Message ID | 1443634014-3026-9-git-send-email-Anna.Schumaker@Netapp.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Anna Schumaker <Anna.Schumaker@netapp.com> writes: > @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, > struct file *file_out, loff_t pos_out, > size_t len, unsigned int flags) > { > - struct inode *inode_in; > - struct inode *inode_out; > ssize_t ret; > > - if (flags) > + /* Flags should only be used exclusively. */ > + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) > + return -EINVAL; > + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) > + return -EINVAL; > + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) > return -EINVAL; > Do you also need: if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP)) return -EINVAL; so that future user-space can test if the kernel supports new flags? NeilBrown
On 08/10/15 02:40, Neil Brown wrote: > Anna Schumaker <Anna.Schumaker@netapp.com> writes: > >> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, >> struct file *file_out, loff_t pos_out, >> size_t len, unsigned int flags) >> { >> - struct inode *inode_in; >> - struct inode *inode_out; >> ssize_t ret; >> >> - if (flags) >> + /* Flags should only be used exclusively. */ >> + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) >> + return -EINVAL; >> + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) >> + return -EINVAL; >> + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) >> return -EINVAL; >> > > Do you also need: > > if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP)) > return -EINVAL; > > so that future user-space can test if the kernel supports new flags? Seems like a good idea, yes. Also that got me thinking about COPY_FR_SPARSE. What's the current behavior when copying a sparse range? Is the hole propagated by default (good), or is it expanded? Note cp(1) has --sparse={never,auto,always}. Auto is the default, so it would be good I think if that was the default mode for copy_file_range(). With other sparse modes, we'd have to avoid copy_file_range() unless there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}. Note currently cp --sparse=always will detect runs of zeros and also avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, ...) thanks, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: > This allows us to have an in-kernel copy mechanism that avoids frequent > switches between kernel and user space. This is especially useful so > NFSD can support server-side copies. > > I make pagecache copies configurable by adding three new (exclusive) > flags: > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. > - COPY_FR_DEDUP creates a reflink, but only if the contents of both > ranges are identical. All but FR_COPY really should be a separate system call. Clones (an dedup as a special case of clones) are really a separate beast from file copies. If I want to clone a file I either want it clone fully or fail, not copy a certain amount. That means that a) we need to return an error not short "write", and b) locking impementations are important - we need to prevent other applications from racing with our clone even if it is large, while to get these semantics for the possible short returning file copy will require a proper userland locking protocol. Last but not least file copies need to be interruptible while clones should be not. All this is already important for local file systems and even more important for NFS exporting. So I'd suggest to drop this patch and just let your syscall handle actualy copies with all their horrors. We can go with Peng's patches to generalize the btrfs ioctls for clones for now which is what everyone already uses anyway, and then add a separate sys_file_clone later. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: > On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: > > This allows us to have an in-kernel copy mechanism that avoids frequent > > switches between kernel and user space. This is especially useful so > > NFSD can support server-side copies. > > > > I make pagecache copies configurable by adding three new (exclusive) > > flags: > > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. > > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. > > - COPY_FR_DEDUP creates a reflink, but only if the contents of both > > ranges are identical. > > All but FR_COPY really should be a separate system call. Clones (an > dedup as a special case of clones) are really a separate beast from file > copies. > > If I want to clone a file I either want it clone fully or fail, not copy > a certain amount. That means that a) we need to return an error not > short "write", and b) locking impementations are important - we need to > prevent other applications from racing with our clone even if it is > large, while to get these semantics for the possible short returning > file copy will require a proper userland locking protocol. Last but not > least file copies need to be interruptible while clones should be not. > All this is already important for local file systems and even more > important for NFS exporting. > > So I'd suggest to drop this patch and just let your syscall handle > actualy copies with all their horrors. We can go with Peng's patches > to generalize the btrfs ioctls for clones for now which is what everyone > already uses anyway, and then add a separate sys_file_clone later. Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. What does everyone think about generalizing EXTENT_SAME? The interface enables one to ask the kernel to dedupe multiple file ranges in a single call. That's more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming that the extra complexity buys us the ability to ... multi-dedupe at the same time, with locks held on the source file? I'm happy to generalize the existing EXTENT_SAME, but please yell if you really hate the interface. --D > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: >> > This allows us to have an in-kernel copy mechanism that avoids frequent >> > switches between kernel and user space. This is especially useful so >> > NFSD can support server-side copies. >> > >> > I make pagecache copies configurable by adding three new (exclusive) >> > flags: >> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. >> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. >> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both >> > ranges are identical. >> >> All but FR_COPY really should be a separate system call. Clones (an >> dedup as a special case of clones) are really a separate beast from file >> copies. >> >> If I want to clone a file I either want it clone fully or fail, not copy >> a certain amount. That means that a) we need to return an error not >> short "write", and b) locking impementations are important - we need to >> prevent other applications from racing with our clone even if it is >> large, while to get these semantics for the possible short returning >> file copy will require a proper userland locking protocol. Last but not >> least file copies need to be interruptible while clones should be not. >> All this is already important for local file systems and even more >> important for NFS exporting. >> >> So I'd suggest to drop this patch and just let your syscall handle >> actualy copies with all their horrors. We can go with Peng's patches >> to generalize the btrfs ioctls for clones for now which is what everyone >> already uses anyway, and then add a separate sys_file_clone later. > > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > What does everyone think about generalizing EXTENT_SAME? The interface enables > one to ask the kernel to dedupe multiple file ranges in a single call. That's > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > that the extra complexity buys us the ability to ... multi-dedupe at the same > time, with locks held on the source file? How is this supposed to be implemented on something like NFS without protocol changes? Trond -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote: > On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: > >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: > >> > This allows us to have an in-kernel copy mechanism that avoids frequent > >> > switches between kernel and user space. This is especially useful so > >> > NFSD can support server-side copies. > >> > > >> > I make pagecache copies configurable by adding three new (exclusive) > >> > flags: > >> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. > >> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. > >> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both > >> > ranges are identical. > >> > >> All but FR_COPY really should be a separate system call. Clones (an > >> dedup as a special case of clones) are really a separate beast from file > >> copies. > >> > >> If I want to clone a file I either want it clone fully or fail, not copy > >> a certain amount. That means that a) we need to return an error not > >> short "write", and b) locking impementations are important - we need to > >> prevent other applications from racing with our clone even if it is > >> large, while to get these semantics for the possible short returning > >> file copy will require a proper userland locking protocol. Last but not > >> least file copies need to be interruptible while clones should be not. > >> All this is already important for local file systems and even more > >> important for NFS exporting. > >> > >> So I'd suggest to drop this patch and just let your syscall handle > >> actualy copies with all their horrors. We can go with Peng's patches > >> to generalize the btrfs ioctls for clones for now which is what everyone > >> already uses anyway, and then add a separate sys_file_clone later. > > > > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > > > What does everyone think about generalizing EXTENT_SAME? The interface enables > > one to ask the kernel to dedupe multiple file ranges in a single call. That's > > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > > that the extra complexity buys us the ability to ... multi-dedupe at the same > > time, with locks held on the source file? > > How is this supposed to be implemented on something like NFS without > protocol changes? Quite frankly, I'm not sure. Assuming NFS doesn't already have some sort of deduplication primitive (I could be totally wrong about that) I'd probably just leave the appropriate ops function pointer set to NULL and return -EOPNOTSUPP to userspace. Trying to fake it by comparing contents on the client and issuing a reflink might be doable with hard locks but if I had to guess I'd say that's even less palatable than simply bailing out. :) IOW: I was only considering the filesystems that already support dedupe, which is basically btrfs and future-XFS. --D > > Trond > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 04:17:49PM -0700, Darrick J. Wong wrote: > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > What does everyone think about generalizing EXTENT_SAME? The interface enables > one to ask the kernel to dedupe multiple file ranges in a single call. That's > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > that the extra complexity buys us the ability to ... multi-dedupe at the same > time, with locks held on the source file? > > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really > hate the interface. It's not pretty, but if the btrfs folks have a good reason for it I don't see a reason to diverge. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2015 at 11:36:31PM -0400, Trond Myklebust wrote: > How is this supposed to be implemented on something like NFS without > protocol changes? Explicit dedup has no chance of working over NFS or other network protocols without protocol changes. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/07/2015 09:40 PM, Neil Brown wrote: > Anna Schumaker <Anna.Schumaker@netapp.com> writes: > >> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, >> struct file *file_out, loff_t pos_out, >> size_t len, unsigned int flags) >> { >> - struct inode *inode_in; >> - struct inode *inode_out; >> ssize_t ret; >> >> - if (flags) >> + /* Flags should only be used exclusively. */ >> + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) >> + return -EINVAL; >> + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) >> + return -EINVAL; >> + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) >> return -EINVAL; >> > > Do you also need: > > if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP)) > return -EINVAL; > > so that future user-space can test if the kernel supports new flags? Probably. I'll add that in! Thanks, Anna > > NeilBrown > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/09/2015 07:15 AM, Pádraig Brady wrote: > On 08/10/15 02:40, Neil Brown wrote: >> Anna Schumaker <Anna.Schumaker@netapp.com> writes: >> >>> @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, >>> struct file *file_out, loff_t pos_out, >>> size_t len, unsigned int flags) >>> { >>> - struct inode *inode_in; >>> - struct inode *inode_out; >>> ssize_t ret; >>> >>> - if (flags) >>> + /* Flags should only be used exclusively. */ >>> + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) >>> + return -EINVAL; >>> + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) >>> + return -EINVAL; >>> + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) >>> return -EINVAL; >>> >> >> Do you also need: >> >> if (flags & ~(COPY_FR_COPY | COPY_FR_REFLINK | COPY_FR_DEDUP)) >> return -EINVAL; >> >> so that future user-space can test if the kernel supports new flags? > > Seems like a good idea, yes. > > Also that got me thinking about COPY_FR_SPARSE. > What's the current behavior when copying a sparse range? > Is the hole propagated by default (good), or is it expanded? I haven't tried it, but I think the hole would be expanded :(. I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files. I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces. I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto". I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range. Anna > > Note cp(1) has --sparse={never,auto,always}. Auto is the default, > so it would be good I think if that was the default mode for copy_file_range(). > With other sparse modes, we'd have to avoid copy_file_range() unless > there was control possible with COPY_FR_SPARSE_{AUTO,NONE,ALWAYS}. > Note currently cp --sparse=always will detect runs of zeros and also > avoid speculative preallocation by using fallocate (fd, FALLOC_FL_PUNCH_HOLE, ...) > > thanks, > Pádraig. > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 13, 2015 at 04:25:29PM -0400, Anna Schumaker wrote: > I haven't tried it, but I think the hole would be expanded :(. I'm having splice() handle the pagecache copy part, and (as far as I know) splice() doesn't know anything about sparse files. I might be able to put in some kind of fallocate() / splice() loop to copy the range in multiple pieces. > > I don't want to add COPY_FR_SPARSE_AUTO, because then the kernel will have to determine how best to interpret "auto". I'm more inclined to add a single COPY_FR_SPARSE flag to enable creating sparse files, and then have the application tell us what to do for any given range. Teh right think is to keep sparse ranges spare as much as possible. This would require the same sort of support as NFS READ_PLUS so I think it's worthwhile to try it. If the file system can't support it it won't be sparse, so we'll get a worse quality of implementation. But please don't add even more weird flags that just confuse users. So far I think the only useful flag for copy_file_range is a PREALLOC or similar flag that says the destination range should have an implicit poix_fallocate performed on it. And due to the complexity of implementation I'm not even sure we need that in the first version. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/12/2015 07:17 PM, Darrick J. Wong wrote: > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: >>> This allows us to have an in-kernel copy mechanism that avoids frequent >>> switches between kernel and user space. This is especially useful so >>> NFSD can support server-side copies. >>> >>> I make pagecache copies configurable by adding three new (exclusive) >>> flags: >>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. >>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. >>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both >>> ranges are identical. >> >> All but FR_COPY really should be a separate system call. Clones (an >> dedup as a special case of clones) are really a separate beast from file >> copies. >> >> If I want to clone a file I either want it clone fully or fail, not copy >> a certain amount. That means that a) we need to return an error not >> short "write", and b) locking impementations are important - we need to >> prevent other applications from racing with our clone even if it is >> large, while to get these semantics for the possible short returning >> file copy will require a proper userland locking protocol. Last but not >> least file copies need to be interruptible while clones should be not. >> All this is already important for local file systems and even more >> important for NFS exporting. >> >> So I'd suggest to drop this patch and just let your syscall handle >> actualy copies with all their horrors. We can go with Peng's patches >> to generalize the btrfs ioctls for clones for now which is what everyone >> already uses anyway, and then add a separate sys_file_clone later. So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct? I can make those changes, but only if everybody is in agreement that it's the best way forward. The only reason I haven't done anything to make this system call interruptible is because I haven't been able to find any documentation or examples for making system calls interruptible. How do I do this? Anna > > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > What does everyone think about generalizing EXTENT_SAME? The interface enables > one to ask the kernel to dedupe multiple file ranges in a single call. That's > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > that the extra complexity buys us the ability to ... multi-dedupe at the same > time, with locks held on the source file? > > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really > hate the interface. > > --D > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-api" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 10:59 AM, Anna Schumaker <Anna.Schumaker@netapp.com> wrote: > On 10/12/2015 07:17 PM, Darrick J. Wong wrote: >> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: >>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: >>>> This allows us to have an in-kernel copy mechanism that avoids frequent >>>> switches between kernel and user space. This is especially useful so >>>> NFSD can support server-side copies. >>>> >>>> I make pagecache copies configurable by adding three new (exclusive) >>>> flags: >>>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. >>>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. >>>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both >>>> ranges are identical. >>> >>> All but FR_COPY really should be a separate system call. Clones (an >>> dedup as a special case of clones) are really a separate beast from file >>> copies. >>> >>> If I want to clone a file I either want it clone fully or fail, not copy >>> a certain amount. That means that a) we need to return an error not >>> short "write", and b) locking impementations are important - we need to >>> prevent other applications from racing with our clone even if it is >>> large, while to get these semantics for the possible short returning >>> file copy will require a proper userland locking protocol. Last but not >>> least file copies need to be interruptible while clones should be not. >>> All this is already important for local file systems and even more >>> important for NFS exporting. >>> >>> So I'd suggest to drop this patch and just let your syscall handle >>> actualy copies with all their horrors. We can go with Peng's patches >>> to generalize the btrfs ioctls for clones for now which is what everyone >>> already uses anyway, and then add a separate sys_file_clone later. > > So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct? I can make those changes, but only if everybody is in agreement that it's the best way forward. I personally rather like the reflink option. That thing is quite useful. > > The only reason I haven't done anything to make this system call interruptible is because I haven't been able to find any documentation or examples for making system calls interruptible. How do I do this? > For just interruptability, avoid waiting in non-interruptable ways and return -EINTR if one of your wait calls returns -EINTR. For restartability, it's more complicated. There are special values you can return that give the signal code hints as to what to do. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 01:59:40PM -0400, Anna Schumaker wrote: > On 10/12/2015 07:17 PM, Darrick J. Wong wrote: > > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: > >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: > >>> This allows us to have an in-kernel copy mechanism that avoids frequent > >>> switches between kernel and user space. This is especially useful so > >>> NFSD can support server-side copies. > >>> > >>> I make pagecache copies configurable by adding three new (exclusive) > >>> flags: > >>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. > >>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. > >>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both > >>> ranges are identical. > >> > >> All but FR_COPY really should be a separate system call. Clones (an > >> dedup as a special case of clones) are really a separate beast from file > >> copies. > >> > >> If I want to clone a file I either want it clone fully or fail, not copy > >> a certain amount. That means that a) we need to return an error not > >> short "write", and b) locking impementations are important - we need to > >> prevent other applications from racing with our clone even if it is > >> large, while to get these semantics for the possible short returning > >> file copy will require a proper userland locking protocol. Last but not > >> least file copies need to be interruptible while clones should be not. > >> All this is already important for local file systems and even more > >> important for NFS exporting. > >> > >> So I'd suggest to drop this patch and just let your syscall handle > >> actualy copies with all their horrors. We can go with Peng's patches > >> to generalize the btrfs ioctls for clones for now which is what everyone > >> already uses anyway, and then add a separate sys_file_clone later. > > So what I'm hearing is that I should drop the reflink and dedup flags and > change this system call only perform a full copy (with preserving of > sparseness), correct? I can make those changes, but only if everybody is in > agreement that it's the best way forward. Sounds fine to me; I'll work on promoting EXTENT_SAME to the VFS. > The only reason I haven't done anything to make this system call > interruptible is because I haven't been able to find any documentation or > examples for making system calls interruptible. How do I do this? I thought it was mostly a matter of sprinkling in "if (signal_pending(...)) return -ERESTARTSYS" type things whenever it's convenient to check. The splice code already seems to have this, though I'm no expert on what the splice code actually does. :) --D > > Anna > > > > > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > > > What does everyone think about generalizing EXTENT_SAME? The interface enables > > one to ask the kernel to dedupe multiple file ranges in a single call. That's > > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > > that the extra complexity buys us the ability to ... multi-dedupe at the same > > time, with locks held on the source file? > > > > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really > > hate the interface. > > > > --D > > > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-api" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:11 AM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Wed, Oct 14, 2015 at 01:59:40PM -0400, Anna Schumaker wrote: >> On 10/12/2015 07:17 PM, Darrick J. Wong wrote: >> > On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote: >> >> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote: >> >>> This allows us to have an in-kernel copy mechanism that avoids frequent >> >>> switches between kernel and user space. This is especially useful so >> >>> NFSD can support server-side copies. >> >>> >> >>> I make pagecache copies configurable by adding three new (exclusive) >> >>> flags: >> >>> - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink. >> >>> - COPY_FR_COPY does a full data copy, but may be filesystem accelerated. >> >>> - COPY_FR_DEDUP creates a reflink, but only if the contents of both >> >>> ranges are identical. >> >> >> >> All but FR_COPY really should be a separate system call. Clones (an >> >> dedup as a special case of clones) are really a separate beast from file >> >> copies. >> >> >> >> If I want to clone a file I either want it clone fully or fail, not copy >> >> a certain amount. That means that a) we need to return an error not >> >> short "write", and b) locking impementations are important - we need to >> >> prevent other applications from racing with our clone even if it is >> >> large, while to get these semantics for the possible short returning >> >> file copy will require a proper userland locking protocol. Last but not >> >> least file copies need to be interruptible while clones should be not. >> >> All this is already important for local file systems and even more >> >> important for NFS exporting. >> >> >> >> So I'd suggest to drop this patch and just let your syscall handle >> >> actualy copies with all their horrors. We can go with Peng's patches >> >> to generalize the btrfs ioctls for clones for now which is what everyone >> >> already uses anyway, and then add a separate sys_file_clone later. >> >> So what I'm hearing is that I should drop the reflink and dedup flags and >> change this system call only perform a full copy (with preserving of >> sparseness), correct? I can make those changes, but only if everybody is in >> agreement that it's the best way forward. > > Sounds fine to me; I'll work on promoting EXTENT_SAME to the VFS. > >> The only reason I haven't done anything to make this system call >> interruptible is because I haven't been able to find any documentation or >> examples for making system calls interruptible. How do I do this? > > I thought it was mostly a matter of sprinkling in "if (signal_pending(...)) > return -ERESTARTSYS" type things whenever it's convenient to check. The splice > code already seems to have this, though I'm no expert on what the splice code > actually does. :) > Oh, right. That's for making loops that don't otherwise block interruptible. If you're doing wait_xyz, then you want to use the interruptable variable of that. Anyway, I just checked on x86. The relevant error codes are (I think): -EINTR: returns -EINTR to userspace with no special handling. -ERESTARTNOINTR: end the syscall, call a signal handler if appropriate, then retry the syscall with the same arguments (i.e. the syscall needs to make sure that trying again is an acceptable thing to do by, e.g., updating offsets that are used). -ERESTARTSYS: same as -ERESTARTNOINTR *unless* there's an unblocked signal handler that has SA_RESTART clear, which which case the caller gets -EINTR. -ERESTARTNOHAND: end the syscall and retry with the same arguments if no signal handler would be called; otherwise call the signal handler and return -EINTR to the caller. -ERESTART_RESTARTBLOCK: return -EINTR if a signal is delivered and otherwise use the restart_block mechanism. (Don't use that -- it's evil.) So -ERESTARTSYS is probably the most sensible thing to use under normal circumstances. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote: > > So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct? I can make those changes, but only if everybody is in agreement that it's the best way forward. > > I personally rather like the reflink option. That thing is quite useful. reflink is very useful, probably more useful than the copy actually. But it is different from a copy. It should be a separate interface. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:27 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote: >> > So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct? I can make those changes, but only if everybody is in agreement that it's the best way forward. >> >> I personally rather like the reflink option. That thing is quite useful. > > reflink is very useful, probably more useful than the copy actually. But it > is different from a copy. It should be a separate interface. One might argue that reflink is like copy + immediate dedupe. Also, I can imagine there being network protocols over which you can't really tell the difference between reflink and server-to-server copy. --Andy
On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote: > One might argue that reflink is like copy + immediate dedupe. Not, it's not. It's all that and more, because it is an operation that is atomic vs other writes to the file and it's an operation that either clones the whole range or nothing. That's a very important difference. > Also, I > can imagine there being network protocols over which you can't really > tell the difference between reflink and server-to-server copy. For NFS we specificly have a CLONE and a COPY operations so that smart servers can support the proper clone, and dumb servers still get copy offload. Other protocols might only be able to support COPY if they don't have a CLONE primitive. Note that a clone also always is a valid copy, just with much simpler an at the same time more useful semantics. Take a look at the NFSv4.2 sections for CLONE vs COPY if you're interested. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote: >> One might argue that reflink is like copy + immediate dedupe. > > Not, it's not. It's all that and more, because it is an operation that > is atomic vs other writes to the file and it's an operation that either > clones the whole range or nothing. That's a very important difference. Fair enough. Would copy_file_range without the reflink option removed still be permitted to link blocks on supported filesystems (btrfs and maybe XFS)? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-10-14 14:27, Christoph Hellwig wrote: > On Wed, Oct 14, 2015 at 11:08:40AM -0700, Andy Lutomirski wrote: >>> So what I'm hearing is that I should drop the reflink and dedup flags and change this system call only perform a full copy (with preserving of sparseness), correct? I can make those changes, but only if everybody is in agreement that it's the best way forward. >> >> I personally rather like the reflink option. That thing is quite useful. > > reflink is very useful, probably more useful than the copy actually. But it > is different from a copy. It should be a separate interface. Whether or not reflink is different from a copy is entirely a matter of who is looking at it. For someone looking directly at the block device, or trying to manipulate the block layout of the filesystem it is definitely not a copy. For a database app that needs ACID transaction semantics, it is definitely not a copy (although for that usage, it's arguably significantly better than a copy). From the point of view of a generic userspace app that didn't perform the copy operation however, and for anyone looking at it after the fact without paying attention to the block layout, a reflink _is_ for all intents and purposes functionally equivalent to a copy of the reflinked data (assuming of course that the filesystem implements it properly, and that the hardware behaves right). I would not in fact be surprised if at least some SCSI devices that implement the XCOPY command do so internally using a reflink (I have not personally read the standard, but even if it 'requires' a compliant device to actually create a separate copy of the data, there will still be some vendors who ignore this), and it is well known that some SSD's do in-band data deduplication effectively reducing a traditional copy to a reflink at the firmware level. I agree that we shouldn't try to make a reflink by default (less than intelligent programmers won't read the docs completely, and will make various stupid assumptions about how this is 'supposed' to work, making the defaults less ambiguous is a good thing), but it makes sense (at least, it does to me) to have the ability to say 'make this block of data appear at this location as well, I don't care how you do it as long as they are functionally independent for userspace applications'.
On 2015-10-14 14:53, Andy Lutomirski wrote: > On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote: >> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote: >>> One might argue that reflink is like copy + immediate dedupe. >> >> Not, it's not. It's all that and more, because it is an operation that >> is atomic vs other writes to the file and it's an operation that either >> clones the whole range or nothing. That's a very important difference. > > Fair enough. > > Would copy_file_range without the reflink option removed still be > permitted to link blocks on supported filesystems (btrfs and maybe > XFS)? I would argue that it should have such functionality, but not do so by default (maybe add some option to tell it to ask the FS to accelerate the copy operation?).
On 14/10/15 20:14, Austin S Hemmelgarn wrote: > On 2015-10-14 14:53, Andy Lutomirski wrote: >> On Wed, Oct 14, 2015 at 11:49 AM, Christoph Hellwig <hch@infradead.org> wrote: >>> On Wed, Oct 14, 2015 at 11:38:13AM -0700, Andy Lutomirski wrote: >>>> One might argue that reflink is like copy + immediate dedupe. >>> >>> Not, it's not. It's all that and more, because it is an operation that >>> is atomic vs other writes to the file and it's an operation that either >>> clones the whole range or nothing. That's a very important difference. >> >> Fair enough. >> >> Would copy_file_range without the reflink option removed still be >> permitted to link blocks on supported filesystems (btrfs and maybe >> XFS)? > I would argue that it should have such functionality, but not do so by > default (maybe add some option to tell it to ask the FS to accelerate > the copy operation?). Heh, so back to the REFLINK flag :) TBH given the overlap between "copy" and "reflink", I quite like the REFLINK flag as a general interface to reflink. thanks, Pádraig -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 11:53:45AM -0700, Andy Lutomirski wrote: > Would copy_file_range without the reflink option removed still be > permitted to link blocks on supported filesystems (btrfs and maybe > XFS)? Absolutely. Unless the COPY_FALLOCATE or whatever we call it option is specified of course. But I'd really love to get basic copy infrastructure in for 4.4 and then define these options later. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 14, 2015 at 03:08:46PM -0400, Austin S Hemmelgarn wrote: > Whether or not reflink is different from a copy is entirely a matter of who > is looking at it. So what? I've been trying to explain why clone semantics matter, and I've not seen a counter argument for that. I've also explained a couple times that a valid clone always is a valid copy, and I've only heard some slight disagreement, and so far none as long as we take the COPY_FALLOCATE option into account. Note that all of this also applies to storage devices - any smart array will do a clone-like operation underneath an XCOPY, but so far SCSI doesn't provide full clone _semantics_ even if you can emulate a lot of it using a lot of complexity around ROD tokens. Similar at the SCSI level you can perform a fallocate-like operation using the anchor bit in the UNMAP or WRITE SAME commands. > I agree that we shouldn't try to make a reflink by default (less than > intelligent programmers won't read the docs completely, and will make > various stupid assumptions about how this is 'supposed' to work, making the > defaults less ambiguous is a good thing), but it makes sense (at least, it > does to me) to have the ability to say 'make this block of data appear at > this location as well, I don't care how you do it as long as they are > functionally independent for userspace applications'. Yes, we absolutely should use reflink as a default implementation for copy where available. But we also need a clone or reflink interface that only gives us well specified reflink semantics, and not the much weaker copy semantics. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-10-15 02:36, Christoph Hellwig wrote: > On Wed, Oct 14, 2015 at 03:08:46PM -0400, Austin S Hemmelgarn wrote: >> Whether or not reflink is different from a copy is entirely a matter of who >> is looking at it. > > So what? I've been trying to explain why clone semantics matter, and > I've not seen a counter argument for that. I've also explained a couple > times that a valid clone always is a valid copy, and I've only heard > some slight disagreement, and so far none as long as we take the > COPY_FALLOCATE option into account. > > Note that all of this also applies to storage devices - any smart array > will do a clone-like operation underneath an XCOPY, but so far SCSI > doesn't provide full clone _semantics_ even if you can emulate a lot of > it using a lot of complexity around ROD tokens. > > Similar at the SCSI level you can perform a fallocate-like operation > using the anchor bit in the UNMAP or WRITE SAME commands. > >> I agree that we shouldn't try to make a reflink by default (less than >> intelligent programmers won't read the docs completely, and will make >> various stupid assumptions about how this is 'supposed' to work, making the >> defaults less ambiguous is a good thing), but it makes sense (at least, it >> does to me) to have the ability to say 'make this block of data appear at >> this location as well, I don't care how you do it as long as they are >> functionally independent for userspace applications'. > > Yes, we absolutely should use reflink as a default implementation for > copy where available. > > But we also need a clone or reflink interface that only gives us well > specified reflink semantics, and not the much weaker copy semantics. > Ah, I was completely misunderstanding your meaning, sorry about any confusion that I may have caused as a result of this. My only point with saying we shouldn't reflink by default is that there are many (unintelligent) people who will assume that since the syscall has copy in it's name, that's what it will do; and, while I don't think we should cater to such individuals, it does make sense to have a syscall that says in it's name that it copies data actually do so by default.
On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote: > My only point with saying we shouldn't reflink by default is that there are > many (unintelligent) people who will assume that since the syscall has copy > in it's name, that's what it will do; and, while I don't think we should > cater to such individuals, it does make sense to have a syscall that says in > it's name that it copies data actually do so by default. As far as the user is concerned a reflink is a copy. A very efficient copy. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-10-16 01:38, Christoph Hellwig wrote: > On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote: >> My only point with saying we shouldn't reflink by default is that there are >> many (unintelligent) people who will assume that since the syscall has copy >> in it's name, that's what it will do; and, while I don't think we should >> cater to such individuals, it does make sense to have a syscall that says in >> it's name that it copies data actually do so by default. > > As far as the user is concerned a reflink is a copy. A very efficient > copy. I should have been specific, what I meant was that some people will assume that it actually creates a physical, on-disk byte-for-byte copy of the data. There are many people out there (and sadly I have to deal with some at work) who are absolutely terrified of the concept of data deduplication, and will likely refuse to use this syscall for _anything_ if it reflinks by default on filesystems that support it.
On 16/10/15 12:46, Austin S Hemmelgarn wrote: > On 2015-10-16 01:38, Christoph Hellwig wrote: >> On Thu, Oct 15, 2015 at 08:24:51AM -0400, Austin S Hemmelgarn wrote: >>> My only point with saying we shouldn't reflink by default is that there are >>> many (unintelligent) people who will assume that since the syscall has copy >>> in it's name, that's what it will do; and, while I don't think we should >>> cater to such individuals, it does make sense to have a syscall that says in >>> it's name that it copies data actually do so by default. >> >> As far as the user is concerned a reflink is a copy. A very efficient >> copy. > I should have been specific, what I meant was that some people will > assume that it actually creates a physical, on-disk byte-for-byte copy > of the data. There are many people out there (and sadly I have to deal > with some at work) who are absolutely terrified of the concept of data > deduplication, and will likely refuse to use this syscall for _anything_ > if it reflinks by default on filesystems that support it. Right. reflinking is transparent to the user, though its consequences are not. Consequences being the possible extra latency or ENOSPC on CoW. Therefore reflinking should be an explicit action/flag IMHO. cheers, Pádraig. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 16, 2015 at 07:46:41AM -0400, Austin S Hemmelgarn wrote: > I should have been specific, what I meant was that some people will assume > that it actually creates a physical, on-disk byte-for-byte copy of the data. > There are many people out there (and sadly I have to deal with some at work) > who are absolutely terrified of the concept of data deduplication, and will > likely refuse to use this syscall for _anything_ if it reflinks by default > on filesystems that support it. If they use a file system that supports COW or dedup they are toast already. It's not the system call that does the 'dedup', it's the file system or storage device, that's where they need to set their preferences. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 16, 2015 at 01:02:23PM +0100, P??draig Brady wrote: > Right. reflinking is transparent to the user, though its consequences are not. > Consequences being the possible extra latency or ENOSPC on CoW. You can get all these consequences without doing the file system reflink by using a COW file system, any dedup scheme or thinly provisioned or COW storage devices. > Therefore reflinking should be an explicit action/flag IMHO. This still does not make any sense, as it only prevents one of many ways a file could do COW operations underneath. If you don't want ENOSPC use fallocate, or the proposed COPY_FALLOC flag. If you want care about latency you need to carefull benchmark your setup but in general falloc / COPY_FALLOC might be a good starting point. But for 99% of the copies a reflink is exactly the right thing to do. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-10-16 08:24, Christoph Hellwig wrote: > On Fri, Oct 16, 2015 at 01:02:23PM +0100, P??draig Brady wrote: >> Right. reflinking is transparent to the user, though its consequences are not. >> Consequences being the possible extra latency or ENOSPC on CoW. > > You can get all these consequences without doing the file system reflink > by using a COW file system, any dedup scheme or thinly provisioned or > COW storage devices. > >> Therefore reflinking should be an explicit action/flag IMHO. > > This still does not make any sense, as it only prevents one of many > ways a file could do COW operations underneath. If you don't want > ENOSPC use fallocate, or the proposed COPY_FALLOC flag. If you want > care about latency you need to carefull benchmark your setup but in > general falloc / COPY_FALLOC might be a good starting point. But for > 99% of the copies a reflink is exactly the right thing to do. There is at least one reason other than avoiding ENOSPC and minimizing latency that people may want to avoid reflinking things: They actually _want_ multiple physically independent copies of the same file on the disk. Usually people do go about this wrong (some people I know don't understand that having multiple copies of a file on the same filesystem provides no greater safety than one copy), but that doesn't mean that this isn't a perfectly valid use case for copying a file.
On 2015-10-16 08:21, Christoph Hellwig wrote: > On Fri, Oct 16, 2015 at 07:46:41AM -0400, Austin S Hemmelgarn wrote: >> I should have been specific, what I meant was that some people will assume >> that it actually creates a physical, on-disk byte-for-byte copy of the data. >> There are many people out there (and sadly I have to deal with some at work) >> who are absolutely terrified of the concept of data deduplication, and will >> likely refuse to use this syscall for _anything_ if it reflinks by default >> on filesystems that support it. > > If they use a file system that supports COW or dedup they are toast > already. It's not the system call that does the 'dedup', it's the file > system or storage device, that's where they need to set their > preferences. BTRFS is COW and supports deduplication, it does _absolutely zero_ reflinking and/or deduplication unless you explicitly tell it to do so. Likewise, ZFS is COW and supports deduplication, it also does _absolutely zero_ reflinking and/or deduplication unless you tell it to (note that in-band deduplication is off by default on ZFS). AFAIK, XFS will not automatically reflink instead of copying either (and if it does decide to do it automatically, that will just be something else to add to the list of why I will never use it on any of my systems). OCFS2 supports reflinks (although not many people know this, and I think it implements them slightly differently from BTRFS/ZFS/XFS) and yet again, does _absolutely zero_ reflinking unless you tell it to. Based on this, it is in no way the filesystem that does the deduplication and reflinking, it only handles the implementation and provides the option to the user to do it. Certain parts of userspace do try to reflink things instead of copying (for example, coreutils recently started doing so in mv and has had the option to do so with cp for a while now), but a properly designed general purpose filesystem does not and should not do this without the user telling it to do so.
On Fri, Oct 16, 2015 at 08:50:41AM -0400, Austin S Hemmelgarn wrote: > Certain parts of userspace do try to reflink things instead of copying (for > example, coreutils recently started doing so in mv and has had the option to > do so with cp for a while now), but a properly designed general purpose > filesystem does not and should not do this without the user telling it to do > so. But they do. Get out of your narrow local Linux file system view. Every all flash array or hyperconverge hypervisor will dedeup the hell out of your data, heck some SSDs even do it on the device. Your NFS or CIFS server already does or soon will do dedup and reflinks behind the scenes, that's the whole point of adding these features to the protocol. And except for the odd fear or COW or dedup, and the ENOSPC issue for which we have a flag with a very well defined meaning I've still not heard any good arguments against it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-10-16 09:12, Christoph Hellwig wrote: > On Fri, Oct 16, 2015 at 08:50:41AM -0400, Austin S Hemmelgarn wrote: >> Certain parts of userspace do try to reflink things instead of copying (for >> example, coreutils recently started doing so in mv and has had the option to >> do so with cp for a while now), but a properly designed general purpose >> filesystem does not and should not do this without the user telling it to do >> so. > > But they do. Get out of your narrow local Linux file system view. > Every all flash array or hyperconverge hypervisor will dedeup the hell > out of your data, heck some SSDs even do it on the device. Your NFS or > CIFS server already does or soon will do dedup and reflinks behind the > scenes, that's the whole point of adding these features to the protocol. Unless things have significantly changed on Windows and OS X, NTFS and HFS+ do not do automatic data deduplication (I'm not sure whether either even supports reflinks, although NTFS is at least partly COW), and I know for certain that FAT, UDF, Minix, BeFS, and Venti do not do so. NFS and CIFS/SMB both have support in the protocol, but unless either the client asks for it specifically, or the server is manually configured to do it automatically (although current versions of Windows server might do it by default, but if they do it is not documented anywhere I've seen), they don't do it. 9P has no provisions for reflinks/deduplication. AFS/Coda/Ceph/Lustre/GFS2 might do deduplication, but I'm pretty certain that they do not do so by default, and even then they really don't fit the 'general purpose' bit in my statement above. So, overall, my statement still holds for any widely used filesystem technology that is actually 'general purpose'. Furthermore, if you actually read my statement, you will notice that I only said that _filesystems_ should not do it without being told to do so, and (intentionally) said absolutely nothing about any kind of storage devices or virtualization. Ideally, SSD's really shouldn't do it either unless they have a 100% guarantee that the entire block going bad will not render the data unrecoverable (most do in fact use ECC internally, but they typically only handle two or three bad bits out of a full byte). And as far as hypervisors go, a good storage hypervisor should be providing some guarantee of reliability, which means either it is already storing multiple copies of _everything_ or using some form of erasure coding so that it can recover from issues with the underlying storage devices without causing issues for higher levels, thus meaning that deduplication in that context is safe for all intents and purposes. > And except for the odd fear or COW or dedup, and the ENOSPC issue for > which we have a flag with a very well defined meaning I've still not > heard any good arguments against it. Most people who I know who demonstrate this fear are just fine with COW, it's the deduplication that they're terrified of, and TBH that's largely because they've only ever seen it used in unsafe ways. My main argument (which I admittedly have not really stated properly at all during this discussion) is that almost everyone is likely to jump on this, which _will_ change long established semantics in many things that switch to this, and there will almost certainly be serious backlash from that.
On Tue, Oct 13, 2015 at 12:27:37AM -0700, Christoph Hellwig wrote: > On Mon, Oct 12, 2015 at 04:17:49PM -0700, Darrick J. Wong wrote: > > Hm. Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from > > btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl. > > > > What does everyone think about generalizing EXTENT_SAME? The interface enables > > one to ask the kernel to dedupe multiple file ranges in a single call. That's > > more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming > > that the extra complexity buys us the ability to ... multi-dedupe at the same > > time, with locks held on the source file? > > > > I'm happy to generalize the existing EXTENT_SAME, but please yell if you really > > hate the interface. > > It's not pretty, but if the btrfs folks have a good reason for it I > don't see a reason to diverge. I started hoisting EXTENT_SAME into the VFS but I don't like the name because this ioctl implies some sort of action, but "EXTENT SAME" lacks a verb. Since we have to introduce a new symbol anyway, I'm going to use FS_DEDUPE_RANGE. struct file_dedupe_range { ... } #define FI_DEDUPE_RANGE _IOWR(0x94, 54, struct file_dedupe_range) (Honestly, I'm not in love with FICLONERANGE either, but FIDEDUPRANGE was just unpronounceable mess.) Also, for the btrfs folks: Why does extent_same call mnt_want_write_file on the fd that we pass into the ioctl? Shouldn't we be calling it on the fd that's in the btrfs_ioctl_extent_same_info structure because that'ss the file that gets its blocks remapped? --D -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/read_write.c b/fs/read_write.c index ee9fa37..4fb9b8e 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -7,6 +7,7 @@ #include <linux/slab.h> #include <linux/stat.h> #include <linux/fcntl.h> +#include <linux/copy.h> #include <linux/file.h> #include <linux/uio.h> #include <linux/fsnotify.h> @@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd, } #endif +static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + size_t len) +{ + ssize_t ret; + + ret = rw_verify_area(READ, file_in, &pos_in, len); + if (ret >= 0) { + len = ret; + ret = rw_verify_area(WRITE, file_out, &pos_out, len); + if (ret >= 0) + len = ret; + } + if (ret < 0) + return ret; + + file_start_write(file_out); + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0); + file_end_write(file_out); + + return ret; +} + /* * copy_file_range() differs from regular file read and write in that it * specifically allows return partial success. When it does so is up to @@ -1338,34 +1362,26 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, size_t len, unsigned int flags) { - struct inode *inode_in; - struct inode *inode_out; ssize_t ret; - if (flags) + /* Flags should only be used exclusively. */ + if ((flags & COPY_FR_COPY) && (flags & ~COPY_FR_COPY)) + return -EINVAL; + if ((flags & COPY_FR_REFLINK) && (flags & ~COPY_FR_REFLINK)) + return -EINVAL; + if ((flags & COPY_FR_DEDUP) && (flags & ~COPY_FR_DEDUP)) return -EINVAL; - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */ - ret = rw_verify_area(READ, file_in, &pos_in, len); - if (ret >= 0) - ret = rw_verify_area(WRITE, file_out, &pos_out, len); - if (ret < 0) - return ret; + /* Default behavior is to try both. */ + if (flags == 0) + flags = COPY_FR_COPY | COPY_FR_REFLINK; if (!(file_in->f_mode & FMODE_READ) || !(file_out->f_mode & FMODE_WRITE) || (file_out->f_flags & O_APPEND) || - !file_out->f_op || !file_out->f_op->copy_file_range) + !file_out->f_op) return -EBADF; - inode_in = file_inode(file_in); - inode_out = file_inode(file_out); - - /* make sure offsets don't wrap and the input is inside i_size */ - if (pos_in + len < pos_in || pos_out + len < pos_out || - pos_in + len > i_size_read(inode_in)) - return -EINVAL; - if (len == 0) return 0; @@ -1373,8 +1389,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in, if (ret) return ret; - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out, - len, flags); + ret = -EOPNOTSUPP; + if (file_out->f_op->copy_file_range && (file_in->f_op == file_out->f_op)) + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, + pos_out, len, flags); + if ((ret < 0) && (flags & COPY_FR_COPY)) + ret = vfs_copy_file_pagecache(file_in, pos_in, file_out, + pos_out, len); if (ret > 0) { fsnotify_access(file_in); add_rchar(current, ret); diff --git a/include/linux/copy.h b/include/linux/copy.h new file mode 100644 index 0000000..fd54543 --- /dev/null +++ b/include/linux/copy.h @@ -0,0 +1,6 @@ +#ifndef _LINUX_COPY_H +#define _LINUX_COPY_H + +#include <uapi/linux/copy.h> + +#endif /* _LINUX_COPY_H */ diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index f7b2db4..faafd67 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -90,6 +90,7 @@ header-y += coda_psdev.h header-y += coff.h header-y += connector.h header-y += const.h +header-y += copy.h header-y += cramfs_fs.h header-y += cuda.h header-y += cyclades.h diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h new file mode 100644 index 0000000..b807dcd --- /dev/null +++ b/include/uapi/linux/copy.h @@ -0,0 +1,8 @@ +#ifndef _UAPI_LINUX_COPY_H +#define _UAPI_LINUX_COPY_H + +#define COPY_FR_COPY (1 << 0) /* Only do a pagecache copy. */ +#define COPY_FR_REFLINK (1 << 1) /* Only make a reflink. */ +#define COPY_FR_DEDUP (1 << 2) /* Deduplicate file data. */ + +#endif /* _UAPI_LINUX_COPY_H */