[8/9] vfs: hoist the btrfs deduplication ioctl to the vfs
diff mbox

Message ID 20151219085559.12478.33700.stgit@birch.djwong.org
State New
Headers show

Commit Message

Darrick J. Wong Dec. 19, 2015, 8:55 a.m. UTC
Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
more systematic (FIDEDUPERANGE).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/compat_ioctl.c       |    1 
 fs/ioctl.c              |   38 ++++++++++++++++++
 fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h      |    4 ++
 include/uapi/linux/fs.h |   30 ++++++++++++++
 5 files changed, 173 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Biggers Jan. 12, 2016, 6:07 a.m. UTC | #1
Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
(note: I realize the patch is mostly just moving the code that already existed
in btrfs, but in the VFS it deserves a more thorough review):

At high level, I am confused about what is meant by the "source" and
"destination" files.  I understand that with more than two files, you
effectively have to choose one file to treat specially and dedupe with all the
other files (an NxN comparison isn't realistic).  But with just two files, a
deduplication operation should be completely symmetric, should it not?  The end
result should be that the data is deduplicated, regardless of the order in which
I gave the file descriptors.  So why is there some non-symmetric behavior?
There are several examples but one is that the VFS is checking !S_ISREG() on the
"source" file descriptor but not on the "destination" file descriptor.  Another
is that different permissions are required on the source versus on the
destination.  If there are good reasons for the nonsymmetry then this needs to
be clearly explained in the man page; otherwise it may not be clear what to use
as the "source" and what to use as the "destination".

It seems odd to be adding "copy" as a system call but then have "dedupe" and
"clone" as ioctls rather than system calls... it seems that they should all be
one or the other (at least, if we put aside the fact that the ioctls already
exist in btrfs).

The range checking in clone_verify_area() appears incomplete.  Someone could
provide len=UINT64_MAX and all the checks would still pass even though 'pos+len'
would overflow.

Should the ioctl be interruptible?  Right now it always goes through *all* the
'struct file_dedupe_range_info's you passed in --- potentially up to 65535 of
them.

Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
whole.

What permissions do you need on the destination file descriptors?  The man page
implies they must be open for writing and not appending.  The implementation
differs: it requires FMODE_WRITE only for non-admin users, and it doesn't check
for O_APPEND at all.  The man page also says you get EPERM if "dest_fd is
immutable" and ETXTBSY if "one of the files is a swap file", which I don't see
actually happening in the implementation; it seems those error codes perhaps
exist at all for this ioctl but rather be left to open(..., O_WRONLY).

If the filesystem doesn't support deduplication, or I pass in a strange file
descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The man
page isn't clear.

Under what circumstances will 'bytes_deduped' differ from the count that was
passed in?  If short counts are allowed, what will be the 'status' be in that
case: FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?  Can
data be deduped even if only a prefix of the data region matches?

The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.

The man page isn't clear about whether the ioctl stops early if an error occurs
or always processes all the 'struct file_dedupe_range_info's you pass in.  And
if it were, hypothetically, to stop early, how is the user meant to know on
which file it stopped?

The man page says "logical_offset" but in the struct it is called "dest_offset".

There are some variables named "same" which don't really make sense now that the
ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 12, 2016, 9:14 a.m. UTC | #2
[adding btrfs to the cc since we're talking about a whole new dedupe interface]

On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
> (note: I realize the patch is mostly just moving the code that already existed
> in btrfs, but in the VFS it deserves a more thorough review):

Wheee. :)

Yes, let's discuss the concerns about the btrfs extent same ioctl.

I believe Christoph dislikes about the odd return mechanism (i.e. status and
bytes_deduped) and doubts that the vectorization is really necessary.  There's
not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
does".  I suspect that will leave my explanations lackng, since I neither
designed the btrfs interface nor know all that much about the decisions made to
arrive at what we have now.

(I agree with both of hch's complaints.)

Really, the best argument for keeping this ioctl is to avoid breaking
duperemove.  Even then, given that current duperemove checks for btrfs before
trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
leaving the old btrfs ioctl intact, and train duperemove to try the new
ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.

Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
that resembles the clone_range interface:

int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

struct file_dedupe_range {
    __s64 src_fd;
    __u64 src_offset;
    __u64 length;
    __u64 dest_offset;
    __u64 flags;
};

"See if the byte range src_offset:length in src_fd matches all of
dest_offset:length in dest_fd; if so, share src_fd's physical storage with
dest_fd.  Both fds must be files; if they are the same file the ranges cannot
overlap; src_fd must be readable; dest_fd must be writable or append-only.
Offsets and lengths probably need to be block-aligned, but that is filesystem
dependent."

The error conditions would be superset of the ones we know about today.  I'd
return EOVERFLOW or something if length is longer than the FS wants to deal
with.

Now all the vectorization problems go away, and since it's a new VFS interface
we can define everything from the start.

Christoph, if this new interface solves your complaints I think I'd like to get
started on the code/docs soon.

> At high level, I am confused about what is meant by the "source" and
> "destination" files.  I understand that with more than two files, you
> effectively have to choose one file to treat specially and dedupe with all
> the other files (an NxN comparison isn't realistic).  But with just two
> files, a deduplication operation should be completely symmetric, should it
> not?  The end

Not sure what you mean by 'symmetric', but in any case the convention seems
to be that src_fd's storage is shared with dest_fd if there's a match.

> result should be that the data is deduplicated, regardless of the order in
> which I gave the file descriptors.  So why is there some non-symmetric
> behavior?  There are several examples but one is that the VFS is checking
> !S_ISREG() on the "source" file descriptor but not on the "destination" file
> descriptor.

The dedupe_range function pointer should only be supplied for regular files.

> Another is that different permissions are required on the source versus on
> the destination.  If there are good reasons for the nonsymmetry then this
> needs to be clearly explained in the man page; otherwise it may not be clear
> what to use as the "source" and what to use as the "destination".
> 
> It seems odd to be adding "copy" as a system call but then have "dedupe" and
> "clone" as ioctls rather than system calls... it seems that they should all
> be one or the other (at least, if we put aside the fact that the ioctls
> already exist in btrfs).

We can't put the clone ioctl aside; coreutils has already started using it.

I'm not sure if clone_range or extent_same are all that popular, though.
AFAIK duperemove is the only program using extent_same, and I don't know
of anything using clone_range.

(Well, xfs_io does...)

> The range checking in clone_verify_area() appears incomplete.  Someone could
> provide len=UINT64_MAX and all the checks would still pass even though
> 'pos+len' would overflow.

Yeah...

> Should the ioctl be interruptible?  Right now it always goes through *all*
> the 'struct file_dedupe_range_info's you passed in --- potentially up to
> 65535 of them.

There probably ought to be explicit signal checks, or we could just get rid
of the vectorization entirely. :)

> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
> whole.

Right, because bytes_deduped is a part of file_dedup_range_info, not
file_dedupe_range.

(Note the bytes_deduped = 0 earlier in the function.)

> What permissions do you need on the destination file descriptors?  The man
> page implies they must be open for writing and not appending.  The
> implementation differs: it requires FMODE_WRITE only for non-admin users, and
> it doesn't check for O_APPEND at all.

I think the result of an earlier discussion was that src_fd must be readable,
and dest_fd must be writable or appendable.

> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> "one of the files is a swap file", which I don't see actually happening in
> the implementation; it seems those error codes perhaps exist at all for this
> ioctl but rather be left to open(..., O_WRONLY).

Those could be hoisted to the VFS (from the XFS implementation), I think.

> If the filesystem doesn't support deduplication, or I pass in a strange file
> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
> man page isn't clear.

Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
directories; and EINVAL for any other kind of non-file fd.  I suspect the
clone* manpages don't make this too clear either.

> Under what circumstances will 'bytes_deduped' differ from the count that was
> passed in?

btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(

> If short counts are allowed, what will be the 'status' be in that case:
> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?

One of those two.

> Can data be deduped even if only a prefix of the data region matches?

No.

> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.

Oops, good catch. :(

> The man page isn't clear about whether the ioctl stops early if an error
> occurs or always processes all the 'struct file_dedupe_range_info's you pass
> in.  And if it were, hypothetically, to stop early, how is the user meant to
> know on which file it stopped?

I don't know if this should be the official behavior, but it stopped at
whichever file_dedupe_range_info has both status and bytes_deduped set to zero.

> The man page says "logical_offset" but in the struct it is called
> "dest_offset".

Oops.

> There are some variables named "same" which don't really make sense now that
> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.

Perhaps not.

I'll later take a look at how many of these issues apply to clone/clone_range.

--D

> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Biggers Jan. 13, 2016, 2:36 a.m. UTC | #3
On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> 
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
> 
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);

Getting rid of the vectorization certainly avoids a lot of API complexity.  I
don't know precisely why it was made vectorized in the first place, but to me it
seems it shouldn't be vectorized unless there's a strong reason for it to be.

> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
> ...
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.

Deduplication is a strange operation since you're not "reading" or "writing" in
the traditional sense.  All file data stays exactly the same from a user
perspective.

So rather than arbitrarily say that you're "reading" from one file and "writing"
to the other, one possibility is to have an API where you submit two files and
the implementation decides how to best deduplicate them.  That could involve
sharing existing extents in file 1 with file 2; or sharing existing extents in
file 2 with file 1; or creating a new extent and referencing it from each file.
It would be up to the implementation to choose the most efficient approach.

But I take it that the existing btrfs ioctl doesn't work that way, and it always
will share the existing extents in file 1 with file 2.

That's probably a perfectly fine way to do it, but I was wondering whether it
really makes sense to hard-code that detail in the API, since an API could be
defined more generally.

And from a practical perpective, people using the ioctl to deduplicate two files
need to know which one to submit to the API as the "source" and which as the
"destination", if it matters.

Related to this are the permissions you need on each file descriptor.  Since
deduplication isn't "reading" or "writing" in the traditional sense it could,
theoretically, be argued that no permissions at all should be required: neither
FMODE_READ nor FMODE_WRITE.

Most likely are concerns that would make that a bad idea, though.  Perhaps
someone could create mischief by causing fragmentation in system files.

If this was previously discussed, can you point me to that discussion?

The proposed man page is also very brief on permissions, only mentioning them
indirectly in the error codes section.

> > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > "one of the files is a swap file", which I don't see actually happening in
> > the implementation; it seems those error codes perhaps exist at all for this
> > ioctl but rather be left to open(..., O_WRONLY).
> 
> Those could be hoisted to the VFS (from the XFS implementation), I think.

If the destination file has to already be open for writing then it can't be
immutable.  So I don't think that error code is needed.  Checking for an active
swapfile, on the other hand, may be valid, although I don't see such a check
anywhere in the version of the code I'm looking at.

> I'll later take a look at how many of these issues apply to clone/clone_range.

clone and clone_range seem to avoid the two biggest issues I saw: they aren't
vectorized, and there's a natural distinction between the "source" and the
"destination" of the operation.  They definitely need careful consideration as
well, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Jan. 23, 2016, 12:54 a.m. UTC | #4
On Tue, Jan 12, 2016 at 08:36:58PM -0600, Eric Biggers wrote:
> On Tue, Jan 12, 2016 at 01:14:32AM -0800, Darrick J. Wong wrote:
> > 
> > Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> > that resembles the clone_range interface:
> > 
> > int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
> 
> Getting rid of the vectorization certainly avoids a lot of API complexity.  I
> don't know precisely why it was made vectorized in the first place, but to me it
> seems it shouldn't be vectorized unless there's a strong reason for it to be.

I suppose the idea was that you could ask the kernel to dedupe a bunch of files
"atomically" and that locks would be held (at least on the source file) during
the entire operation...?

Ick.  Given that you can have 65536 vectors of 16MB apiece, you could be
reading up to 1TB of data into the pagecache.  That's a lot for a single
ioctl, though I guess we unlock all the inodes between each vector.

(Welll... we don't check for pending signals.  Urk.)

> > Not sure what you mean by 'symmetric', but in any case the convention seems
> > to be that src_fd's storage is shared with dest_fd if there's a match.
> > ...
> > I think the result of an earlier discussion was that src_fd must be readable,
> > and dest_fd must be writable or appendable.
> 
> Deduplication is a strange operation since you're not "reading" or "writing" in
> the traditional sense.  All file data stays exactly the same from a user
> perspective.
> 
> So rather than arbitrarily say that you're "reading" from one file and "writing"
> to the other, one possibility is to have an API where you submit two files and
> the implementation decides how to best deduplicate them.  That could involve
> sharing existing extents in file 1 with file 2; or sharing existing extents in
> file 2 with file 1; or creating a new extent and referencing it from each file.
> It would be up to the implementation to choose the most efficient approach.

That's a very good point!  We needn't limit the API to whatever the first
implementer does; all of the above are valid interpretations of what dedupe
could involve.  I might even argue that this is the use-case for the vectorized
dedupe call -- analyze all these proposed deduplications and figure out the
best plan to make it happen.

(Ofc the current implementation doesn't allow this, but we can change the
code when the singularity happens, etc.)

> But I take it that the existing btrfs ioctl doesn't work that way, and it always
> will share the existing extents in file 1 with file 2.
> 
> That's probably a perfectly fine way to do it, but I was wondering whether it
> really makes sense to hard-code that detail in the API, since an API could be
> defined more generally.
> 
> And from a practical perpective, people using the ioctl to deduplicate two files
> need to know which one to submit to the API as the "source" and which as the
> "destination", if it matters.
> 
> Related to this are the permissions you need on each file descriptor.  Since
> deduplication isn't "reading" or "writing" in the traditional sense it could,
> theoretically, be argued that no permissions at all should be required: neither
> FMODE_READ nor FMODE_WRITE.

I don't like the idea of being able to make major changes to a file that
I've opened with O_RDONLY. :)

I suppose if you're going to allow the FS to figure out its own strategy
and all that, then by the above reasoning one ought to have all the files
opened as writable.

> Most likely are concerns that would make that a bad idea, though.  Perhaps
> someone could create mischief by causing fragmentation in system files.
> 
> If this was previously discussed, can you point me to that discussion?

I don't know if there was a previous discussion, alas.

> The proposed man page is also very brief on permissions, only mentioning them
> indirectly in the error codes section.

There's a lot of unfortunate hand-waving in that manpage -- I wasn't involved
in designing the initial ioctl, so for the most part I'm simply sniffing my
way through based on observed behaviors (of duperemove) and guessing based on
the source code as I write the XFS version. :)

> > > The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
> > > "one of the files is a swap file", which I don't see actually happening in
> > > the implementation; it seems those error codes perhaps exist at all for this
> > > ioctl but rather be left to open(..., O_WRONLY).
> > 
> > Those could be hoisted to the VFS (from the XFS implementation), I think.
> 
> If the destination file has to already be open for writing then it can't be
> immutable.  So I don't think that error code is needed.  Checking for an active
> swapfile, on the other hand, may be valid, although I don't see such a check
> anywhere in the version of the code I'm looking at.

It's at the top of xfs_file_share_range() in the more recent RFCs of XFS reflink.

> > I'll later take a look at how many of these issues apply to clone/clone_range.
> 
> clone and clone_range seem to avoid the two biggest issues I saw: they aren't
> vectorized, and there's a natural distinction between the "source" and the
> "destination" of the operation.  They definitely need careful consideration as
> well, though.

Indeed.  I appreciate your review of the manpages/patches!

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A. Shutemov July 27, 2016, 9:51 p.m. UTC | #5
On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> more systematic (FIDEDUPERANGE).
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/compat_ioctl.c       |    1 
>  fs/ioctl.c              |   38 ++++++++++++++++++
>  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h      |    4 ++
>  include/uapi/linux/fs.h |   30 ++++++++++++++
>  5 files changed, 173 insertions(+)
> 
> 
> diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> index 70d4b10..eab31e7 100644
> --- a/fs/compat_ioctl.c
> +++ b/fs/compat_ioctl.c
> @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
>  
>  	case FICLONE:
>  	case FICLONERANGE:
> +	case FIDEDUPERANGE:
>  		goto do_ioctl;
>  
>  	case FIBMAP:
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 84c6e79..fcdd33b 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
>  	return thaw_super(sb);
>  }
>  
> +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> +{
> +	struct file_dedupe_range __user *argp = arg;
> +	struct file_dedupe_range *same = NULL;
> +	int ret;
> +	unsigned long size;
> +	u16 count;
> +
> +	if (get_user(count, &argp->dest_count)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	size = offsetof(struct file_dedupe_range __user, info[count]);

Vlastimil triggered this during fuzzing:

http://paste.opensuse.org/view/raw/99203426

High order allocation without __GFP_NOWARN + fallback. That's not good.

Basically, we don't have any sanity check of 'dest_count' here. This u16
comes directly from userspace. And we call memdup_user() based on it.

Here's a program which makes kernel allocate order-9 page:

https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22

Should we put some reasonable upper limit for the 'dest_count'?
What is typical 'dest_count'?

> +
> +	same = memdup_user(argp, size);
> +	if (IS_ERR(same)) {
> +		ret = PTR_ERR(same);
> +		same = NULL;
> +		goto out;
> +	}
> +
> +	ret = vfs_dedupe_file_range(file, same);
> +	if (ret)
> +		goto out;
> +
> +	ret = copy_to_user(argp, same, size);
> +	if (ret)
> +		ret = -EFAULT;
> +
> +out:
> +	kfree(same);
> +	return ret;
> +}
> +
Darrick J. Wong July 28, 2016, 6:07 p.m. UTC | #6
On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);

(I still hate this interface.)

> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22
> 
> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?

There are two userland programs I know of that call this ioctl.  The
first is xfs_io, which always sets dest_count = 1.

The other is duperemove, which seems capable of setting dest_count to
however many fragments it finds, up to a max of 120.  Capping size to
x86's 4k page size yields 127 entries.  On bigger machines with 64k
pages, that increases to 2047.  I think that's enough for anybody.

(Honestly, 127 dedupe candidates * max 16M extent length is already
2GB of IO for a single call.)

--D

> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong July 28, 2016, 7:25 p.m. UTC | #7
On Thu, Jul 28, 2016 at 12:51:30AM +0300, Kirill A. Shutemov wrote:
> On Sat, Dec 19, 2015 at 12:55:59AM -0800, Darrick J. Wong wrote:
> > Hoist the btrfs EXTENT_SAME ioctl up to the VFS and make the name
> > more systematic (FIDEDUPERANGE).
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/compat_ioctl.c       |    1 
> >  fs/ioctl.c              |   38 ++++++++++++++++++
> >  fs/read_write.c         |  100 +++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/fs.h      |    4 ++
> >  include/uapi/linux/fs.h |   30 ++++++++++++++
> >  5 files changed, 173 insertions(+)
> > 
> > 
> > diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> > index 70d4b10..eab31e7 100644
> > --- a/fs/compat_ioctl.c
> > +++ b/fs/compat_ioctl.c
> > @@ -1582,6 +1582,7 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
> >  
> >  	case FICLONE:
> >  	case FICLONERANGE:
> > +	case FIDEDUPERANGE:
> >  		goto do_ioctl;
> >  
> >  	case FIBMAP:
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 84c6e79..fcdd33b 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -568,6 +568,41 @@ static int ioctl_fsthaw(struct file *filp)
> >  	return thaw_super(sb);
> >  }
> >  
> > +static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
> > +{
> > +	struct file_dedupe_range __user *argp = arg;
> > +	struct file_dedupe_range *same = NULL;
> > +	int ret;
> > +	unsigned long size;
> > +	u16 count;
> > +
> > +	if (get_user(count, &argp->dest_count)) {
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +
> > +	size = offsetof(struct file_dedupe_range __user, info[count]);
> 
> Vlastimil triggered this during fuzzing:
> 
> http://paste.opensuse.org/view/raw/99203426
> 
> High order allocation without __GFP_NOWARN + fallback. That's not good.
> 
> Basically, we don't have any sanity check of 'dest_count' here. This u16
> comes directly from userspace. And we call memdup_user() based on it.
> 
> Here's a program which makes kernel allocate order-9 page:
> 
> https://gist.github.com/kiryl/2b344b51da1fd2725be420a996b10d22

I forgot to say, please wrap this up as an xfstest so we can check for
future regressions.  After the patch, the ioctl should return ENOMEM
to signal that the caller asked for more dedupe_count than we want to
allocate memory for.

--D

> Should we put some reasonable upper limit for the 'dest_count'?
> What is typical 'dest_count'?
> 
> > +
> > +	same = memdup_user(argp, size);
> > +	if (IS_ERR(same)) {
> > +		ret = PTR_ERR(same);
> > +		same = NULL;
> > +		goto out;
> > +	}
> > +
> > +	ret = vfs_dedupe_file_range(file, same);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = copy_to_user(argp, same, size);
> > +	if (ret)
> > +		ret = -EFAULT;
> > +
> > +out:
> > +	kfree(same);
> > +	return ret;
> > +}
> > +
> 
> -- 
>  Kirill A. Shutemov
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Kerrisk (man-pages) Aug. 7, 2016, 5:47 p.m. UTC | #8
Hi Darrick,

On 01/12/2016 08:14 PM, Darrick J. Wong wrote:
> [adding btrfs to the cc since we're talking about a whole new dedupe interface]

In the discussion below, many points of possible improvement were notedfor
the man page.... Would you be willing to put together a patch please?

Thanks,

Michael

  
> On Tue, Jan 12, 2016 at 12:07:14AM -0600, Eric Biggers wrote:
>> Some feedback on the VFS portion of the FIDEDUPERANGE ioctl and its man page...
>> (note: I realize the patch is mostly just moving the code that already existed
>> in btrfs, but in the VFS it deserves a more thorough review):
>
> Wheee. :)
>
> Yes, let's discuss the concerns about the btrfs extent same ioctl.
>
> I believe Christoph dislikes about the odd return mechanism (i.e. status and
> bytes_deduped) and doubts that the vectorization is really necessary.  There's
> not a lot of documentation to go on aside from "Do whatever the BTRFS ioctl
> does".  I suspect that will leave my explanations lackng, since I neither
> designed the btrfs interface nor know all that much about the decisions made to
> arrive at what we have now.
>
> (I agree with both of hch's complaints.)
>
> Really, the best argument for keeping this ioctl is to avoid breaking
> duperemove.  Even then, given that current duperemove checks for btrfs before
> trying to use BTRFS_IOC_EXTENT_SAME, we could very well design a new dedupe
> ioctl for the VFS, hook the new dedupers (XFS) into the new VFS ioctl
> leaving the old btrfs ioctl intact, and train duperemove to try the new
> ioctl and fall back on the btrfs one if the VFS ioctl isn't supported.
>
> Frankly, I also wouldn't mind changing the VFS dedupe ioctl that to something
> that resembles the clone_range interface:
>
> int ioctl(int dest_fd, FIDEDUPERANGE, struct file_dedupe_range * arg);
>
> struct file_dedupe_range {
>     __s64 src_fd;
>     __u64 src_offset;
>     __u64 length;
>     __u64 dest_offset;
>     __u64 flags;
> };
>
> "See if the byte range src_offset:length in src_fd matches all of
> dest_offset:length in dest_fd; if so, share src_fd's physical storage with
> dest_fd.  Both fds must be files; if they are the same file the ranges cannot
> overlap; src_fd must be readable; dest_fd must be writable or append-only.
> Offsets and lengths probably need to be block-aligned, but that is filesystem
> dependent."
>
> The error conditions would be superset of the ones we know about today.  I'd
> return EOVERFLOW or something if length is longer than the FS wants to deal
> with.
>
> Now all the vectorization problems go away, and since it's a new VFS interface
> we can define everything from the start.
>
> Christoph, if this new interface solves your complaints I think I'd like to get
> started on the code/docs soon.
>
>> At high level, I am confused about what is meant by the "source" and
>> "destination" files.  I understand that with more than two files, you
>> effectively have to choose one file to treat specially and dedupe with all
>> the other files (an NxN comparison isn't realistic).  But with just two
>> files, a deduplication operation should be completely symmetric, should it
>> not?  The end
>
> Not sure what you mean by 'symmetric', but in any case the convention seems
> to be that src_fd's storage is shared with dest_fd if there's a match.
>
>> result should be that the data is deduplicated, regardless of the order in
>> which I gave the file descriptors.  So why is there some non-symmetric
>> behavior?  There are several examples but one is that the VFS is checking
>> !S_ISREG() on the "source" file descriptor but not on the "destination" file
>> descriptor.
>
> The dedupe_range function pointer should only be supplied for regular files.
>
>> Another is that different permissions are required on the source versus on
>> the destination.  If there are good reasons for the nonsymmetry then this
>> needs to be clearly explained in the man page; otherwise it may not be clear
>> what to use as the "source" and what to use as the "destination".
>>
>> It seems odd to be adding "copy" as a system call but then have "dedupe" and
>> "clone" as ioctls rather than system calls... it seems that they should all
>> be one or the other (at least, if we put aside the fact that the ioctls
>> already exist in btrfs).
>
> We can't put the clone ioctl aside; coreutils has already started using it.
>
> I'm not sure if clone_range or extent_same are all that popular, though.
> AFAIK duperemove is the only program using extent_same, and I don't know
> of anything using clone_range.
>
> (Well, xfs_io does...)
>
>> The range checking in clone_verify_area() appears incomplete.  Someone could
>> provide len=UINT64_MAX and all the checks would still pass even though
>> 'pos+len' would overflow.
>
> Yeah...
>
>> Should the ioctl be interruptible?  Right now it always goes through *all*
>> the 'struct file_dedupe_range_info's you passed in --- potentially up to
>> 65535 of them.
>
> There probably ought to be explicit signal checks, or we could just get rid
> of the vectorization entirely. :)
>
>> Why 'info->bytes_deduped += deduped' rather than 'info->bytes_deduped =
>> deduped'?  'bytes_deduped' is per file descriptor, not for the operation as a
>> whole.
>
> Right, because bytes_deduped is a part of file_dedup_range_info, not
> file_dedupe_range.
>
> (Note the bytes_deduped = 0 earlier in the function.)
>
>> What permissions do you need on the destination file descriptors?  The man
>> page implies they must be open for writing and not appending.  The
>> implementation differs: it requires FMODE_WRITE only for non-admin users, and
>> it doesn't check for O_APPEND at all.
>
> I think the result of an earlier discussion was that src_fd must be readable,
> and dest_fd must be writable or appendable.
>
>> The man page also says you get EPERM if "dest_fd is immutable" and ETXTBSY if
>> "one of the files is a swap file", which I don't see actually happening in
>> the implementation; it seems those error codes perhaps exist at all for this
>> ioctl but rather be left to open(..., O_WRONLY).
>
> Those could be hoisted to the VFS (from the XFS implementation), I think.
>
>> If the filesystem doesn't support deduplication, or I pass in a strange file
>> descriptor such as one for a named pipe, do I get EINVAL or EOPNOTSUPP?  The
>> man page isn't clear.
>
> Should be EOPNOTSUPP if dest_fd isn't a regular file; EISDIR if either are
> directories; and EINVAL for any other kind of non-file fd.  I suspect the
> clone* manpages don't make this too clear either.
>
>> Under what circumstances will 'bytes_deduped' differ from the count that was
>> passed in?
>
> btrfs/xfs will only compare the first 16MB.  Not documented anywhere. :(
>
>> If short counts are allowed, what will be the 'status' be in that case:
>> FILE_DEDUP_RANGE_DIFFERS, FILE_DEDUPE_RANGE_SAME, or something else?
>
> One of those two.
>
>> Can data be deduped even if only a prefix of the data region matches?
>
> No.
>
>> The man page doesn't mention FILE_DEDUPE_RANGE_SAME at all, instead calling it
>> 0; it only mentions FILE_DEDUPE_RANGE_DIFFERS.
>
> Oops, good catch. :(
>
>> The man page isn't clear about whether the ioctl stops early if an error
>> occurs or always processes all the 'struct file_dedupe_range_info's you pass
>> in.  And if it were, hypothetically, to stop early, how is the user meant to
>> know on which file it stopped?
>
> I don't know if this should be the official behavior, but it stopped at
> whichever file_dedupe_range_info has both status and bytes_deduped set to zero.
>
>> The man page says "logical_offset" but in the struct it is called
>> "dest_offset".
>
> Oops.
>
>> There are some variables named "same" which don't really make sense now that
>> the ioctl is called FIDEDUPERANGE instead of EXTENT_SAME.
>
> Perhaps not.
>
> I'll later take a look at how many of these issues apply to clone/clone_range.
>
> --D
>
>>
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-api" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Patch
diff mbox

diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index 70d4b10..eab31e7 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1582,6 +1582,7 @@  COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
 
 	case FICLONE:
 	case FICLONERANGE:
+	case FIDEDUPERANGE:
 		goto do_ioctl;
 
 	case FIBMAP:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 84c6e79..fcdd33b 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -568,6 +568,41 @@  static int ioctl_fsthaw(struct file *filp)
 	return thaw_super(sb);
 }
 
+static long ioctl_file_dedupe_range(struct file *file, void __user *arg)
+{
+	struct file_dedupe_range __user *argp = arg;
+	struct file_dedupe_range *same = NULL;
+	int ret;
+	unsigned long size;
+	u16 count;
+
+	if (get_user(count, &argp->dest_count)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	size = offsetof(struct file_dedupe_range __user, info[count]);
+
+	same = memdup_user(argp, size);
+	if (IS_ERR(same)) {
+		ret = PTR_ERR(same);
+		same = NULL;
+		goto out;
+	}
+
+	ret = vfs_dedupe_file_range(file, same);
+	if (ret)
+		goto out;
+
+	ret = copy_to_user(argp, same, size);
+	if (ret)
+		ret = -EFAULT;
+
+out:
+	kfree(same);
+	return ret;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -629,6 +664,9 @@  int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 	case FICLONERANGE:
 		return ioctl_file_clone_range(filp, argp);
 
+	case FIDEDUPERANGE:
+		return ioctl_file_dedupe_range(filp, argp);
+
 	default:
 		if (S_ISREG(inode->i_mode))
 			error = file_ioctl(filp, cmd, arg);
diff --git a/fs/read_write.c b/fs/read_write.c
index 0713e28..aaaad52 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1523,3 +1523,103 @@  int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 	return ret;
 }
 EXPORT_SYMBOL(vfs_clone_file_range);
+
+int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
+{
+	struct file_dedupe_range_info *info;
+	struct inode *src = file_inode(file);
+	u64 off;
+	u64 len;
+	int i;
+	int ret;
+	bool is_admin = capable(CAP_SYS_ADMIN);
+	u16 count = same->dest_count;
+	struct file *dst_file;
+	loff_t dst_off;
+	ssize_t deduped;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (same->reserved1 || same->reserved2)
+		return -EINVAL;
+
+	off = same->src_offset;
+	len = same->src_length;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = -EINVAL;
+	if (!S_ISREG(src->i_mode))
+		goto out;
+
+	ret = clone_verify_area(file, off, len, false);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+
+	/* pre-format output fields to sane values */
+	for (i = 0; i < count; i++) {
+		same->info[i].bytes_deduped = 0ULL;
+		same->info[i].status = FILE_DEDUPE_RANGE_SAME;
+	}
+
+	for (i = 0, info = same->info; i < count; i++, info++) {
+		struct inode *dst;
+		struct fd dst_fd = fdget(info->dest_fd);
+
+		dst_file = dst_fd.file;
+		if (!dst_file) {
+			info->status = -EBADF;
+			goto next_loop;
+		}
+		dst = file_inode(dst_file);
+
+		ret = mnt_want_write_file(dst_file);
+		if (ret) {
+			info->status = ret;
+			goto next_loop;
+		}
+
+		dst_off = info->dest_offset;
+		ret = clone_verify_area(dst_file, dst_off, len, true);
+		if (ret < 0) {
+			info->status = ret;
+			goto next_file;
+		}
+		ret = 0;
+
+		if (info->reserved) {
+			info->status = -EINVAL;
+		} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+			info->status = -EINVAL;
+		} else if (file->f_path.mnt != dst_file->f_path.mnt) {
+			info->status = -EXDEV;
+		} else if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+		} else if (dst_file->f_op->dedupe_file_range == NULL) {
+			info->status = -EINVAL;
+		} else {
+			deduped = dst_file->f_op->dedupe_file_range(file, off,
+							len, dst_file,
+							info->dest_offset);
+			if (deduped == -EBADE)
+				info->status = FILE_DEDUPE_RANGE_DIFFERS;
+			else if (deduped < 0)
+				info->status = deduped;
+			else
+				info->bytes_deduped += deduped;
+		}
+
+next_file:
+		mnt_drop_write_file(dst_file);
+next_loop:
+		fdput(dst_fd);
+	}
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL(vfs_dedupe_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26b7607..bcfa23d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1633,6 +1633,8 @@  struct file_operations {
 			loff_t, size_t, unsigned int);
 	int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
 			u64);
+	ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
+			u64);
 };
 
 struct inode_operations {
@@ -1688,6 +1690,8 @@  extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, u64 len);
+extern int vfs_dedupe_file_range(struct file *file,
+				 struct file_dedupe_range *same);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index ad60359..801986e 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -52,6 +52,35 @@  struct fstrim_range {
 	__u64 minlen;
 };
 
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME		0
+#define FILE_DEDUPE_RANGE_DIFFERS	1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+	__s64 dest_fd;		/* in - destination file */
+	__u64 dest_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file. */
+	/* status of this dedupe operation:
+	 * < 0 for error
+	 * == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+	 * == FILE_DEDUPE_RANGE_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;		/* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+	__u64 src_offset;	/* in - start of extent in source */
+	__u64 src_length;	/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;	/* must be zero */
+	__u32 reserved2;	/* must be zero */
+	struct file_dedupe_range_info info[0];
+};
+
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	unsigned long nr_files;		/* read only */
@@ -177,6 +206,7 @@  struct blkzeroout2 {
 #define FITRIM		_IOWR('X', 121, struct fstrim_range)	/* Trim */
 #define FICLONE		_IOW(0x94, 9, int)
 #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
+#define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)