[v3,2/4] ovl: use vfs_clone_file_range() for copy up if possible
diff mbox

Message ID 1473856994-27463-3-git-send-email-amir73il@gmail.com
State New
Headers show

Commit Message

Amir Goldstein Sept. 14, 2016, 12:43 p.m. UTC
When copying up within the same fs, try to use vfs_clone_file_range().
This is very efficient when lower and upper are on the same fs
with file reflink support. If vfs_clone_file_range() fails because
lower and upper are not on the same fs or if fs has no reflink support,
copy up falls back to the regular data copy code.

Tested correct behavior when lower and upper are on:
1. same ext4 (copy)
2. same xfs + reflink patches + mkfs.xfs (copy)
3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)

For comparison, on my laptop, xfstest overlay/001 (copy up of large
sparse files) takes less than 1 second in the xfs reflink setup vs.
25 seconds on the rest of the setups.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/copy_up.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Miklos Szeredi Sept. 21, 2016, 3:09 p.m. UTC | #1
On Wed, Sep 14, 2016 at 2:43 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> When copying up within the same fs, try to use vfs_clone_file_range().
> This is very efficient when lower and upper are on the same fs
> with file reflink support. If vfs_clone_file_range() fails because
> lower and upper are not on the same fs or if fs has no reflink support,
> copy up falls back to the regular data copy code.
>
> Tested correct behavior when lower and upper are on:
> 1. same ext4 (copy)
> 2. same xfs + reflink patches + mkfs.xfs (copy)
> 3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
> 4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
>
> For comparison, on my laptop, xfstest overlay/001 (copy up of large
> sparse files) takes less than 1 second in the xfs reflink setup vs.
> 25 seconds on the rest of the setups.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/copy_up.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 43fdc27..ba039f8 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -136,6 +136,16 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>                 goto out_fput;
>         }
>
> +       /* Try to use clone_file_range to clone up within the same fs */
> +       error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
> +       if (!error)
> +               goto out;
> +       /* If we can clone but clone failed - abort */
> +       if (error != -EXDEV && error != -EOPNOTSUPP)
> +               goto out;

Would be safer to fall back on any error.

Otherwise ACK.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Sept. 21, 2016, 5:01 p.m. UTC | #2
On Wed, Sep 21, 2016 at 6:09 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Sep 14, 2016 at 2:43 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> When copying up within the same fs, try to use vfs_clone_file_range().
>> This is very efficient when lower and upper are on the same fs
>> with file reflink support. If vfs_clone_file_range() fails because
>> lower and upper are not on the same fs or if fs has no reflink support,
>> copy up falls back to the regular data copy code.
>>
>> Tested correct behavior when lower and upper are on:
>> 1. same ext4 (copy)
>> 2. same xfs + reflink patches + mkfs.xfs (copy)
>> 3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
>> 4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
>>
>> For comparison, on my laptop, xfstest overlay/001 (copy up of large
>> sparse files) takes less than 1 second in the xfs reflink setup vs.
>> 25 seconds on the rest of the setups.
>>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>  fs/overlayfs/copy_up.c | 12 +++++++++++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
>> index 43fdc27..ba039f8 100644
>> --- a/fs/overlayfs/copy_up.c
>> +++ b/fs/overlayfs/copy_up.c
>> @@ -136,6 +136,16 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>>                 goto out_fput;
>>         }
>>
>> +       /* Try to use clone_file_range to clone up within the same fs */
>> +       error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
>> +       if (!error)
>> +               goto out;
>> +       /* If we can clone but clone failed - abort */
>> +       if (error != -EXDEV && error != -EOPNOTSUPP)
>> +               goto out;
>
> Would be safer to fall back on any error.
>

Agreed. Dave, since you suggested the 'softer' fall back, do you have
any objections?

> Otherwise ACK.
>

Will you be taking this to your tree?

Please note that this patch depends on patch v3 1/4,
because vfs_clone_file_range() in current mainline
fails to clone from lower to upper due to upper and lower being
private mount clones
and therefore not the same f_path.mnt.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Sept. 21, 2016, 6:29 p.m. UTC | #3
On Wed, Sep 21, 2016 at 7:01 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 6:09 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Wed, Sep 14, 2016 at 2:43 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> When copying up within the same fs, try to use vfs_clone_file_range().
>>> This is very efficient when lower and upper are on the same fs
>>> with file reflink support. If vfs_clone_file_range() fails because
>>> lower and upper are not on the same fs or if fs has no reflink support,
>>> copy up falls back to the regular data copy code.
>>>
>>> Tested correct behavior when lower and upper are on:
>>> 1. same ext4 (copy)
>>> 2. same xfs + reflink patches + mkfs.xfs (copy)
>>> 3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
>>> 4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
>>>
>>> For comparison, on my laptop, xfstest overlay/001 (copy up of large
>>> sparse files) takes less than 1 second in the xfs reflink setup vs.
>>> 25 seconds on the rest of the setups.
>>>
>>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>>> ---
>>>  fs/overlayfs/copy_up.c | 12 +++++++++++-
>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
>>> index 43fdc27..ba039f8 100644
>>> --- a/fs/overlayfs/copy_up.c
>>> +++ b/fs/overlayfs/copy_up.c
>>> @@ -136,6 +136,16 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>>>                 goto out_fput;
>>>         }
>>>
>>> +       /* Try to use clone_file_range to clone up within the same fs */
>>> +       error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
>>> +       if (!error)
>>> +               goto out;
>>> +       /* If we can clone but clone failed - abort */
>>> +       if (error != -EXDEV && error != -EOPNOTSUPP)
>>> +               goto out;
>>
>> Would be safer to fall back on any error.
>>
>
> Agreed. Dave, since you suggested the 'softer' fall back, do you have
> any objections?
>
>> Otherwise ACK.
>>
>
> Will you be taking this to your tree?

Sure I can take it.

>
> Please note that this patch depends on patch v3 1/4,
> because vfs_clone_file_range() in current mainline
> fails to clone from lower to upper due to upper and lower being
> private mount clones
> and therefore not the same f_path.mnt.

Right.  I didn't do a thorough audit of ->clone_file_range()
implementations, but 1/4 is probably OK.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 21, 2016, 9:48 p.m. UTC | #4
On Wed, Sep 21, 2016 at 08:01:22PM +0300, Amir Goldstein wrote:
> On Wed, Sep 21, 2016 at 6:09 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > On Wed, Sep 14, 2016 at 2:43 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> >> When copying up within the same fs, try to use vfs_clone_file_range().
> >> This is very efficient when lower and upper are on the same fs
> >> with file reflink support. If vfs_clone_file_range() fails because
> >> lower and upper are not on the same fs or if fs has no reflink support,
> >> copy up falls back to the regular data copy code.
> >>
> >> Tested correct behavior when lower and upper are on:
> >> 1. same ext4 (copy)
> >> 2. same xfs + reflink patches + mkfs.xfs (copy)
> >> 3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
> >> 4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
> >>
> >> For comparison, on my laptop, xfstest overlay/001 (copy up of large
> >> sparse files) takes less than 1 second in the xfs reflink setup vs.
> >> 25 seconds on the rest of the setups.
> >>
> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >> ---
> >>  fs/overlayfs/copy_up.c | 12 +++++++++++-
> >>  1 file changed, 11 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> >> index 43fdc27..ba039f8 100644
> >> --- a/fs/overlayfs/copy_up.c
> >> +++ b/fs/overlayfs/copy_up.c
> >> @@ -136,6 +136,16 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
> >>                 goto out_fput;
> >>         }
> >>
> >> +       /* Try to use clone_file_range to clone up within the same fs */
> >> +       error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
> >> +       if (!error)
> >> +               goto out;
> >> +       /* If we can clone but clone failed - abort */
> >> +       if (error != -EXDEV && error != -EOPNOTSUPP)
> >> +               goto out;
> >
> > Would be safer to fall back on any error.
> >
> 
> Agreed. Dave, since you suggested the 'softer' fall back, do you have
> any objections?

If you get any error other than -EXDEV or -EOPNOTSUPP from a clone
operation, there's somethign seriously wrong with the metadata of
the inode or th eunderlying storage. You're not going to be able to
copy the data if a clone fails for exactly the same reasons.

What's worse is that you might get a partial copy before failure, or
there might not be a failure during copy at all because the fs uses
delayed allocation and the corruption problem during allocation that
prevented clone from working is not triggered until writeback time.
i.e. you may not have anyone to report the fact that the copyup
failed by the time the failure is known.

Your choice, really - I'd much prefer that known hard errors are
propagated immediately than leave them to chance and have a user
then wonder where the $ANTI-DEITY their data has gone later on.

Cheers,

Dave.
Al Viro Sept. 21, 2016, 9:57 p.m. UTC | #5
On Thu, Sep 22, 2016 at 07:48:15AM +1000, Dave Chinner wrote:

> If you get any error other than -EXDEV or -EOPNOTSUPP from a clone
> operation, there's somethign seriously wrong with the metadata of
> the inode or th eunderlying storage.

Such as -ENOSPC?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 21, 2016, 10:33 p.m. UTC | #6
On Wed, Sep 21, 2016 at 10:57:24PM +0100, Al Viro wrote:
> On Thu, Sep 22, 2016 at 07:48:15AM +1000, Dave Chinner wrote:
> 
> > If you get any error other than -EXDEV or -EOPNOTSUPP from a clone
> > operation, there's somethign seriously wrong with the metadata of
> > the inode or the underlying storage.
> 
> Such as -ENOSPC?

Yup, that's a fatal error, too.  i.e. if a clone returns ENOSPC
because there isn't space for the extra metadata, then the fallback
data copy is almost certainly going to fail with ENOSPC when trying
to reserve/allocate space for both the extra data copy and the extra
metadata....

Cheers,

Dave.
Darrick J. Wong Sept. 22, 2016, 2:25 a.m. UTC | #7
On Thu, Sep 22, 2016 at 08:33:31AM +1000, Dave Chinner wrote:
> On Wed, Sep 21, 2016 at 10:57:24PM +0100, Al Viro wrote:
> > On Thu, Sep 22, 2016 at 07:48:15AM +1000, Dave Chinner wrote:
> > 
> > > If you get any error other than -EXDEV or -EOPNOTSUPP from a clone
> > > operation, there's somethign seriously wrong with the metadata of
> > > the inode or the underlying storage.
> > 
> > Such as -ENOSPC?
> 
> Yup, that's a fatal error, too.  i.e. if a clone returns ENOSPC
> because there isn't space for the extra metadata, then the fallback
> data copy is almost certainly going to fail with ENOSPC when trying
> to reserve/allocate space for both the extra data copy and the extra
> metadata....

XFS will return ENOSPC during reflink if one of the relevant AGs is
running low on space for the refcount/rmap trees; however there might be
enough blocks in another AG to make a regular old copy.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Sept. 22, 2016, 2:52 a.m. UTC | #8
On Thu, Sep 22, 2016 at 5:25 AM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Sep 22, 2016 at 08:33:31AM +1000, Dave Chinner wrote:
>> On Wed, Sep 21, 2016 at 10:57:24PM +0100, Al Viro wrote:
>> > On Thu, Sep 22, 2016 at 07:48:15AM +1000, Dave Chinner wrote:
>> >
>> > > If you get any error other than -EXDEV or -EOPNOTSUPP from a clone
>> > > operation, there's somethign seriously wrong with the metadata of
>> > > the inode or the underlying storage.
>> >
>> > Such as -ENOSPC?
>>
>> Yup, that's a fatal error, too.  i.e. if a clone returns ENOSPC
>> because there isn't space for the extra metadata, then the fallback
>> data copy is almost certainly going to fail with ENOSPC when trying
>> to reserve/allocate space for both the extra data copy and the extra
>> metadata....
>
> XFS will return ENOSPC during reflink if one of the relevant AGs is
> running low on space for the refcount/rmap trees; however there might be
> enough blocks in another AG to make a regular old copy.
>

That settles it then. I rather be prudent and retry copy anyway.
Other (maybe future) filesystems may have their own non-fatal reasons
to fail clone
and they have the right to do so.

I will resend the ACKed clone_range patch pair for Miklos to pick up

For now, I am not at ease about resending the copy_range patch pair
without a proper way for users to get independent test coverage for it.

Unless Miklos or Al feel confident enough about taking those patches
on my testing testimony and their review?

Darrick, do you have an easy way to reproduce the extreme case of
clone failure due to certain AG ENOSPC? any xfstest for it?
I recon that the fallback to copy_range could be useful in that case
(i.e. very large file with one block in the unfortunate AG)

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Sept. 29, 2016, 9 a.m. UTC | #9
On Wed, Sep 21, 2016 at 9:29 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Sep 21, 2016 at 7:01 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Wed, Sep 21, 2016 at 6:09 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Wed, Sep 14, 2016 at 2:43 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> When copying up within the same fs, try to use vfs_clone_file_range().
>>>> This is very efficient when lower and upper are on the same fs
>>>> with file reflink support. If vfs_clone_file_range() fails because
>>>> lower and upper are not on the same fs or if fs has no reflink support,
>>>> copy up falls back to the regular data copy code.
>>>>
>>>> Tested correct behavior when lower and upper are on:
>>>> 1. same ext4 (copy)
>>>> 2. same xfs + reflink patches + mkfs.xfs (copy)
>>>> 3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink)
>>>> 4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy)
>>>>
>>>> For comparison, on my laptop, xfstest overlay/001 (copy up of large
>>>> sparse files) takes less than 1 second in the xfs reflink setup vs.
>>>> 25 seconds on the rest of the setups.
>>>>
>>>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>>>> ---
>>>>  fs/overlayfs/copy_up.c | 12 +++++++++++-
>>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
>>>> index 43fdc27..ba039f8 100644
>>>> --- a/fs/overlayfs/copy_up.c
>>>> +++ b/fs/overlayfs/copy_up.c
>>>> @@ -136,6 +136,16 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
>>>>                 goto out_fput;
>>>>         }
>>>>
>>>> +       /* Try to use clone_file_range to clone up within the same fs */
>>>> +       error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
>>>> +       if (!error)
>>>> +               goto out;
>>>> +       /* If we can clone but clone failed - abort */
>>>> +       if (error != -EXDEV && error != -EOPNOTSUPP)
>>>> +               goto out;
>>>
>>> Would be safer to fall back on any error.
>>>
>>
>> Agreed. Dave, since you suggested the 'softer' fall back, do you have
>> any objections?
>>
>>> Otherwise ACK.
>>>
>>
>> Will you be taking this to your tree?
>
> Sure I can take it.
>

Miklos,

Will you be able to queue v4 patches for 4.9?

Thanks,
Amir.

>>
>> Please note that this patch depends on patch v3 1/4,
>> because vfs_clone_file_range() in current mainline
>> fails to clone from lower to upper due to upper and lower being
>> private mount clones
>> and therefore not the same f_path.mnt.
>
> Right.  I didn't do a thorough audit of ->clone_file_range()
> implementations, but 1/4 is probably OK.
>
> Thanks,
> Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Sept. 30, 2016, 11:14 a.m. UTC | #10
On Thu, Sep 29, 2016 at 11:00 AM, Amir Goldstein <amir73il@gmail.com> wrote:

> Will you be able to queue v4 patches for 4.9?

Queued to

 git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #overlayfs-next

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 43fdc27..ba039f8 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -136,6 +136,16 @@  static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 		goto out_fput;
 	}
 
+	/* Try to use clone_file_range to clone up within the same fs */
+	error = vfs_clone_file_range(old_file, 0, new_file, 0, len);
+	if (!error)
+		goto out;
+	/* If we can clone but clone failed - abort */
+	if (error != -EXDEV && error != -EOPNOTSUPP)
+		goto out;
+	/* Can't clone, so now we try to copy the data */
+	error = 0;
+
 	/* FIXME: copy up sparse files efficiently */
 	while (len) {
 		size_t this_len = OVL_COPY_UP_CHUNK_SIZE;
@@ -160,7 +170,7 @@  static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 
 		len -= bytes;
 	}
-
+out:
 	fput(new_file);
 out_fput:
 	fput(old_file);