mbox series

[v6,00/28] fs: fixes for serious clone/dedupe problems

Message ID 154013850285.29026.16168387526580596209.stgit@magnolia (mailing list archive)
Headers show
Series fs: fixes for serious clone/dedupe problems | expand

Message

Darrick J. Wong Oct. 21, 2018, 4:15 p.m. UTC
Hi all,

Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
reflink implementation, and tracked it down to reflink forgetting to do
some of the file-extending activities that must happen for regular
writes.

We then started auditing the clone, dedupe, and copyfile code and
realized that from a file contents perspective, clonerange isn't any
different from a regular file write.  Unfortunately, we also noticed
that *unlike* a regular write, clonerange skips a ton of overflow
checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
and RLIMIT_FSIZE.  We also observed that cloning into a file did not
strip security privileges (suid, capabilities) like a regular write
would.  I also noticed that xfs and ocfs2 need to dump the page cache
before remapping blocks, not after.

In fixing the range checking problems I also realized that both dedupe
and copyfile tell userspace how much of the requested operation was
acted upon.  Since the range validation can shorten a clone request (or
we can ENOSPC midway through), we might as well plumb the short
operation reporting back through the VFS indirection code to userspace.
I added a few more cleanups to the xfs code per reviewer suggestions.

So, here's the whole giant pile of patches[1] that fix all the problems.
This branch is against current upstream (4.19-rc8).  The patch
"generic: test reflink side effects" recently sent to fstests exercises
the fixes in this series.  Tests are in [2].

--D

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

Comments

Dave Chinner Oct. 22, 2018, 2:21 a.m. UTC | #1
On Sun, Oct 21, 2018 at 09:15:03AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> reflink implementation, and tracked it down to reflink forgetting to do
> some of the file-extending activities that must happen for regular
> writes.
> 
> We then started auditing the clone, dedupe, and copyfile code and
> realized that from a file contents perspective, clonerange isn't any
> different from a regular file write.  Unfortunately, we also noticed
> that *unlike* a regular write, clonerange skips a ton of overflow
> checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> strip security privileges (suid, capabilities) like a regular write
> would.  I also noticed that xfs and ocfs2 need to dump the page cache
> before remapping blocks, not after.
> 
> In fixing the range checking problems I also realized that both dedupe
> and copyfile tell userspace how much of the requested operation was
> acted upon.  Since the range validation can shorten a clone request (or
> we can ENOSPC midway through), we might as well plumb the short
> operation reporting back through the VFS indirection code to userspace.
> I added a few more cleanups to the xfs code per reviewer suggestions.
> 
> So, here's the whole giant pile of patches[1] that fix all the problems.
> This branch is against current upstream (4.19-rc8).  The patch
> "generic: test reflink side effects" recently sent to fstests exercises
> the fixes in this series.  Tests are in [2].

Ok, so now that all the patches (other than the ocfs2 bits) have been
reviewed, how do we want to merge this? I can take it through the
XFS tree given that there is a bit of XFS changes that needs to be
co-ordinated with it, or should it go through some other tree?

The other question I have is who reviews ocfs2 changes these days?
Do they get reviewed, or just shepherded in via akpm's tree?

Cheers,

Dave.
Dave Chinner Oct. 22, 2018, 4:37 a.m. UTC | #2
On Mon, Oct 22, 2018 at 01:21:12PM +1100, Dave Chinner wrote:
> On Sun, Oct 21, 2018 at 09:15:03AM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > Dave, Eric, and I have been chasing a stale data exposure bug in the XFS
> > reflink implementation, and tracked it down to reflink forgetting to do
> > some of the file-extending activities that must happen for regular
> > writes.
> > 
> > We then started auditing the clone, dedupe, and copyfile code and
> > realized that from a file contents perspective, clonerange isn't any
> > different from a regular file write.  Unfortunately, we also noticed
> > that *unlike* a regular write, clonerange skips a ton of overflow
> > checks, such as validating the ranges against s_maxbytes, MAX_NON_LFS,
> > and RLIMIT_FSIZE.  We also observed that cloning into a file did not
> > strip security privileges (suid, capabilities) like a regular write
> > would.  I also noticed that xfs and ocfs2 need to dump the page cache
> > before remapping blocks, not after.
> > 
> > In fixing the range checking problems I also realized that both dedupe
> > and copyfile tell userspace how much of the requested operation was
> > acted upon.  Since the range validation can shorten a clone request (or
> > we can ENOSPC midway through), we might as well plumb the short
> > operation reporting back through the VFS indirection code to userspace.
> > I added a few more cleanups to the xfs code per reviewer suggestions.
> > 
> > So, here's the whole giant pile of patches[1] that fix all the problems.
> > This branch is against current upstream (4.19-rc8).  The patch
> > "generic: test reflink side effects" recently sent to fstests exercises
> > the fixes in this series.  Tests are in [2].
> 
> Ok, so now that all the patches (other than the ocfs2 bits) have been
> reviewed, how do we want to merge this? I can take it through the
> XFS tree given that there is a bit of XFS changes that needs to be
> co-ordinated with it, or should it go through some other tree?

Ok, this is a bit of a mess. the patches do not merge cleanly to a
4.19-rc1 base kernel because of all the changes to
include/linux/fs.h that have hit the tree after this. There's also
failures against Documentation/filesystems/fs.h

IOWs, it's not going to get merged through the main XFS tree because
I don't have the patience to resolve all the patch application
failures, then when it comes to merge make sure all the merge
failures end up being resolved correctly.

So if I take it through the XFS tree, it will being a standalone
branch based on 4.19-rc8 and won't hit linux-next until after the
first XFS merge when I can rebase the for-next branch...

Cheers,

Dave.
Al Viro Oct. 22, 2018, 4:52 a.m. UTC | #3
On Mon, Oct 22, 2018 at 03:37:41PM +1100, Dave Chinner wrote:

> Ok, this is a bit of a mess. the patches do not merge cleanly to a
> 4.19-rc1 base kernel because of all the changes to
> include/linux/fs.h that have hit the tree after this. There's also
> failures against Documentation/filesystems/fs.h
> 
> IOWs, it's not going to get merged through the main XFS tree because
> I don't have the patience to resolve all the patch application
> failures, then when it comes to merge make sure all the merge
> failures end up being resolved correctly.
> 
> So if I take it through the XFS tree, it will being a standalone
> branch based on 4.19-rc8 and won't hit linux-next until after the
> first XFS merge when I can rebase the for-next branch...

How many conflicts does it have with XFS tree?  I can take it via
vfs.git...
Dave Chinner Oct. 22, 2018, 5:08 a.m. UTC | #4
On Mon, Oct 22, 2018 at 05:52:49AM +0100, Al Viro wrote:
> On Mon, Oct 22, 2018 at 03:37:41PM +1100, Dave Chinner wrote:
> 
> > Ok, this is a bit of a mess. the patches do not merge cleanly to a
> > 4.19-rc1 base kernel because of all the changes to
> > include/linux/fs.h that have hit the tree after this. There's also
> > failures against Documentation/filesystems/fs.h
> > 
> > IOWs, it's not going to get merged through the main XFS tree because
> > I don't have the patience to resolve all the patch application
> > failures, then when it comes to merge make sure all the merge
> > failures end up being resolved correctly.
> > 
> > So if I take it through the XFS tree, it will being a standalone
> > branch based on 4.19-rc8 and won't hit linux-next until after the
> > first XFS merge when I can rebase the for-next branch...
> 
> How many conflicts does it have with XFS tree?  I can take it via
> vfs.git...

I gave up after 4 of the first 6 or 7 patches had conflicts in vfs
and documentation code.

There were changes that went into 4.19-rc7 that changed
{do|vfs}_clone_file_range() prototypes and this patchset hits
prototypes adjacent to that multiple times. There's also a conflicts
against a new f_ops->fadvise method. These all appear to be direct
fallout of fixes needed for all the overlay f_ops changes.

The XFS changes at the end of the patchset are based on
commits that were merged into -rc7 and -rc8, so if you are using
-rc8 as your base, then it all merges cleanly. There are no
conflicts with the current xfs/for-next branch.

I've just merged and built it into my test tree (-rc8, xfs/for-next,
djwong/devel) so I can test it properly, but if it merges cleanly
with the vfs tree you are building then that's probably the easiest
way to merge it all...

Cheers,

Dave.
Amir Goldstein Oct. 22, 2018, 5:42 a.m. UTC | #5
On Mon, Oct 22, 2018 at 8:09 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Oct 22, 2018 at 05:52:49AM +0100, Al Viro wrote:
> > On Mon, Oct 22, 2018 at 03:37:41PM +1100, Dave Chinner wrote:
> >
> > > Ok, this is a bit of a mess. the patches do not merge cleanly to a
> > > 4.19-rc1 base kernel because of all the changes to
> > > include/linux/fs.h that have hit the tree after this. There's also
> > > failures against Documentation/filesystems/fs.h
> > >
> > > IOWs, it's not going to get merged through the main XFS tree because
> > > I don't have the patience to resolve all the patch application
> > > failures, then when it comes to merge make sure all the merge
> > > failures end up being resolved correctly.
> > >
> > > So if I take it through the XFS tree, it will being a standalone
> > > branch based on 4.19-rc8 and won't hit linux-next until after the
> > > first XFS merge when I can rebase the for-next branch...
> >
> > How many conflicts does it have with XFS tree?  I can take it via
> > vfs.git...
>
> I gave up after 4 of the first 6 or 7 patches had conflicts in vfs
> and documentation code.
>
> There were changes that went into 4.19-rc7 that changed
> {do|vfs}_clone_file_range() prototypes and this patchset hits
> prototypes adjacent to that multiple times. There's also a conflicts
> against a new f_ops->fadvise method. These all appear to be direct
> fallout of fixes needed for all the overlay f_ops changes.
>
> The XFS changes at the end of the patchset are based on
> commits that were merged into -rc7 and -rc8, so if you are using
> -rc8 as your base, then it all merges cleanly. There are no
> conflicts with the current xfs/for-next branch.
>
> I've just merged and built it into my test tree (-rc8, xfs/for-next,
> djwong/devel) so I can test it properly, but if it merges cleanly
> with the vfs tree you are building then that's probably the easiest
> way to merge it all...
>

Dave,

Pardon my ignorance, but its an opportunity for me to learn a thing
or two about kernel development process.

First, I asked Darrick to base his patches on top of -rc8 intentionally
to avoid the conflict with "swap names of {do|vfs}_clone_file_range()" (*).
My change pre dates his changes so it makes sense.

What I don't get is why does it need to create a problem?
Can you not back merge -rc8 into xfs/for-next (or into vfs/for-next for
that matter) and then merge Darrick's patches?

What is the culprit with doing that?

Thanks,
Amir.

(*) Yes, I do realize "swap names of {do|vfs}_clone_file_range()"
is a backporting landmine. It's been on my todo list to send it to Greg
here I am going to do it now...
Dave Chinner Oct. 22, 2018, 6:55 a.m. UTC | #6
On Mon, Oct 22, 2018 at 08:42:29AM +0300, Amir Goldstein wrote:
> On Mon, Oct 22, 2018 at 8:09 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Oct 22, 2018 at 05:52:49AM +0100, Al Viro wrote:
> > > On Mon, Oct 22, 2018 at 03:37:41PM +1100, Dave Chinner wrote:
> > >
> > > > Ok, this is a bit of a mess. the patches do not merge cleanly to a
> > > > 4.19-rc1 base kernel because of all the changes to
> > > > include/linux/fs.h that have hit the tree after this. There's also
> > > > failures against Documentation/filesystems/fs.h
> > > >
> > > > IOWs, it's not going to get merged through the main XFS tree because
> > > > I don't have the patience to resolve all the patch application
> > > > failures, then when it comes to merge make sure all the merge
> > > > failures end up being resolved correctly.
> > > >
> > > > So if I take it through the XFS tree, it will being a standalone
> > > > branch based on 4.19-rc8 and won't hit linux-next until after the
> > > > first XFS merge when I can rebase the for-next branch...
> > >
> > > How many conflicts does it have with XFS tree?  I can take it via
> > > vfs.git...
> >
> > I gave up after 4 of the first 6 or 7 patches had conflicts in vfs
> > and documentation code.
> >
> > There were changes that went into 4.19-rc7 that changed
> > {do|vfs}_clone_file_range() prototypes and this patchset hits
> > prototypes adjacent to that multiple times. There's also a conflicts
> > against a new f_ops->fadvise method. These all appear to be direct
> > fallout of fixes needed for all the overlay f_ops changes.
> >
> > The XFS changes at the end of the patchset are based on
> > commits that were merged into -rc7 and -rc8, so if you are using
> > -rc8 as your base, then it all merges cleanly. There are no
> > conflicts with the current xfs/for-next branch.
> >
> > I've just merged and built it into my test tree (-rc8, xfs/for-next,
> > djwong/devel) so I can test it properly, but if it merges cleanly
> > with the vfs tree you are building then that's probably the easiest
> > way to merge it all...
> >
> 
> Dave,
> 
> Pardon my ignorance, but its an opportunity for me to learn a thing
> or two about kernel development process.
> 
> First, I asked Darrick to base his patches on top of -rc8 intentionally
> to avoid the conflict with "swap names of {do|vfs}_clone_file_range()" (*).
> My change pre dates his changes so it makes sense.

Actually, I asked him to do that because the critical clone/dedupe
data corruption bug fixes for XFS that were merged into -rc8. They
created substantial conflicts with the XFS code being
rearranged in the patch set, and that wasn't something easy to
resolve.

But because those XFS commits in 4.19-rc8 are from a stable in a
topic branch in the xfs tree (xfs-4.19-fixes-1) which is merged into
the for-next tree before anything else, the XFS changes in Darrick's
patchset merge cleanly with the XFS for-next branch.

What doesn't merge cleanly is all the VFS API and prototype stuff
that got changed after -rc1 because it's not in the master branch of
the xfs dev tree.

> What I don't get is why does it need to create a problem?
> Can you not back merge -rc8 into xfs/for-next (or into vfs/for-next for
> that matter) and then merge Darrick's patches?

Because it forces everyone to move their base kernel forward to test
the latest fixes. i.e. i forces everyone to rebase their local dev
trees to -rc8 regardless of whether they want to or not.

If you're doing work outside the XFS tree, then that forces you to
rebase everything you are working on, not just modify your XFS
patches that are affected by the new XFS changes. it also breaks the
concept of having a baseline for regression tests - force upgrading
to -rc8 breaks the baseline for comparing just XFS changes.

The other reason is topic branches. They need to have a stable base.
i.e. all your topic branches that get merged in need to be on the
same base commit so they can be merged and rearranged without
problems or having to rewrite commits. The for-next tree can move
forward on a rebase with topic branches if necessary, but  it has
downsides for downstream developers.

For example, I can do:

git reset --hard v4.19-rc8
git am djw-clone-dedupe-rework

and have it apply cleanly. But I can't turn that into a topic branch
for merge into linux-xfs/for-next because it's not based on the
correct branch.  i.e. I need to do this to create a topic branch for
for-next merge:

git reset --hard linux-xfs/master
git merge linux-xfs/xfs-4.19-fixes-1
git am djw-clone-dedupe-rework
{rejects galore}

I can fix all the rejectsi (boring, tedious, time consuming), but
then when I go to merge it back into a 4.19-rc8 kernel like so:

git reset --hard v4.19-rc8
git merge linux-xfs/vfs-4.20-clone-dedupe-rework
[conflicts!]

I essentially have to revert all of the conflict resolution I just
did to get it into ithe XFS topic branch. And those conflicts will
need to be resolved in the upstream merge, too, and by every XFS
developer who changes their base kernel to >=4.19-rc7.

IOWs, there's downsides everywhere. Yes, we could rebase the entire
tree on 4.19-rc8, but Linus *really* didn't like dev trees to be
rebased to a late -rcX just to clean up merge issues like this. He
wanted to see the development history of the code he was asked to
pull and wanted to see merge conflicts and resolve them himself so
he knew who and what was causing problems when merging code. That
way he nkew in future what changes to allow late in -rcX releases
and what to defer to after the next merge window.

Co-ordination between tree is important - high level API changes are
exactly the sort of thing that causes dev tree merge pain late in a
dev cycle. Hence this sort of thing is best done immediately after
-rc1 when everyone is getting ready to rebase and start their next
dev cycle and nobody is really affected by the API change.

So, historically speaking, I've done things this way because that's
how Linus wanted them done to make life easier for both himself and
everyone who manages a dev tree.

Cheers,

Dave.