diff mbox series

[3/4] Btrfs: check if destination root is read-only for deduplication

Message ID 20181212180559.15249-4-fdmanana@kernel.org (mailing list archive)
State New, archived
Headers show
Series Btrfs: a few more cleanups and fixes for clone/deduplication | expand

Commit Message

Filipe Manana Dec. 12, 2018, 6:05 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Checking if the destination root is read-only was being performed only for
clone operations. Make deduplication check it as well, as it does not make
sense to not do it, even if it is an operation that does not change the
file contents (such as defrag for example, which checks first if the root
is read-only).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ioctl.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Comments

David Sterba Dec. 13, 2018, 4:07 p.m. UTC | #1
On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Checking if the destination root is read-only was being performed only for
> clone operations. Make deduplication check it as well, as it does not make
> sense to not do it, even if it is an operation that does not change the
> file contents (such as defrag for example, which checks first if the root
> is read-only).

And this is also change in user-visible behaviour of dedupe, so this
needs to be verified if it's not breaking existing tools.
Filipe Manana Jan. 31, 2019, 4:39 p.m. UTC | #2
On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Checking if the destination root is read-only was being performed only for
> > clone operations. Make deduplication check it as well, as it does not make
> > sense to not do it, even if it is an operation that does not change the
> > file contents (such as defrag for example, which checks first if the root
> > is read-only).
>
> And this is also change in user-visible behaviour of dedupe, so this
> needs to be verified if it's not breaking existing tools.

Have you had the chance to do such verification?

This actually conflicts with send. Send does not expect a root/tree to
change, and with dedupe on read-only roots happening
in parallel with send is going to cause all sorts of unexpected and
undesired problems...

This is a problem introduced by dedupe ioctl when it landed, since
send existed for a longer time (when nothing else was
allowed to change read-only roots, including defrag).

I understand it can break some applications, but adding other solution
such as preventing send and dedupe from running in parallel
(erroring out or block and wait for each other, etc) is going to be
really ugly. There's always the workaround for apps to set the
subvolume
to RW mode, do the dedupe, then switch it back to RO mode.

Thanks.
Hugo Mills Jan. 31, 2019, 4:44 p.m. UTC | #3
On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Checking if the destination root is read-only was being performed only for
> > > clone operations. Make deduplication check it as well, as it does not make
> > > sense to not do it, even if it is an operation that does not change the
> > > file contents (such as defrag for example, which checks first if the root
> > > is read-only).
> >
> > And this is also change in user-visible behaviour of dedupe, so this
> > needs to be verified if it's not breaking existing tools.
> 
> Have you had the chance to do such verification?
> 
> This actually conflicts with send. Send does not expect a root/tree to
> change, and with dedupe on read-only roots happening
> in parallel with send is going to cause all sorts of unexpected and
> undesired problems...
> 
> This is a problem introduced by dedupe ioctl when it landed, since
> send existed for a longer time (when nothing else was
> allowed to change read-only roots, including defrag).
> 
> I understand it can break some applications, but adding other solution
> such as preventing send and dedupe from running in parallel
> (erroring out or block and wait for each other, etc) is going to be
> really ugly. There's always the workaround for apps to set the
> subvolume
> to RW mode, do the dedupe, then switch it back to RO mode.

   Only if you want your incremental send chain to break on the way
past...

   I think it's fairly clear by now (particularly from the last thread
on this a couple of weeks ago) that making RO subvols RW and then back
again is a fast way to broken incremental receives.

   Hugo.
Filipe Manana Feb. 12, 2019, 5:59 p.m. UTC | #4
On Thu, Jan 31, 2019 at 4:39 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Checking if the destination root is read-only was being performed only for
> > > clone operations. Make deduplication check it as well, as it does not make
> > > sense to not do it, even if it is an operation that does not change the
> > > file contents (such as defrag for example, which checks first if the root
> > > is read-only).
> >
> > And this is also change in user-visible behaviour of dedupe, so this
> > needs to be verified if it's not breaking existing tools.
>
> Have you had the chance to do such verification?
>
> This actually conflicts with send. Send does not expect a root/tree to
> change, and with dedupe on read-only roots happening
> in parallel with send is going to cause all sorts of unexpected and
> undesired problems...
>
> This is a problem introduced by dedupe ioctl when it landed, since
> send existed for a longer time (when nothing else was
> allowed to change read-only roots, including defrag).

Another ping.
This is a problem that has to be solved one way or another.

Thanks.

>
> I understand it can break some applications, but adding other solution
> such as preventing send and dedupe from running in parallel
> (erroring out or block and wait for each other, etc) is going to be
> really ugly. There's always the workaround for apps to set the
> subvolume
> to RW mode, do the dedupe, then switch it back to RO mode.
>
> Thanks.
David Sterba Feb. 18, 2019, 3:38 p.m. UTC | #5
On Thu, Jan 31, 2019 at 04:44:39PM +0000, Hugo Mills wrote:
> On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Checking if the destination root is read-only was being performed only for
> > > > clone operations. Make deduplication check it as well, as it does not make
> > > > sense to not do it, even if it is an operation that does not change the
> > > > file contents (such as defrag for example, which checks first if the root
> > > > is read-only).
> > >
> > > And this is also change in user-visible behaviour of dedupe, so this
> > > needs to be verified if it's not breaking existing tools.
> > 
> > Have you had the chance to do such verification?
> > 
> > This actually conflicts with send. Send does not expect a root/tree to
> > change, and with dedupe on read-only roots happening
> > in parallel with send is going to cause all sorts of unexpected and
> > undesired problems...
> > 
> > This is a problem introduced by dedupe ioctl when it landed, since
> > send existed for a longer time (when nothing else was
> > allowed to change read-only roots, including defrag).
> > 
> > I understand it can break some applications, but adding other solution
> > such as preventing send and dedupe from running in parallel
> > (erroring out or block and wait for each other, etc) is going to be
> > really ugly. There's always the workaround for apps to set the
> > subvolume
> > to RW mode, do the dedupe, then switch it back to RO mode.
> 
>    Only if you want your incremental send chain to break on the way
> past...
> 
>    I think it's fairly clear by now (particularly from the last thread
> on this a couple of weeks ago) that making RO subvols RW and then back
> again is a fast way to broken incremental receives.

So, I think the way it should be fixed is to keep deduplication off the
read-only subvolumes. The main reason I see is to avoid the random
problems that arise from send + dedupe interaction. The cost is lower
deduplication ratio.

The main usecase being a primary subvolume with RO snapshots over time,
with optional incremental send. That I know is common and widely used.

A problematic usecase that would utilize deduplication over RO snapshots
does could be something like a set of subvolumes that have very similar
data, get snapshotted or set RO, followed by a dedupe pass.

To support the latter I'd go only via the RO->RW, dedupe, RW->RO way, as
this records that there was a change and does not collide with send
assumptions. That this leads to loss of incremental send must be
communicated to the user with possible override, eg. --I-know-what-I-am-doing
option.
David Sterba Feb. 18, 2019, 4:01 p.m. UTC | #6
On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Checking if the destination root is read-only was being performed only for
> clone operations. Make deduplication check it as well, as it does not make
> sense to not do it, even if it is an operation that does not change the
> file contents (such as defrag for example, which checks first if the root
> is read-only).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I've copied the portions of our discussion and added the patch to
misc-next.
Filipe Manana Feb. 18, 2019, 4:55 p.m. UTC | #7
On Mon, Feb 18, 2019 at 3:36 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Thu, Jan 31, 2019 at 04:44:39PM +0000, Hugo Mills wrote:
> > On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> > > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> > > >
> > > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > Checking if the destination root is read-only was being performed only for
> > > > > clone operations. Make deduplication check it as well, as it does not make
> > > > > sense to not do it, even if it is an operation that does not change the
> > > > > file contents (such as defrag for example, which checks first if the root
> > > > > is read-only).
> > > >
> > > > And this is also change in user-visible behaviour of dedupe, so this
> > > > needs to be verified if it's not breaking existing tools.
> > >
> > > Have you had the chance to do such verification?
> > >
> > > This actually conflicts with send. Send does not expect a root/tree to
> > > change, and with dedupe on read-only roots happening
> > > in parallel with send is going to cause all sorts of unexpected and
> > > undesired problems...
> > >
> > > This is a problem introduced by dedupe ioctl when it landed, since
> > > send existed for a longer time (when nothing else was
> > > allowed to change read-only roots, including defrag).
> > >
> > > I understand it can break some applications, but adding other solution
> > > such as preventing send and dedupe from running in parallel
> > > (erroring out or block and wait for each other, etc) is going to be
> > > really ugly. There's always the workaround for apps to set the
> > > subvolume
> > > to RW mode, do the dedupe, then switch it back to RO mode.
> >
> >    Only if you want your incremental send chain to break on the way
> > past...
> >
> >    I think it's fairly clear by now (particularly from the last thread
> > on this a couple of weeks ago) that making RO subvols RW and then back
> > again is a fast way to broken incremental receives.
>
> So, I think the way it should be fixed is to keep deduplication off the
> read-only subvolumes. The main reason I see is to avoid the random
> problems that arise from send + dedupe interaction. The cost is lower
> deduplication ratio.
>
> The main usecase being a primary subvolume with RO snapshots over time,
> with optional incremental send. That I know is common and widely used.
>
> A problematic usecase that would utilize deduplication over RO snapshots
> does could be something like a set of subvolumes that have very similar
> data, get snapshotted or set RO, followed by a dedupe pass.
>
> To support the latter I'd go only via the RO->RW, dedupe, RW->RO way, as
> this records that there was a change and does not collide with send
> assumptions. That this leads to loss of incremental send must be
> communicated to the user with possible override, eg. --I-know-what-I-am-doing
> option.

I fully agree.
Thanks.
Zygo Blaxell Feb. 20, 2019, 4:41 p.m. UTC | #8
On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Checking if the destination root is read-only was being performed only for
> > > clone operations. Make deduplication check it as well, as it does not make
> > > sense to not do it, even if it is an operation that does not change the
> > > file contents (such as defrag for example, which checks first if the root
> > > is read-only).
> >
> > And this is also change in user-visible behaviour of dedupe, so this
> > needs to be verified if it's not breaking existing tools.
> 
> Have you had the chance to do such verification?
> 
> This actually conflicts with send. Send does not expect a root/tree to
> change, and with dedupe on read-only roots happening
> in parallel with send is going to cause all sorts of unexpected and
> undesired problems...

This is a problem bees ran into.  There is a workaround in bees (called
--workaround-btrfs-send) that avoids RO subvols as dedupe targets.
As the name of the option implies, it works around problems in btrfs send.

This kernel change makes the workaround mandatory now, as the default
case (without workaround) will fail on every RO subvol even if that
behavior is desired by the user.  That breaks an important use case on
the receiving side of sends--to dedupe the received subvols together
while also protecting them against modification on the target system
with the RO flag--and preserving that use case is why the send workaround
was optional (and not default) in bees.

bees also won't handle the RO/RW/RO transition correctly, as it didn't
seem like a sane thing to support at the time.  That is arguably something
to be fixed in bees.

> This is a problem introduced by dedupe ioctl when it landed, since
> send existed for a longer time (when nothing else was
> allowed to change read-only roots, including defrag).

Is there a reason why incremental send can't simply be fixed?  As far
as I can tell, send is failing because of a runtime check that seems to
be too strict; however, I haven't tried removing that check to see if
it fixes the problem in send, or just hides the next problem.

More details at:

	https://github.com/Zygo/bees/issues/79#issuecomment-429039036

> I understand it can break some applications, but adding other solution
> such as preventing send and dedupe from running in parallel
> (erroring out or block and wait for each other, etc) is going to be
> really ugly. There's always the workaround for apps to set the
> subvolume
> to RW mode, do the dedupe, then switch it back to RO mode.
> 
> Thanks.
Filipe Manana Feb. 20, 2019, 4:54 p.m. UTC | #9
On Wed, Feb 20, 2019 at 4:42 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Checking if the destination root is read-only was being performed only for
> > > > clone operations. Make deduplication check it as well, as it does not make
> > > > sense to not do it, even if it is an operation that does not change the
> > > > file contents (such as defrag for example, which checks first if the root
> > > > is read-only).
> > >
> > > And this is also change in user-visible behaviour of dedupe, so this
> > > needs to be verified if it's not breaking existing tools.
> >
> > Have you had the chance to do such verification?
> >
> > This actually conflicts with send. Send does not expect a root/tree to
> > change, and with dedupe on read-only roots happening
> > in parallel with send is going to cause all sorts of unexpected and
> > undesired problems...
>
> This is a problem bees ran into.  There is a workaround in bees (called
> --workaround-btrfs-send) that avoids RO subvols as dedupe targets.
> As the name of the option implies, it works around problems in btrfs send.
>
> This kernel change makes the workaround mandatory now, as the default
> case (without workaround) will fail on every RO subvol even if that
> behavior is desired by the user.  That breaks an important use case on
> the receiving side of sends--to dedupe the received subvols together
> while also protecting them against modification on the target system
> with the RO flag--and preserving that use case is why the send workaround
> was optional (and not default) in bees.
>
> bees also won't handle the RO/RW/RO transition correctly, as it didn't
> seem like a sane thing to support at the time.  That is arguably something
> to be fixed in bees.
>
> > This is a problem introduced by dedupe ioctl when it landed, since
> > send existed for a longer time (when nothing else was
> > allowed to change read-only roots, including defrag).
>
> Is there a reason why incremental send can't simply be fixed?

This is a problem that affects both incremental and non-incremental (full) send.

> As far
> as I can tell, send is failing because of a runtime check that seems to
> be too strict; however, I haven't tried removing that check to see if
> it fixes the problem in send, or just hides the next problem.

The problem is send was designed with the idea that read-only roots
don't ever change.

The failures that can happen are many and unpredictable, from
occasionally failing with some error, to invalid memory accesses, use
after free problems, etc.
Essentially all caused by races when the nodes/leafs from the
read-only tree change while send is running.

I don't know what runtime check you are mentioning that is too strict.
You can definitely do dedupe on a read-only root and after it finishes
do a send (either full or incremental), and it will work.

>
> More details at:
>
>         https://github.com/Zygo/bees/issues/79#issuecomment-429039036
>
> > I understand it can break some applications, but adding other solution
> > such as preventing send and dedupe from running in parallel
> > (erroring out or block and wait for each other, etc) is going to be
> > really ugly. There's always the workaround for apps to set the
> > subvolume
> > to RW mode, do the dedupe, then switch it back to RO mode.
> >
> > Thanks.
Zygo Blaxell Feb. 20, 2019, 5:17 p.m. UTC | #10
On Wed, Feb 20, 2019 at 04:54:09PM +0000, Filipe Manana wrote:
> On Wed, Feb 20, 2019 at 4:42 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> > > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> > > >
> > > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > Checking if the destination root is read-only was being performed only for
> > > > > clone operations. Make deduplication check it as well, as it does not make
> > > > > sense to not do it, even if it is an operation that does not change the
> > > > > file contents (such as defrag for example, which checks first if the root
> > > > > is read-only).
> > > >
> > > > And this is also change in user-visible behaviour of dedupe, so this
> > > > needs to be verified if it's not breaking existing tools.
> > >
> > > Have you had the chance to do such verification?
> > >
> > > This actually conflicts with send. Send does not expect a root/tree to
> > > change, and with dedupe on read-only roots happening
> > > in parallel with send is going to cause all sorts of unexpected and
> > > undesired problems...
> >
> > This is a problem bees ran into.  There is a workaround in bees (called
> > --workaround-btrfs-send) that avoids RO subvols as dedupe targets.
> > As the name of the option implies, it works around problems in btrfs send.
> >
> > This kernel change makes the workaround mandatory now, as the default
> > case (without workaround) will fail on every RO subvol even if that
> > behavior is desired by the user.  That breaks an important use case on
> > the receiving side of sends--to dedupe the received subvols together
> > while also protecting them against modification on the target system
> > with the RO flag--and preserving that use case is why the send workaround
> > was optional (and not default) in bees.
> >
> > bees also won't handle the RO/RW/RO transition correctly, as it didn't
> > seem like a sane thing to support at the time.  That is arguably something
> > to be fixed in bees.
> >
> > > This is a problem introduced by dedupe ioctl when it landed, since
> > > send existed for a longer time (when nothing else was
> > > allowed to change read-only roots, including defrag).
> >
> > Is there a reason why incremental send can't simply be fixed?
> 
> This is a problem that affects both incremental and non-incremental (full) send.
> 
> > As far
> > as I can tell, send is failing because of a runtime check that seems to
> > be too strict; however, I haven't tried removing that check to see if
> > it fixes the problem in send, or just hides the next problem.
> 
> The problem is send was designed with the idea that read-only roots
> don't ever change.

...except when they're snapshotted, balanced, or deduped (to list places
where that assumption hasn't held so far).

> The failures that can happen are many and unpredictable, from
> occasionally failing with some error, to invalid memory accesses, use
> after free problems, etc.
> Essentially all caused by races when the nodes/leafs from the
> read-only tree change while send is running.

Maybe you can explain this further?  As far as I can tell, all of those
are send bugs that should just be (and over the years have been) fixed.

> I don't know what runtime check you are mentioning that is too strict.

It's this one, in send.c:

                        /*
                         * We may have found an extent item that has changed
                         * only its disk_bytenr field and the corresponding
                         * inode item was not updated. This case happens due to
                         * very specific timings during relocation when a leaf
                         * that contains file extent items is COWed while
                         * relocation is ongoing and its in the stage where it
                         * updates data pointers. So when this happens we can
                         * safely ignore it since we know it's the same extent,
                         * but just at different logical and physical locations
                         * (when an extent is fully replaced with a new one, we
                         * know the generation number must have changed too,
                         * since snapshot creation implies committing the current
                         * transaction, and the inode item must have been updated
                         * as well).
                         * This replacement of the disk_bytenr happens at
                         * relocation.c:replace_file_extents() through
                         * relocation.c:btrfs_reloc_cow_block().
                         */
                        if (btrfs_file_extent_generation(leaf_l, ei_l) ==
                            btrfs_file_extent_generation(leaf_r, ei_r) &&
                            btrfs_file_extent_ram_bytes(leaf_l, ei_l) ==
                            btrfs_file_extent_ram_bytes(leaf_r, ei_r) &&
                            btrfs_file_extent_compression(leaf_l, ei_l) ==
                            btrfs_file_extent_compression(leaf_r, ei_r) &&
                            btrfs_file_extent_encryption(leaf_l, ei_l) ==
                            btrfs_file_extent_encryption(leaf_r, ei_r) &&
                            btrfs_file_extent_other_encoding(leaf_l, ei_l) ==
                            btrfs_file_extent_other_encoding(leaf_r, ei_r) &&
                            btrfs_file_extent_type(leaf_l, ei_l) ==
                            btrfs_file_extent_type(leaf_r, ei_r) &&
                            btrfs_file_extent_disk_bytenr(leaf_l, ei_l) !=
                            btrfs_file_extent_disk_bytenr(leaf_r, ei_r) &&
                            btrfs_file_extent_disk_num_bytes(leaf_l, ei_l) ==
                            btrfs_file_extent_disk_num_bytes(leaf_r, ei_r) &&
                            btrfs_file_extent_offset(leaf_l, ei_l) ==
                            btrfs_file_extent_offset(leaf_r, ei_r) &&
                            btrfs_file_extent_num_bytes(leaf_l, ei_l) ==
                            btrfs_file_extent_num_bytes(leaf_r, ei_r))
                                return 0;
                }

                inconsistent_snapshot_error(sctx, result, "extent");
                return -EIO;

This is the point where bees users report send failures.

I don't really understand what that code is trying to achieve, though.
If we are diffing two subvol trees then we should just send anything
that's different--even if we occasionally send a few redundant extents
because we happen to be sending during a balance, because that's better
than bombing out completely.

Or is the problem more like we lose our ei pointer in one of the subvols
when the extent tree changes under it?  That would be harder to solve,
it would have to keep releasing and reacquiring everything.

> You can definitely do dedupe on a read-only root and after it finishes
> do a send (either full or incremental), and it will work.

If there's a way to make a subvol RW temporarily _without breaking
incremental send from that subvol_ (i.e. without clearing the parent
UUIDs, maybe also allowing only dedupe) then I have no objection.

> > More details at:
> >
> >         https://github.com/Zygo/bees/issues/79#issuecomment-429039036
> >
> > > I understand it can break some applications, but adding other solution
> > > such as preventing send and dedupe from running in parallel
> > > (erroring out or block and wait for each other, etc) is going to be
> > > really ugly. There's always the workaround for apps to set the
> > > subvolume
> > > to RW mode, do the dedupe, then switch it back to RO mode.
> > >
> > > Thanks.
Zygo Blaxell Feb. 21, 2019, 4:54 p.m. UTC | #11
On Wed, Feb 20, 2019 at 04:54:09PM +0000, Filipe Manana wrote:
> On Wed, Feb 20, 2019 at 4:42 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> > > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> > > >
> > > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > Checking if the destination root is read-only was being performed only for
> > > > > clone operations. Make deduplication check it as well, as it does not make
> > > > > sense to not do it, even if it is an operation that does not change the
> > > > > file contents (such as defrag for example, which checks first if the root
> > > > > is read-only).
> > > >
> > > > And this is also change in user-visible behaviour of dedupe, so this
> > > > needs to be verified if it's not breaking existing tools.
> > >
> > > Have you had the chance to do such verification?
> > >
> > > This actually conflicts with send. Send does not expect a root/tree to
> > > change, and with dedupe on read-only roots happening
> > > in parallel with send is going to cause all sorts of unexpected and
> > > undesired problems...
> >
> > This is a problem bees ran into.  There is a workaround in bees (called
> > --workaround-btrfs-send) that avoids RO subvols as dedupe targets.
> > As the name of the option implies, it works around problems in btrfs send.
> >
> > This kernel change makes the workaround mandatory now, as the default
> > case (without workaround) will fail on every RO subvol even if that
> > behavior is desired by the user.  That breaks an important use case on
> > the receiving side of sends--to dedupe the received subvols together
> > while also protecting them against modification on the target system
> > with the RO flag--and preserving that use case is why the send workaround
> > was optional (and not default) in bees.
> >
> > bees also won't handle the RO/RW/RO transition correctly, as it didn't
> > seem like a sane thing to support at the time.  That is arguably something
> > to be fixed in bees.
> >
> > > This is a problem introduced by dedupe ioctl when it landed, since
> > > send existed for a longer time (when nothing else was
> > > allowed to change read-only roots, including defrag).
> >
> > Is there a reason why incremental send can't simply be fixed?
> 
> This is a problem that affects both incremental and non-incremental (full) send.
> 
> > As far
> > as I can tell, send is failing because of a runtime check that seems to
> > be too strict; however, I haven't tried removing that check to see if
> > it fixes the problem in send, or just hides the next problem.
> 
> The problem is send was designed with the idea that read-only roots
> don't ever change.

Wait...don't ever change, or don't change while send is running?  If the
only thing we have to do is prevent concurrent send and dedupe, then that
might be much easier--and much less intrusive than breaking dedupe of
all RO subvols.

How does concurrent send and balance work now?  Maybe dedupe can use the
same approach.

> The failures that can happen are many and unpredictable, from
> occasionally failing with some error, to invalid memory accesses, use
> after free problems, etc.
> Essentially all caused by races when the nodes/leafs from the
> read-only tree change while send is running.
> 
> I don't know what runtime check you are mentioning that is too strict.
> You can definitely do dedupe on a read-only root and after it finishes
> do a send (either full or incremental), and it will work.

OK, after rereading https://github.com/Zygo/bees/issues/79 it's not
clear to me whether the problem reported was "send fails while bees is
running" or "send fails if bees is run between two incremental sends."
I had concluded it was the latter, which would definitely be a send bug,
but I can't find any statement to that effect from the bug reporter.

If it's the former, then all we need is some way to prevent dedupe while
sending on the same subvol.  Maybe send could set a flag (rwlock?) on the
subvol data structure, and dedupe can check the flag, with this behavior:

	- if dedupe is running while send starts, block the send until
	the dedupe is finished

	- if send is running when dedupe starts, reject the dedupe
	with...EBUSY?

Userspace dedupe could look for EBUSY and reschedule dedupe of the RO
subvol later.

Dedupe could block until the send is finished (similar to the way
that subvol deletes block until balance is finished) but I'd prefer to
have dedupe return immediately with an error so userspace can schedule
something different to dedupe (similar to the way that multiple concurrent
balances exclude each other).


> >
> > More details at:
> >
> >         https://github.com/Zygo/bees/issues/79#issuecomment-429039036
> >
> > > I understand it can break some applications, but adding other solution
> > > such as preventing send and dedupe from running in parallel
> > > (erroring out or block and wait for each other, etc) is going to be
> > > really ugly. There's always the workaround for apps to set the
> > > subvolume
> > > to RW mode, do the dedupe, then switch it back to RO mode.
> > >
> > > Thanks.
Filipe Manana Feb. 22, 2019, 11:13 a.m. UTC | #12
On Wed, Feb 20, 2019 at 5:17 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Wed, Feb 20, 2019 at 04:54:09PM +0000, Filipe Manana wrote:
> > On Wed, Feb 20, 2019 at 4:42 PM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Thu, Jan 31, 2019 at 04:39:22PM +0000, Filipe Manana wrote:
> > > > On Thu, Dec 13, 2018 at 4:08 PM David Sterba <dsterba@suse.cz> wrote:
> > > > >
> > > > > On Wed, Dec 12, 2018 at 06:05:58PM +0000, fdmanana@kernel.org wrote:
> > > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > > >
> > > > > > Checking if the destination root is read-only was being performed only for
> > > > > > clone operations. Make deduplication check it as well, as it does not make
> > > > > > sense to not do it, even if it is an operation that does not change the
> > > > > > file contents (such as defrag for example, which checks first if the root
> > > > > > is read-only).
> > > > >
> > > > > And this is also change in user-visible behaviour of dedupe, so this
> > > > > needs to be verified if it's not breaking existing tools.
> > > >
> > > > Have you had the chance to do such verification?
> > > >
> > > > This actually conflicts with send. Send does not expect a root/tree to
> > > > change, and with dedupe on read-only roots happening
> > > > in parallel with send is going to cause all sorts of unexpected and
> > > > undesired problems...
> > >
> > > This is a problem bees ran into.  There is a workaround in bees (called
> > > --workaround-btrfs-send) that avoids RO subvols as dedupe targets.
> > > As the name of the option implies, it works around problems in btrfs send.
> > >
> > > This kernel change makes the workaround mandatory now, as the default
> > > case (without workaround) will fail on every RO subvol even if that
> > > behavior is desired by the user.  That breaks an important use case on
> > > the receiving side of sends--to dedupe the received subvols together
> > > while also protecting them against modification on the target system
> > > with the RO flag--and preserving that use case is why the send workaround
> > > was optional (and not default) in bees.
> > >
> > > bees also won't handle the RO/RW/RO transition correctly, as it didn't
> > > seem like a sane thing to support at the time.  That is arguably something
> > > to be fixed in bees.
> > >
> > > > This is a problem introduced by dedupe ioctl when it landed, since
> > > > send existed for a longer time (when nothing else was
> > > > allowed to change read-only roots, including defrag).
> > >
> > > Is there a reason why incremental send can't simply be fixed?
> >
> > This is a problem that affects both incremental and non-incremental (full) send.
> >
> > > As far
> > > as I can tell, send is failing because of a runtime check that seems to
> > > be too strict; however, I haven't tried removing that check to see if
> > > it fixes the problem in send, or just hides the next problem.
> >
> > The problem is send was designed with the idea that read-only roots
> > don't ever change.
>
> ...except when they're snapshotted, balanced, or deduped (to list places
> where that assumption hasn't held so far).

Yes, we had bugs related to that (I fixed some of them). However, the
snapshotting ones [1, 2] happened to be easy to solve because only the
root node is COWed.
The one related to balance [3], which I only remember one and matches
the piece of code you pointed to (which I fixed long ago), actually
only solves
one problem, which was hitting a BUG_ON() (if the extent item was
changed by relocation we know no data changed and we can ignore the
change).
The other problem it doesn't solve is essentially the same problem
that a concurrent dedupe can cause, which I only realized some weeks
ago after noticing
dedup was allowed on RO roots (subvolumes/snapshots).

I'll explain below what problem it is.

>
> > The failures that can happen are many and unpredictable, from
> > occasionally failing with some error, to invalid memory accesses, use
> > after free problems, etc.
> > Essentially all caused by races when the nodes/leafs from the
> > read-only tree change while send is running.
>
> Maybe you can explain this further?  As far as I can tell, all of those
> are send bugs that should just be (and over the years have been) fixed.

So send was always lockless and transactionless, because it's a
potentially long running operation that could delay everything
else if it were not (specially if it held transactions open) and
because it needs to operate on a tree that never changes, so that it
always
sees a consistent state (often it needs to do multiple searches on a
tree for many different reasons). Well this second reason is actually
what
matters and the first reason is more a consequence of the second
reason. But well, the world is not perfect
and it seems that during the design/implementation stage of send,
balance and snapshotting of a snapshot being used by send were
forgotten.
Then after send came, dedupe came and because it allows to dedupe
against files in a RO tree, brought another way of being able to
modify a RO tree that can be in use by a concurrent send - this was
missed during the review process obviously, otherwise it would have
never allowed dedupe on RO trees or send would have been changed as
well to somehow work with a concurrent dedupe.

The big problem, for which I don't have a solution, is that dedupe and
balance COW entire paths, and not just the root node of a tree like
the snapshotting case.
For example while send is doing work with a leaf from a RO snapshot
tree, if that leaf is COWed, its extent is marked as free once
the transaction that COWed it is committed. That means that
immediately after that, the extent can be reused, allocated to some
other tree
leaf or node - while send is still using the respective in-memory
representation of the leaf, the extent buffer. So some other task will
be modifying the
extent buffer while send is using it as well - this is what brings
unexpected results that can result in crashes or, with luck, some
random failures that make
send return some error only. For a leaf we could just clone the extent
buffer while preventing transaction commits, and then allow them to
happen again after
cloning.

Now COWing a leaf also results in COWing its parent node, parent of
the parent, etc, all the way up to the root node (pointers need to be
changed in every parent). So if a node is COWed while send is using
it, and before it finishes using it the same extent gets reused for
another tree/node, send would
follow an incorrect path since now the node has different pointers or
it's in an undefined/inconsistent state due to the concurrent
modification. Creating a copy/clone of
the node while transaction commits are disabled is not an option,
since when using the copy, nodes below it might have been COWed and no
longer exist, so we are
accessing stale pointers.

Getting the extent of a node/leaf immediately reused after COWing it
and after its transaction is committed is not very common and happens
mostly when running close
to ENOSPC, plus the time spent processing a node/leaf by send is very
short. So this is effectively a hard to hit problem.

The bug fixes for the snapshotting of a RO snapshot work and are
simple because snapshotting only COWs the root of a tree, so we stop
transactions commits
temporarily, make a copy of the root for send to use and allow
transaction commits to happen again, while send is happy doing what it
has to using the cloned root
nodes - works because snapshotting doesn't COW any other nodes/leafs
of the tree, i.e., the pointers in the root node don't change.

>
> > I don't know what runtime check you are mentioning that is too strict.
>
> It's this one, in send.c:
>
>                         /*
>                          * We may have found an extent item that has changed
>                          * only its disk_bytenr field and the corresponding
>                          * inode item was not updated. This case happens due to
>                          * very specific timings during relocation when a leaf
>                          * that contains file extent items is COWed while
>                          * relocation is ongoing and its in the stage where it
>                          * updates data pointers. So when this happens we can
>                          * safely ignore it since we know it's the same extent,
>                          * but just at different logical and physical locations
>                          * (when an extent is fully replaced with a new one, we
>                          * know the generation number must have changed too,
>                          * since snapshot creation implies committing the current
>                          * transaction, and the inode item must have been updated
>                          * as well).
>                          * This replacement of the disk_bytenr happens at
>                          * relocation.c:replace_file_extents() through
>                          * relocation.c:btrfs_reloc_cow_block().
>                          */
>                         if (btrfs_file_extent_generation(leaf_l, ei_l) ==
>                             btrfs_file_extent_generation(leaf_r, ei_r) &&
>                             btrfs_file_extent_ram_bytes(leaf_l, ei_l) ==
>                             btrfs_file_extent_ram_bytes(leaf_r, ei_r) &&
>                             btrfs_file_extent_compression(leaf_l, ei_l) ==
>                             btrfs_file_extent_compression(leaf_r, ei_r) &&
>                             btrfs_file_extent_encryption(leaf_l, ei_l) ==
>                             btrfs_file_extent_encryption(leaf_r, ei_r) &&
>                             btrfs_file_extent_other_encoding(leaf_l, ei_l) ==
>                             btrfs_file_extent_other_encoding(leaf_r, ei_r) &&
>                             btrfs_file_extent_type(leaf_l, ei_l) ==
>                             btrfs_file_extent_type(leaf_r, ei_r) &&
>                             btrfs_file_extent_disk_bytenr(leaf_l, ei_l) !=
>                             btrfs_file_extent_disk_bytenr(leaf_r, ei_r) &&
>                             btrfs_file_extent_disk_num_bytes(leaf_l, ei_l) ==
>                             btrfs_file_extent_disk_num_bytes(leaf_r, ei_r) &&
>                             btrfs_file_extent_offset(leaf_l, ei_l) ==
>                             btrfs_file_extent_offset(leaf_r, ei_r) &&
>                             btrfs_file_extent_num_bytes(leaf_l, ei_l) ==
>                             btrfs_file_extent_num_bytes(leaf_r, ei_r))
>                                 return 0;
>                 }
>
>                 inconsistent_snapshot_error(sctx, result, "extent");
>                 return -EIO;
>
> This is the point where bees users report send failures.

That seems yo happen when dedupe is running in parallel with send.
That can be fixed by ignoring file extent items that changed without
the inode item having
changed as well (assuming this can only be caused by balance and
dedupe, and there's no other path for this).
When I added tjat check to fix a BUG_ON hit due to balance changing
file extent items (without updating inode items), I wasn't aware that
an ongoing dedupe could
cause the same problem (back then, when I added that check [3], I
thought that dedupe was like clone and would refuse to work on RO
subvolumes/snapshots).
However I don't see immediately how it's possible, since dedupe
updates file extent items (or inserts new ones) in the same
transaction where it updates the inode
item, so it's really weird how send would find the new/modified file
extent items from the commit root without seeing an update inode item
as well.

Btw, why weren't those reports reported to the btrfs mailing list?

However the bigger problem, mentioned above, I really don't have a
solution for it now, and it's far from being a trivial problem
(perhaps simply making send use read locks
for searches on the commit root would work). It can be caused by
either a concurrent dedupe or a concurrent balance (and hopefully
there's nothing else that can modify RO trees),
so a generic solution needs to be found. Preventing dedupe and send
from running in parallel, as you suggest in another reply, is
something I suggested too earlier in this thread.
Preventing balance and send running in parallel, well... it starts to get messy.

So not doing nothing for now, that is, not applying this patch to
disable dedupe on RO roots and wait for a better solution (best case,
for 5.2 merge window), is reasonable
and not something I'm against.
David, would you consider at least excluding it from 5.1 to allow for
a different solution to pop up for another merge window?

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5e84fd8d0634d056248b67463b42f6c85896a19




>
> I don't really understand what that code is trying to achieve, though.
> If we are diffing two subvol trees then we should just send anything
> that's different--even if we occasionally send a few redundant extents
> because we happen to be sending during a balance, because that's better
> than bombing out completely.
>
> Or is the problem more like we lose our ei pointer in one of the subvols
> when the extent tree changes under it?  That would be harder to solve,
> it would have to keep releasing and reacquiring everything.
>
> > You can definitely do dedupe on a read-only root and after it finishes
> > do a send (either full or incremental), and it will work.
>
> If there's a way to make a subvol RW temporarily _without breaking
> incremental send from that subvol_ (i.e. without clearing the parent
> UUIDs, maybe also allowing only dedupe) then I have no objection.
>
> > > More details at:
> > >
> > >         https://github.com/Zygo/bees/issues/79#issuecomment-429039036
> > >
> > > > I understand it can break some applications, but adding other solution
> > > > such as preventing send and dedupe from running in parallel
> > > > (erroring out or block and wait for each other, etc) is going to be
> > > > really ugly. There's always the workaround for apps to set the
> > > > subvolume
> > > > to RW mode, do the dedupe, then switch it back to RO mode.
> > > >
> > > > Thanks.
David Sterba Feb. 22, 2019, 5:25 p.m. UTC | #13
On Fri, Feb 22, 2019 at 11:13:48AM +0000, Filipe Manana wrote:
> So not doing nothing for now, that is, not applying this patch to
> disable dedupe on RO roots and wait for a better solution (best case,
> for 5.2 merge window), is reasonable
> and not something I'm against.
> David, would you consider at least excluding it from 5.1 to allow for
> a different solution to pop up for another merge window?

Ok, I'll remove it from the 5.1 queue for now. Thanks for the detailed
analysis. This fix favors send (for the correctness reasons), but the
deduplication usecase is also important. I hope we'll be able to come up
with a solution that does not hurt usability too much (or affects send
and dedupe equally with a sane fallback behaviour). The exclusion of
send and other operation (snapshot) already exists so the suggested
-EBUSY approach.
diff mbox series

Patch

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ffe940ceb80a..4e9efc93340e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3907,12 +3907,8 @@  static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	u64 wb_len;
 	int ret;
 
-	if (!(remap_flags & REMAP_FILE_DEDUP)) {
-		struct btrfs_root *root_out = BTRFS_I(inode_out)->root;
-
-		if (btrfs_root_readonly(root_out))
-			return -EROFS;
-	}
+	if (btrfs_root_readonly(BTRFS_I(inode_out)->root))
+		return -EROFS;
 
 	if (file_in->f_path.mnt != file_out->f_path.mnt ||
 	    inode_in->i_sb != inode_out->i_sb)