[v2,0/6] RAID1 with 3- and 4- copies
mbox series

Message ID cover.1559917235.git.dsterba@suse.com
Headers show
Series
  • RAID1 with 3- and 4- copies
Related show

Message

David Sterba June 10, 2019, 12:29 p.m. UTC
Hi,

this patchset brings the RAID1 with 3 and 4 copies as a separate
feature as outlined in V1
(https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).

This should help a bit in the raid56 situation, where the write hole
hurts most for metadata, without a block group profile that offers 2
device loss resistance.

I've gathered some feedback from knowlegeable poeople on IRC and the
following setup is considered good enough (certainly better than what we
have now):

- data: RAID6
- metadata: RAID1C3

The RAID1C3 vs RAID6 have different characteristics in terms of space
consumption and repair.


Space consumption
~~~~~~~~~~~~~~~~~

* RAID6 reduces overall metadata by N/(N-2), so with more devices the
  parity overhead ratio is small

* RAID1C3 will allways consume 67% of metadata chunks for redundancy

The overall size of metadata is typically in range of gigabytes to
hundreds of gigabytes (depends on usecase), rough estimate is from
1%-10%. With larger filesystem the percentage is usually smaller.

So, for the 3-copy raid1 the cost of redundancy is better expressed in
the absolute value of gigabytes "wasted" on redundancy than as the
ratio that does look scary compared to raid6.


Repair
~~~~~~

RAID6 needs to access all available devices to calculate the P and Q,
either 1 or 2 missing devices.

RAID1C3 can utilize the independence of each copy and also the way the
RAID1 works in btrfs. In the scenario with 1 missing device, one of the
2 correct copies is read and written to the repaired devices.

Given how the 2-copy RAID1 works on btrfs, the block groups could be
spread over several devices so the load during repair would be spread as
well.

Additionally, device replace works sequentially and in big chunks so on
a lightly used system the read pattern is seek-friendly.


Compatibility
~~~~~~~~~~~~~

The new block group types cost an incompatibility bit, so old kernel
will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
the filesystem with the new type.

To upgrade existing filesystems use the balance filters eg. from RAID6

  $ btrfs balance start -mconvert=raid1c3 /path


Merge target
~~~~~~~~~~~~

I'd like to push that to misc-next for wider testing and merge to 5.3,
unless something bad pops up. Given that the code changes are small and
just a new types with the constraints, the rest is done by the generic
code, I'm not expecting problems that can't be fixed before full
release.


Testing so far
~~~~~~~~~~~~~~

* mkfs with the profiles
* fstests (no specific tests, only check that it does not break)
* profile conversions between single/raid1/raid5/raid1c3/raid6/raid1c4/raid1c4
  with added devices where needed
* scrub

TODO:

* 1 missing device followed by repair
* 2 missing devices followed by repair


David Sterba (6):
  btrfs: add mask for all RAID1 types
  btrfs: use mask for RAID56 profiles
  btrfs: document BTRFS_MAX_MIRRORS
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompat for raid1 with 3, 4 copies

 fs/btrfs/ctree.h                | 14 ++++++++--
 fs/btrfs/extent-tree.c          | 19 +++++++------
 fs/btrfs/scrub.c                |  2 +-
 fs/btrfs/super.c                |  6 +++++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/volumes.c              | 48 ++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h              |  4 +++
 include/uapi/linux/btrfs.h      |  5 +++-
 include/uapi/linux/btrfs_tree.h | 10 +++++++
 9 files changed, 90 insertions(+), 20 deletions(-)

Comments

Hugo Mills June 10, 2019, 12:42 p.m. UTC | #1
Hi, David,

On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> this patchset brings the RAID1 with 3 and 4 copies as a separate
> feature as outlined in V1
> (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
[...]
> Compatibility
> ~~~~~~~~~~~~~
> 
> The new block group types cost an incompatibility bit, so old kernel
> will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> the filesystem with the new type.
> 
> To upgrade existing filesystems use the balance filters eg. from RAID6
> 
>   $ btrfs balance start -mconvert=raid1c3 /path
[...]

   If I do:

$ btrfs balance start -mprofiles=raid13c,convert=raid1 \
                      -dprofiles=raid13c,convert=raid6 /path

will that clear the incompatibility bit?

(I'm not sure if profiles= and convert= work together, but let's
assume that they do for the purposes of this question).

   Hugo.
David Sterba June 10, 2019, 2:02 p.m. UTC | #2
On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
>    Hi, David,
> 
> On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > feature as outlined in V1
> > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> [...]
> > Compatibility
> > ~~~~~~~~~~~~~
> > 
> > The new block group types cost an incompatibility bit, so old kernel
> > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > the filesystem with the new type.
> > 
> > To upgrade existing filesystems use the balance filters eg. from RAID6
> > 
> >   $ btrfs balance start -mconvert=raid1c3 /path
> [...]
> 
>    If I do:
> 
> $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
>                       -dprofiles=raid13c,convert=raid6 /path
> 
> will that clear the incompatibility bit?

No the bit will stay, even though there are no chunks of the raid1c3
type. Same for raid5/6.

Dropping the bit would need an extra pass trough all chunks after
balance, which is feasible and I don't see usability surprises. That you
ask means that the current behaviour is probably opposite to what users
expect.

> (I'm not sure if profiles= and convert= work together, but let's
> assume that they do for the purposes of this question).

Yes they work together.
Hugo Mills June 10, 2019, 2:48 p.m. UTC | #3
On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> >    Hi, David,
> > 
> > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > feature as outlined in V1
> > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> > [...]
> > > Compatibility
> > > ~~~~~~~~~~~~~
> > > 
> > > The new block group types cost an incompatibility bit, so old kernel
> > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > the filesystem with the new type.
> > > 
> > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > 
> > >   $ btrfs balance start -mconvert=raid1c3 /path
> > [...]
> > 
> >    If I do:
> > 
> > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> >                       -dprofiles=raid13c,convert=raid6 /path
> > 
> > will that clear the incompatibility bit?
> 
> No the bit will stay, even though there are no chunks of the raid1c3
> type. Same for raid5/6.
> 
> Dropping the bit would need an extra pass trough all chunks after
> balance, which is feasible and I don't see usability surprises. That you
> ask means that the current behaviour is probably opposite to what users
> expect.

   We've had a couple of cases in the past where people have tried out
a new feature on a new kernel, then turned it off again and not been
able to go back to an earlier kernel. Particularly in this case, I can
see people being surprised at the trapdoor. "I don't have any RAID13C
on this filesystem: why can't I go back to 5.2?"

   Hugo.
David Sterba June 11, 2019, 9:53 a.m. UTC | #4
On Mon, Jun 10, 2019 at 02:48:06PM +0000, Hugo Mills wrote:
> On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> > On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> > >    Hi, David,
> > > 
> > > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > > feature as outlined in V1
> > > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> > > [...]
> > > > Compatibility
> > > > ~~~~~~~~~~~~~
> > > > 
> > > > The new block group types cost an incompatibility bit, so old kernel
> > > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > > the filesystem with the new type.
> > > > 
> > > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > > 
> > > >   $ btrfs balance start -mconvert=raid1c3 /path
> > > [...]
> > > 
> > >    If I do:
> > > 
> > > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> > >                       -dprofiles=raid13c,convert=raid6 /path
> > > 
> > > will that clear the incompatibility bit?
> > 
> > No the bit will stay, even though there are no chunks of the raid1c3
> > type. Same for raid5/6.
> > 
> > Dropping the bit would need an extra pass trough all chunks after
> > balance, which is feasible and I don't see usability surprises. That you
> > ask means that the current behaviour is probably opposite to what users
> > expect.
> 
>    We've had a couple of cases in the past where people have tried out
> a new feature on a new kernel, then turned it off again and not been
> able to go back to an earlier kernel. Particularly in this case, I can
> see people being surprised at the trapdoor. "I don't have any RAID13C
> on this filesystem: why can't I go back to 5.2?"

Undoing the incompat bit is expensive in some cases, eg. for ZSTD this
would mean to scan all extents, but in case of the raid profiles it's
easy to check the list of space infos that are per-profile.

So, my current idea is to use the sysfs interface. The /features
directory lists the files representing features and writing 1 to the
file followed by a sync would trigger the rescan and drop the bit
eventually.

The meaning of the /sys/fs/btrfs/features/* is defined for 1, which
means 'can be set at runtime', so the ability to unset the feature would
be eg. 3, as a bitmask of possible actions (0b01 set, 0b10 unset).

We do have infrastructure for changing the state in a safe manner even
from sysfs, which sets a bit somewhere and commit processes that. That's
why the sync is required, but I don't think that's harming usability too
much.
David Sterba June 11, 2019, 12:03 p.m. UTC | #5
On Tue, Jun 11, 2019 at 11:53:15AM +0200, David Sterba wrote:
> On Mon, Jun 10, 2019 at 02:48:06PM +0000, Hugo Mills wrote:
> > On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> > > On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> > > >    Hi, David,
> > > > 
> > > > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > > > feature as outlined in V1
> > > > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> > > > [...]
> > > > > Compatibility
> > > > > ~~~~~~~~~~~~~
> > > > > 
> > > > > The new block group types cost an incompatibility bit, so old kernel
> > > > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > > > the filesystem with the new type.
> > > > > 
> > > > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > > > 
> > > > >   $ btrfs balance start -mconvert=raid1c3 /path
> > > > [...]
> > > > 
> > > >    If I do:
> > > > 
> > > > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> > > >                       -dprofiles=raid13c,convert=raid6 /path
> > > > 
> > > > will that clear the incompatibility bit?
> > > 
> > > No the bit will stay, even though there are no chunks of the raid1c3
> > > type. Same for raid5/6.
> > > 
> > > Dropping the bit would need an extra pass trough all chunks after
> > > balance, which is feasible and I don't see usability surprises. That you
> > > ask means that the current behaviour is probably opposite to what users
> > > expect.
> > 
> >    We've had a couple of cases in the past where people have tried out
> > a new feature on a new kernel, then turned it off again and not been
> > able to go back to an earlier kernel. Particularly in this case, I can
> > see people being surprised at the trapdoor. "I don't have any RAID13C
> > on this filesystem: why can't I go back to 5.2?"
> 
> Undoing the incompat bit is expensive in some cases, eg. for ZSTD this
> would mean to scan all extents, but in case of the raid profiles it's
> easy to check the list of space infos that are per-profile.
> 
> So, my current idea is to use the sysfs interface. The /features
> directory lists the files representing features and writing 1 to the
> file followed by a sync would trigger the rescan and drop the bit
> eventually.
> 
> The meaning of the /sys/fs/btrfs/features/* is defined for 1, which
> means 'can be set at runtime', so the ability to unset the feature would
> be eg. 3, as a bitmask of possible actions (0b01 set, 0b10 unset).
> 
> We do have infrastructure for changing the state in a safe manner even
> from sysfs, which sets a bit somewhere and commit processes that. That's
> why the sync is required, but I don't think that's harming usability t

Scratch that, there's much simpler way and would work as expected in the
example. Ie. after removing last bg of the given type the incompat bit
will be dropped.
David Sterba June 25, 2019, 5:47 p.m. UTC | #6
On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> Hi,
> 
> this patchset brings the RAID1 with 3 and 4 copies as a separate
> feature as outlined in V1
> (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> 
> This should help a bit in the raid56 situation, where the write hole
> hurts most for metadata, without a block group profile that offers 2
> device loss resistance.
> 
> I've gathered some feedback from knowlegeable poeople on IRC and the
> following setup is considered good enough (certainly better than what we
> have now):
> 
> - data: RAID6
> - metadata: RAID1C3
> 
> The RAID1C3 vs RAID6 have different characteristics in terms of space
> consumption and repair.
> 
> 
> Space consumption
> ~~~~~~~~~~~~~~~~~
> 
> * RAID6 reduces overall metadata by N/(N-2), so with more devices the
>   parity overhead ratio is small
> 
> * RAID1C3 will allways consume 67% of metadata chunks for redundancy
> 
> The overall size of metadata is typically in range of gigabytes to
> hundreds of gigabytes (depends on usecase), rough estimate is from
> 1%-10%. With larger filesystem the percentage is usually smaller.
> 
> So, for the 3-copy raid1 the cost of redundancy is better expressed in
> the absolute value of gigabytes "wasted" on redundancy than as the
> ratio that does look scary compared to raid6.
> 
> 
> Repair
> ~~~~~~
> 
> RAID6 needs to access all available devices to calculate the P and Q,
> either 1 or 2 missing devices.
> 
> RAID1C3 can utilize the independence of each copy and also the way the
> RAID1 works in btrfs. In the scenario with 1 missing device, one of the
> 2 correct copies is read and written to the repaired devices.
> 
> Given how the 2-copy RAID1 works on btrfs, the block groups could be
> spread over several devices so the load during repair would be spread as
> well.
> 
> Additionally, device replace works sequentially and in big chunks so on
> a lightly used system the read pattern is seek-friendly.
> 
> 
> Compatibility
> ~~~~~~~~~~~~~
> 
> The new block group types cost an incompatibility bit, so old kernel
> will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> the filesystem with the new type.
> 
> To upgrade existing filesystems use the balance filters eg. from RAID6
> 
>   $ btrfs balance start -mconvert=raid1c3 /path
> 
> 
> Merge target
> ~~~~~~~~~~~~
> 
> I'd like to push that to misc-next for wider testing and merge to 5.3,
> unless something bad pops up. Given that the code changes are small and
> just a new types with the constraints, the rest is done by the generic
> code, I'm not expecting problems that can't be fixed before full
> release.
> 
> 
> Testing so far
> ~~~~~~~~~~~~~~
> 
> * mkfs with the profiles
> * fstests (no specific tests, only check that it does not break)
> * profile conversions between single/raid1/raid5/raid1c3/raid6/raid1c4/raid1c4
>   with added devices where needed
> * scrub
> 
> TODO:
> 
> * 1 missing device followed by repair
> * 2 missing devices followed by repair

Unfortunatelly neither of the two cases works as expected and I don't have time
to fix it for the 5.3 deadline. As the 3-copy is supposed to be a replacement
for raid6, I consider the lack of repair capability to be a show stopper so the
main part of the patchset is postponed.

The test I did was something like this:

- create fs with 3 devices, raid1c3
- fill with some data
- unmount
- wipe 2nd device
- mount degraded
- replace missing
- remount read-write, write data to verify that it works
- unmount
- mount as usual   <-- here it fails and the device is still reported missing

The same happens for the 2 missing devices.