[0/4] btrfs: cleanups and preparation for the incoming RAID56J features

Message ID	cover.1652428644.git.wqu@suse.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Date: Fri, 13 May 2022 16:34:27 +0800 Message-Id: <cover.1652428644.git.wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: cleanups and preparation for the incoming RAID56J features \| expand [0/4] btrfs: cleanups and preparation for the incoming RAID56J features [1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function [2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation [3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes [4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

Qu Wenruo May 13, 2022, 8:34 a.m. UTC

Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
(J for journal), if we're relying on ad-hoc if () else if () branches to
calculate thing like number of p/q stripes, it will cause a lot of
problems.

In fact, during my development, I have hit tons of bugs related to this.

One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
to update the check for RAID5 and RAID6 profiles, we will got a bad
nr_data == num_stripes, and screw up later writeback.

90% of my suffering comes from such ad-hoc usage doing manual checks on
RAID56.

Another example is, scrub_stripe() which due to the extra per-device
reservation, @dev_extent_len is no longer the same the data stripe
length calculated from extent_map.

So this patchset will do the following cleanups preparing for the
incoming RAID56J (already finished coding, functionality and on-disk
format are fine, although no journal yet):

- Calculate device stripe length in-house inside scrub_stripe()
  This removes one of the nasty mismatch which is less obvious.

- Use btrfs_raid_array[] based calculation instead of ad-hoc check
  The only exception is scrub_nr_raid_mirrors(), which has several
  difference against btrfs_num_copies():

  * No iteration on all RAID6 combinations
    No sure if it's planned or not.

  * Use bioc->num_stripes directly
    In that context, bioc is already all the mirrors for the same
    stripe, thus no need to lookup using btrfs_raid_array[].

With all these cleanups, the RAID56J will be much easier to implement.

Qu Wenruo (4):
  btrfs: remove @dev_extent_len argument from scrub_stripe() function
  btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
    calculation
  btrfs: use btrfs_raid_array[] to calculate the number of parity
    stripes
  btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

 fs/btrfs/raid56.c  | 10 ++--------
 fs/btrfs/raid56.h  | 12 ++----------
 fs/btrfs/scrub.c   | 13 +++++++------
 fs/btrfs/volumes.c | 35 +++++++++++++++++++++--------------
 fs/btrfs/volumes.h |  2 ++
 5 files changed, 34 insertions(+), 38 deletions(-)

David Sterba May 13, 2022, 11:38 a.m. UTC | #1

On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote:
> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
> (J for journal),

Do you need to introduce completely new profiles? IIRC in my drafts I
was more inclined to reuse a dedicated raid1c3 block group, of an
arbitrary size and not used by anything else. A new profile would
technically work too but it brings other issues.

As a related feature to the raid56, I was working on the striped
raid10c34 profiles but was not able to make it work. In a sense this is
easier as it reuses existing code, but if you make the journal work we
won't need that.

> if we're relying on ad-hoc if () else if () branches to
> calculate thing like number of p/q stripes, it will cause a lot of
> problems.
> 
> In fact, during my development, I have hit tons of bugs related to this.
> 
> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
> to update the check for RAID5 and RAID6 profiles, we will got a bad
> nr_data == num_stripes, and screw up later writeback.
> 
> 90% of my suffering comes from such ad-hoc usage doing manual checks on
> RAID56.
> 
> Another example is, scrub_stripe() which due to the extra per-device
> reservation, @dev_extent_len is no longer the same the data stripe
> length calculated from extent_map.
> 
> So this patchset will do the following cleanups preparing for the
> incoming RAID56J (already finished coding, functionality and on-disk
> format are fine, although no journal yet):
> 
> - Calculate device stripe length in-house inside scrub_stripe()
>   This removes one of the nasty mismatch which is less obvious.
> 
> - Use btrfs_raid_array[] based calculation instead of ad-hoc check
>   The only exception is scrub_nr_raid_mirrors(), which has several
>   difference against btrfs_num_copies():
> 
>   * No iteration on all RAID6 combinations
>     No sure if it's planned or not.
> 
>   * Use bioc->num_stripes directly
>     In that context, bioc is already all the mirrors for the same
>     stripe, thus no need to lookup using btrfs_raid_array[].
> 
> With all these cleanups, the RAID56J will be much easier to implement.
> 
> Qu Wenruo (4):
>   btrfs: remove @dev_extent_len argument from scrub_stripe() function
>   btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
>     calculation
>   btrfs: use btrfs_raid_array[] to calculate the number of parity
>     stripes
>   btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

The preparatory patches look good to me.

Qu Wenruo May 13, 2022, 12:21 p.m. UTC | #2

On 2022/5/13 19:38, David Sterba wrote:
> On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote:
>> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
>> (J for journal),
>
> Do you need to introduce completely new profiles? IIRC in my drafts I
> was more inclined to reuse a dedicated raid1c3 block group, of an
> arbitrary size and not used by anything else. A new profile would
> technically work too but it brings other issues.

The point here is, I want to reserve some space for each data stripe of
a RAID56 chunk.

Unfortunately, we need some on-disk format change anyway, for reusing
btrfs_chunk::io_align to represent how many bytes are reserved for each
dev extent, before where the data really starts.

So new RAID56 types are needed anyway.

Sure, this will make dev_extent mismatch from pure calculated extent
length from a chunk map.

But all the involved change are not that huge, still reviewable.

 From my latest version, already under fstests (with new RAID56J
profiles added to related test cases), but without real journal code
yet, the patch looks like this:

  fs/btrfs/block-group.c          |  23 ++++++-
  fs/btrfs/ctree.h                |   7 +-
  fs/btrfs/scrub.c                |  15 ++--
  fs/btrfs/tree-checker.c         |   4 ++
  fs/btrfs/volumes.c              | 118 +++++++++++++++++++++++++++++---
  fs/btrfs/volumes.h              |   7 +-
  fs/btrfs/zoned.c                |   2 +
  include/uapi/linux/btrfs.h      |   1 +
  include/uapi/linux/btrfs_tree.h |  30 +++++++-
  9 files changed, 185 insertions(+), 22 deletions(-)

Most changes are involved in dev extent handling, and some sites can not
use btrfs_raid_array[] directly.

I guess you'll see that kernel patch soon, along with needed progs patch
in next week.

Thanks,
Qu
>
> As a related feature to the raid56, I was working on the striped
> raid10c34 profiles but was not able to make it work. In a sense this is
> easier as it reuses existing code, but if you make the journal work we
> won't need that.
>
>> if we're relying on ad-hoc if () else if () branches to
>> calculate thing like number of p/q stripes, it will cause a lot of
>> problems.
>>
>> In fact, during my development, I have hit tons of bugs related to this.
>>
>> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
>> to update the check for RAID5 and RAID6 profiles, we will got a bad
>> nr_data == num_stripes, and screw up later writeback.
>>
>> 90% of my suffering comes from such ad-hoc usage doing manual checks on
>> RAID56.
>>
>> Another example is, scrub_stripe() which due to the extra per-device
>> reservation, @dev_extent_len is no longer the same the data stripe
>> length calculated from extent_map.
>>
>> So this patchset will do the following cleanups preparing for the
>> incoming RAID56J (already finished coding, functionality and on-disk
>> format are fine, although no journal yet):
>>
>> - Calculate device stripe length in-house inside scrub_stripe()
>>    This removes one of the nasty mismatch which is less obvious.
>>
>> - Use btrfs_raid_array[] based calculation instead of ad-hoc check
>>    The only exception is scrub_nr_raid_mirrors(), which has several
>>    difference against btrfs_num_copies():
>>
>>    * No iteration on all RAID6 combinations
>>      No sure if it's planned or not.
>>
>>    * Use bioc->num_stripes directly
>>      In that context, bioc is already all the mirrors for the same
>>      stripe, thus no need to lookup using btrfs_raid_array[].
>>
>> With all these cleanups, the RAID56J will be much easier to implement.
>>
>> Qu Wenruo (4):
>>    btrfs: remove @dev_extent_len argument from scrub_stripe() function
>>    btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
>>      calculation
>>    btrfs: use btrfs_raid_array[] to calculate the number of parity
>>      stripes
>>    btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
>
> The preparatory patches look good to me.

Forza May 13, 2022, 3:14 p.m. UTC | #3

Hi, 

---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ----

> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
> (J for journal), 

Great to see work being done on the RAID56 parts of Btrfs. :) 

I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole. 


Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too. 

Thanks 
 
Forza

Qu Wenruo May 13, 2022, 10:58 p.m. UTC | #4

On 2022/5/13 23:14, Forza wrote:
> Hi,
>
> ---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ----
>
>> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
>> (J for journal),
>
> Great to see work being done on the RAID56 parts of Btrfs. :)
>
> I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole.

In fact, Johannes from WDC is already working on a pure CoW based
solution, called stripe tree.

The idea there is to introduce a new layer of mapping.

With that stripe tree, inside one chunk, the logical bytenr is no longer
directly mapped to a physical location, but can be dynamically mapped to
any physical location inside the chunk range.

So previously if we have a RAID56 chunk with 3 disks looks like this:

     Logical bytenr X		X + 64K		X + 128K
		   |            |		|

Then we have the following on-disk mapping:

   [X, X + 64K):		Devid 1		Physical Y1
   [X + 64K, X + 128K)	Devid 2		Physical Y2
   Parity for above	Devid 3		Physical Y3

So if we just write 64K into logical bytenr X, we need to read out data
at logical bytenr [X + 64K, X + 128K), then calculate the parity, write
into devid3 physical offset Y3.

But with the new stripe tree, we can map [X, X + 64K) into any location
in the devid 1.
So is [X + 64K, X + 128K) and the parity.

Then we we write data into logical bytenr [X, X + 64K), then we just
find a free 64K range in stripe tree of devid 1, and check if we have
mapped [X + 64K, X + 128) in the stripe tree.

a) Mapped

If we have [X + 64K, X + 128) mapped, then we read that range out,
update our parity stripe, and write the parity stripe into some newer
location (CoW), then free up the old stripe.

b) Not mapped

This means we don't have any data write into that range, thus it is all
zero. We calculate parity with all zero, then find a new location for
parity in devid 3, write the newly calculated parity and insert a
mapping for the new parity location.

By this, we in fact decouple the 1:1 mapping for RAID56, and get way
more flexibility.
Although this idea no longer follows the strict rotation of RAID5, thus
it's a middle ground between RAID4 and RAID5.

The brilliant idea is introduced mostly to support different chunk
profiles for zoned devices, but Johannes is working on enabling this for
non-zoned devices too.

Then you may ask why I'm still pushing this way more traditional RAID56J
solution, the reasons are:

- Complexity
   The stripe tree is flexible, thus more complex.
   And AFAIK it will affect all chunk types, not only RAID56.
   Thus it can be more challenging.

- Currently relies on zoned unit to split extents/stripes
   Thus I believe Johannes can solve it without any problems.

- I just want a valid way to steal code from dm/md guys :)

Don't get me wrong, I totally believe stripe tree can be the silver
bullet, but it doesn't prevent us to explore some different (and more
traditional) ways.

>
>
> Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too.

Well, the journal will have optimizations, e.g. full stripe doesn't need
to journal its data.

I'll learn (steal code) from dm/md to implement the code.

But there are problems related in RAID56, affecting dm/md raid56 too.

Like bitrot in one data stripe, while we're writing data into the other
data stripe.
Then RWM will read out the bad data stripe, calculate parity, and cause
the bit rot permanent.

The destructive RMW will not be detected in traditional raid56 with
traditional fs, but can be detected by btrfs.

Thus after the RAID56J project, I'll take more time on that destructive
RMW problem.

Thanks,
Qu

>
> Thanks
>
> Forza
>

David Sterba June 15, 2022, 11:45 a.m. UTC | #5

On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote:
> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
> (J for journal), if we're relying on ad-hoc if () else if () branches to
> calculate thing like number of p/q stripes, it will cause a lot of
> problems.
> 
> In fact, during my development, I have hit tons of bugs related to this.
> 
> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
> to update the check for RAID5 and RAID6 profiles, we will got a bad
> nr_data == num_stripes, and screw up later writeback.
> 
> 90% of my suffering comes from such ad-hoc usage doing manual checks on
> RAID56.
> 
> Another example is, scrub_stripe() which due to the extra per-device
> reservation, @dev_extent_len is no longer the same the data stripe
> length calculated from extent_map.
> 
> So this patchset will do the following cleanups preparing for the
> incoming RAID56J (already finished coding, functionality and on-disk
> format are fine, although no journal yet):
> 
> - Calculate device stripe length in-house inside scrub_stripe()
>   This removes one of the nasty mismatch which is less obvious.
> 
> - Use btrfs_raid_array[] based calculation instead of ad-hoc check
>   The only exception is scrub_nr_raid_mirrors(), which has several
>   difference against btrfs_num_copies():
> 
>   * No iteration on all RAID6 combinations
>     No sure if it's planned or not.
> 
>   * Use bioc->num_stripes directly
>     In that context, bioc is already all the mirrors for the same
>     stripe, thus no need to lookup using btrfs_raid_array[].
> 
> With all these cleanups, the RAID56J will be much easier to implement.
> 
> Qu Wenruo (4):
>   btrfs: remove @dev_extent_len argument from scrub_stripe() function
>   btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
>     calculation
>   btrfs: use btrfs_raid_array[] to calculate the number of parity
>     stripes
>   btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

Added to misc-next, thanks.

[0/4] btrfs: cleanups and preparation for the incoming RAID56J features

Message

Comments