mbox series

[00/12] btrfs: introduce write-intent bitmaps for RAID56

Message ID cover.1657171615.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: introduce write-intent bitmaps for RAID56 | expand

Message

Qu Wenruo July 7, 2022, 5:32 a.m. UTC
[CHANGELOG]
RFC->v1:
- Fix a corner case in write_intent_set_bits()
  If the range covers the last existing entry, but still needs a new
  entry, the old code will not insert the new entry, causing
  write_intent_clear_bits() to cause a warning.

- Add selftests for the write intent bitmaps
  The write intent bitmaps is an sparse array of bitmaps.
  There are some corner cases tricky to get it done correctly in the
  first try (see above case).
  The test case would prevent such problems from happening again.

- Fix hang with dev-replace, and better bitmaps bio submission
  Previously we will hold device_list_mutex while submitting the bitmaps
  bio, this can lead to deadlock with dev-replace/dev-removal.
  Fix it by using RCU to keep an local copy of devices and use them
  to submit the bitmaps bio.

  Furthermore, there is no need to follow the way of superblocks
  writeback, as the content of bitmaps are always the same for all
  devices, we can just submitting the same page and use atomic counter
  to wait for them to finish.

  Now there is no more crash/warning/deadlock in btrfs/070.

[BACKGROUND]
Unlike md-raid, btrfs RAID56 has nothing to sync its devices when power
loss happens.

For pure mirror based profiles it's fine as btrfs can utilize its csums
to find the correct mirror the repair the bad ones.

But for RAID56, the repair itself needs the data from other devices,
thus any out-of-sync data can degrade the tolerance.

Even worse, incorrect RMW can use the stale data to generate P/Q,
removing the possibility of recovery the data.


For md-raid, it goes with write-intent bitmap, to do faster resilver,
and goes journal (partial parity log for RAID5) to ensure it can even
stand a powerloss + device lose.

[OBJECTIVE]

This patchset will introduce a btrfs specific write-intent bitmap.

The bitmap will locate at physical offset 1MiB of each device, and the
content is the same between all devices.

When there is a RAID56 write (currently all RAID56 write, including full
stripe write), before submitting all the real bios to disks,
write-intent bitmap will be updated and flushed to all writeable
devices.

So even if a powerloss happened, at the next mount time we know which
full stripes needs to check, and can start a scrub for those involved
logical bytenr ranges.

[NO RECOVERY CODE YET]

Unfortunately, this patchset only implements the write-intent bitmap
code, the recovery part is still a place holder, as we need some scrub
refactor to make it only scrub a logical bytenr range.

[ADVANTAGE OF BTRFS SPECIFIC WRITE-INTENT BITMAPS]

Since btrfs can utilize csum for its metadata and CoWed data, unlike
dm-bitmap which can only be used for faster re-silver, we can fully
rebuild the full stripe, as long as:

1) There is no missing device
   For missing device case, we still need to go full journal.

2) Untouched data stays untouched
   This should be mostly sane for sane hardware.

And since the btrfs specific write-intent bitmaps are pretty small (4KiB
in size), the overhead much lower than full journal.

In the future, we may allow users to choose between just bitmaps or full
journal to meet their requirement.

[BITMAPS DESIGN]

The bitmaps on-disk format looks like this:

 [ super ][ entry 1 ][ entry 2 ] ... [entry N]
 |<---------  super::size (4K) ------------->|

Super block contains how many entires are in use.

Each entry is 128 bits (16 bytes) in size, containing one u64 for
bytenr, and u64 for one bitmap.

And all utilized entries will be sorted in their bytenr order, and no
bit can overlap.

The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

For the worst case, it can contain 14MiB dirty ranges.
(1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).

For the best case, it can contain 896MiB dirty ranges.
(all bits set per bitmap)

[WHY NOT BTRFS BTREE]

Current write-intent structure needs two features:

- Its data needs to survive cross stripe boundary
  Normally this means write-intent btree needs to acts like a proper
  tree root, has METADATA_ITEMs for all its tree blocks.

- Its data update must be outside of a transaction
  Currently only log tree can do such thing.
  But unfortunately log tree can not survive across transaction
  boundary.

Thus write-intent btree can only meet one of the requirement, not a
suitable solution here.

[TESTING AND BENCHMARK]

For performance benchmark, unfortunately I don't have 3 HDDs to test.
Will do the benchmark after secured enough hardware.

For testing, it can survive volume/raid/dev-replace test groups, and no
write-intent bitmap leakage.

Unfortunately there is still a warning triggered in btrfs/070, still
under investigation, hopefully to be a false alert in bitmap clearing
path.

[TODO]
- Scrub refactor to allow us to do proper recovery at mount time
  Need to change scrub interface to scrub based on logical bytenr.

  This can be a super big work, thus currently we will focus only on
  RAID56 new scrub interface for write-intent recovery only.

- Extra optimizations
  * Skip full stripe writes
  * Enlarge the window between btrfs_write_intent_mark_dirty() and
    btrfs_write_intent_writeback()
    So that we can merge more dirty bites and cause less bitmaps
    writeback

- Proper performance benchmark
  Needs hardware/baremetal VMs, since I don't have any physical machine
  large enough to contian 3 3.5" HDDs.


Qu Wenruo (12):
  btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
  btrfs: introduce a new experimental compat RO flag,
    WRITE_INTENT_BITMAP
  btrfs: introduce the on-disk format of btrfs write intent bitmaps
  btrfs: load/create write-intent bitmaps at mount time
  btrfs: write-intent: write the newly created bitmaps to all disks
  btrfs: write-intent: introduce an internal helper to set bits for a
    range.
  btrfs: write-intent: introduce an internal helper to clear bits for a
    range.
  btrfs: selftests: add selftests for write-intent bitmaps
  btrfs: write back write intent bitmap after barrier_all_devices()
  btrfs: update and writeback the write-intent bitmap for RAID56 write.
  btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
  btrfs: warn and clear bitmaps if there is dirty bitmap at mount time

 fs/btrfs/Makefile                           |   5 +-
 fs/btrfs/ctree.h                            |  24 +-
 fs/btrfs/disk-io.c                          |  54 ++
 fs/btrfs/raid56.c                           |  16 +
 fs/btrfs/sysfs.c                            |   2 +
 fs/btrfs/tests/btrfs-tests.c                |   4 +
 fs/btrfs/tests/btrfs-tests.h                |   2 +
 fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
 fs/btrfs/volumes.c                          |  34 +-
 fs/btrfs/write-intent.c                     | 903 ++++++++++++++++++++
 fs/btrfs/write-intent.h                     | 303 +++++++
 fs/btrfs/zoned.c                            |   8 +
 include/uapi/linux/btrfs.h                  |  17 +
 13 files changed, 1610 insertions(+), 9 deletions(-)
 create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
 create mode 100644 fs/btrfs/write-intent.c
 create mode 100644 fs/btrfs/write-intent.h

Comments

Christoph Hellwig July 7, 2022, 5:36 a.m. UTC | #1
Just a high level comment from someone who has no direct stake in this:

The pain point of classic parity RAID is the read-modify-write cycles,
and this series is a bandaid to work around the write holes caused by
them, but does not help with the performane problems associated with
them.

But btrfs is a file system fundamentally designed to write out of place
and thus can get around these problems by writing data in a way that
fundamentally never does a read-modify-write for a set of parity RAID
stripes.

Wouldn't it be a better idea to use the raid stripe tree that Johannes
suggest and apply that to parity writes instead of beating the dead
horse of classic RAID?
Qu Wenruo July 7, 2022, 5:48 a.m. UTC | #2
On 2022/7/7 13:36, Christoph Hellwig wrote:
> Just a high level comment from someone who has no direct stake in this:
>
> The pain point of classic parity RAID is the read-modify-write cycles,
> and this series is a bandaid to work around the write holes caused by
> them, but does not help with the performane problems associated with
> them.
>
> But btrfs is a file system fundamentally designed to write out of place
> and thus can get around these problems by writing data in a way that
> fundamentally never does a read-modify-write for a set of parity RAID
> stripes.
>
> Wouldn't it be a better idea to use the raid stripe tree that Johannes
> suggest and apply that to parity writes instead of beating the dead
> horse of classic RAID?

This has been discussed before, in short there are some unknown aspects
of RST tree from Johannes:

- It may not support RAID56 for metadata forever.
   This may not be a big deal though.

- Not yet determined how RST to handle non-COW RAID56 writes
   Just CoW the partial data write and its parity to other location?
   How to reclaim the unreferred part?

   Currently RST is still mainly to address RAID0/1 support for zoned
   device, RAID56 support is a good thing to come, but not yet focused on
   RAID56.

- ENOSPC
   If we go COW way, the ENOSPC situation will be more pressing.

   Now all partial writes must be COWed.
   This is going to need way more work, I'm not sure if the existing
   data space handling code is able to handle it at all.

In fact, I think the RAID56 in modern COW environment is worthy a full
talk in some conference.
If I had the chance, I really want to co-host a talk with Johannes on
this (but the stupid zero-covid policy is definitely not helping at all
here).

Thus I go the tried and true solution, write-intent bitmap and later
full journal. To provide a way to close the write-hole before we had a
better solution, instead of marking RAID56 unsafe and drive away new users.

Thanks,
Qu
Johannes Thumshirn July 7, 2022, 9:37 a.m. UTC | #3
On 07.07.22 07:49, Qu Wenruo wrote:
> 
> 
> On 2022/7/7 13:36, Christoph Hellwig wrote:
>> Just a high level comment from someone who has no direct stake in this:
>>
>> The pain point of classic parity RAID is the read-modify-write cycles,
>> and this series is a bandaid to work around the write holes caused by
>> them, but does not help with the performane problems associated with
>> them.
>>
>> But btrfs is a file system fundamentally designed to write out of place
>> and thus can get around these problems by writing data in a way that
>> fundamentally never does a read-modify-write for a set of parity RAID
>> stripes.
>>
>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>> suggest and apply that to parity writes instead of beating the dead
>> horse of classic RAID?
> 
> This has been discussed before, in short there are some unknown aspects
> of RST tree from Johannes:
> 
> - It may not support RAID56 for metadata forever.
>    This may not be a big deal though.

This might be a problem indeed, but a) we have the RAID1C[34] profiles and
b) we can place the RST for meta-data into the SYSTEM block-group.
 
> - Not yet determined how RST to handle non-COW RAID56 writes
>    Just CoW the partial data write and its parity to other location?
>    How to reclaim the unreferred part?
> 
>    Currently RST is still mainly to address RAID0/1 support for zoned
>    device, RAID56 support is a good thing to come, but not yet focused on
>    RAID56.

Well RAID1 was the low hanging fruit and I had to start somewhere. Now that
the overall concept and RAID1 looks promising I've started to look into RAID0.

RAID0 introduces srtiping for the 1st time in the context of RST, so there might
be dragons, but nothing unsolvable.

Once this is done, RAID10 should just fall into place (famous last words?) and
with this completed most of the things we need for RAID56 are there as well, from
the RST and volumes.c side of things. What's left is getting rid of the RMW cycles
that are done for sub stripe size writes.

> 
> - ENOSPC
>    If we go COW way, the ENOSPC situation will be more pressing.
> 
>    Now all partial writes must be COWed.
>    This is going to need way more work, I'm not sure if the existing
>    data space handling code is able to handle it at all.

Well just as with normal btrfs as well, either you can override the "garbage"
space with valid data again or you need to do garbage collection in case of
a zoned file-system.

> 
> In fact, I think the RAID56 in modern COW environment is worthy a full
> talk in some conference.
> If I had the chance, I really want to co-host a talk with Johannes on
> this (but the stupid zero-covid policy is definitely not helping at all
> here).
> 
> Thus I go the tried and true solution, write-intent bitmap and later
> full journal. To provide a way to close the write-hole before we had a
> better solution, instead of marking RAID56 unsafe and drive away new users.
> 
> Thanks,
> Qu
>
Qu Wenruo July 7, 2022, 9:45 a.m. UTC | #4
On 2022/7/7 17:37, Johannes Thumshirn wrote:
> On 07.07.22 07:49, Qu Wenruo wrote:
>>
>>
>> On 2022/7/7 13:36, Christoph Hellwig wrote:
>>> Just a high level comment from someone who has no direct stake in this:
>>>
>>> The pain point of classic parity RAID is the read-modify-write cycles,
>>> and this series is a bandaid to work around the write holes caused by
>>> them, but does not help with the performane problems associated with
>>> them.
>>>
>>> But btrfs is a file system fundamentally designed to write out of place
>>> and thus can get around these problems by writing data in a way that
>>> fundamentally never does a read-modify-write for a set of parity RAID
>>> stripes.
>>>
>>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>>> suggest and apply that to parity writes instead of beating the dead
>>> horse of classic RAID?
>>
>> This has been discussed before, in short there are some unknown aspects
>> of RST tree from Johannes:
>>
>> - It may not support RAID56 for metadata forever.
>>     This may not be a big deal though.
>
> This might be a problem indeed, but a) we have the RAID1C[34] profiles and
> b) we can place the RST for meta-data into the SYSTEM block-group.
>
>> - Not yet determined how RST to handle non-COW RAID56 writes
>>     Just CoW the partial data write and its parity to other location?
>>     How to reclaim the unreferred part?
>>
>>     Currently RST is still mainly to address RAID0/1 support for zoned
>>     device, RAID56 support is a good thing to come, but not yet focused on
>>     RAID56.
>
> Well RAID1 was the low hanging fruit and I had to start somewhere. Now that
> the overall concept and RAID1 looks promising I've started to look into RAID0.
>
> RAID0 introduces srtiping for the 1st time in the context of RST, so there might
> be dragons, but nothing unsolvable.
>
> Once this is done, RAID10 should just fall into place (famous last words?) and
> with this completed most of the things we need for RAID56 are there as well, from
> the RST and volumes.c side of things. What's left is getting rid of the RMW cycles
> that are done for sub stripe size writes.
>
>>
>> - ENOSPC
>>     If we go COW way, the ENOSPC situation will be more pressing.
>>
>>     Now all partial writes must be COWed.
>>     This is going to need way more work, I'm not sure if the existing
>>     data space handling code is able to handle it at all.
>
> Well just as with normal btrfs as well, either you can override the "garbage"
> space with valid data again or you need to do garbage collection in case of
> a zoned file-system.

My concern is, (not considering zoned yet) if we do a partial write into
a full stripe, what would happen?

Like this:

D1	| Old data | Old data |
D2	| To write |  Unused  |
P	| Parity 1 | Parity 2 |

The "To write" part will definite need a new RST entry, so no double.

But what would happen to Parity 1? We need to do a CoW to some new
location, right?

So the old physical location for "Parity 1" will be mark reserved and
only freed after we did a full transaction?


Another concern is, what if the following case happen?

D1	| Old 1 | Old 2 |
D2	| Old 3 | Old 4 |
P	|  P 1  |  P 2  |

If we only write part of data into "Old 3" do we need to read the whole
"Old 3" out and CoW the full stripe? Or can we do sectors CoW only?

Thanks,
Qu
>
>>
>> In fact, I think the RAID56 in modern COW environment is worthy a full
>> talk in some conference.
>> If I had the chance, I really want to co-host a talk with Johannes on
>> this (but the stupid zero-covid policy is definitely not helping at all
>> here).
>>
>> Thus I go the tried and true solution, write-intent bitmap and later
>> full journal. To provide a way to close the write-hole before we had a
>> better solution, instead of marking RAID56 unsafe and drive away new users.
>>
>> Thanks,
>> Qu
>>
>
Qu Wenruo July 7, 2022, 10:42 a.m. UTC | #5
On 2022/7/7 17:45, Qu Wenruo wrote:
>
>
> On 2022/7/7 17:37, Johannes Thumshirn wrote:
>> On 07.07.22 07:49, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/7/7 13:36, Christoph Hellwig wrote:
>>>> Just a high level comment from someone who has no direct stake in this:
>>>>
>>>> The pain point of classic parity RAID is the read-modify-write cycles,
>>>> and this series is a bandaid to work around the write holes caused by
>>>> them, but does not help with the performane problems associated with
>>>> them.
>>>>
>>>> But btrfs is a file system fundamentally designed to write out of place
>>>> and thus can get around these problems by writing data in a way that
>>>> fundamentally never does a read-modify-write for a set of parity RAID
>>>> stripes.
>>>>
>>>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>>>> suggest and apply that to parity writes instead of beating the dead
>>>> horse of classic RAID?
>>>
>>> This has been discussed before, in short there are some unknown aspects
>>> of RST tree from Johannes:
>>>
>>> - It may not support RAID56 for metadata forever.
>>>     This may not be a big deal though.
>>
>> This might be a problem indeed, but a) we have the RAID1C[34] profiles
>> and
>> b) we can place the RST for meta-data into the SYSTEM block-group.
>>
>>> - Not yet determined how RST to handle non-COW RAID56 writes
>>>     Just CoW the partial data write and its parity to other location?
>>>     How to reclaim the unreferred part?
>>>
>>>     Currently RST is still mainly to address RAID0/1 support for zoned
>>>     device, RAID56 support is a good thing to come, but not yet
>>> focused on
>>>     RAID56.
>>
>> Well RAID1 was the low hanging fruit and I had to start somewhere. Now
>> that
>> the overall concept and RAID1 looks promising I've started to look
>> into RAID0.
>>
>> RAID0 introduces srtiping for the 1st time in the context of RST, so
>> there might
>> be dragons, but nothing unsolvable.
>>
>> Once this is done, RAID10 should just fall into place (famous last
>> words?) and
>> with this completed most of the things we need for RAID56 are there as
>> well, from
>> the RST and volumes.c side of things. What's left is getting rid of
>> the RMW cycles
>> that are done for sub stripe size writes.
>>
>>>
>>> - ENOSPC
>>>     If we go COW way, the ENOSPC situation will be more pressing.
>>>
>>>     Now all partial writes must be COWed.
>>>     This is going to need way more work, I'm not sure if the existing
>>>     data space handling code is able to handle it at all.
>>
>> Well just as with normal btrfs as well, either you can override the
>> "garbage"
>> space with valid data again or you need to do garbage collection in
>> case of
>> a zoned file-system.
>
> My concern is, (not considering zoned yet) if we do a partial write into
> a full stripe, what would happen?
>
> Like this:
>
> D1    | Old data | Old data |
> D2    | To write |  Unused  |
> P    | Parity 1 | Parity 2 |
>
> The "To write" part will definite need a new RST entry, so no double.
>
> But what would happen to Parity 1? We need to do a CoW to some new
> location, right?
>
> So the old physical location for "Parity 1" will be mark reserved and
> only freed after we did a full transaction?
>
>
> Another concern is, what if the following case happen?
>
> D1    | Old 1 | Old 2 |
> D2    | Old 3 | Old 4 |
> P    |  P 1  |  P 2  |
>
> If we only write part of data into "Old 3" do we need to read the whole
> "Old 3" out and CoW the full stripe? Or can we do sectors CoW only?

To be more accurate, if we go the COW path for Data and Parity stripes,
we will hit the following situation finally:
	0		  64K
Disk 1  | Old D1  | Old D2  |
Disk 2  | Free D3 | Free D4 |
Disk 3  | Old P1  | Old P2  |

Currently for the extent allocator, it will consider the range at D3 and
D4 available.
Although if we write into D3 and D4, we need write-intent (or full
journal) to close the write-hole (write-intent need no device missing,
while full journal doesn't).


If we go CoW, this means if we want to write into D3, we have to Cow P1.
But if all stripes are like above, even with rotation, we can still have
all data/parity on disk 3 used.
Thus really no space to do extra CoW.

To me, as long as we go CoW for parity, the core idea is no different
than a new extent allocator policy to avoid partial writes.
And that can be a very challenging work to do.

Thanks,
Qu
>
> Thanks,
> Qu
>>
>>>
>>> In fact, I think the RAID56 in modern COW environment is worthy a full
>>> talk in some conference.
>>> If I had the chance, I really want to co-host a talk with Johannes on
>>> this (but the stupid zero-covid policy is definitely not helping at all
>>> here).
>>>
>>> Thus I go the tried and true solution, write-intent bitmap and later
>>> full journal. To provide a way to close the write-hole before we had a
>>> better solution, instead of marking RAID56 unsafe and drive away new
>>> users.
>>>
>>> Thanks,
>>> Qu
>>>
>>
Johannes Thumshirn July 7, 2022, 12:23 p.m. UTC | #6
On 07.07.22 11:45, Qu Wenruo wrote:
> 
> 
> On 2022/7/7 17:37, Johannes Thumshirn wrote:
>> On 07.07.22 07:49, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/7/7 13:36, Christoph Hellwig wrote:
>>>> Just a high level comment from someone who has no direct stake in this:
>>>>
>>>> The pain point of classic parity RAID is the read-modify-write cycles,
>>>> and this series is a bandaid to work around the write holes caused by
>>>> them, but does not help with the performane problems associated with
>>>> them.
>>>>
>>>> But btrfs is a file system fundamentally designed to write out of place
>>>> and thus can get around these problems by writing data in a way that
>>>> fundamentally never does a read-modify-write for a set of parity RAID
>>>> stripes.
>>>>
>>>> Wouldn't it be a better idea to use the raid stripe tree that Johannes
>>>> suggest and apply that to parity writes instead of beating the dead
>>>> horse of classic RAID?
>>>
>>> This has been discussed before, in short there are some unknown aspects
>>> of RST tree from Johannes:
>>>
>>> - It may not support RAID56 for metadata forever.
>>>     This may not be a big deal though.
>>
>> This might be a problem indeed, but a) we have the RAID1C[34] profiles and
>> b) we can place the RST for meta-data into the SYSTEM block-group.
>>
>>> - Not yet determined how RST to handle non-COW RAID56 writes
>>>     Just CoW the partial data write and its parity to other location?
>>>     How to reclaim the unreferred part?
>>>
>>>     Currently RST is still mainly to address RAID0/1 support for zoned
>>>     device, RAID56 support is a good thing to come, but not yet focused on
>>>     RAID56.
>>
>> Well RAID1 was the low hanging fruit and I had to start somewhere. Now that
>> the overall concept and RAID1 looks promising I've started to look into RAID0.
>>
>> RAID0 introduces srtiping for the 1st time in the context of RST, so there might
>> be dragons, but nothing unsolvable.
>>
>> Once this is done, RAID10 should just fall into place (famous last words?) and
>> with this completed most of the things we need for RAID56 are there as well, from
>> the RST and volumes.c side of things. What's left is getting rid of the RMW cycles
>> that are done for sub stripe size writes.
>>
>>>
>>> - ENOSPC
>>>     If we go COW way, the ENOSPC situation will be more pressing.
>>>
>>>     Now all partial writes must be COWed.
>>>     This is going to need way more work, I'm not sure if the existing
>>>     data space handling code is able to handle it at all.
>>
>> Well just as with normal btrfs as well, either you can override the "garbage"
>> space with valid data again or you need to do garbage collection in case of
>> a zoned file-system.
> 
> My concern is, (not considering zoned yet) if we do a partial write into
> a full stripe, what would happen?
> 
> Like this:
> 
> D1	| Old data | Old data |
> D2	| To write |  Unused  |
> P	| Parity 1 | Parity 2 |
> 
> The "To write" part will definite need a new RST entry, so no double.
> 
> But what would happen to Parity 1? We need to do a CoW to some new
> location, right?
> 
> So the old physical location for "Parity 1" will be mark reserved and
> only freed after we did a full transaction?

Correct

> 
> 
> Another concern is, what if the following case happen?
> 
> D1	| Old 1 | Old 2 |
> D2	| Old 3 | Old 4 |
> P	|  P 1  |  P 2  |
> 
> If we only write part of data into "Old 3" do we need to read the whole
> "Old 3" out and CoW the full stripe? Or can we do sectors CoW only?

Well that'll be an implementation detail. I'll see what I can come up with 
to make this working.
Christoph Hellwig July 7, 2022, 1:36 p.m. UTC | #7
On Thu, Jul 07, 2022 at 01:48:11PM +0800, Qu Wenruo wrote:
> - It may not support RAID56 for metadata forever.
>   This may not be a big deal though.

Does parity raid for metadata make much sense to start with?

> - Not yet determined how RST to handle non-COW RAID56 writes

The whole point is that with parity raid you really do not ever want
to do non-COW writes, as that is what gets you the raid hole problem.
Qu Wenruo July 7, 2022, 1:48 p.m. UTC | #8
On 2022/7/7 21:36, Christoph Hellwig wrote:
> On Thu, Jul 07, 2022 at 01:48:11PM +0800, Qu Wenruo wrote:
>> - It may not support RAID56 for metadata forever.
>>    This may not be a big deal though.
> 
> Does parity raid for metadata make much sense to start with?
> 
>> - Not yet determined how RST to handle non-COW RAID56 writes
> 
> The whole point is that with parity raid you really do not ever want
> to do non-COW writes, as that is what gets you the raid hole problem.

You don't need non-COW writes and can still get screwed up.

Just write some new data into unused space, while there is already some 
other data in the same vertical stripe, (aka, partial write), then the 
raid hole problem comes again.

Let me say this again, the extent allocator is not P/Q friendly at all, 
and even if we go update the allocator policy, the ENOSPC problem will 
come to bother the data profiles too.

In fact, considering real-wolrd large fs data chunk usage (under most 
cases over 95%, easily go 98%), I'm pretty sure ENOSPC will be a real 
problem.

Thus any COW based policy for P/Q based RAID will not only need a big 
update on extent allocator, but also ENOSPC related problems.

Thus to me, if we go COW policy directly, we are not really going to 
provide a better RAID56, we're just trading a problem for another.

Thus that's why I still believe traditional write-intent bitmaps/journal 
still makes sense.
It doesn't rely on big extent allocator change, but still solves the 
write-hole problem.

Thanks,
Qu
Lukas Straub July 13, 2022, 4:18 p.m. UTC | #9
Hello Qu,

I think I mentioned it elsewhere, but could this be made general to all
raid levels (e.g. raid1/10 too)? This way, the bitmap can also be used
to fix the corruption of nocow files on btrfs raid. IMHO this is very
important since openSUSE and Fedora use nocow for certain files
(databases, vm images) by default and currently anyone using btrfs raid
there will be shot in the foot by this.

More comments below.

On Thu,  7 Jul 2022 13:32:25 +0800
Qu Wenruo <wqu@suse.com> 
> [...]
> [OBJECTIVE]
> 
> This patchset will introduce a btrfs specific write-intent bitmap.
> 
> The bitmap will locate at physical offset 1MiB of each device, and the
> content is the same between all devices.
> 
> When there is a RAID56 write (currently all RAID56 write, _including full
> stripe write_), before submitting all the real bios to disks,
> write-intent bitmap will be updated and flushed to all writeable
> devices.

You'll need to update the bitmap even with full stripe writes. I don't
know btrfs internals well, but this example should apply:

1. Powerloss happens during a full stripe write. If the bitmap wasn't set,
the whole stripe will contain inconsistent data:

	0		32K		64K
Disk 1	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 3	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)

2. Partial stripe write happens, only updates one data + parity:

	0		32K		64K
Disk 1	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
Disk 3	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)

3. We loose Disk 1. We try to recover Disk 1 data by using Disk 2 data
+ parity. Because Disk 2 is inconsistent we get invalid data.

Thus, we need to scrub the stripe even after a full stripe write to
prevent this.

> So even if a powerloss happened, at the next mount time we know which
> full stripes needs to check, and can start a scrub for those involved
> logical bytenr ranges.
> 
> [...]
> 
> [BITMAPS DESIGN]
> 
> The bitmaps on-disk format looks like this:
> 
>  [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>  |<---------  super::size (4K) ------------->|
> 
> Super block contains how many entires are in use.
> 
> Each entry is 128 bits (16 bytes) in size, containing one u64 for
> bytenr, and u64 for one bitmap.
> 
> And all utilized entries will be sorted in their bytenr order, and no
> bit can overlap.
> 
> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.

IMHO we can go much larger, mdraid for example uses a blocksize of
64MiB by default. Sure, we'll scrub many unrelated stripes on recovery
but write performance will be better.

Regards,
Lukas Straub

> For the worst case, it can contain 14MiB dirty ranges.
> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
> 
> For the best case, it can contain 896MiB dirty ranges.
> (all bits set per bitmap)
> 
> [WHY NOT BTRFS BTREE]
> 
> Current write-intent structure needs two features:
> 
> - Its data needs to survive cross stripe boundary
>   Normally this means write-intent btree needs to acts like a proper
>   tree root, has METADATA_ITEMs for all its tree blocks.
> 
> - Its data update must be outside of a transaction
>   Currently only log tree can do such thing.
>   But unfortunately log tree can not survive across transaction
>   boundary.
> 
> Thus write-intent btree can only meet one of the requirement, not a
> suitable solution here.
> 
> [TESTING AND BENCHMARK]
> 
> For performance benchmark, unfortunately I don't have 3 HDDs to test.
> Will do the benchmark after secured enough hardware.
> 
> For testing, it can survive volume/raid/dev-replace test groups, and no
> write-intent bitmap leakage.
> 
> Unfortunately there is still a warning triggered in btrfs/070, still
> under investigation, hopefully to be a false alert in bitmap clearing
> path.
> 
> [TODO]
> - Scrub refactor to allow us to do proper recovery at mount time
>   Need to change scrub interface to scrub based on logical bytenr.
> 
>   This can be a super big work, thus currently we will focus only on
>   RAID56 new scrub interface for write-intent recovery only.
> 
> - Extra optimizations
>   * Skip full stripe writes
>   * Enlarge the window between btrfs_write_intent_mark_dirty() and
>     btrfs_write_intent_writeback()
>     So that we can merge more dirty bites and cause less bitmaps
>     writeback
> 
> - Proper performance benchmark
>   Needs hardware/baremetal VMs, since I don't have any physical machine
>   large enough to contian 3 3.5" HDDs.
> 
> 
> Qu Wenruo (12):
>   btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>   btrfs: introduce a new experimental compat RO flag,
>     WRITE_INTENT_BITMAP
>   btrfs: introduce the on-disk format of btrfs write intent bitmaps
>   btrfs: load/create write-intent bitmaps at mount time
>   btrfs: write-intent: write the newly created bitmaps to all disks
>   btrfs: write-intent: introduce an internal helper to set bits for a
>     range.
>   btrfs: write-intent: introduce an internal helper to clear bits for a
>     range.
>   btrfs: selftests: add selftests for write-intent bitmaps
>   btrfs: write back write intent bitmap after barrier_all_devices()
>   btrfs: update and writeback the write-intent bitmap for RAID56 write.
>   btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>   btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
> 
>  fs/btrfs/Makefile                           |   5 +-
>  fs/btrfs/ctree.h                            |  24 +-
>  fs/btrfs/disk-io.c                          |  54 ++
>  fs/btrfs/raid56.c                           |  16 +
>  fs/btrfs/sysfs.c                            |   2 +
>  fs/btrfs/tests/btrfs-tests.c                |   4 +
>  fs/btrfs/tests/btrfs-tests.h                |   2 +
>  fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>  fs/btrfs/volumes.c                          |  34 +-
>  fs/btrfs/write-intent.c                     | 903 ++++++++++++++++++++
>  fs/btrfs/write-intent.h                     | 303 +++++++
>  fs/btrfs/zoned.c                            |   8 +
>  include/uapi/linux/btrfs.h                  |  17 +
>  13 files changed, 1610 insertions(+), 9 deletions(-)
>  create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>  create mode 100644 fs/btrfs/write-intent.c
>  create mode 100644 fs/btrfs/write-intent.h
> 



--
Qu Wenruo July 13, 2022, 11 p.m. UTC | #10
On 2022/7/14 00:18, Lukas Straub wrote:
> Hello Qu,
>
> I think I mentioned it elsewhere, but could this be made general to all
> raid levels (e.g. raid1/10 too)? This way, the bitmap can also be used
> to fix the corruption of nocow files on btrfs raid. IMHO this is very
> important since openSUSE and Fedora use nocow for certain files
> (databases, vm images) by default and currently anyone using btrfs raid
> there will be shot in the foot by this.

Yes, that's in the future plan.

Although currently we're still at the discussion stage to make sure
which way we should really go (RST or write-intent).
>
> More comments below.
>
> On Thu,  7 Jul 2022 13:32:25 +0800
> Qu Wenruo <wqu@suse.com>
>> [...]
>> [OBJECTIVE]
>>
>> This patchset will introduce a btrfs specific write-intent bitmap.
>>
>> The bitmap will locate at physical offset 1MiB of each device, and the
>> content is the same between all devices.
>>
>> When there is a RAID56 write (currently all RAID56 write, _including full
>> stripe write_), before submitting all the real bios to disks,
>> write-intent bitmap will be updated and flushed to all writeable
>> devices.
>
> You'll need to update the bitmap even with full stripe writes. I don't
> know btrfs internals well, but this example should apply:
>
> 1. Powerloss happens during a full stripe write. If the bitmap wasn't set,
> the whole stripe will contain inconsistent data:

That's why btrfs COW shines.

If we are doing full stripe writes and power loss, that full stripe data
won't be read at at mount at all.

All metadata will point back to old data, thus only partial writes
matter here.

Thanks,
Qu
>
> 	0		32K		64K
> Disk 1	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 3	|iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)
>
> 2. Partial stripe write happens, only updates one data + parity:
>
> 	0		32K		64K
> Disk 1	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 2  |iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (data stripe)
> Disk 3	|XXiiiiiiiiiiiiiiiiiiiiiiiiiiiii| (parity stripe)
>
> 3. We loose Disk 1. We try to recover Disk 1 data by using Disk 2 data
> + parity. Because Disk 2 is inconsistent we get invalid data.
>
> Thus, we need to scrub the stripe even after a full stripe write to
> prevent this.
>
>> So even if a powerloss happened, at the next mount time we know which
>> full stripes needs to check, and can start a scrub for those involved
>> logical bytenr ranges.
>>
>> [...]
>>
>> [BITMAPS DESIGN]
>>
>> The bitmaps on-disk format looks like this:
>>
>>   [ super ][ entry 1 ][ entry 2 ] ... [entry N]
>>   |<---------  super::size (4K) ------------->|
>>
>> Super block contains how many entires are in use.
>>
>> Each entry is 128 bits (16 bytes) in size, containing one u64 for
>> bytenr, and u64 for one bitmap.
>>
>> And all utilized entries will be sorted in their bytenr order, and no
>> bit can overlap.
>>
>> The blocksize is now fixed to BTRFS_STRIPE_LEN (64KiB), so each entry
>> can contain at most 4MiB, and the whole bitmaps can contain 224 entries.
>
> IMHO we can go much larger, mdraid for example uses a blocksize of
> 64MiB by default. Sure, we'll scrub many unrelated stripes on recovery
> but write performance will be better.
>
> Regards,
> Lukas Straub
>
>> For the worst case, it can contain 14MiB dirty ranges.
>> (1 bits set per bitmap, also means 2 disks RAID5 or 3 disks RAID6).
>>
>> For the best case, it can contain 896MiB dirty ranges.
>> (all bits set per bitmap)
>>
>> [WHY NOT BTRFS BTREE]
>>
>> Current write-intent structure needs two features:
>>
>> - Its data needs to survive cross stripe boundary
>>    Normally this means write-intent btree needs to acts like a proper
>>    tree root, has METADATA_ITEMs for all its tree blocks.
>>
>> - Its data update must be outside of a transaction
>>    Currently only log tree can do such thing.
>>    But unfortunately log tree can not survive across transaction
>>    boundary.
>>
>> Thus write-intent btree can only meet one of the requirement, not a
>> suitable solution here.
>>
>> [TESTING AND BENCHMARK]
>>
>> For performance benchmark, unfortunately I don't have 3 HDDs to test.
>> Will do the benchmark after secured enough hardware.
>>
>> For testing, it can survive volume/raid/dev-replace test groups, and no
>> write-intent bitmap leakage.
>>
>> Unfortunately there is still a warning triggered in btrfs/070, still
>> under investigation, hopefully to be a false alert in bitmap clearing
>> path.
>>
>> [TODO]
>> - Scrub refactor to allow us to do proper recovery at mount time
>>    Need to change scrub interface to scrub based on logical bytenr.
>>
>>    This can be a super big work, thus currently we will focus only on
>>    RAID56 new scrub interface for write-intent recovery only.
>>
>> - Extra optimizations
>>    * Skip full stripe writes
>>    * Enlarge the window between btrfs_write_intent_mark_dirty() and
>>      btrfs_write_intent_writeback()
>>      So that we can merge more dirty bites and cause less bitmaps
>>      writeback
>>
>> - Proper performance benchmark
>>    Needs hardware/baremetal VMs, since I don't have any physical machine
>>    large enough to contian 3 3.5" HDDs.
>>
>>
>> Qu Wenruo (12):
>>    btrfs: introduce new compat RO flag, EXTRA_SUPER_RESERVED
>>    btrfs: introduce a new experimental compat RO flag,
>>      WRITE_INTENT_BITMAP
>>    btrfs: introduce the on-disk format of btrfs write intent bitmaps
>>    btrfs: load/create write-intent bitmaps at mount time
>>    btrfs: write-intent: write the newly created bitmaps to all disks
>>    btrfs: write-intent: introduce an internal helper to set bits for a
>>      range.
>>    btrfs: write-intent: introduce an internal helper to clear bits for a
>>      range.
>>    btrfs: selftests: add selftests for write-intent bitmaps
>>    btrfs: write back write intent bitmap after barrier_all_devices()
>>    btrfs: update and writeback the write-intent bitmap for RAID56 write.
>>    btrfs: raid56: clear write-intent bimaps when a full stripe finishes.
>>    btrfs: warn and clear bitmaps if there is dirty bitmap at mount time
>>
>>   fs/btrfs/Makefile                           |   5 +-
>>   fs/btrfs/ctree.h                            |  24 +-
>>   fs/btrfs/disk-io.c                          |  54 ++
>>   fs/btrfs/raid56.c                           |  16 +
>>   fs/btrfs/sysfs.c                            |   2 +
>>   fs/btrfs/tests/btrfs-tests.c                |   4 +
>>   fs/btrfs/tests/btrfs-tests.h                |   2 +
>>   fs/btrfs/tests/write-intent-bitmaps-tests.c | 247 ++++++
>>   fs/btrfs/volumes.c                          |  34 +-
>>   fs/btrfs/write-intent.c                     | 903 ++++++++++++++++++++
>>   fs/btrfs/write-intent.h                     | 303 +++++++
>>   fs/btrfs/zoned.c                            |   8 +
>>   include/uapi/linux/btrfs.h                  |  17 +
>>   13 files changed, 1610 insertions(+), 9 deletions(-)
>>   create mode 100644 fs/btrfs/tests/write-intent-bitmaps-tests.c
>>   create mode 100644 fs/btrfs/write-intent.c
>>   create mode 100644 fs/btrfs/write-intent.h
>>
>
>
>