diff mbox series

[DRAFT] btrfs: RAID56J journal on-disk format draft

Message ID f6679e0b681f9b1a74dfccbe05dcb5b6eb0878f6.1653372729.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [DRAFT] btrfs: RAID56J journal on-disk format draft | expand

Commit Message

Qu Wenruo May 24, 2022, 6:13 a.m. UTC
This is the draft version of the on-disk format for RAID56J journal.

The overall idea is, we have the following elements:

1) A fixed header
   Recording things like if the journal is clean or dirty, and how many
   entries it has.

2) One or at most 127 entries
   Each entry will point to a range of data in the per-device reserved
   range.

3) Data in the remaining reserved space

The write time and recovery workflow is embedded into the on-disk
format.

Unfortunately we will not have any optimization for the RAID56J, every
write will be journaled, no exception.

Furthermore due to current write behavior of RAID56, we always submit a
full 64K stripe no matter what, we have every limited size for the data
part (at most 15 64K stripe).

So far, I don't believe we will have a fast RAID56J at all.

There are still some behaivor needs extra comments, like if we really
need to do extensive flush for converting DIRTY log to CLEAN.

Any feedback is welcomed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 include/uapi/linux/btrfs_tree.h | 142 ++++++++++++++++++++++++++++++++
 1 file changed, 142 insertions(+)

Comments

kernel test robot May 24, 2022, 11:08 a.m. UTC | #1
Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on v5.18]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-RAID56J-journal-on-disk-format-draft/20220524-141549
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20220524/202205241800.sJRhtNfS-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/695dee41baee3a72c39396cb548088e91472d04d
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-RAID56J-journal-on-disk-format-draft/20220524-141549
        git checkout 695dee41baee3a72c39396cb548088e91472d04d
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:
>> ./usr/include/linux/btrfs_tree.h:1156:9: error: unknown type name 'u8'
    1156 |         u8 __reserved[4];
         |         ^~
kernel test robot May 24, 2022, 12:19 p.m. UTC | #2
Hi Qu,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on kdave/for-next]
[also build test ERROR on v5.18 next-20220524]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-RAID56J-journal-on-disk-format-draft/20220524-141549
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
config: x86_64-randconfig-a016 (https://download.01.org/0day-ci/archive/20220524/202205242033.I6ix2WkH-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 10c9ecce9f6096e18222a331c5e7d085bd813f75)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/695dee41baee3a72c39396cb548088e91472d04d
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qu-Wenruo/btrfs-RAID56J-journal-on-disk-format-draft/20220524-141549
        git checkout 695dee41baee3a72c39396cb548088e91472d04d
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <built-in>:1:
>> ./usr/include/linux/btrfs_tree.h:1156:2: error: unknown type name 'u8'
           u8 __reserved[4];
           ^
   1 error generated.
David Sterba May 24, 2022, 5:02 p.m. UTC | #3
On Tue, May 24, 2022 at 02:13:47PM +0800, Qu Wenruo wrote:
> This is the draft version of the on-disk format for RAID56J journal.
> 
> The overall idea is, we have the following elements:
> 
> 1) A fixed header
>    Recording things like if the journal is clean or dirty, and how many
>    entries it has.
> 
> 2) One or at most 127 entries
>    Each entry will point to a range of data in the per-device reserved
>    range.
> 
> 3) Data in the remaining reserved space
> 
> The write time and recovery workflow is embedded into the on-disk
> format.
> 
> Unfortunately we will not have any optimization for the RAID56J, every
> write will be journaled, no exception.
> 
> Furthermore due to current write behavior of RAID56, we always submit a
> full 64K stripe no matter what, we have every limited size for the data
> part (at most 15 64K stripe).
> 
> So far, I don't believe we will have a fast RAID56J at all.

Well, that does not sound encouraging. One option discussed in the past
how to fix the write hole was to always do full RMW cycle. Having a "not
fast journal at all" would require a format change and have probably a
comparable performance drop.
Qu Wenruo May 24, 2022, 10:31 p.m. UTC | #4
On 2022/5/25 01:02, David Sterba wrote:
> On Tue, May 24, 2022 at 02:13:47PM +0800, Qu Wenruo wrote:
>> This is the draft version of the on-disk format for RAID56J journal.
>>
>> The overall idea is, we have the following elements:
>>
>> 1) A fixed header
>>     Recording things like if the journal is clean or dirty, and how many
>>     entries it has.
>>
>> 2) One or at most 127 entries
>>     Each entry will point to a range of data in the per-device reserved
>>     range.
>>
>> 3) Data in the remaining reserved space
>>
>> The write time and recovery workflow is embedded into the on-disk
>> format.
>>
>> Unfortunately we will not have any optimization for the RAID56J, every
>> write will be journaled, no exception.
>>
>> Furthermore due to current write behavior of RAID56, we always submit a
>> full 64K stripe no matter what, we have every limited size for the data
>> part (at most 15 64K stripe).
>>
>> So far, I don't believe we will have a fast RAID56J at all.
>
> Well, that does not sound encouraging. One option discussed in the past
> how to fix the write hole was to always do full RMW cycle. Having a "not
> fast journal at all" would require a format change and have probably a
> comparable performance drop.
>
That's why the next step is to improve the RMW cycle, to only write the
vertical stripes first.

Which can help a little on the performance front.

Thanks,
Qu
Christoph Hellwig May 25, 2022, 9 a.m. UTC | #5
On Tue, May 24, 2022 at 07:02:34PM +0200, David Sterba wrote:
> Well, that does not sound encouraging. One option discussed in the past
> how to fix the write hole was to always do full RMW cycle. Having a "not
> fast journal at all" would require a format change and have probably a
> comparable performance drop.

So maybe I'm just dumb, but what is the problem with only using
raid56 for data, forbidding nowcow for it and thus avoiding the
problem entirely?
Qu Wenruo May 25, 2022, 9:13 a.m. UTC | #6
On 2022/5/25 17:00, Christoph Hellwig wrote:
> On Tue, May 24, 2022 at 07:02:34PM +0200, David Sterba wrote:
>> Well, that does not sound encouraging. One option discussed in the past
>> how to fix the write hole was to always do full RMW cycle. Having a "not
>> fast journal at all" would require a format change and have probably a
>> comparable performance drop.
>
> So maybe I'm just dumb, but what is the problem with only using
> raid56 for data, forbidding nowcow for it and thus avoiding the
> problem entirely?

The problem is, we can have partial write for RAID56, no matter if we
use NODATACOW or not.

For example, we have a very typical 3 disks RAID5:

	0	32K	64K
Disk 1  |DDDDDDD|       |
Disk 2  |ddddddd|ddddddd|
Disk 3  |PPPPPPP|PPPPPPP|


D = old data, it's there for a while.
d = new data, we want to write.

Now bio for disk 1 and disk 3 finished, but before bio for disk2 can
finish, we hit power loss.

Btrfs reverts to old data, so we should still only see D, but no new data d.

So far so good.

But what if disk 1 now disappear?

To read out old data (DDDDD), we need to rebuild using disk2 and disk3.

But please note that, now Disk 3 has the new parity, but disk2 is still
old data.

Now the recovered data will be wrong, and not pass btrfs csum check.

This is the write-hole problem, it's not screwing up all our data by a
sudden, but corrupt out data bytes by bytes each time we hit a power loss.

Thanks,
Qu
Christoph Hellwig May 25, 2022, 9:26 a.m. UTC | #7
On Wed, May 25, 2022 at 05:13:11PM +0800, Qu Wenruo wrote:
> The problem is, we can have partial write for RAID56, no matter if we
> use NODATACOW or not.
> 
> For example, we have a very typical 3 disks RAID5:
> 
> 	0	32K	64K
> Disk 1  |DDDDDDD|       |
> Disk 2  |ddddddd|ddddddd|
> Disk 3  |PPPPPPP|PPPPPPP|
> 
> 
> D = old data, it's there for a while.
> d = new data, we want to write.

Oh.  I keep forgetting that the striping is entirely on the physіcal
block basis and not logic block basis.  Which makes the whole idea
of btrfs integrated raid5/6 not all that useful compared to just using
mdraid :(
Qu Wenruo May 25, 2022, 9:35 a.m. UTC | #8
On 2022/5/25 17:26, Christoph Hellwig wrote:
> On Wed, May 25, 2022 at 05:13:11PM +0800, Qu Wenruo wrote:
>> The problem is, we can have partial write for RAID56, no matter if we
>> use NODATACOW or not.
>>
>> For example, we have a very typical 3 disks RAID5:
>>
>> 	0	32K	64K
>> Disk 1  |DDDDDDD|       |
>> Disk 2  |ddddddd|ddddddd|
>> Disk 3  |PPPPPPP|PPPPPPP|
>>
>>
>> D = old data, it's there for a while.
>> d = new data, we want to write.
>
> Oh.  I keep forgetting that the striping is entirely on the physіcal
> block basis and not logic block basis.  Which makes the whole idea
> of btrfs integrated raid5/6 not all that useful compared to just using
> mdraid :(

Yep, that's why I have to go the old journal way.

But you may want to explore the super awesome idea of raid stripe tree
from Johannes.

The idea is we introduce a new layer of logical addr -> internal mapping
-> physical addr.
By that, we get rid of the strict physical address requirement.

And when we update the new stripe, we just insert two new mapping for
(dddd), and two new mapping for the new (PPPPP).

If power loss happen, we still see the old internal mapping, and can get
the correct recovery.

But it still seems to have a lot of things to resolve for now.

Thanks,
Qu
waxhead May 26, 2022, 9:06 a.m. UTC | #9
Qu Wenruo wrote:
> 
> 
> On 2022/5/25 17:26, Christoph Hellwig wrote:
>> On Wed, May 25, 2022 at 05:13:11PM +0800, Qu Wenruo wrote:
>>> The problem is, we can have partial write for RAID56, no matter if we
>>> use NODATACOW or not.
>>>
>>> For example, we have a very typical 3 disks RAID5:
>>>
>>>     0    32K    64K
>>> Disk 1  |DDDDDDD|       |
>>> Disk 2  |ddddddd|ddddddd|
>>> Disk 3  |PPPPPPP|PPPPPPP|
>>>
>>>
>>> D = old data, it's there for a while.
>>> d = new data, we want to write.
>>
>> Oh.  I keep forgetting that the striping is entirely on the physіcal
>> block basis and not logic block basis.  Which makes the whole idea
>> of btrfs integrated raid5/6 not all that useful compared to just using
>> mdraid :(
> 
> Yep, that's why I have to go the old journal way.
> 
> But you may want to explore the super awesome idea of raid stripe tree
> from Johannes.
> 
> The idea is we introduce a new layer of logical addr -> internal mapping
> -> physical addr.
> By that, we get rid of the strict physical address requirement.
> 
> And when we update the new stripe, we just insert two new mapping for
> (dddd), and two new mapping for the new (PPPPP).
> 
> If power loss happen, we still see the old internal mapping, and can get
> the correct recovery.
> 
> But it still seems to have a lot of things to resolve for now.
> 
> Thanks,
> Qu

I am just a humble BTRFS user and while I think the journaled approach 
sounds superinteresting I believe that the stripe tree sounds like the 
better solution in the long run.

Is it really such a good idea to add a (potentially temporary) journaled 
raid mode if the stripe tree version really is better? What about Josef 
Bacik's extent tree v2 ? Would that fit better with the stripe tree / 
would it cause problems with the journaled mode?

As a regular user I think that adding another raid56 mode may be 
confusing, especially for people that do not understand how things work 
(which absolutely sometimes includes me too), Quite some BTRFS use is 
also done outside the datacenter, and it is regular joe and co. that 
complains the most when they screw up, which to some extent prevents 
adoption on non-stellar hardware which again would/could lead to 
bugreorts and a better filesystem in the long run. So therefore:

If the standard raid56 mode is unstable and discouraged to use, would it 
not be better to sneakily drop that once and for all e.g. just make it 
so that new filesystems created with raid56 automatically uses the new 
(and better) raid56j mode? Effectively preventing users from making 
filesystems with the "bad" raid56 after a certain btrfs-progs version?

This way the raid56 code would seem to be fixed albeit getting slower 
(as I understand it), but the number of configurations available is not 
overwhelming for us regular people.

PS! I understand that I sound like I am not to keen on the new raid56j 
mode which is sort of true, but that does not mean that I am ungrateful 
for it :)
Qu Wenruo May 26, 2022, 9:26 a.m. UTC | #10
On 2022/5/26 17:06, waxhead wrote:
> Qu Wenruo wrote:
>>
>>
>> On 2022/5/25 17:26, Christoph Hellwig wrote:
>>> On Wed, May 25, 2022 at 05:13:11PM +0800, Qu Wenruo wrote:
>>>> The problem is, we can have partial write for RAID56, no matter if we
>>>> use NODATACOW or not.
>>>>
>>>> For example, we have a very typical 3 disks RAID5:
>>>>
>>>>     0    32K    64K
>>>> Disk 1  |DDDDDDD|       |
>>>> Disk 2  |ddddddd|ddddddd|
>>>> Disk 3  |PPPPPPP|PPPPPPP|
>>>>
>>>>
>>>> D = old data, it's there for a while.
>>>> d = new data, we want to write.
>>>
>>> Oh.  I keep forgetting that the striping is entirely on the physіcal
>>> block basis and not logic block basis.  Which makes the whole idea
>>> of btrfs integrated raid5/6 not all that useful compared to just using
>>> mdraid :(
>>
>> Yep, that's why I have to go the old journal way.
>>
>> But you may want to explore the super awesome idea of raid stripe tree
>> from Johannes.
>>
>> The idea is we introduce a new layer of logical addr -> internal mapping
>> -> physical addr.
>> By that, we get rid of the strict physical address requirement.
>>
>> And when we update the new stripe, we just insert two new mapping for
>> (dddd), and two new mapping for the new (PPPPP).
>>
>> If power loss happen, we still see the old internal mapping, and can get
>> the correct recovery.
>>
>> But it still seems to have a lot of things to resolve for now.
>>
>> Thanks,
>> Qu
>
> I am just a humble BTRFS user and while I think the journaled approach
> sounds superinteresting I believe that the stripe tree sounds like the
> better solution in the long run.
>
> Is it really such a good idea to add a (potentially temporary) journaled
> raid mode if the stripe tree version really is better?

Journal is simpler to implement, and has been tried and true for a long
time.
Although the on-disk format change is unavoidable.

Another problem is, for now we don't have a good idea on even if it's
possible to use stripe tree for metadata.
(And it's still under the early stage for stripe tree)

Sure forcing RAID10/RAID1C* on metadata would be acceptable for most
users, it's still something to take into consideration.

> What about Josef
> Bacik's extent tree v2 ? Would that fit better with the stripe tree /
> would it cause problems with the journaled mode?

I don't believe extent tree v2 would affect RAID56J at all.

Not 100% sure about RST (raid stripe tree), but from the initial
impression, some tricks from extent tree v2 may help RST.

>
> As a regular user I think that adding another raid56 mode may be
> confusing, especially for people that do not understand how things work
> (which absolutely sometimes includes me too), Quite some BTRFS use is
> also done outside the datacenter, and it is regular joe and co. that
> complains the most when they screw up, which to some extent prevents
> adoption on non-stellar hardware which again would/could lead to
> bugreorts and a better filesystem in the long run. So therefore:
>
> If the standard raid56 mode is unstable and discouraged to use, would it
> not be better to sneakily drop that once and for all e.g. just make it
> so that new filesystems created with raid56 automatically uses the new
> (and better) raid56j mode? Effectively preventing users from making
> filesystems with the "bad" raid56 after a certain btrfs-progs version?

Deprecation needs time, and RAID56J is not a drop-in replacement
unfortunately, it needs on-disk format change, and is new RAID profiles.

If the code is finished and properly tested (through several kernel
released), we may switch all raid56 to raid56J in mkfs.btrfs and balance
(aka, balance profile raid56j becomes the default one for raid56).

For RST, it's harder to say with confidence now, a lot of things are not
yet determined...

Thanks,
Qu

>
> This way the raid56 code would seem to be fixed albeit getting slower
> (as I understand it), but the number of configurations available is not
> overwhelming for us regular people.
>
> PS! I understand that I sound like I am not to keen on the new raid56j
> mode which is sort of true, but that does not mean that I am ungrateful
> for it :)
Goffredo Baroncelli May 26, 2022, 3:30 p.m. UTC | #11
On 26/05/2022 11.26, Qu Wenruo wrote:
[...]
> 
> Deprecation needs time, and RAID56J is not a drop-in replacement
> unfortunately, it needs on-disk format change, and is new RAID profiles.
> 
IIRC, the original idea was to dedicate a separate disk to the journal.
Instead of a dedicated disk, we could create a new chunk type to store the journal.
The main advantage is that the data/metadata are still stored in the "old"
raid5/6 chunk type. And only the journal is stored in the new chunk type.

This would simplify the forward compatibility (you can leave the already existing raid5/6 chunk, and enabling the journal mode only if the journal chunk exists)

And even the backward compatibility is simple: you need only to delete the journal chunk (via an external tool ?)

However I understand that at this phase of developing nobody want to introduce a so big change.

> If the code is finished and properly tested (through several kernel
> released), we may switch all raid56 to raid56J in mkfs.btrfs and balance
> (aka, balance profile raid56j becomes the default one for raid56).

If a balance is enough, I think that this can be considered a "drop-in" replacement.

> 
> For RST, it's harder to say with confidence now, a lot of things are not
> yet determined...
> 
> Thanks,
> Qu
> 
>>
>> This way the raid56 code would seem to be fixed albeit getting slower
>> (as I understand it), but the number of configurations available is not
>> overwhelming for us regular people.
>>
>> PS! I understand that I sound like I am not to keen on the new raid56j
>> mode which is sort of true, but that does not mean that I am ungrateful
>> for it :)
David Sterba May 26, 2022, 4:10 p.m. UTC | #12
On Thu, May 26, 2022 at 05:30:16PM +0200, Goffredo Baroncelli wrote:
> On 26/05/2022 11.26, Qu Wenruo wrote:
> [...]
> > 
> > Deprecation needs time, and RAID56J is not a drop-in replacement
> > unfortunately, it needs on-disk format change, and is new RAID profiles.
> > 
> IIRC, the original idea was to dedicate a separate disk to the journal.

And was rejected as it would be the single point of failure for drive
setup that must survive failure of 1 or 2 devices.
Wang Yugui June 1, 2022, 2:06 a.m. UTC | #13
Hi,

> This is the draft version of the on-disk format for RAID56J journal.
> 
> The overall idea is, we have the following elements:
> 
> 1) A fixed header
>    Recording things like if the journal is clean or dirty, and how many
>    entries it has.
> 
> 2) One or at most 127 entries
>    Each entry will point to a range of data in the per-device reserved
>    range.

Can we put this journal in a device just like ' mke2fs -O journal_dev' 
or 'mkfs.xfs -l logdev'?

A fast & small journal device may help the performance.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/01
Qu Wenruo June 1, 2022, 2:13 a.m. UTC | #14
On 2022/6/1 10:06, Wang Yugui wrote:
> Hi,
> 
>> This is the draft version of the on-disk format for RAID56J journal.
>>
>> The overall idea is, we have the following elements:
>>
>> 1) A fixed header
>>     Recording things like if the journal is clean or dirty, and how many
>>     entries it has.
>>
>> 2) One or at most 127 entries
>>     Each entry will point to a range of data in the per-device reserved
>>     range.
> 
> Can we put this journal in a device just like ' mke2fs -O journal_dev'
> or 'mkfs.xfs -l logdev'?
> 
> A fast & small journal device may help the performance.

Then that lacks the ability to lose one device.

The journal device must be there no matter what.

Furthermore, this will still need a on-disk format change for a special 
type of device.

Thanks,
Qu
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/01
>
Wang Yugui June 1, 2022, 2:25 a.m. UTC | #15
Hi,

> On 2022/6/1 10:06, Wang Yugui wrote:
> > Hi,
> >
> >> This is the draft version of the on-disk format for RAID56J journal.
> >>
> >> The overall idea is, we have the following elements:
> >>
> >> 1) A fixed header
> >>     Recording things like if the journal is clean or dirty, and how many
> >>     entries it has.
> >>
> >> 2) One or at most 127 entries
> >>     Each entry will point to a range of data in the per-device reserved
> >>     range.
> >
> > Can we put this journal in a device just like 'mke2fs -O journal_dev'
> > or 'mkfs.xfs -l logdev'?
> >
> > A fast & small journal device may help the performance.
> 
> Then that lacks the ability to lose one device.
> 
> The journal device must be there no matter what.
> 
> Furthermore, this will still need a on-disk format change for a special type of device.

If we save journal on every RAID56 HDD, it will always be very slow, 
because journal data is in a different place than normal data, so HDD
seek is always happen?

If we save journal on a device just like 'mke2fs -O journal_dev' or 'mkfs.xfs
-l logdev', then this device just works like NVDIMM?  We may not need
RAID56/RAID1 for journal data.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/01
Qu Wenruo June 1, 2022, 2:55 a.m. UTC | #16
On 2022/6/1 10:25, Wang Yugui wrote:
> Hi,
>
>> On 2022/6/1 10:06, Wang Yugui wrote:
>>> Hi,
>>>
>>>> This is the draft version of the on-disk format for RAID56J journal.
>>>>
>>>> The overall idea is, we have the following elements:
>>>>
>>>> 1) A fixed header
>>>>      Recording things like if the journal is clean or dirty, and how many
>>>>      entries it has.
>>>>
>>>> 2) One or at most 127 entries
>>>>      Each entry will point to a range of data in the per-device reserved
>>>>      range.
>>>
>>> Can we put this journal in a device just like 'mke2fs -O journal_dev'
>>> or 'mkfs.xfs -l logdev'?
>>>
>>> A fast & small journal device may help the performance.
>>
>> Then that lacks the ability to lose one device.
>>
>> The journal device must be there no matter what.
>>
>> Furthermore, this will still need a on-disk format change for a special type of device.
>
> If we save journal on every RAID56 HDD, it will always be very slow,
> because journal data is in a different place than normal data, so HDD
> seek is always happen?
>
> If we save journal on a device just like 'mke2fs -O journal_dev' or 'mkfs.xfs
> -l logdev', then this device just works like NVDIMM?  We may not need
> RAID56/RAID1 for journal data.

That device is the single point of failure. You lost that device, write
hole come again.

RAID56 can tolerant one or two device failures for sure.
Thus one point failure is against RAID56.


If one is not bothered with writehole, then they doesn't need any
journal at all.

Thanks,
Qu
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/01
>
>
Wang Yugui June 1, 2022, 9:07 a.m. UTC | #17
Hi,

> On 2022/6/1 10:25, Wang Yugui wrote:
> > Hi,
> >
> >> On 2022/6/1 10:06, Wang Yugui wrote:
> >>> Hi,
> >>>
> >>>> This is the draft version of the on-disk format for RAID56J journal.
> >>>>
> >>>> The overall idea is, we have the following elements:
> >>>>
> >>>> 1) A fixed header
> >>>>      Recording things like if the journal is clean or dirty, and how many
> >>>>      entries it has.
> >>>>
> >>>> 2) One or at most 127 entries
> >>>>      Each entry will point to a range of data in the per-device reserved
> >>>>      range.
> >>>
> >>> Can we put this journal in a device just like 'mke2fs -O journal_dev'
> >>> or 'mkfs.xfs -l logdev'?
> >>>
> >>> A fast & small journal device may help the performance.
> >>
> >> Then that lacks the ability to lose one device.
> >>
> >> The journal device must be there no matter what.
> >>
> >> Furthermore, this will still need a on-disk format change for a special type of device.
> >
> > If we save journal on every RAID56 HDD, it will always be very slow,
> > because journal data is in a different place than normal data, so HDD
> > seek is always happen?
> >
> > If we save journal on a device just like 'mke2fs -O journal_dev' or 'mkfs.xfs
> > -l logdev', then this device just works like NVDIMM?  We may not need
> > RAID56/RAID1 for journal data.
> 
> That device is the single point of failure. You lost that device, write
> hole come again.

The HW RAID card have 'single point of failure'  too, such as the NVDIMM
inside HW RAID card. 

but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd failure
frequency

so It still help a lot.

> RAID56 can tolerant one or two device failures for sure.
> Thus one point failure is against RAID56.
> 
> 
> If one is not bothered with writehole, then they doesn't need any
> journal at all.

I though 'degraded read-only' will help more case than 'degraded
read-write' with writehole.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/01
Qu Wenruo June 1, 2022, 9:27 a.m. UTC | #18
On 2022/6/1 17:07, Wang Yugui wrote:
> Hi,
>
>> On 2022/6/1 10:25, Wang Yugui wrote:
>>> Hi,
>>>
>>>> On 2022/6/1 10:06, Wang Yugui wrote:
>>>>> Hi,
>>>>>
>>>>>> This is the draft version of the on-disk format for RAID56J journal.
>>>>>>
>>>>>> The overall idea is, we have the following elements:
>>>>>>
>>>>>> 1) A fixed header
>>>>>>       Recording things like if the journal is clean or dirty, and how many
>>>>>>       entries it has.
>>>>>>
>>>>>> 2) One or at most 127 entries
>>>>>>       Each entry will point to a range of data in the per-device reserved
>>>>>>       range.
>>>>>
>>>>> Can we put this journal in a device just like 'mke2fs -O journal_dev'
>>>>> or 'mkfs.xfs -l logdev'?
>>>>>
>>>>> A fast & small journal device may help the performance.
>>>>
>>>> Then that lacks the ability to lose one device.
>>>>
>>>> The journal device must be there no matter what.
>>>>
>>>> Furthermore, this will still need a on-disk format change for a special type of device.
>>>
>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>> because journal data is in a different place than normal data, so HDD
>>> seek is always happen?
>>>
>>> If we save journal on a device just like 'mke2fs -O journal_dev' or 'mkfs.xfs
>>> -l logdev', then this device just works like NVDIMM?  We may not need
>>> RAID56/RAID1 for journal data.
>>
>> That device is the single point of failure. You lost that device, write
>> hole come again.
>
> The HW RAID card have 'single point of failure'  too, such as the NVDIMM
> inside HW RAID card.
>
> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd failure
> frequency

It's a completely different level.

For btrfs RAID, we have no special treat for any disk.
And our RAID is focusing on ensuring device tolerance.

In your RAID card case, indeed the failure rate of the card is much lower.
In journal device case, how do you ensure it's still true that the
journal device missing possibility is way lower than all the other devices?

So this doesn't make sense, unless you introduce the journal to
something definitely not a regular disk.

I don't believe this benefit most users.
Just consider how many regular people use dedicated journal device for
XFS/EXT4 upon md/dm RAID56.

>
> so It still help a lot.
>
>> RAID56 can tolerant one or two device failures for sure.
>> Thus one point failure is against RAID56.
>>
>>
>> If one is not bothered with writehole, then they doesn't need any
>> journal at all.
>
> I though 'degraded read-only' will help more case than 'degraded
> read-write' with writehole.

I don't get what you're talking about here.

Thanks,
Qu

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/01
>
>
Paul Jones June 1, 2022, 9:56 a.m. UTC | #19
> -----Original Message-----
> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
> Sent: Wednesday, 1 June 2022 7:27 PM
> To: Wang Yugui <wangyugui@e16-tech.com>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
> 
> 

> >>> If we save journal on every RAID56 HDD, it will always be very slow,
> >>> because journal data is in a different place than normal data, so
> >>> HDD seek is always happen?
> >>>
> >>> If we save journal on a device just like 'mke2fs -O journal_dev' or
> >>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
> >>> may not need
> >>> RAID56/RAID1 for journal data.
> >>
> >> That device is the single point of failure. You lost that device,
> >> write hole come again.
> >
> > The HW RAID card have 'single point of failure'  too, such as the
> > NVDIMM inside HW RAID card.
> >
> > but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
> > failure frequency
> 
> It's a completely different level.
> 
> For btrfs RAID, we have no special treat for any disk.
> And our RAID is focusing on ensuring device tolerance.
> 
> In your RAID card case, indeed the failure rate of the card is much lower.
> In journal device case, how do you ensure it's still true that the journal device
> missing possibility is way lower than all the other devices?
> 
> So this doesn't make sense, unless you introduce the journal to something
> definitely not a regular disk.
> 
> I don't believe this benefit most users.
> Just consider how many regular people use dedicated journal device for
> XFS/EXT4 upon md/dm RAID56.

A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.

As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.

Paul.
Qu Wenruo June 1, 2022, 10:12 a.m. UTC | #20
On 2022/6/1 17:56, Paul Jones wrote:
>
>> -----Original Message-----
>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>> Sent: Wednesday, 1 June 2022 7:27 PM
>> To: Wang Yugui <wangyugui@e16-tech.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>
>>
>
>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>> because journal data is in a different place than normal data, so
>>>>> HDD seek is always happen?
>>>>>
>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>> may not need
>>>>> RAID56/RAID1 for journal data.
>>>>
>>>> That device is the single point of failure. You lost that device,
>>>> write hole come again.
>>>
>>> The HW RAID card have 'single point of failure'  too, such as the
>>> NVDIMM inside HW RAID card.
>>>
>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>> failure frequency
>>
>> It's a completely different level.
>>
>> For btrfs RAID, we have no special treat for any disk.
>> And our RAID is focusing on ensuring device tolerance.
>>
>> In your RAID card case, indeed the failure rate of the card is much lower.
>> In journal device case, how do you ensure it's still true that the journal device
>> missing possibility is way lower than all the other devices?
>>
>> So this doesn't make sense, unless you introduce the journal to something
>> definitely not a regular disk.
>>
>> I don't believe this benefit most users.
>> Just consider how many regular people use dedicated journal device for
>> XFS/EXT4 upon md/dm RAID56.
>
> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
>
> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.

In fact, since write hole is only a problem for power loss (and explicit
degraded write), another solution is, only record if the fs is
gracefully closed.

If the fs is not gracefully closed (by a bit in superblock), then we
just trigger a full scrub on all existing RAID56 block groups.

This should solve the problem, with the extra cost of slow scrub for
each unclean shutdown.

To be extra safe, during that scrub run, we really want user to wait for
the scrub to finish.

But on the other hand, I totally understand user won't be happy to wait
for 10+ hours just due to a unclean shutdown...

Thanks,
Qu

>
> Paul.
Qu Wenruo June 1, 2022, 12:21 p.m. UTC | #21
On 2022/6/1 17:56, Paul Jones wrote:
>
>> -----Original Message-----
>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>> Sent: Wednesday, 1 June 2022 7:27 PM
>> To: Wang Yugui <wangyugui@e16-tech.com>
>> Cc: linux-btrfs@vger.kernel.org
>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>
>>
>
>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>> because journal data is in a different place than normal data, so
>>>>> HDD seek is always happen?
>>>>>
>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>> may not need
>>>>> RAID56/RAID1 for journal data.
>>>>
>>>> That device is the single point of failure. You lost that device,
>>>> write hole come again.
>>>
>>> The HW RAID card have 'single point of failure'  too, such as the
>>> NVDIMM inside HW RAID card.
>>>
>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>> failure frequency
>>
>> It's a completely different level.
>>
>> For btrfs RAID, we have no special treat for any disk.
>> And our RAID is focusing on ensuring device tolerance.
>>
>> In your RAID card case, indeed the failure rate of the card is much lower.
>> In journal device case, how do you ensure it's still true that the journal device
>> missing possibility is way lower than all the other devices?
>>
>> So this doesn't make sense, unless you introduce the journal to something
>> definitely not a regular disk.
>>
>> I don't believe this benefit most users.
>> Just consider how many regular people use dedicated journal device for
>> XFS/EXT4 upon md/dm RAID56.
>
> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.

After more consideration, it looks like it's indeed better.

Although we break the guarantee on bad devices, if the journal device is
missing, we just fall back to the old RAID56 behavior.

It's not the best situation, but we still have all the content we have.
The problem is for future write, we may degrade the recovery ability
bytes by bytes.

Thanks,
Qu
>
> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.
>
> Paul.
Robert Krig June 1, 2022, 2:55 p.m. UTC | #22
I guess you guys are probably aware, but I thought I'd mention it 
anyway. With ZFS for example you can create mirrored log or cache disks, 
using either whole disks or just partitions.

Wouldn't a mirrored journal device remove the single point of failure? 
If you had the optional capability to create a raid1 journal on two 
disks (let's assume SSDs or NVMEs).



On 01.06.22 14:21, Qu Wenruo wrote:
>
>
> On 2022/6/1 17:56, Paul Jones wrote:
>>
>>> -----Original Message-----
>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>> Cc: linux-btrfs@vger.kernel.org
>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>>
>>>
>>
>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>>> because journal data is in a different place than normal data, so
>>>>>> HDD seek is always happen?
>>>>>>
>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>>> may not need
>>>>>> RAID56/RAID1 for journal data.
>>>>>
>>>>> That device is the single point of failure. You lost that device,
>>>>> write hole come again.
>>>>
>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>> NVDIMM inside HW RAID card.
>>>>
>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>> failure frequency
>>>
>>> It's a completely different level.
>>>
>>> For btrfs RAID, we have no special treat for any disk.
>>> And our RAID is focusing on ensuring device tolerance.
>>>
>>> In your RAID card case, indeed the failure rate of the card is much 
>>> lower.
>>> In journal device case, how do you ensure it's still true that the 
>>> journal device
>>> missing possibility is way lower than all the other devices?
>>>
>>> So this doesn't make sense, unless you introduce the journal to 
>>> something
>>> definitely not a regular disk.
>>>
>>> I don't believe this benefit most users.
>>> Just consider how many regular people use dedicated journal device for
>>> XFS/EXT4 upon md/dm RAID56.
>>
>> A good solid state drive should be far less error prone than spinning 
>> drives, so would be a good candidate. Not perfect, but better.
>
> After more consideration, it looks like it's indeed better.
>
> Although we break the guarantee on bad devices, if the journal device is
> missing, we just fall back to the old RAID56 behavior.
>
> It's not the best situation, but we still have all the content we have.
> The problem is for future write, we may degrade the recovery ability
> bytes by bytes.
>
> Thanks,
> Qu
>>
>> As an end user I think focusing on stability and recovery tools is a 
>> better use of time than fixing the write hole, as I wouldn't even 
>> consider using Raid56 in it's current state. The write hole problem 
>> can be alleviated by a UPS and not using Raid56 for a busy write 
>> load. It's still good to brainstorm the issue though, as it will need 
>> solving eventually.
>>
>> Paul.
Martin Raiber June 1, 2022, 6:49 p.m. UTC | #23
On 01.06.2022 12:12 Qu Wenruo wrote:
>
>
> On 2022/6/1 17:56, Paul Jones wrote:
>>
>>> -----Original Message-----
>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>> Cc: linux-btrfs@vger.kernel.org
>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>>
>>>
>>
>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>>> because journal data is in a different place than normal data, so
>>>>>> HDD seek is always happen?
>>>>>>
>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>>> may not need
>>>>>> RAID56/RAID1 for journal data.
>>>>>
>>>>> That device is the single point of failure. You lost that device,
>>>>> write hole come again.
>>>>
>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>> NVDIMM inside HW RAID card.
>>>>
>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>> failure frequency
>>>
>>> It's a completely different level.
>>>
>>> For btrfs RAID, we have no special treat for any disk.
>>> And our RAID is focusing on ensuring device tolerance.
>>>
>>> In your RAID card case, indeed the failure rate of the card is much lower.
>>> In journal device case, how do you ensure it's still true that the journal device
>>> missing possibility is way lower than all the other devices?
>>>
>>> So this doesn't make sense, unless you introduce the journal to something
>>> definitely not a regular disk.
>>>
>>> I don't believe this benefit most users.
>>> Just consider how many regular people use dedicated journal device for
>>> XFS/EXT4 upon md/dm RAID56.
>>
>> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
>>
>> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.
>
> In fact, since write hole is only a problem for power loss (and explicit
> degraded write), another solution is, only record if the fs is
> gracefully closed.
>
> If the fs is not gracefully closed (by a bit in superblock), then we
> just trigger a full scrub on all existing RAID56 block groups.
>
> This should solve the problem, with the extra cost of slow scrub for
> each unclean shutdown.
>
> To be extra safe, during that scrub run, we really want user to wait for
> the scrub to finish.
>
> But on the other hand, I totally understand user won't be happy to wait
> for 10+ hours just due to a unclean shutdown...
Would it be possible to put the stripe offsets/numbers into a journal/commit them before write? Then, during mount you could scrub only those after an unclean shutdown.
>
> Thanks,
> Qu
>
>>
>> Paul.
Qu Wenruo June 1, 2022, 9:37 p.m. UTC | #24
On 2022/6/2 02:49, Martin Raiber wrote:
> On 01.06.2022 12:12 Qu Wenruo wrote:
>>
>>
>> On 2022/6/1 17:56, Paul Jones wrote:
>>>
>>>> -----Original Message-----
>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>>> Cc: linux-btrfs@vger.kernel.org
>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>>>
>>>>
>>>
>>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>>>> because journal data is in a different place than normal data, so
>>>>>>> HDD seek is always happen?
>>>>>>>
>>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>>>> may not need
>>>>>>> RAID56/RAID1 for journal data.
>>>>>>
>>>>>> That device is the single point of failure. You lost that device,
>>>>>> write hole come again.
>>>>>
>>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>>> NVDIMM inside HW RAID card.
>>>>>
>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>>> failure frequency
>>>>
>>>> It's a completely different level.
>>>>
>>>> For btrfs RAID, we have no special treat for any disk.
>>>> And our RAID is focusing on ensuring device tolerance.
>>>>
>>>> In your RAID card case, indeed the failure rate of the card is much lower.
>>>> In journal device case, how do you ensure it's still true that the journal device
>>>> missing possibility is way lower than all the other devices?
>>>>
>>>> So this doesn't make sense, unless you introduce the journal to something
>>>> definitely not a regular disk.
>>>>
>>>> I don't believe this benefit most users.
>>>> Just consider how many regular people use dedicated journal device for
>>>> XFS/EXT4 upon md/dm RAID56.
>>>
>>> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
>>>
>>> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.
>>
>> In fact, since write hole is only a problem for power loss (and explicit
>> degraded write), another solution is, only record if the fs is
>> gracefully closed.
>>
>> If the fs is not gracefully closed (by a bit in superblock), then we
>> just trigger a full scrub on all existing RAID56 block groups.
>>
>> This should solve the problem, with the extra cost of slow scrub for
>> each unclean shutdown.
>>
>> To be extra safe, during that scrub run, we really want user to wait for
>> the scrub to finish.
>>
>> But on the other hand, I totally understand user won't be happy to wait
>> for 10+ hours just due to a unclean shutdown...
> Would it be possible to put the stripe offsets/numbers into a journal/commit them before write? Then, during mount you could scrub only those after an unclean shutdown.

If we go that path, we can already do full journal, and only replay that
journal without the need for scrub at all.

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
>>>
>>> Paul.
>
>
Lukas Straub June 3, 2022, 9:32 a.m. UTC | #25
On Thu, 2 Jun 2022 05:37:11 +0800
Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> On 2022/6/2 02:49, Martin Raiber wrote:
> > On 01.06.2022 12:12 Qu Wenruo wrote:  
> >>
> >>
> >> On 2022/6/1 17:56, Paul Jones wrote:  
> >>>  
> >>>> -----Original Message-----
> >>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
> >>>> Sent: Wednesday, 1 June 2022 7:27 PM
> >>>> To: Wang Yugui <wangyugui@e16-tech.com>
> >>>> Cc: linux-btrfs@vger.kernel.org
> >>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
> >>>>
> >>>>  
> >>>  
> >>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
> >>>>>>> because journal data is in a different place than normal data, so
> >>>>>>> HDD seek is always happen?
> >>>>>>>
> >>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
> >>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
> >>>>>>> may not need
> >>>>>>> RAID56/RAID1 for journal data.  
> >>>>>>
> >>>>>> That device is the single point of failure. You lost that device,
> >>>>>> write hole come again.  
> >>>>>
> >>>>> The HW RAID card have 'single point of failure'  too, such as the
> >>>>> NVDIMM inside HW RAID card.
> >>>>>
> >>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
> >>>>> failure frequency  
> >>>>
> >>>> It's a completely different level.
> >>>>
> >>>> For btrfs RAID, we have no special treat for any disk.
> >>>> And our RAID is focusing on ensuring device tolerance.
> >>>>
> >>>> In your RAID card case, indeed the failure rate of the card is much lower.
> >>>> In journal device case, how do you ensure it's still true that the journal device
> >>>> missing possibility is way lower than all the other devices?
> >>>>
> >>>> So this doesn't make sense, unless you introduce the journal to something
> >>>> definitely not a regular disk.
> >>>>
> >>>> I don't believe this benefit most users.
> >>>> Just consider how many regular people use dedicated journal device for
> >>>> XFS/EXT4 upon md/dm RAID56.  
> >>>
> >>> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
> >>>
> >>> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.  
> >>
> >> In fact, since write hole is only a problem for power loss (and explicit
> >> degraded write), another solution is, only record if the fs is
> >> gracefully closed.
> >>
> >> If the fs is not gracefully closed (by a bit in superblock), then we
> >> just trigger a full scrub on all existing RAID56 block groups.
> >>
> >> This should solve the problem, with the extra cost of slow scrub for
> >> each unclean shutdown.
> >>
> >> To be extra safe, during that scrub run, we really want user to wait for
> >> the scrub to finish.
> >>
> >> But on the other hand, I totally understand user won't be happy to wait
> >> for 10+ hours just due to a unclean shutdown...  
> > Would it be possible to put the stripe offsets/numbers into a journal/commit them before write? Then, during mount you could scrub only those after an unclean shutdown.  
> 
> If we go that path, we can already do full journal, and only replay that
> journal without the need for scrub at all.

Hello Qu,

If you don't care about the write-hole, you can also use a dirty bitmap
like mdraid 5/6 does. There, one bit in the bitmap represents for
example one gigabyte of the disk that _may_ be dirty, and the bit is left
dirty for a while and doesn't need to be set for each write. Or you
could do a per-block-group dirty bit.

And while you're at it, add the same mechanism to all the other raid
and dup modes to fix the inconsistency of NOCOW files after a crash.

Regards,
Lukas Straub

> Thanks,
> Qu
> 
> >>
> >> Thanks,
> >> Qu
> >>  
> >>>
> >>> Paul.  
> >
> >  



--
Qu Wenruo June 3, 2022, 9:59 a.m. UTC | #26
On 2022/6/3 17:32, Lukas Straub wrote:
> On Thu, 2 Jun 2022 05:37:11 +0800
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>> On 2022/6/2 02:49, Martin Raiber wrote:
>>> On 01.06.2022 12:12 Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2022/6/1 17:56, Paul Jones wrote:
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>>>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>>>>> Cc: linux-btrfs@vger.kernel.org
>>>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>>>>>
>>>>>>
>>>>>
>>>>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>>>>>> because journal data is in a different place than normal data, so
>>>>>>>>> HDD seek is always happen?
>>>>>>>>>
>>>>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>>>>>> may not need
>>>>>>>>> RAID56/RAID1 for journal data.
>>>>>>>>
>>>>>>>> That device is the single point of failure. You lost that device,
>>>>>>>> write hole come again.
>>>>>>>
>>>>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>>>>> NVDIMM inside HW RAID card.
>>>>>>>
>>>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>>>>> failure frequency
>>>>>>
>>>>>> It's a completely different level.
>>>>>>
>>>>>> For btrfs RAID, we have no special treat for any disk.
>>>>>> And our RAID is focusing on ensuring device tolerance.
>>>>>>
>>>>>> In your RAID card case, indeed the failure rate of the card is much lower.
>>>>>> In journal device case, how do you ensure it's still true that the journal device
>>>>>> missing possibility is way lower than all the other devices?
>>>>>>
>>>>>> So this doesn't make sense, unless you introduce the journal to something
>>>>>> definitely not a regular disk.
>>>>>>
>>>>>> I don't believe this benefit most users.
>>>>>> Just consider how many regular people use dedicated journal device for
>>>>>> XFS/EXT4 upon md/dm RAID56.
>>>>>
>>>>> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
>>>>>
>>>>> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.
>>>>
>>>> In fact, since write hole is only a problem for power loss (and explicit
>>>> degraded write), another solution is, only record if the fs is
>>>> gracefully closed.
>>>>
>>>> If the fs is not gracefully closed (by a bit in superblock), then we
>>>> just trigger a full scrub on all existing RAID56 block groups.
>>>>
>>>> This should solve the problem, with the extra cost of slow scrub for
>>>> each unclean shutdown.
>>>>
>>>> To be extra safe, during that scrub run, we really want user to wait for
>>>> the scrub to finish.
>>>>
>>>> But on the other hand, I totally understand user won't be happy to wait
>>>> for 10+ hours just due to a unclean shutdown...
>>> Would it be possible to put the stripe offsets/numbers into a journal/commit them before write? Then, during mount you could scrub only those after an unclean shutdown.
>>
>> If we go that path, we can already do full journal, and only replay that
>> journal without the need for scrub at all.
>
> Hello Qu,
>
> If you don't care about the write-hole, you can also use a dirty bitmap
> like mdraid 5/6 does. There, one bit in the bitmap represents for
> example one gigabyte of the disk that _may_ be dirty, and the bit is left
> dirty for a while and doesn't need to be set for each write. Or you
> could do a per-block-group dirty bit.

That would be a pretty good way for auto scrub after dirty close.

Currently we have quite some different ideas, but some are pretty
similar but at different side of a spectrum:

     Easier to implement        ..     Harder to implement
|<- More on mount time scrub   ..     More on journal ->|
|					|	|	\- Full journal
|					|	\--- Per bg dirty bitmap
|					\----------- Per bg dirty flag
\--------------------------------------------------- Per sb dirty flag

In fact, the dirty bitmap is just a simplified version of journal (only
record the metadata, without data).
Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
recover the data without problem.

Even with per-bg dirty bitmap, we still need some extra location to
record the bitmap. Thus it needs a on-disk format change anyway.

Currently only sb dirty flag may be backward compatible.

And whether we should wait for the scrub to finish before allowing use
to do anything into the fs is also another concern.

Even using bitmap, we may have several GiB data needs to be scrubbed.
If we wait for the scrub to finish, it's the best and safest way, but
users won't be happy at all.

If we go scrub resume way, it's faster but still leaves a large window
to allow write-hole to reduce our tolerance.

Thanks,
Qu
>
> And while you're at it, add the same mechanism to all the other raid
> and dup modes to fix the inconsistency of NOCOW files after a crash.
>
> Regards,
> Lukas Straub
>
>> Thanks,
>> Qu
>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Paul.
>>>
>>>
>
>
>
Qu Wenruo June 6, 2022, 8:16 a.m. UTC | #27
On 2022/6/3 17:59, Qu Wenruo wrote:
>
>
> On 2022/6/3 17:32, Lukas Straub wrote:
>> On Thu, 2 Jun 2022 05:37:11 +0800
>> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>> On 2022/6/2 02:49, Martin Raiber wrote:
>>>> On 01.06.2022 12:12 Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2022/6/1 17:56, Paul Jones wrote:
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>>>>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>>>>>> Cc: linux-btrfs@vger.kernel.org
>>>>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format
>>>>>>> draft
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>>>>> If we save journal on every RAID56 HDD, it will always be very
>>>>>>>>>> slow,
>>>>>>>>>> because journal data is in a different place than normal data, so
>>>>>>>>>> HDD seek is always happen?
>>>>>>>>>>
>>>>>>>>>> If we save journal on a device just like 'mke2fs -O
>>>>>>>>>> journal_dev' or
>>>>>>>>>> 'mkfs.xfs -l logdev', then this device just works like
>>>>>>>>>> NVDIMM?  We
>>>>>>>>>> may not need
>>>>>>>>>> RAID56/RAID1 for journal data.
>>>>>>>>>
>>>>>>>>> That device is the single point of failure. You lost that device,
>>>>>>>>> write hole come again.
>>>>>>>>
>>>>>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>>>>>> NVDIMM inside HW RAID card.
>>>>>>>>
>>>>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>>>>>> failure frequency
>>>>>>>
>>>>>>> It's a completely different level.
>>>>>>>
>>>>>>> For btrfs RAID, we have no special treat for any disk.
>>>>>>> And our RAID is focusing on ensuring device tolerance.
>>>>>>>
>>>>>>> In your RAID card case, indeed the failure rate of the card is
>>>>>>> much lower.
>>>>>>> In journal device case, how do you ensure it's still true that
>>>>>>> the journal device
>>>>>>> missing possibility is way lower than all the other devices?
>>>>>>>
>>>>>>> So this doesn't make sense, unless you introduce the journal to
>>>>>>> something
>>>>>>> definitely not a regular disk.
>>>>>>>
>>>>>>> I don't believe this benefit most users.
>>>>>>> Just consider how many regular people use dedicated journal
>>>>>>> device for
>>>>>>> XFS/EXT4 upon md/dm RAID56.
>>>>>>
>>>>>> A good solid state drive should be far less error prone than
>>>>>> spinning drives, so would be a good candidate. Not perfect, but
>>>>>> better.
>>>>>>
>>>>>> As an end user I think focusing on stability and recovery tools is
>>>>>> a better use of time than fixing the write hole, as I wouldn't
>>>>>> even consider using Raid56 in it's current state. The write hole
>>>>>> problem can be alleviated by a UPS and not using Raid56 for a busy
>>>>>> write load. It's still good to brainstorm the issue though, as it
>>>>>> will need solving eventually.
>>>>>
>>>>> In fact, since write hole is only a problem for power loss (and
>>>>> explicit
>>>>> degraded write), another solution is, only record if the fs is
>>>>> gracefully closed.
>>>>>
>>>>> If the fs is not gracefully closed (by a bit in superblock), then we
>>>>> just trigger a full scrub on all existing RAID56 block groups.
>>>>>
>>>>> This should solve the problem, with the extra cost of slow scrub for
>>>>> each unclean shutdown.
>>>>>
>>>>> To be extra safe, during that scrub run, we really want user to
>>>>> wait for
>>>>> the scrub to finish.
>>>>>
>>>>> But on the other hand, I totally understand user won't be happy to
>>>>> wait
>>>>> for 10+ hours just due to a unclean shutdown...
>>>> Would it be possible to put the stripe offsets/numbers into a
>>>> journal/commit them before write? Then, during mount you could scrub
>>>> only those after an unclean shutdown.
>>>
>>> If we go that path, we can already do full journal, and only replay that
>>> journal without the need for scrub at all.
>>
>> Hello Qu,
>>
>> If you don't care about the write-hole, you can also use a dirty bitmap
>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>> example one gigabyte of the disk that _may_ be dirty, and the bit is left
>> dirty for a while and doesn't need to be set for each write. Or you
>> could do a per-block-group dirty bit.
>
> That would be a pretty good way for auto scrub after dirty close.
>
> Currently we have quite some different ideas, but some are pretty
> similar but at different side of a spectrum:
>
>      Easier to implement        ..     Harder to implement
> |<- More on mount time scrub   ..     More on journal ->|
> |                    |    |    \- Full journal
> |                    |    \--- Per bg dirty bitmap
> |                    \----------- Per bg dirty flag
> \--------------------------------------------------- Per sb dirty flag

In fact, recently I'm checking the MD code (including their MD-raid5).

It turns out they have write-intent bitmap, which is almost the per-bg
dirty bitmap in above spectrum.

In fact, since btrfs has all the CoW and checksum for metadata (and part
of its data), btrfs scrub can do a much better job than MD to resilver
the range.

Furthermore, we have a pretty good reserved space (1M), and has a pretty
reasonable stripe length (1GiB).
This means, we only need 32KiB for the bitmap for each RAID56 stripe,
much smaller than the 1MiB we reserved.

I think this can be a pretty reasonable middle ground, faster than full
journal, while the amount to scrub should be reasonable enough to be
done at mount time.

Thanks,
Qu
>
> In fact, the dirty bitmap is just a simplified version of journal (only
> record the metadata, without data).
> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
> recover the data without problem.
>
> Even with per-bg dirty bitmap, we still need some extra location to
> record the bitmap. Thus it needs a on-disk format change anyway.
>
> Currently only sb dirty flag may be backward compatible.
>
> And whether we should wait for the scrub to finish before allowing use
> to do anything into the fs is also another concern.
>
> Even using bitmap, we may have several GiB data needs to be scrubbed.
> If we wait for the scrub to finish, it's the best and safest way, but
> users won't be happy at all.
>
> If we go scrub resume way, it's faster but still leaves a large window
> to allow write-hole to reduce our tolerance.
>
> Thanks,
> Qu
>>
>> And while you're at it, add the same mechanism to all the other raid
>> and dup modes to fix the inconsistency of NOCOW files after a crash.
>>
>> Regards,
>> Lukas Straub
>>
>>> Thanks,
>>> Qu
>>>
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>> Paul.
>>>>
>>>>
>>
>>
>>
Qu Wenruo June 6, 2022, 11:21 a.m. UTC | #28
On 2022/6/6 16:16, Qu Wenruo wrote:
>
>
[...]
>>>
>>> Hello Qu,
>>>
>>> If you don't care about the write-hole, you can also use a dirty bitmap
>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>> example one gigabyte of the disk that _may_ be dirty, and the bit is
>>> left
>>> dirty for a while and doesn't need to be set for each write. Or you
>>> could do a per-block-group dirty bit.
>>
>> That would be a pretty good way for auto scrub after dirty close.
>>
>> Currently we have quite some different ideas, but some are pretty
>> similar but at different side of a spectrum:
>>
>>      Easier to implement        ..     Harder to implement
>> |<- More on mount time scrub   ..     More on journal ->|
>> |                    |    |    \- Full journal
>> |                    |    \--- Per bg dirty bitmap
>> |                    \----------- Per bg dirty flag
>> \--------------------------------------------------- Per sb dirty flag
>
> In fact, recently I'm checking the MD code (including their MD-raid5).
>
> It turns out they have write-intent bitmap, which is almost the per-bg
> dirty bitmap in above spectrum.
>
> In fact, since btrfs has all the CoW and checksum for metadata (and part
> of its data), btrfs scrub can do a much better job than MD to resilver
> the range.
>
> Furthermore, we have a pretty good reserved space (1M), and has a pretty
> reasonable stripe length (1GiB).
> This means, we only need 32KiB for the bitmap for each RAID56 stripe,
> much smaller than the 1MiB we reserved.
>
> I think this can be a pretty reasonable middle ground, faster than full
> journal, while the amount to scrub should be reasonable enough to be
> done at mount time.

Furthermore, this even allows us to go something like bitmap tree, for
such write-intent bitmap.
And as long as the user is not using RAID56 for metadata (maybe even
it's OK to use RAID56 for metadata), it should be pretty safe against
most write-hole (for metadata and CoW data only though, nocow data is
still affected).

Thus I believe this can be a valid path to explore, and even have a
higher priority than full journal.

Thanks,
Qu


>
> Thanks,
> Qu
>>
>> In fact, the dirty bitmap is just a simplified version of journal (only
>> record the metadata, without data).
>> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
>> recover the data without problem.
>>
>> Even with per-bg dirty bitmap, we still need some extra location to
>> record the bitmap. Thus it needs a on-disk format change anyway.
>>
>> Currently only sb dirty flag may be backward compatible.
>>
>> And whether we should wait for the scrub to finish before allowing use
>> to do anything into the fs is also another concern.
>>
>> Even using bitmap, we may have several GiB data needs to be scrubbed.
>> If we wait for the scrub to finish, it's the best and safest way, but
>> users won't be happy at all.
>>
>> If we go scrub resume way, it's faster but still leaves a large window
>> to allow write-hole to reduce our tolerance.
>>
>> Thanks,
>> Qu
>>>
>>> And while you're at it, add the same mechanism to all the other raid
>>> and dup modes to fix the inconsistency of NOCOW files after a crash.
>>>
>>> Regards,
>>> Lukas Straub
>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> Paul.
>>>>>
>>>>>
>>>
>>>
>>>
Goffredo Baroncelli June 6, 2022, 6:10 p.m. UTC | #29
Hi Qu,

On 06/06/2022 13.21, Qu Wenruo wrote:
> 
> 
> On 2022/6/6 16:16, Qu Wenruo wrote:
>>
>>
> [...]
>>>>
>>>> Hello Qu,
>>>>
>>>> If you don't care about the write-hole, you can also use a dirty bitmap
>>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>>> example one gigabyte of the disk that _may_ be dirty, and the bit is
>>>> left
>>>> dirty for a while and doesn't need to be set for each write. Or you
>>>> could do a per-block-group dirty bit.
>>>
>>> That would be a pretty good way for auto scrub after dirty close.
>>>
>>> Currently we have quite some different ideas, but some are pretty
>>> similar but at different side of a spectrum:
>>>
>>>      Easier to implement        ..     Harder to implement
>>> |<- More on mount time scrub   ..     More on journal ->|
>>> |                    |    |    \- Full journal
>>> |                    |    \--- Per bg dirty bitmap
>>> |                    \----------- Per bg dirty flag
>>> \--------------------------------------------------- Per sb dirty flag
>>
>> In fact, recently I'm checking the MD code (including their MD-raid5).
>>
>> It turns out they have write-intent bitmap, which is almost the per-bg
>> dirty bitmap in above spectrum.
>>
>> In fact, since btrfs has all the CoW and checksum for metadata (and part
>> of its data), btrfs scrub can do a much better job than MD to resilver
>> the range.
>>
>> Furthermore, we have a pretty good reserved space (1M), and has a pretty
>> reasonable stripe length (1GiB).
>> This means, we only need 32KiB for the bitmap for each RAID56 stripe,
>> much smaller than the 1MiB we reserved.
>>
>> I think this can be a pretty reasonable middle ground, faster than full
>> journal, while the amount to scrub should be reasonable enough to be
>> done at mount time.

Raid5 is "single fault proof". This means that it can sustain only one
failure *at time* like:
1) unavailability of a disk (e.g. data disk failure)
2) a missing write in the stripe (e.g. unclean shutdown)

a) Until now (i.e. without the "bitmap intent"), even if these failures happen
in different days (i.e. not at the same time), the result may be a "write hole".

b) With the bitmap intent, the write hole requires that 1) and 2) happen
at the same time. But this would be not anymore a "single fault", with only an
exception: if these failure have a common cause (e.g. a power
failure which in turn cause the dead of a disk). In this case this has to be
considered "single fault".

But with a battery backup (i.e. no power failure), the likelihood of b) became
negligible.

This to say that a write intent bitmap will provide an huge
improvement of the resilience of a btrfs raid5, and in turn raid6.

My only suggestions, is to find a way to store the bitmap intent not in the
raid5/6 block group, but in a separate block group, with the appropriate level
of redundancy.

This for two main reasons:
1) in future BTRFS may get the ability of allocating this block group in a
dedicate disks set. I see two main cases:
a) in case of raid6, we can store the intent bitmap (or the journal) in a
raid1C3 BG allocated in the faster disks. The cons is that each block has to be
written 3x2 times. But if you have an hybrid disks set (some ssd and some hdd,
you got a noticeable gain of performance)
b) another option is to spread the intent bitmap (or the journal) in *all* disks,
where each disks contains only the the related data (if we update only disk #1
and disk #2, we have to update only the intent bitmap (or the journal) in
disk #1 and  disk #2)


2) having a dedicate bg for the intent bitmap (or the journal), has another big
advantage: you don't need to change the meaning of the raid5/6 bg. This means
that an older kernel can read/write a raid5/6 filesystem: it sufficient to ignore
the intent bitmap (or the journal)



> 
> Furthermore, this even allows us to go something like bitmap tree, for
> such write-intent bitmap.
> And as long as the user is not using RAID56 for metadata (maybe even
> it's OK to use RAID56 for metadata), it should be pretty safe against
> most write-hole (for metadata and CoW data only though, nocow data is
> still affected).
> 
> Thus I believe this can be a valid path to explore, and even have a
> higher priority than full journal.
> 
> Thanks,
> Qu
>
Qu Wenruo June 7, 2022, 1:27 a.m. UTC | #30
On 2022/6/7 02:10, Goffredo Baroncelli wrote:
> Hi Qu,
>
> On 06/06/2022 13.21, Qu Wenruo wrote:
>>
>>
>> On 2022/6/6 16:16, Qu Wenruo wrote:
>>>
>>>
>> [...]
>>>>>
>>>>> Hello Qu,
>>>>>
>>>>> If you don't care about the write-hole, you can also use a dirty
>>>>> bitmap
>>>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>>>> example one gigabyte of the disk that _may_ be dirty, and the bit is
>>>>> left
>>>>> dirty for a while and doesn't need to be set for each write. Or you
>>>>> could do a per-block-group dirty bit.
>>>>
>>>> That would be a pretty good way for auto scrub after dirty close.
>>>>
>>>> Currently we have quite some different ideas, but some are pretty
>>>> similar but at different side of a spectrum:
>>>>
>>>>      Easier to implement        ..     Harder to implement
>>>> |<- More on mount time scrub   ..     More on journal ->|
>>>> |                    |    |    \- Full journal
>>>> |                    |    \--- Per bg dirty bitmap
>>>> |                    \----------- Per bg dirty flag
>>>> \--------------------------------------------------- Per sb dirty flag
>>>
>>> In fact, recently I'm checking the MD code (including their MD-raid5).
>>>
>>> It turns out they have write-intent bitmap, which is almost the per-bg
>>> dirty bitmap in above spectrum.
>>>
>>> In fact, since btrfs has all the CoW and checksum for metadata (and part
>>> of its data), btrfs scrub can do a much better job than MD to resilver
>>> the range.
>>>
>>> Furthermore, we have a pretty good reserved space (1M), and has a pretty
>>> reasonable stripe length (1GiB).
>>> This means, we only need 32KiB for the bitmap for each RAID56 stripe,
>>> much smaller than the 1MiB we reserved.
>>>
>>> I think this can be a pretty reasonable middle ground, faster than full
>>> journal, while the amount to scrub should be reasonable enough to be
>>> done at mount time.
>
> Raid5 is "single fault proof". This means that it can sustain only one
> failure *at time* like:
> 1) unavailability of a disk (e.g. data disk failure)
> 2) a missing write in the stripe (e.g. unclean shutdown)

Yep, only one of them happen at the same time.

But to end users, they want to handle all of them at the same time.
And it's not that unreasonable to hit cases like a power loss, then one
disk also gave up.

So, I'm afraid we will want to go full journal (at least allow users to
choose) in the future for that case.

>
> a) Until now (i.e. without the "bitmap intent"), even if these failures
> happen
> in different days (i.e. not at the same time), the result may be a
> "write hole".

Yep.

>
> b) With the bitmap intent, the write hole requires that 1) and 2) happen
> at the same time. But this would be not anymore a "single fault", with
> only an
> exception: if these failure have a common cause (e.g. a power
> failure which in turn cause the dead of a disk). In this case this has
> to be
> considered "single fault".

Exactly, so the write-intent tree (I hate the DM naming, it just name it
bitmap) would still be an obvious improvement.

And that's exactly why multi-device DM profiles all have that configurable.

>
> But with a battery backup (i.e. no power failure), the likelihood of b)
> became
> negligible.
>
> This to say that a write intent bitmap will provide an huge
> improvement of the resilience of a btrfs raid5, and in turn raid6.
>
> My only suggestions, is to find a way to store the bitmap intent not in the
> raid5/6 block group, but in a separate block group, with the appropriate
> level
> of redundancy.

That's why I want to reject RAID56 as metadata, and just store the
write-intent tree into the metadata, like what we did for fsync (log tree).

>
> This for two main reasons:
> 1) in future BTRFS may get the ability of allocating this block group in a
> dedicate disks set. I see two main cases:
> a) in case of raid6, we can store the intent bitmap (or the journal) in a
> raid1C3 BG allocated in the faster disks. The cons is that each block
> has to be
> written 3x2 times. But if you have an hybrid disks set (some ssd and
> some hdd,
> you got a noticeable gain of performance)

In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
missing disks.

In fact, the chance to tolerate two missing devices for 4 disks RAID10 is:

4 / 6 = 66.7%

4 is the total valid combinations, no order involved, including:
(1, 3), (1, 4), (2, 3) (2, 4).
(Or 4C2 - 2)

6 is the 4C2.

So really no need to go RAID1C3 unless you're really want to ensured 2
disks tolerance.

> b) another option is to spread the intent bitmap (or the journal) in
> *all* disks,
> where each disks contains only the the related data (if we update only
> disk #1
> and disk #2, we have to update only the intent bitmap (or the journal) in
> disk #1 and  disk #2)

That's my initial per-device reservation method.

But for write-intent tree, I tend to not go that way, but with a
RO-compatible flag instead, as it's much simpler and more back compatible.

Thanks,
Qu
>
>
> 2) having a dedicate bg for the intent bitmap (or the journal), has
> another big
> advantage: you don't need to change the meaning of the raid5/6 bg. This
> means
> that an older kernel can read/write a raid5/6 filesystem: it sufficient
> to ignore
> the intent bitmap (or the journal)
>
>
>
>>
>> Furthermore, this even allows us to go something like bitmap tree, for
>> such write-intent bitmap.
>> And as long as the user is not using RAID56 for metadata (maybe even
>> it's OK to use RAID56 for metadata), it should be pretty safe against
>> most write-hole (for metadata and CoW data only though, nocow data is
>> still affected).
>>
>> Thus I believe this can be a valid path to explore, and even have a
>> higher priority than full journal.
>>
>> Thanks,
>> Qu
>>
>
>
>
Goffredo Baroncelli June 7, 2022, 5:36 p.m. UTC | #31
On 07/06/2022 03.27, Qu Wenruo wrote:
> 
> 
> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
[...]

>>
>> But with a battery backup (i.e. no power failure), the likelihood of b)
>> became
>> negligible.
>>
>> This to say that a write intent bitmap will provide an huge
>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>
>> My only suggestions, is to find a way to store the bitmap intent not in the
>> raid5/6 block group, but in a separate block group, with the appropriate
>> level
>> of redundancy.
> 
> That's why I want to reject RAID56 as metadata, and just store the
> write-intent tree into the metadata, like what we did for fsync (log tree).
> 

My suggestion was not to use the btrfs metadata to store the "write-intent", but
to track the space used by the write-intent storage area with a bg. Then the
write intent can be handled not with a btrfs btree, but (e.g.) simply
writing a bitmap of the used blocks, or the pairs [starts, length]....

I really like the idea to store the write intent in a btree. I find it very
elegant. However I don't think that it is convenient.

The write intent disk format is not performance related, you don't need to seek
inside it; and it is small: you need to read it (entirerly) only in case of power
failure, and in any case the biggest cost is to scrub the last updated blocks. So
it is not needed a btree.

Moreover, the handling of raid5/6 is a layer below the btree. I think that
updating the write-intent btree would be a performance bottleneck. I am quite sure
that the write intent likely requires less than one metadata page (16K today);
however to store this page you need to update the metadata page tracking...

>>
>> This for two main reasons:
>> 1) in future BTRFS may get the ability of allocating this block group in a
>> dedicate disks set. I see two main cases:
>> a) in case of raid6, we can store the intent bitmap (or the journal) in a
>> raid1C3 BG allocated in the faster disks. The cons is that each block
>> has to be
>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>> some hdd,
>> you got a noticeable gain of performance)
> 
> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
> missing disks.
> 
> In fact, the chance to tolerate two missing devices for 4 disks RAID10 is:
> 
> 4 / 6 = 66.7%
> 
> 4 is the total valid combinations, no order involved, including:
> (1, 3), (1, 4), (2, 3) (2, 4).
> (Or 4C2 - 2)
> 
> 6 is the 4C2.
> 
> So really no need to go RAID1C3 unless you're really want to ensured 2
> disks tolerance.

I don't get the point: I started talking about raid6. The raid6 is two
failures proof (you need three failure to see the problem... in theory).

If P is the probability of a disk failure (with P << 1), the likelihood of
a RAID6 failure is O(P^3). The same is RAID1C3.

Instead RAID10 failure likelihood is only a bit lesser than two disk failure:
RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).

Because P is << 1 then  P^3 << 0.66 * P^2.
> 
>> b) another option is to spread the intent bitmap (or the journal) in
>> *all* disks,
>> where each disks contains only the the related data (if we update only
>> disk #1
>> and disk #2, we have to update only the intent bitmap (or the journal) in
>> disk #1 and  disk #2)
> 
> That's my initial per-device reservation method.
> 
> But for write-intent tree, I tend to not go that way, but with a
> RO-compatible flag instead, as it's much simpler and more back compatible.
> 
> Thanks,
> Qu
>>
>>
>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>> another big
>> advantage: you don't need to change the meaning of the raid5/6 bg. This
>> means
>> that an older kernel can read/write a raid5/6 filesystem: it sufficient
>> to ignore
>> the intent bitmap (or the journal)
>>
>>
>>
>>>
>>> Furthermore, this even allows us to go something like bitmap tree, for
>>> such write-intent bitmap.
>>> And as long as the user is not using RAID56 for metadata (maybe even
>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>> most write-hole (for metadata and CoW data only though, nocow data is
>>> still affected).
>>>
>>> Thus I believe this can be a valid path to explore, and even have a
>>> higher priority than full journal.
>>>
>>> Thanks,
>>> Qu
>>>
>>
>>
>>
Qu Wenruo June 7, 2022, 10:14 p.m. UTC | #32
On 2022/6/8 01:36, Goffredo Baroncelli wrote:
> On 07/06/2022 03.27, Qu Wenruo wrote:
>>
>>
>> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
> [...]
>
>>>
>>> But with a battery backup (i.e. no power failure), the likelihood of b)
>>> became
>>> negligible.
>>>
>>> This to say that a write intent bitmap will provide an huge
>>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>>
>>> My only suggestions, is to find a way to store the bitmap intent not
>>> in the
>>> raid5/6 block group, but in a separate block group, with the appropriate
>>> level
>>> of redundancy.
>>
>> That's why I want to reject RAID56 as metadata, and just store the
>> write-intent tree into the metadata, like what we did for fsync (log
>> tree).
>>
>
> My suggestion was not to use the btrfs metadata to store the
> "write-intent", but
> to track the space used by the write-intent storage area with a bg. Then
> the
> write intent can be handled not with a btrfs btree, but (e.g.) simply
> writing a bitmap of the used blocks, or the pairs [starts, length]....

That solution requires a lot of extra change to chunk allocation, and
out-of-btree tracking.

Furthermore, btrfs Btree itself has CoW to defend against the power loss.

By not using Btree, we will pay a much higher price on the complexity of
implementing everything.

>
> I really like the idea to store the write intent in a btree. I find it very
> elegant. However I don't think that it is convenient.
>
> The write intent disk format is not performance related, you don't need
> to seek
> inside it; and it is small: you need to read it (entirerly) only in case
> of power
> failure, and in any case the biggest cost is to scrub the last updated
> blocks. So
> it is not needed a btree.

But such write intent bitmap must survive powerloss by itself.

And in fact, that bitmap is not small as you think.

In fact, for users who need write-intent tree/bitmap, we're talking
about at least TiB level usage.

4TiB used space needs already 128MiB if we really go straight bitmap for
them.
Embedding them all in a per-device basis is completely possible, but
when implementing it, it's much complex.

128MiB is not that large, so in theory we're fine to keep an in-memory
bitmap.
But what would happen if we go 32TiB? Then 1GiB in-memory bitmap is
needed, which is not really acceptable anymore.

When we start to choose what part is really needed in the large bitmap
pool, then Btree starts to make sense. We can store a super large bitmap
using bitmap and extent based entries pretty easily, just like free
space cache tree.

>
> Moreover, the handling of raid5/6 is a layer below the btree.

While CSUM is also a layer below, but we still put it into CSUM tree.

The handling of write-intent bitmap/tree is indeed a layer lower.
But traditional DM lacks the awareness of the upper layer fs, thus has a
lot of problems like unable to detect bit rot in RAID1 for example.

Yes, we care about layer separation, but more in a code level.
For functionality, layer separation is not that a big deal already.

> I think that
> updating the write-intent btree would be a performance bottleneck. I am
> quite sure
> that the write intent likely requires less than one metadata page (16K
> today);
> however to store this page you need to update the metadata page tracking...

We already have the existing log tree code doing similar (but still
quite different purpose) things, and it's used to speed up fsync.

Furthermore, DM layer bitmap is not a straight bitmap of all sectors
either, and for performance it's almost negligible for sequential RW.

I don't think Btree handling would be a performance bottleneck, as
NODATACOW for data doesn't improve much performance other than the
implied NODATASUM.

>
>>>
>>> This for two main reasons:
>>> 1) in future BTRFS may get the ability of allocating this block group
>>> in a
>>> dedicate disks set. I see two main cases:
>>> a) in case of raid6, we can store the intent bitmap (or the journal)
>>> in a
>>> raid1C3 BG allocated in the faster disks. The cons is that each block
>>> has to be
>>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>>> some hdd,
>>> you got a noticeable gain of performance)
>>
>> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
>> missing disks.
>>
>> In fact, the chance to tolerate two missing devices for 4 disks RAID10
>> is:
>>
>> 4 / 6 = 66.7%
>>
>> 4 is the total valid combinations, no order involved, including:
>> (1, 3), (1, 4), (2, 3) (2, 4).
>> (Or 4C2 - 2)
>>
>> 6 is the 4C2.
>>
>> So really no need to go RAID1C3 unless you're really want to ensured 2
>> disks tolerance.
>
> I don't get the point: I started talking about raid6. The raid6 is two
> failures proof (you need three failure to see the problem... in theory).
>
> If P is the probability of a disk failure (with P << 1), the likelihood of
> a RAID6 failure is O(P^3). The same is RAID1C3.
>
> Instead RAID10 failure likelihood is only a bit lesser than two disk
> failure:
> RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).
>
> Because P is << 1 then  P^3 << 0.66 * P^2.

My point here is, although RAID10 is not ensured to lose 2 disks, just
losing two disks still have a high enough chance to survive.

While RAID10 only have two copies of data, instead of 3 from RAID1C3,
such cost saving can be attractive for a lot of users though.

Thanks,
Qu

>>
>>> b) another option is to spread the intent bitmap (or the journal) in
>>> *all* disks,
>>> where each disks contains only the the related data (if we update only
>>> disk #1
>>> and disk #2, we have to update only the intent bitmap (or the
>>> journal) in
>>> disk #1 and  disk #2)
>>
>> That's my initial per-device reservation method.
>>
>> But for write-intent tree, I tend to not go that way, but with a
>> RO-compatible flag instead, as it's much simpler and more back
>> compatible.
>>
>> Thanks,
>> Qu
>>>
>>>
>>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>>> another big
>>> advantage: you don't need to change the meaning of the raid5/6 bg. This
>>> means
>>> that an older kernel can read/write a raid5/6 filesystem: it sufficient
>>> to ignore
>>> the intent bitmap (or the journal)
>>>
>>>
>>>
>>>>
>>>> Furthermore, this even allows us to go something like bitmap tree, for
>>>> such write-intent bitmap.
>>>> And as long as the user is not using RAID56 for metadata (maybe even
>>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>>> most write-hole (for metadata and CoW data only though, nocow data is
>>>> still affected).
>>>>
>>>> Thus I believe this can be a valid path to explore, and even have a
>>>> higher priority than full journal.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>
>>>
>>>
>
>
Lukas Straub June 8, 2022, 3:17 p.m. UTC | #33
On Fri, 3 Jun 2022 17:59:59 +0800
Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:

> On 2022/6/3 17:32, Lukas Straub wrote:
> > On Thu, 2 Jun 2022 05:37:11 +0800
> > Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >  
> >> On 2022/6/2 02:49, Martin Raiber wrote:  
> >>> On 01.06.2022 12:12 Qu Wenruo wrote:  
> >>>>
> >>>>
> >>>> On 2022/6/1 17:56, Paul Jones wrote:  
> >>>>>  
> >>>>>> -----Original Message-----
> >>>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
> >>>>>> Sent: Wednesday, 1 June 2022 7:27 PM
> >>>>>> To: Wang Yugui <wangyugui@e16-tech.com>
> >>>>>> Cc: linux-btrfs@vger.kernel.org
> >>>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
> >>>>>>
> >>>>>>  
> >>>>>  
> >>>>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
> >>>>>>>>> because journal data is in a different place than normal data, so
> >>>>>>>>> HDD seek is always happen?
> >>>>>>>>>
> >>>>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
> >>>>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
> >>>>>>>>> may not need
> >>>>>>>>> RAID56/RAID1 for journal data.  
> >>>>>>>>
> >>>>>>>> That device is the single point of failure. You lost that device,
> >>>>>>>> write hole come again.  
> >>>>>>>
> >>>>>>> The HW RAID card have 'single point of failure'  too, such as the
> >>>>>>> NVDIMM inside HW RAID card.
> >>>>>>>
> >>>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
> >>>>>>> failure frequency  
> >>>>>>
> >>>>>> It's a completely different level.
> >>>>>>
> >>>>>> For btrfs RAID, we have no special treat for any disk.
> >>>>>> And our RAID is focusing on ensuring device tolerance.
> >>>>>>
> >>>>>> In your RAID card case, indeed the failure rate of the card is much lower.
> >>>>>> In journal device case, how do you ensure it's still true that the journal device
> >>>>>> missing possibility is way lower than all the other devices?
> >>>>>>
> >>>>>> So this doesn't make sense, unless you introduce the journal to something
> >>>>>> definitely not a regular disk.
> >>>>>>
> >>>>>> I don't believe this benefit most users.
> >>>>>> Just consider how many regular people use dedicated journal device for
> >>>>>> XFS/EXT4 upon md/dm RAID56.  
> >>>>>
> >>>>> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
> >>>>>
> >>>>> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.  
> >>>>
> >>>> In fact, since write hole is only a problem for power loss (and explicit
> >>>> degraded write), another solution is, only record if the fs is
> >>>> gracefully closed.
> >>>>
> >>>> If the fs is not gracefully closed (by a bit in superblock), then we
> >>>> just trigger a full scrub on all existing RAID56 block groups.
> >>>>
> >>>> This should solve the problem, with the extra cost of slow scrub for
> >>>> each unclean shutdown.
> >>>>
> >>>> To be extra safe, during that scrub run, we really want user to wait for
> >>>> the scrub to finish.
> >>>>
> >>>> But on the other hand, I totally understand user won't be happy to wait
> >>>> for 10+ hours just due to a unclean shutdown...  
> >>> Would it be possible to put the stripe offsets/numbers into a journal/commit them before write? Then, during mount you could scrub only those after an unclean shutdown.  
> >>
> >> If we go that path, we can already do full journal, and only replay that
> >> journal without the need for scrub at all.  
> >
> > Hello Qu,
> >
> > If you don't care about the write-hole, you can also use a dirty bitmap
> > like mdraid 5/6 does. There, one bit in the bitmap represents for
> > example one gigabyte of the disk that _may_ be dirty, and the bit is left
> > dirty for a while and doesn't need to be set for each write. Or you
> > could do a per-block-group dirty bit.  
> 
> That would be a pretty good way for auto scrub after dirty close.
> 
> Currently we have quite some different ideas, but some are pretty
> similar but at different side of a spectrum:
> 
>      Easier to implement        ..     Harder to implement
> |<- More on mount time scrub   ..     More on journal ->|
> |					|	|	\- Full journal
> |					|	\--- Per bg dirty bitmap
> |					\----------- Per bg dirty flag
> \--------------------------------------------------- Per sb dirty flag
> 
> In fact, the dirty bitmap is just a simplified version of journal (only
> record the metadata, without data).
> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
> recover the data without problem.
> 
> Even with per-bg dirty bitmap, we still need some extra location to
> record the bitmap. Thus it needs a on-disk format change anyway.
> 
> Currently only sb dirty flag may be backward compatible.
> 
> And whether we should wait for the scrub to finish before allowing use
> to do anything into the fs is also another concern.
> 
> Even using bitmap, we may have several GiB data needs to be scrubbed.
> If we wait for the scrub to finish, it's the best and safest way, but
> users won't be happy at all.
> 

Hmm, but it doesn't really make a difference in safety whether we allow
use while scrub/resync is running: The disks have inconsistent data and
if we now loose one disk, write-hole happens.

The only thing to watch out for while scrub/resync is running and a
write is submitted to the filesystem, is to scrub the stripe before
writing to it.


Regards,
Lukas Straub

> If we go scrub resume way, it's faster but still leaves a large window
> to allow write-hole to reduce our tolerance.
> 
> Thanks,
> Qu
> >
> > And while you're at it, add the same mechanism to all the other raid
> > and dup modes to fix the inconsistency of NOCOW files after a crash.
> >
> > Regards,
> > Lukas Straub
> >  
> >> Thanks,
> >> Qu
> >>  
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>  
> >>>>>
> >>>>> Paul.  
> >>>
> >>>  
> >
> >
> >  



--
Goffredo Baroncelli June 8, 2022, 5:26 p.m. UTC | #34
On 08/06/2022 00.14, Qu Wenruo wrote:
> 
> 
> On 2022/6/8 01:36, Goffredo Baroncelli wrote:
>> On 07/06/2022 03.27, Qu Wenruo wrote:
>>>
>>>
>>> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
>> [...]
>>
>>>>
>>>> But with a battery backup (i.e. no power failure), the likelihood of b)
>>>> became
>>>> negligible.
>>>>
>>>> This to say that a write intent bitmap will provide an huge
>>>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>>>
>>>> My only suggestions, is to find a way to store the bitmap intent not
>>>> in the
>>>> raid5/6 block group, but in a separate block group, with the appropriate
>>>> level
>>>> of redundancy.
>>>
>>> That's why I want to reject RAID56 as metadata, and just store the
>>> write-intent tree into the metadata, like what we did for fsync (log
>>> tree).
>>>
>>
>> My suggestion was not to use the btrfs metadata to store the
>> "write-intent", but
>> to track the space used by the write-intent storage area with a bg. Then
>> the
>> write intent can be handled not with a btrfs btree, but (e.g.) simply
>> writing a bitmap of the used blocks, or the pairs [starts, length]....
> 
> That solution requires a lot of extra change to chunk allocation, and
> out-of-btree tracking.
> 
> Furthermore, btrfs Btree itself has CoW to defend against the power loss.

[...]
> 
> But such write intent bitmap must survive powerloss by itself.
> 

What is the reason that the write intent "must survive" ?
My understanding is that if the write intent is not fully wrote,
also no data is wrote. And to check if a write intent
is fully wrote, it is enough to check against a checksums.

I imagine (but maybe I am wrong) that the sequence should be:

1) write the intent (w/ checksum)
2) sync
3) write the raid5/6 data
4) sync
5) invalid the intent
6) sync

a) If the powerloss happens before 3, we don't *need* to scrub anything.
But if the checksum matches we will.
But hopefully it doesn't harm (only a delay at mount time after a poweroff)


b) If the powerloss happens after 2 (and until 6), we should scrub all the potential
impacted blocks disk. But the consistency of the write intent is guarantee
by "2) sync" (and this doesn't depend by the fact that the intent is
stored in a btree or in another kind of storage)


> And in fact, that bitmap is not small as you think.
> 
> In fact, for users who need write-intent tree/bitmap, we're talking
> about at least TiB level usage.

The worst case is that the data to write is big as the memory. Usually
sizeof(memory) << sizeof(disk), so I think an intent map "extent based"
would be more efficient than a bit map.

Supposing to have 16gb server, and to track the blocks using an extent
based (64bit pointer + 16bit length in 4k unit). The worst case is that all the
pages are not contiguous, so we need 16GB/4k*10 = 41MB to track all the pages.

The point is that for each 16GB of data, we need to write further 41MB,
which is a negligible quantity, 0.2%; and this factor is constant.
And in case of a powerloss, you have to scrub only the
changed blocks (plus the intrinsic amplification factor of raid5/6).


On the other side, supposing to have a 16TB disks set, and to track the
blocks using a bitmap, where each bit represent 1MB.
To track all the disks we need 2MB. However:
1) if you write 4k, you still need to write 21MB
2) the worst case is that you need to update 16GB of 4k pages, where each
pages is 1MB far from the nearest. This means that you need to scrub
16GB/4k*1MB = 4TB of disks (plus the intrinsic amplification factor
of raid5/6).

If we reduce the unit of the page, we reduce the amplification factor
for the scrub, but we increase the size of the bitmap.
For example if each bit tracks a 4k page, we have a bitmap of
4GB for a 16TB filesystem. And
1.bis) if you write 4k, you will still need to write 4GB of intent.
2.bis) on the other side in case of powerloss, you have to scrub only the
impacted disk pages (plus the intrinsic amplification factor
of raid5/6).

> 
> 4TiB used space needs already 128MiB if we really go straight bitmap for
> them.
> Embedding them all in a per-device basis is completely possible, but
> when implementing it, it's much complex.
> 
> 128MiB is not that large, so in theory we're fine to keep an in-memory
> bitmap.
> But what would happen if we go 32TiB? Then 1GiB in-memory bitmap is
> needed, which is not really acceptable anymore.
> 
> When we start to choose what part is really needed in the large bitmap
> pool, then Btree starts to make sense. We can store a super large bitmap
> using bitmap and extent based entries pretty easily, just like free
> space cache tree.
> 
>>
>> Moreover, the handling of raid5/6 is a layer below the btree.
> 
> While CSUM is also a layer below, but we still put it into CSUM tree.
> 
> The handling of write-intent bitmap/tree is indeed a layer lower.
> But traditional DM lacks the awareness of the upper layer fs, thus has a
> lot of problems like unable to detect bit rot in RAID1 for example.
> 
> Yes, we care about layer separation, but more in a code level.
> For functionality, layer separation is not that a big deal already.
> 
>> I think that
>> updating the write-intent btree would be a performance bottleneck. I am
>> quite sure
>> that the write intent likely requires less than one metadata page (16K
>> today);
>> however to store this page you need to update the metadata page tracking...
> 
> We already have the existing log tree code doing similar (but still
> quite different purpose) things, and it's used to speed up fsync.
> 
> Furthermore, DM layer bitmap is not a straight bitmap of all sectors
> either, and for performance it's almost negligible for sequential RW.
> 
> I don't think Btree handling would be a performance bottleneck, as
> NODATACOW for data doesn't improve much performance other than the
> implied NODATASUM.
> 
>>
>>>>
>>>> This for two main reasons:
>>>> 1) in future BTRFS may get the ability of allocating this block group
>>>> in a
>>>> dedicate disks set. I see two main cases:
>>>> a) in case of raid6, we can store the intent bitmap (or the journal)
>>>> in a
>>>> raid1C3 BG allocated in the faster disks. The cons is that each block
>>>> has to be
>>>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>>>> some hdd,
>>>> you got a noticeable gain of performance)
>>>
>>> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
>>> missing disks.
>>>
>>> In fact, the chance to tolerate two missing devices for 4 disks RAID10
>>> is:
>>>
>>> 4 / 6 = 66.7%
>>>
>>> 4 is the total valid combinations, no order involved, including:
>>> (1, 3), (1, 4), (2, 3) (2, 4).
>>> (Or 4C2 - 2)
>>>
>>> 6 is the 4C2.
>>>
>>> So really no need to go RAID1C3 unless you're really want to ensured 2
>>> disks tolerance.
>>
>> I don't get the point: I started talking about raid6. The raid6 is two
>> failures proof (you need three failure to see the problem... in theory).
>>
>> If P is the probability of a disk failure (with P << 1), the likelihood of
>> a RAID6 failure is O(P^3). The same is RAID1C3.
>>
>> Instead RAID10 failure likelihood is only a bit lesser than two disk
>> failure:
>> RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).
>>
>> Because P is << 1 then  P^3 << 0.66 * P^2.
> 
> My point here is, although RAID10 is not ensured to lose 2 disks, just
> losing two disks still have a high enough chance to survive.
> 
> While RAID10 only have two copies of data, instead of 3 from RAID1C3,
> such cost saving can be attractive for a lot of users though.
> 
> Thanks,
> Qu
> 
>>>
>>>> b) another option is to spread the intent bitmap (or the journal) in
>>>> *all* disks,
>>>> where each disks contains only the the related data (if we update only
>>>> disk #1
>>>> and disk #2, we have to update only the intent bitmap (or the
>>>> journal) in
>>>> disk #1 and  disk #2)
>>>
>>> That's my initial per-device reservation method.
>>>
>>> But for write-intent tree, I tend to not go that way, but with a
>>> RO-compatible flag instead, as it's much simpler and more back
>>> compatible.
>>>
>>> Thanks,
>>> Qu
>>>>
>>>>
>>>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>>>> another big
>>>> advantage: you don't need to change the meaning of the raid5/6 bg. This
>>>> means
>>>> that an older kernel can read/write a raid5/6 filesystem: it sufficient
>>>> to ignore
>>>> the intent bitmap (or the journal)
>>>>
>>>>
>>>>
>>>>>
>>>>> Furthermore, this even allows us to go something like bitmap tree, for
>>>>> such write-intent bitmap.
>>>>> And as long as the user is not using RAID56 for metadata (maybe even
>>>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>>>> most write-hole (for metadata and CoW data only though, nocow data is
>>>>> still affected).
>>>>>
>>>>> Thus I believe this can be a valid path to explore, and even have a
>>>>> higher priority than full journal.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>
>>>>
>>>>
>>
>>
Goffredo Baroncelli June 8, 2022, 5:32 p.m. UTC | #35
On 08/06/2022 17.17, Lukas Straub wrote:
> On Fri, 3 Jun 2022 17:59:59 +0800
> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 
>> On 2022/6/3 17:32, Lukas Straub wrote:
>>> On Thu, 2 Jun 2022 05:37:11 +0800
>>> Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>   
>>>> On 2022/6/2 02:49, Martin Raiber wrote:
>>>>> On 01.06.2022 12:12 Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> On 2022/6/1 17:56, Paul Jones wrote:
>>>>>>>   
>>>>>>>> -----Original Message-----
>>>>>>>> From: Qu Wenruo <quwenruo.btrfs@gmx.com>
>>>>>>>> Sent: Wednesday, 1 June 2022 7:27 PM
>>>>>>>> To: Wang Yugui <wangyugui@e16-tech.com>
>>>>>>>> Cc: linux-btrfs@vger.kernel.org
>>>>>>>> Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft
>>>>>>>>
>>>>>>>>   
>>>>>>>   
>>>>>>>>>>> If we save journal on every RAID56 HDD, it will always be very slow,
>>>>>>>>>>> because journal data is in a different place than normal data, so
>>>>>>>>>>> HDD seek is always happen?
>>>>>>>>>>>
>>>>>>>>>>> If we save journal on a device just like 'mke2fs -O journal_dev' or
>>>>>>>>>>> 'mkfs.xfs -l logdev', then this device just works like NVDIMM?  We
>>>>>>>>>>> may not need
>>>>>>>>>>> RAID56/RAID1 for journal data.
>>>>>>>>>>
>>>>>>>>>> That device is the single point of failure. You lost that device,
>>>>>>>>>> write hole come again.
>>>>>>>>>
>>>>>>>>> The HW RAID card have 'single point of failure'  too, such as the
>>>>>>>>> NVDIMM inside HW RAID card.
>>>>>>>>>
>>>>>>>>> but  power-lost frequency > hdd failure frequency  > NVDIMM/ssd
>>>>>>>>> failure frequency
>>>>>>>>
>>>>>>>> It's a completely different level.
>>>>>>>>
>>>>>>>> For btrfs RAID, we have no special treat for any disk.
>>>>>>>> And our RAID is focusing on ensuring device tolerance.
>>>>>>>>
>>>>>>>> In your RAID card case, indeed the failure rate of the card is much lower.
>>>>>>>> In journal device case, how do you ensure it's still true that the journal device
>>>>>>>> missing possibility is way lower than all the other devices?
>>>>>>>>
>>>>>>>> So this doesn't make sense, unless you introduce the journal to something
>>>>>>>> definitely not a regular disk.
>>>>>>>>
>>>>>>>> I don't believe this benefit most users.
>>>>>>>> Just consider how many regular people use dedicated journal device for
>>>>>>>> XFS/EXT4 upon md/dm RAID56.
>>>>>>>
>>>>>>> A good solid state drive should be far less error prone than spinning drives, so would be a good candidate. Not perfect, but better.
>>>>>>>
>>>>>>> As an end user I think focusing on stability and recovery tools is a better use of time than fixing the write hole, as I wouldn't even consider using Raid56 in it's current state. The write hole problem can be alleviated by a UPS and not using Raid56 for a busy write load. It's still good to brainstorm the issue though, as it will need solving eventually.
>>>>>>
>>>>>> In fact, since write hole is only a problem for power loss (and explicit
>>>>>> degraded write), another solution is, only record if the fs is
>>>>>> gracefully closed.
>>>>>>
>>>>>> If the fs is not gracefully closed (by a bit in superblock), then we
>>>>>> just trigger a full scrub on all existing RAID56 block groups.
>>>>>>
>>>>>> This should solve the problem, with the extra cost of slow scrub for
>>>>>> each unclean shutdown.
>>>>>>
>>>>>> To be extra safe, during that scrub run, we really want user to wait for
>>>>>> the scrub to finish.
>>>>>>
>>>>>> But on the other hand, I totally understand user won't be happy to wait
>>>>>> for 10+ hours just due to a unclean shutdown...
>>>>> Would it be possible to put the stripe offsets/numbers into a journal/commit them before write? Then, during mount you could scrub only those after an unclean shutdown.
>>>>
>>>> If we go that path, we can already do full journal, and only replay that
>>>> journal without the need for scrub at all.
>>>
>>> Hello Qu,
>>>
>>> If you don't care about the write-hole, you can also use a dirty bitmap
>>> like mdraid 5/6 does. There, one bit in the bitmap represents for
>>> example one gigabyte of the disk that _may_ be dirty, and the bit is left
>>> dirty for a while and doesn't need to be set for each write. Or you
>>> could do a per-block-group dirty bit.
>>
>> That would be a pretty good way for auto scrub after dirty close.
>>
>> Currently we have quite some different ideas, but some are pretty
>> similar but at different side of a spectrum:
>>
>>       Easier to implement        ..     Harder to implement
>> |<- More on mount time scrub   ..     More on journal ->|
>> |					|	|	\- Full journal
>> |					|	\--- Per bg dirty bitmap
>> |					\----------- Per bg dirty flag
>> \--------------------------------------------------- Per sb dirty flag
>>
>> In fact, the dirty bitmap is just a simplified version of journal (only
>> record the metadata, without data).
>> Unlike dm/dm-raid56, with btrfs scrub, we should be able to fully
>> recover the data without problem.
>>
>> Even with per-bg dirty bitmap, we still need some extra location to
>> record the bitmap. Thus it needs a on-disk format change anyway.
>>
>> Currently only sb dirty flag may be backward compatible.
>>
>> And whether we should wait for the scrub to finish before allowing use
>> to do anything into the fs is also another concern.
>>
>> Even using bitmap, we may have several GiB data needs to be scrubbed.
>> If we wait for the scrub to finish, it's the best and safest way, but
>> users won't be happy at all.
>>
> 
> Hmm, but it doesn't really make a difference in safety whether we allow
> use while scrub/resync is running: The disks have inconsistent data and
> if we now loose one disk, write-hole happens.

This is not correct. When you update the data, to compute the parity you
should read all the data and check the checksums. Then you can
write the (correct) parity.

For raid5 there is an optimization, so you don't need to read all the stripes:
you can compute the new parity as

	new-parity = old-parity ^ new-data ^ old-data

This would create the write hole, but I hope that BTRFS doesn't have this
optimization.


> 
> The only thing to watch out for while scrub/resync is running and a
> write is submitted to the filesystem, is to scrub the stripe before
> writing to it.
> 
> 
> Regards,
> Lukas Straub
> 
>> If we go scrub resume way, it's faster but still leaves a large window
>> to allow write-hole to reduce our tolerance.
>>
>> Thanks,
>> Qu
>>>
>>> And while you're at it, add the same mechanism to all the other raid
>>> and dup modes to fix the inconsistency of NOCOW files after a crash.
>>>
>>> Regards,
>>> Lukas Straub
>>>   
>>>> Thanks,
>>>> Qu
>>>>   
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>   
>>>>>>>
>>>>>>> Paul.
>>>>>
>>>>>   
>>>
>>>
>>>   
> 
> 
>
Qu Wenruo June 13, 2022, 2:27 a.m. UTC | #36
On 2022/6/9 01:26, Goffredo Baroncelli wrote:
> On 08/06/2022 00.14, Qu Wenruo wrote:
>>
>>
>> On 2022/6/8 01:36, Goffredo Baroncelli wrote:
>>> On 07/06/2022 03.27, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2022/6/7 02:10, Goffredo Baroncelli wrote:
>>> [...]
>>>
>>>>>
>>>>> But with a battery backup (i.e. no power failure), the likelihood
>>>>> of b)
>>>>> became
>>>>> negligible.
>>>>>
>>>>> This to say that a write intent bitmap will provide an huge
>>>>> improvement of the resilience of a btrfs raid5, and in turn raid6.
>>>>>
>>>>> My only suggestions, is to find a way to store the bitmap intent not
>>>>> in the
>>>>> raid5/6 block group, but in a separate block group, with the
>>>>> appropriate
>>>>> level
>>>>> of redundancy.
>>>>
>>>> That's why I want to reject RAID56 as metadata, and just store the
>>>> write-intent tree into the metadata, like what we did for fsync (log
>>>> tree).
>>>>
>>>
>>> My suggestion was not to use the btrfs metadata to store the
>>> "write-intent", but
>>> to track the space used by the write-intent storage area with a bg. Then
>>> the
>>> write intent can be handled not with a btrfs btree, but (e.g.) simply
>>> writing a bitmap of the used blocks, or the pairs [starts, length]....
>>
>> That solution requires a lot of extra change to chunk allocation, and
>> out-of-btree tracking.
>>
>> Furthermore, btrfs Btree itself has CoW to defend against the power loss.
>
> [...]
>>
>> But such write intent bitmap must survive powerloss by itself.
>>
>
> What is the reason that the write intent "must survive" ?

Easy, to let us know whether a range needs to be scrubbed, to close the
write hole.

Consider the following case of RAID56, a partial write:

Disk1:		|WWWWW|     |
Disk2:		|     |WWWWW|
Parity:		|WWWWW|WWWWW|

If we write something into the write-intent bitmap, and we do the real
write, but a powerloss happened.

Only the untouched data in Disk1 and Disk2 are safe.
The remaining can be updated or not.

At next mount, we should scrub the this full stripe to close the write hole.

If by somehow, for example the metadata is on another disk (not involved
in the RAID5 array), and that disk is lost.
We are unable to know which full stripe we need to scrub, thus the write
hole is still there.

> My understanding is that if the write intent is not fully wrote,
> also no data is wrote. And to check if a write intent
> is fully wrote, it is enough to check against a checksums.
>
> I imagine (but maybe I am wrong) that the sequence should be:
>
> 1) write the intent (w/ checksum)
> 2) sync
> 3) write the raid5/6 data
> 4) sync
> 5) invalid the intent
> 6) sync
>
> a) If the powerloss happens before 3, we don't *need* to scrub anything.
> But if the checksum matches we will.
> But hopefully it doesn't harm (only a delay at mount time after a poweroff)
>
>
> b) If the powerloss happens after 2 (and until 6), we should scrub all
> the potential
> impacted blocks disk. But the consistency of the write intent is guarantee
> by "2) sync" (and this doesn't depend by the fact that the intent is
> stored in a btree or in another kind of storage)
>
>
>> And in fact, that bitmap is not small as you think.
>>
>> In fact, for users who need write-intent tree/bitmap, we're talking
>> about at least TiB level usage.
>
> The worst case is that the data to write is big as the memory. Usually
> sizeof(memory) << sizeof(disk), so I think an intent map "extent based"
> would be more efficient than a bit map.

Nope, write intent bitmap/tree doesn't work like that.

In fact for mdraid, the bitmap normally has only 16KiB in size.

It has proper reclaim mechanism, like flushing a disk to reclaim all
those bitmaps.

Thus in fact I don't really think we'd need to contain the whole bitmap,
sorry for the confusion.



And furthermore, some previous idea, especially using btrfs btree for
write-intent bitmap is not feasible.

The problem here is about the lifespan.
Previously I think write-intent tree can go the same lifespan as log tree.

But it's not true, the problem is, data writeback can happen crossing
transaction boundary.

This means the purposed write-intent tree need to survive multiple
transactions, and is different from log tree, thus it needs metadata
extent item.

The requirement for metadata extent item can get pretty messy.

Thus I'd say, the write-intent bitmap would really go the mdraid way,
preserve a small space (1MiB~2MiB range for each device), then get
updated at the same timing/behavior just like md-bitmap.c.

Thanks,
Qu
>
> Supposing to have 16gb server, and to track the blocks using an extent
> based (64bit pointer + 16bit length in 4k unit). The worst case is that
> all the
> pages are not contiguous, so we need 16GB/4k*10 = 41MB to track all the
> pages.
>
> The point is that for each 16GB of data, we need to write further 41MB,
> which is a negligible quantity, 0.2%; and this factor is constant.
> And in case of a powerloss, you have to scrub only the
> changed blocks (plus the intrinsic amplification factor of raid5/6).
>
>
> On the other side, supposing to have a 16TB disks set, and to track the
> blocks using a bitmap, where each bit represent 1MB.
> To track all the disks we need 2MB. However:
> 1) if you write 4k, you still need to write 21MB
> 2) the worst case is that you need to update 16GB of 4k pages, where each
> pages is 1MB far from the nearest. This means that you need to scrub
> 16GB/4k*1MB = 4TB of disks (plus the intrinsic amplification factor
> of raid5/6).
>
> If we reduce the unit of the page, we reduce the amplification factor
> for the scrub, but we increase the size of the bitmap.
> For example if each bit tracks a 4k page, we have a bitmap of
> 4GB for a 16TB filesystem. And
> 1.bis) if you write 4k, you will still need to write 4GB of intent.
> 2.bis) on the other side in case of powerloss, you have to scrub only the
> impacted disk pages (plus the intrinsic amplification factor
> of raid5/6).
>
>>
>> 4TiB used space needs already 128MiB if we really go straight bitmap for
>> them.
>> Embedding them all in a per-device basis is completely possible, but
>> when implementing it, it's much complex.
>>
>> 128MiB is not that large, so in theory we're fine to keep an in-memory
>> bitmap.
>> But what would happen if we go 32TiB? Then 1GiB in-memory bitmap is
>> needed, which is not really acceptable anymore.
>>
>> When we start to choose what part is really needed in the large bitmap
>> pool, then Btree starts to make sense. We can store a super large bitmap
>> using bitmap and extent based entries pretty easily, just like free
>> space cache tree.
>>
>>>
>>> Moreover, the handling of raid5/6 is a layer below the btree.
>>
>> While CSUM is also a layer below, but we still put it into CSUM tree.
>>
>> The handling of write-intent bitmap/tree is indeed a layer lower.
>> But traditional DM lacks the awareness of the upper layer fs, thus has a
>> lot of problems like unable to detect bit rot in RAID1 for example.
>>
>> Yes, we care about layer separation, but more in a code level.
>> For functionality, layer separation is not that a big deal already.
>>
>>> I think that
>>> updating the write-intent btree would be a performance bottleneck. I am
>>> quite sure
>>> that the write intent likely requires less than one metadata page (16K
>>> today);
>>> however to store this page you need to update the metadata page
>>> tracking...
>>
>> We already have the existing log tree code doing similar (but still
>> quite different purpose) things, and it's used to speed up fsync.
>>
>> Furthermore, DM layer bitmap is not a straight bitmap of all sectors
>> either, and for performance it's almost negligible for sequential RW.
>>
>> I don't think Btree handling would be a performance bottleneck, as
>> NODATACOW for data doesn't improve much performance other than the
>> implied NODATASUM.
>>
>>>
>>>>>
>>>>> This for two main reasons:
>>>>> 1) in future BTRFS may get the ability of allocating this block group
>>>>> in a
>>>>> dedicate disks set. I see two main cases:
>>>>> a) in case of raid6, we can store the intent bitmap (or the journal)
>>>>> in a
>>>>> raid1C3 BG allocated in the faster disks. The cons is that each block
>>>>> has to be
>>>>> written 3x2 times. But if you have an hybrid disks set (some ssd and
>>>>> some hdd,
>>>>> you got a noticeable gain of performance)
>>>>
>>>> In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2
>>>> missing disks.
>>>>
>>>> In fact, the chance to tolerate two missing devices for 4 disks RAID10
>>>> is:
>>>>
>>>> 4 / 6 = 66.7%
>>>>
>>>> 4 is the total valid combinations, no order involved, including:
>>>> (1, 3), (1, 4), (2, 3) (2, 4).
>>>> (Or 4C2 - 2)
>>>>
>>>> 6 is the 4C2.
>>>>
>>>> So really no need to go RAID1C3 unless you're really want to ensured 2
>>>> disks tolerance.
>>>
>>> I don't get the point: I started talking about raid6. The raid6 is two
>>> failures proof (you need three failure to see the problem... in theory).
>>>
>>> If P is the probability of a disk failure (with P << 1), the
>>> likelihood of
>>> a RAID6 failure is O(P^3). The same is RAID1C3.
>>>
>>> Instead RAID10 failure likelihood is only a bit lesser than two disk
>>> failure:
>>> RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2).
>>>
>>> Because P is << 1 then  P^3 << 0.66 * P^2.
>>
>> My point here is, although RAID10 is not ensured to lose 2 disks, just
>> losing two disks still have a high enough chance to survive.
>>
>> While RAID10 only have two copies of data, instead of 3 from RAID1C3,
>> such cost saving can be attractive for a lot of users though.
>>
>> Thanks,
>> Qu
>>
>>>>
>>>>> b) another option is to spread the intent bitmap (or the journal) in
>>>>> *all* disks,
>>>>> where each disks contains only the the related data (if we update only
>>>>> disk #1
>>>>> and disk #2, we have to update only the intent bitmap (or the
>>>>> journal) in
>>>>> disk #1 and  disk #2)
>>>>
>>>> That's my initial per-device reservation method.
>>>>
>>>> But for write-intent tree, I tend to not go that way, but with a
>>>> RO-compatible flag instead, as it's much simpler and more back
>>>> compatible.
>>>>
>>>> Thanks,
>>>> Qu
>>>>>
>>>>>
>>>>> 2) having a dedicate bg for the intent bitmap (or the journal), has
>>>>> another big
>>>>> advantage: you don't need to change the meaning of the raid5/6 bg.
>>>>> This
>>>>> means
>>>>> that an older kernel can read/write a raid5/6 filesystem: it
>>>>> sufficient
>>>>> to ignore
>>>>> the intent bitmap (or the journal)
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Furthermore, this even allows us to go something like bitmap tree,
>>>>>> for
>>>>>> such write-intent bitmap.
>>>>>> And as long as the user is not using RAID56 for metadata (maybe even
>>>>>> it's OK to use RAID56 for metadata), it should be pretty safe against
>>>>>> most write-hole (for metadata and CoW data only though, nocow data is
>>>>>> still affected).
>>>>>>
>>>>>> Thus I believe this can be a valid path to explore, and even have a
>>>>>> higher priority than full journal.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
diff mbox series

Patch

diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 46991a27013b..17dace961695 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -1042,4 +1042,146 @@  struct btrfs_verity_descriptor_item {
 	__u8 encryption;
 } __attribute__ ((__packed__));
 
+/*
+ * For checksum used in RAID56J, we don't really want to use the
+ * one used by the fs (can be CRC32, XXHASH, SHA256, BLAKE2), as
+ * most meta/data written has already has their own csum calculated.
+ *
+ * Furthermore, we don't have a good way just to reuse the csum.
+ * (metadata has csum inlined, data csum is passed through ordered extents)
+ *
+ * So here the csum for RAID56J is fixed to CRC32 to avoid unnecessary overhead.
+ */
+#define BTRFS_RAID56J_CSUM_SIZE		(4)
+
+#define BTRFS_RAID56J_STATUS_DIRTY	(1 << 0)
+#define BTRFS_RAID56J_STATUS_CLEAN	(1 << 0)
+
+/*
+ * ON-DISK FORMAT
+ * ==============
+ *
+ * This is the header part for RAID56 journal.
+ *
+ * Unlike all structures above, they are not stored in btrfs btree.
+ *
+ * The overall on-disk format looks like this:
+ *
+ * [ raid56j_header ][ raid56j_entry 1 ] .. [raid56j_entry N ]
+ * |------------------ At most in 4K size -------------------|
+ *
+ * One header can contain at most 127 entries.
+ * All entry should have corresponding data.
+ *
+ * Even for full stripe write, we must still journal all its content.
+ * As we can have cases like a full stripe full of dirty metadata.
+ * If we don't journal them, we can easily screw up all the metadata.
+ *
+ * Also we can not just discard the journal of a block group only,
+ * as other metadata write in other blocks can still be journaled, thus
+ * we have to discard all journal of all block groups, which is not
+ * really feasible here.
+ *
+ * So unfortunately we have no optimization for journal write at all.
+ *
+ * [ stripe data 1 ][ stripe data 2 ]   ...   [ stripe data N]
+ * |------- At most btrfs_chunk::per_dev_reserved - 4K ------|
+ *
+ * Normally we put the full 64K stripe into journal, and use 1M as default
+ * reservation for the journal.
+ * This means we can have at most 15 data stripes for now, and it would be
+ * the bottleneck.
+ *
+ * Later we can enhance RAID56 write to only write the modified veritical
+ * stripes, then we can have 255 data stripes.
+ *
+ * WRITE TIME WORKFLOW
+ * ===================
+ *
+ * And when writes are queued into a RAID56J block group, we will update the
+ * btrfs_raid56j_header, and put the pending bios for the device into a bio
+ * list. At this stage, we do not submit the real device bio yet.
+ *
+ * Under the following situations, we need to write the journal to disk:
+ * a) the 4KiB is full of entries
+ * b) the data space is full of data
+ * c) explicit flush/sync request
+ *
+ * When we need to write the journal, we will bump up the journal_generation,
+ * build a bio using the 4KiB memory and bios in the list, and submit it to
+ * the real device.
+ *
+ * If the device has write cache, we also need to flush the device.
+ *
+ * Then submit the invovled bios in the list, flush, then convert the journal
+ * back to CLEAN status, and flush again.
+ *
+ * XXX: Do we need the extra flush and set the CLEAN flags?
+ * The extensive flush will definitely hurt performance, while re-do the
+ * journal replay will not hurt anything except mount time.
+ *
+ * RECOVERY WORKFLOW
+ * =================
+ *
+ * At mount time, we load all block groups, and for RAID56J block groups we need
+ * to load their headers into the per-bg-per-dev memory.
+ * Not only for the status but also the journal_generation value.
+ *
+ * If the journal is dirty, we replay the journal, flush, then clean the DIRTY
+ * flag.
+ *
+ * Such journal replay will happen before any other write (even before log
+ * replay) at mount time.
+ */
+struct btrfs_raid56j_header {
+	/* Csum for the header only. */
+	__u8 csum[BTRFS_RAID56J_CSUM_SIZE];
+
+	/* The current status of the RAID56J chunk. */
+	__u32 status;
+
+	/* How many entries we have. */
+	__u32 nr_entries;
+	/*
+	 * Journal generation, we can not go with transid, as for data write
+	 * we have no reliable transid to utilize.
+	 *
+	 * Thus here we introduce a block group specific journal geneartion,
+	 * it get bumped each time a dirty RAID56J header got updated.
+	 * (AKA, just converting DIRTY header to CLEAN won't bumpup the
+	 *  journal geneartion).
+	 */
+	__u64 journal_generation;
+
+	/* Reserved space, and bump the header to 32 bytes. */
+	__u8 __reserved[12];
+} __attribute__ ((__packed__));
+
+/* Pointer for the real data. */
+struct btrfs_raid56j_entry {
+	/* Csum of the data. */
+	__u8 csum[BTRFS_RAID56J_CSUM_SIZE];
+
+	/* Where the journaled data should be written to. */
+	__u64 physical;
+
+	/* Logical bytenr of the data, can be (u64)-1 or (u64)-2 for P/Q. */
+	__u64 logical;
+
+	/*
+	 * Where the data is in the journal.
+	 * Starting from the btrfs_raid56j_header.
+	 *
+	 * Offset can be 0, this means this is a full stripe write, no need to
+	 * journal the write.
+	 * As the COW nature of btrfs ensures the write destination is not used
+	 * by anyone, thus no write hole.
+	 */
+	__u32 offset;
+	__u32 len;
+
+	/* Bump to 32 bytes. */
+	u8 __reserved[4];
+} __attribute__ ((__packed__));
+
 #endif /* _BTRFS_CTREE_H_ */