diff mbox series

Combining nodatasum + compression

Message ID 01020179f656e348-b07c8001-7b74-432a-b215-ccc7b17fbfdf-000000@eu-west-1.amazonses.com (mailing list archive)
State New, archived
Headers show
Series Combining nodatasum + compression | expand

Commit Message

Martin Raiber June 10, 2021, 2:32 p.m. UTC
Hi,

when btrfs is running on a block device that improves integrity (e.g. 
Ceph), it's usefull to run it with nodatasum to reduce the amount of 
metadata and random IO.

In that case it would also be useful to be able to run it with 
compression combined with nodatasum as well.

I actually found that if nodatasum is specified after compress-force, 
that it doesn't remove reset the compress/nodatasum bit, while the other 
way around it doesn't work.

That combined with


should do the trick.

The above code was added with the argument that having no checksums with 
compression would damage too much data in case of corruption ( 
https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/ 
). This argument doesn't apply much to single device file systems and 
even less to file systems on Ceph like volumes.

Comments

Qu Wenruo June 11, 2021, 12:18 a.m. UTC | #1
On 2021/6/10 下午10:32, Martin Raiber wrote:
> Hi,
>
> when btrfs is running on a block device that improves integrity (e.g.
> Ceph), it's usefull to run it with nodatasum to reduce the amount of
> metadata and random IO.
>
> In that case it would also be useful to be able to run it with
> compression combined with nodatasum as well.
>
> I actually found that if nodatasum is specified after compress-force,
> that it doesn't remove reset the compress/nodatasum bit, while the other
> way around it doesn't work.
>
> That combined with
>
> --- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 +0200
> +++ linux-5.10.39/fs/btrfs/inode.c      2021-05-31 16:11:19.461591523 +0200
> @@ -408,8 +408,7 @@
>    */
>   static inline bool inode_can_compress(struct btrfs_inode *inode)
>   {
> -       if (inode->flags & BTRFS_INODE_NODATACOW ||
> -           inode->flags & BTRFS_INODE_NODATASUM)
> +       if (inode->flags & BTRFS_INODE_NODATACOW)
>                  return false;
>          return true;
>   }
>
> should do the trick. >
> The above code was added with the argument that having no checksums with
> compression would damage too much data in case of corruption (
> https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/
> ).

It doesn't make a difference whether it's a single device fs or not.
If we don't have csum, the corruption is not only affecting the sector
where the corruption is, but the full compressed extent.

Furthermore, it's not that simple.

Current code we always expect compressed read to find some csum.
Just check btrfs_submit_compressed_read(), it will call
btrfs_lookup_bio_sums().
Which will fill the csum array with 0 if it can not find any csum.

Then at endio callbacks, we verify the csum against the data we read, if
it's all zero, the csum will definitely mismatch and discard the data no
matter if it's correct or not.

The same thing applies to btrfs_submit_compressed_write(), it will
always generate csum.

The diff will just give you a false sense of compression without csum.
It will still generate csum for write and relies on csum check at read time.

Thanks,
Qu

> This argument doesn't apply much to single device file systems and
> even less to file systems on Ceph like volumes.
>
Martin Raiber June 11, 2021, 10:55 a.m. UTC | #2
On 11.06.2021 02:18 Qu Wenruo wrote:
> On 2021/6/10 下午10:32, Martin Raiber wrote:
>> when btrfs is running on a block device that improves integrity (e.g.
>> Ceph), it's usefull to run it with nodatasum to reduce the amount of
>> metadata and random IO.
>>
>> In that case it would also be useful to be able to run it with
>> compression combined with nodatasum as well.
>>
>> I actually found that if nodatasum is specified after compress-force,
>> that it doesn't remove reset the compress/nodatasum bit, while the other
>> way around it doesn't work.
>>
>> That combined with
>>
>> --- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 
>> +0200
>> +++ linux-5.10.39/fs/btrfs/inode.c      2021-05-31 16:11:19.461591523 
>> +0200
>> @@ -408,8 +408,7 @@
>>    */
>>   static inline bool inode_can_compress(struct btrfs_inode *inode)
>>   {
>> -       if (inode->flags & BTRFS_INODE_NODATACOW ||
>> -           inode->flags & BTRFS_INODE_NODATASUM)
>> +       if (inode->flags & BTRFS_INODE_NODATACOW)
>>                  return false;
>>          return true;
>>   }
>>
>> should do the trick. >
>> The above code was added with the argument that having no checksums with
>> compression would damage too much data in case of corruption (
>> https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/
>> ).
>
> It doesn't make a difference whether it's a single device fs or not.
> If we don't have csum, the corruption is not only affecting the sector
> where the corruption is, but the full compressed extent.
Doesn't really matter if it doesn't happen if btrfs is on a e.g. ceph 
volume. Furthermore it depends on the use-case if corruption affecting 
one page, or a whole compressed block is something one can live with.
>
> Furthermore, it's not that simple.
>
> Current code we always expect compressed read to find some csum.
> Just check btrfs_submit_compressed_read(), it will call
> btrfs_lookup_bio_sums().
> Which will fill the csum array with 0 if it can not find any csum.

It seems to make that conditional w.r.t. BTRFS_INODE_NODATASUM

if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
     ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums);
     BUG_ON(ret); /* -ENOMEM */
}and in check_compressed_csum: if (inode->flags & BTRFS_INODE_NODATASUM) 
return 0; This seems to have been moved around recently ( 
https://github.com/torvalds/linux/commit/334c16d82cfe180f7b262a6f8ae2d9379f032b18 
).

>
> Then at endio callbacks, we verify the csum against the data we read, if
> it's all zero, the csum will definitely mismatch and discard the data no
> matter if it's correct or not.
>
> The same thing applies to btrfs_submit_compressed_write(), it will
> always generate csum.

Same here:

int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;

...

if (!skip_sum) {
         ret = btrfs_csum_one_bio(inode, bio, start, 1);
         BUG_ON(ret); /* -ENOMEM */
  }

>
> The diff will just give you a false sense of compression without csum.
> It will still generate csum for write and relies on csum check at read 
> time.
>
Qu Wenruo June 11, 2021, 11:15 a.m. UTC | #3
On 2021/6/11 下午6:55, Martin Raiber wrote:
> On 11.06.2021 02:18 Qu Wenruo wrote:
>> On 2021/6/10 下午10:32, Martin Raiber wrote:
>>> when btrfs is running on a block device that improves integrity (e.g.
>>> Ceph), it's usefull to run it with nodatasum to reduce the amount of
>>> metadata and random IO.
>>>
>>> In that case it would also be useful to be able to run it with
>>> compression combined with nodatasum as well.
>>>
>>> I actually found that if nodatasum is specified after compress-force,
>>> that it doesn't remove reset the compress/nodatasum bit, while the other
>>> way around it doesn't work.
>>>
>>> That combined with
>>>
>>> --- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580
>>> +0200
>>> +++ linux-5.10.39/fs/btrfs/inode.c      2021-05-31 16:11:19.461591523
>>> +0200
>>> @@ -408,8 +408,7 @@
>>>    */
>>>   static inline bool inode_can_compress(struct btrfs_inode *inode)
>>>   {
>>> -       if (inode->flags & BTRFS_INODE_NODATACOW ||
>>> -           inode->flags & BTRFS_INODE_NODATASUM)
>>> +       if (inode->flags & BTRFS_INODE_NODATACOW)
>>>                  return false;
>>>          return true;
>>>   }
>>>
>>> should do the trick. >
>>> The above code was added with the argument that having no checksums with
>>> compression would damage too much data in case of corruption (
>>> https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/
>>> ).
>>
>> It doesn't make a difference whether it's a single device fs or not.
>> If we don't have csum, the corruption is not only affecting the sector
>> where the corruption is, but the full compressed extent.
> Doesn't really matter if it doesn't happen if btrfs is on a e.g. ceph
> volume.

But this still means, all other regular end user can hit a case where
they set nodatasum for a directory and later corrupt their data.

This will increase the corruption loss, if user just migrate from older
kernel to newer one.

Yeah, you can blame the users for doing that, but I'm still not that
sure if such behavior change is fine.

Especially when this increase the chance of data loss.
It may be fine for ceph, but btrfs still have a wider user base to consider.

> Furthermore it depends on the use-case if corruption affecting
> one page, or a whole compressed block is something one can live with.

That's true, but my point is, at least we shouldn't increase the data
loss possibility for the end users.

If in the past, we allowed NODATASUM + compression, then converting to
current situation, I'm more or less fine, since it reduce the
possibility of data loss.

But not in the other direction.

>>
>> Furthermore, it's not that simple.
>>
>> Current code we always expect compressed read to find some csum.
>> Just check btrfs_submit_compressed_read(), it will call
>> btrfs_lookup_bio_sums().
>> Which will fill the csum array with 0 if it can not find any csum.
>
> It seems to make that conditional w.r.t. BTRFS_INODE_NODATASUM
>
> if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
>      ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums);
>      BUG_ON(ret); /* -ENOMEM */
> }and in check_compressed_csum: if (inode->flags & BTRFS_INODE_NODATASUM)
> return 0; This seems to have been moved around recently (
> https://github.com/torvalds/linux/commit/334c16d82cfe180f7b262a6f8ae2d9379f032b18
> ).

Forgot that patch, I should look one function deeper to confirm the
behavior.

That indeed reduces my concerns from the code point of view.

Thanks,
Qu
>
>>
>> Then at endio callbacks, we verify the csum against the data we read, if
>> it's all zero, the csum will definitely mismatch and discard the data no
>> matter if it's correct or not.
>>
>> The same thing applies to btrfs_submit_compressed_write(), it will
>> always generate csum.
>
> Same here:
>
> int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;
>
> ...
>
> if (!skip_sum) {
>          ret = btrfs_csum_one_bio(inode, bio, start, 1);
>          BUG_ON(ret); /* -ENOMEM */
>   }
>
>>
>> The diff will just give you a false sense of compression without csum.
>> It will still generate csum for write and relies on csum check at read
>> time. >>
>
>
David Sterba June 17, 2021, 12:15 p.m. UTC | #4
On Fri, Jun 11, 2021 at 07:15:29PM +0800, Qu Wenruo wrote:
> On 2021/6/11 下午6:55, Martin Raiber wrote:
> > On 11.06.2021 02:18 Qu Wenruo wrote:
> >> On 2021/6/10 下午10:32, Martin Raiber wrote:
> 
> But this still means, all other regular end user can hit a case where
> they set nodatasum for a directory and later corrupt their data.

So this where I'd say give users what they ask for, like the usecase
where the data correctness is done on another layer or solved by
completely different approach like reprovisioning in case of errors.

I would certainly not recommend to turning off nodatasum in general but
if thre's a usecase with known pros and cons then I'm for allowing it
(if feasible on technical grounds).

An example from past is (not)allowing DUP on a single device for data.
The counterargument for that was something like "but it won't help on SD
cards anyway", but we're not choosing hw or usecase for users.

I had a prototype to do "nometasum", it's obviously simple but with lack
of usecase I did not push it for a merge. If there's an interest for a
checksum-free usecase we can do that.

> This will increase the corruption loss, if user just migrate from older
> kernel to newer one.
> 
> Yeah, you can blame the users for doing that, but I'm still not that
> sure if such behavior change is fine.
> 
> Especially when this increase the chance of data loss.
> It may be fine for ceph, but btrfs still have a wider user base to consider.

So the silent change would make a difference, but it's still after a
decision to do nodatasum, that gets on a directory and inherited.
That it actually happens is not what user asked for, so from that point
I think it's the other way around.

> > Furthermore it depends on the use-case if corruption affecting
> > one page, or a whole compressed block is something one can live with.
> 
> That's true, but my point is, at least we shouldn't increase the data
> loss possibility for the end users.
> 
> If in the past, we allowed NODATASUM + compression, then converting to
> current situation, I'm more or less fine, since it reduce the
> possibility of data loss.

I'm going to read the reasoning again.
Qu Wenruo June 17, 2021, 12:35 p.m. UTC | #5
On 2021/6/17 下午8:15, David Sterba wrote:
> On Fri, Jun 11, 2021 at 07:15:29PM +0800, Qu Wenruo wrote:
>> On 2021/6/11 下午6:55, Martin Raiber wrote:
>>> On 11.06.2021 02:18 Qu Wenruo wrote:
>>>> On 2021/6/10 下午10:32, Martin Raiber wrote:
>>
>> But this still means, all other regular end user can hit a case where
>> they set nodatasum for a directory and later corrupt their data.
> 
> So this where I'd say give users what they ask for, like the usecase
> where the data correctness is done on another layer or solved by
> completely different approach like reprovisioning in case of errors.

Fine, then I guess what we need is to talk about the technical details then.

> 
> I would certainly not recommend to turning off nodatasum in general but
> if thre's a usecase with known pros and cons then I'm for allowing it
> (if feasible on technical grounds).
> 
> An example from past is (not)allowing DUP on a single device for data.
> The counterargument for that was something like "but it won't help on SD
> cards anyway", but we're not choosing hw or usecase for users.
> 
> I had a prototype to do "nometasum", it's obviously simple but with lack
> of usecase I did not push it for a merge. If there's an interest for a
> checksum-free usecase we can do that.

Already off-topic, but I doubt if it can really bring some benefit.

In the old days, maybe.

Nowadays, even with checksum disabled, we still have mandatory 
tree-checker, which on average takes a similar amount of time of csum 
the tree-block...

> 
>> This will increase the corruption loss, if user just migrate from older
>> kernel to newer one.
>>
>> Yeah, you can blame the users for doing that, but I'm still not that
>> sure if such behavior change is fine.
>>
>> Especially when this increase the chance of data loss.
>> It may be fine for ceph, but btrfs still have a wider user base to consider.
> 
> So the silent change would make a difference, but it's still after a
> decision to do nodatasum, that gets on a directory and inherited.
> That it actually happens is not what user asked for, so from that point
> I think it's the other way around.
> 
>>> Furthermore it depends on the use-case if corruption affecting
>>> one page, or a whole compressed block is something one can live with.
>>
>> That's true, but my point is, at least we shouldn't increase the data
>> loss possibility for the end users.
>>
>> If in the past, we allowed NODATASUM + compression, then converting to
>> current situation, I'm more or less fine, since it reduce the
>> possibility of data loss.
> 
> I'm going to read the reasoning again.
> 

That's because we used to force csum check for compressed data, but it's 
no longer the case anymore.

As Martin already stated, we skip the compressed csum for read, as long 
as the inode has NODATASUM flag.

Thanks,
Qu
diff mbox series

Patch

--- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580 +0200
+++ linux-5.10.39/fs/btrfs/inode.c      2021-05-31 16:11:19.461591523 +0200
@@ -408,8 +408,7 @@ 
   */
  static inline bool inode_can_compress(struct btrfs_inode *inode)
  {
-       if (inode->flags & BTRFS_INODE_NODATACOW ||
-           inode->flags & BTRFS_INODE_NODATASUM)
+       if (inode->flags & BTRFS_INODE_NODATACOW)
                 return false;
         return true;
  }