diff mbox series

btrfs: always fallback to buffered IO if the inode requires checksum

Message ID 78091f2b21b9d45678ca54e5a7d117adb69c0deb.1738574832.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series btrfs: always fallback to buffered IO if the inode requires checksum | expand

Commit Message

Qu Wenruo Feb. 3, 2025, 9:27 a.m. UTC
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode none).

[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).

This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.

But the final write into the image file is handled by btrfs, which need
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).

So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
but btrfs requires no modification, this leads to the false csum
mismatch.

This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.

For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.

[FIX]
I have considered the following ideas to solve the problem:

- Make direct IO to always skip data checksum
  This not only requires a new incompatible flag, as it breaks the
  current per-inode NODATASUM flag.
  But also requires extra handling for no csum found cases.

  And this also reduces our checksum protection.

- Let hardware to handle all the checksum
  AKA, just nodatasum mount option.
  That requires trust for hardware (which is not that trustful in a lot
  of cases), and it's not generic at all.

- Always fallback to buffered IO if the inode requires checksum
  This is suggested by Christoph, and is the solution utilized by this
  patch.

  The cost is obvious, the extra buffer copying into page cache, thus it
  reduce the performance.
  But at least it's still user configurable, if the end user still wants
  the zero-copy performance, just set NODATASUM flag for the inode
  (which is a common practice for VM images on btrfs).

  Since we can not trust user space programs to keep the buffer
  consistent during direct IO, we have no choice but always falling
  back to buffered IO.
  At least by this, we avoid the more deadly false data checksum
  mismatch error.

Suggested-by: hch@infradead.org <hch@infradead.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Wang Yugui Feb. 3, 2025, 10:22 a.m. UTC | #1
Hi,

> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
> 
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
> 
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
> 
> But the final write into the image file is handled by btrfs, which need
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
> 
> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> but btrfs requires no modification, this leads to the false csum
> mismatch.
> 
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
> 
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
> 
> [FIX]
> I have considered the following ideas to solve the problem:
> 
> - Make direct IO to always skip data checksum
>   This not only requires a new incompatible flag, as it breaks the
>   current per-inode NODATASUM flag.
>   But also requires extra handling for no csum found cases.
> 
>   And this also reduces our checksum protection.
> 
> - Let hardware to handle all the checksum
>   AKA, just nodatasum mount option.
>   That requires trust for hardware (which is not that trustful in a lot
>   of cases), and it's not generic at all.
> 
> - Always fallback to buffered IO if the inode requires checksum
>   This is suggested by Christoph, and is the solution utilized by this
>   patch.
> 
>   The cost is obvious, the extra buffer copying into page cache, thus it
>   reduce the performance.
>   But at least it's still user configurable, if the end user still wants
>   the zero-copy performance, just set NODATASUM flag for the inode
>   (which is a common practice for VM images on btrfs).
> 
>   Since we can not trust user space programs to keep the buffer
>   consistent during direct IO, we have no choice but always falling
>   back to buffered IO.
>   At least by this, we avoid the more deadly false data checksum
>   mismatch error.

Could we mark the page for direct write to READ-only
when ! BTRFS_INODE_NODATASUM?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2025/02/03



> Suggested-by: hch@infradead.org <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index c99ceabcd792..d64cda76cc92 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>  			goto err;
>  	}
>  
> +	/*
> +	 * For direct IO, we have no control on the folio passed in, thus the content
> +	 * can change halfway after we calculated the data checksum.
> +	 *
> +	 * To be extra safe and avoid false data checksum mismatch, if the inode still
> +	 * requires data checksum, we just fall back to buffered IO by returning
> +	 * -ENOTBLK, and iomap will do the fallback.
> +	 */
> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> +		ret = -ENOTBLK;
> +		goto err;
> +	}
> +
>  	/*
>  	 * If this errors out it's because we couldn't invalidate pagecache for
>  	 * this range and we need to fallback to buffered IO, or we are doing a
> -- 
> 2.48.1
>
Filipe Manana Feb. 3, 2025, 11:04 a.m. UTC | #2
On Mon, Feb 3, 2025 at 9:28 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
>
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
>
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
>
> But the final write into the image file is handled by btrfs, which need
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
>
> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> but btrfs requires no modification, this leads to the false csum
> mismatch.
>
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
>
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
>
> [FIX]
> I have considered the following ideas to solve the problem:
>
> - Make direct IO to always skip data checksum
>   This not only requires a new incompatible flag, as it breaks the
>   current per-inode NODATASUM flag.
>   But also requires extra handling for no csum found cases.
>
>   And this also reduces our checksum protection.
>
> - Let hardware to handle all the checksum
>   AKA, just nodatasum mount option.
>   That requires trust for hardware (which is not that trustful in a lot
>   of cases), and it's not generic at all.
>
> - Always fallback to buffered IO if the inode requires checksum
>   This is suggested by Christoph, and is the solution utilized by this
>   patch.
>
>   The cost is obvious, the extra buffer copying into page cache, thus it
>   reduce the performance.
>   But at least it's still user configurable, if the end user still wants
>   the zero-copy performance, just set NODATASUM flag for the inode
>   (which is a common practice for VM images on btrfs).
>
>   Since we can not trust user space programs to keep the buffer
>   consistent during direct IO, we have no choice but always falling
>   back to buffered IO.
>   At least by this, we avoid the more deadly false data checksum
>   mismatch error.
>
> Suggested-by: hch@infradead.org <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index c99ceabcd792..d64cda76cc92 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>                         goto err;
>         }
>
> +       /*
> +        * For direct IO, we have no control on the folio passed in, thus the content

folio -> folios

> +        * can change halfway after we calculated the data checksum.
> +        *
> +        * To be extra safe and avoid false data checksum mismatch, if the inode still
> +        * requires data checksum, we just fall back to buffered IO by returning
> +        * -ENOTBLK, and iomap will do the fallback.
> +        */
> +       if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> +               ret = -ENOTBLK;
> +               goto err;
> +       }

Why do it here?

We can do it higher in the call stack at btrfs_direct_write() after
locking the inode, right before or after calling check_direct_IO().
It's far more straightforward.

Thanks.

> +
>         /*
>          * If this errors out it's because we couldn't invalidate pagecache for
>          * this range and we need to fallback to buffered IO, or we are doing a
> --
> 2.48.1
>
>
Goffredo Baroncelli Feb. 3, 2025, 8:19 p.m. UTC | #3
On 03/02/2025 10.27, Qu Wenruo wrote:
> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
> 
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
> 
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
> 
> But the final write into the image file is handled by btrfs, which need
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
> 
> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> but btrfs requires no modification, this leads to the false csum
> mismatch.
> 
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
> 
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
> 
> [FIX]
> I have considered the following ideas to solve the problem:
> 
> - Make direct IO to always skip data checksum
>    This not only requires a new incompatible flag, as it breaks the
>    current per-inode NODATASUM flag.
>    But also requires extra handling for no csum found cases.
> 
>    And this also reduces our checksum protection.
> 
> - Let hardware to handle all the checksum
>    AKA, just nodatasum mount option.
>    That requires trust for hardware (which is not that trustful in a lot
>    of cases), and it's not generic at all.
> 
> - Always fallback to buffered IO if the inode requires checksum
>    This is suggested by Christoph, and is the solution utilized by this
>    patch.
> 
>    The cost is obvious, the extra buffer copying into page cache, thus it
>    reduce the performance.
>    But at least it's still user configurable, if the end user still wants
>    the zero-copy performance, just set NODATASUM flag for the inode
>    (which is a common practice for VM images on btrfs).
> 
>    Since we can not trust user space programs to keep the buffer
>    consistent during direct IO, we have no choice but always falling
>    back to buffered IO.
>    At least by this, we avoid the more deadly false data checksum
>    mismatch error.

I tried a patch few years ago [1]. I think that it is important to point out that
even ZFS started to support DIRECT_IO only recently [2]. Until that, it ignored the
DIRECT_IO flag.

Moreover, I suggest to print a WARNING when a file is opened with
DIRECT_IO and DATASUM. Even tough it could cause a flood of dmesg.


[1] https://patchwork.kernel.org/project/linux-btrfs/patch/5d52220b-177d-72d4-7825-dbe6cbf8722f@inwind.it/
[2] https://www.phoronix.com/news/OpenZFS-Direct-IO

> Suggested-by: hch@infradead.org <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/direct-io.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index c99ceabcd792..d64cda76cc92 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>   			goto err;
>   	}
>   
> +	/*
> +	 * For direct IO, we have no control on the folio passed in, thus the content
> +	 * can change halfway after we calculated the data checksum.
> +	 *
> +	 * To be extra safe and avoid false data checksum mismatch, if the inode still
> +	 * requires data checksum, we just fall back to buffered IO by returning
> +	 * -ENOTBLK, and iomap will do the fallback.
> +	 */
> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> +		ret = -ENOTBLK;
> +		goto err;
> +	}
> +
>   	/*
>   	 * If this errors out it's because we couldn't invalidate pagecache for
>   	 * this range and we need to fallback to buffered IO, or we are doing a
Qu Wenruo Feb. 3, 2025, 10:57 p.m. UTC | #4
在 2025/2/3 21:34, Filipe Manana 写道:
> On Mon, Feb 3, 2025 at 9:28 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which need
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>> but btrfs requires no modification, this leads to the false csum
>> mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware to handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered IO if the inode requires checksum
>>    This is suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduce the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we can not trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling
>>    back to buffered IO.
>>    At least by this, we avoid the more deadly false data checksum
>>    mismatch error.
>>
>> Suggested-by: hch@infradead.org <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/direct-io.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index c99ceabcd792..d64cda76cc92 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>>                          goto err;
>>          }
>>
>> +       /*
>> +        * For direct IO, we have no control on the folio passed in, thus the content
>
> folio -> folios
>
>> +        * can change halfway after we calculated the data checksum.
>> +        *
>> +        * To be extra safe and avoid false data checksum mismatch, if the inode still
>> +        * requires data checksum, we just fall back to buffered IO by returning
>> +        * -ENOTBLK, and iomap will do the fallback.
>> +        */
>> +       if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
>> +               ret = -ENOTBLK;
>> +               goto err;
>> +       }
>
> Why do it here?
>
> We can do it higher in the call stack at btrfs_direct_write() after
> locking the inode, right before or after calling check_direct_IO().
> It's far more straightforward.


Thanks a lot for this!

It's indeed way better, will go that direction in the next update.

Thanks,
Qu

>
> Thanks.
>
>> +
>>          /*
>>           * If this errors out it's because we couldn't invalidate pagecache for
>>           * this range and we need to fallback to buffered IO, or we are doing a
>> --
>> 2.48.1
>>
>>
>
Qu Wenruo Feb. 3, 2025, 10:58 p.m. UTC | #5
在 2025/2/3 20:52, Wang Yugui 写道:
> Hi,
>
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which need
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>> but btrfs requires no modification, this leads to the false csum
>> mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware to handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered IO if the inode requires checksum
>>    This is suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduce the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we can not trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling
>>    back to buffered IO.
>>    At least by this, we avoid the more deadly false data checksum
>>    mismatch error.
>
> Could we mark the page for direct write to READ-only
> when ! BTRFS_INODE_NODATASUM?

We have no control on the source page.
It can be user page or even some weird mmapped ones.

Thanks,
QU
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2025/02/03
>
>
>
>> Suggested-by: hch@infradead.org <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/direct-io.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index c99ceabcd792..d64cda76cc92 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>>   			goto err;
>>   	}
>>
>> +	/*
>> +	 * For direct IO, we have no control on the folio passed in, thus the content
>> +	 * can change halfway after we calculated the data checksum.
>> +	 *
>> +	 * To be extra safe and avoid false data checksum mismatch, if the inode still
>> +	 * requires data checksum, we just fall back to buffered IO by returning
>> +	 * -ENOTBLK, and iomap will do the fallback.
>> +	 */
>> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
>> +		ret = -ENOTBLK;
>> +		goto err;
>> +	}
>> +
>>   	/*
>>   	 * If this errors out it's because we couldn't invalidate pagecache for
>>   	 * this range and we need to fallback to buffered IO, or we are doing a
>> --
>> 2.48.1
>>
>
>
>
Qu Wenruo Feb. 3, 2025, 11:06 p.m. UTC | #6
在 2025/2/4 06:49, Goffredo Baroncelli 写道:
> On 03/02/2025 10.27, Qu Wenruo wrote:
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which need
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>> but btrfs requires no modification, this leads to the false csum
>> mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware to handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered IO if the inode requires checksum
>>    This is suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduce the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we can not trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling
>>    back to buffered IO.
>>    At least by this, we avoid the more deadly false data checksum
>>    mismatch error.
>
> I tried a patch few years ago [1]. I think that it is important to point
> out that
> even ZFS started to support DIRECT_IO only recently [2]. Until that, it
> ignored the
> DIRECT_IO flag.
>
> Moreover, I suggest to print a WARNING when a file is opened with
> DIRECT_IO and DATASUM. Even tough it could cause a flood of dmesg.

The warning is a little overkilled, and under a lot of cases, dmesg
warning is not enough to inform the user.

And completely not supporting is also problematic.
Direct_IO condition is already complex (all the alignment checks etc),
I'm not a huge fan to introduce another extra check.

To me, the fallback itself is a pretty good compromise to everyone.

For people who really doesn't want to waste memory on page cache (e.g.
VM has its own page cache), then the fallback still works.
The page cache is just populated, written back, then immediately
invalidated, thus no long term page cache usage.

Not to mention this solves the false data csum error.

For people really believe their own workload handles memory cache
better, sure they can still do that with NODATASUM, and that's already
the common practice for a lot of databases.

Not to mention, even regular direct IO can fallback to buffered one by a
lot of other conditions, so the fallback itself is always a valid solution.

Thanks,
Qu

>
>
> [1] https://patchwork.kernel.org/project/linux-btrfs/
> patch/5d52220b-177d-72d4-7825-dbe6cbf8722f@inwind.it/
> [2] https://www.phoronix.com/news/OpenZFS-Direct-IO
>
>> Suggested-by: hch@infradead.org <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/direct-io.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index c99ceabcd792..d64cda76cc92 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode
>> *inode, loff_t start,
>>               goto err;
>>       }
>> +    /*
>> +     * For direct IO, we have no control on the folio passed in, thus
>> the content
>> +     * can change halfway after we calculated the data checksum.
>> +     *
>> +     * To be extra safe and avoid false data checksum mismatch, if
>> the inode still
>> +     * requires data checksum, we just fall back to buffered IO by
>> returning
>> +     * -ENOTBLK, and iomap will do the fallback.
>> +     */
>> +    if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
>> +        ret = -ENOTBLK;
>> +        goto err;
>> +    }
>> +
>>       /*
>>        * If this errors out it's because we couldn't invalidate
>> pagecache for
>>        * this range and we need to fallback to buffered IO, or we are
>> doing a
>
>
Wang Yugui Feb. 5, 2025, 11:10 p.m. UTC | #7
HI,

> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode none).
> 
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
> 
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
> 
> But the final write into the image file is handled by btrfs, which need
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
> 
> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
> but btrfs requires no modification, this leads to the false csum
> mismatch.
> 
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
> 
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
> 
> [FIX]
> I have considered the following ideas to solve the problem:
> 
> - Make direct IO to always skip data checksum
>   This not only requires a new incompatible flag, as it breaks the
>   current per-inode NODATASUM flag.
>   But also requires extra handling for no csum found cases.
> 
>   And this also reduces our checksum protection.
> 
> - Let hardware to handle all the checksum
>   AKA, just nodatasum mount option.
>   That requires trust for hardware (which is not that trustful in a lot
>   of cases), and it's not generic at all.
> 
> - Always fallback to buffered IO if the inode requires checksum
>   This is suggested by Christoph, and is the solution utilized by this
>   patch.
> 
>   The cost is obvious, the extra buffer copying into page cache, thus it
>   reduce the performance.
>   But at least it's still user configurable, if the end user still wants
>   the zero-copy performance, just set NODATASUM flag for the inode
>   (which is a common practice for VM images on btrfs).
> 
>   Since we can not trust user space programs to keep the buffer
>   consistent during direct IO, we have no choice but always falling
>   back to buffered IO.
>   At least by this, we avoid the more deadly false data checksum
>   mismatch error.
> 
> Suggested-by: hch@infradead.org <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Another point about uncached buffered I/O since 6.14.

when the direct i/o is fallbacked to buffered i/o,
should it be marked as uncached(RWF_DONTCACHE)?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2025/02/06



> ---
>  fs/btrfs/direct-io.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index c99ceabcd792..d64cda76cc92 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>  			goto err;
>  	}
>  
> +	/*
> +	 * For direct IO, we have no control on the folio passed in, thus the content
> +	 * can change halfway after we calculated the data checksum.
> +	 *
> +	 * To be extra safe and avoid false data checksum mismatch, if the inode still
> +	 * requires data checksum, we just fall back to buffered IO by returning
> +	 * -ENOTBLK, and iomap will do the fallback.
> +	 */
> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
> +		ret = -ENOTBLK;
> +		goto err;
> +	}
> +
>  	/*
>  	 * If this errors out it's because we couldn't invalidate pagecache for
>  	 * this range and we need to fallback to buffered IO, or we are doing a
> -- 
> 2.48.1
>
Qu Wenruo Feb. 5, 2025, 11:12 p.m. UTC | #8
在 2025/2/6 09:40, Wang Yugui 写道:
> HI,
> 
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which need
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>> but btrfs requires no modification, this leads to the false csum
>> mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware to handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered IO if the inode requires checksum
>>    This is suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduce the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we can not trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling
>>    back to buffered IO.
>>    At least by this, we avoid the more deadly false data checksum
>>    mismatch error.
>>
>> Suggested-by: hch@infradead.org <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Another point about uncached buffered I/O since 6.14.
> 
> when the direct i/o is fallbacked to buffered i/o,
> should it be marked as uncached(RWF_DONTCACHE)?

At the time of write, the uncached write support for btrfs is not yet 
merged.

Thanks,
Qu

> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2025/02/06
> 
> 
> 
>> ---
>>   fs/btrfs/direct-io.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index c99ceabcd792..d64cda76cc92 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -444,6 +444,19 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>>   			goto err;
>>   	}
>>   
>> +	/*
>> +	 * For direct IO, we have no control on the folio passed in, thus the content
>> +	 * can change halfway after we calculated the data checksum.
>> +	 *
>> +	 * To be extra safe and avoid false data checksum mismatch, if the inode still
>> +	 * requires data checksum, we just fall back to buffered IO by returning
>> +	 * -ENOTBLK, and iomap will do the fallback.
>> +	 */
>> +	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
>> +		ret = -ENOTBLK;
>> +		goto err;
>> +	}
>> +
>>   	/*
>>   	 * If this errors out it's because we couldn't invalidate pagecache for
>>   	 * this range and we need to fallback to buffered IO, or we are doing a
>> -- 
>> 2.48.1
>>
> 
>
diff mbox series

Patch

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index c99ceabcd792..d64cda76cc92 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -444,6 +444,19 @@  static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 			goto err;
 	}
 
+	/*
+	 * For direct IO, we have no control on the folio passed in, thus the content
+	 * can change halfway after we calculated the data checksum.
+	 *
+	 * To be extra safe and avoid false data checksum mismatch, if the inode still
+	 * requires data checksum, we just fall back to buffered IO by returning
+	 * -ENOTBLK, and iomap will do the fallback.
+	 */
+	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
+		ret = -ENOTBLK;
+		goto err;
+	}
+
 	/*
 	 * If this errors out it's because we couldn't invalidate pagecache for
 	 * this range and we need to fallback to buffered IO, or we are doing a