diff mbox series

[v2] fstests: add a generic test to verify direct IO writes with buffer contents change

Message ID d9c50aa0df6cde2cb39cb7c9f978dbc27dadb770.1739241217.git.wqu@suse.com (mailing list archive)
State New
Headers show
Series [v2] fstests: add a generic test to verify direct IO writes with buffer contents change | expand

Commit Message

Qu Wenruo Feb. 11, 2025, 5:52 a.m. UTC
There is a long existing btrfs problem that if some one is modifying the
buffer of an on-going direct IO write, it has a very high chance causing
permanent data checksum mismatch.

This is caused by the following factors:

- Direct IO buffer is out of the control of filesystem
  Thus user space can modify the contents during writeback.

- Btrfs generate its data checksum just before submitting the bio
  This means if the contents happens after the checksum is generated,
  the data written to disk will no longer match the checksum.

Btrfs later fixes the problem by forcing the direct IO to fallback to
buffered IO (if the inode requires data checksum), so that btrfs can
have a consistent view of the buffer.

This test case will verify the behavior by:

- Create a helper program 'dio-writeback-race'
  Which does direct IO writes block-by-block, and the buffer is always
  initialized to all 0xff before write,
  Then starting two threads:
  - One to submit the direct IO
  - One to modify the buffer to 0x00

  The program uses 4K as default block size, and 64MiB as the default
  file size.
  Which is more than enough to trigger tons of btrfs checksum errors
  on unpatched kernels.

- New test case generic/761
  Which will:

  * Use above program to create a 64MiB file

  * Do buffered read on that file
    Since above program is doing direct IO, there is no page cache
    populated.
    And the buffered read will need to read out all data from the disk,
    and if the filesystem supports data checksum, then the data checksum
    will also be verified against the data.

The test case passes on the following fses:
- ext4
- xfs
- btrfs with "nodatasum" mount option
  No data checksum to bother.

- btrfs with default "datasum" mount option and the fix "btrfs: always
  fallback to buffered write if the inode requires checksum"
  This makes btrfs to fallback on buffered IO so the contents won't
  change during writeback of page cache.

And fails on the following fses:

- btrfs with default "datasum" mount option and without the fix
  Expected.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Fix the comment on the default block size of dio-writeback-race
- Use proper type for pthread_exit() of do_io() function
- Fix the error message when filesize is invalid
- Fix the error message when unknown option is parsed
- Catch the thread return value correctly for pthread_join() on IO thread
- Always update @ret
- Return EXIT_SUCCESS/FAILURE based on @ret at error: tag
- Check the return value of pthread_join() correctly
- Remove unused cleanup override/include comments from the test case
- Add the missing fixed-by tag
---
 .gitignore               |   1 +
 src/Makefile             |   3 +-
 src/dio-writeback-race.c | 148 +++++++++++++++++++++++++++++++++++++++
 tests/generic/761        |  41 +++++++++++
 tests/generic/761.out    |   2 +
 5 files changed, 194 insertions(+), 1 deletion(-)
 create mode 100644 src/dio-writeback-race.c
 create mode 100755 tests/generic/761
 create mode 100644 tests/generic/761.out

Comments

Daniel Vacek Feb. 11, 2025, 11:30 a.m. UTC | #1
On Tue, 11 Feb 2025 at 11:16, Qu Wenruo <wqu@suse.com> wrote:
>
> There is a long existing btrfs problem that if some one is modifying the
> buffer of an on-going direct IO write, it has a very high chance causing
> permanent data checksum mismatch.
>
> This is caused by the following factors:
>
> - Direct IO buffer is out of the control of filesystem
>   Thus user space can modify the contents during writeback.
>
> - Btrfs generate its data checksum just before submitting the bio
>   This means if the contents happens after the checksum is generated,
>   the data written to disk will no longer match the checksum.
>
> Btrfs later fixes the problem by forcing the direct IO to fallback to
> buffered IO (if the inode requires data checksum), so that btrfs can
> have a consistent view of the buffer.

Would it be a bad idea if btrfs remapped the page read-only for the
duration of the writeback instead? Eventual page fault could either
block until the writeback is finished (which may be an unwanted
behavior) or it could map another copy of the data. This would keep
the direct IO behavior for applications which do not change the data
after submitting the IO.

Also, I'd maybe add a tracepoint or dynamic debug print (-once?) when
falling back to buffered IO.

> This test case will verify the behavior by:
>
> - Create a helper program 'dio-writeback-race'
>   Which does direct IO writes block-by-block, and the buffer is always
>   initialized to all 0xff before write,
>   Then starting two threads:
>   - One to submit the direct IO
>   - One to modify the buffer to 0x00
>
>   The program uses 4K as default block size, and 64MiB as the default
>   file size.
>   Which is more than enough to trigger tons of btrfs checksum errors
>   on unpatched kernels.
>
> - New test case generic/761
>   Which will:
>
>   * Use above program to create a 64MiB file
>
>   * Do buffered read on that file
>     Since above program is doing direct IO, there is no page cache
>     populated.
>     And the buffered read will need to read out all data from the disk,
>     and if the filesystem supports data checksum, then the data checksum
>     will also be verified against the data.
>
> The test case passes on the following fses:
> - ext4
> - xfs
> - btrfs with "nodatasum" mount option
>   No data checksum to bother.
>
> - btrfs with default "datasum" mount option and the fix "btrfs: always
>   fallback to buffered write if the inode requires checksum"
>   This makes btrfs to fallback on buffered IO so the contents won't
>   change during writeback of page cache.
>
> And fails on the following fses:
>
> - btrfs with default "datasum" mount option and without the fix
>   Expected.
>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Fix the comment on the default block size of dio-writeback-race
> - Use proper type for pthread_exit() of do_io() function
> - Fix the error message when filesize is invalid
> - Fix the error message when unknown option is parsed
> - Catch the thread return value correctly for pthread_join() on IO thread
> - Always update @ret
> - Return EXIT_SUCCESS/FAILURE based on @ret at error: tag
> - Check the return value of pthread_join() correctly
> - Remove unused cleanup override/include comments from the test case
> - Add the missing fixed-by tag
> ---
>  .gitignore               |   1 +
>  src/Makefile             |   3 +-
>  src/dio-writeback-race.c | 148 +++++++++++++++++++++++++++++++++++++++
>  tests/generic/761        |  41 +++++++++++
>  tests/generic/761.out    |   2 +
>  5 files changed, 194 insertions(+), 1 deletion(-)
>  create mode 100644 src/dio-writeback-race.c
>  create mode 100755 tests/generic/761
>  create mode 100644 tests/generic/761.out
>
> diff --git a/.gitignore b/.gitignore
> index efd477738e1e..7060f52cf6b8 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -210,6 +210,7 @@ tags
>  /src/perf/*.pyc
>  /src/fiemap-fault
>  /src/min_dio_alignment
> +/src/dio-writeback-race
>
>  # Symlinked files
>  /tests/generic/035.out
> diff --git a/src/Makefile b/src/Makefile
> index 1417c383863e..6ac72b366257 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -20,7 +20,8 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
>         t_get_file_time t_create_short_dirs t_create_long_dirs t_enospc \
>         t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
>         t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
> -       readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd
> +       readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
> +       dio-writeback-race
>
>  LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>         preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
> diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
> new file mode 100644
> index 000000000000..f0a2f6de531b
> --- /dev/null
> +++ b/src/dio-writeback-race.c
> @@ -0,0 +1,148 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * dio_writeback_race -- test direct IO writes with the contents of
> + * the buffer changed during writeback.
> + *
> + * Copyright (C) 2025 SUSE Linux Products GmbH.
> + */
> +
> +/*
> + * dio_writeback_race
> + *
> + * Compile:
> + *
> + *   gcc -Wall -pthread -o dio_writeback_race dio_writeback_race.c
> + *
> + * Make sure filesystems with explicit data checksum can handle direct IO
> + * writes correctly, so that even the contents of the direct IO buffer changes
> + * during writeback, the fs should still work fine without data checksum mismatch.
> + * (Normally by falling back to buffer IO to keep the checksum and contents
> + *  consistent)
> + *
> + * Usage:
> + *
> + *   dio_writeback_race [-b <buffersize>] [-s <filesize>] <filename>
> + *
> + * Where:
> + *
> + *   <filename> is the name of the test file to create, if the file exists
> + *   it will be overwritten.
> + *
> + *   <buffersize> is the buffer size for direct IO. Users are responsible to
> + *   probe the correct direct IO size and alignment requirement.
> + *   If not specified, the default value will be 4096.
> + *
> + *   <filesize> is the total size of the test file. If not aligned to <blocksize>
> + *   the result file size will be rounded up to <blocksize>.
> + *   If not specified, the default value will be 64MiB.
> + */
> +
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <pthread.h>
> +#include <unistd.h>
> +#include <getopt.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/stat.h>
> +
> +static int fd = -1;
> +static void *buf = NULL;
> +static unsigned long long filesize = 64 * 1024 * 1024;
> +static int blocksize = 4096;
> +
> +const int orig_pattern = 0xff;
> +const int modify_pattern = 0x00;
> +
> +static void *do_io(void *arg)
> +{
> +       ssize_t ret;
> +
> +       ret = write(fd, buf, blocksize);
> +       pthread_exit((void *)ret);
> +}
> +
> +static void *do_modify(void *arg)
> +{
> +       memset(buf, modify_pattern, blocksize);
> +       pthread_exit(NULL);
> +}
> +
> +int main (int argc, char *argv[])
> +{
> +       pthread_t io_thread;
> +       pthread_t modify_thread;
> +       unsigned long long filepos;
> +       int opt;
> +       int ret = -EINVAL;
> +
> +       while ((opt = getopt(argc, argv, "b:s:")) > 0) {
> +               switch (opt) {
> +               case 'b':
> +                       blocksize = atoi(optarg);
> +                       if (blocksize == 0) {
> +                               fprintf(stderr, "invalid blocksize '%s'\n", optarg);
> +                               goto error;
> +                       }
> +                       break;
> +               case 's':
> +                       filesize = atoll(optarg);
> +                       if (filesize == 0) {
> +                               fprintf(stderr, "invalid filesize '%s'\n", optarg);
> +                               goto error;
> +                       }
> +                       break;
> +               default:
> +                       fprintf(stderr, "unknown option '%c'\n", opt);
> +                       goto error;
> +               }
> +       }
> +       if (optind >= argc) {
> +               fprintf(stderr, "missing argument\n");
> +               goto error;
> +       }
> +       ret = posix_memalign(&buf, blocksize, blocksize);
> +       if (!buf) {
> +               fprintf(stderr, "failed to allocate aligned memory\n");
> +               exit(EXIT_FAILURE);
> +       }
> +       fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
> +       if (fd < 0) {
> +               fprintf(stderr, "failed to open file '%s': %m\n", argv[2]);
> +               goto error;
> +       }
> +
> +       for (filepos = 0; filepos < filesize; filepos += blocksize) {
> +               void *retval = NULL;
> +
> +               memset(buf, orig_pattern, blocksize);
> +               pthread_create(&io_thread, NULL, do_io, NULL);
> +               pthread_create(&modify_thread, NULL, do_modify, NULL);
> +               ret = pthread_join(io_thread, &retval);
> +               if (ret) {
> +                       errno = ret;
> +                       ret = -ret;
> +                       fprintf(stderr, "failed to join io thread: %m\n");
> +                       goto error;
> +               }
> +               if ((ssize_t )retval < blocksize) {
> +                       ret = -EIO;
> +                       fprintf(stderr, "io thread failed\n");
> +                       goto error;
> +               }
> +               ret = pthread_join(modify_thread, NULL);
> +               if (ret) {
> +                       errno = ret;
> +                       ret = -ret;
> +                       fprintf(stderr, "failed to join modify thread: %m\n");
> +                       goto error;
> +               }
> +       }
> +error:
> +       close(fd);
> +       free(buf);
> +       if (ret < 0)
> +               return EXIT_FAILURE;
> +       return EXIT_SUCCESS;
> +}
> diff --git a/tests/generic/761 b/tests/generic/761
> new file mode 100755
> index 000000000000..422b716d31b6
> --- /dev/null
> +++ b/tests/generic/761
> @@ -0,0 +1,41 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2025 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 761
> +#
> +# Making sure direct IO (O_DIRECT) writes won't cause any data checksum mismatch,
> +# even if the contents of the buffer changes during writeback.
> +#
> +# This is mostly for filesystems with data checksum support, which should fallback
> +# to buffer IO to avoid inconsistency.
> +# For filesystems without data checksum support, nothing needs to be bothered.
> +#
> +
> +. ./common/preamble
> +_begin_fstest auto quick
> +
> +_require_scratch
> +_require_odirect
> +_require_test_program dio-writeback-race
> +_fixed_by_kernel_commit XXXXXXXX \
> +       "btrfs: always fallback to buffered write if the inode requires checksum"
> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +blocksize=$(_get_block_size $SCRATCH_MNT)
> +filesize=$(( 64 * 1024 * 1024))
> +
> +echo "blocksize=$blocksize filesize=$filesize" >> $seqres.full
> +
> +$here/src/dio-writeback-race -b $blocksize -s $filesize $SCRATCH_MNT/foobar
> +
> +# Read out the file, which should trigger checksum verification
> +cat $SCRATCH_MNT/foobar > /dev/null
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/761.out b/tests/generic/761.out
> new file mode 100644
> index 000000000000..72ebba4cb426
> --- /dev/null
> +++ b/tests/generic/761.out
> @@ -0,0 +1,2 @@
> +QA output created by 761
> +Silence is golden
> --
> 2.48.1
>
>
David Disseldorp Feb. 11, 2025, 12:19 p.m. UTC | #2
On Tue, 11 Feb 2025 16:22:57 +1030, Qu Wenruo wrote:

...
> diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
> new file mode 100644
> index 000000000000..f0a2f6de531b
> --- /dev/null
> +++ b/src/dio-writeback-race.c
> @@ -0,0 +1,148 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * dio_writeback_race -- test direct IO writes with the contents of
> + * the buffer changed during writeback.
> + *
> + * Copyright (C) 2025 SUSE Linux Products GmbH.
> + */
> +
> +/*
> + * dio_writeback_race
> + *
> + * Compile:
> + *
> + *   gcc -Wall -pthread -o dio_writeback_race dio_writeback_race.c

nit: hyphens instead of underscores: dio-writeback-race.c

...
> +	fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
> +	if (fd < 0) {
> +		fprintf(stderr, "failed to open file '%s': %m\n", argv[2]);

argv[optind] instead of argv[2] in the error msg. With that fixed:

Reviewed-by: David Disseldorp <ddiss@suse.de>
Qu Wenruo Feb. 11, 2025, 8:49 p.m. UTC | #3
在 2025/2/11 22:00, Daniel Vacek 写道:
> On Tue, 11 Feb 2025 at 11:16, Qu Wenruo <wqu@suse.com> wrote:
>>
>> There is a long existing btrfs problem that if some one is modifying the
>> buffer of an on-going direct IO write, it has a very high chance causing
>> permanent data checksum mismatch.
>>
>> This is caused by the following factors:
>>
>> - Direct IO buffer is out of the control of filesystem
>>    Thus user space can modify the contents during writeback.
>>
>> - Btrfs generate its data checksum just before submitting the bio
>>    This means if the contents happens after the checksum is generated,
>>    the data written to disk will no longer match the checksum.
>>
>> Btrfs later fixes the problem by forcing the direct IO to fallback to
>> buffered IO (if the inode requires data checksum), so that btrfs can
>> have a consistent view of the buffer.
>
> Would it be a bad idea if btrfs remapped the page read-only for the
> duration of the writeback instead?

If you mean the direct IO source buffer, I do not think MM people will
be happy with that.

The direct source memory can be any memory.
 From user space memory to mmaped page cache or whatever weird memory type.

> Eventual page fault could either
> block until the writeback is finished (which may be an unwanted
> behavior) or it could map another copy of the data. This would keep
> the direct IO behavior for applications which do not change the data
> after submitting the IO.

HCH mentioned the similar idea, but to my limited understanding, it will
be a huge infrastructure change and no one is really actively looking
into this.
(I bet we will see shared pages between different address space before
this, and the shared pages idea is already stalled for years)

And if even MM experts are not looking into this, I'm not sure if it's
really a good idea.

>
> Also, I'd maybe add a tracepoint or dynamic debug print (-once?) when
> falling back to buffered IO.

It is not bringing much benefit AFAIK.

Tracepoints are for debug purposes and has almost zero overhead when not
enabled, thus -once makes no sense.

We have some existing OE related trace point that can help:

- trace_btrfs_finish_ordered_extent()
   Which contains the flag of the OE, if it's really a direct IO it will
   have BTRFS_ORDERED_DIRECT flag.

Although there are not enough events for things like direct IO itself,
but it's really out of the realm of this fix.

Feel free to enhance the trace events in another series.

Thanks,
Qu

>
>> This test case will verify the behavior by:
>>
>> - Create a helper program 'dio-writeback-race'
>>    Which does direct IO writes block-by-block, and the buffer is always
>>    initialized to all 0xff before write,
>>    Then starting two threads:
>>    - One to submit the direct IO
>>    - One to modify the buffer to 0x00
>>
>>    The program uses 4K as default block size, and 64MiB as the default
>>    file size.
>>    Which is more than enough to trigger tons of btrfs checksum errors
>>    on unpatched kernels.
>>
>> - New test case generic/761
>>    Which will:
>>
>>    * Use above program to create a 64MiB file
>>
>>    * Do buffered read on that file
>>      Since above program is doing direct IO, there is no page cache
>>      populated.
>>      And the buffered read will need to read out all data from the disk,
>>      and if the filesystem supports data checksum, then the data checksum
>>      will also be verified against the data.
>>
>> The test case passes on the following fses:
>> - ext4
>> - xfs
>> - btrfs with "nodatasum" mount option
>>    No data checksum to bother.
>>
>> - btrfs with default "datasum" mount option and the fix "btrfs: always
>>    fallback to buffered write if the inode requires checksum"
>>    This makes btrfs to fallback on buffered IO so the contents won't
>>    change during writeback of page cache.
>>
>> And fails on the following fses:
>>
>> - btrfs with default "datasum" mount option and without the fix
>>    Expected.
>>
>> Suggested-by: Christoph Hellwig <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Fix the comment on the default block size of dio-writeback-race
>> - Use proper type for pthread_exit() of do_io() function
>> - Fix the error message when filesize is invalid
>> - Fix the error message when unknown option is parsed
>> - Catch the thread return value correctly for pthread_join() on IO thread
>> - Always update @ret
>> - Return EXIT_SUCCESS/FAILURE based on @ret at error: tag
>> - Check the return value of pthread_join() correctly
>> - Remove unused cleanup override/include comments from the test case
>> - Add the missing fixed-by tag
>> ---
>>   .gitignore               |   1 +
>>   src/Makefile             |   3 +-
>>   src/dio-writeback-race.c | 148 +++++++++++++++++++++++++++++++++++++++
>>   tests/generic/761        |  41 +++++++++++
>>   tests/generic/761.out    |   2 +
>>   5 files changed, 194 insertions(+), 1 deletion(-)
>>   create mode 100644 src/dio-writeback-race.c
>>   create mode 100755 tests/generic/761
>>   create mode 100644 tests/generic/761.out
>>
>> diff --git a/.gitignore b/.gitignore
>> index efd477738e1e..7060f52cf6b8 100644
>> --- a/.gitignore
>> +++ b/.gitignore
>> @@ -210,6 +210,7 @@ tags
>>   /src/perf/*.pyc
>>   /src/fiemap-fault
>>   /src/min_dio_alignment
>> +/src/dio-writeback-race
>>
>>   # Symlinked files
>>   /tests/generic/035.out
>> diff --git a/src/Makefile b/src/Makefile
>> index 1417c383863e..6ac72b366257 100644
>> --- a/src/Makefile
>> +++ b/src/Makefile
>> @@ -20,7 +20,8 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
>>          t_get_file_time t_create_short_dirs t_create_long_dirs t_enospc \
>>          t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
>>          t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
>> -       readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd
>> +       readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
>> +       dio-writeback-race
>>
>>   LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>>          preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
>> diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
>> new file mode 100644
>> index 000000000000..f0a2f6de531b
>> --- /dev/null
>> +++ b/src/dio-writeback-race.c
>> @@ -0,0 +1,148 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * dio_writeback_race -- test direct IO writes with the contents of
>> + * the buffer changed during writeback.
>> + *
>> + * Copyright (C) 2025 SUSE Linux Products GmbH.
>> + */
>> +
>> +/*
>> + * dio_writeback_race
>> + *
>> + * Compile:
>> + *
>> + *   gcc -Wall -pthread -o dio_writeback_race dio_writeback_race.c
>> + *
>> + * Make sure filesystems with explicit data checksum can handle direct IO
>> + * writes correctly, so that even the contents of the direct IO buffer changes
>> + * during writeback, the fs should still work fine without data checksum mismatch.
>> + * (Normally by falling back to buffer IO to keep the checksum and contents
>> + *  consistent)
>> + *
>> + * Usage:
>> + *
>> + *   dio_writeback_race [-b <buffersize>] [-s <filesize>] <filename>
>> + *
>> + * Where:
>> + *
>> + *   <filename> is the name of the test file to create, if the file exists
>> + *   it will be overwritten.
>> + *
>> + *   <buffersize> is the buffer size for direct IO. Users are responsible to
>> + *   probe the correct direct IO size and alignment requirement.
>> + *   If not specified, the default value will be 4096.
>> + *
>> + *   <filesize> is the total size of the test file. If not aligned to <blocksize>
>> + *   the result file size will be rounded up to <blocksize>.
>> + *   If not specified, the default value will be 64MiB.
>> + */
>> +
>> +#include <fcntl.h>
>> +#include <stdlib.h>
>> +#include <stdio.h>
>> +#include <pthread.h>
>> +#include <unistd.h>
>> +#include <getopt.h>
>> +#include <string.h>
>> +#include <errno.h>
>> +#include <sys/stat.h>
>> +
>> +static int fd = -1;
>> +static void *buf = NULL;
>> +static unsigned long long filesize = 64 * 1024 * 1024;
>> +static int blocksize = 4096;
>> +
>> +const int orig_pattern = 0xff;
>> +const int modify_pattern = 0x00;
>> +
>> +static void *do_io(void *arg)
>> +{
>> +       ssize_t ret;
>> +
>> +       ret = write(fd, buf, blocksize);
>> +       pthread_exit((void *)ret);
>> +}
>> +
>> +static void *do_modify(void *arg)
>> +{
>> +       memset(buf, modify_pattern, blocksize);
>> +       pthread_exit(NULL);
>> +}
>> +
>> +int main (int argc, char *argv[])
>> +{
>> +       pthread_t io_thread;
>> +       pthread_t modify_thread;
>> +       unsigned long long filepos;
>> +       int opt;
>> +       int ret = -EINVAL;
>> +
>> +       while ((opt = getopt(argc, argv, "b:s:")) > 0) {
>> +               switch (opt) {
>> +               case 'b':
>> +                       blocksize = atoi(optarg);
>> +                       if (blocksize == 0) {
>> +                               fprintf(stderr, "invalid blocksize '%s'\n", optarg);
>> +                               goto error;
>> +                       }
>> +                       break;
>> +               case 's':
>> +                       filesize = atoll(optarg);
>> +                       if (filesize == 0) {
>> +                               fprintf(stderr, "invalid filesize '%s'\n", optarg);
>> +                               goto error;
>> +                       }
>> +                       break;
>> +               default:
>> +                       fprintf(stderr, "unknown option '%c'\n", opt);
>> +                       goto error;
>> +               }
>> +       }
>> +       if (optind >= argc) {
>> +               fprintf(stderr, "missing argument\n");
>> +               goto error;
>> +       }
>> +       ret = posix_memalign(&buf, blocksize, blocksize);
>> +       if (!buf) {
>> +               fprintf(stderr, "failed to allocate aligned memory\n");
>> +               exit(EXIT_FAILURE);
>> +       }
>> +       fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
>> +       if (fd < 0) {
>> +               fprintf(stderr, "failed to open file '%s': %m\n", argv[2]);
>> +               goto error;
>> +       }
>> +
>> +       for (filepos = 0; filepos < filesize; filepos += blocksize) {
>> +               void *retval = NULL;
>> +
>> +               memset(buf, orig_pattern, blocksize);
>> +               pthread_create(&io_thread, NULL, do_io, NULL);
>> +               pthread_create(&modify_thread, NULL, do_modify, NULL);
>> +               ret = pthread_join(io_thread, &retval);
>> +               if (ret) {
>> +                       errno = ret;
>> +                       ret = -ret;
>> +                       fprintf(stderr, "failed to join io thread: %m\n");
>> +                       goto error;
>> +               }
>> +               if ((ssize_t )retval < blocksize) {
>> +                       ret = -EIO;
>> +                       fprintf(stderr, "io thread failed\n");
>> +                       goto error;
>> +               }
>> +               ret = pthread_join(modify_thread, NULL);
>> +               if (ret) {
>> +                       errno = ret;
>> +                       ret = -ret;
>> +                       fprintf(stderr, "failed to join modify thread: %m\n");
>> +                       goto error;
>> +               }
>> +       }
>> +error:
>> +       close(fd);
>> +       free(buf);
>> +       if (ret < 0)
>> +               return EXIT_FAILURE;
>> +       return EXIT_SUCCESS;
>> +}
>> diff --git a/tests/generic/761 b/tests/generic/761
>> new file mode 100755
>> index 000000000000..422b716d31b6
>> --- /dev/null
>> +++ b/tests/generic/761
>> @@ -0,0 +1,41 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2025 SUSE Linux Products GmbH. All Rights Reserved.
>> +#
>> +# FS QA Test 761
>> +#
>> +# Making sure direct IO (O_DIRECT) writes won't cause any data checksum mismatch,
>> +# even if the contents of the buffer changes during writeback.
>> +#
>> +# This is mostly for filesystems with data checksum support, which should fallback
>> +# to buffer IO to avoid inconsistency.
>> +# For filesystems without data checksum support, nothing needs to be bothered.
>> +#
>> +
>> +. ./common/preamble
>> +_begin_fstest auto quick
>> +
>> +_require_scratch
>> +_require_odirect
>> +_require_test_program dio-writeback-race
>> +_fixed_by_kernel_commit XXXXXXXX \
>> +       "btrfs: always fallback to buffered write if the inode requires checksum"
>> +
>> +_scratch_mkfs > $seqres.full 2>&1
>> +_scratch_mount
>> +
>> +blocksize=$(_get_block_size $SCRATCH_MNT)
>> +filesize=$(( 64 * 1024 * 1024))
>> +
>> +echo "blocksize=$blocksize filesize=$filesize" >> $seqres.full
>> +
>> +$here/src/dio-writeback-race -b $blocksize -s $filesize $SCRATCH_MNT/foobar
>> +
>> +# Read out the file, which should trigger checksum verification
>> +cat $SCRATCH_MNT/foobar > /dev/null
>> +
>> +echo "Silence is golden"
>> +
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/generic/761.out b/tests/generic/761.out
>> new file mode 100644
>> index 000000000000..72ebba4cb426
>> --- /dev/null
>> +++ b/tests/generic/761.out
>> @@ -0,0 +1,2 @@
>> +QA output created by 761
>> +Silence is golden
>> --
>> 2.48.1
>>
>>
>
Zorro Lang Feb. 12, 2025, 2:55 a.m. UTC | #4
On Tue, Feb 11, 2025 at 04:22:57PM +1030, Qu Wenruo wrote:
> There is a long existing btrfs problem that if some one is modifying the
> buffer of an on-going direct IO write, it has a very high chance causing
> permanent data checksum mismatch.
> 
> This is caused by the following factors:
> 
> - Direct IO buffer is out of the control of filesystem
>   Thus user space can modify the contents during writeback.
> 
> - Btrfs generate its data checksum just before submitting the bio
>   This means if the contents happens after the checksum is generated,
>   the data written to disk will no longer match the checksum.
> 
> Btrfs later fixes the problem by forcing the direct IO to fallback to
> buffered IO (if the inode requires data checksum), so that btrfs can
> have a consistent view of the buffer.
> 
> This test case will verify the behavior by:
> 
> - Create a helper program 'dio-writeback-race'
>   Which does direct IO writes block-by-block, and the buffer is always
>   initialized to all 0xff before write,
>   Then starting two threads:
>   - One to submit the direct IO
>   - One to modify the buffer to 0x00
> 
>   The program uses 4K as default block size, and 64MiB as the default
>   file size.
>   Which is more than enough to trigger tons of btrfs checksum errors
>   on unpatched kernels.
> 
> - New test case generic/761
>   Which will:
> 
>   * Use above program to create a 64MiB file
> 
>   * Do buffered read on that file
>     Since above program is doing direct IO, there is no page cache
>     populated.
>     And the buffered read will need to read out all data from the disk,
>     and if the filesystem supports data checksum, then the data checksum
>     will also be verified against the data.
> 
> The test case passes on the following fses:
> - ext4
> - xfs
> - btrfs with "nodatasum" mount option
>   No data checksum to bother.
> 
> - btrfs with default "datasum" mount option and the fix "btrfs: always
>   fallback to buffered write if the inode requires checksum"
>   This makes btrfs to fallback on buffered IO so the contents won't
>   change during writeback of page cache.
> 
> And fails on the following fses:
> 
> - btrfs with default "datasum" mount option and without the fix
>   Expected.
> 
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Fix the comment on the default block size of dio-writeback-race
> - Use proper type for pthread_exit() of do_io() function
> - Fix the error message when filesize is invalid
> - Fix the error message when unknown option is parsed
> - Catch the thread return value correctly for pthread_join() on IO thread
> - Always update @ret
> - Return EXIT_SUCCESS/FAILURE based on @ret at error: tag
> - Check the return value of pthread_join() correctly
> - Remove unused cleanup override/include comments from the test case
> - Add the missing fixed-by tag
> ---
>  .gitignore               |   1 +
>  src/Makefile             |   3 +-
>  src/dio-writeback-race.c | 148 +++++++++++++++++++++++++++++++++++++++
>  tests/generic/761        |  41 +++++++++++
>  tests/generic/761.out    |   2 +
>  5 files changed, 194 insertions(+), 1 deletion(-)
>  create mode 100644 src/dio-writeback-race.c
>  create mode 100755 tests/generic/761
>  create mode 100644 tests/generic/761.out
> 
> diff --git a/.gitignore b/.gitignore
> index efd477738e1e..7060f52cf6b8 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -210,6 +210,7 @@ tags
>  /src/perf/*.pyc
>  /src/fiemap-fault
>  /src/min_dio_alignment
> +/src/dio-writeback-race
>  
>  # Symlinked files
>  /tests/generic/035.out
> diff --git a/src/Makefile b/src/Makefile
> index 1417c383863e..6ac72b366257 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -20,7 +20,8 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
>  	t_get_file_time t_create_short_dirs t_create_long_dirs t_enospc \
>  	t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
>  	t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
> -	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd
> +	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
> +	dio-writeback-race
>  
>  LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>  	preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
> diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
> new file mode 100644
> index 000000000000..f0a2f6de531b
> --- /dev/null
> +++ b/src/dio-writeback-race.c
> @@ -0,0 +1,148 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * dio_writeback_race -- test direct IO writes with the contents of
> + * the buffer changed during writeback.
> + *
> + * Copyright (C) 2025 SUSE Linux Products GmbH.
> + */
> +
> +/*
> + * dio_writeback_race
> + *
> + * Compile:
> + *
> + *   gcc -Wall -pthread -o dio_writeback_race dio_writeback_race.c
> + *
> + * Make sure filesystems with explicit data checksum can handle direct IO
> + * writes correctly, so that even the contents of the direct IO buffer changes
> + * during writeback, the fs should still work fine without data checksum mismatch.
> + * (Normally by falling back to buffer IO to keep the checksum and contents
> + *  consistent)
> + *
> + * Usage:
> + *
> + *   dio_writeback_race [-b <buffersize>] [-s <filesize>] <filename>
> + *
> + * Where:
> + *
> + *   <filename> is the name of the test file to create, if the file exists
> + *   it will be overwritten.
> + *
> + *   <buffersize> is the buffer size for direct IO. Users are responsible to
> + *   probe the correct direct IO size and alignment requirement.
> + *   If not specified, the default value will be 4096.
> + *
> + *   <filesize> is the total size of the test file. If not aligned to <blocksize>
> + *   the result file size will be rounded up to <blocksize>.
> + *   If not specified, the default value will be 64MiB.
> + */
> +
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <pthread.h>
> +#include <unistd.h>
> +#include <getopt.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/stat.h>
> +
> +static int fd = -1;
> +static void *buf = NULL;
> +static unsigned long long filesize = 64 * 1024 * 1024;
> +static int blocksize = 4096;
> +
> +const int orig_pattern = 0xff;
> +const int modify_pattern = 0x00;
> +
> +static void *do_io(void *arg)
> +{
> +	ssize_t ret;
> +
> +	ret = write(fd, buf, blocksize);
> +	pthread_exit((void *)ret);
> +}
> +
> +static void *do_modify(void *arg)
> +{
> +	memset(buf, modify_pattern, blocksize);
> +	pthread_exit(NULL);
> +}
> +
> +int main (int argc, char *argv[])
> +{
> +	pthread_t io_thread;
> +	pthread_t modify_thread;
> +	unsigned long long filepos;
> +	int opt;
> +	int ret = -EINVAL;
> +
> +	while ((opt = getopt(argc, argv, "b:s:")) > 0) {
> +		switch (opt) {
> +		case 'b':
> +			blocksize = atoi(optarg);
> +			if (blocksize == 0) {
> +				fprintf(stderr, "invalid blocksize '%s'\n", optarg);
> +				goto error;
> +			}
> +			break;
> +		case 's':
> +			filesize = atoll(optarg);
> +			if (filesize == 0) {
> +				fprintf(stderr, "invalid filesize '%s'\n", optarg);
> +				goto error;
> +			}
> +			break;
> +		default:
> +			fprintf(stderr, "unknown option '%c'\n", opt);
> +			goto error;
> +		}
> +	}
> +	if (optind >= argc) {
> +		fprintf(stderr, "missing argument\n");
> +		goto error;
> +	}
> +	ret = posix_memalign(&buf, blocksize, blocksize);
> +	if (!buf) {
> +		fprintf(stderr, "failed to allocate aligned memory\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
> +	if (fd < 0) {
> +		fprintf(stderr, "failed to open file '%s': %m\n", argv[2]);
> +		goto error;
> +	}
> +
> +	for (filepos = 0; filepos < filesize; filepos += blocksize) {
> +		void *retval = NULL;
> +
> +		memset(buf, orig_pattern, blocksize);
> +		pthread_create(&io_thread, NULL, do_io, NULL);
> +		pthread_create(&modify_thread, NULL, do_modify, NULL);
> +		ret = pthread_join(io_thread, &retval);
> +		if (ret) {
> +			errno = ret;
> +			ret = -ret;
> +			fprintf(stderr, "failed to join io thread: %m\n");
> +			goto error;
> +		}
> +		if ((ssize_t )retval < blocksize) {
> +			ret = -EIO;
> +			fprintf(stderr, "io thread failed\n");
> +			goto error;
> +		}
> +		ret = pthread_join(modify_thread, NULL);
> +		if (ret) {
> +			errno = ret;
> +			ret = -ret;
> +			fprintf(stderr, "failed to join modify thread: %m\n");
> +			goto error;
> +		}
> +	}
> +error:
> +	close(fd);
> +	free(buf);
> +	if (ret < 0)
> +		return EXIT_FAILURE;
> +	return EXIT_SUCCESS;
> +}
> diff --git a/tests/generic/761 b/tests/generic/761
> new file mode 100755
> index 000000000000..422b716d31b6
> --- /dev/null
> +++ b/tests/generic/761
> @@ -0,0 +1,41 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2025 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 761
> +#
> +# Making sure direct IO (O_DIRECT) writes won't cause any data checksum mismatch,
> +# even if the contents of the buffer changes during writeback.
> +#
> +# This is mostly for filesystems with data checksum support, which should fallback
> +# to buffer IO to avoid inconsistency.
> +# For filesystems without data checksum support, nothing needs to be bothered.
> +#
> +
> +. ./common/preamble
> +_begin_fstest auto quick
> +
> +_require_scratch
> +_require_odirect
> +_require_test_program dio-writeback-race
> +_fixed_by_kernel_commit XXXXXXXX \
> +	"btrfs: always fallback to buffered write if the inode requires checksum"
> +
> +_scratch_mkfs > $seqres.full 2>&1
> +_scratch_mount
> +
> +blocksize=$(_get_block_size $SCRATCH_MNT)

_get_file_block_size ? Others looks good to me.

Reviewed-by: Zorro Lang <zlang@redhat.com>

I'll help to do these changes from all review points when I merge it, if you
don't want to change it more.

Thanks,
Zorro

> +filesize=$(( 64 * 1024 * 1024))
> +
> +echo "blocksize=$blocksize filesize=$filesize" >> $seqres.full
> +
> +$here/src/dio-writeback-race -b $blocksize -s $filesize $SCRATCH_MNT/foobar
> +
> +# Read out the file, which should trigger checksum verification
> +cat $SCRATCH_MNT/foobar > /dev/null
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/761.out b/tests/generic/761.out
> new file mode 100644
> index 000000000000..72ebba4cb426
> --- /dev/null
> +++ b/tests/generic/761.out
> @@ -0,0 +1,2 @@
> +QA output created by 761
> +Silence is golden
> -- 
> 2.48.1
> 
>
Qu Wenruo Feb. 12, 2025, 3:02 a.m. UTC | #5
在 2025/2/12 13:25, Zorro Lang 写道:
> On Tue, Feb 11, 2025 at 04:22:57PM +1030, Qu Wenruo wrote:
>> There is a long existing btrfs problem that if some one is modifying the
>> buffer of an on-going direct IO write, it has a very high chance causing
>> permanent data checksum mismatch.
>>
>> This is caused by the following factors:
>>
>> - Direct IO buffer is out of the control of filesystem
>>    Thus user space can modify the contents during writeback.
>>
>> - Btrfs generate its data checksum just before submitting the bio
>>    This means if the contents happens after the checksum is generated,
>>    the data written to disk will no longer match the checksum.
>>
>> Btrfs later fixes the problem by forcing the direct IO to fallback to
>> buffered IO (if the inode requires data checksum), so that btrfs can
>> have a consistent view of the buffer.
>>
>> This test case will verify the behavior by:
>>
>> - Create a helper program 'dio-writeback-race'
>>    Which does direct IO writes block-by-block, and the buffer is always
>>    initialized to all 0xff before write,
>>    Then starting two threads:
>>    - One to submit the direct IO
>>    - One to modify the buffer to 0x00
>>
>>    The program uses 4K as default block size, and 64MiB as the default
>>    file size.
>>    Which is more than enough to trigger tons of btrfs checksum errors
>>    on unpatched kernels.
>>
>> - New test case generic/761
>>    Which will:
>>
>>    * Use above program to create a 64MiB file
>>
>>    * Do buffered read on that file
>>      Since above program is doing direct IO, there is no page cache
>>      populated.
>>      And the buffered read will need to read out all data from the disk,
>>      and if the filesystem supports data checksum, then the data checksum
>>      will also be verified against the data.
>>
>> The test case passes on the following fses:
>> - ext4
>> - xfs
>> - btrfs with "nodatasum" mount option
>>    No data checksum to bother.
>>
>> - btrfs with default "datasum" mount option and the fix "btrfs: always
>>    fallback to buffered write if the inode requires checksum"
>>    This makes btrfs to fallback on buffered IO so the contents won't
>>    change during writeback of page cache.
>>
>> And fails on the following fses:
>>
>> - btrfs with default "datasum" mount option and without the fix
>>    Expected.
>>
>> Suggested-by: Christoph Hellwig <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Fix the comment on the default block size of dio-writeback-race
>> - Use proper type for pthread_exit() of do_io() function
>> - Fix the error message when filesize is invalid
>> - Fix the error message when unknown option is parsed
>> - Catch the thread return value correctly for pthread_join() on IO thread
>> - Always update @ret
>> - Return EXIT_SUCCESS/FAILURE based on @ret at error: tag
>> - Check the return value of pthread_join() correctly
>> - Remove unused cleanup override/include comments from the test case
>> - Add the missing fixed-by tag
>> ---
>>   .gitignore               |   1 +
>>   src/Makefile             |   3 +-
>>   src/dio-writeback-race.c | 148 +++++++++++++++++++++++++++++++++++++++
>>   tests/generic/761        |  41 +++++++++++
>>   tests/generic/761.out    |   2 +
>>   5 files changed, 194 insertions(+), 1 deletion(-)
>>   create mode 100644 src/dio-writeback-race.c
>>   create mode 100755 tests/generic/761
>>   create mode 100644 tests/generic/761.out
>>
>> diff --git a/.gitignore b/.gitignore
>> index efd477738e1e..7060f52cf6b8 100644
>> --- a/.gitignore
>> +++ b/.gitignore
>> @@ -210,6 +210,7 @@ tags
>>   /src/perf/*.pyc
>>   /src/fiemap-fault
>>   /src/min_dio_alignment
>> +/src/dio-writeback-race
>>
>>   # Symlinked files
>>   /tests/generic/035.out
>> diff --git a/src/Makefile b/src/Makefile
>> index 1417c383863e..6ac72b366257 100644
>> --- a/src/Makefile
>> +++ b/src/Makefile
>> @@ -20,7 +20,8 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
>>   	t_get_file_time t_create_short_dirs t_create_long_dirs t_enospc \
>>   	t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
>>   	t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
>> -	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd
>> +	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
>> +	dio-writeback-race
>>
>>   LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>>   	preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
>> diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
>> new file mode 100644
>> index 000000000000..f0a2f6de531b
>> --- /dev/null
>> +++ b/src/dio-writeback-race.c
>> @@ -0,0 +1,148 @@
>> +// SPDX-License-Identifier: GPL-2.0+
>> +/*
>> + * dio_writeback_race -- test direct IO writes with the contents of
>> + * the buffer changed during writeback.
>> + *
>> + * Copyright (C) 2025 SUSE Linux Products GmbH.
>> + */
>> +
>> +/*
>> + * dio_writeback_race
>> + *
>> + * Compile:
>> + *
>> + *   gcc -Wall -pthread -o dio_writeback_race dio_writeback_race.c
>> + *
>> + * Make sure filesystems with explicit data checksum can handle direct IO
>> + * writes correctly, so that even the contents of the direct IO buffer changes
>> + * during writeback, the fs should still work fine without data checksum mismatch.
>> + * (Normally by falling back to buffer IO to keep the checksum and contents
>> + *  consistent)
>> + *
>> + * Usage:
>> + *
>> + *   dio_writeback_race [-b <buffersize>] [-s <filesize>] <filename>
>> + *
>> + * Where:
>> + *
>> + *   <filename> is the name of the test file to create, if the file exists
>> + *   it will be overwritten.
>> + *
>> + *   <buffersize> is the buffer size for direct IO. Users are responsible to
>> + *   probe the correct direct IO size and alignment requirement.
>> + *   If not specified, the default value will be 4096.
>> + *
>> + *   <filesize> is the total size of the test file. If not aligned to <blocksize>
>> + *   the result file size will be rounded up to <blocksize>.
>> + *   If not specified, the default value will be 64MiB.
>> + */
>> +
>> +#include <fcntl.h>
>> +#include <stdlib.h>
>> +#include <stdio.h>
>> +#include <pthread.h>
>> +#include <unistd.h>
>> +#include <getopt.h>
>> +#include <string.h>
>> +#include <errno.h>
>> +#include <sys/stat.h>
>> +
>> +static int fd = -1;
>> +static void *buf = NULL;
>> +static unsigned long long filesize = 64 * 1024 * 1024;
>> +static int blocksize = 4096;
>> +
>> +const int orig_pattern = 0xff;
>> +const int modify_pattern = 0x00;
>> +
>> +static void *do_io(void *arg)
>> +{
>> +	ssize_t ret;
>> +
>> +	ret = write(fd, buf, blocksize);
>> +	pthread_exit((void *)ret);
>> +}
>> +
>> +static void *do_modify(void *arg)
>> +{
>> +	memset(buf, modify_pattern, blocksize);
>> +	pthread_exit(NULL);
>> +}
>> +
>> +int main (int argc, char *argv[])
>> +{
>> +	pthread_t io_thread;
>> +	pthread_t modify_thread;
>> +	unsigned long long filepos;
>> +	int opt;
>> +	int ret = -EINVAL;
>> +
>> +	while ((opt = getopt(argc, argv, "b:s:")) > 0) {
>> +		switch (opt) {
>> +		case 'b':
>> +			blocksize = atoi(optarg);
>> +			if (blocksize == 0) {
>> +				fprintf(stderr, "invalid blocksize '%s'\n", optarg);
>> +				goto error;
>> +			}
>> +			break;
>> +		case 's':
>> +			filesize = atoll(optarg);
>> +			if (filesize == 0) {
>> +				fprintf(stderr, "invalid filesize '%s'\n", optarg);
>> +				goto error;
>> +			}
>> +			break;
>> +		default:
>> +			fprintf(stderr, "unknown option '%c'\n", opt);
>> +			goto error;
>> +		}
>> +	}
>> +	if (optind >= argc) {
>> +		fprintf(stderr, "missing argument\n");
>> +		goto error;
>> +	}
>> +	ret = posix_memalign(&buf, blocksize, blocksize);
>> +	if (!buf) {
>> +		fprintf(stderr, "failed to allocate aligned memory\n");
>> +		exit(EXIT_FAILURE);
>> +	}
>> +	fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
>> +	if (fd < 0) {
>> +		fprintf(stderr, "failed to open file '%s': %m\n", argv[2]);
>> +		goto error;
>> +	}
>> +
>> +	for (filepos = 0; filepos < filesize; filepos += blocksize) {
>> +		void *retval = NULL;
>> +
>> +		memset(buf, orig_pattern, blocksize);
>> +		pthread_create(&io_thread, NULL, do_io, NULL);
>> +		pthread_create(&modify_thread, NULL, do_modify, NULL);
>> +		ret = pthread_join(io_thread, &retval);
>> +		if (ret) {
>> +			errno = ret;
>> +			ret = -ret;
>> +			fprintf(stderr, "failed to join io thread: %m\n");
>> +			goto error;
>> +		}
>> +		if ((ssize_t )retval < blocksize) {
>> +			ret = -EIO;
>> +			fprintf(stderr, "io thread failed\n");
>> +			goto error;
>> +		}
>> +		ret = pthread_join(modify_thread, NULL);
>> +		if (ret) {
>> +			errno = ret;
>> +			ret = -ret;
>> +			fprintf(stderr, "failed to join modify thread: %m\n");
>> +			goto error;
>> +		}
>> +	}
>> +error:
>> +	close(fd);
>> +	free(buf);
>> +	if (ret < 0)
>> +		return EXIT_FAILURE;
>> +	return EXIT_SUCCESS;
>> +}
>> diff --git a/tests/generic/761 b/tests/generic/761
>> new file mode 100755
>> index 000000000000..422b716d31b6
>> --- /dev/null
>> +++ b/tests/generic/761
>> @@ -0,0 +1,41 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2025 SUSE Linux Products GmbH. All Rights Reserved.
>> +#
>> +# FS QA Test 761
>> +#
>> +# Making sure direct IO (O_DIRECT) writes won't cause any data checksum mismatch,
>> +# even if the contents of the buffer changes during writeback.
>> +#
>> +# This is mostly for filesystems with data checksum support, which should fallback
>> +# to buffer IO to avoid inconsistency.
>> +# For filesystems without data checksum support, nothing needs to be bothered.
>> +#
>> +
>> +. ./common/preamble
>> +_begin_fstest auto quick
>> +
>> +_require_scratch
>> +_require_odirect
>> +_require_test_program dio-writeback-race
>> +_fixed_by_kernel_commit XXXXXXXX \
>> +	"btrfs: always fallback to buffered write if the inode requires checksum"
>> +
>> +_scratch_mkfs > $seqres.full 2>&1
>> +_scratch_mount
>> +
>> +blocksize=$(_get_block_size $SCRATCH_MNT)
>
> _get_file_block_size ? Others looks good to me.

Thanks for pointing out the helper, indeed there are some extra handling
for ocfs2 and xfs.

>
> Reviewed-by: Zorro Lang <zlang@redhat.com>
>
> I'll help to do these changes from all review points when I merge it, if you
> don't want to change it more.

No worry, I was just going to send out a v3, I will address the
_get_file_block_size() and add your tag in the newer version.

Thanks,
Qu

>
> Thanks,
> Zorro
>
>> +filesize=$(( 64 * 1024 * 1024))
>> +
>> +echo "blocksize=$blocksize filesize=$filesize" >> $seqres.full
>> +
>> +$here/src/dio-writeback-race -b $blocksize -s $filesize $SCRATCH_MNT/foobar
>> +
>> +# Read out the file, which should trigger checksum verification
>> +cat $SCRATCH_MNT/foobar > /dev/null
>> +
>> +echo "Silence is golden"
>> +
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/generic/761.out b/tests/generic/761.out
>> new file mode 100644
>> index 000000000000..72ebba4cb426
>> --- /dev/null
>> +++ b/tests/generic/761.out
>> @@ -0,0 +1,2 @@
>> +QA output created by 761
>> +Silence is golden
>> --
>> 2.48.1
>>
>>
>
>
diff mbox series

Patch

diff --git a/.gitignore b/.gitignore
index efd477738e1e..7060f52cf6b8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -210,6 +210,7 @@  tags
 /src/perf/*.pyc
 /src/fiemap-fault
 /src/min_dio_alignment
+/src/dio-writeback-race
 
 # Symlinked files
 /tests/generic/035.out
diff --git a/src/Makefile b/src/Makefile
index 1417c383863e..6ac72b366257 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -20,7 +20,8 @@  TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
 	t_get_file_time t_create_short_dirs t_create_long_dirs t_enospc \
 	t_mmap_writev_overlap checkpoint_journal mmap-rw-fault allocstale \
 	t_mmap_cow_memory_failure fake-dump-rootino dio-buf-fault rewinddir-test \
-	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd
+	readdir-while-renames dio-append-buf-fault dio-write-fsync-same-fd \
+	dio-writeback-race
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/dio-writeback-race.c b/src/dio-writeback-race.c
new file mode 100644
index 000000000000..f0a2f6de531b
--- /dev/null
+++ b/src/dio-writeback-race.c
@@ -0,0 +1,148 @@ 
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * dio_writeback_race -- test direct IO writes with the contents of
+ * the buffer changed during writeback.
+ *
+ * Copyright (C) 2025 SUSE Linux Products GmbH.
+ */
+
+/*
+ * dio_writeback_race
+ *
+ * Compile:
+ *
+ *   gcc -Wall -pthread -o dio_writeback_race dio_writeback_race.c
+ *
+ * Make sure filesystems with explicit data checksum can handle direct IO
+ * writes correctly, so that even the contents of the direct IO buffer changes
+ * during writeback, the fs should still work fine without data checksum mismatch.
+ * (Normally by falling back to buffer IO to keep the checksum and contents
+ *  consistent)
+ *
+ * Usage:
+ *
+ *   dio_writeback_race [-b <buffersize>] [-s <filesize>] <filename>
+ *
+ * Where:
+ *
+ *   <filename> is the name of the test file to create, if the file exists
+ *   it will be overwritten.
+ *
+ *   <buffersize> is the buffer size for direct IO. Users are responsible to
+ *   probe the correct direct IO size and alignment requirement.
+ *   If not specified, the default value will be 4096.
+ *
+ *   <filesize> is the total size of the test file. If not aligned to <blocksize>
+ *   the result file size will be rounded up to <blocksize>.
+ *   If not specified, the default value will be 64MiB.
+ */
+
+#include <fcntl.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <pthread.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/stat.h>
+
+static int fd = -1;
+static void *buf = NULL;
+static unsigned long long filesize = 64 * 1024 * 1024;
+static int blocksize = 4096;
+
+const int orig_pattern = 0xff;
+const int modify_pattern = 0x00;
+
+static void *do_io(void *arg)
+{
+	ssize_t ret;
+
+	ret = write(fd, buf, blocksize);
+	pthread_exit((void *)ret);
+}
+
+static void *do_modify(void *arg)
+{
+	memset(buf, modify_pattern, blocksize);
+	pthread_exit(NULL);
+}
+
+int main (int argc, char *argv[])
+{
+	pthread_t io_thread;
+	pthread_t modify_thread;
+	unsigned long long filepos;
+	int opt;
+	int ret = -EINVAL;
+
+	while ((opt = getopt(argc, argv, "b:s:")) > 0) {
+		switch (opt) {
+		case 'b':
+			blocksize = atoi(optarg);
+			if (blocksize == 0) {
+				fprintf(stderr, "invalid blocksize '%s'\n", optarg);
+				goto error;
+			}
+			break;
+		case 's':
+			filesize = atoll(optarg);
+			if (filesize == 0) {
+				fprintf(stderr, "invalid filesize '%s'\n", optarg);
+				goto error;
+			}
+			break;
+		default:
+			fprintf(stderr, "unknown option '%c'\n", opt);
+			goto error;
+		}
+	}
+	if (optind >= argc) {
+		fprintf(stderr, "missing argument\n");
+		goto error;
+	}
+	ret = posix_memalign(&buf, blocksize, blocksize);
+	if (!buf) {
+		fprintf(stderr, "failed to allocate aligned memory\n");
+		exit(EXIT_FAILURE);
+	}
+	fd = open(argv[optind], O_DIRECT | O_WRONLY | O_CREAT);
+	if (fd < 0) {
+		fprintf(stderr, "failed to open file '%s': %m\n", argv[2]);
+		goto error;
+	}
+
+	for (filepos = 0; filepos < filesize; filepos += blocksize) {
+		void *retval = NULL;
+
+		memset(buf, orig_pattern, blocksize);
+		pthread_create(&io_thread, NULL, do_io, NULL);
+		pthread_create(&modify_thread, NULL, do_modify, NULL);
+		ret = pthread_join(io_thread, &retval);
+		if (ret) {
+			errno = ret;
+			ret = -ret;
+			fprintf(stderr, "failed to join io thread: %m\n");
+			goto error;
+		}
+		if ((ssize_t )retval < blocksize) {
+			ret = -EIO;
+			fprintf(stderr, "io thread failed\n");
+			goto error;
+		}
+		ret = pthread_join(modify_thread, NULL);
+		if (ret) {
+			errno = ret;
+			ret = -ret;
+			fprintf(stderr, "failed to join modify thread: %m\n");
+			goto error;
+		}
+	}
+error:
+	close(fd);
+	free(buf);
+	if (ret < 0)
+		return EXIT_FAILURE;
+	return EXIT_SUCCESS;
+}
diff --git a/tests/generic/761 b/tests/generic/761
new file mode 100755
index 000000000000..422b716d31b6
--- /dev/null
+++ b/tests/generic/761
@@ -0,0 +1,41 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2025 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test 761
+#
+# Making sure direct IO (O_DIRECT) writes won't cause any data checksum mismatch,
+# even if the contents of the buffer changes during writeback.
+#
+# This is mostly for filesystems with data checksum support, which should fallback
+# to buffer IO to avoid inconsistency.
+# For filesystems without data checksum support, nothing needs to be bothered.
+#
+
+. ./common/preamble
+_begin_fstest auto quick
+
+_require_scratch
+_require_odirect
+_require_test_program dio-writeback-race
+_fixed_by_kernel_commit XXXXXXXX \
+	"btrfs: always fallback to buffered write if the inode requires checksum"
+
+_scratch_mkfs > $seqres.full 2>&1
+_scratch_mount
+
+blocksize=$(_get_block_size $SCRATCH_MNT)
+filesize=$(( 64 * 1024 * 1024))
+
+echo "blocksize=$blocksize filesize=$filesize" >> $seqres.full
+
+$here/src/dio-writeback-race -b $blocksize -s $filesize $SCRATCH_MNT/foobar
+
+# Read out the file, which should trigger checksum verification
+cat $SCRATCH_MNT/foobar > /dev/null
+
+echo "Silence is golden"
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/761.out b/tests/generic/761.out
new file mode 100644
index 000000000000..72ebba4cb426
--- /dev/null
+++ b/tests/generic/761.out
@@ -0,0 +1,2 @@ 
+QA output created by 761
+Silence is golden