[GIT,PULL] Btrfs updates for 6.15

Message ID	cover.1742834133.git.dsterba@suse.com (mailing list archive)
State	New
Headers	show Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DB391799F for <linux-btrfs@vger.kernel.org>; Mon, 24 Mar 2025 16:37:58 +0000 (UTC) From: David Sterba <dsterba@suse.com> To: torvalds@linux-foundation.org Cc: David Sterba <dsterba@suse.com>, linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [GIT PULL] Btrfs updates for 6.15 Date: Mon, 24 Mar 2025 17:37:51 +0100 Message-ID: <cover.1742834133.git.dsterba@suse.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit default: False [-2.80 / 50.00]; BAYES_HAM(-3.00)[100.00%]; MID_CONTAINS_FROM(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; R_MISSING_CHARSET(0.50)[]; NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; TO_DN_SOME(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; DKIM_SIGNED(0.00)[suse.com:s=susede1]; FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; RCVD_COUNT_TWO(0.00)[2]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DBL_BLOCKED_OPENRESOLVER(0.00)[suse.com:mid,imap1.dmz-prg2.suse.org:helo]; RCVD_TLS_ALL(0.00)[]
Series	[GIT,PULL] Btrfs updates for 6.15 \| expand [GIT,PULL] Btrfs updates for 6.15

David Sterba March 24, 2025, 4:37 p.m. UTC

Hi,

please pull the following btrfs updates, thanks.

User visible changes:

- fall back to buffered write if direct io is done on a file that requires
  checksums
  - this avoids a problem with checksum mismatch errors, observed e.g. on
    virtual images when writes to pages under writeback cause the checksum
    mismatch reports
  - this may lead to some performance degradation but currently the
    recommended setup for VM images is to use the NOCOW file attribute that
    also disables checksums

- fast/realtime zstd levels -15 to -1
  - supported by mount options (compress=zstd:-5) and defrag ioctl
  - improved speed, reduced compression ratio, check the commit for sample
    measurements

- defrag ioctl extended to accept negative compression levels

- subpage mode
  - remove warning when subpage mode is used, the feature is now reasonably
    complete and tested
  - in debug mode allow to create 2K b-tree nodes to allow testing subpage on
    x86_64 with 4K pages too

Performance improvements:

- in send, better file path caching improves runtime (on sample load by -30%)

- on s390x with hardware zlib support prepare the input buffer in a better way
  to get the best results from the acceleration

- minor speed improvement in encoded read, avoid memory allocation in
  synchronous mode

Core:

- enable stable writes on inodes, replacing manually waiting for writeback
  and allowing to skip that on inodes without checksums

- add last checks and warnings for out-of-band dirty writes to pages,
  requiring a fixup ("fixup worker"), this should not be necessary since 5.8
  where get_user_page() and pin_user_pages*() prevent this
  - long history behind that, we'll be happy to remove the whole infrastructure
    in the near future

- more folio API conversions and preparations for large folio support

- subpage cleanups and refactoring, split handling of data and metadata to
  allow future support for large folios

- readpage works as block-by-block, no change for normal mode, this is
  preparation for future subpage updates

- block group refcount fixes and hardening

- delayed iput fixes

- in zoned mode, fix zone activation on filesystem with missing devices

- cleanups:
  - inode parameter cleanups
  - path auto-freeing updates
  - code flow simplifications in send
  - redundant parameter cleanups

----------------------------------------------------------------
The following changes since commit 4701f33a10702d5fc577c32434eb62adde0a1ae1:

  Linux 6.14-rc7 (2025-03-16 12:55:17 -1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.15-tag

for you to fetch changes up to 35fec1089ebb5617f85884d3fa6a699ce6337a75:

  btrfs: zoned: fix zone finishing with missing devices (2025-03-18 20:35:57 +0100)

----------------------------------------------------------------
Anand Jain (1):
      btrfs: sysfs: accept size suffixes for read policy values

Boris Burkov (5):
      btrfs: fix block group refcount race in btrfs_create_pending_block_groups()
      btrfs: harden block_group::bg_list against list_del() races
      btrfs: make btrfs_discard_workfn() block_group ref explicit
      btrfs: explicitly ref count block_group on new_bgs list
      btrfs: codify pattern for adding block_group to bg_list

Dan Carpenter (1):
      btrfs: return a literal instead of a variable in btrfs_init_dev_replace()

Daniel Vacek (3):
      btrfs: keep private struct on stack for sync reads in btrfs_encoded_read_regular_fill_pages()
      btrfs: zstd: enable negative compression levels mount option
      btrfs: defrag: extend ioctl to accept compression levels

David Sterba (55):
      btrfs: add __cold attribute to extent_io_tree_panic()
      btrfs: async-thread: switch local variables need_order bool
      btrfs: zstd: move zstd_parameters to the workspace
      btrfs: zstd: remove local variable for storing page offsets
      btrfs: unify ordering of btrfs_key initializations
      btrfs: simplify returns and labels in btrfs_init_fs_root()
      btrfs: update include and forward declarations in headers
      btrfs: pass struct btrfs_inode to can_nocow_extent()
      btrfs: pass struct btrfs_inode to extent_range_clear_dirty_for_io()
      btrfs: pass struct btrfs_inode to btrfs_read_locked_inode()
      btrfs: pass struct btrfs_inode to btrfs_iget_locked()
      btrfs: pass struct btrfs_inode to new_simple_dir()
      btrfs: pass struct btrfs_inode to btrfs_inode_type()
      btrfs: pass struct btrfs_inode to btrfs_defrag_file()
      btrfs: use struct btrfs_inode inside create_pending_snapshot()
      btrfs: pass struct btrfs_inode to fill_stack_inode_item()
      btrfs: pass struct btrfs_inode to btrfs_fill_inode()
      btrfs: pass struct btrfs_inode to btrfs_load_inode_props()
      btrfs: pass struct btrfs_inode to btrfs_inode_inherit_props()
      btrfs: props: switch prop_handler::apply to struct btrfs_inode
      btrfs: props: switch prop_handler::extract to struct btrfs_inode
      btrfs: pass struct btrfs_inode to clone_copy_inline_extent()
      btrfs: pass struct btrfs_inode to btrfs_double_mmap_lock()
      btrfs: pass struct btrfs_inode to btrfs_double_mmap_unlock()
      btrfs: pass struct btrfs_inode to btrfs_extent_same_range()
      btrfs: use struct btrfs_inode inside btrfs_remap_file_range()
      btrfs: use struct btrfs_inode inside btrfs_remap_file_range_prep()
      btrfs: use struct btrfs_inode inside btrfs_get_parent()
      btrfs: use struct btrfs_inode inside btrfs_get_name()
      btrfs: don't pass nodesize to __alloc_extent_buffer()
      btrfs: merge alloc_dummy_extent_buffer() helpers
      btrfs: simplify parameters of metadata folio helpers
      btrfs: add __pure attribute to eb page and folio counters
      btrfs: use num_extent_folios() in for loop bounds
      btrfs: do trivial BTRFS_PATH_AUTO_FREE conversions
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_dev_replace()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_run_dev_replace()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_check_dir_item_collision()
      btrfs: use BTRFS_PATH_AUTO_FREE in load_global_roots()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_root_free_objectid()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_get_name()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_extent_info()
      btrfs: use BTRFS_PATH_AUTO_FREE in run_delayed_extent_op()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_bio_sums()
      btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_remove_free_space_inode()
      btrfs: use BTRFS_PATH_AUTO_FREE in populate_free_space_tree()
      btrfs: use BTRFS_PATH_AUTO_FREE in clear_free_space_tree()
      btrfs: use BTRFS_PATH_AUTO_FREE in load_free_space_tree()
      btrfs: parameter constification in ioctl.c
      btrfs: pass btrfs_root pointers to send ioctl parameters
      btrfs: pass root pointers to search tree ioctl helpers
      btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()
      btrfs: simplify local variables in btrfs_ioctl_resize()
      btrfs: pass struct to btrfs_ioctl_subvol_getflags()
      btrfs: unify inode variable naming

Filipe Manana (57):
      btrfs: avoid assigning twice to block_start at btrfs_do_readpage()
      btrfs: send: remove duplicated logic from fs_path_reset()
      btrfs: send: make fs_path_len() inline and constify its argument
      btrfs: send: always use fs_path_len() to determine a path's length
      btrfs: send: simplify return logic from fs_path_prepare_for_add()
      btrfs: send: simplify return logic from fs_path_add()
      btrfs: send: implement fs_path_add_path() using fs_path_add()
      btrfs: send: simplify return logic from fs_path_add_from_extent_buffer()
      btrfs: send: return -ENAMETOOLONG when attempting a path that is too long
      btrfs: send: simplify return logic from __get_cur_name_and_parent()
      btrfs: send: simplify return logic from is_inode_existent()
      btrfs: send: simplify return logic from get_cur_inode_state()
      btrfs: send: factor out common logic when sending xattrs
      btrfs: send: only use boolean variables at process_recorded_refs()
      btrfs: send: add and use helper to rename current inode when processing refs
      btrfs: send: simplify return logic from send_remove_xattr()
      btrfs: send: simplify return logic from record_new_ref_if_needed()
      btrfs: send: simplify return logic from record_deleted_ref_if_needed()
      btrfs: send: simplify return logic from record_new_ref()
      btrfs: send: simplify return logic from record_deleted_ref()
      btrfs: send: simplify return logic from record_changed_ref()
      btrfs: send: remove unnecessary return variable from process_new_xattr()
      btrfs: send: simplify return logic from process_changed_xattr()
      btrfs: send: simplify return logic from send_verity()
      btrfs: send: simplify return logic from send_rename()
      btrfs: send: simplify return logic from send_link()
      btrfs: send: simplify return logic from send_unlink()
      btrfs: send: simplify return logic from send_rmdir()
      btrfs: send: keep the current inode's path cached
      btrfs: send: avoid path allocation for the current inode when issuing commands
      btrfs: send: simplify return logic from send_set_xattr()
      btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work()
      btrfs: get used bytes while holding lock at btrfs_reclaim_bgs_work()
      btrfs: fix reclaimed bytes accounting after automatic block group reclaim
      btrfs: fix non-empty delayed iputs list on unmount due to compressed write workers
      btrfs: move __btrfs_bio_end_io() code into its single caller
      btrfs: move btrfs_cleanup_bio() code into its single caller
      btrfs: fix non-empty delayed iputs list on unmount due to async workers
      btrfs: avoid unnecessary bio dereference at run_one_async_done()
      btrfs: send: remove unnecessary inode lookup at send_encoded_inline_extent()
      btrfs: send: simplify return logic from send_encoded_extent()
      btrfs: return a btrfs_inode from btrfs_iget_logging()
      btrfs: return a btrfs_inode from read_one_inode()
      btrfs: pass a btrfs_inode to fixup_inode_link_count()
      btrfs: make btrfs_iget() return a btrfs inode instead
      btrfs: make btrfs_iget_path() return a btrfs inode instead
      btrfs: remove unnecessary fs_info argument from create_reloc_inode()
      btrfs: remove unnecessary fs_info argument from delete_block_group_cache()
      btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()
      btrfs: tests: fix chunk map leak after failure to add it to the tree
      btrfs: avoid unnecessary memory allocation and copy at overwrite_item()
      btrfs: use variables to store extent buffer and slot at overwrite_item()
      btrfs: update outdated comment for overwrite_item()
      btrfs: use memcmp_extent_buffer() at replay_one_extent()
      btrfs: remove redundant else statement from btrfs_log_inode_parent()
      btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent()
      btrfs: remove end_no_trans label from btrfs_log_inode_parent()

Johannes Thumshirn (3):
      btrfs: zoned: exit btrfs_can_activate_zone if BTRFS_FS_NEED_ZONE_FINISH is set
      btrfs: zoned: fix zone activation with missing devices
      btrfs: zoned: fix zone finishing with missing devices

Mark Harmstone (2):
      btrfs: avoid linker error in btrfs_find_create_tree_block()
      btrfs: don't clobber ret in btrfs_validate_super()

Matthew Wilcox (Oracle) (2):
      btrfs: update some folio related comments
      btrfs: convert io_ctl_prepare_pages() to work on folios

Qu Wenruo (32):
      btrfs: remove duplicated metadata folio flag update in end_bbio_meta_read()
      btrfs: always fallback to buffered write if the inode requires checksum
      btrfs: zlib: refactor S390x HW acceleration buffer preparation
      btrfs: expose per-inode stable writes flag
      btrfs: factor out nocow ordered extent and extent map generation into a helper
      btrfs: move ordered extent cleanup to where they are allocated
      btrfs: remove btrfs_fs_info::sectors_per_page
      btrfs: factor out metadata subpage detection into a dedicated helper
      btrfs: make subpage attach and detach handle metadata properly
      btrfs: use metadata specific helpers to simplify extent buffer helpers
      btrfs: simplify subpage handling of btrfs_clear_buffer_dirty()
      btrfs: simplify subpage handling of write_one_eb()
      btrfs: simplify subpage handling of read_extent_buffer_pages_nowait()
      btrfs: require strict data/metadata split for subpage checks
      btrfs: prevent inline data extents read from touching blocks beyond its range
      btrfs: fix the qgroup data free range for inline data extents
      btrfs: introduce a read path dedicated extent lock helper
      btrfs: make btrfs_do_readpage() to do block-by-block read
      btrfs: allow buffered write to avoid full page read if it's block aligned
      btrfs: allow inline data extents creation if block size < page size
      btrfs: remove the subpage related warning message
      btrfs: properly limit inline data extent according to block size
      btrfs: allow debug builds to accept 2K block size
      btrfs: reject out-of-band dirty folios during writeback
      btrfs: run btrfs_error_commit_super() early
      btrfs: add extra warning if delayed iput is added when it's not allowed
      btrfs: subpage: make btrfs_is_subpage() check against a folio
      btrfs: add a size parameter to btrfs_alloc_subpage()
      btrfs: replace PAGE_SIZE with folio_size for subpage.[ch]
      btrfs: prepare btrfs_launcher_folio() for large folios support
      btrfs: prepare extent_io.c for future large folio support
      btrfs: prepare btrfs_page_mkwrite() for large folios

Sun YangKai (3):
      btrfs: simplify the return value handling in search_ioctl()
      btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward()
      btrfs: avoid redundant path slot assignment in btrfs_search_forward()

 fs/btrfs/accessors.h              |   1 +
 fs/btrfs/acl.h                    |   2 +
 fs/btrfs/async-thread.c           |  11 +-
 fs/btrfs/backref.c                |   4 +-
 fs/btrfs/bio.c                    |  38 +--
 fs/btrfs/block-group.c            | 155 ++++++----
 fs/btrfs/btrfs_inode.h            |  17 +-
 fs/btrfs/compression.c            |  31 +-
 fs/btrfs/compression.h            |  26 +-
 fs/btrfs/ctree.c                  |  18 +-
 fs/btrfs/ctree.h                  |   2 +-
 fs/btrfs/defrag.c                 |  78 ++---
 fs/btrfs/defrag.h                 |   4 +-
 fs/btrfs/delayed-inode.c          |  99 ++++---
 fs/btrfs/delayed-inode.h          |   2 +-
 fs/btrfs/delayed-ref.h            |   2 +
 fs/btrfs/dev-replace.c            |  33 +--
 fs/btrfs/dir-item.c               |  24 +-
 fs/btrfs/dir-item.h               |   1 +
 fs/btrfs/direct-io.c              |  19 +-
 fs/btrfs/direct-io.h              |   2 +
 fs/btrfs/discard.c                |  34 ++-
 fs/btrfs/discard.h                |   1 +
 fs/btrfs/disk-io.c                | 109 ++++---
 fs/btrfs/export.c                 |  51 ++--
 fs/btrfs/extent-io-tree.c         |   8 +-
 fs/btrfs/extent-tree.c            |  63 ++--
 fs/btrfs/extent-tree.h            |   1 -
 fs/btrfs/extent_io.c              | 589 ++++++++++++++++++++++----------------
 fs/btrfs/extent_io.h              |   9 +-
 fs/btrfs/file-item.c              |  30 +-
 fs/btrfs/file-item.h              |   2 +
 fs/btrfs/file.c                   |  28 +-
 fs/btrfs/file.h                   |   2 +
 fs/btrfs/free-space-cache.c       |  57 ++--
 fs/btrfs/free-space-tree.c        |  45 ++-
 fs/btrfs/fs.c                     |   1 -
 fs/btrfs/fs.h                     |  26 +-
 fs/btrfs/inode-item.c             |   6 +-
 fs/btrfs/inode.c                  | 587 +++++++++++++++++++------------------
 fs/btrfs/ioctl.c                  | 217 +++++++-------
 fs/btrfs/ioctl.h                  |   4 +-
 fs/btrfs/locking.c                |   1 -
 fs/btrfs/ordered-data.c           |  23 +-
 fs/btrfs/ordered-data.h           |   9 +-
 fs/btrfs/print-tree.h             |   2 +
 fs/btrfs/props.c                  |  66 ++---
 fs/btrfs/props.h                  |   8 +-
 fs/btrfs/qgroup.c                 |   2 +-
 fs/btrfs/qgroup.h                 |   3 +
 fs/btrfs/raid-stripe-tree.h       |   1 +
 fs/btrfs/reflink.c                | 100 +++----
 fs/btrfs/relocation.c             |  30 +-
 fs/btrfs/scrub.c                  |   4 +-
 fs/btrfs/send.c                   | 544 ++++++++++++++++-------------------
 fs/btrfs/send.h                   |   4 +-
 fs/btrfs/space-info.c             |   2 +-
 fs/btrfs/subpage.c                | 224 +++++++++------
 fs/btrfs/subpage.h                |  56 +++-
 fs/btrfs/super.c                  |   6 +-
 fs/btrfs/sysfs.c                  |  14 +-
 fs/btrfs/sysfs.h                  |   1 +
 fs/btrfs/tests/extent-io-tests.c  |   6 +-
 fs/btrfs/tests/extent-map-tests.c |   1 +
 fs/btrfs/transaction.c            |  39 ++-
 fs/btrfs/tree-log.c               | 392 ++++++++++++-------------
 fs/btrfs/verity.c                 |   4 +-
 fs/btrfs/volumes.c                |  16 +-
 fs/btrfs/volumes.h                |   4 +
 fs/btrfs/xattr.h                  |   2 +
 fs/btrfs/zlib.c                   |  83 ++++--
 fs/btrfs/zoned.c                  |   9 +
 fs/btrfs/zstd.c                   |  66 +++--
 include/uapi/linux/btrfs.h        |  16 +-
 74 files changed, 2263 insertions(+), 1914 deletions(-)

pr-tracker-bot@kernel.org March 27, 2025, 8:48 p.m. UTC | #1

The pull request you sent on Mon, 24 Mar 2025 17:37:51 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.15-tag

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fd71def6d9abc5ae362fb9995d46049b7b0ed391

Thank you!

Josef Bacik March 28, 2025, 1:27 p.m. UTC | #2

On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote:
> Hi,
> 
> please pull the following btrfs updates, thanks.
> 
> User visible changes:
> 
> - fall back to buffered write if direct io is done on a file that requires
>   checksums

<trimming the everybody linux-btrfs from the cc list>

What?  We use this constantly in a bunch of places to avoid page cache overhead,
it's perfectly legitimate to do DIO to a file that requires checksums.  Does the
vm case mess this up?  Absolutely, but that's why we say use NOCOW for that
case.  We've always had this behavior, we've always been clear that if you break
it you buy it.  This is a huge regression for a pretty significant use case.
Thanks,

Josef

David Sterba March 28, 2025, 5:36 p.m. UTC | #3

On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote:
> On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote:
> > Hi,
> > 
> > please pull the following btrfs updates, thanks.
> > 
> > User visible changes:
> > 
> > - fall back to buffered write if direct io is done on a file that requires
> >   checksums
> 
> <trimming the everybody linux-btrfs from the cc list>
> 
> What?  We use this constantly in a bunch of places to avoid page cache overhead,
> it's perfectly legitimate to do DIO to a file that requires checksums.  Does the
> vm case mess this up?  Absolutely, but that's why we say use NOCOW for that
> case.  We've always had this behavior, we've always been clear that if you break
> it you buy it.  This is a huge regression for a pretty significant use case.

The patch has been up for like 2 months and you could have said "don't
because reasons" any time before the pull request. Now we're left with a
revert, or other alternatives making the use cases working.

Josef Bacik March 28, 2025, 7:39 p.m. UTC | #4

On Fri, Mar 28, 2025 at 06:36:44PM +0100, David Sterba wrote:
> On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote:
> > On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote:
> > > Hi,
> > > 
> > > please pull the following btrfs updates, thanks.
> > > 
> > > User visible changes:
> > > 
> > > - fall back to buffered write if direct io is done on a file that requires
> > >   checksums
> > 
> > <trimming the everybody linux-btrfs from the cc list>
> > 
> > What?  We use this constantly in a bunch of places to avoid page cache overhead,
> > it's perfectly legitimate to do DIO to a file that requires checksums.  Does the
> > vm case mess this up?  Absolutely, but that's why we say use NOCOW for that
> > case.  We've always had this behavior, we've always been clear that if you break
> > it you buy it.  This is a huge regression for a pretty significant use case.
> 
> The patch has been up for like 2 months and you could have said "don't
> because reasons" any time before the pull request. Now we're left with a
> revert, or other alternatives making the use cases working.

Boris told me about this and I forgot.  I took everybody off the CC list because
I don't want to revert it, in fact generally speaking I'd love to never have
these style of bug reports again.

But it is a pretty significant change.  Are we ok going forward saying you don't
get O_DIRECT unless you want NOCOW? Should we maybe allow for users to indicate
they're not dumb and can be trusted to do O_DIRECT properly? I just think this
opens us up to a lot more uncomfortable conversations than the other behavior.

I personally think this is better, if it's been sitting there for 2 months then
hooray we're in agreement. But I'm also worried it'll come back to bite us.
Thanks,

Josef

Qu Wenruo March 28, 2025, 9:36 p.m. UTC | #5

在 2025/3/29 06:09, Josef Bacik 写道:
> On Fri, Mar 28, 2025 at 06:36:44PM +0100, David Sterba wrote:
>> On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote:
>>> On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote:
>>>> Hi,
>>>>
>>>> please pull the following btrfs updates, thanks.
>>>>
>>>> User visible changes:
>>>>
>>>> - fall back to buffered write if direct io is done on a file that requires
>>>>    checksums
>>>
>>> <trimming the everybody linux-btrfs from the cc list>
>>>
>>> What?  We use this constantly in a bunch of places to avoid page cache overhead,
>>> it's perfectly legitimate to do DIO to a file that requires checksums.  Does the
>>> vm case mess this up?  Absolutely, but that's why we say use NOCOW for that
>>> case.  We've always had this behavior, we've always been clear that if you break
>>> it you buy it.  This is a huge regression for a pretty significant use case.

BTW, it's not clear, it's not in the documents (until recently, along 
with this change), and no one is really explaining that to end users 
until now.

If you're a experienced fs/mm developer, sure it's obvious, but most 
users (millions of VM users who choose to use none cache mode) are not 
so experienced.

Future more, a lot of such VM users are just trying to reduce page cache 
memory usage, not so about the zero-copy performance.
And the buffered fallback should still make a lot of sense for them, and 
still allow working data checksum (which they may care more than the 
performance).

>>
>> The patch has been up for like 2 months and you could have said "don't
>> because reasons" any time before the pull request. Now we're left with a
>> revert, or other alternatives making the use cases working.
> 
> Boris told me about this and I forgot.  I took everybody off the CC list because
> I don't want to revert it, in fact generally speaking I'd love to never have
> these style of bug reports again.
> 
> But it is a pretty significant change.  Are we ok going forward saying you don't
> get O_DIRECT unless you want NOCOW?

Despite NOCOW, there are still other things (like alignment) can make 
O_DIRECT to fall back buffered, without even letting the users know.

And it is always fs specific on whether it choose to fall buffered or 
return -EINVAL.

So in the first place there is never any guarantee that O_DIRECT always 
results in zero-copy writes.

> Should we maybe allow for users to indicate
> they're not dumb and can be trusted to do O_DIRECT properly? I just think this
> opens us up to a lot more uncomfortable conversations than the other behavior.

For whatever reason, no matter dumb users or not, all operations through 
a fs should not lead to data/metadata corruption, no matter if it's real 
corruption or just false csum mismatch.
(Not to mention it's not easy to fix the csum mismatch either)

If dumb programs can still corrupt the fs, then it's a bug in the fs.

To me, this is very straightforward, and the priority should be obvious.

> 
> I personally think this is better, if it's been sitting there for 2 months then
> hooray we're in agreement. But I'm also worried it'll come back to bite us.

I'd say, let it bite.

If it's some performance critical use cases, I believe the developer 
should be reasonable enough to be persuaded (either choose perf and go 
NODATASUM, or choose datacsum and accept the perf drop).

If it's newbie level usages, they shouldn't even notice any difference, 
except the fact btrfs no longer corrupts their files.

Even if it really bites in the future, I doubt it will be any worse.

On the other hand, the data csum mismatch for VM images are already 
biting us for too long, and it's really damaging the reputation of Btrfs.

Thanks,
Qu

> Thanks,
> 
> Josef

Daniel Vacek March 31, 2025, 9:35 a.m. UTC | #6

On Fri, 28 Mar 2025 at 20:40, Josef Bacik <josef@toxicpanda.com> wrote:
>
> ... Should we maybe allow for users to indicate
> they're not dumb and can be trusted to do O_DIRECT properly? I just think this

Seriously, I don't think it's a good idea.

Though, I'd vote for adding a tracing point so that people chasing a
performance issue/regression can check if this is possibly the cause.
Yet, that wouldn't restore the performance for you, of course.

Anyways, if you really trust your app that much (hey, it's your call,
right?) you can always patch *your* kernel. But IMO such a hack does
not belong to the upstream kernel.
As Qu mentioned, the FS needs to stay consistent and not allow users
to shoot themselves into a leg.

--nX

>
> Josef
>

David Sterba March 31, 2025, 11:58 p.m. UTC | #7

On Fri, Mar 28, 2025 at 03:39:27PM -0400, Josef Bacik wrote:
> On Fri, Mar 28, 2025 at 06:36:44PM +0100, David Sterba wrote:
> > On Fri, Mar 28, 2025 at 09:27:51AM -0400, Josef Bacik wrote:
> > > On Mon, Mar 24, 2025 at 05:37:51PM +0100, David Sterba wrote:
> > > > Hi,
> > > > 
> > > > please pull the following btrfs updates, thanks.
> > > > 
> > > > User visible changes:
> > > > 
> > > > - fall back to buffered write if direct io is done on a file that requires
> > > >   checksums
> > > 
> > > <trimming the everybody linux-btrfs from the cc list>
> > > 
> > > What?  We use this constantly in a bunch of places to avoid page cache overhead,
> > > it's perfectly legitimate to do DIO to a file that requires checksums.  Does the
> > > vm case mess this up?  Absolutely, but that's why we say use NOCOW for that
> > > case.  We've always had this behavior, we've always been clear that if you break
> > > it you buy it.  This is a huge regression for a pretty significant use case.
> > 
> > The patch has been up for like 2 months and you could have said "don't
> > because reasons" any time before the pull request. Now we're left with a
> > revert, or other alternatives making the use cases working.
> 
> Boris told me about this and I forgot.  I took everybody off the CC list because
> I don't want to revert it, in fact generally speaking I'd love to never have
> these style of bug reports again.
> 
> But it is a pretty significant change.  Are we ok going forward saying you don't
> get O_DIRECT unless you want NOCOW? Should we maybe allow for users to indicate
> they're not dumb and can be trusted to do O_DIRECT properly? I just think this
> opens us up to a lot more uncomfortable conversations than the other behavior.
> 
> I personally think this is better, if it's been sitting there for 2 months then
> hooray we're in agreement. But I'm also worried it'll come back to bite us.

I think we're in agreement where the fix is going and I also agree this
is a significant change and can indeed become worse in some sense than
what we have now. We already have exceptions for direct io when combined
with various features, most notably with compression where it falls back
to buffered right away.

We may need to enumerate and document the common combinations and decide
what to do about them and what are potential drawbacks. The upcoming
uncached IO support can possibly fill some gaps.

The fix is going towards correctness at the cost of performance and
arguably this should have been the default already. We ended up with
many people noticing checksum mismatches in syslog due to the VM setup.
It's rather bad usability to tell people to ignore certain errors that
would otherwise be quite significant.

Some of the combinations:

1) dio + no csum + no cow                  -> true direct io
2) dio + no csum +    cow                  -> almost direct io
3) dio +    csum +    cow + no compression -> fallback to buffered
4) dio +    csum +    cow      compression -> fallback to buffered

The changed case is 3.

Quoting your reply:

> We use this constantly in a bunch of places to avoid page cache overhead,
> it's perfectly legitimate to do DIO to a file that requires checksums.

So if you want to avoid page cache overhead the uncached mode does not
help that much I guess, the pages and related structures still need to
be set up.

I'm wondering if the use of direct io is more for convenience (not to
pollute the page cache) or for the real expectations of the performance
gains when the caching is done on the application side (e.g. databases).

The mutually exclusive feature is checksumming, so if the direct io is
for performance then the application could also turn off the checksums.

We don't have a magic bit and opening mode that can keep the dio + csum
working as before, and currently don't have a suggestion how to
implement it. Seems that one way or another the user application will
have to do some work to get best the performance.

[GIT,PULL] Btrfs updates for 6.15

Pull-request

Message

Comments