mbox series

[v4,00/68] btrfs: add basic rw support for subpage sector size

Message ID 20201021062554.68132-1-wqu@suse.com (mailing list archive)
Headers show
Series btrfs: add basic rw support for subpage sector size | expand

Message

Qu Wenruo Oct. 21, 2020, 6:24 a.m. UTC
Patches can be fetched from github:
https://github.com/adam900710/linux/tree/subpage_data_fullpage_write

=== Overview ===
To make 64K page size systems to mount 4K sector size btrfs and to
regular read-write.

=== What works ===
- Subpage data read
  Both uncompressed and compressed data

- Subpage metadata read/write
  So far single thread "fsstress" loops hasn't crash the system with all
  debug options enabled.
  (Currently running with "-n 2048" in a 1024 run loop).

  This also means, we can do subpage sized pure metadata operations like
  reflink. (e.g. we can result 4K sector using reflink without problem)

- Full page data write
  Only tested uncompressed data yet.
  This means all data write will happen in page size, including:
  * buffered write
  * dio write
  * hole punch for unaligned range
  This means even just one 4K sector is dirtied, we will writeback the
  full 64K page as an data extent.

=== What doesn't works ===
- Balance
- Scrub
  All failed with csum check failure, may be quick to solve, but the
  current development status and patchset size is enough for a milestone.

- Dev replace
  Unable to submit subpage data writes.


=== Challenges and solutions ===
- Metadata
  * One 64K page can contain several tree blocks
    Instead of full page read/write/lock, we use extent io tree to do
    sector aligned read/write/lock, and avoid full page lock if
    possible.

  * Metadata can cross 64K page boundary
    This only happens for certain converted fs. Consider how little used
    just reject them for now and fix convert.

  Overall, metadata is not that complex as metadata has very limited
  interfaces.

- Data
  * Data has more page status and uses ordered extents
  * Data subpage write can be handled by iomap
    Instead of using extent io tree for each page status, goes full page
    write back.
    So that I won't waste time to implement something which is designed
    to be replaced.

- Testing
  * No way to test under 86_64
    Currently I'm using an RK3399 board with NVME driver, planning to
    move to a Xavier AGX board.
    But we plan to add 2K sector size support as a pure testing sector
    size for x86_64 (but still 4K as minimal node size) to test subpage
    routines and make my life a little easier.

=== TODO ===
- More testing 
  Obviously

- Balance and scrub support
- Limited data subpage write
  Mostly for balance and replace, as a workaround.

- Iomap support for true subpage data writeback

=== Patchset structure ===
Patch 01~03:	Small bug fixes
Patch 04~22:	Generic cleanup and refactors, which make sense without
		subpage support
Patch 23~27:	Subpage specific cleanup and refactors.
Patch 28~42:	Enablement for subpage RO mount
Patch 43~52:	Enablement for subpage metadata write
Patch 53~68:	Enablement for subpage data write (although still in
		page size)

=== Changelog ===
v2:
- Migrating to extent_io_tree based status/locking mechanism
  This gets rid of the ad-hoc subpage_eb_mapping structure and extra
  timing to verify the extent buffers.

  This also brings some extra cleanups for btree inode extent io tree
  hooks which makes no sense for both subpage and regular sector size.

  This also completely removes the requirement for page status like
  Locked/Uptodate/Dirty. Now metadata pages only utilize Private status,
  while private pointer is always NULL.

- Submit proper subpage sized read for metadata
  With the help of extent io tree, we no longer need to bother full page
  read. Now submit subpage sized metadata read and do subpage locking.

- Remove some unnecessary refactors
  Some refactors like extracting detach_extent_buffer_pages() doesn't
  really make the code cleaner. We can easily add subpage specific
  branch.

- Address the comments from v1

v3:
- Add compressed data read fix

- Also update page status according to extent status for btree inode
  This makes us to reuse more code from the existing code base.

- Add metadata write support
  Only manually tested (with a fs created under x86_64, and script to do
  metadata only operations under aarch64 with 64K page size).

- More cleanup/refactors during metadata write support development.

v4:
- Add more refactors
  The mostly obvious one is the refactor of __set/__clear_extent_bit()
  to make the less common options less visible, and allow me to add more
  options more easily.

- Add full data page write support

- More bug fixes for existing patches
  Mostly the bug found during fsstress tests.

- Reduce page locking to minimal for metadata
  I hit a possible ABBA lock, where extent io tree locking and page
  locking leads to dead lock.
  To resolve it without adding more requirement for page locking
  sequence, subpage metadata only rely on extent io tree locking.
  Page locking is only reserved for unavoidable cases, like calling
  clear_page_dirty_for_io().

Goldwyn Rodrigues (1):
  btrfs: use iosize while reading compressed pages

Qu Wenruo (67):
  btrfs: extent-io-tests: remove invalid tests
  btrfs: extent_io: fix the comment on lock_extent_buffer_for_io().
  btrfs: extent_io: update the comment for find_first_extent_bit()
  btrfs: extent_io: sink the @failed_start parameter for
    set_extent_bit()
  btrfs: make btree inode io_tree has its special owner
  btrfs: disk-io: replace @fs_info and @private_data with @inode for
    btrfs_wq_submit_bio()
  btrfs: inode: sink parameter @start and @len for check_data_csum()
  btrfs: extent_io: unexport extent_invalidatepage()
  btrfs: extent_io: remove the forward declaration and rename
    __process_pages_contig
  btrfs: extent_io: rename pages_locked in process_pages_contig()
  btrfs: extent_io: only require sector size alignment for page read
  btrfs: extent_io: remove the extent_start/extent_len for
    end_bio_extent_readpage()
  btrfs: extent_io: integrate page status update into
    endio_readpage_release_extent()
  btrfs: extent_io: rename page_size to io_size in submit_extent_page()
  btrfs: extent_io: add assert_spin_locked() for
    attach_extent_buffer_page()
  btrfs: extent_io: extract the btree page submission code into its own
    helper function
  btrfs: extent_io: calculate inline extent buffer page size based on
    page size
  btrfs: extent_io: make btrfs_fs_info::buffer_radix to take sector size
    devided values
  btrfs: extent_io: sink less common parameters for __set_extent_bit()
  btrfs: extent_io: sink less common parameters for __clear_extent_bit()
  btrfs: disk_io: grab fs_info from extent_buffer::fs_info directly for
    btrfs_mark_buffer_dirty()
  btrfs: disk-io: make csum_tree_block() handle sectorsize smaller than
    page size
  btrfs: disk-io: extract the extent buffer verification from
    btree_readpage_end_io_hook()
  btrfs: disk-io: accept bvec directly for csum_dirty_buffer()
  btrfs: inode: make btrfs_readpage_end_io_hook() follow sector size
  btrfs: introduce a helper to determine if the sectorsize is smaller
    than PAGE_SIZE
  btrfs: extent_io: allow find_first_extent_bit() to find a range with
    exact bits match
  btrfs: extent_io: don't allow tree block to cross page boundary for
    subpage support
  btrfs: extent_io: update num_extent_pages() to support subpage sized
    extent buffer
  btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
  btrfs: disk-io: only clear EXTENT_LOCK bit for extent_invalidatepage()
  btrfs: extent-io: make type of extent_state::state to be at least 32
    bits
  btrfs: extent_io: use extent_io_tree to handle subpage extent buffer
    allocation
  btrfs: extent_io: make set/clear_extent_buffer_uptodate() to support
    subpage size
  btrfs: extent_io: make the assert test on page uptodate able to handle
    subpage
  btrfs: extent_io: implement subpage metadata read and its endio
    function
  btrfs: extent_io: implement try_release_extent_buffer() for subpage
    metadata support
  btrfs: extent_io: extra the core of test_range_bit() into
    test_range_bit_nolock()
  btrfs: extent_io: introduce EXTENT_READ_SUBMITTED to handle subpage
    data read
  btrfs: set btree inode track_uptodate for subpage support
  btrfs: allow RO mount of 4K sector size fs on 64K page system
  btrfs: disk-io: allow btree_set_page_dirty() to do more sanity check
    on subpage metadata
  btrfs: disk-io: support subpage metadata csum calculation at write
    time
  btrfs: extent_io: prevent extent_state from being merged for btree io
    tree
  btrfs: extent_io: make set_extent_buffer_dirty() to support subpage
    sized metadata
  btrfs: extent_io: add subpage support for clear_extent_buffer_dirty()
  btrfs: extent_io: make set_btree_ioerr() accept extent buffer
  btrfs: extent_io: introduce write_one_subpage_eb() function
  btrfs: extent_io: make lock_extent_buffer_for_io() subpage compatible
  btrfs: extent_io: introduce submit_btree_subpage() to submit a page
    for subpage metadata write
  btrfs: extent_io: introduce end_bio_subpage_eb_writepage() function
  btrfs: inode: make can_nocow_extent() check only return 1 if the range
    is no smaller than PAGE_SIZE
  btrfs: file: calculate reserve space based on PAGE_SIZE for buffered
    write
  btrfs: file: make hole punching page aligned for subpage
  btrfs: file: make btrfs_dirty_pages() follow page size to mark extent
    io tree
  btrfs: file: make btrfs_file_write_iter() to be page aligned
  btrfs: output extra info for space info update underflow
  btrfs: delalloc-space: make data space reservation to be page aligned
  btrfs: scrub: allow scrub to work with subpage sectorsize
  btrfs: inode: make btrfs_truncate_block() to do page alignment
  btrfs: file: make hole punch and zero range to be page aligned
  btrfs: file: make btrfs_fallocate() to use PAGE_SIZE as blocksize
  btrfs: inode: always mark the full page range delalloc for
    btrfs_page_mkwrite()
  btrfs: inode: require page alignement for direct io
  btrfs: inode: only do NOCOW write for page aligned extent
  btrfs: reflink: do full page writeback for reflink prepare
  btrfs: support subpage read write for test

 fs/btrfs/block-group.c           |    2 +-
 fs/btrfs/btrfs_inode.h           |   12 +
 fs/btrfs/ctree.c                 |    5 +-
 fs/btrfs/ctree.h                 |   43 +-
 fs/btrfs/delalloc-space.c        |   19 +-
 fs/btrfs/disk-io.c               |  425 ++++++--
 fs/btrfs/disk-io.h               |    8 +-
 fs/btrfs/extent-io-tree.h        |  145 ++-
 fs/btrfs/extent-tree.c           |    2 +-
 fs/btrfs/extent_io.c             | 1576 ++++++++++++++++++++++--------
 fs/btrfs/extent_io.h             |   27 +-
 fs/btrfs/extent_map.c            |    2 +-
 fs/btrfs/file.c                  |  140 ++-
 fs/btrfs/free-space-cache.c      |    2 +-
 fs/btrfs/inode.c                 |  117 ++-
 fs/btrfs/reflink.c               |   36 +-
 fs/btrfs/relocation.c            |    2 +-
 fs/btrfs/scrub.c                 |    8 -
 fs/btrfs/space-info.h            |    4 +-
 fs/btrfs/struct-funcs.c          |   18 +-
 fs/btrfs/tests/extent-io-tests.c |   26 +-
 fs/btrfs/transaction.c           |    4 +-
 fs/btrfs/volumes.c               |    2 +-
 include/trace/events/btrfs.h     |    1 +
 24 files changed, 1927 insertions(+), 699 deletions(-)

Comments

David Sterba Oct. 21, 2020, 11:22 a.m. UTC | #1
On Wed, Oct 21, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
> === Patchset structure ===
> Patch 01~03:	Small bug fixes
> Patch 04~22:	Generic cleanup and refactors, which make sense without
> 		subpage support
> Patch 23~27:	Subpage specific cleanup and refactors.
> Patch 28~42:	Enablement for subpage RO mount
> Patch 43~52:	Enablement for subpage metadata write
> Patch 53~68:	Enablement for subpage data write (although still in
> 		page size)

That's a sane grouping to merge it from the top, though it still could
be some updates required. There are some pending patchsets for next and
I don't have an estimate for conflicts regarding the cleanups you have
in this patchset so we'll see.  All up to 27 should be mergeable in this
dev cycle.
Qu Wenruo Oct. 21, 2020, 11:50 a.m. UTC | #2
On 2020/10/21 下午7:22, David Sterba wrote:
> On Wed, Oct 21, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
>> === Patchset structure ===
>> Patch 01~03:	Small bug fixes
>> Patch 04~22:	Generic cleanup and refactors, which make sense without
>> 		subpage support
>> Patch 23~27:	Subpage specific cleanup and refactors.
>> Patch 28~42:	Enablement for subpage RO mount
>> Patch 43~52:	Enablement for subpage metadata write
>> Patch 53~68:	Enablement for subpage data write (although still in
>> 		page size)
> 
> That's a sane grouping to merge it from the top, though it still could
> be some updates required. There are some pending patchsets for next and
> I don't have an estimate for conflicts regarding the cleanups you have
> in this patchset so we'll see.  All up to 27 should be mergeable in this
> dev cycle.
> 

That's great, if the conflicts are not manageable, feel free to ask me
to do the rebase.

The main conflicts I can guess is from the metadata readpage refactor
from Nik, but my current structure is already using a similar way to
call submit_extent_page() directly, so I guess it shouldn't be too
destructive.

Thanks,
Qu
David Sterba Nov. 2, 2020, 2:56 p.m. UTC | #3
On Wed, Oct 21, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
> Patches can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage_data_fullpage_write
> 
> Qu Wenruo (67):

So far I've merged

      btrfs: extent_io: fix the comment on lock_extent_buffer_for_io()
      btrfs: extent_io: update the comment for find_first_extent_bit()
      btrfs: extent_io: sink the failed_start parameter to set_extent_bit()
      btrfs: disk-io: replace fs_info and private_data with inode for btrfs_wq_submit_bio()
      btrfs: inode: sink parameter start and len to check_data_csum()
      btrfs: extent_io: rename pages_locked in process_pages_contig()
      btrfs: extent_io: only require sector size alignment for page read
      btrfs: extent_io: rename page_size to io_size in submit_extent_page()

to misc-next.  This is from the first 20, the easy and safe changes.
There are few more that need more explanation or another look.
Qu Wenruo Nov. 3, 2020, 12:06 a.m. UTC | #4
On 2020/11/2 下午10:56, David Sterba wrote:
> On Wed, Oct 21, 2020 at 02:24:46PM +0800, Qu Wenruo wrote:
>> Patches can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage_data_fullpage_write
>>
>> Qu Wenruo (67):
> 
> So far I've merged
> 
>       btrfs: extent_io: fix the comment on lock_extent_buffer_for_io()
>       btrfs: extent_io: update the comment for find_first_extent_bit()
>       btrfs: extent_io: sink the failed_start parameter to set_extent_bit()
>       btrfs: disk-io: replace fs_info and private_data with inode for btrfs_wq_submit_bio()
>       btrfs: inode: sink parameter start and len to check_data_csum()
>       btrfs: extent_io: rename pages_locked in process_pages_contig()
>       btrfs: extent_io: only require sector size alignment for page read
>       btrfs: extent_io: rename page_size to io_size in submit_extent_page()
> 
> to misc-next.  This is from the first 20, the easy and safe changes.
> There are few more that need more explanation or another look.
> 
That's great.

BTW, for next update, I should rebase all patches to current misc-next
right?
Especially to take advantage of things like sectorsize_bits.

BTW, for next round patches, should I send all the patches in a huge
batch, or just send the safe refactors (with comments addresses)?

THanks,
Qu