mbox series

[v4,00/18] btrfs: add read-only support for subpage sector size

Message ID 20210116071533.105780-1-wqu@suse.com (mailing list archive)
Headers show
Series btrfs: add read-only support for subpage sector size | expand

Message

Qu Wenruo Jan. 16, 2021, 7:15 a.m. UTC
Patches can be fetched from github:
https://github.com/adam900710/linux/tree/subpage
Currently the branch also contains partial RW data support (still some
ordered extent and data csum mismatch problems)

Great thanks to David/Nikolay/Josef for their effort reviewing and
merging the preparation patches into misc-next.

=== What works ===

Just from the patchset:
- Data read
  Both regular and compressed data, with csum check.

- Metadata read

This means, with these patchset, 64K page systems can at least mount
btrfs with 4K sector size.

In the subpage branch
- Metadata read write and balance
  Not yet full tested due to data write still has bugs need to be
  solved.
  But considering that metadata operations from previous iteration
  is mostly untouched, metadata read write should be pretty stable.

- Data read write and balance
  Only uncompressed data writes. Fsstress can survive for around 5000
  ops and more.
  But still some random data csum error, and even more rare ordered
  extent related BUG_ON().
  Still invetigating.

=== Needs feedback ===
The following design needs extra comments:

- u16 bitmap
  As David mentioned, using u16 as bit map is not the fastest way.
  That's also why current bitmap code requires unsigned long (u32) as
  minimal unit.
  But using bitmap directly would double the memory usage.
  Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
  that still needs extra investigation to find better practice.

  Anyway the skeleton should be pretty simple to expand.

- Separate handling for subpage metadata
  Currently the metadata read and (later write path) handles subpage
  metadata differently. Mostly due to the page locking must be skipped
  for subpage metadata.
  I tried several times to use as many common code as possible, but
  every time I ended up reverting back to current code.

  Thankfully, for data handling we will use the same common code.

- Incompatible subpage strcuture against iomap_page
  In btrfs we need extra bits than iomap_page.
  This is due to we need sector perfect write for data balance.
  E.g. if only one 4K sector is dirty in a 64K page, we should only
  write that dirty 4K back to disk, not the full 64K page.

  As data balance requires the new data extents to have exactly the
  same size as the original ones.
  This means, unless iomap_page get extra bits like what we're doing in
  btrfs for dirty, we can't merge the btrfs_subpage with iomap_page.

=== Patchset structure ===
Patch 01~02:	More RW preparation patches.
		This is to separate page lock/unlock from plain
		lock/unlock_page() call with __process_pages_contig().
		This makes more sense for subpage data write, but it
		also works for regular sector size.
Patch 03~12:	Subpage metadata allocation and freeing
Patch 13~15:	Subpage metadata read path
Patch 16~17:	Subpage data read path
Patch 18:	Enable subpage RO support

=== Changelog ===
v1:
- Separate the main implementation from previous huge patchset
  Huge patchset doesn't make much sense.

- Use bitmap implementation
  Now page::private will be a pointer to btrfs_subpage structure, which
  contains bitmaps for various page status.

v2:
- Use page::private as btrfs_subpage for extra info
  This replace old extent io tree based solution, which reduces latency
  and don't require memory allocation for its operations.

- Cherry-pick new preparation patches from RW development
  Those new preparation patches improves the readability by their own.

v3:
- Make dummy extent buffer to follow the same subpage accessors
  Fsstress exposed several ASSERT() for dummy extent buffers.
  It turns out we need to make dummy extent buffer to own the same
  btrfs_subpage structure to make eb accessors to work properly

- Two new small __process_pages_contig() related preparation patches
  One to make __process_pages_contig() to enhance the error handling
  path for locked_page, one to merge one macro.

- Extent buffers refs count update
  Except try_release_extent_buffer(), all other eb uses will try to
  increase the ref count of the eb.
  For try_release_extent_buffer(), the eb refs check will happen inside
  the rcu critical section to avoid eb being freed.

- Comment updates
  Addressing the comments from the mail list.

v4:
- Get rid of btrfs_subpage::tree_block_bitmap
  This is to reduce lock complexity (no need to bother extra subpage
  lock for metadata, all locks are existing locks)
  Now eb looking up mostly depends on radix tree, with small help from
  btrfs_subpage::under_alloc.
  Now I haven't experieneced metadata related problems any more during
  my local fsstress tests.

- Fix a race where metadata page dirty bit can race
  Fixed in the metadata RW patchset though.

- Rebased to latest misc-next branch
  With 4 patches removed, as they are already in misc-next.

Qu Wenruo (18):
  btrfs: update locked page dirty/writeback/error bits in
    __process_pages_contig()
  btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK into
    PAGE_START_WRITEBACK
  btrfs: introduce the skeleton of btrfs_subpage structure
  btrfs: make attach_extent_buffer_page() to handle subpage case
  btrfs: make grab_extent_buffer_from_page() to handle subpage case
  btrfs: support subpage for extent buffer page release
  btrfs: attach private to dummy extent buffer pages
  btrfs: introduce helper for subpage uptodate status
  btrfs: introduce helper for subpage error status
  btrfs: make set/clear_extent_buffer_uptodate() to support subpage size
  btrfs: make btrfs_clone_extent_buffer() to be subpage compatible
  btrfs: implement try_release_extent_buffer() for subpage metadata
    support
  btrfs: introduce read_extent_buffer_subpage()
  btrfs: extent_io: make endio_readpage_update_page_status() to handle
    subpage case
  btrfs: disk-io: introduce subpage metadata validation check
  btrfs: introduce btrfs_subpage for data inodes
  btrfs: integrate page status update for data read path into
    begin/end_page_read()
  btrfs: allow RO mount of 4K sector size fs on 64K page system

 fs/btrfs/Makefile           |   3 +-
 fs/btrfs/compression.c      |  10 +-
 fs/btrfs/disk-io.c          |  82 +++++-
 fs/btrfs/extent_io.c        | 520 +++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h        |  15 +-
 fs/btrfs/file.c             |  24 +-
 fs/btrfs/free-space-cache.c |  15 +-
 fs/btrfs/inode.c            |  40 ++-
 fs/btrfs/ioctl.c            |   5 +-
 fs/btrfs/reflink.c          |   5 +-
 fs/btrfs/relocation.c       |  12 +-
 fs/btrfs/subpage.c          |  39 +++
 fs/btrfs/subpage.h          | 263 ++++++++++++++++++
 fs/btrfs/super.c            |   7 +
 14 files changed, 920 insertions(+), 120 deletions(-)
 create mode 100644 fs/btrfs/subpage.c
 create mode 100644 fs/btrfs/subpage.h

Comments

David Sterba Jan. 18, 2021, 11:17 p.m. UTC | #1
On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
> Patches can be fetched from github:
> https://github.com/adam900710/linux/tree/subpage
> Currently the branch also contains partial RW data support (still some
> ordered extent and data csum mismatch problems)
> 
> Great thanks to David/Nikolay/Josef for their effort reviewing and
> merging the preparation patches into misc-next.
> 
> === What works ===
> 
> Just from the patchset:
> - Data read
>   Both regular and compressed data, with csum check.
> 
> - Metadata read
> 
> This means, with these patchset, 64K page systems can at least mount
> btrfs with 4K sector size.

I haven't found anything serious, the comments are cosmetic and I can
fixup that or other simple things at commit time.

Is there anthing serious still not working? As the subpage support is
sort of an isolated feature we could afford to get the first batch of
code in and continue polishing. Read-only suppot with 64k/4k is a good
milestone so I'm not worried too much about some smaller things left
behind, as long as the default case page size == sectorsize works.

Tests of this branch are still running but so far so good. I'll add it
as a topic branch to for-next for testing and my current plan is to push
it to misc-next soon, targeting 5.12.

> In the subpage branch
> - Metadata read write and balance
>   Not yet full tested due to data write still has bugs need to be
>   solved.
>   But considering that metadata operations from previous iteration
>   is mostly untouched, metadata read write should be pretty stable.

I assume the bugs are for the 64k/4k usecase.

> - Data read write and balance
>   Only uncompressed data writes. Fsstress can survive for around 5000
>   ops and more.
>   But still some random data csum error, and even more rare ordered
>   extent related BUG_ON().
>   Still invetigating.

You say it's for 'read write', right now getting the read-only suport
without known bugs would be sufficient.

> === Needs feedback ===
> The following design needs extra comments:
> 
> - u16 bitmap
>   As David mentioned, using u16 as bit map is not the fastest way.
>   That's also why current bitmap code requires unsigned long (u32) as
>   minimal unit.
>   But using bitmap directly would double the memory usage.
>   Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
>   that still needs extra investigation to find better practice.

I think that for first implementation we can afford to trade off
correctness for performance. In this case not optimal bitmap tracking
with the spinlock. Replacing a better bitmap tracking with atomics would
be a separate step and can be reviewed independently once we know the
slow but coorrect case works as expected.

>   Anyway the skeleton should be pretty simple to expand.
> 
> - Separate handling for subpage metadata
>   Currently the metadata read and (later write path) handles subpage
>   metadata differently. Mostly due to the page locking must be skipped
>   for subpage metadata.
>   I tried several times to use as many common code as possible, but
>   every time I ended up reverting back to current code.
> 
>   Thankfully, for data handling we will use the same common code.

Ok.

> - Incompatible subpage strcuture against iomap_page
>   In btrfs we need extra bits than iomap_page.
>   This is due to we need sector perfect write for data balance.
>   E.g. if only one 4K sector is dirty in a 64K page, we should only
>   write that dirty 4K back to disk, not the full 64K page.
> 
>   As data balance requires the new data extents to have exactly the
>   same size as the original ones.
>   This means, unless iomap_page get extra bits like what we're doing in
>   btrfs for dirty, we can't merge the btrfs_subpage with iomap_page.

Ok, so implementing the subpage support inside btrfs first gives us some
space for the specific needs or workarounds that would perhaps need
extensions of the iomap API. Once we have that working and understand
what exactly we need, then we can ask for iomap changes. This has worked
well, eg. during the direct io conversion, so we can build on that.
Qu Wenruo Jan. 18, 2021, 11:26 p.m. UTC | #2
On 2021/1/19 上午7:17, David Sterba wrote:
> On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
>> Patches can be fetched from github:
>> https://github.com/adam900710/linux/tree/subpage
>> Currently the branch also contains partial RW data support (still some
>> ordered extent and data csum mismatch problems)
>>
>> Great thanks to David/Nikolay/Josef for their effort reviewing and
>> merging the preparation patches into misc-next.
>>
>> === What works ===
>>
>> Just from the patchset:
>> - Data read
>>    Both regular and compressed data, with csum check.
>>
>> - Metadata read
>>
>> This means, with these patchset, 64K page systems can at least mount
>> btrfs with 4K sector size.
>
> I haven't found anything serious, the comments are cosmetic and I can
> fixup that or other simple things at commit time.
>
> Is there anthing serious still not working?

Compression write (not even touching it).
Random (rare) ordered extent related bugs (from BUG_ON() due to missing
ordered extent to data csum mismatch).
Working on the ordered extent bug now.

> As the subpage support is
> sort of an isolated feature we could afford to get the first batch of
> code in and continue polishing. Read-only suppot with 64k/4k is a good
> milestone so I'm not worried too much about some smaller things left
> behind, as long as the default case page size == sectorsize works.

Yeah, that's the core design of current subpage support, all subpage
will be handled in a different routine, leaving minimal impact to
existing code.

>
> Tests of this branch are still running but so far so good. I'll add it
> as a topic branch to for-next for testing and my current plan is to push
> it to misc-next soon, targeting 5.12.

That's great to hear.
>
>> In the subpage branch
>> - Metadata read write and balance
>>    Not yet full tested due to data write still has bugs need to be
>>    solved.
>>    But considering that metadata operations from previous iteration
>>    is mostly untouched, metadata read write should be pretty stable.
>
> I assume the bugs are for the 64k/4k usecase.

Yes, at least the 4K case passes fstests in my local env.

Thanks,
Qu

>
>> - Data read write and balance
>>    Only uncompressed data writes. Fsstress can survive for around 5000
>>    ops and more.
>>    But still some random data csum error, and even more rare ordered
>>    extent related BUG_ON().
>>    Still invetigating.
>
> You say it's for 'read write', right now getting the read-only suport
> without known bugs would be sufficient.
>
>> === Needs feedback ===
>> The following design needs extra comments:
>>
>> - u16 bitmap
>>    As David mentioned, using u16 as bit map is not the fastest way.
>>    That's also why current bitmap code requires unsigned long (u32) as
>>    minimal unit.
>>    But using bitmap directly would double the memory usage.
>>    Thus the best way is to pack two u16 bitmap into one u32 bitmap, but
>>    that still needs extra investigation to find better practice.
>
> I think that for first implementation we can afford to trade off
> correctness for performance. In this case not optimal bitmap tracking
> with the spinlock. Replacing a better bitmap tracking with atomics would
> be a separate step and can be reviewed independently once we know the
> slow but coorrect case works as expected.
>
>>    Anyway the skeleton should be pretty simple to expand.
>>
>> - Separate handling for subpage metadata
>>    Currently the metadata read and (later write path) handles subpage
>>    metadata differently. Mostly due to the page locking must be skipped
>>    for subpage metadata.
>>    I tried several times to use as many common code as possible, but
>>    every time I ended up reverting back to current code.
>>
>>    Thankfully, for data handling we will use the same common code.
>
> Ok.
>
>> - Incompatible subpage strcuture against iomap_page
>>    In btrfs we need extra bits than iomap_page.
>>    This is due to we need sector perfect write for data balance.
>>    E.g. if only one 4K sector is dirty in a 64K page, we should only
>>    write that dirty 4K back to disk, not the full 64K page.
>>
>>    As data balance requires the new data extents to have exactly the
>>    same size as the original ones.
>>    This means, unless iomap_page get extra bits like what we're doing in
>>    btrfs for dirty, we can't merge the btrfs_subpage with iomap_page.
>
> Ok, so implementing the subpage support inside btrfs first gives us some
> space for the specific needs or workarounds that would perhaps need
> extensions of the iomap API. Once we have that working and understand
> what exactly we need, then we can ask for iomap changes. This has worked
> well, eg. during the direct io conversion, so we can build on that.
>
David Sterba Jan. 24, 2021, 12:29 p.m. UTC | #3
On Tue, Jan 19, 2021 at 07:26:17AM +0800, Qu Wenruo wrote:
> On 2021/1/19 上午7:17, David Sterba wrote:
> > On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
> > As the subpage support is
> > sort of an isolated feature we could afford to get the first batch of
> > code in and continue polishing. Read-only suppot with 64k/4k is a good
> > milestone so I'm not worried too much about some smaller things left
> > behind, as long as the default case page size == sectorsize works.
> 
> Yeah, that's the core design of current subpage support, all subpage
> will be handled in a different routine, leaving minimal impact to
> existing code.
> 
> >
> > Tests of this branch are still running but so far so good. I'll add it
> > as a topic branch to for-next for testing and my current plan is to push
> > it to misc-next soon, targeting 5.12.
> 
> That's great to hear.
> >
> >> In the subpage branch
> >> - Metadata read write and balance
> >>    Not yet full tested due to data write still has bugs need to be
> >>    solved.
> >>    But considering that metadata operations from previous iteration
> >>    is mostly untouched, metadata read write should be pretty stable.
> >
> > I assume the bugs are for the 64k/4k usecase.
> 
> Yes, at least the 4K case passes fstests in my local env.

I'd done a pre-merge pass last week with fixups in changlogs, subjects
and some coding style fixes but that was before Josef's comments. Some
of them still need updates but I also don't want to throw away my
changes.  (Ideally I don't have to do them at all, you can get the gist
of what are the most common things I'm fixing by comparing both versions.)

Please have a look at the branch ext/qu/subpage-v4 in my github repo,
the patches are in the same order as in this posted patchset. If the
patch does not change you can keep it as is, I'll reuse what I have.

For the final merge of the read-only support, patch 1 could be dropped
as discussed. The rest is hopefully ok to go, please resend, thanks.
Qu Wenruo Jan. 25, 2021, 1:19 a.m. UTC | #4
On 2021/1/24 下午8:29, David Sterba wrote:
> On Tue, Jan 19, 2021 at 07:26:17AM +0800, Qu Wenruo wrote:
>> On 2021/1/19 上午7:17, David Sterba wrote:
>>> On Sat, Jan 16, 2021 at 03:15:15PM +0800, Qu Wenruo wrote:
>>> As the subpage support is
>>> sort of an isolated feature we could afford to get the first batch of
>>> code in and continue polishing. Read-only suppot with 64k/4k is a good
>>> milestone so I'm not worried too much about some smaller things left
>>> behind, as long as the default case page size == sectorsize works.
>>
>> Yeah, that's the core design of current subpage support, all subpage
>> will be handled in a different routine, leaving minimal impact to
>> existing code.
>>
>>>
>>> Tests of this branch are still running but so far so good. I'll add it
>>> as a topic branch to for-next for testing and my current plan is to push
>>> it to misc-next soon, targeting 5.12.
>>
>> That's great to hear.
>>>
>>>> In the subpage branch
>>>> - Metadata read write and balance
>>>>     Not yet full tested due to data write still has bugs need to be
>>>>     solved.
>>>>     But considering that metadata operations from previous iteration
>>>>     is mostly untouched, metadata read write should be pretty stable.
>>>
>>> I assume the bugs are for the 64k/4k usecase.
>>
>> Yes, at least the 4K case passes fstests in my local env.
> 
> I'd done a pre-merge pass last week with fixups in changlogs, subjects
> and some coding style fixes but that was before Josef's comments. Some
> of them still need updates but I also don't want to throw away my
> changes.  (Ideally I don't have to do them at all, you can get the gist
> of what are the most common things I'm fixing by comparing both versions.)
> 
> Please have a look at the branch ext/qu/subpage-v4 in my github repo,
> the patches are in the same order as in this posted patchset. If the
> patch does not change you can keep it as is, I'll reuse what I have.

Already doing this, using the ext/qu/subpage-v4 as base, so all your 
modification should still be there.

Thanks,
Qu

> 
> For the final merge of the read-only support, patch 1 could be dropped
> as discussed. The rest is hopefully ok to go, please resend, thanks.
>