mbox series

[RFC,0/2] btrfs: defrag: further preparation for multi-page sector size

Message ID cover.1706068026.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: defrag: further preparation for multi-page sector size | expand

Message

Qu Wenruo Jan. 24, 2024, 3:59 a.m. UTC
With the folio interface, it's much easier to support multi-page sector
size (aka, sector/block size > PAGE_SIZE, which is rare between major
upstream filesystems).

The basic idea is, if we firstly convert to full folio interface, and
allow an address space to only allocate folio which is exactly
block/sector size, the support for multi-page would be mostly done.

But before that support, there are still quite some conversion left for
btrfs.

Furthermore, with both subpage and multipage sector size, we need to
handle folio different:

- For subpage
  The folio would always be page sized.

- For multipage (and regular sectorsize == PAGE_SIZE)
  The folio would be sector sized.

Furthermore, the filemap interface would make various shifts more
complex.
As filemap_*() interfaces use index which is PAGE_SHIFT based,
meanwhile with potential larger folio, the folio shift can be larger
than PAGE_SHIFT.

Thus in the future, we may want to change filemap interface to accept
bytenr to avoid confusion between page and folio shifts.

The first patch would introduce a cached folio size and shift.

The second patch would make defrag to utilize the new cached folio size
and shift.
(And still use PAGE_SHIFT only for filemap*() interfaces)

Qu Wenruo (2):
  btrfs: introduce cached folio size
  btrfs: defrag: prepare defrag for larger data folio size

 fs/btrfs/defrag.c  | 69 +++++++++++++++++++++++++---------------------
 fs/btrfs/disk-io.c | 11 ++++++++
 fs/btrfs/fs.h      | 10 +++++++
 3 files changed, 58 insertions(+), 32 deletions(-)

Comments

Qu Wenruo Jan. 24, 2024, 4:03 a.m. UTC | #1
Adding MM list to this cover letter.

On 2024/1/24 14:29, Qu Wenruo wrote:
> With the folio interface, it's much easier to support multi-page sector
> size (aka, sector/block size > PAGE_SIZE, which is rare between major
> upstream filesystems).
> 
> The basic idea is, if we firstly convert to full folio interface, and
> allow an address space to only allocate folio which is exactly
> block/sector size, the support for multi-page would be mostly done.
> 
> But before that support, there are still quite some conversion left for
> btrfs.
> 
> Furthermore, with both subpage and multipage sector size, we need to
> handle folio different:
> 
> - For subpage
>    The folio would always be page sized.
> 
> - For multipage (and regular sectorsize == PAGE_SIZE)
>    The folio would be sector sized.
> 
> Furthermore, the filemap interface would make various shifts more
> complex.
> As filemap_*() interfaces use index which is PAGE_SHIFT based,
> meanwhile with potential larger folio, the folio shift can be larger
> than PAGE_SHIFT.

As I really want some feedback on this part.

I'm pretty sure we would have some filesystems go utilizing larger 
folios to implement their multi-page block size support.

Thus in that case, can we have an interface change to make all folio 
versions of filemap_*() to accept a file offset instead of page index?

Thanks,
Qu
Matthew Wilcox Jan. 24, 2024, 4:48 a.m. UTC | #2
On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
> I'm pretty sure we would have some filesystems go utilizing larger folios to
> implement their multi-page block size support.
> 
> Thus in that case, can we have an interface change to make all folio
> versions of filemap_*() to accept a file offset instead of page index?

You're confused.  There's no change needed to the filemap API to support
large folios used by large block sizes.  Quite possibly more of btrfs
is confused, but it's really very simple.  index == pos / PAGE_SIZE.
That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
machine.

That implies that folios must be at least order-4, but you can still
look up a folio at index 23 and get back the folio which was stored at
index 16 (range 16-31).

hugetlbfs made the mistake of 'hstate->order' and it's still not fixed.
It's a little better than it was (thanks to Sid), but more work is needed.
Just use the same approach as THPs or you're going to end up hurt.
Qu Wenruo Jan. 24, 2024, 5:27 a.m. UTC | #3
On 2024/1/24 15:18, Matthew Wilcox wrote:
> On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
>> I'm pretty sure we would have some filesystems go utilizing larger folios to
>> implement their multi-page block size support.
>>
>> Thus in that case, can we have an interface change to make all folio
>> versions of filemap_*() to accept a file offset instead of page index?
>
> You're confused.  There's no change needed to the filemap API to support
> large folios used by large block sizes.  Quite possibly more of btrfs
> is confused, but it's really very simple.  index == pos / PAGE_SIZE.
> That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
> machine.

Yes, I understand that filemap API is always working on PAGE_SHIFTed index.

The concern is, (hopefully) with more fses going to utilized large
folios, there would be two shifts.

One folio shift (ilog2(blocksize)), one PAGE_SHIFT for filemap interfaces.

And I'm pretty sure it's going to cause confusion, e.g. someone doing
the conversion without much think, and all go the folio shift, even for
filemap_get_folio().

Thus I'm wondering if it's possible to get a bytenr version of
filemap_get_folio().

(Or is it better just creating an inline wrapper inside the fs to avoid
confusion?)

>
> That implies that folios must be at least order-4, but you can still
> look up a folio at index 23 and get back the folio which was stored at
> index 16 (range 16-31).

Yep, that's also what I expect, and that is very handy.

Thanks,
Qu

>
> hugetlbfs made the mistake of 'hstate->order' and it's still not fixed.
> It's a little better than it was (thanks to Sid), but more work is needed.
> Just use the same approach as THPs or you're going to end up hurt.
>
Matthew Wilcox Jan. 24, 2024, 5:43 a.m. UTC | #4
On Wed, Jan 24, 2024 at 03:57:39PM +1030, Qu Wenruo wrote:
> 
> 
> On 2024/1/24 15:18, Matthew Wilcox wrote:
> > On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
> > > I'm pretty sure we would have some filesystems go utilizing larger folios to
> > > implement their multi-page block size support.
> > > 
> > > Thus in that case, can we have an interface change to make all folio
> > > versions of filemap_*() to accept a file offset instead of page index?
> > 
> > You're confused.  There's no change needed to the filemap API to support
> > large folios used by large block sizes.  Quite possibly more of btrfs
> > is confused, but it's really very simple.  index == pos / PAGE_SIZE.
> > That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
> > machine.
> 
> Yes, I understand that filemap API is always working on PAGE_SHIFTed index.

OK, good.

> The concern is, (hopefully) with more fses going to utilized large
> folios, there would be two shifts.
> 
> One folio shift (ilog2(blocksize)), one PAGE_SHIFT for filemap interfaces.

Don't shift the file position by the folio_shift().  You want to support
large(r) folios _and_ large blocksizes at the same time.  ie 64kB might
be the block size, but all that would mean would be that folio_shift()
would be at least 16.  It might be 17, 18 or 21 (for a THP).

Filesystems already have to deal with different PAGE_SIZE, SECTOR_SIZE,
fsblock size and LBA size.  Folios aren't making things any worse here
(they're also not making anything better in this area, but I never
claimed they would).

btrfs is slightly unusual in that it defined PAGE_SIZE and fsblock size
to be the same (and then had to deal with the consequences of arm64/x86
interoperability later).  But most filesystems have pretty good separation
of the four concepts.
Qu Wenruo Jan. 24, 2024, 5:50 a.m. UTC | #5
On 2024/1/24 16:13, Matthew Wilcox wrote:
> On Wed, Jan 24, 2024 at 03:57:39PM +1030, Qu Wenruo wrote:
>>
>>
>> On 2024/1/24 15:18, Matthew Wilcox wrote:
>>> On Wed, Jan 24, 2024 at 02:33:22PM +1030, Qu Wenruo wrote:
>>>> I'm pretty sure we would have some filesystems go utilizing larger folios to
>>>> implement their multi-page block size support.
>>>>
>>>> Thus in that case, can we have an interface change to make all folio
>>>> versions of filemap_*() to accept a file offset instead of page index?
>>>
>>> You're confused.  There's no change needed to the filemap API to support
>>> large folios used by large block sizes.  Quite possibly more of btrfs
>>> is confused, but it's really very simple.  index == pos / PAGE_SIZE.
>>> That's all.  Even if you have a 64kB block size device on a 4kB PAGE_SIZE
>>> machine.
>>
>> Yes, I understand that filemap API is always working on PAGE_SHIFTed index.
> 
> OK, good.
> 
>> The concern is, (hopefully) with more fses going to utilized large
>> folios, there would be two shifts.
>>
>> One folio shift (ilog2(blocksize)), one PAGE_SHIFT for filemap interfaces.
> 
> Don't shift the file position by the folio_shift().  You want to support
> large(r) folios _and_ large blocksizes at the same time.  ie 64kB might
> be the block size, but all that would mean would be that folio_shift()
> would be at least 16.  It might be 17, 18 or 21 (for a THP).

Indeed, I forgot we have THP.

> 
> Filesystems already have to deal with different PAGE_SIZE, SECTOR_SIZE,
> fsblock size and LBA size.  Folios aren't making things any worse here
> (they're also not making anything better in this area, but I never
> claimed they would).

OK, that makes sense.

So all the folio shifts would be an fs internal deal, and we have to 
handle it properly.

> 
> btrfs is slightly unusual in that it defined PAGE_SIZE and fsblock size
> to be the same (and then had to deal with the consequences of arm64/x86
> interoperability later).  But most filesystems have pretty good separation
> of the four concepts.

Indeed, although I also found most major fses do not support larger 
block size (> PAGE_SIZE) either.

I guess subpage is simpler in the past, and hopefully with larger folio, 
we can see more fses support multi-page blocksize soon.

Thanks,
Qu