[3/9] xfs: convert buffer cache to use high order folios

Message ID	20240318224715.3367463-4-david@fromorbit.com (mailing list archive)
State	New
Headers	show Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A4E295A4E2 for <linux-xfs@vger.kernel.org>; Mon, 18 Mar 2024 22:47:21 +0000 (UTC) From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [PATCH 3/9] xfs: convert buffer cache to use high order folios Date: Tue, 19 Mar 2024 09:45:54 +1100 Message-ID: <20240318224715.3367463-4-david@fromorbit.com> In-Reply-To: <20240318224715.3367463-1-david@fromorbit.com> References: <20240318224715.3367463-1-david@fromorbit.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	xfs: use large folios for buffers \| expand [v2,0/9] xfs: use large folios for buffers [1/9] xfs: unmapped buffer item size straddling mismatch [2/9] xfs: use folios in the buffer cache [3/9] xfs: convert buffer cache to use high order folios [4/9] xfs: kill XBF_UNMAPPED [5/9] xfs: buffer items don't straddle pages anymore [6/9] xfs: map buffers in xfs_buf_alloc_folios [7/9] xfs: walk b_addr for buffer I/O [8/9] xfs: use vmalloc for multi-folio buffers [9/9] xfs: rename bp->b_folio_count

Dave Chinner March 18, 2024, 10:45 p.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Now that we have the buffer cache using the folio API, we can extend
the use of folios to allocate high order folios for multi-page
buffers rather than an array of single pages that are then vmapped
into a contiguous range.

This creates two types of buffers: single folio buffers that can
have arbitrary order, and multi-folio buffers made up of many single
page folios that get vmapped. The latter is essentially the existing
code, so there are no logic changes to handle this case.

There are a few places where we iterate the folios on a buffer.
These need to be converted to handle the high order folio case.
Luckily, this only occurs when bp->b_folio_count == 1, and the code
for handling this case is just a simple application of the folio API
to the operations that need to be performed.

The code that allocates buffers will optimistically attempt a high
order folio allocation as a fast path. If this high order allocation
fails, then we fall back to the existing multi-folio allocation
code. This now forms the slow allocation path, and hopefully will be
largely unused in normal conditions.

This should improve performance of large buffer operations (e.g.
large directory block sizes) as we should now mostly avoid the
expense of vmapping large buffers (and the vmap lock contention that
can occur) as well as avoid the runtime pressure that frequently
accessing kernel vmapped pages put on the TLBs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 110 insertions(+), 31 deletions(-)

Christoph Hellwig March 19, 2024, 6:55 a.m. UTC | #1

This looks mostly good to me.  One question:
xfs_buf_free passes the folio count to mm_account_reclaimed_pages.
The name and implementation suggest it actually wants a count of
page size units, though.

Darrick J. Wong March 19, 2024, 5:29 p.m. UTC | #2

On Tue, Mar 19, 2024 at 09:45:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have the buffer cache using the folio API, we can extend
> the use of folios to allocate high order folios for multi-page
> buffers rather than an array of single pages that are then vmapped
> into a contiguous range.
> 
> This creates two types of buffers: single folio buffers that can
> have arbitrary order, and multi-folio buffers made up of many single
> page folios that get vmapped. The latter is essentially the existing
> code, so there are no logic changes to handle this case.
> 
> There are a few places where we iterate the folios on a buffer.
> These need to be converted to handle the high order folio case.
> Luckily, this only occurs when bp->b_folio_count == 1, and the code
> for handling this case is just a simple application of the folio API
> to the operations that need to be performed.
> 
> The code that allocates buffers will optimistically attempt a high
> order folio allocation as a fast path. If this high order allocation
> fails, then we fall back to the existing multi-folio allocation
> code. This now forms the slow allocation path, and hopefully will be
> largely unused in normal conditions.
> 
> This should improve performance of large buffer operations (e.g.
> large directory block sizes) as we should now mostly avoid the
> expense of vmapping large buffers (and the vmap lock contention that
> can occur) as well as avoid the runtime pressure that frequently
> accessing kernel vmapped pages put on the TLBs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 110 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 832226385154..7d9303497763 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -80,6 +80,10 @@ xfs_buf_is_vmapped(
>  	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
> +/*
> + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> + * allocating vmapped buffers.
> + */
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
> @@ -352,14 +356,63 @@ xfs_buf_alloc_kmem(
>  		bp->b_addr = NULL;
>  		return -ENOMEM;
>  	}
> -	bp->b_offset = offset_in_page(bp->b_addr);
>  	bp->b_folios = bp->b_folio_array;
>  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
> +/*
> + * Allocating a high order folio makes the assumption that buffers are a
> + * power-of-2 size so that ilog2() returns the exact order needed to fit
> + * the contents of the buffer. Buffer lengths are mostly a power of two,
> + * so this is not an unreasonable approach to take by default.
> + *
> + * The exception here are user xattr data buffers, which can be arbitrarily
> + * sized up to 64kB plus structure metadata. In that case, round up the order.

So.... does that mean a 128K folio for a 68k xattr remote value buffer?

I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
tree, which isn't awesome.  I haven't figured out a good way to deal
with that, since the verity code uses a lot of blkno/byte shifting and
2k merkle tree blocks suck.

(Just rambling, don't mind me)

> + */
> +static bool
> +xfs_buf_alloc_folio(
> +	struct xfs_buf	*bp,
> +	gfp_t		gfp_mask)
> +{
> +	int		length = BBTOB(bp->b_length);
> +	int		order = get_order(length);
> +
> +	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> +	if (!bp->b_folio_array[0])
> +		return false;
> +
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folio_count = 1;
> +	bp->b_flags |= _XBF_FOLIOS;
> +	return true;
> +}
> +
> +/*
> + * When we allocate folios for a buffer, we end up with one of two types of
> + * buffer.
> + *
> + * The first type is a single folio buffer - this may be a high order
> + * folio or just a single page sized folio, but either way they get treated the
> + * same way by the rest of the code - the buffer memory spans a single
> + * contiguous memory region that we don't have to map and unmap to access the
> + * data directly.
> + *
> + * The second type of buffer is the multi-folio buffer. These are *always* made
> + * up of single page folios so that they can be fed to vmap_ram() to return a
> + * contiguous memory region we can access the data through, or mark it as
> + * XBF_UNMAPPED and access the data directly through individual folio_address()
> + * calls.
> + *
> + * We don't use high order folios for this second type of buffer (yet) because
> + * having variable size folios makes offset-to-folio indexing and iteration of
> + * the data range more complex than if they are fixed size. This case should now
> + * be the slow path, though, so unless we regularly fail to allocate high order
> + * folios, there should be little need to optimise this path.
> + */
>  static int
>  xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
> @@ -371,7 +424,15 @@ xfs_buf_alloc_folios(
>  	if (flags & XBF_READ_AHEAD)
>  		gfp_mask |= __GFP_NORETRY;
>  
> -	/* Make sure that we have a page list */
> +	/* Assure zeroed buffer for non-read cases. */
> +	if (!(flags & XBF_READ))
> +		gfp_mask |= __GFP_ZERO;
> +
> +	/* Optimistically attempt a single high order folio allocation. */
> +	if (xfs_buf_alloc_folio(bp, gfp_mask))
> +		return 0;
> +
> +	/* Fall back to allocating an array of single page folios. */
>  	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
>  	if (bp->b_folio_count <= XB_FOLIOS) {
>  		bp->b_folios = bp->b_folio_array;
> @@ -383,9 +444,6 @@ xfs_buf_alloc_folios(
>  	}
>  	bp->b_flags |= _XBF_FOLIOS;
>  
> -	/* Assure zeroed buffer for non-read cases. */
> -	if (!(flags & XBF_READ))
> -		gfp_mask |= __GFP_ZERO;
>  
>  	/*
>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
>  {
>  	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
> -		/* A single page buffer is always mappable */
> +		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
> @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
>  	int		*count,
>  	blk_opf_t	op)
>  {
> -	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_folio_count;
> -	int		nr_pages;
> +	int		folio_index;
> +	unsigned int	total_nr_folios = bp->b_folio_count;
> +	int		nr_folios;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
>  	int		offset;
>  
> -	/* skip the pages in the buffer before the start offset */
> -	page_index = 0;
> +	/*
> +	 * If the start offset if larger than a single page, we need to be
> +	 * careful. We might have a high order folio, in which case the indexing
> +	 * is from the start of the buffer. However, if we have more than one
> +	 * folio single page folio in the buffer, we need to skip the folios in
> +	 * the buffer before the start offset.
> +	 */
> +	folio_index = 0;
>  	offset = *buf_offset;
> -	while (offset >= PAGE_SIZE) {
> -		page_index++;
> -		offset -= PAGE_SIZE;
> +	if (bp->b_folio_count > 1) {
> +		while (offset >= PAGE_SIZE) {
> +			folio_index++;
> +			offset -= PAGE_SIZE;
> +		}
>  	}
>  
>  	/*
> @@ -1551,28 +1617,28 @@ xfs_buf_ioapply_map(
>  
>  next_chunk:
>  	atomic_inc(&bp->b_io_remaining);
> -	nr_pages = bio_max_segs(total_nr_pages);
> +	nr_folios = bio_max_segs(total_nr_folios);
>  
> -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
>  	bio->bi_iter.bi_sector = sector;
>  	bio->bi_end_io = xfs_buf_bio_end_io;
>  	bio->bi_private = bp;
>  
> -	for (; size && nr_pages; nr_pages--, page_index++) {
> -		int	rbytes, nbytes = PAGE_SIZE - offset;
> +	for (; size && nr_folios; nr_folios--, folio_index++) {
> +		struct folio	*folio = bp->b_folios[folio_index];
> +		int		nbytes = folio_size(folio) - offset;
>  
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> -				      offset);

Um, bio_add_folio returns a bool, maybe that's why hch was complaining
about hangs with only the first few patches applied?

Annoying if the kbuild robots don't catch that.  Either way, with hch's
replies to this and patch 2 dealt with,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> -		if (rbytes < nbytes)
> +		if (!bio_add_folio(bio, folio, nbytes,
> +				offset_in_folio(folio, offset)))
>  			break;
>  
>  		offset = 0;
>  		sector += BTOBB(nbytes);
>  		size -= nbytes;
> -		total_nr_pages--;
> +		total_nr_folios--;
>  	}
>  
>  	if (likely(bio->bi_iter.bi_size)) {
> @@ -1788,6 +1854,13 @@ xfs_buf_offset(
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> +	/* Single folio buffers may use large folios. */
> +	if (bp->b_folio_count == 1) {
> +		folio = bp->b_folios[0];
> +		return folio_address(folio) + offset_in_folio(folio, offset);
> +	}
> +
> +	/* Multi-folio buffers always use PAGE_SIZE folios */
>  	folio = bp->b_folios[offset >> PAGE_SHIFT];
>  	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
> @@ -1803,18 +1876,24 @@ xfs_buf_zero(
>  	bend = boff + bsize;
>  	while (boff < bend) {
>  		struct folio	*folio;
> -		int		page_index, page_offset, csize;
> +		int		folio_index, folio_offset, csize;
>  
> -		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> -		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		folio = bp->b_folios[page_index];
> -		csize = min_t(size_t, PAGE_SIZE - page_offset,
> +		/* Single folio buffers may use large folios. */
> +		if (bp->b_folio_count == 1) {
> +			folio = bp->b_folios[0];
> +			folio_offset = offset_in_folio(folio,
> +						bp->b_offset + boff);
> +		} else {
> +			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> +			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> +			folio = bp->b_folios[folio_index];
> +		}
> +
> +		csize = min_t(size_t, folio_size(folio) - folio_offset,
>  				      BBTOB(bp->b_length) - boff);
> +		ASSERT((csize + folio_offset) <= folio_size(folio));
>  
> -		ASSERT((csize + page_offset) <= PAGE_SIZE);
> -
> -		memset(folio_address(folio) + page_offset, 0, csize);
> -
> +		memset(folio_address(folio) + folio_offset, 0, csize);
>  		boff += csize;
>  	}
>  }
> -- 
> 2.43.0
> 
>

Christoph Hellwig March 19, 2024, 9:32 p.m. UTC | #3

On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> So.... does that mean a 128K folio for a 68k xattr remote value buffer?

I though 64k was the maximum xattr size?

> I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> tree, which isn't awesome.

Maybe that's a question for the fsverity thread, but how do we end
up with these weird sizes?

Darrick J. Wong March 19, 2024, 9:38 p.m. UTC | #4

On Tue, Mar 19, 2024 at 02:32:04PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> > So.... does that mean a 128K folio for a 68k xattr remote value buffer?
> 
> I though 64k was the maximum xattr size?

64k is the maximum xattr value size, yes.  But remote xattr value blocks
now have block headers complete with owner/uuid/magic/etc.  Each block
can only store $blksz-56 bytes now.  Hence that 64k value needs
ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.

(On a 64k fsb filesystem, that bloats up to 128k.)

Luckily we bwrite remote blocks to disk directly, which is why we've
never blown out the buffer log item bitmap.

> > I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> > tree, which isn't awesome.
> 
> Maybe that's a question for the fsverity thread, but how do we end
> up with these weird sizes?

Same reason.  Merkle tree blocks are $blksz by default, but storing a
4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
that second block is wasted.

--D

Christoph Hellwig March 19, 2024, 9:41 p.m. UTC | #5

On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> now have block headers complete with owner/uuid/magic/etc.  Each block
> can only store $blksz-56 bytes now.  Hence that 64k value needs
> ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.

Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
add headers.  That almost asks for using vmalloc if the size is just
above a a power of two.

> Same reason.  Merkle tree blocks are $blksz by default, but storing a
> 4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
> that second block is wasted.

*sad panda face*.  We should be able to come up with something
better than that..

Dave Chinner March 19, 2024, 9:55 p.m. UTC | #6

On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 09:45:54AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have the buffer cache using the folio API, we can extend
> > the use of folios to allocate high order folios for multi-page
> > buffers rather than an array of single pages that are then vmapped
> > into a contiguous range.
> > 
> > This creates two types of buffers: single folio buffers that can
> > have arbitrary order, and multi-folio buffers made up of many single
> > page folios that get vmapped. The latter is essentially the existing
> > code, so there are no logic changes to handle this case.
> > 
> > There are a few places where we iterate the folios on a buffer.
> > These need to be converted to handle the high order folio case.
> > Luckily, this only occurs when bp->b_folio_count == 1, and the code
> > for handling this case is just a simple application of the folio API
> > to the operations that need to be performed.
> > 
> > The code that allocates buffers will optimistically attempt a high
> > order folio allocation as a fast path. If this high order allocation
> > fails, then we fall back to the existing multi-folio allocation
> > code. This now forms the slow allocation path, and hopefully will be
> > largely unused in normal conditions.
> > 
> > This should improve performance of large buffer operations (e.g.
> > large directory block sizes) as we should now mostly avoid the
> > expense of vmapping large buffers (and the vmap lock contention that
> > can occur) as well as avoid the runtime pressure that frequently
> > accessing kernel vmapped pages put on the TLBs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 110 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 832226385154..7d9303497763 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -80,6 +80,10 @@ xfs_buf_is_vmapped(
> >  	return bp->b_addr && bp->b_folio_count > 1;
> >  }
> >  
> > +/*
> > + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> > + * allocating vmapped buffers.
> > + */
> >  static inline int
> >  xfs_buf_vmap_len(
> >  	struct xfs_buf	*bp)
> > @@ -352,14 +356,63 @@ xfs_buf_alloc_kmem(
> >  		bp->b_addr = NULL;
> >  		return -ENOMEM;
> >  	}
> > -	bp->b_offset = offset_in_page(bp->b_addr);
> >  	bp->b_folios = bp->b_folio_array;
> >  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> > +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
> >  	bp->b_folio_count = 1;
> >  	bp->b_flags |= _XBF_KMEM;
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Allocating a high order folio makes the assumption that buffers are a
> > + * power-of-2 size so that ilog2() returns the exact order needed to fit
> > + * the contents of the buffer. Buffer lengths are mostly a power of two,
> > + * so this is not an unreasonable approach to take by default.
> > + *
> > + * The exception here are user xattr data buffers, which can be arbitrarily
> > + * sized up to 64kB plus structure metadata. In that case, round up the order.
> 
> So.... does that mean a 128K folio for a 68k xattr remote value buffer?

Yes. I figured that >64kB metadata is a rare corner case, and so
the extra memory usage isn't that big a deal. If it does ever become
a problem, we can skip straight to vmalloc for such buffers...

> I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> tree, which isn't awesome.

Yup, that's because we have headers in remote xattr blocks for
holding identifying information and CRCs.

> I haven't figured out a good way to deal
> with that, since the verity code uses a lot of blkno/byte shifting and
> 2k merkle tree blocks suck.

I really wish that the verity code just asked us to store individual
{key,value} tuples rather than large opaque blocks of fixed size
data. Then we could store everything efficiently as individual
btree records in the attr fork dabtree, and verity could still
implement it's page sized opaque block storage for all the other
filesystems behind that....

> > @@ -1551,28 +1617,28 @@ xfs_buf_ioapply_map(
> >  
> >  next_chunk:
> >  	atomic_inc(&bp->b_io_remaining);
> > -	nr_pages = bio_max_segs(total_nr_pages);
> > +	nr_folios = bio_max_segs(total_nr_folios);
> >  
> > -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> > +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
> >  	bio->bi_iter.bi_sector = sector;
> >  	bio->bi_end_io = xfs_buf_bio_end_io;
> >  	bio->bi_private = bp;
> >  
> > -	for (; size && nr_pages; nr_pages--, page_index++) {
> > -		int	rbytes, nbytes = PAGE_SIZE - offset;
> > +	for (; size && nr_folios; nr_folios--, folio_index++) {
> > +		struct folio	*folio = bp->b_folios[folio_index];
> > +		int		nbytes = folio_size(folio) - offset;
> >  
> >  		if (nbytes > size)
> >  			nbytes = size;
> >  
> > -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> > -				      offset);
> 
> Um, bio_add_folio returns a bool, maybe that's why hch was complaining
> about hangs with only the first few patches applied?

Good spot - entirely possible this is the problem...

> Annoying if the kbuild robots don't catch that.  Either way, with hch's
> replies to this and patch 2 dealt with,
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!.

-Dave.

Dave Chinner March 19, 2024, 10:23 p.m. UTC | #7

On Tue, Mar 19, 2024 at 02:41:27PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> > now have block headers complete with owner/uuid/magic/etc.  Each block
> > can only store $blksz-56 bytes now.  Hence that 64k value needs
> > ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.
> 
> Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
> add headers.

We needed CRCs for them, and they can be discontiguous so we also
needed self identifying information so xattrs could be reconstructed
from the header information.

If I was doing the v5 stuff again, I would have put a the
information in the remote xattr name structure held in the dabtree
record, not the actual xattr data extent. But hindsight is 20:20....

-Dave.

Darrick J. Wong March 21, 2024, 2:12 a.m. UTC | #8

On Tue, Mar 19, 2024 at 02:41:27PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> > now have block headers complete with owner/uuid/magic/etc.  Each block
> > can only store $blksz-56 bytes now.  Hence that 64k value needs
> > ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.
> 
> Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
> add headers.  That almost asks for using vmalloc if the size is just
> above a a power of two.
> 
> > Same reason.  Merkle tree blocks are $blksz by default, but storing a
> > 4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
> > that second block is wasted.
> 
> *sad panda face*.  We should be able to come up with something
> better than that..

Well xfs_verity.c could just zlib/zstd/whatever compress the contents
and see if that helps.  We only need a ~2% compression to shrink to a
single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
1k blocks.

--D

Darrick J. Wong March 21, 2024, 2:40 a.m. UTC | #9

On Wed, Mar 20, 2024 at 07:12:36PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 02:41:27PM -0700, Christoph Hellwig wrote:
> > On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > > 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> > > now have block headers complete with owner/uuid/magic/etc.  Each block
> > > can only store $blksz-56 bytes now.  Hence that 64k value needs
> > > ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.
> > 
> > Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
> > add headers.  That almost asks for using vmalloc if the size is just
> > above a a power of two.
> > 
> > > Same reason.  Merkle tree blocks are $blksz by default, but storing a
> > > 4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
> > > that second block is wasted.
> > 
> > *sad panda face*.  We should be able to come up with something
> > better than that..
> 
> Well xfs_verity.c could just zlib/zstd/whatever compress the contents
> and see if that helps.  We only need a ~2% compression to shrink to a
> single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
> 1k blocks.

...or we just turn off the attr remote header for XFS_ATTR_VERITY merkle
tree block values and XOR i_ino, merkle_pos, and the fs uuid into the
first 32 bytes.

--D

> --D
>

Christoph Hellwig March 21, 2024, 9:28 p.m. UTC | #10

On Wed, Mar 20, 2024 at 07:40:05PM -0700, Darrick J. Wong wrote:
> > Well xfs_verity.c could just zlib/zstd/whatever compress the contents
> > and see if that helps.  We only need a ~2% compression to shrink to a
> > single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
> > 1k blocks.
> 
> ...or we just turn off the attr remote header for XFS_ATTR_VERITY merkle
> tree block values and XOR i_ino, merkle_pos, and the fs uuid into the
> first 32 bytes.

That does sound much better than wasting the space or crazy compression
schemes.

Darrick J. Wong March 21, 2024, 9:39 p.m. UTC | #11

On Thu, Mar 21, 2024 at 02:28:23PM -0700, Christoph Hellwig wrote:
> On Wed, Mar 20, 2024 at 07:40:05PM -0700, Darrick J. Wong wrote:
> > > Well xfs_verity.c could just zlib/zstd/whatever compress the contents
> > > and see if that helps.  We only need a ~2% compression to shrink to a
> > > single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
> > > 1k blocks.
> > 
> > ...or we just turn off the attr remote header for XFS_ATTR_VERITY merkle
> > tree block values and XOR i_ino, merkle_pos, and the fs uuid into the
> > first 32 bytes.
> 
> That does sound much better than wasting the space or crazy compression
> schemes.

It turns out that merkle tree blocks are nearly "random" data, which
means that my experiments with creating fsverity files and trying to
compress merkle tree blocks actually made it *bigger*.  No zlib for us!

--D

Christoph Hellwig March 21, 2024, 10:02 p.m. UTC | #12

On Thu, Mar 21, 2024 at 02:39:03PM -0700, Darrick J. Wong wrote:
> 
> It turns out that merkle tree blocks are nearly "random" data, which
> means that my experiments with creating fsverity files and trying to
> compress merkle tree blocks actually made it *bigger*.  No zlib for us!

For a cryptographically secure hash, not having high entropy would be a
bad thing..

Pankaj Raghav (Samsung) March 22, 2024, 8:02 a.m. UTC | #13

>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
>  {
>  	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
> -		/* A single page buffer is always mappable */
> +		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
> @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
>  	int		*count,
>  	blk_opf_t	op)
>  {
> -	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_folio_count;
> -	int		nr_pages;
> +	int		folio_index;
> +	unsigned int	total_nr_folios = bp->b_folio_count;
> +	int		nr_folios;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
>  	int		offset;
>  
> -	/* skip the pages in the buffer before the start offset */
> -	page_index = 0;
> +	/*
> +	 * If the start offset if larger than a single page, we need to be
> +	 * careful. We might have a high order folio, in which case the indexing
> +	 * is from the start of the buffer. However, if we have more than one
> +	 * folio single page folio in the buffer, we need to skip the folios in
s/folio single page folio/single page folio/

> +	 * the buffer before the start offset.
> +	 */
> +	folio_index = 0;
>  	offset = *buf_offset;
> -	while (offset >= PAGE_SIZE) {
> -		page_index++;
> -		offset -= PAGE_SIZE;
> +	if (bp->b_folio_count > 1) {
> +		while (offset >= PAGE_SIZE) {
> +			folio_index++;
> +			offset -= PAGE_SIZE;

Can this be:
folio_index = offset >> PAGE_SHIFT;
offset = offset_in_page(offset);

instead of a loop?

> +		}
>  	}
>  
>  	/*

--
Pankaj

Dave Chinner March 22, 2024, 10:04 p.m. UTC | #14

On Fri, Mar 22, 2024 at 09:02:31AM +0100, Pankaj Raghav (Samsung) wrote:
> >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
> >  {
> >  	ASSERT(bp->b_flags & _XBF_FOLIOS);
> >  	if (bp->b_folio_count == 1) {
> > -		/* A single page buffer is always mappable */
> > +		/* A single folio buffer is always mappable */
> >  		bp->b_addr = folio_address(bp->b_folios[0]);
> >  	} else if (flags & XBF_UNMAPPED) {
> >  		bp->b_addr = NULL;
> > @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
> >  	int		*count,
> >  	blk_opf_t	op)
> >  {
> > -	int		page_index;
> > -	unsigned int	total_nr_pages = bp->b_folio_count;
> > -	int		nr_pages;
> > +	int		folio_index;
> > +	unsigned int	total_nr_folios = bp->b_folio_count;
> > +	int		nr_folios;
> >  	struct bio	*bio;
> >  	sector_t	sector =  bp->b_maps[map].bm_bn;
> >  	int		size;
> >  	int		offset;
> >  
> > -	/* skip the pages in the buffer before the start offset */
> > -	page_index = 0;
> > +	/*
> > +	 * If the start offset if larger than a single page, we need to be
> > +	 * careful. We might have a high order folio, in which case the indexing
> > +	 * is from the start of the buffer. However, if we have more than one
> > +	 * folio single page folio in the buffer, we need to skip the folios in
> s/folio single page folio/single page folio/
> 
> > +	 * the buffer before the start offset.
> > +	 */
> > +	folio_index = 0;
> >  	offset = *buf_offset;
> > -	while (offset >= PAGE_SIZE) {
> > -		page_index++;
> > -		offset -= PAGE_SIZE;
> > +	if (bp->b_folio_count > 1) {
> > +		while (offset >= PAGE_SIZE) {
> > +			folio_index++;
> > +			offset -= PAGE_SIZE;
> 
> Can this be:
> folio_index = offset >> PAGE_SHIFT;
> offset = offset_in_page(offset);
> 
> instead of a loop?

It could, but I'm not going to change it or even bother to fix the
comment as this code goes away later in the patch set.

See "[PATCH 7/9] xfs: walk b_addr for buffer I/O" later in the
series - once we can rely on bp->b_addr always being set for buffers
we convert this code to a simpler and more efficient folio based
iteration that is not reliant on pages or PAGE_SIZE at all.

-Dave.

Pankaj Raghav (Samsung) March 25, 2024, 11:17 a.m. UTC | #15

On Sat, Mar 23, 2024 at 09:04:51AM +1100, Dave Chinner wrote:
> On Fri, Mar 22, 2024 at 09:02:31AM +0100, Pankaj Raghav (Samsung) wrote:
> > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
> > >  {
> > >  	ASSERT(bp->b_flags & _XBF_FOLIOS);
> > >  	if (bp->b_folio_count == 1) {
> > > -		/* A single page buffer is always mappable */
> > > +		/* A single folio buffer is always mappable */
> > >  		bp->b_addr = folio_address(bp->b_folios[0]);
> > >  	} else if (flags & XBF_UNMAPPED) {
> > >  		bp->b_addr = NULL;
> > > @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
> > >  	int		*count,
> > >  	blk_opf_t	op)
> > >  {
> > > -	int		page_index;
> > > -	unsigned int	total_nr_pages = bp->b_folio_count;
> > > -	int		nr_pages;
> > > +	int		folio_index;
> > > +	unsigned int	total_nr_folios = bp->b_folio_count;
> > > +	int		nr_folios;
> > >  	struct bio	*bio;
> > >  	sector_t	sector =  bp->b_maps[map].bm_bn;
> > >  	int		size;
> > >  	int		offset;
> > >  
> > > -	/* skip the pages in the buffer before the start offset */
> > > -	page_index = 0;
> > > +	/*
> > > +	 * If the start offset if larger than a single page, we need to be
> > > +	 * careful. We might have a high order folio, in which case the indexing
> > > +	 * is from the start of the buffer. However, if we have more than one
> > > +	 * folio single page folio in the buffer, we need to skip the folios in
> > s/folio single page folio/single page folio/
> > 
> > > +	 * the buffer before the start offset.
> > > +	 */
> > > +	folio_index = 0;
> > >  	offset = *buf_offset;
> > > -	while (offset >= PAGE_SIZE) {
> > > -		page_index++;
> > > -		offset -= PAGE_SIZE;
> > > +	if (bp->b_folio_count > 1) {
> > > +		while (offset >= PAGE_SIZE) {
> > > +			folio_index++;
> > > +			offset -= PAGE_SIZE;
> > 
> > Can this be:
> > folio_index = offset >> PAGE_SHIFT;
> > offset = offset_in_page(offset);
> > 
> > instead of a loop?
> 
> It could, but I'm not going to change it or even bother to fix the
> comment as this code goes away later in the patch set.
> 
> See "[PATCH 7/9] xfs: walk b_addr for buffer I/O" later in the
> series - once we can rely on bp->b_addr always being set for buffers
> we convert this code to a simpler and more efficient folio based
> iteration that is not reliant on pages or PAGE_SIZE at all.
> 

Ah, I missed seeing the whole series before replying to individual patches.
Nvm. 

Thanks for the pointer on patch 7, and I agree that it does not make 
sense to change this patch as it goes away later.

--
Pankaj

[3/9] xfs: convert buffer cache to use high order folios

Commit Message

Comments

Patch