[v5,11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag

Message ID	20240304191046.157464-13-aalbersh@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 516327A70E for <linux-xfs@vger.kernel.org>; Mon, 4 Mar 2024 19:12:24 +0000 (UTC) From: Andrey Albershteyn <aalbersh@redhat.com> To: fsverity@lists.linux.dev, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, chandan.babu@oracle.com, djwong@kernel.org, ebiggers@kernel.org Cc: Andrey Albershteyn <aalbersh@redhat.com> Subject: [PATCH v5 11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag Date: Mon, 4 Mar 2024 20:10:34 +0100 Message-ID: <20240304191046.157464-13-aalbersh@redhat.com> In-Reply-To: <20240304191046.157464-2-aalbersh@redhat.com> References: <20240304191046.157464-2-aalbersh@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	fs-verity support for XFS \| expand [v5,00/24] fs-verity support for XFS [v5,01/24] fsverity: remove hash page spin lock [v5,02/24] xfs: add parent pointer support to attribute code [v5,03/24] xfs: define parent pointer ondisk extended attribute format [v5,04/24] xfs: add parent pointer validator functions [v5,05/24] fs: add FS_XFLAG_VERITY for verity files [v5,06/24] fsverity: pass tree_blocksize to end_enable_verity() [v5,07/24] fsverity: support block-based Merkle tree caching [v5,08/24] fsverity: add per-sb workqueue for post read processing [v5,09/24] fsverity: add tracepoints [v5,10/24] iomap: integrate fs-verity verification into iomap's read path [v5,11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag [v5,12/24] xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer [v5,13/24] xfs: add attribute type for fs-verity [v5,14/24] xfs: make xfs_buf_get() to take XBF_* flags [v5,15/24] xfs: add XBF_DOUBLE_ALLOC to increase size of the buffer [v5,16/24] xfs: add fs-verity ro-compat flag [v5,17/24] xfs: add inode on-disk VERITY flag [v5,18/24] xfs: initialize fs-verity on file open and cleanup on inode destruction [v5,19/24] xfs: don't allow to enable DAX on fs-verity sealsed inode [v5,20/24] xfs: disable direct read path for fs-verity files [v5,21/24] xfs: add fs-verity support [v5,22/24] xfs: make scrub aware of verity dinode flag [v5,23/24] xfs: add fs-verity ioctls [v5,24/24] xfs: enable ro-compat fs-verity flag

Andrey Albershteyn March 4, 2024, 7:10 p.m. UTC

One of essential ideas of fs-verity is that pages which are already
verified won't need to be re-verified if they still in page cache.

XFS will store Merkle tree blocks in extended file attributes. When
read extended attribute data is put into xfs_buf.

fs-verity uses PG_checked flag to track status of the blocks in the
page. This flag can has two meanings - page was re-instantiated and
the only block in the page is verified.

However, in XFS, the data in the buffer is not aligned with xfs_buf
pages and we don't have a reference to these pages. Moreover, these
pages are released when value is copied out in xfs_attr code. In
other words, we can not directly mark underlying xfs_buf's pages as
verified as it's done by fs-verity for other filesystems.

One way to track that these pages were processed by fs-verity is to
mark buffer as verified instead. If buffer is evicted the incore
XBF_VERITY_SEEN flag is lost. When the xattr is read again
xfs_attr_get() returns new buffer without the flag. The xfs_buf's
flag is then used to tell fs-verity this buffer was cached or not.

The second state indicated by PG_checked is if the only block in the
PAGE is verified. This is not the case for XFS as there could be
multiple blocks in single buffer (page size 64k block size 4k). This
is handled by fs-verity bitmap. fs-verity is always uses bitmap for
XFS despite of Merkle tree block size.

The meaning of the flag is that value of the extended attribute in
the buffer is processed by fs-verity.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/xfs_buf.h | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Darrick J. Wong March 7, 2024, 10:46 p.m. UTC | #1

On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> One of essential ideas of fs-verity is that pages which are already
> verified won't need to be re-verified if they still in page cache.
> 
> XFS will store Merkle tree blocks in extended file attributes. When
> read extended attribute data is put into xfs_buf.
> 
> fs-verity uses PG_checked flag to track status of the blocks in the
> page. This flag can has two meanings - page was re-instantiated and
> the only block in the page is verified.
> 
> However, in XFS, the data in the buffer is not aligned with xfs_buf
> pages and we don't have a reference to these pages. Moreover, these
> pages are released when value is copied out in xfs_attr code. In
> other words, we can not directly mark underlying xfs_buf's pages as
> verified as it's done by fs-verity for other filesystems.
> 
> One way to track that these pages were processed by fs-verity is to
> mark buffer as verified instead. If buffer is evicted the incore
> XBF_VERITY_SEEN flag is lost. When the xattr is read again
> xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> flag is then used to tell fs-verity this buffer was cached or not.
> 
> The second state indicated by PG_checked is if the only block in the
> PAGE is verified. This is not the case for XFS as there could be
> multiple blocks in single buffer (page size 64k block size 4k). This
> is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> XFS despite of Merkle tree block size.
> 
> The meaning of the flag is that value of the extended attribute in
> the buffer is processed by fs-verity.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/xfs_buf.h | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 73249abca968..2a73918193ba 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -24,14 +24,15 @@ struct xfs_buf;
>  
>  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
>  
> -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */

Yuck.  I still dislike this entire approach.

XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
gets loaded on behalf of a merkle tree request, then uses the extra
space to shadow the contents of the ondisk block.  AFAICT the shadow
doesn't get updated even if the cached data does, which sounds like a
landmine for coherency issues.

XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
exposing the low level buffering details of the xattr code to
xfs_attr_get callers.

XBF_VERITY_SEEN is a layering violation because now the overall buffer
cache can track file metadata state.  I think the reason why you need
this state flag is because the datadev buffer target indexes based on
physical xfs_daddr_t, whereas merkle tree blocks have their own internal
block numbers.  You can't directly go from the merkle block number to an
xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
fell out of memory.

ISTR asking for a separation of these indices when I reviewed some
previous version of this patchset.  At the time it seems to me that a
much more efficient way to cache the merkle tree blocks would be to set
up a direct (merkle tree block number) -> (blob of data) lookup table.
That I don't see here.

In the spirit of the recent collaboration style that I've taken with
Christoph, I pulled your branch and started appending patches to it to
see if the design that I'm asking for is reasonable.  As it so happens,
I was working on a simplified version of the xfs buffer cache ("fsbuf")
that could be used by simple filesystems to get them off of buffer
heads.

(Ab)using the fsbuf code did indeed work (and passed all the fstests -g
verity tests), so now I know the idea is reasonable.  Patches 11, 12,
14, and 15 become unnecessary.  However, this solution is itself grossly
overengineered, since all we want are the following operations:

peek(key): returns an fsbuf if there's any data cached for key

get(key): returns an fsbuf for key, regardless of state

store(fsbuf, p): attach a memory buffer p to fsbuf

Then the xfs ->read_merkle_tree_block function becomes:

	bp = peek(key)
	if (bp)
		/* return bp data up to verity */

	p = xfs_attr_get(key)
	if (!p)
		/* error */

	bp = get(key)
	store(bp, p)

	/* return bp data up to verity */

*Fortunately* you've also rebased against the 6.9 for-next branch, which
means I think there's an elegant solution at hand.  The online repair
patches that are in that branch make it possible to create a standalone
xfs_buftarg.  We can attach one of these to an xfs_inode to cache merkle
tree blocks.  xfs_bufs have well known usage semantics, they're already
hooked up to shrinkers so that we can reclaim the data if memory is
tight, and they can clean up a lot of code.

Could you take a look at my branch here:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-07

And tell me what you think?

--D

>  
>  /* buffer type flags for write callbacks */
>  #define _XBF_INODES	 (1u << 16)/* inode buffer */
> @@ -65,6 +66,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ XBF_DONE,		"DONE" }, \
>  	{ XBF_STALE,		"STALE" }, \
>  	{ XBF_WRITE_FAIL,	"WRITE_FAIL" }, \
> +	{ XBF_VERITY_SEEN,	"VERITY_SEEN" }, \
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
> -- 
> 2.42.0
> 
>

Dave Chinner March 8, 2024, 1:59 a.m. UTC | #2

On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > One of essential ideas of fs-verity is that pages which are already
> > verified won't need to be re-verified if they still in page cache.
> > 
> > XFS will store Merkle tree blocks in extended file attributes. When
> > read extended attribute data is put into xfs_buf.
> > 
> > fs-verity uses PG_checked flag to track status of the blocks in the
> > page. This flag can has two meanings - page was re-instantiated and
> > the only block in the page is verified.
> > 
> > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > pages and we don't have a reference to these pages. Moreover, these
> > pages are released when value is copied out in xfs_attr code. In
> > other words, we can not directly mark underlying xfs_buf's pages as
> > verified as it's done by fs-verity for other filesystems.
> > 
> > One way to track that these pages were processed by fs-verity is to
> > mark buffer as verified instead. If buffer is evicted the incore
> > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > flag is then used to tell fs-verity this buffer was cached or not.
> > 
> > The second state indicated by PG_checked is if the only block in the
> > PAGE is verified. This is not the case for XFS as there could be
> > multiple blocks in single buffer (page size 64k block size 4k). This
> > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > XFS despite of Merkle tree block size.
> > 
> > The meaning of the flag is that value of the extended attribute in
> > the buffer is processed by fs-verity.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> >  1 file changed, 10 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > index 73249abca968..2a73918193ba 100644
> > --- a/fs/xfs/xfs_buf.h
> > +++ b/fs/xfs/xfs_buf.h
> > @@ -24,14 +24,15 @@ struct xfs_buf;
> >  
> >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> >  
> > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> 
> Yuck.  I still dislike this entire approach.
> 
> XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> gets loaded on behalf of a merkle tree request, then uses the extra
> space to shadow the contents of the ondisk block.  AFAICT the shadow
> doesn't get updated even if the cached data does, which sounds like a
> landmine for coherency issues.
>
> XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> exposing the low level buffering details of the xattr code to
> xfs_attr_get callers.
> 
> XBF_VERITY_SEEN is a layering violation because now the overall buffer
> cache can track file metadata state.  I think the reason why you need
> this state flag is because the datadev buffer target indexes based on
> physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> block numbers.  You can't directly go from the merkle block number to an
> xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> fell out of memory.
>
> ISTR asking for a separation of these indices when I reviewed some
> previous version of this patchset.  At the time it seems to me that a
> much more efficient way to cache the merkle tree blocks would be to set
> up a direct (merkle tree block number) -> (blob of data) lookup table.
> That I don't see here.
>
> In the spirit of the recent collaboration style that I've taken with
> Christoph, I pulled your branch and started appending patches to it to
> see if the design that I'm asking for is reasonable.  As it so happens,
> I was working on a simplified version of the xfs buffer cache ("fsbuf")
> that could be used by simple filesystems to get them off of buffer
> heads.
> 
> (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> 14, and 15 become unnecessary.  However, this solution is itself grossly
> overengineered, since all we want are the following operations:
> 
> peek(key): returns an fsbuf if there's any data cached for key
> 
> get(key): returns an fsbuf for key, regardless of state
> 
> store(fsbuf, p): attach a memory buffer p to fsbuf
> 
> Then the xfs ->read_merkle_tree_block function becomes:
> 
> 	bp = peek(key)
> 	if (bp)
> 		/* return bp data up to verity */
> 
> 	p = xfs_attr_get(key)
> 	if (!p)
> 		/* error */
> 
> 	bp = get(key)
> 	store(bp, p)

Ok, that looks good - it definitely gets rid of a lot of the
nastiness, but I have to ask: why does it need to be based on
xfs_bufs?  That's just wasting 300 bytes of memory on a handle to
store a key and a opaque blob in a rhashtable.

IIUC, the key here is a sequential index, so an xarray would be a
much better choice as it doesn't require internal storage of the
key.

i.e.

	p = xa_load(key);
	if (p)
		return p;

	xfs_attr_get(key);
	if (!args->value)
		/* error */

	/*
	 * store the current value, freeing any old value that we
	 * replaced at this key. Don't care about failure to store,
	 * this is optimistic caching.
	 */
	p = xa_store(key, args->value, GFP_NOFS);
	if (p)
		kvfree(p);
	return args->value;

-Dave.

Darrick J. Wong March 8, 2024, 3:31 a.m. UTC | #3

On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > > One of essential ideas of fs-verity is that pages which are already
> > > verified won't need to be re-verified if they still in page cache.
> > > 
> > > XFS will store Merkle tree blocks in extended file attributes. When
> > > read extended attribute data is put into xfs_buf.
> > > 
> > > fs-verity uses PG_checked flag to track status of the blocks in the
> > > page. This flag can has two meanings - page was re-instantiated and
> > > the only block in the page is verified.
> > > 
> > > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > > pages and we don't have a reference to these pages. Moreover, these
> > > pages are released when value is copied out in xfs_attr code. In
> > > other words, we can not directly mark underlying xfs_buf's pages as
> > > verified as it's done by fs-verity for other filesystems.
> > > 
> > > One way to track that these pages were processed by fs-verity is to
> > > mark buffer as verified instead. If buffer is evicted the incore
> > > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > > flag is then used to tell fs-verity this buffer was cached or not.
> > > 
> > > The second state indicated by PG_checked is if the only block in the
> > > PAGE is verified. This is not the case for XFS as there could be
> > > multiple blocks in single buffer (page size 64k block size 4k). This
> > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > > XFS despite of Merkle tree block size.
> > > 
> > > The meaning of the flag is that value of the extended attribute in
> > > the buffer is processed by fs-verity.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > ---
> > >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> > >  1 file changed, 10 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > index 73249abca968..2a73918193ba 100644
> > > --- a/fs/xfs/xfs_buf.h
> > > +++ b/fs/xfs/xfs_buf.h
> > > @@ -24,14 +24,15 @@ struct xfs_buf;
> > >  
> > >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> > >  
> > > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> > 
> > Yuck.  I still dislike this entire approach.
> > 
> > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> > gets loaded on behalf of a merkle tree request, then uses the extra
> > space to shadow the contents of the ondisk block.  AFAICT the shadow
> > doesn't get updated even if the cached data does, which sounds like a
> > landmine for coherency issues.
> >
> > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> > exposing the low level buffering details of the xattr code to
> > xfs_attr_get callers.
> > 
> > XBF_VERITY_SEEN is a layering violation because now the overall buffer
> > cache can track file metadata state.  I think the reason why you need
> > this state flag is because the datadev buffer target indexes based on
> > physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> > block numbers.  You can't directly go from the merkle block number to an
> > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> > fell out of memory.
> >
> > ISTR asking for a separation of these indices when I reviewed some
> > previous version of this patchset.  At the time it seems to me that a
> > much more efficient way to cache the merkle tree blocks would be to set
> > up a direct (merkle tree block number) -> (blob of data) lookup table.
> > That I don't see here.
> >
> > In the spirit of the recent collaboration style that I've taken with
> > Christoph, I pulled your branch and started appending patches to it to
> > see if the design that I'm asking for is reasonable.  As it so happens,
> > I was working on a simplified version of the xfs buffer cache ("fsbuf")
> > that could be used by simple filesystems to get them off of buffer
> > heads.
> > 
> > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > overengineered, since all we want are the following operations:
> > 
> > peek(key): returns an fsbuf if there's any data cached for key
> > 
> > get(key): returns an fsbuf for key, regardless of state
> > 
> > store(fsbuf, p): attach a memory buffer p to fsbuf
> > 
> > Then the xfs ->read_merkle_tree_block function becomes:
> > 
> > 	bp = peek(key)
> > 	if (bp)
> > 		/* return bp data up to verity */
> > 
> > 	p = xfs_attr_get(key)
> > 	if (!p)
> > 		/* error */
> > 
> > 	bp = get(key)
> > 	store(bp, p)
> 
> Ok, that looks good - it definitely gets rid of a lot of the
> nastiness, but I have to ask: why does it need to be based on
> xfs_bufs?

(copying from IRC) It was still warm in my brain L2 after all the xfile
buftarg cleaning and merging that just got done a few weeks ago.   So I
went with the simplest thing I could rig up to test my ideas, and now
we're at the madly iterate until exhaustion stage. ;)

>            That's just wasting 300 bytes of memory on a handle to
> store a key and a opaque blob in a rhashtable.

Yep.  The fsbufs implementation was a lot more slender, but a bunch more
code.  I agree that I ought to go look at xarrays or something that's
more of a direct mapping as a next step.  However, i wanted to get
Andrey's feedback on this general approach first.

> IIUC, the key here is a sequential index, so an xarray would be a
> much better choice as it doesn't require internal storage of the
> key.

I wonder, what are the access patterns for merkle blobs?  Is it actually
sequential, or is more like 0 -> N -> N*N as we walk towards leaves?

Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
that done because merkle trees may some day have more than 2^32 blocks
in them?  That won't play well with things like xarrays on 32-bit
machines.

(Granted we've been talking about deprecating XFS on 32-bit for a while
now but we're not the whole world)

> i.e.
> 
> 	p = xa_load(key);
> 	if (p)
> 		return p;
> 
> 	xfs_attr_get(key);
> 	if (!args->value)
> 		/* error */
> 
> 	/*
> 	 * store the current value, freeing any old value that we
> 	 * replaced at this key. Don't care about failure to store,
> 	 * this is optimistic caching.
> 	 */
> 	p = xa_store(key, args->value, GFP_NOFS);
> 	if (p)
> 		kvfree(p);
> 	return args->value;

Attractive.  Will have to take a look at that tomorrow.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Darrick J. Wong March 9, 2024, 4:28 p.m. UTC | #4

On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote:
> > > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote:
> > > > One of essential ideas of fs-verity is that pages which are already
> > > > verified won't need to be re-verified if they still in page cache.
> > > > 
> > > > XFS will store Merkle tree blocks in extended file attributes. When
> > > > read extended attribute data is put into xfs_buf.
> > > > 
> > > > fs-verity uses PG_checked flag to track status of the blocks in the
> > > > page. This flag can has two meanings - page was re-instantiated and
> > > > the only block in the page is verified.
> > > > 
> > > > However, in XFS, the data in the buffer is not aligned with xfs_buf
> > > > pages and we don't have a reference to these pages. Moreover, these
> > > > pages are released when value is copied out in xfs_attr code. In
> > > > other words, we can not directly mark underlying xfs_buf's pages as
> > > > verified as it's done by fs-verity for other filesystems.
> > > > 
> > > > One way to track that these pages were processed by fs-verity is to
> > > > mark buffer as verified instead. If buffer is evicted the incore
> > > > XBF_VERITY_SEEN flag is lost. When the xattr is read again
> > > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's
> > > > flag is then used to tell fs-verity this buffer was cached or not.
> > > > 
> > > > The second state indicated by PG_checked is if the only block in the
> > > > PAGE is verified. This is not the case for XFS as there could be
> > > > multiple blocks in single buffer (page size 64k block size 4k). This
> > > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for
> > > > XFS despite of Merkle tree block size.
> > > > 
> > > > The meaning of the flag is that value of the extended attribute in
> > > > the buffer is processed by fs-verity.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_buf.h | 18 ++++++++++--------
> > > >  1 file changed, 10 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > > index 73249abca968..2a73918193ba 100644
> > > > --- a/fs/xfs/xfs_buf.h
> > > > +++ b/fs/xfs/xfs_buf.h
> > > > @@ -24,14 +24,15 @@ struct xfs_buf;
> > > >  
> > > >  #define XFS_BUF_DADDR_NULL	((xfs_daddr_t) (-1LL))
> > > >  
> > > > -#define XBF_READ	 (1u << 0) /* buffer intended for reading from device */
> > > > -#define XBF_WRITE	 (1u << 1) /* buffer intended for writing to device */
> > > > -#define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
> > > > -#define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > > -#define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> > > > -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> > > > -#define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
> > > > -#define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
> > > > +#define XBF_READ		(1u << 0) /* buffer intended for reading from device */
> > > > +#define XBF_WRITE		(1u << 1) /* buffer intended for writing to device */
> > > > +#define XBF_READ_AHEAD		(1u << 2) /* asynchronous read-ahead */
> > > > +#define XBF_NO_IOACCT		(1u << 3) /* bypass I/O accounting (non-LRU bufs) */
> > > > +#define XBF_ASYNC		(1u << 4) /* initiator will not wait for completion */
> > > > +#define XBF_DONE		(1u << 5) /* all pages in the buffer uptodate */
> > > > +#define XBF_STALE		(1u << 6) /* buffer has been staled, do not find it */
> > > > +#define XBF_WRITE_FAIL		(1u << 7) /* async writes have failed on this buffer */
> > > > +#define XBF_VERITY_SEEN		(1u << 8) /* buffer was processed by fs-verity */
> > > 
> > > Yuck.  I still dislike this entire approach.
> > > 
> > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that
> > > gets loaded on behalf of a merkle tree request, then uses the extra
> > > space to shadow the contents of the ondisk block.  AFAICT the shadow
> > > doesn't get updated even if the cached data does, which sounds like a
> > > landmine for coherency issues.
> > >
> > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of
> > > exposing the low level buffering details of the xattr code to
> > > xfs_attr_get callers.
> > > 
> > > XBF_VERITY_SEEN is a layering violation because now the overall buffer
> > > cache can track file metadata state.  I think the reason why you need
> > > this state flag is because the datadev buffer target indexes based on
> > > physical xfs_daddr_t, whereas merkle tree blocks have their own internal
> > > block numbers.  You can't directly go from the merkle block number to an
> > > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block
> > > fell out of memory.
> > >
> > > ISTR asking for a separation of these indices when I reviewed some
> > > previous version of this patchset.  At the time it seems to me that a
> > > much more efficient way to cache the merkle tree blocks would be to set
> > > up a direct (merkle tree block number) -> (blob of data) lookup table.
> > > That I don't see here.
> > >
> > > In the spirit of the recent collaboration style that I've taken with
> > > Christoph, I pulled your branch and started appending patches to it to
> > > see if the design that I'm asking for is reasonable.  As it so happens,
> > > I was working on a simplified version of the xfs buffer cache ("fsbuf")
> > > that could be used by simple filesystems to get them off of buffer
> > > heads.
> > > 
> > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > overengineered, since all we want are the following operations:
> > > 
> > > peek(key): returns an fsbuf if there's any data cached for key
> > > 
> > > get(key): returns an fsbuf for key, regardless of state
> > > 
> > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > 
> > > Then the xfs ->read_merkle_tree_block function becomes:
> > > 
> > > 	bp = peek(key)
> > > 	if (bp)
> > > 		/* return bp data up to verity */
> > > 
> > > 	p = xfs_attr_get(key)
> > > 	if (!p)
> > > 		/* error */
> > > 
> > > 	bp = get(key)
> > > 	store(bp, p)
> > 
> > Ok, that looks good - it definitely gets rid of a lot of the
> > nastiness, but I have to ask: why does it need to be based on
> > xfs_bufs?
> 
> (copying from IRC) It was still warm in my brain L2 after all the xfile
> buftarg cleaning and merging that just got done a few weeks ago.   So I
> went with the simplest thing I could rig up to test my ideas, and now
> we're at the madly iterate until exhaustion stage. ;)
> 
> >            That's just wasting 300 bytes of memory on a handle to
> > store a key and a opaque blob in a rhashtable.
> 
> Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> code.  I agree that I ought to go look at xarrays or something that's
> more of a direct mapping as a next step.  However, i wanted to get
> Andrey's feedback on this general approach first.
> 
> > IIUC, the key here is a sequential index, so an xarray would be a
> > much better choice as it doesn't require internal storage of the
> > key.
> 
> I wonder, what are the access patterns for merkle blobs?  Is it actually
> sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> 
> Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> that done because merkle trees may some day have more than 2^32 blocks
> in them?  That won't play well with things like xarrays on 32-bit
> machines.
> 
> (Granted we've been talking about deprecating XFS on 32-bit for a while
> now but we're not the whole world)
> 
> > i.e.
> > 
> > 	p = xa_load(key);
> > 	if (p)
> > 		return p;
> > 
> > 	xfs_attr_get(key);
> > 	if (!args->value)
> > 		/* error */
> > 
> > 	/*
> > 	 * store the current value, freeing any old value that we
> > 	 * replaced at this key. Don't care about failure to store,
> > 	 * this is optimistic caching.
> > 	 */
> > 	p = xa_store(key, args->value, GFP_NOFS);
> > 	if (p)
> > 		kvfree(p);
> > 	return args->value;
> 
> Attractive.  Will have to take a look at that tomorrow.

Done.  I think.  Not sure that I actually got all the interactions
between the shrinker and the xarray correct though.  KASAN and lockdep
don't have any complaints running fstests, so that's a start.

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09

--D

> 
> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>

Dave Chinner March 11, 2024, 12:26 a.m. UTC | #5

On Sat, Mar 09, 2024 at 08:28:28AM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > > overengineered, since all we want are the following operations:
> > > > 
> > > > peek(key): returns an fsbuf if there's any data cached for key
> > > > 
> > > > get(key): returns an fsbuf for key, regardless of state
> > > > 
> > > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > > 
> > > > Then the xfs ->read_merkle_tree_block function becomes:
> > > > 
> > > > 	bp = peek(key)
> > > > 	if (bp)
> > > > 		/* return bp data up to verity */
> > > > 
> > > > 	p = xfs_attr_get(key)
> > > > 	if (!p)
> > > > 		/* error */
> > > > 
> > > > 	bp = get(key)
> > > > 	store(bp, p)
> > > 
> > > Ok, that looks good - it definitely gets rid of a lot of the
> > > nastiness, but I have to ask: why does it need to be based on
> > > xfs_bufs?
> > 
> > (copying from IRC) It was still warm in my brain L2 after all the xfile
> > buftarg cleaning and merging that just got done a few weeks ago.   So I
> > went with the simplest thing I could rig up to test my ideas, and now
> > we're at the madly iterate until exhaustion stage. ;)
> > 
> > >            That's just wasting 300 bytes of memory on a handle to
> > > store a key and a opaque blob in a rhashtable.
> > 
> > Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> > code.  I agree that I ought to go look at xarrays or something that's
> > more of a direct mapping as a next step.  However, i wanted to get
> > Andrey's feedback on this general approach first.
> > 
> > > IIUC, the key here is a sequential index, so an xarray would be a
> > > much better choice as it doesn't require internal storage of the
> > > key.
> > 
> > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?

I think the leaf level (i.e. individual record) access patterns
largely match data access patterns, so I'd just treat it like as if
it's a normal file being accessed....

> > Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> > that done because merkle trees may some day have more than 2^32 blocks
> > in them?  That won't play well with things like xarrays on 32-bit
> > machines.
> > 
> > (Granted we've been talking about deprecating XFS on 32-bit for a while
> > now but we're not the whole world)
> > 
> > > i.e.
> > > 
> > > 	p = xa_load(key);
> > > 	if (p)
> > > 		return p;
> > > 
> > > 	xfs_attr_get(key);
> > > 	if (!args->value)
> > > 		/* error */
> > > 
> > > 	/*
> > > 	 * store the current value, freeing any old value that we
> > > 	 * replaced at this key. Don't care about failure to store,
> > > 	 * this is optimistic caching.
> > > 	 */
> > > 	p = xa_store(key, args->value, GFP_NOFS);
> > > 	if (p)
> > > 		kvfree(p);
> > > 	return args->value;
> > 
> > Attractive.  Will have to take a look at that tomorrow.
> 
> Done.  I think.  Not sure that I actually got all the interactions
> between the shrinker and the xarray correct though.  KASAN and lockdep
> don't have any complaints running fstests, so that's a start.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09

My initial impression is "over-engineered".

I personally would have just allocated the xattr value buffer with a
little extra size and added all the external cache information (a
reference counter is all we need as these are fixed sized blocks) to
the tail of the blob we actually pass to fsverity. If we tag the
inode in the radix tree as having verity blobs that can be freed, we
can then just extend the existing fs sueprblock shrinker callout to
also walk all the verity inodes with cached data to try to reclaim
some objects...

But, if a generic blob cache is what it takes to move this forwards,
so be it.

-Dave.

Darrick J. Wong March 11, 2024, 3:25 p.m. UTC | #6

On Mon, Mar 11, 2024 at 11:26:24AM +1100, Dave Chinner wrote:
> On Sat, Mar 09, 2024 at 08:28:28AM -0800, Darrick J. Wong wrote:
> > On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote:
> > > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote:
> > > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g
> > > > > verity tests), so now I know the idea is reasonable.  Patches 11, 12,
> > > > > 14, and 15 become unnecessary.  However, this solution is itself grossly
> > > > > overengineered, since all we want are the following operations:
> > > > > 
> > > > > peek(key): returns an fsbuf if there's any data cached for key
> > > > > 
> > > > > get(key): returns an fsbuf for key, regardless of state
> > > > > 
> > > > > store(fsbuf, p): attach a memory buffer p to fsbuf
> > > > > 
> > > > > Then the xfs ->read_merkle_tree_block function becomes:
> > > > > 
> > > > > 	bp = peek(key)
> > > > > 	if (bp)
> > > > > 		/* return bp data up to verity */
> > > > > 
> > > > > 	p = xfs_attr_get(key)
> > > > > 	if (!p)
> > > > > 		/* error */
> > > > > 
> > > > > 	bp = get(key)
> > > > > 	store(bp, p)
> > > > 
> > > > Ok, that looks good - it definitely gets rid of a lot of the
> > > > nastiness, but I have to ask: why does it need to be based on
> > > > xfs_bufs?
> > > 
> > > (copying from IRC) It was still warm in my brain L2 after all the xfile
> > > buftarg cleaning and merging that just got done a few weeks ago.   So I
> > > went with the simplest thing I could rig up to test my ideas, and now
> > > we're at the madly iterate until exhaustion stage. ;)
> > > 
> > > >            That's just wasting 300 bytes of memory on a handle to
> > > > store a key and a opaque blob in a rhashtable.
> > > 
> > > Yep.  The fsbufs implementation was a lot more slender, but a bunch more
> > > code.  I agree that I ought to go look at xarrays or something that's
> > > more of a direct mapping as a next step.  However, i wanted to get
> > > Andrey's feedback on this general approach first.
> > > 
> > > > IIUC, the key here is a sequential index, so an xarray would be a
> > > > much better choice as it doesn't require internal storage of the
> > > > key.
> > > 
> > > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> 
> I think the leaf level (i.e. individual record) access patterns
> largely match data access patterns, so I'd just treat it like as if
> it's a normal file being accessed....

<nod> The latest version of this tries to avoid letting reclaim take the
top of the tree.  Logically this makes sense to me to reduce read verify
latency, but I was hoping Eric or Andrey or someone with more
familiarity with fsverity would chime in on whether or not that made
sense.

> > > Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> > > that done because merkle trees may some day have more than 2^32 blocks
> > > in them?  That won't play well with things like xarrays on 32-bit
> > > machines.
> > > 
> > > (Granted we've been talking about deprecating XFS on 32-bit for a while
> > > now but we're not the whole world)
> > > 
> > > > i.e.
> > > > 
> > > > 	p = xa_load(key);
> > > > 	if (p)
> > > > 		return p;
> > > > 
> > > > 	xfs_attr_get(key);
> > > > 	if (!args->value)
> > > > 		/* error */
> > > > 
> > > > 	/*
> > > > 	 * store the current value, freeing any old value that we
> > > > 	 * replaced at this key. Don't care about failure to store,
> > > > 	 * this is optimistic caching.
> > > > 	 */
> > > > 	p = xa_store(key, args->value, GFP_NOFS);
> > > > 	if (p)
> > > > 		kvfree(p);
> > > > 	return args->value;
> > > 
> > > Attractive.  Will have to take a look at that tomorrow.
> > 
> > Done.  I think.  Not sure that I actually got all the interactions
> > between the shrinker and the xarray correct though.  KASAN and lockdep
> > don't have any complaints running fstests, so that's a start.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09
> 
> My initial impression is "over-engineered".

Overly focused on unfamiliar data structures -- I've never built
anything with an xarray before.

> I personally would have just allocated the xattr value buffer with a
> little extra size and added all the external cache information (a
> reference counter is all we need as these are fixed sized blocks) to
> the tail of the blob we actually pass to fsverity.

Oho, that's a good idea.  I didn't like the separate allocation anyway.
Friday was mostly chatting with willy and trying to make sure I got the
xarray access patterns correct.

>                                                    If we tag the
> inode in the radix tree as having verity blobs that can be freed, we
> can then just extend the existing fs sueprblock shrinker callout to
> also walk all the verity inodes with cached data to try to reclaim
> some objects...

This too is a wonderful suggestion -- use the third radix tree tag to
mark inodes with extra incore caches that can be reclaimed, then teach
xfs_reclaim_inodes_{nr,count} to scan them.  Allocating a per-inode
shrinker was likely to cause problems with flooding debugfs with too
many knobs anyway.

> But, if a generic blob cache is what it takes to move this forwards,
> so be it.

Not necessarily. ;)

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Eric Biggers March 12, 2024, 2:43 a.m. UTC | #7

On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> > 
> > I think the leaf level (i.e. individual record) access patterns
> > largely match data access patterns, so I'd just treat it like as if
> > it's a normal file being accessed....
> 
> <nod> The latest version of this tries to avoid letting reclaim take the
> top of the tree.  Logically this makes sense to me to reduce read verify
> latency, but I was hoping Eric or Andrey or someone with more
> familiarity with fsverity would chime in on whether or not that made
> sense.

The levels are stored ordered from root to leaf; they aren't interleaved.  So
technically the overall access pattern isn't sequential, but it is sequential
within each level, and the leaf level is over 99% of the accesses anyway
(assuming the default parameters where each block has 128 children).

It makes sense to try to keep the top levels, i.e. the blocks with the lowest
indices, cached.  (The other filesystems that support fsverity currently aren't
doing anything special to treat them differently from other pagecache pages.)

- Eric

Darrick J. Wong March 12, 2024, 2:45 a.m. UTC | #8

On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:

<snip>

> <nod> The latest version of this tries to avoid letting reclaim take the
> top of the tree.  Logically this makes sense to me to reduce read verify
> latency, but I was hoping Eric or Andrey or someone with more
> familiarity with fsverity would chime in on whether or not that made
> sense.
> 
> > > > Also -- the fsverity block interfaces pass in a "u64 pos" argument.  Was
> > > > that done because merkle trees may some day have more than 2^32 blocks
> > > > in them?  That won't play well with things like xarrays on 32-bit
> > > > machines.
> > > > 
> > > > (Granted we've been talking about deprecating XFS on 32-bit for a while
> > > > now but we're not the whole world)
> > > > 
> > > > > i.e.
> > > > > 
> > > > > 	p = xa_load(key);
> > > > > 	if (p)
> > > > > 		return p;
> > > > > 
> > > > > 	xfs_attr_get(key);
> > > > > 	if (!args->value)
> > > > > 		/* error */
> > > > > 
> > > > > 	/*
> > > > > 	 * store the current value, freeing any old value that we
> > > > > 	 * replaced at this key. Don't care about failure to store,
> > > > > 	 * this is optimistic caching.
> > > > > 	 */
> > > > > 	p = xa_store(key, args->value, GFP_NOFS);
> > > > > 	if (p)
> > > > > 		kvfree(p);
> > > > > 	return args->value;
> > > > 
> > > > Attractive.  Will have to take a look at that tomorrow.
> > > 
> > > Done.  I think.  Not sure that I actually got all the interactions
> > > between the shrinker and the xarray correct though.  KASAN and lockdep
> > > don't have any complaints running fstests, so that's a start.
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09
> > 
> > My initial impression is "over-engineered".
> 
> Overly focused on unfamiliar data structures -- I've never built
> anything with an xarray before.
> 
> > I personally would have just allocated the xattr value buffer with a
> > little extra size and added all the external cache information (a
> > reference counter is all we need as these are fixed sized blocks) to
> > the tail of the blob we actually pass to fsverity.
> 
> Oho, that's a good idea.  I didn't like the separate allocation anyway.
> Friday was mostly chatting with willy and trying to make sure I got the
> xarray access patterns correct.
> 
> >                                                    If we tag the
> > inode in the radix tree as having verity blobs that can be freed, we
> > can then just extend the existing fs sueprblock shrinker callout to
> > also walk all the verity inodes with cached data to try to reclaim
> > some objects...
> 
> This too is a wonderful suggestion -- use the third radix tree tag to
> mark inodes with extra incore caches that can be reclaimed, then teach
> xfs_reclaim_inodes_{nr,count} to scan them.  Allocating a per-inode
> shrinker was likely to cause problems with flooding debugfs with too
> many knobs anyway.
> 
> > But, if a generic blob cache is what it takes to move this forwards,
> > so be it.
> 
> Not necessarily. ;)

And here's today's branch, with xfs_blobcache.[ch] removed and a few
more cleanups:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11

--D

> 
> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
>

Darrick J. Wong March 12, 2024, 4:15 a.m. UTC | #9

On Mon, Mar 11, 2024 at 07:43:20PM -0700, Eric Biggers wrote:
> On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > > I wonder, what are the access patterns for merkle blobs?  Is it actually
> > > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves?
> > > 
> > > I think the leaf level (i.e. individual record) access patterns
> > > largely match data access patterns, so I'd just treat it like as if
> > > it's a normal file being accessed....
> > 
> > <nod> The latest version of this tries to avoid letting reclaim take the
> > top of the tree.  Logically this makes sense to me to reduce read verify
> > latency, but I was hoping Eric or Andrey or someone with more
> > familiarity with fsverity would chime in on whether or not that made
> > sense.
> 
> The levels are stored ordered from root to leaf; they aren't interleaved.  So
> technically the overall access pattern isn't sequential, but it is sequential
> within each level, and the leaf level is over 99% of the accesses anyway
> (assuming the default parameters where each block has 128 children).

<nod>

> It makes sense to try to keep the top levels, i.e. the blocks with the lowest
> indices, cached.  (The other filesystems that support fsverity currently aren't
> doing anything special to treat them differently from other pagecache pages.)

<nod>

--D

> - Eric
>

Dave Chinner March 12, 2024, 7:01 a.m. UTC | #10

On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > But, if a generic blob cache is what it takes to move this forwards,
> > > so be it.
> > 
> > Not necessarily. ;)
> 
> And here's today's branch, with xfs_blobcache.[ch] removed and a few
> more cleanups:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11

Walking all the inodes counting all the verity blobs in the shrinker
is going to be -expensive-. Shrinkers are run very frequently and
with high concurrency under memory pressure by direct reclaim, and
every single shrinker execution is going to run that traversal even
if it is decided there is nothing that can be shrunk.

IMO, it would be better to keep a count of reclaimable objects
either on the inode itself (protected by the xa_lock when
adding/removing) to avoid needing to walk the xarray to count the
blocks on the inode. Even better would be a counter in the perag or
a global percpu counter in the mount of all caches objects. Both of
those pretty much remove all the shrinker side counting overhead.

Couple of other small things.

- verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It
  really has nothing to do with the icache other than calling
  xfs_icwalk(). That gets rid of some of the config ifdefs.

- SHRINK_STOP is what should be returned by the scan when
  xfs_verity_shrinker_scan() wants the shrinker to immediately stop,
  not LONG_MAX.

- In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of
  how many object to try to free, not how many we must free. i.e.
  even if we can't free objects, they are still objects that got
  scanned and so should decement nr_to_scan...

-Dave.

Darrick J. Wong March 12, 2024, 8:04 p.m. UTC | #11

On Tue, Mar 12, 2024 at 06:01:01PM +1100, Dave Chinner wrote:
> On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > But, if a generic blob cache is what it takes to move this forwards,
> > > > so be it.
> > > 
> > > Not necessarily. ;)
> > 
> > And here's today's branch, with xfs_blobcache.[ch] removed and a few
> > more cleanups:
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11
> 
> Walking all the inodes counting all the verity blobs in the shrinker
> is going to be -expensive-. Shrinkers are run very frequently and
> with high concurrency under memory pressure by direct reclaim, and
> every single shrinker execution is going to run that traversal even
> if it is decided there is nothing that can be shrunk.
> 
> IMO, it would be better to keep a count of reclaimable objects
> either on the inode itself (protected by the xa_lock when
> adding/removing) to avoid needing to walk the xarray to count the
> blocks on the inode. Even better would be a counter in the perag or
> a global percpu counter in the mount of all caches objects. Both of
> those pretty much remove all the shrinker side counting overhead.

I went with a global percpu counter, let's see if lockdep/kasan have
anything to say about my new design. :P

> Couple of other small things.
> 
> - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It
>   really has nothing to do with the icache other than calling
>   xfs_icwalk(). That gets rid of some of the config ifdefs.

Done.

> - SHRINK_STOP is what should be returned by the scan when
>   xfs_verity_shrinker_scan() wants the shrinker to immediately stop,
>   not LONG_MAX.

Aha.  Ok, thanks for the tipoff. ;)

> - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of
>   how many object to try to free, not how many we must free. i.e.
>   even if we can't free objects, they are still objects that got
>   scanned and so should decement nr_to_scan...

<nod>

Is there any way for ->scan_objects to tell its caller how much memory
it actually freed?  Or does it only know about objects?  I suppose
"number of bytes freed" wouldn't be that helpful since someone else
could allocate all the freed memory immediately anyway.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Dave Chinner March 12, 2024, 9:45 p.m. UTC | #12

On Tue, Mar 12, 2024 at 01:04:24PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 06:01:01PM +1100, Dave Chinner wrote:
> > On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote:
> > > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote:
> > > > > But, if a generic blob cache is what it takes to move this forwards,
> > > > > so be it.
> > > > 
> > > > Not necessarily. ;)
> > > 
> > > And here's today's branch, with xfs_blobcache.[ch] removed and a few
> > > more cleanups:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11
> > 
> > Walking all the inodes counting all the verity blobs in the shrinker
> > is going to be -expensive-. Shrinkers are run very frequently and
> > with high concurrency under memory pressure by direct reclaim, and
> > every single shrinker execution is going to run that traversal even
> > if it is decided there is nothing that can be shrunk.
> > 
> > IMO, it would be better to keep a count of reclaimable objects
> > either on the inode itself (protected by the xa_lock when
> > adding/removing) to avoid needing to walk the xarray to count the
> > blocks on the inode. Even better would be a counter in the perag or
> > a global percpu counter in the mount of all caches objects. Both of
> > those pretty much remove all the shrinker side counting overhead.
> 
> I went with a global percpu counter, let's see if lockdep/kasan have
> anything to say about my new design. :P
> 
> > Couple of other small things.
> > 
> > - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It
> >   really has nothing to do with the icache other than calling
> >   xfs_icwalk(). That gets rid of some of the config ifdefs.
> 
> Done.
> 
> > - SHRINK_STOP is what should be returned by the scan when
> >   xfs_verity_shrinker_scan() wants the shrinker to immediately stop,
> >   not LONG_MAX.
> 
> Aha.  Ok, thanks for the tipoff. ;)
> 
> > - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of
> >   how many object to try to free, not how many we must free. i.e.
> >   even if we can't free objects, they are still objects that got
> >   scanned and so should decement nr_to_scan...
> 
> <nod>
> 
> Is there any way for ->scan_objects to tell its caller how much memory
> it actually freed? Or does it only know about objects? 

The shrinker infrastructure itself only concerns itself with object
counts. It's almost impossible to balance memory used by slab caches
based on memory used because the objects are of different sizes.

For example, it's easy to acheive a 1:1 balance for dentry objects
and inode objects when shrinking is done by object count. Now try
getting that balance when individual shrinkers and the shrinker
infrastructure don't know the relative size of the objects in caches
it is trying to balance against. For shmem and fuse inode it is 4:1
size difference between a dentry and the inode, XFS inodes is 5:1,
ext4 it's 6:1, etc. Yet they all want the same 1:1 object count
balance with the dentry cache, and the same shrinker implementation
has to manage that...

IOws, there are two key scan values - a control value to tell the
shrinker scan how many objects to scan, and a return value that
tells the shrinker how many objects that specific scan freed.

> I suppose
> "number of bytes freed" wouldn't be that helpful since someone else
> could allocate all the freed memory immediately anyway.

The problem is that there isn't a direct realtionship between
"object freed" and "memory available for allocation" with slab cache
shrinkers. If you look at the back end of SLUB freeing, when it
frees a backing page for a cache because it is now empty (all
objects freed) it calls mm_account_reclaimed_pages() to account for
the page being freed. We do the same in xfs_buf_free_pages() to
account for the fact the object being freed also freed a bunch of
extra pages not directly accounted to the shrinker.

THis is the feedback to the memory reclaim process to determine how
much progress was made by the entire shrinker scan. Memory reclaim
is not actually caring how many objects were freed, it's using
hidden accounting to track how many pages actually got freed by the
overall 'every shrinker scan' call.

IOWs, object count base shrinking makes for realtively simple
cross-cache balance tuning, whilst actual memory freed accounting
by scans is hidden in the background...

In the case of this cache, it might be worthwhile adding something
like this for vmalloc()d objects:

	if (is_vmalloc_addr(ptr))
		mm_account_reclaimed_pages(how_many_pages(ptr))
	kvfree(ptr)

as I think that anything allocated from the SLUB heap by kvmalloc()
is already accounted to reclaim by the SLUB freeing code.

-Dave.

[v5,11/24] xfs: add XBF_VERITY_SEEN xfs_buf flag

Commit Message

Comments

Patch