diff mbox series

[v5,07/24] fsverity: support block-based Merkle tree caching

Message ID 20240304191046.157464-9-aalbersh@redhat.com (mailing list archive)
State Superseded
Headers show
Series fs-verity support for XFS | expand

Commit Message

Andrey Albershteyn March 4, 2024, 7:10 p.m. UTC
In the current implementation fs-verity expects filesystem to
provide PAGEs filled with Merkle tree blocks. Then, when fs-verity
is done with processing the blocks, reference to PAGE is freed. This
doesn't fit well with the way XFS manages its memory.

To allow XFS integrate fs-verity this patch adds ability to
fs-verity verification code to take Merkle tree blocks instead of
PAGE reference. This way ext4, f2fs, and btrfs are still able to
pass PAGE references and XFS can pass reference to Merkle tree
blocks stored in XFS's buffer infrastructure.

Another addition is invalidation function which tells fs-verity to
mark part of Merkle tree as not verified. This function is used
by filesystem to tell fs-verity to invalidate block which was
evicted from memory.

Depending on Merkle tree block size fs-verity is using either bitmap
or PG_checked flag to track "verified" status of the blocks. With a
Merkle tree block caching (XFS) there is no PAGE to flag it as
verified. fs-verity always uses bitmap to track verified blocks for
filesystems which use block caching.

Further this patch allows filesystem to make additional processing
on verified pages via fsverity_drop_block() instead of just dropping
a reference. This will be used by XFS for internal buffer cache
manipulation in further patches. The btrfs, ext4, and f2fs just drop
the reference.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/verity/fsverity_private.h |   8 +++
 fs/verity/open.c             |   8 ++-
 fs/verity/read_metadata.c    |  64 +++++++++++-------
 fs/verity/verify.c           | 125 +++++++++++++++++++++++++++--------
 include/linux/fsverity.h     |  65 ++++++++++++++++++
 5 files changed, 217 insertions(+), 53 deletions(-)

Comments

Eric Biggers March 6, 2024, 3:56 a.m. UTC | #1
On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> index b3506f56e180..dad33e6ff0d6 100644
> --- a/fs/verity/fsverity_private.h
> +++ b/fs/verity/fsverity_private.h
> @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
>  
>  void __init fsverity_init_workqueue(void);
>  
> +/*
> + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> + * block->context.
> + */
> +void fsverity_drop_block(struct inode *inode,
> +			 struct fsverity_blockbuf *block);
> +
>  #endif /* _FSVERITY_PRIVATE_H */

This should be paired with a helper function that reads a Merkle tree block by
calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
being consistent with having a helper function for drop, this would prevent code
duplication between verify_data_block() and fsverity_read_merkle_tree().

I recommend that it look like this:

int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
				    unsigned long ra_bytes,
				    struct fsverity_blockbuf *block);

'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
would be the number of bytes for the filesystem to (optionally) readahead if the
block is not yet cached.  I think that things work out simpler if these values
are measured in bytes, not blocks.  'block' would be at the end because it's an
output, and it can be confusing to interleave inputs and outputs in parameters.

Also, the drop function should be named consistently:

void fsverity_drop_merkle_tree_block(struct inode *inode,
                                     struct fsverity_blockbuf *block);

It would be a bit long, but I think it's helpful for consistency and also
because there are other types of "blocks" that it could be confused with.
(Data blocks, filesystem blocks, internal blocks of the hash function.)

> diff --git a/fs/verity/open.c b/fs/verity/open.c
> index fdeb95eca3af..6e6922b4b014 100644
> --- a/fs/verity/open.c
> +++ b/fs/verity/open.c
> @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
>  	if (err)
>  		goto fail;
>  
> -	if (vi->tree_params.block_size != PAGE_SIZE) {
> +	/*
> +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> +	 * fs-verity should use hash_block_verified bitmap as there's no page
> +	 * to mark it with PG_checked.
> +	 */
> +	if (vi->tree_params.block_size != PAGE_SIZE ||
> +			inode->i_sb->s_vop->read_merkle_tree_block) {

Technically, all filesystems "pass Merkle tree blocks to fs-verity".  Maybe
stick to calling it "block-based Merkle tree caching", as you have in the commit
title, as that seems clearer.

> diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> index f58432772d9e..5da40b5a81af 100644
> --- a/fs/verity/read_metadata.c
> +++ b/fs/verity/read_metadata.c
> @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  {
>  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
>  	u64 end_offset;
> -	unsigned int offs_in_page;
> +	unsigned int offs_in_block;
>  	pgoff_t index, last_index;
>  	int retval = 0;
>  	int err = 0;
> +	const unsigned int block_size = vi->tree_params.block_size;
> +	const u8 log_blocksize = vi->tree_params.log_blocksize;
>  
>  	end_offset = min(offset + length, vi->tree_params.tree_size);
>  	if (offset >= end_offset)
>  		return 0;
> -	offs_in_page = offset_in_page(offset);
> -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> +	offs_in_block = offset & (block_size - 1);
> +	last_index = (end_offset - 1) >> log_blocksize;
>  
>  	/*
> -	 * Iterate through each Merkle tree page in the requested range and copy
> -	 * the requested portion to userspace.  Note that the Merkle tree block
> -	 * size isn't important here, as we are returning a byte stream; i.e.,
> -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace. Note that we are returning
> +	 * a byte stream.
>  	 */
> -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> +	for (index = offset >> log_blocksize; index <= last_index; index++) {
>  		unsigned long num_ra_pages =
>  			min_t(unsigned long, last_index - index + 1,
>  			      inode->i_sb->s_bdi->io_pages);
>  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> -						   PAGE_SIZE - offs_in_page);
> -		struct page *page;
> -		const void *virt;
> +						   block_size - offs_in_block);
> +		struct fsverity_blockbuf block = {
> +			.size = block_size,
> +		};
>  
> -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> -		if (IS_ERR(page)) {
> -			err = PTR_ERR(page);
> +		if (!vops->read_merkle_tree_block) {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			unsigned long page_idx =
> +				round_down(index, blocks_per_page);
> +			struct page *page = vops->read_merkle_tree_page(inode,
> +					page_idx, num_ra_pages);
> +
> +			if (IS_ERR(page)) {
> +				err = PTR_ERR(page);
> +			} else {
> +				block.kaddr = kmap_local_page(page) +
> +					((index - page_idx) << log_blocksize);
> +				block.context = page;
> +			}
> +		} else {
> +			err = vops->read_merkle_tree_block(inode,
> +					index << log_blocksize,
> +					&block, log_blocksize, num_ra_pages);
> +		}
> +
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %d reading Merkle tree page %lu",
> -				     err, index);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, index << log_blocksize);
>  			break;
>  		}
>  
> -		virt = kmap_local_page(page);
> -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> -			kunmap_local(virt);
> -			put_page(page);
> +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> +			fsverity_drop_block(inode, &block);
>  			err = -EFAULT;
>  			break;
>  		}
> -		kunmap_local(virt);
> -		put_page(page);
> +		fsverity_drop_block(inode, &block);
>  
>  		retval += bytes_to_copy;
>  		buf += bytes_to_copy;
> @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  			break;
>  		}
>  		cond_resched();
> -		offs_in_page = 0;
> +		offs_in_block = 0;
>  	}
>  	return retval ? retval : err;
>  }

This is one of the two places that would be simplified quite a bit by adding an
fsverity_read_merkle_tree_block() helper function, especially if the loop
variable was changed to measure bytes instead of blocks so that the Merkle tree
block position and readahead amount can more easily be computed in bytes.  E.g.:

static int fsverity_read_merkle_tree(struct inode *inode,
				     const struct fsverity_info *vi,
				     void __user *buf, u64 pos, int length)
{
	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
	[...]

	while (pos < end_pos) {
		// bytes_to_copy = remaining length in current block
		[...]
		err = fsverity_read_merkle_tree_block(inode,
						      pos - offs_in_block,
						      ra_bytes,
						      &block);
		[...]
		pos += bytes_to_copy;
		[...]
	}

> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index 4fcad0825a12..de71911d400c 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -13,14 +13,17 @@
>  static struct workqueue_struct *fsverity_read_workqueue;
>  
>  /*
> - * Returns true if the hash block with index @hblock_idx in the tree, located in
> - * @hpage, has already been verified.
> + * Returns true if the hash block with index @hblock_idx in the tree has
> + * already been verified.
>   */
> -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> +static bool is_hash_block_verified(struct inode *inode,
> +				   struct fsverity_blockbuf *block,
>  				   unsigned long hblock_idx)
>  {
>  	unsigned int blocks_per_page;
>  	unsigned int i;
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	struct page *hpage = (struct page *)block->context;
>  
>  	/*
>  	 * When the Merkle tree block size and page size are the same, then the
> @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
>  	if (!vi->hash_block_verified)
>  		return PageChecked(hpage);
>  
> +	/*
> +	 * Filesystems which use block based caching (e.g. XFS) always use
> +	 * bitmap.
> +	 */
> +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> +		return test_bit(hblock_idx, vi->hash_block_verified);
>  	/*
>  	 * When the Merkle tree block size and page size differ, we use a bitmap
>  	 * to indicate whether each hash block has been verified.

This is hard to follow because the code goes from handling page-based caching to
handling block-based caching to handling page-based caching again.  Also, the
comment says that block-based caching uses the bitmap, but it doesn't give any
hint at why the bitmap is being handled differently in that case from the
page-based caching case that also uses the bitmap.

How about doing something like the following where the block-based caching case
is handled first, before either block-based caching case:

	/*
	 * If the filesystem uses block-based caching, then
	 * ->hash_block_verified is always used and the filesystem pushes
	 * invalidations to it as needed.
	 */
	if (inode->i_sb->s_vop->read_merkle_tree_block)
		return test_bit(hblock_idx, vi->hash_block_verified);

	/* Otherwise, the filesystem uses page-based caching. */
	hpage = (struct page *)block->context;

Note, to avoid confusion, the 'hpage' variable shouldn't be set until it's been
verified that the filesystem is using page-based caching.

> @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> verify_data_block(struct inode *inode, struct fsverity_info *vi,
>		  const void *data, u64 data_pos, unsigned long max_ra_pages)

max_ra_pages should become max_ra_bytes.

> +	int num_ra_pages;

unsigned long ra_bytes.  Also, this should be declared in the scope later on
where it's actually needed.

> @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		unsigned long next_hidx;
>  		unsigned long hblock_idx;
>  		pgoff_t hpage_idx;
> +		u64 hblock_pos;
>  		unsigned int hblock_offset_in_page;

hpage_idx and hblock_offset_in_page should both go away.
fsverity_read_merkle_tree_block() should do the conversion to/from pages.

>  		unsigned int hoffset;
>  		struct page *hpage;
> -		const void *haddr;
> +		struct fsverity_blockbuf *block = &hblocks[level].block;
>  
>  		/*
>  		 * The index of the block in the current level; also the index
> @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>		/* Index of the hash block in the tree overall */
>		hblock_idx = params->level_start[level] + next_hidx;
>
>		/* Byte offset of the hash block within the page */
>  		hblock_offset_in_page =
>  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
>  
> +		/* Offset of the Merkle tree block into the tree */
> +		hblock_pos = hblock_idx << params->log_blocksize;

The comment for hblock_pos should say:

	/* Byte offset of the hash block in the tree overall */

... so that it's consistent with the one for hblock_idx which says:

	/* Index of the hash block in the tree overall */

> +		num_ra_pages = level == 0 ?
> +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> +
> +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> +				inode, hblock_pos, block, params->log_blocksize,
> +				num_ra_pages);
> +		} else {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> +
> +			if (IS_ERR(hpage)) {
> +				err = PTR_ERR(hpage);
> +			} else {
> +				block->kaddr = kmap_local_page(hpage) +
> +					hblock_offset_in_page;
> +				block->context = hpage;
> +			}
> +		}

This has the weirdness with the readahead amount again, where it's still being
measured in pages.  I think it would make more sense to measure it in bytes.  It
would stay bytes when passed to ->read_merkle_tree_block, or be converted to
pages at the last minute for ->read_merkle_tree_page which works in pages.

> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %ld reading Merkle tree page %lu",
> -				     PTR_ERR(hpage), hpage_idx);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, hblock_idx);
>  			goto error;
>  		}

This error message should go in fsverity_read_merkle_tree_block().

Doing that would also eliminate the need to add an 'err' variable to
verify_data_block(), which could be confusing because it returns a bool.

>  	/* Descend the tree verifying hash blocks. */
>  	for (; level > 0; level--) {
> -		struct page *hpage = hblocks[level - 1].page;
> -		const void *haddr = hblocks[level - 1].addr;
> +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> +		const void *haddr = block->kaddr;
>  		unsigned long hblock_idx = hblocks[level - 1].index;
>  		unsigned int hoffset = hblocks[level - 1].hoffset;
> +		struct page *hpage = (struct page *)block->context;
>  
>  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
>  			goto error;
> @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>		if (vi->hash_block_verified)
>			set_bit(hblock_idx, vi->hash_block_verified);
>		else
>  			SetPageChecked(hpage);

Since the page is only valid when vi->hash_block_verified is NULL, remove the
hpage variable and do 'SetPageChecked((struct page *)block->context);'.
Yes, there's no issue until the pointer is actually dereferenced, but it's
confusing to have incorrectly-typed pointer variables sitting around.

> +/**
> + * fsverity_invalidate_block() - invalidate Merkle tree block
> + * @inode: inode to which this Merkle tree blocks belong
> + * @block: block to be invalidated
> + *
> + * This function invalidates/clears "verified" state of Merkle tree block
> + * in the fs-verity bitmap. The block needs to have ->offset set.
> + */
> +void fsverity_invalidate_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)
> +{
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> +
> +	if (block->offset > vi->tree_params.tree_size) {
> +		fsverity_err(inode,
> +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> +			     vi->tree_params.tree_size, block->offset);
> +		return;
> +	}
> +
> +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> +}
> +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);

I don't think this function should be using struct fsverity_blockbuf.  It only
uses the 'offset' field, which is orthogonal to what ->read_merkle_tree_block
and ->drop_merkle_tree_block use.

The function name fsverity_invalidate_block() also has the issue where it's not
clear what type of block it's talking about.

How about changing the prototype to:

void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);

Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
can it happen in cases like filesystem corruption?  If it can only happen due to
kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.

Please note that there's also an off-by-one error.  pos > tree_size should be
pos >= tree_size.

> +void fsverity_drop_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)

fsverity_drop_merkle_tree_block

> +{
> +	if (inode->i_sb->s_vop->drop_block)
> +		inode->i_sb->s_vop->drop_block(block);
> +	else {

Add braces above.

(See line 213 of Documentation/process/coding-style.rst)

> +		struct page *page = (struct page *)block->context;
> +
> +		kunmap_local(block->kaddr);
> +		put_page(page);

put_page((struct page *)block->context), and remove the page variable

> +	}
> +	block->kaddr = NULL;

Also set block->context = NULL

> +/**
> + * struct fsverity_blockbuf - Merkle Tree block buffer
> + * @kaddr: virtual address of the block's data
> + * @offset: block's offset into Merkle tree
> + * @size: the Merkle tree block size
> + * @context: filesystem private context
> + *
> + * Buffer containing single Merkle Tree block. These buffers are passed
> + *  - to filesystem, when fs-verity is building merkel tree,
> + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> + * Filesystems sets kaddr together with size to point to a memory which contains
> + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> + * written down to disk.
> + *
> + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> + * ->drop_block to let filesystem know that memory can be freed.
> + *
> + * The context is optional. This field can be used by filesystem to passthrough
> + * state from ->read_merkle_tree_block to ->drop_block.
> + */
> +struct fsverity_blockbuf {
> +	void *kaddr;
> +	u64 offset;
> +	unsigned int size;
> +	void *context;
> +};

If this is only used by ->read_merkle_tree_block and ->drop_merkle_tree_block,
then it can be simplified.  The only fields needed are kaddr and context.

The comment also still incorrectly claims that the struct is used for writes.

> +	/**
> +	 * Read a Merkle tree block of the given inode.
> +	 * @inode: the inode
> +	 * @pos: byte offset of the block within the Merkle tree
> +	 * @block: block buffer for filesystem to point it to the block
> +	 * @log_blocksize: log2 of the size of the expected block
> +	 * @ra_bytes: The number of bytes that should be
> +	 *		prefetched starting at @pos if the page at @pos
> +	 *		isn't already cached.  Implementations may ignore this
> +	 *		argument; it's only a performance optimization.
> +	 *
> +	 * This can be called at any time on an open verity file.  It may be
> +	 * called by multiple processes concurrently.
> +	 *
> +	 * In case that block was evicted from the memory filesystem has to use
> +	 * fsverity_invalidate_block() to let fsverity know that block's
> +	 * verification state is not valid anymore.
> +	 *
> +	 * Return: 0 on success, -errno on failure
> +	 */
> +	int (*read_merkle_tree_block)(struct inode *inode,
> +				      u64 pos,
> +				      struct fsverity_blockbuf *block,
> +				      unsigned int log_blocksize,
> +				      u64 ra_bytes);

How about:
	
	int (*read_merkle_tree_block)(struct inode *inode,
				      u64 pos,
				      unsigned int size,
				      unsigned long ra_bytes,
				      struct fsverity_blockbuf *block);

- size instead of log_blocksize, for consistency with other functions and
  because it's what XFS actually wants

- ra_bytes as unsigned long, since u64 readahead amounts don't really make
  sense.  (This isn't too important though; it could be u64 if you really want.)

- block at the end since it's an output parameter

It also needs to be clearly documented that filesystems must implement either
->read_merkle_tree_page, *or* both ->read_merkle_tree_block and
->drop_merkle_tree_block.

Also, the comment about invalidating the block is still unclear regarding the
timing of when the filesystem must do it.

> +
> +	/**
> +	 * Release the reference to a Merkle tree block
> +	 *
> +	 * @block: the block to release
> +	 *
> +	 * This is called when fs-verity is done with a block obtained with
> +	 * ->read_merkle_tree_block().
> +	 */
> +	void (*drop_block)(struct fsverity_blockbuf *block);

drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block

- Eric
Darrick J. Wong March 7, 2024, 9:54 p.m. UTC | #2
On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > index b3506f56e180..dad33e6ff0d6 100644
> > --- a/fs/verity/fsverity_private.h
> > +++ b/fs/verity/fsverity_private.h
> > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> >  
> >  void __init fsverity_init_workqueue(void);
> >  
> > +/*
> > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > + * block->context.
> > + */
> > +void fsverity_drop_block(struct inode *inode,
> > +			 struct fsverity_blockbuf *block);
> > +
> >  #endif /* _FSVERITY_PRIVATE_H */
> 
> This should be paired with a helper function that reads a Merkle tree block by
> calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> being consistent with having a helper function for drop, this would prevent code
> duplication between verify_data_block() and fsverity_read_merkle_tree().
> 
> I recommend that it look like this:
> 
> int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> 				    unsigned long ra_bytes,
> 				    struct fsverity_blockbuf *block);
> 
> 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> would be the number of bytes for the filesystem to (optionally) readahead if the
> block is not yet cached.  I think that things work out simpler if these values
> are measured in bytes, not blocks.  'block' would be at the end because it's an
> output, and it can be confusing to interleave inputs and outputs in parameters.

FWIW I don't really like 'pos' here because that's usually short for
"file position", which is a byte, and this looks a lot more like a
merkle tree block number.

u64 blkno?

Or better yet use a typedef ("merkle_blkno_t") to make it really clear
when we're dealing with a tree block number.  Ignore checkpatch
complaining about typeedefs. :)

> Also, the drop function should be named consistently:
> 
> void fsverity_drop_merkle_tree_block(struct inode *inode,
>                                      struct fsverity_blockbuf *block);
> 
> It would be a bit long, but I think it's helpful for consistency and also
> because there are other types of "blocks" that it could be confused with.
> (Data blocks, filesystem blocks, internal blocks of the hash function.)

I agree.

> > diff --git a/fs/verity/open.c b/fs/verity/open.c
> > index fdeb95eca3af..6e6922b4b014 100644
> > --- a/fs/verity/open.c
> > +++ b/fs/verity/open.c
> > @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
> >  	if (err)
> >  		goto fail;
> >  
> > -	if (vi->tree_params.block_size != PAGE_SIZE) {
> > +	/*
> > +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> > +	 * fs-verity should use hash_block_verified bitmap as there's no page
> > +	 * to mark it with PG_checked.
> > +	 */
> > +	if (vi->tree_params.block_size != PAGE_SIZE ||
> > +			inode->i_sb->s_vop->read_merkle_tree_block) {
> 
> Technically, all filesystems "pass Merkle tree blocks to fs-verity".  Maybe
> stick to calling it "block-based Merkle tree caching", as you have in the commit
> title, as that seems clearer.
> 
> > diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> > index f58432772d9e..5da40b5a81af 100644
> > --- a/fs/verity/read_metadata.c
> > +++ b/fs/verity/read_metadata.c
> > @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
> >  {
> >  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
> >  	u64 end_offset;
> > -	unsigned int offs_in_page;
> > +	unsigned int offs_in_block;
> >  	pgoff_t index, last_index;
> >  	int retval = 0;
> >  	int err = 0;
> > +	const unsigned int block_size = vi->tree_params.block_size;
> > +	const u8 log_blocksize = vi->tree_params.log_blocksize;
> >  
> >  	end_offset = min(offset + length, vi->tree_params.tree_size);
> >  	if (offset >= end_offset)
> >  		return 0;
> > -	offs_in_page = offset_in_page(offset);
> > -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> > +	offs_in_block = offset & (block_size - 1);
> > +	last_index = (end_offset - 1) >> log_blocksize;
> >  
> >  	/*
> > -	 * Iterate through each Merkle tree page in the requested range and copy
> > -	 * the requested portion to userspace.  Note that the Merkle tree block
> > -	 * size isn't important here, as we are returning a byte stream; i.e.,
> > -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> > +	 * Iterate through each Merkle tree block in the requested range and
> > +	 * copy the requested portion to userspace. Note that we are returning
> > +	 * a byte stream.
> >  	 */
> > -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> > +	for (index = offset >> log_blocksize; index <= last_index; index++) {
> >  		unsigned long num_ra_pages =
> >  			min_t(unsigned long, last_index - index + 1,
> >  			      inode->i_sb->s_bdi->io_pages);
> >  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> > -						   PAGE_SIZE - offs_in_page);
> > -		struct page *page;
> > -		const void *virt;
> > +						   block_size - offs_in_block);
> > +		struct fsverity_blockbuf block = {
> > +			.size = block_size,
> > +		};
> >  
> > -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> > -		if (IS_ERR(page)) {
> > -			err = PTR_ERR(page);
> > +		if (!vops->read_merkle_tree_block) {
> > +			unsigned int blocks_per_page =
> > +				vi->tree_params.blocks_per_page;
> > +			unsigned long page_idx =
> > +				round_down(index, blocks_per_page);
> > +			struct page *page = vops->read_merkle_tree_page(inode,
> > +					page_idx, num_ra_pages);
> > +
> > +			if (IS_ERR(page)) {
> > +				err = PTR_ERR(page);
> > +			} else {
> > +				block.kaddr = kmap_local_page(page) +
> > +					((index - page_idx) << log_blocksize);
> > +				block.context = page;
> > +			}
> > +		} else {
> > +			err = vops->read_merkle_tree_block(inode,
> > +					index << log_blocksize,
> > +					&block, log_blocksize, num_ra_pages);
> > +		}
> > +
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %d reading Merkle tree page %lu",
> > -				     err, index);
> > +				     "Error %d reading Merkle tree block %lu",
> > +				     err, index << log_blocksize);
> >  			break;
> >  		}
> >  
> > -		virt = kmap_local_page(page);
> > -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> > -			kunmap_local(virt);
> > -			put_page(page);
> > +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> > +			fsverity_drop_block(inode, &block);
> >  			err = -EFAULT;
> >  			break;
> >  		}
> > -		kunmap_local(virt);
> > -		put_page(page);
> > +		fsverity_drop_block(inode, &block);
> >  
> >  		retval += bytes_to_copy;
> >  		buf += bytes_to_copy;
> > @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
> >  			break;
> >  		}
> >  		cond_resched();
> > -		offs_in_page = 0;
> > +		offs_in_block = 0;
> >  	}
> >  	return retval ? retval : err;
> >  }
> 
> This is one of the two places that would be simplified quite a bit by adding an
> fsverity_read_merkle_tree_block() helper function, especially if the loop
> variable was changed to measure bytes instead of blocks so that the Merkle tree
> block position and readahead amount can more easily be computed in bytes.  E.g.:

Yes!  I really agree here -- I downloaded the git repo, tried to read
this function, and felt that it would be a lot easier to understand if
the two access paths were separate functions instead of a lot of
switched logic within one function.

> static int fsverity_read_merkle_tree(struct inode *inode,
> 				     const struct fsverity_info *vi,
> 				     void __user *buf, u64 pos, int length)
> {
> 	const u64 end_pos = min(pos + length, vi->tree_params.tree_size);
> 	[...]
> 
> 	while (pos < end_pos) {
> 		// bytes_to_copy = remaining length in current block
> 		[...]
> 		err = fsverity_read_merkle_tree_block(inode,
> 						      pos - offs_in_block,
> 						      ra_bytes,
> 						      &block);
> 		[...]
> 		pos += bytes_to_copy;
> 		[...]
> 	}
> 
> > diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> > index 4fcad0825a12..de71911d400c 100644
> > --- a/fs/verity/verify.c
> > +++ b/fs/verity/verify.c
> > @@ -13,14 +13,17 @@
> >  static struct workqueue_struct *fsverity_read_workqueue;
> >  
> >  /*
> > - * Returns true if the hash block with index @hblock_idx in the tree, located in
> > - * @hpage, has already been verified.
> > + * Returns true if the hash block with index @hblock_idx in the tree has
> > + * already been verified.
> >   */
> > -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> > +static bool is_hash_block_verified(struct inode *inode,
> > +				   struct fsverity_blockbuf *block,
> >  				   unsigned long hblock_idx)
> >  {
> >  	unsigned int blocks_per_page;
> >  	unsigned int i;
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	struct page *hpage = (struct page *)block->context;
> >  
> >  	/*
> >  	 * When the Merkle tree block size and page size are the same, then the
> > @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> >  	if (!vi->hash_block_verified)
> >  		return PageChecked(hpage);
> >  
> > +	/*
> > +	 * Filesystems which use block based caching (e.g. XFS) always use
> > +	 * bitmap.
> > +	 */
> > +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> > +		return test_bit(hblock_idx, vi->hash_block_verified);
> >  	/*
> >  	 * When the Merkle tree block size and page size differ, we use a bitmap
> >  	 * to indicate whether each hash block has been verified.
> 
> This is hard to follow because the code goes from handling page-based caching to
> handling block-based caching to handling page-based caching again.  Also, the
> comment says that block-based caching uses the bitmap, but it doesn't give any
> hint at why the bitmap is being handled differently in that case from the
> page-based caching case that also uses the bitmap.
> 
> How about doing something like the following where the block-based caching case
> is handled first, before either block-based caching case:
> 
> 	/*
> 	 * If the filesystem uses block-based caching, then
> 	 * ->hash_block_verified is always used and the filesystem pushes
> 	 * invalidations to it as needed.
> 	 */
> 	if (inode->i_sb->s_vop->read_merkle_tree_block)
> 		return test_bit(hblock_idx, vi->hash_block_verified);
> 
> 	/* Otherwise, the filesystem uses page-based caching. */
> 	hpage = (struct page *)block->context;
> 
> Note, to avoid confusion, the 'hpage' variable shouldn't be set until it's been
> verified that the filesystem is using page-based caching.

(Yep.)

> > @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> > verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		  const void *data, u64 data_pos, unsigned long max_ra_pages)
> 
> max_ra_pages should become max_ra_bytes.
> 
> > +	int num_ra_pages;
> 
> unsigned long ra_bytes.  Also, this should be declared in the scope later on
> where it's actually needed.

Agreed.

> > @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >  		unsigned long next_hidx;
> >  		unsigned long hblock_idx;
> >  		pgoff_t hpage_idx;
> > +		u64 hblock_pos;
> >  		unsigned int hblock_offset_in_page;
> 
> hpage_idx and hblock_offset_in_page should both go away.
> fsverity_read_merkle_tree_block() should do the conversion to/from pages.
> 
> >  		unsigned int hoffset;
> >  		struct page *hpage;
> > -		const void *haddr;
> > +		struct fsverity_blockbuf *block = &hblocks[level].block;
> >  
> >  		/*
> >  		 * The index of the block in the current level; also the index
> > @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		/* Index of the hash block in the tree overall */
> >		hblock_idx = params->level_start[level] + next_hidx;
> >
> >		/* Byte offset of the hash block within the page */
> >  		hblock_offset_in_page =
> >  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
> >  
> > +		/* Offset of the Merkle tree block into the tree */
> > +		hblock_pos = hblock_idx << params->log_blocksize;
> 
> The comment for hblock_pos should say:
> 
> 	/* Byte offset of the hash block in the tree overall */
> 
> ... so that it's consistent with the one for hblock_idx which says:
> 
> 	/* Index of the hash block in the tree overall */
> 
> > +		num_ra_pages = level == 0 ?
> > +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> > +
> > +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> > +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> > +				inode, hblock_pos, block, params->log_blocksize,
> > +				num_ra_pages);
> > +		} else {
> > +			unsigned int blocks_per_page =
> > +				vi->tree_params.blocks_per_page;
> > +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> > +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> > +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> > +
> > +			if (IS_ERR(hpage)) {
> > +				err = PTR_ERR(hpage);
> > +			} else {
> > +				block->kaddr = kmap_local_page(hpage) +
> > +					hblock_offset_in_page;
> > +				block->context = hpage;
> > +			}
> > +		}
> 
> This has the weirdness with the readahead amount again, where it's still being
> measured in pages.  I think it would make more sense to measure it in bytes.  It
> would stay bytes when passed to ->read_merkle_tree_block, or be converted to
> pages at the last minute for ->read_merkle_tree_page which works in pages.
> 
> > +		if (err) {
> >  			fsverity_err(inode,
> > -				     "Error %ld reading Merkle tree page %lu",
> > -				     PTR_ERR(hpage), hpage_idx);
> > +				     "Error %d reading Merkle tree block %lu",
> > +				     err, hblock_idx);
> >  			goto error;
> >  		}
> 
> This error message should go in fsverity_read_merkle_tree_block().
> 
> Doing that would also eliminate the need to add an 'err' variable to
> verify_data_block(), which could be confusing because it returns a bool.
> 
> >  	/* Descend the tree verifying hash blocks. */
> >  	for (; level > 0; level--) {
> > -		struct page *hpage = hblocks[level - 1].page;
> > -		const void *haddr = hblocks[level - 1].addr;
> > +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> > +		const void *haddr = block->kaddr;
> >  		unsigned long hblock_idx = hblocks[level - 1].index;
> >  		unsigned int hoffset = hblocks[level - 1].hoffset;
> > +		struct page *hpage = (struct page *)block->context;
> >  
> >  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
> >  			goto error;
> > @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
> >		if (vi->hash_block_verified)
> >			set_bit(hblock_idx, vi->hash_block_verified);
> >		else
> >  			SetPageChecked(hpage);
> 
> Since the page is only valid when vi->hash_block_verified is NULL, remove the
> hpage variable and do 'SetPageChecked((struct page *)block->context);'.
> Yes, there's no issue until the pointer is actually dereferenced, but it's
> confusing to have incorrectly-typed pointer variables sitting around.
> 
> > +/**
> > + * fsverity_invalidate_block() - invalidate Merkle tree block
> > + * @inode: inode to which this Merkle tree blocks belong
> > + * @block: block to be invalidated
> > + *
> > + * This function invalidates/clears "verified" state of Merkle tree block
> > + * in the fs-verity bitmap. The block needs to have ->offset set.
> > + */
> > +void fsverity_invalidate_block(struct inode *inode,
> > +		struct fsverity_blockbuf *block)
> > +{
> > +	struct fsverity_info *vi = inode->i_verity_info;
> > +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> > +
> > +	if (block->offset > vi->tree_params.tree_size) {
> > +		fsverity_err(inode,
> > +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> > +			     vi->tree_params.tree_size, block->offset);
> > +		return;
> > +	}
> > +
> > +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> > +}
> > +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
> 
> I don't think this function should be using struct fsverity_blockbuf.  It only
> uses the 'offset' field, which is orthogonal to what ->read_merkle_tree_block
> and ->drop_merkle_tree_block use.
> 
> The function name fsverity_invalidate_block() also has the issue where it's not
> clear what type of block it's talking about.
> 
> How about changing the prototype to:
> 
> void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
>
> Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> can it happen in cases like filesystem corruption?  If it can only happen due to
> kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.

I think XFS only passes to _invalidate_* the same pos that was passed to
->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
problem.

Perhaps this function ought to note that @pos is supposed to be the same
value that was given to ->read_merkle_tree_block?

Or: make the implementations return 1 for "reloaded from disk", 0 for
"still in cache", or a negative error code.  Then fsverity can call
the invalidation routine itself and XFS doesn't have to worry about this
part.

(I think?  I have questions about the xfs_invalidate_blocks function.)

> Please note that there's also an off-by-one error.  pos > tree_size should be
> pos >= tree_size.
> 
> > +void fsverity_drop_block(struct inode *inode,
> > +		struct fsverity_blockbuf *block)
> 
> fsverity_drop_merkle_tree_block
> 
> > +{
> > +	if (inode->i_sb->s_vop->drop_block)
> > +		inode->i_sb->s_vop->drop_block(block);
> > +	else {
> 
> Add braces above.
> 
> (See line 213 of Documentation/process/coding-style.rst)
> 
> > +		struct page *page = (struct page *)block->context;
> > +
> > +		kunmap_local(block->kaddr);
> > +		put_page(page);
> 
> put_page((struct page *)block->context), and remove the page variable
> 
> > +	}
> > +	block->kaddr = NULL;
> 
> Also set block->context = NULL
> 
> > +/**
> > + * struct fsverity_blockbuf - Merkle Tree block buffer
> > + * @kaddr: virtual address of the block's data
> > + * @offset: block's offset into Merkle tree
> > + * @size: the Merkle tree block size
> > + * @context: filesystem private context
> > + *
> > + * Buffer containing single Merkle Tree block. These buffers are passed
> > + *  - to filesystem, when fs-verity is building merkel tree,
> > + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> > + * Filesystems sets kaddr together with size to point to a memory which contains
> > + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> > + * written down to disk.
> > + *
> > + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> > + * ->drop_block to let filesystem know that memory can be freed.
> > + *
> > + * The context is optional. This field can be used by filesystem to passthrough
> > + * state from ->read_merkle_tree_block to ->drop_block.
> > + */
> > +struct fsverity_blockbuf {
> > +	void *kaddr;
> > +	u64 offset;
> > +	unsigned int size;
> > +	void *context;
> > +};
> 
> If this is only used by ->read_merkle_tree_block and ->drop_merkle_tree_block,
> then it can be simplified.  The only fields needed are kaddr and context.
> 
> The comment also still incorrectly claims that the struct is used for writes.
> 
> > +	/**
> > +	 * Read a Merkle tree block of the given inode.
> > +	 * @inode: the inode
> > +	 * @pos: byte offset of the block within the Merkle tree
> > +	 * @block: block buffer for filesystem to point it to the block
> > +	 * @log_blocksize: log2 of the size of the expected block
> > +	 * @ra_bytes: The number of bytes that should be
> > +	 *		prefetched starting at @pos if the page at @pos
> > +	 *		isn't already cached.  Implementations may ignore this
> > +	 *		argument; it's only a performance optimization.
> > +	 *
> > +	 * This can be called at any time on an open verity file.  It may be
> > +	 * called by multiple processes concurrently.
> > +	 *
> > +	 * In case that block was evicted from the memory filesystem has to use
> > +	 * fsverity_invalidate_block() to let fsverity know that block's
> > +	 * verification state is not valid anymore.

Ah, ok, that's why fsverity_invalidate_block only does one block at a
time.

> > +	 *
> > +	 * Return: 0 on success, -errno on failure
> > +	 */
> > +	int (*read_merkle_tree_block)(struct inode *inode,
> > +				      u64 pos,
> > +				      struct fsverity_blockbuf *block,
> > +				      unsigned int log_blocksize,
> > +				      u64 ra_bytes);
> 
> How about:
> 	
> 	int (*read_merkle_tree_block)(struct inode *inode,
> 				      u64 pos,
> 				      unsigned int size,
> 				      unsigned long ra_bytes,
> 				      struct fsverity_blockbuf *block);
> 
> - size instead of log_blocksize, for consistency with other functions and
>   because it's what XFS actually wants
> 
> - ra_bytes as unsigned long, since u64 readahead amounts don't really make
>   sense.  (This isn't too important though; it could be u64 if you really want.)
> 
> - block at the end since it's an output parameter
> 
> It also needs to be clearly documented that filesystems must implement either
> ->read_merkle_tree_page, *or* both ->read_merkle_tree_block and
> ->drop_merkle_tree_block.
> 
> Also, the comment about invalidating the block is still unclear regarding the
> timing of when the filesystem must do it.
> 
> > +
> > +	/**
> > +	 * Release the reference to a Merkle tree block
> > +	 *
> > +	 * @block: the block to release
> > +	 *
> > +	 * This is called when fs-verity is done with a block obtained with
> > +	 * ->read_merkle_tree_block().
> > +	 */
> > +	void (*drop_block)(struct fsverity_blockbuf *block);
> 
> drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block

Yep.  I noticed that xfs_verity.c doesn't put them together, which made
me wonder if the write_merkle_tree_block path made use of that.  It
doesn't, AFAICT.

And I think the reason is that when we're setting up the merkle tree,
we want to stream the contents straight to disk instead of ending up
with a huge cache that might not all be necessary?

--D

> - Eric
>
Eric Biggers March 7, 2024, 10:49 p.m. UTC | #3
On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > index b3506f56e180..dad33e6ff0d6 100644
> > > --- a/fs/verity/fsverity_private.h
> > > +++ b/fs/verity/fsverity_private.h
> > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > >  
> > >  void __init fsverity_init_workqueue(void);
> > >  
> > > +/*
> > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > + * block->context.
> > > + */
> > > +void fsverity_drop_block(struct inode *inode,
> > > +			 struct fsverity_blockbuf *block);
> > > +
> > >  #endif /* _FSVERITY_PRIVATE_H */
> > 
> > This should be paired with a helper function that reads a Merkle tree block by
> > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > being consistent with having a helper function for drop, this would prevent code
> > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > 
> > I recommend that it look like this:
> > 
> > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > 				    unsigned long ra_bytes,
> > 				    struct fsverity_blockbuf *block);
> > 
> > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > would be the number of bytes for the filesystem to (optionally) readahead if the
> > block is not yet cached.  I think that things work out simpler if these values
> > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > output, and it can be confusing to interleave inputs and outputs in parameters.
> 
> FWIW I don't really like 'pos' here because that's usually short for
> "file position", which is a byte, and this looks a lot more like a
> merkle tree block number.
> 
> u64 blkno?
> 
> Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> when we're dealing with a tree block number.  Ignore checkpatch
> complaining about typeedefs. :)

My suggestion is for 'pos' to be a byte position, in alignment with the
pagecache naming convention as well as ->write_merkle_tree_block which currently
uses a 'u64 pos' byte position too.

It would also work for it to be a block index, in which case it should be named
'index' or 'blkno' and have type unsigned long.  I *think* that things work out
a bit cleaner if it's a byte position, but I could be convinced that it's
actually better for it to be a block index.

> > How about changing the prototype to:
> > 
> > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> >
> > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > can it happen in cases like filesystem corruption?  If it can only happen due to
> > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> 
> I think XFS only passes to _invalidate_* the same pos that was passed to
> ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> problem.
> 
> Perhaps this function ought to note that @pos is supposed to be the same
> value that was given to ->read_merkle_tree_block?
> 
> Or: make the implementations return 1 for "reloaded from disk", 0 for
> "still in cache", or a negative error code.  Then fsverity can call
> the invalidation routine itself and XFS doesn't have to worry about this
> part.
> 
> (I think?  I have questions about the xfs_invalidate_blocks function.)

It looks like XFS can invalidate blocks other than the one being read by
->read_merkle_tree_block.

If it really was only a matter of the single block being read, then it would
indeed be simpler to just make it a piece of information returned from
->read_merkle_tree_block.

If the generic invalidation function is needed, it needs to be clearly
documented when filesystems are expected to invalidate blocks.

> > > +
> > > +	/**
> > > +	 * Release the reference to a Merkle tree block
> > > +	 *
> > > +	 * @block: the block to release
> > > +	 *
> > > +	 * This is called when fs-verity is done with a block obtained with
> > > +	 * ->read_merkle_tree_block().
> > > +	 */
> > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > 
> > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> 
> Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> me wonder if the write_merkle_tree_block path made use of that.  It
> doesn't, AFAICT.
> 
> And I think the reason is that when we're setting up the merkle tree,
> we want to stream the contents straight to disk instead of ending up
> with a huge cache that might not all be necessary?

In the current patchset, fsverity_blockbuf isn't used for writes (despite the
comment saying it is).  I think that's fine and that it keeps things simpler.
Filesystems can still cache the blocks that are passed to
->write_merkle_tree_block if they want to.  That already happens on ext4 and
f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.

- Eric
Darrick J. Wong March 8, 2024, 3:50 a.m. UTC | #4
On Thu, Mar 07, 2024 at 02:49:03PM -0800, Eric Biggers wrote:
> On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > > index b3506f56e180..dad33e6ff0d6 100644
> > > > --- a/fs/verity/fsverity_private.h
> > > > +++ b/fs/verity/fsverity_private.h
> > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > > >  
> > > >  void __init fsverity_init_workqueue(void);
> > > >  
> > > > +/*
> > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > > + * block->context.
> > > > + */
> > > > +void fsverity_drop_block(struct inode *inode,
> > > > +			 struct fsverity_blockbuf *block);
> > > > +
> > > >  #endif /* _FSVERITY_PRIVATE_H */
> > > 
> > > This should be paired with a helper function that reads a Merkle tree block by
> > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > > being consistent with having a helper function for drop, this would prevent code
> > > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > > 
> > > I recommend that it look like this:
> > > 
> > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > > 				    unsigned long ra_bytes,
> > > 				    struct fsverity_blockbuf *block);
> > > 
> > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > > would be the number of bytes for the filesystem to (optionally) readahead if the
> > > block is not yet cached.  I think that things work out simpler if these values
> > > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > > output, and it can be confusing to interleave inputs and outputs in parameters.
> > 
> > FWIW I don't really like 'pos' here because that's usually short for
> > "file position", which is a byte, and this looks a lot more like a
> > merkle tree block number.
> > 
> > u64 blkno?
> > 
> > Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> > when we're dealing with a tree block number.  Ignore checkpatch
> > complaining about typeedefs. :)
> 
> My suggestion is for 'pos' to be a byte position, in alignment with the
> pagecache naming convention as well as ->write_merkle_tree_block which currently
> uses a 'u64 pos' byte position too.
> 
> It would also work for it to be a block index, in which case it should be named
> 'index' or 'blkno' and have type unsigned long.  I *think* that things work out
> a bit cleaner if it's a byte position, but I could be convinced that it's
> actually better for it to be a block index.

It's probably cleaner (or at least willy says so) to measure everything
in bytes.  The only place that gets messy is if we want to cache merkle
tree blocks in an xarray or something, in which case we'd want to >> by
the block_shift to avoid leaving gaps.

> > > How about changing the prototype to:
> > > 
> > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> > >
> > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > > can it happen in cases like filesystem corruption?  If it can only happen due to
> > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> > 
> > I think XFS only passes to _invalidate_* the same pos that was passed to
> > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> > problem.
> > 
> > Perhaps this function ought to note that @pos is supposed to be the same
> > value that was given to ->read_merkle_tree_block?
> > 
> > Or: make the implementations return 1 for "reloaded from disk", 0 for
> > "still in cache", or a negative error code.  Then fsverity can call
> > the invalidation routine itself and XFS doesn't have to worry about this
> > part.
> > 
> > (I think?  I have questions about the xfs_invalidate_blocks function.)
> 
> It looks like XFS can invalidate blocks other than the one being read by
> ->read_merkle_tree_block.
> 
> If it really was only a matter of the single block being read, then it would
> indeed be simpler to just make it a piece of information returned from
> ->read_merkle_tree_block.
> 
> If the generic invalidation function is needed, it needs to be clearly
> documented when filesystems are expected to invalidate blocks.

I /think/ the generic invalidation is only necessary with the weird
DOUBLE_ALLOC thing.  If the other two implementations (ext4/f2fs)
haven't needed a "nuke from orbit" function then xfs shouldn't require
one too.

> > > > +
> > > > +	/**
> > > > +	 * Release the reference to a Merkle tree block
> > > > +	 *
> > > > +	 * @block: the block to release
> > > > +	 *
> > > > +	 * This is called when fs-verity is done with a block obtained with
> > > > +	 * ->read_merkle_tree_block().
> > > > +	 */
> > > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > > 
> > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> > 
> > Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> > me wonder if the write_merkle_tree_block path made use of that.  It
> > doesn't, AFAICT.
> > 
> > And I think the reason is that when we're setting up the merkle tree,
> > we want to stream the contents straight to disk instead of ending up
> > with a huge cache that might not all be necessary?
> 
> In the current patchset, fsverity_blockbuf isn't used for writes (despite the
> comment saying it is).  I think that's fine and that it keeps things simpler.
> Filesystems can still cache the blocks that are passed to
> ->write_merkle_tree_block if they want to.  That already happens on ext4 and
> f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.

Oh!  Maybe it should do that then.  At least the root block?

--D

> - Eric
>
Darrick J. Wong March 9, 2024, 4:24 p.m. UTC | #5
On Thu, Mar 07, 2024 at 07:50:36PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 02:49:03PM -0800, Eric Biggers wrote:
> > On Thu, Mar 07, 2024 at 01:54:01PM -0800, Darrick J. Wong wrote:
> > > On Tue, Mar 05, 2024 at 07:56:22PM -0800, Eric Biggers wrote:
> > > > On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> > > > > diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> > > > > index b3506f56e180..dad33e6ff0d6 100644
> > > > > --- a/fs/verity/fsverity_private.h
> > > > > +++ b/fs/verity/fsverity_private.h
> > > > > @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
> > > > >  
> > > > >  void __init fsverity_init_workqueue(void);
> > > > >  
> > > > > +/*
> > > > > + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> > > > > + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> > > > > + * block->context.
> > > > > + */
> > > > > +void fsverity_drop_block(struct inode *inode,
> > > > > +			 struct fsverity_blockbuf *block);
> > > > > +
> > > > >  #endif /* _FSVERITY_PRIVATE_H */
> > > > 
> > > > This should be paired with a helper function that reads a Merkle tree block by
> > > > calling ->read_merkle_tree_block or ->read_merkle_tree_page as needed.  Besides
> > > > being consistent with having a helper function for drop, this would prevent code
> > > > duplication between verify_data_block() and fsverity_read_merkle_tree().
> > > > 
> > > > I recommend that it look like this:
> > > > 
> > > > int fsverity_read_merkle_tree_block(struct inode *inode, u64 pos,
> > > > 				    unsigned long ra_bytes,
> > > > 				    struct fsverity_blockbuf *block);
> > > > 
> > > > 'pos' would be the byte position of the block in the Merkle tree, and 'ra_bytes'
> > > > would be the number of bytes for the filesystem to (optionally) readahead if the
> > > > block is not yet cached.  I think that things work out simpler if these values
> > > > are measured in bytes, not blocks.  'block' would be at the end because it's an
> > > > output, and it can be confusing to interleave inputs and outputs in parameters.
> > > 
> > > FWIW I don't really like 'pos' here because that's usually short for
> > > "file position", which is a byte, and this looks a lot more like a
> > > merkle tree block number.
> > > 
> > > u64 blkno?
> > > 
> > > Or better yet use a typedef ("merkle_blkno_t") to make it really clear
> > > when we're dealing with a tree block number.  Ignore checkpatch
> > > complaining about typeedefs. :)
> > 
> > My suggestion is for 'pos' to be a byte position, in alignment with the
> > pagecache naming convention as well as ->write_merkle_tree_block which currently
> > uses a 'u64 pos' byte position too.
> > 
> > It would also work for it to be a block index, in which case it should be named
> > 'index' or 'blkno' and have type unsigned long.  I *think* that things work out
> > a bit cleaner if it's a byte position, but I could be convinced that it's
> > actually better for it to be a block index.
> 
> It's probably cleaner (or at least willy says so) to measure everything
> in bytes.  The only place that gets messy is if we want to cache merkle
> tree blocks in an xarray or something, in which case we'd want to >> by
> the block_shift to avoid leaving gaps.

...and now having just done that, yes, I would like to retain passing
log_blocksize to the ->read_merkle_tree_block function.  I cleaned it up
a bit for my own purposes:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=fsverity-cleanups-6.9&id=4d36c90be4ac143f3e31989317bde7cd6c109c6e

Though I saw you had other feedback to Andrey about that. :)

--D

> > > > How about changing the prototype to:
> > > > 
> > > > void fsverity_invalidate_merkle_tree_block(struct inode *inode, u64 pos);
> > > >
> > > > Also, is it a kernel bug for the pos to be beyond the end of the Merkle tree, or
> > > > can it happen in cases like filesystem corruption?  If it can only happen due to
> > > > kernel bugs, WARN_ON_ONCE() might be more appropriate than an error message.
> > > 
> > > I think XFS only passes to _invalidate_* the same pos that was passed to
> > > ->read_merkle_tree_block, so this is a kernel bug, not a fs corruption
> > > problem.
> > > 
> > > Perhaps this function ought to note that @pos is supposed to be the same
> > > value that was given to ->read_merkle_tree_block?
> > > 
> > > Or: make the implementations return 1 for "reloaded from disk", 0 for
> > > "still in cache", or a negative error code.  Then fsverity can call
> > > the invalidation routine itself and XFS doesn't have to worry about this
> > > part.
> > > 
> > > (I think?  I have questions about the xfs_invalidate_blocks function.)
> > 
> > It looks like XFS can invalidate blocks other than the one being read by
> > ->read_merkle_tree_block.
> > 
> > If it really was only a matter of the single block being read, then it would
> > indeed be simpler to just make it a piece of information returned from
> > ->read_merkle_tree_block.
> > 
> > If the generic invalidation function is needed, it needs to be clearly
> > documented when filesystems are expected to invalidate blocks.
> 
> I /think/ the generic invalidation is only necessary with the weird
> DOUBLE_ALLOC thing.  If the other two implementations (ext4/f2fs)
> haven't needed a "nuke from orbit" function then xfs shouldn't require
> one too.
> 
> > > > > +
> > > > > +	/**
> > > > > +	 * Release the reference to a Merkle tree block
> > > > > +	 *
> > > > > +	 * @block: the block to release
> > > > > +	 *
> > > > > +	 * This is called when fs-verity is done with a block obtained with
> > > > > +	 * ->read_merkle_tree_block().
> > > > > +	 */
> > > > > +	void (*drop_block)(struct fsverity_blockbuf *block);
> > > > 
> > > > drop_merkle_tree_block, so that it's clearly paired with read_merkle_tree_block
> > > 
> > > Yep.  I noticed that xfs_verity.c doesn't put them together, which made
> > > me wonder if the write_merkle_tree_block path made use of that.  It
> > > doesn't, AFAICT.
> > > 
> > > And I think the reason is that when we're setting up the merkle tree,
> > > we want to stream the contents straight to disk instead of ending up
> > > with a huge cache that might not all be necessary?
> > 
> > In the current patchset, fsverity_blockbuf isn't used for writes (despite the
> > comment saying it is).  I think that's fine and that it keeps things simpler.
> > Filesystems can still cache the blocks that are passed to
> > ->write_merkle_tree_block if they want to.  That already happens on ext4 and
> > f2fs; FS_IOC_ENABLE_VERITY results in the Merkle tree being in the pagecache.
> 
> Oh!  Maybe it should do that then.  At least the root block?
> 
> --D
> 
> > - Eric
> > 
>
Darrick J. Wong March 11, 2024, 11:22 p.m. UTC | #6
On Mon, Mar 04, 2024 at 08:10:30PM +0100, Andrey Albershteyn wrote:
> In the current implementation fs-verity expects filesystem to
> provide PAGEs filled with Merkle tree blocks. Then, when fs-verity
> is done with processing the blocks, reference to PAGE is freed. This
> doesn't fit well with the way XFS manages its memory.
> 
> To allow XFS integrate fs-verity this patch adds ability to
> fs-verity verification code to take Merkle tree blocks instead of
> PAGE reference. This way ext4, f2fs, and btrfs are still able to
> pass PAGE references and XFS can pass reference to Merkle tree
> blocks stored in XFS's buffer infrastructure.
> 
> Another addition is invalidation function which tells fs-verity to
> mark part of Merkle tree as not verified. This function is used
> by filesystem to tell fs-verity to invalidate block which was
> evicted from memory.
> 
> Depending on Merkle tree block size fs-verity is using either bitmap
> or PG_checked flag to track "verified" status of the blocks. With a
> Merkle tree block caching (XFS) there is no PAGE to flag it as
> verified. fs-verity always uses bitmap to track verified blocks for
> filesystems which use block caching.
> 
> Further this patch allows filesystem to make additional processing
> on verified pages via fsverity_drop_block() instead of just dropping
> a reference. This will be used by XFS for internal buffer cache
> manipulation in further patches. The btrfs, ext4, and f2fs just drop
> the reference.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/verity/fsverity_private.h |   8 +++
>  fs/verity/open.c             |   8 ++-
>  fs/verity/read_metadata.c    |  64 +++++++++++-------
>  fs/verity/verify.c           | 125 +++++++++++++++++++++++++++--------
>  include/linux/fsverity.h     |  65 ++++++++++++++++++
>  5 files changed, 217 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
> index b3506f56e180..dad33e6ff0d6 100644
> --- a/fs/verity/fsverity_private.h
> +++ b/fs/verity/fsverity_private.h
> @@ -154,4 +154,12 @@ static inline void fsverity_init_signature(void)
>  
>  void __init fsverity_init_workqueue(void);
>  
> +/*
> + * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
> + * filesystem if ->drop_block() is set, otherwise, drop the reference in the
> + * block->context.
> + */
> +void fsverity_drop_block(struct inode *inode,
> +			 struct fsverity_blockbuf *block);
> +
>  #endif /* _FSVERITY_PRIVATE_H */
> diff --git a/fs/verity/open.c b/fs/verity/open.c
> index fdeb95eca3af..6e6922b4b014 100644
> --- a/fs/verity/open.c
> +++ b/fs/verity/open.c
> @@ -213,7 +213,13 @@ struct fsverity_info *fsverity_create_info(const struct inode *inode,
>  	if (err)
>  		goto fail;
>  
> -	if (vi->tree_params.block_size != PAGE_SIZE) {
> +	/*
> +	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
> +	 * fs-verity should use hash_block_verified bitmap as there's no page
> +	 * to mark it with PG_checked.
> +	 */
> +	if (vi->tree_params.block_size != PAGE_SIZE ||
> +			inode->i_sb->s_vop->read_merkle_tree_block) {
>  		/*
>  		 * When the Merkle tree block size and page size differ, we use
>  		 * a bitmap to keep track of which hash blocks have been
> diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
> index f58432772d9e..5da40b5a81af 100644
> --- a/fs/verity/read_metadata.c
> +++ b/fs/verity/read_metadata.c
> @@ -18,50 +18,68 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  {
>  	const struct fsverity_operations *vops = inode->i_sb->s_vop;
>  	u64 end_offset;
> -	unsigned int offs_in_page;
> +	unsigned int offs_in_block;
>  	pgoff_t index, last_index;
>  	int retval = 0;
>  	int err = 0;
> +	const unsigned int block_size = vi->tree_params.block_size;
> +	const u8 log_blocksize = vi->tree_params.log_blocksize;
>  
>  	end_offset = min(offset + length, vi->tree_params.tree_size);
>  	if (offset >= end_offset)
>  		return 0;
> -	offs_in_page = offset_in_page(offset);
> -	last_index = (end_offset - 1) >> PAGE_SHIFT;
> +	offs_in_block = offset & (block_size - 1);
> +	last_index = (end_offset - 1) >> log_blocksize;
>  
>  	/*
> -	 * Iterate through each Merkle tree page in the requested range and copy
> -	 * the requested portion to userspace.  Note that the Merkle tree block
> -	 * size isn't important here, as we are returning a byte stream; i.e.,
> -	 * we can just work with pages even if the tree block size != PAGE_SIZE.
> +	 * Iterate through each Merkle tree block in the requested range and
> +	 * copy the requested portion to userspace. Note that we are returning
> +	 * a byte stream.
>  	 */
> -	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
> +	for (index = offset >> log_blocksize; index <= last_index; index++) {
>  		unsigned long num_ra_pages =
>  			min_t(unsigned long, last_index - index + 1,
>  			      inode->i_sb->s_bdi->io_pages);
>  		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
> -						   PAGE_SIZE - offs_in_page);
> -		struct page *page;
> -		const void *virt;
> +						   block_size - offs_in_block);
> +		struct fsverity_blockbuf block = {
> +			.size = block_size,
> +		};
>  
> -		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
> -		if (IS_ERR(page)) {
> -			err = PTR_ERR(page);
> +		if (!vops->read_merkle_tree_block) {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			unsigned long page_idx =
> +				round_down(index, blocks_per_page);
> +			struct page *page = vops->read_merkle_tree_page(inode,
> +					page_idx, num_ra_pages);
> +
> +			if (IS_ERR(page)) {
> +				err = PTR_ERR(page);
> +			} else {
> +				block.kaddr = kmap_local_page(page) +
> +					((index - page_idx) << log_blocksize);
> +				block.context = page;
> +			}
> +		} else {
> +			err = vops->read_merkle_tree_block(inode,
> +					index << log_blocksize,
> +					&block, log_blocksize, num_ra_pages);
> +		}
> +
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %d reading Merkle tree page %lu",
> -				     err, index);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, index << log_blocksize);
>  			break;
>  		}
>  
> -		virt = kmap_local_page(page);
> -		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
> -			kunmap_local(virt);
> -			put_page(page);
> +		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
> +			fsverity_drop_block(inode, &block);
>  			err = -EFAULT;
>  			break;
>  		}
> -		kunmap_local(virt);
> -		put_page(page);
> +		fsverity_drop_block(inode, &block);
>  
>  		retval += bytes_to_copy;
>  		buf += bytes_to_copy;
> @@ -72,7 +90,7 @@ static int fsverity_read_merkle_tree(struct inode *inode,
>  			break;
>  		}
>  		cond_resched();
> -		offs_in_page = 0;
> +		offs_in_block = 0;
>  	}
>  	return retval ? retval : err;
>  }
> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index 4fcad0825a12..de71911d400c 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -13,14 +13,17 @@
>  static struct workqueue_struct *fsverity_read_workqueue;
>  
>  /*
> - * Returns true if the hash block with index @hblock_idx in the tree, located in
> - * @hpage, has already been verified.
> + * Returns true if the hash block with index @hblock_idx in the tree has
> + * already been verified.
>   */
> -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> +static bool is_hash_block_verified(struct inode *inode,
> +				   struct fsverity_blockbuf *block,
>  				   unsigned long hblock_idx)
>  {
>  	unsigned int blocks_per_page;
>  	unsigned int i;
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	struct page *hpage = (struct page *)block->context;
>  
>  	/*
>  	 * When the Merkle tree block size and page size are the same, then the
> @@ -34,6 +37,12 @@ static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
>  	if (!vi->hash_block_verified)
>  		return PageChecked(hpage);
>  
> +	/*
> +	 * Filesystems which use block based caching (e.g. XFS) always use
> +	 * bitmap.
> +	 */
> +	if (inode->i_sb->s_vop->read_merkle_tree_block)
> +		return test_bit(hblock_idx, vi->hash_block_verified);
>  	/*
>  	 * When the Merkle tree block size and page size differ, we use a bitmap
>  	 * to indicate whether each hash block has been verified.
> @@ -95,15 +104,15 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  	const struct merkle_tree_params *params = &vi->tree_params;
>  	const unsigned int hsize = params->digest_size;
>  	int level;
> +	int err;

FWIW, Dan Carpenter complained that err can be used uninitialized
here...

> +	int num_ra_pages;
>  	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	const u8 *want_hash;
>  	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
>  	/* The hash blocks that are traversed, indexed by level */
>  	struct {
> -		/* Page containing the hash block */
> -		struct page *page;
> -		/* Mapped address of the hash block (will be within @page) */
> -		const void *addr;
> +		/* Buffer containing the hash block */
> +		struct fsverity_blockbuf block;
>  		/* Index of the hash block in the tree overall */
>  		unsigned long index;
>  		/* Byte offset of the wanted hash relative to @addr */
> @@ -144,10 +153,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		unsigned long next_hidx;
>  		unsigned long hblock_idx;
>  		pgoff_t hpage_idx;
> +		u64 hblock_pos;
>  		unsigned int hblock_offset_in_page;
>  		unsigned int hoffset;
>  		struct page *hpage;
> -		const void *haddr;
> +		struct fsverity_blockbuf *block = &hblocks[level].block;
>  
>  		/*
>  		 * The index of the block in the current level; also the index
> @@ -165,29 +175,49 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		hblock_offset_in_page =
>  			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
>  
> +		/* Offset of the Merkle tree block into the tree */
> +		hblock_pos = hblock_idx << params->log_blocksize;
> +
>  		/* Byte offset of the hash within the block */
>  		hoffset = (hidx << params->log_digestsize) &
>  			  (params->block_size - 1);
>  
> -		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
> -				hpage_idx, level == 0 ? min(max_ra_pages,
> -					params->tree_pages - hpage_idx) : 0);
> -		if (IS_ERR(hpage)) {
> +		num_ra_pages = level == 0 ?
> +			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
> +
> +		if (inode->i_sb->s_vop->read_merkle_tree_block) {
> +			err = inode->i_sb->s_vop->read_merkle_tree_block(
> +				inode, hblock_pos, block, params->log_blocksize,
> +				num_ra_pages);
> +		} else {
> +			unsigned int blocks_per_page =
> +				vi->tree_params.blocks_per_page;
> +			hblock_idx = round_down(hblock_idx, blocks_per_page);
> +			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
> +				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
> +
> +			if (IS_ERR(hpage)) {
> +				err = PTR_ERR(hpage);
> +			} else {
> +				block->kaddr = kmap_local_page(hpage) +
> +					hblock_offset_in_page;
> +				block->context = hpage;

...in the case where hpage is not an error.  I'll fix that in my tree.

https://lore.kernel.org/oe-kbuild-all/f9e79efc-a16f-495d-8e53-d5a9eef5936e@moroto.mountain/T/#u

--D

> +			}
> +		}
> +
> +		if (err) {
>  			fsverity_err(inode,
> -				     "Error %ld reading Merkle tree page %lu",
> -				     PTR_ERR(hpage), hpage_idx);
> +				     "Error %d reading Merkle tree block %lu",
> +				     err, hblock_idx);
>  			goto error;
>  		}
> -		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
> -		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
> -			memcpy(_want_hash, haddr + hoffset, hsize);
> +
> +		if (is_hash_block_verified(inode, block, hblock_idx)) {
> +			memcpy(_want_hash, block->kaddr + hoffset, hsize);
>  			want_hash = _want_hash;
> -			kunmap_local(haddr);
> -			put_page(hpage);
> +			fsverity_drop_block(inode, block);
>  			goto descend;
>  		}
> -		hblocks[level].page = hpage;
> -		hblocks[level].addr = haddr;
>  		hblocks[level].index = hblock_idx;
>  		hblocks[level].hoffset = hoffset;
>  		hidx = next_hidx;
> @@ -197,10 +227,11 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  descend:
>  	/* Descend the tree verifying hash blocks. */
>  	for (; level > 0; level--) {
> -		struct page *hpage = hblocks[level - 1].page;
> -		const void *haddr = hblocks[level - 1].addr;
> +		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
> +		const void *haddr = block->kaddr;
>  		unsigned long hblock_idx = hblocks[level - 1].index;
>  		unsigned int hoffset = hblocks[level - 1].hoffset;
> +		struct page *hpage = (struct page *)block->context;
>  
>  		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
>  			goto error;
> @@ -217,8 +248,7 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  			SetPageChecked(hpage);
>  		memcpy(_want_hash, haddr + hoffset, hsize);
>  		want_hash = _want_hash;
> -		kunmap_local(haddr);
> -		put_page(hpage);
> +		fsverity_drop_block(inode, block);
>  	}
>  
>  	/* Finally, verify the data block. */
> @@ -235,10 +265,8 @@ verify_data_block(struct inode *inode, struct fsverity_info *vi,
>  		     params->hash_alg->name, hsize, want_hash,
>  		     params->hash_alg->name, hsize, real_hash);
>  error:
> -	for (; level > 0; level--) {
> -		kunmap_local(hblocks[level - 1].addr);
> -		put_page(hblocks[level - 1].page);
> -	}
> +	for (; level > 0; level--)
> +		fsverity_drop_block(inode, &hblocks[level - 1].block);
>  	return false;
>  }
>  
> @@ -362,3 +390,42 @@ void __init fsverity_init_workqueue(void)
>  	if (!fsverity_read_workqueue)
>  		panic("failed to allocate fsverity_read_queue");
>  }
> +
> +/**
> + * fsverity_invalidate_block() - invalidate Merkle tree block
> + * @inode: inode to which this Merkle tree blocks belong
> + * @block: block to be invalidated
> + *
> + * This function invalidates/clears "verified" state of Merkle tree block
> + * in the fs-verity bitmap. The block needs to have ->offset set.
> + */
> +void fsverity_invalidate_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)
> +{
> +	struct fsverity_info *vi = inode->i_verity_info;
> +	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
> +
> +	if (block->offset > vi->tree_params.tree_size) {
> +		fsverity_err(inode,
> +"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
> +			     vi->tree_params.tree_size, block->offset);
> +		return;
> +	}
> +
> +	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
> +}
> +EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
> +
> +void fsverity_drop_block(struct inode *inode,
> +		struct fsverity_blockbuf *block)
> +{
> +	if (inode->i_sb->s_vop->drop_block)
> +		inode->i_sb->s_vop->drop_block(block);
> +	else {
> +		struct page *page = (struct page *)block->context;
> +
> +		kunmap_local(block->kaddr);
> +		put_page(page);
> +	}
> +	block->kaddr = NULL;
> +}
> diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
> index ac58b19f23d3..0973b521ac5a 100644
> --- a/include/linux/fsverity.h
> +++ b/include/linux/fsverity.h
> @@ -26,6 +26,33 @@
>  /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
>  #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
>  
> +/**
> + * struct fsverity_blockbuf - Merkle Tree block buffer
> + * @kaddr: virtual address of the block's data
> + * @offset: block's offset into Merkle tree
> + * @size: the Merkle tree block size
> + * @context: filesystem private context
> + *
> + * Buffer containing single Merkle Tree block. These buffers are passed
> + *  - to filesystem, when fs-verity is building merkel tree,
> + *  - from filesystem, when fs-verity is reading merkle tree from a disk.
> + * Filesystems sets kaddr together with size to point to a memory which contains
> + * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
> + * written down to disk.
> + *
> + * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
> + * ->drop_block to let filesystem know that memory can be freed.
> + *
> + * The context is optional. This field can be used by filesystem to passthrough
> + * state from ->read_merkle_tree_block to ->drop_block.
> + */
> +struct fsverity_blockbuf {
> +	void *kaddr;
> +	u64 offset;
> +	unsigned int size;
> +	void *context;
> +};
> +
>  /* Verity operations for filesystems */
>  struct fsverity_operations {
>  
> @@ -107,6 +134,32 @@ struct fsverity_operations {
>  					      pgoff_t index,
>  					      unsigned long num_ra_pages);
>  
> +	/**
> +	 * Read a Merkle tree block of the given inode.
> +	 * @inode: the inode
> +	 * @pos: byte offset of the block within the Merkle tree
> +	 * @block: block buffer for filesystem to point it to the block
> +	 * @log_blocksize: log2 of the size of the expected block
> +	 * @ra_bytes: The number of bytes that should be
> +	 *		prefetched starting at @pos if the page at @pos
> +	 *		isn't already cached.  Implementations may ignore this
> +	 *		argument; it's only a performance optimization.
> +	 *
> +	 * This can be called at any time on an open verity file.  It may be
> +	 * called by multiple processes concurrently.
> +	 *
> +	 * In case that block was evicted from the memory filesystem has to use
> +	 * fsverity_invalidate_block() to let fsverity know that block's
> +	 * verification state is not valid anymore.
> +	 *
> +	 * Return: 0 on success, -errno on failure
> +	 */
> +	int (*read_merkle_tree_block)(struct inode *inode,
> +				      u64 pos,
> +				      struct fsverity_blockbuf *block,
> +				      unsigned int log_blocksize,
> +				      u64 ra_bytes);
> +
>  	/**
>  	 * Write a Merkle tree block to the given inode.
>  	 *
> @@ -122,6 +175,16 @@ struct fsverity_operations {
>  	 */
>  	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
>  				       u64 pos, unsigned int size);
> +
> +	/**
> +	 * Release the reference to a Merkle tree block
> +	 *
> +	 * @block: the block to release
> +	 *
> +	 * This is called when fs-verity is done with a block obtained with
> +	 * ->read_merkle_tree_block().
> +	 */
> +	void (*drop_block)(struct fsverity_blockbuf *block);
>  };
>  
>  #ifdef CONFIG_FS_VERITY
> @@ -175,6 +238,8 @@ int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg);
>  bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset);
>  void fsverity_verify_bio(struct bio *bio);
>  void fsverity_enqueue_verify_work(struct work_struct *work);
> +void fsverity_invalidate_block(struct inode *inode,
> +		struct fsverity_blockbuf *block);
>  
>  #else /* !CONFIG_FS_VERITY */
>  
> -- 
> 2.42.0
> 
>
diff mbox series

Patch

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index b3506f56e180..dad33e6ff0d6 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -154,4 +154,12 @@  static inline void fsverity_init_signature(void)
 
 void __init fsverity_init_workqueue(void);
 
+/*
+ * Drop 'block' obtained with ->read_merkle_tree_block(). Calls out back to
+ * filesystem if ->drop_block() is set, otherwise, drop the reference in the
+ * block->context.
+ */
+void fsverity_drop_block(struct inode *inode,
+			 struct fsverity_blockbuf *block);
+
 #endif /* _FSVERITY_PRIVATE_H */
diff --git a/fs/verity/open.c b/fs/verity/open.c
index fdeb95eca3af..6e6922b4b014 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -213,7 +213,13 @@  struct fsverity_info *fsverity_create_info(const struct inode *inode,
 	if (err)
 		goto fail;
 
-	if (vi->tree_params.block_size != PAGE_SIZE) {
+	/*
+	 * If fs passes Merkle tree blocks to fs-verity (e.g. XFS), then
+	 * fs-verity should use hash_block_verified bitmap as there's no page
+	 * to mark it with PG_checked.
+	 */
+	if (vi->tree_params.block_size != PAGE_SIZE ||
+			inode->i_sb->s_vop->read_merkle_tree_block) {
 		/*
 		 * When the Merkle tree block size and page size differ, we use
 		 * a bitmap to keep track of which hash blocks have been
diff --git a/fs/verity/read_metadata.c b/fs/verity/read_metadata.c
index f58432772d9e..5da40b5a81af 100644
--- a/fs/verity/read_metadata.c
+++ b/fs/verity/read_metadata.c
@@ -18,50 +18,68 @@  static int fsverity_read_merkle_tree(struct inode *inode,
 {
 	const struct fsverity_operations *vops = inode->i_sb->s_vop;
 	u64 end_offset;
-	unsigned int offs_in_page;
+	unsigned int offs_in_block;
 	pgoff_t index, last_index;
 	int retval = 0;
 	int err = 0;
+	const unsigned int block_size = vi->tree_params.block_size;
+	const u8 log_blocksize = vi->tree_params.log_blocksize;
 
 	end_offset = min(offset + length, vi->tree_params.tree_size);
 	if (offset >= end_offset)
 		return 0;
-	offs_in_page = offset_in_page(offset);
-	last_index = (end_offset - 1) >> PAGE_SHIFT;
+	offs_in_block = offset & (block_size - 1);
+	last_index = (end_offset - 1) >> log_blocksize;
 
 	/*
-	 * Iterate through each Merkle tree page in the requested range and copy
-	 * the requested portion to userspace.  Note that the Merkle tree block
-	 * size isn't important here, as we are returning a byte stream; i.e.,
-	 * we can just work with pages even if the tree block size != PAGE_SIZE.
+	 * Iterate through each Merkle tree block in the requested range and
+	 * copy the requested portion to userspace. Note that we are returning
+	 * a byte stream.
 	 */
-	for (index = offset >> PAGE_SHIFT; index <= last_index; index++) {
+	for (index = offset >> log_blocksize; index <= last_index; index++) {
 		unsigned long num_ra_pages =
 			min_t(unsigned long, last_index - index + 1,
 			      inode->i_sb->s_bdi->io_pages);
 		unsigned int bytes_to_copy = min_t(u64, end_offset - offset,
-						   PAGE_SIZE - offs_in_page);
-		struct page *page;
-		const void *virt;
+						   block_size - offs_in_block);
+		struct fsverity_blockbuf block = {
+			.size = block_size,
+		};
 
-		page = vops->read_merkle_tree_page(inode, index, num_ra_pages);
-		if (IS_ERR(page)) {
-			err = PTR_ERR(page);
+		if (!vops->read_merkle_tree_block) {
+			unsigned int blocks_per_page =
+				vi->tree_params.blocks_per_page;
+			unsigned long page_idx =
+				round_down(index, blocks_per_page);
+			struct page *page = vops->read_merkle_tree_page(inode,
+					page_idx, num_ra_pages);
+
+			if (IS_ERR(page)) {
+				err = PTR_ERR(page);
+			} else {
+				block.kaddr = kmap_local_page(page) +
+					((index - page_idx) << log_blocksize);
+				block.context = page;
+			}
+		} else {
+			err = vops->read_merkle_tree_block(inode,
+					index << log_blocksize,
+					&block, log_blocksize, num_ra_pages);
+		}
+
+		if (err) {
 			fsverity_err(inode,
-				     "Error %d reading Merkle tree page %lu",
-				     err, index);
+				     "Error %d reading Merkle tree block %lu",
+				     err, index << log_blocksize);
 			break;
 		}
 
-		virt = kmap_local_page(page);
-		if (copy_to_user(buf, virt + offs_in_page, bytes_to_copy)) {
-			kunmap_local(virt);
-			put_page(page);
+		if (copy_to_user(buf, block.kaddr + offs_in_block, bytes_to_copy)) {
+			fsverity_drop_block(inode, &block);
 			err = -EFAULT;
 			break;
 		}
-		kunmap_local(virt);
-		put_page(page);
+		fsverity_drop_block(inode, &block);
 
 		retval += bytes_to_copy;
 		buf += bytes_to_copy;
@@ -72,7 +90,7 @@  static int fsverity_read_merkle_tree(struct inode *inode,
 			break;
 		}
 		cond_resched();
-		offs_in_page = 0;
+		offs_in_block = 0;
 	}
 	return retval ? retval : err;
 }
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 4fcad0825a12..de71911d400c 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -13,14 +13,17 @@ 
 static struct workqueue_struct *fsverity_read_workqueue;
 
 /*
- * Returns true if the hash block with index @hblock_idx in the tree, located in
- * @hpage, has already been verified.
+ * Returns true if the hash block with index @hblock_idx in the tree has
+ * already been verified.
  */
-static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
+static bool is_hash_block_verified(struct inode *inode,
+				   struct fsverity_blockbuf *block,
 				   unsigned long hblock_idx)
 {
 	unsigned int blocks_per_page;
 	unsigned int i;
+	struct fsverity_info *vi = inode->i_verity_info;
+	struct page *hpage = (struct page *)block->context;
 
 	/*
 	 * When the Merkle tree block size and page size are the same, then the
@@ -34,6 +37,12 @@  static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
 	if (!vi->hash_block_verified)
 		return PageChecked(hpage);
 
+	/*
+	 * Filesystems which use block based caching (e.g. XFS) always use
+	 * bitmap.
+	 */
+	if (inode->i_sb->s_vop->read_merkle_tree_block)
+		return test_bit(hblock_idx, vi->hash_block_verified);
 	/*
 	 * When the Merkle tree block size and page size differ, we use a bitmap
 	 * to indicate whether each hash block has been verified.
@@ -95,15 +104,15 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 	const struct merkle_tree_params *params = &vi->tree_params;
 	const unsigned int hsize = params->digest_size;
 	int level;
+	int err;
+	int num_ra_pages;
 	u8 _want_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	const u8 *want_hash;
 	u8 real_hash[FS_VERITY_MAX_DIGEST_SIZE];
 	/* The hash blocks that are traversed, indexed by level */
 	struct {
-		/* Page containing the hash block */
-		struct page *page;
-		/* Mapped address of the hash block (will be within @page) */
-		const void *addr;
+		/* Buffer containing the hash block */
+		struct fsverity_blockbuf block;
 		/* Index of the hash block in the tree overall */
 		unsigned long index;
 		/* Byte offset of the wanted hash relative to @addr */
@@ -144,10 +153,11 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		unsigned long next_hidx;
 		unsigned long hblock_idx;
 		pgoff_t hpage_idx;
+		u64 hblock_pos;
 		unsigned int hblock_offset_in_page;
 		unsigned int hoffset;
 		struct page *hpage;
-		const void *haddr;
+		struct fsverity_blockbuf *block = &hblocks[level].block;
 
 		/*
 		 * The index of the block in the current level; also the index
@@ -165,29 +175,49 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		hblock_offset_in_page =
 			(hblock_idx << params->log_blocksize) & ~PAGE_MASK;
 
+		/* Offset of the Merkle tree block into the tree */
+		hblock_pos = hblock_idx << params->log_blocksize;
+
 		/* Byte offset of the hash within the block */
 		hoffset = (hidx << params->log_digestsize) &
 			  (params->block_size - 1);
 
-		hpage = inode->i_sb->s_vop->read_merkle_tree_page(inode,
-				hpage_idx, level == 0 ? min(max_ra_pages,
-					params->tree_pages - hpage_idx) : 0);
-		if (IS_ERR(hpage)) {
+		num_ra_pages = level == 0 ?
+			min(max_ra_pages, params->tree_pages - hpage_idx) : 0;
+
+		if (inode->i_sb->s_vop->read_merkle_tree_block) {
+			err = inode->i_sb->s_vop->read_merkle_tree_block(
+				inode, hblock_pos, block, params->log_blocksize,
+				num_ra_pages);
+		} else {
+			unsigned int blocks_per_page =
+				vi->tree_params.blocks_per_page;
+			hblock_idx = round_down(hblock_idx, blocks_per_page);
+			hpage = inode->i_sb->s_vop->read_merkle_tree_page(
+				inode, hpage_idx, (num_ra_pages << PAGE_SHIFT));
+
+			if (IS_ERR(hpage)) {
+				err = PTR_ERR(hpage);
+			} else {
+				block->kaddr = kmap_local_page(hpage) +
+					hblock_offset_in_page;
+				block->context = hpage;
+			}
+		}
+
+		if (err) {
 			fsverity_err(inode,
-				     "Error %ld reading Merkle tree page %lu",
-				     PTR_ERR(hpage), hpage_idx);
+				     "Error %d reading Merkle tree block %lu",
+				     err, hblock_idx);
 			goto error;
 		}
-		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
-		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
-			memcpy(_want_hash, haddr + hoffset, hsize);
+
+		if (is_hash_block_verified(inode, block, hblock_idx)) {
+			memcpy(_want_hash, block->kaddr + hoffset, hsize);
 			want_hash = _want_hash;
-			kunmap_local(haddr);
-			put_page(hpage);
+			fsverity_drop_block(inode, block);
 			goto descend;
 		}
-		hblocks[level].page = hpage;
-		hblocks[level].addr = haddr;
 		hblocks[level].index = hblock_idx;
 		hblocks[level].hoffset = hoffset;
 		hidx = next_hidx;
@@ -197,10 +227,11 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 descend:
 	/* Descend the tree verifying hash blocks. */
 	for (; level > 0; level--) {
-		struct page *hpage = hblocks[level - 1].page;
-		const void *haddr = hblocks[level - 1].addr;
+		struct fsverity_blockbuf *block = &hblocks[level - 1].block;
+		const void *haddr = block->kaddr;
 		unsigned long hblock_idx = hblocks[level - 1].index;
 		unsigned int hoffset = hblocks[level - 1].hoffset;
+		struct page *hpage = (struct page *)block->context;
 
 		if (fsverity_hash_block(params, inode, haddr, real_hash) != 0)
 			goto error;
@@ -217,8 +248,7 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 			SetPageChecked(hpage);
 		memcpy(_want_hash, haddr + hoffset, hsize);
 		want_hash = _want_hash;
-		kunmap_local(haddr);
-		put_page(hpage);
+		fsverity_drop_block(inode, block);
 	}
 
 	/* Finally, verify the data block. */
@@ -235,10 +265,8 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 		     params->hash_alg->name, hsize, want_hash,
 		     params->hash_alg->name, hsize, real_hash);
 error:
-	for (; level > 0; level--) {
-		kunmap_local(hblocks[level - 1].addr);
-		put_page(hblocks[level - 1].page);
-	}
+	for (; level > 0; level--)
+		fsverity_drop_block(inode, &hblocks[level - 1].block);
 	return false;
 }
 
@@ -362,3 +390,42 @@  void __init fsverity_init_workqueue(void)
 	if (!fsverity_read_workqueue)
 		panic("failed to allocate fsverity_read_queue");
 }
+
+/**
+ * fsverity_invalidate_block() - invalidate Merkle tree block
+ * @inode: inode to which this Merkle tree blocks belong
+ * @block: block to be invalidated
+ *
+ * This function invalidates/clears "verified" state of Merkle tree block
+ * in the fs-verity bitmap. The block needs to have ->offset set.
+ */
+void fsverity_invalidate_block(struct inode *inode,
+		struct fsverity_blockbuf *block)
+{
+	struct fsverity_info *vi = inode->i_verity_info;
+	const unsigned int log_blocksize = vi->tree_params.log_blocksize;
+
+	if (block->offset > vi->tree_params.tree_size) {
+		fsverity_err(inode,
+"Trying to invalidate beyond Merkle tree (tree %lld, offset %lld)",
+			     vi->tree_params.tree_size, block->offset);
+		return;
+	}
+
+	clear_bit(block->offset >> log_blocksize, vi->hash_block_verified);
+}
+EXPORT_SYMBOL_GPL(fsverity_invalidate_block);
+
+void fsverity_drop_block(struct inode *inode,
+		struct fsverity_blockbuf *block)
+{
+	if (inode->i_sb->s_vop->drop_block)
+		inode->i_sb->s_vop->drop_block(block);
+	else {
+		struct page *page = (struct page *)block->context;
+
+		kunmap_local(block->kaddr);
+		put_page(page);
+	}
+	block->kaddr = NULL;
+}
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index ac58b19f23d3..0973b521ac5a 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -26,6 +26,33 @@ 
 /* Arbitrary limit to bound the kmalloc() size.  Can be changed. */
 #define FS_VERITY_MAX_DESCRIPTOR_SIZE	16384
 
+/**
+ * struct fsverity_blockbuf - Merkle Tree block buffer
+ * @kaddr: virtual address of the block's data
+ * @offset: block's offset into Merkle tree
+ * @size: the Merkle tree block size
+ * @context: filesystem private context
+ *
+ * Buffer containing single Merkle Tree block. These buffers are passed
+ *  - to filesystem, when fs-verity is building merkel tree,
+ *  - from filesystem, when fs-verity is reading merkle tree from a disk.
+ * Filesystems sets kaddr together with size to point to a memory which contains
+ * Merkle tree block. Same is done by fs-verity when Merkle tree is need to be
+ * written down to disk.
+ *
+ * While reading the tree, fs-verity calls ->read_merkle_tree_block followed by
+ * ->drop_block to let filesystem know that memory can be freed.
+ *
+ * The context is optional. This field can be used by filesystem to passthrough
+ * state from ->read_merkle_tree_block to ->drop_block.
+ */
+struct fsverity_blockbuf {
+	void *kaddr;
+	u64 offset;
+	unsigned int size;
+	void *context;
+};
+
 /* Verity operations for filesystems */
 struct fsverity_operations {
 
@@ -107,6 +134,32 @@  struct fsverity_operations {
 					      pgoff_t index,
 					      unsigned long num_ra_pages);
 
+	/**
+	 * Read a Merkle tree block of the given inode.
+	 * @inode: the inode
+	 * @pos: byte offset of the block within the Merkle tree
+	 * @block: block buffer for filesystem to point it to the block
+	 * @log_blocksize: log2 of the size of the expected block
+	 * @ra_bytes: The number of bytes that should be
+	 *		prefetched starting at @pos if the page at @pos
+	 *		isn't already cached.  Implementations may ignore this
+	 *		argument; it's only a performance optimization.
+	 *
+	 * This can be called at any time on an open verity file.  It may be
+	 * called by multiple processes concurrently.
+	 *
+	 * In case that block was evicted from the memory filesystem has to use
+	 * fsverity_invalidate_block() to let fsverity know that block's
+	 * verification state is not valid anymore.
+	 *
+	 * Return: 0 on success, -errno on failure
+	 */
+	int (*read_merkle_tree_block)(struct inode *inode,
+				      u64 pos,
+				      struct fsverity_blockbuf *block,
+				      unsigned int log_blocksize,
+				      u64 ra_bytes);
+
 	/**
 	 * Write a Merkle tree block to the given inode.
 	 *
@@ -122,6 +175,16 @@  struct fsverity_operations {
 	 */
 	int (*write_merkle_tree_block)(struct inode *inode, const void *buf,
 				       u64 pos, unsigned int size);
+
+	/**
+	 * Release the reference to a Merkle tree block
+	 *
+	 * @block: the block to release
+	 *
+	 * This is called when fs-verity is done with a block obtained with
+	 * ->read_merkle_tree_block().
+	 */
+	void (*drop_block)(struct fsverity_blockbuf *block);
 };
 
 #ifdef CONFIG_FS_VERITY
@@ -175,6 +238,8 @@  int fsverity_ioctl_read_metadata(struct file *filp, const void __user *uarg);
 bool fsverity_verify_blocks(struct folio *folio, size_t len, size_t offset);
 void fsverity_verify_bio(struct bio *bio);
 void fsverity_enqueue_verify_work(struct work_struct *work);
+void fsverity_invalidate_block(struct inode *inode,
+		struct fsverity_blockbuf *block);
 
 #else /* !CONFIG_FS_VERITY */