diff mbox series

[v3,07/28] fsverity: always use bitmap to track verified status

Message ID 20231006184922.252188-8-aalbersh@redhat.com (mailing list archive)
State New, archived
Headers show
Series fs-verity support for XFS | expand

Commit Message

Andrey Albershteyn Oct. 6, 2023, 6:49 p.m. UTC
The bitmap is used to track verified status of the Merkle tree
blocks which are smaller than a PAGE. Blocks which fits exactly in a
page - use PageChecked() for tracking "verified" status.

This patch switches to always use bitmap to track verified status.
This is needed to move fs-verity away from page management and work
only with Merkle tree blocks.

Also, this patch removes spinlock. The lock was used to reset bits
in bitmap belonging to one page. This patch works only with one
Merkle tree block and won't reset other blocks status.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/verity/fsverity_private.h |  1 -
 fs/verity/open.c             | 49 ++++++++++++-------------
 fs/verity/verify.c           | 71 +++++-------------------------------
 3 files changed, 33 insertions(+), 88 deletions(-)

Comments

Eric Biggers Oct. 11, 2023, 3:15 a.m. UTC | #1
On Fri, Oct 06, 2023 at 08:49:01PM +0200, Andrey Albershteyn wrote:
> The bitmap is used to track verified status of the Merkle tree
> blocks which are smaller than a PAGE. Blocks which fits exactly in a
> page - use PageChecked() for tracking "verified" status.
> 
> This patch switches to always use bitmap to track verified status.
> This is needed to move fs-verity away from page management and work
> only with Merkle tree blocks.

How complicated would it be to keep supporting using the page bit when
merkle_tree_block_size == page_size and the filesystem supports it?  It's an
efficient solution, so it would be a shame to lose it.  Also it doesn't have the
max file size limit that the bitmap has.

> Also, this patch removes spinlock. The lock was used to reset bits
> in bitmap belonging to one page. This patch works only with one
> Merkle tree block and won't reset other blocks status.

The spinlock is needed when there are multiple Merkle tree blocks per page and
the filesystem is using the page-based caching.  So I don't think you can remove
it.  Can you elaborate on why you feel it can be removed?

>  /*
> - * Returns true if the hash block with index @hblock_idx in the tree, located in
> - * @hpage, has already been verified.
> + * Returns true if the hash block with index @hblock_idx in the tree has
> + * already been verified.
>   */
> -static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
> -				   unsigned long hblock_idx)
> +static bool is_hash_block_verified(struct fsverity_info *vi,
> +				   unsigned long hblock_idx,
> +				   bool block_cached)
>  {
> -	bool verified;
> -	unsigned int blocks_per_page;
> -	unsigned int i;
> -
> -	/*
> -	 * When the Merkle tree block size and page size are the same, then the
> -	 * ->hash_block_verified bitmap isn't allocated, and we use PG_checked
> -	 * to directly indicate whether the page's block has been verified.
> -	 *
> -	 * Using PG_checked also guarantees that we re-verify hash pages that
> -	 * get evicted and re-instantiated from the backing storage, as new
> -	 * pages always start out with PG_checked cleared.
> -	 */
> -	if (!vi->hash_block_verified)
> -		return PageChecked(hpage);
> -
> -	/*
> -	 * When the Merkle tree block size and page size differ, we use a bitmap
> -	 * to indicate whether each hash block has been verified.
> -	 *
> -	 * However, we still need to ensure that hash pages that get evicted and
> -	 * re-instantiated from the backing storage are re-verified.  To do
> -	 * this, we use PG_checked again, but now it doesn't really mean
> -	 * "checked".  Instead, now it just serves as an indicator for whether
> -	 * the hash page is newly instantiated or not.
> -	 *
> -	 * The first thread that sees PG_checked=0 must clear the corresponding
> -	 * bitmap bits, then set PG_checked=1.  This requires a spinlock.  To
> -	 * avoid having to take this spinlock in the common case of
> -	 * PG_checked=1, we start with an opportunistic lockless read.
> -	 */

Note that the above comment explains why the spinlock is needed.

> -	if (PageChecked(hpage)) {
> -		/*
> -		 * A read memory barrier is needed here to give ACQUIRE
> -		 * semantics to the above PageChecked() test.
> -		 */
> -		smp_rmb();
> +	if (block_cached)
>  		return test_bit(hblock_idx, vi->hash_block_verified);

It's not clear what block_cached is supposed to mean here.

- Eric
Andrey Albershteyn Oct. 11, 2023, 1:03 p.m. UTC | #2
On 2023-10-10 20:15:43, Eric Biggers wrote:
> On Fri, Oct 06, 2023 at 08:49:01PM +0200, Andrey Albershteyn wrote:
> > The bitmap is used to track verified status of the Merkle tree
> > blocks which are smaller than a PAGE. Blocks which fits exactly in a
> > page - use PageChecked() for tracking "verified" status.
> > 
> > This patch switches to always use bitmap to track verified status.
> > This is needed to move fs-verity away from page management and work
> > only with Merkle tree blocks.
> 
> How complicated would it be to keep supporting using the page bit when
> merkle_tree_block_size == page_size and the filesystem supports it?  It's an
> efficient solution, so it would be a shame to lose it.  Also it doesn't have the
> max file size limit that the bitmap has.

Well, I think it's possible but my motivation was to step away from
page manipulation as much as possible with intent to not affect other
filesystems too much. I can probably add handling of this case to
fsverity_read_merkle_tree_block() but fs/verity still will create
bitmap and have a limit. The other way is basically revert changes
done in patch 09, then, it probably will be quite a mix of page/block
handling in fs/verity/verify.c

> > Also, this patch removes spinlock. The lock was used to reset bits
> > in bitmap belonging to one page. This patch works only with one
> > Merkle tree block and won't reset other blocks status.
> 
> The spinlock is needed when there are multiple Merkle tree blocks per page and
> the filesystem is using the page-based caching.  So I don't think you can remove
> it.  Can you elaborate on why you feel it can be removed?

With this patch is_hash_block_verified() doesn't reset bits for
blocks belonging to the same page. Even if page is re-instantiated
only one block is checked in this case. So, when other blocks are
checked they are reset.

	if (block_cached)
		return test_bit(hblock_idx, vi->hash_block_verified);

> > -	if (PageChecked(hpage)) {
> > -		/*
> > -		 * A read memory barrier is needed here to give ACQUIRE
> > -		 * semantics to the above PageChecked() test.
> > -		 */
> > -		smp_rmb();
> > +	if (block_cached)
> >  		return test_bit(hblock_idx, vi->hash_block_verified);
> 
> It's not clear what block_cached is supposed to mean here.

If block was re-instantiated or was in cache. I can try to come up
with better name.
Eric Biggers Oct. 12, 2023, 7:27 a.m. UTC | #3
On Wed, Oct 11, 2023 at 03:03:55PM +0200, Andrey Albershteyn wrote:
> > How complicated would it be to keep supporting using the page bit when
> > merkle_tree_block_size == page_size and the filesystem supports it?  It's an
> > efficient solution, so it would be a shame to lose it.  Also it doesn't have the
> > max file size limit that the bitmap has.
> 
> Well, I think it's possible but my motivation was to step away from
> page manipulation as much as possible with intent to not affect other
> filesystems too much. I can probably add handling of this case to
> fsverity_read_merkle_tree_block() but fs/verity still will create
> bitmap and have a limit. The other way is basically revert changes
> done in patch 09, then, it probably will be quite a mix of page/block
> handling in fs/verity/verify.c

The page-based caching still has to be supported anyway, since that's what the
other filesystems that support fsverity use, and it seems you don't plan to
change that.  The question is where the "block verified" flags should be stored.
Currently there are two options: PG_checked and the separate bitmap.  I'm not
yet convinced that removing the support for the PG_checked method is a good
change.  PG_checked is a nice solution for the cases where it can be used; it
requires no extra memory, no locking, and has no max file size.  Also, this
change seems mostly orthogonal to what you're actually trying to accomplish.

> > > Also, this patch removes spinlock. The lock was used to reset bits
> > > in bitmap belonging to one page. This patch works only with one
> > > Merkle tree block and won't reset other blocks status.
> > 
> > The spinlock is needed when there are multiple Merkle tree blocks per page and
> > the filesystem is using the page-based caching.  So I don't think you can remove
> > it.  Can you elaborate on why you feel it can be removed?
> 
> With this patch is_hash_block_verified() doesn't reset bits for
> blocks belonging to the same page. Even if page is re-instantiated
> only one block is checked in this case. So, when other blocks are
> checked they are reset.
> 
> 	if (block_cached)
> 		return test_bit(hblock_idx, vi->hash_block_verified);

When part of the Merkle tree cache is evicted and re-instantiated, whether that
part is a "page" or something else, the verified status of all the blocks
contained in that part need to be invalidated so that they get re-verified.  The
existing code does that.  I don't see how your proposed code does that.

- Eric
Darrick J. Wong Oct. 13, 2023, 3:12 a.m. UTC | #4
Hi Eric,

[Please excuse my ignorance, this is only the third time I've dived
into fsverity.]

On Thu, Oct 12, 2023 at 12:27:46AM -0700, Eric Biggers wrote:
> On Wed, Oct 11, 2023 at 03:03:55PM +0200, Andrey Albershteyn wrote:
> > > How complicated would it be to keep supporting using the page bit when
> > > merkle_tree_block_size == page_size and the filesystem supports it?  It's an
> > > efficient solution, so it would be a shame to lose it.  Also it doesn't have the
> > > max file size limit that the bitmap has.

How complex would it be to get rid of the bitmap entirely, and validate
all the verity tree blocks within a page all at once instead of
individual blocks within a page?

Assuming willy isn't grinding his axe to get rid of PGchecked,
obviously. ;)

> > Well, I think it's possible but my motivation was to step away from
> > page manipulation as much as possible with intent to not affect other
> > filesystems too much. I can probably add handling of this case to
> > fsverity_read_merkle_tree_block() but fs/verity still will create
> > bitmap and have a limit. The other way is basically revert changes
> > done in patch 09, then, it probably will be quite a mix of page/block
> > handling in fs/verity/verify.c
> 
> The page-based caching still has to be supported anyway, since that's what the
> other filesystems that support fsverity use, and it seems you don't plan to
> change that.

I frankly have been asking myself why /this/ patchset adds so much extra
code and flags and whatnot to XFS and fs/verity.  From what I can tell,
the xfs buffer cache has been extended to allocate double the memory so
that xattr contents can be shadowed.  getxattr for merkle tree contents
then pins the buffer, shadows the contents, and hands both back to the
caller (aka xfs_read_merkle_tree_block).   The shadow memory is then
handed to fs/verity to do its magic; following that, fsverity releases
the reference and we can eventually drop the xfs_buf reference.

But this seems way overcomplicated to me.  ->read_merkle_tree_page hands
us a pgoff_t and a suggestion for page readahead, and wants us to return
an uptodate locked page, right?

Why can't xfs allocate a page, walk the requested range to fill the page
with merkle tree blocks that were written to the xattr structure (or
zero the page contents if there is no xattr), and hand that page to
fsverity?  (It helps to provide the merkle tree block size to
xfs_read_merkle_tree_page, thanks for adding that).

Assuming fsverity also wants some caching, we could augment the
xfs_inode to point to a separate address_space for cached merkle tree
pages, and then xfs_read_merkle_tree_page can use __filemap_get_folio to
find uptodate cached pages, or instantiate one and make it uptodate.
Even better, we can pretty easily use multipage folios for this, though
AFAICT the fs/verity code isn't yet up to handling that.

The only thing I can't quite figure out is how to get memory reclaim to
scan the extra address_space when it wants to try to reclaim pages.
That part ext4 and f2fs got for free because they stuffed the merkle
tree in the posteof space.

But wouldn't that solve /all/ the plumbing problems without scattering
bits of new code and flags into the xfs buffer cache, the extended
attributes code, and elsewhere?  And then xfs would not need to burn up
vmap space to allocate 8K memory blocks just to provide 4k merkel tree
blocks to fs/verity.

That's just my 2 cents from spending a couple of hours hypothesizing how
I would fill out the fsverity_operations.

> The question is where the "block verified" flags should be stored.
> Currently there are two options: PG_checked and the separate bitmap.  I'm not
> yet convinced that removing the support for the PG_checked method is a good
> change.  PG_checked is a nice solution for the cases where it can be used; it
> requires no extra memory, no locking, and has no max file size.  Also, this
> change seems mostly orthogonal to what you're actually trying to accomplish.

That scheme sounds way better to me than the bitmap. ;)

--D

> > > > Also, this patch removes spinlock. The lock was used to reset bits
> > > > in bitmap belonging to one page. This patch works only with one
> > > > Merkle tree block and won't reset other blocks status.
> > > 
> > > The spinlock is needed when there are multiple Merkle tree blocks per page and
> > > the filesystem is using the page-based caching.  So I don't think you can remove
> > > it.  Can you elaborate on why you feel it can be removed?
> > 
> > With this patch is_hash_block_verified() doesn't reset bits for
> > blocks belonging to the same page. Even if page is re-instantiated
> > only one block is checked in this case. So, when other blocks are
> > checked they are reset.
> > 
> > 	if (block_cached)
> > 		return test_bit(hblock_idx, vi->hash_block_verified);
> 
> When part of the Merkle tree cache is evicted and re-instantiated, whether that
> part is a "page" or something else, the verified status of all the blocks
> contained in that part need to be invalidated so that they get re-verified.  The
> existing code does that.  I don't see how your proposed code does that.
> 
> - Eric
Andrey Albershteyn Oct. 16, 2023, 11:52 a.m. UTC | #5
On 2023-10-12 00:27:46, Eric Biggers wrote:
> On Wed, Oct 11, 2023 at 03:03:55PM +0200, Andrey Albershteyn wrote:
> > > How complicated would it be to keep supporting using the page bit when
> > > merkle_tree_block_size == page_size and the filesystem supports it?  It's an
> > > efficient solution, so it would be a shame to lose it.  Also it doesn't have the
> > > max file size limit that the bitmap has.
> > 
> > Well, I think it's possible but my motivation was to step away from
> > page manipulation as much as possible with intent to not affect other
> > filesystems too much. I can probably add handling of this case to
> > fsverity_read_merkle_tree_block() but fs/verity still will create
> > bitmap and have a limit. The other way is basically revert changes
> > done in patch 09, then, it probably will be quite a mix of page/block
> > handling in fs/verity/verify.c
> 
> The page-based caching still has to be supported anyway, since that's what the
> other filesystems that support fsverity use, and it seems you don't plan to
> change that.  The question is where the "block verified" flags should be stored.
> Currently there are two options: PG_checked and the separate bitmap.  I'm not
> yet convinced that removing the support for the PG_checked method is a good
> change.  PG_checked is a nice solution for the cases where it can be used; it
> requires no extra memory, no locking, and has no max file size.  Also, this
> change seems mostly orthogonal to what you're actually trying to accomplish.

Yes, I was trying to combine PG_checked and bitmap in the way it can
be also used by XFS. I will try change this patch to not drop
merkle_tree_block_size == page_size case.

> > > > Also, this patch removes spinlock. The lock was used to reset bits
> > > > in bitmap belonging to one page. This patch works only with one
> > > > Merkle tree block and won't reset other blocks status.
> > > 
> > > The spinlock is needed when there are multiple Merkle tree blocks per page and
> > > the filesystem is using the page-based caching.  So I don't think you can remove
> > > it.  Can you elaborate on why you feel it can be removed?
> > 
> > With this patch is_hash_block_verified() doesn't reset bits for
> > blocks belonging to the same page. Even if page is re-instantiated
> > only one block is checked in this case. So, when other blocks are
> > checked they are reset.
> > 
> > 	if (block_cached)
> > 		return test_bit(hblock_idx, vi->hash_block_verified);
> 
> When part of the Merkle tree cache is evicted and re-instantiated, whether that
> part is a "page" or something else, the verified status of all the blocks
> contained in that part need to be invalidated so that they get re-verified.  The
> existing code does that.  I don't see how your proposed code does that.

Oh, I see the problem now, I will revert it back.
Eric Biggers Oct. 17, 2023, 4:58 a.m. UTC | #6
On Thu, Oct 12, 2023 at 08:12:09PM -0700, Darrick J. Wong wrote:
> Hi Eric,
> 
> [Please excuse my ignorance, this is only the third time I've dived
> into fsverity.]
> 
> On Thu, Oct 12, 2023 at 12:27:46AM -0700, Eric Biggers wrote:
> > On Wed, Oct 11, 2023 at 03:03:55PM +0200, Andrey Albershteyn wrote:
> > > > How complicated would it be to keep supporting using the page bit when
> > > > merkle_tree_block_size == page_size and the filesystem supports it?  It's an
> > > > efficient solution, so it would be a shame to lose it.  Also it doesn't have the
> > > > max file size limit that the bitmap has.
> 
> How complex would it be to get rid of the bitmap entirely, and validate
> all the verity tree blocks within a page all at once instead of
> individual blocks within a page?
> 
> Assuming willy isn't grinding his axe to get rid of PGchecked,
> obviously. ;)

See what I wrote earlier at
https://lore.kernel.org/linux-xfs/Y5ltzp6yeMo1oDSk@sol.localdomain.
Basically it would increase the worst-case latency by a lot.

> 
> > > Well, I think it's possible but my motivation was to step away from
> > > page manipulation as much as possible with intent to not affect other
> > > filesystems too much. I can probably add handling of this case to
> > > fsverity_read_merkle_tree_block() but fs/verity still will create
> > > bitmap and have a limit. The other way is basically revert changes
> > > done in patch 09, then, it probably will be quite a mix of page/block
> > > handling in fs/verity/verify.c
> > 
> > The page-based caching still has to be supported anyway, since that's what the
> > other filesystems that support fsverity use, and it seems you don't plan to
> > change that.
> 
> I frankly have been asking myself why /this/ patchset adds so much extra
> code and flags and whatnot to XFS and fs/verity.  From what I can tell,
> the xfs buffer cache has been extended to allocate double the memory so
> that xattr contents can be shadowed.  getxattr for merkle tree contents
> then pins the buffer, shadows the contents, and hands both back to the
> caller (aka xfs_read_merkle_tree_block).   The shadow memory is then
> handed to fs/verity to do its magic; following that, fsverity releases
> the reference and we can eventually drop the xfs_buf reference.
> 
> But this seems way overcomplicated to me.  ->read_merkle_tree_page hands
> us a pgoff_t and a suggestion for page readahead, and wants us to return
> an uptodate locked page, right?
> 
> Why can't xfs allocate a page, walk the requested range to fill the page
> with merkle tree blocks that were written to the xattr structure (or
> zero the page contents if there is no xattr), and hand that page to
> fsverity?  (It helps to provide the merkle tree block size to
> xfs_read_merkle_tree_page, thanks for adding that).

Earlier versions of this patchset did that.  But, it's only really feasible if
the pages are actually cached.  Otherwise it's very inefficient and can result
in random ENOMEM.

> Assuming fsverity also wants some caching, we could augment the
> xfs_inode to point to a separate address_space for cached merkle tree
> pages, and then xfs_read_merkle_tree_page can use __filemap_get_folio to
> find uptodate cached pages, or instantiate one and make it uptodate.
> Even better, we can pretty easily use multipage folios for this, though
> AFAICT the fs/verity code isn't yet up to handling that.
> 
> The only thing I can't quite figure out is how to get memory reclaim to
> scan the extra address_space when it wants to try to reclaim pages.
> That part ext4 and f2fs got for free because they stuffed the merkle
> tree in the posteof space.
> 
> But wouldn't that solve /all/ the plumbing problems without scattering
> bits of new code and flags into the xfs buffer cache, the extended
> attributes code, and elsewhere?  And then xfs would not need to burn up
> vmap space to allocate 8K memory blocks just to provide 4k merkel tree
> blocks to fs/verity.
> 
> That's just my 2 cents from spending a couple of hours hypothesizing how
> I would fill out the fsverity_operations.

That might work.  I'm not sure about the details though, e.g. can mapping->host
point to the file's inode or would it need to be a fake one.

- Eric
Christoph Hellwig Oct. 17, 2023, 5:57 a.m. UTC | #7
On Thu, Oct 12, 2023 at 12:27:46AM -0700, Eric Biggers wrote:
> Currently there are two options: PG_checked and the separate bitmap.  I'm not
> yet convinced that removing the support for the PG_checked method is a good
> change.  PG_checked is a nice solution for the cases where it can be used; it
> requires no extra memory, no locking, and has no max file size.  Also, this
> change seems mostly orthogonal to what you're actually trying to accomplish.

Given that willy has been on a (IMHO reasonable) quest to kill off
as many as possible page flags I'd really like to seize the opportunity
and kill PageCheck in fsverity.  How big are the downsides of the bitmap
vs using the page flags, and do they matter in practice?
Christoph Hellwig Oct. 17, 2023, 6:01 a.m. UTC | #8
On Thu, Oct 12, 2023 at 08:12:09PM -0700, Darrick J. Wong wrote:
> I frankly have been asking myself why /this/ patchset adds so much extra
> code and flags and whatnot to XFS and fs/verity.  From what I can tell,
> the xfs buffer cache has been extended to allocate double the memory so
> that xattr contents can be shadowed.  getxattr for merkle tree contents
> then pins the buffer, shadows the contents, and hands both back to the
> caller (aka xfs_read_merkle_tree_block).   The shadow memory is then
> handed to fs/verity to do its magic; following that, fsverity releases
> the reference and we can eventually drop the xfs_buf reference.
> 
> But this seems way overcomplicated to me.  ->read_merkle_tree_page hands
> us a pgoff_t and a suggestion for page readahead, and wants us to return
> an uptodate locked page, right?

It does.  That beeing said I really much prefer the block based
interface from Andrey.  It is a lot cleaner and without weird page
cache internals, although it can still be implemented very nicely
by file systems that store the tree in the page cache.

> The only thing I can't quite figure out is how to get memory reclaim to
> scan the extra address_space when it wants to try to reclaim pages.
> That part ext4 and f2fs got for free because they stuffed the merkle
> tree in the posteof space.

Except for th magic swapper_spaces, and address_space without an
inode is not a thing, so you need to allocate an extra inode anyway,
which is what reclaim works on.
Eric Biggers Oct. 17, 2023, 5:49 p.m. UTC | #9
On Mon, Oct 16, 2023 at 10:57:45PM -0700, Christoph Hellwig wrote:
> On Thu, Oct 12, 2023 at 12:27:46AM -0700, Eric Biggers wrote:
> > Currently there are two options: PG_checked and the separate bitmap.  I'm not
> > yet convinced that removing the support for the PG_checked method is a good
> > change.  PG_checked is a nice solution for the cases where it can be used; it
> > requires no extra memory, no locking, and has no max file size.  Also, this
> > change seems mostly orthogonal to what you're actually trying to accomplish.
> 
> Given that willy has been on a (IMHO reasonable) quest to kill off
> as many as possible page flags I'd really like to seize the opportunity
> and kill PageCheck in fsverity.  How big are the downsides of the bitmap
> vs using the page flags, and do they matter in practice?
> 

Currently PG_checked is used even with the bitmap-based approach, as a way to
ensure that hash pages that get evicted and re-instantiated from the backing
storage are re-verified.  See is_hash_block_verified() in fs/verity/verify.c.
That would need to be replaced with something else.  I'm not sure what else
would work and still be efficient.

No one has actually deployed the bitmap-based approach in production yet; it was
added in v6.3 and is only used for merkle_tree_block_size != PAGE_SIZE.  But,
the performance and memory overhead is probably not significant in practice.
I'm more worried about the file size limit of the bitmap-based approach, which
is currently ~4 TB.  *Probably* no one is using fsverity on files that large,
but introducing a limit for a case that wasn't supported before
(merkle_tree_block_size != PAGE_SIZE) isn't as bad as retroactively introducing
a limit for existing files that worked before and refusing to open them.  Huge
files would need something other than a simple bitmap, which would add
complexity and overhead.

Note that many filesystems use PG_checked for other purposes such as keeping
track of when directory blocks have been validated.  I'm not sure how feasible
it will be to free up that flag entirely.

- Eric
Darrick J. Wong Oct. 18, 2023, 2:35 a.m. UTC | #10
On Mon, Oct 16, 2023 at 09:58:34PM -0700, Eric Biggers wrote:
> On Thu, Oct 12, 2023 at 08:12:09PM -0700, Darrick J. Wong wrote:
> > Hi Eric,
> > 
> > [Please excuse my ignorance, this is only the third time I've dived
> > into fsverity.]
> > 
> > On Thu, Oct 12, 2023 at 12:27:46AM -0700, Eric Biggers wrote:
> > > On Wed, Oct 11, 2023 at 03:03:55PM +0200, Andrey Albershteyn wrote:
> > > > > How complicated would it be to keep supporting using the page bit when
> > > > > merkle_tree_block_size == page_size and the filesystem supports it?  It's an
> > > > > efficient solution, so it would be a shame to lose it.  Also it doesn't have the
> > > > > max file size limit that the bitmap has.
> > 
> > How complex would it be to get rid of the bitmap entirely, and validate
> > all the verity tree blocks within a page all at once instead of
> > individual blocks within a page?
> > 
> > Assuming willy isn't grinding his axe to get rid of PGchecked,
> > obviously. ;)
> 
> See what I wrote earlier at
> https://lore.kernel.org/linux-xfs/Y5ltzp6yeMo1oDSk@sol.localdomain.
> Basically it would increase the worst-case latency by a lot.

Ahh.  Yeah, ok, got it, you don't necessarily want to prefault a bunch
more merkle tree blocks all at once just to read a single byte. :)

> > 
> > > > Well, I think it's possible but my motivation was to step away from
> > > > page manipulation as much as possible with intent to not affect other
> > > > filesystems too much. I can probably add handling of this case to
> > > > fsverity_read_merkle_tree_block() but fs/verity still will create
> > > > bitmap and have a limit. The other way is basically revert changes
> > > > done in patch 09, then, it probably will be quite a mix of page/block
> > > > handling in fs/verity/verify.c
> > > 
> > > The page-based caching still has to be supported anyway, since that's what the
> > > other filesystems that support fsverity use, and it seems you don't plan to
> > > change that.
> > 
> > I frankly have been asking myself why /this/ patchset adds so much extra
> > code and flags and whatnot to XFS and fs/verity.  From what I can tell,
> > the xfs buffer cache has been extended to allocate double the memory so
> > that xattr contents can be shadowed.  getxattr for merkle tree contents
> > then pins the buffer, shadows the contents, and hands both back to the
> > caller (aka xfs_read_merkle_tree_block).   The shadow memory is then
> > handed to fs/verity to do its magic; following that, fsverity releases
> > the reference and we can eventually drop the xfs_buf reference.
> > 
> > But this seems way overcomplicated to me.  ->read_merkle_tree_page hands
> > us a pgoff_t and a suggestion for page readahead, and wants us to return
> > an uptodate locked page, right?
> > 
> > Why can't xfs allocate a page, walk the requested range to fill the page
> > with merkle tree blocks that were written to the xattr structure (or
> > zero the page contents if there is no xattr), and hand that page to
> > fsverity?  (It helps to provide the merkle tree block size to
> > xfs_read_merkle_tree_page, thanks for adding that).
> 
> Earlier versions of this patchset did that.  But, it's only really feasible if
> the pages are actually cached.  Otherwise it's very inefficient and can result
> in random ENOMEM.
> 
> > Assuming fsverity also wants some caching, we could augment the
> > xfs_inode to point to a separate address_space for cached merkle tree
> > pages, and then xfs_read_merkle_tree_page can use __filemap_get_folio to
> > find uptodate cached pages, or instantiate one and make it uptodate.
> > Even better, we can pretty easily use multipage folios for this, though
> > AFAICT the fs/verity code isn't yet up to handling that.
> > 
> > The only thing I can't quite figure out is how to get memory reclaim to
> > scan the extra address_space when it wants to try to reclaim pages.
> > That part ext4 and f2fs got for free because they stuffed the merkle
> > tree in the posteof space.
> > 
> > But wouldn't that solve /all/ the plumbing problems without scattering
> > bits of new code and flags into the xfs buffer cache, the extended
> > attributes code, and elsewhere?  And then xfs would not need to burn up
> > vmap space to allocate 8K memory blocks just to provide 4k merkel tree
> > blocks to fs/verity.
> > 
> > That's just my 2 cents from spending a couple of hours hypothesizing how
> > I would fill out the fsverity_operations.
> 
> That might work.  I'm not sure about the details though, e.g. can mapping->host
> point to the file's inode or would it need to be a fake one.

Me neither.  I /think/ invalidate_mapping_pages would do the job, though
one would probably want to have a custom shrinker so we could find out
just how many pages are desired by reclaim, and then try to invalidate
only that many pages.  Not sure what happens if there are multiple
mappings whose ->host pointers point to the same inode.

Alternately I wonder if we could make a tmpfs file that would evict live
contents instead of writing them to the paging file... ;)

--D

> 
> - Eric
diff mbox series

Patch

diff --git a/fs/verity/fsverity_private.h b/fs/verity/fsverity_private.h
index d071a6e32581..9611eeae3527 100644
--- a/fs/verity/fsverity_private.h
+++ b/fs/verity/fsverity_private.h
@@ -69,7 +69,6 @@  struct fsverity_info {
 	u8 file_digest[FS_VERITY_MAX_DIGEST_SIZE];
 	const struct inode *inode;
 	unsigned long *hash_block_verified;
-	spinlock_t hash_page_init_lock;
 };
 
 #define FS_VERITY_MAX_SIGNATURE_SIZE	(FS_VERITY_MAX_DESCRIPTOR_SIZE - \
diff --git a/fs/verity/open.c b/fs/verity/open.c
index 6c31a871b84b..dfb9fe6aaae9 100644
--- a/fs/verity/open.c
+++ b/fs/verity/open.c
@@ -182,6 +182,7 @@  struct fsverity_info *fsverity_create_info(const struct inode *inode,
 {
 	struct fsverity_info *vi;
 	int err;
+	unsigned long num_bits;
 
 	vi = kmem_cache_zalloc(fsverity_info_cachep, GFP_KERNEL);
 	if (!vi)
@@ -213,33 +214,29 @@  struct fsverity_info *fsverity_create_info(const struct inode *inode,
 	if (err)
 		goto fail;
 
-	if (vi->tree_params.block_size != PAGE_SIZE) {
-		/*
-		 * When the Merkle tree block size and page size differ, we use
-		 * a bitmap to keep track of which hash blocks have been
-		 * verified.  This bitmap must contain one bit per hash block,
-		 * including alignment to a page boundary at the end.
-		 *
-		 * Eventually, to support extremely large files in an efficient
-		 * way, it might be necessary to make pages of this bitmap
-		 * reclaimable.  But for now, simply allocating the whole bitmap
-		 * is a simple solution that works well on the files on which
-		 * fsverity is realistically used.  E.g., with SHA-256 and 4K
-		 * blocks, a 100MB file only needs a 24-byte bitmap, and the
-		 * bitmap for any file under 17GB fits in a 4K page.
-		 */
-		unsigned long num_bits =
-			vi->tree_params.tree_pages <<
-			vi->tree_params.log_blocks_per_page;
+	/*
+	 * We use a bitmap to keep track of which hash blocks have been
+	 * verified.  This bitmap must contain one bit per hash block,
+	 * including alignment to a page boundary at the end.
+	 *
+	 * Eventually, to support extremely large files in an efficient
+	 * way, it might be necessary to make pages of this bitmap
+	 * reclaimable.  But for now, simply allocating the whole bitmap
+	 * is a simple solution that works well on the files on which
+	 * fsverity is realistically used.  E.g., with SHA-256 and 4K
+	 * blocks, a 100MB file only needs a 24-byte bitmap, and the
+	 * bitmap for any file under 17GB fits in a 4K page.
+	 */
+	num_bits =
+		vi->tree_params.tree_pages <<
+		vi->tree_params.log_blocks_per_page;
 
-		vi->hash_block_verified = kvcalloc(BITS_TO_LONGS(num_bits),
-						   sizeof(unsigned long),
-						   GFP_KERNEL);
-		if (!vi->hash_block_verified) {
-			err = -ENOMEM;
-			goto fail;
-		}
-		spin_lock_init(&vi->hash_page_init_lock);
+	vi->hash_block_verified = kvcalloc(BITS_TO_LONGS(num_bits),
+					   sizeof(unsigned long),
+					   GFP_KERNEL);
+	if (!vi->hash_block_verified) {
+		err = -ENOMEM;
+		goto fail;
 	}
 
 	return vi;
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index 2fe7bd57b16e..e7b13d143ae9 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -13,69 +13,18 @@ 
 static struct workqueue_struct *fsverity_read_workqueue;
 
 /*
- * Returns true if the hash block with index @hblock_idx in the tree, located in
- * @hpage, has already been verified.
+ * Returns true if the hash block with index @hblock_idx in the tree has
+ * already been verified.
  */
-static bool is_hash_block_verified(struct fsverity_info *vi, struct page *hpage,
-				   unsigned long hblock_idx)
+static bool is_hash_block_verified(struct fsverity_info *vi,
+				   unsigned long hblock_idx,
+				   bool block_cached)
 {
-	bool verified;
-	unsigned int blocks_per_page;
-	unsigned int i;
-
-	/*
-	 * When the Merkle tree block size and page size are the same, then the
-	 * ->hash_block_verified bitmap isn't allocated, and we use PG_checked
-	 * to directly indicate whether the page's block has been verified.
-	 *
-	 * Using PG_checked also guarantees that we re-verify hash pages that
-	 * get evicted and re-instantiated from the backing storage, as new
-	 * pages always start out with PG_checked cleared.
-	 */
-	if (!vi->hash_block_verified)
-		return PageChecked(hpage);
-
-	/*
-	 * When the Merkle tree block size and page size differ, we use a bitmap
-	 * to indicate whether each hash block has been verified.
-	 *
-	 * However, we still need to ensure that hash pages that get evicted and
-	 * re-instantiated from the backing storage are re-verified.  To do
-	 * this, we use PG_checked again, but now it doesn't really mean
-	 * "checked".  Instead, now it just serves as an indicator for whether
-	 * the hash page is newly instantiated or not.
-	 *
-	 * The first thread that sees PG_checked=0 must clear the corresponding
-	 * bitmap bits, then set PG_checked=1.  This requires a spinlock.  To
-	 * avoid having to take this spinlock in the common case of
-	 * PG_checked=1, we start with an opportunistic lockless read.
-	 */
-	if (PageChecked(hpage)) {
-		/*
-		 * A read memory barrier is needed here to give ACQUIRE
-		 * semantics to the above PageChecked() test.
-		 */
-		smp_rmb();
+	if (block_cached)
 		return test_bit(hblock_idx, vi->hash_block_verified);
-	}
-	spin_lock(&vi->hash_page_init_lock);
-	if (PageChecked(hpage)) {
-		verified = test_bit(hblock_idx, vi->hash_block_verified);
-	} else {
-		blocks_per_page = vi->tree_params.blocks_per_page;
-		hblock_idx = round_down(hblock_idx, blocks_per_page);
-		for (i = 0; i < blocks_per_page; i++)
-			clear_bit(hblock_idx + i, vi->hash_block_verified);
-		/*
-		 * A write memory barrier is needed here to give RELEASE
-		 * semantics to the below SetPageChecked() operation.
-		 */
-		smp_wmb();
-		SetPageChecked(hpage);
-		verified = false;
-	}
-	spin_unlock(&vi->hash_page_init_lock);
-	return verified;
+
+	clear_bit(hblock_idx, vi->hash_block_verified);
+	return false;
 }
 
 /*
@@ -179,7 +128,7 @@  verify_data_block(struct inode *inode, struct fsverity_info *vi,
 			goto error;
 		}
 		haddr = kmap_local_page(hpage) + hblock_offset_in_page;
-		if (is_hash_block_verified(vi, hpage, hblock_idx)) {
+		if (is_hash_block_verified(vi, hblock_idx, PageChecked(hpage))) {
 			memcpy(_want_hash, haddr + hoffset, hsize);
 			want_hash = _want_hash;
 			kunmap_local(haddr);