[RFC,10/11] xfs: add fs-verity support

Message ID	20221213172935.680971-11-aalbersh@redhat.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Andrey Albershteyn <aalbersh@redhat.com> To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Cc: Andrey Albershteyn <aalbersh@redhat.com> Subject: [RFC PATCH 10/11] xfs: add fs-verity support Date: Tue, 13 Dec 2022 18:29:34 +0100 Message-Id: <20221213172935.680971-11-aalbersh@redhat.com> In-Reply-To: <20221213172935.680971-1-aalbersh@redhat.com> References: <20221213172935.680971-1-aalbersh@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	fs-verity support for XFS \| expand [RFC,00/11] fs-verity support for XFS [RFC,01/11] xfs: enable large folios in xfs_setup_inode() [RFC,02/11] pagemap: add mapping_clear_large_folios() wrapper [RFC,03/11] xfs: add attribute type for fs-verity [RFC,04/11] xfs: add fs-verity ro-compat flag [RFC,05/11] xfs: add inode on-disk VERITY flag [RFC,06/11] xfs: initialize fs-verity on file open and cleanup on inode destruction [RFC,07/11] xfs: disable direct read path for fs-verity sealed files [RFC,08/11] xfs: don't enable large folios on fs-verity sealed inode [RFC,09/11] iomap: fs-verity verification on page read [RFC,10/11] xfs: add fs-verity support [RFC,11/11] xfs: add fs-verity ioctls

Andrey Albershteyn Dec. 13, 2022, 5:29 p.m. UTC

Add integration with fs-verity. The XFS store fs-verity metadata in
the extended attributes. The metadata consist of verity descriptor
and Merkle tree pages.

The descriptor is stored under "verity_descriptor" extended
attribute. The Merkle tree pages are stored under binary indexes.

When fs-verity is enabled on an inode, the XFS_IVERITY flag is set
meaning that the Merkle tree is being build. Then, pagecache is
flushed and large folios are disabled as these aren't yet supported
by fs-verity. This is done in xfs_begin_enable_verity() to make sure
that fs-verity operations on the inode don't populate cache with
large folios during a tree build. The initialization ends with
storing of verity descriptor and setting inode on-disk flag
(XFS_DIFLAG2_VERITY).

Also add check that block size == PAGE_SIZE as fs-verity doesn't
support different sizes yet.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/xfs/Makefile          |   1 +
 fs/xfs/libxfs/xfs_attr.c |   8 ++
 fs/xfs/xfs_inode.h       |   1 +
 fs/xfs/xfs_super.c       |  10 ++
 fs/xfs/xfs_verity.c      | 203 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_verity.h      |  19 ++++
 6 files changed, 242 insertions(+)
 create mode 100644 fs/xfs/xfs_verity.c
 create mode 100644 fs/xfs/xfs_verity.h

Eric Biggers Dec. 13, 2022, 7:08 p.m. UTC | #1

On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> 
> Also add check that block size == PAGE_SIZE as fs-verity doesn't
> support different sizes yet.

That's coming with
https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@kernel.org/T/#u,
which I'll be resending soon and I hope to apply for 6.3.
Review and testing of that patchset, along with its associated xfstests update
(https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@kernel.org/T/#u),
would be greatly appreciated.

Note, as proposed there will still be a limit of:

	merkle_tree_block_size <= fs_block_size <= page_size

Hopefully you don't need fs_block_size > page_size or
merkle_tree_block_size > fs_block_size?

- Eric

Darrick J. Wong Dec. 13, 2022, 7:22 p.m. UTC | #2

On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote:
> On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> > 
> > Also add check that block size == PAGE_SIZE as fs-verity doesn't
> > support different sizes yet.
> 
> That's coming with
> https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@kernel.org/T/#u,
> which I'll be resending soon and I hope to apply for 6.3.
> Review and testing of that patchset, along with its associated xfstests update
> (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@kernel.org/T/#u),
> would be greatly appreciated.
> 
> Note, as proposed there will still be a limit of:
> 
> 	merkle_tree_block_size <= fs_block_size <= page_size
> 
> Hopefully you don't need fs_block_size > page_size or

Not right this second, but large folios (and the ability to demand them)
are probably the last major piece that we need to support running
64k-fsblock filesystems on x86.

> merkle_tree_block_size > fs_block_size?

Doubtful.

--D

> 
> - Eric

Eric Biggers Dec. 13, 2022, 8:13 p.m. UTC | #3

On Tue, Dec 13, 2022 at 11:22:31AM -0800, Darrick J. Wong wrote:
> On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote:
> > On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> > > 
> > > Also add check that block size == PAGE_SIZE as fs-verity doesn't
> > > support different sizes yet.
> > 
> > That's coming with
> > https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@kernel.org/T/#u,
> > which I'll be resending soon and I hope to apply for 6.3.
> > Review and testing of that patchset, along with its associated xfstests update
> > (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@kernel.org/T/#u),
> > would be greatly appreciated.
> > 
> > Note, as proposed there will still be a limit of:
> > 
> > 	merkle_tree_block_size <= fs_block_size <= page_size
> > 
> > Hopefully you don't need fs_block_size > page_size or
> 
> Not right this second, but large folios (and the ability to demand them)
> are probably the last major piece that we need to support running
> 64k-fsblock filesystems on x86.
> 
> > merkle_tree_block_size > fs_block_size?
> 
> Doubtful.
> 

Thanks for the info.  Actually, I think

	merkle_tree_block_size <= fs_block_size <= page_size

is wrong.  It should actually be

	merkle_tree_block_size <= min(fs_block_size, page_size)

So there shouldn't actually be a problem with fs_block_size > page_size
(assuming that the filesystem doesn't have unrelated problems with it).

merkle_tree_block_size <= page_size comes from the way that fs/verity/verify.c
works (again, talking about the version with my patchset "fsverity: support for
non-4K pages" applied, and not the current upstream version which has a stronger
assumption of merkle_tree_block_size == page_size).

merkle_tree_block_size <= fs_block_size comes from the fact that every time data
is verified, it must be Merkle tree block aligned.  Maybe even that is not
necessarily a problem, if the filesystem waits to collect a full page (or folio)
before verifying it.  But ext4 will do a FS block at a time.

- Eric

Dave Chinner Dec. 13, 2022, 8:33 p.m. UTC | #4

On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote:
> On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> > 
> > Also add check that block size == PAGE_SIZE as fs-verity doesn't
> > support different sizes yet.
> 
> That's coming with
> https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@kernel.org/T/#u,
> which I'll be resending soon and I hope to apply for 6.3.
> Review and testing of that patchset, along with its associated xfstests update
> (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@kernel.org/T/#u),
> would be greatly appreciated.
> 
> Note, as proposed there will still be a limit of:
> 
> 	merkle_tree_block_size <= fs_block_size <= page_size

> Hopefully you don't need fs_block_size > page_size or

Yes, we will.

This back on my radar now that folios have settled down. It's
pretty trivial for XFS to do because we already support metadata
block sizes > filesystem block size. Here is an old prototype:

https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@fromorbit.com/

> merkle_tree_block_size > fs_block_size?

That's also a desirable addition.

XFS is using xattrs to hold merkle tree blocks so the merkle tree
storage is are already independent of the filesystem block size and
page cache limitations. Being able to using 64kB merkle tree blocks
would be really handy for reducing the search depth and overall IO
footprint of really large files.

Cheers,

Dave.

Eric Biggers Dec. 13, 2022, 8:39 p.m. UTC | #5

On Wed, Dec 14, 2022 at 07:33:19AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote:
> > On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> > > 
> > > Also add check that block size == PAGE_SIZE as fs-verity doesn't
> > > support different sizes yet.
> > 
> > That's coming with
> > https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@kernel.org/T/#u,
> > which I'll be resending soon and I hope to apply for 6.3.
> > Review and testing of that patchset, along with its associated xfstests update
> > (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@kernel.org/T/#u),
> > would be greatly appreciated.
> > 
> > Note, as proposed there will still be a limit of:
> > 
> > 	merkle_tree_block_size <= fs_block_size <= page_size
> 
> > Hopefully you don't need fs_block_size > page_size or
> 
> Yes, we will.
> 
> This back on my radar now that folios have settled down. It's
> pretty trivial for XFS to do because we already support metadata
> block sizes > filesystem block size. Here is an old prototype:
> 
> https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@fromorbit.com/

As per my follow-up response
(https://lore.kernel.org/r/Y5jc7P1ZeWHiTKRF@sol.localdomain),
I now think that wouldn't actually be a problem.

> > merkle_tree_block_size > fs_block_size?
> 
> That's also a desirable addition.
> 
> XFS is using xattrs to hold merkle tree blocks so the merkle tree
> storage is are already independent of the filesystem block size and
> page cache limitations. Being able to using 64kB merkle tree blocks
> would be really handy for reducing the search depth and overall IO
> footprint of really large files.

Well, the main problem is that using a Merkle tree block of 64K would mean that
you can never read less than 64K at a time.

- Eric

Dave Chinner Dec. 13, 2022, 9:40 p.m. UTC | #6

On Tue, Dec 13, 2022 at 12:39:39PM -0800, Eric Biggers wrote:
> On Wed, Dec 14, 2022 at 07:33:19AM +1100, Dave Chinner wrote:
> > On Tue, Dec 13, 2022 at 11:08:45AM -0800, Eric Biggers wrote:
> > > On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> > > > 
> > > > Also add check that block size == PAGE_SIZE as fs-verity doesn't
> > > > support different sizes yet.
> > > 
> > > That's coming with
> > > https://lore.kernel.org/linux-fsdevel/20221028224539.171818-1-ebiggers@kernel.org/T/#u,
> > > which I'll be resending soon and I hope to apply for 6.3.
> > > Review and testing of that patchset, along with its associated xfstests update
> > > (https://lore.kernel.org/fstests/20221211070704.341481-1-ebiggers@kernel.org/T/#u),
> > > would be greatly appreciated.
> > > 
> > > Note, as proposed there will still be a limit of:
> > > 
> > > 	merkle_tree_block_size <= fs_block_size <= page_size
> > 
> > > Hopefully you don't need fs_block_size > page_size or
> > 
> > Yes, we will.
> > 
> > This back on my radar now that folios have settled down. It's
> > pretty trivial for XFS to do because we already support metadata
> > block sizes > filesystem block size. Here is an old prototype:
> > 
> > https://lore.kernel.org/linux-xfs/20181107063127.3902-1-david@fromorbit.com/
> 
> As per my follow-up response
> (https://lore.kernel.org/r/Y5jc7P1ZeWHiTKRF@sol.localdomain),
> I now think that wouldn't actually be a problem.

Good to hear.

> > > merkle_tree_block_size > fs_block_size?
> > 
> > That's also a desirable addition.
> > 
> > XFS is using xattrs to hold merkle tree blocks so the merkle tree
> > storage is are already independent of the filesystem block size and
> > page cache limitations. Being able to using 64kB merkle tree blocks
> > would be really handy for reducing the search depth and overall IO
> > footprint of really large files.
> 
> Well, the main problem is that using a Merkle tree block of 64K would mean that
> you can never read less than 64K at a time.

Sure, but why does that matter?

The typical cost of a 64kB IO is only about 5% more than a
4kB IO, even on slow spinning storage.  However, we bring an order
of magnitude more data into the cache with that IO, so we can then
process more data before we have to go to disk again and take
another latency hit.

FYI, we have this large 64kB block size option for directories in
XFS already - you can have a 4kB block size filesystem with a 64kB
directory block size. The larger block size is a little slower for
small directories because they have higher per-leaf block CPU
processing overhead, but once you get to millions of records in a
single directory or really high sustained IO load, the larger block
size is *much* faster because the reduction in IO latency and search
efficiency more than makes up for the single block CPU processing
overhead...

The merkle tree is little different - once we get into TB scale
files, the merkle tree is indexing millions of individual records.
At this point overall record lookup and IO efficiency dominates the
data access time, not the amount of data each individual IO
retreives from disk.

Keep in mind that the block size used for the merkle tree would be a
filesystem choice. If we have the capability to support 64kB
merkle tree blocks, then XFS can make the choice of what block size
to use at the point where we are measuring the file because we know
how large the file is at that point. And because we're storing the
merkle tree blocks in xattrs, we know exactly what block size the
merkle tree data was stored in from the xattr metadata...

Cheers,

Dave.

Dave Chinner Dec. 14, 2022, 7:58 a.m. UTC | #7

On Tue, Dec 13, 2022 at 06:29:34PM +0100, Andrey Albershteyn wrote:
> Add integration with fs-verity. The XFS store fs-verity metadata in
> the extended attributes. The metadata consist of verity descriptor
> and Merkle tree pages.
> 
> The descriptor is stored under "verity_descriptor" extended
> attribute. The Merkle tree pages are stored under binary indexes.
> 
> When fs-verity is enabled on an inode, the XFS_IVERITY flag is set
> meaning that the Merkle tree is being build. Then, pagecache is
> flushed and large folios are disabled as these aren't yet supported
> by fs-verity. This is done in xfs_begin_enable_verity() to make sure
> that fs-verity operations on the inode don't populate cache with
> large folios during a tree build. The initialization ends with
> storing of verity descriptor and setting inode on-disk flag
> (XFS_DIFLAG2_VERITY).
> 
> Also add check that block size == PAGE_SIZE as fs-verity doesn't
> support different sizes yet.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/xfs/Makefile          |   1 +
>  fs/xfs/libxfs/xfs_attr.c |   8 ++
>  fs/xfs/xfs_inode.h       |   1 +
>  fs/xfs/xfs_super.c       |  10 ++
>  fs/xfs/xfs_verity.c      | 203 +++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_verity.h      |  19 ++++
>  6 files changed, 242 insertions(+)
>  create mode 100644 fs/xfs/xfs_verity.c
>  create mode 100644 fs/xfs/xfs_verity.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 42d0496fdad7d..5afa8ae5b3b7f 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -131,6 +131,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)	+= xfs_acl.o
>  xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
> +xfs-$(CONFIG_FS_VERITY)		+= xfs_verity.o
>  
>  # notify failure
>  ifeq ($(CONFIG_MEMORY_FAILURE),y)
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index 57080ea4c869b..42013fc99b76a 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -26,6 +26,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_attr_item.h"
>  #include "xfs_xattr.h"
> +#include "xfs_verity.h"
>  
>  struct kmem_cache		*xfs_attr_intent_cache;
>  
> @@ -1632,6 +1633,13 @@ xfs_attr_namecheck(
>  		return xfs_verify_pptr(mp, (struct xfs_parent_name_rec *)name);
>  	}
>  
> +	if (flags & XFS_ATTR_VERITY) {
> +		if (length != sizeof(__be64) &&
> +				length != XFS_VERITY_DESCRIPTOR_NAME_LEN)

This needs a comment describing what it is checking as it is not
obvious from reading the code. I can grok what the
XFS_VERITY_DESCRIPTOR_NAME_LEN check is about, but the sizeof()
check is not obvious.

I also suspect it is also better to explicitly check for valid
values rather than invalid values. i.e.

		/* describe what name is 8 bytes in length */
		if (length == sizeof(__be64))
			return true;
		/* Verity descriptor blocks are held in a named attribute. */
		if (length == XFS_VERITY_DESCRIPTOR_NAME_LEN)
			return true;
		/* Not a valid verity attribute name length */
		return false.

> +			return false;
> +		return true;
> +	}
> +
>  	return xfs_str_attr_namecheck(name, length);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 5735de32beebd..070631adac572 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -325,6 +325,7 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>   * plain old IRECLAIMABLE inode.
>   */
>  #define XFS_INACTIVATING	(1 << 13)
> +#define XFS_IVERITY		(1 << 14) /* merkle tree is in progress */

Does this flag mean the inode has verity information on it (as the
name suggests) or that the inode is currently being measured and the
merkle tree is being built (as the comment suggests)?
>  
>  /* All inode state flags related to inode reclaim. */
>  #define XFS_ALL_IRECLAIM_FLAGS	(XFS_IRECLAIMABLE | \
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 50c2c819ba940..a3c89d2c06a8a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -41,6 +41,7 @@
>  #include "xfs_attr_item.h"
>  #include "xfs_xattr.h"
>  #include "xfs_iunlink_item.h"
> +#include "xfs_verity.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -1469,6 +1470,9 @@ xfs_fs_fill_super(
>  	sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
>  #endif
>  	sb->s_op = &xfs_super_operations;
> +#ifdef CONFIG_FS_VERITY
> +	sb->s_vop = &xfs_verity_ops;
> +#endif
>  
>  	/*
>  	 * Delay mount work if the debug hook is set. This is debug
> @@ -1669,6 +1673,12 @@ xfs_fs_fill_super(
>  		xfs_alert(mp,
>  	"EXPERIMENTAL parent pointer feature enabled. Use at your own risk!");
>  
> +	if (xfs_has_verity(mp) && mp->m_super->s_blocksize != PAGE_SIZE) {
> +		xfs_alert(mp,
> +			"Cannot use fs-verity with block size != PAGE_SIZE");
> +		goto out_filestream_unmount;
> +	}

Needs an EXPERIMENTAL warning to be emitted here, jus tlike the
parent pointer feature check above.

> +
>  	error = xfs_mountfs(mp);
>  	if (error)
>  		goto out_filestream_unmount;
> diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> new file mode 100644
> index 0000000000000..112a72d0b0ca7
> --- /dev/null
> +++ b/fs/xfs/xfs_verity.c
> @@ -0,0 +1,203 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Red Hat, Inc.
> + */
> +#include "xfs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_da_format.h"
> +#include "xfs_da_btree.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_inode.h"
> +#include "xfs_attr.h"
> +#include "xfs_verity.h"
> +#include "xfs_bmap_util.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans.h"
> +
> +static int
> +xfs_get_verity_descriptor(
> +	struct inode		*inode,
> +	void			*buf,
> +	size_t			buf_size)

So what is the API being implemented here? It passes in a NULL *buf
and buf_size = 0 to query the size of the verity descriptor?

And then allocates a buffer the size returned and calls again with
*buf of the right size and buf_size = the right size?

> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error = 0;
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
> +		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
> +		.valuelen	= buf_size,
> +	};
> +
> +	error = xfs_attr_get(&args);
> +	if (error)
> +		return error;
> +
> +	if (buf_size == 0)
> +		return args.valuelen;
> +
> +	if (args.valuelen > buf_size) {
> +		kmem_free(args.value);
> +		return -ERANGE;
> +	}

Ok, this won't happen. If the provided value length (buf_size) is
smaller than the attribute value length found, xfs_attr_copy_value()
will return -ERANGE and set args.valuelen to the actual attribute
size found. That will be caught by the initial error return, so
you would never have noticed the kmem_free(NULL) call here.

xfs_attr_copy_value() deals with attributes both larger and smaller
than the provided buffer correctly.

> +	memcpy(buf, args.value, buf_size);

However, this will overrun the allocated args.value buffer if the
attribute that was found is smaller than buf_size - the actual
length of the attribute value that is copied is written into
args.valuelen and it may be less than the size of the buffer
supplied....

> +
> +	kmem_free(args.value);
> +	return args.valuelen;
> +}

It seems this this function can be rewritten as: 

{
	struct xfs_inode	*ip = XFS_I(inode);
	int			error = 0;
	struct xfs_da_args	args = {
		.dp		= ip,
		.attr_filter	= XFS_ATTR_VERITY,
		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
		.value		= buf,
		.valuelen	= buf_size,
	};

	error = xfs_attr_get(&args);
	if (error)
		return error;

	return args.valuelen;
}

And function as it does now.

> +
> +static int
> +xfs_begin_enable_verity(
> +	struct file	    *filp)
> +{
> +	struct inode	    *inode = file_inode(filp);
> +	struct xfs_inode    *ip = XFS_I(inode);
> +	int		    error = 0;

Hmmm. This code looks like it assumes the XFS_IOLOCK_EXCL is already
held - when was that lock taken? Can you add this:

	ASSERT(xfs_is_ilocked(ip, XFS_IOLOCK_EXCL));

> +
> +	if (IS_DAX(inode))
> +		return -EINVAL;

So fsverity is not compatible with DAX? What happens if we
turn on DAX after fsverity has been enabled on an inode? I didn't
see an exception to avoid IS_DAX() inodes with fsverity enabled
in the xfs_file_read_iter() code....

> +
> +	if (xfs_iflags_test(ip, XFS_IVERITY))
> +		return -EBUSY;
> +	xfs_iflags_set(ip, XFS_IVERITY);

Ah, so this is the "merkle tree being built" flag.

Can we name it XFS_IVERITY_CONSTRUCTION or something similar that
tells the reader that is a "work is in progress" flag rather than an
"inode has verity enabled" flag?

> +	/*
> +	 * As fs-verity doesn't support multi-page folios yet, flush everything
> +	 * from page cache and disable it
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +
> +	inode_dio_wait(inode);

This only works to stop all IO if somethign is already holding
the XFS_IOLOCK_EXCL, hence my comment about the assert above.

> +	error = xfs_flush_unmap_range(ip, 0, XFS_ISIZE(ip));
> +	if (error)
> +		goto out;

	WARN_ON_ONCE(inode->i_mapping->nr_pages != 0);

Ok, so by this point nothing can instantiate new pages in the page
cache, and the mapping should be empty. Which means there should be
no large pages in it at all.

Which means it is safe to remove large folio support from the
mapping at this point..

> +	mapping_clear_large_folios(inode->i_mapping);

If Willy doesn't want to this wrapper to exist (and I can see why he
doesn't want it), just open code it with a clear comment detailing
that it will go away as the fsverity infrastructure is updated to
support multipage folios.

	/*
	 * This bit of nastiness will go away when fsverity support
	 * for multi-page folios is added.
	 */
	clear_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);

> +
> +out:
> +	filemap_invalidate_unlock(inode->i_mapping);
> +	if (error)
> +		xfs_iflags_clear(ip, XFS_IVERITY);
> +	return error;
> +}
> +
> +static int
> +xfs_end_enable_verity(
> +	struct file		*filp,
> +	const void		*desc,
> +	size_t			desc_size,
> +	u64			merkle_tree_size)
> +{
> +	struct inode		*inode = file_inode(filp);
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.whichfork	= XFS_ATTR_FORK,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.attr_flags	= XATTR_CREATE,
> +		.name		= (const uint8_t *)XFS_VERITY_DESCRIPTOR_NAME,
> +		.namelen	= XFS_VERITY_DESCRIPTOR_NAME_LEN,
> +		.value		= (void *)desc,
> +		.valuelen	= desc_size,
> +	};
> +	int			error = 0;

	ASSERT(xfs_is_ilocked(ip, XFS_IOLOCK_EXCL));

> +
> +	/* fs-verity failed, just cleanup */
> +	if (desc == NULL) {
> +		mapping_set_large_folios(inode->i_mapping);
> +		goto out;

I don't think it's safe to enable this while there may be page cache
operations running (i.e. through the page fault path). It also uses
__set_bit(), which is not guaranteed to be an atomic bit operation
(i.e. might be a RMW operation) and so isn't safe to perform on
variables that might be modified concurrently (as can happen with
mapping->flags in this context).

I think if verity measurement fails, the least of our worries is
re-enabling large folio support. hence I'd just leave this out for
the moment - the disabling of large folios is really only a
temporary thing...

> +	}
> +
> +	error = xfs_attr_set(&args);
> +	if (error)
> +		goto out;
> +
> +	/* Set fsverity inode flag */
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> +	if (error)
> +		goto out;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);

xfs_trans_alloc_inode()?

> +
> +	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;
> +	inode->i_flags |= S_VERITY;
> +
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	error = xfs_trans_commit(tp);

I think this should be a data integrity commit here. We're setting
the on-disk flag and guaranteeing to the caller that the verity
information has been written and verity is enabled, so we should
really guarantee that it is persistent before returning to
the caller.

	/*
	 * Ensure that we've persisted the verity information before
	 * we enable it on the inode and tell the caller we have
	 * sealed the inode.
	 */
	ip->i_diflags2 |= XFS_DIFLAG2_VERITY;

	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
	xfs_trans_set_sync(tp);

	error = xfs_trans_commit(tp);

	if (!error)
		inode->i_flags |= S_VERITY;

> +
> +out:
> +	if (error)
> +		mapping_set_large_folios(inode->i_mapping);

See above. 

> +
> +	xfs_iflags_clear(ip, XFS_IVERITY);
> +	return error;
> +}
> +
> +static struct page *
> +xfs_read_merkle_tree_page(
> +	struct inode		*inode,
> +	pgoff_t			index,
> +	unsigned long		num_ra_pages)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	struct page		*page;
> +	__be64			name = cpu_to_be64(index);
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.name		= (const uint8_t *)&name,
> +		.namelen	= sizeof(__be64),
> +		.valuelen	= PAGE_SIZE,
> +	};
> +	int			error = 0;
> +
> +	error = xfs_attr_get(&args);
> +	if (error)
> +		return ERR_PTR(-EFAULT);
> +
> +	page = alloc_page(GFP_KERNEL);
> +	if (!page)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memcpy(page_address(page), args.value, args.valuelen);
> +
> +	kmem_free(args.value);
> +	return page;
> +}

Ok, this will work, but now I see why the merkel tree needs
validation on every read - the validation code makes the assumption
that the same *physical page* for a given tree block is fed to it
over and over again. It doesn't appear to check that the data in the
page is unchanged, just checks for the PageChecked() flag being set.
Not sure how the verify_page algorithm will work with variable sized
multi-page folios in the page cache, but that's a future problem...

I suspect we want to have xfs_attr_get() return us the xfs_buf that
contains the attribute value rather than a copy of the data itself.
Then we can pull the backing page from the xfs_buf here, grab a
reference to it and hand it back to the caller. 

All we need is a callout when the page ref is dropped by
verify_page() so that we can release the reference to the xfs_buf
that owns the page....

> +static int
> +xfs_write_merkle_tree_block(
> +	struct inode		*inode,
> +	const void		*buf,
> +	u64			index,
> +	int			log_blocksize)

"log blocksize"?

What's the journal got to do with writing a merkle tree block? :)

Hmmmm. The index is in units of blocks, and the size is always
one block. So to convert both size and index to units of bytes, we
have to shift upwards by "log_blocksize".

Yup, everyone immediately converts index to a byte offset, and
length to a byte count.

Eric, can we get this interface converted to pass an offset for the
index and a buffer length for the contents of *buf in bytes as
that's how everyone uses it?

> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	__be64			name = cpu_to_be64(index);
> +	struct xfs_da_args	args = {
> +		.dp		= ip,
> +		.whichfork	= XFS_ATTR_FORK,
> +		.attr_filter	= XFS_ATTR_VERITY,
> +		.attr_flags	= XATTR_CREATE,
> +		.name		= (const uint8_t *)&name,
> +		.namelen	= sizeof(__be64),
> +		.value		= (void *)buf,
> +		.valuelen	= 1 << log_blocksize,
> +	};

I'd prefer to store the merkle tree block as a byte offset (i.e.
index << log_blocksize) rather than a block index that is dependent
on something else to define the object size. That way we can
actually validate that the entire merkle tree is present (e.g. when
scrubbing/repairing inodes) without having to know what the merkle
tree block size is.

> +
> +	return xfs_attr_set(&args);
> +}
> +
> +const struct fsverity_operations xfs_verity_ops = {
> +	.begin_enable_verity = &xfs_begin_enable_verity,
> +	.end_enable_verity = &xfs_end_enable_verity,
> +	.get_verity_descriptor = &xfs_get_verity_descriptor,
> +	.read_merkle_tree_page = &xfs_read_merkle_tree_page,
> +	.write_merkle_tree_block = &xfs_write_merkle_tree_block,
> +};
> diff --git a/fs/xfs/xfs_verity.h b/fs/xfs/xfs_verity.h
> new file mode 100644
> index 0000000000000..ae5d87ca32a86
> --- /dev/null
> +++ b/fs/xfs/xfs_verity.h
> @@ -0,0 +1,19 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2022 Red Hat, Inc.
> + */
> +#ifndef __XFS_VERITY_H__
> +#define __XFS_VERITY_H__
> +
> +#include <linux/fsverity.h>
> +
> +#define XFS_VERITY_DESCRIPTOR_NAME "verity_descriptor"
> +#define XFS_VERITY_DESCRIPTOR_NAME_LEN 17

Need to put a

	BUILD_BUG_ON(strlen(XFS_VERITY_DESCRIPTOR_NAME) ==
			XFS_VERITY_DESCRIPTOR_NAME_LEN);

somewhere in the code - there's a function that checks on-disk
structure sizes somewhere like this (in xfs_ondisk.c?)....

Cheers,

Dave.

[RFC,10/11] xfs: add fs-verity support

Commit Message

Comments

Patch