Message ID | 20240304191046.157464-13-aalbersh@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs-verity support for XFS | expand |
On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote: > One of essential ideas of fs-verity is that pages which are already > verified won't need to be re-verified if they still in page cache. > > XFS will store Merkle tree blocks in extended file attributes. When > read extended attribute data is put into xfs_buf. > > fs-verity uses PG_checked flag to track status of the blocks in the > page. This flag can has two meanings - page was re-instantiated and > the only block in the page is verified. > > However, in XFS, the data in the buffer is not aligned with xfs_buf > pages and we don't have a reference to these pages. Moreover, these > pages are released when value is copied out in xfs_attr code. In > other words, we can not directly mark underlying xfs_buf's pages as > verified as it's done by fs-verity for other filesystems. > > One way to track that these pages were processed by fs-verity is to > mark buffer as verified instead. If buffer is evicted the incore > XBF_VERITY_SEEN flag is lost. When the xattr is read again > xfs_attr_get() returns new buffer without the flag. The xfs_buf's > flag is then used to tell fs-verity this buffer was cached or not. > > The second state indicated by PG_checked is if the only block in the > PAGE is verified. This is not the case for XFS as there could be > multiple blocks in single buffer (page size 64k block size 4k). This > is handled by fs-verity bitmap. fs-verity is always uses bitmap for > XFS despite of Merkle tree block size. > > The meaning of the flag is that value of the extended attribute in > the buffer is processed by fs-verity. > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> > --- > fs/xfs/xfs_buf.h | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 73249abca968..2a73918193ba 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -24,14 +24,15 @@ struct xfs_buf; > > #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) > > -#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > -#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > -#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > -#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > -#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > -#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > -#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > -#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > +#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > +#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > +#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > +#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > +#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > +#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > +#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > +#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > +#define XBF_VERITY_SEEN (1u << 8) /* buffer was processed by fs-verity */ Yuck. I still dislike this entire approach. XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that gets loaded on behalf of a merkle tree request, then uses the extra space to shadow the contents of the ondisk block. AFAICT the shadow doesn't get updated even if the cached data does, which sounds like a landmine for coherency issues. XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of exposing the low level buffering details of the xattr code to xfs_attr_get callers. XBF_VERITY_SEEN is a layering violation because now the overall buffer cache can track file metadata state. I think the reason why you need this state flag is because the datadev buffer target indexes based on physical xfs_daddr_t, whereas merkle tree blocks have their own internal block numbers. You can't directly go from the merkle block number to an xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block fell out of memory. ISTR asking for a separation of these indices when I reviewed some previous version of this patchset. At the time it seems to me that a much more efficient way to cache the merkle tree blocks would be to set up a direct (merkle tree block number) -> (blob of data) lookup table. That I don't see here. In the spirit of the recent collaboration style that I've taken with Christoph, I pulled your branch and started appending patches to it to see if the design that I'm asking for is reasonable. As it so happens, I was working on a simplified version of the xfs buffer cache ("fsbuf") that could be used by simple filesystems to get them off of buffer heads. (Ab)using the fsbuf code did indeed work (and passed all the fstests -g verity tests), so now I know the idea is reasonable. Patches 11, 12, 14, and 15 become unnecessary. However, this solution is itself grossly overengineered, since all we want are the following operations: peek(key): returns an fsbuf if there's any data cached for key get(key): returns an fsbuf for key, regardless of state store(fsbuf, p): attach a memory buffer p to fsbuf Then the xfs ->read_merkle_tree_block function becomes: bp = peek(key) if (bp) /* return bp data up to verity */ p = xfs_attr_get(key) if (!p) /* error */ bp = get(key) store(bp, p) /* return bp data up to verity */ *Fortunately* you've also rebased against the 6.9 for-next branch, which means I think there's an elegant solution at hand. The online repair patches that are in that branch make it possible to create a standalone xfs_buftarg. We can attach one of these to an xfs_inode to cache merkle tree blocks. xfs_bufs have well known usage semantics, they're already hooked up to shrinkers so that we can reclaim the data if memory is tight, and they can clean up a lot of code. Could you take a look at my branch here: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-07 And tell me what you think? --D > > /* buffer type flags for write callbacks */ > #define _XBF_INODES (1u << 16)/* inode buffer */ > @@ -65,6 +66,7 @@ typedef unsigned int xfs_buf_flags_t; > { XBF_DONE, "DONE" }, \ > { XBF_STALE, "STALE" }, \ > { XBF_WRITE_FAIL, "WRITE_FAIL" }, \ > + { XBF_VERITY_SEEN, "VERITY_SEEN" }, \ > { _XBF_INODES, "INODES" }, \ > { _XBF_DQUOTS, "DQUOTS" }, \ > { _XBF_LOGRECOVERY, "LOG_RECOVERY" }, \ > -- > 2.42.0 > >
On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote: > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote: > > One of essential ideas of fs-verity is that pages which are already > > verified won't need to be re-verified if they still in page cache. > > > > XFS will store Merkle tree blocks in extended file attributes. When > > read extended attribute data is put into xfs_buf. > > > > fs-verity uses PG_checked flag to track status of the blocks in the > > page. This flag can has two meanings - page was re-instantiated and > > the only block in the page is verified. > > > > However, in XFS, the data in the buffer is not aligned with xfs_buf > > pages and we don't have a reference to these pages. Moreover, these > > pages are released when value is copied out in xfs_attr code. In > > other words, we can not directly mark underlying xfs_buf's pages as > > verified as it's done by fs-verity for other filesystems. > > > > One way to track that these pages were processed by fs-verity is to > > mark buffer as verified instead. If buffer is evicted the incore > > XBF_VERITY_SEEN flag is lost. When the xattr is read again > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's > > flag is then used to tell fs-verity this buffer was cached or not. > > > > The second state indicated by PG_checked is if the only block in the > > PAGE is verified. This is not the case for XFS as there could be > > multiple blocks in single buffer (page size 64k block size 4k). This > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for > > XFS despite of Merkle tree block size. > > > > The meaning of the flag is that value of the extended attribute in > > the buffer is processed by fs-verity. > > > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> > > --- > > fs/xfs/xfs_buf.h | 18 ++++++++++-------- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > index 73249abca968..2a73918193ba 100644 > > --- a/fs/xfs/xfs_buf.h > > +++ b/fs/xfs/xfs_buf.h > > @@ -24,14 +24,15 @@ struct xfs_buf; > > > > #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) > > > > -#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > -#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > -#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > -#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > -#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > -#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > -#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > -#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > +#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > +#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > +#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > +#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > +#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > +#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > +#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > +#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > +#define XBF_VERITY_SEEN (1u << 8) /* buffer was processed by fs-verity */ > > Yuck. I still dislike this entire approach. > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that > gets loaded on behalf of a merkle tree request, then uses the extra > space to shadow the contents of the ondisk block. AFAICT the shadow > doesn't get updated even if the cached data does, which sounds like a > landmine for coherency issues. > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of > exposing the low level buffering details of the xattr code to > xfs_attr_get callers. > > XBF_VERITY_SEEN is a layering violation because now the overall buffer > cache can track file metadata state. I think the reason why you need > this state flag is because the datadev buffer target indexes based on > physical xfs_daddr_t, whereas merkle tree blocks have their own internal > block numbers. You can't directly go from the merkle block number to an > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block > fell out of memory. > > ISTR asking for a separation of these indices when I reviewed some > previous version of this patchset. At the time it seems to me that a > much more efficient way to cache the merkle tree blocks would be to set > up a direct (merkle tree block number) -> (blob of data) lookup table. > That I don't see here. > > In the spirit of the recent collaboration style that I've taken with > Christoph, I pulled your branch and started appending patches to it to > see if the design that I'm asking for is reasonable. As it so happens, > I was working on a simplified version of the xfs buffer cache ("fsbuf") > that could be used by simple filesystems to get them off of buffer > heads. > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g > verity tests), so now I know the idea is reasonable. Patches 11, 12, > 14, and 15 become unnecessary. However, this solution is itself grossly > overengineered, since all we want are the following operations: > > peek(key): returns an fsbuf if there's any data cached for key > > get(key): returns an fsbuf for key, regardless of state > > store(fsbuf, p): attach a memory buffer p to fsbuf > > Then the xfs ->read_merkle_tree_block function becomes: > > bp = peek(key) > if (bp) > /* return bp data up to verity */ > > p = xfs_attr_get(key) > if (!p) > /* error */ > > bp = get(key) > store(bp, p) Ok, that looks good - it definitely gets rid of a lot of the nastiness, but I have to ask: why does it need to be based on xfs_bufs? That's just wasting 300 bytes of memory on a handle to store a key and a opaque blob in a rhashtable. IIUC, the key here is a sequential index, so an xarray would be a much better choice as it doesn't require internal storage of the key. i.e. p = xa_load(key); if (p) return p; xfs_attr_get(key); if (!args->value) /* error */ /* * store the current value, freeing any old value that we * replaced at this key. Don't care about failure to store, * this is optimistic caching. */ p = xa_store(key, args->value, GFP_NOFS); if (p) kvfree(p); return args->value; -Dave.
On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote: > On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote: > > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote: > > > One of essential ideas of fs-verity is that pages which are already > > > verified won't need to be re-verified if they still in page cache. > > > > > > XFS will store Merkle tree blocks in extended file attributes. When > > > read extended attribute data is put into xfs_buf. > > > > > > fs-verity uses PG_checked flag to track status of the blocks in the > > > page. This flag can has two meanings - page was re-instantiated and > > > the only block in the page is verified. > > > > > > However, in XFS, the data in the buffer is not aligned with xfs_buf > > > pages and we don't have a reference to these pages. Moreover, these > > > pages are released when value is copied out in xfs_attr code. In > > > other words, we can not directly mark underlying xfs_buf's pages as > > > verified as it's done by fs-verity for other filesystems. > > > > > > One way to track that these pages were processed by fs-verity is to > > > mark buffer as verified instead. If buffer is evicted the incore > > > XBF_VERITY_SEEN flag is lost. When the xattr is read again > > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's > > > flag is then used to tell fs-verity this buffer was cached or not. > > > > > > The second state indicated by PG_checked is if the only block in the > > > PAGE is verified. This is not the case for XFS as there could be > > > multiple blocks in single buffer (page size 64k block size 4k). This > > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for > > > XFS despite of Merkle tree block size. > > > > > > The meaning of the flag is that value of the extended attribute in > > > the buffer is processed by fs-verity. > > > > > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> > > > --- > > > fs/xfs/xfs_buf.h | 18 ++++++++++-------- > > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > > index 73249abca968..2a73918193ba 100644 > > > --- a/fs/xfs/xfs_buf.h > > > +++ b/fs/xfs/xfs_buf.h > > > @@ -24,14 +24,15 @@ struct xfs_buf; > > > > > > #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) > > > > > > -#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > > -#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > > -#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > > -#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > > -#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > > -#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > > -#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > > -#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > > +#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > > +#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > > +#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > > +#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > > +#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > > +#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > > +#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > > +#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > > +#define XBF_VERITY_SEEN (1u << 8) /* buffer was processed by fs-verity */ > > > > Yuck. I still dislike this entire approach. > > > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that > > gets loaded on behalf of a merkle tree request, then uses the extra > > space to shadow the contents of the ondisk block. AFAICT the shadow > > doesn't get updated even if the cached data does, which sounds like a > > landmine for coherency issues. > > > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of > > exposing the low level buffering details of the xattr code to > > xfs_attr_get callers. > > > > XBF_VERITY_SEEN is a layering violation because now the overall buffer > > cache can track file metadata state. I think the reason why you need > > this state flag is because the datadev buffer target indexes based on > > physical xfs_daddr_t, whereas merkle tree blocks have their own internal > > block numbers. You can't directly go from the merkle block number to an > > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block > > fell out of memory. > > > > ISTR asking for a separation of these indices when I reviewed some > > previous version of this patchset. At the time it seems to me that a > > much more efficient way to cache the merkle tree blocks would be to set > > up a direct (merkle tree block number) -> (blob of data) lookup table. > > That I don't see here. > > > > In the spirit of the recent collaboration style that I've taken with > > Christoph, I pulled your branch and started appending patches to it to > > see if the design that I'm asking for is reasonable. As it so happens, > > I was working on a simplified version of the xfs buffer cache ("fsbuf") > > that could be used by simple filesystems to get them off of buffer > > heads. > > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g > > verity tests), so now I know the idea is reasonable. Patches 11, 12, > > 14, and 15 become unnecessary. However, this solution is itself grossly > > overengineered, since all we want are the following operations: > > > > peek(key): returns an fsbuf if there's any data cached for key > > > > get(key): returns an fsbuf for key, regardless of state > > > > store(fsbuf, p): attach a memory buffer p to fsbuf > > > > Then the xfs ->read_merkle_tree_block function becomes: > > > > bp = peek(key) > > if (bp) > > /* return bp data up to verity */ > > > > p = xfs_attr_get(key) > > if (!p) > > /* error */ > > > > bp = get(key) > > store(bp, p) > > Ok, that looks good - it definitely gets rid of a lot of the > nastiness, but I have to ask: why does it need to be based on > xfs_bufs? (copying from IRC) It was still warm in my brain L2 after all the xfile buftarg cleaning and merging that just got done a few weeks ago. So I went with the simplest thing I could rig up to test my ideas, and now we're at the madly iterate until exhaustion stage. ;) > That's just wasting 300 bytes of memory on a handle to > store a key and a opaque blob in a rhashtable. Yep. The fsbufs implementation was a lot more slender, but a bunch more code. I agree that I ought to go look at xarrays or something that's more of a direct mapping as a next step. However, i wanted to get Andrey's feedback on this general approach first. > IIUC, the key here is a sequential index, so an xarray would be a > much better choice as it doesn't require internal storage of the > key. I wonder, what are the access patterns for merkle blobs? Is it actually sequential, or is more like 0 -> N -> N*N as we walk towards leaves? Also -- the fsverity block interfaces pass in a "u64 pos" argument. Was that done because merkle trees may some day have more than 2^32 blocks in them? That won't play well with things like xarrays on 32-bit machines. (Granted we've been talking about deprecating XFS on 32-bit for a while now but we're not the whole world) > i.e. > > p = xa_load(key); > if (p) > return p; > > xfs_attr_get(key); > if (!args->value) > /* error */ > > /* > * store the current value, freeing any old value that we > * replaced at this key. Don't care about failure to store, > * this is optimistic caching. > */ > p = xa_store(key, args->value, GFP_NOFS); > if (p) > kvfree(p); > return args->value; Attractive. Will have to take a look at that tomorrow. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote: > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote: > > On Thu, Mar 07, 2024 at 02:46:54PM -0800, Darrick J. Wong wrote: > > > On Mon, Mar 04, 2024 at 08:10:34PM +0100, Andrey Albershteyn wrote: > > > > One of essential ideas of fs-verity is that pages which are already > > > > verified won't need to be re-verified if they still in page cache. > > > > > > > > XFS will store Merkle tree blocks in extended file attributes. When > > > > read extended attribute data is put into xfs_buf. > > > > > > > > fs-verity uses PG_checked flag to track status of the blocks in the > > > > page. This flag can has two meanings - page was re-instantiated and > > > > the only block in the page is verified. > > > > > > > > However, in XFS, the data in the buffer is not aligned with xfs_buf > > > > pages and we don't have a reference to these pages. Moreover, these > > > > pages are released when value is copied out in xfs_attr code. In > > > > other words, we can not directly mark underlying xfs_buf's pages as > > > > verified as it's done by fs-verity for other filesystems. > > > > > > > > One way to track that these pages were processed by fs-verity is to > > > > mark buffer as verified instead. If buffer is evicted the incore > > > > XBF_VERITY_SEEN flag is lost. When the xattr is read again > > > > xfs_attr_get() returns new buffer without the flag. The xfs_buf's > > > > flag is then used to tell fs-verity this buffer was cached or not. > > > > > > > > The second state indicated by PG_checked is if the only block in the > > > > PAGE is verified. This is not the case for XFS as there could be > > > > multiple blocks in single buffer (page size 64k block size 4k). This > > > > is handled by fs-verity bitmap. fs-verity is always uses bitmap for > > > > XFS despite of Merkle tree block size. > > > > > > > > The meaning of the flag is that value of the extended attribute in > > > > the buffer is processed by fs-verity. > > > > > > > > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> > > > > --- > > > > fs/xfs/xfs_buf.h | 18 ++++++++++-------- > > > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > > > > index 73249abca968..2a73918193ba 100644 > > > > --- a/fs/xfs/xfs_buf.h > > > > +++ b/fs/xfs/xfs_buf.h > > > > @@ -24,14 +24,15 @@ struct xfs_buf; > > > > > > > > #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) > > > > > > > > -#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > > > -#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > > > -#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > > > -#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > > > -#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > > > -#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > > > -#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > > > -#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > > > +#define XBF_READ (1u << 0) /* buffer intended for reading from device */ > > > > +#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ > > > > +#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ > > > > +#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ > > > > +#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ > > > > +#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ > > > > +#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ > > > > +#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ > > > > +#define XBF_VERITY_SEEN (1u << 8) /* buffer was processed by fs-verity */ > > > > > > Yuck. I still dislike this entire approach. > > > > > > XBF_DOUBLE_ALLOC doubles the memory consumption of any xattr block that > > > gets loaded on behalf of a merkle tree request, then uses the extra > > > space to shadow the contents of the ondisk block. AFAICT the shadow > > > doesn't get updated even if the cached data does, which sounds like a > > > landmine for coherency issues. > > > > > > XFS_DA_OP_BUFFER is a little gross, since I don't like the idea of > > > exposing the low level buffering details of the xattr code to > > > xfs_attr_get callers. > > > > > > XBF_VERITY_SEEN is a layering violation because now the overall buffer > > > cache can track file metadata state. I think the reason why you need > > > this state flag is because the datadev buffer target indexes based on > > > physical xfs_daddr_t, whereas merkle tree blocks have their own internal > > > block numbers. You can't directly go from the merkle block number to an > > > xfs_daddr_t, so you can't use xfs_buf_incore to figure out if the block > > > fell out of memory. > > > > > > ISTR asking for a separation of these indices when I reviewed some > > > previous version of this patchset. At the time it seems to me that a > > > much more efficient way to cache the merkle tree blocks would be to set > > > up a direct (merkle tree block number) -> (blob of data) lookup table. > > > That I don't see here. > > > > > > In the spirit of the recent collaboration style that I've taken with > > > Christoph, I pulled your branch and started appending patches to it to > > > see if the design that I'm asking for is reasonable. As it so happens, > > > I was working on a simplified version of the xfs buffer cache ("fsbuf") > > > that could be used by simple filesystems to get them off of buffer > > > heads. > > > > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g > > > verity tests), so now I know the idea is reasonable. Patches 11, 12, > > > 14, and 15 become unnecessary. However, this solution is itself grossly > > > overengineered, since all we want are the following operations: > > > > > > peek(key): returns an fsbuf if there's any data cached for key > > > > > > get(key): returns an fsbuf for key, regardless of state > > > > > > store(fsbuf, p): attach a memory buffer p to fsbuf > > > > > > Then the xfs ->read_merkle_tree_block function becomes: > > > > > > bp = peek(key) > > > if (bp) > > > /* return bp data up to verity */ > > > > > > p = xfs_attr_get(key) > > > if (!p) > > > /* error */ > > > > > > bp = get(key) > > > store(bp, p) > > > > Ok, that looks good - it definitely gets rid of a lot of the > > nastiness, but I have to ask: why does it need to be based on > > xfs_bufs? > > (copying from IRC) It was still warm in my brain L2 after all the xfile > buftarg cleaning and merging that just got done a few weeks ago. So I > went with the simplest thing I could rig up to test my ideas, and now > we're at the madly iterate until exhaustion stage. ;) > > > That's just wasting 300 bytes of memory on a handle to > > store a key and a opaque blob in a rhashtable. > > Yep. The fsbufs implementation was a lot more slender, but a bunch more > code. I agree that I ought to go look at xarrays or something that's > more of a direct mapping as a next step. However, i wanted to get > Andrey's feedback on this general approach first. > > > IIUC, the key here is a sequential index, so an xarray would be a > > much better choice as it doesn't require internal storage of the > > key. > > I wonder, what are the access patterns for merkle blobs? Is it actually > sequential, or is more like 0 -> N -> N*N as we walk towards leaves? > > Also -- the fsverity block interfaces pass in a "u64 pos" argument. Was > that done because merkle trees may some day have more than 2^32 blocks > in them? That won't play well with things like xarrays on 32-bit > machines. > > (Granted we've been talking about deprecating XFS on 32-bit for a while > now but we're not the whole world) > > > i.e. > > > > p = xa_load(key); > > if (p) > > return p; > > > > xfs_attr_get(key); > > if (!args->value) > > /* error */ > > > > /* > > * store the current value, freeing any old value that we > > * replaced at this key. Don't care about failure to store, > > * this is optimistic caching. > > */ > > p = xa_store(key, args->value, GFP_NOFS); > > if (p) > > kvfree(p); > > return args->value; > > Attractive. Will have to take a look at that tomorrow. Done. I think. Not sure that I actually got all the interactions between the shrinker and the xarray correct though. KASAN and lockdep don't have any complaints running fstests, so that's a start. https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09 --D > > --D > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > >
On Sat, Mar 09, 2024 at 08:28:28AM -0800, Darrick J. Wong wrote: > On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote: > > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote: > > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g > > > > verity tests), so now I know the idea is reasonable. Patches 11, 12, > > > > 14, and 15 become unnecessary. However, this solution is itself grossly > > > > overengineered, since all we want are the following operations: > > > > > > > > peek(key): returns an fsbuf if there's any data cached for key > > > > > > > > get(key): returns an fsbuf for key, regardless of state > > > > > > > > store(fsbuf, p): attach a memory buffer p to fsbuf > > > > > > > > Then the xfs ->read_merkle_tree_block function becomes: > > > > > > > > bp = peek(key) > > > > if (bp) > > > > /* return bp data up to verity */ > > > > > > > > p = xfs_attr_get(key) > > > > if (!p) > > > > /* error */ > > > > > > > > bp = get(key) > > > > store(bp, p) > > > > > > Ok, that looks good - it definitely gets rid of a lot of the > > > nastiness, but I have to ask: why does it need to be based on > > > xfs_bufs? > > > > (copying from IRC) It was still warm in my brain L2 after all the xfile > > buftarg cleaning and merging that just got done a few weeks ago. So I > > went with the simplest thing I could rig up to test my ideas, and now > > we're at the madly iterate until exhaustion stage. ;) > > > > > That's just wasting 300 bytes of memory on a handle to > > > store a key and a opaque blob in a rhashtable. > > > > Yep. The fsbufs implementation was a lot more slender, but a bunch more > > code. I agree that I ought to go look at xarrays or something that's > > more of a direct mapping as a next step. However, i wanted to get > > Andrey's feedback on this general approach first. > > > > > IIUC, the key here is a sequential index, so an xarray would be a > > > much better choice as it doesn't require internal storage of the > > > key. > > > > I wonder, what are the access patterns for merkle blobs? Is it actually > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves? I think the leaf level (i.e. individual record) access patterns largely match data access patterns, so I'd just treat it like as if it's a normal file being accessed.... > > Also -- the fsverity block interfaces pass in a "u64 pos" argument. Was > > that done because merkle trees may some day have more than 2^32 blocks > > in them? That won't play well with things like xarrays on 32-bit > > machines. > > > > (Granted we've been talking about deprecating XFS on 32-bit for a while > > now but we're not the whole world) > > > > > i.e. > > > > > > p = xa_load(key); > > > if (p) > > > return p; > > > > > > xfs_attr_get(key); > > > if (!args->value) > > > /* error */ > > > > > > /* > > > * store the current value, freeing any old value that we > > > * replaced at this key. Don't care about failure to store, > > > * this is optimistic caching. > > > */ > > > p = xa_store(key, args->value, GFP_NOFS); > > > if (p) > > > kvfree(p); > > > return args->value; > > > > Attractive. Will have to take a look at that tomorrow. > > Done. I think. Not sure that I actually got all the interactions > between the shrinker and the xarray correct though. KASAN and lockdep > don't have any complaints running fstests, so that's a start. > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09 My initial impression is "over-engineered". I personally would have just allocated the xattr value buffer with a little extra size and added all the external cache information (a reference counter is all we need as these are fixed sized blocks) to the tail of the blob we actually pass to fsverity. If we tag the inode in the radix tree as having verity blobs that can be freed, we can then just extend the existing fs sueprblock shrinker callout to also walk all the verity inodes with cached data to try to reclaim some objects... But, if a generic blob cache is what it takes to move this forwards, so be it. -Dave.
On Mon, Mar 11, 2024 at 11:26:24AM +1100, Dave Chinner wrote: > On Sat, Mar 09, 2024 at 08:28:28AM -0800, Darrick J. Wong wrote: > > On Thu, Mar 07, 2024 at 07:31:38PM -0800, Darrick J. Wong wrote: > > > On Fri, Mar 08, 2024 at 12:59:56PM +1100, Dave Chinner wrote: > > > > > (Ab)using the fsbuf code did indeed work (and passed all the fstests -g > > > > > verity tests), so now I know the idea is reasonable. Patches 11, 12, > > > > > 14, and 15 become unnecessary. However, this solution is itself grossly > > > > > overengineered, since all we want are the following operations: > > > > > > > > > > peek(key): returns an fsbuf if there's any data cached for key > > > > > > > > > > get(key): returns an fsbuf for key, regardless of state > > > > > > > > > > store(fsbuf, p): attach a memory buffer p to fsbuf > > > > > > > > > > Then the xfs ->read_merkle_tree_block function becomes: > > > > > > > > > > bp = peek(key) > > > > > if (bp) > > > > > /* return bp data up to verity */ > > > > > > > > > > p = xfs_attr_get(key) > > > > > if (!p) > > > > > /* error */ > > > > > > > > > > bp = get(key) > > > > > store(bp, p) > > > > > > > > Ok, that looks good - it definitely gets rid of a lot of the > > > > nastiness, but I have to ask: why does it need to be based on > > > > xfs_bufs? > > > > > > (copying from IRC) It was still warm in my brain L2 after all the xfile > > > buftarg cleaning and merging that just got done a few weeks ago. So I > > > went with the simplest thing I could rig up to test my ideas, and now > > > we're at the madly iterate until exhaustion stage. ;) > > > > > > > That's just wasting 300 bytes of memory on a handle to > > > > store a key and a opaque blob in a rhashtable. > > > > > > Yep. The fsbufs implementation was a lot more slender, but a bunch more > > > code. I agree that I ought to go look at xarrays or something that's > > > more of a direct mapping as a next step. However, i wanted to get > > > Andrey's feedback on this general approach first. > > > > > > > IIUC, the key here is a sequential index, so an xarray would be a > > > > much better choice as it doesn't require internal storage of the > > > > key. > > > > > > I wonder, what are the access patterns for merkle blobs? Is it actually > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves? > > I think the leaf level (i.e. individual record) access patterns > largely match data access patterns, so I'd just treat it like as if > it's a normal file being accessed.... <nod> The latest version of this tries to avoid letting reclaim take the top of the tree. Logically this makes sense to me to reduce read verify latency, but I was hoping Eric or Andrey or someone with more familiarity with fsverity would chime in on whether or not that made sense. > > > Also -- the fsverity block interfaces pass in a "u64 pos" argument. Was > > > that done because merkle trees may some day have more than 2^32 blocks > > > in them? That won't play well with things like xarrays on 32-bit > > > machines. > > > > > > (Granted we've been talking about deprecating XFS on 32-bit for a while > > > now but we're not the whole world) > > > > > > > i.e. > > > > > > > > p = xa_load(key); > > > > if (p) > > > > return p; > > > > > > > > xfs_attr_get(key); > > > > if (!args->value) > > > > /* error */ > > > > > > > > /* > > > > * store the current value, freeing any old value that we > > > > * replaced at this key. Don't care about failure to store, > > > > * this is optimistic caching. > > > > */ > > > > p = xa_store(key, args->value, GFP_NOFS); > > > > if (p) > > > > kvfree(p); > > > > return args->value; > > > > > > Attractive. Will have to take a look at that tomorrow. > > > > Done. I think. Not sure that I actually got all the interactions > > between the shrinker and the xarray correct though. KASAN and lockdep > > don't have any complaints running fstests, so that's a start. > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09 > > My initial impression is "over-engineered". Overly focused on unfamiliar data structures -- I've never built anything with an xarray before. > I personally would have just allocated the xattr value buffer with a > little extra size and added all the external cache information (a > reference counter is all we need as these are fixed sized blocks) to > the tail of the blob we actually pass to fsverity. Oho, that's a good idea. I didn't like the separate allocation anyway. Friday was mostly chatting with willy and trying to make sure I got the xarray access patterns correct. > If we tag the > inode in the radix tree as having verity blobs that can be freed, we > can then just extend the existing fs sueprblock shrinker callout to > also walk all the verity inodes with cached data to try to reclaim > some objects... This too is a wonderful suggestion -- use the third radix tree tag to mark inodes with extra incore caches that can be reclaimed, then teach xfs_reclaim_inodes_{nr,count} to scan them. Allocating a per-inode shrinker was likely to cause problems with flooding debugfs with too many knobs anyway. > But, if a generic blob cache is what it takes to move this forwards, > so be it. Not necessarily. ;) --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote: > > > > I wonder, what are the access patterns for merkle blobs? Is it actually > > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves? > > > > I think the leaf level (i.e. individual record) access patterns > > largely match data access patterns, so I'd just treat it like as if > > it's a normal file being accessed.... > > <nod> The latest version of this tries to avoid letting reclaim take the > top of the tree. Logically this makes sense to me to reduce read verify > latency, but I was hoping Eric or Andrey or someone with more > familiarity with fsverity would chime in on whether or not that made > sense. The levels are stored ordered from root to leaf; they aren't interleaved. So technically the overall access pattern isn't sequential, but it is sequential within each level, and the leaf level is over 99% of the accesses anyway (assuming the default parameters where each block has 128 children). It makes sense to try to keep the top levels, i.e. the blocks with the lowest indices, cached. (The other filesystems that support fsverity currently aren't doing anything special to treat them differently from other pagecache pages.) - Eric
On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote: <snip> > <nod> The latest version of this tries to avoid letting reclaim take the > top of the tree. Logically this makes sense to me to reduce read verify > latency, but I was hoping Eric or Andrey or someone with more > familiarity with fsverity would chime in on whether or not that made > sense. > > > > > Also -- the fsverity block interfaces pass in a "u64 pos" argument. Was > > > > that done because merkle trees may some day have more than 2^32 blocks > > > > in them? That won't play well with things like xarrays on 32-bit > > > > machines. > > > > > > > > (Granted we've been talking about deprecating XFS on 32-bit for a while > > > > now but we're not the whole world) > > > > > > > > > i.e. > > > > > > > > > > p = xa_load(key); > > > > > if (p) > > > > > return p; > > > > > > > > > > xfs_attr_get(key); > > > > > if (!args->value) > > > > > /* error */ > > > > > > > > > > /* > > > > > * store the current value, freeing any old value that we > > > > > * replaced at this key. Don't care about failure to store, > > > > > * this is optimistic caching. > > > > > */ > > > > > p = xa_store(key, args->value, GFP_NOFS); > > > > > if (p) > > > > > kvfree(p); > > > > > return args->value; > > > > > > > > Attractive. Will have to take a look at that tomorrow. > > > > > > Done. I think. Not sure that I actually got all the interactions > > > between the shrinker and the xarray correct though. KASAN and lockdep > > > don't have any complaints running fstests, so that's a start. > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fsverity-cleanups-6.9_2024-03-09 > > > > My initial impression is "over-engineered". > > Overly focused on unfamiliar data structures -- I've never built > anything with an xarray before. > > > I personally would have just allocated the xattr value buffer with a > > little extra size and added all the external cache information (a > > reference counter is all we need as these are fixed sized blocks) to > > the tail of the blob we actually pass to fsverity. > > Oho, that's a good idea. I didn't like the separate allocation anyway. > Friday was mostly chatting with willy and trying to make sure I got the > xarray access patterns correct. > > > If we tag the > > inode in the radix tree as having verity blobs that can be freed, we > > can then just extend the existing fs sueprblock shrinker callout to > > also walk all the verity inodes with cached data to try to reclaim > > some objects... > > This too is a wonderful suggestion -- use the third radix tree tag to > mark inodes with extra incore caches that can be reclaimed, then teach > xfs_reclaim_inodes_{nr,count} to scan them. Allocating a per-inode > shrinker was likely to cause problems with flooding debugfs with too > many knobs anyway. > > > But, if a generic blob cache is what it takes to move this forwards, > > so be it. > > Not necessarily. ;) And here's today's branch, with xfs_blobcache.[ch] removed and a few more cleanups: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11 --D > > --D > > > -Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > >
On Mon, Mar 11, 2024 at 07:43:20PM -0700, Eric Biggers wrote: > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote: > > > > > I wonder, what are the access patterns for merkle blobs? Is it actually > > > > > sequential, or is more like 0 -> N -> N*N as we walk towards leaves? > > > > > > I think the leaf level (i.e. individual record) access patterns > > > largely match data access patterns, so I'd just treat it like as if > > > it's a normal file being accessed.... > > > > <nod> The latest version of this tries to avoid letting reclaim take the > > top of the tree. Logically this makes sense to me to reduce read verify > > latency, but I was hoping Eric or Andrey or someone with more > > familiarity with fsverity would chime in on whether or not that made > > sense. > > The levels are stored ordered from root to leaf; they aren't interleaved. So > technically the overall access pattern isn't sequential, but it is sequential > within each level, and the leaf level is over 99% of the accesses anyway > (assuming the default parameters where each block has 128 children). <nod> > It makes sense to try to keep the top levels, i.e. the blocks with the lowest > indices, cached. (The other filesystems that support fsverity currently aren't > doing anything special to treat them differently from other pagecache pages.) <nod> --D > - Eric >
On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote: > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote: > > > But, if a generic blob cache is what it takes to move this forwards, > > > so be it. > > > > Not necessarily. ;) > > And here's today's branch, with xfs_blobcache.[ch] removed and a few > more cleanups: > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11 Walking all the inodes counting all the verity blobs in the shrinker is going to be -expensive-. Shrinkers are run very frequently and with high concurrency under memory pressure by direct reclaim, and every single shrinker execution is going to run that traversal even if it is decided there is nothing that can be shrunk. IMO, it would be better to keep a count of reclaimable objects either on the inode itself (protected by the xa_lock when adding/removing) to avoid needing to walk the xarray to count the blocks on the inode. Even better would be a counter in the perag or a global percpu counter in the mount of all caches objects. Both of those pretty much remove all the shrinker side counting overhead. Couple of other small things. - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It really has nothing to do with the icache other than calling xfs_icwalk(). That gets rid of some of the config ifdefs. - SHRINK_STOP is what should be returned by the scan when xfs_verity_shrinker_scan() wants the shrinker to immediately stop, not LONG_MAX. - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of how many object to try to free, not how many we must free. i.e. even if we can't free objects, they are still objects that got scanned and so should decement nr_to_scan... -Dave.
On Tue, Mar 12, 2024 at 06:01:01PM +1100, Dave Chinner wrote: > On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote: > > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote: > > > > But, if a generic blob cache is what it takes to move this forwards, > > > > so be it. > > > > > > Not necessarily. ;) > > > > And here's today's branch, with xfs_blobcache.[ch] removed and a few > > more cleanups: > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11 > > Walking all the inodes counting all the verity blobs in the shrinker > is going to be -expensive-. Shrinkers are run very frequently and > with high concurrency under memory pressure by direct reclaim, and > every single shrinker execution is going to run that traversal even > if it is decided there is nothing that can be shrunk. > > IMO, it would be better to keep a count of reclaimable objects > either on the inode itself (protected by the xa_lock when > adding/removing) to avoid needing to walk the xarray to count the > blocks on the inode. Even better would be a counter in the perag or > a global percpu counter in the mount of all caches objects. Both of > those pretty much remove all the shrinker side counting overhead. I went with a global percpu counter, let's see if lockdep/kasan have anything to say about my new design. :P > Couple of other small things. > > - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It > really has nothing to do with the icache other than calling > xfs_icwalk(). That gets rid of some of the config ifdefs. Done. > - SHRINK_STOP is what should be returned by the scan when > xfs_verity_shrinker_scan() wants the shrinker to immediately stop, > not LONG_MAX. Aha. Ok, thanks for the tipoff. ;) > - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of > how many object to try to free, not how many we must free. i.e. > even if we can't free objects, they are still objects that got > scanned and so should decement nr_to_scan... <nod> Is there any way for ->scan_objects to tell its caller how much memory it actually freed? Or does it only know about objects? I suppose "number of bytes freed" wouldn't be that helpful since someone else could allocate all the freed memory immediately anyway. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, Mar 12, 2024 at 01:04:24PM -0700, Darrick J. Wong wrote: > On Tue, Mar 12, 2024 at 06:01:01PM +1100, Dave Chinner wrote: > > On Mon, Mar 11, 2024 at 07:45:07PM -0700, Darrick J. Wong wrote: > > > On Mon, Mar 11, 2024 at 08:25:05AM -0700, Darrick J. Wong wrote: > > > > > But, if a generic blob cache is what it takes to move this forwards, > > > > > so be it. > > > > > > > > Not necessarily. ;) > > > > > > And here's today's branch, with xfs_blobcache.[ch] removed and a few > > > more cleanups: > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=fsverity-cleanups-6.9_2024-03-11 > > > > Walking all the inodes counting all the verity blobs in the shrinker > > is going to be -expensive-. Shrinkers are run very frequently and > > with high concurrency under memory pressure by direct reclaim, and > > every single shrinker execution is going to run that traversal even > > if it is decided there is nothing that can be shrunk. > > > > IMO, it would be better to keep a count of reclaimable objects > > either on the inode itself (protected by the xa_lock when > > adding/removing) to avoid needing to walk the xarray to count the > > blocks on the inode. Even better would be a counter in the perag or > > a global percpu counter in the mount of all caches objects. Both of > > those pretty much remove all the shrinker side counting overhead. > > I went with a global percpu counter, let's see if lockdep/kasan have > anything to say about my new design. :P > > > Couple of other small things. > > > > - verity shrinker belongs in xfs_verity.c, not xfs_icache.c. It > > really has nothing to do with the icache other than calling > > xfs_icwalk(). That gets rid of some of the config ifdefs. > > Done. > > > - SHRINK_STOP is what should be returned by the scan when > > xfs_verity_shrinker_scan() wants the shrinker to immediately stop, > > not LONG_MAX. > > Aha. Ok, thanks for the tipoff. ;) > > > - In xfs_verity_cache_shrink_scan(), the nr_to_scan is a count of > > how many object to try to free, not how many we must free. i.e. > > even if we can't free objects, they are still objects that got > > scanned and so should decement nr_to_scan... > > <nod> > > Is there any way for ->scan_objects to tell its caller how much memory > it actually freed? Or does it only know about objects? The shrinker infrastructure itself only concerns itself with object counts. It's almost impossible to balance memory used by slab caches based on memory used because the objects are of different sizes. For example, it's easy to acheive a 1:1 balance for dentry objects and inode objects when shrinking is done by object count. Now try getting that balance when individual shrinkers and the shrinker infrastructure don't know the relative size of the objects in caches it is trying to balance against. For shmem and fuse inode it is 4:1 size difference between a dentry and the inode, XFS inodes is 5:1, ext4 it's 6:1, etc. Yet they all want the same 1:1 object count balance with the dentry cache, and the same shrinker implementation has to manage that... IOws, there are two key scan values - a control value to tell the shrinker scan how many objects to scan, and a return value that tells the shrinker how many objects that specific scan freed. > I suppose > "number of bytes freed" wouldn't be that helpful since someone else > could allocate all the freed memory immediately anyway. The problem is that there isn't a direct realtionship between "object freed" and "memory available for allocation" with slab cache shrinkers. If you look at the back end of SLUB freeing, when it frees a backing page for a cache because it is now empty (all objects freed) it calls mm_account_reclaimed_pages() to account for the page being freed. We do the same in xfs_buf_free_pages() to account for the fact the object being freed also freed a bunch of extra pages not directly accounted to the shrinker. THis is the feedback to the memory reclaim process to determine how much progress was made by the entire shrinker scan. Memory reclaim is not actually caring how many objects were freed, it's using hidden accounting to track how many pages actually got freed by the overall 'every shrinker scan' call. IOWs, object count base shrinking makes for realtively simple cross-cache balance tuning, whilst actual memory freed accounting by scans is hidden in the background... In the case of this cache, it might be worthwhile adding something like this for vmalloc()d objects: if (is_vmalloc_addr(ptr)) mm_account_reclaimed_pages(how_many_pages(ptr)) kvfree(ptr) as I think that anything allocated from the SLUB heap by kvmalloc() is already accounted to reclaim by the SLUB freeing code. -Dave.
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 73249abca968..2a73918193ba 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -24,14 +24,15 @@ struct xfs_buf; #define XFS_BUF_DADDR_NULL ((xfs_daddr_t) (-1LL)) -#define XBF_READ (1u << 0) /* buffer intended for reading from device */ -#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ -#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ -#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ -#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ -#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ -#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ -#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ +#define XBF_READ (1u << 0) /* buffer intended for reading from device */ +#define XBF_WRITE (1u << 1) /* buffer intended for writing to device */ +#define XBF_READ_AHEAD (1u << 2) /* asynchronous read-ahead */ +#define XBF_NO_IOACCT (1u << 3) /* bypass I/O accounting (non-LRU bufs) */ +#define XBF_ASYNC (1u << 4) /* initiator will not wait for completion */ +#define XBF_DONE (1u << 5) /* all pages in the buffer uptodate */ +#define XBF_STALE (1u << 6) /* buffer has been staled, do not find it */ +#define XBF_WRITE_FAIL (1u << 7) /* async writes have failed on this buffer */ +#define XBF_VERITY_SEEN (1u << 8) /* buffer was processed by fs-verity */ /* buffer type flags for write callbacks */ #define _XBF_INODES (1u << 16)/* inode buffer */ @@ -65,6 +66,7 @@ typedef unsigned int xfs_buf_flags_t; { XBF_DONE, "DONE" }, \ { XBF_STALE, "STALE" }, \ { XBF_WRITE_FAIL, "WRITE_FAIL" }, \ + { XBF_VERITY_SEEN, "VERITY_SEEN" }, \ { _XBF_INODES, "INODES" }, \ { _XBF_DQUOTS, "DQUOTS" }, \ { _XBF_LOGRECOVERY, "LOG_RECOVERY" }, \
One of essential ideas of fs-verity is that pages which are already verified won't need to be re-verified if they still in page cache. XFS will store Merkle tree blocks in extended file attributes. When read extended attribute data is put into xfs_buf. fs-verity uses PG_checked flag to track status of the blocks in the page. This flag can has two meanings - page was re-instantiated and the only block in the page is verified. However, in XFS, the data in the buffer is not aligned with xfs_buf pages and we don't have a reference to these pages. Moreover, these pages are released when value is copied out in xfs_attr code. In other words, we can not directly mark underlying xfs_buf's pages as verified as it's done by fs-verity for other filesystems. One way to track that these pages were processed by fs-verity is to mark buffer as verified instead. If buffer is evicted the incore XBF_VERITY_SEEN flag is lost. When the xattr is read again xfs_attr_get() returns new buffer without the flag. The xfs_buf's flag is then used to tell fs-verity this buffer was cached or not. The second state indicated by PG_checked is if the only block in the PAGE is verified. This is not the case for XFS as there could be multiple blocks in single buffer (page size 64k block size 4k). This is handled by fs-verity bitmap. fs-verity is always uses bitmap for XFS despite of Merkle tree block size. The meaning of the flag is that value of the extended attribute in the buffer is processed by fs-verity. Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> --- fs/xfs/xfs_buf.h | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)