diff mbox series

[v5,06/24] fsverity: pass tree_blocksize to end_enable_verity()

Message ID 20240304191046.157464-8-aalbersh@redhat.com (mailing list archive)
State Superseded
Headers show
Series fs-verity support for XFS | expand

Commit Message

Andrey Albershteyn March 4, 2024, 7:10 p.m. UTC
XFS will need to know tree_blocksize to remove the tree in case of an
error. The size is needed to calculate offsets of particular Merkle
tree blocks.

Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
---
 fs/btrfs/verity.c        | 4 +++-
 fs/ext4/verity.c         | 3 ++-
 fs/f2fs/verity.c         | 3 ++-
 fs/verity/enable.c       | 6 ++++--
 include/linux/fsverity.h | 4 +++-
 5 files changed, 14 insertions(+), 6 deletions(-)

Comments

Eric Biggers March 5, 2024, 12:52 a.m. UTC | #1
On Mon, Mar 04, 2024 at 08:10:29PM +0100, Andrey Albershteyn wrote:
> XFS will need to know tree_blocksize to remove the tree in case of an
> error. The size is needed to calculate offsets of particular Merkle
> tree blocks.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> ---
>  fs/btrfs/verity.c        | 4 +++-
>  fs/ext4/verity.c         | 3 ++-
>  fs/f2fs/verity.c         | 3 ++-
>  fs/verity/enable.c       | 6 ++++--
>  include/linux/fsverity.h | 4 +++-
>  5 files changed, 14 insertions(+), 6 deletions(-)

How will XFS handle dropping a file's incomplete tree if the system crashes
while it's being built?

I think this is why none of the other filesystems have needed the tree_blocksize
in ->end_enable_verity() yet.  They need to be able to drop the tree from just
the information the filesystem has on-disk anyway.  ext4 and f2fs just truncate
past EOF, while btrfs has some code that finds all the verity metadata items and
deletes them (see btrfs_drop_verity_items() in fs/btrfs/verity.c).

Technically you don't *have* to drop incomplete trees, since it shouldn't cause
a behavior difference.  But it seems like something that should be cleaned up.

- Eric
Darrick J. Wong March 6, 2024, 4:30 p.m. UTC | #2
On Mon, Mar 04, 2024 at 04:52:42PM -0800, Eric Biggers wrote:
> On Mon, Mar 04, 2024 at 08:10:29PM +0100, Andrey Albershteyn wrote:
> > XFS will need to know tree_blocksize to remove the tree in case of an
> > error. The size is needed to calculate offsets of particular Merkle
> > tree blocks.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com>
> > ---
> >  fs/btrfs/verity.c        | 4 +++-
> >  fs/ext4/verity.c         | 3 ++-
> >  fs/f2fs/verity.c         | 3 ++-
> >  fs/verity/enable.c       | 6 ++++--
> >  include/linux/fsverity.h | 4 +++-
> >  5 files changed, 14 insertions(+), 6 deletions(-)
> 
> How will XFS handle dropping a file's incomplete tree if the system crashes
> while it's being built?

AFAICT it simply leaves the half-constructed tree in the xattrs data.

> I think this is why none of the other filesystems have needed the tree_blocksize
> in ->end_enable_verity() yet.  They need to be able to drop the tree from just
> the information the filesystem has on-disk anyway.  ext4 and f2fs just truncate
> past EOF, while btrfs has some code that finds all the verity metadata items and
> deletes them (see btrfs_drop_verity_items() in fs/btrfs/verity.c).
> 
> Technically you don't *have* to drop incomplete trees, since it shouldn't cause
> a behavior difference.  But it seems like something that should be cleaned up.

If it's required that a failed FS_IOC_ENABLE_VERITY clean up the
unfinished merkle tree, then you'd have to introduce some kind of log
intent item to roll back blocks from an unfinished tree.  That log
item has to be committed as the first item in a chain of transactions,
each of which adds a merkle tree block and relogs the rollback item.
When we finish the tree, we log a done item to whiteout the intent.

Log recovery, upon finding an intent without the done item, will replay
the intent, which (in this case) will remove all the blocks.  In theory
a failed merkle tree commit also should do this, but most likely that
will cause an fs shutdown anyway.

(That's a fair amount of work.)

Or you could leave the unfinished tree as-is; that will waste space, but
if userspace tries again, the xattr code will replace the old merkle
tree block contents with the new ones.  This assumes that we're not
using XATTR_CREATE during FS_IOC_ENABLE_VERITY.

--D

> 
> - Eric
>
Eric Biggers March 7, 2024, 10:02 p.m. UTC | #3
On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> Or you could leave the unfinished tree as-is; that will waste space, but
> if userspace tries again, the xattr code will replace the old merkle
> tree block contents with the new ones.  This assumes that we're not
> using XATTR_CREATE during FS_IOC_ENABLE_VERITY.

This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
that was interrupted and the one that completed, there may be extra Merkle tree
blocks left over.

BTW, is xfs_repair planned to do anything about any such extra blocks?

- Eric
Darrick J. Wong March 8, 2024, 3:46 a.m. UTC | #4
On Thu, Mar 07, 2024 at 02:02:24PM -0800, Eric Biggers wrote:
> On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> > Or you could leave the unfinished tree as-is; that will waste space, but
> > if userspace tries again, the xattr code will replace the old merkle
> > tree block contents with the new ones.  This assumes that we're not
> > using XATTR_CREATE during FS_IOC_ENABLE_VERITY.
> 
> This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
> that was interrupted and the one that completed, there may be extra Merkle tree
> blocks left over.

What if ->enable_begin walked the xattrs and trimmed out any verity
xattrs that were already there?  Though I think ->enable_end actually
could do this since one of the args is the tree size, right?

(Hmm that still wouldn't be any guarantee of success since those xattr
remove calls are each separate transactions...)

> BTW, is xfs_repair planned to do anything about any such extra blocks?

Sorry to answer your question with a question, but how much checking is
$filesystem expected to do for merkle trees?

In theory xfs_repair could learn how to interpret the verity descriptor,
walk the merkle tree blocks, and even read the file data to confirm
intactness.  If the descriptor specifies the highest block address then
we could certainly trim off excess blocks.  But I don't know how much of
libfsverity actually lets you do that; I haven't looked into that
deeply. :/

For xfs_scrub I guess the job is theoretically simpler, since we only
need to stream reads of the verity files through the page cache and let
verity tell us if the file data are consistent.

For both tools, if something finds errors in the merkle tree structure
itself, do we turn off verity?  Or do we do something nasty like
truncate the file?

Is there an ioctl or something that allows userspace to validate an
entire file's contents?  Sort of like what BLKVERIFY would have done for
block devices, except that we might believe its answers?

Also -- inconsistencies between the file data and the merkle tree aren't
something that xfs can self-heal, right?

--D

> - Eric
>
Eric Biggers March 8, 2024, 4:40 a.m. UTC | #5
On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > BTW, is xfs_repair planned to do anything about any such extra blocks?
> 
> Sorry to answer your question with a question, but how much checking is
> $filesystem expected to do for merkle trees?
> 
> In theory xfs_repair could learn how to interpret the verity descriptor,
> walk the merkle tree blocks, and even read the file data to confirm
> intactness.  If the descriptor specifies the highest block address then
> we could certainly trim off excess blocks.  But I don't know how much of
> libfsverity actually lets you do that; I haven't looked into that
> deeply. :/
> 
> For xfs_scrub I guess the job is theoretically simpler, since we only
> need to stream reads of the verity files through the page cache and let
> verity tell us if the file data are consistent.
> 
> For both tools, if something finds errors in the merkle tree structure
> itself, do we turn off verity?  Or do we do something nasty like
> truncate the file?

As far as I know (I haven't been following btrfs-progs, but I'm familiar with
e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
validating the data of verity inodes against their Merkle trees.

e2fsck does delete the verity metadata of inodes that don't have the verity flag
enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.

I suppose that ideally, if an inode's verity metadata is invalid, then fsck
should delete that inode's verity metadata and remove the verity flag from the
inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
fairly straightforward, but it probably wouldn't catch much compared to actually
validating the data against the Merkle tree.  And actually validating the data
against the Merkle tree would be complex and expensive.  Note, none of this
would work on files that are encrypted.

Re: libfsverity, I think it would be possible to validate a Merkle tree using
libfsverity_compute_digest() and the callbacks that it supports.  But that's not
quite what it was designed for.

> Is there an ioctl or something that allows userspace to validate an
> entire file's contents?  Sort of like what BLKVERIFY would have done for
> block devices, except that we might believe its answers?

Just reading the whole file and seeing whether you get an error would do it.

Though if you want to make sure it's really re-reading the on-disk data, it's
necessary to drop the file's pagecache first.

> Also -- inconsistencies between the file data and the merkle tree aren't
> something that xfs can self-heal, right?

Similar to file data itself, only way to self-heal would be via mechanisms that
provide redundancy.  There's been some interest in adding support forward error
correction (FEC) to fsverity similar to what dm-verity has, but this would be
complex, and it's not something that anyone has gotten around to yet.

- Eric
Dave Chinner March 8, 2024, 9:34 p.m. UTC | #6
On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> On Thu, Mar 07, 2024 at 02:02:24PM -0800, Eric Biggers wrote:
> > On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> > > Or you could leave the unfinished tree as-is; that will waste space, but
> > > if userspace tries again, the xattr code will replace the old merkle
> > > tree block contents with the new ones.  This assumes that we're not
> > > using XATTR_CREATE during FS_IOC_ENABLE_VERITY.
> > 
> > This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
> > that was interrupted and the one that completed, there may be extra Merkle tree
> > blocks left over.
> 
> What if ->enable_begin walked the xattrs and trimmed out any verity
> xattrs that were already there?  Though I think ->enable_end actually
> could do this since one of the args is the tree size, right?

If we are overwriting xattrs, it's effectively a remove then a new
create operation, so we may as well just add a XFS_ATTR_VERITY
namespace invalidation filter that removes any xattr in that
namespace in ->enable_begin...

> > BTW, is xfs_repair planned to do anything about any such extra blocks?
> 
> Sorry to answer your question with a question, but how much checking is
> $filesystem expected to do for merkle trees?
> 
> In theory xfs_repair could learn how to interpret the verity descriptor,
> walk the merkle tree blocks, and even read the file data to confirm
> intactness.  If the descriptor specifies the highest block address then
> we could certainly trim off excess blocks.  But I don't know how much of
> libfsverity actually lets you do that; I haven't looked into that
> deeply. :/

Perhaps a generic fsverity userspace checking library we can link in
to fs utilities like e2fsck and xfs_repair is the way to go here.
That way any filesystem that supports fsverity can do offline
validation of the merkle tree after checking the metadata is OK if
desired.

> For xfs_scrub I guess the job is theoretically simpler, since we only
> need to stream reads of the verity files through the page cache and let
> verity tell us if the file data are consistent.

*nod*

> For both tools, if something finds errors in the merkle tree structure
> itself, do we turn off verity?  Or do we do something nasty like
> truncate the file?

Mark it as "data corrupt" in terms of generic XFS health status, and
leave it up to the user to repair the data and/or recalc the merkle
tree, depending on what they find when they look at the corrupt file
status.

> Is there an ioctl or something that allows userspace to validate an
> entire file's contents?  Sort of like what BLKVERIFY would have done for
> block devices, except that we might believe its answers?
> 
> Also -- inconsistencies between the file data and the merkle tree aren't
> something that xfs can self-heal, right?

Not that I know of - the file data has to be validated before we can
tell if the error is in the data or the merkle tree, and only the
user can validate the data is correct.

-Dave.
Darrick J. Wong March 9, 2024, 4:19 p.m. UTC | #7
On Sat, Mar 09, 2024 at 08:34:51AM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > On Thu, Mar 07, 2024 at 02:02:24PM -0800, Eric Biggers wrote:
> > > On Wed, Mar 06, 2024 at 08:30:00AM -0800, Darrick J. Wong wrote:
> > > > Or you could leave the unfinished tree as-is; that will waste space, but
> > > > if userspace tries again, the xattr code will replace the old merkle
> > > > tree block contents with the new ones.  This assumes that we're not
> > > > using XATTR_CREATE during FS_IOC_ENABLE_VERITY.
> > > 
> > > This should work, though if the file was shrunk between the FS_IOC_ENABLE_VERITY
> > > that was interrupted and the one that completed, there may be extra Merkle tree
> > > blocks left over.
> > 
> > What if ->enable_begin walked the xattrs and trimmed out any verity
> > xattrs that were already there?  Though I think ->enable_end actually
> > could do this since one of the args is the tree size, right?
> 
> If we are overwriting xattrs, it's effectively a remove then a new
> create operation, so we may as well just add a XFS_ATTR_VERITY
> namespace invalidation filter that removes any xattr in that
> namespace in ->enable_begin...

Yeah, that sounds like a good idea.  One nice aspect of the generic
listxattr code (aka not the simplified one that scrub uses) is that the
cursor tracking means that we could actually iterate-and-zap old merkle
tree blocks.

If we know the size of the merkle tree ahead of time (say it's N blocks)
then we just start zapping N, then N+1, etc. until we don't find any
more.  That wouldn't be exhaustive, but it's good enough to catch most
cases.

Online fsck should, however, have a way to call ensure_verity_info() so
that it can scan the xattrs looking for merkle tree blocks beyond
tree_size, missing merkle tree blocks within tree_size, missing
descriptors, etc.  It looks like the merkle tree block contents are
entirely hashes (no sibling/child/parent pointers, block headers, etc.)
so there's not a lot to check in the tree structure.  It looks pretty
similar to flattening a heap into a linear array.

> > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > 
> > Sorry to answer your question with a question, but how much checking is
> > $filesystem expected to do for merkle trees?
> > 
> > In theory xfs_repair could learn how to interpret the verity descriptor,
> > walk the merkle tree blocks, and even read the file data to confirm
> > intactness.  If the descriptor specifies the highest block address then
> > we could certainly trim off excess blocks.  But I don't know how much of
> > libfsverity actually lets you do that; I haven't looked into that
> > deeply. :/
> 
> Perhaps a generic fsverity userspace checking library we can link in
> to fs utilities like e2fsck and xfs_repair is the way to go here.
> That way any filesystem that supports fsverity can do offline
> validation of the merkle tree after checking the metadata is OK if
> desired.

That'd be nice.  Does the above checking sound reasonable? :)

> > For xfs_scrub I guess the job is theoretically simpler, since we only
> > need to stream reads of the verity files through the page cache and let
> > verity tell us if the file data are consistent.
> 
> *nod*

I had another thought overnight -- regular read()s incur the cost of
copying pagecache contents to userspace.  Do we really care about that,
though?  In theory we could mmap verity file contents and then use
MADV_POPULATE_READ to pull in the page cache and return error codes.  No
copying, and fewer syscalls.

> > For both tools, if something finds errors in the merkle tree structure
> > itself, do we turn off verity?  Or do we do something nasty like
> > truncate the file?
> 
> Mark it as "data corrupt" in terms of generic XFS health status, and
> leave it up to the user to repair the data and/or recalc the merkle
> tree, depending on what they find when they look at the corrupt file
> status.

Is there a way to forcibly read the file contents even if it fails
verity validation?  I was assuming the only recourse in that case is to
delete the file and restore from backup/package manager/etc.

> > Is there an ioctl or something that allows userspace to validate an
> > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > block devices, except that we might believe its answers?
> > 
> > Also -- inconsistencies between the file data and the merkle tree aren't
> > something that xfs can self-heal, right?
> 
> Not that I know of - the file data has to be validated before we can
> tell if the error is in the data or the merkle tree, and only the
> user can validate the data is correct.

<nod>

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Darrick J. Wong March 11, 2024, 10:38 p.m. UTC | #8
[add willy and linux-mm]

On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > 
> > Sorry to answer your question with a question, but how much checking is
> > $filesystem expected to do for merkle trees?
> > 
> > In theory xfs_repair could learn how to interpret the verity descriptor,
> > walk the merkle tree blocks, and even read the file data to confirm
> > intactness.  If the descriptor specifies the highest block address then
> > we could certainly trim off excess blocks.  But I don't know how much of
> > libfsverity actually lets you do that; I haven't looked into that
> > deeply. :/
> > 
> > For xfs_scrub I guess the job is theoretically simpler, since we only
> > need to stream reads of the verity files through the page cache and let
> > verity tell us if the file data are consistent.
> > 
> > For both tools, if something finds errors in the merkle tree structure
> > itself, do we turn off verity?  Or do we do something nasty like
> > truncate the file?
> 
> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> validating the data of verity inodes against their Merkle trees.
> 
> e2fsck does delete the verity metadata of inodes that don't have the verity flag
> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> 
> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> should delete that inode's verity metadata and remove the verity flag from the
> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> fairly straightforward, but it probably wouldn't catch much compared to actually
> validating the data against the Merkle tree.  And actually validating the data
> against the Merkle tree would be complex and expensive.  Note, none of this
> would work on files that are encrypted.
> 
> Re: libfsverity, I think it would be possible to validate a Merkle tree using
> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> quite what it was designed for.
> 
> > Is there an ioctl or something that allows userspace to validate an
> > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > block devices, except that we might believe its answers?
> 
> Just reading the whole file and seeing whether you get an error would do it.
> 
> Though if you want to make sure it's really re-reading the on-disk data, it's
> necessary to drop the file's pagecache first.

I tried a straight pagecache read and it worked like a charm!

But then I thought to myself, do I really want to waste memory bandwidth
copying a bunch of data?  No.  I don't even want to incur system call
overhead from reading a single byte every $pagesize bytes.

So I created 2M mmap areas and read a byte every $pagesize bytes.  That
worked too, insofar as SIGBUSes are annoying to handle.  But it's
annoying to take signals like that.

Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
like what I wanted -- it prefaults in the pages, and "If populating
fails, a SIGBUS signal is not generated; instead, an error is returned."

But then I tried rigging up a test to see if I could catch an EIO, and
instead I had to SIGKILL the process!  It looks filemap_fault returns
VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
__do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
the VM_FAULT_RETRY is translated to -EBUSY.

__get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
that to madvise_populate.  Unfortunately, madvise_populate increments
its loop counter by the return value (still 0) so it runs in an
infinite loop.  The only way out is SIGKILL.

So I don't know what the correct behavior is here, other than the
infinite loop seems pretty suspect.  Is it the correct behavior that
madvise_populate returns EIO if __get_user_pages ever returns zero?
That doesn't quite sound right if it's the case that a zero return could
also happen if memory is tight.

I suppose filemap_fault could return VM_FAULT_SIGBUS in this one
scenario so userspace would get an -EFAULT.  That would solve this one
case of weird behavior.  But I think that doesn't happen in the
page_not_uptodate case because fpin is non-null?

As for xfs_scrub validating data files, I suppose it's not /so/
terrible to read one byte every $fsblocksize so that we can report
exactly where fsverity and the file data became inconsistent.  The
POPULATE_READ interface doesn't tell you how many pages it /did/ manage
to load, so perhaps MADV_POPULATE_READ isn't workable anyway.

(and now I'm just handwaving wildly about pagecache behaviors ;))

--D

> > Also -- inconsistencies between the file data and the merkle tree aren't
> > something that xfs can self-heal, right?
> 
> Similar to file data itself, only way to self-heal would be via mechanisms that
> provide redundancy.  There's been some interest in adding support forward error
> correction (FEC) to fsverity similar to what dm-verity has, but this would be
> complex, and it's not something that anyone has gotten around to yet.
> 
> - Eric
>
David Hildenbrand March 12, 2024, 3:13 p.m. UTC | #9
On 11.03.24 23:38, Darrick J. Wong wrote:
> [add willy and linux-mm]
> 
> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>
>>> Sorry to answer your question with a question, but how much checking is
>>> $filesystem expected to do for merkle trees?
>>>
>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>> walk the merkle tree blocks, and even read the file data to confirm
>>> intactness.  If the descriptor specifies the highest block address then
>>> we could certainly trim off excess blocks.  But I don't know how much of
>>> libfsverity actually lets you do that; I haven't looked into that
>>> deeply. :/
>>>
>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>> need to stream reads of the verity files through the page cache and let
>>> verity tell us if the file data are consistent.
>>>
>>> For both tools, if something finds errors in the merkle tree structure
>>> itself, do we turn off verity?  Or do we do something nasty like
>>> truncate the file?
>>
>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>> validating the data of verity inodes against their Merkle trees.
>>
>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>
>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>> should delete that inode's verity metadata and remove the verity flag from the
>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>> fairly straightforward, but it probably wouldn't catch much compared to actually
>> validating the data against the Merkle tree.  And actually validating the data
>> against the Merkle tree would be complex and expensive.  Note, none of this
>> would work on files that are encrypted.
>>
>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>> quite what it was designed for.
>>
>>> Is there an ioctl or something that allows userspace to validate an
>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>> block devices, except that we might believe its answers?
>>
>> Just reading the whole file and seeing whether you get an error would do it.
>>
>> Though if you want to make sure it's really re-reading the on-disk data, it's
>> necessary to drop the file's pagecache first.
> 
> I tried a straight pagecache read and it worked like a charm!
> 
> But then I thought to myself, do I really want to waste memory bandwidth
> copying a bunch of data?  No.  I don't even want to incur system call
> overhead from reading a single byte every $pagesize bytes.
> 
> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> worked too, insofar as SIGBUSes are annoying to handle.  But it's
> annoying to take signals like that.
> 
> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> like what I wanted -- it prefaults in the pages, and "If populating
> fails, a SIGBUS signal is not generated; instead, an error is returned."
> 

Yes, these were the expected semantics :)

> But then I tried rigging up a test to see if I could catch an EIO, and
> instead I had to SIGKILL the process!  It looks filemap_fault returns
> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> the VM_FAULT_RETRY is translated to -EBUSY.
> 
> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> that to madvise_populate.  Unfortunately, madvise_populate increments
> its loop counter by the return value (still 0) so it runs in an
> infinite loop.  The only way out is SIGKILL.

That's certainly unexpected. One user I know is QEMU, which primarily 
uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is 
primarily used with shmem/hugetlb, where I haven't heard of any such 
endless loops.

> 
> So I don't know what the correct behavior is here, other than the
> infinite loop seems pretty suspect.  Is it the correct behavior that
> madvise_populate returns EIO if __get_user_pages ever returns zero?
> That doesn't quite sound right if it's the case that a zero return could
> also happen if memory is tight.

madvise_populate() ends up calling faultin_vma_page_range() in a loop. 
That one calls __get_user_pages().

__get_user_pages() documents: "0 return value is possible when the fault 
would need to be retried."

So that's what the caller does. IIRC, there are cases where we really 
have to retry (at least once) and will make progress, so treating "0" as 
an error would be wrong.

Staring at other __get_user_pages() users, __get_user_pages_locked() 
documents: "Please note that this function, unlike __get_user_pages(), 
will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".

But there is some elaborate retry logic in there, whereby the retry will 
set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second 
retry attempt (there are cases where we retry more often, but that's 
related to something else I believe).

So maybe we need a similar retry logic in faultin_vma_page_range()? Or 
make it use __get_user_pages_locked(), but I recall when I introduced 
MADV_POPULATE_READ, there was a catch to it.
David Hildenbrand March 12, 2024, 3:33 p.m. UTC | #10
On 12.03.24 16:13, David Hildenbrand wrote:
> On 11.03.24 23:38, Darrick J. Wong wrote:
>> [add willy and linux-mm]
>>
>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>
>>>> Sorry to answer your question with a question, but how much checking is
>>>> $filesystem expected to do for merkle trees?
>>>>
>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>> intactness.  If the descriptor specifies the highest block address then
>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>> libfsverity actually lets you do that; I haven't looked into that
>>>> deeply. :/
>>>>
>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>> need to stream reads of the verity files through the page cache and let
>>>> verity tell us if the file data are consistent.
>>>>
>>>> For both tools, if something finds errors in the merkle tree structure
>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>> truncate the file?
>>>
>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>> validating the data of verity inodes against their Merkle trees.
>>>
>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>
>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>> should delete that inode's verity metadata and remove the verity flag from the
>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>> validating the data against the Merkle tree.  And actually validating the data
>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>> would work on files that are encrypted.
>>>
>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>> quite what it was designed for.
>>>
>>>> Is there an ioctl or something that allows userspace to validate an
>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>> block devices, except that we might believe its answers?
>>>
>>> Just reading the whole file and seeing whether you get an error would do it.
>>>
>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>> necessary to drop the file's pagecache first.
>>
>> I tried a straight pagecache read and it worked like a charm!
>>
>> But then I thought to myself, do I really want to waste memory bandwidth
>> copying a bunch of data?  No.  I don't even want to incur system call
>> overhead from reading a single byte every $pagesize bytes.
>>
>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>> annoying to take signals like that.
>>
>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>> like what I wanted -- it prefaults in the pages, and "If populating
>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>
> 
> Yes, these were the expected semantics :)
> 
>> But then I tried rigging up a test to see if I could catch an EIO, and
>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>> the VM_FAULT_RETRY is translated to -EBUSY.
>>
>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>> that to madvise_populate.  Unfortunately, madvise_populate increments
>> its loop counter by the return value (still 0) so it runs in an
>> infinite loop.  The only way out is SIGKILL.
> 
> That's certainly unexpected. One user I know is QEMU, which primarily
> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> primarily used with shmem/hugetlb, where I haven't heard of any such
> endless loops.
> 
>>
>> So I don't know what the correct behavior is here, other than the
>> infinite loop seems pretty suspect.  Is it the correct behavior that
>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>> That doesn't quite sound right if it's the case that a zero return could
>> also happen if memory is tight.
> 
> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> That one calls __get_user_pages().
> 
> __get_user_pages() documents: "0 return value is possible when the fault
> would need to be retried."
> 
> So that's what the caller does. IIRC, there are cases where we really
> have to retry (at least once) and will make progress, so treating "0" as
> an error would be wrong.
> 
> Staring at other __get_user_pages() users, __get_user_pages_locked()
> documents: "Please note that this function, unlike __get_user_pages(),
> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> 
> But there is some elaborate retry logic in there, whereby the retry will
> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> retry attempt (there are cases where we retry more often, but that's
> related to something else I believe).
> 
> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> make it use __get_user_pages_locked(), but I recall when I introduced
> MADV_POPULATE_READ, there was a catch to it.

I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the 
mmap()+access case you describe above.

Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately 
see how we would transition from a VM_FAULT_RETRY loop to 
VM_FAULT_SIGBUS. Because VM_FAULT_SIGBUS would be required for that 
function to call do_sigbus().
Darrick J. Wong March 12, 2024, 4:44 p.m. UTC | #11
On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> On 12.03.24 16:13, David Hildenbrand wrote:
> > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > [add willy and linux-mm]
> > > 
> > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > 
> > > > > Sorry to answer your question with a question, but how much checking is
> > > > > $filesystem expected to do for merkle trees?
> > > > > 
> > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > intactness.  If the descriptor specifies the highest block address then
> > > > > we could certainly trim off excess blocks.  But I don't know how much of
> > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > deeply. :/
> > > > > 
> > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > need to stream reads of the verity files through the page cache and let
> > > > > verity tell us if the file data are consistent.
> > > > > 
> > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > itself, do we turn off verity?  Or do we do something nasty like
> > > > > truncate the file?
> > > > 
> > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > validating the data of verity inodes against their Merkle trees.
> > > > 
> > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > 
> > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > validating the data against the Merkle tree.  And actually validating the data
> > > > against the Merkle tree would be complex and expensive.  Note, none of this
> > > > would work on files that are encrypted.
> > > > 
> > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> > > > quite what it was designed for.
> > > > 
> > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > > > > block devices, except that we might believe its answers?
> > > > 
> > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > 
> > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > necessary to drop the file's pagecache first.
> > > 
> > > I tried a straight pagecache read and it worked like a charm!
> > > 
> > > But then I thought to myself, do I really want to waste memory bandwidth
> > > copying a bunch of data?  No.  I don't even want to incur system call
> > > overhead from reading a single byte every $pagesize bytes.
> > > 
> > > So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> > > worked too, insofar as SIGBUSes are annoying to handle.  But it's
> > > annoying to take signals like that.
> > > 
> > > Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> > > like what I wanted -- it prefaults in the pages, and "If populating
> > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > 
> > 
> > Yes, these were the expected semantics :)
> > 
> > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > instead I had to SIGKILL the process!  It looks filemap_fault returns
> > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > 
> > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > that to madvise_populate.  Unfortunately, madvise_populate increments
> > > its loop counter by the return value (still 0) so it runs in an
> > > infinite loop.  The only way out is SIGKILL.
> > 
> > That's certainly unexpected. One user I know is QEMU, which primarily
> > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > primarily used with shmem/hugetlb, where I haven't heard of any such
> > endless loops.
> > 
> > > 
> > > So I don't know what the correct behavior is here, other than the
> > > infinite loop seems pretty suspect.  Is it the correct behavior that
> > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > That doesn't quite sound right if it's the case that a zero return could
> > > also happen if memory is tight.
> > 
> > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > That one calls __get_user_pages().
> > 
> > __get_user_pages() documents: "0 return value is possible when the fault
> > would need to be retried."
> > 
> > So that's what the caller does. IIRC, there are cases where we really
> > have to retry (at least once) and will make progress, so treating "0" as
> > an error would be wrong.
> > 
> > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > documents: "Please note that this function, unlike __get_user_pages(),
> > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > 
> > But there is some elaborate retry logic in there, whereby the retry will
> > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > retry attempt (there are cases where we retry more often, but that's
> > related to something else I believe).
> > 
> > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > make it use __get_user_pages_locked(), but I recall when I introduced
> > MADV_POPULATE_READ, there was a catch to it.
> 
> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> mmap()+access case you describe above.
> 
> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> Because VM_FAULT_SIGBUS would be required for that function to call
> do_sigbus().

The code I was looking at yesterday in filemap_fault was:

page_not_uptodate:
	/*
	 * Umm, take care of errors if the page isn't up-to-date.
	 * Try to re-read it _once_. We do this synchronously,
	 * because there really aren't any performance issues here
	 * and we need to check for errors.
	 */
	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
	if (fpin)
		goto out_retry;
	folio_put(folio);

	if (!error || error == AOP_TRUNCATED_PAGE)
		goto retry_find;
	filemap_invalidate_unlock_shared(mapping);

	return VM_FAULT_SIGBUS;

Wherein I /think/ fpin is non-null in this case, so if
filemap_read_folio returns an error, we'll do this instead:

out_retry:
	/*
	 * We dropped the mmap_lock, we need to return to the fault handler to
	 * re-find the vma and come back and find our hopefully still populated
	 * page.
	 */
	if (!IS_ERR(folio))
		folio_put(folio);
	if (mapping_locked)
		filemap_invalidate_unlock_shared(mapping);
	if (fpin)
		fput(fpin);
	return ret | VM_FAULT_RETRY;

and since ret was 0 before the goto, the only return code is
VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:

	if (fpin) {
		if (error)
			ret |= VM_FAULT_SIGBUS;
		goto out_retry;
	}

But I think the hard part here is that there doesn't seem to be any
distinction between transient read errors (e.g. disk cable fell out) vs.
semi-permanent errors (e.g. verity says the hash doesn't match).
AFAICT, either the read(ahead) sets uptodate and callers read the page,
or it doesn't set it and callers treat that as an error-retry
opportunity.

For the transient error case VM_FAULT_RETRY makes perfect sense; for the
second case I imagine we'd want something closer to _SIGBUS.

</handwave>

--D

> -- 
> Cheers,
> 
> David / dhildenb
> 
>
David Hildenbrand March 13, 2024, 12:29 p.m. UTC | #12
On 12.03.24 17:44, Darrick J. Wong wrote:
> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>> On 12.03.24 16:13, David Hildenbrand wrote:
>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>> [add willy and linux-mm]
>>>>
>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>
>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>> $filesystem expected to do for merkle trees?
>>>>>>
>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>> deeply. :/
>>>>>>
>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>> verity tell us if the file data are consistent.
>>>>>>
>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>> truncate the file?
>>>>>
>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>
>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>
>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>> would work on files that are encrypted.
>>>>>
>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>> quite what it was designed for.
>>>>>
>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>> block devices, except that we might believe its answers?
>>>>>
>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>
>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>> necessary to drop the file's pagecache first.
>>>>
>>>> I tried a straight pagecache read and it worked like a charm!
>>>>
>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>> overhead from reading a single byte every $pagesize bytes.
>>>>
>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>> annoying to take signals like that.
>>>>
>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>
>>>
>>> Yes, these were the expected semantics :)
>>>
>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>
>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>> its loop counter by the return value (still 0) so it runs in an
>>>> infinite loop.  The only way out is SIGKILL.
>>>
>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>> endless loops.
>>>
>>>>
>>>> So I don't know what the correct behavior is here, other than the
>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>> That doesn't quite sound right if it's the case that a zero return could
>>>> also happen if memory is tight.
>>>
>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>> That one calls __get_user_pages().
>>>
>>> __get_user_pages() documents: "0 return value is possible when the fault
>>> would need to be retried."
>>>
>>> So that's what the caller does. IIRC, there are cases where we really
>>> have to retry (at least once) and will make progress, so treating "0" as
>>> an error would be wrong.
>>>
>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>> documents: "Please note that this function, unlike __get_user_pages(),
>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>
>>> But there is some elaborate retry logic in there, whereby the retry will
>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>> retry attempt (there are cases where we retry more often, but that's
>>> related to something else I believe).
>>>
>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>> MADV_POPULATE_READ, there was a catch to it.
>>
>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>> mmap()+access case you describe above.
>>
>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>> Because VM_FAULT_SIGBUS would be required for that function to call
>> do_sigbus().
> 
> The code I was looking at yesterday in filemap_fault was:
> 
> page_not_uptodate:
> 	/*
> 	 * Umm, take care of errors if the page isn't up-to-date.
> 	 * Try to re-read it _once_. We do this synchronously,
> 	 * because there really aren't any performance issues here
> 	 * and we need to check for errors.
> 	 */
> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> 	if (fpin)
> 		goto out_retry;
> 	folio_put(folio);
> 
> 	if (!error || error == AOP_TRUNCATED_PAGE)
> 		goto retry_find;
> 	filemap_invalidate_unlock_shared(mapping);
> 
> 	return VM_FAULT_SIGBUS;
> 
> Wherein I /think/ fpin is non-null in this case, so if
> filemap_read_folio returns an error, we'll do this instead:
> 
> out_retry:
> 	/*
> 	 * We dropped the mmap_lock, we need to return to the fault handler to
> 	 * re-find the vma and come back and find our hopefully still populated
> 	 * page.
> 	 */
> 	if (!IS_ERR(folio))
> 		folio_put(folio);
> 	if (mapping_locked)
> 		filemap_invalidate_unlock_shared(mapping);
> 	if (fpin)
> 		fput(fpin);
> 	return ret | VM_FAULT_RETRY;
> 
> and since ret was 0 before the goto, the only return code is
> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
> 
> 	if (fpin) {
> 		if (error)
> 			ret |= VM_FAULT_SIGBUS;
> 		goto out_retry;
> 	}
> 
> But I think the hard part here is that there doesn't seem to be any
> distinction between transient read errors (e.g. disk cable fell out) vs.
> semi-permanent errors (e.g. verity says the hash doesn't match).
> AFAICT, either the read(ahead) sets uptodate and callers read the page,
> or it doesn't set it and callers treat that as an error-retry
> opportunity.
> 
> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> second case I imagine we'd want something closer to _SIGBUS.


Agreed, it's really hard to judge when it's the right time to give up 
retrying. At least with MADV_POPULATE_READ we should try achieving the 
same behavior as with mmap()+read access. So if the latter manages to 
trigger SIGBUS, MADV_POPULATE_READ should return an error.

Is there an easy way to for me to reproduce this scenario?
Darrick J. Wong March 13, 2024, 5:19 p.m. UTC | #13
On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
> On 12.03.24 17:44, Darrick J. Wong wrote:
> > On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
> > > On 12.03.24 16:13, David Hildenbrand wrote:
> > > > On 11.03.24 23:38, Darrick J. Wong wrote:
> > > > > [add willy and linux-mm]
> > > > > 
> > > > > On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
> > > > > > On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
> > > > > > > > BTW, is xfs_repair planned to do anything about any such extra blocks?
> > > > > > > 
> > > > > > > Sorry to answer your question with a question, but how much checking is
> > > > > > > $filesystem expected to do for merkle trees?
> > > > > > > 
> > > > > > > In theory xfs_repair could learn how to interpret the verity descriptor,
> > > > > > > walk the merkle tree blocks, and even read the file data to confirm
> > > > > > > intactness.  If the descriptor specifies the highest block address then
> > > > > > > we could certainly trim off excess blocks.  But I don't know how much of
> > > > > > > libfsverity actually lets you do that; I haven't looked into that
> > > > > > > deeply. :/
> > > > > > > 
> > > > > > > For xfs_scrub I guess the job is theoretically simpler, since we only
> > > > > > > need to stream reads of the verity files through the page cache and let
> > > > > > > verity tell us if the file data are consistent.
> > > > > > > 
> > > > > > > For both tools, if something finds errors in the merkle tree structure
> > > > > > > itself, do we turn off verity?  Or do we do something nasty like
> > > > > > > truncate the file?
> > > > > > 
> > > > > > As far as I know (I haven't been following btrfs-progs, but I'm familiar with
> > > > > > e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
> > > > > > validating the data of verity inodes against their Merkle trees.
> > > > > > 
> > > > > > e2fsck does delete the verity metadata of inodes that don't have the verity flag
> > > > > > enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
> > > > > > 
> > > > > > I suppose that ideally, if an inode's verity metadata is invalid, then fsck
> > > > > > should delete that inode's verity metadata and remove the verity flag from the
> > > > > > inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
> > > > > > fairly straightforward, but it probably wouldn't catch much compared to actually
> > > > > > validating the data against the Merkle tree.  And actually validating the data
> > > > > > against the Merkle tree would be complex and expensive.  Note, none of this
> > > > > > would work on files that are encrypted.
> > > > > > 
> > > > > > Re: libfsverity, I think it would be possible to validate a Merkle tree using
> > > > > > libfsverity_compute_digest() and the callbacks that it supports.  But that's not
> > > > > > quite what it was designed for.
> > > > > > 
> > > > > > > Is there an ioctl or something that allows userspace to validate an
> > > > > > > entire file's contents?  Sort of like what BLKVERIFY would have done for
> > > > > > > block devices, except that we might believe its answers?
> > > > > > 
> > > > > > Just reading the whole file and seeing whether you get an error would do it.
> > > > > > 
> > > > > > Though if you want to make sure it's really re-reading the on-disk data, it's
> > > > > > necessary to drop the file's pagecache first.
> > > > > 
> > > > > I tried a straight pagecache read and it worked like a charm!
> > > > > 
> > > > > But then I thought to myself, do I really want to waste memory bandwidth
> > > > > copying a bunch of data?  No.  I don't even want to incur system call
> > > > > overhead from reading a single byte every $pagesize bytes.
> > > > > 
> > > > > So I created 2M mmap areas and read a byte every $pagesize bytes.  That
> > > > > worked too, insofar as SIGBUSes are annoying to handle.  But it's
> > > > > annoying to take signals like that.
> > > > > 
> > > > > Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
> > > > > like what I wanted -- it prefaults in the pages, and "If populating
> > > > > fails, a SIGBUS signal is not generated; instead, an error is returned."
> > > > > 
> > > > 
> > > > Yes, these were the expected semantics :)
> > > > 
> > > > > But then I tried rigging up a test to see if I could catch an EIO, and
> > > > > instead I had to SIGKILL the process!  It looks filemap_fault returns
> > > > > VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
> > > > > __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
> > > > > handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
> > > > > the VM_FAULT_RETRY is translated to -EBUSY.
> > > > > 
> > > > > __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
> > > > > that to madvise_populate.  Unfortunately, madvise_populate increments
> > > > > its loop counter by the return value (still 0) so it runs in an
> > > > > infinite loop.  The only way out is SIGKILL.
> > > > 
> > > > That's certainly unexpected. One user I know is QEMU, which primarily
> > > > uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
> > > > primarily used with shmem/hugetlb, where I haven't heard of any such
> > > > endless loops.
> > > > 
> > > > > 
> > > > > So I don't know what the correct behavior is here, other than the
> > > > > infinite loop seems pretty suspect.  Is it the correct behavior that
> > > > > madvise_populate returns EIO if __get_user_pages ever returns zero?
> > > > > That doesn't quite sound right if it's the case that a zero return could
> > > > > also happen if memory is tight.
> > > > 
> > > > madvise_populate() ends up calling faultin_vma_page_range() in a loop.
> > > > That one calls __get_user_pages().
> > > > 
> > > > __get_user_pages() documents: "0 return value is possible when the fault
> > > > would need to be retried."
> > > > 
> > > > So that's what the caller does. IIRC, there are cases where we really
> > > > have to retry (at least once) and will make progress, so treating "0" as
> > > > an error would be wrong.
> > > > 
> > > > Staring at other __get_user_pages() users, __get_user_pages_locked()
> > > > documents: "Please note that this function, unlike __get_user_pages(),
> > > > will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
> > > > 
> > > > But there is some elaborate retry logic in there, whereby the retry will
> > > > set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
> > > > retry attempt (there are cases where we retry more often, but that's
> > > > related to something else I believe).
> > > > 
> > > > So maybe we need a similar retry logic in faultin_vma_page_range()? Or
> > > > make it use __get_user_pages_locked(), but I recall when I introduced
> > > > MADV_POPULATE_READ, there was a catch to it.
> > > 
> > > I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
> > > mmap()+access case you describe above.
> > > 
> > > Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
> > > how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
> > > Because VM_FAULT_SIGBUS would be required for that function to call
> > > do_sigbus().
> > 
> > The code I was looking at yesterday in filemap_fault was:
> > 
> > page_not_uptodate:
> > 	/*
> > 	 * Umm, take care of errors if the page isn't up-to-date.
> > 	 * Try to re-read it _once_. We do this synchronously,
> > 	 * because there really aren't any performance issues here
> > 	 * and we need to check for errors.
> > 	 */
> > 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
> > 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> > 	if (fpin)
> > 		goto out_retry;
> > 	folio_put(folio);
> > 
> > 	if (!error || error == AOP_TRUNCATED_PAGE)
> > 		goto retry_find;
> > 	filemap_invalidate_unlock_shared(mapping);
> > 
> > 	return VM_FAULT_SIGBUS;
> > 
> > Wherein I /think/ fpin is non-null in this case, so if
> > filemap_read_folio returns an error, we'll do this instead:
> > 
> > out_retry:
> > 	/*
> > 	 * We dropped the mmap_lock, we need to return to the fault handler to
> > 	 * re-find the vma and come back and find our hopefully still populated
> > 	 * page.
> > 	 */
> > 	if (!IS_ERR(folio))
> > 		folio_put(folio);
> > 	if (mapping_locked)
> > 		filemap_invalidate_unlock_shared(mapping);
> > 	if (fpin)
> > 		fput(fpin);
> > 	return ret | VM_FAULT_RETRY;
> > 
> > and since ret was 0 before the goto, the only return code is
> > VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
> > 
> > 	if (fpin) {
> > 		if (error)
> > 			ret |= VM_FAULT_SIGBUS;
> > 		goto out_retry;
> > 	}
> > 
> > But I think the hard part here is that there doesn't seem to be any
> > distinction between transient read errors (e.g. disk cable fell out) vs.
> > semi-permanent errors (e.g. verity says the hash doesn't match).
> > AFAICT, either the read(ahead) sets uptodate and callers read the page,
> > or it doesn't set it and callers treat that as an error-retry
> > opportunity.
> > 
> > For the transient error case VM_FAULT_RETRY makes perfect sense; for the
> > second case I imagine we'd want something closer to _SIGBUS.
> 
> 
> Agreed, it's really hard to judge when it's the right time to give up
> retrying. At least with MADV_POPULATE_READ we should try achieving the same
> behavior as with mmap()+read access. So if the latter manages to trigger
> SIGBUS, MADV_POPULATE_READ should return an error.
> 
> Is there an easy way to for me to reproduce this scenario?

Yes.  Take this Makefile:

CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable

all: mpr

and this C program mpr.c:

/* test MAP_POPULATE_READ on a file */
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define min(a, b)	((a) < (b) ? (a) : (b))
#define BUFSIZE		(2097152)

int main(int argc, char *argv[])
{
	struct stat sb;
	long pagesize = sysconf(_SC_PAGESIZE);
	off_t read_sz, pos;
	void *addr;
	char c;
	int fd, ret;

	if (argc != 2) {
		printf("Usage: %s fname\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror(argv[1]);
		return 1;
	}

	ret = fstat(fd, &sb);
	if (ret) {
		perror("fstat");
		return 1;
	}

	/* Validate the file contents with regular reads */
	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
		ret = pread(fd, &c, 1, pos);
		if (ret < 0) {
			if (errno != EIO) {
				perror("pread");
				return 1;
			}

			printf("%s: at offset %llu: %s\n", argv[1],
					(unsigned long long)pos,
					strerror(errno));
			break;
		}
	}

	ret = pread(fd, &c, 1, sb.st_size);
	if (ret < 0) {
		if (errno != EIO) {
			perror("pread");
			return 1;
		}

		printf("%s: at offset %llu: %s\n", argv[1],
				(unsigned long long)sb.st_size,
				strerror(errno));
	}

	/* Validate the file contents with MADV_POPULATE_READ */
	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);

	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
		unsigned int mappos;
		size_t maplen = min(read_sz - pos, BUFSIZE);

		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
		if (addr == MAP_FAILED) {
			perror("mmap");
			return 1;
		}

		ret = madvise(addr, maplen, MADV_POPULATE_READ);
		if (ret) {
			perror("madvise");
			return 1;
		}

		ret = munmap(addr, maplen);
		if (ret) {
			perror("munmap");
			return 1;
		}
	}

	ret = close(fd);
	if (ret) {
		perror("close");
		return 1;
	}

	return 0;
}

and this shell script mpr.sh:

#!/bin/bash -x

# Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ

scriptdir="$(dirname "$0")"

commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
for cmd in "${commands[@]}"; do
	if ! command -v "$cmd" &>/dev/null; then
		echo "$cmd: Command required for this program."
		exit 1
	fi
done

dev="${1:-/dev/sda}"
mnt="${2:-/mnt}"
dmtarget="dumbtarget"

# Clean up any old mounts
umount "$dev" "$mnt"
dmsetup remove "$dmtarget"
rmmod xfs

# Create dm linear mapping to block device and format filesystem
sectors="$(blockdev --getsz "$dev")"
tgt="/dev/mapper/$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
mkfs.xfs -f "$tgt"

# Create a file that we'll read, then cycle mount to zap pagecache
mount "$tgt" "$mnt"
xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
umount "$mnt"
mount "$tgt" "$mnt"

# Load file metadata
stat "$mnt/a"

# Induce EIO errors on read
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors error" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"

# Try to provoke the kernel; kill the process after 10s so we can clean up
timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"

# Stop EIO errors so we can unmount
dmsetup suspend --noflush --nolockfs "$dmtarget"
echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
dmsetup resume "$dmtarget"

# Unmount and clean up after ourselves
umount "$mnt"
dmsetup remove "$dmtarget"
<EOF>

make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
stall in the madvise call until timeout sends sigkill to the program;
you can crank the 10s timeout up if you want.

<insert usual disclaimer that I run all these things in scratch VMs>

--D

> -- 
> Cheers,
> 
> David / dhildenb
> 
>
David Hildenbrand March 13, 2024, 7:10 p.m. UTC | #14
On 13.03.24 18:19, Darrick J. Wong wrote:
> On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
>> On 12.03.24 17:44, Darrick J. Wong wrote:
>>> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>>>> On 12.03.24 16:13, David Hildenbrand wrote:
>>>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>>>> [add willy and linux-mm]
>>>>>>
>>>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>>>
>>>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>>>> $filesystem expected to do for merkle trees?
>>>>>>>>
>>>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>>>> deeply. :/
>>>>>>>>
>>>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>>>> verity tell us if the file data are consistent.
>>>>>>>>
>>>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>>>> truncate the file?
>>>>>>>
>>>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>>>
>>>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>>>
>>>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>>>> would work on files that are encrypted.
>>>>>>>
>>>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>>>> quite what it was designed for.
>>>>>>>
>>>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>>>> block devices, except that we might believe its answers?
>>>>>>>
>>>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>>>
>>>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>>>> necessary to drop the file's pagecache first.
>>>>>>
>>>>>> I tried a straight pagecache read and it worked like a charm!
>>>>>>
>>>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>>>> overhead from reading a single byte every $pagesize bytes.
>>>>>>
>>>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>>>> annoying to take signals like that.
>>>>>>
>>>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>>>
>>>>>
>>>>> Yes, these were the expected semantics :)
>>>>>
>>>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>>>
>>>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>>>> its loop counter by the return value (still 0) so it runs in an
>>>>>> infinite loop.  The only way out is SIGKILL.
>>>>>
>>>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>>>> endless loops.
>>>>>
>>>>>>
>>>>>> So I don't know what the correct behavior is here, other than the
>>>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>>>> That doesn't quite sound right if it's the case that a zero return could
>>>>>> also happen if memory is tight.
>>>>>
>>>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>>>> That one calls __get_user_pages().
>>>>>
>>>>> __get_user_pages() documents: "0 return value is possible when the fault
>>>>> would need to be retried."
>>>>>
>>>>> So that's what the caller does. IIRC, there are cases where we really
>>>>> have to retry (at least once) and will make progress, so treating "0" as
>>>>> an error would be wrong.
>>>>>
>>>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>>>> documents: "Please note that this function, unlike __get_user_pages(),
>>>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>>>
>>>>> But there is some elaborate retry logic in there, whereby the retry will
>>>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>>>> retry attempt (there are cases where we retry more often, but that's
>>>>> related to something else I believe).
>>>>>
>>>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>>>> MADV_POPULATE_READ, there was a catch to it.
>>>>
>>>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>>>> mmap()+access case you describe above.
>>>>
>>>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>>>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>>>> Because VM_FAULT_SIGBUS would be required for that function to call
>>>> do_sigbus().
>>>
>>> The code I was looking at yesterday in filemap_fault was:
>>>
>>> page_not_uptodate:
>>> 	/*
>>> 	 * Umm, take care of errors if the page isn't up-to-date.
>>> 	 * Try to re-read it _once_. We do this synchronously,
>>> 	 * because there really aren't any performance issues here
>>> 	 * and we need to check for errors.
>>> 	 */
>>> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
>>> 	if (fpin)
>>> 		goto out_retry;
>>> 	folio_put(folio);
>>>
>>> 	if (!error || error == AOP_TRUNCATED_PAGE)
>>> 		goto retry_find;
>>> 	filemap_invalidate_unlock_shared(mapping);
>>>
>>> 	return VM_FAULT_SIGBUS;
>>>
>>> Wherein I /think/ fpin is non-null in this case, so if
>>> filemap_read_folio returns an error, we'll do this instead:
>>>
>>> out_retry:
>>> 	/*
>>> 	 * We dropped the mmap_lock, we need to return to the fault handler to
>>> 	 * re-find the vma and come back and find our hopefully still populated
>>> 	 * page.
>>> 	 */
>>> 	if (!IS_ERR(folio))
>>> 		folio_put(folio);
>>> 	if (mapping_locked)
>>> 		filemap_invalidate_unlock_shared(mapping);
>>> 	if (fpin)
>>> 		fput(fpin);
>>> 	return ret | VM_FAULT_RETRY;
>>>
>>> and since ret was 0 before the goto, the only return code is
>>> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
>>>
>>> 	if (fpin) {
>>> 		if (error)
>>> 			ret |= VM_FAULT_SIGBUS;
>>> 		goto out_retry;
>>> 	}
>>>
>>> But I think the hard part here is that there doesn't seem to be any
>>> distinction between transient read errors (e.g. disk cable fell out) vs.
>>> semi-permanent errors (e.g. verity says the hash doesn't match).
>>> AFAICT, either the read(ahead) sets uptodate and callers read the page,
>>> or it doesn't set it and callers treat that as an error-retry
>>> opportunity.
>>>
>>> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
>>> second case I imagine we'd want something closer to _SIGBUS.
>>
>>
>> Agreed, it's really hard to judge when it's the right time to give up
>> retrying. At least with MADV_POPULATE_READ we should try achieving the same
>> behavior as with mmap()+read access. So if the latter manages to trigger
>> SIGBUS, MADV_POPULATE_READ should return an error.
>>
>> Is there an easy way to for me to reproduce this scenario?
> 
> Yes.  Take this Makefile:
> 
> CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
> 
> all: mpr
> 
> and this C program mpr.c:
> 
> /* test MAP_POPULATE_READ on a file */
> #include <stdio.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/stat.h>
> #include <sys/mman.h>
> 
> #define min(a, b)	((a) < (b) ? (a) : (b))
> #define BUFSIZE		(2097152)
> 
> int main(int argc, char *argv[])
> {
> 	struct stat sb;
> 	long pagesize = sysconf(_SC_PAGESIZE);
> 	off_t read_sz, pos;
> 	void *addr;
> 	char c;
> 	int fd, ret;
> 
> 	if (argc != 2) {
> 		printf("Usage: %s fname\n", argv[0]);
> 		return 1;
> 	}
> 
> 	fd = open(argv[1], O_RDONLY);
> 	if (fd < 0) {
> 		perror(argv[1]);
> 		return 1;
> 	}
> 
> 	ret = fstat(fd, &sb);
> 	if (ret) {
> 		perror("fstat");
> 		return 1;
> 	}
> 
> 	/* Validate the file contents with regular reads */
> 	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
> 		ret = pread(fd, &c, 1, pos);
> 		if (ret < 0) {
> 			if (errno != EIO) {
> 				perror("pread");
> 				return 1;
> 			}
> 
> 			printf("%s: at offset %llu: %s\n", argv[1],
> 					(unsigned long long)pos,
> 					strerror(errno));
> 			break;
> 		}
> 	}
> 
> 	ret = pread(fd, &c, 1, sb.st_size);
> 	if (ret < 0) {
> 		if (errno != EIO) {
> 			perror("pread");
> 			return 1;
> 		}
> 
> 		printf("%s: at offset %llu: %s\n", argv[1],
> 				(unsigned long long)sb.st_size,
> 				strerror(errno));
> 	}
> 
> 	/* Validate the file contents with MADV_POPULATE_READ */
> 	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
> 	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
> 
> 	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
> 		unsigned int mappos;
> 		size_t maplen = min(read_sz - pos, BUFSIZE);
> 
> 		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
> 		if (addr == MAP_FAILED) {
> 			perror("mmap");
> 			return 1;
> 		}
> 
> 		ret = madvise(addr, maplen, MADV_POPULATE_READ);
> 		if (ret) {
> 			perror("madvise");
> 			return 1;
> 		}
> 
> 		ret = munmap(addr, maplen);
> 		if (ret) {
> 			perror("munmap");
> 			return 1;
> 		}
> 	}
> 
> 	ret = close(fd);
> 	if (ret) {
> 		perror("close");
> 		return 1;
> 	}
> 
> 	return 0;
> }
> 
> and this shell script mpr.sh:
> 
> #!/bin/bash -x
> 
> # Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
> 
> scriptdir="$(dirname "$0")"
> 
> commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
> for cmd in "${commands[@]}"; do
> 	if ! command -v "$cmd" &>/dev/null; then
> 		echo "$cmd: Command required for this program."
> 		exit 1
> 	fi
> done
> 
> dev="${1:-/dev/sda}"
> mnt="${2:-/mnt}"
> dmtarget="dumbtarget"
> 
> # Clean up any old mounts
> umount "$dev" "$mnt"
> dmsetup remove "$dmtarget"
> rmmod xfs
> 
> # Create dm linear mapping to block device and format filesystem
> sectors="$(blockdev --getsz "$dev")"
> tgt="/dev/mapper/$dmtarget"
> echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
> mkfs.xfs -f "$tgt"
> 
> # Create a file that we'll read, then cycle mount to zap pagecache
> mount "$tgt" "$mnt"
> xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
> umount "$mnt"
> mount "$tgt" "$mnt"
> 
> # Load file metadata
> stat "$mnt/a"
> 
> # Induce EIO errors on read
> dmsetup suspend --noflush --nolockfs "$dmtarget"
> echo "0 $sectors error" | dmsetup load "$dmtarget"
> dmsetup resume "$dmtarget"
> 
> # Try to provoke the kernel; kill the process after 10s so we can clean up
> timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
> 
> # Stop EIO errors so we can unmount
> dmsetup suspend --noflush --nolockfs "$dmtarget"
> echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
> dmsetup resume "$dmtarget"
> 
> # Unmount and clean up after ourselves
> umount "$mnt"
> dmsetup remove "$dmtarget"
> <EOF>
> 
> make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
> stall in the madvise call until timeout sends sigkill to the program;
> you can crank the 10s timeout up if you want.
> 
> <insert usual disclaimer that I run all these things in scratch VMs>

Yes, seems to work, nice!


[  452.455636] buffer_io_error: 6 callbacks suppressed
[  452.455638] Buffer I/O error on dev dm-0, logical block 16, async page read
[  452.456169] Buffer I/O error on dev dm-0, logical block 17, async page read
[  452.456456] Buffer I/O error on dev dm-0, logical block 18, async page read
[  452.456754] Buffer I/O error on dev dm-0, logical block 19, async page read
[  452.457061] Buffer I/O error on dev dm-0, logical block 20, async page read
[  452.457350] Buffer I/O error on dev dm-0, logical block 21, async page read
[  452.457639] Buffer I/O error on dev dm-0, logical block 22, async page read
[  452.457942] Buffer I/O error on dev dm-0, logical block 23, async page read
[  452.458242] Buffer I/O error on dev dm-0, logical block 16, async page read
[  452.458552] Buffer I/O error on dev dm-0, logical block 17, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
madvise(0x7f9393624000, 1048576, MADV_POPULATE_READ./mpr.sh: line 45:  2070 Killed                  tim"


And once I switch over to reading instead of MADV_POPULATE_READ:

[  753.940230] buffer_io_error: 6 callbacks suppressed
[  753.940233] Buffer I/O error on dev dm-0, logical block 8192, async page read
[  753.941402] Buffer I/O error on dev dm-0, logical block 8193, async page read
[  753.942084] Buffer I/O error on dev dm-0, logical block 8194, async page read
[  753.942738] Buffer I/O error on dev dm-0, logical block 8195, async page read
[  753.943412] Buffer I/O error on dev dm-0, logical block 8196, async page read
[  753.944088] Buffer I/O error on dev dm-0, logical block 8197, async page read
[  753.944741] Buffer I/O error on dev dm-0, logical block 8198, async page read
[  753.945415] Buffer I/O error on dev dm-0, logical block 8199, async page read
[  753.946105] Buffer I/O error on dev dm-0, logical block 8192, async page read
[  753.946661] Buffer I/O error on dev dm-0, logical block 8193, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
--- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f34f82d8000} ---
+++ killed by SIGBUS (core dumped) +++
timeout: the monitored command dumped core
./mpr.sh: line 45:  2388 Bus error               timeout -s KILL 10s strace -s99 -e madvise "$scriptdir"


Let me dig how the fault handler is able to conclude SIGBUS here!
David Hildenbrand March 13, 2024, 9:03 p.m. UTC | #15
On 13.03.24 20:10, David Hildenbrand wrote:
> On 13.03.24 18:19, Darrick J. Wong wrote:
>> On Wed, Mar 13, 2024 at 01:29:12PM +0100, David Hildenbrand wrote:
>>> On 12.03.24 17:44, Darrick J. Wong wrote:
>>>> On Tue, Mar 12, 2024 at 04:33:14PM +0100, David Hildenbrand wrote:
>>>>> On 12.03.24 16:13, David Hildenbrand wrote:
>>>>>> On 11.03.24 23:38, Darrick J. Wong wrote:
>>>>>>> [add willy and linux-mm]
>>>>>>>
>>>>>>> On Thu, Mar 07, 2024 at 08:40:17PM -0800, Eric Biggers wrote:
>>>>>>>> On Thu, Mar 07, 2024 at 07:46:50PM -0800, Darrick J. Wong wrote:
>>>>>>>>>> BTW, is xfs_repair planned to do anything about any such extra blocks?
>>>>>>>>>
>>>>>>>>> Sorry to answer your question with a question, but how much checking is
>>>>>>>>> $filesystem expected to do for merkle trees?
>>>>>>>>>
>>>>>>>>> In theory xfs_repair could learn how to interpret the verity descriptor,
>>>>>>>>> walk the merkle tree blocks, and even read the file data to confirm
>>>>>>>>> intactness.  If the descriptor specifies the highest block address then
>>>>>>>>> we could certainly trim off excess blocks.  But I don't know how much of
>>>>>>>>> libfsverity actually lets you do that; I haven't looked into that
>>>>>>>>> deeply. :/
>>>>>>>>>
>>>>>>>>> For xfs_scrub I guess the job is theoretically simpler, since we only
>>>>>>>>> need to stream reads of the verity files through the page cache and let
>>>>>>>>> verity tell us if the file data are consistent.
>>>>>>>>>
>>>>>>>>> For both tools, if something finds errors in the merkle tree structure
>>>>>>>>> itself, do we turn off verity?  Or do we do something nasty like
>>>>>>>>> truncate the file?
>>>>>>>>
>>>>>>>> As far as I know (I haven't been following btrfs-progs, but I'm familiar with
>>>>>>>> e2fsprogs and f2fs-tools), there isn't yet any precedent for fsck actually
>>>>>>>> validating the data of verity inodes against their Merkle trees.
>>>>>>>>
>>>>>>>> e2fsck does delete the verity metadata of inodes that don't have the verity flag
>>>>>>>> enabled.  That handles cleaning up after a crash during FS_IOC_ENABLE_VERITY.
>>>>>>>>
>>>>>>>> I suppose that ideally, if an inode's verity metadata is invalid, then fsck
>>>>>>>> should delete that inode's verity metadata and remove the verity flag from the
>>>>>>>> inode.  Checking for a missing or obviously corrupt fsverity_descriptor would be
>>>>>>>> fairly straightforward, but it probably wouldn't catch much compared to actually
>>>>>>>> validating the data against the Merkle tree.  And actually validating the data
>>>>>>>> against the Merkle tree would be complex and expensive.  Note, none of this
>>>>>>>> would work on files that are encrypted.
>>>>>>>>
>>>>>>>> Re: libfsverity, I think it would be possible to validate a Merkle tree using
>>>>>>>> libfsverity_compute_digest() and the callbacks that it supports.  But that's not
>>>>>>>> quite what it was designed for.
>>>>>>>>
>>>>>>>>> Is there an ioctl or something that allows userspace to validate an
>>>>>>>>> entire file's contents?  Sort of like what BLKVERIFY would have done for
>>>>>>>>> block devices, except that we might believe its answers?
>>>>>>>>
>>>>>>>> Just reading the whole file and seeing whether you get an error would do it.
>>>>>>>>
>>>>>>>> Though if you want to make sure it's really re-reading the on-disk data, it's
>>>>>>>> necessary to drop the file's pagecache first.
>>>>>>>
>>>>>>> I tried a straight pagecache read and it worked like a charm!
>>>>>>>
>>>>>>> But then I thought to myself, do I really want to waste memory bandwidth
>>>>>>> copying a bunch of data?  No.  I don't even want to incur system call
>>>>>>> overhead from reading a single byte every $pagesize bytes.
>>>>>>>
>>>>>>> So I created 2M mmap areas and read a byte every $pagesize bytes.  That
>>>>>>> worked too, insofar as SIGBUSes are annoying to handle.  But it's
>>>>>>> annoying to take signals like that.
>>>>>>>
>>>>>>> Then I started looking at madvise.  MADV_POPULATE_READ looked exactly
>>>>>>> like what I wanted -- it prefaults in the pages, and "If populating
>>>>>>> fails, a SIGBUS signal is not generated; instead, an error is returned."
>>>>>>>
>>>>>>
>>>>>> Yes, these were the expected semantics :)
>>>>>>
>>>>>>> But then I tried rigging up a test to see if I could catch an EIO, and
>>>>>>> instead I had to SIGKILL the process!  It looks filemap_fault returns
>>>>>>> VM_FAULT_RETRY to __xfs_filemap_fault, which propagates up through
>>>>>>> __do_fault -> do_read_fault -> do_fault -> handle_pte_fault ->
>>>>>>> handle_mm_fault -> faultin_page -> __get_user_pages.  At faultin_pages,
>>>>>>> the VM_FAULT_RETRY is translated to -EBUSY.
>>>>>>>
>>>>>>> __get_user_pages squashes -EBUSY to 0, so faultin_vma_page_range returns
>>>>>>> that to madvise_populate.  Unfortunately, madvise_populate increments
>>>>>>> its loop counter by the return value (still 0) so it runs in an
>>>>>>> infinite loop.  The only way out is SIGKILL.
>>>>>>
>>>>>> That's certainly unexpected. One user I know is QEMU, which primarily
>>>>>> uses MADV_POPULATE_WRITE to prefault page tables. Prefaulting in QEMU is
>>>>>> primarily used with shmem/hugetlb, where I haven't heard of any such
>>>>>> endless loops.
>>>>>>
>>>>>>>
>>>>>>> So I don't know what the correct behavior is here, other than the
>>>>>>> infinite loop seems pretty suspect.  Is it the correct behavior that
>>>>>>> madvise_populate returns EIO if __get_user_pages ever returns zero?
>>>>>>> That doesn't quite sound right if it's the case that a zero return could
>>>>>>> also happen if memory is tight.
>>>>>>
>>>>>> madvise_populate() ends up calling faultin_vma_page_range() in a loop.
>>>>>> That one calls __get_user_pages().
>>>>>>
>>>>>> __get_user_pages() documents: "0 return value is possible when the fault
>>>>>> would need to be retried."
>>>>>>
>>>>>> So that's what the caller does. IIRC, there are cases where we really
>>>>>> have to retry (at least once) and will make progress, so treating "0" as
>>>>>> an error would be wrong.
>>>>>>
>>>>>> Staring at other __get_user_pages() users, __get_user_pages_locked()
>>>>>> documents: "Please note that this function, unlike __get_user_pages(),
>>>>>> will not return 0 for nr_pages > 0, unless FOLL_NOWAIT is used.".
>>>>>>
>>>>>> But there is some elaborate retry logic in there, whereby the retry will
>>>>>> set FOLL_TRIED->FAULT_FLAG_TRIED, and I think we'd fail on the second
>>>>>> retry attempt (there are cases where we retry more often, but that's
>>>>>> related to something else I believe).
>>>>>>
>>>>>> So maybe we need a similar retry logic in faultin_vma_page_range()? Or
>>>>>> make it use __get_user_pages_locked(), but I recall when I introduced
>>>>>> MADV_POPULATE_READ, there was a catch to it.
>>>>>
>>>>> I'm trying to figure out who will be setting the VM_FAULT_SIGBUS in the
>>>>> mmap()+access case you describe above.
>>>>>
>>>>> Staring at arch/x86/mm/fault.c:do_user_addr_fault(), I don't immediately see
>>>>> how we would transition from a VM_FAULT_RETRY loop to VM_FAULT_SIGBUS.
>>>>> Because VM_FAULT_SIGBUS would be required for that function to call
>>>>> do_sigbus().
>>>>
>>>> The code I was looking at yesterday in filemap_fault was:
>>>>
>>>> page_not_uptodate:
>>>> 	/*
>>>> 	 * Umm, take care of errors if the page isn't up-to-date.
>>>> 	 * Try to re-read it _once_. We do this synchronously,
>>>> 	 * because there really aren't any performance issues here
>>>> 	 * and we need to check for errors.
>>>> 	 */
>>>> 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>>>> 	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
>>>> 	if (fpin)
>>>> 		goto out_retry;
>>>> 	folio_put(folio);
>>>>
>>>> 	if (!error || error == AOP_TRUNCATED_PAGE)
>>>> 		goto retry_find;
>>>> 	filemap_invalidate_unlock_shared(mapping);
>>>>
>>>> 	return VM_FAULT_SIGBUS;
>>>>
>>>> Wherein I /think/ fpin is non-null in this case, so if
>>>> filemap_read_folio returns an error, we'll do this instead:
>>>>
>>>> out_retry:
>>>> 	/*
>>>> 	 * We dropped the mmap_lock, we need to return to the fault handler to
>>>> 	 * re-find the vma and come back and find our hopefully still populated
>>>> 	 * page.
>>>> 	 */
>>>> 	if (!IS_ERR(folio))
>>>> 		folio_put(folio);
>>>> 	if (mapping_locked)
>>>> 		filemap_invalidate_unlock_shared(mapping);
>>>> 	if (fpin)
>>>> 		fput(fpin);
>>>> 	return ret | VM_FAULT_RETRY;
>>>>
>>>> and since ret was 0 before the goto, the only return code is
>>>> VM_FAULT_RETRY.  I had speculated that perhaps we could instead do:
>>>>
>>>> 	if (fpin) {
>>>> 		if (error)
>>>> 			ret |= VM_FAULT_SIGBUS;
>>>> 		goto out_retry;
>>>> 	}
>>>>
>>>> But I think the hard part here is that there doesn't seem to be any
>>>> distinction between transient read errors (e.g. disk cable fell out) vs.
>>>> semi-permanent errors (e.g. verity says the hash doesn't match).
>>>> AFAICT, either the read(ahead) sets uptodate and callers read the page,
>>>> or it doesn't set it and callers treat that as an error-retry
>>>> opportunity.
>>>>
>>>> For the transient error case VM_FAULT_RETRY makes perfect sense; for the
>>>> second case I imagine we'd want something closer to _SIGBUS.
>>>
>>>
>>> Agreed, it's really hard to judge when it's the right time to give up
>>> retrying. At least with MADV_POPULATE_READ we should try achieving the same
>>> behavior as with mmap()+read access. So if the latter manages to trigger
>>> SIGBUS, MADV_POPULATE_READ should return an error.
>>>
>>> Is there an easy way to for me to reproduce this scenario?
>>
>> Yes.  Take this Makefile:
>>
>> CFLAGS=-Wall -Werror -O2 -g -Wno-unused-variable
>>
>> all: mpr
>>
>> and this C program mpr.c:
>>
>> /* test MAP_POPULATE_READ on a file */
>> #include <stdio.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <sys/stat.h>
>> #include <sys/mman.h>
>>
>> #define min(a, b)	((a) < (b) ? (a) : (b))
>> #define BUFSIZE		(2097152)
>>
>> int main(int argc, char *argv[])
>> {
>> 	struct stat sb;
>> 	long pagesize = sysconf(_SC_PAGESIZE);
>> 	off_t read_sz, pos;
>> 	void *addr;
>> 	char c;
>> 	int fd, ret;
>>
>> 	if (argc != 2) {
>> 		printf("Usage: %s fname\n", argv[0]);
>> 		return 1;
>> 	}
>>
>> 	fd = open(argv[1], O_RDONLY);
>> 	if (fd < 0) {
>> 		perror(argv[1]);
>> 		return 1;
>> 	}
>>
>> 	ret = fstat(fd, &sb);
>> 	if (ret) {
>> 		perror("fstat");
>> 		return 1;
>> 	}
>>
>> 	/* Validate the file contents with regular reads */
>> 	for (pos = 0; pos < sb.st_size; pos += sb.st_blksize) {
>> 		ret = pread(fd, &c, 1, pos);
>> 		if (ret < 0) {
>> 			if (errno != EIO) {
>> 				perror("pread");
>> 				return 1;
>> 			}
>>
>> 			printf("%s: at offset %llu: %s\n", argv[1],
>> 					(unsigned long long)pos,
>> 					strerror(errno));
>> 			break;
>> 		}
>> 	}
>>
>> 	ret = pread(fd, &c, 1, sb.st_size);
>> 	if (ret < 0) {
>> 		if (errno != EIO) {
>> 			perror("pread");
>> 			return 1;
>> 		}
>>
>> 		printf("%s: at offset %llu: %s\n", argv[1],
>> 				(unsigned long long)sb.st_size,
>> 				strerror(errno));
>> 	}
>>
>> 	/* Validate the file contents with MADV_POPULATE_READ */
>> 	read_sz = ((sb.st_size + (pagesize - 1)) / pagesize) * pagesize;
>> 	printf("%s: read bytes %llu\n", argv[1], (unsigned long long)read_sz);
>>
>> 	for (pos = 0; pos < read_sz; pos += BUFSIZE) {
>> 		unsigned int mappos;
>> 		size_t maplen = min(read_sz - pos, BUFSIZE);
>>
>> 		addr = mmap(NULL, maplen, PROT_READ, MAP_SHARED, fd, pos);
>> 		if (addr == MAP_FAILED) {
>> 			perror("mmap");
>> 			return 1;
>> 		}
>>
>> 		ret = madvise(addr, maplen, MADV_POPULATE_READ);
>> 		if (ret) {
>> 			perror("madvise");
>> 			return 1;
>> 		}
>>
>> 		ret = munmap(addr, maplen);
>> 		if (ret) {
>> 			perror("munmap");
>> 			return 1;
>> 		}
>> 	}
>>
>> 	ret = close(fd);
>> 	if (ret) {
>> 		perror("close");
>> 		return 1;
>> 	}
>>
>> 	return 0;
>> }
>>
>> and this shell script mpr.sh:
>>
>> #!/bin/bash -x
>>
>> # Try to trigger infinite loop with regular IO errors and MADV_POPULATE_READ
>>
>> scriptdir="$(dirname "$0")"
>>
>> commands=(dmsetup mkfs.xfs xfs_io timeout strace "$scriptdir/mpr")
>> for cmd in "${commands[@]}"; do
>> 	if ! command -v "$cmd" &>/dev/null; then
>> 		echo "$cmd: Command required for this program."
>> 		exit 1
>> 	fi
>> done
>>
>> dev="${1:-/dev/sda}"
>> mnt="${2:-/mnt}"
>> dmtarget="dumbtarget"
>>
>> # Clean up any old mounts
>> umount "$dev" "$mnt"
>> dmsetup remove "$dmtarget"
>> rmmod xfs
>>
>> # Create dm linear mapping to block device and format filesystem
>> sectors="$(blockdev --getsz "$dev")"
>> tgt="/dev/mapper/$dmtarget"
>> echo "0 $sectors linear $dev 0" | dmsetup create "$dmtarget"
>> mkfs.xfs -f "$tgt"
>>
>> # Create a file that we'll read, then cycle mount to zap pagecache
>> mount "$tgt" "$mnt"
>> xfs_io -f -c "pwrite -S 0x58 0 1m" "$mnt/a"
>> umount "$mnt"
>> mount "$tgt" "$mnt"
>>
>> # Load file metadata
>> stat "$mnt/a"
>>
>> # Induce EIO errors on read
>> dmsetup suspend --noflush --nolockfs "$dmtarget"
>> echo "0 $sectors error" | dmsetup load "$dmtarget"
>> dmsetup resume "$dmtarget"
>>
>> # Try to provoke the kernel; kill the process after 10s so we can clean up
>> timeout -s KILL 10s strace -s99 -e madvise "$scriptdir/mpr" "$mnt/a"
>>
>> # Stop EIO errors so we can unmount
>> dmsetup suspend --noflush --nolockfs "$dmtarget"
>> echo "0 $sectors linear $dev 0" | dmsetup load "$dmtarget"
>> dmsetup resume "$dmtarget"
>>
>> # Unmount and clean up after ourselves
>> umount "$mnt"
>> dmsetup remove "$dmtarget"
>> <EOF>
>>
>> make the C program, then run ./mpr.sh <device> <mountpoint>.  It should
>> stall in the madvise call until timeout sends sigkill to the program;
>> you can crank the 10s timeout up if you want.
>>
>> <insert usual disclaimer that I run all these things in scratch VMs>
> 
> Yes, seems to work, nice!
> 
> 
> [  452.455636] buffer_io_error: 6 callbacks suppressed
> [  452.455638] Buffer I/O error on dev dm-0, logical block 16, async page read
> [  452.456169] Buffer I/O error on dev dm-0, logical block 17, async page read
> [  452.456456] Buffer I/O error on dev dm-0, logical block 18, async page read
> [  452.456754] Buffer I/O error on dev dm-0, logical block 19, async page read
> [  452.457061] Buffer I/O error on dev dm-0, logical block 20, async page read
> [  452.457350] Buffer I/O error on dev dm-0, logical block 21, async page read
> [  452.457639] Buffer I/O error on dev dm-0, logical block 22, async page read
> [  452.457942] Buffer I/O error on dev dm-0, logical block 23, async page read
> [  452.458242] Buffer I/O error on dev dm-0, logical block 16, async page read
> [  452.458552] Buffer I/O error on dev dm-0, logical block 17, async page read
> + timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
> /mnt/tmp//a: at offset 0: Input/output error
> /mnt/tmp//a: read bytes 1048576
> madvise(0x7f9393624000, 1048576, MADV_POPULATE_READ./mpr.sh: line 45:  2070 Killed                  tim"
> 
> 
> And once I switch over to reading instead of MADV_POPULATE_READ:
> 
> [  753.940230] buffer_io_error: 6 callbacks suppressed
> [  753.940233] Buffer I/O error on dev dm-0, logical block 8192, async page read
> [  753.941402] Buffer I/O error on dev dm-0, logical block 8193, async page read
> [  753.942084] Buffer I/O error on dev dm-0, logical block 8194, async page read
> [  753.942738] Buffer I/O error on dev dm-0, logical block 8195, async page read
> [  753.943412] Buffer I/O error on dev dm-0, logical block 8196, async page read
> [  753.944088] Buffer I/O error on dev dm-0, logical block 8197, async page read
> [  753.944741] Buffer I/O error on dev dm-0, logical block 8198, async page read
> [  753.945415] Buffer I/O error on dev dm-0, logical block 8199, async page read
> [  753.946105] Buffer I/O error on dev dm-0, logical block 8192, async page read
> [  753.946661] Buffer I/O error on dev dm-0, logical block 8193, async page read
> + timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
> /mnt/tmp//a: at offset 0: Input/output error
> /mnt/tmp//a: read bytes 1048576
> --- SIGBUS {si_signo=SIGBUS, si_code=BUS_ADRERR, si_addr=0x7f34f82d8000} ---
> +++ killed by SIGBUS (core dumped) +++
> timeout: the monitored command dumped core
> ./mpr.sh: line 45:  2388 Bus error               timeout -s KILL 10s strace -s99 -e madvise "$scriptdir"
> 
> 
> Let me dig how the fault handler is able to conclude SIGBUS here!

Might be as simple as this:

diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d5..62df1548b7779 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1734,8 +1734,8 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
         if (check_vma_flags(vma, gup_flags))
                 return -EINVAL;
  
-       ret = __get_user_pages(mm, start, nr_pages, gup_flags,
-                              NULL, locked);
+       ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked,
+                                     gup_flags);
         lru_add_drain();
         return ret;
  }


[   57.154542] Buffer I/O error on dev dm-0, logical block 16, async page read
[   57.155276] Buffer I/O error on dev dm-0, logical block 17, async page read
[   57.155911] Buffer I/O error on dev dm-0, logical block 18, async page read
[   57.156568] Buffer I/O error on dev dm-0, logical block 19, async page read
[   57.157245] Buffer I/O error on dev dm-0, logical block 20, async page read
[   57.157880] Buffer I/O error on dev dm-0, logical block 21, async page read
[   57.158539] Buffer I/O error on dev dm-0, logical block 22, async page read
[   57.159197] Buffer I/O error on dev dm-0, logical block 23, async page read
[   57.159838] Buffer I/O error on dev dm-0, logical block 16, async page read
[   57.160492] Buffer I/O error on dev dm-0, logical block 17, async page read
+ timeout -s KILL 10s strace -s99 -e madvise ./mpr /mnt/tmp//a
[   57.190472] Retrying
/mnt/tmp//a: at offset 0: Input/output error
/mnt/tmp//a: read bytes 1048576
madvise(0x7f0fa0384000, 1048576, MADV_POPULATE_READ) = -1 EFAULT (Bad address)
madvise: Bad address


And -EFAULT is what MADV_POPULATE_READ is documented to return for SIGBUS.

(there are a couple of ways we could speedup MADV_POPULATE_READ, maybe
at some point using VMA locks)
diff mbox series

Patch

diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index 66e2270b0dae..966630523502 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -621,6 +621,7 @@  static int btrfs_begin_enable_verity(struct file *filp)
  * @desc:              verity descriptor to write out (NULL in error conditions)
  * @desc_size:         size of the verity descriptor (variable with signatures)
  * @merkle_tree_size:  size of the merkle tree in bytes
+ * @tree_blocksize:    the Merkle tree block size
  *
  * If desc is null, then VFS is signaling an error occurred during verity
  * enable, and we should try to rollback. Otherwise, attempt to finish verity.
@@ -628,7 +629,8 @@  static int btrfs_begin_enable_verity(struct file *filp)
  * Returns 0 on success, negative error code on error.
  */
 static int btrfs_end_enable_verity(struct file *filp, const void *desc,
-				   size_t desc_size, u64 merkle_tree_size)
+				   size_t desc_size, u64 merkle_tree_size,
+				   unsigned int tree_blocksize)
 {
 	struct btrfs_inode *inode = BTRFS_I(file_inode(filp));
 	int ret = 0;
diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index 2f37e1ea3955..da2095a81349 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -189,7 +189,8 @@  static int ext4_write_verity_descriptor(struct inode *inode, const void *desc,
 }
 
 static int ext4_end_enable_verity(struct file *filp, const void *desc,
-				  size_t desc_size, u64 merkle_tree_size)
+				  size_t desc_size, u64 merkle_tree_size,
+				  unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	const int credits = 2; /* superblock and inode for ext4_orphan_del() */
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index 4fc95f353a7a..b4461b9f47a3 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -144,7 +144,8 @@  static int f2fs_begin_enable_verity(struct file *filp)
 }
 
 static int f2fs_end_enable_verity(struct file *filp, const void *desc,
-				  size_t desc_size, u64 merkle_tree_size)
+				  size_t desc_size, u64 merkle_tree_size,
+				  unsigned int tree_blocksize)
 {
 	struct inode *inode = file_inode(filp);
 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
diff --git a/fs/verity/enable.c b/fs/verity/enable.c
index c284f46d1b53..04e060880b79 100644
--- a/fs/verity/enable.c
+++ b/fs/verity/enable.c
@@ -274,7 +274,8 @@  static int enable_verity(struct file *filp,
 	 * Serialized with ->begin_enable_verity() by the inode lock.
 	 */
 	inode_lock(inode);
-	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size);
+	err = vops->end_enable_verity(filp, desc, desc_size, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	if (err) {
 		fsverity_err(inode, "%ps() failed with err %d",
@@ -300,7 +301,8 @@  static int enable_verity(struct file *filp,
 
 rollback:
 	inode_lock(inode);
-	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size);
+	(void)vops->end_enable_verity(filp, NULL, 0, params.tree_size,
+				      params.block_size);
 	inode_unlock(inode);
 	goto out;
 }
diff --git a/include/linux/fsverity.h b/include/linux/fsverity.h
index 1eb7eae580be..ac58b19f23d3 100644
--- a/include/linux/fsverity.h
+++ b/include/linux/fsverity.h
@@ -51,6 +51,7 @@  struct fsverity_operations {
 	 * @desc: the verity descriptor to write, or NULL on failure
 	 * @desc_size: size of verity descriptor, or 0 on failure
 	 * @merkle_tree_size: total bytes the Merkle tree took up
+	 * @tree_blocksize: the Merkle tree block size
 	 *
 	 * If desc == NULL, then enabling verity failed and the filesystem only
 	 * must do any necessary cleanups.  Else, it must also store the given
@@ -65,7 +66,8 @@  struct fsverity_operations {
 	 * Return: 0 on success, -errno on failure
 	 */
 	int (*end_enable_verity)(struct file *filp, const void *desc,
-				 size_t desc_size, u64 merkle_tree_size);
+				 size_t desc_size, u64 merkle_tree_size,
+				 unsigned int tree_blocksize);
 
 	/**
 	 * Get the verity descriptor of the given inode.