[2/2,v12] Btrfs: be aware of btree inode write errors to avoid fs corruption

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---

V2: If an extent buffer's write failed but it's also deleted from the tree
    before the transaction commits, don't abort the transaction with -EIO,
    since the unwritten node/leaf it represents can't be pointed to by any
    other node in a tree.

V3: Correct V2, missed unstaged changes.

V4: Use root's key to figure out which counter to update.

V5: Decrement the error counters too when an eb is made dirty again (the
    next write attempt might succeed).

V6: Moved counters from transaction struct to fs_info struct, because there's
    a (short) time window where fs_info->running_transaction is NULL.
    There's now 2 counters for log extent buffers too, each one representing
    a different log transaction.

V7: Track the eb's log index in the eb itself, otherwise it wasn't possible
    to find it when writeback triggered from a transaction commit.

V8: Track the log eb write errors per root instead, and reset them on a
    transaction commit.

V9: Don't decrement the error counters if the eb is deleted or re-written.
    It is not safe because there's a time window when committing a transaction,
    between setting fs_info->current_transaction to NULL and checking the
    error counters in btrfs_write_and_wait_transaction(), where a new transaction
    can start and delete or re-write an eb that has the write error flag set.
    If this happens it means the previous transaction can write a superblock
    that refers to trees that point to unwritten nodes.
    Replaced the counters with simple flags in the btree inode's runtime
    flags - essentially back to V1 but accounting for the 2 different log
    sub-transactions.
    Removed access to an eb's parent root through
    BTRFS_I(eb->pages[0]->mapping->host)->root since it was not correct, as this
    always gives us the btree inode's root (objectid 1ULL). Instead use the
    field eb->log_index to know wether it's a log btree eb (and which sub-
    -transaction) or a non-log btree eb.

V10: Clear the log eb write error flags in a more logical place (transaction
     commit function).

V11: Updated commit message and a comment, replaced an ASSERT() with a BUG()
     and changed eb->lock_nested to a short to keep the structure size.

V12: Removed leftovers from previous versions (no longer necessary #include and
     prototype in extent_io.h of no longer existing function) and updated parts
     from a comment that apply only to some past versions.
     Rebased against latest integration branch (didn't apply cleanly) and re-tested.

 fs/btrfs/btrfs_inode.h | 11 ++++++++
 fs/btrfs/disk-io.c     |  4 +--
 fs/btrfs/extent-tree.c |  4 ++-
 fs/btrfs/extent_io.c   | 74 +++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/extent_io.h   |  7 +++--
 fs/btrfs/transaction.c | 26 ++++++++++++++++++
 6 files changed, 114 insertions(+), 12 deletions(-)

[2/2,v12] Btrfs: be aware of btree inode write errors to avoid fs corruption

Commit Message

Patch