I/O errors block the entire filesystem

I've been trying to figure out the btrfs I/O stack to try to understand
why, sometimes (but not always), after a failure to read a (data
non-replicated) block from the disk, the file being accessed becomes
permanently locked, and the filesystem, unmountable.

Sometimes (but not always) it's possible to kill the process that
accessed the file, and sometimes (but not always) the failure causes
the machine load to skyrocket by 60+ processes.

In one of the failures that caused machine load spikes, I tried to
collect info on active processes with perf top and SysRq-T, but nothing
there seemed to explain the spike.  Thoughts on how to figure out what's
causing this?

Another weirdness I noticed is that, after a single read failure,
btree_io_failed_hook gets called multiple times, until io_pages gets
down to zero.  This seems wrong: I think it should only be called once
when a single block fails, rather than having that single failure get
all pending pages marked as failed, no?

Here are some instrumented dumps I collected from one occurrence of the
scenario described in the previous paragraph (it didn't cause a load
spike).  Only one disk block had a read failure.  At the end, I enclose
the patch that got those dumps printed, the result of several iterations
in which one failure led me to find another function to instrument.

end_request: I/O error, dev sdd, sector 183052083
btrfs: bdev /dev/sdd4 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
btrfs_end_bio orig -EIO 1 > 0 pending 0 end ffffffffa0240820,ffffffffa020c2d0
end_workqueue_bio err -5 bi_rw 0
ata5: EH complete
end_workqueue_fn err -5 end_io ffffffffa020c2d0,ffffffffa0231080
btree_io_failed_hook failed_mirror 1 io_pages 15 readahead 0
end_bio_extent_readpage err -5 faied_hook ffffffffa020bed0 ret -5
btree_io_failed_hook failed_mirror 1 io_pages 14 readahead 0
end_bio_extent_readpage err -5 failed_hook ffffffffa020bed0 ret -5
[...repeat both msgs with io_pages decremented one at a time...]
btree_io_failed_hook failed_mirror 1 io_pages 0 readahead 0
end_bio_extent_readpage err -5 failed_hook ffffffffa020bed0 ret -5
(no further related messages)

I/O errors block the entire filesystem

Commit Message

Comments

Patch