Message ID | 20190227134230.11860-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] Btrfs: fix file corruption after snapshotting due to mix of buffered/DIO writes | expand |
On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote: > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) > * from already being in a transaction and our join_transaction doesn't > * have to re-take the fs freeze lock. > */ > - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) > + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { > writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > + } else { > + struct btrfs_pending_snapshot *pending; > + struct list_head *head = &trans->transaction->pending_snapshots; > + > + /* > + * Flush dellaloc for any root that is going to be snapshotted. > + * This is done to avoid a corrupted version of files, in the > + * snapshots, that had both buffered and direct IO writes (even > + * if they were done sequentially) due to an unordered update of > + * the inode's size on disk. > + */ > + list_for_each_entry(pending, head, list) > + btrfs_start_delalloc_snapshot(pending->root); > + } A diff would be better than words, incremental on top of your patch: @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans) * if they were done sequentially) due to an unordered update of * the inode's size on disk. */ - list_for_each_entry(pending, head, list) - btrfs_start_delalloc_snapshot(pending->root); + list_for_each_entry(pending, head, list) { + int ret; + + ret = btrfs_start_delalloc_snapshot(pending->root); + if (ret < 0) { + writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); + break; + } + } } return 0; } --- Making a switch by the exact error is probably not necessary and wouldn't be future proof anyway.
On Wed, Feb 27, 2019 at 5:25 PM David Sterba <dsterba@suse.cz> wrote: > > On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote: > > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) > > * from already being in a transaction and our join_transaction doesn't > > * have to re-take the fs freeze lock. > > */ > > - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) > > + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { > > writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > > + } else { > > + struct btrfs_pending_snapshot *pending; > > + struct list_head *head = &trans->transaction->pending_snapshots; > > + > > + /* > > + * Flush dellaloc for any root that is going to be snapshotted. > > + * This is done to avoid a corrupted version of files, in the > > + * snapshots, that had both buffered and direct IO writes (even > > + * if they were done sequentially) due to an unordered update of > > + * the inode's size on disk. > > + */ > > + list_for_each_entry(pending, head, list) > > + btrfs_start_delalloc_snapshot(pending->root); > > + } > > A diff would be better than words, incremental on top of your patch: You mean, better than a full 2nd version patch I suppose. > > @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans) > * if they were done sequentially) due to an unordered update of > * the inode's size on disk. > */ > - list_for_each_entry(pending, head, list) > - btrfs_start_delalloc_snapshot(pending->root); > + list_for_each_entry(pending, head, list) { > + int ret; > + > + ret = btrfs_start_delalloc_snapshot(pending->root); > + if (ret < 0) { > + writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > + break; > + } What do you expect by falling back to writeback_inodes_sb()? It all ends up going through the same btrfs writeback path. And as I left it, if an error happens for one root, it still tries to flush writeback for all the remaining roots as well, so I don't get it why you fallback to writeback_inodes_sb(). > + } > } > return 0; > } > --- > > Making a switch by the exact error is probably not necessary and wouldn't be > future proof anyway. Not sure I understood that sentence. Anyway, it's not clear to me whether as it is it's fine, or do you want an incremental patch, or something else. thanks
On Wed, Feb 27, 2019 at 05:32:55PM +0000, Filipe Manana wrote: > On Wed, Feb 27, 2019 at 5:25 PM David Sterba <dsterba@suse.cz> wrote: > > > > On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote: > > > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) > > > * from already being in a transaction and our join_transaction doesn't > > > * have to re-take the fs freeze lock. > > > */ > > > - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) > > > + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { > > > writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > > > + } else { > > > + struct btrfs_pending_snapshot *pending; > > > + struct list_head *head = &trans->transaction->pending_snapshots; > > > + > > > + /* > > > + * Flush dellaloc for any root that is going to be snapshotted. > > > + * This is done to avoid a corrupted version of files, in the > > > + * snapshots, that had both buffered and direct IO writes (even > > > + * if they were done sequentially) due to an unordered update of > > > + * the inode's size on disk. > > > + */ > > > + list_for_each_entry(pending, head, list) > > > + btrfs_start_delalloc_snapshot(pending->root); > > > + } > > > > A diff would be better than words, incremental on top of your patch: > > You mean, better than a full 2nd version patch I suppose. I mean better than my attempts to explain by words the error handling logic that I was proposing. > > > > @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans) > > * if they were done sequentially) due to an unordered update of > > * the inode's size on disk. > > */ > > - list_for_each_entry(pending, head, list) > > - btrfs_start_delalloc_snapshot(pending->root); > > + list_for_each_entry(pending, head, list) { > > + int ret; > > + > > + ret = btrfs_start_delalloc_snapshot(pending->root); > > + if (ret < 0) { > > + writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > > + break; > > + } > > What do you expect by falling back to writeback_inodes_sb()? > It all ends up going through the same btrfs writeback path. > And as I left it, if an error happens for one root, it still tries to > flush writeback for all the remaining roots as well, so I don't get it > why you fallback to writeback_inodes_sb(). As I read the changelog, you say that the corruption does not happen with FLUSHONCOMMIT, which does writeback_inodes_sb. Using that would be too heavy, thus you only start the delalloc on snapshotted roots. So the idea is to use the same logic of flushoncommit in the unlikely case when btrfs_start_delalloc_snapshot would fail. Only at some performance cost, unless I'm missing something. As for the v2 as you implement it without any error handling, doesn't this allow the corruption to happen? If start_delalloc_inodes has a lot of inodes for which it needs to allocate delalloc_work, the failure is possible. That the list_for_each continues does not affect that particular root. > > Making a switch by the exact error is probably not necessary and wouldn't be > > future proof anyway. > > Not sure I understood that sentence. Under v1 I was suggesting to filter out all possible errors from btrfs_start_delalloc_snapshot, EROFS and ENOMEM. So by 'making a switch' I meant if (ret == -EROFS) { break; } else { writeback_inodes_sb(); break; } > Anyway, it's not clear to me whether as it is it's fine, or do you > want an incremental patch, or something else. I want to continue the discussion about the error handling. The incremental diff was to show actual code behind my idea. If this weren't a correctness and commit related code I probably would not go that far be ok with the errors and abort. So I'm hoping you could help me find good options with low impact as the change affects the default commit path.
On Wed, Feb 27, 2019 at 6:30 PM David Sterba <dsterba@suse.cz> wrote: > > On Wed, Feb 27, 2019 at 05:32:55PM +0000, Filipe Manana wrote: > > On Wed, Feb 27, 2019 at 5:25 PM David Sterba <dsterba@suse.cz> wrote: > > > > > > On Wed, Feb 27, 2019 at 01:42:30PM +0000, fdmanana@kernel.org wrote: > > > > @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) > > > > * from already being in a transaction and our join_transaction doesn't > > > > * have to re-take the fs freeze lock. > > > > */ > > > > - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) > > > > + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { > > > > writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > > > > + } else { > > > > + struct btrfs_pending_snapshot *pending; > > > > + struct list_head *head = &trans->transaction->pending_snapshots; > > > > + > > > > + /* > > > > + * Flush dellaloc for any root that is going to be snapshotted. > > > > + * This is done to avoid a corrupted version of files, in the > > > > + * snapshots, that had both buffered and direct IO writes (even > > > > + * if they were done sequentially) due to an unordered update of > > > > + * the inode's size on disk. > > > > + */ > > > > + list_for_each_entry(pending, head, list) > > > > + btrfs_start_delalloc_snapshot(pending->root); > > > > + } > > > > > > A diff would be better than words, incremental on top of your patch: > > > > You mean, better than a full 2nd version patch I suppose. > > I mean better than my attempts to explain by words the error handling > logic that I was proposing. > > > > > > > @@ -1912,8 +1912,15 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans) > > > * if they were done sequentially) due to an unordered update of > > > * the inode's size on disk. > > > */ > > > - list_for_each_entry(pending, head, list) > > > - btrfs_start_delalloc_snapshot(pending->root); > > > + list_for_each_entry(pending, head, list) { > > > + int ret; > > > + > > > + ret = btrfs_start_delalloc_snapshot(pending->root); > > > + if (ret < 0) { > > > + writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > > > + break; > > > + } > > > > What do you expect by falling back to writeback_inodes_sb()? > > It all ends up going through the same btrfs writeback path. > > And as I left it, if an error happens for one root, it still tries to > > flush writeback for all the remaining roots as well, so I don't get it > > why you fallback to writeback_inodes_sb(). > > As I read the changelog, you say that the corruption does not happen > with FLUSHONCOMMIT, which does writeback_inodes_sb. Using that would be > too heavy, thus you only start the delalloc on snapshotted roots. It does not happen with flushoncommit because delalloc is flushed for all roots. If flushoncommit (writeback_inodes_sb()) would flush only for roots that are about to be snapshotted, the corruption wouldn't happen as well. > > So the idea is to use the same logic of flushoncommit in the unlikely > case when btrfs_start_delalloc_snapshot would fail. Only at some > performance cost, unless I'm missing something. Ok, so I hope the first paragraph above explains why there's no need to fallback to the "flush all delalloc for all roots" logic (writeback_inodes_sb()). > > As for the v2 as you implement it without any error handling, doesn't > this allow the corruption to happen? If start_delalloc_inodes has a lot > of inodes for which it needs to allocate delalloc_work, the failure is > possible. That the list_for_each continues does not affect that > particular root. Yes, it allows for the corruption to happen, that's why I had the error returned in v1. writeback_inodes_sb() isn't special in any way - flushing some delalloc range can fail the same way, it ends up calling the same btrfs writepages() callback. Yes, it doesn't return any error, because it's relying on callers either not caring about it, or if they do, checking the inode's mapping for an error, which btrfs sets through its writepages() callback if any an error happens (by calling mapping_set_error()), or any other fs specific way of storing/checking for writeback errors. If we get an error when flushing the dealloc range from the example in the changelog, then the corruption happens, regardless of writeback being triggered by writeback_inodes_sb() or btrfs_start_delalloc_snapshot(). > > > > Making a switch by the exact error is probably not necessary and wouldn't be > > > future proof anyway. > > > > Not sure I understood that sentence. > > Under v1 I was suggesting to filter out all possible errors from > btrfs_start_delalloc_snapshot, EROFS and ENOMEM. So by 'making a switch' > I meant > > if (ret == -EROFS) { > break; > } else { > writeback_inodes_sb(); > break; > } > > > Anyway, it's not clear to me whether as it is it's fine, or do you > > want an incremental patch, or something else. > > I want to continue the discussion about the error handling. The > incremental diff was to show actual code behind my idea. > > If this weren't a correctness and commit related code I probably would > not go that far be ok with the errors and abort. So I'm hoping you > could help me find good options with low impact as the change affects > the default commit path.
On Wed, Feb 27, 2019 at 06:56:10PM +0000, Filipe Manana wrote: > > > What do you expect by falling back to writeback_inodes_sb()? > > > It all ends up going through the same btrfs writeback path. > > > And as I left it, if an error happens for one root, it still tries to > > > flush writeback for all the remaining roots as well, so I don't get it > > > why you fallback to writeback_inodes_sb(). > > > > As I read the changelog, you say that the corruption does not happen > > with FLUSHONCOMMIT, which does writeback_inodes_sb. Using that would be > > too heavy, thus you only start the delalloc on snapshotted roots. > > It does not happen with flushoncommit because delalloc is flushed for all roots. > If flushoncommit (writeback_inodes_sb()) would flush only for roots > that are about to be snapshotted, the corruption wouldn't happen as > well. > > > > > So the idea is to use the same logic of flushoncommit in the unlikely > > case when btrfs_start_delalloc_snapshot would fail. Only at some > > performance cost, unless I'm missing something. > > Ok, so I hope the first paragraph above explains why there's no need > to fallback to the "flush all delalloc for all roots" logic > (writeback_inodes_sb()). Yeah, I think I get it now. > > As for the v2 as you implement it without any error handling, doesn't > > this allow the corruption to happen? If start_delalloc_inodes has a lot > > of inodes for which it needs to allocate delalloc_work, the failure is > > possible. That the list_for_each continues does not affect that > > particular root. > > Yes, it allows for the corruption to happen, that's why I had the > error returned in v1. > > writeback_inodes_sb() isn't special in any way - flushing some > delalloc range can fail the same way, it ends up calling the same > btrfs writepages() callback. Yes, it doesn't return any error, because > it's relying on callers either not caring about it, or if they do, > checking the inode's mapping for an error, which btrfs sets through > its writepages() callback if any an error happens (by calling > mapping_set_error()), > or any other fs specific way of storing/checking for writeback errors. Ok, so the the error handling needs to stay and there's no simple way around it. > If we get an error when flushing the dealloc range from the example in > the changelog, then the corruption happens, regardless of writeback > being > triggered by writeback_inodes_sb() or btrfs_start_delalloc_snapshot(). Thanks for the time discussing that. I'll use code from v1 and the subject from v2 and add the patch to the queue.
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 4ec2b660d014..f96867e55350 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans) } } -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle *trans) { + struct btrfs_fs_info *fs_info = trans->fs_info; + /* * We use writeback_inodes_sb here because if we used * btrfs_start_delalloc_roots we would deadlock with fs freeze. @@ -1897,15 +1899,45 @@ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) * from already being in a transaction and our join_transaction doesn't * have to re-take the fs freeze lock. */ - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); + } else { + struct btrfs_pending_snapshot *pending; + struct list_head *head = &trans->transaction->pending_snapshots; + + /* + * Flush dellaloc for any root that is going to be snapshotted. + * This is done to avoid a corrupted version of files, in the + * snapshots, that had both buffered and direct IO writes (even + * if they were done sequentially) due to an unordered update of + * the inode's size on disk. + */ + list_for_each_entry(pending, head, list) + btrfs_start_delalloc_snapshot(pending->root); + } return 0; } -static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info) +static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle *trans) { - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) + struct btrfs_fs_info *fs_info = trans->fs_info; + + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); + } else { + struct btrfs_pending_snapshot *pending; + struct list_head *head = &trans->transaction->pending_snapshots; + + /* + * Wait for any dellaloc that we started previously for the roots + * that are going to be snapshotted. This is to avoid a corrupted + * version of files in the snapshots that had both buffered and + * direct IO writes (even if they were done sequentially). + */ + list_for_each_entry(pending, head, list) + btrfs_wait_ordered_extents(pending->root, + U64_MAX, 0, U64_MAX); + } } int btrfs_commit_transaction(struct btrfs_trans_handle *trans) @@ -2024,7 +2056,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) extwriter_counter_dec(cur_trans, trans->type); - ret = btrfs_start_delalloc_flush(fs_info); + ret = btrfs_start_delalloc_flush(trans); if (ret) goto cleanup_transaction; @@ -2040,7 +2072,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (ret) goto cleanup_transaction; - btrfs_wait_delalloc_flush(fs_info); + btrfs_wait_delalloc_flush(trans); btrfs_scrub_pause(fs_info); /*