Btrfs: deal with error on secondary log properly
diff mbox

Message ID 1440522583-32120-1-git-send-email-jbacik@fb.com
State New
Headers show

Commit Message

Josef Bacik Aug. 25, 2015, 5:09 p.m. UTC
If we have an fsync at the same time in two seperate subvolumes we could end up
with the tree log pointing at invalid blocks.  We need to notice if our writeout
failed in anyway, if it did then we need to do a full transaction commit and
return an error on the subvolume that gave us the io error.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/file.c     | 4 ++++
 fs/btrfs/tree-log.c | 4 ++++
 2 files changed, 8 insertions(+)

Comments

Liu Bo Aug. 26, 2015, 2:06 a.m. UTC | #1
On Tue, Aug 25, 2015 at 01:09:43PM -0400, Josef Bacik wrote:
> If we have an fsync at the same time in two seperate subvolumes we could end up
> with the tree log pointing at invalid blocks.  We need to notice if our writeout

Mind to share more details of the problem?

Thanks,

-liubo
> failed in anyway, if it did then we need to do a full transaction commit and
> return an error on the subvolume that gave us the io error.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/file.c     | 4 ++++
>  fs/btrfs/tree-log.c | 4 ++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b823fac..c8f49f5 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2054,6 +2054,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  			if (!ret) {
>  				ret = btrfs_end_transaction(trans, root);
>  				goto out;
> +			} else if (ctx.io_err) {
> +				btrfs_end_transaction(trans, root);
> +				ret = ctx.io_err;
> +				goto out;
>  			}
>  		}
>  		if (!full_sync) {
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 1bbaace..b4f15f5 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2857,6 +2857,10 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> +		if (ret) {
> +			btrfs_set_log_full_commit(root->fs_info, trans);
> +			ctx->io_err = ret;
> +		}
>  		wait_log_commit(log_root_tree,
>  				root_log_ctx.log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Aug. 26, 2015, 7:06 p.m. UTC | #2
On 08/25/2015 10:06 PM, Liu Bo wrote:
> On Tue, Aug 25, 2015 at 01:09:43PM -0400, Josef Bacik wrote:
>> If we have an fsync at the same time in two seperate subvolumes we could end up
>> with the tree log pointing at invalid blocks.  We need to notice if our writeout
>
> Mind to share more details of the problem?
>

It's the problem Filipe was trying to solve.  Say we fsync() on two 
different subvols, one of them will race in and be the one who commits 
the log root tree.  So process A waits on process B to add its log to 
the log root tree and write out its log.  If we get an IO error while 
writing out the log the log root tree will be pointing to invalid crap, 
and we also won't return an error back to userspace.  We need to notice 
if there was an error, turn on the transaction commit stuff since we've 
already updated the the log root tree with our subvol log so we don't 
get garbage on the disk, and we need to return an error to process B. 
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Aug. 28, 2015, 2:45 a.m. UTC | #3
On Tue, Aug 25, 2015 at 01:09:43PM -0400, Josef Bacik wrote:
> If we have an fsync at the same time in two seperate subvolumes we could end up
> with the tree log pointing at invalid blocks.  We need to notice if our writeout
> failed in anyway, if it did then we need to do a full transaction commit and
> return an error on the subvolume that gave us the io error.  Thanks,

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/btrfs/file.c     | 4 ++++
>  fs/btrfs/tree-log.c | 4 ++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index b823fac..c8f49f5 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2054,6 +2054,10 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>  			if (!ret) {
>  				ret = btrfs_end_transaction(trans, root);
>  				goto out;
> +			} else if (ctx.io_err) {
> +				btrfs_end_transaction(trans, root);
> +				ret = ctx.io_err;
> +				goto out;
>  			}
>  		}
>  		if (!full_sync) {
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 1bbaace..b4f15f5 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2857,6 +2857,10 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> +		if (ret) {
> +			btrfs_set_log_full_commit(root->fs_info, trans);
> +			ctx->io_err = ret;
> +		}
>  		wait_log_commit(log_root_tree,
>  				root_log_ctx.log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana Aug. 28, 2015, 1:23 p.m. UTC | #4
On Wed, Aug 26, 2015 at 8:06 PM, Josef Bacik <jbacik@fb.com> wrote:
> On 08/25/2015 10:06 PM, Liu Bo wrote:
>>
>> On Tue, Aug 25, 2015 at 01:09:43PM -0400, Josef Bacik wrote:
>>>
>>> If we have an fsync at the same time in two seperate subvolumes we could
>>> end up
>>> with the tree log pointing at invalid blocks.  We need to notice if our
>>> writeout
>>
>>
>> Mind to share more details of the problem?
>>
>
> It's the problem Filipe was trying to solve.  Say we fsync() on two
> different subvols, one of them will race in and be the one who commits the
> log root tree.  So process A waits on process B to add its log to the log
> root tree and write out its log.  If we get an IO error while writing out
> the log the log root tree will be pointing to invalid crap, and we also
> won't return an error back to userspace.  We need to notice if there was an
> error, turn on the transaction commit stuff since we've already updated the
> the log root tree with our subvol log so we don't get garbage on the disk,
> and we need to return an error to process B. Thanks,

Josef,

So the problem was that without forcing A to trigger a transaction
commit if B gets an error when writing one or more log tree
nodes/leafs, A could write a superblock pointing to a log root tree
for which not all nodes/leafs were persisted. Then B would fall back
to a transaction commit and everything would be ok - i.e. we would
only have a "small" time window where a superblock points to an
invalid log root tree.

So the fix here shouldn't be only to force A do a transaction commit
(call btrfs_set_log_full_commit())? Why do we need to make B return an
error to userspace and not fallback to a transaction commit too (as it
was before this change)? After all this is the kind of failure for
which we can fallback to a transaction commit without losing any inode
metadata (links, owner, group, xattrs, etc) nor file data.

The change looks good, just puzzled why we need to return the error to
userspace.

thanks

>
> Josef
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Aug. 28, 2015, 1:54 p.m. UTC | #5
On 08/28/2015 09:23 AM, Filipe David Manana wrote:
> On Wed, Aug 26, 2015 at 8:06 PM, Josef Bacik <jbacik@fb.com> wrote:
>> On 08/25/2015 10:06 PM, Liu Bo wrote:
>>>
>>> On Tue, Aug 25, 2015 at 01:09:43PM -0400, Josef Bacik wrote:
>>>>
>>>> If we have an fsync at the same time in two seperate subvolumes we could
>>>> end up
>>>> with the tree log pointing at invalid blocks.  We need to notice if our
>>>> writeout
>>>
>>>
>>> Mind to share more details of the problem?
>>>
>>
>> It's the problem Filipe was trying to solve.  Say we fsync() on two
>> different subvols, one of them will race in and be the one who commits the
>> log root tree.  So process A waits on process B to add its log to the log
>> root tree and write out its log.  If we get an IO error while writing out
>> the log the log root tree will be pointing to invalid crap, and we also
>> won't return an error back to userspace.  We need to notice if there was an
>> error, turn on the transaction commit stuff since we've already updated the
>> the log root tree with our subvol log so we don't get garbage on the disk,
>> and we need to return an error to process B. Thanks,
>
> Josef,
>
> So the problem was that without forcing A to trigger a transaction
> commit if B gets an error when writing one or more log tree
> nodes/leafs, A could write a superblock pointing to a log root tree
> for which not all nodes/leafs were persisted. Then B would fall back
> to a transaction commit and everything would be ok - i.e. we would
> only have a "small" time window where a superblock points to an
> invalid log root tree.
>
> So the fix here shouldn't be only to force A do a transaction commit
> (call btrfs_set_log_full_commit())? Why do we need to make B return an
> error to userspace and not fallback to a transaction commit too (as it
> was before this change)? After all this is the kind of failure for
> which we can fallback to a transaction commit without losing any inode
> metadata (links, owner, group, xattrs, etc) nor file data.
>
> The change looks good, just puzzled why we need to return the error to
> userspace.
>

Ah well there's a reason in upcoming patches!  Basically I'm moving the 
wait on ordered extents back into the sync log stuff so I can have my go 
fast stripes back, and when I do that we'll want to return an error 
since it could be from wait_ordered_extents.  But you are right, if we 
only error out writing the metadata we can just commit the transaction 
and not return an error.  I can change this bit to not return an error 
if we want, or wait for my other patches which will fix it up.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..c8f49f5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2054,6 +2054,10 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 			if (!ret) {
 				ret = btrfs_end_transaction(trans, root);
 				goto out;
+			} else if (ctx.io_err) {
+				btrfs_end_transaction(trans, root);
+				ret = ctx.io_err;
+				goto out;
 			}
 		}
 		if (!full_sync) {
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1bbaace..b4f15f5 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2857,6 +2857,10 @@  int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
+		if (ret) {
+			btrfs_set_log_full_commit(root->fs_info, trans);
+			ctx->io_err = ret;
+		}
 		wait_log_commit(log_root_tree,
 				root_log_ctx.log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);