diff mbox series

Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space

Message ID 1533522630-21758-1-git-send-email-robbieko@synology.com (mailing list archive)
State New, archived
Headers show
Series Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space | expand

Commit Message

robbieko Aug. 6, 2018, 2:30 a.m. UTC
From: Robbie Ko <robbieko@synology.com>

Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
forced nocow writes to fallback to COW, during writeback,
when a snapshot is created. This resulted in writes made before
creating the snapshot to unexpectedly fail with ENOSPC during
writeback when success (0) was returned to user space through
the write system call.

The steps leading to this problem are:

1. When it's not possible to allocate data space for a write,
the buffered write path checks if a NOCOW write is possible.
If it is, it will not reserve space and success (0) is returned
to user space.

2. Then when a snapshot is created, the root's will_be_snapshotted
atomic is incremented and writeback is triggered for all inode's
that belong to the root being snapshotted. Incrementing that atomic
forces all previous writes to fallback to COW during writeback
(running delalloc).

3. This results in the writeback for the inodes to fail and therefore
setting the ENOSPC error in their mappings, so that a subsequent fsync
on them will report the error to user space. So it's not a completely
silent data loss (since fsync will report ENOSPC) but it's a very
unexpected and undesirable behaviour, because if a clean
shutdown/unmount of the filesystem happens without previous calls to
fsync, it is expected to have the data present in the files after
mounting the filesystem again.

So fix this by adding a new atomic named snapshot_force_cow to the
root structure which prevents this behaviour and works the following way:

1. It is incremented when we start to create a snapshot after
triggering writeback and before waiting for writeback to finish.

2. This new atomic is now what is used by writeback (running delalloc)
to decide whether we need to fallback to COW or not. Because we
incremented this new atomic after triggering writeback in the snapshot
creation ioctl, we ensure that all buffered writes that happened
before snapshot creation will succeed and not fallback to COW
(which would make them fail with ENOSPC).

3. The existing atomic, will_be_snapshotted, is kept because it is
used to force new buffered writes, that start after we started
snapshotting, to reserve data space even when NOCOW is possible.
This makes these writes fail early with ENOSPC when there's no
available space to allocate, preventing the unexpected behaviour
of writeback later failing with ENOSPC due to a fallback to COW mode.

Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/inode.c   | 26 +++++---------------------
 fs/btrfs/ioctl.c   | 16 ++++++++++++++++
 4 files changed, 23 insertions(+), 21 deletions(-)

Comments

David Sterba Aug. 7, 2018, 3:52 p.m. UTC | #1
On Mon, Aug 06, 2018 at 10:30:30AM +0800, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> forced nocow writes to fallback to COW, during writeback,
> when a snapshot is created. This resulted in writes made before
> creating the snapshot to unexpectedly fail with ENOSPC during
> writeback when success (0) was returned to user space through
> the write system call.
> 
> The steps leading to this problem are:
> 
> 1. When it's not possible to allocate data space for a write,
> the buffered write path checks if a NOCOW write is possible.
> If it is, it will not reserve space and success (0) is returned
> to user space.
> 
> 2. Then when a snapshot is created, the root's will_be_snapshotted
> atomic is incremented and writeback is triggered for all inode's
> that belong to the root being snapshotted. Incrementing that atomic
> forces all previous writes to fallback to COW during writeback
> (running delalloc).
> 
> 3. This results in the writeback for the inodes to fail and therefore
> setting the ENOSPC error in their mappings, so that a subsequent fsync
> on them will report the error to user space. So it's not a completely
> silent data loss (since fsync will report ENOSPC) but it's a very
> unexpected and undesirable behaviour, because if a clean
> shutdown/unmount of the filesystem happens without previous calls to
> fsync, it is expected to have the data present in the files after
> mounting the filesystem again.
> 
> So fix this by adding a new atomic named snapshot_force_cow to the
> root structure which prevents this behaviour and works the following way:
> 
> 1. It is incremented when we start to create a snapshot after
> triggering writeback and before waiting for writeback to finish.
> 
> 2. This new atomic is now what is used by writeback (running delalloc)
> to decide whether we need to fallback to COW or not. Because we
> incremented this new atomic after triggering writeback in the snapshot
> creation ioctl, we ensure that all buffered writes that happened
> before snapshot creation will succeed and not fallback to COW
> (which would make them fail with ENOSPC).
> 
> 3. The existing atomic, will_be_snapshotted, is kept because it is
> used to force new buffered writes, that start after we started
> snapshotting, to reserve data space even when NOCOW is possible.
> This makes these writes fail early with ENOSPC when there's no
> available space to allocate, preventing the unexpected behaviour
> of writeback later failing with ENOSPC due to a fallback to COW mode.
> 
> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> Signed-off-by: Robbie Ko <robbieko@synology.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next and will be probably in the 2nd pull for the 4.19
merge window, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo April 25, 2019, 9:29 a.m. UTC | #2
On 2018/8/6 上午10:30, robbieko wrote:
> From: Robbie Ko <robbieko@synology.com>
> 
> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> forced nocow writes to fallback to COW, during writeback,
> when a snapshot is created. This resulted in writes made before
> creating the snapshot to unexpectedly fail with ENOSPC during
> writeback when success (0) was returned to user space through
> the write system call.
> 
> The steps leading to this problem are:
> 
> 1. When it's not possible to allocate data space for a write,
> the buffered write path checks if a NOCOW write is possible.
> If it is, it will not reserve space and success (0) is returned
> to user space.
> 
> 2. Then when a snapshot is created, the root's will_be_snapshotted
> atomic is incremented and writeback is triggered for all inode's
> that belong to the root being snapshotted. Incrementing that atomic
> forces all previous writes to fallback to COW during writeback
> (running delalloc).
> 
> 3. This results in the writeback for the inodes to fail and therefore
> setting the ENOSPC error in their mappings, so that a subsequent fsync
> on them will report the error to user space. So it's not a completely
> silent data loss (since fsync will report ENOSPC) but it's a very
> unexpected and undesirable behaviour, because if a clean
> shutdown/unmount of the filesystem happens without previous calls to
> fsync, it is expected to have the data present in the files after
> mounting the filesystem again.
> 
> So fix this by adding a new atomic named snapshot_force_cow to the
> root structure which prevents this behaviour and works the following way:

This fix is a little out of dated now.

Since 609e804d771f ("Btrfs: fix file corruption after snapshotting due
to mix of buffered/DIO writes") we will write back all dirty
inodes/pages back to disk before snapshot.

This makes sure that all dirty inodes/pages before a transaction will
reach disk before snapshot  subvolume is created.

Thus the whole will_be_snapshotted member no long makes sense, and in
fact it could cause more problem if we're going back to always check
NODATACOW at buffered write time.

Can we just revert this patch and the will_be_snapshotted member ?

Thanks,
Qu

> 
> 1. It is incremented when we start to create a snapshot after
> triggering writeback and before waiting for writeback to finish.
> 
> 2. This new atomic is now what is used by writeback (running delalloc)
> to decide whether we need to fallback to COW or not. Because we
> incremented this new atomic after triggering writeback in the snapshot
> creation ioctl, we ensure that all buffered writes that happened
> before snapshot creation will succeed and not fallback to COW
> (which would make them fail with ENOSPC).
> 
> 3. The existing atomic, will_be_snapshotted, is kept because it is
> used to force new buffered writes, that start after we started
> snapshotting, to reserve data space even when NOCOW is possible.
> This makes these writes fail early with ENOSPC when there's no
> available space to allocate, preventing the unexpected behaviour
> of writeback later failing with ENOSPC due to a fallback to COW mode.
> 
> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> Signed-off-by: Robbie Ko <robbieko@synology.com>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/disk-io.c |  1 +
>  fs/btrfs/inode.c   | 26 +++++---------------------
>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>  4 files changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 118346a..663ce05 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>  	int send_in_progress;
>  	struct btrfs_subvolume_writers *subv_writers;
>  	atomic_t will_be_snapshotted;
> +	atomic_t snapshot_force_cow;
>  
>  	/* For qgroup metadata reserved space */
>  	spinlock_t qgroup_meta_rsv_lock;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 205092d..5573916 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>  	atomic_set(&root->log_batch, 0);
>  	refcount_set(&root->refs, 1);
>  	atomic_set(&root->will_be_snapshotted, 0);
> +	atomic_set(&root->snapshot_force_cow, 0);
>  	root->log_transid = 0;
>  	root->log_transid_committed = -1;
>  	root->last_log_commit = 0;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index eba61bc..263b852 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  	u64 disk_num_bytes;
>  	u64 ram_bytes;
>  	int extent_type;
> -	int ret, err;
> +	int ret;
>  	int type;
>  	int nocow;
>  	int check_prev = 1;
> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  			 * if there are pending snapshots for this root,
>  			 * we fall into common COW way.
>  			 */
> -			if (!nolock) {
> -				err = btrfs_start_write_no_snapshotting(root);
> -				if (!err)
> -					goto out_check;
> -			}
> +			if (!nolock &&
> +				unlikely(atomic_read(&root->snapshot_force_cow)))
> +				goto out_check;
>  			/*
>  			 * force cow if csum exists in the range.
>  			 * this ensure that csum for a given extent are
> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  			ret = csum_exist_in_range(fs_info, disk_bytenr,
>  						  num_bytes);
>  			if (ret) {
> -				if (!nolock)
> -					btrfs_end_write_no_snapshotting(root);
> -
>  				/*
>  				 * ret could be -EIO if the above fails to read
>  				 * metadata.
> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  				WARN_ON_ONCE(nolock);
>  				goto out_check;
>  			}
> -			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
> -				if (!nolock)
> -					btrfs_end_write_no_snapshotting(root);
> +			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
>  				goto out_check;
> -			}
>  			nocow = 1;
>  		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>  			extent_end = found_key.offset +
> @@ -1453,8 +1445,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  out_check:
>  		if (extent_end <= start) {
>  			path->slots[0]++;
> -			if (!nolock && nocow)
> -				btrfs_end_write_no_snapshotting(root);
>  			if (nocow)
>  				btrfs_dec_nocow_writers(fs_info, disk_bytenr);
>  			goto next_slot;
> @@ -1476,8 +1466,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  					     end, page_started, nr_written, 1,
>  					     NULL);
>  			if (ret) {
> -				if (!nolock && nocow)
> -					btrfs_end_write_no_snapshotting(root);
>  				if (nocow)
>  					btrfs_dec_nocow_writers(fs_info,
>  								disk_bytenr);
> @@ -1497,8 +1485,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  					  ram_bytes, BTRFS_COMPRESS_NONE,
>  					  BTRFS_ORDERED_PREALLOC);
>  			if (IS_ERR(em)) {
> -				if (!nolock && nocow)
> -					btrfs_end_write_no_snapshotting(root);
>  				if (nocow)
>  					btrfs_dec_nocow_writers(fs_info,
>  								disk_bytenr);
> @@ -1537,8 +1523,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  					     EXTENT_CLEAR_DATA_RESV,
>  					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
>  
> -		if (!nolock && nocow)
> -			btrfs_end_write_no_snapshotting(root);
>  		cur_offset = extent_end;
>  
>  		/*
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b077544..331b495 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>  	struct btrfs_pending_snapshot *pending_snapshot;
>  	struct btrfs_trans_handle *trans;
>  	int ret;
> +	bool snapshot_force_cow = false;
>  
>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>  		return -EINVAL;
> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>  		goto free_pending;
>  	}
>  
> +	/*
> +	 * Force new buffered writes to reserve space even when NOCOW is
> +	 * possible. This is to avoid later writeback (running dealloc)
> +	 * to fallback to COW mode and unexpectedly fail with ENOSPC.
> +	 */
>  	atomic_inc(&root->will_be_snapshotted);
>  	smp_mb__after_atomic();
>  	/* wait for no snapshot writes */
> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>  	if (ret)
>  		goto dec_and_free;
>  
> +	/*
> +	 * All previous writes have started writeback in NOCOW mode, so now
> +	 * we force future writes to fallback to COW mode during snapshot
> +	 * creation.
> +	 */
> +	atomic_inc(&root->snapshot_force_cow);
> +	snapshot_force_cow = true;
> +
>  	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>  
>  	btrfs_init_block_rsv(&pending_snapshot->block_rsv,
> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>  fail:
>  	btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
>  dec_and_free:
> +	if (snapshot_force_cow)
> +		atomic_dec(&root->snapshot_force_cow);
>  	if (atomic_dec_and_test(&root->will_be_snapshotted))
>  		wake_up_var(&root->will_be_snapshotted);
>  free_pending:
>
Nikolay Borisov April 25, 2019, 9:37 a.m. UTC | #3
On 25.04.19 г. 12:29 ч., Qu Wenruo wrote:
> 
> 
> On 2018/8/6 上午10:30, robbieko wrote:
>> From: Robbie Ko <robbieko@synology.com>
>>
>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>> forced nocow writes to fallback to COW, during writeback,
>> when a snapshot is created. This resulted in writes made before
>> creating the snapshot to unexpectedly fail with ENOSPC during
>> writeback when success (0) was returned to user space through
>> the write system call.
>>
>> The steps leading to this problem are:
>>
>> 1. When it's not possible to allocate data space for a write,
>> the buffered write path checks if a NOCOW write is possible.
>> If it is, it will not reserve space and success (0) is returned
>> to user space.
>>
>> 2. Then when a snapshot is created, the root's will_be_snapshotted
>> atomic is incremented and writeback is triggered for all inode's
>> that belong to the root being snapshotted. Incrementing that atomic
>> forces all previous writes to fallback to COW during writeback
>> (running delalloc).
>>
>> 3. This results in the writeback for the inodes to fail and therefore
>> setting the ENOSPC error in their mappings, so that a subsequent fsync
>> on them will report the error to user space. So it's not a completely
>> silent data loss (since fsync will report ENOSPC) but it's a very
>> unexpected and undesirable behaviour, because if a clean
>> shutdown/unmount of the filesystem happens without previous calls to
>> fsync, it is expected to have the data present in the files after
>> mounting the filesystem again.
>>
>> So fix this by adding a new atomic named snapshot_force_cow to the
>> root structure which prevents this behaviour and works the following way:
> 
> This fix is a little out of dated now.
> 
> Since 609e804d771f ("Btrfs: fix file corruption after snapshotting due
> to mix of buffered/DIO writes") we will write back all dirty
> inodes/pages back to disk before snapshot.
> 
> This makes sure that all dirty inodes/pages before a transaction will
> reach disk before snapshot  subvolume is created.
> 
> Thus the whole will_be_snapshotted member no long makes sense, and in
> fact it could cause more problem if we're going back to always check
> NODATACOW at buffered write time.
> 
> Can we just revert this patch and the will_be_snapshotted member ?

I'm very much in favor of removing the whole will_be_snapshotted
mechanism. In fact I have a series which replaces it with a rwsem
essentially:

1. Serializing snapshots (acquiring write-side of the semaphore)

2. Nocow writers simply acquire the readsize of the semaphore.

the will_be_snapshoted thing is very convoluted relying on a percpu
counter/waitqueue  to exclude snapshots from pending nocow writers. OTOH
it depends on atomic_t and an implicit wait queue thanks to wait_var
infrastructure to exclude nocow writers from pending snapshots. Filipe
had some concerns regarding performance but if the patch you mentioned
fixed all issues I'm all in favor of removing the code!


> 
> Thanks,
> Qu
> 
>>
>> 1. It is incremented when we start to create a snapshot after
>> triggering writeback and before waiting for writeback to finish.
>>
>> 2. This new atomic is now what is used by writeback (running delalloc)
>> to decide whether we need to fallback to COW or not. Because we
>> incremented this new atomic after triggering writeback in the snapshot
>> creation ioctl, we ensure that all buffered writes that happened
>> before snapshot creation will succeed and not fallback to COW
>> (which would make them fail with ENOSPC).
>>
>> 3. The existing atomic, will_be_snapshotted, is kept because it is
>> used to force new buffered writes, that start after we started
>> snapshotting, to reserve data space even when NOCOW is possible.
>> This makes these writes fail early with ENOSPC when there's no
>> available space to allocate, preventing the unexpected behaviour
>> of writeback later failing with ENOSPC due to a fallback to COW mode.
>>
>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/ctree.h   |  1 +
>>  fs/btrfs/disk-io.c |  1 +
>>  fs/btrfs/inode.c   | 26 +++++---------------------
>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>>  4 files changed, 23 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 118346a..663ce05 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>  	int send_in_progress;
>>  	struct btrfs_subvolume_writers *subv_writers;
>>  	atomic_t will_be_snapshotted;
>> +	atomic_t snapshot_force_cow;
>>  
>>  	/* For qgroup metadata reserved space */
>>  	spinlock_t qgroup_meta_rsv_lock;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 205092d..5573916 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>>  	atomic_set(&root->log_batch, 0);
>>  	refcount_set(&root->refs, 1);
>>  	atomic_set(&root->will_be_snapshotted, 0);
>> +	atomic_set(&root->snapshot_force_cow, 0);
>>  	root->log_transid = 0;
>>  	root->log_transid_committed = -1;
>>  	root->last_log_commit = 0;
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index eba61bc..263b852 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  	u64 disk_num_bytes;
>>  	u64 ram_bytes;
>>  	int extent_type;
>> -	int ret, err;
>> +	int ret;
>>  	int type;
>>  	int nocow;
>>  	int check_prev = 1;
>> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  			 * if there are pending snapshots for this root,
>>  			 * we fall into common COW way.
>>  			 */
>> -			if (!nolock) {
>> -				err = btrfs_start_write_no_snapshotting(root);
>> -				if (!err)
>> -					goto out_check;
>> -			}
>> +			if (!nolock &&
>> +				unlikely(atomic_read(&root->snapshot_force_cow)))
>> +				goto out_check;
>>  			/*
>>  			 * force cow if csum exists in the range.
>>  			 * this ensure that csum for a given extent are
>> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  			ret = csum_exist_in_range(fs_info, disk_bytenr,
>>  						  num_bytes);
>>  			if (ret) {
>> -				if (!nolock)
>> -					btrfs_end_write_no_snapshotting(root);
>> -
>>  				/*
>>  				 * ret could be -EIO if the above fails to read
>>  				 * metadata.
>> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  				WARN_ON_ONCE(nolock);
>>  				goto out_check;
>>  			}
>> -			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
>> -				if (!nolock)
>> -					btrfs_end_write_no_snapshotting(root);
>> +			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
>>  				goto out_check;
>> -			}
>>  			nocow = 1;
>>  		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>  			extent_end = found_key.offset +
>> @@ -1453,8 +1445,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  out_check:
>>  		if (extent_end <= start) {
>>  			path->slots[0]++;
>> -			if (!nolock && nocow)
>> -				btrfs_end_write_no_snapshotting(root);
>>  			if (nocow)
>>  				btrfs_dec_nocow_writers(fs_info, disk_bytenr);
>>  			goto next_slot;
>> @@ -1476,8 +1466,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  					     end, page_started, nr_written, 1,
>>  					     NULL);
>>  			if (ret) {
>> -				if (!nolock && nocow)
>> -					btrfs_end_write_no_snapshotting(root);
>>  				if (nocow)
>>  					btrfs_dec_nocow_writers(fs_info,
>>  								disk_bytenr);
>> @@ -1497,8 +1485,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  					  ram_bytes, BTRFS_COMPRESS_NONE,
>>  					  BTRFS_ORDERED_PREALLOC);
>>  			if (IS_ERR(em)) {
>> -				if (!nolock && nocow)
>> -					btrfs_end_write_no_snapshotting(root);
>>  				if (nocow)
>>  					btrfs_dec_nocow_writers(fs_info,
>>  								disk_bytenr);
>> @@ -1537,8 +1523,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  					     EXTENT_CLEAR_DATA_RESV,
>>  					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
>>  
>> -		if (!nolock && nocow)
>> -			btrfs_end_write_no_snapshotting(root);
>>  		cur_offset = extent_end;
>>  
>>  		/*
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index b077544..331b495 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>  	struct btrfs_pending_snapshot *pending_snapshot;
>>  	struct btrfs_trans_handle *trans;
>>  	int ret;
>> +	bool snapshot_force_cow = false;
>>  
>>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>  		return -EINVAL;
>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>  		goto free_pending;
>>  	}
>>  
>> +	/*
>> +	 * Force new buffered writes to reserve space even when NOCOW is
>> +	 * possible. This is to avoid later writeback (running dealloc)
>> +	 * to fallback to COW mode and unexpectedly fail with ENOSPC.
>> +	 */
>>  	atomic_inc(&root->will_be_snapshotted);
>>  	smp_mb__after_atomic();
>>  	/* wait for no snapshot writes */
>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>  	if (ret)
>>  		goto dec_and_free;
>>  
>> +	/*
>> +	 * All previous writes have started writeback in NOCOW mode, so now
>> +	 * we force future writes to fallback to COW mode during snapshot
>> +	 * creation.
>> +	 */
>> +	atomic_inc(&root->snapshot_force_cow);
>> +	snapshot_force_cow = true;
>> +
>>  	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>  
>>  	btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>  fail:
>>  	btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
>>  dec_and_free:
>> +	if (snapshot_force_cow)
>> +		atomic_dec(&root->snapshot_force_cow);
>>  	if (atomic_dec_and_test(&root->will_be_snapshotted))
>>  		wake_up_var(&root->will_be_snapshotted);
>>  free_pending:
>>
>
Qu Wenruo April 25, 2019, 10:07 a.m. UTC | #4
On 2019/4/25 下午5:37, Nikolay Borisov wrote:
>
>
> On 25.04.19 г. 12:29 ч., Qu Wenruo wrote:
>>
>>
>> On 2018/8/6 上午10:30, robbieko wrote:
>>> From: Robbie Ko <robbieko@synology.com>
>>>
>>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> forced nocow writes to fallback to COW, during writeback,
>>> when a snapshot is created. This resulted in writes made before
>>> creating the snapshot to unexpectedly fail with ENOSPC during
>>> writeback when success (0) was returned to user space through
>>> the write system call.
>>>
>>> The steps leading to this problem are:
>>>
>>> 1. When it's not possible to allocate data space for a write,
>>> the buffered write path checks if a NOCOW write is possible.
>>> If it is, it will not reserve space and success (0) is returned
>>> to user space.
>>>
>>> 2. Then when a snapshot is created, the root's will_be_snapshotted
>>> atomic is incremented and writeback is triggered for all inode's
>>> that belong to the root being snapshotted. Incrementing that atomic
>>> forces all previous writes to fallback to COW during writeback
>>> (running delalloc).
>>>
>>> 3. This results in the writeback for the inodes to fail and therefore
>>> setting the ENOSPC error in their mappings, so that a subsequent fsync
>>> on them will report the error to user space. So it's not a completely
>>> silent data loss (since fsync will report ENOSPC) but it's a very
>>> unexpected and undesirable behaviour, because if a clean
>>> shutdown/unmount of the filesystem happens without previous calls to
>>> fsync, it is expected to have the data present in the files after
>>> mounting the filesystem again.
>>>
>>> So fix this by adding a new atomic named snapshot_force_cow to the
>>> root structure which prevents this behaviour and works the following way:
>>
>> This fix is a little out of dated now.
>>
>> Since 609e804d771f ("Btrfs: fix file corruption after snapshotting due
>> to mix of buffered/DIO writes") we will write back all dirty
>> inodes/pages back to disk before snapshot.
>>
>> This makes sure that all dirty inodes/pages before a transaction will
>> reach disk before snapshot  subvolume is created.
>>
>> Thus the whole will_be_snapshotted member no long makes sense, and in
>> fact it could cause more problem if we're going back to always check
>> NODATACOW at buffered write time.
>>
>> Can we just revert this patch and the will_be_snapshotted member ?
>
> I'm very much in favor of removing the whole will_be_snapshotted
> mechanism. In fact I have a series which replaces it with a rwsem
> essentially:
>
> 1. Serializing snapshots (acquiring write-side of the semaphore)

With the new btrfs_start_delalloc_snapshot() in
btrfs_commit_transaction(), do we really need to do anything special now?

Snapshot creation will only happen in btrfs_commit_transaction(), as
long as all dirty inodes/pages are written before pending snapshots,
we're completely fine.

Or did I miss something?

Thanks,
Qu

>
> 2. Nocow writers simply acquire the readsize of the semaphore.
>
> the will_be_snapshoted thing is very convoluted relying on a percpu
> counter/waitqueue  to exclude snapshots from pending nocow writers. OTOH
> it depends on atomic_t and an implicit wait queue thanks to wait_var
> infrastructure to exclude nocow writers from pending snapshots. Filipe
> had some concerns regarding performance but if the patch you mentioned
> fixed all issues I'm all in favor of removing the code!
>
>
>>
>> Thanks,
>> Qu
>>
>>>
>>> 1. It is incremented when we start to create a snapshot after
>>> triggering writeback and before waiting for writeback to finish.
>>>
>>> 2. This new atomic is now what is used by writeback (running delalloc)
>>> to decide whether we need to fallback to COW or not. Because we
>>> incremented this new atomic after triggering writeback in the snapshot
>>> creation ioctl, we ensure that all buffered writes that happened
>>> before snapshot creation will succeed and not fallback to COW
>>> (which would make them fail with ENOSPC).
>>>
>>> 3. The existing atomic, will_be_snapshotted, is kept because it is
>>> used to force new buffered writes, that start after we started
>>> snapshotting, to reserve data space even when NOCOW is possible.
>>> This makes these writes fail early with ENOSPC when there's no
>>> available space to allocate, preventing the unexpected behaviour
>>> of writeback later failing with ENOSPC due to a fallback to COW mode.
>>>
>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>  fs/btrfs/ctree.h   |  1 +
>>>  fs/btrfs/disk-io.c |  1 +
>>>  fs/btrfs/inode.c   | 26 +++++---------------------
>>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>>>  4 files changed, 23 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 118346a..663ce05 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>>  	int send_in_progress;
>>>  	struct btrfs_subvolume_writers *subv_writers;
>>>  	atomic_t will_be_snapshotted;
>>> +	atomic_t snapshot_force_cow;
>>>
>>>  	/* For qgroup metadata reserved space */
>>>  	spinlock_t qgroup_meta_rsv_lock;
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 205092d..5573916 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>>>  	atomic_set(&root->log_batch, 0);
>>>  	refcount_set(&root->refs, 1);
>>>  	atomic_set(&root->will_be_snapshotted, 0);
>>> +	atomic_set(&root->snapshot_force_cow, 0);
>>>  	root->log_transid = 0;
>>>  	root->log_transid_committed = -1;
>>>  	root->last_log_commit = 0;
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index eba61bc..263b852 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  	u64 disk_num_bytes;
>>>  	u64 ram_bytes;
>>>  	int extent_type;
>>> -	int ret, err;
>>> +	int ret;
>>>  	int type;
>>>  	int nocow;
>>>  	int check_prev = 1;
>>> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  			 * if there are pending snapshots for this root,
>>>  			 * we fall into common COW way.
>>>  			 */
>>> -			if (!nolock) {
>>> -				err = btrfs_start_write_no_snapshotting(root);
>>> -				if (!err)
>>> -					goto out_check;
>>> -			}
>>> +			if (!nolock &&
>>> +				unlikely(atomic_read(&root->snapshot_force_cow)))
>>> +				goto out_check;
>>>  			/*
>>>  			 * force cow if csum exists in the range.
>>>  			 * this ensure that csum for a given extent are
>>> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  			ret = csum_exist_in_range(fs_info, disk_bytenr,
>>>  						  num_bytes);
>>>  			if (ret) {
>>> -				if (!nolock)
>>> -					btrfs_end_write_no_snapshotting(root);
>>> -
>>>  				/*
>>>  				 * ret could be -EIO if the above fails to read
>>>  				 * metadata.
>>> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  				WARN_ON_ONCE(nolock);
>>>  				goto out_check;
>>>  			}
>>> -			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
>>> -				if (!nolock)
>>> -					btrfs_end_write_no_snapshotting(root);
>>> +			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
>>>  				goto out_check;
>>> -			}
>>>  			nocow = 1;
>>>  		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>>  			extent_end = found_key.offset +
>>> @@ -1453,8 +1445,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  out_check:
>>>  		if (extent_end <= start) {
>>>  			path->slots[0]++;
>>> -			if (!nolock && nocow)
>>> -				btrfs_end_write_no_snapshotting(root);
>>>  			if (nocow)
>>>  				btrfs_dec_nocow_writers(fs_info, disk_bytenr);
>>>  			goto next_slot;
>>> @@ -1476,8 +1466,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  					     end, page_started, nr_written, 1,
>>>  					     NULL);
>>>  			if (ret) {
>>> -				if (!nolock && nocow)
>>> -					btrfs_end_write_no_snapshotting(root);
>>>  				if (nocow)
>>>  					btrfs_dec_nocow_writers(fs_info,
>>>  								disk_bytenr);
>>> @@ -1497,8 +1485,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  					  ram_bytes, BTRFS_COMPRESS_NONE,
>>>  					  BTRFS_ORDERED_PREALLOC);
>>>  			if (IS_ERR(em)) {
>>> -				if (!nolock && nocow)
>>> -					btrfs_end_write_no_snapshotting(root);
>>>  				if (nocow)
>>>  					btrfs_dec_nocow_writers(fs_info,
>>>  								disk_bytenr);
>>> @@ -1537,8 +1523,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>  					     EXTENT_CLEAR_DATA_RESV,
>>>  					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
>>>
>>> -		if (!nolock && nocow)
>>> -			btrfs_end_write_no_snapshotting(root);
>>>  		cur_offset = extent_end;
>>>
>>>  		/*
>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>> index b077544..331b495 100644
>>> --- a/fs/btrfs/ioctl.c
>>> +++ b/fs/btrfs/ioctl.c
>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>  	struct btrfs_pending_snapshot *pending_snapshot;
>>>  	struct btrfs_trans_handle *trans;
>>>  	int ret;
>>> +	bool snapshot_force_cow = false;
>>>
>>>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>  		return -EINVAL;
>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>  		goto free_pending;
>>>  	}
>>>
>>> +	/*
>>> +	 * Force new buffered writes to reserve space even when NOCOW is
>>> +	 * possible. This is to avoid later writeback (running dealloc)
>>> +	 * to fallback to COW mode and unexpectedly fail with ENOSPC.
>>> +	 */
>>>  	atomic_inc(&root->will_be_snapshotted);
>>>  	smp_mb__after_atomic();
>>>  	/* wait for no snapshot writes */
>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>  	if (ret)
>>>  		goto dec_and_free;
>>>
>>> +	/*
>>> +	 * All previous writes have started writeback in NOCOW mode, so now
>>> +	 * we force future writes to fallback to COW mode during snapshot
>>> +	 * creation.
>>> +	 */
>>> +	atomic_inc(&root->snapshot_force_cow);
>>> +	snapshot_force_cow = true;
>>> +
>>>  	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>
>>>  	btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>  fail:
>>>  	btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
>>>  dec_and_free:
>>> +	if (snapshot_force_cow)
>>> +		atomic_dec(&root->snapshot_force_cow);
>>>  	if (atomic_dec_and_test(&root->will_be_snapshotted))
>>>  		wake_up_var(&root->will_be_snapshotted);
>>>  free_pending:
>>>
>>
Filipe Manana April 25, 2019, 2:51 p.m. UTC | #5
On Thu, Apr 25, 2019 at 2:06 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/4/25 下午5:37, Nikolay Borisov wrote:
> >
> >
> > On 25.04.19 г. 12:29 ч., Qu Wenruo wrote:
> >>
> >>
> >> On 2018/8/6 上午10:30, robbieko wrote:
> >>> From: Robbie Ko <robbieko@synology.com>
> >>>
> >>> Commit e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> >>> forced nocow writes to fallback to COW, during writeback,
> >>> when a snapshot is created. This resulted in writes made before
> >>> creating the snapshot to unexpectedly fail with ENOSPC during
> >>> writeback when success (0) was returned to user space through
> >>> the write system call.
> >>>
> >>> The steps leading to this problem are:
> >>>
> >>> 1. When it's not possible to allocate data space for a write,
> >>> the buffered write path checks if a NOCOW write is possible.
> >>> If it is, it will not reserve space and success (0) is returned
> >>> to user space.
> >>>
> >>> 2. Then when a snapshot is created, the root's will_be_snapshotted
> >>> atomic is incremented and writeback is triggered for all inode's
> >>> that belong to the root being snapshotted. Incrementing that atomic
> >>> forces all previous writes to fallback to COW during writeback
> >>> (running delalloc).
> >>>
> >>> 3. This results in the writeback for the inodes to fail and therefore
> >>> setting the ENOSPC error in their mappings, so that a subsequent fsync
> >>> on them will report the error to user space. So it's not a completely
> >>> silent data loss (since fsync will report ENOSPC) but it's a very
> >>> unexpected and undesirable behaviour, because if a clean
> >>> shutdown/unmount of the filesystem happens without previous calls to
> >>> fsync, it is expected to have the data present in the files after
> >>> mounting the filesystem again.
> >>>
> >>> So fix this by adding a new atomic named snapshot_force_cow to the
> >>> root structure which prevents this behaviour and works the following way:
> >>
> >> This fix is a little out of dated now.
> >>
> >> Since 609e804d771f ("Btrfs: fix file corruption after snapshotting due
> >> to mix of buffered/DIO writes") we will write back all dirty
> >> inodes/pages back to disk before snapshot.
> >>
> >> This makes sure that all dirty inodes/pages before a transaction will
> >> reach disk before snapshot  subvolume is created.
> >>
> >> Thus the whole will_be_snapshotted member no long makes sense, and in
> >> fact it could cause more problem if we're going back to always check
> >> NODATACOW at buffered write time.
> >>
> >> Can we just revert this patch and the will_be_snapshotted member ?
> >
> > I'm very much in favor of removing the whole will_be_snapshotted
> > mechanism. In fact I have a series which replaces it with a rwsem
> > essentially:
> >
> > 1. Serializing snapshots (acquiring write-side of the semaphore)
>
> With the new btrfs_start_delalloc_snapshot() in
> btrfs_commit_transaction(), do we really need to do anything special now?
>
> Snapshot creation will only happen in btrfs_commit_transaction(), as
> long as all dirty inodes/pages are written before pending snapshots,
> we're completely fine.
>
> Or did I miss something?

You missed the whole point of both patches.

The one I authored recently, is about ensuring we see ordered update
of an inode's disk_i_size / i_size.
That flushes delalloc a second time, during the transaction commit, to
ensure an ordered update of disk_i_size in case direct IO writes
happened during snapshotting.
btrfs/078 could detect this when not running with the no-holes feature
enabled, since fsck will report missing file extent items.

Robbie's patch is about making sure that buffered nodatacow writes
that happened before snapshotting (and success was returned to user
space), will not fail silently during the writeback triggered by
snapshot creation.
There's even a test case for this, btrfs/170, which I submitted.

So no, you can't simply revert Robbie's change, that will re-introduce
the bug it fixed.

Thanks.

>
> Thanks,
> Qu
>
> >
> > 2. Nocow writers simply acquire the readsize of the semaphore.
> >
> > the will_be_snapshoted thing is very convoluted relying on a percpu
> > counter/waitqueue  to exclude snapshots from pending nocow writers. OTOH
> > it depends on atomic_t and an implicit wait queue thanks to wait_var
> > infrastructure to exclude nocow writers from pending snapshots. Filipe
> > had some concerns regarding performance but if the patch you mentioned
> > fixed all issues I'm all in favor of removing the code!
> >
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> 1. It is incremented when we start to create a snapshot after
> >>> triggering writeback and before waiting for writeback to finish.
> >>>
> >>> 2. This new atomic is now what is used by writeback (running delalloc)
> >>> to decide whether we need to fallback to COW or not. Because we
> >>> incremented this new atomic after triggering writeback in the snapshot
> >>> creation ioctl, we ensure that all buffered writes that happened
> >>> before snapshot creation will succeed and not fallback to COW
> >>> (which would make them fail with ENOSPC).
> >>>
> >>> 3. The existing atomic, will_be_snapshotted, is kept because it is
> >>> used to force new buffered writes, that start after we started
> >>> snapshotting, to reserve data space even when NOCOW is possible.
> >>> This makes these writes fail early with ENOSPC when there's no
> >>> available space to allocate, preventing the unexpected behaviour
> >>> of writeback later failing with ENOSPC due to a fallback to COW mode.
> >>>
> >>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> >>> Signed-off-by: Robbie Ko <robbieko@synology.com>
> >>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >>> ---
> >>>  fs/btrfs/ctree.h   |  1 +
> >>>  fs/btrfs/disk-io.c |  1 +
> >>>  fs/btrfs/inode.c   | 26 +++++---------------------
> >>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
> >>>  4 files changed, 23 insertions(+), 21 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>> index 118346a..663ce05 100644
> >>> --- a/fs/btrfs/ctree.h
> >>> +++ b/fs/btrfs/ctree.h
> >>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
> >>>     int send_in_progress;
> >>>     struct btrfs_subvolume_writers *subv_writers;
> >>>     atomic_t will_be_snapshotted;
> >>> +   atomic_t snapshot_force_cow;
> >>>
> >>>     /* For qgroup metadata reserved space */
> >>>     spinlock_t qgroup_meta_rsv_lock;
> >>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>> index 205092d..5573916 100644
> >>> --- a/fs/btrfs/disk-io.c
> >>> +++ b/fs/btrfs/disk-io.c
> >>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
> >>>     atomic_set(&root->log_batch, 0);
> >>>     refcount_set(&root->refs, 1);
> >>>     atomic_set(&root->will_be_snapshotted, 0);
> >>> +   atomic_set(&root->snapshot_force_cow, 0);
> >>>     root->log_transid = 0;
> >>>     root->log_transid_committed = -1;
> >>>     root->last_log_commit = 0;
> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>> index eba61bc..263b852 100644
> >>> --- a/fs/btrfs/inode.c
> >>> +++ b/fs/btrfs/inode.c
> >>> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>     u64 disk_num_bytes;
> >>>     u64 ram_bytes;
> >>>     int extent_type;
> >>> -   int ret, err;
> >>> +   int ret;
> >>>     int type;
> >>>     int nocow;
> >>>     int check_prev = 1;
> >>> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>                      * if there are pending snapshots for this root,
> >>>                      * we fall into common COW way.
> >>>                      */
> >>> -                   if (!nolock) {
> >>> -                           err = btrfs_start_write_no_snapshotting(root);
> >>> -                           if (!err)
> >>> -                                   goto out_check;
> >>> -                   }
> >>> +                   if (!nolock &&
> >>> +                           unlikely(atomic_read(&root->snapshot_force_cow)))
> >>> +                           goto out_check;
> >>>                     /*
> >>>                      * force cow if csum exists in the range.
> >>>                      * this ensure that csum for a given extent are
> >>> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>                     ret = csum_exist_in_range(fs_info, disk_bytenr,
> >>>                                               num_bytes);
> >>>                     if (ret) {
> >>> -                           if (!nolock)
> >>> -                                   btrfs_end_write_no_snapshotting(root);
> >>> -
> >>>                             /*
> >>>                              * ret could be -EIO if the above fails to read
> >>>                              * metadata.
> >>> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>                             WARN_ON_ONCE(nolock);
> >>>                             goto out_check;
> >>>                     }
> >>> -                   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
> >>> -                           if (!nolock)
> >>> -                                   btrfs_end_write_no_snapshotting(root);
> >>> +                   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
> >>>                             goto out_check;
> >>> -                   }
> >>>                     nocow = 1;
> >>>             } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> >>>                     extent_end = found_key.offset +
> >>> @@ -1453,8 +1445,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>  out_check:
> >>>             if (extent_end <= start) {
> >>>                     path->slots[0]++;
> >>> -                   if (!nolock && nocow)
> >>> -                           btrfs_end_write_no_snapshotting(root);
> >>>                     if (nocow)
> >>>                             btrfs_dec_nocow_writers(fs_info, disk_bytenr);
> >>>                     goto next_slot;
> >>> @@ -1476,8 +1466,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>                                          end, page_started, nr_written, 1,
> >>>                                          NULL);
> >>>                     if (ret) {
> >>> -                           if (!nolock && nocow)
> >>> -                                   btrfs_end_write_no_snapshotting(root);
> >>>                             if (nocow)
> >>>                                     btrfs_dec_nocow_writers(fs_info,
> >>>                                                             disk_bytenr);
> >>> @@ -1497,8 +1485,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>                                       ram_bytes, BTRFS_COMPRESS_NONE,
> >>>                                       BTRFS_ORDERED_PREALLOC);
> >>>                     if (IS_ERR(em)) {
> >>> -                           if (!nolock && nocow)
> >>> -                                   btrfs_end_write_no_snapshotting(root);
> >>>                             if (nocow)
> >>>                                     btrfs_dec_nocow_writers(fs_info,
> >>>                                                             disk_bytenr);
> >>> @@ -1537,8 +1523,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>                                          EXTENT_CLEAR_DATA_RESV,
> >>>                                          PAGE_UNLOCK | PAGE_SET_PRIVATE2);
> >>>
> >>> -           if (!nolock && nocow)
> >>> -                   btrfs_end_write_no_snapshotting(root);
> >>>             cur_offset = extent_end;
> >>>
> >>>             /*
> >>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >>> index b077544..331b495 100644
> >>> --- a/fs/btrfs/ioctl.c
> >>> +++ b/fs/btrfs/ioctl.c
> >>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>     struct btrfs_pending_snapshot *pending_snapshot;
> >>>     struct btrfs_trans_handle *trans;
> >>>     int ret;
> >>> +   bool snapshot_force_cow = false;
> >>>
> >>>     if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
> >>>             return -EINVAL;
> >>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>             goto free_pending;
> >>>     }
> >>>
> >>> +   /*
> >>> +    * Force new buffered writes to reserve space even when NOCOW is
> >>> +    * possible. This is to avoid later writeback (running dealloc)
> >>> +    * to fallback to COW mode and unexpectedly fail with ENOSPC.
> >>> +    */
> >>>     atomic_inc(&root->will_be_snapshotted);
> >>>     smp_mb__after_atomic();
> >>>     /* wait for no snapshot writes */
> >>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>     if (ret)
> >>>             goto dec_and_free;
> >>>
> >>> +   /*
> >>> +    * All previous writes have started writeback in NOCOW mode, so now
> >>> +    * we force future writes to fallback to COW mode during snapshot
> >>> +    * creation.
> >>> +    */
> >>> +   atomic_inc(&root->snapshot_force_cow);
> >>> +   snapshot_force_cow = true;
> >>> +
> >>>     btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
> >>>
> >>>     btrfs_init_block_rsv(&pending_snapshot->block_rsv,
> >>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>  fail:
> >>>     btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
> >>>  dec_and_free:
> >>> +   if (snapshot_force_cow)
> >>> +           atomic_dec(&root->snapshot_force_cow);
> >>>     if (atomic_dec_and_test(&root->will_be_snapshotted))
> >>>             wake_up_var(&root->will_be_snapshotted);
> >>>  free_pending:
> >>>
> >>
Qu Wenruo April 25, 2019, 11:20 p.m. UTC | #6
[snip]
>>>
>>> 1. Serializing snapshots (acquiring write-side of the semaphore)
>>
>> With the new btrfs_start_delalloc_snapshot() in
>> btrfs_commit_transaction(), do we really need to do anything special now?
>>
>> Snapshot creation will only happen in btrfs_commit_transaction(), as
>> long as all dirty inodes/pages are written before pending snapshots,
>> we're completely fine.
>>
>> Or did I miss something?
>
> You missed the whole point of both patches.
>
> The one I authored recently, is about ensuring we see ordered update
> of an inode's disk_i_size / i_size.
> That flushes delalloc a second time, during the transaction commit, to
> ensure an ordered update of disk_i_size in case direct IO writes
> happened during snapshotting.
> btrfs/078 could detect this when not running with the no-holes feature
> enabled, since fsck will report missing file extent items.

But that flush at commit time ensures all pages of the source snapshot
to disk, killing the following old case:
- preallocate
- buffered write
  at this timing, the write will be nodatacow
- snapshot creation
  now the write needs to be cowed
- dirty page writeback happens

With your flush, snapshot creation will flush all its dirty pages before
the new snapshot is created.

>
> Robbie's patch is about making sure that buffered nodatacow writes
> that happened before snapshotting (and success was returned to user
> space), will not fail silently during the writeback triggered by
> snapshot creation.
> There's even a test case for this, btrfs/170, which I submitted.
>
> So no, you can't simply revert Robbie's change, that will re-introduce
> the bug it fixed.

With your patch, I see nothing need to be handled specially in a
snapshot source. Thus we don't need Robbie patch after your more
comprehensive fix.

Or did I miss something again?

Thanks,
Qu

>
> Thanks.
>
>>
>> Thanks,
>> Qu
>>
>>>
>>> 2. Nocow writers simply acquire the readsize of the semaphore.
>>>
>>> the will_be_snapshoted thing is very convoluted relying on a percpu
>>> counter/waitqueue  to exclude snapshots from pending nocow writers. OTOH
>>> it depends on atomic_t and an implicit wait queue thanks to wait_var
>>> infrastructure to exclude nocow writers from pending snapshots. Filipe
>>> had some concerns regarding performance but if the patch you mentioned
>>> fixed all issues I'm all in favor of removing the code!
>>>
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> 1. It is incremented when we start to create a snapshot after
>>>>> triggering writeback and before waiting for writeback to finish.
>>>>>
>>>>> 2. This new atomic is now what is used by writeback (running delalloc)
>>>>> to decide whether we need to fallback to COW or not. Because we
>>>>> incremented this new atomic after triggering writeback in the snapshot
>>>>> creation ioctl, we ensure that all buffered writes that happened
>>>>> before snapshot creation will succeed and not fallback to COW
>>>>> (which would make them fail with ENOSPC).
>>>>>
>>>>> 3. The existing atomic, will_be_snapshotted, is kept because it is
>>>>> used to force new buffered writes, that start after we started
>>>>> snapshotting, to reserve data space even when NOCOW is possible.
>>>>> This makes these writes fail early with ENOSPC when there's no
>>>>> available space to allocate, preventing the unexpected behaviour
>>>>> of writeback later failing with ENOSPC due to a fallback to COW mode.
>>>>>
>>>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>>>> ---
>>>>>  fs/btrfs/ctree.h   |  1 +
>>>>>  fs/btrfs/disk-io.c |  1 +
>>>>>  fs/btrfs/inode.c   | 26 +++++---------------------
>>>>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>>>>>  4 files changed, 23 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>> index 118346a..663ce05 100644
>>>>> --- a/fs/btrfs/ctree.h
>>>>> +++ b/fs/btrfs/ctree.h
>>>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>>>>     int send_in_progress;
>>>>>     struct btrfs_subvolume_writers *subv_writers;
>>>>>     atomic_t will_be_snapshotted;
>>>>> +   atomic_t snapshot_force_cow;
>>>>>
>>>>>     /* For qgroup metadata reserved space */
>>>>>     spinlock_t qgroup_meta_rsv_lock;
>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>> index 205092d..5573916 100644
>>>>> --- a/fs/btrfs/disk-io.c
>>>>> +++ b/fs/btrfs/disk-io.c
>>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>>>>>     atomic_set(&root->log_batch, 0);
>>>>>     refcount_set(&root->refs, 1);
>>>>>     atomic_set(&root->will_be_snapshotted, 0);
>>>>> +   atomic_set(&root->snapshot_force_cow, 0);
>>>>>     root->log_transid = 0;
>>>>>     root->log_transid_committed = -1;
>>>>>     root->last_log_commit = 0;
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index eba61bc..263b852 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>     u64 disk_num_bytes;
>>>>>     u64 ram_bytes;
>>>>>     int extent_type;
>>>>> -   int ret, err;
>>>>> +   int ret;
>>>>>     int type;
>>>>>     int nocow;
>>>>>     int check_prev = 1;
>>>>> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>                      * if there are pending snapshots for this root,
>>>>>                      * we fall into common COW way.
>>>>>                      */
>>>>> -                   if (!nolock) {
>>>>> -                           err = btrfs_start_write_no_snapshotting(root);
>>>>> -                           if (!err)
>>>>> -                                   goto out_check;
>>>>> -                   }
>>>>> +                   if (!nolock &&
>>>>> +                           unlikely(atomic_read(&root->snapshot_force_cow)))
>>>>> +                           goto out_check;
>>>>>                     /*
>>>>>                      * force cow if csum exists in the range.
>>>>>                      * this ensure that csum for a given extent are
>>>>> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>                     ret = csum_exist_in_range(fs_info, disk_bytenr,
>>>>>                                               num_bytes);
>>>>>                     if (ret) {
>>>>> -                           if (!nolock)
>>>>> -                                   btrfs_end_write_no_snapshotting(root);
>>>>> -
>>>>>                             /*
>>>>>                              * ret could be -EIO if the above fails to read
>>>>>                              * metadata.
>>>>> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>                             WARN_ON_ONCE(nolock);
>>>>>                             goto out_check;
>>>>>                     }
>>>>> -                   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
>>>>> -                           if (!nolock)
>>>>> -                                   btrfs_end_write_no_snapshotting(root);
>>>>> +                   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
>>>>>                             goto out_check;
>>>>> -                   }
>>>>>                     nocow = 1;
>>>>>             } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>>>>                     extent_end = found_key.offset +
>>>>> @@ -1453,8 +1445,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>  out_check:
>>>>>             if (extent_end <= start) {
>>>>>                     path->slots[0]++;
>>>>> -                   if (!nolock && nocow)
>>>>> -                           btrfs_end_write_no_snapshotting(root);
>>>>>                     if (nocow)
>>>>>                             btrfs_dec_nocow_writers(fs_info, disk_bytenr);
>>>>>                     goto next_slot;
>>>>> @@ -1476,8 +1466,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>                                          end, page_started, nr_written, 1,
>>>>>                                          NULL);
>>>>>                     if (ret) {
>>>>> -                           if (!nolock && nocow)
>>>>> -                                   btrfs_end_write_no_snapshotting(root);
>>>>>                             if (nocow)
>>>>>                                     btrfs_dec_nocow_writers(fs_info,
>>>>>                                                             disk_bytenr);
>>>>> @@ -1497,8 +1485,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>                                       ram_bytes, BTRFS_COMPRESS_NONE,
>>>>>                                       BTRFS_ORDERED_PREALLOC);
>>>>>                     if (IS_ERR(em)) {
>>>>> -                           if (!nolock && nocow)
>>>>> -                                   btrfs_end_write_no_snapshotting(root);
>>>>>                             if (nocow)
>>>>>                                     btrfs_dec_nocow_writers(fs_info,
>>>>>                                                             disk_bytenr);
>>>>> @@ -1537,8 +1523,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>>>>                                          EXTENT_CLEAR_DATA_RESV,
>>>>>                                          PAGE_UNLOCK | PAGE_SET_PRIVATE2);
>>>>>
>>>>> -           if (!nolock && nocow)
>>>>> -                   btrfs_end_write_no_snapshotting(root);
>>>>>             cur_offset = extent_end;
>>>>>
>>>>>             /*
>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>> index b077544..331b495 100644
>>>>> --- a/fs/btrfs/ioctl.c
>>>>> +++ b/fs/btrfs/ioctl.c
>>>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>>>     struct btrfs_pending_snapshot *pending_snapshot;
>>>>>     struct btrfs_trans_handle *trans;
>>>>>     int ret;
>>>>> +   bool snapshot_force_cow = false;
>>>>>
>>>>>     if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>>>             return -EINVAL;
>>>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>>>             goto free_pending;
>>>>>     }
>>>>>
>>>>> +   /*
>>>>> +    * Force new buffered writes to reserve space even when NOCOW is
>>>>> +    * possible. This is to avoid later writeback (running dealloc)
>>>>> +    * to fallback to COW mode and unexpectedly fail with ENOSPC.
>>>>> +    */
>>>>>     atomic_inc(&root->will_be_snapshotted);
>>>>>     smp_mb__after_atomic();
>>>>>     /* wait for no snapshot writes */
>>>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>>>     if (ret)
>>>>>             goto dec_and_free;
>>>>>
>>>>> +   /*
>>>>> +    * All previous writes have started writeback in NOCOW mode, so now
>>>>> +    * we force future writes to fallback to COW mode during snapshot
>>>>> +    * creation.
>>>>> +    */
>>>>> +   atomic_inc(&root->snapshot_force_cow);
>>>>> +   snapshot_force_cow = true;
>>>>> +
>>>>>     btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>>>
>>>>>     btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>>>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>>>>  fail:
>>>>>     btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
>>>>>  dec_and_free:
>>>>> +   if (snapshot_force_cow)
>>>>> +           atomic_dec(&root->snapshot_force_cow);
>>>>>     if (atomic_dec_and_test(&root->will_be_snapshotted))
>>>>>             wake_up_var(&root->will_be_snapshotted);
>>>>>  free_pending:
>>>>>
>>>>
>
>
>
robbieko April 26, 2019, 2:47 a.m. UTC | #7
Qu Wenruo 於 2019-04-26 07:20 寫到:
> [snip]
>>>> 
>>>> 1. Serializing snapshots (acquiring write-side of the semaphore)
>>> 
>>> With the new btrfs_start_delalloc_snapshot() in
>>> btrfs_commit_transaction(), do we really need to do anything special 
>>> now?
>>> 
>>> Snapshot creation will only happen in btrfs_commit_transaction(), as
>>> long as all dirty inodes/pages are written before pending snapshots,
>>> we're completely fine.
>>> 
>>> Or did I miss something?
>> 
>> You missed the whole point of both patches.
>> 
>> The one I authored recently, is about ensuring we see ordered update
>> of an inode's disk_i_size / i_size.
>> That flushes delalloc a second time, during the transaction commit, to
>> ensure an ordered update of disk_i_size in case direct IO writes
>> happened during snapshotting.
>> btrfs/078 could detect this when not running with the no-holes feature
>> enabled, since fsck will report missing file extent items.
> 
> But that flush at commit time ensures all pages of the source snapshot
> to disk, killing the following old case:
> - preallocate
> - buffered write
>   at this timing, the write will be nodatacow
> - snapshot creation
>   now the write needs to be cowed
> - dirty page writeback happens
> 
> With your flush, snapshot creation will flush all its dirty pages 
> before
> the new snapshot is created.
> 
>> 
>> Robbie's patch is about making sure that buffered nodatacow writes
>> that happened before snapshotting (and success was returned to user
>> space), will not fail silently during the writeback triggered by
>> snapshot creation.
>> There's even a test case for this, btrfs/170, which I submitted.
>> 
>> So no, you can't simply revert Robbie's change, that will re-introduce
>> the bug it fixed.
> 
> With your patch, I see nothing need to be handled specially in a
> snapshot source. Thus we don't need Robbie patch after your more
> comprehensive fix.
> 
> Or did I miss something again?

If you revert my patch directly, when volume is full, when writing data,
it will check and try to use nocow to write, but if the data is still
in the page cache, it has not been written back to disk.
If snapshot is triggered, will_be_snapshotted will be set, and then 
flush
data, causing data to be unable to do nocow in run_delalloc_nocow,
can only be written back in cow way, but there is no extra space to use,
so that data loss.
Thanks.
Robbieko

> Thanks,
> Qu
> 
>> 
>> Thanks.
>> 
>>> 
>>> Thanks,
>>> Qu
>>> 
>>>> 
>>>> 2. Nocow writers simply acquire the readsize of the semaphore.
>>>> 
>>>> the will_be_snapshoted thing is very convoluted relying on a percpu
>>>> counter/waitqueue  to exclude snapshots from pending nocow writers. 
>>>> OTOH
>>>> it depends on atomic_t and an implicit wait queue thanks to wait_var
>>>> infrastructure to exclude nocow writers from pending snapshots. 
>>>> Filipe
>>>> had some concerns regarding performance but if the patch you 
>>>> mentioned
>>>> fixed all issues I'm all in favor of removing the code!
>>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>> 
>>>>>> 
>>>>>> 1. It is incremented when we start to create a snapshot after
>>>>>> triggering writeback and before waiting for writeback to finish.
>>>>>> 
>>>>>> 2. This new atomic is now what is used by writeback (running 
>>>>>> delalloc)
>>>>>> to decide whether we need to fallback to COW or not. Because we
>>>>>> incremented this new atomic after triggering writeback in the 
>>>>>> snapshot
>>>>>> creation ioctl, we ensure that all buffered writes that happened
>>>>>> before snapshot creation will succeed and not fallback to COW
>>>>>> (which would make them fail with ENOSPC).
>>>>>> 
>>>>>> 3. The existing atomic, will_be_snapshotted, is kept because it is
>>>>>> used to force new buffered writes, that start after we started
>>>>>> snapshotting, to reserve data space even when NOCOW is possible.
>>>>>> This makes these writes fail early with ENOSPC when there's no
>>>>>> available space to allocate, preventing the unexpected behaviour
>>>>>> of writeback later failing with ENOSPC due to a fallback to COW 
>>>>>> mode.
>>>>>> 
>>>>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>>>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>>>>> ---
>>>>>>  fs/btrfs/ctree.h   |  1 +
>>>>>>  fs/btrfs/disk-io.c |  1 +
>>>>>>  fs/btrfs/inode.c   | 26 +++++---------------------
>>>>>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>>>>>>  4 files changed, 23 insertions(+), 21 deletions(-)
>>>>>> 
>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>> index 118346a..663ce05 100644
>>>>>> --- a/fs/btrfs/ctree.h
>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>>>>>     int send_in_progress;
>>>>>>     struct btrfs_subvolume_writers *subv_writers;
>>>>>>     atomic_t will_be_snapshotted;
>>>>>> +   atomic_t snapshot_force_cow;
>>>>>> 
>>>>>>     /* For qgroup metadata reserved space */
>>>>>>     spinlock_t qgroup_meta_rsv_lock;
>>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>>> index 205092d..5573916 100644
>>>>>> --- a/fs/btrfs/disk-io.c
>>>>>> +++ b/fs/btrfs/disk-io.c
>>>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root 
>>>>>> *root, struct btrfs_fs_info *fs_info,
>>>>>>     atomic_set(&root->log_batch, 0);
>>>>>>     refcount_set(&root->refs, 1);
>>>>>>     atomic_set(&root->will_be_snapshotted, 0);
>>>>>> +   atomic_set(&root->snapshot_force_cow, 0);
>>>>>>     root->log_transid = 0;
>>>>>>     root->log_transid_committed = -1;
>>>>>>     root->last_log_commit = 0;
>>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>>> index eba61bc..263b852 100644
>>>>>> --- a/fs/btrfs/inode.c
>>>>>> +++ b/fs/btrfs/inode.c
>>>>>> @@ -1275,7 +1275,7 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>     u64 disk_num_bytes;
>>>>>>     u64 ram_bytes;
>>>>>>     int extent_type;
>>>>>> -   int ret, err;
>>>>>> +   int ret;
>>>>>>     int type;
>>>>>>     int nocow;
>>>>>>     int check_prev = 1;
>>>>>> @@ -1407,11 +1407,9 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>                      * if there are pending snapshots for this 
>>>>>> root,
>>>>>>                      * we fall into common COW way.
>>>>>>                      */
>>>>>> -                   if (!nolock) {
>>>>>> -                           err = 
>>>>>> btrfs_start_write_no_snapshotting(root);
>>>>>> -                           if (!err)
>>>>>> -                                   goto out_check;
>>>>>> -                   }
>>>>>> +                   if (!nolock &&
>>>>>> +                           
>>>>>> unlikely(atomic_read(&root->snapshot_force_cow)))
>>>>>> +                           goto out_check;
>>>>>>                     /*
>>>>>>                      * force cow if csum exists in the range.
>>>>>>                      * this ensure that csum for a given extent 
>>>>>> are
>>>>>> @@ -1420,9 +1418,6 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>                     ret = csum_exist_in_range(fs_info, 
>>>>>> disk_bytenr,
>>>>>>                                               num_bytes);
>>>>>>                     if (ret) {
>>>>>> -                           if (!nolock)
>>>>>> -                                   
>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>> -
>>>>>>                             /*
>>>>>>                              * ret could be -EIO if the above 
>>>>>> fails to read
>>>>>>                              * metadata.
>>>>>> @@ -1435,11 +1430,8 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>                             WARN_ON_ONCE(nolock);
>>>>>>                             goto out_check;
>>>>>>                     }
>>>>>> -                   if (!btrfs_inc_nocow_writers(fs_info, 
>>>>>> disk_bytenr)) {
>>>>>> -                           if (!nolock)
>>>>>> -                                   
>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>> +                   if (!btrfs_inc_nocow_writers(fs_info, 
>>>>>> disk_bytenr))
>>>>>>                             goto out_check;
>>>>>> -                   }
>>>>>>                     nocow = 1;
>>>>>>             } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>>>>>                     extent_end = found_key.offset +
>>>>>> @@ -1453,8 +1445,6 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>  out_check:
>>>>>>             if (extent_end <= start) {
>>>>>>                     path->slots[0]++;
>>>>>> -                   if (!nolock && nocow)
>>>>>> -                           btrfs_end_write_no_snapshotting(root);
>>>>>>                     if (nocow)
>>>>>>                             btrfs_dec_nocow_writers(fs_info, 
>>>>>> disk_bytenr);
>>>>>>                     goto next_slot;
>>>>>> @@ -1476,8 +1466,6 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>                                          end, page_started, 
>>>>>> nr_written, 1,
>>>>>>                                          NULL);
>>>>>>                     if (ret) {
>>>>>> -                           if (!nolock && nocow)
>>>>>> -                                   
>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>                             if (nocow)
>>>>>>                                     
>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>                                                             
>>>>>> disk_bytenr);
>>>>>> @@ -1497,8 +1485,6 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>                                       ram_bytes, 
>>>>>> BTRFS_COMPRESS_NONE,
>>>>>>                                       BTRFS_ORDERED_PREALLOC);
>>>>>>                     if (IS_ERR(em)) {
>>>>>> -                           if (!nolock && nocow)
>>>>>> -                                   
>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>                             if (nocow)
>>>>>>                                     
>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>                                                             
>>>>>> disk_bytenr);
>>>>>> @@ -1537,8 +1523,6 @@ static noinline int 
>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>                                          EXTENT_CLEAR_DATA_RESV,
>>>>>>                                          PAGE_UNLOCK | 
>>>>>> PAGE_SET_PRIVATE2);
>>>>>> 
>>>>>> -           if (!nolock && nocow)
>>>>>> -                   btrfs_end_write_no_snapshotting(root);
>>>>>>             cur_offset = extent_end;
>>>>>> 
>>>>>>             /*
>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>> index b077544..331b495 100644
>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root 
>>>>>> *root, struct inode *dir,
>>>>>>     struct btrfs_pending_snapshot *pending_snapshot;
>>>>>>     struct btrfs_trans_handle *trans;
>>>>>>     int ret;
>>>>>> +   bool snapshot_force_cow = false;
>>>>>> 
>>>>>>     if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>>>>             return -EINVAL;
>>>>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root 
>>>>>> *root, struct inode *dir,
>>>>>>             goto free_pending;
>>>>>>     }
>>>>>> 
>>>>>> +   /*
>>>>>> +    * Force new buffered writes to reserve space even when NOCOW 
>>>>>> is
>>>>>> +    * possible. This is to avoid later writeback (running 
>>>>>> dealloc)
>>>>>> +    * to fallback to COW mode and unexpectedly fail with ENOSPC.
>>>>>> +    */
>>>>>>     atomic_inc(&root->will_be_snapshotted);
>>>>>>     smp_mb__after_atomic();
>>>>>>     /* wait for no snapshot writes */
>>>>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root 
>>>>>> *root, struct inode *dir,
>>>>>>     if (ret)
>>>>>>             goto dec_and_free;
>>>>>> 
>>>>>> +   /*
>>>>>> +    * All previous writes have started writeback in NOCOW mode, 
>>>>>> so now
>>>>>> +    * we force future writes to fallback to COW mode during 
>>>>>> snapshot
>>>>>> +    * creation.
>>>>>> +    */
>>>>>> +   atomic_inc(&root->snapshot_force_cow);
>>>>>> +   snapshot_force_cow = true;
>>>>>> +
>>>>>>     btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>>>> 
>>>>>>     btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>>>>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root 
>>>>>> *root, struct inode *dir,
>>>>>>  fail:
>>>>>>     btrfs_subvolume_release_metadata(fs_info, 
>>>>>> &pending_snapshot->block_rsv);
>>>>>>  dec_and_free:
>>>>>> +   if (snapshot_force_cow)
>>>>>> +           atomic_dec(&root->snapshot_force_cow);
>>>>>>     if (atomic_dec_and_test(&root->will_be_snapshotted))
>>>>>>             wake_up_var(&root->will_be_snapshotted);
>>>>>>  free_pending:
>>>>>> 
>>>>> 
>> 
>> 
>>
Qu Wenruo April 26, 2019, 2:53 a.m. UTC | #8
On 2019/4/26 上午10:47, robbieko wrote:
> Qu Wenruo 於 2019-04-26 07:20 寫到:
>> [snip]
>>>>>
>>>>> 1. Serializing snapshots (acquiring write-side of the semaphore)
>>>>
>>>> With the new btrfs_start_delalloc_snapshot() in
>>>> btrfs_commit_transaction(), do we really need to do anything special
>>>> now?
>>>>
>>>> Snapshot creation will only happen in btrfs_commit_transaction(), as
>>>> long as all dirty inodes/pages are written before pending snapshots,
>>>> we're completely fine.
>>>>
>>>> Or did I miss something?
>>>
>>> You missed the whole point of both patches.
>>>
>>> The one I authored recently, is about ensuring we see ordered update
>>> of an inode's disk_i_size / i_size.
>>> That flushes delalloc a second time, during the transaction commit, to
>>> ensure an ordered update of disk_i_size in case direct IO writes
>>> happened during snapshotting.
>>> btrfs/078 could detect this when not running with the no-holes feature
>>> enabled, since fsck will report missing file extent items.
>>
>> But that flush at commit time ensures all pages of the source snapshot
>> to disk, killing the following old case:
>> - preallocate
>> - buffered write
>>   at this timing, the write will be nodatacow
>> - snapshot creation
>>   now the write needs to be cowed
>> - dirty page writeback happens
>>
>> With your flush, snapshot creation will flush all its dirty pages before
>> the new snapshot is created.
>>
>>>
>>> Robbie's patch is about making sure that buffered nodatacow writes
>>> that happened before snapshotting (and success was returned to user
>>> space), will not fail silently during the writeback triggered by
>>> snapshot creation.
>>> There's even a test case for this, btrfs/170, which I submitted.
>>>
>>> So no, you can't simply revert Robbie's change, that will re-introduce
>>> the bug it fixed.
>>
>> With your patch, I see nothing need to be handled specially in a
>> snapshot source. Thus we don't need Robbie patch after your more
>> comprehensive fix.
>>
>> Or did I miss something again?
> 
> If you revert my patch directly, when volume is full, when writing data,
> it will check and try to use nocow to write, but if the data is still
> in the page cache, it has not been written back to disk.
> If snapshot is triggered,

But now, before creating new snapshot, we will flush all data.

This means at this time, the data will be written back *NOCOW*, just as
expected.

> will_be_snapshotted will be set, and then flush

will_be_snapshotted will also be gone.

Thanks,
Qu

> data, causing data to be unable to do nocow in run_delalloc_nocow,
> can only be written back in cow way, but there is no extra space to use,
> so that data loss.
> Thanks.
> Robbieko
> 
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> 2. Nocow writers simply acquire the readsize of the semaphore.
>>>>>
>>>>> the will_be_snapshoted thing is very convoluted relying on a percpu
>>>>> counter/waitqueue  to exclude snapshots from pending nocow writers.
>>>>> OTOH
>>>>> it depends on atomic_t and an implicit wait queue thanks to wait_var
>>>>> infrastructure to exclude nocow writers from pending snapshots. Filipe
>>>>> had some concerns regarding performance but if the patch you mentioned
>>>>> fixed all issues I'm all in favor of removing the code!
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> 1. It is incremented when we start to create a snapshot after
>>>>>>> triggering writeback and before waiting for writeback to finish.
>>>>>>>
>>>>>>> 2. This new atomic is now what is used by writeback (running
>>>>>>> delalloc)
>>>>>>> to decide whether we need to fallback to COW or not. Because we
>>>>>>> incremented this new atomic after triggering writeback in the
>>>>>>> snapshot
>>>>>>> creation ioctl, we ensure that all buffered writes that happened
>>>>>>> before snapshot creation will succeed and not fallback to COW
>>>>>>> (which would make them fail with ENOSPC).
>>>>>>>
>>>>>>> 3. The existing atomic, will_be_snapshotted, is kept because it is
>>>>>>> used to force new buffered writes, that start after we started
>>>>>>> snapshotting, to reserve data space even when NOCOW is possible.
>>>>>>> This makes these writes fail early with ENOSPC when there's no
>>>>>>> available space to allocate, preventing the unexpected behaviour
>>>>>>> of writeback later failing with ENOSPC due to a fallback to COW
>>>>>>> mode.
>>>>>>>
>>>>>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>>>>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>>>>>> ---
>>>>>>>  fs/btrfs/ctree.h   |  1 +
>>>>>>>  fs/btrfs/disk-io.c |  1 +
>>>>>>>  fs/btrfs/inode.c   | 26 +++++---------------------
>>>>>>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>>>>>>>  4 files changed, 23 insertions(+), 21 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>>> index 118346a..663ce05 100644
>>>>>>> --- a/fs/btrfs/ctree.h
>>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>>>>>>     int send_in_progress;
>>>>>>>     struct btrfs_subvolume_writers *subv_writers;
>>>>>>>     atomic_t will_be_snapshotted;
>>>>>>> +   atomic_t snapshot_force_cow;
>>>>>>>
>>>>>>>     /* For qgroup metadata reserved space */
>>>>>>>     spinlock_t qgroup_meta_rsv_lock;
>>>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>>>> index 205092d..5573916 100644
>>>>>>> --- a/fs/btrfs/disk-io.c
>>>>>>> +++ b/fs/btrfs/disk-io.c
>>>>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root
>>>>>>> *root, struct btrfs_fs_info *fs_info,
>>>>>>>     atomic_set(&root->log_batch, 0);
>>>>>>>     refcount_set(&root->refs, 1);
>>>>>>>     atomic_set(&root->will_be_snapshotted, 0);
>>>>>>> +   atomic_set(&root->snapshot_force_cow, 0);
>>>>>>>     root->log_transid = 0;
>>>>>>>     root->log_transid_committed = -1;
>>>>>>>     root->last_log_commit = 0;
>>>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>>>> index eba61bc..263b852 100644
>>>>>>> --- a/fs/btrfs/inode.c
>>>>>>> +++ b/fs/btrfs/inode.c
>>>>>>> @@ -1275,7 +1275,7 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>     u64 disk_num_bytes;
>>>>>>>     u64 ram_bytes;
>>>>>>>     int extent_type;
>>>>>>> -   int ret, err;
>>>>>>> +   int ret;
>>>>>>>     int type;
>>>>>>>     int nocow;
>>>>>>>     int check_prev = 1;
>>>>>>> @@ -1407,11 +1407,9 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>                      * if there are pending snapshots for this root,
>>>>>>>                      * we fall into common COW way.
>>>>>>>                      */
>>>>>>> -                   if (!nolock) {
>>>>>>> -                           err =
>>>>>>> btrfs_start_write_no_snapshotting(root);
>>>>>>> -                           if (!err)
>>>>>>> -                                   goto out_check;
>>>>>>> -                   }
>>>>>>> +                   if (!nolock &&
>>>>>>> +                          
>>>>>>> unlikely(atomic_read(&root->snapshot_force_cow)))
>>>>>>> +                           goto out_check;
>>>>>>>                     /*
>>>>>>>                      * force cow if csum exists in the range.
>>>>>>>                      * this ensure that csum for a given extent are
>>>>>>> @@ -1420,9 +1418,6 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>                     ret = csum_exist_in_range(fs_info, disk_bytenr,
>>>>>>>                                               num_bytes);
>>>>>>>                     if (ret) {
>>>>>>> -                           if (!nolock)
>>>>>>> -                                  
>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>> -
>>>>>>>                             /*
>>>>>>>                              * ret could be -EIO if the above
>>>>>>> fails to read
>>>>>>>                              * metadata.
>>>>>>> @@ -1435,11 +1430,8 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>                             WARN_ON_ONCE(nolock);
>>>>>>>                             goto out_check;
>>>>>>>                     }
>>>>>>> -                   if (!btrfs_inc_nocow_writers(fs_info,
>>>>>>> disk_bytenr)) {
>>>>>>> -                           if (!nolock)
>>>>>>> -                                  
>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>> +                   if (!btrfs_inc_nocow_writers(fs_info,
>>>>>>> disk_bytenr))
>>>>>>>                             goto out_check;
>>>>>>> -                   }
>>>>>>>                     nocow = 1;
>>>>>>>             } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
>>>>>>>                     extent_end = found_key.offset +
>>>>>>> @@ -1453,8 +1445,6 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>  out_check:
>>>>>>>             if (extent_end <= start) {
>>>>>>>                     path->slots[0]++;
>>>>>>> -                   if (!nolock && nocow)
>>>>>>> -                           btrfs_end_write_no_snapshotting(root);
>>>>>>>                     if (nocow)
>>>>>>>                             btrfs_dec_nocow_writers(fs_info,
>>>>>>> disk_bytenr);
>>>>>>>                     goto next_slot;
>>>>>>> @@ -1476,8 +1466,6 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>                                          end, page_started,
>>>>>>> nr_written, 1,
>>>>>>>                                          NULL);
>>>>>>>                     if (ret) {
>>>>>>> -                           if (!nolock && nocow)
>>>>>>> -                                  
>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>                             if (nocow)
>>>>>>>                                     btrfs_dec_nocow_writers(fs_info,
>>>>>>>                                                            
>>>>>>> disk_bytenr);
>>>>>>> @@ -1497,8 +1485,6 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>                                       ram_bytes,
>>>>>>> BTRFS_COMPRESS_NONE,
>>>>>>>                                       BTRFS_ORDERED_PREALLOC);
>>>>>>>                     if (IS_ERR(em)) {
>>>>>>> -                           if (!nolock && nocow)
>>>>>>> -                                  
>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>                             if (nocow)
>>>>>>>                                     btrfs_dec_nocow_writers(fs_info,
>>>>>>>                                                            
>>>>>>> disk_bytenr);
>>>>>>> @@ -1537,8 +1523,6 @@ static noinline int
>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>                                          EXTENT_CLEAR_DATA_RESV,
>>>>>>>                                          PAGE_UNLOCK |
>>>>>>> PAGE_SET_PRIVATE2);
>>>>>>>
>>>>>>> -           if (!nolock && nocow)
>>>>>>> -                   btrfs_end_write_no_snapshotting(root);
>>>>>>>             cur_offset = extent_end;
>>>>>>>
>>>>>>>             /*
>>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>>> index b077544..331b495 100644
>>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root
>>>>>>> *root, struct inode *dir,
>>>>>>>     struct btrfs_pending_snapshot *pending_snapshot;
>>>>>>>     struct btrfs_trans_handle *trans;
>>>>>>>     int ret;
>>>>>>> +   bool snapshot_force_cow = false;
>>>>>>>
>>>>>>>     if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>>>>>             return -EINVAL;
>>>>>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root
>>>>>>> *root, struct inode *dir,
>>>>>>>             goto free_pending;
>>>>>>>     }
>>>>>>>
>>>>>>> +   /*
>>>>>>> +    * Force new buffered writes to reserve space even when NOCOW is
>>>>>>> +    * possible. This is to avoid later writeback (running dealloc)
>>>>>>> +    * to fallback to COW mode and unexpectedly fail with ENOSPC.
>>>>>>> +    */
>>>>>>>     atomic_inc(&root->will_be_snapshotted);
>>>>>>>     smp_mb__after_atomic();
>>>>>>>     /* wait for no snapshot writes */
>>>>>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root
>>>>>>> *root, struct inode *dir,
>>>>>>>     if (ret)
>>>>>>>             goto dec_and_free;
>>>>>>>
>>>>>>> +   /*
>>>>>>> +    * All previous writes have started writeback in NOCOW mode,
>>>>>>> so now
>>>>>>> +    * we force future writes to fallback to COW mode during
>>>>>>> snapshot
>>>>>>> +    * creation.
>>>>>>> +    */
>>>>>>> +   atomic_inc(&root->snapshot_force_cow);
>>>>>>> +   snapshot_force_cow = true;
>>>>>>> +
>>>>>>>     btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>>>>>
>>>>>>>     btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>>>>>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root
>>>>>>> *root, struct inode *dir,
>>>>>>>  fail:
>>>>>>>     btrfs_subvolume_release_metadata(fs_info,
>>>>>>> &pending_snapshot->block_rsv);
>>>>>>>  dec_and_free:
>>>>>>> +   if (snapshot_force_cow)
>>>>>>> +           atomic_dec(&root->snapshot_force_cow);
>>>>>>>     if (atomic_dec_and_test(&root->will_be_snapshotted))
>>>>>>>             wake_up_var(&root->will_be_snapshotted);
>>>>>>>  free_pending:
>>>>>>>
>>>>>>
>>>
>>>
>>>
>
robbieko April 26, 2019, 3:32 a.m. UTC | #9
Qu Wenruo 於 2019-04-26 10:53 寫到:
> On 2019/4/26 上午10:47, robbieko wrote:
>> Qu Wenruo 於 2019-04-26 07:20 寫到:
>>> [snip]
>>>>>> 
>>>>>> 1. Serializing snapshots (acquiring write-side of the semaphore)
>>>>> 
>>>>> With the new btrfs_start_delalloc_snapshot() in
>>>>> btrfs_commit_transaction(), do we really need to do anything 
>>>>> special
>>>>> now?
>>>>> 
>>>>> Snapshot creation will only happen in btrfs_commit_transaction(), 
>>>>> as
>>>>> long as all dirty inodes/pages are written before pending 
>>>>> snapshots,
>>>>> we're completely fine.
>>>>> 
>>>>> Or did I miss something?
>>>> 
>>>> You missed the whole point of both patches.
>>>> 
>>>> The one I authored recently, is about ensuring we see ordered update
>>>> of an inode's disk_i_size / i_size.
>>>> That flushes delalloc a second time, during the transaction commit, 
>>>> to
>>>> ensure an ordered update of disk_i_size in case direct IO writes
>>>> happened during snapshotting.
>>>> btrfs/078 could detect this when not running with the no-holes 
>>>> feature
>>>> enabled, since fsck will report missing file extent items.
>>> 
>>> But that flush at commit time ensures all pages of the source 
>>> snapshot
>>> to disk, killing the following old case:
>>> - preallocate
>>> - buffered write
>>>   at this timing, the write will be nodatacow
>>> - snapshot creation
>>>   now the write needs to be cowed
>>> - dirty page writeback happens
>>> 
>>> With your flush, snapshot creation will flush all its dirty pages 
>>> before
>>> the new snapshot is created.
>>> 
>>>> 
>>>> Robbie's patch is about making sure that buffered nodatacow writes
>>>> that happened before snapshotting (and success was returned to user
>>>> space), will not fail silently during the writeback triggered by
>>>> snapshot creation.
>>>> There's even a test case for this, btrfs/170, which I submitted.
>>>> 
>>>> So no, you can't simply revert Robbie's change, that will 
>>>> re-introduce
>>>> the bug it fixed.
>>> 
>>> With your patch, I see nothing need to be handled specially in a
>>> snapshot source. Thus we don't need Robbie patch after your more
>>> comprehensive fix.
>>> 
>>> Or did I miss something again?
>> 
>> If you revert my patch directly, when volume is full, when writing 
>> data,
>> it will check and try to use nocow to write, but if the data is still
>> in the page cache, it has not been written back to disk.
>> If snapshot is triggered,
> 
> But now, before creating new snapshot, we will flush all data.
> 
> This means at this time, the data will be written back *NOCOW*, just as
> expected.
> 
>> will_be_snapshotted will be set, and then flush
> 
> will_be_snapshotted will also be gone.
> 
> Thanks,
> Qu
> 

Why will will_be_snapshotted disappear?

If will_be_snapshotted is removed, when volume is full,
how do we ensure that snapshot and write are both
simultaneous data not loss ?

How do you avoid the write data between btrfs_start_delalloc_snapshot
and create_pending_snapshot during this time ?

E.g. chattr +C /mnt
Fallocate -l 4096 /mnt/small_file
Fallocate -l $((64*1024*1024)) /mnt/large_file

First process:
While true; do
dd if=/dev/urandom of=/mnt/small_file bs=4k count=1 conv=notrunc
Done

Second process:
While true; do
dd if=/dev/urandom of=/mnt/large_file bs=64k count=1024 conv=notrunc
Done

Third process: create snapshot.



>> data, causing data to be unable to do nocow in run_delalloc_nocow,
>> can only be written back in cow way, but there is no extra space to 
>> use,
>> so that data loss.
>> Thanks.
>> Robbieko
>> 
>>> Thanks,
>>> Qu
>>> 
>>>> 
>>>> Thanks.
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Qu
>>>>> 
>>>>>> 
>>>>>> 2. Nocow writers simply acquire the readsize of the semaphore.
>>>>>> 
>>>>>> the will_be_snapshoted thing is very convoluted relying on a 
>>>>>> percpu
>>>>>> counter/waitqueue  to exclude snapshots from pending nocow 
>>>>>> writers.
>>>>>> OTOH
>>>>>> it depends on atomic_t and an implicit wait queue thanks to 
>>>>>> wait_var
>>>>>> infrastructure to exclude nocow writers from pending snapshots. 
>>>>>> Filipe
>>>>>> had some concerns regarding performance but if the patch you 
>>>>>> mentioned
>>>>>> fixed all issues I'm all in favor of removing the code!
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Qu
>>>>>>> 
>>>>>>>> 
>>>>>>>> 1. It is incremented when we start to create a snapshot after
>>>>>>>> triggering writeback and before waiting for writeback to finish.
>>>>>>>> 
>>>>>>>> 2. This new atomic is now what is used by writeback (running
>>>>>>>> delalloc)
>>>>>>>> to decide whether we need to fallback to COW or not. Because we
>>>>>>>> incremented this new atomic after triggering writeback in the
>>>>>>>> snapshot
>>>>>>>> creation ioctl, we ensure that all buffered writes that happened
>>>>>>>> before snapshot creation will succeed and not fallback to COW
>>>>>>>> (which would make them fail with ENOSPC).
>>>>>>>> 
>>>>>>>> 3. The existing atomic, will_be_snapshotted, is kept because it 
>>>>>>>> is
>>>>>>>> used to force new buffered writes, that start after we started
>>>>>>>> snapshotting, to reserve data space even when NOCOW is possible.
>>>>>>>> This makes these writes fail early with ENOSPC when there's no
>>>>>>>> available space to allocate, preventing the unexpected behaviour
>>>>>>>> of writeback later failing with ENOSPC due to a fallback to COW
>>>>>>>> mode.
>>>>>>>> 
>>>>>>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
>>>>>>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
>>>>>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>>>>>>>> ---
>>>>>>>>  fs/btrfs/ctree.h   |  1 +
>>>>>>>>  fs/btrfs/disk-io.c |  1 +
>>>>>>>>  fs/btrfs/inode.c   | 26 +++++---------------------
>>>>>>>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
>>>>>>>>  4 files changed, 23 insertions(+), 21 deletions(-)
>>>>>>>> 
>>>>>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>>>>>> index 118346a..663ce05 100644
>>>>>>>> --- a/fs/btrfs/ctree.h
>>>>>>>> +++ b/fs/btrfs/ctree.h
>>>>>>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
>>>>>>>>     int send_in_progress;
>>>>>>>>     struct btrfs_subvolume_writers *subv_writers;
>>>>>>>>     atomic_t will_be_snapshotted;
>>>>>>>> +   atomic_t snapshot_force_cow;
>>>>>>>> 
>>>>>>>>     /* For qgroup metadata reserved space */
>>>>>>>>     spinlock_t qgroup_meta_rsv_lock;
>>>>>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>>>>>> index 205092d..5573916 100644
>>>>>>>> --- a/fs/btrfs/disk-io.c
>>>>>>>> +++ b/fs/btrfs/disk-io.c
>>>>>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root
>>>>>>>> *root, struct btrfs_fs_info *fs_info,
>>>>>>>>     atomic_set(&root->log_batch, 0);
>>>>>>>>     refcount_set(&root->refs, 1);
>>>>>>>>     atomic_set(&root->will_be_snapshotted, 0);
>>>>>>>> +   atomic_set(&root->snapshot_force_cow, 0);
>>>>>>>>     root->log_transid = 0;
>>>>>>>>     root->log_transid_committed = -1;
>>>>>>>>     root->last_log_commit = 0;
>>>>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>>>>> index eba61bc..263b852 100644
>>>>>>>> --- a/fs/btrfs/inode.c
>>>>>>>> +++ b/fs/btrfs/inode.c
>>>>>>>> @@ -1275,7 +1275,7 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>     u64 disk_num_bytes;
>>>>>>>>     u64 ram_bytes;
>>>>>>>>     int extent_type;
>>>>>>>> -   int ret, err;
>>>>>>>> +   int ret;
>>>>>>>>     int type;
>>>>>>>>     int nocow;
>>>>>>>>     int check_prev = 1;
>>>>>>>> @@ -1407,11 +1407,9 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>                      * if there are pending snapshots for this 
>>>>>>>> root,
>>>>>>>>                      * we fall into common COW way.
>>>>>>>>                      */
>>>>>>>> -                   if (!nolock) {
>>>>>>>> -                           err =
>>>>>>>> btrfs_start_write_no_snapshotting(root);
>>>>>>>> -                           if (!err)
>>>>>>>> -                                   goto out_check;
>>>>>>>> -                   }
>>>>>>>> +                   if (!nolock &&
>>>>>>>> +                          
>>>>>>>> unlikely(atomic_read(&root->snapshot_force_cow)))
>>>>>>>> +                           goto out_check;
>>>>>>>>                     /*
>>>>>>>>                      * force cow if csum exists in the range.
>>>>>>>>                      * this ensure that csum for a given extent 
>>>>>>>> are
>>>>>>>> @@ -1420,9 +1418,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>                     ret = csum_exist_in_range(fs_info, 
>>>>>>>> disk_bytenr,
>>>>>>>>                                               num_bytes);
>>>>>>>>                     if (ret) {
>>>>>>>> -                           if (!nolock)
>>>>>>>> -                                  
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> -
>>>>>>>>                             /*
>>>>>>>>                              * ret could be -EIO if the above
>>>>>>>> fails to read
>>>>>>>>                              * metadata.
>>>>>>>> @@ -1435,11 +1430,8 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>                             WARN_ON_ONCE(nolock);
>>>>>>>>                             goto out_check;
>>>>>>>>                     }
>>>>>>>> -                   if (!btrfs_inc_nocow_writers(fs_info,
>>>>>>>> disk_bytenr)) {
>>>>>>>> -                           if (!nolock)
>>>>>>>> -                                  
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>> +                   if (!btrfs_inc_nocow_writers(fs_info,
>>>>>>>> disk_bytenr))
>>>>>>>>                             goto out_check;
>>>>>>>> -                   }
>>>>>>>>                     nocow = 1;
>>>>>>>>             } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) 
>>>>>>>> {
>>>>>>>>                     extent_end = found_key.offset +
>>>>>>>> @@ -1453,8 +1445,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>  out_check:
>>>>>>>>             if (extent_end <= start) {
>>>>>>>>                     path->slots[0]++;
>>>>>>>> -                   if (!nolock && nocow)
>>>>>>>> -                           
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>>                     if (nocow)
>>>>>>>>                             btrfs_dec_nocow_writers(fs_info,
>>>>>>>> disk_bytenr);
>>>>>>>>                     goto next_slot;
>>>>>>>> @@ -1476,8 +1466,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>                                          end, page_started,
>>>>>>>> nr_written, 1,
>>>>>>>>                                          NULL);
>>>>>>>>                     if (ret) {
>>>>>>>> -                           if (!nolock && nocow)
>>>>>>>> -                                  
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>>                             if (nocow)
>>>>>>>>                                     
>>>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>>>                                                            
>>>>>>>> disk_bytenr);
>>>>>>>> @@ -1497,8 +1485,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>                                       ram_bytes,
>>>>>>>> BTRFS_COMPRESS_NONE,
>>>>>>>>                                       BTRFS_ORDERED_PREALLOC);
>>>>>>>>                     if (IS_ERR(em)) {
>>>>>>>> -                           if (!nolock && nocow)
>>>>>>>> -                                  
>>>>>>>> btrfs_end_write_no_snapshotting(root);
>>>>>>>>                             if (nocow)
>>>>>>>>                                     
>>>>>>>> btrfs_dec_nocow_writers(fs_info,
>>>>>>>>                                                            
>>>>>>>> disk_bytenr);
>>>>>>>> @@ -1537,8 +1523,6 @@ static noinline int
>>>>>>>> run_delalloc_nocow(struct inode *inode,
>>>>>>>>                                          EXTENT_CLEAR_DATA_RESV,
>>>>>>>>                                          PAGE_UNLOCK |
>>>>>>>> PAGE_SET_PRIVATE2);
>>>>>>>> 
>>>>>>>> -           if (!nolock && nocow)
>>>>>>>> -                   btrfs_end_write_no_snapshotting(root);
>>>>>>>>             cur_offset = extent_end;
>>>>>>>> 
>>>>>>>>             /*
>>>>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>>>>>>>> index b077544..331b495 100644
>>>>>>>> --- a/fs/btrfs/ioctl.c
>>>>>>>> +++ b/fs/btrfs/ioctl.c
>>>>>>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>>     struct btrfs_pending_snapshot *pending_snapshot;
>>>>>>>>     struct btrfs_trans_handle *trans;
>>>>>>>>     int ret;
>>>>>>>> +   bool snapshot_force_cow = false;
>>>>>>>> 
>>>>>>>>     if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>>>>>>             return -EINVAL;
>>>>>>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct 
>>>>>>>> btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>>             goto free_pending;
>>>>>>>>     }
>>>>>>>> 
>>>>>>>> +   /*
>>>>>>>> +    * Force new buffered writes to reserve space even when 
>>>>>>>> NOCOW is
>>>>>>>> +    * possible. This is to avoid later writeback (running 
>>>>>>>> dealloc)
>>>>>>>> +    * to fallback to COW mode and unexpectedly fail with 
>>>>>>>> ENOSPC.
>>>>>>>> +    */
>>>>>>>>     atomic_inc(&root->will_be_snapshotted);
>>>>>>>>     smp_mb__after_atomic();
>>>>>>>>     /* wait for no snapshot writes */
>>>>>>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct 
>>>>>>>> btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>>     if (ret)
>>>>>>>>             goto dec_and_free;
>>>>>>>> 
>>>>>>>> +   /*
>>>>>>>> +    * All previous writes have started writeback in NOCOW mode,
>>>>>>>> so now
>>>>>>>> +    * we force future writes to fallback to COW mode during
>>>>>>>> snapshot
>>>>>>>> +    * creation.
>>>>>>>> +    */
>>>>>>>> +   atomic_inc(&root->snapshot_force_cow);
>>>>>>>> +   snapshot_force_cow = true;
>>>>>>>> +
>>>>>>>>     btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>>>>>>> 
>>>>>>>>     btrfs_init_block_rsv(&pending_snapshot->block_rsv,
>>>>>>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root
>>>>>>>> *root, struct inode *dir,
>>>>>>>>  fail:
>>>>>>>>     btrfs_subvolume_release_metadata(fs_info,
>>>>>>>> &pending_snapshot->block_rsv);
>>>>>>>>  dec_and_free:
>>>>>>>> +   if (snapshot_force_cow)
>>>>>>>> +           atomic_dec(&root->snapshot_force_cow);
>>>>>>>>     if (atomic_dec_and_test(&root->will_be_snapshotted))
>>>>>>>>             wake_up_var(&root->will_be_snapshotted);
>>>>>>>>  free_pending:
>>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>>> 
>>
Qu Wenruo April 26, 2019, 6:32 a.m. UTC | #10
On 2019/4/26 上午11:32, robbieko wrote:
[snipped]
>>
>> But now, before creating new snapshot, we will flush all data.
>>
>> This means at this time, the data will be written back *NOCOW*, just as
>> expected.
>>
>>> will_be_snapshotted will be set, and then flush
>>
>> will_be_snapshotted will also be gone.
>>
>> Thanks,
>> Qu
>>
>
> Why will will_be_snapshotted disappear?
>
> If will_be_snapshotted is removed, when volume is full,
> how do we ensure that snapshot and write are both
> simultaneous data not loss ?
>
> How do you avoid the write data between btrfs_start_delalloc_snapshot
> and create_pending_snapshot during this time ?

Oh, I forgot that buffered write doesn't rely on transaction, thus it
has the race window between above 2 functions.


So will_be_snapshotted is to ensure buffered write at above race window
always reserve space for COW.

So your patch changes the behavior from:
                   /------------------------ inc(will_be_snapshotted)
                   |
-------------------|-------------------------------------
reserve: NOCOW     | reserve: COW
delalloc: NOCOW    | delalloc: COW

That reserve NOCOW->delalloc COW is causing ENOSPC.

To new the behavior:

                   /------------------------ inc(will_be_snapshotted)
                   |                     /-- inc(snapshot_force_cow)
-------------------|---------------------|---------------
reserve: NOCOW     | reserve: COW        | reserve: COW
delalloc: NOCOW    | delalloc: NOCOW     | delalloc: COW

And now reserve->delalloc is either NOCOW->NOCOW or COW->COW, avoiding
the possible ENOSPC.

Then it makes sense now. (although way more complex)

Although I still have some questions about all these
will_be_snapshotted/snapshot_force_cow code.

1) Does that btrfs_start_delalloc_snapshot() in create_snapshot() call
   still make sense?
   We now have btrfs_start_delalloc_snapshot() in
   btrfs_start_delalloc_flush().
   Can we remove that call in create_snapshot() and:

2) Can't we make the race window smaller? (If we can remove the call in
   mentioned in above quiestion)
   E.g. move the inc(will_be_snapshotted) before
   btrfs_start_delalloc_snapshot() in btrfs_start_delalloc_flush(),
   and move inc(snapshot_force_cow) after that
   btrfs_start_delalloc_snapshot() call?

   As you also mentioned, the real race window is between
   btrfs_start_delalloc_snapshot() and create_pending_snapshot().

Thanks,
Qu

>
> E.g. chattr +C /mnt
> Fallocate -l 4096 /mnt/small_file
> Fallocate -l $((64*1024*1024)) /mnt/large_file
>
> First process:
> While true; do
> dd if=/dev/urandom of=/mnt/small_file bs=4k count=1 conv=notrunc
> Done
>
> Second process:
> While true; do
> dd if=/dev/urandom of=/mnt/large_file bs=64k count=1024 conv=notrunc
> Done
>
> Third process: create snapshot.
Qu Wenruo April 26, 2019, 6:59 a.m. UTC | #11
On 2019/4/26 下午2:32, Qu Wenruo wrote:
>
>
> On 2019/4/26 上午11:32, robbieko wrote:
> [snipped]
>>>
>>> But now, before creating new snapshot, we will flush all data.
>>>
>>> This means at this time, the data will be written back *NOCOW*, just as
>>> expected.
>>>
>>>> will_be_snapshotted will be set, and then flush
>>>
>>> will_be_snapshotted will also be gone.
>>>
>>> Thanks,
>>> Qu
>>>
>>
>> Why will will_be_snapshotted disappear?
>>
>> If will_be_snapshotted is removed, when volume is full,
>> how do we ensure that snapshot and write are both
>> simultaneous data not loss ?
>>
>> How do you avoid the write data between btrfs_start_delalloc_snapshot
>> and create_pending_snapshot during this time ?
>
> Oh, I forgot that buffered write doesn't rely on transaction, thus it
> has the race window between above 2 functions.
>
>
> So will_be_snapshotted is to ensure buffered write at above race window
> always reserve space for COW.
>
> So your patch changes the behavior from:
>                    /------------------------ inc(will_be_snapshotted)
>                    |
> -------------------|-------------------------------------
> reserve: NOCOW     | reserve: COW
> delalloc: NOCOW    | delalloc: COW
>
> That reserve NOCOW->delalloc COW is causing ENOSPC.
>
> To new the behavior:
>
>                    /------------------------ inc(will_be_snapshotted)
>                    |                     /-- inc(snapshot_force_cow)
> -------------------|---------------------|---------------
> reserve: NOCOW     | reserve: COW        | reserve: COW
> delalloc: NOCOW    | delalloc: NOCOW     | delalloc: COW
>
> And now reserve->delalloc is either NOCOW->NOCOW or COW->COW, avoiding
> the possible ENOSPC.
>
> Then it makes sense now. (although way more complex)
>
> Although I still have some questions about all these
> will_be_snapshotted/snapshot_force_cow code.
>
> 1) Does that btrfs_start_delalloc_snapshot() in create_snapshot() call
>    still make sense?
>    We now have btrfs_start_delalloc_snapshot() in
>    btrfs_start_delalloc_flush().
>    Can we remove that call in create_snapshot() and:
>
> 2) Can't we make the race window smaller? (If we can remove the call in
>    mentioned in above quiestion)
>    E.g. move the inc(will_be_snapshotted) before
>    btrfs_start_delalloc_snapshot() in btrfs_start_delalloc_flush(),
>    and move inc(snapshot_force_cow) after that
>    btrfs_start_delalloc_snapshot() call?
>
>    As you also mentioned, the real race window is between
>    btrfs_start_delalloc_snapshot() and create_pending_snapshot().

And another (unrelated) question:

3) Why we use such complex way to prevent race? Why not just go per-
   subvolume freeze like various volume manager?

   Current way keeps buffered write still going, but does it worth?

   If we freeze/block write when creating snapshot, it just falls back
   to traditional volume manager behavior, shouldn't cause much problem,
   and we can get rid of all these snapshot specific fixes.

Thanks,
Qu

>
> Thanks,
> Qu
>
>>
>> E.g. chattr +C /mnt
>> Fallocate -l 4096 /mnt/small_file
>> Fallocate -l $((64*1024*1024)) /mnt/large_file
>>
>> First process:
>> While true; do
>> dd if=/dev/urandom of=/mnt/small_file bs=4k count=1 conv=notrunc
>> Done
>>
>> Second process:
>> While true; do
>> dd if=/dev/urandom of=/mnt/large_file bs=64k count=1024 conv=notrunc
>> Done
>>
>> Third process: create snapshot.
>
Filipe Manana April 26, 2019, 9:03 a.m. UTC | #12
On Fri, Apr 26, 2019 at 12:21 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
> [snip]
> >>>
> >>> 1. Serializing snapshots (acquiring write-side of the semaphore)
> >>
> >> With the new btrfs_start_delalloc_snapshot() in
> >> btrfs_commit_transaction(), do we really need to do anything special now?
> >>
> >> Snapshot creation will only happen in btrfs_commit_transaction(), as
> >> long as all dirty inodes/pages are written before pending snapshots,
> >> we're completely fine.
> >>
> >> Or did I miss something?
> >
> > You missed the whole point of both patches.
> >
> > The one I authored recently, is about ensuring we see ordered update
> > of an inode's disk_i_size / i_size.
> > That flushes delalloc a second time, during the transaction commit, to
> > ensure an ordered update of disk_i_size in case direct IO writes
> > happened during snapshotting.
> > btrfs/078 could detect this when not running with the no-holes feature
> > enabled, since fsck will report missing file extent items.
>
> But that flush at commit time ensures all pages of the source snapshot
> to disk, killing the following old case:
> - preallocate
> - buffered write
>   at this timing, the write will be nodatacow
> - snapshot creation
>   now the write needs to be cowed
> - dirty page writeback happens
>
> With your flush, snapshot creation will flush all its dirty pages before
> the new snapshot is created.
>
> >
> > Robbie's patch is about making sure that buffered nodatacow writes
> > that happened before snapshotting (and success was returned to user
> > space), will not fail silently during the writeback triggered by
> > snapshot creation.
> > There's even a test case for this, btrfs/170, which I submitted.
> >
> > So no, you can't simply revert Robbie's change, that will re-introduce
> > the bug it fixed.
>
> With your patch, I see nothing need to be handled specially in a
> snapshot source. Thus we don't need Robbie patch after your more
> comprehensive fix.

Qu,

Do you think I would have done my previous reply without confirming
that the test passes after reverting Robbie's patch?

I did test it, not just to be sure but all to avoid wasting my time
and the time of anyone else following this thread, especially those
who are participating on it.

Really, you have no excuse for being so stubborn, there's is a
deterministic test for this issue.

Thanks.

> Or did I miss something again?
>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> 2. Nocow writers simply acquire the readsize of the semaphore.
> >>>
> >>> the will_be_snapshoted thing is very convoluted relying on a percpu
> >>> counter/waitqueue  to exclude snapshots from pending nocow writers. OTOH
> >>> it depends on atomic_t and an implicit wait queue thanks to wait_var
> >>> infrastructure to exclude nocow writers from pending snapshots. Filipe
> >>> had some concerns regarding performance but if the patch you mentioned
> >>> fixed all issues I'm all in favor of removing the code!
> >>>
> >>>
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> 1. It is incremented when we start to create a snapshot after
> >>>>> triggering writeback and before waiting for writeback to finish.
> >>>>>
> >>>>> 2. This new atomic is now what is used by writeback (running delalloc)
> >>>>> to decide whether we need to fallback to COW or not. Because we
> >>>>> incremented this new atomic after triggering writeback in the snapshot
> >>>>> creation ioctl, we ensure that all buffered writes that happened
> >>>>> before snapshot creation will succeed and not fallback to COW
> >>>>> (which would make them fail with ENOSPC).
> >>>>>
> >>>>> 3. The existing atomic, will_be_snapshotted, is kept because it is
> >>>>> used to force new buffered writes, that start after we started
> >>>>> snapshotting, to reserve data space even when NOCOW is possible.
> >>>>> This makes these writes fail early with ENOSPC when there's no
> >>>>> available space to allocate, preventing the unexpected behaviour
> >>>>> of writeback later failing with ENOSPC due to a fallback to COW mode.
> >>>>>
> >>>>> Fixes: e9894fd3e3b3 ("Btrfs: fix snapshot vs nocow writting")
> >>>>> Signed-off-by: Robbie Ko <robbieko@synology.com>
> >>>>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> >>>>> ---
> >>>>>  fs/btrfs/ctree.h   |  1 +
> >>>>>  fs/btrfs/disk-io.c |  1 +
> >>>>>  fs/btrfs/inode.c   | 26 +++++---------------------
> >>>>>  fs/btrfs/ioctl.c   | 16 ++++++++++++++++
> >>>>>  4 files changed, 23 insertions(+), 21 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>>>> index 118346a..663ce05 100644
> >>>>> --- a/fs/btrfs/ctree.h
> >>>>> +++ b/fs/btrfs/ctree.h
> >>>>> @@ -1277,6 +1277,7 @@ struct btrfs_root {
> >>>>>     int send_in_progress;
> >>>>>     struct btrfs_subvolume_writers *subv_writers;
> >>>>>     atomic_t will_be_snapshotted;
> >>>>> +   atomic_t snapshot_force_cow;
> >>>>>
> >>>>>     /* For qgroup metadata reserved space */
> >>>>>     spinlock_t qgroup_meta_rsv_lock;
> >>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>>>> index 205092d..5573916 100644
> >>>>> --- a/fs/btrfs/disk-io.c
> >>>>> +++ b/fs/btrfs/disk-io.c
> >>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
> >>>>>     atomic_set(&root->log_batch, 0);
> >>>>>     refcount_set(&root->refs, 1);
> >>>>>     atomic_set(&root->will_be_snapshotted, 0);
> >>>>> +   atomic_set(&root->snapshot_force_cow, 0);
> >>>>>     root->log_transid = 0;
> >>>>>     root->log_transid_committed = -1;
> >>>>>     root->last_log_commit = 0;
> >>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>>>> index eba61bc..263b852 100644
> >>>>> --- a/fs/btrfs/inode.c
> >>>>> +++ b/fs/btrfs/inode.c
> >>>>> @@ -1275,7 +1275,7 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>     u64 disk_num_bytes;
> >>>>>     u64 ram_bytes;
> >>>>>     int extent_type;
> >>>>> -   int ret, err;
> >>>>> +   int ret;
> >>>>>     int type;
> >>>>>     int nocow;
> >>>>>     int check_prev = 1;
> >>>>> @@ -1407,11 +1407,9 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>                      * if there are pending snapshots for this root,
> >>>>>                      * we fall into common COW way.
> >>>>>                      */
> >>>>> -                   if (!nolock) {
> >>>>> -                           err = btrfs_start_write_no_snapshotting(root);
> >>>>> -                           if (!err)
> >>>>> -                                   goto out_check;
> >>>>> -                   }
> >>>>> +                   if (!nolock &&
> >>>>> +                           unlikely(atomic_read(&root->snapshot_force_cow)))
> >>>>> +                           goto out_check;
> >>>>>                     /*
> >>>>>                      * force cow if csum exists in the range.
> >>>>>                      * this ensure that csum for a given extent are
> >>>>> @@ -1420,9 +1418,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>                     ret = csum_exist_in_range(fs_info, disk_bytenr,
> >>>>>                                               num_bytes);
> >>>>>                     if (ret) {
> >>>>> -                           if (!nolock)
> >>>>> -                                   btrfs_end_write_no_snapshotting(root);
> >>>>> -
> >>>>>                             /*
> >>>>>                              * ret could be -EIO if the above fails to read
> >>>>>                              * metadata.
> >>>>> @@ -1435,11 +1430,8 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>                             WARN_ON_ONCE(nolock);
> >>>>>                             goto out_check;
> >>>>>                     }
> >>>>> -                   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
> >>>>> -                           if (!nolock)
> >>>>> -                                   btrfs_end_write_no_snapshotting(root);
> >>>>> +                   if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
> >>>>>                             goto out_check;
> >>>>> -                   }
> >>>>>                     nocow = 1;
> >>>>>             } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
> >>>>>                     extent_end = found_key.offset +
> >>>>> @@ -1453,8 +1445,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>  out_check:
> >>>>>             if (extent_end <= start) {
> >>>>>                     path->slots[0]++;
> >>>>> -                   if (!nolock && nocow)
> >>>>> -                           btrfs_end_write_no_snapshotting(root);
> >>>>>                     if (nocow)
> >>>>>                             btrfs_dec_nocow_writers(fs_info, disk_bytenr);
> >>>>>                     goto next_slot;
> >>>>> @@ -1476,8 +1466,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>                                          end, page_started, nr_written, 1,
> >>>>>                                          NULL);
> >>>>>                     if (ret) {
> >>>>> -                           if (!nolock && nocow)
> >>>>> -                                   btrfs_end_write_no_snapshotting(root);
> >>>>>                             if (nocow)
> >>>>>                                     btrfs_dec_nocow_writers(fs_info,
> >>>>>                                                             disk_bytenr);
> >>>>> @@ -1497,8 +1485,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>                                       ram_bytes, BTRFS_COMPRESS_NONE,
> >>>>>                                       BTRFS_ORDERED_PREALLOC);
> >>>>>                     if (IS_ERR(em)) {
> >>>>> -                           if (!nolock && nocow)
> >>>>> -                                   btrfs_end_write_no_snapshotting(root);
> >>>>>                             if (nocow)
> >>>>>                                     btrfs_dec_nocow_writers(fs_info,
> >>>>>                                                             disk_bytenr);
> >>>>> @@ -1537,8 +1523,6 @@ static noinline int run_delalloc_nocow(struct inode *inode,
> >>>>>                                          EXTENT_CLEAR_DATA_RESV,
> >>>>>                                          PAGE_UNLOCK | PAGE_SET_PRIVATE2);
> >>>>>
> >>>>> -           if (!nolock && nocow)
> >>>>> -                   btrfs_end_write_no_snapshotting(root);
> >>>>>             cur_offset = extent_end;
> >>>>>
> >>>>>             /*
> >>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >>>>> index b077544..331b495 100644
> >>>>> --- a/fs/btrfs/ioctl.c
> >>>>> +++ b/fs/btrfs/ioctl.c
> >>>>> @@ -761,6 +761,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>>>     struct btrfs_pending_snapshot *pending_snapshot;
> >>>>>     struct btrfs_trans_handle *trans;
> >>>>>     int ret;
> >>>>> +   bool snapshot_force_cow = false;
> >>>>>
> >>>>>     if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
> >>>>>             return -EINVAL;
> >>>>> @@ -777,6 +778,11 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>>>             goto free_pending;
> >>>>>     }
> >>>>>
> >>>>> +   /*
> >>>>> +    * Force new buffered writes to reserve space even when NOCOW is
> >>>>> +    * possible. This is to avoid later writeback (running dealloc)
> >>>>> +    * to fallback to COW mode and unexpectedly fail with ENOSPC.
> >>>>> +    */
> >>>>>     atomic_inc(&root->will_be_snapshotted);
> >>>>>     smp_mb__after_atomic();
> >>>>>     /* wait for no snapshot writes */
> >>>>> @@ -787,6 +793,14 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>>>     if (ret)
> >>>>>             goto dec_and_free;
> >>>>>
> >>>>> +   /*
> >>>>> +    * All previous writes have started writeback in NOCOW mode, so now
> >>>>> +    * we force future writes to fallback to COW mode during snapshot
> >>>>> +    * creation.
> >>>>> +    */
> >>>>> +   atomic_inc(&root->snapshot_force_cow);
> >>>>> +   snapshot_force_cow = true;
> >>>>> +
> >>>>>     btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
> >>>>>
> >>>>>     btrfs_init_block_rsv(&pending_snapshot->block_rsv,
> >>>>> @@ -851,6 +865,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
> >>>>>  fail:
> >>>>>     btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
> >>>>>  dec_and_free:
> >>>>> +   if (snapshot_force_cow)
> >>>>> +           atomic_dec(&root->snapshot_force_cow);
> >>>>>     if (atomic_dec_and_test(&root->will_be_snapshotted))
> >>>>>             wake_up_var(&root->will_be_snapshotted);
> >>>>>  free_pending:
> >>>>>
> >>>>
> >
> >
> >
Filipe Manana April 26, 2019, 9:09 a.m. UTC | #13
On Fri, Apr 26, 2019 at 7:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2019/4/26 下午2:32, Qu Wenruo wrote:
> >
> >
> > On 2019/4/26 上午11:32, robbieko wrote:
> > [snipped]
> >>>
> >>> But now, before creating new snapshot, we will flush all data.
> >>>
> >>> This means at this time, the data will be written back *NOCOW*, just as
> >>> expected.
> >>>
> >>>> will_be_snapshotted will be set, and then flush
> >>>
> >>> will_be_snapshotted will also be gone.
> >>>
> >>> Thanks,
> >>> Qu
> >>>
> >>
> >> Why will will_be_snapshotted disappear?
> >>
> >> If will_be_snapshotted is removed, when volume is full,
> >> how do we ensure that snapshot and write are both
> >> simultaneous data not loss ?
> >>
> >> How do you avoid the write data between btrfs_start_delalloc_snapshot
> >> and create_pending_snapshot during this time ?
> >
> > Oh, I forgot that buffered write doesn't rely on transaction, thus it
> > has the race window between above 2 functions.
> >
> >
> > So will_be_snapshotted is to ensure buffered write at above race window
> > always reserve space for COW.
> >
> > So your patch changes the behavior from:
> >                    /------------------------ inc(will_be_snapshotted)
> >                    |
> > -------------------|-------------------------------------
> > reserve: NOCOW     | reserve: COW
> > delalloc: NOCOW    | delalloc: COW
> >
> > That reserve NOCOW->delalloc COW is causing ENOSPC.
> >
> > To new the behavior:
> >
> >                    /------------------------ inc(will_be_snapshotted)
> >                    |                     /-- inc(snapshot_force_cow)
> > -------------------|---------------------|---------------
> > reserve: NOCOW     | reserve: COW        | reserve: COW
> > delalloc: NOCOW    | delalloc: NOCOW     | delalloc: COW
> >
> > And now reserve->delalloc is either NOCOW->NOCOW or COW->COW, avoiding
> > the possible ENOSPC.
> >
> > Then it makes sense now. (although way more complex)
> >
> > Although I still have some questions about all these
> > will_be_snapshotted/snapshot_force_cow code.
> >
> > 1) Does that btrfs_start_delalloc_snapshot() in create_snapshot() call
> >    still make sense?
> >    We now have btrfs_start_delalloc_snapshot() in
> >    btrfs_start_delalloc_flush().
> >    Can we remove that call in create_snapshot() and:
> >
> > 2) Can't we make the race window smaller? (If we can remove the call in
> >    mentioned in above quiestion)
> >    E.g. move the inc(will_be_snapshotted) before
> >    btrfs_start_delalloc_snapshot() in btrfs_start_delalloc_flush(),
> >    and move inc(snapshot_force_cow) after that
> >    btrfs_start_delalloc_snapshot() call?
> >
> >    As you also mentioned, the real race window is between
> >    btrfs_start_delalloc_snapshot() and create_pending_snapshot().
>
> And another (unrelated) question:
>
> 3) Why we use such complex way to prevent race? Why not just go per-
>    subvolume freeze like various volume manager?
>
>    Current way keeps buffered write still going, but does it worth?

And that's precisely one of the advantages of Btrfs' snapshotting - it
does not block writes while snapshotting is in progress.
Blocking them is terrible for user experience - waiting for
snapshotting IO (flushing delalloc, waiting for delalloc, committing a
transaction, etc) can take seconds, or minutes, or maybe more in
extreme cases.

As a user, while I'm browsing youtube, replying to these threads,
playing games, etc, I would really hate if I experience stalls while
snapper decided to take a snapshot of my root or home subvolumes.
Not to mention the case for IO intensive applications (databases, etc).

>
>    If we freeze/block write when creating snapshot, it just falls back
>    to traditional volume manager behavior, shouldn't cause much problem,
>    and we can get rid of all these snapshot specific fixes.

So if after that we still find bugs in snapshotting, and we will for
sure, then we do what? Disable snapshotting in btrfs, remove all the
supporting code?

NACK.

>
> Thanks,
> Qu
>
> >
> > Thanks,
> > Qu
> >
> >>
> >> E.g. chattr +C /mnt
> >> Fallocate -l 4096 /mnt/small_file
> >> Fallocate -l $((64*1024*1024)) /mnt/large_file
> >>
> >> First process:
> >> While true; do
> >> dd if=/dev/urandom of=/mnt/small_file bs=4k count=1 conv=notrunc
> >> Done
> >>
> >> Second process:
> >> While true; do
> >> dd if=/dev/urandom of=/mnt/large_file bs=64k count=1024 conv=notrunc
> >> Done
> >>
> >> Third process: create snapshot.
> >
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346a..663ce05 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1277,6 +1277,7 @@  struct btrfs_root {
 	int send_in_progress;
 	struct btrfs_subvolume_writers *subv_writers;
 	atomic_t will_be_snapshotted;
+	atomic_t snapshot_force_cow;
 
 	/* For qgroup metadata reserved space */
 	spinlock_t qgroup_meta_rsv_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092d..5573916 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1216,6 +1216,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	atomic_set(&root->log_batch, 0);
 	refcount_set(&root->refs, 1);
 	atomic_set(&root->will_be_snapshotted, 0);
+	atomic_set(&root->snapshot_force_cow, 0);
 	root->log_transid = 0;
 	root->log_transid_committed = -1;
 	root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eba61bc..263b852 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1275,7 +1275,7 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 	u64 disk_num_bytes;
 	u64 ram_bytes;
 	int extent_type;
-	int ret, err;
+	int ret;
 	int type;
 	int nocow;
 	int check_prev = 1;
@@ -1407,11 +1407,9 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 			 * if there are pending snapshots for this root,
 			 * we fall into common COW way.
 			 */
-			if (!nolock) {
-				err = btrfs_start_write_no_snapshotting(root);
-				if (!err)
-					goto out_check;
-			}
+			if (!nolock &&
+				unlikely(atomic_read(&root->snapshot_force_cow)))
+				goto out_check;
 			/*
 			 * force cow if csum exists in the range.
 			 * this ensure that csum for a given extent are
@@ -1420,9 +1418,6 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 			ret = csum_exist_in_range(fs_info, disk_bytenr,
 						  num_bytes);
 			if (ret) {
-				if (!nolock)
-					btrfs_end_write_no_snapshotting(root);
-
 				/*
 				 * ret could be -EIO if the above fails to read
 				 * metadata.
@@ -1435,11 +1430,8 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 				WARN_ON_ONCE(nolock);
 				goto out_check;
 			}
-			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) {
-				if (!nolock)
-					btrfs_end_write_no_snapshotting(root);
+			if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr))
 				goto out_check;
-			}
 			nocow = 1;
 		} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
 			extent_end = found_key.offset +
@@ -1453,8 +1445,6 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 out_check:
 		if (extent_end <= start) {
 			path->slots[0]++;
-			if (!nolock && nocow)
-				btrfs_end_write_no_snapshotting(root);
 			if (nocow)
 				btrfs_dec_nocow_writers(fs_info, disk_bytenr);
 			goto next_slot;
@@ -1476,8 +1466,6 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 					     end, page_started, nr_written, 1,
 					     NULL);
 			if (ret) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshotting(root);
 				if (nocow)
 					btrfs_dec_nocow_writers(fs_info,
 								disk_bytenr);
@@ -1497,8 +1485,6 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 					  ram_bytes, BTRFS_COMPRESS_NONE,
 					  BTRFS_ORDERED_PREALLOC);
 			if (IS_ERR(em)) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshotting(root);
 				if (nocow)
 					btrfs_dec_nocow_writers(fs_info,
 								disk_bytenr);
@@ -1537,8 +1523,6 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 					     EXTENT_CLEAR_DATA_RESV,
 					     PAGE_UNLOCK | PAGE_SET_PRIVATE2);
 
-		if (!nolock && nocow)
-			btrfs_end_write_no_snapshotting(root);
 		cur_offset = extent_end;
 
 		/*
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b077544..331b495 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -761,6 +761,7 @@  static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	struct btrfs_pending_snapshot *pending_snapshot;
 	struct btrfs_trans_handle *trans;
 	int ret;
+	bool snapshot_force_cow = false;
 
 	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 		return -EINVAL;
@@ -777,6 +778,11 @@  static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 		goto free_pending;
 	}
 
+	/*
+	 * Force new buffered writes to reserve space even when NOCOW is
+	 * possible. This is to avoid later writeback (running dealloc)
+	 * to fallback to COW mode and unexpectedly fail with ENOSPC.
+	 */
 	atomic_inc(&root->will_be_snapshotted);
 	smp_mb__after_atomic();
 	/* wait for no snapshot writes */
@@ -787,6 +793,14 @@  static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	if (ret)
 		goto dec_and_free;
 
+	/*
+	 * All previous writes have started writeback in NOCOW mode, so now
+	 * we force future writes to fallback to COW mode during snapshot
+	 * creation.
+	 */
+	atomic_inc(&root->snapshot_force_cow);
+	snapshot_force_cow = true;
+
 	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
 
 	btrfs_init_block_rsv(&pending_snapshot->block_rsv,
@@ -851,6 +865,8 @@  static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 fail:
 	btrfs_subvolume_release_metadata(fs_info, &pending_snapshot->block_rsv);
 dec_and_free:
+	if (snapshot_force_cow)
+		atomic_dec(&root->snapshot_force_cow);
 	if (atomic_dec_and_test(&root->will_be_snapshotted))
 		wake_up_var(&root->will_be_snapshotted);
 free_pending: