diff mbox series

[3/3] dm clone: Flush destination device before committing metadata

Message ID 20191204140654.26214-4-ntsironis@arrikto.com (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show
Series dm clone: Flush destination device before committing metadata to avoid data corruption | expand

Commit Message

Nikos Tsironis Dec. 4, 2019, 2:06 p.m. UTC
dm-clone maintains an on-disk bitmap which records which regions are
valid in the destination device, i.e., which regions have already been
hydrated, or have been written to directly, via user I/O.

Setting a bit in the on-disk bitmap meas the corresponding region is
valid in the destination device and we redirect all I/O regarding it to
the destination device.

Suppose the destination device has a volatile write-back cache and the
following sequence of events occur:

1. A region gets hydrated, either through the background hydration or
   because it was written to directly, via user I/O.

2. The commit timeout expires and we commit the metadata, marking that
   region as valid in the destination device.

3. The system crashes and the destination device's cache has not been
   flushed, meaning the region's data are lost.

The next time we read that region we read it from the destination
device, since the metadata have been successfully committed, but the
data are lost due to the crash, so we read garbage instead of the old
data.

This has several implications:

1. In case of background hydration or of writes with size smaller than
   the region size (which means we first copy the whole region and then
   issue the smaller write), we corrupt data that the user never
   touched.

2. In case of writes with size equal to the device's logical block size,
   we fail to provide atomic sector writes. When the system recovers the
   user will read garbage from the sector instead of the old data or the
   new data.

3. In case of writes without the FUA flag set, after the system
   recovers, the written sectors will contain garbage instead of a
   random mix of sectors containing either old data or new data, thus we
   fail again to provide atomic sector writes.

4. Even when the user flushes the dm-clone device, because we first
   commit the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down).

The only case which is unaffected is that of writes with size equal to
the region size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

To solve this and avoid the potential data corruption we flush the
destination device **before** committing the metadata.

This ensures that any freshly hydrated regions, for which we commit the
metadata, are properly written to non-volatile storage and won't be lost
in case of a crash.

Fixes: 7431b7835f55 ("dm: add clone target")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
---
 drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

Comments

Mike Snitzer Dec. 5, 2019, 7:46 p.m. UTC | #1
On Wed, Dec 04 2019 at  9:06P -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> dm-clone maintains an on-disk bitmap which records which regions are
> valid in the destination device, i.e., which regions have already been
> hydrated, or have been written to directly, via user I/O.
> 
> Setting a bit in the on-disk bitmap meas the corresponding region is
> valid in the destination device and we redirect all I/O regarding it to
> the destination device.
> 
> Suppose the destination device has a volatile write-back cache and the
> following sequence of events occur:
> 
> 1. A region gets hydrated, either through the background hydration or
>    because it was written to directly, via user I/O.
> 
> 2. The commit timeout expires and we commit the metadata, marking that
>    region as valid in the destination device.
> 
> 3. The system crashes and the destination device's cache has not been
>    flushed, meaning the region's data are lost.
> 
> The next time we read that region we read it from the destination
> device, since the metadata have been successfully committed, but the
> data are lost due to the crash, so we read garbage instead of the old
> data.
> 
> This has several implications:
> 
> 1. In case of background hydration or of writes with size smaller than
>    the region size (which means we first copy the whole region and then
>    issue the smaller write), we corrupt data that the user never
>    touched.
> 
> 2. In case of writes with size equal to the device's logical block size,
>    we fail to provide atomic sector writes. When the system recovers the
>    user will read garbage from the sector instead of the old data or the
>    new data.
> 
> 3. In case of writes without the FUA flag set, after the system
>    recovers, the written sectors will contain garbage instead of a
>    random mix of sectors containing either old data or new data, thus we
>    fail again to provide atomic sector writes.
> 
> 4. Even when the user flushes the dm-clone device, because we first
>    commit the metadata and then pass down the flush, the same risk for
>    corruption exists (if the system crashes after the metadata have been
>    committed but before the flush is passed down).
> 
> The only case which is unaffected is that of writes with size equal to
> the region size and with the FUA flag set. But, because FUA writes
> trigger metadata commits, this case can trigger the corruption
> indirectly.
> 
> To solve this and avoid the potential data corruption we flush the
> destination device **before** committing the metadata.
> 
> This ensures that any freshly hydrated regions, for which we commit the
> metadata, are properly written to non-volatile storage and won't be lost
> in case of a crash.
> 
> Fixes: 7431b7835f55 ("dm: add clone target")
> Cc: stable@vger.kernel.org # v5.4+
> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> ---
>  drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
> index 613c913c296c..d1e1b5b56b1b 100644
> --- a/drivers/md/dm-clone-target.c
> +++ b/drivers/md/dm-clone-target.c
> @@ -86,6 +86,12 @@ struct clone {
>  
>  	struct dm_clone_metadata *cmd;
>  
> +	/*
> +	 * bio used to flush the destination device, before committing the
> +	 * metadata.
> +	 */
> +	struct bio flush_bio;
> +
>  	/* Region hydration hash table */
>  	struct hash_table_bucket *ht;
>  
> @@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
>  /*
>   * A non-zero return indicates read-only or fail mode.
>   */
> -static int commit_metadata(struct clone *clone)
> +static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
>  {
>  	int r = 0;
>  
> +	if (dest_dev_flushed)
> +		*dest_dev_flushed = false;
> +
>  	mutex_lock(&clone->commit_lock);
>  
>  	if (!dm_clone_changed_this_transaction(clone->cmd))
> @@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
>  		goto out;
>  	}
>  
> +	bio_reset(&clone->flush_bio);
> +	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
> +	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> +
> +	r = submit_bio_wait(&clone->flush_bio);
> +	if (unlikely(r)) {
> +		__metadata_operation_failed(clone, "flush destination device", r);
> +		goto out;
> +	}
> +
> +	if (dest_dev_flushed)
> +		*dest_dev_flushed = true;
> +
>  	r = dm_clone_metadata_commit(clone->cmd);
>  	if (unlikely(r)) {
>  		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
> @@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
>  static void process_deferred_flush_bios(struct clone *clone)
>  {
>  	struct bio *bio;
> +	bool dest_dev_flushed;
>  	struct bio_list bios = BIO_EMPTY_LIST;
>  	struct bio_list bio_completions = BIO_EMPTY_LIST;
>  
> @@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone *clone)
>  	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
>  		return;
>  
> -	if (commit_metadata(clone)) {
> +	if (commit_metadata(clone, &dest_dev_flushed)) {
>  		bio_list_merge(&bios, &bio_completions);
>  
>  		while ((bio = bio_list_pop(&bios)))
> @@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(struct clone *clone)
>  	while ((bio = bio_list_pop(&bio_completions)))
>  		bio_endio(bio);
>  
> -	while ((bio = bio_list_pop(&bios)))
> -		generic_make_request(bio);
> +	while ((bio = bio_list_pop(&bios))) {
> +		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
> +			/* We just flushed the destination device as part of
> +			 * the metadata commit, so there is no reason to send
> +			 * another flush.
> +			 */
> +			bio_endio(bio);
> +		} else {
> +			generic_make_request(bio);
> +		}
> +	}
>  }
>  
>  static void do_worker(struct work_struct *work)
> @@ -1405,7 +1437,7 @@ static void clone_status(struct dm_target *ti, status_type_t type,
>  
>  		/* Commit to ensure statistics aren't out-of-date */
>  		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
> -			(void) commit_metadata(clone);
> +			(void) commit_metadata(clone, NULL);
>  
>  		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
>  
> @@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>  	bio_list_init(&clone->deferred_flush_completions);
>  	clone->hydration_offset = 0;
>  	atomic_set(&clone->hydrations_in_flight, 0);
> +	bio_init(&clone->flush_bio, NULL, 0);
>  
>  	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
>  	if (!clone->wq) {
> @@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *ti)
>  	struct clone *clone = ti->private;
>  
>  	mutex_destroy(&clone->commit_lock);
> +	bio_uninit(&clone->flush_bio);
>  
>  	for (i = 0; i < clone->nr_ctr_args; i++)
>  		kfree(clone->ctr_args[i]);
> @@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_target *ti)
>  	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
>  	flush_workqueue(clone->wq);
>  
> -	(void) commit_metadata(clone);
> +	(void) commit_metadata(clone, NULL);
>  }
>  
>  static void clone_resume(struct dm_target *ti)
> -- 
> 2.11.0
> 


Like the dm-thin patch I replied to, would rather avoid open-coding
blkdev_issue_flush (also I check !bio_has_data), here is incremental:

---
 drivers/md/dm-clone-target.c | 17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index d1e1b5b56b1b..bce99bff8678 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -86,12 +86,6 @@ struct clone {
 
 	struct dm_clone_metadata *cmd;
 
-	/*
-	 * bio used to flush the destination device, before committing the
-	 * metadata.
-	 */
-	struct bio flush_bio;
-
 	/* Region hydration hash table */
 	struct hash_table_bucket *ht;
 
@@ -1137,11 +1131,7 @@ static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
 		goto out;
 	}
 
-	bio_reset(&clone->flush_bio);
-	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
-	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
-
-	r = submit_bio_wait(&clone->flush_bio);
+	r = blkdev_issue_flush(clone->dest_dev->bdev, GFP_NOIO, NULL);
 	if (unlikely(r)) {
 		__metadata_operation_failed(clone, "flush destination device", r);
 		goto out;
@@ -1256,7 +1246,8 @@ static void process_deferred_flush_bios(struct clone *clone)
 		bio_endio(bio);
 
 	while ((bio = bio_list_pop(&bios))) {
-		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
+		if (dest_dev_flushed &&
+		    (bio->bi_opf & REQ_PREFLUSH) && !bio_has_data(bio)) {
 			/* We just flushed the destination device as part of
 			 * the metadata commit, so there is no reason to send
 			 * another flush.
@@ -1871,7 +1862,6 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	bio_list_init(&clone->deferred_flush_completions);
 	clone->hydration_offset = 0;
 	atomic_set(&clone->hydrations_in_flight, 0);
-	bio_init(&clone->flush_bio, NULL, 0);
 
 	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
 	if (!clone->wq) {
@@ -1945,7 +1935,6 @@ static void clone_dtr(struct dm_target *ti)
 	struct clone *clone = ti->private;
 
 	mutex_destroy(&clone->commit_lock);
-	bio_uninit(&clone->flush_bio);
 
 	for (i = 0; i < clone->nr_ctr_args; i++)
 		kfree(clone->ctr_args[i]);
Mike Snitzer Dec. 5, 2019, 8:07 p.m. UTC | #2
On Thu, Dec 05 2019 at  2:46pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Dec 04 2019 at  9:06P -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
> > dm-clone maintains an on-disk bitmap which records which regions are
> > valid in the destination device, i.e., which regions have already been
> > hydrated, or have been written to directly, via user I/O.
> > 
> > Setting a bit in the on-disk bitmap meas the corresponding region is
> > valid in the destination device and we redirect all I/O regarding it to
> > the destination device.
> > 
> > Suppose the destination device has a volatile write-back cache and the
> > following sequence of events occur:
> > 
> > 1. A region gets hydrated, either through the background hydration or
> >    because it was written to directly, via user I/O.
> > 
> > 2. The commit timeout expires and we commit the metadata, marking that
> >    region as valid in the destination device.
> > 
> > 3. The system crashes and the destination device's cache has not been
> >    flushed, meaning the region's data are lost.
> > 
> > The next time we read that region we read it from the destination
> > device, since the metadata have been successfully committed, but the
> > data are lost due to the crash, so we read garbage instead of the old
> > data.
> > 
> > This has several implications:
> > 
> > 1. In case of background hydration or of writes with size smaller than
> >    the region size (which means we first copy the whole region and then
> >    issue the smaller write), we corrupt data that the user never
> >    touched.
> > 
> > 2. In case of writes with size equal to the device's logical block size,
> >    we fail to provide atomic sector writes. When the system recovers the
> >    user will read garbage from the sector instead of the old data or the
> >    new data.
> > 
> > 3. In case of writes without the FUA flag set, after the system
> >    recovers, the written sectors will contain garbage instead of a
> >    random mix of sectors containing either old data or new data, thus we
> >    fail again to provide atomic sector writes.
> > 
> > 4. Even when the user flushes the dm-clone device, because we first
> >    commit the metadata and then pass down the flush, the same risk for
> >    corruption exists (if the system crashes after the metadata have been
> >    committed but before the flush is passed down).
> > 
> > The only case which is unaffected is that of writes with size equal to
> > the region size and with the FUA flag set. But, because FUA writes
> > trigger metadata commits, this case can trigger the corruption
> > indirectly.
> > 
> > To solve this and avoid the potential data corruption we flush the
> > destination device **before** committing the metadata.
> > 
> > This ensures that any freshly hydrated regions, for which we commit the
> > metadata, are properly written to non-volatile storage and won't be lost
> > in case of a crash.
> > 
> > Fixes: 7431b7835f55 ("dm: add clone target")
> > Cc: stable@vger.kernel.org # v5.4+
> > Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> > ---
> >  drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
> > index 613c913c296c..d1e1b5b56b1b 100644
> > --- a/drivers/md/dm-clone-target.c
> > +++ b/drivers/md/dm-clone-target.c
> > @@ -86,6 +86,12 @@ struct clone {
> >  
> >  	struct dm_clone_metadata *cmd;
> >  
> > +	/*
> > +	 * bio used to flush the destination device, before committing the
> > +	 * metadata.
> > +	 */
> > +	struct bio flush_bio;
> > +
> >  	/* Region hydration hash table */
> >  	struct hash_table_bucket *ht;
> >  
> > @@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
> >  /*
> >   * A non-zero return indicates read-only or fail mode.
> >   */
> > -static int commit_metadata(struct clone *clone)
> > +static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
> >  {
> >  	int r = 0;
> >  
> > +	if (dest_dev_flushed)
> > +		*dest_dev_flushed = false;
> > +
> >  	mutex_lock(&clone->commit_lock);
> >  
> >  	if (!dm_clone_changed_this_transaction(clone->cmd))
> > @@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
> >  		goto out;
> >  	}
> >  
> > +	bio_reset(&clone->flush_bio);
> > +	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
> > +	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> > +
> > +	r = submit_bio_wait(&clone->flush_bio);
> > +	if (unlikely(r)) {
> > +		__metadata_operation_failed(clone, "flush destination device", r);
> > +		goto out;
> > +	}
> > +
> > +	if (dest_dev_flushed)
> > +		*dest_dev_flushed = true;
> > +
> >  	r = dm_clone_metadata_commit(clone->cmd);
> >  	if (unlikely(r)) {
> >  		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
> > @@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
> >  static void process_deferred_flush_bios(struct clone *clone)
> >  {
> >  	struct bio *bio;
> > +	bool dest_dev_flushed;
> >  	struct bio_list bios = BIO_EMPTY_LIST;
> >  	struct bio_list bio_completions = BIO_EMPTY_LIST;
> >  
> > @@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone *clone)
> >  	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
> >  		return;
> >  
> > -	if (commit_metadata(clone)) {
> > +	if (commit_metadata(clone, &dest_dev_flushed)) {
> >  		bio_list_merge(&bios, &bio_completions);
> >  
> >  		while ((bio = bio_list_pop(&bios)))
> > @@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(struct clone *clone)
> >  	while ((bio = bio_list_pop(&bio_completions)))
> >  		bio_endio(bio);
> >  
> > -	while ((bio = bio_list_pop(&bios)))
> > -		generic_make_request(bio);
> > +	while ((bio = bio_list_pop(&bios))) {
> > +		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
> > +			/* We just flushed the destination device as part of
> > +			 * the metadata commit, so there is no reason to send
> > +			 * another flush.
> > +			 */
> > +			bio_endio(bio);
> > +		} else {
> > +			generic_make_request(bio);
> > +		}
> > +	}
> >  }
> >  
> >  static void do_worker(struct work_struct *work)
> > @@ -1405,7 +1437,7 @@ static void clone_status(struct dm_target *ti, status_type_t type,
> >  
> >  		/* Commit to ensure statistics aren't out-of-date */
> >  		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
> > -			(void) commit_metadata(clone);
> > +			(void) commit_metadata(clone, NULL);
> >  
> >  		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
> >  
> > @@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> >  	bio_list_init(&clone->deferred_flush_completions);
> >  	clone->hydration_offset = 0;
> >  	atomic_set(&clone->hydrations_in_flight, 0);
> > +	bio_init(&clone->flush_bio, NULL, 0);
> >  
> >  	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
> >  	if (!clone->wq) {
> > @@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *ti)
> >  	struct clone *clone = ti->private;
> >  
> >  	mutex_destroy(&clone->commit_lock);
> > +	bio_uninit(&clone->flush_bio);
> >  
> >  	for (i = 0; i < clone->nr_ctr_args; i++)
> >  		kfree(clone->ctr_args[i]);
> > @@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_target *ti)
> >  	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
> >  	flush_workqueue(clone->wq);
> >  
> > -	(void) commit_metadata(clone);
> > +	(void) commit_metadata(clone, NULL);
> >  }
> >  
> >  static void clone_resume(struct dm_target *ti)
> > -- 
> > 2.11.0
> > 
> 
> 
> Like the dm-thin patch I replied to, would rather avoid open-coding
> blkdev_issue_flush (also I check !bio_has_data), here is incremental:

Sorry for the noise relative to !bio_has_data check.. we don't need it.
DM core will split flush from data (see dec_pending()'s  REQ_PREFLUSH
check).

I'm dropping the extra !bio_has_data() checks from the incrementals I
did; will review again and push out to linux-next.. still have time to
change if you fundamentally disagree with using blkdev_issue_flush()

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikos Tsironis Dec. 5, 2019, 9:49 p.m. UTC | #3
On 12/5/19 10:07 PM, Mike Snitzer wrote:
> On Thu, Dec 05 2019 at  2:46pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Wed, Dec 04 2019 at  9:06P -0500,
>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>
>>> dm-clone maintains an on-disk bitmap which records which regions are
>>> valid in the destination device, i.e., which regions have already been
>>> hydrated, or have been written to directly, via user I/O.
>>>
>>> Setting a bit in the on-disk bitmap meas the corresponding region is
>>> valid in the destination device and we redirect all I/O regarding it to
>>> the destination device.
>>>
>>> Suppose the destination device has a volatile write-back cache and the
>>> following sequence of events occur:
>>>
>>> 1. A region gets hydrated, either through the background hydration or
>>>     because it was written to directly, via user I/O.
>>>
>>> 2. The commit timeout expires and we commit the metadata, marking that
>>>     region as valid in the destination device.
>>>
>>> 3. The system crashes and the destination device's cache has not been
>>>     flushed, meaning the region's data are lost.
>>>
>>> The next time we read that region we read it from the destination
>>> device, since the metadata have been successfully committed, but the
>>> data are lost due to the crash, so we read garbage instead of the old
>>> data.
>>>
>>> This has several implications:
>>>
>>> 1. In case of background hydration or of writes with size smaller than
>>>     the region size (which means we first copy the whole region and then
>>>     issue the smaller write), we corrupt data that the user never
>>>     touched.
>>>
>>> 2. In case of writes with size equal to the device's logical block size,
>>>     we fail to provide atomic sector writes. When the system recovers the
>>>     user will read garbage from the sector instead of the old data or the
>>>     new data.
>>>
>>> 3. In case of writes without the FUA flag set, after the system
>>>     recovers, the written sectors will contain garbage instead of a
>>>     random mix of sectors containing either old data or new data, thus we
>>>     fail again to provide atomic sector writes.
>>>
>>> 4. Even when the user flushes the dm-clone device, because we first
>>>     commit the metadata and then pass down the flush, the same risk for
>>>     corruption exists (if the system crashes after the metadata have been
>>>     committed but before the flush is passed down).
>>>
>>> The only case which is unaffected is that of writes with size equal to
>>> the region size and with the FUA flag set. But, because FUA writes
>>> trigger metadata commits, this case can trigger the corruption
>>> indirectly.
>>>
>>> To solve this and avoid the potential data corruption we flush the
>>> destination device **before** committing the metadata.
>>>
>>> This ensures that any freshly hydrated regions, for which we commit the
>>> metadata, are properly written to non-volatile storage and won't be lost
>>> in case of a crash.
>>>
>>> Fixes: 7431b7835f55 ("dm: add clone target")
>>> Cc: stable@vger.kernel.org # v5.4+
>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>> ---
>>>   drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------
>>>   1 file changed, 40 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
>>> index 613c913c296c..d1e1b5b56b1b 100644
>>> --- a/drivers/md/dm-clone-target.c
>>> +++ b/drivers/md/dm-clone-target.c
>>> @@ -86,6 +86,12 @@ struct clone {
>>>   
>>>   	struct dm_clone_metadata *cmd;
>>>   
>>> +	/*
>>> +	 * bio used to flush the destination device, before committing the
>>> +	 * metadata.
>>> +	 */
>>> +	struct bio flush_bio;
>>> +
>>>   	/* Region hydration hash table */
>>>   	struct hash_table_bucket *ht;
>>>   
>>> @@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
>>>   /*
>>>    * A non-zero return indicates read-only or fail mode.
>>>    */
>>> -static int commit_metadata(struct clone *clone)
>>> +static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
>>>   {
>>>   	int r = 0;
>>>   
>>> +	if (dest_dev_flushed)
>>> +		*dest_dev_flushed = false;
>>> +
>>>   	mutex_lock(&clone->commit_lock);
>>>   
>>>   	if (!dm_clone_changed_this_transaction(clone->cmd))
>>> @@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
>>>   		goto out;
>>>   	}
>>>   
>>> +	bio_reset(&clone->flush_bio);
>>> +	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
>>> +	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
>>> +
>>> +	r = submit_bio_wait(&clone->flush_bio);
>>> +	if (unlikely(r)) {
>>> +		__metadata_operation_failed(clone, "flush destination device", r);
>>> +		goto out;
>>> +	}
>>> +
>>> +	if (dest_dev_flushed)
>>> +		*dest_dev_flushed = true;
>>> +
>>>   	r = dm_clone_metadata_commit(clone->cmd);
>>>   	if (unlikely(r)) {
>>>   		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
>>> @@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
>>>   static void process_deferred_flush_bios(struct clone *clone)
>>>   {
>>>   	struct bio *bio;
>>> +	bool dest_dev_flushed;
>>>   	struct bio_list bios = BIO_EMPTY_LIST;
>>>   	struct bio_list bio_completions = BIO_EMPTY_LIST;
>>>   
>>> @@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone *clone)
>>>   	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
>>>   		return;
>>>   
>>> -	if (commit_metadata(clone)) {
>>> +	if (commit_metadata(clone, &dest_dev_flushed)) {
>>>   		bio_list_merge(&bios, &bio_completions);
>>>   
>>>   		while ((bio = bio_list_pop(&bios)))
>>> @@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(struct clone *clone)
>>>   	while ((bio = bio_list_pop(&bio_completions)))
>>>   		bio_endio(bio);
>>>   
>>> -	while ((bio = bio_list_pop(&bios)))
>>> -		generic_make_request(bio);
>>> +	while ((bio = bio_list_pop(&bios))) {
>>> +		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
>>> +			/* We just flushed the destination device as part of
>>> +			 * the metadata commit, so there is no reason to send
>>> +			 * another flush.
>>> +			 */
>>> +			bio_endio(bio);
>>> +		} else {
>>> +			generic_make_request(bio);
>>> +		}
>>> +	}
>>>   }
>>>   
>>>   static void do_worker(struct work_struct *work)
>>> @@ -1405,7 +1437,7 @@ static void clone_status(struct dm_target *ti, status_type_t type,
>>>   
>>>   		/* Commit to ensure statistics aren't out-of-date */
>>>   		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
>>> -			(void) commit_metadata(clone);
>>> +			(void) commit_metadata(clone, NULL);
>>>   
>>>   		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
>>>   
>>> @@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>>>   	bio_list_init(&clone->deferred_flush_completions);
>>>   	clone->hydration_offset = 0;
>>>   	atomic_set(&clone->hydrations_in_flight, 0);
>>> +	bio_init(&clone->flush_bio, NULL, 0);
>>>   
>>>   	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
>>>   	if (!clone->wq) {
>>> @@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *ti)
>>>   	struct clone *clone = ti->private;
>>>   
>>>   	mutex_destroy(&clone->commit_lock);
>>> +	bio_uninit(&clone->flush_bio);
>>>   
>>>   	for (i = 0; i < clone->nr_ctr_args; i++)
>>>   		kfree(clone->ctr_args[i]);
>>> @@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_target *ti)
>>>   	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
>>>   	flush_workqueue(clone->wq);
>>>   
>>> -	(void) commit_metadata(clone);
>>> +	(void) commit_metadata(clone, NULL);
>>>   }
>>>   
>>>   static void clone_resume(struct dm_target *ti)
>>> -- 
>>> 2.11.0
>>>
>>
>>
>> Like the dm-thin patch I replied to, would rather avoid open-coding
>> blkdev_issue_flush (also I check !bio_has_data), here is incremental:
> 
> Sorry for the noise relative to !bio_has_data check.. we don't need it.
> DM core will split flush from data (see dec_pending()'s  REQ_PREFLUSH
> check).
> 

It's OK. I know this, that's why I didn't put the !bio_has_data check in
the first place.

> I'm dropping the extra !bio_has_data() checks from the incrementals I
> did; will review again and push out to linux-next.. still have time to
> change if you fundamentally disagree with using blkdev_issue_flush()
> 

For dm-clone, I didn't use blkdev_issue_flush() to avoid allocating and
freeing a new bio every time we commit the metadata. I haven't measured
the allocation/freeing overhead and probably it won't be huge, but still
I would like to avoid it, if you don't mind.

For dm-thin, indeed, there is not much to gain by not using
blkdev_issue_flush(), since we still allocate a new bio, indirectly, in
the stack.

Thanks,
Nikos

> Thanks,
> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Dec. 5, 2019, 10:09 p.m. UTC | #4
On Thu, Dec 05 2019 at  4:49pm -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> On 12/5/19 10:07 PM, Mike Snitzer wrote:
> >On Thu, Dec 05 2019 at  2:46pm -0500,
> >Mike Snitzer <snitzer@redhat.com> wrote:
> >
> >>On Wed, Dec 04 2019 at  9:06P -0500,
> >>Nikos Tsironis <ntsironis@arrikto.com> wrote:
> >>
> >>>dm-clone maintains an on-disk bitmap which records which regions are
> >>>valid in the destination device, i.e., which regions have already been
> >>>hydrated, or have been written to directly, via user I/O.
> >>>
> >>>Setting a bit in the on-disk bitmap meas the corresponding region is
> >>>valid in the destination device and we redirect all I/O regarding it to
> >>>the destination device.
> >>>
> >>>Suppose the destination device has a volatile write-back cache and the
> >>>following sequence of events occur:
> >>>
> >>>1. A region gets hydrated, either through the background hydration or
> >>>    because it was written to directly, via user I/O.
> >>>
> >>>2. The commit timeout expires and we commit the metadata, marking that
> >>>    region as valid in the destination device.
> >>>
> >>>3. The system crashes and the destination device's cache has not been
> >>>    flushed, meaning the region's data are lost.
> >>>
> >>>The next time we read that region we read it from the destination
> >>>device, since the metadata have been successfully committed, but the
> >>>data are lost due to the crash, so we read garbage instead of the old
> >>>data.
> >>>
> >>>This has several implications:
> >>>
> >>>1. In case of background hydration or of writes with size smaller than
> >>>    the region size (which means we first copy the whole region and then
> >>>    issue the smaller write), we corrupt data that the user never
> >>>    touched.
> >>>
> >>>2. In case of writes with size equal to the device's logical block size,
> >>>    we fail to provide atomic sector writes. When the system recovers the
> >>>    user will read garbage from the sector instead of the old data or the
> >>>    new data.
> >>>
> >>>3. In case of writes without the FUA flag set, after the system
> >>>    recovers, the written sectors will contain garbage instead of a
> >>>    random mix of sectors containing either old data or new data, thus we
> >>>    fail again to provide atomic sector writes.
> >>>
> >>>4. Even when the user flushes the dm-clone device, because we first
> >>>    commit the metadata and then pass down the flush, the same risk for
> >>>    corruption exists (if the system crashes after the metadata have been
> >>>    committed but before the flush is passed down).
> >>>
> >>>The only case which is unaffected is that of writes with size equal to
> >>>the region size and with the FUA flag set. But, because FUA writes
> >>>trigger metadata commits, this case can trigger the corruption
> >>>indirectly.
> >>>
> >>>To solve this and avoid the potential data corruption we flush the
> >>>destination device **before** committing the metadata.
> >>>
> >>>This ensures that any freshly hydrated regions, for which we commit the
> >>>metadata, are properly written to non-volatile storage and won't be lost
> >>>in case of a crash.
> >>>
> >>>Fixes: 7431b7835f55 ("dm: add clone target")
> >>>Cc: stable@vger.kernel.org # v5.4+
> >>>Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
> >>>---
> >>>  drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------
> >>>  1 file changed, 40 insertions(+), 6 deletions(-)
> >>>
> >>>diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
> >>>index 613c913c296c..d1e1b5b56b1b 100644
> >>>--- a/drivers/md/dm-clone-target.c
> >>>+++ b/drivers/md/dm-clone-target.c
> >>>@@ -86,6 +86,12 @@ struct clone {
> >>>  	struct dm_clone_metadata *cmd;
> >>>+	/*
> >>>+	 * bio used to flush the destination device, before committing the
> >>>+	 * metadata.
> >>>+	 */
> >>>+	struct bio flush_bio;
> >>>+
> >>>  	/* Region hydration hash table */
> >>>  	struct hash_table_bucket *ht;
> >>>@@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
> >>>  /*
> >>>   * A non-zero return indicates read-only or fail mode.
> >>>   */
> >>>-static int commit_metadata(struct clone *clone)
> >>>+static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
> >>>  {
> >>>  	int r = 0;
> >>>+	if (dest_dev_flushed)
> >>>+		*dest_dev_flushed = false;
> >>>+
> >>>  	mutex_lock(&clone->commit_lock);
> >>>  	if (!dm_clone_changed_this_transaction(clone->cmd))
> >>>@@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
> >>>  		goto out;
> >>>  	}
> >>>+	bio_reset(&clone->flush_bio);
> >>>+	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
> >>>+	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> >>>+
> >>>+	r = submit_bio_wait(&clone->flush_bio);
> >>>+	if (unlikely(r)) {
> >>>+		__metadata_operation_failed(clone, "flush destination device", r);
> >>>+		goto out;
> >>>+	}
> >>>+
> >>>+	if (dest_dev_flushed)
> >>>+		*dest_dev_flushed = true;
> >>>+
> >>>  	r = dm_clone_metadata_commit(clone->cmd);
> >>>  	if (unlikely(r)) {
> >>>  		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
> >>>@@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
> >>>  static void process_deferred_flush_bios(struct clone *clone)
> >>>  {
> >>>  	struct bio *bio;
> >>>+	bool dest_dev_flushed;
> >>>  	struct bio_list bios = BIO_EMPTY_LIST;
> >>>  	struct bio_list bio_completions = BIO_EMPTY_LIST;
> >>>@@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone *clone)
> >>>  	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
> >>>  		return;
> >>>-	if (commit_metadata(clone)) {
> >>>+	if (commit_metadata(clone, &dest_dev_flushed)) {
> >>>  		bio_list_merge(&bios, &bio_completions);
> >>>  		while ((bio = bio_list_pop(&bios)))
> >>>@@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(struct clone *clone)
> >>>  	while ((bio = bio_list_pop(&bio_completions)))
> >>>  		bio_endio(bio);
> >>>-	while ((bio = bio_list_pop(&bios)))
> >>>-		generic_make_request(bio);
> >>>+	while ((bio = bio_list_pop(&bios))) {
> >>>+		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
> >>>+			/* We just flushed the destination device as part of
> >>>+			 * the metadata commit, so there is no reason to send
> >>>+			 * another flush.
> >>>+			 */
> >>>+			bio_endio(bio);
> >>>+		} else {
> >>>+			generic_make_request(bio);
> >>>+		}
> >>>+	}
> >>>  }
> >>>  static void do_worker(struct work_struct *work)
> >>>@@ -1405,7 +1437,7 @@ static void clone_status(struct dm_target *ti, status_type_t type,
> >>>  		/* Commit to ensure statistics aren't out-of-date */
> >>>  		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
> >>>-			(void) commit_metadata(clone);
> >>>+			(void) commit_metadata(clone, NULL);
> >>>  		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
> >>>@@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> >>>  	bio_list_init(&clone->deferred_flush_completions);
> >>>  	clone->hydration_offset = 0;
> >>>  	atomic_set(&clone->hydrations_in_flight, 0);
> >>>+	bio_init(&clone->flush_bio, NULL, 0);
> >>>  	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
> >>>  	if (!clone->wq) {
> >>>@@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *ti)
> >>>  	struct clone *clone = ti->private;
> >>>  	mutex_destroy(&clone->commit_lock);
> >>>+	bio_uninit(&clone->flush_bio);
> >>>  	for (i = 0; i < clone->nr_ctr_args; i++)
> >>>  		kfree(clone->ctr_args[i]);
> >>>@@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_target *ti)
> >>>  	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
> >>>  	flush_workqueue(clone->wq);
> >>>-	(void) commit_metadata(clone);
> >>>+	(void) commit_metadata(clone, NULL);
> >>>  }
> >>>  static void clone_resume(struct dm_target *ti)
> >>>-- 
> >>>2.11.0
> >>>
> >>
> >>
> >>Like the dm-thin patch I replied to, would rather avoid open-coding
> >>blkdev_issue_flush (also I check !bio_has_data), here is incremental:
> >
> >Sorry for the noise relative to !bio_has_data check.. we don't need it.
> >DM core will split flush from data (see dec_pending()'s  REQ_PREFLUSH
> >check).
> >
> 
> It's OK. I know this, that's why I didn't put the !bio_has_data check in
> the first place.
> 
> >I'm dropping the extra !bio_has_data() checks from the incrementals I
> >did; will review again and push out to linux-next.. still have time to
> >change if you fundamentally disagree with using blkdev_issue_flush()
> >
> 
> For dm-clone, I didn't use blkdev_issue_flush() to avoid allocating and
> freeing a new bio every time we commit the metadata. I haven't measured
> the allocation/freeing overhead and probably it won't be huge, but still
> I would like to avoid it, if you don't mind.

That's fine, I've restored your code.
 
> For dm-thin, indeed, there is not much to gain by not using
> blkdev_issue_flush(), since we still allocate a new bio, indirectly, in
> the stack.

But thinp obviously could if there is actual benefit to avoiding this
flush bio allocation, via blkdev_issue_flush, every commit.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikos Tsironis Dec. 5, 2019, 10:42 p.m. UTC | #5
On 12/6/19 12:09 AM, Mike Snitzer wrote:
> On Thu, Dec 05 2019 at  4:49pm -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> On 12/5/19 10:07 PM, Mike Snitzer wrote:
>>> On Thu, Dec 05 2019 at  2:46pm -0500,
>>> Mike Snitzer <snitzer@redhat.com> wrote:
>>>
>>>> On Wed, Dec 04 2019 at  9:06P -0500,
>>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>>
>>>>> dm-clone maintains an on-disk bitmap which records which regions are
>>>>> valid in the destination device, i.e., which regions have already been
>>>>> hydrated, or have been written to directly, via user I/O.
>>>>>
>>>>> Setting a bit in the on-disk bitmap meas the corresponding region is
>>>>> valid in the destination device and we redirect all I/O regarding it to
>>>>> the destination device.
>>>>>
>>>>> Suppose the destination device has a volatile write-back cache and the
>>>>> following sequence of events occur:
>>>>>
>>>>> 1. A region gets hydrated, either through the background hydration or
>>>>>     because it was written to directly, via user I/O.
>>>>>
>>>>> 2. The commit timeout expires and we commit the metadata, marking that
>>>>>     region as valid in the destination device.
>>>>>
>>>>> 3. The system crashes and the destination device's cache has not been
>>>>>     flushed, meaning the region's data are lost.
>>>>>
>>>>> The next time we read that region we read it from the destination
>>>>> device, since the metadata have been successfully committed, but the
>>>>> data are lost due to the crash, so we read garbage instead of the old
>>>>> data.
>>>>>
>>>>> This has several implications:
>>>>>
>>>>> 1. In case of background hydration or of writes with size smaller than
>>>>>     the region size (which means we first copy the whole region and then
>>>>>     issue the smaller write), we corrupt data that the user never
>>>>>     touched.
>>>>>
>>>>> 2. In case of writes with size equal to the device's logical block size,
>>>>>     we fail to provide atomic sector writes. When the system recovers the
>>>>>     user will read garbage from the sector instead of the old data or the
>>>>>     new data.
>>>>>
>>>>> 3. In case of writes without the FUA flag set, after the system
>>>>>     recovers, the written sectors will contain garbage instead of a
>>>>>     random mix of sectors containing either old data or new data, thus we
>>>>>     fail again to provide atomic sector writes.
>>>>>
>>>>> 4. Even when the user flushes the dm-clone device, because we first
>>>>>     commit the metadata and then pass down the flush, the same risk for
>>>>>     corruption exists (if the system crashes after the metadata have been
>>>>>     committed but before the flush is passed down).
>>>>>
>>>>> The only case which is unaffected is that of writes with size equal to
>>>>> the region size and with the FUA flag set. But, because FUA writes
>>>>> trigger metadata commits, this case can trigger the corruption
>>>>> indirectly.
>>>>>
>>>>> To solve this and avoid the potential data corruption we flush the
>>>>> destination device **before** committing the metadata.
>>>>>
>>>>> This ensures that any freshly hydrated regions, for which we commit the
>>>>> metadata, are properly written to non-volatile storage and won't be lost
>>>>> in case of a crash.
>>>>>
>>>>> Fixes: 7431b7835f55 ("dm: add clone target")
>>>>> Cc: stable@vger.kernel.org # v5.4+
>>>>> Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
>>>>> ---
>>>>>   drivers/md/dm-clone-target.c | 46 ++++++++++++++++++++++++++++++++++++++------
>>>>>   1 file changed, 40 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
>>>>> index 613c913c296c..d1e1b5b56b1b 100644
>>>>> --- a/drivers/md/dm-clone-target.c
>>>>> +++ b/drivers/md/dm-clone-target.c
>>>>> @@ -86,6 +86,12 @@ struct clone {
>>>>>   	struct dm_clone_metadata *cmd;
>>>>> +	/*
>>>>> +	 * bio used to flush the destination device, before committing the
>>>>> +	 * metadata.
>>>>> +	 */
>>>>> +	struct bio flush_bio;
>>>>> +
>>>>>   	/* Region hydration hash table */
>>>>>   	struct hash_table_bucket *ht;
>>>>> @@ -1108,10 +1114,13 @@ static bool need_commit_due_to_time(struct clone *clone)
>>>>>   /*
>>>>>    * A non-zero return indicates read-only or fail mode.
>>>>>    */
>>>>> -static int commit_metadata(struct clone *clone)
>>>>> +static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
>>>>>   {
>>>>>   	int r = 0;
>>>>> +	if (dest_dev_flushed)
>>>>> +		*dest_dev_flushed = false;
>>>>> +
>>>>>   	mutex_lock(&clone->commit_lock);
>>>>>   	if (!dm_clone_changed_this_transaction(clone->cmd))
>>>>> @@ -1128,6 +1137,19 @@ static int commit_metadata(struct clone *clone)
>>>>>   		goto out;
>>>>>   	}
>>>>> +	bio_reset(&clone->flush_bio);
>>>>> +	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
>>>>> +	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
>>>>> +
>>>>> +	r = submit_bio_wait(&clone->flush_bio);
>>>>> +	if (unlikely(r)) {
>>>>> +		__metadata_operation_failed(clone, "flush destination device", r);
>>>>> +		goto out;
>>>>> +	}
>>>>> +
>>>>> +	if (dest_dev_flushed)
>>>>> +		*dest_dev_flushed = true;
>>>>> +
>>>>>   	r = dm_clone_metadata_commit(clone->cmd);
>>>>>   	if (unlikely(r)) {
>>>>>   		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
>>>>> @@ -1199,6 +1221,7 @@ static void process_deferred_bios(struct clone *clone)
>>>>>   static void process_deferred_flush_bios(struct clone *clone)
>>>>>   {
>>>>>   	struct bio *bio;
>>>>> +	bool dest_dev_flushed;
>>>>>   	struct bio_list bios = BIO_EMPTY_LIST;
>>>>>   	struct bio_list bio_completions = BIO_EMPTY_LIST;
>>>>> @@ -1218,7 +1241,7 @@ static void process_deferred_flush_bios(struct clone *clone)
>>>>>   	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
>>>>>   		return;
>>>>> -	if (commit_metadata(clone)) {
>>>>> +	if (commit_metadata(clone, &dest_dev_flushed)) {
>>>>>   		bio_list_merge(&bios, &bio_completions);
>>>>>   		while ((bio = bio_list_pop(&bios)))
>>>>> @@ -1232,8 +1255,17 @@ static void process_deferred_flush_bios(struct clone *clone)
>>>>>   	while ((bio = bio_list_pop(&bio_completions)))
>>>>>   		bio_endio(bio);
>>>>> -	while ((bio = bio_list_pop(&bios)))
>>>>> -		generic_make_request(bio);
>>>>> +	while ((bio = bio_list_pop(&bios))) {
>>>>> +		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
>>>>> +			/* We just flushed the destination device as part of
>>>>> +			 * the metadata commit, so there is no reason to send
>>>>> +			 * another flush.
>>>>> +			 */
>>>>> +			bio_endio(bio);
>>>>> +		} else {
>>>>> +			generic_make_request(bio);
>>>>> +		}
>>>>> +	}
>>>>>   }
>>>>>   static void do_worker(struct work_struct *work)
>>>>> @@ -1405,7 +1437,7 @@ static void clone_status(struct dm_target *ti, status_type_t type,
>>>>>   		/* Commit to ensure statistics aren't out-of-date */
>>>>>   		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
>>>>> -			(void) commit_metadata(clone);
>>>>> +			(void) commit_metadata(clone, NULL);
>>>>>   		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
>>>>> @@ -1839,6 +1871,7 @@ static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>>>>>   	bio_list_init(&clone->deferred_flush_completions);
>>>>>   	clone->hydration_offset = 0;
>>>>>   	atomic_set(&clone->hydrations_in_flight, 0);
>>>>> +	bio_init(&clone->flush_bio, NULL, 0);
>>>>>   	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
>>>>>   	if (!clone->wq) {
>>>>> @@ -1912,6 +1945,7 @@ static void clone_dtr(struct dm_target *ti)
>>>>>   	struct clone *clone = ti->private;
>>>>>   	mutex_destroy(&clone->commit_lock);
>>>>> +	bio_uninit(&clone->flush_bio);
>>>>>   	for (i = 0; i < clone->nr_ctr_args; i++)
>>>>>   		kfree(clone->ctr_args[i]);
>>>>> @@ -1966,7 +2000,7 @@ static void clone_postsuspend(struct dm_target *ti)
>>>>>   	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
>>>>>   	flush_workqueue(clone->wq);
>>>>> -	(void) commit_metadata(clone);
>>>>> +	(void) commit_metadata(clone, NULL);
>>>>>   }
>>>>>   static void clone_resume(struct dm_target *ti)
>>>>> -- 
>>>>> 2.11.0
>>>>>
>>>>
>>>>
>>>> Like the dm-thin patch I replied to, would rather avoid open-coding
>>>> blkdev_issue_flush (also I check !bio_has_data), here is incremental:
>>>
>>> Sorry for the noise relative to !bio_has_data check.. we don't need it.
>>> DM core will split flush from data (see dec_pending()'s  REQ_PREFLUSH
>>> check).
>>>
>>
>> It's OK. I know this, that's why I didn't put the !bio_has_data check in
>> the first place.
>>
>>> I'm dropping the extra !bio_has_data() checks from the incrementals I
>>> did; will review again and push out to linux-next.. still have time to
>>> change if you fundamentally disagree with using blkdev_issue_flush()
>>>
>>
>> For dm-clone, I didn't use blkdev_issue_flush() to avoid allocating and
>> freeing a new bio every time we commit the metadata. I haven't measured
>> the allocation/freeing overhead and probably it won't be huge, but still
>> I would like to avoid it, if you don't mind.
> 
> That's fine, I've restored your code.
>   
>> For dm-thin, indeed, there is not much to gain by not using
>> blkdev_issue_flush(), since we still allocate a new bio, indirectly, in
>> the stack.
> 
> But thinp obviously could if there is actual benefit to avoiding this
> flush bio allocation, via blkdev_issue_flush, every commit.
> 

Yes, we could do the flush in thinp exactly the same way we do it in
dm-clone. Add a struct bio field in struct pool_c and use that in the
callback.

It would work since the callback is called holding a write lock on
pmd->root_lock, so it's executed only by a single thread at a time.

I didn't go for it in my implementation, because I didn't like having to
make that assumption in the callback, i.e., that it's executed under a
lock and so it's safe to have the bio in struct pool_c.

In hindsight, maybe this was a bad call, since it's technically feasible
to do it this way and we could just add a comment stating that the
callback is executed atomically.

If you want I can send a new follow-on patch tomorrow implementing the
flush in thinp the same way it's implemented in dm-clone.

Nikos

> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Dec. 6, 2019, 4:21 p.m. UTC | #6
On Thu, Dec 05 2019 at  5:42P -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> On 12/6/19 12:09 AM, Mike Snitzer wrote:
> > On Thu, Dec 05 2019 at  4:49pm -0500,
> > Nikos Tsironis <ntsironis@arrikto.com> wrote:
> > 
> > > For dm-thin, indeed, there is not much to gain by not using
> > > blkdev_issue_flush(), since we still allocate a new bio, indirectly, in
> > > the stack.
> > 
> > But thinp obviously could if there is actual benefit to avoiding this
> > flush bio allocation, via blkdev_issue_flush, every commit.
> > 
> 
> Yes, we could do the flush in thinp exactly the same way we do it in
> dm-clone. Add a struct bio field in struct pool_c and use that in the
> callback.
> 
> It would work since the callback is called holding a write lock on
> pmd->root_lock, so it's executed only by a single thread at a time.
> 
> I didn't go for it in my implementation, because I didn't like having to
> make that assumption in the callback, i.e., that it's executed under a
> lock and so it's safe to have the bio in struct pool_c.
> 
> In hindsight, maybe this was a bad call, since it's technically feasible
> to do it this way and we could just add a comment stating that the
> callback is executed atomically.
> 
> If you want I can send a new follow-on patch tomorrow implementing the
> flush in thinp the same way it's implemented in dm-clone.

I took care of it, here is the incremental:

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 73d191ddbb9f..57626c27a54b 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -328,6 +328,7 @@ struct pool_c {
 	dm_block_t low_water_blocks;
 	struct pool_features requested_pf; /* Features requested during table load */
 	struct pool_features adjusted_pf;  /* Features used after adjusting for constituent devices */
+	struct bio flush_bio;
 };
 
 /*
@@ -3123,6 +3124,7 @@ static void pool_dtr(struct dm_target *ti)
 	__pool_dec(pt->pool);
 	dm_put_device(ti, pt->metadata_dev);
 	dm_put_device(ti, pt->data_dev);
+	bio_uninit(&pt->flush_bio);
 	kfree(pt);
 
 	mutex_unlock(&dm_thin_pool_table.mutex);
@@ -3202,8 +3204,13 @@ static void metadata_low_callback(void *context)
 static int metadata_pre_commit_callback(void *context)
 {
 	struct pool_c *pt = context;
+	struct bio *flush_bio = &pt->flush_bio;
 
-	return blkdev_issue_flush(pt->data_dev->bdev, GFP_NOIO, NULL);
+	bio_reset(flush_bio);
+	bio_set_dev(flush_bio, pt->data_dev->bdev);
+	flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+	return submit_bio_wait(flush_bio);
 }
 
 static sector_t get_dev_size(struct block_device *bdev)
@@ -3374,6 +3381,7 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	pt->data_dev = data_dev;
 	pt->low_water_blocks = low_water_blocks;
 	pt->adjusted_pf = pt->requested_pf = pf;
+	bio_init(&pt->flush_bio, NULL, 0);
 	ti->num_flush_bios = 1;
 
 	/*


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikos Tsironis Dec. 6, 2019, 4:46 p.m. UTC | #7
On 12/6/19 6:21 PM, Mike Snitzer wrote:
> On Thu, Dec 05 2019 at  5:42P -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> On 12/6/19 12:09 AM, Mike Snitzer wrote:
>>> On Thu, Dec 05 2019 at  4:49pm -0500,
>>> Nikos Tsironis <ntsironis@arrikto.com> wrote:
>>>
>>>> For dm-thin, indeed, there is not much to gain by not using
>>>> blkdev_issue_flush(), since we still allocate a new bio, indirectly, in
>>>> the stack.
>>>
>>> But thinp obviously could if there is actual benefit to avoiding this
>>> flush bio allocation, via blkdev_issue_flush, every commit.
>>>
>>
>> Yes, we could do the flush in thinp exactly the same way we do it in
>> dm-clone. Add a struct bio field in struct pool_c and use that in the
>> callback.
>>
>> It would work since the callback is called holding a write lock on
>> pmd->root_lock, so it's executed only by a single thread at a time.
>>
>> I didn't go for it in my implementation, because I didn't like having to
>> make that assumption in the callback, i.e., that it's executed under a
>> lock and so it's safe to have the bio in struct pool_c.
>>
>> In hindsight, maybe this was a bad call, since it's technically feasible
>> to do it this way and we could just add a comment stating that the
>> callback is executed atomically.
>>
>> If you want I can send a new follow-on patch tomorrow implementing the
>> flush in thinp the same way it's implemented in dm-clone.
> 
> I took care of it, here is the incremental:
> 

Awesome, thanks!
  
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 73d191ddbb9f..57626c27a54b 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -328,6 +328,7 @@ struct pool_c {
>   	dm_block_t low_water_blocks;
>   	struct pool_features requested_pf; /* Features requested during table load */
>   	struct pool_features adjusted_pf;  /* Features used after adjusting for constituent devices */
> +	struct bio flush_bio;
>   };
>   
>   /*
> @@ -3123,6 +3124,7 @@ static void pool_dtr(struct dm_target *ti)
>   	__pool_dec(pt->pool);
>   	dm_put_device(ti, pt->metadata_dev);
>   	dm_put_device(ti, pt->data_dev);
> +	bio_uninit(&pt->flush_bio);
>   	kfree(pt);
>   
>   	mutex_unlock(&dm_thin_pool_table.mutex);
> @@ -3202,8 +3204,13 @@ static void metadata_low_callback(void *context)
>   static int metadata_pre_commit_callback(void *context)
>   {
>   	struct pool_c *pt = context;
> +	struct bio *flush_bio = &pt->flush_bio;
>   
> -	return blkdev_issue_flush(pt->data_dev->bdev, GFP_NOIO, NULL);
> +	bio_reset(flush_bio);
> +	bio_set_dev(flush_bio, pt->data_dev->bdev);
> +	flush_bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> +
> +	return submit_bio_wait(flush_bio);
>   }
>   
>   static sector_t get_dev_size(struct block_device *bdev)
> @@ -3374,6 +3381,7 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
>   	pt->data_dev = data_dev;
>   	pt->low_water_blocks = low_water_blocks;
>   	pt->adjusted_pf = pt->requested_pf = pf;
> +	bio_init(&pt->flush_bio, NULL, 0);
>   	ti->num_flush_bios = 1;
>   
>   	/*
> 

Looks good,

Thanks Nikos

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/drivers/md/dm-clone-target.c b/drivers/md/dm-clone-target.c
index 613c913c296c..d1e1b5b56b1b 100644
--- a/drivers/md/dm-clone-target.c
+++ b/drivers/md/dm-clone-target.c
@@ -86,6 +86,12 @@  struct clone {
 
 	struct dm_clone_metadata *cmd;
 
+	/*
+	 * bio used to flush the destination device, before committing the
+	 * metadata.
+	 */
+	struct bio flush_bio;
+
 	/* Region hydration hash table */
 	struct hash_table_bucket *ht;
 
@@ -1108,10 +1114,13 @@  static bool need_commit_due_to_time(struct clone *clone)
 /*
  * A non-zero return indicates read-only or fail mode.
  */
-static int commit_metadata(struct clone *clone)
+static int commit_metadata(struct clone *clone, bool *dest_dev_flushed)
 {
 	int r = 0;
 
+	if (dest_dev_flushed)
+		*dest_dev_flushed = false;
+
 	mutex_lock(&clone->commit_lock);
 
 	if (!dm_clone_changed_this_transaction(clone->cmd))
@@ -1128,6 +1137,19 @@  static int commit_metadata(struct clone *clone)
 		goto out;
 	}
 
+	bio_reset(&clone->flush_bio);
+	bio_set_dev(&clone->flush_bio, clone->dest_dev->bdev);
+	clone->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+	r = submit_bio_wait(&clone->flush_bio);
+	if (unlikely(r)) {
+		__metadata_operation_failed(clone, "flush destination device", r);
+		goto out;
+	}
+
+	if (dest_dev_flushed)
+		*dest_dev_flushed = true;
+
 	r = dm_clone_metadata_commit(clone->cmd);
 	if (unlikely(r)) {
 		__metadata_operation_failed(clone, "dm_clone_metadata_commit", r);
@@ -1199,6 +1221,7 @@  static void process_deferred_bios(struct clone *clone)
 static void process_deferred_flush_bios(struct clone *clone)
 {
 	struct bio *bio;
+	bool dest_dev_flushed;
 	struct bio_list bios = BIO_EMPTY_LIST;
 	struct bio_list bio_completions = BIO_EMPTY_LIST;
 
@@ -1218,7 +1241,7 @@  static void process_deferred_flush_bios(struct clone *clone)
 	    !(dm_clone_changed_this_transaction(clone->cmd) && need_commit_due_to_time(clone)))
 		return;
 
-	if (commit_metadata(clone)) {
+	if (commit_metadata(clone, &dest_dev_flushed)) {
 		bio_list_merge(&bios, &bio_completions);
 
 		while ((bio = bio_list_pop(&bios)))
@@ -1232,8 +1255,17 @@  static void process_deferred_flush_bios(struct clone *clone)
 	while ((bio = bio_list_pop(&bio_completions)))
 		bio_endio(bio);
 
-	while ((bio = bio_list_pop(&bios)))
-		generic_make_request(bio);
+	while ((bio = bio_list_pop(&bios))) {
+		if ((bio->bi_opf & REQ_PREFLUSH) && dest_dev_flushed) {
+			/* We just flushed the destination device as part of
+			 * the metadata commit, so there is no reason to send
+			 * another flush.
+			 */
+			bio_endio(bio);
+		} else {
+			generic_make_request(bio);
+		}
+	}
 }
 
 static void do_worker(struct work_struct *work)
@@ -1405,7 +1437,7 @@  static void clone_status(struct dm_target *ti, status_type_t type,
 
 		/* Commit to ensure statistics aren't out-of-date */
 		if (!(status_flags & DM_STATUS_NOFLUSH_FLAG) && !dm_suspended(ti))
-			(void) commit_metadata(clone);
+			(void) commit_metadata(clone, NULL);
 
 		r = dm_clone_get_free_metadata_block_count(clone->cmd, &nr_free_metadata_blocks);
 
@@ -1839,6 +1871,7 @@  static int clone_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	bio_list_init(&clone->deferred_flush_completions);
 	clone->hydration_offset = 0;
 	atomic_set(&clone->hydrations_in_flight, 0);
+	bio_init(&clone->flush_bio, NULL, 0);
 
 	clone->wq = alloc_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM, 0);
 	if (!clone->wq) {
@@ -1912,6 +1945,7 @@  static void clone_dtr(struct dm_target *ti)
 	struct clone *clone = ti->private;
 
 	mutex_destroy(&clone->commit_lock);
+	bio_uninit(&clone->flush_bio);
 
 	for (i = 0; i < clone->nr_ctr_args; i++)
 		kfree(clone->ctr_args[i]);
@@ -1966,7 +2000,7 @@  static void clone_postsuspend(struct dm_target *ti)
 	wait_event(clone->hydration_stopped, !atomic_read(&clone->hydrations_in_flight));
 	flush_workqueue(clone->wq);
 
-	(void) commit_metadata(clone);
+	(void) commit_metadata(clone, NULL);
 }
 
 static void clone_resume(struct dm_target *ti)