diff mbox series

[2/2] dm thin: Flush data device before committing metadata

Message ID 20191204140742.26273-3-ntsironis@arrikto.com (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show
Series dm thin: Flush data device before committing metadata to avoid data corruption | expand

Commit Message

Nikos Tsironis Dec. 4, 2019, 2:07 p.m. UTC
The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

This has the following implications:

1. In case of writes to shared blocks, with size smaller than the pool's
   block size (which means we first copy the whole block and then issue
   the smaller write), we corrupt data that the user never touched.

2. In case of writes to shared blocks, with size equal to the device's
   logical block size, we fail to provide atomic sector writes. When the
   system recovers the user will read garbage from that sector instead
   of the old data or the new data.

3. Even for writes to shared blocks, with size equal to the pool's block
   size (overwrites), after the system recovers, the written sectors
   will contain garbage instead of a random mix of sectors containing
   either old data or new data, thus we fail again to provide atomic
   sectors writes.

4. Even when the user flushes the thin device, because we first commit
   the metadata and then pass down the flush, the same risk for
   corruption exists (if the system crashes after the metadata have been
   committed but before the flush is passed down to the data device.)

The only case which is unaffected is that of writes with size equal to
the pool's block size and with the FUA flag set. But, because FUA writes
trigger metadata commits, this case can trigger the corruption
indirectly.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

To solve this and avoid the potential data corruption we flush the
pool's data device **before** committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Cc: stable@vger.kernel.org
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
---
 drivers/md/dm-thin.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Comments

Joe Thornber Dec. 4, 2019, 3:27 p.m. UTC | #1
On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
> The thin provisioning target maintains per thin device mappings that map
> virtual blocks to data blocks in the data device.


Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
original bio is still remapped and issued at the end of process_deferred_bios?

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikos Tsironis Dec. 4, 2019, 4:17 p.m. UTC | #2
On 12/4/19 5:27 PM, Joe Thornber wrote:
> On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
>> The thin provisioning target maintains per thin device mappings that map
>> virtual blocks to data blocks in the data device.
> 
> 
> Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
> original bio is still remapped and issued at the end of process_deferred_bios?
> 

Yes, that's correct. I thought of it and of putting a check in
process_deferred_bios() to complete FLUSH bios immediately, but I have
one concern and I preferred to be safe than sorry.

In __commit_transaction() there is the following check:

   if (unlikely(!pmd->in_service))
             return 0;

, which means we don't commit the metadata, and thus we don't flush the
data device, in case the pool is not in service.

Opening a thin device doesn't seem to put the pool in service, since
dm_pool_open_thin_device() uses pmd_write_lock_in_core().

Can I assume that the pool is in service if I/O can be mapped to a thin
device? If so, it's safe to put such a check in process_deferred_bios().

On second thought though, in order for a flush bio to end up in
deferred_flush_bios in the first place, someone must have changed the
metadata and thus put the pool in service. Otherwise, it would have been
submitted directly to the data device. So, it's probably safe to check
for flush bios after commit() in process_deferred_bios() and complete
them immediately.

If you confirm too that this is safe, I will send a second version of
the patch adding the check.

Thanks,
Nikos

> - Joe
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Dec. 4, 2019, 4:39 p.m. UTC | #3
On Wed, Dec 04 2019 at 11:17am -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> On 12/4/19 5:27 PM, Joe Thornber wrote:
> >On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
> >>The thin provisioning target maintains per thin device mappings that map
> >>virtual blocks to data blocks in the data device.
> >
> >
> >Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
> >original bio is still remapped and issued at the end of process_deferred_bios?
> >
> 
> Yes, that's correct. I thought of it and of putting a check in
> process_deferred_bios() to complete FLUSH bios immediately, but I have
> one concern and I preferred to be safe than sorry.
> 
> In __commit_transaction() there is the following check:
> 
>   if (unlikely(!pmd->in_service))
>             return 0;
> 
> , which means we don't commit the metadata, and thus we don't flush the
> data device, in case the pool is not in service.
> 
> Opening a thin device doesn't seem to put the pool in service, since
> dm_pool_open_thin_device() uses pmd_write_lock_in_core().
> 
> Can I assume that the pool is in service if I/O can be mapped to a thin
> device? If so, it's safe to put such a check in process_deferred_bios().

In service means upper layer has issued a write to a thin device of a
pool.  The header for commit 873f258becca87 gets into more detail.

> On second thought though, in order for a flush bio to end up in
> deferred_flush_bios in the first place, someone must have changed the
> metadata and thus put the pool in service. Otherwise, it would have been
> submitted directly to the data device. So, it's probably safe to check
> for flush bios after commit() in process_deferred_bios() and complete
> them immediately.

Yes, I think so, which was Joe's original point.
 
> If you confirm too that this is safe, I will send a second version of
> the patch adding the check.

Not seeing why we need another in_service check.  After your changes are
applied, any commit will trigger a preceeding flush.. so the deferred
flushes are redundant.

By definition, these deferred bios imply the pool is in service.

I'd be fine with seeing a 3rd follow-on thinp patch that completes the
redundant flushes immediately.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Nikos Tsironis Dec. 4, 2019, 4:47 p.m. UTC | #4
On 12/4/19 6:39 PM, Mike Snitzer wrote:>
On Wed, Dec 04 2019 at 11:17am -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> On 12/4/19 5:27 PM, Joe Thornber wrote:
>>> On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
>>>> The thin provisioning target maintains per thin device mappings that map
>>>> virtual blocks to data blocks in the data device.
>>>
>>>
>>> Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
>>> original bio is still remapped and issued at the end of process_deferred_bios?
>>>
>>
>> Yes, that's correct. I thought of it and of putting a check in
>> process_deferred_bios() to complete FLUSH bios immediately, but I have
>> one concern and I preferred to be safe than sorry.
>>
>> In __commit_transaction() there is the following check:
>>
>>    if (unlikely(!pmd->in_service))
>>              return 0;
>>
>> , which means we don't commit the metadata, and thus we don't flush the
>> data device, in case the pool is not in service.
>>
>> Opening a thin device doesn't seem to put the pool in service, since
>> dm_pool_open_thin_device() uses pmd_write_lock_in_core().
>>
>> Can I assume that the pool is in service if I/O can be mapped to a thin
>> device? If so, it's safe to put such a check in process_deferred_bios().
> 
> In service means upper layer has issued a write to a thin device of a
> pool.  The header for commit 873f258becca87 gets into more detail.
> 
>> On second thought though, in order for a flush bio to end up in
>> deferred_flush_bios in the first place, someone must have changed the
>> metadata and thus put the pool in service. Otherwise, it would have been
>> submitted directly to the data device. So, it's probably safe to check
>> for flush bios after commit() in process_deferred_bios() and complete
>> them immediately.
> 
> Yes, I think so, which was Joe's original point.
>   
>> If you confirm too that this is safe, I will send a second version of
>> the patch adding the check.
> 
> Not seeing why we need another in_service check.  After your changes are
> applied, any commit will trigger a preceeding flush.. so the deferred
> flushes are redundant.
> 

Yes, I meant add a check in process_deferred_bios(), after commit(), to
check for REQ_PREFLUSH bios and complete them immediately. I should have
clarified that.

> By definition, these deferred bios imply the pool is in service.
> 
> I'd be fine with seeing a 3rd follow-on thinp patch that completes the
> redundant flushes immediately.
> 

Ack, I will send another patch fixing this.

Nikos

> Thanks,
> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox series

Patch

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 5a2c494cb552..e0be545080d0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3180,6 +3180,34 @@  static void metadata_low_callback(void *context)
 	dm_table_event(pool->ti->table);
 }
 
+/*
+ * We need to flush the data device **before** committing the metadata.
+ *
+ * This ensures that the data blocks of any newly inserted mappings are
+ * properly written to non-volatile storage and won't be lost in case of a
+ * crash.
+ *
+ * Failure to do so can result in data corruption in the case of internal or
+ * external snapshots and in the case of newly provisioned blocks, when block
+ * zeroing is enabled.
+ */
+static int metadata_pre_commit_callback(void *context)
+{
+	struct pool_c *pt = context;
+	struct bio bio;
+	int r;
+
+	bio_init(&bio, NULL, 0);
+	bio_set_dev(&bio, pt->data_dev->bdev);
+	bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+
+	r = submit_bio_wait(&bio);
+
+	bio_uninit(&bio);
+
+	return r;
+}
+
 static sector_t get_dev_size(struct block_device *bdev)
 {
 	return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
@@ -3374,6 +3402,10 @@  static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
 	if (r)
 		goto out_flags_changed;
 
+	dm_pool_register_pre_commit_callback(pt->pool->pmd,
+					     metadata_pre_commit_callback,
+					     pt);
+
 	pt->callbacks.congested_fn = pool_is_congested;
 	dm_table_add_target_callbacks(ti->table, &pt->callbacks);