Message ID | 20191204140742.26273-3-ntsironis@arrikto.com (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Mike Snitzer |
Headers | show |
Series | dm thin: Flush data device before committing metadata to avoid data corruption | expand |
On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote: > The thin provisioning target maintains per thin device mappings that map > virtual blocks to data blocks in the data device. Ack. But I think we're issuing the FLUSH twice with your patch. Since the original bio is still remapped and issued at the end of process_deferred_bios? - Joe -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 12/4/19 5:27 PM, Joe Thornber wrote: > On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote: >> The thin provisioning target maintains per thin device mappings that map >> virtual blocks to data blocks in the data device. > > > Ack. But I think we're issuing the FLUSH twice with your patch. Since the > original bio is still remapped and issued at the end of process_deferred_bios? > Yes, that's correct. I thought of it and of putting a check in process_deferred_bios() to complete FLUSH bios immediately, but I have one concern and I preferred to be safe than sorry. In __commit_transaction() there is the following check: if (unlikely(!pmd->in_service)) return 0; , which means we don't commit the metadata, and thus we don't flush the data device, in case the pool is not in service. Opening a thin device doesn't seem to put the pool in service, since dm_pool_open_thin_device() uses pmd_write_lock_in_core(). Can I assume that the pool is in service if I/O can be mapped to a thin device? If so, it's safe to put such a check in process_deferred_bios(). On second thought though, in order for a flush bio to end up in deferred_flush_bios in the first place, someone must have changed the metadata and thus put the pool in service. Otherwise, it would have been submitted directly to the data device. So, it's probably safe to check for flush bios after commit() in process_deferred_bios() and complete them immediately. If you confirm too that this is safe, I will send a second version of the patch adding the check. Thanks, Nikos > - Joe > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Dec 04 2019 at 11:17am -0500, Nikos Tsironis <ntsironis@arrikto.com> wrote: > On 12/4/19 5:27 PM, Joe Thornber wrote: > >On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote: > >>The thin provisioning target maintains per thin device mappings that map > >>virtual blocks to data blocks in the data device. > > > > > >Ack. But I think we're issuing the FLUSH twice with your patch. Since the > >original bio is still remapped and issued at the end of process_deferred_bios? > > > > Yes, that's correct. I thought of it and of putting a check in > process_deferred_bios() to complete FLUSH bios immediately, but I have > one concern and I preferred to be safe than sorry. > > In __commit_transaction() there is the following check: > > if (unlikely(!pmd->in_service)) > return 0; > > , which means we don't commit the metadata, and thus we don't flush the > data device, in case the pool is not in service. > > Opening a thin device doesn't seem to put the pool in service, since > dm_pool_open_thin_device() uses pmd_write_lock_in_core(). > > Can I assume that the pool is in service if I/O can be mapped to a thin > device? If so, it's safe to put such a check in process_deferred_bios(). In service means upper layer has issued a write to a thin device of a pool. The header for commit 873f258becca87 gets into more detail. > On second thought though, in order for a flush bio to end up in > deferred_flush_bios in the first place, someone must have changed the > metadata and thus put the pool in service. Otherwise, it would have been > submitted directly to the data device. So, it's probably safe to check > for flush bios after commit() in process_deferred_bios() and complete > them immediately. Yes, I think so, which was Joe's original point. > If you confirm too that this is safe, I will send a second version of > the patch adding the check. Not seeing why we need another in_service check. After your changes are applied, any commit will trigger a preceeding flush.. so the deferred flushes are redundant. By definition, these deferred bios imply the pool is in service. I'd be fine with seeing a 3rd follow-on thinp patch that completes the redundant flushes immediately. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 12/4/19 6:39 PM, Mike Snitzer wrote:> On Wed, Dec 04 2019 at 11:17am -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> On 12/4/19 5:27 PM, Joe Thornber wrote: >>> On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote: >>>> The thin provisioning target maintains per thin device mappings that map >>>> virtual blocks to data blocks in the data device. >>> >>> >>> Ack. But I think we're issuing the FLUSH twice with your patch. Since the >>> original bio is still remapped and issued at the end of process_deferred_bios? >>> >> >> Yes, that's correct. I thought of it and of putting a check in >> process_deferred_bios() to complete FLUSH bios immediately, but I have >> one concern and I preferred to be safe than sorry. >> >> In __commit_transaction() there is the following check: >> >> if (unlikely(!pmd->in_service)) >> return 0; >> >> , which means we don't commit the metadata, and thus we don't flush the >> data device, in case the pool is not in service. >> >> Opening a thin device doesn't seem to put the pool in service, since >> dm_pool_open_thin_device() uses pmd_write_lock_in_core(). >> >> Can I assume that the pool is in service if I/O can be mapped to a thin >> device? If so, it's safe to put such a check in process_deferred_bios(). > > In service means upper layer has issued a write to a thin device of a > pool. The header for commit 873f258becca87 gets into more detail. > >> On second thought though, in order for a flush bio to end up in >> deferred_flush_bios in the first place, someone must have changed the >> metadata and thus put the pool in service. Otherwise, it would have been >> submitted directly to the data device. So, it's probably safe to check >> for flush bios after commit() in process_deferred_bios() and complete >> them immediately. > > Yes, I think so, which was Joe's original point. > >> If you confirm too that this is safe, I will send a second version of >> the patch adding the check. > > Not seeing why we need another in_service check. After your changes are > applied, any commit will trigger a preceeding flush.. so the deferred > flushes are redundant. > Yes, I meant add a check in process_deferred_bios(), after commit(), to check for REQ_PREFLUSH bios and complete them immediately. I should have clarified that. > By definition, these deferred bios imply the pool is in service. > > I'd be fine with seeing a 3rd follow-on thinp patch that completes the > redundant flushes immediately. > Ack, I will send another patch fixing this. Nikos > Thanks, > Mike > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 5a2c494cb552..e0be545080d0 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -3180,6 +3180,34 @@ static void metadata_low_callback(void *context) dm_table_event(pool->ti->table); } +/* + * We need to flush the data device **before** committing the metadata. + * + * This ensures that the data blocks of any newly inserted mappings are + * properly written to non-volatile storage and won't be lost in case of a + * crash. + * + * Failure to do so can result in data corruption in the case of internal or + * external snapshots and in the case of newly provisioned blocks, when block + * zeroing is enabled. + */ +static int metadata_pre_commit_callback(void *context) +{ + struct pool_c *pt = context; + struct bio bio; + int r; + + bio_init(&bio, NULL, 0); + bio_set_dev(&bio, pt->data_dev->bdev); + bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; + + r = submit_bio_wait(&bio); + + bio_uninit(&bio); + + return r; +} + static sector_t get_dev_size(struct block_device *bdev) { return i_size_read(bdev->bd_inode) >> SECTOR_SHIFT; @@ -3374,6 +3402,10 @@ static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv) if (r) goto out_flags_changed; + dm_pool_register_pre_commit_callback(pt->pool->pmd, + metadata_pre_commit_callback, + pt); + pt->callbacks.congested_fn = pool_is_congested; dm_table_add_target_callbacks(ti->table, &pt->callbacks);
The thin provisioning target maintains per thin device mappings that map virtual blocks to data blocks in the data device. When we write to a shared block, in case of internal snapshots, or provision a new block, in case of external snapshots, we copy the shared block to a new data block (COW), update the mapping for the relevant virtual block and then issue the write to the new data block. Suppose the data device has a volatile write-back cache and the following sequence of events occur: 1. We write to a shared block 2. A new data block is allocated 3. We copy the shared block to the new data block using kcopyd (COW) 4. We insert the new mapping for the virtual block in the btree for that thin device. 5. The commit timeout expires and we commit the metadata, that now includes the new mapping from step (4). 6. The system crashes and the data device's cache has not been flushed, meaning that the COWed data are lost. The next time we read that virtual block of the thin device we read it from the data block allocated in step (2), since the metadata have been successfully committed. The data are lost due to the crash, so we read garbage instead of the old, shared data. This has the following implications: 1. In case of writes to shared blocks, with size smaller than the pool's block size (which means we first copy the whole block and then issue the smaller write), we corrupt data that the user never touched. 2. In case of writes to shared blocks, with size equal to the device's logical block size, we fail to provide atomic sector writes. When the system recovers the user will read garbage from that sector instead of the old data or the new data. 3. Even for writes to shared blocks, with size equal to the pool's block size (overwrites), after the system recovers, the written sectors will contain garbage instead of a random mix of sectors containing either old data or new data, thus we fail again to provide atomic sectors writes. 4. Even when the user flushes the thin device, because we first commit the metadata and then pass down the flush, the same risk for corruption exists (if the system crashes after the metadata have been committed but before the flush is passed down to the data device.) The only case which is unaffected is that of writes with size equal to the pool's block size and with the FUA flag set. But, because FUA writes trigger metadata commits, this case can trigger the corruption indirectly. Moreover, apart from internal and external snapshots, the same issue exists for newly provisioned blocks, when block zeroing is enabled. After the system recovers the provisioned blocks might contain garbage instead of zeroes. To solve this and avoid the potential data corruption we flush the pool's data device **before** committing its metadata. This ensures that the data blocks of any newly inserted mappings are properly written to non-volatile storage and won't be lost in case of a crash. Cc: stable@vger.kernel.org Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com> --- drivers/md/dm-thin.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+)