[0/2] dm thin: Flush data device before committing metadata to avoid data corruption

Message ID	20191204140742.26273-1-ntsironis@arrikto.com (mailing list archive)
Headers	show Return-Path: <SRS0=S7pt=Z2=redhat.com=dm-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A70DF206DF From: Nikos Tsironis <ntsironis@arrikto.com> To: snitzer@redhat.com, agk@redhat.com, dm-devel@redhat.com Date: Wed, 4 Dec 2019 16:07:40 +0200 Message-Id: <20191204140742.26273-1-ntsironis@arrikto.com> Cc: thornber@redhat.com, ntsironis@arrikto.com Subject: [dm-devel] [PATCH 0/2] dm thin: Flush data device before committing metadata to avoid data corruption Precedence: junk MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	dm thin: Flush data device before committing metadata to avoid data corruption \| expand [0/2] dm thin: Flush data device before committing metadata to avoid data corruption [1/2] dm thin metadata: Add support for a pre-commit callback [2/2] dm thin: Flush data device before committing metadata

Nikos Tsironis Dec. 4, 2019, 2:07 p.m. UTC

The thin provisioning target maintains per thin device mappings that map
virtual blocks to data blocks in the data device.

When we write to a shared block, in case of internal snapshots, or
provision a new block, in case of external snapshots, we copy the shared
block to a new data block (COW), update the mapping for the relevant
virtual block and then issue the write to the new data block.

Suppose the data device has a volatile write-back cache and the
following sequence of events occur:

1. We write to a shared block
2. A new data block is allocated
3. We copy the shared block to the new data block using kcopyd (COW)
4. We insert the new mapping for the virtual block in the btree for that
   thin device.
5. The commit timeout expires and we commit the metadata, that now
   includes the new mapping from step (4).
6. The system crashes and the data device's cache has not been flushed,
   meaning that the COWed data are lost.

The next time we read that virtual block of the thin device we read it
from the data block allocated in step (2), since the metadata have been
successfully committed. The data are lost due to the crash, so we read
garbage instead of the old, shared data.

Moreover, apart from internal and external snapshots, the same issue
exists for newly provisioned blocks, when block zeroing is enabled.
After the system recovers the provisioned blocks might contain garbage
instead of zeroes.

For more information regarding the implications of this please see the
relevant commit.

To solve this and avoid the potential data corruption we have to flush
the pool's data device before committing its metadata.

This ensures that the data blocks of any newly inserted mappings are
properly written to non-volatile storage and won't be lost in case of a
crash.

Nikos Tsironis (2):
  dm thin metadata: Add support for a pre-commit callback
  dm thin: Flush data device before committing metadata

 drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
 drivers/md/dm-thin-metadata.h |  7 +++++++
 drivers/md/dm-thin.c          | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

Eric Wheeler Dec. 4, 2019, 7:58 p.m. UTC | #1

On Wed, 4 Dec 2019, Nikos Tsironis wrote:

> The thin provisioning target maintains per thin device mappings that map
> virtual blocks to data blocks in the data device.
> 
> When we write to a shared block, in case of internal snapshots, or
> provision a new block, in case of external snapshots, we copy the shared
> block to a new data block (COW), update the mapping for the relevant
> virtual block and then issue the write to the new data block.
> 
> Suppose the data device has a volatile write-back cache and the
> following sequence of events occur:

For those with NV caches, can the data disk flush be optional (maybe as a 
table flag)?

--
Eric Wheeler



> 
> 1. We write to a shared block
> 2. A new data block is allocated
> 3. We copy the shared block to the new data block using kcopyd (COW)
> 4. We insert the new mapping for the virtual block in the btree for that
>    thin device.
> 5. The commit timeout expires and we commit the metadata, that now
>    includes the new mapping from step (4).
> 6. The system crashes and the data device's cache has not been flushed,
>    meaning that the COWed data are lost.
> 
> The next time we read that virtual block of the thin device we read it
> from the data block allocated in step (2), since the metadata have been
> successfully committed. The data are lost due to the crash, so we read
> garbage instead of the old, shared data.
> 
> Moreover, apart from internal and external snapshots, the same issue
> exists for newly provisioned blocks, when block zeroing is enabled.
> After the system recovers the provisioned blocks might contain garbage
> instead of zeroes.
> 
> For more information regarding the implications of this please see the
> relevant commit.
> 
> To solve this and avoid the potential data corruption we have to flush
> the pool's data device before committing its metadata.
> 
> This ensures that the data blocks of any newly inserted mappings are
> properly written to non-volatile storage and won't be lost in case of a
> crash.
> 
> Nikos Tsironis (2):
>   dm thin metadata: Add support for a pre-commit callback
>   dm thin: Flush data device before committing metadata
> 
>  drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++
>  drivers/md/dm-thin-metadata.h |  7 +++++++
>  drivers/md/dm-thin.c          | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 68 insertions(+)
> 
> -- 
> 2.11.0
> 
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer Dec. 4, 2019, 8:17 p.m. UTC | #2

On Wed, Dec 04 2019 at  2:58pm -0500,
Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:

> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> 
> > The thin provisioning target maintains per thin device mappings that map
> > virtual blocks to data blocks in the data device.
> > 
> > When we write to a shared block, in case of internal snapshots, or
> > provision a new block, in case of external snapshots, we copy the shared
> > block to a new data block (COW), update the mapping for the relevant
> > virtual block and then issue the write to the new data block.
> > 
> > Suppose the data device has a volatile write-back cache and the
> > following sequence of events occur:
> 
> For those with NV caches, can the data disk flush be optional (maybe as a 
> table flag)?

IIRC block core should avoid issuing the flush if not needed.  I'll have
a closer look to verify as much.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Nikos Tsironis Dec. 5, 2019, 3:31 p.m. UTC | #3

On 12/4/19 10:17 PM, Mike Snitzer wrote:
> On Wed, Dec 04 2019 at  2:58pm -0500,
> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> 
>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>
>>> The thin provisioning target maintains per thin device mappings that map
>>> virtual blocks to data blocks in the data device.
>>>
>>> When we write to a shared block, in case of internal snapshots, or
>>> provision a new block, in case of external snapshots, we copy the shared
>>> block to a new data block (COW), update the mapping for the relevant
>>> virtual block and then issue the write to the new data block.
>>>
>>> Suppose the data device has a volatile write-back cache and the
>>> following sequence of events occur:
>>
>> For those with NV caches, can the data disk flush be optional (maybe as a
>> table flag)?
> 
> IIRC block core should avoid issuing the flush if not needed.  I'll have
> a closer look to verify as much.
> 

For devices without a volatile write-back cache block core strips off
the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
completes empty REQ_PREFLUSH requests before entering the driver.

This happens in generic_make_request_checks():

		/*
		 * Filter flush bio's early so that make_request based
		 * drivers without flush support don't have to worry
		 * about them.
		 */
		if (op_is_flush(bio->bi_opf) &&
		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
		        if (!nr_sectors) {
		                status = BLK_STS_OK;
		                goto end_io;
		        }
		}

If I am not mistaken, it all depends on whether the underlying device
reports the existence of a write back cache or not.

You could check this by looking at /sys/block/<device>/queue/write_cache
If it says "write back" then flushes will be issued.

In case the sysfs entry reports a "write back" cache for a device with a
non-volatile write cache, I think you can change the kernel's view of
the device by writing to this entry (you could also create a udev rule
for this).

This way you can set the write cache as write through. This will
eliminate the cache flushes issued by the kernel, without altering the
device state (Documentation/block/queue-sysfs.rst).

Nikos

> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer Dec. 5, 2019, 3:42 p.m. UTC | #4

On Thu, Dec 05 2019 at 10:31am -0500,
Nikos Tsironis <ntsironis@arrikto.com> wrote:

> On 12/4/19 10:17 PM, Mike Snitzer wrote:
> >On Wed, Dec 04 2019 at  2:58pm -0500,
> >Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> >
> >>On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> >>
> >>>The thin provisioning target maintains per thin device mappings that map
> >>>virtual blocks to data blocks in the data device.
> >>>
> >>>When we write to a shared block, in case of internal snapshots, or
> >>>provision a new block, in case of external snapshots, we copy the shared
> >>>block to a new data block (COW), update the mapping for the relevant
> >>>virtual block and then issue the write to the new data block.
> >>>
> >>>Suppose the data device has a volatile write-back cache and the
> >>>following sequence of events occur:
> >>
> >>For those with NV caches, can the data disk flush be optional (maybe as a
> >>table flag)?
> >
> >IIRC block core should avoid issuing the flush if not needed.  I'll have
> >a closer look to verify as much.
> >
> 
> For devices without a volatile write-back cache block core strips off
> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> completes empty REQ_PREFLUSH requests before entering the driver.
> 
> This happens in generic_make_request_checks():
> 
> 		/*
> 		 * Filter flush bio's early so that make_request based
> 		 * drivers without flush support don't have to worry
> 		 * about them.
> 		 */
> 		if (op_is_flush(bio->bi_opf) &&
> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> 		        if (!nr_sectors) {
> 		                status = BLK_STS_OK;
> 		                goto end_io;
> 		        }
> 		}
> 
> If I am not mistaken, it all depends on whether the underlying device
> reports the existence of a write back cache or not.

Yes, thanks for confirming my memory of the situation.

> You could check this by looking at /sys/block/<device>/queue/write_cache
> If it says "write back" then flushes will be issued.
> 
> In case the sysfs entry reports a "write back" cache for a device with a
> non-volatile write cache, I think you can change the kernel's view of
> the device by writing to this entry (you could also create a udev rule
> for this).
> 
> This way you can set the write cache as write through. This will
> eliminate the cache flushes issued by the kernel, without altering the
> device state (Documentation/block/queue-sysfs.rst).

Not delved into this aspect of Linux's capabilities but it strikes me as
"dangerous" to twiddle device capabilities like this.  Best to fix
driver to properly expose cache (or not, as the case may be).  It should
also be noted that with DM; the capabilities are stac ked up at device
creation time.  So any changes to the underlying devices will _not_ be
reflected to the high level DM device.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Nikos Tsironis Dec. 5, 2019, 4:02 p.m. UTC | #5

On 12/5/19 5:42 PM, Mike Snitzer wrote:
> On Thu, Dec 05 2019 at 10:31am -0500,
> Nikos Tsironis <ntsironis@arrikto.com> wrote:
> 
>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>>>
>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>
>>>>> The thin provisioning target maintains per thin device mappings that map
>>>>> virtual blocks to data blocks in the data device.
>>>>>
>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>> provision a new block, in case of external snapshots, we copy the shared
>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>> virtual block and then issue the write to the new data block.
>>>>>
>>>>> Suppose the data device has a volatile write-back cache and the
>>>>> following sequence of events occur:
>>>>
>>>> For those with NV caches, can the data disk flush be optional (maybe as a
>>>> table flag)?
>>>
>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>> a closer look to verify as much.
>>>
>>
>> For devices without a volatile write-back cache block core strips off
>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>> completes empty REQ_PREFLUSH requests before entering the driver.
>>
>> This happens in generic_make_request_checks():
>>
>> 		/*
>> 		 * Filter flush bio's early so that make_request based
>> 		 * drivers without flush support don't have to worry
>> 		 * about them.
>> 		 */
>> 		if (op_is_flush(bio->bi_opf) &&
>> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>> 		        if (!nr_sectors) {
>> 		                status = BLK_STS_OK;
>> 		                goto end_io;
>> 		        }
>> 		}
>>
>> If I am not mistaken, it all depends on whether the underlying device
>> reports the existence of a write back cache or not.
> 
> Yes, thanks for confirming my memory of the situation.
> 
>> You could check this by looking at /sys/block/<device>/queue/write_cache
>> If it says "write back" then flushes will be issued.
>>
>> In case the sysfs entry reports a "write back" cache for a device with a
>> non-volatile write cache, I think you can change the kernel's view of
>> the device by writing to this entry (you could also create a udev rule
>> for this).
>>
>> This way you can set the write cache as write through. This will
>> eliminate the cache flushes issued by the kernel, without altering the
>> device state (Documentation/block/queue-sysfs.rst).
> 
> Not delved into this aspect of Linux's capabilities but it strikes me as
> "dangerous" to twiddle device capabilities like this.  Best to fix
> driver to properly expose cache (or not, as the case may be).  It should
> also be noted that with DM; the capabilities are stac ked up at device
> creation time.  So any changes to the underlying devices will _not_ be
> reflected to the high level DM device.
> 

Yes, I agree completely. The queue-sysfs doc also mentions that it's not
safe to do that. I just mentioned it for completeness.

As far as DM is concerned, you are right. You would have to deactivate
and reactivate all DM devices for the change to propagate to upper
layers. That's why I mentioned udev, because that way the change will be
made to the lower level device when its queue is first created and it
will be properly propagated to upper layers.

But, again, I agree that this is not something safe to do and it's
better to make sure the driver properly exposes the cache capabilities,
as you said.

Nikos

> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Eric Wheeler Dec. 5, 2019, 10:34 p.m. UTC | #6

On Thu, 5 Dec 2019, Nikos Tsironis wrote:
> On 12/4/19 10:17 PM, Mike Snitzer wrote:
> > On Wed, Dec 04 2019 at  2:58pm -0500,
> > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > 
> > > On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> > >
> > > > The thin provisioning target maintains per thin device mappings that map
> > > > virtual blocks to data blocks in the data device.
> > > >
> > > > When we write to a shared block, in case of internal snapshots, or
> > > > provision a new block, in case of external snapshots, we copy the shared
> > > > block to a new data block (COW), update the mapping for the relevant
> > > > virtual block and then issue the write to the new data block.
> > > >
> > > > Suppose the data device has a volatile write-back cache and the
> > > > following sequence of events occur:
> > >
> > > For those with NV caches, can the data disk flush be optional (maybe as a
> > > table flag)?
> > 
> > IIRC block core should avoid issuing the flush if not needed.  I'll have
> > a closer look to verify as much.
> > 
> 
> For devices without a volatile write-back cache block core strips off
> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> completes empty REQ_PREFLUSH requests before entering the driver.
> 
> This happens in generic_make_request_checks():
> 
> 		/*
> 		 * Filter flush bio's early so that make_request based
> 		 * drivers without flush support don't have to worry
> 		 * about them.
> 		 */
> 		if (op_is_flush(bio->bi_opf) &&
> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> 		        if (!nr_sectors) {
> 		                status = BLK_STS_OK;
> 		                goto end_io;
> 		        }
> 		}
> 
> If I am not mistaken, it all depends on whether the underlying device
> reports the existence of a write back cache or not.
> 
> You could check this by looking at /sys/block/<device>/queue/write_cache
> If it says "write back" then flushes will be issued.
> 
> In case the sysfs entry reports a "write back" cache for a device with a
> non-volatile write cache, I think you can change the kernel's view of
> the device by writing to this entry (you could also create a udev rule
> for this).
> 
> This way you can set the write cache as write through. This will
> eliminate the cache flushes issued by the kernel, without altering the
> device state (Documentation/block/queue-sysfs.rst).

Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards:
	'This means that it might not be safe to toggle the setting from 
	"write back" to "write through", since that will also eliminate
	cache flushes issued by the kernel.'
	[https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]


How does this work with stacking blockdevs?  Does it inherit from the 
lower-level dev? If an upper-level is misconfigured, would a writeback at 
higher levels would clear the flush for lower levels?

--
Eric Wheeler



> Nikos
> 
> > Mike
> > 
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Nikos Tsironis Dec. 6, 2019, 3:14 p.m. UTC | #7

On 12/6/19 12:34 AM, Eric Wheeler wrote:
> On Thu, 5 Dec 2019, Nikos Tsironis wrote:
>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>>>
>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>
>>>>> The thin provisioning target maintains per thin device mappings that map
>>>>> virtual blocks to data blocks in the data device.
>>>>>
>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>> provision a new block, in case of external snapshots, we copy the shared
>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>> virtual block and then issue the write to the new data block.
>>>>>
>>>>> Suppose the data device has a volatile write-back cache and the
>>>>> following sequence of events occur:
>>>>
>>>> For those with NV caches, can the data disk flush be optional (maybe as a
>>>> table flag)?
>>>
>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>> a closer look to verify as much.
>>>
>>
>> For devices without a volatile write-back cache block core strips off
>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>> completes empty REQ_PREFLUSH requests before entering the driver.
>>
>> This happens in generic_make_request_checks():
>>
>> 		/*
>> 		 * Filter flush bio's early so that make_request based
>> 		 * drivers without flush support don't have to worry
>> 		 * about them.
>> 		 */
>> 		if (op_is_flush(bio->bi_opf) &&
>> 		    !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>> 		        bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>> 		        if (!nr_sectors) {
>> 		                status = BLK_STS_OK;
>> 		                goto end_io;
>> 		        }
>> 		}
>>
>> If I am not mistaken, it all depends on whether the underlying device
>> reports the existence of a write back cache or not.
>>
>> You could check this by looking at /sys/block/<device>/queue/write_cache
>> If it says "write back" then flushes will be issued.
>>
>> In case the sysfs entry reports a "write back" cache for a device with a
>> non-volatile write cache, I think you can change the kernel's view of
>> the device by writing to this entry (you could also create a udev rule
>> for this).
>>
>> This way you can set the write cache as write through. This will
>> eliminate the cache flushes issued by the kernel, without altering the
>> device state (Documentation/block/queue-sysfs.rst).
> 
> Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards:
> 	'This means that it might not be safe to toggle the setting from
> 	"write back" to "write through", since that will also eliminate
> 	cache flushes issued by the kernel.'
> 	[https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
> 
> 

If a device has a volatile cache then the write_cache sysfs entry will
be "write back" and we have to issue flushes to the device. In all other
cases write_cache will be "write through".

It's not safe to toggle write_cache from "write back" to "write through"
because this stops the kernel from sending flushes to the device, but
the device will continue caching the writes. So, in case something goes
wrong, you might lose your writes or end up with some kind of
corruption.

> How does this work with stacking blockdevs?  Does it inherit from the
> lower-level dev? If an upper-level is misconfigured, would a writeback at
> higher levels would clear the flush for lower levels?
> 

As Mike already mentioned in another reply to this thread, the device
capabilities are stacked up when each device is created and are
inherited from component devices.

The logic for device stacking is implemented in various functions in
block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
etc.), which are used also by DM core in dm-table.c to set the
capabilities of DM devices.

If an upper layer device reports a "write back" cache then flushes will
be issued to it by the kernel, no matter what the capabilities of the
underlying devices are.

Normally an upper layer device would report a "write back" cache if at
least one underlying device supports flushes. But, some DM devices
report a "write back" cache irrespective of the underlying devices,
e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
their own metadata. They then pass the flush request down to the
underlying device and rely on block core to do the right thing. Either
actually send the flush to the device, if it has a volatile cache, or
complete it immediately.

Nikos

> --
> Eric Wheeler
> 
> 
> 
>> Nikos
>>
>>> Mike
>>>
>>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Eric Wheeler Dec. 6, 2019, 8:06 p.m. UTC | #8

On Fri, 6 Dec 2019, Nikos Tsironis wrote:
> On 12/6/19 12:34 AM, Eric Wheeler wrote:
> > On Thu, 5 Dec 2019, Nikos Tsironis wrote:
> > > On 12/4/19 10:17 PM, Mike Snitzer wrote:
> > > > On Wed, Dec 04 2019 at  2:58pm -0500,
> > > > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
> > > >
> > > > > On Wed, 4 Dec 2019, Nikos Tsironis wrote:
> > > > >
> > > > > > The thin provisioning target maintains per thin device mappings that
> > > > > > map
> > > > > > virtual blocks to data blocks in the data device.
> > > > > >
> > > > > > When we write to a shared block, in case of internal snapshots, or
> > > > > > provision a new block, in case of external snapshots, we copy the
> > > > > > shared
> > > > > > block to a new data block (COW), update the mapping for the relevant
> > > > > > virtual block and then issue the write to the new data block.
> > > > > >
> > > > > > Suppose the data device has a volatile write-back cache and the
> > > > > > following sequence of events occur:
> > > > >
> > > > > For those with NV caches, can the data disk flush be optional (maybe
> > > > > as a
> > > > > table flag)?
> > > >
> > > > IIRC block core should avoid issuing the flush if not needed.  I'll have
> > > > a closer look to verify as much.
> > > >
> > >
> > > For devices without a volatile write-back cache block core strips off
> > > the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
> > > completes empty REQ_PREFLUSH requests before entering the driver.
> > >
> > > This happens in generic_make_request_checks():
> > >
> > >   /*
> > >    * Filter flush bio's early so that make_request based
> > >    * drivers without flush support don't have to worry
> > >    * about them.
> > >    */
> > >   if (op_is_flush(bio->bi_opf) &&
> > >       !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> > >           bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
> > >           if (!nr_sectors) {
> > >                   status = BLK_STS_OK;
> > >                   goto end_io;
> > >           }
> > >   }
> > >
> > > If I am not mistaken, it all depends on whether the underlying device
> > > reports the existence of a write back cache or not.
> > >
> > > You could check this by looking at /sys/block/<device>/queue/write_cache
> > > If it says "write back" then flushes will be issued.
> > >
> > > In case the sysfs entry reports a "write back" cache for a device with a
> > > non-volatile write cache, I think you can change the kernel's view of
> > > the device by writing to this entry (you could also create a udev rule
> > > for this).
> > >
> > > This way you can set the write cache as write through. This will
> > > eliminate the cache flushes issued by the kernel, without altering the
> > > device state (Documentation/block/queue-sysfs.rst).
> > 
> > Interesting, I'll remember that. I think this is a documentation bug, isn't
> > this backwards:
> >  'This means that it might not be safe to toggle the setting from
> >  "write back" to "write through", since that will also eliminate
> >  cache flushes issued by the kernel.'
> >  [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
> > 
> > 
> 
> If a device has a volatile cache then the write_cache sysfs entry will
> be "write back" and we have to issue flushes to the device. In all other
> cases write_cache will be "write through".

Forgive my misunderstanding, but if I have a RAID controller with a cache 
and BBU with the RAID volume set to write-back mode in the controller, are 
you saying that the sysfs entry should show "write through"? I had always 
understood that it was safe to disable flushes with a non-volatile cache 
and a non-volatile cache is called a write-back cache.

It is strange to me that this terminology in the kernel would be backwards 
from how it is expressed in a RAID controller. Incidentally, I have an 
Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in 
write-back mode and the second volume is write-through. In both cases 
sysfs reports "write through":

[root@hv1-he ~]# cat /sys/block/sda/queue/write_cache 
write through
[root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache 
write through

This is running 4.19.75, so we can at least say that the 9460 does not 
support proper representation of the VD cache mode in sysfs, but which is 
correct? Should it not be that the sysfs entry reports the same cache mode 
of the RAID controller?

-Eric

> 
> It's not safe to toggle write_cache from "write back" to "write through"
> because this stops the kernel from sending flushes to the device, but
> the device will continue caching the writes. So, in case something goes
> wrong, you might lose your writes or end up with some kind of
> corruption.
> 
> > How does this work with stacking blockdevs?  Does it inherit from the
> > lower-level dev? If an upper-level is misconfigured, would a writeback at
> > higher levels would clear the flush for lower levels?
> > 
> 
> As Mike already mentioned in another reply to this thread, the device
> capabilities are stacked up when each device is created and are
> inherited from component devices.
> 
> The logic for device stacking is implemented in various functions in
> block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
> etc.), which are used also by DM core in dm-table.c to set the
> capabilities of DM devices.
> 
> If an upper layer device reports a "write back" cache then flushes will
> be issued to it by the kernel, no matter what the capabilities of the
> underlying devices are.
> 
> Normally an upper layer device would report a "write back" cache if at
> least one underlying device supports flushes. But, some DM devices
> report a "write back" cache irrespective of the underlying devices,
> e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
> their own metadata. They then pass the flush request down to the
> underlying device and rely on block core to do the right thing. Either
> actually send the flush to the device, if it has a volatile cache, or
> complete it immediately.
> 
> Nikos
> 
> > --
> > Eric Wheeler
> > 
> > 
> > 
> > > Nikos
> > >
> > > > Mike
> > > >
> > >
> 
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Nikos Tsironis Dec. 9, 2019, 2:25 p.m. UTC | #9

On 12/6/19 10:06 PM, Eric Wheeler wrote:
> On Fri, 6 Dec 2019, Nikos Tsironis wrote:
>> On 12/6/19 12:34 AM, Eric Wheeler wrote:
>>> On Thu, 5 Dec 2019, Nikos Tsironis wrote:
>>>> On 12/4/19 10:17 PM, Mike Snitzer wrote:
>>>>> On Wed, Dec 04 2019 at  2:58pm -0500,
>>>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote:
>>>>>
>>>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote:
>>>>>>
>>>>>>> The thin provisioning target maintains per thin device mappings that
>>>>>>> map
>>>>>>> virtual blocks to data blocks in the data device.
>>>>>>>
>>>>>>> When we write to a shared block, in case of internal snapshots, or
>>>>>>> provision a new block, in case of external snapshots, we copy the
>>>>>>> shared
>>>>>>> block to a new data block (COW), update the mapping for the relevant
>>>>>>> virtual block and then issue the write to the new data block.
>>>>>>>
>>>>>>> Suppose the data device has a volatile write-back cache and the
>>>>>>> following sequence of events occur:
>>>>>>
>>>>>> For those with NV caches, can the data disk flush be optional (maybe
>>>>>> as a
>>>>>> table flag)?
>>>>>
>>>>> IIRC block core should avoid issuing the flush if not needed.  I'll have
>>>>> a closer look to verify as much.
>>>>>
>>>>
>>>> For devices without a volatile write-back cache block core strips off
>>>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and
>>>> completes empty REQ_PREFLUSH requests before entering the driver.
>>>>
>>>> This happens in generic_make_request_checks():
>>>>
>>>>    /*
>>>>     * Filter flush bio's early so that make_request based
>>>>     * drivers without flush support don't have to worry
>>>>     * about them.
>>>>     */
>>>>    if (op_is_flush(bio->bi_opf) &&
>>>>        !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
>>>>            bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
>>>>            if (!nr_sectors) {
>>>>                    status = BLK_STS_OK;
>>>>                    goto end_io;
>>>>            }
>>>>    }
>>>>
>>>> If I am not mistaken, it all depends on whether the underlying device
>>>> reports the existence of a write back cache or not.
>>>>
>>>> You could check this by looking at /sys/block/<device>/queue/write_cache
>>>> If it says "write back" then flushes will be issued.
>>>>
>>>> In case the sysfs entry reports a "write back" cache for a device with a
>>>> non-volatile write cache, I think you can change the kernel's view of
>>>> the device by writing to this entry (you could also create a udev rule
>>>> for this).
>>>>
>>>> This way you can set the write cache as write through. This will
>>>> eliminate the cache flushes issued by the kernel, without altering the
>>>> device state (Documentation/block/queue-sysfs.rst).
>>>
>>> Interesting, I'll remember that. I think this is a documentation bug, isn't
>>> this backwards:
>>>   'This means that it might not be safe to toggle the setting from
>>>   "write back" to "write through", since that will also eliminate
>>>   cache flushes issued by the kernel.'
>>>   [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst]
>>>
>>>
>>
>> If a device has a volatile cache then the write_cache sysfs entry will
>> be "write back" and we have to issue flushes to the device. In all other
>> cases write_cache will be "write through".
> 
> Forgive my misunderstanding, but if I have a RAID controller with a cache
> and BBU with the RAID volume set to write-back mode in the controller, are
> you saying that the sysfs entry should show "write through"? I had always
> understood that it was safe to disable flushes with a non-volatile cache
> and a non-volatile cache is called a write-back cache.
> 

 From the device perspective, a non-volatile cache operating in
write-back mode is indeed called a write-back cache.

But, from the OS perspective, a non-volatile cache (whether it operates
in write-back or write-through mode), for all intents and purposes, is
equivalent to a write-through cache: when the device acknowledges a
write it's guaranteed that the written data won't be lost in case of
power loss.

So, in the case of a controller with a BBU and/or a non-volatile cache,
you don't care what the device does internally. All that matters is that
acked writes won't be lost in case of power failure.

I believe that the sysfs entry reports exactly that. Whether the kernel
should treat the device as having a volatile write-back cache, so we
have to issue flushes to ensure the data are properly persisted, or as
having no cache or a write-through cache, so flushes are not necessary.

> It is strange to me that this terminology in the kernel would be backwards
> from how it is expressed in a RAID controller. Incidentally, I have an
> Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in
> write-back mode and the second volume is write-through. In both cases
> sysfs reports "write through":
> 
> [root@hv1-he ~]# cat /sys/block/sda/queue/write_cache
> write through
> [root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache
> write through
> 
> This is running 4.19.75, so we can at least say that the 9460 does not
> support proper representation of the VD cache mode in sysfs, but which is
> correct? Should it not be that the sysfs entry reports the same cache mode
> of the RAID controller?
> 

My guess is that the controller reports to the kernel that it has a
write-through cache (or no cache at all) on purpose, to avoid
unnecessary flushes. Since it can ensure the persistence of acked writes
with other means, e.g., a BBU unit, as far as the kernel is concerned
the device can be treated as one with a write-through cache.

Moreover, I think that MegaRAID controllers, in the default write back
mode, automatically switch the write policy to write-through if the BBU
is low, has failed or is being charged.

So, I think it makes sense to report to the kernel that the device has a
write-through cache, even though internally the device operates the
cache in write-back mode.

Nikos

> -Eric
> 
>>
>> It's not safe to toggle write_cache from "write back" to "write through"
>> because this stops the kernel from sending flushes to the device, but
>> the device will continue caching the writes. So, in case something goes
>> wrong, you might lose your writes or end up with some kind of
>> corruption.
>>
>>> How does this work with stacking blockdevs?  Does it inherit from the
>>> lower-level dev? If an upper-level is misconfigured, would a writeback at
>>> higher levels would clear the flush for lower levels?
>>>
>>
>> As Mike already mentioned in another reply to this thread, the device
>> capabilities are stacked up when each device is created and are
>> inherited from component devices.
>>
>> The logic for device stacking is implemented in various functions in
>> block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(),
>> etc.), which are used also by DM core in dm-table.c to set the
>> capabilities of DM devices.
>>
>> If an upper layer device reports a "write back" cache then flushes will
>> be issued to it by the kernel, no matter what the capabilities of the
>> underlying devices are.
>>
>> Normally an upper layer device would report a "write back" cache if at
>> least one underlying device supports flushes. But, some DM devices
>> report a "write back" cache irrespective of the underlying devices,
>> e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush
>> their own metadata. They then pass the flush request down to the
>> underlying device and rely on block core to do the right thing. Either
>> actually send the flush to the device, if it has a volatile cache, or
>> complete it immediately.
>>
>> Nikos
>>
>>> --
>>> Eric Wheeler
>>>
>>>
>>>
>>>> Nikos
>>>>
>>>>> Mike
>>>>>
>>>>
>>
>>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[0/2] dm thin: Flush data device before committing metadata to avoid data corruption

Message

Comments