Message ID | 20191204140742.26273-1-ntsironis@arrikto.com (mailing list archive) |
---|---|
Headers | show |
Series | dm thin: Flush data device before committing metadata to avoid data corruption | expand |
On Wed, 4 Dec 2019, Nikos Tsironis wrote: > The thin provisioning target maintains per thin device mappings that map > virtual blocks to data blocks in the data device. > > When we write to a shared block, in case of internal snapshots, or > provision a new block, in case of external snapshots, we copy the shared > block to a new data block (COW), update the mapping for the relevant > virtual block and then issue the write to the new data block. > > Suppose the data device has a volatile write-back cache and the > following sequence of events occur: For those with NV caches, can the data disk flush be optional (maybe as a table flag)? -- Eric Wheeler > > 1. We write to a shared block > 2. A new data block is allocated > 3. We copy the shared block to the new data block using kcopyd (COW) > 4. We insert the new mapping for the virtual block in the btree for that > thin device. > 5. The commit timeout expires and we commit the metadata, that now > includes the new mapping from step (4). > 6. The system crashes and the data device's cache has not been flushed, > meaning that the COWed data are lost. > > The next time we read that virtual block of the thin device we read it > from the data block allocated in step (2), since the metadata have been > successfully committed. The data are lost due to the crash, so we read > garbage instead of the old, shared data. > > Moreover, apart from internal and external snapshots, the same issue > exists for newly provisioned blocks, when block zeroing is enabled. > After the system recovers the provisioned blocks might contain garbage > instead of zeroes. > > For more information regarding the implications of this please see the > relevant commit. > > To solve this and avoid the potential data corruption we have to flush > the pool's data device before committing its metadata. > > This ensures that the data blocks of any newly inserted mappings are > properly written to non-volatile storage and won't be lost in case of a > crash. > > Nikos Tsironis (2): > dm thin metadata: Add support for a pre-commit callback > dm thin: Flush data device before committing metadata > > drivers/md/dm-thin-metadata.c | 29 +++++++++++++++++++++++++++++ > drivers/md/dm-thin-metadata.h | 7 +++++++ > drivers/md/dm-thin.c | 32 ++++++++++++++++++++++++++++++++ > 3 files changed, 68 insertions(+) > > -- > 2.11.0 > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Dec 04 2019 at 2:58pm -0500, Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: > On Wed, 4 Dec 2019, Nikos Tsironis wrote: > > > The thin provisioning target maintains per thin device mappings that map > > virtual blocks to data blocks in the data device. > > > > When we write to a shared block, in case of internal snapshots, or > > provision a new block, in case of external snapshots, we copy the shared > > block to a new data block (COW), update the mapping for the relevant > > virtual block and then issue the write to the new data block. > > > > Suppose the data device has a volatile write-back cache and the > > following sequence of events occur: > > For those with NV caches, can the data disk flush be optional (maybe as a > table flag)? IIRC block core should avoid issuing the flush if not needed. I'll have a closer look to verify as much. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 12/4/19 10:17 PM, Mike Snitzer wrote: > On Wed, Dec 04 2019 at 2:58pm -0500, > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: > >> On Wed, 4 Dec 2019, Nikos Tsironis wrote: >> >>> The thin provisioning target maintains per thin device mappings that map >>> virtual blocks to data blocks in the data device. >>> >>> When we write to a shared block, in case of internal snapshots, or >>> provision a new block, in case of external snapshots, we copy the shared >>> block to a new data block (COW), update the mapping for the relevant >>> virtual block and then issue the write to the new data block. >>> >>> Suppose the data device has a volatile write-back cache and the >>> following sequence of events occur: >> >> For those with NV caches, can the data disk flush be optional (maybe as a >> table flag)? > > IIRC block core should avoid issuing the flush if not needed. I'll have > a closer look to verify as much. > For devices without a volatile write-back cache block core strips off the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and completes empty REQ_PREFLUSH requests before entering the driver. This happens in generic_make_request_checks(): /* * Filter flush bio's early so that make_request based * drivers without flush support don't have to worry * about them. */ if (op_is_flush(bio->bi_opf) && !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); if (!nr_sectors) { status = BLK_STS_OK; goto end_io; } } If I am not mistaken, it all depends on whether the underlying device reports the existence of a write back cache or not. You could check this by looking at /sys/block/<device>/queue/write_cache If it says "write back" then flushes will be issued. In case the sysfs entry reports a "write back" cache for a device with a non-volatile write cache, I think you can change the kernel's view of the device by writing to this entry (you could also create a udev rule for this). This way you can set the write cache as write through. This will eliminate the cache flushes issued by the kernel, without altering the device state (Documentation/block/queue-sysfs.rst). Nikos > Mike > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Dec 05 2019 at 10:31am -0500, Nikos Tsironis <ntsironis@arrikto.com> wrote: > On 12/4/19 10:17 PM, Mike Snitzer wrote: > >On Wed, Dec 04 2019 at 2:58pm -0500, > >Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: > > > >>On Wed, 4 Dec 2019, Nikos Tsironis wrote: > >> > >>>The thin provisioning target maintains per thin device mappings that map > >>>virtual blocks to data blocks in the data device. > >>> > >>>When we write to a shared block, in case of internal snapshots, or > >>>provision a new block, in case of external snapshots, we copy the shared > >>>block to a new data block (COW), update the mapping for the relevant > >>>virtual block and then issue the write to the new data block. > >>> > >>>Suppose the data device has a volatile write-back cache and the > >>>following sequence of events occur: > >> > >>For those with NV caches, can the data disk flush be optional (maybe as a > >>table flag)? > > > >IIRC block core should avoid issuing the flush if not needed. I'll have > >a closer look to verify as much. > > > > For devices without a volatile write-back cache block core strips off > the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and > completes empty REQ_PREFLUSH requests before entering the driver. > > This happens in generic_make_request_checks(): > > /* > * Filter flush bio's early so that make_request based > * drivers without flush support don't have to worry > * about them. > */ > if (op_is_flush(bio->bi_opf) && > !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { > bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); > if (!nr_sectors) { > status = BLK_STS_OK; > goto end_io; > } > } > > If I am not mistaken, it all depends on whether the underlying device > reports the existence of a write back cache or not. Yes, thanks for confirming my memory of the situation. > You could check this by looking at /sys/block/<device>/queue/write_cache > If it says "write back" then flushes will be issued. > > In case the sysfs entry reports a "write back" cache for a device with a > non-volatile write cache, I think you can change the kernel's view of > the device by writing to this entry (you could also create a udev rule > for this). > > This way you can set the write cache as write through. This will > eliminate the cache flushes issued by the kernel, without altering the > device state (Documentation/block/queue-sysfs.rst). Not delved into this aspect of Linux's capabilities but it strikes me as "dangerous" to twiddle device capabilities like this. Best to fix driver to properly expose cache (or not, as the case may be). It should also be noted that with DM; the capabilities are stac ked up at device creation time. So any changes to the underlying devices will _not_ be reflected to the high level DM device. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 12/5/19 5:42 PM, Mike Snitzer wrote: > On Thu, Dec 05 2019 at 10:31am -0500, > Nikos Tsironis <ntsironis@arrikto.com> wrote: > >> On 12/4/19 10:17 PM, Mike Snitzer wrote: >>> On Wed, Dec 04 2019 at 2:58pm -0500, >>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: >>> >>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote: >>>> >>>>> The thin provisioning target maintains per thin device mappings that map >>>>> virtual blocks to data blocks in the data device. >>>>> >>>>> When we write to a shared block, in case of internal snapshots, or >>>>> provision a new block, in case of external snapshots, we copy the shared >>>>> block to a new data block (COW), update the mapping for the relevant >>>>> virtual block and then issue the write to the new data block. >>>>> >>>>> Suppose the data device has a volatile write-back cache and the >>>>> following sequence of events occur: >>>> >>>> For those with NV caches, can the data disk flush be optional (maybe as a >>>> table flag)? >>> >>> IIRC block core should avoid issuing the flush if not needed. I'll have >>> a closer look to verify as much. >>> >> >> For devices without a volatile write-back cache block core strips off >> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and >> completes empty REQ_PREFLUSH requests before entering the driver. >> >> This happens in generic_make_request_checks(): >> >> /* >> * Filter flush bio's early so that make_request based >> * drivers without flush support don't have to worry >> * about them. >> */ >> if (op_is_flush(bio->bi_opf) && >> !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { >> bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); >> if (!nr_sectors) { >> status = BLK_STS_OK; >> goto end_io; >> } >> } >> >> If I am not mistaken, it all depends on whether the underlying device >> reports the existence of a write back cache or not. > > Yes, thanks for confirming my memory of the situation. > >> You could check this by looking at /sys/block/<device>/queue/write_cache >> If it says "write back" then flushes will be issued. >> >> In case the sysfs entry reports a "write back" cache for a device with a >> non-volatile write cache, I think you can change the kernel's view of >> the device by writing to this entry (you could also create a udev rule >> for this). >> >> This way you can set the write cache as write through. This will >> eliminate the cache flushes issued by the kernel, without altering the >> device state (Documentation/block/queue-sysfs.rst). > > Not delved into this aspect of Linux's capabilities but it strikes me as > "dangerous" to twiddle device capabilities like this. Best to fix > driver to properly expose cache (or not, as the case may be). It should > also be noted that with DM; the capabilities are stac ked up at device > creation time. So any changes to the underlying devices will _not_ be > reflected to the high level DM device. > Yes, I agree completely. The queue-sysfs doc also mentions that it's not safe to do that. I just mentioned it for completeness. As far as DM is concerned, you are right. You would have to deactivate and reactivate all DM devices for the change to propagate to upper layers. That's why I mentioned udev, because that way the change will be made to the lower level device when its queue is first created and it will be properly propagated to upper layers. But, again, I agree that this is not something safe to do and it's better to make sure the driver properly exposes the cache capabilities, as you said. Nikos > Mike > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 5 Dec 2019, Nikos Tsironis wrote: > On 12/4/19 10:17 PM, Mike Snitzer wrote: > > On Wed, Dec 04 2019 at 2:58pm -0500, > > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: > > > > > On Wed, 4 Dec 2019, Nikos Tsironis wrote: > > > > > > > The thin provisioning target maintains per thin device mappings that map > > > > virtual blocks to data blocks in the data device. > > > > > > > > When we write to a shared block, in case of internal snapshots, or > > > > provision a new block, in case of external snapshots, we copy the shared > > > > block to a new data block (COW), update the mapping for the relevant > > > > virtual block and then issue the write to the new data block. > > > > > > > > Suppose the data device has a volatile write-back cache and the > > > > following sequence of events occur: > > > > > > For those with NV caches, can the data disk flush be optional (maybe as a > > > table flag)? > > > > IIRC block core should avoid issuing the flush if not needed. I'll have > > a closer look to verify as much. > > > > For devices without a volatile write-back cache block core strips off > the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and > completes empty REQ_PREFLUSH requests before entering the driver. > > This happens in generic_make_request_checks(): > > /* > * Filter flush bio's early so that make_request based > * drivers without flush support don't have to worry > * about them. > */ > if (op_is_flush(bio->bi_opf) && > !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { > bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); > if (!nr_sectors) { > status = BLK_STS_OK; > goto end_io; > } > } > > If I am not mistaken, it all depends on whether the underlying device > reports the existence of a write back cache or not. > > You could check this by looking at /sys/block/<device>/queue/write_cache > If it says "write back" then flushes will be issued. > > In case the sysfs entry reports a "write back" cache for a device with a > non-volatile write cache, I think you can change the kernel's view of > the device by writing to this entry (you could also create a udev rule > for this). > > This way you can set the write cache as write through. This will > eliminate the cache flushes issued by the kernel, without altering the > device state (Documentation/block/queue-sysfs.rst). Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards: 'This means that it might not be safe to toggle the setting from "write back" to "write through", since that will also eliminate cache flushes issued by the kernel.' [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst] How does this work with stacking blockdevs? Does it inherit from the lower-level dev? If an upper-level is misconfigured, would a writeback at higher levels would clear the flush for lower levels? -- Eric Wheeler > Nikos > > > Mike > > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 12/6/19 12:34 AM, Eric Wheeler wrote: > On Thu, 5 Dec 2019, Nikos Tsironis wrote: >> On 12/4/19 10:17 PM, Mike Snitzer wrote: >>> On Wed, Dec 04 2019 at 2:58pm -0500, >>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: >>> >>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote: >>>> >>>>> The thin provisioning target maintains per thin device mappings that map >>>>> virtual blocks to data blocks in the data device. >>>>> >>>>> When we write to a shared block, in case of internal snapshots, or >>>>> provision a new block, in case of external snapshots, we copy the shared >>>>> block to a new data block (COW), update the mapping for the relevant >>>>> virtual block and then issue the write to the new data block. >>>>> >>>>> Suppose the data device has a volatile write-back cache and the >>>>> following sequence of events occur: >>>> >>>> For those with NV caches, can the data disk flush be optional (maybe as a >>>> table flag)? >>> >>> IIRC block core should avoid issuing the flush if not needed. I'll have >>> a closer look to verify as much. >>> >> >> For devices without a volatile write-back cache block core strips off >> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and >> completes empty REQ_PREFLUSH requests before entering the driver. >> >> This happens in generic_make_request_checks(): >> >> /* >> * Filter flush bio's early so that make_request based >> * drivers without flush support don't have to worry >> * about them. >> */ >> if (op_is_flush(bio->bi_opf) && >> !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { >> bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); >> if (!nr_sectors) { >> status = BLK_STS_OK; >> goto end_io; >> } >> } >> >> If I am not mistaken, it all depends on whether the underlying device >> reports the existence of a write back cache or not. >> >> You could check this by looking at /sys/block/<device>/queue/write_cache >> If it says "write back" then flushes will be issued. >> >> In case the sysfs entry reports a "write back" cache for a device with a >> non-volatile write cache, I think you can change the kernel's view of >> the device by writing to this entry (you could also create a udev rule >> for this). >> >> This way you can set the write cache as write through. This will >> eliminate the cache flushes issued by the kernel, without altering the >> device state (Documentation/block/queue-sysfs.rst). > > Interesting, I'll remember that. I think this is a documentation bug, isn't this backwards: > 'This means that it might not be safe to toggle the setting from > "write back" to "write through", since that will also eliminate > cache flushes issued by the kernel.' > [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst] > > If a device has a volatile cache then the write_cache sysfs entry will be "write back" and we have to issue flushes to the device. In all other cases write_cache will be "write through". It's not safe to toggle write_cache from "write back" to "write through" because this stops the kernel from sending flushes to the device, but the device will continue caching the writes. So, in case something goes wrong, you might lose your writes or end up with some kind of corruption. > How does this work with stacking blockdevs? Does it inherit from the > lower-level dev? If an upper-level is misconfigured, would a writeback at > higher levels would clear the flush for lower levels? > As Mike already mentioned in another reply to this thread, the device capabilities are stacked up when each device is created and are inherited from component devices. The logic for device stacking is implemented in various functions in block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(), etc.), which are used also by DM core in dm-table.c to set the capabilities of DM devices. If an upper layer device reports a "write back" cache then flushes will be issued to it by the kernel, no matter what the capabilities of the underlying devices are. Normally an upper layer device would report a "write back" cache if at least one underlying device supports flushes. But, some DM devices report a "write back" cache irrespective of the underlying devices, e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush their own metadata. They then pass the flush request down to the underlying device and rely on block core to do the right thing. Either actually send the flush to the device, if it has a volatile cache, or complete it immediately. Nikos > -- > Eric Wheeler > > > >> Nikos >> >>> Mike >>> >> -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, 6 Dec 2019, Nikos Tsironis wrote: > On 12/6/19 12:34 AM, Eric Wheeler wrote: > > On Thu, 5 Dec 2019, Nikos Tsironis wrote: > > > On 12/4/19 10:17 PM, Mike Snitzer wrote: > > > > On Wed, Dec 04 2019 at 2:58pm -0500, > > > > Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: > > > > > > > > > On Wed, 4 Dec 2019, Nikos Tsironis wrote: > > > > > > > > > > > The thin provisioning target maintains per thin device mappings that > > > > > > map > > > > > > virtual blocks to data blocks in the data device. > > > > > > > > > > > > When we write to a shared block, in case of internal snapshots, or > > > > > > provision a new block, in case of external snapshots, we copy the > > > > > > shared > > > > > > block to a new data block (COW), update the mapping for the relevant > > > > > > virtual block and then issue the write to the new data block. > > > > > > > > > > > > Suppose the data device has a volatile write-back cache and the > > > > > > following sequence of events occur: > > > > > > > > > > For those with NV caches, can the data disk flush be optional (maybe > > > > > as a > > > > > table flag)? > > > > > > > > IIRC block core should avoid issuing the flush if not needed. I'll have > > > > a closer look to verify as much. > > > > > > > > > > For devices without a volatile write-back cache block core strips off > > > the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and > > > completes empty REQ_PREFLUSH requests before entering the driver. > > > > > > This happens in generic_make_request_checks(): > > > > > > /* > > > * Filter flush bio's early so that make_request based > > > * drivers without flush support don't have to worry > > > * about them. > > > */ > > > if (op_is_flush(bio->bi_opf) && > > > !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { > > > bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); > > > if (!nr_sectors) { > > > status = BLK_STS_OK; > > > goto end_io; > > > } > > > } > > > > > > If I am not mistaken, it all depends on whether the underlying device > > > reports the existence of a write back cache or not. > > > > > > You could check this by looking at /sys/block/<device>/queue/write_cache > > > If it says "write back" then flushes will be issued. > > > > > > In case the sysfs entry reports a "write back" cache for a device with a > > > non-volatile write cache, I think you can change the kernel's view of > > > the device by writing to this entry (you could also create a udev rule > > > for this). > > > > > > This way you can set the write cache as write through. This will > > > eliminate the cache flushes issued by the kernel, without altering the > > > device state (Documentation/block/queue-sysfs.rst). > > > > Interesting, I'll remember that. I think this is a documentation bug, isn't > > this backwards: > > 'This means that it might not be safe to toggle the setting from > > "write back" to "write through", since that will also eliminate > > cache flushes issued by the kernel.' > > [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst] > > > > > > If a device has a volatile cache then the write_cache sysfs entry will > be "write back" and we have to issue flushes to the device. In all other > cases write_cache will be "write through". Forgive my misunderstanding, but if I have a RAID controller with a cache and BBU with the RAID volume set to write-back mode in the controller, are you saying that the sysfs entry should show "write through"? I had always understood that it was safe to disable flushes with a non-volatile cache and a non-volatile cache is called a write-back cache. It is strange to me that this terminology in the kernel would be backwards from how it is expressed in a RAID controller. Incidentally, I have an Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in write-back mode and the second volume is write-through. In both cases sysfs reports "write through": [root@hv1-he ~]# cat /sys/block/sda/queue/write_cache write through [root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache write through This is running 4.19.75, so we can at least say that the 9460 does not support proper representation of the VD cache mode in sysfs, but which is correct? Should it not be that the sysfs entry reports the same cache mode of the RAID controller? -Eric > > It's not safe to toggle write_cache from "write back" to "write through" > because this stops the kernel from sending flushes to the device, but > the device will continue caching the writes. So, in case something goes > wrong, you might lose your writes or end up with some kind of > corruption. > > > How does this work with stacking blockdevs? Does it inherit from the > > lower-level dev? If an upper-level is misconfigured, would a writeback at > > higher levels would clear the flush for lower levels? > > > > As Mike already mentioned in another reply to this thread, the device > capabilities are stacked up when each device is created and are > inherited from component devices. > > The logic for device stacking is implemented in various functions in > block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(), > etc.), which are used also by DM core in dm-table.c to set the > capabilities of DM devices. > > If an upper layer device reports a "write back" cache then flushes will > be issued to it by the kernel, no matter what the capabilities of the > underlying devices are. > > Normally an upper layer device would report a "write back" cache if at > least one underlying device supports flushes. But, some DM devices > report a "write back" cache irrespective of the underlying devices, > e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush > their own metadata. They then pass the flush request down to the > underlying device and rely on block core to do the right thing. Either > actually send the flush to the device, if it has a volatile cache, or > complete it immediately. > > Nikos > > > -- > > Eric Wheeler > > > > > > > > > Nikos > > > > > > > Mike > > > > > > > > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 12/6/19 10:06 PM, Eric Wheeler wrote: > On Fri, 6 Dec 2019, Nikos Tsironis wrote: >> On 12/6/19 12:34 AM, Eric Wheeler wrote: >>> On Thu, 5 Dec 2019, Nikos Tsironis wrote: >>>> On 12/4/19 10:17 PM, Mike Snitzer wrote: >>>>> On Wed, Dec 04 2019 at 2:58pm -0500, >>>>> Eric Wheeler <dm-devel@lists.ewheeler.net> wrote: >>>>> >>>>>> On Wed, 4 Dec 2019, Nikos Tsironis wrote: >>>>>> >>>>>>> The thin provisioning target maintains per thin device mappings that >>>>>>> map >>>>>>> virtual blocks to data blocks in the data device. >>>>>>> >>>>>>> When we write to a shared block, in case of internal snapshots, or >>>>>>> provision a new block, in case of external snapshots, we copy the >>>>>>> shared >>>>>>> block to a new data block (COW), update the mapping for the relevant >>>>>>> virtual block and then issue the write to the new data block. >>>>>>> >>>>>>> Suppose the data device has a volatile write-back cache and the >>>>>>> following sequence of events occur: >>>>>> >>>>>> For those with NV caches, can the data disk flush be optional (maybe >>>>>> as a >>>>>> table flag)? >>>>> >>>>> IIRC block core should avoid issuing the flush if not needed. I'll have >>>>> a closer look to verify as much. >>>>> >>>> >>>> For devices without a volatile write-back cache block core strips off >>>> the REQ_PREFLUSH and REQ_FUA bits from requests with a payload and >>>> completes empty REQ_PREFLUSH requests before entering the driver. >>>> >>>> This happens in generic_make_request_checks(): >>>> >>>> /* >>>> * Filter flush bio's early so that make_request based >>>> * drivers without flush support don't have to worry >>>> * about them. >>>> */ >>>> if (op_is_flush(bio->bi_opf) && >>>> !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { >>>> bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); >>>> if (!nr_sectors) { >>>> status = BLK_STS_OK; >>>> goto end_io; >>>> } >>>> } >>>> >>>> If I am not mistaken, it all depends on whether the underlying device >>>> reports the existence of a write back cache or not. >>>> >>>> You could check this by looking at /sys/block/<device>/queue/write_cache >>>> If it says "write back" then flushes will be issued. >>>> >>>> In case the sysfs entry reports a "write back" cache for a device with a >>>> non-volatile write cache, I think you can change the kernel's view of >>>> the device by writing to this entry (you could also create a udev rule >>>> for this). >>>> >>>> This way you can set the write cache as write through. This will >>>> eliminate the cache flushes issued by the kernel, without altering the >>>> device state (Documentation/block/queue-sysfs.rst). >>> >>> Interesting, I'll remember that. I think this is a documentation bug, isn't >>> this backwards: >>> 'This means that it might not be safe to toggle the setting from >>> "write back" to "write through", since that will also eliminate >>> cache flushes issued by the kernel.' >>> [https://www.kernel.org/doc/Documentation/block/queue-sysfs.rst] >>> >>> >> >> If a device has a volatile cache then the write_cache sysfs entry will >> be "write back" and we have to issue flushes to the device. In all other >> cases write_cache will be "write through". > > Forgive my misunderstanding, but if I have a RAID controller with a cache > and BBU with the RAID volume set to write-back mode in the controller, are > you saying that the sysfs entry should show "write through"? I had always > understood that it was safe to disable flushes with a non-volatile cache > and a non-volatile cache is called a write-back cache. > From the device perspective, a non-volatile cache operating in write-back mode is indeed called a write-back cache. But, from the OS perspective, a non-volatile cache (whether it operates in write-back or write-through mode), for all intents and purposes, is equivalent to a write-through cache: when the device acknowledges a write it's guaranteed that the written data won't be lost in case of power loss. So, in the case of a controller with a BBU and/or a non-volatile cache, you don't care what the device does internally. All that matters is that acked writes won't be lost in case of power failure. I believe that the sysfs entry reports exactly that. Whether the kernel should treat the device as having a volatile write-back cache, so we have to issue flushes to ensure the data are properly persisted, or as having no cache or a write-through cache, so flushes are not necessary. > It is strange to me that this terminology in the kernel would be backwards > from how it is expressed in a RAID controller. Incidentally, I have an > Avago MegaRAID 9460 with 2 volumes. The first volume (sda) is in > write-back mode and the second volume is write-through. In both cases > sysfs reports "write through": > > [root@hv1-he ~]# cat /sys/block/sda/queue/write_cache > write through > [root@hv1-he ~]# cat /sys/block/sdb/queue/write_cache > write through > > This is running 4.19.75, so we can at least say that the 9460 does not > support proper representation of the VD cache mode in sysfs, but which is > correct? Should it not be that the sysfs entry reports the same cache mode > of the RAID controller? > My guess is that the controller reports to the kernel that it has a write-through cache (or no cache at all) on purpose, to avoid unnecessary flushes. Since it can ensure the persistence of acked writes with other means, e.g., a BBU unit, as far as the kernel is concerned the device can be treated as one with a write-through cache. Moreover, I think that MegaRAID controllers, in the default write back mode, automatically switch the write policy to write-through if the BBU is low, has failed or is being charged. So, I think it makes sense to report to the kernel that the device has a write-through cache, even though internally the device operates the cache in write-back mode. Nikos > -Eric > >> >> It's not safe to toggle write_cache from "write back" to "write through" >> because this stops the kernel from sending flushes to the device, but >> the device will continue caching the writes. So, in case something goes >> wrong, you might lose your writes or end up with some kind of >> corruption. >> >>> How does this work with stacking blockdevs? Does it inherit from the >>> lower-level dev? If an upper-level is misconfigured, would a writeback at >>> higher levels would clear the flush for lower levels? >>> >> >> As Mike already mentioned in another reply to this thread, the device >> capabilities are stacked up when each device is created and are >> inherited from component devices. >> >> The logic for device stacking is implemented in various functions in >> block/blk-settings.c (blk_set_stacking_limits(), blk_stack_limits(), >> etc.), which are used also by DM core in dm-table.c to set the >> capabilities of DM devices. >> >> If an upper layer device reports a "write back" cache then flushes will >> be issued to it by the kernel, no matter what the capabilities of the >> underlying devices are. >> >> Normally an upper layer device would report a "write back" cache if at >> least one underlying device supports flushes. But, some DM devices >> report a "write back" cache irrespective of the underlying devices, >> e.g., dm-thin, dm-clone, dm-cache. This is required so they can flush >> their own metadata. They then pass the flush request down to the >> underlying device and rely on block core to do the right thing. Either >> actually send the flush to the device, if it has a volatile cache, or >> complete it immediately. >> >> Nikos >> >>> -- >>> Eric Wheeler >>> >>> >>> >>>> Nikos >>>> >>>>> Mike >>>>> >>>> >> >> -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel