diff mbox series

[1/4] generic: require discard zero behavior for dmlogwrites on XFS

Message ID 20200826143815.360002-2-bfoster@redhat.com (mailing list archive)
State Deferred, archived
Headers show
Series fix up generic dmlogwrites tests to work with XFS | expand

Commit Message

Brian Foster Aug. 26, 2020, 2:38 p.m. UTC
Several generic fstests use dm-log-writes to test the filesystem for
consistency at various crash recovery points. dm-log-writes and the
associated replay mechanism rely on zeroing via discard to clear
stale blocks when moving to various points in time of the fs. If the
storage doesn't provide zeroing or the discard requests exceed the
hardcoded maximum (128MB) of the fallback solution to physically
write zeroes, stale blocks are left around in the target fs. This
scheme is known to cause issues on XFS v5 superblocks if recovery
observes metadata from a future variant of an fs that has been
replayed to an older point in time. This corrupts the filesystem and
leads to false test failures.

generic/482 already works around this problem by using a thin volume
as the target device, which provides consistent and efficient
discard zeroing behavior, but other tests have seen similar issues
on XFS. Add an XFS specific check to the dmlogwrites init time code
that requires discard zeroing support and otherwise skips the test
to avoid false positive failures.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 common/dmlogwrites | 10 ++++++++--
 common/rc          | 14 ++++++++++++++
 tests/generic/470  |  2 +-
 3 files changed, 23 insertions(+), 3 deletions(-)

Comments

Amir Goldstein Aug. 27, 2020, 6:58 a.m. UTC | #1
On Wed, Aug 26, 2020 at 5:38 PM Brian Foster <bfoster@redhat.com> wrote:
>
> Several generic fstests use dm-log-writes to test the filesystem for
> consistency at various crash recovery points. dm-log-writes and the
> associated replay mechanism rely on zeroing via discard to clear
> stale blocks when moving to various points in time of the fs. If the
> storage doesn't provide zeroing or the discard requests exceed the
> hardcoded maximum (128MB) of the fallback solution to physically
> write zeroes, stale blocks are left around in the target fs. This
> scheme is known to cause issues on XFS v5 superblocks if recovery
> observes metadata from a future variant of an fs that has been
> replayed to an older point in time. This corrupts the filesystem and
> leads to false test failures.
>
> generic/482 already works around this problem by using a thin volume
> as the target device, which provides consistent and efficient
> discard zeroing behavior, but other tests have seen similar issues
> on XFS. Add an XFS specific check to the dmlogwrites init time code
> that requires discard zeroing support and otherwise skips the test
> to avoid false positive failures.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  common/dmlogwrites | 10 ++++++++--
>  common/rc          | 14 ++++++++++++++
>  tests/generic/470  |  2 +-
>  3 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 573f4b8a..92cc6ce2 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -43,9 +43,10 @@ _require_log_writes_dax_mountopt()
>         _require_test_program "log-writes/replay-log"
>
>         local ret=0
> -       local mountopt=$1
> +       local dev=$1
> +       local mountopt=$2
>
> -       _log_writes_init $SCRATCH_DEV
> +       _log_writes_init $dev
>         _log_writes_mkfs > /dev/null 2>&1
>         _log_writes_mount "-o $mountopt" > /dev/null 2>&1
>         # Check options to be sure.
> @@ -66,6 +67,11 @@ _log_writes_init()
>         [ -z "$blkdev" ] && _fail \
>         "block dev must be specified for _log_writes_init"
>
> +       # XFS requires discard zeroing support on the target device to work
> +       # reliably with dm-log-writes. Use dm-thin devices in tests that want
> +       # to provide reliable discard zeroing support.
> +       [ $FSTYP == "xfs" ] && _require_discard_zeroes $blkdev
> +

I imagine that ext4 could also be burned by this.
Do we have a reason to limit this requirement to xfs?
I prefer to make it generic.

>         local BLK_DEV_SIZE=`blockdev --getsz $blkdev`
>         LOGWRITES_NAME=logwrites-test
>         LOGWRITES_DMDEV=/dev/mapper/$LOGWRITES_NAME
> diff --git a/common/rc b/common/rc
> index aa5a7409..fedb5221 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -4313,6 +4313,20 @@ _require_mknod()
>         rm -f $TEST_DIR/$seq.null
>  }
>
> +# check that discard is supported and subsequent reads return zeroes
> +_require_discard_zeroes()
> +{
> +       local dev=$1
> +
> +       _require_command "$BLKDISCARD_PROG" blkdiscard
> +
> +       $XFS_IO_PROG -c "pwrite -S 0xcd 0 4k" $dev > /dev/null 2>&1 ||
> +               _fail "write error"
> +       $BLKDISCARD_PROG -o 0 -l 1m $dev || _notrun "no discard support"
> +       hexdump -n 4096 $dev | head -n 1 | grep cdcd &&
> +               _notrun "no discard zeroing support"
> +}
> +

I am fine with your solution, but if there was a discussion on the best way to
solve the problem, I missed it, so would like to hear what Chritoph has to say.

Thanks,
Amir.
Christoph Hellwig Aug. 27, 2020, 7:02 a.m. UTC | #2
On Thu, Aug 27, 2020 at 09:58:09AM +0300, Amir Goldstein wrote:
> I imagine that ext4 could also be burned by this.
> Do we have a reason to limit this requirement to xfs?
> I prefer to make it generic.

The whole idea that discard zeroes data is just completely broken.
Discard is a hint that is explicitly allowed to do nothing.
Amir Goldstein Aug. 27, 2020, 7:29 a.m. UTC | #3
On Thu, Aug 27, 2020 at 10:02 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Aug 27, 2020 at 09:58:09AM +0300, Amir Goldstein wrote:
> > I imagine that ext4 could also be burned by this.
> > Do we have a reason to limit this requirement to xfs?
> > I prefer to make it generic.
>
> The whole idea that discard zeroes data is just completely broken.
> Discard is a hint that is explicitly allowed to do nothing.

I figured you'd say something like that :)
but since we are talking about dm-thin as a solution for predictable
behavior at the moment and this sanity check helps avoiding adding
new tests that can fail to some extent, is the proposed bandaid good enough
to keep those tests alive until a better solution is proposed?

Another concern that I have is that perhaps adding dm-thin to the fsx tests
would change timing of io in a subtle way and hide the original bugs that they
caught:

47c7d0b19502 ("xfs: fix incorrect log_flushed on fsync")
3af423b03435 ("xfs: evict CoW fork extents when performing finsert/fcollapse")
51e3ae81ec58 ("ext4: fix interaction between i_size, fallocate, and
delalloc after a crash")

Because at the time I (re)wrote Josef's fsx tests, they flushed out the bugs
on spinning rust much more frequently than on SSD.

So Brian, if you could verify that the fsx tests still catch the original bugs
by reverting the fixes or running with an old kernel and probably running
on a slow disk that would be great.

I know it is not a simple request, but I just don't know how else to trust
these changes. I guess a quick way for you to avoid the hassle is to add
_require_discard_zeroes (patch #1) and drop the rest.

Thanks,
Amir.
Christoph Hellwig Aug. 27, 2020, 7:37 a.m. UTC | #4
On Thu, Aug 27, 2020 at 10:29:05AM +0300, Amir Goldstein wrote:
> I figured you'd say something like that :)
> but since we are talking about dm-thin as a solution for predictable
> behavior at the moment and this sanity check helps avoiding adding
> new tests that can fail to some extent, is the proposed bandaid good enough
> to keep those tests alive until a better solution is proposed?

Well, the problem is that a test that wants to reliable nuke data needs
to... *drumroll* reliably nuke data.  Which means zeroing or at least
a known pattern.  discard doesn't give you that.

I don't see how a plain discard is going to work for any file system
for that particular case.
Christoph Hellwig Aug. 27, 2020, 7:38 a.m. UTC | #5
On Wed, Aug 26, 2020 at 10:38:12AM -0400, Brian Foster wrote:
> Several generic fstests use dm-log-writes to test the filesystem for
> consistency at various crash recovery points. dm-log-writes and the
> associated replay mechanism rely on zeroing via discard to clear
> stale blocks when moving to various points in time of the fs. If the
> storage doesn't provide zeroing or the discard requests exceed the
> hardcoded maximum (128MB) of the fallback solution to physically
> write zeroes, stale blocks are left around in the target fs. This
> scheme is known to cause issues on XFS v5 superblocks if recovery
> observes metadata from a future variant of an fs that has been
> replayed to an older point in time. This corrupts the filesystem and
> leads to false test failures.
> 
> generic/482 already works around this problem by using a thin volume
> as the target device, which provides consistent and efficient
> discard zeroing behavior, but other tests have seen similar issues
> on XFS. Add an XFS specific check to the dmlogwrites init time code
> that requires discard zeroing support and otherwise skips the test
> to avoid false positive failures.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  common/dmlogwrites | 10 ++++++++--
>  common/rc          | 14 ++++++++++++++
>  tests/generic/470  |  2 +-
>  3 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 573f4b8a..92cc6ce2 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -43,9 +43,10 @@ _require_log_writes_dax_mountopt()
>  	_require_test_program "log-writes/replay-log"
>  
>  	local ret=0
> -	local mountopt=$1
> +	local dev=$1
> +	local mountopt=$2
>  
> -	_log_writes_init $SCRATCH_DEV
> +	_log_writes_init $dev
>  	_log_writes_mkfs > /dev/null 2>&1
>  	_log_writes_mount "-o $mountopt" > /dev/null 2>&1
>  	# Check options to be sure.
> @@ -66,6 +67,11 @@ _log_writes_init()
>  	[ -z "$blkdev" ] && _fail \
>  	"block dev must be specified for _log_writes_init"
>  
> +	# XFS requires discard zeroing support on the target device to work
> +	# reliably with dm-log-writes. Use dm-thin devices in tests that want
> +	# to provide reliable discard zeroing support.
> +	[ $FSTYP == "xfs" ] && _require_discard_zeroes $blkdev
> +
>  	local BLK_DEV_SIZE=`blockdev --getsz $blkdev`
>  	LOGWRITES_NAME=logwrites-test
>  	LOGWRITES_DMDEV=/dev/mapper/$LOGWRITES_NAME
> diff --git a/common/rc b/common/rc
> index aa5a7409..fedb5221 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -4313,6 +4313,20 @@ _require_mknod()
>  	rm -f $TEST_DIR/$seq.null
>  }
>  
> +# check that discard is supported and subsequent reads return zeroes
> +_require_discard_zeroes()
> +{
> +	local dev=$1
> +
> +	_require_command "$BLKDISCARD_PROG" blkdiscard
> +
> +	$XFS_IO_PROG -c "pwrite -S 0xcd 0 4k" $dev > /dev/null 2>&1 ||
> +		_fail "write error"
> +	$BLKDISCARD_PROG -o 0 -l 1m $dev || _notrun "no discard support"
> +	hexdump -n 4096 $dev | head -n 1 | grep cdcd &&
> +		_notrun "no discard zeroing support"
> +}

This test is bogus.  a discard may zero parts of the request or all
of it.  It may decided to zero based on the LBA, the physical block
number in the SSD, the phase of the moon or a random number generator.
Brian Foster Aug. 27, 2020, 2:11 p.m. UTC | #6
On Thu, Aug 27, 2020 at 10:29:05AM +0300, Amir Goldstein wrote:
> On Thu, Aug 27, 2020 at 10:02 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Aug 27, 2020 at 09:58:09AM +0300, Amir Goldstein wrote:
> > > I imagine that ext4 could also be burned by this.
> > > Do we have a reason to limit this requirement to xfs?
> > > I prefer to make it generic.
> >
> > The whole idea that discard zeroes data is just completely broken.
> > Discard is a hint that is explicitly allowed to do nothing.
> 
> I figured you'd say something like that :)
> but since we are talking about dm-thin as a solution for predictable
> behavior at the moment and this sanity check helps avoiding adding
> new tests that can fail to some extent, is the proposed bandaid good enough
> to keep those tests alive until a better solution is proposed?
> 
> Another concern that I have is that perhaps adding dm-thin to the fsx tests
> would change timing of io in a subtle way and hide the original bugs that they
> caught:
> 
> 47c7d0b19502 ("xfs: fix incorrect log_flushed on fsync")
> 3af423b03435 ("xfs: evict CoW fork extents when performing finsert/fcollapse")
> 51e3ae81ec58 ("ext4: fix interaction between i_size, fallocate, and
> delalloc after a crash")
> 
> Because at the time I (re)wrote Josef's fsx tests, they flushed out the bugs
> on spinning rust much more frequently than on SSD.
> 
> So Brian, if you could verify that the fsx tests still catch the original bugs
> by reverting the fixes or running with an old kernel and probably running
> on a slow disk that would be great.
> 

I made these particular changes because it's how we previously addressed
generic/482 and they were a pretty clear and quick way to address the
problem. I'm pretty sure I've seen these tests reproduce legitimate
issues with or without thin, but that's from memory so I can't confirm
that with certainty or suggest that applies for every regression these
have discovered in the past.

If we want to go down that path, I'd say lets just assume those no
longer reproduce. That doesn't make these tests any less broken on XFS
(v5, at least) today, so I guess I'd drop the thin changes (note again
that generic/482 is already using dmthinp) and just disable these tests
on XFS until we can come up with an acceptable solution to make them
more reliable. That's slightly unfortunate because I think the tests are
quite effective, but we're continuing to see spurious failures that are
not trivial to diagnose.

IIRC, I did at one point experiment with removing the (128M?) physical
zeroing limit from the associated logwrites tool, but that somewhat
predictably caused the test to take an extremely long time. I'm not sure
I even allowed it to finish, tbh. Anyways, perhaps some options are to
allow a larger physical zeroing limit and remove the tests from the auto
group, use smaller target devices to reduce the amount of zeroing
required, require the user to have a thinp or some such volume as a
scratch dev already or otherwise find some other more efficient zeroing
mechanism.

> I know it is not a simple request, but I just don't know how else to trust
> these changes. I guess a quick way for you to avoid the hassle is to add
> _require_discard_zeroes (patch #1) and drop the rest.
> 

That was going to be my fallback suggestion if the changes to the tests
were problematic for whatever reason. Christoph points out that the
discard zeroing logic is not predictable in general as it might be on
dm-thinp, so I think for now we should probably just notrun these on
XFS. I'll send something like that as a v2 of patch 1.

Brian

> Thanks,
> Amir.
>
Josef Bacik Aug. 27, 2020, 3:57 p.m. UTC | #7
On 8/27/20 3:37 AM, Christoph Hellwig wrote:
> On Thu, Aug 27, 2020 at 10:29:05AM +0300, Amir Goldstein wrote:
>> I figured you'd say something like that :)
>> but since we are talking about dm-thin as a solution for predictable
>> behavior at the moment and this sanity check helps avoiding adding
>> new tests that can fail to some extent, is the proposed bandaid good enough
>> to keep those tests alive until a better solution is proposed?
> 
> Well, the problem is that a test that wants to reliable nuke data needs
> to... *drumroll* reliably nuke data.  Which means zeroing or at least
> a known pattern.  discard doesn't give you that.
> 
> I don't see how a plain discard is going to work for any file system
> for that particular case.
> 

This sort of brings up a good point, the whole point of DISCARD support 
in log-writes was to expose problems where we may have been discarding 
real data we cared about, hence adding the forced zero'ing stuff for 
devices that didn't support discard.  But that made the incorrect 
assumption that a drive with actual discard support would actually 
return 0's for discarded data.  That assumption was based on hardware 
that did actually do that, but now we live in the brave new world of 
significantly shittier drives.  Does dm-thinp reliably unmap the ranges 
we discard, and thus give us this zero'ing behavior?  Because we might 
as well just use that for everything so log-writes doesn't have to 
resort to pwrite()'ing zeros everywhere.  Thanks,

Josef
Christoph Hellwig Aug. 27, 2020, 5:02 p.m. UTC | #8
On Thu, Aug 27, 2020 at 11:57:03AM -0400, Josef Bacik wrote:
> This sort of brings up a good point, the whole point of DISCARD support in
> log-writes was to expose problems where we may have been discarding real
> data we cared about, hence adding the forced zero'ing stuff for devices that
> didn't support discard.  But that made the incorrect assumption that a drive
> with actual discard support would actually return 0's for discarded data.
> That assumption was based on hardware that did actually do that, but now we
> live in the brave new world of significantly shittier drives.  Does dm-thinp
> reliably unmap the ranges we discard, and thus give us this zero'ing
> behavior?  Because we might as well just use that for everything so
> log-writes doesn't have to resort to pwrite()'ing zeros everywhere.  Thanks,

We have a write zeroes operation in the block layer.  For some devices
this is as efficient as discard, and that should (I think) dm.
Brian Foster Aug. 27, 2020, 6:35 p.m. UTC | #9
On Thu, Aug 27, 2020 at 06:02:42PM +0100, Christoph Hellwig wrote:
> On Thu, Aug 27, 2020 at 11:57:03AM -0400, Josef Bacik wrote:
> > This sort of brings up a good point, the whole point of DISCARD support in
> > log-writes was to expose problems where we may have been discarding real
> > data we cared about, hence adding the forced zero'ing stuff for devices that
> > didn't support discard.  But that made the incorrect assumption that a drive
> > with actual discard support would actually return 0's for discarded data.
> > That assumption was based on hardware that did actually do that, but now we
> > live in the brave new world of significantly shittier drives.  Does dm-thinp
> > reliably unmap the ranges we discard, and thus give us this zero'ing
> > behavior?  Because we might as well just use that for everything so
> > log-writes doesn't have to resort to pwrite()'ing zeros everywhere.  Thanks,
> 

That's pretty much what this series does. It only modifies the generic
tests because I didn't want to mess with the others (all btrfs, I think)
that might not have any issues, but I wouldn't be opposed to burying the
logic into the dmlogwrites bits so it just always creates a thin volume
behind the scenes. If we were going to do that, I'd prefer to do it as a
follow up to these patches (dropping patch 1, most likely) so at least
they can remain enabled on XFS for the time being.

OTOH, perhaps the thinp behavior could be internal, but conditional
based on XFS. It's not really clear to me if this problem is more of an
XFS phenomenon or just that XFS happens to have some unique recovery
checking logic that explicitly detects it. It seems more like the
latter, but I don't know enough about ext4 or btrfs to say..

> We have a write zeroes operation in the block layer.  For some devices
> this is as efficient as discard, and that should (I think) dm.
> 

Do you mean BLKZEROOUT? I see that is more efficient than writing zeroes
from userspace, but I don't think it's efficient enough to solve this
problem. It takes about 3m to manually zero my 15GB lvm (dm-linear)
scratch device on my test vm via dd using sync writes. A 'blkdiscard -z'
saves me about half that time, but IIRC this is an operation that would
occur every time the logwrites device is replayed to a particular
recovery point (which can happen many times per test).

Brian
Brian Foster Aug. 28, 2020, 2:10 p.m. UTC | #10
Re-add lists to CC.

On Fri, Aug 28, 2020 at 06:47:06AM +0300, Amir Goldstein wrote:
> On Thu, Aug 27, 2020, 5:11 PM Brian Foster <bfoster@redhat.com> wrote:
> 
> > On Thu, Aug 27, 2020 at 10:29:05AM +0300, Amir Goldstein wrote:
> > > On Thu, Aug 27, 2020 at 10:02 AM Christoph Hellwig <hch@infradead.org>
> > wrote:
> > > >
> > > > On Thu, Aug 27, 2020 at 09:58:09AM +0300, Amir Goldstein wrote:
> > > > > I imagine that ext4 could also be burned by this.
> > > > > Do we have a reason to limit this requirement to xfs?
> > > > > I prefer to make it generic.
> > > >
> > > > The whole idea that discard zeroes data is just completely broken.
> > > > Discard is a hint that is explicitly allowed to do nothing.
> > >
> > > I figured you'd say something like that :)
> > > but since we are talking about dm-thin as a solution for predictable
> > > behavior at the moment and this sanity check helps avoiding adding
> > > new tests that can fail to some extent, is the proposed bandaid good
> > enough
> > > to keep those tests alive until a better solution is proposed?
> > >
> > > Another concern that I have is that perhaps adding dm-thin to the fsx
> > tests
> > > would change timing of io in a subtle way and hide the original bugs
> > that they
> > > caught:
> > >
> > > 47c7d0b19502 ("xfs: fix incorrect log_flushed on fsync")
> > > 3af423b03435 ("xfs: evict CoW fork extents when performing
> > finsert/fcollapse")
> > > 51e3ae81ec58 ("ext4: fix interaction between i_size, fallocate, and
> > > delalloc after a crash")
> > >
> > > Because at the time I (re)wrote Josef's fsx tests, they flushed out the
> > bugs
> > > on spinning rust much more frequently than on SSD.
> > >
> > > So Brian, if you could verify that the fsx tests still catch the
> > original bugs
> > > by reverting the fixes or running with an old kernel and probably running
> > > on a slow disk that would be great.
> > >
> >
> > I made these particular changes because it's how we previously addressed
> > generic/482 and they were a pretty clear and quick way to address the
> > problem. I'm pretty sure I've seen these tests reproduce legitimate
> > issues with or without thin, but that's from memory so I can't confirm
> > that with certainty or suggest that applies for every regression these
> > have discovered in the past.
> >
> > If we want to go down that path, I'd say lets just assume those no
> > longer reproduce. That doesn't make these tests any less broken on XFS
> > (v5, at least) today, so I guess I'd drop the thin changes (note again
> > that generic/482 is already using dmthinp) and just disable these tests
> > on XFS until we can come up with an acceptable solution to make them
> > more reliable. That's slightly unfortunate because I think the tests are
> > quite effective, but we're continuing to see spurious failures that are
> > not trivial to diagnose.
> 
> 
> 
> > IIRC, I did at one point experiment with removing the (128M?) physical
> > zeroing limit from the associated logwrites tool, but that somewhat
> > predictably caused the test to take an extremely long time. I'm not sure
> > I even allowed it to finish, tbh. Anyways, perhaps some options are to
> > allow a larger physical zeroing limit and remove the tests from the auto
> > group, use smaller target devices to reduce the amount of zeroing
> > required, require the user to have a thinp or some such volume as a
> > scratch dev already or otherwise find some other more efficient zeroing
> > mechanism.
> >
> > > I know it is not a simple request, but I just don't know how else to
> > trust
> > > these changes. I guess a quick way for you to avoid the hassle is to add
> > > _require_discard_zeroes (patch #1) and drop the rest.
> > >
> >
> > That was going to be my fallback suggestion if the changes to the tests
> > were problematic for whatever reason. Christoph points out that the
> > discard zeroing logic is not predictable in general as it might be on
> > dm-thinp, so I think for now we should probably just notrun these on
> > XFS. I'll send something like that as a v2 of patch 1.
> >
> 
> Maybe there is a better way to stay safe and not sorry.
> 
> Can we use dm thinp only for replay but not for fsx?
> I suppose reliable zeroing is only needed in replay?
> 

I think that would work, but might clutter or complicate the tests by
using multiple devices for I/O vs. replay. That kind of strikes me as
overkill given that two of these run multiple instances of fsx (which
has a fixed size file) and the other looks like it runs a simple xfs_io
command with a fixed amount of I/O. Ironically, generic/482 seems like
the test with the most I/O overhead (via fsstress), but it's been using
dm-thin for a while now.

I think if we're really that concerned with dm-thin allocation overhead,
we should either configure the cluster size to amortize the cost of it
or just look into using smaller block devices so replay-log falls back
to manual zeroing when discard doesn't work. A quick test of the latter
doesn't show a huge increase in test runtime for 455/457, but I'd also
have to confirm that this works as expected and eliminates the spurious
corruption issue. Of course, a tradeoff could be that if discard does
work but doesn't provide zeroing (which Christoph already explained is
unpredictable), then I think we're still susceptible to the original
problem. Perhaps we could create a mode that simply forces manual
zeroing over discards if there isn't one already...

Brian

> Thanks,
> Amir.
Christoph Hellwig Aug. 29, 2020, 6:44 a.m. UTC | #11
On Thu, Aug 27, 2020 at 11:57:03AM -0400, Josef Bacik wrote:
> > 
> 
> This sort of brings up a good point, the whole point of DISCARD support in
> log-writes was to expose problems where we may have been discarding real
> data we cared about, hence adding the forced zero'ing stuff for devices that
> didn't support discard.  But that made the incorrect assumption that a drive
> with actual discard support would actually return 0's for discarded data.
> That assumption was based on hardware that did actually do that, but now we
> live in the brave new world of significantly shittier drives.  Does dm-thinp
> reliably unmap the ranges we discard, and thus give us this zero'ing
> behavior?  Because we might as well just use that for everything so
> log-writes doesn't have to resort to pwrite()'ing zeros everywhere.  Thanks,

So the right fix is to replace issuing REQ_OP_DISCARD bios with
REQ_OP_WRITE_ZEROES (plus the check for that).  That'll give you the
effective zeroing part if the device supports it, and you can still
fall back to manual zeroing (and blk-lib.c actually has helpers for
that as well).
Christoph Hellwig Aug. 29, 2020, 6:46 a.m. UTC | #12
On Thu, Aug 27, 2020 at 02:35:07PM -0400, Brian Foster wrote:
> OTOH, perhaps the thinp behavior could be internal, but conditional
> based on XFS. It's not really clear to me if this problem is more of an
> XFS phenomenon or just that XFS happens to have some unique recovery
> checking logic that explicitly detects it. It seems more like the
> latter, but I don't know enough about ext4 or btrfs to say..

The way I understand the tests (and Josefs mail seems to confirm that)
is that it uses discards to ensure data disappears.  Unfortunately
that's only how discard sometimes work, but not all the time.

> > We have a write zeroes operation in the block layer.  For some devices
> > this is as efficient as discard, and that should (I think) dm.
> > 
> 
> Do you mean BLKZEROOUT? I see that is more efficient than writing zeroes
> from userspace, but I don't think it's efficient enough to solve this
> problem. It takes about 3m to manually zero my 15GB lvm (dm-linear)
> scratch device on my test vm via dd using sync writes. A 'blkdiscard -z'
> saves me about half that time, but IIRC this is an operation that would
> occur every time the logwrites device is replayed to a particular
> recovery point (which can happen many times per test).

Are we talking about the block layer interface or the userspace syscall
one?  I though it was the former, in which case REQ_OP_WRITE_ZEROES
is the interface.  User interface is harder - you need to use fallocate
on the block device, but the flags are mapped kinda weird:

FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE guarantees you a
REQ_OP_WRITE_ZEROES, but there is a bunch of other variants that include
fallbacks.
Amir Goldstein Aug. 30, 2020, 1:30 p.m. UTC | #13
On Sat, Aug 29, 2020 at 9:47 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Aug 27, 2020 at 02:35:07PM -0400, Brian Foster wrote:
> > OTOH, perhaps the thinp behavior could be internal, but conditional
> > based on XFS. It's not really clear to me if this problem is more of an
> > XFS phenomenon or just that XFS happens to have some unique recovery
> > checking logic that explicitly detects it. It seems more like the
> > latter, but I don't know enough about ext4 or btrfs to say..
>
> The way I understand the tests (and Josefs mail seems to confirm that)
> is that it uses discards to ensure data disappears.  Unfortunately
> that's only how discard sometimes work, but not all the time.
>

I think we are confusing two slightly different uses of discard in those tests.
One use case is that dm-logwrites records discards in test workloads and then
needs to replay them to simulate the sequence of IO event up to a crash point.

I think that use case is less interesting, because as Christoph points out,
discard is not reliable, so I think we should get rid of " -o discard"
in the tests -
it did not catch any issues that I know of.

But there is another discard in those tests issued by _log_writes_mkfs
(at least it does for xfs and ext4). This discard has the by product of
making sure that replay from the start to point in time, first wipes all
stale data from the replay block device.

Of course the problems we encounter is that it does not wipe all stale
data when not running on dm-thinp, which is why I suggested to always
use dm-thinp for replay device, but I can live perfectly well with Brian's
v1 patches where both workload and replay are done on dm-thinp.

Josef had two variants for those tests, one doing "replay from start"
for every checkpoint and utilizing this discard-as-wipe behavior
and one flavor that used dm-thinp to take snapshots and replay
from snapshot T to the next mark.

I remember someone tried converting some of the tests to the snapshot
flavor, but it turned out to be slower, so we left it as is (always replay from
the start).

In conclusion, I *think* the correct fix for the failing tests is:
1. Use dm-thinp for all those tests (as v1 does)
2. In _log_writes_replay_log{,_range}() start by explicitly
    wiping dm-thinp, either with with hole punch command or by
    re-creating the new thinp volume, whichever is faster.
    instead of relying on the replay of discard operation recorded
    from mkfs that sort of kind of worked by mistake.

Thanks,
Amir.
Brian Foster Aug. 31, 2020, 1:37 p.m. UTC | #14
On Sat, Aug 29, 2020 at 07:46:59AM +0100, Christoph Hellwig wrote:
> On Thu, Aug 27, 2020 at 02:35:07PM -0400, Brian Foster wrote:
> > OTOH, perhaps the thinp behavior could be internal, but conditional
> > based on XFS. It's not really clear to me if this problem is more of an
> > XFS phenomenon or just that XFS happens to have some unique recovery
> > checking logic that explicitly detects it. It seems more like the
> > latter, but I don't know enough about ext4 or btrfs to say..
> 
> The way I understand the tests (and Josefs mail seems to confirm that)
> is that it uses discards to ensure data disappears.  Unfortunately
> that's only how discard sometimes work, but not all the time.
> 

I think Amir's followup describes how the infrastructure uses discard
better than I could. I'm not intimately familiar with how it works, so
my goal was to take the same approach as generic/482 and update the
tests to provide the predictable behavior expected by the
infrastructure. If folks want to revisit all of that to improve the
tests to not rely on discard and break that dependency, that seems like
a fine direction, but it also seems that can come later as improvements
to the broader infrastructure.

> > > We have a write zeroes operation in the block layer.  For some devices
> > > this is as efficient as discard, and that should (I think) dm.
> > > 
> > 
> > Do you mean BLKZEROOUT? I see that is more efficient than writing zeroes
> > from userspace, but I don't think it's efficient enough to solve this
> > problem. It takes about 3m to manually zero my 15GB lvm (dm-linear)
> > scratch device on my test vm via dd using sync writes. A 'blkdiscard -z'
> > saves me about half that time, but IIRC this is an operation that would
> > occur every time the logwrites device is replayed to a particular
> > recovery point (which can happen many times per test).
> 
> Are we talking about the block layer interface or the userspace syscall
> one?  I though it was the former, in which case REQ_OP_WRITE_ZEROES
> is the interface.  User interface is harder - you need to use fallocate
> on the block device, but the flags are mapped kinda weird:
> 
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE guarantees you a
> REQ_OP_WRITE_ZEROES, but there is a bunch of other variants that include
> fallbacks.
> 

I was using the BLKZEROOUT ioctl in my previous test because
fallocate(PUNCH_HOLE|KEEP_SIZE) (zeroing offload) isn't supported on
this device. I see similar results as above with
fallocate(PUNCH_HOLE|KEEP_SIZE) though, which seems to fall back to
__blkdev_issue_zero_pages() in that case.

Brian
diff mbox series

Patch

diff --git a/common/dmlogwrites b/common/dmlogwrites
index 573f4b8a..92cc6ce2 100644
--- a/common/dmlogwrites
+++ b/common/dmlogwrites
@@ -43,9 +43,10 @@  _require_log_writes_dax_mountopt()
 	_require_test_program "log-writes/replay-log"
 
 	local ret=0
-	local mountopt=$1
+	local dev=$1
+	local mountopt=$2
 
-	_log_writes_init $SCRATCH_DEV
+	_log_writes_init $dev
 	_log_writes_mkfs > /dev/null 2>&1
 	_log_writes_mount "-o $mountopt" > /dev/null 2>&1
 	# Check options to be sure.
@@ -66,6 +67,11 @@  _log_writes_init()
 	[ -z "$blkdev" ] && _fail \
 	"block dev must be specified for _log_writes_init"
 
+	# XFS requires discard zeroing support on the target device to work
+	# reliably with dm-log-writes. Use dm-thin devices in tests that want
+	# to provide reliable discard zeroing support.
+	[ $FSTYP == "xfs" ] && _require_discard_zeroes $blkdev
+
 	local BLK_DEV_SIZE=`blockdev --getsz $blkdev`
 	LOGWRITES_NAME=logwrites-test
 	LOGWRITES_DMDEV=/dev/mapper/$LOGWRITES_NAME
diff --git a/common/rc b/common/rc
index aa5a7409..fedb5221 100644
--- a/common/rc
+++ b/common/rc
@@ -4313,6 +4313,20 @@  _require_mknod()
 	rm -f $TEST_DIR/$seq.null
 }
 
+# check that discard is supported and subsequent reads return zeroes
+_require_discard_zeroes()
+{
+	local dev=$1
+
+	_require_command "$BLKDISCARD_PROG" blkdiscard
+
+	$XFS_IO_PROG -c "pwrite -S 0xcd 0 4k" $dev > /dev/null 2>&1 ||
+		_fail "write error"
+	$BLKDISCARD_PROG -o 0 -l 1m $dev || _notrun "no discard support"
+	hexdump -n 4096 $dev | head -n 1 | grep cdcd &&
+		_notrun "no discard zeroing support"
+}
+
 init_rc
 
 ################################################################################
diff --git a/tests/generic/470 b/tests/generic/470
index fd6da563..707b6237 100755
--- a/tests/generic/470
+++ b/tests/generic/470
@@ -35,7 +35,7 @@  rm -f $seqres.full
 _supported_fs generic
 _supported_os Linux
 _require_scratch
-_require_log_writes_dax_mountopt "dax"
+_require_log_writes_dax_mountopt $SCRATCH_DEV "dax"
 _require_xfs_io_command "mmap" "-S"
 _require_xfs_io_command "log_writes"