diff mbox series

[01/12] generic/757: fix various bugs in this test

Message ID 173197064441.904310.18406008193922603782.stgit@frogsfrogsfrogs (mailing list archive)
State New, archived
Headers show
Series [01/12] generic/757: fix various bugs in this test | expand

Commit Message

Darrick J. Wong Nov. 18, 2024, 11:01 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Fix this test so the check doesn't fail on XFS, and restrict runtime to
100 loops because otherwise this test takes many hours.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 tests/generic/757 |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Zorro Lang Nov. 21, 2024, 9:56 a.m. UTC | #1
On Mon, Nov 18, 2024 at 03:01:38PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Fix this test so the check doesn't fail on XFS, and restrict runtime to
> 100 loops because otherwise this test takes many hours.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  tests/generic/757 |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/tests/generic/757 b/tests/generic/757
> index 0ff5a8ac00182b..9d41975bde07bb 100755
> --- a/tests/generic/757
> +++ b/tests/generic/757
> @@ -63,9 +63,14 @@ prev=$(_log_writes_mark_to_entry_number mkfs)
>  cur=$(_log_writes_find_next_fua $prev)
>  [ -z "$cur" ] && _fail "failed to locate next FUA write"
>  
> -while [ ! -z "$cur" ]; do
> +for ((i = 0; i < 100; i++)); do
>  	_log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
>  
> +	# xfs_repair won't run if the log is dirty
> +	if [ $FSTYP = "xfs" ]; then
> +		_scratch_mount
> +		_scratch_unmount
> +	fi

Hi Darrick,

I didn't merge this patch last week, due to we were still talking
about the "discards" things:

https://lore.kernel.org/fstests/20241115182821.s3pt4wmkueyjggx3@dell-per750-06-vm-08.rhts.eng.pek2.redhat.com/T/#u

Do you think we need to do a force discards at here, or change the
SCRATCH_DEV to dmthin to support discards?

Thanks,
Zorro

>  	_check_scratch_fs
>  
>  	prev=$cur
>
Christoph Hellwig Nov. 21, 2024, 10:05 a.m. UTC | #2
On Thu, Nov 21, 2024 at 05:56:24PM +0800, Zorro Lang wrote:
> I didn't merge this patch last week, due to we were still talking
> about the "discards" things:
> 
> https://lore.kernel.org/fstests/20241115182821.s3pt4wmkueyjggx3@dell-per750-06-vm-08.rhts.eng.pek2.redhat.com/T/#u
> 
> Do you think we need to do a force discards at here, or change the
> SCRATCH_DEV to dmthin to support discards?

FYI, I'm seeing regular failures with generic/757 when using Darrick's
not yet merged RT rmap support, but only with that.

But the whole discard thing leaves me really confused, and the commit
log in the patch references by the above link doesn't clear that up
either.

Why does dmlogwrites require discard for XFS (and apprently XFS only)?
Note that discard is not required and often does not zero data.  So
if we need data to be zeroed we need to do that explicitly, and
preferably in a way that is obvious.
Christoph Hellwig Nov. 21, 2024, 10:13 a.m. UTC | #3
On Thu, Nov 21, 2024 at 11:05:55AM +0100, Christoph Hellwig wrote:
> But the whole discard thing leaves me really confused, and the commit
> log in the patch references by the above link doesn't clear that up
> either.
> 
> Why does dmlogwrites require discard for XFS (and apprently XFS only)?
> Note that discard is not required and often does not zero data.  So
> if we need data to be zeroed we need to do that explicitly, and
> preferably in a way that is obvious.

Ok, I found the problem.  src/log-writes/log-writes.c does try to
discard to zero, which is broken.  This patch switches it to the
proper zeroout ioctl, which fixes my generic/757 issues, and should
also allow to revert commits to use dm-thin referenced in your link.

diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
index aa53473974d9..cb3ac962574c 100644
--- a/src/log-writes/log-writes.c
+++ b/src/log-writes/log-writes.c
@@ -31,7 +31,7 @@ static int discard_range(struct log *log, u64 start, u64 len)
 {
 	u64 range[2] = { start, len };
 
-	if (ioctl(log->replayfd, BLKDISCARD, &range) < 0) {
+	if (ioctl(log->replayfd, BLKZEROOUT, &range) < 0) {
 		if (log_writes_verbose)
 			printf("replay device doesn't support discard, "
 			       "switching to writing zeros\n");
Christoph Hellwig Nov. 21, 2024, 10:52 a.m. UTC | #4
On Thu, Nov 21, 2024 at 11:13:25AM +0100, Christoph Hellwig wrote:
> proper zeroout ioctl, which fixes my generic/757 issues, and should

It turns out then when I extrapolate my shortened 10 iteration run
to the 100 currently in the test it would take ~ 30 minutes.  I'm
not sure that's really a reasonable run time for a single test in
the auto group.
Brian Foster Nov. 21, 2024, 12:28 p.m. UTC | #5
On Thu, Nov 21, 2024 at 11:05:55AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 21, 2024 at 05:56:24PM +0800, Zorro Lang wrote:
> > I didn't merge this patch last week, due to we were still talking
> > about the "discards" things:
> > 
> > https://lore.kernel.org/fstests/20241115182821.s3pt4wmkueyjggx3@dell-per750-06-vm-08.rhts.eng.pek2.redhat.com/T/#u
> > 
> > Do you think we need to do a force discards at here, or change the
> > SCRATCH_DEV to dmthin to support discards?
> 
> FYI, I'm seeing regular failures with generic/757 when using Darrick's
> not yet merged RT rmap support, but only with that.
> 
> But the whole discard thing leaves me really confused, and the commit
> log in the patch references by the above link doesn't clear that up
> either.
> 
> Why does dmlogwrites require discard for XFS (and apprently XFS only)?
> Note that discard is not required and often does not zero data.  So
> if we need data to be zeroed we need to do that explicitly, and
> preferably in a way that is obvious.
> 

IIRC it was to accommodate the test program, which presumably used
discard for efficiency reasons because it did a lot of context switching
to different point-in-time variations of the fs. If the discard didn't
actually zero the range (depending on the underlying test dev), then at
least on XFS, we'd see odd recovery issues and whatnot from the fs going
forward/back in time.

Therefore the reason for using dm-thin was that it was an easy way to
provide predictable behavior to the test program, where discards punch
out blocks that subsequently return zeroes.

I don't recall all the specifics, but I thought part of the reason for
using discard over explicit zeroing was the latter made the test
impractically slow. I could be misremembering, but if you want to change
it I'd suggest to at least verify runtimes on some of the preexisting
logwrites tests as well.

Brian
Christoph Hellwig Nov. 21, 2024, 1:12 p.m. UTC | #6
On Thu, Nov 21, 2024 at 07:28:09AM -0500, Brian Foster wrote:
> IIRC it was to accommodate the test program, which presumably used
> discard for efficiency reasons because it did a lot of context switching
> to different point-in-time variations of the fs. If the discard didn't
> actually zero the range (depending on the underlying test dev), then at
> least on XFS, we'd see odd recovery issues and whatnot from the fs going
> forward/back in time.

But the fundamental problem is that discard does not actually zero
data.  It sometimes might, but also often doesn't because it's not
part of the contract.  That's why my patch two switch to the zeroing
ioctl is the right thing.  Note that this doesn't mean explicitly
writing zeroes, it still uses zeroing offloads available in the drivers.

> I don't recall all the specifics, but I thought part of the reason for
> using discard over explicit zeroing was the latter made the test
> impractically slow. I could be misremembering, but if you want to change
> it I'd suggest to at least verify runtimes on some of the preexisting
> logwrites tests as well.

I'm all for speeding up tests.  But relying on a unspecified side effect
of an operation and then requiring a driver that implements that side
effect without documenting that isn't really good practice.
Brian Foster Nov. 21, 2024, 2:11 p.m. UTC | #7
On Thu, Nov 21, 2024 at 02:12:39PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 21, 2024 at 07:28:09AM -0500, Brian Foster wrote:
> > IIRC it was to accommodate the test program, which presumably used
> > discard for efficiency reasons because it did a lot of context switching
> > to different point-in-time variations of the fs. If the discard didn't
> > actually zero the range (depending on the underlying test dev), then at
> > least on XFS, we'd see odd recovery issues and whatnot from the fs going
> > forward/back in time.
> 
> But the fundamental problem is that discard does not actually zero
> data.  It sometimes might, but also often doesn't because it's not
> part of the contract.  That's why my patch two switch to the zeroing
> ioctl is the right thing.  Note that this doesn't mean explicitly
> writing zeroes, it still uses zeroing offloads available in the drivers.
> 

Ok. I get that wrt discard.. I'm saying that's why we inserted dm-thin
into the stack, because it had predictable behavior that fell in line
with this particular design quirk of the higher level test program.

I've no issue with changing it so long as we maintain effectiveness of
the test. I'm not familiar enough with the zeroing offloads in the
drivers you mention to know whether that means this test can still run
quickly in the general case, or only with select hardware..?

> > I don't recall all the specifics, but I thought part of the reason for
> > using discard over explicit zeroing was the latter made the test
> > impractically slow. I could be misremembering, but if you want to change
> > it I'd suggest to at least verify runtimes on some of the preexisting
> > logwrites tests as well.
> 
> I'm all for speeding up tests.  But relying on a unspecified side effect
> of an operation and then requiring a driver that implements that side
> effect without documenting that isn't really good practice.
> 

It's a hack to facilitate test coverage. It would obviously need to be
revisited if behavior changed sufficiently to break the test.

I'm not really sure what you're asking for wrt documentation. A quick
scan of the git history shows the first such commit is 65cc9a235919
("generic/482: use thin volume as data device"), the commit log for
which seems to explain the reasoning.

Brian
Darrick J. Wong Nov. 21, 2024, 4:04 p.m. UTC | #8
On Thu, Nov 21, 2024 at 07:28:09AM -0500, Brian Foster wrote:
> On Thu, Nov 21, 2024 at 11:05:55AM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 21, 2024 at 05:56:24PM +0800, Zorro Lang wrote:
> > > I didn't merge this patch last week, due to we were still talking
> > > about the "discards" things:
> > > 
> > > https://lore.kernel.org/fstests/20241115182821.s3pt4wmkueyjggx3@dell-per750-06-vm-08.rhts.eng.pek2.redhat.com/T/#u
> > > 
> > > Do you think we need to do a force discards at here, or change the
> > > SCRATCH_DEV to dmthin to support discards?
> > 
> > FYI, I'm seeing regular failures with generic/757 when using Darrick's
> > not yet merged RT rmap support, but only with that.
> > 
> > But the whole discard thing leaves me really confused, and the commit
> > log in the patch references by the above link doesn't clear that up
> > either.
> > 
> > Why does dmlogwrites require discard for XFS (and apprently XFS only)?
> > Note that discard is not required and often does not zero data.  So
> > if we need data to be zeroed we need to do that explicitly, and
> > preferably in a way that is obvious.
> > 
> 
> IIRC it was to accommodate the test program, which presumably used
> discard for efficiency reasons because it did a lot of context switching
> to different point-in-time variations of the fs. If the discard didn't
> actually zero the range (depending on the underlying test dev), then at
> least on XFS, we'd see odd recovery issues and whatnot from the fs going
> forward/back in time.

Yes, that's my recollection too -- performing a logwrite replay of an
old mark means that you can end up with blocks with the correct fs uuid
but an LSN that's higher than anything in the log.  Recovery will then
skip the block replay, which is not correct.

I suppose we could fix log recovery to treat incoming block LSNs that
are higher than the log head as if there were no block contents at all.
OTOH going backwards in time isn't usually a concern...right?

> Therefore the reason for using dm-thin was that it was an easy way to
> provide predictable behavior to the test program, where discards punch
> out blocks that subsequently return zeroes.

Yep.  The test needs to reset the block device to a zeroed state.
Discards get us there quickly, but only if discard_zeroes_data==1.
Hence bolting dm-thinp (where this is guaranteed) onto the logwrites
tests.

> I don't recall all the specifics, but I thought part of the reason for
> using discard over explicit zeroing was the latter made the test
> impractically slow. I could be misremembering, but if you want to change
> it I'd suggest to at least verify runtimes on some of the preexisting
> logwrites tests as well.

Not sure -- I think BLKZEROOUT will cause allocations and real disk
writes if we're not careful.

--D

> Brian
> 
>
Darrick J. Wong Nov. 21, 2024, 4:33 p.m. UTC | #9
On Thu, Nov 21, 2024 at 11:52:48AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 21, 2024 at 11:13:25AM +0100, Christoph Hellwig wrote:
> > proper zeroout ioctl, which fixes my generic/757 issues, and should
> 
> It turns out then when I extrapolate my shortened 10 iteration run
> to the 100 currently in the test it would take ~ 30 minutes.  I'm
> not sure that's really a reasonable run time for a single test in
> the auto group.

Yeah, perhaps I should adjust this one to use TIME_FACTOR too?

while _soak_loop_running $((10 * TIME_FACTOR)); do
	# timeconsuming stuff
done

(and then you can SOAK_DURATION=10s to limit the runtime)

--D
Darrick J. Wong Nov. 21, 2024, 5:19 p.m. UTC | #10
On Thu, Nov 21, 2024 at 08:33:55AM -0800, Darrick J. Wong wrote:
> On Thu, Nov 21, 2024 at 11:52:48AM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 21, 2024 at 11:13:25AM +0100, Christoph Hellwig wrote:
> > > proper zeroout ioctl, which fixes my generic/757 issues, and should
> > 
> > It turns out then when I extrapolate my shortened 10 iteration run
> > to the 100 currently in the test it would take ~ 30 minutes.  I'm
> > not sure that's really a reasonable run time for a single test in
> > the auto group.
> 
> Yeah, perhaps I should adjust this one to use TIME_FACTOR too?
> 
> while _soak_loop_running $((10 * TIME_FACTOR)); do
> 	# timeconsuming stuff
> done
> 
> (and then you can SOAK_DURATION=10s to limit the runtime)

Oh, that's with your BLKDISCARD -> BLKZEROOUT change applied, isn't it?
On my system, 100 loops takes 96 seconds with discard and 639 seconds
with zeroout.

So yes, we should shift this test to use thinp.  I was going to say that
we should also make _log_writes_init skip the test if the block device
has discard_zeroes_data==0, but apparently it's now hardwired to zero
and (AFAICT) there's no way to tell if unmap actually zeroes even for
software defined devices like thinp where we *know* that works. :(

--D
Christoph Hellwig Nov. 22, 2024, 12:31 p.m. UTC | #11
On Thu, Nov 21, 2024 at 09:11:56AM -0500, Brian Foster wrote:
> > I'm all for speeding up tests.  But relying on a unspecified side effect
> > of an operation and then requiring a driver that implements that side
> > effect without documenting that isn't really good practice.
> > 
> 
> It's a hack to facilitate test coverage. It would obviously need to be
> revisited if behavior changed sufficiently to break the test.
> 
> I'm not really sure what you're asking for wrt documentation. A quick
> scan of the git history shows the first such commit is 65cc9a235919
> ("generic/482: use thin volume as data device"), the commit log for
> which seems to explain the reasoning.

A comment on _log_writes_init that it must only be used by dm-thin
because it relies on the undocumented behavior that dm-trim zeroes
all blocks discarded.

Or even better my moving the dm-think setup boilerplate into the log
writes helpers, so that it gets done automatically.
Christoph Hellwig Nov. 22, 2024, 12:34 p.m. UTC | #12
On Thu, Nov 21, 2024 at 08:04:15AM -0800, Darrick J. Wong wrote:
> > IIRC it was to accommodate the test program, which presumably used
> > discard for efficiency reasons because it did a lot of context switching
> > to different point-in-time variations of the fs. If the discard didn't
> > actually zero the range (depending on the underlying test dev), then at
> > least on XFS, we'd see odd recovery issues and whatnot from the fs going
> > forward/back in time.
> 
> Yes, that's my recollection too -- performing a logwrite replay of an
> old mark means that you can end up with blocks with the correct fs uuid
> but an LSN that's higher than anything in the log.  Recovery will then
> skip the block replay, which is not correct.
> 
> I suppose we could fix log recovery to treat incoming block LSNs that
> are higher than the log head as if there were no block contents at all.
> OTOH going backwards in time isn't usually a concern...right?

It's probably the best we can do.  Recover as far as everything validated
and then give up.

> 
> > Therefore the reason for using dm-thin was that it was an easy way to
> > provide predictable behavior to the test program, where discards punch
> > out blocks that subsequently return zeroes.
> 
> Yep.  The test needs to reset the block device to a zeroed state.
> Discards get us there quickly, but only if discard_zeroes_data==1.
> Hence bolting dm-thinp (where this is guaranteed) onto the logwrites
> tests.

discard_zeroes_data was unfortunately always broken as no standard
gives you any such guarantee.  The best you get is a guarantee that
it returns zeroes if it actually deallocated the block, but if it
deallocates a given block or not is a black box.

> 
> > I don't recall all the specifics, but I thought part of the reason for
> > using discard over explicit zeroing was the latter made the test
> > impractically slow. I could be misremembering, but if you want to change
> > it I'd suggest to at least verify runtimes on some of the preexisting
> > logwrites tests as well.
> 
> Not sure -- I think BLKZEROOUT will cause allocations and real disk
> writes if we're not careful.

If the device reports a queue/write_zeroes_max_bytes it supports a hardware
offload.  That could still write zeroes to the media if the device
is stupid enough, but hopefully not many are.
Christoph Hellwig Nov. 22, 2024, 12:35 p.m. UTC | #13
On Thu, Nov 21, 2024 at 09:19:24AM -0800, Darrick J. Wong wrote:
> Oh, that's with your BLKDISCARD -> BLKZEROOUT change applied, isn't it?
> On my system, 100 loops takes 96 seconds with discard and 639 seconds
> with zeroout.

The discard path is guaranteed to fail after the first iteration, but
that at least happens very quickly..
Brian Foster Nov. 22, 2024, 1:49 p.m. UTC | #14
On Fri, Nov 22, 2024 at 01:31:33PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 21, 2024 at 09:11:56AM -0500, Brian Foster wrote:
> > > I'm all for speeding up tests.  But relying on a unspecified side effect
> > > of an operation and then requiring a driver that implements that side
> > > effect without documenting that isn't really good practice.
> > > 
> > 
> > It's a hack to facilitate test coverage. It would obviously need to be
> > revisited if behavior changed sufficiently to break the test.
> > 
> > I'm not really sure what you're asking for wrt documentation. A quick
> > scan of the git history shows the first such commit is 65cc9a235919
> > ("generic/482: use thin volume as data device"), the commit log for
> > which seems to explain the reasoning.
> 
> A comment on _log_writes_init that it must only be used by dm-thin
> because it relies on the undocumented behavior that dm-trim zeroes
> all blocks discarded.
> 
> Or even better my moving the dm-think setup boilerplate into the log
> writes helpers, so that it gets done automatically.
> 

A related idea might be to incorporate your BLKZEROOUT fix so the core
tool is fundamentally correct, but then wrap the existing discard
behavior in a param or something that the dm-thin oriented tests can
pass to enable it as a fast zero hack/optimization.

But that all seems reasonable to me either way. I'm not sure that's
something I would have fully abstracted into the logwrites stuff
initially, but here we are ~5 years later and it seems pretty much every
additional logwrites test has wanted the same treatment. If whoever
wants to convert this newer test over wants to start by refactoring
things that way, that sounds like a welcome cleanup to me.

Brian
Darrick J. Wong Nov. 22, 2024, 4:13 p.m. UTC | #15
On Fri, Nov 22, 2024 at 08:49:42AM -0500, Brian Foster wrote:
> On Fri, Nov 22, 2024 at 01:31:33PM +0100, Christoph Hellwig wrote:
> > On Thu, Nov 21, 2024 at 09:11:56AM -0500, Brian Foster wrote:
> > > > I'm all for speeding up tests.  But relying on a unspecified side effect
> > > > of an operation and then requiring a driver that implements that side
> > > > effect without documenting that isn't really good practice.
> > > > 
> > > 
> > > It's a hack to facilitate test coverage. It would obviously need to be
> > > revisited if behavior changed sufficiently to break the test.
> > > 
> > > I'm not really sure what you're asking for wrt documentation. A quick
> > > scan of the git history shows the first such commit is 65cc9a235919
> > > ("generic/482: use thin volume as data device"), the commit log for
> > > which seems to explain the reasoning.
> > 
> > A comment on _log_writes_init that it must only be used by dm-thin
> > because it relies on the undocumented behavior that dm-trim zeroes
> > all blocks discarded.
> > 
> > Or even better my moving the dm-think setup boilerplate into the log
> > writes helpers, so that it gets done automatically.
> > 
> 
> A related idea might be to incorporate your BLKZEROOUT fix so the core
> tool is fundamentally correct, but then wrap the existing discard
> behavior in a param or something that the dm-thin oriented tests can
> pass to enable it as a fast zero hack/optimization.
> 
> But that all seems reasonable to me either way. I'm not sure that's
> something I would have fully abstracted into the logwrites stuff
> initially, but here we are ~5 years later and it seems pretty much every
> additional logwrites test has wanted the same treatment. If whoever
> wants to convert this newer test over wants to start by refactoring
> things that way, that sounds like a welcome cleanup to me.

Ugh, I just want to fix this stupid test and move on with the bugfixes,
not refactor every logwrites user in the codebase just to reduce one
test's runtime from hours to 90s.

It's not as simple as making the logwrites init function set up thinp on
its own, because there's at least one test out there (generic/470) that
takes care of its own discarding, and then there's whatever the strange
stuff that the tests/btrfs/ users do -- it looks fairly simple, but I
don't really want to go digging into that just to make sure I didn't
break their testing.

I'll send what I have currently, which adds a warning about running
logwrites on a device that supports discard but isn't thinp... in
addition to fixing the xfs log recovery thing, and in addition to fixing
the loop duration.

I guess I can add yet another patch to switch the replay program to use
BLKDISCARD if the _init function thinks it's ok, but seriously... you
guys need to send start sending patches implementing the new
functionality that you suggest.

--D

> Brian
> 
>
Christoph Hellwig Nov. 22, 2024, 4:20 p.m. UTC | #16
On Fri, Nov 22, 2024 at 08:13:47AM -0800, Darrick J. Wong wrote:
> I guess I can add yet another patch to switch the replay program to use
> BLKDISCARD if the _init function thinks it's ok, but seriously... you
> guys need to send start sending patches implementing the new
> functionality that you suggest.

I can look into the refactoring and the comments, I wasn't planning
to offload the work onto you, just making sure we have robust
infrastructure.
Brian Foster Nov. 22, 2024, 4:33 p.m. UTC | #17
On Fri, Nov 22, 2024 at 08:13:47AM -0800, Darrick J. Wong wrote:
> On Fri, Nov 22, 2024 at 08:49:42AM -0500, Brian Foster wrote:
> > On Fri, Nov 22, 2024 at 01:31:33PM +0100, Christoph Hellwig wrote:
> > > On Thu, Nov 21, 2024 at 09:11:56AM -0500, Brian Foster wrote:
> > > > > I'm all for speeding up tests.  But relying on a unspecified side effect
> > > > > of an operation and then requiring a driver that implements that side
> > > > > effect without documenting that isn't really good practice.
> > > > > 
> > > > 
> > > > It's a hack to facilitate test coverage. It would obviously need to be
> > > > revisited if behavior changed sufficiently to break the test.
> > > > 
> > > > I'm not really sure what you're asking for wrt documentation. A quick
> > > > scan of the git history shows the first such commit is 65cc9a235919
> > > > ("generic/482: use thin volume as data device"), the commit log for
> > > > which seems to explain the reasoning.
> > > 
> > > A comment on _log_writes_init that it must only be used by dm-thin
> > > because it relies on the undocumented behavior that dm-trim zeroes
> > > all blocks discarded.
> > > 
> > > Or even better my moving the dm-think setup boilerplate into the log
> > > writes helpers, so that it gets done automatically.
> > > 
> > 
> > A related idea might be to incorporate your BLKZEROOUT fix so the core
> > tool is fundamentally correct, but then wrap the existing discard
> > behavior in a param or something that the dm-thin oriented tests can
> > pass to enable it as a fast zero hack/optimization.
> > 
> > But that all seems reasonable to me either way. I'm not sure that's
> > something I would have fully abstracted into the logwrites stuff
> > initially, but here we are ~5 years later and it seems pretty much every
> > additional logwrites test has wanted the same treatment. If whoever
> > wants to convert this newer test over wants to start by refactoring
> > things that way, that sounds like a welcome cleanup to me.
> 
> Ugh, I just want to fix this stupid test and move on with the bugfixes,
> not refactor every logwrites user in the codebase just to reduce one
> test's runtime from hours to 90s.
> 
> It's not as simple as making the logwrites init function set up thinp on
> its own, because there's at least one test out there (generic/470) that
> takes care of its own discarding, and then there's whatever the strange
> stuff that the tests/btrfs/ users do -- it looks fairly simple, but I
> don't really want to go digging into that just to make sure I didn't
> break their testing.
> 
> I'll send what I have currently, which adds a warning about running
> logwrites on a device that supports discard but isn't thinp... in
> addition to fixing the xfs log recovery thing, and in addition to fixing
> the loop duration.
> 
> I guess I can add yet another patch to switch the replay program to use
> BLKDISCARD if the _init function thinks it's ok, but seriously... you
> guys need to send start sending patches implementing the new
> functionality that you suggest.
> 

Sorry, I should have been more clear. I certainly don't insist on it as
an immediate change or to gatekeep the current patch. I'm just acking
the idea, and I think it's perfectly fair to say "more time consuming
than I have time for right now" if you planned to just fixup the test
itself. I may get to it opportunistically someday, or if hch cares
enough about it he's certainly capable of picking it up sooner.

For future reference, I'm generally not trying to tell people what to do
with their patches or force work on people, etc. I realize we have a
tendency to do that. I don't like it either. It would be nice if we had
a clearer way to express/discuss an idea without implying it as a
demand.

Brian

> --D
> 
> > Brian
> > 
> > 
>
Darrick J. Wong Nov. 22, 2024, 4:37 p.m. UTC | #18
On Fri, Nov 22, 2024 at 11:33:24AM -0500, Brian Foster wrote:
> On Fri, Nov 22, 2024 at 08:13:47AM -0800, Darrick J. Wong wrote:
> > On Fri, Nov 22, 2024 at 08:49:42AM -0500, Brian Foster wrote:
> > > On Fri, Nov 22, 2024 at 01:31:33PM +0100, Christoph Hellwig wrote:
> > > > On Thu, Nov 21, 2024 at 09:11:56AM -0500, Brian Foster wrote:
> > > > > > I'm all for speeding up tests.  But relying on a unspecified side effect
> > > > > > of an operation and then requiring a driver that implements that side
> > > > > > effect without documenting that isn't really good practice.
> > > > > > 
> > > > > 
> > > > > It's a hack to facilitate test coverage. It would obviously need to be
> > > > > revisited if behavior changed sufficiently to break the test.
> > > > > 
> > > > > I'm not really sure what you're asking for wrt documentation. A quick
> > > > > scan of the git history shows the first such commit is 65cc9a235919
> > > > > ("generic/482: use thin volume as data device"), the commit log for
> > > > > which seems to explain the reasoning.
> > > > 
> > > > A comment on _log_writes_init that it must only be used by dm-thin
> > > > because it relies on the undocumented behavior that dm-trim zeroes
> > > > all blocks discarded.
> > > > 
> > > > Or even better my moving the dm-think setup boilerplate into the log
> > > > writes helpers, so that it gets done automatically.
> > > > 
> > > 
> > > A related idea might be to incorporate your BLKZEROOUT fix so the core
> > > tool is fundamentally correct, but then wrap the existing discard
> > > behavior in a param or something that the dm-thin oriented tests can
> > > pass to enable it as a fast zero hack/optimization.
> > > 
> > > But that all seems reasonable to me either way. I'm not sure that's
> > > something I would have fully abstracted into the logwrites stuff
> > > initially, but here we are ~5 years later and it seems pretty much every
> > > additional logwrites test has wanted the same treatment. If whoever
> > > wants to convert this newer test over wants to start by refactoring
> > > things that way, that sounds like a welcome cleanup to me.
> > 
> > Ugh, I just want to fix this stupid test and move on with the bugfixes,
> > not refactor every logwrites user in the codebase just to reduce one
> > test's runtime from hours to 90s.
> > 
> > It's not as simple as making the logwrites init function set up thinp on
> > its own, because there's at least one test out there (generic/470) that
> > takes care of its own discarding, and then there's whatever the strange
> > stuff that the tests/btrfs/ users do -- it looks fairly simple, but I
> > don't really want to go digging into that just to make sure I didn't
> > break their testing.
> > 
> > I'll send what I have currently, which adds a warning about running
> > logwrites on a device that supports discard but isn't thinp... in
> > addition to fixing the xfs log recovery thing, and in addition to fixing
> > the loop duration.
> > 
> > I guess I can add yet another patch to switch the replay program to use
> > BLKDISCARD if the _init function thinks it's ok, but seriously... you
> > guys need to send start sending patches implementing the new
> > functionality that you suggest.
> > 
> 
> Sorry, I should have been more clear. I certainly don't insist on it as
> an immediate change or to gatekeep the current patch. I'm just acking
> the idea, and I think it's perfectly fair to say "more time consuming
> than I have time for right now" if you planned to just fixup the test
> itself. I may get to it opportunistically someday, or if hch cares
> enough about it he's certainly capable of picking it up sooner.
> 
> For future reference, I'm generally not trying to tell people what to do
> with their patches or force work on people, etc. I realize we have a
> tendency to do that. I don't like it either. It would be nice if we had
> a clearer way to express/discuss an idea without implying it as a
> demand.

/me suggests

"Here's something that we ought to do, though as a separate patchset:
clean up all the $fubar to be $less_fubar.  In the meantime this patch
is good enough for now."

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > 
> > 
>
diff mbox series

Patch

diff --git a/tests/generic/757 b/tests/generic/757
index 0ff5a8ac00182b..9d41975bde07bb 100755
--- a/tests/generic/757
+++ b/tests/generic/757
@@ -63,9 +63,14 @@  prev=$(_log_writes_mark_to_entry_number mkfs)
 cur=$(_log_writes_find_next_fua $prev)
 [ -z "$cur" ] && _fail "failed to locate next FUA write"
 
-while [ ! -z "$cur" ]; do
+for ((i = 0; i < 100; i++)); do
 	_log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full
 
+	# xfs_repair won't run if the log is dirty
+	if [ $FSTYP = "xfs" ]; then
+		_scratch_mount
+		_scratch_unmount
+	fi
 	_check_scratch_fs
 
 	prev=$cur