Message ID | 173706974273.1927324.11899201065662863518.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [01/23] generic/476: fix fsstress process management | expand |
On Thu, Jan 16, 2025 at 03:28:33PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Prior to commit 8973af00ec21, in the absence of an explicit > SOAK_DURATION, this test would run 2500 fsstress operations each of ten > times through the loop body. On the author's machines, this kept the > runtime to about 30s total. Oddly, this was changed to 30s per loop > body with no specific justification in the middle of an fsstress process > management change. I'm pretty sure that was because when you run g/650 on a machine with 64p, the number of ops performed on the filesystem is nr_cpus * 2500 * nr_loops. In that case, each loop was taking over 90s to run, so the overall runtime was up in the 15-20 minute mark. I wanted to cap the runtime of each loop to min(nr_ops, SOAK_DURATION) so that it ran in about 5 minutes in the worst case i.e. (nr_loops * SOAK_DURATION). I probably misunderstood how -n nr_ops vs --duration=30 interact; I expected it to run until either were exhausted, not for duration to override nr_ops as implied by this: > On the author's machine, this explodes the runtime from ~30s to 420s. > Put things back the way they were. Yeah, OK, that's exactly waht keep_running() does - duration overrides nr_ops. Ok, so keeping or reverting the change will simply make different people unhappy because of the excessive runtime the test has at either ends of the CPU count spectrum - what's the best way to go about providing the desired min(nr_ops, max loop time) behaviour? Do we simply cap the maximum process count to keep the number of ops down to something reasonable (e.g. 16), or something else? -Dave.
On Tue, Jan 21, 2025 at 03:57:23PM +1100, Dave Chinner wrote: > I probably misunderstood how -n nr_ops vs --duration=30 interact; > I expected it to run until either were exhausted, not for duration > to override nr_ops as implied by this: There are (at least) two ways that a soak duration is being used today; one is where someone wants to run a very long soak for hours and where if you go long by an hour or two it's no big deals. The other is where you are specifying a soak duration as part of a smoke test (using the smoketest group), where you might be hoping to keep the overall run time to 15-20 minutes and so you set SOAK_DURATION to 3m. (This was based on some research that Darrick did which showed that running the original 5 tests in the smoketest group gave you most of the code coverage of running all of the quick group, which had ballooned from 15 minutes many years ago to an hour or more. I just noticed that we've since added two more tests to the smoketest group; it might be worth checking whether those two new tests addded to thhe smoketest groups significantly improves code coverage or not. It would be unfortunate if the runtime bloat that happened to the quick group also happens to the smoketest group...) The bottom line is in addition to trying to design semantics for users who might be at either end of the CPU count spectrum, we should also consider that SOAK_DURATION could be set for values ranging from minutes to hours. Thanks, - Ted
On Tue, Jan 21, 2025 at 08:00:27AM -0500, Theodore Ts'o wrote: > On Tue, Jan 21, 2025 at 03:57:23PM +1100, Dave Chinner wrote: > > I probably misunderstood how -n nr_ops vs --duration=30 interact; > > I expected it to run until either were exhausted, not for duration > > to override nr_ops as implied by this: > > There are (at least) two ways that a soak duration is being used > today; one is where someone wants to run a very long soak for hours > and where if you go long by an hour or two it's no big deals. The > other is where you are specifying a soak duration as part of a smoke > test (using the smoketest group), where you might be hoping to keep > the overall run time to 15-20 minutes and so you set SOAK_DURATION to > 3m. check-parallel on my 64p machine runs the full auto group test in under 10 minutes. i.e. if you have a typical modern server (64-128p, 256GB RAM and a couple of NVMe SSDs), then check-parallel allows a full test run in the same time that './check -g smoketest' will run.... > (This was based on some research that Darrick did which showed that > running the original 5 tests in the smoketest group gave you most of > the code coverage of running all of the quick group, which had > ballooned from 15 minutes many years ago to an hour or more. I just > noticed that we've since added two more tests to the smoketest group; > it might be worth checking whether those two new tests addded to thhe > smoketest groups significantly improves code coverage or not. It > would be unfortunate if the runtime bloat that happened to the quick > group also happens to the smoketest group...) Yes, and I've previously made the point about how check-parallel changes the way we should be looking at dev-test cycles. We no longer have to care that auto group testing takes 4 hours to run and have to work around that with things like smoketest groups. If you can run the whole auto test group in 10-15 minutes, then we don't need "quick", "smoketest", etc to reduce dev-test cycle time anymore... > The bottom line is in addition to trying to design semantics for users > who might be at either end of the CPU count spectrum, we should also > consider that SOAK_DURATION could be set for values ranging from > minutes to hours. I don't see much point in testing for hours with check-parallel. The whole point of it is to enable iteration across the entire fs test matrix as fast as possible. If you want to do long running soak tests, then keep using check for that. If you want to run the auto group test across 100 different mkfs option combinations, then that is where check-parallel comes in - it'll take a few hours to do this instead of a week. -Dave.
diff --git a/tests/generic/650 b/tests/generic/650 index 60f86fdf518961..d376488f2fedeb 100755 --- a/tests/generic/650 +++ b/tests/generic/650 @@ -68,11 +68,8 @@ test "$nr_cpus" -gt 1024 && nr_cpus="$nr_hotplug_cpus" fsstress_args+=(-p $nr_cpus) if [ -n "$SOAK_DURATION" ]; then test "$SOAK_DURATION" -lt 10 && SOAK_DURATION=10 -else - # run for 30s per iteration max - SOAK_DURATION=300 + fsstress_args+=(--duration="$((SOAK_DURATION / 10))") fi -fsstress_args+=(--duration="$((SOAK_DURATION / 10))") nr_ops=$((2500 * TIME_FACTOR)) fsstress_args+=(-n $nr_ops)