mbox series

[PATCHSET,0/3] fstests: direct specification of looping test duration

Message ID 168123682679.4086541.13812285218510940665.stgit@frogsfrogsfrogs (mailing list archive)
Headers show
Series fstests: direct specification of looping test duration | expand

Message

Darrick J. Wong April 11, 2023, 6:13 p.m. UTC
Hi all,

One of the things that I do as a maintainer is to designate a handful of
VMs to run fstests for unusually long periods of time.  This practice I
call long term soak testing.  There are actually three separate fleets
for this -- one runs alongside the nightly builds, one runs alongside
weekly rebases, and the last one runs stable releases.

My interactions with all three fleets is pretty much the same -- load
current builds of software, and try to run the exerciser tests for a
duration of time -- 12 hours, 6.5 days, 30 days, etc.  TIME_FACTOR does
not work well for this usage model, because it is difficult to guess
the correct time factor given that the VMs are hetergeneous and the IO
completion rate is not perfectly predictable.

Worse yet, if you want to run (say) all the recoveryloop tests on one VM
(because recoveryloop is prone to crashing), it's impossible to set a
TIME_FACTOR so that each loop test gets equal runtime.  That can be
hacked around with config sections, but that doesn't solve the first
problem.

This series introduces a new configuration variable, SOAK_DURATION, that
allows test runners to control directly various long soak and looping
recovery tests.  This is intended to be an alternative to TIME_FACTOR,
since that variable usually adjusts operation counts, which are
proportional to runtime but otherwise not a direct measure of time.

With this override in place, I can configure the long soak fleet to run
for exactly as long as I want them to, and they actually hit the time
budget targets.  The recoveryloop fleet now divides looping-test time
equally among the four that are in that group so that they all get ~3
hours of coverage every night.

There are more tests that could use this than I actually modified here,
but I've done enough to show this off as a proof of concept.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=soak-duration
---
 check                 |   14 +++++++++
 common/config         |    7 ++++
 common/fuzzy          |    7 ++++
 common/rc             |   34 +++++++++++++++++++++
 common/report         |    1 +
 ltp/fsstress.c        |   78 +++++++++++++++++++++++++++++++++++++++++++++++--
 ltp/fsx.c             |   50 +++++++++++++++++++++++++++++++
 src/soak_duration.awk |   23 ++++++++++++++
 tests/generic/019     |    1 +
 tests/generic/388     |    2 +
 tests/generic/475     |    2 +
 tests/generic/476     |    7 +++-
 tests/generic/482     |    5 +++
 tests/generic/521     |    1 +
 tests/generic/522     |    1 +
 tests/generic/642     |    1 +
 tests/generic/648     |    8 +++--
 17 files changed, 229 insertions(+), 13 deletions(-)
 create mode 100644 src/soak_duration.awk

Comments

Andrey Albershteyn April 13, 2023, 10:48 a.m. UTC | #1
On Tue, Apr 11, 2023 at 11:13:46AM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> One of the things that I do as a maintainer is to designate a handful of
> VMs to run fstests for unusually long periods of time.  This practice I
> call long term soak testing.  There are actually three separate fleets
> for this -- one runs alongside the nightly builds, one runs alongside
> weekly rebases, and the last one runs stable releases.
> 
> My interactions with all three fleets is pretty much the same -- load
> current builds of software, and try to run the exerciser tests for a
> duration of time -- 12 hours, 6.5 days, 30 days, etc.  TIME_FACTOR does
> not work well for this usage model, because it is difficult to guess
> the correct time factor given that the VMs are hetergeneous and the IO
> completion rate is not perfectly predictable.
> 
> Worse yet, if you want to run (say) all the recoveryloop tests on one VM
> (because recoveryloop is prone to crashing), it's impossible to set a
> TIME_FACTOR so that each loop test gets equal runtime.  That can be
> hacked around with config sections, but that doesn't solve the first
> problem.
> 
> This series introduces a new configuration variable, SOAK_DURATION, that
> allows test runners to control directly various long soak and looping
> recovery tests.  This is intended to be an alternative to TIME_FACTOR,
> since that variable usually adjusts operation counts, which are
> proportional to runtime but otherwise not a direct measure of time.
> 
> With this override in place, I can configure the long soak fleet to run
> for exactly as long as I want them to, and they actually hit the time
> budget targets.  The recoveryloop fleet now divides looping-test time
> equally among the four that are in that group so that they all get ~3
> hours of coverage every night.
> 
> There are more tests that could use this than I actually modified here,
> but I've done enough to show this off as a proof of concept.
> 
> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> 
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> 
> --D
> 
> fstests git tree:
> https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=soak-duration
> ---
>  check                 |   14 +++++++++
>  common/config         |    7 ++++
>  common/fuzzy          |    7 ++++
>  common/rc             |   34 +++++++++++++++++++++
>  common/report         |    1 +
>  ltp/fsstress.c        |   78 +++++++++++++++++++++++++++++++++++++++++++++++--
>  ltp/fsx.c             |   50 +++++++++++++++++++++++++++++++
>  src/soak_duration.awk |   23 ++++++++++++++
>  tests/generic/019     |    1 +
>  tests/generic/388     |    2 +
>  tests/generic/475     |    2 +
>  tests/generic/476     |    7 +++-
>  tests/generic/482     |    5 +++
>  tests/generic/521     |    1 +
>  tests/generic/522     |    1 +
>  tests/generic/642     |    1 +
>  tests/generic/648     |    8 +++--
>  17 files changed, 229 insertions(+), 13 deletions(-)
>  create mode 100644 src/soak_duration.awk
> 

The set looks good to me (the second commit has different var name,
but fine by me)
Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
Darrick J. Wong April 13, 2023, 2:47 p.m. UTC | #2
On Thu, Apr 13, 2023 at 12:48:36PM +0200, Andrey Albershteyn wrote:
> On Tue, Apr 11, 2023 at 11:13:46AM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > One of the things that I do as a maintainer is to designate a handful of
> > VMs to run fstests for unusually long periods of time.  This practice I
> > call long term soak testing.  There are actually three separate fleets
> > for this -- one runs alongside the nightly builds, one runs alongside
> > weekly rebases, and the last one runs stable releases.
> > 
> > My interactions with all three fleets is pretty much the same -- load
> > current builds of software, and try to run the exerciser tests for a
> > duration of time -- 12 hours, 6.5 days, 30 days, etc.  TIME_FACTOR does
> > not work well for this usage model, because it is difficult to guess
> > the correct time factor given that the VMs are hetergeneous and the IO
> > completion rate is not perfectly predictable.
> > 
> > Worse yet, if you want to run (say) all the recoveryloop tests on one VM
> > (because recoveryloop is prone to crashing), it's impossible to set a
> > TIME_FACTOR so that each loop test gets equal runtime.  That can be
> > hacked around with config sections, but that doesn't solve the first
> > problem.
> > 
> > This series introduces a new configuration variable, SOAK_DURATION, that
> > allows test runners to control directly various long soak and looping
> > recovery tests.  This is intended to be an alternative to TIME_FACTOR,
> > since that variable usually adjusts operation counts, which are
> > proportional to runtime but otherwise not a direct measure of time.
> > 
> > With this override in place, I can configure the long soak fleet to run
> > for exactly as long as I want them to, and they actually hit the time
> > budget targets.  The recoveryloop fleet now divides looping-test time
> > equally among the four that are in that group so that they all get ~3
> > hours of coverage every night.
> > 
> > There are more tests that could use this than I actually modified here,
> > but I've done enough to show this off as a proof of concept.
> > 
> > If you're going to start using this mess, you probably ought to just
> > pull from my git trees, which are linked below.
> > 
> > This is an extraordinary way to destroy everything.  Enjoy!
> > Comments and questions are, as always, welcome.
> > 
> > --D
> > 
> > fstests git tree:
> > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=soak-duration
> > ---
> >  check                 |   14 +++++++++
> >  common/config         |    7 ++++
> >  common/fuzzy          |    7 ++++
> >  common/rc             |   34 +++++++++++++++++++++
> >  common/report         |    1 +
> >  ltp/fsstress.c        |   78 +++++++++++++++++++++++++++++++++++++++++++++++--
> >  ltp/fsx.c             |   50 +++++++++++++++++++++++++++++++
> >  src/soak_duration.awk |   23 ++++++++++++++
> >  tests/generic/019     |    1 +
> >  tests/generic/388     |    2 +
> >  tests/generic/475     |    2 +
> >  tests/generic/476     |    7 +++-
> >  tests/generic/482     |    5 +++
> >  tests/generic/521     |    1 +
> >  tests/generic/522     |    1 +
> >  tests/generic/642     |    1 +
> >  tests/generic/648     |    8 +++--
> >  17 files changed, 229 insertions(+), 13 deletions(-)
> >  create mode 100644 src/soak_duration.awk
> > 
> 
> The set looks good to me (the second commit has different var name,
> but fine by me)

Which variable name, specifically?

--D

> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>
> 
> -- 
> - Andrey
>
Andrey Albershteyn April 13, 2023, 3:43 p.m. UTC | #3
On Thu, Apr 13, 2023 at 07:47:08AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 13, 2023 at 12:48:36PM +0200, Andrey Albershteyn wrote:
> > >  tests/generic/648     |    8 +++--
> > >  17 files changed, 229 insertions(+), 13 deletions(-)
> > >  create mode 100644 src/soak_duration.awk
> > > 
> > 
> > The set looks good to me (the second commit has different var name,
> > but fine by me)
> 
> Which variable name, specifically?

STRESS_DURATION in the commit message
Darrick J. Wong April 15, 2023, 12:28 a.m. UTC | #4
On Thu, Apr 13, 2023 at 05:43:52PM +0200, Andrey Albershteyn wrote:
> On Thu, Apr 13, 2023 at 07:47:08AM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 13, 2023 at 12:48:36PM +0200, Andrey Albershteyn wrote:
> > > >  tests/generic/648     |    8 +++--
> > > >  17 files changed, 229 insertions(+), 13 deletions(-)
> > > >  create mode 100644 src/soak_duration.awk
> > > > 
> > > 
> > > The set looks good to me (the second commit has different var name,
> > > but fine by me)
> > 
> > Which variable name, specifically?
> 
> STRESS_DURATION in the commit message

Ah, will fix and resend.  Thanks for noticing!

--D

> -- 
> - Andrey
>