mbox series

[00/11] t: reduce direct disk access to data structures

Message ID cover.1697607222.git.ps@pks.im (mailing list archive)
Headers show
Series t: reduce direct disk access to data structures | expand

Message

Patrick Steinhardt Oct. 18, 2023, 5:35 a.m. UTC
Hi,

this patch series refactors a bunch of our tests to perform less direct
disk access to on-disk data structures. Instead, the tests are converted
to use Git tools or our test-tool to access data to the best extent
possible. This serves two benefits:

    - We increase test coverage of our own code base.

    - We become less dependent on the actual on-disk format.

The main motivation for this patch series was the second bullet point as
it is preparatory work to get the reftable backend upstreamed. My intent
is to get rid of many or even most of the current blockers in the Git
project before trying to send the reftable implementation upstream.
While this will be a lot of up-front work that is going to span over a
long time period, I think this approach will make everyones live easier
by doing comparatively small and incremental improvements to the Git
project. Ultimately, the final patch series should in the best case only
contain the new backend as well as testing infrastructure, but not much
else.

Patrick

Patrick Steinhardt (11):
  t: add helpers to test for reference existence
  t: allow skipping expected object ID in `ref-store update-ref`
  t: convert tests to use helpers for reference existence
  t: convert tests to not write references via the filesystem
  t: convert tests to not access symrefs via the filesystem
  t: convert tests to not access reflog via the filesystem
  t1450: convert tests to remove worktrees via git-worktree(1)
  t4207: delete replace references via git-update-ref(1)
  t7300: assert exact states of repo
  t7900: assert the absence of refs via git-for-each-ref(1)
  t: mark several tests that assume the files backend with REFFILES

 t/README                           |  9 ++++
 t/helper/test-ref-store.c          | 38 +++++++++++++--
 t/t1400-update-ref.sh              | 49 ++++++++++----------
 t/t1430-bad-ref-name.sh            | 39 ++++++++++------
 t/t1450-fsck.sh                    | 46 ++++++++++---------
 t/t2011-checkout-invalid-head.sh   | 16 +++----
 t/t3200-branch.sh                  | 74 ++++++++++++++++--------------
 t/t3400-rebase.sh                  |  2 +-
 t/t3404-rebase-interactive.sh      |  2 +-
 t/t4013-diff-various.sh            |  2 +-
 t/t4202-log.sh                     |  2 +-
 t/t4207-log-decoration-colors.sh   |  4 +-
 t/t5521-pull-options.sh            |  4 +-
 t/t5526-fetch-submodules.sh        |  2 +-
 t/t5605-clone-local.sh             |  6 +--
 t/t5702-protocol-v2.sh             | 24 +++++++---
 t/t7300-clean.sh                   | 23 ++++++----
 t/t7900-maintenance.sh             |  3 +-
 t/t9133-git-svn-nested-git-repo.sh |  2 +-
 t/test-lib-functions.sh            | 66 ++++++++++++++++++++++++++
 20 files changed, 277 insertions(+), 136 deletions(-)

Comments

Patrick Steinhardt Oct. 18, 2023, 5:39 a.m. UTC | #1
On Wed, Oct 18, 2023 at 07:35:03AM +0200, Patrick Steinhardt wrote:
> Hi,
> 
> this patch series refactors a bunch of our tests to perform less direct
> disk access to on-disk data structures. Instead, the tests are converted
> to use Git tools or our test-tool to access data to the best extent
> possible. This serves two benefits:
> 
>     - We increase test coverage of our own code base.
> 
>     - We become less dependent on the actual on-disk format.
> 
> The main motivation for this patch series was the second bullet point as
> it is preparatory work to get the reftable backend upstreamed. My intent
> is to get rid of many or even most of the current blockers in the Git
> project before trying to send the reftable implementation upstream.
> While this will be a lot of up-front work that is going to span over a
> long time period, I think this approach will make everyones live easier
> by doing comparatively small and incremental improvements to the Git
> project. Ultimately, the final patch series should in the best case only
> contain the new backend as well as testing infrastructure, but not much
> else.
> 
> Patrick

As usual, I forgot to mention that this applies on top of current
master, which is at a9ecda2788 (The eighteenth batch, 2023-10-13) at the
time of writing. I look forward to the day where I remember to include
this information in the cover letter...

Patrick

> Patrick Steinhardt (11):
>   t: add helpers to test for reference existence
>   t: allow skipping expected object ID in `ref-store update-ref`
>   t: convert tests to use helpers for reference existence
>   t: convert tests to not write references via the filesystem
>   t: convert tests to not access symrefs via the filesystem
>   t: convert tests to not access reflog via the filesystem
>   t1450: convert tests to remove worktrees via git-worktree(1)
>   t4207: delete replace references via git-update-ref(1)
>   t7300: assert exact states of repo
>   t7900: assert the absence of refs via git-for-each-ref(1)
>   t: mark several tests that assume the files backend with REFFILES
> 
>  t/README                           |  9 ++++
>  t/helper/test-ref-store.c          | 38 +++++++++++++--
>  t/t1400-update-ref.sh              | 49 ++++++++++----------
>  t/t1430-bad-ref-name.sh            | 39 ++++++++++------
>  t/t1450-fsck.sh                    | 46 ++++++++++---------
>  t/t2011-checkout-invalid-head.sh   | 16 +++----
>  t/t3200-branch.sh                  | 74 ++++++++++++++++--------------
>  t/t3400-rebase.sh                  |  2 +-
>  t/t3404-rebase-interactive.sh      |  2 +-
>  t/t4013-diff-various.sh            |  2 +-
>  t/t4202-log.sh                     |  2 +-
>  t/t4207-log-decoration-colors.sh   |  4 +-
>  t/t5521-pull-options.sh            |  4 +-
>  t/t5526-fetch-submodules.sh        |  2 +-
>  t/t5605-clone-local.sh             |  6 +--
>  t/t5702-protocol-v2.sh             | 24 +++++++---
>  t/t7300-clean.sh                   | 23 ++++++----
>  t/t7900-maintenance.sh             |  3 +-
>  t/t9133-git-svn-nested-git-repo.sh |  2 +-
>  t/test-lib-functions.sh            | 66 ++++++++++++++++++++++++++
>  20 files changed, 277 insertions(+), 136 deletions(-)
> 
> -- 
> 2.42.0
>
Junio C Hamano Oct. 18, 2023, 3:32 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> this patch series refactors a bunch of our tests to perform less direct
> disk access to on-disk data structures. Instead, the tests are converted
> to use Git tools or our test-tool to access data to the best extent
> possible. This serves two benefits:

Laudable goal.

>     - We increase test coverage of our own code base.

Meaning the new code added to test-tool for this series will also
get tested and bugs spotted?

>     - We become less dependent on the actual on-disk format.

Yes, this is very desirable.  Without looking at the implementation,
I see some issues aiming for this goal may involve. [a] using the
production code for validation would mean our expectation to be
compared to the reality to be validated can be affected by the same
bug, making two wrongs to appear right; [b] using a separate
implementation used only for validation would at least mean we will
have to make the same mistake in unique part of both implementations
that is less likely to miss bugs compared to [a], but bugs in shared
part of the production code and validation code will be hidden the
same way as [a].

But you have thought about this series a lot deeper and longer than
I have, so let me read on and find out what your solution is.


> The main motivation for this patch series was the second bullet point as
> it is preparatory work to get the reftable backend upstreamed.

Yay.

> My intent
> is to get rid of many or even most of the current blockers in the Git
> project before trying to send the reftable implementation upstream.
> While this will be a lot of up-front work that is going to span over a
> long time period, I think this approach will make everyones live easier
> by doing comparatively small and incremental improvements to the Git
> project.

Thank you very much for being considerate.  I assume this approach
is what those in the Contributors' Summit agreed to be the best way
forward to do anything sizeable.

> Ultimately, the final patch series should in the best case only
> contain the new backend as well as testing infrastructure, but not much
> else.

;-)
Junio C Hamano Oct. 18, 2023, 11:40 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

> As usual, I forgot to mention that this applies on top of current
> master, which is at a9ecda2788 (The eighteenth batch, 2023-10-13) at the
> time of writing. I look forward to the day where I remember to include
> this information in the cover letter...

I heard rumors that we have a --base option that records such a
piece of information in the format-patch output (I haven't used it
myself).  Would it have helped, or perhaps it still needs some
usability tweak?
Han-Wen Nienhuys Oct. 19, 2023, 10:13 a.m. UTC | #4
On Wed, Oct 18, 2023 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote:
> > this patch series refactors a bunch of our tests to perform less direct
> > disk access to on-disk data structures. Instead, the tests are converted
> > to use Git tools or our test-tool to access data to the best extent
> > possible. This serves two benefits:
>
> Laudable goal.
>
> >     - We increase test coverage of our own code base.
>
> Meaning the new code added to test-tool for this series will also
> get tested and bugs spotted?
>
> >     - We become less dependent on the actual on-disk format.
>
> Yes, this is very desirable.  Without looking at the implementation,
> I see some issues aiming for this goal may involve. [a] using the
> production code for validation would mean our expectation to be
> compared to the reality to be validated can be affected by the same
> bug, making two wrongs to appear right; [b] using a separate
> implementation used only for validation would at least mean we will
> have to make the same mistake in unique part of both implementations
> that is less likely to miss bugs compared to [a], but bugs in shared
> part of the production code and validation code will be hidden the
> same way as [a].

I think it would be really great if there were separate unittests for
the ref backend API. Some of the reftable work was needlessly
difficult because the contract of the API was underspecified. The API
is well compartmentalized in refs-internal.h, and a lot of the API
behavior can be tested as a black box, eg.

* setup symref HEAD pointing to R1
* setup transaction updating ref R1 from C1 to C2
* commit transaction, check that it succeeds
* read ref R1, check if it is C2
* read reflog for R1, see that it has a C1 => C2 update
* read reflog for HEAD, see that it has a C1 => C2 update

Tests for the loose/packed backend could directly mess with the
on-disk files to test failure scenarios.

With unittests like that, the tests can zoom in on the functionality
of the ref backend, and provide more convenient coverage for
dynamic/static analysis.
Junio C Hamano Oct. 19, 2023, 5:55 p.m. UTC | #5
Han-Wen Nienhuys <hanwen@google.com> writes:

> I think it would be really great if there were separate unittests for
> the ref backend API. Some of the reftable work was needlessly
> difficult because the contract of the API was underspecified. The API
> is well compartmentalized in refs-internal.h, and a lot of the API
> behavior can be tested as a black box, eg.
>
> * setup symref HEAD pointing to R1
> * setup transaction updating ref R1 from C1 to C2
> * commit transaction, check that it succeeds
> * read ref R1, check if it is C2
> * read reflog for R1, see that it has a C1 => C2 update
> * read reflog for HEAD, see that it has a C1 => C2 update
>
> Tests for the loose/packed backend could directly mess with the
> on-disk files to test failure scenarios.
>
> With unittests like that, the tests can zoom in on the functionality
> of the ref backend, and provide more convenient coverage for
> dynamic/static analysis.

Yeah, I agree that something like that would really be great.
Patrick Steinhardt Oct. 23, 2023, 11:57 a.m. UTC | #6
On Wed, Oct 18, 2023 at 04:40:35PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > As usual, I forgot to mention that this applies on top of current
> > master, which is at a9ecda2788 (The eighteenth batch, 2023-10-13) at the
> > time of writing. I look forward to the day where I remember to include
> > this information in the cover letter...
> 
> I heard rumors that we have a --base option that records such a
> piece of information in the format-patch output (I haven't used it
> myself).  Would it have helped, or perhaps it still needs some
> usability tweak?

I was aware of the option, but somehow never thought to use it. I guess
others are in the same boat as this is a semi-frequent topic to come up
on the mailing list.

What will fix this for me from now on is that I've now discovered the
`format.useAutoBase` config. To fix this for everybody else without
having to go through the same process we could change the default to
`format.useAutoBase=auto`. This can lead to hard-to-understand errors as
pointed out by 7efba5fa39f (format-patch: teach format.useAutoBase
"whenAble" option, 2020-10-01) though:

```
$ git format-patch -1 <an old commit>
fatal: base commit shouldn't be in revision list
```

But this was fixed via the new "whenAble" value for this configuration,
which will cause Git to ignore the case where it's unable to figure out
the base commit. So maybe we should consider to make it the new default?
I cannot think of any real downside. In the best case we simplify the
life of many maintainers of mailing list based projects. In the worst
case where we are unable to figure out the base commit nothing changes.

Patrick
Patrick Steinhardt Oct. 23, 2023, 1:58 p.m. UTC | #7
On Thu, Oct 19, 2023 at 12:13:12PM +0200, Han-Wen Nienhuys wrote:
> On Wed, Oct 18, 2023 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote:
> > > this patch series refactors a bunch of our tests to perform less direct
> > > disk access to on-disk data structures. Instead, the tests are converted
> > > to use Git tools or our test-tool to access data to the best extent
> > > possible. This serves two benefits:
> >
> > Laudable goal.
> >
> > >     - We increase test coverage of our own code base.
> >
> > Meaning the new code added to test-tool for this series will also
> > get tested and bugs spotted?

For now all the helpers of the test-tool only have implicit test
coverage, but I get your point. If we decide to instead transform this
test tool into production-level code as you suggested (e.g. `git
rev-parse --exists`) then this becomes less of a discussion point as we
would of course have proper test coverage for it.

> > >     - We become less dependent on the actual on-disk format.
> >
> > Yes, this is very desirable.  Without looking at the implementation,
> > I see some issues aiming for this goal may involve. [a] using the
> > production code for validation would mean our expectation to be
> > compared to the reality to be validated can be affected by the same
> > bug, making two wrongs to appear right; [b] using a separate
> > implementation used only for validation would at least mean we will
> > have to make the same mistake in unique part of both implementations
> > that is less likely to miss bugs compared to [a], but bugs in shared
> > part of the production code and validation code will be hidden the
> > same way as [a].
> 
> I think it would be really great if there were separate unittests for
> the ref backend API. Some of the reftable work was needlessly
> difficult because the contract of the API was underspecified. The API
> is well compartmentalized in refs-internal.h, and a lot of the API
> behavior can be tested as a black box, eg.
> 
> * setup symref HEAD pointing to R1
> * setup transaction updating ref R1 from C1 to C2
> * commit transaction, check that it succeeds
> * read ref R1, check if it is C2
> * read reflog for R1, see that it has a C1 => C2 update
> * read reflog for HEAD, see that it has a C1 => C2 update
> 
> Tests for the loose/packed backend could directly mess with the
> on-disk files to test failure scenarios.
> 
> With unittests like that, the tests can zoom in on the functionality
> of the ref backend, and provide more convenient coverage for
> dynamic/static analysis.

Agreed. Ideally, I think our tests should be split up into two layers:

    1. Low-level tests which are backend specific. These allow us to
       assert the on-disk state of the respective backend thoroughly and
       also allow us to explicitly test for edge cases that are only of
       relevance to this specific backend.

       The reftable backend in its current (non-upstream) already has
       such tests, but we don't really have explicit tests for the files
       backend. This is a gap that should ideally be filled at some
       point in time.

    2. Higher-level tests should then be allowed to assume that the
       underlying logic works as expected. These tests are thus free to
       use plumbing tools that tie into the reference backends.

Patrick