[0/8] fetch: refactor code that prints reference updates

Message ID	cover.1678878623.git.ps@pks.im (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> Feedback-ID: i197146af:Fastmail Date: Wed, 15 Mar 2023 12:21:01 +0100 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Subject: [PATCH 0/8] fetch: refactor code that prints reference updates Message-ID: <cover.1678878623.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="gMIC0HWLKYU+yMvP" Content-Disposition: inline Precedence: bulk
Series	fetch: refactor code that prints reference updates \| expand [0/8] fetch: refactor code that prints reference updates [1/8] fetch: rename `display` buffer to avoid name conflict [2/8] fetch: move reference width calculation into `display_state` [3/8] fetch: move output format into `display_state` [4/8] fetch: pass the full local reference name to `format_display` [5/8] fetch: deduplicate handling of per-reference format [6/8] fetch: deduplicate logic to print remote URL [7/8] fetch: fix inconsistent summary width for pruned and updated refs [8/8] fetch: centralize printing of reference updates

Message ID

cover.1678878623.git.ps@pks.im (mailing list archive)

Headers

Feedback-ID: i197146af:Fastmail
Date: Wed, 15 Mar 2023 12:21:01 +0100
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Subject: [PATCH 0/8] fetch: refactor code that prints reference updates
Message-ID: <cover.1678878623.git.ps@pks.im>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
        protocol="application/pgp-signature"; boundary="gMIC0HWLKYU+yMvP"
Content-Disposition: inline
Precedence: bulk

Series

fetch: refactor code that prints reference updates | expand

Message

Patrick Steinhardt March 15, 2023, 11:21 a.m. UTC

Hi,

at GitLab, we want to gain more control over fetches to achieve two
different things:

    1. We want to take control of the reference updates so that we can
       atomically update all or a subset of references that git-fetch
       would have updated.

    2. We want to be able to quarantine objects in a fetch so that we
       can e.g. perform consistency checks for them before they land in
       the main repository.

To do this, we aim to use git-fetch(1)'s `--dry-run` mode with a
manually set up quarantine directory. One issue we currently face though
is that git-fetch(1), to the best of my knowledge, has no mode in which
it would print all reference updates in a machine-parseable format.

I thus set out to implement a "porcelain"-style mode for git-fetch(1)
that surfaces this information:

    - The reference that would be updated.

    - The remote reference this is coming from.

    - The old and new object IDs of the reference.

    - Whether there's any error, like a D/F conflict.

I had a hard time understanding the current implementation of how ref
updates are printed though. So as a first step towards such a porcelain
mode this patch series refactors said code. It sets out to achieve two
major goals:

    - There should be as few global state as possible. This is to reduce
      confusion and having to repeat the same incantations in multiple
      different locations.

    - The logic should be as self-contained as possible. This is so that
      it can easily be changed in a subsequent patch series.

This patch series does exactly that, but does not yet introduce the new
machine-parsebale porcelain mode.

Patrick

Patrick Steinhardt (8):
  fetch: rename `display` buffer to avoid name conflict
  fetch: move reference width calculation into `display_state`
  fetch: move output format into `display_state`
  fetch: pass the full local reference name to `format_display`
  fetch: deduplicate handling of per-reference format
  fetch: deduplicate logic to print remote URL
  fetch: fix inconsistent summary width for pruned and updated refs
  fetch: centralize printing of reference updates

 builtin/fetch.c | 270 ++++++++++++++++++++++++------------------------
 1 file changed, 134 insertions(+), 136 deletions(-)

Comments

Jonathan Tan March 17, 2023, 8:24 p.m. UTC | #1

Patrick Steinhardt <ps@pks.im> writes:
>     1. We want to take control of the reference updates so that we can
>        atomically update all or a subset of references that git-fetch
>        would have updated.
> 
>     2. We want to be able to quarantine objects in a fetch so that we
>        can e.g. perform consistency checks for them before they land in
>        the main repository.

If you want to do this, something that might be possible is to change
the RHS of the refspecs to put the refs in a namespace of your choice
(e.g. ...:refs/<UUID>/...) and then you can look at what's generated and
process them as you wish.

>     - There should be as few global state as possible. This is to reduce
>       confusion and having to repeat the same incantations in multiple
>       different locations.

Makes sense.

>     - The logic should be as self-contained as possible. This is so that
>       it can easily be changed in a subsequent patch series.

Also makes sense, but I think that some of your patches might be
contrary to this goal (more details below).

I've read all the patches, but will just summarize my thoughts here.

> Patrick Steinhardt (8):
>   fetch: rename `display` buffer to avoid name conflict

One other way, as others have discussed, is to just name the new
variable display_state. (I would prefer that, at the very least so
that in case someone else has a patch that contains the identifier
"display", problems would be more easily noticed. This is very unlikely
to happen but I think it's a good general direction for the Git project
to follow.)

>   fetch: move reference width calculation into `display_state`
>   fetch: move output format into `display_state`
>   fetch: pass the full local reference name to `format_display`

All these are good changes that I would be happy to see merged.

>   fetch: deduplicate handling of per-reference format

I'm not so sure that this is the correct abstraction. I think that this
and the last patch might be counterproductive to your stated goal of
having one more mode of printing the refs, in fact, since when we have
that new mode, the format would be different but the printing would
remain (so we should split the format and printing).

>   fetch: deduplicate logic to print remote URL

Makes sense, although I would need to consider only storing the
raw URL in the struct display_state and processing it when it needs
to be emitted (haven't checked if this is feasible, though).

>   fetch: fix inconsistent summary width for pruned and updated refs

This changes the behavior in that the summary width, even when printing
the summary of pruned refs, is computed based only on the updated refs.
The summary width might need to remain out of the struct display_state
for now.

>   fetch: centralize printing of reference updates

Same as "fetch: deduplicate handling of per-reference format".

Patrick Steinhardt March 20, 2023, 6:57 a.m. UTC | #2

On Fri, Mar 17, 2023 at 01:24:49PM -0700, Jonathan Tan wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> >     1. We want to take control of the reference updates so that we can
> >        atomically update all or a subset of references that git-fetch
> >        would have updated.
> > 
> >     2. We want to be able to quarantine objects in a fetch so that we
> >        can e.g. perform consistency checks for them before they land in
> >        the main repository.
> 
> If you want to do this, something that might be possible is to change
> the RHS of the refspecs to put the refs in a namespace of your choice
> (e.g. ...:refs/<UUID>/...) and then you can look at what's generated and
> process them as you wish.

There's two major problems with this, unfortunately:

- We want to use the machine-parseable format in our repository
  mirroring functionality, where you can easily end up fetching
  thousands or even hundreds of thousands of references. If you need to
  write all of them anew in a first step then you'll end up slower than
  before.

- Repository mirroring is comparatively flexible in what it allows. Most
  importantly, it gives you the opiton to say that divergent references
  should not be updated at all, which translates into an unforced fetch.
  It's even possible to have fetches with mixed forced and unforced
  reference updates. So if we fetched into a separate namespace first,
  we'd now have to reimplement checks for forced updates in Gitaly so
  that we correctly update only those refs that would have been updated
  by Git. We'd also need to manually figure out deleted references.

  This would be quite a risky change and would duplicate a lot of
  knowledge. Furthermore, merging the two sets of references would
  likely be quite expensive performance-wise.

Also, even if we did have a different RHS, it still wouldn't fix the
issue that objects are written into the main object database directly.
Ideally, we'd really only accept changes into the repository once we
have fully verified all of them. Right now it can happen that we refuse
a fetch, but the objects would continue to exist in the repository.

A second motivation for the quarantine directory is so that we can
enumerate all objects that are indeed new. This will eventually be used
to implement more efficient replication of the repository, where we can
theoretically just take all of the fetched objects in the quarantine
object directory and copy it to the replicas of that repository.

[snip]
> >   fetch: deduplicate handling of per-reference format
> 
> I'm not so sure that this is the correct abstraction. I think that this
> and the last patch might be counterproductive to your stated goal of
> having one more mode of printing the refs, in fact, since when we have
> that new mode, the format would be different but the printing would
> remain (so we should split the format and printing).

I already have the full implementation of the new machine-parseable
format available locally, but didn't want to send it as part of this
patch series yet to avoid it becoming overly large. But I can say that
this change really did make the end goal easier to achieve, due to two
reasons:

- If we continued to handle the per-reference format at the different
  callsites, I'd have to also amend each of the callers when introducing
  the new format as we're going to use a different format there. But
  when doing this in `format_display()`, we really only need to have a
  single switch at the beginning to check whether to use the machine
  parseable format or the other one.

- Currently, all reference updates are printed to stderr. As stderr is
  also used to display errors and the progress bar, this really makes it
  not a good fit for the machine-parseable format. Instead, I decided
  that it would make more sense to print the new format to stdout. And
  by having the printing-logic self-contained we again only have a
  single location we need to change.

I realize though that all of this isn't as well-documented in the commit
messages as it should be, which is also something that Junio complained
about. I'll hopefully do a better job in v2 of this patch series.

> >   fetch: deduplicate logic to print remote URL
> 
> Makes sense, although I would need to consider only storing the
> raw URL in the struct display_state and processing it when it needs
> to be emitted (haven't checked if this is feasible, though).
> 
> >   fetch: fix inconsistent summary width for pruned and updated refs
> 
> This changes the behavior in that the summary width, even when printing
> the summary of pruned refs, is computed based only on the updated refs.
> The summary width might need to remain out of the struct display_state
> for now.

Fair, that's a case I didn't yet consider. I'll have another look.

Patrick

Patrick Steinhardt March 20, 2023, 12:26 p.m. UTC | #3

On Fri, Mar 17, 2023 at 01:24:49PM -0700, Jonathan Tan wrote:
> Patrick Steinhardt <ps@pks.im> writes:
[snip]
> >   fetch: deduplicate logic to print remote URL
> 
> Makes sense, although I would need to consider only storing the
> raw URL in the struct display_state and processing it when it needs
> to be emitted (haven't checked if this is feasible, though).

We likely could, but right now the benefit isn't all that high. If the
URL was only used in `display_ref_update()` then this would be easy
enough to do. But we also access the sanitized URL when the connectivity
check fails or when printing to FETCH_HEAD.

If we provided an accessor function thet returns the URL it would be
trivial to do, but what do we really gain here? In the best case we save
an allocation for the URL and two loops ranging over it. That doesn't
quite feel worth it to me.

Patrick