[0/2] receive-pack: use advertised reference tips to inform connectivity check

Message ID	cover.1666967670.git.ps@pks.im (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Feedback-ID: i197146af:Fastmail Date: Fri, 28 Oct 2022 16:42:19 +0200 From: Patrick Steinhardt <ps@pks.im> To: git@vger.kernel.org Subject: [PATCH 0/2] receive-pack: use advertised reference tips to inform connectivity check Message-ID: <cover.1666967670.git.ps@pks.im> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="2YtZw0Tfs3ybVKNC" Content-Disposition: inline Precedence: bulk
Series	receive-pack: use advertised reference tips to inform connectivity check \| expand [0/2] receive-pack: use advertised reference tips to inform connectivity check [1/2] connected: allow supplying different view of reachable objects [2/2] receive-pack: use advertised reference tips to inform connectivity check

Message ID

cover.1666967670.git.ps@pks.im (mailing list archive)

Headers

Feedback-ID: i197146af:Fastmail
Date: Fri, 28 Oct 2022 16:42:19 +0200
From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Subject: [PATCH 0/2] receive-pack: use advertised reference tips to inform
 connectivity check
Message-ID: <cover.1666967670.git.ps@pks.im>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
        protocol="application/pgp-signature"; boundary="2YtZw0Tfs3ybVKNC"
Content-Disposition: inline
Precedence: bulk

Series

receive-pack: use advertised reference tips to inform connectivity check | expand

Message

Patrick Steinhardt Oct. 28, 2022, 2:42 p.m. UTC

Hi,

this patch series improves the connectivity check done by stateful
git-receive-pack(1) to only consider references as reachable that have
been advertised to the client. This has two advantages:

    - A client shouldn't assume objects to exist that have not been part
      of the reference advertisement. But if it excluded an object from
      the packfile that is reachable via any ref that is excluded from
      the reference advertisement due to `transfer.hideRefs` we'd have
      accepted the push anyway. I'd argue that this is a bug in the
      current implementation.

    - Second, by using advertised refs as inputs instead of `git
      rev-list --not --all` we avoid looking up all refs that are
      irrelevant to the current push. This can be a huge performance
      improvement in repos that have a huge amount of internal, hidden
      refs. In one of our repos with 7m refs, of which 6.8m are hidden,
      this speeds up pushes from ~30s to ~4.5s.

One downside is that we need to pass in the object IDs that were part of
the reference advertisement via the standard input, which is seemingly
slower than reading them from the refdb. I'm discussing this in the
second commit.

Patrick

Patrick Steinhardt (2):
  connected: allow supplying different view of reachable objects
  receive-pack: use advertised reference tips to inform connectivity
    check

 builtin/receive-pack.c | 31 ++++++++++++++++++++++---------
 connected.c            |  9 ++++++++-
 connected.h            |  7 +++++++
 3 files changed, 37 insertions(+), 10 deletions(-)

Comments

Junio C Hamano Oct. 28, 2022, 4:40 p.m. UTC | #1

Patrick Steinhardt <ps@pks.im> writes:

> Hi,
>
> this patch series improves the connectivity check done by stateful
> git-receive-pack(1) to only consider references as reachable that have
> been advertised to the client. This has two advantages:
>
>     - A client shouldn't assume objects to exist that have not been part
>       of the reference advertisement. But if it excluded an object from
>       the packfile that is reachable via any ref that is excluded from
>       the reference advertisement due to `transfer.hideRefs` we'd have
>       accepted the push anyway. I'd argue that this is a bug in the
>       current implementation.

I agree that it is bad to accept an incoming pack that depends on an
object that is *only* reachable by a hidden ref, but what you said
above is a bit stronger than that.  Use transfer.hideRefs does not
have to be to hide objects (e.g. I could hide the tip of the
branches that holds tips of older maintenance tracks, just to
unclutter, and giving a pack that depends on older parts of history
is just fine).

Let's let it pass, as the cover letter material won't become part of
the permanent history ;-)

>     - Second, by using advertised refs as inputs instead of `git
>       rev-list --not --all` we avoid looking up all refs that are
>       irrelevant to the current push. This can be a huge performance
>       improvement in repos that have a huge amount of internal, hidden
>       refs. In one of our repos with 7m refs, of which 6.8m are hidden,
>       this speeds up pushes from ~30s to ~4.5s.

Yes.

> One downside is that we need to pass in the object IDs that were part of
> the reference advertisement via the standard input, which is seemingly
> slower than reading them from the refdb. I'm discussing this in the
> second commit.

Interesting.

Taylor Blau Nov. 1, 2022, 1:30 a.m. UTC | #2

On Fri, Oct 28, 2022 at 04:42:19PM +0200, Patrick Steinhardt wrote:
> this patch series improves the connectivity check done by stateful
> git-receive-pack(1) to only consider references as reachable that have
> been advertised to the client. This has two advantages:

A third advantage which I didn't see mentioned here is that we have a
better chance of being helped by reachability bitmaps here. Because we
aren't likely to cover hidden references with bitmaps, we would be
paying a significant cost to build up a complete picture of what's
reachable from --all.

That will inevitably involve a lot of traversing from unbitmapped
reference tips (i.e. the hidden ones) down to either a random bitmap
sprinkled throughout history, or the root of that line of history.

But if we limited the negated side of our connectivity check to just the
non-hidden references, then we will likely end up with more full bitmap
coverage of those tips, without having to do any additional traversal.

So that is definitely something worth experimenting with.

Thanks,
Taylor

Jeff King Nov. 1, 2022, 9 a.m. UTC | #3

On Fri, Oct 28, 2022 at 04:42:19PM +0200, Patrick Steinhardt wrote:

>     - A client shouldn't assume objects to exist that have not been part
>       of the reference advertisement. But if it excluded an object from
>       the packfile that is reachable via any ref that is excluded from
>       the reference advertisement due to `transfer.hideRefs` we'd have
>       accepted the push anyway. I'd argue that this is a bug in the
>       current implementation.

Like others, I don't think this is a bug exactly. We'd never introduce a
corruption. We're just more lenient with clients than we need to be.

But I don't think your scheme changes that. In a sense, the tips used by
"rev-list --not --all" are really an optimization. We will walk the
history from the to-be-updated ref tips all the way down to the roots if
we have to. So imagine that I have object X which is not referenced at
all (neither hidden nor visible ref). We obviously do not advertise it
to the client, but let's further imagine that a client sends us a pack
with X..Y, and a request to update some ref to Y.

Both before and after your code, if rev-list is able to walk down from Y
until either we hit all roots or all UNINTERESTING commits, it will be
satisfied. So as long as the receiving repo actually has all of the
history leading up to X, it will allow the push, regardless of your
patch.

If we wanted to stop being lenient, we'd have to actually check that
every object we traverse is either reachable, or came from the
just-pushed pack.

There's also a subtle timing issue here. Our connectivity check happens
after we've finished receiving the pack. So not only are we including
hidden refs, but we are using the ref state at the end of the push
(after receiving and processing the incoming pack), rather than the
beginning.

From the same "leniency" lens this seems like the wrong thing. But as
above, it doesn't matter in practice, because these tips are really an
optimization to tell rev-list that it can stop traversing.

If you think of the connectivity check less as "did the client try to
cheat" and more as "is it OK to update these refs without introducing a
corruption", then it makes sense that you'd want to do read the inputs
to the check as close to the ref update as possible, because it shrinks
the race window which could introduce corruption.

Imagine a situation like this:

  0. We advertise to client that we have commit X.

  1. Client starts pushing up a pack with X..Y and asks to update some
     branch to Y.

  2. Meanwhile, the branch with X is deleted, and X is pruned.

  3. Server finishes receiving the pack. All looks good, and then we
     start a connectivity check.

In the current code, that check starts with the current ref state (with
X deleted) as a given, and makes sure that we have the objects we need
to update the refs. After your patches, it would take X as a given, and
stop traversing when we see it.

That same race exists before your patch, but it's between the time of
"rev-list --not --all" running and the ref update. After your patch,
it's between the advertisement and the ref update, which can be a long
time (hours or even days, if the client is very slow).

In practice I'm not sure how big a deal this is. If we feed the
now-pruned X to rev-list, it may notice that X went away, though we've
been reducing the number of checks there in the name of efficiency
(e.g., if it's still in the commit graph, we'd say "OK, good enough"
these days, even if we don't have it on disk anymore).

But it feels like a wrong direction to make that race longer if there's
no need to.

So all that said...

>     - Second, by using advertised refs as inputs instead of `git
>       rev-list --not --all` we avoid looking up all refs that are
>       irrelevant to the current push. This can be a huge performance
>       improvement in repos that have a huge amount of internal, hidden
>       refs. In one of our repos with 7m refs, of which 6.8m are hidden,
>       this speeds up pushes from ~30s to ~4.5s.

I like the general direction here of avoiding the hidden refs. The
client _shouldn't_ have been using them, so we can optimistically assume
they're useless (and in the case of races or other weirdness, rev-list
just ends up traversing a bit further).

But we can split the two ideas in your series:

  1. Feed the advertised tips from receive-pack to rev-list.

  2. Avoid traversing from the hidden tips.

Doing (1) gets you (2) for free. But if we don't want to do (1), and I
don't think we do, we can get (2) by just teaching rev-list to narrow
the check.

I see some discussion in the other part of the thread, and we may need a
new rev-list option to do this, as mentioned there. However, you _might_
be able to do it the existing --exclude mechanism. I.e., something like:

  rev-list --stdin --not --exclude 'refs/hidden/*' --all

The gotchas are:

  - I'm not 100% sure that --exclude globbing and transfer.hideRefs
    syntax are compatible. You'd want to check.

  - these would have to come on the command line (at least with the
    current code). Probably nobody has enough hiderefs patterns for that
    to be a problem (and remember we are passing the glob pattern here,
    not the 6.8M refs themselves). But it could bite somebody in a
    pathological case.

-Peff

Patrick Steinhardt Nov. 1, 2022, 11:49 a.m. UTC | #4

On Tue, Nov 01, 2022 at 05:00:22AM -0400, Jeff King wrote:
> On Fri, Oct 28, 2022 at 04:42:19PM +0200, Patrick Steinhardt wrote:
> 
> >     - A client shouldn't assume objects to exist that have not been part
> >       of the reference advertisement. But if it excluded an object from
> >       the packfile that is reachable via any ref that is excluded from
> >       the reference advertisement due to `transfer.hideRefs` we'd have
> >       accepted the push anyway. I'd argue that this is a bug in the
> >       current implementation.
> 
> Like others, I don't think this is a bug exactly. We'd never introduce a
> corruption. We're just more lenient with clients than we need to be.
> 
> But I don't think your scheme changes that. In a sense, the tips used by
> "rev-list --not --all" are really an optimization. We will walk the
> history from the to-be-updated ref tips all the way down to the roots if
> we have to. So imagine that I have object X which is not referenced at
> all (neither hidden nor visible ref). We obviously do not advertise it
> to the client, but let's further imagine that a client sends us a pack
> with X..Y, and a request to update some ref to Y.
> 
> Both before and after your code, if rev-list is able to walk down from Y
> until either we hit all roots or all UNINTERESTING commits, it will be
> satisfied. So as long as the receiving repo actually has all of the
> history leading up to X, it will allow the push, regardless of your
> patch.

Oh, right! Now I see where my thinko was, which means both you and
Taylor are correct. I somehow assumed that we'd fail the connectivity
check in that case, but all it means is that we now potentially walk
more objects than we'd have done if we used `--not --all`.

> If we wanted to stop being lenient, we'd have to actually check that
> every object we traverse is either reachable, or came from the
> just-pushed pack.

Yes, indeed.

> There's also a subtle timing issue here. Our connectivity check happens
> after we've finished receiving the pack. So not only are we including
> hidden refs, but we are using the ref state at the end of the push
> (after receiving and processing the incoming pack), rather than the
> beginning.
> 
> From the same "leniency" lens this seems like the wrong thing. But as
> above, it doesn't matter in practice, because these tips are really an
> optimization to tell rev-list that it can stop traversing.
> 
> If you think of the connectivity check less as "did the client try to
> cheat" and more as "is it OK to update these refs without introducing a
> corruption", then it makes sense that you'd want to do read the inputs
> to the check as close to the ref update as possible, because it shrinks
> the race window which could introduce corruption.

Agreed.

> Imagine a situation like this:
> 
>   0. We advertise to client that we have commit X.
> 
>   1. Client starts pushing up a pack with X..Y and asks to update some
>      branch to Y.
> 
>   2. Meanwhile, the branch with X is deleted, and X is pruned.
> 
>   3. Server finishes receiving the pack. All looks good, and then we
>      start a connectivity check.
> 
> In the current code, that check starts with the current ref state (with
> X deleted) as a given, and makes sure that we have the objects we need
> to update the refs. After your patches, it would take X as a given, and
> stop traversing when we see it.
> 
> That same race exists before your patch, but it's between the time of
> "rev-list --not --all" running and the ref update. After your patch,
> it's between the advertisement and the ref update, which can be a long
> time (hours or even days, if the client is very slow).
> 
> In practice I'm not sure how big a deal this is. If we feed the
> now-pruned X to rev-list, it may notice that X went away, though we've
> been reducing the number of checks there in the name of efficiency
> (e.g., if it's still in the commit graph, we'd say "OK, good enough"
> these days, even if we don't have it on disk anymore).
> 
> But it feels like a wrong direction to make that race longer if there's
> no need to.

Good point.

> So all that said...
> 
> >     - Second, by using advertised refs as inputs instead of `git
> >       rev-list --not --all` we avoid looking up all refs that are
> >       irrelevant to the current push. This can be a huge performance
> >       improvement in repos that have a huge amount of internal, hidden
> >       refs. In one of our repos with 7m refs, of which 6.8m are hidden,
> >       this speeds up pushes from ~30s to ~4.5s.
> 
> I like the general direction here of avoiding the hidden refs. The
> client _shouldn't_ have been using them, so we can optimistically assume
> they're useless (and in the case of races or other weirdness, rev-list
> just ends up traversing a bit further).
> 
> But we can split the two ideas in your series:
> 
>   1. Feed the advertised tips from receive-pack to rev-list.
> 
>   2. Avoid traversing from the hidden tips.
> 
> Doing (1) gets you (2) for free. But if we don't want to do (1), and I
> don't think we do, we can get (2) by just teaching rev-list to narrow
> the check.
> 
> I see some discussion in the other part of the thread, and we may need a
> new rev-list option to do this, as mentioned there. However, you _might_
> be able to do it the existing --exclude mechanism. I.e., something like:
> 
>   rev-list --stdin --not --exclude 'refs/hidden/*' --all

Yeah, Taylor proposed to add a new `--visible-refs=receive` option that
lets git-rev-list(1) automatically add all references that are visible
when paying attention to `receive.hideRefs`. I actually like this idea
and will likely have a look at how easy or hard it is to implement.

> The gotchas are:
> 
>   - I'm not 100% sure that --exclude globbing and transfer.hideRefs
>     syntax are compatible. You'd want to check.
> 
>   - these would have to come on the command line (at least with the
>     current code). Probably nobody has enough hiderefs patterns for that
>     to be a problem (and remember we are passing the glob pattern here,
>     not the 6.8M refs themselves). But it could bite somebody in a
>     pathological case.
> 
> -Peff

Well, we can avoid these gotchas if we used `--visible-refs`.

Patrick