diff mbox series

[4/4] Doc: push with --base

Message ID 6250c13897e3cc01f247d80c148cf8dc5e7f3ad0.1604362701.git.jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series "Push" protocol change proposal: user-specified base | expand

Commit Message

Jonathan Tan Nov. 3, 2020, 12:26 a.m. UTC
Includes protocol documentation and a design document.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/pack-protocol.txt  | 10 ++--
 Documentation/technical/push-with-base.txt | 61 ++++++++++++++++++++++
 2 files changed, 68 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/technical/push-with-base.txt

Comments

Jonathan Nieder Nov. 3, 2020, 5:35 a.m. UTC | #1
(+cc: Elijah Newren and Derrick Stolee because of [1])
Hi,

Jonathan Tan wrote:

> Includes protocol documentation and a design document.

Thanks for writing this.

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/technical/pack-protocol.txt  | 10 ++--
>  Documentation/technical/push-with-base.txt | 61 ++++++++++++++++++++++
>  2 files changed, 68 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/technical/push-with-base.txt
> 
> diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
> index e13a2c064d..0485616701 100644
> --- a/Documentation/technical/pack-protocol.txt
> +++ b/Documentation/technical/pack-protocol.txt
> @@ -59,9 +59,9 @@ Parameters", and are supported by the Git, SSH, and HTTP protocols.
>  Each Extra Parameter takes the form of `<key>=<value>` or `<key>`.
>  
>  Servers that receive any such Extra Parameters MUST ignore all
> -unrecognized keys. Currently, the only Extra Parameter recognized is
> -"version" with a value of '1' or '2'.  See protocol-v2.txt for more
> -information on protocol version 2.
> +unrecognized keys. Currently, the only Extra Parameters recognized are
> +"version" with a value of '1' or '2' and, for push, "base" with an OID.  See
> +protocol-v2.txt for more information on protocol version 2.

It's nice to see an example of extra parameters being useful in
protocol v0 as well. :)

>  
>  Git Transport
>  -------------
> @@ -506,6 +506,10 @@ real difference is that the capability listing is different - the only
>  possible values are 'report-status', 'report-status-v2', 'delete-refs',
>  'ofs-delta', 'atomic' and 'push-options'.
>  
> +If a "base=<oid>" Extra Parameter was sent by the client, and the
> +server recognizes that object, the server MAY send "<oid> .have" in
> +lieu of all the reference obj-ids and names.
> +

Can this only appear once, or is it permitted to pass multiple oids
this way?

I'm also curious about what Junio asked about elsewhere in the thread:
for cases that benefit from more complex negotiation than the client
proposing a particular oid, what comes next in this roadmap?  I like
that this is an optional feature so we could clean up later by
removing support for it; do we expect to?  If we do expect to, is
there anything we could do to minimize the impact of the feature later
(e.g. using a less short-and-sweet key name than `base`, maybe)?

>  Reference Update Request and Packfile Transfer
>  ----------------------------------------------
>  
> diff --git a/Documentation/technical/push-with-base.txt b/Documentation/technical/push-with-base.txt
> new file mode 100644
> index 0000000000..d56aa7f900
> --- /dev/null
> +++ b/Documentation/technical/push-with-base.txt
> @@ -0,0 +1,61 @@
> +Push with base design notes
> +===========================
> +
> +This feature allows clients, when pushing, to indicate that a
> +certain object is an ancestor of all pushed commits and that they
> +believe that the server knows of this object. This in turn allows
> +servers to send an abbreviated ref advertisement containing only that
> +object.
> +
> +Besides bandwidth savings, this also ensures that the ref
> +advertisement contains information relevant to the client. For
> +example, at least one project (Gerrit [1]) have included workarounds
> +to send ancestors of refs that move often, even though the ref
> +advertisement is only meant to contain refs.
> +
> +[1] https://gerrit.googlesource.com/gerrit/+/refs/heads/master/java/com/google/gerrit/server/git/receive/HackPushNegotiateHook.java
> +
> +
> +Design overview
> +---------------
> +
> +The "base" being sent is sent as an Extra Parameter, supported in the
> +git://, ssh://, and http(s):// protocols. By sending it as an Extra
> +Parameter, the server is aware of this parameter before it generates
> +the ref advertisement, thus making it able to tailor the ref
> +advertisement accordingly. Sending it as an Extra Parameter also makes
> +this protocol backwards-compatible, as servers will ignore any Extra
> +Parameters they do not understand. (The push will then proceed as if
> +neither party had this feature.)
> +
> +The remote helper protocol has been extended to support the
> +"push-base" capability and an option of the same name. When a remote
> +helper advertises this capability, it thus indicates that it supports
> +this option. Git then will send "option push-base" if the user
> +specifies it when invoking "git push".
> +
> +The remote-curl remote helper bundled with Git has been updated to
> +support this capability and option.

Since this is an optional remote helper capability, it could be
removed later.  Good (but the capability and option names would still
need to be reserved).

> +
> +
> +Future work
> +-----------
> +
> +In the future, we might want a way to automatically determine the base
> +instead of always having the user specify it. However, this does not
> +make obsolete any of the current work - once the base is automatically
> +determined, we still need this protocol to communicate it to the
> +server, and allowing the user to specify the base manually is still
> +useful.

Makes sense.  I think this isn't "might", but "would"; most users
that do not know why their push is slow wouldn't know to use this
feature.

> +
> +
> +Alternatives
> +------------
> +
> +- Making a more substantial protocol change like "fetch" protocol v2.
> +  This would eliminate the need for some of the remote helper updates;
> +  as part of the protocol change, the protocol could be made to
> +  support "stateless-connect" and thus no remote helper updates (like
> +  "push-base") would be needed. For "fetch", the protocol change has
> +  enabled features like wanted-refs and packfile-uris, but I do not
> +  have any similar ideas in mind for "push".

I think you're saying that we don't need a "push" v2 because v0
already has what a user would want.

Git protocol v2 for fetch brought two major changes:

- it changed the response for the initial request, allowing
  abbreviating the ref advertisement at last

- it defined a structure for requests and responses, simplifying the
  addition of later protocol improvements.  In particular, because the
  initial response is a capability advertisement, it allows changing
  the ref advertisement format more in the future.

Both of those changes would be valuable for push.  The ref
advertisements are large, and matching the structure of commands used
by fetchv2 would make debugging easier.

There are some specific applications I'm interested in after that
(e.g., pushing symrefs), but the fundamental extensibility improvement
is larger than any particular application I could think of.

That said, I'm not against experimenting with extra parameters before
we go there, as a way of getting more information about what a
workable negotiation for push looks like.  The push ref advertisement
serves two roles: it advertises commits, to serve as negotiation, and
it tells the client the current values of refs they may want to
update, so they can send an appropriate compare-and-swap command for
a force-push.  If we have a workable replacement negotiation system,
then the other role can be replaced with something simpler.  For
example:

- listing ref updates before the pack, to allow the server to bail
  out early if they would be rejected, and

- allowing an explicit force update of a ref instead of requiring the
  client to compare-and-swap if they don't care about the old value
  (but taking care to still make the client "fetch first" when
  appropriate!)

Thanks,
Jonathan

[1] https://lore.kernel.org/git/CABPp-BEswHLhymQ1_07g3qqu=7kFR3eQyAHR0qMgSvi6THy=zQ@mail.gmail.com/
Derrick Stolee Nov. 3, 2020, 1:53 p.m. UTC | #2
On 11/2/2020 7:26 PM, Jonathan Tan wrote:
> Includes protocol documentation and a design document.

>  Servers that receive any such Extra Parameters MUST ignore all
> -unrecognized keys. Currently, the only Extra Parameter recognized is
> -"version" with a value of '1' or '2'.  See protocol-v2.txt for more
> -information on protocol version 2.
> +unrecognized keys. Currently, the only Extra Parameters recognized are
> +"version" with a value of '1' or '2' and, for push, "base" with an OID.  See
> +protocol-v2.txt for more information on protocol version 2.
>  
>  Git Transport
>  -------------
> @@ -506,6 +506,10 @@ real difference is that the capability listing is different - the only
>  possible values are 'report-status', 'report-status-v2', 'delete-refs',
>  'ofs-delta', 'atomic' and 'push-options'.
>  
> +If a "base=<oid>" Extra Parameter was sent by the client, and the
> +server recognizes that object, the server MAY send "<oid> .have" in
> +lieu of all the reference obj-ids and names.
> +

nit: no comma before "and the server recognizes" as these are both
conditions of the "if". My preference is also to include a "then"
before the result, especially with compound conditions like this.
Perhaps also make it more active voice while we are here?

+ If the client sent a "base=<oid>" Extra Parameter and the server
+ recognizes that object, then the server MAY send "<oid> .have" in
+ lieu of all the reference object ids and names.

> +Push with base design notes
> +===========================

Perhaps start with a brief problem statement?

+ If a client wishes to push data to a server without first running
+ `git fetch`, then they might not have local copies of the objects
+ at the server's ref tips. The client cannot compute an adequate
+ common base and will send more objects than necessary.

Then,

+ The "push with base" feature allows...

> +This feature allows clients, when pushing, to indicate that a
> +certain object is an ancestor of all pushed commits and that they
> +believe that the server knows of this object. This in turn allows
> +servers to send an abbreviated ref advertisement containing only that
> +object.

Why only one base? Could there be value in sending multiple possible
bases and the server can report the subset that they know about?

> +Besides bandwidth savings,

It seems you are underselling the bandwidth savings here, because
we save in both the ref advertisement and the push pack size. It
might be good to call out both metrics so we can see that the ref
improvement works even for cases where the client recently fetched
and can figure out the proper base on its own!

> this also ensures that the ref
> +advertisement contains information relevant to the client. For
> +example, at least one project (Gerrit [1]) have included workarounds
> +to send ancestors of refs that move often, even though the ref
> +advertisement is only meant to contain refs.
> +
> +[1] https://gerrit.googlesource.com/gerrit/+/refs/heads/master/java/com/google/gerrit/server/git/receive/HackPushNegotiateHook.java
> +
> +
> +Design overview
> +---------------
> +
> +The "base" being sent is sent as an Extra Parameter, supported in the
> +git://, ssh://, and http(s):// protocols. By sending it as an Extra
> +Parameter, the server is aware of this parameter before it generates
> +the ref advertisement, thus making it able to tailor the ref
> +advertisement accordingly. Sending it as an Extra Parameter also makes
> +this protocol backwards-compatible, as servers will ignore any Extra
> +Parameters they do not understand. (The push will then proceed as if
> +neither party had this feature.)

I'm not familiar enough with the extra parameter logic to know if we
allow multi-valued extra parameters. It would be good to relax the
"base" parameter to either allow multiple instances of the parameter
or to have something like a comma-separated list of OIDs be allowed.

This might be particularly important for multiple refs being pushed
at the same time.

> +The remote helper protocol has been extended to support the
> +"push-base" capability and an option of the same name. When a remote
> +helper advertises this capability, it thus indicates that it supports
> +this option. Git then will send "option push-base" if the user
> +specifies it when invoking "git push".
> +
> +The remote-curl remote helper bundled with Git has been updated to
> +support this capability and option.> +
> +
> +Future work
> +-----------
> +
> +In the future, we might want a way to automatically determine the base
> +instead of always having the user specify it. However, this does not
> +make obsolete any of the current work - once the base is automatically
> +determined, we still need this protocol to communicate it to the
> +server, and allowing the user to specify the base manually is still
> +useful.

It is appropriate to separate the protocol ability from the automatic
computation on the client. It would be nice to know how you expect that
to work.

For the single-ref push case, my default instinct would be to find a
merge-base between that ref and all of the remote refs matching the
remote we are pushing to. Bonus points if we find _all_ (maximal)
merge-bases!

The equivalent for the multi-ref push might be to find the boundary
commits from something like the following command:

	git rev-list --boundary \
		--not refs/remotes/origin/A \
		--not refs/remotes/origin/B \
		--not refs/remotes/origin/C \
		refs/heads/toPush/1 \
		refs/heads/toPush/2 \
		refs/heads/toPush/3

But you might have something better in mind.

> +
> +
> +Alternatives
> +------------

Odd to have a single bullet in a bulleted list.

> +- Making a more substantial protocol change like "fetch" protocol v2.

This isn't a full sentence.

Perhaps: "One considered approach was to introduce a multi-phase
negotiation step similar to how 'git fetch' operates."

> +  This would eliminate the need for some of the remote helper updates;
> +  as part of the protocol change, the protocol could be made to
> +  support "stateless-connect" and thus no remote helper updates (like
> +  "push-base") would be needed. For "fetch", the protocol change has
> +  enabled features like wanted-refs and packfile-uris, but I do not
> +  have any similar ideas in mind for "push".

I think that you should mention that a negotiation phase here would
use significantly more bandwidth to do the ref advertisements on both
ends with only a rare case that the negotiated bases are better than
the locally-computed bases. Everything is about tradeoffs here!

I'm generally excited about the opportunities here. I'd love to see
some measurements for reduced ref advertisements and reduced object
counts in the pushed pack.

Thanks,
-Stolee
Jeff King Nov. 3, 2020, 3:18 p.m. UTC | #3
On Mon, Nov 02, 2020 at 09:35:54PM -0800, Jonathan Nieder wrote:

> I think you're saying that we don't need a "push" v2 because v0
> already has what a user would want.
> 
> Git protocol v2 for fetch brought two major changes:
> 
> - it changed the response for the initial request, allowing
>   abbreviating the ref advertisement at last
> 
> - it defined a structure for requests and responses, simplifying the
>   addition of later protocol improvements.  In particular, because the
>   initial response is a capability advertisement, it allows changing
>   the ref advertisement format more in the future.
> 
> Both of those changes would be valuable for push.  The ref
> advertisements are large, and matching the structure of commands used
> by fetchv2 would make debugging easier.
> 
> There are some specific applications I'm interested in after that
> (e.g., pushing symrefs), but the fundamental extensibility improvement
> is larger than any particular application I could think of.

You pretty much summed up what I was going to respond. :)

But I'd go further here...

> That said, I'm not against experimenting with extra parameters before
> we go there, as a way of getting more information about what a
> workable negotiation for push looks like.

I'd prefer to avoid doing this as an extra parameter for a few reasons:

  - once it's in a released version, it's much harder for us to take it
    away

  - the extra parameters area is a hack that helped us bootstrap v2. We
    could probably use the same hack to bootstrap v3, etc. But it has
    limitations for stuffing in arbitrary data. An obvious one is size.
    We can transmit a single base, but would be limited if we wanted to
    be able to send multiple. We already ran into this once with the
    "symref=foo:bar" capability overflowing pkt-line limits. Here I'm
    not even sure what the limits might be (it's subject to things like
    how big an HTTP header a proxy will pass, or how large an
    environment variable an ssh implementation supports)

  - it potentially pushes more data/work outside of the git protocol
    itself. E.g., web servers have to translate Git-Protocol headers
    into the GIT_PROTOCOL environment for v2. I guess this new field
    works in our tests because we copy the header's value entirely in
    our apache.conf. But I wonder how many systems in the wild may only
    work if it contains "version=2".

-Peff
Junio C Hamano Nov. 3, 2020, 5:35 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> On Mon, Nov 02, 2020 at 09:35:54PM -0800, Jonathan Nieder wrote:
>
>> I think you're saying that we don't need a "push" v2 because v0
>> already has what a user would want.
>> 
>> Git protocol v2 for fetch brought two major changes:
>> 
>> - it changed the response for the initial request, allowing
>>   abbreviating the ref advertisement at last
>> 
>> - it defined a structure for requests and responses, simplifying the
>>   addition of later protocol improvements.  In particular, because the
>>   initial response is a capability advertisement, it allows changing
>>   the ref advertisement format more in the future.
>> 
>> Both of those changes would be valuable for push.  The ref
>> advertisements are large, and matching the structure of commands used
>> by fetchv2 would make debugging easier.
>> 
>> There are some specific applications I'm interested in after that
>> (e.g., pushing symrefs), but the fundamental extensibility improvement
>> is larger than any particular application I could think of.
>
> You pretty much summed up what I was going to respond. :)
>
> But I'd go further here...
>
>> That said, I'm not against experimenting with extra parameters before
>> we go there, as a way of getting more information about what a
>> workable negotiation for push looks like.
>
> I'd prefer to avoid doing this as an extra parameter for a few reasons:
>
>   - once it's in a released version, it's much harder for us to take it
>     away
>
>   - the extra parameters area is a hack that helped us bootstrap v2. We
>     could probably use the same hack to bootstrap v3, etc. But it has
>     limitations for stuffing in arbitrary data. An obvious one is size.
>     We can transmit a single base, but would be limited if we wanted to
>     be able to send multiple. We already ran into this once with the
>     "symref=foo:bar" capability overflowing pkt-line limits. Here I'm
>     not even sure what the limits might be (it's subject to things like
>     how big an HTTP header a proxy will pass, or how large an
>     environment variable an ssh implementation supports)
>
>   - it potentially pushes more data/work outside of the git protocol
>     itself. E.g., web servers have to translate Git-Protocol headers
>     into the GIT_PROTOCOL environment for v2. I guess this new field
>     works in our tests because we copy the header's value entirely in
>     our apache.conf. But I wonder how many systems in the wild may only
>     work if it contains "version=2".

I do not have much to add to what has been said so far, other than
offering historical perspective.

The single biggest reason why "fetch" has common ancestor discovery
negotiation and "push" does not is because the design comes from the
use case the inventor of Git and those worked on the early protocol
wanted to support---you are pushing into your own repository you
alone push into, your work is disseminated to others who fetch from
your repository, and you get others' work by fetching from theirs.

In such a world without a central server where everybody pushes
into, by definition, a pusher knows all the objects that have ever
been pushed into the receiving repository when running "git push".
They are all objects that passed through the repository you are
pushing from to the receiving repository in your previous pushes.
The advertised ref(s) are expected to be known to the repository you
are pusing from anyway, and if that is not the case, you would first
fetch from there before force-pushing.

Hence, when you push, there isn't much need to walk back from the
tip of refs at the remote to discover common ancestor like we do for
the fetch side in pre-central-server world.

On the other hand, you expect that remote refs point at objects
unknown to you when you fetch from your colleagues, so it is
expected that you have to perform the common ancestor discovery
negotiation.

After 15 years, we live in a different world.

People expect that a single repository at their hosting sites can be
used as the central meeting point for the project, just like CVS/SVN
servers were in older world.  "git push" would need to accept that
reality and start common ancestor discovery eventually.
Jonathan Tan Nov. 9, 2020, 7:56 p.m. UTC | #5
> People expect that a single repository at their hosting sites can be
> used as the central meeting point for the project, just like CVS/SVN
> servers were in older world.  "git push" would need to accept that
> reality and start common ancestor discovery eventually.

Thanks for your reply (and everyone else's). I was thinking that a more
rudimentary form of the feature would suffice, since I wasn't expecting
much more need in the future, but looks like this isn't the case. I'll
be thinking of a more comprehensive idea.
Derrick Stolee Nov. 9, 2020, 9 p.m. UTC | #6
On 11/9/20 2:56 PM, Jonathan Tan wrote:
>> People expect that a single repository at their hosting sites can be
>> used as the central meeting point for the project, just like CVS/SVN
>> servers were in older world.  "git push" would need to accept that
>> reality and start common ancestor discovery eventually.
> 
> Thanks for your reply (and everyone else's). I was thinking that a more
> rudimentary form of the feature would suffice, since I wasn't expecting
> much more need in the future, but looks like this isn't the case. I'll
> be thinking of a more comprehensive idea.

I think this "half round negotiation" idea you have has merit, and can
get us 95% of the benefit that a multi-round negotiation would bring
without those extra steps.

My concerns with the current series is that it isn't fully ready for
even that case. In my mind, a protocol change like this would need:

1. A top-to-bottom implementation that allows a user to opt-in to
   this new behavior with a config setting.

2. A demonstration of situations where this algorithm out-performs
   the existing algorithm (i.e. client is far behind server, but
   topic is a small change based on something in server's history)

3. A clear way to handle odd cases, such as multiple merge-bases.
   This leads to a change in how you are sending the data.

Perhaps this "send multiple OIDs in a payload" is already half-way to
implementing a full negotiation, and we might as well go all the way
in the writing. I expect that sending all maximal merge-bases will be
sufficient for the vast majority of cases, and so any multi-round
negotiation process will almost always end after the client sends that
data.

Looking forward to your next version.

Thanks,
-Stolee
Junio C Hamano Nov. 9, 2020, 9:20 p.m. UTC | #7
Jonathan Tan <jonathantanmy@google.com> writes:

>> People expect that a single repository at their hosting sites can be
>> used as the central meeting point for the project, just like CVS/SVN
>> servers were in older world.  "git push" would need to accept that
>> reality and start common ancestor discovery eventually.
>
> Thanks for your reply (and everyone else's). I was thinking that a more
> rudimentary form of the feature would suffice, since I wasn't expecting
> much more need in the future, but looks like this isn't the case. I'll
> be thinking of a more comprehensive idea.

I said "eventually", meaning that we may not have to solve it
immediately, but judging from the need for ad-hoc workarounds like
sending older commits that are not necessarily at the tip of
anything from the receiving end as if they are tips and then another
ad-hoc workaround like this one, it seems that the need is real.

Would the earlier refactoring of the negotiation part into a
separate negotiator module help, or did the refactor not remove the
deep assumption that it is only about the fetch/upload-pack traffic
and we need a design for push/receive-pack from scratch?

Thanks.
Jeff King Nov. 9, 2020, 9:40 p.m. UTC | #8
On Mon, Nov 09, 2020 at 01:20:14PM -0800, Junio C Hamano wrote:

> Would the earlier refactoring of the negotiation part into a
> separate negotiator module help, or did the refactor not remove the
> deep assumption that it is only about the fetch/upload-pack traffic
> and we need a design for push/receive-pack from scratch?

I haven't thought too deeply about this problem area, but there is one
challenge I'd expect with reusing the existing negotiation protocol and
code: the stateless side is flipped for http.

In the fetch protocol, the client is stateful and makes a series of
requests to a stateless server, each time saying "I want X, we agreed
previously on Y, and here are some more options Z; let me know if that's
enough to generate a pack". If we were to flip that around, the client
must remain the stateful party, but it's no longer the receiver of the
pack. For the receiver (i.e., the stateless server) to iterate through
its history looking for a shard point, I think the message would have to
be more like "last time you told me about X, and gave me cut-points Y to
resume your traversal; tell me more about Y so that I can decide if
we're ready to send a pack".

-Peff
Jonathan Tan Nov. 9, 2020, 10:22 p.m. UTC | #9
> On 11/9/20 2:56 PM, Jonathan Tan wrote:
> >> People expect that a single repository at their hosting sites can be
> >> used as the central meeting point for the project, just like CVS/SVN
> >> servers were in older world.  "git push" would need to accept that
> >> reality and start common ancestor discovery eventually.
> > 
> > Thanks for your reply (and everyone else's). I was thinking that a more
> > rudimentary form of the feature would suffice, since I wasn't expecting
> > much more need in the future, but looks like this isn't the case. I'll
> > be thinking of a more comprehensive idea.
> 
> I think this "half round negotiation" idea you have has merit, and can
> get us 95% of the benefit that a multi-round negotiation would bring
> without those extra steps.

Thanks.

> My concerns with the current series is that it isn't fully ready for
> even that case. In my mind, a protocol change like this would need:
> 
> 1. A top-to-bottom implementation that allows a user to opt-in to
>    this new behavior with a config setting.

Makes sense. The client already needs to opt-in by specifying the base,
but yes, an option needs to be added for the server.

> 2. A demonstration of situations where this algorithm out-performs
>    the existing algorithm (i.e. client is far behind server, but
>    topic is a small change based on something in server's history)

Noted.

> 3. A clear way to handle odd cases, such as multiple merge-bases.
>    This leads to a change in how you are sending the data.
> 
> Perhaps this "send multiple OIDs in a payload" is already half-way to
> implementing a full negotiation, and we might as well go all the way
> in the writing. I expect that sending all maximal merge-bases will be
> sufficient for the vast majority of cases, and so any multi-round
> negotiation process will almost always end after the client sends that
> data.

I think the ability to send multiple bases would work in this scenario,
but this might run into concerns that other reviewers have brought up
(specifically, the fact that we're using environment variables and HTTP
headers to pass this information - and among other things, line length
limits). So a more comprehensive solution might be needed, even if only
for this reason.

> Looking forward to your next version.

Thanks.
Jonathan Tan Nov. 9, 2020, 10:47 p.m. UTC | #10
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> >> People expect that a single repository at their hosting sites can be
> >> used as the central meeting point for the project, just like CVS/SVN
> >> servers were in older world.  "git push" would need to accept that
> >> reality and start common ancestor discovery eventually.
> >
> > Thanks for your reply (and everyone else's). I was thinking that a more
> > rudimentary form of the feature would suffice, since I wasn't expecting
> > much more need in the future, but looks like this isn't the case. I'll
> > be thinking of a more comprehensive idea.
> 
> I said "eventually", meaning that we may not have to solve it
> immediately, but judging from the need for ad-hoc workarounds like
> sending older commits that are not necessarily at the tip of
> anything from the receiving end as if they are tips and then another
> ad-hoc workaround like this one, it seems that the need is real.

That's true. And if we need a more comprehensive solution than putting
bases as Extra Parameters (and don't think that base-in-Extra-Parameter
will form a part of that comprehensive solution), then I do think that
we should go for the comprehensive solution instead of making something
obsolete that we will still need to support later.

> Would the earlier refactoring of the negotiation part into a
> separate negotiator module help, or did the refactor not remove the
> deep assumption that it is only about the fetch/upload-pack traffic
> and we need a design for push/receive-pack from scratch?

The negotiator module might help - looking that the API, it takes a list
of known commons (none in this case, because we want to skip the long
ref advertisement) and a list of tips to check (for this case, the tips
we want to push), and then the negotiation starts. So we might be able
to use it in a push v2.

Having said that, it might be possible to reuse the existing fetch v0/v2
protocol to perform this negotiation (preferably v2, so that we can skip
the ref advertisement). We'll just need to add the opposite of the
"no-done" capability to make sure that the server never sends a
packfile.

Once we have found the base commit(s) either through push v2 or through
fetch+opposite-of-no-done, we'll need to send the packfile somehow. My
proposal here could do it, although then we might run into the problems
Peff describes about Extra Parameters [1]. If we don't use Extra
Parameters, we would probably need a push v2, but we might then run into
similar problems to what we had during the fetch v2 migration (e.g.
unexpected subtly different behaviors).

[1] https://lore.kernel.org/git/20201103151859.GA444466@coredump.intra.peff.net/
diff mbox series

Patch

diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
index e13a2c064d..0485616701 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -59,9 +59,9 @@  Parameters", and are supported by the Git, SSH, and HTTP protocols.
 Each Extra Parameter takes the form of `<key>=<value>` or `<key>`.
 
 Servers that receive any such Extra Parameters MUST ignore all
-unrecognized keys. Currently, the only Extra Parameter recognized is
-"version" with a value of '1' or '2'.  See protocol-v2.txt for more
-information on protocol version 2.
+unrecognized keys. Currently, the only Extra Parameters recognized are
+"version" with a value of '1' or '2' and, for push, "base" with an OID.  See
+protocol-v2.txt for more information on protocol version 2.
 
 Git Transport
 -------------
@@ -506,6 +506,10 @@  real difference is that the capability listing is different - the only
 possible values are 'report-status', 'report-status-v2', 'delete-refs',
 'ofs-delta', 'atomic' and 'push-options'.
 
+If a "base=<oid>" Extra Parameter was sent by the client, and the
+server recognizes that object, the server MAY send "<oid> .have" in
+lieu of all the reference obj-ids and names.
+
 Reference Update Request and Packfile Transfer
 ----------------------------------------------
 
diff --git a/Documentation/technical/push-with-base.txt b/Documentation/technical/push-with-base.txt
new file mode 100644
index 0000000000..d56aa7f900
--- /dev/null
+++ b/Documentation/technical/push-with-base.txt
@@ -0,0 +1,61 @@ 
+Push with base design notes
+===========================
+
+This feature allows clients, when pushing, to indicate that a
+certain object is an ancestor of all pushed commits and that they
+believe that the server knows of this object. This in turn allows
+servers to send an abbreviated ref advertisement containing only that
+object.
+
+Besides bandwidth savings, this also ensures that the ref
+advertisement contains information relevant to the client. For
+example, at least one project (Gerrit [1]) have included workarounds
+to send ancestors of refs that move often, even though the ref
+advertisement is only meant to contain refs.
+
+[1] https://gerrit.googlesource.com/gerrit/+/refs/heads/master/java/com/google/gerrit/server/git/receive/HackPushNegotiateHook.java
+
+
+Design overview
+---------------
+
+The "base" being sent is sent as an Extra Parameter, supported in the
+git://, ssh://, and http(s):// protocols. By sending it as an Extra
+Parameter, the server is aware of this parameter before it generates
+the ref advertisement, thus making it able to tailor the ref
+advertisement accordingly. Sending it as an Extra Parameter also makes
+this protocol backwards-compatible, as servers will ignore any Extra
+Parameters they do not understand. (The push will then proceed as if
+neither party had this feature.)
+
+The remote helper protocol has been extended to support the
+"push-base" capability and an option of the same name. When a remote
+helper advertises this capability, it thus indicates that it supports
+this option. Git then will send "option push-base" if the user
+specifies it when invoking "git push".
+
+The remote-curl remote helper bundled with Git has been updated to
+support this capability and option.
+
+
+Future work
+-----------
+
+In the future, we might want a way to automatically determine the base
+instead of always having the user specify it. However, this does not
+make obsolete any of the current work - once the base is automatically
+determined, we still need this protocol to communicate it to the
+server, and allowing the user to specify the base manually is still
+useful.
+
+
+Alternatives
+------------
+
+- Making a more substantial protocol change like "fetch" protocol v2.
+  This would eliminate the need for some of the remote helper updates;
+  as part of the protocol change, the protocol could be made to
+  support "stateless-connect" and thus no remote helper updates (like
+  "push-base") would be needed. For "fetch", the protocol change has
+  enabled features like wanted-refs and packfile-uris, but I do not
+  have any similar ideas in mind for "push".