[5/8] Documentation: add Packfile URIs design doc

Message ID	4eea9d927af1df11cdb0342e969b293a6e317d46.1590789428.git.jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org> Date: Fri, 29 May 2020 15:30:17 -0700 In-Reply-To: <cover.1590789428.git.jonathantanmy@google.com> Message-Id: <4eea9d927af1df11cdb0342e969b293a6e317d46.1590789428.git.jonathantanmy@google.com> Mime-Version: 1.0 References: <cover.1590789428.git.jonathantanmy@google.com> Subject: [PATCH 5/8] Documentation: add Packfile URIs design doc From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	CDN offloading update \| expand [0/8] CDN offloading update [1/8] http: use --stdin when getting dumb HTTP pack [2/8] http: improve documentation of http_pack_request [3/8] http-fetch: support fetching packfiles by URL [4/8] Documentation: order protocol v2 sections [5/8] Documentation: add Packfile URIs design doc [6/8] upload-pack: refactor reading of pack-objects out [7/8] fetch-pack: support more than one pack lockfile [8/8] upload-pack: send part of packfile response as uri

Jonathan Tan May 29, 2020, 10:30 p.m. UTC

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/packfile-uri.txt | 78 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt  | 28 ++++++++-
 2 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

Junio C Hamano May 30, 2020, 12:15 a.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/technical/packfile-uri.txt | 78 ++++++++++++++++++++++++
>  Documentation/technical/protocol-v2.txt  | 28 ++++++++-
>  2 files changed, 105 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/technical/packfile-uri.txt
>
> diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
> new file mode 100644
> index 0000000000..6a5a6440d5
> --- /dev/null
> +++ b/Documentation/technical/packfile-uri.txt
> @@ -0,0 +1,78 @@
> +Packfile URIs
> +=============
> +
> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU usage
> +(for example, by serving some data through a CDN), and (in the future) provides
> +some measure of resumability to clients.
> +
> +This feature is available only in protocol version 2.

Yay.

> +Protocol
> +--------
> +
> +The server advertises `packfile-uris`.

Is this a new "protocol capability"?  There are multiple things that
are "advertised" over the wire (there is "reference advertisement")
and readers would want to know if this is yet another kind of
advertisement or a new variety of the "capability".

> +If the client then communicates which protocols (HTTPS, etc.) it supports with
> +a `packfile-uris` argument, the server MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing URIs of any of the given protocols. The URIs point to
> +packfiles that use only features that the client has declared that it supports
> +(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
> +this section.
> +
> +Clients then should understand that the returned packfile could be incomplete,

I am guessing that this merely mean that the result of downloading
and installing the packfile does not necessarily make the resulting
repository up-to-date with respect to the "want" the "fetch" command
requested.  But the above can easily be misread that the returned
packfile is somewhat broken, corrupt or missing objects that it
ought to have (e.g. a deltified object lacks its base object in the
file). [#1]

> +and that it needs to download all the given URIs before the fetch or clone is
> +complete.

So if I say "I want history leading to commit X", and choose to use
the packfile-uris that told me to fetch two packfiles P and Q, does
it mean that I need to only fetch P and Q, index-pack them and store
the resulting two packfiles and their idx files in my repository, do
I have the history leading to commit X?

I would have guessed that the resulting repository after fetching
these URIs can still be incomplete and the "packfile" section of the
response needs to be processed before the fetch or clone is complete,
but the above does not say so.

> +Server design
> +-------------
> +
> +The server can be trivially made compatible with the proposed protocol by
> +having it advertise `packfile-uris`, tolerating the client sending
> +`packfile-uris`, and never sending any `packfile-uris` section. But we should
> +include some sort of non-trivial implementation in the Minimum Viable Product,
> +at least so that we can test the client.
> +
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for example,
> +servers to delegate serving of large blobs to CDNs.

Let me see if the above is understandable.

Does "git cat-file blob <sha1>" come back when we try to "wget/curl"
the <uri>?

> +Client design
> +-------------
> +
> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.

Same question again.  As URIs are allowed to be incomplete (point #1 above),
it is still too early to declare success after all URIs have been downloaded
as packfiles, isn't it?  Shouldn't the "latest bits" need to be extracted
out of the usual "packfile" section of the protocol?

> +The division of work (initial fetch + additional URIs) introduces convenient
> +points for resumption of an interrupted clone - such resumption can be done
> +after the Minimum Viable Product (see "Future work").
> +
> +The client can inhibit this feature (i.e. refrain from sending the
> +`packfile-uris` parameter) by passing --no-packfile-uris to `git fetch`.

By default, as long as the other side advertises packfile-uris, the
client automatically attempts to use it and users need to explicitly
disable it?  

It's quite different from the way we introduce new features and I am
wondering if it is a good approach.

> + * On the server, a long-running process that takes in entire requests and
> +   outputs a list of URIs and the corresponding inclusion and exclusion sets of
> +   objects. This allows, e.g., signed URIs to be used and packfiles for common
> +   requests to be cached.

Did we discuss "inclusion and exclusion sets" whatever they are?

> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> index ef7514a3ee..7e1b3a0bfe 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -325,13 +325,26 @@ included in the client's request:
>  	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
>  	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
>  
> +If the 'packfile-uris' feature is advertised, the following argument
> +can be included in the client's request as well as the potential
> +addition of the 'packfile-uris' section in the server's response as
> +explained below.
> +
> +    packfile-uris <comma-separated list of protocols>
> +	Indicates to the server that the client is willing to receive
> +	URIs of any of the given protocols in place of objects in the
> +	sent packfile. Before performing the connectivity check, the
> +	client should download from all given URIs. Currently, the
> +	protocols supported are "http" and "https".

Ah, I think the puzzlement I repeatedly expressed above is starting
to dissolve.  You are assuming that the receiving end would remember
the URIs but the in-protocol packfile data at the end is handled
first, and then before the final connectivity check is done,
packfiles are downloaded from the URIs.  If you spelled out that
assumption early in the document, then I wouldn't have had to ask
all those questions.  I was assuming a different order (i.e. CDN
packfiles first to establish the baseline, and then in-protocol
packfile to fill the "latest bits", the last mile to reach the tips
of requested refs).

In practice, I suspect that these fetches would go in parallel with
the processing of the in-protocol packfile, but spelling it out as
if these are done sequencially would help establishing the right
mental model.  

"(1) Process in-protocol packfiles first, and then (2) fetch CDN
URIs, and after all is done, (3) update the tips of refs" would
serve as a base to establish a good mental model.  It would be even
better to throw in another step before all that: (0) record the
wanted-refs and CDN URIs to the safe place.  If you get disconnected
before finishing (1), you have to redo from the scratch, but once
you finished (0) and (1), then (2) and (3) can be done at your
leisure using the information you saved in step (0), and (1) can be
retried if your connection is lousy.

>  The response of `fetch` is broken into a number of sections separated by
>  delimiter packets (0001), with each section beginning with its section
>  header. Most sections are sent only when the packfile is sent.
>  
>      output = acknowledgements flush-pkt |
>  	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -	     [wanted-refs delim-pkt] packfile flush-pkt
> +	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +	     packfile flush-pkt
>  
>      acknowledgments = PKT-LINE("acknowledgments" LF)
>  		      (nak | *ack)
> @@ -349,6 +362,9 @@ header. Most sections are sent only when the packfile is sent.
>  		  *PKT-LINE(wanted-ref LF)
>      wanted-ref = obj-id SP refname
>  
> +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)

40* 

>      packfile = PKT-LINE("packfile" LF)
>  	       *PKT-LINE(%x01-03 *%x00-ff)
>  
> @@ -420,6 +436,16 @@ header. Most sections are sent only when the packfile is sent.
>  	* The server MUST NOT send any refs which were not requested
>  	  using 'want-ref' lines.
>  
> +    packfile-uris section
> +	* This section is only included if the client sent
> +	  'packfile-uris' and the server has at least one such URI to
> +	  send.
> +
> +	* Always begins with the section header "packfile-uris".
> +
> +	* For each URI the server sends, it sends a hash of the pack's
> +	  contents (as output by git index-pack) followed by the URI.

OK.  This allows the URI that feeds us the packfile contents to have
any name, and still lets us verify the contents match what the other
end wanted to feed us.

Thanks.

>      packfile section
>  	* This section is only included if the client has sent 'want'
>  	  lines in its request and either requested that no more

Junio C Hamano May 30, 2020, 12:22 a.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> In practice, I suspect that these fetches would go in parallel with
> the processing of the in-protocol packfile, but spelling it out as
> if these are done sequencially would help establishing the right
> mental model.  
>
> "(1) Process in-protocol packfiles first, and then (2) fetch CDN
> URIs, and after all is done, (3) update the tips of refs" would
> serve as a base to establish a good mental model.  It would be even
> better to throw in another step before all that: (0) record the
> wanted-refs and CDN URIs to the safe place.  If you get disconnected
> before finishing (1), you have to redo from the scratch, but once
> you finished (0) and (1), then (2) and (3) can be done at your
> leisure using the information you saved in step (0), and (1) can be
> retried if your connection is lousy.

We need to be a bit careful here.  After finishing (0) and (1), the
most recent history near the requested tips is not anchored by any
ref.  We of course cannot point these "most recent" objects with
refs because it is very likely that they are not connected to the
parts of the history we already have in the receiving repository.
The huge gap exists to be filled by the bulk download from CDN.

So a GC that happens before (3) completes can discard object data
obtained in step (1).  One way to protect it may be to use a .keep
file but then some procedure needs to be there to remove it once we
are done.  Perhaps at the end of (1), the name of that .keep file is
added to the set of information we keep until (3) happens (the
remainder of the set of information was obtained in step (0)).

Jonathan Tan June 1, 2020, 11:07 p.m. UTC | #3

> > +Protocol
> > +--------
> > +
> > +The server advertises `packfile-uris`.
> 
> Is this a new "protocol capability"?  There are multiple things that
> are "advertised" over the wire (there is "reference advertisement")
> and readers would want to know if this is yet another kind of
> advertisement or a new variety of the "capability".

Yes, it's a new protocol capability. I'll update the text.

> > +If the client then communicates which protocols (HTTPS, etc.) it supports with
> > +a `packfile-uris` argument, the server MAY send a `packfile-uris` section
> > +directly before the `packfile` section (right after `wanted-refs` if it is
> > +sent) containing URIs of any of the given protocols. The URIs point to
> > +packfiles that use only features that the client has declared that it supports
> > +(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
> > +this section.
> > +
> > +Clients then should understand that the returned packfile could be incomplete,
> 
> I am guessing that this merely mean that the result of downloading
> and installing the packfile does not necessarily make the resulting
> repository up-to-date with respect to the "want" the "fetch" command
> requested.  But the above can easily be misread that the returned
> packfile is somewhat broken, corrupt or missing objects that it
> ought to have (e.g. a deltified object lacks its base object in the
> file). [#1]

Most of this is resolved below, but I'll try to write upfront what's
going on so it will be clear from the beginning (and not just at the
end).

But you bring up a good point here - can one of the linked-by-URI packs
have a delta against the inline packfile, or vice versa? I'll take a
look and clarify that.

> > +and that it needs to download all the given URIs before the fetch or clone is
> > +complete.
> 
> So if I say "I want history leading to commit X", and choose to use
> the packfile-uris that told me to fetch two packfiles P and Q, does
> it mean that I need to only fetch P and Q, index-pack them and store
> the resulting two packfiles and their idx files in my repository, do
> I have the history leading to commit X?
> 
> I would have guessed that the resulting repository after fetching
> these URIs can still be incomplete and the "packfile" section of the
> response needs to be processed before the fetch or clone is complete,
> but the above does not say so.

I'll clarify the statement.

> > +Server design
> > +-------------
> > +
> > +The server can be trivially made compatible with the proposed protocol by
> > +having it advertise `packfile-uris`, tolerating the client sending
> > +`packfile-uris`, and never sending any `packfile-uris` section. But we should
> > +include some sort of non-trivial implementation in the Minimum Viable Product,
> > +at least so that we can test the client.
> > +
> > +This is the implementation: a feature, marked experimental, that allows the
> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> > +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> > +with the given sha1 can be replaced by the given URI. This allows, for example,
> > +servers to delegate serving of large blobs to CDNs.
> 
> Let me see if the above is understandable.
> 
> Does "git cat-file blob <sha1>" come back when we try to "wget/curl"
> the <uri>?

No - a packfile containing a single object will be returned, not the
uncompressed and headerless object. I'll update the text to clarify
that.

> > +The division of work (initial fetch + additional URIs) introduces convenient
> > +points for resumption of an interrupted clone - such resumption can be done
> > +after the Minimum Viable Product (see "Future work").
> > +
> > +The client can inhibit this feature (i.e. refrain from sending the
> > +`packfile-uris` parameter) by passing --no-packfile-uris to `git fetch`.
> 
> By default, as long as the other side advertises packfile-uris, the
> client automatically attempts to use it and users need to explicitly
> disable it?  
> 
> It's quite different from the way we introduce new features and I am
> wondering if it is a good approach.

The client has to opt-in first with the fetch.uriprotocols config (which
says what protocols it wants to support) but it's not written here in
this spec. I'll add it.

> > + * On the server, a long-running process that takes in entire requests and
> > +   outputs a list of URIs and the corresponding inclusion and exclusion sets of
> > +   objects. This allows, e.g., signed URIs to be used and packfiles for common
> > +   requests to be cached.
> 
> Did we discuss "inclusion and exclusion sets" whatever they are?

No - this is future/speculative so I didn't want to spend too much time
explaining this. I was thinking of saying that this URI includes all
objects from commit A (inclusion) but not from commits B and C
(exclusion). Maybe I should just leave it out.

> > diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> > index ef7514a3ee..7e1b3a0bfe 100644
> > --- a/Documentation/technical/protocol-v2.txt
> > +++ b/Documentation/technical/protocol-v2.txt
> > @@ -325,13 +325,26 @@ included in the client's request:
> >  	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
> >  	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
> >  
> > +If the 'packfile-uris' feature is advertised, the following argument
> > +can be included in the client's request as well as the potential
> > +addition of the 'packfile-uris' section in the server's response as
> > +explained below.
> > +
> > +    packfile-uris <comma-separated list of protocols>
> > +	Indicates to the server that the client is willing to receive
> > +	URIs of any of the given protocols in place of objects in the
> > +	sent packfile. Before performing the connectivity check, the
> > +	client should download from all given URIs. Currently, the
> > +	protocols supported are "http" and "https".
> 
> Ah, I think the puzzlement I repeatedly expressed above is starting
> to dissolve.  You are assuming that the receiving end would remember
> the URIs but the in-protocol packfile data at the end is handled
> first, and then before the final connectivity check is done,
> packfiles are downloaded from the URIs.  If you spelled out that
> assumption early in the document, then I wouldn't have had to ask
> all those questions.  I was assuming a different order (i.e. CDN
> packfiles first to establish the baseline, and then in-protocol
> packfile to fill the "latest bits", the last mile to reach the tips
> of requested refs).
> 
> In practice, I suspect that these fetches would go in parallel with
> the processing of the in-protocol packfile, but spelling it out as
> if these are done sequencially would help establishing the right
> mental model.  

OK.

> "(1) Process in-protocol packfiles first, and then (2) fetch CDN
> URIs, and after all is done, (3) update the tips of refs" would
> serve as a base to establish a good mental model.  It would be even
> better to throw in another step before all that: (0) record the
> wanted-refs and CDN URIs to the safe place.  If you get disconnected
> before finishing (1), you have to redo from the scratch, but once
> you finished (0) and (1), then (2) and (3) can be done at your
> leisure using the information you saved in step (0), and (1) can be
> retried if your connection is lousy.

OK. This set of patches does not do (0) yet, and I think doing so would
require more design - in particular, if we have fetch resumption, we
would need to somehow track that none of the previously fetched objects
have been deleted.

Thanks for all your comments.

Jonathan Tan June 1, 2020, 11:10 p.m. UTC | #4

> Junio C Hamano <gitster@pobox.com> writes:
> 
> > In practice, I suspect that these fetches would go in parallel with
> > the processing of the in-protocol packfile, but spelling it out as
> > if these are done sequencially would help establishing the right
> > mental model.  
> >
> > "(1) Process in-protocol packfiles first, and then (2) fetch CDN
> > URIs, and after all is done, (3) update the tips of refs" would
> > serve as a base to establish a good mental model.  It would be even
> > better to throw in another step before all that: (0) record the
> > wanted-refs and CDN URIs to the safe place.  If you get disconnected
> > before finishing (1), you have to redo from the scratch, but once
> > you finished (0) and (1), then (2) and (3) can be done at your
> > leisure using the information you saved in step (0), and (1) can be
> > retried if your connection is lousy.
> 
> We need to be a bit careful here.  After finishing (0) and (1), the
> most recent history near the requested tips is not anchored by any
> ref.  We of course cannot point these "most recent" objects with
> refs because it is very likely that they are not connected to the
> parts of the history we already have in the receiving repository.
> The huge gap exists to be filled by the bulk download from CDN.
> 
> So a GC that happens before (3) completes can discard object data
> obtained in step (1).  One way to protect it may be to use a .keep
> file but then some procedure needs to be there to remove it once we
> are done.  Perhaps at the end of (1), the name of that .keep file is
> added to the set of information we keep until (3) happens (the
> remainder of the set of information was obtained in step (0)).

Yes, this is precisely what we're doing - the packs obtained through the
packfile URIs are all written with keep files, and the names of the keep
files are added to a list. They are then deleted at the same time that
the regular keep file (the one generated during an ordinary fetch) is
deleted.

I'll also add this information to the spec.

Jonathan Tan June 10, 2020, 1:14 a.m. UTC | #5

> > @@ -349,6 +362,9 @@ header. Most sections are sent only when the packfile is sent.
> >  		  *PKT-LINE(wanted-ref LF)
> >      wanted-ref = obj-id SP refname
> >  
> > +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> > +    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
> 
> 40* 

I'm almost ready to send out an updated version, but have one question:
what do you mean by this? If you mean that I should use "obj-id"
instead, I didn't want to because it's not the hash of an object, but
the hash of a packfile.

Junio C Hamano June 10, 2020, 5:16 p.m. UTC | #6

Jonathan Tan <jonathantanmy@google.com> writes:

>> > @@ -349,6 +362,9 @@ header. Most sections are sent only when the packfile is sent.
>> >  		  *PKT-LINE(wanted-ref LF)
>> >      wanted-ref = obj-id SP refname
>> >  
>> > +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
>> > +    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
>> 
>> 40* 
>
> I'm almost ready to send out an updated version, but have one question:
> what do you mean by this? If you mean that I should use "obj-id"
> instead, I didn't want to because it's not the hash of an object, but
> the hash of a packfile.

It clearly is not an object name, but it is a run of hexdigits whose
length is the same as (hexadecimal representation of) the object name.

How is "obj-id" we see above in the precontext of that hunk defined?
Does it use 40*(HEXDIGIT), too?  Do we plan to support non SHA-1 hashes
in this design in the future, and if so how?

"We are only focused on SHA-1 hashes for now" is a perfectly
acceptable answer, and then 40* here makes 100% sense, but then we'd
need to say "for now this design only assumes SHA-1 hash" upfront, I
would think, to remind ourselves that we need to consider this part
of the system when we upgrade to SHA-256.

Thanks.

Jonathan Tan June 10, 2020, 6:04 p.m. UTC | #7

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> >> > @@ -349,6 +362,9 @@ header. Most sections are sent only when the packfile is sent.
> >> >  		  *PKT-LINE(wanted-ref LF)
> >> >      wanted-ref = obj-id SP refname
> >> >  
> >> > +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> >> > +    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
> >> 
> >> 40* 
> >
> > I'm almost ready to send out an updated version, but have one question:
> > what do you mean by this? If you mean that I should use "obj-id"
> > instead, I didn't want to because it's not the hash of an object, but
> > the hash of a packfile.
> 
> It clearly is not an object name, but it is a run of hexdigits whose
> length is the same as (hexadecimal representation of) the object name.
> 
> How is "obj-id" we see above in the precontext of that hunk defined?
> Does it use 40*(HEXDIGIT), too?  

Yes, it's defined as such in protocol-common.txt:

  obj-id    =  40*(HEXDIGIT)

> Do we plan to support non SHA-1 hashes
> in this design in the future, and if so how?
> 
> "We are only focused on SHA-1 hashes for now" is a perfectly
> acceptable answer, and then 40* here makes 100% sense, but then we'd
> need to say "for now this design only assumes SHA-1 hash" upfront, I
> would think, to remind ourselves that we need to consider this part
> of the system when we upgrade to SHA-256.

This will be whatever is output by index-pack after "pack\t" or
"keep\t". I'll make a note in the version I'm about to send out. Yes,
we'll definitely need to remind ourselves about considering this part
when we upgrade.

[5/8] Documentation: add Packfile URIs design doc

Commit Message

Comments

Patch