[WIP,RFC,2/5] Documentation: add Packfile URIs design doc

Message ID	0461b362569362c6d0e73951469c547a03a1b59d.1543879256.git.jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 3 Dec 2018 15:37:35 -0800 In-Reply-To: <cover.1543879256.git.jonathantanmy@google.com> Message-Id: <0461b362569362c6d0e73951469c547a03a1b59d.1543879256.git.jonathantanmy@google.com> Mime-Version: 1.0 References: <cover.1543879256.git.jonathantanmy@google.com> Subject: [WIP RFC 2/5] Documentation: add Packfile URIs design doc From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Design for offloading part of packfile response to CDN \| expand [WIP,RFC,0/5] Design for offloading part of packfile response to CDN [WIP,RFC,1/5] Documentation: order protocol v2 sections [WIP,RFC,2/5] Documentation: add Packfile URIs design doc [WIP,RFC,3/5] upload-pack: refactor reading of pack-objects out [WIP,RFC,4/5] upload-pack: refactor writing of "packfile" line [WIP,RFC,5/5] upload-pack: send part of packfile response as uri

Jonathan Tan Dec. 3, 2018, 11:37 p.m. UTC

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/packfile-uri.txt | 83 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt  |  6 +-
 2 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

Stefan Beller Dec. 4, 2018, 12:21 a.m. UTC | #1

Thanks for bringing this design to the list!

> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> index 345c00e08c..2cb1c41742 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is sent.
>
>      output = acknowledgements flush-pkt |
>              [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -            [wanted-refs delim-pkt] packfile flush-pkt
> +            [wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +            packfile flush-pkt

While this is an RFC and incomplete, we'd need to remember to
add packfile-uris to the capabilities list above, stating that it requires
thin-pack and ofs-delta to be sent, and what to expect from it.

The mention of  --no-packfile-urls in the Client design above
seems to imply we'd want to turn it on by default, which I thought
was not the usual stance how we introduce new things.

An odd way of disabling it would be --no-thin-pack, hoping the
client side implementation abides by the implied requirements.

>      acknowledgments = PKT-LINE("acknowledgments" LF)
>                       (nak | *ack)
> @@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is sent.
>                   *PKT-LINE(wanted-ref LF)
>      wanted-ref = obj-id SP refname
>
> +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)

Is the *%x20-ff a fancy way of saying obj-id?

While the server is configured with pairs of (oid URL),
we would not need to send the exact oid to the client
as that is what the client can figure out on its own by reading
the downloaded pack.

Instead we could send an integrity hash (i.e. the packfile
downloaded from "uri" is expected to hash to $oid here)

Thanks,
Stefan

brian m. carlson Dec. 4, 2018, 1:54 a.m. UTC | #2

On Mon, Dec 03, 2018 at 03:37:35PM -0800, Jonathan Tan wrote:
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/technical/packfile-uri.txt | 83 ++++++++++++++++++++++++
>  Documentation/technical/protocol-v2.txt  |  6 +-
>  2 files changed, 88 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/technical/packfile-uri.txt
> 
> diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
> new file mode 100644
> index 0000000000..6535801486
> --- /dev/null
> +++ b/Documentation/technical/packfile-uri.txt
> @@ -0,0 +1,83 @@
> +Packfile URIs
> +=============
> +
> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU usage
> +(for example, by serving some data through a CDN), and (in the future) provides
> +some measure of resumability to clients.
> +
> +This feature is available only in protocol version 2.
> +
> +Protocol
> +--------
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta
> +
> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.
> +
> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).

Some thoughts here:

First, I'd like to see a section (and a bit in the implementation)
requiring HTTPS if the original protocol is secure (SSH or HTTPS).
Allowing the server to downgrade to HTTP, even by accident, would be a
security problem.

Second, this feature likely should be opt-in for SSH. One issue I've
seen repeatedly is that people don't want to use HTTPS to fetch things
when they're using SSH for Git. Many people in corporate environments
have proxies that break HTTP for non-browser use cases[0], and using SSH
is the only way that they can make a functional Git connection.

Third, I think the server needs to be required to both support Range
headers and never change the content of a URI, so that we can have
resumable clone implicit in this design. There are some places in the
world where connections are poor and fetching even the initial packfile
at once might be a problem. (I've seen such questions on Stack
Overflow, for example.)

Having said that, I think overall this is a good idea and I'm glad to
see a proposal for it.

[0] For example, a naughty-word filter may corrupt or block certain byte
sequences that occur incidentally in the pack stream.

Jonathan Tan Dec. 4, 2018, 7:29 p.m. UTC | #3

> Some thoughts here:
> 
> First, I'd like to see a section (and a bit in the implementation)
> requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> Allowing the server to downgrade to HTTP, even by accident, would be a
> security problem.
> 
> Second, this feature likely should be opt-in for SSH. One issue I've
> seen repeatedly is that people don't want to use HTTPS to fetch things
> when they're using SSH for Git. Many people in corporate environments
> have proxies that break HTTP for non-browser use cases[0], and using SSH
> is the only way that they can make a functional Git connection.

Good points about SSH support and the client needing to control which
protocols the server will send URIs for. I'll include a line in the
client request in which the client can specify which protocols it is OK
with.

> Third, I think the server needs to be required to both support Range
> headers and never change the content of a URI, so that we can have
> resumable clone implicit in this design. There are some places in the
> world where connections are poor and fetching even the initial packfile
> at once might be a problem. (I've seen such questions on Stack
> Overflow, for example.)

Good points. I'll add these in the next revision.

> Having said that, I think overall this is a good idea and I'm glad to
> see a proposal for it.

Thanks, and thanks for your comments too.

Junio C Hamano Dec. 5, 2018, 5:02 a.m. UTC | #4

Jonathan Tan <jonathantanmy@google.com> writes:

> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU usage
> +(for example, by serving some data through a CDN), and (in the future) provides
> +some measure of resumability to clients.

Without reading the remainder, this makes readers anticipate a few
good things ;-)

 - "part of", so pre-generated constant material can be given from
   CDN and then followed-up by "filling the gaps" small packfile,
   perhaps?

 - The "part of" transmission may not bring the repository up to
   date wrt to the "want" objects; would this feature involve "you
   asked history up to these commits, but with this pack-uri, you'll
   be getting history up to these (somewhat stale) commits"?

Anyway, let's read on.

> +This feature is available only in protocol version 2.
> +
> +Protocol
> +--------
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta

"with the following" meaning "with all of the following", or "with
any of the following"?  Is there a reason why the server side must
require that the client understands and is willing to accept a
thin-pack when wanting to use packfile-uris?  The same question for
the ofs-delta.

When the pregenerated constant material the server plans to hand out
the uris for was prepared by using ofs-delta encoding, the server
cannot give the uri to it when the client does not want ofs-delta
encoded packfile, but it feels somewhat strange that we require the
most capable client at the protocol level.  After all, the server
side could prepare one with ofs-delta and another without ofs-delta
and depending on what the client is capable of, hand out different
URIs, if it wanted to.

The reason why I care is because thin and ofs will *NOT* stay
forever be the only optional features of the pack format.  We may
invent yet another such optional 'frotz' feature, which may greatly
help the efficiency of the packfile encoding, hence it may be
preferrable to always generate a CDN packfile with that feature, in
addition to thin and ofs.  Would we add 'frotz' to the above list in
the documentation, then?  What would happen to existing servers and
clients written before that time then?

My recommendation is to drop the mention of "thin" and "ofs" from
the above list, and also from the following paragraph.  The "it MAY
send" will serve as a practical escape clause to allow a server/CDN
implementation that *ALWAYS* prepares pregenerated material that can
only be digested by clients that supports thin and ofs.  Such a server
can send packfile-URIs only when all of the three are given by the
client and be compliant.  And such an update to the proposed document
would allow a more diskful server to prepare both thin and non-thin
pregenerated packs and choose which one to give to the client depending
on the capability.

> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.

So, this is OK, but

> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).

weaken or remove the (parenthetical comment) in the last sentence,
and replace the beginning of the section with something like

	If the client replies with 'packfile-uris', when the server
	sends the packfile, it MAY send a `packfile-uris` section...

You may steal what I wrote in the above response to help the
server-side folks to decide how to actually implement the "it MAY
send a packfile-uris" part in the document.

> +Server design
> +-------------
> +
> +The server can be trivially made compatible with the proposed protocol by
> +having it advertise `packfile-uris`, tolerating the client sending
> +`packfile-uris`, and never sending any `packfile-uris` section. But we should
> +include some sort of non-trivial implementation in the Minimum Viable Product,
> +at least so that we can test the client.
> +
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for example,
> +servers to delegate serving of large blobs to CDNs.

;-)

> +Client design
> +-------------
> +
> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.
> +
> +The division of work (initial fetch + additional URIs) introduces convenient
> +points for resumption of an interrupted clone - such resumption can be done
> +after the Minimum Viable Product (see "Future work").
> +
> +The client can inhibit this feature (i.e. refrain from sending the
> +`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.

OK, this comes back to what I alluded to at the beginning.  We could
respond to a full-clone request by feeding a series of packfile-uris
and some ref information, perhaps like this:

	* Grab this packfile and update your remote-tracking refs
          and tags to these values; you'd be as if you cloned the
          project when it was at v1.0.

	* When you are done with the above, grab this packfile and
          update your remote-tracking refs and tags to these values;
          you'd be as if you cloned the project when it was at v2.0.

	* When you are done with the above, grab this packfile and
          update your remote-tracking refs and tags to these values;
          you'd be as if you cloned the project when it was at v3.0.

	...

	* When you are done with the above, here is the remaining
          packdata to bring you fully up to date with your original
          "want"s.

and before fully reading the proposal, I anticipated that it was
what you were going to describe.  The major difference is "up to the
packdata given to you so far, you'd be as if you fetched these" ref
information, which would allow you to be interrupted and then simply
resume, without having to remember the set of packfile-uris yet to
be processed across a fetch/clone failure.  If you sucessfully fetch
packfile for ..v1.0, you can update the remote-tracking refs to
match as if you fetched back when that was the most recent state of
the project, and then if you failed while transferring packfile for
v1.0..v2.0, the resuming would just reissue "git fetch" internally.

I think what you proposed, i.e. without the "with the data up to
this packfile, you have history to these objects", would also work,
even though it requires us to remember more of what we learned
during the initial attempt throughout retrying failed transfers.

> +Future work
> +-----------
> +
> +The protocol design allows some evolution of the server and client without any
> +need for protocol changes, so only a small-scoped design is included here to
> +form the MVP. For example, the following can be done:
> +
> + * On the server, a long-running process that takes in entire requests and
> +   outputs a list of URIs and the corresponding inclusion and exclusion sets of
> +   objects. This allows, e.g., signed URIs to be used and packfiles for common
> +   requests to be cached.
> + * On the client, resumption of clone. If a clone is interrupted, information
> +   could be recorded in the repository's config and a "clone-resume" command
> +   can resume the clone in progress. (Resumption of subsequent fetches is more
> +   difficult because that must deal with the user wanting to use the repository
> +   even after the fetch was interrupted.)
> +
> +There are some possible features that will require a change in protocol:
> +
> + * Additional HTTP headers (e.g. authentication)
> + * Byte range support
> + * Different file formats referenced by URIs (e.g. raw object)
> +
> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> index 345c00e08c..2cb1c41742 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is sent.
>  
>      output = acknowledgements flush-pkt |
>  	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -	     [wanted-refs delim-pkt] packfile flush-pkt
> +	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +	     packfile flush-pkt
>  
>      acknowledgments = PKT-LINE("acknowledgments" LF)
>  		      (nak | *ack)
> @@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is sent.
>  		  *PKT-LINE(wanted-ref LF)
>      wanted-ref = obj-id SP refname
>  
> +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
> +
>      packfile = PKT-LINE("packfile" LF)
>  	       *PKT-LINE(%x01-03 *%x00-ff)

Junio C Hamano Dec. 5, 2018, 5:55 a.m. UTC | #5

Junio C Hamano <gitster@pobox.com> writes:

> So, this is OK, but
>
>> +Clients then should understand that the returned packfile could be incomplete,
>> +and that it needs to download all the given URIs before the fetch or clone is
>> +complete. Each URI should point to a Git packfile (which may be a thin pack and
>> +which may contain offset deltas).
>
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
>
> 	If the client replies with 'packfile-uris', when the server
> 	sends the packfile, it MAY send a `packfile-uris` section...
>
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

By the way, I do agree with the practical consideration the design
you described makes.  For a pregenerated pack that brings you from
v1.0 to v2.0, "thin" would roughly save the transfer of one full
checkout (compressed, of course), and "ofs" would also save several
bytes per object.  Compared to a single pack that delivers everything
the fetcher wants, concatenation of packs without "thin" to transfer
the same set of objects would cost quite a lot more.

And I do not think we should care too much about fetchers that lack
either thin or ofs, so I'd imagine that any client that ask for
packfile-uris would also advertise thin and ofs as well, so in
practice, a request with packfile-uris that lack thin or ofs would
be pretty rare and requiring all three and requiring only one would
not make much practical difference.  It's just that I think singling
out these two capabilities as hard requirements at the protocol
level is wrong.

Jonathan Tan Dec. 6, 2018, 11:16 p.m. UTC | #6

> > +This feature allows servers to serve part of their packfile response as URIs.
> > +This allows server designs that improve scalability in bandwidth and CPU usage
> > +(for example, by serving some data through a CDN), and (in the future) provides
> > +some measure of resumability to clients.
> 
> Without reading the remainder, this makes readers anticipate a few
> good things ;-)
> 
>  - "part of", so pre-generated constant material can be given from
>    CDN and then followed-up by "filling the gaps" small packfile,
>    perhaps?

Yes :-)

>  - The "part of" transmission may not bring the repository up to
>    date wrt to the "want" objects; would this feature involve "you
>    asked history up to these commits, but with this pack-uri, you'll
>    be getting history up to these (somewhat stale) commits"?

It could be, but not necessarily. In my current WIP implementation, for
example, pack URIs don't give you any commits at all (and thus, no
history) - only blobs. Quite a few people first think of the "stale
clone then top-up" case, though - I wonder if it would be a good idea to
give the blob example in this paragraph in order to put people in the
right frame of mind.

> > +If the client replies with the following arguments:
> > +
> > + * packfile-uris
> > + * thin-pack
> > + * ofs-delta
> 
> "with the following" meaning "with all of the following", or "with
> any of the following"?  Is there a reason why the server side must
> require that the client understands and is willing to accept a
> thin-pack when wanting to use packfile-uris?  The same question for
> the ofs-delta.

"All of the following", but from your later comments, we probably don't
need this section anyway.

> My recommendation is to drop the mention of "thin" and "ofs" from
> the above list, and also from the following paragraph.  The "it MAY
> send" will serve as a practical escape clause to allow a server/CDN
> implementation that *ALWAYS* prepares pregenerated material that can
> only be digested by clients that supports thin and ofs.  Such a server
> can send packfile-URIs only when all of the three are given by the
> client and be compliant.  And such an update to the proposed document
> would allow a more diskful server to prepare both thin and non-thin
> pregenerated packs and choose which one to give to the client depending
> on the capability.

That is true - we can just let the server decide. I'll update the patch
accordingly, and state that the URIs should point to packs with features
like thin-pack and ofs-delta only if the client has declared that it
supports them.

> > +Clients then should understand that the returned packfile could be incomplete,
> > +and that it needs to download all the given URIs before the fetch or clone is
> > +complete. Each URI should point to a Git packfile (which may be a thin pack and
> > +which may contain offset deltas).
> 
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
> 
> 	If the client replies with 'packfile-uris', when the server
> 	sends the packfile, it MAY send a `packfile-uris` section...
> 
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

OK, will do.

> OK, this comes back to what I alluded to at the beginning.  We could
> respond to a full-clone request by feeding a series of packfile-uris
> and some ref information, perhaps like this:
> 
> 	* Grab this packfile and update your remote-tracking refs
>           and tags to these values; you'd be as if you cloned the
>           project when it was at v1.0.
> 
> 	* When you are done with the above, grab this packfile and
>           update your remote-tracking refs and tags to these values;
>           you'd be as if you cloned the project when it was at v2.0.
> 
> 	* When you are done with the above, grab this packfile and
>           update your remote-tracking refs and tags to these values;
>           you'd be as if you cloned the project when it was at v3.0.
> 
> 	...
> 
> 	* When you are done with the above, here is the remaining
>           packdata to bring you fully up to date with your original
>           "want"s.
> 
> and before fully reading the proposal, I anticipated that it was
> what you were going to describe.  The major difference is "up to the
> packdata given to you so far, you'd be as if you fetched these" ref
> information, which would allow you to be interrupted and then simply
> resume, without having to remember the set of packfile-uris yet to
> be processed across a fetch/clone failure.  If you sucessfully fetch
> packfile for ..v1.0, you can update the remote-tracking refs to
> match as if you fetched back when that was the most recent state of
> the project, and then if you failed while transferring packfile for
> v1.0..v2.0, the resuming would just reissue "git fetch" internally.

The "up to" would work if we had the stale clone + periodic "upgrades"
arrangement you describe, but not when, for example, we just want to
separate large blobs out. If we were to insist on attaching ref
information to each packfile URI (or turn the packfiles into bundles),
it is true that we can have resumable fetch, although I haven't fully
thought out the implications of letting the user modify the repository
while a fetch is in progress. (What happens if the user wipes out their
object store in between fetching one packfile and the next, for
example?) That is why I only talked about resumable clone, not resumable
fetch.

Christian Couder Feb. 19, 2019, 1:22 p.m. UTC | #7

On Tue, Dec 4, 2018 at 8:31 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > Some thoughts here:
> >
> > First, I'd like to see a section (and a bit in the implementation)
> > requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> > Allowing the server to downgrade to HTTP, even by accident, would be a
> > security problem.
> >
> > Second, this feature likely should be opt-in for SSH. One issue I've
> > seen repeatedly is that people don't want to use HTTPS to fetch things
> > when they're using SSH for Git. Many people in corporate environments
> > have proxies that break HTTP for non-browser use cases[0], and using SSH
> > is the only way that they can make a functional Git connection.
>
> Good points about SSH support and the client needing to control which
> protocols the server will send URIs for. I'll include a line in the
> client request in which the client can specify which protocols it is OK
> with.

What if a client is ok to fetch from some servers but not others (for
example github.com and gitlab.com but nothing else)?

Or what if a client is ok to fetch using SSH from some servers and
HTTPS from other servers but nothing else?

I also wonder in general how this would interact with promisor/partial
clone remotes.

When we discussed promisor/partial clone remotes in the thread
following this email:

https://public-inbox.org/git/20181016174304.GA221682@aiede.svl.corp.google.com/

it looked like you were ok with having many promisor remotes, which I
think could fill the same use cases especially related to large
objects.

As clients would configure promisor remotes explicitly, there would be
no issues about which protocol and servers are allowed or not.

If the issue is that you want the server to decide which promisor
remotes would be used without the client having to do anything, maybe
that could be something added on top of the possibility to have many
promisor remotes.

Ævar Arnfjörð Bjarmason Feb. 19, 2019, 1:44 p.m. UTC | #8

On Tue, Dec 04 2018, brian m. carlson wrote:

> On Mon, Dec 03, 2018 at 03:37:35PM -0800, Jonathan Tan wrote:
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
>>  Documentation/technical/packfile-uri.txt | 83 ++++++++++++++++++++++++
>>  Documentation/technical/protocol-v2.txt  |  6 +-
>>  2 files changed, 88 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/technical/packfile-uri.txt
>>
>> diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
>> new file mode 100644
>> index 0000000000..6535801486
>> --- /dev/null
>> +++ b/Documentation/technical/packfile-uri.txt
>> @@ -0,0 +1,83 @@
>> +Packfile URIs
>> +=============
>> +
>> +This feature allows servers to serve part of their packfile response as URIs.
>> +This allows server designs that improve scalability in bandwidth and CPU usage
>> +(for example, by serving some data through a CDN), and (in the future) provides
>> +some measure of resumability to clients.
>> +
>> +This feature is available only in protocol version 2.
>> +
>> +Protocol
>> +--------
>> +
>> +The server advertises `packfile-uris`.
>> +
>> +If the client replies with the following arguments:
>> +
>> + * packfile-uris
>> + * thin-pack
>> + * ofs-delta
>> +
>> +when the server sends the packfile, it MAY send a `packfile-uris` section
>> +directly before the `packfile` section (right after `wanted-refs` if it is
>> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
>> +this section.
>> +
>> +Clients then should understand that the returned packfile could be incomplete,
>> +and that it needs to download all the given URIs before the fetch or clone is
>> +complete. Each URI should point to a Git packfile (which may be a thin pack and
>> +which may contain offset deltas).
>
>
> First, I'd like to see a section (and a bit in the implementation)
> requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> Allowing the server to downgrade to HTTP, even by accident, would be a
> security problem.

Maybe I've misunderstood the design (I'm writing some other follow-up
E-Mails in this thread which might clarify things for me), but I don't
see why.

We get the ref advertisement from the server. We don't need to trust the
CDN server or the transport layer. We just download whatever we get from
there, validate the packfile with SHA-1 (and in the future SHA-256). It
doesn't matter if the CDN transport is insecure.

You can do this offline with git today, you don't need to trust me to
trust that my copy of git.git I give you on a sketchy USB stick is
genuine. Just unpack it, then compare the SHA-1s you get with:

    git ls-remote https://github.com/git/git.git

So this is a case similar to Debian's where they distribute packages
over http, but manifests over https: https://whydoesaptnotusehttps.com

> Second, this feature likely should be opt-in for SSH. One issue I've
> seen repeatedly is that people don't want to use HTTPS to fetch things
> when they're using SSH for Git. Many people in corporate environments
> have proxies that break HTTP for non-browser use cases[0], and using SSH
> is the only way that they can make a functional Git connection.

Yeah, there should definitely be accommodations for such clients, per my
reading clients can always ignore the CDN and proceed with a normal
negotiation. Isn't that enough, or is something extra needed?

> Third, I think the server needs to be required to both support Range
> headers and never change the content of a URI, so that we can have
> resumable clone implicit in this design. There are some places in the
> world where connections are poor and fetching even the initial packfile
> at once might be a problem. (I've seen such questions on Stack
> Overflow, for example.)

I think this should be a MAY not a MUST in RFC 2119 terms. There's still
many users who might want to offload things to a very dumb CDN, such as
Debian where they don't control their own mirrors, but might want to
offload a 1GB packfile download to some random university's Debian
mirror.

Such a download (over http) will work most of the time. If it's not
resumable it still sucks less than no CDN at all, and client can always
fall back if the CDN breaks, which they should be doing anyway in case
of other sorts of issues.

> Having said that, I think overall this is a good idea and I'm glad to
> see a proposal for it.
>
> [0] For example, a naughty-word filter may corrupt or block certain byte
> sequences that occur incidentally in the pack stream.

Ævar Arnfjörð Bjarmason Feb. 19, 2019, 2:28 p.m. UTC | #9

On Tue, Dec 04 2018, Jonathan Tan wrote:

I meant to follow-up after Git Merge, but didn't remember until this
thread was bumped.

But some things I'd like to clarify / am concerned about...

> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.
> +
> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).
> [...]
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for example,
> +servers to delegate serving of large blobs to CDNs.

Okey, so the server advertisement is not just "<urls>" but <oid><url>
pairs. More on this later...

> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.

And this. I don't quite understand this well enough, but maybe it helps
if I talk about what I'd expect out of CDN offloading. It comes down to
three things:

 * The server should be able to point to some "seed" packfiles *without*
   necessarily knowing what OIDs are in it, or have to tell the client.

 * The client should be able to just blindly get this data ("I guess
   this is where most of it is"), unpack it, see what OIDs it has, and
   *then* without initiating a new connection continue a want/have
   dialog.

   This effectively "bootstraps" a "clone" mid way into an arbitrary
   "fetch".

 * There should be no requirement that a client successfully downloads
   the advertised CDNs, for fault handling (also discussed in
   https://public-inbox.org/git/87lg2b6gg0.fsf@evledraar.gmail.com/)

More concretely, I'd like to have a setup where a server can just dumbly
point to some URL that probably has most of the data, without having any
idea what OIDs are in it. So that e.g. some machine entirely
disconnected from the server (and with just a regular clone) can
continually generating an up-to-date-enough packfile.

I don't see how this is compatible with the server needing to send a
bunch of "<oid> <url>" lines, or why a client "cannot declare that the
fetch is complete until all URIs have been downloaded as
packfiles". Can't it fall back on the normal dialog?

Other thoughts:

 * If there isn't such a close coordination between git server & CDN, is
   there a case for having pack *.idx files on the CDN, so clients can
   inspect them to see if they'd like to download the full referenced
   pack?

 * Without the server needing to know enough about the packs to
   advertise "<oid> <url>" is there a way to e.g. advertise 4x packs to
   clients:

       big.pack, last-month.pack, last-week.pack, last-day.pack

   Or some other optimistic negotiation where clients, even ones just
   doing regular fetches, can seek to get more up-to-date with one of
   the more recent packs before doing the first fetch in 3 days?

   In the past I'd toyed with creating a similar "not quite CDN" setup
   using git-bundle.

Jonathan Tan Feb. 19, 2019, 8:10 p.m. UTC | #10

> > Good points about SSH support and the client needing to control which
> > protocols the server will send URIs for. I'll include a line in the
> > client request in which the client can specify which protocols it is OK
> > with.
> 
> What if a client is ok to fetch from some servers but not others (for
> example github.com and gitlab.com but nothing else)?
> 
> Or what if a client is ok to fetch using SSH from some servers and
> HTTPS from other servers but nothing else?

The objects received from the various CDNs are still rehashed by the
client (so they are identified with the correct name), and if the client
is fetching from a server, presumably it can trust the URLs it receives
(just like it trusts ref names, and so on). Do you know of a specific
case in which a client wants to fetch from some servers but not others?
(In any case, if this happens, the client can just disable the CDN
support.)

> I also wonder in general how this would interact with promisor/partial
> clone remotes.
> 
> When we discussed promisor/partial clone remotes in the thread
> following this email:
> 
> https://public-inbox.org/git/20181016174304.GA221682@aiede.svl.corp.google.com/
> 
> it looked like you were ok with having many promisor remotes, which I
> think could fill the same use cases especially related to large
> objects.
>
> As clients would configure promisor remotes explicitly, there would be
> no issues about which protocol and servers are allowed or not.
> 
> If the issue is that you want the server to decide which promisor
> remotes would be used without the client having to do anything, maybe
> that could be something added on top of the possibility to have many
> promisor remotes.

It's true that there is a slight overlap with respect to large objects,
but this protocol can also handle large sets of objects being offloaded
to CDN, not only single ones. (The included implementation only handles
single objects, as a minimum viable product, but it is conceivable that
the server implementation is later expanded to allow offloading of sets
of objects.)

And this protocol is meant to be able to use CDNs to help serve objects,
whether single objects or sets of objects. In the case of promisor
remotes, the thing we fetch from has to be a Git server. (We could use
dumb HTTP from a CDN, but that defeats the purpose in at least one way -
with dumb HTTP, we have to fetch objects individually, but with URL
support, we can fetch objects as sets too.)

Jonathan Tan Feb. 19, 2019, 10:06 p.m. UTC | #11

> > +when the server sends the packfile, it MAY send a `packfile-uris` section
> > +directly before the `packfile` section (right after `wanted-refs` if it is
> > +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> > +this section.
> > +
> > +Clients then should understand that the returned packfile could be incomplete,
> > +and that it needs to download all the given URIs before the fetch or clone is
> > +complete. Each URI should point to a Git packfile (which may be a thin pack and
> > +which may contain offset deltas).
> > [...]
> > +This is the implementation: a feature, marked experimental, that allows the
> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> > +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> > +with the given sha1 can be replaced by the given URI. This allows, for example,
> > +servers to delegate serving of large blobs to CDNs.
> 
> Okey, so the server advertisement is not just "<urls>" but <oid><url>
> pairs. More on this later...

Actually, the server advertisement is just "<urls>". (The OID is there
to tell the server which object to omit if it sends the URL.) But I see
that the rest of your comments still stand.

> More concretely, I'd like to have a setup where a server can just dumbly
> point to some URL that probably has most of the data, without having any
> idea what OIDs are in it. So that e.g. some machine entirely
> disconnected from the server (and with just a regular clone) can
> continually generating an up-to-date-enough packfile.

Thanks for the concrete use case. Server ignorance would work in this
case, since the client can concisely communicate to the server what
objects it obtained from the CDN (in this case, through "have" lines),
but it does not seem to work in the general case (e.g. offloading large
blobs; or the CDN serving a pack suitable for a shallow clone -
containing all objects referenced by the last few commits, whether
changed in that commit or not).

In this case, maybe the batch job can also inform the server which
commit the CDN is prepared to serve.

> I don't see how this is compatible with the server needing to send a
> bunch of "<oid> <url>" lines, or why a client "cannot declare that the
> fetch is complete until all URIs have been downloaded as
> packfiles". Can't it fall back on the normal dialog?

As stated above, the server advertisement is just "<url>", but you're
right that the server still needs to know their corresponding OIDs (or
have some knowledge like "this pack contains all objects in between this
commit and that commit").

I was thinking that there is no normal dialog to be had with this
protocol, since (as above) in the general case, the client cannot
concisely communicate what objects it obtained from the CDN.

> Other thoughts:
> 
>  * If there isn't such a close coordination between git server & CDN, is
>    there a case for having pack *.idx files on the CDN, so clients can
>    inspect them to see if they'd like to download the full referenced
>    pack?

I'm not sure if I understand this fully, but off the top of my head, the
.idx file doesn't contain relations between objects, so I don't think
the client has enough information to decide if it wants to download the
corresponding packfile.

>  * Without the server needing to know enough about the packs to
>    advertise "<oid> <url>" is there a way to e.g. advertise 4x packs to
>    clients:
> 
>        big.pack, last-month.pack, last-week.pack, last-day.pack
> 
>    Or some other optimistic negotiation where clients, even ones just
>    doing regular fetches, can seek to get more up-to-date with one of
>    the more recent packs before doing the first fetch in 3 days?
> 
>    In the past I'd toyed with creating a similar "not quite CDN" setup
>    using git-bundle.

I think such optimistic downloading of packs during a regular fetch
would only work in a partial clone where "holes" are tolerated. (That
does bring up the possibility of having a fetch mode in which we
download potentially incomplete packfiles into a partial repo and then
"completing" the repo through a not-yet-implemented process, but I
haven't thought through this.)

brian m. carlson Feb. 21, 2019, 1:09 a.m. UTC | #12

On Tue, Feb 19, 2019 at 02:44:31PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Dec 04 2018, brian m. carlson wrote:
> > First, I'd like to see a section (and a bit in the implementation)
> > requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> > Allowing the server to downgrade to HTTP, even by accident, would be a
> > security problem.
> 
> Maybe I've misunderstood the design (I'm writing some other follow-up
> E-Mails in this thread which might clarify things for me), but I don't
> see why.
> 
> We get the ref advertisement from the server. We don't need to trust the
> CDN server or the transport layer. We just download whatever we get from
> there, validate the packfile with SHA-1 (and in the future SHA-256). It
> doesn't matter if the CDN transport is insecure.
> 
> You can do this offline with git today, you don't need to trust me to
> trust that my copy of git.git I give you on a sketchy USB stick is
> genuine. Just unpack it, then compare the SHA-1s you get with:
> 
>     git ls-remote https://github.com/git/git.git
> 
> So this is a case similar to Debian's where they distribute packages
> over http, but manifests over https: https://whydoesaptnotusehttps.com

This assumes that integrity of the data is the only reason you'd want to
use HTTPS. There's also confidentiality. Perhaps a user is downloading
data that will help them circumvent the Great Firewall of China. A
downgrade to HTTP could result in a long prison sentence.

Furthermore, some ISPs tamper with headers to allow tracking, and some
environments (e.g. schools and libraries) perform opportunistic
filtering on HTTP connections to filter certain content (and a lot of
this filtering is really simplistic).

Moreover, Google is planning on using this and filters in place of Git
LFS for large objects. I expect that if this approach becomes viable, it
may actually grow authentication functionality, or, depending on how the
series uses the existing code, it may already have it. In such a case,
we should not allow authentication to go over a plaintext connection
when the user thinks that the connection they're using is encrypted
(since they used an SSH or HTTPS URL to clone or fetch).

Downgrades from HTTPS to HTTP are generally considered CVE-worthy. We
need to make sure that we refuse to allow a downgrade on the client
side, even if the server ignores our request for a secure protocol.

> > Second, this feature likely should be opt-in for SSH. One issue I've
> > seen repeatedly is that people don't want to use HTTPS to fetch things
> > when they're using SSH for Git. Many people in corporate environments
> > have proxies that break HTTP for non-browser use cases[0], and using SSH
> > is the only way that they can make a functional Git connection.
> 
> Yeah, there should definitely be accommodations for such clients, per my
> reading clients can always ignore the CDN and proceed with a normal
> negotiation. Isn't that enough, or is something extra needed?

I think at least a config option and a command line flag are needed to
be able to turn CDN usage off. There needs to be an easy way for people
in broken environments to circumvent the breakage.

Ævar Arnfjörð Bjarmason Feb. 22, 2019, 9:34 a.m. UTC | #13

On Thu, Feb 21 2019, brian m. carlson wrote:

> On Tue, Feb 19, 2019 at 02:44:31PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Tue, Dec 04 2018, brian m. carlson wrote:
>> > First, I'd like to see a section (and a bit in the implementation)
>> > requiring HTTPS if the original protocol is secure (SSH or HTTPS).
>> > Allowing the server to downgrade to HTTP, even by accident, would be a
>> > security problem.
>>
>> Maybe I've misunderstood the design (I'm writing some other follow-up
>> E-Mails in this thread which might clarify things for me), but I don't
>> see why.
>>
>> We get the ref advertisement from the server. We don't need to trust the
>> CDN server or the transport layer. We just download whatever we get from
>> there, validate the packfile with SHA-1 (and in the future SHA-256). It
>> doesn't matter if the CDN transport is insecure.
>>
>> You can do this offline with git today, you don't need to trust me to
>> trust that my copy of git.git I give you on a sketchy USB stick is
>> genuine. Just unpack it, then compare the SHA-1s you get with:
>>
>>     git ls-remote https://github.com/git/git.git
>>
>> So this is a case similar to Debian's where they distribute packages
>> over http, but manifests over https: https://whydoesaptnotusehttps.com
>
> This assumes that integrity of the data is the only reason you'd want to
> use HTTPS. There's also confidentiality. Perhaps a user is downloading
> data that will help them circumvent the Great Firewall of China. A
> downgrade to HTTP could result in a long prison sentence.
>
> Furthermore, some ISPs tamper with headers to allow tracking, and some
> environments (e.g. schools and libraries) perform opportunistic
> filtering on HTTP connections to filter certain content (and a lot of
> this filtering is really simplistic).
>
> Moreover, Google is planning on using this and filters in place of Git
> LFS for large objects. I expect that if this approach becomes viable, it
> may actually grow authentication functionality, or, depending on how the
> series uses the existing code, it may already have it. In such a case,
> we should not allow authentication to go over a plaintext connection
> when the user thinks that the connection they're using is encrypted
> (since they used an SSH or HTTPS URL to clone or fetch).
>
> Downgrades from HTTPS to HTTP are generally considered CVE-worthy. We
> need to make sure that we refuse to allow a downgrade on the client
> side, even if the server ignores our request for a secure protocol.

All good points, I definitely agree we shouldn't do downgrading by
default for the reasons you've outlined, and e.g. make this an opt-in.

I'm just mindful that git's used as infrastructure in a lot of unusual
case, e.g. something like what apt's doing (after carefully weighing
http v.s. https for their use-case).

So I think providing some optional escape hatch is still a good idea.

>> > Second, this feature likely should be opt-in for SSH. One issue I've
>> > seen repeatedly is that people don't want to use HTTPS to fetch things
>> > when they're using SSH for Git. Many people in corporate environments
>> > have proxies that break HTTP for non-browser use cases[0], and using SSH
>> > is the only way that they can make a functional Git connection.
>>
>> Yeah, there should definitely be accommodations for such clients, per my
>> reading clients can always ignore the CDN and proceed with a normal
>> negotiation. Isn't that enough, or is something extra needed?
>
> I think at least a config option and a command line flag are needed to
> be able to turn CDN usage off. There needs to be an easy way for people
> in broken environments to circumvent the breakage.

Yeah, but let's try hard to make it Just Work. I.e. if in the middle of
the dialog the CDN connection is broken can we retry then, and if that
fails just continue with negotiation against the server?

As opposed to erroring by default, and the user needing to retry with
some config option...

Christian Couder Feb. 22, 2019, 11:35 a.m. UTC | #14

On Tue, Feb 19, 2019 at 9:10 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > > Good points about SSH support and the client needing to control which
> > > protocols the server will send URIs for. I'll include a line in the
> > > client request in which the client can specify which protocols it is OK
> > > with.
> >
> > What if a client is ok to fetch from some servers but not others (for
> > example github.com and gitlab.com but nothing else)?
> >
> > Or what if a client is ok to fetch using SSH from some servers and
> > HTTPS from other servers but nothing else?
>
> The objects received from the various CDNs are still rehashed by the
> client (so they are identified with the correct name), and if the client
> is fetching from a server, presumably it can trust the URLs it receives
> (just like it trusts ref names, and so on). Do you know of a specific
> case in which a client wants to fetch from some servers but not others?

For example I think the Great Firewall of China lets people in China
use GitHub.com but not Google.com. So if people start configuring
their repos on GitHub so that they send packs that contain Google.com
CDN URLs (or actually anything that the Firewall blocks), it might
create many problems for users in China if they don't have a way to
opt out of receiving packs with those kind of URLs.

> (In any case, if this happens, the client can just disable the CDN
> support.)

Would this mean that people in China will not be able to use the
feature at all, because too many of their clones could be blocked? Or
that they will have to create forks to mirror any interesting repo and
 reconfigure those forks to work well from China?

> > I also wonder in general how this would interact with promisor/partial
> > clone remotes.
> >
> > When we discussed promisor/partial clone remotes in the thread
> > following this email:
> >
> > https://public-inbox.org/git/20181016174304.GA221682@aiede.svl.corp.google.com/
> >
> > it looked like you were ok with having many promisor remotes, which I
> > think could fill the same use cases especially related to large
> > objects.
> >
> > As clients would configure promisor remotes explicitly, there would be
> > no issues about which protocol and servers are allowed or not.
> >
> > If the issue is that you want the server to decide which promisor
> > remotes would be used without the client having to do anything, maybe
> > that could be something added on top of the possibility to have many
> > promisor remotes.
>
> It's true that there is a slight overlap with respect to large objects,
> but this protocol can also handle large sets of objects being offloaded
> to CDN, not only single ones.

Isn't partial clone also designed to handle large sets of objects?

> (The included implementation only handles
> single objects, as a minimum viable product, but it is conceivable that
> the server implementation is later expanded to allow offloading of sets
> of objects.)
>
> And this protocol is meant to be able to use CDNs to help serve objects,
> whether single objects or sets of objects. In the case of promisor
> remotes, the thing we fetch from has to be a Git server.

When we discussed the plan for many promisor remotes, Jonathan Nieder
(in the email linked above) suggested:

 2. Simplifying the protocol for fetching missing objects so that it
    can be satisfied by a lighter weight object storage system than
    a full Git server.  The ODB helpers introduced in this series are
    meant to speak such a simpler protocol since they are only used
    for one-off requests of a collection of missing objects instead of
    needing to understand refs, Git's negotiation, etc.

and I agreed with that point.

Is there something that you don't like in many promisor remotes?

> (We could use
> dumb HTTP from a CDN, but that defeats the purpose in at least one way -
> with dumb HTTP, we have to fetch objects individually, but with URL
> support, we can fetch objects as sets too.)

[WIP,RFC,2/5] Documentation: add Packfile URIs design doc

Commit Message

Comments

Patch