diff mbox series

[v2,6/9] Documentation: add Packfile URIs design doc

Message ID 6344c225897de1a2d8aa86d610e9eaf1c6ec82b4.1591821067.git.jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series CDN offloading update | expand

Commit Message

Jonathan Tan June 10, 2020, 8:57 p.m. UTC
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/packfile-uri.txt | 78 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt  | 32 +++++++++-
 2 files changed, 109 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

Comments

Junio C Hamano June 11, 2020, 1:55 a.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> +The server advertises the `packfile-uris` capability.
> +
> +If the client then communicates which protocols (HTTPS, etc.) it supports with
> +a `packfile-uris` argument, the server MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing URIs of any of the given protocols. The URIs point to
> +packfiles that use only features that the client has declared that it supports
> +(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
> +this section.

Even if we have some pre-packaged packfiles, if the requestor lacks
some feature (by the way, shouldn't we consistently use "capability"
instead of "feature" to call things like "ofs-delta"?)  to be able
to understand them, we would already know that fact (i.e. the lack
of capability on the other side) by the time we have to decide if we
can give packfile-uris section.  OK, this makes sense.

> +Clients should then download and index all the given URIs (in addition to
> +downloading and indexing the packfile given in the `packfile` section of the
> +response) before performing the connectivity check.

Looking good.

Is there other requirement on the packfile that can be retrieved
with the URI listed in the packfile-uris section?  It would be clear
that it must, together with the contents in the dynamic 'packfile'
section, give fully connected set of objects to the other side that
has the object it claimed to have.  But are we allowed to include
extra/unasked-for objects in such a packfile?  Or is it better to
leave it unspecified?

Thanks.
Ævar Arnfjörð Bjarmason Nov. 25, 2020, 9:15 a.m. UTC | #2
On Wed, Jun 10 2020, Jonathan Tan wrote:

> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, all such
> +blobs are excluded, replaced with URIs. The client will download those URIs,
> +expecting them to each point to packfiles containing single blobs.

I was poking at this recently to see whether I could change it into the
more dumb method I noted in
https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/

As an aside on a protocol level could that be supported with this
current verb by having the client say "packfile-uris=early" or something
like that instead of "packfile-uris"? The server advertising the same,
and the client then just requesting packfile-uris before ls-refs or
whatever? The server would need to be stateful about what's requested
when and serve up something different than the current
one-blob-per-pack. Any pointers to where/how to implement that would be
welcome, I got lost in the non-linearity of the
connect.c/fetch-pack.c/upload-pack.c code yesterday.

But I'm mainly replying here to ask if it's intentional that clients are
tolerant of the server sending whatever it pleases in the supposedly
"single blob" packs. As demonstrated by the tests passing with this
patch:
    
    diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
    index 7d5b17909bb..4fe2030f4c1 100755
    --- a/t/t5702-protocol-v2.sh
    +++ b/t/t5702-protocol-v2.sh
    @@ -797,11 +797,12 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
     
     configure_exclusion () {
     	git -C "$1" hash-object "$2" >objh &&
    +	echo -n shattered | git -C "$1" hash-object --stdin -w >>objh &&
     	git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
     	git -C "$1" config --add \
     		"uploadpack.blobpackfileuri" \
    -		"$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
    -	cat objh
    +		"$(head -n 1 objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
    +	head -n 1 objh
     }
     
     test_expect_success 'part of packfile response provided as URI' '
    @@ -820,10 +821,11 @@ test_expect_success 'part of packfile response provided as URI' '
     	configure_exclusion "$P" my-blob >h &&
     	configure_exclusion "$P" other-blob >h2 &&
     
    -	GIT_TRACE=1 GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
    -	git -c protocol.version=2 \
    +	GIT_TRACE=1 GIT_TRACE2="$(pwd)/log" GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
    +	CHECK_SHATTERED=1 git -c protocol.version=2 \
     		-c fetch.uriprotocols=http,https \
     		clone "$HTTPD_URL/smart/http_parent" http_child &&
    +	cp "$(pwd)/log" /tmp/clone.log &&
     
     	# Ensure that my-blob and other-blob are in separate packfiles.
     	for idx in http_child/.git/objects/pack/*.idx
    @@ -832,7 +834,7 @@ test_expect_success 'part of packfile response provided as URI' '
     		{
     			grep "^[0-9a-f]\{16,\} " out || :
     		} >out.objectlist &&
    -		if test_line_count = 1 out.objectlist
    +		if true
     		then
     			if grep $(cat h) out
     			then

As you may guess from the "shattered" I was trying to find if the
particulars around the partial fsck allowed me to exploit this somehow,
I haven't found a way to do that, just be annoying by sending the client
more than they asked for, but I could also do that with the normal
dialog. Just wondering if the client should be opening the pack and
barfing if it has more than one object, or not care.
Jonathan Tan Nov. 25, 2020, 7:09 p.m. UTC | #3
> On Wed, Jun 10 2020, Jonathan Tan wrote:
> 
> > +This is the implementation: a feature, marked experimental, that allows the
> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> > +<uri>` entries. Whenever the list of objects to be sent is assembled, all such
> > +blobs are excluded, replaced with URIs. The client will download those URIs,
> > +expecting them to each point to packfiles containing single blobs.
> 
> I was poking at this recently to see whether I could change it into the
> more dumb method I noted in
> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/
> 
> As an aside on a protocol level could that be supported with this
> current verb by having the client say "packfile-uris=early" or something
> like that instead of "packfile-uris"? 

Hmm...would the advantage of this be that the client can subsequently
report any OIDs it finds as "want"s?

I guess the protocol could be extended to support "packfile-uris" at any
negotiation step.

> The server advertising the same,
> and the client then just requesting packfile-uris before ls-refs or
> whatever? The server would need to be stateful about what's requested
> when and serve up something different than the current
> one-blob-per-pack. 

Statefulness will be difficult. Right now, protocol v2 is stateless,
and updating it to be stateful will be difficult, I believe - at least
for HTTP, the statelessness design has been long there and other
implementations of Git or systems that use Git might have already made
that assumption (it is stateless both for protocol v0 and v2).

As for serving more than one blob per pack, the current protocol and
implementation already allows this. You can see a demonstration by
cloning the following repository, which supports it on the server side:

  GIT_TRACE_PACKET=1 git -c fetch.uriprotocols=https clone \
    https://chromium.googlesource.com/chromium/src/base

> Any pointers to where/how to implement that would be
> welcome, I got lost in the non-linearity of the
> connect.c/fetch-pack.c/upload-pack.c code yesterday.

upload_pack_v2() in upload-pack.c and do_fetch_pack_v2() in fetch-pack.c
have the state machines of the server and client side respectively - I
think those would be the first places to look.

> But I'm mainly replying here to ask if it's intentional that clients are
> tolerant of the server sending whatever it pleases in the supposedly
> "single blob" packs. As demonstrated by the tests passing with this
> patch:

[snip test]

Yes, it has the same tolerance w.r.t. the packfile URI packs as w.r.t.
the inline packfile that gets sent.

> As you may guess from the "shattered" I was trying to find if the
> particulars around the partial fsck allowed me to exploit this somehow,
> I haven't found a way to do that, just be annoying by sending the client
> more than they asked for, but I could also do that with the normal
> dialog. Just wondering if the client should be opening the pack and
> barfing if it has more than one object, or not care.

Ah yes, as you said, it's the same as with the normal dialog.
Ævar Arnfjörð Bjarmason Dec. 1, 2020, 12:48 p.m. UTC | #4
On Wed, Nov 25 2020, Jonathan Tan wrote:

>> On Wed, Jun 10 2020, Jonathan Tan wrote:
>> 
>> > +This is the implementation: a feature, marked experimental, that allows the
>> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
>> > +<uri>` entries. Whenever the list of objects to be sent is assembled, all such
>> > +blobs are excluded, replaced with URIs. The client will download those URIs,
>> > +expecting them to each point to packfiles containing single blobs.
>> 
>> I was poking at this recently to see whether I could change it into the
>> more dumb method I noted in
>> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/
>> 
>> As an aside on a protocol level could that be supported with this
>> current verb by having the client say "packfile-uris=early" or something
>> like that instead of "packfile-uris"? 
>
> Hmm...would the advantage of this be that the client can subsequently
> report any OIDs it finds as "want"s?

Yes, as a means-to-an-end of allowing the server CDN part to be dumb and
possibly out-of-date.

I.e. the current "here's some blobs and then a complimentary pack"
requires very close CDN/server coordination.

Whereas if you can say early in the dialog "here's a pack that should
have most of what you need, then do a want/fetch to get up-to-date" you
can just make that pack in a cronjob, the pack could even be corrupted
etc. and we'd just optimistically fall back on a full clone.

It wouldn't be reporting any OID it finds (too noisy). I haven't
experimented, but one dumb way to do it is for the server to serve up a
pack with "start negotiating with this SHA", which would just be a
recent revision of HEAD guaranteed to be in the served pack.

Or the client could scour the pack and e.g. find all commit tips in it
by running the equivalent of commit-graph on it first. More expensive,
but expensive on the client-side, and allows e.g. re-using that codepath
to clone on the basis of a GIT_ALTERNATE_OBJECT_DIRECTORIES containing
objects the <remote url> might have already.

> I guess the protocol could be extended to support "packfile-uris" at any
> negotiation step.

*nod*

>> The server advertising the same,
>> and the client then just requesting packfile-uris before ls-refs or
>> whatever? The server would need to be stateful about what's requested
>> when and serve up something different than the current
>> one-blob-per-pack. 
>
> Statefulness will be difficult. Right now, protocol v2 is stateless,
> and updating it to be stateful will be difficult, I believe - at least
> for HTTP, the statelessness design has been long there and other
> implementations of Git or systems that use Git might have already made
> that assumption (it is stateless both for protocol v0 and v2).

Sorry, on second thought what I'm suggesting doesn't really require
state, just the server to get/read/understand the "client supports this"
flags that it already does.

> As for serving more than one blob per pack, the current protocol and
> implementation already allows this. You can see a demonstration by
> cloning the following repository, which supports it on the server side:
>
>   GIT_TRACE_PACKET=1 git -c fetch.uriprotocols=https clone \
>     https://chromium.googlesource.com/chromium/src/base

So cloning that produces two packs (the packfile-uri one, and the
negotiated one), the packfile-uri one being:
    
    $ git show-index <objects/pack/pack-d917d9803d3acb03da7ea9e8ebb8e643364ba051.idx
    12 3ea853619c232488e683139217585f520ec636e0 (8e33883c)
    9371 83d1358e4979bf9aff879cb4150276cd3894463a (0ea17153)
    13640 d0f8c611c044b36b8ac57b7a3af18a8d88e4dde2 (af8be7a0)
    572 da9ca8b36d5029bd9b18addc054ba9c0b016e6bc (d57fdbac)

So 3ea853619 is a commit with a tree da9ca8b36/ , that tree has pointer
to win/ via 83d1358e4, which has a blob win/.clang-tidy at d0f8c611c.

So this is intended & you're already using it like that. So shouldn't
the docs be updated from:

    [...]one or more `uploadpack.blobPackfileUri=<sha1> <uri>`
    entries. Whenever the list of objects to be sent is assembled, all
    such blobs are excluded, replaced with URIs. The client will
    download those URIs, expecting them to each point to packfiles
    containing single blobs.

To something like:

    [...]one or more `uploadpack.postWantPackfileUri=<sha1> <uri>`
    entries. When the server sends the subsequent full pack any objects
    sent out-of-bound may be excluded. The client will download those
    URIs, expecting them to each contain some amount of objects. The
    union of the packs found at these URIs and the server's own returned
    packfile is expected to yield a valid repository that'll pass an
    fsck.

Aside from the hassle of renaming the "uploadpack.blobPackfileUri"
variable to some more accurate "this clearly has nothing to do with
blobs per-se, really" name that re-done paragraph seems to more
accurately reflect what this is doing & intended to do.

Also, why it it necessary to make it an error if the expected hash for
the packfile doesn't match, since we're about to do a full fsck
connectivity check anyway? The POC patch at the end of this mail[1]
shows that we mostly support it not matching now.

I suppose you might want to point to an evil CDN, but since it currently
needs to fsck the potential for that evilness seems rather limited. It's
not a hassle in the current closely coupled server/CDN implementation,
but to e.g. have a server dumbly pointing at
https://some-cdn.example/REPONAME/some-big-recent-pack.pack it's a
hassle, so having it at least optionally support the NULL_SHA would be
ideal for that. So I'm curious what you think it adds to the overall
security of the feature.

>> Any pointers to where/how to implement that would be
>> welcome, I got lost in the non-linearity of the
>> connect.c/fetch-pack.c/upload-pack.c code yesterday.
>
> upload_pack_v2() in upload-pack.c and do_fetch_pack_v2() in fetch-pack.c
> have the state machines of the server and client side respectively - I
> think those would be the first places to look.

Thanks.

>> But I'm mainly replying here to ask if it's intentional that clients are
>> tolerant of the server sending whatever it pleases in the supposedly
>> "single blob" packs. As demonstrated by the tests passing with this
>> patch:
>
> [snip test]
>
> Yes, it has the same tolerance w.r.t. the packfile URI packs as w.r.t.
> the inline packfile that gets sent.
>
>> As you may guess from the "shattered" I was trying to find if the
>> particulars around the partial fsck allowed me to exploit this somehow,
>> I haven't found a way to do that, just be annoying by sending the client
>> more than they asked for, but I could also do that with the normal
>> dialog. Just wondering if the client should be opening the pack and
>> barfing if it has more than one object, or not care.
>
> Ah yes, as you said, it's the same as with the normal dialog.

1.:

diff --git a/fetch-pack.c b/fetch-pack.c
index b10c432315..87d948b023 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1661,9 +1661,9 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 
 		if (memcmp(packfile_uris.items[i].string, packname,
 			   the_hash_algo->hexsz))
-			die("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
-			    uri, (int) the_hash_algo->hexsz,
-			    packfile_uris.items[i].string);
+			warning("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
+				uri, (int) the_hash_algo->hexsz,
+				packfile_uris.items[i].string);
 
 		string_list_append_nodup(pack_lockfiles,
 					 xstrfmt("%s/pack/pack-%s.keep",
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 7d5b17909b..bc0fc4d8e3 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -798,9 +798,13 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
 configure_exclusion () {
 	git -C "$1" hash-object "$2" >objh &&
 	git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
+	fake_sha=$(git hash-object --stdin <packh) &&
+	mv \
+		"$HTTPD_DOCUMENT_ROOT_PATH/mypack-$(cat packh).pack" \
+		"$HTTPD_DOCUMENT_ROOT_PATH/mypack-$fake_sha.pack"
 	git -C "$1" config --add \
 		"uploadpack.blobpackfileuri" \
-		"$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
+		"$(cat objh) $fake_sha $HTTPD_URL/dumb/mypack-$fake_sha.pack" &&
 	cat objh
 }
 
@@ -852,7 +856,7 @@ test_expect_success 'part of packfile response provided as URI' '
 	test_line_count = 6 filelist
 '
 
-test_expect_success 'fetching with valid packfile URI but invalid hash fails' '
+-test_expect_success 'fetching with valid packfile URI but invalid hash warns' '
 	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
 	rm -rf "$P" http_child log &&
 
@@ -870,13 +874,12 @@ test_expect_success 'fetching with valid packfile URI but invalid hash fails' '
 	# the hash of the packfile, since the hash does not matter for this
 	# test as long as it is not the hash of the pack, and it is of the
 	# expected length.
-	git -C "$P" hash-object other-blob >objh &&
 	git -C "$P" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
 	git -C "$P" config --add \
 		"uploadpack.blobpackfileuri" \
 		"$(cat objh) $(cat objh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
 
-	test_must_fail env GIT_TEST_SIDEBAND_ALL=1 \
+	env GIT_TEST_SIDEBAND_ALL=1 \
 		git -c protocol.version=2 \
 		-c fetch.uriprotocols=http,https \
 		clone "$HTTPD_URL/smart/http_parent" http_child 2>err &&
diff mbox series

Patch

diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000000..318713abc3
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,78 @@ 
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises the `packfile-uris` capability.
+
+If the client then communicates which protocols (HTTPS, etc.) it supports with
+a `packfile-uris` argument, the server MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing URIs of any of the given protocols. The URIs point to
+packfiles that use only features that the client has declared that it supports
+(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
+this section.
+
+Clients should then download and index all the given URIs (in addition to
+downloading and indexing the packfile given in the `packfile` section of the
+response) before performing the connectivity check.
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
+<uri>` entries. Whenever the list of objects to be sent is assembled, all such
+blobs are excluded, replaced with URIs. The client will download those URIs,
+expecting them to each point to packfiles containing single blobs.
+
+Client design
+-------------
+
+The client has a config variable `fetch.uriprotocols` that determines which
+protocols the end user is willing to use. By default, this is empty.
+
+When the client downloads the given URIs, it should store them with "keep"
+files, just like it does with the packfile in the `packfile` section. These
+additional "keep" files can only be removed after the refs have been updated -
+just like the "keep" file for the packfile in the `packfile` section.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, more sophisticated means of excluding objects (e.g. by
+   specifying a commit to represent that commit and all objects that it
+   references).
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 995f07481e..f9f4e4ddd0 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -323,13 +323,26 @@  included in the client's request:
 	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
 	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
 
+If the 'packfile-uris' feature is advertised, the following argument
+can be included in the client's request as well as the potential
+addition of the 'packfile-uris' section in the server's response as
+explained below.
+
+    packfile-uris <comma-separated list of protocols>
+	Indicates to the server that the client is willing to receive
+	URIs of any of the given protocols in place of objects in the
+	sent packfile. Before performing the connectivity check, the
+	client should download from all given URIs. Currently, the
+	protocols supported are "http" and "https".
+
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
 header. Most sections are sent only when the packfile is sent.
 
     output = acknowledgements flush-pkt |
 	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
-	     [wanted-refs delim-pkt] packfile flush-pkt
+	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
+	     packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -347,6 +360,9 @@  header. Most sections are sent only when the packfile is sent.
 		  *PKT-LINE(wanted-ref LF)
     wanted-ref = obj-id SP refname
 
+    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
+
     packfile = PKT-LINE("packfile" LF)
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
@@ -418,6 +434,20 @@  header. Most sections are sent only when the packfile is sent.
 	* The server MUST NOT send any refs which were not requested
 	  using 'want-ref' lines.
 
+    packfile-uris section
+	* This section is only included if the client sent
+	  'packfile-uris' and the server has at least one such URI to
+	  send.
+
+	* Always begins with the section header "packfile-uris".
+
+	* For each URI the server sends, it sends a hash of the pack's
+	  contents (as output by git index-pack) followed by the URI.
+
+	* The hashes are 40 hex characters long. When Git upgrades to a new
+	  hash algorithm, this might need to be updated. (It should match
+	  whatever index-pack outputs after "pack\t" or "keep\t".
+
     packfile section
 	* This section is only included if the client has sent 'want'
 	  lines in its request and either requested that no more