mbox series

[v2,0/8] CDN offloading of fetch response

Message ID cover.1552073690.git.jonathantanmy@google.com (mailing list archive)
Headers show
Series CDN offloading of fetch response | expand

Message

Jonathan Tan March 8, 2019, 9:55 p.m. UTC
Here's my current progress - the only thing that is lacking is more
tests, maybe, so I think it's ready for review.

I wrote in version 1:

> I know
> that there are some implementation details that could be improved (e.g.
> parallelization of the CDN downloads, starting CDN downloads *after*
> closing the first HTTP request, holding on to the .keep locks until
> after the refs are set), but will work on those once the overall design
> is more or less finalized.

This code now starts CDN downloads after closing the first HTTP request,
and it holds on to the .keep files until after the refs are set. I'll
leave parallelization of the CDN downloads for later work.

One relatively significant change: someone pointed out that the issue fixed by 
50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
occur here, so I have changed it so that the server is required to send
the packfile's hash along with the URI.

This does mean that Ævar's workflow described in [1] would not work.
Quoting Ævar:

> More concretely, I'd like to have a setup where a server can just dumbly
> point to some URL that probably has most of the data, without having any
> idea what OIDs are in it. So that e.g. some machine entirely
> disconnected from the server (and with just a regular clone) can
> continually generating an up-to-date-enough packfile.

With 50d3413740, it seems to me that it's important for the server to
know details about the URIs that it points to, so such a disconnection
would not work.

[1] https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/

Jonathan Tan (8):
  http: use --stdin when getting dumb HTTP pack
  http: improve documentation of http_pack_request
  http-fetch: support fetching packfiles by URL
  Documentation: order protocol v2 sections
  Documentation: add Packfile URIs design doc
  upload-pack: refactor reading of pack-objects out
  fetch-pack: support more than one pack lockfile
  upload-pack: send part of packfile response as uri

 Documentation/git-http-fetch.txt         |   8 +-
 Documentation/technical/packfile-uri.txt |  78 ++++++++++++
 Documentation/technical/protocol-v2.txt  |  44 +++++--
 builtin/fetch-pack.c                     |  17 ++-
 builtin/pack-objects.c                   |  76 +++++++++++
 connected.c                              |   8 +-
 fetch-pack.c                             | 129 +++++++++++++++++--
 fetch-pack.h                             |   2 +-
 http-fetch.c                             |  64 ++++++++--
 http.c                                   |  82 +++++++-----
 http.h                                   |  32 ++++-
 t/t5550-http-fetch-dumb.sh               |  25 ++++
 t/t5702-protocol-v2.sh                   |  57 +++++++++
 transport-helper.c                       |   5 +-
 transport.c                              |  14 +-
 transport.h                              |   6 +-
 upload-pack.c                            | 155 +++++++++++++++++------
 17 files changed, 670 insertions(+), 132 deletions(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

Range-diff against v1:
1:  987c9b686b < -:  ---------- http: use --stdin and --keep when downloading pack
-:  ---------- > 1:  87173d0ad1 http: use --stdin when getting dumb HTTP pack
2:  4b53a6f48c ! 2:  66d31720a0 http: improve documentation of http_pack_request
    @@ -45,4 +45,4 @@
     + */
      extern struct http_pack_request *new_http_pack_request(
      	struct packed_git *target, const char *base_url);
    - int finish_http_pack_request(struct http_pack_request *preq, char **lockfile);
    + extern int finish_http_pack_request(struct http_pack_request *preq);
3:  afe73a282d ! 3:  02373f6afe http-fetch: support fetching packfiles by URL
    @@ -31,7 +31,8 @@
     +--packfile::
     +	Instead of a commit id on the command line (which is not expected in
     +	this case), 'git http-fetch' fetches the packfile directly at the given
    -+	URL and generates the corresponding .idx file.
    ++	URL and uses index-pack to generate corresponding .idx and .keep files.
    ++	The output of index-pack is printed to stdout.
     +
      --recover::
      	Verify that everything reachable from target is fetched.  Used after
    @@ -101,12 +102,12 @@
     +		struct http_pack_request *preq;
     +		struct slot_results results;
     +		int ret;
    -+		char *lockfile;
     +
     +		preq = new_http_pack_request(NULL, url);
     +		if (preq == NULL)
     +			die("couldn't create http pack request");
     +		preq->slot->results = &results;
    ++		preq->generate_keep = 1;
     +
     +		if (start_active_slot(preq->slot)) {
     +			run_active_slot(preq->slot);
    @@ -118,9 +119,8 @@
     +			die("Unable to start request");
     +		}
     +
    -+		if ((ret = finish_http_pack_request(preq, &lockfile)))
    ++		if ((ret = finish_http_pack_request(preq)))
     +			die("finish_http_pack_request gave result %d", ret);
    -+		unlink(lockfile);
     +		release_http_pack_request(preq);
     +		rc = 0;
     +	} else {
    @@ -180,6 +180,20 @@
      	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
      
     @@
    + 	argv_array_push(&ip.args, "--stdin");
    + 	ip.git_cmd = 1;
    + 	ip.in = tmpfile_fd;
    +-	ip.no_stdout = 1;
    ++	if (preq->generate_keep) {
    ++		argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX,
    ++				 (uintmax_t)getpid());
    ++		ip.out = 0;
    ++	} else {
    ++		ip.no_stdout = 1;
    ++	}
    + 
    + 	if (run_command(&ip)) {
    + 		ret = -1;
      		goto cleanup;
      	}
      
    @@ -243,6 +257,19 @@
      	 * pack list that target is in. finish_http_pack_request() will remove
      	 * target from lst and call install_packed_git() on target.
      	 */
    + 	struct packed_git **lst;
    + 
    ++	/*
    ++	 * If this is true, finish_http_pack_request() will pass "--keep" to
    ++	 * index-pack, resulting in the creation of a keep file, and will not
    ++	 * suppress its stdout (that is, the "keep\t<hash>\n" line will be
    ++	 * printed to stdout).
    ++	 */
    ++	unsigned generate_keep : 1;
    ++
    + 	/*
    + 	 * State managed by functions in http.c.
    + 	 */
     @@
      };
      
    @@ -269,7 +296,14 @@
     +test_expect_success 'http-fetch --packfile' '
     +	git init packfileclient &&
     +	p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git && ls objects/pack/pack-*.pack) &&
    -+	git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p &&
    ++	git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p >out &&
    ++
    ++	# Ensure that the expected files are generated
    ++	grep "^keep.[0-9a-f]\{16,\}$" out &&
    ++	cut -c6- out >packhash &&
    ++	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).pack" &&
    ++	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).idx" &&
    ++	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).keep" &&
     +
     +	# Ensure that it has the HEAD of repo_pack, at least
     +	HASH=$(git -C "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git rev-parse HEAD) &&
4:  15a00cd3e1 = 4:  2e0f2daba0 Documentation: order protocol v2 sections
5:  9df7f6e9a4 ! 5:  5ce56844d3 Documentation: add Packfile URIs design doc
    @@ -64,7 +64,7 @@
     +after the Minimum Viable Product (see "Future work").
     +
     +The client can inhibit this feature (i.e. refrain from sending the
    -+`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.
    ++`packfile-uris` parameter) by passing --no-packfile-uris to `git fetch`.
     +
     +Future work
     +-----------
    @@ -93,6 +93,24 @@
      --- a/Documentation/technical/protocol-v2.txt
      +++ b/Documentation/technical/protocol-v2.txt
     @@
    + 	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
    + 	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
    + 
    ++If the 'packfile-uris' feature is advertised, the following argument
    ++can be included in the client's request as well as the potential
    ++addition of the 'packfile-uris' section in the server's response as
    ++explained below.
    ++
    ++    packfile-uris <comma-separated list of protocols>
    ++	Indicates to the server that the client is willing to receive
    ++	URIs of any of the given protocols in place of objects in the
    ++	sent packfile. Before performing the connectivity check, the
    ++	client should download from all given URIs. Currently, the
    ++	protocols supported are "http" and "https".
    ++
    + The response of `fetch` is broken into a number of sections separated by
    + delimiter packets (0001), with each section beginning with its section
    + header. Most sections are sent only when the packfile is sent.
      
          output = acknowledgements flush-pkt |
      	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
    @@ -107,8 +125,25 @@
          wanted-ref = obj-id SP refname
      
     +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
    -+    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
    ++    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
     +
          packfile = PKT-LINE("packfile" LF)
      	       *PKT-LINE(%x01-03 *%x00-ff)
      
    +@@
    + 	* The server MUST NOT send any refs which were not requested
    + 	  using 'want-ref' lines.
    + 
    ++    packfile-uris section
    ++	* This section is only included if the client sent
    ++	  'packfile-uris' and the server has at least one such URI to
    ++	  send.
    ++
    ++	* Always begins with the section header "packfile-uris".
    ++
    ++	* For each URI the server sends, it sends a hash of the pack's
    ++	  contents (as output by git index-pack) followed by the URI.
    ++
    +     packfile section
    + 	* This section is only included if the client has sent 'want'
    + 	  lines in its request and either requested that no more
6:  639235562e = 6:  d487f46b0f upload-pack: refactor reading of pack-objects out
-:  ---------- > 7:  f46504a166 fetch-pack: support more than one pack lockfile
7:  0e821b4427 ! 8:  c7546868cf upload-pack: send part of packfile response as uri
    @@ -5,12 +5,12 @@
         Teach upload-pack to send part of its packfile response as URIs.
     
         An administrator may configure a repository with one or more
    -    "uploadpack.blobpackfileuri" lines, each line containing an OID and a
    -    URI. A client may configure fetch.uriprotocols to be a comma-separated
    -    list of protocols that it is willing to use to fetch additional
    -    packfiles - this list will be sent to the server. Whenever an object
    -    with one of those OIDs would appear in the packfile transmitted by
    -    upload-pack, the server may exclude that object, and instead send the
    +    "uploadpack.blobpackfileuri" lines, each line containing an OID, a pack
    +    hash, and a URI. A client may configure fetch.uriprotocols to be a
    +    comma-separated list of protocols that it is willing to use to fetch
    +    additional packfiles - this list will be sent to the server. Whenever an
    +    object with one of those OIDs would appear in the packfile transmitted
    +    by upload-pack, the server may exclude that object, and instead send the
         URI. The client will then download the packs referred to by those URIs
         before performing the connectivity check.
     
    @@ -35,6 +35,7 @@
      
     +struct configured_exclusion {
     +	struct oidmap_entry e;
    ++	char *pack_hash_hex;
     +	char *uri;
     +};
     +static struct oidmap configured_exclusions;
    @@ -60,6 +61,8 @@
     +
     +		if (!ex)
     +			BUG("configured exclusion wasn't configured");
    ++		write_in_full(1, ex->pack_hash_hex, strlen(ex->pack_hash_hex));
    ++		write_in_full(1, " ", 1);
     +		write_in_full(1, ex->uri, strlen(ex->uri));
     +		write_in_full(1, "\n", 1);
     +	}
    @@ -100,15 +103,25 @@
      	}
     +	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
     +		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
    -+		const char *end;
    ++		const char *oid_end, *pack_end;
    ++		/*
    ++		 * Stores the pack hash. This is not a true object ID, but is
    ++		 * of the same form.
    ++		 */
    ++		struct object_id pack_hash;
     +
    -+		if (parse_oid_hex(v, &ex->e.oid, &end) || *end != ' ')
    ++		if (parse_oid_hex(v, &ex->e.oid, &oid_end) ||
    ++		    *oid_end != ' ' ||
    ++		    parse_oid_hex(oid_end + 1, &pack_hash, &pack_end) ||
    ++		    *pack_end != ' ')
     +			die(_("value of uploadpack.blobpackfileuri must be "
    -+			      "of the form '<sha-1> <uri>' (got '%s')"), v);
    ++			      "of the form '<object-hash> <pack-hash> <uri>' (got '%s')"), v);
     +		if (oidmap_get(&configured_exclusions, &ex->e.oid))
     +			die(_("object already configured in another "
     +			      "uploadpack.blobpackfileuri (got '%s')"), v);
    -+		ex->uri = xstrdup(end + 1);
    ++		ex->pack_hash_hex = xcalloc(1, pack_end - oid_end);
    ++		memcpy(ex->pack_hash_hex, oid_end + 1, pack_end - oid_end - 1);
    ++		ex->uri = xstrdup(pack_end + 1);
     +		oidmap_put(&configured_exclusions, ex);
     +	}
      	return git_default_config(k, v, cb);
    @@ -144,6 +157,42 @@
      
      /* Remember to update object flag allocation in object.h */
      #define COMPLETE	(1U << 0)
    +@@
    + }
    + 
    + static int get_pack(struct fetch_pack_args *args,
    +-		    int xd[2], struct string_list *pack_lockfiles)
    ++		    int xd[2], struct string_list *pack_lockfiles,
    ++		    int only_packfile)
    + {
    + 	struct async demux;
    + 	int do_keep = args->keep_pack;
    +@@
    + 					"--keep=fetch-pack %"PRIuMAX " on %s",
    + 					(uintmax_t)getpid(), hostname);
    + 		}
    +-		if (args->check_self_contained_and_connected)
    ++		if (only_packfile && args->check_self_contained_and_connected)
    + 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
    ++		else
    ++			/*
    ++			 * We cannot perform any connectivity checks because
    ++			 * not all packs have been downloaded; let the caller
    ++			 * have this responsibility.
    ++			 */
    ++			args->check_self_contained_and_connected = 0;
    + 		if (args->from_promisor)
    + 			argv_array_push(&cmd.args, "--promisor");
    + 	}
    +@@
    + 		alternate_shallow_file = setup_temporary_shallow(si->shallow);
    + 	else
    + 		alternate_shallow_file = NULL;
    +-	if (get_pack(args, fd, pack_lockfiles))
    ++	if (get_pack(args, fd, pack_lockfiles, 1))
    + 		die(_("git fetch-pack: fetch failed."));
    + 
    +  all_done:
     @@
      		warning("filtering not recognized by server, ignoring");
      	}
    @@ -175,27 +224,16 @@
      		die(_("error processing wanted refs: %d"), reader->status);
      }
      
    -+static void receive_packfile_uris(struct packet_reader *reader)
    ++static void receive_packfile_uris(struct packet_reader *reader,
    ++				  struct string_list *uris)
     +{
     +	process_section_header(reader, "packfile-uris", 0);
     +	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
    -+		const char *p;
    -+		struct child_process cmd = CHILD_PROCESS_INIT;
    -+
    -+
    -+		if (!skip_prefix(reader->line, "uri ", &p))
    -+			die("expected 'uri <uri>', got: %s\n", reader->line);
    ++		if (reader->pktlen < the_hash_algo->hexsz ||
    ++		    reader->line[the_hash_algo->hexsz] != ' ')
    ++			die("expected '<hash> <uri>', got: %s\n", reader->line);
     +
    -+		argv_array_push(&cmd.args, "http-fetch");
    -+		argv_array_push(&cmd.args, "--packfile");
    -+		argv_array_push(&cmd.args, p);
    -+		cmd.git_cmd = 1;
    -+		cmd.no_stdin = 1;
    -+		cmd.no_stdout = 1;
    -+		if (start_command(&cmd))
    -+			die("fetch-pack: unable to spawn");
    -+		if (finish_command(&cmd))
    -+			die("fetch-pack: unable to finish");
    ++		string_list_append(uris, reader->line);
     +	}
     +	if (reader->status != PACKET_READ_DELIM)
     +		die("expected DELIM");
    @@ -205,15 +243,81 @@
      	FETCH_CHECK_LOCAL = 0,
      	FETCH_SEND_REQUEST,
     @@
    + 	int in_vain = 0;
    + 	int haves_to_send = INITIAL_FLUSH;
    + 	struct fetch_negotiator negotiator;
    ++	struct string_list packfile_uris = STRING_LIST_INIT_DUP;
    ++	int i;
    ++
    + 	fetch_negotiator_init(&negotiator, negotiation_algorithm);
    + 	packet_reader_init(&reader, fd[0], NULL, 0,
    + 			   PACKET_READ_CHOMP_NEWLINE |
    +@@
    + 			if (process_section_header(&reader, "wanted-refs", 1))
      				receive_wanted_refs(&reader, sought, nr_sought);
      
    - 			/* get the pack */
    -+			if (process_section_header(&reader, "packfile-uris", 1)) {
    -+				receive_packfile_uris(&reader);
    -+			}
    +-			/* get the pack */
    ++			/* get the pack(s) */
    ++			if (process_section_header(&reader, "packfile-uris", 1))
    ++				receive_packfile_uris(&reader, &packfile_uris);
      			process_section_header(&reader, "packfile", 0);
    - 			if (get_pack(args, fd, pack_lockfile))
    +-			if (get_pack(args, fd, pack_lockfiles))
    ++			if (get_pack(args, fd, pack_lockfiles,
    ++				     !packfile_uris.nr))
      				die(_("git fetch-pack: fetch failed."));
    + 
    + 			state = FETCH_DONE;
    +@@
    + 		}
    + 	}
    + 
    ++	for (i = 0; i < packfile_uris.nr; i++) {
    ++		struct child_process cmd = CHILD_PROCESS_INIT;
    ++		char packname[GIT_MAX_HEXSZ + 1];
    ++		const char *uri = packfile_uris.items[i].string +
    ++			the_hash_algo->hexsz + 1;
    ++
    ++		argv_array_push(&cmd.args, "http-fetch");
    ++		argv_array_push(&cmd.args, "--packfile");
    ++		argv_array_push(&cmd.args, uri);
    ++		cmd.git_cmd = 1;
    ++		cmd.no_stdin = 1;
    ++		cmd.out = -1;
    ++		if (start_command(&cmd))
    ++			die("fetch-pack: unable to spawn http-fetch");
    ++
    ++		if (read_in_full(cmd.out, packname, 5) < 0 ||
    ++		    memcmp(packname, "keep\t", 5))
    ++			die("fetch-pack: expected keep then TAB at start of http-fetch output");
    ++
    ++		if (read_in_full(cmd.out, packname,
    ++				 the_hash_algo->hexsz + 1) < 0 ||
    ++		    packname[the_hash_algo->hexsz] != '\n')
    ++			die("fetch-pack: expected hash then LF at end of http-fetch output");
    ++
    ++		packname[the_hash_algo->hexsz] = '\0';
    ++
    ++		close(cmd.out);
    ++
    ++		if (finish_command(&cmd))
    ++			die("fetch-pack: unable to finish http-fetch");
    ++
    ++		if (memcmp(packfile_uris.items[i].string, packname,
    ++			   the_hash_algo->hexsz))
    ++			die("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
    ++			    uri, (int) the_hash_algo->hexsz,
    ++			    packfile_uris.items[i].string);
    ++
    ++		string_list_append_nodup(pack_lockfiles,
    ++					 xstrfmt("%s/pack/pack-%s.keep",
    ++						 get_object_directory(),
    ++						 packname));
    ++	}
    ++	string_list_clear(&packfile_uris, 0);
    ++
    + 	negotiator.release(&negotiator);
    + 	oidset_clear(&common);
    + 	return ref;
     @@
      	git_config_get_bool("transfer.fsckobjects", &transfer_fsck_objects);
      	git_config_get_string("fetch.negotiationalgorithm",
    @@ -237,6 +341,15 @@
      	test_i18ngrep "expected no other sections to be sent after no .ready." err
      '
      
    ++configure_exclusion () {
    ++	git -C "$1" hash-object "$2" >objh &&
    ++	git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
    ++	git -C "$1" config --add \
    ++		"uploadpack.blobpackfileuri" \
    ++		"$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
    ++	cat objh
    ++}
    ++
     +test_expect_success 'part of packfile response provided as URI' '
     +	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
     +	rm -rf "$P" http_child log &&
    @@ -250,32 +363,22 @@
     +	git -C "$P" add other-blob &&
     +	git -C "$P" commit -m x &&
     +
    -+	# Create a packfile for my-blob and configure it for exclusion.
    -+	git -C "$P" hash-object my-blob >h &&
    -+	git -C "$P" pack-objects --stdout <h \
    -+		>"$HTTPD_DOCUMENT_ROOT_PATH/one.pack" &&
    -+	git -C "$P" config \
    -+		"uploadpack.blobpackfileuri" \
    -+		"$(cat h) $HTTPD_URL/dumb/one.pack" &&
    -+
    -+	# Do the same for other-blob.
    -+	git -C "$P" hash-object other-blob >h2 &&
    -+	git -C "$P" pack-objects --stdout <h2 \
    -+		>"$HTTPD_DOCUMENT_ROOT_PATH/two.pack" &&
    -+	git -C "$P" config --add \
    -+		"uploadpack.blobpackfileuri" \
    -+		"$(cat h2) $HTTPD_URL/dumb/two.pack" &&
    ++	configure_exclusion "$P" my-blob >h &&
    ++	configure_exclusion "$P" other-blob >h2 &&
     +
    -+	GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
    -+		git -c protocol.version=2 \
    ++	GIT_TRACE=1 GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
    ++	git -c protocol.version=2 \
     +		-c fetch.uriprotocols=http,https \
    -+		clone  "$HTTPD_URL/smart/http_parent" http_child &&
    ++		clone "$HTTPD_URL/smart/http_parent" http_child &&
     +
     +	# Ensure that my-blob and other-blob are in separate packfiles.
     +	for idx in http_child/.git/objects/pack/*.idx
     +	do
     +		git verify-pack --verbose $idx >out &&
    -+		if test "$(grep "^[0-9a-f]\{40\} " out | wc -l)" = 1
    ++		{
    ++			grep "^[0-9a-f]\{16,\} " out || :
    ++		} >out.objectlist &&
    ++		if test_line_count = 1 out.objectlist
     +		then
     +			if grep $(cat h) out
     +			then
    @@ -288,7 +391,11 @@
     +		fi
     +	done &&
     +	test -f hfound &&
    -+	test -f h2found
    ++	test -f h2found &&
    ++
    ++	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
    ++	ls http_child/.git/objects/pack/* >filelist &&
    ++	test_line_count = 6 filelist
     +'
     +
      stop_httpd
    @@ -335,7 +442,7 @@
     +				packet_write_fmt(1, "\1packfile-uris\n");
     +			}
     +			*p = '\0';
    -+			packet_write_fmt(1, "\1uri %s\n", os->buffer);
    ++			packet_write_fmt(1, "\1%s\n", os->buffer);
     +
     +			os->used -= p - os->buffer + 1;
     +			memmove(os->buffer, p + 1, os->used);
8:  2c31d71166 < -:  ---------- SQUASH???

Comments

Josh Steadmon March 19, 2019, 8:48 p.m. UTC | #1
On 2019.03.08 13:55, Jonathan Tan wrote:
> Here's my current progress - the only thing that is lacking is more
> tests, maybe, so I think it's ready for review.

This series looks good to me.

Reviewed-by: Josh Steadmon <steadmon@google.com>
Jeff King April 23, 2019, 5:21 a.m. UTC | #2
On Fri, Mar 08, 2019 at 01:55:12PM -0800, Jonathan Tan wrote:

> Here's my current progress - the only thing that is lacking is more
> tests, maybe, so I think it's ready for review.

A bit belated, but here are some overall thoughts. A lot of my thinking
comes from comparing this to previous work I had done on teaching the
client to fetch first from a bundle and then do a follow-up fetch from
the server.

> This code now starts CDN downloads after closing the first HTTP request,
> and it holds on to the .keep files until after the refs are set. I'll
> leave parallelization of the CDN downloads for later work.

Good. One of the problems with the bundle+followup approach was either
having to hold open or re-establish the connection to the original
server, since that fetch had to be put on hold.

I agree that parallelizing can come later. I do wonder if it's worth
making a new tool rather than trying to reuse git-http-fetch. Yes, it
basically does what we want already, but there's quite a lot of cruft in
that dumb-http code base. Using http_get_file(), or even curl more
directly might be a bit easier.

One problem in particular I'd worry about is that the http code
generally expects authentication to happen through the initial
ref-discovery, and then be set for all subsequent requests. So I doubt
an http_pack_request actually handles auth at all, and retro-fitting it
may be tricky.

> One relatively significant change: someone pointed out that the issue fixed by 
> 50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
> occur here, so I have changed it so that the server is required to send
> the packfile's hash along with the URI.

I have mixed feelings on that. I like how it clearly makes the server
the source of authority. You don't have to care about CDN tampering,
because you know you are getting the bytes the server wanted you to.

I think even without that this is still _mostly_ true, though. You're
going to compute the hash of every object the CDN sends you, and those
objects must fit into the missing gaps in what the server sends you. So
the worst case is that a malicious CDN could send you some extra
objects. That's probably not the end of the world, but I do like the
extra guarantee of knowing that you got byte-for-byte what the server
wanted you to.

> This does mean that Ævar's workflow described in [1] would not work.
> Quoting Ævar:
> 
> > More concretely, I'd like to have a setup where a server can just dumbly
> > point to some URL that probably has most of the data, without having any
> > idea what OIDs are in it. So that e.g. some machine entirely
> > disconnected from the server (and with just a regular clone) can
> > continually generating an up-to-date-enough packfile.
> 
> With 50d3413740, it seems to me that it's important for the server to
> know details about the URIs that it points to, so such a disconnection
> would not work.

I think even without 50d3413740, this is important for your scheme. One
of the weaknesses (and strengths, I suppose) of the bundle+followup
scheme was that the initial bundle generally had to meet the git
repository guarantee. After you fetched the bundle, you'd tell the
server "OK, now I have commit X" just like you were a regular client.

But I really like that your proposal does away with that, and asks for
tighter cooperation between the server and the CDN. It means the CDN can
serve some random subset of the objects. But once we do that, now the
server _has_ to know what was in those URLs it pointed to, because our
protocol has no good way of communicating a random subset of objects (if
they're just blobs, we could enumerate all of them to the server as
haves; yuck.  But as soon as you get a tree or commit from the CDN, you
have to have all of the reachable objects to be able to claim it).

So I think this is pretty inherent to your proposal, and but it's the
right tradeoff to be making here (I think there could still be room for
the other thing, too, but it's just a different feature).

I have a few other comments on the design, but I'll send them in
response to patch 5.

-Peff
Jonathan Tan April 23, 2019, 7:23 p.m. UTC | #3
> On Fri, Mar 08, 2019 at 01:55:12PM -0800, Jonathan Tan wrote:
> 
> > Here's my current progress - the only thing that is lacking is more
> > tests, maybe, so I think it's ready for review.
> 
> A bit belated, but here are some overall thoughts. A lot of my thinking
> comes from comparing this to previous work I had done on teaching the
> client to fetch first from a bundle and then do a follow-up fetch from
> the server.

Thanks for your thoughts.

> > This code now starts CDN downloads after closing the first HTTP request,
> > and it holds on to the .keep files until after the refs are set. I'll
> > leave parallelization of the CDN downloads for later work.
> 
> Good. One of the problems with the bundle+followup approach was either
> having to hold open or re-establish the connection to the original
> server, since that fetch had to be put on hold.
> 
> I agree that parallelizing can come later. I do wonder if it's worth
> making a new tool rather than trying to reuse git-http-fetch. Yes, it
> basically does what we want already, but there's quite a lot of cruft in
> that dumb-http code base. Using http_get_file(), or even curl more
> directly might be a bit easier.
> 
> One problem in particular I'd worry about is that the http code
> generally expects authentication to happen through the initial
> ref-discovery, and then be set for all subsequent requests. So I doubt
> an http_pack_request actually handles auth at all, and retro-fitting it
> may be tricky.

Thanks - I'll keep that in mind. If auth isn't handled, then we'll
definitely need something new.

> > One relatively significant change: someone pointed out that the issue fixed by 
> > 50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
> > occur here, so I have changed it so that the server is required to send
> > the packfile's hash along with the URI.
> 
> I have mixed feelings on that. I like how it clearly makes the server
> the source of authority. You don't have to care about CDN tampering,
> because you know you are getting the bytes the server wanted you to.

I was thinking not of CDN tampering but of a malicious server. Suppose a
server knew of (1) the hash of a target commit, and (2) the URL of a
packfile containing that target commit. The server can craft a commit
whose parent is that target commit and advertise it, and serves both
that commit as a packfile and the URL as packfile-uri. When the user
subsequently pushes, they will probably inadvertently push everything
including the target commit.

This is similar to the 50d3413740 case except that now the server needs
an additional datum (2). The packfile's hash is yet another additional
datum. I admit that it may not be that useful, though.

> I think even without that this is still _mostly_ true, though. You're
> going to compute the hash of every object the CDN sends you, and those
> objects must fit into the missing gaps in what the server sends you. So
> the worst case is that a malicious CDN could send you some extra
> objects. That's probably not the end of the world, but I do like the
> extra guarantee of knowing that you got byte-for-byte what the server
> wanted you to.

Yes, the guarantee is nice.

> > This does mean that Ævar's workflow described in [1] would not work.
> > Quoting Ævar:
> > 
> > > More concretely, I'd like to have a setup where a server can just dumbly
> > > point to some URL that probably has most of the data, without having any
> > > idea what OIDs are in it. So that e.g. some machine entirely
> > > disconnected from the server (and with just a regular clone) can
> > > continually generating an up-to-date-enough packfile.
> > 
> > With 50d3413740, it seems to me that it's important for the server to
> > know details about the URIs that it points to, so such a disconnection
> > would not work.
> 
> I think even without 50d3413740, this is important for your scheme. One
> of the weaknesses (and strengths, I suppose) of the bundle+followup
> scheme was that the initial bundle generally had to meet the git
> repository guarantee. After you fetched the bundle, you'd tell the
> server "OK, now I have commit X" just like you were a regular client.
> 
> But I really like that your proposal does away with that, and asks for
> tighter cooperation between the server and the CDN. It means the CDN can
> serve some random subset of the objects. But once we do that, now the
> server _has_ to know what was in those URLs it pointed to, because our
> protocol has no good way of communicating a random subset of objects (if
> they're just blobs, we could enumerate all of them to the server as
> haves; yuck.  But as soon as you get a tree or commit from the CDN, you
> have to have all of the reachable objects to be able to claim it).

Ah...you're right that the server still needs some inkling of what the
CDN's packfile has.

Without the packfile hash, the knowledge needed is slightly less: the
server just has to know a subset of the objects instead of the exact
list of objects, but I can't think of how this can benefit us other than
that we don't require so much consistency (it's OK if the server's
knowledge of the CDN is updated significantly later than the CDN is
updated, if the CDN reuses the same URL for each version of the
packfile).

> So I think this is pretty inherent to your proposal, and but it's the
> right tradeoff to be making here (I think there could still be room for
> the other thing, too, but it's just a different feature).

Thanks.
Ævar Arnfjörð Bjarmason April 24, 2019, 9:09 a.m. UTC | #4
On Fri, Mar 08 2019, Jonathan Tan wrote:

> One relatively significant change: someone pointed out that the issue fixed by
> 50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
> occur here, so I have changed it so that the server is required to send
> the packfile's hash along with the URI.
>
> This does mean that Ævar's workflow described in
> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/ would
> not work.  Quoting Ævar:
>
>> More concretely, I'd like to have a setup where a server can just dumbly
>> point to some URL that probably has most of the data, without having any
>> idea what OIDs are in it. So that e.g. some machine entirely
>> disconnected from the server (and with just a regular clone) can
>> continually generating an up-to-date-enough packfile.
>
> With 50d3413740, it seems to me that it's important for the server to
> know details about the URIs that it points to, so such a disconnection
> would not work.

For what it's worth I'm fine with that, and as is obvious in some of the
previous threads I hadn't thought about that threat model. This "I know
XYZ has a pack that should have most of it, check what's in it and get
back to me with HAVEs" was a nice-to-have, but not a must.

And you can still get most of the way there with this proposal and the
techniques described in the follow-up to
https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/ if I'm
reading this series correctly.

I.e. the server would know the CDNs are going to be generating a pack
from the "master" history with the tip being the Nth commit, and since
both agree on the pack algorithm they can generate the same OID/pack
hash pair without further coordination, the server would then
optimistically advertise that. This would give you the sort of "lazy"
CDN setup I was describing with slightly more work than just "here's
some random latest pack".

But isn't in the case that if a malicious server ever got a hold of a
pack SHA-1 it knows to be on a private CDN *and* a commit that's in it
we'd have the same problem? Shouldn't this security model be prominently
documented in Documentation/technical/packfile-uri.txt? I.e. "THOU MUST
NEVER LEAK A COMBINATION OF THE PACK SHA-1 OF A PRIVATE CDN AND A
PRIVATE COMMIT SHA-1!".

It seems likely that once we have git CDN support the first thing CDNs
are going to do is to serve up such packs via a URL that includes the
pack SHA-1. Once the combination of that & a commit in it leaks we're
back to square one, no? *My* default approach in setting up such
infrastructure without knowing about 50d3413740 would be not to care
about the privacy of the SHA-1s, even in the case of private data.

Isn't there also overlap here with non-CDN shallow/narrow fetching?
Where a server can detect such a client, rely on it to get the objects
from elsewhere (e.g. via adding an unrelated remote), and then get a
push that gives it secret data?

I don't see any fool-proof way out of it, but maybe a mode where we:

 1. Get the CDN data URLs (which for the purposes of this example I'm
    assuming are "base packs" containing at least some commits...)

 2. Do a follow-up request to the server where we e.g. pretend not to
    have the last N commits for tips it advertises and ask it to serve
    that up from a non-CDN. If it can't we know it's lying and trying an
    attack similar to the one described in 50d3413740.

    Conversely, can't know if it can that it isn't, but at that point it
    seems unlikely, since surely the useful thing is a delta against the
    recent-but-server-lied-about-having-it commit, at least in most
    cases....

This would also have the advantage of addressing some of the reliability
concerns brought up elsewhere. I.e. once we're doing a sanity-check
post-fetch anyway a failure to download 1/3 packs from a CDN becomes
recoverable, although the design outlined here (understandably) would
prefer that to be done in another "fetch" dialog so the server can keep
the "CDN + my data should be 100% of the data" state.