diff mbox series

[1/8] http: use --stdin when getting dumb HTTP pack

Message ID 4d17d560b87746acfd62ff785cc22c09600d4e65.1590789428.git.jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series CDN offloading update | expand

Commit Message

Jonathan Tan May 29, 2020, 10:30 p.m. UTC
When Git fetches a pack using dumb HTTP, it reuses the server's name for
the packfile (which incorporates a hash), which is different from the
behavior of fetch-pack and receive-pack.

A subsequent patch will allow downloading packs over HTTP(S) as part of
a fetch. These downloads will not necessarily be from a Git repository,
and thus may not have a hash as part of its name.

Thus, teach http to pass --stdin to index-pack, so that we have no
reliance on the server's name for the packfile.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 http.c | 33 +++++++++++----------------------
 1 file changed, 11 insertions(+), 22 deletions(-)

Comments

Junio C Hamano May 29, 2020, 11 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> When Git fetches a pack using dumb HTTP, it reuses the server's name for
> the packfile (which incorporates a hash), which is different from the
> behavior of fetch-pack and receive-pack.

My first two reads of the above mistakenly thought that for some
reason the packfile has the URL to the server encoded in its name,
but that is not what you meant by "the server's name".  You rather
meant "the name the server stores the packfile under", "the name the
server gave the packfile", "it reuses the name the server uses for
the packfile".  The last rephrase may be the easiest to understand.

> A subsequent patch will allow downloading packs over HTTP(S) as part of
> a fetch. These downloads will not necessarily be from a Git repository,
> and thus may not have a hash as part of its name.

A location that is not necessarily a Git repository can still honor
the naming convention, so I find this a bit weak argument.  After
all, the procedure to prepare such a CDN backed file would be using
Git and the (git) "natural" name for the resulting packfile is
easily available to it, isn't it?

I am not necessarily against loosening the limitation of the shape
of the filename, but we may want to say the reason why we want to
name the packfile on the CDN side differently from how Git would
naturally name them.  What benefit would we get out from geing able
to do so?  Perhaps it makes arrangements such as "you can fetch
'pack-v1.0.pack' to become reasonably up-to-date if you at least
have the version v1.0 software", "if the last time you fetched from
us was 2020-05-20 or after, you can fetch 'pack-2020-05-20.pack' and
be assured that you aren't missing anything", etc.?  Such a "why
would somebody want to name the packfile differently" would make a
more convincing justification.

> Thus, teach http to pass --stdin to index-pack, so that we have no
> reliance on the server's name for the packfile.

OK.  By definition, if we feed the packdata via --stdin, the
index-pack command would not even _know_ what the filename we use,
or the name the other side had.  Makes sense.
Jonathan Tan June 1, 2020, 8:37 p.m. UTC | #2
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > When Git fetches a pack using dumb HTTP, it reuses the server's name for
> > the packfile (which incorporates a hash), which is different from the
> > behavior of fetch-pack and receive-pack.
> 
> My first two reads of the above mistakenly thought that for some
> reason the packfile has the URL to the server encoded in its name,
> but that is not what you meant by "the server's name".  You rather
> meant "the name the server stores the packfile under", "the name the
> server gave the packfile", "it reuses the name the server uses for
> the packfile".  The last rephrase may be the easiest to understand.

OK - I'll use that.

> > A subsequent patch will allow downloading packs over HTTP(S) as part of
> > a fetch. These downloads will not necessarily be from a Git repository,
> > and thus may not have a hash as part of its name.
> 
> A location that is not necessarily a Git repository can still honor
> the naming convention, so I find this a bit weak argument.  After
> all, the procedure to prepare such a CDN backed file would be using
> Git and the (git) "natural" name for the resulting packfile is
> easily available to it, isn't it?
> 
> I am not necessarily against loosening the limitation of the shape
> of the filename, but we may want to say the reason why we want to
> name the packfile on the CDN side differently from how Git would
> naturally name them.  What benefit would we get out from geing able
> to do so?  Perhaps it makes arrangements such as "you can fetch
> 'pack-v1.0.pack' to become reasonably up-to-date if you at least
> have the version v1.0 software", "if the last time you fetched from
> us was 2020-05-20 or after, you can fetch 'pack-2020-05-20.pack' and
> be assured that you aren't missing anything", etc.?  Such a "why
> would somebody want to name the packfile differently" would make a
> more convincing justification.

I didn't want to unnecessarily exclude features like signed URLs which
may change the way the URL is - for example, in Google Cloud Storage,
the signed part is a suffix [1]. I'll include this in the commit
message.

Having said that, after rereading my patch:

 (1) I'm not sure anymore if the restriction is that there must be a
     hash in the filename. It might be just that the filename
     must end in ".pack.temp". (Having said that, if the filename was
     not named "<hash>.pack.temp", it would look different to the rest
     of the contents of "objects/pack/", which may or may not be fine.)

 (2) The filename restriction in question is on the local filename, not
     the URL. We could do any manipulation we want on the URL (e.g. by
     appending ".pack.temp").

And one idea that came up at $DAYJOB is that if we're using a suffix of
the URL as the filename, there may be a clash of names anyway, so we
might as well use the hash instead (which is reported by the server).
I'll take a further look - maybe this patch will no longer be needed.

[1] https://cloud.google.com/storage/docs/access-control/signed-urls
diff mbox series

Patch

diff --git a/http.c b/http.c
index 4882c9f5b2..130e9d6259 100644
--- a/http.c
+++ b/http.c
@@ -2276,9 +2276,9 @@  int finish_http_pack_request(struct http_pack_request *preq)
 {
 	struct packed_git **lst;
 	struct packed_git *p = preq->target;
-	char *tmp_idx;
-	size_t len;
 	struct child_process ip = CHILD_PROCESS_INIT;
+	int tmpfile_fd;
+	int ret = 0;
 
 	close_pack_index(p);
 
@@ -2290,35 +2290,24 @@  int finish_http_pack_request(struct http_pack_request *preq)
 		lst = &((*lst)->next);
 	*lst = (*lst)->next;
 
-	if (!strip_suffix(preq->tmpfile.buf, ".pack.temp", &len))
-		BUG("pack tmpfile does not end in .pack.temp?");
-	tmp_idx = xstrfmt("%.*s.idx.temp", (int)len, preq->tmpfile.buf);
+	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
 
 	argv_array_push(&ip.args, "index-pack");
-	argv_array_pushl(&ip.args, "-o", tmp_idx, NULL);
-	argv_array_push(&ip.args, preq->tmpfile.buf);
+	argv_array_push(&ip.args, "--stdin");
 	ip.git_cmd = 1;
-	ip.no_stdin = 1;
+	ip.in = tmpfile_fd;
 	ip.no_stdout = 1;
 
 	if (run_command(&ip)) {
-		unlink(preq->tmpfile.buf);
-		unlink(tmp_idx);
-		free(tmp_idx);
-		return -1;
-	}
-
-	unlink(sha1_pack_index_name(p->hash));
-
-	if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->hash))
-	 || finalize_object_file(tmp_idx, sha1_pack_index_name(p->hash))) {
-		free(tmp_idx);
-		return -1;
+		ret = -1;
+		goto cleanup;
 	}
 
 	install_packed_git(the_repository, p);
-	free(tmp_idx);
-	return 0;
+cleanup:
+	close(tmpfile_fd);
+	unlink(preq->tmpfile.buf);
+	return ret;
 }
 
 struct http_pack_request *new_http_pack_request(