Message ID | 20190221001447.124088-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] http: use --stdin and --keep when downloading pack | expand |
On Wed, Feb 20, 2019 at 04:14:47PM -0800, Jonathan Tan wrote: > This is part of the work of CDN offloading of fetch responses. > > I have plans to use the http_pack_request suite of functions to > implement the part where we download from CDN over HTTP(S), but need > this change to be able to do so. I think it's better from the code > quality perspective to reuse these functions, but this necessitates a > behavior change in that we no longer use the filename as declared by the > server, so I'm sending this as RFC to see what the community thinks. I think it makes sense. We don't use the server names for any of the other protocols, and I'm happy to see one less place where we may inherit a stupid or malicious item of data from the server. > diff --git a/http-push.c b/http-push.c > index b22c7caea0..409b266b0c 100644 > --- a/http-push.c > +++ b/http-push.c > @@ -586,11 +586,16 @@ static void finish_request(struct transfer_request *request) > fprintf(stderr, "Unable to get pack file %s\n%s", > request->url, curl_errorstr); > } else { > + char *lockfile; > + > preq = (struct http_pack_request *)request->userData; > > if (preq) { > - if (finish_http_pack_request(preq) == 0) > + if (finish_http_pack_request(preq, > + &lockfile) == 0) { > + unlink(lockfile); > fail = 0; > + } > release_http_pack_request(preq); > } > } I was puzzled that you had to touch http-push.c. But indeed, it seems to have some fetching code in it, too? I'm willing to throw up my hands in disgust at the http-push code without looking further at this point. :) > argv_array_push(&ip.args, "index-pack"); > - argv_array_pushl(&ip.args, "-o", tmp_idx, NULL); > - argv_array_push(&ip.args, preq->tmpfile.buf); > + argv_array_push(&ip.args, "--stdin"); > + argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX, (uintmax_t)getpid()); > ip.git_cmd = 1; > - ip.no_stdin = 1; > - ip.no_stdout = 1; > + ip.in = tmpfile_fd; > + ip.out = -1; > > - if (run_command(&ip)) { > - unlink(preq->tmpfile.buf); > - unlink(tmp_idx); > - free(tmp_idx); > - return -1; > + if (start_command(&ip)) { > + ret = -1; > + goto cleanup; > } > > - unlink(sha1_pack_index_name(p->sha1)); > + *lockfile = index_pack_lockfile(ip.out); > + close(ip.out); We're now doing bi-directional I/O with index-pack. But it should be deadlock-free, because we know the output is small and will only come at the end after we've closed its stdin. > - if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1)) > - || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) { > - free(tmp_idx); > - return -1; > + if (finish_command(&ip)) { > + ret = -1; > + goto cleanup; > } If the command fails but we got something in *lockfile, should we clean it up? Likewise, do we need to be installing a signal handler to clean it up in case we die in other code paths (or by a signal)? -Peff
> On Wed, Feb 20, 2019 at 04:14:47PM -0800, Jonathan Tan wrote: > > > This is part of the work of CDN offloading of fetch responses. > > > > I have plans to use the http_pack_request suite of functions to > > implement the part where we download from CDN over HTTP(S), but need > > this change to be able to do so. I think it's better from the code > > quality perspective to reuse these functions, but this necessitates a > > behavior change in that we no longer use the filename as declared by the > > server, so I'm sending this as RFC to see what the community thinks. > > I think it makes sense. We don't use the server names for any of the > other protocols, and I'm happy to see one less place where we may > inherit a stupid or malicious item of data from the server. Thanks. That's true. > I was puzzled that you had to touch http-push.c. But indeed, it seems to > have some fetching code in it, too? I'm willing to throw up my hands in > disgust at the http-push code without looking further at this point. :) :) > > argv_array_push(&ip.args, "index-pack"); > > - argv_array_pushl(&ip.args, "-o", tmp_idx, NULL); > > - argv_array_push(&ip.args, preq->tmpfile.buf); > > + argv_array_push(&ip.args, "--stdin"); > > + argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX, (uintmax_t)getpid()); > > ip.git_cmd = 1; > > - ip.no_stdin = 1; > > - ip.no_stdout = 1; > > + ip.in = tmpfile_fd; > > + ip.out = -1; > > > > - if (run_command(&ip)) { > > - unlink(preq->tmpfile.buf); > > - unlink(tmp_idx); > > - free(tmp_idx); > > - return -1; > > + if (start_command(&ip)) { > > + ret = -1; > > + goto cleanup; > > } > > > > - unlink(sha1_pack_index_name(p->sha1)); > > + *lockfile = index_pack_lockfile(ip.out); > > + close(ip.out); > > We're now doing bi-directional I/O with index-pack. But it should be > deadlock-free, because we know the output is small and will only come at > the end after we've closed its stdin. Yes. This is also how fetch-pack.c does it, for example. > > - if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1)) > > - || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) { > > - free(tmp_idx); > > - return -1; > > + if (finish_command(&ip)) { > > + ret = -1; > > + goto cleanup; > > } > > If the command fails but we got something in *lockfile, should we clean > it up? Likewise, do we need to be installing a signal handler to clean > it up in case we die in other code paths (or by a signal)? My inclination is not to do it - as far as I can tell, we don't have cleanup or signal handlers in fetch-pack.c either.
Jonathan Tan <jonathantanmy@google.com> writes: > When Git fetches a pack using dumb HTTP, it does at least 2 things > differently from when it fetches using fetch-pack or receive-pack: (1) > it reuses the server's name for the packfile (which incorporates a hash) > for the packfile, and (2) it does not create a .keep file to avoid race > conditions with repack. > > A subsequent patch will allow downloading packs over HTTP(S) as part of > a fetch. These downloads will not necessarily be from a Git repository, > and thus may not have a hash as part of its name. Also, generating a > .keep file will be necessary to avoid race conditions with repack (until > the fetch has successfully written the new refs). > > Thus, teach http to pass --stdin and --keep to index-pack, the former so > that we have no reliance on the server's name for the packfile, and the > latter so that we have the .keep file. Makes sense. Can a malicious dumb server mount a passive attack that serves misnamed packfiles and wait for victims to come? > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > This is part of the work of CDN offloading of fetch responses. > > I have plans to use the http_pack_request suite of functions to > implement the part where we download from CDN over HTTP(S), but need > this change to be able to do so. I think it's better from the code > quality perspective to reuse these functions, but this necessitates a > behavior change in that we no longer use the filename as declared by the > server, so I'm sending this as RFC to see what the community thinks. > --- > http-push.c | 7 ++++++- > http-walker.c | 5 ++++- > http.c | 42 ++++++++++++++++++++---------------------- > http.h | 2 +- > 4 files changed, 31 insertions(+), 25 deletions(-) > > diff --git a/http-push.c b/http-push.c > index b22c7caea0..409b266b0c 100644 > --- a/http-push.c > +++ b/http-push.c > @@ -586,11 +586,16 @@ static void finish_request(struct transfer_request *request) > fprintf(stderr, "Unable to get pack file %s\n%s", > request->url, curl_errorstr); > } else { > + char *lockfile; > + > preq = (struct http_pack_request *)request->userData; > > if (preq) { > - if (finish_http_pack_request(preq) == 0) > + if (finish_http_pack_request(preq, > + &lockfile) == 0) { > + unlink(lockfile); > fail = 0; > + } > release_http_pack_request(preq); > } > } > diff --git a/http-walker.c b/http-walker.c > index 8ae5d76c6a..804dc82304 100644 > --- a/http-walker.c > +++ b/http-walker.c > @@ -425,6 +425,7 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne > int ret; > struct slot_results results; > struct http_pack_request *preq; > + char *lockfile; > > if (fetch_indices(walker, repo)) > return -1; > @@ -457,7 +458,9 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne > goto abort; > } > > - ret = finish_http_pack_request(preq); > + ret = finish_http_pack_request(preq, &lockfile); > + if (!ret) > + unlink(lockfile); > release_http_pack_request(preq); > if (ret) > return ret; > diff --git a/http.c b/http.c > index a32ad36ddf..5f8e602cd2 100644 > --- a/http.c > +++ b/http.c > @@ -2200,13 +2200,13 @@ void release_http_pack_request(struct http_pack_request *preq) > free(preq); > } > > -int finish_http_pack_request(struct http_pack_request *preq) > +int finish_http_pack_request(struct http_pack_request *preq, char **lockfile) > { > struct packed_git **lst; > struct packed_git *p = preq->target; > - char *tmp_idx; > - size_t len; > struct child_process ip = CHILD_PROCESS_INIT; > + int tmpfile_fd; > + int ret = 0; > > close_pack_index(p); > > @@ -2218,35 +2218,33 @@ int finish_http_pack_request(struct http_pack_request *preq) > lst = &((*lst)->next); > *lst = (*lst)->next; > > - if (!strip_suffix(preq->tmpfile.buf, ".pack.temp", &len)) > - BUG("pack tmpfile does not end in .pack.temp?"); > - tmp_idx = xstrfmt("%.*s.idx.temp", (int)len, preq->tmpfile.buf); > + tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY); > > argv_array_push(&ip.args, "index-pack"); > - argv_array_pushl(&ip.args, "-o", tmp_idx, NULL); > - argv_array_push(&ip.args, preq->tmpfile.buf); > + argv_array_push(&ip.args, "--stdin"); > + argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX, (uintmax_t)getpid()); > ip.git_cmd = 1; > - ip.no_stdin = 1; > - ip.no_stdout = 1; > + ip.in = tmpfile_fd; > + ip.out = -1; > > - if (run_command(&ip)) { > - unlink(preq->tmpfile.buf); > - unlink(tmp_idx); > - free(tmp_idx); > - return -1; > + if (start_command(&ip)) { > + ret = -1; > + goto cleanup; > } > > - unlink(sha1_pack_index_name(p->sha1)); > + *lockfile = index_pack_lockfile(ip.out); > + close(ip.out); > > - if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1)) > - || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) { > - free(tmp_idx); > - return -1; > + if (finish_command(&ip)) { > + ret = -1; > + goto cleanup; > } > > install_packed_git(the_repository, p); > - free(tmp_idx); > - return 0; > +cleanup: > + close(tmpfile_fd); > + unlink(preq->tmpfile.buf); > + return ret; > } Nicely done. > > struct http_pack_request *new_http_pack_request( > diff --git a/http.h b/http.h > index 4eb4e808e5..20d1c85d0b 100644 > --- a/http.h > +++ b/http.h > @@ -212,7 +212,7 @@ struct http_pack_request { > > extern struct http_pack_request *new_http_pack_request( > struct packed_git *target, const char *base_url); > -extern int finish_http_pack_request(struct http_pack_request *preq); > +int finish_http_pack_request(struct http_pack_request *preq, char **lockfile); > extern void release_http_pack_request(struct http_pack_request *preq); > > /* Helpers for fetching object */
On Thu, Feb 21, 2019 at 10:49:48AM -0800, Jonathan Tan wrote: > > > - if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1)) > > > - || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) { > > > - free(tmp_idx); > > > - return -1; > > > + if (finish_command(&ip)) { > > > + ret = -1; > > > + goto cleanup; > > > } > > > > If the command fails but we got something in *lockfile, should we clean > > it up? Likewise, do we need to be installing a signal handler to clean > > it up in case we die in other code paths (or by a signal)? > > My inclination is not to do it - as far as I can tell, we don't have > cleanup or signal handlers in fetch-pack.c either. Hmm, yeah, you're right. Nor do we do it on fetch. At first I was surprised that this doesn't cause more problems (we'd leave it in place after seeing a pre-receive hook fail, for instance), but in practice it gets cleaned up along with the rest of the quarantine area. There's still a window where we may leave it around, though (and we don't quarantine fetches at all), so it might be worth addressing at some point. But I agree it is totally out of scope for this patch. -Peff
diff --git a/http-push.c b/http-push.c index b22c7caea0..409b266b0c 100644 --- a/http-push.c +++ b/http-push.c @@ -586,11 +586,16 @@ static void finish_request(struct transfer_request *request) fprintf(stderr, "Unable to get pack file %s\n%s", request->url, curl_errorstr); } else { + char *lockfile; + preq = (struct http_pack_request *)request->userData; if (preq) { - if (finish_http_pack_request(preq) == 0) + if (finish_http_pack_request(preq, + &lockfile) == 0) { + unlink(lockfile); fail = 0; + } release_http_pack_request(preq); } } diff --git a/http-walker.c b/http-walker.c index 8ae5d76c6a..804dc82304 100644 --- a/http-walker.c +++ b/http-walker.c @@ -425,6 +425,7 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne int ret; struct slot_results results; struct http_pack_request *preq; + char *lockfile; if (fetch_indices(walker, repo)) return -1; @@ -457,7 +458,9 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne goto abort; } - ret = finish_http_pack_request(preq); + ret = finish_http_pack_request(preq, &lockfile); + if (!ret) + unlink(lockfile); release_http_pack_request(preq); if (ret) return ret; diff --git a/http.c b/http.c index a32ad36ddf..5f8e602cd2 100644 --- a/http.c +++ b/http.c @@ -2200,13 +2200,13 @@ void release_http_pack_request(struct http_pack_request *preq) free(preq); } -int finish_http_pack_request(struct http_pack_request *preq) +int finish_http_pack_request(struct http_pack_request *preq, char **lockfile) { struct packed_git **lst; struct packed_git *p = preq->target; - char *tmp_idx; - size_t len; struct child_process ip = CHILD_PROCESS_INIT; + int tmpfile_fd; + int ret = 0; close_pack_index(p); @@ -2218,35 +2218,33 @@ int finish_http_pack_request(struct http_pack_request *preq) lst = &((*lst)->next); *lst = (*lst)->next; - if (!strip_suffix(preq->tmpfile.buf, ".pack.temp", &len)) - BUG("pack tmpfile does not end in .pack.temp?"); - tmp_idx = xstrfmt("%.*s.idx.temp", (int)len, preq->tmpfile.buf); + tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY); argv_array_push(&ip.args, "index-pack"); - argv_array_pushl(&ip.args, "-o", tmp_idx, NULL); - argv_array_push(&ip.args, preq->tmpfile.buf); + argv_array_push(&ip.args, "--stdin"); + argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX, (uintmax_t)getpid()); ip.git_cmd = 1; - ip.no_stdin = 1; - ip.no_stdout = 1; + ip.in = tmpfile_fd; + ip.out = -1; - if (run_command(&ip)) { - unlink(preq->tmpfile.buf); - unlink(tmp_idx); - free(tmp_idx); - return -1; + if (start_command(&ip)) { + ret = -1; + goto cleanup; } - unlink(sha1_pack_index_name(p->sha1)); + *lockfile = index_pack_lockfile(ip.out); + close(ip.out); - if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1)) - || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) { - free(tmp_idx); - return -1; + if (finish_command(&ip)) { + ret = -1; + goto cleanup; } install_packed_git(the_repository, p); - free(tmp_idx); - return 0; +cleanup: + close(tmpfile_fd); + unlink(preq->tmpfile.buf); + return ret; } struct http_pack_request *new_http_pack_request( diff --git a/http.h b/http.h index 4eb4e808e5..20d1c85d0b 100644 --- a/http.h +++ b/http.h @@ -212,7 +212,7 @@ struct http_pack_request { extern struct http_pack_request *new_http_pack_request( struct packed_git *target, const char *base_url); -extern int finish_http_pack_request(struct http_pack_request *preq); +int finish_http_pack_request(struct http_pack_request *preq, char **lockfile); extern void release_http_pack_request(struct http_pack_request *preq); /* Helpers for fetching object */
When Git fetches a pack using dumb HTTP, it does at least 2 things differently from when it fetches using fetch-pack or receive-pack: (1) it reuses the server's name for the packfile (which incorporates a hash) for the packfile, and (2) it does not create a .keep file to avoid race conditions with repack. A subsequent patch will allow downloading packs over HTTP(S) as part of a fetch. These downloads will not necessarily be from a Git repository, and thus may not have a hash as part of its name. Also, generating a .keep file will be necessary to avoid race conditions with repack (until the fetch has successfully written the new refs). Thus, teach http to pass --stdin and --keep to index-pack, the former so that we have no reliance on the server's name for the packfile, and the latter so that we have the .keep file. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- This is part of the work of CDN offloading of fetch responses. I have plans to use the http_pack_request suite of functions to implement the part where we download from CDN over HTTP(S), but need this change to be able to do so. I think it's better from the code quality perspective to reuse these functions, but this necessitates a behavior change in that we no longer use the filename as declared by the server, so I'm sending this as RFC to see what the community thinks. --- http-push.c | 7 ++++++- http-walker.c | 5 ++++- http.c | 42 ++++++++++++++++++++---------------------- http.h | 2 +- 4 files changed, 31 insertions(+), 25 deletions(-)