Message ID | 6b3a628719e0593893e537de0220a5e0d5460232.1590789428.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | CDN offloading update | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > +'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin | --packfile | <commit>] <url> > ... > +--packfile:: > + Instead of a commit id on the command line (which is not expected in > + this case), 'git http-fetch' fetches the packfile directly at the given > + URL and uses index-pack to generate corresponding .idx and .keep files. > + The output of index-pack is printed to stdout. This makes sense as an external interface, I guess. How should this interact with --stdin option, which is more like "instead of getting a single <dest filename, object name> pair from the command line, handle many pairs read from the standard input" batch mode operation. Would it be beneficial to allow unbounded number of packfiles, not just a single one, to be fetched and indexed by a single invocation of the command? I suspect that given the relatively large size of a single request for fetching a packfile, one invocation of the command per packfile won't be too heavy an overhead, so lack of such an orthogonality may only hurt conceptual cleanliness, but not performance. OK. > - if (argc != arg + 2 - commits_on_stdin) > + if (argc != arg + 2 - (commits_on_stdin || packfile)) > usage(http_fetch_usage); > if (commits_on_stdin) { > commits = walker_targets_stdin(&commit_id, &write_ref); > + } else if (packfile) { > + /* URL will be set later */ Prefer to see an empty statement spelled more explicitly, like this: ; /* URL will be set later */ Otherwise reader would be left wondering if a line was (or lines were) accidentally lost after this comment. > } else { > commit_id = (char **) &argv[arg++]; > commits = 1; > } > > + if (packfile) { > + url = xstrdup(argv[arg]); > + } else { > + if (argv[arg]) > + str_end_url_with_slash(argv[arg], &url); > + } > > setup_git_directory(); > > git_config(git_default_config, NULL); > > http_init(NULL, url, 0); > + if (packfile) { > + struct http_pack_request *preq; > + struct slot_results results; > + int ret; > + > + preq = new_http_pack_request(NULL, url); > + if (preq == NULL) > + die("couldn't create http pack request"); > + preq->slot->results = &results; > + preq->generate_keep = 1; > + > + if (start_active_slot(preq->slot)) { > + run_active_slot(preq->slot); > + if (results.curl_result != CURLE_OK) { > + die("Unable to get pack file %s\n%s", preq->url, > + curl_errorstr); > + } > + } else { > + die("Unable to start request"); > + } > + > + if ((ret = finish_http_pack_request(preq))) > + die("finish_http_pack_request gave result %d", ret); > + release_http_pack_request(preq); > + rc = 0; The above probably want to be a single helper function. The other side of if/else may also become another helper function. That way, the flow of control would become clearer. After all, these two branches do not share all that much. Only http-init and http-cleanup and nothing else. For that matter, even before introducing this new mode of operation, another patch to make a preparatory move of the original logic in this function to a helper function that would be called from the "else" side may make it easier to see what is going on. > diff --git a/http.c b/http.c > index 130e9d6259..ac66215ee6 100644 > --- a/http.c > +++ b/http.c > @@ -2280,15 +2280,18 @@ int finish_http_pack_request(struct http_pack_request *preq) > int tmpfile_fd; > int ret = 0; > > - close_pack_index(p); > + if (p) > + close_pack_index(p); > > fclose(preq->packfile); > preq->packfile = NULL; > > - lst = preq->lst; > - while (*lst != p) > - lst = &((*lst)->next); > - *lst = (*lst)->next; > + if (p) { > + lst = preq->lst; > + while (*lst != p) > + lst = &((*lst)->next); > + *lst = (*lst)->next; > + } This is quite ugly. What is the original meaning of the target field of the pack_request structure again? A packed_git structure that will be filled when we are done fetching the packfile from the other side and installed in our repository? When we are (ab)using http_fetch code to fetch a single packfile, we do not install the packfile into the running process, because we are only (re)using the existing machinery as a poor-man's "curl | git index-pack --stdin"? I do not think it is a bad idea to roll "curl | git index-pack --stdin" our own, but I do find this an ugly way to do so. Perhaps a set of lower-level helper functions can be isolated out of the existing code before this new feature is added, and then a new "just fetch and pipe it to the index-pack" feature should be written using these helpers but with a separate set of entry points? Would it be a good way to make the resulting code cleaner than this patch does? I dunno. > diff --git a/http.h b/http.h > index a5b082f3ae..709dfa4c19 100644 > --- a/http.h > +++ b/http.h > @@ -223,12 +223,21 @@ struct http_pack_request { > struct active_request_slot *slot; > > /* > - * After calling new_http_pack_request(), point lst to the head of the > + * After calling new_http_pack_request(), if fetching a pack that > + * http_get_info_packs() told us about, point lst to the head of the > * pack list that target is in. finish_http_pack_request() will remove > * target from lst and call install_packed_git() on target. > */ > struct packed_git **lst; > > + /* > + * If this is true, finish_http_pack_request() will pass "--keep" to > + * index-pack, resulting in the creation of a keep file, and will not > + * suppress its stdout (that is, the "keep\t<hash>\n" line will be > + * printed to stdout). > + */ > + unsigned generate_keep : 1; > + I suspect that this is a sign that this single patch is trying to do too many things at the same time. - Whether we are fetching a single packfile from a URL, or walking to fetch all the packfiles in the repository at a given URL - Whether packfiles taken from outer space are marked with the "keep" bit - Whether the obtained packfile(s) are internally "installed" to the running process are conceptually independent choices, but somehow mixed up, it seems. Thanks.
> > - if (argc != arg + 2 - commits_on_stdin) > > + if (argc != arg + 2 - (commits_on_stdin || packfile)) > > usage(http_fetch_usage); > > if (commits_on_stdin) { > > commits = walker_targets_stdin(&commit_id, &write_ref); > > + } else if (packfile) { > > + /* URL will be set later */ > > Prefer to see an empty statement spelled more explicitly, like this: > > ; /* URL will be set later */ > > Otherwise reader would be left wondering if a line was (or lines > were) accidentally lost after this comment. OK. > The above probably want to be a single helper function. > > The other side of if/else may also become another helper function. > > That way, the flow of control would become clearer. After all, > these two branches do not share all that much. Only http-init and > http-cleanup and nothing else. > > For that matter, even before introducing this new mode of operation, > another patch to make a preparatory move of the original logic in > this function to a helper function that would be called from the > "else" side may make it easier to see what is going on. OK. > > diff --git a/http.c b/http.c > > index 130e9d6259..ac66215ee6 100644 > > --- a/http.c > > +++ b/http.c > > @@ -2280,15 +2280,18 @@ int finish_http_pack_request(struct http_pack_request *preq) > > int tmpfile_fd; > > int ret = 0; > > > > - close_pack_index(p); > > + if (p) > > + close_pack_index(p); > > > > fclose(preq->packfile); > > preq->packfile = NULL; > > > > - lst = preq->lst; > > - while (*lst != p) > > - lst = &((*lst)->next); > > - *lst = (*lst)->next; > > + if (p) { > > + lst = preq->lst; > > + while (*lst != p) > > + lst = &((*lst)->next); > > + *lst = (*lst)->next; > > + } > > This is quite ugly. What is the original meaning of the target > field of the pack_request structure again? A packed_git structure > that will be filled when we are done fetching the packfile from the > other side and installed in our repository? When we are (ab)using > http_fetch code to fetch a single packfile, we do not install the > packfile into the running process, because we are only (re)using the > existing machinery as a poor-man's "curl | git index-pack --stdin"? > > I do not think it is a bad idea to roll "curl | git index-pack > --stdin" our own, but I do find this an ugly way to do so. Perhaps > a set of lower-level helper functions can be isolated out of the > existing code before this new feature is added, and then a new "just > fetch and pipe it to the index-pack" feature should be written using > these helpers but with a separate set of entry points? Would it be > a good way to make the resulting code cleaner than this patch does? > I dunno. OK - I'll extract the functions as you suggested earlier in the email, and then it might be more obvious if it can be done more cleanly. > > diff --git a/http.h b/http.h > > index a5b082f3ae..709dfa4c19 100644 > > --- a/http.h > > +++ b/http.h > > @@ -223,12 +223,21 @@ struct http_pack_request { > > struct active_request_slot *slot; > > > > /* > > - * After calling new_http_pack_request(), point lst to the head of the > > + * After calling new_http_pack_request(), if fetching a pack that > > + * http_get_info_packs() told us about, point lst to the head of the > > * pack list that target is in. finish_http_pack_request() will remove > > * target from lst and call install_packed_git() on target. > > */ > > struct packed_git **lst; > > > > + /* > > + * If this is true, finish_http_pack_request() will pass "--keep" to > > + * index-pack, resulting in the creation of a keep file, and will not > > + * suppress its stdout (that is, the "keep\t<hash>\n" line will be > > + * printed to stdout). > > + */ > > + unsigned generate_keep : 1; > > + > > I suspect that this is a sign that this single patch is trying to > do too many things at the same time. > > - Whether we are fetching a single packfile from a URL, or walking > to fetch all the packfiles in the repository at a given URL > > - Whether packfiles taken from outer space are marked with the > "keep" bit > > - Whether the obtained packfile(s) are internally "installed" > to the running process > > are conceptually independent choices, but somehow mixed up, it > seems. You might be right...I needed another mode that does the opposite choices in these 3 points and these points because options that could be toggled independently. I'll see if there is a better way to do this.
diff --git a/Documentation/git-http-fetch.txt b/Documentation/git-http-fetch.txt index 666b042679..8357359a9b 100644 --- a/Documentation/git-http-fetch.txt +++ b/Documentation/git-http-fetch.txt @@ -9,7 +9,7 @@ git-http-fetch - Download from a remote Git repository via HTTP SYNOPSIS -------- [verse] -'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin] <commit> <url> +'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin | --packfile | <commit>] <url> DESCRIPTION ----------- @@ -40,6 +40,12 @@ commit-id:: <commit-id>['\t'<filename-as-in--w>] +--packfile:: + Instead of a commit id on the command line (which is not expected in + this case), 'git http-fetch' fetches the packfile directly at the given + URL and uses index-pack to generate corresponding .idx and .keep files. + The output of index-pack is printed to stdout. + --recover:: Verify that everything reachable from target is fetched. Used after an earlier fetch is interrupted. diff --git a/http-fetch.c b/http-fetch.c index a32ac118d9..a9764d6f96 100644 --- a/http-fetch.c +++ b/http-fetch.c @@ -5,7 +5,7 @@ #include "walker.h" static const char http_fetch_usage[] = "git http-fetch " -"[-c] [-t] [-a] [-v] [--recover] [-w ref] [--stdin] commit-id url"; +"[-c] [-t] [-a] [-v] [--recover] [-w ref] [--stdin | --packfile | commit-id] url"; int cmd_main(int argc, const char **argv) { @@ -19,6 +19,7 @@ int cmd_main(int argc, const char **argv) int rc = 0; int get_verbosely = 0; int get_recover = 0; + int packfile = 0; while (arg < argc && argv[arg][0] == '-') { if (argv[arg][1] == 't') { @@ -35,43 +36,80 @@ int cmd_main(int argc, const char **argv) get_recover = 1; } else if (!strcmp(argv[arg], "--stdin")) { commits_on_stdin = 1; + } else if (!strcmp(argv[arg], "--packfile")) { + packfile = 1; } arg++; } - if (argc != arg + 2 - commits_on_stdin) + if (argc != arg + 2 - (commits_on_stdin || packfile)) usage(http_fetch_usage); if (commits_on_stdin) { commits = walker_targets_stdin(&commit_id, &write_ref); + } else if (packfile) { + /* URL will be set later */ } else { commit_id = (char **) &argv[arg++]; commits = 1; } - if (argv[arg]) - str_end_url_with_slash(argv[arg], &url); + if (packfile) { + url = xstrdup(argv[arg]); + } else { + if (argv[arg]) + str_end_url_with_slash(argv[arg], &url); + } setup_git_directory(); git_config(git_default_config, NULL); http_init(NULL, url, 0); - walker = get_http_walker(url); - walker->get_verbosely = get_verbosely; - walker->get_recover = get_recover; - rc = walker_fetch(walker, commits, commit_id, write_ref, url); + if (packfile) { + struct http_pack_request *preq; + struct slot_results results; + int ret; + + preq = new_http_pack_request(NULL, url); + if (preq == NULL) + die("couldn't create http pack request"); + preq->slot->results = &results; + preq->generate_keep = 1; + + if (start_active_slot(preq->slot)) { + run_active_slot(preq->slot); + if (results.curl_result != CURLE_OK) { + die("Unable to get pack file %s\n%s", preq->url, + curl_errorstr); + } + } else { + die("Unable to start request"); + } + + if ((ret = finish_http_pack_request(preq))) + die("finish_http_pack_request gave result %d", ret); + release_http_pack_request(preq); + rc = 0; + } else { + walker = get_http_walker(url); + walker->get_verbosely = get_verbosely; + walker->get_recover = get_recover; + + rc = walker_fetch(walker, commits, commit_id, write_ref, url); - if (commits_on_stdin) - walker_targets_free(commits, commit_id, write_ref); + if (commits_on_stdin) + walker_targets_free(commits, commit_id, write_ref); - if (walker->corrupt_object_found) { - fprintf(stderr, + if (walker->corrupt_object_found) { + fprintf(stderr, "Some loose object were found to be corrupt, but they might be just\n" "a false '404 Not Found' error message sent with incorrect HTTP\n" "status code. Suggest running 'git fsck'.\n"); + } + + walker_free(walker); } - walker_free(walker); http_cleanup(); free(url); diff --git a/http.c b/http.c index 130e9d6259..ac66215ee6 100644 --- a/http.c +++ b/http.c @@ -2280,15 +2280,18 @@ int finish_http_pack_request(struct http_pack_request *preq) int tmpfile_fd; int ret = 0; - close_pack_index(p); + if (p) + close_pack_index(p); fclose(preq->packfile); preq->packfile = NULL; - lst = preq->lst; - while (*lst != p) - lst = &((*lst)->next); - *lst = (*lst)->next; + if (p) { + lst = preq->lst; + while (*lst != p) + lst = &((*lst)->next); + *lst = (*lst)->next; + } tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY); @@ -2296,14 +2299,21 @@ int finish_http_pack_request(struct http_pack_request *preq) argv_array_push(&ip.args, "--stdin"); ip.git_cmd = 1; ip.in = tmpfile_fd; - ip.no_stdout = 1; + if (preq->generate_keep) { + argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX, + (uintmax_t)getpid()); + ip.out = 0; + } else { + ip.no_stdout = 1; + } if (run_command(&ip)) { ret = -1; goto cleanup; } - install_packed_git(the_repository, p); + if (p) + install_packed_git(the_repository, p); cleanup: close(tmpfile_fd); unlink(preq->tmpfile.buf); @@ -2321,12 +2331,31 @@ struct http_pack_request *new_http_pack_request( strbuf_init(&preq->tmpfile, 0); preq->target = target; - end_url_with_slash(&buf, base_url); - strbuf_addf(&buf, "objects/pack/pack-%s.pack", - hash_to_hex(target->hash)); - preq->url = strbuf_detach(&buf, NULL); + if (target) { + end_url_with_slash(&buf, base_url); + strbuf_addf(&buf, "objects/pack/pack-%s.pack", + hash_to_hex(target->hash)); + preq->url = strbuf_detach(&buf, NULL); + } else { + preq->url = xstrdup(base_url); + } + + if (target) { + strbuf_addf(&preq->tmpfile, "%s.temp", + sha1_pack_name(target->hash)); + } else { + const char *shortened_url; + size_t url_len = strlen(base_url); + + shortened_url = url_len <= 50 + ? base_url : base_url + (url_len - 50); + strbuf_addf(&preq->tmpfile, "%s/pack/pack-", + get_object_directory()); + strbuf_addstr_urlencode(&preq->tmpfile, + shortened_url, is_rfc3986_unreserved); + strbuf_addstr(&preq->tmpfile, ".temp"); + } - strbuf_addf(&preq->tmpfile, "%s.temp", sha1_pack_name(target->hash)); preq->packfile = fopen(preq->tmpfile.buf, "a"); if (!preq->packfile) { error("Unable to open local file %s for pack", @@ -2350,7 +2379,7 @@ struct http_pack_request *new_http_pack_request( if (http_is_verbose) fprintf(stderr, "Resuming fetch of pack %s at byte %"PRIuMAX"\n", - hash_to_hex(target->hash), + target ? hash_to_hex(target->hash) : base_url, (uintmax_t)prev_posn); http_opt_request_remainder(preq->slot->curl, prev_posn); } diff --git a/http.h b/http.h index a5b082f3ae..709dfa4c19 100644 --- a/http.h +++ b/http.h @@ -223,12 +223,21 @@ struct http_pack_request { struct active_request_slot *slot; /* - * After calling new_http_pack_request(), point lst to the head of the + * After calling new_http_pack_request(), if fetching a pack that + * http_get_info_packs() told us about, point lst to the head of the * pack list that target is in. finish_http_pack_request() will remove * target from lst and call install_packed_git() on target. */ struct packed_git **lst; + /* + * If this is true, finish_http_pack_request() will pass "--keep" to + * index-pack, resulting in the creation of a keep file, and will not + * suppress its stdout (that is, the "keep\t<hash>\n" line will be + * printed to stdout). + */ + unsigned generate_keep : 1; + /* * State managed by functions in http.c. */ @@ -237,8 +246,12 @@ struct http_pack_request { }; /* - * target must be an element in a pack list obtained from - * http_get_info_packs(). + * If fetching a pack that http_get_info_packs() told us about, set target to + * an element in a pack list obtained from http_get_info_packs(). The actual + * URL fetched will be base_url followed by a suffix with the hash of the pack. + * + * Otherwise, set target to NULL. The actual URL fetched will be base_url + * itself. */ struct http_pack_request *new_http_pack_request( struct packed_git *target, const char *base_url); diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh index 50485300eb..53010efc49 100755 --- a/t/t5550-http-fetch-dumb.sh +++ b/t/t5550-http-fetch-dumb.sh @@ -199,6 +199,23 @@ test_expect_success 'fetch packed objects' ' git clone $HTTPD_URL/dumb/repo_pack.git ' +test_expect_success 'http-fetch --packfile' ' + git init packfileclient && + p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git && ls objects/pack/pack-*.pack) && + git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p >out && + + # Ensure that the expected files are generated + grep "^keep.[0-9a-f]\{16,\}$" out && + cut -c6- out >packhash && + test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).pack" && + test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).idx" && + test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).keep" && + + # Ensure that it has the HEAD of repo_pack, at least + HASH=$(git -C "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git rev-parse HEAD) && + git -C packfileclient cat-file -e "$HASH" +' + test_expect_success 'fetch notices corrupt pack' ' cp -R "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git && (cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git && @@ -214,6 +231,14 @@ test_expect_success 'fetch notices corrupt pack' ' ) ' +test_expect_success 'http-fetch --packfile with corrupt pack' ' + rm -rf packfileclient && + git init packfileclient && + p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git && ls objects/pack/pack-*.pack) && + test_must_fail git -C packfileclient http-fetch --packfile \ + "$HTTPD_URL"/dumb/repo_bad1.git/$p +' + test_expect_success 'fetch notices corrupt idx' ' cp -R "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad2.git && (cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
Teach http-fetch the ability to download packfiles directly, given a URL, and to verify them. The http_pack_request suite of functions have been modified to support a NULL target. When target is NULL, the given URL is downloaded directly instead of being treated as the root of a repository. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Documentation/git-http-fetch.txt | 8 +++- http-fetch.c | 64 +++++++++++++++++++++++++------- http.c | 55 ++++++++++++++++++++------- http.h | 19 ++++++++-- t/t5550-http-fetch-dumb.sh | 25 +++++++++++++ 5 files changed, 141 insertions(+), 30 deletions(-)