Message ID | 2d6d8c2671fe424c752994dcb5277d4d923e17a0.1611455251.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Check .gitmodules when using packfile URIs | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh > index 7d5b17909b..8b8fb43dbc 100755 > ... > + sane_unset GIT_TEST_SIDEBAND_ALL && > + git -c protocol.version=2 -c transfer.fsckobjects=1 \ > + -c fetch.uriprotocols=http,https \ > + clone "$HTTPD_URL/smart/http_parent" http_child && > + > + # Ensure that there are exactly 4 files (2 .pack and 2 .idx). Ehh, please don't. We may add multi-pack-index there, or perhaps reverse index files in the future. If you care about having two packs logically because you are exercising the out-of-band prepackaged packfile plus the dynamic transfer, make sure you have two packs (and probably the idx files that go with them). Don't assume there will be one .idx each for them *AND* nothing else there. > + ls http_child/.git/objects/pack/* >filelist && > + test_line_count = 4 filelist > +' IOW, d=http_child/.git/objects/pack/ ls "$d"/*.pack "$d"/*.idx >filelist && test_line_count = 4 filelist or something like that.
On Sun, Jan 24 2021, Jonathan Tan wrote: > +void register_found_gitmodules(const struct object_id *oid) > +{ > + oidset_insert(&gitmodules_found, oid); > +} > + In fsck.c we only use this variable to insert into it, or in fsck_blob() to do the actual check, but then we either abort early if we've found it, or right after that: if (object_on_skiplist(options, oid)) return 0; So (along with comments I have below...) you could just use the existing "skiplist" option instead, no? > int fsck_finish(struct fsck_options *options) > { > int ret = 0; > @@ -1262,10 +1267,13 @@ int fsck_finish(struct fsck_options *options) > if (!buf) { > if (is_promisor_object(oid)) > continue; > - ret |= report(options, > - oid, OBJ_BLOB, > - FSCK_MSG_GITMODULES_MISSING, > - "unable to read .gitmodules blob"); > + if (options->print_dangling_gitmodules) > + printf("%s\n", oid_to_hex(oid)); > + else > + ret |= report(options, > + oid, OBJ_BLOB, > + FSCK_MSG_GITMODULES_MISSING, > + "unable to read .gitmodules blob"); > continue; > } > > diff --git a/fsck.h b/fsck.h > index 69cf715e79..4b8cf03445 100644 > --- a/fsck.h > +++ b/fsck.h > @@ -41,6 +41,12 @@ struct fsck_options { > int *msg_type; > struct oidset skiplist; > kh_oid_map_t *object_names; > + > + /* > + * If 1, print the hashes of missing .gitmodules blobs instead of > + * considering them to be errors. > + */ > + unsigned print_dangling_gitmodules:1; > }; > > #define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL, OIDSET_INIT } > @@ -62,6 +68,8 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options); > int fsck_object(struct object *obj, void *data, unsigned long size, > struct fsck_options *options); > > +void register_found_gitmodules(const struct object_id *oid); > + > /* > * Some fsck checks are context-dependent, and may end up queued; run this > * after completing all fsck_object() calls in order to resolve any remaining This whole thing seems just like the bad path I took in earlier rounds of my in-flight mktag series. You don't need this new custom API. You just setup an error handler for your fsck which ignores / prints / logs / whatever the OIDs you want if you get a FSCK_MSG_GITMODULES_MISSING error, which you then "return 0" on. If you don't have FSCK_MSG_GITMODULES_MISSING punt and call fsck_error_function().
On Sun, Jan 24 2021, Jonathan Tan wrote: > --fsck-objects:: > - Die if the pack contains broken objects. For internal use only. > + For internal use only. > ++ > +Die if the pack contains broken objects. If the pack contains a tree > +pointing to a .gitmodules blob that does not exist, prints the hash of > +that blob (for the caller to check) after the hash that goes into the > +name of the pack/idx file (see "Notes"). [I should have waited a bit and sent one E-Mail] Is this really generally usable as an IPC mechanism, what if we need another set of OIDs we care about? Shouldn't it at least be hidden behind some option so you don't get a deluge of output from index-pack if you're not in this packfile-uri mode? But, along with my other E-Mail... > [...] > +static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids) > +{ > + int len = the_hash_algo->hexsz + 1; /* hash + NL */ > + > + do { > + char hex_hash[GIT_MAX_HEXSZ + 1]; > + int read_len = read_in_full(fd, hex_hash, len); > + struct object_id oid; > + const char *end; > + > + if (!read_len) > + return; > + if (read_len != len) > + die("invalid length read %d", read_len); > + if (parse_oid_hex(hex_hash, &oid, &end) || *end != '\n') > + die("invalid hash"); > + oidset_insert(gitmodules_oids, &oid); > + } while (1); > +} > + Doesn't this IPC mechanism already exist in the form of fsck.skipList? See my 1f3299fda9 (fsck: make fsck_config() re-usable, 2021-01-05) on "next". I.e. as noted in my just-sent-E-Mail you could probably just re-use skiplist as-is. Or if not it seems to me that this whole IPC mechanism would be better done with a tempfile and passing it along like we already pass the fsck.skipList between these processes. I doubt it's going to be large enough to matter, we could just put it in .git/ somewhere, like we put gc.log etc (but created with a mktemp() name...). Or if we want to keep the "print <list> | process" model we can refactor the existing fsck IPC noted in 1f3299fda9 a bit, so e.g. you pass some version of "lines prefixed with "fsck-skiplist: " go into list xyz via a command-line option. And then existing option(s) and your potential new list (which as noted, I think is probably redundant to the skiplist) can use it.
Junio C Hamano <gitster@pobox.com> writes: > Jonathan Tan <jonathantanmy@google.com> writes: > >> diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh >> index 7d5b17909b..8b8fb43dbc 100755 >> ... >> + sane_unset GIT_TEST_SIDEBAND_ALL && >> + git -c protocol.version=2 -c transfer.fsckobjects=1 \ >> + -c fetch.uriprotocols=http,https \ >> + clone "$HTTPD_URL/smart/http_parent" http_child && >> + >> + # Ensure that there are exactly 4 files (2 .pack and 2 .idx). > > Ehh, please don't. We may add multi-pack-index there, or perhaps > reverse index files in the future. If you care about having two > packs logically because you are exercising the out-of-band > prepackaged packfile plus the dynamic transfer, make sure you have > two packs (and probably the idx files that go with them). Don't > assume there will be one .idx each for them *AND* nothing else > there. > >> + ls http_child/.git/objects/pack/* >filelist && >> + test_line_count = 4 filelist >> +' > > IOW, > > d=http_child/.git/objects/pack/ > ls "$d"/*.pack "$d"/*.idx >filelist && > test_line_count = 4 filelist > > or something like that. FYI, I have the following queued to make the tip of 'seen' pass the tests. ---- >8 -------- >8 -------- >8 -------- >8 -------- >8 -------- >8 ---- From: Junio C Hamano <gitster@pobox.com> Date: Mon, 25 Jan 2021 17:27:10 -0800 Subject: [PATCH] SQUASH??? test fix --- t/t5702-protocol-v2.sh | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh index 8b8fb43dbc..b1bc73a9a9 100755 --- a/t/t5702-protocol-v2.sh +++ b/t/t5702-protocol-v2.sh @@ -847,8 +847,9 @@ test_expect_success 'part of packfile response provided as URI' ' test -f hfound && test -f h2found && - # Ensure that there are exactly 6 files (3 .pack and 3 .idx). - ls http_child/.git/objects/pack/* >filelist && + # Ensure that there are exactly 3 packfiles with associated .idx + ls http_child/.git/objects/pack/*.pack \ + http_child/.git/objects/pack/*.idx >filelist && test_line_count = 6 filelist ' @@ -901,8 +902,9 @@ test_expect_success 'packfile-uri with transfer.fsckobjects' ' -c fetch.uriprotocols=http,https \ clone "$HTTPD_URL/smart/http_parent" http_child && - # Ensure that there are exactly 4 files (2 .pack and 2 .idx). - ls http_child/.git/objects/pack/* >filelist && + # Ensure that there are exactly 2 packfiles with associated .idx + ls http_child/.git/objects/pack/*.pack \ + http_child/.git/objects/pack/*.idx >filelist && test_line_count = 4 filelist ' @@ -956,8 +958,9 @@ test_expect_success 'packfile-uri with transfer.fsckobjects succeeds when .gitmo -c fetch.uriprotocols=http,https \ clone "$HTTPD_URL/smart/http_parent" http_child && - # Ensure that there are exactly 4 files (2 .pack and 2 .idx). - ls http_child/.git/objects/pack/* >filelist && + # Ensure that there are exactly 2 packfiles with associated .idx + ls http_child/.git/objects/pack/*.pack \ + http_child/.git/objects/pack/*.idx >filelist && test_line_count = 4 filelist '
> On Sun, Jan 24 2021, Jonathan Tan wrote: > > > +void register_found_gitmodules(const struct object_id *oid) > > +{ > > + oidset_insert(&gitmodules_found, oid); > > +} > > + > > In fsck.c we only use this variable to insert into it, or in fsck_blob() > to do the actual check, but then we either abort early if we've found > it, or right after that: By "this variable", do you mean gitmodules_found? fsck_finish() consumes it. > if (object_on_skiplist(options, oid)) > return 0; > > So (along with comments I have below...) you could just use the existing > "skiplist" option instead, no? I don't understand this part (in particular, the part you quoted). About "skiplist", I'll reply to your other email [1] which has more details. [1] https://lore.kernel.org/git/87czxu7c15.fsf@evledraar.gmail.com/ > This whole thing seems just like the bad path I took in earlier rounds > of my in-flight mktag series. You don't need this new custom API. You > just setup an error handler for your fsck which ignores / prints / logs > / whatever the OIDs you want if you get a FSCK_MSG_GITMODULES_MISSING > error, which you then "return 0" on. > > If you don't have FSCK_MSG_GITMODULES_MISSING punt and call > fsck_error_function(). I tried that first, and the issue is that IDs like FSCK_MSG_GITMODULES_MISSING are internal to fsck.c. As for whether we should start exposing the IDs publicly, I think we should wait until a few new cases like this come up, so that we more fully understand the requirements first.
> > Ehh, please don't. We may add multi-pack-index there, or perhaps > > reverse index files in the future. If you care about having two > > packs logically because you are exercising the out-of-band > > prepackaged packfile plus the dynamic transfer, make sure you have > > two packs (and probably the idx files that go with them). Don't > > assume there will be one .idx each for them *AND* nothing else > > there. > > > >> + ls http_child/.git/objects/pack/* >filelist && > >> + test_line_count = 4 filelist > >> +' > > > > IOW, > > > > d=http_child/.git/objects/pack/ > > ls "$d"/*.pack "$d"/*.idx >filelist && > > test_line_count = 4 filelist > > > > or something like that. > > FYI, I have the following queued to make the tip of 'seen' pass the > tests. [snip] OK - I'll include these changes in the next version.
> On Sun, Jan 24 2021, Jonathan Tan wrote: > > --fsck-objects:: > > - Die if the pack contains broken objects. For internal use only. > > + For internal use only. > > ++ > > +Die if the pack contains broken objects. If the pack contains a tree > > +pointing to a .gitmodules blob that does not exist, prints the hash of > > +that blob (for the caller to check) after the hash that goes into the > > +name of the pack/idx file (see "Notes"). > > [I should have waited a bit and sent one E-Mail] > > Is this really generally usable as an IPC mechanism, what if we need > another set of OIDs we care about? Shouldn't it at least be hidden > behind some option so you don't get a deluge of output from index-pack > if you're not in this packfile-uri mode? --fsck-objects is only for internal use, and it's only used by fetch-pack.c. So its only consumer does want the output. Junio also mentioned the possibility of another set of OIDs, and I replied [1]. [1] https://lore.kernel.org/git/20210128003536.3874866-1-jonathantanmy@google.com/ > But, along with my other E-Mail... > > > [...] > > +static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids) > > +{ > > + int len = the_hash_algo->hexsz + 1; /* hash + NL */ > > + > > + do { > > + char hex_hash[GIT_MAX_HEXSZ + 1]; > > + int read_len = read_in_full(fd, hex_hash, len); > > + struct object_id oid; > > + const char *end; > > + > > + if (!read_len) > > + return; > > + if (read_len != len) > > + die("invalid length read %d", read_len); > > + if (parse_oid_hex(hex_hash, &oid, &end) || *end != '\n') > > + die("invalid hash"); > > + oidset_insert(gitmodules_oids, &oid); > > + } while (1); > > +} > > + > > Doesn't this IPC mechanism already exist in the form of fsck.skipList? > See my 1f3299fda9 (fsck: make fsck_config() re-usable, 2021-01-05) on > "next". I.e. as noted in my just-sent-E-Mail you could probably just > re-use skiplist as-is. I'm not sure how fsck.skipList could be used here. Before running fsck_finish() for the first time, we don't know which .gitmodules are missing and which are not. And when running fsck_finish() for the second time, we definitely do not want to skip any blobs. > Or if not it seems to me that this whole IPC mechanism would be better > done with a tempfile and passing it along like we already pass the > fsck.skipList between these processes. > > I doubt it's going to be large enough to matter, we could just put it in > .git/ somewhere, like we put gc.log etc (but created with a mktemp() > name...). > > Or if we want to keep the "print <list> | process" model we can refactor > the existing fsck IPC noted in 1f3299fda9 a bit, so e.g. you pass some > version of "lines prefixed with "fsck-skiplist: " go into list xyz via a > command-line option. And then existing option(s) and your potential new > list (which as noted, I think is probably redundant to the skiplist) can > use it. I think using stdout is superior to using a tempfile - we don't have to worry about interrupted invocations, for example. What do you mean by "the existing fsck IPC noted in 1f3299fda9"? If you mean the ability to pass a list of OIDs, for example using "-c fsck.skipList=filename.txt", I'm not sure that it solves anything. Firstly, I don't think that the skipList is useful here (as I said earlier). And secondly, I don't think that OID input is the issue - right now, the design is a process (index-pack, calling fsck_finish()) writing to its output which is then picked up by the calling process (fetch-pack). We are not sending the dangling .gitmodules through stdin anywhere.
On Thu, Jan 28 2021, Jonathan Tan wrote: Sorry I managed to miss this at the time. Hopefully a late reply is better than never. >> On Sun, Jan 24 2021, Jonathan Tan wrote: >> >> > +void register_found_gitmodules(const struct object_id *oid) >> > +{ >> > + oidset_insert(&gitmodules_found, oid); >> > +} >> > + >> >> In fsck.c we only use this variable to insert into it, or in fsck_blob() >> to do the actual check, but then we either abort early if we've found >> it, or right after that: > > By "this variable", do you mean gitmodules_found? fsck_finish() consumes > it. Yes, consumes it to emit errors with report(), no? >> if (object_on_skiplist(options, oid)) >> return 0; >> >> So (along with comments I have below...) you could just use the existing >> "skiplist" option instead, no? > > I don't understand this part (in particular, the part you quoted). About > "skiplist", I'll reply to your other email [1] which has more details. > > [1] https://lore.kernel.org/git/87czxu7c15.fsf@evledraar.gmail.com/ *nod* >> This whole thing seems just like the bad path I took in earlier rounds >> of my in-flight mktag series. You don't need this new custom API. You >> just setup an error handler for your fsck which ignores / prints / logs >> / whatever the OIDs you want if you get a FSCK_MSG_GITMODULES_MISSING >> error, which you then "return 0" on. >> >> If you don't have FSCK_MSG_GITMODULES_MISSING punt and call >> fsck_error_function(). > > I tried that first, and the issue is that IDs like > FSCK_MSG_GITMODULES_MISSING are internal to fsck.c. As for whether we > should start exposing the IDs publicly, I think we should wait until a > few new cases like this come up, so that we more fully understand the > requirements first. The requirement is that you want the objects ids we'd otherwise error about in fsck_finish(). Yeah we don't pass the "fsck_msg_id" down in the "report()" function, but you can reliably strstr() it out of the message. We document & hard rely on that already, since it's also a config key. But yeah, we could just change the report function to pass down the id and move the relevant macros from fsck.c to fsck.h. I think that would be a smaller change conceptually than a special-case flag in fsck_options for something we could otherwise do with the error reporting.
On Thu, Jan 28 2021, Jonathan Tan wrote: >> On Sun, Jan 24 2021, Jonathan Tan wrote: >> > --fsck-objects:: >> > - Die if the pack contains broken objects. For internal use only. >> > + For internal use only. >> > ++ >> > +Die if the pack contains broken objects. If the pack contains a tree >> > +pointing to a .gitmodules blob that does not exist, prints the hash of >> > +that blob (for the caller to check) after the hash that goes into the >> > +name of the pack/idx file (see "Notes"). >> >> [I should have waited a bit and sent one E-Mail] >> >> Is this really generally usable as an IPC mechanism, what if we need >> another set of OIDs we care about? Shouldn't it at least be hidden >> behind some option so you don't get a deluge of output from index-pack >> if you're not in this packfile-uri mode? > > --fsck-objects is only for internal use, and it's only used by > fetch-pack.c. So its only consumer does want the output. > > Junio also mentioned the possibility of another set of OIDs, and I > replied [1]. > > [1] https://lore.kernel.org/git/20210128003536.3874866-1-jonathantanmy@google.com/ > >> But, along with my other E-Mail... >> >> > [...] >> > +static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids) >> > +{ >> > + int len = the_hash_algo->hexsz + 1; /* hash + NL */ >> > + >> > + do { >> > + char hex_hash[GIT_MAX_HEXSZ + 1]; >> > + int read_len = read_in_full(fd, hex_hash, len); >> > + struct object_id oid; >> > + const char *end; >> > + >> > + if (!read_len) >> > + return; >> > + if (read_len != len) >> > + die("invalid length read %d", read_len); >> > + if (parse_oid_hex(hex_hash, &oid, &end) || *end != '\n') >> > + die("invalid hash"); >> > + oidset_insert(gitmodules_oids, &oid); >> > + } while (1); >> > +} >> > + >> >> Doesn't this IPC mechanism already exist in the form of fsck.skipList? >> See my 1f3299fda9 (fsck: make fsck_config() re-usable, 2021-01-05) on >> "next". I.e. as noted in my just-sent-E-Mail you could probably just >> re-use skiplist as-is. > > I'm not sure how fsck.skipList could be used here. Before running > fsck_finish() for the first time, we don't know which .gitmodules are > missing and which are not. And when running fsck_finish() for the second > time, we definitely do not want to skip any blobs. > >> Or if not it seems to me that this whole IPC mechanism would be better >> done with a tempfile and passing it along like we already pass the >> fsck.skipList between these processes. >> >> I doubt it's going to be large enough to matter, we could just put it in >> .git/ somewhere, like we put gc.log etc (but created with a mktemp() >> name...). >> >> Or if we want to keep the "print <list> | process" model we can refactor >> the existing fsck IPC noted in 1f3299fda9 a bit, so e.g. you pass some >> version of "lines prefixed with "fsck-skiplist: " go into list xyz via a >> command-line option. And then existing option(s) and your potential new >> list (which as noted, I think is probably redundant to the skiplist) can >> use it. > > I think using stdout is superior to using a tempfile - we don't have to > worry about interrupted invocations, for example. > > What do you mean by "the existing fsck IPC noted in 1f3299fda9"? If you > mean the ability to pass a list of OIDs, for example using "-c > fsck.skipList=filename.txt", I'm not sure that it solves anything. > Firstly, I don't think that the skipList is useful here (as I said > earlier). And secondly, I don't think that OID input is the issue - > right now, the design is a process (index-pack, calling fsck_finish()) > writing to its output which is then picked up by the calling process > (fetch-pack). We are not sending the dangling .gitmodules through stdin > anywhere. Sorry for being unclear here. I don't think (honestly I don't remember, it's been almost a month) that I meant to you should use the skipList. Looking at that code again we use object_on_skiplist() to do an early punt in report(), but also fsck_blob(), presumably you never want the latter, and that early punting wouldn't be needed if your report() function intercepted the modules blob id for stashing it away / later reporting / whatever. So yeah, I'm 99% sure now that's not what I meant :) What I meant with: Or if we want to keep the "print <list> | process"[...] Is that we have an existing ad-hoc IPC model for these commands in passing along the skipList, which is made more complex because sometimes the initial process reads the file, sometimes it passes it along as-is to the child. And then there's this patch that passes OIDs too, but through a different mechanism. I was suggesting that perhaps it made more sense to refactor both so they could use the same mechanism, because we're potentially passing two lists of OIDs between the two. Just one goes via line-at-a-time in the output, the other via a config option on the command-line.
On Sun, Jan 24 2021, Jonathan Tan wrote: > diff --git a/builtin/index-pack.c b/builtin/index-pack.c > index 557bd2f348..f995c15115 100644 > --- a/builtin/index-pack.c > +++ b/builtin/index-pack.c > @@ -1888,8 +1888,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) > else > close(input_fd); > > - if (do_fsck_object && fsck_finish(&fsck_options)) > - die(_("fsck error in pack objects")); > + if (do_fsck_object) { > + struct fsck_options fo = FSCK_OPTIONS_STRICT; > + > + fo.print_dangling_gitmodules = 1; > + if (fsck_finish(&fo)) > + die(_("fsck error in pack objects")); > + } > [...] > +static void fsck_gitmodules_oids(struct oidset *gitmodules_oids) > +{ > + struct oidset_iter iter; > + const struct object_id *oid; > + struct fsck_options fo = FSCK_OPTIONS_STRICT; > + > + if (!oidset_size(gitmodules_oids)) > + return; > + > + oidset_iter_init(gitmodules_oids, &iter); > + while ((oid = oidset_iter_next(&iter))) > + register_found_gitmodules(oid); > + if (fsck_finish(&fo)) > + die("fsck failed"); > +} > + What's the need for STRICT here & can't the former use the existing fsck_options in index-pack.c? With this on top we pass all tests: diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 18531199242..5464edf4778 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1933,10 +1933,8 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) close(input_fd); if (do_fsck_object) { - struct fsck_options fo = FSCK_OPTIONS_STRICT; - - fo.print_dangling_gitmodules = 1; - if (fsck_finish(&fo)) + fsck_options.print_dangling_gitmodules = 1; + if (fsck_finish(&fsck_options)) die(_("fsck error in pack objects")); } diff --git a/fetch-pack.c b/fetch-pack.c index 0a337a04f1f..a8754d97e3d 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -997,7 +997,7 @@ static void fsck_gitmodules_oids(struct oidset *gitmodules_oids) { struct oidset_iter iter; const struct object_id *oid; - struct fsck_options fo = FSCK_OPTIONS_STRICT; + struct fsck_options fo = FSCK_OPTIONS_DEFAULT; if (!oidset_size(gitmodules_oids)) return;
> > I tried that first, and the issue is that IDs like > > FSCK_MSG_GITMODULES_MISSING are internal to fsck.c. As for whether we > > should start exposing the IDs publicly, I think we should wait until a > > few new cases like this come up, so that we more fully understand the > > requirements first. > > The requirement is that you want the objects ids we'd otherwise error > about in fsck_finish(). Yeah we don't pass the "fsck_msg_id" down in the > "report()" function, but you can reliably strstr() it out of the > message. We can't strstr() because of false positives (if, e.g. there is a submodule name that contains the string we're looking for), but looking at report() in fsck.c, the message ID is the very first thing appended, so I think we can use starts_with(). > We document & hard rely on that already, since it's also a > config key. Ah, good point. > But yeah, we could just change the report function to pass down the id > and move the relevant macros from fsck.c to fsck.h. I think that would > be a smaller change conceptually than a special-case flag in > fsck_options for something we could otherwise do with the error > reporting. I agree - I thought this wouldn't be possible, but like you said, we can reliably make use of the string in report() (or pass the ID, like your patch set [1] does) so we should do this. What would be the best way to proceed, now that we have at least 2 patch sets (mine and yours) in play? I was thinking that I should update my one to use the string reported in report() (with starts_with()), so that both our patch sets can be reviewed and merged in parallel, and after that, update the fsck code to use the ID instead of the string. [1] https://lore.kernel.org/git/87blcja2ha.fsf@evledraar.gmail.com/
> Sorry for being unclear here. I don't think (honestly I don't remember, > it's been almost a month) that I meant to you should use the skipList. > > Looking at that code again we use object_on_skiplist() to do an early > punt in report(), but also fsck_blob(), presumably you never want the > latter, and that early punting wouldn't be needed if your report() > function intercepted the modules blob id for stashing it away / later > reporting / whatever. > > So yeah, I'm 99% sure now that's not what I meant :) > > What I meant with: > > Or if we want to keep the "print <list> | process"[...] > > Is that we have an existing ad-hoc IPC model for these commands in > passing along the skipList, which is made more complex because sometimes > the initial process reads the file, sometimes it passes it along as-is > to the child. > > And then there's this patch that passes OIDs too, but through a > different mechanism. > > I was suggesting that perhaps it made more sense to refactor both so > they could use the same mechanism, because we're potentially passing two > lists of OIDs between the two. Just one goes via line-at-a-time in the > output, the other via a config option on the command-line. Thanks for your explanation. I still think that they are quite different - skiplist is a user-written file containing a list of OIDs that will likely never change, whereas my list of dangling .gitmodules is a list of OIDs dynamically generated (and thus, always different) whenever a fetch is done. So I think it's quite reasonable to pass skiplist as a file name, and my list should be passed line-by-line.
> What's the need for STRICT here & can't the former use the existing > fsck_options in index-pack.c? With this on top we pass all tests: [snip code] Good point - I'll do that.
On Wed, Feb 17 2021, Jonathan Tan wrote: >> Sorry for being unclear here. I don't think (honestly I don't remember, >> it's been almost a month) that I meant to you should use the skipList. >> >> Looking at that code again we use object_on_skiplist() to do an early >> punt in report(), but also fsck_blob(), presumably you never want the >> latter, and that early punting wouldn't be needed if your report() >> function intercepted the modules blob id for stashing it away / later >> reporting / whatever. >> >> So yeah, I'm 99% sure now that's not what I meant :) >> >> What I meant with: >> >> Or if we want to keep the "print <list> | process"[...] >> >> Is that we have an existing ad-hoc IPC model for these commands in >> passing along the skipList, which is made more complex because sometimes >> the initial process reads the file, sometimes it passes it along as-is >> to the child. >> >> And then there's this patch that passes OIDs too, but through a >> different mechanism. >> >> I was suggesting that perhaps it made more sense to refactor both so >> they could use the same mechanism, because we're potentially passing two >> lists of OIDs between the two. Just one goes via line-at-a-time in the >> output, the other via a config option on the command-line. > > Thanks for your explanation. I still think that they are quite different > - skiplist is a user-written file containing a list of OIDs that will > likely never change, whereas my list of dangling .gitmodules is a list > of OIDs dynamically generated (and thus, always different) whenever a > fetch is done. So I think it's quite reasonable to pass skiplist as a > file name, and my list should be passed line-by-line. Sure, but I'm not talking about passing it as a tempfile. Yes, I suggested that in the third-to-last paragraph of [1] but then went on to say that we could also move to some IPC mechanism where you spew in the list of dangling .gitmodules, and we also spew in the skipList and anything else we want to pass in. I'm not saying this needs to be part of this series. But let me rephrase: We now have some combination of {receive-pack,upload-pack,send-pack,fetch-pack,unpack-objects} that need to communicate locally or pass data back & forth, passing data either via a CLI option to read a file, packnames/refs on --stdin, or (now) a single list of OIDs on stdout. Let's say we don't just need to pass the .gitmodules OIDs, but also e.g. .mailmap OIDs or whatever (due to some future vulnerability). Would this IPC mechanism deal with that, or would we need to introduce a breaking change (Re: my recently send mail about concurrent updates of libexec programs)? Can we use soemething like pkt-line to talk back & forth in an extensible way? Not needed now, just food for thought... 1. https://lore.kernel.org/git/87czxu7c15.fsf@evledraar.gmail.com/
diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt index af0c26232c..e74a4a1eda 100644 --- a/Documentation/git-index-pack.txt +++ b/Documentation/git-index-pack.txt @@ -78,7 +78,12 @@ OPTIONS Die if the pack contains broken links. For internal use only. --fsck-objects:: - Die if the pack contains broken objects. For internal use only. + For internal use only. ++ +Die if the pack contains broken objects. If the pack contains a tree +pointing to a .gitmodules blob that does not exist, prints the hash of +that blob (for the caller to check) after the hash that goes into the +name of the pack/idx file (see "Notes"). --threads=<n>:: Specifies the number of threads to spawn when resolving diff --git a/builtin/index-pack.c b/builtin/index-pack.c index 557bd2f348..f995c15115 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -1888,8 +1888,13 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix) else close(input_fd); - if (do_fsck_object && fsck_finish(&fsck_options)) - die(_("fsck error in pack objects")); + if (do_fsck_object) { + struct fsck_options fo = FSCK_OPTIONS_STRICT; + + fo.print_dangling_gitmodules = 1; + if (fsck_finish(&fo)) + die(_("fsck error in pack objects")); + } free(objects); strbuf_release(&index_name_buf); diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c index d49d050e6e..ed2c9b42e9 100644 --- a/builtin/receive-pack.c +++ b/builtin/receive-pack.c @@ -2275,7 +2275,7 @@ static const char *unpack(int err_fd, struct shallow_info *si) status = start_command(&child); if (status) return "index-pack fork failed"; - pack_lockfile = index_pack_lockfile(child.out); + pack_lockfile = index_pack_lockfile(child.out, NULL); close(child.out); status = finish_command(&child); if (status) diff --git a/fetch-pack.c b/fetch-pack.c index fe69635eb5..128362e0ba 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -796,6 +796,26 @@ static void write_promisor_file(const char *keep_name, strbuf_release(&promisor_name); } +static void parse_gitmodules_oids(int fd, struct oidset *gitmodules_oids) +{ + int len = the_hash_algo->hexsz + 1; /* hash + NL */ + + do { + char hex_hash[GIT_MAX_HEXSZ + 1]; + int read_len = read_in_full(fd, hex_hash, len); + struct object_id oid; + const char *end; + + if (!read_len) + return; + if (read_len != len) + die("invalid length read %d", read_len); + if (parse_oid_hex(hex_hash, &oid, &end) || *end != '\n') + die("invalid hash"); + oidset_insert(gitmodules_oids, &oid); + } while (1); +} + /* * If packfile URIs were provided, pass a non-NULL pointer to index_pack_args. * The string to pass as the --index-pack-args argument to http-fetch will be @@ -804,7 +824,8 @@ static void write_promisor_file(const char *keep_name, static int get_pack(struct fetch_pack_args *args, int xd[2], struct string_list *pack_lockfiles, char **index_pack_args, - struct ref **sought, int nr_sought) + struct ref **sought, int nr_sought, + struct oidset *gitmodules_oids) { struct async demux; int do_keep = args->keep_pack; @@ -812,6 +833,7 @@ static int get_pack(struct fetch_pack_args *args, struct pack_header header; int pass_header = 0; struct child_process cmd = CHILD_PROCESS_INIT; + int fsck_objects = 0; int ret; memset(&demux, 0, sizeof(demux)); @@ -846,8 +868,15 @@ static int get_pack(struct fetch_pack_args *args, strvec_push(&cmd.args, alternate_shallow_file); } - if (do_keep || args->from_promisor || index_pack_args) { - if (pack_lockfiles) + if (fetch_fsck_objects >= 0 + ? fetch_fsck_objects + : transfer_fsck_objects >= 0 + ? transfer_fsck_objects + : 0) + fsck_objects = 1; + + if (do_keep || args->from_promisor || index_pack_args || fsck_objects) { + if (pack_lockfiles || fsck_objects) cmd.out = -1; cmd_name = "index-pack"; strvec_push(&cmd.args, cmd_name); @@ -897,11 +926,7 @@ static int get_pack(struct fetch_pack_args *args, strvec_pushf(&cmd.args, "--pack_header=%"PRIu32",%"PRIu32, ntohl(header.hdr_version), ntohl(header.hdr_entries)); - if (fetch_fsck_objects >= 0 - ? fetch_fsck_objects - : transfer_fsck_objects >= 0 - ? transfer_fsck_objects - : 0) { + if (fsck_objects) { if (args->from_promisor || index_pack_args) /* * We cannot use --strict in index-pack because it @@ -931,10 +956,15 @@ static int get_pack(struct fetch_pack_args *args, cmd.git_cmd = 1; if (start_command(&cmd)) die(_("fetch-pack: unable to fork off %s"), cmd_name); - if (do_keep && pack_lockfiles) { - char *pack_lockfile = index_pack_lockfile(cmd.out); + if (do_keep && (pack_lockfiles || fsck_objects)) { + int is_well_formed; + char *pack_lockfile = index_pack_lockfile(cmd.out, &is_well_formed); + + if (!is_well_formed) + die(_("fetch-pack: invalid index-pack output")); if (pack_lockfile) string_list_append_nodup(pack_lockfiles, pack_lockfile); + parse_gitmodules_oids(cmd.out, gitmodules_oids); close(cmd.out); } @@ -969,6 +999,22 @@ static int cmp_ref_by_name(const void *a_, const void *b_) return strcmp(a->name, b->name); } +static void fsck_gitmodules_oids(struct oidset *gitmodules_oids) +{ + struct oidset_iter iter; + const struct object_id *oid; + struct fsck_options fo = FSCK_OPTIONS_STRICT; + + if (!oidset_size(gitmodules_oids)) + return; + + oidset_iter_init(gitmodules_oids, &iter); + while ((oid = oidset_iter_next(&iter))) + register_found_gitmodules(oid); + if (fsck_finish(&fo)) + die("fsck failed"); +} + static struct ref *do_fetch_pack(struct fetch_pack_args *args, int fd[2], const struct ref *orig_ref, @@ -983,6 +1029,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args, int agent_len; struct fetch_negotiator negotiator_alloc; struct fetch_negotiator *negotiator; + struct oidset gitmodules_oids = OIDSET_INIT; negotiator = &negotiator_alloc; fetch_negotiator_init(r, negotiator); @@ -1098,8 +1145,10 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args, alternate_shallow_file = setup_temporary_shallow(si->shallow); else alternate_shallow_file = NULL; - if (get_pack(args, fd, pack_lockfiles, NULL, sought, nr_sought)) + if (get_pack(args, fd, pack_lockfiles, NULL, sought, nr_sought, + &gitmodules_oids)) die(_("git fetch-pack: fetch failed.")); + fsck_gitmodules_oids(&gitmodules_oids); all_done: if (negotiator) @@ -1550,6 +1599,7 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args, struct string_list packfile_uris = STRING_LIST_INIT_DUP; int i; char *index_pack_args = NULL; + struct oidset gitmodules_oids = OIDSET_INIT; negotiator = &negotiator_alloc; fetch_negotiator_init(r, negotiator); @@ -1640,7 +1690,7 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args, process_section_header(&reader, "packfile", 0); if (get_pack(args, fd, pack_lockfiles, packfile_uris.nr ? &index_pack_args : NULL, - sought, nr_sought)) + sought, nr_sought, &gitmodules_oids)) die(_("git fetch-pack: fetch failed.")); do_check_stateless_delimiter(args, &reader); @@ -1680,6 +1730,8 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args, packname[the_hash_algo->hexsz] = '\0'; + parse_gitmodules_oids(cmd.out, &gitmodules_oids); + close(cmd.out); if (finish_command(&cmd)) @@ -1699,6 +1751,8 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args, string_list_clear(&packfile_uris, 0); FREE_AND_NULL(index_pack_args); + fsck_gitmodules_oids(&gitmodules_oids); + if (negotiator) negotiator->release(negotiator); diff --git a/fsck.c b/fsck.c index f82e2fe9e3..04f3d342af 100644 --- a/fsck.c +++ b/fsck.c @@ -1243,6 +1243,11 @@ int fsck_error_function(struct fsck_options *o, return 1; } +void register_found_gitmodules(const struct object_id *oid) +{ + oidset_insert(&gitmodules_found, oid); +} + int fsck_finish(struct fsck_options *options) { int ret = 0; @@ -1262,10 +1267,13 @@ int fsck_finish(struct fsck_options *options) if (!buf) { if (is_promisor_object(oid)) continue; - ret |= report(options, - oid, OBJ_BLOB, - FSCK_MSG_GITMODULES_MISSING, - "unable to read .gitmodules blob"); + if (options->print_dangling_gitmodules) + printf("%s\n", oid_to_hex(oid)); + else + ret |= report(options, + oid, OBJ_BLOB, + FSCK_MSG_GITMODULES_MISSING, + "unable to read .gitmodules blob"); continue; } diff --git a/fsck.h b/fsck.h index 69cf715e79..4b8cf03445 100644 --- a/fsck.h +++ b/fsck.h @@ -41,6 +41,12 @@ struct fsck_options { int *msg_type; struct oidset skiplist; kh_oid_map_t *object_names; + + /* + * If 1, print the hashes of missing .gitmodules blobs instead of + * considering them to be errors. + */ + unsigned print_dangling_gitmodules:1; }; #define FSCK_OPTIONS_DEFAULT { NULL, fsck_error_function, 0, NULL, OIDSET_INIT } @@ -62,6 +68,8 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options); int fsck_object(struct object *obj, void *data, unsigned long size, struct fsck_options *options); +void register_found_gitmodules(const struct object_id *oid); + /* * Some fsck checks are context-dependent, and may end up queued; run this * after completing all fsck_object() calls in order to resolve any remaining diff --git a/pack-write.c b/pack-write.c index 3513665e1e..f66ea8e5a1 100644 --- a/pack-write.c +++ b/pack-write.c @@ -272,7 +272,7 @@ void fixup_pack_header_footer(int pack_fd, fsync_or_die(pack_fd, pack_name); } -char *index_pack_lockfile(int ip_out) +char *index_pack_lockfile(int ip_out, int *is_well_formed) { char packname[GIT_MAX_HEXSZ + 6]; const int len = the_hash_algo->hexsz + 6; @@ -286,11 +286,17 @@ char *index_pack_lockfile(int ip_out) */ if (read_in_full(ip_out, packname, len) == len && packname[len-1] == '\n') { const char *name; + + if (is_well_formed) + *is_well_formed = 1; packname[len-1] = 0; if (skip_prefix(packname, "keep\t", &name)) return xstrfmt("%s/pack/pack-%s.keep", get_object_directory(), name); + return NULL; } + if (is_well_formed) + *is_well_formed = 0; return NULL; } diff --git a/pack.h b/pack.h index 9fc0945ac9..09cffec395 100644 --- a/pack.h +++ b/pack.h @@ -85,7 +85,7 @@ int verify_pack_index(struct packed_git *); int verify_pack(struct repository *, struct packed_git *, verify_fn fn, struct progress *, uint32_t); off_t write_pack_header(struct hashfile *f, uint32_t); void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t); -char *index_pack_lockfile(int fd); +char *index_pack_lockfile(int fd, int *is_well_formed); /* * The "hdr" output buffer should be at least this big, which will handle sizes diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh index 7d5b17909b..8b8fb43dbc 100755 --- a/t/t5702-protocol-v2.sh +++ b/t/t5702-protocol-v2.sh @@ -936,6 +936,53 @@ test_expect_success 'packfile-uri with transfer.fsckobjects fails on bad object' test_i18ngrep "invalid author/committer line - missing email" error ' +test_expect_success 'packfile-uri with transfer.fsckobjects succeeds when .gitmodules is separate from tree' ' + P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" && + rm -rf "$P" http_child && + + git init "$P" && + git -C "$P" config "uploadpack.allowsidebandall" "true" && + + echo "[submodule libfoo]" >"$P/.gitmodules" && + echo "path = include/foo" >>"$P/.gitmodules" && + echo "url = git://example.com/git/lib.git" >>"$P/.gitmodules" && + git -C "$P" add .gitmodules && + git -C "$P" commit -m x && + + configure_exclusion "$P" .gitmodules >h && + + sane_unset GIT_TEST_SIDEBAND_ALL && + git -c protocol.version=2 -c transfer.fsckobjects=1 \ + -c fetch.uriprotocols=http,https \ + clone "$HTTPD_URL/smart/http_parent" http_child && + + # Ensure that there are exactly 4 files (2 .pack and 2 .idx). + ls http_child/.git/objects/pack/* >filelist && + test_line_count = 4 filelist +' + +test_expect_success 'packfile-uri with transfer.fsckobjects fails when .gitmodules separate from tree is invalid' ' + P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" && + rm -rf "$P" http_child err && + + git init "$P" && + git -C "$P" config "uploadpack.allowsidebandall" "true" && + + echo "[submodule \"..\"]" >"$P/.gitmodules" && + echo "path = include/foo" >>"$P/.gitmodules" && + echo "url = git://example.com/git/lib.git" >>"$P/.gitmodules" && + git -C "$P" add .gitmodules && + git -C "$P" commit -m x && + + configure_exclusion "$P" .gitmodules >h && + + sane_unset GIT_TEST_SIDEBAND_ALL && + test_must_fail git -c protocol.version=2 -c transfer.fsckobjects=1 \ + -c fetch.uriprotocols=http,https \ + clone "$HTTPD_URL/smart/http_parent" http_child 2>err && + test_i18ngrep "disallowed submodule name" err +' + # DO NOT add non-httpd-specific tests here, because the last part of this # test script is only executed when httpd is available and enabled.
Teach index-pack to print dangling .gitmodules links after its "keep" or "pack" line instead of declaring an error, and teach fetch-pack to check such lines printed. This allows the tree side of the .gitmodules link to be in one packfile and the blob side to be in another without failing the fsck check, because it is now fetch-pack which checks such objects after all packfiles have been downloaded and indexed (and not index-pack on an individual packfile, as it is before this commit). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Documentation/git-index-pack.txt | 7 ++- builtin/index-pack.c | 9 +++- builtin/receive-pack.c | 2 +- fetch-pack.c | 78 +++++++++++++++++++++++++++----- fsck.c | 16 +++++-- fsck.h | 8 ++++ pack-write.c | 8 +++- pack.h | 2 +- t/t5702-protocol-v2.sh | 47 +++++++++++++++++++ 9 files changed, 155 insertions(+), 22 deletions(-)