Message ID | Z5r7BuEJvjwQ9f4G@ArchLinux (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | add more ref consistency checks | expand |
shejialuo <shejialuo@gmail.com> writes: > In "packed-backend.c::create_snapshot", if there is a header (the line > which starts with '#'), we will check whether the line starts with "# > pack-refs with:". As we are going to implement the header consistency > check, we should port this check into "packed_fsck". > > However, the above check is not enough, this is because "git pack-refs" > will always write "PACKED_REFS_HEADER" which is a constant string to the > "packed-refs" file. So, we should check the following things for the > header. I haven't done history digging in this area for a while, but we should make sure we are not flagging a file that was written in ancient version of Git whose repository is still supported. > 1. If the header does not exist, we may report an error to the user > because it should exist, but we do allow no header in "packed-refs" > file. So, create a new fsck message "packedRefMissingHeader(INFO)" to > warn the user and also keep compatibility. Are we sure "it should exist"? I think the header did not exist before "Git v1.5.0". I didn't check with other reimplementations of Git (like jgit or libgit2), but as long as our reading side of the runtime allows a packed-refs file without the header without complaint, I do not think it is a good idea to treat it as a report-worthy event from "git fsck". > 2. If the header content does not start with "# packed-ref with:", we > should report an error just like what "create_snapshot" does. So, > create a new fsck message "badPackedRefHeader(ERROR)" for this. This I can agree with. If the first line begins with "#" but not with that string (with a trailing SP), that is a sign that it may not even be a valid packed-refs file, which is a report-worthy event. > 3. If the header content is not the same as the constant string > "PACKED_REFS_HEADER", ideally, we should report an error to the user. NO. THAT IS NOT IDEAL AT ALL. The header was written like this: /* perhaps other traits later as well */ fprintf(cbdata.refs_file, "# pack-refs with: peeled \n"); in the older versions of Git before it was made into a separate preprocessor macro and lost the comment (the above excerpt is from "git show v1.5.0:builtin-pack-refs.c"). Notice "other traits later" in the comment? The thing is _designed_ to be extensible. In fact, these days we support a few more traits static const char PACKED_REFS_HEADER[] = "# pack-refs with: peeled fully-peeled sorted \n"; (an excerpt from the current refs/packed-backend.c). Reporting an error when you see something written by an older version of Git is far from ideal. > However, we allow other contents as long as the header content starts > with "# packed-ref with:". To keep compatibility, create a new fsck > message "unknownPackedRefHeader(INFO)" to warn about this. We may > tighten this rule in the future. Whatever we do, what we do with an unknown trait should be in line with what the runtime does. If the runtime failed (we do not, but this is to illustrate the principle [*]) on a packed-refs file without "sorted" trait, noticing that "sorted" is not there and flagging as an error is a good thing to do. But if the runtime gracefully degrades and sorts the list of refs read from such a packed-refs file before continuing, then a packed-refs file that lack "sorted" trait is not a report-worthy event. I do not offhand recall if we introduced the concept of mandatory vs optional traits in the packed-refs part of the system (like we have in the index extension subsystem, where a version of Git that encounters an unknown *and* mandatory index extension must refuse to touch the repository), but if there is a mandatory trait declared in the header that our version of Git does not understand, it is a report-worthy event that must be flagged with "git refs verify". > +static int packed_fsck_ref_header(struct fsck_options *o, const char *start, const char *eol) > +{ > + const char *err_fmt = NULL; > + int fsck_msg_id = -1; > + > + if (!starts_with(start, "# pack-refs with:")) { > + err_fmt = "'%.*s' does not start with '# pack-refs with:'"; > + fsck_msg_id = FSCK_MSG_BAD_PACKED_REF_HEADER; > + } else if (strncmp(start, PACKED_REFS_HEADER, strlen(PACKED_REFS_HEADER))) { > + err_fmt = "'%.*s' is an unknown packed-refs header"; > + fsck_msg_id = FSCK_MSG_UNKNOWN_PACKED_REF_HEADER; > + } As I outlined above, this is totally unacceptable. Inspecting the header is good, but if this code claims to be a checker, it should do at least what the runtime does, i.e. parse the header to tell what traits the packed-file declares, not just assuming that it is a fixed string. And error on unknown trait(s) if they are mandatory (if such a concept is implemented in the runtime reading side). Informing on an unknown and optional trait(s) I can live with, but personally I wouldn't recommend it. In other words, report loudly if it is an error, but otherwise stay silent if we know we tolerate it well. > +static int packed_fsck_ref_content(struct fsck_options *o, > + const char *start, const char *eof) > +{ > + struct strbuf packed_entry = STRBUF_INIT; > + int line_number = 1; We limit ourselves with about 1 billion refs in the packed-refs file, which may be plenty, but I do not quite understand the use of this variable. There is no loop inside this so ... > + const char *eol; > + int ret = 0; > + > + strbuf_addf(&packed_entry, "packed-refs line %d", line_number); ... this is always line #1, and then > + ret |= packed_fsck_ref_next_line(o, &packed_entry, start, eof, &eol); > + if (*start == '#') { > + ret |= packed_fsck_ref_header(o, start, eol); > + > + start = eol + 1; > + line_number++; ... it may be incremented, but upon returning from the funcition, it is lost. Perhaps you wanted to make it a function-scope static, but then you are allowed to read one single packed-refs file during the life of your process before you exit, which I am not sure is what you want? > + } else { > + struct fsck_ref_report report = { 0 }; > + report.path = "packed-refs"; > + > + ret |= fsck_report_ref(o, &report, > + FSCK_MSG_PACKED_REF_MISSING_HEADER, > + "missing header line"); > + } > + > + strbuf_release(&packed_entry); > + return ret; > +} I'll stop here for now. Thanks.
On Thu, Jan 30, 2025 at 10:58:32AM -0800, Junio C Hamano wrote: > shejialuo <shejialuo@gmail.com> writes: > > > In "packed-backend.c::create_snapshot", if there is a header (the line > > which starts with '#'), we will check whether the line starts with "# > > pack-refs with:". As we are going to implement the header consistency > > check, we should port this check into "packed_fsck". > > > > However, the above check is not enough, this is because "git pack-refs" > > will always write "PACKED_REFS_HEADER" which is a constant string to the > > "packed-refs" file. So, we should check the following things for the > > header. > > I haven't done history digging in this area for a while, but we > should make sure we are not flagging a file that was written in > ancient version of Git whose repository is still supported. > Understood. > > 1. If the header does not exist, we may report an error to the user > > because it should exist, but we do allow no header in "packed-refs" > > file. So, create a new fsck message "packedRefMissingHeader(INFO)" to > > warn the user and also keep compatibility. > > Are we sure "it should exist"? I think the header did not exist > before "Git v1.5.0". I didn't check with other reimplementations of > Git (like jgit or libgit2), but as long as our reading side of the > runtime allows a packed-refs file without the header without > complaint, I do not think it is a good idea to treat it as a > report-worthy event from "git fsck". > OK, let me improve this in the next version. > > 2. If the header content does not start with "# packed-ref with:", we > > should report an error just like what "create_snapshot" does. So, > > create a new fsck message "badPackedRefHeader(ERROR)" for this. > > This I can agree with. If the first line begins with "#" but not > with that string (with a trailing SP), that is a sign that it may > not even be a valid packed-refs file, which is a report-worthy > event. > > > 3. If the header content is not the same as the constant string > > "PACKED_REFS_HEADER", ideally, we should report an error to the user. > > NO. THAT IS NOT IDEAL AT ALL. > > The header was written like this: > > /* perhaps other traits later as well */ > fprintf(cbdata.refs_file, "# pack-refs with: peeled \n"); > > in the older versions of Git before it was made into a separate > preprocessor macro and lost the comment (the above excerpt is from > "git show v1.5.0:builtin-pack-refs.c"). > > Notice "other traits later" in the comment? > > The thing is _designed_ to be extensible. In fact, these days we > support a few more traits > > static const char PACKED_REFS_HEADER[] = > "# pack-refs with: peeled fully-peeled sorted \n"; > > (an excerpt from the current refs/packed-backend.c). > > Reporting an error when you see something written by an older > version of Git is far from ideal. > Understood, I think we should be consistency with the runtime check. > > However, we allow other contents as long as the header content starts > > with "# packed-ref with:". To keep compatibility, create a new fsck > > message "unknownPackedRefHeader(INFO)" to warn about this. We may > > tighten this rule in the future. > > Whatever we do, what we do with an unknown trait should be in line > with what the runtime does. If the runtime failed (we do not, but > this is to illustrate the principle [*]) on a packed-refs file > without "sorted" trait, noticing that "sorted" is not there and > flagging as an error is a good thing to do. But if the runtime > gracefully degrades and sorts the list of refs read from such a > packed-refs file before continuing, then a packed-refs file that > lack "sorted" trait is not a report-worthy event. > Actually, the runtime won't complain about this. I agree with you here. > I do not offhand recall if we introduced the concept of mandatory vs > optional traits in the packed-refs part of the system (like we have > in the index extension subsystem, where a version of Git that > encounters an unknown *and* mandatory index extension must refuse to > touch the repository), but if there is a mandatory trait declared in > the header that our version of Git does not understand, it is a > report-worthy event that must be flagged with "git refs verify". > I don't think any trait in "packed-refs" is mandatory. Because I have done some experiments before implementing the code. We should only check case 2 here. > > +static int packed_fsck_ref_header(struct fsck_options *o, const char *start, const char *eol) > > +{ > > + const char *err_fmt = NULL; > > + int fsck_msg_id = -1; > > + > > + if (!starts_with(start, "# pack-refs with:")) { > > + err_fmt = "'%.*s' does not start with '# pack-refs with:'"; > > + fsck_msg_id = FSCK_MSG_BAD_PACKED_REF_HEADER; > > + } else if (strncmp(start, PACKED_REFS_HEADER, strlen(PACKED_REFS_HEADER))) { > > + err_fmt = "'%.*s' is an unknown packed-refs header"; > > + fsck_msg_id = FSCK_MSG_UNKNOWN_PACKED_REF_HEADER; > > + } > > As I outlined above, this is totally unacceptable. > > Inspecting the header is good, but if this code claims to be a > checker, it should do at least what the runtime does, i.e. parse the > header to tell what traits the packed-file declares, not just > assuming that it is a fixed string. And error on unknown trait(s) > if they are mandatory (if such a concept is implemented in the > runtime reading side). Informing on an unknown and optional > trait(s) I can live with, but personally I wouldn't recommend it. > Got it, I don't want to report unknown trait(s) either. > In other words, report loudly if it is an error, but otherwise stay > silent if we know we tolerate it well. > Thanks for this suggestion. > > +static int packed_fsck_ref_content(struct fsck_options *o, > > + const char *start, const char *eof) > > +{ > > + struct strbuf packed_entry = STRBUF_INIT; > > + int line_number = 1; > > We limit ourselves with about 1 billion refs in the packed-refs > file, which may be plenty, Let me change this to `size_t`. This would be better. > but I do not quite understand the use of > this variable. There is no loop inside this so ... > The reason why I define this variable is that I am going to use loop to check each entry in the next patch. > > + const char *eol; > > + int ret = 0; > > + > > + strbuf_addf(&packed_entry, "packed-refs line %d", line_number); > > ... this is always line #1, and then > > > + ret |= packed_fsck_ref_next_line(o, &packed_entry, start, eof, &eol); > > + if (*start == '#') { > > + ret |= packed_fsck_ref_header(o, start, eol); > > + > > + start = eol + 1; > > + line_number++; > > ... it may be incremented, but upon returning from the funcition, it > is lost. > > Perhaps you wanted to make it a function-scope static, but then you > are allowed to read one single packed-refs file during the life of > your process before you exit, which I am not sure is what you want? > Actually, what I want is use this variable for looping the each ref entry in the "packed-refs" file. > > + } else { > > + struct fsck_ref_report report = { 0 }; > > + report.path = "packed-refs"; > > + > > + ret |= fsck_report_ref(o, &report, > > + FSCK_MSG_PACKED_REF_MISSING_HEADER, > > + "missing header line"); > > + } > > + > > + strbuf_release(&packed_entry); > > + return ret; > > +} Thanks, Jialuo
diff --git a/Documentation/fsck-msgids.txt b/Documentation/fsck-msgids.txt index b14bc44ca4..34375a3143 100644 --- a/Documentation/fsck-msgids.txt +++ b/Documentation/fsck-msgids.txt @@ -16,6 +16,10 @@ `badObjectSha1`:: (ERROR) An object has a bad sha1. +`badPackedRefHeader`:: + (ERROR) The "packed-refs" file contains an invalid + header. + `badParentSha1`:: (ERROR) A commit object has a bad parent sha1. @@ -176,6 +180,13 @@ `nullSha1`:: (WARN) Tree contains entries pointing to a null sha1. +`packedRefEntryNotTerminated`:: + (ERROR) The "packed-refs" file contains an entry that is + not terminated by a newline. + +`packedRefMissingHeader`:: + (INFO) The "packed-refs" file does not contain the header. + `refMissingNewline`:: (INFO) A loose ref that does not end with newline(LF). As valid implementations of Git never created such a loose ref @@ -208,6 +219,11 @@ `treeNotSorted`:: (ERROR) A tree is not properly sorted. +`unknownPackedRefHeader`:: + (INFO) The "packed-refs" header starts with "# pack-refs with:" + but the remaining content is not the same as what `git pack-refs` + would write. + `unknownType`:: (ERROR) Found an unknown object type. diff --git a/fsck.h b/fsck.h index a44c231a5f..3107a0093d 100644 --- a/fsck.h +++ b/fsck.h @@ -30,6 +30,7 @@ enum fsck_msg_type { FUNC(BAD_EMAIL, ERROR) \ FUNC(BAD_NAME, ERROR) \ FUNC(BAD_OBJECT_SHA1, ERROR) \ + FUNC(BAD_PACKED_REF_HEADER, ERROR) \ FUNC(BAD_PARENT_SHA1, ERROR) \ FUNC(BAD_REF_CONTENT, ERROR) \ FUNC(BAD_REF_FILETYPE, ERROR) \ @@ -53,6 +54,7 @@ enum fsck_msg_type { FUNC(MISSING_TYPE, ERROR) \ FUNC(MISSING_TYPE_ENTRY, ERROR) \ FUNC(MULTIPLE_AUTHORS, ERROR) \ + FUNC(PACKED_REF_ENTRY_NOT_TERMINATED, ERROR) \ FUNC(TREE_NOT_SORTED, ERROR) \ FUNC(UNKNOWN_TYPE, ERROR) \ FUNC(ZERO_PADDED_DATE, ERROR) \ @@ -90,6 +92,8 @@ enum fsck_msg_type { FUNC(REF_MISSING_NEWLINE, INFO) \ FUNC(SYMREF_TARGET_IS_NOT_A_REF, INFO) \ FUNC(TRAILING_REF_CONTENT, INFO) \ + FUNC(UNKNOWN_PACKED_REF_HEADER, INFO) \ + FUNC(PACKED_REF_MISSING_HEADER, INFO) \ /* ignored (elevated when requested) */ \ FUNC(EXTRA_HEADER_ENTRY, IGNORE) diff --git a/refs/packed-backend.c b/refs/packed-backend.c index 6401cecd5f..883189f3a1 100644 --- a/refs/packed-backend.c +++ b/refs/packed-backend.c @@ -1749,12 +1749,92 @@ static struct ref_iterator *packed_reflog_iterator_begin(struct ref_store *ref_s return empty_ref_iterator_begin(); } +static int packed_fsck_ref_next_line(struct fsck_options *o, + struct strbuf *packed_entry, const char *start, + const char *eof, const char **eol) +{ + int ret = 0; + + *eol = memchr(start, '\n', eof - start); + if (!*eol) { + struct fsck_ref_report report = { 0 }; + + report.path = packed_entry->buf; + ret = fsck_report_ref(o, &report, + FSCK_MSG_PACKED_REF_ENTRY_NOT_TERMINATED, + "'%.*s' is not terminated with a newline", + (int)(eof - start), start); + + /* + * There is no newline but we still want to parse it to the end of + * the buffer. + */ + *eol = eof; + } + + return ret; +} + +static int packed_fsck_ref_header(struct fsck_options *o, const char *start, const char *eol) +{ + const char *err_fmt = NULL; + int fsck_msg_id = -1; + + if (!starts_with(start, "# pack-refs with:")) { + err_fmt = "'%.*s' does not start with '# pack-refs with:'"; + fsck_msg_id = FSCK_MSG_BAD_PACKED_REF_HEADER; + } else if (strncmp(start, PACKED_REFS_HEADER, strlen(PACKED_REFS_HEADER))) { + err_fmt = "'%.*s' is an unknown packed-refs header"; + fsck_msg_id = FSCK_MSG_UNKNOWN_PACKED_REF_HEADER; + } + + if (err_fmt && fsck_msg_id >= 0) { + struct fsck_ref_report report = { 0 }; + report.path = "packed-refs.header"; + + return fsck_report_ref(o, &report, fsck_msg_id, err_fmt, + (int)(eol - start), start); + + } + + return 0; +} + +static int packed_fsck_ref_content(struct fsck_options *o, + const char *start, const char *eof) +{ + struct strbuf packed_entry = STRBUF_INIT; + int line_number = 1; + const char *eol; + int ret = 0; + + strbuf_addf(&packed_entry, "packed-refs line %d", line_number); + ret |= packed_fsck_ref_next_line(o, &packed_entry, start, eof, &eol); + if (*start == '#') { + ret |= packed_fsck_ref_header(o, start, eol); + + start = eol + 1; + line_number++; + } else { + struct fsck_ref_report report = { 0 }; + report.path = "packed-refs"; + + ret |= fsck_report_ref(o, &report, + FSCK_MSG_PACKED_REF_MISSING_HEADER, + "missing header line"); + } + + strbuf_release(&packed_entry); + return ret; +} + static int packed_fsck(struct ref_store *ref_store, struct fsck_options *o, struct worktree *wt) { struct packed_ref_store *refs = packed_downcast(ref_store, REF_STORE_READ, "fsck"); + struct strbuf packed_ref_content = STRBUF_INIT; int ret = 0; int fd; @@ -1786,7 +1866,16 @@ static int packed_fsck(struct ref_store *ref_store, goto cleanup; } + if (strbuf_read(&packed_ref_content, fd, 0) < 0) { + ret = error_errno(_("unable to read %s"), refs->path); + goto cleanup; + } + + ret = packed_fsck_ref_content(o, packed_ref_content.buf, + packed_ref_content.buf + packed_ref_content.len); + cleanup: + strbuf_release(&packed_ref_content); return ret; } diff --git a/t/t0602-reffiles-fsck.sh b/t/t0602-reffiles-fsck.sh index 42c8d4ca1e..a7b46b6cb9 100755 --- a/t/t0602-reffiles-fsck.sh +++ b/t/t0602-reffiles-fsck.sh @@ -639,4 +639,50 @@ test_expect_success SYMLINKS 'the filetype of packed-refs should be checked' ' ) ' +test_expect_success 'packed-refs header should be checked' ' + test_when_finished "rm -rf repo" && + git init repo && + ( + cd repo && + test_commit default && + + git refs verify 2>err && + test_must_be_empty err && + + printf "$(git rev-parse main) refs/heads/main\n" >.git/packed-refs && + git refs verify 2>err && + cat >expect <<-EOF && + warning: packed-refs: packedRefMissingHeader: missing header line + EOF + rm .git/packed-refs && + test_cmp expect err && + + for bad_header in "# pack-refs wit: peeled fully-peeled sorted " \ + "# pack-refs with traits: peeled fully-peeled sorted " \ + "# pack-refs with a: peeled fully-peeled" + do + printf "%s\n" "$bad_header" >.git/packed-refs && + test_must_fail git refs verify 2>err && + cat >expect <<-EOF && + error: packed-refs.header: badPackedRefHeader: '\''$bad_header'\'' does not start with '\''# pack-refs with:'\'' + EOF + rm .git/packed-refs && + test_cmp expect err || return 1 + done && + + for unknown_header in "# pack-refs with: peeled fully-peeled sorted garbage" \ + "# pack-refs with: peeled" \ + "# pack-refs with: peeled peeled-fully sort" + do + printf "%s\n" "$unknown_header" >.git/packed-refs && + git refs verify 2>err && + cat >expect <<-EOF && + warning: packed-refs.header: unknownPackedRefHeader: '\''$unknown_header'\'' is an unknown packed-refs header + EOF + rm .git/packed-refs && + test_cmp expect err || return 1 + done + ) +' + test_done
In "packed-backend.c::create_snapshot", if there is a header (the line which starts with '#'), we will check whether the line starts with "# pack-refs with:". As we are going to implement the header consistency check, we should port this check into "packed_fsck". However, the above check is not enough, this is because "git pack-refs" will always write "PACKED_REFS_HEADER" which is a constant string to the "packed-refs" file. So, we should check the following things for the header. 1. If the header does not exist, we may report an error to the user because it should exist, but we do allow no header in "packed-refs" file. So, create a new fsck message "packedRefMissingHeader(INFO)" to warn the user and also keep compatibility. 2. If the header content does not start with "# packed-ref with:", we should report an error just like what "create_snapshot" does. So, create a new fsck message "badPackedRefHeader(ERROR)" for this. 3. If the header content is not the same as the constant string "PACKED_REFS_HEADER", ideally, we should report an error to the user. However, we allow other contents as long as the header content starts with "# packed-ref with:". To keep compatibility, create a new fsck message "unknownPackedRefHeader(INFO)" to warn about this. We may tighten this rule in the future. In order to achieve above checks, read the "packed-refs" file via "strbuf_read". Like what "create_snapshot" and other functions do, we could split the line by finding the next newline in the buffer. When we cannot find a newline, we could report an error. So, create a function "packed_fsck_ref_next_line" to find the next newline and if there is no such newline, use "packedRefEntryNotTerminated(ERROR)" to report an error to the user. Then, parse the first line to apply the above three checks. Update the test to excise the code. Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> --- Documentation/fsck-msgids.txt | 16 +++++++ fsck.h | 4 ++ refs/packed-backend.c | 89 +++++++++++++++++++++++++++++++++++ t/t0602-reffiles-fsck.sh | 46 ++++++++++++++++++ 4 files changed, 155 insertions(+)