Message ID | pull.795.v3.git.git.1590870357549.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] fast-import: add new --date-format=raw-permissive format | expand |
On Sat, May 30, 2020 at 08:25:57PM +0000, Elijah Newren via GitGitGadget wrote: > fast-import: accept invalid timezones so we can import existing repos > > Changes since v2: Thanks, this version looks good to me. -Peff
[Originally sent 5 days ago, but seems to have been a victim of the vger.kernel.org problems at the time, re-sending] On Sat, May 30 2020, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> Full snipped E-Mail in the archive: https://lore.kernel.org/git/pull.795.v3.git.git.1590870357549.gitgitgadget@gmail.com/ > There are multiple repositories in the wild with random, invalid > timezones. Most notably is a commit from rails.git with a timezone of > "+051800"[1]. A few searches will find other repos with that same > invalid timezone as well. Further, Peff reports that GitHub relaxed > their fsck checks in August 2011 to accept any timezone value[2], and > there have been multiple reports to filter-repo about fast-import > crashing while trying to import their existing repositories since they > had timezone values such as "-7349423" and "-43455309"[3]. I've been looking at some of our duplicate logic here after my mktag series where we now use fsck validation. It had a hardcoded "1400" offset value, which I see fast-import.c still has. Then in mailmap.c we have parse_name_and_email(), then there's split_ident_line() in ident.c, and of course fsck_ident(). record_person_from_buf() in fmt-merge-msg.c, copy_name() and copy_email() in ref-filter.c. Maybe handle_from() in mailinfo.c also counts. Anyway, aside from the last these are all parsers for "author/committer" lines in commits one way or another. But I was wondering about fast-import.c in particular. I think Elijah's patch here is obviously good an incremental improvement. But stepping back a bit: who cares about sort-of-fsck validation in fast-import.c anyway? Shouldn't it just pretty much be importing data as-is, and then we could document "if you don't trust it, run fsck afterwards"? Or, if it's a use-case people actually care about, then I might see about unifying some of these parser functions as part of a series I'm preparing.
"=?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?=" Bjarmason <avarab@gmail.com> writes: > But I was wondering about fast-import.c in particular. I think Elijah's > patch here is obviously good an incremental improvement. But stepping > back a bit: who cares about sort-of-fsck validation in fast-import.c > anyway? Those who want to notice and verify the procedure they used to produce the import data from the original before it is too late? I.e. data gets imported to Git, victory declared and then old SCM turned gets off---and only then the resulting imported repository is found not to pass fsck. > Shouldn't it just pretty much be importing data as-is, and then we could > document "if you don't trust it, run fsck afterwards"? If it is a small import, the distinction does not matter, but for a huge import, the procedure to produce the data is likely to be mechanical, so even after processing just a very small portion of early part of the datastream, systematic errors would be noticed before fast-import wastes importing too much garbage that need to be discarded after running such fsck. So in that sense, I suspect that there is value in the early validation. > Or, if it's a use-case people actually care about, then I might see > about unifying some of these parser functions as part of a series I'm > preparing. I think allowing people to loosen particular checks for fast-import (or elsewhere for that matter) is a good idea, and you can do so more easily once the existing checking is migrated to your new scheme that shares code with the fsck machinery.
On Wed, Feb 03 2021, Junio C Hamano wrote: > "=?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?=" Bjarmason <avarab@gmail.com> > writes: > >> But I was wondering about fast-import.c in particular. I think Elijah's >> patch here is obviously good an incremental improvement. But stepping >> back a bit: who cares about sort-of-fsck validation in fast-import.c >> anyway? > > Those who want to notice and verify the procedure they used to > produce the import data from the original before it is too late? > > I.e. data gets imported to Git, victory declared and then old SCM > turned gets off---and only then the resulting imported repository is > found not to pass fsck. > >> Shouldn't it just pretty much be importing data as-is, and then we could >> document "if you don't trust it, run fsck afterwards"? > > If it is a small import, the distinction does not matter, but for a > huge import, the procedure to produce the data is likely to be > mechanical, so even after processing just a very small portion of > early part of the datastream, systematic errors would be noticed > before fast-import wastes importing too much garbage that need to be > discarded after running such fsck. So in that sense, I suspect that > there is value in the early validation. What I was fishing for here is that perhaps since fast-import was originally written this use-case of in-place conversion of primary data on a server might have become too obscure to care about, i.e. as opposed to doing a conversion locally and then "git push"-ing it to something that does transfer.fsckObjects. >> Or, if it's a use-case people actually care about, then I might see >> about unifying some of these parser functions as part of a series I'm >> preparing. > > I think allowing people to loosen particular checks for fast-import > (or elsewhere for that matter) is a good idea, and you can do so > more easily once the existing checking is migrated to your new > scheme that shares code with the fsck machinery. ...allright, depending on how much of a hassle that is I might just add tests for the differences and leave this particular problem to someone else :)
diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt index 77c6b3d0019..7d9aad2a7e1 100644 --- a/Documentation/git-fast-import.txt +++ b/Documentation/git-fast-import.txt @@ -293,7 +293,14 @@ by users who are located in the same location and time zone. In this case a reasonable offset from UTC could be assumed. + Unlike the `rfc2822` format, this format is very strict. Any -variation in formatting will cause fast-import to reject the value. +variation in formatting will cause fast-import to reject the value, +and some sanity checks on the numeric values may also be performed. + +`raw-permissive`:: + This is the same as `raw` except that no sanity checks on + the numeric epoch and local offset are performed. This can + be useful when trying to filter or import an existing history + with e.g. bogus timezone values. `rfc2822`:: This is the standard email format as described by RFC 2822. diff --git a/fast-import.c b/fast-import.c index c98970274c4..0dfa14dc8c3 100644 --- a/fast-import.c +++ b/fast-import.c @@ -139,6 +139,7 @@ struct hash_list { typedef enum { WHENSPEC_RAW = 1, + WHENSPEC_RAW_PERMISSIVE, WHENSPEC_RFC2822, WHENSPEC_NOW } whenspec_type; @@ -1911,7 +1912,7 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res) return 1; } -static int validate_raw_date(const char *src, struct strbuf *result) +static int validate_raw_date(const char *src, struct strbuf *result, int strict) { const char *orig_src = src; char *endp; @@ -1920,7 +1921,11 @@ static int validate_raw_date(const char *src, struct strbuf *result) errno = 0; num = strtoul(src, &endp, 10); - /* NEEDSWORK: perhaps check for reasonable values? */ + /* + * NEEDSWORK: perhaps check for reasonable values? For example, we + * could error on values representing times more than a + * day in the future. + */ if (errno || endp == src || *endp != ' ') return -1; @@ -1929,7 +1934,13 @@ static int validate_raw_date(const char *src, struct strbuf *result) return -1; num = strtoul(src + 1, &endp, 10); - if (errno || endp == src + 1 || *endp || 1400 < num) + /* + * NEEDSWORK: check for brokenness other than num > 1400, such as + * (num % 100) >= 60, or ((num % 100) % 15) != 0 ? + */ + if (errno || endp == src + 1 || *endp || /* did not parse */ + (strict && (1400 < num)) /* parsed a broken timezone */ + ) return -1; strbuf_addstr(result, orig_src); @@ -1963,7 +1974,11 @@ static char *parse_ident(const char *buf) switch (whenspec) { case WHENSPEC_RAW: - if (validate_raw_date(ltgt, &ident) < 0) + if (validate_raw_date(ltgt, &ident, 1) < 0) + die("Invalid raw date \"%s\" in ident: %s", ltgt, buf); + break; + case WHENSPEC_RAW_PERMISSIVE: + if (validate_raw_date(ltgt, &ident, 0) < 0) die("Invalid raw date \"%s\" in ident: %s", ltgt, buf); break; case WHENSPEC_RFC2822: @@ -3258,6 +3273,8 @@ static void option_date_format(const char *fmt) { if (!strcmp(fmt, "raw")) whenspec = WHENSPEC_RAW; + else if (!strcmp(fmt, "raw-permissive")) + whenspec = WHENSPEC_RAW_PERMISSIVE; else if (!strcmp(fmt, "rfc2822")) whenspec = WHENSPEC_RFC2822; else if (!strcmp(fmt, "now")) diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh index 768257b29e0..e151df81c06 100755 --- a/t/t9300-fast-import.sh +++ b/t/t9300-fast-import.sh @@ -410,6 +410,34 @@ test_expect_success 'B: accept empty committer' ' test -z "$out" ' +test_expect_success 'B: reject invalid timezone' ' + cat >input <<-INPUT_END && + commit refs/heads/invalid-timezone + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800 + data <<COMMIT + empty commit + COMMIT + INPUT_END + + test_when_finished "git update-ref -d refs/heads/invalid-timezone" && + test_must_fail git fast-import <input +' + +test_expect_success 'B: accept invalid timezone with raw-permissive' ' + cat >input <<-INPUT_END && + commit refs/heads/invalid-timezone + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800 + data <<COMMIT + empty commit + COMMIT + INPUT_END + + git init invalid-timezone && + git -C invalid-timezone fast-import --date-format=raw-permissive <input && + git -C invalid-timezone cat-file -p invalid-timezone >out && + grep "1234567890 [+]051800" out +' + test_expect_success 'B: accept and fixup committer with no name' ' cat >input <<-INPUT_END && commit refs/heads/empty-committer-2