fast-import: accept invalid timezones so we can import existing repos
diff mbox series

Message ID pull.795.git.git.1590693313099.gitgitgadget@gmail.com
State New
Headers show
Series
  • fast-import: accept invalid timezones so we can import existing repos
Related show

Commit Message

Derrick Stolee via GitGitGadget May 28, 2020, 7:15 p.m. UTC
From: Elijah Newren <newren@gmail.com>

There are multiple repositories in the wild with random, invalid
timezones.  Most notably is a commit from rails.git with a timezone of
"+051800"[1].  However, a few searches found other repos with that same
invalid timezone.  Further, Peff reports that GitHub relaxed their fsck
checks in August 2011 to accept any timezone value[2], and there have
been multiple reports to filter-repo about fast-import crashing while
trying to import their existing repositories since they had timezone
values such as "-7349423" and "-43455309"[3].

It is not clear what created these invalid timezones, but since git has
permitted their use and worked with these repositories for years at this
point, it seems pointless to make fast-import be the only thing that
disallows them.  Relax the parsing to allow these timezones when using
raw import format; when using --date-format=rfc2822 (which is not the
default), we can continue being more strict about timezones.

[1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
[2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
[3] https://github.com/newren/git-filter-repo/issues/88

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fast-import: accept invalid timezones so we can import existing repos
    
    Note: this is not a fix for a regression, so you can ignore it for 2.27;
    it can sit in pu.
    
    Peff leaned towards normalizing these timezones in fast-export[1], but
    (A) it's not clear what "valid" timezone we should randomly pick and
    regardless of what we pick, it seems it'll be wrong for most cases, (B)
    it would provide yet another way that "git fast-export --all | git
    fast-import" would not preserve the original history, as users sometimes
    expect[2], and (C) that'd prevent users from passing a special callback
    to filter-repo to fix up these values[3].
    
    Since I'm not a fan of picking a random value to reassign these to (in
    either fast-export or fast-import), I went the route of relaxing the
    requirements in fast-import, similar to what Peff reports GitHub did
    about 9 years ago in their incoming fsck checks.
    
    [1] 
    https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
    [2] 
    https://lore.kernel.org/git/CABPp-BFLJ48BZ97Y9mr4i3q7HMqjq18cXMgSYdxqD1cMzH8Spg@mail.gmail.com/
    [3] Example: 
    https://github.com/newren/git-filter-repo/issues/88#issuecomment-629706776

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-795%2Fnewren%2Floosen-fast-import-timezone-parsing-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-795/newren/loosen-fast-import-timezone-parsing-v1
Pull-Request: https://github.com/git/git/pull/795

 fast-import.c          |  7 +++----
 t/t9300-fast-import.sh | 17 +++++++++++++++++
 2 files changed, 20 insertions(+), 4 deletions(-)


base-commit: 2d5e9f31ac46017895ce6a183467037d29ceb9d3

Comments

Jonathan Nieder May 28, 2020, 7:26 p.m. UTC | #1
Hi,

Elijah Newren wrote:

>                  Relax the parsing to allow these timezones when using
> raw import format; when using --date-format=rfc2822 (which is not the
> default), we can continue being more strict about timezones.

There are two different use cases here that we may want to distinguish.

The original motivation for fast-import was for importing a repository
from some non-Git storage system (another VCS, a collection of patches
and tarballs, or whatever).  Such an importer might use
--date-format=raw just because that's simple, to avoid overhead.  In
that case, the strict parsing seems useful for catching bugs in the
importer.

"git filter-repo" is for taking an existing repository and modifying
it.  In this case, it would be *possible* to take an invalid timezone
and normalize it to "whatever 'git log' would show", but that's
overreaching relative to the caller's intent: the caller has a specific
set of history modifications they meant to make, and fixing the time
zone wasn't necessarily one of them.

To that end, would it make sense for this to be a new date-format
(e.g., --date-format=raw-permissive) to avoid regressing behavior in
the original case?

Thanks and hope that helps,
Jonathan

Patch
diff mbox series

diff --git a/fast-import.c b/fast-import.c
index c98970274c4..4a3c193007d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1915,11 +1915,10 @@  static int validate_raw_date(const char *src, struct strbuf *result)
 {
 	const char *orig_src = src;
 	char *endp;
-	unsigned long num;
 
 	errno = 0;
 
-	num = strtoul(src, &endp, 10);
+	strtoul(src, &endp, 10);
 	/* NEEDSWORK: perhaps check for reasonable values? */
 	if (errno || endp == src || *endp != ' ')
 		return -1;
@@ -1928,8 +1927,8 @@  static int validate_raw_date(const char *src, struct strbuf *result)
 	if (*src != '-' && *src != '+')
 		return -1;
 
-	num = strtoul(src + 1, &endp, 10);
-	if (errno || endp == src + 1 || *endp || 1400 < num)
+	strtoul(src + 1, &endp, 10);
+	if (errno || endp == src + 1 || *endp)
 		return -1;
 
 	strbuf_addstr(result, orig_src);
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 768257b29e0..0e798e68476 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -410,6 +410,23 @@  test_expect_success 'B: accept empty committer' '
 	test -z "$out"
 '
 
+test_expect_success 'B: accept invalid timezone' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	test_when_finished "git update-ref -d refs/heads/invalid-timezone
+		git gc
+		git prune" &&
+	git fast-import <input &&
+	git cat-file -p invalid-timezone >out &&
+	grep "1234567890 [+]051800" out
+'
+
 test_expect_success 'B: accept and fixup committer with no name' '
 	cat >input <<-INPUT_END &&
 	commit refs/heads/empty-committer-2