[v3] fast-import: add new --date-format=raw-permissive format
diff mbox series

Message ID pull.795.v3.git.git.1590870357549.gitgitgadget@gmail.com
State New
Headers show
Series
  • [v3] fast-import: add new --date-format=raw-permissive format
Related show

Commit Message

Junio C Hamano via GitGitGadget May 30, 2020, 8:25 p.m. UTC
From: Elijah Newren <newren@gmail.com>

There are multiple repositories in the wild with random, invalid
timezones.  Most notably is a commit from rails.git with a timezone of
"+051800"[1].  A few searches will find other repos with that same
invalid timezone as well.  Further, Peff reports that GitHub relaxed
their fsck checks in August 2011 to accept any timezone value[2], and
there have been multiple reports to filter-repo about fast-import
crashing while trying to import their existing repositories since they
had timezone values such as "-7349423" and "-43455309"[3].

The existing check on timezone values inside fast-import may prove
useful for people who are crafting fast-import input by hand or with a
new script.  For them, the check may help them avoid accidentally
recording invalid dates.  (Note that this check is rather simplistic and
there are still several forms of invalid dates that fast-import does not
check for: dates in the future, timezone values with minutes that are
not divisible by 15, and timezone values with minutes that are 60 or
greater.)  While this simple check may have some value for those users,
other users or tools will want to import existing repositories as-is.
Provide a --date-format=raw-permissive format that will not error out on
these otherwise invalid timezones so that such existing repositories can
be imported.

[1] https://github.com/rails/rails/commit/4cf94979c9f4d6683c9338d694d5eb3106a4e734
[2] https://lore.kernel.org/git/20200521195513.GA1542632@coredump.intra.peff.net/
[3] https://github.com/newren/git-filter-repo/issues/88

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    fast-import: accept invalid timezones so we can import existing repos
    
    Changes since v2:
    
     * Add documentation
     * Note the fact that the "strict" method really isn't all that strict
       with some NEEDSWORK comments
     * Check for parsed as unsigned before checking that value range makes
       sense
     * Simplify the testcase as suggested by Peff, leaving it to stick out a
       bit like a sore thumb from the rest of the tests in the same file
       (#leftoverbits)

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-795%2Fnewren%2Floosen-fast-import-timezone-parsing-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-795/newren/loosen-fast-import-timezone-parsing-v3
Pull-Request: https://github.com/git/git/pull/795

Range-diff vs v2:

 1:  9580aacdb21 ! 1:  48326d16dbd fast-import: add new --date-format=raw-permissive format
     @@ Commit message
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     + ## Documentation/git-fast-import.txt ##
     +@@ Documentation/git-fast-import.txt: by users who are located in the same location and time zone.  In this
     + case a reasonable offset from UTC could be assumed.
     + +
     + Unlike the `rfc2822` format, this format is very strict.  Any
     +-variation in formatting will cause fast-import to reject the value.
     ++variation in formatting will cause fast-import to reject the value,
     ++and some sanity checks on the numeric values may also be performed.
     ++
     ++`raw-permissive`::
     ++	This is the same as `raw` except that no sanity checks on
     ++	the numeric epoch and local offset are performed.  This can
     ++	be useful when trying to filter or import an existing history
     ++	with e.g. bogus timezone values.
     + 
     + `rfc2822`::
     + 	This is the standard email format as described by RFC 2822.
     +
       ## fast-import.c ##
      @@ fast-import.c: struct hash_list {
       
     @@ fast-import.c: static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax
       {
       	const char *orig_src = src;
       	char *endp;
     - 	unsigned long num;
     -+	int out_of_range_timezone;
     - 
     +@@ fast-import.c: static int validate_raw_date(const char *src, struct strbuf *result)
       	errno = 0;
       
     + 	num = strtoul(src, &endp, 10);
     +-	/* NEEDSWORK: perhaps check for reasonable values? */
     ++	/*
     ++	 * NEEDSWORK: perhaps check for reasonable values? For example, we
     ++	 *            could error on values representing times more than a
     ++	 *            day in the future.
     ++	 */
     + 	if (errno || endp == src || *endp != ' ')
     + 		return -1;
     + 
      @@ fast-import.c: static int validate_raw_date(const char *src, struct strbuf *result)
       		return -1;
       
       	num = strtoul(src + 1, &endp, 10);
      -	if (errno || endp == src + 1 || *endp || 1400 < num)
     -+	out_of_range_timezone = strict && (1400 < num);
     -+	if (errno || endp == src + 1 || *endp || out_of_range_timezone)
     ++	/*
     ++	 * NEEDSWORK: check for brokenness other than num > 1400, such as
     ++	 *            (num % 100) >= 60, or ((num % 100) % 15) != 0 ?
     ++	 */
     ++	if (errno || endp == src + 1 || *endp || /* did not parse */
     ++	    (strict && (1400 < num))             /* parsed a broken timezone */
     ++	   )
       		return -1;
       
       	strbuf_addstr(result, orig_src);
     @@ t/t9300-fast-import.sh: test_expect_success 'B: accept empty committer' '
      +	COMMIT
      +	INPUT_END
      +
     -+	test_when_finished "git update-ref -d refs/heads/invalid-timezone
     -+		git gc
     -+		git prune" &&
     -+	git fast-import --date-format=raw-permissive <input &&
     -+	git cat-file -p invalid-timezone >out &&
     ++	git init invalid-timezone &&
     ++	git -C invalid-timezone fast-import --date-format=raw-permissive <input &&
     ++	git -C invalid-timezone cat-file -p invalid-timezone >out &&
      +	grep "1234567890 [+]051800" out
      +'
      +


 Documentation/git-fast-import.txt |  9 ++++++++-
 fast-import.c                     | 25 +++++++++++++++++++++----
 t/t9300-fast-import.sh            | 28 ++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 5 deletions(-)


base-commit: 2d5e9f31ac46017895ce6a183467037d29ceb9d3

Comments

Jeff King May 30, 2020, 11:13 p.m. UTC | #1
On Sat, May 30, 2020 at 08:25:57PM +0000, Elijah Newren via GitGitGadget wrote:

>     fast-import: accept invalid timezones so we can import existing repos
>     
>     Changes since v2:

Thanks, this version looks good to me.

-Peff

Patch
diff mbox series

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index 77c6b3d0019..7d9aad2a7e1 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -293,7 +293,14 @@  by users who are located in the same location and time zone.  In this
 case a reasonable offset from UTC could be assumed.
 +
 Unlike the `rfc2822` format, this format is very strict.  Any
-variation in formatting will cause fast-import to reject the value.
+variation in formatting will cause fast-import to reject the value,
+and some sanity checks on the numeric values may also be performed.
+
+`raw-permissive`::
+	This is the same as `raw` except that no sanity checks on
+	the numeric epoch and local offset are performed.  This can
+	be useful when trying to filter or import an existing history
+	with e.g. bogus timezone values.
 
 `rfc2822`::
 	This is the standard email format as described by RFC 2822.
diff --git a/fast-import.c b/fast-import.c
index c98970274c4..0dfa14dc8c3 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -139,6 +139,7 @@  struct hash_list {
 
 typedef enum {
 	WHENSPEC_RAW = 1,
+	WHENSPEC_RAW_PERMISSIVE,
 	WHENSPEC_RFC2822,
 	WHENSPEC_NOW
 } whenspec_type;
@@ -1911,7 +1912,7 @@  static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 	return 1;
 }
 
-static int validate_raw_date(const char *src, struct strbuf *result)
+static int validate_raw_date(const char *src, struct strbuf *result, int strict)
 {
 	const char *orig_src = src;
 	char *endp;
@@ -1920,7 +1921,11 @@  static int validate_raw_date(const char *src, struct strbuf *result)
 	errno = 0;
 
 	num = strtoul(src, &endp, 10);
-	/* NEEDSWORK: perhaps check for reasonable values? */
+	/*
+	 * NEEDSWORK: perhaps check for reasonable values? For example, we
+	 *            could error on values representing times more than a
+	 *            day in the future.
+	 */
 	if (errno || endp == src || *endp != ' ')
 		return -1;
 
@@ -1929,7 +1934,13 @@  static int validate_raw_date(const char *src, struct strbuf *result)
 		return -1;
 
 	num = strtoul(src + 1, &endp, 10);
-	if (errno || endp == src + 1 || *endp || 1400 < num)
+	/*
+	 * NEEDSWORK: check for brokenness other than num > 1400, such as
+	 *            (num % 100) >= 60, or ((num % 100) % 15) != 0 ?
+	 */
+	if (errno || endp == src + 1 || *endp || /* did not parse */
+	    (strict && (1400 < num))             /* parsed a broken timezone */
+	   )
 		return -1;
 
 	strbuf_addstr(result, orig_src);
@@ -1963,7 +1974,11 @@  static char *parse_ident(const char *buf)
 
 	switch (whenspec) {
 	case WHENSPEC_RAW:
-		if (validate_raw_date(ltgt, &ident) < 0)
+		if (validate_raw_date(ltgt, &ident, 1) < 0)
+			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
+		break;
+	case WHENSPEC_RAW_PERMISSIVE:
+		if (validate_raw_date(ltgt, &ident, 0) < 0)
 			die("Invalid raw date \"%s\" in ident: %s", ltgt, buf);
 		break;
 	case WHENSPEC_RFC2822:
@@ -3258,6 +3273,8 @@  static void option_date_format(const char *fmt)
 {
 	if (!strcmp(fmt, "raw"))
 		whenspec = WHENSPEC_RAW;
+	else if (!strcmp(fmt, "raw-permissive"))
+		whenspec = WHENSPEC_RAW_PERMISSIVE;
 	else if (!strcmp(fmt, "rfc2822"))
 		whenspec = WHENSPEC_RFC2822;
 	else if (!strcmp(fmt, "now"))
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 768257b29e0..e151df81c06 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -410,6 +410,34 @@  test_expect_success 'B: accept empty committer' '
 	test -z "$out"
 '
 
+test_expect_success 'B: reject invalid timezone' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	test_when_finished "git update-ref -d refs/heads/invalid-timezone" &&
+	test_must_fail git fast-import <input
+'
+
+test_expect_success 'B: accept invalid timezone with raw-permissive' '
+	cat >input <<-INPUT_END &&
+	commit refs/heads/invalid-timezone
+	committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> 1234567890 +051800
+	data <<COMMIT
+	empty commit
+	COMMIT
+	INPUT_END
+
+	git init invalid-timezone &&
+	git -C invalid-timezone fast-import --date-format=raw-permissive <input &&
+	git -C invalid-timezone cat-file -p invalid-timezone >out &&
+	grep "1234567890 [+]051800" out
+'
+
 test_expect_success 'B: accept and fixup committer with no name' '
 	cat >input <<-INPUT_END &&
 	commit refs/heads/empty-committer-2