[1/6] fast-import: tighten parsing of paths

Message ID	20240322000304.76810-2-thalia@archibald.dev (mailing list archive)
State	Superseded
Headers	show Received: from mail-0201.mail-europe.com (mail-0201.mail-europe.com [51.77.79.158]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C400A382 for <git@vger.kernel.org>; Fri, 22 Mar 2024 00:03:45 +0000 (UTC) Date: Fri, 22 Mar 2024 00:03:18 +0000 To: git@vger.kernel.org From: Thalia Archibald <thalia@archibald.dev> Cc: Elijah Newren <newren@gmail.com>, Thalia Archibald <thalia@archibald.dev> Subject: [PATCH 1/6] fast-import: tighten parsing of paths Message-ID: <20240322000304.76810-2-thalia@archibald.dev> In-Reply-To: <20240322000304.76810-1-thalia@archibald.dev> References: <20240322000304.76810-1-thalia@archibald.dev> Feedback-ID: 63908566:user:proton Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Series	fast-import: tighten parsing of paths \| expand [0/6] fast-import: tighten parsing of paths [1/6] fast-import: tighten parsing of paths [2/6] fast-import: directly use strbufs for paths [3/6] fast-import: release unfreed strbufs [4/6] fast-import: remove dead strbuf [5/6] fast-import: document C-style escapes for paths [6/6] fast-import: forbid escaped NUL in paths

Message ID

20240322000304.76810-2-thalia@archibald.dev (mailing list archive)

State

Superseded

Headers

Date: Fri, 22 Mar 2024 00:03:18 +0000
To: git@vger.kernel.org
From: Thalia Archibald <thalia@archibald.dev>
Cc: Elijah Newren <newren@gmail.com>, Thalia Archibald <thalia@archibald.dev>
Subject: [PATCH 1/6] fast-import: tighten parsing of paths
Message-ID: <20240322000304.76810-2-thalia@archibald.dev>
In-Reply-To: <20240322000304.76810-1-thalia@archibald.dev>
References: <20240322000304.76810-1-thalia@archibald.dev>
Feedback-ID: 63908566:user:proton
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Series

fast-import: tighten parsing of paths | expand

Commit Message

Thalia Archibald March 22, 2024, 12:03 a.m. UTC

Path parsing in fast-import is inconsistent and many unquoting errors
are suppressed.

`<path>` appears in the grammar in these places:

    filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
    filedelete ::= 'D' SP <path> LF
    filecopy   ::= 'C' SP <path> SP <path> LF
    filerename ::= 'R' SP <path> SP <path> LF
    ls         ::= 'ls' SP <dataref> SP <path> LF
    ls-commit  ::= 'ls' SP <path> LF

and fast-import.c parses them in five different ways:

1. For filemodify and filedelete:
   If `<path>` is a valid quoted string, unquote it;
   otherwise, treat it as literal bytes (including any number of SP).
2. For filecopy (source) and filerename (source):
   If `<path>` is a valid quoted string, unquote it;
   otherwise, treat it as literal bytes until the next SP.
3. For filecopy (dest) and filerename (dest):
   Like 1., but an unquoted empty string is an error.
4. For ls:
   If `<path>` starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes (including any number of SP).
5. For ls-commit:
   Unquote `<path>` and report parse errors.
   (It must start with `"` to disambiguate from ls.)

In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine and could silently introduce escapes that would
be incorrectly parsed due to this.

The documentation states that “To use a source path that contains SP the
path must be quoted.”, so it is expected that some implementations
depend on spaces being allowed in paths in the final position. Thus we
have two documented ways to parse paths, so simplify the implementation
to that.

Now we have:

1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
   filerename (dest), ls, and ls-commit:

   If `<path>` starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes (including any number of SP).
   Garbage after a quoted string or an unquoted empty string are errors.
   (In ls-commit, it must be quoted to disambiguate from ls.)

2. `parse_path_space` for filecopy (source) and filerename (source):

   If `<path>` starts with `"`, unquote it and report parse errors;
   otherwise, treat it as literal bytes until the next SP.
   It must be followed by a SP. An unquoted empty string is an error.

Signed-off-by: Thalia Archibald <thalia@archibald.dev>
---
 Documentation/git-fast-import.txt |   3 +-
 builtin/fast-import.c             | 115 ++++++++------
 t/t9300-fast-import.sh            | 252 +++++++++++++++++++++++++++++-
 3 files changed, 316 insertions(+), 54 deletions(-)

Comments

Thalia Archibald March 22, 2024, 12:11 a.m. UTC | #1

Looks like my cover letter was dropped and placing each Cc: on a separate line
only sends to the last one. Let’s try again. Here's my cover letter and the full
relevant list from contrib/contacts is now CC'd:

> fast-import has subtle differences in how it parses file paths between each
> occurrence of <path> in the grammar. Many errors were suppressed or not checked,
> which could lead to silent data corruption. A particularly bad case was when a
> front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not
> supported), it would be treated as literal bytes instead of a quoted string.
> 
> Bring path parsing into line with the documented behavior and improve
> documentation to fill in missing details.
> 
> This patch series is patterned after 06454cb9a3 (fast-import: tighten parsing of
> datarefs, 2012-04-07), which did similar fixes across the grammar, but for
> marks.
> 
> This is my first contribution to Git, so please let me know if there's something
> I've missed. I'm working on a tool for advanced repo transformations (like a
> union of filter-repo and Reposurgeon workflows), so I've been living in
> fast-import code and I have more parsing fixes planned.

Patrick Steinhardt March 28, 2024, 8:21 a.m. UTC | #2

On Fri, Mar 22, 2024 at 12:03:18AM +0000, Thalia Archibald wrote:
> Path parsing in fast-import is inconsistent and many unquoting errors
> are suppressed.
> 
> `<path>` appears in the grammar in these places:
> 
>     filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF
>     filedelete ::= 'D' SP <path> LF
>     filecopy   ::= 'C' SP <path> SP <path> LF
>     filerename ::= 'R' SP <path> SP <path> LF
>     ls         ::= 'ls' SP <dataref> SP <path> LF
>     ls-commit  ::= 'ls' SP <path> LF
> 
> and fast-import.c parses them in five different ways:
> 
> 1. For filemodify and filedelete:
>    If `<path>` is a valid quoted string, unquote it;
>    otherwise, treat it as literal bytes (including any number of SP).
> 2. For filecopy (source) and filerename (source):
>    If `<path>` is a valid quoted string, unquote it;
>    otherwise, treat it as literal bytes until the next SP.
> 3. For filecopy (dest) and filerename (dest):
>    Like 1., but an unquoted empty string is an error.
> 4. For ls:
>    If `<path>` starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes (including any number of SP).
> 5. For ls-commit:
>    Unquote `<path>` and report parse errors.
>    (It must start with `"` to disambiguate from ls.)
> 
> In the first three, any errors from trying to unquote a string are
> suppressed, so a quoted string that contains invalid escapes would be
> interpreted as literal bytes. For example, `"\xff"` would fail to
> unquote (because hex escapes are not supported), and it would instead be
> interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is
> certainly not intended. Some front-ends erroneously use their language's
> standard quoting routine and could silently introduce escapes that would
> be incorrectly parsed due to this.
> 
> The documentation states that “To use a source path that contains SP the
> path must be quoted.”, so it is expected that some implementations
> depend on spaces being allowed in paths in the final position. Thus we
> have two documented ways to parse paths, so simplify the implementation
> to that.
> 
> Now we have:
> 
> 1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
>    filerename (dest), ls, and ls-commit:
> 
>    If `<path>` starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes (including any number of SP).
>    Garbage after a quoted string or an unquoted empty string are errors.
>    (In ls-commit, it must be quoted to disambiguate from ls.)
> 
> 2. `parse_path_space` for filecopy (source) and filerename (source):
> 
>    If `<path>` starts with `"`, unquote it and report parse errors;
>    otherwise, treat it as literal bytes until the next SP.
>    It must be followed by a SP. An unquoted empty string is an error.
> 
> Signed-off-by: Thalia Archibald <thalia@archibald.dev>
> ---
>  Documentation/git-fast-import.txt |   3 +-
>  builtin/fast-import.c             | 115 ++++++++------
>  t/t9300-fast-import.sh            | 252 +++++++++++++++++++++++++++++-
>  3 files changed, 316 insertions(+), 54 deletions(-)
> 
> diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
> index b2607366b9..271bd63a10 100644
> --- a/Documentation/git-fast-import.txt
> +++ b/Documentation/git-fast-import.txt
> @@ -649,7 +649,8 @@ The value of `<path>` must be in canonical form. That is it must not:
>  * contain the special component `.` or `..` (e.g. `foo/./bar` and
>    `foo/../bar` are invalid).
>  
> -The root of the tree can be represented by an empty string as `<path>`.
> +The root of the tree can be represented by a quoted empty string (`""`)
> +as `<path>`.

So this is part of the "filemodify" section with the following syntax:

    'M' SP <mode> SP <dataref> SP <path> LF

The way I interpret this change is that <path> could previously be empty
(`SP LF`), but now it needs to be quoted (`SP '"' '"' LF). This seems to
be related to cases (1) and (3) of your commit messages, where
"filemodify" could contain an unquoted empty string whereas "filecopy"
and "filerename" would complain about such an unquoted string.

In any case I don't see a strong argument why exactly it should be
forbidden to have an unquoted empty path here, and I do wonder whether
it would break existing writers of the format when we retroactively
tighten this case. Isn't it possible to instead loosen it such that all
three of the above actions know to handle unquoted empty paths?

>  It is recommended that `<path>` always be encoded using UTF-8.
>  
> diff --git a/builtin/fast-import.c b/builtin/fast-import.c
> index 71a195ca22..b2adec8d9a 100644
> --- a/builtin/fast-import.c
> +++ b/builtin/fast-import.c
> @@ -2224,7 +2224,7 @@ static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
>   *
>   *   idnum ::= ':' bigint;
>   *
> - * Return the first character after the value in *endptr.
> + * Update *endptr to point to the first character after the value.

I think it would make sense to put these improvements for comments into
a separate commit. Otherwise it makes you wonder whether this behaviour
is new now.

>   * Complain if the following character is not what is expected,
>   * either a space or end of the string.
> @@ -2257,8 +2257,8 @@ static uintmax_t parse_mark_ref_eol(const char *p)
>  }
>  
>  /*
> - * Parse the mark reference, demanding a trailing space.  Return a
> - * pointer to the space.
> + * Parse the mark reference, demanding a trailing space. Update *p to
> + * point to the first character after the space.
>   */

Same.

>  static uintmax_t parse_mark_ref_space(const char **p)
>  {
> @@ -2272,10 +2272,57 @@ static uintmax_t parse_mark_ref_space(const char **p)
>  	return mark;
>  }
>  
> +/*
> + * Parse the path string into the strbuf. It may be quoted with escape sequences
> + * or unquoted without escape sequences. When unquoted, it may only contain a
> + * space if `allow_spaces` is nonzero.
> + */
> +static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
> +{
> +	strbuf_reset(sb);

It's not all that customary in our codebase to have the function reset
the `strbuf` for the caller because it does make the function less
flexible. I would either keep the `strbuf_reset()` on the caller side or
at least document this behaviour in the comment.

> +	if (*p == '"') {
> +		if (unquote_c_style(sb, p, endp))
> +			die("Invalid %s: %s", field, command_buf.buf);
> +	} else {
> +		if (allow_spaces)
> +			*endp = p + strlen(p);

I wonder whether `stop_at_unquoted_space` might be more telling. It's
not like we disallow spaces here, it's that we treat them as the
separator to the next field.

> +		else
> +			*endp = strchr(p, ' ');
> +		if (*endp == p)
> +			die("Missing %s: %s", field, command_buf.buf);

Error messages should start with a lower-case letter and be
translateable. But these are simply moved over from the previous code,
so I don't mind much if you want to keep them as-is.

> +		strbuf_add(sb, p, *endp - p);
> +	}
> +}
> +
> +/*
> + * Parse the path string into the strbuf, and complain if this is not the end of
> + * the string. It may contain spaces even when unquoted.
> + */
> +static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
> +{
> +	const char *end;
> +
> +	parse_path(sb, p, &end, 1, field);
> +	if (*end)
> +		die("Garbage after %s: %s", field, command_buf.buf);
> +}
> +
> +/*
> + * Parse the path string into the strbuf, and ensure it is followed by a space.
> + * It may not contain spaces when unquoted. Update *endp to point to the first
> + * character after the space.
> + */
> +static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
> +{
> +	parse_path(sb, p, endp, 0, field);
> +	if (**endp != ' ')
> +		die("Missing space after %s: %s", field, command_buf.buf);
> +	(*endp)++;
> +}
> +
>  static void file_change_m(const char *p, struct branch *b)
>  {
>  	static struct strbuf uq = STRBUF_INIT;
> -	const char *endp;
>  	struct object_entry *oe;
>  	struct object_id oid;
>  	uint16_t mode, inline_data = 0;
> @@ -2312,12 +2359,8 @@ static void file_change_m(const char *p, struct branch *b)
>  			die("Missing space after SHA1: %s", command_buf.buf);
>  	}
>  
> -	strbuf_reset(&uq);
> -	if (!unquote_c_style(&uq, p, &endp)) {
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;

This is loosening the condition so that we also accept unquoted paths
now. Okay.

>  
>  	/* Git does not track empty, non-toplevel directories. */
>  	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
> @@ -2381,48 +2424,23 @@ static void file_change_m(const char *p, struct branch *b)
>  static void file_change_d(const char *p, struct branch *b)
>  {
>  	static struct strbuf uq = STRBUF_INIT;
> -	const char *endp;
>  
> -	strbuf_reset(&uq);
> -	if (!unquote_c_style(&uq, p, &endp)) {
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;
>  	tree_content_remove(&b->branch_tree, p, NULL, 1);
>  }

Same.

> -static void file_change_cr(const char *s, struct branch *b, int rename)
> +static void file_change_cr(const char *p, struct branch *b, int rename)
>  {
> -	const char *d;
> +	const char *s, *d;
>  	static struct strbuf s_uq = STRBUF_INIT;
>  	static struct strbuf d_uq = STRBUF_INIT;
> -	const char *endp;
>  	struct tree_entry leaf;
>  
> -	strbuf_reset(&s_uq);
> -	if (!unquote_c_style(&s_uq, s, &endp)) {
> -		if (*endp != ' ')
> -			die("Missing space after source: %s", command_buf.buf);
> -	} else {
> -		endp = strchr(s, ' ');
> -		if (!endp)
> -			die("Missing space after source: %s", command_buf.buf);
> -		strbuf_add(&s_uq, s, endp - s);
> -	}
> +	parse_path_space(&s_uq, p, &p, "source");
> +	parse_path_eol(&d_uq, p, "dest");
>  	s = s_uq.buf;
> -
> -	endp++;
> -	if (!*endp)
> -		die("Missing dest: %s", command_buf.buf);
> -
> -	d = endp;
> -	strbuf_reset(&d_uq);
> -	if (!unquote_c_style(&d_uq, d, &endp)) {
> -		if (*endp)
> -			die("Garbage after dest in: %s", command_buf.buf);
> -		d = d_uq.buf;
> -	}
> +	d = d_uq.buf;

Nice simplification. The source path should behave the same, and parsing
of the destination path has been loosened to also allow unquoted paths.

>  	memset(&leaf, 0, sizeof(leaf));
>  	if (rename)
> @@ -3168,6 +3186,7 @@ static void parse_ls(const char *p, struct branch *b)
>  {
>  	struct tree_entry *root = NULL;
>  	struct tree_entry leaf = {NULL};
> +	static struct strbuf uq = STRBUF_INIT;

I know the code had this as a static variable before, as well. But is
this really necessary? Can't we leave it as non-static and then release
the buffer at the end of this function?

>  	/* ls SP (<tree-ish> SP)? <path> */
>  	if (*p == '"') {
> @@ -3182,16 +3201,8 @@ static void parse_ls(const char *p, struct branch *b)
>  			root->versions[1].mode = S_IFDIR;
>  		load_tree(root);
>  	}
> -	if (*p == '"') {
> -		static struct strbuf uq = STRBUF_INIT;
> -		const char *endp;
> -		strbuf_reset(&uq);
> -		if (unquote_c_style(&uq, p, &endp))
> -			die("Invalid path: %s", command_buf.buf);
> -		if (*endp)
> -			die("Garbage after path in: %s", command_buf.buf);
> -		p = uq.buf;
> -	}
> +	parse_path_eol(&uq, p, "path");
> +	p = uq.buf;

And this case should behave the same.

>  	tree_content_get(root, p, &leaf, 1);
>  	/*
>  	 * A directory in preparation would have a sha1 of zero
> diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
> index dbb5042b0b..ef04b43f46 100755
> --- a/t/t9300-fast-import.sh
> +++ b/t/t9300-fast-import.sh
> @@ -2146,6 +2146,7 @@ test_expect_success 'Q: deny note on empty branch' '
>  	EOF
>  	test_must_fail git fast-import <input
>  '
> +
>  ###
>  ### series R (feature and option)
>  ###
> @@ -2794,7 +2795,7 @@ test_expect_success 'R: blob appears only once' '
>  '
>  
>  ###
> -### series S
> +### series S (mark and path parsing)
>  ###
>  #
>  # Make sure missing spaces and EOLs after mark references
> @@ -3064,6 +3065,255 @@ test_expect_success 'S: ls with garbage after sha1 must fail' '
>  	test_grep "space after tree-ish" err
>  '
>  
> +#
> +# Path parsing
> +#
> +# There are two sorts of ways a path can be parsed, depending on whether it is
> +# the last field on the line. Additionally, ls without a <tree-ish> has a
> +# special case. Test every occurrence of <path> in the grammar against every
> +# error case.
> +#
> +
> +#
> +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
> +# filerename (dest), and ls.
> +#
> +# commit :301 from root -- modify hello.c
> +# commit :302 from :301 -- modify $path
> +# commit :303 from :302 -- delete $path
> +# commit :304 from :301 -- copy hello.c $path
> +# commit :305 from :301 -- rename hello.c $path
> +# ls :305 $path
> +#
> +test_path_eol_success () {
> +	test="$1" path="$2" unquoted_path="$3"

Should these variables be local?

> +	test_expect_success "S: paths at EOL with $test must work" '
> +		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
> +		blob
> +		mark :401
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		blob
> +		mark :402
> +		data <<BLOB
> +		hallo welt
> +		BLOB
> +
> +		commit refs/heads/path-eol
> +		mark :301
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		initial commit
> +		COMMIT
> +		M 100644 :401 hello.c
> +
> +		commit refs/heads/path-eol
> +		mark :302
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filemodify
> +		COMMIT
> +		from :301
> +		M 100644 :402 '"$path"'
> +
> +		commit refs/heads/path-eol
> +		mark :303
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filedelete
> +		COMMIT
> +		from :302
> +		D '"$path"'
> +
> +		commit refs/heads/path-eol
> +		mark :304
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filecopy dest
> +		COMMIT
> +		from :301
> +		C hello.c '"$path"'
> +
> +		commit refs/heads/path-eol
> +		mark :305
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filerename dest
> +		COMMIT
> +		from :301
> +		R hello.c '"$path"'
> +
> +		ls :305 '"$path"'
> +		EOF
> +
> +		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
> +		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
> +		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
> +		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
> +		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
> +		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
> +
> +		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
> +		git ls-tree $commit_m | sort >tree_m.out &&
> +		test_cmp tree_m.exp tree_m.out &&
> +
> +		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
> +		git ls-tree $commit_d >tree_d.out &&
> +		test_cmp tree_d.exp tree_d.out &&
> +
> +		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
> +		git ls-tree $commit_c | sort >tree_c.out &&
> +		test_cmp tree_c.exp tree_c.out &&
> +
> +		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
> +		git ls-tree $commit_r >tree_r.out &&
> +		test_cmp tree_r.exp tree_r.out &&
> +
> +		test_cmp out tree_r.exp &&
> +
> +		git branch -D path-eol
> +	'
> +}
> +
> +test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
> +test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
> +
> +#
> +# Valid paths before a space: filecopy (source) and filerename (source).
> +#
> +# commit :301 from root -- modify $path
> +# commit :302 from :301 -- copy $path hello2.c
> +# commit :303 from :301 -- rename $path hello2.c
> +#
> +test_path_space_success () {
> +	test="$1" path="$2" unquoted_path="$3"

Same question here, should these be local?

> +	test_expect_success "S: paths before space with $test must work" '
> +		git fast-import --export-marks=marks.out <<-EOF 2>err &&
> +		blob
> +		mark :401
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		commit refs/heads/path-space
> +		mark :301
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		initial commit
> +		COMMIT
> +		M 100644 :401 '"$path"'
> +
> +		commit refs/heads/path-space
> +		mark :302
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filecopy source
> +		COMMIT
> +		from :301
> +		C '"$path"' hello2.c
> +
> +		commit refs/heads/path-space
> +		mark :303
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit filerename source
> +		COMMIT
> +		from :301
> +		R '"$path"' hello2.c
> +
> +		EOF
> +
> +		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
> +		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
> +		blob=$(grep :401 marks.out | cut -d\  -f2) &&
> +
> +		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
> +		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
> +		git ls-tree $commit_c | sort >tree_c.out &&
> +		test_cmp tree_c.exp tree_c.out &&
> +
> +		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
> +		git ls-tree $commit_r >tree_r.out &&
> +		test_cmp tree_r.exp tree_r.out &&
> +
> +		git branch -D path-space
> +	'
> +}
> +
> +test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
> +test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
> +
> +#
> +# Test a single commit change with an invalid path. Run it with all occurrences
> +# of <path> in the grammar against all error kinds.
> +#
> +test_path_fail () {
> +	what="$1" path="$2" err_grep="$3"

Same.

> +	test_expect_success "S: $change with $what must fail" '
> +		test_must_fail git fast-import <<-EOF 2>err &&
> +		blob
> +		mark :1
> +		data <<BLOB
> +		hello world
> +		BLOB
> +
> +		commit refs/heads/S-path-fail
> +		mark :2
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit setup
> +		COMMIT
> +		M 100644 :1 hello.c
> +
> +		commit refs/heads/S-path-fail
> +		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
> +		data <<COMMIT
> +		commit with bad path
> +		COMMIT
> +		from :2
> +		'"$prefix$path$suffix"'
> +		EOF
> +
> +		test_grep '"'$err_grep'"' err
> +	'
> +}
> +
> +test_path_base_fail () {
> +	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
> +	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
> +}
> +test_path_eol_quoted_fail () {
> +	test_path_base_fail
> +	test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
> +	test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
> +}
> +test_path_eol_fail () {
> +	test_path_eol_quoted_fail
> +	test_path_fail 'empty unquoted path' '' "Missing $field"
> +}
> +test_path_space_fail () {
> +	test_path_base_fail
> +	test_path_fail 'empty unquoted path' '' "Missing $field"
> +	test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
> +}
> +
> +change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
> +change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
> +change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
> +change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
> +change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
> +change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
> +change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail

This is quite confusing because you now mix two different styles, where
some of the functions take arguments while others pass arguments via
variables. I think it would be preferable to pass all arguments as
proper function arguments.

Patrick

> +# When 'ls' has no <tree-ish>, the <path> must be quoted.
> +change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
> +test_path_eol_quoted_fail &&
> +test_path_fail 'empty unquoted path' '' "Invalid dataref"
> +
>  ###
>  ### series T (ls)
>  ###
> -- 
> 2.44.0
> 
> 
>

Thalia Archibald April 1, 2024, 9:05 a.m. UTC | #3

(Sorry for first sending as HTML)

On Mar 28, 2024, at 01:21, Patrick Steinhardt <ps@pks.im> wrote:
> 
> So this is part of the "filemodify" section with the following syntax:
> 
>    'M' SP <mode> SP <dataref> SP <path> LF
> 
> The way I interpret this change is that <path> could previously be empty
> (`SP LF`), but now it needs to be quoted (`SP '"' '"' LF). This seems to
> be related to cases (1) and (3) of your commit messages, where
> "filemodify" could contain an unquoted empty string whereas "filecopy"
> and "filerename" would complain about such an unquoted string.
> 
> In any case I don't see a strong argument why exactly it should be
> forbidden to have an unquoted empty path here, and I do wonder whether
> it would break existing writers of the format when we retroactively
> tighten this case. Isn't it possible to instead loosen it such that all
> three of the above actions know to handle unquoted empty paths?

At first, I strongly thought that we should forbid this case of unquoted empty
paths. It's a somewhat peculiar case in that it refers to the root of the repo
and few front-ends use it. I surveyed git fast-export, git-filter-repo,
Reposurgeon, hg-fast-export, cvs-fast-export (by Eric S. Raymond),
cvs-fast-export (by Roger Vaughn), svn2git, bzr-fastexport, and bk fast-export,
but none of them ever target the root of the repo. I assumed that if an unquoted
empty path was ever emitted, it was likely an bug that should not be accepted
(e.g., a null byte array somehow).

However, most occurrences of <path> in the grammar have allowed unquoted empty
strings to mean the root for 14 years and documentation has implied that it's
allowed for 13 years. It's just the two cases of the destination paths of
filecopy and filerename that don't allow it, and those are less-used operations,
so front-ends may never encounter that error.

Some assumed errors in emitting empty paths are caught by validation with file
modes, so even if it's loosened it's still fairly safe. filemodify must be
040000 when it targets the root, and filecopy and filerename to the root must
have a source path that's a directory. The worst case is filedelete
unintentionally targeting the root, but that's been allowed to be an unquoted
empty path for almost its entire lifetime, so I don't think should be changed.

I've changed it to allow unquoted empty paths for all operations in patch v2
3/8 (fast-import: allow unquoted empty path for root).

> This is loosening the condition so that we also accept unquoted paths
> now. Okay.

On the contrary, v1 tightens all paths to forbid unquoted empty strings. v2 now
allows it and should make that change more clear.

> On Fri, Mar 22, 2024 at 12:03:18AM +0000, Thalia Archibald wrote:
>> +/*
>> + * Parse the path string into the strbuf. It may be quoted with escape sequences
>> + * or unquoted without escape sequences. When unquoted, it may only contain a
>> + * space if `allow_spaces` is nonzero.
>> + */
>> +static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
>> +{
>> + strbuf_reset(sb);
>> + if (*p == '"') {
>> + if (unquote_c_style(sb, p, endp))
>> + die("Invalid %s: %s", field, command_buf.buf);
>> + } else {
>> + if (allow_spaces)
>> + *endp = p + strlen(p);
> 
> I wonder whether `stop_at_unquoted_space` might be more telling. It's
> not like we disallow spaces here, it's that we treat them as the
> separator to the next field.

I agree, but I’d rather something shorter, so I’ve changed it to `include_spaces`.

>> + else
>> + *endp = strchr(p, ' ');
>> + if (*endp == p)
>> + die("Missing %s: %s", field, command_buf.buf);
> 
> Error messages should start with a lower-case letter and be
> translateable. But these are simply moved over from the previous code,
> so I don't mind much if you want to keep them as-is.
> 
>> + strbuf_add(sb, p, *endp - p);
>> + }
>> +}

fast-import isn’t a porcelain command, AFAIK, so I thought the convention is
that its output isn't translated?

From po/README.md:
> 
> The output from Git's plumbing utilities will primarily be read by
> programs and would break scripts under non-C locales if it was
> translated. Plumbing strings should not be translated, since
> they're part of Git's API.

I would, however, like to improve its error messages. For example, diagnosing
errors more precisely or changing the outdated “GIT” to “Git”.

To what extent should the exact error messages be considered part of Git's API?

Thalia

diff --git a/Documentation/git-fast-import.txt b/Documentation/git-fast-import.txt
index b2607366b9..271bd63a10 100644
--- a/Documentation/git-fast-import.txt
+++ b/Documentation/git-fast-import.txt
@@ -649,7 +649,8 @@  The value of `<path>` must be in canonical form. That is it must not:
 * contain the special component `.` or `..` (e.g. `foo/./bar` and
   `foo/../bar` are invalid).
 
-The root of the tree can be represented by an empty string as `<path>`.
+The root of the tree can be represented by a quoted empty string (`""`)
+as `<path>`.
 
 It is recommended that `<path>` always be encoded using UTF-8.
 
diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 71a195ca22..b2adec8d9a 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -2224,7 +2224,7 @@  static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch
  *
  *   idnum ::= ':' bigint;
  *
- * Return the first character after the value in *endptr.
+ * Update *endptr to point to the first character after the value.
  *
  * Complain if the following character is not what is expected,
  * either a space or end of the string.
@@ -2257,8 +2257,8 @@  static uintmax_t parse_mark_ref_eol(const char *p)
 }
 
 /*
- * Parse the mark reference, demanding a trailing space.  Return a
- * pointer to the space.
+ * Parse the mark reference, demanding a trailing space. Update *p to
+ * point to the first character after the space.
  */
 static uintmax_t parse_mark_ref_space(const char **p)
 {
@@ -2272,10 +2272,57 @@  static uintmax_t parse_mark_ref_space(const char **p)
 	return mark;
 }
 
+/*
+ * Parse the path string into the strbuf. It may be quoted with escape sequences
+ * or unquoted without escape sequences. When unquoted, it may only contain a
+ * space if `allow_spaces` is nonzero.
+ */
+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field)
+{
+	strbuf_reset(sb);
+	if (*p == '"') {
+		if (unquote_c_style(sb, p, endp))
+			die("Invalid %s: %s", field, command_buf.buf);
+	} else {
+		if (allow_spaces)
+			*endp = p + strlen(p);
+		else
+			*endp = strchr(p, ' ');
+		if (*endp == p)
+			die("Missing %s: %s", field, command_buf.buf);
+		strbuf_add(sb, p, *endp - p);
+	}
+}
+
+/*
+ * Parse the path string into the strbuf, and complain if this is not the end of
+ * the string. It may contain spaces even when unquoted.
+ */
+static void parse_path_eol(struct strbuf *sb, const char *p, const char *field)
+{
+	const char *end;
+
+	parse_path(sb, p, &end, 1, field);
+	if (*end)
+		die("Garbage after %s: %s", field, command_buf.buf);
+}
+
+/*
+ * Parse the path string into the strbuf, and ensure it is followed by a space.
+ * It may not contain spaces when unquoted. Update *endp to point to the first
+ * character after the space.
+ */
+static void parse_path_space(struct strbuf *sb, const char *p, const char **endp, const char *field)
+{
+	parse_path(sb, p, endp, 0, field);
+	if (**endp != ' ')
+		die("Missing space after %s: %s", field, command_buf.buf);
+	(*endp)++;
+}
+
 static void file_change_m(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 	struct object_entry *oe;
 	struct object_id oid;
 	uint16_t mode, inline_data = 0;
@@ -2312,12 +2359,8 @@  static void file_change_m(const char *p, struct branch *b)
 			die("Missing space after SHA1: %s", command_buf.buf);
 	}
 
-	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 
 	/* Git does not track empty, non-toplevel directories. */
 	if (S_ISDIR(mode) && is_empty_tree_oid(&oid) && *p) {
@@ -2381,48 +2424,23 @@  static void file_change_m(const char *p, struct branch *b)
 static void file_change_d(const char *p, struct branch *b)
 {
 	static struct strbuf uq = STRBUF_INIT;
-	const char *endp;
 
-	strbuf_reset(&uq);
-	if (!unquote_c_style(&uq, p, &endp)) {
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_remove(&b->branch_tree, p, NULL, 1);
 }
 
-static void file_change_cr(const char *s, struct branch *b, int rename)
+static void file_change_cr(const char *p, struct branch *b, int rename)
 {
-	const char *d;
+	const char *s, *d;
 	static struct strbuf s_uq = STRBUF_INIT;
 	static struct strbuf d_uq = STRBUF_INIT;
-	const char *endp;
 	struct tree_entry leaf;
 
-	strbuf_reset(&s_uq);
-	if (!unquote_c_style(&s_uq, s, &endp)) {
-		if (*endp != ' ')
-			die("Missing space after source: %s", command_buf.buf);
-	} else {
-		endp = strchr(s, ' ');
-		if (!endp)
-			die("Missing space after source: %s", command_buf.buf);
-		strbuf_add(&s_uq, s, endp - s);
-	}
+	parse_path_space(&s_uq, p, &p, "source");
+	parse_path_eol(&d_uq, p, "dest");
 	s = s_uq.buf;
-
-	endp++;
-	if (!*endp)
-		die("Missing dest: %s", command_buf.buf);
-
-	d = endp;
-	strbuf_reset(&d_uq);
-	if (!unquote_c_style(&d_uq, d, &endp)) {
-		if (*endp)
-			die("Garbage after dest in: %s", command_buf.buf);
-		d = d_uq.buf;
-	}
+	d = d_uq.buf;
 
 	memset(&leaf, 0, sizeof(leaf));
 	if (rename)
@@ -3168,6 +3186,7 @@  static void parse_ls(const char *p, struct branch *b)
 {
 	struct tree_entry *root = NULL;
 	struct tree_entry leaf = {NULL};
+	static struct strbuf uq = STRBUF_INIT;
 
 	/* ls SP (<tree-ish> SP)? <path> */
 	if (*p == '"') {
@@ -3182,16 +3201,8 @@  static void parse_ls(const char *p, struct branch *b)
 			root->versions[1].mode = S_IFDIR;
 		load_tree(root);
 	}
-	if (*p == '"') {
-		static struct strbuf uq = STRBUF_INIT;
-		const char *endp;
-		strbuf_reset(&uq);
-		if (unquote_c_style(&uq, p, &endp))
-			die("Invalid path: %s", command_buf.buf);
-		if (*endp)
-			die("Garbage after path in: %s", command_buf.buf);
-		p = uq.buf;
-	}
+	parse_path_eol(&uq, p, "path");
+	p = uq.buf;
 	tree_content_get(root, p, &leaf, 1);
 	/*
 	 * A directory in preparation would have a sha1 of zero
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index dbb5042b0b..ef04b43f46 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -2146,6 +2146,7 @@  test_expect_success 'Q: deny note on empty branch' '
 	EOF
 	test_must_fail git fast-import <input
 '
+
 ###
 ### series R (feature and option)
 ###
@@ -2794,7 +2795,7 @@  test_expect_success 'R: blob appears only once' '
 '
 
 ###
-### series S
+### series S (mark and path parsing)
 ###
 #
 # Make sure missing spaces and EOLs after mark references
@@ -3064,6 +3065,255 @@  test_expect_success 'S: ls with garbage after sha1 must fail' '
 	test_grep "space after tree-ish" err
 '
 
+#
+# Path parsing
+#
+# There are two sorts of ways a path can be parsed, depending on whether it is
+# the last field on the line. Additionally, ls without a <tree-ish> has a
+# special case. Test every occurrence of <path> in the grammar against every
+# error case.
+#
+
+#
+# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest),
+# filerename (dest), and ls.
+#
+# commit :301 from root -- modify hello.c
+# commit :302 from :301 -- modify $path
+# commit :303 from :302 -- delete $path
+# commit :304 from :301 -- copy hello.c $path
+# commit :305 from :301 -- rename hello.c $path
+# ls :305 $path
+#
+test_path_eol_success () {
+	test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths at EOL with $test must work" '
+		git fast-import --export-marks=marks.out <<-EOF >out 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		blob
+		mark :402
+		data <<BLOB
+		hallo welt
+		BLOB
+
+		commit refs/heads/path-eol
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 hello.c
+
+		commit refs/heads/path-eol
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filemodify
+		COMMIT
+		from :301
+		M 100644 :402 '"$path"'
+
+		commit refs/heads/path-eol
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filedelete
+		COMMIT
+		from :302
+		D '"$path"'
+
+		commit refs/heads/path-eol
+		mark :304
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy dest
+		COMMIT
+		from :301
+		C hello.c '"$path"'
+
+		commit refs/heads/path-eol
+		mark :305
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename dest
+		COMMIT
+		from :301
+		R hello.c '"$path"'
+
+		ls :305 '"$path"'
+		EOF
+
+		commit_m=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_d=$(grep :303 marks.out | cut -d\  -f2) &&
+		commit_c=$(grep :304 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :305 marks.out | cut -d\  -f2) &&
+		blob1=$(grep :401 marks.out | cut -d\  -f2) &&
+		blob2=$(grep :402 marks.out | cut -d\  -f2) &&
+
+		( printf "100644 blob $blob2\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_m.exp &&
+		git ls-tree $commit_m | sort >tree_m.out &&
+		test_cmp tree_m.exp tree_m.out &&
+
+		printf "100644 blob $blob1\thello.c\n" >tree_d.exp &&
+		git ls-tree $commit_d >tree_d.out &&
+		test_cmp tree_d.exp tree_d.out &&
+
+		( printf "100644 blob $blob1\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob1\thello.c\n" ) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob1\t'"$unquoted_path"'\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		test_cmp out tree_r.exp &&
+
+		git branch -D path-eol
+	'
+}
+
+test_path_eol_success 'quoted spaces'   '" hello world.c "' ' hello world.c '
+test_path_eol_success 'unquoted spaces' ' hello world.c '   ' hello world.c '
+
+#
+# Valid paths before a space: filecopy (source) and filerename (source).
+#
+# commit :301 from root -- modify $path
+# commit :302 from :301 -- copy $path hello2.c
+# commit :303 from :301 -- rename $path hello2.c
+#
+test_path_space_success () {
+	test="$1" path="$2" unquoted_path="$3"
+	test_expect_success "S: paths before space with $test must work" '
+		git fast-import --export-marks=marks.out <<-EOF 2>err &&
+		blob
+		mark :401
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/path-space
+		mark :301
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		initial commit
+		COMMIT
+		M 100644 :401 '"$path"'
+
+		commit refs/heads/path-space
+		mark :302
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filecopy source
+		COMMIT
+		from :301
+		C '"$path"' hello2.c
+
+		commit refs/heads/path-space
+		mark :303
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit filerename source
+		COMMIT
+		from :301
+		R '"$path"' hello2.c
+
+		EOF
+
+		commit_c=$(grep :302 marks.out | cut -d\  -f2) &&
+		commit_r=$(grep :303 marks.out | cut -d\  -f2) &&
+		blob=$(grep :401 marks.out | cut -d\  -f2) &&
+
+		( printf "100644 blob $blob\t'"$unquoted_path"'\n" &&
+		  printf "100644 blob $blob\thello2.c\n" ) | sort >tree_c.exp &&
+		git ls-tree $commit_c | sort >tree_c.out &&
+		test_cmp tree_c.exp tree_c.out &&
+
+		printf "100644 blob $blob\thello2.c\n" >tree_r.exp &&
+		git ls-tree $commit_r >tree_r.out &&
+		test_cmp tree_r.exp tree_r.out &&
+
+		git branch -D path-space
+	'
+}
+
+test_path_space_success 'quoted spaces'      '" hello world.c "' ' hello world.c '
+test_path_space_success 'no unquoted spaces' 'hello_world.c'     'hello_world.c'
+
+#
+# Test a single commit change with an invalid path. Run it with all occurrences
+# of <path> in the grammar against all error kinds.
+#
+test_path_fail () {
+	what="$1" path="$2" err_grep="$3"
+	test_expect_success "S: $change with $what must fail" '
+		test_must_fail git fast-import <<-EOF 2>err &&
+		blob
+		mark :1
+		data <<BLOB
+		hello world
+		BLOB
+
+		commit refs/heads/S-path-fail
+		mark :2
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit setup
+		COMMIT
+		M 100644 :1 hello.c
+
+		commit refs/heads/S-path-fail
+		committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE
+		data <<COMMIT
+		commit with bad path
+		COMMIT
+		from :2
+		'"$prefix$path$suffix"'
+		EOF
+
+		test_grep '"'$err_grep'"' err
+	'
+}
+
+test_path_base_fail () {
+	test_path_fail 'unclosed " in '"$field"          '"hello.c'    "Invalid $field"
+	test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field"
+}
+test_path_eol_quoted_fail () {
+	test_path_base_fail
+	test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field"
+	test_path_fail "space after quoted $field"   '"hello.c" ' "Garbage after $field"
+}
+test_path_eol_fail () {
+	test_path_eol_quoted_fail
+	test_path_fail 'empty unquoted path' '' "Missing $field"
+}
+test_path_space_fail () {
+	test_path_base_fail
+	test_path_fail 'empty unquoted path' '' "Missing $field"
+	test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field"
+}
+
+change=filemodify       prefix='M 100644 :1 ' field=path   suffix=''         test_path_eol_fail
+change=filedelete       prefix='D '           field=path   suffix=''         test_path_eol_fail
+change=filecopy         prefix='C '           field=source suffix=' world.c' test_path_space_fail
+change=filecopy         prefix='C hello.c '   field=dest   suffix=''         test_path_eol_fail
+change=filerename       prefix='R '           field=source suffix=' world.c' test_path_space_fail
+change=filerename       prefix='R hello.c '   field=dest   suffix=''         test_path_eol_fail
+change='ls (in commit)' prefix='ls :2 '       field=path   suffix=''         test_path_eol_fail
+
+# When 'ls' has no <tree-ish>, the <path> must be quoted.
+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \
+test_path_eol_quoted_fail &&
+test_path_fail 'empty unquoted path' '' "Invalid dataref"
+
 ###
 ### series T (ls)
 ###

[1/6] fast-import: tighten parsing of paths

Commit Message

Comments

Patch