[8/8] sparse-checkout: write escaped patterns in cone mode
diff mbox series

Message ID 79b6e9a565fde766954d7cda84a835b9015af6cb.1579029963.git.gitgitgadget@gmail.com
State New
Headers show
Series
  • Harden the sparse-checkout builtin
Related show

Commit Message

sunlin via GitGitGadget Jan. 14, 2020, 7:26 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

If a user somehow creates a directory with an asterisk (*) or backslash
(\), then the "git sparse-checkout set" command will struggle to provide
the correct pattern in the sparse-checkout file. When not in cone mode,
the provided pattern is written directly into the sparse-checkout file.
However, in cone mode we expect a list of paths to directories and then
we convert those into patterns.

Even more specifically, the goal is to always allow the following from
the root of a repo:

  git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin

The ls-tree command provides directory names with an unescaped asterisk.
It also quotes the directories that contain an escaped backslash. We
must remove these quotes, then keep the escaped backslashes.

However, there is some care needed for the timing of these escapes. The
in-memory pattern list is used to update the working directory before
writing the patterns to disk. Thus, we need the command to have the
unescaped names in the hashsets for the cone comparisons, then escape
the patterns later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 48 ++++++++++++++++++++++++++++--
 t/t1091-sparse-checkout-builtin.sh | 21 +++++++++++--
 2 files changed, 64 insertions(+), 5 deletions(-)

Comments

Jeff King Jan. 14, 2020, 9:25 p.m. UTC | #1
On Tue, Jan 14, 2020 at 07:26:02PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> If a user somehow creates a directory with an asterisk (*) or backslash
> (\), then the "git sparse-checkout set" command will struggle to provide
> the correct pattern in the sparse-checkout file. When not in cone mode,
> the provided pattern is written directly into the sparse-checkout file.
> However, in cone mode we expect a list of paths to directories and then
> we convert those into patterns.
> 
> Even more specifically, the goal is to always allow the following from
> the root of a repo:
> 
>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
> 
> The ls-tree command provides directory names with an unescaped asterisk.
> It also quotes the directories that contain an escaped backslash. We
> must remove these quotes, then keep the escaped backslashes.

Do we need to document these rules somewhere? Naively I'd expect
"--stdin" to take in literal pathnames. But of course it can't represent
a path with a newline. So perhaps it makes sense to take quoted names by
default, and allow literal NUL-separated input with "-z" if anybody
wants it.

-Peff
Derrick Stolee Jan. 14, 2020, 10:11 p.m. UTC | #2
On 1/14/2020 4:25 PM, Jeff King wrote:
> On Tue, Jan 14, 2020 at 07:26:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> If a user somehow creates a directory with an asterisk (*) or backslash
>> (\), then the "git sparse-checkout set" command will struggle to provide
>> the correct pattern in the sparse-checkout file. When not in cone mode,
>> the provided pattern is written directly into the sparse-checkout file.
>> However, in cone mode we expect a list of paths to directories and then
>> we convert those into patterns.
>>
>> Even more specifically, the goal is to always allow the following from
>> the root of a repo:
>>
>>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
>>
>> The ls-tree command provides directory names with an unescaped asterisk.
>> It also quotes the directories that contain an escaped backslash. We
>> must remove these quotes, then keep the escaped backslashes.
> 
> Do we need to document these rules somewhere? Naively I'd expect
> "--stdin" to take in literal pathnames. But of course it can't represent
> a path with a newline. So perhaps it makes sense to take quoted names by
> default, and allow literal NUL-separated input with "-z" if anybody
> wants it.

This is worth thinking about the right way to describe the rules:

1. You don't _need_ quotes. They happen to come along for the ride in
  'git ls-tree' so it doesn't mess up shell scripts that iterate on
  those entries. At least, that's why I think they are quoted.

2. If you use quotes, the first layer of quotes will be removed.

How much of this needs to be documented explicitly, or how much should
we say "The input format matches what we would expect from 'git ls-tree
--name-only'"?

Thanks,
-Stolee
Jeff King Jan. 14, 2020, 10:48 p.m. UTC | #3
On Tue, Jan 14, 2020 at 05:11:03PM -0500, Derrick Stolee wrote:

> > Do we need to document these rules somewhere? Naively I'd expect
> > "--stdin" to take in literal pathnames. But of course it can't represent
> > a path with a newline. So perhaps it makes sense to take quoted names by
> > default, and allow literal NUL-separated input with "-z" if anybody
> > wants it.
> 
> This is worth thinking about the right way to describe the rules:
> 
> 1. You don't _need_ quotes. They happen to come along for the ride in
>   'git ls-tree' so it doesn't mess up shell scripts that iterate on
>   those entries. At least, that's why I think they are quoted.

It's not just shell scripts. Without quoting, the syntax becomes
ambiguous (e.g., imagine a file with a newline in it). So most Git
output that shows a filename will quote it if necessary, unless
NUL separators are being used.

> 2. If you use quotes, the first layer of quotes will be removed.

I take this to mean that anything starting with a double-quote will have
the outer layer removed, and backslash escapes inside expanded. And
anything without a starting double quote (even if it has internal
backslash escapes!) will be taken literally.

That would match how things like "update-index --index-info" work.

As far as implementation, I know you're trying to keep some of the
escaping, but I think it might make more sense to do use
unquote_c_style() to parse the input (see update-index's use for some
prior art), and then re-quote as necessary to put things into the
sparse-checkout file (I guess quoting more than just quote_c_style()
would do, since you need to quote glob metacharacters like '*' and
probably "!"). But as much as possible, I think you'd want literal
strings inside the program, and just quoting/unquoting at the edges.

> How much of this needs to be documented explicitly, or how much should
> we say "The input format matches what we would expect from 'git ls-tree
> --name-only'"?

I think it's fine to say that, and maybe call attention to the quoting.
Like:

  The input format matches the output of `git ls-tree --name-only`. This
  includes interpreting pathnames that begin with a double quote (") as
  C-style quoted strings.

Disappointingly, update-index does not seem to explain the rules
anywhere. fast-import does cover it. Maybe it's something that ought to
be hoisted out into gitcli(7) or similar (or maybe it has been and I
just can't find it).

-Peff
Derrick Stolee Jan. 24, 2020, 9:10 p.m. UTC | #4
On 1/14/2020 5:48 PM, Jeff King wrote:
> On Tue, Jan 14, 2020 at 05:11:03PM -0500, Derrick Stolee wrote:
> 
>>> Do we need to document these rules somewhere? Naively I'd expect
>>> "--stdin" to take in literal pathnames. But of course it can't represent
>>> a path with a newline. So perhaps it makes sense to take quoted names by
>>> default, and allow literal NUL-separated input with "-z" if anybody
>>> wants it.
>>
>> This is worth thinking about the right way to describe the rules:
>>
>> 1. You don't _need_ quotes. They happen to come along for the ride in
>>   'git ls-tree' so it doesn't mess up shell scripts that iterate on
>>   those entries. At least, that's why I think they are quoted.
> 
> It's not just shell scripts. Without quoting, the syntax becomes
> ambiguous (e.g., imagine a file with a newline in it). So most Git
> output that shows a filename will quote it if necessary, unless
> NUL separators are being used.

Good to know.

>> 2. If you use quotes, the first layer of quotes will be removed.
> 
> I take this to mean that anything starting with a double-quote will have
> the outer layer removed, and backslash escapes inside expanded. And
> anything without a starting double quote (even if it has internal
> backslash escapes!) will be taken literally.

Hm. Perhaps you are right! The ls-tree output for the test example
is:

	deep
	folder1
	folder2
	"zbad\\dir"
	zdoes*exist

so the "zdoes*exist" value is not escaped. I believe the current
logic does something extra: consider supplying this input to
'git sparse-checkout set --stdin':

	deep
	folder1
	folder2
	"zbad\\dir"
	zdoes\*exist

then should we un-escape "\*" to "*"? Or is this not a valid input
because a backslash should have been quoted into C-style quotes?

The behavior in the current series allows this output that would
never be written by "git ls-tree".

> That would match how things like "update-index --index-info" work.
> 
> As far as implementation, I know you're trying to keep some of the
> escaping, but I think it might make more sense to do use
> unquote_c_style() to parse the input (see update-index's use for some
> prior art), and then re-quote as necessary to put things into the
> sparse-checkout file (I guess quoting more than just quote_c_style()
> would do, since you need to quote glob metacharacters like '*' and
> probably "!"). But as much as possible, I think you'd want literal
> strings inside the program, and just quoting/unquoting at the edges.

I was playing around with this, and I think that quote_c_style() is
necessary for the output, but we have a strange in-memory situation
for the other escaping: we both fill the hashsets with the un-escaped
data and fill the pattern list with the escaped patterns.

I'll add a commit with the quote_c_style() calls during the 'list'
subcommand.

>> How much of this needs to be documented explicitly, or how much should
>> we say "The input format matches what we would expect from 'git ls-tree
>> --name-only'"?
> 
> I think it's fine to say that, and maybe call attention to the quoting.
> Like:
> 
>   The input format matches the output of `git ls-tree --name-only`. This
>   includes interpreting pathnames that begin with a double quote (") as
>   C-style quoted strings.
> 
> Disappointingly, update-index does not seem to explain the rules
> anywhere. fast-import does cover it. Maybe it's something that ought to
> be hoisted out into gitcli(7) or similar (or maybe it has been and I
> just can't find it).

I'll start the process by using your recommended language. I noticed
also that the 'set' command doesn't actually document what happens
when in cone mode, so I will correct that, too.

Thanks,
-Stolee
Jeff King Jan. 24, 2020, 9:42 p.m. UTC | #5
On Fri, Jan 24, 2020 at 04:10:21PM -0500, Derrick Stolee wrote:

> Hm. Perhaps you are right! The ls-tree output for the test example
> is:
> 
> 	deep
> 	folder1
> 	folder2
> 	"zbad\\dir"
> 	zdoes*exist
> 
> so the "zdoes*exist" value is not escaped. I believe the current
> logic does something extra: consider supplying this input to
> 'git sparse-checkout set --stdin':
> 
> 	deep
> 	folder1
> 	folder2
> 	"zbad\\dir"
> 	zdoes\*exist
> 
> then should we un-escape "\*" to "*"? Or is this not a valid input
> because a backslash should have been quoted into C-style quotes?

I'd think we should not un-escape anything, because we weren't told that
this was a C-style quoted string by the presence of an opening
double-quote. And that's how, say, update-index behaves:

  $ blob=$(echo foo | git hash-object -w --stdin)
  $ printf '100644 %s\t%s\n' \
      $blob 'just*asterisk' \
      $blob 'backslash\without\quotes' \
      $blob '"backslash\\with\\quotes"' |
    git update-index --index-info

which results in:

  $ git ls-files
  "backslash\\with\\quotes"
  "backslash\\without\\quotes"
  just*asterisk

  [same, but without quoting]
  $ git ls-files -z | tr '\0' '\n'
  backslash\with\quotes
  backslash\without\quotes
  just*asterisk

> The behavior in the current series allows this output that would
> never be written by "git ls-tree".

Yes, I think we'd never write that, because ls-tree would quote anything
with a backslash in it, even though it's not strictly necessary. But it
would be valid input to specify a file that has backslashes but not
double-quotes, and I think sparse-checkout should be changed to match
update-index here.

> I was playing around with this, and I think that quote_c_style() is
> necessary for the output, but we have a strange in-memory situation
> for the other escaping: we both fill the hashsets with the un-escaped
> data and fill the pattern list with the escaped patterns.

Yeah, but I think that the syntactic escaping on input might not have
identical rules to the escaping needed for the patterns.

So it makes sense to me to handle input as a separate mechanism, get a
pristine copy of what the user was trying to communicate to us, and then
re-escape whatever we need to put into the pattern list.

And ultimately the flow would be something like:

  - read input
    - if argument is from command-line, use it verbatim
    - else if reading stdin with "-z", use it verbatim
    - else if line starts with double-quote, unquote_c_style()
    - else use line verbatim
    - the result is a single pristine filename
  - fill hashset with pristine filenames
  - generate pattern list to write to sparse file, escaping filenames as
    necessary according to sparse-pattern rules

Obviously you don't have a "-z" yet, but I think it's something we'd
probably want in the long run. And anything coming from the command-line
shouldn't need quoting to get it to us either (and so we'd need to
escape before writing to the sparse file).

-Peff
Derrick Stolee Jan. 28, 2020, 3:03 p.m. UTC | #6
On 1/24/2020 4:42 PM, Jeff King wrote:
> And ultimately the flow would be something like:
> 
>   - read input
>     - if argument is from command-line, use it verbatim
>     - else if reading stdin with "-z", use it verbatim
>     - else if line starts with double-quote, unquote_c_style()
>     - else use line verbatim
>     - the result is a single pristine filename
>   - fill hashset with pristine filenames
>   - generate pattern list to write to sparse file, escaping filenames as
>     necessary according to sparse-pattern rules
> 
> Obviously you don't have a "-z" yet, but I think it's something we'd
> probably want in the long run. And anything coming from the command-line
> shouldn't need quoting to get it to us either (and so we'd need to
> escape before writing to the sparse file).

This recommendation came async with my v2, so I'll follow shortly with
a v3 that uses this flow. I have something that I think works, after
slightly adapting my tests, but now I need to make sure that all the
patches still make sense and build cleanly.

Thanks,
-Stolee

Patch
diff mbox series

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 3cee8ab46e..61d2c30036 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -140,6 +140,22 @@  static int update_working_directory(struct pattern_list *pl)
 	return result;
 }
 
+static char *escaped_pattern(char *pattern)
+{
+	char *p = pattern;
+	struct strbuf final = STRBUF_INIT;
+
+	while (*p) {
+		if (*p == '*' || *p == '\\')
+			strbuf_addch(&final, '\\');
+
+		strbuf_addch(&final, *p);
+		p++;
+	}
+
+	return strbuf_detach(&final, NULL);
+}
+
 static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
 {
 	int i;
@@ -164,10 +180,11 @@  static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
 	fprintf(fp, "/*\n!/*/\n");
 
 	for (i = 0; i < sl.nr; i++) {
-		char *pattern = sl.items[i].string;
+		char *pattern = escaped_pattern(sl.items[i].string);
 
 		if (strlen(pattern))
 			fprintf(fp, "%s/\n!%s/*/\n", pattern, pattern);
+		free(pattern);
 	}
 
 	string_list_clear(&sl, 0);
@@ -185,8 +202,9 @@  static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
 	string_list_remove_duplicates(&sl, 0);
 
 	for (i = 0; i < sl.nr; i++) {
-		char *pattern = sl.items[i].string;
+		char *pattern = escaped_pattern(sl.items[i].string);
 		fprintf(fp, "%s/\n", pattern);
+		free(pattern);
 	}
 }
 
@@ -337,7 +355,9 @@  static void insert_recursive_pattern(struct pattern_list *pl, struct strbuf *pat
 {
 	struct pattern_entry *e = xmalloc(sizeof(*e));
 	e->patternlen = path->len;
-	e->pattern = strbuf_detach(path, NULL);
+	e->pattern = dup_and_filter_pattern(path->buf);
+	strbuf_release(path);
+
 	hashmap_entry_init(&e->ent,
 			   ignore_case ?
 			   strihash(e->pattern) :
@@ -369,6 +389,7 @@  static void insert_recursive_pattern(struct pattern_list *pl, struct strbuf *pat
 
 static void strbuf_to_cone_pattern(struct strbuf *line, struct pattern_list *pl)
 {
+	int i;
 	strbuf_trim(line);
 
 	strbuf_trim_trailing_dir_sep(line);
@@ -376,6 +397,27 @@  static void strbuf_to_cone_pattern(struct strbuf *line, struct pattern_list *pl)
 	if (!line->len)
 		return;
 
+	for (i = 0; i < line->len; i++) {
+		if (line->buf[i] == '*') {
+			strbuf_insert(line, i, "\\", 1);
+			i++;
+		}
+
+		if (line->buf[i] == '\\') {
+			if (i < line->len - 1 && line->buf[i + 1] == '\\')
+				i++;
+			else
+				strbuf_insert(line, i, "\\", 1);
+
+			i++;
+		}
+	}
+
+	if (line->buf[0] == '"' && line->buf[line->len - 1] == '"') {
+		strbuf_remove(line, 0, 1);
+		strbuf_remove(line, line->len - 1, 1);
+	}
+
 	if (line->buf[0] != '/')
 		strbuf_insert(line, 0, "/", 1);
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index 051c1f3bf2..3da7c10bd9 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -309,6 +309,9 @@  check_read_tree_errors () {
 	REPO=$1
 	FILES=$2
 	ERRORS=$3
+	git -C $REPO -c core.sparseCheckoutCone=false read-tree -mu HEAD 2>err &&
+	test_must_be_empty err &&
+	check_files $REPO "$FILES" &&
 	git -C $REPO read-tree -mu HEAD 2>err &&
 	if test -z "$ERRORS"
 	then
@@ -379,14 +382,28 @@  test_expect_success BSLASHPSPEC 'pattern-checks: escaped "*"' '
 	git -C escaped reset --hard $COMMIT &&
 	check_files escaped "a deep folder1 folder2 zbad\\dir zdoes*exist" &&
 	git -C escaped sparse-checkout init --cone &&
-	cat >escaped/.git/info/sparse-checkout <<-\EOF &&
+	git -C escaped sparse-checkout set zbad\\dir zdoes\*not\*exist zdoes\*exist &&
+	cat >expect <<-\EOF &&
 	/*
 	!/*/
 	/zbad\\dir/
+	/zdoes\*exist/
 	/zdoes\*not\*exist/
+	EOF
+	test_cmp expect escaped/.git/info/sparse-checkout &&
+	check_read_tree_errors escaped "a zbad\\dir zdoes*exist" &&
+	git -C escaped ls-tree -d --name-only HEAD | git -C escaped sparse-checkout set --stdin &&
+	cat >expect <<-\EOF &&
+	/*
+	!/*/
+	/deep/
+	/folder1/
+	/folder2/
+	/zbad\\dir/
 	/zdoes\*exist/
 	EOF
-	check_read_tree_errors escaped "a zbad\\dir zdoes*exist"
+	test_cmp expect escaped/.git/info/sparse-checkout &&
+	check_files escaped "a deep folder1 folder2 zbad\\dir zdoes*exist"
 '
 
 test_done