Message ID | 79b6e9a565fde766954d7cda84a835b9015af6cb.1579029963.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Harden the sparse-checkout builtin | expand |
On Tue, Jan 14, 2020 at 07:26:02PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > If a user somehow creates a directory with an asterisk (*) or backslash > (\), then the "git sparse-checkout set" command will struggle to provide > the correct pattern in the sparse-checkout file. When not in cone mode, > the provided pattern is written directly into the sparse-checkout file. > However, in cone mode we expect a list of paths to directories and then > we convert those into patterns. > > Even more specifically, the goal is to always allow the following from > the root of a repo: > > git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin > > The ls-tree command provides directory names with an unescaped asterisk. > It also quotes the directories that contain an escaped backslash. We > must remove these quotes, then keep the escaped backslashes. Do we need to document these rules somewhere? Naively I'd expect "--stdin" to take in literal pathnames. But of course it can't represent a path with a newline. So perhaps it makes sense to take quoted names by default, and allow literal NUL-separated input with "-z" if anybody wants it. -Peff
On 1/14/2020 4:25 PM, Jeff King wrote: > On Tue, Jan 14, 2020 at 07:26:02PM +0000, Derrick Stolee via GitGitGadget wrote: > >> From: Derrick Stolee <dstolee@microsoft.com> >> >> If a user somehow creates a directory with an asterisk (*) or backslash >> (\), then the "git sparse-checkout set" command will struggle to provide >> the correct pattern in the sparse-checkout file. When not in cone mode, >> the provided pattern is written directly into the sparse-checkout file. >> However, in cone mode we expect a list of paths to directories and then >> we convert those into patterns. >> >> Even more specifically, the goal is to always allow the following from >> the root of a repo: >> >> git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin >> >> The ls-tree command provides directory names with an unescaped asterisk. >> It also quotes the directories that contain an escaped backslash. We >> must remove these quotes, then keep the escaped backslashes. > > Do we need to document these rules somewhere? Naively I'd expect > "--stdin" to take in literal pathnames. But of course it can't represent > a path with a newline. So perhaps it makes sense to take quoted names by > default, and allow literal NUL-separated input with "-z" if anybody > wants it. This is worth thinking about the right way to describe the rules: 1. You don't _need_ quotes. They happen to come along for the ride in 'git ls-tree' so it doesn't mess up shell scripts that iterate on those entries. At least, that's why I think they are quoted. 2. If you use quotes, the first layer of quotes will be removed. How much of this needs to be documented explicitly, or how much should we say "The input format matches what we would expect from 'git ls-tree --name-only'"? Thanks, -Stolee
On Tue, Jan 14, 2020 at 05:11:03PM -0500, Derrick Stolee wrote: > > Do we need to document these rules somewhere? Naively I'd expect > > "--stdin" to take in literal pathnames. But of course it can't represent > > a path with a newline. So perhaps it makes sense to take quoted names by > > default, and allow literal NUL-separated input with "-z" if anybody > > wants it. > > This is worth thinking about the right way to describe the rules: > > 1. You don't _need_ quotes. They happen to come along for the ride in > 'git ls-tree' so it doesn't mess up shell scripts that iterate on > those entries. At least, that's why I think they are quoted. It's not just shell scripts. Without quoting, the syntax becomes ambiguous (e.g., imagine a file with a newline in it). So most Git output that shows a filename will quote it if necessary, unless NUL separators are being used. > 2. If you use quotes, the first layer of quotes will be removed. I take this to mean that anything starting with a double-quote will have the outer layer removed, and backslash escapes inside expanded. And anything without a starting double quote (even if it has internal backslash escapes!) will be taken literally. That would match how things like "update-index --index-info" work. As far as implementation, I know you're trying to keep some of the escaping, but I think it might make more sense to do use unquote_c_style() to parse the input (see update-index's use for some prior art), and then re-quote as necessary to put things into the sparse-checkout file (I guess quoting more than just quote_c_style() would do, since you need to quote glob metacharacters like '*' and probably "!"). But as much as possible, I think you'd want literal strings inside the program, and just quoting/unquoting at the edges. > How much of this needs to be documented explicitly, or how much should > we say "The input format matches what we would expect from 'git ls-tree > --name-only'"? I think it's fine to say that, and maybe call attention to the quoting. Like: The input format matches the output of `git ls-tree --name-only`. This includes interpreting pathnames that begin with a double quote (") as C-style quoted strings. Disappointingly, update-index does not seem to explain the rules anywhere. fast-import does cover it. Maybe it's something that ought to be hoisted out into gitcli(7) or similar (or maybe it has been and I just can't find it). -Peff
On 1/14/2020 5:48 PM, Jeff King wrote: > On Tue, Jan 14, 2020 at 05:11:03PM -0500, Derrick Stolee wrote: > >>> Do we need to document these rules somewhere? Naively I'd expect >>> "--stdin" to take in literal pathnames. But of course it can't represent >>> a path with a newline. So perhaps it makes sense to take quoted names by >>> default, and allow literal NUL-separated input with "-z" if anybody >>> wants it. >> >> This is worth thinking about the right way to describe the rules: >> >> 1. You don't _need_ quotes. They happen to come along for the ride in >> 'git ls-tree' so it doesn't mess up shell scripts that iterate on >> those entries. At least, that's why I think they are quoted. > > It's not just shell scripts. Without quoting, the syntax becomes > ambiguous (e.g., imagine a file with a newline in it). So most Git > output that shows a filename will quote it if necessary, unless > NUL separators are being used. Good to know. >> 2. If you use quotes, the first layer of quotes will be removed. > > I take this to mean that anything starting with a double-quote will have > the outer layer removed, and backslash escapes inside expanded. And > anything without a starting double quote (even if it has internal > backslash escapes!) will be taken literally. Hm. Perhaps you are right! The ls-tree output for the test example is: deep folder1 folder2 "zbad\\dir" zdoes*exist so the "zdoes*exist" value is not escaped. I believe the current logic does something extra: consider supplying this input to 'git sparse-checkout set --stdin': deep folder1 folder2 "zbad\\dir" zdoes\*exist then should we un-escape "\*" to "*"? Or is this not a valid input because a backslash should have been quoted into C-style quotes? The behavior in the current series allows this output that would never be written by "git ls-tree". > That would match how things like "update-index --index-info" work. > > As far as implementation, I know you're trying to keep some of the > escaping, but I think it might make more sense to do use > unquote_c_style() to parse the input (see update-index's use for some > prior art), and then re-quote as necessary to put things into the > sparse-checkout file (I guess quoting more than just quote_c_style() > would do, since you need to quote glob metacharacters like '*' and > probably "!"). But as much as possible, I think you'd want literal > strings inside the program, and just quoting/unquoting at the edges. I was playing around with this, and I think that quote_c_style() is necessary for the output, but we have a strange in-memory situation for the other escaping: we both fill the hashsets with the un-escaped data and fill the pattern list with the escaped patterns. I'll add a commit with the quote_c_style() calls during the 'list' subcommand. >> How much of this needs to be documented explicitly, or how much should >> we say "The input format matches what we would expect from 'git ls-tree >> --name-only'"? > > I think it's fine to say that, and maybe call attention to the quoting. > Like: > > The input format matches the output of `git ls-tree --name-only`. This > includes interpreting pathnames that begin with a double quote (") as > C-style quoted strings. > > Disappointingly, update-index does not seem to explain the rules > anywhere. fast-import does cover it. Maybe it's something that ought to > be hoisted out into gitcli(7) or similar (or maybe it has been and I > just can't find it). I'll start the process by using your recommended language. I noticed also that the 'set' command doesn't actually document what happens when in cone mode, so I will correct that, too. Thanks, -Stolee
On Fri, Jan 24, 2020 at 04:10:21PM -0500, Derrick Stolee wrote: > Hm. Perhaps you are right! The ls-tree output for the test example > is: > > deep > folder1 > folder2 > "zbad\\dir" > zdoes*exist > > so the "zdoes*exist" value is not escaped. I believe the current > logic does something extra: consider supplying this input to > 'git sparse-checkout set --stdin': > > deep > folder1 > folder2 > "zbad\\dir" > zdoes\*exist > > then should we un-escape "\*" to "*"? Or is this not a valid input > because a backslash should have been quoted into C-style quotes? I'd think we should not un-escape anything, because we weren't told that this was a C-style quoted string by the presence of an opening double-quote. And that's how, say, update-index behaves: $ blob=$(echo foo | git hash-object -w --stdin) $ printf '100644 %s\t%s\n' \ $blob 'just*asterisk' \ $blob 'backslash\without\quotes' \ $blob '"backslash\\with\\quotes"' | git update-index --index-info which results in: $ git ls-files "backslash\\with\\quotes" "backslash\\without\\quotes" just*asterisk [same, but without quoting] $ git ls-files -z | tr '\0' '\n' backslash\with\quotes backslash\without\quotes just*asterisk > The behavior in the current series allows this output that would > never be written by "git ls-tree". Yes, I think we'd never write that, because ls-tree would quote anything with a backslash in it, even though it's not strictly necessary. But it would be valid input to specify a file that has backslashes but not double-quotes, and I think sparse-checkout should be changed to match update-index here. > I was playing around with this, and I think that quote_c_style() is > necessary for the output, but we have a strange in-memory situation > for the other escaping: we both fill the hashsets with the un-escaped > data and fill the pattern list with the escaped patterns. Yeah, but I think that the syntactic escaping on input might not have identical rules to the escaping needed for the patterns. So it makes sense to me to handle input as a separate mechanism, get a pristine copy of what the user was trying to communicate to us, and then re-escape whatever we need to put into the pattern list. And ultimately the flow would be something like: - read input - if argument is from command-line, use it verbatim - else if reading stdin with "-z", use it verbatim - else if line starts with double-quote, unquote_c_style() - else use line verbatim - the result is a single pristine filename - fill hashset with pristine filenames - generate pattern list to write to sparse file, escaping filenames as necessary according to sparse-pattern rules Obviously you don't have a "-z" yet, but I think it's something we'd probably want in the long run. And anything coming from the command-line shouldn't need quoting to get it to us either (and so we'd need to escape before writing to the sparse file). -Peff
On 1/24/2020 4:42 PM, Jeff King wrote: > And ultimately the flow would be something like: > > - read input > - if argument is from command-line, use it verbatim > - else if reading stdin with "-z", use it verbatim > - else if line starts with double-quote, unquote_c_style() > - else use line verbatim > - the result is a single pristine filename > - fill hashset with pristine filenames > - generate pattern list to write to sparse file, escaping filenames as > necessary according to sparse-pattern rules > > Obviously you don't have a "-z" yet, but I think it's something we'd > probably want in the long run. And anything coming from the command-line > shouldn't need quoting to get it to us either (and so we'd need to > escape before writing to the sparse file). This recommendation came async with my v2, so I'll follow shortly with a v3 that uses this flow. I have something that I think works, after slightly adapting my tests, but now I need to make sure that all the patches still make sense and build cleanly. Thanks, -Stolee
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c index 3cee8ab46e..61d2c30036 100644 --- a/builtin/sparse-checkout.c +++ b/builtin/sparse-checkout.c @@ -140,6 +140,22 @@ static int update_working_directory(struct pattern_list *pl) return result; } +static char *escaped_pattern(char *pattern) +{ + char *p = pattern; + struct strbuf final = STRBUF_INIT; + + while (*p) { + if (*p == '*' || *p == '\\') + strbuf_addch(&final, '\\'); + + strbuf_addch(&final, *p); + p++; + } + + return strbuf_detach(&final, NULL); +} + static void write_cone_to_file(FILE *fp, struct pattern_list *pl) { int i; @@ -164,10 +180,11 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl) fprintf(fp, "/*\n!/*/\n"); for (i = 0; i < sl.nr; i++) { - char *pattern = sl.items[i].string; + char *pattern = escaped_pattern(sl.items[i].string); if (strlen(pattern)) fprintf(fp, "%s/\n!%s/*/\n", pattern, pattern); + free(pattern); } string_list_clear(&sl, 0); @@ -185,8 +202,9 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl) string_list_remove_duplicates(&sl, 0); for (i = 0; i < sl.nr; i++) { - char *pattern = sl.items[i].string; + char *pattern = escaped_pattern(sl.items[i].string); fprintf(fp, "%s/\n", pattern); + free(pattern); } } @@ -337,7 +355,9 @@ static void insert_recursive_pattern(struct pattern_list *pl, struct strbuf *pat { struct pattern_entry *e = xmalloc(sizeof(*e)); e->patternlen = path->len; - e->pattern = strbuf_detach(path, NULL); + e->pattern = dup_and_filter_pattern(path->buf); + strbuf_release(path); + hashmap_entry_init(&e->ent, ignore_case ? strihash(e->pattern) : @@ -369,6 +389,7 @@ static void insert_recursive_pattern(struct pattern_list *pl, struct strbuf *pat static void strbuf_to_cone_pattern(struct strbuf *line, struct pattern_list *pl) { + int i; strbuf_trim(line); strbuf_trim_trailing_dir_sep(line); @@ -376,6 +397,27 @@ static void strbuf_to_cone_pattern(struct strbuf *line, struct pattern_list *pl) if (!line->len) return; + for (i = 0; i < line->len; i++) { + if (line->buf[i] == '*') { + strbuf_insert(line, i, "\\", 1); + i++; + } + + if (line->buf[i] == '\\') { + if (i < line->len - 1 && line->buf[i + 1] == '\\') + i++; + else + strbuf_insert(line, i, "\\", 1); + + i++; + } + } + + if (line->buf[0] == '"' && line->buf[line->len - 1] == '"') { + strbuf_remove(line, 0, 1); + strbuf_remove(line, line->len - 1, 1); + } + if (line->buf[0] != '/') strbuf_insert(line, 0, "/", 1); diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh index 051c1f3bf2..3da7c10bd9 100755 --- a/t/t1091-sparse-checkout-builtin.sh +++ b/t/t1091-sparse-checkout-builtin.sh @@ -309,6 +309,9 @@ check_read_tree_errors () { REPO=$1 FILES=$2 ERRORS=$3 + git -C $REPO -c core.sparseCheckoutCone=false read-tree -mu HEAD 2>err && + test_must_be_empty err && + check_files $REPO "$FILES" && git -C $REPO read-tree -mu HEAD 2>err && if test -z "$ERRORS" then @@ -379,14 +382,28 @@ test_expect_success BSLASHPSPEC 'pattern-checks: escaped "*"' ' git -C escaped reset --hard $COMMIT && check_files escaped "a deep folder1 folder2 zbad\\dir zdoes*exist" && git -C escaped sparse-checkout init --cone && - cat >escaped/.git/info/sparse-checkout <<-\EOF && + git -C escaped sparse-checkout set zbad\\dir zdoes\*not\*exist zdoes\*exist && + cat >expect <<-\EOF && /* !/*/ /zbad\\dir/ + /zdoes\*exist/ /zdoes\*not\*exist/ + EOF + test_cmp expect escaped/.git/info/sparse-checkout && + check_read_tree_errors escaped "a zbad\\dir zdoes*exist" && + git -C escaped ls-tree -d --name-only HEAD | git -C escaped sparse-checkout set --stdin && + cat >expect <<-\EOF && + /* + !/*/ + /deep/ + /folder1/ + /folder2/ + /zbad\\dir/ /zdoes\*exist/ EOF - check_read_tree_errors escaped "a zbad\\dir zdoes*exist" + test_cmp expect escaped/.git/info/sparse-checkout && + check_files escaped "a deep folder1 folder2 zbad\\dir zdoes*exist" ' test_done