diff mbox series

[PATCH/RFC,v1,1/1] git restore -p . and precomposed unicode

Message ID 20210124151306.23185-1-tboegi@web.de (mailing list archive)
State New, archived
Headers show
Series [PATCH/RFC,v1,1/1] git restore -p . and precomposed unicode | expand

Commit Message

Torsten Bögershausen Jan. 24, 2021, 3:13 p.m. UTC
From: Torsten Bögershausen <tboegi@web.de>

The following sequence leads to a "BUG" assertion running under MacOS:

  !/bin/sh
  DIR=git-test-restore-p
  Adiarnfd=$(printf 'A\314\210')
  DIRNAME=xx${Adiarnfd}yy
  mkdir $DIR &&
  cd $DIR &&
  git init &&
  mkdir $DIRNAME &&
  cd $DIRNAME &&
  echo "Initial" >file &&
  git add file &&
  echo "One more line"  >>file &&
  echo y | git restore -p . &&
  echo "OK"

 Initialized empty Git repository in /tmp/git-test-restore-p/.git/
 BUG: pathspec.c:495: error initializing pathspec_item
 Cannot close git diff-index --cached --numstat
 [snip]


The command `git restore` is run from a directory inside a Git repo.
The Git needs to split the $CWD into 2 parts:
The path to the repo and "the rest", if any.
"The rest" becomes a "prefix" later used inside the pathspec code.

As an example, "/path/to/repo/dir-inside-repå" would determine
"/path/to/repo" as the root of the repo, the place where the
configuration file .git/config is found.

The rest becomes the prefix ("dir-inside-repå"), from where the pathspec
machinery expands the ".", more about this later.
If there is a decomposed form, (making the decomposing visible like this),
"dir-inside-rep°a" doesn't match "dir-inside-repå".

The solution is to read the config variable "core.precomposeunicode" early.
Then, if configured, precompose "prefix" (and argv) and handle the prefix
over into pathspec for expanding "." into a list of path names tracked by Git.

[1] git-bugreport-2021-01-06-1209.txt (git can't deal with special characters)
[2] https://lore.kernel.org/git/A102844A-9501-4A86-854D-E3B387D378AA@icloud.com/

Reported-by: Daniel Troger <random_n0body@icloud.com>
Helped-By: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 This may need some refinements, but we need to start somewhere...
 Are there any good ideas how to improve the commit message ?
 Should the code in git.c be "hidden" in a function somewhere else ?
 Other comments are appreciated.


compat/precompose_utf8.c     | 24 ++++++++++++++++++++++++
 compat/precompose_utf8.h     |  2 ++
 git-compat-util.h            |  8 ++++++++
 git.c                        |  9 +++++++++
 t/t3910-mac-os-precompose.sh | 15 +++++++++++++++
 5 files changed, 58 insertions(+)

--
2.30.0.155.g66e871b664

Comments

Junio C Hamano Jan. 24, 2021, 7:51 p.m. UTC | #1
tboegi@web.de writes:

> The solution is to read the config variable "core.precomposeunicode" early.

For a single command like "restore", we need to fail if it is run
outside a repository anyway, so it is OK, but code in "git.c" in
general can be called outside a repository, we may not know where
our configuration files are, though.  So I'd prefer to avoid
adding too many "do this thing early for every single command".

Where do we normally read that variable?  I see calls to
precompose_argv() on only a few codepaths

builtin/diff-files.c:38:	precompose_argv(argc, argv);
builtin/diff-index.c:28:	precompose_argv(argc, argv);
builtin/diff-tree.c:129:	precompose_argv(argc, argv);
builtin/diff.c:455:	precompose_argv(argc, argv);
builtin/submodule--helper.c:1260:	precompose_argv(diff_args.nr, diff_args.v);
parse-options.c:872:	precompose_argv(argc, argv);

I guess the reason we can get away with it is because most of the
newer commands use the parse_options() API, and the call to
precompose_argv() is used by the codepaths implementing these
commands.  And as a rule, these commands read config first before
calling parse_options(), so by the time the control reaches there,
the value of the variable may be already known.

The question is why "restore -p" is so special?  Or does this patch
mean everybody, even the ones that use parse_options() is broken?

I guess "prefix" needs to be munged for everybody, so "restore -p"
on the title of the patch is a red herring, and the problem applies
to all "git" commands---in which case, inside "git.c" would be the
right place to do so.

I wonder if everybody who calls precompoase_argv() has access to the
prefix, though.  If so, would it make more sense to extend the API
to

	precompose_argv(int argc, char **argv, char **prefix)

so that the callers only need to call a single function, without any
additional code like this patch does?

Also, as the current precompose_argv() begins like this:

        void precompose_argv(int argc, const char **argv)
        {
                int i = 0;
                const char *oldarg;
                char *newarg;
                iconv_t ic_precompose;

                if (precomposed_unicode != 1)
                        return;

and environment.c initializes the global to (-1), I wonder if we can
get away with an approach not to read the "config" anywhere outside
precompose_argv() function.  Instead, can't the above snippet become
more like:

                if (precomposed_unicode < 0)
                        precomposed_unicode = read from config;
		if (precomposed_unicode != 1)
			return;

That way, you do not even have to touch anything outside
compat/precompose_utf8.c, other than adjusting the callers to pass
the pointer to their prefix to be munged.

Namely

> +int precompose_read_config_gently(void)

This can become file-scope static.

> +{
> +	git_config_get_bool("core.precomposeunicode", &precomposed_unicode);
> +	return precomposed_unicode == 1;
> +}
>
>  void probe_utf8_pathname_composition(void)
>  {
> @@ -60,6 +65,25 @@ void probe_utf8_pathname_composition(void)
>  	strbuf_release(&path);
>  }
>
> +char *precompose_string_if_needed(const char *in)
> +{

This too.

> diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
> index 6f843d3e1a..ce82857d73 100644
> --- a/compat/precompose_utf8.h
> +++ b/compat/precompose_utf8.h
> @@ -28,6 +28,8 @@ typedef struct {
>  	struct dirent_prec_psx *dirent_nfc;
>  } PREC_DIR;
>
> +int precompose_read_config_gently(void);
> +char *precompose_string_if_needed(const char *in);
>  void precompose_argv(int argc, const char **argv);
>  void probe_utf8_pathname_composition(void);

And this can go away.

> diff --git a/git-compat-util.h b/git-compat-util.h
> index 104993b975..f34854b66f 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -252,6 +252,14 @@ typedef unsigned long uintptr_t;
>  #ifdef PRECOMPOSE_UNICODE
>  #include "compat/precompose_utf8.h"
>  #else
> +static inline int precompose_read_config_gently(void)
> +{
> +	return 0;
> +}
> +static inline char *precompose_string_if_needed(const char *in)
> +{
> +	return NULL; /* no need to precompose a string */
> +}

So do these.

> diff --git a/git.c b/git.c
> index a00a0a4d94..f09e14f733 100644
> --- a/git.c
> +++ b/git.c
> @@ -421,6 +421,15 @@ static int run_builtin(struct cmd_struct *p, int argc, const char **argv)
>  			prefix = setup_git_directory_gently(&nongit_ok);
>  		}
>
> +		if (precompose_read_config_gently()) {
> +			precompose_argv(argc, argv);
> +			if (prefix) {
> +				const char *prec_pfx;
> +					prec_pfx = precompose_string_if_needed(prefix);
> +				if (prec_pfx)
> +					prefix = prec_pfx; /* memory lost */
> +			}
> +		}

And this would move to the beginning of precompose_argv()
implementation.

Also the code we currently use to read the core.precomposeunicode
configuration variable (presumably somewhere in git_default_config()
callchain; I didn't check) can go away.

Which would be much nicer outcome, if doable (again, I didn't check).

> diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
> index 54ce19e353..bbbc50da93 100755
> --- a/t/t3910-mac-os-precompose.sh
> +++ b/t/t3910-mac-os-precompose.sh
> @@ -191,6 +191,21 @@ test_expect_failure 'handle existing decomposed filenames' '
>  	test_must_be_empty untracked
>  '
>
> +test_expect_success "unicode decomposed: git restore -p . " '
> +	DIRNAMEPWD=dir.Odiarnfc &&
> +	DIRNAMEINREPO=dir.$Adiarnfc &&
> +	export DIRNAMEPWD DIRNAMEINREPO &&
> +	git init $DIRNAMEPWD &&
> +	( cd $DIRNAMEPWD &&
> +		mkdir $DIRNAMEINREPO &&

Style:

	(
		cd $DIRNAMEPWD &&
		mkdir $DIRNAMEINREPO &&

Is "restore" the only thing that has this issue?  

I would imagine that anything that takes '.' pathspec to limit its
operation to the current subdirectory (e.g. "cd sub && git diff .")
would be affected (clarification: I am *not* hinting that other
commands need to be tested---I am however hinting to update the
proposed log message to explain either (1) this applies in general
to commands that does X, or (2) this affects only "restore -p" which
does this unusual thing Y that no other commands do).

Thanks.


> +		cd $DIRNAMEINREPO &&
> +		echo "Initial" >file &&
> +		git add file &&
> +		echo "More stuff" >>file &&
> +		echo y | git restore -p .
> +	)
> +'
> +
>  # Test if the global core.precomposeunicode stops autosensing
>  # Must be the last test case
>  test_expect_success "respect git config --global core.precomposeunicode" '
> --
> 2.30.0.155.g66e871b664
Torsten Bögershausen Jan. 25, 2021, 4:53 p.m. UTC | #2
On Sun, Jan 24, 2021 at 11:51:33AM -0800, Junio C Hamano wrote:

Thanks for the fast review, much appreciated.

The short answer: There is more to be done, V2 is coming somewhen.
For the longer story, please see inline.

> tboegi@web.de writes:
>
> > The solution is to read the config variable "core.precomposeunicode" early.
>
> For a single command like "restore", we need to fail if it is run
> outside a repository anyway, so it is OK, but code in "git.c" in
> general can be called outside a repository, we may not know where
> our configuration files are, though.  So I'd prefer to avoid
> adding too many "do this thing early for every single command".
>
> Where do we normally read that variable?  I see calls to
> precompose_argv() on only a few codepaths
>
> builtin/diff-files.c:38:	precompose_argv(argc, argv);
> builtin/diff-index.c:28:	precompose_argv(argc, argv);
> builtin/diff-tree.c:129:	precompose_argv(argc, argv);
> builtin/diff.c:455:	precompose_argv(argc, argv);
> builtin/submodule--helper.c:1260:	precompose_argv(diff_args.nr, diff_args.v);
> parse-options.c:872:	precompose_argv(argc, argv);
>
> I guess the reason we can get away with it is because most of the
> newer commands use the parse_options() API, and the call to
> precompose_argv() is used by the codepaths implementing these
> commands.  And as a rule, these commands read config first before
> calling parse_options(), so by the time the control reaches there,
> the value of the variable may be already known.

Yes.

>
> The question is why "restore -p" is so special?  Or does this patch
> mean everybody, even the ones that use parse_options() is broken?

One special thing is that it uses pathspec, `git restore -p .`
and $CWD is inside a decomposed directory.

There is more work to be done, as it seems.
Running `git status . ` in an ASCII based repo (ignore the debug prints)

  user@mac:/tmp/210125-precomp/A.dir>  git status .
  git.c/handle_builtin:699 cmd="status" len=6
  git.c:428 prefix="A.dir/" (6)
  git.c:434 prec_pfx="(null)" (4294967295)
  builtin/commit.c:1401 prefix="A.dir/" (6)
  On branch master
  Changes not staged for commit:
    (use "git add <file>..." to update what will be committed)
      (use "git restore <file>..." to discard changes in working directory)
              modified:   O.file

-------------------------
But doing the same test within a decomosed directory:
  user@mac:/tmp/210125-decomp/Ä.dir> git status .
  git.c/handle_builtin:699 cmd="status" len=6
  git.c:428 prefix="Ä.dir/" (8)
  git.c:434 prec_pfx="Ä.dir/" (7)
  builtin/commit.c:1401 prefix="Ä.dir/" (7)
  On branch master
  nothing to commit, working tree clean
---------
But using ".." as the pathspec works:

  user@mac:/tmp/210125-decomp/Ä.dir> git status ..
  git.c/handle_builtin:699 cmd="status" len=6
  git.c:428 prefix="Ä.dir/" (8)
  git.c:434 prec_pfx="Ä.dir/" (7)
  builtin/commit.c:1401 prefix="Ä.dir/" (7)
  On branch master
  Changes not staged for commit:
    (use "git add <file>..." to update what will be committed)
      (use "git restore <file>..." to discard changes in working directory)
              modified:   "../A\314\210.dir/\303\226.file"

  no changes added to commit (use "git add" and/or "git commit -a")
--------------

So in that sense, it seems as if `git restore` is special because
the patch helps.

More patches are needed asseen above.

And the rest of the suggestions makes all sense.
diff mbox series

Patch

diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index 136250fbf6..06e371660f 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -36,6 +36,11 @@  static size_t has_non_ascii(const char *s, size_t maxlen, size_t *strlen_c)
 	return ret;
 }

+int precompose_read_config_gently(void)
+{
+	git_config_get_bool("core.precomposeunicode", &precomposed_unicode);
+	return precomposed_unicode == 1;
+}

 void probe_utf8_pathname_composition(void)
 {
@@ -60,6 +65,25 @@  void probe_utf8_pathname_composition(void)
 	strbuf_release(&path);
 }

+char *precompose_string_if_needed(const char *in)
+{
+	size_t inlen = strlen(in);
+	size_t outlen;
+	char *out = NULL;
+	if ((has_non_ascii(in, inlen, NULL)) && (precomposed_unicode == 1)) {
+		int saved_errno = errno;
+		out = reencode_string_len(in, inlen,
+					  repo_encoding, path_encoding,
+					  &outlen);
+		if (out && outlen == inlen && !memcmp(in, out, outlen)) {
+			/* strings are identical: no need to return a new one */
+			free(out);
+			out = NULL;
+		}
+		errno = saved_errno;
+	}
+	return out;
+}

 void precompose_argv(int argc, const char **argv)
 {
diff --git a/compat/precompose_utf8.h b/compat/precompose_utf8.h
index 6f843d3e1a..ce82857d73 100644
--- a/compat/precompose_utf8.h
+++ b/compat/precompose_utf8.h
@@ -28,6 +28,8 @@  typedef struct {
 	struct dirent_prec_psx *dirent_nfc;
 } PREC_DIR;

+int precompose_read_config_gently(void);
+char *precompose_string_if_needed(const char *in);
 void precompose_argv(int argc, const char **argv);
 void probe_utf8_pathname_composition(void);

diff --git a/git-compat-util.h b/git-compat-util.h
index 104993b975..f34854b66f 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -252,6 +252,14 @@  typedef unsigned long uintptr_t;
 #ifdef PRECOMPOSE_UNICODE
 #include "compat/precompose_utf8.h"
 #else
+static inline int precompose_read_config_gently(void)
+{
+	return 0;
+}
+static inline char *precompose_string_if_needed(const char *in)
+{
+	return NULL; /* no need to precompose a string */
+}
 static inline void precompose_argv(int argc, const char **argv)
 {
 	; /* nothing */
diff --git a/git.c b/git.c
index a00a0a4d94..f09e14f733 100644
--- a/git.c
+++ b/git.c
@@ -421,6 +421,15 @@  static int run_builtin(struct cmd_struct *p, int argc, const char **argv)
 			prefix = setup_git_directory_gently(&nongit_ok);
 		}

+		if (precompose_read_config_gently()) {
+			precompose_argv(argc, argv);
+			if (prefix) {
+				const char *prec_pfx;
+					prec_pfx = precompose_string_if_needed(prefix);
+				if (prec_pfx)
+					prefix = prec_pfx; /* memory lost */
+			}
+		}
 		if (use_pager == -1 && p->option & (RUN_SETUP | RUN_SETUP_GENTLY) &&
 		    !(p->option & DELAY_PAGER_CONFIG))
 			use_pager = check_pager_config(p->cmd);
diff --git a/t/t3910-mac-os-precompose.sh b/t/t3910-mac-os-precompose.sh
index 54ce19e353..bbbc50da93 100755
--- a/t/t3910-mac-os-precompose.sh
+++ b/t/t3910-mac-os-precompose.sh
@@ -191,6 +191,21 @@  test_expect_failure 'handle existing decomposed filenames' '
 	test_must_be_empty untracked
 '

+test_expect_success "unicode decomposed: git restore -p . " '
+	DIRNAMEPWD=dir.Odiarnfc &&
+	DIRNAMEINREPO=dir.$Adiarnfc &&
+	export DIRNAMEPWD DIRNAMEINREPO &&
+	git init $DIRNAMEPWD &&
+	( cd $DIRNAMEPWD &&
+		mkdir $DIRNAMEINREPO &&
+		cd $DIRNAMEINREPO &&
+		echo "Initial" >file &&
+		git add file &&
+		echo "More stuff" >>file &&
+		echo y | git restore -p .
+	)
+'
+
 # Test if the global core.precomposeunicode stops autosensing
 # Must be the last test case
 test_expect_success "respect git config --global core.precomposeunicode" '