Poison gettext with the Ook language
diff mbox series

Message ID 20181022153633.31757-1-pclouds@gmail.com
State New
Headers show
Series
  • Poison gettext with the Ook language
Related show

Commit Message

Duy Nguyen Oct. 22, 2018, 3:36 p.m. UTC
The current gettext() function just replaces all strings with
'# GETTEXT POISON #' including format strings and hides the things
that we should be allowed to grep (like branch names, or some other
codes) even when gettext is poisoned.

This patch implements the poisoned _() with a universal and totally
legit language called Ook [1]. We could actually grep stuff even in
with this because format strings are preserved.

Long term, we could implement an "ook translator" for test_i18ngrep
and friends so that they translate English to Ook, allowing us to
match full text while making sure the text in the code is still marked
for translation.

[1] https://en.wikipedia.org/wiki/Unseen_University#Librarian

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 This started out as something fun to do while running the test suite
 last weekend. But it turns out actually working! If this patch ends
 up in git.git, the Librarian would be so proud!

 gettext.c       | 54 +++++++++++++++++++++++++++++++++++++++++++++++++
 gettext.h       |  7 ++++---
 t/lib-rebase.sh |  2 +-
 3 files changed, 59 insertions(+), 4 deletions(-)

Comments

SZEDER Gábor Oct. 22, 2018, 8:22 p.m. UTC | #1
On Mon, Oct 22, 2018 at 05:36:33PM +0200, Nguyễn Thái Ngọc Duy wrote:
> The current gettext() function just replaces all strings with
> '# GETTEXT POISON #' including format strings and hides the things
> that we should be allowed to grep (like branch names, or some other
> codes) even when gettext is poisoned.
> 
> This patch implements the poisoned _() with a universal and totally
> legit language called Ook [1]. We could actually grep stuff even in
> with this because format strings are preserved.

Once upon a time a GETTEXT_POISON build job failed on me, and the
error message:

  error: # GETTEXT POISON #

was not particularly useful.  Ook wouldn't help with that...

So I came up with the following couple of patches that implement a
"scrambled" format that makes the poisoned output legible for humans
but still gibberish for machine consumption (i.e. grep-ing the text
part would still fail):

  error:  U.n.a.b.l.e. .t.o. .c.r.e.a.t.e. .'./home/szeder/src/git/t/trash directory.t1404-update-ref-errors/.git/packed-refs...l.o.c.k.'.:. .File exists...

I have been running GETTEXT_POISON builds with this series for some
months now, but haven't submitted it yet, because I haven't decided
yet whether including strings (paths, refs, etc.) in the output as
they are is a feature or a flaw.  And because it embarrassingly leaks
every single translated string... :)


SZEDER Gábor (8):
  test-lib.sh: preserve GIT_GETTEXT_POISON from the environment
  gettext: don't poison if GIT_GETTEXT_POISON is set but empty
  lib-rebase: loosen GETTEXT_POISON check in fake editor
  gettext: #ifdef away GETTEXT POISON-related code from _() and Q_()
  gettext: put "# GETTEXT POISON #" string literal into a macro
  gettext: use an enum for the mode of GETTEXT POISONing
  gettext: introduce GIT_GETTEXT_POISON=scrambled
  travis-ci: run GETTEXT POISON build job in scrambled mode, too

 Makefile                  |  2 +-
 ci/lib-travisci.sh        |  1 +
 ci/run-build-and-tests.sh | 10 +++++--
 gettext.c                 | 63 +++++++++++++++++++++++++++++++++++----
 gettext.h                 | 33 +++++++++++++++-----
 t/lib-rebase.sh           |  2 +-
 t/test-lib.sh             | 17 ++++++++++-
 7 files changed, 110 insertions(+), 18 deletions(-)
Johannes Schindelin Oct. 22, 2018, 8:52 p.m. UTC | #2
Hi Duy,

On Mon, 22 Oct 2018, Nguyễn Thái Ngọc Duy wrote:

> The current gettext() function just replaces all strings with
> '# GETTEXT POISON #' including format strings and hides the things
> that we should be allowed to grep (like branch names, or some other
> codes) even when gettext is poisoned.
> 
> This patch implements the poisoned _() with a universal and totally
> legit language called Ook [1]. We could actually grep stuff even in
> with this because format strings are preserved.
> 
> Long term, we could implement an "ook translator" for test_i18ngrep
> and friends so that they translate English to Ook, allowing us to
> match full text while making sure the text in the code is still marked
> for translation.
> 
> [1] https://en.wikipedia.org/wiki/Unseen_University#Librarian
> 
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  This started out as something fun to do while running the test suite
>  last weekend. But it turns out actually working! If this patch ends
>  up in git.git, the Librarian would be so proud!

Cute.
Dscho

> 
>  gettext.c       | 54 +++++++++++++++++++++++++++++++++++++++++++++++++
>  gettext.h       |  7 ++++---
>  t/lib-rebase.sh |  2 +-
>  3 files changed, 59 insertions(+), 4 deletions(-)
> 
> diff --git a/gettext.c b/gettext.c
> index 7272771c8e..29901e1ddd 100644
> --- a/gettext.c
> +++ b/gettext.c
> @@ -56,6 +56,60 @@ int use_gettext_poison(void)
>  }
>  #endif
>  
> +const char *gettext_poison(const char *msgid)
> +{
> +	/*
> +	 * gettext() returns a string that is always valid. We would
> +	 * need a hash map for that but let's stay simple and keep the
> +	 * last 64 gettext() results. Should be more than enough.
> +	 */
> +	static char *bufs[64];
> +	static int i;
> +	struct strbuf sb = STRBUF_INIT;
> +	char *buf;
> +	const char *p;
> +	const char *type_specifiers = "diouxXeEfFgGaAcsCSpnm%";
> +
> +	if (!strchr(msgid, '%'))
> +		return "Eek!";
> +
> +	p = msgid;
> +	while (*p) {
> +		const char *type;
> +		switch (*p) {
> +		case '%':
> +			/*
> +			 * No strict parsing. We simply look for the end of a
> +			 * format string
> +			 */
> +			type = p + 1;
> +			while (*type && !strchr(type_specifiers, *type))
> +				type++;
> +			if (*type)
> +				type++;
> +			strbuf_add(&sb, p, (int)(type - p));
> +			p = type;
> +			break;
> +		default:
> +			if (!isalpha(*p)) {
> +				strbuf_addch(&sb, *p);
> +				p++;
> +				break;
> +			}
> +			if (isupper(*p))
> +				strbuf_addstr(&sb, "Ook");
> +			else
> +				strbuf_addstr(&sb, "ook");
> +			while (isalpha(*p))
> +				p++;
> +		}
> +	}
> +	buf = bufs[(i++) % ARRAY_SIZE(bufs)];
> +	free(buf);
> +	buf = strbuf_detach(&sb, NULL);
> +	return buf;
> +}
> +
>  #ifndef NO_GETTEXT
>  static int test_vsnprintf(const char *fmt, ...)
>  {
> diff --git a/gettext.h b/gettext.h
> index 7eee64a34f..dc9851a06a 100644
> --- a/gettext.h
> +++ b/gettext.h
> @@ -41,8 +41,9 @@ static inline int gettext_width(const char *s)
>  }
>  #endif
>  
> +const char *gettext_poison(const char *);
>  #ifdef GETTEXT_POISON
> -extern int use_gettext_poison(void);
> +int use_gettext_poison(void);
>  #else
>  #define use_gettext_poison() 0
>  #endif
> @@ -51,14 +52,14 @@ static inline FORMAT_PRESERVING(1) const char *_(const char *msgid)
>  {
>  	if (!*msgid)
>  		return "";
> -	return use_gettext_poison() ? "# GETTEXT POISON #" : gettext(msgid);
> +	return use_gettext_poison() ? gettext_poison(msgid) : gettext(msgid);
>  }
>  
>  static inline FORMAT_PRESERVING(1) FORMAT_PRESERVING(2)
>  const char *Q_(const char *msgid, const char *plu, unsigned long n)
>  {
>  	if (use_gettext_poison())
> -		return "# GETTEXT POISON #";
> +		return gettext_poison(msgid);
>  	return ngettext(msgid, plu, n);
>  }
>  
> diff --git a/t/lib-rebase.sh b/t/lib-rebase.sh
> index 2ca9fb69d6..1e8440e935 100644
> --- a/t/lib-rebase.sh
> +++ b/t/lib-rebase.sh
> @@ -29,7 +29,7 @@ set_fake_editor () {
>  	*/COMMIT_EDITMSG)
>  		test -z "$EXPECT_HEADER_COUNT" ||
>  			test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# This is a combination of \(.*\) commits\./\1/p' < "$1")" ||
> -			test "# # GETTEXT POISON #" = "$(sed -n '1p' < "$1")" ||
> +			test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# Ook ook ook ook ook \(.*\) ook\./\1/p' < "$1")" ||
>  			exit
>  		test -z "$FAKE_COMMIT_MESSAGE" || echo "$FAKE_COMMIT_MESSAGE" > "$1"
>  		test -z "$FAKE_COMMIT_AMEND" || echo "$FAKE_COMMIT_AMEND" >> "$1"
> -- 
> 2.19.1.647.g708186aaf9
> 
>
Ævar Arnfjörð Bjarmason Oct. 22, 2018, 9:09 p.m. UTC | #3
On Mon, Oct 22 2018, Nguyễn Thái Ngọc Duy wrote:

> The current gettext() function just replaces all strings with
> '# GETTEXT POISON #' including format strings and hides the things
> that we should be allowed to grep (like branch names, or some other
> codes) even when gettext is poisoned.
>
> This patch implements the poisoned _() with a universal and totally
> legit language called Ook [1]. We could actually grep stuff even in
> with this because format strings are preserved.
>
> Long term, we could implement an "ook translator" for test_i18ngrep
> and friends so that they translate English to Ook, allowing us to
> match full text while making sure the text in the code is still marked
> for translation.

Replying to both this patch, and SZEDER's series for brevity. Thanks
both for working on this. It's obvious that this stuff needs some
polishing, and it's great that's being done, and sorry for my part of
having left this in the current state.

a)

Both this patch and SZEDER's 7/8 are addressing the issue that the
poison mode doesn't preserve format specifiers.

I haven't tried to dig this up in the list archive, and maybe it only
exists in my mind, but I seem to remember that this was explicitly
discussed as a goal at the time.

I.e. we were expecting the lack of this to break tests, and that was
considered a feature as it would spot plumbing messages we shouldn't be
translating.

However, it's my opinion that this was just a bad idea to begin with,
I've CC'd a couple of prolific markers of i18n from my log grepping (and
feel free to add more) to see if they disagre.

I think it probably helped me a bit in the beginning when i18n was
bootstrapping and there was a *lot* to mark for translation, but we've
long since stabilized and aren't doing that anymore, so it's much easier
to just review the patches to see if they translate plumbing.

All of which is to say that I think something like your patch here is
fine to just accept, and the only improvement I'd suggest is some note
in the commit message saying that this was always intentional, but
nobody can name a case where it helped, so let's just drop it.

In SZEDER's case that we shouldn't have GIT_GETTEXT_POISON=scrambled,
let's just make it boolean and make "scrambled" (i.e. preserving format
specifiecs) the default.

b)

SZEDER's series, and your patch (although it's smaller) still preserve
the notion that there's such a thing as a GETTEXT_POISON *build* and you
actually need to compile git with that flag to make use of it.

This, as I recall, was just paranoia on my part back in 2010/2011. I
wanted to be able to present a version of this i18n stuff where people
who didn't like it could be assured that if they didn't opt-in to it
they wouldn't get any of the code (sans a few trivial and obviously
correct wrapper functions).

So I think the only reason to keep it compile-time is performance, but I
don't think that matters. It's not like we're printing gigabytes of _()
formatted output. Everything where formatting matters is plumbing which
doesn't use this API. These messages are always for human consumption.

So shouldn't we just drop this notion of GETTEXT_POISON at compile-time
entirely, and make GIT_GETTEXT_POISON=<boolean> a test flag that all
builds of git are aware of? I think so.
Ævar Arnfjörð Bjarmason Oct. 22, 2018, 9:46 p.m. UTC | #4
On Mon, Oct 22 2018, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Oct 22 2018, Nguyễn Thái Ngọc Duy wrote:
>
>> The current gettext() function just replaces all strings with
>> '# GETTEXT POISON #' including format strings and hides the things
>> that we should be allowed to grep (like branch names, or some other
>> codes) even when gettext is poisoned.
>>
>> This patch implements the poisoned _() with a universal and totally
>> legit language called Ook [1]. We could actually grep stuff even in
>> with this because format strings are preserved.
>>
>> Long term, we could implement an "ook translator" for test_i18ngrep
>> and friends so that they translate English to Ook, allowing us to
>> match full text while making sure the text in the code is still marked
>> for translation.
>
> Replying to both this patch, and SZEDER's series for brevity. Thanks
> both for working on this. It's obvious that this stuff needs some
> polishing, and it's great that's being done, and sorry for my part of
> having left this in the current state.
>
> a)
>
> Both this patch and SZEDER's 7/8 are addressing the issue that the
> poison mode doesn't preserve format specifiers.
>
> I haven't tried to dig this up in the list archive, and maybe it only
> exists in my mind, but I seem to remember that this was explicitly
> discussed as a goal at the time.
>
> I.e. we were expecting the lack of this to break tests, and that was
> considered a feature as it would spot plumbing messages we shouldn't be
> translating.
>
> However, it's my opinion that this was just a bad idea to begin with,
> I've CC'd a couple of prolific markers of i18n from my log grepping (and
> feel free to add more) to see if they disagre.
>
> I think it probably helped me a bit in the beginning when i18n was
> bootstrapping and there was a *lot* to mark for translation, but we've
> long since stabilized and aren't doing that anymore, so it's much easier
> to just review the patches to see if they translate plumbing.
>
> All of which is to say that I think something like your patch here is
> fine to just accept, and the only improvement I'd suggest is some note
> in the commit message saying that this was always intentional, but
> nobody can name a case where it helped, so let's just drop it.
>
> In SZEDER's case that we shouldn't have GIT_GETTEXT_POISON=scrambled,
> let's just make it boolean and make "scrambled" (i.e. preserving format
> specifiecs) the default.

[...]

> b)
>
> SZEDER's series, and your patch (although it's smaller) still preserve
> the notion that there's such a thing as a GETTEXT_POISON *build* and you
> actually need to compile git with that flag to make use of it.
>
> This, as I recall, was just paranoia on my part back in 2010/2011. I
> wanted to be able to present a version of this i18n stuff where people
> who didn't like it could be assured that if they didn't opt-in to it
> they wouldn't get any of the code (sans a few trivial and obviously
> correct wrapper functions).
>
> So I think the only reason to keep it compile-time is performance, but I
> don't think that matters. It's not like we're printing gigabytes of _()
> formatted output. Everything where formatting matters is plumbing which
> doesn't use this API. These messages are always for human consumption.
>
> So shouldn't we just drop this notion of GETTEXT_POISON at compile-time
> entirely, and make GIT_GETTEXT_POISON=<boolean> a test flag that all
> builds of git are aware of? I think so.

Something like this, untested except I compiled it and ran one test
with/without GIT_GETTEXT_POISON, so it may have bugs:

 Makefile                | 10 +---------
 gettext.c               |  4 +---
 gettext.h               |  4 ----
 po/README               | 13 ++++---------
 t/README                |  5 +++++
 t/helper/test-tool.c    |  1 +
 t/helper/test-tool.h    |  1 +
 t/test-lib-functions.sh |  8 ++++----
 t/test-lib.sh           | 12 +++---------
 9 files changed, 20 insertions(+), 38 deletions(-)

diff --git a/Makefile b/Makefile
index d18ab0fe78..78190ee9d9 100644
--- a/Makefile
+++ b/Makefile
@@ -362,11 +362,6 @@ all::
 # Define HAVE_DEV_TTY if your system can open /dev/tty to interact with the
 # user.
 #
-# Define GETTEXT_POISON if you are debugging the choice of strings marked
-# for translation.  In a GETTEXT_POISON build, you can turn all strings marked
-# for translation into gibberish by setting the GIT_GETTEXT_POISON variable
-# (to any value) in your environment.
-#
 # Define JSMIN to point to JavaScript minifier that functions as
 # a filter to have gitweb.js minified.
 #
@@ -714,6 +709,7 @@ TEST_BUILTINS_OBJS += test-dump-split-index.o
 TEST_BUILTINS_OBJS += test-dump-untracked-cache.o
 TEST_BUILTINS_OBJS += test-example-decorate.o
 TEST_BUILTINS_OBJS += test-genrandom.o
+TEST_BUILTINS_OBJS += test-gettext-poison.o
 TEST_BUILTINS_OBJS += test-hashmap.o
 TEST_BUILTINS_OBJS += test-index-version.o
 TEST_BUILTINS_OBJS += test-json-writer.o
@@ -1439,9 +1435,6 @@ endif
 ifdef NO_SYMLINK_HEAD
 	BASIC_CFLAGS += -DNO_SYMLINK_HEAD
 endif
-ifdef GETTEXT_POISON
-	BASIC_CFLAGS += -DGETTEXT_POISON
-endif
 ifdef NO_GETTEXT
 	BASIC_CFLAGS += -DNO_GETTEXT
 	USE_GETTEXT_SCHEME ?= fallthrough
@@ -2591,7 +2584,6 @@ ifdef GIT_TEST_CMP_USE_COPIED_CONTEXT
 	@echo GIT_TEST_CMP_USE_COPIED_CONTEXT=YesPlease >>$@+
 endif
 	@echo NO_GETTEXT=\''$(subst ','\'',$(subst ','\'',$(NO_GETTEXT)))'\' >>$@+
-	@echo GETTEXT_POISON=\''$(subst ','\'',$(subst ','\'',$(GETTEXT_POISON)))'\' >>$@+
 ifdef GIT_PERF_REPEAT_COUNT
 	@echo GIT_PERF_REPEAT_COUNT=\''$(subst ','\'',$(subst ','\'',$(GIT_PERF_REPEAT_COUNT)))'\' >>$@+
 endif
diff --git a/gettext.c b/gettext.c
index 7272771c8e..d71e394284 100644
--- a/gettext.c
+++ b/gettext.c
@@ -46,15 +46,13 @@ const char *get_preferred_languages(void)
 	return NULL;
 }

-#ifdef GETTEXT_POISON
 int use_gettext_poison(void)
 {
 	static int poison_requested = -1;
 	if (poison_requested == -1)
-		poison_requested = getenv("GIT_GETTEXT_POISON") ? 1 : 0;
+		poison_requested = git_env_bool("GIT_GETTEXT_POISON", 0);
 	return poison_requested;
 }
-#endif

 #ifndef NO_GETTEXT
 static int test_vsnprintf(const char *fmt, ...)
diff --git a/gettext.h b/gettext.h
index 7eee64a34f..4c492d9f57 100644
--- a/gettext.h
+++ b/gettext.h
@@ -41,11 +41,7 @@ static inline int gettext_width(const char *s)
 }
 #endif

-#ifdef GETTEXT_POISON
 extern int use_gettext_poison(void);
-#else
-#define use_gettext_poison() 0
-#endif

 static inline FORMAT_PRESERVING(1) const char *_(const char *msgid)
 {
diff --git a/po/README b/po/README
index fef4c0f0b5..9dad277e33 100644
--- a/po/README
+++ b/po/README
@@ -289,16 +289,11 @@ something in the test suite might still depend on the US English
 version of the strings, e.g. to grep some error message or other
 output.

-To smoke out issues like these Git can be compiled with gettext poison
-support, at the top-level:
+To smoke out issues like these Git tested with a translation mode that
+emits gibberish on every call to gettext. To use it run the test suite
+with it, e.g.:

-    make GETTEXT_POISON=YesPlease
-
-That'll give you a git which emits gibberish on every call to
-gettext. It's obviously not meant to be installed, but you should run
-the test suite with it:
-
-    cd t && prove -j 9 ./t[0-9]*.sh
+    cd t && GIT_GETTEXT_POISON=true prove -j 9 ./t[0-9]*.sh

 If tests break with it you should inspect them manually and see if
 what you're translating is sane, i.e. that you're not translating
diff --git a/t/README b/t/README
index 8847489640..b3b2fa0b11 100644
--- a/t/README
+++ b/t/README
@@ -301,6 +301,11 @@ that cannot be easily covered by a few specific test cases. These
 could be enabled by running the test suite with correct GIT_TEST_
 environment set.

+GIT_GETTEXT_POISON=<boolean> turns all strings marked for translation
+into gibberish. Used for spotting those tests that need to be marked
+with a C_LOCALE_OUTPUT prerequisite when adding more strings for
+translation. See "Testing marked strings" in po/README for details.
+
 GIT_TEST_SPLIT_INDEX=<boolean> forces split-index mode on the whole
 test suite. Accept any boolean values that are accepted by git-config.

diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 6b5836dc1b..3e75672a37 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -19,6 +19,7 @@ static struct test_cmd cmds[] = {
 	{ "dump-untracked-cache", cmd__dump_untracked_cache },
 	{ "example-decorate", cmd__example_decorate },
 	{ "genrandom", cmd__genrandom },
+	{ "gettext-poison", cmd__gettext_poison },
 	{ "hashmap", cmd__hashmap },
 	{ "index-version", cmd__index_version },
 	{ "json-writer", cmd__json_writer },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index e4890566da..04f033b7fc 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -15,6 +15,7 @@ int cmd__dump_split_index(int argc, const char **argv);
 int cmd__dump_untracked_cache(int argc, const char **argv);
 int cmd__example_decorate(int argc, const char **argv);
 int cmd__genrandom(int argc, const char **argv);
+int cmd__gettext_poison(int argc, const char **argv);
 int cmd__hashmap(int argc, const char **argv);
 int cmd__index_version(int argc, const char **argv);
 int cmd__json_writer(int argc, const char **argv);
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 78d8c3783b..4e8683addd 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -755,16 +755,16 @@ test_cmp_bin() {

 # Use this instead of test_cmp to compare files that contain expected and
 # actual output from git commands that can be translated.  When running
-# under GETTEXT_POISON this pretends that the command produced expected
+# under GIT_GETTEXT_POISON this pretends that the command produced expected
 # results.
 test_i18ncmp () {
-	test -n "$GETTEXT_POISON" || test_cmp "$@"
+	! test_have_prereq C_LOCALE_OUTPUT || test_cmp "$@"
 }

 # Use this instead of "grep expected-string actual" to see if the
 # output from a git command that can be translated either contains an
 # expected string, or does not contain an unwanted one.  When running
-# under GETTEXT_POISON this pretends that the command produced expected
+# under GIT_GETTEXT_POISON this pretends that the command produced expected
 # results.
 test_i18ngrep () {
 	eval "last_arg=\${$#}"
@@ -779,7 +779,7 @@ test_i18ngrep () {
 		error "bug in the test script: too few parameters to test_i18ngrep"
 	fi

-	if test -n "$GETTEXT_POISON"
+	if ! test_have_prereq C_LOCALE_OUTPUT
 	then
 		# pretend success
 		return 0
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 897e6fcc94..f82d2149a2 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -105,6 +105,7 @@ unset VISUAL EMAIL LANGUAGE COLUMNS $("$PERL_PATH" -e '
 		TRACE
 		DEBUG
 		TEST
+		GETTEXT_POISON
 		.*_TEST
 		PROVE
 		VALGRIND
@@ -1104,15 +1105,8 @@ test -n "$USE_LIBPCRE1" && test_set_prereq LIBPCRE1
 test -n "$USE_LIBPCRE2" && test_set_prereq LIBPCRE2
 test -z "$NO_GETTEXT" && test_set_prereq GETTEXT

-# Can we rely on git's output in the C locale?
-if test -n "$GETTEXT_POISON"
-then
-	GIT_GETTEXT_POISON=YesPlease
-	export GIT_GETTEXT_POISON
-	test_set_prereq GETTEXT_POISON
-else
-	test_set_prereq C_LOCALE_OUTPUT
-fi
+test_lazy_prereq GETTEXT_POISON 'test-tool gettext-poison'
+test_lazy_prereq C_LOCALE_OUTPUT '! test-tool gettext-poison'

 if test -z "$GIT_TEST_CHECK_CACHE_TREE"
 then
Junio C Hamano Oct. 22, 2018, 11:04 p.m. UTC | #5
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> So I think the only reason to keep it compile-time is performance, but I
> don't think that matters. It's not like we're printing gigabytes of _()
> formatted output. Everything where formatting matters is plumbing which
> doesn't use this API. These messages are always for human consumption.
>
> So shouldn't we just drop this notion of GETTEXT_POISON at compile-time
> entirely, and make GIT_GETTEXT_POISON=<boolean> a test flag that all
> builds of git are aware of? I think so.

Haven't thought things through, but that sounds like a good thing to
aim for.  Keep the ideas coming...
Johannes Schindelin Oct. 23, 2018, 9:30 a.m. UTC | #6
Hi Ævar,

On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote:

> So I think the only reason to keep it [GETTEXT_POISON] compile-time is
> performance, but I don't think that matters. It's not like we're
> printing gigabytes of _() formatted output. Everything where formatting
> matters is plumbing which doesn't use this API. These messages are
> always for human consumption.

Well, let's make sure that your impression is correct before going too
far. I, too, had the impression that gettext cannot possibly be expensive,
especifally in Git for Windows' settings, where we do not even ship
translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid
initialization if the locale dir is not present, 2018-04-21):

	The runtime of a simple `git.exe version` call on Windows is
	currently dominated by the gettext setup, adding a whopping ~150ms
	to the ~210ms total.

I would be in favor of your change to make this a runtime option, of
course, as long as it does not affect performance greatly (in particular
on Windows, where we fight an uphill battle to make Git faster).

Ciao,
Dscho
Ævar Arnfjörð Bjarmason Oct. 23, 2018, 10:17 a.m. UTC | #7
On Tue, Oct 23 2018, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote:
>
>> So I think the only reason to keep it [GETTEXT_POISON] compile-time is
>> performance, but I don't think that matters. It's not like we're
>> printing gigabytes of _() formatted output. Everything where formatting
>> matters is plumbing which doesn't use this API. These messages are
>> always for human consumption.
>
> Well, let's make sure that your impression is correct before going too
> far. I, too, had the impression that gettext cannot possibly be expensive,
> especifally in Git for Windows' settings, where we do not even ship
> translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid
> initialization if the locale dir is not present, 2018-04-21):
>
> 	The runtime of a simple `git.exe version` call on Windows is
> 	currently dominated by the gettext setup, adding a whopping ~150ms
> 	to the ~210ms total.
>
> I would be in favor of your change to make this a runtime option, of
> course, as long as it does not affect performance greatly (in particular
> on Windows, where we fight an uphill battle to make Git faster).

How expensive gettext() may or may not be isn't relevant to the
GETTEXT_POISON compile-time option.

The effect of what I'm suggesting here, and which my WIP patch in
<875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a
one-time getenv() for each process that prints a _() message that we
aren't doing now, and for each message call a function that would check
a boolean "are we in poison mode" static global variable.

Perhaps some of that's expensive on Windows, but given the recent
patches by Microsoft employees to add GIT_TEST_* env options I assumed
not, but in any case it won't have anything to do with how expensive
gettext may or may not be, you'll already be paying that cost now (or
not, with NO_GETTEXT).
Johannes Schindelin Oct. 23, 2018, 11:07 a.m. UTC | #8
Hi Ævar,

On Tue, 23 Oct 2018, Ævar Arnfjörð Bjarmason wrote:

> On Tue, Oct 23 2018, Johannes Schindelin wrote:
> 
> > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote:
> >
> >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is
> >> performance, but I don't think that matters. It's not like we're
> >> printing gigabytes of _() formatted output. Everything where formatting
> >> matters is plumbing which doesn't use this API. These messages are
> >> always for human consumption.
> >
> > Well, let's make sure that your impression is correct before going too
> > far. I, too, had the impression that gettext cannot possibly be expensive,
> > especifally in Git for Windows' settings, where we do not even ship
> > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid
> > initialization if the locale dir is not present, 2018-04-21):
> >
> > 	The runtime of a simple `git.exe version` call on Windows is
> > 	currently dominated by the gettext setup, adding a whopping ~150ms
> > 	to the ~210ms total.
> >
> > I would be in favor of your change to make this a runtime option, of
> > course, as long as it does not affect performance greatly (in particular
> > on Windows, where we fight an uphill battle to make Git faster).
> 
> How expensive gettext() may or may not be isn't relevant to the
> GETTEXT_POISON compile-time option.
> 
> The effect of what I'm suggesting here, and which my WIP patch in
> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a
> one-time getenv() for each process that prints a _() message that we
> aren't doing now, and for each message call a function that would check
> a boolean "are we in poison mode" static global variable.

Yep, you are right, we are *already* going through _() as a function,
whether gettext() is initialized or not. My bad.

> Perhaps some of that's expensive on Windows, but given the recent
> patches by Microsoft employees to add GIT_TEST_* env options I assumed
> not, but in any case it won't have anything to do with how expensive
> gettext may or may not be, you'll already be paying that cost now (or
> not, with NO_GETTEXT).

Indeed, we want to measure performance better, and that's what those
environment variables are for.

Thanks,
Dscho
Duy Nguyen Oct. 23, 2018, 2:37 p.m. UTC | #9
On Mon, Oct 22, 2018 at 10:23 PM SZEDER Gábor <szeder.dev@gmail.com> wrote:
> Once upon a time a GETTEXT_POISON build job failed on me, and the
> error message:
>
>   error: # GETTEXT POISON #
>
> was not particularly useful.  Ook wouldn't help with that...

Oook?

> So I came up with the following couple of patches that implement a
> "scrambled" format that makes the poisoned output legible for humans
> but still gibberish for machine consumption (i.e. grep-ing the text
> part would still fail):
>
>   error:  U.n.a.b.l.e. .t.o. .c.r.e.a.t.e. .'./home/szeder/src/git/t/trash directory.t1404-update-ref-errors/.git/packed-refs...l.o.c.k.'.:. .File exists...
>
> I have been running GETTEXT_POISON builds with this series for some
> months now, but haven't submitted it yet, because I haven't decided
> yet whether including strings (paths, refs, etc.) in the output as
> they are is a feature or a flaw.

A feature to me. The only thing that got through and but should not be
grep-able is strerror(). But that's orthogonal to this scrambling
stuff.

> And because it embarrassingly leaks every single translated string... :)

Easy to fix. I hope this means you have no more excuse to not push
this forward ;-)
Duy Nguyen Oct. 23, 2018, 3 p.m. UTC | #10
On Tue, Oct 23, 2018 at 12:17 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Tue, Oct 23 2018, Johannes Schindelin wrote:
>
> > Hi Ævar,
> >
> > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote:
> >
> >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is
> >> performance, but I don't think that matters. It's not like we're
> >> printing gigabytes of _() formatted output. Everything where formatting
> >> matters is plumbing which doesn't use this API. These messages are
> >> always for human consumption.
> >
> > Well, let's make sure that your impression is correct before going too
> > far. I, too, had the impression that gettext cannot possibly be expensive,
> > especifally in Git for Windows' settings, where we do not even ship
> > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid
> > initialization if the locale dir is not present, 2018-04-21):
> >
> >       The runtime of a simple `git.exe version` call on Windows is
> >       currently dominated by the gettext setup, adding a whopping ~150ms
> >       to the ~210ms total.
> >
> > I would be in favor of your change to make this a runtime option, of
> > course, as long as it does not affect performance greatly (in particular
> > on Windows, where we fight an uphill battle to make Git faster).
>
> How expensive gettext() may or may not be isn't relevant to the
> GETTEXT_POISON compile-time option.

If a user requests NO_GETTEXT, they should have zero (or near zero)
cost related to gettext. Which is true until now (the inline _()
version is expanded to pretty much no-op with the exception of NULL
string)

> The effect of what I'm suggesting here, and which my WIP patch in
> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a
> one-time getenv() for each process that prints a _() message that we
> aren't doing now, and for each message call a function that would check
> a boolean "are we in poison mode" static global variable.

Just don't do the getenv() inside _() even if you just do it one time.
getenv() may invalidate whatever value from the previous call. I would
not want to find out my code broke because of an getenv() inside some
innocent _()... And we should still respect NO_GETTEXT, not doing any
extra work when it's defined.
Ævar Arnfjörð Bjarmason Oct. 23, 2018, 4:45 p.m. UTC | #11
On Tue, Oct 23 2018, Duy Nguyen wrote:

> On Tue, Oct 23, 2018 at 12:17 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Tue, Oct 23 2018, Johannes Schindelin wrote:
>>
>> > Hi Ævar,
>> >
>> > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote:
>> >
>> >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is
>> >> performance, but I don't think that matters. It's not like we're
>> >> printing gigabytes of _() formatted output. Everything where formatting
>> >> matters is plumbing which doesn't use this API. These messages are
>> >> always for human consumption.
>> >
>> > Well, let's make sure that your impression is correct before going too
>> > far. I, too, had the impression that gettext cannot possibly be expensive,
>> > especifally in Git for Windows' settings, where we do not even ship
>> > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid
>> > initialization if the locale dir is not present, 2018-04-21):
>> >
>> >       The runtime of a simple `git.exe version` call on Windows is
>> >       currently dominated by the gettext setup, adding a whopping ~150ms
>> >       to the ~210ms total.
>> >
>> > I would be in favor of your change to make this a runtime option, of
>> > course, as long as it does not affect performance greatly (in particular
>> > on Windows, where we fight an uphill battle to make Git faster).
>>
>> How expensive gettext() may or may not be isn't relevant to the
>> GETTEXT_POISON compile-time option.
>
> If a user requests NO_GETTEXT, they should have zero (or near zero)
> cost related to gettext. Which is true until now (the inline _()
> version is expanded to pretty much no-op with the exception of NULL
> string)

I'm assuming by "until now" you mean until my RFC patch in
https://public-inbox.org/git/875zxtd59e.fsf@evledraar.gmail.com/

Yeah it's more expensive, but I don't see how it matters. We'd be
nano-optimizing a thing that's never going to become a
bottleneck. I.e. the patch is basically taking the commented-out
codepath in this test program:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    int have_flag(void)
    {
        return 0;
        /*
        static int flag = -1;
        if (flag == -1)
            flag = strlen(getenv("GIT_TEST_FOO"));
        return flag;
        */
    }


    int main(void)
    {
        while (1) {
            const char *out = have_flag() ? "foo" : "bar";
            puts(out);
        }
    }

When I test that on my laptop with gcc 8.2.0 I get ~230MB/s of output on
both versions, i.e. where have_flag() can be trivially constant folded,
and where we need to call getenv() for both of:

    GIT_TEST_FOO= ./main | pv >/dev/null
    GIT_TEST_FOO=1 ./main | pv >/dev/null

We don't have any codepaths where we're calling gettext() to output more
than a screenfull or two of output, so optimizing this doesn't make
sense.

>> The effect of what I'm suggesting here, and which my WIP patch in
>> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a
>> one-time getenv() for each process that prints a _() message that we
>> aren't doing now, and for each message call a function that would check
>> a boolean "are we in poison mode" static global variable.
>
> Just don't do the getenv() inside _() even if you just do it one time.
> getenv() may invalidate whatever value from the previous call. I would
> not want to find out my code broke because of an getenv() inside some
> innocent _()...

How would any code break? We have various getenv("GIT_TEST_*...")
already that work the same way. Unless you set that in the environment
the function is idempotent, and I don't see how anyone would do that
accidentally.

> And we should still respect NO_GETTEXT, not doing any extra work when
> it's defined.

This is not how any of our NO_* defines work. *Sometimes* defining them
guarantees you do no extra work, but in other cases we've decided that
bending over backwards with #ifdef in various places isn't worth it.

E.g. grep_opt in grep.h is a good example of this. Even if you don't
compile with PCRE you're pointlessly memzero-ing out members of the
struct that'll never be used in your config, because it's easier to
maintain the code that way (types always checked at compile-time).

(And no, despite my involvement in PCRE I didn't introduce that
particular pattern)

For the reasons noted above I think NO_GETTEXT is another such
case. Just like GIT_TEST_SPLIT_INDEX (added in your d6e3c181bc
("read-cache: force split index mode with GIT_TEST_SPLIT_INDEX",
2014-06-13)) etc.
Duy Nguyen Oct. 24, 2018, 2:41 p.m. UTC | #12
On Tue, Oct 23, 2018 at 6:45 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> >> The effect of what I'm suggesting here, and which my WIP patch in
> >> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a
> >> one-time getenv() for each process that prints a _() message that we
> >> aren't doing now, and for each message call a function that would check
> >> a boolean "are we in poison mode" static global variable.
> >
> > Just don't do the getenv() inside _() even if you just do it one time.
> > getenv() may invalidate whatever value from the previous call. I would
> > not want to find out my code broke because of an getenv() inside some
> > innocent _()...
>
> How would any code break? We have various getenv("GIT_TEST_*...")
> already that work the same way. Unless you set that in the environment
> the function is idempotent, and I don't see how anyone would do that
> accidentally.

I didn't check those GIT_TEST_ but I think the difference is in
complexity. When you write

 var = getenv("foo");
 complexFunction();

you probably start to feel scary and wrap getenv() with a strdup()
because you usually don't know exactly what complexFunction() can call
(and even if you do, you can't be sure what it may call in the
future).

The person who writes

 printf(_("%s"), getenv("foo"));

may not go through the same thought process as with complexFunction().
If _() calls getenv(), because you the order of parameter evaluation
is unspecified, you cannot be sure if getenv("foo") will be called
before or after the one inside _(). One of them may screw up the
other.

> > And we should still respect NO_GETTEXT, not doing any extra work when
> > it's defined.
>
> This is not how any of our NO_* defines work. *Sometimes* defining them
> guarantees you do no extra work, but in other cases we've decided that
> bending over backwards with #ifdef in various places isn't worth it.

I guess it boils down to "worth it". For me #ifdef NO_GETTEXT would be
limited to gettext.h and it's not that much work. But you do the
actual work. You decide.
Ævar Arnfjörð Bjarmason Oct. 24, 2018, 5:54 p.m. UTC | #13
On Wed, Oct 24 2018, Duy Nguyen wrote:

> On Tue, Oct 23, 2018 at 6:45 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> >> The effect of what I'm suggesting here, and which my WIP patch in
>> >> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a
>> >> one-time getenv() for each process that prints a _() message that we
>> >> aren't doing now, and for each message call a function that would check
>> >> a boolean "are we in poison mode" static global variable.
>> >
>> > Just don't do the getenv() inside _() even if you just do it one time.
>> > getenv() may invalidate whatever value from the previous call. I would
>> > not want to find out my code broke because of an getenv() inside some
>> > innocent _()...
>>
>> How would any code break? We have various getenv("GIT_TEST_*...")
>> already that work the same way. Unless you set that in the environment
>> the function is idempotent, and I don't see how anyone would do that
>> accidentally.
>
> I didn't check those GIT_TEST_ but I think the difference is in
> complexity. When you write
>
>  var = getenv("foo");
>  complexFunction();
>
> you probably start to feel scary and wrap getenv() with a strdup()
> because you usually don't know exactly what complexFunction() can call
> (and even if you do, you can't be sure what it may call in the
> future).
>
> The person who writes
>
>  printf(_("%s"), getenv("foo"));
>
> may not go through the same thought process as with complexFunction().
> If _() calls getenv(), because you the order of parameter evaluation
> is unspecified, you cannot be sure if getenv("foo") will be called
> before or after the one inside _(). One of them may screw up the
> other.

Ah, you mean because getenv() is not reentrant such a construct may
cause us to return something else entirely for getenv("bar") (e.g. in
this case the value for getenv("bar")).

Is that something you or anyone else has seen in practice? My intuition
is that while POSIX doesn't make that promise it isn't likely that
there's any implementation that would mutate the env purely on a
getenv(), but setenv() (munging some static "env" area in-place) might
be another matter.

But we could always call use_gettext_poison() really early from
git_setup_gettext() (called from our main()) if we're worried about
this, it would then set the static "poison_requested" variable and we
wouldn't call getenv() again, e.g. if we're later calling it in the
middle of multithreaded code, or within the same same sequence point.

>> > And we should still respect NO_GETTEXT, not doing any extra work when
>> > it's defined.
>>
>> This is not how any of our NO_* defines work. *Sometimes* defining them
>> guarantees you do no extra work, but in other cases we've decided that
>> bending over backwards with #ifdef in various places isn't worth it.
>
> I guess it boils down to "worth it". For me #ifdef NO_GETTEXT would be
> limited to gettext.h and it's not that much work. But you do the
> actual work. You decide.

Yeah the ifdef is pretty small and not really worth worrynig about, the
main benefit is being able to run tests in this mode without
recompiling.

I hadn't been running with GETTEXT_POISON when I build git because I
hadn't been bothered to build twice just for it, but am now running it
alongside other GIT_TEST_* modes.
Junio C Hamano Oct. 25, 2018, 3:52 a.m. UTC | #14
Duy Nguyen <pclouds@gmail.com> writes:

> The person who writes
>
>  printf(_("%s"), getenv("foo"));
>
> may not go through the same thought process as with complexFunction().
> If _() calls getenv(), because you the order of parameter evaluation
> is unspecified, you cannot be sure if getenv("foo") will be called
> before or after the one inside _(). One of them may screw up the
> other.

Yup, sometimes we've been sloppy but we should strive to mimick
efforts like f4ef5173 ("determine_author_info(): copy getenv
output", 2014-08-27).
Jeff King Oct. 25, 2018, 6:20 a.m. UTC | #15
On Thu, Oct 25, 2018 at 12:52:55PM +0900, Junio C Hamano wrote:

> Duy Nguyen <pclouds@gmail.com> writes:
> 
> > The person who writes
> >
> >  printf(_("%s"), getenv("foo"));
> >
> > may not go through the same thought process as with complexFunction().
> > If _() calls getenv(), because you the order of parameter evaluation
> > is unspecified, you cannot be sure if getenv("foo") will be called
> > before or after the one inside _(). One of them may screw up the
> > other.
> 
> Yup, sometimes we've been sloppy but we should strive to mimick
> efforts like f4ef5173 ("determine_author_info(): copy getenv
> output", 2014-08-27).

I've wondered about this before. Even calling:

  foo = xstrdup(getenv("bar"));

is not necessarily correct, because xstrdup() relies on xmalloc(), which
may check GIT_ALLOC_LIMIT (we do cache that result, but it can happen on
the first malloc).

I also wouldn't be surprised if there are cases in our threaded code
that use getenv() without taking a lock.

I've definitely run into setenv()/getenv() races on Linux (inside Git,
even, though it was while working on custom code). But I wonder how
common it is for getenv() to be invalidated by another getenv() call.
Certainly POSIX allows it, but every implementation I've seen (which is
admittedly few) is passing back pointers to a chunk of environment
memory.

I.e., could we mostly ignore this problem as not applying to most modern
systems? And if there is such a system, give it a fallback like:

  /*
   * For systems that use a single buffer for getenv(), this hacks
   * around it by giving it _four_ buffers. That's just punting on
   * the problem, but it at least gives enough breathing room for
   * the caller to do something sane like use non-trivial functions
   * to copy the string. It still does nothing for threading, but
   * hopefully such systems don't support pthreads in the first place. ;)
   */
  const char *xgetenv(const char *key)
  {
	static struct strbuf bufs[] = {
		STRBUF_INIT, STRBUF_INIT, STRBUF_INIT, STRBUF_INIT
	};
	static unsigned int cur;
	struct strbuf *buf;
	const char *value;
	size_t len;

	value = getenv(key);
	if (!value)
		return NULL;

	buf = bufs[cur++];
	cur %= ARRAY_SIZE(bufs);

	/*
	 * We have to do this length check ourselves, because allocating
	 * the strbuf may invalidate "value"!
	 */
	len = strlen(value);
	if (buf->alloc <= len) {
		strbuf_grow(buf, len);
		value = getenv(key);
		if (!value)
			return NULL; /* whoops, it went away! */
		len = strlen(value); /* paranoia that it didn't change */
	}

	strbuf_reset(buf);
	strbuf_add(buf, value, len);

	return buf->buf;
  }

I dunno. Maybe I am being overly optimistic. But I strongly suspect we
have such bugs already in our code base, and nobody has run into them
(OTOH, they are quite finicky due to things like the caching I
mentioned).

-Peff
Junio C Hamano Oct. 27, 2018, 6:55 a.m. UTC | #16
Jeff King <peff@peff.net> writes:

> I.e., could we mostly ignore this problem as not applying to most modern
> systems? And if there is such a system, give it a fallback like:
>
>   /*
>    * For systems that use a single buffer for getenv(), this hacks
>    * around it by giving it _four_ buffers. That's just punting on
>    * the problem, but it at least gives enough breathing room for
>    * the caller to do something sane like use non-trivial functions
>    * to copy the string. It still does nothing for threading, but
>    * hopefully such systems don't support pthreads in the first place. ;)
>    */

Tempting.

> I dunno. Maybe I am being overly optimistic. But I strongly suspect we
> have such bugs already in our code base, and nobody has run into them

Yeah, I tend to agree that I would not be surprised at all if we
have many such "bugs" that triggers for nobody on modern platforms.
Jakub Narębski Oct. 27, 2018, 10:11 a.m. UTC | #17
SZEDER Gábor <szeder.dev@gmail.com> writes:
> On Mon, Oct 22, 2018 at 05:36:33PM +0200, Nguyễn Thái Ngọc Duy wrote:
>>
>> The current gettext() function just replaces all strings with
>> '# GETTEXT POISON #' including format strings and hides the things
>> that we should be allowed to grep (like branch names, or some other
>> codes) even when gettext is poisoned.
>> 
>> This patch implements the poisoned _() with a universal and totally
>> legit language called Ook [1]. We could actually grep stuff even in
>> with this because format strings are preserved.
>
> Once upon a time a GETTEXT_POISON build job failed on me, and the
> error message:
>
>   error: # GETTEXT POISON #
>
> was not particularly useful.  Ook wouldn't help with that...
>
> So I came up with the following couple of patches that implement a
> "scrambled" format that makes the poisoned output legible for humans
> but still gibberish for machine consumption (i.e. grep-ing the text
> part would still fail):
>
>   error:  U.n.a.b.l.e. .t.o. .c.r.e.a.t.e. .'./home/szeder/src/git/t/trash directory.t1404-update-ref-errors/.git/packed-refs...l.o.c.k.'.:. .File exists...
>
> I have been running GETTEXT_POISON builds with this series for some
> months now, but haven't submitted it yet, because I haven't decided
> yet whether including strings (paths, refs, etc.) in the output as
> they are is a feature or a flaw.  And because it embarrassingly leaks
> every single translated string... :)

There is similar technique called "pseudolocalization", meant for
testing i18n aspect of software.

In one of most common forms, the string

  Edit program settings

woukd be translated to

  [!!! εÐiţ Þr0ģЯãm səTτıИğ§ !!!]

(possibly using mirrored locale, i.e. right-to-left order).

The brackets [!!! ... !!!] are used as a "poison", to detect
translatable text, and to spot issues with truncation; it also helps
with finding "lego" translation.

It would also stress-test Unicode handling...

Regards,

Patch
diff mbox series

diff --git a/gettext.c b/gettext.c
index 7272771c8e..29901e1ddd 100644
--- a/gettext.c
+++ b/gettext.c
@@ -56,6 +56,60 @@  int use_gettext_poison(void)
 }
 #endif
 
+const char *gettext_poison(const char *msgid)
+{
+	/*
+	 * gettext() returns a string that is always valid. We would
+	 * need a hash map for that but let's stay simple and keep the
+	 * last 64 gettext() results. Should be more than enough.
+	 */
+	static char *bufs[64];
+	static int i;
+	struct strbuf sb = STRBUF_INIT;
+	char *buf;
+	const char *p;
+	const char *type_specifiers = "diouxXeEfFgGaAcsCSpnm%";
+
+	if (!strchr(msgid, '%'))
+		return "Eek!";
+
+	p = msgid;
+	while (*p) {
+		const char *type;
+		switch (*p) {
+		case '%':
+			/*
+			 * No strict parsing. We simply look for the end of a
+			 * format string
+			 */
+			type = p + 1;
+			while (*type && !strchr(type_specifiers, *type))
+				type++;
+			if (*type)
+				type++;
+			strbuf_add(&sb, p, (int)(type - p));
+			p = type;
+			break;
+		default:
+			if (!isalpha(*p)) {
+				strbuf_addch(&sb, *p);
+				p++;
+				break;
+			}
+			if (isupper(*p))
+				strbuf_addstr(&sb, "Ook");
+			else
+				strbuf_addstr(&sb, "ook");
+			while (isalpha(*p))
+				p++;
+		}
+	}
+	buf = bufs[(i++) % ARRAY_SIZE(bufs)];
+	free(buf);
+	buf = strbuf_detach(&sb, NULL);
+	return buf;
+}
+
 #ifndef NO_GETTEXT
 static int test_vsnprintf(const char *fmt, ...)
 {
diff --git a/gettext.h b/gettext.h
index 7eee64a34f..dc9851a06a 100644
--- a/gettext.h
+++ b/gettext.h
@@ -41,8 +41,9 @@  static inline int gettext_width(const char *s)
 }
 #endif
 
+const char *gettext_poison(const char *);
 #ifdef GETTEXT_POISON
-extern int use_gettext_poison(void);
+int use_gettext_poison(void);
 #else
 #define use_gettext_poison() 0
 #endif
@@ -51,14 +52,14 @@  static inline FORMAT_PRESERVING(1) const char *_(const char *msgid)
 {
 	if (!*msgid)
 		return "";
-	return use_gettext_poison() ? "# GETTEXT POISON #" : gettext(msgid);
+	return use_gettext_poison() ? gettext_poison(msgid) : gettext(msgid);
 }
 
 static inline FORMAT_PRESERVING(1) FORMAT_PRESERVING(2)
 const char *Q_(const char *msgid, const char *plu, unsigned long n)
 {
 	if (use_gettext_poison())
-		return "# GETTEXT POISON #";
+		return gettext_poison(msgid);
 	return ngettext(msgid, plu, n);
 }
 
diff --git a/t/lib-rebase.sh b/t/lib-rebase.sh
index 2ca9fb69d6..1e8440e935 100644
--- a/t/lib-rebase.sh
+++ b/t/lib-rebase.sh
@@ -29,7 +29,7 @@  set_fake_editor () {
 	*/COMMIT_EDITMSG)
 		test -z "$EXPECT_HEADER_COUNT" ||
 			test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# This is a combination of \(.*\) commits\./\1/p' < "$1")" ||
-			test "# # GETTEXT POISON #" = "$(sed -n '1p' < "$1")" ||
+			test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# Ook ook ook ook ook \(.*\) ook\./\1/p' < "$1")" ||
 			exit
 		test -z "$FAKE_COMMIT_MESSAGE" || echo "$FAKE_COMMIT_MESSAGE" > "$1"
 		test -z "$FAKE_COMMIT_AMEND" || echo "$FAKE_COMMIT_AMEND" >> "$1"