Message ID | 20181022153633.31757-1-pclouds@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Poison gettext with the Ook language | expand |
On Mon, Oct 22, 2018 at 05:36:33PM +0200, Nguyễn Thái Ngọc Duy wrote: > The current gettext() function just replaces all strings with > '# GETTEXT POISON #' including format strings and hides the things > that we should be allowed to grep (like branch names, or some other > codes) even when gettext is poisoned. > > This patch implements the poisoned _() with a universal and totally > legit language called Ook [1]. We could actually grep stuff even in > with this because format strings are preserved. Once upon a time a GETTEXT_POISON build job failed on me, and the error message: error: # GETTEXT POISON # was not particularly useful. Ook wouldn't help with that... So I came up with the following couple of patches that implement a "scrambled" format that makes the poisoned output legible for humans but still gibberish for machine consumption (i.e. grep-ing the text part would still fail): error: U.n.a.b.l.e. .t.o. .c.r.e.a.t.e. .'./home/szeder/src/git/t/trash directory.t1404-update-ref-errors/.git/packed-refs...l.o.c.k.'.:. .File exists... I have been running GETTEXT_POISON builds with this series for some months now, but haven't submitted it yet, because I haven't decided yet whether including strings (paths, refs, etc.) in the output as they are is a feature or a flaw. And because it embarrassingly leaks every single translated string... :) SZEDER Gábor (8): test-lib.sh: preserve GIT_GETTEXT_POISON from the environment gettext: don't poison if GIT_GETTEXT_POISON is set but empty lib-rebase: loosen GETTEXT_POISON check in fake editor gettext: #ifdef away GETTEXT POISON-related code from _() and Q_() gettext: put "# GETTEXT POISON #" string literal into a macro gettext: use an enum for the mode of GETTEXT POISONing gettext: introduce GIT_GETTEXT_POISON=scrambled travis-ci: run GETTEXT POISON build job in scrambled mode, too Makefile | 2 +- ci/lib-travisci.sh | 1 + ci/run-build-and-tests.sh | 10 +++++-- gettext.c | 63 +++++++++++++++++++++++++++++++++++---- gettext.h | 33 +++++++++++++++----- t/lib-rebase.sh | 2 +- t/test-lib.sh | 17 ++++++++++- 7 files changed, 110 insertions(+), 18 deletions(-)
Hi Duy, On Mon, 22 Oct 2018, Nguyễn Thái Ngọc Duy wrote: > The current gettext() function just replaces all strings with > '# GETTEXT POISON #' including format strings and hides the things > that we should be allowed to grep (like branch names, or some other > codes) even when gettext is poisoned. > > This patch implements the poisoned _() with a universal and totally > legit language called Ook [1]. We could actually grep stuff even in > with this because format strings are preserved. > > Long term, we could implement an "ook translator" for test_i18ngrep > and friends so that they translate English to Ook, allowing us to > match full text while making sure the text in the code is still marked > for translation. > > [1] https://en.wikipedia.org/wiki/Unseen_University#Librarian > > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> > --- > This started out as something fun to do while running the test suite > last weekend. But it turns out actually working! If this patch ends > up in git.git, the Librarian would be so proud! Cute. Dscho > > gettext.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++ > gettext.h | 7 ++++--- > t/lib-rebase.sh | 2 +- > 3 files changed, 59 insertions(+), 4 deletions(-) > > diff --git a/gettext.c b/gettext.c > index 7272771c8e..29901e1ddd 100644 > --- a/gettext.c > +++ b/gettext.c > @@ -56,6 +56,60 @@ int use_gettext_poison(void) > } > #endif > > +const char *gettext_poison(const char *msgid) > +{ > + /* > + * gettext() returns a string that is always valid. We would > + * need a hash map for that but let's stay simple and keep the > + * last 64 gettext() results. Should be more than enough. > + */ > + static char *bufs[64]; > + static int i; > + struct strbuf sb = STRBUF_INIT; > + char *buf; > + const char *p; > + const char *type_specifiers = "diouxXeEfFgGaAcsCSpnm%"; > + > + if (!strchr(msgid, '%')) > + return "Eek!"; > + > + p = msgid; > + while (*p) { > + const char *type; > + switch (*p) { > + case '%': > + /* > + * No strict parsing. We simply look for the end of a > + * format string > + */ > + type = p + 1; > + while (*type && !strchr(type_specifiers, *type)) > + type++; > + if (*type) > + type++; > + strbuf_add(&sb, p, (int)(type - p)); > + p = type; > + break; > + default: > + if (!isalpha(*p)) { > + strbuf_addch(&sb, *p); > + p++; > + break; > + } > + if (isupper(*p)) > + strbuf_addstr(&sb, "Ook"); > + else > + strbuf_addstr(&sb, "ook"); > + while (isalpha(*p)) > + p++; > + } > + } > + buf = bufs[(i++) % ARRAY_SIZE(bufs)]; > + free(buf); > + buf = strbuf_detach(&sb, NULL); > + return buf; > +} > + > #ifndef NO_GETTEXT > static int test_vsnprintf(const char *fmt, ...) > { > diff --git a/gettext.h b/gettext.h > index 7eee64a34f..dc9851a06a 100644 > --- a/gettext.h > +++ b/gettext.h > @@ -41,8 +41,9 @@ static inline int gettext_width(const char *s) > } > #endif > > +const char *gettext_poison(const char *); > #ifdef GETTEXT_POISON > -extern int use_gettext_poison(void); > +int use_gettext_poison(void); > #else > #define use_gettext_poison() 0 > #endif > @@ -51,14 +52,14 @@ static inline FORMAT_PRESERVING(1) const char *_(const char *msgid) > { > if (!*msgid) > return ""; > - return use_gettext_poison() ? "# GETTEXT POISON #" : gettext(msgid); > + return use_gettext_poison() ? gettext_poison(msgid) : gettext(msgid); > } > > static inline FORMAT_PRESERVING(1) FORMAT_PRESERVING(2) > const char *Q_(const char *msgid, const char *plu, unsigned long n) > { > if (use_gettext_poison()) > - return "# GETTEXT POISON #"; > + return gettext_poison(msgid); > return ngettext(msgid, plu, n); > } > > diff --git a/t/lib-rebase.sh b/t/lib-rebase.sh > index 2ca9fb69d6..1e8440e935 100644 > --- a/t/lib-rebase.sh > +++ b/t/lib-rebase.sh > @@ -29,7 +29,7 @@ set_fake_editor () { > */COMMIT_EDITMSG) > test -z "$EXPECT_HEADER_COUNT" || > test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# This is a combination of \(.*\) commits\./\1/p' < "$1")" || > - test "# # GETTEXT POISON #" = "$(sed -n '1p' < "$1")" || > + test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# Ook ook ook ook ook \(.*\) ook\./\1/p' < "$1")" || > exit > test -z "$FAKE_COMMIT_MESSAGE" || echo "$FAKE_COMMIT_MESSAGE" > "$1" > test -z "$FAKE_COMMIT_AMEND" || echo "$FAKE_COMMIT_AMEND" >> "$1" > -- > 2.19.1.647.g708186aaf9 > >
On Mon, Oct 22 2018, Nguyễn Thái Ngọc Duy wrote: > The current gettext() function just replaces all strings with > '# GETTEXT POISON #' including format strings and hides the things > that we should be allowed to grep (like branch names, or some other > codes) even when gettext is poisoned. > > This patch implements the poisoned _() with a universal and totally > legit language called Ook [1]. We could actually grep stuff even in > with this because format strings are preserved. > > Long term, we could implement an "ook translator" for test_i18ngrep > and friends so that they translate English to Ook, allowing us to > match full text while making sure the text in the code is still marked > for translation. Replying to both this patch, and SZEDER's series for brevity. Thanks both for working on this. It's obvious that this stuff needs some polishing, and it's great that's being done, and sorry for my part of having left this in the current state. a) Both this patch and SZEDER's 7/8 are addressing the issue that the poison mode doesn't preserve format specifiers. I haven't tried to dig this up in the list archive, and maybe it only exists in my mind, but I seem to remember that this was explicitly discussed as a goal at the time. I.e. we were expecting the lack of this to break tests, and that was considered a feature as it would spot plumbing messages we shouldn't be translating. However, it's my opinion that this was just a bad idea to begin with, I've CC'd a couple of prolific markers of i18n from my log grepping (and feel free to add more) to see if they disagre. I think it probably helped me a bit in the beginning when i18n was bootstrapping and there was a *lot* to mark for translation, but we've long since stabilized and aren't doing that anymore, so it's much easier to just review the patches to see if they translate plumbing. All of which is to say that I think something like your patch here is fine to just accept, and the only improvement I'd suggest is some note in the commit message saying that this was always intentional, but nobody can name a case where it helped, so let's just drop it. In SZEDER's case that we shouldn't have GIT_GETTEXT_POISON=scrambled, let's just make it boolean and make "scrambled" (i.e. preserving format specifiecs) the default. b) SZEDER's series, and your patch (although it's smaller) still preserve the notion that there's such a thing as a GETTEXT_POISON *build* and you actually need to compile git with that flag to make use of it. This, as I recall, was just paranoia on my part back in 2010/2011. I wanted to be able to present a version of this i18n stuff where people who didn't like it could be assured that if they didn't opt-in to it they wouldn't get any of the code (sans a few trivial and obviously correct wrapper functions). So I think the only reason to keep it compile-time is performance, but I don't think that matters. It's not like we're printing gigabytes of _() formatted output. Everything where formatting matters is plumbing which doesn't use this API. These messages are always for human consumption. So shouldn't we just drop this notion of GETTEXT_POISON at compile-time entirely, and make GIT_GETTEXT_POISON=<boolean> a test flag that all builds of git are aware of? I think so.
On Mon, Oct 22 2018, Ævar Arnfjörð Bjarmason wrote: > On Mon, Oct 22 2018, Nguyễn Thái Ngọc Duy wrote: > >> The current gettext() function just replaces all strings with >> '# GETTEXT POISON #' including format strings and hides the things >> that we should be allowed to grep (like branch names, or some other >> codes) even when gettext is poisoned. >> >> This patch implements the poisoned _() with a universal and totally >> legit language called Ook [1]. We could actually grep stuff even in >> with this because format strings are preserved. >> >> Long term, we could implement an "ook translator" for test_i18ngrep >> and friends so that they translate English to Ook, allowing us to >> match full text while making sure the text in the code is still marked >> for translation. > > Replying to both this patch, and SZEDER's series for brevity. Thanks > both for working on this. It's obvious that this stuff needs some > polishing, and it's great that's being done, and sorry for my part of > having left this in the current state. > > a) > > Both this patch and SZEDER's 7/8 are addressing the issue that the > poison mode doesn't preserve format specifiers. > > I haven't tried to dig this up in the list archive, and maybe it only > exists in my mind, but I seem to remember that this was explicitly > discussed as a goal at the time. > > I.e. we were expecting the lack of this to break tests, and that was > considered a feature as it would spot plumbing messages we shouldn't be > translating. > > However, it's my opinion that this was just a bad idea to begin with, > I've CC'd a couple of prolific markers of i18n from my log grepping (and > feel free to add more) to see if they disagre. > > I think it probably helped me a bit in the beginning when i18n was > bootstrapping and there was a *lot* to mark for translation, but we've > long since stabilized and aren't doing that anymore, so it's much easier > to just review the patches to see if they translate plumbing. > > All of which is to say that I think something like your patch here is > fine to just accept, and the only improvement I'd suggest is some note > in the commit message saying that this was always intentional, but > nobody can name a case where it helped, so let's just drop it. > > In SZEDER's case that we shouldn't have GIT_GETTEXT_POISON=scrambled, > let's just make it boolean and make "scrambled" (i.e. preserving format > specifiecs) the default. [...] > b) > > SZEDER's series, and your patch (although it's smaller) still preserve > the notion that there's such a thing as a GETTEXT_POISON *build* and you > actually need to compile git with that flag to make use of it. > > This, as I recall, was just paranoia on my part back in 2010/2011. I > wanted to be able to present a version of this i18n stuff where people > who didn't like it could be assured that if they didn't opt-in to it > they wouldn't get any of the code (sans a few trivial and obviously > correct wrapper functions). > > So I think the only reason to keep it compile-time is performance, but I > don't think that matters. It's not like we're printing gigabytes of _() > formatted output. Everything where formatting matters is plumbing which > doesn't use this API. These messages are always for human consumption. > > So shouldn't we just drop this notion of GETTEXT_POISON at compile-time > entirely, and make GIT_GETTEXT_POISON=<boolean> a test flag that all > builds of git are aware of? I think so. Something like this, untested except I compiled it and ran one test with/without GIT_GETTEXT_POISON, so it may have bugs: Makefile | 10 +--------- gettext.c | 4 +--- gettext.h | 4 ---- po/README | 13 ++++--------- t/README | 5 +++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/test-lib-functions.sh | 8 ++++---- t/test-lib.sh | 12 +++--------- 9 files changed, 20 insertions(+), 38 deletions(-) diff --git a/Makefile b/Makefile index d18ab0fe78..78190ee9d9 100644 --- a/Makefile +++ b/Makefile @@ -362,11 +362,6 @@ all:: # Define HAVE_DEV_TTY if your system can open /dev/tty to interact with the # user. # -# Define GETTEXT_POISON if you are debugging the choice of strings marked -# for translation. In a GETTEXT_POISON build, you can turn all strings marked -# for translation into gibberish by setting the GIT_GETTEXT_POISON variable -# (to any value) in your environment. -# # Define JSMIN to point to JavaScript minifier that functions as # a filter to have gitweb.js minified. # @@ -714,6 +709,7 @@ TEST_BUILTINS_OBJS += test-dump-split-index.o TEST_BUILTINS_OBJS += test-dump-untracked-cache.o TEST_BUILTINS_OBJS += test-example-decorate.o TEST_BUILTINS_OBJS += test-genrandom.o +TEST_BUILTINS_OBJS += test-gettext-poison.o TEST_BUILTINS_OBJS += test-hashmap.o TEST_BUILTINS_OBJS += test-index-version.o TEST_BUILTINS_OBJS += test-json-writer.o @@ -1439,9 +1435,6 @@ endif ifdef NO_SYMLINK_HEAD BASIC_CFLAGS += -DNO_SYMLINK_HEAD endif -ifdef GETTEXT_POISON - BASIC_CFLAGS += -DGETTEXT_POISON -endif ifdef NO_GETTEXT BASIC_CFLAGS += -DNO_GETTEXT USE_GETTEXT_SCHEME ?= fallthrough @@ -2591,7 +2584,6 @@ ifdef GIT_TEST_CMP_USE_COPIED_CONTEXT @echo GIT_TEST_CMP_USE_COPIED_CONTEXT=YesPlease >>$@+ endif @echo NO_GETTEXT=\''$(subst ','\'',$(subst ','\'',$(NO_GETTEXT)))'\' >>$@+ - @echo GETTEXT_POISON=\''$(subst ','\'',$(subst ','\'',$(GETTEXT_POISON)))'\' >>$@+ ifdef GIT_PERF_REPEAT_COUNT @echo GIT_PERF_REPEAT_COUNT=\''$(subst ','\'',$(subst ','\'',$(GIT_PERF_REPEAT_COUNT)))'\' >>$@+ endif diff --git a/gettext.c b/gettext.c index 7272771c8e..d71e394284 100644 --- a/gettext.c +++ b/gettext.c @@ -46,15 +46,13 @@ const char *get_preferred_languages(void) return NULL; } -#ifdef GETTEXT_POISON int use_gettext_poison(void) { static int poison_requested = -1; if (poison_requested == -1) - poison_requested = getenv("GIT_GETTEXT_POISON") ? 1 : 0; + poison_requested = git_env_bool("GIT_GETTEXT_POISON", 0); return poison_requested; } -#endif #ifndef NO_GETTEXT static int test_vsnprintf(const char *fmt, ...) diff --git a/gettext.h b/gettext.h index 7eee64a34f..4c492d9f57 100644 --- a/gettext.h +++ b/gettext.h @@ -41,11 +41,7 @@ static inline int gettext_width(const char *s) } #endif -#ifdef GETTEXT_POISON extern int use_gettext_poison(void); -#else -#define use_gettext_poison() 0 -#endif static inline FORMAT_PRESERVING(1) const char *_(const char *msgid) { diff --git a/po/README b/po/README index fef4c0f0b5..9dad277e33 100644 --- a/po/README +++ b/po/README @@ -289,16 +289,11 @@ something in the test suite might still depend on the US English version of the strings, e.g. to grep some error message or other output. -To smoke out issues like these Git can be compiled with gettext poison -support, at the top-level: +To smoke out issues like these Git tested with a translation mode that +emits gibberish on every call to gettext. To use it run the test suite +with it, e.g.: - make GETTEXT_POISON=YesPlease - -That'll give you a git which emits gibberish on every call to -gettext. It's obviously not meant to be installed, but you should run -the test suite with it: - - cd t && prove -j 9 ./t[0-9]*.sh + cd t && GIT_GETTEXT_POISON=true prove -j 9 ./t[0-9]*.sh If tests break with it you should inspect them manually and see if what you're translating is sane, i.e. that you're not translating diff --git a/t/README b/t/README index 8847489640..b3b2fa0b11 100644 --- a/t/README +++ b/t/README @@ -301,6 +301,11 @@ that cannot be easily covered by a few specific test cases. These could be enabled by running the test suite with correct GIT_TEST_ environment set. +GIT_GETTEXT_POISON=<boolean> turns all strings marked for translation +into gibberish. Used for spotting those tests that need to be marked +with a C_LOCALE_OUTPUT prerequisite when adding more strings for +translation. See "Testing marked strings" in po/README for details. + GIT_TEST_SPLIT_INDEX=<boolean> forces split-index mode on the whole test suite. Accept any boolean values that are accepted by git-config. diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index 6b5836dc1b..3e75672a37 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -19,6 +19,7 @@ static struct test_cmd cmds[] = { { "dump-untracked-cache", cmd__dump_untracked_cache }, { "example-decorate", cmd__example_decorate }, { "genrandom", cmd__genrandom }, + { "gettext-poison", cmd__gettext_poison }, { "hashmap", cmd__hashmap }, { "index-version", cmd__index_version }, { "json-writer", cmd__json_writer }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index e4890566da..04f033b7fc 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -15,6 +15,7 @@ int cmd__dump_split_index(int argc, const char **argv); int cmd__dump_untracked_cache(int argc, const char **argv); int cmd__example_decorate(int argc, const char **argv); int cmd__genrandom(int argc, const char **argv); +int cmd__gettext_poison(int argc, const char **argv); int cmd__hashmap(int argc, const char **argv); int cmd__index_version(int argc, const char **argv); int cmd__json_writer(int argc, const char **argv); diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 78d8c3783b..4e8683addd 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -755,16 +755,16 @@ test_cmp_bin() { # Use this instead of test_cmp to compare files that contain expected and # actual output from git commands that can be translated. When running -# under GETTEXT_POISON this pretends that the command produced expected +# under GIT_GETTEXT_POISON this pretends that the command produced expected # results. test_i18ncmp () { - test -n "$GETTEXT_POISON" || test_cmp "$@" + ! test_have_prereq C_LOCALE_OUTPUT || test_cmp "$@" } # Use this instead of "grep expected-string actual" to see if the # output from a git command that can be translated either contains an # expected string, or does not contain an unwanted one. When running -# under GETTEXT_POISON this pretends that the command produced expected +# under GIT_GETTEXT_POISON this pretends that the command produced expected # results. test_i18ngrep () { eval "last_arg=\${$#}" @@ -779,7 +779,7 @@ test_i18ngrep () { error "bug in the test script: too few parameters to test_i18ngrep" fi - if test -n "$GETTEXT_POISON" + if ! test_have_prereq C_LOCALE_OUTPUT then # pretend success return 0 diff --git a/t/test-lib.sh b/t/test-lib.sh index 897e6fcc94..f82d2149a2 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -105,6 +105,7 @@ unset VISUAL EMAIL LANGUAGE COLUMNS $("$PERL_PATH" -e ' TRACE DEBUG TEST + GETTEXT_POISON .*_TEST PROVE VALGRIND @@ -1104,15 +1105,8 @@ test -n "$USE_LIBPCRE1" && test_set_prereq LIBPCRE1 test -n "$USE_LIBPCRE2" && test_set_prereq LIBPCRE2 test -z "$NO_GETTEXT" && test_set_prereq GETTEXT -# Can we rely on git's output in the C locale? -if test -n "$GETTEXT_POISON" -then - GIT_GETTEXT_POISON=YesPlease - export GIT_GETTEXT_POISON - test_set_prereq GETTEXT_POISON -else - test_set_prereq C_LOCALE_OUTPUT -fi +test_lazy_prereq GETTEXT_POISON 'test-tool gettext-poison' +test_lazy_prereq C_LOCALE_OUTPUT '! test-tool gettext-poison' if test -z "$GIT_TEST_CHECK_CACHE_TREE" then
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > So I think the only reason to keep it compile-time is performance, but I > don't think that matters. It's not like we're printing gigabytes of _() > formatted output. Everything where formatting matters is plumbing which > doesn't use this API. These messages are always for human consumption. > > So shouldn't we just drop this notion of GETTEXT_POISON at compile-time > entirely, and make GIT_GETTEXT_POISON=<boolean> a test flag that all > builds of git are aware of? I think so. Haven't thought things through, but that sounds like a good thing to aim for. Keep the ideas coming...
Hi Ævar, On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote: > So I think the only reason to keep it [GETTEXT_POISON] compile-time is > performance, but I don't think that matters. It's not like we're > printing gigabytes of _() formatted output. Everything where formatting > matters is plumbing which doesn't use this API. These messages are > always for human consumption. Well, let's make sure that your impression is correct before going too far. I, too, had the impression that gettext cannot possibly be expensive, especifally in Git for Windows' settings, where we do not even ship translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid initialization if the locale dir is not present, 2018-04-21): The runtime of a simple `git.exe version` call on Windows is currently dominated by the gettext setup, adding a whopping ~150ms to the ~210ms total. I would be in favor of your change to make this a runtime option, of course, as long as it does not affect performance greatly (in particular on Windows, where we fight an uphill battle to make Git faster). Ciao, Dscho
On Tue, Oct 23 2018, Johannes Schindelin wrote: > Hi Ævar, > > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote: > >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is >> performance, but I don't think that matters. It's not like we're >> printing gigabytes of _() formatted output. Everything where formatting >> matters is plumbing which doesn't use this API. These messages are >> always for human consumption. > > Well, let's make sure that your impression is correct before going too > far. I, too, had the impression that gettext cannot possibly be expensive, > especifally in Git for Windows' settings, where we do not even ship > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid > initialization if the locale dir is not present, 2018-04-21): > > The runtime of a simple `git.exe version` call on Windows is > currently dominated by the gettext setup, adding a whopping ~150ms > to the ~210ms total. > > I would be in favor of your change to make this a runtime option, of > course, as long as it does not affect performance greatly (in particular > on Windows, where we fight an uphill battle to make Git faster). How expensive gettext() may or may not be isn't relevant to the GETTEXT_POISON compile-time option. The effect of what I'm suggesting here, and which my WIP patch in <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a one-time getenv() for each process that prints a _() message that we aren't doing now, and for each message call a function that would check a boolean "are we in poison mode" static global variable. Perhaps some of that's expensive on Windows, but given the recent patches by Microsoft employees to add GIT_TEST_* env options I assumed not, but in any case it won't have anything to do with how expensive gettext may or may not be, you'll already be paying that cost now (or not, with NO_GETTEXT).
Hi Ævar, On Tue, 23 Oct 2018, Ævar Arnfjörð Bjarmason wrote: > On Tue, Oct 23 2018, Johannes Schindelin wrote: > > > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote: > > > >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is > >> performance, but I don't think that matters. It's not like we're > >> printing gigabytes of _() formatted output. Everything where formatting > >> matters is plumbing which doesn't use this API. These messages are > >> always for human consumption. > > > > Well, let's make sure that your impression is correct before going too > > far. I, too, had the impression that gettext cannot possibly be expensive, > > especifally in Git for Windows' settings, where we do not even ship > > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid > > initialization if the locale dir is not present, 2018-04-21): > > > > The runtime of a simple `git.exe version` call on Windows is > > currently dominated by the gettext setup, adding a whopping ~150ms > > to the ~210ms total. > > > > I would be in favor of your change to make this a runtime option, of > > course, as long as it does not affect performance greatly (in particular > > on Windows, where we fight an uphill battle to make Git faster). > > How expensive gettext() may or may not be isn't relevant to the > GETTEXT_POISON compile-time option. > > The effect of what I'm suggesting here, and which my WIP patch in > <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a > one-time getenv() for each process that prints a _() message that we > aren't doing now, and for each message call a function that would check > a boolean "are we in poison mode" static global variable. Yep, you are right, we are *already* going through _() as a function, whether gettext() is initialized or not. My bad. > Perhaps some of that's expensive on Windows, but given the recent > patches by Microsoft employees to add GIT_TEST_* env options I assumed > not, but in any case it won't have anything to do with how expensive > gettext may or may not be, you'll already be paying that cost now (or > not, with NO_GETTEXT). Indeed, we want to measure performance better, and that's what those environment variables are for. Thanks, Dscho
On Mon, Oct 22, 2018 at 10:23 PM SZEDER Gábor <szeder.dev@gmail.com> wrote: > Once upon a time a GETTEXT_POISON build job failed on me, and the > error message: > > error: # GETTEXT POISON # > > was not particularly useful. Ook wouldn't help with that... Oook? > So I came up with the following couple of patches that implement a > "scrambled" format that makes the poisoned output legible for humans > but still gibberish for machine consumption (i.e. grep-ing the text > part would still fail): > > error: U.n.a.b.l.e. .t.o. .c.r.e.a.t.e. .'./home/szeder/src/git/t/trash directory.t1404-update-ref-errors/.git/packed-refs...l.o.c.k.'.:. .File exists... > > I have been running GETTEXT_POISON builds with this series for some > months now, but haven't submitted it yet, because I haven't decided > yet whether including strings (paths, refs, etc.) in the output as > they are is a feature or a flaw. A feature to me. The only thing that got through and but should not be grep-able is strerror(). But that's orthogonal to this scrambling stuff. > And because it embarrassingly leaks every single translated string... :) Easy to fix. I hope this means you have no more excuse to not push this forward ;-)
On Tue, Oct 23, 2018 at 12:17 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > On Tue, Oct 23 2018, Johannes Schindelin wrote: > > > Hi Ævar, > > > > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote: > > > >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is > >> performance, but I don't think that matters. It's not like we're > >> printing gigabytes of _() formatted output. Everything where formatting > >> matters is plumbing which doesn't use this API. These messages are > >> always for human consumption. > > > > Well, let's make sure that your impression is correct before going too > > far. I, too, had the impression that gettext cannot possibly be expensive, > > especifally in Git for Windows' settings, where we do not even ship > > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid > > initialization if the locale dir is not present, 2018-04-21): > > > > The runtime of a simple `git.exe version` call on Windows is > > currently dominated by the gettext setup, adding a whopping ~150ms > > to the ~210ms total. > > > > I would be in favor of your change to make this a runtime option, of > > course, as long as it does not affect performance greatly (in particular > > on Windows, where we fight an uphill battle to make Git faster). > > How expensive gettext() may or may not be isn't relevant to the > GETTEXT_POISON compile-time option. If a user requests NO_GETTEXT, they should have zero (or near zero) cost related to gettext. Which is true until now (the inline _() version is expanded to pretty much no-op with the exception of NULL string) > The effect of what I'm suggesting here, and which my WIP patch in > <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a > one-time getenv() for each process that prints a _() message that we > aren't doing now, and for each message call a function that would check > a boolean "are we in poison mode" static global variable. Just don't do the getenv() inside _() even if you just do it one time. getenv() may invalidate whatever value from the previous call. I would not want to find out my code broke because of an getenv() inside some innocent _()... And we should still respect NO_GETTEXT, not doing any extra work when it's defined.
On Tue, Oct 23 2018, Duy Nguyen wrote: > On Tue, Oct 23, 2018 at 12:17 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> >> On Tue, Oct 23 2018, Johannes Schindelin wrote: >> >> > Hi Ævar, >> > >> > On Mon, 22 Oct 2018, Ævar Arnfjörð Bjarmason wrote: >> > >> >> So I think the only reason to keep it [GETTEXT_POISON] compile-time is >> >> performance, but I don't think that matters. It's not like we're >> >> printing gigabytes of _() formatted output. Everything where formatting >> >> matters is plumbing which doesn't use this API. These messages are >> >> always for human consumption. >> > >> > Well, let's make sure that your impression is correct before going too >> > far. I, too, had the impression that gettext cannot possibly be expensive, >> > especifally in Git for Windows' settings, where we do not even ship >> > translations. Yet see the commit message of cc5e1bf99247 (gettext: avoid >> > initialization if the locale dir is not present, 2018-04-21): >> > >> > The runtime of a simple `git.exe version` call on Windows is >> > currently dominated by the gettext setup, adding a whopping ~150ms >> > to the ~210ms total. >> > >> > I would be in favor of your change to make this a runtime option, of >> > course, as long as it does not affect performance greatly (in particular >> > on Windows, where we fight an uphill battle to make Git faster). >> >> How expensive gettext() may or may not be isn't relevant to the >> GETTEXT_POISON compile-time option. > > If a user requests NO_GETTEXT, they should have zero (or near zero) > cost related to gettext. Which is true until now (the inline _() > version is expanded to pretty much no-op with the exception of NULL > string) I'm assuming by "until now" you mean until my RFC patch in https://public-inbox.org/git/875zxtd59e.fsf@evledraar.gmail.com/ Yeah it's more expensive, but I don't see how it matters. We'd be nano-optimizing a thing that's never going to become a bottleneck. I.e. the patch is basically taking the commented-out codepath in this test program: #include <stdio.h> #include <stdlib.h> #include <string.h> int have_flag(void) { return 0; /* static int flag = -1; if (flag == -1) flag = strlen(getenv("GIT_TEST_FOO")); return flag; */ } int main(void) { while (1) { const char *out = have_flag() ? "foo" : "bar"; puts(out); } } When I test that on my laptop with gcc 8.2.0 I get ~230MB/s of output on both versions, i.e. where have_flag() can be trivially constant folded, and where we need to call getenv() for both of: GIT_TEST_FOO= ./main | pv >/dev/null GIT_TEST_FOO=1 ./main | pv >/dev/null We don't have any codepaths where we're calling gettext() to output more than a screenfull or two of output, so optimizing this doesn't make sense. >> The effect of what I'm suggesting here, and which my WIP patch in >> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a >> one-time getenv() for each process that prints a _() message that we >> aren't doing now, and for each message call a function that would check >> a boolean "are we in poison mode" static global variable. > > Just don't do the getenv() inside _() even if you just do it one time. > getenv() may invalidate whatever value from the previous call. I would > not want to find out my code broke because of an getenv() inside some > innocent _()... How would any code break? We have various getenv("GIT_TEST_*...") already that work the same way. Unless you set that in the environment the function is idempotent, and I don't see how anyone would do that accidentally. > And we should still respect NO_GETTEXT, not doing any extra work when > it's defined. This is not how any of our NO_* defines work. *Sometimes* defining them guarantees you do no extra work, but in other cases we've decided that bending over backwards with #ifdef in various places isn't worth it. E.g. grep_opt in grep.h is a good example of this. Even if you don't compile with PCRE you're pointlessly memzero-ing out members of the struct that'll never be used in your config, because it's easier to maintain the code that way (types always checked at compile-time). (And no, despite my involvement in PCRE I didn't introduce that particular pattern) For the reasons noted above I think NO_GETTEXT is another such case. Just like GIT_TEST_SPLIT_INDEX (added in your d6e3c181bc ("read-cache: force split index mode with GIT_TEST_SPLIT_INDEX", 2014-06-13)) etc.
On Tue, Oct 23, 2018 at 6:45 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > >> The effect of what I'm suggesting here, and which my WIP patch in > >> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a > >> one-time getenv() for each process that prints a _() message that we > >> aren't doing now, and for each message call a function that would check > >> a boolean "are we in poison mode" static global variable. > > > > Just don't do the getenv() inside _() even if you just do it one time. > > getenv() may invalidate whatever value from the previous call. I would > > not want to find out my code broke because of an getenv() inside some > > innocent _()... > > How would any code break? We have various getenv("GIT_TEST_*...") > already that work the same way. Unless you set that in the environment > the function is idempotent, and I don't see how anyone would do that > accidentally. I didn't check those GIT_TEST_ but I think the difference is in complexity. When you write var = getenv("foo"); complexFunction(); you probably start to feel scary and wrap getenv() with a strdup() because you usually don't know exactly what complexFunction() can call (and even if you do, you can't be sure what it may call in the future). The person who writes printf(_("%s"), getenv("foo")); may not go through the same thought process as with complexFunction(). If _() calls getenv(), because you the order of parameter evaluation is unspecified, you cannot be sure if getenv("foo") will be called before or after the one inside _(). One of them may screw up the other. > > And we should still respect NO_GETTEXT, not doing any extra work when > > it's defined. > > This is not how any of our NO_* defines work. *Sometimes* defining them > guarantees you do no extra work, but in other cases we've decided that > bending over backwards with #ifdef in various places isn't worth it. I guess it boils down to "worth it". For me #ifdef NO_GETTEXT would be limited to gettext.h and it's not that much work. But you do the actual work. You decide.
On Wed, Oct 24 2018, Duy Nguyen wrote: > On Tue, Oct 23, 2018 at 6:45 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> The effect of what I'm suggesting here, and which my WIP patch in >> >> <875zxtd59e.fsf@evledraar.gmail.com> implements is that we'd do a >> >> one-time getenv() for each process that prints a _() message that we >> >> aren't doing now, and for each message call a function that would check >> >> a boolean "are we in poison mode" static global variable. >> > >> > Just don't do the getenv() inside _() even if you just do it one time. >> > getenv() may invalidate whatever value from the previous call. I would >> > not want to find out my code broke because of an getenv() inside some >> > innocent _()... >> >> How would any code break? We have various getenv("GIT_TEST_*...") >> already that work the same way. Unless you set that in the environment >> the function is idempotent, and I don't see how anyone would do that >> accidentally. > > I didn't check those GIT_TEST_ but I think the difference is in > complexity. When you write > > var = getenv("foo"); > complexFunction(); > > you probably start to feel scary and wrap getenv() with a strdup() > because you usually don't know exactly what complexFunction() can call > (and even if you do, you can't be sure what it may call in the > future). > > The person who writes > > printf(_("%s"), getenv("foo")); > > may not go through the same thought process as with complexFunction(). > If _() calls getenv(), because you the order of parameter evaluation > is unspecified, you cannot be sure if getenv("foo") will be called > before or after the one inside _(). One of them may screw up the > other. Ah, you mean because getenv() is not reentrant such a construct may cause us to return something else entirely for getenv("bar") (e.g. in this case the value for getenv("bar")). Is that something you or anyone else has seen in practice? My intuition is that while POSIX doesn't make that promise it isn't likely that there's any implementation that would mutate the env purely on a getenv(), but setenv() (munging some static "env" area in-place) might be another matter. But we could always call use_gettext_poison() really early from git_setup_gettext() (called from our main()) if we're worried about this, it would then set the static "poison_requested" variable and we wouldn't call getenv() again, e.g. if we're later calling it in the middle of multithreaded code, or within the same same sequence point. >> > And we should still respect NO_GETTEXT, not doing any extra work when >> > it's defined. >> >> This is not how any of our NO_* defines work. *Sometimes* defining them >> guarantees you do no extra work, but in other cases we've decided that >> bending over backwards with #ifdef in various places isn't worth it. > > I guess it boils down to "worth it". For me #ifdef NO_GETTEXT would be > limited to gettext.h and it's not that much work. But you do the > actual work. You decide. Yeah the ifdef is pretty small and not really worth worrynig about, the main benefit is being able to run tests in this mode without recompiling. I hadn't been running with GETTEXT_POISON when I build git because I hadn't been bothered to build twice just for it, but am now running it alongside other GIT_TEST_* modes.
Duy Nguyen <pclouds@gmail.com> writes: > The person who writes > > printf(_("%s"), getenv("foo")); > > may not go through the same thought process as with complexFunction(). > If _() calls getenv(), because you the order of parameter evaluation > is unspecified, you cannot be sure if getenv("foo") will be called > before or after the one inside _(). One of them may screw up the > other. Yup, sometimes we've been sloppy but we should strive to mimick efforts like f4ef5173 ("determine_author_info(): copy getenv output", 2014-08-27).
On Thu, Oct 25, 2018 at 12:52:55PM +0900, Junio C Hamano wrote: > Duy Nguyen <pclouds@gmail.com> writes: > > > The person who writes > > > > printf(_("%s"), getenv("foo")); > > > > may not go through the same thought process as with complexFunction(). > > If _() calls getenv(), because you the order of parameter evaluation > > is unspecified, you cannot be sure if getenv("foo") will be called > > before or after the one inside _(). One of them may screw up the > > other. > > Yup, sometimes we've been sloppy but we should strive to mimick > efforts like f4ef5173 ("determine_author_info(): copy getenv > output", 2014-08-27). I've wondered about this before. Even calling: foo = xstrdup(getenv("bar")); is not necessarily correct, because xstrdup() relies on xmalloc(), which may check GIT_ALLOC_LIMIT (we do cache that result, but it can happen on the first malloc). I also wouldn't be surprised if there are cases in our threaded code that use getenv() without taking a lock. I've definitely run into setenv()/getenv() races on Linux (inside Git, even, though it was while working on custom code). But I wonder how common it is for getenv() to be invalidated by another getenv() call. Certainly POSIX allows it, but every implementation I've seen (which is admittedly few) is passing back pointers to a chunk of environment memory. I.e., could we mostly ignore this problem as not applying to most modern systems? And if there is such a system, give it a fallback like: /* * For systems that use a single buffer for getenv(), this hacks * around it by giving it _four_ buffers. That's just punting on * the problem, but it at least gives enough breathing room for * the caller to do something sane like use non-trivial functions * to copy the string. It still does nothing for threading, but * hopefully such systems don't support pthreads in the first place. ;) */ const char *xgetenv(const char *key) { static struct strbuf bufs[] = { STRBUF_INIT, STRBUF_INIT, STRBUF_INIT, STRBUF_INIT }; static unsigned int cur; struct strbuf *buf; const char *value; size_t len; value = getenv(key); if (!value) return NULL; buf = bufs[cur++]; cur %= ARRAY_SIZE(bufs); /* * We have to do this length check ourselves, because allocating * the strbuf may invalidate "value"! */ len = strlen(value); if (buf->alloc <= len) { strbuf_grow(buf, len); value = getenv(key); if (!value) return NULL; /* whoops, it went away! */ len = strlen(value); /* paranoia that it didn't change */ } strbuf_reset(buf); strbuf_add(buf, value, len); return buf->buf; } I dunno. Maybe I am being overly optimistic. But I strongly suspect we have such bugs already in our code base, and nobody has run into them (OTOH, they are quite finicky due to things like the caching I mentioned). -Peff
Jeff King <peff@peff.net> writes: > I.e., could we mostly ignore this problem as not applying to most modern > systems? And if there is such a system, give it a fallback like: > > /* > * For systems that use a single buffer for getenv(), this hacks > * around it by giving it _four_ buffers. That's just punting on > * the problem, but it at least gives enough breathing room for > * the caller to do something sane like use non-trivial functions > * to copy the string. It still does nothing for threading, but > * hopefully such systems don't support pthreads in the first place. ;) > */ Tempting. > I dunno. Maybe I am being overly optimistic. But I strongly suspect we > have such bugs already in our code base, and nobody has run into them Yeah, I tend to agree that I would not be surprised at all if we have many such "bugs" that triggers for nobody on modern platforms.
SZEDER Gábor <szeder.dev@gmail.com> writes: > On Mon, Oct 22, 2018 at 05:36:33PM +0200, Nguyễn Thái Ngọc Duy wrote: >> >> The current gettext() function just replaces all strings with >> '# GETTEXT POISON #' including format strings and hides the things >> that we should be allowed to grep (like branch names, or some other >> codes) even when gettext is poisoned. >> >> This patch implements the poisoned _() with a universal and totally >> legit language called Ook [1]. We could actually grep stuff even in >> with this because format strings are preserved. > > Once upon a time a GETTEXT_POISON build job failed on me, and the > error message: > > error: # GETTEXT POISON # > > was not particularly useful. Ook wouldn't help with that... > > So I came up with the following couple of patches that implement a > "scrambled" format that makes the poisoned output legible for humans > but still gibberish for machine consumption (i.e. grep-ing the text > part would still fail): > > error: U.n.a.b.l.e. .t.o. .c.r.e.a.t.e. .'./home/szeder/src/git/t/trash directory.t1404-update-ref-errors/.git/packed-refs...l.o.c.k.'.:. .File exists... > > I have been running GETTEXT_POISON builds with this series for some > months now, but haven't submitted it yet, because I haven't decided > yet whether including strings (paths, refs, etc.) in the output as > they are is a feature or a flaw. And because it embarrassingly leaks > every single translated string... :) There is similar technique called "pseudolocalization", meant for testing i18n aspect of software. In one of most common forms, the string Edit program settings woukd be translated to [!!! εÐiţ Þr0ģЯãm səTτıИğ§ !!!] (possibly using mirrored locale, i.e. right-to-left order). The brackets [!!! ... !!!] are used as a "poison", to detect translatable text, and to spot issues with truncation; it also helps with finding "lego" translation. It would also stress-test Unicode handling... Regards,
diff --git a/gettext.c b/gettext.c index 7272771c8e..29901e1ddd 100644 --- a/gettext.c +++ b/gettext.c @@ -56,6 +56,60 @@ int use_gettext_poison(void) } #endif +const char *gettext_poison(const char *msgid) +{ + /* + * gettext() returns a string that is always valid. We would + * need a hash map for that but let's stay simple and keep the + * last 64 gettext() results. Should be more than enough. + */ + static char *bufs[64]; + static int i; + struct strbuf sb = STRBUF_INIT; + char *buf; + const char *p; + const char *type_specifiers = "diouxXeEfFgGaAcsCSpnm%"; + + if (!strchr(msgid, '%')) + return "Eek!"; + + p = msgid; + while (*p) { + const char *type; + switch (*p) { + case '%': + /* + * No strict parsing. We simply look for the end of a + * format string + */ + type = p + 1; + while (*type && !strchr(type_specifiers, *type)) + type++; + if (*type) + type++; + strbuf_add(&sb, p, (int)(type - p)); + p = type; + break; + default: + if (!isalpha(*p)) { + strbuf_addch(&sb, *p); + p++; + break; + } + if (isupper(*p)) + strbuf_addstr(&sb, "Ook"); + else + strbuf_addstr(&sb, "ook"); + while (isalpha(*p)) + p++; + } + } + buf = bufs[(i++) % ARRAY_SIZE(bufs)]; + free(buf); + buf = strbuf_detach(&sb, NULL); + return buf; +} + #ifndef NO_GETTEXT static int test_vsnprintf(const char *fmt, ...) { diff --git a/gettext.h b/gettext.h index 7eee64a34f..dc9851a06a 100644 --- a/gettext.h +++ b/gettext.h @@ -41,8 +41,9 @@ static inline int gettext_width(const char *s) } #endif +const char *gettext_poison(const char *); #ifdef GETTEXT_POISON -extern int use_gettext_poison(void); +int use_gettext_poison(void); #else #define use_gettext_poison() 0 #endif @@ -51,14 +52,14 @@ static inline FORMAT_PRESERVING(1) const char *_(const char *msgid) { if (!*msgid) return ""; - return use_gettext_poison() ? "# GETTEXT POISON #" : gettext(msgid); + return use_gettext_poison() ? gettext_poison(msgid) : gettext(msgid); } static inline FORMAT_PRESERVING(1) FORMAT_PRESERVING(2) const char *Q_(const char *msgid, const char *plu, unsigned long n) { if (use_gettext_poison()) - return "# GETTEXT POISON #"; + return gettext_poison(msgid); return ngettext(msgid, plu, n); } diff --git a/t/lib-rebase.sh b/t/lib-rebase.sh index 2ca9fb69d6..1e8440e935 100644 --- a/t/lib-rebase.sh +++ b/t/lib-rebase.sh @@ -29,7 +29,7 @@ set_fake_editor () { */COMMIT_EDITMSG) test -z "$EXPECT_HEADER_COUNT" || test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# This is a combination of \(.*\) commits\./\1/p' < "$1")" || - test "# # GETTEXT POISON #" = "$(sed -n '1p' < "$1")" || + test "$EXPECT_HEADER_COUNT" = "$(sed -n '1s/^# Ook ook ook ook ook \(.*\) ook\./\1/p' < "$1")" || exit test -z "$FAKE_COMMIT_MESSAGE" || echo "$FAKE_COMMIT_MESSAGE" > "$1" test -z "$FAKE_COMMIT_AMEND" || echo "$FAKE_COMMIT_AMEND" >> "$1"
The current gettext() function just replaces all strings with '# GETTEXT POISON #' including format strings and hides the things that we should be allowed to grep (like branch names, or some other codes) even when gettext is poisoned. This patch implements the poisoned _() with a universal and totally legit language called Ook [1]. We could actually grep stuff even in with this because format strings are preserved. Long term, we could implement an "ook translator" for test_i18ngrep and friends so that they translate English to Ook, allowing us to match full text while making sure the text in the code is still marked for translation. [1] https://en.wikipedia.org/wiki/Unseen_University#Librarian Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- This started out as something fun to do while running the test suite last weekend. But it turns out actually working! If this patch ends up in git.git, the Librarian would be so proud! gettext.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++ gettext.h | 7 ++++--- t/lib-rebase.sh | 2 +- 3 files changed, 59 insertions(+), 4 deletions(-)