diff mbox series

ci: disallow directional formatting

Message ID pull.1071.git.1635857935312.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series ci: disallow directional formatting | expand

Commit Message

Johannes Schindelin Nov. 2, 2021, 12:58 p.m. UTC
From: Johannes Schindelin <johannes.schindelin@gmx.de>

As described in https://trojansource.codes/trojan-source.pdf, it is
possible to abuse directional formatting (a feature of Unicode) to
deceive human readers into interpreting code differently from compilers.

It is highly unlikely that Git's source code wants to contain such
directional formatting in the first place, so let's disallow it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    ci: disallow directional formatting
    
    I just stumbled over
    https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
    which details an interesting social-engineering attack: it uses
    directional formatting in source code to pretend to human readers that
    the code does something different than it actually does.
    
    It is highly unlikely that Git's source code wants to contain such
    directional formatting in the first place, so let's disallow it.
    
    Technically, this is not exactly -rc material, but the paper was just
    published, and I want us to be safe.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1071

 .github/workflows/main.yml | 7 +++++++
 1 file changed, 7 insertions(+)


base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49

Comments

Ævar Arnfjörð Bjarmason Nov. 2, 2021, 3:01 p.m. UTC | #1
On Tue, Nov 02 2021, Johannes Schindelin via GitGitGadget wrote:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> As described in https://trojansource.codes/trojan-source.pdf, it is
> possible to abuse directional formatting (a feature of Unicode) to
> deceive human readers into interpreting code differently from compilers.
>
> It is highly unlikely that Git's source code wants to contain such
> directional formatting in the first place, so let's disallow it.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>     ci: disallow directional formatting
>     
>     I just stumbled over
>     https://siliconangle.com/2021/11/01/trojan-source-technique-can-inject-malware-source-code-without-detection/,
>     which details an interesting social-engineering attack: it uses
>     directional formatting in source code to pretend to human readers that
>     the code does something different than it actually does.
>     
>     It is highly unlikely that Git's source code wants to contain such
>     directional formatting in the first place, so let's disallow it.
>     
>     Technically, this is not exactly -rc material, but the paper was just
>     published, and I want us to be safe.

There's a parallel discussion about doing something to detect this in
"git am", which for the git project seems like a better place to put
this.

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1071%2Fdscho%2Fcheck-for-utf-8-directional-formatting-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1071/dscho/check-for-utf-8-directional-formatting-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1071
>
>  .github/workflows/main.yml | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> index 6ed6a9e8076..7b4b4df03c3 100644
> --- a/.github/workflows/main.yml
> +++ b/.github/workflows/main.yml
> @@ -289,6 +289,13 @@ jobs:
>      - uses: actions/checkout@v2
>      - run: ci/install-dependencies.sh
>      - run: ci/run-static-analysis.sh
> +    - name: disallow Unicode directional formatting
> +      run: |
> +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> +        # could use `git grep -P` with the `\u` syntax).
> +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>    sparse:
>      needs: ci-config
>      if: needs.ci-config.outputs.enabled == 'yes'
>
> base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49

It would be easier to maintain this if were added to
ci/run-static-analysis.sh instead, where we have some similar tests, and
if it lives there we could do away with the double-escaping, then it can
also be run manually.

Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
its unconditional support for e.g. unicode properties in regexes.

How will this change impact RTL languages being added to po/?
Taylor Blau Nov. 2, 2021, 3:48 p.m. UTC | #2
On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
> There's a parallel discussion about doing something to detect this in
> "git am", which for the git project seems like a better place to put
> this.

I don't think that one impacts the other necessarily. Having `git am`
guard against this would probably be sufficient to protect Junio
accidentally apply something containing directional formatting to his
tree unknowingly.

But the idea that we rely on the import mechanism to protect against
this doesn't sit well with me. Ultimately, we should be relying on a
static check like below to ensure that directional formatting hasn't
entered the tree by any mechanism (not just 'git am').

> > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > index 6ed6a9e8076..7b4b4df03c3 100644
> > --- a/.github/workflows/main.yml
> > +++ b/.github/workflows/main.yml
> > @@ -289,6 +289,13 @@ jobs:
> >      - uses: actions/checkout@v2
> >      - run: ci/install-dependencies.sh
> >      - run: ci/run-static-analysis.sh
> > +    - name: disallow Unicode directional formatting
> > +      run: |
> > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> > +        # could use `git grep -P` with the `\u` syntax).
> > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
> >    sparse:
> >      needs: ci-config
> >      if: needs.ci-config.outputs.enabled == 'yes'
> >
> > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
>
> It would be easier to maintain this if were added to
> ci/run-static-analysis.sh instead, where we have some similar tests, and
> if it lives there we could do away with the double-escaping, then it can
> also be run manually.
>
> Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
> its unconditional support for e.g. unicode properties in regexes.

I agree that the double-escaping is ugly. I think that this would be
easier to maintain if it lived in ci/run-static-analysis.sh or its own
script like ci/check-directional-formatting.sh.

And yes, constructing a byte pattern is a little odd as well, but I
think that it's the best you can do if you're limited to running 'git
grep' without libpcre. I wondered if we could depend on perl being
around during CI, but as far as I know we can since install Perl modules
in ci/install-dependencies.sh and use Perl extensively through the test
suite.

Thanks,
Taylor
Ævar Arnfjörð Bjarmason Nov. 2, 2021, 4:03 p.m. UTC | #3
On Tue, Nov 02 2021, Taylor Blau wrote:

> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> There's a parallel discussion about doing something to detect this in
>> "git am", which for the git project seems like a better place to put
>> this.
>
> I don't think that one impacts the other necessarily. Having `git am`
> guard against this would probably be sufficient to protect Junio
> accidentally apply something containing directional formatting to his
> tree unknowingly.
>
> But the idea that we rely on the import mechanism to protect against
> this doesn't sit well with me. Ultimately, we should be relying on a
> static check like below to ensure that directional formatting hasn't
> entered the tree by any mechanism (not just 'git am').
>
>> > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
>> > index 6ed6a9e8076..7b4b4df03c3 100644
>> > --- a/.github/workflows/main.yml
>> > +++ b/.github/workflows/main.yml
>> > @@ -289,6 +289,13 @@ jobs:
>> >      - uses: actions/checkout@v2
>> >      - run: ci/install-dependencies.sh
>> >      - run: ci/run-static-analysis.sh
>> > +    - name: disallow Unicode directional formatting
>> > +      run: |
>> > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
>> > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
>> > +        # could use `git grep -P` with the `\u` syntax).
>> > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
>> > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>> >    sparse:
>> >      needs: ci-config
>> >      if: needs.ci-config.outputs.enabled == 'yes'
>> >
>> > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
>>
>> It would be easier to maintain this if were added to
>> ci/run-static-analysis.sh instead, where we have some similar tests, and
>> if it lives there we could do away with the double-escaping, then it can
>> also be run manually.
>>
>> Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
>> its unconditional support for e.g. unicode properties in regexes.
>
> I agree that the double-escaping is ugly. I think that this would be
> easier to maintain if it lived in ci/run-static-analysis.sh or its own
> script like ci/check-directional-formatting.sh.
>
> And yes, constructing a byte pattern is a little odd as well, but I
> think that it's the best you can do if you're limited to running 'git
> grep' without libpcre. I wondered if we could depend on perl being
> around during CI, but as far as I know we can since install Perl modules
> in ci/install-dependencies.sh and use Perl extensively through the test
> suite.

We can hard depend on anything listed in [1][2], in this case there's a
Perl 5.30 available.

We don't need any Perl modules, it's all Perl-native regex features and
a small for-loop.

On the topic at large I wonder how much we need to worry about this at
all, seems like somewher between case of "the anglosphere discovers
scary foreign characters in Unicode again" and "security researcher
creates scary landing page for well-known Unicode edge-case"[3] :)

In this particular case the test cases involved seem extremely contrived
and unlikely to be something we'd end up with in our code in any case,
even in a case where all the reviewers would be fooled by this method in
itself.

I.e. since you can't use these to "fold" lines in the eyes of a human
viewer it's all rather contrived cases where comments and valid C code
lives on the same line, e.g.:

    /*‮ } ⁦if (isAdmin)⁩ ⁦ begin admins only */
        printf("You are an admin.\n");
    /* end admins only ‮ { ⁦*/

Which at least in Emacs is highlighted the right way, which is the first
major clue, and having tested some of these just now on AIX I've got xlc
whining about some cases that gcc/clang will happily pass by, so our
tooling at large will probably catch these anyway...

1. https://github.com/actions/virtual-environments
2. https://github.com/actions/virtual-environments/blob/main/images/linux/Ubuntu2004-README.md
3. https://www.trojansource.codes/
Johannes Schindelin Nov. 2, 2021, 4:12 p.m. UTC | #4
Hi Taylor,

On Tue, 2 Nov 2021, Taylor Blau wrote:

> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
> > There's a parallel discussion about doing something to detect this in
> > "git am", which for the git project seems like a better place to put
> > this.
>
> I don't think that one impacts the other necessarily. Having `git am`
> guard against this would probably be sufficient to protect Junio
> accidentally apply something containing directional formatting to his
> tree unknowingly.
>
> But the idea that we rely on the import mechanism to protect against
> this doesn't sit well with me. Ultimately, we should be relying on a
> static check like below to ensure that directional formatting hasn't
> entered the tree by any mechanism (not just 'git am').

Yep, the `git am` change and the CI change are addressing two very
different concerns.

> > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
> > > index 6ed6a9e8076..7b4b4df03c3 100644
> > > --- a/.github/workflows/main.yml
> > > +++ b/.github/workflows/main.yml
> > > @@ -289,6 +289,13 @@ jobs:
> > >      - uses: actions/checkout@v2
> > >      - run: ci/install-dependencies.sh
> > >      - run: ci/run-static-analysis.sh
> > > +    - name: disallow Unicode directional formatting
> > > +      run: |
> > > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
> > > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
> > > +        # could use `git grep -P` with the `\u` syntax).
> > > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
> > > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
> > >    sparse:
> > >      needs: ci-config
> > >      if: needs.ci-config.outputs.enabled == 'yes'
> > >
> > > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
> >
> > It would be easier to maintain this if were added to
> > ci/run-static-analysis.sh instead, where we have some similar tests, and
> > if it lives there we could do away with the double-escaping, then it can
> > also be run manually.
> >
> > Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
> > its unconditional support for e.g. unicode properties in regexes.
>
> I agree that the double-escaping is ugly. I think that this would be
> easier to maintain if it lived in ci/run-static-analysis.sh or its own
> script like ci/check-directional-formatting.sh.

That's a good idea, I will put it into its own script in ci/.

> And yes, constructing a byte pattern is a little odd as well, but I
> think that it's the best you can do if you're limited to running 'git
> grep' without libpcre. I wondered if we could depend on perl being
> around during CI, but as far as I know we can since install Perl modules
> in ci/install-dependencies.sh and use Perl extensively through the test
> suite.

Perl alone won't fix anything. A crucial part of the `git grep` invocation
is the `-I` option: ignore binary files.

Ciao,
Dscho
Ævar Arnfjörð Bjarmason Nov. 2, 2021, 4:38 p.m. UTC | #5
On Tue, Nov 02 2021, Johannes Schindelin wrote:

> Hi Taylor,
>
> On Tue, 2 Nov 2021, Taylor Blau wrote:
>
>> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> > There's a parallel discussion about doing something to detect this in
>> > "git am", which for the git project seems like a better place to put
>> > this.
>>
>> I don't think that one impacts the other necessarily. Having `git am`
>> guard against this would probably be sufficient to protect Junio
>> accidentally apply something containing directional formatting to his
>> tree unknowingly.
>>
>> But the idea that we rely on the import mechanism to protect against
>> this doesn't sit well with me. Ultimately, we should be relying on a
>> static check like below to ensure that directional formatting hasn't
>> entered the tree by any mechanism (not just 'git am').
>
> Yep, the `git am` change and the CI change are addressing two very
> different concerns.
>
>> > > diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
>> > > index 6ed6a9e8076..7b4b4df03c3 100644
>> > > --- a/.github/workflows/main.yml
>> > > +++ b/.github/workflows/main.yml
>> > > @@ -289,6 +289,13 @@ jobs:
>> > >      - uses: actions/checkout@v2
>> > >      - run: ci/install-dependencies.sh
>> > >      - run: ci/run-static-analysis.sh
>> > > +    - name: disallow Unicode directional formatting
>> > > +      run: |
>> > > +        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
>> > > +        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
>> > > +        # could use `git grep -P` with the `\u` syntax).
>> > > +        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
>> > > +          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
>> > >    sparse:
>> > >      needs: ci-config
>> > >      if: needs.ci-config.outputs.enabled == 'yes'
>> > >
>> > > base-commit: 0cddd84c9f3e9c3d793ec93034ef679335f35e49
>> >
>> > It would be easier to maintain this if were added to
>> > ci/run-static-analysis.sh instead, where we have some similar tests, and
>> > if it lives there we could do away with the double-escaping, then it can
>> > also be run manually.
>> >
>> > Also, can't we just pipe "git ls-files -z" into "perl -0ne" here and use
>> > its unconditional support for e.g. unicode properties in regexes.
>>
>> I agree that the double-escaping is ugly. I think that this would be
>> easier to maintain if it lived in ci/run-static-analysis.sh or its own
>> script like ci/check-directional-formatting.sh.
>
> That's a good idea, I will put it into its own script in ci/.
>
>> And yes, constructing a byte pattern is a little odd as well, but I
>> think that it's the best you can do if you're limited to running 'git
>> grep' without libpcre. I wondered if we could depend on perl being
>> around during CI, but as far as I know we can since install Perl modules
>> in ci/install-dependencies.sh and use Perl extensively through the test
>> suite.
>
> Perl alone won't fix anything. A crucial part of the `git grep` invocation
> is the `-I` option: ignore binary files.

As noted in [1] I'm rather skeptical of us needing this at all, but why
is "-I" needed over asking "git grep" or "git ls-files" to exclude
binary files?

    git ls-files ':!(attr:binary)'

I.e. why do ad-hoc binary detection on the fly in git.git when we should
already be marking what files are binary?

If this check shows a false positive due to a binary file isn't that a
good thing (sans LTR issues I mentioned upthread), i.e. we should be
adding that file to .gitattributes, no?

In any case, I meant that the match on the RHS might be easier with
Perl, in such a pipeline you could always farm out the binary detection
to GNU grep on the LHS or whatever. Maybe you don't want to do it with
Perl, but using the -I option seems like a bad idea in either case.

1. https://lore.kernel.org/git/211102.86lf261q2e.gmgdl@evledraar.gmail.com/
Junio C Hamano Nov. 3, 2021, 5:20 p.m. UTC | #6
Taylor Blau <me@ttaylorr.com> writes:

> On Tue, Nov 02, 2021 at 04:01:57PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> There's a parallel discussion about doing something to detect this in
>> "git am", which for the git project seems like a better place to put
>> this.
>
> I don't think that one impacts the other necessarily. Having `git am`
> guard against this would probably be sufficient to protect Junio
> accidentally apply something containing directional formatting to his
> tree unknowingly.
>
> But the idea that we rely on the import mechanism to protect against
> this doesn't sit well with me. Ultimately, we should be relying on a
> static check like below to ensure that directional formatting hasn't
> entered the tree by any mechanism (not just 'git am').

Yes.  Quite honestly, such a check shouldn't be in "am" proper at
all.

Rather, for am users who care, they should protect themselves with
something like the pre-applypatch hook, which can perform the same
check as their pre-commit hook to protect their other commits.
diff mbox series

Patch

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 6ed6a9e8076..7b4b4df03c3 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -289,6 +289,13 @@  jobs:
     - uses: actions/checkout@v2
     - run: ci/install-dependencies.sh
     - run: ci/run-static-analysis.sh
+    - name: disallow Unicode directional formatting
+      run: |
+        # Use UTF-8-aware `printf` to feed a byte pattern to non-UTF-8-aware `git grep`
+        # (Ubuntu's `git grep` is compiled without support for libpcre, otherwise we
+        # could use `git grep -P` with the `\u` syntax).
+        ! LANG=C git grep -Il "$(LANG=C.UTF-8 printf \
+          '\\(\u202a\\|\u202b\\|\u202c\\|\u202d\\|\u202e\\|\u2066\\|\u2067\\|\u2068\\|\u2069\\)')"
   sparse:
     needs: ci-config
     if: needs.ci-config.outputs.enabled == 'yes'