diff mbox series

leak tests: add an interface to the LSAN_OPTIONS "suppressions"

Message ID patch-1.1-9190f3c128f-20211022T102725Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series leak tests: add an interface to the LSAN_OPTIONS "suppressions" | expand

Commit Message

Ævar Arnfjörð Bjarmason Oct. 22, 2021, 10:32 a.m. UTC
Extend the SANITIZE=leak testing mode added in 956d2e4639b (tests: add
a test mode for SANITIZE=leak, run it in CI, 2021-09-23) to optionally
be able to add a "suppressions" file to the $LSAN_OPTIONS.

This allows for marking tests as passing with
"TEST_PASSES_SANITIZE_LEAK=true" when they still have failure due more
general widespread memory leaks, such as from the "git log" family of
commands. We can now mark the "git -C" tests as passing.

For getting specific tests to pass this is preferable to using
UNLEAK() in these codepaths, as I'll have fixes for those leaks soon,
and being able to atomically mark relevant tests as passing with
"TEST_PASSES_SANITIZE_LEAK=true" helps to explain those changes. See
[1] for more details.

1. https://lore.kernel.org/git/211022.86sfwtl6uj.gmgdl@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

On Fri, Oct 22 2021, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Oct 22 2021, Taylor Blau wrote:
>
>> On Thu, Oct 21, 2021 at 01:50:55PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Wed, Oct 20 2021, Taylor Blau wrote:
[...]
> If you want to pick that approach up and run with it I think it would
> probably make sense to factor that suggested test_expect_success out
> into a function in test-lib-functions.sh or whatever, and call it as
> e.g.:
>     
>     TEST_PASSES_SANITIZE_LEAK=true
>      . ./test-lib.sh
>     declare_known_leaks <<-\EOF
>     add_rev_cmdline
>     [...]
>     EOF
>
> Then pipe it through sed 's/^/leak:/' and have it set LSAN_OPTIONS for
> you.
>
> Doing it that way would be less boilerplate for each test that wants it,
> and is also more likely to work with other non-LSAN leak appoaches,
> i.e. as long as something can take a list of lines matching stack traces
> we can feed that to that leak checker's idea of a whitelist.

I just went ahead and hacked that up. If you're OK with that approach
it would really help reduce the work for leak changes I've got
planned, and as noted gives you the end-state of a passing 5319.

I don't know if it makes more sense for you to base on top of this
if/when Junio picks it up, or to integrate it into your series
etc. Maybe Junio will chime in ...

 t/t0056-git-C.sh        |  4 ++++
 t/test-lib-functions.sh | 42 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

Comments

Taylor Blau Oct. 26, 2021, 8:23 p.m. UTC | #1
On Fri, Oct 22, 2021 at 12:32:17PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Extend the SANITIZE=leak testing mode added in 956d2e4639b (tests: add
> a test mode for SANITIZE=leak, run it in CI, 2021-09-23) to optionally
> be able to add a "suppressions" file to the $LSAN_OPTIONS.
>
> This allows for marking tests as passing with
> "TEST_PASSES_SANITIZE_LEAK=true" when they still have failure due more
> general widespread memory leaks, such as from the "git log" family of
> commands. We can now mark the "git -C" tests as passing.
>
> For getting specific tests to pass this is preferable to using
> UNLEAK() in these codepaths, as I'll have fixes for those leaks soon,
> and being able to atomically mark relevant tests as passing with
> "TEST_PASSES_SANITIZE_LEAK=true" helps to explain those changes. See
> [1] for more details.
>
> 1. https://lore.kernel.org/git/211022.86sfwtl6uj.gmgdl@evledraar.gmail.com/
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>
> On Fri, Oct 22 2021, Ævar Arnfjörð Bjarmason wrote:
>
> > On Fri, Oct 22 2021, Taylor Blau wrote:
> >
> >> On Thu, Oct 21, 2021 at 01:50:55PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>>
> >>> On Wed, Oct 20 2021, Taylor Blau wrote:
> [...]
> > If you want to pick that approach up and run with it I think it would
> > probably make sense to factor that suggested test_expect_success out
> > into a function in test-lib-functions.sh or whatever, and call it as
> > e.g.:
> >
> >     TEST_PASSES_SANITIZE_LEAK=true
> >      . ./test-lib.sh
> >     declare_known_leaks <<-\EOF
> >     add_rev_cmdline
> >     [...]
> >     EOF
> >
> > Then pipe it through sed 's/^/leak:/' and have it set LSAN_OPTIONS for
> > you.
> >
> > Doing it that way would be less boilerplate for each test that wants it,
> > and is also more likely to work with other non-LSAN leak appoaches,
> > i.e. as long as something can take a list of lines matching stack traces
> > we can feed that to that leak checker's idea of a whitelist.
>
> I just went ahead and hacked that up. If you're OK with that approach
> it would really help reduce the work for leak changes I've got
> planned, and as noted gives you the end-state of a passing 5319.
>
> I don't know if it makes more sense for you to base on top of this
> if/when Junio picks it up, or to integrate it into your series
> etc. Maybe Junio will chime in ...

Hmm. This seems neat, but I haven't been able to get it to work without
encountering a rather annoying bug along the way.

Here is the preamble of my modified t5319 to include all of the leaks I
found in the previous round (but decided not to fix):

TEST_PASSES_SANITIZE_LEAK=true
. ./test-lib.sh
todo_leaks <<-\EOF
^add_rev_cmdline$
^cmd_log_init_finish$
^prepare_revision_walk$
^prep_parse_options$
EOF

So running that when git is compiled with SANITIZE=leak *should* work,
but instead produces this rather annoying leak detection after t5319.7:

  =================================================================
  ==1785298==ERROR: LeakSanitizer: detected memory leaks

  Indirect leak of 41 byte(s) in 3 object(s) allocated from:
      #0 0x7f2f2f866db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
      #1 0x7f2f2f64b4aa in __GI___strdup string/strdup.c:42

  -----------------------------------------------------
  Suppressions used:
    count      bytes template
        1        576 ^add_rev_cmdline$
  -----------------------------------------------------

  SUMMARY: LeakSanitizer: 41 byte(s) leaked in 3 allocation(s).
  Aborted

Huh? Looking past __GI___strdup (which I assume is just a
symbolification failure on ASan's part), who calls strdup()? That trace
is annoyingly incomplete, and doesn't really give us anything to go off
of.

It does seem to remind me of this:

  https://github.com/google/sanitizers/issues/148

though that issue goes in so many different directions that I'm not sure
whether it really is the same issue or not. In any case, that leak
*still* shows up even when suppressing xmalloc() and xrealloc(), so I
almost think that it's a leak within ASan itself.

But most interesting is that those calls go away when I stop setting
`suppressions` in $LSAN_OPTIONS by dropping the call to your todo_leaks.

So this all feels like a bug in ASan to me. I'm curious if it works on
your system, but in the meantime I think the best path forward is to
drop the last patch of my original series (the one with the three
UNLEAK() calls) and to avoid relying on this patch for the time being.

Thanks,
Taylor
Jeff King Oct. 26, 2021, 9:11 p.m. UTC | #2
On Tue, Oct 26, 2021 at 04:23:14PM -0400, Taylor Blau wrote:

> So this all feels like a bug in ASan to me. I'm curious if it works on
> your system, but in the meantime I think the best path forward is to
> drop the last patch of my original series (the one with the three
> UNLEAK() calls) and to avoid relying on this patch for the time being.

Bugs aside, I'd much rather see UNLEAK() annotations than external ones,
for all the reasons we introduced UNLEAK() in the first place:

  - it keeps the annotations near the code. Yes, that creates conflicts
    when the code is changed (or the leak is actually fixed), but that's
    a feature. It keeps them from going stale.

  - leak-checkers only know where things are allocated, not who is
    supposed to own them. So you're often annotating the wrong thing;
    it's not a strdup() call which is buggy and leaking, but the
    function five layers up the stack which was supposed to take
    ownership and didn't.

And if we avoid any annotation bugs by doing so, that's icing on the
cake. :)

-Peff
Taylor Blau Oct. 26, 2021, 9:30 p.m. UTC | #3
On Tue, Oct 26, 2021 at 05:11:29PM -0400, Jeff King wrote:
> On Tue, Oct 26, 2021 at 04:23:14PM -0400, Taylor Blau wrote:
>
> > So this all feels like a bug in ASan to me. I'm curious if it works on
> > your system, but in the meantime I think the best path forward is to
> > drop the last patch of my original series (the one with the three
> > UNLEAK() calls) and to avoid relying on this patch for the time being.
>
> Bugs aside, I'd much rather see UNLEAK() annotations than external ones,
> for all the reasons we introduced UNLEAK() in the first place:
>
>   - it keeps the annotations near the code. Yes, that creates conflicts
>     when the code is changed (or the leak is actually fixed), but that's
>     a feature. It keeps them from going stale.

I agree completely. I noted as much in my message here:

    https://lore.kernel.org/git/YXJAfICQN8s5Gm7s@nand.local/

but Ævar made it sound like his work would be made much easier without
the conflict. Since I'm not in any kind of rush to make t5319 leak-free,
I figured that queueing the parts of that series that wouldn't conflict
with Ævar's ongoing work would be a net-positive.

>   - leak-checkers only know where things are allocated, not who is
>     supposed to own them. So you're often annotating the wrong thing;
>     it's not a strdup() call which is buggy and leaking, but the
>     function five layers up the stack which was supposed to take
>     ownership and didn't.

TBH, I find this much more compelling. Either way, I don't feel strongly
enough to deviate from v2 much, and I don't want to create more work for
Ævar along the way.

Thanks,
Taylor
Jeff King Oct. 26, 2021, 9:48 p.m. UTC | #4
On Tue, Oct 26, 2021 at 05:30:47PM -0400, Taylor Blau wrote:

> > Bugs aside, I'd much rather see UNLEAK() annotations than external ones,
> > for all the reasons we introduced UNLEAK() in the first place:
> >
> >   - it keeps the annotations near the code. Yes, that creates conflicts
> >     when the code is changed (or the leak is actually fixed), but that's
> >     a feature. It keeps them from going stale.
> 
> I agree completely. I noted as much in my message here:
> 
>     https://lore.kernel.org/git/YXJAfICQN8s5Gm7s@nand.local/
> 
> but Ævar made it sound like his work would be made much easier without
> the conflict. Since I'm not in any kind of rush to make t5319 leak-free,
> I figured that queueing the parts of that series that wouldn't conflict
> with Ævar's ongoing work would be a net-positive.

Yeah, to be clear, if there's work in progress in an area, then _not_
annotating it (with either method) is perfectly fine with me in the
meantime.

-Peff
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 7:51 a.m. UTC | #5
On Tue, Oct 26 2021, Taylor Blau wrote:

> On Fri, Oct 22, 2021 at 12:32:17PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Extend the SANITIZE=leak testing mode added in 956d2e4639b (tests: add
>> a test mode for SANITIZE=leak, run it in CI, 2021-09-23) to optionally
>> be able to add a "suppressions" file to the $LSAN_OPTIONS.
>>
>> This allows for marking tests as passing with
>> "TEST_PASSES_SANITIZE_LEAK=true" when they still have failure due more
>> general widespread memory leaks, such as from the "git log" family of
>> commands. We can now mark the "git -C" tests as passing.
>>
>> For getting specific tests to pass this is preferable to using
>> UNLEAK() in these codepaths, as I'll have fixes for those leaks soon,
>> and being able to atomically mark relevant tests as passing with
>> "TEST_PASSES_SANITIZE_LEAK=true" helps to explain those changes. See
>> [1] for more details.
>>
>> 1. https://lore.kernel.org/git/211022.86sfwtl6uj.gmgdl@evledraar.gmail.com/
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>
>> On Fri, Oct 22 2021, Ævar Arnfjörð Bjarmason wrote:
>>
>> > On Fri, Oct 22 2021, Taylor Blau wrote:
>> >
>> >> On Thu, Oct 21, 2021 at 01:50:55PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> >>>
>> >>> On Wed, Oct 20 2021, Taylor Blau wrote:
>> [...]
>> > If you want to pick that approach up and run with it I think it would
>> > probably make sense to factor that suggested test_expect_success out
>> > into a function in test-lib-functions.sh or whatever, and call it as
>> > e.g.:
>> >
>> >     TEST_PASSES_SANITIZE_LEAK=true
>> >      . ./test-lib.sh
>> >     declare_known_leaks <<-\EOF
>> >     add_rev_cmdline
>> >     [...]
>> >     EOF
>> >
>> > Then pipe it through sed 's/^/leak:/' and have it set LSAN_OPTIONS for
>> > you.
>> >
>> > Doing it that way would be less boilerplate for each test that wants it,
>> > and is also more likely to work with other non-LSAN leak appoaches,
>> > i.e. as long as something can take a list of lines matching stack traces
>> > we can feed that to that leak checker's idea of a whitelist.
>>
>> I just went ahead and hacked that up. If you're OK with that approach
>> it would really help reduce the work for leak changes I've got
>> planned, and as noted gives you the end-state of a passing 5319.
>>
>> I don't know if it makes more sense for you to base on top of this
>> if/when Junio picks it up, or to integrate it into your series
>> etc. Maybe Junio will chime in ...
>
> Hmm. This seems neat, but I haven't been able to get it to work without
> encountering a rather annoying bug along the way.
>
> Here is the preamble of my modified t5319 to include all of the leaks I
> found in the previous round (but decided not to fix):
>
> TEST_PASSES_SANITIZE_LEAK=true
> . ./test-lib.sh
> todo_leaks <<-\EOF
> ^add_rev_cmdline$
> ^cmd_log_init_finish$
> ^prepare_revision_walk$
> ^prep_parse_options$
> EOF
>
> So running that when git is compiled with SANITIZE=leak *should* work,
> but instead produces this rather annoying leak detection after t5319.7:
>
>   =================================================================
>   ==1785298==ERROR: LeakSanitizer: detected memory leaks
>
>   Indirect leak of 41 byte(s) in 3 object(s) allocated from:
>       #0 0x7f2f2f866db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
>       #1 0x7f2f2f64b4aa in __GI___strdup string/strdup.c:42
>
>   -----------------------------------------------------
>   Suppressions used:
>     count      bytes template
>         1        576 ^add_rev_cmdline$
>   -----------------------------------------------------
>
>   SUMMARY: LeakSanitizer: 41 byte(s) leaked in 3 allocation(s).
>   Aborted
>
> Huh? Looking past __GI___strdup (which I assume is just a
> symbolification failure on ASan's part), who calls strdup()? That trace
> is annoyingly incomplete, and doesn't really give us anything to go off
> of.
>
> It does seem to remind me of this:
>
>   https://github.com/google/sanitizers/issues/148
>
> though that issue goes in so many different directions that I'm not sure
> whether it really is the same issue or not. In any case, that leak
> *still* shows up even when suppressing xmalloc() and xrealloc(), so I
> almost think that it's a leak within ASan itself.
>
> But most interesting is that those calls go away when I stop setting
> `suppressions` in $LSAN_OPTIONS by dropping the call to your todo_leaks.
>
> So this all feels like a bug in ASan to me. I'm curious if it works on
> your system, but in the meantime I think the best path forward is to
> drop the last patch of my original series (the one with the three
> UNLEAK() calls) and to avoid relying on this patch for the time being.

There are similar cases where LSAN doesn't provide as meaningful of a
stacktrace as valgrind, sometimes when tracing leaks I get a relatively
bad stacktrace like that ending in some __GI___*<stdlib-name>
function. I'll usually compile without SANITIZE=leak and just run a
slower:

    valgrind --leak-check=full --show-leak-kinds=all

In this case however it's not a bug or bad leak tracing, but an artifact
of us using these stacktrace exclusions.

If you run this manually you'll see:
    
    $ ~/g/git/git -c core.multiPackIndex=false rev-list --objects --all
    cd0747a9352b58d112f0010134351efc7bbad4a6
    [... snipped output ...]
    
    =================================================================
    ==58023==ERROR: LeakSanitizer: detected memory leaks
    
    Direct leak of 576 byte(s) in 1 object(s) allocated from:
        #0 0x7fd0bfae50c5 in __interceptor_realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:82
        #1 0x5637b3bd9163 in xrealloc /home/avar/g/git/wrapper.c:126
        #2 0x5637b3b6a1e4 in add_rev_cmdline /home/avar/g/git/revision.c:1482
        #3 0x5637b3b6a412 in handle_one_ref /home/avar/g/git/revision.c:1534
        #4 0x5637b3b49c4e in do_for_each_ref_helper /home/avar/g/git/refs.c:1483
        #5 0x5637b3b54afc in do_for_each_repo_ref_iterator refs/iterator.c:418
        #6 0x5637b3b49cc8 in do_for_each_ref /home/avar/g/git/refs.c:1498
        #7 0x5637b3b49d07 in refs_for_each_ref /home/avar/g/git/refs.c:1504
        #8 0x5637b3b6a563 in handle_refs /home/avar/g/git/revision.c:1578
        #9 0x5637b3b6e0e7 in handle_revision_pseudo_opt /home/avar/g/git/revision.c:2597
        #10 0x5637b3b6e9d5 in setup_revisions /home/avar/g/git/revision.c:2738
        #11 0x5637b39ebc58 in cmd_rev_list builtin/rev-list.c:550
        #12 0x5637b3932a89 in run_builtin /home/avar/g/git/git.c:461
        #13 0x5637b3932e98 in handle_builtin /home/avar/g/git/git.c:713
        #14 0x5637b3933105 in run_argv /home/avar/g/git/git.c:780
        #15 0x5637b39335ae in cmd_main /home/avar/g/git/git.c:911
        #16 0x5637b3a1a898 in main /home/avar/g/git/common-main.c:52
        #17 0x7fd0bf860d09 in __libc_start_main ../csu/libc-start.c:308
    
    Indirect leak of 41 byte(s) in 3 object(s) allocated from:
        #0 0x7fd0bfae4db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
        #1 0x7fd0bf8c7e4a in __GI___strdup string/strdup.c:42
    
    SUMMARY: LeakSanitizer: 617 byte(s) leaked in 4 allocation(s).

Which is why I put the "strdup" there, but forgot to tell you when I
submitted the real PATCH version that that one shouldn't have the ^$
anchoring.

FWIW this will work on top of your series:
    
    diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
    index a3c72b68f7c..891de720b2c 100755
    --- a/t/t5319-multi-pack-index.sh
    +++ b/t/t5319-multi-pack-index.sh
    @@ -2,6 +2,17 @@
     
     test_description='multi-pack-indexes'
     . ./test-lib.sh
    +todo_leaks <<-\EOF
    +^add_rev_cmdline$
    +^cmd_log_init_finish$
    +^prep_parse_options$
    +^prepare_revision_walk$
    +^start_progress_delay$
    +
    +# An indirect leak reported because we excluded the leaks above
    +strdup$
    +EOF
    +
     
     GIT_TEST_MULTI_PACK_INDEX=0
     objdir=.git/objects
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 8:04 a.m. UTC | #6
On Tue, Oct 26 2021, Jeff King wrote:

It seems that I've gotten what I wanted from this side-thread, yay!
Thanks both, not to pile on, but just to clarify:

> On Tue, Oct 26, 2021 at 04:23:14PM -0400, Taylor Blau wrote:
>
>> So this all feels like a bug in ASan to me. I'm curious if it works on
>> your system, but in the meantime I think the best path forward is to
>> drop the last patch of my original series (the one with the three
>> UNLEAK() calls) and to avoid relying on this patch for the time being.
>
> Bugs aside, I'd much rather see UNLEAK() annotations than external ones,
> for all the reasons we introduced UNLEAK() in the first place:
>
>   - it keeps the annotations near the code. Yes, that creates conflicts
>     when the code is changed (or the leak is actually fixed), but that's
>     a feature. It keeps them from going stale.

I fully agree with you in general for "good" uses of UNLEAK(). E.g. consider:

    struct strbuf buf = xmalloc(...);
    [...]
    UNLEAK(buf);

If I was fixing leaks in that sort of code what I pointed out downthread
in [1] wouldn't apply.

So to clarify, I'm not asking in [1] that UNLEAK() not be used at all
while we have in-flight leak fixes. I.e. I'd run into a textual
conflict, but that would be trivial to resolve, and obvious what the
semantic & textual conflict was.

Rather, it's not marking specific leaks, but UNLEAK() on a container
struct that's problematic.

Depending on how they're used those structs may or may not leak. So
e.g. Taylor's upthread 11/11[2] contained:
    
    diff --git a/builtin/rev-list.c b/builtin/rev-list.c
    index 36cb909eba..df3811e763 100644
    --- a/builtin/rev-list.c
    +++ b/builtin/rev-list.c
    @@ -549,6 +549,8 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
     
            argc = setup_revisions(argc, argv, &revs, &s_r_opt);
     
    +       UNLEAK(revs);
    +
            memset(&info, 0, sizeof(info));
            info.revs = &revs;
            if (revs.bisect)

Which if you on master replace that with:
    
    diff --git a/builtin/rev-list.c b/builtin/rev-list.c
    index 36cb909ebaa..3cce87e0eb7 100644
    --- a/builtin/rev-list.c
    +++ b/builtin/rev-list.c
    @@ -548,6 +548,7 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
                    revs.do_not_die_on_missing_tree = 1;
     
            argc = setup_revisions(argc, argv, &revs, &s_r_opt);
    +       BUG("using this?");
     
            memset(&info, 0, sizeof(info));
            info.revs = &revs;

You'll run into a failure with:

    GIT_TEST_PASSING_SANITIZE_LEAK=true ./t0002-gitfile.sh

I.e. our existing leak tests already take that codepath and don't run
into a leak from using "revs".

So we wouldn't just be marking a known specific leak, but hiding all
leaks & non-leaks in container from the top-level, and thus hide
potential regressions, an addition to attaining the end-goal of marking
some specific test as passing.

Now, that specific example isn't very useful right now, since we don't
have a release_revisions() at all, but e.g. the first batch of fixes
I've got for the revisions.[ch] leak fix most common cases (e.g. the
pending array), but not some obscure ones.

Being able to incrementally mark those leaks as fixed on a test-by-test
basis has the advantage over UNLEAK() that we won't have regressions
while we're not at a 100% leak free (at which point we could remove the
UNLEAK()).

>   - leak-checkers only know where things are allocated, not who is
>     supposed to own them. So you're often annotating the wrong thing;
>     it's not a strdup() call which is buggy and leaking, but the
>     function five layers up the stack which was supposed to take
>     ownership and didn't.

As noted in [3] this case is because the LSAN suppressions list was in
play, so we excluded the meaningful part of the stack trace (which is
shown in that mail).

Hrm, now that I think about it I think that the cases where I had to
resort to valgrind to get around such crappy stacktraces (when that was
all I got, even without suppressions) were probably all because there
was an UNLEAK() in play...

1. https://lore.kernel.org/git/211022.86sfwtl6uj.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/f1bb8b73ffdb78995dc5653791f9a64adf216e21.1634787555.git.me@ttaylorr.com/
3. https://lore.kernel.org/git/211027.86zgquyk52.gmgdl@evledraar.gmail.com/
Jeff King Oct. 27, 2021, 9:06 a.m. UTC | #7
On Wed, Oct 27, 2021 at 10:04:17AM +0200, Ævar Arnfjörð Bjarmason wrote:

> >   - it keeps the annotations near the code. Yes, that creates conflicts
> >     when the code is changed (or the leak is actually fixed), but that's
> >     a feature. It keeps them from going stale.
> 
> I fully agree with you in general for "good" uses of UNLEAK(). E.g. consider:
> 
>     struct strbuf buf = xmalloc(...);
>     [...]
>     UNLEAK(buf);
> 
> If I was fixing leaks in that sort of code what I pointed out downthread
> in [1] wouldn't apply.

I'm OK with such a use of UNLEAK(), though of course you could probably
just free the buffer in that instance.

But...

> So to clarify, I'm not asking in [1] that UNLEAK() not be used at all
> while we have in-flight leak fixes. I.e. I'd run into a textual
> conflict, but that would be trivial to resolve, and obvious what the
> semantic & textual conflict was.
> 
> Rather, it's not marking specific leaks, but UNLEAK() on a container
> struct that's problematic.

...that's the whole point of UNLEAK(), IMHO. You know that the container
struct is leaked, but you don't care because the program is about to
exit, or there's no easy way to avoid the leak.

So it's not the "container" element, but rather it can be a problem if
people annotate too broadly (you will miss some leaks). In the case of
rev_info, there is no way to _not_ leak right now, because it has no
cleanup function. In the long run we should probably add one. But in the
meantime, annotating them so we can find the _interesting_ leaks is
worth doing (and that was the whole point of adding UNLEAK in the first
place.

And I think the UNLEAK() calls in Taylor's patch are fine in that sense.
While yes, _some_ runs of those commands will not leak, the point is to
say that when they do leak, they are not interesting. And they are not
interesting because that rev_info is in scope until the program exits,
so anything it points to is only leaked at a moment where we no longer
care.

> So we wouldn't just be marking a known specific leak, but hiding all
> leaks & non-leaks in container from the top-level, and thus hide
> potential regressions, an addition to attaining the end-goal of marking
> some specific test as passing.

I'd argue it _is_ a known specific leak. It is rev_info going out of
scope that causes the leak, not rev_info holding on to memory in things
like its pending array.

Now those can be interesting, too (if it no longer needs the memory, can
it perhaps get rid of it). But IMHO all of that is pretty secondary to
clearing the noise so you can find "true" leaks (ones where some
sub-function really is allocating and then losing the pointer entirely,
especially if it's called an arbitrary number of times).

> >   - leak-checkers only know where things are allocated, not who is
> >     supposed to own them. So you're often annotating the wrong thing;
> >     it's not a strdup() call which is buggy and leaking, but the
> >     function five layers up the stack which was supposed to take
> >     ownership and didn't.
> 
> As noted in [3] this case is because the LSAN suppressions list was in
> play, so we excluded the meaningful part of the stack trace (which is
> shown in that mail).

I don't think that's true at all. The annotations you showed in that
message, for example, are pointing at add_rev_cmdline(). But that is
_not_ the source of the leak, nor where it should be fixed. It is a
necessary part of how rev_info works. The leak is when the rev_info goes
out of scope without anybody cleaning it up.

> Hrm, now that I think about it I think that the cases where I had to
> resort to valgrind to get around such crappy stacktraces (when that was
> all I got, even without suppressions) were probably all because there
> was an UNLEAK() in play...

I don't see how UNLEAK() would impact stack traces. It should either
make something not-leaked-at-all (in which case LSan will no longer
mention it), or it does nothing (it throws some wasted memory into a
structure which is itself not leaked).

-Peff
Junio C Hamano Oct. 27, 2021, 8:21 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> I don't think that's true at all. The annotations you showed in that
> message, for example, are pointing at add_rev_cmdline(). But that is
> _not_ the source of the leak, nor where it should be fixed. It is a
> necessary part of how rev_info works. The leak is when the rev_info goes
> out of scope without anybody cleaning it up.

True.

>> Hrm, now that I think about it I think that the cases where I had to
>> resort to valgrind to get around such crappy stacktraces (when that was
>> all I got, even without suppressions) were probably all because there
>> was an UNLEAK() in play...
>
> I don't see how UNLEAK() would impact stack traces. It should either
> make something not-leaked-at-all (in which case LSan will no longer
> mention it), or it does nothing (it throws some wasted memory into a
> structure which is itself not leaked).

I'd prefer to see judicious use of UNLEAK() over LSAN suppressions
that lists things that are not the source of the problem.  The list
being kept in the tests that are far away from the actual source
does not help, either.

Thanks.
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 8:57 p.m. UTC | #9
On Wed, Oct 27 2021, Jeff King wrote:

> On Wed, Oct 27, 2021 at 10:04:17AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> >   - it keeps the annotations near the code. Yes, that creates conflicts
>> >     when the code is changed (or the leak is actually fixed), but that's
>> >     a feature. It keeps them from going stale.
>> 
>> I fully agree with you in general for "good" uses of UNLEAK(). E.g. consider:
>> 
>>     struct strbuf buf = xmalloc(...);
>>     [...]
>>     UNLEAK(buf);
>> 
>> If I was fixing leaks in that sort of code what I pointed out downthread
>> in [1] wouldn't apply.
>
> I'm OK with such a use of UNLEAK(), though of course you could probably
> just free the buffer in that instance.
>
> But...
>
>> So to clarify, I'm not asking in [1] that UNLEAK() not be used at all
>> while we have in-flight leak fixes. I.e. I'd run into a textual
>> conflict, but that would be trivial to resolve, and obvious what the
>> semantic & textual conflict was.
>> 
>> Rather, it's not marking specific leaks, but UNLEAK() on a container
>> struct that's problematic.
>
> ...that's the whole point of UNLEAK(), IMHO. You know that the container
> struct is leaked, but you don't care because the program is about to
> exit, or there's no easy way to avoid the leak.
>
> So it's not the "container" element, but rather it can be a problem if
> people annotate too broadly (you will miss some leaks). In the case of
> rev_info, there is no way to _not_ leak right now, because it has no
> cleanup function.

It doesn't have one, but there are uses of setup_revisions() and
rev_info usage that don't leak, as that builtin/rev-list.c case shows.

I mean, in that case it's not doing much of anything, but at least we
test that setup_revisions() itself doesn't leak right now, but wouldn't
with UNLEAK().

> In the long run we should probably add one.

After the current in-flight patches I've got making diff_free() a bit
more useful, and then a revisions_release() that'll handle most cases,
which gets us started on mass-marking the tests as leak free.

So just FWIW I'm not saying "hey can we hold off on that UNLEAK() for
far future xyz", but for a thing I've got queued up that I'd rather not
start rewriting...

> [...]But in the meantime, annotating them so we can find the
> _interesting_ leaks is worth doing (and that was the whole point of
> adding UNLEAK in the first place.
>
> And I think the UNLEAK() calls in Taylor's patch are fine in that sense.
> While yes, _some_ runs of those commands will not leak, the point is to
> say that when they do leak, they are not interesting. And they are not
> interesting because that rev_info is in scope until the program exits,
> so anything it points to is only leaked at a moment where we no longer
> care.

I don't per-se care about leaks at program exit either, but am making
them leak-free as a proxy for making the relevant APIs useful for
persistent use.

So I think the historical approach of just sprinkling UNLEAK() since
we're exiting anyway is taking too narrow of a view.

In that case if we'd like to see if revisions.[ch] is leak-free we'd
need parallel test-helper tests (or whatever) for all its features, or
we can just piggy-back on every "git log" invocation in the test
suite...

>> So we wouldn't just be marking a known specific leak, but hiding all
>> leaks & non-leaks in container from the top-level, and thus hide
>> potential regressions, an addition to attaining the end-goal of marking
>> some specific test as passing.
>
> I'd argue it _is_ a known specific leak. It is rev_info going out of
> scope that causes the leak, not rev_info holding on to memory in things
> like its pending array.
>
> Now those can be interesting, too (if it no longer needs the memory, can
> it perhaps get rid of it). But IMHO all of that is pretty secondary to
> clearing the noise so you can find "true" leaks (ones where some
> sub-function really is allocating and then losing the pointer entirely,
> especially if it's called an arbitrary number of times).

I wasn't trying to get into the philosophical discussion of whether a
struct or a struct it contains can be said to be the "real" source of a
leak :)

Yeah, I agree that from that POV there's no difference.

I'm merely referring to the practical aspect of patches I've got queued
up, which do things like fix all leaks of rev_info.pending, mark tests
that now pass, fix another rev_info.whatever, mark tests etc.

For rev_info in particular it's a bit iffy to UNLEAK() it, since it also
contains pointers to stuff it doesn't "own", sometimes things are
allocated and handed off to it, sometimes for a while, sometimes
permanently.

So by UNLEAK()ing it you're not just unleaking "its" memory, but
potentially hiding leaks in APIs that need to interact with it in that
way. E.g. mailmap.c is one of those cases.

>> >   - leak-checkers only know where things are allocated, not who is
>> >     supposed to own them. So you're often annotating the wrong thing;
>> >     it's not a strdup() call which is buggy and leaking, but the
>> >     function five layers up the stack which was supposed to take
>> >     ownership and didn't.
>> 
>> As noted in [3] this case is because the LSAN suppressions list was in
>> play, so we excluded the meaningful part of the stack trace (which is
>> shown in that mail).
>
> I don't think that's true at all. The annotations you showed in that
> message, for example, are pointing at add_rev_cmdline(). But that is
> _not_ the source of the leak, nor where it should be fixed. It is a
> necessary part of how rev_info works. The leak is when the rev_info goes
> out of scope without anybody cleaning it up.
>
>> Hrm, now that I think about it I think that the cases where I had to
>> resort to valgrind to get around such crappy stacktraces (when that was
>> all I got, even without suppressions) were probably all because there
>> was an UNLEAK() in play...
>
> I don't see how UNLEAK() would impact stack traces. It should either
> make something not-leaked-at-all (in which case LSan will no longer
> mention it), or it does nothing (it throws some wasted memory into a
> structure which is itself not leaked).

Yes, I think either categorically wrong here, or it applies to some
other case I wasn't able to dig up. Or maybe not, doesn't Taylor's
example take it from "Direct leak" to "Indirect leak" with the
suppression in play? I think those were related somehow (but don't have
that in front of me as I type this out).

E.g. (to reinforce your point) try compiling with SANITIZE=leak and running:

    $ TZ=UTC t/helper/test-tool date show:format:%z 1466000000 +0200
    1466000000 -> +0000
    +0200 -> +0000
    
    =================================================================
    ==335188==ERROR: LeakSanitizer: detected memory leaks
    
    Direct leak of 3 byte(s) in 1 object(s) allocated from:
        #0 0x7f31cdd21db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
        #1 0x7f31cdb04e4a in __GI___strdup string/strdup.c:42
    
    SUMMARY: LeakSanitizer: 3 byte(s) leaked in 1 allocation(s).

Now try with "SANITIZE=" and valgrind can give you the "real" source:

    $ TZ=UTC valgrind --leak-check=full t/helper/test-tool date show:format:%z 1466000000 +0200
    ==336515== Memcheck, a memory error detector
    ==336515== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
    ==336515== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
    ==336515== Command: t/helper/test-tool date show:format:%z 1466000000 +0200
    ==336515==
    1466000000 -> +0000
    +0200 -> +0000
    ==336515==
    ==336515== HEAP SUMMARY:
    ==336515==     in use at exit: 1,208 bytes in 13 blocks
    ==336515==   total heap usage: 50 allocs, 37 frees, 27,593 bytes allocated
    ==336515==
    ==336515== 3 bytes in 1 blocks are definitely lost in loss record 1 of 13
    ==336515==    at 0x483877F: malloc (vg_replace_malloc.c:307)
    ==336515==    by 0x49C6E4A: strdup (strdup.c:42)
    ==336515==    by 0x261E3A: xstrdup (wrapper.c:29)
    ==336515==    by 0x152785: parse_date_format (date.c:991)
    ==336515==    by 0x116A34: show_dates (test-date.c:39)
    ==336515==    by 0x116DE9: cmd__date (test-date.c:116)
    ==336515==    by 0x115227: cmd_main (test-tool.c:127)
    ==336515==    by 0x115334: main (common-main.c:52)
    ==336515==
    ==336515== LEAK SUMMARY:
    ==336515==    definitely lost: 3 bytes in 1 blocks
    ==336515==    indirectly lost: 0 bytes in 0 blocks
    ==336515==      possibly lost: 0 bytes in 0 blocks
    ==336515==    still reachable: 1,205 bytes in 12 blocks
    ==336515==         suppressed: 0 bytes in 0 blocks
    ==336515== Reachable blocks (those to which a pointer was found) are not shown.
    ==336515== To see them, rerun with: --leak-check=full --show-leak-kinds=all
    ==336515==
    ==336515== For lists of detected and suppressed errors, rerun with: -s
    ==336515== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

I don't know why it habitually loses track of seemingly easy-to-track
leaks sometimes, but it does. There isn't an UNLEAK() in play here.

When I run into these I try it with valgrind, fix the leak, then test
again with SANITIZE=leak. I haven't found cases where it actually misses
things, just cases where the stack traces suck.
Jeff King Oct. 29, 2021, 8:56 p.m. UTC | #10
On Wed, Oct 27, 2021 at 10:57:52PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > So it's not the "container" element, but rather it can be a problem if
> > people annotate too broadly (you will miss some leaks). In the case of
> > rev_info, there is no way to _not_ leak right now, because it has no
> > cleanup function.
> 
> It doesn't have one, but there are uses of setup_revisions() and
> rev_info usage that don't leak, as that builtin/rev-list.c case shows.
> 
> I mean, in that case it's not doing much of anything, but at least we
> test that setup_revisions() itself doesn't leak right now, but wouldn't
> with UNLEAK().

I don't think that's true. If you UNLEAK() the rev_info in the caller,
then it will only affect allocations that are still reachable from
rev_info. I.e., things that are by definition not a leak in
setup_revisions().

Now you could argue that setup_revisions() is "leaking" by allocating
things and stuffing them into rev_info that it should not be. But we can
never know that until we have an actual function that cleans up a
rev_info, which defines what it's "supposed" to have ownership of.

Maybe we have callers that explicitly try to de-allocate bits of the
rev_info. But IMHO that is the source of the whole problem: how is
random code using rev_info supposed to know which of its internal
details are owned or not? This should be documented and enforced with a
single function.

> So just FWIW I'm not saying "hey can we hold off on that UNLEAK() for
> far future xyz", but for a thing I've got queued up that I'd rather not
> start rewriting...

Just to be clear: I am totally fine with dropping Taylor's UNLEAK
patches (as I've said already). I was only arguing here that annotating
via external files is worse than just adding an UNLEAK().

I'm also trying to combat what I see as mis-conceptions or inaccuracies
about what UNLEAK() does or its implications (or even what counts as a
"leak"). But I hope in the long run that we don't need _any_ kind of
annotation, because we'll actually be leak-free. And then we don't have
to care about any of this.

> > I don't see how UNLEAK() would impact stack traces. It should either
> > make something not-leaked-at-all (in which case LSan will no longer
> > mention it), or it does nothing (it throws some wasted memory into a
> > structure which is itself not leaked).
> 
> Yes, I think either categorically wrong here, or it applies to some
> other case I wasn't able to dig up. Or maybe not, doesn't Taylor's
> example take it from "Direct leak" to "Indirect leak" with the
> suppression in play? I think those were related somehow (but don't have
> that in front of me as I type this out).

I don't think UNLEAK() can move something from "direct" to "indirect" in
LSan's terminology. If rev_info points to an array of structs, and those
structs point to allocated strings, then the array itself is a "direct"
leak, and the strings are "indirect" (they are leaked, but presumably
fixing the direct leak would also deallocate them).

If UNLEAK() makes the array not-leaked, then those indirect leaks don't
become direct. They should be transitively not-leaked, too.

> E.g. (to reinforce your point) try compiling with SANITIZE=leak and running:
> 
>     $ TZ=UTC t/helper/test-tool date show:format:%z 1466000000 +0200
>     1466000000 -> +0000
>     +0200 -> +0000
>     
>     =================================================================
>     ==335188==ERROR: LeakSanitizer: detected memory leaks
>     
>     Direct leak of 3 byte(s) in 1 object(s) allocated from:
>         #0 0x7f31cdd21db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
>         #1 0x7f31cdb04e4a in __GI___strdup string/strdup.c:42
>     
>     SUMMARY: LeakSanitizer: 3 byte(s) leaked in 1 allocation(s).

So these should be real leaks. Of course with the lousy stack trace it's
hard to see what they are. But I don't see how UNLEAK() is responsible
for making the lousy stack trace. You could try compiling with LSan but
_not_ -DSUPPRESS_ANNOTATED_LEAKS and see if the result is similarly bad
(but I expect it to be, since test-date.c does not have any UNLEAK()
calls in it).

-Peff
Jeff King Oct. 29, 2021, 9:05 p.m. UTC | #11
On Fri, Oct 29, 2021 at 04:56:31PM -0400, Jeff King wrote:

> > E.g. (to reinforce your point) try compiling with SANITIZE=leak and running:
> > 
> >     $ TZ=UTC t/helper/test-tool date show:format:%z 1466000000 +0200
> >     1466000000 -> +0000
> >     +0200 -> +0000
> >     
> >     =================================================================
> >     ==335188==ERROR: LeakSanitizer: detected memory leaks
> >     
> >     Direct leak of 3 byte(s) in 1 object(s) allocated from:
> >         #0 0x7f31cdd21db0 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:54
> >         #1 0x7f31cdb04e4a in __GI___strdup string/strdup.c:42
> >     
> >     SUMMARY: LeakSanitizer: 3 byte(s) leaked in 1 allocation(s).
> 
> So these should be real leaks. Of course with the lousy stack trace it's
> hard to see what they are. But I don't see how UNLEAK() is responsible
> for making the lousy stack trace. You could try compiling with LSan but
> _not_ -DSUPPRESS_ANNOTATED_LEAKS and see if the result is similarly bad
> (but I expect it to be, since test-date.c does not have any UNLEAK()
> calls in it).

I just tested this, and it is unrelated to UNLEAK(). Interestingly,
compiling with SANITIZE=address does give the correct stack trace, so I
think LSan is just buggy here for some reason.

We usually suppress leaks with ASAN_OPTIONS because there are so many
(and we want to get at the more important signal of actual memory
errors). But it wouldn't be hard to have that setting respect your
PASSING_SANITIZE_LEAK variables. The big downside is that ASan runs of
the test suite are much more expensive.

Hmm. A little googling turns up LSan's fast_unwind_on_malloc option. And
indeed:

  $ LSAN_OPTIONS=fast_unwind_on_malloc=0 t/helper/test-tool date show:format:%z 1466000000 +0200
  1466000000 -> +0000
  +0200 -> +0000
  
  =================================================================
  ==39628==ERROR: LeakSanitizer: detected memory leaks
  
  Direct leak of 3 byte(s) in 1 object(s) allocated from:
      #0 0x7fa664e39b94 in __interceptor_malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:56
      #1 0x7fa664c124aa in __GI___strdup string/strdup.c:42
      #2 0x557636c40c2e in xstrdup /home/peff/compile/git/wrapper.c:29
      #3 0x557636b2ae52 in parse_date_format /home/peff/compile/git/date.c:991
      #4 0x557636aedb0e in show_dates t/helper/test-date.c:39
      #5 0x557636aedee1 in cmd__date t/helper/test-date.c:116
      #6 0x557636aec1f0 in cmd_main t/helper/test-tool.c:127
      #7 0x557636aec30d in main /home/peff/compile/git/common-main.c:52
      #8 0x7fa664babe49 in __libc_start_main ../csu/libc-start.c:314
      #9 0x557636aebec9 in _start (/home/peff/compile/git/t/helper/test-tool+0xcec9)
  
  SUMMARY: LeakSanitizer: 3 byte(s) leaked in 1 allocation(s).

So perhaps we ought to be setting that by default.

-Peff
diff mbox series

Patch

diff --git a/t/t0056-git-C.sh b/t/t0056-git-C.sh
index 2630e756dab..490aefa81a1 100755
--- a/t/t0056-git-C.sh
+++ b/t/t0056-git-C.sh
@@ -2,7 +2,11 @@ 
 
 test_description='"-C <path>" option and its effects on other path-related options'
 
+TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
+todo_leaks <<-\EOF
+^cmd_log_init_finish$
+EOF
 
 test_expect_success '"git -C <path>" runs git from the directory <path>' '
 	test_create_repo dir1 &&
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index eef2262a360..d89bf5da7dc 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -243,6 +243,48 @@  debug () {
 	done
 }
 
+# Declare known "general" memory leaks, for use with TEST_PASSES_SANITIZE_LEAK=true.
+#
+# Matches lines in a stack trace that leaks. Intended for
+# LSAN_OPTIONS, but the format is intended to be easy to use with
+# other leak checkers, so the "leak:" prefix is omitted (and added for
+# you).
+#
+# Use it immediately after sourcing test-lib.sh (or equivalent), and
+# after a "TEST_PASSES_SANITIZE_LEAK=true" has been set. E.g:
+#
+#    TEST_PASSES_SANITIZE_LEAK=true
+#    . ./test-lib.sh
+#    todo_leaks <<-\EOF
+#    ^cmd_log_init_finish$
+#    EOF
+#
+# The "^" and "$" anchors don't suggest full regex syntax support,
+# that's the only anchoring (or other metacharacter) understood by
+# LSAN_OPTIONS,.
+#
+# See
+# https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer#suppressions
+# for the relevant LSAN_OPTIONS documentation.
+todo_leaks () {
+	if ! test_have_prereq SANITIZE_LEAK
+	then
+		return 0
+	fi
+
+	# Try not to interfere with any test logic
+	suppressions=.lsan-suppressions.txt
+	if test -d .git
+	then
+		suppressions=".git/$suppressions"
+	fi
+	suppressions="$PWD/$suppressions"
+
+	sed 's/^/leak:/' >"$suppressions" &&
+	LSAN_OPTIONS="$LSAN_OPTIONS:suppressions=\"$suppressions\"" &&
+	export LSAN_OPTIONS
+}
+
 # Usage: test_commit [options] <message> [<file> [<contents> [<tag>]]]
 #   -C <dir>:
 #	Run all git commands in directory <dir>