diff mbox series

[v2,2/4] Makefile/coccicheck: speed up and fix bug with duplicate hunks

Message ID 20210305170724.23859-3-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series Makefile: fix bugs in coccicheck and speed it up | expand

Commit Message

Ævar Arnfjörð Bjarmason March 5, 2021, 5:07 p.m. UTC
Change the coccicheck target to run on all of our *.c and *.h files
with --no-includes, instead of only on the *.c files with
--all-includes.

This speeds it up significantly and reduces its memory usage, since it
doesn't need to parse N includes for every file it visits.

See [1] for a discussion thread about this commit with some timings
for details, but briefly: This change speeds it up by ~2x and makes it
use much less memory. Or a reduction of a max of around ~2GB
per-process (under the old SPATCH_BATCH_SIZE=0) to around ~200MB.

Looking at the history of the coccicheck target I think we've always
been running it in the wrong mode for what we wanted to achieve:

 - When it was added in 63f0a758a06 (add coccicheck make target,
   2016-09-15) it explicitly processed only the %.c files.

 - Then in a9a884aea57 (coccicheck: use --all-includes by default,
   2016-09-30) it started processing the %.h files by looking around for
   its own includes.

Let's instead just point it to both our *.c and *.h files, then
there's no need to have it recursively look around for included files
to change.

Setting --no-includes would not work if we expected to set
COCCI_SOURCES to a subset of our source files, but that's not what
we're doing here. If anyone manually tweaks COCCI_SOURCES they'll now
need to tweak SPATCH_FLAGS too. The speed and correctness we gain is
worth that small trade-off.

Using --no-includes also fixes a subtle bug introduced in
960154b9c17 (coccicheck: optionally batch spatch invocations,
2019-05-06) with duplicate hunks being written to the
generated *.patch files.

This is because that change altered a file-at-a-time for-loop to an
invocation of "xargs -n X". This would not matter for most other
programs, but it matters for spatch.

This is because each spatch invocation will maintain shared lock files
in /tmp, check if files being parsed were changed etc. I haven't dug
into why exactly, but it's easy to reproduce the issue[2]. The issue
goes away entirely if we just use --no-includes, which as noted above
would have made sense even without that issue.

1. https://lore.kernel.org/git/20210302205103.12230-1-avarab@gmail.com/
2. A session showing racy spatch under xargs -n X:

    $ cat test.cocci
    @@
    expression E1;
    @@
    - strbuf_avail(E1)
    + strbuf_has(E1)
    $ for i in 1 2 4 16 64 128 512
    do
        echo with xargs -n $i: &&
	echo *.c | xargs -n $i spatch --sp-file \
            test.cocci --all-includes --patch . 2>/dev/null | \
	grep -F +++ | sort | uniq -c
    done
    with xargs -n 1:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
    with xargs -n 2:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
    with xargs -n 4:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
    with xargs -n 16:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
          2 +++ b/strbuf.h
    with xargs -n 64:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
          2 +++ b/strbuf.h
    with xargs -n 128:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
          2 +++ b/strbuf.h
    with xargs -n 512:
          1 +++ b/convert.c
          1 +++ b/strbuf.c
          1 +++ b/strbuf.h

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Jeff King March 6, 2021, 10:45 a.m. UTC | #1
On Fri, Mar 05, 2021 at 06:07:22PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Using --no-includes also fixes a subtle bug introduced in
> 960154b9c17 (coccicheck: optionally batch spatch invocations,
> 2019-05-06) with duplicate hunks being written to the
> generated *.patch files.
> 
> This is because that change altered a file-at-a-time for-loop to an
> invocation of "xargs -n X". This would not matter for most other
> programs, but it matters for spatch.
> 
> This is because each spatch invocation will maintain shared lock files
> in /tmp, check if files being parsed were changed etc. I haven't dug
> into why exactly, but it's easy to reproduce the issue[2]. The issue
> goes away entirely if we just use --no-includes, which as noted above
> would have made sense even without that issue.

This part still doesn't make any sense to me. If we are running with
SPATCH_BATCH_SIZE=1, which is the default, then "xargs -n" is still
going to run it in file-at-a-time mode. From spatch's perspective,
there's no difference between a for-loop and "xargs -n 1" (unless it
somehow cares about stdin, but it shouldn't).

Using strace, I do see it creating files in /tmp, but they are all named
after the process id, and cleaned up before exit. So I don't see how
they could interfere with each other (certainly not in a sequential run,
but even if you were to use "xargs -P" to get parallel runs, they seem
distinct).

If we increase the batch size, I'd expect _fewer_ duplicates. Because in
file-at-a-time mode with --all-includes, wouldn't every file that
mentions an include possibly end up emitting a patch for it?

The results you show here (which do replicate for me) imply something
much weirder is going on:

>     with xargs -n 1:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c
>     with xargs -n 2:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c
>     with xargs -n 4:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c

These results are wrong! They are not finding the entry in strbuf.h that
should be changed.

>     with xargs -n 16:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c
>           2 +++ b/strbuf.h
>     with xargs -n 64:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c
>           2 +++ b/strbuf.h
>     with xargs -n 128:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c
>           2 +++ b/strbuf.h

These ones are also wrong. Now we find the strbuf.h mention, but we are
finding it twice.

>     with xargs -n 512:
>           1 +++ b/convert.c
>           1 +++ b/strbuf.c
>           1 +++ b/strbuf.h

And this one, which is given all of the paths in one invocation gets it
right. I'd expect that over the "128" version, which is running two
independent spatch invocations. But the fact that "64" and "16" produce
exactly two duplicates makes little sense. And even less that 1, 2, and
4 don't find the header change at all.

Running the same test with a for loop produces the same (wrong) results
as "-n 1", as expected:

  $ for i in *.c; do
      spatch --sp-file test.cocci --all-includes --patch . $i 2>/dev/null
    done | grep -F +++ | sort | uniq -c
        1 +++ b/convert.c
        1 +++ b/strbuf.c

So in short I have no idea what spatch is doing. But I remain
unconvinced that there is anything wrong with the batch-size patch.

Running it like your patch will, feeding the header directly along with
--no-includes, does find the correct result:

  $ for i in *.c *.h; do
      spatch --sp-file test.cocci --no-includes --patch . $i 2>/dev/null
    done | grep -F +++ | sort | uniq -c
        1 +++ b/convert.c
        1 +++ b/strbuf.c
        1 +++ b/strbuf.h

though I am still concerned by René's example which is missing more
results.

-Peff
Ævar Arnfjörð Bjarmason March 6, 2021, 7:29 p.m. UTC | #2
On Sat, Mar 06 2021, Jeff King wrote:

> On Fri, Mar 05, 2021 at 06:07:22PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Using --no-includes also fixes a subtle bug introduced in
>> 960154b9c17 (coccicheck: optionally batch spatch invocations,
>> 2019-05-06) with duplicate hunks being written to the
>> generated *.patch files.
>> 
>> This is because that change altered a file-at-a-time for-loop to an
>> invocation of "xargs -n X". This would not matter for most other
>> programs, but it matters for spatch.
>> 
>> This is because each spatch invocation will maintain shared lock files
>> in /tmp, check if files being parsed were changed etc. I haven't dug
>> into why exactly, but it's easy to reproduce the issue[2]. The issue
>> goes away entirely if we just use --no-includes, which as noted above
>> would have made sense even without that issue.
>
> This part still doesn't make any sense to me. If we are running with
> SPATCH_BATCH_SIZE=1, which is the default, then "xargs -n" is still
> going to run it in file-at-a-time mode. From spatch's perspective,
> there's no difference between a for-loop and "xargs -n 1" (unless it
> somehow cares about stdin, but it shouldn't).
>
> Using strace, I do see it creating files in /tmp, but they are all named
> after the process id, and cleaned up before exit. So I don't see how
> they could interfere with each other (certainly not in a sequential run,
> but even if you were to use "xargs -P" to get parallel runs, they seem
> distinct).
>
> If we increase the batch size, I'd expect _fewer_ duplicates. Because in
> file-at-a-time mode with --all-includes, wouldn't every file that
> mentions an include possibly end up emitting a patch for it?
>
> The results you show here (which do replicate for me) imply something
> much weirder is going on:
>
>>     with xargs -n 1:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>>     with xargs -n 2:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>>     with xargs -n 4:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>
> These results are wrong! They are not finding the entry in strbuf.h that
> should be changed.
>
>>     with xargs -n 16:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>>           2 +++ b/strbuf.h
>>     with xargs -n 64:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>>           2 +++ b/strbuf.h
>>     with xargs -n 128:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>>           2 +++ b/strbuf.h
>
> These ones are also wrong. Now we find the strbuf.h mention, but we are
> finding it twice.
>
>>     with xargs -n 512:
>>           1 +++ b/convert.c
>>           1 +++ b/strbuf.c
>>           1 +++ b/strbuf.h
>
> And this one, which is given all of the paths in one invocation gets it
> right. I'd expect that over the "128" version, which is running two
> independent spatch invocations. But the fact that "64" and "16" produce
> exactly two duplicates makes little sense. And even less that 1, 2, and
> 4 don't find the header change at all.

Yes, I don't know what's going on with spatch. Just the observed results
of duplication before/after youc commit.

> Running the same test with a for loop produces the same (wrong) results
> as "-n 1", as expected:
>
>   $ for i in *.c; do
>       spatch --sp-file test.cocci --all-includes --patch . $i 2>/dev/null
>     done | grep -F +++ | sort | uniq -c
>         1 +++ b/convert.c
>         1 +++ b/strbuf.c
>
> So in short I have no idea what spatch is doing. But I remain
> unconvinced that there is anything wrong with the batch-size patch.

Right, but that's an unrelated "didn't change this thing it should have"
bug (which we had before), not the duplicate hunk bug.

> Running it like your patch will, feeding the header directly along with
> --no-includes, does find the correct result:
>
>   $ for i in *.c *.h; do
>       spatch --sp-file test.cocci --no-includes --patch . $i 2>/dev/null
>     done | grep -F +++ | sort | uniq -c
>         1 +++ b/convert.c
>         1 +++ b/strbuf.c
>         1 +++ b/strbuf.h
>
> though I am still concerned by René's example which is missing more
> results.

*nod*. I just sent a patch on top to fix that, although it undoes most
 of the speedup in this series.
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index f881b558c44..798a0517131 100644
--- a/Makefile
+++ b/Makefile
@@ -1196,7 +1196,8 @@  SPARSE_FLAGS ?=
 SP_EXTRA_FLAGS = -Wno-universal-initializer
 
 # For the 'coccicheck' target
-SPATCH_FLAGS = --all-includes --patch .
+SPATCH_FLAGS = --no-includes --patch .
+
 # For the 'coccicheck' target; setting SPATCH_BATCH_SIZE higher will
 # usually result in less CPU usage at the cost of higher peak memory.
 # Setting it to 0 will feed all files in a single spatch invocation.
@@ -2853,7 +2854,7 @@  check: config-list.h command-list.h
 		exit 1; \
 	fi
 
-FOUND_C_SOURCES = $(filter %.c,$(shell $(FIND_SOURCE_FILES)))
+FOUND_C_SOURCES = $(filter %.c %.h,$(shell $(FIND_SOURCE_FILES)))
 COCCI_SOURCES = $(filter-out $(THIRD_PARTY_SOURCES),$(FOUND_C_SOURCES))
 
 %.cocci.patch: %.cocci $(COCCI_SOURCES)