coccinelle: improve array.cocci
diff mbox series

Message ID 0d9cf772-268d-bd00-1cbb-0bbbec9dfc9a@web.de
State New
Headers show
Series
  • coccinelle: improve array.cocci
Related show

Commit Message

Markus Elfring Nov. 18, 2019, 4:10 p.m. UTC
From: Markus Elfring <elfring@users.sourceforge.net>
Date: Mon, 18 Nov 2019 17:00:37 +0100

This script contained transformation rules for the semantic patch language
which used similar code.

1. Delete two SmPL rules which were used to transform source code fragments
   (pointer expressions) so that the search pattern “sizeof(T)” would work
   in the third rule.
   See also the topic “coccinelle: adjustments for array.cocci?”:
   https://public-inbox.org/git/f28f5fb8-2814-9df5-faf2-7146ed1a1f4d@web.de/

2. Combine the remaining rules by using six SmPL disjunctions.

3. Adjust case distinctions and corresponding metavariables so that
   the desired search for update candidates can be more complete.

4. Increase the precision for the specification of required changes.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 contrib/coccinelle/array.cocci | 100 ++++++---------------------------
 1 file changed, 18 insertions(+), 82 deletions(-)

--
2.24.0

Comments

René Scharfe Nov. 19, 2019, 7:15 p.m. UTC | #1
Am 18.11.19 um 17:10 schrieb Markus Elfring:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Mon, 18 Nov 2019 17:00:37 +0100
>
> This script contained transformation rules for the semantic patch language
> which used similar code.
>
> 1. Delete two SmPL rules which were used to transform source code fragments
>    (pointer expressions) so that the search pattern “sizeof(T)” would work
>    in the third rule.
>    See also the topic “coccinelle: adjustments for array.cocci?”:
>    https://public-inbox.org/git/f28f5fb8-2814-9df5-faf2-7146ed1a1f4d@web.de/
>
> 2. Combine the remaining rules by using six SmPL disjunctions.
>
> 3. Adjust case distinctions and corresponding metavariables so that
>    the desired search for update candidates can be more complete.
>
> 4. Increase the precision for the specification of required changes.
>
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---
>  contrib/coccinelle/array.cocci | 100 ++++++---------------------------
>  1 file changed, 18 insertions(+), 82 deletions(-)

The diff is hard to read, so here's the resulting semantic patch:

-- start --
@@
type T;
T[] src_arr;
expression n, dst_e, src_e;
expression* dst_p_e, src_p_e;
@@
(
(
-memcpy
+COPY_ARRAY
|
-memmove
+MOVE_ARRAY
)
       (
        dst_e,
        src_e
-       , (n) * \( sizeof(T) \| sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \)
+       , n
       )
|
+ALLOC_ARRAY(
             dst_p_e
-                    = xmalloc((n) * \( sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \| sizeof(T) \))
+            , n)
)
-- end --

I like that COPY_ARRAY and MOVE_ARRAY are handled in the same rule,
as they share the same parameters and do the same -- except that
the latter handles overlaps, while the former may be a bit faster.

And I like that it's short.

I don't like that ALLOC_ARRAY is handled in the same rule, as it is
quite different from the other two macros.

Coccinelle needs significantly longer to apply the new version.
Here are times for master:

Benchmark #1: make contrib/coccinelle/array.cocci.patch
  Time (mean ± σ):     19.314 s ±  0.200 s    [User: 19.065 s, System: 0.224 s]
  Range (min … max):   19.009 s … 19.718 s    10 runs

... and here with the patch applied:

Benchmark #1: make contrib/coccinelle/array.cocci.patch
  Time (mean ± σ):     43.420 s ±  0.490 s    [User: 43.087 s, System: 0.273 s]
  Range (min … max):   42.636 s … 44.359 s    10 runs

The current version checks if source and destination are of the same type,
and whether the sizeof operand is either said type or an element of source
or destination.  The new one does not.  So I don't see claim 4 ("Increase
the precision") fulfilled, quite the opposite rather.  It can produce e.g.
a transformation like this:

 void f(int *dst, char *src, size_t n)
 {
-	memcpy(dst, src, n * sizeof(short));
+	COPY_ARRAY(dst, src, n);
 }

The COPY_ARRAY there effectively expands to:

	memcpy(dst, src, n * sizeof(*dst));

... which is quite different -- if short is 2 bytes wide and int 4 bytes
then we copy twice as many bytes as before.

I think an automatic transformation should only be generated if it is
safe.  It's hard to spot a weird case in a generated patch amid ten
well-behaving ones.

>
> diff --git a/contrib/coccinelle/array.cocci b/contrib/coccinelle/array.cocci
> index 46b8d2ee11..bcd6ff4793 100644
> --- a/contrib/coccinelle/array.cocci
> +++ b/contrib/coccinelle/array.cocci
> @@ -1,90 +1,26 @@
> -@@
> -expression dst, src, n, E;
> -@@
> -  memcpy(dst, src, n * sizeof(
> -- E[...]
> -+ *(E)
> -  ))
> -
>  @@
>  type T;
> -T *ptr;
> -T[] arr;
> -expression E, n;
> -@@
> -(
> -  memcpy(ptr, E,
> -- n * sizeof(*(ptr))
> -+ n * sizeof(T)
> -  )
> -|
> -  memcpy(arr, E,
> -- n * sizeof(*(arr))
> -+ n * sizeof(T)
> -  )
> -|
> -  memcpy(E, ptr,
> -- n * sizeof(*(ptr))
> -+ n * sizeof(T)
> -  )
> -|
> -  memcpy(E, arr,
> -- n * sizeof(*(arr))
> -+ n * sizeof(T)
> -  )
> -)
> -
> -@@
> -type T;
> -T *dst_ptr;
> -T *src_ptr;
> -T[] dst_arr;
>  T[] src_arr;
> -expression n;
> +expression n, dst_e, src_e;
> +expression* dst_p_e, src_p_e;
>  @@
>  (
> -- memcpy(dst_ptr, src_ptr, (n) * sizeof(T))
> -+ COPY_ARRAY(dst_ptr, src_ptr, n)
> -|
> -- memcpy(dst_ptr, src_arr, (n) * sizeof(T))
> -+ COPY_ARRAY(dst_ptr, src_arr, n)
> -|
> -- memcpy(dst_arr, src_ptr, (n) * sizeof(T))
> -+ COPY_ARRAY(dst_arr, src_ptr, n)
> -|
> -- memcpy(dst_arr, src_arr, (n) * sizeof(T))
> -+ COPY_ARRAY(dst_arr, src_arr, n)
> -)
> -
> -@@
> -type T;
> -T *dst;
> -T *src;
> -expression n;
> -@@
>  (
> -- memmove(dst, src, (n) * sizeof(*dst));
> -+ MOVE_ARRAY(dst, src, n);
> -|
> -- memmove(dst, src, (n) * sizeof(*src));
> -+ MOVE_ARRAY(dst, src, n);
> +-memcpy
> ++COPY_ARRAY
>  |
> -- memmove(dst, src, (n) * sizeof(T));
> -+ MOVE_ARRAY(dst, src, n);
> +-memmove
> ++MOVE_ARRAY
> +)
> +       (
> +        dst_e,
> +        src_e
> +-       , (n) * \( sizeof(T) \| sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \)
> ++       , n
> +       )
> +|
> ++ALLOC_ARRAY(
> +             dst_p_e
> +-                    = xmalloc((n) * \( sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \| sizeof(T) \))
> ++            , n)
>  )
> -
> -@@
> -type T;
> -T *ptr;
> -expression n;
> -@@
> -- ptr = xmalloc((n) * sizeof(*ptr));
> -+ ALLOC_ARRAY(ptr, n);
> -
> -@@
> -type T;
> -T *ptr;
> -expression n;
> -@@
> -- ptr = xmalloc((n) * sizeof(T));
> -+ ALLOC_ARRAY(ptr, n);
> --
> 2.24.0
>
Markus Elfring Nov. 20, 2019, 9:01 a.m. UTC | #2
> I like that COPY_ARRAY and MOVE_ARRAY are handled in the same rule,
> as they share the same parameters and do the same -- except that
> the latter handles overlaps, while the former may be a bit faster.
>
> And I like that it's short.

Thanks for such positive feedback after our growing discussion.


> I don't like that ALLOC_ARRAY is handled in the same rule, as it is
> quite different from the other two macros.

This case distinction can share a few metavariables with the other
transformation approach, can't it?


> Coccinelle needs significantly longer to apply the new version.

This can happen because of a more complete source code search pattern,
can't it?

The data processing can benefit from parallelisation (if desired.)
https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L745


> Here are times for master:

The SmPL script execution times can be analysed also directly with
the help of the Coccinelle software by profiling functionality.
https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L736


> ... and here with the patch applied:
>
> Benchmark #1: make contrib/coccinelle/array.cocci.patch
>   Time (mean ± σ):     43.420 s ±  0.490 s    [User: 43.087 s, System: 0.273 s]

I got an other distribution of run times on my test system.


> The current version checks if source and destination are of the same type,
> and whether the sizeof operand is either said type or an element of source
> or destination.

The specification of metavariables for pointer types has got some consequences.


> The new one does not.

I suggest to use a search for (pointer) expressions instead.
This approach can trigger other consequences then.


> So I don't see claim 4 ("Increase the precision") fulfilled,

I tried to express an adjustment on the change granularity by the plus
and minus characters at the beginning of the lines in the semantic patch.

The SmPL disjunctions provide also more common functionality now.


> quite the opposite rather.

The search for compatible pointers can become even more challenging.


> I think an automatic transformation should only be generated if it is safe.

Different expectations can occur around safety and change convenience.

Would you eventually work with SmPL script variants in parallel according
to different confidence settings?


> It's hard to spot a weird case in a generated patch amid ten
> well-behaving ones.

I can follow also this development concern to some degree.

Regards,
Markus
René Scharfe Nov. 21, 2019, 7:02 p.m. UTC | #3
Am 20.11.19 um 10:01 schrieb Markus Elfring:
>> I don't like that ALLOC_ARRAY is handled in the same rule, as it is
>> quite different from the other two macros.
>
> This case distinction can share a few metavariables with the other
> transformation approach, can't it?

Can it can, but should it?  In my opinion it should not; separate
concerns should get their own rules.  That's easier to manage for
developers.  I suspect it's also easier for Coccinelle to evaluate,
but didn't check.

>> Coccinelle needs significantly longer to apply the new version.
>
> This can happen because of a more complete source code search pattern,
> can't it?

Perhaps.

> The data processing can benefit from parallelisation (if desired.)
> https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L745

Right.  I use MAKEFLAGS += -j6, which runs six spatch instances in
parallel for the coccicheck make target of Git instead.

>> Here are times for master:
>
> The SmPL script execution times can be analysed also directly with
> the help of the Coccinelle software by profiling functionality.
> https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L736

OK, so --profile allows to analyze in which of its parts Coccinelle
spends the extra time.

>> The current version checks if source and destination are of the same type,
>> and whether the sizeof operand is either said type or an element of source
>> or destination.
>
> The specification of metavariables for pointer types has got some consequences.
>
>
>> The new one does not.
>
> I suggest to use a search for (pointer) expressions instead.
> This approach can trigger other consequences then.

Why don't we need to check the type?

>> So I don't see claim 4 ("Increase the precision") fulfilled,
>
> I tried to express an adjustment on the change granularity by the plus
> and minus characters at the beginning of the lines in the semantic patch.

Hmm, to me "precision" means to transform exactly those cases that are
intended to be transformed, i.e. to avoid false positives and negatives.
What you seem to mean here I'd rather describe as "reduce duplication".

> The SmPL disjunctions provide also more common functionality now.
>
>
>> quite the opposite rather.
>
> The search for compatible pointers can become even more challenging.

It's what we currently have, in an a clunky way.

>> I think an automatic transformation should only be generated if it is safe.
>
> Different expectations can occur around safety and change convenience.
>
> Would you eventually work with SmPL script variants in parallel according
> to different confidence settings?

Me?  No.  If I can't trust automatic transformations then I don't want
them.  I can already generate bugs fast enough manually, thank you
very much. :)

René
Markus Elfring Nov. 21, 2019, 7:44 p.m. UTC | #4
>> This case distinction can share a few metavariables with the other
>> transformation approach, can't it?
>
> Can it can, but should it?  In my opinion it should not;

I presented a software design in an other direction.
Some data processing approaches can benefit from sharing common information.


> separate concerns should get their own rules.

Strict separation triggers corresponding consequences.


> That's easier to manage for developers.

This view can be reasonable.


> I suspect it's also easier for Coccinelle to evaluate, but didn't check.

I find such an assumption questionable.


> I use MAKEFLAGS += -j6, which runs six spatch instances in
> parallel for the coccicheck make target of Git instead.

The program “spatch” supports parallelisation also directly by the parameter “--jobs”.
Did you try it out occasionally?


> OK, so --profile allows to analyze in which of its parts Coccinelle
> spends the extra time.

Some information about time distribution will be displayed.


>> I suggest to use a search for (pointer) expressions instead.
>> This approach can trigger other consequences then.
>
> Why don't we need to check the type?

I got the impression that we stumble on a general challenge for generic
source code searches.
How many efforts would we like to invest in solving type safety issues?


>> Would you eventually work with SmPL script variants in parallel according
>> to different confidence settings?
>
> Me?  No.

Such a view can be fine.

But I am also still trying to improve various implementation details
despite of known software limitations.


> If I can't trust automatic transformations then I don't want them.

I need to live with compromises together also with current development tools.


> I can already generate bugs fast enough manually, thank you very much. :)

This is usual.

I hope that specific tools can make our lives occasionally a bit easier.

Regards,
Markus
Junio C Hamano Nov. 22, 2019, 5:54 a.m. UTC | #5
René Scharfe <l.s.r@web.de> writes:

> The current version checks if source and destination are of the same type,
> and whether the sizeof operand is either said type or an element of source
> or destination.  The new one does not.  So I don't see claim 4 ("Increase
> the precision") fulfilled, quite the opposite rather.  It can produce e.g.
> a transformation like this:
>
>  void f(int *dst, char *src, size_t n)
>  {
> -	memcpy(dst, src, n * sizeof(short));
> +	COPY_ARRAY(dst, src, n);
>  }
>
> The COPY_ARRAY there effectively expands to:
>
> 	memcpy(dst, src, n * sizeof(*dst));
>
> ... which is quite different -- if short is 2 bytes wide and int 4 bytes
> then we copy twice as many bytes as before.
>
> I think an automatic transformation should only be generated if it is
> safe.  It's hard to spot a weird case in a generated patch amid ten
> well-behaving ones.

Nicely said; I agree 100% with you that the priority of this project
is to use these *.cocci transformations in such a way that they are
absolutely safe---so that humans do not have to spend time sifting
the result through to find accidental bad transformations.

And thanks for taking time to very clearly explain why the proposed
rewrite is not something we want to take.
Markus Elfring Nov. 22, 2019, 7:34 a.m. UTC | #6
> Nicely said; I agree 100% with you that the priority of this project
> is to use these *.cocci transformations in such a way that they are
> absolutely safe

Such a goal can be generally desirable.

But I got the impression that there are target conflicts to consider
for the currently discussed SmPL script.
The available transformation approaches show different open issues,
don't they?

Your desire is easier to fulfil for other change patterns.


> ---so that humans do not have to spend time sifting the result through
> to find accidental bad transformations.

Automatic source code analysis contains the usual risk for false positives.
How many efforts would we like to invest in improving corresponding
software solutions?


> And thanks for taking time to very clearly explain why the proposed
> rewrite is not something we want to take.

Would you like to check once more if additional update candidates
will be found in the source files with the presented SmPL script variant?

Regards,
Markus
SZEDER Gábor Nov. 22, 2019, 3:29 p.m. UTC | #7
On Thu, Nov 21, 2019 at 08:44:12PM +0100, Markus Elfring wrote:
> The program “spatch” supports parallelisation also directly by the parameter “--jobs”.
> Did you try it out occasionally?

I did try --jobs on a couple of occasions, and the results always
varied between broken, not working, or downright making things even
slower.


  $ spatch --version
  spatch version 1.0.4 with Python support and with PCRE support
  $ spatch --sp-file contrib/coccinelle/array.cocci --all-includes --patch . --jobs 2 alias.c alloc.c
  init_defs_builtins: /usr/lib/coccinelle/standard.h
  HANDLING: alias.c alloc.c
  Fatal error: exception Sys_error("array: No such file or directory")

This issue seems to be fixed in later versions, but this is the
version what many distros still ship and what is used in our CI
builds, so we do care about 1.0.4.


  $ spatch --version
  spatch version 1.0.8 compiled with OCaml version 4.05.0
  Flags passed to the configure script: [none]
  OCaml scripting support: yes
  Python scripting support: yes
  Syntax of regular expressions: PCRE
  $ /usr/bin/time --format='%e | %M' make contrib/coccinelle/array.cocci.patch
      SPATCH contrib/coccinelle/array.cocci
  102.06 | 129084

Our Makefile recipes run Coccinelle in a sequential loop, one 'spatch'
invocation for each source file by default.  Therefore, merely passing
in '--jobs <N>' doesn't bring any runtime benefits:

  $ /usr/bin/time --format='%e | %M' make SPATCH_FLAGS='--all-includes --patch . --jobs 8' contrib/coccinelle/array.cocci.patch
      SPATCH contrib/coccinelle/array.cocci
  105.31 | 118512

Some time ago we found that invoking 'spatch' with multiple files at
once does bring notable speedup (with 1.0.4), although at the cost of
drastically increased memory footprint, see commit 960154b9c1
(coccicheck: optionally batch spatch invocations, 2019-05-06).  Alas,
trying to use that in the hope that 'spatch' can do more in parallel
if it has more files to process at once doesn't bring any runtime
benefits, either:

  $ /usr/bin/time --format='%e | %M' make SPATCH_FLAGS='--all-includes --patch . --jobs 8' SPATCH_BATCH_SIZE=8 contrib/coccinelle/array.cocci.patch
      SPATCH contrib/coccinelle/array.cocci
  116.27 | 349964

And by further increasing the batch size it just gets notably slower;
also note the order of magnitude higher max memory usage:

  $ /usr/bin/time --format='%e | %M' make SPATCH_FLAGS='--all-includes --patch . --jobs 8' SPATCH_BATCH_SIZE=32 contrib/coccinelle/array.cocci.patch
      SPATCH contrib/coccinelle/array.cocci
  197.70 | 1205784

It appears that batching 'spatch' invocations with 1.0.8 does not
bring the same benefits as with 1.0.4, but brings slowdowns instead...

Anyway, looking at 'ps u -L' output it appears that 'spatch' doesn't
really do any parallel work, and there are only two 'spatch' processes
and no threads despite '--jobs 8':

  szeder    2561  0.4  0.5  36944 21520 pts/0    S+   15:31   0:00 spatch
  szeder    2567 97.1 30.5 1228372 1205332 pts/0 R+   15:31   0:29 spatch


Note that 1.0.8 above was run in a Docker container, while 1.0.4 on
the host.  This may or may not have influenced the runtimes reported
above.  FWIW, 'make -j4 coccicheck' parallelizes just fine even in the
container and with 1.0.8.


A different approach relying on 'make -j' to parallelize 'spatch'
invocations was discussed here:

  https://public-inbox.org/git/20180802115522.16107-1-szeder.dev@gmail.com/T/#u
Markus Elfring Nov. 22, 2019, 4:17 p.m. UTC | #8
> I did try --jobs on a couple of occasions,

Thanks for your feedback.


> and the results always varied between broken, not working,

The parallelisation support by the Coccinelle software was questionable
for a while.


> or downright making things even slower.

I wonder about this information.


…
>   spatch version 1.0.4 with Python support and with PCRE support>   Fatal error: exception Sys_error("array: No such file or directory")

Another bit of background information:
* https://lore.kernel.org/cocci/99082e9d-8047-eee3-68dd-9849868d4a96@users.sourceforge.net/
  https://systeme.lip6.fr/pipermail/cocci/2016-August/003546.html

* https://lore.kernel.org/cocci/alpine.DEB.2.10.1506171707190.2578@hadrien/
  https://systeme.lip6.fr/pipermail/cocci/2015-June/002141.html


…
> Therefore, merely passing in '--jobs <N>' doesn't bring any runtime benefits:

This program parameter should be used for other command variants.

…
> It appears that batching 'spatch' invocations with 1.0.8 does not
> bring the same benefits as with 1.0.4, but brings slowdowns instead...

Would you like to share your experiences also on the Coccinelle mailing list?

Regards,
Markus

Patch
diff mbox series

diff --git a/contrib/coccinelle/array.cocci b/contrib/coccinelle/array.cocci
index 46b8d2ee11..bcd6ff4793 100644
--- a/contrib/coccinelle/array.cocci
+++ b/contrib/coccinelle/array.cocci
@@ -1,90 +1,26 @@ 
-@@
-expression dst, src, n, E;
-@@
-  memcpy(dst, src, n * sizeof(
-- E[...]
-+ *(E)
-  ))
-
 @@
 type T;
-T *ptr;
-T[] arr;
-expression E, n;
-@@
-(
-  memcpy(ptr, E,
-- n * sizeof(*(ptr))
-+ n * sizeof(T)
-  )
-|
-  memcpy(arr, E,
-- n * sizeof(*(arr))
-+ n * sizeof(T)
-  )
-|
-  memcpy(E, ptr,
-- n * sizeof(*(ptr))
-+ n * sizeof(T)
-  )
-|
-  memcpy(E, arr,
-- n * sizeof(*(arr))
-+ n * sizeof(T)
-  )
-)
-
-@@
-type T;
-T *dst_ptr;
-T *src_ptr;
-T[] dst_arr;
 T[] src_arr;
-expression n;
+expression n, dst_e, src_e;
+expression* dst_p_e, src_p_e;
 @@
 (
-- memcpy(dst_ptr, src_ptr, (n) * sizeof(T))
-+ COPY_ARRAY(dst_ptr, src_ptr, n)
-|
-- memcpy(dst_ptr, src_arr, (n) * sizeof(T))
-+ COPY_ARRAY(dst_ptr, src_arr, n)
-|
-- memcpy(dst_arr, src_ptr, (n) * sizeof(T))
-+ COPY_ARRAY(dst_arr, src_ptr, n)
-|
-- memcpy(dst_arr, src_arr, (n) * sizeof(T))
-+ COPY_ARRAY(dst_arr, src_arr, n)
-)
-
-@@
-type T;
-T *dst;
-T *src;
-expression n;
-@@
 (
-- memmove(dst, src, (n) * sizeof(*dst));
-+ MOVE_ARRAY(dst, src, n);
-|
-- memmove(dst, src, (n) * sizeof(*src));
-+ MOVE_ARRAY(dst, src, n);
+-memcpy
++COPY_ARRAY
 |
-- memmove(dst, src, (n) * sizeof(T));
-+ MOVE_ARRAY(dst, src, n);
+-memmove
++MOVE_ARRAY
+)
+       (
+        dst_e,
+        src_e
+-       , (n) * \( sizeof(T) \| sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \)
++       , n
+       )
+|
++ALLOC_ARRAY(
+             dst_p_e
+-                    = xmalloc((n) * \( sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \| sizeof(T) \))
++            , n)
 )
-
-@@
-type T;
-T *ptr;
-expression n;
-@@
-- ptr = xmalloc((n) * sizeof(*ptr));
-+ ALLOC_ARRAY(ptr, n);
-
-@@
-type T;
-T *ptr;
-expression n;
-@@
-- ptr = xmalloc((n) * sizeof(T));
-+ ALLOC_ARRAY(ptr, n);