Message ID | 0d9cf772-268d-bd00-1cbb-0bbbec9dfc9a@web.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | coccinelle: improve array.cocci | expand |
Am 18.11.19 um 17:10 schrieb Markus Elfring: > From: Markus Elfring <elfring@users.sourceforge.net> > Date: Mon, 18 Nov 2019 17:00:37 +0100 > > This script contained transformation rules for the semantic patch language > which used similar code. > > 1. Delete two SmPL rules which were used to transform source code fragments > (pointer expressions) so that the search pattern “sizeof(T)” would work > in the third rule. > See also the topic “coccinelle: adjustments for array.cocci?”: > https://public-inbox.org/git/f28f5fb8-2814-9df5-faf2-7146ed1a1f4d@web.de/ > > 2. Combine the remaining rules by using six SmPL disjunctions. > > 3. Adjust case distinctions and corresponding metavariables so that > the desired search for update candidates can be more complete. > > 4. Increase the precision for the specification of required changes. > > Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> > --- > contrib/coccinelle/array.cocci | 100 ++++++--------------------------- > 1 file changed, 18 insertions(+), 82 deletions(-) The diff is hard to read, so here's the resulting semantic patch: -- start -- @@ type T; T[] src_arr; expression n, dst_e, src_e; expression* dst_p_e, src_p_e; @@ ( ( -memcpy +COPY_ARRAY | -memmove +MOVE_ARRAY ) ( dst_e, src_e - , (n) * \( sizeof(T) \| sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \) + , n ) | +ALLOC_ARRAY( dst_p_e - = xmalloc((n) * \( sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \| sizeof(T) \)) + , n) ) -- end -- I like that COPY_ARRAY and MOVE_ARRAY are handled in the same rule, as they share the same parameters and do the same -- except that the latter handles overlaps, while the former may be a bit faster. And I like that it's short. I don't like that ALLOC_ARRAY is handled in the same rule, as it is quite different from the other two macros. Coccinelle needs significantly longer to apply the new version. Here are times for master: Benchmark #1: make contrib/coccinelle/array.cocci.patch Time (mean ± σ): 19.314 s ± 0.200 s [User: 19.065 s, System: 0.224 s] Range (min … max): 19.009 s … 19.718 s 10 runs ... and here with the patch applied: Benchmark #1: make contrib/coccinelle/array.cocci.patch Time (mean ± σ): 43.420 s ± 0.490 s [User: 43.087 s, System: 0.273 s] Range (min … max): 42.636 s … 44.359 s 10 runs The current version checks if source and destination are of the same type, and whether the sizeof operand is either said type or an element of source or destination. The new one does not. So I don't see claim 4 ("Increase the precision") fulfilled, quite the opposite rather. It can produce e.g. a transformation like this: void f(int *dst, char *src, size_t n) { - memcpy(dst, src, n * sizeof(short)); + COPY_ARRAY(dst, src, n); } The COPY_ARRAY there effectively expands to: memcpy(dst, src, n * sizeof(*dst)); ... which is quite different -- if short is 2 bytes wide and int 4 bytes then we copy twice as many bytes as before. I think an automatic transformation should only be generated if it is safe. It's hard to spot a weird case in a generated patch amid ten well-behaving ones. > > diff --git a/contrib/coccinelle/array.cocci b/contrib/coccinelle/array.cocci > index 46b8d2ee11..bcd6ff4793 100644 > --- a/contrib/coccinelle/array.cocci > +++ b/contrib/coccinelle/array.cocci > @@ -1,90 +1,26 @@ > -@@ > -expression dst, src, n, E; > -@@ > - memcpy(dst, src, n * sizeof( > -- E[...] > -+ *(E) > - )) > - > @@ > type T; > -T *ptr; > -T[] arr; > -expression E, n; > -@@ > -( > - memcpy(ptr, E, > -- n * sizeof(*(ptr)) > -+ n * sizeof(T) > - ) > -| > - memcpy(arr, E, > -- n * sizeof(*(arr)) > -+ n * sizeof(T) > - ) > -| > - memcpy(E, ptr, > -- n * sizeof(*(ptr)) > -+ n * sizeof(T) > - ) > -| > - memcpy(E, arr, > -- n * sizeof(*(arr)) > -+ n * sizeof(T) > - ) > -) > - > -@@ > -type T; > -T *dst_ptr; > -T *src_ptr; > -T[] dst_arr; > T[] src_arr; > -expression n; > +expression n, dst_e, src_e; > +expression* dst_p_e, src_p_e; > @@ > ( > -- memcpy(dst_ptr, src_ptr, (n) * sizeof(T)) > -+ COPY_ARRAY(dst_ptr, src_ptr, n) > -| > -- memcpy(dst_ptr, src_arr, (n) * sizeof(T)) > -+ COPY_ARRAY(dst_ptr, src_arr, n) > -| > -- memcpy(dst_arr, src_ptr, (n) * sizeof(T)) > -+ COPY_ARRAY(dst_arr, src_ptr, n) > -| > -- memcpy(dst_arr, src_arr, (n) * sizeof(T)) > -+ COPY_ARRAY(dst_arr, src_arr, n) > -) > - > -@@ > -type T; > -T *dst; > -T *src; > -expression n; > -@@ > ( > -- memmove(dst, src, (n) * sizeof(*dst)); > -+ MOVE_ARRAY(dst, src, n); > -| > -- memmove(dst, src, (n) * sizeof(*src)); > -+ MOVE_ARRAY(dst, src, n); > +-memcpy > ++COPY_ARRAY > | > -- memmove(dst, src, (n) * sizeof(T)); > -+ MOVE_ARRAY(dst, src, n); > +-memmove > ++MOVE_ARRAY > +) > + ( > + dst_e, > + src_e > +- , (n) * \( sizeof(T) \| sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \) > ++ , n > + ) > +| > ++ALLOC_ARRAY( > + dst_p_e > +- = xmalloc((n) * \( sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \| sizeof(T) \)) > ++ , n) > ) > - > -@@ > -type T; > -T *ptr; > -expression n; > -@@ > -- ptr = xmalloc((n) * sizeof(*ptr)); > -+ ALLOC_ARRAY(ptr, n); > - > -@@ > -type T; > -T *ptr; > -expression n; > -@@ > -- ptr = xmalloc((n) * sizeof(T)); > -+ ALLOC_ARRAY(ptr, n); > -- > 2.24.0 >
> I like that COPY_ARRAY and MOVE_ARRAY are handled in the same rule, > as they share the same parameters and do the same -- except that > the latter handles overlaps, while the former may be a bit faster. > > And I like that it's short. Thanks for such positive feedback after our growing discussion. > I don't like that ALLOC_ARRAY is handled in the same rule, as it is > quite different from the other two macros. This case distinction can share a few metavariables with the other transformation approach, can't it? > Coccinelle needs significantly longer to apply the new version. This can happen because of a more complete source code search pattern, can't it? The data processing can benefit from parallelisation (if desired.) https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L745 > Here are times for master: The SmPL script execution times can be analysed also directly with the help of the Coccinelle software by profiling functionality. https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L736 > ... and here with the patch applied: > > Benchmark #1: make contrib/coccinelle/array.cocci.patch > Time (mean ± σ): 43.420 s ± 0.490 s [User: 43.087 s, System: 0.273 s] I got an other distribution of run times on my test system. > The current version checks if source and destination are of the same type, > and whether the sizeof operand is either said type or an element of source > or destination. The specification of metavariables for pointer types has got some consequences. > The new one does not. I suggest to use a search for (pointer) expressions instead. This approach can trigger other consequences then. > So I don't see claim 4 ("Increase the precision") fulfilled, I tried to express an adjustment on the change granularity by the plus and minus characters at the beginning of the lines in the semantic patch. The SmPL disjunctions provide also more common functionality now. > quite the opposite rather. The search for compatible pointers can become even more challenging. > I think an automatic transformation should only be generated if it is safe. Different expectations can occur around safety and change convenience. Would you eventually work with SmPL script variants in parallel according to different confidence settings? > It's hard to spot a weird case in a generated patch amid ten > well-behaving ones. I can follow also this development concern to some degree. Regards, Markus
Am 20.11.19 um 10:01 schrieb Markus Elfring: >> I don't like that ALLOC_ARRAY is handled in the same rule, as it is >> quite different from the other two macros. > > This case distinction can share a few metavariables with the other > transformation approach, can't it? Can it can, but should it? In my opinion it should not; separate concerns should get their own rules. That's easier to manage for developers. I suspect it's also easier for Coccinelle to evaluate, but didn't check. >> Coccinelle needs significantly longer to apply the new version. > > This can happen because of a more complete source code search pattern, > can't it? Perhaps. > The data processing can benefit from parallelisation (if desired.) > https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L745 Right. I use MAKEFLAGS += -j6, which runs six spatch instances in parallel for the coccicheck make target of Git instead. >> Here are times for master: > > The SmPL script execution times can be analysed also directly with > the help of the Coccinelle software by profiling functionality. > https://github.com/coccinelle/coccinelle/blob/66a1118e04a6aaf1acdae89623313c8e05158a8d/docs/manual/spatch_options.tex#L736 OK, so --profile allows to analyze in which of its parts Coccinelle spends the extra time. >> The current version checks if source and destination are of the same type, >> and whether the sizeof operand is either said type or an element of source >> or destination. > > The specification of metavariables for pointer types has got some consequences. > > >> The new one does not. > > I suggest to use a search for (pointer) expressions instead. > This approach can trigger other consequences then. Why don't we need to check the type? >> So I don't see claim 4 ("Increase the precision") fulfilled, > > I tried to express an adjustment on the change granularity by the plus > and minus characters at the beginning of the lines in the semantic patch. Hmm, to me "precision" means to transform exactly those cases that are intended to be transformed, i.e. to avoid false positives and negatives. What you seem to mean here I'd rather describe as "reduce duplication". > The SmPL disjunctions provide also more common functionality now. > > >> quite the opposite rather. > > The search for compatible pointers can become even more challenging. It's what we currently have, in an a clunky way. >> I think an automatic transformation should only be generated if it is safe. > > Different expectations can occur around safety and change convenience. > > Would you eventually work with SmPL script variants in parallel according > to different confidence settings? Me? No. If I can't trust automatic transformations then I don't want them. I can already generate bugs fast enough manually, thank you very much. :) René
>> This case distinction can share a few metavariables with the other >> transformation approach, can't it? > > Can it can, but should it? In my opinion it should not; I presented a software design in an other direction. Some data processing approaches can benefit from sharing common information. > separate concerns should get their own rules. Strict separation triggers corresponding consequences. > That's easier to manage for developers. This view can be reasonable. > I suspect it's also easier for Coccinelle to evaluate, but didn't check. I find such an assumption questionable. > I use MAKEFLAGS += -j6, which runs six spatch instances in > parallel for the coccicheck make target of Git instead. The program “spatch” supports parallelisation also directly by the parameter “--jobs”. Did you try it out occasionally? > OK, so --profile allows to analyze in which of its parts Coccinelle > spends the extra time. Some information about time distribution will be displayed. >> I suggest to use a search for (pointer) expressions instead. >> This approach can trigger other consequences then. > > Why don't we need to check the type? I got the impression that we stumble on a general challenge for generic source code searches. How many efforts would we like to invest in solving type safety issues? >> Would you eventually work with SmPL script variants in parallel according >> to different confidence settings? > > Me? No. Such a view can be fine. But I am also still trying to improve various implementation details despite of known software limitations. > If I can't trust automatic transformations then I don't want them. I need to live with compromises together also with current development tools. > I can already generate bugs fast enough manually, thank you very much. :) This is usual. I hope that specific tools can make our lives occasionally a bit easier. Regards, Markus
René Scharfe <l.s.r@web.de> writes: > The current version checks if source and destination are of the same type, > and whether the sizeof operand is either said type or an element of source > or destination. The new one does not. So I don't see claim 4 ("Increase > the precision") fulfilled, quite the opposite rather. It can produce e.g. > a transformation like this: > > void f(int *dst, char *src, size_t n) > { > - memcpy(dst, src, n * sizeof(short)); > + COPY_ARRAY(dst, src, n); > } > > The COPY_ARRAY there effectively expands to: > > memcpy(dst, src, n * sizeof(*dst)); > > ... which is quite different -- if short is 2 bytes wide and int 4 bytes > then we copy twice as many bytes as before. > > I think an automatic transformation should only be generated if it is > safe. It's hard to spot a weird case in a generated patch amid ten > well-behaving ones. Nicely said; I agree 100% with you that the priority of this project is to use these *.cocci transformations in such a way that they are absolutely safe---so that humans do not have to spend time sifting the result through to find accidental bad transformations. And thanks for taking time to very clearly explain why the proposed rewrite is not something we want to take.
> Nicely said; I agree 100% with you that the priority of this project > is to use these *.cocci transformations in such a way that they are > absolutely safe Such a goal can be generally desirable. But I got the impression that there are target conflicts to consider for the currently discussed SmPL script. The available transformation approaches show different open issues, don't they? Your desire is easier to fulfil for other change patterns. > ---so that humans do not have to spend time sifting the result through > to find accidental bad transformations. Automatic source code analysis contains the usual risk for false positives. How many efforts would we like to invest in improving corresponding software solutions? > And thanks for taking time to very clearly explain why the proposed > rewrite is not something we want to take. Would you like to check once more if additional update candidates will be found in the source files with the presented SmPL script variant? Regards, Markus
On Thu, Nov 21, 2019 at 08:44:12PM +0100, Markus Elfring wrote: > The program “spatch” supports parallelisation also directly by the parameter “--jobs”. > Did you try it out occasionally? I did try --jobs on a couple of occasions, and the results always varied between broken, not working, or downright making things even slower. $ spatch --version spatch version 1.0.4 with Python support and with PCRE support $ spatch --sp-file contrib/coccinelle/array.cocci --all-includes --patch . --jobs 2 alias.c alloc.c init_defs_builtins: /usr/lib/coccinelle/standard.h HANDLING: alias.c alloc.c Fatal error: exception Sys_error("array: No such file or directory") This issue seems to be fixed in later versions, but this is the version what many distros still ship and what is used in our CI builds, so we do care about 1.0.4. $ spatch --version spatch version 1.0.8 compiled with OCaml version 4.05.0 Flags passed to the configure script: [none] OCaml scripting support: yes Python scripting support: yes Syntax of regular expressions: PCRE $ /usr/bin/time --format='%e | %M' make contrib/coccinelle/array.cocci.patch SPATCH contrib/coccinelle/array.cocci 102.06 | 129084 Our Makefile recipes run Coccinelle in a sequential loop, one 'spatch' invocation for each source file by default. Therefore, merely passing in '--jobs <N>' doesn't bring any runtime benefits: $ /usr/bin/time --format='%e | %M' make SPATCH_FLAGS='--all-includes --patch . --jobs 8' contrib/coccinelle/array.cocci.patch SPATCH contrib/coccinelle/array.cocci 105.31 | 118512 Some time ago we found that invoking 'spatch' with multiple files at once does bring notable speedup (with 1.0.4), although at the cost of drastically increased memory footprint, see commit 960154b9c1 (coccicheck: optionally batch spatch invocations, 2019-05-06). Alas, trying to use that in the hope that 'spatch' can do more in parallel if it has more files to process at once doesn't bring any runtime benefits, either: $ /usr/bin/time --format='%e | %M' make SPATCH_FLAGS='--all-includes --patch . --jobs 8' SPATCH_BATCH_SIZE=8 contrib/coccinelle/array.cocci.patch SPATCH contrib/coccinelle/array.cocci 116.27 | 349964 And by further increasing the batch size it just gets notably slower; also note the order of magnitude higher max memory usage: $ /usr/bin/time --format='%e | %M' make SPATCH_FLAGS='--all-includes --patch . --jobs 8' SPATCH_BATCH_SIZE=32 contrib/coccinelle/array.cocci.patch SPATCH contrib/coccinelle/array.cocci 197.70 | 1205784 It appears that batching 'spatch' invocations with 1.0.8 does not bring the same benefits as with 1.0.4, but brings slowdowns instead... Anyway, looking at 'ps u -L' output it appears that 'spatch' doesn't really do any parallel work, and there are only two 'spatch' processes and no threads despite '--jobs 8': szeder 2561 0.4 0.5 36944 21520 pts/0 S+ 15:31 0:00 spatch szeder 2567 97.1 30.5 1228372 1205332 pts/0 R+ 15:31 0:29 spatch Note that 1.0.8 above was run in a Docker container, while 1.0.4 on the host. This may or may not have influenced the runtimes reported above. FWIW, 'make -j4 coccicheck' parallelizes just fine even in the container and with 1.0.8. A different approach relying on 'make -j' to parallelize 'spatch' invocations was discussed here: https://public-inbox.org/git/20180802115522.16107-1-szeder.dev@gmail.com/T/#u
> I did try --jobs on a couple of occasions, Thanks for your feedback. > and the results always varied between broken, not working, The parallelisation support by the Coccinelle software was questionable for a while. > or downright making things even slower. I wonder about this information. … > spatch version 1.0.4 with Python support and with PCRE support … > Fatal error: exception Sys_error("array: No such file or directory") Another bit of background information: * https://lore.kernel.org/cocci/99082e9d-8047-eee3-68dd-9849868d4a96@users.sourceforge.net/ https://systeme.lip6.fr/pipermail/cocci/2016-August/003546.html * https://lore.kernel.org/cocci/alpine.DEB.2.10.1506171707190.2578@hadrien/ https://systeme.lip6.fr/pipermail/cocci/2015-June/002141.html … > Therefore, merely passing in '--jobs <N>' doesn't bring any runtime benefits: This program parameter should be used for other command variants. … > It appears that batching 'spatch' invocations with 1.0.8 does not > bring the same benefits as with 1.0.4, but brings slowdowns instead... Would you like to share your experiences also on the Coccinelle mailing list? Regards, Markus
> This script contained transformation rules for the semantic patch language > which used similar code. How do you think about to adjust any implementation details according to the shown proposal? Would you like to continue corresponding code review? Regards, Markus
diff --git a/contrib/coccinelle/array.cocci b/contrib/coccinelle/array.cocci index 46b8d2ee11..bcd6ff4793 100644 --- a/contrib/coccinelle/array.cocci +++ b/contrib/coccinelle/array.cocci @@ -1,90 +1,26 @@ -@@ -expression dst, src, n, E; -@@ - memcpy(dst, src, n * sizeof( -- E[...] -+ *(E) - )) - @@ type T; -T *ptr; -T[] arr; -expression E, n; -@@ -( - memcpy(ptr, E, -- n * sizeof(*(ptr)) -+ n * sizeof(T) - ) -| - memcpy(arr, E, -- n * sizeof(*(arr)) -+ n * sizeof(T) - ) -| - memcpy(E, ptr, -- n * sizeof(*(ptr)) -+ n * sizeof(T) - ) -| - memcpy(E, arr, -- n * sizeof(*(arr)) -+ n * sizeof(T) - ) -) - -@@ -type T; -T *dst_ptr; -T *src_ptr; -T[] dst_arr; T[] src_arr; -expression n; +expression n, dst_e, src_e; +expression* dst_p_e, src_p_e; @@ ( -- memcpy(dst_ptr, src_ptr, (n) * sizeof(T)) -+ COPY_ARRAY(dst_ptr, src_ptr, n) -| -- memcpy(dst_ptr, src_arr, (n) * sizeof(T)) -+ COPY_ARRAY(dst_ptr, src_arr, n) -| -- memcpy(dst_arr, src_ptr, (n) * sizeof(T)) -+ COPY_ARRAY(dst_arr, src_ptr, n) -| -- memcpy(dst_arr, src_arr, (n) * sizeof(T)) -+ COPY_ARRAY(dst_arr, src_arr, n) -) - -@@ -type T; -T *dst; -T *src; -expression n; -@@ ( -- memmove(dst, src, (n) * sizeof(*dst)); -+ MOVE_ARRAY(dst, src, n); -| -- memmove(dst, src, (n) * sizeof(*src)); -+ MOVE_ARRAY(dst, src, n); +-memcpy ++COPY_ARRAY | -- memmove(dst, src, (n) * sizeof(T)); -+ MOVE_ARRAY(dst, src, n); +-memmove ++MOVE_ARRAY +) + ( + dst_e, + src_e +- , (n) * \( sizeof(T) \| sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \) ++ , n + ) +| ++ALLOC_ARRAY( + dst_p_e +- = xmalloc((n) * \( sizeof( \( *(src_p_e) \| src_e[...] \| src_arr \) ) \| sizeof(T) \)) ++ , n) ) - -@@ -type T; -T *ptr; -expression n; -@@ -- ptr = xmalloc((n) * sizeof(*ptr)); -+ ALLOC_ARRAY(ptr, n); - -@@ -type T; -T *ptr; -expression n; -@@ -- ptr = xmalloc((n) * sizeof(T)); -+ ALLOC_ARRAY(ptr, n);