mbox series

[0/11] renaming argv_array

Message ID 20200728202124.GA1021264@coredump.intra.peff.net (mailing list archive)
Headers show
Series renaming argv_array | expand

Message

Jeff King July 28, 2020, 8:21 p.m. UTC
The argv_array data type has turned out to be useful in our code base,
but the name isn't very good. From patch 2 of this series:

  The name "argv-array" isn't very good, because it describes what the
  data type can be used for (program argument arrays), not what it
  actually is (a dynamically-growing string array that maintains a
  NULL-terminator invariant). This leads to people being hesitant to use
  it for other cases where it would actually be a good fit. The existing
  name is also clunky to use. It's overly long, and the name often leads
  to saying things like "argv.argv" (i.e., the field names overlap with
  variable names, since they're describing the use, not the type). Let's
  give it a more neutral name.

This has bugged me for a while, so I decided to finally fix it. It
wasn't _too_ painful, though I'm sure there will be a little fallout
with topics in flight.

I tried to split out the mechanical bits into their own patches to make
reviewing easier. Patches 5-7 really could be a single patch, but
they're too big for the mailing list. I'm OK to leave them separate, or
they could be squashed together.

We could stop at patch 9 for now and allow topics in flight to catch up
before removing the compat layers. But the struct field renaming has to
happen as a single step, so it will be a pain whenever we do it. If
we're going to go this route, I'd just as soon do it all now and deal
with other topics as they get merged.

  [01/11]: argv-array: use size_t for count and alloc
  [02/11]: argv-array: rename to strvec
  [03/11]: strvec: rename files from argv-array to strvec
  [04/11]: quote: rename sq_dequote_to_argv_array to mention strvec
  [05/11]: strvec: convert builtin/ callers away from argv_array name
  [06/11]: strvec: convert more callers away from argv_array name
  [07/11]: strvec: convert remaining callers away from argv_array name
  [08/11]: strvec: fix indentation in renamed calls
  [09/11]: strvec: update documention to avoid argv_array
  [10/11]: strvec: drop argv_array compatibility layer
  [11/11]: strvec: rename struct fields

 Documentation/technical/api-parse-options.txt |   4 +-
 Makefile                                      |   2 +-
 add-interactive.c                             |  28 +--
 add-patch.c                                   |  48 ++--
 argv-array.c                                  | 109 ---------
 bisect.c                                      |  20 +-
 builtin/add.c                                 |  18 +-
 builtin/am.c                                  |  80 +++----
 builtin/annotate.c                            |  10 +-
 builtin/bisect--helper.c                      |  20 +-
 builtin/bundle.c                              |  14 +-
 builtin/clone.c                               |  38 +--
 builtin/commit.c                              |   8 +-
 builtin/describe.c                            |  44 ++--
 builtin/difftool.c                            |  30 +--
 builtin/fetch.c                               |  64 ++---
 builtin/gc.c                                  |  78 +++----
 builtin/grep.c                                |   2 +-
 builtin/log.c                                 |  12 +-
 builtin/ls-remote.c                           |   6 +-
 builtin/pack-objects.c                        |  26 +--
 builtin/pull.c                                | 160 ++++++-------
 builtin/range-diff.c                          |   4 +-
 builtin/rebase.c                              |  90 ++++----
 builtin/receive-pack.c                        | 126 +++++-----
 builtin/remote-ext.c                          |   4 +-
 builtin/remote.c                              |  26 +--
 builtin/repack.c                              |  72 +++---
 builtin/replace.c                             |  18 +-
 builtin/show-branch.c                         |  16 +-
 builtin/stash.c                               | 162 ++++++-------
 builtin/submodule--helper.c                   | 144 ++++++------
 builtin/update-ref.c                          |   2 +-
 builtin/upload-archive.c                      |  12 +-
 builtin/worktree.c                            |  68 +++---
 bundle.c                                      |  24 +-
 bundle.h                                      |   4 +-
 column.c                                      |  12 +-
 commit.c                                      |  10 +-
 compat/mingw.c                                |   4 +-
 compat/terminal.c                             |  18 +-
 connect.c                                     |  69 +++---
 connected.c                                   |  24 +-
 daemon.c                                      |  60 ++---
 diff.c                                        |  32 +--
 environment.c                                 |  12 +-
 exec-cmd.c                                    |  18 +-
 exec-cmd.h                                    |   4 +-
 fast-import.c                                 |   4 +-
 fetch-pack.c                                  |  46 ++--
 fsmonitor.c                                   |   6 +-
 git.c                                         |  32 +--
 gpg-interface.c                               |  22 +-
 graph.c                                       |  16 +-
 http-backend.c                                |   8 +-
 http-push.c                                   |  18 +-
 http.c                                        |   8 +-
 imap-send.c                                   |   2 +-
 line-log.c                                    |   8 +-
 list-objects-filter-options.c                 |   2 +-
 ls-refs.c                                     |  18 +-
 ls-refs.h                                     |   4 +-
 merge.c                                       |  18 +-
 midx.c                                        |  12 +-
 pager.c                                       |   8 +-
 parse-options-cb.c                            |   8 +-
 pathspec.c                                    |  10 +-
 quote.c                                       |   8 +-
 quote.h                                       |   8 +-
 range-diff.c                                  |  40 ++--
 range-diff.h                                  |   4 +-
 ref-filter.c                                  |  12 +-
 refs.c                                        |   8 +-
 refs.h                                        |   4 +-
 refspec.c                                     |  10 +-
 refspec.h                                     |   4 +-
 remote-curl.c                                 | 102 ++++----
 remote-testsvn.c                              |  10 +-
 remote.c                                      |  20 +-
 remote.h                                      |   4 +-
 revision.c                                    |  20 +-
 run-command.c                                 |  72 +++---
 run-command.h                                 |  12 +-
 send-pack.c                                   |  18 +-
 sequencer.c                                   | 126 +++++-----
 serve.c                                       |  20 +-
 serve.h                                       |   4 +-
 sha1-file.c                                   |  14 +-
 strvec.c                                      | 109 +++++++++
 argv-array.h => strvec.h                      |  58 ++---
 sub-process.c                                 |   2 +-
 submodule.c                                   | 218 +++++++++---------
 submodule.h                                   |   6 +-
 t/helper/test-run-command.c                   |  52 ++---
 t/helper/test-trace2.c                        |   2 +-
 tmp-objdir.c                                  |  20 +-
 transport-helper.c                            |  36 +--
 transport-internal.h                          |   4 +-
 transport.c                                   |  12 +-
 transport.h                                   |   2 +-
 unpack-trees.c                                |  12 +-
 unpack-trees.h                                |   4 +-
 upload-pack.c                                 |  53 +++--
 upload-pack.h                                 |   4 +-
 wt-status.c                                   |  19 +-
 105 files changed, 1619 insertions(+), 1620 deletions(-)
 delete mode 100644 argv-array.c
 create mode 100644 strvec.c
 rename argv-array.h => strvec.h (50%)

Comments

Jacob Keller July 28, 2020, 10:45 p.m. UTC | #1
On Tue, Jul 28, 2020 at 1:25 PM Jeff King <peff@peff.net> wrote:
>
> The argv_array data type has turned out to be useful in our code base,
> but the name isn't very good. From patch 2 of this series:
>
>   The name "argv-array" isn't very good, because it describes what the
>   data type can be used for (program argument arrays), not what it
>   actually is (a dynamically-growing string array that maintains a
>   NULL-terminator invariant). This leads to people being hesitant to use
>   it for other cases where it would actually be a good fit. The existing
>   name is also clunky to use. It's overly long, and the name often leads
>   to saying things like "argv.argv" (i.e., the field names overlap with
>   variable names, since they're describing the use, not the type). Let's
>   give it a more neutral name.
>
> This has bugged me for a while, so I decided to finally fix it. It
> wasn't _too_ painful, though I'm sure there will be a little fallout
> with topics in flight.
>
> I tried to split out the mechanical bits into their own patches to make
> reviewing easier. Patches 5-7 really could be a single patch, but
> they're too big for the mailing list. I'm OK to leave them separate, or
> they could be squashed together.
>
> We could stop at patch 9 for now and allow topics in flight to catch up
> before removing the compat layers. But the struct field renaming has to
> happen as a single step, so it will be a pain whenever we do it. If
> we're going to go this route, I'd just as soon do it all now and deal
> with other topics as they get merged.
>
>   [01/11]: argv-array: use size_t for count and alloc
>   [02/11]: argv-array: rename to strvec
>   [03/11]: strvec: rename files from argv-array to strvec
>   [04/11]: quote: rename sq_dequote_to_argv_array to mention strvec
>   [05/11]: strvec: convert builtin/ callers away from argv_array name
>   [06/11]: strvec: convert more callers away from argv_array name
>   [07/11]: strvec: convert remaining callers away from argv_array name
>   [08/11]: strvec: fix indentation in renamed calls
>   [09/11]: strvec: update documention to avoid argv_array
>   [10/11]: strvec: drop argv_array compatibility layer
>   [11/11]: strvec: rename struct fields
>

I like this. It definitely helps to name the API after what it does.

One thing I thought I would see but I guess we simply don't have one
is a technical doc that details the strvec. I guess we just never had
one for argv_array? Probably worth adding one at some point.

Thanks,
Jake
Jeff King July 29, 2020, 12:06 a.m. UTC | #2
On Tue, Jul 28, 2020 at 03:45:08PM -0700, Jacob Keller wrote:

> One thing I thought I would see but I guess we simply don't have one
> is a technical doc that details the strvec. I guess we just never had
> one for argv_array? Probably worth adding one at some point.

It all got moved into the header file in 971b1f24a2 (argv-array: move
doc to argv-array.h, 2019-11-17). It got the same s/argv_array/strvec/
as the code in patch 2. I think what's there is pretty reasonable, but
patches welcome if you have suggestions about what could be added. :)

-Peff
Christian Couder July 29, 2020, 6:15 a.m. UTC | #3
On Tue, Jul 28, 2020 at 10:23 PM Jeff King <peff@peff.net> wrote:
>
> The argv_array data type has turned out to be useful in our code base,
> but the name isn't very good. From patch 2 of this series:
>
>   The name "argv-array" isn't very good, because it describes what the
>   data type can be used for (program argument arrays), not what it
>   actually is (a dynamically-growing string array that maintains a
>   NULL-terminator invariant).

I cannot help but notice that you still use "array" when describing
what it is. You actually use "string array" to describe what it is,
and at the same time say that the name should describe what it is. So
I would expect after the above sentence that you would rename it to
"string_array" or "str_array".

In patch 2 you also say:

> I settled on "strvec" because "vector" is the name for a dynamic array
> type in many programming languages.

in which you also use "array" to describe it.

> "strarray" would work, too, but it's
> longer and a bit more awkward to say (and don't we all say these things
> in our mind as we type them?).

It's longer than "strarray" by 2 characters only.

Also we still use "array" in "oid_array" which is very similar to
this. And the implementation is based on the ALLOC_GROW macro which
uses the REALLOC_ARRAY macro.

We also use ALLOC_ARRAY, FLEX_ARRAY, CALLOC_ARRAY, COPY_ARRAY and
MOVE_ARRAY macros.

So if you don't like the "array" part of the name, are you going to
also change "oid_array" into "oidvec" and for example "REALLOC_ARRAY"
into "REALLOC_VEC" or "REALLOC_VECTOR"?

If you want to change only "argv_array" (and not also "oid_array",
"REALLOC_ARRAY" and perhaps other *_ARRAY macros) into something else,
then I think it would be better to be consistent with them.
Christian Couder July 29, 2020, 6:19 a.m. UTC | #4
On Wed, Jul 29, 2020 at 8:15 AM Christian Couder
<christian.couder@gmail.com> wrote:

> > "strarray" would work, too, but it's
> > longer and a bit more awkward to say (and don't we all say these things
> > in our mind as we type them?).
>
> It's longer than "strarray" by 2 characters only.

Sorry, I mean it's shorter than "strarray" by 2 characters only.
Eric Sunshine July 29, 2020, 1:32 p.m. UTC | #5
On Wed, Jul 29, 2020 at 2:16 AM Christian Couder
<christian.couder@gmail.com> wrote:
> I would expect after the above sentence that you would rename it to
> "string_array" or "str_array".
> [...]
> Also we still use "array" in "oid_array" which is very similar to
> this. And the implementation is based on the ALLOC_GROW macro which
> uses the REALLOC_ARRAY macro.
>
> We also use ALLOC_ARRAY, FLEX_ARRAY, CALLOC_ARRAY, COPY_ARRAY and
> MOVE_ARRAY macros.
> [...]
> If you want to change only "argv_array" (and not also "oid_array",
> "REALLOC_ARRAY" and perhaps other *_ARRAY macros) into something else,
> then I think it would be better to be consistent with them.

Dipping my toe into the bikeshed paint bucket...

Naming consistency is certainly a valid concern, and if this was a
public API I might agree that such consistency should outweigh some
other concerns, however, this is a private, project-specific API in
which convenience and concision weigh more heavily in my opinion.
There is value in succinctness, not just when writing code (by having
to type less), but also when reading it, for the same reason that we
use short and sweet variable and function names. To wit: short, well
chosen, idiomatic names let the structure of the code show through
(often) at-a-glance, whereas long names force us to laboriously read
code, making it harder to discern the overall structure and logic.

There is also value[1] in having "vec" (or "v") in the name as opposed
to "array", specifically because this isn't just an array of strings;
it is a NULL-terminated vector of strings. Thus, it is suitable for
functions which accept such a datatype, which often have "v" in their
names, as well (for instance, execv()).

On the topic of a "terminated array": As a NULL-terminated array of
strings, the name "strvec" provides naming consistency and fits nicely
alongside the name "strbuf", a NUL-terminated array of characters,
while also sharing the relative concision of that name.

In my opinion, the name "strvec" is a good and reasonably succinct
compromise between other names, such as "strv"[2] and "strs"[3],
proposed previously. It gets my "+1".

[1]: https://lore.kernel.org/git/20180227221808.GE11187@sigill.intra.peff.net/
[2]: https://lore.kernel.org/git/20180227220443.GB11187@sigill.intra.peff.net/
[3]: https://lore.kernel.org/git/CAPig+cS+G-xC51n-Ud0Wbmcc-zeHBM3-5WQQAFm9gwm9LNk3Gg@mail.gmail.com/
Jeff King July 29, 2020, 4:33 p.m. UTC | #6
On Wed, Jul 29, 2020 at 08:15:54AM +0200, Christian Couder wrote:

> On Tue, Jul 28, 2020 at 10:23 PM Jeff King <peff@peff.net> wrote:
> >
> > The argv_array data type has turned out to be useful in our code base,
> > but the name isn't very good. From patch 2 of this series:
> >
> >   The name "argv-array" isn't very good, because it describes what the
> >   data type can be used for (program argument arrays), not what it
> >   actually is (a dynamically-growing string array that maintains a
> >   NULL-terminator invariant).
> 
> I cannot help but notice that you still use "array" when describing
> what it is. You actually use "string array" to describe what it is,
> and at the same time say that the name should describe what it is. So
> I would expect after the above sentence that you would rename it to
> "string_array" or "str_array".

I think one thing that leads me to "vec" is the "v" from "argv", which
implies the NULL-terminator (though of course that's _not_ implied by
"vector" in other languages). None of the other "array" types have that
feature, and it's an important one here.

> > "strarray" would work, too, but it's
> > longer and a bit more awkward to say (and don't we all say these things
> > in our mind as we type them?).
> 
> It's longer than "strarray" by 2 characters only.

Yeah, I agree the length between the two is not super important. Mostly
I find it harder to read and say. That's obviously subjective, but I do
think counts for something.

We use an underscore for "oid_array" (which I think makes sense because
it's awkward to concatenate directly a word that starts with a vowel and
has multiple syllables). That makes it inconsistent with oidmap and
oidset. Likewise, I was hoping for consistency with strbuf here.

> Also we still use "array" in "oid_array" which is very similar to
> this. And the implementation is based on the ALLOC_GROW macro which
> uses the REALLOC_ARRAY macro.
> 
> We also use ALLOC_ARRAY, FLEX_ARRAY, CALLOC_ARRAY, COPY_ARRAY and
> MOVE_ARRAY macros.

Those macro names are pretty bulky, but I think it matters a lot less
because they're used a lot less. The type name here is prepended to all
of the "method" functions like strvec_pushf(), etc.

> So if you don't like the "array" part of the name, are you going to
> also change "oid_array" into "oidvec" and for example "REALLOC_ARRAY"
> into "REALLOC_VEC" or "REALLOC_VECTOR"?

I hadn't planned on it, just because I think the cost of changing versus
the benefit is not that high. If I were designing from scratch, I'd
definitely consider oidvec (but probably not REALLOC_VEC for the reasons
above).

I think the cost of changing away from argv_array _is_ worth it, and
once we're doing that, we can choose between the alternatives without
paying an extra cost.

> If you want to change only "argv_array" (and not also "oid_array",
> "REALLOC_ARRAY" and perhaps other *_ARRAY macros) into something else,
> then I think it would be better to be consistent with them.

I definitely sympathize with the consistency argument, and I don't think
'str_array" is the end of the world. But I think there are many
dimensions of inconsistency already, so without renaming every data
structure, we're going to be inconsistent with something. Mostly I just
subjectively find strvec easier to read, say, and type, and I don't
think the inconsistencies are so glaring that it's a problem.

-Peff
René Scharfe Aug. 11, 2020, 4:08 p.m. UTC | #7
Am 28.07.20 um 22:21 schrieb Jeff King:
> The argv_array data type has turned out to be useful in our code base,
> but the name isn't very good. From patch 2 of this series:
>
>   The name "argv-array" isn't very good, because it describes what the
>   data type can be used for (program argument arrays), not what it
>   actually is (a dynamically-growing string array that maintains a
>   NULL-terminator invariant). This leads to people being hesitant to use
>   it for other cases where it would actually be a good fit. The existing
>   name is also clunky to use. It's overly long, and the name often leads
>   to saying things like "argv.argv" (i.e., the field names overlap with
>   variable names, since they're describing the use, not the type). Let's
>   give it a more neutral name.
>
> This has bugged me for a while, so I decided to finally fix it. It
> wasn't _too_ painful, though I'm sure there will be a little fallout
> with topics in flight.

Just as this landed in master now, https://lobste.rs/ decided to link to
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2493.pdf, which is a
paper about reserved identifiers in C.  It contains a nice overview.

Anyway, 7.31 of C11 says: "All external names described below are
reserved no matter what headers are included by the program."  And
7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
lowercase letter may be added to the declarations in the <string.h>
header."  So the names of the strvec functions are reserved.

Also how about using Coccinelle and patience to reduce the impact of
such a change next time?  I.e. adding the new thing, providing a
semantic patch for converting old code, waiting a reasonable amount of
time after the last conversion was necessary and then removing the
old thing.

René
Taylor Blau Aug. 11, 2020, 6:28 p.m. UTC | #8
On Tue, Aug 11, 2020 at 06:08:22PM +0200, René Scharfe wrote:
> Am 28.07.20 um 22:21 schrieb Jeff King:
> > The argv_array data type has turned out to be useful in our code base,
> > but the name isn't very good. From patch 2 of this series:
> >
> >   The name "argv-array" isn't very good, because it describes what the
> >   data type can be used for (program argument arrays), not what it
> >   actually is (a dynamically-growing string array that maintains a
> >   NULL-terminator invariant). This leads to people being hesitant to use
> >   it for other cases where it would actually be a good fit. The existing
> >   name is also clunky to use. It's overly long, and the name often leads
> >   to saying things like "argv.argv" (i.e., the field names overlap with
> >   variable names, since they're describing the use, not the type). Let's
> >   give it a more neutral name.
> >
> > This has bugged me for a while, so I decided to finally fix it. It
> > wasn't _too_ painful, though I'm sure there will be a little fallout
> > with topics in flight.
>
> Just as this landed in master now, https://lobste.rs/ decided to link to
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2493.pdf, which is a
> paper about reserved identifiers in C.  It contains a nice overview.
>
> Anyway, 7.31 of C11 says: "All external names described below are
> reserved no matter what headers are included by the program."  And
> 7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
> lowercase letter may be added to the declarations in the <string.h>
> header."  So the names of the strvec functions are reserved.
>
> Also how about using Coccinelle and patience to reduce the impact of
> such a change next time?  I.e. adding the new thing, providing a
> semantic patch for converting old code, waiting a reasonable amount of
> time after the last conversion was necessary and then removing the
> old thing.

I don't think that this is a bad idea at all, but I'm also totally fine
with the approach that Peff took here. (Bear in mind that I'm saying
this as someone who didn't have any topics that used argv_array, and so
didn't have to do any conversion effort ;-)).

I worry a little bit about providing too long a grace period where we
have to worry about not introducing new uses of argv_array, and still
having to deal with some (perhaps less) fall-out later down the line.

For this change, I think it was perfectly fine to just rip the bandaid
off.

> René

Thanks,
Taylor
Junio C Hamano Aug. 11, 2020, 7 p.m. UTC | #9
René Scharfe <l.s.r@web.de> writes:

> 7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
> lowercase letter may be added to the declarations in the <string.h>
> header."  So the names of the strvec functions are reserved.

Ugh, strbuf functions are reserved, too?
Jacob Keller Aug. 11, 2020, 8:39 p.m. UTC | #10
On Tue, Aug 11, 2020 at 12:02 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> René Scharfe <l.s.r@web.de> writes:
>
> > 7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
> > lowercase letter may be added to the declarations in the <string.h>
> > header."  So the names of the strvec functions are reserved.
>
> Ugh, strbuf functions are reserved, too?

At least that paper is proposing a better solution in the long term...
Junio C Hamano Aug. 11, 2020, 9:03 p.m. UTC | #11
Jacob Keller <jacob.keller@gmail.com> writes:

> On Tue, Aug 11, 2020 at 12:02 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> René Scharfe <l.s.r@web.de> writes:
>>
>> > 7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
>> > lowercase letter may be added to the declarations in the <string.h>
>> > header."  So the names of the strvec functions are reserved.
>>
>> Ugh, strbuf functions are reserved, too?
>
> At least that paper is proposing a better solution in the long term...

I am not sure what the "potentially" really buys to our fellow
developers.  At least we are not folks "targeting a specific version
of the standard"; we care about the future standards, too X-<.
Johannes Schindelin Aug. 12, 2020, 12:42 p.m. UTC | #12
Hi,

On Tue, 11 Aug 2020, Junio C Hamano wrote:

> René Scharfe <l.s.r@web.de> writes:
>
> > 7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
> > lowercase letter may be added to the declarations in the <string.h>
> > header."  So the names of the strvec functions are reserved.
>
> Ugh, strbuf functions are reserved, too?

We could try to implement the very patient approach proposed by René and
rename `strbuf` to `git_buf` :-)

Ciao,
Dscho
Jeff King Aug. 12, 2020, 3:06 p.m. UTC | #13
On Tue, Aug 11, 2020 at 06:08:22PM +0200, René Scharfe wrote:

> > This has bugged me for a while, so I decided to finally fix it. It
> > wasn't _too_ painful, though I'm sure there will be a little fallout
> > with topics in flight.
> 
> Just as this landed in master now, https://lobste.rs/ decided to link to
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2493.pdf, which is a
> paper about reserved identifiers in C.  It contains a nice overview.
> 
> Anyway, 7.31 of C11 says: "All external names described below are
> reserved no matter what headers are included by the program."  And
> 7.31.13 goes on: "Function names that begin with str, mem, or wcs and a
> lowercase letter may be added to the declarations in the <string.h>
> header."  So the names of the strvec functions are reserved.

It was the same in C99. I think this is one of those cases where we need
to worry less about what the standard says and more about what the real
world does. As long as we're ready for C25 or whatever to add "strvec"
and we accept that we'll need to change the name then, I'm not
particularly concerned. A compiler that starts warning about "str"
functions in the meantime would be impractical I think (forget strbuf,
stuff like strip_extension() would be illegal).

> Also how about using Coccinelle and patience to reduce the impact of
> such a change next time?  I.e. adding the new thing, providing a
> semantic patch for converting old code, waiting a reasonable amount of
> time after the last conversion was necessary and then removing the
> old thing.

So I almost sent a rant about Coccinelle along with this series. :)

Debian unstable now ships coccinelle 1.0.8, and it's unbelievably slow
compared to 1.0.4. Running "make coccicheck" is currently at 80 minutes
of CPU time running each script in parallel, with none of them down.
They're also all consuming 6GB of RAM each, so I'm killing them all.

I got somewhere similar when I was working on this series, got fed up,
and then just did the whole thing with grep, which was easier. I'm open
to the idea that a slower transition might have helped topics in flight
catch up, but it also would have prolonged the pain. So I dunno.

-Peff
Jeff King Aug. 12, 2020, 3:10 p.m. UTC | #14
On Wed, Aug 12, 2020 at 11:06:11AM -0400, Jeff King wrote:

> Debian unstable now ships coccinelle 1.0.8, and it's unbelievably slow
> compared to 1.0.4. Running "make coccicheck" is currently at 80 minutes
> of CPU time running each script in parallel, with none of them down.
> They're also all consuming 6GB of RAM each, so I'm killing them all.

This seems to be related to setting SPATCH_BATCH_SIZE to "0". It used to make
things go much faster (if you had enough memory), but now seems to just
consume tons of CPU. Setting it to "1" finishes the whole thing in ~13
minutes of CPU (~2m wall-clock).

So that's at least a path forward, but in general I have been frustrated
with operational aspects of coccinelle like this.

-Peff
René Scharfe Aug. 12, 2020, 4:23 p.m. UTC | #15
Am 12.08.20 um 17:10 schrieb Jeff King:
> On Wed, Aug 12, 2020 at 11:06:11AM -0400, Jeff King wrote:
>
>> Debian unstable now ships coccinelle 1.0.8, and it's unbelievably slow
>> compared to 1.0.4. Running "make coccicheck" is currently at 80 minutes
>> of CPU time running each script in parallel, with none of them down.
>> They're also all consuming 6GB of RAM each, so I'm killing them all.
>
> This seems to be related to setting SPATCH_BATCH_SIZE to "0". It used to make
> things go much faster (if you had enough memory), but now seems to just
> consume tons of CPU. Setting it to "1" finishes the whole thing in ~13
> minutes of CPU (~2m wall-clock).

This bit me as well, and I settled with SPATCH_BATCH_SIZE = 10.  With
MAKEFLAGS += -j3 I get these number, which are quite similar to yours
(except I don't dare use more cores due to cooling issues..):

  real	4m12,393s
  user	12m15,447s
  sys	0m10,418s

> So that's at least a path forward, but in general I have been frustrated
> with operational aspects of coccinelle like this.

And I was a bit shocked when Coccinelle's testing package became
unmaintained for a while and I had to compile it from source.

And yes, coccicheck is quite heavy.  When I merge all .cocci files into
one I get:

  real	2m7,164s
  user	2m5,389s
  sys	0m1,572s

Nice.  With spatch -j3 I get basically the same numbers, though.  Hmm.

René
Jeff King Aug. 12, 2020, 5:08 p.m. UTC | #16
On Wed, Aug 12, 2020 at 06:23:01PM +0200, René Scharfe wrote:

> > This seems to be related to setting SPATCH_BATCH_SIZE to "0". It used to make
> > things go much faster (if you had enough memory), but now seems to just
> > consume tons of CPU. Setting it to "1" finishes the whole thing in ~13
> > minutes of CPU (~2m wall-clock).
> 
> This bit me as well, and I settled with SPATCH_BATCH_SIZE = 10.  With
> MAKEFLAGS += -j3 I get these number, which are quite similar to yours
> (except I don't dare use more cores due to cooling issues..):
> 
>   real	4m12,393s
>   user	12m15,447s
>   sys	0m10,418s

Interestingly, that was slower for me (2m47s wall-clock, with 27m of
CPU). Using "2" is slightly faster than "1". But "3" is a little less
fast, and "4" is slower than "1". So...no clue what is going on.

> > So that's at least a path forward, but in general I have been frustrated
> > with operational aspects of coccinelle like this.
> 
> And I was a bit shocked when Coccinelle's testing package became
> unmaintained for a while and I had to compile it from source.

Yeah, I've had various issues with the packaging. For a long time they
had 1.0.7 in experimental, but with no python support. I wonder if it's
worth starting to use python scriptlets in our coccinelle rules, as
described in 4d168e742a (coccinelle: use <...> for function exclusion,
2018-08-28). They're faster and IMHO easier to understand.

Of course I tried it out and got some inscrutable errors:

  SPATCH contrib/coccinelle/object_id.cocci
  init_defs_builtins: /usr/bin/../lib/coccinelle/standard.h
  Python path configuration:
    PYTHONHOME = '/lib/x86_64-linux-gnu/..'
    PYTHONPATH = '/usr/bin/../lib/coccinelle/python'
    program name = 'python3'
    isolated = 0
    environment = 1
    user site = 1
    import site = 1
    sys._base_executable = '/usr/bin/python3'
    sys.base_prefix = '/lib/x86_64-linux-gnu/..'
    sys.base_exec_prefix = '/lib/x86_64-linux-gnu/..'
    sys.executable = '/usr/bin/python3'
    sys.prefix = '/lib/x86_64-linux-gnu/..'
    sys.exec_prefix = '/lib/x86_64-linux-gnu/..'
    sys.path = [
      '/usr/bin/../lib/coccinelle/python',
      '/lib/x86_64-linux-gnu/../lib/python38.zip',
      '/lib/x86_64-linux-gnu/../lib/python3.8',
      '/lib/x86_64-linux-gnu/../lib/python3.8/lib-dynload',
    ]
  Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
  Python runtime state: core initialized
  ModuleNotFoundError: No module named 'encodings'

Fun.

-Peff
René Scharfe Aug. 12, 2020, 6:18 p.m. UTC | #17
Am 12.08.20 um 19:08 schrieb Jeff King:
> Yeah, I've had various issues with the packaging. For a long time they
> had 1.0.7 in experimental, but with no python support. I wonder if it's
> worth starting to use python scriptlets in our coccinelle rules, as
> described in 4d168e742a (coccinelle: use <...> for function exclusion,
> 2018-08-28). They're faster and IMHO easier to understand.

The idea to use Python as a faster alternative to anything makes me
a bit uneasy.  That can't be right. ;-)

> Of course I tried it out and got some inscrutable errors:
>
>   SPATCH contrib/coccinelle/object_id.cocci
>   init_defs_builtins: /usr/bin/../lib/coccinelle/standard.h
>   Python path configuration:
>     PYTHONHOME = '/lib/x86_64-linux-gnu/..'

This looks bogus.  Can you try to unset this environment variable?
Or set it to "/usr"?

>     PYTHONPATH = '/usr/bin/../lib/coccinelle/python'
>     program name = 'python3'
>     isolated = 0
>     environment = 1
>     user site = 1
>     import site = 1
>     sys._base_executable = '/usr/bin/python3'
>     sys.base_prefix = '/lib/x86_64-linux-gnu/..'
>     sys.base_exec_prefix = '/lib/x86_64-linux-gnu/..'
>     sys.executable = '/usr/bin/python3'
>     sys.prefix = '/lib/x86_64-linux-gnu/..'
>     sys.exec_prefix = '/lib/x86_64-linux-gnu/..'
>     sys.path = [
>       '/usr/bin/../lib/coccinelle/python',
>       '/lib/x86_64-linux-gnu/../lib/python38.zip',
>       '/lib/x86_64-linux-gnu/../lib/python3.8',
>       '/lib/x86_64-linux-gnu/../lib/python3.8/lib-dynload',
>     ]
>   Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
>   Python runtime state: core initialized
>   ModuleNotFoundError: No module named 'encodings'

I got this instead:

-- snip --
init_defs_builtins: /usr/bin/../lib/coccinelle/standard.h
Py.find_library: unable to find the Python library [libpython3.8m.so returned Library not found] [/usr/bin/../lib/libpython3.8m.so returned Library not found] [libpython3.8.so returned Library not found] [/usr/bin/../lib/libpython3.8.so returned Library not found]
-- snap --

Then I did "sudo ln -s libpython3.8.so.1.0 libpython3.8.so" in
/usr/lib/x86_64-linux-gnu, and now it seems to be happy.

René
Jeff King Aug. 12, 2020, 7:57 p.m. UTC | #18
On Wed, Aug 12, 2020 at 08:18:33PM +0200, René Scharfe wrote:

> Am 12.08.20 um 19:08 schrieb Jeff King:
> > Yeah, I've had various issues with the packaging. For a long time they
> > had 1.0.7 in experimental, but with no python support. I wonder if it's
> > worth starting to use python scriptlets in our coccinelle rules, as
> > described in 4d168e742a (coccinelle: use <...> for function exclusion,
> > 2018-08-28). They're faster and IMHO easier to understand.
> 
> The idea to use Python as a faster alternative to anything makes me
> a bit uneasy.  That can't be right. ;-)

Well, when you're comparing it to some exponential algorithm with the
other technique, even a scripted language can do well. :)

I definitely remember getting big speedups back at the time of that
commit. But applying the patch below, building object_id.cocci.patch
actually gets a few seconds slower. So maybe something changed between
the various coccinelle versions. At any rate, it doesn't seem worth
pursuing further.

> > Of course I tried it out and got some inscrutable errors:
> >
> >   SPATCH contrib/coccinelle/object_id.cocci
> >   init_defs_builtins: /usr/bin/../lib/coccinelle/standard.h
> >   Python path configuration:
> >     PYTHONHOME = '/lib/x86_64-linux-gnu/..'
> 
> This looks bogus.  Can you try to unset this environment variable?
> Or set it to "/usr"?

It's not set in my environment. However, I was able to solve it by
fiddling with my system python packages (I'm not sure of the exact
cause; it was a case of corrected-while-testing).

-Peff