mbox series

[v2,0/7] strvec: use size_t to store nr and alloc

Message ID cover-v2-0.7-00000000000-20210912T001420Z-avarab@gmail.com (mailing list archive)
Headers show
Series strvec: use size_t to store nr and alloc | expand

Message

Ævar Arnfjörð Bjarmason Sept. 12, 2021, 12:15 a.m. UTC
This is a proposed v2 of Jeff King's one-patch change to change
strvec's nr/alloc from "int" to "size_t". As noted below I think it's
worthwhile to not only change that in the struct, but also in code
that directly references the "nr" member.

On Sat, Sep 11 2021, Philip Oakley wrote:

> On 11/09/2021 17:13, Ævar Arnfjörð Bjarmason wrote:
>> On Sat, Sep 11 2021, Jeff King wrote:
>>
>>> We converted argv_array (which later became strvec) to use size_t in
>>> 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in
>>> order to avoid the possibility of integer overflow. But later, commit
>>> d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally
>>> converted these back to ints!
>>>
>>> Those two commits were part of the same patch series. I'm pretty sure
>>> what happened is that they were originally written in the opposite order
>>> and then cleaned up and re-ordered during an interactive rebase. And
>>> when resolving the inevitable conflict, I mistakenly took the "rename"
>>> patch completely, accidentally dropping the type change.
>>>
>>> We can correct it now; better late than never.
>>>
>>> Signed-off-by: Jeff King <peff@peff.net>
>>> ---
>>> This was posted previously in the midst of another thread, but I don't
>>> think was picked up. There was some positive reaction, but one "do we
>>> really need this?" to which I responded in detail:
>>>
>>>   https://lore.kernel.org/git/YTIBnT8Ue1HZXs82@coredump.intra.peff.net/
>>>
>>> I don't really think any of that needs to go into the commit message,
>>> but if that's a hold-up, I can try to summarize it (though I think
>>> referring to the commit which _already_ did this and was accidentally
>>> reverted would be sufficient).
>> Thanks, I have a WIP version of this outstanding starting with this
>> patch that I was planning to submit sometime, but I'm happy to have you
>> pursue it, especially with the ~100 outstanding patches I have in
>> master..seen.
>>
>> It does feel somewhere between iffy and a landmine waiting to be stepped
>> on to only convert the member itself, and not any of the corresponding
>> "int" variables that track it to "size_t".
>>
>> If you do the change I suggested in
>> https://lore.kernel.org/git/87v93i8svd.fsf@evledraar.gmail.com/ you'll
>> find that there's at least one first-order reference to this that now
>> uses "int" that if converted to "size_t" will result in a wrap-around
>> error, we're lucky that one has a test failure.
>>
>> I can tell you what that bug is, but maybe it's better if you find it
>> yourself :) I.e. I found *that* one, but I'm not sure I found them
>> all. I just s/int nr/size_t *nr/ and eyeballed the wall off compiler
>> errors & the code context (note: pointer, obviously broken, but makes
>> the compiler yell).
>>
>> That particular bug will be caught by the compiler as it involves a >= 0
>> comparison against unsigned, but we may not not have that everywhere...
>
> I'm particularly interested in the int -> size_t change problem as part
> of the wider 4GB limitations for the LLP64 systems [0] such as the
> RaspPi, git-lfs (on windows [1]), and Git-for-Windows[2]. It is a big
> problem.

Okey, fine, no fun excercise for the reader then ;)

This is what I'd been sitting on locally since that recent thread, I
polished it up a bit since Jeff King posted his version.

The potential overflow bug I mentioned is in rebase.c. See
5/7. "Potential" because it's not a bug now, but that code
intentionally considers a strvec, and then iterates it from nr-1 to 0,
and if it reaches 0 intentionally counts down one more to -1 to
indicate that it's visited all elements.

We then check that with i >= 0, except of course if it becomes
unsigned that doesn't become -1, but rather it wraps around.

The rest of this is all changes to have that s/int/size_t/ radiate
outwards, i.e. when we assign that value to a variable somewhere its
now a "size_t" instead of an "int" etc.

> [0]
> http://nickdesaulniers.github.io/blog/2016/05/30/data-models-and-word-size/
> [1] https://github.com/git-lfs/git-lfs/issues/2434  Git on Windows
> client corrupts files > 4Gb
> [2] https://github.com/git-for-windows/git/pull/2179  [DRAFT] for
> testing : Fix 4Gb limit for large files on Git for Windows

Jeff King (1):
  strvec: use size_t to store nr and alloc

Ævar Arnfjörð Bjarmason (6):
  remote-curl: pass "struct strvec *" instead of int/char ** pair
  pack-objects: pass "struct strvec *" instead of int/char ** pair
  sequencer.[ch]: pass "struct strvec *" instead of int/char ** pair
  upload-pack.c: pass "struct strvec *" instead of int/char ** pair
  rebase: don't have loop over "struct strvec" depend on signed "nr"
  strvec API users: change some "int" tracking "nr" to "size_t"

 builtin/pack-objects.c |  6 +++---
 builtin/rebase.c       | 26 ++++++++++++--------------
 connect.c              |  8 ++++----
 fetch-pack.c           |  4 ++--
 ls-refs.c              |  2 +-
 remote-curl.c          | 23 +++++++++++------------
 sequencer.c            |  8 ++++----
 sequencer.h            |  4 ++--
 serve.c                |  2 +-
 shallow.c              |  5 +++--
 shallow.h              |  6 ++++--
 strvec.h               |  4 ++--
 submodule.c            |  2 +-
 upload-pack.c          |  7 +++----
 14 files changed, 53 insertions(+), 54 deletions(-)

Range-diff against v1:
-:  ----------- > 1:  2ef48d734e8 remote-curl: pass "struct strvec *" instead of int/char ** pair
-:  ----------- > 2:  7f59a58ed97 pack-objects: pass "struct strvec *" instead of int/char ** pair
-:  ----------- > 3:  c35cfb9c9c5 sequencer.[ch]: pass "struct strvec *" instead of int/char ** pair
-:  ----------- > 4:  2e0b82d4316 upload-pack.c: pass "struct strvec *" instead of int/char ** pair
-:  ----------- > 5:  be85a0565ef rebase: don't have loop over "struct strvec" depend on signed "nr"
1:  498f5ed80dc ! 6:  ba17290852c strvec: use size_t to store nr and alloc
    @@ Commit message
         We can correct it now; better late than never.
     
         Signed-off-by: Jeff King <peff@peff.net>
    +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## strvec.h ##
     @@ strvec.h: extern const char *empty_strvec[];
-:  ----------- > 7:  2edd9708888 strvec API users: change some "int" tracking "nr" to "size_t"

Comments

Jeff King Sept. 12, 2021, 10:19 p.m. UTC | #1
On Sun, Sep 12, 2021 at 02:15:48AM +0200, Ævar Arnfjörð Bjarmason wrote:

> This is what I'd been sitting on locally since that recent thread, I
> polished it up a bit since Jeff King posted his version.
> 
> The potential overflow bug I mentioned is in rebase.c. See
> 5/7. "Potential" because it's not a bug now, but that code
> intentionally considers a strvec, and then iterates it from nr-1 to 0,
> and if it reaches 0 intentionally counts down one more to -1 to
> indicate that it's visited all elements.
> 
> We then check that with i >= 0, except of course if it becomes
> unsigned that doesn't become -1, but rather it wraps around.

You can also just use ssize_t, or you can compare against SIZE_MAX to
catch the wraparound (there's some prior art in sort_revindex()). That
said, I don't mind rewriting loops to count up rather than down. It
usually makes them easier to follow (and in your patch 5, I do not see
any reason we would need to count down rather than up; we do not even
care where we find "-q", only that we found it.

> The rest of this is all changes to have that s/int/size_t/ radiate
> outwards, i.e. when we assign that value to a variable somewhere its
> now a "size_t" instead of an "int" etc.

I'm a little "meh" on some of these, for a few reasons:

 - anything calling into setup_revisions() eventually is just kicking
   the can anyway. And these are generally not buggy in the first place,
   since they're bounded argv creations.

 - passing a strvec instead of the broken-down pair is a less flexible
   interface. It's one thing if the callee benefits from seeing the
   strvec (say, because they may push more items onto it). But I think
   with strbufs, we have a general guideline that if a function _can_
   take the bare pointer, then it should. (Sorry, I don't have a
   succinct reference to CodingGuidelines or anything like that; I feel
   like this is wisdom we came up with on the list in the early days of
   strbufs).

 - if we are going to pass a strvec, it should almost certainly be
   const, to make it clear how we intend to use it.

So if we we wanted to try to reduce the int/size_t conversions here (and
I don't mind doing it, but am not altogether sure it is a good use of
time, because the rabbit hole runs deep), I think we ought to be
switching to size_t everywhere-ish along whole call chains. Or possibly
providing a checked size_to_int() which will safely catch and abort.
These cases are largely stupid things that real people would never come
across. The real goal is making sure we don't get hit with a memory
safety bug (under-allocation, converting a big size_t to a negative int,
etc).

-Peff
Junio C Hamano Sept. 13, 2021, 5:38 a.m. UTC | #2
Jeff King <peff@peff.net> writes:

> I'm a little "meh" on some of these, for a few reasons:
>
>  - anything calling into setup_revisions() eventually is just kicking
>    the can anyway. And these are generally not buggy in the first place,
>    since they're bounded argv creations.
>
>  - passing a strvec instead of the broken-down pair is a less flexible
>    interface. It's one thing if the callee benefits from seeing the
>    strvec (say, because they may push more items onto it). But I think
>    with strbufs, we have a general guideline that if a function _can_
>    take the bare pointer, then it should. (Sorry, I don't have a
>    succinct reference to CodingGuidelines or anything like that; I feel
>    like this is wisdom we came up with on the list in the early days of
>    strbufs).
>
>  - if we are going to pass a strvec, it should almost certainly be
>    const, to make it clear how we intend to use it.

All true.

> These cases are largely stupid things that real people would never come
> across. The real goal is making sure we don't get hit with a memory
> safety bug (under-allocation, converting a big size_t to a negative int,
> etc).

Yes.  

Ævar, I do not mean any disrespect to you, but I have to say that
topics like this one are starting to wear my concentration and
patience down really thin and making me really slow down.

Perhaps I am biased by not yet having seen what you eventually want
to build on top, and because of that I do not understood why and how
these "clean ups" are so valuable to have right now (as opposed to
just letting the sleeping dog lie), which you would of course have a
much better chance to know than I do.

But with that "bias", many of the recent patches from you look more
like pointless churn, mixed with fixes to theoretical problems, than
clean-ups with real benefits.

What makes it worse is that there are occasional real gems among
these "meh" patches, which means I have to read all of them anyway
to sift wheat from chaff X-<.
Philip Oakley Sept. 13, 2021, 10:47 a.m. UTC | #3
On 12/09/2021 01:15, Ævar Arnfjörð Bjarmason wrote:
> This is a proposed v2 of Jeff King's one-patch change to change
> strvec's nr/alloc from "int" to "size_t". As noted below I think it's
> worthwhile to not only change that in the struct, but also in code
> that directly references the "nr" member.
>
> On Sat, Sep 11 2021, Philip Oakley wrote:
>
>> On 11/09/2021 17:13, Ævar Arnfjörð Bjarmason wrote:
>>> On Sat, Sep 11 2021, Jeff King wrote:
>>>
>>>> We converted argv_array (which later became strvec) to use size_t in
>>>> 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in
>>>> order to avoid the possibility of integer overflow. But later, commit
>>>> d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally
>>>> converted these back to ints!
>>>>
>>>> Those two commits were part of the same patch series. I'm pretty sure
>>>> what happened is that they were originally written in the opposite order
>>>> and then cleaned up and re-ordered during an interactive rebase. And
>>>> when resolving the inevitable conflict, I mistakenly took the "rename"
>>>> patch completely, accidentally dropping the type change.
>>>>
>>>> We can correct it now; better late than never.
>>>>
>>>> Signed-off-by: Jeff King <peff@peff.net>
>>>> ---
>>>> This was posted previously in the midst of another thread, but I don't
>>>> think was picked up. There was some positive reaction, but one "do we
>>>> really need this?" to which I responded in detail:
>>>>
>>>>   https://lore.kernel.org/git/YTIBnT8Ue1HZXs82@coredump.intra.peff.net/
>>>>
>>>> I don't really think any of that needs to go into the commit message,
>>>> but if that's a hold-up, I can try to summarize it (though I think
>>>> referring to the commit which _already_ did this and was accidentally
>>>> reverted would be sufficient).
>>> Thanks, I have a WIP version of this outstanding starting with this
>>> patch that I was planning to submit sometime, but I'm happy to have you
>>> pursue it, especially with the ~100 outstanding patches I have in
>>> master..seen.
>>>
>>> It does feel somewhere between iffy and a landmine waiting to be stepped
>>> on to only convert the member itself, and not any of the corresponding
>>> "int" variables that track it to "size_t".
>>>
>>> If you do the change I suggested in
>>> https://lore.kernel.org/git/87v93i8svd.fsf@evledraar.gmail.com/ you'll
>>> find that there's at least one first-order reference to this that now
>>> uses "int" that if converted to "size_t" will result in a wrap-around
>>> error, we're lucky that one has a test failure.
>>>
>>> I can tell you what that bug is, but maybe it's better if you find it
>>> yourself :) I.e. I found *that* one, but I'm not sure I found them
>>> all. I just s/int nr/size_t *nr/ and eyeballed the wall off compiler
>>> errors & the code context (note: pointer, obviously broken, but makes
>>> the compiler yell).
>>>
>>> That particular bug will be caught by the compiler as it involves a >= 0
>>> comparison against unsigned, but we may not not have that everywhere...
>> I'm particularly interested in the int -> size_t change problem as part
>> of the wider 4GB limitations for the LLP64 systems [0] such as the
>> RaspPi, git-lfs (on windows [1]), and Git-for-Windows[2]. It is a big
>> problem.
> Okey, fine, no fun excercise for the reader then ;)

There's a lot of weeds in there ;-)  In some ways it feels like the
SHA1->SHA256 transition, but without the consensus, as the 'shifting
foundations' problem only affect a subgroup of a subgroup (large Windows
files and repositories).
>
> This is what I'd been sitting on locally since that recent thread, I
> polished it up a bit since Jeff King posted his version.
>
> The potential overflow bug I mentioned is in rebase.c. See
> 5/7. "Potential" because it's not a bug now, but that code
> intentionally considers a strvec, and then iterates it from nr-1 to 0,
> and if it reaches 0 intentionally counts down one more to -1 to
> indicate that it's visited all elements.

It's these tidbits about how the problems surface, their detection and
resolution that are really useful. Along with general awareness raising.
At least here the issue is reasonably tightly focussed, and even then,
testing is hard.
>
> We then check that with i >= 0, except of course if it becomes
> unsigned that doesn't become -1, but rather it wraps around.
>
> The rest of this is all changes to have that s/int/size_t/ radiate
> outwards, i.e. when we assign that value to a variable somewhere its
> now a "size_t" instead of an "int" etc.

In the LLP64 case, I'm somewhat concerned about the possible pushback of
a wide spread s/int/size_t/ on the codebase's look & feel.
 (aside) I don't think there is even a `1S` to match the  the `1L` and
`1U` shorthands used in various places.

None of that is part of the series, but the patches are beneficial to
the codes portability.

>
>> [0]
>> http://nickdesaulniers.github.io/blog/2016/05/30/data-models-and-word-size/
>> [1] https://github.com/git-lfs/git-lfs/issues/2434  Git on Windows
>> client corrupts files > 4Gb
>> [2] https://github.com/git-for-windows/git/pull/2179  [DRAFT] for
>> testing : Fix 4Gb limit for large files on Git for Windows
> Jeff King (1):
>   strvec: use size_t to store nr and alloc
>
> Ævar Arnfjörð Bjarmason (6):
>   remote-curl: pass "struct strvec *" instead of int/char ** pair
>   pack-objects: pass "struct strvec *" instead of int/char ** pair
>   sequencer.[ch]: pass "struct strvec *" instead of int/char ** pair
>   upload-pack.c: pass "struct strvec *" instead of int/char ** pair
>   rebase: don't have loop over "struct strvec" depend on signed "nr"
>   strvec API users: change some "int" tracking "nr" to "size_t"
>
>  builtin/pack-objects.c |  6 +++---
>  builtin/rebase.c       | 26 ++++++++++++--------------
>  connect.c              |  8 ++++----
>  fetch-pack.c           |  4 ++--
>  ls-refs.c              |  2 +-
>  remote-curl.c          | 23 +++++++++++------------
>  sequencer.c            |  8 ++++----
>  sequencer.h            |  4 ++--
>  serve.c                |  2 +-
>  shallow.c              |  5 +++--
>  shallow.h              |  6 ++++--
>  strvec.h               |  4 ++--
>  submodule.c            |  2 +-
>  upload-pack.c          |  7 +++----
>  14 files changed, 53 insertions(+), 54 deletions(-)
>
> Range-diff against v1:
> -:  ----------- > 1:  2ef48d734e8 remote-curl: pass "struct strvec *" instead of int/char ** pair
> -:  ----------- > 2:  7f59a58ed97 pack-objects: pass "struct strvec *" instead of int/char ** pair
> -:  ----------- > 3:  c35cfb9c9c5 sequencer.[ch]: pass "struct strvec *" instead of int/char ** pair
> -:  ----------- > 4:  2e0b82d4316 upload-pack.c: pass "struct strvec *" instead of int/char ** pair
> -:  ----------- > 5:  be85a0565ef rebase: don't have loop over "struct strvec" depend on signed "nr"
> 1:  498f5ed80dc ! 6:  ba17290852c strvec: use size_t to store nr and alloc
>     @@ Commit message
>          We can correct it now; better late than never.
>      
>          Signed-off-by: Jeff King <peff@peff.net>
>     +    Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>      
>       ## strvec.h ##
>      @@ strvec.h: extern const char *empty_strvec[];
> -:  ----------- > 7:  2edd9708888 strvec API users: change some "int" tracking "nr" to "size_t"
Ævar Arnfjörð Bjarmason Sept. 13, 2021, 12:29 p.m. UTC | #4
On Sun, Sep 12 2021, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
>
>> I'm a little "meh" on some of these, for a few reasons:
>>
>>  - anything calling into setup_revisions() eventually is just kicking
>>    the can anyway. And these are generally not buggy in the first place,
>>    since they're bounded argv creations.
>>
>>  - passing a strvec instead of the broken-down pair is a less flexible
>>    interface. It's one thing if the callee benefits from seeing the
>>    strvec (say, because they may push more items onto it). But I think
>>    with strbufs, we have a general guideline that if a function _can_
>>    take the bare pointer, then it should. (Sorry, I don't have a
>>    succinct reference to CodingGuidelines or anything like that; I feel
>>    like this is wisdom we came up with on the list in the early days of
>>    strbufs).
>>
>>  - if we are going to pass a strvec, it should almost certainly be
>>    const, to make it clear how we intend to use it.
>
> All true.
>
>> These cases are largely stupid things that real people would never come
>> across. The real goal is making sure we don't get hit with a memory
>> safety bug (under-allocation, converting a big size_t to a negative int,
>> etc).
>
> Yes.  
>
> Ævar, I do not mean any disrespect to you, but I have to say that
> topics like this one are starting to wear my concentration and
> patience down really thin and making me really slow down.

I'm sorry. I'll try to lay off on the patch firehose until the delta
I've got in master..seen is way down from what it is now.

> Perhaps I am biased by not yet having seen what you eventually want
> to build on top, and because of that I do not understood why and how
> these "clean ups" are so valuable to have right now (as opposed to
> just letting the sleeping dog lie), which you would of course have a
> much better chance to know than I do.

I could go into that, but it's probably best to leave it at: Yeah for
these seemingly going nowhere small changes I'm generally taking them
somewhere interesting.

But figuring out how to split that is still a hard problem. E.g. I've
had one series (the fsck error message improvements) that's been stalled
for ~3 months due to size/including the "interesting" part, but recently
relatively small increments of prep code changes seem to get a lot of
review traction.

As for this strvec.h s/int/size_t/ topic. I'm not taking that anywhere,
Jeff suggested it and came up with the patch, I figured more helpful
than "if we change s/int/size_t/g for x, shouldn't we change that for y
which whe assign x to?" would be patches I had to do that, which I'd
come up with after Jeff suggested this direction in response to another
topic.
Jeff King Sept. 13, 2021, 5:20 p.m. UTC | #5
On Mon, Sep 13, 2021 at 02:29:01PM +0200, Ævar Arnfjörð Bjarmason wrote:

> As for this strvec.h s/int/size_t/ topic. I'm not taking that anywhere,
> Jeff suggested it and came up with the patch, I figured more helpful
> than "if we change s/int/size_t/g for x, shouldn't we change that for y
> which whe assign x to?" would be patches I had to do that, which I'd
> come up with after Jeff suggested this direction in response to another
> topic.

I'm not inherently opposed to further int/size_t cleanups. But the
trouble is that my single patch stands on its own as an improvement to a
real issue, and does not (as far as I know) have any functional
downsides (either known or even hypothetical, aside from the obvious
mismatch that some callers will still use "int").

But doing wide-spread int/size_t conversion has less obvious immediate
benefit, is much easier to get wrong, and may introduce further
complications (e.g., differences of opinion in whether we should be
passing strvecs around more, or just using size_t in more places).

So I don't mind a series in that direction (though I don't necessarily
think it is the best use of time), but I'd prefer not to see my original
patch tied up in it.

-Peff