[v3,2/5] strvec: add a strvec_pushvec()

Message ID	patch-v3-2.5-321b8ba3f0e-20210826T140414Z-avarab@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Emily Shaffer <emilyshaffer@google.com>, Derrick Stolee <stolee@gmail.com>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> Subject: [PATCH v3 2/5] strvec: add a strvec_pushvec() Date: Thu, 26 Aug 2021 16:05:48 +0200 Message-Id: <patch-v3-2.5-321b8ba3f0e-20210826T140414Z-avarab@gmail.com> In-Reply-To: <cover-v3-0.5-00000000000-20210826T140414Z-avarab@gmail.com> References: <cover-0.4-0000000000-20210727T004015Z-avarab@gmail.com> <cover-v3-0.5-00000000000-20210826T140414Z-avarab@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bundle: show progress on "unbundle" \| expand [v3,0/5] bundle: show progress on "unbundle" [v3,1/5] bundle API: start writing API documentation [v3,2/5] strvec: add a strvec_pushvec() [v3,3/5] bundle API: change "flags" to be "extra_index_pack_args" [v3,4/5] index-pack: add --progress-title option [v3,5/5] bundle: show progress on "unbundle"

Ævar Arnfjörð Bjarmason Aug. 26, 2021, 2:05 p.m. UTC

Add a strvec_pushvec() function to concatenate two "struct strvec *"
together, and modify code added in 50d89ad6542 (submodule: use
argv_array instead of hand-building arrays, 2012-09-01) to use it. In
a subsequent commit we'll gain another API user.

This could also have been named strvec_concat()[1], but I opted to
make its name consistent with the strbuf_addbuf() function instead. We
only name these sorts of functions *_concat() in one instance:
parse_options_concat().

1. http://lore.kernel.org/git/30620e13-4509-1905-7644-9962b6adf9c5@gmail.com

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 strvec.c    | 8 ++++++++
 strvec.h    | 7 +++++++
 submodule.c | 4 +---
 3 files changed, 16 insertions(+), 3 deletions(-)

Junio C Hamano Aug. 28, 2021, 1:23 a.m. UTC | #1

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Add a strvec_pushvec() function to concatenate two "struct strvec *"
> together, and modify code added in 50d89ad6542 (submodule: use
> argv_array instead of hand-building arrays, 2012-09-01) to use it. In
> a subsequent commit we'll gain another API user.
>
> This could also have been named strvec_concat()[1], but I opted to
> make its name consistent with the strbuf_addbuf() function instead. We
> only name these sorts of functions *_concat() in one instance:
> parse_options_concat().

FWIW pushvec() is a much better name than concat() for this.

concat(A, B) could be an operation that is non-destructive for both
A and B that returns a new vector C.  I would imagine that it would
be how, if not majority of, but at least a nontrivial percentage of,
readers expect a function called concat() to behave.  A better name
for destructive version would probably be append(), if there is no
other constraints on naming.

The name pushvec(A, B) makes it clear, thanks to similarity with
other push*(A, ...) variants, that A is mutated using the other
arguments.  The name is much more clear with less chance of
misunderstanding.

> +void strvec_pushvec(struct strvec *array, const struct strvec *items)
> +{
> +	int i;
> +
> +	for (i = 0; i < items->nr; i++)
> +		strvec_push(array, items->v[i]);
> +}

This implementation is not wrong per-se, but is somewhat
disappointing.  When items->nr is large, especially relative to the
original array->alloc, it would incur unnecessary reallocations that
we can easily avoid by pre-sizing the array before pushing the
elements of items from it.

In the original code that became the first user of this helper, it
may not have made much difference, but now it is becoming a more
generally reusable API function, we should care.

Other than that, looks quite straight-forward.

Thanks.

Junio C Hamano Aug. 28, 2021, 1:29 a.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

>> +void strvec_pushvec(struct strvec *array, const struct strvec *items)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < items->nr; i++)
>> +		strvec_push(array, items->v[i]);
>> +}
>
> This implementation is not wrong per-se, but is somewhat
> disappointing.  When items->nr is large, especially relative to the
> original array->alloc, it would incur unnecessary reallocations that
> we can easily avoid by pre-sizing the array before pushing the
> elements of items from it.
>
> In the original code that became the first user of this helper, it
> may not have made much difference, but now it is becoming a more
> generally reusable API function, we should care.

And if we do not care, you can rewrite the code that became the
first user of this helper to instead call strvec_pushv() on the
items->v array that is guaranteed to be NULL terminated, without
inventing this new helper.

I think I am fine with either way.

Jeff King Aug. 28, 2021, 4:12 a.m. UTC | #3

On Fri, Aug 27, 2021 at 06:29:30PM -0700, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
> 
> >> +void strvec_pushvec(struct strvec *array, const struct strvec *items)
> >> +{
> >> +	int i;
> >> +
> >> +	for (i = 0; i < items->nr; i++)
> >> +		strvec_push(array, items->v[i]);
> >> +}
> >
> > This implementation is not wrong per-se, but is somewhat
> > disappointing.  When items->nr is large, especially relative to the
> > original array->alloc, it would incur unnecessary reallocations that
> > we can easily avoid by pre-sizing the array before pushing the
> > elements of items from it.
> >
> > In the original code that became the first user of this helper, it
> > may not have made much difference, but now it is becoming a more
> > generally reusable API function, we should care.
> 
> And if we do not care, you can rewrite the code that became the
> first user of this helper to instead call strvec_pushv() on the
> items->v array that is guaranteed to be NULL terminated, without
> inventing this new helper.

I came here to say that. ;)

I do not mind using pushv() directly, or a pushvec() that is a
convenience wrapper for pushv(). Even better if that wrapper is smart
enough to pre-allocate based on items->nr, as you mentioned, but that
can also come later.

One thing that did surprise me: the use of "int" here for iterating,
rather than size_t. But it seems that strvec is already storing ints,
which is an accident!

I think we'd want the patch below. It can be applied independently
(though if we do take the index-iterating version of Ævar's patch, I
think it should switch to size_t).

-- >8 --
Subject: [PATCH] strvec: use size_t to store nr and alloc

We converted argv_array (which later became strvec) to use size_t in
819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in
order to avoid the possibility of integer overflow. But later, commit
d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally
converted these back to ints!

Those two commits were part of the same patch series. I'm pretty sure
what happened is that they were originally written in the opposite order
and then cleaned up and re-ordered during an interactive rebase. And
when resolving the inevitable conflict, I mistakenly took the "rename"
patch completely, accidentally dropping the type change.

We can correct it now; better late than never.

Signed-off-by: Jeff King <peff@peff.net>
---
 strvec.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/strvec.h b/strvec.h
index fdcad75b45..6b3cbd6758 100644
--- a/strvec.h
+++ b/strvec.h
@@ -29,8 +29,8 @@ extern const char *empty_strvec[];
  */
 struct strvec {
 	const char **v;
-	int nr;
-	int alloc;
+	size_t nr;
+	size_t alloc;
 };

 #define STRVEC_INIT { empty_strvec, 0, 0 }

Junio C Hamano Aug. 29, 2021, 11:54 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> I think we'd want the patch below. It can be applied independently
> (though if we do take the index-iterating version of Ævar's patch, I
> think it should switch to size_t).

Yeah, I do not see a strong need for _pushvec(), especially the
variant that does not preallocate, when we have _pushv().  But the
type fix below does make sense.

> -- >8 --
> Subject: [PATCH] strvec: use size_t to store nr and alloc
>
> We converted argv_array (which later became strvec) to use size_t in
> 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in
> order to avoid the possibility of integer overflow. But later, commit
> d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally
> converted these back to ints!
>
> Those two commits were part of the same patch series. I'm pretty sure
> what happened is that they were originally written in the opposite order
> and then cleaned up and re-ordered during an interactive rebase. And
> when resolving the inevitable conflict, I mistakenly took the "rename"
> patch completely, accidentally dropping the type change.
>
> We can correct it now; better late than never.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  strvec.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/strvec.h b/strvec.h
> index fdcad75b45..6b3cbd6758 100644
> --- a/strvec.h
> +++ b/strvec.h
> @@ -29,8 +29,8 @@ extern const char *empty_strvec[];
>   */
>  struct strvec {
>  	const char **v;
> -	int nr;
> -	int alloc;
> +	size_t nr;
> +	size_t alloc;
>  };
>  
>  #define STRVEC_INIT { empty_strvec, 0, 0 }

Derrick Stolee Aug. 30, 2021, 5:30 p.m. UTC | #5

On 8/29/2021 7:54 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> I think we'd want the patch below. It can be applied independently
>> (though if we do take the index-iterating version of Ævar's patch, I
>> think it should switch to size_t).
> 
> Yeah, I do not see a strong need for _pushvec(), especially the
> variant that does not preallocate, when we have _pushv().  But the
> type fix below does make sense.

Thanks for chiming in. I was not aware of _pushv() when I asked
about the _pushvec() variant. Sorry, Ævar, for sending you down
an unnecessary direction.
 
>> -- >8 --
>> Subject: [PATCH] strvec: use size_t to store nr and alloc
>>
>> We converted argv_array (which later became strvec) to use size_t in
>> 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in
>> order to avoid the possibility of integer overflow. But later, commit
>> d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally
>> converted these back to ints!
>>
>> Those two commits were part of the same patch series. I'm pretty sure
>> what happened is that they were originally written in the opposite order
>> and then cleaned up and re-ordered during an interactive rebase. And
>> when resolving the inevitable conflict, I mistakenly took the "rename"
>> patch completely, accidentally dropping the type change.
>>
>> We can correct it now; better late than never.
...
>>  struct strvec {
>>  	const char **v;
>> -	int nr;
>> -	int alloc;
>> +	size_t nr;
>> +	size_t alloc;
>>  };

This is also a good change to take.

Thanks,
-Stolee

Ævar Arnfjörð Bjarmason Sept. 2, 2021, 11:19 p.m. UTC | #6

On Sat, Aug 28 2021, Jeff King wrote:

> On Fri, Aug 27, 2021 at 06:29:30PM -0700, Junio C Hamano wrote:
>
>> Junio C Hamano <gitster@pobox.com> writes:
>> 
>> >> +void strvec_pushvec(struct strvec *array, const struct strvec *items)
>> >> +{
>> >> +	int i;
>> >> +
>> >> +	for (i = 0; i < items->nr; i++)
>> >> +		strvec_push(array, items->v[i]);
>> >> +}
>> >
>> > This implementation is not wrong per-se, but is somewhat
>> > disappointing.  When items->nr is large, especially relative to the
>> > original array->alloc, it would incur unnecessary reallocations that
>> > we can easily avoid by pre-sizing the array before pushing the
>> > elements of items from it.
>> >
>> > In the original code that became the first user of this helper, it
>> > may not have made much difference, but now it is becoming a more
>> > generally reusable API function, we should care.
>> 
>> And if we do not care, you can rewrite the code that became the
>> first user of this helper to instead call strvec_pushv() on the
>> items->v array that is guaranteed to be NULL terminated, without
>> inventing this new helper.
>
> I came here to say that. ;)
>
> I do not mind using pushv() directly, or a pushvec() that is a
> convenience wrapper for pushv(). Even better if that wrapper is smart
> enough to pre-allocate based on items->nr, as you mentioned, but that
> can also come later.
>
> One thing that did surprise me: the use of "int" here for iterating,
> rather than size_t. But it seems that strvec is already storing ints,
> which is an accident!

Is it really? If you temporarily try to say convert that to "size_t *nr"
to have the compiler catch all the cases where we use "nr", and then
s/size_t/int/g those all, you'll find that e.g. setup_revisions() and
the like expect to take either "int argc" or the strvec equivalent.

We can sensibly convert some of those to size_t, but not all, and the
int v.s. size_t inconsistency as a result feels weird.

Since the main point of this API is to be a wrapper for what a C main()
would take, shouldn't its prototype mirror its prototype? I.e. we should
stick to "int" here?

> I think we'd want the patch below. It can be applied independently
> (though if we do take the index-iterating version of Ævar's patch, I
> think it should switch to size_t).
>
> -- >8 --
> Subject: [PATCH] strvec: use size_t to store nr and alloc
>
> We converted argv_array (which later became strvec) to use size_t in
> 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in
> order to avoid the possibility of integer overflow. But later, commit
> d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally
> converted these back to ints!
>
> Those two commits were part of the same patch series. I'm pretty sure
> what happened is that they were originally written in the opposite order
> and then cleaned up and re-ordered during an interactive rebase. And
> when resolving the inevitable conflict, I mistakenly took the "rename"
> patch completely, accidentally dropping the type change.
>
> We can correct it now; better late than never.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  strvec.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/strvec.h b/strvec.h
> index fdcad75b45..6b3cbd6758 100644
> --- a/strvec.h
> +++ b/strvec.h
> @@ -29,8 +29,8 @@ extern const char *empty_strvec[];
>   */
>  struct strvec {
>  	const char **v;
> -	int nr;
> -	int alloc;
> +	size_t nr;
> +	size_t alloc;
>  };
>  
>  #define STRVEC_INIT { empty_strvec, 0, 0 }

Junio C Hamano Sept. 3, 2021, 5:05 a.m. UTC | #7

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Since the main point of this API is to be a wrapper for what a C main()
> would take, shouldn't its prototype mirror its prototype? I.e. we should
> stick to "int" here?

That's a fair-enough argument.

In other words, we do not mind the upper limit of 2 billion strings,
even though we do care that we can have more than 4 billion bytes in
any of these strings ;-)

Jeff King Sept. 3, 2021, 11:06 a.m. UTC | #8

On Fri, Sep 03, 2021 at 01:19:33AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > One thing that did surprise me: the use of "int" here for iterating,
> > rather than size_t. But it seems that strvec is already storing ints,
> > which is an accident!
> 
> Is it really? If you temporarily try to say convert that to "size_t *nr"
> to have the compiler catch all the cases where we use "nr", and then
> s/size_t/int/g those all, you'll find that e.g. setup_revisions() and
> the like expect to take either "int argc" or the strvec equivalent.

It most definitely is an accident, in the sense that the commit which
changed it did not mean to. :)

> We can sensibly convert some of those to size_t, but not all, and the
> int v.s. size_t inconsistency as a result feels weird.
> 
> Since the main point of this API is to be a wrapper for what a C main()
> would take, shouldn't its prototype mirror its prototype? I.e. we should
> stick to "int" here?

That isn't the main point of this API. The reason the name changed away
from argv_array was so that it would be more obviously a generally
useful array of strings.

But even if that were the use, the point isn't that we expect to see 2B
or even 4B strings in any reasonable use case. The point is that we
don't want to accidentally overflow the integer counter, leading to
reads and writes of random bits of memory. And all it takes to do that
is some code like:

  while (strbuf_readline(stdin, &buf))
	strvec_push(&foo, buf.buf);

On a system with 32-bit size_t, you are not likely to actually allocate
2B strings before you'd fail. But on most 64-bit systems, you have
plenty of address space (and these days, often RAM) to do so, but you
still only need 2B strings, because "int" is 32 bits.

So code like setup_revisions() which uses an int is fine. It is reading
from a source with that much smaller "int" limit in the first place.
Whether strvec can handle bigger arrays or not does not matter either
way. And when it later uses ints to operate on such a strvec, it's OK in
practice; even if the size_t gets truncated on the fly to an int, we
know we did not put more than an int's worth of items into the array in
the first place.

But code which has potentially larger inputs (either because it reads
iteratively into the strvec as above, or because it is itself using a
larger data type) is now also OK, because it's using size_t.

What's not OK is code which operates on a potentially-unbounded strvec
using an int. E.g., following the code I showed above with something
like:

  int nr = foo.nr;
  foo.v[nr] = xstrdup("replacement string");

which may write to memory outside of foo.v.

Obviously that's nonsense nobody would ever write directly. It is
possible for some code to do this inadvertently (say, passing a ptr/int
pair through a function interface), but I think it's fairly unlikely in
practice. So while I do think more code ought to be using size_t in
general, I don't think it's necessary to audit this carefully. The
important thing to me is protecting strvec itself.

-Peff

[v3,2/5] strvec: add a strvec_pushvec()

Commit Message

Comments

Patch