Message ID | patch-v3-2.5-321b8ba3f0e-20210826T140414Z-avarab@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bundle: show progress on "unbundle" | expand |
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Add a strvec_pushvec() function to concatenate two "struct strvec *" > together, and modify code added in 50d89ad6542 (submodule: use > argv_array instead of hand-building arrays, 2012-09-01) to use it. In > a subsequent commit we'll gain another API user. > > This could also have been named strvec_concat()[1], but I opted to > make its name consistent with the strbuf_addbuf() function instead. We > only name these sorts of functions *_concat() in one instance: > parse_options_concat(). FWIW pushvec() is a much better name than concat() for this. concat(A, B) could be an operation that is non-destructive for both A and B that returns a new vector C. I would imagine that it would be how, if not majority of, but at least a nontrivial percentage of, readers expect a function called concat() to behave. A better name for destructive version would probably be append(), if there is no other constraints on naming. The name pushvec(A, B) makes it clear, thanks to similarity with other push*(A, ...) variants, that A is mutated using the other arguments. The name is much more clear with less chance of misunderstanding. > +void strvec_pushvec(struct strvec *array, const struct strvec *items) > +{ > + int i; > + > + for (i = 0; i < items->nr; i++) > + strvec_push(array, items->v[i]); > +} This implementation is not wrong per-se, but is somewhat disappointing. When items->nr is large, especially relative to the original array->alloc, it would incur unnecessary reallocations that we can easily avoid by pre-sizing the array before pushing the elements of items from it. In the original code that became the first user of this helper, it may not have made much difference, but now it is becoming a more generally reusable API function, we should care. Other than that, looks quite straight-forward. Thanks.
Junio C Hamano <gitster@pobox.com> writes: >> +void strvec_pushvec(struct strvec *array, const struct strvec *items) >> +{ >> + int i; >> + >> + for (i = 0; i < items->nr; i++) >> + strvec_push(array, items->v[i]); >> +} > > This implementation is not wrong per-se, but is somewhat > disappointing. When items->nr is large, especially relative to the > original array->alloc, it would incur unnecessary reallocations that > we can easily avoid by pre-sizing the array before pushing the > elements of items from it. > > In the original code that became the first user of this helper, it > may not have made much difference, but now it is becoming a more > generally reusable API function, we should care. And if we do not care, you can rewrite the code that became the first user of this helper to instead call strvec_pushv() on the items->v array that is guaranteed to be NULL terminated, without inventing this new helper. I think I am fine with either way.
On Fri, Aug 27, 2021 at 06:29:30PM -0700, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > >> +void strvec_pushvec(struct strvec *array, const struct strvec *items) > >> +{ > >> + int i; > >> + > >> + for (i = 0; i < items->nr; i++) > >> + strvec_push(array, items->v[i]); > >> +} > > > > This implementation is not wrong per-se, but is somewhat > > disappointing. When items->nr is large, especially relative to the > > original array->alloc, it would incur unnecessary reallocations that > > we can easily avoid by pre-sizing the array before pushing the > > elements of items from it. > > > > In the original code that became the first user of this helper, it > > may not have made much difference, but now it is becoming a more > > generally reusable API function, we should care. > > And if we do not care, you can rewrite the code that became the > first user of this helper to instead call strvec_pushv() on the > items->v array that is guaranteed to be NULL terminated, without > inventing this new helper. I came here to say that. ;) I do not mind using pushv() directly, or a pushvec() that is a convenience wrapper for pushv(). Even better if that wrapper is smart enough to pre-allocate based on items->nr, as you mentioned, but that can also come later. One thing that did surprise me: the use of "int" here for iterating, rather than size_t. But it seems that strvec is already storing ints, which is an accident! I think we'd want the patch below. It can be applied independently (though if we do take the index-iterating version of Ævar's patch, I think it should switch to size_t). -- >8 -- Subject: [PATCH] strvec: use size_t to store nr and alloc We converted argv_array (which later became strvec) to use size_t in 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in order to avoid the possibility of integer overflow. But later, commit d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally converted these back to ints! Those two commits were part of the same patch series. I'm pretty sure what happened is that they were originally written in the opposite order and then cleaned up and re-ordered during an interactive rebase. And when resolving the inevitable conflict, I mistakenly took the "rename" patch completely, accidentally dropping the type change. We can correct it now; better late than never. Signed-off-by: Jeff King <peff@peff.net> --- strvec.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/strvec.h b/strvec.h index fdcad75b45..6b3cbd6758 100644 --- a/strvec.h +++ b/strvec.h @@ -29,8 +29,8 @@ extern const char *empty_strvec[]; */ struct strvec { const char **v; - int nr; - int alloc; + size_t nr; + size_t alloc; }; #define STRVEC_INIT { empty_strvec, 0, 0 }
Jeff King <peff@peff.net> writes: > I think we'd want the patch below. It can be applied independently > (though if we do take the index-iterating version of Ævar's patch, I > think it should switch to size_t). Yeah, I do not see a strong need for _pushvec(), especially the variant that does not preallocate, when we have _pushv(). But the type fix below does make sense. > -- >8 -- > Subject: [PATCH] strvec: use size_t to store nr and alloc > > We converted argv_array (which later became strvec) to use size_t in > 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in > order to avoid the possibility of integer overflow. But later, commit > d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally > converted these back to ints! > > Those two commits were part of the same patch series. I'm pretty sure > what happened is that they were originally written in the opposite order > and then cleaned up and re-ordered during an interactive rebase. And > when resolving the inevitable conflict, I mistakenly took the "rename" > patch completely, accidentally dropping the type change. > > We can correct it now; better late than never. > > Signed-off-by: Jeff King <peff@peff.net> > --- > strvec.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/strvec.h b/strvec.h > index fdcad75b45..6b3cbd6758 100644 > --- a/strvec.h > +++ b/strvec.h > @@ -29,8 +29,8 @@ extern const char *empty_strvec[]; > */ > struct strvec { > const char **v; > - int nr; > - int alloc; > + size_t nr; > + size_t alloc; > }; > > #define STRVEC_INIT { empty_strvec, 0, 0 }
On 8/29/2021 7:54 PM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> I think we'd want the patch below. It can be applied independently >> (though if we do take the index-iterating version of Ævar's patch, I >> think it should switch to size_t). > > Yeah, I do not see a strong need for _pushvec(), especially the > variant that does not preallocate, when we have _pushv(). But the > type fix below does make sense. Thanks for chiming in. I was not aware of _pushv() when I asked about the _pushvec() variant. Sorry, Ævar, for sending you down an unnecessary direction. >> -- >8 -- >> Subject: [PATCH] strvec: use size_t to store nr and alloc >> >> We converted argv_array (which later became strvec) to use size_t in >> 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in >> order to avoid the possibility of integer overflow. But later, commit >> d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally >> converted these back to ints! >> >> Those two commits were part of the same patch series. I'm pretty sure >> what happened is that they were originally written in the opposite order >> and then cleaned up and re-ordered during an interactive rebase. And >> when resolving the inevitable conflict, I mistakenly took the "rename" >> patch completely, accidentally dropping the type change. >> >> We can correct it now; better late than never. ... >> struct strvec { >> const char **v; >> - int nr; >> - int alloc; >> + size_t nr; >> + size_t alloc; >> }; This is also a good change to take. Thanks, -Stolee
On Sat, Aug 28 2021, Jeff King wrote: > On Fri, Aug 27, 2021 at 06:29:30PM -0700, Junio C Hamano wrote: > >> Junio C Hamano <gitster@pobox.com> writes: >> >> >> +void strvec_pushvec(struct strvec *array, const struct strvec *items) >> >> +{ >> >> + int i; >> >> + >> >> + for (i = 0; i < items->nr; i++) >> >> + strvec_push(array, items->v[i]); >> >> +} >> > >> > This implementation is not wrong per-se, but is somewhat >> > disappointing. When items->nr is large, especially relative to the >> > original array->alloc, it would incur unnecessary reallocations that >> > we can easily avoid by pre-sizing the array before pushing the >> > elements of items from it. >> > >> > In the original code that became the first user of this helper, it >> > may not have made much difference, but now it is becoming a more >> > generally reusable API function, we should care. >> >> And if we do not care, you can rewrite the code that became the >> first user of this helper to instead call strvec_pushv() on the >> items->v array that is guaranteed to be NULL terminated, without >> inventing this new helper. > > I came here to say that. ;) > > I do not mind using pushv() directly, or a pushvec() that is a > convenience wrapper for pushv(). Even better if that wrapper is smart > enough to pre-allocate based on items->nr, as you mentioned, but that > can also come later. > > One thing that did surprise me: the use of "int" here for iterating, > rather than size_t. But it seems that strvec is already storing ints, > which is an accident! Is it really? If you temporarily try to say convert that to "size_t *nr" to have the compiler catch all the cases where we use "nr", and then s/size_t/int/g those all, you'll find that e.g. setup_revisions() and the like expect to take either "int argc" or the strvec equivalent. We can sensibly convert some of those to size_t, but not all, and the int v.s. size_t inconsistency as a result feels weird. Since the main point of this API is to be a wrapper for what a C main() would take, shouldn't its prototype mirror its prototype? I.e. we should stick to "int" here? > I think we'd want the patch below. It can be applied independently > (though if we do take the index-iterating version of Ævar's patch, I > think it should switch to size_t). > > -- >8 -- > Subject: [PATCH] strvec: use size_t to store nr and alloc > > We converted argv_array (which later became strvec) to use size_t in > 819f0e76b1 (argv-array: use size_t for count and alloc, 2020-07-28) in > order to avoid the possibility of integer overflow. But later, commit > d70a9eb611 (strvec: rename struct fields, 2020-07-28) accidentally > converted these back to ints! > > Those two commits were part of the same patch series. I'm pretty sure > what happened is that they were originally written in the opposite order > and then cleaned up and re-ordered during an interactive rebase. And > when resolving the inevitable conflict, I mistakenly took the "rename" > patch completely, accidentally dropping the type change. > > We can correct it now; better late than never. > > Signed-off-by: Jeff King <peff@peff.net> > --- > strvec.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/strvec.h b/strvec.h > index fdcad75b45..6b3cbd6758 100644 > --- a/strvec.h > +++ b/strvec.h > @@ -29,8 +29,8 @@ extern const char *empty_strvec[]; > */ > struct strvec { > const char **v; > - int nr; > - int alloc; > + size_t nr; > + size_t alloc; > }; > > #define STRVEC_INIT { empty_strvec, 0, 0 }
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Since the main point of this API is to be a wrapper for what a C main() > would take, shouldn't its prototype mirror its prototype? I.e. we should > stick to "int" here? That's a fair-enough argument. In other words, we do not mind the upper limit of 2 billion strings, even though we do care that we can have more than 4 billion bytes in any of these strings ;-)
On Fri, Sep 03, 2021 at 01:19:33AM +0200, Ævar Arnfjörð Bjarmason wrote: > > One thing that did surprise me: the use of "int" here for iterating, > > rather than size_t. But it seems that strvec is already storing ints, > > which is an accident! > > Is it really? If you temporarily try to say convert that to "size_t *nr" > to have the compiler catch all the cases where we use "nr", and then > s/size_t/int/g those all, you'll find that e.g. setup_revisions() and > the like expect to take either "int argc" or the strvec equivalent. It most definitely is an accident, in the sense that the commit which changed it did not mean to. :) > We can sensibly convert some of those to size_t, but not all, and the > int v.s. size_t inconsistency as a result feels weird. > > Since the main point of this API is to be a wrapper for what a C main() > would take, shouldn't its prototype mirror its prototype? I.e. we should > stick to "int" here? That isn't the main point of this API. The reason the name changed away from argv_array was so that it would be more obviously a generally useful array of strings. But even if that were the use, the point isn't that we expect to see 2B or even 4B strings in any reasonable use case. The point is that we don't want to accidentally overflow the integer counter, leading to reads and writes of random bits of memory. And all it takes to do that is some code like: while (strbuf_readline(stdin, &buf)) strvec_push(&foo, buf.buf); On a system with 32-bit size_t, you are not likely to actually allocate 2B strings before you'd fail. But on most 64-bit systems, you have plenty of address space (and these days, often RAM) to do so, but you still only need 2B strings, because "int" is 32 bits. So code like setup_revisions() which uses an int is fine. It is reading from a source with that much smaller "int" limit in the first place. Whether strvec can handle bigger arrays or not does not matter either way. And when it later uses ints to operate on such a strvec, it's OK in practice; even if the size_t gets truncated on the fly to an int, we know we did not put more than an int's worth of items into the array in the first place. But code which has potentially larger inputs (either because it reads iteratively into the strvec as above, or because it is itself using a larger data type) is now also OK, because it's using size_t. What's not OK is code which operates on a potentially-unbounded strvec using an int. E.g., following the code I showed above with something like: int nr = foo.nr; foo.v[nr] = xstrdup("replacement string"); which may write to memory outside of foo.v. Obviously that's nonsense nobody would ever write directly. It is possible for some code to do this inadvertently (say, passing a ptr/int pair through a function interface), but I think it's fairly unlikely in practice. So while I do think more code ought to be using size_t in general, I don't think it's necessary to audit this carefully. The important thing to me is protecting strvec itself. -Peff
diff --git a/strvec.c b/strvec.c index 61a76ce6cb9..54ed8427c13 100644 --- a/strvec.c +++ b/strvec.c @@ -56,6 +56,14 @@ void strvec_pushv(struct strvec *array, const char **items) strvec_push(array, *items); } +void strvec_pushvec(struct strvec *array, const struct strvec *items) +{ + int i; + + for (i = 0; i < items->nr; i++) + strvec_push(array, items->v[i]); +} + void strvec_pop(struct strvec *array) { if (!array->nr) diff --git a/strvec.h b/strvec.h index fdcad75b45b..9003bde2b7d 100644 --- a/strvec.h +++ b/strvec.h @@ -62,6 +62,13 @@ void strvec_pushl(struct strvec *, ...); /* Push a null-terminated array of strings onto the end of the array. */ void strvec_pushv(struct strvec *, const char **); +/** + * Push the contents of another "struct strvec *" onto the end of the + * array. Like strvec_pushv(), this is a convenience wrapper that + * calls strvec_push() in a loop. + */ +void strvec_pushvec(struct strvec *, const struct strvec *); + /** * Remove the final element from the array. If there are no * elements in the array, do nothing. diff --git a/submodule.c b/submodule.c index 8e611fe1dbf..84b5f5dc0e0 100644 --- a/submodule.c +++ b/submodule.c @@ -1606,7 +1606,6 @@ int fetch_populated_submodules(struct repository *r, int default_option, int quiet, int max_parallel_jobs) { - int i; struct submodule_parallel_fetch spf = SPF_INIT; spf.r = r; @@ -1622,8 +1621,7 @@ int fetch_populated_submodules(struct repository *r, die(_("index file corrupt")); strvec_push(&spf.args, "fetch"); - for (i = 0; i < options->nr; i++) - strvec_push(&spf.args, options->v[i]); + strvec_pushvec(&spf.args, options); strvec_push(&spf.args, "--recurse-submodules-default"); /* default value, "--submodule-prefix" and its value are added later */
Add a strvec_pushvec() function to concatenate two "struct strvec *" together, and modify code added in 50d89ad6542 (submodule: use argv_array instead of hand-building arrays, 2012-09-01) to use it. In a subsequent commit we'll gain another API user. This could also have been named strvec_concat()[1], but I opted to make its name consistent with the strbuf_addbuf() function instead. We only name these sorts of functions *_concat() in one instance: parse_options_concat(). 1. http://lore.kernel.org/git/30620e13-4509-1905-7644-9962b6adf9c5@gmail.com Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- strvec.c | 8 ++++++++ strvec.h | 7 +++++++ submodule.c | 4 +--- 3 files changed, 16 insertions(+), 3 deletions(-)