Message ID | 6658b231a906dde6acbe7ce156da693ef7dc40e6.1681845518.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | banned: mark `strok()` as banned | expand |
Taylor Blau <me@ttaylorr.com> writes: > Introduce a variant of the `string_list_split_in_place()` function that > takes a string of accepted delimiters. > > By contrast to its cousin `string_list_split_in_place()` which splits > the given string at every instance of the single character `delim`, the > `_multi` variant splits the given string any any character appearing in > the string `delim`. > > Like `strtok()`, the `_multi` variant skips past sequential delimiting > characters. For example: > > string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1); > > would place in `xs` the elements "foo", "bar", and "baz". strtok() also skips leading and trailing delimiters, i.e. the above will give you identical result for ":foo:bar:baz:". It would be useful to test that here in addition to the existing ones. > +for test_fn in test_split test_split_in_place_multi > +do > + $test_fn "foo:bar:baz" ":" "-1" <<-\EOF > + 3 > + [0]: "foo" > + [1]: "bar" > + [2]: "baz" > + EOF > > + $test_fn "foo:bar:baz" ":" "0" <<-\EOF > + 1 > + [0]: "foo:bar:baz" > + EOF > > + $test_fn "foo:bar:baz" ":" "1" <<-\EOF > + 2 > + [0]: "foo" > + [1]: "bar:baz" > + EOF > > + $test_fn "foo:bar:baz" ":" "2" <<-\EOF > + 3 > + [0]: "foo" > + [1]: "bar" > + [2]: "baz" > + EOF
On Tue, Apr 18, 2023 at 12:39:48PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > Introduce a variant of the `string_list_split_in_place()` function that > > takes a string of accepted delimiters. > > > > By contrast to its cousin `string_list_split_in_place()` which splits > > the given string at every instance of the single character `delim`, the > > `_multi` variant splits the given string any any character appearing in > > the string `delim`. > > > > Like `strtok()`, the `_multi` variant skips past sequential delimiting > > characters. For example: > > > > string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1); > > > > would place in `xs` the elements "foo", "bar", and "baz". > > strtok() also skips leading and trailing delimiters, i.e. the above > will give you identical result for ":foo:bar:baz:". I'm not sure the results are identical. Adding this test case for testing the behavior of string_list_split_in_place() passes before and after this series: --- 8< --- diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh index 9c5094616a..dfe970a566 100755 --- a/t/t0063-string-list.sh +++ b/t/t0063-string-list.sh @@ -72,6 +72,15 @@ test_split ":" ":" "-1" <<EOF [1]: "" EOF +test_split ":foo:bar:baz:" ":" "-1" <<-\EOF +5 +[0]: "" +[1]: "foo" +[2]: "bar" +[3]: "baz" +[4]: "" +EOF + test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF 3 [0]: "foo" --- >8 --- > It would be useful to test that here in addition to the existing ones. Sure. FWIW, the behavior for string_list_split_in_place_multi() is slightly different, since it will eat up all of the leading delimiter tokens and treat the first token as "foo". Here's a diff that could be squashed into this patch which captures both cases: --- 8< --- diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh index 9c5094616a..efc84dc124 100755 --- a/t/t0063-string-list.sh +++ b/t/t0063-string-list.sh @@ -72,6 +72,15 @@ test_split ":" ":" "-1" <<EOF [1]: "" EOF +test_split ":foo:bar:baz:" ":" "-1" <<-\EOF +5 +[0]: "" +[1]: "foo" +[2]: "bar" +[3]: "baz" +[4]: "" +EOF + test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF 3 [0]: "foo" @@ -104,6 +113,14 @@ test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF [2]: "" EOF +test_split_in_place_multi ":;:foo:;:bar:;:baz:;:" ":;" "-1" <<-\EOF +4 +[0]: "foo" +[1]: "bar" +[2]: "baz" +[3]: "" +EOF + test_expect_success "test filter_string_list" ' test "x-" = "x$(test-tool string-list filter - y)" && test "x-" = "x$(test-tool string-list filter no y)" && --- >8 --- Thanks, Taylor
On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote: > Introduce a variant of the `string_list_split_in_place()` function that > takes a string of accepted delimiters. > > By contrast to its cousin `string_list_split_in_place()` which splits > the given string at every instance of the single character `delim`, the > `_multi` variant splits the given string any any character appearing in > the string `delim`. > > Like `strtok()`, the `_multi` variant skips past sequential delimiting > characters. For example: > > string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1); > > would place in `xs` the elements "foo", "bar", and "baz". I have mixed feelings on this. There are multiple orthogonal options here: single/multiple delimiter, and how to deal with sequential delimiters (you call it "runs" here, though I think of it inverted as "allow empty fields"). Plus existing ones like maxsplit. Your underlying function, string_list_split_in_place_1(), handles these independently. But it seems like a subtle gotcha that string_list_split_in_place() and its _multi() variant, which purport to differ only in one dimension (representation of delimiter list), also subtly differ in another dimension. Especially because it's a facet that might not come up in simple tests, I can see somebody incorrectly applying one or the other. Obviously one solution is to add the "runs" option to all variants. But I'd be hesitant to burden existing callers. So I'd propose one of: 1. Make your _1() function public, with a name like _with_options() or something (though the function name is sadly already quite long). Leave string_list_split_in_place() as a wrapper that behaves as now, and have the few new callers use the with_options() variant. 2. Don't bother implementing the "runs" version. The only users would be test programs, and nobody cares much either way for their cases. Document in the commit message (and possibly above the function) that this isn't a strict replacement for strtok(). That would hopefully be enough for a potential caller to think about the behavior, and we can add "runs" if and when somebody actually wants it. -Peff
Am 22.04.23 um 13:12 schrieb Jeff King: > On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote: > >> Introduce a variant of the `string_list_split_in_place()` function that >> takes a string of accepted delimiters. >> >> By contrast to its cousin `string_list_split_in_place()` which splits >> the given string at every instance of the single character `delim`, the >> `_multi` variant splits the given string any any character appearing in >> the string `delim`. >> >> Like `strtok()`, the `_multi` variant skips past sequential delimiting >> characters. For example: >> >> string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1); >> >> would place in `xs` the elements "foo", "bar", and "baz". Which one is it here really, string_list_split_in_place() or _multi()? > I have mixed feelings on this. > > There are multiple orthogonal options here: single/multiple delimiter, > and how to deal with sequential delimiters (you call it "runs" here, > though I think of it inverted as "allow empty fields"). Plus existing > ones like maxsplit. > > Your underlying function, string_list_split_in_place_1(), handles these > independently. But it seems like a subtle gotcha that > string_list_split_in_place() and its _multi() variant, which purport to > differ only in one dimension (representation of delimiter list), also > subtly differ in another dimension. Especially because it's a facet that > might not come up in simple tests, I can see somebody incorrectly > applying one or the other. > > Obviously one solution is to add the "runs" option to all variants. But > I'd be hesitant to burden existing callers. So I'd propose one of: > > 1. Make your _1() function public, with a name like _with_options() or > something (though the function name is sadly already quite long). > Leave string_list_split_in_place() as a wrapper that behaves as > now, and have the few new callers use the with_options() variant. > > 2. Don't bother implementing the "runs" version. The only users would > be test programs, and nobody cares much either way for their cases. > Document in the commit message (and possibly above the function) > that this isn't a strict replacement for strtok(). That would > hopefully be enough for a potential caller to think about the > behavior, and we can add "runs" if and when somebody actually wants > it. You can call string_list_remove_empty_items() to get rid of empty strings. And if the single-char version calls a multi-char version under the hood then its only advantage is backward compatibility -- but converting all callers isn't that hard. So the only string_list split function version you really need accepts multiple delimiters and preserves empty items and can be called string_list_split_in_place(). René
On Sat, Apr 22, 2023 at 05:53:05PM +0200, René Scharfe wrote: > > Obviously one solution is to add the "runs" option to all variants. But > > I'd be hesitant to burden existing callers. So I'd propose one of: > > > > 1. Make your _1() function public, with a name like _with_options() or > > something (though the function name is sadly already quite long). > > Leave string_list_split_in_place() as a wrapper that behaves as > > now, and have the few new callers use the with_options() variant. > > > > 2. Don't bother implementing the "runs" version. The only users would > > be test programs, and nobody cares much either way for their cases. > > Document in the commit message (and possibly above the function) > > that this isn't a strict replacement for strtok(). That would > > hopefully be enough for a potential caller to think about the > > behavior, and we can add "runs" if and when somebody actually wants > > it. > > You can call string_list_remove_empty_items() to get rid of empty > strings. And if the single-char version calls a multi-char version > under the hood then its only advantage is backward compatibility -- but > converting all callers isn't that hard. So the only string_list split > function version you really need accepts multiple delimiters and > preserves empty items and can be called string_list_split_in_place(). Ooh, I like that. I didn't think of processing after the fact. And indeed, we have several spots already that do the split/remove_empty pair. And yes, it looks like there are only 7 callers which would need a trivial update if we just switched to the multi-char version. That's very compelling. -Peff
On Sat, Apr 22, 2023 at 07:12:13AM -0400, Jeff King wrote: > On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote: > > > Introduce a variant of the `string_list_split_in_place()` function that > > takes a string of accepted delimiters. > > > > By contrast to its cousin `string_list_split_in_place()` which splits > > the given string at every instance of the single character `delim`, the > > `_multi` variant splits the given string any any character appearing in > > the string `delim`. > > > > Like `strtok()`, the `_multi` variant skips past sequential delimiting > > characters. For example: > > > > string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1); > > > > would place in `xs` the elements "foo", "bar", and "baz". > > I have mixed feelings on this. Hmm. I implemented it this way partially after reading your suggestion in [1], but also to allow using `string_list_split_in_place()` as a strict replacement for strtok(). And I agree that for the existing users of strtok(), which are all in the test helpers, this probably doesn't matter much either way. Though I feel like since we are banning strtok() over the whole tree, that we should provide a suitable replacement at the time we ban strtok(), not at the time somebody needs to rely on its non-empty fields behavior. [1]: https://lore.kernel.org/git/20230418102320.GB508219@coredump.intra.peff.net/ > Obviously one solution is to add the "runs" option to all variants. But > I'd be hesitant to burden existing callers. So I'd propose one of: > > 1. Make your _1() function public, with a name like _with_options() or > something (though the function name is sadly already quite long). > Leave string_list_split_in_place() as a wrapper that behaves as > now, and have the few new callers use the with_options() variant. I think that in general I'd prefer (2) to avoid polluting the list of declarations in string-list.h, but in this case I think that in this case it is the right thing to do. Thanks, Taylor
On Sat, Apr 22, 2023 at 10:38:13PM -0400, Taylor Blau wrote: > > Obviously one solution is to add the "runs" option to all variants. But > > I'd be hesitant to burden existing callers. So I'd propose one of: > > > > 1. Make your _1() function public, with a name like _with_options() or > > something (though the function name is sadly already quite long). > > Leave string_list_split_in_place() as a wrapper that behaves as > > now, and have the few new callers use the with_options() variant. > > I think that in general I'd prefer (2) to avoid polluting the list of > declarations in string-list.h, but in this case I think that in this > case it is the right thing to do. Oh, nevermind. I just read René's solution below and it is much cleaner. I take it back ;-). Thanks, Taylor
Jeff King <peff@peff.net> writes: > And yes, it looks like there are only 7 callers which would need a > trivial update if we just switched to the multi-char version. > > That's very compelling. Sounds good. Thanks.
diff --git a/string-list.c b/string-list.c index db473f273e..b27a53f2e1 100644 --- a/string-list.c +++ b/string-list.c @@ -300,8 +300,9 @@ int string_list_split(struct string_list *list, const char *string, } } -int string_list_split_in_place(struct string_list *list, char *string, - int delim, int maxsplit) +static int string_list_split_in_place_1(struct string_list *list, char *string, + const char *delim, int maxsplit, + unsigned runs) { int count = 0; char *p = string, *end; @@ -310,13 +311,16 @@ int string_list_split_in_place(struct string_list *list, char *string, die("internal error in string_list_split_in_place(): " "list->strdup_strings must not be set"); for (;;) { + if (runs) + p += strspn(p, delim); + count++; if (maxsplit >= 0 && count > maxsplit) { string_list_append(list, p); return count; } - end = strchr(p, delim); - if (end) { + end = p + strcspn(p, delim); + if (end && *end) { *end = '\0'; string_list_append(list, p); p = end + 1; @@ -326,3 +330,17 @@ int string_list_split_in_place(struct string_list *list, char *string, } } } + +int string_list_split_in_place_multi(struct string_list *list, char *string, + const char *delim, int maxsplit) +{ + return string_list_split_in_place_1(list, string, delim, maxsplit, 1); +} + +int string_list_split_in_place(struct string_list *list, char *string, + int delim, int maxsplit) +{ + char delim_s[2] = { delim, 0 }; + + return string_list_split_in_place_1(list, string, delim_s, maxsplit, 0); +} diff --git a/string-list.h b/string-list.h index c7b0d5d000..f01bbb0bb6 100644 --- a/string-list.h +++ b/string-list.h @@ -268,7 +268,14 @@ int string_list_split(struct string_list *list, const char *string, * new string_list_items point into string (which therefore must not * be modified or freed while the string_list is in use). * list->strdup_strings must *not* be set. + * + * The "_multi" variant splits the given string on any character + * appearing in "delim", and the non-"_multi" variant splits only on the + * given character. The "_multi" variant behaves like `strtok()` where + * no element contains the delimiting byte(s). */ +int string_list_split_in_place_multi(struct string_list *list, char *string, + const char *delim, int maxsplit); int string_list_split_in_place(struct string_list *list, char *string, int delim, int maxsplit); #endif /* STRING_LIST_H */ diff --git a/t/helper/test-string-list.c b/t/helper/test-string-list.c index 2123dda85b..119bc9e1c9 100644 --- a/t/helper/test-string-list.c +++ b/t/helper/test-string-list.c @@ -73,6 +73,21 @@ int cmd__string_list(int argc, const char **argv) return 0; } + if (argc == 5 && !strcmp(argv[1], "split_in_place_multi")) { + struct string_list list = STRING_LIST_INIT_NODUP; + int i; + char *s = xstrdup(argv[2]); + const char *delim = argv[3]; + int maxsplit = atoi(argv[4]); + + i = string_list_split_in_place_multi(&list, s, delim, maxsplit); + printf("%d\n", i); + write_list(&list); + string_list_clear(&list, 0); + free(s); + return 0; + } + if (argc == 4 && !strcmp(argv[1], "filter")) { /* * Retain only the items that have the specified prefix. diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh index 46d4839194..9c5094616a 100755 --- a/t/t0063-string-list.sh +++ b/t/t0063-string-list.sh @@ -18,42 +18,53 @@ test_split () { " } -test_split "foo:bar:baz" ":" "-1" <<EOF -3 -[0]: "foo" -[1]: "bar" -[2]: "baz" -EOF +test_split_in_place_multi () { + cat >expected && + test_expect_success "split_in_place_multi $1 at $2, max $3" " + test-tool string-list split_in_place_multi '$1' '$2' '$3' >actual && + test_cmp expected actual + " +} -test_split "foo:bar:baz" ":" "0" <<EOF -1 -[0]: "foo:bar:baz" -EOF +for test_fn in test_split test_split_in_place_multi +do + $test_fn "foo:bar:baz" ":" "-1" <<-\EOF + 3 + [0]: "foo" + [1]: "bar" + [2]: "baz" + EOF -test_split "foo:bar:baz" ":" "1" <<EOF -2 -[0]: "foo" -[1]: "bar:baz" -EOF + $test_fn "foo:bar:baz" ":" "0" <<-\EOF + 1 + [0]: "foo:bar:baz" + EOF -test_split "foo:bar:baz" ":" "2" <<EOF -3 -[0]: "foo" -[1]: "bar" -[2]: "baz" -EOF + $test_fn "foo:bar:baz" ":" "1" <<-\EOF + 2 + [0]: "foo" + [1]: "bar:baz" + EOF -test_split "foo:bar:" ":" "-1" <<EOF -3 -[0]: "foo" -[1]: "bar" -[2]: "" -EOF + $test_fn "foo:bar:baz" ":" "2" <<-\EOF + 3 + [0]: "foo" + [1]: "bar" + [2]: "baz" + EOF -test_split "" ":" "-1" <<EOF -1 -[0]: "" -EOF + $test_fn "foo:bar:" ":" "-1" <<-\EOF + 3 + [0]: "foo" + [1]: "bar" + [2]: "" + EOF + + $test_fn "" ":" "-1" <<-\EOF + 1 + [0]: "" + EOF +done test_split ":" ":" "-1" <<EOF 2 @@ -61,6 +72,38 @@ test_split ":" ":" "-1" <<EOF [1]: "" EOF +test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF +3 +[0]: "foo" +[1]: "bar" +[2]: "baz" +EOF + +test_split_in_place_multi "foo:;:bar:;:baz" ":;" "0" <<-\EOF +1 +[0]: "foo:;:bar:;:baz" +EOF + +test_split_in_place_multi "foo:;:bar:;:baz" ":;" "1" <<-\EOF +2 +[0]: "foo" +[1]: "bar:;:baz" +EOF + +test_split_in_place_multi "foo:;:bar:;:baz" ":;" "2" <<-\EOF +3 +[0]: "foo" +[1]: "bar" +[2]: "baz" +EOF + +test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF +3 +[0]: "foo" +[1]: "bar" +[2]: "" +EOF + test_expect_success "test filter_string_list" ' test "x-" = "x$(test-tool string-list filter - y)" && test "x-" = "x$(test-tool string-list filter no y)" &&
Introduce a variant of the `string_list_split_in_place()` function that takes a string of accepted delimiters. By contrast to its cousin `string_list_split_in_place()` which splits the given string at every instance of the single character `delim`, the `_multi` variant splits the given string any any character appearing in the string `delim`. Like `strtok()`, the `_multi` variant skips past sequential delimiting characters. For example: string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1); would place in `xs` the elements "foo", "bar", and "baz". Instead of using `strchr(2)` to locate the first occurrence of the given delimiter character, `string_list_split_in_place_multi()` uses `strcspn(2)` to move past the initial segment of characters comprised of any characters in the delimiting set. When only a single delimiting character is provided, `strcspn(2)` has equivalent performance to `strchr(2)`. Modern `strcspn(2)` implementations treat an empty delimiter or the singleton delimiter as a special case and fall back to calling strchrnul(). Both glibc[1] and musl[2] implement `strcspn(2)` this way. Since the `_multi` variant is a generalization of the original implementation, reimplement `string_list_split_in_place()` in terms of the more general function by providing a single-character string for the list of accepted delimiters. To avoid regressions, update t0063 in this patch as well. Any "common" test cases (i.e., those that produce the same result whether you call `string_list_split()` or `string_list_split_in_place_multi()`) are grouped into a loop which is parameterized over the function to test. Any cases which aren't common (of which there is one existing case, and a handful of new ones added which are specific to the `_multi` variant) are tested independently. [1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35 [2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11 Signed-off-by: Taylor Blau <me@ttaylorr.com> --- string-list.c | 26 +++++++-- string-list.h | 7 +++ t/helper/test-string-list.c | 15 ++++++ t/t0063-string-list.sh | 105 +++++++++++++++++++++++++----------- 4 files changed, 118 insertions(+), 35 deletions(-)