diff mbox series

[v2,1/6] string-list: introduce `string_list_split_in_place_multi()`

Message ID 6658b231a906dde6acbe7ce156da693ef7dc40e6.1681845518.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series banned: mark `strok()` as banned | expand

Commit Message

Taylor Blau April 18, 2023, 7:18 p.m. UTC
Introduce a variant of the `string_list_split_in_place()` function that
takes a string of accepted delimiters.

By contrast to its cousin `string_list_split_in_place()` which splits
the given string at every instance of the single character `delim`, the
`_multi` variant splits the given string any any character appearing in
the string `delim`.

Like `strtok()`, the `_multi` variant skips past sequential delimiting
characters. For example:

    string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);

would place in `xs` the elements "foo", "bar", and "baz".

Instead of using `strchr(2)` to locate the first occurrence of the given
delimiter character, `string_list_split_in_place_multi()` uses
`strcspn(2)` to move past the initial segment of characters comprised of
any characters in the delimiting set.

When only a single delimiting character is provided, `strcspn(2)` has
equivalent performance to `strchr(2)`. Modern `strcspn(2)`
implementations treat an empty delimiter or the singleton delimiter as a
special case and fall back to calling strchrnul(). Both glibc[1] and
musl[2] implement `strcspn(2)` this way.

Since the `_multi` variant is a generalization of the original
implementation, reimplement `string_list_split_in_place()` in terms of
the more general function by providing a single-character string for the
list of accepted delimiters.

To avoid regressions, update t0063 in this patch as well. Any "common"
test cases (i.e., those that produce the same result whether you call
`string_list_split()` or `string_list_split_in_place_multi()`) are
grouped into a loop which is parameterized over the function to test.

Any cases which aren't common (of which there is one existing case, and
a handful of new ones added which are specific to the `_multi` variant)
are tested independently.

[1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
[2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 string-list.c               |  26 +++++++--
 string-list.h               |   7 +++
 t/helper/test-string-list.c |  15 ++++++
 t/t0063-string-list.sh      | 105 +++++++++++++++++++++++++-----------
 4 files changed, 118 insertions(+), 35 deletions(-)

Comments

Junio C Hamano April 18, 2023, 7:39 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> Introduce a variant of the `string_list_split_in_place()` function that
> takes a string of accepted delimiters.
>
> By contrast to its cousin `string_list_split_in_place()` which splits
> the given string at every instance of the single character `delim`, the
> `_multi` variant splits the given string any any character appearing in
> the string `delim`.
>
> Like `strtok()`, the `_multi` variant skips past sequential delimiting
> characters. For example:
>
>     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
>
> would place in `xs` the elements "foo", "bar", and "baz".

strtok() also skips leading and trailing delimiters, i.e. the above
will give you identical result for ":foo:bar:baz:".

It would be useful to test that here in addition to the existing ones.

> +for test_fn in test_split test_split_in_place_multi
> +do
> +	$test_fn "foo:bar:baz" ":" "-1" <<-\EOF
> +	3
> +	[0]: "foo"
> +	[1]: "bar"
> +	[2]: "baz"
> +	EOF
>  
> +	$test_fn "foo:bar:baz" ":" "0" <<-\EOF
> +	1
> +	[0]: "foo:bar:baz"
> +	EOF
>  
> +	$test_fn "foo:bar:baz" ":" "1" <<-\EOF
> +	2
> +	[0]: "foo"
> +	[1]: "bar:baz"
> +	EOF
>  
> +	$test_fn "foo:bar:baz" ":" "2" <<-\EOF
> +	3
> +	[0]: "foo"
> +	[1]: "bar"
> +	[2]: "baz"
> +	EOF
Taylor Blau April 18, 2023, 8:54 p.m. UTC | #2
On Tue, Apr 18, 2023 at 12:39:48PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Introduce a variant of the `string_list_split_in_place()` function that
> > takes a string of accepted delimiters.
> >
> > By contrast to its cousin `string_list_split_in_place()` which splits
> > the given string at every instance of the single character `delim`, the
> > `_multi` variant splits the given string any any character appearing in
> > the string `delim`.
> >
> > Like `strtok()`, the `_multi` variant skips past sequential delimiting
> > characters. For example:
> >
> >     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
> >
> > would place in `xs` the elements "foo", "bar", and "baz".
>
> strtok() also skips leading and trailing delimiters, i.e. the above
> will give you identical result for ":foo:bar:baz:".

I'm not sure the results are identical. Adding this test case for
testing the behavior of string_list_split_in_place() passes before and
after this series:

--- 8< ---
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 9c5094616a..dfe970a566 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -72,6 +72,15 @@ test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF

+test_split ":foo:bar:baz:" ":" "-1" <<-\EOF
+5
+[0]: ""
+[1]: "foo"
+[2]: "bar"
+[3]: "baz"
+[4]: ""
+EOF
+
 test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
 3
 [0]: "foo"
--- >8 ---

> It would be useful to test that here in addition to the existing ones.

Sure. FWIW, the behavior for string_list_split_in_place_multi() is
slightly different, since it will eat up all of the leading delimiter
tokens and treat the first token as "foo".

Here's a diff that could be squashed into this patch which captures both
cases:

--- 8< ---
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 9c5094616a..efc84dc124 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -72,6 +72,15 @@ test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF

+test_split ":foo:bar:baz:" ":" "-1" <<-\EOF
+5
+[0]: ""
+[1]: "foo"
+[2]: "bar"
+[3]: "baz"
+[4]: ""
+EOF
+
 test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
 3
 [0]: "foo"
@@ -104,6 +113,14 @@ test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF
 [2]: ""
 EOF

+test_split_in_place_multi ":;:foo:;:bar:;:baz:;:" ":;" "-1" <<-\EOF
+4
+[0]: "foo"
+[1]: "bar"
+[2]: "baz"
+[3]: ""
+EOF
+
 test_expect_success "test filter_string_list" '
 	test "x-" = "x$(test-tool string-list filter - y)" &&
 	test "x-" = "x$(test-tool string-list filter no y)" &&
--- >8 ---


Thanks,
Taylor
Jeff King April 22, 2023, 11:12 a.m. UTC | #3
On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote:

> Introduce a variant of the `string_list_split_in_place()` function that
> takes a string of accepted delimiters.
> 
> By contrast to its cousin `string_list_split_in_place()` which splits
> the given string at every instance of the single character `delim`, the
> `_multi` variant splits the given string any any character appearing in
> the string `delim`.
> 
> Like `strtok()`, the `_multi` variant skips past sequential delimiting
> characters. For example:
> 
>     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
> 
> would place in `xs` the elements "foo", "bar", and "baz".

I have mixed feelings on this.

There are multiple orthogonal options here: single/multiple delimiter,
and how to deal with sequential delimiters (you call it "runs" here,
though I think of it inverted as "allow empty fields"). Plus existing
ones like maxsplit.

Your underlying function, string_list_split_in_place_1(), handles these
independently. But it seems like a subtle gotcha that
string_list_split_in_place() and its _multi() variant, which purport to
differ only in one dimension (representation of delimiter list), also
subtly differ in another dimension. Especially because it's a facet that
might not come up in simple tests, I can see somebody incorrectly
applying one or the other.

Obviously one solution is to add the "runs" option to all variants. But
I'd be hesitant to burden existing callers. So I'd propose one of:

  1. Make your _1() function public, with a name like _with_options() or
     something (though the function name is sadly already quite long).
     Leave string_list_split_in_place() as a wrapper that behaves as
     now, and have the few new callers use the with_options() variant.

  2. Don't bother implementing the "runs" version. The only users would
     be test programs, and nobody cares much either way for their cases.
     Document in the commit message (and possibly above the function)
     that this isn't a strict replacement for strtok(). That would
     hopefully be enough for a potential caller to think about the
     behavior, and we can add "runs" if and when somebody actually wants
     it.

-Peff
René Scharfe April 22, 2023, 3:53 p.m. UTC | #4
Am 22.04.23 um 13:12 schrieb Jeff King:
> On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote:
>
>> Introduce a variant of the `string_list_split_in_place()` function that
>> takes a string of accepted delimiters.
>>
>> By contrast to its cousin `string_list_split_in_place()` which splits
>> the given string at every instance of the single character `delim`, the
>> `_multi` variant splits the given string any any character appearing in
>> the string `delim`.
>>
>> Like `strtok()`, the `_multi` variant skips past sequential delimiting
>> characters. For example:
>>
>>     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
>>
>> would place in `xs` the elements "foo", "bar", and "baz".

Which one is it here really, string_list_split_in_place() or _multi()?

> I have mixed feelings on this.
>
> There are multiple orthogonal options here: single/multiple delimiter,
> and how to deal with sequential delimiters (you call it "runs" here,
> though I think of it inverted as "allow empty fields"). Plus existing
> ones like maxsplit.
>
> Your underlying function, string_list_split_in_place_1(), handles these
> independently. But it seems like a subtle gotcha that
> string_list_split_in_place() and its _multi() variant, which purport to
> differ only in one dimension (representation of delimiter list), also
> subtly differ in another dimension. Especially because it's a facet that
> might not come up in simple tests, I can see somebody incorrectly
> applying one or the other.
>
> Obviously one solution is to add the "runs" option to all variants. But
> I'd be hesitant to burden existing callers. So I'd propose one of:
>
>   1. Make your _1() function public, with a name like _with_options() or
>      something (though the function name is sadly already quite long).
>      Leave string_list_split_in_place() as a wrapper that behaves as
>      now, and have the few new callers use the with_options() variant.
>
>   2. Don't bother implementing the "runs" version. The only users would
>      be test programs, and nobody cares much either way for their cases.
>      Document in the commit message (and possibly above the function)
>      that this isn't a strict replacement for strtok(). That would
>      hopefully be enough for a potential caller to think about the
>      behavior, and we can add "runs" if and when somebody actually wants
>      it.

You can call string_list_remove_empty_items() to get rid of empty
strings.  And if the single-char version calls a multi-char version
under the hood then its only advantage is backward compatibility -- but
converting all callers isn't that hard.  So the only string_list split
function version you really need accepts multiple delimiters and
preserves empty items and can be called string_list_split_in_place().

René
Jeff King April 23, 2023, 12:35 a.m. UTC | #5
On Sat, Apr 22, 2023 at 05:53:05PM +0200, René Scharfe wrote:

> > Obviously one solution is to add the "runs" option to all variants. But
> > I'd be hesitant to burden existing callers. So I'd propose one of:
> >
> >   1. Make your _1() function public, with a name like _with_options() or
> >      something (though the function name is sadly already quite long).
> >      Leave string_list_split_in_place() as a wrapper that behaves as
> >      now, and have the few new callers use the with_options() variant.
> >
> >   2. Don't bother implementing the "runs" version. The only users would
> >      be test programs, and nobody cares much either way for their cases.
> >      Document in the commit message (and possibly above the function)
> >      that this isn't a strict replacement for strtok(). That would
> >      hopefully be enough for a potential caller to think about the
> >      behavior, and we can add "runs" if and when somebody actually wants
> >      it.
> 
> You can call string_list_remove_empty_items() to get rid of empty
> strings.  And if the single-char version calls a multi-char version
> under the hood then its only advantage is backward compatibility -- but
> converting all callers isn't that hard.  So the only string_list split
> function version you really need accepts multiple delimiters and
> preserves empty items and can be called string_list_split_in_place().

Ooh, I like that. I didn't think of processing after the fact. And
indeed, we have several spots already that do the split/remove_empty
pair.

And yes, it looks like there are only 7 callers which would need a
trivial update if we just switched to the multi-char version.

That's very compelling.

-Peff
Taylor Blau April 23, 2023, 2:38 a.m. UTC | #6
On Sat, Apr 22, 2023 at 07:12:13AM -0400, Jeff King wrote:
> On Tue, Apr 18, 2023 at 03:18:43PM -0400, Taylor Blau wrote:
>
> > Introduce a variant of the `string_list_split_in_place()` function that
> > takes a string of accepted delimiters.
> >
> > By contrast to its cousin `string_list_split_in_place()` which splits
> > the given string at every instance of the single character `delim`, the
> > `_multi` variant splits the given string any any character appearing in
> > the string `delim`.
> >
> > Like `strtok()`, the `_multi` variant skips past sequential delimiting
> > characters. For example:
> >
> >     string_list_split_in_place(&xs, xstrdup("foo::bar::baz"), ":", -1);
> >
> > would place in `xs` the elements "foo", "bar", and "baz".
>
> I have mixed feelings on this.

Hmm. I implemented it this way partially after reading your suggestion
in [1], but also to allow using `string_list_split_in_place()` as a
strict replacement for strtok().

And I agree that for the existing users of strtok(), which are all in the
test helpers, this probably doesn't matter much either way. Though I
feel like since we are banning strtok() over the whole tree, that we
should provide a suitable replacement at the time we ban strtok(), not
at the time somebody needs to rely on its non-empty fields behavior.

[1]: https://lore.kernel.org/git/20230418102320.GB508219@coredump.intra.peff.net/

> Obviously one solution is to add the "runs" option to all variants. But
> I'd be hesitant to burden existing callers. So I'd propose one of:
>
>   1. Make your _1() function public, with a name like _with_options() or
>      something (though the function name is sadly already quite long).
>      Leave string_list_split_in_place() as a wrapper that behaves as
>      now, and have the few new callers use the with_options() variant.

I think that in general I'd prefer (2) to avoid polluting the list of
declarations in string-list.h, but in this case I think that in this
case it is the right thing to do.

Thanks,
Taylor
Taylor Blau April 23, 2023, 2:40 a.m. UTC | #7
On Sat, Apr 22, 2023 at 10:38:13PM -0400, Taylor Blau wrote:
> > Obviously one solution is to add the "runs" option to all variants. But
> > I'd be hesitant to burden existing callers. So I'd propose one of:
> >
> >   1. Make your _1() function public, with a name like _with_options() or
> >      something (though the function name is sadly already quite long).
> >      Leave string_list_split_in_place() as a wrapper that behaves as
> >      now, and have the few new callers use the with_options() variant.
>
> I think that in general I'd prefer (2) to avoid polluting the list of
> declarations in string-list.h, but in this case I think that in this
> case it is the right thing to do.

Oh, nevermind. I just read René's solution below and it is much cleaner.
I take it back ;-).

Thanks,
Taylor
Junio C Hamano April 24, 2023, 4:24 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> And yes, it looks like there are only 7 callers which would need a
> trivial update if we just switched to the multi-char version.
>
> That's very compelling.

Sounds good.  Thanks.
diff mbox series

Patch

diff --git a/string-list.c b/string-list.c
index db473f273e..b27a53f2e1 100644
--- a/string-list.c
+++ b/string-list.c
@@ -300,8 +300,9 @@  int string_list_split(struct string_list *list, const char *string,
 	}
 }
 
-int string_list_split_in_place(struct string_list *list, char *string,
-			       int delim, int maxsplit)
+static int string_list_split_in_place_1(struct string_list *list, char *string,
+					const char *delim, int maxsplit,
+					unsigned runs)
 {
 	int count = 0;
 	char *p = string, *end;
@@ -310,13 +311,16 @@  int string_list_split_in_place(struct string_list *list, char *string,
 		die("internal error in string_list_split_in_place(): "
 		    "list->strdup_strings must not be set");
 	for (;;) {
+		if (runs)
+			p += strspn(p, delim);
+
 		count++;
 		if (maxsplit >= 0 && count > maxsplit) {
 			string_list_append(list, p);
 			return count;
 		}
-		end = strchr(p, delim);
-		if (end) {
+		end = p + strcspn(p, delim);
+		if (end && *end) {
 			*end = '\0';
 			string_list_append(list, p);
 			p = end + 1;
@@ -326,3 +330,17 @@  int string_list_split_in_place(struct string_list *list, char *string,
 		}
 	}
 }
+
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit)
+{
+	return string_list_split_in_place_1(list, string, delim, maxsplit, 1);
+}
+
+int string_list_split_in_place(struct string_list *list, char *string,
+			       int delim, int maxsplit)
+{
+	char delim_s[2] = { delim, 0 };
+
+	return string_list_split_in_place_1(list, string, delim_s, maxsplit, 0);
+}
diff --git a/string-list.h b/string-list.h
index c7b0d5d000..f01bbb0bb6 100644
--- a/string-list.h
+++ b/string-list.h
@@ -268,7 +268,14 @@  int string_list_split(struct string_list *list, const char *string,
  * new string_list_items point into string (which therefore must not
  * be modified or freed while the string_list is in use).
  * list->strdup_strings must *not* be set.
+ *
+ * The "_multi" variant splits the given string on any character
+ * appearing in "delim", and the non-"_multi" variant splits only on the
+ * given character. The "_multi" variant behaves like `strtok()` where
+ * no element contains the delimiting byte(s).
  */
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit);
 int string_list_split_in_place(struct string_list *list, char *string,
 			       int delim, int maxsplit);
 #endif /* STRING_LIST_H */
diff --git a/t/helper/test-string-list.c b/t/helper/test-string-list.c
index 2123dda85b..119bc9e1c9 100644
--- a/t/helper/test-string-list.c
+++ b/t/helper/test-string-list.c
@@ -73,6 +73,21 @@  int cmd__string_list(int argc, const char **argv)
 		return 0;
 	}
 
+	if (argc == 5 && !strcmp(argv[1], "split_in_place_multi")) {
+		struct string_list list = STRING_LIST_INIT_NODUP;
+		int i;
+		char *s = xstrdup(argv[2]);
+		const char *delim = argv[3];
+		int maxsplit = atoi(argv[4]);
+
+		i = string_list_split_in_place_multi(&list, s, delim, maxsplit);
+		printf("%d\n", i);
+		write_list(&list);
+		string_list_clear(&list, 0);
+		free(s);
+		return 0;
+	}
+
 	if (argc == 4 && !strcmp(argv[1], "filter")) {
 		/*
 		 * Retain only the items that have the specified prefix.
diff --git a/t/t0063-string-list.sh b/t/t0063-string-list.sh
index 46d4839194..9c5094616a 100755
--- a/t/t0063-string-list.sh
+++ b/t/t0063-string-list.sh
@@ -18,42 +18,53 @@  test_split () {
 	"
 }
 
-test_split "foo:bar:baz" ":" "-1" <<EOF
-3
-[0]: "foo"
-[1]: "bar"
-[2]: "baz"
-EOF
+test_split_in_place_multi () {
+	cat >expected &&
+	test_expect_success "split_in_place_multi $1 at $2, max $3" "
+		test-tool string-list split_in_place_multi '$1' '$2' '$3' >actual &&
+		test_cmp expected actual
+	"
+}
 
-test_split "foo:bar:baz" ":" "0" <<EOF
-1
-[0]: "foo:bar:baz"
-EOF
+for test_fn in test_split test_split_in_place_multi
+do
+	$test_fn "foo:bar:baz" ":" "-1" <<-\EOF
+	3
+	[0]: "foo"
+	[1]: "bar"
+	[2]: "baz"
+	EOF
 
-test_split "foo:bar:baz" ":" "1" <<EOF
-2
-[0]: "foo"
-[1]: "bar:baz"
-EOF
+	$test_fn "foo:bar:baz" ":" "0" <<-\EOF
+	1
+	[0]: "foo:bar:baz"
+	EOF
 
-test_split "foo:bar:baz" ":" "2" <<EOF
-3
-[0]: "foo"
-[1]: "bar"
-[2]: "baz"
-EOF
+	$test_fn "foo:bar:baz" ":" "1" <<-\EOF
+	2
+	[0]: "foo"
+	[1]: "bar:baz"
+	EOF
 
-test_split "foo:bar:" ":" "-1" <<EOF
-3
-[0]: "foo"
-[1]: "bar"
-[2]: ""
-EOF
+	$test_fn "foo:bar:baz" ":" "2" <<-\EOF
+	3
+	[0]: "foo"
+	[1]: "bar"
+	[2]: "baz"
+	EOF
 
-test_split "" ":" "-1" <<EOF
-1
-[0]: ""
-EOF
+	$test_fn "foo:bar:" ":" "-1" <<-\EOF
+	3
+	[0]: "foo"
+	[1]: "bar"
+	[2]: ""
+	EOF
+
+	$test_fn "" ":" "-1" <<-\EOF
+	1
+	[0]: ""
+	EOF
+done
 
 test_split ":" ":" "-1" <<EOF
 2
@@ -61,6 +72,38 @@  test_split ":" ":" "-1" <<EOF
 [1]: ""
 EOF
 
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "-1" <<-\EOF
+3
+[0]: "foo"
+[1]: "bar"
+[2]: "baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "0" <<-\EOF
+1
+[0]: "foo:;:bar:;:baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "1" <<-\EOF
+2
+[0]: "foo"
+[1]: "bar:;:baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:baz" ":;" "2" <<-\EOF
+3
+[0]: "foo"
+[1]: "bar"
+[2]: "baz"
+EOF
+
+test_split_in_place_multi "foo:;:bar:;:" ":;" "-1" <<-\EOF
+3
+[0]: "foo"
+[1]: "bar"
+[2]: ""
+EOF
+
 test_expect_success "test filter_string_list" '
 	test "x-" = "x$(test-tool string-list filter - y)" &&
 	test "x-" = "x$(test-tool string-list filter no y)" &&