diff mbox series

[1/5] string-list: introduce `string_list_split_in_place_multi()`

Message ID dda218c8c1fdc0ca2e4352b820f3432565a74a23.1681428696.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series banned: mark `strok()`, `strtok_r()` as banned | expand

Commit Message

Taylor Blau April 13, 2023, 11:31 p.m. UTC
Introduce a variant of the `string_list_split_in_place()` function that
takes a string of accepted delimiters.

By contrast to its cousin `string_list_split_in_place()` which splits
the given string at every instance of the single character `delim`, the
`_multi` variant splits the given string any any character appearing in
the string `delim`.

Instead of using `strchr(2)` to locate the first occurrence of the given
delimiter character, `string_list_split_in_place_multi()` uses
`strpbrk(2)` to find the first occurrence of *any* character in the given
delimiter string.

Since the `_multi` variant is a generalization of the original
implementation, reimplement `string_list_split_in_place()` in terms of
the more general function by providing a single-character string for the
list of accepted delimiters.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 string-list.c | 15 ++++++++++++---
 string-list.h |  6 ++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

Comments

Jeff King April 18, 2023, 10:10 a.m. UTC | #1
On Thu, Apr 13, 2023 at 07:31:43PM -0400, Taylor Blau wrote:

> Instead of using `strchr(2)` to locate the first occurrence of the given
> delimiter character, `string_list_split_in_place_multi()` uses
> `strpbrk(2)` to find the first occurrence of *any* character in the given
> delimiter string.
> 
> Since the `_multi` variant is a generalization of the original
> implementation, reimplement `string_list_split_in_place()` in terms of
> the more general function by providing a single-character string for the
> list of accepted delimiters.

I'd imagine that strpbrk() is potentially a lot slower than strchr().
But I kind of doubt that splitting is such a hot path that it will
matter. If we want to care, I think we could do something like:

  end = delim[1] ? strpbrk(p, delim) : strchr(p, delim[0]);

It's entirely possible that a half-decent strpbrk() implementation does
this already.

So I mention it mostly in case we need to revisit this later. I think
it's OK to ignore for now.

-Peff
Taylor Blau April 18, 2023, 5:08 p.m. UTC | #2
On Tue, Apr 18, 2023 at 06:10:58AM -0400, Jeff King wrote:
> On Thu, Apr 13, 2023 at 07:31:43PM -0400, Taylor Blau wrote:
>
> > Instead of using `strchr(2)` to locate the first occurrence of the given
> > delimiter character, `string_list_split_in_place_multi()` uses
> > `strpbrk(2)` to find the first occurrence of *any* character in the given
> > delimiter string.
> >
> > Since the `_multi` variant is a generalization of the original
> > implementation, reimplement `string_list_split_in_place()` in terms of
> > the more general function by providing a single-character string for the
> > list of accepted delimiters.
>
> I'd imagine that strpbrk() is potentially a lot slower than strchr().
> But I kind of doubt that splitting is such a hot path that it will
> matter. If we want to care, I think we could do something like:
>
>   end = delim[1] ? strpbrk(p, delim) : strchr(p, delim[0]);
>
> It's entirely possible that a half-decent strpbrk() implementation does
> this already.

I did a little research here, and it looks like both glibc[1] and
musl[2] already do this. Those links are for their implementations of
strcspn(), but which both implement[^1] strpbrk() as "p += strcspn(p,
d)".

> So I mention it mostly in case we need to revisit this later. I think
> it's OK to ignore for now.

In addition to being OK from a performance standpoint I think solving
the semantics concern you noted lower down in the thread[3] is fairly
straightforward.

When in "_multi" mode, I think this boils down to moving past any number
of characters in `delim` before each iteration of the loop. And that's
doable with "p += strspn(p, delim)".

That avoids any behavior change, and more closely matches what strtok()
does, so I think it's a better path.

Thanks,
Taylor

[1]: https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcspn.c;hb=glibc-2.37#l35
[2]: https://git.musl-libc.org/cgit/musl/tree/src/string/strcspn.c?h=v1.2.3#n11
[3]: https://lore.kernel.org/git/20230418102320.GB508219@coredump.intra.peff.net/

[^1]: Not entirely. If strcspn() takes you to the end of the string,
  strpbrk() will return NULL, instead of a pointer to the end of the
  string.
diff mbox series

Patch

diff --git a/string-list.c b/string-list.c
index db473f273e1..67f9ff18904 100644
--- a/string-list.c
+++ b/string-list.c
@@ -300,8 +300,8 @@  int string_list_split(struct string_list *list, const char *string,
 	}
 }
 
-int string_list_split_in_place(struct string_list *list, char *string,
-			       int delim, int maxsplit)
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit)
 {
 	int count = 0;
 	char *p = string, *end;
@@ -315,7 +315,7 @@  int string_list_split_in_place(struct string_list *list, char *string,
 			string_list_append(list, p);
 			return count;
 		}
-		end = strchr(p, delim);
+		end = strpbrk(p, delim);
 		if (end) {
 			*end = '\0';
 			string_list_append(list, p);
@@ -326,3 +326,12 @@  int string_list_split_in_place(struct string_list *list, char *string,
 		}
 	}
 }
+
+int string_list_split_in_place(struct string_list *list, char *string,
+			       int delim, int maxsplit)
+{
+	char delim_s[2] = { delim, 0 };
+
+	return string_list_split_in_place_multi(list, string, delim_s,
+						maxsplit);
+}
diff --git a/string-list.h b/string-list.h
index c7b0d5d0008..670d4fc8fb7 100644
--- a/string-list.h
+++ b/string-list.h
@@ -268,7 +268,13 @@  int string_list_split(struct string_list *list, const char *string,
  * new string_list_items point into string (which therefore must not
  * be modified or freed while the string_list is in use).
  * list->strdup_strings must *not* be set.
+ *
+ * The "_multi" variant splits the given string on any character
+ * appearing in "delim", and the non-"_multi" variant splits only on the
+ * given character.
  */
+int string_list_split_in_place_multi(struct string_list *list, char *string,
+				     const char *delim, int maxsplit);
 int string_list_split_in_place(struct string_list *list, char *string,
 			       int delim, int maxsplit);
 #endif /* STRING_LIST_H */