diff mbox series

ordered string-list considered harmful, was Re: [PATCH v3] Allow aliases that include other aliases

Message ID 20180906191203.GA26184@sigill.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series ordered string-list considered harmful, was Re: [PATCH v3] Allow aliases that include other aliases | expand

Commit Message

Jeff King Sept. 6, 2018, 7:12 p.m. UTC
On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote:

> > +		string_list_append(&cmd_list, *argv[0]);
> 
> This will create an unsorted list. You'd have to use
> string_list_insert() here for a sorted list, or
> unsorted_string_list_has_string() in the earlier call.
> 
> It's unfortunate that string_list makes this so easy to get wrong.

This is getting really off-topic (since it sounds like we'd probably
want to use an ordered list here), but is it crazy to think that
basically every use of an ordered string list could just be a hashmap?

And then the sometimes-sorted/sometimes-not duality of string-list could
go away?

As a bonus, that would fix a bunch of places which possibly quadratic,
since calling string_list_insert() may have to do O(n) work to shift the
memory around (though the worst case is reverse-sorted order, so in
practice many of these callers are actually fine, and those that aren't
often do a series of appends followed by a sort).

One pain point is that hashmaps are a little inconvenient due to their
memory ownership. But I sketched out an API below (and converted one
caller) that could make that less awful.

Another alternative is for string_list to keep a "sorted" flag, reset it
after an _append(), and BUG() when it's not set in the right places.
That's still easy to mis-use, but at least you get a run-time check.

---
 apply.c  | 35 +++++++++--------------------
 apply.h  |  3 ++-
 strmap.h | 42 +++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 25 deletions(-)

Comments

Jeff King Sept. 6, 2018, 7:20 p.m. UTC | #1
On Thu, Sep 06, 2018 at 03:12:03PM -0400, Jeff King wrote:

> On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote:
> 
> > > +		string_list_append(&cmd_list, *argv[0]);
> > 
> > This will create an unsorted list. You'd have to use
> > string_list_insert() here for a sorted list, or
> > unsorted_string_list_has_string() in the earlier call.
> > 
> > It's unfortunate that string_list makes this so easy to get wrong.
> 
> This is getting really off-topic (since it sounds like we'd probably
> want to use an ordered list here), but is it crazy to think that
> basically every use of an ordered string list could just be a hashmap?

Er, oops, I used "ordered" to mean two things here.

I meant that the code regarding aliases would use an _unsorted_ list
(where we care about keeping the original insertion order).

But what I think is harmful is a _sorted_ list, because of the
"accidentally quadratic" nature, and because it's easy to call its
functions on an unsorted list.

-Peff
Stefan Beller Sept. 6, 2018, 8:04 p.m. UTC | #2
On Thu, Sep 6, 2018 at 12:12 PM Jeff King <peff@peff.net> wrote:
>
> On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote:
>
> > > +           string_list_append(&cmd_list, *argv[0]);
> >
> > This will create an unsorted list. You'd have to use
> > string_list_insert() here for a sorted list, or
> > unsorted_string_list_has_string() in the earlier call.
> >
> > It's unfortunate that string_list makes this so easy to get wrong.
>
> This is getting really off-topic (since it sounds like we'd probably
> want to use an ordered list here), but is it crazy to think that
> basically every use of an ordered string list could just be a hashmap?

Does a hashmap guarantee an order?
I thought we had an example of an ordered list in the submodule code
but could not find it, maybe it is gone already or did not rely on the order
as I thought.

It turns out we make never use of a custom compare function in
the stringlist, which helps gaining confidence this use case is nowhere
to be found in the code.

> And then the sometimes-sorted/sometimes-not duality of string-list could
> go away?

Sounds good, ship it. :-)

Stefan
Jeff King Sept. 6, 2018, 8:49 p.m. UTC | #3
On Thu, Sep 06, 2018 at 01:04:18PM -0700, Stefan Beller wrote:

> On Thu, Sep 6, 2018 at 12:12 PM Jeff King <peff@peff.net> wrote:
> >
> > On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote:
> >
> > > > +           string_list_append(&cmd_list, *argv[0]);
> > >
> > > This will create an unsorted list. You'd have to use
> > > string_list_insert() here for a sorted list, or
> > > unsorted_string_list_has_string() in the earlier call.
> > >
> > > It's unfortunate that string_list makes this so easy to get wrong.
> >
> > This is getting really off-topic (since it sounds like we'd probably
> > want to use an ordered list here), but is it crazy to think that
> > basically every use of an ordered string list could just be a hashmap?
> 
> Does a hashmap guarantee an order?

No, it definitely doesn't.

I guess the reading-between-the-lines assumption that I didn't quite say
is: I think most (if not all) of the users of sorted string lists don't
actually care about a particular order. They just want efficient lookup.

> I thought we had an example of an ordered list in the submodule code
> but could not find it, maybe it is gone already or did not rely on the order
> as I thought.
> 
> It turns out we make never use of a custom compare function in
> the stringlist, which helps gaining confidence this use case is nowhere
> to be found in the code.

Plenty of code uses the default strcmp. You can find users which assume
sorting by their use of string_list_insert() versus _append(). Or ones
that call string_list_sort(), of course.

-Peff
Stefan Beller Sept. 6, 2018, 8:54 p.m. UTC | #4
> > Does a hashmap guarantee an order?
>
> No, it definitely doesn't.
>
> I guess the reading-between-the-lines assumption that I didn't quite say
> is: I think most (if not all) of the users of sorted string lists don't
> actually care about a particular order. They just want efficient lookup.
>
> > I thought we had an example of an ordered list in the submodule code
> > but could not find it, maybe it is gone already or did not rely on the order
> > as I thought.
> >
> > It turns out we make never use of a custom compare function in
> > the stringlist, which helps gaining confidence this use case is nowhere
> > to be found in the code.
>
> Plenty of code uses the default strcmp. You can find users which assume
> sorting by their use of string_list_insert() versus _append(). Or ones
> that call string_list_sort(), of course.

Here comes my reading-between-the-lines assumption:

When using the default comparison function, you probably only care
about the efficient lookup as described above, but if you had a non-default
order, then we'd have strong evidence of the contrary as the author of such
code would have found reasons why that order is superior than default order
(and don't tell me a different order helps making lookups even more efficient,
this must be another reason).
Jonathan Nieder Sept. 6, 2018, 11:50 p.m. UTC | #5
Hi,

Jeff King wrote:

> But what I think is harmful is a _sorted_ list, because of the
> "accidentally quadratic" nature, and because it's easy to call its
> functions on an unsorted list.

I agree --- in general, it tends to be better to build an unsorted
string list and then sort it.

Once I've done so, what is your advice about getting fast lookups
in the result?  Should I build an auxiliary hashmap as well?  Or
is this an argument for the 'sorted' flag + BUG approach you
already mentioned?

Whatever we do, I agree with your goal of getting rid of
string_list_insert.

Thanks,
Jonathan
Jeff King Sept. 7, 2018, 3:12 a.m. UTC | #6
On Thu, Sep 06, 2018 at 01:54:15PM -0700, Stefan Beller wrote:

> > > It turns out we make never use of a custom compare function in
> > > the stringlist, which helps gaining confidence this use case is nowhere
> > > to be found in the code.
> >
> > Plenty of code uses the default strcmp. You can find users which assume
> > sorting by their use of string_list_insert() versus _append(). Or ones
> > that call string_list_sort(), of course.
> 
> Here comes my reading-between-the-lines assumption:
> 
> When using the default comparison function, you probably only care
> about the efficient lookup as described above, but if you had a non-default
> order, then we'd have strong evidence of the contrary as the author of such
> code would have found reasons why that order is superior than default order
> (and don't tell me a different order helps making lookups even more efficient,
> this must be another reason).

That's a reasonable hypothesis. It looks like there are a few cases
where we assign to a string_list.cmp, so I picked one arbitrarily to
look at: split_maildir() uses a custom filename comparison. Despite
having no recollection of this code, it appears to come from my commit
18505c3423. :)

And yes, we really do care about order there, as we're trying to read
the files in "maildir order". And I don't think a hash would do.

That said, I don't think we care about using string-list's sorted
operations here (like its binary-search lookup). It would be enough for
us to generate the list (whether as string-list or no; we'd actually be
happy with any array), sort it, and then iterate over the result.

So I think some of these are definitely going to require some thoughtful
conversion to the correct data type. Mostly what I was asking in the
beginning was: does this seem like an overtly terrible line of thinking
to anyone. And so far I think the answer is no. So the next step is to
proceed to actually trying some conversions, and seeing what kinds of
snags I hit. The devil, as usual, is in the details. :)

-Peff
Jeff King Sept. 7, 2018, 3:24 a.m. UTC | #7
On Thu, Sep 06, 2018 at 04:50:33PM -0700, Jonathan Nieder wrote:

> Hi,
> 
> Jeff King wrote:
> 
> > But what I think is harmful is a _sorted_ list, because of the
> > "accidentally quadratic" nature, and because it's easy to call its
> > functions on an unsorted list.
> 
> I agree --- in general, it tends to be better to build an unsorted
> string list and then sort it.
> 
> Once I've done so, what is your advice about getting fast lookups
> in the result?  Should I build an auxiliary hashmap as well?  Or
> is this an argument for the 'sorted' flag + BUG approach you
> already mentioned?

I don't see any point in generating a sorted list and _then_ making an
auxiliary hashmap. My idea was that if you're using a sorted string-list
for lookup, then you can replace the whole thing with a hash (inserting
as you go, rather than sorting at the end).

I.e., imagine there are three use cases for string lists:

  1. Build a list via append(), sort at the end, then do a series of
     efficient queries.

  2. Build a list via insert(), which is always sorted. Possibly query
     while building, or after finished building.

  3. Build an unsorted list and iterate over it.

We know that (2) is potentially quadratic during the build-and-sort
step. It would be nice to turn it into (1), but it's not always possible
if we query it while still building. Turning these into a hashmap is an
easy fix with no real downsides.

Case (1) actually isn't a problem for run-time. We'd like to get rid of
it only because the string_list functions are a bit confusing in terms
of which ones expect us to be sorted or not (and if ever forget to sort
before querying, we'd see all kinds of subtle bugs).

Converting these cases to use a hashmap is one way we might get rid of
the confusing string-list functions. And it doesn't carry any big-O
downside, though it may be slower in practice (e.g., hashmaps tend to be
a bit malloc-heavy).

Case (3) is the one we'd like to preserve as the "right" use of
string-list, since it's hard to get wrong (after we remove the sorting
functions).

I think Stefan pointed out a "case 4" in the other part of the thread:
ones where we really care not just about fast lookup, but actual
iteration order.  These ones may need some special care, but I don't
think there are many of them.

So that's _one_ way to make the world better.

Another way is to try to make the functions harder to misuse. E.g.,
maybe putting "sorted" into the name of string_list_has_string(), so
it's on an equal footing with the unsorted variant. Or the sorted flag I
mentioned. Those can help the misuse problem, but they don't help with
the case (2) quadratic ones. It's probably less work, though.

I think I like the hashmap way, if the conversion isn't too painful. As
I said to Stefan in the other part of the thread, I'm mostly at the
"does this seem like a terrible idea" stage. I'd have to start
conversions to see how many of each case we have (and if there are ones
that don't fit into my taxonomy above).

-Peff
Jonathan Nieder Sept. 7, 2018, 6:32 a.m. UTC | #8
Jeff King wrote:

> I don't see any point in generating a sorted list and _then_ making an
> auxiliary hashmap. My idea was that if you're using a sorted string-list
> for lookup, then you can replace the whole thing with a hash (inserting
> as you go, rather than sorting at the end).

What if I'm sorting a string list in preparation for emitting a sorted
list, and I *also* want to perform lookups in that same list?  In
other words:

[...]
> I think Stefan pointed out a "case 4" in the other part of the thread:
> ones where we really care not just about fast lookup, but actual
> iteration order.

I had assumed that that was the whole point of this data structure.
Anything else that is using it for lookups should indeed use a hash
map instead, and I can take my share of blame for missing this kind of
thing in review.

[...]
> I think I like the hashmap way, if the conversion isn't too painful.

If we don't have any callers that actually need the sort-and-lookup
thing, then yay, let's get rid of it.  But I don't actually think of
this as the hashmap way.  It's the get-rid-of-the-unneeded-feature
way.

In other words, *regardless* of what else we should do, we should
update any callers that want a hashmap to use a hashmap.  Please go
ahead, even if it doesn't let us simplify the string list API at all.

Thanks,
Jonathan
Ævar Arnfjörð Bjarmason Sept. 7, 2018, 7:20 a.m. UTC | #9
On Fri, Sep 07 2018, Jonathan Nieder wrote:

> Jeff King wrote:
>
>> I don't see any point in generating a sorted list and _then_ making an
>> auxiliary hashmap. My idea was that if you're using a sorted string-list
>> for lookup, then you can replace the whole thing with a hash (inserting
>> as you go, rather than sorting at the end).
>
> What if I'm sorting a string list in preparation for emitting a sorted
> list, and I *also* want to perform lookups in that same list?  In
> other words:

If this turns out to be a common use-case perhaps the easiest way to
support that would be to make the hashmap (optionally?) ordered, as Ruby
1.9 did with their hash implementation:
https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/
Jonathan Nieder Sept. 7, 2018, 7:23 a.m. UTC | #10
Ævar Arnfjörð Bjarmason wrote:
> On Fri, Sep 07 2018, Jonathan Nieder wrote:
>> Jeff King wrote:

>>> I don't see any point in generating a sorted list and _then_ making an
>>> auxiliary hashmap. My idea was that if you're using a sorted string-list
>>> for lookup, then you can replace the whole thing with a hash (inserting
>>> as you go, rather than sorting at the end).
>>
>> What if I'm sorting a string list in preparation for emitting a sorted
>> list, and I *also* want to perform lookups in that same list?  In
>> other words:
>
> If this turns out to be a common use-case perhaps the easiest way to
> support that would be to make the hashmap (optionally?) ordered, as Ruby
> 1.9 did with their hash implementation:
> https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/

That's about recording the order of insertion.  I'm talking about
something much simpler: sorting an array (as preparation for emitting
it) and binary searching to find an entry in that array.
Jeff King Sept. 7, 2018, 2:48 p.m. UTC | #11
On Thu, Sep 06, 2018 at 11:32:41PM -0700, Jonathan Nieder wrote:

> > I think Stefan pointed out a "case 4" in the other part of the thread:
> > ones where we really care not just about fast lookup, but actual
> > iteration order.
> 
> I had assumed that that was the whole point of this data structure.
> Anything else that is using it for lookups should indeed use a hash
> map instead, and I can take my share of blame for missing this kind of
> thing in review.

Keep in mind we didn't have a decent generic hashmap for many years. So
I think string-list got used in its place.

> > I think I like the hashmap way, if the conversion isn't too painful.
> 
> If we don't have any callers that actually need the sort-and-lookup
> thing, then yay, let's get rid of it.  But I don't actually think of
> this as the hashmap way.  It's the get-rid-of-the-unneeded-feature
> way.
> 
> In other words, *regardless* of what else we should do, we should
> update any callers that want a hashmap to use a hashmap.  Please go
> ahead, even if it doesn't let us simplify the string list API at all.

Great, I think we're on the same page. Thanks!

-Peff
brian m. carlson Sept. 8, 2018, 4:49 p.m. UTC | #12
On Fri, Sep 07, 2018 at 12:23:53AM -0700, Jonathan Nieder wrote:
> Ævar Arnfjörð Bjarmason wrote:
> > If this turns out to be a common use-case perhaps the easiest way to
> > support that would be to make the hashmap (optionally?) ordered, as Ruby
> > 1.9 did with their hash implementation:
> > https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/
> 
> That's about recording the order of insertion.  I'm talking about
> something much simpler: sorting an array (as preparation for emitting
> it) and binary searching to find an entry in that array.

If you want both a collection that is always sorted and has efficient
lookup for an arbitrary entry, then a B-tree is probably a better
choice.  That's the primitive that Rust provides for that situation.
diff mbox series

Patch

diff --git a/apply.c b/apply.c
index e485fbc6bc..df3ba94caf 100644
--- a/apply.c
+++ b/apply.c
@@ -90,7 +90,7 @@  int init_apply_state(struct apply_state *state,
 	state->ws_error_action = warn_on_ws_error;
 	state->ws_ignore_action = ignore_ws_none;
 	state->linenr = 1;
-	string_list_init(&state->fn_table, 0);
+	strmap_init(&state->fn_table);
 	string_list_init(&state->limit_by_name, 0);
 	string_list_init(&state->symlink_changes, 0);
 	strbuf_init(&state->root, 0);
@@ -3265,16 +3265,10 @@  static int read_file_or_gitlink(const struct cache_entry *ce, struct strbuf *buf
 
 static struct patch *in_fn_table(struct apply_state *state, const char *name)
 {
-	struct string_list_item *item;
-
 	if (name == NULL)
 		return NULL;
 
-	item = string_list_lookup(&state->fn_table, name);
-	if (item != NULL)
-		return (struct patch *)item->util;
-
-	return NULL;
+	return strmap_get(&state->fn_table, name);
 }
 
 /*
@@ -3304,26 +3298,21 @@  static int was_deleted(struct patch *patch)
 
 static void add_to_fn_table(struct apply_state *state, struct patch *patch)
 {
-	struct string_list_item *item;
-
 	/*
 	 * Always add new_name unless patch is a deletion
 	 * This should cover the cases for normal diffs,
 	 * file creations and copies
 	 */
-	if (patch->new_name != NULL) {
-		item = string_list_insert(&state->fn_table, patch->new_name);
-		item->util = patch;
-	}
+	if (patch->new_name != NULL)
+		strmap_put(&state->fn_table, patch->new_name, patch);
 
 	/*
 	 * store a failure on rename/deletion cases because
 	 * later chunks shouldn't patch old names
 	 */
-	if ((patch->new_name == NULL) || (patch->is_rename)) {
-		item = string_list_insert(&state->fn_table, patch->old_name);
-		item->util = PATH_WAS_DELETED;
-	}
+	if ((patch->new_name == NULL) || (patch->is_rename))
+		strmap_put(&state->fn_table, patch->old_name,
+			   PATH_WAS_DELETED);
 }
 
 static void prepare_fn_table(struct apply_state *state, struct patch *patch)
@@ -3332,11 +3321,9 @@  static void prepare_fn_table(struct apply_state *state, struct patch *patch)
 	 * store information about incoming file deletion
 	 */
 	while (patch) {
-		if ((patch->new_name == NULL) || (patch->is_rename)) {
-			struct string_list_item *item;
-			item = string_list_insert(&state->fn_table, patch->old_name);
-			item->util = PATH_TO_BE_DELETED;
-		}
+		if ((patch->new_name == NULL) || (patch->is_rename))
+			strmap_put(&state->fn_table, patch->old_name,
+				   PATH_TO_BE_DELETED);
 		patch = patch->next;
 	}
 }
@@ -4757,7 +4744,7 @@  static int apply_patch(struct apply_state *state,
 end:
 	free_patch_list(list);
 	strbuf_release(&buf);
-	string_list_clear(&state->fn_table, 0);
+	strmap_clear(&state->fn_table);
 	return res;
 }
 
diff --git a/apply.h b/apply.h
index 5948348133..66f681622e 100644
--- a/apply.h
+++ b/apply.h
@@ -3,6 +3,7 @@ 
 
 #include "lockfile.h"
 #include "string-list.h"
+#include "strmap.h"
 
 struct repository;
 
@@ -98,7 +99,7 @@  struct apply_state {
 	 * Records filenames that have been touched, in order to handle
 	 * the case where more than one patches touch the same file.
 	 */
-	struct string_list fn_table;
+	struct strmap fn_table;
 
 	/*
 	 * This is to save reporting routines before using
diff --git a/strmap.h b/strmap.h
new file mode 100644
index 0000000000..2748f50e3e
--- /dev/null
+++ b/strmap.h
@@ -0,0 +1,42 @@ 
+#ifndef STRMAP_H
+#define STRMAP_H
+
+struct strmap {
+	struct hashmap map;
+};
+
+#define STRMAP_INIT { { NULL } }
+
+/*
+ * Initialize an empty strmap (this is unnecessary if it was initialized with
+ * STRMAP_INIT).
+ */
+void strmap_init(struct strmap *map);
+
+/*
+ * Remove all entries from the map, releasing any allocated resources.
+ */
+void strmap_clear(struct strmap *map);
+
+/*
+ * Insert "str" into the map, pointing to "data". A copy of "str" is made, so
+ * it does not need to persist after the this function is called.
+ *
+ * If an entry for "str" already exists, its data pointer is overwritten, and
+ * the original data pointer returned. Otherwise, returns NULL.
+ */
+void *strmap_put(struct strmap *map, const char *str, void *data);
+
+/*
+ * Return the data pointer mapped by "str", or NULL if the entry does not
+ * exist.
+ */
+void *strmap_get(struct strmap *map, const char *str);
+
+/*
+ * Return non-zero iff "str" is present in the map. This differs from
+ * strmap_get() in that it can distinguish entries with a NULL data pointer.
+ */
+int strmap_contains(struct strmap *map, const char *str);
+
+#endif /* STRMAP_H */