Message ID | 20180906191203.GA26184@sigill.intra.peff.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ordered string-list considered harmful, was Re: [PATCH v3] Allow aliases that include other aliases | expand |
On Thu, Sep 06, 2018 at 03:12:03PM -0400, Jeff King wrote: > On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote: > > > > + string_list_append(&cmd_list, *argv[0]); > > > > This will create an unsorted list. You'd have to use > > string_list_insert() here for a sorted list, or > > unsorted_string_list_has_string() in the earlier call. > > > > It's unfortunate that string_list makes this so easy to get wrong. > > This is getting really off-topic (since it sounds like we'd probably > want to use an ordered list here), but is it crazy to think that > basically every use of an ordered string list could just be a hashmap? Er, oops, I used "ordered" to mean two things here. I meant that the code regarding aliases would use an _unsorted_ list (where we care about keeping the original insertion order). But what I think is harmful is a _sorted_ list, because of the "accidentally quadratic" nature, and because it's easy to call its functions on an unsorted list. -Peff
On Thu, Sep 6, 2018 at 12:12 PM Jeff King <peff@peff.net> wrote: > > On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote: > > > > + string_list_append(&cmd_list, *argv[0]); > > > > This will create an unsorted list. You'd have to use > > string_list_insert() here for a sorted list, or > > unsorted_string_list_has_string() in the earlier call. > > > > It's unfortunate that string_list makes this so easy to get wrong. > > This is getting really off-topic (since it sounds like we'd probably > want to use an ordered list here), but is it crazy to think that > basically every use of an ordered string list could just be a hashmap? Does a hashmap guarantee an order? I thought we had an example of an ordered list in the submodule code but could not find it, maybe it is gone already or did not rely on the order as I thought. It turns out we make never use of a custom compare function in the stringlist, which helps gaining confidence this use case is nowhere to be found in the code. > And then the sometimes-sorted/sometimes-not duality of string-list could > go away? Sounds good, ship it. :-) Stefan
On Thu, Sep 06, 2018 at 01:04:18PM -0700, Stefan Beller wrote: > On Thu, Sep 6, 2018 at 12:12 PM Jeff King <peff@peff.net> wrote: > > > > On Thu, Sep 06, 2018 at 10:59:42AM -0400, Jeff King wrote: > > > > > > + string_list_append(&cmd_list, *argv[0]); > > > > > > This will create an unsorted list. You'd have to use > > > string_list_insert() here for a sorted list, or > > > unsorted_string_list_has_string() in the earlier call. > > > > > > It's unfortunate that string_list makes this so easy to get wrong. > > > > This is getting really off-topic (since it sounds like we'd probably > > want to use an ordered list here), but is it crazy to think that > > basically every use of an ordered string list could just be a hashmap? > > Does a hashmap guarantee an order? No, it definitely doesn't. I guess the reading-between-the-lines assumption that I didn't quite say is: I think most (if not all) of the users of sorted string lists don't actually care about a particular order. They just want efficient lookup. > I thought we had an example of an ordered list in the submodule code > but could not find it, maybe it is gone already or did not rely on the order > as I thought. > > It turns out we make never use of a custom compare function in > the stringlist, which helps gaining confidence this use case is nowhere > to be found in the code. Plenty of code uses the default strcmp. You can find users which assume sorting by their use of string_list_insert() versus _append(). Or ones that call string_list_sort(), of course. -Peff
> > Does a hashmap guarantee an order? > > No, it definitely doesn't. > > I guess the reading-between-the-lines assumption that I didn't quite say > is: I think most (if not all) of the users of sorted string lists don't > actually care about a particular order. They just want efficient lookup. > > > I thought we had an example of an ordered list in the submodule code > > but could not find it, maybe it is gone already or did not rely on the order > > as I thought. > > > > It turns out we make never use of a custom compare function in > > the stringlist, which helps gaining confidence this use case is nowhere > > to be found in the code. > > Plenty of code uses the default strcmp. You can find users which assume > sorting by their use of string_list_insert() versus _append(). Or ones > that call string_list_sort(), of course. Here comes my reading-between-the-lines assumption: When using the default comparison function, you probably only care about the efficient lookup as described above, but if you had a non-default order, then we'd have strong evidence of the contrary as the author of such code would have found reasons why that order is superior than default order (and don't tell me a different order helps making lookups even more efficient, this must be another reason).
Hi, Jeff King wrote: > But what I think is harmful is a _sorted_ list, because of the > "accidentally quadratic" nature, and because it's easy to call its > functions on an unsorted list. I agree --- in general, it tends to be better to build an unsorted string list and then sort it. Once I've done so, what is your advice about getting fast lookups in the result? Should I build an auxiliary hashmap as well? Or is this an argument for the 'sorted' flag + BUG approach you already mentioned? Whatever we do, I agree with your goal of getting rid of string_list_insert. Thanks, Jonathan
On Thu, Sep 06, 2018 at 01:54:15PM -0700, Stefan Beller wrote: > > > It turns out we make never use of a custom compare function in > > > the stringlist, which helps gaining confidence this use case is nowhere > > > to be found in the code. > > > > Plenty of code uses the default strcmp. You can find users which assume > > sorting by their use of string_list_insert() versus _append(). Or ones > > that call string_list_sort(), of course. > > Here comes my reading-between-the-lines assumption: > > When using the default comparison function, you probably only care > about the efficient lookup as described above, but if you had a non-default > order, then we'd have strong evidence of the contrary as the author of such > code would have found reasons why that order is superior than default order > (and don't tell me a different order helps making lookups even more efficient, > this must be another reason). That's a reasonable hypothesis. It looks like there are a few cases where we assign to a string_list.cmp, so I picked one arbitrarily to look at: split_maildir() uses a custom filename comparison. Despite having no recollection of this code, it appears to come from my commit 18505c3423. :) And yes, we really do care about order there, as we're trying to read the files in "maildir order". And I don't think a hash would do. That said, I don't think we care about using string-list's sorted operations here (like its binary-search lookup). It would be enough for us to generate the list (whether as string-list or no; we'd actually be happy with any array), sort it, and then iterate over the result. So I think some of these are definitely going to require some thoughtful conversion to the correct data type. Mostly what I was asking in the beginning was: does this seem like an overtly terrible line of thinking to anyone. And so far I think the answer is no. So the next step is to proceed to actually trying some conversions, and seeing what kinds of snags I hit. The devil, as usual, is in the details. :) -Peff
On Thu, Sep 06, 2018 at 04:50:33PM -0700, Jonathan Nieder wrote: > Hi, > > Jeff King wrote: > > > But what I think is harmful is a _sorted_ list, because of the > > "accidentally quadratic" nature, and because it's easy to call its > > functions on an unsorted list. > > I agree --- in general, it tends to be better to build an unsorted > string list and then sort it. > > Once I've done so, what is your advice about getting fast lookups > in the result? Should I build an auxiliary hashmap as well? Or > is this an argument for the 'sorted' flag + BUG approach you > already mentioned? I don't see any point in generating a sorted list and _then_ making an auxiliary hashmap. My idea was that if you're using a sorted string-list for lookup, then you can replace the whole thing with a hash (inserting as you go, rather than sorting at the end). I.e., imagine there are three use cases for string lists: 1. Build a list via append(), sort at the end, then do a series of efficient queries. 2. Build a list via insert(), which is always sorted. Possibly query while building, or after finished building. 3. Build an unsorted list and iterate over it. We know that (2) is potentially quadratic during the build-and-sort step. It would be nice to turn it into (1), but it's not always possible if we query it while still building. Turning these into a hashmap is an easy fix with no real downsides. Case (1) actually isn't a problem for run-time. We'd like to get rid of it only because the string_list functions are a bit confusing in terms of which ones expect us to be sorted or not (and if ever forget to sort before querying, we'd see all kinds of subtle bugs). Converting these cases to use a hashmap is one way we might get rid of the confusing string-list functions. And it doesn't carry any big-O downside, though it may be slower in practice (e.g., hashmaps tend to be a bit malloc-heavy). Case (3) is the one we'd like to preserve as the "right" use of string-list, since it's hard to get wrong (after we remove the sorting functions). I think Stefan pointed out a "case 4" in the other part of the thread: ones where we really care not just about fast lookup, but actual iteration order. These ones may need some special care, but I don't think there are many of them. So that's _one_ way to make the world better. Another way is to try to make the functions harder to misuse. E.g., maybe putting "sorted" into the name of string_list_has_string(), so it's on an equal footing with the unsorted variant. Or the sorted flag I mentioned. Those can help the misuse problem, but they don't help with the case (2) quadratic ones. It's probably less work, though. I think I like the hashmap way, if the conversion isn't too painful. As I said to Stefan in the other part of the thread, I'm mostly at the "does this seem like a terrible idea" stage. I'd have to start conversions to see how many of each case we have (and if there are ones that don't fit into my taxonomy above). -Peff
Jeff King wrote: > I don't see any point in generating a sorted list and _then_ making an > auxiliary hashmap. My idea was that if you're using a sorted string-list > for lookup, then you can replace the whole thing with a hash (inserting > as you go, rather than sorting at the end). What if I'm sorting a string list in preparation for emitting a sorted list, and I *also* want to perform lookups in that same list? In other words: [...] > I think Stefan pointed out a "case 4" in the other part of the thread: > ones where we really care not just about fast lookup, but actual > iteration order. I had assumed that that was the whole point of this data structure. Anything else that is using it for lookups should indeed use a hash map instead, and I can take my share of blame for missing this kind of thing in review. [...] > I think I like the hashmap way, if the conversion isn't too painful. If we don't have any callers that actually need the sort-and-lookup thing, then yay, let's get rid of it. But I don't actually think of this as the hashmap way. It's the get-rid-of-the-unneeded-feature way. In other words, *regardless* of what else we should do, we should update any callers that want a hashmap to use a hashmap. Please go ahead, even if it doesn't let us simplify the string list API at all. Thanks, Jonathan
On Fri, Sep 07 2018, Jonathan Nieder wrote: > Jeff King wrote: > >> I don't see any point in generating a sorted list and _then_ making an >> auxiliary hashmap. My idea was that if you're using a sorted string-list >> for lookup, then you can replace the whole thing with a hash (inserting >> as you go, rather than sorting at the end). > > What if I'm sorting a string list in preparation for emitting a sorted > list, and I *also* want to perform lookups in that same list? In > other words: If this turns out to be a common use-case perhaps the easiest way to support that would be to make the hashmap (optionally?) ordered, as Ruby 1.9 did with their hash implementation: https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/
Ævar Arnfjörð Bjarmason wrote: > On Fri, Sep 07 2018, Jonathan Nieder wrote: >> Jeff King wrote: >>> I don't see any point in generating a sorted list and _then_ making an >>> auxiliary hashmap. My idea was that if you're using a sorted string-list >>> for lookup, then you can replace the whole thing with a hash (inserting >>> as you go, rather than sorting at the end). >> >> What if I'm sorting a string list in preparation for emitting a sorted >> list, and I *also* want to perform lookups in that same list? In >> other words: > > If this turns out to be a common use-case perhaps the easiest way to > support that would be to make the hashmap (optionally?) ordered, as Ruby > 1.9 did with their hash implementation: > https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/ That's about recording the order of insertion. I'm talking about something much simpler: sorting an array (as preparation for emitting it) and binary searching to find an entry in that array.
On Thu, Sep 06, 2018 at 11:32:41PM -0700, Jonathan Nieder wrote: > > I think Stefan pointed out a "case 4" in the other part of the thread: > > ones where we really care not just about fast lookup, but actual > > iteration order. > > I had assumed that that was the whole point of this data structure. > Anything else that is using it for lookups should indeed use a hash > map instead, and I can take my share of blame for missing this kind of > thing in review. Keep in mind we didn't have a decent generic hashmap for many years. So I think string-list got used in its place. > > I think I like the hashmap way, if the conversion isn't too painful. > > If we don't have any callers that actually need the sort-and-lookup > thing, then yay, let's get rid of it. But I don't actually think of > this as the hashmap way. It's the get-rid-of-the-unneeded-feature > way. > > In other words, *regardless* of what else we should do, we should > update any callers that want a hashmap to use a hashmap. Please go > ahead, even if it doesn't let us simplify the string list API at all. Great, I think we're on the same page. Thanks! -Peff
On Fri, Sep 07, 2018 at 12:23:53AM -0700, Jonathan Nieder wrote: > Ævar Arnfjörð Bjarmason wrote: > > If this turns out to be a common use-case perhaps the easiest way to > > support that would be to make the hashmap (optionally?) ordered, as Ruby > > 1.9 did with their hash implementation: > > https://www.igvita.com/2009/02/04/ruby-19-internals-ordered-hash/ > > That's about recording the order of insertion. I'm talking about > something much simpler: sorting an array (as preparation for emitting > it) and binary searching to find an entry in that array. If you want both a collection that is always sorted and has efficient lookup for an arbitrary entry, then a B-tree is probably a better choice. That's the primitive that Rust provides for that situation.
diff --git a/apply.c b/apply.c index e485fbc6bc..df3ba94caf 100644 --- a/apply.c +++ b/apply.c @@ -90,7 +90,7 @@ int init_apply_state(struct apply_state *state, state->ws_error_action = warn_on_ws_error; state->ws_ignore_action = ignore_ws_none; state->linenr = 1; - string_list_init(&state->fn_table, 0); + strmap_init(&state->fn_table); string_list_init(&state->limit_by_name, 0); string_list_init(&state->symlink_changes, 0); strbuf_init(&state->root, 0); @@ -3265,16 +3265,10 @@ static int read_file_or_gitlink(const struct cache_entry *ce, struct strbuf *buf static struct patch *in_fn_table(struct apply_state *state, const char *name) { - struct string_list_item *item; - if (name == NULL) return NULL; - item = string_list_lookup(&state->fn_table, name); - if (item != NULL) - return (struct patch *)item->util; - - return NULL; + return strmap_get(&state->fn_table, name); } /* @@ -3304,26 +3298,21 @@ static int was_deleted(struct patch *patch) static void add_to_fn_table(struct apply_state *state, struct patch *patch) { - struct string_list_item *item; - /* * Always add new_name unless patch is a deletion * This should cover the cases for normal diffs, * file creations and copies */ - if (patch->new_name != NULL) { - item = string_list_insert(&state->fn_table, patch->new_name); - item->util = patch; - } + if (patch->new_name != NULL) + strmap_put(&state->fn_table, patch->new_name, patch); /* * store a failure on rename/deletion cases because * later chunks shouldn't patch old names */ - if ((patch->new_name == NULL) || (patch->is_rename)) { - item = string_list_insert(&state->fn_table, patch->old_name); - item->util = PATH_WAS_DELETED; - } + if ((patch->new_name == NULL) || (patch->is_rename)) + strmap_put(&state->fn_table, patch->old_name, + PATH_WAS_DELETED); } static void prepare_fn_table(struct apply_state *state, struct patch *patch) @@ -3332,11 +3321,9 @@ static void prepare_fn_table(struct apply_state *state, struct patch *patch) * store information about incoming file deletion */ while (patch) { - if ((patch->new_name == NULL) || (patch->is_rename)) { - struct string_list_item *item; - item = string_list_insert(&state->fn_table, patch->old_name); - item->util = PATH_TO_BE_DELETED; - } + if ((patch->new_name == NULL) || (patch->is_rename)) + strmap_put(&state->fn_table, patch->old_name, + PATH_TO_BE_DELETED); patch = patch->next; } } @@ -4757,7 +4744,7 @@ static int apply_patch(struct apply_state *state, end: free_patch_list(list); strbuf_release(&buf); - string_list_clear(&state->fn_table, 0); + strmap_clear(&state->fn_table); return res; } diff --git a/apply.h b/apply.h index 5948348133..66f681622e 100644 --- a/apply.h +++ b/apply.h @@ -3,6 +3,7 @@ #include "lockfile.h" #include "string-list.h" +#include "strmap.h" struct repository; @@ -98,7 +99,7 @@ struct apply_state { * Records filenames that have been touched, in order to handle * the case where more than one patches touch the same file. */ - struct string_list fn_table; + struct strmap fn_table; /* * This is to save reporting routines before using diff --git a/strmap.h b/strmap.h new file mode 100644 index 0000000000..2748f50e3e --- /dev/null +++ b/strmap.h @@ -0,0 +1,42 @@ +#ifndef STRMAP_H +#define STRMAP_H + +struct strmap { + struct hashmap map; +}; + +#define STRMAP_INIT { { NULL } } + +/* + * Initialize an empty strmap (this is unnecessary if it was initialized with + * STRMAP_INIT). + */ +void strmap_init(struct strmap *map); + +/* + * Remove all entries from the map, releasing any allocated resources. + */ +void strmap_clear(struct strmap *map); + +/* + * Insert "str" into the map, pointing to "data". A copy of "str" is made, so + * it does not need to persist after the this function is called. + * + * If an entry for "str" already exists, its data pointer is overwritten, and + * the original data pointer returned. Otherwise, returns NULL. + */ +void *strmap_put(struct strmap *map, const char *str, void *data); + +/* + * Return the data pointer mapped by "str", or NULL if the entry does not + * exist. + */ +void *strmap_get(struct strmap *map, const char *str); + +/* + * Return non-zero iff "str" is present in the map. This differs from + * strmap_get() in that it can distinguish entries with a NULL data pointer. + */ +int strmap_contains(struct strmap *map, const char *str); + +#endif /* STRMAP_H */