mbox series

[v4,00/13] Add struct strmap and associated utility functions

Message ID pull.835.v4.git.git.1604535765.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Add struct strmap and associated utility functions | expand

Message

Derrick Stolee via GitGitGadget Nov. 5, 2020, 12:22 a.m. UTC
Here I introduce a new strmap type (and strintmap and strset), which my new
merge backed, merge-ort, uses heavily. (I also made significant use of it in
my changes to diffcore-rename). This strmap type was based on Peff's
proposal from a couple years ago[1], but has additions that I made as I used
it, and a number of additions/changes suggested by Peff in his reviews. I
also start the series off with some changes to hashmap, based on Peff's
feedback on v1 & v2.

NOTE: While en/merge-ort-impl depends on this series, there are no changes
in v4 that affect it so en/merge-ort-impl does not need a reroll.

Changes since v3 (almost all of which were suggestions from Peff):

 * Fix pointer math due to platform differences in FLEX_ALLOC definition,
   and a few other FLEXPTR_ALLOC_STR cleanups
 * Define strmap_for_each_entry in terms of hashmap_for_each_entry instead
   of lower level functions
 * Use simpler _INIT macros
 * Remove strset_check_and_add() from API as per Peff's suggestion
   (merge-ort doesn't need it; we can add it later)
 * Update comments and commit messages to update now obsolete statements due
   to changes from earlier reviews

[1] 
https://lore.kernel.org/git/20180906191203.GA26184@sigill.intra.peff.net/

Elijah Newren (13):
  hashmap: add usage documentation explaining hashmap_free[_entries]()
  hashmap: adjust spacing to fix argument alignment
  hashmap: allow re-use after hashmap_free()
  hashmap: introduce a new hashmap_partial_clear()
  hashmap: provide deallocation function names
  strmap: new utility functions
  strmap: add more utility functions
  strmap: enable faster clearing and reusing of strmaps
  strmap: add functions facilitating use as a string->int map
  strmap: add a strset sub-type
  strmap: enable allocations to come from a mem_pool
  strmap: take advantage of FLEXPTR_ALLOC_STR when relevant
  Use new HASHMAP_INIT macro to simplify hashmap initialization

 Makefile                |   1 +
 add-interactive.c       |   2 +-
 attr.c                  |  26 ++--
 blame.c                 |   2 +-
 bloom.c                 |   5 +-
 builtin/difftool.c      |   9 +-
 builtin/fetch.c         |   6 +-
 builtin/shortlog.c      |   2 +-
 config.c                |   2 +-
 diff.c                  |   4 +-
 diffcore-rename.c       |   2 +-
 dir.c                   |   8 +-
 hashmap.c               |  74 +++++++----
 hashmap.h               |  91 +++++++++++---
 merge-recursive.c       |   6 +-
 name-hash.c             |   4 +-
 object.c                |   2 +-
 oidmap.c                |   2 +-
 patch-ids.c             |   2 +-
 range-diff.c            |   6 +-
 ref-filter.c            |   2 +-
 revision.c              |  11 +-
 sequencer.c             |   4 +-
 strmap.c                | 151 ++++++++++++++++++++++
 strmap.h                | 270 ++++++++++++++++++++++++++++++++++++++++
 submodule-config.c      |   4 +-
 t/helper/test-hashmap.c |   9 +-
 27 files changed, 593 insertions(+), 114 deletions(-)
 create mode 100644 strmap.c
 create mode 100644 strmap.h


base-commit: d4a392452e292ff924e79ec8458611c0f679d6d4
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-835%2Fnewren%2Fstrmap-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-835/newren/strmap-v4
Pull-Request: https://github.com/git/git/pull/835

Range-diff vs v3:

  1:  af6b6fcb46 =  1:  af6b6fcb46 hashmap: add usage documentation explaining hashmap_free[_entries]()
  2:  591161fd78 =  2:  591161fd78 hashmap: adjust spacing to fix argument alignment
  3:  f2718d036d =  3:  f2718d036d hashmap: allow re-use after hashmap_free()
  4:  61f1da3c51 =  4:  61f1da3c51 hashmap: introduce a new hashmap_partial_clear()
  5:  861e8d65ae =  5:  861e8d65ae hashmap: provide deallocation function names
  6:  448d3b219f =  6:  448d3b219f strmap: new utility functions
  7:  42633b8d03 !  7:  5e8004c728 strmap: add more utility functions
     @@ strmap.h: void *strmap_get(struct strmap *map, const char *str);
      + * iterate through @map using @iter, @var is a pointer to a type strmap_entry
      + */
      +#define strmap_for_each_entry(mystrmap, iter, var)	\
     -+	for (var = hashmap_iter_first_entry_offset(&(mystrmap)->map, iter, 0); \
     -+		var; \
     -+		var = hashmap_iter_next_entry_offset(iter, 0))
     ++	hashmap_for_each_entry(&(mystrmap)->map, iter, var, ent)
      +
       #endif /* STRMAP_H */
  8:  ea942eb803 =  8:  fd96e9fc8d strmap: enable faster clearing and reusing of strmaps
  9:  c1d2172171 !  9:  f499934f54 strmap: add functions facilitating use as a string->int map
     @@ Commit message
          isn't the case when we're storing an int value directly in the void*
          slot instead of using the void* slot as a pointer to data.
      
     -    A note on the name: if anyone has a better name suggestion than
     -    strintmap, I'm happy to take it.  It seems slightly unwieldy, but I have
     -    not been able to come up with a better name.
     -
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## strmap.c ##
     @@ strmap.h: int cmp_strmap_entry(const void *hashmap_cmp_fn_data,
       			.strdup_strings = 1,                          \
       		    }
      +#define STRINTMAP_INIT { \
     -+			.map.map = HASHMAP_INIT(cmp_strmap_entry, NULL),  \
     -+			.map.strdup_strings = 1,                          \
     -+			.default_value = 0,                               \
     -+		    }
     ++			.map = STRMAP_INIT,   \
     ++			.default_value = 0,   \
     ++		       }
       
       /*
        * Initialize the members of the strmap.  Any keys added to the strmap will
      @@ strmap.h: static inline int strmap_empty(struct strmap *map)
     - 		var; \
     - 		var = hashmap_iter_next_entry_offset(iter, 0))
     + #define strmap_for_each_entry(mystrmap, iter, var)	\
     + 	hashmap_for_each_entry(&(mystrmap)->map, iter, var, ent)
       
      +
      +/*
     @@ strmap.h: static inline int strmap_empty(struct strmap *map)
      + * Primary differences:
      + *    1) Since the void* value is just an int in disguise, there is no value
      + *       to free.  (Thus one fewer argument to strintmap_clear)
     -+ *    2) strintmap_get() returns an int; it also requires an extra parameter to
     -+ *       be specified so it knows what value to return if the underlying strmap
     -+ *       has not key matching the given string.
     ++ *    2) strintmap_get() returns an int, or returns the default_value if the
     ++ *       key is not found in the strintmap.
      + *    3) No strmap_put() equivalent; strintmap_set() and strintmap_incr()
      + *       instead.
      + */
 10:  0f57735f5e ! 10:  ee1ec55f1b strmap: add a strset sub-type
     @@ Commit message
          free_values argument to *_clear(), and the *_get() function), and
          strset_add() is chosen as the API instead of strset_put().
      
     -    Finally, shortlog already had a more minimal strset API; so this adds a
     -    strset_check_and_add() function for its benefit to allow it to switch
     -    over to this strset implementation.
     -
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     - ## strmap.c ##
     -@@ strmap.c: void strintmap_incr(struct strintmap *map, const char *str, intptr_t amt)
     - 	else
     - 		strintmap_set(map, str, map->default_value + amt);
     - }
     -+
     -+int strset_check_and_add(struct strset *set, const char *str)
     -+{
     -+	if (strset_contains(set, str))
     -+		return 1;
     -+	strset_add(set, str);
     -+	return 0;
     -+}
     -
       ## strmap.h ##
      @@ strmap.h: int cmp_strmap_entry(const void *hashmap_cmp_fn_data,
     - 			.map.strdup_strings = 1,                          \
     - 			.default_value = 0,                               \
     - 		    }
     -+#define STRSET_INIT { \
     -+			.map.map = HASHMAP_INIT(cmp_strmap_entry, NULL),  \
     -+			.map.strdup_strings = 1,                          \
     -+		    }
     + 			.map = STRMAP_INIT,   \
     + 			.default_value = 0,   \
     + 		       }
     ++#define STRSET_INIT { .map = STRMAP_INIT }
       
       /*
        * Initialize the members of the strmap.  Any keys added to the strmap will
     @@ strmap.h: static inline void strintmap_set(struct strintmap *map, const char *st
      +{
      +	strmap_put(&set->map, str, NULL);
      +}
     -+
     -+/* Returns 1 if str already in set.  Otherwise adds str to set and returns 0 */
     -+int strset_check_and_add(struct strset *set, const char *str);
      +
       #endif /* STRMAP_H */
 11:  980537e877 = 11:  73a57045c3 strmap: enable allocations to come from a mem_pool
 12:  7f93cbb525 ! 12:  0352260de4 strmap: take advantage of FLEXPTR_ALLOC_STR when relevant
     @@ strmap.c: static void strmap_free_entries_(struct strmap *map, int free_values)
       	}
       }
       
     -@@ strmap.c: void *strmap_put(struct strmap *map, const char *str, void *data)
     - 		old = entry->value;
     +@@ strmap.c: void strmap_partial_clear(struct strmap *map, int free_values)
     + void *strmap_put(struct strmap *map, const char *str, void *data)
     + {
     + 	struct strmap_entry *entry = find_strmap_entry(map, str);
     +-	void *old = NULL;
     + 
     + 	if (entry) {
     +-		old = entry->value;
     ++		void *old = entry->value;
       		entry->value = data;
     - 	} else {
     +-	} else {
      -		const char *key = str;
      -
      -		entry = map->pool ? mem_pool_alloc(map->pool, sizeof(*entry))
      -				  : xmalloc(sizeof(*entry));
     -+		if (map->strdup_strings) {
     -+			if (!map->pool) {
     -+				FLEXPTR_ALLOC_STR(entry, key, str);
     -+			} else {
     -+				/* Remember +1 for nul byte twice below */
     -+				size_t len = strlen(str);
     -+				entry = mem_pool_alloc(map->pool,
     -+					       st_add3(sizeof(*entry), len, 1));
     -+				memcpy(entry->keydata, str, len+1);
     -+			}
     -+		} else if (!map->pool) {
     -+			entry = xmalloc(sizeof(*entry));
     -+		} else {
     -+			entry = mem_pool_alloc(map->pool, sizeof(*entry));
     -+		}
     - 		hashmap_entry_init(&entry->ent, strhash(str));
     --
     +-		hashmap_entry_init(&entry->ent, strhash(str));
     ++		return old;
     ++	}
     + 
      -		if (map->strdup_strings)
      -			key = map->pool ? mem_pool_strdup(map->pool, str)
      -					: xstrdup(str);
      -		entry->key = key;
     -+		entry->key = map->strdup_strings ? entry->keydata : str;
     - 		entry->value = data;
     - 		hashmap_add(&map->map, &entry->ent);
     +-		entry->value = data;
     +-		hashmap_add(&map->map, &entry->ent);
     ++	if (map->strdup_strings) {
     ++		if (!map->pool) {
     ++			FLEXPTR_ALLOC_STR(entry, key, str);
     ++		} else {
     ++			size_t len = st_add(strlen(str), 1); /* include NUL */
     ++			entry = mem_pool_alloc(map->pool,
     ++					       st_add(sizeof(*entry), len));
     ++			memcpy(entry + 1, str, len);
     ++			entry->key = (void *)(entry + 1);
     ++		}
     ++	} else if (!map->pool) {
     ++		entry = xmalloc(sizeof(*entry));
     ++	} else {
     ++		entry = mem_pool_alloc(map->pool, sizeof(*entry));
       	}
     +-	return old;
     ++	hashmap_entry_init(&entry->ent, strhash(str));
     ++	if (!map->strdup_strings)
     ++		entry->key = str;
     ++	entry->value = data;
     ++	hashmap_add(&map->map, &entry->ent);
     ++	return NULL;
     + }
     + 
     + struct strmap_entry *strmap_get_entry(struct strmap *map, const char *str)
      @@ strmap.c: void strmap_remove(struct strmap *map, const char *str, int free_value)
       		return;
       	if (free_value)
     @@ strmap.h: struct strmap_entry {
       	struct hashmap_entry ent;
       	const char *key;
       	void *value;
     -+	char keydata[FLEX_ARRAY]; /* if strdup_strings=1, key == &keydata[0] */
     ++	/* strmap_entry may be allocated extra space to store the key at end */
       };
       
       int cmp_strmap_entry(const void *hashmap_cmp_fn_data,
 13:  5f41fc63e5 = 13:  617926540b Use new HASHMAP_INIT macro to simplify hashmap initialization

Comments

Jeff King Nov. 5, 2020, 1:29 p.m. UTC | #1
On Thu, Nov 05, 2020 at 12:22:32AM +0000, Elijah Newren via GitGitGadget wrote:

> Changes since v3 (almost all of which were suggestions from Peff):
> 
>  * Fix pointer math due to platform differences in FLEX_ALLOC definition,
>    and a few other FLEXPTR_ALLOC_STR cleanups
>  * Define strmap_for_each_entry in terms of hashmap_for_each_entry instead
>    of lower level functions
>  * Use simpler _INIT macros
>  * Remove strset_check_and_add() from API as per Peff's suggestion
>    (merge-ort doesn't need it; we can add it later)
>  * Update comments and commit messages to update now obsolete statements due
>    to changes from earlier reviews

Thanks, this version looks good to me.

I think we might as well do this on top now:

-- >8 --
Subject: [PATCH] shortlog: drop custom strset implementation

We can use the strset recently added in strmap.h instead. Note that this
doesn't have a "check_and_add" function. We can easily write the same
thing using separate "contains" and "adds" calls. This is slightly less
efficient, in that it hashes the string twice, but for our use here it
shouldn't be a big deal either way.

I did leave it as a separate helper function, though, since we use it in
three separate spots (some of which are in the middle of a conditional).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/shortlog.c | 59 ++++++----------------------------------------
 1 file changed, 7 insertions(+), 52 deletions(-)

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 83f0a739b4..2d036ceec2 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -10,6 +10,7 @@
 #include "shortlog.h"
 #include "parse-options.h"
 #include "trailer.h"
+#include "strmap.h"
 
 static char const * const shortlog_usage[] = {
 	N_("git shortlog [<options>] [<revision-range>] [[--] <path>...]"),
@@ -169,60 +170,14 @@ static void read_from_stdin(struct shortlog *log)
 	strbuf_release(&oneline);
 }
 
-struct strset_item {
-	struct hashmap_entry ent;
-	char value[FLEX_ARRAY];
-};
-
-struct strset {
-	struct hashmap map;
-};
-
-#define STRSET_INIT { { NULL } }
-
-static int strset_item_hashcmp(const void *hash_data,
-			       const struct hashmap_entry *entry,
-			       const struct hashmap_entry *entry_or_key,
-			       const void *keydata)
+static int check_dup(struct strset *dups, const char *str)
 {
-	const struct strset_item *a, *b;
-
-	a = container_of(entry, const struct strset_item, ent);
-	if (keydata)
-		return strcmp(a->value, keydata);
-
-	b = container_of(entry_or_key, const struct strset_item, ent);
-	return strcmp(a->value, b->value);
-}
-
-/*
- * Adds "str" to the set if it was not already present; returns true if it was
- * already there.
- */
-static int strset_check_and_add(struct strset *ss, const char *str)
-{
-	unsigned int hash = strhash(str);
-	struct strset_item *item;
-
-	if (!ss->map.table)
-		hashmap_init(&ss->map, strset_item_hashcmp, NULL, 0);
-
-	if (hashmap_get_from_hash(&ss->map, hash, str))
+	if (strset_contains(dups, str))
 		return 1;
-
-	FLEX_ALLOC_STR(item, value, str);
-	hashmap_entry_init(&item->ent, hash);
-	hashmap_add(&ss->map, &item->ent);
+	strset_add(dups, str);
 	return 0;
 }
 
-static void strset_clear(struct strset *ss)
-{
-	if (!ss->map.table)
-		return;
-	hashmap_clear_and_free(&ss->map, struct strset_item, ent);
-}
-
 static void insert_records_from_trailers(struct shortlog *log,
 					 struct strset *dups,
 					 struct commit *commit,
@@ -253,7 +208,7 @@ static void insert_records_from_trailers(struct shortlog *log,
 		if (!parse_ident(log, &ident, value))
 			value = ident.buf;
 
-		if (strset_check_and_add(dups, value))
+		if (check_dup(dups, value))
 			continue;
 		insert_one_record(log, value, oneline);
 	}
@@ -291,7 +246,7 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 				      log->email ? "%aN <%aE>" : "%aN",
 				      &ident, &ctx);
 		if (!HAS_MULTI_BITS(log->groups) ||
-		    !strset_check_and_add(&dups, ident.buf))
+		    !check_dup(&dups, ident.buf))
 			insert_one_record(log, ident.buf, oneline_str);
 	}
 	if (log->groups & SHORTLOG_GROUP_COMMITTER) {
@@ -300,7 +255,7 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 				      log->email ? "%cN <%cE>" : "%cN",
 				      &ident, &ctx);
 		if (!HAS_MULTI_BITS(log->groups) ||
-		    !strset_check_and_add(&dups, ident.buf))
+		    !check_dup(&dups, ident.buf))
 			insert_one_record(log, ident.buf, oneline_str);
 	}
 	if (log->groups & SHORTLOG_GROUP_TRAILER) {
Junio C Hamano Nov. 5, 2020, 8:25 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

> Subject: [PATCH] shortlog: drop custom strset implementation
>
> We can use the strset recently added in strmap.h instead. Note that this
> doesn't have a "check_and_add" function. We can easily write the same
> thing using separate "contains" and "adds" calls. This is slightly less
> efficient, in that it hashes the string twice, but for our use here it
> shouldn't be a big deal either way.
>
> I did leave it as a separate helper function, though, since we use it in
> three separate spots (some of which are in the middle of a conditional).

It makes sense, but check_dup() sounds like its use pattern would be

	if (check_dup(it) == NO_DUP)
		add(it);

where it is more like "add it but just once".

By the way, is a strset a set or a bag?  If it makes the second add
an no-op, perhaps your check_dup() is what strset_add() should do
itself?  What builtin/shortlog.c::check_dup() does smells like it is
a workaround for the lack of a naturally-expected feature.
Jeff King Nov. 5, 2020, 9:17 p.m. UTC | #3
On Thu, Nov 05, 2020 at 12:25:14PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Subject: [PATCH] shortlog: drop custom strset implementation
> >
> > We can use the strset recently added in strmap.h instead. Note that this
> > doesn't have a "check_and_add" function. We can easily write the same
> > thing using separate "contains" and "adds" calls. This is slightly less
> > efficient, in that it hashes the string twice, but for our use here it
> > shouldn't be a big deal either way.
> >
> > I did leave it as a separate helper function, though, since we use it in
> > three separate spots (some of which are in the middle of a conditional).
> 
> It makes sense, but check_dup() sounds like its use pattern would be
> 
> 	if (check_dup(it) == NO_DUP)
> 		add(it);
> 
> where it is more like "add it but just once".

Hmph. I picked that name (versus just "contains") hoping it be general
enough to cover the dual operation.  Better name suggestions are
welcome. Though...

> By the way, is a strset a set or a bag?  If it makes the second add
> an no-op, perhaps your check_dup() is what strset_add() should do
> itself?  What builtin/shortlog.c::check_dup() does smells like it is
> a workaround for the lack of a naturally-expected feature.

Yes, if strset_add() returned an integer telling us whether the item was
already in the set, then we could use it directly. It's slightly
non-trivial to do, though, as it's built around strmap_put(), which
returns a pointer to the old value. But since we're a set and not a map,
that value is always NULL; we can't tell the difference between "I was
storing an old value which was NULL" and "I was not storing any value".

If strset were built around strintmap it could store "1" for "present in
the set". It somehow feels hacky, though, to induce extra value writes
just for the sake of working around the API.

Since strset is defined within strmap.c, perhaps it wouldn't be too bad
for it to be more intimate with the details here. I.e., to use
find_strmap_entry() directly, and if the value is not present, to create
a new hashmap entry. That would require hacking up strmap_put() into a
few helpers, but it's probably not too bad.

-Peff
Elijah Newren Nov. 5, 2020, 9:22 p.m. UTC | #4
On Thu, Nov 5, 2020 at 12:25 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > Subject: [PATCH] shortlog: drop custom strset implementation
> >
> > We can use the strset recently added in strmap.h instead. Note that this
> > doesn't have a "check_and_add" function. We can easily write the same
> > thing using separate "contains" and "adds" calls. This is slightly less
> > efficient, in that it hashes the string twice, but for our use here it
> > shouldn't be a big deal either way.
> >
> > I did leave it as a separate helper function, though, since we use it in
> > three separate spots (some of which are in the middle of a conditional).
>
> It makes sense, but check_dup() sounds like its use pattern would be
>
>         if (check_dup(it) == NO_DUP)
>                 add(it);
>
> where it is more like "add it but just once".
>
> By the way, is a strset a set or a bag?  If it makes the second add

strset is a set; there is no way to get duplicate entries.

> an no-op, perhaps your check_dup() is what strset_add() should do
> itself?  What builtin/shortlog.c::check_dup() does smells like it is
> a workaround for the lack of a naturally-expected feature.

Is the expectation that strset_add() would return a boolean for
whether a new entry was added?
Junio C Hamano Nov. 5, 2020, 10:15 p.m. UTC | #5
Elijah Newren <newren@gmail.com> writes:

>> It makes sense, but check_dup() sounds like its use pattern would be
>>
>>         if (check_dup(it) == NO_DUP)
>>                 add(it);
>>
>> where it is more like "add it but just once".
>>
>> By the way, is a strset a set or a bag?  If it makes the second add
>
> strset is a set; there is no way to get duplicate entries.
>
>> an no-op, perhaps your check_dup() is what strset_add() should do
>> itself?  What builtin/shortlog.c::check_dup() does smells like it is
>> a workaround for the lack of a naturally-expected feature.
>
> Is the expectation that strset_add() would return a boolean for
> whether a new entry was added?

It seems to be a reasonable expectation that the caller can tell if
the add was "already there and was a no-op", judging from what we
saw in the shortlog code, which was the first audience the API was
introduced to support.  It seems to benefit from it if it were
available, and has to work around the lack of it with check_dup()
wrapper.