diff mbox series

fetch: replace string-list used as a look-up table with a hashmap

Message ID xmqqin2sj6df.fsf@gitster-ct.c.googlers.com (mailing list archive)
State New, archived
Headers show
Series fetch: replace string-list used as a look-up table with a hashmap | expand

Commit Message

Junio C Hamano Sept. 26, 2018, 9:28 p.m. UTC
In find_non_local_tags() helper function (used to implement the
"follow tags"), we use string_list_has_string() on two string lists
as a way to see if a refname has already been processed, etc.

All this code predates more modern in-core lookup API like hashmap;
replace them with two hashmaps and one string list---the hashmaps
are used for look-ups and the string list is to keep the order of
items in the returned result stable (which is the only single thing
hashmap does worse than lookups on string-list).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * Just to remind ourselves that we talked about getting rid of
   string-list used as look-up tables by replacing them with hashmap
   but haven't seen much effort expended on it.  I think this is my
   first semi-serious use of hashmap, and the code is expected to be
   full of newbie mistakes, inefficiencies and ignorance of idioms;
   pointing out any of which is much appreciated.

 builtin/fetch.c | 116 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 90 insertions(+), 26 deletions(-)

Comments

Stefan Beller Sept. 26, 2018, 10:59 p.m. UTC | #1
On Wed, Sep 26, 2018 at 2:28 PM Junio C Hamano <gitster@pobox.com> wrote:

>
> +struct refname_hash_entry {
> +       struct hashmap_entry ent; /* must be the first member */

$ git grep "struct hashmap_entry" reveals that we have another
convention that we follow but not document, which is to stress
the importance of putting the hashmap_entry first. ;-)

> +       struct object_id oid;
> +       char refname[FLEX_ARRAY];
> +};
> +
> +static int refname_hash_entry_cmp(const void *hashmap_cmp_fn_data,
> +                                 const struct refname_hash_entry *e1,
> +                                 const struct refname_hash_entry *e2,
> +                                 const void *keydata)
> +{
> +       return strcmp(e1->refname, keydata ? keydata : e2->refname);
> +}

and later

    hashmap_init(...  (hashmap_cmp_fn)refname_hash_entry_cmp, ...);

I wonder if we'd want to stick to this style, as that seems to be the easiest
to do, and drop the efforts started in 55c965f3a2 (Merge branch
'sb/hashmap-cleanup', 2017-08-11), that avoids the cast in the init,
but puts the (implicit) casts in the _cmp function as we'd take
void pointers as the function signature and recast to a local variable.

> +
> +static struct refname_hash_entry *refname_hash_add(struct hashmap *map,
> +                                                  const char *refname,
> +                                                  const struct object_id *oid)
> +{
> +       struct refname_hash_entry *ent;
> +       size_t len = strlen(refname);
> +
> +       FLEX_ALLOC_MEM(ent, refname, refname, len);
> +       hashmap_entry_init(ent, memhash(refname, len));

memhash is preferred over strhash as we already know the length.
For a second I wondered why we use it instead of strhash

> +static int refname_hash_exists(struct hashmap *map, const char *refname)
> +{
> +       struct hashmap_entry key;
> +       size_t len = strlen(refname);
> +       hashmap_entry_init(&key, memhash(refname, len));

Here we could use strhash, but as they could change to
differing implementation, we keep using memhash.

>         /*
> -        * For all the tags in the remote_refs string list,
> +        * For all the tags in the remote_refs_list,
>          * add them to the list of refs to be fetched
>          */
> -       for_each_string_list_item(item, &remote_refs) {
> +       for_each_string_list_item(remote_ref_item, &remote_refs_list) {
> +               const char *refname = remote_ref_item->string;
> +               struct hashmap_entry key;
> +
> +               hashmap_entry_init(&key, memhash(refname, strlen(refname)));
> +               item = hashmap_get(&remote_refs, &key, refname);

> +               if (!item)
> +                       continue; /* can this happen??? */

I thought this could happen when we have hash collisions,
I convinced myself otherwise, as we pass the refname to hashmap_get
as the 'keydata', so when there is a hash collision we keep comparing
to the real value.

And as whenever we have an insert to remote_refs_list, we also
have a refname_hash_add to the remote_refs hashmap,
I think the return of NULL is not possible, and we could switch
it to BUG(...);


> All this code predates more modern in-core lookup API like hashmap;
> replace them with two hashmaps and one string list---the hashmaps
> are used for look-ups and the string list is to keep the order of
> items in the returned result stable (which is the only single thing
> hashmap does worse than lookups on string-list).
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---

I wonder if we want to add an API to the hashmap that allows for ordered
iteration (as hashmap_iter_* relies on the hashing function for its output)

I think it may be too early to do so, but there are 2 thoughts on how to do it:
* Each element keeps (prev/next) pointers to the previously inserted
  element and updates the next pointer when the next element is inserted.
  When removing elements we'd have to update the next and prev pointer
  of the adjacent elements.
  Then iteration can be done in insertion-order.
  Probably this would go into its own file ordered-hashmap.{c, h}
* provide an order function to hashmap_iter_init, which then
  orders according to that function before the first call of
  hashmap_iter_next returns.

>  * Just to remind ourselves that we talked about getting rid of
>    string-list used as look-up tables by replacing them with hashmap
>    but haven't seen much effort expended on it.  I think this is my
>    first semi-serious use of hashmap, and the code is expected to be
>    full of newbie mistakes, inefficiencies and ignorance of idioms;
>    pointing out any of which is much appreciated.

I could not find newbie mistakes, but gave my thoughts anyway.

Stefan
Jeff King Sept. 27, 2018, 5:34 a.m. UTC | #2
On Wed, Sep 26, 2018 at 02:28:28PM -0700, Junio C Hamano wrote:

> In find_non_local_tags() helper function (used to implement the
> "follow tags"), we use string_list_has_string() on two string lists
> as a way to see if a refname has already been processed, etc.
> 
> All this code predates more modern in-core lookup API like hashmap;
> replace them with two hashmaps and one string list---the hashmaps
> are used for look-ups and the string list is to keep the order of
> items in the returned result stable (which is the only single thing
> hashmap does worse than lookups on string-list).
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
> 
>  * Just to remind ourselves that we talked about getting rid of
>    string-list used as look-up tables by replacing them with hashmap
>    but haven't seen much effort expended on it.  I think this is my
>    first semi-serious use of hashmap, and the code is expected to be
>    full of newbie mistakes, inefficiencies and ignorance of idioms;
>    pointing out any of which is much appreciated.

As you probably noticed, there's a bit of boilerplate required to use a
hashmap. I had figured we could replace most of these with a single
"strmap" API to map a string to a void pointer (which is basically
what string-list gives you).

That would save on the boilerplate. But your solution here replaces a
"void *" pointer with an actual "struct object_id" member, which
improves type-safety. It also removes questions about memory lifetimes
(at the minor cost of copying the oids). So this path is probably better
if we don't mind a little extra code.

I do note that your struct just has the same information as "struct
ref":

> +struct refname_hash_entry {
> +	struct hashmap_entry ent; /* must be the first member */
> +	struct object_id oid;
> +	char refname[FLEX_ARRAY];
> +};

So yet another alternative here is to just define a single hashmap that
stores void pointers. That also throws away some type safety, but is
maybe conceptually simpler. I dunno. It's actually a pain to do that
with "struct hashmap" because it requires the caller to handle
allocation. An open-addressed hash table, as we use elsewhere (and in
khash.h) is a bit simpler, since it doesn't need to do any per-entry
malloc.

To be clear, I'm perfectly happy with the approach in your patch here.
I'm just musing on what might might be the least painful thing for doing
more of them.

-Peff
Jeff King Sept. 27, 2018, 5:54 a.m. UTC | #3
On Wed, Sep 26, 2018 at 03:59:08PM -0700, Stefan Beller wrote:

> > +struct refname_hash_entry {
> > +       struct hashmap_entry ent; /* must be the first member */
> 
> $ git grep "struct hashmap_entry" reveals that we have another
> convention that we follow but not document, which is to stress
> the importance of putting the hashmap_entry first. ;-)

One thing I've liked about the list.h implementation is that you can
store the list pointer anywhere in the struct, or even have the same
struct in multiple lists.

The only funny thing is that you have to "dereference" the iterator like
this:

  struct list_head *pos;
  struct actual_thing *item;
  ...
  item = list_entry(pos, struct actual_thing, list_member);

which is a minor pain, but it's reasonably hard to get it wrong.

I wonder if we could do the same here. The hashmap would only ever see
the "struct hashmap_entry", and then the caller would convert that back
to the actual type. I think we could even get away with not converting
existing callers; if the hashmap _is_ at the front, then that
list_entry() really just devolves to a cast. So as long as the struct
definition and the users of the struct agree, it would just work.

> > +static int refname_hash_entry_cmp(const void *hashmap_cmp_fn_data,
> > +                                 const struct refname_hash_entry *e1,
> > +                                 const struct refname_hash_entry *e2,
> > +                                 const void *keydata)
> > +{
> > +       return strcmp(e1->refname, keydata ? keydata : e2->refname);
> > +}
> 
> and later
> 
>     hashmap_init(...  (hashmap_cmp_fn)refname_hash_entry_cmp, ...);
> 
> I wonder if we'd want to stick to this style, as that seems to be the easiest
> to do, and drop the efforts started in 55c965f3a2 (Merge branch
> 'sb/hashmap-cleanup', 2017-08-11), that avoids the cast in the init,
> but puts the (implicit) casts in the _cmp function as we'd take
> void pointers as the function signature and recast to a local variable.

I think that casting the function pointer technically breaks the C
standard, though probably not in a way that is a problem on any real
systems.

The other thing about the "cast inside the cmp function" approach from
sb/hashmap-cleanup is that it is less likely to go stale. If we change
the definition of hashmap_cmp_fn, then any functions which are manually
cast will suppress the compiler errors. When the function takes void
pointers, the same can happen if "struct refname_hash_entry" is swapped
out for another struct, but IMHO that's a less likely error to make.

I admit that's just a gut feeling, though. It would be nice if we could
avoid casting altogether, but I think that puts us into macro territory
(which I'm not altogether opposed to, but it has its own drawbacks).

-Peff
Junio C Hamano Sept. 29, 2018, 6:31 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> On Wed, Sep 26, 2018 at 03:59:08PM -0700, Stefan Beller wrote:
>
>> > +struct refname_hash_entry {
>> > +       struct hashmap_entry ent; /* must be the first member */
>> 
>> $ git grep "struct hashmap_entry" reveals that we have another
>> convention that we follow but not document, which is to stress
>> the importance of putting the hashmap_entry first. ;-)
>
> One thing I've liked about the list.h implementation is that you can
> store the list pointer anywhere in the struct, or even have the same
> struct in multiple lists.
>
> The only funny thing is that you have to "dereference" the iterator like
> this:
>
>   struct list_head *pos;
>   struct actual_thing *item;
>   ...
>   item = list_entry(pos, struct actual_thing, list_member);
>
> which is a minor pain, but it's reasonably hard to get it wrong.
>
> I wonder if we could do the same here. The hashmap would only ever see
> the "struct hashmap_entry", and then the caller would convert that back
> to the actual type.

Hmph, how would hashmap_cmp_fn look like with that scheme?  It would
get one entry, another entry (or just the skeleton of it) and
optionally a separate keydata (if the second one is skeleton), and
the first two points at the embedded hashmap struct, not the
surrounding one.  The callback function is now responsible for
calling a hashmap_entry() macro that adjusts for the negative offset
like list_entry() does?

> I think we could even get away with not converting
> existing callers; if the hashmap _is_ at the front, then that
> list_entry() really just devolves to a cast.

Yes.

> So as long as the struct
> definition and the users of the struct agree, it would just work.

Yes, too.

Was it ever a consideration, when allowing struct list-head anywhere
in the enclosing struct, that it would allow an element to be on
more than one list?  Would it benefit us to be able to place an
element in multiple hashmaps because we do not have to have the
embedded hashmap_entry always at the beginning if we did this
change?
Junio C Hamano Sept. 30, 2018, 2:11 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

> So yet another alternative here is to just define a single hashmap that
> stores void pointers. That also throws away some type safety, but is
> maybe conceptually simpler. I dunno.

I was tempted to try that route, which would essentially give us "if
you are abusing string-list as a look-up table, and if either you do
not need to iterate over all the elements, or you do not care about
the order in which such an interation is made, then use this
instead" API that can more easily substitute string-list without
boilerplate.

But I am not sure if that is worth it.  Perhaps we may end up doing
so when we find the second thing to move from string-list to hashmap,
but not with this patch (and yes, there is another string-list user
in the same source file, which I think can reuse what I added with
this patch).

In any case, I expect to be offline more than half of the next week,
so before I forget, here is an updated patch.  Two changes since the
version posted earlier are:

 - remote_refs_list (not remote_refs which is a hashmap) is passed
   to string_list_clear() at the end.  This is a fix for an outright
   bug Ramsay noticed.

 - The callback function now takes (const void *) and casts its
   parameters to correct type before they are used, instead of
   casting the pointer to a function whose signature is wrong in the
   call to hashmap_init().  This was noticed by Stefan and I agree
   that doing it this way makes more sense.

-- >8 --
Subject: [PATCH v2] fetch: replace string-list used as a look-up table with a hashmap

In find_non_local_tags() helper function (used to implement the
"follow tags"), we use string_list_has_string() on two string lists
as a way to see if a refname has already been processed, etc.

All this code predates more modern in-core lookup API like hashmap;
replace them with two hashmaps and one string list---the hashmaps
are used for look-ups and the string list is to keep the order of
items in the returned result stable (which is the only single thing
hashmap does worse than lookups on string-list).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/fetch.c | 121 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 94 insertions(+), 27 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 0696abfc2a..489531ec93 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -246,16 +246,76 @@ static int will_fetch(struct ref **head, const unsigned char *sha1)
 	return 0;
 }
 
+struct refname_hash_entry {
+	struct hashmap_entry ent; /* must be the first member */
+	struct object_id oid;
+	char refname[FLEX_ARRAY];
+};
+
+static int refname_hash_entry_cmp(const void *hashmap_cmp_fn_data,
+				  const void *e1_,
+				  const void *e2_,
+				  const void *keydata)
+{
+	const struct refname_hash_entry *e1 = e1_;
+	const struct refname_hash_entry *e2 = e2_;
+
+	return strcmp(e1->refname, keydata ? keydata : e2->refname);
+}
+
+static struct refname_hash_entry *refname_hash_add(struct hashmap *map,
+						   const char *refname,
+						   const struct object_id *oid)
+{
+	struct refname_hash_entry *ent;
+	size_t len = strlen(refname);
+
+	FLEX_ALLOC_MEM(ent, refname, refname, len);
+	hashmap_entry_init(ent, memhash(refname, len));
+	oidcpy(&ent->oid, oid);
+	hashmap_add(map, ent);
+	return ent;
+}
+
+static int add_one_refname(const char *refname,
+			   const struct object_id *oid,
+			   int flag, void *cbdata)
+{
+	struct hashmap *refname_map = cbdata;
+
+	(void) refname_hash_add(refname_map, refname, oid);
+	return 0;
+}
+
+static void refname_hash_init(struct hashmap *map)
+{
+	hashmap_init(map, refname_hash_entry_cmp, NULL, 0);
+}
+
+static int refname_hash_exists(struct hashmap *map, const char *refname)
+{
+	struct hashmap_entry key;
+	size_t len = strlen(refname);
+	hashmap_entry_init(&key, memhash(refname, len));
+
+	return !!hashmap_get(map, &key, refname);
+}
+
 static void find_non_local_tags(const struct ref *refs,
 				struct ref **head,
 				struct ref ***tail)
 {
-	struct string_list existing_refs = STRING_LIST_INIT_DUP;
-	struct string_list remote_refs = STRING_LIST_INIT_NODUP;
+	struct hashmap existing_refs;
+	struct hashmap remote_refs;
+	struct string_list remote_refs_list = STRING_LIST_INIT_NODUP;
+	struct string_list_item *remote_ref_item;
 	const struct ref *ref;
-	struct string_list_item *item = NULL;
+	struct refname_hash_entry *item = NULL;
 
-	for_each_ref(add_existing, &existing_refs);
+	refname_hash_init(&existing_refs);
+	refname_hash_init(&remote_refs);
+
+	for_each_ref(add_one_refname, &existing_refs);
 	for (ref = refs; ref; ref = ref->next) {
 		if (!starts_with(ref->name, "refs/tags/"))
 			continue;
@@ -271,10 +331,10 @@ static void find_non_local_tags(const struct ref *refs,
 			    !has_object_file_with_flags(&ref->old_oid,
 							OBJECT_INFO_QUICK) &&
 			    !will_fetch(head, ref->old_oid.hash) &&
-			    !has_sha1_file_with_flags(item->util,
+			    !has_sha1_file_with_flags(item->oid.hash,
 						      OBJECT_INFO_QUICK) &&
-			    !will_fetch(head, item->util))
-				item->util = NULL;
+			    !will_fetch(head, item->oid.hash))
+				oidclr(&item->oid);
 			item = NULL;
 			continue;
 		}
@@ -286,48 +346,55 @@ static void find_non_local_tags(const struct ref *refs,
 		 * fetch.
 		 */
 		if (item &&
-		    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
-		    !will_fetch(head, item->util))
-			item->util = NULL;
+		    !has_sha1_file_with_flags(item->oid.hash, OBJECT_INFO_QUICK) &&
+		    !will_fetch(head, item->oid.hash))
+			oidclr(&item->oid);
 
 		item = NULL;
 
 		/* skip duplicates and refs that we already have */
-		if (string_list_has_string(&remote_refs, ref->name) ||
-		    string_list_has_string(&existing_refs, ref->name))
+		if (refname_hash_exists(&remote_refs, ref->name) ||
+		    refname_hash_exists(&existing_refs, ref->name))
 			continue;
 
-		item = string_list_insert(&remote_refs, ref->name);
-		item->util = (void *)&ref->old_oid;
+		item = refname_hash_add(&remote_refs, ref->name, &ref->old_oid);
+		string_list_insert(&remote_refs_list, ref->name);
 	}
-	string_list_clear(&existing_refs, 1);
+	hashmap_free(&existing_refs, 1);
 
 	/*
 	 * We may have a final lightweight tag that needs to be
 	 * checked to see if it needs fetching.
 	 */
 	if (item &&
-	    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
-	    !will_fetch(head, item->util))
-		item->util = NULL;
+	    !has_sha1_file_with_flags(item->oid.hash, OBJECT_INFO_QUICK) &&
+	    !will_fetch(head, item->oid.hash))
+		oidclr(&item->oid);
 
 	/*
-	 * For all the tags in the remote_refs string list,
+	 * For all the tags in the remote_refs_list,
 	 * add them to the list of refs to be fetched
 	 */
-	for_each_string_list_item(item, &remote_refs) {
+	for_each_string_list_item(remote_ref_item, &remote_refs_list) {
+		const char *refname = remote_ref_item->string;
+		struct hashmap_entry key;
+
+		hashmap_entry_init(&key, memhash(refname, strlen(refname)));
+		item = hashmap_get(&remote_refs, &key, refname);
+		if (!item)
+			continue; /* can this happen??? */
+
 		/* Unless we have already decided to ignore this item... */
-		if (item->util)
-		{
-			struct ref *rm = alloc_ref(item->string);
-			rm->peer_ref = alloc_ref(item->string);
-			oidcpy(&rm->old_oid, item->util);
+		if (!is_null_oid(&item->oid)) {
+			struct ref *rm = alloc_ref(item->refname);
+			rm->peer_ref = alloc_ref(item->refname);
+			oidcpy(&rm->old_oid, &item->oid);
 			**tail = rm;
 			*tail = &rm->next;
 		}
 	}
-
-	string_list_clear(&remote_refs, 0);
+	hashmap_free(&remote_refs, 1);
+	string_list_clear(&remote_refs_list, 0);
 }
 
 static struct ref *get_ref_map(struct remote *remote,
Jeff King Sept. 30, 2018, 4:57 a.m. UTC | #6
On Sat, Sep 29, 2018 at 11:31:08AM -0700, Junio C Hamano wrote:

> > The only funny thing is that you have to "dereference" the iterator like
> > this:
> >
> >   struct list_head *pos;
> >   struct actual_thing *item;
> >   ...
> >   item = list_entry(pos, struct actual_thing, list_member);
> >
> > which is a minor pain, but it's reasonably hard to get it wrong.
> >
> > I wonder if we could do the same here. The hashmap would only ever see
> > the "struct hashmap_entry", and then the caller would convert that back
> > to the actual type.
> 
> Hmph, how would hashmap_cmp_fn look like with that scheme?  It would
> get one entry, another entry (or just the skeleton of it) and
> optionally a separate keydata (if the second one is skeleton), and
> the first two points at the embedded hashmap struct, not the
> surrounding one.  The callback function is now responsible for
> calling a hashmap_entry() macro that adjusts for the negative offset
> like list_entry() does?

Exactly. The comparison functions currently look something like this:

  int foo_hash_cmp(const void *data,
		   const void *e1, const void *e2,
		   const void *keydata)
  {
	const struct foo *f1 = (const struct foo *)e1;
	const struct foo *f2 = (const struct foo *)e2;
	return strcmp(foo->field, keydata ? keydata : foo->field);
  }

(this is using the correct function signature; you can drop the casts if
you violate that, but IMHO this style is more maintainable in the long
run). With the hashmap_entry at an arbitrary position, the body is:

  const struct foo *f1 = hash_entry(e1, struct foo, hash_field);
  const struct foo *f2 = hash_entry(e2, struct foo, hash_field);
  return strcmp(foo->field, keydata ? keydata : foo->field);

where hash_field is the member of "struct foo" that holds the "struct
hashmap_entry". So that's not so different, but there is an interesting
implication here. The comparison callback has to know which
hashmap_entry name to use! So if you have a struct that can be in two
hashes, like:

  struct foo {
    char *name;
    struct hashmap_entry h1;
    struct hashmap_entry h2;
  };

then you cannot use a single foo_hash_cmp() function. You have to use a
different one for each field. Which seems kind of nasty and error-prone.

So I dunno. Maybe this is not a good direction. I have to admit that I'm
not wild about our hashmap implementation in general:

  - the way it handles allocations is often awkward, or causes you to
    write confusing boilerplate

  - it's not type-safe

  - it seems slower than open-addressed alternatives (e.g., over in [1]
    we determined that oidset can be made almost twice as fast using
    khash)

  - the chained implementation in hashmap.c is probably better for cases
    where there's a lot of deletions, because removing from the hashmap
    truly removes everything (whereas with open-addressed schemes, we
    either didn't implement deletion at all, or it sets a marker which
    allows the item to be dropped next time the hash is resized)

[1] https://public-inbox.org/git/20180814015842.GA27055@sigill.intra.peff.net/

> Was it ever a consideration, when allowing struct list-head anywhere
> in the enclosing struct, that it would allow an element to be on
> more than one list?

I don't know the history of that list code in that kernel, but I always
assumed that was one of the goals. We don't use it yet, but there are
places where we could (e.g., packed_git is part of the regular packed
list, as well as the mru; the former is implemented as a singly-linked
list, but that's mostly historical).

It also allows us to have an item in a list and a hashmap (since if they
both needed to be at the start of the struct, that wouldn't work). We do
use that ability for delta_base_cache_entry.

> Would it benefit us to be able to place an
> element in multiple hashmaps because we do not have to have the
> embedded hashmap_entry always at the beginning if we did this
> change?

I'm not sure. I think it would allow us to more aggressively stick the
hashmap_entry inside existing structs. For instance, in the patch from
this thread, could we actually shove the hashmap_entry structs into
"struct ref", and save having to allocate the extra refname_hash struct?

I guess that pollutes "struct ref", though. Unless we generically say
"it can be a part of up to 3 hashes" and provide struct hashmap_entry
h1, struct hashmap_entry h2, etc. And then your new code does:

  struct hashmap existing_refs;
  struct hashmap remote_refs;

  /*
   * These need to be different compare functions to access the
   * h1 and h2 fields in "struct ref"!
   */
  hashmap_init(&existing_refs, ref_hash_cmp_h1);
  hashmap_init(&remote_refs, ref_hash_cmp_h2);

  ...

  hashmap_add(&existing_refs, &some_ref->h1);

But that seems pretty error-prone (not just confusing h1 and h2 here,
but how do we know that some other part of the code isn't using "h1" for
its own hash already?)

So again, maybe this is just a bad direction.

At least with an open-addressed scheme, the hash table itself contains
everything, and you can just store pointers to the existing refs.

-Peff
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 0696abfc2a..5d1fd8dd49 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -246,16 +246,73 @@  static int will_fetch(struct ref **head, const unsigned char *sha1)
 	return 0;
 }
 
+struct refname_hash_entry {
+	struct hashmap_entry ent; /* must be the first member */
+	struct object_id oid;
+	char refname[FLEX_ARRAY];
+};
+
+static int refname_hash_entry_cmp(const void *hashmap_cmp_fn_data,
+				  const struct refname_hash_entry *e1,
+				  const struct refname_hash_entry *e2,
+				  const void *keydata)
+{
+	return strcmp(e1->refname, keydata ? keydata : e2->refname);
+}
+
+static struct refname_hash_entry *refname_hash_add(struct hashmap *map,
+						   const char *refname,
+						   const struct object_id *oid)
+{
+	struct refname_hash_entry *ent;
+	size_t len = strlen(refname);
+
+	FLEX_ALLOC_MEM(ent, refname, refname, len);
+	hashmap_entry_init(ent, memhash(refname, len));
+	oidcpy(&ent->oid, oid);
+	hashmap_add(map, ent);
+	return ent;
+}
+
+static int add_one_refname(const char *refname,
+			   const struct object_id *oid,
+			   int flag, void *cbdata)
+{
+	struct hashmap *refname_map = cbdata;
+
+	(void) refname_hash_add(refname_map, refname, oid);
+	return 0;
+}
+
+static void refname_hash_init(struct hashmap *map)
+{
+	hashmap_init(map, (hashmap_cmp_fn)refname_hash_entry_cmp, NULL, 0);
+}
+
+static int refname_hash_exists(struct hashmap *map, const char *refname)
+{
+	struct hashmap_entry key;
+	size_t len = strlen(refname);
+	hashmap_entry_init(&key, memhash(refname, len));
+
+	return !!hashmap_get(map, &key, refname);
+}
+
 static void find_non_local_tags(const struct ref *refs,
 				struct ref **head,
 				struct ref ***tail)
 {
-	struct string_list existing_refs = STRING_LIST_INIT_DUP;
-	struct string_list remote_refs = STRING_LIST_INIT_NODUP;
+	struct hashmap existing_refs;
+	struct hashmap remote_refs;
+	struct string_list remote_refs_list = STRING_LIST_INIT_NODUP;
+	struct string_list_item *remote_ref_item;
 	const struct ref *ref;
-	struct string_list_item *item = NULL;
+	struct refname_hash_entry *item = NULL;
 
-	for_each_ref(add_existing, &existing_refs);
+	refname_hash_init(&existing_refs);
+	refname_hash_init(&remote_refs);
+
+	for_each_ref(add_one_refname, &existing_refs);
 	for (ref = refs; ref; ref = ref->next) {
 		if (!starts_with(ref->name, "refs/tags/"))
 			continue;
@@ -271,10 +328,10 @@  static void find_non_local_tags(const struct ref *refs,
 			    !has_object_file_with_flags(&ref->old_oid,
 							OBJECT_INFO_QUICK) &&
 			    !will_fetch(head, ref->old_oid.hash) &&
-			    !has_sha1_file_with_flags(item->util,
+			    !has_sha1_file_with_flags(item->oid.hash,
 						      OBJECT_INFO_QUICK) &&
-			    !will_fetch(head, item->util))
-				item->util = NULL;
+			    !will_fetch(head, item->oid.hash))
+				oidclr(&item->oid);
 			item = NULL;
 			continue;
 		}
@@ -286,47 +343,54 @@  static void find_non_local_tags(const struct ref *refs,
 		 * fetch.
 		 */
 		if (item &&
-		    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
-		    !will_fetch(head, item->util))
-			item->util = NULL;
+		    !has_sha1_file_with_flags(item->oid.hash, OBJECT_INFO_QUICK) &&
+		    !will_fetch(head, item->oid.hash))
+			oidclr(&item->oid);
 
 		item = NULL;
 
 		/* skip duplicates and refs that we already have */
-		if (string_list_has_string(&remote_refs, ref->name) ||
-		    string_list_has_string(&existing_refs, ref->name))
+		if (refname_hash_exists(&remote_refs, ref->name) ||
+		    refname_hash_exists(&existing_refs, ref->name))
 			continue;
 
-		item = string_list_insert(&remote_refs, ref->name);
-		item->util = (void *)&ref->old_oid;
+		item = refname_hash_add(&remote_refs, ref->name, &ref->old_oid);
+		string_list_insert(&remote_refs_list, ref->name);
 	}
-	string_list_clear(&existing_refs, 1);
+	hashmap_free(&existing_refs, 1);
 
 	/*
 	 * We may have a final lightweight tag that needs to be
 	 * checked to see if it needs fetching.
 	 */
 	if (item &&
-	    !has_sha1_file_with_flags(item->util, OBJECT_INFO_QUICK) &&
-	    !will_fetch(head, item->util))
-		item->util = NULL;
+	    !has_sha1_file_with_flags(item->oid.hash, OBJECT_INFO_QUICK) &&
+	    !will_fetch(head, item->oid.hash))
+		oidclr(&item->oid);
 
 	/*
-	 * For all the tags in the remote_refs string list,
+	 * For all the tags in the remote_refs_list,
 	 * add them to the list of refs to be fetched
 	 */
-	for_each_string_list_item(item, &remote_refs) {
+	for_each_string_list_item(remote_ref_item, &remote_refs_list) {
+		const char *refname = remote_ref_item->string;
+		struct hashmap_entry key;
+
+		hashmap_entry_init(&key, memhash(refname, strlen(refname)));
+		item = hashmap_get(&remote_refs, &key, refname);
+		if (!item)
+			continue; /* can this happen??? */
+
 		/* Unless we have already decided to ignore this item... */
-		if (item->util)
-		{
-			struct ref *rm = alloc_ref(item->string);
-			rm->peer_ref = alloc_ref(item->string);
-			oidcpy(&rm->old_oid, item->util);
+		if (!is_null_oid(&item->oid)) {
+			struct ref *rm = alloc_ref(item->refname);
+			rm->peer_ref = alloc_ref(item->refname);
+			oidcpy(&rm->old_oid, &item->oid);
 			**tail = rm;
 			*tail = &rm->next;
 		}
 	}
-
+	hashmap_free(&remote_refs, 1);
 	string_list_clear(&remote_refs, 0);
 }