diff mbox series

fast-import: replace custom hash with hashmap.c

Message ID 20200403121508.GA638328@coredump.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series fast-import: replace custom hash with hashmap.c | expand

Commit Message

Jeff King April 3, 2020, 12:15 p.m. UTC
We use a custom hash in fast-import to store the set of objects we've
imported so far. It has a fixed set of 2^16 buckets and chains any
collisions with a linked list. As the number of objects grows larger
than that, the load factor increases and we degrade to O(n) lookups and
O(n^2) insertions.

We can scale better by using our hashmap.c implementation, which will
resize the bucket count as we grow. This does incur an extra memory cost
of 8 bytes per object, as hashmap stores the integer hash value for each
entry in its hashmap_entry struct (which we really don't care about
here, because we're just reusing the embedded object hash). But I think
the numbers below justify this (and our per-object memory cost is
already much higher).

I also looked at using khash, but it seemed to perform slightly worse
than hashmap at all sizes, and worse even than the existing code for
small sizes. It's also awkward to use here, because we want to look up a
"struct object_entry" from a "struct object_id", and it doesn't handle
mismatched keys as well. Making a mapping of object_id to object_entry
would be more natural, but that would require pulling the embedded oid
out of the object_entry or incurring an extra 32 bytes per object.

In a synthetic test creating as many cheap, tiny objects as possible

  perl -e '
      my $bits = shift;
      my $nr = 2**$bits;

      for (my $i = 0; $i < $nr; $i++) {
              print "blob\n";
              print "data 4\n";
              print pack("N", $i);
      }
  ' $bits | git fast-import

I got these results:

  nr_objects   master       khash      hashmap
  2^20         0m4.317s     0m5.109s   0m3.890s
  2^21         0m10.204s    0m9.702s   0m7.933s
  2^22         0m27.159s    0m17.911s  0m16.751s
  2^23         1m19.038s    0m35.080s  0m31.963s
  2^24         4m18.766s    1m10.233s  1m6.793s

which points to hashmap as the winner. We didn't have any perf tests for
fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than
a full one, but still is an interesting case people might use (e.g., for
rewriting history). It will emphasize this change in some ways (as a
percentage we spend more time making objects and less shuffling blob
bytes around) and less in others (the total object count is lower).

Here are the results for linux.git:

  Test                        HEAD^                 HEAD
  ----------------------------------------------------------------------------
  9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
  9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%

It only has ~5.2M commits and trees, so this is a larger effect than I
expected (the 2^23 case above only improved by 50s or so, but here we
gained almost 90s). This is probably due to actually performing more
object lookups in a real import with trees and commits, as opposed to
just dumping a bunch of blobs into a pack.

Signed-off-by: Jeff King <peff@peff.net>
---
 fast-import.c                      | 62 +++++++++++++++++++-----------
 t/perf/p9300-fast-import-export.sh | 23 +++++++++++
 2 files changed, 62 insertions(+), 23 deletions(-)
 create mode 100755 t/perf/p9300-fast-import-export.sh

Comments

René Scharfe April 3, 2020, 6:53 p.m. UTC | #1
Am 03.04.20 um 14:15 schrieb Jeff King:
> We use a custom hash in fast-import to store the set of objects we've
> imported so far. It has a fixed set of 2^16 buckets and chains any
> collisions with a linked list. As the number of objects grows larger
> than that, the load factor increases and we degrade to O(n) lookups and
> O(n^2) insertions.
>
> We can scale better by using our hashmap.c implementation, which will
> resize the bucket count as we grow. This does incur an extra memory cost
> of 8 bytes per object, as hashmap stores the integer hash value for each
> entry in its hashmap_entry struct (which we really don't care about
> here, because we're just reusing the embedded object hash). But I think
> the numbers below justify this (and our per-object memory cost is
> already much higher).
>
> I also looked at using khash, but it seemed to perform slightly worse
> than hashmap at all sizes, and worse even than the existing code for
> small sizes. It's also awkward to use here, because we want to look up a
> "struct object_entry" from a "struct object_id", and it doesn't handle
> mismatched keys as well. Making a mapping of object_id to object_entry
> would be more natural, but that would require pulling the embedded oid
> out of the object_entry or incurring an extra 32 bytes per object.
>
> In a synthetic test creating as many cheap, tiny objects as possible
>
>   perl -e '
>       my $bits = shift;
>       my $nr = 2**$bits;
>
>       for (my $i = 0; $i < $nr; $i++) {
>               print "blob\n";
>               print "data 4\n";
>               print pack("N", $i);
>       }
>   ' $bits | git fast-import
>
> I got these results:
>
>   nr_objects   master       khash      hashmap
>   2^20         0m4.317s     0m5.109s   0m3.890s
>   2^21         0m10.204s    0m9.702s   0m7.933s
>   2^22         0m27.159s    0m17.911s  0m16.751s
>   2^23         1m19.038s    0m35.080s  0m31.963s
>   2^24         4m18.766s    1m10.233s  1m6.793s

I get similar numbers.

>
> which points to hashmap as the winner. We didn't have any perf tests for
> fast-export or fast-import, so I added one as a more real-world case.
> It uses an export without blobs since that's significantly cheaper than
> a full one, but still is an interesting case people might use (e.g., for
> rewriting history). It will emphasize this change in some ways (as a
> percentage we spend more time making objects and less shuffling blob
> bytes around) and less in others (the total object count is lower).
>
> Here are the results for linux.git:
>
>   Test                        HEAD^                 HEAD
>   ----------------------------------------------------------------------------
>   9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
>   9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%

My numbers look a bit different for this, not sure why:

Test                        origin/master         HEAD
---------------------------------------------------------------------------
9300.1: export (no-blobs)   69.36(66.44+1.56)     67.89(66.07+1.56) -2.1%
9300.2: import (no-blobs)   295.10(293.83+1.19)   283.83(282.91+0.91) -3.8%

They are still in favor of the patch, just not as strongly as yours.

>
> It only has ~5.2M commits and trees, so this is a larger effect than I
> expected (the 2^23 case above only improved by 50s or so, but here we
> gained almost 90s). This is probably due to actually performing more
> object lookups in a real import with trees and commits, as opposed to
> just dumping a bunch of blobs into a pack.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  fast-import.c                      | 62 +++++++++++++++++++-----------
>  t/perf/p9300-fast-import-export.sh | 23 +++++++++++
>  2 files changed, 62 insertions(+), 23 deletions(-)
>  create mode 100755 t/perf/p9300-fast-import-export.sh
>

I have to warn you up front: I'm not familiar with hashmap or
fast-import, so I'll focus on stylistic topics -- bikeshedding.  The
actual change looks reasonable to me overall and the performance is
convincing.

> diff --git a/fast-import.c b/fast-import.c
> index 202dda11a6..0ef6defc10 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -39,12 +39,28 @@
>
>  struct object_entry {
>  	struct pack_idx_entry idx;
> -	struct object_entry *next;
> +	struct hashmap_entry ent;
>  	uint32_t type : TYPE_BITS,
>  		pack_id : PACK_ID_BITS,
>  		depth : DEPTH_BITS;
>  };
>
> +static int object_entry_hashcmp(const void *map_data,
> +				const struct hashmap_entry *eptr,
> +				const struct hashmap_entry *entry_or_key,
> +				const void *keydata)
> +{
> +	const struct object_id *oid = keydata;
> +	const struct object_entry *e1, *e2;
> +
> +	e1 = container_of(eptr, const struct object_entry, ent);
> +	if (oid)
> +		return oidcmp(&e1->idx.oid, oid);
> +
> +	e2 = container_of(entry_or_key, const struct object_entry, ent);
> +	return oidcmp(&e1->idx.oid, &e2->idx.oid);

Other hashmap users express this more tersely, similar to this:

	const struct object_entry *e1, *e2;

	e1 = container_of(eptr, const struct object_entry, ent);
	e2 = container_of(entry_or_key, const struct object_entry, ent);
	return oidcmp(&e1->idx.oid, keydata ? keydata : &e2->idx.oid);

> +}
> +
>  struct object_entry_pool {
>  	struct object_entry_pool *next_pool;
>  	struct object_entry *next_free;
> @@ -178,7 +194,7 @@ static off_t pack_size;
>  /* Table of objects we've written. */
>  static unsigned int object_entry_alloc = 5000;
>  static struct object_entry_pool *blocks;
> -static struct object_entry *object_table[1 << 16];
> +static struct hashmap object_table;
>  static struct mark_set *marks;
>  static const char *export_marks_file;
>  static const char *import_marks_file;
> @@ -455,44 +471,42 @@ static struct object_entry *new_object(struct object_id *oid)
>
>  static struct object_entry *find_object(struct object_id *oid)
>  {
> -	unsigned int h = oid->hash[0] << 8 | oid->hash[1];
> -	struct object_entry *e;
> -	for (e = object_table[h]; e; e = e->next)
> -		if (oideq(oid, &e->idx.oid))
> -			return e;
> +	struct hashmap_entry lookup_entry, *e;
> +
> +	hashmap_entry_init(&lookup_entry, oidhash(oid));
> +	e = hashmap_get(&object_table, &lookup_entry, oid);
> +	if (e)
> +		return container_of(e, struct object_entry, ent);

(Worth repeating:) No casting trick, yay!

>  	return NULL;
>  }
>
>  static struct object_entry *insert_object(struct object_id *oid)
>  {
> -	unsigned int h = oid->hash[0] << 8 | oid->hash[1];
> -	struct object_entry *e = object_table[h];
> +	struct object_entry *e;
> +	struct hashmap_entry lookup_entry, *hashent;
>
> -	while (e) {
> -		if (oideq(oid, &e->idx.oid))
> -			return e;
> -		e = e->next;
> -	}
> +	hashmap_entry_init(&lookup_entry, oidhash(oid));
> +	hashent = hashmap_get(&object_table, &lookup_entry, oid);
> +	if (hashent)
> +		return container_of(hashent, struct object_entry, ent);

That duplicates find_object()...

>
>  	e = new_object(oid);
> -	e->next = object_table[h];
>  	e->idx.offset = 0;
> -	object_table[h] = e;
> +	e->ent.hash = lookup_entry.hash;

... except for this part.  Would it make sense to replace this line with
a call to hashmap_entry_init()?  Seems cleaner to me.  It would look
like this:

	struct object_entry *e = find_object(oid);

	if (e)
		return e;

	e = new_object(oid);
	e->idx.offset = 0;
	hashmap_entry_init(&e->ent, oidhash(oid));
	hashmap_add(&object_table, &e->ent);
	return e;

> +	hashmap_add(&object_table, &e->ent);
>  	return e;
>  }
>
>  static void invalidate_pack_id(unsigned int id)
>  {
> -	unsigned int h;
>  	unsigned long lu;
>  	struct tag *t;
> +	struct hashmap_iter iter;
> +	struct object_entry *e;
>
> -	for (h = 0; h < ARRAY_SIZE(object_table); h++) {
> -		struct object_entry *e;
> -
> -		for (e = object_table[h]; e; e = e->next)
> -			if (e->pack_id == id)
> -				e->pack_id = MAX_PACK_ID;
> +	hashmap_for_each_entry(&object_table, &iter, e, ent) {
> +		if (e->pack_id == id)
> +			e->pack_id = MAX_PACK_ID;
>  	}
>
>  	for (lu = 0; lu < branch_table_sz; lu++) {
> @@ -3511,6 +3525,8 @@ int cmd_main(int argc, const char **argv)
>  	avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*));
>  	marks = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
>
> +	hashmap_init(&object_table, object_entry_hashcmp, NULL, 0);
> +
>  	/*
>  	 * We don't parse most options until after we've seen the set of
>  	 * "feature" lines at the start of the stream (which allows the command
> diff --git a/t/perf/p9300-fast-import-export.sh b/t/perf/p9300-fast-import-export.sh
> new file mode 100755
> index 0000000000..c3c743d04a
> --- /dev/null
> +++ b/t/perf/p9300-fast-import-export.sh
> @@ -0,0 +1,23 @@
> +#!/bin/sh
> +
> +test_description='test fast-import and fast-export performance'
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +# Use --no-data here to produce a vastly smaller export file.
> +# This is much cheaper to work with but should still exercise
> +# fast-import pretty well (we'll still process all commits and
> +# trees, which account for 60% or more of objects in most repos).
> +#
> +# Use --rencode to avoid the default of aborting on non-utf8 commits,
> +# which lets this test run against a wider variety of sample repos.
> +test_perf 'export (no-blobs)' '
> +	git fast-export --reencode=yes --no-data HEAD >export
> +'
> +
> +test_perf 'import (no-blobs)' '
> +	git fast-import --force <export
> +'
> +
> +test_done
>
Jeff King April 5, 2020, 6:59 p.m. UTC | #2
On Fri, Apr 03, 2020 at 08:53:30PM +0200, René Scharfe wrote:

> > Here are the results for linux.git:
> >
> >   Test                        HEAD^                 HEAD
> >   ----------------------------------------------------------------------------
> >   9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
> >   9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%
> 
> My numbers look a bit different for this, not sure why:
> 
> Test                        origin/master         HEAD
> ---------------------------------------------------------------------------
> 9300.1: export (no-blobs)   69.36(66.44+1.56)     67.89(66.07+1.56) -2.1%
> 9300.2: import (no-blobs)   295.10(293.83+1.19)   283.83(282.91+0.91) -3.8%
> 
> They are still in favor of the patch, just not as strongly as yours.

Interesting. I re-ran mine just to double check, and got:

Test                        HEAD^                 HEAD                      
----------------------------------------------------------------------------
9300.1: export (no-blobs)   63.11(62.31+0.79)     62.00(61.21+0.79) -1.8%   
9300.2: import (no-blobs)   261.79(261.06+0.72)   188.55(187.87+0.66) -28.0%

(for some reason my whole machine is faster today; maybe I had closed
slack).

This is on a fresh-ish clone of linux.git.

> > +	e1 = container_of(eptr, const struct object_entry, ent);
> > +	if (oid)
> > +		return oidcmp(&e1->idx.oid, oid);
> > +
> > +	e2 = container_of(entry_or_key, const struct object_entry, ent);
> > +	return oidcmp(&e1->idx.oid, &e2->idx.oid);
> 
> Other hashmap users express this more tersely, similar to this:
> 
> 	const struct object_entry *e1, *e2;
> 
> 	e1 = container_of(eptr, const struct object_entry, ent);
> 	e2 = container_of(entry_or_key, const struct object_entry, ent);
> 	return oidcmp(&e1->idx.oid, keydata ? keydata : &e2->idx.oid);

I wasn't sure if we'd ever see entry_or_key as NULL, in which case the
second container_of() would be undefined behavior. There's a
container_of_or_null() for this, but it seemed simpler to just avoid the
deref entirely if we didn't need it.

I think in practice we won't ever pass NULL, though. Even a
hashmap_get() needs to pass in a hashmap_entry with the hash value in
it. Though I think that's technically _also_ undefined behavior. If I
have a struct where the entry is not at the beginning, like:

  struct foo {
    const char *bar;
    struct hashmap_entry ent;
  };

then:

  e2 = container_of(ptr_to_entry, struct foo, ent);

is going to form a pointer at 8 bytes before ptr_to_entry. If it really
is a "struct hashmap_entry" on the stack, then it's pointing at garbage.

We don't dereference it, so it's likely fine in practice, but even
computing such a pointer does violate the standard.

> > +	hashmap_entry_init(&lookup_entry, oidhash(oid));
> > +	hashent = hashmap_get(&object_table, &lookup_entry, oid);
> > +	if (hashent)
> > +		return container_of(hashent, struct object_entry, ent);
> 
> That duplicates find_object()...

Yes. This is how most hashmap users do it. It's only a few lines, and
sharing without extra work would mean passing the hash value around
(otherwise we'd have to recompute it again), which is awkward.

Though in our case oidhash() is cheap enough that perhaps it's not worth
worrying about?

> >  	e = new_object(oid);
> > -	e->next = object_table[h];
> >  	e->idx.offset = 0;
> > -	object_table[h] = e;
> > +	e->ent.hash = lookup_entry.hash;
> 
> ... except for this part.  Would it make sense to replace this line with
> a call to hashmap_entry_init()?  Seems cleaner to me.  It would look
> like this:
> 	struct object_entry *e = find_object(oid);
> 
> 	if (e)
> 		return e;
> 
> 	e = new_object(oid);
> 	e->idx.offset = 0;
> 	hashmap_entry_init(&e->ent, oidhash(oid));
> 	hashmap_add(&object_table, &e->ent);
> 	return e;

Right, that calls oidhash() again. If we're OK with that, I agree it's a
bit shorter and not any more awkward. I do worry slightly that it sets a
bad example, though (you wouldn't want to repeat a strhash(), for
example).

-Peff
René Scharfe April 6, 2020, 6:07 p.m. UTC | #3
Am 05.04.20 um 20:59 schrieb Jeff King:
> On Fri, Apr 03, 2020 at 08:53:30PM +0200, René Scharfe wrote:
>
>>> Here are the results for linux.git:
>>>
>>>   Test                        HEAD^                 HEAD
>>>   ----------------------------------------------------------------------------
>>>   9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
>>>   9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%
>>
>> My numbers look a bit different for this, not sure why:
>>
>> Test                        origin/master         HEAD
>> ---------------------------------------------------------------------------
>> 9300.1: export (no-blobs)   69.36(66.44+1.56)     67.89(66.07+1.56) -2.1%
>> 9300.2: import (no-blobs)   295.10(293.83+1.19)   283.83(282.91+0.91) -3.8%
>>
>> They are still in favor of the patch, just not as strongly as yours.
>
> Interesting. I re-ran mine just to double check, and got:
>
> Test                        HEAD^                 HEAD
> ----------------------------------------------------------------------------
> 9300.1: export (no-blobs)   63.11(62.31+0.79)     62.00(61.21+0.79) -1.8%
> 9300.2: import (no-blobs)   261.79(261.06+0.72)   188.55(187.87+0.66) -28.0%
>
> (for some reason my whole machine is faster today; maybe I had closed
> slack).
>
> This is on a fresh-ish clone of linux.git.

A second run today reported:

Test                        origin/master         HEAD
---------------------------------------------------------------------------
9300.1: export (no-blobs)   61.58(59.93+1.40)     60.63(59.35+1.22) -1.5%
9300.2: import (no-blobs)   239.64(239.00+0.63)   246.02(245.18+0.82) +2.7%

git describe says v5.4-3890-g86fd3e9df543 in that repo.

Dunno.  My PC has thermal issues and stressing it for half an hour straight
may cause it to throttle?

With Git's own repo I just got this:

Test                        origin/master       HEAD
-----------------------------------------------------------------------
9300.1: export (no-blobs)   2.58(2.45+0.05)     2.81(2.75+0.05) +8.9%
9300.2: import (no-blobs)   10.41(10.37+0.03)   10.75(10.72+0.02) +3.3%

>
>>> +	e1 = container_of(eptr, const struct object_entry, ent);
>>> +	if (oid)
>>> +		return oidcmp(&e1->idx.oid, oid);
>>> +
>>> +	e2 = container_of(entry_or_key, const struct object_entry, ent);
>>> +	return oidcmp(&e1->idx.oid, &e2->idx.oid);
>>
>> Other hashmap users express this more tersely, similar to this:
>>
>> 	const struct object_entry *e1, *e2;
>>
>> 	e1 = container_of(eptr, const struct object_entry, ent);
>> 	e2 = container_of(entry_or_key, const struct object_entry, ent);
>> 	return oidcmp(&e1->idx.oid, keydata ? keydata : &e2->idx.oid);
>
> I wasn't sure if we'd ever see entry_or_key as NULL, in which case the
> second container_of() would be undefined behavior. There's a
> container_of_or_null() for this, but it seemed simpler to just avoid the
> deref entirely if we didn't need it.

OK.  The other users might have the same issue then, though.

> I think in practice we won't ever pass NULL, though. Even a
> hashmap_get() needs to pass in a hashmap_entry with the hash value in
> it. Though I think that's technically _also_ undefined behavior. If I
> have a struct where the entry is not at the beginning, like:
>
>   struct foo {
>     const char *bar;
>     struct hashmap_entry ent;
>   };
>
> then:
>
>   e2 = container_of(ptr_to_entry, struct foo, ent);
>
> is going to form a pointer at 8 bytes before ptr_to_entry. If it really
> is a "struct hashmap_entry" on the stack, then it's pointing at garbage.
>
> We don't dereference it, so it's likely fine in practice, but even
> computing such a pointer does violate the standard.

Really?  If ptr_to_entry was NULL then any pointer arithmetic on it would
be bad.  If it points to a bare hashmap_entry and we lied to container_of
by telling it that it's embedded in some other struct then that would be
bad as well.  But if there actually is a surrounding struct of the
specified type then the resulting pointer must surely be valid, no matter
if the object it points to was initialized?  Do you have the relevant
chapter and verse handy?

>
>>> +	hashmap_entry_init(&lookup_entry, oidhash(oid));
>>> +	hashent = hashmap_get(&object_table, &lookup_entry, oid);
>>> +	if (hashent)
>>> +		return container_of(hashent, struct object_entry, ent);
>>
>> That duplicates find_object()...
>
> Yes. This is how most hashmap users do it. It's only a few lines, and
> sharing without extra work would mean passing the hash value around
> (otherwise we'd have to recompute it again), which is awkward.

I find touching hashmap_entry members directly more awkward.

Compilers likely inline find_object() here, so the text size and
performance should be about the same.

> Though in our case oidhash() is cheap enough that perhaps it's not worth
> worrying about?

Probably, but I didn't measure.  My system is quite noisy..

>
>>>  	e = new_object(oid);
>>> -	e->next = object_table[h];
>>>  	e->idx.offset = 0;
>>> -	object_table[h] = e;
>>> +	e->ent.hash = lookup_entry.hash;
>>
>> ... except for this part.  Would it make sense to replace this line with
>> a call to hashmap_entry_init()?  Seems cleaner to me.  It would look
>> like this:
>> 	struct object_entry *e = find_object(oid);
>>
>> 	if (e)
>> 		return e;
>>
>> 	e = new_object(oid);
>> 	e->idx.offset = 0;
>> 	hashmap_entry_init(&e->ent, oidhash(oid));
>> 	hashmap_add(&object_table, &e->ent);
>> 	return e;
>
> Right, that calls oidhash() again. If we're OK with that, I agree it's a
> bit shorter and not any more awkward. I do worry slightly that it sets a
> bad example, though (you wouldn't want tusing only the API to o repeat a strhash(), for
> example).

Stuffing the oidhash() result into a variable and using it twice with
hashmap_entry_init() would work as well.  This would make the reason for
the duplicate find_object() code obvious, while keeping struct
hashmap_entry opaque.

René
Jeff King April 6, 2020, 6:35 p.m. UTC | #4
On Mon, Apr 06, 2020 at 08:07:45PM +0200, René Scharfe wrote:

> > Interesting. I re-ran mine just to double check, and got:
> [...]
> A second run today reported:
> 
> Test                        origin/master         HEAD
> ---------------------------------------------------------------------------
> 9300.1: export (no-blobs)   61.58(59.93+1.40)     60.63(59.35+1.22) -1.5%
> 9300.2: import (no-blobs)   239.64(239.00+0.63)   246.02(245.18+0.82) +2.7%
> 
> git describe says v5.4-3890-g86fd3e9df543 in that repo.

I'm on v5.6-rc7-188-g1b649e0bcae7, but I can't imagine it makes a big
difference.

> Dunno.  My PC has thermal issues and stressing it for half an hour straight
> may cause it to throttle?

Yeah. I wonder what the variance is between the 3 runs (assuming you're
using ./run and doing 3). I.e., is one run in the first set much faster
than the others, and we pick it as best-of-3.

> With Git's own repo I just got this:
> 
> Test                        origin/master       HEAD
> -----------------------------------------------------------------------
> 9300.1: export (no-blobs)   2.58(2.45+0.05)     2.81(2.75+0.05) +8.9%
> 9300.2: import (no-blobs)   10.41(10.37+0.03)   10.75(10.72+0.02) +3.3%

I get similar numbers here. But I wouldn't expect much. We only have
~160k commits+trees in git.git, so the average chain length is ~2.5.

> > I think in practice we won't ever pass NULL, though. Even a
> > hashmap_get() needs to pass in a hashmap_entry with the hash value in
> > it. Though I think that's technically _also_ undefined behavior. If I
> > have a struct where the entry is not at the beginning, like:
> >
> >   struct foo {
> >     const char *bar;
> >     struct hashmap_entry ent;
> >   };
> >
> > then:
> >
> >   e2 = container_of(ptr_to_entry, struct foo, ent);
> >
> > is going to form a pointer at 8 bytes before ptr_to_entry. If it really
> > is a "struct hashmap_entry" on the stack, then it's pointing at garbage.
> >
> > We don't dereference it, so it's likely fine in practice, but even
> > computing such a pointer does violate the standard.
> 
> Really?  If ptr_to_entry was NULL then any pointer arithmetic on it would
> be bad.  If it points to a bare hashmap_entry and we lied to container_of
> by telling it that it's embedded in some other struct then that would be
> bad as well.  But if there actually is a surrounding struct of the
> specified type then the resulting pointer must surely be valid, no matter
> if the object it points to was initialized?  Do you have the relevant
> chapter and verse handy?

Pointing to uninitialized bits is fine according to the standard (and I
think that's what you're asking about for chapter and verse). But we
really do lie to container_of(). See remote.c, for example. In
make_remote(), we call hashmap_get() with a pointer to lookup_entry,
which is a bare "struct hashmap_entry". That should end up in
remotes_hash_cmp(), which unconditionally computes a pointer to a
"struct remote".

Now the hashmap_entry there is at the beginning of the struct, so the
offset is 0 and the two pointers are the same. So while the pointer's
type is incorrect, we didn't compute an invalid pointer value. And
traditionally the hashmap code required that it be at the front of the
containing struct, but that was loosened recently (and container_of
introduced) in 5efabc7ed9 (Merge branch 'ew/hashmap', 2019-10-15).

And grepping around, I don't see any cases where it's _not_ at the
beginning. So perhaps this is a problem waiting to bite us.

> Stuffing the oidhash() result into a variable and using it twice with
> hashmap_entry_init() would work as well.  This would make the reason for
> the duplicate find_object() code obvious, while keeping struct
> hashmap_entry opaque.

I'd prefer not to use a separate variable, as that requires giving it a
type. Perhaps:

  hashmap_entry_init(&e->ent, lookup_entry.hash);

which is used elsewhere is OK? That still assumes there's a hash field,
but I think hashmap has always been pretty up front about that (the real
sin in the original is assuming that nothing else needs to be
initialized).

-Peff
Jeff King April 6, 2020, 6:46 p.m. UTC | #5
On Mon, Apr 06, 2020 at 02:35:42PM -0400, Jeff King wrote:

> Pointing to uninitialized bits is fine according to the standard (and I
> think that's what you're asking about for chapter and verse). But we
> really do lie to container_of(). See remote.c, for example. In
> make_remote(), we call hashmap_get() with a pointer to lookup_entry,
> which is a bare "struct hashmap_entry". That should end up in
> remotes_hash_cmp(), which unconditionally computes a pointer to a
> "struct remote".
> 
> Now the hashmap_entry there is at the beginning of the struct, so the
> offset is 0 and the two pointers are the same. So while the pointer's
> type is incorrect, we didn't compute an invalid pointer value. And
> traditionally the hashmap code required that it be at the front of the
> containing struct, but that was loosened recently (and container_of
> introduced) in 5efabc7ed9 (Merge branch 'ew/hashmap', 2019-10-15).
> 
> And grepping around, I don't see any cases where it's _not_ at the
> beginning. So perhaps this is a problem waiting to bite us.

I was curious wither ASan/UBSan might complain if we did something like
this:

diff --git a/oidmap.h b/oidmap.h
index c66a83ab1d..1abfe966f2 100644
--- a/oidmap.h
+++ b/oidmap.h
@@ -12,10 +12,10 @@
  * internal_entry field.
  */
 struct oidmap_entry {
+	struct object_id oid;
+
 	/* For internal use only */
 	struct hashmap_entry internal_entry;
-
-	struct object_id oid;
 };
 

It does, but not because of the subtle UB issue. It's just that the rest
of oidmap still assumes the hashmap-is-first logic, and we need:

diff --git a/oidmap.c b/oidmap.c
index 423aa014a3..4065ea4b79 100644
--- a/oidmap.c
+++ b/oidmap.c
@@ -35,7 +35,10 @@ void *oidmap_get(const struct oidmap *map, const struct object_id *key)
 	if (!map->map.cmpfn)
 		return NULL;
 
-	return hashmap_get_from_hash(&map->map, oidhash(key), key);
+	return container_of_or_null(
+		hashmap_get_from_hash(&map->map, oidhash(key), key),
+		struct oidmap_entry,
+		internal_entry);
 }
 
 void *oidmap_remove(struct oidmap *map, const struct object_id *key)
 struct oidmap {
@@ -79,7 +79,10 @@ static inline void oidmap_iter_init(struct oidmap *map, struct oidmap_iter *iter
 static inline void *oidmap_iter_next(struct oidmap_iter *iter)
 {
 	/* TODO: this API could be reworked to do compile-time type checks */
-	return (void *)hashmap_iter_next(&iter->h_iter);
+	return container_of_or_null(
+		hashmap_iter_next(&iter->h_iter),
+		struct oidmap_entry,
+		internal_entry);
 }
 
 static inline void *oidmap_iter_first(struct oidmap *map,

-Peff
René Scharfe April 7, 2020, 6:37 p.m. UTC | #6
Am 06.04.20 um 20:35 schrieb Jeff King:
> On Mon, Apr 06, 2020 at 08:07:45PM +0200, René Scharfe wrote:
>
>>> Interesting. I re-ran mine just to double check, and got:
>> [...]
>> A second run today reported:
>>
>> Test                        origin/master         HEAD
>> ---------------------------------------------------------------------------
>> 9300.1: export (no-blobs)   61.58(59.93+1.40)     60.63(59.35+1.22) -1.5%
>> 9300.2: import (no-blobs)   239.64(239.00+0.63)   246.02(245.18+0.82) +2.7%
>>
>> git describe says v5.4-3890-g86fd3e9df543 in that repo.
>
> I'm on v5.6-rc7-188-g1b649e0bcae7, but I can't imagine it makes a big
> difference.
>
>> Dunno.  My PC has thermal issues and stressing it for half an hour straight
>> may cause it to throttle?
>
> Yeah. I wonder what the variance is between the 3 runs (assuming you're
> using ./run and doing 3). I.e., is one run in the first set much faster
> than the others, and we pick it as best-of-3.

I did use run with the default three laps.  Ran p9300 directly with
patch v2 applied instead now, got this:

Test                        this tree
-----------------------------------------------
9300.1: export (no-blobs)   64.80(60.95+1.47)
9300.2: import (no-blobs)   210.00(206.02+2.04)
9300.1: export (no-blobs)   62.58(60.74+1.35)
9300.2: import (no-blobs)   209.07(206.78+1.96)
9300.1: export (no-blobs)   61.33(60.17+1.03)
9300.2: import (no-blobs)   207.73(205.47+2.04)

> But we
> really do lie to container_of(). See remote.c, for example. In
> make_remote(), we call hashmap_get() with a pointer to lookup_entry,
> which is a bare "struct hashmap_entry". That should end up in
> remotes_hash_cmp(), which unconditionally computes a pointer to a
> "struct remote".

Ugh.  Any optimization level would delay that to after the keydata
check, though, I guess.

There's also function pointer casting going on (with patch_util_cmp()
and sequence_entry_cmp()), which is undefined behavior IIUC.

> Now the hashmap_entry there is at the beginning of the struct, so the
> offset is 0 and the two pointers are the same. So while the pointer's
> type is incorrect, we didn't compute an invalid pointer value. And
> traditionally the hashmap code required that it be at the front of the
> containing struct, but that was loosened recently (and container_of
> introduced) in 5efabc7ed9 (Merge branch 'ew/hashmap', 2019-10-15).

Hmm, OK.

> And grepping around, I don't see any cases where it's _not_ at the
> beginning. So perhaps this is a problem waiting to bite us.

Possibly.

Speaking of grep, I ended up with this oneliner to show all the
comparison functions used with hashmaps:

git grep -W -f <(git grep -wh hashmap_init | cut -f2 -d, | grep -v -e ' .* ' -e NULL -e '^ *$' | sed 's/ //; s/.*)//; s/^/int /' | sort -u)

But I didn't learn much from looking at them, except that most have
that unconditional container_of.

>> Stuffing the oidhash() result into a variable and using it twice with
>> hashmap_entry_init() would work as well.  This would make the reason for
>> the duplicate find_object() code obvious, while keeping struct
>> hashmap_entry opaque.
>
> I'd prefer not to use a separate variable, as that requires giving it a
> type.

Specifying the type that oidhash() returns and hashmap_entry_init()
accepts is redundant and inconvenient, but that line should be easy to
find and adapt when the function signatures are eventually changed.
Which is perhaps never going to happen anyway?

> Perhaps:
>
>   hashmap_entry_init(&e->ent, lookup_entry.hash);
>
> which is used elsewhere is OK? That still assumes there's a hash field,
> but I think hashmap has always been pretty up front about that (the real
> sin in the original is assuming that nothing else needs to be
> initialized).

That conflicts with that statement from the hashmap.h:

 * struct hashmap_entry is an opaque structure representing an entry in the
 * hash table.

Patch v2 looks good to me, though!

René
diff mbox series

Patch

diff --git a/fast-import.c b/fast-import.c
index 202dda11a6..0ef6defc10 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -39,12 +39,28 @@ 
 
 struct object_entry {
 	struct pack_idx_entry idx;
-	struct object_entry *next;
+	struct hashmap_entry ent;
 	uint32_t type : TYPE_BITS,
 		pack_id : PACK_ID_BITS,
 		depth : DEPTH_BITS;
 };
 
+static int object_entry_hashcmp(const void *map_data,
+				const struct hashmap_entry *eptr,
+				const struct hashmap_entry *entry_or_key,
+				const void *keydata)
+{
+	const struct object_id *oid = keydata;
+	const struct object_entry *e1, *e2;
+
+	e1 = container_of(eptr, const struct object_entry, ent);
+	if (oid)
+		return oidcmp(&e1->idx.oid, oid);
+
+	e2 = container_of(entry_or_key, const struct object_entry, ent);
+	return oidcmp(&e1->idx.oid, &e2->idx.oid);
+}
+
 struct object_entry_pool {
 	struct object_entry_pool *next_pool;
 	struct object_entry *next_free;
@@ -178,7 +194,7 @@  static off_t pack_size;
 /* Table of objects we've written. */
 static unsigned int object_entry_alloc = 5000;
 static struct object_entry_pool *blocks;
-static struct object_entry *object_table[1 << 16];
+static struct hashmap object_table;
 static struct mark_set *marks;
 static const char *export_marks_file;
 static const char *import_marks_file;
@@ -455,44 +471,42 @@  static struct object_entry *new_object(struct object_id *oid)
 
 static struct object_entry *find_object(struct object_id *oid)
 {
-	unsigned int h = oid->hash[0] << 8 | oid->hash[1];
-	struct object_entry *e;
-	for (e = object_table[h]; e; e = e->next)
-		if (oideq(oid, &e->idx.oid))
-			return e;
+	struct hashmap_entry lookup_entry, *e;
+
+	hashmap_entry_init(&lookup_entry, oidhash(oid));
+	e = hashmap_get(&object_table, &lookup_entry, oid);
+	if (e)
+		return container_of(e, struct object_entry, ent);
 	return NULL;
 }
 
 static struct object_entry *insert_object(struct object_id *oid)
 {
-	unsigned int h = oid->hash[0] << 8 | oid->hash[1];
-	struct object_entry *e = object_table[h];
+	struct object_entry *e;
+	struct hashmap_entry lookup_entry, *hashent;
 
-	while (e) {
-		if (oideq(oid, &e->idx.oid))
-			return e;
-		e = e->next;
-	}
+	hashmap_entry_init(&lookup_entry, oidhash(oid));
+	hashent = hashmap_get(&object_table, &lookup_entry, oid);
+	if (hashent)
+		return container_of(hashent, struct object_entry, ent);
 
 	e = new_object(oid);
-	e->next = object_table[h];
 	e->idx.offset = 0;
-	object_table[h] = e;
+	e->ent.hash = lookup_entry.hash;
+	hashmap_add(&object_table, &e->ent);
 	return e;
 }
 
 static void invalidate_pack_id(unsigned int id)
 {
-	unsigned int h;
 	unsigned long lu;
 	struct tag *t;
+	struct hashmap_iter iter;
+	struct object_entry *e;
 
-	for (h = 0; h < ARRAY_SIZE(object_table); h++) {
-		struct object_entry *e;
-
-		for (e = object_table[h]; e; e = e->next)
-			if (e->pack_id == id)
-				e->pack_id = MAX_PACK_ID;
+	hashmap_for_each_entry(&object_table, &iter, e, ent) {
+		if (e->pack_id == id)
+			e->pack_id = MAX_PACK_ID;
 	}
 
 	for (lu = 0; lu < branch_table_sz; lu++) {
@@ -3511,6 +3525,8 @@  int cmd_main(int argc, const char **argv)
 	avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*));
 	marks = mem_pool_calloc(&fi_mem_pool, 1, sizeof(struct mark_set));
 
+	hashmap_init(&object_table, object_entry_hashcmp, NULL, 0);
+
 	/*
 	 * We don't parse most options until after we've seen the set of
 	 * "feature" lines at the start of the stream (which allows the command
diff --git a/t/perf/p9300-fast-import-export.sh b/t/perf/p9300-fast-import-export.sh
new file mode 100755
index 0000000000..c3c743d04a
--- /dev/null
+++ b/t/perf/p9300-fast-import-export.sh
@@ -0,0 +1,23 @@ 
+#!/bin/sh
+
+test_description='test fast-import and fast-export performance'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+# Use --no-data here to produce a vastly smaller export file.
+# This is much cheaper to work with but should still exercise
+# fast-import pretty well (we'll still process all commits and
+# trees, which account for 60% or more of objects in most repos).
+#
+# Use --rencode to avoid the default of aborting on non-utf8 commits,
+# which lets this test run against a wider variety of sample repos.
+test_perf 'export (no-blobs)' '
+	git fast-export --reencode=yes --no-data HEAD >export
+'
+
+test_perf 'import (no-blobs)' '
+	git fast-import --force <export
+'
+
+test_done