fast-import: replace custom hash with hashmap.c

We use a custom hash in fast-import to store the set of objects we've
imported so far. It has a fixed set of 2^16 buckets and chains any
collisions with a linked list. As the number of objects grows larger
than that, the load factor increases and we degrade to O(n) lookups and
O(n^2) insertions.

We can scale better by using our hashmap.c implementation, which will
resize the bucket count as we grow. This does incur an extra memory cost
of 8 bytes per object, as hashmap stores the integer hash value for each
entry in its hashmap_entry struct (which we really don't care about
here, because we're just reusing the embedded object hash). But I think
the numbers below justify this (and our per-object memory cost is
already much higher).

I also looked at using khash, but it seemed to perform slightly worse
than hashmap at all sizes, and worse even than the existing code for
small sizes. It's also awkward to use here, because we want to look up a
"struct object_entry" from a "struct object_id", and it doesn't handle
mismatched keys as well. Making a mapping of object_id to object_entry
would be more natural, but that would require pulling the embedded oid
out of the object_entry or incurring an extra 32 bytes per object.

In a synthetic test creating as many cheap, tiny objects as possible

  perl -e '
      my $bits = shift;
      my $nr = 2**$bits;

      for (my $i = 0; $i < $nr; $i++) {
              print "blob\n";
              print "data 4\n";
              print pack("N", $i);
      }
  ' $bits | git fast-import

I got these results:

  nr_objects   master       khash      hashmap
  2^20         0m4.317s     0m5.109s   0m3.890s
  2^21         0m10.204s    0m9.702s   0m7.933s
  2^22         0m27.159s    0m17.911s  0m16.751s
  2^23         1m19.038s    0m35.080s  0m31.963s
  2^24         4m18.766s    1m10.233s  1m6.793s

which points to hashmap as the winner. We didn't have any perf tests for
fast-export or fast-import, so I added one as a more real-world case.
It uses an export without blobs since that's significantly cheaper than
a full one, but still is an interesting case people might use (e.g., for
rewriting history). It will emphasize this change in some ways (as a
percentage we spend more time making objects and less shuffling blob
bytes around) and less in others (the total object count is lower).

Here are the results for linux.git:

  Test                        HEAD^                 HEAD
  ----------------------------------------------------------------------------
  9300.1: export (no-blobs)   67.64(66.96+0.67)     67.81(67.06+0.75) +0.3%
  9300.2: import (no-blobs)   284.04(283.34+0.69)   198.09(196.01+0.92) -30.3%

It only has ~5.2M commits and trees, so this is a larger effect than I
expected (the 2^23 case above only improved by 50s or so, but here we
gained almost 90s). This is probably due to actually performing more
object lookups in a real import with trees and commits, as opposed to
just dumping a bunch of blobs into a pack.

Signed-off-by: Jeff King <peff@peff.net>
---
 fast-import.c                      | 62 +++++++++++++++++++-----------
 t/perf/p9300-fast-import-export.sh | 23 +++++++++++
 2 files changed, 62 insertions(+), 23 deletions(-)
 create mode 100755 t/perf/p9300-fast-import-export.sh

Message ID	20200403121508.GA638328@coredump.intra.peff.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=9X1Y=5T=vger.kernel.org=git-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AEB14159A for <patchwork-git@patchwork.kernel.org>; Fri, 3 Apr 2020 12:15:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8D67F2073B for <patchwork-git@patchwork.kernel.org>; Fri, 3 Apr 2020 12:15:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403883AbgDCMPK (ORCPT <rfc822;patchwork-git@patchwork.kernel.org>); Fri, 3 Apr 2020 08:15:10 -0400 Received: from cloud.peff.net ([104.130.231.41]:60450 "HELO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S2403836AbgDCMPK (ORCPT <rfc822;git@vger.kernel.org>); Fri, 3 Apr 2020 08:15:10 -0400 Received: (qmail 27084 invoked by uid 109); 3 Apr 2020 12:15:09 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with SMTP; Fri, 03 Apr 2020 12:15:09 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 18696 invoked by uid 111); 3 Apr 2020 12:25:22 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Fri, 03 Apr 2020 08:25:22 -0400 Authentication-Results: peff.net; auth=none Date: Fri, 3 Apr 2020 08:15:08 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Cc: =?utf-8?b?UmVuw6k=?= Scharfe <l.s.r@web.de> Subject: [PATCH] fast-import: replace custom hash with hashmap.c Message-ID: <20200403121508.GA638328@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: <git.vger.kernel.org> X-Mailing-List: git@vger.kernel.org
Series	fast-import: replace custom hash with hashmap.c \| expand fast-import: replace custom hash with hashmap.c

fast-import: replace custom hash with hashmap.c

Commit Message

Comments

Patch