[1/7] oid_array: use size_t for count and allocation

Message ID	20200330140309.GA2456038@coredump.intra.peff.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=sJPh=5P=vger.kernel.org=git-owner@kernel.org> Date: Mon, 30 Mar 2020 10:03:09 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Subject: [PATCH 1/7] oid_array: use size_t for count and allocation Message-ID: <20200330140309.GA2456038@coredump.intra.peff.net> References: <20200330140247.GA476088@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200330140247.GA476088@coredump.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	oid_array cleanups \| expand [0/7] oid_array cleanups [1/7] oid_array: use size_t for count and allocation [2/7] oid_array: use size_t for iteration [3/7] oid_array: rename source file from sha1-array [4/7] test-tool: rename sha1-array to oid-array [5/7] bisect: stop referring to sha1_array [6/7] ref-filter: stop referring to "sha1 array" [7/7] oidset: stop referring to sha1-array

Message ID

20200330140309.GA2456038@coredump.intra.peff.net (mailing list archive)

State

New, archived

Headers

Date: Mon, 30 Mar 2020 10:03:09 -0400
From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [PATCH 1/7] oid_array: use size_t for count and allocation
Message-ID: <20200330140309.GA2456038@coredump.intra.peff.net>
References: <20200330140247.GA476088@coredump.intra.peff.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20200330140247.GA476088@coredump.intra.peff.net>
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

oid_array cleanups | expand

Commit Message

Jeff King March 30, 2020, 2:03 p.m. UTC

The oid_array object uses an "int" to store the number of items and the
allocated size. It's rather unlikely for somebody to have more than 2^31
objects in a repository (the sha1's alone would be 40GB!), but if they
do, we'd overflow our alloc variable.

You can reproduce this case with something like:

  git init repo
  cd repo

  # make a pack with 2^24 objects
  perl -e '
    my $nr = 2**24;

    for (my $i = 0; $i < $nr; $i++) {
	    print "blob\n";
	    print "data 4\n";
	    print pack("N", $i);
    }
  ' | git fast-import

  # now make 256 copies of it; most of these objects will be duplicates,
  # but oid_array doesn't de-dup until all values are read and it can
  # sort the result.
  cd .git/objects/pack/
  pack=$(echo *.pack)
  idx=$(echo *.idx)
  for i in $(seq 0 255); do
    # no need to waste disk space
    ln "$pack" "pack-extra-$i.pack"
    ln "$idx" "pack-extra-$i.idx"
  done

  # and now force an oid_array to store all of it
  git cat-file --batch-all-objects --batch-check

which results in:

  fatal: size_t overflow: 32 * 18446744071562067968

So the good news is that st_mult() sees the problem (the large number is
because our int wraps negative, and then that gets cast to a size_t),
doing the job it was meant to: bailing in crazy situations rather than
causing an undersized buffer.

But we should avoid hitting this case at all, and instead limit
ourselves based on what malloc() is willing to give us. We can easily do
that by switching to size_t.

The cat-file process above made it to ~120GB virtual set size before the
integer overflow (our internal hash storage is 32-bytes now in
preparation for sha256, so we'd expect ~128GB total needed, plus
potentially more to copy from one realloc'd block to another)). After
this patch (and about 130GB of RAM+swap), it does eventually read in the
whole set. No test for obvious reasons.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1-array.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jeff King March 30, 2020, 2:09 p.m. UTC | #1

On Mon, Mar 30, 2020 at 10:03:10AM -0400, Jeff King wrote:

> The oid_array object uses an "int" to store the number of items and the
> allocated size. It's rather unlikely for somebody to have more than 2^31
> objects in a repository (the sha1's alone would be 40GB!), but if they
> do, we'd overflow our alloc variable.
> 
> You can reproduce this case with something like:
> 
>   git init repo
>   cd repo
> 
>   # make a pack with 2^24 objects
>   perl -e '
>     my $nr = 2**24;
> 
>     for (my $i = 0; $i < $nr; $i++) {
> 	    print "blob\n";
> 	    print "data 4\n";
> 	    print pack("N", $i);
>     }
>   ' | git fast-import

I briefly tried to create a packfile manually that had a ton of objects
in it, thinking it would be faster. Conceptually it's pretty simple to
write a pack header, and then a series of pack-object headers (which are
the same single-byte each time) and then a 4-byte body. But you have to
zlib compress the body, which created a bunch of headaches.

So I arrived at this fast-import solution, which was...not super fast.
Profiling showed that we were spending 80% of the time inserting into
our custom hashtable, which is fixed at 2^16 entries and then chains
beyond that. Swapping it out for a khash proved much faster, but I'm not
sure if the memory games are too gross (see the comment in find_object
below).

I also didn't look into whether we could get rid of the extra allocating
pool (and store the structs right in the hash), or if it's necessary for
their pointers to be stable.

---
diff --git a/fast-import.c b/fast-import.c
index 202dda11a6..427dd73d26 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -39,12 +39,25 @@
 
 struct object_entry {
 	struct pack_idx_entry idx;
-	struct object_entry *next;
 	uint32_t type : TYPE_BITS,
 		pack_id : PACK_ID_BITS,
 		depth : DEPTH_BITS;
 };
 
+static inline unsigned int object_entry_hash(struct object_entry *oe)
+{
+	return oidhash(&oe->idx.oid);
+}
+
+static inline int object_entry_equal(struct object_entry *a,
+				     struct object_entry *b)
+{
+	return oideq(&a->idx.oid, &b->idx.oid);
+}
+
+KHASH_INIT(object_entry_set, struct object_entry *, int, 0,
+	   object_entry_hash, object_entry_equal);
+
 struct object_entry_pool {
 	struct object_entry_pool *next_pool;
 	struct object_entry *next_free;
@@ -178,7 +191,7 @@ static off_t pack_size;
 /* Table of objects we've written. */
 static unsigned int object_entry_alloc = 5000;
 static struct object_entry_pool *blocks;
-static struct object_entry *object_table[1 << 16];
+static kh_object_entry_set_t object_table;
 static struct mark_set *marks;
 static const char *export_marks_file;
 static const char *import_marks_file;
@@ -455,44 +468,40 @@ static struct object_entry *new_object(struct object_id *oid)
 
 static struct object_entry *find_object(struct object_id *oid)
 {
-	unsigned int h = oid->hash[0] << 8 | oid->hash[1];
-	struct object_entry *e;
-	for (e = object_table[h]; e; e = e->next)
-		if (oideq(oid, &e->idx.oid))
-			return e;
+	/*
+	 * this cast works because we only look at the oid part of the entry,
+	 * and it comes first in the struct
+	 */
+	khiter_t pos = kh_get_object_entry_set(&object_table,
+					       (struct object_entry *)oid);
+	if (pos != kh_end(&object_table))
+		return kh_key(&object_table, pos);
 	return NULL;
 }
 
 static struct object_entry *insert_object(struct object_id *oid)
 {
-	unsigned int h = oid->hash[0] << 8 | oid->hash[1];
-	struct object_entry *e = object_table[h];
-
-	while (e) {
-		if (oideq(oid, &e->idx.oid))
-			return e;
-		e = e->next;
-	}
+	struct object_entry *e;
+	int redundant;
 
 	e = new_object(oid);
-	e->next = object_table[h];
 	e->idx.offset = 0;
-	object_table[h] = e;
+	kh_put_object_entry_set(&object_table, e, &redundant);
 	return e;
 }
 
 static void invalidate_pack_id(unsigned int id)
 {
-	unsigned int h;
 	unsigned long lu;
 	struct tag *t;
+	khiter_t iter;
 
-	for (h = 0; h < ARRAY_SIZE(object_table); h++) {
-		struct object_entry *e;
-
-		for (e = object_table[h]; e; e = e->next)
+	for (iter = kh_begin(&object_table); iter != kh_end(&object_table); iter++) {
+		if (kh_exist(&object_table, iter)) {
+			struct object_entry *e = kh_key(&object_table, iter);
 			if (e->pack_id == id)
 				e->pack_id = MAX_PACK_ID;
+		}
 	}
 
 	for (lu = 0; lu < branch_table_sz; lu++) {

Taylor Blau April 15, 2020, 12:27 a.m. UTC | #2

On Mon, Mar 30, 2020 at 10:03:09AM -0400, Jeff King wrote:
> The oid_array object uses an "int" to store the number of items and the
> allocated size. It's rather unlikely for somebody to have more than 2^31
> objects in a repository (the sha1's alone would be 40GB!), but if they
> do, we'd overflow our alloc variable.
>
> You can reproduce this case with something like:
>
>   git init repo
>   cd repo
>
>   # make a pack with 2^24 objects
>   perl -e '
>     my $nr = 2**24;
>
>     for (my $i = 0; $i < $nr; $i++) {
> 	    print "blob\n";
> 	    print "data 4\n";
> 	    print pack("N", $i);
>     }
>   ' | git fast-import
>
>   # now make 256 copies of it; most of these objects will be duplicates,
>   # but oid_array doesn't de-dup until all values are read and it can
>   # sort the result.
>   cd .git/objects/pack/
>   pack=$(echo *.pack)
>   idx=$(echo *.idx)
>   for i in $(seq 0 255); do
>     # no need to waste disk space
>     ln "$pack" "pack-extra-$i.pack"
>     ln "$idx" "pack-extra-$i.idx"
>   done
>
>   # and now force an oid_array to store all of it
>   git cat-file --batch-all-objects --batch-check
>
> which results in:
>
>   fatal: size_t overflow: 32 * 18446744071562067968
>
> So the good news is that st_mult() sees the problem (the large number is
> because our int wraps negative, and then that gets cast to a size_t),
> doing the job it was meant to: bailing in crazy situations rather than
> causing an undersized buffer.
>
> But we should avoid hitting this case at all, and instead limit
> ourselves based on what malloc() is willing to give us. We can easily do
> that by switching to size_t.
>
> The cat-file process above made it to ~120GB virtual set size before the
> integer overflow (our internal hash storage is 32-bytes now in
> preparation for sha256, so we'd expect ~128GB total needed, plus
> potentially more to copy from one realloc'd block to another)). After
> this patch (and about 130GB of RAM+swap), it does eventually read in the
> whole set. No test for obvious reasons.

;). This patch looks good, and makes immediate sense given your
explanation.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  sha1-array.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/sha1-array.h b/sha1-array.h
> index dc1bca9c9a..c5e4b9324f 100644
> --- a/sha1-array.h
> +++ b/sha1-array.h
> @@ -49,8 +49,8 @@
>   */
>  struct oid_array {
>  	struct object_id *oid;
> -	int nr;
> -	int alloc;
> +	size_t nr;
> +	size_t alloc;
>  	int sorted;
>  };
>
> --
> 2.26.0.597.g7e08ed78ff

Thanks,
Taylor

diff --git a/sha1-array.h b/sha1-array.h
index dc1bca9c9a..c5e4b9324f 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -49,8 +49,8 @@ 
  */
 struct oid_array {
 	struct object_id *oid;
-	int nr;
-	int alloc;
+	size_t nr;
+	size_t alloc;
 	int sorted;
 };

[1/7] oid_array: use size_t for count and allocation

Commit Message

Comments

Patch