[0/5] handling 4GB .idx files

Message ID	20201113050631.GA744608@coredump.intra.peff.net (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Fri, 13 Nov 2020 00:06:31 -0500 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Subject: [PATCH 0/5] handling 4GB .idx files Message-ID: <20201113050631.GA744608@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Precedence: bulk
Series	handling 4GB .idx files \| expand [0/5] handling 4GB .idx files [1/5] compute pack .idx byte offsets using size_t [2/5] use size_t to store pack .idx byte offsets [3/5] fsck: correctly compute checksums on idx files larger than 4GB [4/5] block-sha1: take a size_t length parameter [5/5] packfile: detect overflow in .idx file size checks

Jeff King Nov. 13, 2020, 5:06 a.m. UTC

I recently ran into a case where Git could not read the pack it had
produced via running "git repack". The culprit turned out to be an .idx
file which crossed the 4GB barrier (in bytes, not number of objects).
This series fixes the problems I saw, along with similar ones I couldn't
trigger in practice, and protects the .idx loading code against integer
overflows that would fool the size checks.

  [1/5]: compute pack .idx byte offsets using size_t
  [2/5]: use size_t to store pack .idx byte offsets
  [3/5]: fsck: correctly compute checksums on idx files larger than 4GB
  [4/5]: block-sha1: take a size_t length parameter
  [5/5]: packfile: detect overflow in .idx file size checks

 block-sha1/sha1.c        |  2 +-
 block-sha1/sha1.h        |  2 +-
 builtin/index-pack.c     |  2 +-
 builtin/pack-redundant.c |  6 +++---
 pack-check.c             | 10 +++++-----
 pack-revindex.c          |  2 +-
 packfile.c               | 14 +++++++-------
 7 files changed, 19 insertions(+), 19 deletions(-)

-Peff

Thomas Braun Nov. 15, 2020, 2:43 p.m. UTC | #1

On 13.11.2020 06:06, Jeff King wrote:
> I recently ran into a case where Git could not read the pack it had
> produced via running "git repack". The culprit turned out to be an .idx
> file which crossed the 4GB barrier (in bytes, not number of objects).
> This series fixes the problems I saw, along with similar ones I couldn't
> trigger in practice, and protects the .idx loading code against integer
> overflows that would fool the size checks.

Would it be feasible to have a test case for this large index case? This
should very certainly have an EXPENSIVE tag, or might even not yet work
on windows. But hopefully someday I'll find some more time to push large
object support on windows forward, and these kind of tests would really
help then.

Jeff King Nov. 16, 2020, 4:10 a.m. UTC | #2

On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:

> On 13.11.2020 06:06, Jeff King wrote:
> > I recently ran into a case where Git could not read the pack it had
> > produced via running "git repack". The culprit turned out to be an .idx
> > file which crossed the 4GB barrier (in bytes, not number of objects).
> > This series fixes the problems I saw, along with similar ones I couldn't
> > trigger in practice, and protects the .idx loading code against integer
> > overflows that would fool the size checks.
> 
> Would it be feasible to have a test case for this large index case? This
> should very certainly have an EXPENSIVE tag, or might even not yet work
> on windows. But hopefully someday I'll find some more time to push large
> object support on windows forward, and these kind of tests would really
> help then.

I think it would be a level beyond what we usually consider even for
EXPENSIVE. The cheapest I could come up with to generate the case is:

  perl -e '
	for (0..154_000_000) {
		print "blob\n";
		print "data <<EOF\n";
		print "$_\n";
		print "EOF\n";
	}
  ' |
  git fast-import

which took almost 13 minutes of CPU to run, and peaked around 15GB of
RAM (and takes about 6.7GB on disk).

In the resulting repo, the old code barfed on lookups:

  $ blob=$(echo 0 | git hash-object --stdin)
  $ git cat-file blob $blob
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file

whereas now it works:

  $ git cat-file blob $blob
  0

That's the most basic test I think you could do. More interesting is
looking at entries that are actually after the 4GB mark. That requires
dumping the whole index:

  final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
  git cat-file blob $final

That takes ~35s to run. Curiously, it also allocates 5GB of heap. For
some reason it decides to make an internal copy of the entries table. I
guess because it reads the file sequentially rather than mmap-ing it,
and 64-bit offsets in v2 idx files can't be resolved until we've read
the whole entry table (and it wants to output the entries in sha1
order).

The checksum bug requires running git-fsck on the repo. That's another 5
minutes of CPU (and even higher peak memory; I think we create a "struct
blob" for each one, and it seems to hit 20GB).

Hitting the other cases that I fixed but never triggered in practice
would need a repo about 4x as large. So figure an hour of CPU and 60GB
of RAM.

So I dunno. I wouldn't be opposed to codifying some of that in a script,
but I can't imagine anybody ever running it unless they were working on
this specific problem.

-Peff

Derrick Stolee Nov. 16, 2020, 1:30 p.m. UTC | #3

On 11/15/2020 11:10 PM, Jeff King wrote:
> On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:
> 
>> On 13.11.2020 06:06, Jeff King wrote:
>>> I recently ran into a case where Git could not read the pack it had
>>> produced via running "git repack". The culprit turned out to be an .idx
>>> file which crossed the 4GB barrier (in bytes, not number of objects).
>>> This series fixes the problems I saw, along with similar ones I couldn't
>>> trigger in practice, and protects the .idx loading code against integer
>>> overflows that would fool the size checks.
>>
>> Would it be feasible to have a test case for this large index case? This
>> should very certainly have an EXPENSIVE tag, or might even not yet work
>> on windows. But hopefully someday I'll find some more time to push large
>> object support on windows forward, and these kind of tests would really
>> help then.
> 
> I think it would be a level beyond what we usually consider even for
> EXPENSIVE. The cheapest I could come up with to generate the case is:

I agree that the cost of this test is more than I would expect for
EXPENSIVE.

>   perl -e '
> 	for (0..154_000_000) {
> 		print "blob\n";
> 		print "data <<EOF\n";
> 		print "$_\n";
> 		print "EOF\n";
> 	}
>   ' |
>   git fast-import
> 
> which took almost 13 minutes of CPU to run, and peaked around 15GB of
> RAM (and takes about 6.7GB on disk).

I was thinking that maybe the RAM requirements would be lower
if we batched the fast-import calls and then repacked, but then
the repack would probably be just as expensive.

> In the resulting repo, the old code barfed on lookups:
> 
>   $ blob=$(echo 0 | git hash-object --stdin)
>   $ git cat-file blob $blob
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file
> 
> whereas now it works:
> 
>   $ git cat-file blob $blob
>   0
> 
> That's the most basic test I think you could do. More interesting is
> looking at entries that are actually after the 4GB mark. That requires
> dumping the whole index:
> 
>   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
>   git cat-file blob $final

Could you also (after running the test once) determine the largest
SHA-1, at least up to unique short-SHA? Then run something like

	git cat-file blob fffffe

Since your loop is hard-coded, you could even use the largest full
SHA-1.

Naturally, nothing short of a full .idx verification would be
completely sound, and we are already generating an enormous repo.

> So I dunno. I wouldn't be opposed to codifying some of that in a script,
> but I can't imagine anybody ever running it unless they were working on
> this specific problem.

It would be good to have this available somewhere in the codebase to
run whenever testing .idx changes. Perhaps create a new prerequisite
specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
environment variable?

It would be helpful to also write a multi-pack-index on top of this
.idx to ensure we can handle that case, too.

Thanks,
-Stolee

Jeff King Nov. 16, 2020, 11:49 p.m. UTC | #4

On Mon, Nov 16, 2020 at 08:30:34AM -0500, Derrick Stolee wrote:

> > which took almost 13 minutes of CPU to run, and peaked around 15GB of
> > RAM (and takes about 6.7GB on disk).
> 
> I was thinking that maybe the RAM requirements would be lower
> if we batched the fast-import calls and then repacked, but then
> the repack would probably be just as expensive.

I think it's even worse. Fast-import just holds enough data to create
the index (sha1, etc), but pack-objects is also holding data to support
the delta search, etc. A quick (well, quick to invoke, not to run):

   git show-index <.git/objects/pack/pack-*.idx |
   awk '{print $2}' |
   git pack-objects foo --all-progress

on the fast-import pack seems to cap out around 27GB.

I doubt you could do much better overall than fast-import in terms of
CPU. The trick is really that you need to have a matching content/sha1
pair for 154M objects, and that's where most of the time goes. If we
lied about what's in each object (just generating an index with sha1
...0001, ...0002, etc), we could go much faster. But it's a much less
interesting test then.

> > That's the most basic test I think you could do. More interesting is
> > looking at entries that are actually after the 4GB mark. That requires
> > dumping the whole index:
> > 
> >   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
> >   git cat-file blob $final
> 
> Could you also (after running the test once) determine the largest
> SHA-1, at least up to unique short-SHA? Then run something like
> 
> 	git cat-file blob fffffe
> 
> Since your loop is hard-coded, you could even use the largest full
> SHA-1.

That $final is the highest sha1. We could hard-code it, yes (and the
resulting lookup via cat-file is quite fast; it's the linear index dump
that's slow). We'd need the matching sha256 version, too. But it's
really the generation of the data that's the main issue.

> Naturally, nothing short of a full .idx verification would be
> completely sound, and we are already generating an enormous repo.

Yep.

> > So I dunno. I wouldn't be opposed to codifying some of that in a script,
> > but I can't imagine anybody ever running it unless they were working on
> > this specific problem.
> 
> It would be good to have this available somewhere in the codebase to
> run whenever testing .idx changes. Perhaps create a new prerequisite
> specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
> environment variable?

My feeling is that anybody who's really interested in playing with this
topic can find this thread in the archive. I don't think they're really
any worse off there than with a bit-rotting script in the repo that
nobody ever runs.

But if somebody wants to write up a test script, I'm happy to review it.

> It would be helpful to also write a multi-pack-index on top of this
> .idx to ensure we can handle that case, too.

I did run "git multi-pack-index write" on the resulting repo, which
completed in a reasonable amount of time (maybe 30-60s). And then
confirmed that lookups in the midx work just fine.

-Peff

Thomas Braun Nov. 30, 2020, 10:57 p.m. UTC | #5

> Jeff King <peff@peff.net> hat am 16.11.2020 05:10 geschrieben:

[...]

> So I dunno. I wouldn't be opposed to codifying some of that in 
> a script, but I can't imagine anybody ever running it unless they 
> were working on this specific problem.

Thanks for the pointers.

Below is what I came up with. It passes here. I've replaced awk with cut from the original draft, and also moved the perl script out of the test as I think the quoting is getting way too messy otherwise. And I've added --no-dangling to git fsck as otherwise it takes forever to output the obvious dangling blobs. The unpack limit is mostly for testing the test itself with a smaller amount of blobs. But I still think it is worthwile to force everything into a pack.

--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -97,4 +97,34 @@ test_expect_success 'index version config precedence' '
 	test_index_version 0 true 2 2
 '
 
+{
+	echo "#!$SHELL_PATH"
+	cat <<'EOF'
+	   "$PERL_PATH" -e '
+		for (0..154_000_000) {
+			print "blob\n";
+			print "data <<EOF\n";
+			print "$_\n";
+			print "EOF\n";
+		} '
+EOF
+
+} >dump
+chmod +x dump
+
+test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' '
+	test_config fastimport.unpacklimit 0 &&
+	./dump | git fast-import &&
+	blob=$(echo 0 | git hash-object --stdin) &&
+	git cat-file blob $blob >actual &&
+	echo 0 >expect &&
+	test_cmp expect actual &&
+	idx_pack=$(ls .git/objects/pack/*.idx) &&
+	test_file_not_empty $idx_pack &&
+	final=$(git show-index <$idx_pack | tail -1 | cut -d " " -f2) &&
+	git cat-file blob $final &&
+	git cat-file blob fffffff &&
+	git fsck --strict --no-dangling
+'
+
 test_done
--

Jeff King Dec. 1, 2020, 11:23 a.m. UTC | #6

On Mon, Nov 30, 2020 at 11:57:27PM +0100, Thomas Braun wrote:

> Below is what I came up with. It passes here. I've replaced awk with
> cut from the original draft, and also moved the perl script out of the
> test as I think the quoting is getting way too messy otherwise. And
> I've added --no-dangling to git fsck as otherwise it takes forever to
> output the obvious dangling blobs. The unpack limit is mostly for
> testing the test itself with a smaller amount of blobs. But I still
> think it is worthwile to force everything into a pack.

I think you can get rid of some of the quoting by using perl directly as
the interpreter, rather than a shell script that only invokes it with
-e. See below.

> --- a/t/t1600-index.sh
> +++ b/t/t1600-index.sh

I don't think this should go in t1600; that's about testing the
.git/index file, not a pack .idx. Probably t5302 would be more
appropriate.

> @@ -97,4 +97,34 @@ test_expect_success 'index version config precedence' '
>  	test_index_version 0 true 2 2
>  '
>  
> +{
> +	echo "#!$SHELL_PATH"
> +	cat <<'EOF'
> +	   "$PERL_PATH" -e '
> +		for (0..154_000_000) {
> +			print "blob\n";
> +			print "data <<EOF\n";
> +			print "$_\n";
> +			print "EOF\n";
> +		} '
> +EOF
> +
> +} >dump
> +chmod +x dump

You can simplify this a bit with write_script, as well. And we do prefer
to put this stuff in a test block, so verbosity, etc, is handled
correctly.

I didn't let it run to completion, but something like this seems to
work:

diff --git a/t/t1600-index.sh b/t/t1600-index.sh
index 6d83aaf8a4..a4c1dc0f0a 100755
--- a/t/t1600-index.sh
+++ b/t/t1600-index.sh
@@ -97,23 +97,16 @@ test_expect_success 'index version config precedence' '
 	test_index_version 0 true 2 2
 '

-{
-	echo "#!$SHELL_PATH"
-	cat <<'EOF'
-	   "$PERL_PATH" -e '
-		for (0..154_000_000) {
-			print "blob\n";
-			print "data <<EOF\n";
-			print "$_\n";
-			print "EOF\n";
-		} '
-EOF
-
-} >dump
-chmod +x dump
-
 test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' '
 	test_config fastimport.unpacklimit 0 &&
+	write_script dump "$PERL_PATH" <<-\EOF &&
+	for (0..154_000_000) {
+		print "blob\n";
+		print "data <<EOF\n";
+		print "$_\n";
+		print "EOF\n";
+	}
+	EOF
 	./dump | git fast-import &&
 	blob=$(echo 0 | git hash-object --stdin) &&
 	git cat-file blob $blob >actual &&

> +test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' '

You can drop the PERL prereq. Even without it set, we assume that we can
do basic perl one-liners that would work even in old versions of perl.

I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a
VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64
seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a
bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s
(it's not strictly additive, since we can work in parallel on other
tests for the first bit, but still, yuck).

So we're looking at 2-3x to run the expensive tests now. This new one
would be 20x or more. I'm not sure if anybody would care or not (i.e.,
whether anyone actually runs the whole suite with this flag). I thought
we did for some CI job, but it looks like it's just the one-off in
t5608.

> +	git cat-file blob $final &&
> +	git cat-file blob fffffff &&

This final cat-file may be a problem when tested with SHA-256. You are
relying on the fact that there is exactly one object that matches seven
f's as its prefix. That may be true for SHA-1, but if so it's mostly
luck.  Seven hex digits is only 28 bits, which is ~260M. For 154M
objects, we'd expect an average of 0.57 objects per 7-digit prefix. So I
wouldn't be at all surprised if there are two of them for SHA-256.

I'm also not sure what it's testing that the $final one isn't.

-Peff

Taylor Blau Dec. 1, 2020, 6:27 p.m. UTC | #7

On Tue, Dec 01, 2020 at 06:23:28AM -0500, Jeff King wrote:
> I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a
> VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64
> seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a
> bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s
> (it's not strictly additive, since we can work in parallel on other
> tests for the first bit, but still, yuck).
>
> So we're looking at 2-3x to run the expensive tests now. This new one
> would be 20x or more. I'm not sure if anybody would care or not (i.e.,
> whether anyone actually runs the whole suite with this flag). I thought
> we did for some CI job, but it looks like it's just the one-off in
> t5608.

I had written something similar yesterday before mutt crashed and I
decided to stop work for the day.

I have a sense that probably very few people actually run GIT_TEST_LONG
regularly, and that that group may vanish entirely if we added a test
which increased the runtime of the suite by 20x in this mode.

I have mixed feelings about VERY_EXPENSIVE. On one hand, having this
test checked in so that we can quickly refer back to it in the case of a
regression is useful. On the other hand, what is it worth to have this
in-tree if nobody ever runs it? I'm speculating about whether or not
people would run this, of course.

My hunch is that anybody who is interested enough to fix regressions in
this area would be able to refer back to the list archive to dig up this
thread and recover the script.

I don't feel strongly, really, but just noting some light objections to
checking this test into the suite.

Thanks,
Taylor

Jeff King Dec. 2, 2020, 1:12 p.m. UTC | #8

On Tue, Dec 01, 2020 at 01:27:26PM -0500, Taylor Blau wrote:

> I have a sense that probably very few people actually run GIT_TEST_LONG
> regularly, and that that group may vanish entirely if we added a test
> which increased the runtime of the suite by 20x in this mode.

Yeah, and if that's so, then I'm OK with calling it EXPENSIVE and moving
on with our lives. And I guess merging this is one way to find out, if
anybody screams. The stakes are low enough that I don't mind doing that.

> I have mixed feelings about VERY_EXPENSIVE. On one hand, having this
> test checked in so that we can quickly refer back to it in the case of a
> regression is useful. On the other hand, what is it worth to have this
> in-tree if nobody ever runs it? I'm speculating about whether or not
> people would run this, of course.
> 
> My hunch is that anybody who is interested enough to fix regressions in
> this area would be able to refer back to the list archive to dig up this
> thread and recover the script.
> 
> I don't feel strongly, really, but just noting some light objections to
> checking this test into the suite.

Yeah, that about matches my feelings. But if Thomas wants to wrap it up
as a patch, I don't mind seeing what happens (and I think with the
suggestions I gave earlier, it would be in good shape).

-Peff

[0/5] handling 4GB .idx files

Message

Comments