Message ID | 20201113050631.GA744608@coredump.intra.peff.net (mailing list archive) |
---|---|
Headers | show |
Series | handling 4GB .idx files | expand |
On 13.11.2020 06:06, Jeff King wrote: > I recently ran into a case where Git could not read the pack it had > produced via running "git repack". The culprit turned out to be an .idx > file which crossed the 4GB barrier (in bytes, not number of objects). > This series fixes the problems I saw, along with similar ones I couldn't > trigger in practice, and protects the .idx loading code against integer > overflows that would fool the size checks. Would it be feasible to have a test case for this large index case? This should very certainly have an EXPENSIVE tag, or might even not yet work on windows. But hopefully someday I'll find some more time to push large object support on windows forward, and these kind of tests would really help then.
On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote: > On 13.11.2020 06:06, Jeff King wrote: > > I recently ran into a case where Git could not read the pack it had > > produced via running "git repack". The culprit turned out to be an .idx > > file which crossed the 4GB barrier (in bytes, not number of objects). > > This series fixes the problems I saw, along with similar ones I couldn't > > trigger in practice, and protects the .idx loading code against integer > > overflows that would fool the size checks. > > Would it be feasible to have a test case for this large index case? This > should very certainly have an EXPENSIVE tag, or might even not yet work > on windows. But hopefully someday I'll find some more time to push large > object support on windows forward, and these kind of tests would really > help then. I think it would be a level beyond what we usually consider even for EXPENSIVE. The cheapest I could come up with to generate the case is: perl -e ' for (0..154_000_000) { print "blob\n"; print "data <<EOF\n"; print "$_\n"; print "EOF\n"; } ' | git fast-import which took almost 13 minutes of CPU to run, and peaked around 15GB of RAM (and takes about 6.7GB on disk). In the resulting repo, the old code barfed on lookups: $ blob=$(echo 0 | git hash-object --stdin) $ git cat-file blob $blob error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file whereas now it works: $ git cat-file blob $blob 0 That's the most basic test I think you could do. More interesting is looking at entries that are actually after the 4GB mark. That requires dumping the whole index: final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}') git cat-file blob $final That takes ~35s to run. Curiously, it also allocates 5GB of heap. For some reason it decides to make an internal copy of the entries table. I guess because it reads the file sequentially rather than mmap-ing it, and 64-bit offsets in v2 idx files can't be resolved until we've read the whole entry table (and it wants to output the entries in sha1 order). The checksum bug requires running git-fsck on the repo. That's another 5 minutes of CPU (and even higher peak memory; I think we create a "struct blob" for each one, and it seems to hit 20GB). Hitting the other cases that I fixed but never triggered in practice would need a repo about 4x as large. So figure an hour of CPU and 60GB of RAM. So I dunno. I wouldn't be opposed to codifying some of that in a script, but I can't imagine anybody ever running it unless they were working on this specific problem. -Peff
On 11/15/2020 11:10 PM, Jeff King wrote: > On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote: > >> On 13.11.2020 06:06, Jeff King wrote: >>> I recently ran into a case where Git could not read the pack it had >>> produced via running "git repack". The culprit turned out to be an .idx >>> file which crossed the 4GB barrier (in bytes, not number of objects). >>> This series fixes the problems I saw, along with similar ones I couldn't >>> trigger in practice, and protects the .idx loading code against integer >>> overflows that would fool the size checks. >> >> Would it be feasible to have a test case for this large index case? This >> should very certainly have an EXPENSIVE tag, or might even not yet work >> on windows. But hopefully someday I'll find some more time to push large >> object support on windows forward, and these kind of tests would really >> help then. > > I think it would be a level beyond what we usually consider even for > EXPENSIVE. The cheapest I could come up with to generate the case is: I agree that the cost of this test is more than I would expect for EXPENSIVE. > perl -e ' > for (0..154_000_000) { > print "blob\n"; > print "data <<EOF\n"; > print "$_\n"; > print "EOF\n"; > } > ' | > git fast-import > > which took almost 13 minutes of CPU to run, and peaked around 15GB of > RAM (and takes about 6.7GB on disk). I was thinking that maybe the RAM requirements would be lower if we batched the fast-import calls and then repacked, but then the repack would probably be just as expensive. > In the resulting repo, the old code barfed on lookups: > > $ blob=$(echo 0 | git hash-object --stdin) > $ git cat-file blob $blob > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx > fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file > > whereas now it works: > > $ git cat-file blob $blob > 0 > > That's the most basic test I think you could do. More interesting is > looking at entries that are actually after the 4GB mark. That requires > dumping the whole index: > > final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}') > git cat-file blob $final Could you also (after running the test once) determine the largest SHA-1, at least up to unique short-SHA? Then run something like git cat-file blob fffffe Since your loop is hard-coded, you could even use the largest full SHA-1. Naturally, nothing short of a full .idx verification would be completely sound, and we are already generating an enormous repo. > So I dunno. I wouldn't be opposed to codifying some of that in a script, > but I can't imagine anybody ever running it unless they were working on > this specific problem. It would be good to have this available somewhere in the codebase to run whenever testing .idx changes. Perhaps create a new prerequisite specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_* environment variable? It would be helpful to also write a multi-pack-index on top of this .idx to ensure we can handle that case, too. Thanks, -Stolee
On Mon, Nov 16, 2020 at 08:30:34AM -0500, Derrick Stolee wrote: > > which took almost 13 minutes of CPU to run, and peaked around 15GB of > > RAM (and takes about 6.7GB on disk). > > I was thinking that maybe the RAM requirements would be lower > if we batched the fast-import calls and then repacked, but then > the repack would probably be just as expensive. I think it's even worse. Fast-import just holds enough data to create the index (sha1, etc), but pack-objects is also holding data to support the delta search, etc. A quick (well, quick to invoke, not to run): git show-index <.git/objects/pack/pack-*.idx | awk '{print $2}' | git pack-objects foo --all-progress on the fast-import pack seems to cap out around 27GB. I doubt you could do much better overall than fast-import in terms of CPU. The trick is really that you need to have a matching content/sha1 pair for 154M objects, and that's where most of the time goes. If we lied about what's in each object (just generating an index with sha1 ...0001, ...0002, etc), we could go much faster. But it's a much less interesting test then. > > That's the most basic test I think you could do. More interesting is > > looking at entries that are actually after the 4GB mark. That requires > > dumping the whole index: > > > > final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}') > > git cat-file blob $final > > Could you also (after running the test once) determine the largest > SHA-1, at least up to unique short-SHA? Then run something like > > git cat-file blob fffffe > > Since your loop is hard-coded, you could even use the largest full > SHA-1. That $final is the highest sha1. We could hard-code it, yes (and the resulting lookup via cat-file is quite fast; it's the linear index dump that's slow). We'd need the matching sha256 version, too. But it's really the generation of the data that's the main issue. > Naturally, nothing short of a full .idx verification would be > completely sound, and we are already generating an enormous repo. Yep. > > So I dunno. I wouldn't be opposed to codifying some of that in a script, > > but I can't imagine anybody ever running it unless they were working on > > this specific problem. > > It would be good to have this available somewhere in the codebase to > run whenever testing .idx changes. Perhaps create a new prerequisite > specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_* > environment variable? My feeling is that anybody who's really interested in playing with this topic can find this thread in the archive. I don't think they're really any worse off there than with a bit-rotting script in the repo that nobody ever runs. But if somebody wants to write up a test script, I'm happy to review it. > It would be helpful to also write a multi-pack-index on top of this > .idx to ensure we can handle that case, too. I did run "git multi-pack-index write" on the resulting repo, which completed in a reasonable amount of time (maybe 30-60s). And then confirmed that lookups in the midx work just fine. -Peff
> Jeff King <peff@peff.net> hat am 16.11.2020 05:10 geschrieben: [...] > So I dunno. I wouldn't be opposed to codifying some of that in > a script, but I can't imagine anybody ever running it unless they > were working on this specific problem. Thanks for the pointers. Below is what I came up with. It passes here. I've replaced awk with cut from the original draft, and also moved the perl script out of the test as I think the quoting is getting way too messy otherwise. And I've added --no-dangling to git fsck as otherwise it takes forever to output the obvious dangling blobs. The unpack limit is mostly for testing the test itself with a smaller amount of blobs. But I still think it is worthwile to force everything into a pack. --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -97,4 +97,34 @@ test_expect_success 'index version config precedence' ' test_index_version 0 true 2 2 ' +{ + echo "#!$SHELL_PATH" + cat <<'EOF' + "$PERL_PATH" -e ' + for (0..154_000_000) { + print "blob\n"; + print "data <<EOF\n"; + print "$_\n"; + print "EOF\n"; + } ' +EOF + +} >dump +chmod +x dump + +test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' ' + test_config fastimport.unpacklimit 0 && + ./dump | git fast-import && + blob=$(echo 0 | git hash-object --stdin) && + git cat-file blob $blob >actual && + echo 0 >expect && + test_cmp expect actual && + idx_pack=$(ls .git/objects/pack/*.idx) && + test_file_not_empty $idx_pack && + final=$(git show-index <$idx_pack | tail -1 | cut -d " " -f2) && + git cat-file blob $final && + git cat-file blob fffffff && + git fsck --strict --no-dangling +' + test_done --
On Mon, Nov 30, 2020 at 11:57:27PM +0100, Thomas Braun wrote: > Below is what I came up with. It passes here. I've replaced awk with > cut from the original draft, and also moved the perl script out of the > test as I think the quoting is getting way too messy otherwise. And > I've added --no-dangling to git fsck as otherwise it takes forever to > output the obvious dangling blobs. The unpack limit is mostly for > testing the test itself with a smaller amount of blobs. But I still > think it is worthwile to force everything into a pack. I think you can get rid of some of the quoting by using perl directly as the interpreter, rather than a shell script that only invokes it with -e. See below. > --- a/t/t1600-index.sh > +++ b/t/t1600-index.sh I don't think this should go in t1600; that's about testing the .git/index file, not a pack .idx. Probably t5302 would be more appropriate. > @@ -97,4 +97,34 @@ test_expect_success 'index version config precedence' ' > test_index_version 0 true 2 2 > ' > > +{ > + echo "#!$SHELL_PATH" > + cat <<'EOF' > + "$PERL_PATH" -e ' > + for (0..154_000_000) { > + print "blob\n"; > + print "data <<EOF\n"; > + print "$_\n"; > + print "EOF\n"; > + } ' > +EOF > + > +} >dump > +chmod +x dump You can simplify this a bit with write_script, as well. And we do prefer to put this stuff in a test block, so verbosity, etc, is handled correctly. I didn't let it run to completion, but something like this seems to work: diff --git a/t/t1600-index.sh b/t/t1600-index.sh index 6d83aaf8a4..a4c1dc0f0a 100755 --- a/t/t1600-index.sh +++ b/t/t1600-index.sh @@ -97,23 +97,16 @@ test_expect_success 'index version config precedence' ' test_index_version 0 true 2 2 ' -{ - echo "#!$SHELL_PATH" - cat <<'EOF' - "$PERL_PATH" -e ' - for (0..154_000_000) { - print "blob\n"; - print "data <<EOF\n"; - print "$_\n"; - print "EOF\n"; - } ' -EOF - -} >dump -chmod +x dump - test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' ' test_config fastimport.unpacklimit 0 && + write_script dump "$PERL_PATH" <<-\EOF && + for (0..154_000_000) { + print "blob\n"; + print "data <<EOF\n"; + print "$_\n"; + print "EOF\n"; + } + EOF ./dump | git fast-import && blob=$(echo 0 | git hash-object --stdin) && git cat-file blob $blob >actual && > +test_expect_success EXPENSIVE,PERL 'Test 4GB boundary for the index' ' You can drop the PERL prereq. Even without it set, we assume that we can do basic perl one-liners that would work even in old versions of perl. I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64 seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s (it's not strictly additive, since we can work in parallel on other tests for the first bit, but still, yuck). So we're looking at 2-3x to run the expensive tests now. This new one would be 20x or more. I'm not sure if anybody would care or not (i.e., whether anyone actually runs the whole suite with this flag). I thought we did for some CI job, but it looks like it's just the one-off in t5608. > + git cat-file blob $final && > + git cat-file blob fffffff && This final cat-file may be a problem when tested with SHA-256. You are relying on the fact that there is exactly one object that matches seven f's as its prefix. That may be true for SHA-1, but if so it's mostly luck. Seven hex digits is only 28 bits, which is ~260M. For 154M objects, we'd expect an average of 0.57 objects per 7-digit prefix. So I wouldn't be at all surprised if there are two of them for SHA-256. I'm also not sure what it's testing that the $final one isn't. -Peff
On Tue, Dec 01, 2020 at 06:23:28AM -0500, Jeff King wrote: > I'm not sure if EXPENSIVE is the right ballpark, or if we'd want a > VERY_EXPENSIVE. On my machine, the whole test suite for v2.29.0 takes 64 > seconds to run, and setting GIT_TEST_LONG=1 bumps that to 103s. It got a > bit worse since then, as t7900 adds an EXPENSIVE test that takes ~200s > (it's not strictly additive, since we can work in parallel on other > tests for the first bit, but still, yuck). > > So we're looking at 2-3x to run the expensive tests now. This new one > would be 20x or more. I'm not sure if anybody would care or not (i.e., > whether anyone actually runs the whole suite with this flag). I thought > we did for some CI job, but it looks like it's just the one-off in > t5608. I had written something similar yesterday before mutt crashed and I decided to stop work for the day. I have a sense that probably very few people actually run GIT_TEST_LONG regularly, and that that group may vanish entirely if we added a test which increased the runtime of the suite by 20x in this mode. I have mixed feelings about VERY_EXPENSIVE. On one hand, having this test checked in so that we can quickly refer back to it in the case of a regression is useful. On the other hand, what is it worth to have this in-tree if nobody ever runs it? I'm speculating about whether or not people would run this, of course. My hunch is that anybody who is interested enough to fix regressions in this area would be able to refer back to the list archive to dig up this thread and recover the script. I don't feel strongly, really, but just noting some light objections to checking this test into the suite. Thanks, Taylor
On Tue, Dec 01, 2020 at 01:27:26PM -0500, Taylor Blau wrote: > I have a sense that probably very few people actually run GIT_TEST_LONG > regularly, and that that group may vanish entirely if we added a test > which increased the runtime of the suite by 20x in this mode. Yeah, and if that's so, then I'm OK with calling it EXPENSIVE and moving on with our lives. And I guess merging this is one way to find out, if anybody screams. The stakes are low enough that I don't mind doing that. > I have mixed feelings about VERY_EXPENSIVE. On one hand, having this > test checked in so that we can quickly refer back to it in the case of a > regression is useful. On the other hand, what is it worth to have this > in-tree if nobody ever runs it? I'm speculating about whether or not > people would run this, of course. > > My hunch is that anybody who is interested enough to fix regressions in > this area would be able to refer back to the list archive to dig up this > thread and recover the script. > > I don't feel strongly, really, but just noting some light objections to > checking this test into the suite. Yeah, that about matches my feelings. But if Thomas wants to wrap it up as a patch, I don't mind seeing what happens (and I think with the suggestions I gave earlier, it would be in good shape). -Peff