diff mbox series

[v3,1/9] t5326: demonstrate bitmap corruption after permutation

Message ID babce7d29a85df0da54cb651433111bc33097a4e.1641320129.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: prevent bitmap corruption when permuting pack order | expand

Commit Message

Taylor Blau Jan. 4, 2022, 6:15 p.m. UTC
This patch demonstrates a cause of bitmap corruption that can occur when
the contents of the multi-pack index does not change, but the underlying
object order does.

In this example, we have a MIDX containing two packs, each with a
distinct set of objects (pack A corresponds to the tree, blob, and
commit from the first patch, and pack B corresponds to the second
patch).

First, a MIDX is written where the 'A' pack is preferred. As expected,
the bitmaps generated there are in-tact. But then, we generate an
identical MIDX with a different object order: this time preferring pack
'B'.

Due to a bug which will be explained and fixed in the following commit,
the MIDX is updated, but the .rev file is not, causing the .bitmap file
to be read incorrectly. Specifically, the .bitmap file will contain
correct data, but the auxiliary object order in the .rev file is stale,
causing readers to get confused by reading the new bitmaps using the old
object order.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t5326-multi-pack-bitmaps.sh | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Comments

Jonathan Tan Jan. 20, 2022, 5:55 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:
> This patch demonstrates a cause of bitmap corruption that can occur when
> the contents of the multi-pack index does not change, but the underlying
> object order does.
> 
> In this example, we have a MIDX containing two packs, each with a
> distinct set of objects (pack A corresponds to the tree, blob, and
> commit from the first patch, and pack B corresponds to the second
> patch).
> 
> First, a MIDX is written where the 'A' pack is preferred. As expected,
> the bitmaps generated there are in-tact. But then, we generate an
> identical MIDX with a different object order: this time preferring pack
> 'B'.
> 
> Due to a bug which will be explained and fixed in the following commit,
> the MIDX is updated, but the .rev file is not, causing the .bitmap file
> to be read incorrectly. Specifically, the .bitmap file will contain
> correct data, but the auxiliary object order in the .rev file is stale,
> causing readers to get confused by reading the new bitmaps using the old
> object order.

Thanks - overall, this looks like a bug that needs to be fixed.

For the benefit of other reviewers, here's my summary of the problem:
the .midx, .rev, and .bitmap files are almost always generated together,
and it is possible for two different invocations of Git to generate the
same .midx but a different .rev and .bitmap. For example, when
generating a .midx+.rev+.bitmap for 2 disjoint packfiles, the 1st time
with one packfile as preferred and the 2nd time with the other packfile
as preferred. In .midx, packfiles are always ordered by lexicographical
order of their names, and the preferred status only matters when an
object is in multiple packfiles (which never happens in this case, since
the packfiles are disjoint). But the preferred status affects .rev and
.bitmap, because they use a concept called "pseudo-pack order" (see
pack-format.txt for more details) in which the preferred pack comes
first.

As an effort to ensure that Git reads coherent .midx, .rev, and .bitmap
files, both the .rev and .bitmap files are keyed on the checksum of the
.midx file. But the issue here is that a .rev and a .bitmap could both
refer to the same .midx checksum when the .rev and .bitmap files are not
coherent with respect to each other (e.g. when a Git process has written
the .rev, but not the .bitmap yet - but this would appear perfectly
ordinary to another concurrently running Git process, since the .midx
checksum in the .rev and .bitmap files match).

This problem is exacerbated by the fact that the .rev has its .midx
checksum in its filename, whereas the .bitmap has its .midx checksum in
its file contents. When generating .midx+.rev+.bitmap, it would write
the .bitmap but not the .rev, since a .rev of the same filename already
exists.

The solution is to embed the .rev in the .midx. This means that the
checksum stored in .bitmap takes into account the contents of what would
have been in .rev, solving the coherency issue. (There are other
solutions like storing the name of the preferred pack in .midx, but I
think that putting the contents of .rev in the .midx is best.)
Taylor Blau Jan. 20, 2022, 10:11 p.m. UTC | #2
On Thu, Jan 20, 2022 at 09:55:41AM -0800, Jonathan Tan wrote:
> As an effort to ensure that Git reads coherent .midx, .rev, and .bitmap
> files, both the .rev and .bitmap files are keyed on the checksum of the
> .midx file. But the issue here is that a .rev and a .bitmap could both
> refer to the same .midx checksum when the .rev and .bitmap files are not
> coherent with respect to each other (e.g. when a Git process has written
> the .rev, but not the .bitmap yet - but this would appear perfectly
> ordinary to another concurrently running Git process, since the .midx
> checksum in the .rev and .bitmap files match).

Kind of, and it's possible that we're saying the same thing here using
different words.

But seeing an out-of-sync .bitmap and .rev is really a symptom of the
underlying problem, which is that the MIDX checksum could (prior to
these patches) be unchanged even when the object order it represents
_does_ change (e.g., in the case of swapping the preferred pack around
like the test here demonstrates).

> This problem is exacerbated by the fact that the .rev has its .midx
> checksum in its filename, whereas the .bitmap has its .midx checksum in
> its file contents. When generating .midx+.rev+.bitmap, it would write
> the .bitmap but not the .rev, since a .rev of the same filename already
> exists.

This isn't quite right: both have the MIDX's checksum in their filename.
For example after running `git repack --write-midx --write-bitmap-index`
on a random repository, I get these MIDX-related files:

    $ find .git/objects/pack -type f -name 'multi-pack-index*'
    .git/objects/pack/multi-pack-index-fd71600b4ceb4caf23a538c7829b9284f2b66d73.rev
    .git/objects/pack/multi-pack-index-fd71600b4ceb4caf23a538c7829b9284f2b66d73.bitmap
    .git/objects/pack/multi-pack-index

Before these patches, it was possible for the MIDX's object order to
change but for its checksum to remain the same. The problem here is that
we rename(2)'d the .bitmap into place, but only link(2)'d the .rev into
place.

So one of the two was updated, and the other was left behind. That does
make them incoherent with respect to each other; but I find it more
useful to think of it as the .rev being out-of-sync with the MIDX.

> The solution is to embed the .rev in the .midx. This means that the
> checksum stored in .bitmap takes into account the contents of what would
> have been in .rev, solving the coherency issue. (There are other
> solutions like storing the name of the preferred pack in .midx, but I
> think that putting the contents of .rev in the .midx is best.)

Right. The problem being that it's possible to change the MIDX's
object order without changing its checksum. Since the .rev file's
contents encodes the pseudo-pack order, embedding it into the MIDX is
sufficient to guarantee that the MIDX's checksum changes whenever the
object order does.

Apologies if that all exactly matched up with your understanding, and I
was just telling you stuff that you already knew. But I figure that this
bug is subtle enough that a little bit of hair-splitting doesn't hurt
;).

Thanks,
Taylor
Junio C Hamano Jan. 20, 2022, 10:41 p.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:

> Right. The problem being that it's possible to change the MIDX's
> object order without changing its checksum. Since the .rev file's
> contents encodes the pseudo-pack order, embedding it into the MIDX is
> sufficient to guarantee that the MIDX's checksum changes whenever the
> object order does.

To slightly rephrase,

    The problem being that before the change to embed reverse table in
    MIDX, it was possible to change the object order without changing
    the checksum, but with that change alone, if a rebuilt MIDX has the
    same checksum, it is guaranteed that it records the same objects in
    the same order.

Hence, shuffling between rename(2) and link(2), which was the code
change that is only necessary to force updating the existing MIDX
with newly recreated one because two MIDX with the same checksum
could record objects in different orders, is no longer necessary.

Am I on the same page as you two?
Taylor Blau Jan. 20, 2022, 10:46 p.m. UTC | #4
On Thu, Jan 20, 2022 at 02:41:47PM -0800, Junio C Hamano wrote:
> Am I on the same page as you two?

Yes, exactly.

Thanks,
Taylor
Jonathan Tan Jan. 24, 2022, 5:40 p.m. UTC | #5
Taylor Blau <me@ttaylorr.com> writes:
> On Thu, Jan 20, 2022 at 09:55:41AM -0800, Jonathan Tan wrote:
> > As an effort to ensure that Git reads coherent .midx, .rev, and .bitmap
> > files, both the .rev and .bitmap files are keyed on the checksum of the
> > .midx file. But the issue here is that a .rev and a .bitmap could both
> > refer to the same .midx checksum when the .rev and .bitmap files are not
> > coherent with respect to each other (e.g. when a Git process has written
> > the .rev, but not the .bitmap yet - but this would appear perfectly
> > ordinary to another concurrently running Git process, since the .midx
> > checksum in the .rev and .bitmap files match).
> 
> Kind of, and it's possible that we're saying the same thing here using
> different words.
> 
> But seeing an out-of-sync .bitmap and .rev is really a symptom of the
> underlying problem, which is that the MIDX checksum could (prior to
> these patches) be unchanged even when the object order it represents
> _does_ change (e.g., in the case of swapping the preferred pack around
> like the test here demonstrates).

Yeah - I think the way you're describing it is that the concept of
preferred packfile is inherent to the .midx, whereas I just think of it
as an input that helps the algorithm choose between the many .midx
possible given a set of packfiles. Maybe these are just two different
ways of looking at the same thing.

> > This problem is exacerbated by the fact that the .rev has its .midx
> > checksum in its filename, whereas the .bitmap has its .midx checksum in
> > its file contents. When generating .midx+.rev+.bitmap, it would write
> > the .bitmap but not the .rev, since a .rev of the same filename already
> > exists.
> 
> This isn't quite right: both have the MIDX's checksum in their filename.
> For example after running `git repack --write-midx --write-bitmap-index`
> on a random repository, I get these MIDX-related files:
> 
>     $ find .git/objects/pack -type f -name 'multi-pack-index*'
>     .git/objects/pack/multi-pack-index-fd71600b4ceb4caf23a538c7829b9284f2b66d73.rev
>     .git/objects/pack/multi-pack-index-fd71600b4ceb4caf23a538c7829b9284f2b66d73.bitmap
>     .git/objects/pack/multi-pack-index

Ah, thanks for the clarification. I saw that bitmaps were overridden
correctly and saw BITMAP_OPT_HASH_CACHE in the bitmap-format
documentation, and made an incorrect assumption.

> Apologies if that all exactly matched up with your understanding, and I
> was just telling you stuff that you already knew. But I figure that this
> bug is subtle enough that a little bit of hair-splitting doesn't hurt
> ;).

Thanks for all the clarification, especially about the bitmap.
diff mbox series

Patch

diff --git a/t/t5326-multi-pack-bitmaps.sh b/t/t5326-multi-pack-bitmaps.sh
index e187f90f29..0ca2868b0b 100755
--- a/t/t5326-multi-pack-bitmaps.sh
+++ b/t/t5326-multi-pack-bitmaps.sh
@@ -395,4 +395,35 @@  test_expect_success 'hash-cache values are propagated from pack bitmaps' '
 	)
 '
 
+test_expect_failure 'changing the preferred pack does not corrupt bitmaps' '
+	rm -fr repo &&
+	git init repo &&
+	test_when_finished "rm -fr repo" &&
+	(
+		cd repo &&
+
+		test_commit A &&
+		test_commit B &&
+
+		git rev-list --objects --no-object-names HEAD^ >A.objects &&
+		git rev-list --objects --no-object-names HEAD^.. >B.objects &&
+
+		A=$(git pack-objects $objdir/pack/pack <A.objects) &&
+		B=$(git pack-objects $objdir/pack/pack <B.objects) &&
+
+		cat >indexes <<-EOF &&
+		pack-$A.idx
+		pack-$B.idx
+		EOF
+
+		git multi-pack-index write --bitmap --stdin-packs \
+			--preferred-pack=pack-$A.pack <indexes &&
+		git rev-list --test-bitmap A &&
+
+		git multi-pack-index write --bitmap --stdin-packs \
+			--preferred-pack=pack-$B.pack <indexes &&
+		git rev-list --test-bitmap A
+	)
+'
+
 test_done