diff mbox series

[v2,01/10] bundle: optionally skip reachability walk

Message ID b3828725bc8f8887b9b4777a0e3d84224a427f31.1674487310.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Bundle URIs V: creationToken heuristic for incremental fetches | expand

Commit Message

Derrick Stolee Jan. 23, 2023, 3:21 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

When unbundling a bundle, the verify_bundle() method checks two things
with regards to the prerequisite commits:

 1. Those commits are in the object store, and
 2. Those commits are reachable from refs.

During testing of the bundle URI feature, where multiple bundles are
unbundled in the same process, the ref store did not appear to be
refreshing with the new refs/bundles/* references added within that
process. This caused the second half -- the reachability walk -- report
that some commits were not present, despite actually being present.

One way to attempt to fix this would be to create a way to force-refresh
the ref state. That would correct this for these cases where the
refs/bundles/* references have been updated. However, this still is an
expensive operation in a repository with many references.

Instead, optionally allow callers to skip this portion by instead just
checking for presence within the object store. Use this when unbundling
in bundle-uri.c.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle-uri.c | 8 +++++++-
 bundle.c     | 3 ++-
 bundle.h     | 1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

Comments

Junio C Hamano Jan. 23, 2023, 6:03 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> When unbundling a bundle, the verify_bundle() method checks two things
> with regards to the prerequisite commits:
>
>  1. Those commits are in the object store, and
>  2. Those commits are reachable from refs.
>
> During testing of the bundle URI feature, where multiple bundles are
> unbundled in the same process, the ref store did not appear to be
> refreshing with the new refs/bundles/* references added within that
> process. This caused the second half -- the reachability walk -- report
> that some commits were not present, despite actually being present.
>
> One way to attempt to fix this would be to create a way to force-refresh
> the ref state. That would correct this for these cases where the
> refs/bundles/* references have been updated. However, this still is an
> expensive operation in a repository with many references.
>
> Instead, optionally allow callers to skip this portion by instead just
> checking for presence within the object store. Use this when unbundling
> in bundle-uri.c.

This step is new in this round.

I am assuming that this approach is to avoid repeated "now we
unbundled one, let's spend enormous cycles to update the in-core
view of the ref store before processing the next bundle"---instead
we unbundle all, assuming the prerequisites for each and every
bundle are satisfied.

I am OK as long as we check the assumption holds true at the end;
this looks like a good optimization.
Derrick Stolee Jan. 23, 2023, 6:24 p.m. UTC | #2
On 1/23/2023 1:03 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> When unbundling a bundle, the verify_bundle() method checks two things
>> with regards to the prerequisite commits:
>>
>>  1. Those commits are in the object store, and
>>  2. Those commits are reachable from refs.
>>
>> During testing of the bundle URI feature, where multiple bundles are
>> unbundled in the same process, the ref store did not appear to be
>> refreshing with the new refs/bundles/* references added within that
>> process. This caused the second half -- the reachability walk -- report
>> that some commits were not present, despite actually being present.
>>
>> One way to attempt to fix this would be to create a way to force-refresh
>> the ref state. That would correct this for these cases where the
>> refs/bundles/* references have been updated. However, this still is an
>> expensive operation in a repository with many references.
>>
>> Instead, optionally allow callers to skip this portion by instead just
>> checking for presence within the object store. Use this when unbundling
>> in bundle-uri.c.
> 
> This step is new in this round.
> 
> I am assuming that this approach is to avoid repeated "now we
> unbundled one, let's spend enormous cycles to update the in-core
> view of the ref store before processing the next bundle"---instead
> we unbundle all, assuming the prerequisites for each and every
> bundle are satisfied.

We are specifically removing the requirement that the objects are
reachable from refs, we still check that the objects are in the
object store. Thus, we can only be in a bad state afterwards if
the required objects for a bundle were in the object store,
previously unreachable, and one of these two things happened:

1. Some objects reachable from those required commits were already
   missing in the repository (so the repo's object store was broken
   but only for some unreachable objects).

2. A GC pruned those objects between verifying the bundle and
   writing the refs/bundles/* refs after unbundling.

I think we should trust the repository to not be in the first state,
but the race condition in the second option will create a state
where we have missing objects that are now reachable from refs.
 
> I am OK as long as we check the assumption holds true at the end;
> this looks like a good optimization.
 
So are you recommending that we verify all objects reachable from
the new refs/bundles/* are present after unbundling? That would
prevent the possibility of a GC race, but at some significant run-
time performance costs. Do we do the same as we unpack from a
fetch? Do we apply the same scrutiny to the objects during
unbundling as we do from a fetched pack? They both use 'git
index-pack --stdin --fix-thin', so my guess is that we do the same
amount of checks for both cases.

Since this is only one of multiple ways to add objects that depend
on possibly-unreachable objects, my preferred way to avoid the
GC race is by using care in the GC process itself (such as the new
--expire-to option to recover from these races).

Thanks,
-Stolee
Junio C Hamano Jan. 23, 2023, 8:13 p.m. UTC | #3
Derrick Stolee <derrickstolee@github.com> writes:

> We are specifically removing the requirement that the objects are
> reachable from refs, we still check that the objects are in the
> object store. Thus, we can only be in a bad state afterwards if
> the required objects for a bundle were in the object store,
> previously unreachable, and one of these two things happened:
>
> 1. Some objects reachable from those required commits were already
>    missing in the repository (so the repo's object store was broken
>    but only for some unreachable objects).

A repository having some unreachable objects floating in the object
store is not corrupt.  As long as all the objects reachable from refs
are connected, that is a perfectly sane state.

But allowing unbundling with the sanity check loosened WILL corrupt
it, at the moment you point some objects from the bundle with refs.

> I think we should trust the repository to not be in the first state,

So, I think this line of thought is simply mistaken.

>> I am OK as long as we check the assumption holds true at the end;
>> this looks like a good optimization.
>  
> So are you recommending that we verify all objects reachable from
> the new refs/bundles/* are present after unbundling?

Making sure that prerequisites are connected will reduce the span of
the DAG we would need to verify.  After unbundling all bundles, but
before updating the refs to point at the tips in the bundles, if we
can make sure that these prerequisite objects named in the bundles
are reachable from the tips recorded in the bundles, while stopping
the traversal at the tips of original refs (remember: we have only
updated objects in the object store, but haven't updated the refs from
the bundles), that would allow us to make sure that the updates to
refs proposed by the bundles will not corrupt the repository.
Junio C Hamano Jan. 23, 2023, 9:08 p.m. UTC | #4
Derrick Stolee <derrickstolee@github.com> writes:

> Do we do the same as we unpack from a fetch?

We should.

We only consider tips of refs and objects that are reachable from
them to be "present", and there may be random objects that float in
the object store without any guarantee that no the objects that
ought to be reachable from them are missing from the object store,
but they do not play part in the common ancestor discovery.

And then after we unpack, we ensure that the proposed updates to
refs made by the fetch operation will not corrupt the repository.
This can be guaranteed by making sure that objects to be placed at
the updated tip can all reach some existing tips of refs.  We trust
refs before the operation (in this case, 'git fetch') begins.  We
ensure that refs after the operation can be trusted before updating
them.  Where "trust" here means "objects pointed at by them are
connected.
Junio C Hamano Jan. 23, 2023, 10:30 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> A repository having some unreachable objects floating in the object
> store is not corrupt.  As long as all the objects reachable from refs
> are connected, that is a perfectly sane state.
>
> But allowing unbundling with the sanity check loosened WILL corrupt
> it, at the moment you point some objects from the bundle with refs.

While all of the above is true, I think existing check done by
bundle.c::verify_bundle() is stricter than necessary.  We make sure
that the prerequiste objects exist and are reachable from the refs.
But for the purpose of ensuring the health of the repo after the
operation, it is also OK if the prerequisite objects exist and they
pass connected.c::check_connected() test to reach existing refs.
verify_bundle() that is used in unbundle() does not allow it.

In a simplest case, with a single ref and a single strand of pearls
history, with a few cruft history that are connected to the main
history but are not anchored by any ref (e.g. the tip of 'main' was
rewound recently and we haven't done any GC).

                             7---8 (cruft)
                            /
   0---1---2---3---4---5---6 refs/heads/main

And they have another repository created back when '5' was the
latest and greatest, which built three commits on top of it.

   0---1---2---3---4---5---A---B---C

They create a bundle of 5..C to update us to C.  In the meantime, we
have created three commits ourselves (6, 7, and 8) but threw two
away.

If a bundle requires commit '5', because it is reachable from an
existing ref (which points at the 'main' branch), we can unbundle it
and point a ref at the tip of the history contained within the
bundle safely.  Commit '5' being pointed by a ref means the commit,
its ancestors, and all trees and blobs referenced are available to
the repository (some may be fetched lazily from promisor), and
unless the producer lied and placed unconnected data in the bundle,
unpacking a bundle on top of '5' should give us all the objects that
are needed to complete the new tip proposed by the bundle (e.g. 'C').

                             7---8 (cruft)
                            /
   0---1---2---3---4---5---6 refs/heads/main
                        \
                         A---B---C refs/bundle-1/main

And existing check that I called "sticter than necessary" easily
makes sure that it is safe to point 'C' with our ref.

Imagine another party cloned us back when 'main' was pointing at '8'
(which we since then have rewound), and built a few commits on it.

                                   X---Y refs/bundle-2/main
                                  /
                             7---8 (cruft)
                            /
   0---1---2---3---4---5---6 refs/heads/main

As they did not know we'd rewind and discard 7 and 8, they would
have created their bundle to cover 8..Y.  The current test will fail
because '8' that is prerequisite is not reachable from any ref on
our side.  But that is too pessimistic.

As long as we haven't garbage collected so that all objects
reachable from '7' and '8' are available to this repository,
however, we should be able to unbundle the bundle that has '8' as
its prerequisite.  For that, we only need that '8' passes the
check_connected() check, which essentially means "we shouldn't find
any missing link while traversing history from '8' that stops at any
existing refs".

Again this relies on the fact that unbundling code makes sure that
incoming data is fully connected (i.e. bundle producer did not lie
about the prerequisite).
Derrick Stolee Jan. 24, 2023, 12:27 p.m. UTC | #6
On 1/23/2023 5:30 PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
>> A repository having some unreachable objects floating in the object
>> store is not corrupt.  As long as all the objects reachable from refs
>> are connected, that is a perfectly sane state.
>>
>> But allowing unbundling with the sanity check loosened WILL corrupt
>> it, at the moment you point some objects from the bundle with refs.
> 
> While all of the above is true, I think existing check done by
> bundle.c::verify_bundle() is stricter than necessary.  We make sure
> that the prerequiste objects exist and are reachable from the refs.
> But for the purpose of ensuring the health of the repo after the
> operation, it is also OK if the prerequisite objects exist and they
> pass connected.c::check_connected() test to reach existing refs.
> verify_bundle() that is used in unbundle() does not allow it.

Thank you for all of the detailed explanation, here and in other
messages.

I'll focus on this area today and see what I can learn and how I
can approach this problem in a different way. The current options
that I see are:

 1. Leave verify_bundle() as-is and figure out how to refresh the
    refs. (This would remain a stricter check than necessary.)

 2. Find out how to modify verify_bundle() so it can do the more
    relaxed connectivity check.

 3. Take the connectivity check that fetch uses before updating
    refs and add that check before updating refs in the bundle URI
    code.

There could also be a combination of (2) and (3), or others I have
not considered until I go poking around in the code.

I'll let you know what I find.

Thanks,
-Stolee
Junio C Hamano Jan. 24, 2023, 3:22 p.m. UTC | #7
Derrick Stolee <derrickstolee@github.com> writes:

> I'll focus on this area today and see what I can learn and how I
> can approach this problem in a different way. The current options
> that I see are:
>
>  1. Leave verify_bundle() as-is and figure out how to refresh the
>     refs. (This would remain a stricter check than necessary.)

Even if we switch to "assume everything is OK, remember a few key
facts (like prerequisites and tips) about each bundle as we unpack
them, and sanity check the results at the end" approach, doesn't
that last step require us to be able to see the final state of the
refs?  If so, wouldn't we need to figure out how to refresh the refs
no matter what?

>  2. Find out how to modify verify_bundle() so it can do the more
>     relaxed connectivity check.

I am not sure what kind of relaxing you have in mind, but as long as
we can guarantee the connectedness of the end result

>  3. Take the connectivity check that fetch uses before updating
>     refs and add that check before updating refs in the bundle URI
>     code.

This is optional at much lower priority, isn't it?  In the second
example in the message you are responding to, I do not think it is
too bad to reject the bundle based on '8' that has been rewound away
(in other words, a bundle publisher ought to be basing their bundles
on well publicized and commonly available commits).  Only when we
try to be overly helpful to such a use case, it becomes necessary to
loosen the rule from "all prerequisites must be reachable from
existing refs" to "or prerequisites that are not reachable from any
refs are also OK if they pass check_connected()".

The current check to require that prerequisites are reachable from
refs does not have to check trees and blobs, because any commit that
is reachable from an existing ref is complete[*] by definition.

    Let's define a term: a commit is "complete" iff it is not
    missing any objects that it (recursively) references to.

The check done by check_connected() is more expensive because it has
to prove that a commit, which is found in the object store and may
or may not be reachable from any refs, is complete.  The tranversal
still can take advantage of the fact that commits _reachable_ from
refs are guaranteed to be complete, but until the traversal reaches
a commit that is reachable from refs (e.g. when inspecting commits
'8' and then '7' until it reaches '6', in the second example in the
message you are responding to) we need to look at trees and blobs.

> There could also be a combination of (2) and (3), or others I have
> not considered until I go poking around in the code.
>
> I'll let you know what I find.

Thanks.  Unlike areas that allow glitches as long as workarounds are
available (e.g. UI), the object store + refs layer is where it is
absolutely required to be correct.  I am happy to see capable minds
are on it.
diff mbox series

Patch

diff --git a/bundle-uri.c b/bundle-uri.c
index 36268dda172..2f079f713cf 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -322,9 +322,15 @@  static int unbundle_from_file(struct repository *r, const char *file)
 	 * Skip the reachability walk here, since we will be adding
 	 * a reachable ref pointing to the new tips, which will reach
 	 * the prerequisite commits.
+	 *
+	 * Since multiple iterations of unbundle_from_file() can create
+	 * new commits in the object store that are not reachable from
+	 * the current cached state of the ref store, skip the reachability
+	 * walk and move forward as long as the objects are present in the
+	 * object store.
 	 */
 	if ((result = unbundle(r, &header, bundle_fd, NULL,
-			       VERIFY_BUNDLE_QUIET)))
+			       VERIFY_BUNDLE_QUIET | VERIFY_BUNDLE_SKIP_REACHABLE)))
 		return 1;
 
 	/*
diff --git a/bundle.c b/bundle.c
index 4ef7256aa11..b51974f0806 100644
--- a/bundle.c
+++ b/bundle.c
@@ -223,7 +223,8 @@  int verify_bundle(struct repository *r,
 			error("%s", message);
 		error("%s %s", oid_to_hex(oid), name);
 	}
-	if (revs.pending.nr != p->nr)
+	if (revs.pending.nr != p->nr ||
+	    (flags & VERIFY_BUNDLE_SKIP_REACHABLE))
 		goto cleanup;
 	req_nr = revs.pending.nr;
 	setup_revisions(2, argv, &revs, NULL);
diff --git a/bundle.h b/bundle.h
index 9f2bd733a6a..24c30e5f74a 100644
--- a/bundle.h
+++ b/bundle.h
@@ -34,6 +34,7 @@  int create_bundle(struct repository *r, const char *path,
 enum verify_bundle_flags {
 	VERIFY_BUNDLE_VERBOSE = (1 << 0),
 	VERIFY_BUNDLE_QUIET = (1 << 1),
+	VERIFY_BUNDLE_SKIP_REACHABLE = (1 << 2),
 };
 
 int verify_bundle(struct repository *r, struct bundle_header *header,