Message ID | 20240919234741.1317946-3-calvinwan@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | revision: fix reachable objects being gc'ed in no blob clone repo | expand |
Calvin Wan <calvinwan@google.com> writes: > In a partial repository, creating a local commit and then fetching > causes the following state to occur: > > commit tree blob > C3 ---- T3 -- B3 (fetched from remote, in promisor pack) > | > C2 ---- T2 -- B2 (created locally, in non-promisor pack) > | > C1 ---- T1 -- B1 (fetched from remote, in promisor pack) > > During garbage collection, parents of promisor objects are marked as > UNINTERESTING and are subsequently garbage collected. In this case, C2 > would be deleted and attempts to access that commit would result in "bad > object" errors (originally reported here[1]). Understandable. > This is not a bug in gc since it should be the case that parents of > promisor objects are also promisor objects (fsck assumes this as > well). I am not sure where this "not a bug" claim comes from. Here, the definition of "promisor objects" seems to be anything that are reachable from objects in promisor packs, but isn't the source of the bug that collects C2 exactly that "gc" uses such a definition for discardable objects that can be refetchd from promisor remotes? > When promisor objects are fetched, the state of the repository > should ensure that the above holds true. Therefore, do not declare local > commits as "have" in partial repositores so they can be fetched into a > promisor pack. Could you clarify what it means in the context of the above example you gave in an updated version of the proposed log message? We pretend that C2 and anything it reaches do not exist locally, to force them to be fetched from the remote? We'd end up having two copies of C2 (one that we created locally and had before starting this fetch, the other we fetched when we fetched C3 from them)? This sounds like it is awfully inefficient both network bandwidth- and local disk-wise. I was hoping to see that the issue can be fixed on the "gc" side, regardless of how the objects enter our repository, but perhaps I am missing something. Isn't it just the matter of collecting C1, C3 but not C2? Or to put it another way, if we first create a list of objects to be packed (regardless of whether they are in promisor packs), and then remove the objects that are in promisor packs from the list, and pack the objects still remaining in the list? > diff --git a/fetch-pack.c b/fetch-pack.c > index 58b4581ad8..c39b0f6ad4 100644 > --- a/fetch-pack.c > +++ b/fetch-pack.c > @@ -1297,12 +1297,23 @@ static void add_common(struct strbuf *req_buf, struct oidset *common) > > static int add_haves(struct fetch_negotiator *negotiator, > struct strbuf *req_buf, > - int *haves_to_send) > + int *haves_to_send, > + int from_promisor) > { > int haves_added = 0; > const struct object_id *oid; > > while ((oid = negotiator->next(negotiator))) { > + /* > + * In partial repos, do not declare local objects as "have" > + * so that they can be fetched into a promisor pack. Certain > + * operations mark parent commits of promisor objects as > + * UNINTERESTING and are subsequently garbage collected so > + * this ensures local commits are still available in promisor > + * packs after a fetch + gc. > + */ > + if (from_promisor && !is_in_promisor_pack(oid, 0)) > + continue; > packet_buf_write(req_buf, "have %s\n", oid_to_hex(oid)); > if (++haves_added >= *haves_to_send) > break; > @@ -1405,7 +1416,7 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out, > /* Add all of the common commits we've found in previous rounds */ > add_common(&req_buf, common); > > - haves_added = add_haves(negotiator, &req_buf, haves_to_send); > + haves_added = add_haves(negotiator, &req_buf, haves_to_send, args->from_promisor); > *in_vain += haves_added; > trace2_data_intmax("negotiation_v2", the_repository, "haves_added", haves_added); > trace2_data_intmax("negotiation_v2", the_repository, "in_vain", *in_vain); > @@ -2178,7 +2189,7 @@ void negotiate_using_fetch(const struct oid_array *negotiation_tips, > > packet_buf_write(&req_buf, "wait-for-done"); > > - haves_added = add_haves(&negotiator, &req_buf, &haves_to_send); > + haves_added = add_haves(&negotiator, &req_buf, &haves_to_send, 0); > in_vain += haves_added; > if (!haves_added || (seen_ack && in_vain >= MAX_IN_VAIN)) > last_iteration = 1; > diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh > index 8415884754..cba9f7ed9b 100755 > --- a/t/t5616-partial-clone.sh > +++ b/t/t5616-partial-clone.sh > @@ -693,6 +693,35 @@ test_expect_success 'lazy-fetch in submodule succeeds' ' > git -C client restore --recurse-submodules --source=HEAD^ :/ > ' > > +test_expect_success 'fetching from promisor remote fetches previously local commits' ' > + # Setup > + git init full && > + git -C full config uploadpack.allowfilter 1 && > + git -C full config uploadpack.allowanysha1inwant 1 && > + touch full/foo && > + git -C full add foo && > + git -C full commit -m "commit 1" && > + git -C full checkout --detach && > + > + # Partial clone and push commit to remote > + git clone "file://$(pwd)/full" --filter=blob:none partial && > + echo "hello" > partial/foo && > + git -C partial commit -a -m "commit 2" && > + git -C partial push && > + > + # gc in partial repo > + git -C partial gc --prune=now && > + > + # Create another commit in normal repo > + git -C full checkout main && > + echo " world" >> full/foo && > + git -C full commit -a -m "commit 3" && > + > + # Pull from remote in partial repo, and run gc again > + git -C partial pull && > + git -C partial gc --prune=now > +' > + > . "$TEST_DIRECTORY"/lib-httpd.sh > start_httpd
Junio C Hamano <gitster@pobox.com> writes: > Calvin Wan <calvinwan@google.com> writes: > >> In a partial repository, creating a local commit and then fetching >> causes the following state to occur: >> >> commit tree blob >> C3 ---- T3 -- B3 (fetched from remote, in promisor pack) >> | >> C2 ---- T2 -- B2 (created locally, in non-promisor pack) >> | >> C1 ---- T1 -- B1 (fetched from remote, in promisor pack) >> >> During garbage collection, parents of promisor objects are marked as >> UNINTERESTING and are subsequently garbage collected. In this case, C2 >> would be deleted and attempts to access that commit would result in "bad >> object" errors (originally reported here[1]). > > Understandable. > ... >> When promisor objects are fetched, the state of the repository >> should ensure that the above holds true. Therefore, do not declare local >> commits as "have" in partial repositores so they can be fetched into a >> promisor pack. > ... > We pretend that C2 and anything it reaches do not exist locally, to > force them to be fetched from the remote? We'd end up having two > copies of C2 (one that we created locally and had before starting > this fetch, the other we fetched when we fetched C3 from them)? > This sounds like it is awfully inefficient both network bandwidth- > and local disk-wise. One related thing that worries me is what happens after we make a large push, either directly to the remote, or what we pushed elsewhere were fetched by the remote, and then we need to fetch what they created on top. The history may look like this: 1. we fetch from promisor remote. C is in promisor packs ---C 2. we build on top. 'x' are local. ---C---x---x---x---x---x---x---x---x 3. 'x' we created above ends up to be a the promisor side, and others build a few commits on top. ---C---x---x---x---x---x---x---x---x---o---o 4. Now we try to fetch from them. I.e. a repository that has history illustrated in 2. fetches the history illustrated in 3. Because this change forbids the fetching side to tell the other side that it has 'x', the first "have" we are allowed to send is 'C', even though we only need to fetch two commits 'o' from them. And 'x' could be numerous in distributed development workflows, as these "local" commits do not have to be ones you created locally yourself. You may have fetched and merged these commits from elsewhere where the active development is happening. The only criterion that qualifies a commit to be "local" (and causes us to omit them from "have" communication) is that we didn't obtain it directly from our promisor remote, so you may end up fetching a large portion of the history you already have from the promisor remote, just to have them into a promisor pack. If we cannot change the definition of "is-promisor-object" for the purpose of "gc" (and it is probably I am missing what you, JTan, and HanYang thought about that I do not see he reason why), I wonder if there is a way to somehow avoid the refetching but still "move" these 'x' objects purely locally into a promisor pack? That is, the current "git fetch" without this patch would only fetch two 'o' commits (and its associated trees and blobs) into a new promisor pack, but because we know that commits 'x' have now become re-fetchable from the promisor, we can make them promisor objects by repacking locally them and mark the resulting pack a promisor pack, without incurring the cost to the remote to prepare and send 'x' again to us. That would give us the same protection the patch under discussion offers, wouldn't it? I however still think fixing "gc" would give us a lot more intuitive behaviour, though. Thanks.
On Sun, Sep 22, 2024 at 2:53 PM Junio C Hamano <gitster@pobox.com> wrote: > I was hoping to see that the issue can be fixed on the "gc" side, > regardless of how the objects enter our repository, but perhaps I am > missing something. Isn't it just the matter of collecting C1, C3 > but not C2? Or to put it another way, if we first create a list of > objects to be packed (regardless of whether they are in promisor > packs), and then remove the objects that are in promisor packs from > the list, and pack the objects still remaining in the list? I tried to fix the issue on the "gc" side following JTan's suggestion, by packing local objects referenced by promisor objects into promisor packs. But it turns out the cost for "for each promisor object, parse them and try to decide the objects they reference is in local repo" is too great. In a test blob:none partial clone repo, the gc would take more than one hour in the 2019 MacBook, despite the repo only having 17071073 objects. Normally it would take about 30 minutes. > if we first create a list of > objects to be packed (regardless of whether they are in promisor > packs), and then remove the objects that are in promisor packs from > the list, and pack the objects still remaining in the list? This would work, though the remaining objects in the list would be suboptimally packed, due to the delta heuristic. Because we feed object id directly into git-pack-objects, instead of using rev-list. But that's how we pack promisor objects anyway. In $JOB, we modified git-repack to pack everything into a giant promisor pack if the repo is partially cloned. This basically does the same thing as you suggested, but without the cost of constructing the object list and removing the objects in the promisor packs. Thanks.
韩仰 <hanyang.tony@bytedance.com> writes: > In $JOB, we modified git-repack to pack everything into a giant promisor > pack if the repo is partially cloned. I would imagine that would give you pretty much similar results as the posted patch without incurring the cost of transfering the same objects from the promisor remote. > This basically does the same thing as > you suggested, but without the cost of constructing the object list and > removing the objects in the promisor packs. Yup, repacking, instead of creating a new pack only to hold the objects in the gap, would be much simpler.
韩仰 <hanyang.tony@bytedance.com> writes: > On Sun, Sep 22, 2024 at 2:53 PM Junio C Hamano <gitster@pobox.com> wrote: > > > I was hoping to see that the issue can be fixed on the "gc" side, > > regardless of how the objects enter our repository, but perhaps I am > > missing something. Isn't it just the matter of collecting C1, C3 > > but not C2? Or to put it another way, if we first create a list of > > objects to be packed (regardless of whether they are in promisor > > packs), and then remove the objects that are in promisor packs from > > the list, and pack the objects still remaining in the list? > > I tried to fix the issue on the "gc" side following JTan's suggestion, > by packing local objects referenced by promisor objects into promisor > packs. But it turns out the cost for "for each promisor object, > parse them and try to decide the objects they reference is in local repo" > is too great. In a test blob:none partial clone repo, the gc would take more > than one hour in the 2019 MacBook, despite the repo only > having 17071073 objects. Normally it would take about 30 minutes. I found that running `time git submodule foreach git <create promisor pack set>` on Android takes 25 minutes on my machine. Granted this is single threaded but it's still quite an expensive operation to be doing on every recursive fetch. If this operation is so expensive, then unless we can figure out some method that doesn't involve creating a set of promisor pack objects, solving this during fetch is infeasible.
diff --git a/fetch-pack.c b/fetch-pack.c index 58b4581ad8..c39b0f6ad4 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -1297,12 +1297,23 @@ static void add_common(struct strbuf *req_buf, struct oidset *common) static int add_haves(struct fetch_negotiator *negotiator, struct strbuf *req_buf, - int *haves_to_send) + int *haves_to_send, + int from_promisor) { int haves_added = 0; const struct object_id *oid; while ((oid = negotiator->next(negotiator))) { + /* + * In partial repos, do not declare local objects as "have" + * so that they can be fetched into a promisor pack. Certain + * operations mark parent commits of promisor objects as + * UNINTERESTING and are subsequently garbage collected so + * this ensures local commits are still available in promisor + * packs after a fetch + gc. + */ + if (from_promisor && !is_in_promisor_pack(oid, 0)) + continue; packet_buf_write(req_buf, "have %s\n", oid_to_hex(oid)); if (++haves_added >= *haves_to_send) break; @@ -1405,7 +1416,7 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out, /* Add all of the common commits we've found in previous rounds */ add_common(&req_buf, common); - haves_added = add_haves(negotiator, &req_buf, haves_to_send); + haves_added = add_haves(negotiator, &req_buf, haves_to_send, args->from_promisor); *in_vain += haves_added; trace2_data_intmax("negotiation_v2", the_repository, "haves_added", haves_added); trace2_data_intmax("negotiation_v2", the_repository, "in_vain", *in_vain); @@ -2178,7 +2189,7 @@ void negotiate_using_fetch(const struct oid_array *negotiation_tips, packet_buf_write(&req_buf, "wait-for-done"); - haves_added = add_haves(&negotiator, &req_buf, &haves_to_send); + haves_added = add_haves(&negotiator, &req_buf, &haves_to_send, 0); in_vain += haves_added; if (!haves_added || (seen_ack && in_vain >= MAX_IN_VAIN)) last_iteration = 1; diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh index 8415884754..cba9f7ed9b 100755 --- a/t/t5616-partial-clone.sh +++ b/t/t5616-partial-clone.sh @@ -693,6 +693,35 @@ test_expect_success 'lazy-fetch in submodule succeeds' ' git -C client restore --recurse-submodules --source=HEAD^ :/ ' +test_expect_success 'fetching from promisor remote fetches previously local commits' ' + # Setup + git init full && + git -C full config uploadpack.allowfilter 1 && + git -C full config uploadpack.allowanysha1inwant 1 && + touch full/foo && + git -C full add foo && + git -C full commit -m "commit 1" && + git -C full checkout --detach && + + # Partial clone and push commit to remote + git clone "file://$(pwd)/full" --filter=blob:none partial && + echo "hello" > partial/foo && + git -C partial commit -a -m "commit 2" && + git -C partial push && + + # gc in partial repo + git -C partial gc --prune=now && + + # Create another commit in normal repo + git -C full checkout main && + echo " world" >> full/foo && + git -C full commit -a -m "commit 3" && + + # Pull from remote in partial repo, and run gc again + git -C partial pull && + git -C partial gc --prune=now +' + . "$TEST_DIRECTORY"/lib-httpd.sh start_httpd
In a partial repository, creating a local commit and then fetching causes the following state to occur: commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, in non-promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) During garbage collection, parents of promisor objects are marked as UNINTERESTING and are subsequently garbage collected. In this case, C2 would be deleted and attempts to access that commit would result in "bad object" errors (originally reported here[1]). This is not a bug in gc since it should be the case that parents of promisor objects are also promisor objects (fsck assumes this as well). When promisor objects are fetched, the state of the repository should ensure that the above holds true. Therefore, do not declare local commits as "have" in partial repositores so they can be fetched into a promisor pack. [1] https://lore.kernel.org/git/20240802073143.56731-1-hanyang.tony@bytedance.com/ Helped-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Calvin Wan <calvinwan@google.com> --- fetch-pack.c | 17 ++++++++++++++--- t/t5616-partial-clone.sh | 29 +++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+), 3 deletions(-)