Message ID | 20240802073143.56731-1-hanyang.tony@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | revision: fix reachable objects being gc'ed in no blob clone repo | expand |
Han Young <hanyang.tony@bytedance.com> writes: > Here are the minimal steps to recreate issue. [snip] I think the following is what is happening. Before the final gc, the repo looks as follows: commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, in non-promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) After the final gc, {C,T,B}3 and {C,T,B}1 are in a promisor pack, but all of {C,T,B}2 are deleted because they are thought to be promisor objects. > The last `git gc` will error out on fsck with error message like this: > > error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94 > error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94 I'm not sure how `git gc` (or `git fsck`) knows the name of this object (what is the type of this object, and which object refers to this object?) but I think that if we implement one of the solutions I describe below, this problem will go away. > `git gc` will call `git repack`, which will call `git pack-objects` > twice on a partially cloned repo. The first call to pack-objects > combines all the promisor packfiles, and the second pack-objects > command packs all reachable non-promisor objects into a normal packfile. Yes, this is what I remember. > However, a bug in setup_revision caused some non-promisor objects > to be mistakenly marked as in promisor packfiles in the second call > to pack-objects. I think they ({C,T,B}2 in the example above) should be considered as promisor objects, actually. From the partial clone doc, it says of a promisor object: "the local repository has that object in one of its promisor packfiles, or because another promisor object refers to it". C3 (in a promisor packfile) refers to C2, so C2 is a promisor object. It refers to T2, which refers to B2, so all of them are promisor objects. However, in the Git code, I don't think this definition is applied recursively - if I remember correctly, we made the assumption (e.g. in is_promisor_object() in packfile.c) that we only need to care about objects in promisor packfiles and the objects they directly reference, because how would we know what objects they indirectly reference? But this does not cover the case in which we know what objects they indirectly reference because we pushed them to the server in the first place. Solutions I can think of: - When fetching from a promisor remote, never declare commits that are not in promisor packfiles as HAVE. This means that we would refetch C2 and T2 as being in promisor packfiles. But it's wasteful in network bandwidth, and does not repair problems in Git repos created by Git versions that do not have this solution. - When fetching from a promisor remote, parse every object and repack any local objects referenced (directly or indirectly) into a promisor packfile. Also does not repair problems. - When repacking all objects in promisor packfiles, if any object they refer to is present in a non-promisor packfile, do a revwalk on that object and pack those objects too. The repack will probably be slower because each object now has to be parsed. The revwalks themselves probably will not take too long, since they can stop at known promisor objects. (One other thing that might be considered is, whenever pushing to a promisor remote, to write the pack that's pushed as a promisor packfile. But I don't think this is a good idea - the server may not retain any packs that were pushed.)
Jonathan Tan <jonathantanmy@google.com> writes:
> Solutions I can think of:
One more thing that I just thought of regarding the solution in this
patch. It seems to be to have a different separation of packs: all
objects currently in promisor packs and all objects currently not
in promisor packs. And the way it is done is to only exclude (in
this patch, mark UNINTERESTING, although it might be better to have
a separate flag for it) objects in promisor packs, but not their
ancestors. There are two ways we can go from here:
- Do not iterate past this object, just like for UNINTERESTING. This
would end up not packing objects that we need to pack (e.g. {C,T,B}2
below, if we only have a ref pointing to C3).
commit tree blob
C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
|
C2 ---- T2 -- B2 (created locally, in non-promisor pack)
|
C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
- Iterate past this object (I think this is the path this patch took,
but I didn't look at it closely). This works, but seems to be very
slow. We would need to walk through all reachable objects (promisor
object or not), unlike currently in which we stop once we have
reached a promisor object.
Jonathan Tan <jonathantanmy@google.com> writes: > Jonathan Tan <jonathantanmy@google.com> writes: >> Solutions I can think of: > > One more thing that I just thought of regarding the solution in this > patch. It seems to be to have a different separation of packs: all > objects currently in promisor packs and all objects currently not > in promisor packs. And the way it is done is to only exclude (in > this patch, mark UNINTERESTING, although it might be better to have > a separate flag for it) objects in promisor packs, but not their > ancestors. You're right to mention two separate bits, especially because you do not want the "I am in a promisor pack" bit to propagate down to the ancestry chain like UNINTERESTING bit does. But isn't the approach to enumerate all objects in promisor packs in an oidset and give a quick way for is_promisor_object() to answer if an object is or is not in promisor pack sufficient to replace the need to use _any_ object flag bits to manage objects in promisor packs? > There are two ways we can go from here: > > - Do not iterate past this object, just like for UNINTERESTING. This > would end up not packing objects that we need to pack (e.g. {C,T,B}2 > below, if we only have a ref pointing to C3). > > commit tree blob > C3 ---- T3 -- B3 (fetched from remote, in promisor pack) > | > C2 ---- T2 -- B2 (created locally, in non-promisor pack) > | > C1 ---- T1 -- B1 (fetched from remote, in promisor pack) > > - Iterate past this object (I think this is the path this patch took, > but I didn't look at it closely). This works, but seems to be very > slow. We would need to walk through all reachable objects (promisor > object or not), unlike currently in which we stop once we have > reached a promisor object. Thanks for helping Han & Xinxin.
Junio C Hamano <gitster@pobox.com> writes: > Jonathan Tan <jonathantanmy@google.com> writes: > > > Jonathan Tan <jonathantanmy@google.com> writes: > >> Solutions I can think of: > > > > One more thing that I just thought of regarding the solution in this > > patch. It seems to be to have a different separation of packs: all > > objects currently in promisor packs and all objects currently not > > in promisor packs. And the way it is done is to only exclude (in > > this patch, mark UNINTERESTING, although it might be better to have > > a separate flag for it) objects in promisor packs, but not their > > ancestors. > > You're right to mention two separate bits, especially because you do > not want the "I am in a promisor pack" bit to propagate down to the > ancestry chain like UNINTERESTING bit does. But isn't the approach > to enumerate all objects in promisor packs in an oidset and give a > quick way for is_promisor_object() to answer if an object is or is > not in promisor pack sufficient to replace the need to use _any_ > object flag bits to manage objects in promisor packs? Ah...yes, you're right. But if someone is intending to go this route, note that you can't use the oidset in is_promisor_object() directly, as it contains both objects in promisor packs and objects that they directly reference. You'll need to adapt it.
It seems that we're at a standstill for the various possible designs that can solve this problem, so I decided to write up a design document to discuss the ideas we've come up with so far and new ones. Hopefully this will get us closer to a viable implementation we can agree on. Missing Promisor Objects in Partial Repo Design Doc =================================================== Basic Reproduction Steps ------------------------ - Partial clone repository - Create local commit and push - Fetch new changes - Garbage collection State After Reproduction ------------------------ commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2b ---- T2b -- B2b (created locally, in non-promisor pack) | C2a ---- T2a -- B2a (created locally, in non-promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) Explanation of the Problem -------------------------- In a partial clone repository, non-promisor commits are locally committed as children of promisor commits and then pushed up to the server. Fetches of new history can result in promisor commits that have non-promisor commits as ancestors. During garbage collection, objects are repacked in 2 steps. In the first step, if there is more than one promisor packfile, all objects in promisor packfiles are repacked into a single promisor packfile. In the second step, a revision walk is made from all refs (and some other things like HEAD and reflog entries) that stops whenever it encounters a promisor object. In the example above, if a ref pointed directly to C2a, it would be returned by the walk (as an object to be packed). But if we only had a ref pointing to C3, the revision walk immediately sees that it is a promisor object, does not return it, and does not iterate through its parents. (C2b is a bit of a special case. Despite not being in a promisor pack, it is still considered to be a promisor object since C3 directly references it.) If we think this is a bad state, we should propagate the “promisor-ness” of C3 to its ancestors. Git commands should either prevent this state from occurring or tolerate it and fix it when we can. If we did run into this state unexpectedly, then it would be considered a BUG. If we think it is a valid state, we should NOT propagate the “promisor-ness” of C3 to its ancestors. Git commands should respect that this is a possible state and be able to work around it. Therefore, this bug would then be strictly caused by garbage collection Bad State Solutions =================== Fetch negotiation ----------------- Implemented at https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/ During fetch negotiation, if a commit is not in a promisor pack and therefore local, do not declare it as "have" so they can be fetched into a promisor pack. Cost: - Creation of set of promisor pack objects (by iterating through every .idx of promisor packs) - Refetch number of local commits Pros: Implementation is simple, client doesn’t have to repack, prevents state from ever occurring in the repository. Cons: Network cost of refetching could be high if many local commits need to be refetched. commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, refetched into promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) Fetch repack ------------ Not yet implemented. Enumerate the objects in the freshly fetched promisor packs, checking every outgoing link to see if they reference a non-promisor object that we have, to get a list of tips where local objects are parents of promisor objects ("bad history"). After collecting these "tips of bad history", you then start another traversal from them until you hit an object in a promisor pack and stop traversal there. You have successfully enumerated the local objects to be repacked into a promisor pack. Cost: - Traversal through newly fetched promisor trees and commits - Creation of set of promisor pack objects (for tips of bad history traversal to stop at a promisor object) - Traversal through all local commits and check existence in promisor pack set - Repack all pushed local commits Pros: Prevents state from ever occurring in the repository, no network cost. Cons: Additional cost of repacking is incurred during fetch, more complex implementation. commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, packed into promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) Garbage Collection repack ------------------------- Not yet implemented. Same concept at “fetch repack”, but happens during garbage collection instead. The traversal is more expensive since we no longer have access to what was recently fetched so we have to traverse through all promisor packs to collect tips of “bad” history. Cost: - Creation of set of promisor pack objects - Traversal through all promisor commits - Traversal through all local commits and check existence in promisor object set - Repack all pushed local commits Pros: Can be run in the background as part of maintenance, no network cost. Cons: More expensive than “fetch repack”, state isn’t fixed until garbage collection, more complex implementation commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, packed into promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) Garbage Collection repack all ----------------------------- Implemented at https://lore.kernel.org/git/20240925072021.77078-1-hanyang.tony@bytedance.com/ Repack all local commits into promisor packs during garbage collection. Both valid scenarios commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, packed into promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) commit tree blob C3 ---- T3 -- B3 (created locally, packed into promisor pack) | C2 ---- T2 -- B2 (created locally, packed into promisor pack) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) Cost: Repack all local commits Pros: Can be run in the background as part of maintenance, no network cost, less complex implementation, and less expensive than “garbage collection repack”. Cons: Packing local objects into promisor packs means that it is no longer possible to detect if an object is missing due to repository corruption or because we need to fetch it from a promisor remote. Packing local objects into promisor packs means that garbage collection will no longer remove unreachable local objects. Valid State Solutions ===================== Garbage Collection check ------------------------ Not yet implemented. Currently during the garbage collection rev walk, whenever a promisor commit is reached, it is marked UNINTERESTING, and then subsequently all ancestors of the promisor commit are traversed and also marked UNINTERESTING. Therefore, add a check for whether a commit is local or not during promisor commit ancestor traversal and do not mark local commits as UNINTERESTING. commit tree blob C3 ---- T3 -- B3 (fetched from remote, in promisor pack) | C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete) | C1 ---- T1 -- B1 (fetched from remote, in promisor pack) Cost: - Adds an additional check to every ancestor of a promisor commit. This is practically the only solution if the state is valid. Fsck would also have to start checking for validity of ancestors of promisor commits instead of ignoring them as it currently does. Optimizations ============= The “creation of set of promisor pack objects” can be replaced with “creation of set of non-promisor objects” since the latter is almost always cheaper and we can check for non-existence rather than existence. This does not work for “fetch negotiation” since if we have a commit that's in both a promisor pack and a non-promisor pack, the algorithm's correctness relies on the fact that we report it as a promisor object (because we really need the server to re-send it).
Calvin Wan <calvinwan@google.com> writes: > It seems that we're at a standstill for the various possible designs > that can solve this problem, so I decided to write up a design document > to discuss the ideas we've come up with so far and new ones. Hopefully > this will get us closer to a viable implementation we can agree on. Thanks for writing this up. Very much appreciated. I'll hold my thoughts before others have chance to speak up, though. Thanks.
> Missing Promisor Objects in Partial Repo Design Doc > =================================================== > > Basic Reproduction Steps > ------------------------ > > - Partial clone repository > - Create local commit and push > - Fetch new changes > - Garbage collection > > State After Reproduction > ------------------------ > > commit tree blob > C3 ---- T3 -- B3 (fetched from remote, in promisor pack) > | > C2b ---- T2b -- B2b (created locally, in non-promisor pack) > | > C2a ---- T2a -- B2a (created locally, in non-promisor pack) > | > C1 ---- T1 -- B1 (fetched from remote, in promisor pack) > > Explanation of the Problem > -------------------------- > > In a partial clone repository, non-promisor commits are locally > committed as children of promisor commits and then pushed up to the > server. Fetches of new history can result in promisor commits that have > non-promisor commits as ancestors. During garbage collection, objects > are repacked in 2 steps. In the first step, if there is more than one > promisor packfile, all objects in promisor packfiles are repacked into a > single promisor packfile. In the second step, a revision walk is made > from all refs (and some other things like HEAD and reflog entries) that > stops whenever it encounters a promisor object. In the example above, if > a ref pointed directly to C2a, it would be returned by the walk (as an > object to be packed). But if we only had a ref pointing to C3, the > revision walk immediately sees that it is a promisor object, does not > return it, and does not iterate through its parents. True. Will it become even worse, if a protocol extension Christian proposes starts suggesting a repository that is not lazy to add a promisor remote? In such a set-up, perhaps all history leading to C2b down to the root are local, but C3 may have come from a promisor remote (hence in a promisor pack). > (C2b is a bit of a special case. Despite not being in a promisor pack, > it is still considered to be a promisor object since C3 directly > references it.) Yes, and I suspect the root cause of this confusion is because "promisor object", as defined today, is a flawed concept. If C2b were pointed by a local ref, just like the case the ref points at C2a, they should be treated the same way, as both of them are locally created. To put it another way, presumably the local have already been pushed out to elsewhere and the promisor remote got hold of them, and that is why C3 can build on top of them. And the fact C2b is directly reachable from C3 and C2a is not should not have any relevance if C2a or C2b are not _included_ in promisor packs (hence both of them need to be included in the local pack). Two concepts that would have been useful are (1) objects that are in promisor packs and (2) objects that are reachable from an object that is in a promisor pack. I do not see how the current definition of "promisor objects" (i.e. in a promisor pack, or one hop from an object in a promisor pack) is useful in any context. > If we think this is a bad state, we should propagate the “promisor-ness” > of C3 to its ancestors. Git commands should either prevent this state > from occurring or tolerate it and fix it when we can. If we did run into > this state unexpectedly, then it would be considered a BUG. Yup, that is the basis of the solutions we saw proposed so far. > If we think it is a valid state, we should NOT propagate the > “promisor-ness” of C3 to its ancestors. Git commands should respect that > this is a possible state and be able to work around it. Therefore, this > bug would then be strictly caused by garbage collection Yes, that is possibly an alternative. > Bad State Solutions > =================== > > Fetch negotiation > ----------------- > Implemented at > https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/ > > During fetch negotiation, if a commit is not in a promisor pack and > therefore local, do not declare it as "have" so they can be fetched into > a promisor pack. > > Cost: > - Creation of set of promisor pack objects (by iterating through every > .idx of promisor packs) What is "promisor PACK objects"? Is it different from the "promisor objects" (i.e. what I called the useless definition above)? > - Refetch number of local commits > > Pros: Implementation is simple, client doesn’t have to repack, prevents > state from ever occurring in the repository. > > Cons: Network cost of refetching could be high if many local commits > need to be refetched. What if we get into the same state by creating local C4, which gets to outside and on top of which C5 is built, which is now sitting at the tip of the remote history and we fetch from them? In order to include C4 in the "promisor pack", we refrain from saying C4 is a "have" for us and refetch. Would C2 be fetched again? I do not think C2 would be, because we made it an object in a promisor pack when we "fixed" the history for C3. So the cost will not grow proportionally to the depth of the history, which makes it OK from my point of view. > Garbage Collection repack > ------------------------- > Not yet implemented. > > Same concept at “fetch repack”, but happens during garbage collection > instead. The traversal is more expensive since we no longer have access > to what was recently fetched so we have to traverse through all promisor > packs to collect tips of “bad” history. In other words, with the status quo, "git gc" that attempts to repack "objects in promisor packs" and "other objects that did not get repacked in the step that repack objects in promisor packs" separately, it implements the latter in a buggy way and discards some objects. And fixing that bug by doing the right thing is expensive. Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after "git gc"? Wouldn't these "missing" objects be lazily fetchable, now C3 is known to the remote and the remote promises everything reachable from what they offer are (re)fetchable from them? IOW, is this a correctness issue, or only performance issue (of having to re-fetch what we once locally had)? > Cons: Packing local objects into promisor packs means that it is no > longer possible to detect if an object is missing due to repository > corruption or because we need to fetch it from a promisor remote. Is this true? Can we tell, when trying to access C2a/C2b/C2 after the current version of "git gc" removes them from the local object store, that they are missing due to repository corruption? After all, C3 can reach them so wouldn't it be possible for us to fetch them from the promisor remote? After a lazy clone that omits a lot of objects acquires many objects over time by fetching missing objects on demand, wouldn't we want to have an option to "slim" the local repository by discarding some of these objects (the ones that are least frequently used), relying on the promise by the promisor remote that even if we did so, they can be fetched again? Can we treat loss of C2a/C2b/C2 as if such a feature prematurely kicked in? Or are we failing to refetch them for some reason? > Packing local objects into promisor packs means that garbage collection > will no longer remove unreachable local objects. > > Valid State Solutions > ===================== > Garbage Collection check > ------------------------ > Not yet implemented. > > Currently during the garbage collection rev walk, whenever a promisor > commit is reached, it is marked UNINTERESTING, and then subsequently all > ancestors of the promisor commit are traversed and also marked > UNINTERESTING. Therefore, add a check for whether a commit is local or > not during promisor commit ancestor traversal and do not mark local > commits as UNINTERESTING. > > commit tree blob > C3 ---- T3 -- B3 (fetched from remote, in promisor pack) > | > C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete) > | > C1 ---- T1 -- B1 (fetched from remote, in promisor pack) > > Cost: > - Adds an additional check to every ancestor of a promisor commit. > > This is practically the only solution if the state is valid. Fsck would > also have to start checking for validity of ancestors of promisor > commits instead of ignoring them as it currently does. In the longer term, this looks like the most straight-forward and easy to explain solution to me. > Optimizations > ============= > > The “creation of set of promisor pack objects” can be replaced with > “creation of set of non-promisor objects” since the latter is almost > always cheaper and we can check for non-existence rather than existence. > This does not work for “fetch negotiation” since if we have a commit > that's in both a promisor pack and a non-promisor pack, the algorithm's > correctness relies on the fact that we report it as a promisor object > (because we really need the server to re-send it).
On Wed, Oct 2, 2024 at 10:55 AM Junio C Hamano <gitster@pobox.com> wrote: > Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after > "git gc"? Wouldn't these "missing" objects be lazily fetchable, now > C3 is known to the remote and the remote promises everything > reachable from what they offer are (re)fetchable from them? IOW, is > this a correctness issue, or only performance issue (of having to > re-fetch what we once locally had)? > > Is this true? Can we tell, when trying to access C2a/C2b/C2 after > the current version of "git gc" removes them from the local object > store, that they are missing due to repository corruption? After > all, C3 can reach them so wouldn't it be possible for us to fetch > them from the promisor remote? > > After a lazy clone that omits a lot of objects acquires many objects > over time by fetching missing objects on demand, wouldn't we want to > have an option to "slim" the local repository by discarding some of > these objects (the ones that are least frequently used), relying on > the promise by the promisor remote that even if we did so, they can > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a > feature prematurely kicked in? Or are we failing to refetch them > for some reason? In a blobless clone, we expect commits and trees to be present in repo. If C2/T2 is missing, commands like "git merge" will complain "cannot merge unrelated history" and fail. Or commands like "git log" will try to lazily fetch the commit, but without 'have' negotiation, end up pulling all the trees and blobs reachable from that commit. It's possible to minimize the impact of missing commits by adding negotiation to lazy fetching, but we probably need to adapt code in many places where we don't do lazy fetching. "git log", "git merge" commit graph etc. it's no trivia amount of work.
On Tue, Oct 1, 2024 at 7:54 PM Junio C Hamano <gitster@pobox.com> wrote: > > True. Will it become even worse, if a protocol extension Christian > proposes starts suggesting a repository that is not lazy to add a > promisor remote? In such a set-up, perhaps all history leading to > C2b down to the root are local, but C3 may have come from a promisor > remote (hence in a promisor pack). Yes if we and consequently Git considers this state to be problematic. > > Bad State Solutions > > =================== > > > > Fetch negotiation > > ----------------- > > Implemented at > > https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/ > > > > During fetch negotiation, if a commit is not in a promisor pack and > > therefore local, do not declare it as "have" so they can be fetched into > > a promisor pack. > > > > Cost: > > - Creation of set of promisor pack objects (by iterating through every > > .idx of promisor packs) > > What is "promisor PACK objects"? Is it different from the "promisor > objects" (i.e. what I called the useless definition above)? Objects that are in promisor packs, specifically the ones that have the flag, packed_git::pack_promisor, set. However, since this design doc was sent out, it turns out the creation of a set of promisor pack objects in a large repository (such as Android or Chrome) is very expensive, so this design is infeasible in my opinion. > > > - Refetch number of local commits > > > > Pros: Implementation is simple, client doesn’t have to repack, prevents > > state from ever occurring in the repository. > > > > Cons: Network cost of refetching could be high if many local commits > > need to be refetched. > > What if we get into the same state by creating local C4, which gets > to outside and on top of which C5 is built, which is now sitting at > the tip of the remote history and we fetch from them? In order to > include C4 in the "promisor pack", we refrain from saying C4 is a > "have" for us and refetch. Would C2 be fetched again? > > I do not think C2 would be, because we made it an object in a > promisor pack when we "fixed" the history for C3. > > So the cost will not grow proportionally to the depth of the > history, which makes it OK from my point of view. Correct, the cost of refetching is only a one time cost, but unfortunately creation of a set of promisor pack objects isn't. > > > Garbage Collection repack > > ------------------------- > > Not yet implemented. > > > > Same concept at “fetch repack”, but happens during garbage collection > > instead. The traversal is more expensive since we no longer have access > > to what was recently fetched so we have to traverse through all promisor > > packs to collect tips of “bad” history. > > In other words, with the status quo, "git gc" that attempts to > repack "objects in promisor packs" and "other objects that did not > get repacked in the step that repack objects in promisor packs" > separately, it implements the latter in a buggy way and discards > some objects. And fixing that bug by doing the right thing is > expensive. > > Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after > "git gc"? Wouldn't these "missing" objects be lazily fetchable, now > C3 is known to the remote and the remote promises everything > reachable from what they offer are (re)fetchable from them? IOW, is > this a correctness issue, or only performance issue (of having to > re-fetch what we once locally had)? My first thought is that from both the user and developer perspective, we don't expect our reachable objects to be gc'ed. So all of the "bad state" solutions work to ensure that that isn't the case in some way or form. However, if it turns out that all of these solutions are much more expensive and disruptive to the user than accepting that local objects can be gc'ed and JIT refetching, then the latter seems much more palatable. It is inevitable that we take some performance hit to fix this problem and we may just have to accept this as one of the costs of having partial clones to begin with. > > > Cons: Packing local objects into promisor packs means that it is no > > longer possible to detect if an object is missing due to repository > > corruption or because we need to fetch it from a promisor remote. > > Is this true? Can we tell, when trying to access C2a/C2b/C2 after > the current version of "git gc" removes them from the local object > store, that they are missing due to repository corruption? After > all, C3 can reach them so wouldn't it be possible for us to fetch > them from the promisor remote? I should be more clear that "detecting if an object is missing due to repository corruption" refers to fsck currently not having the functionality to do that. We are "accidentally" discovering the corruption when we try to access the missing object, but we can still fetch them from the promisor remote afterwards. > After a lazy clone that omits a lot of objects acquires many objects > over time by fetching missing objects on demand, wouldn't we want to > have an option to "slim" the local repository by discarding some of > these objects (the ones that are least frequently used), relying on > the promise by the promisor remote that even if we did so, they can > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a > feature prematurely kicked in? Or are we failing to refetch them > for some reason? Yes if such a feature existed, then it would be feasible and a possible solution for this issue (I'm leaning quite towards this now after testing out some of the other designs).
On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@google.com> wrote: > Objects that are in promisor packs, specifically the ones that have the > flag, packed_git::pack_promisor, set. However, since this design doc > was sent out, it turns out the creation of a set of promisor pack objects > in a large repository (such as Android or Chrome) is very expensive, so > this design is infeasible in my opinion. I wonder if a set of local loose/pack objects will be cheaper to construct? Normally loose objects are always non-promisor objects, unless the user running something like `unpack-objects`. > > After a lazy clone that omits a lot of objects acquires many objects > > over time by fetching missing objects on demand, wouldn't we want to > > have an option to "slim" the local repository by discarding some of > > these objects (the ones that are least frequently used), relying on > > the promise by the promisor remote that even if we did so, they can > > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a > > feature prematurely kicked in? Or are we failing to refetch them > > for some reason? > > Yes if such a feature existed, then it would be feasible and a possible > solution for this issue (I'm leaning quite towards this now after testing > out some of the other designs). Since no partial clone filter omits commit objects, we always assume commits are available in the codebase. `merge` reports "cannot merge unrelated history" if one of the commits is missing, instead of trying to fetch it. Another problem is current lazy fetching code does not report "haves" to remote, so a lazy fetching of commit ended up pulling all the trees, blobs associated with that commit. I also prefer the "fetching the missing objects" approach, making sure the repo has all the "correct" objects is difficult to get right.
Han Young <hanyang.tony@bytedance.com> writes: > On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@google.com> wrote: > > > Objects that are in promisor packs, specifically the ones that have the > > flag, packed_git::pack_promisor, set. However, since this design doc > > was sent out, it turns out the creation of a set of promisor pack objects > > in a large repository (such as Android or Chrome) is very expensive, so > > this design is infeasible in my opinion. > > I wonder if a set of local loose/pack objects will be cheaper to construct? > Normally loose objects are always non-promisor objects, unless the user > running something like `unpack-objects`. We had a similar idea at $JOB. Note that you don't actually need to create the set - when looking up an object using oid_object_info_extended(), we know if it's a loose object and if not, which pack it is in. The pack has a promisor bit that we can check. Note that there is a possibility of a false positive. If the same object is in two packs - one promisor and one non-promisor - I believe there's no guarantee that one pack will be preferred. So we can see that the object is in a non-promisor pack, but there's no guarantee that it's not also in a promisor pack. For the omit-local-commits-in-"have" solution, this is a fatal flaw (we absolutely must guarantee that we don't send any promisor commits) but for the repack-on-fetch solution, this is no big deal (we are looking for objects to repack into a promisor pack, and repacking a promisor object into a promisor pack is perfectly file). For this reason, I think the repack-on-fetch solution is the most promising one so far. Loose objects are always non-promisor objects, yes. (I don't think the user running `unpack-objects` counts - the user running a command on a packfile in the .git directory is out of scope, I think.) > > > After a lazy clone that omits a lot of objects acquires many objects > > > over time by fetching missing objects on demand, wouldn't we want to > > > have an option to "slim" the local repository by discarding some of > > > these objects (the ones that are least frequently used), relying on > > > the promise by the promisor remote that even if we did so, they can > > > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a > > > feature prematurely kicked in? Or are we failing to refetch them > > > for some reason? > > > > Yes if such a feature existed, then it would be feasible and a possible > > solution for this issue (I'm leaning quite towards this now after testing > > out some of the other designs). > > Since no partial clone filter omits commit objects, we always assume > commits are available in the codebase. `merge` reports "cannot merge > unrelated history" if one of the commits is missing, instead of trying to > fetch it. > Another problem is current lazy fetching code does not report "haves" > to remote, so a lazy fetching of commit ended up pulling all the trees, > blobs associated with that commit. > I also prefer the "fetching the missing objects" approach, making sure > the repo has all the "correct" objects is difficult to get right. If I remember correctly, our intention (or, at least, my intention) of not treating missing commits differently was so that we don't limit the solutions that we can implement. For example, we had the idea of server-assisted merge base computation - this and other features would make it feasible to omit commits locally. It has not been implemented, though.
Junio C Hamano <gitster@pobox.com> writes: > > (C2b is a bit of a special case. Despite not being in a promisor pack, > > it is still considered to be a promisor object since C3 directly > > references it.) > > Yes, and I suspect the root cause of this confusion is because > "promisor object", as defined today, is a flawed concept. If C2b > were pointed by a local ref, just like the case the ref points at > C2a, they should be treated the same way, as both of them are > locally created. To put it another way, presumably the local have > already been pushed out to elsewhere and the promisor remote got > hold of them, and that is why C3 can build on top of them. And the > fact C2b is directly reachable from C3 and C2a is not should not > have any relevance if C2a or C2b are not _included_ in promisor > packs (hence both of them need to be included in the local pack). > > Two concepts that would have been useful are (1) objects that are in > promisor packs and (2) objects that are reachable from an object > that is in a promisor pack. I do not see how the current definition > of "promisor objects" (i.e. in a promisor pack, or one hop from an > object in a promisor pack) is useful in any context. The one-hop part in the current definition is meant to (a) explain what objects the client knows the remote has (in theory the client has no knowledge of objects beyond the first hop, but we now know this theory to not be true) and (b) explain what objects a non-promisor object can reference (in particular, a non-promisor tree can reference promisor blobs, even when our knowledge of that promisor blob only comes from a tree in a promisor pack). If we think that a promisor commit being a child of a non-promisor commit as a "bad state" that needs to be fixed [1], then the one-hop current definition seems to be equivalent to (2). As for (1), we do use that concept in Git, although it's limited to the repack during GC (or maybe there are others that I don't recall), so the concept doesn't have a widely-used name like "promisor object". [1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/ > > Garbage Collection repack > > ------------------------- > > Not yet implemented. > > > > Same concept at “fetch repack”, but happens during garbage collection > > instead. The traversal is more expensive since we no longer have access > > to what was recently fetched so we have to traverse through all promisor > > packs to collect tips of “bad” history. > > In other words, with the status quo, "git gc" that attempts to > repack "objects in promisor packs" and "other objects that did not > get repacked in the step that repack objects in promisor packs" > separately, it implements the latter in a buggy way and discards > some objects. And fixing that bug by doing the right thing is > expensive. > > Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after > "git gc"? Wouldn't these "missing" objects be lazily fetchable, now > C3 is known to the remote and the remote promises everything > reachable from what they offer are (re)fetchable from them? IOW, is > this a correctness issue, or only performance issue (of having to > re-fetch what we once locally had)? I believe the re-fetch didn't happen because it was run from a command with fetch_if_missing=0. (But even if we decide that we shouldn't use fetch_if_missing, and then change all commands to not use it, there still remains the performance issue, so we should still fix it.) > > Cons: Packing local objects into promisor packs means that it is no > > longer possible to detect if an object is missing due to repository > > corruption or because we need to fetch it from a promisor remote. > > Is this true? Can we tell, when trying to access C2a/C2b/C2 after > the current version of "git gc" removes them from the local object > store, that they are missing due to repository corruption? After > all, C3 can reach them so wouldn't it be possible for us to fetch > them from the promisor remote? > > After a lazy clone that omits a lot of objects acquires many objects > over time by fetching missing objects on demand, wouldn't we want to > have an option to "slim" the local repository by discarding some of > these objects (the ones that are least frequently used), relying on > the promise by the promisor remote that even if we did so, they can > be fetched again? Can we treat loss of C2a/C2b/C2 as if such a > feature prematurely kicked in? Or are we failing to refetch them > for some reason? This is under the "repack all" option, which states that we repack all objects (wherever they came from) into promisor packs. If we locally created commit A and then its child commit B, and the repo got corrupted so that we lost A, repacking all objects would mean that we could never detect that the loss of A is problematic.
Jonathan Tan <jonathantanmy@google.com> writes: > For > this reason, I think the repack-on-fetch solution is the most promising > one so far. I had time to take a closer look at this solution. One problem that I've noticed is that the "best effort" promisor object check cannot naively replace is_promisor_object(), because a lot of the time (e.g. in revision.c's get_reference()) is_promisor_object() is used when an object is missing to check whether we need to error out or not. Our "best effort" promisor object check cannot replace this because it needs us to have looked up the object in the first place to check whether it's loose or packed (and if packed, which packfile it's in), so it can't work with an object that's missing. So I think we'll need to use do_not_die_on_missing_objects. It does have the weakness that if the object is not supposed to be missing, we don't inform the user, but perhaps this is OK here because we know that all objects we encounter on this object walk are promisor objects, so if it's missing, it's OK. In addition to do_not_die_on_missing_objects, we'll also need the actual code that stops iteration through objects that pass our "best effort" promisor object check. Probably the best place is in get_revision_1() after the NULL check, but I haven't fully thought through what happens if this option is used when some commits are UNINTERESTING. (For the repack-on-fetch, no commits are UNINTERESTING, but it's probably best to make sure our feature is as useful in as many cases as possible, especially since we're going to further complicate revision walking code, which is complicated enough.
On Sat, Oct 12, 2024 at 10:05 AM Jonathan Tan <jonathantanmy@google.com> wrote: > So I think we'll need to use do_not_die_on_missing_objects. It does have > the weakness that if the object is not supposed to be missing, we don't > inform the user, but perhaps this is OK here because we know that all > objects we encounter on this object walk are promisor objects, so if > it's missing, it's OK. And I think users would prefer the git command to succeed if possible, rather than die on the first (noncritical) error. Maybe show a warning and swallow the error? > In addition to do_not_die_on_missing_objects, we'll also need the actual > code that stops iteration through objects that pass our "best effort" > promisor object check. Probably the best place is in get_revision_1() > after the NULL check get_revision_1() only does commit limiting though. Some callers of rev-list also do tree walking on commits, in a (corrupted) partial repo, tree could also be missing. There isn't a central place we can stop tree walking, callers using this feature would have to implement "tree walking early termination" themself.
Han Young <hanyang.tony@bytedance.com> writes: > On Sat, Oct 12, 2024 at 10:05 AM Jonathan Tan <jonathantanmy@google.com> wrote: > > So I think we'll need to use do_not_die_on_missing_objects. It does have > > the weakness that if the object is not supposed to be missing, we don't > > inform the user, but perhaps this is OK here because we know that all > > objects we encounter on this object walk are promisor objects, so if > > it's missing, it's OK. > > And I think users would prefer the git command to succeed if possible, > rather than die on the first (noncritical) error. Maybe show a warning > and swallow the error? Just to be clear, this is not an error condition so we shouldn't show an error or warning. Whenever we repack non-promisor objects in a partial clone we will almost always encounter missing objects. In the gc repack, this is mitigated by --exclude-promisor-objects, which checks the promisor object set whenever a missing object is encountered; it does not show an error if the missing object is in that set. My proposal is to use do_not_die_on_missing_objects instead, which always tolerates missing objects (without showing any error or warning). This is probably not safe enough for the gc repack, but should be OK for the fetch repack, since we are only repacking ancestors of known promisor objects (so we can deduce that the missing objects are promisor objects). > > In addition to do_not_die_on_missing_objects, we'll also need the actual > > code that stops iteration through objects that pass our "best effort" > > promisor object check. Probably the best place is in get_revision_1() > > after the NULL check > > get_revision_1() only does commit limiting though. Some callers of rev-list > also do tree walking on commits, Ah, yes, you're right. The repack on fetch is one such caller (that will need tree walking). > in a (corrupted) partial repo, tree could > also be missing. There isn't a central place we can stop tree walking, > callers using this feature would have to implement "tree walking early > termination" themself. The repo could have been cloned with a tree filter (e.g. "--filter=tree:0") too, in which case trees would be missing even if the repo is not corrupted. But even in a non-corrupted --filter=blob:none partial clone, we still don't want to iterate through promisor trees, so that we don't repack them unnecessarily. So yes, get_revision_1() is not the only place that needs to be changed. I think that there is a central place to stop tree walking - in list-objects.c.