[0/1] revision: fix reachable objects being gc'ed in no blob clone repo

Message ID	20240802073143.56731-1-hanyang.tony@bytedance.com (mailing list archive)
Headers	show Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F3CE51BF31C for <git@vger.kernel.org>; Fri, 2 Aug 2024 07:31:58 +0000 (UTC) From: Han Young <hanyang.tony@bytedance.com> To: git@vger.kernel.org Cc: Han Young <hanyang.tony@bytedance.com> Subject: [PATCH 0/1] revision: fix reachable objects being gc'ed in no blob clone repo Date: Fri, 2 Aug 2024 15:31:42 +0800 Message-ID: <20240802073143.56731-1-hanyang.tony@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	revision: fix reachable objects being gc'ed in no blob clone repo \| expand [0/1] revision: fix reachable objects being gc'ed in no blob clone repo [1/1] revision: don't set parents as uninteresting if exclude promisor objects [2/2] fetch-pack.c: do not declare local commits as "have" in partial repos

Han Young Aug. 2, 2024, 7:31 a.m. UTC

We use --filter=blob:none to clone our large monorepo.
After a while we started getting reports from engineers complaining 
that their local repository was broken. Upon further investigation, 
we found that broken repositories are missing objects that created 
in that particular local repository. git fsck reports "bad object: xxx".

Here are the minimal steps to recreate issue.
    # create a normal git repo, add one file and push to remote
    $ mkdir full && cd full && git init && touch foo
    $ git add foo && git commit -m "commit 1" && git push

    # partial clone a copy of the repo we just created
    $ cd ..
    $ git clone git@example.org:example/foo.git --filter=blob:none partial

    # create a commit in partial cloned repo and push it to remote
    $ cd partial && echo 'hello' > foo && git commit -a -m "commit 2"
    $ git push

    # run gc in partial repo
    $ git gc --prune=now

    # in normal git repo, create another commit on top of the
    # commit we created in partial repo
    $ cd ../full && git pull && echo ' world' >> foo
    $ git commit -a -m "commit 3" && git push

    # pull from remote in partial repo, and run gc again
    $ cd ../partial && git pull && git gc --prune=now

The last `git gc` will error out on fsck with error message like this:

  error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
  error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94

Note that disabling commit graph on the partial repo will cause 
`git gc` to exit normally, but will still not solve the 
underlying problem. And in more complex situations, 
disabling commit graph will not avoid the error.

The problem is caused by the wrong result returned by setup_revision
with `--exclude-promisor-objects` enabled.
`git gc` will call `git repack`, which will call `git pack-objects`
twice on a partially cloned repo. The first call to pack-objects 
combines all the promisor packfiles, and the second pack-objects 
command packs all reachable non-promisor objects into a normal packfile.
However, a bug in setup_revision caused some non-promisor objects 
to be mistakenly marked as in promisor packfiles in the second call 
to pack-objects. These incorrectly marked objects are never repacked, 
and were deleted from the object store as a result. In revision.c, 
`process_parents()` recursively marks commit parents as UNINTERESTING 
if the commit itself is UNINTERESTING. `--exclude-promisor-objects` 
is implemented as "iterate all objects in promisor packfiles, 
mark them as UNINTERESTING". So when we find a commit object in 
a promisor packfile, we also set its ancestors as UNINTERESTING, 
whether the ancestor is a promisor object or not. In the example above, 
"commit 2" is a normal commit object, living in a normal packfile, 
but marked as a promisor object and gc'ed from the object store.

Han Young (1):
  revision: don't set parents as uninteresting if exclude promisor

 revision.c               |  2 +-
 t/t0410-partial-clone.sh | 22 +++++++++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

Jonathan Tan Aug. 13, 2024, 12:45 a.m. UTC | #1

Han Young <hanyang.tony@bytedance.com> writes:
> Here are the minimal steps to recreate issue.
[snip]

I think the following is what is happening. Before the final gc, the
repo looks as follows:

  commit  tree  blob
   C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
   |
   C2 ---- T2 -- B2 (created locally, in non-promisor pack)
   |
   C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

After the final gc, {C,T,B}3 and {C,T,B}1 are in a promisor pack, but
all of {C,T,B}2 are deleted because they are thought to be promisor
objects.

> The last `git gc` will error out on fsck with error message like this:
> 
>   error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94
>   error: Could not read d3fbfea9e448461c2b72a79a95a220ae10defd94

I'm not sure how `git gc` (or `git fsck`) knows the name of this
object (what is the type of this object, and which object refers to
this object?) but I think that if we implement one of the solutions I
describe below, this problem will go away.

> `git gc` will call `git repack`, which will call `git pack-objects`
> twice on a partially cloned repo. The first call to pack-objects 
> combines all the promisor packfiles, and the second pack-objects 
> command packs all reachable non-promisor objects into a normal packfile.

Yes, this is what I remember.

> However, a bug in setup_revision caused some non-promisor objects 
> to be mistakenly marked as in promisor packfiles in the second call 
> to pack-objects.

I think they ({C,T,B}2 in the example above) should be considered as
promisor objects, actually. From the partial clone doc, it says of a
promisor object: "the local repository has that object in one of its
promisor packfiles, or because another promisor object refers to it".
C3 (in a promisor packfile) refers to C2, so C2 is a promisor object. It
refers to T2, which refers to B2, so all of them are promisor objects.
However, in the Git code, I don't think this definition is applied
recursively - if I remember correctly, we made the assumption (e.g.
in is_promisor_object() in packfile.c) that we only need to care about
objects in promisor packfiles and the objects they directly reference,
because how would we know what objects they indirectly reference?
But this does not cover the case in which we know what objects they
indirectly reference because we pushed them to the server in the first
place.

Solutions I can think of:

 - When fetching from a promisor remote, never declare commits that are
   not in promisor packfiles as HAVE. This means that we would refetch
   C2 and T2 as being in promisor packfiles. But it's wasteful in
   network bandwidth, and does not repair problems in Git repos created
   by Git versions that do not have this solution.

 - When fetching from a promisor remote, parse every object and repack
   any local objects referenced (directly or indirectly) into a promisor
   packfile. Also does not repair problems.

 - When repacking all objects in promisor packfiles, if any object they
   refer to is present in a non-promisor packfile, do a revwalk on that
   object and pack those objects too. The repack will probably be slower
   because each object now has to be parsed. The revwalks themselves
   probably will not take too long, since they can stop at known promisor
   objects.

(One other thing that might be considered is, whenever pushing to a
promisor remote, to write the pack that's pushed as a promisor packfile.
But I don't think this is a good idea - the server may not retain any
packs that were pushed.)

Jonathan Tan Aug. 13, 2024, 5:18 p.m. UTC | #2

Jonathan Tan <jonathantanmy@google.com> writes:
> Solutions I can think of:

One more thing that I just thought of regarding the solution in this
patch. It seems to be to have a different separation of packs: all
objects currently in promisor packs and all objects currently not
in promisor packs. And the way it is done is to only exclude (in
this patch, mark UNINTERESTING, although it might be better to have
a separate flag for it) objects in promisor packs, but not their
ancestors. There are two ways we can go from here:

 - Do not iterate past this object, just like for UNINTERESTING. This
   would end up not packing objects that we need to pack (e.g. {C,T,B}2
   below, if we only have a ref pointing to C3).

  commit  tree  blob
   C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
   |
   C2 ---- T2 -- B2 (created locally, in non-promisor pack)
   |
   C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

 - Iterate past this object (I think this is the path this patch took,
   but I didn't look at it closely). This works, but seems to be very
   slow. We would need to walk through all reachable objects (promisor
   object or not), unlike currently in which we stop once we have
   reached a promisor object.

Junio C Hamano Aug. 14, 2024, 4:10 a.m. UTC | #3

Jonathan Tan <jonathantanmy@google.com> writes:

> Jonathan Tan <jonathantanmy@google.com> writes:
>> Solutions I can think of:
>
> One more thing that I just thought of regarding the solution in this
> patch. It seems to be to have a different separation of packs: all
> objects currently in promisor packs and all objects currently not
> in promisor packs. And the way it is done is to only exclude (in
> this patch, mark UNINTERESTING, although it might be better to have
> a separate flag for it) objects in promisor packs, but not their
> ancestors.

You're right to mention two separate bits, especially because you do
not want the "I am in a promisor pack" bit to propagate down to the
ancestry chain like UNINTERESTING bit does.  But isn't the approach
to enumerate all objects in promisor packs in an oidset and give a
quick way for is_promisor_object() to answer if an object is or is
not in promisor pack sufficient to replace the need to use _any_
object flag bits to manage objects in promisor packs?

> There are two ways we can go from here:
>
>  - Do not iterate past this object, just like for UNINTERESTING. This
>    would end up not packing objects that we need to pack (e.g. {C,T,B}2
>    below, if we only have a ref pointing to C3).
>
>   commit  tree  blob
>    C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
>    |
>    C2 ---- T2 -- B2 (created locally, in non-promisor pack)
>    |
>    C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
>  - Iterate past this object (I think this is the path this patch took,
>    but I didn't look at it closely). This works, but seems to be very
>    slow. We would need to walk through all reachable objects (promisor
>    object or not), unlike currently in which we stop once we have
>    reached a promisor object.

Thanks for helping Han & Xinxin.

Jonathan Tan Aug. 14, 2024, 7:30 p.m. UTC | #4

Junio C Hamano <gitster@pobox.com> writes:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > Jonathan Tan <jonathantanmy@google.com> writes:
> >> Solutions I can think of:
> >
> > One more thing that I just thought of regarding the solution in this
> > patch. It seems to be to have a different separation of packs: all
> > objects currently in promisor packs and all objects currently not
> > in promisor packs. And the way it is done is to only exclude (in
> > this patch, mark UNINTERESTING, although it might be better to have
> > a separate flag for it) objects in promisor packs, but not their
> > ancestors.
> 
> You're right to mention two separate bits, especially because you do
> not want the "I am in a promisor pack" bit to propagate down to the
> ancestry chain like UNINTERESTING bit does.  But isn't the approach
> to enumerate all objects in promisor packs in an oidset and give a
> quick way for is_promisor_object() to answer if an object is or is
> not in promisor pack sufficient to replace the need to use _any_
> object flag bits to manage objects in promisor packs?

Ah...yes, you're right. But if someone is intending to go this route,
note that you can't use the oidset in is_promisor_object() directly,
as it contains both objects in promisor packs and objects that they
directly reference. You'll need to adapt it.

Calvin Wan Oct. 1, 2024, 7:17 p.m. UTC | #5

It seems that we're at a standstill for the various possible designs
that can solve this problem, so I decided to write up a design document
to discuss the ideas we've come up with so far and new ones. Hopefully
this will get us closer to a viable implementation we can agree on.

Missing Promisor Objects in Partial Repo Design Doc
===================================================

Basic Reproduction Steps
------------------------

 - Partial clone repository
 - Create local commit and push
 - Fetch new changes
 - Garbage collection

State After Reproduction
------------------------

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2b ---- T2b -- B2b (created locally, in non-promisor pack)
  |
  C2a ---- T2a -- B2a (created locally, in non-promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Explanation of the Problem
--------------------------

In a partial clone repository, non-promisor commits are locally
committed as children of promisor commits and then pushed up to the
server. Fetches of new history can result in promisor commits that have
non-promisor commits as ancestors. During garbage collection, objects
are repacked in 2 steps. In the first step, if there is more than one
promisor packfile, all objects in promisor packfiles are repacked into a
single promisor packfile. In the second step, a revision walk is made
from all refs (and some other things like HEAD and reflog entries) that
stops whenever it encounters a promisor object. In the example above, if
a ref pointed directly to C2a, it would be returned by the walk (as an
object to be packed). But if we only had a ref pointing to C3, the
revision walk immediately sees that it is a promisor object, does not
return it, and does not iterate through its parents.

(C2b is a bit of a special case. Despite not being in a promisor pack,
it is still considered to be a promisor object since C3 directly
references it.)

If we think this is a bad state, we should propagate the “promisor-ness”
of C3 to its ancestors. Git commands should either prevent this state
from occurring or tolerate it and fix it when we can. If we did run into
this state unexpectedly, then it would be considered a BUG.

If we think it is a valid state, we should NOT propagate the
“promisor-ness” of C3 to its ancestors. Git commands should respect that
this is a possible state and be able to work around it. Therefore, this
bug would then be strictly caused by garbage collection


Bad State Solutions
===================

Fetch negotiation
-----------------
Implemented at
https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/

During fetch negotiation, if a commit is not in a promisor pack and
therefore local, do not declare it as "have" so they can be fetched into
a promisor pack.

Cost:
- Creation of set of promisor pack objects (by iterating through every
  .idx of promisor packs)
- Refetch number of local commits

Pros: Implementation is simple, client doesn’t have to repack, prevents
state from ever occurring in the repository.

Cons: Network cost of refetching could be high if many local commits
need to be refetched.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, refetched into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Fetch repack
------------
Not yet implemented.

Enumerate the objects in the freshly fetched promisor packs, checking
every outgoing link to see if they reference a non-promisor object that
we have, to get a list of tips where local objects are parents of
promisor objects ("bad history"). After collecting these "tips of bad
history", you then start another traversal from them until you hit an
object in a promisor pack and stop traversal there. You have
successfully enumerated the local objects to be repacked into a promisor
pack.

Cost:
- Traversal through newly fetched promisor trees and commits
- Creation of set of promisor pack objects (for tips of bad history
  traversal to stop at a promisor object)
- Traversal through all local commits and check existence in promisor
  pack set
- Repack all pushed local commits

Pros: Prevents state from ever occurring in the repository, no network
cost.

Cons: Additional cost of repacking is incurred during fetch, more
complex implementation.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Garbage Collection repack
-------------------------
Not yet implemented.

Same concept at “fetch repack”, but happens during garbage collection
instead. The traversal is more expensive since we no longer have access
to what was recently fetched so we have to traverse through all promisor
packs to collect tips of “bad” history.

Cost:
- Creation of set of promisor pack objects
- Traversal through all promisor commits
- Traversal through all local commits and check existence in promisor
  object set
- Repack all pushed local commits

Pros: Can be run in the background as part of maintenance, no network
cost.

Cons: More expensive than “fetch repack”, state isn’t fixed until
garbage collection, more complex implementation

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Garbage Collection repack all
-----------------------------
Implemented at
https://lore.kernel.org/git/20240925072021.77078-1-hanyang.tony@bytedance.com/ 

Repack all local commits into promisor packs during garbage collection.

Both valid scenarios
commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

commit  tree  blob
  C3 ---- T3 -- B3 (created locally, packed into promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, packed into promisor pack)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Cost:
Repack all local commits

Pros: Can be run in the background as part of maintenance, no network
cost, less complex implementation, and less expensive than “garbage
collection repack”.

Cons: Packing local objects into promisor packs means that it is no
longer possible to detect if an object is missing due to repository
corruption or because we need to fetch it from a promisor remote.
Packing local objects into promisor packs means that garbage collection
will no longer remove unreachable local objects.

Valid State Solutions
=====================
Garbage Collection check
------------------------
Not yet implemented.

Currently during the garbage collection rev walk, whenever a promisor
commit is reached, it is marked UNINTERESTING, and then subsequently all
ancestors of the promisor commit are traversed and also marked
UNINTERESTING. Therefore, add a check for whether a commit is local or
not during promisor commit ancestor traversal and do not mark local
commits as UNINTERESTING.

commit  tree  blob
  C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
  |
  C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete)
  |
  C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

Cost:
- Adds an additional check to every ancestor of a promisor commit.

This is practically the only solution if the state is valid. Fsck would
also have to start checking for validity of ancestors of promisor
commits instead of ignoring them as it currently does.

Optimizations
=============

The “creation of set of promisor pack objects” can be replaced with
“creation of set of non-promisor objects” since the latter is almost
always cheaper and we can check for non-existence rather than existence.
This does not work for “fetch negotiation” since if we have a commit
that's in both a promisor pack and a non-promisor pack, the algorithm's
correctness relies on the fact that we report it as a promisor object
(because we really need the server to re-send it).

Junio C Hamano Oct. 1, 2024, 7:35 p.m. UTC | #6

Calvin Wan <calvinwan@google.com> writes:

> It seems that we're at a standstill for the various possible designs
> that can solve this problem, so I decided to write up a design document
> to discuss the ideas we've come up with so far and new ones. Hopefully
> this will get us closer to a viable implementation we can agree on.

Thanks for writing this up.  Very much appreciated.

I'll hold my thoughts before others have chance to speak up, though.

Thanks.

Junio C Hamano Oct. 2, 2024, 2:54 a.m. UTC | #7

> Missing Promisor Objects in Partial Repo Design Doc
> ===================================================
>
> Basic Reproduction Steps
> ------------------------
>
>  - Partial clone repository
>  - Create local commit and push
>  - Fetch new changes
>  - Garbage collection
>
> State After Reproduction
> ------------------------
>
> commit  tree  blob
>   C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
>   |
>   C2b ---- T2b -- B2b (created locally, in non-promisor pack)
>   |
>   C2a ---- T2a -- B2a (created locally, in non-promisor pack)
>   |
>   C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> Explanation of the Problem
> --------------------------
>
> In a partial clone repository, non-promisor commits are locally
> committed as children of promisor commits and then pushed up to the
> server. Fetches of new history can result in promisor commits that have
> non-promisor commits as ancestors. During garbage collection, objects
> are repacked in 2 steps. In the first step, if there is more than one
> promisor packfile, all objects in promisor packfiles are repacked into a
> single promisor packfile. In the second step, a revision walk is made
> from all refs (and some other things like HEAD and reflog entries) that
> stops whenever it encounters a promisor object. In the example above, if
> a ref pointed directly to C2a, it would be returned by the walk (as an
> object to be packed). But if we only had a ref pointing to C3, the
> revision walk immediately sees that it is a promisor object, does not
> return it, and does not iterate through its parents.

True.  Will it become even worse, if a protocol extension Christian
proposes starts suggesting a repository that is not lazy to add a
promisor remote?  In such a set-up, perhaps all history leading to
C2b down to the root are local, but C3 may have come from a promisor
remote (hence in a promisor pack).

> (C2b is a bit of a special case. Despite not being in a promisor pack,
> it is still considered to be a promisor object since C3 directly
> references it.)

Yes, and I suspect the root cause of this confusion is because
"promisor object", as defined today, is a flawed concept.  If C2b
were pointed by a local ref, just like the case the ref points at
C2a, they should be treated the same way, as both of them are
locally created.  To put it another way, presumably the local have
already been pushed out to elsewhere and the promisor remote got
hold of them, and that is why C3 can build on top of them.  And the
fact C2b is directly reachable from C3 and C2a is not should not
have any relevance if C2a or C2b are not _included_ in promisor
packs (hence both of them need to be included in the local pack).

Two concepts that would have been useful are (1) objects that are in
promisor packs and (2) objects that are reachable from an object
that is in a promisor pack.  I do not see how the current definition
of "promisor objects" (i.e. in a promisor pack, or one hop from an
object in a promisor pack) is useful in any context.

> If we think this is a bad state, we should propagate the “promisor-ness”
> of C3 to its ancestors. Git commands should either prevent this state
> from occurring or tolerate it and fix it when we can. If we did run into
> this state unexpectedly, then it would be considered a BUG.

Yup, that is the basis of the solutions we saw proposed so far.

> If we think it is a valid state, we should NOT propagate the
> “promisor-ness” of C3 to its ancestors. Git commands should respect that
> this is a possible state and be able to work around it. Therefore, this
> bug would then be strictly caused by garbage collection

Yes, that is possibly an alternative.

> Bad State Solutions
> ===================
>
> Fetch negotiation
> -----------------
> Implemented at
> https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/
>
> During fetch negotiation, if a commit is not in a promisor pack and
> therefore local, do not declare it as "have" so they can be fetched into
> a promisor pack.
>
> Cost:
> - Creation of set of promisor pack objects (by iterating through every
>   .idx of promisor packs)

What is "promisor PACK objects"?  Is it different from the "promisor
objects" (i.e. what I called the useless definition above)?

> - Refetch number of local commits
>
> Pros: Implementation is simple, client doesn’t have to repack, prevents
> state from ever occurring in the repository.
>
> Cons: Network cost of refetching could be high if many local commits
> need to be refetched.

What if we get into the same state by creating local C4, which gets
to outside and on top of which C5 is built, which is now sitting at
the tip of the remote history and we fetch from them?  In order to
include C4 in the "promisor pack", we refrain from saying C4 is a
"have" for us and refetch.  Would C2 be fetched again?

I do not think C2 would be, because we made it an object in a
promisor pack when we "fixed" the history for C3.

So the cost will not grow proportionally to the depth of the
history, which makes it OK from my point of view.

> Garbage Collection repack
> -------------------------
> Not yet implemented.
>
> Same concept at “fetch repack”, but happens during garbage collection
> instead. The traversal is more expensive since we no longer have access
> to what was recently fetched so we have to traverse through all promisor
> packs to collect tips of “bad” history.

In other words, with the status quo, "git gc" that attempts to
repack "objects in promisor packs" and "other objects that did not
get repacked in the step that repack objects in promisor packs"
separately, it implements the latter in a buggy way and discards
some objects.  And fixing that bug by doing the right thing is
expensive.

Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
"git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
C3 is known to the remote and the remote promises everything
reachable from what they offer are (re)fetchable from them?  IOW, is
this a correctness issue, or only performance issue (of having to
re-fetch what we once locally had)?

> Cons: Packing local objects into promisor packs means that it is no
> longer possible to detect if an object is missing due to repository
> corruption or because we need to fetch it from a promisor remote.

Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
the current version of "git gc" removes them from the local object
store, that they are missing due to repository corruption?  After
all, C3 can reach them so wouldn't it be possible for us to fetch
them from the promisor remote?

After a lazy clone that omits a lot of objects acquires many objects
over time by fetching missing objects on demand, wouldn't we want to
have an option to "slim" the local repository by discarding some of
these objects (the ones that are least frequently used), relying on
the promise by the promisor remote that even if we did so, they can
be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
feature prematurely kicked in?  Or are we failing to refetch them
for some reason?

> Packing local objects into promisor packs means that garbage collection
> will no longer remove unreachable local objects.
>
> Valid State Solutions
> =====================
> Garbage Collection check
> ------------------------
> Not yet implemented.
>
> Currently during the garbage collection rev walk, whenever a promisor
> commit is reached, it is marked UNINTERESTING, and then subsequently all
> ancestors of the promisor commit are traversed and also marked
> UNINTERESTING. Therefore, add a check for whether a commit is local or
> not during promisor commit ancestor traversal and do not mark local
> commits as UNINTERESTING.
>
> commit  tree  blob
>   C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
>   |
>   C2 ---- T2 -- B2 (created locally, in non-promisor pack, gc does not delete)
>   |
>   C1 ---- T1 -- B1 (fetched from remote, in promisor pack)
>
> Cost:
> - Adds an additional check to every ancestor of a promisor commit.
>
> This is practically the only solution if the state is valid. Fsck would
> also have to start checking for validity of ancestors of promisor
> commits instead of ignoring them as it currently does.

In the longer term, this looks like the most straight-forward and
easy to explain solution to me.

> Optimizations
> =============
>
> The “creation of set of promisor pack objects” can be replaced with
> “creation of set of non-promisor objects” since the latter is almost
> always cheaper and we can check for non-existence rather than existence.
> This does not work for “fetch negotiation” since if we have a commit
> that's in both a promisor pack and a non-promisor pack, the algorithm's
> correctness relies on the fact that we report it as a promisor object
> (because we really need the server to re-send it).

Han Young Oct. 2, 2024, 7:57 a.m. UTC | #8

On Wed, Oct 2, 2024 at 10:55 AM Junio C Hamano <gitster@pobox.com> wrote:

> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them?  IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?
>
> Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption?  After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?
>
> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in?  Or are we failing to refetch them
> for some reason?

In a blobless clone, we expect commits and trees to be present in repo.
If C2/T2 is missing, commands like "git merge" will complain
"cannot merge unrelated history" and fail. Or commands like "git log" will
try to lazily fetch the commit, but without 'have' negotiation, end up
pulling all the trees and blobs reachable from that commit.

It's possible to minimize the impact of missing commits by adding negotiation
to lazy fetching, but we probably need to adapt code in many places where
we don't do lazy fetching. "git log", "git merge" commit graph etc. it's
no trivia amount of work.

Calvin Wan Oct. 8, 2024, 9:35 p.m. UTC | #9

On Tue, Oct 1, 2024 at 7:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> True.  Will it become even worse, if a protocol extension Christian
> proposes starts suggesting a repository that is not lazy to add a
> promisor remote?  In such a set-up, perhaps all history leading to
> C2b down to the root are local, but C3 may have come from a promisor
> remote (hence in a promisor pack).

Yes if we and consequently Git considers this state to be problematic.

> > Bad State Solutions
> > ===================
> >
> > Fetch negotiation
> > -----------------
> > Implemented at
> > https://lore.kernel.org/git/20240919234741.1317946-1-calvinwan@google.com/
> >
> > During fetch negotiation, if a commit is not in a promisor pack and
> > therefore local, do not declare it as "have" so they can be fetched into
> > a promisor pack.
> >
> > Cost:
> > - Creation of set of promisor pack objects (by iterating through every
> >   .idx of promisor packs)
>
> What is "promisor PACK objects"?  Is it different from the "promisor
> objects" (i.e. what I called the useless definition above)?

Objects that are in promisor packs, specifically the ones that have the
flag, packed_git::pack_promisor, set. However, since this design doc
was sent out, it turns out the creation of a set of promisor pack objects
in a large repository (such as Android or Chrome) is very expensive, so
this design is infeasible in my opinion.

>
> > - Refetch number of local commits
> >
> > Pros: Implementation is simple, client doesn’t have to repack, prevents
> > state from ever occurring in the repository.
> >
> > Cons: Network cost of refetching could be high if many local commits
> > need to be refetched.
>
> What if we get into the same state by creating local C4, which gets
> to outside and on top of which C5 is built, which is now sitting at
> the tip of the remote history and we fetch from them?  In order to
> include C4 in the "promisor pack", we refrain from saying C4 is a
> "have" for us and refetch.  Would C2 be fetched again?
>
> I do not think C2 would be, because we made it an object in a
> promisor pack when we "fixed" the history for C3.
>
> So the cost will not grow proportionally to the depth of the
> history, which makes it OK from my point of view.

Correct, the cost of refetching is only a one time cost, but
unfortunately creation of a set of promisor pack objects isn't.

>
> > Garbage Collection repack
> > -------------------------
> > Not yet implemented.
> >
> > Same concept at “fetch repack”, but happens during garbage collection
> > instead. The traversal is more expensive since we no longer have access
> > to what was recently fetched so we have to traverse through all promisor
> > packs to collect tips of “bad” history.
>
> In other words, with the status quo, "git gc" that attempts to
> repack "objects in promisor packs" and "other objects that did not
> get repacked in the step that repack objects in promisor packs"
> separately, it implements the latter in a buggy way and discards
> some objects.  And fixing that bug by doing the right thing is
> expensive.
>
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them?  IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?

My first thought is that from both the user and developer perspective,
we don't expect our reachable objects to be gc'ed. So all of the "bad
state" solutions work to ensure that that isn't the case in some way or
form. However, if it turns out that all of these solutions are much more
expensive and disruptive to the user than accepting that local objects
can be gc'ed and JIT refetching, then the latter seems much more
palatable. It is inevitable that we take some performance hit to fix this
problem and we may just have to accept this as one of the costs of
having partial clones to begin with.

>
> > Cons: Packing local objects into promisor packs means that it is no
> > longer possible to detect if an object is missing due to repository
> > corruption or because we need to fetch it from a promisor remote.
>
> Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption?  After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?

I should be more clear that "detecting if an object is missing due to
repository corruption" refers to fsck currently not having the
functionality to do that. We are "accidentally" discovering the
corruption when we try to access the missing object, but we can
still fetch them from the promisor remote afterwards.

> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in?  Or are we failing to refetch them
> for some reason?

Yes if such a feature existed, then it would be feasible and a possible
solution for this issue (I'm leaning quite towards this now after testing
out some of the other designs).

Han Young Oct. 9, 2024, 6:46 a.m. UTC | #10

On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@google.com> wrote:

> Objects that are in promisor packs, specifically the ones that have the
> flag, packed_git::pack_promisor, set. However, since this design doc
> was sent out, it turns out the creation of a set of promisor pack objects
> in a large repository (such as Android or Chrome) is very expensive, so
> this design is infeasible in my opinion.

I wonder if a set of local loose/pack objects will be cheaper to construct?
Normally loose objects are always non-promisor objects, unless the user
running something like `unpack-objects`.

> > After a lazy clone that omits a lot of objects acquires many objects
> > over time by fetching missing objects on demand, wouldn't we want to
> > have an option to "slim" the local repository by discarding some of
> > these objects (the ones that are least frequently used), relying on
> > the promise by the promisor remote that even if we did so, they can
> > be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> > feature prematurely kicked in?  Or are we failing to refetch them
> > for some reason?
>
> Yes if such a feature existed, then it would be feasible and a possible
> solution for this issue (I'm leaning quite towards this now after testing
> out some of the other designs).

Since no partial clone filter omits commit objects, we always assume
commits are available in the codebase. `merge` reports "cannot merge
unrelated history" if one of the commits is missing, instead of trying to
fetch it.
Another problem is current lazy fetching code does not report "haves"
to remote, so a lazy fetching of commit ended up pulling all the trees,
blobs associated with that commit.
I also prefer the "fetching the missing objects" approach, making sure
the repo has all the "correct" objects is difficult to get right.

Jonathan Tan Oct. 9, 2024, 6:34 p.m. UTC | #11

Han Young <hanyang.tony@bytedance.com> writes:
> On Wed, Oct 9, 2024 at 5:36 AM Calvin Wan <calvinwan@google.com> wrote:
> 
> > Objects that are in promisor packs, specifically the ones that have the
> > flag, packed_git::pack_promisor, set. However, since this design doc
> > was sent out, it turns out the creation of a set of promisor pack objects
> > in a large repository (such as Android or Chrome) is very expensive, so
> > this design is infeasible in my opinion.
> 
> I wonder if a set of local loose/pack objects will be cheaper to construct?
> Normally loose objects are always non-promisor objects, unless the user
> running something like `unpack-objects`.

We had a similar idea at $JOB. Note that you don't actually
need to create the set - when looking up an object using
oid_object_info_extended(), we know if it's a loose object and if not,
which pack it is in. The pack has a promisor bit that we can check.

Note that there is a possibility of a false positive. If the same object
is in two packs - one promisor and one non-promisor - I believe there's
no guarantee that one pack will be preferred. So we can see that the
object is in a non-promisor pack, but there's no guarantee that it's not
also in a promisor pack. For the omit-local-commits-in-"have" solution,
this is a fatal flaw (we absolutely must guarantee that we don't send
any promisor commits) but for the repack-on-fetch solution, this is no
big deal (we are looking for objects to repack into a promisor pack, and
repacking a promisor object into a promisor pack is perfectly file). For
this reason, I think the repack-on-fetch solution is the most promising
one so far.

Loose objects are always non-promisor objects, yes. (I don't think the
user running `unpack-objects` counts - the user running a command on a
packfile in the .git directory is out of scope, I think.)

> > > After a lazy clone that omits a lot of objects acquires many objects
> > > over time by fetching missing objects on demand, wouldn't we want to
> > > have an option to "slim" the local repository by discarding some of
> > > these objects (the ones that are least frequently used), relying on
> > > the promise by the promisor remote that even if we did so, they can
> > > be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> > > feature prematurely kicked in?  Or are we failing to refetch them
> > > for some reason?
> >
> > Yes if such a feature existed, then it would be feasible and a possible
> > solution for this issue (I'm leaning quite towards this now after testing
> > out some of the other designs).
> 
> Since no partial clone filter omits commit objects, we always assume
> commits are available in the codebase. `merge` reports "cannot merge
> unrelated history" if one of the commits is missing, instead of trying to
> fetch it.
> Another problem is current lazy fetching code does not report "haves"
> to remote, so a lazy fetching of commit ended up pulling all the trees,
> blobs associated with that commit.
> I also prefer the "fetching the missing objects" approach, making sure
> the repo has all the "correct" objects is difficult to get right.

If I remember correctly, our intention (or, at least, my intention)
of not treating missing commits differently was so that we don't limit
the solutions that we can implement. For example, we had the idea of
server-assisted merge base computation - this and other features would
make it feasible to omit commits locally. It has not been implemented,
though.

Jonathan Tan Oct. 9, 2024, 6:53 p.m. UTC | #12

Junio C Hamano <gitster@pobox.com> writes:
> > (C2b is a bit of a special case. Despite not being in a promisor pack,
> > it is still considered to be a promisor object since C3 directly
> > references it.)
> 
> Yes, and I suspect the root cause of this confusion is because
> "promisor object", as defined today, is a flawed concept.  If C2b
> were pointed by a local ref, just like the case the ref points at
> C2a, they should be treated the same way, as both of them are
> locally created.  To put it another way, presumably the local have
> already been pushed out to elsewhere and the promisor remote got
> hold of them, and that is why C3 can build on top of them.  And the
> fact C2b is directly reachable from C3 and C2a is not should not
> have any relevance if C2a or C2b are not _included_ in promisor
> packs (hence both of them need to be included in the local pack).
> 
> Two concepts that would have been useful are (1) objects that are in
> promisor packs and (2) objects that are reachable from an object
> that is in a promisor pack.  I do not see how the current definition
> of "promisor objects" (i.e. in a promisor pack, or one hop from an
> object in a promisor pack) is useful in any context.

The one-hop part in the current definition is meant to (a) explain what
objects the client knows the remote has (in theory the client has no
knowledge of objects beyond the first hop, but we now know this theory
to not be true) and (b) explain what objects a non-promisor object can
reference (in particular, a non-promisor tree can reference promisor
blobs, even when our knowledge of that promisor blob only comes from a
tree in a promisor pack).

If we think that a promisor commit being a child of a non-promisor
commit as a "bad state" that needs to be fixed [1], then the one-hop
current definition seems to be equivalent to (2).

As for (1), we do use that concept in Git, although it's limited to the
repack during GC (or maybe there are others that I don't recall), so the
concept doesn't have a widely-used name like "promisor object".

[1] https://lore.kernel.org/git/20241001191811.1934900-1-calvinwan@google.com/

> > Garbage Collection repack
> > -------------------------
> > Not yet implemented.
> >
> > Same concept at “fetch repack”, but happens during garbage collection
> > instead. The traversal is more expensive since we no longer have access
> > to what was recently fetched so we have to traverse through all promisor
> > packs to collect tips of “bad” history.
> 
> In other words, with the status quo, "git gc" that attempts to
> repack "objects in promisor packs" and "other objects that did not
> get repacked in the step that repack objects in promisor packs"
> separately, it implements the latter in a buggy way and discards
> some objects.  And fixing that bug by doing the right thing is
> expensive.
> 
> Stepping back a bit, why is the loss of C2a/C2b/C2 a problem after
> "git gc"?  Wouldn't these "missing" objects be lazily fetchable, now
> C3 is known to the remote and the remote promises everything
> reachable from what they offer are (re)fetchable from them?  IOW, is
> this a correctness issue, or only performance issue (of having to
> re-fetch what we once locally had)?

I believe the re-fetch didn't happen because it was run from a command
with fetch_if_missing=0. (But even if we decide that we shouldn't use
fetch_if_missing, and then change all commands to not use it, there
still remains the performance issue, so we should still fix it.)

> > Cons: Packing local objects into promisor packs means that it is no
> > longer possible to detect if an object is missing due to repository
> > corruption or because we need to fetch it from a promisor remote.
> 
> Is this true?  Can we tell, when trying to access C2a/C2b/C2 after
> the current version of "git gc" removes them from the local object
> store, that they are missing due to repository corruption?  After
> all, C3 can reach them so wouldn't it be possible for us to fetch
> them from the promisor remote?
> 
> After a lazy clone that omits a lot of objects acquires many objects
> over time by fetching missing objects on demand, wouldn't we want to
> have an option to "slim" the local repository by discarding some of
> these objects (the ones that are least frequently used), relying on
> the promise by the promisor remote that even if we did so, they can
> be fetched again?  Can we treat loss of C2a/C2b/C2 as if such a
> feature prematurely kicked in?  Or are we failing to refetch them
> for some reason?

This is under the "repack all" option, which states that we repack all
objects (wherever they came from) into promisor packs. If we locally
created commit A and then its child commit B, and the repo got corrupted
so that we lost A, repacking all objects would mean that we could never
detect that the loss of A is problematic.

Jonathan Tan Oct. 12, 2024, 2:05 a.m. UTC | #13

Jonathan Tan <jonathantanmy@google.com> writes:
> For
> this reason, I think the repack-on-fetch solution is the most promising
> one so far.

I had time to take a closer look at this solution. One problem that
I've noticed is that the "best effort" promisor object check cannot
naively replace is_promisor_object(), because a lot of the time (e.g.
in revision.c's get_reference()) is_promisor_object() is used when an
object is missing to check whether we need to error out or not. Our
"best effort" promisor object check cannot replace this because it needs
us to have looked up the object in the first place to check whether it's
loose or packed (and if packed, which packfile it's in), so it can't
work with an object that's missing.

So I think we'll need to use do_not_die_on_missing_objects. It does have
the weakness that if the object is not supposed to be missing, we don't
inform the user, but perhaps this is OK here because we know that all
objects we encounter on this object walk are promisor objects, so if
it's missing, it's OK.

In addition to do_not_die_on_missing_objects, we'll also need the actual
code that stops iteration through objects that pass our "best effort"
promisor object check. Probably the best place is in get_revision_1()
after the NULL check, but I haven't fully thought through what happens
if this option is used when some commits are UNINTERESTING. (For the
repack-on-fetch, no commits are UNINTERESTING, but it's probably best
to make sure our feature is as useful in as many cases as possible,
especially since we're going to further complicate revision walking
code, which is complicated enough.

Han Young Oct. 12, 2024, 3:30 a.m. UTC | #14

On Sat, Oct 12, 2024 at 10:05 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> So I think we'll need to use do_not_die_on_missing_objects. It does have
> the weakness that if the object is not supposed to be missing, we don't
> inform the user, but perhaps this is OK here because we know that all
> objects we encounter on this object walk are promisor objects, so if
> it's missing, it's OK.

And I think users would prefer the git command to succeed if possible,
rather than die on the first (noncritical) error. Maybe show a warning
and swallow the error?

> In addition to do_not_die_on_missing_objects, we'll also need the actual
> code that stops iteration through objects that pass our "best effort"
> promisor object check. Probably the best place is in get_revision_1()
> after the NULL check

get_revision_1() only does commit limiting though. Some callers of rev-list
also do tree walking on commits, in a (corrupted) partial repo, tree could
also be missing. There isn't a central place we can stop tree walking,
callers using this feature would have to implement "tree walking early
termination" themself.

Jonathan Tan Oct. 14, 2024, 5:52 p.m. UTC | #15

Han Young <hanyang.tony@bytedance.com> writes:
> On Sat, Oct 12, 2024 at 10:05 AM Jonathan Tan <jonathantanmy@google.com> wrote:
> > So I think we'll need to use do_not_die_on_missing_objects. It does have
> > the weakness that if the object is not supposed to be missing, we don't
> > inform the user, but perhaps this is OK here because we know that all
> > objects we encounter on this object walk are promisor objects, so if
> > it's missing, it's OK.
> 
> And I think users would prefer the git command to succeed if possible,
> rather than die on the first (noncritical) error. Maybe show a warning
> and swallow the error?

Just to be clear, this is not an error condition so we shouldn't show an
error or warning. Whenever we repack non-promisor objects in a partial
clone we will almost always encounter missing objects. In the gc repack,
this is mitigated by --exclude-promisor-objects, which checks the
promisor object set whenever a missing object is encountered; it does
not show an error if the missing object is in that set.

My proposal is to use do_not_die_on_missing_objects instead, which
always tolerates missing objects (without showing any error or warning).
This is probably not safe enough for the gc repack, but should be OK
for the fetch repack, since we are only repacking ancestors of known
promisor objects (so we can deduce that the missing objects are promisor
objects).

> > In addition to do_not_die_on_missing_objects, we'll also need the actual
> > code that stops iteration through objects that pass our "best effort"
> > promisor object check. Probably the best place is in get_revision_1()
> > after the NULL check
> 
> get_revision_1() only does commit limiting though. Some callers of rev-list
> also do tree walking on commits,

Ah, yes, you're right. The repack on fetch is one such caller (that will
need tree walking).

> in a (corrupted) partial repo, tree could
> also be missing. There isn't a central place we can stop tree walking,
> callers using this feature would have to implement "tree walking early
> termination" themself.

The repo could have been cloned with a tree filter (e.g.
"--filter=tree:0") too, in which case trees would be missing even if the
repo is not corrupted. But even in a non-corrupted --filter=blob:none
partial clone, we still don't want to iterate through promisor trees, so
that we don't repack them unnecessarily. So yes, get_revision_1() is not
the only place that needs to be changed.

I think that there is a central place to stop tree walking - in
list-objects.c.

[0/1] revision: fix reachable objects being gc'ed in no blob clone repo

Message

Comments