[0/3] list-object-filter: introduce depth filter

Message ID	pull.1343.git.1662025272.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1343.git.1662025272.gitgitgadget@gmail.com> From: "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> Date: Thu, 01 Sep 2022 09:41:09 +0000 Subject: [PATCH 0/3] Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Christian Couder <christian.couder@gmail.com>, =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>, Jeff King <peff@peff.net>, Jeff Hostetler <jeffhost@microsoft.com>, Junio C Hamano <gitster@pobox.com>, Derrick Stolee <derrickstolee@github.com>, Johannes Schindelin <johannes.schindelin@gmx.de>, ZheNing Hu <adlternative@gmail.com> Precedence: bulk
Series	list-object-filter: introduce depth filter \| expand [0/3] list-object-filter: introduce depth filter [1/3] commit-graph: let commit graph respect commit graft [2/3] list-object-filter: pass traversal_context in filter_init_fn [3/3] list-object-filter: introduce depth filter

Johannes Schindelin via GitGitGadget Sept. 1, 2022, 9:41 a.m. UTC

This patch let partial clone have the similar capabilities of the shallow
clone git clone --depth=<depth>.

Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
git fetch --unshallow to lift the shallow clone restriction, it will
download all history of current commit.

Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
git client needs to send a lot of missing objects' id to the server, this
can be very wasteful of network traffic.

Now we can use git clone --filter="depth=<depth>" to omit all commits whose
depth is >= <depth>. By this way, we can have the advantages of both shallow
clone and partial clone: Limiting the depth of commits, get other objects on
demand.

Unfinished business for now:

 1. Git fetch has not yet learned the depth filter, if we can solve this
    problem, we may can have a better "batch fetch" for some needed commits
    (see [1]).
 2. Sometimes we may want to partial clone to avoid automatic downloads
    missing objects, e.g. when running git log, we might want to have
    similar results of shallow clone (without commit graft).

[1]:
https://lore.kernel.org/git/16633d89-6ccd-859d-8533-9861ad831c45@github.com/

ZheNing Hu (3):
  commit-graph: let commit graph respect commit graft
  list-object-filter: pass traversal_context in filter_init_fn
  list-object-filter: introduce depth filter

 Documentation/rev-list-options.txt  |   6 ++
 builtin/clone.c                     |  10 ++-
 commit-graph.c                      |  36 +++++++--
 list-objects-filter-options.c       |  30 +++++++
 list-objects-filter-options.h       |   6 ++
 list-objects-filter.c               |  78 ++++++++++++++++++-
 list-objects-filter.h               |   2 +
 list-objects.c                      |  10 +--
 list-objects.h                      |   8 ++
 shallow.c                           |  16 ++++
 shallow.h                           |   2 +
 t/t5616-partial-clone.sh            | 116 ++++++++++++++++++++++++++++
 t/t6112-rev-list-filters-objects.sh |  14 ++++
 upload-pack.c                       |  14 ----
 upload-pack.h                       |  14 ++++
 15 files changed, 330 insertions(+), 32 deletions(-)


base-commit: d42b38dfb5edf1a7fddd9542d722f91038407819
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1343%2Fadlternative%2Fzh%2Ffilter_depth-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1343/adlternative/zh/filter_depth-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1343

Derrick Stolee Sept. 1, 2022, 7:24 p.m. UTC | #1

On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> This patch let partial clone have the similar capabilities of the shallow
> clone git clone --depth=<depth>.
...
> Now we can use git clone --filter="depth=<depth>" to omit all commits whose
> depth is >= <depth>. By this way, we can have the advantages of both shallow
> clone and partial clone: Limiting the depth of commits, get other objects on
> demand.

I have several concerns about this proposal.

The first is that "depth=X" doesn't mean anything after the first
clone. What will happen when we fetch the remaining objects?

Partial clone is designed to download a subset of objects, but make
the remaining reachable objects downloadable on demand. By dropping
reachable commits, the normal partial clone mechanism would result
in a 'git rev-list' call asking for a missing commit. Would this
inherit the "depth=X" but result in a huge amount of over-downloading
the trees and blobs in that commit range? Would it result in downloading
commits one-by-one, and then their root trees (and all reachable objects
from those root trees)?

Finally, computing the set of objects to send is just as expensive as
if we had a shallow clone (we can't use bitmaps). However, we get the
additional problem where fetches do not have a shallow boundary, so
the server will send deltas based on objects that are not necessarily
present locally, triggering extra requests to resolve those deltas.

This fallout remains undocumented and unexplored in this series, but I
doubt the investigation would result in positive outcomes.

> Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
> git fetch --unshallow to lift the shallow clone restriction, it will
> download all history of current commit.

How does your proposal fix this? Instead of unshallowing, users will
stumble across these objects and trigger huge downloads by accident.

> Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
> git client needs to send a lot of missing objects' id to the server, this
> can be very wasteful of network traffic.

Asking for a list of blobs (especially limited to a sparse-checkout) is
much more efficient than what will happen when a user tries to do almost
anything in a repository formed the way you did here.

Thinking about this idea, I don't think it is viable. I would need to
see a lot of work done to test these scenarios closely to believe that
this type of partial clone is a desirable working state.

Thanks,
-Stolee

Johannes Schindelin Sept. 2, 2022, 1:48 p.m. UTC | #2

Hi ZheNing,

first of all: thank you for working on this. In the past, I thought that
this feature would be likely something we would want to have in Git.

But Stolee's concerns are valid, and made me think about it more. See
below for a more detailed analysis.

On Thu, 1 Sep 2022, Derrick Stolee wrote:

> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
>
> > [...]
> >
> > Disadvantages of git clone --filter=blob:none with git
> > sparse-checkout: The git client needs to send a lot of missing
> > objects' id to the server, this can be very wasteful of network
> > traffic.
>
> Asking for a list of blobs (especially limited to a sparse-checkout) is
> much more efficient than what will happen when a user tries to do almost
> anything in a repository formed the way you did here.

I agree. When you have all the commit and tree objects on the local side,
you can enumerate all the blob objects you need in one fell swoop, then
fetch them in a single network round trip.

When you lack tree objects, or worse, commit objects, this is not true.
You may very well need to fetch _quite_ a bunch of objects, then inspect
them to find out that you need to fetch more tree/commit objects, and then
a couple more round trips, before you can enumerate all of the objects you
need.

Concrete example: let's assume that you clone git.git with a "partial
depth" of 50. That is, while cloning, all of the tip commits' graphs will
be traversed up until the commits that are removed by 49 edges in the
commit graph. For example, v0.99~49 will be present locally after cloning,
but not v0.99~50.

Now, the first-parent depth of v0.99 is 955 (verify with `git rev-list
--count --first-parent v0.99`). None of the commits reachable from v0.99
other than the tip itself seem to be closer to any other tag, so all
commits reachable from v0.99~49 will be missing locally. And since reverts
are rare, we must assume that the vast majority of the associated root
tree objects are missing, too.

Digging through history, a contributor might need to investigate where,
say, `t/t4100/t-apply-7.expect` was introduced (it was in v0.99~206)
because they found something looking like a bug and they need to read the
commit message to see whether it was intentional. They know that this file
was already present in v0.99. Naturally, the command-line to investigate
that is:

	git log --diff-filter=A v0.99 -- t/t4100/t-apply-7.expect

So what does Git do in that operation? It traverses the commits starting
from v0.99, following the chain along the commit parents. When it
encounters v0.99~49, it figures out that it has to fetch v0.99~50. To see
whether v0.99~49 introduced that file, it then has to inspect that commit
object and then fetch the tree object (v0.99~50^{tree}). Then, Git
inspects that tree to find out the object ID for v0.99~50^{tree}:t/, sees
that it is identical to v0.99~49^{tree}:t/ and therefore the pathspec
filter skips this commit from the output of the `git log` command. A
couple of parent traversals later (always fetching the parent commit
object individually, then the associated tree object, then figuring out
that `t/` is unchanged) Git will encounter v0.99~55 where `t/` _did_
change. So now it also has to fetch _that_ tree object.

In total, we are looking at 400+ individual network round trips just to
fetch the required tree/commit objects, i.e. before Git can show you the
output of that `git log` command. And that's just for back-filling the
missing tree/commit objects.

If we had done this using a shallow clone, Git would have stopped at the
shallow boundary, the user would have had a chance to increase the depth
in bigger chunks (probably first extending the depth by 50, then maybe
100, then maybe going for 500) and while it would have been a lot of
manual labor, the total time would be still a lot shorter than those 400+
network round trips (which likely would incur some throttling on the
server side).

> Thinking about this idea, I don't think it is viable. I would need to
> see a lot of work done to test these scenarios closely to believe that
> this type of partial clone is a desirable working state.

Indeed, it is hard to think of a way how the design could result in
anything but undesirable behavior, both on the client and the server side.

We also have to consider that our experience with large repositories
demonstrates that tree and commit objects delta pretty well and are
virtually never a concern when cloning. It is always the sheer amount of
blob objects that is causing poor user experience when performing
non-partial clones of large repositories.

Now, I can be totally wrong in my expectation that there is _no_ scenario
where cloning with a "partial depth" would cause anything but poor
performance. If I am wrong, then there is value in having this feature,
but since it causes undesirable performance in all cases I can think of,
it definitely should be guarded behind an opt-in flag.

Ciao,
Dscho

ZheNing Hu Sept. 4, 2022, 7:27 a.m. UTC | #3

Derrick Stolee <derrickstolee@github.com> 于2022年9月2日周五 03:24写道：
>
> On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> > This patch let partial clone have the similar capabilities of the shallow
> > clone git clone --depth=<depth>.
> ...
> > Now we can use git clone --filter="depth=<depth>" to omit all commits whose
> > depth is >= <depth>. By this way, we can have the advantages of both shallow
> > clone and partial clone: Limiting the depth of commits, get other objects on
> > demand.
>
> I have several concerns about this proposal.
>
> The first is that "depth=X" doesn't mean anything after the first
> clone. What will happen when we fetch the remaining objects?
>

According to the current results, yes, it still downloads a large number
of commits.

Do a litte test again:

$ git clone --filter=depth:2 git.git git
Cloning into 'git'...
remote: Enumerating objects: 4311, done.
remote: Counting objects: 100% (4311/4311), done.
remote: Compressing objects: 100% (3788/3788), done.

Just see how many objects...
$ git cat-file --batch-check --batch-all-objects | grep blob | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    4098
$ git cat-file --batch-check --batch-all-objects | grep tree | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
     211
$ git cat-file --batch-check --batch-all-objects | grep commit | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
       2

$ git checkout HEAD~

Fetch nothing...because depth=2.

$  git checkout HEAD~
remote: Enumerating objects: 198514, done.
remote: Counting objects: 100% (198514/198514), done.
remote: Compressing objects: 100% (68511/68511), done.
remote: Total 198514 (delta 128408), reused 198509 (delta 128406), pack-reused 0
Receiving objects: 100% (198514/198514), 77.07 MiB | 9.58 MiB/s, done.
Resolving deltas: 100% (128408/128408), done.
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
Receiving objects: 100% (1/1), 14.35 KiB | 14.35 MiB/s, done.
remote: Enumerating objects: 198014, done.
remote: Counting objects: 100% (198014/198014), done.
remote: Compressing objects: 100% (68362/68362), done.
remote: Total 198014 (delta 128056), reused 198012 (delta 128055), pack-reused 0
Receiving objects: 100% (198014/198014), 76.55 MiB | 14.00 MiB/s, done.
Resolving deltas: 100% (128056/128056), done.
Previous HEAD position was 624a936234 Merge branch 'en/merge-multi-strategies'
HEAD is now at 014a9ea207 Merge branch 'en/t4301-more-merge-tree-tests'

Fetch a lot of objects... (three times!)

$ git cat-file --batch-check --batch-all-objects | grep blob | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    4099
$ git cat-file --batch-check --batch-all-objects | grep tree | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    130712
$ git cat-file --batch-check --batch-all-objects | grep commit | wc -l
warning: This repository uses promisor remotes. Some objects may not be loaded.
    67815

It fetched too many Commits and Trees... But Surprisingly, only one
more blob was downloaded.

I admit that this is a very bad action, That's because we
have no commits locally...

Maybe one solution: we can also provide a commit-id parameter
inside the depth filter, like --filter="commit:014a9ea207, depth:1"...
we can clone with blob:none filter to download all trees/commits,
then fetch blobs with this "commit-depth" filter.... even we can
provide a more complex filter: --filter="commit:014a9ea207, depth:1, type=blob"
This may avoid downloading too many unneeded commits and trees...

git fetch --filter="commit:014a9ea207, depth:1, type=blob"

If git fetch have learned this filter, then git checkout or other commands can
 use this filter internally heuristically:

e.g.

git checkout HEAD~
if HEAD~ missing | 75% blobs/trees in HEAD~ missing -> use "commit-depth" filter
else -> use blob:none filter

We can even make this commit-depth filter support multiple commits later.

> Partial clone is designed to download a subset of objects, but make
> the remaining reachable objects downloadable on demand. By dropping
> reachable commits, the normal partial clone mechanism would result
> in a 'git rev-list' call asking for a missing commit. Would this
> inherit the "depth=X" but result in a huge amount of over-downloading
> the trees and blobs in that commit range? Would it result in downloading
> commits one-by-one, and then their root trees (and all reachable objects
> from those root trees)?
>

I don't know if it's possible let git rev-list know that commits is missing, and
stop download them. (just like git cat-file --batch --batch-all-objects does)

Similarly, you can let git log or other commands to understand this...

Probably a config var: fetch.skipmissingcommits...

> Finally, computing the set of objects to send is just as expensive as
> if we had a shallow clone (we can't use bitmaps). However, we get the
> additional problem where fetches do not have a shallow boundary, so
> the server will send deltas based on objects that are not necessarily
> present locally, triggering extra requests to resolve those deltas.
>

Agree, I think this maybe a problem, but there is no good solution for it.

> This fallout remains undocumented and unexplored in this series, but I
> doubt the investigation would result in positive outcomes.
>
> > Disadvantages of git clone --depth=<depth> --filter=blob:none: we must call
> > git fetch --unshallow to lift the shallow clone restriction, it will
> > download all history of current commit.
>
> How does your proposal fix this? Instead of unshallowing, users will
> stumble across these objects and trigger huge downloads by accident.
>

As mentioned above, I would expect a commit-depth filter to fix this.

> > Disadvantages of git clone --filter=blob:none with git sparse-checkout: The
> > git client needs to send a lot of missing objects' id to the server, this
> > can be very wasteful of network traffic.
>
> Asking for a list of blobs (especially limited to a sparse-checkout) is
> much more efficient than what will happen when a user tries to do almost
> anything in a repository formed the way you did here.
>

Yes. also as mentioned above, enabling this filter in some specific cases:
e.g. we have the commit but not all trees/blobs in it.

> Thinking about this idea, I don't think it is viable. I would need to
> see a lot of work done to test these scenarios closely to believe that
> this type of partial clone is a desirable working state.
>

Agree.

> Thanks,
> -Stolee

Thanks to these reviews and criticisms, it makes me think more :)

ZheNing Hu

ZheNing Hu Sept. 4, 2022, 9:14 a.m. UTC | #4

Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月2日周五 21:48写道：
>
> Hi ZheNing,
>
> first of all: thank you for working on this. In the past, I thought that
> this feature would be likely something we would want to have in Git.
>

Originally, I just find "full git checkout" after partial-clone will
send so many blob-ids:

$ git clone --filter=blob:none --no-checkout --sparse
git@github.com:derrickstolee/sparse-checkout-example.git
$ cd sparse-checkout-example
$ GIT_TRACE_PACKET=$HOME/packet.trace git checkout HEAD
$ grep want $HOME/packet.trace  | wc -l
4060

So I just think about whether this process can be simplified between
the client and the server. In git checkout, users only need all the objects
in a commit. So maybe we can let the git client tell the server about this
commit-id, then the server downloads all objects in this commit. Then I
find it just looks like git clone|fetch --depth=1, but the shallow-clone doesn't
seem as easy to extend missing objects as the partial-clone.

https://git-scm.com/docs/partial-clone#_non_tasks also said:

Every time the subject of "demand loading blobs" comes up it seems
that someone suggests that the server be allowed to "guess" and send
additional objects that may be related to the requested objects.

So I guess --filter=depth:<depth> may be a solution, but as you and
Derrick have said: there are still very many problems with this depth filter.

> But Stolee's concerns are valid, and made me think about it more. See
> below for a more detailed analysis.
>
> On Thu, 1 Sep 2022, Derrick Stolee wrote:
>
> > On 9/1/2022 5:41 AM, ZheNing Hu via GitGitGadget wrote:
> >
> > > [...]
> > >
> > > Disadvantages of git clone --filter=blob:none with git
> > > sparse-checkout: The git client needs to send a lot of missing
> > > objects' id to the server, this can be very wasteful of network
> > > traffic.
> >
> > Asking for a list of blobs (especially limited to a sparse-checkout) is
> > much more efficient than what will happen when a user tries to do almost
> > anything in a repository formed the way you did here.
>
> I agree. When you have all the commit and tree objects on the local side,
> you can enumerate all the blob objects you need in one fell swoop, then
> fetch them in a single network round trip.
>
> When you lack tree objects, or worse, commit objects, this is not true.
> You may very well need to fetch _quite_ a bunch of objects, then inspect
> them to find out that you need to fetch more tree/commit objects, and then
> a couple more round trips, before you can enumerate all of the objects you
> need.
>

I think this is because the previous design was that you had to fetch
these missing
commits (also trees) and all their ancestors. Maybe we can modify git
rev-list to
make it understand missing commits...

> Concrete example: let's assume that you clone git.git with a "partial
> depth" of 50. That is, while cloning, all of the tip commits' graphs will
> be traversed up until the commits that are removed by 49 edges in the
> commit graph. For example, v0.99~49 will be present locally after cloning,
> but not v0.99~50.
>
> Now, the first-parent depth of v0.99 is 955 (verify with `git rev-list
> --count --first-parent v0.99`). None of the commits reachable from v0.99
> other than the tip itself seem to be closer to any other tag, so all
> commits reachable from v0.99~49 will be missing locally. And since reverts
> are rare, we must assume that the vast majority of the associated root
> tree objects are missing, too.
>
> Digging through history, a contributor might need to investigate where,
> say, `t/t4100/t-apply-7.expect` was introduced (it was in v0.99~206)
> because they found something looking like a bug and they need to read the
> commit message to see whether it was intentional. They know that this file
> was already present in v0.99. Naturally, the command-line to investigate
> that is:
>
>         git log --diff-filter=A v0.99 -- t/t4100/t-apply-7.expect
>
> So what does Git do in that operation? It traverses the commits starting
> from v0.99, following the chain along the commit parents. When it
> encounters v0.99~49, it figures out that it has to fetch v0.99~50. To see
> whether v0.99~49 introduced that file, it then has to inspect that commit
> object and then fetch the tree object (v0.99~50^{tree}). Then, Git
> inspects that tree to find out the object ID for v0.99~50^{tree}:t/, sees
> that it is identical to v0.99~49^{tree}:t/ and therefore the pathspec
> filter skips this commit from the output of the `git log` command. A
> couple of parent traversals later (always fetching the parent commit
> object individually, then the associated tree object, then figuring out
> that `t/` is unchanged) Git will encounter v0.99~55 where `t/` _did_
> change. So now it also has to fetch _that_ tree object.
>

Very convincing example. I think some git commands which may require
all missing commits history, should fetch all commits in a batch. (so this
depth filter is not very useful here)

> In total, we are looking at 400+ individual network round trips just to
> fetch the required tree/commit objects, i.e. before Git can show you the
> output of that `git log` command. And that's just for back-filling the
> missing tree/commit objects.
>
> If we had done this using a shallow clone, Git would have stopped at the
> shallow boundary, the user would have had a chance to increase the depth
> in bigger chunks (probably first extending the depth by 50, then maybe
> 100, then maybe going for 500) and while it would have been a lot of
> manual labor, the total time would be still a lot shorter than those 400+
> network round trips (which likely would incur some throttling on the
> server side).
>

Agree.

> > Thinking about this idea, I don't think it is viable. I would need to
> > see a lot of work done to test these scenarios closely to believe that
> > this type of partial clone is a desirable working state.
>
> Indeed, it is hard to think of a way how the design could result in
> anything but undesirable behavior, both on the client and the server side.
>
> We also have to consider that our experience with large repositories
> demonstrates that tree and commit objects delta pretty well and are
> virtually never a concern when cloning. It is always the sheer amount of
> blob objects that is causing poor user experience when performing
> non-partial clones of large repositories.
>

Thanks, I think I understand the problem here. By the way, does it make
sense to download just some of the commits/trees in some big repository
which have several million commits/trees?

> Now, I can be totally wrong in my expectation that there is _no_ scenario
> where cloning with a "partial depth" would cause anything but poor
> performance. If I am wrong, then there is value in having this feature,
> but since it causes undesirable performance in all cases I can think of,
> it definitely should be guarded behind an opt-in flag.
>

Well, now I think this depth filter might be a better fit for git fetch.

If git checkout or other commands which just need to check
few commits, and find almost all objects (maybe >= 75%) in a
commit are not local, it can use this depth filter to download them.

> Ciao,
> Dscho

Thanks,
ZheNing Hu

Johannes Schindelin Sept. 7, 2022, 10:18 a.m. UTC | #5

Hi ZheNing,

On Sun, 4 Sep 2022, ZheNing Hu wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月2日周五 21:48写道：
>
> > [...]
> > When you have all the commit and tree objects on the local side,
> > you can enumerate all the blob objects you need in one fell swoop, then
> > fetch them in a single network round trip.
> >
> > When you lack tree objects, or worse, commit objects, this is not true.
> > You may very well need to fetch _quite_ a bunch of objects, then inspect
> > them to find out that you need to fetch more tree/commit objects, and then
> > a couple more round trips, before you can enumerate all of the objects you
> > need.
>
> I think this is because the previous design was that you had to fetch
> these missing commits (also trees) and all their ancestors. Maybe we can
> modify git rev-list to make it understand missing commits...

We do have such a modification, and it is called "shallow clone" ;-)

Granted, shallow clones are not a complete solution and turned out to be a
dead end (i.e. that design cannot be extended into anything more useful).
But that approach demonstrates what it would take to implement a logic
whereby Git understands that some commit ranges are missing and should not
be fetched automatically.

> > [...] it is hard to think of a way how the design could result in
> > anything but undesirable behavior, both on the client and the server
> > side.
> >
> > We also have to consider that our experience with large repositories
> > demonstrates that tree and commit objects delta pretty well and are
> > virtually never a concern when cloning. It is always the sheer amount
> > of blob objects that is causing poor user experience when performing
> > non-partial clones of large repositories.
>
> Thanks, I think I understand the problem here. By the way, does it make
> sense to download just some of the commits/trees in some big repository
> which have several million commits/trees?

It probably only makes sense if we can come up with a good idea how to
teach Git the trick to stop downloading so many objects in costly
roundtrips.

But I wonder whether your scenarios are so different from the ones I
encountered, in that commit and tree objects do _not_ delta well on your
side?

If they _do_ delta well, i.e. if it is comparatively cheap to just fetch
them all in one go, it probably makes more sense to just drop the idea of
fetching only some commit/tree objects but not others in a partial clone,
and always fetch all of 'em.

> > Now, I can be totally wrong in my expectation that there is _no_ scenario
> > where cloning with a "partial depth" would cause anything but poor
> > performance. If I am wrong, then there is value in having this feature,
> > but since it causes undesirable performance in all cases I can think of,
> > it definitely should be guarded behind an opt-in flag.
>
> Well, now I think this depth filter might be a better fit for git fetch.

I disagree here, because I see all the same challenges as I described for
clones missing entire commit ranges.

> If git checkout or other commands which just need to check
> few commits, and find almost all objects (maybe >= 75%) in a
> commit are not local, it can use this depth filter to download them.

If you want a clone that does not show any reasonable commit history
because it does not fetch commit objects on-the-fly, then we already have
such a thing with shallow clones.

The only way to make Git's revision walking logic perform _somewhat_
reasonably would be to teach it to fetch not just a single commit object
when it was asked for, but to somehow pass a desired depth by which to
"unshallow" automatically.

However, such a feature would come with the same undesirable implications
on the server side as shallow clones (fetches into shallow clones are
_really_ expensive on the server side).

Ciao,
Dscho

ZheNing Hu Sept. 11, 2022, 10:59 a.m. UTC | #6

Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月7日周三 18:18写道：
>
> Hi ZheNing,
>
> On Sun, 4 Sep 2022, ZheNing Hu wrote:
>
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> 于2022年9月2日周五 21:48写道：
> >
> > > [...]
> > > When you have all the commit and tree objects on the local side,
> > > you can enumerate all the blob objects you need in one fell swoop, then
> > > fetch them in a single network round trip.
> > >
> > > When you lack tree objects, or worse, commit objects, this is not true.
> > > You may very well need to fetch _quite_ a bunch of objects, then inspect
> > > them to find out that you need to fetch more tree/commit objects, and then
> > > a couple more round trips, before you can enumerate all of the objects you
> > > need.
> >
> > I think this is because the previous design was that you had to fetch
> > these missing commits (also trees) and all their ancestors. Maybe we can
> > modify git rev-list to make it understand missing commits...
>
> We do have such a modification, and it is called "shallow clone" ;-)
>
> Granted, shallow clones are not a complete solution and turned out to be a
> dead end (i.e. that design cannot be extended into anything more useful).

Yeah, the depth filter would have been possible to overcome this
shortcoming, but
it may require very much network overhead in some special cases.

> But that approach demonstrates what it would take to implement a logic
> whereby Git understands that some commit ranges are missing and should not
> be fetched automatically.
>

Agree. Git uses the commit-graft to do so.

> > > [...] it is hard to think of a way how the design could result in
> > > anything but undesirable behavior, both on the client and the server
> > > side.
> > >
> > > We also have to consider that our experience with large repositories
> > > demonstrates that tree and commit objects delta pretty well and are
> > > virtually never a concern when cloning. It is always the sheer amount
> > > of blob objects that is causing poor user experience when performing
> > > non-partial clones of large repositories.
> >
> > Thanks, I think I understand the problem here. By the way, does it make
> > sense to download just some of the commits/trees in some big repository
> > which have several million commits/trees?
>
> It probably only makes sense if we can come up with a good idea how to
> teach Git the trick to stop downloading so many objects in costly
> roundtrips.
>

Good advice. Perhaps we should merge these multiple requests into one.
Maybe we should use a blob:none filter to download all missing trees/commits
if we need to iterate through all commits history.

> But I wonder whether your scenarios are so different from the ones I
> encountered, in that commit and tree objects do _not_ delta well on your
> side?
>
> If they _do_ delta well, i.e. if it is comparatively cheap to just fetch
> them all in one go, it probably makes more sense to just drop the idea of
> fetching only some commit/tree objects but not others in a partial clone,
> and always fetch all of 'em.
>

Delta is a wonderful thing most of the time (in cases where bulk acquisition
is required). But sometimes I think users just want to see the message of one
commit, so why do they have to download other commits/trees that are not
required?

Sometimes users may better understand the working patterns of their git
objects than the git server, It may be nice if the user could download the
specified object just mapped by its objectid (it is only for blob now, right?)

> > > Now, I can be totally wrong in my expectation that there is _no_ scenario
> > > where cloning with a "partial depth" would cause anything but poor
> > > performance. If I am wrong, then there is value in having this feature,
> > > but since it causes undesirable performance in all cases I can think of,
> > > it definitely should be guarded behind an opt-in flag.
> >
> > Well, now I think this depth filter might be a better fit for git fetch.
>
> I disagree here, because I see all the same challenges as I described for
> clones missing entire commit ranges.
>

Oh, a prerequisite is missing here: after we have all commits, trees,
then use the
depth filter to down missing blobs.

> > If git checkout or other commands which just need to check
> > few commits, and find almost all objects (maybe >= 75%) in a
> > commit are not local, it can use this depth filter to download them.
>
> If you want a clone that does not show any reasonable commit history
> because it does not fetch commit objects on-the-fly, then we already have
> such a thing with shallow clones.
>
> The only way to make Git's revision walking logic perform _somewhat_
> reasonably would be to teach it to fetch not just a single commit object
> when it was asked for, but to somehow pass a desired depth by which to
> "unshallow" automatically.
>
> However, such a feature would come with the same undesirable implications
> on the server side as shallow clones (fetches into shallow clones are
> _really_ expensive on the server side).
>

Agree. letting git shallow clone to be smarter may work, but there are
big challenges too.

> Ciao,
> Dscho

Thanks,
ZheNing Hu

[0/3] list-object-filter: introduce depth filter

Message

Comments