mbox series

[0/2] repack: pack everything into promisor packfile in partial repos

Message ID 20240925072021.77078-1-hanyang.tony@bytedance.com (mailing list archive)
Headers show
Series repack: pack everything into promisor packfile in partial repos | expand

Message

Han Young Sept. 25, 2024, 7:20 a.m. UTC
As suggested by Jonathan[1], there are number of ways to fix this issue.
We have already explored some of them in this thread, and so far none of them
is satisfiable. Calvin and I tried to address the problem from fetch-pack side
and rev-list side. But the fix either consumes too much CPU power or results
in inefficient bandwidth use.

So let's attack the problem from repack side. The goal is to prevent repack
from discarding local objects, previously it is done by carefully
separating promisor objects and normal objects in rev-list.
The implementation is flawed and no solution have been found so far.
Instead, we can get ride of rev-list and just pack everything into promisor
files. This way, no objects would be lost.

By using 'repack everything', repacking requires less work and we are not
using more bandwidth. The only downside is normal objects packing does not
benefiting from the history and path based delta calculation. Majority of
objects in a partial repo is promisor objects, so the impact of worse normal
objects repacking is negligible.

[1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/

Han Young (2):
  repack: pack everything into promisor packfile in partial repos
  t0410: adapt tests to repack changes

 builtin/repack.c         | 258 ++++++++++++++++++++++-----------------
 t/t0410-partial-clone.sh |  68 +----------
 2 files changed, 145 insertions(+), 181 deletions(-)

Comments

Phillip Wood Sept. 25, 2024, 3:20 p.m. UTC | #1
Hi Han

On 25/09/2024 08:20, Han Young wrote:
> As suggested by Jonathan[1], there are number of ways to fix this issue.
> We have already explored some of them in this thread, and so far none of them
> is satisfiable. Calvin and I tried to address the problem from fetch-pack side
> and rev-list side. But the fix either consumes too much CPU power or results
> in inefficient bandwidth use.

I was wondering if it would be possible to cache the tip commits in 
promisor packs when repacking so that a subsequent repack only has to 
walk the commits added since the last repack when it is trying to figure 
out if a local object should be moved into a promisor pack.

> So let's attack the problem from repack side. The goal is to prevent repack
> from discarding local objects, previously it is done by carefully
> separating promisor objects and normal objects in rev-list.
> The implementation is flawed and no solution have been found so far.
> Instead, we can get ride of rev-list and just pack everything into promisor
> files. This way, no objects would be lost.
> 
> By using 'repack everything', repacking requires less work and we are not
> using more bandwidth. The only downside is normal objects packing does not
> benefiting from the history and path based delta calculation.

I've just been looking at Documentation/technical/partial-clone.txt and 
I think there are a couple of other implications of this change

 > An object may be missing due to a partial clone or fetch, or missing
 > due to repository corruption.  To differentiate these cases, the
 > local repository specially indicates such filtered packfiles
 > obtained from promisor remotes as "promisor packfiles".

Packing local objects into promisor packfiles means that it is no longer 
possible to detect if an object is missing due to repository corruption 
or because we need to fetch it from a promisor remote.

 > `repack` in GC has been updated to not touch promisor packfiles at
 > all, and to only repack other objects.

Packing local objects into promisor packfiles means that GC will 
no-longer remove unreachable local objects.

It would be helpful if the cover letter or commit messages discussed the 
tradeoffs of these changes and updated that document accordingly.

Best Wishes

Phillip

> Majority of
> objects in a partial repo is promisor objects, so the impact of worse normal
> objects repacking is negligible.
> 
> [1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
> 
> Han Young (2):
>    repack: pack everything into promisor packfile in partial repos
>    t0410: adapt tests to repack changes
> 
>   builtin/repack.c         | 258 ++++++++++++++++++++++-----------------
>   t/t0410-partial-clone.sh |  68 +----------
>   2 files changed, 145 insertions(+), 181 deletions(-)
>
Junio C Hamano Sept. 25, 2024, 4:48 p.m. UTC | #2
Phillip Wood <phillip.wood123@gmail.com> writes:

> I was wondering if it would be possible to cache the tip commits in
> promisor packs when repacking so that a subsequent repack only has to
> walk the commits added since the last repack when it is trying to
> figure out if a local object should be moved into a promisor pack.

I was wondering the same thing.  If packfiles (and bundles) record
the entry points and the exit points of the DAG, it would help quite
a bit.

> It would be helpful if the cover letter or commit messages discussed
> the tradeoffs of these changes and updated that document accordingly.

I like the suggestion very much.

Thanks for a review.
Junio C Hamano Sept. 25, 2024, 5:03 p.m. UTC | #3
Han Young <hanyang.tony@bytedance.com> writes:

> By using 'repack everything', repacking requires less work and we are not
> using more bandwidth. The only downside is normal objects packing does not
> benefiting from the history and path based delta calculation. Majority of
> objects in a partial repo is promisor objects, so the impact of worse normal
> objects repacking is negligible.

There is an important assumption that any objects in promisor packs
*and* any objects that are (directly or indirectly) referenced by
these objects in promisor packs can safely be expunged from the
local object store because they can be later fetched again from the
promisor remote.  In that (in)famous failure case topology of the
history:

    commit  tree  blob
     C3 ---- T3 -- B3 (fetched from remote, in promisor pack)
     |
     C2 ---- T2 -- B2 (created locally, in non-promisor pack)
     |
     C1 ---- T1 -- B1 (fetched from remote, in promisor pack)

even though the objects associated with the commit C2 are created
locally, the fact that C3 in promisor pack references it alone is
sufficient for us to also assume that these "locally created" are
now refetchable from the promisor remote that gave us C3, hence it
is safe to repack the history leading to C3 and all objects involved
in the history and mark the resulting pack(s) promisor packs.

OK. That sounds workable, but aren't there downsides?

Thanks for working on this topic.