mbox series

[v2,0/3] repack: pack everything into promisor packfile in partial repos

Message ID 20241008081350.8950-1-hanyang.tony@bytedance.com (mailing list archive)
Headers show
Series repack: pack everything into promisor packfile in partial repos | expand

Message

Han Young Oct. 8, 2024, 8:13 a.m. UTC
As suggested by Jonathan[1], there are number of ways to fix this issue.
We have already explored some of them in this thread, and so far none of them
is satisfiable. Calvin and I tried to address the problem from fetch-pack side
and rev-list side. But the fix either consumes too much CPU power or results
in inefficient bandwidth use.

So let's attack the problem from repack side. The goal is to prevent repack
from discarding local objects, previously it is done by carefully
separating promisor objects and normal objects in rev-list.
The implementation is flawed and no solution have been found so far.
Instead, we can get ride of rev-list and just pack everything into promisor
files. This way, no objects would be lost.

By using 'repack everything', repacking requires less work and we are not
using more bandwidth.

Packing local objects into promisor packfiles means that it is no longer
possible to detect if an object is missing due to repository corruption
or because we need to fetch it from a promisor remote.

Promisor objects packing does not benefiting from the history and
path based delta calculation, and GC does not remove unreachable promisor
objects. By packing locally created normal objects into promisor packfile,
normal objects are converted into promisor objects. However, in partial cloned
repos, the number of locally created objects are small compared to promisor
objects. The impact should be negligible.

[1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/

*** Changes since v1 ***
Added tradeoffs in cover letter.
Fixed some partial clone test cases.
Updated partial clone documentation.

Han Young (3):
  repack: pack everything into packfile
  t0410: adapt tests to repack changes
  partial-clone: update doc

 Documentation/technical/partial-clone.txt |  16 +-
 builtin/repack.c                          | 257 ++++++++++++----------
 t/t0410-partial-clone.sh                  |  68 +-----
 t/t5616-partial-clone.sh                  |   9 +-
 4 files changed, 157 insertions(+), 193 deletions(-)

Comments

Junio C Hamano Oct. 8, 2024, 9:57 p.m. UTC | #1
Han Young <hanyang.tony@bytedance.com> writes:

> As suggested by Jonathan[1], there are number of ways to fix this issue.
> We have already explored some of them in this thread, and so far none of them
> is satisfiable. Calvin and I tried to address the problem from fetch-pack side
> and rev-list side. But the fix either consumes too much CPU power or results
> in inefficient bandwidth use.
>
> So let's attack the problem from repack side. The goal is to prevent repack
> from discarding local objects, previously it is done by carefully
> separating promisor objects and normal objects in rev-list.
> The implementation is flawed and no solution have been found so far.
> Instead, we can get ride of rev-list and just pack everything into promisor
> files. This way, no objects would be lost.
>
> By using 'repack everything', repacking requires less work and we are not
> using more bandwidth.

OK, perhaps.

> Packing local objects into promisor packfiles means that it is no longer
> possible to detect if an object is missing due to repository corruption
> or because we need to fetch it from a promisor remote.

Is it true that without doing this, we can tell between these two
cases, though?  More importantly, even if it is true, would there be
a practical difference?

In the sample scenario used in [1/3] where you created C2/T2/B2 on
top of C1/T1/B1 (which came from a promisor remote), somebody else
built C3/T3/B3 on top, and it came back from the promisor remote,
you could lose 3's objects and 1's objects and they can be refetched
but even if you lose 2's objects, since 3's objects are building on
top of them, you should be able to fetch them from the promisor
remote just like objects from 1 and 3, no?  So strictly speaking,
missing 2's objects may be "repository corruption" while missing 1's
and 3's objects may not be, would there be a practical use for that
information?

> Promisor objects packing does not benefiting from the history and
> path based delta calculation, and GC does not remove unreachable promisor
> objects. By packing locally created normal objects into promisor packfile,
> normal objects are converted into promisor objects. However, in partial cloned
> repos, the number of locally created objects are small compared to promisor
> objects. The impact should be negligible.

> [1] https://lore.kernel.org/git/20240813004508.2768102-1-jonathantanmy@google.com/
>
> *** Changes since v1 ***
> Added tradeoffs in cover letter.
> Fixed some partial clone test cases.
> Updated partial clone documentation.

These patches are based on the tip of master before 365529e1 (Merge
branch 'ps/leakfixes-part-7', 2024-10-02), which will give mildly
annoying conflicts when merged to 'seen'.

I've managed to apply and then merge, so unless review discussions
find needs for updates, there is no need for immediate reroll, but
if you end up having to update these patches, it is a good idea to
rebase the topic on top of v2.47.0 that was released early this
week, as we are now entering a new development cycle.

Thanks.


>
> Han Young (3):
>   repack: pack everything into packfile
>   t0410: adapt tests to repack changes
>   partial-clone: update doc
>
>  Documentation/technical/partial-clone.txt |  16 +-
>  builtin/repack.c                          | 257 ++++++++++++----------
>  t/t0410-partial-clone.sh                  |  68 +-----
>  t/t5616-partial-clone.sh                  |   9 +-
>  4 files changed, 157 insertions(+), 193 deletions(-)
Junio C Hamano Oct. 8, 2024, 10:43 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> I've managed to apply and then merge, so unless review discussions
> find needs for updates, there is no need for immediate reroll, but
> if you end up having to update these patches, it is a good idea to
> rebase the topic on top of v2.47.0 that was released early this
> week, as we are now entering a new development cycle.

When merged to the tip of 'seen', it seems to break t5710.  It might
be due to mismerge, but can you check on your end?

Thanks.
Han Young Oct. 9, 2024, 6:31 a.m. UTC | #3
On Wed, Oct 9, 2024 at 5:57 AM Junio C Hamano <gitster@pobox.com> wrote:

> > Packing local objects into promisor packfiles means that it is no longer
> > possible to detect if an object is missing due to repository corruption
> > or because we need to fetch it from a promisor remote.
>
> Is it true that without doing this, we can tell between these two
> cases, though?  More importantly, even if it is true, would there be
> a practical difference?
>
> In the sample scenario used in [1/3] where you created C2/T2/B2 on
> top of C1/T1/B1 (which came from a promisor remote), somebody else
> built C3/T3/B3 on top, and it came back from the promisor remote,
> you could lose 3's objects and 1's objects and they can be refetched
> but even if you lose 2's objects, since 3's objects are building on
> top of them, you should be able to fetch them from the promisor
> remote just like objects from 1 and 3, no?  So strictly speaking,
> missing 2's objects may be "repository corruption" while missing 1's
> and 3's objects may not be, would there be a practical use for that
> information?

 Some code path does check if the missing object is promisor object before
 lazy fetching, `--missing=` does this check.
But in that case, C2 is also a promisor object, the check would pass.
There are no partial clone filter that omits commits, missing commit will
always result in error. And even if we do report "repository corruption",
the best course of action is still try to fetching them.
So, no. I don't think there are practical uses for that information


> These patches are based on the tip of master before 365529e1 (Merge
> branch 'ps/leakfixes-part-7', 2024-10-02), which will give mildly
> annoying conflicts when merged to 'seen'.
>
> I've managed to apply and then merge, so unless review discussions
> find needs for updates, there is no need for immediate reroll, but
> if you end up having to update these patches, it is a good idea to
> rebase the topic on top of v2.47.0 that was released early this
> week, as we are now entering a new development cycle.

Thanks, I will rebase to master and see if any other tests break.