mbox series

[0/2] Fix background maintenance regression in Git 2.45.0

Message ID pull.1764.git.1721332546.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Fix background maintenance regression in Git 2.45.0 | expand

Message

John Cai via GitGitGadget July 18, 2024, 7:55 p.m. UTC
Here is an issue I noticed while exploring issues with my local copy of a
large monorepo. I was intending to show some engineers how nice the objects
were maintained by background maintenance, but saw hundreds of small
pack-files that were up to two months old. This time matched when I upgraded
to the microsoft/git fork that included the 2.45.0 release of Git.

The issue is that 'git multi-pack-index repack' was taught to call 'git
pack-objects' with the new '--stdin-packs' option. However, this changes the
object selection algorithm. Instead of using the objects referenced by the
multi-pack-index, it compares pack-files using a list of "included" and
"excluded" pack-files. This loses some granularity of how the
multi-pack-index chooses among duplicate objects.

The end result is that some objects that would normally have been included
in the new pack-file are no longer included. The copy that the
multi-pack-index references is in the pack-file that was intended to be
repacked, so that pack-file cannot be expired in the next 'git
multi-pack-index expire' step and is included again in the batch of objects
to repack.

In the context of the change that is reverted by this series, it seems the
motivation of the change was two-fold:

 1. some I/O benefits to using pack names over object names, and
 2. the ability to use an object walk to improve delta compression.

In my local prototyping, I've found that we could improve 'git pack-objects'
to use an object walk when given a set of objects over stdin without needing
to use pack-file names. I do not believe the '--stdin-packs' option should
be used for the 'git multi-pack-index repack' mechanism, or at least should
be done with great care and only in specific cases where some assumptions
can be made around duplicate objects and closure under reachability.

However, the prototype I've built to get these benefits is non-trivial due
to working to guarantee that partial clones do not accidentally download
missing blobs. That will follow in a separate series that can be reviewed at
a slower pace.

Thanks, -Stolee

Derrick Stolee (2):
  t5319: add failing test case for repack/expire
  midx-write: revert use of --stdin-packs

 midx-write.c                | 18 ++++++------
 t/t5319-multi-pack-index.sh | 55 +++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 9 deletions(-)


base-commit: bea9ecd24b0c3bf06cab4a851694fe09e7e51408
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1764%2Fderrickstolee%2Fincremental-repack-fix-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1764/derrickstolee/incremental-repack-fix-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1764

Comments

Junio C Hamano July 18, 2024, 9:57 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Here is an issue I noticed while exploring issues with my local copy of a
> large monorepo. I was intending to show some engineers how nice the objects
> were maintained by background maintenance, but saw hundreds of small
> pack-files that were up to two months old. This time matched when I upgraded
> to the microsoft/git fork that included the 2.45.0 release of Git.

I almost said "wow, perfect timing on the -rc1 day", but then
realized that this is not a regression during _this_ cycle, but a
cycle ago.

You already Cc'ed Taylor, whom I would have asked for help if I were
the one who found this issue, which is good.

Will queue.

Thanks.
Taylor Blau July 18, 2024, 10:38 p.m. UTC | #2
On Thu, Jul 18, 2024 at 02:57:55PM -0700, Junio C Hamano wrote:
> You already Cc'ed Taylor, whom I would have asked for help if I were
> the one who found this issue, which is good.

It looks like there is a small typo in my email address as
"me@ttayllorr.com" instead of "me@ttaylorr.com", but I saw the message
anyway ;-).

Thanks,
Taylor
Taylor Blau July 18, 2024, 10:50 p.m. UTC | #3
On Thu, Jul 18, 2024 at 07:55:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> The issue is that 'git multi-pack-index repack' was taught to call 'git
> pack-objects' with the new '--stdin-packs' option. However, this changes the
> object selection algorithm. Instead of using the objects referenced by the
> multi-pack-index, it compares pack-files using a list of "included" and
> "excluded" pack-files. This loses some granularity of how the
> multi-pack-index chooses among duplicate objects.

Thank you for looking at this so carefully! Let me double check my own
understanding.

Suppose we have a MIDX with some pack 'P' and say, some commit object
'C' which appears in that pack. Let's also suppose we have another pack
'Q' in the same MIDX which also contains 'C', but the MIDX selected its
copy from pack 'P'.

If we want to combine 'P' with some other packs (excluding 'Q'), then
the input to --stdin-packs will look something like:

    P.pack
    ^Q.pack
    ...

And the resulting pack would not contain 'C', since we would reject it
via: add_object_entry_from_pack() -> want_object_in_pack() ->
want_found_object() -> has_object_kept_pack(). The final function there
would find a copy of 'C' in 'Q', and since 'Q' is excluded, we would
reject 'C' as unwanted.

So the resulting pack would not contain 'C', and the MIDX would hold
onto its copy from pack 'P', resulting in 'P' being both (a) in the set
of packs to repack together, but also (b) non-expireable, since it has
at least one object selected from it in the MIDX.

> The end result is that some objects that would normally have been included
> in the new pack-file are no longer included. The copy that the
> multi-pack-index references is in the pack-file that was intended to be
> repacked, so that pack-file cannot be expired in the next 'git
> multi-pack-index expire' step and is included again in the batch of objects
> to repack.

I think this matches my own understanding, but let me know if I'm
missing something. Assuming I'm thinking about this the same way you
are, the fix (stop using --stdin-packss) makes sense to me.

Thanks,
Taylor
Derrick Stolee July 19, 2024, 1:21 p.m. UTC | #4
On 7/18/24 6:50 PM, Taylor Blau wrote:
> On Thu, Jul 18, 2024 at 07:55:44PM +0000, Derrick Stolee via GitGitGadget wrote:
>> The issue is that 'git multi-pack-index repack' was taught to call 'git
>> pack-objects' with the new '--stdin-packs' option. However, this changes the
>> object selection algorithm. Instead of using the objects referenced by the
>> multi-pack-index, it compares pack-files using a list of "included" and
>> "excluded" pack-files. This loses some granularity of how the
>> multi-pack-index chooses among duplicate objects.
> 
> Thank you for looking at this so carefully! Let me double check my own
> understanding.
> 
> Suppose we have a MIDX with some pack 'P' and say, some commit object
> 'C' which appears in that pack. Let's also suppose we have another pack
> 'Q' in the same MIDX which also contains 'C', but the MIDX selected its
> copy from pack 'P'.
> 
> If we want to combine 'P' with some other packs (excluding 'Q'), then
> the input to --stdin-packs will look something like:
> 
>      P.pack
>      ^Q.pack
>      ...
> 
> And the resulting pack would not contain 'C', since we would reject it
> via: add_object_entry_from_pack() -> want_object_in_pack() ->
> want_found_object() -> has_object_kept_pack(). The final function there
> would find a copy of 'C' in 'Q', and since 'Q' is excluded, we would
> reject 'C' as unwanted.
> 
> So the resulting pack would not contain 'C', and the MIDX would hold
> onto its copy from pack 'P', resulting in 'P' being both (a) in the set
> of packs to repack together, but also (b) non-expireable, since it has
> at least one object selected from it in the MIDX.
> 
>> The end result is that some objects that would normally have been included
>> in the new pack-file are no longer included. The copy that the
>> multi-pack-index references is in the pack-file that was intended to be
>> repacked, so that pack-file cannot be expired in the next 'git
>> multi-pack-index expire' step and is included again in the batch of objects
>> to repack.
> 
> I think this matches my own understanding, but let me know if I'm
> missing something. Assuming I'm thinking about this the same way you
> are, the fix (stop using --stdin-packss) makes sense to me.

Your interpretation matches mine. Thanks for the careful read.

I think we can accomplish similar goals that match the reasoning for
--stdin-packs (better deltas while also limiting the object walk to the 
repacked objects) with some changes to read_object_list_from_stdin(),
but that's a more subtle kind of change.

Taking this change as-is will cause a regression in the quality of
delta choices, but recovers our ability to expire "repacked" pack-files
in all cases.

Thanks,
-Stolee
Derrick Stolee July 19, 2024, 1:23 p.m. UTC | #5
On 7/18/24 6:38 PM, Taylor Blau wrote:
> On Thu, Jul 18, 2024 at 02:57:55PM -0700, Junio C Hamano wrote:
>> You already Cc'ed Taylor, whom I would have asked for help if I were
>> the one who found this issue, which is good.
> 
> It looks like there is a small typo in my email address as
> "me@ttayllorr.com" instead of "me@ttaylorr.com", but I saw the message
> anyway ;-).

Whoops! Sorry about that. Repeated too many letters. I've fixed the
GitGitGadget cover letter in case a v2 is required.

Thanks,
-Stolee
Derrick Stolee July 19, 2024, 1:24 p.m. UTC | #6
On 7/18/24 5:57 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Here is an issue I noticed while exploring issues with my local copy of a
>> large monorepo. I was intending to show some engineers how nice the objects
>> were maintained by background maintenance, but saw hundreds of small
>> pack-files that were up to two months old. This time matched when I upgraded
>> to the microsoft/git fork that included the 2.45.0 release of Git.
> 
> I almost said "wow, perfect timing on the -rc1 day", but then
> realized that this is not a regression during _this_ cycle, but a
> cycle ago.

I almost waited until after the release, but I wanted to put the
information out there just in case you were interested in taking it
into 2.46.0 or were planning on a 2.45.3.

Thanks,
-Stolee
Taylor Blau July 19, 2024, 1:38 p.m. UTC | #7
On Fri, Jul 19, 2024 at 09:21:51AM -0400, Derrick Stolee wrote:
> On 7/18/24 6:50 PM, Taylor Blau wrote:
>
> > I think this matches my own understanding, but let me know if I'm
> > missing something. Assuming I'm thinking about this the same way you
> > are, the fix (stop using --stdin-packss) makes sense to me.
>
> Your interpretation matches mine. Thanks for the careful read.
>
> I think we can accomplish similar goals that match the reasoning for
> --stdin-packs (better deltas while also limiting the object walk to the
> repacked objects) with some changes to read_object_list_from_stdin(),
> but that's a more subtle kind of change.

FWIW, the main motivation for that change was to limit the amount of
cross-process I/O that was necessary to generate the new pack. I figured
that for relatively small amounts of packs which contain relatively
large amounts of objects that it would be more efficient to write out
the pack names than the object names.

I was thinking a little bit about how we would alter the behavior of
'--stdin-packs' to match what the 'multi-pack-index repack' caller
needs. I agree that it is possible, and I doubly agree that it is subtle
;-).

TBH, I think that the amount of I/O we're potentially saving is dwarfed
by the amount of I/O and CPU time it takes to actually generate the new
pack, so I doubt the effort to make such a subtle change would be all
that worthwhile, though certainly an interesting exercise ;-).

> Taking this change as-is will cause a regression in the quality of
> delta choices, but recovers our ability to expire "repacked" pack-files
> in all cases.

Yeah, definitely.

Thanks,
Taylor
Junio C Hamano July 19, 2024, 3:13 p.m. UTC | #8
Derrick Stolee <stolee@gmail.com> writes:

> On 7/18/24 5:57 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> Here is an issue I noticed while exploring issues with my local copy of a
>>> large monorepo. I was intending to show some engineers how nice the objects
>>> were maintained by background maintenance, but saw hundreds of small
>>> pack-files that were up to two months old. This time matched when I upgraded
>>> to the microsoft/git fork that included the 2.45.0 release of Git.
>> I almost said "wow, perfect timing on the -rc1 day", but then
>> realized that this is not a regression during _this_ cycle, but a
>> cycle ago.
>
> I almost waited until after the release, but I wanted to put the
> information out there just in case you were interested in taking it
> into 2.46.0 or were planning on a 2.45.3.

Yup, thanks but this is not exactly a repository breaking data
corruption bug, and did not look ultra urgent.  Especially if we
want to pursue a solution that helps both expiring stale packs
better (which is what you are restoring) and making better delta
chain selection (which may be what you are losing) at the same time,
such a change could become a source of data corruption bug, so I'd
prefer to see it started early in a cycle, rather as a last-minute
"let's fix this too".

Thanks.
Derrick Stolee July 19, 2024, 4:20 p.m. UTC | #9
On 7/19/24 11:13 AM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/18/24 5:57 PM, Junio C Hamano wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>> Here is an issue I noticed while exploring issues with my local copy of a
>>>> large monorepo. I was intending to show some engineers how nice the objects
>>>> were maintained by background maintenance, but saw hundreds of small
>>>> pack-files that were up to two months old. This time matched when I upgraded
>>>> to the microsoft/git fork that included the 2.45.0 release of Git.
>>> I almost said "wow, perfect timing on the -rc1 day", but then
>>> realized that this is not a regression during _this_ cycle, but a
>>> cycle ago.
>>
>> I almost waited until after the release, but I wanted to put the
>> information out there just in case you were interested in taking it
>> into 2.46.0 or were planning on a 2.45.3.
> 
> Yup, thanks but this is not exactly a repository breaking data
> corruption bug, and did not look ultra urgent.  Especially if we
> want to pursue a solution that helps both expiring stale packs
> better (which is what you are restoring) and making better delta
> chain selection (which may be what you are losing) at the same time,
> such a change could become a source of data corruption bug, so I'd
> prefer to see it started early in a cycle, rather as a last-minute
> "let's fix this too".

I agree with this assessment. The microsoft/git fork has taken the
revert as users of that fork have a higher chance of being affected.

Thanks,
-Stolee