diff mbox series

[7/7] builtin/merge-tree.c: implement support for `--write-pack`

Message ID e96921014557edb41dd73d93a8c3cf6cfaf0c719.1696629697.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series merge-ort: implement support for packing objects together | expand

Commit Message

Taylor Blau Oct. 6, 2023, 10:02 p.m. UTC
When using merge-tree often within a repository[^1], it is possible to
generate a relatively large number of loose objects, which can result in
degraded performance, and inode exhaustion in extreme cases.

Building on the functionality introduced in previous commits, the
bulk-checkin machinery now has support to write arbitrary blob and tree
objects which are small enough to be held in-core. We can use this to
write any blob/tree objects generated by ORT into a separate pack
instead of writing them out individually as loose.

This functionality is gated behind a new `--write-pack` option to
`merge-tree` that works with the (non-deprecated) `--write-tree` mode.

The implementation is relatively straightforward. There are two spots
within the ORT mechanism where we call `write_object_file()`, one for
content differences within blobs, and another to assemble any new trees
necessary to construct the merge. In each of those locations,
conditionally replace calls to `write_object_file()` with
`index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()`
depending on which kind of object we are writing.

The only remaining task is to begin and end the transaction necessary to
initialize the bulk-checkin machinery, and move any new pack(s) it
created into the main object store.

[^1]: Such is the case at GitHub, where we run presumptive "test merges"
  on open pull requests to see whether or not we can light up the merge
  button green depending on whether or not the presumptive merge was
  conflicted.

  This is done in response to a number of user-initiated events,
  including viewing an open pull request whose last test merge is stale
  with respect to the current base and tip of the pull request. As a
  result, merge-tree can be run very frequently on large, active
  repositories.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-merge-tree.txt |  4 ++
 builtin/merge-tree.c             |  5 ++
 merge-ort.c                      | 43 +++++++++++----
 merge-recursive.h                |  1 +
 t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++
 5 files changed, 136 insertions(+), 10 deletions(-)

Comments

Junio C Hamano Oct. 6, 2023, 10:35 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> When using merge-tree often within a repository[^1], it is possible to
> generate a relatively large number of loose objects, which can result in
> degraded performance, and inode exhaustion in extreme cases.

Well, be it "git merge-tree" or "git merge", new loose objects tend
to accumulate until "gc" kicks in, so it is not a new problem for
mere mortals, is it?

As one "interesting" use case of "merge-tree" is for a Git hosting
site with bare repositories to offer trial merges, without which
majority of the object their repositories acquire would have been in
packs pushed by their users, "Gee, loose objects consume many inodes
in exchange for easier selective pruning" becomes an issue, right?

Just like it hurts performance to have too many loose object files,
presumably it would also hurt performance to keep too many packs,
each came from such a trial merge.  Do we have a "gc" story offered
for these packs created by the new feature?  E.g., "once merge-tree
is done creating a trial merge, we can discard the objects created
in the pack, because we never expose new objects in the pack to the
outside, processes running simultaneously, so instead closing the
new packfile by calling flush_bulk_checkin_packfile(), we can safely
unlink the temporary pack.  We do not even need to spend cycles to
run a gc that requires cycles to enumerate what is still reachable",
or something like that?

Thanks.
Taylor Blau Oct. 6, 2023, 11:02 p.m. UTC | #2
On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > When using merge-tree often within a repository[^1], it is possible to
> > generate a relatively large number of loose objects, which can result in
> > degraded performance, and inode exhaustion in extreme cases.
>
> Well, be it "git merge-tree" or "git merge", new loose objects tend
> to accumulate until "gc" kicks in, so it is not a new problem for
> mere mortals, is it?

Yeah, I would definitely suspect that this is more of an issue for
forges than individual Git users.

> As one "interesting" use case of "merge-tree" is for a Git hosting
> site with bare repositories to offer trial merges, without which
> majority of the object their repositories acquire would have been in
> packs pushed by their users, "Gee, loose objects consume many inodes
> in exchange for easier selective pruning" becomes an issue, right?

Right.

> Just like it hurts performance to have too many loose object files,
> presumably it would also hurt performance to keep too many packs,
> each came from such a trial merge.  Do we have a "gc" story offered
> for these packs created by the new feature?  E.g., "once merge-tree
> is done creating a trial merge, we can discard the objects created
> in the pack, because we never expose new objects in the pack to the
> outside, processes running simultaneously, so instead closing the
> new packfile by calling flush_bulk_checkin_packfile(), we can safely
> unlink the temporary pack.  We do not even need to spend cycles to
> run a gc that requires cycles to enumerate what is still reachable",
> or something like that?

I know Johannes worked on something like this recently. IIRC, it
effectively does something like:

    struct tmp_objdir *tmp_objdir = tmp_objdir_create(...);
    tmp_objdir_replace_primary_odb(tmp_objdir, 1);

at the beginning of a merge operation, and:

    tmp_objdir_discard_objects(tmp_objdir);

at the end. I haven't followed that work off-list very closely, but it
is only possible for GitHub to discard certain niche kinds of
merges/rebases, since in general we make the objects created during test
merges available via refs/pull/N/{merge,rebase}.

I think that like anything, this is a trade-off. Having lots of packs
can be a performance hindrance just like having lots of loose objects.
But since we can represent more objects with fewer inodes when packed,
storing those objects together in a pack is preferable when (a) you're
doing lots of test-merges, and (b) you want to keep those objects
around, e.g., because they are reachable.

Thanks,
Taylor
Elijah Newren Oct. 8, 2023, 7:02 a.m. UTC | #3
On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > When using merge-tree often within a repository[^1], it is possible to
> > > generate a relatively large number of loose objects, which can result in
> > > degraded performance, and inode exhaustion in extreme cases.
> >
> > Well, be it "git merge-tree" or "git merge", new loose objects tend
> > to accumulate until "gc" kicks in, so it is not a new problem for
> > mere mortals, is it?
>
> Yeah, I would definitely suspect that this is more of an issue for
> forges than individual Git users.

It may still be nice to also do this optimization for plain "git
merge" as well.  I had it in my list of ideas somewhere to do a
"fast-import-like" thing to avoid writing loose objects, as I
suspected that'd actually be a performance impediment.

> > As one "interesting" use case of "merge-tree" is for a Git hosting
> > site with bare repositories to offer trial merges, without which
> > majority of the object their repositories acquire would have been in
> > packs pushed by their users, "Gee, loose objects consume many inodes
> > in exchange for easier selective pruning" becomes an issue, right?
>
> Right.
>
> > Just like it hurts performance to have too many loose object files,
> > presumably it would also hurt performance to keep too many packs,
> > each came from such a trial merge.  Do we have a "gc" story offered
> > for these packs created by the new feature?  E.g., "once merge-tree
> > is done creating a trial merge, we can discard the objects created
> > in the pack, because we never expose new objects in the pack to the
> > outside, processes running simultaneously, so instead closing the
> > new packfile by calling flush_bulk_checkin_packfile(), we can safely
> > unlink the temporary pack.  We do not even need to spend cycles to
> > run a gc that requires cycles to enumerate what is still reachable",
> > or something like that?
>
> I know Johannes worked on something like this recently. IIRC, it
> effectively does something like:
>
>     struct tmp_objdir *tmp_objdir = tmp_objdir_create(...);
>     tmp_objdir_replace_primary_odb(tmp_objdir, 1);
>
> at the beginning of a merge operation, and:
>
>     tmp_objdir_discard_objects(tmp_objdir);
>
> at the end. I haven't followed that work off-list very closely, but it
> is only possible for GitHub to discard certain niche kinds of
> merges/rebases, since in general we make the objects created during test
> merges available via refs/pull/N/{merge,rebase}.

Oh, at the contributor summit, Johannes said he only needed pass/fail,
not the actual commits, which is why I suggested this route.  If you
need to keep the actual commits, then this won't help.

I was interested in the same question as Junio, but from a different
angle.  fast-import documentation points out that the packs it creates
are suboptimal with poorer delta choices.  Are the packs created by
bulk-checkin prone to the same issues?  When I was thinking in terms
of having "git merge" use fast-import for pack creation instead of
writing loose objects (an idea I never investigated very far), I was
wondering if I'd need to mark those packs as "less optimal" and do
something to make sure they were more likely to be repacked.

I believe geometric repacking didn't exist back when I was thinking
about this, and perhaps geometric repacking automatically handles
things nicely for us.  Does it, or are we risking retaining
sub-optimal deltas from the bulk-checkin code?

(I've never really cracked open the pack code, so I have absolutely no
idea; I'm just curious.)

> I think that like anything, this is a trade-off. Having lots of packs
> can be a performance hindrance just like having lots of loose objects.
> But since we can represent more objects with fewer inodes when packed,
> storing those objects together in a pack is preferable when (a) you're
> doing lots of test-merges, and (b) you want to keep those objects
> around, e.g., because they are reachable.

A couple of the comments earlier in the series suggested this was
about streaming blobs to a pack in the bulk checkin code.  Are tree
and commit objects also put in the pack, or will those continue to be
written loosely?
Taylor Blau Oct. 8, 2023, 4:04 p.m. UTC | #4
On Sun, Oct 08, 2023 at 12:02:27AM -0700, Elijah Newren wrote:
> On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@ttaylorr.com> wrote:
> >
> > On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> > > Taylor Blau <me@ttaylorr.com> writes:
> > >
> > > > When using merge-tree often within a repository[^1], it is possible to
> > > > generate a relatively large number of loose objects, which can result in
> > > > degraded performance, and inode exhaustion in extreme cases.
> > >
> > > Well, be it "git merge-tree" or "git merge", new loose objects tend
> > > to accumulate until "gc" kicks in, so it is not a new problem for
> > > mere mortals, is it?
> >
> > Yeah, I would definitely suspect that this is more of an issue for
> > forges than individual Git users.
>
> It may still be nice to also do this optimization for plain "git
> merge" as well.  I had it in my list of ideas somewhere to do a
> "fast-import-like" thing to avoid writing loose objects, as I
> suspected that'd actually be a performance impediment.

I think that would be worth doing, definitely. I do worry a little bit
about locking in low-quality deltas (or lack thereof), but more on that
below...

> Oh, at the contributor summit, Johannes said he only needed pass/fail,
> not the actual commits, which is why I suggested this route.  If you
> need to keep the actual commits, then this won't help.

Yep, agreed. Like I said earlier, I think there are some niche scenarios
where we just care about "would this merge cleanly?", but in most other
cases we want to keep around the actual tree.

> I was interested in the same question as Junio, but from a different
> angle.  fast-import documentation points out that the packs it creates
> are suboptimal with poorer delta choices.  Are the packs created by
> bulk-checkin prone to the same issues?  When I was thinking in terms
> of having "git merge" use fast-import for pack creation instead of
> writing loose objects (an idea I never investigated very far), I was
> wondering if I'd need to mark those packs as "less optimal" and do
> something to make sure they were more likely to be repacked.
>
> I believe geometric repacking didn't exist back when I was thinking
> about this, and perhaps geometric repacking automatically handles
> things nicely for us.  Does it, or are we risking retaining
> sub-optimal deltas from the bulk-checkin code?
>
> (I've never really cracked open the pack code, so I have absolutely no
> idea; I'm just curious.)

Yes, the bulk-checkin mechanism suffers from an even worse problem which
is the pack it creates will contain no deltas whatsoever. The contents
of the pack are just getting written as-is, so there's no fancy
delta-ficiation going on.

I think Michael Haggerty (?) suggested to me off-list that it might be
interesting to have a flag that we could mark packs with bad/no deltas
as such so that we don't implicitly trust their contents as having high
quality deltas.

> > I think that like anything, this is a trade-off. Having lots of packs
> > can be a performance hindrance just like having lots of loose objects.
> > But since we can represent more objects with fewer inodes when packed,
> > storing those objects together in a pack is preferable when (a) you're
> > doing lots of test-merges, and (b) you want to keep those objects
> > around, e.g., because they are reachable.
>
> A couple of the comments earlier in the series suggested this was
> about streaming blobs to a pack in the bulk checkin code.  Are tree
> and commit objects also put in the pack, or will those continue to be
> written loosely?

This covers both blobs and trees, since IIUC that's all we'd need to
implement support for merge-tree to be able to write any objects it
creates into a pack. AFAIK merge-tree never generates any commit
objects. But teaching 'merge' to perform the same bulk-checkin trick
would just require us implementing index_bulk_commit_checkin_in_core()
or similar, which is straightforward to do on top of the existing code.

Thanks,
Taylor
Jeff King Oct. 8, 2023, 5:33 p.m. UTC | #5
On Sun, Oct 08, 2023 at 12:04:04PM -0400, Taylor Blau wrote:

> > I was interested in the same question as Junio, but from a different
> > angle.  fast-import documentation points out that the packs it creates
> > are suboptimal with poorer delta choices.  Are the packs created by
> > bulk-checkin prone to the same issues?  When I was thinking in terms
> > of having "git merge" use fast-import for pack creation instead of
> > writing loose objects (an idea I never investigated very far), I was
> > wondering if I'd need to mark those packs as "less optimal" and do
> > something to make sure they were more likely to be repacked.
> >
> > I believe geometric repacking didn't exist back when I was thinking
> > about this, and perhaps geometric repacking automatically handles
> > things nicely for us.  Does it, or are we risking retaining
> > sub-optimal deltas from the bulk-checkin code?
> >
> > (I've never really cracked open the pack code, so I have absolutely no
> > idea; I'm just curious.)
> 
> Yes, the bulk-checkin mechanism suffers from an even worse problem which
> is the pack it creates will contain no deltas whatsoever. The contents
> of the pack are just getting written as-is, so there's no fancy
> delta-ficiation going on.

I wonder how big a deal this would be in practice for merges.
pack-objects will look for deltas between any two candidates objects,
but in practice I think most deltas are between objects from multiple
commits (across the "time" dimension, if you will) rather than within a
single tree (the "space" dimension). And a merge operation is generally
creating a single new tree (recursive merging may create intermediate
states which would delta, but we don't actually need to keep those
intermediate ones. I won't be surprised if we do, though).

We should be able to test that theory by looking at existing deltas.
Here's a script which builds an index of blobs and trees to the commits
that introduce them:

  git rev-list HEAD |
  git diff-tree --stdin -r -m -t --raw |
  perl -lne '
    if (/^[0-9a-f]/) {
      $commit = $_;
    } elsif (/^:\S+ \S+ \S+ (\S+)/) {
      $h{$1} = $commit;
    }
    END { print "$_ $h{$_}" for keys(%h) }
  ' >commits.db

And then we can see which deltas come from the same commit:

  git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)' |
  perl -alne '
    BEGIN {
      open(my $fh, "<", "commits.db");
      %commit = map { chomp; split } <$fh>;
    }
    next if $F[1] =~ /0{40}/; # not a delta
    next unless defined $commit{$F[0]}; # not in index
    print $commit{$F[0]} eq $commit{$F[1]} ? "inner" : "outer", " ", $_;
  '

In git.git, I see 460 "inner" deltas, and 193,081 "outer" ones. The
inner ones are mostly small single-file test vectors, which makes sense.
It's possible to have a merge result that does conflict resolution in
two such files (that would then delta), but it seems like a fairly
unlikely case. Numbers for linux.git are similar.

So it might just not be a big issue at all for this use case.

> I think Michael Haggerty (?) suggested to me off-list that it might be
> interesting to have a flag that we could mark packs with bad/no deltas
> as such so that we don't implicitly trust their contents as having high
> quality deltas.

I was going to suggest the same thing. ;) Unfortunately it's a bit
tricky to do as we have no room in the file format for an optional flag.
You'd have to add a ".mediocre-delta" file or something.

But here's another approach. I recall discussing a while back the idea
that we should not necessarily trust the quality of deltas in packs that
are pushed (and I think Thomas Gummerer even did some experiments inside
GitHub with those, though I don't remember the results). And one way
around that is during geometric repacking to consider the biggest/oldest
pack as "preferred", reuse its deltas, but always compute from scratch
with the others (neither reusing on-disk deltas, nor skipping
try_delta() when two objects come from the same pack).

That same strategy would work here (and for incremental fast-import
packs, though of course not if your fast-import pack is the "big" one
after you do a from-scratch import).

> > A couple of the comments earlier in the series suggested this was
> > about streaming blobs to a pack in the bulk checkin code.  Are tree
> > and commit objects also put in the pack, or will those continue to be
> > written loosely?
> 
> This covers both blobs and trees, since IIUC that's all we'd need to
> implement support for merge-tree to be able to write any objects it
> creates into a pack. AFAIK merge-tree never generates any commit
> objects. But teaching 'merge' to perform the same bulk-checkin trick
> would just require us implementing index_bulk_commit_checkin_in_core()
> or similar, which is straightforward to do on top of the existing code.

This is a bit of a devil's advocate question, but: would it make sense
to implement this as a general of Git's object-writing code, and not tie
it to a specific command? That is, what if a user could do:

  git --write-to-pack merge ...

but also:

  git --write-to-pack add ...

and the object-writing code would just write to a pack instead of
writing loose objects. That lets the caller decide when it is or is not
a good idea to use this mode. And if making loose objects gives bad
performance for merges, wouldn't the same be true of other operations
which generate many objects?

Possibly it exacerbates the "no deltas" issue from above (though it
would depend on the command).  The bigger question to me is one of
checkpointing. When do we finish off the pack with a .idx and make it
available to other readers? We could do it at program exit, but I
suspect there are some commands that really want to make objects
available sooner (e.g., as soon as "git add" writes an index, we'd want
those objects to already be available). Probably every program that
writes objects would need to be annotated with a checkpoint call (which
would be a noop in loose object mode).

So maybe it's a dumb direction. I dunno.

-Peff
Taylor Blau Oct. 9, 2023, 1:37 a.m. UTC | #6
On Sun, Oct 08, 2023 at 01:33:29PM -0400, Jeff King wrote:
> On Sun, Oct 08, 2023 at 12:04:04PM -0400, Taylor Blau wrote:
>
> > > I was interested in the same question as Junio, but from a different
> > > angle.  fast-import documentation points out that the packs it creates
> > > are suboptimal with poorer delta choices.  Are the packs created by
> > > bulk-checkin prone to the same issues?  When I was thinking in terms
> > > of having "git merge" use fast-import for pack creation instead of
> > > writing loose objects (an idea I never investigated very far), I was
> > > wondering if I'd need to mark those packs as "less optimal" and do
> > > something to make sure they were more likely to be repacked.
> > >
> > > I believe geometric repacking didn't exist back when I was thinking
> > > about this, and perhaps geometric repacking automatically handles
> > > things nicely for us.  Does it, or are we risking retaining
> > > sub-optimal deltas from the bulk-checkin code?
> > >
> > > (I've never really cracked open the pack code, so I have absolutely no
> > > idea; I'm just curious.)
> >
> > Yes, the bulk-checkin mechanism suffers from an even worse problem which
> > is the pack it creates will contain no deltas whatsoever. The contents
> > of the pack are just getting written as-is, so there's no fancy
> > delta-ficiation going on.
>
> I wonder how big a deal this would be in practice for merges.
> pack-objects will look for deltas between any two candidates objects,
> but in practice I think most deltas are between objects from multiple
> commits (across the "time" dimension, if you will) rather than within a
> single tree (the "space" dimension). And a merge operation is generally
> creating a single new tree (recursive merging may create intermediate
> states which would delta, but we don't actually need to keep those
> intermediate ones. I won't be surprised if we do, though).
>
> We should be able to test that theory by looking at existing deltas.
> Here's a script which builds an index of blobs and trees to the commits
> that introduce them:
>
>   git rev-list HEAD |
>   git diff-tree --stdin -r -m -t --raw |
>   perl -lne '
>     if (/^[0-9a-f]/) {
>       $commit = $_;
>     } elsif (/^:\S+ \S+ \S+ (\S+)/) {
>       $h{$1} = $commit;
>     }
>     END { print "$_ $h{$_}" for keys(%h) }
>   ' >commits.db
>
> And then we can see which deltas come from the same commit:
>
>   git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)' |
>   perl -alne '
>     BEGIN {
>       open(my $fh, "<", "commits.db");
>       %commit = map { chomp; split } <$fh>;
>     }
>     next if $F[1] =~ /0{40}/; # not a delta
>     next unless defined $commit{$F[0]}; # not in index
>     print $commit{$F[0]} eq $commit{$F[1]} ? "inner" : "outer", " ", $_;
>   '
>
> In git.git, I see 460 "inner" deltas, and 193,081 "outer" ones. The
> inner ones are mostly small single-file test vectors, which makes sense.
> It's possible to have a merge result that does conflict resolution in
> two such files (that would then delta), but it seems like a fairly
> unlikely case. Numbers for linux.git are similar.
>
> So it might just not be a big issue at all for this use case.

Very interesting, thanks for running (and documenting!) this experiment.
I'm mostly with you that it probably doesn't make a huge difference in
practice here.

One thing that I'm not entirely clear on is how we'd treat objects that
could be good delta candidates for each other between two packs. For
instance, if I write a tree corresponding to the merge between two
branches, it's likely that the resulting tree would be a good delta
candidate against either of the trees at the tips of those two refs.

But we won't pack those trees (the ones at the tips of the refs) in the
same pack as the tree containing their merge. If we later on tried to
repack, would we evaluate the tip trees as possible delta candidates
against the merged tree? Or would we look at the merged tree, realize it
isn't delta'd with anything, and then not attempt to find any
candidates?

> > I think Michael Haggerty (?) suggested to me off-list that it might be
> > interesting to have a flag that we could mark packs with bad/no deltas
> > as such so that we don't implicitly trust their contents as having high
> > quality deltas.
>
> I was going to suggest the same thing. ;) Unfortunately it's a bit
> tricky to do as we have no room in the file format for an optional flag.
> You'd have to add a ".mediocre-delta" file or something.

Yeah, I figured that we'd add a new ".baddeltas" file or something. (As
an aside, we probably should have an optional flags section in the .pack
format, since we seem to have a lot of optional pack extensions: .rev,
.bitmap, .keep, .promisor, etc.)

> But here's another approach. I recall discussing a while back the idea
> that we should not necessarily trust the quality of deltas in packs that
> are pushed (and I think Thomas Gummerer even did some experiments inside
> GitHub with those, though I don't remember the results). And one way
> around that is during geometric repacking to consider the biggest/oldest
> pack as "preferred", reuse its deltas, but always compute from scratch
> with the others (neither reusing on-disk deltas, nor skipping
> try_delta() when two objects come from the same pack).
>
> That same strategy would work here (and for incremental fast-import
> packs, though of course not if your fast-import pack is the "big" one
> after you do a from-scratch import).

Yeah, definitely. I think that that code is still living in GitHub's
fork, but inactive since we haven't set any of the relevant
configuration in GitHub's production environment.

> Possibly it exacerbates the "no deltas" issue from above (though it
> would depend on the command).  The bigger question to me is one of
> checkpointing. When do we finish off the pack with a .idx and make it
> available to other readers? We could do it at program exit, but I
> suspect there are some commands that really want to make objects
> available sooner (e.g., as soon as "git add" writes an index, we'd want
> those objects to already be available). Probably every program that
> writes objects would need to be annotated with a checkpoint call (which
> would be a noop in loose object mode).
>
> So maybe it's a dumb direction. I dunno.

I wouldn't say it's a dumb direction ;-). But I'd be hesitant pursuing
it without solving the "no deltas" question from earlier.

Thanks,
Taylor
Patrick Steinhardt Oct. 9, 2023, 10:54 a.m. UTC | #7
On Fri, Oct 06, 2023 at 07:02:05PM -0400, Taylor Blau wrote:
> On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > When using merge-tree often within a repository[^1], it is possible to
> > > generate a relatively large number of loose objects, which can result in
> > > degraded performance, and inode exhaustion in extreme cases.
> >
> > Well, be it "git merge-tree" or "git merge", new loose objects tend
> > to accumulate until "gc" kicks in, so it is not a new problem for
> > mere mortals, is it?
> 
> Yeah, I would definitely suspect that this is more of an issue for
> forges than individual Git users.
> 
> > As one "interesting" use case of "merge-tree" is for a Git hosting
> > site with bare repositories to offer trial merges, without which
> > majority of the object their repositories acquire would have been in
> > packs pushed by their users, "Gee, loose objects consume many inodes
> > in exchange for easier selective pruning" becomes an issue, right?
> 
> Right.
> 
> > Just like it hurts performance to have too many loose object files,
> > presumably it would also hurt performance to keep too many packs,
> > each came from such a trial merge.  Do we have a "gc" story offered
> > for these packs created by the new feature?  E.g., "once merge-tree
> > is done creating a trial merge, we can discard the objects created
> > in the pack, because we never expose new objects in the pack to the
> > outside, processes running simultaneously, so instead closing the
> > new packfile by calling flush_bulk_checkin_packfile(), we can safely
> > unlink the temporary pack.  We do not even need to spend cycles to
> > run a gc that requires cycles to enumerate what is still reachable",
> > or something like that?
> 
> I know Johannes worked on something like this recently. IIRC, it
> effectively does something like:
> 
>     struct tmp_objdir *tmp_objdir = tmp_objdir_create(...);
>     tmp_objdir_replace_primary_odb(tmp_objdir, 1);
> 
> at the beginning of a merge operation, and:
> 
>     tmp_objdir_discard_objects(tmp_objdir);
> 
> at the end. I haven't followed that work off-list very closely, but it
> is only possible for GitHub to discard certain niche kinds of
> merges/rebases, since in general we make the objects created during test
> merges available via refs/pull/N/{merge,rebase}.
> 
> I think that like anything, this is a trade-off. Having lots of packs
> can be a performance hindrance just like having lots of loose objects.
> But since we can represent more objects with fewer inodes when packed,
> storing those objects together in a pack is preferable when (a) you're
> doing lots of test-merges, and (b) you want to keep those objects
> around, e.g., because they are reachable.

In Gitaly, we usually set up quarantine directories for all operations
that create objects. This allows us to discard any newly written objects
in case either the RPC call gets cancelled or in case our access checks
determine that the change should not be allowed. The logic is rather
simple:

    1. Create a new temporary directory.

    2. Set up the new temporary directory as main object database via
       the `GIT_OBJECT_DIRECTORY` environment variable.

    3. Set up the main repository's object database via the
       `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable.

    4. Execute Git commands that write objects with these environment
       variables set up. The new objects will end up neatly contained in
       the temporary directory.

    5. Once done, either discard the temporary object database or
       migrate objects into the main object daatabase.

I wonder whether this would be a viable approach for you, as well.

Patrick
Taylor Blau Oct. 9, 2023, 4:08 p.m. UTC | #8
On Mon, Oct 09, 2023 at 12:54:12PM +0200, Patrick Steinhardt wrote:
> In Gitaly, we usually set up quarantine directories for all operations
> that create objects. This allows us to discard any newly written objects
> in case either the RPC call gets cancelled or in case our access checks
> determine that the change should not be allowed. The logic is rather
> simple:
>
>     1. Create a new temporary directory.
>
>     2. Set up the new temporary directory as main object database via
>        the `GIT_OBJECT_DIRECTORY` environment variable.
>
>     3. Set up the main repository's object database via the
>        `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable.

Is there a reason not to run Git in the quarantine environment and list
the main repository as an alternate via $GIT_DIR/objects/info/alternates
instead of the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable?

>     4. Execute Git commands that write objects with these environment
>        variables set up. The new objects will end up neatly contained in
>        the temporary directory.
>
>     5. Once done, either discard the temporary object database or
>        migrate objects into the main object daatabase.

Interesting. I'm curious why you don't use the builtin tmp_objdir
mechanism in Git itself. Do you need to run more than one command in the
quarantine environment? If so, that makes sense that you'd want to have
a scratch repository that lasts beyond the lifetime of a single process.

> I wonder whether this would be a viable approach for you, as well.

I think that the main problem that we are trying to solve with this
series is creating a potentially large number of loose objects. I think
that you could do something like what you propose above, with a 'git
repacks -adk' before moving its objects over back to the main repository.

But since we're working in a single process only when doing a merge-tree
operation, I think it is probably more expedient to write the pack's
bytes directly.

> Patrick


Thanks,
Taylor
Junio C Hamano Oct. 9, 2023, 5:24 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

>> Yes, the bulk-checkin mechanism suffers from an even worse problem which
>> is the pack it creates will contain no deltas whatsoever. The contents
>> of the pack are just getting written as-is, so there's no fancy
>> delta-ficiation going on.
>
> I wonder how big a deal this would be in practice for merges.
> ...

Thanks for your experiments ;-)

The reason why bulk-checkin mechanism does not attempt deltifying
(as opposed to fast-import that attempts to deltify with the
immediately previous object and only with that single object) is
exactly the same.  It was done to support the initial check-in,
which by definition lacks the delta opportunity along the time axis.

As you describe, such a delta-less pack would risk missed
deltification opportunity when running a repack (without "-f"), as
the opposite of the well known "reuse delta" heuristics, aka "this
object was stored in the base form, it is likely that the previous
pack-object tried but did not find a good delta base for it, let's
not waste time retrying that" heuristics would get in the way.
Jeff King Oct. 9, 2023, 8:21 p.m. UTC | #10
On Sun, Oct 08, 2023 at 09:37:42PM -0400, Taylor Blau wrote:

> Very interesting, thanks for running (and documenting!) this experiment.
> I'm mostly with you that it probably doesn't make a huge difference in
> practice here.
> 
> One thing that I'm not entirely clear on is how we'd treat objects that
> could be good delta candidates for each other between two packs. For
> instance, if I write a tree corresponding to the merge between two
> branches, it's likely that the resulting tree would be a good delta
> candidate against either of the trees at the tips of those two refs.
> 
> But we won't pack those trees (the ones at the tips of the refs) in the
> same pack as the tree containing their merge. If we later on tried to
> repack, would we evaluate the tip trees as possible delta candidates
> against the merged tree? Or would we look at the merged tree, realize it
> isn't delta'd with anything, and then not attempt to find any
> candidates?

When we repack (either all-into-one, or in a geometric roll-up), we
should consider those trees as candidates. The only deltas we don't
consider are:

  - if something is already a delta in a pack, then we will usually
    reuse that delta verbatim (so you might get fooled by a mediocre
    delta and not go to the trouble to search again. But I don't think
    that applies here; there is no other tree in your new pack to make
    such a mediocre delta from, and anyway you are skipping deltas
    entirely)

  - if two objects are in the same pack but there's no delta
    relationship, the try_delta() heuristics will skip them immediately
    (under the assumption that we tried during the last repack and
    didn't find anything good).

So yes, if you had a big old pack with the original trees, and then a
new pack with the merge result, we should try to delta the new merge
result tree against the others, just as we would if it were loose.

> > I was going to suggest the same thing. ;) Unfortunately it's a bit
> > tricky to do as we have no room in the file format for an optional flag.
> > You'd have to add a ".mediocre-delta" file or something.
> 
> Yeah, I figured that we'd add a new ".baddeltas" file or something. (As
> an aside, we probably should have an optional flags section in the .pack
> format, since we seem to have a lot of optional pack extensions: .rev,
> .bitmap, .keep, .promisor, etc.)

Yes, though since packv2 is the on-the-wire format, it's very hard to
change now. It might be easier to put an annotation into the .idx file.

-Peff
Patrick Steinhardt Oct. 10, 2023, 6:36 a.m. UTC | #11
On Mon, Oct 09, 2023 at 12:08:32PM -0400, Taylor Blau wrote:
> On Mon, Oct 09, 2023 at 12:54:12PM +0200, Patrick Steinhardt wrote:
> > In Gitaly, we usually set up quarantine directories for all operations
> > that create objects. This allows us to discard any newly written objects
> > in case either the RPC call gets cancelled or in case our access checks
> > determine that the change should not be allowed. The logic is rather
> > simple:
> >
> >     1. Create a new temporary directory.
> >
> >     2. Set up the new temporary directory as main object database via
> >        the `GIT_OBJECT_DIRECTORY` environment variable.
> >
> >     3. Set up the main repository's object database via the
> >        `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable.
> 
> Is there a reason not to run Git in the quarantine environment and list
> the main repository as an alternate via $GIT_DIR/objects/info/alternates
> instead of the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable?

The quarantine environment as we use it is really only a single object
database, so you cannot run Git inside of it directly.

> >     4. Execute Git commands that write objects with these environment
> >        variables set up. The new objects will end up neatly contained in
> >        the temporary directory.
> >
> >     5. Once done, either discard the temporary object database or
> >        migrate objects into the main object daatabase.
> 
> Interesting. I'm curious why you don't use the builtin tmp_objdir
> mechanism in Git itself. Do you need to run more than one command in the
> quarantine environment? If so, that makes sense that you'd want to have
> a scratch repository that lasts beyond the lifetime of a single process.

It's a mixture of things:

    - Many commands simply don't set up a temporary object directory.

    - We want to check the result after the objects have been generated.
      Many of the commands don't provide hooks to do so in a reasonable
      way. So we want to check the result _after_ the command has exited
      already, and objects should not yet have been migrated into the
      target object database at that point.

    - Sometimes we indeed want to run multiple Git commands. We use this
      e.g. for worktreeless rebases, where we run a succession of
      commands to rebase every single commit.

So ultimately, our own quarantine directory sits at a conceptually
higher level than what Git commands would be able to provide.

> > I wonder whether this would be a viable approach for you, as well.
> 
> I think that the main problem that we are trying to solve with this
> series is creating a potentially large number of loose objects. I think
> that you could do something like what you propose above, with a 'git
> repacks -adk' before moving its objects over back to the main repository.

Yeah, that's certainly possible. I had the feeling that there are two
different problems that we're trying to solve in this thread:

    - The problem that the objects are of temporary nature. Regardless
      of whether they are in a packfile or loose, it doesn't make much
      sense to put them into the main object database as they would be
      unreachable anyway and thus get pruned after some time.

    - The problem that writing many loose objects can be slow.

My proposal only addresses the first problem.

> But since we're working in a single process only when doing a merge-tree
> operation, I think it is probably more expedient to write the pack's
> bytes directly.

If it's about efficiency and not about how we discard those objects
after the command has ran to completion then yes, I agree.

Patrick
diff mbox series

Patch

diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt
index ffc4fbf7e8..9d37609ef1 100644
--- a/Documentation/git-merge-tree.txt
+++ b/Documentation/git-merge-tree.txt
@@ -69,6 +69,10 @@  OPTIONS
 	specify a merge-base for the merge, and specifying multiple bases is
 	currently not supported. This option is incompatible with `--stdin`.
 
+--write-pack::
+	Write any new objects into a separate packfile instead of as
+	individual loose objects.
+
 [[OUTPUT]]
 OUTPUT
 ------
diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c
index 0de42aecf4..672ebd4c54 100644
--- a/builtin/merge-tree.c
+++ b/builtin/merge-tree.c
@@ -18,6 +18,7 @@ 
 #include "quote.h"
 #include "tree.h"
 #include "config.h"
+#include "bulk-checkin.h"
 
 static int line_termination = '\n';
 
@@ -414,6 +415,7 @@  struct merge_tree_options {
 	int show_messages;
 	int name_only;
 	int use_stdin;
+	int write_pack;
 };
 
 static int real_merge(struct merge_tree_options *o,
@@ -440,6 +442,7 @@  static int real_merge(struct merge_tree_options *o,
 	init_merge_options(&opt, the_repository);
 
 	opt.show_rename_progress = 0;
+	opt.write_pack = o->write_pack;
 
 	opt.branch1 = branch1;
 	opt.branch2 = branch2;
@@ -548,6 +551,8 @@  int cmd_merge_tree(int argc, const char **argv, const char *prefix)
 			   &merge_base,
 			   N_("commit"),
 			   N_("specify a merge-base for the merge")),
+		OPT_BOOL(0, "write-pack", &o.write_pack,
+			 N_("write new objects to a pack instead of as loose")),
 		OPT_END()
 	};
 
diff --git a/merge-ort.c b/merge-ort.c
index 8631c99700..85d8c5c6b3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -48,6 +48,7 @@ 
 #include "tree.h"
 #include "unpack-trees.h"
 #include "xdiff-interface.h"
+#include "bulk-checkin.h"
 
 /*
  * We have many arrays of size 3.  Whenever we have such an array, the
@@ -2124,11 +2125,19 @@  static int handle_content_merge(struct merge_options *opt,
 		if ((merge_status < 0) || !result_buf.ptr)
 			ret = err(opt, _("Failed to execute internal merge"));
 
-		if (!ret &&
-		    write_object_file(result_buf.ptr, result_buf.size,
-				      OBJ_BLOB, &result->oid))
-			ret = err(opt, _("Unable to add %s to database"),
-				  path);
+		if (!ret) {
+			ret = opt->write_pack
+				? index_blob_bulk_checkin_incore(&result->oid,
+								 result_buf.ptr,
+								 result_buf.size,
+								 path, 1)
+				: write_object_file(result_buf.ptr,
+						    result_buf.size,
+						    OBJ_BLOB, &result->oid);
+			if (ret)
+				ret = err(opt, _("Unable to add %s to database"),
+					  path);
+		}
 
 		free(result_buf.ptr);
 		if (ret)
@@ -3618,7 +3627,8 @@  static int tree_entry_order(const void *a_, const void *b_)
 				 b->string, strlen(b->string), bmi->result.mode);
 }
 
-static int write_tree(struct object_id *result_oid,
+static int write_tree(struct merge_options *opt,
+		      struct object_id *result_oid,
 		      struct string_list *versions,
 		      unsigned int offset,
 		      size_t hash_size)
@@ -3652,8 +3662,14 @@  static int write_tree(struct object_id *result_oid,
 	}
 
 	/* Write this object file out, and record in result_oid */
-	if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid))
+	ret = opt->write_pack
+		? index_tree_bulk_checkin_incore(result_oid,
+						 buf.buf, buf.len, "", 1)
+		: write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid);
+
+	if (ret)
 		ret = -1;
+
 	strbuf_release(&buf);
 	return ret;
 }
@@ -3818,8 +3834,8 @@  static int write_completed_directory(struct merge_options *opt,
 		 */
 		dir_info->is_null = 0;
 		dir_info->result.mode = S_IFDIR;
-		if (write_tree(&dir_info->result.oid, &info->versions, offset,
-			       opt->repo->hash_algo->rawsz) < 0)
+		if (write_tree(opt, &dir_info->result.oid, &info->versions,
+			       offset, opt->repo->hash_algo->rawsz) < 0)
 			ret = -1;
 	}
 
@@ -4353,9 +4369,13 @@  static int process_entries(struct merge_options *opt,
 		fflush(stdout);
 		BUG("dir_metadata accounting completely off; shouldn't happen");
 	}
-	if (write_tree(result_oid, &dir_metadata.versions, 0,
+	if (write_tree(opt, result_oid, &dir_metadata.versions, 0,
 		       opt->repo->hash_algo->rawsz) < 0)
 		ret = -1;
+
+	if (opt->write_pack)
+		end_odb_transaction();
+
 cleanup:
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
@@ -4899,6 +4919,9 @@  static void merge_start(struct merge_options *opt, struct merge_result *result)
 	 */
 	strmap_init(&opt->priv->conflicts);
 
+	if (opt->write_pack)
+		begin_odb_transaction();
+
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
diff --git a/merge-recursive.h b/merge-recursive.h
index b88000e3c2..156e160876 100644
--- a/merge-recursive.h
+++ b/merge-recursive.h
@@ -48,6 +48,7 @@  struct merge_options {
 	unsigned renormalize : 1;
 	unsigned record_conflict_msgs_as_headers : 1;
 	const char *msg_header_prefix;
+	unsigned write_pack : 1;
 
 	/* internal fields used by the implementation */
 	struct merge_options_internal *priv;
diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh
index 250f721795..2d81ff4de5 100755
--- a/t/t4301-merge-tree-write-tree.sh
+++ b/t/t4301-merge-tree-write-tree.sh
@@ -922,4 +922,97 @@  test_expect_success 'check the input format when --stdin is passed' '
 	test_cmp expect actual
 '
 
+packdir=".git/objects/pack"
+
+test_expect_success 'merge-tree can pack its result with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+
+	# base has lines [3, 4, 5]
+	#   - side adds to the beginning, resulting in [1, 2, 3, 4, 5]
+	#   - other adds to the end, resulting in [3, 4, 5, 6, 7]
+	#
+	# merging the two should result in a new blob object containing
+	# [1, 2, 3, 4, 5, 6, 7], along with a new tree.
+	test_commit -C repo base file "$(test_seq 3 5)" &&
+	git -C repo branch -M main &&
+	git -C repo checkout -b side main &&
+	test_commit -C repo side file "$(test_seq 1 5)" &&
+	git -C repo checkout -b other main &&
+	test_commit -C repo other file "$(test_seq 3 7)" &&
+
+	find repo/$packdir -type f -name "pack-*.idx" >packs.before &&
+	tree="$(git -C repo merge-tree --write-pack \
+		refs/tags/side refs/tags/other)" &&
+	blob="$(git -C repo rev-parse $tree:file)" &&
+	find repo/$packdir -type f -name "pack-*.idx" >packs.after &&
+
+	test_must_be_empty packs.before &&
+	test_line_count = 1 packs.after &&
+
+	git show-index <$(cat packs.after) >objects &&
+	test_line_count = 2 objects &&
+	grep "^[1-9][0-9]* $tree" objects &&
+	grep "^[1-9][0-9]* $blob" objects
+'
+
+test_expect_success 'merge-tree can write multiple packs with --write-pack' '
+	test_when_finished "rm -rf repo" &&
+	git init repo &&
+	(
+		cd repo &&
+
+		git config pack.packSizeLimit 512 &&
+
+		test_seq 512 >f &&
+
+		# "f" contains roughly ~2,000 bytes.
+		#
+		# Each side ("foo" and "bar") adds a small amount of data at the
+		# beginning and end of "base", respectively.
+		git add f &&
+		test_tick &&
+		git commit -m base &&
+		git branch -M main &&
+
+		git checkout -b foo main &&
+		{
+			echo foo && cat f
+		} >f.tmp &&
+		mv f.tmp f &&
+		git add f &&
+		test_tick &&
+		git commit -m foo &&
+
+		git checkout -b bar main &&
+		echo bar >>f &&
+		git add f &&
+		test_tick &&
+		git commit -m bar &&
+
+		find $packdir -type f -name "pack-*.idx" >packs.before &&
+		# Merging either side should result in a new object which is
+		# larger than 1M, thus the result should be split into two
+		# separate packs.
+		tree="$(git merge-tree --write-pack \
+			refs/heads/foo refs/heads/bar)" &&
+		blob="$(git rev-parse $tree:f)" &&
+		find $packdir -type f -name "pack-*.idx" >packs.after &&
+
+		test_must_be_empty packs.before &&
+		test_line_count = 2 packs.after &&
+		for idx in $(cat packs.after)
+		do
+			git show-index <$idx || return 1
+		done >objects &&
+
+		# The resulting set of packs should contain one copy of both
+		# objects, each in a separate pack.
+		test_line_count = 2 objects &&
+		grep "^[1-9][0-9]* $tree" objects &&
+		grep "^[1-9][0-9]* $blob" objects
+
+	)
+'
+
 test_done