Message ID | e96921014557edb41dd73d93a8c3cf6cfaf0c719.1696629697.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | merge-ort: implement support for packing objects together | expand |
Taylor Blau <me@ttaylorr.com> writes: > When using merge-tree often within a repository[^1], it is possible to > generate a relatively large number of loose objects, which can result in > degraded performance, and inode exhaustion in extreme cases. Well, be it "git merge-tree" or "git merge", new loose objects tend to accumulate until "gc" kicks in, so it is not a new problem for mere mortals, is it? As one "interesting" use case of "merge-tree" is for a Git hosting site with bare repositories to offer trial merges, without which majority of the object their repositories acquire would have been in packs pushed by their users, "Gee, loose objects consume many inodes in exchange for easier selective pruning" becomes an issue, right? Just like it hurts performance to have too many loose object files, presumably it would also hurt performance to keep too many packs, each came from such a trial merge. Do we have a "gc" story offered for these packs created by the new feature? E.g., "once merge-tree is done creating a trial merge, we can discard the objects created in the pack, because we never expose new objects in the pack to the outside, processes running simultaneously, so instead closing the new packfile by calling flush_bulk_checkin_packfile(), we can safely unlink the temporary pack. We do not even need to spend cycles to run a gc that requires cycles to enumerate what is still reachable", or something like that? Thanks.
On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > When using merge-tree often within a repository[^1], it is possible to > > generate a relatively large number of loose objects, which can result in > > degraded performance, and inode exhaustion in extreme cases. > > Well, be it "git merge-tree" or "git merge", new loose objects tend > to accumulate until "gc" kicks in, so it is not a new problem for > mere mortals, is it? Yeah, I would definitely suspect that this is more of an issue for forges than individual Git users. > As one "interesting" use case of "merge-tree" is for a Git hosting > site with bare repositories to offer trial merges, without which > majority of the object their repositories acquire would have been in > packs pushed by their users, "Gee, loose objects consume many inodes > in exchange for easier selective pruning" becomes an issue, right? Right. > Just like it hurts performance to have too many loose object files, > presumably it would also hurt performance to keep too many packs, > each came from such a trial merge. Do we have a "gc" story offered > for these packs created by the new feature? E.g., "once merge-tree > is done creating a trial merge, we can discard the objects created > in the pack, because we never expose new objects in the pack to the > outside, processes running simultaneously, so instead closing the > new packfile by calling flush_bulk_checkin_packfile(), we can safely > unlink the temporary pack. We do not even need to spend cycles to > run a gc that requires cycles to enumerate what is still reachable", > or something like that? I know Johannes worked on something like this recently. IIRC, it effectively does something like: struct tmp_objdir *tmp_objdir = tmp_objdir_create(...); tmp_objdir_replace_primary_odb(tmp_objdir, 1); at the beginning of a merge operation, and: tmp_objdir_discard_objects(tmp_objdir); at the end. I haven't followed that work off-list very closely, but it is only possible for GitHub to discard certain niche kinds of merges/rebases, since in general we make the objects created during test merges available via refs/pull/N/{merge,rebase}. I think that like anything, this is a trade-off. Having lots of packs can be a performance hindrance just like having lots of loose objects. But since we can represent more objects with fewer inodes when packed, storing those objects together in a pack is preferable when (a) you're doing lots of test-merges, and (b) you want to keep those objects around, e.g., because they are reachable. Thanks, Taylor
On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@ttaylorr.com> wrote: > > On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote: > > Taylor Blau <me@ttaylorr.com> writes: > > > > > When using merge-tree often within a repository[^1], it is possible to > > > generate a relatively large number of loose objects, which can result in > > > degraded performance, and inode exhaustion in extreme cases. > > > > Well, be it "git merge-tree" or "git merge", new loose objects tend > > to accumulate until "gc" kicks in, so it is not a new problem for > > mere mortals, is it? > > Yeah, I would definitely suspect that this is more of an issue for > forges than individual Git users. It may still be nice to also do this optimization for plain "git merge" as well. I had it in my list of ideas somewhere to do a "fast-import-like" thing to avoid writing loose objects, as I suspected that'd actually be a performance impediment. > > As one "interesting" use case of "merge-tree" is for a Git hosting > > site with bare repositories to offer trial merges, without which > > majority of the object their repositories acquire would have been in > > packs pushed by their users, "Gee, loose objects consume many inodes > > in exchange for easier selective pruning" becomes an issue, right? > > Right. > > > Just like it hurts performance to have too many loose object files, > > presumably it would also hurt performance to keep too many packs, > > each came from such a trial merge. Do we have a "gc" story offered > > for these packs created by the new feature? E.g., "once merge-tree > > is done creating a trial merge, we can discard the objects created > > in the pack, because we never expose new objects in the pack to the > > outside, processes running simultaneously, so instead closing the > > new packfile by calling flush_bulk_checkin_packfile(), we can safely > > unlink the temporary pack. We do not even need to spend cycles to > > run a gc that requires cycles to enumerate what is still reachable", > > or something like that? > > I know Johannes worked on something like this recently. IIRC, it > effectively does something like: > > struct tmp_objdir *tmp_objdir = tmp_objdir_create(...); > tmp_objdir_replace_primary_odb(tmp_objdir, 1); > > at the beginning of a merge operation, and: > > tmp_objdir_discard_objects(tmp_objdir); > > at the end. I haven't followed that work off-list very closely, but it > is only possible for GitHub to discard certain niche kinds of > merges/rebases, since in general we make the objects created during test > merges available via refs/pull/N/{merge,rebase}. Oh, at the contributor summit, Johannes said he only needed pass/fail, not the actual commits, which is why I suggested this route. If you need to keep the actual commits, then this won't help. I was interested in the same question as Junio, but from a different angle. fast-import documentation points out that the packs it creates are suboptimal with poorer delta choices. Are the packs created by bulk-checkin prone to the same issues? When I was thinking in terms of having "git merge" use fast-import for pack creation instead of writing loose objects (an idea I never investigated very far), I was wondering if I'd need to mark those packs as "less optimal" and do something to make sure they were more likely to be repacked. I believe geometric repacking didn't exist back when I was thinking about this, and perhaps geometric repacking automatically handles things nicely for us. Does it, or are we risking retaining sub-optimal deltas from the bulk-checkin code? (I've never really cracked open the pack code, so I have absolutely no idea; I'm just curious.) > I think that like anything, this is a trade-off. Having lots of packs > can be a performance hindrance just like having lots of loose objects. > But since we can represent more objects with fewer inodes when packed, > storing those objects together in a pack is preferable when (a) you're > doing lots of test-merges, and (b) you want to keep those objects > around, e.g., because they are reachable. A couple of the comments earlier in the series suggested this was about streaming blobs to a pack in the bulk checkin code. Are tree and commit objects also put in the pack, or will those continue to be written loosely?
On Sun, Oct 08, 2023 at 12:02:27AM -0700, Elijah Newren wrote: > On Fri, Oct 6, 2023 at 4:02 PM Taylor Blau <me@ttaylorr.com> wrote: > > > > On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote: > > > Taylor Blau <me@ttaylorr.com> writes: > > > > > > > When using merge-tree often within a repository[^1], it is possible to > > > > generate a relatively large number of loose objects, which can result in > > > > degraded performance, and inode exhaustion in extreme cases. > > > > > > Well, be it "git merge-tree" or "git merge", new loose objects tend > > > to accumulate until "gc" kicks in, so it is not a new problem for > > > mere mortals, is it? > > > > Yeah, I would definitely suspect that this is more of an issue for > > forges than individual Git users. > > It may still be nice to also do this optimization for plain "git > merge" as well. I had it in my list of ideas somewhere to do a > "fast-import-like" thing to avoid writing loose objects, as I > suspected that'd actually be a performance impediment. I think that would be worth doing, definitely. I do worry a little bit about locking in low-quality deltas (or lack thereof), but more on that below... > Oh, at the contributor summit, Johannes said he only needed pass/fail, > not the actual commits, which is why I suggested this route. If you > need to keep the actual commits, then this won't help. Yep, agreed. Like I said earlier, I think there are some niche scenarios where we just care about "would this merge cleanly?", but in most other cases we want to keep around the actual tree. > I was interested in the same question as Junio, but from a different > angle. fast-import documentation points out that the packs it creates > are suboptimal with poorer delta choices. Are the packs created by > bulk-checkin prone to the same issues? When I was thinking in terms > of having "git merge" use fast-import for pack creation instead of > writing loose objects (an idea I never investigated very far), I was > wondering if I'd need to mark those packs as "less optimal" and do > something to make sure they were more likely to be repacked. > > I believe geometric repacking didn't exist back when I was thinking > about this, and perhaps geometric repacking automatically handles > things nicely for us. Does it, or are we risking retaining > sub-optimal deltas from the bulk-checkin code? > > (I've never really cracked open the pack code, so I have absolutely no > idea; I'm just curious.) Yes, the bulk-checkin mechanism suffers from an even worse problem which is the pack it creates will contain no deltas whatsoever. The contents of the pack are just getting written as-is, so there's no fancy delta-ficiation going on. I think Michael Haggerty (?) suggested to me off-list that it might be interesting to have a flag that we could mark packs with bad/no deltas as such so that we don't implicitly trust their contents as having high quality deltas. > > I think that like anything, this is a trade-off. Having lots of packs > > can be a performance hindrance just like having lots of loose objects. > > But since we can represent more objects with fewer inodes when packed, > > storing those objects together in a pack is preferable when (a) you're > > doing lots of test-merges, and (b) you want to keep those objects > > around, e.g., because they are reachable. > > A couple of the comments earlier in the series suggested this was > about streaming blobs to a pack in the bulk checkin code. Are tree > and commit objects also put in the pack, or will those continue to be > written loosely? This covers both blobs and trees, since IIUC that's all we'd need to implement support for merge-tree to be able to write any objects it creates into a pack. AFAIK merge-tree never generates any commit objects. But teaching 'merge' to perform the same bulk-checkin trick would just require us implementing index_bulk_commit_checkin_in_core() or similar, which is straightforward to do on top of the existing code. Thanks, Taylor
On Sun, Oct 08, 2023 at 12:04:04PM -0400, Taylor Blau wrote: > > I was interested in the same question as Junio, but from a different > > angle. fast-import documentation points out that the packs it creates > > are suboptimal with poorer delta choices. Are the packs created by > > bulk-checkin prone to the same issues? When I was thinking in terms > > of having "git merge" use fast-import for pack creation instead of > > writing loose objects (an idea I never investigated very far), I was > > wondering if I'd need to mark those packs as "less optimal" and do > > something to make sure they were more likely to be repacked. > > > > I believe geometric repacking didn't exist back when I was thinking > > about this, and perhaps geometric repacking automatically handles > > things nicely for us. Does it, or are we risking retaining > > sub-optimal deltas from the bulk-checkin code? > > > > (I've never really cracked open the pack code, so I have absolutely no > > idea; I'm just curious.) > > Yes, the bulk-checkin mechanism suffers from an even worse problem which > is the pack it creates will contain no deltas whatsoever. The contents > of the pack are just getting written as-is, so there's no fancy > delta-ficiation going on. I wonder how big a deal this would be in practice for merges. pack-objects will look for deltas between any two candidates objects, but in practice I think most deltas are between objects from multiple commits (across the "time" dimension, if you will) rather than within a single tree (the "space" dimension). And a merge operation is generally creating a single new tree (recursive merging may create intermediate states which would delta, but we don't actually need to keep those intermediate ones. I won't be surprised if we do, though). We should be able to test that theory by looking at existing deltas. Here's a script which builds an index of blobs and trees to the commits that introduce them: git rev-list HEAD | git diff-tree --stdin -r -m -t --raw | perl -lne ' if (/^[0-9a-f]/) { $commit = $_; } elsif (/^:\S+ \S+ \S+ (\S+)/) { $h{$1} = $commit; } END { print "$_ $h{$_}" for keys(%h) } ' >commits.db And then we can see which deltas come from the same commit: git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)' | perl -alne ' BEGIN { open(my $fh, "<", "commits.db"); %commit = map { chomp; split } <$fh>; } next if $F[1] =~ /0{40}/; # not a delta next unless defined $commit{$F[0]}; # not in index print $commit{$F[0]} eq $commit{$F[1]} ? "inner" : "outer", " ", $_; ' In git.git, I see 460 "inner" deltas, and 193,081 "outer" ones. The inner ones are mostly small single-file test vectors, which makes sense. It's possible to have a merge result that does conflict resolution in two such files (that would then delta), but it seems like a fairly unlikely case. Numbers for linux.git are similar. So it might just not be a big issue at all for this use case. > I think Michael Haggerty (?) suggested to me off-list that it might be > interesting to have a flag that we could mark packs with bad/no deltas > as such so that we don't implicitly trust their contents as having high > quality deltas. I was going to suggest the same thing. ;) Unfortunately it's a bit tricky to do as we have no room in the file format for an optional flag. You'd have to add a ".mediocre-delta" file or something. But here's another approach. I recall discussing a while back the idea that we should not necessarily trust the quality of deltas in packs that are pushed (and I think Thomas Gummerer even did some experiments inside GitHub with those, though I don't remember the results). And one way around that is during geometric repacking to consider the biggest/oldest pack as "preferred", reuse its deltas, but always compute from scratch with the others (neither reusing on-disk deltas, nor skipping try_delta() when two objects come from the same pack). That same strategy would work here (and for incremental fast-import packs, though of course not if your fast-import pack is the "big" one after you do a from-scratch import). > > A couple of the comments earlier in the series suggested this was > > about streaming blobs to a pack in the bulk checkin code. Are tree > > and commit objects also put in the pack, or will those continue to be > > written loosely? > > This covers both blobs and trees, since IIUC that's all we'd need to > implement support for merge-tree to be able to write any objects it > creates into a pack. AFAIK merge-tree never generates any commit > objects. But teaching 'merge' to perform the same bulk-checkin trick > would just require us implementing index_bulk_commit_checkin_in_core() > or similar, which is straightforward to do on top of the existing code. This is a bit of a devil's advocate question, but: would it make sense to implement this as a general of Git's object-writing code, and not tie it to a specific command? That is, what if a user could do: git --write-to-pack merge ... but also: git --write-to-pack add ... and the object-writing code would just write to a pack instead of writing loose objects. That lets the caller decide when it is or is not a good idea to use this mode. And if making loose objects gives bad performance for merges, wouldn't the same be true of other operations which generate many objects? Possibly it exacerbates the "no deltas" issue from above (though it would depend on the command). The bigger question to me is one of checkpointing. When do we finish off the pack with a .idx and make it available to other readers? We could do it at program exit, but I suspect there are some commands that really want to make objects available sooner (e.g., as soon as "git add" writes an index, we'd want those objects to already be available). Probably every program that writes objects would need to be annotated with a checkpoint call (which would be a noop in loose object mode). So maybe it's a dumb direction. I dunno. -Peff
On Sun, Oct 08, 2023 at 01:33:29PM -0400, Jeff King wrote: > On Sun, Oct 08, 2023 at 12:04:04PM -0400, Taylor Blau wrote: > > > > I was interested in the same question as Junio, but from a different > > > angle. fast-import documentation points out that the packs it creates > > > are suboptimal with poorer delta choices. Are the packs created by > > > bulk-checkin prone to the same issues? When I was thinking in terms > > > of having "git merge" use fast-import for pack creation instead of > > > writing loose objects (an idea I never investigated very far), I was > > > wondering if I'd need to mark those packs as "less optimal" and do > > > something to make sure they were more likely to be repacked. > > > > > > I believe geometric repacking didn't exist back when I was thinking > > > about this, and perhaps geometric repacking automatically handles > > > things nicely for us. Does it, or are we risking retaining > > > sub-optimal deltas from the bulk-checkin code? > > > > > > (I've never really cracked open the pack code, so I have absolutely no > > > idea; I'm just curious.) > > > > Yes, the bulk-checkin mechanism suffers from an even worse problem which > > is the pack it creates will contain no deltas whatsoever. The contents > > of the pack are just getting written as-is, so there's no fancy > > delta-ficiation going on. > > I wonder how big a deal this would be in practice for merges. > pack-objects will look for deltas between any two candidates objects, > but in practice I think most deltas are between objects from multiple > commits (across the "time" dimension, if you will) rather than within a > single tree (the "space" dimension). And a merge operation is generally > creating a single new tree (recursive merging may create intermediate > states which would delta, but we don't actually need to keep those > intermediate ones. I won't be surprised if we do, though). > > We should be able to test that theory by looking at existing deltas. > Here's a script which builds an index of blobs and trees to the commits > that introduce them: > > git rev-list HEAD | > git diff-tree --stdin -r -m -t --raw | > perl -lne ' > if (/^[0-9a-f]/) { > $commit = $_; > } elsif (/^:\S+ \S+ \S+ (\S+)/) { > $h{$1} = $commit; > } > END { print "$_ $h{$_}" for keys(%h) } > ' >commits.db > > And then we can see which deltas come from the same commit: > > git cat-file --batch-all-objects --batch-check='%(objectname) %(deltabase)' | > perl -alne ' > BEGIN { > open(my $fh, "<", "commits.db"); > %commit = map { chomp; split } <$fh>; > } > next if $F[1] =~ /0{40}/; # not a delta > next unless defined $commit{$F[0]}; # not in index > print $commit{$F[0]} eq $commit{$F[1]} ? "inner" : "outer", " ", $_; > ' > > In git.git, I see 460 "inner" deltas, and 193,081 "outer" ones. The > inner ones are mostly small single-file test vectors, which makes sense. > It's possible to have a merge result that does conflict resolution in > two such files (that would then delta), but it seems like a fairly > unlikely case. Numbers for linux.git are similar. > > So it might just not be a big issue at all for this use case. Very interesting, thanks for running (and documenting!) this experiment. I'm mostly with you that it probably doesn't make a huge difference in practice here. One thing that I'm not entirely clear on is how we'd treat objects that could be good delta candidates for each other between two packs. For instance, if I write a tree corresponding to the merge between two branches, it's likely that the resulting tree would be a good delta candidate against either of the trees at the tips of those two refs. But we won't pack those trees (the ones at the tips of the refs) in the same pack as the tree containing their merge. If we later on tried to repack, would we evaluate the tip trees as possible delta candidates against the merged tree? Or would we look at the merged tree, realize it isn't delta'd with anything, and then not attempt to find any candidates? > > I think Michael Haggerty (?) suggested to me off-list that it might be > > interesting to have a flag that we could mark packs with bad/no deltas > > as such so that we don't implicitly trust their contents as having high > > quality deltas. > > I was going to suggest the same thing. ;) Unfortunately it's a bit > tricky to do as we have no room in the file format for an optional flag. > You'd have to add a ".mediocre-delta" file or something. Yeah, I figured that we'd add a new ".baddeltas" file or something. (As an aside, we probably should have an optional flags section in the .pack format, since we seem to have a lot of optional pack extensions: .rev, .bitmap, .keep, .promisor, etc.) > But here's another approach. I recall discussing a while back the idea > that we should not necessarily trust the quality of deltas in packs that > are pushed (and I think Thomas Gummerer even did some experiments inside > GitHub with those, though I don't remember the results). And one way > around that is during geometric repacking to consider the biggest/oldest > pack as "preferred", reuse its deltas, but always compute from scratch > with the others (neither reusing on-disk deltas, nor skipping > try_delta() when two objects come from the same pack). > > That same strategy would work here (and for incremental fast-import > packs, though of course not if your fast-import pack is the "big" one > after you do a from-scratch import). Yeah, definitely. I think that that code is still living in GitHub's fork, but inactive since we haven't set any of the relevant configuration in GitHub's production environment. > Possibly it exacerbates the "no deltas" issue from above (though it > would depend on the command). The bigger question to me is one of > checkpointing. When do we finish off the pack with a .idx and make it > available to other readers? We could do it at program exit, but I > suspect there are some commands that really want to make objects > available sooner (e.g., as soon as "git add" writes an index, we'd want > those objects to already be available). Probably every program that > writes objects would need to be annotated with a checkpoint call (which > would be a noop in loose object mode). > > So maybe it's a dumb direction. I dunno. I wouldn't say it's a dumb direction ;-). But I'd be hesitant pursuing it without solving the "no deltas" question from earlier. Thanks, Taylor
On Fri, Oct 06, 2023 at 07:02:05PM -0400, Taylor Blau wrote: > On Fri, Oct 06, 2023 at 03:35:25PM -0700, Junio C Hamano wrote: > > Taylor Blau <me@ttaylorr.com> writes: > > > > > When using merge-tree often within a repository[^1], it is possible to > > > generate a relatively large number of loose objects, which can result in > > > degraded performance, and inode exhaustion in extreme cases. > > > > Well, be it "git merge-tree" or "git merge", new loose objects tend > > to accumulate until "gc" kicks in, so it is not a new problem for > > mere mortals, is it? > > Yeah, I would definitely suspect that this is more of an issue for > forges than individual Git users. > > > As one "interesting" use case of "merge-tree" is for a Git hosting > > site with bare repositories to offer trial merges, without which > > majority of the object their repositories acquire would have been in > > packs pushed by their users, "Gee, loose objects consume many inodes > > in exchange for easier selective pruning" becomes an issue, right? > > Right. > > > Just like it hurts performance to have too many loose object files, > > presumably it would also hurt performance to keep too many packs, > > each came from such a trial merge. Do we have a "gc" story offered > > for these packs created by the new feature? E.g., "once merge-tree > > is done creating a trial merge, we can discard the objects created > > in the pack, because we never expose new objects in the pack to the > > outside, processes running simultaneously, so instead closing the > > new packfile by calling flush_bulk_checkin_packfile(), we can safely > > unlink the temporary pack. We do not even need to spend cycles to > > run a gc that requires cycles to enumerate what is still reachable", > > or something like that? > > I know Johannes worked on something like this recently. IIRC, it > effectively does something like: > > struct tmp_objdir *tmp_objdir = tmp_objdir_create(...); > tmp_objdir_replace_primary_odb(tmp_objdir, 1); > > at the beginning of a merge operation, and: > > tmp_objdir_discard_objects(tmp_objdir); > > at the end. I haven't followed that work off-list very closely, but it > is only possible for GitHub to discard certain niche kinds of > merges/rebases, since in general we make the objects created during test > merges available via refs/pull/N/{merge,rebase}. > > I think that like anything, this is a trade-off. Having lots of packs > can be a performance hindrance just like having lots of loose objects. > But since we can represent more objects with fewer inodes when packed, > storing those objects together in a pack is preferable when (a) you're > doing lots of test-merges, and (b) you want to keep those objects > around, e.g., because they are reachable. In Gitaly, we usually set up quarantine directories for all operations that create objects. This allows us to discard any newly written objects in case either the RPC call gets cancelled or in case our access checks determine that the change should not be allowed. The logic is rather simple: 1. Create a new temporary directory. 2. Set up the new temporary directory as main object database via the `GIT_OBJECT_DIRECTORY` environment variable. 3. Set up the main repository's object database via the `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable. 4. Execute Git commands that write objects with these environment variables set up. The new objects will end up neatly contained in the temporary directory. 5. Once done, either discard the temporary object database or migrate objects into the main object daatabase. I wonder whether this would be a viable approach for you, as well. Patrick
On Mon, Oct 09, 2023 at 12:54:12PM +0200, Patrick Steinhardt wrote: > In Gitaly, we usually set up quarantine directories for all operations > that create objects. This allows us to discard any newly written objects > in case either the RPC call gets cancelled or in case our access checks > determine that the change should not be allowed. The logic is rather > simple: > > 1. Create a new temporary directory. > > 2. Set up the new temporary directory as main object database via > the `GIT_OBJECT_DIRECTORY` environment variable. > > 3. Set up the main repository's object database via the > `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable. Is there a reason not to run Git in the quarantine environment and list the main repository as an alternate via $GIT_DIR/objects/info/alternates instead of the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable? > 4. Execute Git commands that write objects with these environment > variables set up. The new objects will end up neatly contained in > the temporary directory. > > 5. Once done, either discard the temporary object database or > migrate objects into the main object daatabase. Interesting. I'm curious why you don't use the builtin tmp_objdir mechanism in Git itself. Do you need to run more than one command in the quarantine environment? If so, that makes sense that you'd want to have a scratch repository that lasts beyond the lifetime of a single process. > I wonder whether this would be a viable approach for you, as well. I think that the main problem that we are trying to solve with this series is creating a potentially large number of loose objects. I think that you could do something like what you propose above, with a 'git repacks -adk' before moving its objects over back to the main repository. But since we're working in a single process only when doing a merge-tree operation, I think it is probably more expedient to write the pack's bytes directly. > Patrick Thanks, Taylor
Jeff King <peff@peff.net> writes: >> Yes, the bulk-checkin mechanism suffers from an even worse problem which >> is the pack it creates will contain no deltas whatsoever. The contents >> of the pack are just getting written as-is, so there's no fancy >> delta-ficiation going on. > > I wonder how big a deal this would be in practice for merges. > ... Thanks for your experiments ;-) The reason why bulk-checkin mechanism does not attempt deltifying (as opposed to fast-import that attempts to deltify with the immediately previous object and only with that single object) is exactly the same. It was done to support the initial check-in, which by definition lacks the delta opportunity along the time axis. As you describe, such a delta-less pack would risk missed deltification opportunity when running a repack (without "-f"), as the opposite of the well known "reuse delta" heuristics, aka "this object was stored in the base form, it is likely that the previous pack-object tried but did not find a good delta base for it, let's not waste time retrying that" heuristics would get in the way.
On Sun, Oct 08, 2023 at 09:37:42PM -0400, Taylor Blau wrote: > Very interesting, thanks for running (and documenting!) this experiment. > I'm mostly with you that it probably doesn't make a huge difference in > practice here. > > One thing that I'm not entirely clear on is how we'd treat objects that > could be good delta candidates for each other between two packs. For > instance, if I write a tree corresponding to the merge between two > branches, it's likely that the resulting tree would be a good delta > candidate against either of the trees at the tips of those two refs. > > But we won't pack those trees (the ones at the tips of the refs) in the > same pack as the tree containing their merge. If we later on tried to > repack, would we evaluate the tip trees as possible delta candidates > against the merged tree? Or would we look at the merged tree, realize it > isn't delta'd with anything, and then not attempt to find any > candidates? When we repack (either all-into-one, or in a geometric roll-up), we should consider those trees as candidates. The only deltas we don't consider are: - if something is already a delta in a pack, then we will usually reuse that delta verbatim (so you might get fooled by a mediocre delta and not go to the trouble to search again. But I don't think that applies here; there is no other tree in your new pack to make such a mediocre delta from, and anyway you are skipping deltas entirely) - if two objects are in the same pack but there's no delta relationship, the try_delta() heuristics will skip them immediately (under the assumption that we tried during the last repack and didn't find anything good). So yes, if you had a big old pack with the original trees, and then a new pack with the merge result, we should try to delta the new merge result tree against the others, just as we would if it were loose. > > I was going to suggest the same thing. ;) Unfortunately it's a bit > > tricky to do as we have no room in the file format for an optional flag. > > You'd have to add a ".mediocre-delta" file or something. > > Yeah, I figured that we'd add a new ".baddeltas" file or something. (As > an aside, we probably should have an optional flags section in the .pack > format, since we seem to have a lot of optional pack extensions: .rev, > .bitmap, .keep, .promisor, etc.) Yes, though since packv2 is the on-the-wire format, it's very hard to change now. It might be easier to put an annotation into the .idx file. -Peff
On Mon, Oct 09, 2023 at 12:08:32PM -0400, Taylor Blau wrote: > On Mon, Oct 09, 2023 at 12:54:12PM +0200, Patrick Steinhardt wrote: > > In Gitaly, we usually set up quarantine directories for all operations > > that create objects. This allows us to discard any newly written objects > > in case either the RPC call gets cancelled or in case our access checks > > determine that the change should not be allowed. The logic is rather > > simple: > > > > 1. Create a new temporary directory. > > > > 2. Set up the new temporary directory as main object database via > > the `GIT_OBJECT_DIRECTORY` environment variable. > > > > 3. Set up the main repository's object database via the > > `GIT_ALTERNATE_OBJECT_DIRECTORIES` environment variable. > > Is there a reason not to run Git in the quarantine environment and list > the main repository as an alternate via $GIT_DIR/objects/info/alternates > instead of the GIT_ALTERNATE_OBJECT_DIRECTORIES environment variable? The quarantine environment as we use it is really only a single object database, so you cannot run Git inside of it directly. > > 4. Execute Git commands that write objects with these environment > > variables set up. The new objects will end up neatly contained in > > the temporary directory. > > > > 5. Once done, either discard the temporary object database or > > migrate objects into the main object daatabase. > > Interesting. I'm curious why you don't use the builtin tmp_objdir > mechanism in Git itself. Do you need to run more than one command in the > quarantine environment? If so, that makes sense that you'd want to have > a scratch repository that lasts beyond the lifetime of a single process. It's a mixture of things: - Many commands simply don't set up a temporary object directory. - We want to check the result after the objects have been generated. Many of the commands don't provide hooks to do so in a reasonable way. So we want to check the result _after_ the command has exited already, and objects should not yet have been migrated into the target object database at that point. - Sometimes we indeed want to run multiple Git commands. We use this e.g. for worktreeless rebases, where we run a succession of commands to rebase every single commit. So ultimately, our own quarantine directory sits at a conceptually higher level than what Git commands would be able to provide. > > I wonder whether this would be a viable approach for you, as well. > > I think that the main problem that we are trying to solve with this > series is creating a potentially large number of loose objects. I think > that you could do something like what you propose above, with a 'git > repacks -adk' before moving its objects over back to the main repository. Yeah, that's certainly possible. I had the feeling that there are two different problems that we're trying to solve in this thread: - The problem that the objects are of temporary nature. Regardless of whether they are in a packfile or loose, it doesn't make much sense to put them into the main object database as they would be unreachable anyway and thus get pruned after some time. - The problem that writing many loose objects can be slow. My proposal only addresses the first problem. > But since we're working in a single process only when doing a merge-tree > operation, I think it is probably more expedient to write the pack's > bytes directly. If it's about efficiency and not about how we discard those objects after the command has ran to completion then yes, I agree. Patrick
diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt index ffc4fbf7e8..9d37609ef1 100644 --- a/Documentation/git-merge-tree.txt +++ b/Documentation/git-merge-tree.txt @@ -69,6 +69,10 @@ OPTIONS specify a merge-base for the merge, and specifying multiple bases is currently not supported. This option is incompatible with `--stdin`. +--write-pack:: + Write any new objects into a separate packfile instead of as + individual loose objects. + [[OUTPUT]] OUTPUT ------ diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c index 0de42aecf4..672ebd4c54 100644 --- a/builtin/merge-tree.c +++ b/builtin/merge-tree.c @@ -18,6 +18,7 @@ #include "quote.h" #include "tree.h" #include "config.h" +#include "bulk-checkin.h" static int line_termination = '\n'; @@ -414,6 +415,7 @@ struct merge_tree_options { int show_messages; int name_only; int use_stdin; + int write_pack; }; static int real_merge(struct merge_tree_options *o, @@ -440,6 +442,7 @@ static int real_merge(struct merge_tree_options *o, init_merge_options(&opt, the_repository); opt.show_rename_progress = 0; + opt.write_pack = o->write_pack; opt.branch1 = branch1; opt.branch2 = branch2; @@ -548,6 +551,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix) &merge_base, N_("commit"), N_("specify a merge-base for the merge")), + OPT_BOOL(0, "write-pack", &o.write_pack, + N_("write new objects to a pack instead of as loose")), OPT_END() }; diff --git a/merge-ort.c b/merge-ort.c index 8631c99700..85d8c5c6b3 100644 --- a/merge-ort.c +++ b/merge-ort.c @@ -48,6 +48,7 @@ #include "tree.h" #include "unpack-trees.h" #include "xdiff-interface.h" +#include "bulk-checkin.h" /* * We have many arrays of size 3. Whenever we have such an array, the @@ -2124,11 +2125,19 @@ static int handle_content_merge(struct merge_options *opt, if ((merge_status < 0) || !result_buf.ptr) ret = err(opt, _("Failed to execute internal merge")); - if (!ret && - write_object_file(result_buf.ptr, result_buf.size, - OBJ_BLOB, &result->oid)) - ret = err(opt, _("Unable to add %s to database"), - path); + if (!ret) { + ret = opt->write_pack + ? index_blob_bulk_checkin_incore(&result->oid, + result_buf.ptr, + result_buf.size, + path, 1) + : write_object_file(result_buf.ptr, + result_buf.size, + OBJ_BLOB, &result->oid); + if (ret) + ret = err(opt, _("Unable to add %s to database"), + path); + } free(result_buf.ptr); if (ret) @@ -3618,7 +3627,8 @@ static int tree_entry_order(const void *a_, const void *b_) b->string, strlen(b->string), bmi->result.mode); } -static int write_tree(struct object_id *result_oid, +static int write_tree(struct merge_options *opt, + struct object_id *result_oid, struct string_list *versions, unsigned int offset, size_t hash_size) @@ -3652,8 +3662,14 @@ static int write_tree(struct object_id *result_oid, } /* Write this object file out, and record in result_oid */ - if (write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid)) + ret = opt->write_pack + ? index_tree_bulk_checkin_incore(result_oid, + buf.buf, buf.len, "", 1) + : write_object_file(buf.buf, buf.len, OBJ_TREE, result_oid); + + if (ret) ret = -1; + strbuf_release(&buf); return ret; } @@ -3818,8 +3834,8 @@ static int write_completed_directory(struct merge_options *opt, */ dir_info->is_null = 0; dir_info->result.mode = S_IFDIR; - if (write_tree(&dir_info->result.oid, &info->versions, offset, - opt->repo->hash_algo->rawsz) < 0) + if (write_tree(opt, &dir_info->result.oid, &info->versions, + offset, opt->repo->hash_algo->rawsz) < 0) ret = -1; } @@ -4353,9 +4369,13 @@ static int process_entries(struct merge_options *opt, fflush(stdout); BUG("dir_metadata accounting completely off; shouldn't happen"); } - if (write_tree(result_oid, &dir_metadata.versions, 0, + if (write_tree(opt, result_oid, &dir_metadata.versions, 0, opt->repo->hash_algo->rawsz) < 0) ret = -1; + + if (opt->write_pack) + end_odb_transaction(); + cleanup: string_list_clear(&plist, 0); string_list_clear(&dir_metadata.versions, 0); @@ -4899,6 +4919,9 @@ static void merge_start(struct merge_options *opt, struct merge_result *result) */ strmap_init(&opt->priv->conflicts); + if (opt->write_pack) + begin_odb_transaction(); + trace2_region_leave("merge", "allocate/init", opt->repo); } diff --git a/merge-recursive.h b/merge-recursive.h index b88000e3c2..156e160876 100644 --- a/merge-recursive.h +++ b/merge-recursive.h @@ -48,6 +48,7 @@ struct merge_options { unsigned renormalize : 1; unsigned record_conflict_msgs_as_headers : 1; const char *msg_header_prefix; + unsigned write_pack : 1; /* internal fields used by the implementation */ struct merge_options_internal *priv; diff --git a/t/t4301-merge-tree-write-tree.sh b/t/t4301-merge-tree-write-tree.sh index 250f721795..2d81ff4de5 100755 --- a/t/t4301-merge-tree-write-tree.sh +++ b/t/t4301-merge-tree-write-tree.sh @@ -922,4 +922,97 @@ test_expect_success 'check the input format when --stdin is passed' ' test_cmp expect actual ' +packdir=".git/objects/pack" + +test_expect_success 'merge-tree can pack its result with --write-pack' ' + test_when_finished "rm -rf repo" && + git init repo && + + # base has lines [3, 4, 5] + # - side adds to the beginning, resulting in [1, 2, 3, 4, 5] + # - other adds to the end, resulting in [3, 4, 5, 6, 7] + # + # merging the two should result in a new blob object containing + # [1, 2, 3, 4, 5, 6, 7], along with a new tree. + test_commit -C repo base file "$(test_seq 3 5)" && + git -C repo branch -M main && + git -C repo checkout -b side main && + test_commit -C repo side file "$(test_seq 1 5)" && + git -C repo checkout -b other main && + test_commit -C repo other file "$(test_seq 3 7)" && + + find repo/$packdir -type f -name "pack-*.idx" >packs.before && + tree="$(git -C repo merge-tree --write-pack \ + refs/tags/side refs/tags/other)" && + blob="$(git -C repo rev-parse $tree:file)" && + find repo/$packdir -type f -name "pack-*.idx" >packs.after && + + test_must_be_empty packs.before && + test_line_count = 1 packs.after && + + git show-index <$(cat packs.after) >objects && + test_line_count = 2 objects && + grep "^[1-9][0-9]* $tree" objects && + grep "^[1-9][0-9]* $blob" objects +' + +test_expect_success 'merge-tree can write multiple packs with --write-pack' ' + test_when_finished "rm -rf repo" && + git init repo && + ( + cd repo && + + git config pack.packSizeLimit 512 && + + test_seq 512 >f && + + # "f" contains roughly ~2,000 bytes. + # + # Each side ("foo" and "bar") adds a small amount of data at the + # beginning and end of "base", respectively. + git add f && + test_tick && + git commit -m base && + git branch -M main && + + git checkout -b foo main && + { + echo foo && cat f + } >f.tmp && + mv f.tmp f && + git add f && + test_tick && + git commit -m foo && + + git checkout -b bar main && + echo bar >>f && + git add f && + test_tick && + git commit -m bar && + + find $packdir -type f -name "pack-*.idx" >packs.before && + # Merging either side should result in a new object which is + # larger than 1M, thus the result should be split into two + # separate packs. + tree="$(git merge-tree --write-pack \ + refs/heads/foo refs/heads/bar)" && + blob="$(git rev-parse $tree:f)" && + find $packdir -type f -name "pack-*.idx" >packs.after && + + test_must_be_empty packs.before && + test_line_count = 2 packs.after && + for idx in $(cat packs.after) + do + git show-index <$idx || return 1 + done >objects && + + # The resulting set of packs should contain one copy of both + # objects, each in a separate pack. + test_line_count = 2 objects && + grep "^[1-9][0-9]* $tree" objects && + grep "^[1-9][0-9]* $blob" objects + + ) +' + test_done
When using merge-tree often within a repository[^1], it is possible to generate a relatively large number of loose objects, which can result in degraded performance, and inode exhaustion in extreme cases. Building on the functionality introduced in previous commits, the bulk-checkin machinery now has support to write arbitrary blob and tree objects which are small enough to be held in-core. We can use this to write any blob/tree objects generated by ORT into a separate pack instead of writing them out individually as loose. This functionality is gated behind a new `--write-pack` option to `merge-tree` that works with the (non-deprecated) `--write-tree` mode. The implementation is relatively straightforward. There are two spots within the ORT mechanism where we call `write_object_file()`, one for content differences within blobs, and another to assemble any new trees necessary to construct the merge. In each of those locations, conditionally replace calls to `write_object_file()` with `index_blob_bulk_checkin_incore()` or `index_tree_bulk_checkin_incore()` depending on which kind of object we are writing. The only remaining task is to begin and end the transaction necessary to initialize the bulk-checkin machinery, and move any new pack(s) it created into the main object store. [^1]: Such is the case at GitHub, where we run presumptive "test merges" on open pull requests to see whether or not we can light up the merge button green depending on whether or not the presumptive merge was conflicted. This is done in response to a number of user-initiated events, including viewing an open pull request whose last test merge is stale with respect to the current base and tip of the pull request. As a result, merge-tree can be run very frequently on large, active repositories. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/git-merge-tree.txt | 4 ++ builtin/merge-tree.c | 5 ++ merge-ort.c | 43 +++++++++++---- merge-recursive.h | 1 + t/t4301-merge-tree-write-tree.sh | 93 ++++++++++++++++++++++++++++++++ 5 files changed, 136 insertions(+), 10 deletions(-)