Message ID | 2da42e9ca26c9ef914b8b044047d505f00a27e20.1611098616.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: support repacking into a geometric sequence | expand |
Taylor Blau <me@ttaylorr.com> writes: > Teach pack-objects an option to imply the revision machinery's new > '--no-kept-objects' option when doing a reachability traversal. > > When '--assume-kept-packs-closed' is given as an argument to > pack-objects, it behaves differently (i.e., passes different options to > the ensuing revision walk) depending on whether or not other arguments > are passed: > > - If the caller also specifies a '--keep-pack' argument (to mark a > pack as kept in-core), then assume that this combination means to > stop traversal only at in-core packs. > > - If instead the caller passes '--honor-pack-keep', then assume that > the caller wants to stop traversal only at packs with a > corresponding .keep file (consistent with the original meaning which > only refers to packs with a .keep file). > > - If both '--keep-pack' and '--honor-pack-keep' are passed, then > assume the caller wants to stop traversal at either kind of kept > pack. If there is an out-of-band guarantee that .kept packs won't refer to outside world, then we can obtain identical results to what existing --honor-pack-keep (which traverses everything and then filteres out what is in .keep pack) does by just stopping traversal when we see an object that is found in a .keep pack. OK, I guess that it answers the correctness question I asked about [02/10]. It still is curious how we can safely "assume", but presumably we will see how in a patch that appears later in the series. How "closed" are these kept packs supposed to be? When there are two .keep packs, should objects in each of the packs never refer to outside their own pack, or is it OK for objects in one kept pack to refer to another object in the other kept pack? Readers and those who want to understand and extend this code in the future would need to know what definition of "closed" you are using here. Thanks.
On Thu, Jan 28, 2021 at 07:21:09PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > Teach pack-objects an option to imply the revision machinery's new > > '--no-kept-objects' option when doing a reachability traversal. > > > > When '--assume-kept-packs-closed' is given as an argument to > > pack-objects, it behaves differently (i.e., passes different options to > > the ensuing revision walk) depending on whether or not other arguments > > are passed: > > > > - If the caller also specifies a '--keep-pack' argument (to mark a > > pack as kept in-core), then assume that this combination means to > > stop traversal only at in-core packs. > > > > - If instead the caller passes '--honor-pack-keep', then assume that > > the caller wants to stop traversal only at packs with a > > corresponding .keep file (consistent with the original meaning which > > only refers to packs with a .keep file). > > > > - If both '--keep-pack' and '--honor-pack-keep' are passed, then > > assume the caller wants to stop traversal at either kind of kept > > pack. > > If there is an out-of-band guarantee that .kept packs won't refer to > outside world, then we can obtain identical results to what existing > --honor-pack-keep (which traverses everything and then filteres out > what is in .keep pack) does by just stopping traversal when we see > an object that is found in a .keep pack. OK, I guess that it > answers the correctness question I asked about [02/10]. > > It still is curious how we can safely "assume", but presumably we > will see how in a patch that appears later in the series. I think this would generally happen if the .keep packs are generated using something like "git repack -a", which packs everything reachable together. So if you do: git repack -ad touch .git/objects/pack/pack-whatever.keep ... some more packs come in, perhaps via pushes ... # imagine repack knew how to pass this along... git repack -a --assume-kept-packs-closed then you'd repack just the objects that aren't in the big pack. (And other kept packs; this probably would not want to be used with .keep packs, since those can be racily created by receive-pack. Rather you'd want to say "consider this specific list of packs to be closed and kept"). It is dangerous, though. If your assumption is somehow wrong, then you'd potentially corrupt the repository (because you'd stop traversing, but perhaps delete a pack that contained a useful object you would have reached that you don't have elsewhere). The overall goal here is being able to roll up loose objects and smaller packs without having to pay the cost of a full reachability traversal (which can take several minutes on large repositories). Another very-different direction there is to just enumerate those objects without respect to reachability, stick them in a pack, and then delete the originals. That does imply something like "repack -k", though, and interacts weirdly with letting unreachable objects age out via their mtimes (we'd constantly suck them back into fresh packs). That would work better if we our unreachable "aging out" storage was marked as such (say, in a pack marked with a ".cruft" file, rather than just a regular loose object that might be new or might be cruft). Then a roll-up repack would leave cruft packs alone (neither rolling them up, nor deleting them). A "real" repack would eventually delete them, but only after having done an actual reachability traversal, which make sure there are no objects within them that need rescued. > How "closed" are these kept packs supposed to be? When there are > two .keep packs, should objects in each of the packs never refer to > outside their own pack, or is it OK for objects in one kept pack to > refer to another object in the other kept pack? Readers and those > who want to understand and extend this code in the future would need > to know what definition of "closed" you are using here. I think it would want to be "the set of all .keep packs is closed". In a "roll all into one" scenario like above, there is only one .keep pack. But in a geometric progression, that single pack which constitutes your base set could be multiple packs (the last whole "git repack -ad", but then a sequence of roll-ups that came on top of it). -Peff
On Fri, Jan 29, 2021 at 02:19:25PM -0500, Jeff King wrote: > The overall goal here is being able to roll up loose objects and smaller > packs without having to pay the cost of a full reachability traversal > (which can take several minutes on large repositories). Another > very-different direction there is to just enumerate those objects > without respect to reachability, stick them in a pack, and then delete > the originals. That does imply something like "repack -k", though, and > interacts weirdly with letting unreachable objects age out via their > mtimes (we'd constantly suck them back into fresh packs). As I mentioned in an earlier response to Junio, this was the original approach that I took when implementing this, but ultimately decided against it because it means that we'll never let unreachable objects age out (as you note). I wonder if we need our assumption that the union of kept packs is closed under reachability to be specified as an option. If the option is passed, then we stop the traversal as soon as we hit an object in the frozen packs. If not passed, then we do a full traversal but pass --honor-pack-keep to drop out objects in the frozen packs after the fact. Thoughts? > I think it would want to be "the set of all .keep packs is closed". In a > "roll all into one" scenario like above, there is only one .keep pack. > But in a geometric progression, that single pack which constitutes your > base set could be multiple packs (the last whole "git repack -ad", but > then a sequence of roll-ups that came on top of it). I don't think having a roll-up strategy of "all-except-one" simplifies things. Or, if it does, then I don't understand it. Isn't this the exact same thing as a geometric repack which decides to keep only one pack? ISTM that you would be susceptible to the same problems in this case, too. Thanks, Taylor
On Fri, Jan 29, 2021 at 03:01:48PM -0500, Taylor Blau wrote: > On Fri, Jan 29, 2021 at 02:19:25PM -0500, Jeff King wrote: > > The overall goal here is being able to roll up loose objects and smaller > > packs without having to pay the cost of a full reachability traversal > > (which can take several minutes on large repositories). Another > > very-different direction there is to just enumerate those objects > > without respect to reachability, stick them in a pack, and then delete > > the originals. That does imply something like "repack -k", though, and > > interacts weirdly with letting unreachable objects age out via their > > mtimes (we'd constantly suck them back into fresh packs). > > As I mentioned in an earlier response to Junio, this was the original > approach that I took when implementing this, but ultimately decided > against it because it means that we'll never let unreachable objects age > out (as you note). Right. But that's no different than using "-k" most of the time, and then occasionally doing a more careful repack with short expiration times and a full reachability check. As you know, this is basically what we do at GitHub. So it may be reasonable to go that direction, which is really defining a totally separate strategy from git-gc's "repack, and occasionally objects age out". Especially if we find that the assume-kept-packs-closed route is too risky (i.e., has too many cases where it's possible to cause corruption if our assumptions isn't met). I'm not convinced either way at this point, but just thinking out loud on the options (and trying to give some context to the list). > I wonder if we need our assumption that the union of kept packs is > closed under reachability to be specified as an option. If the option is > passed, then we stop the traversal as soon as we hit an object in the > frozen packs. If not passed, then we do a full traversal but pass > --honor-pack-keep to drop out objects in the frozen packs after the > fact. > > Thoughts? I'm confused. I thought the whole idea was to pass it as an option (the user telling Git "I know these packs are supposed to be closed; trust me")? > > I think it would want to be "the set of all .keep packs is closed". In a > > "roll all into one" scenario like above, there is only one .keep pack. > > But in a geometric progression, that single pack which constitutes your > > base set could be multiple packs (the last whole "git repack -ad", but > > then a sequence of roll-ups that came on top of it). > > I don't think having a roll-up strategy of "all-except-one" simplifies > things. Or, if it does, then I don't understand it. Isn't this the exact > same thing as a geometric repack which decides to keep only one pack? > > ISTM that you would be susceptible to the same problems in this case, > too. I wasn't trying to argue that all-except-one avoids any problems. I was saying that the example I gave above was an all-into-one, but if you want to extend the concept to multiple packs, it has to cover the whole set. I.e., answering Junio's: > is it OK for objects in one kept pack to refer to another object in > the other kept pack? with "yes". -Peff
Jeff King <peff@peff.net> writes: >> If there is an out-of-band guarantee that .kept packs won't refer to >> outside world, then we can obtain identical results to what existing >> --honor-pack-keep (which traverses everything and then filteres out >> what is in .keep pack) does by just stopping traversal when we see >> an object that is found in a .keep pack. OK, I guess that it >> answers the correctness question I asked about [02/10]. >> >> It still is curious how we can safely "assume", but presumably we >> will see how in a patch that appears later in the series. > > I think this would generally happen if the .keep packs are generated > using something like "git repack -a", which packs everything reachable > together. So if you do: > > git repack -ad > touch .git/objects/pack/pack-whatever.keep > ... some more packs come in, perhaps via pushes ... > # imagine repack knew how to pass this along... > git repack -a --assume-kept-packs-closed > > then you'd repack just the objects that aren't in the big pack. Yeah. As a tool to help the above workflow, where you are only creating another .keep out of youngest objects (i.e. those that are either loose or in non-kept packs), because by definition anything in .keep cannot be pointing back at these younger objects, it does make sense to take advantage of "the set of packs with .keep as a whole is closed". It may become tricky once we start talking about creating a new .keep out of youngest objects PLUS a few young keep packs, though. Starting from all on-disk .keep packs, you'd mark them as in-core keep bit, then drop in-core keep bit from the few young keep packs that you intend to coalesce with the youngest objects---that is how I would imagine your repacking strategy would go. The set of all the on-disk .keep packs may give us "closed" guarantee, but if we exclude a few latest packs from that set, would the remainder still give us the "closed" guarantee we can take advantage of, in order to pack these youngest objects (including the ones in the kept packs that we are coalescing)?
On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote: > So it may be reasonable to go that direction, which is really defining a > totally separate strategy from git-gc's "repack, and occasionally > objects age out". Especially if we find that the > assume-kept-packs-closed route is too risky (i.e., has too many cases > where it's possible to cause corruption if our assumptions isn't met). Yeah, this whole conversation has made me very nervous about using reachability. Fundamentally, this isn't about reachability at all. The operation is as simple as telling pack-objects a list of packs that you do and don't want objects from, making a new pack out of that, and then optionally dropping the packs that you rolled up. So, I think that teaching pack-objects a way to understand a caller that says "include objects from packs X, Y, and Z, but not if they appear in packs A, B, or C, and also pull in any loose objects" is the best way forward here. Of course, you're going to be dragging along unreachable objects until you decide to do a full repack, but I'm OK with that since we wouldn't expect anybody to be solely relying on geometric repacks without occasionally running 'git repack -ad'. Junio: I don't think that you have picked this up yet, but please avoid doing so for now, and I'll send a new series that goes in the direction I outlined above. Thanks, Taylor
Jeff King <peff@peff.net> writes: >> I wonder if we need our assumption that the union of kept packs is >> closed under reachability to be specified as an option. If the option is >> passed, then we stop the traversal as soon as we hit an object in the >> frozen packs. If not passed, then we do a full traversal but pass >> --honor-pack-keep to drop out objects in the frozen packs after the >> fact. >> >> Thoughts? > > I'm confused. I thought the whole idea was to pass it as an option (the > user telling Git "I know these packs are supposed to be closed; trust > me")? Yes, that is how I read these patches, and it sounds like an assumption that we can make under many scenarios/repacking strategies.
On Fri, Jan 29, 2021 at 12:30:38PM -0800, Junio C Hamano wrote: > > I think this would generally happen if the .keep packs are generated > > using something like "git repack -a", which packs everything reachable > > together. So if you do: > > > > git repack -ad > > touch .git/objects/pack/pack-whatever.keep > > ... some more packs come in, perhaps via pushes ... > > # imagine repack knew how to pass this along... > > git repack -a --assume-kept-packs-closed > > > > then you'd repack just the objects that aren't in the big pack. > > Yeah. As a tool to help the above workflow, where you are only > creating another .keep out of youngest objects (i.e. those that are > either loose or in non-kept packs), because by definition anything > in .keep cannot be pointing back at these younger objects, it does > make sense to take advantage of "the set of packs with .keep as a > whole is closed". > > It may become tricky once we start talking about creating a new > .keep out of youngest objects PLUS a few young keep packs, though. Right. You'd have to make sure the younger packs were also created with that reachability in mind. I.e., if you are in a situation where you've got: - a big "old" pack - N new packs from pushes then you can't assume anything about the reachability for those individual N packs. It would be wrong to split any of them into the "keep" side. They need to have all their objects traversed until we hit something in the old pack. But if you have a situation with: - a big "old" pack - M packs made from previous rollups on top of the old pack - N new packs from pushes Then I think you can still take the old pack + the M packs as a cohesive unit closed under reachability. The tricky part is knowing which packs are which (size is a heuristic, but it can be wrong; people may make a big push to a previously-small repository). > Starting from all on-disk .keep packs, you'd mark them as in-core > keep bit, then drop in-core keep bit from the few young keep packs > that you intend to coalesce with the youngest objects---that is how > I would imagine your repacking strategy would go. The set of all > the on-disk .keep packs may give us "closed" guarantee, but if we > exclude a few latest packs from that set, would the remainder still > give us the "closed" guarantee we can take advantage of, in order to > pack these youngest objects (including the ones in the kept packs > that we are coalescing)? Yeah, I think we are going along the same lines. Except I think it is dangerous to use on-disk ".keep" as your marker, because we will racily see incoming push packs with a ".keep" (which receive-pack/index-pack use as a lockfile until the refs are updated). So repack has to "somehow" get the list of which is which. None of which is disputing your "it may become tricky", of course. ;) It is exactly this trickiness that I am worried about. And I am not being coy with "somehow", as if we have some custom not-yet-shared layer on top of repack that tracks this. We are still figuring out whether this is a good direction in the first place. :) One of the things that led us to this reachability traversal, away from "just suck up all of the objects from these packs", is that this is how --unpacked works. I had always assumed it was implemented as "if it's loose, then put it in the pack". But it's not. It's attached to the revision traversal. And it actually gets some cases wrong! It will walk every commit, so you don't have to worry about a packed commit referring to an unpacked one. But it doesn't look at the trees of the packed commits (for the quite obvious reason that doing so is orders of magnitude more expensive). That means that if there is a packed commit that refers to an unpacked blob (which is not referenced by an unpacked commit), then "rev-list --unpacked" will not report it (and likewise "git repack -d" would not pack it). It's easy to create such a situation manually, but I've included a more plausible sequence involving "--amend" and push/fetch unpackLimit at the end of this email. At the core, --unpacked is assuming certain things about reachability of loose/packed objects that aren't necessarily true. And this --assume-kept-pack-closed stuff is basically doing the same thing for a particular set of packs (albeit more so; I believe the patches here cut off traversal of parent pointers, not just commit-to-tree pointers). One of the reasons I think nobody noticed with --unpacked is that the stakes are pretty low. If our assumption is wrong, the worst case is that a loose object remains unpacked in "repack -d". But we'd never delete it based on that information (instead, git prune would do its own traversal to find the correct reachability). And it would eventually get picked up by "git repack -ad". But for repacking, the general strategy is to put things you want to keep into the new pack, and then delete the old ones (not marked as keep, of course). So if our assumption is ever wrong, it means we'd potentially drop packs that have reachable objects not found elsewhere, and we'd end up corrupting the repository. So I think the paths forward are either: - come up with an air-tight system of making sure that we know packs we claim are closed under reachability really are (perhaps some marker that says "I was generated by repack -a") - have a "roll-up" mode that does not care about reachability at all, and just takes any objects from a particular set of packs (plus probably loose objects) I'm still thinking aloud here, and not really sure which is a better path. I do feel like the failure modes for the second one are less risky. Anyway, here's the --unpacked example, if you're curious. It's based on fetching, but you could invert it to do pushes (in which case it is repacking in "parent" that gets the wrong result). -- >8 -- # two repos, one a clone of the other git init parent git -C parent commit --allow-empty -m base git clone parent child # now there's a small fetch, which will get # exploded into loose objects. ( cd parent echo small >small git add small git commit -m small ) git -C child fetch # We can verify that "rev-list --unpacked" reports these # objects. git -C child rev-list --objects --unpacked origin # and now a bigger one that will remain a pack (we'll # tweak unpackLimit instead of making a really big commit, # but the concept is the same) # # There are two key things here: # # - the "small" commit is no longer reachable, but the big one # still contains the "small" blob object. Using --amend is # a plausible mechanism for this happening. # # - we are using bitmaps, which give us an exact answer for the set of # objects to send. Otherwise, pack-objects on the server actually # fails to notice the other side has told us it has the small blob, and # sends another copy of it. ( cd parent git repack -adb echo big >big git add big git commit --amend -m big ) git -C child -c fetch.unpackLimit=1 fetch # So now in the child we have a packed object # whose ancestor is an unpacked one. rev-list # now won't report the "small" blob (ac790413). git.compile -C child rev-list --objects --unpacked origin # Even though we can see that it's present only as an # unpacked object. show_objects() { for i in child/.git/objects/pack/*.idx; do git show-index <$i done | cut -d' ' -f2 | sed 's/^/packed: /' find child/.git/objects/??/* | perl -F/ -alne 'print " loose: $F[-2]$F[-1]"' } # If we were to do an incremental repack now, it wouldn't be packed. # (Note we have to kill off the reflog, which still references the # rewound commit). rm -rf child/.git/logs git -C child repack -d show_objects
On Fri, Jan 29, 2021 at 05:43:32PM -0500, Jeff King wrote: > So I think the paths forward are either: > > - come up with an air-tight system of making sure that we know packs > we claim are closed under reachability really are (perhaps some > marker that says "I was generated by repack -a") > > - have a "roll-up" mode that does not care about reachability at all, > and just takes any objects from a particular set of packs (plus > probably loose objects) > > I'm still thinking aloud here, and not really sure which is a better > path. I do feel like the failure modes for the second one are less > risky. The more I think about it, the more I feel that the second option is the right approach. It seems like if you were naïvely implementing this from scratch, that you'd pick the second one (i.e., have pack-objects understand a new input mode, and then make a pack based on that). I am leery that we'd be able to get the first option "right" without attaching some sort of marker to each pack, especially given how difficult I think that this is to reason about precisely. I suppose you could have a .closed file corresponding to each pack, or alternatively a $objdir/pack/pack-geometry file which specifies the same thing, but both of these feel overly restrictive. Besides having to special case the loose objects, is there any downside to doing the simpler thing here? > Anyway, here's the --unpacked example, if you're curious. It's based on > fetching, but you could invert it to do pushes (in which case it is > repacking in "parent" that gets the wrong result). Fascinating indeed :-). Thanks, Taylor
On Fri, Jan 29, 2021 at 05:10:20PM -0500, Taylor Blau wrote: > On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote: > > So it may be reasonable to go that direction, which is really defining a > > totally separate strategy from git-gc's "repack, and occasionally > > objects age out". Especially if we find that the > > assume-kept-packs-closed route is too risky (i.e., has too many cases > > where it's possible to cause corruption if our assumptions isn't met). > > Yeah, this whole conversation has made me very nervous about using > reachability. Fundamentally, this isn't about reachability at all. The > operation is as simple as telling pack-objects a list of packs that you > do and don't want objects from, making a new pack out of that, and then > optionally dropping the packs that you rolled up. > > So, I think that teaching pack-objects a way to understand a caller that > says "include objects from packs X, Y, and Z, but not if they appear in > packs A, B, or C, and also pull in any loose objects" is the best way > forward here. > > Of course, you're going to be dragging along unreachable objects until > you decide to do a full repack, but I'm OK with that since we wouldn't > expect anybody to be solely relying on geometric repacks without > occasionally running 'git repack -ad'. While writing my other response, I had some thoughts that this "dragging along" might not be so bad. Just to lay out the problem as I see it, if you do: - frequently roll up all small packs and loose objects into a new pack, without regard to reachability - occasionally run "git repack -ad" to do a real traversal then the problem is that unreachable objects never age out: - a loose unreachable object starts with a recent-ish mtime - the frequent roll-up rolls it into a pack, freshening its mtime - the full "repack -ad" doesn't delete it, because its pack mtime is too recent. It explodes it loose again. - repeat forever We know that "repack -d" is not 100% accurate because of similar "closed under reachability" assumptions (see my other email). But it's OK, because the worst case is an object that doesn't quite get packed yet, not that it gets deleted. So you could do something like: - roll up loose objects into a pack with "repack -d"; mostly accurate, but doesn't suck up unreachable objects - roll up small packs into a bigger pack without regard for reachability. This includes the pack created in the first step, but we know everything in it is actually reachable. - eventually run "repack -ad" to do a real traversal That would extend the lifetime of unreachable objects which were found in a pack (they get dragged forward during the rollups). But they'd eventually get exploded loose during a "repack -ad", and then _not_ sucked back into a roll-up pack. And then eventually "repack -ad" removes them. The downsides are: - doing a separate "repack -d" plus a roll-up repack is wasted work. But I think they could be combined into a single step (at the cost of some extra complexity in the implementation). - using "--unpacked" still means traversing every commit. That's much faster than traversing the whole object graph, but still scales with the size of the repo, not the size of the new objects. That might be acceptable, though. I do think the original problem goes away entirely if we can keep better track of the mtimes. I.e., if we had packs marked with ".cruft" instead of exploding loose, then the logic is: - roll up all loose objects and any objects in a pack that isn't marked as cruft (or keep); never delete a cruft pack at this stage - occasionally "repack -ad"; this does delete old cruft packs (because we'd have rescued any reachable objects they might have contained) I'm not sure I want to block this topic on having cruft packs, though. Of course there are tons of _other_ reasons to want them (like not causing operational headaches when a repo's disk and inode usage grows by 10x due to exploding loose objects). So maybe it's not a bad idea to work on them together. I dunno. -Peff
On Fri, Jan 29, 2021 at 05:53:58PM -0500, Taylor Blau wrote: > > I'm still thinking aloud here, and not really sure which is a better > > path. I do feel like the failure modes for the second one are less > > risky. > > The more I think about it, the more I feel that the second option is the > right approach. It seems like if you were naïvely implementing this from > scratch, that you'd pick the second one (i.e., have pack-objects > understand a new input mode, and then make a pack based on that). > > I am leery that we'd be able to get the first option "right" without > attaching some sort of marker to each pack, especially given how > difficult I think that this is to reason about precisely. I suppose you > could have a .closed file corresponding to each pack, or alternatively a > $objdir/pack/pack-geometry file which specifies the same thing, but both > of these feel overly restrictive. Yeah, I think my gut feeling matches yours. > Besides having to special case the loose objects, is there any downside > to doing the simpler thing here? The other downside I can think of is that you can't just run "git repack --geometric" every time, and eventually get a good result (or one that asymptotically approaches good ;) ). I.e., you now have two types of repacks: quick and dirty rollups, and "real" ones that do reachability. So you need some heuristics about how often you do one versus the other. I'm definitely OK with that outcome. And I think we could even bake those heuristics into a script or mode of repack (e.g., maybe "gc --auto" would trigger a bigger repack every N times or something). But that's what I came up with by brainstorming. :) -Peff
Taylor Blau <me@ttaylorr.com> writes: > So, I think that teaching pack-objects a way to understand a caller that > says "include objects from packs X, Y, and Z, but not if they appear in > packs A, B, or C, and also pull in any loose objects" is the best way > forward here. Are our goals still include that the resulting packfile has good delta compression and object locality? Reachability traversal discovers which commit comes close to which other commits to help pack-objects to arrange the resulting pack so that objects that appear close together in history appears close together. It also gives each object a pathname hint to help group objects of the same type (either blobs or trees) with like-paths together for better deltification. Without reachability traversal, I would imagine that it would become quite important to keep the order in which objects appear in the original pack, and existing delta chain, as much as possible, or we'd be seeing a horribly inefficient pack like fast-import would produce. Thanks.
Taylor Blau <me@ttaylorr.com> writes: > On Fri, Jan 29, 2021 at 05:43:32PM -0500, Jeff King wrote: >> So I think the paths forward are either: >> >> - come up with an air-tight system of making sure that we know packs >> we claim are closed under reachability really are (perhaps some >> marker that says "I was generated by repack -a") >> >> - have a "roll-up" mode that does not care about reachability at all, >> and just takes any objects from a particular set of packs (plus >> probably loose objects) >> >> I'm still thinking aloud here, and not really sure which is a better >> path. I do feel like the failure modes for the second one are less >> risky. > > The more I think about it, the more I feel that the second option is the > right approach. It seems like if you were naïvely implementing this from > scratch, that you'd pick the second one (i.e., have pack-objects > understand a new input mode, and then make a pack based on that). Yes, "roll-up" mode would be a sensible thing to have, as long as we can keep pruning out of the picture for now. But in the end, I do think "stop at any object in this frozen pack---these objects go all the way down to root and we know they are reachable" optimization that would give 'prune' a performance boost with small margin of false positive about reachability (i.e. we may never be able to prune away an object in such a pack, even when it becomes unreachable) would be a valuable thing to have in a practical system, so from that point of view, the work done in these patches are not lost ;-) The efficiency issue of the resulting pack I mentioned earlier in a separate message is there in the "roll-up" mode, though.
On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > So, I think that teaching pack-objects a way to understand a caller that > > says "include objects from packs X, Y, and Z, but not if they appear in > > packs A, B, or C, and also pull in any loose objects" is the best way > > forward here. > > Are our goals still include that the resulting packfile has good > delta compression and object locality? Reachability traversal > discovers which commit comes close to which other commits to help > pack-objects to arrange the resulting pack so that objects that > appear close together in history appears close together. It also > gives each object a pathname hint to help group objects of the same > type (either blobs or trees) with like-paths together for better > deltification. I think our goals here are somewhere between having fewer packfiles while also ensuring that the packfiles we had to create don't have horrible delta compression and locality. But now that you do mention it, I remember the reachability traversal's bringing in object names was a reason that we decided to implement this series using a reachability traversal in the first place. > Without reachability traversal, I would imagine that it would become > quite important to keep the order in which objects appear in the > original pack, and existing delta chain, as much as possible, or > we'd be seeing a horribly inefficient pack like fast-import would > produce. Yeah; we'd definitely want to feed the objects to pack-objects in the order that they appear in the original pack. Maybe that's not that bad a tradeoff to make, though... > Thanks. Thanks, Taylor
On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > So, I think that teaching pack-objects a way to understand a caller that > > says "include objects from packs X, Y, and Z, but not if they appear in > > packs A, B, or C, and also pull in any loose objects" is the best way > > forward here. > > Are our goals still include that the resulting packfile has good > delta compression and object locality? Reachability traversal > discovers which commit comes close to which other commits to help > pack-objects to arrange the resulting pack so that objects that > appear close together in history appears close together. It also > gives each object a pathname hint to help group objects of the same > type (either blobs or trees) with like-paths together for better > deltification. > > Without reachability traversal, I would imagine that it would become > quite important to keep the order in which objects appear in the > original pack, and existing delta chain, as much as possible, or > we'd be seeing a horribly inefficient pack like fast-import would > produce. Thanks, that's another good point we discussed a while ago (off-list), but hasn't come up in this discussion yet. Another option here is not to roll up packs at all, but instead to use a midx to cover them all[1]. That solves the issue where object lookup is O(nr_packs), and you retain the same locality and delta characteristics. But I think part of the goal is to actually improve the deltas, in two ways: - we'd hopefully find new delta opportunities between objects in the various packs - we'll drop some objects that are duplicated in other packs. Definitely we have to to avoid duplicates in the roll-up pack, but I think we'd want to even for objects that are in the "big" kept pack. These are likely bases of deltas in our roll-up pack, since the common cause there is --fix-thin adding them to complete the pack. But we really prefer to serve fetches using the ones out of the main pack, since they may already themselves be deltas (which makes them way cheaper; we can send the delta straight off the disk, rather than looking for a new possible base). So I would anticipate the delta-compression phase actually trying to do some new work. I do worry that the lack of pathname hints may make the deltas we find much more worse (or cause us to spend excessive CPU searching for them). It's possible we could do a "best effort" traversal where we walk new commits to find newly added pathnames, but don't bother crossing into trees/commits that aren't in the set of objects to be packed. It's OK to optimize for speed there, because it's just feeding the delta heuristic, not the set of objects we'd plan to pack. -Peff [1] Our end-game plan is actually to _also_ use a midx to cover the roll-ups and the "big" pack, since we'd want to generate bitmaps for the new objects, too.'
On Fri, Jan 29, 2021 at 06:28:28PM -0500, Taylor Blau wrote: > On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote: > > Are our goals still include that the resulting packfile has good > > delta compression and object locality? Reachability traversal > > discovers which commit comes close to which other commits to help > > pack-objects to arrange the resulting pack so that objects that > > appear close together in history appears close together. It also > > gives each object a pathname hint to help group objects of the same > > type (either blobs or trees) with like-paths together for better > > deltification. > > I think our goals here are somewhere between having fewer packfiles > while also ensuring that the packfiles we had to create don't have > horrible delta compression and locality. > > But now that you do mention it, I remember the reachability traversal's > bringing in object names was a reason that we decided to implement this > series using a reachability traversal in the first place. Peff shared a very clever idea with me today. Like in the naive approach, we fill the list of "objects to pack" with everything in the packs that are about to get rolled up, excluding anything that appears in the large packs. But we do a reachability traversal whose starting points are all of the commits in the packs that are about to be rolled up, filling in the namehash of the objects we encounter along the way. Like in the original version of this series, we'll stop early once we encounter an object in any of the frozen packs (which are marked as kept in core), and so we might not traverse through everything. But that's completely OK, since we know we have the right list of objects to pack (at worst, we would having some zero'd namehashes and come up with slightly worse deltas). But, I think that this is a nice middle-ground (and it allows us to reuse lots of work from the original version), so I'm quite happy. It's in my fork [1] in the tb/geometric-repack.wip branch, but I'll try and clean those patches up tomorrow and send a v2 to the list. Thanks, Taylor [1]: https://github.com/ttaylorr/git
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 54d715ead1..cbe08e7415 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -135,6 +135,17 @@ depth is 4095. leading directory (e.g. `pack-123.pack`). The option could be specified multiple times to keep multiple packs. +--assume-kept-packs-closed:: + This flag causes `git rev-list` to halt the object traversal + when it encounters an object found in a kept pack. This is + dissimilar to `--honor-pack-keep`, which only prunes unwanted + results after the full traversal is completed. ++ +Without any `--keep-pack=<pack-name>` arguments, only packs with an +on-disk `*.keep` files are used when considering when to halt the +traversal. If other packs are artificially marked as "kept" with +`--keep-pack`, then those are considered as well. + --incremental:: This flag causes an object already in a pack to be ignored even if it would have otherwise been packed. diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 2a00358f34..a5dcd66f52 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -78,6 +78,7 @@ static int have_non_local_packs; static int incremental; static int ignore_packed_keep_on_disk; static int ignore_packed_keep_in_core; +static int assume_kept_packs_closed; static int allow_ofs_delta; static struct pack_idx_option pack_idx_opts; static const char *base_name; @@ -3542,6 +3543,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) N_("create packs suitable for shallow fetches")), OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk, N_("ignore packs that have companion .keep file")), + OPT_BOOL(0, "assume-kept-packs-closed", &assume_kept_packs_closed, + N_("assume the union of kept packs is closed under reachability")), OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"), N_("ignore this pack")), OPT_INTEGER(0, "compression", &pack_compression_level, @@ -3631,6 +3634,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) use_internal_rev_list = 1; strvec_push(&rp, "--unpacked"); } + if (assume_kept_packs_closed) + use_internal_rev_list = 1; if (exclude_promisor_objects) { use_internal_rev_list = 1; @@ -3711,6 +3716,14 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (!p) /* no keep-able packs found */ ignore_packed_keep_on_disk = 0; } + if (assume_kept_packs_closed) { + if (ignore_packed_keep_on_disk && ignore_packed_keep_in_core) + strvec_push(&rp, "--no-kept-objects"); + else if (ignore_packed_keep_on_disk) + strvec_push(&rp, "--no-kept-objects=on-disk"); + else if (ignore_packed_keep_in_core) + strvec_push(&rp, "--no-kept-objects=in-core"); + } if (local) { /* * unlike ignore_packed_keep_on_disk above, we do not diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh index 9239d8aa46..0861305a04 100755 --- a/t/t6114-keep-packs.sh +++ b/t/t6114-keep-packs.sh @@ -66,4 +66,63 @@ test_expect_success '--no-kept-objects excludes kept non-MIDX object' ' test_cmp expect actual ' +test_expect_success '--no-kept-objects can respect only in-core keep packs' ' + test_when_finished "rm -fr actual-*.idx actual-*.pack" && + ( + git rev-list --objects --no-object-names packed..kept && + git rev-list --objects --no-object-names loose + ) | sort >expect && + + git pack-objects \ + --assume-kept-packs-closed \ + --keep-pack=pack-$MISC_PACK.pack \ + --all actual </dev/null && + idx_objects actual-*.idx >actual && + + test_cmp expect actual +' + +test_expect_success 'setup additional --no-kept-objects tests' ' + test_commit additional && + + ADDITIONAL_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF + refs/tags/additional + ^refs/tags/kept + EOF + ) +' + +test_expect_success '--no-kept-objects can respect only on-disk keep packs' ' + test_when_finished "rm -fr actual-*.idx actual-*.pack" && + ( + git rev-list --objects --no-object-names kept..additional && + git rev-list --objects --no-object-names packed + ) | sort >expect && + + git pack-objects \ + --assume-kept-packs-closed \ + --honor-pack-keep \ + --all actual </dev/null && + idx_objects actual-*.idx >actual && + + test_cmp expect actual +' + +test_expect_success '--no-kept-objects can respect mixed kept packs' ' + test_when_finished "rm -fr actual-*.idx actual-*.pack" && + ( + git rev-list --objects --no-object-names kept..additional && + git rev-list --objects --no-object-names loose + ) | sort >expect && + + git pack-objects \ + --assume-kept-packs-closed \ + --honor-pack-keep \ + --keep-pack=pack-$MISC_PACK.pack \ + --all actual </dev/null && + idx_objects actual-*.idx >actual && + + test_cmp expect actual +' + test_done
Teach pack-objects an option to imply the revision machinery's new '--no-kept-objects' option when doing a reachability traversal. When '--assume-kept-packs-closed' is given as an argument to pack-objects, it behaves differently (i.e., passes different options to the ensuing revision walk) depending on whether or not other arguments are passed: - If the caller also specifies a '--keep-pack' argument (to mark a pack as kept in-core), then assume that this combination means to stop traversal only at in-core packs. - If instead the caller passes '--honor-pack-keep', then assume that the caller wants to stop traversal only at packs with a corresponding .keep file (consistent with the original meaning which only refers to packs with a .keep file). - If both '--keep-pack' and '--honor-pack-keep' are passed, then assume the caller wants to stop traversal at either kind of kept pack. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/git-pack-objects.txt | 11 ++++++ builtin/pack-objects.c | 13 +++++++ t/t6114-keep-packs.sh | 59 ++++++++++++++++++++++++++++++ 3 files changed, 83 insertions(+)