diff mbox series

[03/10] builtin/pack-objects.c: learn '--assume-kept-packs-closed'

Message ID 2da42e9ca26c9ef914b8b044047d505f00a27e20.1611098616.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: support repacking into a geometric sequence | expand

Commit Message

Taylor Blau Jan. 19, 2021, 11:24 p.m. UTC
Teach pack-objects an option to imply the revision machinery's new
'--no-kept-objects' option when doing a reachability traversal.

When '--assume-kept-packs-closed' is given as an argument to
pack-objects, it behaves differently (i.e., passes different options to
the ensuing revision walk) depending on whether or not other arguments
are passed:

  - If the caller also specifies a '--keep-pack' argument (to mark a
    pack as kept in-core), then assume that this combination means to
    stop traversal only at in-core packs.

  - If instead the caller passes '--honor-pack-keep', then assume that
    the caller wants to stop traversal only at packs with a
    corresponding .keep file (consistent with the original meaning which
    only refers to packs with a .keep file).

  - If both '--keep-pack' and '--honor-pack-keep' are passed, then
    assume the caller wants to stop traversal at either kind of kept
    pack.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-pack-objects.txt | 11 ++++++
 builtin/pack-objects.c             | 13 +++++++
 t/t6114-keep-packs.sh              | 59 ++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+)

Comments

Junio C Hamano Jan. 29, 2021, 3:21 a.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> Teach pack-objects an option to imply the revision machinery's new
> '--no-kept-objects' option when doing a reachability traversal.
>
> When '--assume-kept-packs-closed' is given as an argument to
> pack-objects, it behaves differently (i.e., passes different options to
> the ensuing revision walk) depending on whether or not other arguments
> are passed:
>
>   - If the caller also specifies a '--keep-pack' argument (to mark a
>     pack as kept in-core), then assume that this combination means to
>     stop traversal only at in-core packs.
>
>   - If instead the caller passes '--honor-pack-keep', then assume that
>     the caller wants to stop traversal only at packs with a
>     corresponding .keep file (consistent with the original meaning which
>     only refers to packs with a .keep file).
>
>   - If both '--keep-pack' and '--honor-pack-keep' are passed, then
>     assume the caller wants to stop traversal at either kind of kept
>     pack.

If there is an out-of-band guarantee that .kept packs won't refer to
outside world, then we can obtain identical results to what existing
--honor-pack-keep (which traverses everything and then filteres out
what is in .keep pack) does by just stopping traversal when we see
an object that is found in a .keep pack.  OK, I guess that it
answers the correctness question I asked about [02/10].

It still is curious how we can safely "assume", but presumably we
will see how in a patch that appears later in the series.

How "closed" are these kept packs supposed to be?  When there are
two .keep packs, should objects in each of the packs never refer to
outside their own pack, or is it OK for objects in one kept pack to
refer to another object in the other kept pack?  Readers and those
who want to understand and extend this code in the future would need
to know what definition of "closed" you are using here.

Thanks.
Jeff King Jan. 29, 2021, 7:19 p.m. UTC | #2
On Thu, Jan 28, 2021 at 07:21:09PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > Teach pack-objects an option to imply the revision machinery's new
> > '--no-kept-objects' option when doing a reachability traversal.
> >
> > When '--assume-kept-packs-closed' is given as an argument to
> > pack-objects, it behaves differently (i.e., passes different options to
> > the ensuing revision walk) depending on whether or not other arguments
> > are passed:
> >
> >   - If the caller also specifies a '--keep-pack' argument (to mark a
> >     pack as kept in-core), then assume that this combination means to
> >     stop traversal only at in-core packs.
> >
> >   - If instead the caller passes '--honor-pack-keep', then assume that
> >     the caller wants to stop traversal only at packs with a
> >     corresponding .keep file (consistent with the original meaning which
> >     only refers to packs with a .keep file).
> >
> >   - If both '--keep-pack' and '--honor-pack-keep' are passed, then
> >     assume the caller wants to stop traversal at either kind of kept
> >     pack.
> 
> If there is an out-of-band guarantee that .kept packs won't refer to
> outside world, then we can obtain identical results to what existing
> --honor-pack-keep (which traverses everything and then filteres out
> what is in .keep pack) does by just stopping traversal when we see
> an object that is found in a .keep pack.  OK, I guess that it
> answers the correctness question I asked about [02/10].
> 
> It still is curious how we can safely "assume", but presumably we
> will see how in a patch that appears later in the series.

I think this would generally happen if the .keep packs are generated
using something like "git repack -a", which packs everything reachable
together. So if you do:

  git repack -ad
  touch .git/objects/pack/pack-whatever.keep
  ... some more packs come in, perhaps via pushes ...
  # imagine repack knew how to pass this along...
  git repack -a --assume-kept-packs-closed

then you'd repack just the objects that aren't in the big pack.

(And other kept packs; this probably would not want to be used with
.keep packs, since those can be racily created by receive-pack. Rather
you'd want to say "consider this specific list of packs to be closed and
kept").

It is dangerous, though. If your assumption is somehow wrong, then you'd
potentially corrupt the repository (because you'd stop traversing, but
perhaps delete a pack that contained a useful object you would have
reached that you don't have elsewhere).

The overall goal here is being able to roll up loose objects and smaller
packs without having to pay the cost of a full reachability traversal
(which can take several minutes on large repositories). Another
very-different direction there is to just enumerate those objects
without respect to reachability, stick them in a pack, and then delete
the originals. That does imply something like "repack -k", though, and
interacts weirdly with letting unreachable objects age out via their
mtimes (we'd constantly suck them back into fresh packs).

That would work better if we our unreachable "aging out" storage was
marked as such (say, in a pack marked with a ".cruft" file, rather than
just a regular loose object that might be new or might be cruft). Then a
roll-up repack would leave cruft packs alone (neither rolling them up,
nor deleting them). A "real" repack would eventually delete them, but
only after having done an actual reachability traversal, which make sure
there are no objects within them that need rescued.

> How "closed" are these kept packs supposed to be?  When there are
> two .keep packs, should objects in each of the packs never refer to
> outside their own pack, or is it OK for objects in one kept pack to
> refer to another object in the other kept pack?  Readers and those
> who want to understand and extend this code in the future would need
> to know what definition of "closed" you are using here.

I think it would want to be "the set of all .keep packs is closed". In a
"roll all into one" scenario like above, there is only one .keep pack.
But in a geometric progression, that single pack which constitutes your
base set could be multiple packs (the last whole "git repack -ad", but
then a sequence of roll-ups that came on top of it).

-Peff
Taylor Blau Jan. 29, 2021, 8:01 p.m. UTC | #3
On Fri, Jan 29, 2021 at 02:19:25PM -0500, Jeff King wrote:
> The overall goal here is being able to roll up loose objects and smaller
> packs without having to pay the cost of a full reachability traversal
> (which can take several minutes on large repositories). Another
> very-different direction there is to just enumerate those objects
> without respect to reachability, stick them in a pack, and then delete
> the originals. That does imply something like "repack -k", though, and
> interacts weirdly with letting unreachable objects age out via their
> mtimes (we'd constantly suck them back into fresh packs).

As I mentioned in an earlier response to Junio, this was the original
approach that I took when implementing this, but ultimately decided
against it because it means that we'll never let unreachable objects age
out (as you note).

I wonder if we need our assumption that the union of kept packs is
closed under reachability to be specified as an option. If the option is
passed, then we stop the traversal as soon as we hit an object in the
frozen packs. If not passed, then we do a full traversal but pass
--honor-pack-keep to drop out objects in the frozen packs after the
fact.

Thoughts?

> I think it would want to be "the set of all .keep packs is closed". In a
> "roll all into one" scenario like above, there is only one .keep pack.
> But in a geometric progression, that single pack which constitutes your
> base set could be multiple packs (the last whole "git repack -ad", but
> then a sequence of roll-ups that came on top of it).

I don't think having a roll-up strategy of "all-except-one" simplifies
things. Or, if it does, then I don't understand it. Isn't this the exact
same thing as a geometric repack which decides to keep only one pack?

ISTM that you would be susceptible to the same problems in this case,
too.


Thanks,
Taylor
Jeff King Jan. 29, 2021, 8:25 p.m. UTC | #4
On Fri, Jan 29, 2021 at 03:01:48PM -0500, Taylor Blau wrote:

> On Fri, Jan 29, 2021 at 02:19:25PM -0500, Jeff King wrote:
> > The overall goal here is being able to roll up loose objects and smaller
> > packs without having to pay the cost of a full reachability traversal
> > (which can take several minutes on large repositories). Another
> > very-different direction there is to just enumerate those objects
> > without respect to reachability, stick them in a pack, and then delete
> > the originals. That does imply something like "repack -k", though, and
> > interacts weirdly with letting unreachable objects age out via their
> > mtimes (we'd constantly suck them back into fresh packs).
> 
> As I mentioned in an earlier response to Junio, this was the original
> approach that I took when implementing this, but ultimately decided
> against it because it means that we'll never let unreachable objects age
> out (as you note).

Right. But that's no different than using "-k" most of the time, and
then occasionally doing a more careful repack with short expiration
times and a full reachability check. As you know, this is basically what
we do at GitHub.

So it may be reasonable to go that direction, which is really defining a
totally separate strategy from git-gc's "repack, and occasionally
objects age out". Especially if we find that the
assume-kept-packs-closed route is too risky (i.e., has too many cases
where it's possible to cause corruption if our assumptions isn't met).

I'm not convinced either way at this point, but just thinking out loud
on the options (and trying to give some context to the list).

> I wonder if we need our assumption that the union of kept packs is
> closed under reachability to be specified as an option. If the option is
> passed, then we stop the traversal as soon as we hit an object in the
> frozen packs. If not passed, then we do a full traversal but pass
> --honor-pack-keep to drop out objects in the frozen packs after the
> fact.
> 
> Thoughts?

I'm confused. I thought the whole idea was to pass it as an option (the
user telling Git "I know these packs are supposed to be closed; trust
me")?

> > I think it would want to be "the set of all .keep packs is closed". In a
> > "roll all into one" scenario like above, there is only one .keep pack.
> > But in a geometric progression, that single pack which constitutes your
> > base set could be multiple packs (the last whole "git repack -ad", but
> > then a sequence of roll-ups that came on top of it).
> 
> I don't think having a roll-up strategy of "all-except-one" simplifies
> things. Or, if it does, then I don't understand it. Isn't this the exact
> same thing as a geometric repack which decides to keep only one pack?
> 
> ISTM that you would be susceptible to the same problems in this case,
> too.

I wasn't trying to argue that all-except-one avoids any problems. I was
saying that the example I gave above was an all-into-one, but if you
want to extend the concept to multiple packs, it has to cover the whole
set. I.e., answering Junio's:

  > is it OK for objects in one kept pack to refer to another object in
  > the other kept pack?

with "yes".

-Peff
Junio C Hamano Jan. 29, 2021, 8:30 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

>> If there is an out-of-band guarantee that .kept packs won't refer to
>> outside world, then we can obtain identical results to what existing
>> --honor-pack-keep (which traverses everything and then filteres out
>> what is in .keep pack) does by just stopping traversal when we see
>> an object that is found in a .keep pack.  OK, I guess that it
>> answers the correctness question I asked about [02/10].
>> 
>> It still is curious how we can safely "assume", but presumably we
>> will see how in a patch that appears later in the series.
>
> I think this would generally happen if the .keep packs are generated
> using something like "git repack -a", which packs everything reachable
> together. So if you do:
>
>   git repack -ad
>   touch .git/objects/pack/pack-whatever.keep
>   ... some more packs come in, perhaps via pushes ...
>   # imagine repack knew how to pass this along...
>   git repack -a --assume-kept-packs-closed
>
> then you'd repack just the objects that aren't in the big pack.

Yeah.  As a tool to help the above workflow, where you are only
creating another .keep out of youngest objects (i.e. those that are
either loose or in non-kept packs), because by definition anything
in .keep cannot be pointing back at these younger objects, it does
make sense to take advantage of "the set of packs with .keep as a
whole is closed".

It may become tricky once we start talking about creating a new
.keep out of youngest objects PLUS a few young keep packs, though.

Starting from all on-disk .keep packs, you'd mark them as in-core
keep bit, then drop in-core keep bit from the few young keep packs
that you intend to coalesce with the youngest objects---that is how
I would imagine your repacking strategy would go.  The set of all
the on-disk .keep packs may give us "closed" guarantee, but if we 
exclude a few latest packs from that set, would the remainder still
give us the "closed" guarantee we can take advantage of, in order to
pack these youngest objects (including the ones in the kept packs
that we are coalescing)?
Taylor Blau Jan. 29, 2021, 10:10 p.m. UTC | #6
On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote:
> So it may be reasonable to go that direction, which is really defining a
> totally separate strategy from git-gc's "repack, and occasionally
> objects age out". Especially if we find that the
> assume-kept-packs-closed route is too risky (i.e., has too many cases
> where it's possible to cause corruption if our assumptions isn't met).

Yeah, this whole conversation has made me very nervous about using
reachability. Fundamentally, this isn't about reachability at all. The
operation is as simple as telling pack-objects a list of packs that you
do and don't want objects from, making a new pack out of that, and then
optionally dropping the packs that you rolled up.

So, I think that teaching pack-objects a way to understand a caller that
says "include objects from packs X, Y, and Z, but not if they appear in
packs A, B, or C, and also pull in any loose objects" is the best way
forward here.

Of course, you're going to be dragging along unreachable objects until
you decide to do a full repack, but I'm OK with that since we wouldn't
expect anybody to be solely relying on geometric repacks without
occasionally running 'git repack -ad'.

Junio: I don't think that you have picked this up yet, but please avoid
doing so for now, and I'll send a new series that goes in the direction
I outlined above.

Thanks,
Taylor
Junio C Hamano Jan. 29, 2021, 10:13 p.m. UTC | #7
Jeff King <peff@peff.net> writes:

>> I wonder if we need our assumption that the union of kept packs is
>> closed under reachability to be specified as an option. If the option is
>> passed, then we stop the traversal as soon as we hit an object in the
>> frozen packs. If not passed, then we do a full traversal but pass
>> --honor-pack-keep to drop out objects in the frozen packs after the
>> fact.
>> 
>> Thoughts?
>
> I'm confused. I thought the whole idea was to pass it as an option (the
> user telling Git "I know these packs are supposed to be closed; trust
> me")?

Yes, that is how I read these patches, and it sounds like an assumption
that we can make under many scenarios/repacking strategies.
Jeff King Jan. 29, 2021, 10:43 p.m. UTC | #8
On Fri, Jan 29, 2021 at 12:30:38PM -0800, Junio C Hamano wrote:

> > I think this would generally happen if the .keep packs are generated
> > using something like "git repack -a", which packs everything reachable
> > together. So if you do:
> >
> >   git repack -ad
> >   touch .git/objects/pack/pack-whatever.keep
> >   ... some more packs come in, perhaps via pushes ...
> >   # imagine repack knew how to pass this along...
> >   git repack -a --assume-kept-packs-closed
> >
> > then you'd repack just the objects that aren't in the big pack.
> 
> Yeah.  As a tool to help the above workflow, where you are only
> creating another .keep out of youngest objects (i.e. those that are
> either loose or in non-kept packs), because by definition anything
> in .keep cannot be pointing back at these younger objects, it does
> make sense to take advantage of "the set of packs with .keep as a
> whole is closed".
> 
> It may become tricky once we start talking about creating a new
> .keep out of youngest objects PLUS a few young keep packs, though.

Right. You'd have to make sure the younger packs were also created with
that reachability in mind. I.e., if you are in a situation where you've
got:

  - a big "old" pack
  - N new packs from pushes

then you can't assume anything about the reachability for those
individual N packs. It would be wrong to split any of them into the
"keep" side. They need to have all their objects traversed until we hit
something in the old pack.

But if you have a situation with:

  - a big "old" pack
  - M packs made from previous rollups on top of the old pack
  - N new packs from pushes

Then I think you can still take the old pack + the M packs as a
cohesive unit closed under reachability.

The tricky part is knowing which packs are which (size is a heuristic,
but it can be wrong; people may make a big push to a previously-small
repository).

> Starting from all on-disk .keep packs, you'd mark them as in-core
> keep bit, then drop in-core keep bit from the few young keep packs
> that you intend to coalesce with the youngest objects---that is how
> I would imagine your repacking strategy would go.  The set of all
> the on-disk .keep packs may give us "closed" guarantee, but if we 
> exclude a few latest packs from that set, would the remainder still
> give us the "closed" guarantee we can take advantage of, in order to
> pack these youngest objects (including the ones in the kept packs
> that we are coalescing)?

Yeah, I think we are going along the same lines. Except I think it is
dangerous to use on-disk ".keep" as your marker, because we will racily
see incoming push packs with a ".keep" (which receive-pack/index-pack
use as a lockfile until the refs are updated).

So repack has to "somehow" get the list of which is which.

None of which is disputing your "it may become tricky", of course. ;) It
is exactly this trickiness that I am worried about. And I am not being
coy with "somehow", as if we have some custom not-yet-shared layer on
top of repack that tracks this. We are still figuring out whether this
is a good direction in the first place. :)

One of the things that led us to this reachability traversal, away from
"just suck up all of the objects from these packs", is that this is how
--unpacked works. I had always assumed it was implemented as "if it's
loose, then put it in the pack". But it's not. It's attached to the
revision traversal. And it actually gets some cases wrong!

It will walk every commit, so you don't have to worry about a packed
commit referring to an unpacked one. But it doesn't look at the trees of
the packed commits (for the quite obvious reason that doing so is orders
of magnitude more expensive). That means that if there is a packed
commit that refers to an unpacked blob (which is not referenced by an
unpacked commit), then "rev-list --unpacked" will not report it (and
likewise "git repack -d" would not pack it).

It's easy to create such a situation manually, but I've included a
more plausible sequence involving "--amend" and push/fetch unpackLimit
at the end of this email.

At the core, --unpacked is assuming certain things about reachability of
loose/packed objects that aren't necessarily true. And this
--assume-kept-pack-closed stuff is basically doing the same thing for a
particular set of packs (albeit more so; I believe the patches here cut
off traversal of parent pointers, not just commit-to-tree pointers).

One of the reasons I think nobody noticed with --unpacked is that the
stakes are pretty low. If our assumption is wrong, the worst case is
that a loose object remains unpacked in "repack -d". But we'd never
delete it based on that information (instead, git prune would do its own
traversal to find the correct reachability). And it would eventually get
picked up by "git repack -ad".

But for repacking, the general strategy is to put things you want to
keep into the new pack, and then delete the old ones (not marked as
keep, of course). So if our assumption is ever wrong, it means we'd
potentially drop packs that have reachable objects not found elsewhere,
and we'd end up corrupting the repository.

So I think the paths forward are either:

  - come up with an air-tight system of making sure that we know packs
    we claim are closed under reachability really are (perhaps some
    marker that says "I was generated by repack -a")

  - have a "roll-up" mode that does not care about reachability at all,
    and just takes any objects from a particular set of packs (plus
    probably loose objects)

I'm still thinking aloud here, and not really sure which is a better
path. I do feel like the failure modes for the second one are less
risky.

Anyway, here's the --unpacked example, if you're curious. It's based on
fetching, but you could invert it to do pushes (in which case it is
repacking in "parent" that gets the wrong result).

-- >8 --
# two repos, one a clone of the other
git init parent
git -C parent commit --allow-empty -m base
git clone parent child

# now there's a small fetch, which will get
# exploded into loose objects.
(
	cd parent
	echo small >small
	git add small
	git commit -m small
)
git -C child fetch

# We can verify that "rev-list --unpacked" reports these
# objects.
git -C child rev-list --objects --unpacked origin

# and now a bigger one that will remain a pack (we'll
# tweak unpackLimit instead of making a really big commit,
# but the concept is the same)
#
# There are two key things here:
#
#   - the "small" commit is no longer reachable, but the big one
#     still contains the "small" blob object. Using --amend is
#     a plausible mechanism for this happening.
#
#   - we are using bitmaps, which give us an exact answer for the set of
#     objects to send. Otherwise, pack-objects on the server actually
#     fails to notice the other side has told us it has the small blob, and
#     sends another copy of it.
(
	cd parent
	git repack -adb
	echo big >big
	git add big
	git commit --amend -m big
)
git -C child -c fetch.unpackLimit=1 fetch

# So now in the child we have a packed object
# whose ancestor is an unpacked one. rev-list
# now won't report the "small" blob (ac790413).
git.compile -C child rev-list --objects --unpacked origin

# Even though we can see that it's present only as an
# unpacked object.
show_objects() {
	for i in child/.git/objects/pack/*.idx; do
		git show-index <$i
	done | cut -d' ' -f2 | sed 's/^/packed: /'
	find child/.git/objects/??/* |
		perl -F/ -alne 'print " loose: $F[-2]$F[-1]"'
}

# If we were to do an incremental repack now, it wouldn't be packed.
# (Note we have to kill off the reflog, which still references the
# rewound commit).
rm -rf child/.git/logs
git -C child repack -d
show_objects
Taylor Blau Jan. 29, 2021, 10:53 p.m. UTC | #9
On Fri, Jan 29, 2021 at 05:43:32PM -0500, Jeff King wrote:
> So I think the paths forward are either:
>
>   - come up with an air-tight system of making sure that we know packs
>     we claim are closed under reachability really are (perhaps some
>     marker that says "I was generated by repack -a")
>
>   - have a "roll-up" mode that does not care about reachability at all,
>     and just takes any objects from a particular set of packs (plus
>     probably loose objects)
>
> I'm still thinking aloud here, and not really sure which is a better
> path. I do feel like the failure modes for the second one are less
> risky.

The more I think about it, the more I feel that the second option is the
right approach. It seems like if you were naïvely implementing this from
scratch, that you'd pick the second one (i.e., have pack-objects
understand a new input mode, and then make a pack based on that).

I am leery that we'd be able to get the first option "right" without
attaching some sort of marker to each pack, especially given how
difficult I think that this is to reason about precisely. I suppose you
could have a .closed file corresponding to each pack, or alternatively a
$objdir/pack/pack-geometry file which specifies the same thing, but both
of these feel overly restrictive.

Besides having to special case the loose objects, is there any downside
to doing the simpler thing here?

> Anyway, here's the --unpacked example, if you're curious. It's based on
> fetching, but you could invert it to do pushes (in which case it is
> repacking in "parent" that gets the wrong result).

Fascinating indeed :-).

Thanks,
Taylor
Jeff King Jan. 29, 2021, 10:57 p.m. UTC | #10
On Fri, Jan 29, 2021 at 05:10:20PM -0500, Taylor Blau wrote:

> On Fri, Jan 29, 2021 at 03:25:37PM -0500, Jeff King wrote:
> > So it may be reasonable to go that direction, which is really defining a
> > totally separate strategy from git-gc's "repack, and occasionally
> > objects age out". Especially if we find that the
> > assume-kept-packs-closed route is too risky (i.e., has too many cases
> > where it's possible to cause corruption if our assumptions isn't met).
> 
> Yeah, this whole conversation has made me very nervous about using
> reachability. Fundamentally, this isn't about reachability at all. The
> operation is as simple as telling pack-objects a list of packs that you
> do and don't want objects from, making a new pack out of that, and then
> optionally dropping the packs that you rolled up.
> 
> So, I think that teaching pack-objects a way to understand a caller that
> says "include objects from packs X, Y, and Z, but not if they appear in
> packs A, B, or C, and also pull in any loose objects" is the best way
> forward here.
> 
> Of course, you're going to be dragging along unreachable objects until
> you decide to do a full repack, but I'm OK with that since we wouldn't
> expect anybody to be solely relying on geometric repacks without
> occasionally running 'git repack -ad'.

While writing my other response, I had some thoughts that this "dragging
along" might not be so bad.

Just to lay out the problem as I see it, if you do:

  - frequently roll up all small packs and loose objects into a new
    pack, without regard to reachability

  - occasionally run "git repack -ad" to do a real traversal

then the problem is that unreachable objects never age out:

  - a loose unreachable object starts with a recent-ish mtime

  - the frequent roll-up rolls it into a pack, freshening its mtime

  - the full "repack -ad" doesn't delete it, because its pack mtime is
    too recent. It explodes it loose again.

  - repeat forever

We know that "repack -d" is not 100% accurate because of similar "closed
under reachability" assumptions (see my other email). But it's OK,
because the worst case is an object that doesn't quite get packed yet,
not that it gets deleted.

So you could do something like:

  - roll up loose objects into a pack with "repack -d"; mostly accurate,
    but doesn't suck up unreachable objects

  - roll up small packs into a bigger pack without regard for
    reachability. This includes the pack created in the first step, but
    we know everything in it is actually reachable.

  - eventually run "repack -ad" to do a real traversal

That would extend the lifetime of unreachable objects which were found
in a pack (they get dragged forward during the rollups). But they'd
eventually get exploded loose during a "repack -ad", and then _not_
sucked back into a roll-up pack. And then eventually "repack -ad"
removes them.

The downsides are:

  - doing a separate "repack -d" plus a roll-up repack is wasted work.
    But I think they could be combined into a single step (at the cost
    of some extra complexity in the implementation).

  - using "--unpacked" still means traversing every commit. That's much
    faster than traversing the whole object graph, but still scales with
    the size of the repo, not the size of the new objects. That might be
    acceptable, though.

I do think the original problem goes away entirely if we can keep better
track of the mtimes. I.e., if we had packs marked with ".cruft" instead
of exploding loose, then the logic is:

  - roll up all loose objects and any objects in a pack that isn't
    marked as cruft (or keep); never delete a cruft pack at this stage

  - occasionally "repack -ad"; this does delete old cruft packs (because
    we'd have rescued any reachable objects they might have contained)

I'm not sure I want to block this topic on having cruft packs, though.
Of course there are tons of _other_ reasons to want them (like not
causing operational headaches when a repo's disk and inode usage grows
by 10x due to exploding loose objects). So maybe it's not a bad idea to
work on them together. I dunno.

-Peff
Jeff King Jan. 29, 2021, 11 p.m. UTC | #11
On Fri, Jan 29, 2021 at 05:53:58PM -0500, Taylor Blau wrote:

> > I'm still thinking aloud here, and not really sure which is a better
> > path. I do feel like the failure modes for the second one are less
> > risky.
> 
> The more I think about it, the more I feel that the second option is the
> right approach. It seems like if you were naïvely implementing this from
> scratch, that you'd pick the second one (i.e., have pack-objects
> understand a new input mode, and then make a pack based on that).
> 
> I am leery that we'd be able to get the first option "right" without
> attaching some sort of marker to each pack, especially given how
> difficult I think that this is to reason about precisely. I suppose you
> could have a .closed file corresponding to each pack, or alternatively a
> $objdir/pack/pack-geometry file which specifies the same thing, but both
> of these feel overly restrictive.

Yeah, I think my gut feeling matches yours.

> Besides having to special case the loose objects, is there any downside
> to doing the simpler thing here?

The other downside I can think of is that you can't just run "git repack
--geometric" every time, and eventually get a good result (or one that
asymptotically approaches good ;) ). I.e., you now have two types of
repacks: quick and dirty rollups, and "real" ones that do reachability.
So you need some heuristics about how often you do one versus the other.

I'm definitely OK with that outcome. And I think we could even bake
those heuristics into a script or mode of repack (e.g., maybe "gc
--auto" would trigger a bigger repack every N times or something). But
that's what I came up with by brainstorming. :)

-Peff
Junio C Hamano Jan. 29, 2021, 11:03 p.m. UTC | #12
Taylor Blau <me@ttaylorr.com> writes:

> So, I think that teaching pack-objects a way to understand a caller that
> says "include objects from packs X, Y, and Z, but not if they appear in
> packs A, B, or C, and also pull in any loose objects" is the best way
> forward here.

Are our goals still include that the resulting packfile has good
delta compression and object locality?  Reachability traversal
discovers which commit comes close to which other commits to help
pack-objects to arrange the resulting pack so that objects that
appear close together in history appears close together.  It also
gives each object a pathname hint to help group objects of the same
type (either blobs or trees) with like-paths together for better
deltification.

Without reachability traversal, I would imagine that it would become
quite important to keep the order in which objects appear in the
original pack, and existing delta chain, as much as possible, or
we'd be seeing a horribly inefficient pack like fast-import would
produce.

Thanks.
Junio C Hamano Jan. 29, 2021, 11:10 p.m. UTC | #13
Taylor Blau <me@ttaylorr.com> writes:

> On Fri, Jan 29, 2021 at 05:43:32PM -0500, Jeff King wrote:
>> So I think the paths forward are either:
>>
>>   - come up with an air-tight system of making sure that we know packs
>>     we claim are closed under reachability really are (perhaps some
>>     marker that says "I was generated by repack -a")
>>
>>   - have a "roll-up" mode that does not care about reachability at all,
>>     and just takes any objects from a particular set of packs (plus
>>     probably loose objects)
>>
>> I'm still thinking aloud here, and not really sure which is a better
>> path. I do feel like the failure modes for the second one are less
>> risky.
>
> The more I think about it, the more I feel that the second option is the
> right approach. It seems like if you were naïvely implementing this from
> scratch, that you'd pick the second one (i.e., have pack-objects
> understand a new input mode, and then make a pack based on that).

Yes, "roll-up" mode would be a sensible thing to have, as long as we
can keep pruning out of the picture for now.  But in the end, I do
think "stop at any object in this frozen pack---these objects go all
the way down to root and we know they are reachable" optimization
that would give 'prune' a performance boost with small margin of
false positive about reachability (i.e. we may never be able to
prune away an object in such a pack, even when it becomes
unreachable) would be a valuable thing to have in a practical
system, so from that point of view, the work done in these patches
are not lost ;-)

The efficiency issue of the resulting pack I mentioned earlier in a
separate message is there in the "roll-up" mode, though.
Taylor Blau Jan. 29, 2021, 11:28 p.m. UTC | #14
On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > So, I think that teaching pack-objects a way to understand a caller that
> > says "include objects from packs X, Y, and Z, but not if they appear in
> > packs A, B, or C, and also pull in any loose objects" is the best way
> > forward here.
>
> Are our goals still include that the resulting packfile has good
> delta compression and object locality?  Reachability traversal
> discovers which commit comes close to which other commits to help
> pack-objects to arrange the resulting pack so that objects that
> appear close together in history appears close together.  It also
> gives each object a pathname hint to help group objects of the same
> type (either blobs or trees) with like-paths together for better
> deltification.

I think our goals here are somewhere between having fewer packfiles
while also ensuring that the packfiles we had to create don't have
horrible delta compression and locality.

But now that you do mention it, I remember the reachability traversal's
bringing in object names was a reason that we decided to implement this
series using a reachability traversal in the first place.

> Without reachability traversal, I would imagine that it would become
> quite important to keep the order in which objects appear in the
> original pack, and existing delta chain, as much as possible, or
> we'd be seeing a horribly inefficient pack like fast-import would
> produce.

Yeah; we'd definitely want to feed the objects to pack-objects in the
order that they appear in the original pack. Maybe that's not that bad a
tradeoff to make, though...

> Thanks.

Thanks,
Taylor
Jeff King Jan. 29, 2021, 11:31 p.m. UTC | #15
On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > So, I think that teaching pack-objects a way to understand a caller that
> > says "include objects from packs X, Y, and Z, but not if they appear in
> > packs A, B, or C, and also pull in any loose objects" is the best way
> > forward here.
> 
> Are our goals still include that the resulting packfile has good
> delta compression and object locality?  Reachability traversal
> discovers which commit comes close to which other commits to help
> pack-objects to arrange the resulting pack so that objects that
> appear close together in history appears close together.  It also
> gives each object a pathname hint to help group objects of the same
> type (either blobs or trees) with like-paths together for better
> deltification.
> 
> Without reachability traversal, I would imagine that it would become
> quite important to keep the order in which objects appear in the
> original pack, and existing delta chain, as much as possible, or
> we'd be seeing a horribly inefficient pack like fast-import would
> produce.

Thanks, that's another good point we discussed a while ago (off-list),
but hasn't come up in this discussion yet.

Another option here is not to roll up packs at all, but instead to use a
midx to cover them all[1]. That solves the issue where object lookup is
O(nr_packs), and you retain the same locality and delta characteristics.

But I think part of the goal is to actually improve the deltas, in two
ways:

  - we'd hopefully find new delta opportunities between objects in the
    various packs

  - we'll drop some objects that are duplicated in other packs.
    Definitely we have to to avoid duplicates in the roll-up pack, but I
    think we'd want to even for objects that are in the "big" kept pack.
    These are likely bases of deltas in our roll-up pack, since the
    common cause there is --fix-thin adding them to complete the pack.
    But we really prefer to serve fetches using the ones out of the main
    pack, since they may already themselves be deltas (which makes them
    way cheaper; we can send the delta straight off the disk, rather
    than looking for a new possible base).

So I would anticipate the delta-compression phase actually trying to do
some new work. I do worry that the lack of pathname hints may make the
deltas we find much more worse (or cause us to spend excessive CPU
searching for them). It's possible we could do a "best effort" traversal
where we walk new commits to find newly added pathnames, but don't
bother crossing into trees/commits that aren't in the set of objects to
be packed. It's OK to optimize for speed there, because it's just
feeding the delta heuristic, not the set of objects we'd plan to pack.

-Peff

[1] Our end-game plan is actually to _also_ use a midx to cover the
    roll-ups and the "big" pack, since we'd want to generate bitmaps for
    the new objects, too.'
Taylor Blau Feb. 2, 2021, 3:04 a.m. UTC | #16
On Fri, Jan 29, 2021 at 06:28:28PM -0500, Taylor Blau wrote:
> On Fri, Jan 29, 2021 at 03:03:08PM -0800, Junio C Hamano wrote:
> > Are our goals still include that the resulting packfile has good
> > delta compression and object locality?  Reachability traversal
> > discovers which commit comes close to which other commits to help
> > pack-objects to arrange the resulting pack so that objects that
> > appear close together in history appears close together.  It also
> > gives each object a pathname hint to help group objects of the same
> > type (either blobs or trees) with like-paths together for better
> > deltification.
>
> I think our goals here are somewhere between having fewer packfiles
> while also ensuring that the packfiles we had to create don't have
> horrible delta compression and locality.
>
> But now that you do mention it, I remember the reachability traversal's
> bringing in object names was a reason that we decided to implement this
> series using a reachability traversal in the first place.

Peff shared a very clever idea with me today. Like in the naive
approach, we fill the list of "objects to pack" with everything in the
packs that are about to get rolled up, excluding anything that appears
in the large packs.

But we do a reachability traversal whose starting points are all of the
commits in the packs that are about to be rolled up, filling in the
namehash of the objects we encounter along the way.

Like in the original version of this series, we'll stop early once we
encounter an object in any of the frozen packs (which are marked as kept
in core), and so we might not traverse through everything. But that's
completely OK, since we know we have the right list of objects to pack
(at worst, we would having some zero'd namehashes and come up with
slightly worse deltas).

But, I think that this is a nice middle-ground (and it allows us to
reuse lots of work from the original version), so I'm quite happy.

It's in my fork [1] in the tb/geometric-repack.wip branch, but I'll try
and clean those patches up tomorrow and send a v2 to the list.

Thanks,
Taylor

[1]: https://github.com/ttaylorr/git
diff mbox series

Patch

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 54d715ead1..cbe08e7415 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -135,6 +135,17 @@  depth is 4095.
 	leading directory (e.g. `pack-123.pack`). The option could be
 	specified multiple times to keep multiple packs.
 
+--assume-kept-packs-closed::
+	This flag causes `git rev-list` to halt the object traversal
+	when it encounters an object found in a kept pack. This is
+	dissimilar to `--honor-pack-keep`, which only prunes unwanted
+	results after the full traversal is completed.
++
+Without any `--keep-pack=<pack-name>` arguments, only packs with an
+on-disk `*.keep` files are used when considering when to halt the
+traversal. If other packs are artificially marked as "kept" with
+`--keep-pack`, then those are considered as well.
+
 --incremental::
 	This flag causes an object already in a pack to be ignored
 	even if it would have otherwise been packed.
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2a00358f34..a5dcd66f52 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -78,6 +78,7 @@  static int have_non_local_packs;
 static int incremental;
 static int ignore_packed_keep_on_disk;
 static int ignore_packed_keep_in_core;
+static int assume_kept_packs_closed;
 static int allow_ofs_delta;
 static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
@@ -3542,6 +3543,8 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("create packs suitable for shallow fetches")),
 		OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
 			 N_("ignore packs that have companion .keep file")),
+		OPT_BOOL(0, "assume-kept-packs-closed", &assume_kept_packs_closed,
+			 N_("assume the union of kept packs is closed under reachability")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
 				N_("ignore this pack")),
 		OPT_INTEGER(0, "compression", &pack_compression_level,
@@ -3631,6 +3634,8 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		strvec_push(&rp, "--unpacked");
 	}
+	if (assume_kept_packs_closed)
+		use_internal_rev_list = 1;
 
 	if (exclude_promisor_objects) {
 		use_internal_rev_list = 1;
@@ -3711,6 +3716,14 @@  int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		if (!p) /* no keep-able packs found */
 			ignore_packed_keep_on_disk = 0;
 	}
+	if (assume_kept_packs_closed) {
+		if (ignore_packed_keep_on_disk && ignore_packed_keep_in_core)
+			strvec_push(&rp, "--no-kept-objects");
+		else if (ignore_packed_keep_on_disk)
+			strvec_push(&rp, "--no-kept-objects=on-disk");
+		else if (ignore_packed_keep_in_core)
+			strvec_push(&rp, "--no-kept-objects=in-core");
+	}
 	if (local) {
 		/*
 		 * unlike ignore_packed_keep_on_disk above, we do not
diff --git a/t/t6114-keep-packs.sh b/t/t6114-keep-packs.sh
index 9239d8aa46..0861305a04 100755
--- a/t/t6114-keep-packs.sh
+++ b/t/t6114-keep-packs.sh
@@ -66,4 +66,63 @@  test_expect_success '--no-kept-objects excludes kept non-MIDX object' '
 	test_cmp expect actual
 '
 
+test_expect_success '--no-kept-objects can respect only in-core keep packs' '
+	test_when_finished "rm -fr actual-*.idx actual-*.pack" &&
+	(
+		git rev-list --objects --no-object-names packed..kept &&
+		git rev-list --objects --no-object-names loose
+	) | sort >expect &&
+
+	git pack-objects \
+	  --assume-kept-packs-closed \
+	  --keep-pack=pack-$MISC_PACK.pack \
+	  --all actual </dev/null &&
+	idx_objects actual-*.idx >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success 'setup additional --no-kept-objects tests' '
+	test_commit additional &&
+
+	ADDITIONAL_PACK=$(git pack-objects --revs .git/objects/pack/pack <<-EOF
+	refs/tags/additional
+	^refs/tags/kept
+	EOF
+	)
+'
+
+test_expect_success '--no-kept-objects can respect only on-disk keep packs' '
+	test_when_finished "rm -fr actual-*.idx actual-*.pack" &&
+	(
+		git rev-list --objects --no-object-names kept..additional &&
+		git rev-list --objects --no-object-names packed
+	) | sort >expect &&
+
+	git pack-objects \
+	  --assume-kept-packs-closed \
+	  --honor-pack-keep \
+	  --all actual </dev/null &&
+	idx_objects actual-*.idx >actual &&
+
+	test_cmp expect actual
+'
+
+test_expect_success '--no-kept-objects can respect mixed kept packs' '
+	test_when_finished "rm -fr actual-*.idx actual-*.pack" &&
+	(
+		git rev-list --objects --no-object-names kept..additional &&
+		git rev-list --objects --no-object-names loose
+	) | sort >expect &&
+
+	git pack-objects \
+	  --assume-kept-packs-closed \
+	  --honor-pack-keep \
+	  --keep-pack=pack-$MISC_PACK.pack \
+	  --all actual </dev/null &&
+	idx_objects actual-*.idx >actual &&
+
+	test_cmp expect actual
+'
+
 test_done