diff mbox series

[01/10] packfile: introduce 'find_kept_pack_entry()'

Message ID dc7fa4c7a61f657e779e10385d3e8076d6dac36c.1611098616.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: support repacking into a geometric sequence | expand

Commit Message

Taylor Blau Jan. 19, 2021, 11:24 p.m. UTC
Future callers will want a function to fill a 'struct pack_entry' for a
given object id but _only_ from its position in any kept pack(s). They
could accomplish this by calling 'find_pack_entry()' and checking
whether the found pack is kept or not, but this is insufficient, since
there may be duplicate objects (and the mru cache makes it unpredictable
which variant we'll get).

Teach this new function to treat the two different kinds of kept packs
(on disk ones with .keep files, as well as in-core ones which are set by
manually poking the 'pack_keep_in_core' bit) separately. This will
become important for callers that only want to respect a certain kind of
kept pack.

Introduce 'find_kept_pack_entry()' which behaves like
'find_pack_entry()', except that it skips over packs which are not
marked kept. Callers will be added in subsequent patches.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 packfile.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 packfile.h |  6 +++++
 2 files changed, 65 insertions(+), 5 deletions(-)

Comments

Derrick Stolee Jan. 20, 2021, 1:40 p.m. UTC | #1
On 1/19/2021 6:24 PM, Taylor Blau wrote:
>  	for (m = r->objects->multi_pack_index; m; m = m->next) {
> -		if (fill_midx_entry(r, oid, e, m))
> +		if (!(fill_midx_entry(r, oid, e, m)))

nit: we don't need extra parens around fill_midx_entry().

> -		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
> -			list_move(&p->mru, &r->objects->packed_git_mru);
> -			return 1;
> +		if (p->multi_pack_index && !kept_only) {
> +			/*
> +			 * If this pack is covered by the MIDX, we'd have found
> +			 * the object already in the loop above if it was here,
> +			 * so don't bother looking.
> +			 *
> +			 * The exception is if we are looking only at kept
> +			 * packs. An object can be present in two packs covered
> +			 * by the MIDX, one kept and one not-kept. And as the
> +			 * MIDX points to only one copy of each object, it might
> +			 * have returned only the non-kept version above. We
> +			 * have to check again to be thorough.
> +			 */
> +			continue;
> +		}
> +		if (!kept_only ||
> +		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
> +		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
> +			if (fill_pack_entry(oid, e, p)) {
> +				list_move(&p->mru, &r->objects->packed_git_mru);
> +				return 1;
> +			}

Here is the meat of your patch. The comment helps a lot.

This might have been easier if the MIDX had preferred kept packs
over non-kept packs (before sorting by modified time). Perhaps
the MIDX could get an extra field to say "I preferred kept packs"
which would let us trust the MIDX return here without the pack
loop.

(Note: we can't just change the MIDX selection and then start
trusting all MIDXs to have the right tie-breakers because of
existing files in the wild.)

Thanks,
-Stolee
Taylor Blau Jan. 20, 2021, 2:38 p.m. UTC | #2
On Wed, Jan 20, 2021 at 08:40:22AM -0500, Derrick Stolee wrote:
> On 1/19/2021 6:24 PM, Taylor Blau wrote:
> >  	for (m = r->objects->multi_pack_index; m; m = m->next) {
> > -		if (fill_midx_entry(r, oid, e, m))
> > +		if (!(fill_midx_entry(r, oid, e, m)))
>
> nit: we don't need extra parens around fill_midx_entry().

Yep. I checked whether we should have written this as "if
(fill_midx_entry(...) < 0)", but fill_midx_entry returns a positive
number on error, so checking "!fill_midx_entry" is certainly what we
should be doing.

> > -		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
> > -			list_move(&p->mru, &r->objects->packed_git_mru);
> > -			return 1;
> > +		if (p->multi_pack_index && !kept_only) {
> > +			/*
> > +			 * If this pack is covered by the MIDX, we'd have found
> > +			 * the object already in the loop above if it was here,
> > +			 * so don't bother looking.
> > +			 *
> > +			 * The exception is if we are looking only at kept
> > +			 * packs. An object can be present in two packs covered
> > +			 * by the MIDX, one kept and one not-kept. And as the
> > +			 * MIDX points to only one copy of each object, it might
> > +			 * have returned only the non-kept version above. We
> > +			 * have to check again to be thorough.
> > +			 */
> > +			continue;
> > +		}
> > +		if (!kept_only ||
> > +		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
> > +		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
> > +			if (fill_pack_entry(oid, e, p)) {
> > +				list_move(&p->mru, &r->objects->packed_git_mru);
> > +				return 1;
> > +			}
>
> Here is the meat of your patch. The comment helps a lot.
>
> This might have been easier if the MIDX had preferred kept packs
> over non-kept packs (before sorting by modified time). Perhaps
> the MIDX could get an extra field to say "I preferred kept packs"
> which would let us trust the MIDX return here without the pack
> loop.
>
> (Note: we can't just change the MIDX selection and then start
> trusting all MIDXs to have the right tie-breakers because of
> existing files in the wild.)

Yeah, that is what makes it tricky. Changing the code isn't so hard: a
new field that we check and do one of two things when we're breaking
ties.

But I think the cognitive load is high, and I'm not sure that the
benefit (skipping another linear pass through non-MIDX'd packs when
looking up an object in kept packs only _and_ that object is duplicated)
is worth the extra hassle with the MIDX code.

All of that said, I do think that it's worth revisiting this and giving
it some more thought after multi-pack bitmaps to see whether we feel the
same or not.

Thanks,
Taylor
Junio C Hamano Jan. 29, 2021, 2:33 a.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:

> Future callers will want a function to fill a 'struct pack_entry' for a
> given object id but _only_ from its position in any kept pack(s). They
> could accomplish this by calling 'find_pack_entry()' and checking
> whether the found pack is kept or not, but this is insufficient, since
> there may be duplicate objects (and the mru cache makes it unpredictable
> which variant we'll get).

I wonder if we eventually need a callback interface to walk _all_
pack entries for a given object, so that "I am only interested in
instances in kept packs" will be under total control of the callers.
As it stands, it is "just grab any one that is in a kept pack, any
one of them is fine", which is almost just of as narrow utility as
the original's "just grab the first one---any one of them is fine",
the latter of which is "insufficient" as the log message says.

But this (in the context of the remainder of the series) might be
sufficient, at least for now.

> Teach this new function to treat the two different kinds of kept packs
> (on disk ones with .keep files, as well as in-core ones which are set by
> manually poking the 'pack_keep_in_core' bit) separately. This will
> become important for callers that only want to respect a certain kind of
> kept pack.

Or maybe not ;-)

If there are notable relationship between on-disk and in-core kept
packs (e.g. "the set of on-disk kept packs is a subset of in-core
kept packs", "usually on-disk kept packs get in-core kept bit upon
their packed_git instances are populated, but we can drop the bit at
runtime, so on-disk and in-core are pretty much independent and
there is no notable relationship"), it must be explained upfront to
help the reader form a sensible world view.

> Introduce 'find_kept_pack_entry()' which behaves like
> 'find_pack_entry()', except that it skips over packs which are not
> marked kept. Callers will be added in subsequent patches.
>
> Co-authored-by: Jeff King <peff@peff.net>
> Signed-off-by: Jeff King <peff@peff.net>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>

Fun.

Thanks.
Taylor Blau Jan. 29, 2021, 6:38 p.m. UTC | #4
On Thu, Jan 28, 2021 at 06:33:10PM -0800, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > Future callers will want a function to fill a 'struct pack_entry' for a
> > given object id but _only_ from its position in any kept pack(s). They
> > could accomplish this by calling 'find_pack_entry()' and checking
> > whether the found pack is kept or not, but this is insufficient, since
> > there may be duplicate objects (and the mru cache makes it unpredictable
> > which variant we'll get).
>
> I wonder if we eventually need a callback interface to walk _all_
> pack entries for a given object, so that "I am only interested in
> instances in kept packs" will be under total control of the callers.
> As it stands, it is "just grab any one that is in a kept pack, any
> one of them is fine", which is almost just of as narrow utility as
> the original's "just grab the first one---any one of them is fine",
> the latter of which is "insufficient" as the log message says.
>
> But this (in the context of the remainder of the series) might be
> sufficient, at least for now.

As you note, it's more about "can I find this object in any kept pack
(of a certain kind)" versus, "show me this object in a pack" (and hope
that if it appears in a kept pack, that that's the copy that is picked).

> > Teach this new function to treat the two different kinds of kept packs
> > (on disk ones with .keep files, as well as in-core ones which are set by
> > manually poking the 'pack_keep_in_core' bit) separately. This will
> > become important for callers that only want to respect a certain kind of
> > kept pack.
>
> Or maybe not ;-)

:-). The difference here is that we will only want to stop the traversal
at packs which are considered to be stable from the perspective of a
geometric repack.

We mark those packs as "stable" by setting their in-core kept bit, but
we don't write ".keep" files (which would make them on-disk kept). The
latter is up to the user, not us.

> If there are notable relationship between on-disk and in-core kept
> packs (e.g. "the set of on-disk kept packs is a subset of in-core
> kept packs", "usually on-disk kept packs get in-core kept bit upon
> their packed_git instances are populated, but we can drop the bit at
> runtime, so on-disk and in-core are pretty much independent and
> there is no notable relationship"), it must be explained upfront to
> help the reader form a sensible world view.

Unfortunately, I don't think that there is a sensible world-view here
to be formed. Honestly, the distinction between .keep packs and in-core
kept packs is incredibly narrow, and I find our separate handling of
them awkward and error-prone.

But, it is sort of what you'd want here (i.e., a way to mark all objects
in a pack as ignored without actually writing the physical file that
says "ignore all objects in this pack").

Thanks,
Taylor
Jeff King Jan. 29, 2021, 7:31 p.m. UTC | #5
On Thu, Jan 28, 2021 at 06:33:10PM -0800, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > Future callers will want a function to fill a 'struct pack_entry' for a
> > given object id but _only_ from its position in any kept pack(s). They
> > could accomplish this by calling 'find_pack_entry()' and checking
> > whether the found pack is kept or not, but this is insufficient, since
> > there may be duplicate objects (and the mru cache makes it unpredictable
> > which variant we'll get).
> 
> I wonder if we eventually need a callback interface to walk _all_
> pack entries for a given object, so that "I am only interested in
> instances in kept packs" will be under total control of the callers.
> As it stands, it is "just grab any one that is in a kept pack, any
> one of them is fine", which is almost just of as narrow utility as
> the original's "just grab the first one---any one of them is fine",
> the latter of which is "insufficient" as the log message says.

We do that already in pack-objects, and that's the problem: it's really
slow. So if you have few kept packs, but a lot of other ones, you'd like
to pre-split the packs into two lists, and not bother walking the one
you know won't turn up interesting results.

I think the commit message here doesn't emphasize that reasoning enough.
It talks about using "find_pack_entry()", and that is definitely not
sufficient for our purposes. But the interesting part is replacing the
existing "walk all packs and see if any were kept" logic, which happens
in patch 6.

So the more compelling argument, I think, is something like:

  - you sometimes want to know if object X is any kept packs

  - you can't use find_pack_entry(), because it only gives you the first
    pack it finds

  - you can walk over all packs and look for the object in each.
    pack-objects does this. But it's slow, because you are looking in
    packs you don't care about.

  - so it's helpful for the lookup to know up front which packs are
    interesting to find objects in and which are not, to avoid looking
    in the uninteresting ones

-Peff
Junio C Hamano Jan. 29, 2021, 8:20 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> So the more compelling argument, I think, is something like:
>
>   - you sometimes want to know if object X is any kept packs
>
>   - you can't use find_pack_entry(), because it only gives you the first
>     pack it finds
>
>   - you can walk over all packs and look for the object in each.
>     pack-objects does this. But it's slow, because you are looking in
>     packs you don't care about.
>
>   - so it's helpful for the lookup to know up front which packs are
>     interesting to find objects in and which are not, to avoid looking
>     in the uninteresting ones

That does make sense.
diff mbox series

Patch

diff --git a/packfile.c b/packfile.c
index 62d92e0c7c..30f43a1a35 100644
--- a/packfile.c
+++ b/packfile.c
@@ -2015,7 +2015,10 @@  static int fill_pack_entry(const struct object_id *oid,
 	return 1;
 }
 
-int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+static int find_one_pack_entry(struct repository *r,
+			       const struct object_id *oid,
+			       struct pack_entry *e,
+			       int kept_only)
 {
 	struct list_head *pos;
 	struct multi_pack_index *m;
@@ -2025,26 +2028,77 @@  int find_pack_entry(struct repository *r, const struct object_id *oid, struct pa
 		return 0;
 
 	for (m = r->objects->multi_pack_index; m; m = m->next) {
-		if (fill_midx_entry(r, oid, e, m))
+		if (!(fill_midx_entry(r, oid, e, m)))
+			continue;
+
+		if (!kept_only)
+			return 1;
+
+		if (((kept_only & ON_DISK_KEEP_PACKS) && e->p->pack_keep) ||
+		    ((kept_only & IN_CORE_KEEP_PACKS) && e->p->pack_keep_in_core))
 			return 1;
 	}
 
 	list_for_each(pos, &r->objects->packed_git_mru) {
 		struct packed_git *p = list_entry(pos, struct packed_git, mru);
-		if (!p->multi_pack_index && fill_pack_entry(oid, e, p)) {
-			list_move(&p->mru, &r->objects->packed_git_mru);
-			return 1;
+		if (p->multi_pack_index && !kept_only) {
+			/*
+			 * If this pack is covered by the MIDX, we'd have found
+			 * the object already in the loop above if it was here,
+			 * so don't bother looking.
+			 *
+			 * The exception is if we are looking only at kept
+			 * packs. An object can be present in two packs covered
+			 * by the MIDX, one kept and one not-kept. And as the
+			 * MIDX points to only one copy of each object, it might
+			 * have returned only the non-kept version above. We
+			 * have to check again to be thorough.
+			 */
+			continue;
+		}
+		if (!kept_only ||
+		    (((kept_only & ON_DISK_KEEP_PACKS) && p->pack_keep) ||
+		     ((kept_only & IN_CORE_KEEP_PACKS) && p->pack_keep_in_core))) {
+			if (fill_pack_entry(oid, e, p)) {
+				list_move(&p->mru, &r->objects->packed_git_mru);
+				return 1;
+			}
 		}
 	}
 	return 0;
 }
 
+int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e)
+{
+	return find_one_pack_entry(r, oid, e, 0);
+}
+
+int find_kept_pack_entry(struct repository *r,
+			 const struct object_id *oid,
+			 unsigned flags,
+			 struct pack_entry *e)
+{
+	/*
+	 * Load all packs, including midx packs, since our "kept" strategy
+	 * relies on that. We're relying on the side effect of it setting up
+	 * r->objects->packed_git, which is a little ugly.
+	 */
+	get_all_packs(r);
+	return find_one_pack_entry(r, oid, e, flags);
+}
+
 int has_object_pack(const struct object_id *oid)
 {
 	struct pack_entry e;
 	return find_pack_entry(the_repository, oid, &e);
 }
 
+int has_object_kept_pack(const struct object_id *oid, unsigned flags)
+{
+	struct pack_entry e;
+	return find_kept_pack_entry(the_repository, oid, flags, &e);
+}
+
 int has_pack_index(const unsigned char *sha1)
 {
 	struct stat st;
diff --git a/packfile.h b/packfile.h
index a58fc738e0..624327f64d 100644
--- a/packfile.h
+++ b/packfile.h
@@ -161,13 +161,19 @@  int packed_object_info(struct repository *r,
 void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
 const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
 
+#define ON_DISK_KEEP_PACKS 1
+#define IN_CORE_KEEP_PACKS 2
+#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
+
 /*
  * Iff a pack file in the given repository contains the object named by sha1,
  * return true and store its location to e.
  */
 int find_pack_entry(struct repository *r, const struct object_id *oid, struct pack_entry *e);
+int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsigned flags, struct pack_entry *e);
 
 int has_object_pack(const struct object_id *oid);
+int has_object_kept_pack(const struct object_id *oid, unsigned flags);
 
 int has_pack_index(const unsigned char *sha1);