diff mbox series

[v2,07/19] midx: introduce `bsearch_one_midx()`

Message ID bfd1dadbf15cf735392ca15b52834f104cbd6538.1721250704.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: incremental multi-pack indexes, part one | expand

Commit Message

Taylor Blau July 17, 2024, 9:12 p.m. UTC
The `bsearch_midx()` function will be extended in a following commit to
search for the location of a given object ID across all MIDXs in a chain
(or the single non-chain MIDX if no chain is available).

While most callers will naturally want to use the updated
`bsearch_midx()` function, there are a handful of special cases that
will want finer control and will only want to search through a single
MIDX.

For instance, the object abbreviation code, which cares about object IDs
near to where we'd expect to find a match in a MIDX. In that case, we
want to look at the nearby matches in each layer of the MIDX chain, not
just a single one).

Split the more fine-grained control out into a separate function called
`bsearch_one_midx()` which searches only a single MIDX.

At present both `bsearch_midx()` and `bsearch_one_midx()` have identical
behavior, but the following commit will rewrite the former to be aware
of incremental MIDXs for the remaining non-special case callers.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c        | 17 +++++++--
 midx.h        |  5 ++-
 object-name.c | 99 +++++++++++++++++++++++++++------------------------
 3 files changed, 71 insertions(+), 50 deletions(-)

Comments

Jeff King Aug. 1, 2024, 10:06 a.m. UTC | #1
On Wed, Jul 17, 2024 at 05:12:16PM -0400, Taylor Blau wrote:

> The `bsearch_midx()` function will be extended in a following commit to
> search for the location of a given object ID across all MIDXs in a chain
> (or the single non-chain MIDX if no chain is available).
> 
> While most callers will naturally want to use the updated
> `bsearch_midx()` function, there are a handful of special cases that
> will want finer control and will only want to search through a single
> MIDX.
> 
> For instance, the object abbreviation code, which cares about object IDs
> near to where we'd expect to find a match in a MIDX. In that case, we
> want to look at the nearby matches in each layer of the MIDX chain, not
> just a single one).

Hmm. That seems like a weird thing for the object abbreviation code to
want, just because the layers of the midx are essentially random with
respect to object names. So you have to search each layer individually
when looking for a contiguous segment of hash names. But maybe that's
why you want the one_midx() variant. Let's see...

>  static void unique_in_midx(struct multi_pack_index *m,
>  			   struct disambiguate_state *ds)
>  {
> -	uint32_t num, i, first = 0;
> -	const struct object_id *current = NULL;
> -	int len = ds->len > ds->repo->hash_algo->hexsz ?
> -		ds->repo->hash_algo->hexsz : ds->len;
> -	num = m->num_objects;
> +	for (; m; m = m->base_midx) {
> +		uint32_t num, i, first = 0;
> +		const struct object_id *current = NULL;
> +		int len = ds->len > ds->repo->hash_algo->hexsz ?
> +			ds->repo->hash_algo->hexsz : ds->len;
>  
> -	if (!num)
> -		return;
> +		num = m->num_objects + m->num_objects_in_base;
>  
> -	bsearch_midx(&ds->bin_pfx, m, &first);
> +		if (!num)
> +			continue;
>  
> -	/*
> -	 * At this point, "first" is the location of the lowest object
> -	 * with an object name that could match "bin_pfx".  See if we have
> -	 * 0, 1 or more objects that actually match(es).
> -	 */
> -	for (i = first; i < num && !ds->ambiguous; i++) {
> -		struct object_id oid;
> -		current = nth_midxed_object_oid(&oid, m, i);
> -		if (!match_hash(len, ds->bin_pfx.hash, current->hash))
> -			break;
> -		update_candidates(ds, current);
> +		bsearch_one_midx(&ds->bin_pfx, m, &first);
> +
> +		/*
> +		 * At this point, "first" is the location of the lowest
> +		 * object with an object name that could match
> +		 * "bin_pfx".  See if we have 0, 1 or more objects that
> +		 * actually match(es).
> +		 */
> +		for (i = first; i < num && !ds->ambiguous; i++) {
> +			struct object_id oid;
> +			current = nth_midxed_object_oid(&oid, m, i);
> +			if (!match_hash(len, ds->bin_pfx.hash, current->hash))
> +				break;
> +			update_candidates(ds, current);
> +		}
>  	}
>  }

This is much easier to read with "-w", of course. So yeah, the gist of
it is that we're going to loop over items in the chain via the base_midx
pointer, and then search each individually. So that makes sense.

One thing that confused me, though, is setting "num". From the "-w"
diff:

  -       num = m->num_objects;
  +
  +               num = m->num_objects + m->num_objects_in_base;
  
                  if (!num)
  -               return;
  +                       continue;

Before we only had one midx, so that was our limit. But now we are
looking at "num" as a limit in the global size of the chained midx.
Which feels weird, since we're just considering a single layer here. We
seem to use "num" in two ways:

  - we return if it's 0 (or now continue to the next layer). But
    wouldn't we want to do that per-layer? I don't think it will produce
    wrong answers, but we're less likely to kick in this early return
    (though it's not clear to me when it would ever kick in really; a
    zero-length midx?).

  - later we loop "i" from "first", using "num" as a boundary. But this
    "i" is a global position, since that's what bsearch_one_midx()
    returns, and what nth_midxed_object_oid() expects.

    So I think it's correct, though it feels like bsearch_one_midx()
    should still return the position within that midx (and then
    bsearch_midx() could add back m->num_objects_in_base to get a global
    position). And then I guess likewise there would need to be a
    midx-local version of nth_midxed_object_oid().

    I'm not sure if that would make things simpler, or just add to the
    confusion, though. It's easy to get the global/local functions mixed
    up, since of course the global ones also have to take a "struct
    multi_pack_index". We could make them take a
    "multi_pack_index_chain" instead, but now all of the other code
    which wants to treat chains and single midx files the same has to
    care about the distinction (probably by wrapping the single midx in
    a one-entry chain struct).


> @@ -708,37 +712,40 @@ static int repo_extend_abbrev_len(struct repository *r UNUSED,
>  static void find_abbrev_len_for_midx(struct multi_pack_index *m,
>  				     struct min_abbrev_data *mad)

And likewise here (again, much more readable with "-w"). Interestingly,
in this one...

>  {
> -	int match = 0;
> -	uint32_t num, first = 0;
> -	struct object_id oid;
> -	const struct object_id *mad_oid;
> +	for (; m; m = m->base_midx) {
> +		int match = 0;
> +		uint32_t num, first = 0;
> +		struct object_id oid;
> +		const struct object_id *mad_oid;
>  
> -	if (!m->num_objects)
> -		return;
> +		if (!m->num_objects)
> +			continue;

We do the early return/continue directly on the layer's m->num_objects,
which makes sense.

> -	num = m->num_objects;
> -	mad_oid = mad->oid;
> -	match = bsearch_midx(mad_oid, m, &first);
> +		num = m->num_objects + m->num_objects_in_base;
> +		mad_oid = mad->oid;
> +		match = bsearch_one_midx(mad_oid, m, &first);

But then of course we go back to the same global "num", as we must.

So I think it's all correct, minus the early continue on "!num" in the
first function.

-Peff
Taylor Blau Aug. 1, 2024, 7:54 p.m. UTC | #2
On Thu, Aug 01, 2024 at 06:06:35AM -0400, Jeff King wrote:
> One thing that confused me, though, is setting "num". From the "-w"
> diff:
>
>   -       num = m->num_objects;
>   +
>   +               num = m->num_objects + m->num_objects_in_base;
>
>                   if (!num)
>   -               return;
>   +                       continue;
>
> Before we only had one midx, so that was our limit. But now we are
> looking at "num" as a limit in the global size of the chained midx.
> Which feels weird, since we're just considering a single layer here. We
> seem to use "num" in two ways:
>
>   - we return if it's 0 (or now continue to the next layer). But
>     wouldn't we want to do that per-layer? I don't think it will produce
>     wrong answers, but we're less likely to kick in this early return
>     (though it's not clear to me when it would ever kick in really; a
>     zero-length midx?).

This is definitely a bug. We should certainly do something like:

    for (; m; m = m->base_midx) {
            uint32_t num;
            if (!m->num_objects)
                    continue;

            num = m->num_objects + m->num_objects_in_base;
            /* ... */
    }

I'll go ahead and fix this one up locally, which is easy enough to do.

>     So I think it's correct, though it feels like bsearch_one_midx()
>     should still return the position within that midx (and then
>     bsearch_midx() could add back m->num_objects_in_base to get a global
>     position). And then I guess likewise there would need to be a
>     midx-local version of nth_midxed_object_oid().

Like many of the other changes in this series, it's really a matter of
where you put the complexity: either it's in the callers or in the
function itself.

I think here I prefer having bsearch_one_midx() return the global
position, since it is directly usable in other top-level functions
within the MIDX API, like being able to pass it directly to
nth_midxed_object_oid() and etc. below.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index b6c3cd3e59..bb3fa43492 100644
--- a/midx.c
+++ b/midx.c
@@ -329,10 +329,21 @@  int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
 	return 0;
 }
 
-int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result)
+int bsearch_one_midx(const struct object_id *oid, struct multi_pack_index *m,
+		     uint32_t *result)
 {
-	return bsearch_hash(oid->hash, m->chunk_oid_fanout, m->chunk_oid_lookup,
-			    the_hash_algo->rawsz, result);
+	int ret = bsearch_hash(oid->hash, m->chunk_oid_fanout,
+			       m->chunk_oid_lookup, the_hash_algo->rawsz,
+			       result);
+	if (result)
+		*result += m->num_objects_in_base;
+	return ret;
+}
+
+int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m,
+		 uint32_t *result)
+{
+		return bsearch_one_midx(oid, m, result);
 }
 
 struct object_id *nth_midxed_object_oid(struct object_id *oid,
diff --git a/midx.h b/midx.h
index 020e49f77c..46c53d69ff 100644
--- a/midx.h
+++ b/midx.h
@@ -90,7 +90,10 @@  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int nth_bitmapped_pack(struct repository *r, struct multi_pack_index *m,
 		       struct bitmapped_pack *bp, uint32_t pack_int_id);
-int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
+int bsearch_one_midx(const struct object_id *oid, struct multi_pack_index *m,
+		     uint32_t *result);
+int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m,
+		 uint32_t *result);
 off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos);
 uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos);
 struct object_id *nth_midxed_object_oid(struct object_id *oid,
diff --git a/object-name.c b/object-name.c
index 527b853ac4..e23b3e695a 100644
--- a/object-name.c
+++ b/object-name.c
@@ -134,28 +134,32 @@  static int match_hash(unsigned len, const unsigned char *a, const unsigned char
 static void unique_in_midx(struct multi_pack_index *m,
 			   struct disambiguate_state *ds)
 {
-	uint32_t num, i, first = 0;
-	const struct object_id *current = NULL;
-	int len = ds->len > ds->repo->hash_algo->hexsz ?
-		ds->repo->hash_algo->hexsz : ds->len;
-	num = m->num_objects;
+	for (; m; m = m->base_midx) {
+		uint32_t num, i, first = 0;
+		const struct object_id *current = NULL;
+		int len = ds->len > ds->repo->hash_algo->hexsz ?
+			ds->repo->hash_algo->hexsz : ds->len;
 
-	if (!num)
-		return;
+		num = m->num_objects + m->num_objects_in_base;
 
-	bsearch_midx(&ds->bin_pfx, m, &first);
+		if (!num)
+			continue;
 
-	/*
-	 * At this point, "first" is the location of the lowest object
-	 * with an object name that could match "bin_pfx".  See if we have
-	 * 0, 1 or more objects that actually match(es).
-	 */
-	for (i = first; i < num && !ds->ambiguous; i++) {
-		struct object_id oid;
-		current = nth_midxed_object_oid(&oid, m, i);
-		if (!match_hash(len, ds->bin_pfx.hash, current->hash))
-			break;
-		update_candidates(ds, current);
+		bsearch_one_midx(&ds->bin_pfx, m, &first);
+
+		/*
+		 * At this point, "first" is the location of the lowest
+		 * object with an object name that could match
+		 * "bin_pfx".  See if we have 0, 1 or more objects that
+		 * actually match(es).
+		 */
+		for (i = first; i < num && !ds->ambiguous; i++) {
+			struct object_id oid;
+			current = nth_midxed_object_oid(&oid, m, i);
+			if (!match_hash(len, ds->bin_pfx.hash, current->hash))
+				break;
+			update_candidates(ds, current);
+		}
 	}
 }
 
@@ -708,37 +712,40 @@  static int repo_extend_abbrev_len(struct repository *r UNUSED,
 static void find_abbrev_len_for_midx(struct multi_pack_index *m,
 				     struct min_abbrev_data *mad)
 {
-	int match = 0;
-	uint32_t num, first = 0;
-	struct object_id oid;
-	const struct object_id *mad_oid;
+	for (; m; m = m->base_midx) {
+		int match = 0;
+		uint32_t num, first = 0;
+		struct object_id oid;
+		const struct object_id *mad_oid;
 
-	if (!m->num_objects)
-		return;
+		if (!m->num_objects)
+			continue;
 
-	num = m->num_objects;
-	mad_oid = mad->oid;
-	match = bsearch_midx(mad_oid, m, &first);
+		num = m->num_objects + m->num_objects_in_base;
+		mad_oid = mad->oid;
+		match = bsearch_one_midx(mad_oid, m, &first);
 
-	/*
-	 * first is now the position in the packfile where we would insert
-	 * mad->hash if it does not exist (or the position of mad->hash if
-	 * it does exist). Hence, we consider a maximum of two objects
-	 * nearby for the abbreviation length.
-	 */
-	mad->init_len = 0;
-	if (!match) {
-		if (nth_midxed_object_oid(&oid, m, first))
-			extend_abbrev_len(&oid, mad);
-	} else if (first < num - 1) {
-		if (nth_midxed_object_oid(&oid, m, first + 1))
-			extend_abbrev_len(&oid, mad);
+		/*
+		 * first is now the position in the packfile where we
+		 * would insert mad->hash if it does not exist (or the
+		 * position of mad->hash if it does exist). Hence, we
+		 * consider a maximum of two objects nearby for the
+		 * abbreviation length.
+		 */
+		mad->init_len = 0;
+		if (!match) {
+			if (nth_midxed_object_oid(&oid, m, first))
+				extend_abbrev_len(&oid, mad);
+		} else if (first < num - 1) {
+			if (nth_midxed_object_oid(&oid, m, first + 1))
+				extend_abbrev_len(&oid, mad);
+		}
+		if (first > 0) {
+			if (nth_midxed_object_oid(&oid, m, first - 1))
+				extend_abbrev_len(&oid, mad);
+		}
+		mad->init_len = mad->cur_len;
 	}
-	if (first > 0) {
-		if (nth_midxed_object_oid(&oid, m, first - 1))
-			extend_abbrev_len(&oid, mad);
-	}
-	mad->init_len = mad->cur_len;
 }
 
 static void find_abbrev_len_for_pack(struct packed_git *p,