diff mbox series

[v2,03/19] midx: teach `nth_midxed_pack_int_id()` about incremental MIDXs

Message ID f449a72877de97d89e31a56724b4f65be2f33f20.1721250704.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: incremental multi-pack indexes, part one | expand

Commit Message

Taylor Blau July 17, 2024, 9:12 p.m. UTC
The function `nth_midxed_pack_int_id()` takes in a object position in
MIDX lexicographic order and returns an identifier of the pack from
which that object was selected in the MIDX.

Currently, the given object position is an index into the lexicographic
order of objects in a single MIDX. Change this position to instead refer
into the concatenated lexicographic order of all MIDXs in a MIDX chain.

This has two visible effects within the implementation of
`prepare_midx_pack()`:

  - First, the given position is now an index into the concatenated
    lexicographic order of all MIDXs in the order in which they appear
    in the MIDX chain.

  - Second the pack ID returned from this function is now also in the
    concatenated order of packs among all layers of the MIDX chain in
    the same order that they appear in the MIDX chain.

To do this, introduce the first of two general purpose helpers, this one
being `midx_for_object()`. `midx_for_object()` takes a double pointer to
a `struct multi_pack_index` as well as an object `pos` in terms of the
entire MIDX chain[^1].

The function chases down the '->base_midx' field until it finds the MIDX
layer within the chain that contains the given object. It then:

  - modifies the double pointer to point to the containing MIDX, instead
    of the tip of the chain, and

  - returns the MIDX-local position[^2] at which the given object can be
    found.

Use this function within `nth_midxed_pack_int_id()` so that the `pos` it
expects is now relative to the entire MIDX chain, and that it returns
the appropriate pack position for that object.

[^1]: As a reminder, this means that the object is identified among the
  objects contained in all layers of the incremental MIDX chain, not any
  particular layer. For example, consider MIDX chain with two individual
  MIDXs, one with 4 objects and another with 3 objects. If the MIDX with
  4 objects appears earlier in the chain, then asking for pack "6" would
  return the second object in the MIDX with 3 objects.

[^2]: Building on the previous example, asking for object 6 in a MIDX
  chain with (4, 3) objects, respectively, this would set the double
  pointer to point at the MIDX containing three objects, and would
  return an index to the second object within that MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

Comments

Jeff King Aug. 1, 2024, 9:30 a.m. UTC | #1
On Wed, Jul 17, 2024 at 05:12:04PM -0400, Taylor Blau wrote:

> [^1]: As a reminder, this means that the object is identified among the
>   objects contained in all layers of the incremental MIDX chain, not any
>   particular layer. For example, consider MIDX chain with two individual
>   MIDXs, one with 4 objects and another with 3 objects. If the MIDX with
>   4 objects appears earlier in the chain, then asking for pack "6" would
>   return the second object in the MIDX with 3 objects.

I think this is "object 6" in the final sentence?

Otherwise, the explanation lays things out pretty well. Let's look at
the code.

> +static uint32_t midx_for_object(struct multi_pack_index **_m, uint32_t pos)
> +{
> +	struct multi_pack_index *m = *_m;
> +	while (m && pos < m->num_objects_in_base)
> +		m = m->base_midx;

OK, so given a global position, we walk backwards until we find the
correct midx...

> +	if (!m)
> +		BUG("NULL multi-pack-index for object position: %"PRIu32, pos);
> +
> +	if (pos >= m->num_objects + m->num_objects_in_base)
> +		die(_("invalid MIDX object position, MIDX is likely corrupt"));

...and we double check that the given base claims to have that position.
Seems obvious.

> +	*_m = m;
> +
> +	return pos - m->num_objects_in_base;

And then we adjust it into a per-midx position.

> @@ -334,8 +351,10 @@ off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
>  
>  uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
>  {
> -	return get_be32(m->chunk_object_offsets +
> -			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
> +	pos = midx_for_object(&m, pos);
> +
> +	return m->num_packs_in_base + get_be32(m->chunk_object_offsets +
> +					       (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
>  }

OK, so now this function translates a global position into a local one,
and then we get the pack id for the local midx/pos, and then turn it
back into a global pack id.

That all makes sense, but you definitely have to read carefully to make
sure which positions/ids are global within the chain and which are local
to a midx.

I wonder if the type system can help us annotate them, but I suspect it
becomes awkward. Just typedef-ing them to uint32_t means the compiler
won't warn us when we use one in the wrong spot. Sticking them in
structs would solve that, but then using them is painful. Let's keep
reading and see if it's even an issue in practice.

-Peff
Taylor Blau Aug. 1, 2024, 6:57 p.m. UTC | #2
On Thu, Aug 01, 2024 at 05:30:06AM -0400, Jeff King wrote:
> On Wed, Jul 17, 2024 at 05:12:04PM -0400, Taylor Blau wrote:
>
> > [^1]: As a reminder, this means that the object is identified among the
> >   objects contained in all layers of the incremental MIDX chain, not any
> >   particular layer. For example, consider MIDX chain with two individual
> >   MIDXs, one with 4 objects and another with 3 objects. If the MIDX with
> >   4 objects appears earlier in the chain, then asking for pack "6" would
> >   return the second object in the MIDX with 3 objects.
>
> I think this is "object 6" in the final sentence?

Oops, yes. Thanks for spotting, this was as easy as s/pack/object on the
second to last line in the paragraph quoted above.

> OK, so now this function translates a global position into a local one,
> and then we get the pack id for the local midx/pos, and then turn it
> back into a global pack id.
>
> That all makes sense, but you definitely have to read carefully to make
> sure which positions/ids are global within the chain and which are local
> to a midx.

Exactly.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index 3992b05465..39d358da20 100644
--- a/midx.c
+++ b/midx.c
@@ -242,6 +242,23 @@  void close_midx(struct multi_pack_index *m)
 	free(m);
 }
 
+static uint32_t midx_for_object(struct multi_pack_index **_m, uint32_t pos)
+{
+	struct multi_pack_index *m = *_m;
+	while (m && pos < m->num_objects_in_base)
+		m = m->base_midx;
+
+	if (!m)
+		BUG("NULL multi-pack-index for object position: %"PRIu32, pos);
+
+	if (pos >= m->num_objects + m->num_objects_in_base)
+		die(_("invalid MIDX object position, MIDX is likely corrupt"));
+
+	*_m = m;
+
+	return pos - m->num_objects_in_base;
+}
+
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id)
 {
 	struct strbuf pack_name = STRBUF_INIT;
@@ -334,8 +351,10 @@  off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 
 uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos)
 {
-	return get_be32(m->chunk_object_offsets +
-			(off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
+	pos = midx_for_object(&m, pos);
+
+	return m->num_packs_in_base + get_be32(m->chunk_object_offsets +
+					       (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH);
 }
 
 int fill_midx_entry(struct repository *r,