diff mbox series

[v2,02/19] midx: add new fields for incremental MIDX chains

Message ID 337ebc6de7bdf6ff3b4b09c2bea3df2802174e8b.1721250704.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: incremental multi-pack indexes, part one | expand

Commit Message

Taylor Blau July 17, 2024, 9:12 p.m. UTC
The incremental MIDX chain feature is designed around the idea of
indexing into a concatenated lexicographic ordering of object IDs
present in the MIDX.

When given an object position, the MIDX machinery needs to be able to
locate both (a) which MIDX layer contains the given object, and (b) at
what position *within that MIDX layer* that object appears.

To do this, three new fields are added to the `struct multi_pack_index`:

  - struct multi_pack_index *base_midx;
  - uint32_t num_objects_in_base;
  - uint32_t num_packs_in_base;

These three fields store the pieces of information suggested by their
respective field names. In turn, the `num_objects_in_base` and
`num_packs_in_base` fields are used to crawl backwards along the
`base_midx` pointer to locate the appropriate position for a given
object within the MIDX that contains it.

The following commits will update various parts of the MIDX machinery
(as well as their callers from outside of midx.c and midx-write.c) to be
aware and make use of these fields when performing object lookups.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx.h | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Jeff King Aug. 1, 2024, 9:21 a.m. UTC | #1
On Wed, Jul 17, 2024 at 05:12:01PM -0400, Taylor Blau wrote:

> The incremental MIDX chain feature is designed around the idea of
> indexing into a concatenated lexicographic ordering of object IDs
> present in the MIDX.
> 
> When given an object position, the MIDX machinery needs to be able to
> locate both (a) which MIDX layer contains the given object, and (b) at
> what position *within that MIDX layer* that object appears.
> 
> To do this, three new fields are added to the `struct multi_pack_index`:
> 
>   - struct multi_pack_index *base_midx;
>   - uint32_t num_objects_in_base;
>   - uint32_t num_packs_in_base;
> 
> These three fields store the pieces of information suggested by their
> respective field names. In turn, the `num_objects_in_base` and
> `num_packs_in_base` fields are used to crawl backwards along the
> `base_midx` pointer to locate the appropriate position for a given
> object within the MIDX that contains it.

OK, so base_midx is a back-pointer. I think in theory you could compute
num_objects_in_base on the fly by doing that crawl yourself, but we'd
want to be able to do it in constant time, rather than O(# of midx)?

Makes sense.

-Peff
Taylor Blau Aug. 1, 2024, 6:54 p.m. UTC | #2
On Thu, Aug 01, 2024 at 05:21:42AM -0400, Jeff King wrote:
> On Wed, Jul 17, 2024 at 05:12:01PM -0400, Taylor Blau wrote:
>
> > The incremental MIDX chain feature is designed around the idea of
> > indexing into a concatenated lexicographic ordering of object IDs
> > present in the MIDX.
> >
> > When given an object position, the MIDX machinery needs to be able to
> > locate both (a) which MIDX layer contains the given object, and (b) at
> > what position *within that MIDX layer* that object appears.
> >
> > To do this, three new fields are added to the `struct multi_pack_index`:
> >
> >   - struct multi_pack_index *base_midx;
> >   - uint32_t num_objects_in_base;
> >   - uint32_t num_packs_in_base;
> >
> > These three fields store the pieces of information suggested by their
> > respective field names. In turn, the `num_objects_in_base` and
> > `num_packs_in_base` fields are used to crawl backwards along the
> > `base_midx` pointer to locate the appropriate position for a given
> > object within the MIDX that contains it.
>
> OK, so base_midx is a back-pointer. I think in theory you could compute
> num_objects_in_base on the fly by doing that crawl yourself, but we'd
> want to be able to do it in constant time, rather than O(# of midx)?
>
> Makes sense.

Yep. As you have seen in the later patches in this series, we end up
needing to read num_objects_in_base and num_packs_in_base quite
frequently, so it's nice to have them precomputed.

We could compute them lazily, but they're easy to build up in
midx.c::add_midx_to_chain(), where we're already validating that, for
e.g.:

    if (unsigned_add_overflows(midx_chain->num_packs,
                               midx_chain->num_packs_in_base)) {
            /* ... */
    }
    midx->num_packs_in_base = midx_chain->num_packs +
            midx_chain->num_packs_in_base;

> -Peff

Thanks,
Taylor
diff mbox series

Patch

diff --git a/midx.h b/midx.h
index 8554f2d616..020e49f77c 100644
--- a/midx.h
+++ b/midx.h
@@ -63,6 +63,10 @@  struct multi_pack_index {
 	const unsigned char *chunk_revindex;
 	size_t chunk_revindex_len;
 
+	struct multi_pack_index *base_midx;
+	uint32_t num_objects_in_base;
+	uint32_t num_packs_in_base;
+
 	const char **pack_names;
 	struct packed_git **packs;
 	char object_dir[FLEX_ARRAY];