diff mbox series

[v2,01/19] Documentation: describe incremental MIDX format

Message ID 014588b3ecf2e32a6ab1af8e77234dfb2f9ea75a.1721250704.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: incremental multi-pack indexes, part one | expand

Commit Message

Taylor Blau July 17, 2024, 9:11 p.m. UTC
Prepare to implement incremental multi-pack indexes (MIDXs) over the
next several commits by first describing the relevant prerequisites
(like a new chunk in the MIDX format, the directory structure for
incremental MIDXs, etc.)

The format is described in detail in the patch contents below, but the
high-level description is as follows.

Incremental MIDXs live in $GIT_DIR/objects/pack/multi-pack-index.d, and
each `*.midx` within that directory has a single "parent" MIDX, which is
the MIDX layer immediately before it in the MIDX chain. The chain order
resides in a file 'multi-pack-index-chain' in the same directory.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/multi-pack-index.txt | 100 +++++++++++++++++++
 1 file changed, 100 insertions(+)

Comments

Jeff King Aug. 1, 2024, 9:19 a.m. UTC | #1
On Wed, Jul 17, 2024 at 05:11:58PM -0400, Taylor Blau wrote:

> +Each individual component of the chain need only contain a small number
> +of packfiles. Appending to the chain does not invalidate earlier parts
> +of the chain, so repositories can control how much time is spent
> +updating the MIDX chain by determining the number of packs in each layer
> +of the MIDX chain.
> +
> +=== Design state
> +
> +At present, the incremental multi-pack indexes feature is missing two
> +important components:
> +
> +  - The ability to rewrite earlier portions of the MIDX chain (i.e., to
> +    "compact" some collection of adjacent MIDX layers into a single
> +    MIDX). At present the only supported way of shrinking a MIDX chain
> +    is to rewrite the entire chain from scratch without the `--split`
> +    flag.
> ++
> +There are no fundamental limitations that stand in the way of being able
> +to implement this feature. It is omitted from the initial implementation
> +in order to reduce the complexity, but will be added later.
> +
> +  - Support for reachability bitmaps. The classic single MIDX
> +    implementation does support reachability bitmaps (see the section
> +    titled "multi-pack-index reverse indexes" in
> +    linkgit:gitformat-pack[5] for more details).
> ++
> +As above, there are no fundamental limitations that stand in the way of
> +extending the incremental MIDX format to support reachability bitmaps.
> +The design below specifically takes this into account, and support for
> +reachability bitmaps will be added in a future patch series. It is
> +omitted from this series for the same reason as above.

It is nice that you added a bit of a roadmap here about what is
implemented and what is not, and that the design takes into account
future directions (especially incremental bitmap generation).

It does feel a little funny to say "this series" in text that will go
into the repository (i.e., somebody reading the checked out file will
say "huh? which series?"). I'm not sure how to word it better, except to
maybe just say "in the future" and "it is omitted for now" (and
obviously it's a pretty minor point).

> +In brief, to support reachability bitmaps with the incremental MIDX
> +feature, the concept of the pseudo-pack order is extended across each
> +layer of the incremental MIDX chain to form a concatenated pseudo-pack
> +order. This concatenation takes place in the same order as the chain
> +itself (in other words, the concatenated pseudo-pack order for a chain
> +`{$H1, $H2, $H3}` would be the pseudo-pack order for `$H1`, followed by
> +the pseudo-pack order for `$H2`, followed by the pseudo-pack order for
> +`$H3`).

OK, that makes sense. It's how I'd have intuitively assumed it to be,
and most importantly, it should allow appending to the chain without
regenerating (or even translating) earlier bitmaps.

> +=== File layout
> +
> +Instead of storing a single `multi-pack-index` file (with an optional
> +`.rev` and `.bitmap` extension) in `$GIT_DIR/objects/pack`, incremental
> +MIDXs are stored in the following layout:
> +
> +----
> +$GIT_DIR/objects/pack/multi-pack-index.d/
> +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-chain
> +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H1.midx
> +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H2.midx
> +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H3.midx
> +----
> +
> +The `multi-pack-index-chain` file contains a list of the incremental
> +MIDX files in the chain, in order. The above example shows a chain whose
> +`multi-pack-index-chain` file would contain the following lines:
> +
> +----
> +$H1
> +$H2
> +$H3
> +----
> +
> +The `multi-pack-index-$H1.midx` file contains the first layer of the
> +multi-pack-index chain. The `multi-pack-index-$H2.midx` file contains
> +the second layer of the chain, and so on.

Makes sense. How does the chained multi-pack-index.d interact with a
singular multi-pack-index? Generally we should not have both at the same
time, but I'd imagine they both exist for a brief period when moving
from one to another.

I assume the rules are the same as for commit-graphs, which use the same
on-disk structure. I can't think of a reason to prefer one over the
other but this might be a good place to document what does/should
happen.

> +=== Object positions for incremental MIDXs
> +
> +In the original multi-pack-index design, we refer to objects via their
> +lexicographic position (by object IDs) within the repository's singular
> +multi-pack-index. In the incremental multi-pack-index design, we refer
> +to objects via their index into a concatenated lexicographic ordering
> +among each component in the MIDX chain.

How do duplicate objects work here? I guess there aren't any duplicates
in the midx itself, only in the constituent packfiles. So from the
perspective of this section, I guess it doesn't matter? And from the
perspective of bitmaps (where the duplicate issue came up before), it is
business as usual: the midx revindex gives the bit order, and we'd
presumably concatenate the individual revindexes in chain order.

(Mostly just thinking out loud; I'm not sure there's much for you to
answer there).


Looking good so far...

-Peff
Taylor Blau Aug. 1, 2024, 6:52 p.m. UTC | #2
On Thu, Aug 01, 2024 at 05:19:52AM -0400, Jeff King wrote:
> > +As above, there are no fundamental limitations that stand in the way of
> > +extending the incremental MIDX format to support reachability bitmaps.
> > +The design below specifically takes this into account, and support for
> > +reachability bitmaps will be added in a future patch series. It is
> > +omitted from this series for the same reason as above.
>
> It is nice that you added a bit of a roadmap here about what is
> implemented and what is not, and that the design takes into account
> future directions (especially incremental bitmap generation).
>
> It does feel a little funny to say "this series" in text that will go
> into the repository (i.e., somebody reading the checked out file will
> say "huh? which series?"). I'm not sure how to word it better, except to
> maybe just say "in the future" and "it is omitted for now" (and
> obviously it's a pretty minor point).

s/this series/the current implementation/ ?

Definitely an oversight on my part, thanks for catching. I'll squash it
in.

> > +The `multi-pack-index-$H1.midx` file contains the first layer of the
> > +multi-pack-index chain. The `multi-pack-index-$H2.midx` file contains
> > +the second layer of the chain, and so on.
>
> Makes sense. How does the chained multi-pack-index.d interact with a
> singular multi-pack-index? Generally we should not have both at the same
> time, but I'd imagine they both exist for a brief period when moving
> from one to another.
>
> I assume the rules are the same as for commit-graphs, which use the same
> on-disk structure. I can't think of a reason to prefer one over the
> other but this might be a good place to document what does/should
> happen.

The commit-graph code reads the non-chained commit-graph first (see
commit-graph.c::read_commit_graph_one() for exact details, paraphrased
here):

    struct commit_graph *g = load_commit_graph_v1(...);
    if (!g)
            g = load_commit_graph_chain(...);
    return g;

and I matched the same for the MIDX code. I think there are reasonable
arguments for preferring either one over the other, so I think the
easiest thing to do is just throw our hands up and stick with the
convention ;-).

> > +=== Object positions for incremental MIDXs
> > +
> > +In the original multi-pack-index design, we refer to objects via their
> > +lexicographic position (by object IDs) within the repository's singular
> > +multi-pack-index. In the incremental multi-pack-index design, we refer
> > +to objects via their index into a concatenated lexicographic ordering
> > +among each component in the MIDX chain.
>
> How do duplicate objects work here? I guess there aren't any duplicates
> in the midx itself, only in the constituent packfiles. So from the
> perspective of this section, I guess it doesn't matter? And from the
> perspective of bitmaps (where the duplicate issue came up before), it is
> business as usual: the midx revindex gives the bit order, and we'd
> presumably concatenate the individual revindexes in chain order.
>
> (Mostly just thinking out loud; I'm not sure there's much for you to
> answer there).

Right. In a pre-incremental MIDX world, MIDXs contain no duplicate
object entries with respect to themselves. In this new world, the same
is true, with the additional property that MIDXs also contain no
duplicates with respect to their ancestors (when part of a MIDX chain).

> Looking good so far...
>
> -Peff

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
index f2221d2b44..d05e3d6dd9 100644
--- a/Documentation/technical/multi-pack-index.txt
+++ b/Documentation/technical/multi-pack-index.txt
@@ -61,6 +61,106 @@  Design Details
 - The MIDX file format uses a chunk-based approach (similar to the
   commit-graph file) that allows optional data to be added.
 
+Incremental multi-pack indexes
+------------------------------
+
+As repositories grow in size, it becomes more expensive to write a
+multi-pack index (MIDX) that includes all packfiles. To accommodate
+this, the "incremental multi-pack indexes" feature allows for combining
+a "chain" of multi-pack indexes.
+
+Each individual component of the chain need only contain a small number
+of packfiles. Appending to the chain does not invalidate earlier parts
+of the chain, so repositories can control how much time is spent
+updating the MIDX chain by determining the number of packs in each layer
+of the MIDX chain.
+
+=== Design state
+
+At present, the incremental multi-pack indexes feature is missing two
+important components:
+
+  - The ability to rewrite earlier portions of the MIDX chain (i.e., to
+    "compact" some collection of adjacent MIDX layers into a single
+    MIDX). At present the only supported way of shrinking a MIDX chain
+    is to rewrite the entire chain from scratch without the `--split`
+    flag.
++
+There are no fundamental limitations that stand in the way of being able
+to implement this feature. It is omitted from the initial implementation
+in order to reduce the complexity, but will be added later.
+
+  - Support for reachability bitmaps. The classic single MIDX
+    implementation does support reachability bitmaps (see the section
+    titled "multi-pack-index reverse indexes" in
+    linkgit:gitformat-pack[5] for more details).
++
+As above, there are no fundamental limitations that stand in the way of
+extending the incremental MIDX format to support reachability bitmaps.
+The design below specifically takes this into account, and support for
+reachability bitmaps will be added in a future patch series. It is
+omitted from this series for the same reason as above.
++
+In brief, to support reachability bitmaps with the incremental MIDX
+feature, the concept of the pseudo-pack order is extended across each
+layer of the incremental MIDX chain to form a concatenated pseudo-pack
+order. This concatenation takes place in the same order as the chain
+itself (in other words, the concatenated pseudo-pack order for a chain
+`{$H1, $H2, $H3}` would be the pseudo-pack order for `$H1`, followed by
+the pseudo-pack order for `$H2`, followed by the pseudo-pack order for
+`$H3`).
++
+The layout will then be extended so that each layer of the incremental
+MIDX chain can write a `*.bitmap`. The objects in each layer's bitmap
+are offset by the number of objects in the previous layers of the chain.
+
+=== File layout
+
+Instead of storing a single `multi-pack-index` file (with an optional
+`.rev` and `.bitmap` extension) in `$GIT_DIR/objects/pack`, incremental
+MIDXs are stored in the following layout:
+
+----
+$GIT_DIR/objects/pack/multi-pack-index.d/
+$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-chain
+$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H1.midx
+$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H2.midx
+$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H3.midx
+----
+
+The `multi-pack-index-chain` file contains a list of the incremental
+MIDX files in the chain, in order. The above example shows a chain whose
+`multi-pack-index-chain` file would contain the following lines:
+
+----
+$H1
+$H2
+$H3
+----
+
+The `multi-pack-index-$H1.midx` file contains the first layer of the
+multi-pack-index chain. The `multi-pack-index-$H2.midx` file contains
+the second layer of the chain, and so on.
+
+=== Object positions for incremental MIDXs
+
+In the original multi-pack-index design, we refer to objects via their
+lexicographic position (by object IDs) within the repository's singular
+multi-pack-index. In the incremental multi-pack-index design, we refer
+to objects via their index into a concatenated lexicographic ordering
+among each component in the MIDX chain.
+
+If `objects_nr()` is a function that returns the number of objects in a
+given MIDX layer, then the index of an object at lexicographic position
+`i` within, say, $H3 is defined as:
+
+----
+objects_nr($H2) + objects_nr($H1) + i
+----
+
+(in the C implementation, this is often computed as `i +
+m->num_objects_in_base`).
+
 Future Work
 -----------