diff mbox series

[v4,01/13] Documentation: describe incremental MIDX bitmaps

Message ID f565f2fff166bdf4bb2505f4a8853215a5365b17.1741983492.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: incremental multi-pack indexes, part two | expand

Commit Message

Taylor Blau March 14, 2025, 8:18 p.m. UTC
Prepare to implement support for reachability bitmaps for the new
incremental multi-pack index (MIDX) feature over the following commits.

This commit begins by first describing the relevant format and usage
details for incremental MIDX bitmaps.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/multi-pack-index.adoc | 71 +++++++++++++++++++
 1 file changed, 71 insertions(+)

Comments

Jeff King March 18, 2025, 1:16 a.m. UTC | #1
On Fri, Mar 14, 2025 at 04:18:20PM -0400, Taylor Blau wrote:

> +In the incremental MIDX design, we extend this definition to include
> +objects from multiple layers of the MIDX chain. The pseudo-pack order
> +for incremental MIDXs is determined by concatenating the pseudo-pack
> +ordering for each layer of the MIDX chain in order. Formally two objects
> +`o1` and `o2` are compared as follows:
> +
> +1. If `o1` appears in an earlier layer of the MIDX chain than `o2`, then
> +  `o1` is considered less than `o2`.
> +
> +2. Otherwise, if `o1` and `o2` appear in the same MIDX layer, and that
> +   MIDX layer has no base, then if one of `pack(o1)` and `pack(o2)` is
> +   preferred and the other is not, then the preferred one sorts first. If
> +   there is a base layer (i.e. the MIDX layer is not the first layer in
> +   the chain), then if `pack(o1)` appears earlier in that MIDX layer's
> +   pack order, than `o1` is less than `o2`. Likewise if `pack(o2)`
> +   appears earlier, than the opposite is true.
> +
> +3. Otherwise, `o1` and `o2` appear in the same pack, and thus in the
> +   same MIDX layer. Sort `o1` and `o2` by their offset within their
> +   containing packfile.

OK, I think this ordering makes sense. I had to read this description
over several times to make sure I wasn't missing something. The earlier
part that says "it's just concatenating the pack order of the layers" is
a much more intuitive way of looking at it (modulo that you might need
to remove duplicates found in earlier layers).

But I think an even more basic way of thinking about it is that it's the
same as the pseudo-pack order you would get if you had a single midx of
all of the packs in all of the layers (in their layer order). We already
have to deal with (and have documented) duplicates in that case.

Not really suggesting any wording change here, just making sure I
grokked it all.

> +Note that the preferred pack is a property of the MIDX chain, not the
> +individual layers themselves. Fundamentally we could introduce a
> +per-layer preferred pack, but this is less relevant now that we can
> +perform multi-pack reuse across the set of packs in a MIDX.

Calling this out explicitly is good, since it's an obvious question
for somebody to have.

> +=== Reachability bitmaps and incremental MIDXs
> +
> +Each layer of an incremental MIDX chain may have its objects (and the
> +objects from any previous layer in the same MIDX chain) represented in
> +its own `*.bitmap` file.
> +
> +The structure of a `*.bitmap` file belonging to an incremental MIDX
> +chain is identical to that of a non-incremental MIDX bitmap, or a
> +classic single-pack bitmap. Since objects are added to the end of the
> +incremental MIDX's pseudo-pack order (see: above), it is possible to
> +extend a bitmap when appending to the end of a MIDX chain.
> +
> +(Note: it is possible likewise to compress a contiguous sequence of MIDX
> +incremental layers, and their `*.bitmap`(s) into a single layer and
> +`*.bitmap`, but this is not yet implemented.)
> +
> +The object positions used are global within the pseudo-pack order, so
> +subsequent layers will have, for example, `m->num_objects_in_base`
> +number of `0` bits in each of their four type bitmaps. This follows from
> +the fact that we only write type bitmap entries for objects present in
> +the layer immediately corresponding to the bitmap).

OK, so each layer's bitmap does depend on the layers above/before it.
That obviously needs to happen because each incremental midx is not
likely to be a complete reachability set anyway.

But I also wondered what would happen with a situation like this:

  A -- B
   \
    -- C

stored like this:

  base midx:
    - pack 1:
      - object A
      - object B, which can reach A
  incremental midx:
    - pack 2:
      - object A
      - object C, which can reach A

That is, two objects B and C both depend on A, which is duplicated in
two midx layers. Even if the incremental midx is complete in the sense
that C only depends on A, its bitmap cannot just be "11". Because the
bit position for object A in the incremental midx does not exist in the
pseudo-pack order at all! It must refer to the copy of "A" in the base
midx, so it's correct bitmap is "101" (A and C, but not B).

Again, just talking through it here.

> +Note also that only the bitmap pertaining to the most recent layer in an
> +incremental MIDX chain is used to store reachability information about
> +the interesting and uninteresting objects in a reachability query.
> +Earlier bitmap layers are only used to look up commit and pseudo-merge
> +bitmaps from that layer, as well as the type-level bitmaps for objects
> +in that layer.

I'm not quite sure what this means, but I guess you're saying that
internally as we produce a bitmap, we'll always use the complete bitmap
over all of the layers?

> +To simplify the implementation, type-level bitmaps are iterated
> +simultaneously, and their results are OR'd together to avoid recursively
> +calling internal bitmap functions.

OK, I guess we'll see what this means in the patches. ;)

The general rules for the data structure make sense to me, though.

-Peff
Elijah Newren March 18, 2025, 2:42 a.m. UTC | #2
On Fri, Mar 14, 2025 at 1:18 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> Prepare to implement support for reachability bitmaps for the new
> incremental multi-pack index (MIDX) feature over the following commits.
>
> This commit begins by first describing the relevant format and usage
> details for incremental MIDX bitmaps.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  Documentation/technical/multi-pack-index.adoc | 71 +++++++++++++++++++
>  1 file changed, 71 insertions(+)
>
> diff --git a/Documentation/technical/multi-pack-index.adoc b/Documentation/technical/multi-pack-index.adoc
> index cc063b30be..ab98ecfeb9 100644
> --- a/Documentation/technical/multi-pack-index.adoc
> +++ b/Documentation/technical/multi-pack-index.adoc
> @@ -164,6 +164,77 @@ objects_nr($H2) + objects_nr($H1) + i
>  (in the C implementation, this is often computed as `i +
>  m->num_objects_in_base`).
>
> +=== Pseudo-pack order for incremental MIDXs
> +
> +The original implementation of multi-pack reachability bitmaps defined
> +the pseudo-pack order in linkgit:gitformat-pack[5] (see the section
> +titled "multi-pack-index reverse indexes") roughly as follows:
> +
> +____
> +In short, a MIDX's pseudo-pack is the de-duplicated concatenation of
> +objects in packs stored by the MIDX, laid out in pack order, and the
> +packs arranged in MIDX order (with the preferred pack coming first).
> +____
> +
> +In the incremental MIDX design, we extend this definition to include
> +objects from multiple layers of the MIDX chain. The pseudo-pack order
> +for incremental MIDXs is determined by concatenating the pseudo-pack
> +ordering for each layer of the MIDX chain in order. Formally two objects
> +`o1` and `o2` are compared as follows:
> +
> +1. If `o1` appears in an earlier layer of the MIDX chain than `o2`, then
> +  `o1` is considered less than `o2`.

For sorting order, 'less than' doesn't tell us if you are sorting
smallest to greatest or greatest to smallest.  Maybe "less than (so
its order is earlier than) `o2'" ?

> +
> +2. Otherwise, if `o1` and `o2` appear in the same MIDX layer, and that
> +   MIDX layer has no base, then if one of `pack(o1)` and `pack(o2)` is
> +   preferred and the other is not, then the preferred one sorts first. If
> +   there is a base layer (i.e. the MIDX layer is not the first layer in
> +   the chain), then if `pack(o1)` appears earlier in that MIDX layer's
> +   pack order, than `o1` is less than `o2`. Likewise if `pack(o2)`

s/than/then/

> +   appears earlier, than the opposite is true.

s/than/then/

> +
> +3. Otherwise, `o1` and `o2` appear in the same pack, and thus in the
> +   same MIDX layer. Sort `o1` and `o2` by their offset within their
> +   containing packfile.
> +
> +Note that the preferred pack is a property of the MIDX chain, not the
> +individual layers themselves. Fundamentally we could introduce a
> +per-layer preferred pack, but this is less relevant now that we can
> +perform multi-pack reuse across the set of packs in a MIDX.
> +
> +=== Reachability bitmaps and incremental MIDXs
> +
> +Each layer of an incremental MIDX chain may have its objects (and the
> +objects from any previous layer in the same MIDX chain) represented in
> +its own `*.bitmap` file.
> +
> +The structure of a `*.bitmap` file belonging to an incremental MIDX
> +chain is identical to that of a non-incremental MIDX bitmap, or a
> +classic single-pack bitmap. Since objects are added to the end of the
> +incremental MIDX's pseudo-pack order (see: above), it is possible to

drop the colon?

> +extend a bitmap when appending to the end of a MIDX chain.
> +
> +(Note: it is possible likewise to compress a contiguous sequence of MIDX
> +incremental layers, and their `*.bitmap`(s) into a single layer and
> +`*.bitmap`, but this is not yet implemented.)

"`*.bitmap`(s)" feels slightly awkward and only saves 2 characters.
Maybe just "`*.bitmap` files"?

> +
> +The object positions used are global within the pseudo-pack order, so
> +subsequent layers will have, for example, `m->num_objects_in_base`
> +number of `0` bits in each of their four type bitmaps. This follows from
> +the fact that we only write type bitmap entries for objects present in
> +the layer immediately corresponding to the bitmap).
> +
> +Note also that only the bitmap pertaining to the most recent layer in an
> +incremental MIDX chain is used to store reachability information about
> +the interesting and uninteresting objects in a reachability query.
> +Earlier bitmap layers are only used to look up commit and pseudo-merge
> +bitmaps from that layer, as well as the type-level bitmaps for objects
> +in that layer.
> +
> +To simplify the implementation, type-level bitmaps are iterated
> +simultaneously, and their results are OR'd together to avoid recursively
> +calling internal bitmap functions.
> +
>  Future Work
>  -----------

Should the patch also remove the first item from Future Work, since
this series is implementing it?


> --
> 2.49.0.13.gd0d564685b
Taylor Blau March 18, 2025, 11:11 p.m. UTC | #3
On Mon, Mar 17, 2025 at 09:16:18PM -0400, Jeff King wrote:
> On Fri, Mar 14, 2025 at 04:18:20PM -0400, Taylor Blau wrote:
>
> > +In the incremental MIDX design, we extend this definition to include
> > +objects from multiple layers of the MIDX chain. The pseudo-pack order
> > +for incremental MIDXs is determined by concatenating the pseudo-pack
> > +ordering for each layer of the MIDX chain in order. Formally two objects
> > +`o1` and `o2` are compared as follows:
> > +
> > +1. If `o1` appears in an earlier layer of the MIDX chain than `o2`, then
> > +  `o1` is considered less than `o2`.
> > +
> > +2. Otherwise, if `o1` and `o2` appear in the same MIDX layer, and that
> > +   MIDX layer has no base, then if one of `pack(o1)` and `pack(o2)` is
> > +   preferred and the other is not, then the preferred one sorts first. If
> > +   there is a base layer (i.e. the MIDX layer is not the first layer in
> > +   the chain), then if `pack(o1)` appears earlier in that MIDX layer's
> > +   pack order, than `o1` is less than `o2`. Likewise if `pack(o2)`
> > +   appears earlier, than the opposite is true.
> > +
> > +3. Otherwise, `o1` and `o2` appear in the same pack, and thus in the
> > +   same MIDX layer. Sort `o1` and `o2` by their offset within their
> > +   containing packfile.
>
> OK, I think this ordering makes sense. I had to read this description
> over several times to make sure I wasn't missing something. The earlier
> part that says "it's just concatenating the pack order of the layers" is
> a much more intuitive way of looking at it (modulo that you might need
> to remove duplicates found in earlier layers).
>
> But I think an even more basic way of thinking about it is that it's the
> same as the pseudo-pack order you would get if you had a single midx of
> all of the packs in all of the layers (in their layer order). We already
> have to deal with (and have documented) duplicates in that case.
>
> Not really suggesting any wording change here, just making sure I
> grokked it all.

Yeah, those are both excellent ways to think about it. I hadn't
considered the "the new ordering is the same as the pseudo-pack order
you'd get if you had a single MIDX of all the packs in layer order"
thing before, but it's quite intuitive.

As a side note, it's somewhat hilarious to me that we could really
write:

    "The new ordering is the same as the pseudo-pack order you'd get if
    you had a single MIDX of all the packs in layer order, which is the
    same order you'd get if you had a single pack containing all of the
    objects in MIDX order."

;-)

> > +Note that the preferred pack is a property of the MIDX chain, not the
> > +individual layers themselves. Fundamentally we could introduce a
> > +per-layer preferred pack, but this is less relevant now that we can
> > +perform multi-pack reuse across the set of packs in a MIDX.
>
> Calling this out explicitly is good, since it's an obvious question
> for somebody to have.

Thanks, I think this was an addition from Patrick's earlier review of
the series.

> OK, so each layer's bitmap does depend on the layers above/before it.
> That obviously needs to happen because each incremental midx is not
> likely to be a complete reachability set anyway.
>
> But I also wondered what would happen with a situation like this:
>
>   A -- B
>    \
>     -- C
>
> stored like this:
>
>   base midx:
>     - pack 1:
>       - object A
>       - object B, which can reach A
>   incremental midx:
>     - pack 2:
>       - object A
>       - object C, which can reach A
>
> That is, two objects B and C both depend on A, which is duplicated in
> two midx layers. Even if the incremental midx is complete in the sense
> that C only depends on A, its bitmap cannot just be "11". Because the
> bit position for object A in the incremental midx does not exist in the
> pseudo-pack order at all! It must refer to the copy of "A" in the base
> midx, so it's correct bitmap is "101" (A and C, but not B).
>

Right. Since the base MIDX has objects A and B, B's bitmap here would be
"11". C's bit position in the subsequent layer is a function of where it
sits not just in that MIDX layer, but how many (de-duplicated) objects
exist in all prior layers. There are two, so the earliest bit position
possible to allocate towards C is the third bit. And since C reaches A,
its bitmap would indeed be "101".

> Again, just talking through it here.

Heh, thanks for saying so. It's good to know when we're just talking
through examples versus asking for changes. (Of course, the mere fact of
talking through an example is sometimes enough to suggest a change by
virtue of that example being confusing enough to need to be talked
through in the first place).

> > +Note also that only the bitmap pertaining to the most recent layer in an
> > +incremental MIDX chain is used to store reachability information about
> > +the interesting and uninteresting objects in a reachability query.
> > +Earlier bitmap layers are only used to look up commit and pseudo-merge
> > +bitmaps from that layer, as well as the type-level bitmaps for objects
> > +in that layer.
>
> I'm not quite sure what this means, but I guess you're saying that
> internally as we produce a bitmap, we'll always use the complete bitmap
> over all of the layers?

That's exactly right.

> > +To simplify the implementation, type-level bitmaps are iterated
> > +simultaneously, and their results are OR'd together to avoid recursively
> > +calling internal bitmap functions.
>
> OK, I guess we'll see what this means in the patches. ;)
>
> The general rules for the data structure make sense to me, though.

Great, and thanks in advance for the review as I work through the rest
of your emails :-).

Thanks,
Taylor
Taylor Blau March 18, 2025, 11:19 p.m. UTC | #4
On Mon, Mar 17, 2025 at 07:42:54PM -0700, Elijah Newren wrote:
> > +In the incremental MIDX design, we extend this definition to include
> > +objects from multiple layers of the MIDX chain. The pseudo-pack order
> > +for incremental MIDXs is determined by concatenating the pseudo-pack
> > +ordering for each layer of the MIDX chain in order. Formally two objects
> > +`o1` and `o2` are compared as follows:
> > +
> > +1. If `o1` appears in an earlier layer of the MIDX chain than `o2`, then
> > +  `o1` is considered less than `o2`.
>
> For sorting order, 'less than' doesn't tell us if you are sorting
> smallest to greatest or greatest to smallest.  Maybe "less than (so
> its order is earlier than) `o2'" ?

Oh, good suggestion. I found the alternative a little verbose, but went
with "sorts ahead of" instead of "less than".

> > +
> > +2. Otherwise, if `o1` and `o2` appear in the same MIDX layer, and that
> > +   MIDX layer has no base, then if one of `pack(o1)` and `pack(o2)` is
> > +   preferred and the other is not, then the preferred one sorts first. If
> > +   there is a base layer (i.e. the MIDX layer is not the first layer in
> > +   the chain), then if `pack(o1)` appears earlier in that MIDX layer's
> > +   pack order, than `o1` is less than `o2`. Likewise if `pack(o2)`
>
> s/than/then/
>
> > +   appears earlier, than the opposite is true.
>
> s/than/then/

Good catch on both accounts ;-).

> > +The structure of a `*.bitmap` file belonging to an incremental MIDX
> > +chain is identical to that of a non-incremental MIDX bitmap, or a
> > +classic single-pack bitmap. Since objects are added to the end of the
> > +incremental MIDX's pseudo-pack order (see: above), it is possible to
>
> drop the colon?

Yep, dropped.

> > +extend a bitmap when appending to the end of a MIDX chain.
> > +
> > +(Note: it is possible likewise to compress a contiguous sequence of MIDX
> > +incremental layers, and their `*.bitmap`(s) into a single layer and
> > +`*.bitmap`, but this is not yet implemented.)
>
> "`*.bitmap`(s)" feels slightly awkward and only saves 2 characters.
> Maybe just "`*.bitmap` files"?

Fair suggestion, sure!

> >  Future Work
> >  -----------
>
> Should the patch also remove the first item from Future Work, since
> this series is implementing it?

Hah, that was quite satisfying to do. I moved that to its own commit,
though, since this series doesn't implement incremental MIDXs, but
bitmap support for them. Incremental MIDXs were "done" as of b9497848df
(Merge branch 'tb/incremental-midx-part-1', 2024-08-19).

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/technical/multi-pack-index.adoc b/Documentation/technical/multi-pack-index.adoc
index cc063b30be..ab98ecfeb9 100644
--- a/Documentation/technical/multi-pack-index.adoc
+++ b/Documentation/technical/multi-pack-index.adoc
@@ -164,6 +164,77 @@  objects_nr($H2) + objects_nr($H1) + i
 (in the C implementation, this is often computed as `i +
 m->num_objects_in_base`).
 
+=== Pseudo-pack order for incremental MIDXs
+
+The original implementation of multi-pack reachability bitmaps defined
+the pseudo-pack order in linkgit:gitformat-pack[5] (see the section
+titled "multi-pack-index reverse indexes") roughly as follows:
+
+____
+In short, a MIDX's pseudo-pack is the de-duplicated concatenation of
+objects in packs stored by the MIDX, laid out in pack order, and the
+packs arranged in MIDX order (with the preferred pack coming first).
+____
+
+In the incremental MIDX design, we extend this definition to include
+objects from multiple layers of the MIDX chain. The pseudo-pack order
+for incremental MIDXs is determined by concatenating the pseudo-pack
+ordering for each layer of the MIDX chain in order. Formally two objects
+`o1` and `o2` are compared as follows:
+
+1. If `o1` appears in an earlier layer of the MIDX chain than `o2`, then
+  `o1` is considered less than `o2`.
+
+2. Otherwise, if `o1` and `o2` appear in the same MIDX layer, and that
+   MIDX layer has no base, then if one of `pack(o1)` and `pack(o2)` is
+   preferred and the other is not, then the preferred one sorts first. If
+   there is a base layer (i.e. the MIDX layer is not the first layer in
+   the chain), then if `pack(o1)` appears earlier in that MIDX layer's
+   pack order, than `o1` is less than `o2`. Likewise if `pack(o2)`
+   appears earlier, than the opposite is true.
+
+3. Otherwise, `o1` and `o2` appear in the same pack, and thus in the
+   same MIDX layer. Sort `o1` and `o2` by their offset within their
+   containing packfile.
+
+Note that the preferred pack is a property of the MIDX chain, not the
+individual layers themselves. Fundamentally we could introduce a
+per-layer preferred pack, but this is less relevant now that we can
+perform multi-pack reuse across the set of packs in a MIDX.
+
+=== Reachability bitmaps and incremental MIDXs
+
+Each layer of an incremental MIDX chain may have its objects (and the
+objects from any previous layer in the same MIDX chain) represented in
+its own `*.bitmap` file.
+
+The structure of a `*.bitmap` file belonging to an incremental MIDX
+chain is identical to that of a non-incremental MIDX bitmap, or a
+classic single-pack bitmap. Since objects are added to the end of the
+incremental MIDX's pseudo-pack order (see: above), it is possible to
+extend a bitmap when appending to the end of a MIDX chain.
+
+(Note: it is possible likewise to compress a contiguous sequence of MIDX
+incremental layers, and their `*.bitmap`(s) into a single layer and
+`*.bitmap`, but this is not yet implemented.)
+
+The object positions used are global within the pseudo-pack order, so
+subsequent layers will have, for example, `m->num_objects_in_base`
+number of `0` bits in each of their four type bitmaps. This follows from
+the fact that we only write type bitmap entries for objects present in
+the layer immediately corresponding to the bitmap).
+
+Note also that only the bitmap pertaining to the most recent layer in an
+incremental MIDX chain is used to store reachability information about
+the interesting and uninteresting objects in a reachability query.
+Earlier bitmap layers are only used to look up commit and pseudo-merge
+bitmaps from that layer, as well as the type-level bitmaps for objects
+in that layer.
+
+To simplify the implementation, type-level bitmaps are iterated
+simultaneously, and their results are OR'd together to avoid recursively
+calling internal bitmap functions.
+
 Future Work
 -----------