diff mbox series

[6/9] Documentation/technical: describe multi-pack reverse indexes

Message ID e64504bad6e181522946a8f234e12f569bede89e.1612998106.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: implement a multi-pack reverse index | expand

Commit Message

Taylor Blau Feb. 10, 2021, 11:03 p.m. UTC
As a prerequisite to implementing multi-pack bitmaps, motivate and
describe the format and ordering of the multi-pack reverse index.

The subsequent patch will implement reading this format, and the patch
after that will implement writing it while producing a multi-pack index.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt | 83 +++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

Comments

Derrick Stolee Feb. 11, 2021, 2:48 a.m. UTC | #1
On 2/10/21 6:03 PM, Taylor Blau wrote:> +Instead of mapping between offset, pack-, and index position, this

The "pack-," should be paired with "index-position" or drop the
hyphen in both cases. Perhaps just be explicit, especially since
"position" doesn't match with "offset":

  Instead  of mapping between pack offset, pack position, and index
  position, ...

> +reverse index maps between an object's position within the midx, and
> +that object's position within a pseudo-pack that the midx describes.

nit: use multi-pack-index or MIDX, not lower-case 'midx'.

> +Crucially, the objects' positions within this pseudo-pack are the same
> +as their bit positions in a multi-pack reachability bitmap.
> +
> +As a motivating example, consider the multi-pack reachability bitmap
> +(which does not yet exist, but is what we are building towards here). We
> +need each bit to correspond to an object covered by the midx, and we
> +need to be able to convert bit positions back to index positions (from
> +which we can get the oid, etc).

These paragraphs are awkward. Instead of operating in the hypothetical
world of reachability bitmaps, focus on the fact that bitmaps need
a bidirectional mapping between "bit position" and an object ID.

Here is an attempt to reword some of the context you are using here.
Feel free to take as much or as little as you want.

  The multi-pack-index stores the object IDs in lexicographical order
  (lex-order) to allow binary search. To allow compressible reachability
  bitmaps to pair with a multi-pack-index, a different ordering is
  required. When paired with a single packfile, the order used is the
  object order within the packfile (called the pack-order). Construct
  a "pseudo-pack" by concatenating all tracked packfiles in the
  multi-pack-index. We now need a mapping between the lex-order and the
  pseudo-pack-order.

> +One solution is to let each bit position in the index correspond to
> +the same position in the oid-sorted index stored by the midx. But
> +because oids are effectively random, there resulting reachability
> +bitmaps would have no locality, and thus compress poorly. (This is the
> +reason that single-pack bitmaps use the pack ordering, and not the .idx
> +ordering, for the same purpose.)
> +
> +So we'd like to define an ordering for the whole midx based around
> +pack ordering. We can think of it as a pseudo-pack created by the
> +concatenation of all of the packs in the midx. E.g., if we had a midx
> +with three packs (a, b, c), with 10, 15, and 20 objects respectively, we
> +can imagine an ordering of the objects like:
> +
> +    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
> +
> +where the ordering of the packs is defined by the midx's pack list,
> +and then the ordering of objects within each pack is the same as the
> +order in the actual packfile.
> +
> +Given the list of packs and their counts of objects, you can
> +na&iuml;vely reconstruct that pseudo-pack ordering (e.g., the object at
> +position 27 must be (c,1) because packs "a" and "b" consumed 25 of the
> +slots). But there's a catch. Objects may be duplicated between packs, in
> +which case the midx only stores one pointer to the object (and thus we'd
> +want only one slot in the bitmap).
> +
> +Callers could handle duplicates themselves by reading objects in order
> +of their bit-position, but that's linear in the number of objects, and
> +much too expensive for ordinary bitmap lookups. Building a reverse index
> +solves this, since it is the logical inverse of the index, and that
> +index has already removed duplicates. But, building a reverse index on
> +the fly can be expensive. Since we already have an on-disk format for
> +pack-based reverse indexes, let's reuse it for the midx's pseudo-pack,
> +too.
> +
> +Objects from the midx are ordered as follows to string together the
> +pseudo-pack. Let _pack(o)_ return the pack from which _o_ was selected
> +by the midx, and define an ordering of packs based on their numeric ID
> +(as stored by the midx). Let _offset(o)_ return the object offset of _o_
> +within _pack(o)_. Then, compare _o~1~_ and _o~2~_ as follows:
> +
> +  - If one of _pack(o~1~)_ and _pack(o~2~)_ is preferred and the other
> +    is not, then the preferred one sorts first.
> ++
> +(This is a detail that allows the midx bitmap to determine which
> +pack should be used by the pack-reuse mechanism, since it can ask
> +the midx for the pack containing the object at bit position 0).
> +
> +  - If _pack(o~1~) &ne; pack(o~2~)_, then sort the two objects in
> +    descending order based on the pack ID.
> +
> +  - Otherwise, _pack(o~1~) &equals; pack(o~2~)_, and the objects are
> +    sorted in pack-order (i.e., _o~1~_ sorts ahead of _o~2~_ exactly
> +    when _offset(o~1~) &lt; offset(o~2~)_).
> +
> +In short, a midx's pseudo-pack is the de-duplicated concatenation of
> +objects in packs stored by the midx, laid out in pack order, and the
> +packs arranged in midx order (with the preferred pack coming first).
> +
> +Finally, note that the midx's reverse index is not stored as a chunk in
> +the multi-pack-index itself. This is done because the reverse index
> +includes the checksum of the pack or midx to which it belongs, which
> +makes it impossible to write in the midx. To avoid races when rewriting
> +the midx, a midx reverse index includes the midx's checksum in its
> +filename (e.g., `multi-pack-index-xyz.rev`).

The rest of these details make sense and sufficiently motivate the
ordering, once the concept is clear.

Thanks,
-Stolee
Taylor Blau Feb. 11, 2021, 3:03 a.m. UTC | #2
On Wed, Feb 10, 2021 at 09:48:20PM -0500, Derrick Stolee wrote:
> nit: use multi-pack-index or MIDX, not lower-case 'midx'.

Thanks.

> > +Crucially, the objects' positions within this pseudo-pack are the same
> > +as their bit positions in a multi-pack reachability bitmap.
> > +
> > +As a motivating example, consider the multi-pack reachability bitmap
> > +(which does not yet exist, but is what we are building towards here). We
> > +need each bit to correspond to an object covered by the midx, and we
> > +need to be able to convert bit positions back to index positions (from
> > +which we can get the oid, etc).
>
> These paragraphs are awkward. Instead of operating in the hypothetical
> world of reachability bitmaps, focus on the fact that bitmaps need
> a bidirectional mapping between "bit position" and an object ID.

Hmm. I could buy that these paragraphs are awkward, but I'm not sure
that what you proposed makes it less so.

I may be a bad person to judge what you wrote, since I am familiar with
the details of what it's describing. But my thoughts on that second and
third paragraph are basically:

  - define the valid orderings we might consider objects in a MIDX by,
    indicating which of those orderings we're going to use for
    multi-pack bitmaps

  - motivate the need for a mapping between lexicographic order and
    pseudo-pack order

> Here is an attempt to reword some of the context you are using here.
> Feel free to take as much or as little as you want.
>
>   The multi-pack-index stores the object IDs in lexicographical order
>   (lex-order) to allow binary search. To allow compressible reachability
>   bitmaps to pair with a multi-pack-index, a different ordering is
>   required. When paired with a single packfile, the order used is the
>   object order within the packfile (called the pack-order). Construct
>   a "pseudo-pack" by concatenating all tracked packfiles in the
>   multi-pack-index. We now need a mapping between the lex-order and the
>   pseudo-pack-order.

I struggled with what you wrote because I couldn't seem to neatly
place/replace that paragraph in with the existing text without referring
to yet-undefined concepts.

Maybe the confusion lies in the fact that we stray too far from the
point in the second and third paragraphs. What if we reordered the
second, third, and fourth paragraph like this:

		Instead of mapping between offset, pack-, and index position, this
		reverse index maps between an object's position within the MIDX, and
		that object's position within a pseudo-pack that the MIDX describes.

		To clarify these three orderings, consider a multi-pack reachability
		bitmap (which does not yet exist, but is what we are building towards
		here). Each bit needs to correspond to an object in the MIDX, and so we
		need an efficient mapping from bit position to MIDX position.

		One solution is to let bits occupy the same position in the oid-sorted
		index stored by the MIDX. But because oids are effectively random, there
		resulting reachability bitmaps would have no locality, and thus compress
		poorly. (This is the reason that single-pack bitmaps use the pack
		ordering, and not the .idx ordering, for the same purpose.)

		So we'd like to define an ordering for the whole MIDX based around
		pack ordering, which has far better locality (and thus compresses more
		efficiently). We can think of a pseudo-pack created by the concatenation
		of all of the packs in the MIDX. E.g., if we had a MIDX with three packs
		(a, b, c), with 10, 15, and 20 objects respectively, we can imagine an
		ordering of the objects like:

> [snip]
>
> The rest of these details make sense and sufficiently motivate the
> ordering, once the concept is clear.
>
> Thanks,
> -Stolee

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 8833b71c8b..a14722f119 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -376,3 +376,86 @@  CHUNK DATA:
 TRAILER:
 
 	Index checksum of the above contents.
+
+== multi-pack-index reverse indexes
+
+Similar to the pack-based reverse index, the multi-pack index can also
+be used to generate a reverse index.
+
+Instead of mapping between offset, pack-, and index position, this
+reverse index maps between an object's position within the midx, and
+that object's position within a pseudo-pack that the midx describes.
+Crucially, the objects' positions within this pseudo-pack are the same
+as their bit positions in a multi-pack reachability bitmap.
+
+As a motivating example, consider the multi-pack reachability bitmap
+(which does not yet exist, but is what we are building towards here). We
+need each bit to correspond to an object covered by the midx, and we
+need to be able to convert bit positions back to index positions (from
+which we can get the oid, etc).
+
+One solution is to let each bit position in the index correspond to
+the same position in the oid-sorted index stored by the midx. But
+because oids are effectively random, there resulting reachability
+bitmaps would have no locality, and thus compress poorly. (This is the
+reason that single-pack bitmaps use the pack ordering, and not the .idx
+ordering, for the same purpose.)
+
+So we'd like to define an ordering for the whole midx based around
+pack ordering. We can think of it as a pseudo-pack created by the
+concatenation of all of the packs in the midx. E.g., if we had a midx
+with three packs (a, b, c), with 10, 15, and 20 objects respectively, we
+can imagine an ordering of the objects like:
+
+    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+where the ordering of the packs is defined by the midx's pack list,
+and then the ordering of objects within each pack is the same as the
+order in the actual packfile.
+
+Given the list of packs and their counts of objects, you can
+na&iuml;vely reconstruct that pseudo-pack ordering (e.g., the object at
+position 27 must be (c,1) because packs "a" and "b" consumed 25 of the
+slots). But there's a catch. Objects may be duplicated between packs, in
+which case the midx only stores one pointer to the object (and thus we'd
+want only one slot in the bitmap).
+
+Callers could handle duplicates themselves by reading objects in order
+of their bit-position, but that's linear in the number of objects, and
+much too expensive for ordinary bitmap lookups. Building a reverse index
+solves this, since it is the logical inverse of the index, and that
+index has already removed duplicates. But, building a reverse index on
+the fly can be expensive. Since we already have an on-disk format for
+pack-based reverse indexes, let's reuse it for the midx's pseudo-pack,
+too.
+
+Objects from the midx are ordered as follows to string together the
+pseudo-pack. Let _pack(o)_ return the pack from which _o_ was selected
+by the midx, and define an ordering of packs based on their numeric ID
+(as stored by the midx). Let _offset(o)_ return the object offset of _o_
+within _pack(o)_. Then, compare _o~1~_ and _o~2~_ as follows:
+
+  - If one of _pack(o~1~)_ and _pack(o~2~)_ is preferred and the other
+    is not, then the preferred one sorts first.
++
+(This is a detail that allows the midx bitmap to determine which
+pack should be used by the pack-reuse mechanism, since it can ask
+the midx for the pack containing the object at bit position 0).
+
+  - If _pack(o~1~) &ne; pack(o~2~)_, then sort the two objects in
+    descending order based on the pack ID.
+
+  - Otherwise, _pack(o~1~) &equals; pack(o~2~)_, and the objects are
+    sorted in pack-order (i.e., _o~1~_ sorts ahead of _o~2~_ exactly
+    when _offset(o~1~) &lt; offset(o~2~)_).
+
+In short, a midx's pseudo-pack is the de-duplicated concatenation of
+objects in packs stored by the midx, laid out in pack order, and the
+packs arranged in midx order (with the preferred pack coming first).
+
+Finally, note that the midx's reverse index is not stored as a chunk in
+the multi-pack-index itself. This is done because the reverse index
+includes the checksum of the pack or midx to which it belongs, which
+makes it impossible to write in the midx. To avoid races when rewriting
+the midx, a midx reverse index includes the midx's checksum in its
+filename (e.g., `multi-pack-index-xyz.rev`).