diff mbox series

[v2,12/15] Documentation/technical: describe multi-pack reverse indexes

Message ID 404d730498938da034d860d894ddbb7d6dffc27d.1614193703.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx: implement a multi-pack reverse index | expand

Commit Message

Taylor Blau Feb. 24, 2021, 7:10 p.m. UTC
As a prerequisite to implementing multi-pack bitmaps, motivate and
describe the format and ordering of the multi-pack reverse index.

The subsequent patch will implement reading this format, and the patch
after that will implement writing it while producing a multi-pack index.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/pack-format.txt | 80 +++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

Comments

Jonathan Tan March 2, 2021, 4:21 a.m. UTC | #1
> +== multi-pack-index reverse indexes
> +
> +Similar to the pack-based reverse index, the multi-pack index can also
> +be used to generate a reverse index.
> +
> +Instead of mapping between offset, pack-, and index position, this
> +reverse index maps between an object's position within the MIDX, and
> +that object's position within a pseudo-pack that the MIDX describes.
> +
> +To clarify these three orderings

The paragraph seems to only describe 2 orderings - object's position
within the MIDX and object's position within the pseudo-pack. (Is the
third one the offset within the MIDX - which is, I believe, trivially
computable from the position within the MIDX?)

Also, which are stored in the .rev file?

The previous patches look good to me, and I'll review the remaining
patches hopefully tomorrow.
Taylor Blau March 2, 2021, 4:36 a.m. UTC | #2
On Mon, Mar 01, 2021 at 08:21:11PM -0800, Jonathan Tan wrote:
> The previous patches look good to me, and I'll review the remaining
> patches hopefully tomorrow.

Thanks; I am sorely behind recent activity on the list. I had a
last-minute errand to run last weekend and I haven't managed to quite
dig out of the hole I created for myself since then.

Incidentally, I have had this code (and the tb/multi-pack-bitmaps)
running on a couple of high-traffic repositories internal to GitHub, and
so have a couple of improvements that I was hoping to squash in, too.

Thanks,
Taylor
Taylor Blau March 2, 2021, 7:15 p.m. UTC | #3
On Mon, Mar 01, 2021 at 08:21:11PM -0800, Jonathan Tan wrote:
> > +== multi-pack-index reverse indexes
> > +
> > +Similar to the pack-based reverse index, the multi-pack index can also
> > +be used to generate a reverse index.
> > +
> > +Instead of mapping between offset, pack-, and index position, this
> > +reverse index maps between an object's position within the MIDX, and
> > +that object's position within a pseudo-pack that the MIDX describes.
> > +
> > +To clarify these three orderings
>
> The paragraph seems to only describe 2 orderings - object's position
> within the MIDX and object's position within the pseudo-pack. (Is the
> third one the offset within the MIDX - which is, I believe, trivially
> computable from the position within the MIDX?)

Sorry for the confusion. I was trying to distinguish between ordering
based on object offset, pack position, and index position.

I guess you could count that as 2, 3, or 4 different orderings (if you
classify "pack vs MIDX", "offset vs pack pos vs index pos" or the last
three plus "vs MIDX pos").

But I think that all of that is needlessly confusing, so I'd much rather
just say "To clarify the difference between these orderings".

> Also, which are stored in the .rev file?

The paragraph above describes it a little bit "this reverse index maps
between ...", but I think it could be made clearer. (I was intentionally
brief there since I wanted to not get too far into the details before
explaining the relevant concepts, but I think I went too far).

How does this sound?

--- >8 ---

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 77eb591057..4bbbb188a4 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -387,12 +387,15 @@ be used to generate a reverse index.

 Instead of mapping between offset, pack-, and index position, this
 reverse index maps between an object's position within the MIDX, and
-that object's position within a pseudo-pack that the MIDX describes.
+that object's position within a pseudo-pack that the MIDX describes
+(i.e., the ith entry of the multi-pack reverse index holds the MIDX
+position of ith object in pseudo-pack order).

-To clarify these three orderings, consider a multi-pack reachability
-bitmap (which does not yet exist, but is what we are building towards
-here). Each bit needs to correspond to an object in the MIDX, and so we
-need an efficient mapping from bit position to MIDX position.
+To clarify the difference between these orderings, consider a multi-pack
+reachability bitmap (which does not yet exist, but is what we are
+building towards here). Each bit needs to correspond to an object in the
+MIDX, and so we need an efficient mapping from bit position to MIDX
+position.

 One solution is to let bits occupy the same position in the oid-sorted
 index stored by the MIDX. But because oids are effectively random, there
Jonathan Tan March 4, 2021, 2:03 a.m. UTC | #4
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 77eb591057..4bbbb188a4 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -387,12 +387,15 @@ be used to generate a reverse index.
> 
>  Instead of mapping between offset, pack-, and index position, this
>  reverse index maps between an object's position within the MIDX, and
> -that object's position within a pseudo-pack that the MIDX describes.
> +that object's position within a pseudo-pack that the MIDX describes
> +(i.e., the ith entry of the multi-pack reverse index holds the MIDX
> +position of ith object in pseudo-pack order).
> 
> -To clarify these three orderings, consider a multi-pack reachability
> -bitmap (which does not yet exist, but is what we are building towards
> -here). Each bit needs to correspond to an object in the MIDX, and so we
> -need an efficient mapping from bit position to MIDX position.
> +To clarify the difference between these orderings, consider a multi-pack
> +reachability bitmap (which does not yet exist, but is what we are
> +building towards here). Each bit needs to correspond to an object in the
> +MIDX, and so we need an efficient mapping from bit position to MIDX
> +position.
> 
>  One solution is to let bits occupy the same position in the oid-sorted
>  index stored by the MIDX. But because oids are effectively random, there

Thanks - this diff makes sense.
diff mbox series

Patch

diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 1faa949bf6..77eb591057 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -379,3 +379,83 @@  CHUNK DATA:
 TRAILER:
 
 	Index checksum of the above contents.
+
+== multi-pack-index reverse indexes
+
+Similar to the pack-based reverse index, the multi-pack index can also
+be used to generate a reverse index.
+
+Instead of mapping between offset, pack-, and index position, this
+reverse index maps between an object's position within the MIDX, and
+that object's position within a pseudo-pack that the MIDX describes.
+
+To clarify these three orderings, consider a multi-pack reachability
+bitmap (which does not yet exist, but is what we are building towards
+here). Each bit needs to correspond to an object in the MIDX, and so we
+need an efficient mapping from bit position to MIDX position.
+
+One solution is to let bits occupy the same position in the oid-sorted
+index stored by the MIDX. But because oids are effectively random, there
+resulting reachability bitmaps would have no locality, and thus compress
+poorly. (This is the reason that single-pack bitmaps use the pack
+ordering, and not the .idx ordering, for the same purpose.)
+
+So we'd like to define an ordering for the whole MIDX based around
+pack ordering, which has far better locality (and thus compresses more
+efficiently). We can think of a pseudo-pack created by the concatenation
+of all of the packs in the MIDX. E.g., if we had a MIDX with three packs
+(a, b, c), with 10, 15, and 20 objects respectively, we can imagine an
+ordering of the objects like:
+
+    |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
+
+where the ordering of the packs is defined by the MIDX's pack list,
+and then the ordering of objects within each pack is the same as the
+order in the actual packfile.
+
+Given the list of packs and their counts of objects, you can
+na&iuml;vely reconstruct that pseudo-pack ordering (e.g., the object at
+position 27 must be (c,1) because packs "a" and "b" consumed 25 of the
+slots). But there's a catch. Objects may be duplicated between packs, in
+which case the MIDX only stores one pointer to the object (and thus we'd
+want only one slot in the bitmap).
+
+Callers could handle duplicates themselves by reading objects in order
+of their bit-position, but that's linear in the number of objects, and
+much too expensive for ordinary bitmap lookups. Building a reverse index
+solves this, since it is the logical inverse of the index, and that
+index has already removed duplicates. But, building a reverse index on
+the fly can be expensive. Since we already have an on-disk format for
+pack-based reverse indexes, let's reuse it for the MIDX's pseudo-pack,
+too.
+
+Objects from the MIDX are ordered as follows to string together the
+pseudo-pack. Let _pack(o)_ return the pack from which _o_ was selected
+by the MIDX, and define an ordering of packs based on their numeric ID
+(as stored by the MIDX). Let _offset(o)_ return the object offset of _o_
+within _pack(o)_. Then, compare _o~1~_ and _o~2~_ as follows:
+
+  - If one of _pack(o~1~)_ and _pack(o~2~)_ is preferred and the other
+    is not, then the preferred one sorts first.
++
+(This is a detail that allows the MIDX bitmap to determine which
+pack should be used by the pack-reuse mechanism, since it can ask
+the MIDX for the pack containing the object at bit position 0).
+
+  - If _pack(o~1~) &ne; pack(o~2~)_, then sort the two objects in
+    descending order based on the pack ID.
+
+  - Otherwise, _pack(o~1~) &equals; pack(o~2~)_, and the objects are
+    sorted in pack-order (i.e., _o~1~_ sorts ahead of _o~2~_ exactly
+    when _offset(o~1~) &lt; offset(o~2~)_).
+
+In short, a MIDX's pseudo-pack is the de-duplicated concatenation of
+objects in packs stored by the MIDX, laid out in pack order, and the
+packs arranged in MIDX order (with the preferred pack coming first).
+
+Finally, note that the MIDX's reverse index is not stored as a chunk in
+the multi-pack-index itself. This is done because the reverse index
+includes the checksum of the pack or MIDX to which it belongs, which
+makes it impossible to write in the MIDX. To avoid races when rewriting
+the MIDX, a MIDX reverse index includes the MIDX's checksum in its
+filename (e.g., `multi-pack-index-xyz.rev`).