diff mbox series

[v2,05/24] Documentation: describe MIDX-based bitmaps

Message ID 64a260e0c6a116b7c6fa6fea2b9fd96bf416cb18.1624314293.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series multi-pack reachability bitmaps | expand

Commit Message

Taylor Blau June 21, 2021, 10:25 p.m. UTC
Update the technical documentation to describe the multi-pack bitmap
format. This patch merely introduces the new format, and describes its
high-level ideas. Git does not yet know how to read nor write these
multi-pack variants, and so the subsequent patches will:

  - Introduce code to interpret multi-pack bitmaps, according to this
    document.

  - Then, introduce code to write multi-pack bitmaps from the 'git
    multi-pack-index write' sub-command.

Finally, the implementation will gain tests in subsequent patches (as
opposed to inline with the patch teaching Git how to write multi-pack
bitmaps) to avoid a cyclic dependency.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/technical/bitmap-format.txt    | 72 ++++++++++++++++----
 Documentation/technical/multi-pack-index.txt | 10 +--
 2 files changed, 61 insertions(+), 21 deletions(-)

Comments

Jeff King July 21, 2021, 10:18 a.m. UTC | #1
On Mon, Jun 21, 2021 at 06:25:10PM -0400, Taylor Blau wrote:

> +An object is uniquely described by its bit position within a bitmap:
> +
> +	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
> +	the __n__th object in pack order. For a function `offset` which maps
> +	objects to their byte offset within a pack, pack order is defined as
> +	follows:
> +
> +		o1 <= o2 <==> offset(o1) <= offset(o2)
> +
> +	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
> +	__n__th object in MIDX order. With an additional function `pack` which
> +	maps objects to the pack they were selected from by the MIDX, MIDX order
> +	is defined as follows:
> +
> +		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
> +
> +	The ordering between packs is done lexicographically by the pack name,
> +	with the exception of the preferred pack, which sorts ahead of all other
> +	packs.

This doesn't render well as asciidoc (the final paragraph is taken as
more of the code block). But that is a problem through the whole file. I
think we should ignore it for now, and worry about asciidoc-ifying the
whole thing later, if we choose to.

> +	The ordering between packs is done lexicographically by the pack name,
> +	with the exception of the preferred pack, which sorts ahead of all other
> +	packs.

Hmm, I'm not sure if this "lexicographically" part is true. Really we're
building on the midx .rev format here. And that says "defined by the
MIDX's pack list" (though I can't offhand remember if that is
lexicographic, or if it is in the reverse-mtime order).

At any rate, should we just be referencing the rev documentation?

> [...]

The rest of the changes to the document seemed quite sensible to me.

-Peff
Taylor Blau July 21, 2021, 5:53 p.m. UTC | #2
On Wed, Jul 21, 2021 at 06:18:46AM -0400, Jeff King wrote:
> On Mon, Jun 21, 2021 at 06:25:10PM -0400, Taylor Blau wrote:
>
> > +An object is uniquely described by its bit position within a bitmap:
> > +
> > +	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
> > +	the __n__th object in pack order. For a function `offset` which maps
> > +	objects to their byte offset within a pack, pack order is defined as
> > +	follows:
> > +
> > +		o1 <= o2 <==> offset(o1) <= offset(o2)
> > +
> > +	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
> > +	__n__th object in MIDX order. With an additional function `pack` which
> > +	maps objects to the pack they were selected from by the MIDX, MIDX order
> > +	is defined as follows:
> > +
> > +		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
> > +
> > +	The ordering between packs is done lexicographically by the pack name,
> > +	with the exception of the preferred pack, which sorts ahead of all other
> > +	packs.
>
> This doesn't render well as asciidoc (the final paragraph is taken as
> more of the code block). But that is a problem through the whole file. I
> think we should ignore it for now, and worry about asciidoc-ifying the
> whole thing later, if we choose to.

Agreed; let's ignore it for now.

> > +	The ordering between packs is done lexicographically by the pack name,
> > +	with the exception of the preferred pack, which sorts ahead of all other
> > +	packs.
>
> Hmm, I'm not sure if this "lexicographically" part is true. Really we're
> building on the midx .rev format here. And that says "defined by the
> MIDX's pack list" (though I can't offhand remember if that is
> lexicographic, or if it is in the reverse-mtime order).
>
> At any rate, should we just be referencing the rev documentation?

The packs are listed in lex order in the MIDX, but that is so we can
binary search that list to determine whether a pack is included in the
MIDX or not.

I had to check, but we do use the lex order to resolve duplicate
objects, too. See (at the tip of this branch):

    QSORT(ctx.info, ctx.nr, pack_info_compare);

from within midx.c:write_midx_internal(). Here, ctx.info contains the
list of packs, and pack_info_compare is a thin wrapper around
strcmp()-ing the pack_name values of two packed_git structures.

Arguably, you'd get better EWAH compression of the bits between packs
if we sorted packs in reverse order according to their mtime. But I
suspect that it doesn't matter much in practice, since the number of
objects vastly outpaces the number of packs (but I haven't measured to
be certain, so take that with a grain of salt).

In any case, I think that you're right that adding too much detail hurts
us here, so we should really be mentioning the MIDX's .rev-file
documentation (unfortunately, we can't linkgit it, so mentioning it by
name will have to suffice). I plan to reroll with something like this on
top:

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index 25221c7ec8..04b3ec2178 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -26,9 +26,8 @@ An object is uniquely described by its bit position within a bitmap:

 		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)

-	The ordering between packs is done lexicographically by the pack name,
-	with the exception of the preferred pack, which sorts ahead of all other
-	packs.
+	The ordering between packs is done according to the MIDX's .rev file.
+	Notably, the preferred pack sorts ahead of all other packs.

 The on-disk representation (described below) of a bitmap is the same regardless
 of whether or not that bitmap belongs to a packfile or a MIDX. The only

Thanks,
Taylor
Jeff King July 23, 2021, 7:45 a.m. UTC | #3
On Wed, Jul 21, 2021 at 01:53:40PM -0400, Taylor Blau wrote:

> > > +	The ordering between packs is done lexicographically by the pack name,
> > > +	with the exception of the preferred pack, which sorts ahead of all other
> > > +	packs.
> >
> > Hmm, I'm not sure if this "lexicographically" part is true. Really we're
> > building on the midx .rev format here. And that says "defined by the
> > MIDX's pack list" (though I can't offhand remember if that is
> > lexicographic, or if it is in the reverse-mtime order).
> >
> > At any rate, should we just be referencing the rev documentation?
> 
> The packs are listed in lex order in the MIDX, but that is so we can
> binary search that list to determine whether a pack is included in the
> MIDX or not.
> 
> I had to check, but we do use the lex order to resolve duplicate
> objects, too. See (at the tip of this branch):
> 
>     QSORT(ctx.info, ctx.nr, pack_info_compare);
> 
> from within midx.c:write_midx_internal(). Here, ctx.info contains the
> list of packs, and pack_info_compare is a thin wrapper around
> strcmp()-ing the pack_name values of two packed_git structures.

Ah, OK, thanks for checking.

> Arguably, you'd get better EWAH compression of the bits between packs
> if we sorted packs in reverse order according to their mtime. But I
> suspect that it doesn't matter much in practice, since the number of
> objects vastly outpaces the number of packs (but I haven't measured to
> be certain, so take that with a grain of salt).

Agreed, especially when the intended use is with geometric repacking to
keep reasonable-sized packs.

Either way, I think heuristics to optimize the pack ordering can easily
come on top later. Let's keep this series focused on the fundamentals of
having midx bitmaps at all.

> In any case, I think that you're right that adding too much detail hurts
> us here, so we should really be mentioning the MIDX's .rev-file
> documentation (unfortunately, we can't linkgit it, so mentioning it by
> name will have to suffice). I plan to reroll with something like this on
> top:
> 
> diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> index 25221c7ec8..04b3ec2178 100644
> --- a/Documentation/technical/bitmap-format.txt
> +++ b/Documentation/technical/bitmap-format.txt
> @@ -26,9 +26,8 @@ An object is uniquely described by its bit position within a bitmap:
> 
>  		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
> 
> -	The ordering between packs is done lexicographically by the pack name,
> -	with the exception of the preferred pack, which sorts ahead of all other
> -	packs.
> +	The ordering between packs is done according to the MIDX's .rev file.
> +	Notably, the preferred pack sorts ahead of all other packs.
> 
>  The on-disk representation (described below) of a bitmap is the same regardless
>  of whether or not that bitmap belongs to a packfile or a MIDX. The only

Thanks, that looks much better. We can't linkgit, but we only build HTML
for these. So just a link to pack-format.html would work, as they'd
generally be found side-by-side in the filesystem. But since this
doesn't even really render as asciidoc, I'm not sure I care either way.
(Obviously we could also mention pack-format.txt by name, but it's
probably already obvious-ish to a human that this is where you'd find
information on the pack .rev format).

-Peff
diff mbox series

Patch

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index f8c18a0f7a..25221c7ec8 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -1,6 +1,45 @@ 
 GIT bitmap v1 format
 ====================
 
+== Pack and multi-pack bitmaps
+
+Bitmaps store reachability information about the set of objects in a packfile,
+or a multi-pack index (MIDX). The former is defined obviously, and the latter is
+defined as the union of objects in packs contained in the MIDX.
+
+A bitmap may belong to either one pack, or the repository's multi-pack index (if
+it exists). A repository may have at most one bitmap.
+
+An object is uniquely described by its bit position within a bitmap:
+
+	- If the bitmap belongs to a packfile, the __n__th bit corresponds to
+	the __n__th object in pack order. For a function `offset` which maps
+	objects to their byte offset within a pack, pack order is defined as
+	follows:
+
+		o1 <= o2 <==> offset(o1) <= offset(o2)
+
+	- If the bitmap belongs to a MIDX, the __n__th bit corresponds to the
+	__n__th object in MIDX order. With an additional function `pack` which
+	maps objects to the pack they were selected from by the MIDX, MIDX order
+	is defined as follows:
+
+		o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2)
+
+	The ordering between packs is done lexicographically by the pack name,
+	with the exception of the preferred pack, which sorts ahead of all other
+	packs.
+
+The on-disk representation (described below) of a bitmap is the same regardless
+of whether or not that bitmap belongs to a packfile or a MIDX. The only
+difference is the interpretation of the bits, which is described above.
+
+Certain bitmap extensions are supported (see: Appendix B). No extensions are
+required for bitmaps corresponding to packfiles. For bitmaps that correspond to
+MIDXs, both the bit-cache and rev-cache extensions are required.
+
+== On-disk format
+
 	- A header appears at the beginning:
 
 		4-byte signature: {'B', 'I', 'T', 'M'}
@@ -14,17 +53,19 @@  GIT bitmap v1 format
 			The following flags are supported:
 
 			- BITMAP_OPT_FULL_DAG (0x1) REQUIRED
-			This flag must always be present. It implies that the bitmap
-			index has been generated for a packfile with full closure
-			(i.e. where every single object in the packfile can find
-			 its parent links inside the same packfile). This is a
-			requirement for the bitmap index format, also present in JGit,
-			that greatly reduces the complexity of the implementation.
+			This flag must always be present. It implies that the
+			bitmap index has been generated for a packfile or
+			multi-pack index (MIDX) with full closure (i.e. where
+			every single object in the packfile/MIDX can find its
+			parent links inside the same packfile/MIDX). This is a
+			requirement for the bitmap index format, also present in
+			JGit, that greatly reduces the complexity of the
+			implementation.
 
 			- BITMAP_OPT_HASH_CACHE (0x4)
 			If present, the end of the bitmap file contains
 			`N` 32-bit name-hash values, one per object in the
-			pack. The format and meaning of the name-hash is
+			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
 		4-byte entry count (network byte order)
@@ -33,7 +74,8 @@  GIT bitmap v1 format
 
 		20-byte checksum
 
-			The SHA1 checksum of the pack this bitmap index belongs to.
+			The SHA1 checksum of the pack/MIDX this bitmap index
+			belongs to.
 
 	- 4 EWAH bitmaps that act as type indexes
 
@@ -50,7 +92,7 @@  GIT bitmap v1 format
 			- Tags
 
 		In each bitmap, the `n`th bit is set to true if the `n`th object
-		in the packfile is of that type.
+		in the packfile or multi-pack index is of that type.
 
 		The obvious consequence is that the OR of all 4 bitmaps will result
 		in a full set (all bits set), and the AND of all 4 bitmaps will
@@ -62,8 +104,9 @@  GIT bitmap v1 format
 		Each entry contains the following:
 
 		- 4-byte object position (network byte order)
-			The position **in the index for the packfile** where the
-			bitmap for this commit is found.
+			The position **in the index for the packfile or
+			multi-pack index** where the bitmap for this commit is
+			found.
 
 		- 1-byte XOR-offset
 			The xor offset used to compress this bitmap. For an entry
@@ -146,10 +189,11 @@  Name-hash cache
 ---------------
 
 If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
-a cache of 32-bit values, one per object in the pack. The value at
+a cache of 32-bit values, one per object in the pack/MIDX. The value at
 position `i` is the hash of the pathname at which the `i`th object
-(counting in index order) in the pack can be found.  This can be fed
-into the delta heuristics to compare objects with similar pathnames.
+(counting in index or multi-pack index order) in the pack/MIDX can be found.
+This can be fed into the delta heuristics to compare objects with similar
+pathnames.
 
 The hash algorithm used is:
 
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
index fb688976c4..1a73c3ee20 100644
--- a/Documentation/technical/multi-pack-index.txt
+++ b/Documentation/technical/multi-pack-index.txt
@@ -71,14 +71,10 @@  Future Work
   still reducing the number of binary searches required for object
   lookups.
 
-- The reachability bitmap is currently paired directly with a single
-  packfile, using the pack-order as the object order to hopefully
-  compress the bitmaps well using run-length encoding. This could be
-  extended to pair a reachability bitmap with a multi-pack-index. If
-  the multi-pack-index is extended to store a "stable object order"
+- If the multi-pack-index is extended to store a "stable object order"
   (a function Order(hash) = integer that is constant for a given hash,
-  even as the multi-pack-index is updated) then a reachability bitmap
-  could point to a multi-pack-index and be updated independently.
+  even as the multi-pack-index is updated) then MIDX bitmaps could be
+  updated independently of the MIDX.
 
 - Packfiles can be marked as "special" using empty files that share
   the initial name but replace ".pack" with ".keep" or ".promisor".