diff mbox series

[v3,1/6] Documentation/technical: describe bitmap lookup table extension

Message ID f72bf11e6efb4690ae808c0b56c3991c2b1ef266.1656924376.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series bitmap: integrate a lookup table extension to the bitmap format | expand

Commit Message

Abhradeep Chakraborty July 4, 2022, 8:46 a.m. UTC
From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

When reading bitmap file, Git loads each and every bitmap one by one
even if all the bitmaps are not required. A "bitmap lookup table"
extension to the bitmap format can reduce the overhead of loading
bitmaps which stores a list of bitmapped commit id pos (in the midx
or pack, along with their offset and xor offset. This way git can
load only the necessary bitmaps without loading the previous bitmaps.

Older versions of Git ignore the lookup table extension and don't
throw any kind of warning or error while parsing the bitmap file.

Add some information for the new "bitmap lookup table" extension in the
bitmap-format documentation.

Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Co-Authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 Documentation/technical/bitmap-format.txt | 39 +++++++++++++++++++++++
 1 file changed, 39 insertions(+)

Comments

Philip Oakley July 8, 2022, 4:38 p.m. UTC | #1
Hi Abhradeep,

On 04/07/2022 09:46, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> When reading bitmap file, Git loads each and every bitmap one by one
> even if all the bitmaps are not required. A "bitmap lookup table"
> extension to the bitmap format can reduce the overhead of loading
> bitmaps which stores a list of bitmapped commit id pos (in the midx
> or pack, along with their offset and xor offset. This way git can
> load only the necessary bitmaps without loading the previous bitmaps.
>
> Older versions of Git ignore the lookup table extension and don't
> throw any kind of warning or error while parsing the bitmap file.
>
> Add some information for the new "bitmap lookup table" extension in the
> bitmap-format documentation.

Not sure if this is new in this extension, but should there be a link or
two to the basics of XOR compression and some of the bitmap look up
techniques?

It's not always obvious if these techniques are 'heuristic' and only
have partial commit data, or they have all the commits listed, Nor
how/why they work. My point is more about giving new readers a hand-up
in their understanding, rather than simple implementation details for
those who already know what is going on. For example, are there any
external articles that you found helpful in getting started that could
be referenced somewhere in the docs?

Separately I'm preparing a short series on adding 'reachability bitmap'
and 'commit graph' (among other stuff) to the glossary as part of giving
folks [0] stepping stones to cross the chasm of understanding

Philip

[0] me included;-)
>
> Mentored-by: Taylor Blau <me@ttaylorr.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Co-Authored-by: Taylor Blau <me@ttaylorr.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  Documentation/technical/bitmap-format.txt | 39 +++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
>
> diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> index 04b3ec21785..c30dc177643 100644
> --- a/Documentation/technical/bitmap-format.txt
> +++ b/Documentation/technical/bitmap-format.txt
> @@ -67,6 +67,17 @@ MIDXs, both the bit-cache and rev-cache extensions are required.
>  			pack/MIDX. The format and meaning of the name-hash is
>  			described below.
>  
> +			** {empty}
> +			BITMAP_OPT_LOOKUP_TABLE (0x10): :::
> +			If present, the end of the bitmap file contains a table
> +			containing a list of `N` <commit_pos, offset, xor_row>
> +			triplets. The format and meaning of the table is described
> +			below.
> ++
> +NOTE: Unlike the xor_offset used to compress an individual bitmap,
> +`xor_row` stores an *absolute* index into the lookup table, not a location
> +relative to the current entry.
> +
>  		4-byte entry count (network byte order)
>  
>  			The total count of entries (bitmapped commits) in this bitmap index.
> @@ -205,3 +216,31 @@ Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
>  If implementations want to choose a different hashing scheme, they are
>  free to do so, but MUST allocate a new header flag (because comparing
>  hashes made under two different schemes would be pointless).
> +
> +Commit lookup table
> +-------------------
> +
> +If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
> +bytes (preceding the name-hash cache and trailing hash) of the `.bitmap`
> +file contains a lookup table specifying the information needed to get
> +the desired bitmap from the entries without parsing previous unnecessary
> +bitmaps.
> +
> +For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
> +contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets
> +(sorted in the ascending order of `commit_pos`). The content of i'th
> +triplet is -
> +
> +	* {empty}
> +	commit_pos (4 byte integer, network byte order): ::
> +	It stores the object position of a commit (in the midx or pack
> +	index).
> +
> +	* {empty}
> +	offset (8 byte integer, network byte order): ::
> +	The offset from which that commit's bitmap can be read.
> +
> +	* {empty}
> +	xor_row (4 byte integer, network byte order): ::
> +	The position of the triplet whose bitmap is used to compress
> +	this one, or `0xffffffff` if no such bitmap exists.
Abhradeep Chakraborty July 9, 2022, 7:53 a.m. UTC | #2
Hello Philip,

Philip Oakley <philipoakley@iee.email> wrote:

> Not sure if this is new in this extension, but should there be a link or
> two to the basics of XOR compression and some of the bitmap look up
> techniques?
>
> It's not always obvious if these techniques are 'heuristic' and only
> have partial commit data, or they have all the commits listed, Nor
> how/why they work. My point is more about giving new readers a hand-up
> in their understanding, rather than simple implementation details for
> those who already know what is going on. For example, are there any
> external articles that you found helpful in getting started that could
> be referenced somewhere in the docs?

As this series is only about adding a lookup-table extension (and not
about bitmap itself), I am not sure whether it's good to include those
things in this series. But I agree with your point that it should be
able build a logical understanding among the new readers.

There are some external articles[1] which talk about bitmap internals.
But I think it would be better if we can make a new doc file (may be
`Documentation/technical/reachability-bitmaps.txt` or similar) rather
than putting those details in the `bitmap-format.txt` (As the name 
suggests, this file should only contain format details of bitmaps).
That file would provide the answers of "Why bitmaps", "how they are
stored",  "How they are fetched", "how they work with pack-objects,
git-fetch, midx etc.", "Detailed explanation of each bitmap extension"
, and lastly "what are the future works" (if any).

What do you think?

> Separately I'm preparing a short series on adding 'reachability bitmap'
> and 'commit graph' (among other stuff) to the glossary as part of giving
> folks [0] stepping stones to cross the chasm of understanding

Great!

Thanks :)

[1] https://github.blog/2015-09-22-counting-objects/, https://github.blog/2021-04-29-scaling-monorepo-maintenance/
Philip Oakley July 10, 2022, 3:01 p.m. UTC | #3
On 09/07/2022 08:53, Abhradeep Chakraborty wrote:
> Hello Philip,
>
> Philip Oakley <philipoakley@iee.email> wrote:
>
>> Not sure if this is new in this extension, but should there be a link or
>> two to the basics of XOR compression and some of the bitmap look up
>> techniques?
>>
>> It's not always obvious if these techniques are 'heuristic' and only
>> have partial commit data, or they have all the commits listed, Nor
>> how/why they work. My point is more about giving new readers a hand-up
>> in their understanding, rather than simple implementation details for
>> those who already know what is going on. For example, are there any
>> external articles that you found helpful in getting started that could
>> be referenced somewhere in the docs?
> As this series is only about adding a lookup-table extension (and not
> about bitmap itself), I am not sure whether it's good to include those
> things in this series. 

Thanks for the clarification. I must have slight misread some of the
discussions and falsely thought it was the XOR compression (which is a
technique I wasn't really aware of), that was being provided by the
extension - Where would it be best for me to look up the background to
your "extension" project?


> But I agree with your point that it should be
> able build a logical understanding among the new readers.

*nod*
>
> There are some external articles[1] which talk about bitmap internals.
> But I think it would be better if we can make a new doc file (may be
> `Documentation/technical/reachability-bitmaps.txt` or similar) rather
> than putting those details in the `bitmap-format.txt` 

Thanks for the two links. In general I agree about the format document.

> (As the name 
> suggests, this file should only contain format details of bitmaps).
> That file would provide the answers of "Why bitmaps", "how they are
> stored",  "How they are fetched", "how they work with pack-objects,
> git-fetch, midx etc.", "Detailed explanation of each bitmap extension"
> , and lastly "what are the future works" (if any).

One thing I've realised on reflection is that I'm unclear how the
'reachability bitmaps' and the 'commit-graph file' techniques relate to
each other (and to the ODB DAG), and what features they pick out within
their heuristic, explained at just enough level to allow folks to
appreciate what the options that select them will do for their use case.

>
> What do you think?

I'd be happy to collate contributions, suggestions and thoughts.

Trying to create these good introductory descriptions can be really
difficult, as you can only step into the same river once (the 'reading
for the first time problem' of not being able to un-hear the
explanations of others when reading a 2nd draft...)
>
>> Separately I'm preparing a short series on adding 'reachability bitmap'
>> and 'commit graph' (among other stuff) to the glossary as part of giving
>> folks [0] stepping stones to cross the chasm of understanding
> Great!
>
> Thanks :)
>
> [1] https://github.blog/2015-09-22-counting-objects/, https://github.blog/2021-04-29-scaling-monorepo-maintenance/
Thank you.
Taylor Blau July 14, 2022, 11:15 p.m. UTC | #4
On Sun, Jul 10, 2022 at 04:01:11PM +0100, Philip Oakley wrote:
> >> Not sure if this is new in this extension, but should there be a link or
> >> two to the basics of XOR compression and some of the bitmap look up
> >> techniques?
> >>
> >> It's not always obvious if these techniques are 'heuristic' and only
> >> have partial commit data, or they have all the commits listed, Nor
> >> how/why they work. My point is more about giving new readers a hand-up
> >> in their understanding, rather than simple implementation details for
> >> those who already know what is going on. For example, are there any
> >> external articles that you found helpful in getting started that could
> >> be referenced somewhere in the docs?
> > As this series is only about adding a lookup-table extension (and not
> > about bitmap itself), I am not sure whether it's good to include those
> > things in this series.
>
> Thanks for the clarification. I must have slight misread some of the
> discussions and falsely thought it was the XOR compression (which is a
> technique I wasn't really aware of), that was being provided by the
> extension - Where would it be best for me to look up the background to
> your "extension" project?

Yeah, Abhradeep is right that the XOR compression isn't new, we already
serialize bitmaps with optional XOR offsets. The gist is that we give an
offset of some previous bitmap that is used to compress the current one
by XORing the bits in the current bitamp with the previous one. These
XOR-compressed bitmaps are often sparse, so they compress well and
reduce the overall size of the .bitmap.

A slightly more detailed overview can be found in
Documentation/technical/bitmap-format.txt under the bullet point reading
"1-byte XOR-offset".

Thanks,
Taylor
Philip Oakley July 15, 2022, 10:36 a.m. UTC | #5
On 15/07/2022 00:15, Taylor Blau wrote:
> On Sun, Jul 10, 2022 at 04:01:11PM +0100, Philip Oakley wrote:
>>>> Not sure if this is new in this extension, but should there be a link or
>>>> two to the basics of XOR compression and some of the bitmap look up
>>>> techniques?
>>>>
>>>> It's not always obvious if these techniques are 'heuristic' and only
>>>> have partial commit data, or they have all the commits listed, Nor
>>>> how/why they work. My point is more about giving new readers a hand-up
>>>> in their understanding, rather than simple implementation details for
>>>> those who already know what is going on. For example, are there any
>>>> external articles that you found helpful in getting started that could
>>>> be referenced somewhere in the docs?
>>> As this series is only about adding a lookup-table extension (and not
>>> about bitmap itself), I am not sure whether it's good to include those
>>> things in this series.
>> Thanks for the clarification. I must have slight misread some of the
>> discussions and falsely thought it was the XOR compression (which is a
>> technique I wasn't really aware of), that was being provided by the
>> extension - Where would it be best for me to look up the background to
>> your "extension" project?
> Yeah, Abhradeep is right that the XOR compression isn't new, we already
> serialize bitmaps with optional XOR offsets. The gist is that we give an
> offset of some previous bitmap that is used to compress the current one
> by XORing the bits in the current bitamp with the previous one. These
> XOR-compressed bitmaps are often sparse, so they compress well and
> reduce the overall size of the .bitmap.

I was thinking of a short paragraph that covers the broader 'why'
aspects, rather than the what/how. For me, XOR is a 'new' compression
method that (IIUC) takes advantage of certain features of the way the
data is arranged, such that the XOR has lots of leading zeros, leading
to the compression mentioned.

I think it's that we sort on oid name, so that despite the oid being
long, we have (typically) sufficient oids that the leading XOR bits of
adjacent pairs allows effective compression. But I could have guessed
wildly wrong.

I'd been looking at
https://www.timescale.com/blog/time-series-compression-algorithms-explained/
which gave an overview for sorted floats.
I'd not had time to review the paper "Gorilla: A Fast, Scalable,
In-Memory Time Series Database"
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
>
> A slightly more detailed overview can be found in
> Documentation/technical/bitmap-format.txt under the bullet point reading
> "1-byte XOR-offset".
>
A separate point is the linkage (or not) between the older reachability
bit maps, and the commit graph, which sound to be independent options
and features, yet appear rather interrelated.

Thanks

Philip
Abhradeep Chakraborty July 15, 2022, 6:48 p.m. UTC | #6
On Sun, Jul 10, 2022 at 8:31 PM Philip Oakley <philipoakley@iee.email> wrote:
>
> On 09/07/2022 08:53, Abhradeep Chakraborty wrote:
> > Hello Philip,
> >
> > Philip Oakley <philipoakley@iee.email> wrote:
> >
> >> Not sure if this is new in this extension, but should there be a link or
> >> two to the basics of XOR compression and some of the bitmap look up
> >> techniques?
> >>
> >> It's not always obvious if these techniques are 'heuristic' and only
> >> have partial commit data, or they have all the commits listed, Nor
> >> how/why they work. My point is more about giving new readers a hand-up
> >> in their understanding, rather than simple implementation details for
> >> those who already know what is going on. For example, are there any
> >> external articles that you found helpful in getting started that could
> >> be referenced somewhere in the docs?
> > As this series is only about adding a lookup-table extension (and not
> > about bitmap itself), I am not sure whether it's good to include those
> > things in this series.
>
> Thanks for the clarification. I must have slight misread some of the
> discussions and falsely thought it was the XOR compression (which is a
> technique I wasn't really aware of), that was being provided by the
> extension - Where would it be best for me to look up the background to
> your "extension" project?

Sorry that I missed this message. I got the information related to
this project from the gsoc project ideas[1] page, additionally you can
see the comments[2].

[1] https://git.github.io/SoC-2022-Ideas/
[2] https://lore.kernel.org/git/YNovuzAsaEb2uIaa@nand.local/

> > (As the name
> > suggests, this file should only contain format details of bitmaps).
> > That file would provide the answers of "Why bitmaps", "how they are
> > stored",  "How they are fetched", "how they work with pack-objects,
> > git-fetch, midx etc.", "Detailed explanation of each bitmap extension"
> > , and lastly "what are the future works" (if any).
>
> One thing I've realised on reflection is that I'm unclear how the
> 'reachability bitmaps' and the 'commit-graph file' techniques relate to
> each other (and to the ODB DAG), and what features they pick out within
> their heuristic, explained at just enough level to allow folks to
> appreciate what the options that select them will do for their use case.

I am not familiar with 'commit-graph file', so I can't tell you about
that. But for bitmaps, you can look at the introductory patches[3].
After that, if you wish, you can also inspect the code related to
bitmaps.

[3] https://github.com/gitster/git/commit/e127310

> > What do you think?
>
> I'd be happy to collate contributions, suggestions and thoughts.
>
> Trying to create these good introductory descriptions can be really
> difficult, as you can only step into the same river once (the 'reading
> for the first time problem' of not being able to un-hear the
> explanations of others when reading a 2nd draft...)

I agree ;-)

Thanks :)
diff mbox series

Patch

diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
index 04b3ec21785..c30dc177643 100644
--- a/Documentation/technical/bitmap-format.txt
+++ b/Documentation/technical/bitmap-format.txt
@@ -67,6 +67,17 @@  MIDXs, both the bit-cache and rev-cache extensions are required.
 			pack/MIDX. The format and meaning of the name-hash is
 			described below.
 
+			** {empty}
+			BITMAP_OPT_LOOKUP_TABLE (0x10): :::
+			If present, the end of the bitmap file contains a table
+			containing a list of `N` <commit_pos, offset, xor_row>
+			triplets. The format and meaning of the table is described
+			below.
++
+NOTE: Unlike the xor_offset used to compress an individual bitmap,
+`xor_row` stores an *absolute* index into the lookup table, not a location
+relative to the current entry.
+
 		4-byte entry count (network byte order)
 
 			The total count of entries (bitmapped commits) in this bitmap index.
@@ -205,3 +216,31 @@  Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
 If implementations want to choose a different hashing scheme, they are
 free to do so, but MUST allocate a new header flag (because comparing
 hashes made under two different schemes would be pointless).
+
+Commit lookup table
+-------------------
+
+If the BITMAP_OPT_LOOKUP_TABLE flag is set, the last `N * (4 + 8 + 4)`
+bytes (preceding the name-hash cache and trailing hash) of the `.bitmap`
+file contains a lookup table specifying the information needed to get
+the desired bitmap from the entries without parsing previous unnecessary
+bitmaps.
+
+For a `.bitmap` containing `nr_entries` reachability bitmaps, the table
+contains a list of `nr_entries` <commit_pos, offset, xor_row> triplets
+(sorted in the ascending order of `commit_pos`). The content of i'th
+triplet is -
+
+	* {empty}
+	commit_pos (4 byte integer, network byte order): ::
+	It stores the object position of a commit (in the midx or pack
+	index).
+
+	* {empty}
+	offset (8 byte integer, network byte order): ::
+	The offset from which that commit's bitmap can be read.
+
+	* {empty}
+	xor_row (4 byte integer, network byte order): ::
+	The position of the triplet whose bitmap is used to compress
+	this one, or `0xffffffff` if no such bitmap exists.