diff mbox series

[2/6] pack-bitmap: prepare to read lookup table extension

Message ID d139a4c48aa058b142c4860721b10344c5498031.1655728395.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series bitmap: integrate a lookup table extension to the bitmap format | expand

Commit Message

Abhradeep Chakraborty June 20, 2022, 12:33 p.m. UTC
From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>

Bitmap lookup table extension can let git to parse only the necessary
bitmaps without loading the previous bitmaps one by one.

Teach git to read and use the bitmap lookup table extension.

Co-Authored-by: Taylor Blau <ttaylorr@github.com>
Mentored-by: Taylor Blau <ttaylorr@github.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
---
 pack-bitmap.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++--
 pack-bitmap.h |   1 +
 2 files changed, 166 insertions(+), 7 deletions(-)

Comments

Derrick Stolee June 20, 2022, 8:49 p.m. UTC | #1
On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> 
> Bitmap lookup table extension can let git to parse only the necessary
> bitmaps without loading the previous bitmaps one by one.

Here is an attempt to reword this a bit:

  The bitmap lookup table extension was documented by an earlier
  change, but Git does not yet know how to parse that information.
  The extension allows parsing a smaller portion of the bitmap
  file in order to find bitmaps for specific commits.

 
> Teach git to read and use the bitmap lookup table extension.

Normally, I don't mind doing the read portion after the write
portion, but it would be nice to have them in the opposite
order so we can test writing the extension _and Git ignoring
the extension_ before implementing the parsing. As it stands,
most of the code in this patch is untested until patch 5.

General outline attempt:

1. Document the format.
2. Write the extension if the flag is given.
3. Add pack.writeBitmapLookupTable and add tests that write
   the lookup table (and do other bitmap reads on that data).
4. Read the lookup table. The tests from step 3 already cover
   acting upon the lookup table. (Perhaps we add a mode here
   that disables GIT_READ_COMMIT_TABLE since that is not used
   anywhere else.)
5. Performance tests.

> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {

This environment variable does not appear to be used or
documented anywhere. Do we really want to use it as a way
to disable reading the lookup table in general? Or would it be
better to have a GIT_TEST_* variable for disabling the read
during testing?

> +			uint32_t entry_count = ntohl(header->entry_count);
> +			uint32_t table_size =
> +				(entry_count * the_hash_algo->rawsz) /* oids */ +
> +				(entry_count * sizeof(uint32_t)) /* offsets */ +
> +				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
> +				(sizeof(uint32_t)) /* flags */;

Here, uint32_T is probably fine, but maybe we should just use
size_t instead? Should we use st_mult() and st_add() everywhere?

Note: you're using the_hash_algo->rawsz here, which makes sense
because the bitmap format doesn't specify which hash algorithm is
used. Just making this note to say that we should include the hash
algorithm as a value in the bitmap format when we increment the
format version (in the future).

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit commit table)");
> +
> +			index->table_lookup = (void *)(index_end - table_size);
> +			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;

st_mult(), st_add()? Or, should we assume safety now?

> +			index_end -= table_size;
> +		}

> -	if (load_bitmap_entries_v1(bitmap_git) < 0)
> +	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
>  		goto failed;

Ok, don't load these entries pre-emptively if we have the lookup table.

> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint);

I see that we have a two-method recursion loop. Please move this
declaration to immediately before lazy_bitmap_for_commit() so it
is declared as late as possible.

> +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
> +						  uint32_t pos)
> +{
> +	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
> +}

I would call this "bitmap_hash_pos()" because we are getting a raw
hash and not a 'struct object_id'. Do you want a helper that fills
a 'struct object_id', perhaps passed-by-reference?

> +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
> +					    uint32_t pos)

> +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
> +					   uint32_t pos)

These two helpers should probably return a size_t and uint32_t
instead of a pointer. Let these do get_be[32|64]() on the computed
pointer.

> +static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
> +			       struct object_id *oid,
> +			       uint32_t *commit_pos)
> +{
> +	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
> +				       bitmap_git->entry_count,
> +				       the_hash_algo->rawsz, bitmap_lookup_cmp);
> +	if (found)
> +		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;
> +	return !!found;

Ok, we are running binary search and converting the pointer into a
position.

Frequently, these kind of searches return an int, but use a negative
value to indicate that the value was not found. Using an int in this
way would restrict us to 2^31 bitmaps instead of 2^32, so maybe it
is not worth matching that practice.

> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						    struct object_id *oid,
> +						    uint32_t commit_pos)
> +{
> +	uint32_t xor_pos;
> +	off_t bitmap_ofs;
> +
> +	int flags;
> +	struct ewah_bitmap *bitmap;
> +	struct stored_bitmap *xor_bitmap;
> +
> +	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
> +	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));

These lines become simpler with a change in the helper methods'
prototypes, as I recommended higher up.

> +	/*
> +	 * Lazily load the xor'd bitmap if required (and we haven't done so
> +	 * already). Make sure to pass the xor'd bitmap's position along as a
> +	 * hint to avoid an unnecessary binary search in
> +	 * stored_bitmap_for_commit().
> +	 */
> +	if (xor_pos == 0xffffffff) {
> +		xor_bitmap = NULL;
> +	} else {
> +		struct commit *xor_commit;
> +		struct object_id xor_oid;
> +
> +		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));
> +
> +		xor_commit = lookup_commit(the_repository, &xor_oid);
> +		if (!xor_commit)
> +			return NULL;
> +
> +		xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit,
> +						      &xor_pos);
> +	}

This is using an interesting type of tail-recursion. We might be
better off using a loop with a stack: push to the stack the commit
positions of the XOR bitmaps. At the very bottom, we get a bitmap
without an XOR base. Then, pop off the stack, modifying the bitmap
with XOR operations as we go. (Perhaps we also store these bitmaps
in-memory along the way?) Finally, we have the necessary bitmap.

This iterative approach avoids possible stack exhaustion if there
are long XOR chains in the file.

> +
> +	/*
> +	 * Don't bother reading the commit's index position or its xor
> +	 * offset:
> +	 *
> +	 *   - The commit's index position is irrelevant to us, since
> +	 *     load_bitmap_entries_v1 only uses it to learn the object
> +	 *     id which is used to compute the hashmap's key. We already
> +	 *     have an object id, so no need to look it up again.
> +	 *
> +	 *   - The xor_offset is unusable for us, since it specifies how
> +	 *     many entries previous to ours we should look at. This
> +	 *     makes sense when reading the bitmaps sequentially (as in
> +	 *     load_bitmap_entries_v1()), since we can keep track of
> +	 *     each bitmap as we read them.
> +	 *
> +	 *     But it can't work for us, since the bitmap's don't have a
> +	 *     fixed size. So we learn the position of the xor'd bitmap
> +	 *     from the commit table (and resolve it to a bitmap in the
> +	 *     above if-statement).
> +	 *
> +	 * Instead, we can skip ahead and immediately read the flags and
> +	 * ewah bitmap.
> +	 */
> +	bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t);
> +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +	bitmap = read_bitmap_1(bitmap_git);
> +	if (!bitmap)
> +		return NULL;
> +
> +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);

Looks like we'd want to call store_bitmap() while popping the stack
in the loop I recommended above.

> +}
> +
> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint)
>  {
>  	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
>  					   commit->object.oid);
> -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
> +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
> +		uint32_t commit_pos;
> +		if (!bitmap_git->table_lookup)
> +			return NULL;
> +
> +		/* NEEDSWORK: cache misses aren't recorded. */
> +		if (pos_hint)
> +			commit_pos = *pos_hint;
> +		else if (!bitmap_table_lookup(bitmap_git,
> +					      &commit->object.oid,
> +					      &commit_pos))
> +			return NULL;
> +		return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid,
> +					      commit_pos);

The extra bonus of going incremental is that we don't have recursion
across two methods, which I always find difficult to reason about.

> +	}
> +	return kh_value(bitmap_git->bitmaps, hash_pos);
> +}

> @@ -26,6 +26,7 @@ struct bitmap_disk_header {
>  enum pack_bitmap_opts {
>  	BITMAP_OPT_FULL_DAG = 1,
>  	BITMAP_OPT_HASH_CACHE = 4,
> +	BITMAP_OPT_LOOKUP_TABLE = 16,

Perhaps it is time to use hexadecimal representation here to match the
file format document?

Thanks,
-Stolee
Taylor Blau June 20, 2022, 10:06 p.m. UTC | #2
On Mon, Jun 20, 2022 at 12:33:10PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
>
> Bitmap lookup table extension can let git to parse only the necessary
> bitmaps without loading the previous bitmaps one by one.
>
> Teach git to read and use the bitmap lookup table extension.
>
> Co-Authored-by: Taylor Blau <ttaylorr@github.com>
> Mentored-by: Taylor Blau <ttaylorr@github.com>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
> ---
>  pack-bitmap.c | 172 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  pack-bitmap.h |   1 +
>  2 files changed, 166 insertions(+), 7 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 36134222d7a..d5e5973a79f 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -15,6 +15,7 @@
>  #include "list-objects-filter-options.h"
>  #include "midx.h"
>  #include "config.h"
> +#include "hash-lookup.h"
>
>  /*
>   * An entry on the bitmap index, representing the bitmap for a given
> @@ -82,6 +83,13 @@ struct bitmap_index {
>  	/* The checksum of the packfile or MIDX; points into map. */
>  	const unsigned char *checksum;
>
> +	/*
> +	 * If not NULL, these point into the various commit table sections
> +	 * (within map).
> +	 */
> +	unsigned char *table_lookup;
> +	unsigned char *table_offsets;
> +

If table_offsets ends up being a list of just offsets, we could assign
this to the appropriate type, e.g., 'uint64_t *'. We would want to
avoid using a type whose width is platform dependent, like off_t.

But if you end up taking my suggestion from a previous response (of
making each entry in the offset table a triple of commit, offset, and
xor position), make sure to _not_ get tempted to define a struct and
assign table_lookup to be a pointer of that structure type.

That's because even though the struct *should* be packed as you expect,
the packing is mostly up to the compiler, so you can't guarantee its
members won't have padding between them or at the end of the struct for
alignment purposes.

>  	/*
>  	 * Extended index.
>  	 *
> @@ -185,6 +193,24 @@ static int load_bitmap_header(struct bitmap_index *index)
>  			index->hashes = (void *)(index_end - cache_size);
>  			index_end -= cache_size;
>  		}
> +
> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {

What is the purpose of the GIT_READ_COMMIT_TABLE environment variable? I
assume that it's to make it easier to run tests (especially performance
ones) with and without access to the lookup table. If so, we should
document that (lightly) in the commit message, and rename this to be
GIT_TEST_READ_COMMIT_TABLE to indicate that it shouldn't be used outside
of tests.

> +			uint32_t entry_count = ntohl(header->entry_count);
> +			uint32_t table_size =
> +				(entry_count * the_hash_algo->rawsz) /* oids */ +
> +				(entry_count * sizeof(uint32_t)) /* offsets */ +
> +				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
> +				(sizeof(uint32_t)) /* flags */;

entry_count is definitely a 4-byte integer, so uint32_t is the right
type. But I think table_size should be a size_t, and computations on it
should be more strictly checked. Perhaps something like;

    size_t table_size = sizeof(uint32_t); /* flags */
    table_size = st_add(table_size, st_mult(entry_count, the_hash_algo->rawsz)); /* oids */
    table_size = st_add(table_size, st_mult(entry_count, sizeof(uint32_t))); /* offsets */
    table_size = st_add(table_size, st_mult(entry_count, sizeof(uint32_t))); /* xor offsets */

or even:

    size_t table_size = sizeof(uint32_t); /* flags */
    table_size = st_add(table_size,
                        st_mult(entry_count,
                                the_hash_algo->rawsz + /* oids */
                                sizeof(uint32_t) + /* offsets*/
                                sizeof(uint32_t) /* xor offsets */
                               ));

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit commit table)");
> +
> +			index->table_lookup = (void *)(index_end - table_size);
> +			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;
> +
> +			index_end -= table_size;
> +		}

Looks good.

> @@ -470,7 +496,7 @@ static int load_bitmap(struct bitmap_index *bitmap_git)
>  		!(bitmap_git->tags = read_bitmap_1(bitmap_git)))
>  		goto failed;
>
> -	if (load_bitmap_entries_v1(bitmap_git) < 0)
> +	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
>  		goto failed;

No need to load each of the bitmaps individually via
load_bitmap_entries_v1() if we have a lookup table. That function
doesn't do any other initialization that we depend on, so it's OK to
just avoid calling it altogether.

>  	return 0;
> @@ -557,14 +583,145 @@ struct include_data {
>  	struct bitmap *seen;
>  };
>
> -struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
> -				      struct commit *commit)
> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint);
> +
> +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
> +						  uint32_t pos)
> +{
> +	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
> +}
> +
> +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
> +					    uint32_t pos)
> +{
> +	return bitmap_git->table_offsets + (pos * 2 * sizeof(uint32_t));
> +}
> +
> +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
> +					   uint32_t pos)
> +{
> +	return (unsigned char*) bitmap_offset_pos(bitmap_git, pos) + sizeof(uint32_t);
> +}
> +
> +static int bitmap_lookup_cmp(const void *_va, const void *_vb)
> +{
> +	return hashcmp(_va, _vb);
> +}

All makes sense. Some light documentation might help explain what this
comparator function is used for (the bsearch() call below in
bitmap_table_lookup()), although I suspect that this function will get
slightly more complicated if you pack the table contents as I suggest,
in which case more documentation will definitely help.

> +
> +static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
> +			       struct object_id *oid,
> +			       uint32_t *commit_pos)
> +{
> +	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
> +				       bitmap_git->entry_count,
> +				       the_hash_algo->rawsz, bitmap_lookup_cmp);
> +	if (found)
> +		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;

If you end up chaning the type of bitmap_git->table_lookup, make sure
that you scale the result of the pointer arithmetic accordingly, or cast
down to an 'unsigned char *' before you do any math.

> +	return !!found;
> +}
> +
> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						    struct object_id *oid,
> +						    uint32_t commit_pos)
> +{
> +	uint32_t xor_pos;
> +	off_t bitmap_ofs;
> +
> +	int flags;
> +	struct ewah_bitmap *bitmap;
> +	struct stored_bitmap *xor_bitmap;
> +
> +	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
> +	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));
> +
> +	/*
> +	 * Lazily load the xor'd bitmap if required (and we haven't done so
> +	 * already). Make sure to pass the xor'd bitmap's position along as a
> +	 * hint to avoid an unnecessary binary search in
> +	 * stored_bitmap_for_commit().
> +	 */
> +	if (xor_pos == 0xffffffff) {
> +		xor_bitmap = NULL;
> +	} else {
> +		struct commit *xor_commit;
> +		struct object_id xor_oid;
> +
> +		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));

Interesting; this is a point that I forgot about from the original
patch. xor_pos is an index (not an offset) into the list of commits in
the table of contents in the order appear in that table. We should be
clear about (a) what that order is, and (b) that xor_pos is an index
into that order.

The rest of this function looks good to me.

> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint)
>  {
>  	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
>  					   commit->object.oid);
> -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
> +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
> +		uint32_t commit_pos;
> +		if (!bitmap_git->table_lookup)
> +			return NULL;

I was going to suggest moving this check into the caller
bitmap_for_commit() and making it a BUG() to call
stored_bitmap_for_commit() with a NULL bitmap_git->table_lookup pointer.

And I think this makes sense... if we return NULL here, then we know
that we definitely don't have a stored bitmap, since there's no table to
look it up in and we have already loaded everything else. So we
propagate that NULL to the return value of bitmap_for_commit(), and that
makes sense. Good.

> +		/* NEEDSWORK: cache misses aren't recorded. */

Yeah. The problem here is that we can't record every commit that
_doesn't_ have a bitmap every time we return NULL from one of these
queries, since there are arbitrarily many such commits that don't have
bitmaps.

We could approximate it using a Bloom filter or something, and much of
that code is already written and could be interesting to try and reuse.

But I wonder if we could get by with something simpler, though, which
would cause us to load all bitmaps from the lookup table after a fixed
number of cache misses (at which point we should force ourselves to load
everything and just read everything out of a single O(1) lookup in the
stored bitmap table).

That may or may not be a good idea, and the threshold will probably be
highly dependent on the system. So it may not even be worth it, but I
think it's an interesting area to experiemnt in and think a little more
about.

> +		if (pos_hint)
> +			commit_pos = *pos_hint;

How does this commit_pos work again? I confess I have forgetten since I
wrote some of this code a while ago... :-).

> @@ -1699,8 +1856,9 @@ void test_bitmap_walk(struct rev_info *revs)
>  	if (revs->pending.nr != 1)
>  		die("you must specify exactly one commit to test");
>
> -	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
> -		bitmap_git->version, bitmap_git->entry_count);
> +	if (!bitmap_git->table_lookup)
> +		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
> +			bitmap_git->version, bitmap_git->entry_count);

Should we print this regardless of whether or not there is a lookup
table? We should be able to learn the entry count either way.

Thanks,
Taylor
Abhradeep Chakraborty June 21, 2022, 10:28 a.m. UTC | #3
Derrick Stolee <derrickstolee@github.com> wrote:

> Here is an attempt to reword this a bit:
>
>   The bitmap lookup table extension was documented by an earlier
>   change, but Git does not yet know how to parse that information.
>   The extension allows parsing a smaller portion of the bitmap
>   file in order to find bitmaps for specific commits.

Got it. Thanks.

> This environment variable does not appear to be used or
> documented anywhere. Do we really want to use it as a way
> to disable reading the lookup table in general? Or would it be
> better to have a GIT_TEST_* variable for disabling the read
> during testing?

GIT_TEST_* is perfect. This was mainly for testing purpose.

> Here, uint32_T is probably fine, but maybe we should just use
 size_t instead? Should we use st_mult() and st_add() everywhere?

Yeah, it would be better to use st_*().

> I see that we have a two-method recursion loop. Please move this
> declaration to immediately before lazy_bitmap_for_commit() so it
> is declared as late as possible.

Ok.

> These two helpers should probably return a size_t and uint32_t
> instead of a pointer. Let these do get_be[32|64]() on the computed
> pointer.

Ok.

> This is using an interesting type of tail-recursion. We might be
> better off using a loop with a stack: push to the stack the commit
> positions of the XOR bitmaps. At the very bottom, we get a bitmap
> without an XOR base. Then, pop off the stack, modifying the bitmap
> with XOR operations as we go. (Perhaps we also store these bitmaps
> in-memory along the way?) Finally, we have the necessary bitmap.

Hmm, got the point. I need to fix the xor-offset related issue first
(That I said earlier) before doing this.

> Perhaps it is time to use hexadecimal representation here to match the
> file format document?

Yeah, of course!

Thanks :)
Abhradeep Chakraborty June 21, 2022, 11:52 a.m. UTC | #4
Taylor Blau <me@ttaylorr.com> wrote:

> What is the purpose of the GIT_READ_COMMIT_TABLE environment variable? I
> assume that it's to make it easier to run tests (especially performance
> ones) with and without access to the lookup table. If so, we should
> document that (lightly) in the commit message, and rename this to be
> GIT_TEST_READ_COMMIT_TABLE to indicate that it shouldn't be used outside
> of tests.

This is mainly for testing, GIT_TEST_READ_COMMIT_TABLE is perfect.


> All makes sense. Some light documentation might help explain what this
> comparator function is used for (the bsearch() call below in
> bitmap_table_lookup()), although I suspect that this function will get
> slightly more complicated if you pack the table contents as I suggest,
> in which case more documentation will definitely help.

Ok.

> Interesting; this is a point that I forgot about from the original
> patch. xor_pos is an index (not an offset) into the list of commits in
> the table of contents in the order appear in that table. We should be
> clear about (a) what that order is, and (b) that xor_pos is an index
> into that order.

This is exactly what I said in my first reply. I made a mistake here.
(1) As xor_pos is relative to the current bitmap, it depends on the bitmap
entry order. These two order are not same. One is ordered by date, another
is lexicographically ordered. I will fix it.

> Yeah. The problem here is that we can't record every commit that
> _doesn't_ have a bitmap every time we return NULL from one of these
> queries, since there are arbitrarily many such commits that don't have
> bitmaps.
>
> We could approximate it using a Bloom filter or something, and much of
> that code is already written and could be interesting to try and reuse.
> But I wonder if we could get by with something simpler, though, which
> would cause us to load all bitmaps from the lookup table after a fixed
> number of cache misses (at which point we should force ourselves to load
> everything and just read everything out of a single O(1) lookup in the
> stored bitmap table).
>
> That may or may not be a good idea, and the threshold will probably be
> highly dependent on the system. So it may not even be worth it, but I
> think it's an interesting area to experiemnt in and think a little more
> about.

Now I got the point. I wonder what if we leave it as it is. How much will
it affect the code?

> How does this commit_pos work again? I confess I have forgetten since I
> wrote some of this code a while ago... :-).

It is using recursive strategy. The first call to `stored_bitmap_for_commit`
function do not have `pos_hint`. So, it uses `bitmap_table_lookup` to find
the commit position in the list and makes a call to `lazy_bitmap_for_commit`
function. This function gets the offset and xor-offset using the commit id's
position in the list. If xor-offset exists, it is using this xor-offset to
get the xor-bitmap by calling `stored_bitmap_for_commit` again. But this time
`pos_hint` is xor-offset. This goes on till the last non-xor bitmap has found.

As I said before, xor-offset should be an absolute value to make it work
correctly.

> Should we print this regardless of whether or not there is a lookup
> table? We should be able to learn the entry count either way.

No, this is necessary. "Bitmap v1 test (%d entries loaded)" means
all the bitmap entries has been loaded. It is basically for 
`load_bitmap_entries_bitmap_v1` function which loads all the bitmaps
One by one. But if there is a lookup table, `prepare_bitmap_git`
function will not load every entries and thus printing the above
line is wrong.

Thanks :)
Taylor Blau June 22, 2022, 4:49 p.m. UTC | #5
On Tue, Jun 21, 2022 at 05:22:12PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@ttaylorr.com> wrote:
> > Yeah. The problem here is that we can't record every commit that
> > _doesn't_ have a bitmap every time we return NULL from one of these
> > queries, since there are arbitrarily many such commits that don't have
> > bitmaps.
> >
> > We could approximate it using a Bloom filter or something, and much of
> > that code is already written and could be interesting to try and reuse.
> > But I wonder if we could get by with something simpler, though, which
> > would cause us to load all bitmaps from the lookup table after a fixed
> > number of cache misses (at which point we should force ourselves to load
> > everything and just read everything out of a single O(1) lookup in the
> > stored bitmap table).
> >
> > That may or may not be a good idea, and the threshold will probably be
> > highly dependent on the system. So it may not even be worth it, but I
> > think it's an interesting area to experiemnt in and think a little more
> > about.
>
> Now I got the point. I wonder what if we leave it as it is. How much will
> it affect the code?

I'm not sure, and I think that it depends a lot on the repository and
query that we're running.

I'd imagine that the effect is probably measurable, but small. Each hash
lookup is cheap, but if there are many such lookups (a large proportion
of which end up resulting in "no, we haven't loaded this bitmap yet" and
then "...because no such bitmap exists for that commit") at some point
it is worth it to fault all of the commits that _do_ have bitmaps in and
answer authoritatively.

In other words, right now we have to do two queries when an commit
doesn't have a bitmap stored:

  - first, a lookup to see whether we have already loaded a bitmap for
    that commit

  - then, a subsequent lookup to see whether the .bitmap file itself has
    a bitmap for that commit, but we just haven't loaded it yet

If we knew that we had loaded all of the bitmaps in the file, then we
could simplify the above two queries into one, since whatever the first
one returns is enough to know whether or not a bitmap exists at all.

> > How does this commit_pos work again? I confess I have forgetten since I
> > wrote some of this code a while ago... :-).
>
> It is using recursive strategy. The first call to `stored_bitmap_for_commit`
> function do not have `pos_hint`. So, it uses `bitmap_table_lookup` to find
> the commit position in the list and makes a call to `lazy_bitmap_for_commit`
> function. This function gets the offset and xor-offset using the commit id's
> position in the list. If xor-offset exists, it is using this xor-offset to
> get the xor-bitmap by calling `stored_bitmap_for_commit` again. But this time
> `pos_hint` is xor-offset. This goes on till the last non-xor bitmap has found.

Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
convenient way to work some of this into a short comment to help other
readers in the future, too.

> As I said before, xor-offset should be an absolute value to make it work
> correctly.

Yep, makes sense.

> > Should we print this regardless of whether or not there is a lookup
> > table? We should be able to learn the entry count either way.
>
> No, this is necessary. "Bitmap v1 test (%d entries loaded)" means
> all the bitmap entries has been loaded. It is basically for
> `load_bitmap_entries_bitmap_v1` function which loads all the bitmaps
> One by one. But if there is a lookup table, `prepare_bitmap_git`
> function will not load every entries and thus printing the above
> line is wrong.

Right, that part makes sense to me. But I wonder if we should still
print something, perhaps just "Bitmap v1 test" or "Bitmap v1 test (%d
entries)" omitting the "loaded" part.

Thanks,
Taylor
Abhradeep Chakraborty June 22, 2022, 5:18 p.m. UTC | #6
Taylor Blau <me@ttaylorr.com> wrote:

> In other words, right now we have to do two queries when an commit
> doesn't have a bitmap stored:
>
>   - first, a lookup to see whether we have already loaded a bitmap for
>     that commit
>
>   - then, a subsequent lookup to see whether the .bitmap file itself has
>     a bitmap for that commit, but we just haven't loaded it yet
>
> If we knew that we had loaded all of the bitmaps in the file, then we
> could simplify the above two queries into one, since whatever the first
> one returns is enough to know whether or not a bitmap exists at all.

Hmm, agreed.

> Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
> convenient way to work some of this into a short comment to help other
> readers in the future, too.

Actually, Derrick has suggested to go with iterative approach[1] instead of
Recursive approach. What's your view on it?

> Right, that part makes sense to me. But I wonder if we should still
> print something, perhaps just "Bitmap v1 test" or "Bitmap v1 test (%d
> entries)" omitting the "loaded" part.

Yeah, of course we can!

Thanks :)

[1] https://lore.kernel.org/git/92dc6860-ff35-0989-5114-fe1e220ca10c@github.com/
Taylor Blau June 22, 2022, 9:34 p.m. UTC | #7
On Wed, Jun 22, 2022 at 10:48:14PM +0530, Abhradeep Chakraborty wrote:
> > Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
> > convenient way to work some of this into a short comment to help other
> > readers in the future, too.
>
> Actually, Derrick has suggested to go with iterative approach[1] instead of
> Recursive approach. What's your view on it?

I don't have a strong feeling about it. In practice, we seem to top out
at ~500 bitmaps or so for large-ish repositories, so I would be
surprised to see this result in stack exhaustion even in the worst case
(every bitmap xor'd with the previous one, forming a long chain).

But it doesn't hurt to be defensive, so I think it's worth it as long as
you don't find the implementation too complex.

> [1] https://lore.kernel.org/git/92dc6860-ff35-0989-5114-fe1e220ca10c@github.com/

Thanks,
Taylor
diff mbox series

Patch

diff --git a/pack-bitmap.c b/pack-bitmap.c
index 36134222d7a..d5e5973a79f 100644
--- a/pack-bitmap.c
+++ b/pack-bitmap.c
@@ -15,6 +15,7 @@ 
 #include "list-objects-filter-options.h"
 #include "midx.h"
 #include "config.h"
+#include "hash-lookup.h"
 
 /*
  * An entry on the bitmap index, representing the bitmap for a given
@@ -82,6 +83,13 @@  struct bitmap_index {
 	/* The checksum of the packfile or MIDX; points into map. */
 	const unsigned char *checksum;
 
+	/*
+	 * If not NULL, these point into the various commit table sections
+	 * (within map).
+	 */
+	unsigned char *table_lookup;
+	unsigned char *table_offsets;
+
 	/*
 	 * Extended index.
 	 *
@@ -185,6 +193,24 @@  static int load_bitmap_header(struct bitmap_index *index)
 			index->hashes = (void *)(index_end - cache_size);
 			index_end -= cache_size;
 		}
+
+		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
+		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {
+			uint32_t entry_count = ntohl(header->entry_count);
+			uint32_t table_size =
+				(entry_count * the_hash_algo->rawsz) /* oids */ +
+				(entry_count * sizeof(uint32_t)) /* offsets */ +
+				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
+				(sizeof(uint32_t)) /* flags */;
+
+			if (table_size > index_end - index->map - header_size)
+				return error("corrupted bitmap index file (too short to fit commit table)");
+
+			index->table_lookup = (void *)(index_end - table_size);
+			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;
+
+			index_end -= table_size;
+		}
 	}
 
 	index->entry_count = ntohl(header->entry_count);
@@ -470,7 +496,7 @@  static int load_bitmap(struct bitmap_index *bitmap_git)
 		!(bitmap_git->tags = read_bitmap_1(bitmap_git)))
 		goto failed;
 
-	if (load_bitmap_entries_v1(bitmap_git) < 0)
+	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
 		goto failed;
 
 	return 0;
@@ -557,14 +583,145 @@  struct include_data {
 	struct bitmap *seen;
 };
 
-struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
-				      struct commit *commit)
+static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
+						      struct commit *commit,
+						      uint32_t *pos_hint);
+
+static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
+						  uint32_t pos)
+{
+	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
+}
+
+static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
+					    uint32_t pos)
+{
+	return bitmap_git->table_offsets + (pos * 2 * sizeof(uint32_t));
+}
+
+static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
+					   uint32_t pos)
+{
+	return (unsigned char*) bitmap_offset_pos(bitmap_git, pos) + sizeof(uint32_t);
+}
+
+static int bitmap_lookup_cmp(const void *_va, const void *_vb)
+{
+	return hashcmp(_va, _vb);
+}
+
+static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
+			       struct object_id *oid,
+			       uint32_t *commit_pos)
+{
+	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
+				       bitmap_git->entry_count,
+				       the_hash_algo->rawsz, bitmap_lookup_cmp);
+	if (found)
+		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;
+	return !!found;
+}
+
+static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
+						    struct object_id *oid,
+						    uint32_t commit_pos)
+{
+	uint32_t xor_pos;
+	off_t bitmap_ofs;
+
+	int flags;
+	struct ewah_bitmap *bitmap;
+	struct stored_bitmap *xor_bitmap;
+
+	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
+	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));
+
+	/*
+	 * Lazily load the xor'd bitmap if required (and we haven't done so
+	 * already). Make sure to pass the xor'd bitmap's position along as a
+	 * hint to avoid an unnecessary binary search in
+	 * stored_bitmap_for_commit().
+	 */
+	if (xor_pos == 0xffffffff) {
+		xor_bitmap = NULL;
+	} else {
+		struct commit *xor_commit;
+		struct object_id xor_oid;
+
+		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));
+
+		xor_commit = lookup_commit(the_repository, &xor_oid);
+		if (!xor_commit)
+			return NULL;
+
+		xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit,
+						      &xor_pos);
+	}
+
+	/*
+	 * Don't bother reading the commit's index position or its xor
+	 * offset:
+	 *
+	 *   - The commit's index position is irrelevant to us, since
+	 *     load_bitmap_entries_v1 only uses it to learn the object
+	 *     id which is used to compute the hashmap's key. We already
+	 *     have an object id, so no need to look it up again.
+	 *
+	 *   - The xor_offset is unusable for us, since it specifies how
+	 *     many entries previous to ours we should look at. This
+	 *     makes sense when reading the bitmaps sequentially (as in
+	 *     load_bitmap_entries_v1()), since we can keep track of
+	 *     each bitmap as we read them.
+	 *
+	 *     But it can't work for us, since the bitmap's don't have a
+	 *     fixed size. So we learn the position of the xor'd bitmap
+	 *     from the commit table (and resolve it to a bitmap in the
+	 *     above if-statement).
+	 *
+	 * Instead, we can skip ahead and immediately read the flags and
+	 * ewah bitmap.
+	 */
+	bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t);
+	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
+	bitmap = read_bitmap_1(bitmap_git);
+	if (!bitmap)
+		return NULL;
+
+	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);
+}
+
+static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
+						      struct commit *commit,
+						      uint32_t *pos_hint)
 {
 	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
 					   commit->object.oid);
-	if (hash_pos >= kh_end(bitmap_git->bitmaps))
+	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
+		uint32_t commit_pos;
+		if (!bitmap_git->table_lookup)
+			return NULL;
+
+		/* NEEDSWORK: cache misses aren't recorded. */
+		if (pos_hint)
+			commit_pos = *pos_hint;
+		else if (!bitmap_table_lookup(bitmap_git,
+					      &commit->object.oid,
+					      &commit_pos))
+			return NULL;
+		return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid,
+					      commit_pos);
+	}
+	return kh_value(bitmap_git->bitmaps, hash_pos);
+}
+
+struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git,
+				      struct commit *commit)
+{
+	struct stored_bitmap *sb = stored_bitmap_for_commit(bitmap_git, commit,
+							    NULL);
+	if (!sb)
 		return NULL;
-	return lookup_stored_bitmap(kh_value(bitmap_git->bitmaps, hash_pos));
+	return lookup_stored_bitmap(sb);
 }
 
 static inline int bitmap_position_extended(struct bitmap_index *bitmap_git,
@@ -1699,8 +1856,9 @@  void test_bitmap_walk(struct rev_info *revs)
 	if (revs->pending.nr != 1)
 		die("you must specify exactly one commit to test");
 
-	fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
-		bitmap_git->version, bitmap_git->entry_count);
+	if (!bitmap_git->table_lookup)
+		fprintf(stderr, "Bitmap v%d test (%d entries loaded)\n",
+			bitmap_git->version, bitmap_git->entry_count);
 
 	root = revs->pending.objects[0].item;
 	bm = bitmap_for_commit(bitmap_git, (struct commit *)root);
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 3d3ddd77345..37f86787a4d 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -26,6 +26,7 @@  struct bitmap_disk_header {
 enum pack_bitmap_opts {
 	BITMAP_OPT_FULL_DAG = 1,
 	BITMAP_OPT_HASH_CACHE = 4,
+	BITMAP_OPT_LOOKUP_TABLE = 16,
 };
 
 enum pack_bitmap_flags {