diff mbox series

[1/4] pack-bitmap: write lookup table extension by default

Message ID b7cfb1267fdd7f50f414c9f79377cb338a0c1ab0.1744924321.git.me@ttaylorr.com (mailing list archive)
State New
Headers show
Series pack-bitmap: enable lookup tables by default, misc. cleanups | expand

Commit Message

Taylor Blau April 17, 2025, 9:12 p.m. UTC
The lookup table extension for reachability bitmaps was first introduced
via 3fe0121479 (Merge branch 'ac/bitmap-lookup-table', 2022-09-05).
For each bitmapped commit, the lookup table encodes three unsigned
integers:

  - The pack position of the commit.
  - An offset within the *.bitmap itself where that commit's bitmap
    lives.
  - An index into the same table where information on that commit's XOR
    pair can be found.

Lookup tables make bitmap operations faster because they no longer have
to read the entire *.bitmap file to discover what commits have
corresponding reachability bitmaps.

When bitmap lookup tables were first introduced, we established a
baseline level of performance in p5310 with and without lookup tables.
Here is the baseline without:

    Test                                                    this tree
    -----------------------------------------------------------------------
    5310.6: simulated clone                               14.04(5.78+1.79)
    5310.7: simulated fetch                               1.95(3.05+0.20)
    5310.8: pack to file (bitmap)                         44.73(20.55+7.45)
    5310.9: rev-list (commits)                            0.78(0.46+0.10)
    5310.10: rev-list (objects)                           4.07(3.97+0.08)
    5310.11: rev-list with tag negated via --not          0.06(0.02+0.03)
             --all (objects)
    5310.12: rev-list with negative tag (objects)         0.21(0.15+0.05)
    5310.13: rev-list count with blob:none                0.24(0.17+0.06)
    5310.14: rev-list count with blob:limit=1k            7.07(5.92+0.48)
    5310.15: rev-list count with tree:0                   0.25(0.17+0.07)
    5310.16: simulated partial clone                      5.67(3.28+0.64)
    5310.18: clone (partial bitmap)                       16.05(8.34+1.86)
    5310.19: pack to file (partial bitmap)                59.76(27.22+7.43)
    5310.20: rev-list with tree filter (partial bitmap)   0.90(0.18+0.16)

, and here is the same set of tests, this time with the lookup table
enabled:

    Test                                                    this tree
    -----------------------------------------------------------------------
    5310.26: simulated clone                              13.69(5.72+1.78)
    5310.27: simulated fetch                              1.84(3.02+0.16)
    5310.28: pack to file (bitmap)                        45.63(20.67+7.50)
    5310.29: rev-list (commits)                           0.56(0.39+0.8)
    5310.30: rev-list (objects)                           3.77(3.74+0.08)
    5310.31: rev-list with tag negated via --not          0.05(0.02+0.03)
             --all (objects)
    5310.32: rev-list with negative tag (objects)         0.21(0.15+0.05)
    5310.33: rev-list count with blob:none                0.23(0.17+0.05)
    5310.34: rev-list count with blob:limit=1k            6.65(5.72+0.40)
    5310.35: rev-list count with tree:0                   0.23(0.16+0.06)
    5310.36: simulated partial clone                      5.57(3.26+0.59)
    5310.38: clone (partial bitmap)                       15.89(8.39+1.84)
    5310.39: pack to file (partial bitmap)                58.32(27.55+7.47)
    5310.40: rev-list with tree filter (partial bitmap)   0.73(0.18+0.15)

(All numbers here come from a ~2022-era copy of the kernel, via
Abhradeep Chakraborty who implemented the lookup table extension).

In the almost three years since lookup tables were introduced, GitHub
has used them in production without issue, taking advantage of the above
performance benefits along the way.

Since this feature has had sufficient time to flush out any bugs and/or
performance regressions, let's enable it by default so that all bitmap
users can reap similar performance benefits.

[1]: https://lore.kernel.org/git/pull.1266.git.1655728395.gitgitgadget@gmail.com/

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/pack.adoc | 2 +-
 builtin/multi-pack-index.c     | 1 +
 builtin/pack-objects.c         | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

Comments

Junio C Hamano April 17, 2025, 10:04 p.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 2a938466f5..6ad6f814e3 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -142,6 +142,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
>  	int ret;
>  
>  	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
> +	opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
>  
>  	git_config(git_multi_pack_index_write_config, NULL);
>  
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 3973267e9e..384fefbb1d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -239,7 +239,7 @@ static enum {
>  	WRITE_BITMAP_QUIET,
>  	WRITE_BITMAP_TRUE,
>  } write_bitmap_index;
> -static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
> +static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE | BITMAP_OPT_LOOKUP_TABLE;

Are these two hunks required to be kept in sync?

If so, I am wondering what is the right approach to make sure they
are.  The definition of MIDX_WRITE_BITMAP_* flag bits in midx.h and
BITMAP_OPT_* flag bits in write_bitmap_index enum are distinct two
sets, and we need a way to somehow convert between them back and
forth if we really wanted to ensure that these "default" values are
kept in sync automatically.

The reason I ask is mostly because I do not know offhand, and I
would imagine that it would be hard for third-parties to verify, if
these are only two places that need to be updated in order for
lookup table extensions to be written by default, when somebody
new wants to further update the default.
Jeff King April 18, 2025, 9:33 a.m. UTC | #2
On Thu, Apr 17, 2025 at 03:04:00PM -0700, Junio C Hamano wrote:

> Taylor Blau <me@ttaylorr.com> writes:
> 
> > diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> > index 2a938466f5..6ad6f814e3 100644
> > --- a/builtin/multi-pack-index.c
> > +++ b/builtin/multi-pack-index.c
> > @@ -142,6 +142,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
> >  	int ret;
> >  
> >  	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
> > +	opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
> >  
> >  	git_config(git_multi_pack_index_write_config, NULL);
> >  
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 3973267e9e..384fefbb1d 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -239,7 +239,7 @@ static enum {
> >  	WRITE_BITMAP_QUIET,
> >  	WRITE_BITMAP_TRUE,
> >  } write_bitmap_index;
> > -static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
> > +static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE | BITMAP_OPT_LOOKUP_TABLE;
> 
> Are these two hunks required to be kept in sync?

They're not technically required to be in sync. It is OK for the midx
bitmaps to have different options than the ones we make for packs. And
in theory they could intentionally diverge, though in practice I don't
think we (yet) have any extensions or options that would be more
appropriate for one or the other.

So if we did want to join them, I think it would make sense to still be
able to use different flags for each situation, but initialize them from
a common definition.

> If so, I am wondering what is the right approach to make sure they
> are.  The definition of MIDX_WRITE_BITMAP_* flag bits in midx.h and
> BITMAP_OPT_* flag bits in write_bitmap_index enum are distinct two
> sets, and we need a way to somehow convert between them back and
> forth if we really wanted to ensure that these "default" values are
> kept in sync automatically.

Yeah, I think that would be much simpler if the bitmap options were held
separately from the other midx flags, and then we could use the same
values for each (and a common "defaults" definition). Which is
conceptually simple, but means we have to plumb a separate variable
through a bunch of intermediate functions. See the (untested) patch
below.

I dunno if it's worth it for something that doesn't really change all
that often. OTOH, it does remove the extra layer of translation in
write_midx_bitmap().

diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 2a938466f5..fb073a946a 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -10,6 +10,7 @@
 #include "object-store-ll.h"
 #include "replace-object.h"
 #include "repository.h"
+#include "pack-bitmap.h"
 
 #define BUILTIN_MIDX_WRITE_USAGE \
 	N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]" \
@@ -54,6 +55,7 @@ static struct opts_multi_pack_index {
 	char *refs_snapshot;
 	unsigned long batch_size;
 	unsigned flags;
+	uint16_t bitmap_options;
 	int stdin_packs;
 } opts;
 
@@ -89,16 +91,16 @@ static int git_multi_pack_index_write_config(const char *var, const char *value,
 {
 	if (!strcmp(var, "pack.writebitmaphashcache")) {
 		if (git_config_bool(var, value))
-			opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
+			opts.bitmap_options |= BITMAP_OPT_HASH_CACHE;
 		else
-			opts.flags &= ~MIDX_WRITE_BITMAP_HASH_CACHE;
+			opts.bitmap_options &= ~BITMAP_OPT_HASH_CACHE;
 	}
 
 	if (!strcmp(var, "pack.writebitmaplookuptable")) {
 		if (git_config_bool(var, value))
-			opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
+			opts.bitmap_options |= BITMAP_OPT_LOOKUP_TABLE;
 		else
-			opts.flags &= ~MIDX_WRITE_BITMAP_LOOKUP_TABLE;
+			opts.bitmap_options &= ~BITMAP_OPT_LOOKUP_TABLE;
 	}
 
 	/*
@@ -141,7 +143,7 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
 	};
 	int ret;
 
-	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
+	opts.bitmap_options |= BITMAP_OPT_HASH_CACHE;
 
 	git_config(git_multi_pack_index_write_config, NULL);
 
@@ -167,7 +169,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
 
 		ret = write_midx_file_only(repo, opts.object_dir, &packs,
 					   opts.preferred_pack,
-					   opts.refs_snapshot, opts.flags);
+					   opts.refs_snapshot, opts.flags,
+					   opts.bitmap_options);
 
 		string_list_clear(&packs, 0);
 		free(opts.refs_snapshot);
@@ -177,7 +180,8 @@ static int cmd_multi_pack_index_write(int argc, const char **argv,
 	}
 
 	ret = write_midx_file(repo, opts.object_dir, opts.preferred_pack,
-			      opts.refs_snapshot, opts.flags);
+			      opts.refs_snapshot, opts.flags,
+			      opts.bitmap_options);
 
 	free(opts.refs_snapshot);
 	return ret;
@@ -236,7 +240,8 @@ static int cmd_multi_pack_index_expire(int argc, const char **argv,
 
 	FREE_AND_NULL(options);
 
-	return expire_midx_packs(the_repository, opts.object_dir, opts.flags);
+	return expire_midx_packs(the_repository, opts.object_dir, opts.flags,
+				 opts.bitmap_options);
 }
 
 static int cmd_multi_pack_index_repack(int argc, const char **argv,
@@ -269,7 +274,8 @@ static int cmd_multi_pack_index_repack(int argc, const char **argv,
 	FREE_AND_NULL(options);
 
 	return midx_repack(the_repository, opts.object_dir,
-			   (size_t)opts.batch_size, opts.flags);
+			   (size_t)opts.batch_size, opts.flags,
+			   opts.bitmap_options);
 }
 
 int cmd_multi_pack_index(int argc,
diff --git a/builtin/repack.c b/builtin/repack.c
index f3330ade7b..69c9a8bc4c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -1562,7 +1562,7 @@ int cmd_repack(int argc,
 		if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL, 0))
 			flags |= MIDX_WRITE_INCREMENTAL;
 		write_midx_file(the_repository, repo_get_object_directory(the_repository),
-				NULL, NULL, flags);
+				NULL, NULL, flags, 0);
 	}
 
 cleanup:
diff --git a/midx-write.c b/midx-write.c
index 48a4dc5e94..5729b989cb 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -841,10 +841,10 @@ static int write_midx_bitmap(struct write_midx_context *ctx,
 			     struct packing_data *pdata,
 			     struct commit **commits,
 			     uint32_t commits_nr,
-			     unsigned flags)
+			     unsigned flags,
+			     uint16_t options)
 {
 	int ret, i;
-	uint16_t options = 0;
 	struct bitmap_writer writer;
 	struct pack_idx_entry **index;
 	struct strbuf bitmap_name = STRBUF_INIT;
@@ -859,12 +859,6 @@ static int write_midx_bitmap(struct write_midx_context *ctx,
 		get_midx_filename_ext(ctx->repo->hash_algo, &bitmap_name,
 				      object_dir, midx_hash, MIDX_EXT_BITMAP);
 
-	if (flags & MIDX_WRITE_BITMAP_HASH_CACHE)
-		options |= BITMAP_OPT_HASH_CACHE;
-
-	if (flags & MIDX_WRITE_BITMAP_LOOKUP_TABLE)
-		options |= BITMAP_OPT_LOOKUP_TABLE;
-
 	/*
 	 * Build the MIDX-order index based on pdata.objects (which is already
 	 * in MIDX order; c.f., 'midx_pack_order_cmp()' for the definition of
@@ -1070,7 +1064,8 @@ static int write_midx_internal(struct repository *r, const char *object_dir,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
 			       const char *refs_snapshot,
-			       unsigned flags)
+			       unsigned flags,
+			       uint16_t bitmap_options)
 {
 	struct strbuf midx_name = STRBUF_INIT;
 	unsigned char midx_hash[GIT_MAX_RAWSZ];
@@ -1433,7 +1428,7 @@ static int write_midx_internal(struct repository *r, const char *object_dir,
 
 		if (write_midx_bitmap(&ctx, object_dir,
 				      midx_hash, &pdata, commits, commits_nr,
-				      flags) < 0) {
+				      flags, bitmap_options) < 0) {
 			error(_("could not write multi-pack bitmap"));
 			result = 1;
 			clear_packing_data(&pdata);
@@ -1529,23 +1524,27 @@ static int write_midx_internal(struct repository *r, const char *object_dir,
 
 int write_midx_file(struct repository *r, const char *object_dir,
 		    const char *preferred_pack_name,
-		    const char *refs_snapshot, unsigned flags)
+		    const char *refs_snapshot, unsigned flags,
+		    uint16_t bitmap_options)
 {
 	return write_midx_internal(r, object_dir, NULL, NULL,
 				   preferred_pack_name, refs_snapshot,
-				   flags);
+				   flags, bitmap_options);
 }
 
 int write_midx_file_only(struct repository *r, const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
-			 const char *refs_snapshot, unsigned flags)
+			 const char *refs_snapshot, unsigned flags,
+			 uint16_t bitmap_options)
 {
 	return write_midx_internal(r, object_dir, packs_to_include, NULL,
-				   preferred_pack_name, refs_snapshot, flags);
+				   preferred_pack_name, refs_snapshot, flags,
+				   bitmap_options);
 }
 
-int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags)
+int expire_midx_packs(struct repository *r, const char *object_dir,
+		      unsigned flags, uint16_t bitmap_options)
 {
 	uint32_t i, *count, result = 0;
 	struct string_list packs_to_drop = STRING_LIST_INIT_DUP;
@@ -1603,7 +1602,8 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla
 
 	if (packs_to_drop.nr)
 		result = write_midx_internal(r, object_dir, NULL,
-					     &packs_to_drop, NULL, NULL, flags);
+					     &packs_to_drop, NULL, NULL, flags,
+					     bitmap_options);
 
 	string_list_clear(&packs_to_drop, 0);
 
@@ -1718,7 +1718,8 @@ static void fill_included_packs_batch(struct repository *r,
 	free(pack_info);
 }
 
-int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, unsigned flags)
+int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
+		unsigned flags, uint16_t bitmap_options)
 {
 	int result = 0;
 	uint32_t i, packs_to_repack = 0;
@@ -1801,7 +1802,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	}
 
 	result = write_midx_internal(r, object_dir, NULL, NULL, NULL, NULL,
-				     flags);
+				     flags, bitmap_options);
 
 cleanup:
 	free(include_pack);
diff --git a/midx.h b/midx.h
index 9d1374cbd5..5a79302a7a 100644
--- a/midx.h
+++ b/midx.h
@@ -81,8 +81,6 @@ struct multi_pack_index {
 #define MIDX_PROGRESS     (1 << 0)
 #define MIDX_WRITE_REV_INDEX (1 << 1)
 #define MIDX_WRITE_BITMAP (1 << 2)
-#define MIDX_WRITE_BITMAP_HASH_CACHE (1 << 3)
-#define MIDX_WRITE_BITMAP_LOOKUP_TABLE (1 << 4)
 #define MIDX_WRITE_INCREMENTAL (1 << 5)
 
 #define MIDX_EXT_REV "rev"
@@ -131,15 +129,16 @@ int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, i
  */
 int write_midx_file(struct repository *r, const char *object_dir,
 		    const char *preferred_pack_name, const char *refs_snapshot,
-		    unsigned flags);
+		    unsigned flags, uint16_t bitmap_options);
 int write_midx_file_only(struct repository *r, const char *object_dir,
 			 struct string_list *packs_to_include,
 			 const char *preferred_pack_name,
-			 const char *refs_snapshot, unsigned flags);
+			 const char *refs_snapshot, unsigned flags,
+			 uint16_t bitmap_options);
 void clear_midx_file(struct repository *r);
 int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags);
-int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags);
-int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, unsigned flags);
+int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags, uint16_t bitmap_options);
+int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, unsigned flags, uint16_t bitmap_options);
 
 void close_midx(struct multi_pack_index *m);
Junio C Hamano April 18, 2025, 3:44 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> They're not technically required to be in sync. It is OK for the midx
> bitmaps to have different options than the ones we make for packs. And
> in theory they could intentionally diverge, though in practice I don't
> think we (yet) have any extensions or options that would be more
> appropriate for one or the other.
>
> So if we did want to join them, I think it would make sense to still be
> able to use different flags for each situation, but initialize them from
> a common definition.

Thanks for great explanation---I guess it is not worth pursuing,
then.  It is not like it would make the system misbehave when two
are set differently.
Taylor Blau April 18, 2025, 9:52 p.m. UTC | #4
On Fri, Apr 18, 2025 at 08:44:29AM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > They're not technically required to be in sync. It is OK for the midx
> > bitmaps to have different options than the ones we make for packs. And
> > in theory they could intentionally diverge, though in practice I don't
> > think we (yet) have any extensions or options that would be more
> > appropriate for one or the other.
> >
> > So if we did want to join them, I think it would make sense to still be
> > able to use different flags for each situation, but initialize them from
> > a common definition.
>
> Thanks for great explanation---I guess it is not worth pursuing,
> then.  It is not like it would make the system misbehave when two
> are set differently.

Hah, Peff beat me to it. I saw your reply last night and was going to
write you a very similar response.

I think the summary from my perspective would be that: the two could
fall out-of-sync intentionally if we want the two commands to ever have
different defaults. Tangentially we could use some common "bitmap_flags"
field whose bits are defined in pack-bitmap.h and used in both places.

The latter is a bit awkward currently because the current "flags" that
we pass into the MIDX machinery from the builtin all have MIDX-specific
meanings. So we would have to either make sure that MIDX uses separate
bit positions (which is awful and far too fragile for my comfort/taste)
or store them as a separate set of flags (I think what Peff is getting
at above).

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/config/pack.adoc b/Documentation/config/pack.adoc
index da527377fa..ba538e3d9c 100644
--- a/Documentation/config/pack.adoc
+++ b/Documentation/config/pack.adoc
@@ -191,7 +191,7 @@  pack.writeBitmapLookupTable::
 	bitmap index (if one is written). This table is used to defer
 	loading individual bitmaps as late as possible. This can be
 	beneficial in repositories that have relatively large bitmap
-	indexes. Defaults to false.
+	indexes. Defaults to true.
 
 pack.readReverseIndex::
 	When true, git will read any .rev file(s) that may be available
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 2a938466f5..6ad6f814e3 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -142,6 +142,7 @@  static int cmd_multi_pack_index_write(int argc, const char **argv,
 	int ret;
 
 	opts.flags |= MIDX_WRITE_BITMAP_HASH_CACHE;
+	opts.flags |= MIDX_WRITE_BITMAP_LOOKUP_TABLE;
 
 	git_config(git_multi_pack_index_write_config, NULL);
 
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3973267e9e..384fefbb1d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -239,7 +239,7 @@  static enum {
 	WRITE_BITMAP_QUIET,
 	WRITE_BITMAP_TRUE,
 } write_bitmap_index;
-static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE;
+static uint16_t write_bitmap_options = BITMAP_OPT_HASH_CACHE | BITMAP_OPT_LOOKUP_TABLE;
 
 static int exclude_promisor_objects;
 static int exclude_promisor_objects_best_effort;