diff mbox series

[1/8] pack-bitmap: initialize `bitmap_writer_init()` with packing_data

Message ID 9d322fc5399c453913d08f35eee907b5c909ad6b.1723743050.git.me@ttaylorr.com (mailing list archive)
State Accepted
Commit 01e9d129396e98b611bff2ae82bb3d610b28c588
Headers show
Series pseudo-merge: avoid empty and non-closed pseudo-merge commits | expand

Commit Message

Taylor Blau Aug. 15, 2024, 5:31 p.m. UTC
In order to determine its object order, the pack-bitmap machinery keeps
a 'struct packing_data' corresponding to the pack or pseudo-pack (when
writing a MIDX bitmap) being written.

The to_pack field is provided to the bitmap machinery by callers of
bitmap_writer_build() and assigned to the bitmap_writer struct at that
point.

But a subsequent commit will want to have access to that data earlier on
during commit selection. Prepare for that by adding a 'to_pack' argument
to 'bitmap_writer_init()', and initializing the field during that
function.

Subsequent commits will clean up other functions which take
now-redundant arguments (like nr_objects, which is equivalent to
pdata->objects_nr, or pdata itself).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/pack-objects.c | 2 +-
 midx-write.c           | 2 +-
 pack-bitmap-write.c    | 4 +++-
 pack-bitmap.h          | 3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

Comments

Jeff King Aug. 17, 2024, 10:31 a.m. UTC | #1
On Thu, Aug 15, 2024 at 01:31:00PM -0400, Taylor Blau wrote:

> In order to determine its object order, the pack-bitmap machinery keeps
> a 'struct packing_data' corresponding to the pack or pseudo-pack (when
> writing a MIDX bitmap) being written.
> 
> The to_pack field is provided to the bitmap machinery by callers of
> bitmap_writer_build() and assigned to the bitmap_writer struct at that
> point.
> 
> But a subsequent commit will want to have access to that data earlier on
> during commit selection. Prepare for that by adding a 'to_pack' argument
> to 'bitmap_writer_init()', and initializing the field during that
> function.
> 
> Subsequent commits will clean up other functions which take
> now-redundant arguments (like nr_objects, which is equivalent to
> pdata->objects_nr, or pdata itself).

This (and the next few follow-on commits) seem like a good change to me.
It simplifies many of the function calls, and I think it expresses the
domain logic in the API: there is a single set of objects being mapped
to bits, and many parts of the process will rely on it.

Even the midx code, which is not generating a pack, uses a "fake"
packing_data as the way to express that (because inherently the bit
ordering is all coming from the pack-index nature). If we likewise ever
wrote code to generate bitmaps from an existing pack, it would probably
use packing_data, too. :)

-Peff
Taylor Blau Aug. 29, 2024, 7 p.m. UTC | #2
On Sat, Aug 17, 2024 at 06:31:55AM -0400, Jeff King wrote:
> On Thu, Aug 15, 2024 at 01:31:00PM -0400, Taylor Blau wrote:
>
> > In order to determine its object order, the pack-bitmap machinery keeps
> > a 'struct packing_data' corresponding to the pack or pseudo-pack (when
> > writing a MIDX bitmap) being written.
> >
> > The to_pack field is provided to the bitmap machinery by callers of
> > bitmap_writer_build() and assigned to the bitmap_writer struct at that
> > point.
> >
> > But a subsequent commit will want to have access to that data earlier on
> > during commit selection. Prepare for that by adding a 'to_pack' argument
> > to 'bitmap_writer_init()', and initializing the field during that
> > function.
> >
> > Subsequent commits will clean up other functions which take
> > now-redundant arguments (like nr_objects, which is equivalent to
> > pdata->objects_nr, or pdata itself).
>
> This (and the next few follow-on commits) seem like a good change to me.
> It simplifies many of the function calls, and I think it expresses the
> domain logic in the API: there is a single set of objects being mapped
> to bits, and many parts of the process will rely on it.

Thanks. Yeah, it was a little surprising to me that it wasn't already
this way, especially having worked in this area for so long. I suspect
it grew this way organically over time (though haven't actually gone
spelunking through the history to confirm).

> Even the midx code, which is not generating a pack, uses a "fake"
> packing_data as the way to express that (because inherently the bit
> ordering is all coming from the pack-index nature). If we likewise ever
> wrote code to generate bitmaps from an existing pack, it would probably
> use packing_data, too. :)

I agree for the most part, though there is a lot of weight in
packing_data that would be nice to not have to carry around. I know
within GitHub's infrastructure we sometimes OOM kill invocations of "git
multi-pack-index write --bitmap" because of the memory overhead (a lot
of which is dominated by the actual traversal and bitmap generation, but
a lot that comes from just the per-object overhead).

I've thought about alternative structures that might be a little more
memory efficient, but it's never gotten to the top of my list.

Thanks,
Taylor
Jeff King Aug. 29, 2024, 7:36 p.m. UTC | #3
On Thu, Aug 29, 2024 at 03:00:21PM -0400, Taylor Blau wrote:

> > Even the midx code, which is not generating a pack, uses a "fake"
> > packing_data as the way to express that (because inherently the bit
> > ordering is all coming from the pack-index nature). If we likewise ever
> > wrote code to generate bitmaps from an existing pack, it would probably
> > use packing_data, too. :)
> 
> I agree for the most part, though there is a lot of weight in
> packing_data that would be nice to not have to carry around. I know
> within GitHub's infrastructure we sometimes OOM kill invocations of "git
> multi-pack-index write --bitmap" because of the memory overhead (a lot
> of which is dominated by the actual traversal and bitmap generation, but
> a lot that comes from just the per-object overhead).
> 
> I've thought about alternative structures that might be a little more
> memory efficient, but it's never gotten to the top of my list.

True. What the index and bitmap steps really want is not an array of
object_entry, but an array of pack_idx_entry (which is the first
component of an object_entry).

I wonder how feasible it would be to simply hold two arrays with
corresponding entries at each index. Many places only care about one or
the other. But for places that do care about both, especially ones that
receive a pointer to an individual object_entry, they'd need to receive
pointers to both.

I briefly looked at the compile errors that come from making such a
change. Many of them look trivial, but I think some of them get weird
(the ext_bases array is also holding object_entry structs). So maybe
worth pursuing in the long run, but not something to knock out this
afternoon.

-Peff
diff mbox series

Patch

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f395488971..0ad533c045 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1342,7 +1342,7 @@  static void write_pack_file(void)
 
 			if (write_bitmap_index) {
 				bitmap_writer_init(&bitmap_writer,
-						   the_repository);
+						   the_repository, &to_pack);
 				bitmap_writer_set_checksum(&bitmap_writer, hash);
 				bitmap_writer_build_type_index(&bitmap_writer,
 					&to_pack, written_list, nr_written);
diff --git a/midx-write.c b/midx-write.c
index a77ee73c68..62f507eb72 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -825,7 +825,7 @@  static int write_midx_bitmap(const char *midx_name,
 	for (i = 0; i < pdata->nr_objects; i++)
 		index[i] = &pdata->objects[i].idx;
 
-	bitmap_writer_init(&writer, the_repository);
+	bitmap_writer_init(&writer, the_repository, pdata);
 	bitmap_writer_show_progress(&writer, flags & MIDX_PROGRESS);
 	bitmap_writer_build_type_index(&writer, pdata, index,
 				       pdata->nr_objects);
diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c
index bf96c80898..4a7d2d1370 100644
--- a/pack-bitmap-write.c
+++ b/pack-bitmap-write.c
@@ -41,13 +41,15 @@  static inline int bitmap_writer_nr_selected_commits(struct bitmap_writer *writer
 	return writer->selected_nr - writer->pseudo_merges_nr;
 }
 
-void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r)
+void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r,
+			struct packing_data *pdata)
 {
 	memset(writer, 0, sizeof(struct bitmap_writer));
 	if (writer->bitmaps)
 		BUG("bitmap writer already initialized");
 	writer->bitmaps = kh_init_oid_map();
 	writer->pseudo_merge_commits = kh_init_oid_map();
+	writer->to_pack = pdata;
 
 	string_list_init_dup(&writer->pseudo_merge_groups);
 
diff --git a/pack-bitmap.h b/pack-bitmap.h
index 1171e6d989..ab20d6a0b6 100644
--- a/pack-bitmap.h
+++ b/pack-bitmap.h
@@ -123,7 +123,8 @@  struct bitmap_writer {
 	unsigned char pack_checksum[GIT_MAX_RAWSZ];
 };
 
-void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r);
+void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r,
+			struct packing_data *pdata);
 void bitmap_writer_show_progress(struct bitmap_writer *writer, int show);
 void bitmap_writer_set_checksum(struct bitmap_writer *writer,
 				const unsigned char *sha1);