Message ID | 9d322fc5399c453913d08f35eee907b5c909ad6b.1723743050.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 01e9d129396e98b611bff2ae82bb3d610b28c588 |
Headers | show |
Series | pseudo-merge: avoid empty and non-closed pseudo-merge commits | expand |
On Thu, Aug 15, 2024 at 01:31:00PM -0400, Taylor Blau wrote: > In order to determine its object order, the pack-bitmap machinery keeps > a 'struct packing_data' corresponding to the pack or pseudo-pack (when > writing a MIDX bitmap) being written. > > The to_pack field is provided to the bitmap machinery by callers of > bitmap_writer_build() and assigned to the bitmap_writer struct at that > point. > > But a subsequent commit will want to have access to that data earlier on > during commit selection. Prepare for that by adding a 'to_pack' argument > to 'bitmap_writer_init()', and initializing the field during that > function. > > Subsequent commits will clean up other functions which take > now-redundant arguments (like nr_objects, which is equivalent to > pdata->objects_nr, or pdata itself). This (and the next few follow-on commits) seem like a good change to me. It simplifies many of the function calls, and I think it expresses the domain logic in the API: there is a single set of objects being mapped to bits, and many parts of the process will rely on it. Even the midx code, which is not generating a pack, uses a "fake" packing_data as the way to express that (because inherently the bit ordering is all coming from the pack-index nature). If we likewise ever wrote code to generate bitmaps from an existing pack, it would probably use packing_data, too. :) -Peff
On Sat, Aug 17, 2024 at 06:31:55AM -0400, Jeff King wrote: > On Thu, Aug 15, 2024 at 01:31:00PM -0400, Taylor Blau wrote: > > > In order to determine its object order, the pack-bitmap machinery keeps > > a 'struct packing_data' corresponding to the pack or pseudo-pack (when > > writing a MIDX bitmap) being written. > > > > The to_pack field is provided to the bitmap machinery by callers of > > bitmap_writer_build() and assigned to the bitmap_writer struct at that > > point. > > > > But a subsequent commit will want to have access to that data earlier on > > during commit selection. Prepare for that by adding a 'to_pack' argument > > to 'bitmap_writer_init()', and initializing the field during that > > function. > > > > Subsequent commits will clean up other functions which take > > now-redundant arguments (like nr_objects, which is equivalent to > > pdata->objects_nr, or pdata itself). > > This (and the next few follow-on commits) seem like a good change to me. > It simplifies many of the function calls, and I think it expresses the > domain logic in the API: there is a single set of objects being mapped > to bits, and many parts of the process will rely on it. Thanks. Yeah, it was a little surprising to me that it wasn't already this way, especially having worked in this area for so long. I suspect it grew this way organically over time (though haven't actually gone spelunking through the history to confirm). > Even the midx code, which is not generating a pack, uses a "fake" > packing_data as the way to express that (because inherently the bit > ordering is all coming from the pack-index nature). If we likewise ever > wrote code to generate bitmaps from an existing pack, it would probably > use packing_data, too. :) I agree for the most part, though there is a lot of weight in packing_data that would be nice to not have to carry around. I know within GitHub's infrastructure we sometimes OOM kill invocations of "git multi-pack-index write --bitmap" because of the memory overhead (a lot of which is dominated by the actual traversal and bitmap generation, but a lot that comes from just the per-object overhead). I've thought about alternative structures that might be a little more memory efficient, but it's never gotten to the top of my list. Thanks, Taylor
On Thu, Aug 29, 2024 at 03:00:21PM -0400, Taylor Blau wrote: > > Even the midx code, which is not generating a pack, uses a "fake" > > packing_data as the way to express that (because inherently the bit > > ordering is all coming from the pack-index nature). If we likewise ever > > wrote code to generate bitmaps from an existing pack, it would probably > > use packing_data, too. :) > > I agree for the most part, though there is a lot of weight in > packing_data that would be nice to not have to carry around. I know > within GitHub's infrastructure we sometimes OOM kill invocations of "git > multi-pack-index write --bitmap" because of the memory overhead (a lot > of which is dominated by the actual traversal and bitmap generation, but > a lot that comes from just the per-object overhead). > > I've thought about alternative structures that might be a little more > memory efficient, but it's never gotten to the top of my list. True. What the index and bitmap steps really want is not an array of object_entry, but an array of pack_idx_entry (which is the first component of an object_entry). I wonder how feasible it would be to simply hold two arrays with corresponding entries at each index. Many places only care about one or the other. But for places that do care about both, especially ones that receive a pointer to an individual object_entry, they'd need to receive pointers to both. I briefly looked at the compile errors that come from making such a change. Many of them look trivial, but I think some of them get weird (the ext_bases array is also holding object_entry structs). So maybe worth pursuing in the long run, but not something to knock out this afternoon. -Peff
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index f395488971..0ad533c045 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1342,7 +1342,7 @@ static void write_pack_file(void) if (write_bitmap_index) { bitmap_writer_init(&bitmap_writer, - the_repository); + the_repository, &to_pack); bitmap_writer_set_checksum(&bitmap_writer, hash); bitmap_writer_build_type_index(&bitmap_writer, &to_pack, written_list, nr_written); diff --git a/midx-write.c b/midx-write.c index a77ee73c68..62f507eb72 100644 --- a/midx-write.c +++ b/midx-write.c @@ -825,7 +825,7 @@ static int write_midx_bitmap(const char *midx_name, for (i = 0; i < pdata->nr_objects; i++) index[i] = &pdata->objects[i].idx; - bitmap_writer_init(&writer, the_repository); + bitmap_writer_init(&writer, the_repository, pdata); bitmap_writer_show_progress(&writer, flags & MIDX_PROGRESS); bitmap_writer_build_type_index(&writer, pdata, index, pdata->nr_objects); diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index bf96c80898..4a7d2d1370 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -41,13 +41,15 @@ static inline int bitmap_writer_nr_selected_commits(struct bitmap_writer *writer return writer->selected_nr - writer->pseudo_merges_nr; } -void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r) +void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r, + struct packing_data *pdata) { memset(writer, 0, sizeof(struct bitmap_writer)); if (writer->bitmaps) BUG("bitmap writer already initialized"); writer->bitmaps = kh_init_oid_map(); writer->pseudo_merge_commits = kh_init_oid_map(); + writer->to_pack = pdata; string_list_init_dup(&writer->pseudo_merge_groups); diff --git a/pack-bitmap.h b/pack-bitmap.h index 1171e6d989..ab20d6a0b6 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -123,7 +123,8 @@ struct bitmap_writer { unsigned char pack_checksum[GIT_MAX_RAWSZ]; }; -void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r); +void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r, + struct packing_data *pdata); void bitmap_writer_show_progress(struct bitmap_writer *writer, int show); void bitmap_writer_set_checksum(struct bitmap_writer *writer, const unsigned char *sha1);
In order to determine its object order, the pack-bitmap machinery keeps a 'struct packing_data' corresponding to the pack or pseudo-pack (when writing a MIDX bitmap) being written. The to_pack field is provided to the bitmap machinery by callers of bitmap_writer_build() and assigned to the bitmap_writer struct at that point. But a subsequent commit will want to have access to that data earlier on during commit selection. Prepare for that by adding a 'to_pack' argument to 'bitmap_writer_init()', and initializing the field during that function. Subsequent commits will clean up other functions which take now-redundant arguments (like nr_objects, which is equivalent to pdata->objects_nr, or pdata itself). Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/pack-objects.c | 2 +- midx-write.c | 2 +- pack-bitmap-write.c | 4 +++- pack-bitmap.h | 3 ++- 4 files changed, 7 insertions(+), 4 deletions(-)