diff mbox series

[6/8] midx-write.c: support reading an existing MIDX with `packs_to_include`

Message ID 61268114c6562ba882210fd94b3f336efcb5c486.1716482279.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series midx-write: miscellaneous clean-ups for incremental MIDXs | expand

Commit Message

Taylor Blau May 23, 2024, 4:38 p.m. UTC
Avoid unconditionally copying all packs from an existing MIDX into a new
MIDX by checking that packs added via `fill_packs_from_midx()` don't
appear in the `to_include` set, if one was provided.

Do so by calling `should_include_pack()` from both `add_pack_to_midx()`
and `fill_packs_from_midx()`.

In order to make this work, teach `should_include_pack()` a new
"exclude_from_midx" parameter, which allows skipping the first check.
This is done so that the caller in `fill_packs_from_midx()` doesn't
reject all of the packs it provided since they appear in an existing
MIDX by definition.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 midx-write.c | 42 +++++++++++-------------------------------
 1 file changed, 11 insertions(+), 31 deletions(-)

Comments

Patrick Steinhardt May 29, 2024, 7:48 a.m. UTC | #1
On Thu, May 23, 2024 at 12:38:19PM -0400, Taylor Blau wrote:
> Avoid unconditionally copying all packs from an existing MIDX into a new
> MIDX by checking that packs added via `fill_packs_from_midx()` don't
> appear in the `to_include` set, if one was provided.
> 
> Do so by calling `should_include_pack()` from both `add_pack_to_midx()`
> and `fill_packs_from_midx()`.

This is missing an explanation why exactly we want that. Is the current
behaviour a bug? Is it a preparation for a future change? Is this change
expected to modify any existing behaviour?

Reading through the patch we now unconditionally load the existing MIDX
when writing a new one, but I'm not exactly sure what the effect of that
is going to be.

[snip]
> diff --git a/midx-write.c b/midx-write.c
> index 9712ac044f..36ac4ab65b 100644
> --- a/midx-write.c
> +++ b/midx-write.c
> @@ -101,27 +101,13 @@ struct write_midx_context {
>  };
>  
>  static int should_include_pack(const struct write_midx_context *ctx,
> -			       const char *file_name)
> +			       const char *file_name,
> +			       int exclude_from_midx)
>  {
> -	/*
> -	 * Note that at most one of ctx->m and ctx->to_include are set,
> -	 * so we are testing midx_contains_pack() and
> -	 * string_list_has_string() independently (guarded by the
> -	 * appropriate NULL checks).
> -	 *
> -	 * We could support passing to_include while reusing an existing
> -	 * MIDX, but don't currently since the reuse process drags
> -	 * forward all packs from an existing MIDX (without checking
> -	 * whether or not they appear in the to_include list).
> -	 *
> -	 * If we added support for that, these next two conditional
> -	 * should be performed independently (likely checking
> -	 * to_include before the existing MIDX).
> -	 */
> -	if (ctx->m && midx_contains_pack(ctx->m, file_name))
> +	if (exclude_from_midx && ctx->m && midx_contains_pack(ctx->m, file_name))
>  		return 0;
> -	else if (ctx->to_include &&
> -		 !string_list_has_string(ctx->to_include, file_name))
> +	if (ctx->to_include && !string_list_has_string(ctx->to_include,
> +						       file_name))

The second branch is a no-op change, right? The only change was that you
converted from `else if` to `if`. I'd propose to either keep this as-is,
or to do this change in the preceding patch already that introduces this
function.

Patrick
Taylor Blau May 29, 2024, 10:46 p.m. UTC | #2
On Wed, May 29, 2024 at 09:48:20AM +0200, Patrick Steinhardt wrote:
> On Thu, May 23, 2024 at 12:38:19PM -0400, Taylor Blau wrote:
> > Avoid unconditionally copying all packs from an existing MIDX into a new
> > MIDX by checking that packs added via `fill_packs_from_midx()` don't
> > appear in the `to_include` set, if one was provided.
> >
> > Do so by calling `should_include_pack()` from both `add_pack_to_midx()`
> > and `fill_packs_from_midx()`.
>
> This is missing an explanation why exactly we want that. Is the current
> behaviour a bug? Is it a preparation for a future change? Is this change
> expected to modify any existing behaviour?
>
> Reading through the patch we now unconditionally load the existing MIDX
> when writing a new one, but I'm not exactly sure what the effect of that
> is going to be.

Very fair. The short answer is that this is a prerequisite for the
incremental MIDX series that I'm working on. The longer answer is that
an incremental MIDX-aware writer needs to be able to consult with the
existing MIDX (if one exists) and exclude any objects which already
appear in an earlier layer of the MIDX. This is done because we cannot
have the same object appear in multiple layers of the MIDX, for reasons
that are probably not interesting to this series.

I put a more concise version of the explanation above into this patch
which I'll send another round of in v2 of this series shortly.

> [snip]
> > diff --git a/midx-write.c b/midx-write.c
> > index 9712ac044f..36ac4ab65b 100644
> > --- a/midx-write.c
> > +++ b/midx-write.c
> > @@ -101,27 +101,13 @@ struct write_midx_context {
> >  };
> >
> >  static int should_include_pack(const struct write_midx_context *ctx,
> > -			       const char *file_name)
> > +			       const char *file_name,
> > +			       int exclude_from_midx)
> >  {
> > -	/*
> > -	 * Note that at most one of ctx->m and ctx->to_include are set,
> > -	 * so we are testing midx_contains_pack() and
> > -	 * string_list_has_string() independently (guarded by the
> > -	 * appropriate NULL checks).
> > -	 *
> > -	 * We could support passing to_include while reusing an existing
> > -	 * MIDX, but don't currently since the reuse process drags
> > -	 * forward all packs from an existing MIDX (without checking
> > -	 * whether or not they appear in the to_include list).
> > -	 *
> > -	 * If we added support for that, these next two conditional
> > -	 * should be performed independently (likely checking
> > -	 * to_include before the existing MIDX).
> > -	 */
> > -	if (ctx->m && midx_contains_pack(ctx->m, file_name))
> > +	if (exclude_from_midx && ctx->m && midx_contains_pack(ctx->m, file_name))
> >  		return 0;
> > -	else if (ctx->to_include &&
> > -		 !string_list_has_string(ctx->to_include, file_name))
> > +	if (ctx->to_include && !string_list_has_string(ctx->to_include,
> > +						       file_name))
>
> The second branch is a no-op change, right? The only change was that you
> converted from `else if` to `if`. I'd propose to either keep this as-is,
> or to do this change in the preceding patch already that introduces this
> function.

It is a no-op, but I would rather keep these separate to keep the
previous step a pure code movement rather than introducing any textual
changes.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/midx-write.c b/midx-write.c
index 9712ac044f..36ac4ab65b 100644
--- a/midx-write.c
+++ b/midx-write.c
@@ -101,27 +101,13 @@  struct write_midx_context {
 };
 
 static int should_include_pack(const struct write_midx_context *ctx,
-			       const char *file_name)
+			       const char *file_name,
+			       int exclude_from_midx)
 {
-	/*
-	 * Note that at most one of ctx->m and ctx->to_include are set,
-	 * so we are testing midx_contains_pack() and
-	 * string_list_has_string() independently (guarded by the
-	 * appropriate NULL checks).
-	 *
-	 * We could support passing to_include while reusing an existing
-	 * MIDX, but don't currently since the reuse process drags
-	 * forward all packs from an existing MIDX (without checking
-	 * whether or not they appear in the to_include list).
-	 *
-	 * If we added support for that, these next two conditional
-	 * should be performed independently (likely checking
-	 * to_include before the existing MIDX).
-	 */
-	if (ctx->m && midx_contains_pack(ctx->m, file_name))
+	if (exclude_from_midx && ctx->m && midx_contains_pack(ctx->m, file_name))
 		return 0;
-	else if (ctx->to_include &&
-		 !string_list_has_string(ctx->to_include, file_name))
+	if (ctx->to_include && !string_list_has_string(ctx->to_include,
+						       file_name))
 		return 0;
 	return 1;
 }
@@ -135,7 +121,7 @@  static void add_pack_to_midx(const char *full_path, size_t full_path_len,
 	if (ends_with(file_name, ".idx")) {
 		display_progress(ctx->progress, ++ctx->pack_paths_checked);
 
-		if (!should_include_pack(ctx, file_name))
+		if (!should_include_pack(ctx, file_name, 1))
 			return;
 
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
@@ -891,6 +877,9 @@  static int fill_packs_from_midx(struct write_midx_context *ctx,
 	uint32_t i;
 
 	for (i = 0; i < ctx->m->num_packs; i++) {
+		if (!should_include_pack(ctx, ctx->m->pack_names[i], 0))
+			continue;
+
 		ALLOC_GROW(ctx->info, ctx->nr + 1, ctx->alloc);
 
 		if (flags & MIDX_WRITE_REV_INDEX || preferred_pack_name) {
@@ -945,15 +934,7 @@  static int write_midx_internal(const char *object_dir,
 		die_errno(_("unable to create leading directories of %s"),
 			  midx_name.buf);
 
-	if (!packs_to_include) {
-		/*
-		 * Only reference an existing MIDX when not filtering which
-		 * packs to include, since all packs and objects are copied
-		 * blindly from an existing MIDX if one is present.
-		 */
-		ctx.m = lookup_multi_pack_index(the_repository, object_dir);
-	}
-
+	ctx.m = lookup_multi_pack_index(the_repository, object_dir);
 	if (ctx.m && !midx_checksum_valid(ctx.m)) {
 		warning(_("ignoring existing multi-pack-index; checksum mismatch"));
 		ctx.m = NULL;
@@ -962,6 +943,7 @@  static int write_midx_internal(const char *object_dir,
 	ctx.nr = 0;
 	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
 	ctx.info = NULL;
+	ctx.to_include = packs_to_include;
 	ALLOC_ARRAY(ctx.info, ctx.alloc);
 
 	if (ctx.m && fill_packs_from_midx(&ctx, preferred_pack_name,
@@ -978,8 +960,6 @@  static int write_midx_internal(const char *object_dir,
 	else
 		ctx.progress = NULL;
 
-	ctx.to_include = packs_to_include;
-
 	for_each_file_in_pack_dir(object_dir, add_pack_to_midx, &ctx);
 	stop_progress(&ctx.progress);