diff mbox series

[v2,2/8] builtin/multi-pack-index.c: support `--stdin-packs` mode

Message ID 59556e554565120929549239f1aee5a76d80ac8d.1631730270.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series repack: introduce `--write-midx` | expand

Commit Message

Taylor Blau Sept. 15, 2021, 6:24 p.m. UTC
To power a new `--write-midx` mode, `git repack` will want to write a
multi-pack index containing a certain set of packs in the repository.

This new option will be used by `git repack` to write a MIDX which
contains only the packs which will survive after the repack (that is, it
will exclude any packs which are about to be deleted).

This patch effectively exposes the function implemented in the previous
commit via the `git multi-pack-index` builtin. An alternative approach
would have been to call that function from the `git repack` builtin
directly, but this introduces awkward problems around closing and
reopening the object store, so the MIDX will be written out-of-process.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-multi-pack-index.txt |  4 ++++
 builtin/multi-pack-index.c             | 27 ++++++++++++++++++++++++++
 t/t5319-multi-pack-index.sh            | 15 ++++++++++++++
 3 files changed, 46 insertions(+)

Comments

Josh Steadmon Sept. 22, 2021, 10:29 p.m. UTC | #1
Thanks for the series! I have a couple of questions:


On 2021.09.15 14:24, Taylor Blau wrote:
> To power a new `--write-midx` mode, `git repack` will want to write a
> multi-pack index containing a certain set of packs in the repository.
> 
> This new option will be used by `git repack` to write a MIDX which
> contains only the packs which will survive after the repack (that is, it
> will exclude any packs which are about to be deleted).
> 
> This patch effectively exposes the function implemented in the previous
> commit via the `git multi-pack-index` builtin. An alternative approach
> would have been to call that function from the `git repack` builtin
> directly, but this introduces awkward problems around closing and
> reopening the object store, so the MIDX will be written out-of-process.

Could you elaborate a bit on the "awkward problems" here? I'm afraid I'm
missing the context here.


> diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
> index 73c0113b48..047647b5f2 100644
> --- a/builtin/multi-pack-index.c
> +++ b/builtin/multi-pack-index.c
> @@ -47,6 +47,7 @@ static struct opts_multi_pack_index {
>  	const char *preferred_pack;
>  	unsigned long batch_size;
>  	unsigned flags;
> +	int stdin_packs;
>  } opts;
>  
>  static struct option common_opts[] = {
> @@ -61,6 +62,16 @@ static struct option *add_common_options(struct option *prev)
>  	return parse_options_concat(common_opts, prev);
>  }
>  
> +static void read_packs_from_stdin(struct string_list *to)
> +{
> +	struct strbuf buf = STRBUF_INIT;
> +	while (strbuf_getline(&buf, stdin) != EOF)
> +		string_list_append(to, buf.buf);
> +	string_list_sort(to);
> +
> +	strbuf_release(&buf);
> +}
> +

I'm presuming that the packfile list is going to be generated
automatically, but what happens if that becomes corrupt somehow, and we
skip a packfile that should have been included? Will that cause
incorrect behavior, or will we just miss out on some of the bitmap
performance benefits?
Jonathan Tan Sept. 22, 2021, 11:11 p.m. UTC | #2
> An alternative approach
> would have been to call that function from the `git repack` builtin
> directly, but this introduces awkward problems around closing and
> reopening the object store, so the MIDX will be written out-of-process.

I'm not sure if the implementation direction started by this patch
(eventually, running "git multi-pack-index write --stdin-packs" to
replace the midx of a repository while "git repack" is running) would
work on platforms in which mmap-ing a file means that other processes
can't delete it, but if it works on a Windows CI, this should be fine.
(I don't have a Windows CI handy to test it on, though.)

Assuming it works on platforms like Windows, this patch looks fine.
Taylor Blau Sept. 23, 2021, 2:03 a.m. UTC | #3
On Wed, Sep 22, 2021 at 03:29:53PM -0700, Josh Steadmon wrote:
> Thanks for the series! I have a couple of questions:
>
>
> On 2021.09.15 14:24, Taylor Blau wrote:
> > To power a new `--write-midx` mode, `git repack` will want to write a
> > multi-pack index containing a certain set of packs in the repository.
> >
> > This new option will be used by `git repack` to write a MIDX which
> > contains only the packs which will survive after the repack (that is, it
> > will exclude any packs which are about to be deleted).
> >
> > This patch effectively exposes the function implemented in the previous
> > commit via the `git multi-pack-index` builtin. An alternative approach
> > would have been to call that function from the `git repack` builtin
> > directly, but this introduces awkward problems around closing and
> > reopening the object store, so the MIDX will be written out-of-process.
>
> Could you elaborate a bit on the "awkward problems" here? I'm afraid I'm
> missing the context here.

A variety of things can go wrong when the object store is closed and
re-opened in the same process. Many of the symptoms are described
beginning at this message:

  https://lore.kernel.org/git/YPf1m01mcdJ3HNBt@coredump.intra.peff.net/

and further down in the sub-thread. Many of those problems have been
resolved, but I'm not convinced that there aren't others lurking.

> > +static void read_packs_from_stdin(struct string_list *to)
> > +{
> > +	struct strbuf buf = STRBUF_INIT;
> > +	while (strbuf_getline(&buf, stdin) != EOF)
> > +		string_list_append(to, buf.buf);
> > +	string_list_sort(to);
> > +
> > +	strbuf_release(&buf);
> > +}
> > +
>
> I'm presuming that the packfile list is going to be generated
> automatically, but what happens if that becomes corrupt somehow, and we
> skip a packfile that should have been included? Will that cause
> incorrect behavior, or will we just miss out on some of the bitmap
> performance benefits?

A multi-pack bitmap can only refer to objects that are in a pack which
the repository's MIDX includes. So if we left off a pack from this list,
we'd be unable to cover that pack's objects in the resulting bitmap.
We'd also be unable to cover any objects which are reachable from the
missing pack's objects, since the set of objects in a bitmap must be
closed under reachability.

If, on the other hand, we read a line which does not correspond to any
pack, we'll simply ignore it. That's because we loop over the results of
get_all_packs() and try to find a match in this list instead of the
other way around.

We could mark the packs we found by abusing the string_list_item's util
pointer, but it's probably not worth it since this is mostly an internal
interface.

Thanks,
Taylor
Taylor Blau Sept. 23, 2021, 2:06 a.m. UTC | #4
On Wed, Sep 22, 2021 at 04:11:37PM -0700, Jonathan Tan wrote:
> > An alternative approach
> > would have been to call that function from the `git repack` builtin
> > directly, but this introduces awkward problems around closing and
> > reopening the object store, so the MIDX will be written out-of-process.
>
> I'm not sure if the implementation direction started by this patch
> (eventually, running "git multi-pack-index write --stdin-packs" to
> replace the midx of a repository while "git repack" is running) would
> work on platforms in which mmap-ing a file means that other processes
> can't delete it, but if it works on a Windows CI, this should be fine.
> (I don't have a Windows CI handy to test it on, though.)
>
> Assuming it works on platforms like Windows, this patch looks fine.

It definitely passes CI :-). But special care is taken to handle this
case, and it works for a couple of reasons. Notably that we only call
`write_midx_included_packs()` (which in turn invokes the
multi-pack-index builtin as a sub-process) *after* repack has called
close_object_store(). That means that `repack` won't be holding the MIDX
open while the sub-process is trying to write a new one in its place.

The other reason is that the multi-pack-index process also makes sure to
close the old MIDX before writing the new one, so we can be certain that
neither of these spots are mapping the file or have it opened when
trying to write over it.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
index a9df3dbd32..009c989ef8 100644
--- a/Documentation/git-multi-pack-index.txt
+++ b/Documentation/git-multi-pack-index.txt
@@ -45,6 +45,10 @@  write::
 
 	--[no-]bitmap::
 		Control whether or not a multi-pack bitmap is written.
+
+	--stdin-packs::
+		Write a multi-pack index containing only the set of
+		line-delimited pack index basenames provided over stdin.
 --
 
 verify::
diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c
index 73c0113b48..047647b5f2 100644
--- a/builtin/multi-pack-index.c
+++ b/builtin/multi-pack-index.c
@@ -47,6 +47,7 @@  static struct opts_multi_pack_index {
 	const char *preferred_pack;
 	unsigned long batch_size;
 	unsigned flags;
+	int stdin_packs;
 } opts;
 
 static struct option common_opts[] = {
@@ -61,6 +62,16 @@  static struct option *add_common_options(struct option *prev)
 	return parse_options_concat(common_opts, prev);
 }
 
+static void read_packs_from_stdin(struct string_list *to)
+{
+	struct strbuf buf = STRBUF_INIT;
+	while (strbuf_getline(&buf, stdin) != EOF)
+		string_list_append(to, buf.buf);
+	string_list_sort(to);
+
+	strbuf_release(&buf);
+}
+
 static int cmd_multi_pack_index_write(int argc, const char **argv)
 {
 	struct option *options;
@@ -70,6 +81,8 @@  static int cmd_multi_pack_index_write(int argc, const char **argv)
 			   N_("pack for reuse when computing a multi-pack bitmap")),
 		OPT_BIT(0, "bitmap", &opts.flags, N_("write multi-pack bitmap"),
 			MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX),
+		OPT_BOOL(0, "stdin-packs", &opts.stdin_packs,
+			 N_("write multi-pack index containing only given indexes")),
 		OPT_END(),
 	};
 
@@ -86,6 +99,20 @@  static int cmd_multi_pack_index_write(int argc, const char **argv)
 
 	FREE_AND_NULL(options);
 
+	if (opts.stdin_packs) {
+		struct string_list packs = STRING_LIST_INIT_DUP;
+		int ret;
+
+		read_packs_from_stdin(&packs);
+
+		ret = write_midx_file_only(opts.object_dir, &packs,
+					   opts.preferred_pack, opts.flags);
+
+		string_list_clear(&packs, 0);
+
+		return ret;
+
+	}
 	return write_midx_file(opts.object_dir, opts.preferred_pack,
 			       opts.flags);
 }
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bb04f0f23b..385f0a3efd 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -168,6 +168,21 @@  test_expect_success 'write midx with two packs' '
 
 compare_results_with_midx "two packs"
 
+test_expect_success 'write midx with --stdin-packs' '
+	rm -fr $objdir/pack/multi-pack-index &&
+
+	idx="$(find $objdir/pack -name "test-2-*.idx")" &&
+	basename "$idx" >in &&
+
+	git multi-pack-index write --stdin-packs <in &&
+
+	test-tool read-midx $objdir | grep "\.idx$" >packs &&
+
+	test_cmp packs in
+'
+
+compare_results_with_midx "mixed mode (one pack + extra)"
+
 test_expect_success 'write progress off for redirected stderr' '
 	git multi-pack-index --object-dir=$objdir write 2>err &&
 	test_line_count = 0 err