diff mbox series

index-pack: teach --promisor to require --stdin

Message ID 20241118190210.772105-1-jonathantanmy@google.com (mailing list archive)
State New
Headers show
Series index-pack: teach --promisor to require --stdin | expand

Commit Message

Jonathan Tan Nov. 18, 2024, 7:02 p.m. UTC
Currently,

 - Running "index-pack --promisor" outside a repo segfaults.
 - It may be confusing to a user that running "index-pack --promisor"
   within a repo may make changes to the repo's object DB, especially
   since the packs indexed by the index-pack invocation may not even be
   related to the repo.

As discussed in [1], teaching --promisor to require --stdin and forbid a
packfile name solves both these problems. This combination of arguments
requires a repo (since we are writing the resulting .pack and .idx to
it) and it is clear that the files are related to the repo.

Currently, Git uses "index-pack --promisor" only when fetching into
a repo, so it could be argued that we should teach "index-pack" a new
argument (say, "--fetching-mode") instead of tying --promisor to a
generic argument like "--stdin". However, this --promisor feature could
conceivably be used whenever we have a packfile that is known to come
from the promisor remote (whether obtained through Git's fetch protocol
or through other means) so it seems reasonable to use --stdin here -
one could envision a user-made script obtaining a packfile and then
running "index-pack --promisor --stdin", for example. In fact, it might
be possible to relax the restriction further (say, by also allowing
--promisor when indexing a packfile that is in the object DB), but
relaxing the restriction is backwards-compatible so we can revisit that
later.

One thing to watch out for is the possibility of a future Git feature
that indexes a pack in the context of a repo, but does not necessarily
write the resulting pack to it (and does not necessarily desire to
make any changes to the object DB). One such feature would be fetch
quarantine, which might need the repo context in order to detect
hash collisions, but would also need to ensure that the object DB
is undisturbed in case the fetch fails for whatever reason, even if
the reason occurs only after the indexing is complete. It may not be
obvious to the implementer of such a feature that "index-pack" could
sometimes write packs other than the indexed pack to the object DB,
but there are already other ways that "fetch" could write to the object
DB (in particular, packfile URIs and bundle URIs), so hopefully the
implementation of this future feature would already include a test that
the object DB be undisturbed.

This change requires the change to t5300 by 1f52cdfacb (index-pack:
document and test the --promisor option, 2022-03-09) to be undone.
(--promisor is already tested indirectly, so we don't need the explicit
test here any more.)

[1] https://lore.kernel.org/git/20241114005652.GC1140565@coredump.intra.peff.net/

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
This is on jt/repack-local-promisor.

Looking into it further, I think that we also need to require no
packfile name to be given (so that we are writing the file to the
repository). Therefore, I've added that requirement both in the code and
in the documentation.

I've tried to summarize our conversation in the commit message - if you
notice anything missing or incorrect, feel free to let me know.
---
 Documentation/git-index-pack.txt | 2 ++
 builtin/index-pack.c             | 4 ++++
 t/t5300-pack-object.sh           | 4 +---
 3 files changed, 7 insertions(+), 3 deletions(-)

Comments

Junio C Hamano Nov. 19, 2024, 3:29 a.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> Currently,
>
>  - Running "index-pack --promisor" outside a repo segfaults.
>  - It may be confusing to a user that running "index-pack --promisor"
>    within a repo may make changes to the repo's object DB, especially
>    since the packs indexed by the index-pack invocation may not even be
>    related to the repo.
>
> As discussed in [1], teaching --promisor to require --stdin and forbid a
> packfile name solves both these problems. This combination of arguments
> requires a repo (since we are writing the resulting .pack and .idx to
> it) and it is clear that the files are related to the repo.

Makes sense.

> This change requires the change to t5300 by 1f52cdfacb (index-pack:
> document and test the --promisor option, 2022-03-09) to be undone.
> (--promisor is already tested indirectly, so we don't need the explicit
> test here any more.)

OK.

> This is on jt/repack-local-promisor.
> diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
> index 4be09e58e7..ac96935d73 100644
> --- a/Documentation/git-index-pack.txt
> +++ b/Documentation/git-index-pack.txt
> @@ -144,6 +144,8 @@ Also, if there are objects in the given pack that references non-promisor
>  objects (in the repo), repacks those non-promisor objects into a promisor
>  pack. This avoids a situation in which a repo has non-promisor objects that are
>  accessible through promisor objects.
> ++
> +Requires --stdin, and requires <pack-file> to not be specified.
>  
>  NOTES
>  -----
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 08b340552f..c46b6e4061 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
>  		usage(index_pack_usage);
>  	if (fix_thin_pack && !from_stdin)
>  		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
> +	if (promisor_msg && !from_stdin)
> +		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
> +	if (promisor_msg && pack_name)
> +		die(_("--promisor cannot be used with a pack name"));
>  	if (from_stdin && !startup_info->have_repository)
>  		die(_("--stdin requires a git repository"));

OK.  Thanks, will queue.
Jeff King Nov. 19, 2024, 6:53 p.m. UTC | #2
On Mon, Nov 18, 2024 at 11:02:06AM -0800, Jonathan Tan wrote:

> Currently, Git uses "index-pack --promisor" only when fetching into
> a repo, so it could be argued that we should teach "index-pack" a new
> argument (say, "--fetching-mode") instead of tying --promisor to a
> generic argument like "--stdin". However, this --promisor feature could
> conceivably be used whenever we have a packfile that is known to come
> from the promisor remote (whether obtained through Git's fetch protocol
> or through other means) so it seems reasonable to use --stdin here -
> one could envision a user-made script obtaining a packfile and then
> running "index-pack --promisor --stdin", for example. In fact, it might
> be possible to relax the restriction further (say, by also allowing
> --promisor when indexing a packfile that is in the object DB), but
> relaxing the restriction is backwards-compatible so we can revisit that
> later.

Yeah, I agree with this summary.

> This change requires the change to t5300 by 1f52cdfacb (index-pack:
> document and test the --promisor option, 2022-03-09) to be undone.
> (--promisor is already tested indirectly, so we don't need the explicit
> test here any more.)

OK, I think this is reasonable.

> Looking into it further, I think that we also need to require no
> packfile name to be given (so that we are writing the file to the
> repository). Therefore, I've added that requirement both in the code and
> in the documentation.

Hmm. I didn't realize that you could specify a pack name _and_ --stdin,
but I guess it makes sense if you wanted to write the result to a
non-standard location (though curiously --stdin requires a repo, which
feels overly restrictive if you give a pack name).

But I think that makes the --stdin check redundant. I.e., here:

> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 08b340552f..c46b6e4061 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
>  		usage(index_pack_usage);
>  	if (fix_thin_pack && !from_stdin)
>  		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
> +	if (promisor_msg && !from_stdin)
> +		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
> +	if (promisor_msg && pack_name)
> +		die(_("--promisor cannot be used with a pack name"));

...just the second one would be sufficient, because the context just
above this has:

	if (!pack_name && !from_stdin)
		usage(index_pack_usage);

So if there isn't a pack name then from_stdin must be set anyway.

What you've written won't behave incorrectly, but I wonder if this means
we can explain the rule in a more simple way:

  - the --promisor option requires that we be indexing a pack in the
    object database

  - when not given a pack name on the command line, we know this is true
    (because we generate the name ourselves internally)

  - when given a pack name on the command line, we _could_ check that it
    is inside the object directory, but we don't currently do so and
    just bail. That could be changed in the future.

And then there is no mention of --stdin at all (though of course it is
an implication of the second point, since we have to get input somehow).

-Peff
Junio C Hamano Nov. 20, 2024, 1:34 a.m. UTC | #3
Jeff King <peff@peff.net> writes:

> But I think that makes the --stdin check redundant. I.e., here:
>
>> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>> index 08b340552f..c46b6e4061 100644
>> --- a/builtin/index-pack.c
>> +++ b/builtin/index-pack.c
>> @@ -1970,6 +1970,10 @@ int cmd_index_pack(int argc,
>>  		usage(index_pack_usage);
>>  	if (fix_thin_pack && !from_stdin)
>>  		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
>> +	if (promisor_msg && !from_stdin)
>> +		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
>> +	if (promisor_msg && pack_name)
>> +		die(_("--promisor cannot be used with a pack name"));
>
> ...just the second one would be sufficient, because the context just
> above this has:
>
> 	if (!pack_name && !from_stdin)
> 		usage(index_pack_usage);
>
> So if there isn't a pack name then from_stdin must be set anyway.

Nice findings that leads to ... 

> What you've written won't behave incorrectly, but I wonder if this means
> we can explain the rule in a more simple way:
>
>   - the --promisor option requires that we be indexing a pack in the
>     object database
>
>   - when not given a pack name on the command line, we know this is true
>     (because we generate the name ourselves internally)
>
>   - when given a pack name on the command line, we _could_ check that it
>     is inside the object directory, but we don't currently do so and
>     just bail. That could be changed in the future.
>
> And then there is no mention of --stdin at all (though of course it is
> an implication of the second point, since we have to get input somehow).

... a good simplification.  Not of the implementation---as it is
already simple enough---but of the concept, and simplification of
the latter counts a lot more ;-)

Thanks, both, for working on this.
diff mbox series

Patch

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 4be09e58e7..ac96935d73 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -144,6 +144,8 @@  Also, if there are objects in the given pack that references non-promisor
 objects (in the repo), repacks those non-promisor objects into a promisor
 pack. This avoids a situation in which a repo has non-promisor objects that are
 accessible through promisor objects.
++
+Requires --stdin, and requires <pack-file> to not be specified.
 
 NOTES
 -----
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 08b340552f..c46b6e4061 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1970,6 +1970,10 @@  int cmd_index_pack(int argc,
 		usage(index_pack_usage);
 	if (fix_thin_pack && !from_stdin)
 		die(_("the option '%s' requires '%s'"), "--fix-thin", "--stdin");
+	if (promisor_msg && !from_stdin)
+		die(_("the option '%s' requires '%s'"), "--promisor", "--stdin");
+	if (promisor_msg && pack_name)
+		die(_("--promisor cannot be used with a pack name"));
 	if (from_stdin && !startup_info->have_repository)
 		die(_("--stdin requires a git repository"));
 	if (from_stdin && hash_algo)
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index aff164ddf8..c53f355e48 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -332,10 +332,8 @@  test_expect_success 'build pack index for an existing pack' '
 	git index-pack -o tmp.idx test-3.pack &&
 	cmp tmp.idx test-1-${packname_1}.idx &&
 
-	git index-pack --promisor=message test-3.pack &&
+	git index-pack test-3.pack &&
 	cmp test-3.idx test-1-${packname_1}.idx &&
-	echo message >expect &&
-	test_cmp expect test-3.promisor &&
 
 	cat test-2-${packname_2}.pack >test-3.pack &&
 	git index-pack -o tmp.idx test-2-${packname_2}.pack &&