diff mbox series

revision: add `--ignore-missing-links` user option

Message ID 20230908174208.249184-1-karthik.188@gmail.com (mailing list archive)
State Superseded
Headers show
Series revision: add `--ignore-missing-links` user option | expand

Commit Message

karthik nayak Sept. 8, 2023, 5:42 p.m. UTC
The revision backend is used by multiple porcelain commands such as
git-rev-list(1) and git-log(1). The backend currently supports ignoring
missing links by setting the `ignore_missing_links` bit. This allows the
revision walk to skip any objects links which are missing.

Currently there is no way to use git-rev-list(1) to traverse the objects
of the main object directory (GIT_OBJECT_DIRECTORY) and print the
boundary objects when moving from the main object directory to the
alternate object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES).

By exposing this new flag `--ignore-missing-links`, users can set the
required env variables (GIT_OBJECT_DIRECTORY and
GIT_ALTERNATE_OBJECT_DIRECTORIES) along with the `--boundary` flag to
find the boundary objects between object directories.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
---
 Documentation/rev-list-options.txt |  5 ++++
 revision.c                         |  2 ++
 t/t6022-rev-list-alternates.sh     | 43 ++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)
 create mode 100755 t/t6022-rev-list-alternates.sh

Comments

Junio C Hamano Sept. 8, 2023, 7:19 p.m. UTC | #1
Karthik Nayak <karthik.188@gmail.com> writes:

> The revision backend is used by multiple porcelain commands such as
> git-rev-list(1) and git-log(1). The backend currently supports ignoring
> missing links by setting the `ignore_missing_links` bit. This allows the
> revision walk to skip any objects links which are missing.

> Currently there is no way to use git-rev-list(1) to traverse the objects
> of the main object directory (GIT_OBJECT_DIRECTORY) and print the
> boundary objects when moving from the main object directory to the
> alternate object directories (GIT_ALTERNATE_OBJECT_DIRECTORIES).

The above description needs tightened up a bit, I think.

What is left unsaid is that you arranged a repository to borrow from
an alternate object directory (or two), and plan to walk objects
with this bit on in the repository, while leaving the alternates
disabled.  Without stating that you plan to disable the alternates
while this mode of operation happens, nothing would happen when the
traversal goes from the main to the alternate because no links are
broken, no?

> By exposing this new flag `--ignore-missing-links`, users can set the
> required env variables (GIT_OBJECT_DIRECTORY and
> GIT_ALTERNATE_OBJECT_DIRECTORIES) along with the `--boundary` flag to
> find the boundary objects between object directories.

This command being a plumbing, there is not much reason to object to
surfacing features that already internally exist to the command line
option.    Having said that, 

 * Suppose your traversal with --ignore-missing-links from the tip
   of a branch reaches a tree object A, and the tree object A has a
   link to a blob B and a blob C.  But B is in a separate object
   store that you usually access via the alternate mechanism.
   Instead of barfing "The repository is corrupt---object A points
   at object B that does not exist", we pretend that A does not have
   the link to B and keep traversing, discovering C and other
   objects.

   That much we can read from the above and also the documentation
   part of the patch.  The interaction with --boundary needs to be
   clarified in this description and the documentation, though.  It
   is unclear if you show 'A' or 'B' in this scenario.

 * Some traversals use the ignore-missing-links bit implicitly and
   currently there is no way to turn it off.  Is it plausible that
   user may want to explicitly toggle it off, with the option
   negated, i.e. --no-ignore-missing-links?  I do not immediately
   see the utility of such an option, but that is only due to my
   lack of imagination.  For now, I think it makes sense not to
   allow negating this option, until somebody comes up with a useful
   use case.

> +--ignore-missing-links::
> +	When an object points to another object that is missing, pretend as if the
> +	link did not exist. These missing links are not written to stdout unless
> +	the --boundary flag is passed.

Does "git rev-list" ever writes "links"?  I thought not.  

"These missing objects are not written" would be more sensible, but
we never write missing objects with or without the option, so it
is not even worth saying.

When "--boundary" is passed, do they appear as if they are
available?  If not, then the above description is very misleading.

    During traversal, if an object that is referenced does not
    exist, pretend as if the reference itself does not exist,
    instead of dying of a repository corruption.  Running the
    command with the "--boundary" option makes these missing
    objects, together with the objects on the edge of revision
    ranges (i.e. true boundary objects), appear on the output,
    prefixed with '-'.

or something like that, perhaps?

> +# With `--ignore-missing-links`, we stop the traversal when we encounter a
> +# missing link.
> +test_expect_success 'rev-list only prints main odb commits with --ignore-missing-links' '
> +	test_stdout_line_count = 5 git -C main rev-list --ignore-missing-links HEAD
> +'
> +
> +# With `--ignore-missing-links` and `--boundary`, we can even print those boundary
> +# commits.
> +test_expect_success 'rev-list prints boundary commit with --ignore-missing-links' '
> +	git -C main rev-list --ignore-missing-links --boundary HEAD >list-output &&
> +	test_stdout_line_count = 6 cat list-output &&
> +	test_stdout_line_count = 1 cat list-output | grep "^-"
> +'

These tests are way too loose.  Not only you want to see certain
number of boundary objects, you _know_ exactly which object should
be on the boundary, and you should check that instead.  That will
allow you to find a mistake to write commit 'A' that refers to a
missing commit 'B', when they wanted to write the missing comit 'B',
as a boundary object, for example.

Thanks.
karthik nayak Sept. 12, 2023, 2:42 p.m. UTC | #2
On Fri, Sep 8, 2023 at 9:19 PM Junio C Hamano <gitster@pobox.com> wrote:
> The above description needs tightened up a bit, I think.
>
> What is left unsaid is that you arranged a repository to borrow from
> an alternate object directory (or two), and plan to walk objects
> with this bit on in the repository, while leaving the alternates
> disabled.  Without stating that you plan to disable the alternates
> while this mode of operation happens, nothing would happen when the
> traversal goes from the main to the alternate because no links are
> broken, no?
>

Fair enough, I agree with your points. I'll amend the message to highlight this
scenario.

> > By exposing this new flag `--ignore-missing-links`, users can set the
> > required env variables (GIT_OBJECT_DIRECTORY and
> > GIT_ALTERNATE_OBJECT_DIRECTORIES) along with the `--boundary` flag to
> > find the boundary objects between object directories.
>
> This command being a plumbing, there is not much reason to object to
> surfacing features that already internally exist to the command line
> option.    Having said that,
>
>  * Suppose your traversal with --ignore-missing-links from the tip
>    of a branch reaches a tree object A, and the tree object A has a
>    link to a blob B and a blob C.  But B is in a separate object
>    store that you usually access via the alternate mechanism.
>    Instead of barfing "The repository is corrupt---object A points
>    at object B that does not exist", we pretend that A does not have
>    the link to B and keep traversing, discovering C and other
>    objects.
>
>    That much we can read from the above and also the documentation
>    part of the patch.  The interaction with --boundary needs to be
>    clarified in this description and the documentation, though.  It
>    is unclear if you show 'A' or 'B' in this scenario.

Do note that the `--boundary` option only works with commits. Keeping this in
mind `--ignore-missing-links` when used with `--boundary` doesn't even traverse
non-commit objects. Which means trees/blobs being corrupted shouldn't matter.

But I did realize that `--ignore-missing-links` as this patch stands
is broken when
used alongside the `--objects` flag (`--boundary` doesn't work with
`--objects` at the
moment, this is something I plan to tackle soon after with a
`--boundary-objects` flag).
The second version of this patch will have a fix to ensure that even
non-commit objects
are ignored during traversal if `--objects` option is used.

>
>  * Some traversals use the ignore-missing-links bit implicitly and
>    currently there is no way to turn it off.  Is it plausible that
>    user may want to explicitly toggle it off, with the option
>    negated, i.e. --no-ignore-missing-links?  I do not immediately
>    see the utility of such an option, but that is only due to my
>    lack of imagination.  For now, I think it makes sense not to
>    allow negating this option, until somebody comes up with a useful
>    use case.
>

Agreed!

> > +--ignore-missing-links::
> > +     When an object points to another object that is missing, pretend as if the
> > +     link did not exist. These missing links are not written to stdout unless
> > +     the --boundary flag is passed.
>
> Does "git rev-list" ever writes "links"?  I thought not.
>
> "These missing objects are not written" would be more sensible, but
> we never write missing objects with or without the option, so it
> is not even worth saying.
>
> When "--boundary" is passed, do they appear as if they are
> available?  If not, then the above description is very misleading.
>
>     During traversal, if an object that is referenced does not
>     exist, pretend as if the reference itself does not exist,
>     instead of dying of a repository corruption.  Running the
>     command with the "--boundary" option makes these missing
>     objects, together with the objects on the edge of revision
>     ranges (i.e. true boundary objects), appear on the output,
>     prefixed with '-'.
>
> or something like that, perhaps?
>

This indeed is better, I've copied and modified it as needed.

> > +# With `--ignore-missing-links`, we stop the traversal when we encounter a
> > +# missing link.
> > +test_expect_success 'rev-list only prints main odb commits with --ignore-missing-links' '
> > +     test_stdout_line_count = 5 git -C main rev-list --ignore-missing-links HEAD
> > +'
> > +
> > +# With `--ignore-missing-links` and `--boundary`, we can even print those boundary
> > +# commits.
> > +test_expect_success 'rev-list prints boundary commit with --ignore-missing-links' '
> > +     git -C main rev-list --ignore-missing-links --boundary HEAD >list-output &&
> > +     test_stdout_line_count = 6 cat list-output &&
> > +     test_stdout_line_count = 1 cat list-output | grep "^-"
> > +'
>
> These tests are way too loose.  Not only you want to see certain
> number of boundary objects, you _know_ exactly which object should
> be on the boundary, and you should check that instead.  That will
> allow you to find a mistake to write commit 'A' that refers to a
> missing commit 'B', when they wanted to write the missing comit 'B',
> as a boundary object, for example.
>

Fair enough, I will make them more specific and add some tests for
missing trees/blobs.

> Thanks.

Thank you for the review. Will send the next version of the patch soon :)
diff mbox series

Patch

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index a4a0cb93b2..a0b48db8a8 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -227,6 +227,11 @@  explicitly.
 	Upon seeing an invalid object name in the input, pretend as if
 	the bad input was not given.
 
+--ignore-missing-links::
+	When an object points to another object that is missing, pretend as if the
+	link did not exist. These missing links are not written to stdout unless
+	the --boundary flag is passed.
+
 ifndef::git-rev-list[]
 --bisect::
 	Pretend as if the bad bisection ref `refs/bisect/bad`
diff --git a/revision.c b/revision.c
index 2f4c53ea20..cbfcbf6e28 100644
--- a/revision.c
+++ b/revision.c
@@ -2595,6 +2595,8 @@  static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->limited = 1;
 	} else if (!strcmp(arg, "--ignore-missing")) {
 		revs->ignore_missing = 1;
+	} else if (!strcmp(arg, "--ignore-missing-links")) {
+		revs->ignore_missing_links = 1;
 	} else if (opt && opt->allow_exclude_promisor_objects &&
 		   !strcmp(arg, "--exclude-promisor-objects")) {
 		if (fetch_if_missing)
diff --git a/t/t6022-rev-list-alternates.sh b/t/t6022-rev-list-alternates.sh
new file mode 100755
index 0000000000..626ebb2dce
--- /dev/null
+++ b/t/t6022-rev-list-alternates.sh
@@ -0,0 +1,43 @@ 
+#!/bin/sh
+
+test_description='handling of alternates in rev-list'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# We create 5 commits and move them to the alt directory and
+# create 5 more commits which will stay in the main odb.
+test_expect_success 'create repository and alternate directory' '
+	git init main &&
+	test_commit_bulk -C main 5 &&
+	mkdir alt &&
+	mv main/.git/objects/* alt &&
+	GIT_ALTERNATE_OBJECT_DIRECTORIES=$PWD/alt test_commit_bulk --start=6 -C main 5
+'
+
+# When the alternate odb is provided, all commits are listed.
+test_expect_success 'rev-list passes with alternate object directory' '
+	GIT_ALTERNATE_OBJECT_DIRECTORIES=$PWD/alt test_stdout_line_count = 10 git -C main rev-list HEAD
+'
+
+# When the alternate odb is not provided, rev-list fails since the 5th commit's
+# parent is not present in the main odb.
+test_expect_success 'rev-list fails without alternate object directory' '
+	test_must_fail git -C main rev-list HEAD
+'
+
+# With `--ignore-missing-links`, we stop the traversal when we encounter a
+# missing link.
+test_expect_success 'rev-list only prints main odb commits with --ignore-missing-links' '
+	test_stdout_line_count = 5 git -C main rev-list --ignore-missing-links HEAD
+'
+
+# With `--ignore-missing-links` and `--boundary`, we can even print those boundary
+# commits.
+test_expect_success 'rev-list prints boundary commit with --ignore-missing-links' '
+	git -C main rev-list --ignore-missing-links --boundary HEAD >list-output &&
+	test_stdout_line_count = 6 cat list-output &&
+	test_stdout_line_count = 1 cat list-output | grep "^-"
+'
+
+test_done