diff mbox series

[v7,2/2] diff-files: integrate with sparse index

Message ID 20230322161820.3609-3-cheskaqiqi@gmail.com (mailing list archive)
State New, archived
Headers show
Series diff-files: integrate with sparse index | expand

Commit Message

Shuqi Liang March 22, 2023, 4:18 p.m. UTC
Originally, diff-files a pathspec that is out-of-cone in a sparse-index
environment, Git dies with "pathspec '<x>' did not match any files",
mainly because it does not expand the index so nothing is matched.
Expand the index when the <pathspec> needs an expanded index, i.e. the
<pathspec> contains wildcard that may need a full-index or the
<pathspec> is simply outside of sparse-checkout definition.

Remove full index requirement for `git diff-files`.Create an
'ensure_expanded' to handle silent failures. Add test to
ensure the index only expanded when necessary in `git diff-files`.

The `p2000` tests demonstrate a ~96% execution time reduction for 'git
diff-files' and a ~97% execution time reduction for 'git diff-files'
for a file using a sparse index:

Test                                           before  after
-----------------------------------------------------------------
2000.78: git diff-files (full-v3)              0.09    0.08 -11.1%
2000.79: git diff-files (full-v4)              0.09    0.09 +0.0%
2000.80: git diff-files (sparse-v3)            0.52    0.02 -96.2%
2000.81: git diff-files (sparse-v4)            0.51    0.02 -96.1%
2000.82: git diff-files f2/f4/a (full-v3)      0.06    0.07 +16.7%
2000.83: git diff-files f2/f4/a (full-v4)      0.08    0.08 +0.0%
2000.84: git diff-files f2/f4/a (sparse-v3)    0.46    0.01 -97.8%
2000.85: git diff-files f2/f4/a (sparse-v4)    0.51    0.02 -96.1%

Signed-off-by: Shuqi Liang <cheskaqiqi@gmail.com>
---
 builtin/diff-files.c                     |  8 ++++
 t/perf/p2000-sparse-operations.sh        |  2 +
 t/t1092-sparse-checkout-compatibility.sh | 52 ++++++++++++++++++++++++
 3 files changed, 62 insertions(+)

Comments

Victoria Dye April 13, 2023, 9:54 p.m. UTC | #1
Shuqi Liang wrote:
> diff --git a/builtin/diff-files.c b/builtin/diff-files.c
> index dc991f753b..db90592090 100644
> --- a/builtin/diff-files.c
> +++ b/builtin/diff-files.c
> @@ -27,6 +27,10 @@ int cmd_diff_files(int argc, const char **argv, const char *prefix)
>  		usage(diff_files_usage);
>  
>  	git_config(git_diff_basic_config, NULL); /* no "diff" UI options */
> +
> +	prepare_repo_settings(the_repository);
> +	the_repository->settings.command_requires_full_index = 0;
> +
>  	repo_init_revisions(the_repository, &rev, prefix);
>  	rev.abbrev = 0;
>  
> @@ -80,6 +84,10 @@ int cmd_diff_files(int argc, const char **argv, const char *prefix)
>  		result = -1;
>  		goto cleanup;
>  	}
> +
> +	if (pathspec_needs_expanded_index(the_repository->index, &rev.diffopt.pathspec))
> +		ensure_full_index(the_repository->index);

After reviewing the 'diff-index' integration [1], I'm wondering whether we
actually need pathspec expansion at all in this case. 'diff-files' compares
the working tree and the index, and will output a difference if the file on
disk differs from the index. But, if a file is SKIP_WORKTREE'd, that diff
will (I think?) always be empty. So, why would we need to expand a sparse
directory to match a pathspec to its contents if we *know* that the diff
will be empty?

[1] https://lore.kernel.org/git/20230408112342.404318-1-nanth.raghul@gmail.com/

> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index d23041e27a..152f3f752e 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -1401,6 +1401,30 @@ ensure_not_expanded () {
>  	test_region ! index ensure_full_index trace2.txt
>  }
>  
> +ensure_expanded () {
> +	rm -f trace2.txt &&
> +	if test -z "$WITHOUT_UNTRACKED_TXT"
> +	then
> +		echo >>sparse-index/untracked.txt
> +	fi &&
> +
> +	if test "$1" = "!"
> +	then
> +		shift &&
> +		test_must_fail env \
> +			GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
> +			git -C sparse-index "$@" \
> +			>sparse-index-out \
> +			2>sparse-index-error || return 1
> +	else
> +		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
> +			git -C sparse-index "$@" \
> +			>sparse-index-out \
> +			2>sparse-index-error || return 1
> +	fi &&
> +	test_region index ensure_full_index trace2.txt
> +}

This implementation duplicates a lot of the code from 'ensure_not_expanded'.
Can 'ensure_expanded' and 'ensure_not_expanded' be refactored to call a
common helper function (which contains the common code) instead?

> +
>  test_expect_success 'sparse-index is not expanded' '
>  	init_repos &&
>  
> @@ -2101,4 +2125,32 @@ test_expect_success 'diff-files with pathspec outside sparse definition' '
>  	test_all_match git diff-files "folder*/a" 
>  '
>  
> +test_expect_success 'diff-files pathspec expands index when necessary' '
> +	init_repos &&
> +
> +	write_script edit-contents <<-\EOF &&
> +	echo text >>"$1"
> +	EOF
> +
> +	run_on_all ../edit-contents deep/a &&
> +	
> +	# pathspec that should expand index
> +	ensure_expanded diff-files "*/a" &&
> +	ensure_expanded diff-files "**a" 

Similar to the comments in my 'diff-index' review [2]:

- The '**' in the pathspec doesn't do anything special unless using an
  explicit ':(glob)' pathspec. To make it clear that you're not trying to
  use a glob pathspec, you can use '*a' instead.
- Why are these pathspecs in quotes, but those in 'sparse index is not
  expanded: diff-files' are?

[2] https://lore.kernel.org/git/62821012-4fc3-5ad8-695c-70f7ab14a8c9@github.com/

> +'
> +
> +test_expect_success 'sparse index is not expanded: diff-files' '
> +	init_repos &&
> +
> +	write_script edit-contents <<-\EOF &&
> +	echo text >>"$1"
> +	EOF
> +
> +	run_on_all ../edit-contents deep/a &&
> +
> +	ensure_not_expanded diff-files &&
> +	ensure_not_expanded diff-files deep/a &&
> +	ensure_not_expanded diff-files deep/*
> +'
> +
>  test_done
Shuqi Liang April 20, 2023, 4:50 a.m. UTC | #2
Hi Victoria,

Sorry for the late reply. I'm still in the middle of my final exams period.

On Thu, Apr 13, 2023 at 5:55 PM Victoria Dye <vdye@github.com> wrote:
>
> Shuqi Liang wrote:
> > diff --git a/builtin/diff-files.c b/builtin/diff-files.c
> > index dc991f753b..db90592090 100644
> > --- a/builtin/diff-files.c
> > +++ b/builtin/diff-files.c
> > @@ -27,6 +27,10 @@ int cmd_diff_files(int argc, const char **argv, const char *prefix)
> >               usage(diff_files_usage);
> >
> >       git_config(git_diff_basic_config, NULL); /* no "diff" UI options */
> > +
> > +     prepare_repo_settings(the_repository);
> > +     the_repository->settings.command_requires_full_index = 0;
> > +
> >       repo_init_revisions(the_repository, &rev, prefix);
> >       rev.abbrev = 0;
> >
> > @@ -80,6 +84,10 @@ int cmd_diff_files(int argc, const char **argv, const char *prefix)
> >               result = -1;
> >               goto cleanup;
> >       }
> > +
> > +     if (pathspec_needs_expanded_index(the_repository->index, &rev.diffopt.pathspec))
> > +             ensure_full_index(the_repository->index);
>
> After reviewing the 'diff-index' integration [1], I'm wondering whether we
> actually need pathspec expansion at all in this case. 'diff-files' compares
> the working tree and the index, and will output a difference if the file on
> disk differs from the index. But, if a file is SKIP_WORKTREE'd, that diff
> will (I think?) always be empty. So, why would we need to expand a sparse
> directory to match a pathspec to its contents if we *know* that the diff
> will be empty?
>
> [1] https://lore.kernel.org/git/20230408112342.404318-1-nanth.raghul@gmail.com/


It's true that in the case of 'diff-files', expanding the sparse directory to
match a pathspec to its contents might not be necessary. If we don't use
pathspec expansion in this case. It could optimize for performance better.

However, there could be some edge cases. if a user manually modifies the
contents of a SKIP_WORKTREE file in the working tree, the diff between
the working tree and the index would no longer be empty. So I think, In this
case, expanding the sparse directory might still be necessary to ensure the
correct behavior of the 'diff-files' command.



> > diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> > index d23041e27a..152f3f752e 100755
> > --- a/t/t1092-sparse-checkout-compatibility.sh
> > +++ b/t/t1092-sparse-checkout-compatibility.sh
> > @@ -1401,6 +1401,30 @@ ensure_not_expanded () {
> >       test_region ! index ensure_full_index trace2.txt
> >  }
> >
> > +ensure_expanded () {
> > +     rm -f trace2.txt &&
> > +     if test -z "$WITHOUT_UNTRACKED_TXT"
> > +     then
> > +             echo >>sparse-index/untracked.txt
> > +     fi &&
> > +
> > +     if test "$1" = "!"
> > +     then
> > +             shift &&
> > +             test_must_fail env \
> > +                     GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
> > +                     git -C sparse-index "$@" \
> > +                     >sparse-index-out \
> > +                     2>sparse-index-error || return 1
> > +     else
> > +             GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
> > +                     git -C sparse-index "$@" \
> > +                     >sparse-index-out \
> > +                     2>sparse-index-error || return 1
> > +     fi &&
> > +     test_region index ensure_full_index trace2.txt
> > +}
>
> This implementation duplicates a lot of the code from 'ensure_not_expanded'.
> Can 'ensure_expanded' and 'ensure_not_expanded' be refactored to call a
> common helper function (which contains the common code) instead?

Will do!

> > +
> >  test_expect_success 'sparse-index is not expanded' '
> >       init_repos &&
> >
> > @@ -2101,4 +2125,32 @@ test_expect_success 'diff-files with pathspec outside sparse definition' '
> >       test_all_match git diff-files "folder*/a"
> >  '
> >
> > +test_expect_success 'diff-files pathspec expands index when necessary' '
> > +     init_repos &&
> > +
> > +     write_script edit-contents <<-\EOF &&
> > +     echo text >>"$1"
> > +     EOF
> > +
> > +     run_on_all ../edit-contents deep/a &&
> > +
> > +     # pathspec that should expand index
> > +     ensure_expanded diff-files "*/a" &&
> > +     ensure_expanded diff-files "**a"
>
> Similar to the comments in my 'diff-index' review [2]:
>
> - The '**' in the pathspec doesn't do anything special unless using an
>   explicit ':(glob)' pathspec. To make it clear that you're not trying to
>   use a glob pathspec, you can use '*a' instead.

Will do !

> - Why are these pathspecs in quotes, but those in 'sparse index is not
>   expanded: diff-files' are?
>
> [2] https://lore.kernel.org/git/62821012-4fc3-5ad8-695c-70f7ab14a8c9@github.com/


I quote around the pathspec  to prevent shell expansion  of the pathspec
patterns by the shell before they are passed to the git command. "*a" and
"*/a"  have special characters ' * '. I use quotes to tell the shell
to treat them
as regular characters.

In 'sparse index is not expanded: diff-files''deep/a'  does not contain any
special characters that the shell would try to expand. So I use it without
double quotes. And  I think I need to add a double quote to 'deep/*'.

Thanks,
Shuqi
Victoria Dye April 20, 2023, 3:26 p.m. UTC | #3
Shuqi Liang wrote:
> Hi Victoria,
> 
> Sorry for the late reply. I'm still in the middle of my final exams period.

No problem at all, thanks for following up!

> It's true that in the case of 'diff-files', expanding the sparse directory to
> match a pathspec to its contents might not be necessary. If we don't use
> pathspec expansion in this case. It could optimize for performance better.
> 
> However, there could be some edge cases. if a user manually modifies the
> contents of a SKIP_WORKTREE file in the working tree, the diff between
> the working tree and the index would no longer be empty. So I think, In this
> case, expanding the sparse directory might still be necessary to ensure the
> correct behavior of the 'diff-files' command.

If a user manually modifies a SKIP_WORKTREE file, SKIP_WORKTREE will be
removed from the file and the index expanded automatically [1]. If that
mechanism is working properly, there would be no need to manually check the
pathspec and expand the index.

Have you tried removing the 'pathspec_needs_expanded_index()' and running
the tests? If so, is 'diff-files' producing incorrect results? 

[1] https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
Shuqi Liang April 21, 2023, 1:10 a.m. UTC | #4
Hi Victoria,

> If a user manually modifies a SKIP_WORKTREE file, SKIP_WORKTREE will be
> removed from the file and the index expanded automatically [1]. If that
> mechanism is working properly, there would be no need to manually check the
> pathspec and expand the index.
>
> Have you tried removing the 'pathspec_needs_expanded_index()' and running
> the tests? If so, is 'diff-files' producing incorrect results?
>
> [1] https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/

As per your suggestion, I tried removing pathspec_needs_expanded_index()
from the code, and 'diff-files pathspec expands index when necessary'
test failed.

So, I'm thinking about keeping it to ensure everything works properly.
I'd like to know your thoughts on this. Should we keep it or explore
other options?

Thanks,
Shuqi
Victoria Dye April 21, 2023, 9:26 p.m. UTC | #5
Shuqi Liang wrote:
> Hi Victoria,
> 
>> If a user manually modifies a SKIP_WORKTREE file, SKIP_WORKTREE will be
>> removed from the file and the index expanded automatically [1]. If that
>> mechanism is working properly, there would be no need to manually check the
>> pathspec and expand the index.
>>
>> Have you tried removing the 'pathspec_needs_expanded_index()' and running
>> the tests? If so, is 'diff-files' producing incorrect results?
>>
>> [1] https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
> 
> As per your suggestion, I tried removing pathspec_needs_expanded_index()
> from the code, and 'diff-files pathspec expands index when necessary'
> test failed.
> 
> So, I'm thinking about keeping it to ensure everything works properly.
> I'd like to know your thoughts on this. Should we keep it or explore
> other options?

Did the test fail because the index wasn't expanded in a case where you
previously expended it to be expanded? Or because of the returned results of
'diff-files' are invalid?

Only the latter represents incorrect behavior. If we're aren't expanding the
index for a case that was causing index expansion before *and* the
user-facing behavior is as-expected, that's the best-case scenario for a
sparse index integration!

Taking a step back, it's important to remember that the overarching goal of
the project is not just to switch 'command_requires_full_index' to '0'
everywhere, but to find all of the places where Git is working with the
index and make sure that work can be done on a sparse directory.

In most cases, it's possible to adapt an index-related operation to work
with sparse directories (albeit with varying levels of complexity). The use
of 'ensure_full_index()' is reserved for cases where it is _impossible_ to
make Git perform a given action on a sparse directory - expanding the index
completely eliminates the performance gains had by using a sparse index, so
it should be avoided at all costs.

I hope that helps!

> 
> Thanks,
> Shuqi
Shuqi Liang April 22, 2023, 9:25 p.m. UTC | #6
Hi Victoria,

On Fri, Apr 21, 2023 at 5:26 PM Victoria Dye <vdye@github.com> wrote:
> Only the latter represents incorrect behavior. If we're aren't expanding the
> index for a case that was causing index expansion before *and* the
> user-facing behavior is as-expected, that's the best-case scenario for a
> sparse index integration!
>
> Taking a step back, it's important to remember that the overarching goal of
> the project is not just to switch 'command_requires_full_index' to '0'
> everywhere, but to find all of the places where Git is working with the
> index and make sure that work can be done on a sparse directory.
>
> In most cases, it's possible to adapt an index-related operation to work
> with sparse directories (albeit with varying levels of complexity). The use
> of 'ensure_full_index()' is reserved for cases where it is _impossible_ to
> make Git perform a given action on a sparse directory - expanding the index
> completely eliminates the performance gains had by using a sparse index, so
> it should be avoided at all costs.
>
> I hope that helps!

Thanks for reminding me about the ultimate goal of sparse index
integration! I've learned a lot from it. After looking into the test
failure, it seems that the index didn't expand in cases where I expected
it to. I'll go ahead and update my patch.

Thanks,
Shuqi
diff mbox series

Patch

diff --git a/builtin/diff-files.c b/builtin/diff-files.c
index dc991f753b..db90592090 100644
--- a/builtin/diff-files.c
+++ b/builtin/diff-files.c
@@ -27,6 +27,10 @@  int cmd_diff_files(int argc, const char **argv, const char *prefix)
 		usage(diff_files_usage);
 
 	git_config(git_diff_basic_config, NULL); /* no "diff" UI options */
+
+	prepare_repo_settings(the_repository);
+	the_repository->settings.command_requires_full_index = 0;
+
 	repo_init_revisions(the_repository, &rev, prefix);
 	rev.abbrev = 0;
 
@@ -80,6 +84,10 @@  int cmd_diff_files(int argc, const char **argv, const char *prefix)
 		result = -1;
 		goto cleanup;
 	}
+
+	if (pathspec_needs_expanded_index(the_repository->index, &rev.diffopt.pathspec))
+		ensure_full_index(the_repository->index);
+
 	result = run_diff_files(&rev, options);
 	result = diff_result_code(&rev.diffopt, result);
 cleanup:
diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 3242cfe91a..82751f2ca3 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -125,5 +125,7 @@  test_perf_on_all git checkout-index -f --all
 test_perf_on_all git update-index --add --remove $SPARSE_CONE/a
 test_perf_on_all "git rm -f $SPARSE_CONE/a && git checkout HEAD -- $SPARSE_CONE/a"
 test_perf_on_all git grep --cached --sparse bogus -- "f2/f1/f1/*"
+test_perf_on_all git diff-files
+test_perf_on_all git diff-files $SPARSE_CONE/a
 
 test_done
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index d23041e27a..152f3f752e 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1401,6 +1401,30 @@  ensure_not_expanded () {
 	test_region ! index ensure_full_index trace2.txt
 }
 
+ensure_expanded () {
+	rm -f trace2.txt &&
+	if test -z "$WITHOUT_UNTRACKED_TXT"
+	then
+		echo >>sparse-index/untracked.txt
+	fi &&
+
+	if test "$1" = "!"
+	then
+		shift &&
+		test_must_fail env \
+			GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+			git -C sparse-index "$@" \
+			>sparse-index-out \
+			2>sparse-index-error || return 1
+	else
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+			git -C sparse-index "$@" \
+			>sparse-index-out \
+			2>sparse-index-error || return 1
+	fi &&
+	test_region index ensure_full_index trace2.txt
+}
+
 test_expect_success 'sparse-index is not expanded' '
 	init_repos &&
 
@@ -2101,4 +2125,32 @@  test_expect_success 'diff-files with pathspec outside sparse definition' '
 	test_all_match git diff-files "folder*/a" 
 '
 
+test_expect_success 'diff-files pathspec expands index when necessary' '
+	init_repos &&
+
+	write_script edit-contents <<-\EOF &&
+	echo text >>"$1"
+	EOF
+
+	run_on_all ../edit-contents deep/a &&
+	
+	# pathspec that should expand index
+	ensure_expanded diff-files "*/a" &&
+	ensure_expanded diff-files "**a" 
+'
+
+test_expect_success 'sparse index is not expanded: diff-files' '
+	init_repos &&
+
+	write_script edit-contents <<-\EOF &&
+	echo text >>"$1"
+	EOF
+
+	run_on_all ../edit-contents deep/a &&
+
+	ensure_not_expanded diff-files &&
+	ensure_not_expanded diff-files deep/a &&
+	ensure_not_expanded diff-files deep/*
+'
+
 test_done