mbox series

[v7,0/2] diff-files: integrate with sparse index

Message ID 20230322161820.3609-1-cheskaqiqi@gmail.com (mailing list archive)
Headers show
Series diff-files: integrate with sparse index | expand

Message

Shuqi Liang March 22, 2023, 4:18 p.m. UTC
Changes since v6:

1. Fix word wrap in commit message.

2. Use  'mkdir -p folder1' since full-checkout already have folder1.

3. Use `--stat` to ignore file creation time differences in unrefreshed
index.

4. In 'diff-files with pathspec outside sparse definition' add 
'git diff-files "folder*/a" to show that the result is the same with a 
wildcard pathspec.

5. Create an 'ensure_expanded' to handle silent failures.


Shuqi Liang (2):
  t1092: add tests for `git diff-files`
  diff-files: integrate with sparse index

 builtin/diff-files.c                     |  8 ++
 t/perf/p2000-sparse-operations.sh        |  2 +
 t/t1092-sparse-checkout-compatibility.sh | 98 ++++++++++++++++++++++++
 3 files changed, 108 insertions(+)

Range-diff against v6:
1:  2a994e60bc ! 1:  e2dcf9921e t1092: add tests for `git diff-files`
    @@ Metadata
      ## Commit message ##
         t1092: add tests for `git diff-files`
     
    -    Before integrating the 'git diff-files' builtin
    -    with the sparse index feature, add tests to
    -    t1092-sparse-checkout-compatibility.sh to ensure it currently works
    -    with sparse-checkout and will still work with sparse index
    -    after that integration.
    +    Before integrating the 'git diff-files' builtin with the sparse index
    +    feature, add tests to t1092-sparse-checkout-compatibility.sh to ensure
    +    it currently works with sparse-checkout and will still work with sparse
    +    index after that integration.
     
    -    When adding tests against a sparse-checkout
    -    definition, we test two modes: all changes are
    -    within the sparse-checkout cone and some changes are outside
    -    the sparse-checkout cone.
    +    When adding tests against a sparse-checkout definition, we test two
    +    modes: all changes are within the sparse-checkout cone and some changes
    +    are outside the sparse-checkout cone.
     
    -    In order to have staged changes outside of
    -    the sparse-checkout cone, make a directory called 'folder1' and
    -    copy `a` into 'folder1/a'. 'folder1/a' is identical to `a` in the base
    -    commit. These make 'folder1/a' in the index, while leaving it outside of
    -    the sparse-checkout definition. Test 'folder1/a'being present on-disk
    +    In order to have staged changes outside of the sparse-checkout cone,
    +    make a directory called 'folder1' and copy `a` into 'folder1/a'.
    +    'folder1/a' is identical to `a` in the base commit. These make
    +    'folder1/a' in the index, while leaving it outside of the
    +    sparse-checkout definition. Test 'folder1/a'being present on-disk
         without modifications, then change content inside 'folder1/a' in order
         to test 'folder1/a' being present on-disk with modifications.
     
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep sparse direc
     +	test_all_match git diff-files deep/*
     +'
     +
    -+test_expect_failure 'diff-files with pathspec outside sparse definition' '
    ++test_expect_success 'diff-files with pathspec outside sparse definition' '
     +	init_repos &&
     +
     +	test_sparse_match test_must_fail git diff-files folder2/a &&
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'grep sparse direc
     +	# Add file to the index but outside of cone for sparse-checkout cases.
     +	# Add file to the index without sparse-checkout cases to ensure all have 
     +	# same output.
    -+	run_on_all mkdir folder1 &&
    ++	run_on_all mkdir -p folder1 &&
     +	run_on_all cp a folder1/a &&
     +
     +	# file present on-disk without modifications
    -+	test_all_match git diff-files &&
    -+	test_all_match git diff-files folder1/a &&
    ++	# use `--stat` to ignore file creation time differences in
    ++	# unrefreshed index
    ++	test_all_match git diff-files --stat &&
    ++	test_all_match git diff-files --stat folder1/a &&
    ++	test_all_match git diff-files --stat "folder*/a" &&
     +
     +	# file present on-disk with modifications
     +	run_on_all ../edit-contents folder1/a &&
     +	test_all_match git diff-files &&
    -+	test_all_match git diff-files folder1/a
    ++	test_all_match git diff-files folder1/a &&
    ++	test_all_match git diff-files "folder*/a" 
     +'
     +
      test_done
2:  ac730e372d ! 2:  fb8edaf583 diff-files: integrate with sparse index
    @@ Commit message
         <pathspec> contains wildcard that may need a full-index or the
         <pathspec> is simply outside of sparse-checkout definition.
     
    -    Remove full index requirement for `git diff-files`
    -    and add test to ensure the index only expanded when necessary
    -    in `git diff-files`.
    +    Remove full index requirement for `git diff-files`.Create an
    +    'ensure_expanded' to handle silent failures. Add test to
    +    ensure the index only expanded when necessary in `git diff-files`.
     
         The `p2000` tests demonstrate a ~96% execution time reduction for 'git
         diff-files' and a ~97% execution time reduction for 'git diff-files'
    @@ builtin/diff-files.c: int cmd_diff_files(int argc, const char **argv, const char
     +
     +	if (pathspec_needs_expanded_index(the_repository->index, &rev.diffopt.pathspec))
     +		ensure_full_index(the_repository->index);
    -+		
    ++
      	result = run_diff_files(&rev, options);
      	result = diff_result_code(&rev.diffopt, result);
      cleanup:
    @@ t/perf/p2000-sparse-operations.sh: test_perf_on_all git checkout-index -f --all
      test_done
     
      ## t/t1092-sparse-checkout-compatibility.sh ##
    -@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_failure 'diff-files with pathspec outside sparse definition' '
    - 	test_all_match git diff-files folder1/a
    +@@ t/t1092-sparse-checkout-compatibility.sh: ensure_not_expanded () {
    + 	test_region ! index ensure_full_index trace2.txt
    + }
    + 
    ++ensure_expanded () {
    ++	rm -f trace2.txt &&
    ++	if test -z "$WITHOUT_UNTRACKED_TXT"
    ++	then
    ++		echo >>sparse-index/untracked.txt
    ++	fi &&
    ++
    ++	if test "$1" = "!"
    ++	then
    ++		shift &&
    ++		test_must_fail env \
    ++			GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
    ++			git -C sparse-index "$@" \
    ++			>sparse-index-out \
    ++			2>sparse-index-error || return 1
    ++	else
    ++		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
    ++			git -C sparse-index "$@" \
    ++			>sparse-index-out \
    ++			2>sparse-index-error || return 1
    ++	fi &&
    ++	test_region index ensure_full_index trace2.txt
    ++}
    ++
    + test_expect_success 'sparse-index is not expanded' '
    + 	init_repos &&
    + 
    +@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'diff-files with pathspec outside sparse definition' '
    + 	test_all_match git diff-files "folder*/a" 
      '
      
     +test_expect_success 'diff-files pathspec expands index when necessary' '
    @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_failure 'diff-files with p
     +	run_on_all ../edit-contents deep/a &&
     +	
     +	# pathspec that should expand index
    -+	! ensure_not_expanded diff-files "*/a" &&
    -+	test_must_be_empty sparse-index-err &&
    -+
    -+	! ensure_not_expanded diff-files "**a" &&
    -+	test_must_be_empty sparse-index-err
    ++	ensure_expanded diff-files "*/a" &&
    ++	ensure_expanded diff-files "**a" 
     +'
     +
     +test_expect_success 'sparse index is not expanded: diff-files' '

Comments

Junio C Hamano March 22, 2023, 11:36 p.m. UTC | #1
Shuqi Liang <cheskaqiqi@gmail.com> writes:

> 3. Use `--stat` to ignore file creation time differences in unrefreshed
> index.

I am curious about this one.  Why is this a preferred solution over
say "run 'update-index --refresh' before running diff-files"?

Note that this is merely "I am curious", not "I think it is wrong".

Thanks.
Shuqi Liang March 23, 2023, 7:42 a.m. UTC | #2
On Wed, Mar 22, 2023 at 7:36 PM Junio C Hamano <gitster@pobox.com> wrote:

> > 3. Use `--stat` to ignore file creation time differences in unrefreshed
> > index.
>
> I am curious about this one.  Why is this a preferred solution over
> say "run 'update-index --refresh' before running diff-files"?
>
> Note that this is merely "I am curious", not "I think it is wrong".

Hi Junio

Thank you for your question, it has prompted me to consider the matter
further =)  I think both solutions, using git diff-files --stat and using git
update-index --refresh before git diff-files, can produce the same output but
in different ways.

When the index file is not up-to-date, git diff-files may show differences
between the working directory and the index that are caused by file creation
time differences, rather than actual changes to the file contents. By using git
diff-files --stat, which ignores file creation time differences.

While 'git update-index --refresh' updates the index file to match the contents
of the working tree. By running this command before git diff-files, we can
ensure that the index file is up-to-date and that the output of git diff-files
accurately reflects the differences between the working directory and the index.

Maybe using git update-index --refresh would be more direct and
straightforward solution.

(Hi Victoria, do you have any comments?  =)


Thanks
Shuqi
Junio C Hamano March 23, 2023, 4:03 p.m. UTC | #3
Shuqi Liang <cheskaqiqi@gmail.com> writes:

> When the index file is not up-to-date, git diff-files may show differences
> between the working directory and the index that are caused by file creation
> time differences, rather than actual changes to the file contents. By using git
> diff-files --stat, which ignores file creation time differences.

Use of "diff-files --stat" would mean that the contents of the blob
registered in the index will be inspected, which can be used to hide
the "stat dirty" condition.

But doesn't it cut both ways?  Starting from a clean index that has
up-to-date stat information for paths, we may want to test what
"stat dirty" changes diff-files reports when we touch paths in the
working tree, both inside and outside the spase cones.  A test with
"--stat" will not achieve that, exactly because it does not pay
attention to and hides the stat dirtiness.

On the other hand, if "update-index --refresh" is used in the test,
we may discover breakages caused by "update-index" not handling
the sparse index correctly.  It would be outside the topic of this
series, so avoiding it would be simpler, but (1) if it is not broken,
then as you said, it would be a more direct way to test diff-files,
and (2) if it is broken, it would need to be fixed anyway, before or
after this series.  So, I dunno...

Thanks.
Victoria Dye March 23, 2023, 5:25 p.m. UTC | #4
Shuqi Liang wrote:
> On Wed, Mar 22, 2023 at 7:36 PM Junio C Hamano <gitster@pobox.com> wrote:
> 
>>> 3. Use `--stat` to ignore file creation time differences in unrefreshed
>>> index.
>>
>> I am curious about this one.  Why is this a preferred solution over
>> say "run 'update-index --refresh' before running diff-files"?
>>
>> Note that this is merely "I am curious", not "I think it is wrong".
> 
> Hi Junio
> 
> Thank you for your question, it has prompted me to consider the matter
> further =)  I think both solutions, using git diff-files --stat and using git
> update-index --refresh before git diff-files, can produce the same output but
> in different ways.

While they'll (ideally) give the same user-facing result, there is a
difference in how they exercise 'diff-files' because of how 'update-index
--refresh' will affect SKIP_WORKTREE and sparse directories.

Using the same scenario you've set up for your test, suppose I start with a
fresh copy of the 't1092' repo. In the 'sparse-index' repo copy, 'folder1/'
will be a sparse directory:

$ git ls-files -t --sparse folder1/
S folder1/

(note: "S" indicates that SKIP_WORKTREE is applied to the entry)

Now suppose I copy 'a' into 'folder1/' and run 'update-index --refresh'
then 'ls-files' again:

$ git update-index --refresh
$ git ls-files -t --sparse folder1/
S folder1/0/
H folder1/a

(note: "H" indicates that 'folder1/a' does not have SKIP_WORKTREE applied)

The sparse directory has been expanded and SKIP_WORKTREE has been removed
from the file that's now present on-disk. This was an intentional "safety"
measure added in [1] to address the growing volume of bugs and complexities
in scenarios where SKIP_WORKTREE files existed on disk.

Ultimately, the main difference between this test with & without
'update-index' is who applies those index corrections when initially reading
the index: 'update-index' or 'diff-files'. I lean towards the latter because
the former is tested (almost identically) in 'update-index modify outside
sparse definition' earlier in 't1092'.

[1] https://lore.kernel.org/git/pull.1114.v2.git.1642175983.gitgitgadget@gmail.com/

> 
> When the index file is not up-to-date, git diff-files may show differences
> between the working directory and the index that are caused by file creation
> time differences, rather than actual changes to the file contents. By using git
> diff-files --stat, which ignores file creation time differences.

More or less, yes. Internally, 'diff-files' will "see" the file creation
differences, but the '--stat' format doesn't print them.

> 
> While 'git update-index --refresh' updates the index file to match the contents
> of the working tree. By running this command before git diff-files, we can
> ensure that the index file is up-to-date and that the output of git diff-files
> accurately reflects the differences between the working directory and the index.

This isn't quite true - 'update-index' only updates the *contents* of index
entries (or, colloquially, "stage them for commit") for files explicitly
provided as arguments. Separately, though, '--refresh' updates *all* index
entries' cached 'stat' information. 

Going a bit deeper: with no arguments, 'update-index' will read the index,
do nothing to it, then write it only if something has changed. In almost all
cases, reading the index doesn't cause any changes to it, making it a no-op.
However, the removal of SKIP_WORKTREE is done on read (including a refresh
of the entry's stat information), so a even plain 'update-index' *without*
'--refresh' would write a modified index to disk. In your test, that means:

	run_on_sparse mkdir -p folder1 &&
	run_on_sparse cp a folder1/a &&
	run_on_all git update-index &&
	test_all_match git diff-files

would get you the same result as:

	run_on_sparse mkdir -p folder1 &&
	run_on_sparse cp a folder1/a &&
	run_on_all git update-index --refresh &&
	test_all_match git diff-files

> 
> Maybe using git update-index --refresh would be more direct and
> straightforward solution.
> 
> (Hi Victoria, do you have any comments?  =)

I hope the above explanation is helpful. I still think '--stat' is the best
way to test this case, but I'm interested to hear your/others' thoughts on
the matter given the additional context.

> 
> 
> Thanks
> Shuqi
Shuqi Liang March 23, 2023, 11:59 p.m. UTC | #5
Hi Junio

On Thu, Mar 23, 2023 at 12:03 PM Junio C Hamano <gitster@pobox.com> wrote:

> > When the index file is not up-to-date, git diff-files may show differences
> > between the working directory and the index that are caused by file creation
> > time differences, rather than actual changes to the file contents. By using git
> > diff-files --stat, which ignores file creation time differences.
>
> Use of "diff-files --stat" would mean that the contents of the blob
> registered in the index will be inspected, which can be used to hide
> the "stat dirty" condition.
>
> But doesn't it cut both ways?  Starting from a clean index that has
> up-to-date stat information for paths, we may want to test what
> "stat dirty" changes diff-files reports when we touch paths in the
> working tree, both inside and outside the spase cones.  A test with
> "--stat" will not achieve that, exactly because it does not pay
> attention to and hides the stat dirtiness.

In this case, we can only use 'git diff-files --stat' when files are
present on disk without modifications. Since we know in the
full-checkout case 'diff-files --stat' will give empty output, so
sparse-checkout and sparse-index are also empty. These make
sure that the paths in the working tree are not dirty. So we do not
need to pay attention to 'stat dirty' change.

When 'file present on-disk with modifications'. We use 'git diff-files'
instead of  'git diff-files --stat' so we can get the expected
"modified" status but avoids potential breakages related to
inconsistency in the file creation time.

# file present on-disk without modifications
# use `--stat` to ignore file creation time differences in
# unrefreshed index
test_all_match git diff-files --stat &&
test_all_match git diff-files --stat folder1/a &&
test_all_match git diff-files --stat "folder*/a" &&

# file present on-disk with modifications
run_on_all ../edit-contents folder1/a &&
test_all_match git diff-files &&
test_all_match git diff-files folder1/a &&
test_all_match git diff-files "folder*/a"

> On the other hand, if "update-index --refresh" is used in the test,
> we may discover breakages caused by "update-index" not handling
> the sparse index correctly.  It would be outside the topic of this
> series, so avoiding it would be simpler, but (1) if it is not broken,
> then as you said, it would be a more direct way to test diff-files,
> and (2) if it is broken, it would need to be fixed anyway, before or
> after this series.  So, I dunno...

Thanks
Shuqi
Junio C Hamano April 13, 2023, 9:36 p.m. UTC | #6
Shuqi Liang <cheskaqiqi@gmail.com> writes:

> Changes since v6:
>
> 1. Fix word wrap in commit message.
>
> 2. Use  'mkdir -p folder1' since full-checkout already have folder1.
>
> 3. Use `--stat` to ignore file creation time differences in unrefreshed
> index.
>
> 4. In 'diff-files with pathspec outside sparse definition' add 
> 'git diff-files "folder*/a" to show that the result is the same with a 
> wildcard pathspec.
>
> 5. Create an 'ensure_expanded' to handle silent failures.

It seems that review comments have petered out.

Shall we mark this ready for 'next'?

Thanks.
Victoria Dye April 13, 2023, 9:38 p.m. UTC | #7
Junio C Hamano wrote:
> Shuqi Liang <cheskaqiqi@gmail.com> writes:
> 
>> Changes since v6:
>>
>> 1. Fix word wrap in commit message.
>>
>> 2. Use  'mkdir -p folder1' since full-checkout already have folder1.
>>
>> 3. Use `--stat` to ignore file creation time differences in unrefreshed
>> index.
>>
>> 4. In 'diff-files with pathspec outside sparse definition' add 
>> 'git diff-files "folder*/a" to show that the result is the same with a 
>> wildcard pathspec.
>>
>> 5. Create an 'ensure_expanded' to handle silent failures.
> 
> It seems that review comments have petered out.
> 
> Shall we mark this ready for 'next'?

Sorry for the delay - I noticed a couple more things that might warrant
changes before moving to 'next' and am in the process of writing up that
response. I'll send it within the hour.

> 
> Thanks.