diff mbox series

[RFC] add|rm|mv: fix bug that prevent the update of non-sparse

Message ID 80b5ba61861193daf7132aa64b65fc7dde90dacb.1634866698.git.matheus.bernardino@usp.br (mailing list archive)
State New, archived
Headers show
Series [RFC] add|rm|mv: fix bug that prevent the update of non-sparse | expand

Commit Message

Matheus Tavares Oct. 22, 2021, 2:28 a.m. UTC
On Mon, Oct 18, 2021 at 6:28 PM Sean Christopherson <seanjc@google.com> wrote:
>
> $ cat .git/info/sparse-checkout
> !arch/*
> !tools/arch/*
> !virt/kvm/arm/*
> /*
> arch/.gitignore
> arch/Kconfig
> arch/x86
> tools/arch/x86
> tools/include/uapi/linux/kvm.h
> !Documentation
> !drivers
>
> $ git read-tree -mu HEAD
>
> $ rm arch/x86/kvm/x86.c
[...]
> $ git add arch/x86
> The following paths and/or pathspecs matched paths that exist
> outside of your sparse-checkout definition, so will not be
> updated in the index:
> arch/x86

I think the problem may be that we are performing pattern matching
slightly different in add, mv, and rm, in comparison to "git
sparse-checkout". On "git sparse-checkout init" (or reapply), we call
clear_ce_flags() which calls path_matches_pattern_list() for each
component of the working tree paths. If the full path gives a match
result of UNDECIDED, we recursively try to use the match result from
the parent dir (or NOT_MATCHED if we reach the top with UNDECIDED).

In Sean's example, we get UNDECIDED for "arch/x86/kvm/x86.c", but
"arch/x86" gives MATCHED, so we end up using that for the full path.

However, in add|mv|rm we only call path_matches_pattern_list() for the
full path and get UNDECIDED, which we consider the same as NOT_MATCHED,
and end up disallowing the path update operation with a warning message.

The commands do work if we replace the sparsity pattern "arch/x86" with
"arch/x86/" (with a trailing slash), but note that it only works
because the pattern is relative to the root (see dir.c:1297). If we
change it to "x86/", it would no longer work.

So far, the only way I could think of to fix this would be to perform
pattern matching for the leading components of the paths too. That
doesn't seem very nice, though, as it can probably be quite expensive...
But here is a patch for discussion:

-- >8 --
Subject: [RFC PATCH] add|rm|mv: fix bug that prevent the update of non-sparse dirs

These three commands recently learned to avoid updating paths that do
not match the sparse-checkout patterns even if they are missing the
SKIP_WORKTREE bit. This is done using path_in_sparse_checkout(), which
tries to match the path with the current set of sparsity rules using
path_matches_pattern_list(). This is similar to what clear_ce_flags()
does when we run "git sparse-checkout init" or "git sparse-checkout
reapply". But note that clear_ce_flags() has a recursive behavior,
calling path_matches_pattern_list() for each component in a path,
whereas path_in_sparse_checkout() only calls it for the full path. This
makes the function miss matches such as the one between path "a/b/c" and
the pattern "b/". So if the user has the sparsity rules "!/a" and "b/",
for example, add, rm, and mv will fail to update the path "a/b/c" and
end up displaying a warning about "a/b/c" being outside the sparse
checkout even though it isn't. Note that this problem only occurs with
non-cone mode.

Fix this by making path_in_sparse_checkout() perform pattern matching
for every component in the given path when cone mode is disabled. (This
can be expensive, and we might want to do some form of caching for the
match results of the leading components. However, this is not
implemented in this patch.) Also add two tests for each command (add,
rm, and mv) to check that they behave correctly with the said pattern
matching. The first test would previously fail without this patch, while
the second already succeeded. It is added mostly to make sure that we
are not breaking the existing pattern matching for directories that are
really sparse, and also as a protection against any future
regressions.

Note that two other existing tests had to be changed: one test in t3602
checks that "git rm -r <dir>" won't remove sparse entries, but it
didn't allow the non-sparse entries inside <dir> to be removed. The
other one, in t7002, tested that "git mv" would correctly display a
warning message for sparse paths, but it accidentally expected the
message to include two non-sparse paths as well.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 dir.c                          | 33 ++++++++++++++++++++++++------
 t/t3602-rm-sparse-checkout.sh  | 37 +++++++++++++++++++++++++++++++---
 t/t3705-add-sparse-checkout.sh | 18 +++++++++++++++++
 t/t7002-mv-sparse-checkout.sh  | 28 +++++++++++++++++++++++--
 4 files changed, 105 insertions(+), 11 deletions(-)

Comments

Matheus Tavares Oct. 22, 2021, 4:03 a.m. UTC | #1
On Thu, Oct 21, 2021 at 11:28 PM Matheus Tavares <matheus.bernardino@usp.br> wrote:
>
> diff --git a/dir.c b/dir.c
> index a4306ab874..225487a59c 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1516,11 +1517,31 @@ static int path_in_sparse_checkout_1(const char *path,
>  	     !istate->sparse_checkout_patterns->use_cone_patterns))
>  		return 1;
>  
> -	base = strrchr(path, '/');
> -	return path_matches_pattern_list(path, strlen(path), base ? base + 1 : path,
> -					 &dtype,
> -					 istate->sparse_checkout_patterns,
> -					 istate) > 0;
> +	if (istate->sparse_checkout_patterns->use_cone_patterns) {
> +		const char *base = strrchr(path, '/');
> +		return path_matches_pattern_list(path, strlen(path),
> +				base ? base + 1 : path, &dtype,
> +				istate->sparse_checkout_patterns, istate) > 0;
> +	}
> +
> +	for (p = path; ; p++) {
> +		enum pattern_match_result match;
> +
> +		if (*p && *p != '/')
> +			continue;
> +
> +		match  = path_matches_pattern_list(path, p - path,
> +				last_slash ? last_slash + 1 : path, &dtype,
> +				istate->sparse_checkout_patterns, istate);
> +
> +		if (match != UNDECIDED)
> +			ret = match;
> +		if (!*p)
> +			break;
> +		last_slash = p;
> +	}
> +
> +	return ret;
>  }

Of course, after hitting send I realized it would make a lot more sense
to start the pattern matching from the full path and only go backwards
through the parent dirs until we find the first non-UNDECIDED result.
I.e. something like this:

static int path_in_sparse_checkout_1(const char *path,
				     struct index_state *istate,
				     int require_cone_mode)
{
	int dtype = DT_REG;
	enum pattern_match_result ret;
	const char *p, *base;

	/*
	 * We default to accepting a path if there are no patterns or
	 * they are of the wrong type.
	 */
	if (init_sparse_checkout_patterns(istate) ||
	    (require_cone_mode &&
	     !istate->sparse_checkout_patterns->use_cone_patterns))
		return 1;

	if (istate->sparse_checkout_patterns->use_cone_patterns) {
		base = strrchr(path, '/');
		return path_matches_pattern_list(path, strlen(path),
				base ? base + 1 : path, &dtype,
				istate->sparse_checkout_patterns, istate) > 0;
	}

	/*
	 * If the match for the path is UNDECIDED, try to match the parent dir
	 * recursively.
	 */
	for (p = path + strlen(path); p && p > path; p = base) {
		base = memrchr(path, '/', p - path);
		ret = path_matches_pattern_list(path, p - path,
				base ? base + 1 : path, &dtype,
				istate->sparse_checkout_patterns, istate);

		if (ret != UNDECIDED)
			break;
	}

	return ret == UNDECIDED ? NOT_MATCHED : ret;
}

But I will let others comment on the overall idea and/or other
alternatives before sending a possible v2.
Derrick Stolee Oct. 25, 2021, 4:40 p.m. UTC | #2
On 10/21/2021 10:28 PM, Matheus Tavares wrote:
> On Mon, Oct 18, 2021 at 6:28 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> $ cat .git/info/sparse-checkout
>> !arch/*
>> !tools/arch/*
>> !virt/kvm/arm/*
>> /*
>> arch/.gitignore
>> arch/Kconfig
>> arch/x86
>> tools/arch/x86
>> tools/include/uapi/linux/kvm.h
>> !Documentation
>> !drivers
>>
>> $ git read-tree -mu HEAD
>>
>> $ rm arch/x86/kvm/x86.c
> [...]
>> $ git add arch/x86
>> The following paths and/or pathspecs matched paths that exist
>> outside of your sparse-checkout definition, so will not be
>> updated in the index:
>> arch/x86
> 
> I think the problem may be that we are performing pattern matching
> slightly different in add, mv, and rm, in comparison to "git
> sparse-checkout". On "git sparse-checkout init" (or reapply), we call
> clear_ce_flags() which calls path_matches_pattern_list() for each
> component of the working tree paths. If the full path gives a match
> result of UNDECIDED, we recursively try to use the match result from
> the parent dir (or NOT_MATCHED if we reach the top with UNDECIDED).

Yes! I think this is absolutely the problem. Thanks for pointing
this out!
 
> In Sean's example, we get UNDECIDED for "arch/x86/kvm/x86.c", but
> "arch/x86" gives MATCHED, so we end up using that for the full path.
> 
> However, in add|mv|rm we only call path_matches_pattern_list() for the
> full path and get UNDECIDED, which we consider the same as NOT_MATCHED,
> and end up disallowing the path update operation with a warning message.
> 
> The commands do work if we replace the sparsity pattern "arch/x86" with
> "arch/x86/" (with a trailing slash), but note that it only works
> because the pattern is relative to the root (see dir.c:1297). If we
> change it to "x86/", it would no longer work.
> 
> So far, the only way I could think of to fix this would be to perform
> pattern matching for the leading components of the paths too. That
> doesn't seem very nice, though, as it can probably be quite expensive...
> But here is a patch for discussion:

I agree that it is expensive, but that's already the case for the
non-cone sparse-checkout patterns. Hopefully it is sufficient that
these cases are restricted to modified files (in the case of `git add .`)
or specific pathspecs (in the case of `git mv` and `git rm`).

> -- >8 --
> Subject: [RFC PATCH] add|rm|mv: fix bug that prevent the update of non-sparse dirs
> 
> These three commands recently learned to avoid updating paths that do
> not match the sparse-checkout patterns even if they are missing the
> SKIP_WORKTREE bit. This is done using path_in_sparse_checkout(), which
> tries to match the path with the current set of sparsity rules using
> path_matches_pattern_list(). This is similar to what clear_ce_flags()
> does when we run "git sparse-checkout init" or "git sparse-checkout
> reapply". But note that clear_ce_flags() has a recursive behavior,
> calling path_matches_pattern_list() for each component in a path,
> whereas path_in_sparse_checkout() only calls it for the full path. This
> makes the function miss matches such as the one between path "a/b/c" and
> the pattern "b/". So if the user has the sparsity rules "!/a" and "b/",
> for example, add, rm, and mv will fail to update the path "a/b/c" and
> end up displaying a warning about "a/b/c" being outside the sparse
> checkout even though it isn't. Note that this problem only occurs with
> non-cone mode.
> 
> Fix this by making path_in_sparse_checkout() perform pattern matching
> for every component in the given path when cone mode is disabled. (This
> can be expensive, and we might want to do some form of caching for the
> match results of the leading components. However, this is not
> implemented in this patch.) Also add two tests for each command (add,
> rm, and mv) to check that they behave correctly with the said pattern
> matching. The first test would previously fail without this patch, while
> the second already succeeded. It is added mostly to make sure that we
> are not breaking the existing pattern matching for directories that are
> really sparse, and also as a protection against any future
> regressions.
> 
> Note that two other existing tests had to be changed: one test in t3602
> checks that "git rm -r <dir>" won't remove sparse entries, but it
> didn't allow the non-sparse entries inside <dir> to be removed. The
> other one, in t7002, tested that "git mv" would correctly display a
> warning message for sparse paths, but it accidentally expected the
> message to include two non-sparse paths as well.


> @@ -1504,8 +1504,9 @@ static int path_in_sparse_checkout_1(const char *path,
>  				     struct index_state *istate,
>  				     int require_cone_mode)
>  {
> -	const char *base;
>  	int dtype = DT_REG;
> +	enum pattern_match_result ret = NOT_MATCHED;
> +	const char *p, *last_slash = NULL;
>  
>  	/*
>  	 * We default to accepting a path if there are no patterns or
> @@ -1516,11 +1517,31 @@ static int path_in_sparse_checkout_1(const char *path,
>  	     !istate->sparse_checkout_patterns->use_cone_patterns))
>  		return 1;
>  
> -	base = strrchr(path, '/');
> -	return path_matches_pattern_list(path, strlen(path), base ? base + 1 : path,
> -					 &dtype,
> -					 istate->sparse_checkout_patterns,
> -					 istate) > 0;
> +	if (istate->sparse_checkout_patterns->use_cone_patterns) {
> +		const char *base = strrchr(path, '/');
> +		return path_matches_pattern_list(path, strlen(path),
> +				base ? base + 1 : path, &dtype,
> +				istate->sparse_checkout_patterns, istate) > 0;
> +	}
> +
> +	for (p = path; ; p++) {
> +		enum pattern_match_result match;
> +
> +		if (*p && *p != '/')
> +			continue;
> +
> +		match  = path_matches_pattern_list(path, p - path,
> +				last_slash ? last_slash + 1 : path, &dtype,
> +				istate->sparse_checkout_patterns, istate);
> +
> +		if (match != UNDECIDED)
> +			ret = match;
> +		if (!*p)
> +			break;
> +		last_slash = p;
> +	}
> +
> +	return ret;

This implementation makes sense to me.

>  test_expect_success 'recursive rm does not remove sparse entries' '
>  	git reset --hard &&
>  	git sparse-checkout set sub/dir &&
> -	test_must_fail git rm -r sub &&
> -	git rm --sparse -r sub &&
> +	git rm -r sub &&

Interesting that the new pattern-matching already presents a change of
behavior in this test case.

>  	git status --porcelain -uno >actual &&
>  	cat >expected <<-\EOF &&
> +	D  sub/dir/e
> +	EOF
> +	test_cmp expected actual &&

And here is why. Excellent. I suppose that setting the pattern to be
"sub/dir/" would have shown this behavior before.

> +
> +	git rm --sparse -r sub &&
> +	git status --porcelain -uno >actual2 &&
> +	cat >expected2 <<-\EOF &&
>  	D  sub/d
>  	D  sub/dir/e
>  	EOF
> -	test_cmp expected actual
> +	test_cmp expected2 actual2
>  '

The rest of the test cases add new checks that are very valuable.

I love this idea and I agree that it would be better to change the
loop direction to match the full path first (as you mention in your
response).

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/dir.c b/dir.c
index a4306ab874..225487a59c 100644
--- a/dir.c
+++ b/dir.c
@@ -1504,8 +1504,9 @@  static int path_in_sparse_checkout_1(const char *path,
 				     struct index_state *istate,
 				     int require_cone_mode)
 {
-	const char *base;
 	int dtype = DT_REG;
+	enum pattern_match_result ret = NOT_MATCHED;
+	const char *p, *last_slash = NULL;
 
 	/*
 	 * We default to accepting a path if there are no patterns or
@@ -1516,11 +1517,31 @@  static int path_in_sparse_checkout_1(const char *path,
 	     !istate->sparse_checkout_patterns->use_cone_patterns))
 		return 1;
 
-	base = strrchr(path, '/');
-	return path_matches_pattern_list(path, strlen(path), base ? base + 1 : path,
-					 &dtype,
-					 istate->sparse_checkout_patterns,
-					 istate) > 0;
+	if (istate->sparse_checkout_patterns->use_cone_patterns) {
+		const char *base = strrchr(path, '/');
+		return path_matches_pattern_list(path, strlen(path),
+				base ? base + 1 : path, &dtype,
+				istate->sparse_checkout_patterns, istate) > 0;
+	}
+
+	for (p = path; ; p++) {
+		enum pattern_match_result match;
+
+		if (*p && *p != '/')
+			continue;
+
+		match  = path_matches_pattern_list(path, p - path,
+				last_slash ? last_slash + 1 : path, &dtype,
+				istate->sparse_checkout_patterns, istate);
+
+		if (match != UNDECIDED)
+			ret = match;
+		if (!*p)
+			break;
+		last_slash = p;
+	}
+
+	return ret;
 }
 
 int path_in_sparse_checkout(const char *path,
diff --git a/t/t3602-rm-sparse-checkout.sh b/t/t3602-rm-sparse-checkout.sh
index ecce497a9c..6e127b966e 100755
--- a/t/t3602-rm-sparse-checkout.sh
+++ b/t/t3602-rm-sparse-checkout.sh
@@ -40,14 +40,20 @@  done
 test_expect_success 'recursive rm does not remove sparse entries' '
 	git reset --hard &&
 	git sparse-checkout set sub/dir &&
-	test_must_fail git rm -r sub &&
-	git rm --sparse -r sub &&
+	git rm -r sub &&
 	git status --porcelain -uno >actual &&
 	cat >expected <<-\EOF &&
+	D  sub/dir/e
+	EOF
+	test_cmp expected actual &&
+
+	git rm --sparse -r sub &&
+	git status --porcelain -uno >actual2 &&
+	cat >expected2 <<-\EOF &&
 	D  sub/d
 	D  sub/dir/e
 	EOF
-	test_cmp expected actual
+	test_cmp expected2 actual2
 '
 
 test_expect_success 'recursive rm --sparse removes sparse entries' '
@@ -105,4 +111,29 @@  test_expect_success 'refuse to rm a non-skip-worktree path outside sparse cone'
 	test_path_is_missing b
 '
 
+test_expect_success 'can remove files from non-sparse dir' '
+	git reset --hard &&
+	git sparse-checkout disable &&
+	mkdir -p w x/y &&
+	test_commit w/f &&
+	test_commit x/y/f &&
+
+	git sparse-checkout set w !/x y &&
+	git rm w/f.t x/y/f.t 2>stderr &&
+	test_must_be_empty stderr
+'
+
+test_expect_success 'refuse to remove non-skip-worktree file from sparse dir' '
+	git reset --hard &&
+	git sparse-checkout disable &&
+	mkdir -p x/y/z &&
+	test_commit x/y/z/f &&
+	git sparse-checkout set !/x y !x/y/z &&
+
+	git update-index --no-skip-worktree x/y/z/f.t &&
+	test_must_fail git rm x/y/z/f.t 2>stderr &&
+	echo x/y/z/f.t | cat sparse_error_header - sparse_hint >expect &&
+	test_cmp expect stderr
+'
+
 test_done
diff --git a/t/t3705-add-sparse-checkout.sh b/t/t3705-add-sparse-checkout.sh
index 5b904988d4..63f888af12 100755
--- a/t/t3705-add-sparse-checkout.sh
+++ b/t/t3705-add-sparse-checkout.sh
@@ -214,4 +214,22 @@  test_expect_success 'add allows sparse entries with --sparse' '
 	test_must_be_empty stderr
 '
 
+test_expect_success 'can add files from non-sparse dir' '
+	git sparse-checkout set w !/x y &&
+	mkdir -p w x/y &&
+	touch w/f x/y/f &&
+	git add w/f x/y/f 2>stderr &&
+	test_must_be_empty stderr
+'
+
+test_expect_success 'refuse to add non-skip-worktree file from sparse dir' '
+	git sparse-checkout set !/x y !x/y/z &&
+	mkdir -p x/y/z &&
+	touch x/y/z/f &&
+	test_must_fail git add x/y/z/f 2>stderr &&
+	echo x/y/z/f | cat sparse_error_header - sparse_hint >expect &&
+	test_cmp expect stderr
+
+'
+
 test_done
diff --git a/t/t7002-mv-sparse-checkout.sh b/t/t7002-mv-sparse-checkout.sh
index 545748949a..91a857bf05 100755
--- a/t/t7002-mv-sparse-checkout.sh
+++ b/t/t7002-mv-sparse-checkout.sh
@@ -143,8 +143,6 @@  test_expect_success 'recursive mv refuses to move (possible) sparse' '
 	cat >>expect <<-\EOF &&
 	sub/d
 	sub2/d
-	sub/dir/e
-	sub2/dir/e
 	sub/dir2/e
 	sub2/dir2/e
 	EOF
@@ -186,4 +184,30 @@  test_expect_success 'recursive mv refuses to move sparse' '
 	git reset --hard HEAD~1
 '
 
+test_expect_success 'can move files to non-sparse dir' '
+	git reset --hard &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set a b c w !/x y &&
+	mkdir -p w x/y &&
+
+	git mv a w/new-a 2>stderr &&
+	git mv b x/y/new-b 2>stderr &&
+	test_must_be_empty stderr
+'
+
+test_expect_success 'refuse to move file to non-skip-worktree sparse path' '
+	git reset --hard &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set a !/x y !x/y/z &&
+	mkdir -p x/y/z &&
+
+	test_must_fail git mv a x/y/z/new-a 2>stderr &&
+	cat sparse_error_header >expect &&
+	cat >>expect <<-\EOF &&
+	x/y/z/new-a
+	EOF
+	cat sparse_hint >>expect &&
+	test_cmp expect stderr
+'
+
 test_done