diff mbox series

[v3,5/7] refresh_index(): add REFRESH_DONT_MARK_SPARSE_MATCHES flag

Message ID 34a61a0d03868c43d68a04bca8d86dd98de2aa28.1615588109.git.matheus.bernardino@usp.br (mailing list archive)
State Superseded
Headers show
Series add/rm: honor sparse checkout and warn on sparse paths | expand

Commit Message

Matheus Tavares March 12, 2021, 10:48 p.m. UTC
refresh_index() optionally takes a seen[] array to mark the pathspec
items that had matches in the index. This is used by `git add --refresh`
to find out if there was any pathspec without matches, and display an
error accordingly.

In the following patch, `git add` will also learn to warn about
pathspecs that match no eligible path for updating, but do match sparse
entries. For that, we will need a seen[] array marked exclusively with
matches from dense entries. To avoid having to call ce_path_match()
again for these entries after refresh_index() returns, add a flag that
implements this restriction inside the function itself.

Note that refresh_index() does not update sparse entries, regardless of
passing the flag or not. The flag only controls whether matches with
these entries should appear in the seen[] array.

While we are here, also realign the REFRESH_* flags and convert the hex
values to the more natural bit shift format, which makes it easier to
spot holes.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 cache.h      | 15 ++++++++-------
 read-cache.c |  5 ++++-
 2 files changed, 12 insertions(+), 8 deletions(-)

Comments

Junio C Hamano March 18, 2021, 11:45 p.m. UTC | #1
Matheus Tavares <matheus.bernardino@usp.br> writes:

> refresh_index() optionally takes a seen[] array to mark the pathspec
> items that had matches in the index. This is used by `git add --refresh`
> to find out if there was any pathspec without matches, and display an
> error accordingly.

It smells a bit iffy in that this allows a path to "match" a pattern
but yet not result in seen[] to record that the pattern participated
in the match.  If the series is working towards certain operations
not to touch paths that are outside the sparse checkout, shouldn't
it be making these paths not to match the pattern, and by doing so
it would automatically cause the pattern to be considered "not yet
matching any path" when the matcher attempts to match the pattern to
such a path?
Junio C Hamano March 19, 2021, midnight UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
>> refresh_index() optionally takes a seen[] array to mark the pathspec
>> items that had matches in the index. This is used by `git add --refresh`
>> to find out if there was any pathspec without matches, and display an
>> error accordingly.
>
> It smells a bit iffy in that this allows a path to "match" a pattern
> but yet not result in seen[] to record that the pattern participated
> in the match.  If the series is working towards certain operations
> not to touch paths that are outside the sparse checkout, shouldn't
> it be making these paths not to match the pattern, and by doing so
> it would automatically cause the pattern to be considered "not yet
> matching any path" when the matcher attempts to match the pattern to
> such a path?

In other words, the change makes me wonder why we are not adding a
flag that says "do we or do we not want to match paths outside the
sparse checkout cone?", with which the seen[] would automatically
record the right thing.
Matheus Tavares March 19, 2021, 12:23 p.m. UTC | #3
On Thu, Mar 18, 2021 at 9:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Matheus Tavares <matheus.bernardino@usp.br> writes:
> >
> >> refresh_index() optionally takes a seen[] array to mark the pathspec
> >> items that had matches in the index. This is used by `git add --refresh`
> >> to find out if there was any pathspec without matches, and display an
> >> error accordingly.
> >
> > It smells a bit iffy in that this allows a path to "match" a pattern
> > but yet not result in seen[] to record that the pattern participated
> > in the match.  If the series is working towards certain operations
> > not to touch paths that are outside the sparse checkout, shouldn't
> > it be making these paths not to match the pattern, and by doing so
> > it would automatically cause the pattern to be considered "not yet
> > matching any path" when the matcher attempts to match the pattern to
> > such a path?
>
> In other words, the change makes me wonder why we are not adding a
> flag that says "do we or do we not want to match paths outside the
> sparse checkout cone?", with which the seen[] would automatically
> record the right thing.

Yeah, makes sense. I didn't want to make the flag skip the sparse
paths unconditionally (i.e. without matching them) because then we
would also skip the ce_stage() checkings below and the later
ce_mark_uptodate(). And I wasn't sure whether this could cause any
unwanted side effects.

But thinking more carefully about this now, unmerged paths should
never have the SKIP_WORKTREE bit set anyway, right? What about the
CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar
with this code, but I'll try to investigate more later.
Junio C Hamano March 19, 2021, 4:05 p.m. UTC | #4
Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

>> In other words, the change makes me wonder why we are not adding a
>> flag that says "do we or do we not want to match paths outside the
>> sparse checkout cone?", with which the seen[] would automatically
>> record the right thing.
>
> Yeah, makes sense. I didn't want to make the flag skip the sparse
> paths unconditionally (i.e. without matching them) because then we
> would also skip the ce_stage() checkings below and the later
> ce_mark_uptodate(). And I wasn't sure whether this could cause any
> unwanted side effects.
>
> But thinking more carefully about this now, unmerged paths should
> never have the SKIP_WORKTREE bit set anyway, right? What about the
> CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar
> with this code, but I'll try to investigate more later.

I do not offhand know the answers to these questions you ask in the
last paragraph myself, and it could turn out to be that the "let
processing go as before, matching what those excluded paths by the
matcher, but making the caller responsible for filtering out the
paths excluded by sparcity out of the result from the matcher, but
tweak seen[]" approach, however ugly, is the most practical, after
your investigation.  But I'd prefer to see how bad the situation is
first ;-).

Thanks.
Matheus Tavares March 30, 2021, 6:51 p.m. UTC | #5
On Fri, Mar 19, 2021 at 1:05 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
>
> >> In other words, the change makes me wonder why we are not adding a
> >> flag that says "do we or do we not want to match paths outside the
> >> sparse checkout cone?", with which the seen[] would automatically
> >> record the right thing.
> >
> > Yeah, makes sense. I didn't want to make the flag skip the sparse
> > paths unconditionally (i.e. without matching them) because then we
> > would also skip the ce_stage() checkings below and the later
> > ce_mark_uptodate(). And I wasn't sure whether this could cause any
> > unwanted side effects.
> >
> > But thinking more carefully about this now, unmerged paths should
> > never have the SKIP_WORKTREE bit set anyway, right? What about the
> > CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar
> > with this code, but I'll try to investigate more later.

Sorry I haven't given any update on this yet. From what I could see so
far, it seems OK to ignore the skip_worktree entries in
refresh_index() when it is called from `git add --refresh`. But
because we would no longer mark the skip_worktree entries that match
the pathspec with CE_UPTODATE, do_write_index() would start checking
if they are racy clean (this is only done when `!ce_uptodate(ce)`),
which could add some lstat() overhead.

However, this made me think what happens today if we do have a racy
clean entry with the skip_worktree bit set... `git add --refresh` will
probably update the index without noticing that the entry is racy
clean (because it won't check CE_UPTODATE entries, and skip_worktree
entries will have this bit set in refresh_index()). Thus the entries'
size won't be truncated to zero when writing the index, and the entry
will appear unchanged even if we later unset the skip_worktree bit.

But can we have a "racy clean skip_worktree entry"? Yes, this seems
possible e.g. if the following sequence happens fast enough for mtime
to be the same before and after the update:

  echo x >file
  git update-index --refresh --skip-worktree file
  echo y>file

Here is a more complete example which artificially creates a "racy
clean skip_worktree entry", runs `git add --refresh`, and shows that
the racy clean entry was not detected:

# Setup
echo sparse >sparse
echo dense >dense
git add .
git commit -m files

# Emulate a racy clean situation
touch -d yesterday date
touch -r date sparse
git update-index --refresh --skip-worktree sparse
touch -r date .git/index
echo xparse >sparse
touch -r date sparse

# `git add` will now write a new index without checking if
# `sparse` is racy clean nor truncating its size
touch -r date dense
git add --refresh .

git update-index --no-skip-worktree sparse
git status
<doesn't show that `sparse` was modified>

This situation feels rather uncommon, but perhaps the same could
happen with `git sparse-checkout set` instead of `git update-index
--refresh --skip-worktree`? IDK. This made me think whether
refresh_index() should really mark skip_worktree entries with
CE_UPTODATE, in the first place.

Any thoughts?
Elijah Newren March 31, 2021, 9:14 a.m. UTC | #6
Hi,

On Tue, Mar 30, 2021 at 11:51 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Fri, Mar 19, 2021 at 1:05 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> >
> > >> In other words, the change makes me wonder why we are not adding a
> > >> flag that says "do we or do we not want to match paths outside the
> > >> sparse checkout cone?", with which the seen[] would automatically
> > >> record the right thing.
> > >
> > > Yeah, makes sense. I didn't want to make the flag skip the sparse
> > > paths unconditionally (i.e. without matching them) because then we
> > > would also skip the ce_stage() checkings below and the later
> > > ce_mark_uptodate(). And I wasn't sure whether this could cause any
> > > unwanted side effects.
> > >
> > > But thinking more carefully about this now, unmerged paths should
> > > never have the SKIP_WORKTREE bit set anyway, right? What about the
> > > CE_UPTODATE mark, would it be safe to skip it? I'm not very familiar
> > > with this code, but I'll try to investigate more later.
>
> Sorry I haven't given any update on this yet. From what I could see so
> far, it seems OK to ignore the skip_worktree entries in
> refresh_index() when it is called from `git add --refresh`. But
> because we would no longer mark the skip_worktree entries that match
> the pathspec with CE_UPTODATE, do_write_index() would start checking
> if they are racy clean (this is only done when `!ce_uptodate(ce)`),
> which could add some lstat() overhead.
>
> However, this made me think what happens today if we do have a racy
> clean entry with the skip_worktree bit set... `git add --refresh` will
> probably update the index without noticing that the entry is racy
> clean (because it won't check CE_UPTODATE entries, and skip_worktree
> entries will have this bit set in refresh_index()). Thus the entries'
> size won't be truncated to zero when writing the index, and the entry
> will appear unchanged even if we later unset the skip_worktree bit.
>
> But can we have a "racy clean skip_worktree entry"? Yes, this seems
> possible e.g. if the following sequence happens fast enough for mtime
> to be the same before and after the update:
>
>   echo x >file
>   git update-index --refresh --skip-worktree file
>   echo y>file
>
> Here is a more complete example which artificially creates a "racy
> clean skip_worktree entry", runs `git add --refresh`, and shows that
> the racy clean entry was not detected:
>
> # Setup
> echo sparse >sparse
> echo dense >dense
> git add .
> git commit -m files
>
> # Emulate a racy clean situation
> touch -d yesterday date
> touch -r date sparse
> git update-index --refresh --skip-worktree sparse
> touch -r date .git/index
> echo xparse >sparse
> touch -r date sparse
>
> # `git add` will now write a new index without checking if
> # `sparse` is racy clean nor truncating its size
> touch -r date dense
> git add --refresh .
>
> git update-index --no-skip-worktree sparse
> git status
> <doesn't show that `sparse` was modified>
>
> This situation feels rather uncommon, but perhaps the same could
> happen with `git sparse-checkout set` instead of `git update-index
> --refresh --skip-worktree`? IDK. This made me think whether
> refresh_index() should really mark skip_worktree entries with
> CE_UPTODATE, in the first place.
>
> Any thoughts?

Wow, that's a weird one.  Nice digging.  I don't understand the racily
clean stuff that well, or the CE_UPTODATE handling...but based on my
meager understanding of these things I'd say that marking
skip_worktree entries with CE_UPTODATE seems wrong and I'd agree with
your hunch that we shouldn't be updating it for those files.  If an
entry is skip_worktree, it's assumed to not be present in the working
tree and that we'll treat it as "unchanged".  So, when the filehappens
to be present despite that bit being set, checking the files' stat
information or pretending to have done so so we can claim it is up to
date seems wrong to me.  In fact, I'd say the mere recognition that
the file is present in the working tree despite being SKIP_WORKTREE to
me implies it's not "up-to-date" for at least one definition of that
term.

I suspect that if someone flips the skip-worktree bit on and off for a
file without removing it from the working tree, not marking
skip_worktree entries with CE_UPTODATE as you mention above might
force us to incur the cost of an extra stat on said file when we run
"git status" later.  If so, I think that's a totally reasonable cost,
so if that was your worry, I'd say that this is shifting the cost
where it belongs.

But, like I said, I'm not very familiar with the racily clean code or
CE_UPTODATE, so it's possible I said something above that doesn't even
make sense.  Does my reasoning help at all?
diff mbox series

Patch

diff --git a/cache.h b/cache.h
index 6fda8091f1..0b6074d49e 100644
--- a/cache.h
+++ b/cache.h
@@ -879,13 +879,14 @@  int match_stat_data_racy(const struct index_state *istate,
 
 void fill_stat_cache_info(struct index_state *istate, struct cache_entry *ce, struct stat *st);
 
-#define REFRESH_REALLY		0x0001	/* ignore_valid */
-#define REFRESH_UNMERGED	0x0002	/* allow unmerged */
-#define REFRESH_QUIET		0x0004	/* be quiet about it */
-#define REFRESH_IGNORE_MISSING	0x0008	/* ignore non-existent */
-#define REFRESH_IGNORE_SUBMODULES	0x0010	/* ignore submodules */
-#define REFRESH_IN_PORCELAIN	0x0020	/* user friendly output, not "needs update" */
-#define REFRESH_PROGRESS	0x0040  /* show progress bar if stderr is tty */
+#define REFRESH_REALLY                   (1 << 0) /* ignore_valid */
+#define REFRESH_UNMERGED                 (1 << 1) /* allow unmerged */
+#define REFRESH_QUIET                    (1 << 2) /* be quiet about it */
+#define REFRESH_IGNORE_MISSING           (1 << 3) /* ignore non-existent */
+#define REFRESH_IGNORE_SUBMODULES        (1 << 4) /* ignore submodules */
+#define REFRESH_IN_PORCELAIN             (1 << 5) /* user friendly output, not "needs update" */
+#define REFRESH_PROGRESS                 (1 << 6) /* show progress bar if stderr is tty */
+#define REFRESH_DONT_MARK_SPARSE_MATCHES (1 << 7) /* don't mark sparse entries' matches on seen[] */
 int refresh_index(struct index_state *, unsigned int flags, const struct pathspec *pathspec, char *seen, const char *header_msg);
 /*
  * Refresh the index and write it to disk.
diff --git a/read-cache.c b/read-cache.c
index 1e9a50c6c7..0bf7c64eff 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1514,6 +1514,7 @@  int refresh_index(struct index_state *istate, unsigned int flags,
 	int quiet = (flags & REFRESH_QUIET) != 0;
 	int not_new = (flags & REFRESH_IGNORE_MISSING) != 0;
 	int ignore_submodules = (flags & REFRESH_IGNORE_SUBMODULES) != 0;
+	int no_sparse_on_seen = (flags & REFRESH_DONT_MARK_SPARSE_MATCHES) != 0;
 	int first = 1;
 	int in_porcelain = (flags & REFRESH_IN_PORCELAIN);
 	unsigned int options = (CE_MATCH_REFRESH |
@@ -1552,12 +1553,14 @@  int refresh_index(struct index_state *istate, unsigned int flags,
 		int filtered = 0;
 		int t2_did_lstat = 0;
 		int t2_did_scan = 0;
+		char *cur_seen;
 
 		ce = istate->cache[i];
 		if (ignore_submodules && S_ISGITLINK(ce->ce_mode))
 			continue;
 
-		if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
+		cur_seen = no_sparse_on_seen && ce_skip_worktree(ce) ? NULL : seen;
+		if (pathspec && !ce_path_match(istate, ce, pathspec, cur_seen))
 			filtered = 1;
 
 		if (ce_stage(ce)) {