diff mbox series

update-index: refresh should rewrite index in case of racy timestamps

Message ID pull.1105.git.1640181390841.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series update-index: refresh should rewrite index in case of racy timestamps | expand

Commit Message

Marc Strapetz Dec. 22, 2021, 1:56 p.m. UTC
From: Marc Strapetz <marc.strapetz@syntevo.com>

update-index --refresh and --really-refresh should force writing of the
index file if racy timestamps have been encountered, as status already
does [1].

Note that calling update-index still does not guarantee that there will
be no more racy timestamps afterwards (the same holds true for status):

- calling update-index immediately after touching and adding a file may
  still leave racy timestamps if all three operations occur within the
  racy-tolerance (usually 1 second unless USE_NSEC has been defined)

- calling update-index for timestamps which are set into the future
  will leave them racy

To guarantee that such racy timestamps will be resolved would require to
wait until the system clock has passed beyond these timestamps and only
then write the index file. Especially for future timestamps, this does
not seem feasible because of possibly long delays/hangs.

Both --refresh and --really-refresh may in theory be used in
combination with --unresolve and --again which may reset the
"active_cache_changed" flag. There is no difference of whether we
write the index due to racy timestamps or due to other
reasons, like if --really-refresh has detected CE_ENTRY_CHANGED in
refresh_cache(). Hence, we will set the "active_cache_changed" flag
immediately after calling refresh_cache().

[1] https://lore.kernel.org/git/d3dd805c-7c1d-30a9-6574-a7bfcb7fc013@syntevo.com/

Signed-off-by: Marc Strapetz <marc.strapetz@syntevo.com>
---
    update-index: refresh should rewrite index in case of racy timestamps
    
    This patch makes update-index --refresh write the index if it contains
    racy timestamps, as discussed at [1].
    
    [1]
    https://lore.kernel.org/git/d3dd805c-7c1d-30a9-6574-a7bfcb7fc013@syntevo.com/

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1105%2Fmstrap%2Ffeature%2Fupdate-index-refresh-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1105/mstrap/feature/update-index-refresh-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1105

 builtin/update-index.c               |  6 +++
 cache.h                              |  1 +
 read-cache.c                         |  2 +-
 t/t2108-update-index-refresh-racy.sh | 58 ++++++++++++++++++++++++++++
 4 files changed, 66 insertions(+), 1 deletion(-)
 create mode 100755 t/t2108-update-index-refresh-racy.sh


base-commit: 597af311a2899bfd6640b9b107622c5795d5f998

Comments

Junio C Hamano Dec. 22, 2021, 11:52 p.m. UTC | #1
"Marc Strapetz via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  builtin/update-index.c               |  6 +++
>  cache.h                              |  1 +
>  read-cache.c                         |  2 +-
>  t/t2108-update-index-refresh-racy.sh | 58 ++++++++++++++++++++++++++++
>  4 files changed, 66 insertions(+), 1 deletion(-)
>  create mode 100755 t/t2108-update-index-refresh-racy.sh
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 187203e8bb5..0a069281e23 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -787,6 +787,12 @@ static int refresh(struct refresh_params *o, unsigned int flag)
>  	setup_work_tree();
>  	read_cache();
>  	*o->has_errors |= refresh_cache(o->flags | flag);
> +	if (has_racy_timestamp(&the_index)) {
> +		/* For racy timestamps we should set active_cache_changed immediately:
> +		 * other callbacks may follow for which some of them may reset
> +		 * active_cache_changed. */
> +		active_cache_changed |= SOMETHING_CHANGED;
> +	}

Documentation/CodingGuidelines says:

 - Multi-line comments include their delimiters on separate lines from
   the text.  E.g.

	/*
	 * A very long
	 * multi-line comment.
	 */

The last half-sentence puzzles me, partly because of the word
"callback", which is an implementation detail of how --refresh and
other actions are triggered by the update-index command.  Calling
them "operation" or "action" might be easier to understand.  I dunno.

But more problematic is the word "reset", which at least to me
implies that the SOMETHING_CHANGED bit may be cleared by them, which
sounds just wrong and broken.

    ... goes and looks ...

Ah, there are cases where we do clear active_cache_changed when we
notice that an operation detected an error, to avoid spreading the
breakage by writing the index file out, and I think that is the
right thing to do.  Which means that the above patch is not quite
right.  Perhaps taking all of the above together, something like
this?

	*o->has_errors |= refresh_cache(o->flags | flag);
	if (*o->has_errors)
		active_cache_changed = 0; 
	else if (has_racy_timestamps(&the_index))
        	/*
		 * Even if nothing else has changed, updating the file
		 * increases the chance that racy timestamps become
		 * non-racy, helping future run-time performance.
		 */
		active_cache_changed |= SOMETHING_CHANGED;


> diff --git a/t/t2108-update-index-refresh-racy.sh b/t/t2108-update-index-refresh-racy.sh
> new file mode 100755
> index 00000000000..ece1151847c
> --- /dev/null
> +++ b/t/t2108-update-index-refresh-racy.sh
> @@ -0,0 +1,58 @@
> +#!/bin/sh
> +
> +test_description='update-index refresh tests related to racy timestamps'
> +
> +. ./test-lib.sh
> +
> +reset_mtime() {

Documentation/CodingGuidelines

 - We prefer a space between the function name and the parentheses,
   and no space inside the parentheses. The opening "{" should also
   be on the same line.

	(incorrect)
	my_function(){
		...

	(correct)
	my_function () {
		...

> +	test-tool chmtime =$(test-tool chmtime --get .git/fs-tstamp) $1

Even if we know all the existing callers pass a single word argument
to this function, it would be a good discipline to put double-quotes
around "$1" to assure the readers that we are future-proofed.

> +}
> +
> +update_assert_unchanged() {
> +	local ts1=$(test-tool chmtime --get .git/index) &&
> +	git update-index $1 &&
> +	local ts2=$(test-tool chmtime --get .git/index) &&
> +	[ $ts1 -eq $ts2 ]

Documentation/CodingGuidelines

 - We prefer "test" over "[ ... ]".

> +}
> +
> +update_assert_changed() {
> +	local ts1=$(test-tool chmtime --get .git/index) &&
> +	test_might_fail git update-index $1 &&
> +	local ts2=$(test-tool chmtime --get .git/index) &&
> +	[ $ts1 -ne $ts2 ]
> +}
> +
> +test_expect_success 'setup' '
> +	touch .git/fs-tstamp &&

Not that it is wrong, but do we need to create such a throw-away
file inside the .git directory?

When we care only the presence of a path, and not that the path has
the current timestamp, we prefer not to use "touch".

	>.git/fs-tstamp

I am debating myself which is more appropriate in this case.  A
mistaken implementation of "touch" could call gettimeofday() and use
the result to call utimes(), leaving wallclock timestamp in the
result, but redirecation to create or truncate the path is a more
guaranteed way to make sure the timestamp comes from the filesystem,
so it may be more suitable for our needs here.

> +	test-tool chmtime -1 .git/fs-tstamp &&
> +	echo content >file &&
> +	reset_mtime file &&
> +
> +	git add file &&
> +	git commit -m "initial import"
> +'
> +
> +test_expect_success '--refresh has no racy timestamps to fix' '
> +	reset_mtime .git/index &&
> +	test-tool chmtime +1 .git/index &&
> +	update_assert_unchanged --refresh
> +'
> +
> +test_expect_success '--refresh should fix racy timestamp' '
> +	reset_mtime .git/index &&
> +	update_assert_changed --refresh
> +'
> +
> +test_expect_success '--really-refresh should fix racy timestamp' '
> +	reset_mtime .git/index &&
> +	update_assert_changed --really-refresh
> +'
> +
> +test_expect_success '--refresh should fix racy timestamp even if needs update' '
> +	echo content2 >file &&
> +	reset_mtime file &&
> +	reset_mtime .git/index &&
> +	update_assert_changed --refresh
> +'
> +
> +test_done
>
> base-commit: 597af311a2899bfd6640b9b107622c5795d5f998
Marc Strapetz Dec. 23, 2021, 6:24 p.m. UTC | #2
On 23/12/2021 00:52, Junio C Hamano wrote:
> Ah, there are cases where we do clear active_cache_changed when we
> notice that an operation detected an error, to avoid spreading the
> breakage by writing the index file out, and I think that is the
> right thing to do.  Which means that the above patch is not quite
> right.  Perhaps taking all of the above together, something like
> this?
> 
> 	*o->has_errors |= refresh_cache(o->flags | flag);
> 	if (*o->has_errors)
> 		active_cache_changed = 0;
> 	else if (has_racy_timestamps(&the_index))
>          	/*
> 		 * Even if nothing else has changed, updating the file
> 		 * increases the chance that racy timestamps become
> 		 * non-racy, helping future run-time performance.
> 		 */
> 		active_cache_changed |= SOMETHING_CHANGED;

I think it's safe to write the index even if refresh_cache() reports an 
"error" and we should actually do that:

The underlying refresh_index() will report an "error" only for "file: 
needs merge" and "file: needs update". In both cases, the corresponding 
entries will not have been updated. Every entry which has been updated 
is good on its own and writing these updates makes the index a little 
bit better. Subsequent calls to refresh_index() won't have to do the 
same work again (like invoking the quite expensive LFS filter).

This is also how cmd_status() currently works: it does not pay attention 
to the return value of refresh_index() and will always write the index 
if racy timestamps are encountered.

Overall, the "error" handling in update-index.c might not always do what 
one expects. Let's consider your suggested fix. When invoking:

update-index --refresh

this won't fix racy timestamps, however:

update-index --refresh --add untracked

will do. I think this is caused by active_cache_changed being used in 
two different ways: to indicate that the cache should be written and to 
indicate that it must not be written. It might be a good idea to take 
the latter "block index write" to a separate static variable in 
update-index.c.

Candidate usages of this new "block index write" variable will be in the 
existing callbacks: errors detected in unresolve_callback() should 
probably continue to block an index write, to ensure that either all or 
none of the specified files will be unresolved. For the 
reupdate_callback(), the underlying do_reupdate() seems to return 0 
always, so there is dead code in the callback (or am I completely 
blind?). To stay on the safe side, we may still continue to block an 
index write here. The refresh_callback() will never block an index write.

Does it make sense to clarify error handling in some preceding commit 
and only then address the razy timestamps? It will probably make this 
second commit clearer.

>> +}
>> +
>> +update_assert_changed() {
>> +	local ts1=$(test-tool chmtime --get .git/index) &&
>> +	test_might_fail git update-index $1 &&
>> +	local ts2=$(test-tool chmtime --get .git/index) &&
>> +	[ $ts1 -ne $ts2 ]
>> +}
>> +
>> +test_expect_success 'setup' '
>> +	touch .git/fs-tstamp &&
> 
> Not that it is wrong, but do we need to create such a throw-away
> file inside the .git directory?

We actually only need a timestamp for which we know that it is before 
the timestamp the next file system operation would create. I agree that 
it should be easy to rewrite that using "test-tool chmtime". This should 
also simplify reset_mtime().

Regarding all other comments, thanks, I'll address them as suggested for 
the next patch. And sorry for not checking CodingGuidelines before (I 
had completely missed this document).

-Marc
diff mbox series

Patch

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 187203e8bb5..0a069281e23 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -787,6 +787,12 @@  static int refresh(struct refresh_params *o, unsigned int flag)
 	setup_work_tree();
 	read_cache();
 	*o->has_errors |= refresh_cache(o->flags | flag);
+	if (has_racy_timestamp(&the_index)) {
+		/* For racy timestamps we should set active_cache_changed immediately:
+		 * other callbacks may follow for which some of them may reset
+		 * active_cache_changed. */
+		active_cache_changed |= SOMETHING_CHANGED;
+	}
 	return 0;
 }
 
diff --git a/cache.h b/cache.h
index cfba463aa97..dd1932e2d0e 100644
--- a/cache.h
+++ b/cache.h
@@ -891,6 +891,7 @@  void *read_blob_data_from_index(struct index_state *, const char *, unsigned lon
 #define CE_MATCH_IGNORE_FSMONITOR 0X20
 int is_racy_timestamp(const struct index_state *istate,
 		      const struct cache_entry *ce);
+int has_racy_timestamp(struct index_state *istate);
 int ie_match_stat(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 int ie_modified(struct index_state *, const struct cache_entry *, struct stat *, unsigned int);
 
diff --git a/read-cache.c b/read-cache.c
index cbe73f14e5e..ed297635a33 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2775,7 +2775,7 @@  static int repo_verify_index(struct repository *repo)
 	return verify_index_from(repo->index, repo->index_file);
 }
 
-static int has_racy_timestamp(struct index_state *istate)
+int has_racy_timestamp(struct index_state *istate)
 {
 	int entries = istate->cache_nr;
 	int i;
diff --git a/t/t2108-update-index-refresh-racy.sh b/t/t2108-update-index-refresh-racy.sh
new file mode 100755
index 00000000000..ece1151847c
--- /dev/null
+++ b/t/t2108-update-index-refresh-racy.sh
@@ -0,0 +1,58 @@ 
+#!/bin/sh
+
+test_description='update-index refresh tests related to racy timestamps'
+
+. ./test-lib.sh
+
+reset_mtime() {
+	test-tool chmtime =$(test-tool chmtime --get .git/fs-tstamp) $1
+}
+
+update_assert_unchanged() {
+	local ts1=$(test-tool chmtime --get .git/index) &&
+	git update-index $1 &&
+	local ts2=$(test-tool chmtime --get .git/index) &&
+	[ $ts1 -eq $ts2 ]
+}
+
+update_assert_changed() {
+	local ts1=$(test-tool chmtime --get .git/index) &&
+	test_might_fail git update-index $1 &&
+	local ts2=$(test-tool chmtime --get .git/index) &&
+	[ $ts1 -ne $ts2 ]
+}
+
+test_expect_success 'setup' '
+	touch .git/fs-tstamp &&
+	test-tool chmtime -1 .git/fs-tstamp &&
+	echo content >file &&
+	reset_mtime file &&
+
+	git add file &&
+	git commit -m "initial import"
+'
+
+test_expect_success '--refresh has no racy timestamps to fix' '
+	reset_mtime .git/index &&
+	test-tool chmtime +1 .git/index &&
+	update_assert_unchanged --refresh
+'
+
+test_expect_success '--refresh should fix racy timestamp' '
+	reset_mtime .git/index &&
+	update_assert_changed --refresh
+'
+
+test_expect_success '--really-refresh should fix racy timestamp' '
+	reset_mtime .git/index &&
+	update_assert_changed --really-refresh
+'
+
+test_expect_success '--refresh should fix racy timestamp even if needs update' '
+	echo content2 >file &&
+	reset_mtime file &&
+	reset_mtime .git/index &&
+	update_assert_changed --refresh
+'
+
+test_done