diff mbox series

[2/4] split-index; stop abusing the `base_oid` to strip the "link" extension

Message ID f1897b880729b649ab24da14cbc3432d44b7c731.1679500859.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Fix a few split-index bugs | expand

Commit Message

Johannes Schindelin March 22, 2023, 4 p.m. UTC
From: Johannes Schindelin <johannes.schindelin@gmx.de>

When a split-index is in effect, the `$GIT_DIR/index` file needs to
contain a "link" extension that contains all the information about the
split-index, including the information about the shared index.

However, in some cases Git needs to suppress writing that "link"
extension (i.e. to fall back to writing a full index) even if the
in-memory index structure _has_ a `split_index` configured. This is the
case e.g. when "too many not shared" index entries exist.

In such instances, the current code sets the `base_oid` field of said
`split_index` structure to all-zero to indicate that `do_write_index()`
should skip writing the "link" extension.

This can lead to problems later on, when the in-memory index is still
used to perform other operations and eventually wants to write a
split-index, detects the presence of the `split_index` and reuses that,
too (under the assumption that it has been initialized correctly and
still has a non-null `base_oid`).

Let's stop zeroing out the `base_oid` to indicate that the "link"
extension should not be written.

One might be tempted to simply call `discard_split_index()` instead,
under the assumption that Git decided to write a non-split index and
therefore the the `split_index` structure might no longer be wanted.
However, that is not possible because that would release index entries
in `split_index->base` that are likely to still be in use. Therefore we
cannot do that.

The next best thing we _can_ do is to introduce a flag, specifically
indicating when the "link" extension should be skipped. So that's what
we do here.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 read-cache.c                 | 37 ++++++++++++++++++++++--------------
 t/t7527-builtin-fsmonitor.sh |  2 +-
 2 files changed, 24 insertions(+), 15 deletions(-)

Comments

Junio C Hamano March 22, 2023, 9:24 p.m. UTC | #1
"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> When a split-index is in effect, the `$GIT_DIR/index` file needs to
> contain a "link" extension that contains all the information about the
> split-index, including the information about the shared index.
> ...
> Let's stop zeroing out the `base_oid` to indicate that the "link"
> extension should not be written.

Nicely explained.

> One might be tempted to simply call `discard_split_index()` instead,
> under the assumption that Git decided to write a non-split index and
> therefore the the `split_index` structure might no longer be wanted.

"the the".

> +enum strip_extensions {
> +	WRITE_ALL_EXTENSIONS = 0,
> +	STRIP_ALL_EXTENSIONS = 1,
> +	STRIP_LINK_EXTENSION_ONLY = 2
> +};

We do not need to spell out the specific values for this enum; the
users' (i.e. the callers of do_write_index()) sole requirement is
for these symbols to have different values.

Also do we envision that (1) we would need to keep STRIP_LINK_ONLY
to be with the largest value among the enum values, or (2) we would
never add new value to the set?  Otherwise let's end the last one
with a trailing comma.

Looking at the way strip_extensions variable is used in
do_write_index(), an alternative design might be to make it a set of
bits (e.g. unsigned write_extension) and give one bit to each
extension.  But such a clean-up is better left outside the topic, I
would imagine, as we do not have any need to skip an arbitrary set
of extensions right now.

> +/*
> + * Write the Git index into a `.lock` file
> + *
> + * If `strip_link_extension` is non-zero, avoid writing any "link" extension
> + * (used by the split-index feature).
> + */

Not exposing "enum strip_extensions" to the caller of this function,
like this patch does, is probably a very safe and sensible thing to
do.  We do not have a reason to allow its callers to (perhaps
mistakenly) pass STRIP_ALL_EXTENSIONS to it.

>  static int do_write_locked_index(struct index_state *istate, struct lock_file *lock,
> -				 unsigned flags)
> +				 unsigned flags, int strip_link_extension)
>  {
>  	int ret;
>  	int was_full = istate->sparse_index == INDEX_EXPANDED;
> @@ -3185,7 +3197,7 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>  	 */
>  	trace2_region_enter_printf("index", "do_write_index", the_repository,
>  				   "%s", get_lock_file_path(lock));
> -	ret = do_write_index(istate, lock->tempfile, 0, flags);
> +	ret = do_write_index(istate, lock->tempfile, strip_link_extension ? STRIP_LINK_EXTENSION_ONLY : 0, flags);
>  	trace2_region_leave_printf("index", "do_write_index", the_repository,
>  				   "%s", get_lock_file_path(lock));
>  

OK.

Very nicely done.
Jeff Hostetler March 23, 2023, 1:59 p.m. UTC | #2
On 3/22/23 5:24 PM, Junio C Hamano wrote:
> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
> 
>> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>>
>> When a split-index is in effect, the `$GIT_DIR/index` file needs to
>> contain a "link" extension that contains all the information about the
>> split-index, including the information about the shared index.
>> ...
>> Let's stop zeroing out the `base_oid` to indicate that the "link"
>> extension should not be written.
> 
> Nicely explained.
> 
>> One might be tempted to simply call `discard_split_index()` instead,
>> under the assumption that Git decided to write a non-split index and
>> therefore the the `split_index` structure might no longer be wanted.
> 
> "the the".
> 
>> +enum strip_extensions {
>> +	WRITE_ALL_EXTENSIONS = 0,
>> +	STRIP_ALL_EXTENSIONS = 1,
>> +	STRIP_LINK_EXTENSION_ONLY = 2
>> +};
> 
> We do not need to spell out the specific values for this enum; the
> users' (i.e. the callers of do_write_index()) sole requirement is
> for these symbols to have different values.

There are several calls to do_write_locked_index() that pass 0 or 1
as the new final arg.  If we update them to use these enum values,
then we don't need integer values here.

> 
> Also do we envision that (1) we would need to keep STRIP_LINK_ONLY
> to be with the largest value among the enum values, or (2) we would
> never add new value to the set?  Otherwise let's end the last one
> with a trailing comma.
> 
> Looking at the way strip_extensions variable is used in
> do_write_index(), an alternative design might be to make it a set of
> bits (e.g. unsigned write_extension) and give one bit to each
> extension.  But such a clean-up is better left outside the topic, I
> would imagine, as we do not have any need to skip an arbitrary set
> of extensions right now.

Agreed, I thought about suggesting a set of bits too, but right now
we only need to strip all of them or just this one.

> 
>> +/*
>> + * Write the Git index into a `.lock` file
>> + *
>> + * If `strip_link_extension` is non-zero, avoid writing any "link" extension
>> + * (used by the split-index feature).
>> + */
> 
> Not exposing "enum strip_extensions" to the caller of this function,
> like this patch does, is probably a very safe and sensible thing to
> do.  We do not have a reason to allow its callers to (perhaps
> mistakenly) pass STRIP_ALL_EXTENSIONS to it.
> 
>>   static int do_write_locked_index(struct index_state *istate, struct lock_file *lock,
>> -				 unsigned flags)
>> +				 unsigned flags, int strip_link_extension)
>>   {
>>   	int ret;
>>   	int was_full = istate->sparse_index == INDEX_EXPANDED;
>> @@ -3185,7 +3197,7 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>>   	 */
>>   	trace2_region_enter_printf("index", "do_write_index", the_repository,
>>   				   "%s", get_lock_file_path(lock));
>> -	ret = do_write_index(istate, lock->tempfile, 0, flags);
>> +	ret = do_write_index(istate, lock->tempfile, strip_link_extension ? STRIP_LINK_EXTENSION_ONLY : 0, flags);

In the else of the ?: operator, could we use the WRITE_ALL_EXTENSIONS
instead of 0?

>>   	trace2_region_leave_printf("index", "do_write_index", the_repository,
>>   				   "%s", get_lock_file_path(lock));
>>   
> 
> OK.
> 
> Very nicely done.
Jeff Hostetler March 23, 2023, 3:22 p.m. UTC | #3
On 3/22/23 12:00 PM, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> 
> When a split-index is in effect, the `$GIT_DIR/index` file needs to
> contain a "link" extension that contains all the information about the
> split-index, including the information about the shared index.
> 
> However, in some cases Git needs to suppress writing that "link"
> extension (i.e. to fall back to writing a full index) even if the
> in-memory index structure _has_ a `split_index` configured. This is the
> case e.g. when "too many not shared" index entries exist.
> 
> In such instances, the current code sets the `base_oid` field of said
> `split_index` structure to all-zero to indicate that `do_write_index()`
> should skip writing the "link" extension.
> 
> This can lead to problems later on, when the in-memory index is still
> used to perform other operations and eventually wants to write a
> split-index, detects the presence of the `split_index` and reuses that,
> too (under the assumption that it has been initialized correctly and
> still has a non-null `base_oid`).
> 
> Let's stop zeroing out the `base_oid` to indicate that the "link"
> extension should not be written.
> 
> One might be tempted to simply call `discard_split_index()` instead,
> under the assumption that Git decided to write a non-split index and
> therefore the the `split_index` structure might no longer be wanted.
> However, that is not possible because that would release index entries
> in `split_index->base` that are likely to still be in use. Therefore we
> cannot do that.
> 
> The next best thing we _can_ do is to introduce a flag, specifically
> indicating when the "link" extension should be skipped. So that's what
> we do here.
> 
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>   read-cache.c                 | 37 ++++++++++++++++++++++--------------
>   t/t7527-builtin-fsmonitor.sh |  2 +-
>   2 files changed, 24 insertions(+), 15 deletions(-)
> 
> diff --git a/read-cache.c b/read-cache.c
> index b09128b1884..8fcb2d54c05 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -2868,6 +2868,12 @@ static int record_ieot(void)
>   	return !git_config_get_index_threads(&val) && val != 1;
>   }
>   
> +enum strip_extensions {
> +	WRITE_ALL_EXTENSIONS = 0,
> +	STRIP_ALL_EXTENSIONS = 1,
> +	STRIP_LINK_EXTENSION_ONLY = 2
> +};

Earlier (in a response to Junio's response on this commit) I said
that I didn't think we needed to make a bit set here, but I want
to re-think that or at least walk thru the change and talk out loud.

I'll explain in-line below.

> +
>   /*
>    * On success, `tempfile` is closed. If it is the temporary file
>    * of a `struct lock_file`, we will therefore effectively perform
> @@ -2876,7 +2882,7 @@ static int record_ieot(void)
>    * rely on it.
>    */
>   static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
> -			  int strip_extensions, unsigned flags)
> +			  enum strip_extensions strip_extensions, unsigned flags)
>   {
>   	uint64_t start = getnanotime();
>   	struct hashfile *f;
> @@ -3045,7 +3051,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>   			return -1;
>   	}
>   
> -	if (!strip_extensions && istate->split_index &&
> +	if (strip_extensions == WRITE_ALL_EXTENSIONS && istate->split_index &&
>   	    !is_null_oid(&istate->split_index->base_oid)) {

(I hate all of this double negative logic...)

Here we only want the extension if we have WRITE_ALL, so that is
NOT STRIP_ALL and NOT STRIP_LINK_ONLY, so that is OK.

>   		struct strbuf sb = STRBUF_INIT;
>   
> @@ -3060,7 +3066,7 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
>   		if (err)
>   			return -1;
>   	}
> -	if (!strip_extensions && !drop_cache_tree && istate->cache_tree) {
> +	if (strip_extensions != STRIP_ALL_EXTENSIONS && !drop_cache_tree && istate->cache_tree) {

Here we only want the extension when NOT STRIP_ALL, so this is
either WRITE_ALL or STRIP_LINK_ONLY, so this is OK.  The rest are
the same, so I'll omit them.

[...]

All of this looks correct, but I stumbled over things on my first
or second reading.  I wonder if it would it simplify things to define
this as:

enum strip_extensions {
	WRITE_ALL_EXTENSIONS   = 0,
	STRIP_LINK_EXTENSION   = (1<0),
	STRIP_OTHER_EXTENSIONS = (1<1),
	STRIP_ALL_EXTENSIONS   = (STRIP_LINK_EXTENSION
				 | STRIP_OTHER_EXTENSIONS),
};

Then the link test becomes:
	if ( ! (strip_extensions & STRIP_LINK_EXTENSION) &&
	    istate->split_index &&
	    ...) {

and the others become:

	if ( ! (strip_extensions & STRIP_OTHER_EXTENSIONS) &&
	    ...) {

If we need to add the ability later to strip an individual,
we can easily add a bit to the enum and update the _ALL_ mask
and the corresponding `if` test.

In a later commit (probably in another series), we can invert
these double negatives to improve readability.

> +/*
> + * Write the Git index into a `.lock` file
> + *
> + * If `strip_link_extension` is non-zero, avoid writing any "link" extension
> + * (used by the split-index feature).
> + */
>   static int do_write_locked_index(struct index_state *istate, struct lock_file *lock,
> -				 unsigned flags)
> +				 unsigned flags, int strip_link_extension)
>   {
>   	int ret;
>   	int was_full = istate->sparse_index == INDEX_EXPANDED;
> @@ -3185,7 +3197,7 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>   	 */
>   	trace2_region_enter_printf("index", "do_write_index", the_repository,
>   				   "%s", get_lock_file_path(lock));
> -	ret = do_write_index(istate, lock->tempfile, 0, flags);
> +	ret = do_write_index(istate, lock->tempfile, strip_link_extension ? STRIP_LINK_EXTENSION_ONLY : 0, flags);
>   	trace2_region_leave_printf("index", "do_write_index", the_repository,
>   				   "%s", get_lock_file_path(lock));
>   
> @@ -3214,7 +3226,7 @@ static int write_split_index(struct index_state *istate,
>   {
>   	int ret;
>   	prepare_to_write_split_index(istate);
> -	ret = do_write_locked_index(istate, lock, flags);
> +	ret = do_write_locked_index(istate, lock, flags, 0);

could we use the enum values here instead of 0 ?

>   	finish_writing_split_index(istate);
>   	return ret;
>   }
> @@ -3366,9 +3378,7 @@ int write_locked_index(struct index_state *istate, struct lock_file *lock,
>   	if ((!si && !test_split_index_env) ||
>   	    alternate_index_output ||
>   	    (istate->cache_changed & ~EXTMASK)) {
> -		if (si)
> -			oidclr(&si->base_oid);
> -		ret = do_write_locked_index(istate, lock, flags);
> +		ret = do_write_locked_index(istate, lock, flags, 1);

and here

>   		goto out;
>   	}
>   
> @@ -3394,8 +3404,7 @@ int write_locked_index(struct index_state *istate, struct lock_file *lock,
>   		/* Same initial permissions as the main .git/index file */
>   		temp = mks_tempfile_sm(git_path("sharedindex_XXXXXX"), 0, 0666);
>   		if (!temp) {
> -			oidclr(&si->base_oid);
> -			ret = do_write_locked_index(istate, lock, flags);
> +			ret = do_write_locked_index(istate, lock, flags, 1);

and here

>   			goto out;
>   		}
>   		ret = write_shared_index(istate, &temp, flags);
> diff --git a/t/t7527-builtin-fsmonitor.sh b/t/t7527-builtin-fsmonitor.sh
> index cbafdd69602..9fab9a2ab38 100755
> --- a/t/t7527-builtin-fsmonitor.sh
> +++ b/t/t7527-builtin-fsmonitor.sh
> @@ -1003,7 +1003,7 @@ test_expect_success !UNICODE_COMPOSITION_SENSITIVE 'Unicode nfc/nfd' '
>   	egrep "^event: nfd/d_${utf8_nfc}/?$" ./unicode.trace
>   '
>   
> -test_expect_failure 'split-index and FSMonitor work well together' '
> +test_expect_success 'split-index and FSMonitor work well together' '
>   	git init split-index &&
>   	test_when_finished "git -C \"$PWD/split-index\" \
>   		fsmonitor--daemon stop" &&

Thanks
Jeff
Junio C Hamano March 23, 2023, 4:06 p.m. UTC | #4
Jeff Hostetler <git@jeffhostetler.com> writes:

>>> +enum strip_extensions {
>>> +	WRITE_ALL_EXTENSIONS = 0,
>>> +	STRIP_ALL_EXTENSIONS = 1,
>>> +	STRIP_LINK_EXTENSION_ONLY = 2
>>> +};
>> We do not need to spell out the specific values for this enum; the
>> users' (i.e. the callers of do_write_index()) sole requirement is
>> for these symbols to have different values.
>
> There are several calls to do_write_locked_index() that pass 0 or 1
> as the new final arg.  If we update them to use these enum values,
> then we don't need integer values here.

Good eyes.  Yes, the new caller that selectively passes
STRIP_LINK_EXTENSION_ONLY should pass WRITE_ALL_EXTENSIONS, not 0,
on the other side of ?: as you pointed out.

Thanks.
diff mbox series

Patch

diff --git a/read-cache.c b/read-cache.c
index b09128b1884..8fcb2d54c05 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -2868,6 +2868,12 @@  static int record_ieot(void)
 	return !git_config_get_index_threads(&val) && val != 1;
 }
 
+enum strip_extensions {
+	WRITE_ALL_EXTENSIONS = 0,
+	STRIP_ALL_EXTENSIONS = 1,
+	STRIP_LINK_EXTENSION_ONLY = 2
+};
+
 /*
  * On success, `tempfile` is closed. If it is the temporary file
  * of a `struct lock_file`, we will therefore effectively perform
@@ -2876,7 +2882,7 @@  static int record_ieot(void)
  * rely on it.
  */
 static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
-			  int strip_extensions, unsigned flags)
+			  enum strip_extensions strip_extensions, unsigned flags)
 {
 	uint64_t start = getnanotime();
 	struct hashfile *f;
@@ -3045,7 +3051,7 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 			return -1;
 	}
 
-	if (!strip_extensions && istate->split_index &&
+	if (strip_extensions == WRITE_ALL_EXTENSIONS && istate->split_index &&
 	    !is_null_oid(&istate->split_index->base_oid)) {
 		struct strbuf sb = STRBUF_INIT;
 
@@ -3060,7 +3066,7 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		if (err)
 			return -1;
 	}
-	if (!strip_extensions && !drop_cache_tree && istate->cache_tree) {
+	if (strip_extensions != STRIP_ALL_EXTENSIONS && !drop_cache_tree && istate->cache_tree) {
 		struct strbuf sb = STRBUF_INIT;
 
 		cache_tree_write(&sb, istate->cache_tree);
@@ -3070,7 +3076,7 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		if (err)
 			return -1;
 	}
-	if (!strip_extensions && istate->resolve_undo) {
+	if (strip_extensions != STRIP_ALL_EXTENSIONS && istate->resolve_undo) {
 		struct strbuf sb = STRBUF_INIT;
 
 		resolve_undo_write(&sb, istate->resolve_undo);
@@ -3081,7 +3087,7 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		if (err)
 			return -1;
 	}
-	if (!strip_extensions && istate->untracked) {
+	if (strip_extensions != STRIP_ALL_EXTENSIONS && istate->untracked) {
 		struct strbuf sb = STRBUF_INIT;
 
 		write_untracked_extension(&sb, istate->untracked);
@@ -3092,7 +3098,7 @@  static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		if (err)
 			return -1;
 	}
-	if (!strip_extensions && istate->fsmonitor_last_update) {
+	if (strip_extensions != STRIP_ALL_EXTENSIONS && istate->fsmonitor_last_update) {
 		struct strbuf sb = STRBUF_INIT;
 
 		write_fsmonitor_extension(&sb, istate);
@@ -3166,8 +3172,14 @@  static int commit_locked_index(struct lock_file *lk)
 		return commit_lock_file(lk);
 }
 
+/*
+ * Write the Git index into a `.lock` file
+ *
+ * If `strip_link_extension` is non-zero, avoid writing any "link" extension
+ * (used by the split-index feature).
+ */
 static int do_write_locked_index(struct index_state *istate, struct lock_file *lock,
-				 unsigned flags)
+				 unsigned flags, int strip_link_extension)
 {
 	int ret;
 	int was_full = istate->sparse_index == INDEX_EXPANDED;
@@ -3185,7 +3197,7 @@  static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	 */
 	trace2_region_enter_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
-	ret = do_write_index(istate, lock->tempfile, 0, flags);
+	ret = do_write_index(istate, lock->tempfile, strip_link_extension ? STRIP_LINK_EXTENSION_ONLY : 0, flags);
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
@@ -3214,7 +3226,7 @@  static int write_split_index(struct index_state *istate,
 {
 	int ret;
 	prepare_to_write_split_index(istate);
-	ret = do_write_locked_index(istate, lock, flags);
+	ret = do_write_locked_index(istate, lock, flags, 0);
 	finish_writing_split_index(istate);
 	return ret;
 }
@@ -3366,9 +3378,7 @@  int write_locked_index(struct index_state *istate, struct lock_file *lock,
 	if ((!si && !test_split_index_env) ||
 	    alternate_index_output ||
 	    (istate->cache_changed & ~EXTMASK)) {
-		if (si)
-			oidclr(&si->base_oid);
-		ret = do_write_locked_index(istate, lock, flags);
+		ret = do_write_locked_index(istate, lock, flags, 1);
 		goto out;
 	}
 
@@ -3394,8 +3404,7 @@  int write_locked_index(struct index_state *istate, struct lock_file *lock,
 		/* Same initial permissions as the main .git/index file */
 		temp = mks_tempfile_sm(git_path("sharedindex_XXXXXX"), 0, 0666);
 		if (!temp) {
-			oidclr(&si->base_oid);
-			ret = do_write_locked_index(istate, lock, flags);
+			ret = do_write_locked_index(istate, lock, flags, 1);
 			goto out;
 		}
 		ret = write_shared_index(istate, &temp, flags);
diff --git a/t/t7527-builtin-fsmonitor.sh b/t/t7527-builtin-fsmonitor.sh
index cbafdd69602..9fab9a2ab38 100755
--- a/t/t7527-builtin-fsmonitor.sh
+++ b/t/t7527-builtin-fsmonitor.sh
@@ -1003,7 +1003,7 @@  test_expect_success !UNICODE_COMPOSITION_SENSITIVE 'Unicode nfc/nfd' '
 	egrep "^event: nfd/d_${utf8_nfc}/?$" ./unicode.trace
 '
 
-test_expect_failure 'split-index and FSMonitor work well together' '
+test_expect_success 'split-index and FSMonitor work well together' '
 	git init split-index &&
 	test_when_finished "git -C \"$PWD/split-index\" \
 		fsmonitor--daemon stop" &&