diff mbox series

[v2,4/5] split-index: don't compare stat data of entries already marked for split index

Message ID 20180927124434.30835-5-szeder.dev@gmail.com (mailing list archive)
State New, archived
Headers show
Series Fix the racy split index problem | expand

Commit Message

SZEDER Gábor Sept. 27, 2018, 12:44 p.m. UTC
When unpack_trees() constructs a new index, it copies cache entries
from the original index [1].  prepare_to_write_split_index() has to
deal with this, and it has a dedicated code path for copied entries
that are present in the shared index, where it compares the cached
data in the corresponding copied and original entries.  If the cached
data matches, then they are considered the same; if it differs, then
the copied entry will be marked for inclusion as a replacement entry
in the just about to be written split index by setting the
CE_UPDATE_IN_BASE flag.

However, a cache entry already has its CE_UPDATE_IN_BASE flag set upon
reading the split index, if the entry already has a replacement entry
there, or upon refreshing the cached stat data, if the corresponding
file was modified.  The state of this flag is then preserved when
unpack_trees() copies a cache entry from the shared index.

So modify prepare_to_write_split_index() to check the copied cache
entries' CE_UPDATE_IN_BASE flag first, and skip the thorough
comparison of cached data if the flag is already set.

Note that comparing the cached data in copied and original entries in
the shared index might actually be entirely unnecessary.  In theory
all code paths refreshing the cached stat data of an entry in the
shared index should set the CE_UPDATE_IN_BASE flag in that entry, and
unpack_trees() should preserve this flag when copying cache entries.
This means that the cached data is only ever changed if the
CE_UPDATE_IN_BASE flag is set as well.  Our test suite seems to
confirm this: instrumenting the conditions in question and running the
test suite repeatedly with 'GIT_TEST_SPLIT_INDEX=yes' showed that the
cached data in a copied entry differs from the data in the shared
entry only if its CE_UPDATE_IN_BASE flag is indeed set.

In practice, however, our test suite doesn't have 100% coverage,
GIT_TEST_SPLIT_INDEX is inherently random, and I certainly can't claim
to possess complete understanding of what goes on in unpack_trees()...
Therefore I kept the comparison of the cached data when
CE_UPDATE_IN_BASE is not set, just in case that an unnoticed or future
code path were to accidentally miss setting this flag upon refreshing
the cached stat data or unpack_trees() were to drop this flag while
copying a cache entry.

[1] Note that when unpack_trees() constructs the new index and decides
    that a cache entry should now refer to different content than what
    was recorded in the original index (e.g. 'git read-tree -m
    HEAD^'), then that can't really be considered a copy of the
    original, but rather the creation of a new entry.  Notably and
    pertinent to the split index feature, such a new entry doesn't
    have a reference to the original's shared index entry anymore,
    i.e. its 'index' field is set to 0.  Consequently, such an entry
    is treated by prepare_to_write_split_index() as an entry not
    present in the shared index and it will be added to the new split
    index, while the original entry will be marked as deleted, and
    neither the above discussion nor the changes in this patch apply
    to them.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 split-index.c | 79 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 62 insertions(+), 17 deletions(-)

Comments

SZEDER Gábor Sept. 27, 2018, 1:43 p.m. UTC | #1
On Thu, Sep 27, 2018 at 02:44:33PM +0200, SZEDER Gábor wrote:
>  split-index.c | 79 ++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 62 insertions(+), 17 deletions(-)

I generated this patch with more context lines than usual, so the two
conditions that I didn't add any comments to in this or in the next
patch are fully visible.

> diff --git a/split-index.c b/split-index.c
> index 548272ec33..7d8799f6b7 100644
> --- a/split-index.c
> +++ b/split-index.c
> @@ -204,19 +204,34 @@ void prepare_to_write_split_index(struct index_state *istate)
>  		 * that are not marked with either CE_MATCHED or
>  		 * CE_UPDATE_IN_BASE. If istate->cache[i] is a
>  		 * duplicate, deduplicate it.
>  		 */
>  		for (i = 0; i < istate->cache_nr; i++) {
>  			struct cache_entry *base;
> -			/* namelen is checked separately */
> -			const unsigned int ondisk_flags =
> -				CE_STAGEMASK | CE_VALID | CE_EXTENDED_FLAGS;
> -			unsigned int ce_flags, base_flags, ret;
>  			ce = istate->cache[i];
> -			if (!ce->index)
> +			if (!ce->index) {
> +				/*
> +				 * During simple update index operations this
> +				 * is a cache entry that is not present in
> +				 * the shared index.  It will be added to the
> +				 * split index.
> +				 *
> +				 * However, it might also represent a file
> +				 * that already has a cache entry in the
> +				 * shared index, but a new index has just
> +				 * been constructed by unpack_trees(), and
> +				 * this entry now refers to different content
> +				 * than what was recorded in the original
> +				 * index, e.g. during 'read-tree -m HEAD^' or
> +				 * 'checkout HEAD^'.  In this case the
> +				 * original entry in the shared index will be
> +				 * marked as deleted, and this entry will be
> +				 * added to the split index.
> +				 */
>  				continue;
> +			}
>  			if (ce->index > si->base->cache_nr) {
>  				ce->index = 0;
>  				continue;
>  			}

This condition in the context above checks whether a cache entry
refers to a non-existing entry in the shared index.

I don't understand the role of this condition, for two reasons:

  - Under what circumstances can this condition be ever fulfilled?

    I instrumented it and run the test suite repeatedly with
    'GIT_TEST_SPLIT_INDEX=yes', but it has never been fulfilled.  I
    also tried to come up with all kinds of elaborate scenarios to
    trigger it, but no joy, and code inspection didn't bring anything
    either.

  - There are similar conditions in 'split-index.c' in the functions
    mark_entry_for_delete() and replace_entry(); here is the one from
    the latter, but they only differ in the error message:

      if (pos >= istate->cache_nr)
          die("position for replacement %d exceeds base index size %d",
              (int)pos, istate->cache_nr);

    (Note that this 'istate->cache_nr' here equals
    to 'si->base->cache_nr'; see their caller merge_base_index().)

    The die() clearly indicates that fulfilling this condition is a
    Bad Thing.  These two functions are invoked to create a unified
    view of the just read split and shared indexes, so the fulfillment
    of this condition could indicate a corrupt index file, and
    die()ing right away seems to be justified.

    Then why doesn't the condition in prepare_to_write_split_index()
    die() as well?!  After all if it were fulfilled, then it would
    indicate a corruption in the current index_state, and writing a
    new split index from corrupt data doesn't seem like a particularly
    good idea.


>  			ce->ce_flags |= CE_MATCHED; /* or "shared" */
>  			base = si->base->cache[ce->index - 1];
> @@ -224,24 +239,54 @@ void prepare_to_write_split_index(struct index_state *istate)
>  				continue;
>  			if (ce->ce_namelen != base->ce_namelen ||
>  			    strcmp(ce->name, base->name)) {
>  				ce->index = 0;
>  				continue;
>  			}

I don't understand the role of this condition either, and just like
the one discussed above, the test suite with
'GIT_TEST_SPLIT_INDEX=yes' seems to never fulfill it.

> -			ce_flags = ce->ce_flags;
> -			base_flags = base->ce_flags;
> -			/* only on-disk flags matter */
> -			ce->ce_flags   &= ondisk_flags;
> -			base->ce_flags &= ondisk_flags;
> -			ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
> -				     offsetof(struct cache_entry, name) -
> -				     offsetof(struct cache_entry, ce_stat_data));
> -			ce->ce_flags = ce_flags;
> -			base->ce_flags = base_flags;
> -			if (ret)
> -				ce->ce_flags |= CE_UPDATE_IN_BASE;
> +			/*
> +			 * This is the copy of a cache entry that is present
> +			 * in the shared index, created by unpack_trees()
> +			 * while it constructed a new index.
> +			 */
> +			if (ce->ce_flags & CE_UPDATE_IN_BASE) {
> +				/*
> +				 * Already marked for inclusion in the split
> +				 * index, either because the corresponding
> +				 * file was modified and the cached stat data
> +				 * was refreshed, or because the original
> +				 * entry already had a replacement entry in
> +				 * the split index.
> +				 * Nothing to do.
> +				 */
> +			} else {
> +				/*
> +				 * Thoroughly compare the cached data to see
> +				 * whether it should be marked for inclusion
> +				 * in the split index.
> +				 *
> +				 * This comparison might be unnecessary, as
> +				 * code paths modifying the cached data do
> +				 * set CE_UPDATE_IN_BASE as well.
> +				 */
> +				const unsigned int ondisk_flags =
> +					CE_STAGEMASK | CE_VALID |
> +					CE_EXTENDED_FLAGS;
> +				unsigned int ce_flags, base_flags, ret;
> +				ce_flags = ce->ce_flags;
> +				base_flags = base->ce_flags;
> +				/* only on-disk flags matter */
> +				ce->ce_flags   &= ondisk_flags;
> +				base->ce_flags &= ondisk_flags;
> +				ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
> +					     offsetof(struct cache_entry, name) -
> +					     offsetof(struct cache_entry, ce_stat_data));
> +				ce->ce_flags = ce_flags;
> +				base->ce_flags = base_flags;
> +				if (ret)
> +					ce->ce_flags |= CE_UPDATE_IN_BASE;
> +			}
>  			discard_cache_entry(base);
>  			si->base->cache[ce->index - 1] = ce;
>  		}
>  		for (i = 0; i < si->base->cache_nr; i++) {
>  			ce = si->base->cache[i];
>  			if ((ce->ce_flags & CE_REMOVE) ||
> -- 
> 2.19.0.361.gafc87ffe72
>
diff mbox series

Patch

diff --git a/split-index.c b/split-index.c
index 548272ec33..7d8799f6b7 100644
--- a/split-index.c
+++ b/split-index.c
@@ -204,19 +204,34 @@  void prepare_to_write_split_index(struct index_state *istate)
 		 * that are not marked with either CE_MATCHED or
 		 * CE_UPDATE_IN_BASE. If istate->cache[i] is a
 		 * duplicate, deduplicate it.
 		 */
 		for (i = 0; i < istate->cache_nr; i++) {
 			struct cache_entry *base;
-			/* namelen is checked separately */
-			const unsigned int ondisk_flags =
-				CE_STAGEMASK | CE_VALID | CE_EXTENDED_FLAGS;
-			unsigned int ce_flags, base_flags, ret;
 			ce = istate->cache[i];
-			if (!ce->index)
+			if (!ce->index) {
+				/*
+				 * During simple update index operations this
+				 * is a cache entry that is not present in
+				 * the shared index.  It will be added to the
+				 * split index.
+				 *
+				 * However, it might also represent a file
+				 * that already has a cache entry in the
+				 * shared index, but a new index has just
+				 * been constructed by unpack_trees(), and
+				 * this entry now refers to different content
+				 * than what was recorded in the original
+				 * index, e.g. during 'read-tree -m HEAD^' or
+				 * 'checkout HEAD^'.  In this case the
+				 * original entry in the shared index will be
+				 * marked as deleted, and this entry will be
+				 * added to the split index.
+				 */
 				continue;
+			}
 			if (ce->index > si->base->cache_nr) {
 				ce->index = 0;
 				continue;
 			}
 			ce->ce_flags |= CE_MATCHED; /* or "shared" */
 			base = si->base->cache[ce->index - 1];
@@ -224,24 +239,54 @@  void prepare_to_write_split_index(struct index_state *istate)
 				continue;
 			if (ce->ce_namelen != base->ce_namelen ||
 			    strcmp(ce->name, base->name)) {
 				ce->index = 0;
 				continue;
 			}
-			ce_flags = ce->ce_flags;
-			base_flags = base->ce_flags;
-			/* only on-disk flags matter */
-			ce->ce_flags   &= ondisk_flags;
-			base->ce_flags &= ondisk_flags;
-			ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
-				     offsetof(struct cache_entry, name) -
-				     offsetof(struct cache_entry, ce_stat_data));
-			ce->ce_flags = ce_flags;
-			base->ce_flags = base_flags;
-			if (ret)
-				ce->ce_flags |= CE_UPDATE_IN_BASE;
+			/*
+			 * This is the copy of a cache entry that is present
+			 * in the shared index, created by unpack_trees()
+			 * while it constructed a new index.
+			 */
+			if (ce->ce_flags & CE_UPDATE_IN_BASE) {
+				/*
+				 * Already marked for inclusion in the split
+				 * index, either because the corresponding
+				 * file was modified and the cached stat data
+				 * was refreshed, or because the original
+				 * entry already had a replacement entry in
+				 * the split index.
+				 * Nothing to do.
+				 */
+			} else {
+				/*
+				 * Thoroughly compare the cached data to see
+				 * whether it should be marked for inclusion
+				 * in the split index.
+				 *
+				 * This comparison might be unnecessary, as
+				 * code paths modifying the cached data do
+				 * set CE_UPDATE_IN_BASE as well.
+				 */
+				const unsigned int ondisk_flags =
+					CE_STAGEMASK | CE_VALID |
+					CE_EXTENDED_FLAGS;
+				unsigned int ce_flags, base_flags, ret;
+				ce_flags = ce->ce_flags;
+				base_flags = base->ce_flags;
+				/* only on-disk flags matter */
+				ce->ce_flags   &= ondisk_flags;
+				base->ce_flags &= ondisk_flags;
+				ret = memcmp(&ce->ce_stat_data, &base->ce_stat_data,
+					     offsetof(struct cache_entry, name) -
+					     offsetof(struct cache_entry, ce_stat_data));
+				ce->ce_flags = ce_flags;
+				base->ce_flags = base_flags;
+				if (ret)
+					ce->ce_flags |= CE_UPDATE_IN_BASE;
+			}
 			discard_cache_entry(base);
 			si->base->cache[ce->index - 1] = ce;
 		}
 		for (i = 0; i < si->base->cache_nr; i++) {
 			ce = si->base->cache[i];
 			if ((ce->ce_flags & CE_REMOVE) ||