mbox series

[v2,0/6] cache API: always have a "istate->repo"

Message ID cover-v2-0.6-00000000000-20230112T124842Z-avarab@gmail.com (mailing list archive)
Headers show
Series cache API: always have a "istate->repo" | expand

Message

Ævar Arnfjörð Bjarmason Jan. 12, 2023, 12:55 p.m. UTC
The "struct index_state" contains a "repo" member, which should be a
pointer to the repository for that index, but which due to us
constructing such structs on an ad-hoc basis in various places wasn't
always available.

We'd thus end up with code like this, in the recent
ds/omit-trailing-hash-in-index topic:

			       struct repository *r = istate->repo ? istate->repo : the_repository;

Really we should be able to trust the "istate->repo", but were
carrying those sorts of conditionals because our index might come from
a manually constructed source, so we'd have to fall back to
"the_repository".

This series changes the relvant code so the "repo" field is always
non-NULL, as 6/6 here shows we had various workarounds in place for
that, which can now go away.

For v1 and more general summary, see:
https://lore.kernel.org/git/cover-0.5-00000000000-20230110T060340Z-avarab@gmail.com/

See
https://github.com/avar/git/tree/avar/do-not-lazy-populate-istate-repo-2
for passing CI and a fetchable branch for this topic.

Change since v1:

 * Typo/rewording/correcting of commit messages
 * Kept a "TODO" comment which Derrick notes I shouldn't have removed

 * Derrick suggested in
   https://lore.kernel.org/git/6b92fad2-6b74-fddb-679c-8c8735e7103d@github.com/
   that things might be nicer if we had an explicit initializer, which
   was also the subject of the previous discussion at
   https://lore.kernel.org/git/xmqqmt6vqo2w.fsf@gitster.g/; but
   concluded that it was probably best to leave it for now.

   I tried it out, and I think it's worth just doing that now, which
   is why there's a new 5/6 here: We start by adding an
   INDEX_STATE_INIT macro, and corresponding function.

   There's a bit of churn in 6/6 as all of those now will take a
   "repo" argument, but I think the end result is worth it, because
   even if "repo" remains the only thing that we need to initialize
   we're now able to use ALLOC_ARRAY() instead of CALLOC_ARRAY().

   We'll thus be helped by analysis tools (which would show access to
   un-init'd memory) if we miss properly init-ing not just the "repo"
   field, but anything in the structure, so our test coverage will be
   better.

   It also makes the code easier to follow and change, as it's now
   more obvious where we initialize this, and it'll be easier to
   change it in the future if e.g. we add a new member that has
   mandatory initialization (e.g. a "struct strbuf").

Ævar Arnfjörð Bjarmason (6):
  builtin/difftool.c: { 0 }-initialize rather than using memset()
  sparse-index.c: expand_to_path() can assume non-NULL "istate"
  sparse-index API: BUG() out on NULL ensure_full_index()
  read-cache.c: refactor set_new_index_sparsity() for subsequent commit
  cache API: add a "INDEX_STATE_INIT" macro/function, add
    release_index()
  treewide: always have a valid "index_state.repo" member

 apply.c                   |  2 +-
 builtin/difftool.c        |  4 +---
 builtin/sparse-checkout.c |  1 +
 builtin/stash.c           | 16 ++++++-------
 builtin/worktree.c        |  2 +-
 cache.h                   | 16 +++++++++++++
 fsmonitor-settings.c      | 14 ------------
 fsmonitor.c               |  2 +-
 merge-recursive.c         |  2 +-
 read-cache.c              | 48 +++++++++++++++++++--------------------
 repository.c              |  8 +++++--
 revision.c                |  2 +-
 sparse-index.c            | 15 ++++--------
 split-index.c             |  3 ++-
 unpack-trees.c            |  5 ++--
 15 files changed, 69 insertions(+), 71 deletions(-)

Range-diff against v1:
1:  214fe7d3fc2 = 1:  214fe7d3fc2 builtin/difftool.c: { 0 }-initialize rather than using memset()
2:  6db74fe958f = 2:  6db74fe958f sparse-index.c: expand_to_path() can assume non-NULL "istate"
3:  d96388acef6 ! 3:  25e9cff0e97 sparse-index API: fix TODO, BUG() out on NULL ensure_full_index()
    @@ Metadata
     Author: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
      ## Commit message ##
    -    sparse-index API: fix TODO, BUG() out on NULL ensure_full_index()
    +    sparse-index API: BUG() out on NULL ensure_full_index()
     
         Make the ensure_full_index() function stricter, and have it only
         accept a non-NULL "struct index_state". This function (and this
         behavior) was added in [1].
     
         The only reason it needed to be this lax was due to interaction with
    -    repo_index_has_changes(). See the addition of that code in . This
    -    fixes one of the TODO comments added in 0c18c059a15, the other one was
    -    already removed in [3].
    +    repo_index_has_changes(). See the addition of that code in [2].
     
         The other reason for why this was needed dates back to interaction
    -    with code added in [4]. In [5] we started calling ensure_full_index()
    +    with code added in [3]. In [4] we started calling ensure_full_index()
         in unpack_trees(), but the caller added in 34110cd4e39 wants to pass
         us a NULL "dst_index". Let's instead do the NULL check in
         unpack_trees() itself.
     
         1. 4300f8442a2 (sparse-index: implement ensure_full_index(), 2021-03-30)
         2. 0c18c059a15 (read-cache: ensure full index, 2021-04-01)
    -    3. d76723ee531 (status: use sparse-index throughout, 2021-07-14).
    -    4. 34110cd4e39 (Make 'unpack_trees()' have a separate source and
    +    3. 34110cd4e39 (Make 'unpack_trees()' have a separate source and
            destination index, 2008-03-06)
    -    5. 6863df35503 (unpack-trees: ensure full index, 2021-03-30)
    +    4. 6863df35503 (unpack-trees: ensure full index, 2021-03-30)
     
         Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
     
    - ## read-cache.c ##
    -@@ read-cache.c: int repo_index_has_changes(struct repository *repo,
    - 		}
    - 		diff_flush(&opt);
    - 		return opt.flags.has_changes != 0;
    --	} else {
    --		/* TODO: audit for interaction with sparse-index. */
    -+	} else if (istate) {
    - 		ensure_full_index(istate);
    - 		for (i = 0; sb && i < istate->cache_nr; i++) {
    - 			if (i)
    -@@ read-cache.c: int repo_index_has_changes(struct repository *repo,
    - 			strbuf_addstr(sb, istate->cache[i]->name);
    - 		}
    - 		return !!istate->cache_nr;
    -+	} else {
    -+		return 0;
    - 	}
    - }
    - 
    -
      ## sparse-index.c ##
     @@ sparse-index.c: void expand_index(struct index_state *istate, struct pattern_list *pl)
      	 * If the index is already full, then keep it full. We will convert
4:  e514a881e38 = 4:  a75977bd37b read-cache.c: refactor set_new_index_sparsity() for subsequent commit
-:  ----------- > 5:  ae256efe94a cache API: add a "INDEX_STATE_INIT" macro/function, add release_index()
5:  b3b0e592101 ! 6:  291233b0420 treewide: always have a valid "index_state.repo" member
    @@ Commit message
         second-guessing is the "istate->repo ? istate->repo : the_repository"
         code in [2]. We can now simply use "istate->repo".
     
    +    We're doing this by making use of the INDEX_STATE_INIT() macro (and
    +    corresponding function) added in the preceding commit, which now have
    +    mandatory "repo" arguments. As seen here there are cases where we set
    +    it to NULL, but only if we're about to fill in the correct non-NULL
    +    "repo" right afterwards.
    +
         For "fsmonitor-settings.c" we can remove the initialization of a NULL
         "r" argument to "the_repository". This was added back in [3], and was
         needed at the time for callers that would pass us the "r" from an
    @@ apply.c: static int preimage_oid_in_gitlink_patch(struct patch *p, struct object
      static int build_fake_ancestor(struct apply_state *state, struct patch *list)
      {
      	struct patch *patch;
    --	struct index_state result = { NULL };
    -+	struct index_state result = { .repo = state->repo };
    +-	struct index_state result = INDEX_STATE_INIT;
    ++	struct index_state result = INDEX_STATE_INIT(state->repo);
      	struct lock_file lock = LOCK_INIT;
      	int res;
      
    @@ builtin/difftool.c: static int run_dir_diff(const char *extcmd, int symlinks, co
      	struct hashmap symlinks2 = HASHMAP_INIT(pair_cmp, NULL);
      	struct hashmap_iter iter;
      	struct pair_entry *entry;
    --	struct index_state wtindex = { 0 };
    -+	struct index_state wtindex = { .repo = the_repository };
    +-	struct index_state wtindex = INDEX_STATE_INIT;
    ++	struct index_state wtindex = INDEX_STATE_INIT(the_repository);
      	struct checkout lstate, rstate;
      	int err = 0;
      	struct child_process cmd = CHILD_PROCESS_INIT;
    @@ builtin/sparse-checkout.c: static int update_working_directory(struct pattern_li
      	o.head_idx = -1;
      	o.src_index = r->index;
      	o.dst_index = r->index;
    -+	o.result.repo = r;
    +-	index_state_init(&o.result);
    ++	index_state_init(&o.result, r);
      	o.skip_sparse_checkout = 0;
      	o.pl = pl;
      
    @@ builtin/stash.c: static int save_untracked_files(struct stash_info *info, struct
      	int ret = 0;
      	struct strbuf untracked_msg = STRBUF_INIT;
      	struct child_process cp_upd_index = CHILD_PROCESS_INIT;
    --	struct index_state istate = { NULL };
    -+	struct index_state istate = { .repo = the_repository };
    +-	struct index_state istate = INDEX_STATE_INIT;
    ++	struct index_state istate = INDEX_STATE_INIT(the_repository);
      
      	cp_upd_index.git_cmd = 1;
      	strvec_pushl(&cp_upd_index.args, "update-index", "-z", "--add",
    @@ builtin/stash.c: static int stash_staged(struct stash_info *info, struct strbuf
      {
      	int ret = 0;
      	struct child_process cp_diff_tree = CHILD_PROCESS_INIT;
    --	struct index_state istate = { NULL };
    -+	struct index_state istate = { .repo = the_repository };
    +-	struct index_state istate = INDEX_STATE_INIT;
    ++	struct index_state istate = INDEX_STATE_INIT(the_repository);
      
      	if (write_index_as_tree(&info->w_tree, &istate, the_repository->index_file,
      				0, NULL)) {
    @@ builtin/stash.c: static int stash_patch(struct stash_info *info, const struct pa
      	int ret = 0;
      	struct child_process cp_read_tree = CHILD_PROCESS_INIT;
      	struct child_process cp_diff_tree = CHILD_PROCESS_INIT;
    --	struct index_state istate = { NULL };
    -+	struct index_state istate = { .repo = the_repository };
    +-	struct index_state istate = INDEX_STATE_INIT;
    ++	struct index_state istate = INDEX_STATE_INIT(the_repository);
      	char *old_index_env = NULL, *old_repo_index_file;
      
      	remove_path(stash_index_path.buf);
    @@ builtin/stash.c: static int stash_working_tree(struct stash_info *info, const st
      	struct rev_info rev;
      	struct child_process cp_upd_index = CHILD_PROCESS_INIT;
      	struct strbuf diff_output = STRBUF_INIT;
    --	struct index_state istate = { NULL };
    -+	struct index_state istate = { .repo = the_repository };
    +-	struct index_state istate = INDEX_STATE_INIT;
    ++	struct index_state istate = INDEX_STATE_INIT(the_repository);
      
      	init_revisions(&rev, NULL);
      	copy_pathspec(&rev.prune_data, ps);
    @@ builtin/worktree.c: static int unlock_worktree(int ac, const char **av, const ch
      
      static void validate_no_submodules(const struct worktree *wt)
      {
    --	struct index_state istate = { NULL };
    -+	struct index_state istate = { .repo = the_repository };
    +-	struct index_state istate = INDEX_STATE_INIT;
    ++	struct index_state istate = INDEX_STATE_INIT(the_repository);
      	struct strbuf path = STRBUF_INIT;
      	int i, found_submodules = 0;
      
     
    + ## cache.h ##
    +@@ cache.h: struct index_state {
    +  * If the variable won't be used again, use release_index() to free()
    +  * its resources. If it needs to be used again use discard_index(),
    +  * which does the same thing, but will use use index_state_init() at
    +- * the end.
    ++ * the end. The discard_index() will use its own "istate->repo" as the
    ++ * "r" argument to index_state_init() in that case.
    +  */
    +-#define INDEX_STATE_INIT { 0 }
    +-void index_state_init(struct index_state *istate);
    ++#define INDEX_STATE_INIT(r) { \
    ++	.repo = (r), \
    ++}
    ++void index_state_init(struct index_state *istate, struct repository *r);
    + void release_index(struct index_state *istate);
    + 
    + /* Name hashing */
    +
      ## fsmonitor-settings.c ##
     @@ fsmonitor-settings.c: static void lookup_fsmonitor_settings(struct repository *r)
      
    @@ merge-recursive.c: static int unpack_trees_start(struct merge_options *opt,
      {
      	int rc;
      	struct tree_desc t[3];
    --	struct index_state tmp_index = { NULL };
    -+	struct index_state tmp_index = { .repo = opt->repo };
    +-	struct index_state tmp_index = INDEX_STATE_INIT;
    ++	struct index_state tmp_index = INDEX_STATE_INIT(opt->repo);
      
      	memset(&opt->priv->unpack_opts, 0, sizeof(opt->priv->unpack_opts));
      	if (opt->priv->call_depth)
    @@ read-cache.c: int do_read_index(struct index_state *istate, const char *path, in
      	 * If the command explicitly requires a full index, force it
      	 * to be full. Otherwise, correct the sparsity based on repository
     @@ read-cache.c: int read_index_from(struct index_state *istate, const char *path,
    - 		discard_index(split_index->base);
    + 		release_index(split_index->base);
      	else
    - 		CALLOC_ARRAY(split_index->base, 1);
    -+	split_index->base->repo = istate->repo;
    + 		ALLOC_ARRAY(split_index->base, 1);
    +-	index_state_init(split_index->base);
    ++	index_state_init(split_index->base, istate->repo);
      
      	base_oid_hex = oid_to_hex(&split_index->base_oid);
      	base_path = xstrfmt("%s/sharedindex.%s", gitdir, base_oid_hex);
    +@@ read-cache.c: int is_index_unborn(struct index_state *istate)
    + 	return (!istate->cache_nr && !istate->timestamp.sec);
    + }
    + 
    +-void index_state_init(struct index_state *istate)
    ++void index_state_init(struct index_state *istate, struct repository *r)
    + {
    +-	struct index_state blank = INDEX_STATE_INIT;
    ++	struct index_state blank = INDEX_STATE_INIT(r);
    + 	memcpy(istate, &blank, sizeof(*istate));
    + }
    + 
    +@@ read-cache.c: void release_index(struct index_state *istate)
    + void discard_index(struct index_state *istate)
    + {
    + 	release_index(istate);
    +-	index_state_init(istate);
    ++	index_state_init(istate, istate->repo);
    + }
    + 
    + /*
     @@ read-cache.c: static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
      	int ieot_entries = 1;
      	struct index_entry_offset_table *ieot = NULL;
    @@ repository.c: void initialize_the_repository(void)
      	the_repo.remote_state = remote_state_new();
      	the_repo.parsed_objects = parsed_object_pool_new();
      
    -+	the_index.repo = the_repository;
    ++	index_state_init(&the_index, the_repository);
     +
      	repo_set_hash_algo(&the_repo, GIT_HASH_SHA1);
      }
      
    +@@ repository.c: int repo_read_index(struct repository *repo)
    + 
    + 	if (!repo->index) {
    + 		ALLOC_ARRAY(repo->index, 1);
    +-		index_state_init(repo->index);
    ++		index_state_init(repo->index, NULL /* set below */);
    + 	}
    + 
    + 	/* Complete the double-reference */
     
      ## revision.c ##
     @@ revision.c: void add_index_objects_to_pending(struct rev_info *revs, unsigned int flags)
      	worktrees = get_worktrees();
      	for (p = worktrees; *p; p++) {
      		struct worktree *wt = *p;
    --		struct index_state istate = { NULL };
    -+		struct index_state istate = { .repo = revs->repo };
    +-		struct index_state istate = INDEX_STATE_INIT;
    ++		struct index_state istate = INDEX_STATE_INIT(revs->repo);
      
      		if (wt->is_current)
      			continue; /* current index already taken care of */
    @@ split-index.c
     @@ split-index.c: void move_cache_to_base_index(struct index_state *istate)
      	}
      
    - 	CALLOC_ARRAY(si->base, 1);
    -+	si->base->repo = istate->repo;
    + 	ALLOC_ARRAY(si->base, 1);
    +-	index_state_init(si->base);
    ++	index_state_init(si->base, istate->repo);
      	si->base->version = istate->version;
      	/* zero timestamp disables racy test in ce_write_index() */
      	si->base->timestamp = istate->timestamp;
     
      ## unpack-trees.c ##
     @@ unpack-trees.c: int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
    + 		populate_from_existing_patterns(o, &pl);
      	}
      
    - 	memset(&o->result, 0, sizeof(o->result));
    -+	o->result.repo = o->src_index->repo;
    +-	index_state_init(&o->result);
    ++	index_state_init(&o->result, o->src_index->repo);
      	o->result.initialized = 1;
      	o->result.timestamp.sec = o->src_index->timestamp.sec;
      	o->result.timestamp.nsec = o->src_index->timestamp.nsec;

Comments

Derrick Stolee Jan. 12, 2023, 3:23 p.m. UTC | #1
On 1/12/2023 7:55 AM, Ævar Arnfjörð Bjarmason wrote:

>  * Derrick suggested in
>    https://lore.kernel.org/git/6b92fad2-6b74-fddb-679c-8c8735e7103d@github.com/
>    that things might be nicer if we had an explicit initializer, which
>    was also the subject of the previous discussion at
>    https://lore.kernel.org/git/xmqqmt6vqo2w.fsf@gitster.g/; but
>    concluded that it was probably best to leave it for now.
> 
>    I tried it out, and I think it's worth just doing that now, which
>    is why there's a new 5/6 here: We start by adding an
>    INDEX_STATE_INIT macro, and corresponding function.
> 
>    There's a bit of churn in 6/6 as all of those now will take a
>    "repo" argument, but I think the end result is worth it, because
>    even if "repo" remains the only thing that we need to initialize
>    we're now able to use ALLOC_ARRAY() instead of CALLOC_ARRAY().
> 
>    We'll thus be helped by analysis tools (which would show access to
>    un-init'd memory) if we miss properly init-ing not just the "repo"
>    field, but anything in the structure, so our test coverage will be
>    better.
> 
>    It also makes the code easier to follow and change, as it's now
>    more obvious where we initialize this, and it'll be easier to
>    change it in the future if e.g. we add a new member that has
>    mandatory initialization (e.g. a "struct strbuf").

I wasn't expecting the initializer idea to work as well as it has.
Now, Patch 5 does all the heavy lifting and Patch 6 is an easy read.

Outside of one question about the istate->initialized member (which
might not need any change) I'm happy with this version.

Thanks,
-Stolee
Junio C Hamano Jan. 13, 2023, 6:29 p.m. UTC | #2
Derrick Stolee <derrickstolee@github.com> writes:

> I wasn't expecting the initializer idea to work as well as it has.
> Now, Patch 5 does all the heavy lifting and Patch 6 is an easy read.
>
> Outside of one question about the istate->initialized member (which
> might not need any change) I'm happy with this version.

Thanks, both.  And especially for finding 913e0e99 (unpack_trees():
protect the handcrafted in-core index from read_cache(),
2008-08-23).

Will queue.
Junio C Hamano Jan. 13, 2023, 7:22 p.m. UTC | #3
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> The "struct index_state" contains a "repo" member, which should be a
> pointer to the repository for that index, but which due to us
> constructing such structs on an ad-hoc basis in various places wasn't
> always available.

I'd exclude 6/6 for now, as it seems to depend on some changes only
in 'next'.  Feel free to resend only that step with reduced scope so
that it applies to 'master', and send in incremental updates when
each of these topics that are only in 'next' graduates.

Thanks.
Ævar Arnfjörð Bjarmason Jan. 16, 2023, 1:38 p.m. UTC | #4
On Fri, Jan 13 2023, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> The "struct index_state" contains a "repo" member, which should be a
>> pointer to the repository for that index, but which due to us
>> constructing such structs on an ad-hoc basis in various places wasn't
>> always available.
>
> I'd exclude 6/6 for now, as it seems to depend on some changes only
> in 'next'.  Feel free to resend only that step with reduced scope so
> that it applies to 'master', and send in incremental updates when
> each of these topics that are only in 'next' graduates.

Okey, the 6/6 requires ds/omit-trailing-hash-in-index. As both it and
the 1-5/6 of this are in "next" now I think it's best that I just submit
the 6/6 stand-alone after both of those have graduated.
Philip Oakley Jan. 16, 2023, 4:53 p.m. UTC | #5
On 16/01/2023 13:38, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Jan 13 2023, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>
>>> The "struct index_state" contains a "repo" member, which should be a
>>> pointer to the repository for that index, but which due to us
>>> constructing such structs on an ad-hoc basis in various places wasn't
>>> always available.
>> I'd exclude 6/6 for now, as it seems to depend on some changes only
>> in 'next'.  Feel free to resend only that step with reduced scope so
>> that it applies to 'master', and send in incremental updates when
>> each of these topics that are only in 'next' graduates.
> Okey, the 6/6 requires ds/omit-trailing-hash-in-index. As both it and
> the 1-5/6 of this are in "next" now I think it's best that I just submit
> the 6/6 stand-alone after both of those have graduated.
>
micro-nit: The commit message of 5/6 starts "As we'll see in a
subsequent commit..", which may need a slight tweak if 6/6 becomes 'far
away' in the commit tree.

--
Philip
Junio C Hamano Jan. 16, 2023, 6:39 p.m. UTC | #6
Philip Oakley <philipoakley@iee.email> writes:

> On 16/01/2023 13:38, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Jan 13 2023, Junio C Hamano wrote:
>>
>>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>>
>>>> The "struct index_state" contains a "repo" member, which should be a
>>>> pointer to the repository for that index, but which due to us
>>>> constructing such structs on an ad-hoc basis in various places wasn't
>>>> always available.
>>> I'd exclude 6/6 for now, as it seems to depend on some changes only
>>> in 'next'.  Feel free to resend only that step with reduced scope so
>>> that it applies to 'master', and send in incremental updates when
>>> each of these topics that are only in 'next' graduates.
>> Okey, the 6/6 requires ds/omit-trailing-hash-in-index. As both it and
>> the 1-5/6 of this are in "next" now I think it's best that I just submit
>> the 6/6 stand-alone after both of those have graduated.
>>
> micro-nit: The commit message of 5/6 starts "As we'll see in a
> subsequent commit..", which may need a slight tweak if 6/6 becomes 'far
> away' in the commit tree.

Indeed.  

I'd use 

	Hopefully in some not so distant future, we'll get
	advantages from always initializing the "repo" member ...

for now.