diff mbox series

setup: copy repository_format using helper

Message ID 20230612230453.70864-1-chooglen@google.com (mailing list archive)
State New, archived
Headers show
Series setup: copy repository_format using helper | expand

Commit Message

Glen Choo June 12, 2023, 11:04 p.m. UTC
In several parts of the setup machinery, we set up a repository_format
and then use it to set up the_repository in nearly the exact same way,
suggesting that we might be able to use a helper function to standardize
the behavior and make future modifications easier. Create this helper
function, setup_repository_from_format(), thus standardizing this
behavior.

To determine what the 'standardized behavior' should be, we can compare
the candidate call sites in repo_init(), setup_git_directory_gently(),
check_repository_format() and discover_git_directory(),

- All of them copy .worktree_config.

- All of them 'copy' .partial_clone. Most perform a shallow copy of the
  pointer, then set the .partial_clone = NULL so that it doesn't get
  cleared by clear_repository_format(). However,
  check_repository_format() copies the string deeply because the
  repository_format is sometimes read back (it is an "out" parameter).
  To accomodate both shallow copying and deep copying, toggle this
  behavior using the "modify_fmt_ok" parameter.

- Most of them set up repository.hash_algo, except
  discover_git_directory(). Our helper function unconditionally sets up
  .hash_algo because it turns out that discover_git_directory() probably
  doesn't need to set up "struct repository" at all!
  discover_git_directory() isn't actually used in the setup process - its
  only caller in the Git binary is read_early_config(). As explained by
  16ac8b8db6 (setup: introduce the discover_git_directory() function,
  2017-03-13), it is supposed to be an entrypoint into setup.c machinery
  that allows the Git directory to be discovered without side effects,
  in other words, we shouldn't have introduced side effects in
  ebaf3bcf1ae (repository: move global r_f_p_c to repo struct,
  2021-06-17). Fortunately, we didn't start to rely on this unintended
  behavior between then and now, so we can just drop it.

Signed-off-by: Glen Choo <chooglen@google.com>
---
Here's the helper function I had in mind. I was initially mistaken and
it turns out that we need to support deep copying, but fortunately,
t0001 is extremely thorough and will catch virtually any mistake in the
setup process. CI seems to pass, though it appears to be a little flaky
today and sometimes cancels jobs
(https://github.com/chooglen/git/actions/runs/5249029150).

If you're comfortable with it, I would prefer for you to squash this
into your patches so that we don't just end up changing the same few
lines. If not, I'll Reviewed-by your patches (if I don't find any other
concerns on a re-read) and send this as a 1-patch on top.

 repository.c |  7 +------
 setup.c      | 31 +++++++++++++++++++------------
 setup.h      | 10 ++++++++++
 3 files changed, 30 insertions(+), 18 deletions(-)

Comments

Victoria Dye June 13, 2023, 12:03 a.m. UTC | #1
Glen Choo wrote:
> In several parts of the setup machinery, we set up a repository_format
> and then use it to set up the_repository in nearly the exact same way,
> suggesting that we might be able to use a helper function to standardize
> the behavior and make future modifications easier. Create this helper
> function, setup_repository_from_format(), thus standardizing this
> behavior.
> 
> To determine what the 'standardized behavior' should be, we can compare
> the candidate call sites in repo_init(), setup_git_directory_gently(),
> check_repository_format() and discover_git_directory(),
> 
> - All of them copy .worktree_config.
> 
> - All of them 'copy' .partial_clone. Most perform a shallow copy of the
>   pointer, then set the .partial_clone = NULL so that it doesn't get
>   cleared by clear_repository_format(). However,
>   check_repository_format() copies the string deeply because the
>   repository_format is sometimes read back (it is an "out" parameter).
>   To accomodate both shallow copying and deep copying, toggle this
>   behavior using the "modify_fmt_ok" parameter.

Do you have a specific example of this happening? I see two uses of
'check_repository_format()' in the codebase:

1. in 'enter_repo()' ('path.c')
2. in 'init_db()' ('init-db.c')

The first one calls 'check_repository_format()' with 'NULL', which causes
the function to create a temporary 'struct repository_format' that is then
discarded at the end of the function - no need to worry about the value
being cleared there.

The second one does call 'check_repository_format()' with a 'struct
repository_format' instance, but the 'partial_clone' field field is not
accessed again after that. The only subsequent usages of the 'repo_fmt'
variable in 'init_db()' are:

- in 'validate_hash_algorithm()', where only the 'version' and 'hash_algo'
  fields are accessed.
- in 'create_default_files()', where only 'hash_algo' is accessed.

So, shouldn't it be safe to shallow-copy-and-NULL? But as I noted earlier
[1], if you do that it'll make the name 'check_repository_format()' a bit
misleading (since it's actually modifying its arg in place). So, if you
update to always shallow copy, 'check_repository_format()' should be renamed
to reflect its side effects.

[1] https://lore.kernel.org/git/49509708-c0a1-2439-a551-cab05d944b66@github.com/

> 
> - Most of them set up repository.hash_algo, except
>   discover_git_directory(). Our helper function unconditionally sets up
>   .hash_algo because it turns out that discover_git_directory() probably
>   doesn't need to set up "struct repository" at all!

If that's the case, shouldn't the 'repository_format' assignments in
'discover_git_directory()' be removed altogether? 

>   discover_git_directory() isn't actually used in the setup process - its
>   only caller in the Git binary is read_early_config(). As explained by
>   16ac8b8db6 (setup: introduce the discover_git_directory() function,
>   2017-03-13), it is supposed to be an entrypoint into setup.c machinery
>   that allows the Git directory to be discovered without side effects,
>   in other words, we shouldn't have introduced side effects in
>   ebaf3bcf1ae (repository: move global r_f_p_c to repo struct,
>   2021-06-17). Fortunately, we didn't start to rely on this unintended
>   behavior between then and now, so we can just drop it.
> 
> Signed-off-by: Glen Choo <chooglen@google.com>
> ---
> Here's the helper function I had in mind. I was initially mistaken and
> it turns out that we need to support deep copying, but fortunately,
> t0001 is extremely thorough and will catch virtually any mistake in the
> setup process. CI seems to pass, though it appears to be a little flaky
> today and sometimes cancels jobs
> (https://github.com/chooglen/git/actions/runs/5249029150).
> 
> If you're comfortable with it, I would prefer for you to squash this
> into your patches so that we don't just end up changing the same few
> lines. If not, I'll Reviewed-by your patches (if I don't find any other
> concerns on a re-read) and send this as a 1-patch on top.

Reading through the commit message & patch, I'm still not convinced this
refactor is a good idea - to me, it doesn't leave the code in a clearly
better state. If you feel strongly that it does, though, I'm happy to leave
it to others to review/decide but I would prefer that you keep it a separate
patch submission on top.

Thanks!

> diff --git a/repository.c b/repository.c
> index 104960f8f5..50f0b26a6c 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -181,12 +181,7 @@ int repo_init(struct repository *repo,
>  	if (read_and_verify_repository_format(&format, repo->commondir))
>  		goto error;
>  
> -	repo_set_hash_algo(repo, format.hash_algo);
> -	repo->repository_format_worktree_config = format.worktree_config;
> -
> -	/* take ownership of format.partial_clone */
> -	repo->repository_format_partial_clone = format.partial_clone;
> -	format.partial_clone = NULL;
> +	setup_repository_from_format(repo, &format, 1);
>  
>  	if (worktree)
>  		repo_set_worktree(repo, worktree);
> diff --git a/setup.c b/setup.c
> index d866395435..33ce58676f 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -1561,13 +1561,8 @@ const char *setup_git_directory_gently(int *nongit_ok)
>  			setup_git_env(gitdir);
>  		}
>  		if (startup_info->have_repository) {
> -			repo_set_hash_algo(the_repository, repo_fmt.hash_algo);
> -			the_repository->repository_format_worktree_config =
> -				repo_fmt.worktree_config;
> -			/* take ownership of repo_fmt.partial_clone */
> -			the_repository->repository_format_partial_clone =
> -				repo_fmt.partial_clone;
> -			repo_fmt.partial_clone = NULL;
> +			setup_repository_from_format(the_repository,
> +						     &repo_fmt, 1);
>  		}
>  	}
>  	/*
> @@ -1654,14 +1649,26 @@ void check_repository_format(struct repository_format *fmt)
>  		fmt = &repo_fmt;
>  	check_repository_format_gently(get_git_dir(), fmt, NULL);
>  	startup_info->have_repository = 1;
> -	repo_set_hash_algo(the_repository, fmt->hash_algo);
> -	the_repository->repository_format_worktree_config =
> -		fmt->worktree_config;
> -	the_repository->repository_format_partial_clone =
> -		xstrdup_or_null(fmt->partial_clone);
> +	setup_repository_from_format(the_repository, fmt, 0);
>  	clear_repository_format(&repo_fmt);
>  }
>  

I think you may be missing changes to 'discover_git_directory()'? Like I
mentioned above, though, if you don't think 'discover_git_directory()' needs
to set up 'the_repository', then those assignments should just be removed
(not replaced with 'setup_repository_from_format()').
Glen Choo June 13, 2023, 6:25 p.m. UTC | #2
Victoria Dye <vdye@github.com> writes:

>> - All of them 'copy' .partial_clone. Most perform a shallow copy of the
>>   pointer, then set the .partial_clone = NULL so that it doesn't get
>>   cleared by clear_repository_format(). However,
>>   check_repository_format() copies the string deeply because the
>>   repository_format is sometimes read back (it is an "out" parameter).
>>   To accomodate both shallow copying and deep copying, toggle this
>>   behavior using the "modify_fmt_ok" parameter.
>
> Do you have a specific example of this happening? I see two uses of
> 'check_repository_format()' in the codebase:
>
> 1. in 'enter_repo()' ('path.c')
> 2. in 'init_db()' ('init-db.c')
>
> The first one calls 'check_repository_format()' with 'NULL', which causes
> the function to create a temporary 'struct repository_format' that is then
> discarded at the end of the function - no need to worry about the value
> being cleared there.
>
> The second one does call 'check_repository_format()' with a 'struct
> repository_format' instance, but the 'partial_clone' field field is not
> accessed again after that. The only subsequent usages of the 'repo_fmt'
> variable in 'init_db()' are:
>
> - in 'validate_hash_algorithm()', where only the 'version' and 'hash_algo'
>   fields are accessed.
> - in 'create_default_files()', where only 'hash_algo' is accessed.
>
> So, shouldn't it be safe to shallow-copy-and-NULL? But as I noted earlier
> [1], if you do that it'll make the name 'check_repository_format()' a bit
> misleading (since it's actually modifying its arg in place). So, if you
> update to always shallow copy, 'check_repository_format()' should be renamed
> to reflect its side effects.

My understanding of check_repository_format() is that it serves double
duty of doing a) setup of the_repository and b) populating an "out"
parameter with the appropriate values. IMO a) is the side effect that
could warrant the rename, and b) is the expected, "read-only" use case.
From that perspective, doing a shallow copy here isn't really
introducing a weird side-effect (because the arg to an "out" parameter
should be zero-ed out to begin with), but it's returning a 'wrong'
value. You're right that it's safe because the NULL-ed value isn't read
back right now, but it's not any good if this function gains more
callers.

Your point about not having side effects in check_*() is a good one
though, and I'm starting to feel doubtful that we should be doing setup
there either....

>> If you're comfortable with it, I would prefer for you to squash this
>> into your patches so that we don't just end up changing the same few
>> lines. If not, I'll Reviewed-by your patches (if I don't find any other
>> concerns on a re-read) and send this as a 1-patch on top.
>
> Reading through the commit message & patch, I'm still not convinced this
> refactor is a good idea - to me, it doesn't leave the code in a clearly
> better state. If you feel strongly that it does, though, I'm happy to leave
> it to others to review/decide but I would prefer that you keep it a separate
> patch submission on top.

Okay. Given how weird check_repository_format() and
discover_git_directory() are, I think we haven't done enough
investigation to properly consolidate this logic, and doing that
introduces quite a lot of scope creep. It feels very unsatisfactory that
we are propagating a pattern that is suspicious in some places and
outright wrong in others instead of cleaning up as we go and leaving it
in a better state for future authors, but this series does leave some
_other_ parts in a better state (removing the global), and I think it's
still a net positive.

The helper function might not be a good idea yet, but I'm convinced that
removing the setup from discover_git_directory() is a good idea. I think
this series would be in a better state if we get rid of the wrong
pattern instead of extending it. Unfortunately, I forgot to include that
change in the patch I sent (ugh) but here's a patch that _just_ includes
the discover_git_directory() change that I hope you can squash into your
series (and you can use whatever bits of my commit message you see fit).

----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----

  diff --git a/setup.c b/setup.c
  index 33ce58676f..b172ffd48a 100644
  --- a/setup.c
  +++ b/setup.c
  @@ -1422,14 +1422,6 @@ int discover_git_directory(struct strbuf *commondir,
      return -1;
    }

  -	the_repository->repository_format_worktree_config =
  -		candidate.worktree_config;
  -
  -	/* take ownership of candidate.partial_clone */
  -	the_repository->repository_format_partial_clone =
  -		candidate.partial_clone;
  -	candidate.partial_clone = NULL;
  -
    clear_repository_format(&candidate);
    return 0;
  }

----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----

You can see that this patch based on top of yours passes CI

  https://github.com/git/git/commit/9469fe3a6b0efbe89d26ef096a2eebabea59c55f
  https://github.com/chooglen/git/actions/runs/5258672473

> I think you may be missing changes to 'discover_git_directory()'? Like I
> mentioned above, though, if you don't think 'discover_git_directory()' needs
> to set up 'the_repository', then those assignments should just be removed
> (not replaced with 'setup_repository_from_format()').

Ah sorry, yes they were meant to be removed. I somehow missed those as I
was preparing the patch.
Junio C Hamano June 13, 2023, 7:45 p.m. UTC | #3
Glen Choo <chooglen@google.com> writes:

> Victoria Dye <vdye@github.com> writes:
>
>> So, shouldn't it be safe to shallow-copy-and-NULL? But as I noted earlier
>> [1], if you do that it'll make the name 'check_repository_format()' a bit
>> misleading (since it's actually modifying its arg in place). So, if you
>> update to always shallow copy, 'check_repository_format()' should be renamed
>> to reflect its side effects.
>
> My understanding of check_repository_format() is that it serves double
> duty of doing a) setup of the_repository and b) populating an "out"
> parameter with the appropriate values. IMO a) is the side effect that
> could warrant the rename, and b) is the expected, "read-only" use case.
>
> From that perspective, doing a shallow copy here isn't really
> introducing a weird side-effect (because the arg to an "out" parameter
> should be zero-ed out to begin with), but it's returning a 'wrong'
> value. You're right that it's safe because the NULL-ed value isn't read
> back right now, but it's not any good if this function gains more
> callers.

Thanks for having this discussion.  The above makes perfect sense to
me.

> The helper function might not be a good idea yet, but I'm convinced that
> removing the setup from discover_git_directory() is a good idea. I think
> this series would be in a better state if we get rid of the wrong
> pattern instead of extending it.
> ...
>> I think you may be missing changes to 'discover_git_directory()'? Like I
>> mentioned above, though, if you don't think 'discover_git_directory()' needs
>> to set up 'the_repository', then those assignments should just be removed
>> (not replaced with 'setup_repository_from_format()').
>
> Ah sorry, yes they were meant to be removed. I somehow missed those as I
> was preparing the patch.

It looks like you two are in agreement at the end.  It does feel
that the change to make discover purely about discovering extends
the scope a bit too much, but it would be a good direction to go in
the longer term.

Thanks.
diff mbox series

Patch

diff --git a/repository.c b/repository.c
index 104960f8f5..50f0b26a6c 100644
--- a/repository.c
+++ b/repository.c
@@ -181,12 +181,7 @@  int repo_init(struct repository *repo,
 	if (read_and_verify_repository_format(&format, repo->commondir))
 		goto error;
 
-	repo_set_hash_algo(repo, format.hash_algo);
-	repo->repository_format_worktree_config = format.worktree_config;
-
-	/* take ownership of format.partial_clone */
-	repo->repository_format_partial_clone = format.partial_clone;
-	format.partial_clone = NULL;
+	setup_repository_from_format(repo, &format, 1);
 
 	if (worktree)
 		repo_set_worktree(repo, worktree);
diff --git a/setup.c b/setup.c
index d866395435..33ce58676f 100644
--- a/setup.c
+++ b/setup.c
@@ -1561,13 +1561,8 @@  const char *setup_git_directory_gently(int *nongit_ok)
 			setup_git_env(gitdir);
 		}
 		if (startup_info->have_repository) {
-			repo_set_hash_algo(the_repository, repo_fmt.hash_algo);
-			the_repository->repository_format_worktree_config =
-				repo_fmt.worktree_config;
-			/* take ownership of repo_fmt.partial_clone */
-			the_repository->repository_format_partial_clone =
-				repo_fmt.partial_clone;
-			repo_fmt.partial_clone = NULL;
+			setup_repository_from_format(the_repository,
+						     &repo_fmt, 1);
 		}
 	}
 	/*
@@ -1654,14 +1649,26 @@  void check_repository_format(struct repository_format *fmt)
 		fmt = &repo_fmt;
 	check_repository_format_gently(get_git_dir(), fmt, NULL);
 	startup_info->have_repository = 1;
-	repo_set_hash_algo(the_repository, fmt->hash_algo);
-	the_repository->repository_format_worktree_config =
-		fmt->worktree_config;
-	the_repository->repository_format_partial_clone =
-		xstrdup_or_null(fmt->partial_clone);
+	setup_repository_from_format(the_repository, fmt, 0);
 	clear_repository_format(&repo_fmt);
 }
 
+void setup_repository_from_format(struct repository *repo,
+				  struct repository_format *fmt,
+				  int modify_fmt_ok)
+{
+	repo_set_hash_algo(repo, fmt->hash_algo);
+	repo->repository_format_worktree_config = fmt->worktree_config;
+	if (modify_fmt_ok) {
+		repo->repository_format_partial_clone =
+			fmt->partial_clone;
+		fmt->partial_clone = NULL;
+	} else {
+		repo->repository_format_partial_clone =
+			xstrdup_or_null(fmt->partial_clone);
+	}
+}
+
 /*
  * Returns the "prefix", a path to the current working directory
  * relative to the work tree root, or NULL, if the current working
diff --git a/setup.h b/setup.h
index 4c1ca9d0c9..ed39aa38e0 100644
--- a/setup.h
+++ b/setup.h
@@ -140,6 +140,16 @@  int verify_repository_format(const struct repository_format *format,
  */
 void check_repository_format(struct repository_format *fmt);
 
+/*
+ * Setup a "struct repository" from the fields from the repository format.
+ * If "modify_fmt_ok" is nonzero, pointer members in "fmt" will be shallowly
+ * copied to repo and set to NULL (so that it's safe to clear "fmt").
+ */
+struct repository;
+void setup_repository_from_format(struct repository *repo,
+				  struct repository_format *fmt,
+				  int modify_fmt_ok);
+
 /*
  * NOTE NOTE NOTE!!
  *