diff mbox series

[1/7] submodule: lazily add submodule ODBs as alternates

Message ID 5994a517e8afc345e8f649b2368756e22b0e9ebe.1628618950.git.jonathantanmy@google.com (mailing list archive)
State Superseded
Headers show
Series In grep, no adding submodule ODB as alternates | expand

Commit Message

Jonathan Tan Aug. 10, 2021, 6:28 p.m. UTC
Teach Git to add submodule ODBs as alternates to the object store of
the_repository only upon the first access of an object not in
the_repository, and not when add_submodule_odb() is called.

This provides a means of gradually migrating from accessing a
submodule's object through alternates to accessing a submodule's object
by explicitly passing its repository object. Any Git command can declare
that it might access submodule objects by calling add_submodule_odb()
(as they do now), but the submodule ODBs themselves will not be added
until needed, so individual commands and/or combinations of arguments
can be migrated one by one.

[The advantage of explicit repository-object passing is code clarity (it
is clear which repository an object read is from), performance (there is
no need to linearly search through all submodule ODBs whenever an object
is accessed from any repository, whether superproject or submodule), and
the possibility of future features like partial clone submodules (which
right now is not possible because if an object is missing, we do not
know which repository to lazy-fetch into).]

This commit also introduces an environment variable that a test may set
to make the actual registration of alternates fatal, in order to
demonstrate that its codepaths do not need this registration.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 object-file.c |  5 +++++
 submodule.c   | 20 +++++++++++++++++++-
 submodule.h   |  7 +++++++
 t/README      | 10 ++++++++++
 4 files changed, 41 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Aug. 10, 2021, 9:13 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> @@ -1592,6 +1593,10 @@ static int do_oid_object_info_extended(struct repository *r,
>  				break;
>  		}
>  
> +		if (register_all_submodule_odb_as_alternates())
> +			/* We added some alternates; retry */
> +			continue;
> +

OK.  Unless we are running in the "we no longer are relying on the
submodule-odb-as-alternates hack" mode, the control may reach this
point number of times, so this caller expects the function to return
how many new odbs were added with this single invocation.

> diff --git a/submodule.c b/submodule.c
> index 8e611fe1db..8fde90e906 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -165,6 +165,8 @@ void stage_updated_gitmodules(struct index_state *istate)
>  		die(_("staging updated .gitmodules failed"));
>  }
>  
> +static struct string_list added_submodule_odb_paths = STRING_LIST_INIT_NODUP;
> +

I see 2/7 allows callers to add paths for submodule odb to this
list.

> @@ -178,12 +180,28 @@ int add_submodule_odb(const char *path)
>  		ret = -1;
>  		goto done;
>  	}
> -	add_to_alternates_memory(objects_directory.buf);
> +	string_list_insert(&added_submodule_odb_paths,
> +			   strbuf_detach(&objects_directory, NULL));

OK.  By default, any codepath that still call add_submodule_odb()
will add the paths to submodules to the list (so that they will
lazily be loaded).

> +int register_all_submodule_odb_as_alternates(void)
> +{
> +	int i;
> +	int ret = added_submodule_odb_paths.nr;
> +
> +	for (i = 0; i < added_submodule_odb_paths.nr; i++)
> +		add_to_alternates_memory(added_submodule_odb_paths.items[i].string);
> +	if (ret) {
> +		string_list_clear(&added_submodule_odb_paths, 0);
> +		if (git_env_bool("GIT_TEST_FATAL_REGISTER_SUBMODULE_ODB", 0))
> +			BUG("register_all_submodule_odb_as_alternates() called");
> +	}
> +	return ret;
> +}

OK.  We add new ones that were added to the list since we were
called for the last time and clear the list.  When we didn't add
anything, we will return 0.  Seems to mean well.

I wonder if we need to prepare ourselves to catch callers that call
add_submodule_odb() to the same path twice or more.  Probably not.

Thanks.

The "pretend as if objects in submodule odb are locally available",
even though it may have been useful, was an ugly hack invented
before we started adding the "access more than one repository at the
time with the 'repo' arg" extended API.  It is pleasing to see this
step shows an approach to incrementally migrate the users of the
hack.
Emily Shaffer Aug. 11, 2021, 9:33 p.m. UTC | #2
On Tue, Aug 10, 2021 at 11:28:39AM -0700, Jonathan Tan wrote:
> diff --git a/object-file.c b/object-file.c
> index 3d27dc8dea..621b121bcb 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -32,6 +32,7 @@
>  #include "packfile.h"
>  #include "object-store.h"
>  #include "promisor-remote.h"
> +#include "submodule.h"
>  
>  /* The maximum size for an object header. */
>  #define MAX_HEADER_LEN 32
> @@ -1592,6 +1593,10 @@ static int do_oid_object_info_extended(struct repository *r,
>  				break;
>  		}
>  
> +		if (register_all_submodule_odb_as_alternates())
> +			/* We added some alternates; retry */
> +			continue;
> +

Ok, this is where we finally get around to loading the alternate much
later. Fine.

>  		/* Check if it is a missing object */
>  		if (fetch_if_missing && repo_has_promisor_remote(r) &&
>  		    !already_retried &&
> diff --git a/submodule.c b/submodule.c
> index 8e611fe1db..8fde90e906 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -165,6 +165,8 @@ void stage_updated_gitmodules(struct index_state *istate)
>  		die(_("staging updated .gitmodules failed"));
>  }
>  
> +static struct string_list added_submodule_odb_paths = STRING_LIST_INIT_NODUP;
> +
>  /* TODO: remove this function, use repo_submodule_init instead. */
>  int add_submodule_odb(const char *path)
>  {
> @@ -178,12 +180,28 @@ int add_submodule_odb(const char *path)
>  		ret = -1;
>  		goto done;
>  	}
> -	add_to_alternates_memory(objects_directory.buf);
> +	string_list_insert(&added_submodule_odb_paths,
> +			   strbuf_detach(&objects_directory, NULL));

And here is where we hijack the usual alternate load and put the path
into the lazy load list instead. Ok.

>  done:
>  	strbuf_release(&objects_directory);
>  	return ret;
>  }
>  
> +int register_all_submodule_odb_as_alternates(void)
> +{
> +	int i;
> +	int ret = added_submodule_odb_paths.nr;
> +
> +	for (i = 0; i < added_submodule_odb_paths.nr; i++)
> +		add_to_alternates_memory(added_submodule_odb_paths.items[i].string);
> +	if (ret) {
> +		string_list_clear(&added_submodule_odb_paths, 0);
> +		if (git_env_bool("GIT_TEST_FATAL_REGISTER_SUBMODULE_ODB", 0))
> +			BUG("register_all_submodule_odb_as_alternates() called");

Nice, and this is the flag you mentioned in the cover letter to complain
if we're trying to use alternates to access submodule objects. This will
be useful for finding out other random places where we are using that
weird hack.

> +	}
> +	return ret;
> +}
> +
>  void set_diffopt_flags_from_submodule_config(struct diff_options *diffopt,
>  					     const char *path)
>  {
> diff --git a/submodule.h b/submodule.h
> index 84640c49c1..c252784bc2 100644
> --- a/submodule.h
> +++ b/submodule.h
> @@ -97,7 +97,14 @@ int submodule_uses_gitfile(const char *path);
>  #define SUBMODULE_REMOVAL_IGNORE_IGNORED_UNTRACKED (1<<2)
>  int bad_to_remove_submodule(const char *path, unsigned flags);
>  
> +/*
> + * Call add_submodule_odb() to add the submodule at the given path to a list.
> + * When register_all_submodule_odb_as_alternates() is called, the object stores
> + * of all submodules in that list will be added as alternates in
> + * the_repository.
> + */
>  int add_submodule_odb(const char *path);
> +int register_all_submodule_odb_as_alternates(void);

Does it need to be public? Could this be a static in submodule.c
instead?

>  
>  /*
>   * Checks if there are submodule changes in a..b. If a is the null OID,
> diff --git a/t/README b/t/README
> index 9e70122302..8b67b4f00b 100644
> --- a/t/README
> +++ b/t/README
> @@ -448,6 +448,16 @@ GIT_TEST_CHECKOUT_WORKERS=<n> overrides the 'checkout.workers' setting
>  to <n> and 'checkout.thresholdForParallelism' to 0, forcing the
>  execution of the parallel-checkout code.
>  
> +GIT_TEST_FATAL_REGISTER_SUBMODULE_ODB=<boolean>, when true, makes
> +registering submodule ODBs as alternates a fatal action. Support for
> +this environment variable can be removed once the migration to
> +explicitly providing repositories when accessing submodule objects is
> +complete (in which case we might want to replace this with a trace2
> +call so that users can make it visible if accessing submodule objects
> +without an explicit repository still happens) or needs to be abandoned
> +for whatever reason (in which case the migrated codepaths still retain
> +their performance benefits).
> +
>  Naming Tests
>  ------------
>  
> -- 
> 2.33.0.rc1.237.g0d66db33f3-goog
> 

Reviewed-by: Emily Shaffer <emilyshaffer@google.com>
Jonathan Tan Aug. 13, 2021, 4:23 p.m. UTC | #3
> > +int register_all_submodule_odb_as_alternates(void);
> 
> Does it need to be public? Could this be a static in submodule.c
> instead?

Thanks for taking a look at this patch series.

To answer this question: no - in this patch, I need to use it from
object-file.c to actually register the submodule ODBs as alternates when
we try to read an object that is not in the superproject.
Jonathan Tan Aug. 13, 2021, 4:53 p.m. UTC | #4
> > +static struct string_list added_submodule_odb_paths = STRING_LIST_INIT_NODUP;
> > +
> 
> I see 2/7 allows callers to add paths for submodule odb to this
> list.

Yes, and as you said below, add_submodule_odb() has been modified to add
to this list too.

> The "pretend as if objects in submodule odb are locally available",
> even though it may have been useful, was an ugly hack invented
> before we started adding the "access more than one repository at the
> time with the 'repo' arg" extended API.  It is pleasing to see this
> step shows an approach to incrementally migrate the users of the
> hack.

Thanks.
diff mbox series

Patch

diff --git a/object-file.c b/object-file.c
index 3d27dc8dea..621b121bcb 100644
--- a/object-file.c
+++ b/object-file.c
@@ -32,6 +32,7 @@ 
 #include "packfile.h"
 #include "object-store.h"
 #include "promisor-remote.h"
+#include "submodule.h"
 
 /* The maximum size for an object header. */
 #define MAX_HEADER_LEN 32
@@ -1592,6 +1593,10 @@  static int do_oid_object_info_extended(struct repository *r,
 				break;
 		}
 
+		if (register_all_submodule_odb_as_alternates())
+			/* We added some alternates; retry */
+			continue;
+
 		/* Check if it is a missing object */
 		if (fetch_if_missing && repo_has_promisor_remote(r) &&
 		    !already_retried &&
diff --git a/submodule.c b/submodule.c
index 8e611fe1db..8fde90e906 100644
--- a/submodule.c
+++ b/submodule.c
@@ -165,6 +165,8 @@  void stage_updated_gitmodules(struct index_state *istate)
 		die(_("staging updated .gitmodules failed"));
 }
 
+static struct string_list added_submodule_odb_paths = STRING_LIST_INIT_NODUP;
+
 /* TODO: remove this function, use repo_submodule_init instead. */
 int add_submodule_odb(const char *path)
 {
@@ -178,12 +180,28 @@  int add_submodule_odb(const char *path)
 		ret = -1;
 		goto done;
 	}
-	add_to_alternates_memory(objects_directory.buf);
+	string_list_insert(&added_submodule_odb_paths,
+			   strbuf_detach(&objects_directory, NULL));
 done:
 	strbuf_release(&objects_directory);
 	return ret;
 }
 
+int register_all_submodule_odb_as_alternates(void)
+{
+	int i;
+	int ret = added_submodule_odb_paths.nr;
+
+	for (i = 0; i < added_submodule_odb_paths.nr; i++)
+		add_to_alternates_memory(added_submodule_odb_paths.items[i].string);
+	if (ret) {
+		string_list_clear(&added_submodule_odb_paths, 0);
+		if (git_env_bool("GIT_TEST_FATAL_REGISTER_SUBMODULE_ODB", 0))
+			BUG("register_all_submodule_odb_as_alternates() called");
+	}
+	return ret;
+}
+
 void set_diffopt_flags_from_submodule_config(struct diff_options *diffopt,
 					     const char *path)
 {
diff --git a/submodule.h b/submodule.h
index 84640c49c1..c252784bc2 100644
--- a/submodule.h
+++ b/submodule.h
@@ -97,7 +97,14 @@  int submodule_uses_gitfile(const char *path);
 #define SUBMODULE_REMOVAL_IGNORE_IGNORED_UNTRACKED (1<<2)
 int bad_to_remove_submodule(const char *path, unsigned flags);
 
+/*
+ * Call add_submodule_odb() to add the submodule at the given path to a list.
+ * When register_all_submodule_odb_as_alternates() is called, the object stores
+ * of all submodules in that list will be added as alternates in
+ * the_repository.
+ */
 int add_submodule_odb(const char *path);
+int register_all_submodule_odb_as_alternates(void);
 
 /*
  * Checks if there are submodule changes in a..b. If a is the null OID,
diff --git a/t/README b/t/README
index 9e70122302..8b67b4f00b 100644
--- a/t/README
+++ b/t/README
@@ -448,6 +448,16 @@  GIT_TEST_CHECKOUT_WORKERS=<n> overrides the 'checkout.workers' setting
 to <n> and 'checkout.thresholdForParallelism' to 0, forcing the
 execution of the parallel-checkout code.
 
+GIT_TEST_FATAL_REGISTER_SUBMODULE_ODB=<boolean>, when true, makes
+registering submodule ODBs as alternates a fatal action. Support for
+this environment variable can be removed once the migration to
+explicitly providing repositories when accessing submodule objects is
+complete (in which case we might want to replace this with a trace2
+call so that users can make it visible if accessing submodule objects
+without an explicit repository still happens) or needs to be abandoned
+for whatever reason (in which case the migrated codepaths still retain
+their performance benefits).
+
 Naming Tests
 ------------