diff mbox series

[v4,1/5] setup: add an escape hatch for "no more default hash algorithm" change

Message ID 20240514011437.3779151-2-gitster@pobox.com (mailing list archive)
State Superseded
Headers show
Series Fix use of uninitialized hash algorithms | expand

Commit Message

Junio C Hamano May 14, 2024, 1:14 a.m. UTC
Partially revert c8aed5e8 (repository: stop setting SHA1 as the
default object hash, 2024-05-07), to keep end-user systems still
broken when we have gap in our test coverage but yet give them an
escape hatch to set the GIT_DEFAULT_HASH environment variable to
"sha1" in order to revert to the previous behaviour.  This variable
has been in use for using SHA-256 hash by default, and it should be
a better fit than inventing a new and test-only knob.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 environment.h |  1 +
 repository.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 setup.c       |  2 --
 3 files changed, 41 insertions(+), 2 deletions(-)

Comments

Patrick Steinhardt May 14, 2024, 4:32 a.m. UTC | #1
On Mon, May 13, 2024 at 06:14:33PM -0700, Junio C Hamano wrote:
[snip]
> diff --git a/repository.c b/repository.c
> index 15c10015b0..f912ee9a7c 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -1,5 +1,6 @@
>  #include "git-compat-util.h"
>  #include "abspath.h"
> +#include "environment.h"
>  #include "repository.h"
>  #include "object-store-ll.h"
>  #include "config.h"
> @@ -19,6 +20,27 @@
>  static struct repository the_repo;
>  struct repository *the_repository = &the_repo;
>  
> +static void set_default_hash_algo(struct repository *repo)
> +{
> +	const char *hash_name;
> +	int algo;
> +
> +	hash_name = getenv(GIT_DEFAULT_HASH_ENVIRONMENT);
> +	if (!hash_name)
> +		return;
> +	algo = hash_algo_by_name(hash_name);
> +
> +	/*
> +	 * NEEDSWORK: after all, falling back to SHA-1 by assigning
> +	 * GIT_HASH_SHA1 to algo here, instead of returning, may give
> +	 * us better behaviour.
> +	 */
> +	if (algo == GIT_HASH_UNKNOWN)
> +		return;
> +
> +	repo_set_hash_algo(repo, algo);
> +}

The problem with reusing "GIT_DEFAULT_HASH" is that we unconditionally
set it in our test suite in "test-lib.sh". This will have the effect
that we will never hit segfaults in our tests because we always end up
setting up the default hash, whereas our users now will.

I would propose to revert this back to the first iteration you had,
where the workaround only enables the SHA1 fallback. No users have yet
complained about the inability to pick the hash algo outside of a repo,
indicating that it's not widely used. And when they complain, there is
more motivation to fix this properly by adding a `--object-hash=` switch
to the respective commands so that a user can pick the desired object
hash.

Patrick
Junio C Hamano May 14, 2024, 3:05 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> The problem with reusing "GIT_DEFAULT_HASH" is that we unconditionally
> set it in our test suite in "test-lib.sh". This will have the effect
> that we will never hit segfaults in our tests because we always end up
> setting up the default hash, whereas our users now will.

I tried the version you posted with only SHA-1 fallback and it
failed with SHA256 CI jobs, and eventually came up with using the
existing environment variable.

The variable is the perfect fit in the longer term for our purpose.
It is what the end-users will set, not for papering over remaining
bugs from the "no longer there is a fallback default" change, but
for telling Git that they want to use sha256 repositories.

With GIT_DEFAULT_HASH set to sha256, we will be testing a
configuration that is very close to those end-users.

There probably are four cases we need to check:

 - In a repository, invocations of commands that should honor the
   hash function that is in use by the repository (i.e. this is true
   for most commands and their invocations).

 - In a repository, invocations of commands that should ignore the
   repository settings and always use SHA-1 (i.e. proposed fix to
   patch-id).

 - Outside a repository, invocations with GIT_DEFAULT_HASH set
   should probably parallel the above two, as if GIT_DEFAULT_HASH
   came from the repository.

 - Outside a repository, invocations without GIT_DEFAULT_HASH set
   should all default to SHA-1???

The last combination cannot be tested UNLESS we unset GIT_DEFAULT_HASH
that is given by test-lib.sh, but that can easily be arranged, now
we have one primary/central place that tests out-of-repository
invocations of various commands in t1517.

And using the existing environment variable has an added benefit
that we do not have to add an extra option to random commands.
Junio C Hamano May 14, 2024, 5:19 p.m. UTC | #3
Junio C Hamano <gitster@pobox.com> writes:

> Partially revert c8aed5e8 (repository: stop setting SHA1 as the
> default object hash, 2024-05-07), to keep end-user systems still
> broken when we have gap in our test coverage but yet give them an
> escape hatch to set the GIT_DEFAULT_HASH environment variable to
> "sha1" in order to revert to the previous behaviour.  This variable
> has been in use for using SHA-256 hash by default, and it should be
> a better fit than inventing a new and test-only knob.

Having done all of this, I actually am very tempted to add the
"always default to SHA-1" back as a fallback position to the
set_default_hash_algo() function.  We know we are going to get the
right hash algorithm when working in the repository, so the only
case the default matters in practice is when working outside the
repository.

We already have such a custom code for "git diff --no-index", and we
are adding a few more back in here, but they can disappear if we had
code to set the fallback default when GIT_DEFAULT_HASH does not
exist here.  The "always use SHA-1 regardless of the hash used by
the repository" code like "patch-id" should not depend on such a
fallback default but should have its own code to explicitly set it.

As the user can tweak what algorithm they want if the wanted to, it
does not sound too bad to have a fallback default when the user did
not choose.
Patrick Steinhardt May 15, 2024, 12:23 p.m. UTC | #4
On Tue, May 14, 2024 at 10:19:01AM -0700, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Partially revert c8aed5e8 (repository: stop setting SHA1 as the
> > default object hash, 2024-05-07), to keep end-user systems still
> > broken when we have gap in our test coverage but yet give them an
> > escape hatch to set the GIT_DEFAULT_HASH environment variable to
> > "sha1" in order to revert to the previous behaviour.  This variable
> > has been in use for using SHA-256 hash by default, and it should be
> > a better fit than inventing a new and test-only knob.
> 
> Having done all of this, I actually am very tempted to add the
> "always default to SHA-1" back as a fallback position to the
> set_default_hash_algo() function.  We know we are going to get the
> right hash algorithm when working in the repository, so the only
> case the default matters in practice is when working outside the
> repository.
> 
> We already have such a custom code for "git diff --no-index", and we
> are adding a few more back in here, but they can disappear if we had
> code to set the fallback default when GIT_DEFAULT_HASH does not
> exist here.  The "always use SHA-1 regardless of the hash used by
> the repository" code like "patch-id" should not depend on such a
> fallback default but should have its own code to explicitly set it.
> 
> As the user can tweak what algorithm they want if the wanted to, it
> does not sound too bad to have a fallback default when the user did
> not choose.

I think that this is going into the wrong direction. The patches that we
have added surface real issues. If we now unconditionally add back the
fallback to SHA-1, then we only punt those issues further down the road
to the point in time where we drop `the_hash_algo` and/or
`the_repository`. All the issues that were surfaced until now are
technical debt, and the proposed fixes have been documented with a
"TODO" item that they need more work.

In some cases it's as simple as adding in an option to the respective
command that lets the user pick the hash algorithm, and I do not think
that GIT_DEFAULT_HASH is a proper replacement for that. In other cases
like for git-patch-id(1) the issue runs deeper and we need to refactor
the whole subsystem to stop relying on `the_hash_algo` altogether.

Patrick
Junio C Hamano May 16, 2024, 3:31 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> Having done all of this, I actually am very tempted to add the
> "always default to SHA-1" back as a fallback position to the
> set_default_hash_algo() function.  We know we are going to get the
> right hash algorithm when working in the repository, so the only
> case the default matters in practice is when working outside the
> repository.

Not really.  It does not add anything to help either real world or
our tests.  The current test setting is already bad enough in that,
unlike in the real world settings, even tests with the SHA-1
algorithm has GIT_DEFAULT_HASH environment variable set, which means
that such a "if the environment variable is not set, further fall
back to SHA-1" does not do anything.

Unless we change t/test-lib.sh not to set GIT_DEFAULT_HASH tweaking
the fallback default in repository.c:set_default_hash_algo() based
on GIT_DEFAULT_HASH would not be a workable solution.

I wanted to arrange things so that the end-user exectuion by default
has an extra fallback (perhaps to SHA-1, or GIT_DEFAULT_HASH) to
avoid disrupting their real-world use, which we can disable in our
tests to expose code paths that still rely on the "default" set when
in-core repository struct gets initialized, but that is not possible
without changing the way t/test-lib.sh uses GIT_DEFAULT_HASH, it
seems.  So the arrangement unfortunately has to be "we have no
default, and bugs will break the real-world uses as well as tests
the same way.  The real-world users have to export an extra
'workaround' environment variable to force "default" to SHA-1 (or
GIT_DEFAULT_HASH) --- which may be "workable" but very far from being
intuitive.  They can set GIT_DEFAULT_HASH but to make it effective
everywhere, including the "default" given by set_default_hash_algo(),
they need to set this other "workaround" thing.

> We already have such a custom code for "git diff --no-index", and we
> are adding a few more back in here, but they can disappear if we had
> code to set the fallback default when GIT_DEFAULT_HASH does not
> exist here.

While I think a manual setting of the_hash_algo in "diff --no-index"
code path should not hardcode "SHA-1" but instead use the hash
specified by the GIT_DEFAULT_HASH environment to be consistent with
the use of "git" by the same parent process that had that variable
exported to the environment, that should not be done globally in
repository.c:set_default_hash_algo().
diff mbox series

Patch

diff --git a/environment.h b/environment.h
index 05fd94d7be..deaa29408f 100644
--- a/environment.h
+++ b/environment.h
@@ -56,6 +56,7 @@  const char *getenv_safe(struct strvec *argv, const char *name);
 #define GIT_OPTIONAL_LOCKS_ENVIRONMENT "GIT_OPTIONAL_LOCKS"
 #define GIT_TEXT_DOMAIN_DIR_ENVIRONMENT "GIT_TEXTDOMAINDIR"
 #define GIT_ATTR_SOURCE_ENVIRONMENT "GIT_ATTR_SOURCE"
+#define GIT_DEFAULT_HASH_ENVIRONMENT "GIT_DEFAULT_HASH"
 
 /*
  * Environment variable used in handshaking the wire protocol.
diff --git a/repository.c b/repository.c
index 15c10015b0..f912ee9a7c 100644
--- a/repository.c
+++ b/repository.c
@@ -1,5 +1,6 @@ 
 #include "git-compat-util.h"
 #include "abspath.h"
+#include "environment.h"
 #include "repository.h"
 #include "object-store-ll.h"
 #include "config.h"
@@ -19,6 +20,27 @@ 
 static struct repository the_repo;
 struct repository *the_repository = &the_repo;
 
+static void set_default_hash_algo(struct repository *repo)
+{
+	const char *hash_name;
+	int algo;
+
+	hash_name = getenv(GIT_DEFAULT_HASH_ENVIRONMENT);
+	if (!hash_name)
+		return;
+	algo = hash_algo_by_name(hash_name);
+
+	/*
+	 * NEEDSWORK: after all, falling back to SHA-1 by assigning
+	 * GIT_HASH_SHA1 to algo here, instead of returning, may give
+	 * us better behaviour.
+	 */
+	if (algo == GIT_HASH_UNKNOWN)
+		return;
+
+	repo_set_hash_algo(repo, algo);
+}
+
 void initialize_repository(struct repository *repo)
 {
 	repo->objects = raw_object_store_new();
@@ -26,6 +48,24 @@  void initialize_repository(struct repository *repo)
 	repo->parsed_objects = parsed_object_pool_new();
 	ALLOC_ARRAY(repo->index, 1);
 	index_state_init(repo->index, repo);
+
+	/*
+	 * When a command runs inside a repository, it learns what
+	 * hash algorithm is in use from the repository, but some
+	 * commands are designed to work outside a repository, yet
+	 * they want to access the_hash_algo, if only for the length
+	 * of the hashed value to see if their input looks like a
+	 * plausible hash value.
+	 *
+	 * We are in the process of identifying the codepaths and
+	 * giving them an appropriate default individually; any
+	 * unconverted codepath that tries to access the_hash_algo
+	 * will thus fail.  The end-users however have an escape hatch
+	 * to set GIT_DEFAULT_HASH environment variable to "sha1" get
+	 * back the old behaviour of defaulting to SHA-1.
+	 */
+	if (repo == the_repository)
+		set_default_hash_algo(repo);
 }
 
 static void expand_base_dir(char **out, const char *in,
diff --git a/setup.c b/setup.c
index 7c996659bd..d084703465 100644
--- a/setup.c
+++ b/setup.c
@@ -1840,8 +1840,6 @@  int daemonize(void)
 #define TEST_FILEMODE 1
 #endif
 
-#define GIT_DEFAULT_HASH_ENVIRONMENT "GIT_DEFAULT_HASH"
-
 static void copy_templates_1(struct strbuf *path, struct strbuf *template_path,
 			     DIR *dir)
 {