diff mbox series

[01/11] path: harden validation of HEAD with non-standard hashes

Message ID aa4d6f508b4af3923813e19ff82a4e8484d5ff11.1713519789.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series Stop relying on SHA1 fallback for `the_hash_algo` | expand

Commit Message

Patrick Steinhardt April 19, 2024, 9:51 a.m. UTC
The `validate_headref()` function takes a path to a supposed "HEAD" file
and checks whether its format is something that we understand. It is
used as part of our repository discovery to check whether a specific
directory is a Git directory or not.

Part of the validation is a check for a detached HEAD that contains a
plain object ID. To do this validation we use `get_oid_hex()`, which
relies on `the_hash_algo`. At this point in time the hash algo cannot
yet be initialized though because we didn't yet read the Git config.
Consequently, it will always be the SHA1 hash algorithm.

In practice this works alright because `get_oid_hex()` only ends up
checking whether the prefix of the buffer is a valid object ID. And
because SHA1 is shorter than SHA256, the function will successfully
parse SHA256 object IDs, as well.

It is somewhat fragile though and not really the intent to only check
for SHA1. With this in mind, harden the code to use `get_oid_hex_any()`
to check whether the "HEAD" file parses as any known hash.

One might be hard pressed to tighten the check even further and fully
validate the file contents, not only the prefix. In practice though that
wouldn't make a lot of sense as it could be that the repository uses a
hash function that produces longer hashes than SHA256, but which the
current version of Git doesn't understand yet. We'd still want to detect
the repository as proper Git repository in that case, and we will fail
eventually with a proper error message that the hash isn't understood
when trying to set up the repostiory format.

It follows that we could just leave the current code intact, as in
practice the code change doesn't have any user visible impact. But it
also prepares us for `the_hash_algo` being unset when there is no
repositroy.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 path.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

brian m. carlson April 19, 2024, 7:03 p.m. UTC | #1
On 2024-04-19 at 09:51:10, Patrick Steinhardt wrote:
> It follows that we could just leave the current code intact, as in
> practice the code change doesn't have any user visible impact. But it
> also prepares us for `the_hash_algo` being unset when there is no
> repositroy.

The patch looks fine and is well explained, but you have a typo
("repositroy").
Patrick Steinhardt April 22, 2024, 4:56 a.m. UTC | #2
On Fri, Apr 19, 2024 at 07:03:33PM +0000, brian m. carlson wrote:
> On 2024-04-19 at 09:51:10, Patrick Steinhardt wrote:
> > It follows that we could just leave the current code intact, as in
> > practice the code change doesn't have any user visible impact. But it
> > also prepares us for `the_hash_algo` being unset when there is no
> > repositroy.
> 
> The patch looks fine and is well explained, but you have a typo
> ("repositroy").

Thanks, fixed! I'll refrain from sending out another version just to
address this typo though.

Patrick
Junio C Hamano April 22, 2024, 4:15 p.m. UTC | #3
Patrick Steinhardt <ps@pks.im> writes:

> The `validate_headref()` function takes a path to a supposed "HEAD" file
> and checks whether its format is something that we understand. It is
> used as part of our repository discovery to check whether a specific
> directory is a Git directory or not.
>
> Part of the validation is a check for a detached HEAD that contains a
> plain object ID. To do this validation we use `get_oid_hex()`, which
> relies on `the_hash_algo`. At this point in time the hash algo cannot
> yet be initialized though because we didn't yet read the Git config.
> Consequently, it will always be the SHA1 hash algorithm.
>
> In practice this works alright because `get_oid_hex()` only ends up
> checking whether the prefix of the buffer is a valid object ID. And
> because SHA1 is shorter than SHA256, the function will successfully
> parse SHA256 object IDs, as well.
>
> It is somewhat fragile though and not really the intent to only check
> for SHA1. With this in mind, harden the code to use `get_oid_hex_any()`
> to check whether the "HEAD" file parses as any known hash.

All makes sense, and given the above, I strongly suspect that we
would want to make the validate_headref() function a file-scope
static in setup.c to make sure it is only called/callable from the
repository discovery codepath.  Especially that if somebody calls
this function again after we find out that the repository uses
SHA-256, we would want to let the caller know that the detached HEAD
records SHA-1 and we are in an inconsistent state.

> It follows that we could just leave the current code intact, as in
> practice the code change doesn't have any user visible impact. But it
> also prepares us for `the_hash_algo` being unset when there is no
> repositroy.

Or perhaps we use get_oid_hex_any() != GIT_HASH_UNKNOWN when
the_hash_algo has not been determined, and use !get_oid_hex() after
we have determined which algorithm we are using?  It may be what you
did in a later step in the series, so let me read on.

Thanks.

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  path.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/path.c b/path.c
> index 67229edb9c..cc02165530 100644
> --- a/path.c
> +++ b/path.c
> @@ -693,7 +693,7 @@ int validate_headref(const char *path)
>  	/*
>  	 * Is this a detached HEAD?
>  	 */
> -	if (!get_oid_hex(buffer, &oid))
> +	if (get_oid_hex_any(buffer, &oid) != GIT_HASH_UNKNOWN)
>  		return 0;
>  
>  	return -1;
Patrick Steinhardt April 23, 2024, 4:50 a.m. UTC | #4
On Mon, Apr 22, 2024 at 09:15:41AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > The `validate_headref()` function takes a path to a supposed "HEAD" file
> > and checks whether its format is something that we understand. It is
> > used as part of our repository discovery to check whether a specific
> > directory is a Git directory or not.
> >
> > Part of the validation is a check for a detached HEAD that contains a
> > plain object ID. To do this validation we use `get_oid_hex()`, which
> > relies on `the_hash_algo`. At this point in time the hash algo cannot
> > yet be initialized though because we didn't yet read the Git config.
> > Consequently, it will always be the SHA1 hash algorithm.
> >
> > In practice this works alright because `get_oid_hex()` only ends up
> > checking whether the prefix of the buffer is a valid object ID. And
> > because SHA1 is shorter than SHA256, the function will successfully
> > parse SHA256 object IDs, as well.
> >
> > It is somewhat fragile though and not really the intent to only check
> > for SHA1. With this in mind, harden the code to use `get_oid_hex_any()`
> > to check whether the "HEAD" file parses as any known hash.
> 
> All makes sense, and given the above, I strongly suspect that we
> would want to make the validate_headref() function a file-scope
> static in setup.c to make sure it is only called/callable from the
> repository discovery codepath.  Especially that if somebody calls
> this function again after we find out that the repository uses
> SHA-256, we would want to let the caller know that the detached HEAD
> records SHA-1 and we are in an inconsistent state.

Fair enough, we can definitely do so. It only has a single callsite
anyway.

> > It follows that we could just leave the current code intact, as in
> > practice the code change doesn't have any user visible impact. But it
> > also prepares us for `the_hash_algo` being unset when there is no
> > repositroy.
> 
> Or perhaps we use get_oid_hex_any() != GIT_HASH_UNKNOWN when
> the_hash_algo has not been determined, and use !get_oid_hex() after
> we have determined which algorithm we are using?  It may be what you
> did in a later step in the series, so let me read on.

I didn't, and given that it's only used from `is_git_directory()` it
would probably not make much sense, either. I'll rather add another
patch on top that moves the function around to clarify that it is only
expected to be called from "setup.c".

Patrick

> Thanks.
> 
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  path.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/path.c b/path.c
> > index 67229edb9c..cc02165530 100644
> > --- a/path.c
> > +++ b/path.c
> > @@ -693,7 +693,7 @@ int validate_headref(const char *path)
> >  	/*
> >  	 * Is this a detached HEAD?
> >  	 */
> > -	if (!get_oid_hex(buffer, &oid))
> > +	if (get_oid_hex_any(buffer, &oid) != GIT_HASH_UNKNOWN)
> >  		return 0;
> >  
> >  	return -1;
Junio C Hamano April 23, 2024, 4:54 p.m. UTC | #5
Patrick Steinhardt <ps@pks.im> writes:

>> All makes sense, and given the above, I strongly suspect that we
>> would want to make the validate_headref() function a file-scope
>> static in setup.c to make sure it is only called/callable from the
>> repository discovery codepath.  Especially that if somebody calls
>> this function again after we find out that the repository uses
>> SHA-256, we would want to let the caller know that the detached HEAD
>> records SHA-1 and we are in an inconsistent state.
>
> Fair enough, we can definitely do so. It only has a single callsite
> anyway.

I was wondering if I was missing a future plans to call it from
other code paths that are triggered after the set-up was done.  If
that is not the case, we should do so.  It matters more for the
future callers than the current ones.  They _all_ have to be aware
of the deliberate looseness of the check here.

Thanks.
diff mbox series

Patch

diff --git a/path.c b/path.c
index 67229edb9c..cc02165530 100644
--- a/path.c
+++ b/path.c
@@ -693,7 +693,7 @@  int validate_headref(const char *path)
 	/*
 	 * Is this a detached HEAD?
 	 */
-	if (!get_oid_hex(buffer, &oid))
+	if (get_oid_hex_any(buffer, &oid) != GIT_HASH_UNKNOWN)
 		return 0;
 
 	return -1;