mbox series

[0/7] clone: fix init of refdb with wrong object format

Message ID cover.1701863960.git.ps@pks.im (mailing list archive)
Headers show
Series clone: fix init of refdb with wrong object format | expand

Message

Patrick Steinhardt Dec. 6, 2023, 12:39 p.m. UTC
Hi,

when using git-clone(1), we initialize the complete repository before we
know about the object format used by the remote repository. This means
that we'll potentially create the refdb with the wrong object format in
case the local default object format and remote object format are not
the same.

This isn't much of a problem in the context of the files backend, which
never records the object format anyway. But it is a problem for the
reftable backend, which indeed records the object format in the on-disk
data structures. The result is thus a reftable with wrong object format.

This patch series aims to solve this issue by initializing the refdb at
a later point after we have learned about the remote object format. This
requires some careful reordering of code. Unfortunately, the end result
is not easily verifiable and thus I didn't add tests here. But it does
fix cloning of SHA256 repositories with the in-progress state of the
reftable backend.

While at it I noticed that this actually fixes a bug with bundle URIs
when the object formats diverge in this way.

The series is layed out as follows:

  - Patch 1 + 2: split out a function to create the refdb and make it
    possible to skip its initialization in `init_db()`.

  - Patch 3: allows git-remote-curl(1) to work with repos that get
    initialized during its lifetime.

  - Patch 4 - 6: address various corner cases where we access the refdb
    before we learned about the object format.

  - Patch 7: move initialization of the refdb to happen after we have
    learned about the object format.

This patch series is actually the last incompatibility for the reftable
backend that I have found. All tests except for the files-backend
specific ones pass now with the current state I have at [1], which is
currently at e6f2f592b7 (t: skip tests which are incompatible with
reftable, 2023-11-24)

Thanks in advance for your reviews!

Patrick

Patrick Steinhardt (7):
  setup: extract function to create the refdb
  setup: allow skipping creation of the refdb
  remote-curl: rediscover repository when fetching refs
  builtin/clone: fix bundle URIs with mismatching object formats
  builtin/clone: set up sparse checkout later
  builtin/clone: skip reading HEAD when retrieving remote
  builtin/clone: create the refdb with the correct object format

 builtin/clone.c             |  65 ++++++++++++----------
 remote-curl.c               |   7 ++-
 remote.c                    |  26 +++++----
 remote.h                    |   1 +
 setup.c                     | 106 +++++++++++++++++++++---------------
 setup.h                     |   6 +-
 t/t5558-clone-bundle-uri.sh |  18 ++++++
 7 files changed, 140 insertions(+), 89 deletions(-)

Comments

Patrick Steinhardt Dec. 6, 2023, 1:09 p.m. UTC | #1
On Wed, Dec 06, 2023 at 01:39:44PM +0100, Patrick Steinhardt wrote:
> Hi,
> 
> when using git-clone(1), we initialize the complete repository before we
> know about the object format used by the remote repository. This means
> that we'll potentially create the refdb with the wrong object format in
> case the local default object format and remote object format are not
> the same.
> 
> This isn't much of a problem in the context of the files backend, which
> never records the object format anyway. But it is a problem for the
> reftable backend, which indeed records the object format in the on-disk
> data structures. The result is thus a reftable with wrong object format.
> 
> This patch series aims to solve this issue by initializing the refdb at
> a later point after we have learned about the remote object format. This
> requires some careful reordering of code. Unfortunately, the end result
> is not easily verifiable and thus I didn't add tests here. But it does
> fix cloning of SHA256 repositories with the in-progress state of the
> reftable backend.
> 
> While at it I noticed that this actually fixes a bug with bundle URIs
> when the object formats diverge in this way.
> 
> The series is layed out as follows:
> 
>   - Patch 1 + 2: split out a function to create the refdb and make it
>     possible to skip its initialization in `init_db()`.
> 
>   - Patch 3: allows git-remote-curl(1) to work with repos that get
>     initialized during its lifetime.
> 
>   - Patch 4 - 6: address various corner cases where we access the refdb
>     before we learned about the object format.
> 
>   - Patch 7: move initialization of the refdb to happen after we have
>     learned about the object format.
> 
> This patch series is actually the last incompatibility for the reftable
> backend that I have found. All tests except for the files-backend
> specific ones pass now with the current state I have at [1], which is
> currently at e6f2f592b7 (t: skip tests which are incompatible with
> reftable, 2023-11-24)

I forgot to add the link to the merge request that contains the current
reftable backend's implementation in case anybody is interested:
https://gitlab.com/gitlab-org/git/-/merge_requests/58.

Patrick

> Thanks in advance for your reviews!
> 
> Patrick
> 
> Patrick Steinhardt (7):
>   setup: extract function to create the refdb
>   setup: allow skipping creation of the refdb
>   remote-curl: rediscover repository when fetching refs
>   builtin/clone: fix bundle URIs with mismatching object formats
>   builtin/clone: set up sparse checkout later
>   builtin/clone: skip reading HEAD when retrieving remote
>   builtin/clone: create the refdb with the correct object format
> 
>  builtin/clone.c             |  65 ++++++++++++----------
>  remote-curl.c               |   7 ++-
>  remote.c                    |  26 +++++----
>  remote.h                    |   1 +
>  setup.c                     | 106 +++++++++++++++++++++---------------
>  setup.h                     |   6 +-
>  t/t5558-clone-bundle-uri.sh |  18 ++++++
>  7 files changed, 140 insertions(+), 89 deletions(-)
> 
> -- 
> 2.43.0
>
Junio C Hamano Dec. 10, 2023, 3:16 a.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> While at it I noticed that this actually fixes a bug with bundle URIs
> when the object formats diverge in this way.
> ...
> This patch series is actually the last incompatibility for the reftable
> backend that I have found. All tests except for the files-backend
> specific ones pass now with the current state I have at [1], which is
> currently at e6f2f592b7 (t: skip tests which are incompatible with
> reftable, 2023-11-24)

An existing test

    $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh

passes with vanilla Git 2.43, but with these patches applied, it
fails the "7 - empty dumb HTTP" step.

Thanks.
Patrick Steinhardt Dec. 11, 2023, 11:34 a.m. UTC | #3
On Sat, Dec 09, 2023 at 07:16:30PM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > While at it I noticed that this actually fixes a bug with bundle URIs
> > when the object formats diverge in this way.
> > ...
> > This patch series is actually the last incompatibility for the reftable
> > backend that I have found. All tests except for the files-backend
> > specific ones pass now with the current state I have at [1], which is
> > currently at e6f2f592b7 (t: skip tests which are incompatible with
> > reftable, 2023-11-24)
> 
> An existing test
> 
>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
> 
> passes with vanilla Git 2.43, but with these patches applied, it
> fails the "7 - empty dumb HTTP" step.

Indeed -- now that the GitLab CI definitions have landed in your master
branch I can also see this failure as part of our CI job [1]. And with
the NixOS httpd refactorings I'm actually able to run these tests in the
first place. Really happy to see that things come together like this, as
it means that we'll detect such issues before we send the series to the
mailing list from now on.

Anyway, regarding the test itself. I think the refactorings in this
patch series uncover a preexisting and already-known issue with empty
repositories when using the dumb HTTP protocol: there is no proper way
to know about the hash function that the remote repository uses [2].

Before my refactorings we used to fall back to the local default hash
format with which we've already initialized the repository, which is
wrong. Now we use the hash format we detected via the remote, which we
cannot detect because the remote is empty and does not advertise the
hash function, so we fall back to SHA1 and thus also do the wrong thing.
The only correct thing here would be to use the actual hash function
that the remote repository uses, but we have no to do so. So we're kind
of stuck here and can only choose between two different wrong ways to
pick the hash function.

We can restore the previous wrong behaviour by honoring GIT_DEFAULT_HASH
in git-remote-curl(1) in the case where we do not have a repository set
up yet. So something similar in spirit to the following:

```
diff --git a/remote-curl.c b/remote-curl.c
index fc29757b65..7e97c9c2e9 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -27,6 +27,7 @@
 static struct remote *remote;
 /* always ends with a trailing slash */
 static struct strbuf url = STRBUF_INIT;
+static int nongit;
 
 struct options {
 	int verbosity;
@@ -275,8 +276,30 @@ static const struct git_hash_algo *detect_hash_algo(struct discovery *heads)
 {
 	const char *p = memchr(heads->buf, '\t', heads->len);
 	int algo;
-	if (!p)
-		return the_hash_algo;
+
+	if (!p) {
+		const char *env;
+
+		if (!nongit)
+			return the_hash_algo;
+
+		/*
+		 * In case the remote neither advertises the hash format nor
+		 * any references we have no way to detect the correct hash
+		 * format. We can thus only guess what the remote is using,
+		 * where the best guess is to fall back to the default hash.
+		 */
+		env = getenv("GIT_DEFAULT_HASH");
+		if (env) {
+			algo = hash_algo_by_name(env);
+			if (algo == GIT_HASH_UNKNOWN)
+				die(_("unknown hash algorithm '%s'"), env);
+		} else {
+			algo = GIT_HASH_SHA1;
+		}
+
+		return &hash_algos[algo];
+	}
 
 	algo = hash_algo_by_length((p - heads->buf) / 2);
 	if (algo == GIT_HASH_UNKNOWN)
@@ -1521,7 +1544,6 @@ static int stateless_connect(const char *service_name)
 int cmd_main(int argc, const char **argv)
 {
 	struct strbuf buf = STRBUF_INIT;
-	int nongit;
 	int ret = 1;
 
 	setup_git_directory_gently(&nongit);
```

+Cc brian, as he's the author of [2].

Patrick

[1]: https://gitlab.com/gitlab-org/git/-/jobs/5723052108
[2]: ac093d0790 (remote-curl: detect algorithm for dumb HTTP by size, 2020-06-19)
Junio C Hamano Dec. 11, 2023, 2:57 p.m. UTC | #4
Patrick Steinhardt <ps@pks.im> writes:

>> An existing test
>> 
>>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
>> 
>> passes with vanilla Git 2.43, but with these patches applied, it
>> fails the "7 - empty dumb HTTP" step.
> ...
> Before my refactorings we used to fall back to the local default hash
> format with which we've already initialized the repository, which is
> wrong. Now we use the hash format we detected via the remote, which we
> cannot detect because the remote is empty and does not advertise the
> hash function, so we fall back to SHA1 and thus also do the wrong thing.

Yeah, that is why I did *not* say "the series *breaks* existing
tests".  It triggers a failure, and in this case, a test failure
does not mean the behaviour is broken because there is no correct
answer.  ... oh, wait.  There isn't?

I wonder if the right thing to do is to advertise the hash function
even from an absolutely empty repository.  There are no objects in
such a repository, but it should already know what hash function to
use when it creates its first object (determined at the repository
creation time), so that _could_ be advertised.  

> The only correct thing here would be to use the actual hash function
> that the remote repository uses, but we have no to do so.

We have "no way to do so"?  We have "not done so"?

It is possible for the client side to download the $GIT_DIR/config
file from the remote to learn what value extensions.objectFormat is
in use over there instead, I think, but at the same time, I highly
suspect that dumb HTTP outlived its usefulness to warrant such an
additional investment of engineering resource.

The simplest "fix" might be to leave what happens in this narrow
case (i.e. cloning over dumb HTTP from an empty repository)
undefined by not testing (or not insisting on one particular
outcome), but ...

> +Cc brian, as he's the author of [2].

... of course I trust Brian more than I trust myself in this area ;-)

> Patrick
>
> [1]: https://gitlab.com/gitlab-org/git/-/jobs/5723052108
> [2]: ac093d0790 (remote-curl: detect algorithm for dumb HTTP by size, 2020-06-19)

Thanks.
Patrick Steinhardt Dec. 11, 2023, 3:32 p.m. UTC | #5
On Mon, Dec 11, 2023 at 06:57:25AM -0800, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> >> An existing test
> >> 
> >>     $ make && cd t && GIT_TEST_DEFAULT_HASH=sha256 sh t5550-http-fetch-dumb.sh
> >> 
> >> passes with vanilla Git 2.43, but with these patches applied, it
> >> fails the "7 - empty dumb HTTP" step.
> > ...
> > Before my refactorings we used to fall back to the local default hash
> > format with which we've already initialized the repository, which is
> > wrong. Now we use the hash format we detected via the remote, which we
> > cannot detect because the remote is empty and does not advertise the
> > hash function, so we fall back to SHA1 and thus also do the wrong thing.
> 
> Yeah, that is why I did *not* say "the series *breaks* existing
> tests".  It triggers a failure, and in this case, a test failure
> does not mean the behaviour is broken because there is no correct
> answer.  ... oh, wait.  There isn't?
> 
> I wonder if the right thing to do is to advertise the hash function
> even from an absolutely empty repository.  There are no objects in
> such a repository, but it should already know what hash function to
> use when it creates its first object (determined at the repository
> creation time), so that _could_ be advertised.  

For the smart HTTP and SSH protocols definitely, and we already do. But
it's a different story for dumb HTTP, unfortunately, where there is no
CGI-like thing sitting between the client and the repository's data.

> > The only correct thing here would be to use the actual hash function
> > that the remote repository uses, but we have no to do so.
> 
> We have "no way to do so"?  We have "not done so"?

We have not done so until now, and we have no easy way to change this on
the server-side as the server is not controlled by us in the first
place. That leaves two options I can think of:

  - Try harder on the client-side, e.g. by trying to download the
    gitconfig as you propose further down. I wonder whether admins would
    typically block access to the config, but doubt they do.

  - Change the format of `info/refs` to include the hash format, as this
    file _is_ controlled by us on the server-side. Doesn't help though
    in an empty repository, where the file is likely to never have been
    generated in the first place.

So it seems like downloading the gitconfig is the only viable option
that I can think of right now.

It does increase the potential attack surface though because we would
start to unconditionally parse a config file from an untrusted source,
and we did hit issues in our config parser in the past already. You
could argue that we already parse untrusted configs via `.gitmodules`,
but these require opt-in to actually be used by anything if I'm not
mistaken.

So... I dunno.

> It is possible for the client side to download the $GIT_DIR/config
> file from the remote to learn what value extensions.objectFormat is
> in use over there instead, I think, but at the same time, I highly
> suspect that dumb HTTP outlived its usefulness to warrant such an
> additional investment of engineering resource.

Fair enough. All of this feels like an edge case (admin that uses dumb
HTTP) in an edge case (the cloned repository uses SHA256) in an edge
case (the remote repository is empty). Sure, SHA256 is likely to gain in
popularity eventually. But at the same time I'd expect that dumb HTTP
will become increasingly rare.

Taken together, chances for this to happen should be fairly low.

> The simplest "fix" might be to leave what happens in this narrow
> case (i.e. cloning over dumb HTTP from an empty repository)
> undefined by not testing (or not insisting on one particular
> outcome), but ...

I would be fine with that outcome, as well. It's not like the current
behaviour is correct in all cases either. The only benefit of that
behaviour is that a user can in fact work around the broken cases by
setting `GIT_HASH_DEFAULT` to the correct hash, and that benefit would
be retained by the diff I sent that made remote-curl.c aware of this
environment variable.

One additional solution would be to print a user-visible warning a la
"warning: failed to detect hash function of empty remote repository" and
then call it a day, potentially pointing out that a user can correct it
by re-cloning with `GIT_HASH_DEFAULT`. But the warning may not be
actionable by the user, because they may not know what hash function the
remote uses, either.

Patrick
brian m. carlson Dec. 11, 2023, 10:17 p.m. UTC | #6
On 2023-12-11 at 15:32:52, Patrick Steinhardt wrote:
> On Mon, Dec 11, 2023 at 06:57:25AM -0800, Junio C Hamano wrote:
> > We have "no way to do so"?  We have "not done so"?
> 
> We have not done so until now, and we have no easy way to change this on
> the server-side as the server is not controlled by us in the first
> place. That leaves two options I can think of:
> 
>   - Try harder on the client-side, e.g. by trying to download the
>     gitconfig as you propose further down. I wonder whether admins would
>     typically block access to the config, but doubt they do.
> 
>   - Change the format of `info/refs` to include the hash format, as this
>     file _is_ controlled by us on the server-side. Doesn't help though
>     in an empty repository, where the file is likely to never have been
>     generated in the first place.
> 
> So it seems like downloading the gitconfig is the only viable option
> that I can think of right now.

I mean, we can add an `info/capabilities` file with capabilities and
assume the repository is SHA-1 without it.  I'm fine with that approach
as well, and it can be implemented as part of `git update-server-info`
pretty easily.

But yes, absent that approach or parsing the config file, we'll have to
just use the default settings.

> > The simplest "fix" might be to leave what happens in this narrow
> > case (i.e. cloning over dumb HTTP from an empty repository)
> > undefined by not testing (or not insisting on one particular
> > outcome), but ...
> 
> I would be fine with that outcome, as well. It's not like the current
> behaviour is correct in all cases either. The only benefit of that
> behaviour is that a user can in fact work around the broken cases by
> setting `GIT_HASH_DEFAULT` to the correct hash, and that benefit would
> be retained by the diff I sent that made remote-curl.c aware of this
> environment variable.

That would also be fine with me.