diff mbox series

[08/15] scalar: implement the `clone` subcommand

Message ID 2cbf0b611133df5fa7eed1bf38460f9d119d2a6e.1630359290.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Upstreaming the Scalar command | expand

Commit Message

Johannes Schindelin Aug. 30, 2021, 9:34 p.m. UTC
From: Johannes Schindelin <johannes.schindelin@gmx.de>

This implements Scalar's opinionated `clone` command: it tries to use a
partial clone and sets up a sparse checkout by default. In contrast to
`git clone`, `scalar clone` sets up the worktree in the `src/`
subdirectory, to encourage a separation between the source files and the
build output (which helps Git tremendously because it avoids untracked
files that have to be specifically ignored when refreshing the index).

Also, it registers the repository for regular, scheduled maintenance,
and configures a flurry of configuration settings based on the
experience and experiments of the Microsoft Windows and the Microsoft
Office development teams.

Note: We intentionally use a slightly wasteful `set_config()` function
(which does not reuse a single `strbuf`, for example, though performance
_really_ does not matter here) for convenience and readability.

Also note: since the `scalar clone` command is by far the most commonly
called `scalar` subcommand, we document it at the top of the manual
page.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 contrib/scalar/scalar.c          | 200 +++++++++++++++++++++++++++++++
 contrib/scalar/scalar.txt        |  35 +++++-
 contrib/scalar/t/t9099-scalar.sh |  32 +++++
 3 files changed, 262 insertions(+), 5 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Aug. 31, 2021, 8:23 a.m. UTC | #1
On Mon, Aug 30 2021, Johannes Schindelin via GitGitGadget wrote:

> This implements Scalar's opinionated `clone` command: it tries to use a
> partial clone and sets up a sparse checkout by default. In contrast to
> `git clone`, `scalar clone` sets up the worktree in the `src/`
> subdirectory, to encourage a separation between the source files and the
> build output (which helps Git tremendously because it avoids untracked
> files that have to be specifically ignored when refreshing the index).

Perhaps nobody else wondered this while reading this, but I thought this
might be some sparse/worktree magic where cloning into "foo" would have
"foo/.git", but the worktree was somehow magically mapped at foo/src/".

But no, it just takes your "scalar clone <url> foo" and translates it to
"foo/src", so you'll get a directory at "foo".

> Note: We intentionally use a slightly wasteful `set_config()` function
> (which does not reuse a single `strbuf`, for example, though performance
> _really_ does not matter here) for convenience and readability.

FWIW I think the commit message could do without this, that part of the
code is obviously not performance sensitive at all. But maybe an
explicit note helps anyway...
Eric Sunshine Aug. 31, 2021, 4:47 p.m. UTC | #2
On Tue, Aug 31, 2021 at 8:04 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Mon, Aug 30 2021, Johannes Schindelin via GitGitGadget wrote:
> > Note: We intentionally use a slightly wasteful `set_config()` function
> > (which does not reuse a single `strbuf`, for example, though performance
> > _really_ does not matter here) for convenience and readability.
>
> FWIW I think the commit message could do without this, that part of the
> code is obviously not performance sensitive at all. But maybe an
> explicit note helps anyway...

FWIW, I also found this distracting; it takes the reader's attention
away from more important aspects of the patch. (But it alone is not
worth a re-roll; it was just a minor hiccup.)
Junio C Hamano Sept. 1, 2021, 4:45 p.m. UTC | #3
"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +static char *remote_default_branch(const char *url)
> +{
> +	struct child_process cp = CHILD_PROCESS_INIT;
> +	struct strbuf out = STRBUF_INIT;
> +
> +	cp.git_cmd = 1;
> +	strvec_pushl(&cp.args, "ls-remote", "--symref", url, "HEAD", NULL);
> +	strbuf_addstr(&out, "-\n");

Is this a workaround for the problem that the first "ref:" line
won't be found by looking for "\nref: " in the loop?  Cute, but the
extra "-" is a bit misleading.

> +	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
> +		char *ref = out.buf;
> +
> +		while ((ref = strstr(ref + 1, "\nref: "))) {
> +			const char *p;
> +			char *head, *branch;
> +
> +			ref += strlen("\nref: ");
> +			head = strstr(ref, "\tHEAD");
> +
> +			if (!head || memchr(ref, '\n', head - ref))
> +				continue;

OK.  We expect "ref: " <refname> "\t" <head> "\n" where <head> is
"HEAD" for their .git/HEAD and refs/remotes/<nick>/HEAD for their
remote-tracking branch for the remote they call <nick>, on a single
line.  We reject a line that is not of that shape, and we reject a
line that is about remote-tracking branch by only looking for
"\tHEAD". Makes sense.

The strstr() goes from "ref + 1", which feels sloppy.  When we
reject the line we found that begins with "ref :", I would have
expected that the next scan would start at the beginning of the next
line, not from the middle of this line at the first letter 'e' in
'refs/heads/' on the current line "ref: refs/heads/.....".  As long
as the current line is long enough, strstr() would not miss the
beginning of the next line, so it might be OK.

> +			if (skip_prefix(ref, "refs/heads/", &p)) {
> +				branch = xstrndup(p, head - p);
> +				strbuf_release(&out);
> +				return branch;
> +			}
> +
> +			error(_("remote HEAD is not a branch: '%.*s'"),
> +			      (int)(head - ref), ref);
> +			strbuf_release(&out);
> +			return NULL;

OK.  Any symref whose basename is HEAD in their remote-tracking
hierarchy would have been rejected earlier in the loop.

Is there a particular reason why we return early here, instead of
breaking out of hte loop and let the generic "failed to get" code
path below to handle this case?

> +		}
> +	}
> +	warning(_("failed to get default branch name from remote; "
> +		  "using local default"));
> +	strbuf_reset(&out);
> +
> +	child_process_init(&cp);
> +	cp.git_cmd = 1;
> +	strvec_pushl(&cp.args, "symbolic-ref", "--short", "HEAD", NULL);
> +	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
> +		strbuf_trim(&out);
> +		return strbuf_detach(&out, NULL);
> +	}
> +
> +	strbuf_release(&out);
> +	error(_("failed to get default branch name"));
> +	return NULL;
> +}

> +static int cmd_clone(int argc, const char **argv)
> +{
> +	const char *branch = NULL;
> +	int full_clone = 0;
> +	struct option clone_options[] = {
> +		OPT_STRING('b', "branch", &branch, N_("<branch>"),
> +			   N_("branch to checkout after clone")),
> +		OPT_BOOL(0, "full-clone", &full_clone,
> +			 N_("when cloning, create full working directory")),
> +		OPT_END(),
> +	};
> +	const char * const clone_usage[] = {
> +		N_("scalar clone [<options>] [--] <repo> [<dir>]"),
> +		NULL
> +	};
> +	const char *url;
> +	char *enlistment = NULL, *dir = NULL;
> +	struct strbuf buf = STRBUF_INIT;
> +	int res;
> +
> +	argc = parse_options(argc, argv, NULL, clone_options, clone_usage, 0);
> +
> +	if (argc == 2) {
> +		url = argv[0];
> +		enlistment = xstrdup(argv[1]);
> +	} else if (argc == 1) {
> +		url = argv[0];
> +
> +		strbuf_addstr(&buf, url);
> +		/* Strip trailing slashes, if any */
> +		while (buf.len > 0 && is_dir_sep(buf.buf[buf.len - 1]))
> +			strbuf_setlen(&buf, buf.len - 1);
> +		/* Strip suffix `.git`, if any */
> +		strbuf_strip_suffix(&buf, ".git");
> +
> +		enlistment = find_last_dir_sep(buf.buf);
> +		if (!enlistment) {
> +			die(_("cannot deduce worktree name from '%s'"), url);
> +		}
> +		enlistment = xstrdup(enlistment + 1);
> +	} else {
> +		usage_msg_opt(_("You must specify a repository to clone."),
> +			      clone_usage, clone_options);
> +	}
> +
> +	if (is_directory(enlistment))
> +		die(_("directory '%s' exists already"), enlistment);
> +
> +	dir = xstrfmt("%s/src", enlistment);
> +
> +	strbuf_reset(&buf);
> +	if (branch)
> +		strbuf_addf(&buf, "init.defaultBranch=%s", branch);
> +	else {
> +		char *b = repo_default_branch_name(the_repository, 1);
> +		strbuf_addf(&buf, "init.defaultBranch=%s", b);
> +		free(b);

Doesn't "git clone" already use their HEAD without having to make an
extra "git ls-remote" roundtrip?

Ahh, you do not do "git clone"; you do "git init", set things up,
and then "git fetch" and checkout, all manually.

Which is kind of shame.

I wonder if it is a cleaner implementation to give a new option to
"git clone" that gives a command sequence (not necessarily have to
be implemented as a shell script) that specifies necessary
pre-configuration steps to be done before the command starts the
transfer step.
Derrick Stolee Sept. 3, 2021, 12:30 p.m. UTC | #4
On 9/1/2021 12:45 PM, Junio C Hamano wrote:
> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
> 
...
>> +	dir = xstrfmt("%s/src", enlistment);
>> +
>> +	strbuf_reset(&buf);
>> +	if (branch)
>> +		strbuf_addf(&buf, "init.defaultBranch=%s", branch);
>> +	else {
>> +		char *b = repo_default_branch_name(the_repository, 1);
>> +		strbuf_addf(&buf, "init.defaultBranch=%s", b);
>> +		free(b);
> 
> Doesn't "git clone" already use their HEAD without having to make an
> extra "git ls-remote" roundtrip?
> 
> Ahh, you do not do "git clone"; you do "git init", set things up,
> and then "git fetch" and checkout, all manually.
> 
> Which is kind of shame.
> 
> I wonder if it is a cleaner implementation to give a new option to
> "git clone" that gives a command sequence (not necessarily have to
> be implemented as a shell script) that specifies necessary
> pre-configuration steps to be done before the command starts the
> transfer step.

I agree that 'git clone' plus maybe some more improvements like
'--sparse=cone' to set up cone-mode sparse-checkout would be good.
And also the implementation being contributed here is cleaner if
we can use 'git clone'.

We are trying to balance a clean upstream implementation with some
custom things that we still need in our microsoft/git fork to
handle the integration with the GVFS Protocol (i.e. partial clone
on Azure Repos). That customization is cleaner to keep here in the
scalar code instead of adding an option to 'git clone'. It is
difficult to justify code patterns here due to choices we have made
in our fork, so I _could_ see a way to replace those custom bits
with new, custom flags to 'git clone'. It just requires additional
investment during our integration when we incorporate these upstream
changes. Naturally, I'm motivated to avoid that extra work.

If your opinion to switch to 'git clone' is a strong one, then I
could see us doing that change. I just want you to be aware of the
hidden reasons for choices like these.

Thanks,
-Stolee
Johannes Schindelin Sept. 3, 2021, 3:20 p.m. UTC | #5
Hi Junio,

On Wed, 1 Sep 2021, Junio C Hamano wrote:

> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > +static char *remote_default_branch(const char *url)
> > +{
> > +	struct child_process cp = CHILD_PROCESS_INIT;
> > +	struct strbuf out = STRBUF_INIT;
> > +
> > +	cp.git_cmd = 1;
> > +	strvec_pushl(&cp.args, "ls-remote", "--symref", url, "HEAD", NULL);
> > +	strbuf_addstr(&out, "-\n");
>
> Is this a workaround for the problem that the first "ref:" line
> won't be found by looking for "\nref: " in the loop?  Cute, but the
> extra "-" is a bit misleading.

The `-` is actually needed because of the `ref + 1` below, over which you
stumbled.

>
> > +	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
> > +		char *ref = out.buf;
> > +
> > +		while ((ref = strstr(ref + 1, "\nref: "))) {
> > +			const char *p;
> > +			char *head, *branch;
> > +
> > +			ref += strlen("\nref: ");
> > +			head = strstr(ref, "\tHEAD");
> > +
> > +			if (!head || memchr(ref, '\n', head - ref))
> > +				continue;
>
> OK.  We expect "ref: " <refname> "\t" <head> "\n" where <head> is
> "HEAD" for their .git/HEAD and refs/remotes/<nick>/HEAD for their
> remote-tracking branch for the remote they call <nick>, on a single
> line.  We reject a line that is not of that shape, and we reject a
> line that is about remote-tracking branch by only looking for
> "\tHEAD". Makes sense.
>
> The strstr() goes from "ref + 1", which feels sloppy.

I would use a different adjective, one that is less judgemental in nature,
but then, you were talking about your feelings.

> When we reject the line we found that begins with "ref :", I would have
> expected that the next scan would start at the beginning of the next
> line, not from the middle of this line at the first letter 'e' in
> 'refs/heads/' on the current line "ref: refs/heads/.....".  As long as
> the current line is long enough, strstr() would not miss the beginning
> of the next line, so it might be OK.

It would even work if the current line is shorter, but as you point out:
it is wasteful. And it could be improved to be more readable. I reworked
it, and it now looks like this:

	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
		const char *line = out.buf;

		while (*line) {
			const char *eol = strchrnul(line, '\n'), *p;
			size_t len = eol - line;
			char *branch;

			if (!skip_prefix(line, "ref: ", &p) ||
			    !strip_suffix_mem(line, &len, "\tHEAD")) {
				line = eol + (*eol == '\n');
				continue;
			}

			eol = line + len;
			if (skip_prefix(p, "refs/heads/", &p)) {
				branch = xstrndup(p, eol - p);
				strbuf_release(&out);
				return branch;
			}

			error(_("remote HEAD is not a branch: '%.*s'"),
			      (int)(eol - p), p);
			strbuf_release(&out);
			return NULL;
		}
	}

It now parses the output line by line, looking for the expected prefix and
suffix, then verifies the ref name format, and either returns the short
branch name or errors out with the message that this is not a branch.

>
> > +			if (skip_prefix(ref, "refs/heads/", &p)) {
> > +				branch = xstrndup(p, head - p);
> > +				strbuf_release(&out);
> > +				return branch;
> > +			}
> > +
> > +			error(_("remote HEAD is not a branch: '%.*s'"),
> > +			      (int)(head - ref), ref);
> > +			strbuf_release(&out);
> > +			return NULL;
>
> OK.  Any symref whose basename is HEAD in their remote-tracking
> hierarchy would have been rejected earlier in the loop.
>
> Is there a particular reason why we return early here, instead of
> breaking out of hte loop and let the generic "failed to get" code
> path below to handle this case?

Yes, the reason is that I wanted to err on the side of caution. If the
remote repository reports a default branch that is not a default branch at
all, I do not want to pretend that things are fine and then run into
trouble later when we set up a non-branch as remote-tracking target or
something like that.

>
> > +		}
> > +	}
> > +	warning(_("failed to get default branch name from remote; "
> > +		  "using local default"));
> > +	strbuf_reset(&out);
> > +
> > +	child_process_init(&cp);
> > +	cp.git_cmd = 1;
> > +	strvec_pushl(&cp.args, "symbolic-ref", "--short", "HEAD", NULL);
> > +	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
> > +		strbuf_trim(&out);
> > +		return strbuf_detach(&out, NULL);
> > +	}
> > +
> > +	strbuf_release(&out);
> > +	error(_("failed to get default branch name"));
> > +	return NULL;
> > +}
>
> > +static int cmd_clone(int argc, const char **argv)
> > +{
> > +	const char *branch = NULL;
> > +	int full_clone = 0;
> > +	struct option clone_options[] = {
> > +		OPT_STRING('b', "branch", &branch, N_("<branch>"),
> > +			   N_("branch to checkout after clone")),
> > +		OPT_BOOL(0, "full-clone", &full_clone,
> > +			 N_("when cloning, create full working directory")),
> > +		OPT_END(),
> > +	};
> > +	const char * const clone_usage[] = {
> > +		N_("scalar clone [<options>] [--] <repo> [<dir>]"),
> > +		NULL
> > +	};
> > +	const char *url;
> > +	char *enlistment = NULL, *dir = NULL;
> > +	struct strbuf buf = STRBUF_INIT;
> > +	int res;
> > +
> > +	argc = parse_options(argc, argv, NULL, clone_options, clone_usage, 0);
> > +
> > +	if (argc == 2) {
> > +		url = argv[0];
> > +		enlistment = xstrdup(argv[1]);
> > +	} else if (argc == 1) {
> > +		url = argv[0];
> > +
> > +		strbuf_addstr(&buf, url);
> > +		/* Strip trailing slashes, if any */
> > +		while (buf.len > 0 && is_dir_sep(buf.buf[buf.len - 1]))
> > +			strbuf_setlen(&buf, buf.len - 1);
> > +		/* Strip suffix `.git`, if any */
> > +		strbuf_strip_suffix(&buf, ".git");
> > +
> > +		enlistment = find_last_dir_sep(buf.buf);
> > +		if (!enlistment) {
> > +			die(_("cannot deduce worktree name from '%s'"), url);
> > +		}
> > +		enlistment = xstrdup(enlistment + 1);
> > +	} else {
> > +		usage_msg_opt(_("You must specify a repository to clone."),
> > +			      clone_usage, clone_options);
> > +	}
> > +
> > +	if (is_directory(enlistment))
> > +		die(_("directory '%s' exists already"), enlistment);
> > +
> > +	dir = xstrfmt("%s/src", enlistment);
> > +
> > +	strbuf_reset(&buf);
> > +	if (branch)
> > +		strbuf_addf(&buf, "init.defaultBranch=%s", branch);
> > +	else {
> > +		char *b = repo_default_branch_name(the_repository, 1);
> > +		strbuf_addf(&buf, "init.defaultBranch=%s", b);
> > +		free(b);
>
> Doesn't "git clone" already use their HEAD without having to make an
> extra "git ls-remote" roundtrip?
>
> Ahh, you do not do "git clone"; you do "git init", set things up,
> and then "git fetch" and checkout, all manually.
>
> Which is kind of shame.
>
> I wonder if it is a cleaner implementation to give a new option to
> "git clone" that gives a command sequence (not necessarily have to
> be implemented as a shell script) that specifies necessary
> pre-configuration steps to be done before the command starts the
> transfer step.

Right. It is a shame, I agree. And it is one of the things I want to work
on, after the Scalar patch series made it into Git.

The reason why I don't want to work on this now is that I expect this
effort to result in new options for `git clone`, new options that need to
be designed well, and where I fully expect a long discussion until we
reach a consensus how these options should look like, especially since we
will need to maintain backwards-compatibility of Scalar's functionality.

Therefore, in the interest to keep the patch series relatively easy to
review, I left this in the "for later" pile, for now.

Ciao,
Dscho
Johannes Schindelin Sept. 3, 2021, 3:21 p.m. UTC | #6
Hi Eric,

On Tue, 31 Aug 2021, Eric Sunshine wrote:

> On Tue, Aug 31, 2021 at 8:04 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> > On Mon, Aug 30 2021, Johannes Schindelin via GitGitGadget wrote:
> > > Note: We intentionally use a slightly wasteful `set_config()` function
> > > (which does not reuse a single `strbuf`, for example, though performance
> > > _really_ does not matter here) for convenience and readability.
> >
> > FWIW I think the commit message could do without this, that part of the
> > code is obviously not performance sensitive at all. But maybe an
> > explicit note helps anyway...
>
> FWIW, I also found this distracting; it takes the reader's attention
> away from more important aspects of the patch. (But it alone is not
> worth a re-roll; it was just a minor hiccup.)

Since I reworked the remote default branch parsing anyway, I removed this
paragraph from the commit message.

Ciao,
Dscho
Junio C Hamano Sept. 3, 2021, 5:18 p.m. UTC | #7
Derrick Stolee <stolee@gmail.com> writes:

>> Ahh, you do not do "git clone"; you do "git init", set things up,
>> and then "git fetch" and checkout, all manually.
>> 
>> Which is kind of shame.
>> 
>> I wonder if it is a cleaner implementation to give a new option to
>> "git clone" that gives a command sequence (not necessarily have to
>> be implemented as a shell script) that specifies necessary
>> pre-configuration steps to be done before the command starts the
>> transfer step.
>
> I agree that 'git clone' plus maybe some more improvements like
> '--sparse=cone' to set up cone-mode sparse-checkout would be good.
> And also the implementation being contributed here is cleaner if
> we can use 'git clone'.
>
> We are trying to balance a clean upstream implementation with some
> custom things that we still need in our microsoft/git fork to
> handle the integration with the GVFS Protocol (i.e. partial clone
> on Azure Repos). That customization is cleaner to keep here in the
> scalar code instead of adding an option to 'git clone'.

Oh, there is no disagreement on that point, at least in the short
term.  I was wondering why "clone" subcommand needs a duplicated
logic that should be unnecessary, before realizing that this was
not implemented as a wrapper to (possibly updated) "clone", and
I agree that starting with a looser coupling like this step does
is easier to everybody.

"Kind of shame" is just that I wished we had already prepared "git
clone" side to accept customization more easily before its various
distinct phases (new repository creation, where a custom logic may
want to affect the name and location of it and how "git init" is
driven, initial "fetch", where a custom logic may want to affect the
fetch refspec and its parameters like depths and cones, and initial
"checkout") do their things.  If we allowed such plug-in of logic to
affect how "git clone" worked already, it would have been possible
to do "scalar clone" with much less code.  It also would allow us to
reorganize the "clone --local" hack in a way that is easier to
reason about (I think even in today's code, the way I hooked it up
can be seen which is quite messy).  It may even help folks who want
to extend "git clone" to clone a repository recursively its
submodules with project-specific customizations (like which ones to
clone by default, etc.).

I suspect that learning from the way "scalar clone" is done on top
of "init" + "fetch" + "checkout" in this initial series may help us
extend "git clone" later to fill such needs.

> If your opinion to switch to 'git clone' is a strong one, then I
> could see us doing that change. I just want you to be aware of the
> hidden reasons for choices like these.

Not at all at this moment.

It is mostly that the way "init" + "fetch" + "checkout" was done in
this step reminded me of a much longer-term wish I have had for a
while.
Junio C Hamano Sept. 3, 2021, 5:29 p.m. UTC | #8
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> It would even work if the current line is shorter, but as you point out:
> it is wasteful. And it could be improved to be more readable. I reworked
> it, and it now looks like this:
>
> 	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
> 		const char *line = out.buf;
>
> 		while (*line) {
> 			const char *eol = strchrnul(line, '\n'), *p;
> 			size_t len = eol - line;
> 			char *branch;
>
> 			if (!skip_prefix(line, "ref: ", &p) ||
> 			    !strip_suffix_mem(line, &len, "\tHEAD")) {
> 				line = eol + (*eol == '\n');
> 				continue;
> 			}
>
> 			eol = line + len;
> 			if (skip_prefix(p, "refs/heads/", &p)) {
> 				branch = xstrndup(p, eol - p);
> 				strbuf_release(&out);
> 				return branch;
> 			}
>
> 			error(_("remote HEAD is not a branch: '%.*s'"),
> 			      (int)(eol - p), p);
> 			strbuf_release(&out);
> 			return NULL;
> 		}
> 	}
>
> It now parses the output line by line, looking for the expected prefix and
> suffix, then verifies the ref name format, and either returns the short
> branch name or errors out with the message that this is not a branch.

It is much easier to read and understand how the loop works with
above.

>> > +			error(_("remote HEAD is not a branch: '%.*s'"),
>> > +			      (int)(head - ref), ref);
>> > +			strbuf_release(&out);
>> > +			return NULL;
>>
>> OK.  Any symref whose basename is HEAD in their remote-tracking
>> hierarchy would have been rejected earlier in the loop.
>>
>> Is there a particular reason why we return early here, instead of
>> breaking out of hte loop and let the generic "failed to get" code
>> path below to handle this case?
>
> Yes, the reason is that I wanted to err on the side of caution. If the
> remote repository reports a default branch that is not a default branch at
> all, I do not want to pretend that things are fine and then run into
> trouble later when we set up a non-branch as remote-tracking target or
> something like that.

Wouldn't we have the same problem when the remote end does not
advertise HEAD and we fall back to "local default", though?  We'd
run into trouble later as we use "local default" that may correspond
to a non-branch there as remote-tracking target, or something like
that.

Not that I care too deeply in the error case, though.  I just felt
that this early return was an uneven way to follow the principle to
err on the side of caution, as we continue with the local default
when the other side fails to tell us what their HEAD points at.

Thanks.
Johannes Schindelin Sept. 8, 2021, 6:59 p.m. UTC | #9
Hi Junio,

On Fri, 3 Sep 2021, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > It would even work if the current line is shorter, but as you point out:
> > it is wasteful. And it could be improved to be more readable. I reworked
> > it, and it now looks like this:
> >
> > 	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
> > 		const char *line = out.buf;
> >
> > 		while (*line) {
> > 			const char *eol = strchrnul(line, '\n'), *p;
> > 			size_t len = eol - line;
> > 			char *branch;
> >
> > 			if (!skip_prefix(line, "ref: ", &p) ||
> > 			    !strip_suffix_mem(line, &len, "\tHEAD")) {
> > 				line = eol + (*eol == '\n');
> > 				continue;
> > 			}
> >
> > 			eol = line + len;
> > 			if (skip_prefix(p, "refs/heads/", &p)) {
> > 				branch = xstrndup(p, eol - p);
> > 				strbuf_release(&out);
> > 				return branch;
> > 			}
> >
> > 			error(_("remote HEAD is not a branch: '%.*s'"),
> > 			      (int)(eol - p), p);
> > 			strbuf_release(&out);
> > 			return NULL;
> > 		}
> > 	}
> >
> > It now parses the output line by line, looking for the expected prefix and
> > suffix, then verifies the ref name format, and either returns the short
> > branch name or errors out with the message that this is not a branch.
>
> It is much easier to read and understand how the loop works with
> above.

Excellent.

> >> > +			error(_("remote HEAD is not a branch: '%.*s'"),
> >> > +			      (int)(head - ref), ref);
> >> > +			strbuf_release(&out);
> >> > +			return NULL;
> >>
> >> OK.  Any symref whose basename is HEAD in their remote-tracking
> >> hierarchy would have been rejected earlier in the loop.
> >>
> >> Is there a particular reason why we return early here, instead of
> >> breaking out of hte loop and let the generic "failed to get" code
> >> path below to handle this case?
> >
> > Yes, the reason is that I wanted to err on the side of caution. If the
> > remote repository reports a default branch that is not a default branch at
> > all, I do not want to pretend that things are fine and then run into
> > trouble later when we set up a non-branch as remote-tracking target or
> > something like that.
>
> Wouldn't we have the same problem when the remote end does not
> advertise HEAD and we fall back to "local default", though?  We'd
> run into trouble later as we use "local default" that may correspond
> to a non-branch there as remote-tracking target, or something like
> that.

All true, there will be trouble at some stage.

The difference between the two cases, in my mind, is that cloning an empty
repository would run into the latter code path and might still have a
chance to work as intended by using the local default branch name.

In any case, as I indicated earlier, I am _very_ interested in moving as
much functionality as possible from Scalar to Git proper. In this
instance, I hope to move most of the code from `scalar.c` to `clone.c`
(guarded by one or more new options). And as soon as that happens, the
discussion we're having right now will be moot ;-)

Which means that I want to weigh how much effort to put into polishing an
unlikely code path on one side, and on the other side how much effort to
put into moving the functionality away from `contrib/` and deleting that
unlikely code path.

In the same vein, while this patch series contains (mostly) code in
`contrib/` (and therefore technically does not need to adhere strictly to
Git's code style), it is probably wise to pay closer attention to the code
style particularly in those parts that are prone to be moved verbatim (or
close to verbatim) to Git proper.

Thanks,
Dscho

> Not that I care too deeply in the error case, though.  I just felt
> that this early return was an uneven way to follow the principle to
> err on the side of caution, as we continue with the local default
> when the other side fails to tell us what their HEAD points at.
>
> Thanks.
>
Ævar Arnfjörð Bjarmason Sept. 9, 2021, 10:29 a.m. UTC | #10
On Wed, Sep 08 2021, Johannes Schindelin wrote:

> [...]
> Which means that I want to weigh how much effort to put into polishing an
> unlikely code path on one side, and on the other side how much effort to
> put into moving the functionality away from `contrib/` and deleting that
> unlikely code path.
>
> In the same vein, while this patch series contains (mostly) code in
> `contrib/` (and therefore technically does not need to adhere strictly to
> Git's code style), it is probably wise to pay closer attention to the code
> style particularly in those parts that are prone to be moved verbatim (or
> close to verbatim) to Git proper.

I don't think we have any such exception to our usual style & preferred
code patterns in contrib/* or compat/* in general. In the latter case we
have e.g. compat/regex/ and other externally-imported codebases, which
we've tried to stylistically modify as little as possible to make
subsequent imports easier, ditto sha1dc/ etc.

I think the general (but unwritten) rule has been to draw the
distinction on whether or not code is still externally maintained and
expected to be imported, or if it's expected to be maintained in git.git
going forward.

I think that this proposed series falls thoroughly in the latter
category, but maybe I've misunderstood it..

Also re my [1] I had some (still relevant, but unaddressed) points on v1
about how placing this in contrib/* made certain aspects of integrating
it into our build system harder. I was imagining that distinction as
purely an internal implementation detail to git.git (make install
etc. would behave the same), but per the above it seems to come with
deeper connotations than that at least in your mind.

1. https://lore.kernel.org/git/87r1dydp4m.fsf@evledraar.gmail.com
Elijah Newren Sept. 28, 2021, 5:19 a.m. UTC | #11
On Mon, Aug 30, 2021 at 2:36 PM Johannes Schindelin via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
...
>  COMMANDS
>  --------
>
> +Clone
> +~~~~~
> +
> +clone [<options>] <url> [<enlistment>]::
> +    Clones the specified repository, similar to linkgit:git-clone[1]. By
> +    default, only commit and tree objects are cloned. Once finished, the
> +    worktree is located at `<enlistment>/src`.
> ++
> +The sparse-checkout feature is enabled (except when run with `--full-clone`)
> +and the only files present are those in the top-level directory. Use
> +`git sparse-checkout set` to expand the set of directories you want to see,
> +or `git sparse-checkout disable` to expand to all files (see
> +linkgit:git-sparse-checkout[1] for more details). You can explore the
> +subdirectories outside your sparse-checkout by using `git ls-tree HEAD`.

Should this be `git ls-tree [-r] HEAD`?  Do you expect people to just
add directories that are found immediately under the toplevel, rather
than some that are a bit deeper?
Johannes Schindelin Oct. 6, 2021, 8:40 p.m. UTC | #12
Hi Elijah,

On Mon, 27 Sep 2021, Elijah Newren wrote:

> On Mon, Aug 30, 2021 at 2:36 PM Johannes Schindelin via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> ...
> >  COMMANDS
> >  --------
> >
> > +Clone
> > +~~~~~
> > +
> > +clone [<options>] <url> [<enlistment>]::
> > +    Clones the specified repository, similar to linkgit:git-clone[1]. By
> > +    default, only commit and tree objects are cloned. Once finished, the
> > +    worktree is located at `<enlistment>/src`.
> > ++
> > +The sparse-checkout feature is enabled (except when run with `--full-clone`)
> > +and the only files present are those in the top-level directory. Use
> > +`git sparse-checkout set` to expand the set of directories you want to see,
> > +or `git sparse-checkout disable` to expand to all files (see
> > +linkgit:git-sparse-checkout[1] for more details). You can explore the
> > +subdirectories outside your sparse-checkout by using `git ls-tree HEAD`.
>
> Should this be `git ls-tree [-r] HEAD`?  Do you expect people to just
> add directories that are found immediately under the toplevel, rather
> than some that are a bit deeper?

I fear that `git ls-tree -r HEAD` in any monorepo might be a bit too
overwhelming for any reader.

But I agree that just looking at HEAD is probably not enough. Maybe we
should use `git ls-tree HEAD[:<dir>]`?

Ciao,
Dscho
Elijah Newren Oct. 7, 2021, 2:09 p.m. UTC | #13
On Wed, Oct 6, 2021 at 1:40 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Elijah,
>
> On Mon, 27 Sep 2021, Elijah Newren wrote:
>
> > On Mon, Aug 30, 2021 at 2:36 PM Johannes Schindelin via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> > >
> > ...
> > >  COMMANDS
> > >  --------
> > >
> > > +Clone
> > > +~~~~~
> > > +
> > > +clone [<options>] <url> [<enlistment>]::
> > > +    Clones the specified repository, similar to linkgit:git-clone[1]. By
> > > +    default, only commit and tree objects are cloned. Once finished, the
> > > +    worktree is located at `<enlistment>/src`.
> > > ++
> > > +The sparse-checkout feature is enabled (except when run with `--full-clone`)
> > > +and the only files present are those in the top-level directory. Use
> > > +`git sparse-checkout set` to expand the set of directories you want to see,
> > > +or `git sparse-checkout disable` to expand to all files (see
> > > +linkgit:git-sparse-checkout[1] for more details). You can explore the
> > > +subdirectories outside your sparse-checkout by using `git ls-tree HEAD`.
> >
> > Should this be `git ls-tree [-r] HEAD`?  Do you expect people to just
> > add directories that are found immediately under the toplevel, rather
> > than some that are a bit deeper?
>
> I fear that `git ls-tree -r HEAD` in any monorepo might be a bit too
> overwhelming for any reader.

Oh, right...

> But I agree that just looking at HEAD is probably not enough. Maybe we
> should use `git ls-tree HEAD[:<dir>]`?

Sounds good.
diff mbox series

Patch

diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c
index 91ceb97e552..13cdfa94d16 100644
--- a/contrib/scalar/scalar.c
+++ b/contrib/scalar/scalar.c
@@ -7,6 +7,7 @@ 
 #include "parse-options.h"
 #include "config.h"
 #include "run-command.h"
+#include "refs.h"
 
 /*
  * Remove the deepest subdirectory in the provided path string. Path must not
@@ -258,6 +259,204 @@  static int unregister_dir(void)
 	return res;
 }
 
+/* printf-style interface, expects `<key>=<value>` argument */
+static int set_config(const char *fmt, ...)
+{
+	struct strbuf buf = STRBUF_INIT;
+	char *value;
+	int res;
+	va_list args;
+
+	va_start(args, fmt);
+	strbuf_vaddf(&buf, fmt, args);
+	va_end(args);
+
+	value = strchr(buf.buf, '=');
+	if (value)
+		*(value++) = '\0';
+	res = git_config_set_gently(buf.buf, value);
+	strbuf_release(&buf);
+
+	return res;
+}
+
+static char *remote_default_branch(const char *url)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	struct strbuf out = STRBUF_INIT;
+
+	cp.git_cmd = 1;
+	strvec_pushl(&cp.args, "ls-remote", "--symref", url, "HEAD", NULL);
+	strbuf_addstr(&out, "-\n");
+	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
+		char *ref = out.buf;
+
+		while ((ref = strstr(ref + 1, "\nref: "))) {
+			const char *p;
+			char *head, *branch;
+
+			ref += strlen("\nref: ");
+			head = strstr(ref, "\tHEAD");
+
+			if (!head || memchr(ref, '\n', head - ref))
+				continue;
+
+			if (skip_prefix(ref, "refs/heads/", &p)) {
+				branch = xstrndup(p, head - p);
+				strbuf_release(&out);
+				return branch;
+			}
+
+			error(_("remote HEAD is not a branch: '%.*s'"),
+			      (int)(head - ref), ref);
+			strbuf_release(&out);
+			return NULL;
+		}
+	}
+	warning(_("failed to get default branch name from remote; "
+		  "using local default"));
+	strbuf_reset(&out);
+
+	child_process_init(&cp);
+	cp.git_cmd = 1;
+	strvec_pushl(&cp.args, "symbolic-ref", "--short", "HEAD", NULL);
+	if (!pipe_command(&cp, NULL, 0, &out, 0, NULL, 0)) {
+		strbuf_trim(&out);
+		return strbuf_detach(&out, NULL);
+	}
+
+	strbuf_release(&out);
+	error(_("failed to get default branch name"));
+	return NULL;
+}
+
+static int cmd_clone(int argc, const char **argv)
+{
+	const char *branch = NULL;
+	int full_clone = 0;
+	struct option clone_options[] = {
+		OPT_STRING('b', "branch", &branch, N_("<branch>"),
+			   N_("branch to checkout after clone")),
+		OPT_BOOL(0, "full-clone", &full_clone,
+			 N_("when cloning, create full working directory")),
+		OPT_END(),
+	};
+	const char * const clone_usage[] = {
+		N_("scalar clone [<options>] [--] <repo> [<dir>]"),
+		NULL
+	};
+	const char *url;
+	char *enlistment = NULL, *dir = NULL;
+	struct strbuf buf = STRBUF_INIT;
+	int res;
+
+	argc = parse_options(argc, argv, NULL, clone_options, clone_usage, 0);
+
+	if (argc == 2) {
+		url = argv[0];
+		enlistment = xstrdup(argv[1]);
+	} else if (argc == 1) {
+		url = argv[0];
+
+		strbuf_addstr(&buf, url);
+		/* Strip trailing slashes, if any */
+		while (buf.len > 0 && is_dir_sep(buf.buf[buf.len - 1]))
+			strbuf_setlen(&buf, buf.len - 1);
+		/* Strip suffix `.git`, if any */
+		strbuf_strip_suffix(&buf, ".git");
+
+		enlistment = find_last_dir_sep(buf.buf);
+		if (!enlistment) {
+			die(_("cannot deduce worktree name from '%s'"), url);
+		}
+		enlistment = xstrdup(enlistment + 1);
+	} else {
+		usage_msg_opt(_("You must specify a repository to clone."),
+			      clone_usage, clone_options);
+	}
+
+	if (is_directory(enlistment))
+		die(_("directory '%s' exists already"), enlistment);
+
+	dir = xstrfmt("%s/src", enlistment);
+
+	strbuf_reset(&buf);
+	if (branch)
+		strbuf_addf(&buf, "init.defaultBranch=%s", branch);
+	else {
+		char *b = repo_default_branch_name(the_repository, 1);
+		strbuf_addf(&buf, "init.defaultBranch=%s", b);
+		free(b);
+	}
+
+	if ((res = run_git("-c", buf.buf, "init", "--", dir, NULL)))
+		goto cleanup;
+
+	if (chdir(dir) < 0) {
+		res = error_errno(_("could not switch to '%s'"), dir);
+		goto cleanup;
+	}
+
+	setup_git_directory();
+
+	/* common-main already logs `argv` */
+	trace2_def_repo(the_repository);
+
+	if (!branch && !(branch = remote_default_branch(url))) {
+		res = error(_("failed to get default branch for '%s'"), url);
+		goto cleanup;
+	}
+
+	if (set_config("remote.origin.url=%s", url) ||
+	    set_config("remote.origin.fetch="
+		       "+refs/heads/*:refs/remotes/origin/*") ||
+	    set_config("remote.origin.promisor=true") ||
+	    set_config("remote.origin.partialCloneFilter=blob:none")) {
+		res = error(_("could not configure remote in '%s'"), dir);
+		goto cleanup;
+	}
+
+	if (!full_clone &&
+	    (res = run_git("sparse-checkout", "init", "--cone", NULL)))
+		goto cleanup;
+
+	if (set_recommended_config())
+		return error(_("could not configure '%s'"), dir);
+
+	if ((res = run_git("fetch", "--quiet", "origin", NULL))) {
+		warning(_("partial clone failed; attempting full clone"));
+
+		if (set_config("remote.origin.promisor") ||
+		    set_config("remote.origin.partialCloneFilter")) {
+			res = error(_("could not configure for full clone"));
+			goto cleanup;
+		}
+
+		if ((res = run_git("fetch", "--quiet", "origin", NULL)))
+			goto cleanup;
+	}
+
+	if ((res = set_config("branch.%s.remote=origin", branch)))
+		goto cleanup;
+	if ((res = set_config("branch.%s.merge=refs/heads/%s",
+			      branch, branch)))
+		goto cleanup;
+
+	strbuf_reset(&buf);
+	strbuf_addf(&buf, "origin/%s", branch);
+	res = run_git("checkout", "-f", "-t", buf.buf, NULL);
+	if (res)
+		goto cleanup;
+
+	res = register_dir();
+
+cleanup:
+	free(enlistment);
+	free(dir);
+	strbuf_release(&buf);
+	return res;
+}
+
 static int cmd_list(int argc, const char **argv)
 {
 	if (argc != 1)
@@ -354,6 +553,7 @@  static struct {
 	const char *name;
 	int (*fn)(int, const char **);
 } builtins[] = {
+	{ "clone", cmd_clone },
 	{ "list", cmd_list },
 	{ "register", cmd_register },
 	{ "unregister", cmd_unregister },
diff --git a/contrib/scalar/scalar.txt b/contrib/scalar/scalar.txt
index e1f629fddad..90d59f1d79f 100644
--- a/contrib/scalar/scalar.txt
+++ b/contrib/scalar/scalar.txt
@@ -8,6 +8,7 @@  scalar - an opinionated repository management tool
 SYNOPSIS
 --------
 [verse]
+scalar clone [--branch <main-branch>] [--full-clone] <url> [<enlistment>]
 scalar list
 scalar register [<enlistment>]
 scalar unregister [<enlistment>]
@@ -29,19 +30,43 @@  an existing Git worktree with Scalar whose name is not `src`, the enlistment
 will be identical to the worktree.
 
 The `scalar` command implements various subcommands, and different options
-depending on the subcommand. With the exception of `list`, all subcommands
-expect to be run in an enlistment.
+depending on the subcommand. With the exception of `clone` and `list`, all
+subcommands expect to be run in an enlistment.
 
 COMMANDS
 --------
 
+Clone
+~~~~~
+
+clone [<options>] <url> [<enlistment>]::
+    Clones the specified repository, similar to linkgit:git-clone[1]. By
+    default, only commit and tree objects are cloned. Once finished, the
+    worktree is located at `<enlistment>/src`.
++
+The sparse-checkout feature is enabled (except when run with `--full-clone`)
+and the only files present are those in the top-level directory. Use
+`git sparse-checkout set` to expand the set of directories you want to see,
+or `git sparse-checkout disable` to expand to all files (see
+linkgit:git-sparse-checkout[1] for more details). You can explore the
+subdirectories outside your sparse-checkout by using `git ls-tree HEAD`.
+
+-b <name>::
+--branch <name>::
+    Instead of checking out the branch pointed to by the cloned repository's
+    HEAD, check out the `<name>` branch instead.
+
+--[no-]full-clone::
+    A sparse-checkout is initialized by default. This behavior can be turned
+    off via `--full-clone`.
+
 List
 ~~~~
 
 list::
     To see which repositories are currently registered by the service, run
-    `scalar list`. This subcommand does not need to be run inside a Scalar
-    enlistment.
+    `scalar list`. This subcommand, like `clone`, does not need to be run
+    inside a Scalar enlistment.
 
 Register
 ~~~~~~~~
@@ -61,7 +86,7 @@  unregister [<enlistment>]::
 
 SEE ALSO
 --------
-linkgit:git-maintenance[1].
+linkgit:git-clone[1], linkgit:git-maintenance[1].
 
 Scalar
 ---
diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh
index ef0e8d680d5..295398f62cc 100755
--- a/contrib/scalar/t/t9099-scalar.sh
+++ b/contrib/scalar/t/t9099-scalar.sh
@@ -10,6 +10,9 @@  PATH=$PWD/..:$PATH
 
 . ../../../t/test-lib.sh
 
+GIT_TEST_MAINT_SCHEDULER="crontab:test-tool crontab ../cron.txt"
+export GIT_TEST_MAINT_SCHEDULER
+
 test_expect_success 'scalar shows a usage' '
 	test_expect_code 129 scalar -h
 '
@@ -29,4 +32,33 @@  test_expect_success 'scalar unregister' '
 	! grep -F "$(pwd)/vanish/src" scalar.repos
 '
 
+test_expect_success 'set up repository to clone' '
+	test_commit first &&
+	test_commit second &&
+	test_commit third &&
+	git switch -c parallel first &&
+	mkdir -p 1/2 &&
+	test_commit 1/2/3 &&
+	git config uploadPack.allowFilter true &&
+	git config uploadPack.allowAnySHA1InWant true
+'
+
+test_expect_success 'scalar clone' '
+	second=$(git rev-parse --verify second:second.t) &&
+	scalar clone "file://$(pwd)" cloned &&
+	(
+		cd cloned/src &&
+
+		git config --get --global --fixed-value maintenance.repo \
+			"$(pwd)" &&
+
+		test_path_is_missing 1/2 &&
+		test_must_fail git rev-list --missing=print $second &&
+		git rev-list $second &&
+		git cat-file blob $second >actual &&
+		echo "second" >expect &&
+		test_cmp expect actual
+	)
+'
+
 test_done