[v2,01/18] maintenance: create basic maintenance runner

Message ID	63ec602a07756a41f8ccddd745562c567a4b3ed7.1595527000.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=AzkS=BC=vger.kernel.org=git-owner@kernel.org> Message-Id: <63ec602a07756a41f8ccddd745562c567a4b3ed7.1595527000.git.gitgitgadget@gmail.com> In-Reply-To: <pull.671.v2.git.1595527000.gitgitgadget@gmail.com> References: <pull.671.git.1594131695.gitgitgadget@gmail.com> <pull.671.v2.git.1595527000.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Thu, 23 Jul 2020 17:56:23 +0000 Subject: [PATCH v2 01/18] maintenance: create basic maintenance runner Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Johannes.Schindelin@gmx.de, sandals@crustytoothpaste.net, steadmon@google.com, jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com, phillip.wood123@gmail.com, emilyshaffer@google.com, sluongng@gmail.com, jonathantanmy@google.com, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Maintenance builtin, allowing 'gc --auto' customization \| expand [v2,00/18] Maintenance builtin, allowing 'gc --auto' customization [v2,01/18] maintenance: create basic maintenance runner [v2,02/18] maintenance: add --quiet option [v2,03/18] maintenance: replace run_auto_gc() [v2,04/18] maintenance: initialize task array [v2,05/18] maintenance: add commit-graph task [v2,06/18] maintenance: add --task option [v2,07/18] maintenance: take a lock on the objects directory [v2,08/18] maintenance: add prefetch task [v2,09/18] maintenance: add loose-objects task [v2,10/18] maintenance: add incremental-repack task [v2,11/18] maintenance: auto-size incremental-repack batch [v2,12/18] maintenance: create maintenance.<task>.enabled config [v2,13/18] maintenance: use pointers to check --auto [v2,14/18] maintenance: add auto condition for commit-graph task [v2,15/18] maintenance: create auto condition for loose-objects [v2,16/18] maintenance: add incremental-repack auto condition [v2,17/18] midx: use start_delayed_progress() [v2,18/18] maintenance: add trace2 regions for task execution

Koji Nakamaru via GitGitGadget July 23, 2020, 5:56 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The 'gc' builtin is our current entrypoint for automatically maintaining
a repository. This one tool does many operations, such as repacking the
repository, packing refs, and rewriting the commit-graph file. The name
implies it performs "garbage collection" which means several different
things, and some users may not want to use this operation that rewrites
the entire object database.

Create a new 'maintenance' builtin that will become a more general-
purpose command. To start, it will only support the 'run' subcommand,
but will later expand to add subcommands for scheduling maintenance in
the background.

For now, the 'maintenance' builtin is a thin shim over the 'gc' builtin.
In fact, the only option is the '--auto' toggle, which is handed
directly to the 'gc' builtin. The current change is isolated to this
simple operation to prevent more interesting logic from being lost in
all of the boilerplate of adding a new builtin.

Use existing builtin/gc.c file because we want to share code between the
two builtins. It is possible that we will have 'maintenance' replace the
'gc' builtin entirely at some point, leaving 'git gc' as an alias for
some specific arguments to 'git maintenance run'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .gitignore                        |  1 +
 Documentation/git-maintenance.txt | 57 +++++++++++++++++++++++++++++
 builtin.h                         |  1 +
 builtin/gc.c                      | 59 +++++++++++++++++++++++++++++++
 git.c                             |  1 +
 t/t7900-maintenance.sh            | 22 ++++++++++++
 6 files changed, 141 insertions(+)
 create mode 100644 Documentation/git-maintenance.txt
 create mode 100755 t/t7900-maintenance.sh

Taylor Blau July 25, 2020, 1:26 a.m. UTC | #1

On Thu, Jul 23, 2020 at 05:56:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>

Everything in this patch looks very sensible to me, and I think that
implementing 'git maintenance' in the same file as the current 'git gc'
builtin makes a lot of sense to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

Đoàn Trần Công Danh July 25, 2020, 1:47 a.m. UTC | #2

On 2020-07-23 17:56:23+0000, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> new file mode 100755
> index 0000000000..d00641c4dd
> --- /dev/null
> +++ b/t/t7900-maintenance.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +
> +test_description='git maintenance builtin'
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'help text' '
> +	test_must_fail git maintenance -h 2>err &&
> +	test_i18ngrep "usage: git maintenance run" err
> +'

I think this test has been tested better already in t0012,
Anyway, if we would like to check the word "git maintenance run"
explicitly here, it would be clearer to

s/test_must_fail/test_expect_code 129/

> +
> +test_expect_success 'gc [--auto]' '
> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
> +	grep ",\"gc\"]" run-no-auto.txt  &&
> +	grep ",\"gc\",\"--auto\"]" run-auto.txt
> +'
> +
> +test_done
> -- 
> gitgitgadget
>

Jonathan Nieder July 29, 2020, 10:19 p.m. UTC | #3

Hi,

Derrick Stolee wrote:

> [Subject: maintenance: create basic maintenance runner]

Seems sensible, and a good way to set up for the later patches.  Let's
take a look at how it does that.

[...]
> --- /dev/null
> +++ b/Documentation/git-maintenance.txt
> @@ -0,0 +1,57 @@
[...]
> +SUBCOMMANDS
> +-----------
> +
> +run::
> +	Run one or more maintenance tasks.

[jrnieder] How do I supply the tasks on the command line?  Are they
parameters to this subcommand?  If so, it could make sense for this to
say something like

	run <task>...::

What is the exit code convention for "git maintenance run"?  (Many Git
commands don't document this, but since we're making a new command
it seems worth building the habit.)

[...]
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
[...]
> +static struct maintenance_opts {
> +	int auto_flag;
> +} opts;

Packing this in a struct feels a bit unusual.  Is the struct going to
be passed somewhere, or could these be individual locals in
cmd_maintenance?

[...]
> +
> +static int maintenance_task_gc(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "gc", NULL);
> +
> +	if (opts.auto_flag)
> +		argv_array_pushl(&cmd, "--auto", NULL);

These are both pushing single strings, so they can use argv_array_push.

[...]
> --- /dev/null
> +++ b/t/t7900-maintenance.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +
> +test_description='git maintenance builtin'
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0

Why does this disable commit graph and multipack index?  Is that setting
up for something that comes later?

[...]
> +test_expect_success 'gc [--auto]' '
> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
> +	grep ",\"gc\"]" run-no-auto.txt  &&
> +	grep ",\"gc\",\"--auto\"]" run-auto.txt

This feels a little odd in two ways:

- the use of "git gc" isn't a user-facing detail, so this is testing
  implementation instead of the user-facing behavior.  That's okay,
  though --- it can be useful to test internals sometimes.

- the way that this tests for "git gc" feels brittle: if the trace
  emitter changes some day to include a space after the comma, for
  example, then the test would start failing.  I thought that in the
  spirit of fakes, it could make sense to write a custom "git gc"
  command using test_write_script, but that isn't likely to work
  either since gc is a builtin.

Perhaps this is suggesting that we need some central test helper for
parsing traces so we can do this reliably in one place.  What do you
think?

Thanks,
Jonathan

Derrick Stolee July 30, 2020, 1:12 p.m. UTC | #4

On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> Hi,
> 
> Derrick Stolee wrote:
> 
>> [Subject: maintenance: create basic maintenance runner]
> 
> Seems sensible, and a good way to set up for the later patches.  Let's
> take a look at how it does that.
> 
> [...]
>> --- /dev/null
>> +++ b/Documentation/git-maintenance.txt
>> @@ -0,0 +1,57 @@
> [...]
>> +SUBCOMMANDS
>> +-----------
>> +
>> +run::
>> +	Run one or more maintenance tasks.
> 
> [jrnieder] How do I supply the tasks on the command line?  Are they
> parameters to this subcommand?  If so, it could make sense for this to
> say something like
> 
> 	run <task>...::

Hopefully this is documented to your satisfaction when the ability
to customize the tasks is implemented.

> What is the exit code convention for "git maintenance run"?  (Many Git
> commands don't document this, but since we're making a new command
> it seems worth building the habit.)

Is this worth doing? Do we really want every command to document
that "0 means success, everything else means failure, and some of
those exit codes mean a specific kind of failure that is global to
Git"?

> [...]
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
> [...]
>> +static struct maintenance_opts {
>> +	int auto_flag;
>> +} opts;
> 
> Packing this in a struct feels a bit unusual.  Is the struct going to
> be passed somewhere, or could these be individual locals in
> cmd_maintenance?

This will grow, and I'd rather have one global struct than many
individual global items. It makes it clearer when I use
"opts.auto_flag" that this corresponds to whether "--auto" was
provided as a command-line option.

> [...]
>> +
>> +static int maintenance_task_gc(void)
>> +{
>> +	int result;
>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>> +
>> +	argv_array_pushl(&cmd, "gc", NULL);
>> +
>> +	if (opts.auto_flag)
>> +		argv_array_pushl(&cmd, "--auto", NULL);
> 
> These are both pushing single strings, so they can use argv_array_push.

Thanks. I noticed a few of these myself. Luckily, I'll be going
through all of these invocations and replacing them with new
methods soon ;)

[1] https://lore.kernel.org/git/30933a71-3130-5478-cbfd-0ca5bb308cf2@gmail.com/

> [...]
>> --- /dev/null
>> +++ b/t/t7900-maintenance.sh
>> @@ -0,0 +1,22 @@
>> +#!/bin/sh
>> +
>> +test_description='git maintenance builtin'
>> +
>> +GIT_TEST_COMMIT_GRAPH=0
>> +GIT_TEST_MULTI_PACK_INDEX=0
> 
> Why does this disable commit graph and multipack index?  Is that setting
> up for something that comes later?

Yes, these don't need to be here yet.

> [...]
>> +test_expect_success 'gc [--auto]' '
>> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
>> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
>> +	grep ",\"gc\"]" run-no-auto.txt  &&
>> +	grep ",\"gc\",\"--auto\"]" run-auto.txt
> 
> This feels a little odd in two ways:
> 
> - the use of "git gc" isn't a user-facing detail, so this is testing
>   implementation instead of the user-facing behavior.  That's okay,
>   though --- it can be useful to test internals sometimes.

Consider this a "unit test" of the builtin. I'm testing whether the
command-line arguments had an effect on the child process.

> - the way that this tests for "git gc" feels brittle: if the trace
>   emitter changes some day to include a space after the comma, for
>   example, then the test would start failing.  I thought that in the
>   spirit of fakes, it could make sense to write a custom "git gc"
>   command using test_write_script, but that isn't likely to work
>   either since gc is a builtin.
> 
> Perhaps this is suggesting that we need some central test helper for
> parsing traces so we can do this reliably in one place.  What do you
> think?

I'm specifically using GIT_TRACE2_EVENT, which is intended for
consumption by computer, not humans. Adding whitespace or otherwise
changing the output format would be an unnecessary disruption that
is more likely to have downstream effects with external tools.

In some way, adding extra dependencies like this in our own test
suite can help ensure the output format is more stable.

If there is a better way to ask "Did my command call 'git gc' (with
no arguments|with these arguments)?" then I'm happy to consider it.

Thanks,
-Stolee

Jonathan Nieder July 31, 2020, 12:30 a.m. UTC | #5

Derrick Stolee wrote:
> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:

>> [jrnieder] How do I supply the tasks on the command line?  Are they
>> parameters to this subcommand?  If so, it could make sense for this to
>> say something like
>>
>> 	run <task>...::
>
> Hopefully this is documented to your satisfaction when the ability
> to customize the tasks is implemented.

Hm, do you mean that this change is too difficult to squash into the
first patch?

>> What is the exit code convention for "git maintenance run"?  (Many Git
>> commands don't document this, but since we're making a new command
>> it seems worth building the habit.)
>
> Is this worth doing? Do we really want every command to document
> that "0 means success, everything else means failure, and some of
> those exit codes mean a specific kind of failure that is global to
> Git"?

I mean things like "what exit code does it run if there is no
maintenance to do?"

[...]
>>> +static struct maintenance_opts {
>>> +	int auto_flag;
>>> +} opts;
>>
>> Packing this in a struct feels a bit unusual.  Is the struct going to
>> be passed somewhere, or could these be individual locals in
>> cmd_maintenance?
>
> This will grow, and I'd rather have one global struct than many
> individual global items. It makes it clearer when I use
> "opts.auto_flag" that this corresponds to whether "--auto" was
> provided as a command-line option.

That doesn't seem idiomatic within the Git codebase.

Can they be locals instead?

[...]
>> Perhaps this is suggesting that we need some central test helper for
>> parsing traces so we can do this reliably in one place.  What do you
>> think?
>
> I'm specifically using GIT_TRACE2_EVENT, which is intended for
> consumption by computer, not humans. Adding whitespace or otherwise
> changing the output format would be an unnecessary disruption that
> is more likely to have downstream effects with external tools.
>
> In some way, adding extra dependencies like this in our own test
> suite can help ensure the output format is more stable.

Hm, I thought the contract for GIT_TRACE2_EVENT is that it is JSON,
which is a well defined format.  To that end, I would sincerely hope
that external tools are not assuming anything beyond that the format
is JSON.

If that wasn't the intent, I'd rather that we not use a JSON-like
format at all.  That would make the expectations for evolving this
interface clearer.

> If there is a better way to ask "Did my command call 'git gc' (with
> no arguments|with these arguments)?" then I'm happy to consider it.

My proposal was just to factor this out into a function in
test-lib-functions.sh so it's easy to evolve over time in one place.

Thanks and hope that helps,
Jonathan

Jonathan Nieder July 31, 2020, 4:40 p.m. UTC | #6

Derrick Stolee wrote:
> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> > Derrick Stolee wrote:

>>> --- /dev/null
>>> +++ b/Documentation/git-maintenance.txt
>>> @@ -0,0 +1,57 @@
>> [...]
>>> +SUBCOMMANDS
>>> +-----------
>>> +
>>> +run::
>>> +	Run one or more maintenance tasks.
>>
>> [jrnieder] How do I supply the tasks on the command line?  Are they
>> parameters to this subcommand?  If so, it could make sense for this to
>> say something like
>>
>> 	run <task>...::
>
> Hopefully this is documented to your satisfaction when the ability
> to customize the tasks is implemented.

I think my last reply to this part missed the point a little, so let
me try again.

In multi-patch series, requests to make improvements to early parts
that would be obsoleted by later parts may seem strange --- isn't this
just wasted work?  Sure, occasionally someone will stumble on the
early part using "git bisect" or "git log", but isn't that rare enough
to not be worth worrying too much about?

And I'd agree with that reaction --- for a documentation nit like this
one, that benefit alone is probably not enough to be worth the fuss of
tweaking things.

When I focus more on early parts of a series, it is for a different
reason: from a release management point of view, it is useful to get
the early part of a series "cooked" as soon as possible so it can be
exposed to users in "next" and then become a foundation for later
patches in "master".  So we try to get as much of the early part of a
series into that state as we can, and to merge the early part even
without the later part as soon as the former is ready and valuable
enough to users on its own.

Relatedly, ensuring each commit is "complete" in this sense on its own
gives us more flexibility in case a problem is discovered later: we
can revert a patch that is causing issues without having to revert the
preceding part.

I think that might be part of the context I didn't convey well, and
why I want to get this in tip-top shape and start using it.

Thanks,
Jonathan

Derrick Stolee Aug. 3, 2020, 5:37 p.m. UTC | #7

On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
>> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> 
>>> [jrnieder] How do I supply the tasks on the command line?  Are they
>>> parameters to this subcommand?  If so, it could make sense for this to
>>> say something like
>>>
>>> 	run <task>...::
>>
>> Hopefully this is documented to your satisfaction when the ability
>> to customize the tasks is implemented.
> 
> Hm, do you mean that this change is too difficult to squash into the
> first patch?

I mean that there is only one task right now. Until the commit-graph
task is implemented, there is no need to have a --task=<task> option.

Creating a --task=<task> option in this patch would unnecessarily bloat
the patch by implementing multiple tasks while filling in the boilerplate
of creating a new builtin.

>>> What is the exit code convention for "git maintenance run"?  (Many Git
>>> commands don't document this, but since we're making a new command
>>> it seems worth building the habit.)
>>
>> Is this worth doing? Do we really want every command to document
>> that "0 means success, everything else means failure, and some of
>> those exit codes mean a specific kind of failure that is global to
>> Git"?
> 
> I mean things like "what exit code does it run if there is no
> maintenance to do?"
> 
> [...]
>>>> +static struct maintenance_opts {
>>>> +	int auto_flag;
>>>> +} opts;
>>>
>>> Packing this in a struct feels a bit unusual.  Is the struct going to
>>> be passed somewhere, or could these be individual locals in
>>> cmd_maintenance?
>>
>> This will grow, and I'd rather have one global struct than many
>> individual global items. It makes it clearer when I use
>> "opts.auto_flag" that this corresponds to whether "--auto" was
>> provided as a command-line option.
> 
> That doesn't seem idiomatic within the Git codebase.
>
> Can they be locals instead?

Which part do you want to be idiomatic? The fact that the parse-options
library typically refers to static global values or the fact that the
data is not organized in a struct?

Turning these into locals is _even worse_ than the current model of
static globals because that then requires updating the prototypes of
all the (static) local methods that operate on the arguments. Every
time a new option is added, that would cascade in method prototypes
all over again.

The natural thing to do to "fix" that situation is to create a struct
that holds the information from the parsed command-line arguments. But
then why is it a local parameter that is passed through all of the
local methods instead of a global (as presented here in this patch)?

> [...]
>>> Perhaps this is suggesting that we need some central test helper for
>>> parsing traces so we can do this reliably in one place.  What do you
>>> think?
>>
>> I'm specifically using GIT_TRACE2_EVENT, which is intended for
>> consumption by computer, not humans. Adding whitespace or otherwise
>> changing the output format would be an unnecessary disruption that
>> is more likely to have downstream effects with external tools.
>>
>> In some way, adding extra dependencies like this in our own test
>> suite can help ensure the output format is more stable.
> 
> Hm, I thought the contract for GIT_TRACE2_EVENT is that it is JSON,
> which is a well defined format.  To that end, I would sincerely hope
> that external tools are not assuming anything beyond that the format
> is JSON.
> 
> If that wasn't the intent, I'd rather that we not use a JSON-like
> format at all.  That would make the expectations for evolving this
> interface clearer.
> 
>> If there is a better way to ask "Did my command call 'git gc' (with
>> no arguments|with these arguments)?" then I'm happy to consider it.
> 
> My proposal was just to factor this out into a function in
> test-lib-functions.sh so it's easy to evolve over time in one place.

This is a valuable suggestion, but this series is already too large
to make such a change in addition to the patches already here. I'd
consider this more if there were more tests that checked "did we
run this subcommand with these arguments?" Better to extract a helper
when this is actually used in multiple places.

Thanks,
-Stolee

Jonathan Nieder Aug. 3, 2020, 5:46 p.m. UTC | #8

Derrick Stolee wrote:
> On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
>> Derrick Stolee wrote:
>>> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:

>>>> [jrnieder] How do I supply the tasks on the command line?  Are they
>>>> parameters to this subcommand?  If so, it could make sense for this to
>>>> say something like
>>>>
>>>> 	run <task>...::
>>>
>>> Hopefully this is documented to your satisfaction when the ability
>>> to customize the tasks is implemented.
[...]
> I mean that there is only one task right now. Until the commit-graph
> task is implemented, there is no need to have a --task=<task> option.

Ah, that wasn't clear to me from the docs.

You're saying that "git maintenance run" runs the "gc" task?  Can the
documentation say so?

[...]
>>>>> +static struct maintenance_opts {
>>>>> +	int auto_flag;
>>>>> +} opts;
>>>>
>>>> Packing this in a struct feels a bit unusual.  Is the struct going to
>>>> be passed somewhere, or could these be individual locals in
>>>> cmd_maintenance?
>>>
>>> This will grow, and I'd rather have one global struct than many
>>> individual global items. It makes it clearer when I use
>>> "opts.auto_flag" that this corresponds to whether "--auto" was
>>> provided as a command-line option.
>>
>> That doesn't seem idiomatic within the Git codebase.
>>
>> Can they be locals instead?
>
> Which part do you want to be idiomatic? The fact that the parse-options
> library typically refers to static global values or the fact that the
> data is not organized in a struct?

parse-options has no requirement about the values being global.  Some
older code does that, but newer code tends to use locals when
appropriate.

Putting it in a struct is fine as long as that struct is being passed
around.  What isn't idiomatic in Git is using a global struct for
namespacing.

[...]
> The natural thing to do to "fix" that situation is to create a struct
> that holds the information from the parsed command-line arguments. But
> then why is it a local parameter that is passed through all of the
> local methods instead of a global (as presented here in this patch)?

I'm having trouble parsing that last sentence.  You're saying a global
is preferable over passing around a pointer to a local "opts" struct?

[...]
>>> If there is a better way to ask "Did my command call 'git gc' (with
>>> no arguments|with these arguments)?" then I'm happy to consider it.
>>
>> My proposal was just to factor this out into a function in
>> test-lib-functions.sh so it's easy to evolve over time in one place.
>
> This is a valuable suggestion, but this series is already too large
> to make such a change in addition to the patches already here.

Hm, it's not clear to me that this would make the series significantly
larger.

And on the contrary, it would make the code less fragile.  I think this
is important.

https://chromium.googlesource.com/chromium/src/+/master/docs/cl_respect.md#remember-communication-can-be-hard:
if you'd like to meet out of band, let me know, and I'd be happy to go
into this further.

Thanks,
Jonathan

Taylor Blau Aug. 3, 2020, 10:46 p.m. UTC | #9

On Mon, Aug 03, 2020 at 10:46:54AM -0700, Jonathan Nieder wrote:
> Derrick Stolee wrote:
> > On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
> >> Derrick Stolee wrote:
> >>> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
>
> >>>> [jrnieder] How do I supply the tasks on the command line?  Are they
> >>>> parameters to this subcommand?  If so, it could make sense for this to
> >>>> say something like
> >>>>
> >>>> 	run <task>...::
> >>>
> >>> Hopefully this is documented to your satisfaction when the ability
> >>> to customize the tasks is implemented.
> [...]
> > I mean that there is only one task right now. Until the commit-graph
> > task is implemented, there is no need to have a --task=<task> option.
>
> Ah, that wasn't clear to me from the docs.
>
> You're saying that "git maintenance run" runs the "gc" task?  Can the
> documentation say so?
>
> [...]
> >>>>> +static struct maintenance_opts {
> >>>>> +	int auto_flag;
> >>>>> +} opts;
> >>>>
> >>>> Packing this in a struct feels a bit unusual.  Is the struct going to
> >>>> be passed somewhere, or could these be individual locals in
> >>>> cmd_maintenance?
> >>>
> >>> This will grow, and I'd rather have one global struct than many
> >>> individual global items. It makes it clearer when I use
> >>> "opts.auto_flag" that this corresponds to whether "--auto" was
> >>> provided as a command-line option.
> >>
> >> That doesn't seem idiomatic within the Git codebase.
> >>
> >> Can they be locals instead?
> >
> > Which part do you want to be idiomatic? The fact that the parse-options
> > library typically refers to static global values or the fact that the
> > data is not organized in a struct?
>
> parse-options has no requirement about the values being global.  Some
> older code does that, but newer code tends to use locals when
> appropriate.
>
> Putting it in a struct is fine as long as that struct is being passed
> around.  What isn't idiomatic in Git is using a global struct for
> namespacing.
>
> [...]
> > The natural thing to do to "fix" that situation is to create a struct
> > that holds the information from the parsed command-line arguments. But
> > then why is it a local parameter that is passed through all of the
> > local methods instead of a global (as presented here in this patch)?
>
> I'm having trouble parsing that last sentence.  You're saying a global
> is preferable over passing around a pointer to a local "opts" struct?
>
> [...]
> >>> If there is a better way to ask "Did my command call 'git gc' (with
> >>> no arguments|with these arguments)?" then I'm happy to consider it.
> >>
> >> My proposal was just to factor this out into a function in
> >> test-lib-functions.sh so it's easy to evolve over time in one place.
> >
> > This is a valuable suggestion, but this series is already too large
> > to make such a change in addition to the patches already here.
>
> Hm, it's not clear to me that this would make the series significantly
> larger.

I think what Stolee is trying to say is less about a change that would
make the series larger, it's about changing an already-large series.

> And on the contrary, it would make the code less fragile.  I think this
> is important.

I'm not sure that I see your argument. What we are really discussing is
whether or not we should have a static struct outside of 'cmd_gc()', or
a zero-initialized frame local struct within 'cmd_gc()'. I fully
understand the arguments in favor of one or the other, but I struggle to
grasp that this is worth our time to debate in this much detail.

> https://chromium.googlesource.com/chromium/src/+/master/docs/cl_respect.md#remember-communication-can-be-hard:
> if you'd like to meet out of band, let me know, and I'd be happy to go
> into this further.

I assume that the part of this document you were linking to was about
meeting out-of-band to discuss an issue instead of emailing back and
forth.

I appreciate your willingness to resolve this expediently, but I would
much prefer if these discussions (that is, the ones about the technical
content of this series) happened on-list for the benefit of other
reviewers.

> Thanks,
> Jonathan

Thanks,
Taylor

Jonathan Nieder Aug. 3, 2020, 11:01 p.m. UTC | #10

Taylor Blau wrote:
> On Mon, Aug 03, 2020 at 10:46:54AM -0700, Jonathan Nieder wrote:
>> Derrick Stolee wrote:
>>> On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
>>>> Derrick Stolee wrote:

>>>>> If there is a better way to ask "Did my command call 'git gc' (with
>>>>> no arguments|with these arguments)?" then I'm happy to consider it.
>>>>
>>>> My proposal was just to factor this out into a function in
>>>> test-lib-functions.sh so it's easy to evolve over time in one place.
>>>
>>> This is a valuable suggestion, but this series is already too large
>>> to make such a change in addition to the patches already here.
>>
>> Hm, it's not clear to me that this would make the series significantly
>> larger.
>
> I think what Stolee is trying to say is less about a change that would
> make the series larger, it's about changing an already-large series.
>
>> And on the contrary, it would make the code less fragile.  I think this
>> is important.
>
> I'm not sure that I see your argument. What we are really discussing is
> whether or not we should have a static struct outside of 'cmd_gc()', or
> a zero-initialized frame local struct within 'cmd_gc()'. I fully
> understand the arguments in favor of one or the other, but I struggle to
> grasp that this is worth our time to debate in this much detail.

Sorry for the lack of clarity.  The proposal Stolee is pushing back on
above is to have a helper in test-lib-functions.sh that tells a test
whether an event it cares about happened in traces (in this example,
invocation of "git gc").

I consider it important because if we are getting into the habit of
having our tests assume the current exact byte-for-byte trace JSON
output, then that is an assumption that is going to be painful to
unravel.  Factoring out a helper would also make the test easier to
read, but that is less important than "an ounce of prevention is worth
a pound of cure" to me in this example.

Jonathan

Taylor Blau Aug. 3, 2020, 11:08 p.m. UTC | #11

On Mon, Aug 03, 2020 at 04:01:34PM -0700, Jonathan Nieder wrote:
> Taylor Blau wrote:
> > On Mon, Aug 03, 2020 at 10:46:54AM -0700, Jonathan Nieder wrote:
> >> Derrick Stolee wrote:
> >>> On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
> >>>> Derrick Stolee wrote:
>
> >>>>> If there is a better way to ask "Did my command call 'git gc' (with
> >>>>> no arguments|with these arguments)?" then I'm happy to consider it.
> >>>>
> >>>> My proposal was just to factor this out into a function in
> >>>> test-lib-functions.sh so it's easy to evolve over time in one place.
> >>>
> >>> This is a valuable suggestion, but this series is already too large
> >>> to make such a change in addition to the patches already here.
> >>
> >> Hm, it's not clear to me that this would make the series significantly
> >> larger.
> >
> > I think what Stolee is trying to say is less about a change that would
> > make the series larger, it's about changing an already-large series.
> >
> >> And on the contrary, it would make the code less fragile.  I think this
> >> is important.
> >
> > I'm not sure that I see your argument. What we are really discussing is
> > whether or not we should have a static struct outside of 'cmd_gc()', or
> > a zero-initialized frame local struct within 'cmd_gc()'. I fully
> > understand the arguments in favor of one or the other, but I struggle to
> > grasp that this is worth our time to debate in this much detail.
>
> Sorry for the lack of clarity.  The proposal Stolee is pushing back on
> above is to have a helper in test-lib-functions.sh that tells a test
> whether an event it cares about happened in traces (in this example,
> invocation of "git gc").

Serves me right for reading too quickly ;).

> I consider it important because if we are getting into the habit of
> having our tests assume the current exact byte-for-byte trace JSON
> output, then that is an assumption that is going to be painful to
> unravel.  Factoring out a helper would also make the test easier to
> read, but that is less important than "an ounce of prevention is worth
> a pound of cure" to me in this example.

I still only partially buy into this, though. I had to deal with some
rather fragile grepping through trace2 JSON in t4216 (? or something,
the log-bloom tests) recently, and found it a little fragile, but not
overly so.

I'd rather just move forward, but I do feel strongly that we discuss
these matters on the list.

> Jonathan

Thanks,
Taylor

Jonathan Nieder Aug. 3, 2020, 11:17 p.m. UTC | #12

Taylor Blau wrote:

> I still only partially buy into this, though. I had to deal with some
> rather fragile grepping through trace2 JSON in t4216 (? or something,
> the log-bloom tests) recently, and found it a little fragile, but not
> overly so.
>
> I'd rather just move forward,

This means we're promising to never change the JSON serialization, or
that we're willing to pay with time in the future to update all tests
that assume the current JSON serialization.

I do not want to make that promise.  I've lived through the same work
of updating tests that assumed particular sha1s and having to change
them to be more generic and do not want to have to do that again.

If you're saying that you are willing to do that work when the time
comes, then that counts for a lot.  If you're saying that we should
not change the JSON serialization, then it makes me worry that we made
a mistake in choosing JSON as a format, since using a well defined
format with an ecosystem of serialization + deserialization libraries
was the whole point.

Jonathan

Jonathan Nieder Aug. 3, 2020, 11:52 p.m. UTC | #13

Derrick Stolee wrote:
> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:

>> Perhaps this is suggesting that we need some central test helper for
>> parsing traces so we can do this reliably in one place.  What do you
>> think?
>
> I'm specifically using GIT_TRACE2_EVENT, which is intended for
> consumption by computer, not humans. Adding whitespace or otherwise
> changing the output format would be an unnecessary disruption that
> is more likely to have downstream effects with external tools.
>
> In some way, adding extra dependencies like this in our own test
> suite can help ensure the output format is more stable.

I've discussed this a bit more at [1]: I do think this is output that
we cannot promise to keep stable.  Can you say a little more about why
extracting a helper would be problematic?  For example, would doing
that in a separate patch at the end of the series be workable for you?

More generally, one theme I'm sensing in the responses to the code
review comments I sent is that the series may have gotten too long to
be able to easily amend the beginning of it.  To that end, I'm happy
to make a reroll or make this easier in any other way I can.  This
also suggests to me that it may make sense to split off an early part
of the series and get that into a shape you're comfortable with before
rebasing the rest of the series on top as a separate topic (a tried
and true method, used recently for the "git bugreport" command, for
example).

Thoughts?

Thanks,
Jonathan

[1] https://lore.kernel.org/git/20200803231745.GB58587@google.com/

Junio C Hamano Aug. 4, 2020, 12:07 a.m. UTC | #14

Jonathan Nieder <jrnieder@gmail.com> writes:

> ...  If you're saying that we should
> not change the JSON serialization, then it makes me worry that we made
> a mistake in choosing JSON as a format, since using a well defined
> format with an ecosystem of serialization + deserialization libraries
> was the whole point.

Well said.

Derrick Stolee Aug. 4, 2020, 1:32 p.m. UTC | #15

On 8/3/2020 7:17 PM, Jonathan Nieder wrote:
> Taylor Blau wrote:
> 
>> I still only partially buy into this, though. I had to deal with some
>> rather fragile grepping through trace2 JSON in t4216 (? or something,
>> the log-bloom tests) recently, and found it a little fragile, but not
>> overly so.
>>
>> I'd rather just move forward,
> 
> This means we're promising to never change the JSON serialization, or
> that we're willing to pay with time in the future to update all tests
> that assume the current JSON serialization.
> 
> I do not want to make that promise.  I've lived through the same work
> of updating tests that assumed particular sha1s and having to change
> them to be more generic and do not want to have to do that again.
> 
> If you're saying that you are willing to do that work when the time
> comes, then that counts for a lot.  If you're saying that we should
> not change the JSON serialization, then it makes me worry that we made
> a mistake in choosing JSON as a format, since using a well defined
> format with an ecosystem of serialization + deserialization libraries
> was the whole point.

The benefit of JSON is that it is simple to be _additive_. We can
always jam more data into the format.

What I'm doing is making the following assumptions:

1. There will always be an event line corresponding to launching a
   subcommand.

2. This line will contain the list of arguments used for that
   subcommand in a JSON array.

3. We don't add extra whitespace in that JSON array.

The first two items are important promises made by the trace2 event
output. By depending on them, I'm only adding more places where we
expect the JSON *schema* to remain consistent. Rather, I only require
that the JSON schema does not _remove_ that data which is expected
by tools that are reading and reporting that data.

Only the third item is something that adds an additional restriction
on the JSON format. But also: why would we ever add extraneous
whitespace characters to the output intended for computer parsing?
This concern raises a giant "YAGNI" warning in my head.

What is seems like you are asking instead is for me to create a tool
in the test suite that parses each JSON line, extracts a specific
member from that JSON object, reconstructs a command-line invocation
from the JSON array, and reports whether that process worked for any
line in the event output.

That seems overly complicated and puts even more requirements on the
JSON schema than I have placed in my current implementation.

If this is to truly be a hard requirement for these tests to move
forward, then I would rather do an alternate approach that would be
a better test mechanism than scraping trace data: mock the subcommands.

I would create a new directory, "t/mock", and create a new 'git'
executable there. By setting --exec-path in our test commands, we
would launch this mocked version of Git instead of our built version.
The mock could report each set of command-line arguments to a known
file. With some care, we could even provide an instructions file to
tell it which error code to return or whether it should write data
to stdout. This would allow testing things like the "rewrite after
verify fails" behavior already in the maintenance builtin.

If I'm to spend time engineering something more complicated just to
check "did this subcommand run with these arguments?" then I'd
rather do it this way.

Thanks,
-Stolee

Jonathan Nieder Aug. 4, 2020, 2:42 p.m. UTC | #16

Derrick Stolee wrote:

> What is seems like you are asking instead is for me to create a tool
> in the test suite that parses each JSON line, extracts a specific
> member from that JSON object, reconstructs a command-line invocation
> from the JSON array, and reports whether that process worked for any
> line in the event output.

No, that isn't what I'm asking.

I'm asking for this patch series to take the existing "grep" lines
and put them in a function in test-lib-functions.sh, so that we can
change them in one place when the trace emitter changes.

[...]
> If this is to truly be a hard requirement for these tests to move
> forward,

Yes, from my point of view it really is.

But that "is this truly a hard requirement?" comes up tells me I have
not done a good job of communicating in this review.  A review is
about participants in the project working together to improve a patch,
not people making demands at each other.

[...]
> If I'm to spend time engineering something more complicated just to
> check "did this subcommand run with these arguments?" then

I don't see why this is more complicated than what is in patch 1.  In
fact, I think it would be a little more simple.

Jonathan

Derrick Stolee Aug. 4, 2020, 4:32 p.m. UTC | #17

On 8/4/2020 10:42 AM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
> 
>> What is seems like you are asking instead is for me to create a tool
>> in the test suite that parses each JSON line, extracts a specific
>> member from that JSON object, reconstructs a command-line invocation
>> from the JSON array, and reports whether that process worked for any
>> line in the event output.
> 
> No, that isn't what I'm asking.
> 
> I'm asking for this patch series to take the existing "grep" lines
> and put them in a function in test-lib-functions.sh, so that we can
> change them in one place when the trace emitter changes.

Thanks for clarifying, clearing up my misinterpretation of your
request. I'm coming around to this idea.

> [...]
>> If this is to truly be a hard requirement for these tests to move
>> forward,
> 
> Yes, from my point of view it really is.
> 
> But that "is this truly a hard requirement?" comes up tells me I have
> not done a good job of communicating in this review.  A review is
> about participants in the project working together to improve a patch,
> not people making demands at each other.

It's also about a balance. A patch author and a reviewer can have
disagreements, and it is important to discover exactly how hard each
is holding to their opinion.

> [...]
>> If I'm to spend time engineering something more complicated just to
>> check "did this subcommand run with these arguments?" then
> 
> I don't see why this is more complicated than what is in patch 1.  In
> fact, I think it would be a little more simple.

My interpretation (parsing JSON) _was_ more complicated, hence my
reservations. The helper still requires non-trivial script-fu (for
someone like me who is bad at scripting languages) but centralizing
the grep has value.

Here is my attempt so far:

trace2_subcommand_data () {
	command="$1" &&
	commas=$(echo $command | sed s/\ /\",\"/g) &&
	printf "[\"$commas\"]"
}

test_subcommand_called () {
	data=$(trace2_subcommand_data $1) &&
	grep $data $2
}

test_subcommand_not_called () {
	! test_subcommand_called $1 $2
}

with callers looking like

	test_subcommand_called "gc" run-no-auto.txt &&
	test_subcommand_not_called "gc --auto" run-auto.txt &&
	test_subcommand_called "gc --quiet" run-quiet.txt

Thanks,
-Stolee

Jonathan Nieder Aug. 4, 2020, 5:02 p.m. UTC | #18

Derrick Stolee wrote:

> Here is my attempt so far:
[...]
> 	test_subcommand_called "gc" run-no-auto.txt &&
> 	test_subcommand_not_called "gc --auto" run-auto.txt &&
> 	test_subcommand_called "gc --quiet" run-quiet.txt

I like it.  Some tweaks:
- can simplify by using 'printf' instead of sed
- passing the trace in stdin
- following the convention that helpers like test_i18ngrep use of
  accepting ! as a parameter to negate the exit code

That would make (untested)

# Check that the given command was invoked as part of the
# trace2-format trace on stdin.
#
#	test_subcommand_called [!] <command> <args>... < <trace>
#
# For example, to look for an invocation of "git upload-pack
# /path/to/repo"
#
#	GIT_TRACE2_EVENT=event.log git fetch ... &&
#	test_subcommand_called git upload-pack "$PATH" <event.log
#
# If the first parameter passed is !, this instead checks that
# the given command was not called.
#
test_subcommand_called () {
	local negate=
	if test "$1" = "!"
	then
		negate=t
		shift
	fi

	local expr=$(printf '"%s",' "$@")
	expr="${expr%,}"

	if test -n "$negate"
	then
		! grep "\[$expr\]"
	else
		grep "\[$expr\]"
	fi
}

Derrick Stolee Aug. 4, 2020, 5:51 p.m. UTC | #19

On 8/4/2020 1:02 PM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
> 
>> Here is my attempt so far:
> [...]
>> 	test_subcommand_called "gc" run-no-auto.txt &&
>> 	test_subcommand_not_called "gc --auto" run-auto.txt &&
>> 	test_subcommand_called "gc --quiet" run-quiet.txt
> 
> I like it.  Some tweaks:
> - can simplify by using 'printf' instead of sed
> - passing the trace in stdin
> - following the convention that helpers like test_i18ngrep use of
>   accepting ! as a parameter to negate the exit code
> 
> That would make (untested)
> 
> # Check that the given command was invoked as part of the
> # trace2-format trace on stdin.
> #
> #	test_subcommand_called [!] <command> <args>... < <trace>
> #
> # For example, to look for an invocation of "git upload-pack
> # /path/to/repo"
> #
> #	GIT_TRACE2_EVENT=event.log git fetch ... &&
> #	test_subcommand_called git upload-pack "$PATH" <event.log
> #
> # If the first parameter passed is !, this instead checks that
> # the given command was not called.
> #
> test_subcommand_called () {
> 	local negate=
> 	if test "$1" = "!"
> 	then
> 		negate=t
> 		shift
> 	fi
> 
> 	local expr=$(printf '"%s",' "$@")
> 	expr="${expr%,}"
> 
> 	if test -n "$negate"
> 	then
> 		! grep "\[$expr\]"
> 	else
> 		grep "\[$expr\]"
> 	fi
> }

This works! Thanks. With the ! usage, I think a rename to
'test_subcommand' makes sense. Further, I was somehow wrong
about not including "git" in the beginning, so here is my
example use of this method;

	test_subcommand git gc <run-no-auto.txt &&
	test_subcommand ! git gc --auto <run-auto.txt &&
	test_subcommand git gc --quiet <run-quiet.txt

Thanks,
-Stolee

Derrick Stolee Aug. 5, 2020, 3:02 p.m. UTC | #20

On 8/3/2020 1:46 PM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
>> On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
>>> Derrick Stolee wrote:
>>>> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:

Revisiting the first part of this thread, as it got lost in the discussion
about trace2 and JSON parsing...

>>>>> [jrnieder] How do I supply the tasks on the command line?  Are they
>>>>> parameters to this subcommand?  If so, it could make sense for this to
>>>>> say something like
>>>>>
>>>>> 	run <task>...::
>>>>
>>>> Hopefully this is documented to your satisfaction when the ability
>>>> to customize the tasks is implemented.
> [...]
>> I mean that there is only one task right now. Until the commit-graph
>> task is implemented, there is no need to have a --task=<task> option.
> 
> Ah, that wasn't clear to me from the docs.
> 
> You're saying that "git maintenance run" runs the "gc" task?  Can the
> documentation say so?

In this patch, the documentation says

	Run tasks to optimize Git repository data

the plural is a bit incorrect at this point, but not a huge deal.

The run subcommand says "Run one or more maintenance tasks" which again
is a bit misleading since "or more" is not technically possible. This
becomes a bit obvious when we see that the task list only contains a 'gc'
task.

So no, the documentation in this patch doesn't say "Run the 'gc' task"
because that could be gathered from context _and_ it will be incorrect
by the time this patch is integrated as part of the series.

So I struggle with finding a way to improve the documentation in this
patch to satisfy your concerns. Any change made here will just be un-done
by a later patch in the same series.

I'd rather have something vague now and improve the documentation
incrementally along with the implementation.

Another point you made earlier, but seems to be dropped in this email:
you wanted the return code documented. This also seems like wasteful
text, because the best I can think of is

  This command will return 0 unless a maintenance task failed in an
  unrecoverable way, or other typical failure cases such as incorrect
  usage or the command being run outside of a Git repository.

Please recommend something better, because that description suffers from
being too generic/obvious when talking about the maintenance task and
being an obvious repetition of Git's existing error code behavior.

Perhaps this would be easier if there was an established pattern of
documenting the return code. So, I went digging.

The examples I could find are:

 1. 'git cat-file -e'
 2. 'git check-ref-format'
 3. 'git commit-graph --object-dir=<dir>'
 4. 'git cvsimport'
 5. 'git fast-import'
 6. 'git filter-branch'
 7. 'git grep --quiet'
 8. 'git gui citool'
 9. 'git ls-remote --exit-code'
10. 'git merge-base --is-ancestor'
11. 'git push --recurse-submodules'
12. 'git reflog exists'
13. 'git rev-parse --parseopt'
14. 'git update-index --test-untracked-cache' 

Most of these are for specific options, and the exit code is documented
with relation to how that specific option may lead to a non-zero exit
code (1, 3, 7, 8, 9, 10, 11, 12, 13, 14).

The one that is most clearly part of the main description is for
the plumbing command 'git check-ref-format':

	Checks if a given 'refname' is acceptable, and exits with
	a non-zero status if it is not.

This seems appropriate for something intended for use as a helper
in script, with a very simple behavior. The exit code is the only
way it provides any feedback to the caller! This use of the exit
code as the only output is used by some other cases (1, 7, 9, 10,
11, 12, 13, 14).

Of course, we also have examples like this in git-cvsimport.txt:

	Otherwise, success is indicated the Unix way, i.e. by simply
	exiting with a zero exit status.

This phrase should be _assumed_ as part of Git itself, instead of
written in the documentation for specific commands.

The documentation for 'git fast-import' is more useful here:

	If fast-import is supplied invalid input it will terminate
	with a non-zero exit status and create a crash report...

But, you can see that the exit code is not the focus, but the
crash report _is_ important to document.

'git filter-branch' has a special exit code for the case that no
commits were found to update.

'git gui citool' returns an error code if the window is closed
before the command can complete. This is also not a core feature,
so is not a good example.

My only conclusion from this investigation is that the exit code
is documented when it is useful, but otherwise the standard Unix
pattern is assumed and not worth documenting.

If this is something we want to change, then I don't think it
valuable to have that discussion here. Instead, let's start that
discussion elsewhere with a series that updates existing docs to
catch up to a new standard of documenting the exit code.

[1] https://git-scm.com/docs/git-cat-file#Documentation/git-cat-file.txt--e
[2] https://git-scm.com/docs/git-check-ref-format#_description
[3] https://git-scm.com/docs/git-commit-graph#Documentation/git-commit-graph.txt---object-dir
[4] https://git-scm.com/docs/git-cvsimport#_output
[5] https://git-scm.com/docs/git-fast-import#_crash_reports
[6] https://git-scm.com/docs/git-filter-branch#_exit_status
[7] https://git-scm.com/docs/git-grep#Documentation/git-grep.txt---quiet
[8] https://git-scm.com/docs/git-gui#Documentation/git-gui.txt-codegitguicitoolcode
[9] https://git-scm.com/docs/git-ls-remote#Documentation/git-ls-remote.txt---exit-code
[10] https://git-scm.com/docs/git-merge-base#Documentation/git-merge-base.txt---is-ancestor
[11] https://git-scm.com/docs/git-push#Documentation/git-push.txt---recurse-submodulescheckon-demandonlyno
[12] https://git-scm.com/docs/git-reflog
[13] https://git-scm.com/docs/git-rev-parse#_parseopt
[14] https://git-scm.com/docs/git-update-index#Documentation/git-update-index.txt---test-untracked-cache

> [...]
>>>>>> +static struct maintenance_opts {
>>>>>> +	int auto_flag;
>>>>>> +} opts;
>>>>>
>>>>> Packing this in a struct feels a bit unusual.  Is the struct going to
>>>>> be passed somewhere, or could these be individual locals in
>>>>> cmd_maintenance?
>>>>
>>>> This will grow, and I'd rather have one global struct than many
>>>> individual global items. It makes it clearer when I use
>>>> "opts.auto_flag" that this corresponds to whether "--auto" was
>>>> provided as a command-line option.
>>>
>>> That doesn't seem idiomatic within the Git codebase.
>>>
>>> Can they be locals instead?
>>
>> Which part do you want to be idiomatic? The fact that the parse-options
>> library typically refers to static global values or the fact that the
>> data is not organized in a struct?
> 
> parse-options has no requirement about the values being global.  Some
> older code does that, but newer code tends to use locals when
> appropriate.

You're right. Global is not required. The struct needs to be 'static'
or initialized for compilation.

While trying to get rid of globals based on your recommendation, I
see that the callback functions in the option-parsing _do_ require
global state. This is used in "maintenance: add --task option" with
the global tasks array, but it also uses a global count of "how many
tasks have been selected" currently in the struct. Two ways to get
around that issue include (1) making a global int, or (2) looping
through the task list to count how many have been selected so far.

Neither of these solutions is clean. (In the diff below, I chose
option 2.)

> Putting it in a struct is fine as long as that struct is being passed
> around.  What isn't idiomatic in Git is using a global struct for
> namespacing.

In my opinion, using a global struct would be better than the existing
pattern of a bunch of global variables, but I get your point about
precedence. More on this below.

> [...]
>> The natural thing to do to "fix" that situation is to create a struct
>> that holds the information from the parsed command-line arguments. But
>> then why is it a local parameter that is passed through all of the
>> local methods instead of a global (as presented here in this patch)?
> 
> I'm having trouble parsing that last sentence.  You're saying a global
> is preferable over passing around a pointer to a local "opts" struct?

Using a "singleton" like the opts struct here, the code is definitely
much simpler to read. There is only ever one instance, because the struct
definition comes along with the instance declaration.

By making it a local instance, we then need to pass around the struct as
a parameter to all of the static methods in the builtin, even though there
is only ever one instance of this struct per process.

The two points above are the ones that led me to use the global instance
in the first place.

To get back to your point about being idiomatic, these builtins use a
struct to group options, then pass it around as a parameter:

* cat-file (struct batch_options)
* checkout (struct checkout_opts)
* rebase (struct rebase_options)
* tag (struct create_tag_options)
* worktree (struct add_opts)

Unfortunately, there _is_ an example of the behavior I was following in
this series: my own example in the sparse-checkout builtin. So, I'm just
repeating my own mistakes that were not challenged before.

I updated my v4-in-progress branch with that change, and the tip-to-tip
diff is below. It required some non-trivial care around the --task=<task>
callback (which can't see the local options struct) and other cases.

Thanks,
-Stolee

--- 

diff --git a/builtin/gc.c b/builtin/gc.c
index 1f0c519d14..3e473a1293 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -709,11 +709,10 @@ static const char * const builtin_maintenance_usage[] = {
 	NULL
 };
 
-static struct maintenance_opts {
+struct maintenance_opts {
 	int auto_flag;
 	int quiet;
-	int tasks_selected;
-} opts;
+};
 
 /* Remember to update object flag allocation in object.h */
 #define PARENT1		(1u<<16)
@@ -789,7 +788,7 @@ static int should_write_commit_graph(void)
 	return result;
 }
 
-static int run_write_commit_graph(void)
+static int run_write_commit_graph(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
@@ -797,13 +796,13 @@ static int run_write_commit_graph(void)
 	strvec_pushl(&child.args, "commit-graph", "write",
 		     "--split", "--reachable", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	return !!run_command(&child);
 }
 
-static int run_verify_commit_graph(void)
+static int run_verify_commit_graph(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
@@ -811,24 +810,24 @@ static int run_verify_commit_graph(void)
 	strvec_pushl(&child.args, "commit-graph", "verify",
 		     "--shallow", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	return !!run_command(&child);
 }
 
-static int maintenance_task_commit_graph(void)
+static int maintenance_task_commit_graph(struct maintenance_opts *opts)
 {
 	struct repository *r = the_repository;
 	char *chain_path;
 
 	close_object_store(r->objects);
-	if (run_write_commit_graph()) {
+	if (run_write_commit_graph(opts)) {
 		error(_("failed to write commit-graph"));
 		return 1;
 	}
 
-	if (!run_verify_commit_graph())
+	if (!run_verify_commit_graph(opts))
 		return 0;
 
 	warning(_("commit-graph verify caught error, rewriting"));
@@ -840,14 +839,14 @@ static int maintenance_task_commit_graph(void)
 	}
 	free(chain_path);
 
-	if (!run_write_commit_graph())
+	if (!run_write_commit_graph(opts))
 		return 0;
 
 	error(_("failed to rewrite commit-graph"));
 	return 1;
 }
 
-static int fetch_remote(const char *remote)
+static int fetch_remote(const char *remote, struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
@@ -855,7 +854,7 @@ static int fetch_remote(const char *remote)
 	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
 		     "--no-write-fetch-head", "--refmap=", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--quiet");
 
 	strvec_pushf(&child.args, "+refs/heads/*:refs/prefetch/%s/*", remote);
@@ -871,7 +870,7 @@ static int fill_each_remote(struct remote *remote, void *cbdata)
 	return 0;
 }
 
-static int maintenance_task_prefetch(void)
+static int maintenance_task_prefetch(struct maintenance_opts *opts)
 {
 	int result = 0;
 	struct string_list_item *item;
@@ -886,23 +885,23 @@ static int maintenance_task_prefetch(void)
 	for (item = remotes.items;
 	     item && item < remotes.items + remotes.nr;
 	     item++)
-		result |= fetch_remote(item->string);
+		result |= fetch_remote(item->string, opts);
 
 cleanup:
 	string_list_clear(&remotes, 0);
 	return result;
 }
 
-static int maintenance_task_gc(void)
+static int maintenance_task_gc(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_push(&child.args, "gc");
 
-	if (opts.auto_flag)
+	if (opts->auto_flag)
 		strvec_push(&child.args, "--auto");
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--quiet");
 	else
 		strvec_push(&child.args, "--no-quiet");
@@ -911,14 +910,14 @@ static int maintenance_task_gc(void)
 	return run_command(&child);
 }
 
-static int prune_packed(void)
+static int prune_packed(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_push(&child.args, "prune-packed");
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--quiet");
 
 	return !!run_command(&child);
@@ -977,7 +976,7 @@ static int write_loose_object_to_stdin(const struct object_id *oid,
 	return ++(d->count) > d->batch_size;
 }
 
-static int pack_loose(void)
+static int pack_loose(struct maintenance_opts *opts)
 {
 	struct repository *r = the_repository;
 	int result = 0;
@@ -996,7 +995,7 @@ static int pack_loose(void)
 	pack_proc.git_cmd = 1;
 
 	strvec_push(&pack_proc.args, "pack-objects");
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&pack_proc.args, "--quiet");
 	strvec_pushf(&pack_proc.args, "%s/pack/loose", r->objects->odb->path);
 
@@ -1027,9 +1026,9 @@ static int pack_loose(void)
 	return result;
 }
 
-static int maintenance_task_loose_objects(void)
+static int maintenance_task_loose_objects(struct maintenance_opts *opts)
 {
-	return prune_packed() || pack_loose();
+	return prune_packed(opts) || pack_loose(opts);
 }
 
 static int incremental_repack_auto_condition(void)
@@ -1061,14 +1060,14 @@ static int incremental_repack_auto_condition(void)
 	return count >= incremental_repack_auto_limit;
 }
 
-static int multi_pack_index_write(void)
+static int multi_pack_index_write(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "write", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	if (run_command(&child))
@@ -1077,7 +1076,7 @@ static int multi_pack_index_write(void)
 	return 0;
 }
 
-static int rewrite_multi_pack_index(void)
+static int rewrite_multi_pack_index(struct maintenance_opts *opts)
 {
 	struct repository *r = the_repository;
 	char *midx_name = get_midx_filename(r->objects->odb->path);
@@ -1085,17 +1084,18 @@ static int rewrite_multi_pack_index(void)
 	unlink(midx_name);
 	free(midx_name);
 
-	return multi_pack_index_write();
+	return multi_pack_index_write(opts);
 }
 
-static int multi_pack_index_verify(const char *message)
+static int multi_pack_index_verify(struct maintenance_opts *opts,
+				   const char *message)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "verify", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	if (run_command(&child)) {
@@ -1106,14 +1106,14 @@ static int multi_pack_index_verify(const char *message)
 	return 0;
 }
 
-static int multi_pack_index_expire(void)
+static int multi_pack_index_expire(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "expire", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	close_object_store(the_repository->objects);
@@ -1164,14 +1164,14 @@ static off_t get_auto_pack_size(void)
 	return result_size;
 }
 
-static int multi_pack_index_repack(void)
+static int multi_pack_index_repack(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "repack", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	strvec_pushf(&child.args, "--batch-size=%"PRIuMAX,
@@ -1185,7 +1185,7 @@ static int multi_pack_index_repack(void)
 	return 0;
 }
 
-static int maintenance_task_incremental_repack(void)
+static int maintenance_task_incremental_repack(struct maintenance_opts *opts)
 {
 	prepare_repo_settings(the_repository);
 	if (!the_repository->settings.core_multi_pack_index) {
@@ -1193,22 +1193,22 @@ static int maintenance_task_incremental_repack(void)
 		return 0;
 	}
 
-	if (multi_pack_index_write())
+	if (multi_pack_index_write(opts))
 		return 1;
-	if (multi_pack_index_verify("after initial write"))
-		return rewrite_multi_pack_index();
-	if (multi_pack_index_expire())
+	if (multi_pack_index_verify(opts, "after initial write"))
+		return rewrite_multi_pack_index(opts);
+	if (multi_pack_index_expire(opts))
 		return 1;
-	if (multi_pack_index_verify("after expire step"))
-		return !!rewrite_multi_pack_index();
-	if (multi_pack_index_repack())
+	if (multi_pack_index_verify(opts, "after expire step"))
+		return !!rewrite_multi_pack_index(opts);
+	if (multi_pack_index_repack(opts))
 		return 1;
-	if (multi_pack_index_verify("after repack step"))
-		return !!rewrite_multi_pack_index();
+	if (multi_pack_index_verify(opts, "after repack step"))
+		return !!rewrite_multi_pack_index(opts);
 	return 0;
 }
 
-typedef int maintenance_task_fn(void);
+typedef int maintenance_task_fn(struct maintenance_opts *opts);
 
 /*
  * An auto condition function returns 1 if the task should run
@@ -1275,9 +1275,9 @@ static int compare_tasks_by_selection(const void *a_, const void *b_)
 	return b->selected_order - a->selected_order;
 }
 
-static int maintenance_run(void)
+static int maintenance_run(struct maintenance_opts *opts)
 {
-	int i;
+	int i, found_selected = 0;
 	int result = 0;
 	struct lock_file lk;
 	struct repository *r = the_repository;
@@ -1291,7 +1291,7 @@ static int maintenance_run(void)
 		 * recursive process stack. Do not report an error in
 		 * that case.
 		 */
-		if (!opts.auto_flag && !opts.quiet)
+		if (!opts->auto_flag && !opts->quiet)
 			error(_("lock file '%s' exists, skipping maintenance"),
 			      lock_path);
 		free(lock_path);
@@ -1299,23 +1299,26 @@ static int maintenance_run(void)
 	}
 	free(lock_path);
 
-	if (opts.tasks_selected)
+	for (i = 0; !found_selected && i < TASK__COUNT; i++)
+		found_selected = tasks[i].selected;
+
+	if (found_selected)
 		QSORT(tasks, TASK__COUNT, compare_tasks_by_selection);
 
 	for (i = 0; i < TASK__COUNT; i++) {
-		if (opts.tasks_selected && !tasks[i].selected)
+		if (found_selected && !tasks[i].selected)
 			continue;
 
-		if (!opts.tasks_selected && !tasks[i].enabled)
+		if (!found_selected && !tasks[i].enabled)
 			continue;
 
-		if (opts.auto_flag &&
+		if (opts->auto_flag &&
 		    (!tasks[i].auto_condition ||
 		     !tasks[i].auto_condition()))
 			continue;
 
 		trace2_region_enter("maintenance", tasks[i].name, r);
-		if (tasks[i].fn()) {
+		if (tasks[i].fn(opts)) {
 			error(_("task '%s' failed"), tasks[i].name);
 			result = 1;
 		}
@@ -1349,17 +1352,15 @@ static void initialize_task_config(void)
 static int task_option_parse(const struct option *opt,
 			     const char *arg, int unset)
 {
-	int i;
+	int i, num_selected = 0;
 	struct maintenance_task *task = NULL;
 
 	BUG_ON_OPT_NEG(unset);
 
-	opts.tasks_selected++;
-
 	for (i = 0; i < TASK__COUNT; i++) {
+		num_selected += tasks[i].selected;
 		if (!strcasecmp(tasks[i].name, arg)) {
 			task = &tasks[i];
-			break;
 		}
 	}
 
@@ -1374,13 +1375,14 @@ static int task_option_parse(const struct option *opt,
 	}
 
 	task->selected = 1;
-	task->selected_order = opts.tasks_selected;
+	task->selected_order = num_selected + 1;
 
 	return 0;
 }
 
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
 {
+	static struct maintenance_opts opts;
 	static struct option builtin_maintenance_options[] = {
 		OPT_BOOL(0, "auto", &opts.auto_flag,
 			 N_("run tasks based on the state of the repository")),
@@ -1408,7 +1410,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 
 	if (argc == 1) {
 		if (!strcmp(argv[0], "run"))
-			return maintenance_run();
+			return maintenance_run(&opts);
 	}
 
 	usage_with_options(builtin_maintenance_usage,

[v2,01/18] maintenance: create basic maintenance runner

Commit Message

Comments

Patch