diff mbox series

[v2,3/7] maintenance: add --scheduled option and config

Message ID c728c57d85b17035d42313260620a7de5756b0c3.1598380805.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Maintenance III: background maintenance | expand

Commit Message

Derrick Stolee via GitGitGadget Aug. 25, 2020, 6:40 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

A user may want to run certain maintenance tasks based on frequency, not
conditions given in the repository. For example, the user may want to
perform a 'prefetch' task every hour, or 'gc' task every day. To assist,
update the 'git maintenance run --scheduled' command to check the config
for the last run of that task and add a number of seconds. The task
would then run only if the current time is beyond that minimum
timestamp.

Add a '--scheduled' option to 'git maintenance run' to only run tasks
that have had enough time pass since their last run. This is done for
each enabled task by checking if the current timestamp is at least as
large as the sum of 'maintenance.<task>.lastRun' and
'maintenance.<task>.schedule' in the Git config. This second value is
new to this commit, storing a number of seconds intended between runs.

A user could then set up an hourly maintenance run with the following
cron table:

  0 * * * * git -C <repo> maintenance run --scheduled

Then, the user could configure the repository with the following config
values:

  maintenance.prefetch.schedule  3000
  maintenance.gc.schedule       86000

These numbers are slightly lower than one hour and one day (in seconds).
The cron schedule will enforce the hourly run rate, but we can use these
schedules to ensure the 'gc' task runs once a day. The error is given
because the *.lastRun config option is specified at the _start_ of the
task run. Otherwise, a slow task run could shift the "daily" job of 'gc'
from a 10:00pm run to 11:00pm run, or later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/maintenance.txt |  9 +++++
 Documentation/git-maintenance.txt    | 13 ++++++-
 builtin/gc.c                         | 54 ++++++++++++++++++++++++++--
 t/t7900-maintenance.sh               | 20 +++++++++++
 4 files changed, 92 insertions(+), 4 deletions(-)

Comments

Junio C Hamano Aug. 25, 2020, 10:01 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> A user may want to run certain maintenance tasks based on frequency, not
> conditions given in the repository. For example, the user may want to
> perform a 'prefetch' task every hour, or 'gc' task every day. To assist,
> update the 'git maintenance run --scheduled' command to check the config
> for the last run of that task and add a number of seconds. The task
> would then run only if the current time is beyond that minimum
> timestamp.
>
> Add a '--scheduled' option to 'git maintenance run' to only run tasks
> that have had enough time pass since their last run. This is done for
> each enabled task by checking if the current timestamp is at least as
> large as the sum of 'maintenance.<task>.lastRun' and
> 'maintenance.<task>.schedule' in the Git config. This second value is
> new to this commit, storing a number of seconds intended between runs.
>
> A user could then set up an hourly maintenance run with the following
> cron table:
>
>   0 * * * * git -C <repo> maintenance run --scheduled

The scheme has one obvious drawback.  An hourly crontab entry means
your maintenance.*.schedule that is finer grained than an hour
increment will not run as expected.  You'd need to take all the
schedule intervals and take their GCD to come up with the frequency
of the single crontab entry.  

Wouldn't it make more sense to have N crontab entries for N tasks
you want to run periodically, each with their own frequency
controlled by crontab?  That way, you do not need to maintain
maintenance.*.schedule configuration variables and the --scheduled
option.  It might make maintenance.*.lastrun timestamps unneeded,
which would be an added plus to simplify the system quite
drastically.  Most importantly, that would be the way crontab users
are most used to in order to schedule their periodical jobs, so it
is one less thing to learn.
Derrick Stolee Aug. 26, 2020, 3:30 p.m. UTC | #2
On 8/25/2020 6:01 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> A user may want to run certain maintenance tasks based on frequency, not
>> conditions given in the repository. For example, the user may want to
>> perform a 'prefetch' task every hour, or 'gc' task every day. To assist,
>> update the 'git maintenance run --scheduled' command to check the config
>> for the last run of that task and add a number of seconds. The task
>> would then run only if the current time is beyond that minimum
>> timestamp.
>>
>> Add a '--scheduled' option to 'git maintenance run' to only run tasks
>> that have had enough time pass since their last run. This is done for
>> each enabled task by checking if the current timestamp is at least as
>> large as the sum of 'maintenance.<task>.lastRun' and
>> 'maintenance.<task>.schedule' in the Git config. This second value is
>> new to this commit, storing a number of seconds intended between runs.
>>
>> A user could then set up an hourly maintenance run with the following
>> cron table:
>>
>>   0 * * * * git -C <repo> maintenance run --scheduled
> 
> The scheme has one obvious drawback.  An hourly crontab entry means
> your maintenance.*.schedule that is finer grained than an hour
> increment will not run as expected.  You'd need to take all the
> schedule intervals and take their GCD to come up with the frequency
> of the single crontab entry.  

My intention for the *.schedule is that it is not an _exact_ frequency,
but instead a lower bound on the frequency. That can be shelved for now
as we discuss this setup:

> Wouldn't it make more sense to have N crontab entries for N tasks
> you want to run periodically, each with their own frequency
> controlled by crontab?  That way, you do not need to maintain
> maintenance.*.schedule configuration variables and the --scheduled
> option.  It might make maintenance.*.lastrun timestamps unneeded,
> which would be an added plus to simplify the system quite
> drastically.  Most importantly, that would be the way crontab users
> are most used to in order to schedule their periodical jobs, so it
> is one less thing to learn.

I had briefly considered setting up crontab entries for each task
(and possibly each repo) but ended up with these complications:

 1. Maintenance frequency differs by task, so we need to split the
    crontab by task. But we can't just split everything because we
    do not want multiple tasks running at the same time on one
    repository. We would need to group the tasks and have one entry
    saying "git maintenance run --task=<task1> --task=<task2> ..."
    for all tasks in the group.

 2. Different repositories might want different tasks at different
    frequencies, so we might need to split the crontab by repository.
    Again, we likely want to group repositories by these frequencies
    because a user could have 100 registered repositories and we don't
    really want to launch 100 parallel processes.

 3. If we want to stop maintenance, then restart it, we need to
    clear the crontab and repopulate it, which would require iterating
    through all "registered" repositories to read their config for
    frequencies.

 4. On macOS, editing the crontab doesn't require "sudo" but it _does_
    create a pop-up(!) to get permission from the user. It would be
    good to minimize how often we edit the crontab and instead use
    config edits to change frequencies.

With these things in mind, here is a suggested alternative design:

Let users specify a schedule frequency among this list: hourly, daily,
weekly, monthly. We then set the following* crontab:

	0 * * * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=hourly
	0 0 * * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=daily
	0 0 * * 0 git for-each-repo --config=maintenance.repos maintenance run --scheduled=weekly
	0 0 0 * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=monthly

*Of course, there is some care around "$path/git --exec-path=$path"
that I drop for ease here.

Then, "git maintenance (start|stop)" can be just as simple as we have
now: write a fixed schedule every time.

The problem here is that cron will launch these processes in parallel,
and then our object-database lock will cause some to fail! If anyone
knows a simple way to tell cron "run hourly _except_ not at midnight"
then we could let the "daily" schedule also run the "hourly" jobs, for
instance. Hopefully that pattern could be extended to the weekly and
monthly collisions.

Alternatively, we could run every hour and then interpret from config
if the current "hour" matches one of the schedules ourselves. So, the
crontab would be this simple:

	0 * * * * git for-each-repo --config=maintenance.repos maintenance run --scheduled

and then we would internally decide "is this the midnight hour?" and
"is this the first day of the week?" and "is this the first day of the
month?" to detect if we should run the daily/weekly/monthly tasks. While
it adds more time-awareness into Git, it does avoid the parallel task
collisions. There are some concerns here related to long-running tasks
delaying sequential runs of "git -C <repo> maintenance run --scheduled"
causing the "is this the midnight hour?" queries to fail and having
nightly/weekly/monthly maintenance be skipped accidentally. This
motivates the *.lastRun config giving us some guarantee of _eventually_
running the tasks, just _not too frequently_.

I hope this launches a good discussion to help us find a good cron
schedule strategy. After we land on a suitable strategy, I'll summarize
all of these subtleties in the commit message for posterity.

Hopefully, the current way that I integrate with crontab and test that
integration (in PATCH 6/7) could also be reviewed in parallel with this
discussion. I'm very curious to see how that could be improved.

Thanks,
-Stolee
Derrick Stolee Aug. 27, 2020, 3:47 p.m. UTC | #3
On 8/26/2020 11:30 AM, Derrick Stolee wrote:
> Let users specify a schedule frequency among this list: hourly, daily,
> weekly, monthly. We then set the following* crontab:
> 
> 	0 * * * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=hourly
> 	0 0 * * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=daily
> 	0 0 * * 0 git for-each-repo --config=maintenance.repos maintenance run --scheduled=weekly
> 	0 0 0 * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=monthly
> 
> *Of course, there is some care around "$path/git --exec-path=$path"
> that I drop for ease here.

Jeff Hostetler pointed out the following details in the crontab
documentation [1]:

 Ranges of numbers are allowed.  Ranges are two numbers separated with
 a hyphen.  The specified range is inclusive.  For example, 8-11 for
 an 'hours' entry specifies execution at hours 8, 9, 10, and 11. The
 first number must be less than or equal to the second one.

[1] https://man7.org/linux/man-pages/man5/crontab.5.html

This means we could try this schedule:

 0 1-23 * * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=hourly
 0 0 * * 1-6 git for-each-repo --config=maintenance.repos maintenance run --scheduled=daily
 0 0 1-30 * 0 git for-each-repo --config=maintenance.repos maintenance run --scheduled=weekly
 0 0 0 * * git for-each-repo --config=maintenance.repos maintenance run --scheduled=monthly

And it should behave this way:

 Run --scheduled=hourly every hour, except at midnight. This runs
 all "hourly" tasks.

 Run --scheduled=daily at midnight, except on Sunday. This runs all
 "hourly" and "daily" tasks.

 Run --scheduled=weekly at midnight Sunday, except on the first day
 of the month. This runs all "hourly", "daily", and "weekly" tasks.

 Run --scheduled=monthly at midnight on the first day of the month.
 This runs all scheduled tasks.

There is some subtlety between whether the "weekly" runs should be a
subset of "monthly" and maybe the easiest way to handle that would
be to not support "monthly" and have only "hourly", "daily", and "weekly"
options for now.

This should get around all of the parallel issues and allow us to drop
the *.lastRun config option.

Thoughts?

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/config/maintenance.txt b/Documentation/config/maintenance.txt
index 8dd34169da..caacacd322 100644
--- a/Documentation/config/maintenance.txt
+++ b/Documentation/config/maintenance.txt
@@ -15,6 +15,15 @@  maintenance.<task>.lastRun::
 	`<task>` is run. It stores a timestamp representing the most-recent
 	run of the `<task>`.
 
+maintenance.<task>.schedule::
+	This config option controls whether or not the given `<task>` runs
+	during a `git maintenance run --scheduled` command. If the option
+	is an integer value `S`, then the `<task>` is run when the current
+	time is `S` seconds after the timestamp stored in
+	`maintenance.<task>.lastRun`. If the option has no value or a
+	non-integer value, then the task will never run with the `--scheduled`
+	option.
+
 maintenance.commit-graph.auto::
 	This integer config option controls how often the `commit-graph` task
 	should be run as part of `git maintenance run --auto`. If zero, then
diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index b44efb05a3..2bc02c65e4 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -107,7 +107,18 @@  OPTIONS
 	only if certain thresholds are met. For example, the `gc` task
 	runs when the number of loose objects exceeds the number stored
 	in the `gc.auto` config setting, or when the number of pack-files
-	exceeds the `gc.autoPackLimit` config setting.
+	exceeds the `gc.autoPackLimit` config setting. Not compatible with
+	the `--scheduled` option.
+
+--scheduled::
+	When combined with the `run` subcommand, run maintenance tasks
+	only if certain time conditions are met, as specified by the
+	`maintenance.<task>.schedule` config value for each `<task>`.
+	This config value specifies a number of seconds since the last
+	time that task ran, according to the `maintenance.<task>.lastRun`
+	config value. The tasks that are tested are those provided by
+	the `--task=<task>` option(s) or those with
+	`maintenance.<task>.enabled` set to true.
 
 --quiet::
 	Do not report progress or other information over `stderr`.
diff --git a/builtin/gc.c b/builtin/gc.c
index fb6f231a5c..5726a9a3b3 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -705,12 +705,13 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 }
 
 static const char * const builtin_maintenance_run_usage[] = {
-	N_("git maintenance run [--auto] [--[no-]quiet] [--task=<task>]"),
+	N_("git maintenance run [--auto] [--[no-]quiet] [--task=<task>] [--scheduled]"),
 	NULL
 };
 
 struct maintenance_run_opts {
 	int auto_flag;
+	int scheduled;
 	int quiet;
 };
 
@@ -1157,7 +1158,8 @@  struct maintenance_task {
 	const char *name;
 	maintenance_task_fn *fn;
 	maintenance_auto_fn *auto_condition;
-	unsigned enabled:1;
+	unsigned enabled:1,
+		 scheduled:1;
 
 	/* -1 if not selected. */
 	int selected_order;
@@ -1268,6 +1270,9 @@  static int maintenance_run_tasks(struct maintenance_run_opts *opts)
 		     !tasks[i].auto_condition()))
 			continue;
 
+		if (opts->scheduled && !tasks[i].scheduled)
+			continue;
+
 		update_last_run(&tasks[i]);
 
 		trace2_region_enter("maintenance", tasks[i].name, r);
@@ -1282,6 +1287,29 @@  static int maintenance_run_tasks(struct maintenance_run_opts *opts)
 	return result;
 }
 
+static void fill_schedule_info(struct maintenance_task *task,
+			       const char *config_name,
+			       timestamp_t schedule_delay)
+{
+	timestamp_t now = approxidate("now");
+	char *value = NULL;
+	struct strbuf last_run = STRBUF_INIT;
+	int64_t previous_run;
+
+	strbuf_addf(&last_run, "maintenance.%s.lastrun", task->name);
+
+	if (git_config_get_string(last_run.buf, &value))
+		task->scheduled = 1;
+	else {
+		previous_run = git_config_int64(last_run.buf, value);
+		if (now >= previous_run + schedule_delay)
+			task->scheduled = 1;
+	}
+
+	free(value);
+	strbuf_release(&last_run);
+}
+
 static void initialize_task_config(void)
 {
 	int i;
@@ -1290,13 +1318,28 @@  static void initialize_task_config(void)
 
 	for (i = 0; i < TASK__COUNT; i++) {
 		int config_value;
+		char *config_str;
 
-		strbuf_setlen(&config_name, 0);
+		strbuf_reset(&config_name);
 		strbuf_addf(&config_name, "maintenance.%s.enabled",
 			    tasks[i].name);
 
 		if (!git_config_get_bool(config_name.buf, &config_value))
 			tasks[i].enabled = config_value;
+
+		strbuf_reset(&config_name);
+		strbuf_addf(&config_name, "maintenance.%s.schedule",
+			    tasks[i].name);
+
+		if (!git_config_get_string(config_name.buf, &config_str)) {
+			timestamp_t schedule_delay = git_config_int64(
+							config_name.buf,
+							config_str);
+			fill_schedule_info(&tasks[i],
+						config_name.buf,
+						schedule_delay);
+			free(config_str);
+		}
 	}
 
 	strbuf_release(&config_name);
@@ -1340,6 +1383,8 @@  static int maintenance_run(int argc, const char **argv, const char *prefix)
 	struct option builtin_maintenance_run_options[] = {
 		OPT_BOOL(0, "auto", &opts.auto_flag,
 			 N_("run tasks based on the state of the repository")),
+		OPT_BOOL(0, "scheduled", &opts.scheduled,
+			 N_("run tasks based on time intervals")),
 		OPT_BOOL(0, "quiet", &opts.quiet,
 			 N_("do not report progress or other information over stderr")),
 		OPT_CALLBACK_F(0, "task", NULL, N_("task"),
@@ -1360,6 +1405,9 @@  static int maintenance_run(int argc, const char **argv, const char *prefix)
 			     builtin_maintenance_run_usage,
 			     PARSE_OPT_STOP_AT_NON_OPTION);
 
+	if (opts.auto_flag + opts.scheduled > 1)
+		die(_("use at most one of the --auto and --scheduled options"));
+
 	if (argc != 0)
 		usage_with_options(builtin_maintenance_run_usage,
 				   builtin_maintenance_run_options);
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index a985ce3674..3e0c5f1ca8 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -264,6 +264,11 @@  test_expect_success 'maintenance.incremental-repack.auto' '
 	done
 '
 
+test_expect_success '--auto and --scheduled incompatible' '
+	test_must_fail git maintenance run --auto --scheduled 2>err &&
+	test_i18ngrep "at most one" err
+'
+
 test_expect_success 'tasks update maintenance.<task>.lastRun' '
 	git config --unset maintenance.commit-graph.lastrun &&
 	GIT_TRACE2_EVENT="$(pwd)/run.txt" \
@@ -274,4 +279,19 @@  test_expect_success 'tasks update maintenance.<task>.lastRun' '
 	test_cmp_config 1595000000 maintenance.commit-graph.lastrun
 '
 
+test_expect_success '--scheduled with specific time' '
+	git config maintenance.commit-graph.schedule 100 &&
+	GIT_TRACE2_EVENT="$(pwd)/too-soon.txt" \
+		GIT_TEST_DATE_NOW=1595000099 \
+		git maintenance run --scheduled 2>/dev/null &&
+	test_subcommand ! git commit-graph write --split --reachable \
+		--no-progress <too-soon.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/long-enough.txt" \
+		GIT_TEST_DATE_NOW=1595000100 \
+		git maintenance run --scheduled 2>/dev/null &&
+	test_subcommand git commit-graph write --split --reachable \
+		--no-progress <long-enough.txt &&
+	test_cmp_config 1595000100 maintenance.commit-graph.lastrun
+'
+
 test_done