diff mbox series

[v8,1/6] run-command: add duplicate_output_fn to run_processes_parallel_opts

Message ID 20230209000212.1892457-2-calvinwan@google.com (mailing list archive)
State Superseded
Headers show
Series submodule: parallelize diff | expand

Commit Message

Calvin Wan Feb. 9, 2023, 12:02 a.m. UTC
Add duplicate_output_fn as an optionally set function in
run_process_parallel_opts. If set, output from each child process is
copied and passed to the callback function whenever output from the
child process is buffered to allow for separate parsing.

Fix two items in pp_buffer_stderr:
 * strbuf_read_once returns a ssize_t but the variable it is set to is
   an int so fix that.
 * Add missing brackets to "else if" statement

The ungroup/duplicate_output incompatibility check is nested to
prepare for future imcompatibles modes with ungroup.

Signed-off-by: Calvin Wan <calvinwan@google.com>
---
 run-command.c               | 16 ++++++++++++---
 run-command.h               | 25 ++++++++++++++++++++++++
 t/helper/test-run-command.c | 20 +++++++++++++++++++
 t/t0061-run-command.sh      | 39 +++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+), 3 deletions(-)

Comments

Glen Choo Feb. 13, 2023, 6:34 a.m. UTC | #1
Calvin Wan <calvinwan@google.com> writes:

> @@ -1645,14 +1650,19 @@ static void pp_buffer_stderr(struct parallel_processes *pp,
>  	for (size_t i = 0; i < opts->processes; i++) {
>  		if (pp->children[i].state == GIT_CP_WORKING &&
>  		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
> -			int n = strbuf_read_once(&pp->children[i].err,
> -						 pp->children[i].process.err, 0);
> +			ssize_t n = strbuf_read_once(&pp->children[i].err,
> +						     pp->children[i].process.err, 0);
>  			if (n == 0) {
>  				close(pp->children[i].process.err);
>  				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
> -			} else if (n < 0)
> +			} else if (n < 0) {
>  				if (errno != EAGAIN)
>  					die_errno("read");
> +			} else if (opts->duplicate_output) {
> +				opts->duplicate_output(&pp->children[i].err,
> +					pp->children[i].err.len - n,
> +					opts->data, pp->children[i].data);
> +			}
>  		}
>  	}
>  }

What do we think of the name "duplicate_output"? IMO it made sense in
earlier versions when we were copying the output to a separate buffer (I
believe it was renamed in response to [1]), but now that we're just
calling a callback on the main buffer, it seems misleading. Maybe
"output_buffered" would be better?

Sidenote: One convention from JS that I like is to name such event
listeners as "on_<event_name>", e.g. "on_output_buffered". This makes
naming a lot easier sometimes because you don't have to worry about
having your event listener being mistaken for something else. It
wouldn't be idiomatic for Git today, but I wonder what others think
about adopting this.

[1] https://lore.kernel.org/git/xmqq4jvxpw46.fsf@gitster.g/

> +/**
> + * This callback is called whenever output from a child process is buffered
> + * 
> + * See run_processes_parallel() below for a discussion of the "struct
> + * strbuf *out" parameter.
> + * 
> + * The offset refers to the number of bytes originally in "out" before
> + * the output from the child process was buffered. Therefore, the buffer
> + * range, "out + buf" to the end of "out", would contain the buffer of
> + * the child process output.

Looks like there's extra whitespace on the 'blank' lines.
Junio C Hamano Feb. 13, 2023, 5:52 p.m. UTC | #2
Glen Choo <chooglen@google.com> writes:

> What do we think of the name "duplicate_output"? IMO it made sense in
> earlier versions when we were copying the output to a separate buffer (I
> believe it was renamed in response to [1]), but now that we're just
> calling a callback on the main buffer, it seems misleading. Maybe
> "output_buffered" would be better?

Yeah, we do not even know what the callback does to the data we are
giving it.  The only thing we know is that we have output from the
child, and in addition to the usual buffering we do ourselves, we
are allowing the callback to peek into the buffered data in advance.

If the callback does consume it *and* remove the buffered data it
consumed right away, then as you say, "duplicate" becomes a word
that totally misses the point.  There is no duplication, as the
callback consumed and we no longer has our own copy, either.

If the callback consumes it but leaves the buffered data as-is, and
we would show that once the child finishes anyway, you can say that
we are feeding a duplicate of buffered data to the callback.  The
mechanism could be used merely to count how much output we have
accumulated so far to update the progress-bar, for example, and the
output may be given after the process is done.  But note that we are
not doing an "output" of "buffered" data in such a case.

To me, both "duplicate_output" and "output_buffered" sound like they
are names that are quite specific to the expected use case the
person who proposed the names had in mind, yet it is a bit hard to
guess exactly what the expected use cases they had in mind were,
because the names are not quite specific enough.

> Sidenote: One convention from JS that I like is to name such event
> listeners as "on_<event_name>", e.g. "on_output_buffered".

Thanks for bringing this up.  I agree that "Upon X happening, do
this" is a very good convention to follow.  I think the callback is
made whenever the child emits to the standard error stream, so
"on_error_output" (if we are worried that "error" has a too strong
"something bad happend" connotation, then perhaps "on_stderr_output"
may dampen it) perhaps?
Calvin Wan Feb. 13, 2023, 6:26 p.m. UTC | #3
> > Sidenote: One convention from JS that I like is to name such event
> > listeners as "on_<event_name>", e.g. "on_output_buffered".
>
> Thanks for bringing this up.  I agree that "Upon X happening, do
> this" is a very good convention to follow.  I think the callback is
> made whenever the child emits to the standard error stream, so
> "on_error_output" (if we are worried that "error" has a too strong
> "something bad happend" connotation, then perhaps "on_stderr_output"
> may dampen it) perhaps?

"on_stderr_output" sounds much better than "duplicate_output". I
did spend much time trying to come up with a better name, but
couldn't find anything that conveyed what the expected use case
of this function was. Thanks, I'll rename it on my next reroll.
diff mbox series

Patch

diff --git a/run-command.c b/run-command.c
index 756f1839aa..50f741f2ab 100644
--- a/run-command.c
+++ b/run-command.c
@@ -1526,6 +1526,11 @@  static void pp_init(struct parallel_processes *pp,
 	if (!opts->get_next_task)
 		BUG("you need to specify a get_next_task function");
 
+	if (opts->ungroup) {
+		if (opts->duplicate_output)
+			BUG("duplicate_output and ungroup are incompatible with each other");
+	}
+
 	CALLOC_ARRAY(pp->children, n);
 	if (!opts->ungroup)
 		CALLOC_ARRAY(pp->pfd, n);
@@ -1645,14 +1650,19 @@  static void pp_buffer_stderr(struct parallel_processes *pp,
 	for (size_t i = 0; i < opts->processes; i++) {
 		if (pp->children[i].state == GIT_CP_WORKING &&
 		    pp->pfd[i].revents & (POLLIN | POLLHUP)) {
-			int n = strbuf_read_once(&pp->children[i].err,
-						 pp->children[i].process.err, 0);
+			ssize_t n = strbuf_read_once(&pp->children[i].err,
+						     pp->children[i].process.err, 0);
 			if (n == 0) {
 				close(pp->children[i].process.err);
 				pp->children[i].state = GIT_CP_WAIT_CLEANUP;
-			} else if (n < 0)
+			} else if (n < 0) {
 				if (errno != EAGAIN)
 					die_errno("read");
+			} else if (opts->duplicate_output) {
+				opts->duplicate_output(&pp->children[i].err,
+					pp->children[i].err.len - n,
+					opts->data, pp->children[i].data);
+			}
 		}
 	}
 }
diff --git a/run-command.h b/run-command.h
index 072db56a4d..0c16d7f251 100644
--- a/run-command.h
+++ b/run-command.h
@@ -408,6 +408,25 @@  typedef int (*start_failure_fn)(struct strbuf *out,
 				void *pp_cb,
 				void *pp_task_cb);
 
+/**
+ * This callback is called whenever output from a child process is buffered
+ * 
+ * See run_processes_parallel() below for a discussion of the "struct
+ * strbuf *out" parameter.
+ * 
+ * The offset refers to the number of bytes originally in "out" before
+ * the output from the child process was buffered. Therefore, the buffer
+ * range, "out + buf" to the end of "out", would contain the buffer of
+ * the child process output.
+ *
+ * pp_cb is the callback cookie as passed into run_processes_parallel,
+ * pp_task_cb is the callback cookie as passed into get_next_task_fn.
+ *
+ * This function is incompatible with "ungroup"
+ */
+typedef void (*duplicate_output_fn)(struct strbuf *out, size_t offset,
+				    void *pp_cb, void *pp_task_cb);
+
 /**
  * This callback is called on every child process that finished processing.
  *
@@ -461,6 +480,12 @@  struct run_process_parallel_opts
 	 */
 	start_failure_fn start_failure;
 
+	/**
+	 * duplicate_output: See duplicate_output_fn() above. Unless you need
+	 * to capture output from child processes, leave this as NULL.
+	 */
+	duplicate_output_fn duplicate_output;
+
 	/**
 	 * task_finished: See task_finished_fn() above. This can be
 	 * NULL to omit any special handling.
diff --git a/t/helper/test-run-command.c b/t/helper/test-run-command.c
index 3ecb830f4a..4596ba68a8 100644
--- a/t/helper/test-run-command.c
+++ b/t/helper/test-run-command.c
@@ -52,6 +52,20 @@  static int no_job(struct child_process *cp,
 	return 0;
 }
 
+static void duplicate_output(struct strbuf *out,
+			size_t offset,
+			void *pp_cb UNUSED,
+			void *pp_task_cb UNUSED)
+{
+	struct string_list list = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	string_list_split(&list, out->buf + offset, '\n', -1);
+	for_each_string_list_item(item, &list)
+		fprintf(stderr, "duplicate_output: %s\n", item->string);
+	string_list_clear(&list, 0);
+}
+
 static int task_finished(int result,
 			 struct strbuf *err,
 			 void *pp_cb,
@@ -439,6 +453,12 @@  int cmd__run_command(int argc, const char **argv)
 		opts.ungroup = 1;
 	}
 
+	if (!strcmp(argv[1], "--duplicate-output")) {
+		argv += 1;
+		argc -= 1;
+		opts.duplicate_output = duplicate_output;
+	}
+
 	jobs = atoi(argv[2]);
 	strvec_clear(&proc.args);
 	strvec_pushv(&proc.args, (const char **)argv + 3);
diff --git a/t/t0061-run-command.sh b/t/t0061-run-command.sh
index e2411f6a9b..31f1db96fc 100755
--- a/t/t0061-run-command.sh
+++ b/t/t0061-run-command.sh
@@ -135,6 +135,15 @@  test_expect_success 'run_command runs in parallel with more jobs available than
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more jobs available than tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more jobs available than tasks' '
 	test-tool run-command --ungroup run-command-parallel 5 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -147,6 +156,15 @@  test_expect_success 'run_command runs in parallel with as many jobs as tasks' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with as many jobs as tasks --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with as many jobs as tasks' '
 	test-tool run-command --ungroup run-command-parallel 4 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -159,6 +177,15 @@  test_expect_success 'run_command runs in parallel with more tasks than jobs avai
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command runs in parallel with more tasks than jobs available --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test 4 = $(grep -c "duplicate_output: Hello" err) &&
+	test 4 = $(grep -c "duplicate_output: World" err) &&
+	sed "/duplicate_output/d" err >err1 &&
+	test_cmp expect err1
+'
+
 test_expect_success 'run_command runs ungrouped in parallel with more tasks than jobs available' '
 	test-tool run-command --ungroup run-command-parallel 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_line_count = 8 out &&
@@ -180,6 +207,12 @@  test_expect_success 'run_command is asked to abort gracefully' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command is asked to abort gracefully --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-abort 3 false >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command is asked to abort gracefully (ungroup)' '
 	test-tool run-command --ungroup run-command-abort 3 false >out 2>err &&
 	test_must_be_empty out &&
@@ -196,6 +229,12 @@  test_expect_success 'run_command outputs ' '
 	test_cmp expect actual
 '
 
+test_expect_success 'run_command outputs --duplicate-output' '
+	test-tool run-command --duplicate-output run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
+	test_must_be_empty out &&
+	test_cmp expect err
+'
+
 test_expect_success 'run_command outputs (ungroup) ' '
 	test-tool run-command --ungroup run-command-no-jobs 3 sh -c "printf \"%s\n%s\n\" Hello World" >out 2>err &&
 	test_must_be_empty out &&