diff mbox series

[v4,2/4] trailer: find the end of the log message

Message ID c904caba7e17b6f2784933e9f18634ea66f28537.1695709372.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Trailer readability cleanups | expand

Commit Message

Linus Arver Sept. 26, 2023, 6:22 a.m. UTC
From: Linus Arver <linusa@google.com>

Previously, trailer_info_get() computed the trailer block end position
by

(1) checking for the opts->no_divider flag and optionally calling
    find_patch_start() to find the "patch start" location (patch_start), and
(2) calling find_trailer_end() to find the end of the trailer block
    using patch_start as a guide, saving the return value into
    "trailer_end".

The logic in (1) was awkward because the variable "patch_start" is
misleading if there is no patch in the input. The logic in (2) was
misleading because it could be the case that no trailers are in the
input (yet we are setting a "trailer_end" variable before even searching
for trailers, which happens later in find_trailer_start()). The name
"find_trailer_end" was misleading because that function did not look for
any trailer block itself --- instead it just computed the end position
of the log message in the input where the end of the trailer block (if
it exists) would be (because trailer blocks must always come after the
end of the log message).

Combine the logic in (1) and (2) together into find_patch_start() by
renaming it to find_end_of_log_message(). The end of the log message is
the starting point which find_trailer_start() needs to start searching
backward to parse individual trailers (if any).

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Linus Arver <linusa@google.com>
---
 trailer.c | 61 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 23 deletions(-)

Comments

Jonathan Tan Sept. 28, 2023, 11:16 p.m. UTC | #1
"Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Linus Arver <linusa@google.com>
> 
> Previously, trailer_info_get() computed the trailer block end position
> by
> 
> (1) checking for the opts->no_divider flag and optionally calling
>     find_patch_start() to find the "patch start" location (patch_start), and
> (2) calling find_trailer_end() to find the end of the trailer block
>     using patch_start as a guide, saving the return value into
>     "trailer_end".
> 
> The logic in (1) was awkward because the variable "patch_start" is
> misleading if there is no patch in the input. The logic in (2) was
> misleading because it could be the case that no trailers are in the
> input (yet we are setting a "trailer_end" variable before even searching
> for trailers, which happens later in find_trailer_start()). The name
> "find_trailer_end" was misleading because that function did not look for
> any trailer block itself --- instead it just computed the end position
> of the log message in the input where the end of the trailer block (if
> it exists) would be (because trailer blocks must always come after the
> end of the log message).

I might be biased since I wrote the text in question in 022349c3b0
(trailer: avoid unnecessary splitting on lines, 2016-11-02), but the
concept of patch_start and trailer_end being where the patch would start
and where the trailer would end (if they were present) goes back to
2013d8505d (trailer: parse trailers from file or stdin, 2014-10-13). I
don't remember exactly my thoughts in 2016, but today, this makes sense
to me.

As it is, the reader still needs to know that trailer_start is where
the trailer would start if it was present, and then I think it's quite
natural to have trailer_end be where the trailer would end if it was
present.

I believe the code is simpler this way, because trailer absence now no
longer needs to be special-cased when we use these variables (or maybe
they sometimes do, but not as often, since code that writes to the end
of the trailers, for example, can now just write at trailer_end instead
of having to check whether trailers exist). Same comment for patch 4
regarding using the special value 0 if no trailers are found (I think
the existing code makes more sense).

> Combine the logic in (1) and (2) together into find_patch_start() by
> renaming it to find_end_of_log_message(). The end of the log message is
> the starting point which find_trailer_start() needs to start searching
> backward to parse individual trailers (if any).

Having said that, if patch_start is too confusing for whatever reason,
this refactoring makes sense. (Avoid the confusing name by avoiding
needing to name it in the first place.)

> -static size_t find_patch_start(const char *str)
> +static size_t find_end_of_log_message(const char *input, int no_divider)
>  {
> +	size_t end;
> +
>  	const char *s;
>  
> -	for (s = str; *s; s = next_line(s)) {
> +	/* Assume the naive end of the input is already what we want. */
> +	end = strlen(input);
> +
> +	/* Optionally skip over any patch part ("---" line and below). */
> +	for (s = input; *s; s = next_line(s)) {
>  		const char *v;
>  
> -		if (skip_prefix(s, "---", &v) && isspace(*v))
> -			return s - str;
> +		if (!no_divider && skip_prefix(s, "---", &v) && isspace(*v)) {
> +			end = s - input;
> +			break;
> +		}
>  	}
>  
> -	return s - str;
> +	/* Skip over other ignorable bits. */
> +	return end - ignored_log_message_bytes(input, end);
>  }

This sometimes redundantly calls strlen and sometimes redundantly loops.
I think it's better to do as the code currently does - so, have a big
if/else at the beginning of this function that checks no_divider.
Linus Arver Oct. 20, 2023, 12:24 a.m. UTC | #2
Hi Jonathan, it's been a while because I was on vacation. I've forgotten
about most of the intricacies of this patch (I think this was a good
thing, read on below).

Jonathan Tan <jonathantanmy@google.com> writes:

> "Linus Arver via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> From: Linus Arver <linusa@google.com>
>> 
>> Previously, trailer_info_get() computed the trailer block end position
>> by
>> 
>> (1) checking for the opts->no_divider flag and optionally calling
>>     find_patch_start() to find the "patch start" location (patch_start), and
>> (2) calling find_trailer_end() to find the end of the trailer block
>>     using patch_start as a guide, saving the return value into
>>     "trailer_end".
>> 
>> The logic in (1) was awkward because the variable "patch_start" is
>> misleading if there is no patch in the input. The logic in (2) was
>> misleading because it could be the case that no trailers are in the
>> input (yet we are setting a "trailer_end" variable before even searching
>> for trailers, which happens later in find_trailer_start()). The name
>> "find_trailer_end" was misleading because that function did not look for
>> any trailer block itself --- instead it just computed the end position
>> of the log message in the input where the end of the trailer block (if
>> it exists) would be (because trailer blocks must always come after the
>> end of the log message).
>
> [...]
>
> As it is, the reader still needs to know that trailer_start is where
> the trailer would start if it was present, and then I think it's quite
> natural to have trailer_end be where the trailer would end if it was
> present.
>
> I believe the code is simpler this way, because trailer absence now no
> longer needs to be special-cased when we use these variables (or maybe
> they sometimes do, but not as often, since code that writes to the end
> of the trailers, for example, can now just write at trailer_end instead
> of having to check whether trailers exist). Same comment for patch 4
> regarding using the special value 0 if no trailers are found (I think
> the existing code makes more sense).

I think the root cause of my confusion with this codebase is due to the
variables being named as if the things they refer to exist, but without
any guarantees that they indeed exist. This applies to "patch_start"
(the patch part might not be present in the input),
"trailer_{start,end}" (trailers block might not exist (yet)). IOW these
variables are named as if the intent is to always add new trailers into
the input, which may not be the case (we have "--parse", after all).

Looking again at patch 4, I'm now leaning toward dropping it. Other
than the reasons you cited, we also add a new struct field which by
itself does not add new information (the information can already be
deduced from the other fields). In the near future I want to simplify
the data structures as much as possible, and adding a new field seems to
go against this desire of mine.

>> Combine the logic in (1) and (2) together into find_patch_start() by
>> renaming it to find_end_of_log_message(). The end of the log message is
>> the starting point which find_trailer_start() needs to start searching
>> backward to parse individual trailers (if any).
>
> Having said that, if patch_start is too confusing for whatever reason,
> this refactoring makes sense. (Avoid the confusing name by avoiding
> needing to name it in the first place.)

I think the existing code is confusing, and would prefer to keep this
patch.

>> -static size_t find_patch_start(const char *str)
>> +static size_t find_end_of_log_message(const char *input, int no_divider)
>>  {
>> +	size_t end;
>> +
>>  	const char *s;
>>  
>> -	for (s = str; *s; s = next_line(s)) {
>> +	/* Assume the naive end of the input is already what we want. */
>> +	end = strlen(input);
>> +
>> +	/* Optionally skip over any patch part ("---" line and below). */
>> +	for (s = input; *s; s = next_line(s)) {
>>  		const char *v;
>>  
>> -		if (skip_prefix(s, "---", &v) && isspace(*v))
>> -			return s - str;
>> +		if (!no_divider && skip_prefix(s, "---", &v) && isspace(*v)) {
>> +			end = s - input;
>> +			break;
>> +		}
>>  	}
>>  
>> -	return s - str;
>> +	/* Skip over other ignorable bits. */
>> +	return end - ignored_log_message_bytes(input, end);
>>  }
>
> This sometimes redundantly calls strlen and sometimes redundantly loops.
> I think it's better to do as the code currently does - so, have a big
> if/else at the beginning of this function that checks no_divider.

Will update, thanks.
Junio C Hamano Oct. 20, 2023, 12:36 a.m. UTC | #3
Linus Arver <linusa@google.com> writes:

> Hi Jonathan, it's been a while because I was on vacation. I've forgotten
> about most of the intricacies of this patch (I think this was a good
> thing, read on below).

Welcome back ;-).

> Will update, thanks.

Thanks.
diff mbox series

Patch

diff --git a/trailer.c b/trailer.c
index 3c54b38a85a..96cb285a4ea 100644
--- a/trailer.c
+++ b/trailer.c
@@ -809,21 +809,47 @@  static ssize_t last_line(const char *buf, size_t len)
 }
 
 /*
- * Return the position of the start of the patch or the length of str if there
- * is no patch in the message.
+ * Find the end of the log message as an offset from the start of the input
+ * (where callers of this function are interested in looking for a trailers
+ * block in the same input). We have to consider two categories of content that
+ * can come at the end of the input which we want to ignore (because they don't
+ * belong in the log message):
+ *
+ * (1) the "patch part" which begins with a "---" divider and has patch
+ * information (like the output of git-format-patch), and
+ *
+ * (2) any trailing comment lines, blank lines like in the output of "git
+ * commit -v", or stuff below the "cut" (scissor) line.
+ *
+ * As a formula, the situation looks like this:
+ *
+ *     INPUT = LOG MESSAGE + IGNORED
+ *
+ * where IGNORED can be either of the two categories described above. It may be
+ * that there is nothing to ignore. Now it may be the case that the LOG MESSAGE
+ * contains a trailer block, but that's not the concern of this function.
  */
-static size_t find_patch_start(const char *str)
+static size_t find_end_of_log_message(const char *input, int no_divider)
 {
+	size_t end;
+
 	const char *s;
 
-	for (s = str; *s; s = next_line(s)) {
+	/* Assume the naive end of the input is already what we want. */
+	end = strlen(input);
+
+	/* Optionally skip over any patch part ("---" line and below). */
+	for (s = input; *s; s = next_line(s)) {
 		const char *v;
 
-		if (skip_prefix(s, "---", &v) && isspace(*v))
-			return s - str;
+		if (!no_divider && skip_prefix(s, "---", &v) && isspace(*v)) {
+			end = s - input;
+			break;
+		}
 	}
 
-	return s - str;
+	/* Skip over other ignorable bits. */
+	return end - ignored_log_message_bytes(input, end);
 }
 
 /*
@@ -925,12 +951,6 @@  continue_outer_loop:
 	return len;
 }
 
-/* Return the position of the end of the trailers. */
-static size_t find_trailer_end(const char *buf, size_t len)
-{
-	return len - ignored_log_message_bytes(buf, len);
-}
-
 static int ends_with_blank_line(const char *buf, size_t len)
 {
 	ssize_t ll = last_line(buf, len);
@@ -1101,7 +1121,7 @@  void process_trailers(const char *file,
 void trailer_info_get(struct trailer_info *info, const char *str,
 		      const struct process_trailer_options *opts)
 {
-	int patch_start, trailer_end, trailer_start;
+	int end_of_log_message, trailer_start;
 	struct strbuf **trailer_lines, **ptr;
 	char **trailer_strings = NULL;
 	size_t nr = 0, alloc = 0;
@@ -1109,16 +1129,11 @@  void trailer_info_get(struct trailer_info *info, const char *str,
 
 	ensure_configured();
 
-	if (opts->no_divider)
-		patch_start = strlen(str);
-	else
-		patch_start = find_patch_start(str);
-
-	trailer_end = find_trailer_end(str, patch_start);
-	trailer_start = find_trailer_start(str, trailer_end);
+	end_of_log_message = find_end_of_log_message(str, opts->no_divider);
+	trailer_start = find_trailer_start(str, end_of_log_message);
 
 	trailer_lines = strbuf_split_buf(str + trailer_start,
-					 trailer_end - trailer_start,
+					 end_of_log_message - trailer_start,
 					 '\n',
 					 0);
 	for (ptr = trailer_lines; *ptr; ptr++) {
@@ -1141,7 +1156,7 @@  void trailer_info_get(struct trailer_info *info, const char *str,
 	info->blank_line_before_trailer = ends_with_blank_line(str,
 							       trailer_start);
 	info->trailer_start = str + trailer_start;
-	info->trailer_end = str + trailer_end;
+	info->trailer_end = str + end_of_log_message;
 	info->trailers = trailer_strings;
 	info->trailer_nr = nr;
 }