[v3,7/9] pretty: refactor `format_sanitized_subject()`

Message ID	0ad22c7cdd3c692aa5b46444e64a3b76f1e87b4c.1597687822.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=R55k=B3=vger.kernel.org=git-owner@kernel.org> Message-Id: <0ad22c7cdd3c692aa5b46444e64a3b76f1e87b4c.1597687822.git.gitgitgadget@gmail.com> In-Reply-To: <pull.684.v3.git.1597687822.gitgitgadget@gmail.com> References: <pull.684.v2.git.1596664305.gitgitgadget@gmail.com> <pull.684.v3.git.1597687822.gitgitgadget@gmail.com> From: "Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> Date: Mon, 17 Aug 2020 18:10:20 +0000 Subject: [PATCH v3 7/9] pretty: refactor `format_sanitized_subject()` Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Hariom Verma <hariom18599@gmail.com>, Hariom Verma <hariom18599@gmail.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[v3,1/9] ref-filter: support different email formats \| expand [v3,1/9] ref-filter: support different email formats [v3,2/9] ref-filter: refactor `grab_objectname()` [v3,3/9] ref-filter: modify error messages in `grab_objectname()` [v3,4/9] ref-filter: rename `objectname` related functions and fields [v3,5/9] ref-filter: add `short` modifier to 'tree' atom [v3,6/9] ref-filter: add `short` modifier to 'parent' atom [v3,7/9] pretty: refactor `format_sanitized_subject()` [v3,8/9] format-support: move `format_sanitized_subject()` from pretty [v3,9/9] ref-filter: add `sanitize` option for 'subject' atom

Koji Nakamaru via GitGitGadget Aug. 17, 2020, 6:10 p.m. UTC

From: Hariom Verma <hariom18599@gmail.com>

The function 'format_sanitized_subject()' is responsible for
sanitized subject line in pretty.c
e.g.
the subject line
the-sanitized-subject-line

It would be a nice enhancement to `subject` atom to have the
same feature. So in the later commits, we plan to add this feature
to ref-filter.

Refactor `format_sanitized_subject()`, so it can be reused in
ref-filter.c for adding new modifier `sanitize` to "subject" atom.

Currently, the loop inside `format_sanitized_subject()` runs
until `\n` is found. But now, we stored the first occurrence
of `\n` in a variable `eol` and passed it in
`format_sanitized_subject()`. And the loop runs upto `eol`.

But this change isn't sufficient to reuse this function in
ref-filter.c because there exist tags with multiline subject.

It's wise to replace `\n` with ' ', if `format_sanitized_subject()`
encounters `\n` before end of subject line, just like `copy_subject()`.
Because we'll be only using `format_sanitized_subject()` for
"%(subject:sanitize)", instead of `copy_subject()` and
`format_sanitized_subject()` both. So, added the code:
```
if (char == '\n') /* never true if called inside pretty.c */
    char = ' ';
```

Now, it's ready to be reused in ref-filter.c

Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Mentored-by: Heba Waly <heba.waly@gmail.com>
Signed-off-by: Hariom Verma <hariom18599@gmail.com>
---
 pretty.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

Junio C Hamano Aug. 17, 2020, 7:29 p.m. UTC | #1

"Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -static void format_sanitized_subject(struct strbuf *sb, const char *msg)
> +static void format_sanitized_subject(struct strbuf *sb, const char *msg, size_t len)
>  {
> +	char *r = xmemdupz(msg, len);
>  	size_t trimlen;
>  	size_t start_len = sb->len;
>  	int space = 2;
> +	int i;
>  
> -	for (; *msg && *msg != '\n'; msg++) {
> -		if (istitlechar(*msg)) {
> +	for (i = 0; i < len; i++) {
> +		if (r[i] == '\n')
> +			r[i] = ' ';

Copying the whole string only for this one looks very wasteful.
Can't you do

	for (i = 0; i < len; i++) {
		char r = msg[i];
		if (isspace(r))
			r = ' ';
		if (istitlechar(r)) {
			...
	}

or something like that instead?  

> +		if (istitlechar(r[i])) {
>  			if (space == 1)
>  				strbuf_addch(sb, '-');
>  			space = 0;
> -			strbuf_addch(sb, *msg);
> -			if (*msg == '.')
> -				while (*(msg+1) == '.')
> -					msg++;
> +			strbuf_addch(sb, r[i]);
> +			if (r[i] == '.')
> +				while (r[i+1] == '.')
> +					i++;
>  		} else
>  			space |= 1;
>  	}
> +	free(r);

Also, because neither LF or SP is a titlechar(), wouldn't the "if
r[i] is LF, replace it with SP" a no-op wrt what will be in sb at
the end?


>  	case 'f':	/* sanitized subject */
> -		format_sanitized_subject(sb, msg + c->subject_off);
> +		eol = strchrnul(msg + c->subject_off, '\n');
> +		format_sanitized_subject(sb, msg + c->subject_off, eol - (msg + c->subject_off));

This original caller expected the helper to stop reading at the end
of the first line, but the updated helper needs to be told where to
stop, so we do so with some extra computation.  Makes sense.

Hariom verma Aug. 19, 2020, 1:36 p.m. UTC | #2

Hi Junio,

On Tue, Aug 18, 2020 at 12:59 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > -static void format_sanitized_subject(struct strbuf *sb, const char *msg)
> > +static void format_sanitized_subject(struct strbuf *sb, const char *msg, size_t len)
> >  {
> > +     char *r = xmemdupz(msg, len);
> >       size_t trimlen;
> >       size_t start_len = sb->len;
> >       int space = 2;
> > +     int i;
> >
> > -     for (; *msg && *msg != '\n'; msg++) {
> > -             if (istitlechar(*msg)) {
> > +     for (i = 0; i < len; i++) {
> > +             if (r[i] == '\n')
> > +                     r[i] = ' ';
>
> Copying the whole string only for this one looks very wasteful.
> Can't you do
>
>         for (i = 0; i < len; i++) {
>                 char r = msg[i];
>                 if (isspace(r))
>                         r = ' ';
>                 if (istitlechar(r)) {
>                         ...
>         }
>
> or something like that instead?

Ok, that sounds better. Noted for the next version.

> > +             if (istitlechar(r[i])) {
> >                       if (space == 1)
> >                               strbuf_addch(sb, '-');
> >                       space = 0;
> > -                     strbuf_addch(sb, *msg);
> > -                     if (*msg == '.')
> > -                             while (*(msg+1) == '.')
> > -                                     msg++;
> > +                     strbuf_addch(sb, r[i]);
> > +                     if (r[i] == '.')
> > +                             while (r[i+1] == '.')
> > +                                     i++;
> >               } else
> >                       space |= 1;
> >       }
> > +     free(r);
>
> Also, because neither LF or SP is a titlechar(), wouldn't the "if
> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at
> the end?

Maybe its better to directly replace LF with hyphen? [Instead of first
replacing LF with SP and then replacing SP with '-'.]

> >       case 'f':       /* sanitized subject */
> > -             format_sanitized_subject(sb, msg + c->subject_off);
> > +             eol = strchrnul(msg + c->subject_off, '\n');
> > +             format_sanitized_subject(sb, msg + c->subject_off, eol - (msg + c->subject_off));
>
> This original caller expected the helper to stop reading at the end
> of the first line, but the updated helper needs to be told where to
> stop, so we do so with some extra computation.  Makes sense.

Yeah.

Thanks,
Hariom

Junio C Hamano Aug. 19, 2020, 4:01 p.m. UTC | #3

Hariom verma <hariom18599@gmail.com> writes:

>> Also, because neither LF or SP is a titlechar(), wouldn't the "if
>> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at
>> the end?
>
> Maybe its better to directly replace LF with hyphen? [Instead of first
> replacing LF with SP and then replacing SP with '-'.]

Why do you think LF is so special?

Everything other than titlechar() including HT, '#', '*', SP is
treated in the same way as the body of that loop.  It does not
directly contribute to the final contents of sb, but just leaves
the marker in the variable "space" the fact that when adding the
next titlechar() to the resulting sb, we need a SP to wordbreak.

LF now happens to be in the set due to the way you extended the
function (it wasn't fed to this function by its sole caller), but
other than that, it is no more special than HT, SP or '*'.  And they
are not replaced with SP or replaced with '-'.

So it would be the most sensible to just drop 'if LF, replace it
with SP before doing anything else' you added.  The existing 'if
titlechar, add it to sb but if we saw non-title, add a SP before
doing so to wordbreak, and if not titlechar, just remember the fact
that we saw one' should work fine as-is without special casing LF at
all.

Or am I missing something subtle?

Junio C Hamano Aug. 19, 2020, 4:08 p.m. UTC | #4

Junio C Hamano <gitster@pobox.com> writes:

> Hariom verma <hariom18599@gmail.com> writes:
>
>>> Also, because neither LF or SP is a titlechar(), wouldn't the "if
>>> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at
>>> the end?
>>
>> Maybe its better to directly replace LF with hyphen? [Instead of first
>> replacing LF with SP and then replacing SP with '-'.]
>
> Why do you think LF is so special?
>
> Everything other than titlechar() including HT, '#', '*', SP is
> treated in the same way as the body of that loop.  It does not
> directly contribute to the final contents of sb, but just leaves
> the marker in the variable "space" the fact that when adding the
> next titlechar() to the resulting sb, we need a SP to wordbreak.

I was undecided between mentioning and not mentioning the variable
name "space" here.  On one hand, one _could_ argue that "space" is
used to remember we saw "space and the like" and if it were named
"seen_non_title_char", then such a confusion to treat LF so
specially might not have occurred.  But on the other hand, "space"
is what the variable exactly keeps track of; it is just the need for
space on the output side, i.e. we remember that "space needed before
the next output" with that variable.

I am inclined not to suggest renaming "space" at all, but it won't
be the end of the world if it were renamed to "need_space" (before
the next output), or "seen_nontitle".  If we were to actually
rename, I have moderately strong preference to the "need_space" over
"seen_nontitle", as it won't have to be renamed again when the logic
to require a space before the next output has to be updated to
include cases other than just "we saw a nontitle character".

Thanks.

Hariom verma Aug. 20, 2020, 5:27 p.m. UTC | #5

Hi,

On Wed, Aug 19, 2020 at 9:31 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Hariom verma <hariom18599@gmail.com> writes:
>
> >> Also, because neither LF or SP is a titlechar(), wouldn't the "if
> >> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at
> >> the end?
> >
> > Maybe its better to directly replace LF with hyphen? [Instead of first
> > replacing LF with SP and then replacing SP with '-'.]
>
> Why do you think LF is so special?
>
> Everything other than titlechar() including HT, '#', '*', SP is
> treated in the same way as the body of that loop.  It does not
> directly contribute to the final contents of sb, but just leaves
> the marker in the variable "space" the fact that when adding the
> next titlechar() to the resulting sb, we need a SP to wordbreak.
>
> LF now happens to be in the set due to the way you extended the
> function (it wasn't fed to this function by its sole caller), but
> other than that, it is no more special than HT, SP or '*'.  And they
> are not replaced with SP or replaced with '-'.
>
> So it would be the most sensible to just drop 'if LF, replace it
> with SP before doing anything else' you added.  The existing 'if
> titlechar, add it to sb but if we saw non-title, add a SP before
> doing so to wordbreak, and if not titlechar, just remember the fact
> that we saw one' should work fine as-is without special casing LF at
> all.
>
> Or am I missing something subtle?

You actually got it all right. Thanks for the insight.

Now, I got it. There is no need to give special attention to LF. I
missed to see that `titlechar()` is already taking care of everything.

Thanks,
Hariom

Hariom verma Aug. 20, 2020, 5:33 p.m. UTC | #6

Hi,

On Wed, Aug 19, 2020 at 9:38 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Hariom verma <hariom18599@gmail.com> writes:
> >
> >>> Also, because neither LF or SP is a titlechar(), wouldn't the "if
> >>> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at
> >>> the end?
> >>
> >> Maybe its better to directly replace LF with hyphen? [Instead of first
> >> replacing LF with SP and then replacing SP with '-'.]
> >
> > Why do you think LF is so special?
> >
> > Everything other than titlechar() including HT, '#', '*', SP is
> > treated in the same way as the body of that loop.  It does not
> > directly contribute to the final contents of sb, but just leaves
> > the marker in the variable "space" the fact that when adding the
> > next titlechar() to the resulting sb, we need a SP to wordbreak.
>
> I was undecided between mentioning and not mentioning the variable
> name "space" here.  On one hand, one _could_ argue that "space" is
> used to remember we saw "space and the like" and if it were named
> "seen_non_title_char", then such a confusion to treat LF so
> specially might not have occurred.  But on the other hand, "space"
> is what the variable exactly keeps track of; it is just the need for
> space on the output side, i.e. we remember that "space needed before
> the next output" with that variable.
>
> I am inclined not to suggest renaming "space" at all, but it won't
> be the end of the world if it were renamed to "need_space" (before
> the next output), or "seen_nontitle".  If we were to actually
> rename, I have moderately strong preference to the "need_space" over
> "seen_nontitle", as it won't have to be renamed again when the logic
> to require a space before the next output has to be updated to
> include cases other than just "we saw a nontitle character".

Yeah, if it was named "seen_non_title_char", I might not get confused.
But now as you have already explained its working pretty well, "space"
makes more sense to me.
Well, I'm okay with both "space" and "need_space".

I wonder what others have to say on this? "space" or "need_space"?

Thanks,
Hariom

[v3,7/9] pretty: refactor `format_sanitized_subject()`

Commit Message

Comments

Patch