Message ID | 0ad22c7cdd3c692aa5b46444e64a3b76f1e87b4c.1597687822.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3,1/9] ref-filter: support different email formats | expand |
"Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> writes: > -static void format_sanitized_subject(struct strbuf *sb, const char *msg) > +static void format_sanitized_subject(struct strbuf *sb, const char *msg, size_t len) > { > + char *r = xmemdupz(msg, len); > size_t trimlen; > size_t start_len = sb->len; > int space = 2; > + int i; > > - for (; *msg && *msg != '\n'; msg++) { > - if (istitlechar(*msg)) { > + for (i = 0; i < len; i++) { > + if (r[i] == '\n') > + r[i] = ' '; Copying the whole string only for this one looks very wasteful. Can't you do for (i = 0; i < len; i++) { char r = msg[i]; if (isspace(r)) r = ' '; if (istitlechar(r)) { ... } or something like that instead? > + if (istitlechar(r[i])) { > if (space == 1) > strbuf_addch(sb, '-'); > space = 0; > - strbuf_addch(sb, *msg); > - if (*msg == '.') > - while (*(msg+1) == '.') > - msg++; > + strbuf_addch(sb, r[i]); > + if (r[i] == '.') > + while (r[i+1] == '.') > + i++; > } else > space |= 1; > } > + free(r); Also, because neither LF or SP is a titlechar(), wouldn't the "if r[i] is LF, replace it with SP" a no-op wrt what will be in sb at the end? > case 'f': /* sanitized subject */ > - format_sanitized_subject(sb, msg + c->subject_off); > + eol = strchrnul(msg + c->subject_off, '\n'); > + format_sanitized_subject(sb, msg + c->subject_off, eol - (msg + c->subject_off)); This original caller expected the helper to stop reading at the end of the first line, but the updated helper needs to be told where to stop, so we do so with some extra computation. Makes sense.
Hi Junio, On Tue, Aug 18, 2020 at 12:59 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > -static void format_sanitized_subject(struct strbuf *sb, const char *msg) > > +static void format_sanitized_subject(struct strbuf *sb, const char *msg, size_t len) > > { > > + char *r = xmemdupz(msg, len); > > size_t trimlen; > > size_t start_len = sb->len; > > int space = 2; > > + int i; > > > > - for (; *msg && *msg != '\n'; msg++) { > > - if (istitlechar(*msg)) { > > + for (i = 0; i < len; i++) { > > + if (r[i] == '\n') > > + r[i] = ' '; > > Copying the whole string only for this one looks very wasteful. > Can't you do > > for (i = 0; i < len; i++) { > char r = msg[i]; > if (isspace(r)) > r = ' '; > if (istitlechar(r)) { > ... > } > > or something like that instead? Ok, that sounds better. Noted for the next version. > > + if (istitlechar(r[i])) { > > if (space == 1) > > strbuf_addch(sb, '-'); > > space = 0; > > - strbuf_addch(sb, *msg); > > - if (*msg == '.') > > - while (*(msg+1) == '.') > > - msg++; > > + strbuf_addch(sb, r[i]); > > + if (r[i] == '.') > > + while (r[i+1] == '.') > > + i++; > > } else > > space |= 1; > > } > > + free(r); > > Also, because neither LF or SP is a titlechar(), wouldn't the "if > r[i] is LF, replace it with SP" a no-op wrt what will be in sb at > the end? Maybe its better to directly replace LF with hyphen? [Instead of first replacing LF with SP and then replacing SP with '-'.] > > case 'f': /* sanitized subject */ > > - format_sanitized_subject(sb, msg + c->subject_off); > > + eol = strchrnul(msg + c->subject_off, '\n'); > > + format_sanitized_subject(sb, msg + c->subject_off, eol - (msg + c->subject_off)); > > This original caller expected the helper to stop reading at the end > of the first line, but the updated helper needs to be told where to > stop, so we do so with some extra computation. Makes sense. Yeah. Thanks, Hariom
Hariom verma <hariom18599@gmail.com> writes: >> Also, because neither LF or SP is a titlechar(), wouldn't the "if >> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at >> the end? > > Maybe its better to directly replace LF with hyphen? [Instead of first > replacing LF with SP and then replacing SP with '-'.] Why do you think LF is so special? Everything other than titlechar() including HT, '#', '*', SP is treated in the same way as the body of that loop. It does not directly contribute to the final contents of sb, but just leaves the marker in the variable "space" the fact that when adding the next titlechar() to the resulting sb, we need a SP to wordbreak. LF now happens to be in the set due to the way you extended the function (it wasn't fed to this function by its sole caller), but other than that, it is no more special than HT, SP or '*'. And they are not replaced with SP or replaced with '-'. So it would be the most sensible to just drop 'if LF, replace it with SP before doing anything else' you added. The existing 'if titlechar, add it to sb but if we saw non-title, add a SP before doing so to wordbreak, and if not titlechar, just remember the fact that we saw one' should work fine as-is without special casing LF at all. Or am I missing something subtle?
Junio C Hamano <gitster@pobox.com> writes: > Hariom verma <hariom18599@gmail.com> writes: > >>> Also, because neither LF or SP is a titlechar(), wouldn't the "if >>> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at >>> the end? >> >> Maybe its better to directly replace LF with hyphen? [Instead of first >> replacing LF with SP and then replacing SP with '-'.] > > Why do you think LF is so special? > > Everything other than titlechar() including HT, '#', '*', SP is > treated in the same way as the body of that loop. It does not > directly contribute to the final contents of sb, but just leaves > the marker in the variable "space" the fact that when adding the > next titlechar() to the resulting sb, we need a SP to wordbreak. I was undecided between mentioning and not mentioning the variable name "space" here. On one hand, one _could_ argue that "space" is used to remember we saw "space and the like" and if it were named "seen_non_title_char", then such a confusion to treat LF so specially might not have occurred. But on the other hand, "space" is what the variable exactly keeps track of; it is just the need for space on the output side, i.e. we remember that "space needed before the next output" with that variable. I am inclined not to suggest renaming "space" at all, but it won't be the end of the world if it were renamed to "need_space" (before the next output), or "seen_nontitle". If we were to actually rename, I have moderately strong preference to the "need_space" over "seen_nontitle", as it won't have to be renamed again when the logic to require a space before the next output has to be updated to include cases other than just "we saw a nontitle character". Thanks.
Hi, On Wed, Aug 19, 2020 at 9:31 PM Junio C Hamano <gitster@pobox.com> wrote: > > Hariom verma <hariom18599@gmail.com> writes: > > >> Also, because neither LF or SP is a titlechar(), wouldn't the "if > >> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at > >> the end? > > > > Maybe its better to directly replace LF with hyphen? [Instead of first > > replacing LF with SP and then replacing SP with '-'.] > > Why do you think LF is so special? > > Everything other than titlechar() including HT, '#', '*', SP is > treated in the same way as the body of that loop. It does not > directly contribute to the final contents of sb, but just leaves > the marker in the variable "space" the fact that when adding the > next titlechar() to the resulting sb, we need a SP to wordbreak. > > LF now happens to be in the set due to the way you extended the > function (it wasn't fed to this function by its sole caller), but > other than that, it is no more special than HT, SP or '*'. And they > are not replaced with SP or replaced with '-'. > > So it would be the most sensible to just drop 'if LF, replace it > with SP before doing anything else' you added. The existing 'if > titlechar, add it to sb but if we saw non-title, add a SP before > doing so to wordbreak, and if not titlechar, just remember the fact > that we saw one' should work fine as-is without special casing LF at > all. > > Or am I missing something subtle? You actually got it all right. Thanks for the insight. Now, I got it. There is no need to give special attention to LF. I missed to see that `titlechar()` is already taking care of everything. Thanks, Hariom
Hi, On Wed, Aug 19, 2020 at 9:38 PM Junio C Hamano <gitster@pobox.com> wrote: > > Junio C Hamano <gitster@pobox.com> writes: > > > Hariom verma <hariom18599@gmail.com> writes: > > > >>> Also, because neither LF or SP is a titlechar(), wouldn't the "if > >>> r[i] is LF, replace it with SP" a no-op wrt what will be in sb at > >>> the end? > >> > >> Maybe its better to directly replace LF with hyphen? [Instead of first > >> replacing LF with SP and then replacing SP with '-'.] > > > > Why do you think LF is so special? > > > > Everything other than titlechar() including HT, '#', '*', SP is > > treated in the same way as the body of that loop. It does not > > directly contribute to the final contents of sb, but just leaves > > the marker in the variable "space" the fact that when adding the > > next titlechar() to the resulting sb, we need a SP to wordbreak. > > I was undecided between mentioning and not mentioning the variable > name "space" here. On one hand, one _could_ argue that "space" is > used to remember we saw "space and the like" and if it were named > "seen_non_title_char", then such a confusion to treat LF so > specially might not have occurred. But on the other hand, "space" > is what the variable exactly keeps track of; it is just the need for > space on the output side, i.e. we remember that "space needed before > the next output" with that variable. > > I am inclined not to suggest renaming "space" at all, but it won't > be the end of the world if it were renamed to "need_space" (before > the next output), or "seen_nontitle". If we were to actually > rename, I have moderately strong preference to the "need_space" over > "seen_nontitle", as it won't have to be renamed again when the logic > to require a space before the next output has to be updated to > include cases other than just "we saw a nontitle character". Yeah, if it was named "seen_non_title_char", I might not get confused. But now as you have already explained its working pretty well, "space" makes more sense to me. Well, I'm okay with both "space" and "need_space". I wonder what others have to say on this? "space" or "need_space"? Thanks, Hariom
diff --git a/pretty.c b/pretty.c index 2a3d46bf42..8d08e8278a 100644 --- a/pretty.c +++ b/pretty.c @@ -839,24 +839,29 @@ static int istitlechar(char c) (c >= '0' && c <= '9') || c == '.' || c == '_'; } -static void format_sanitized_subject(struct strbuf *sb, const char *msg) +static void format_sanitized_subject(struct strbuf *sb, const char *msg, size_t len) { + char *r = xmemdupz(msg, len); size_t trimlen; size_t start_len = sb->len; int space = 2; + int i; - for (; *msg && *msg != '\n'; msg++) { - if (istitlechar(*msg)) { + for (i = 0; i < len; i++) { + if (r[i] == '\n') + r[i] = ' '; + if (istitlechar(r[i])) { if (space == 1) strbuf_addch(sb, '-'); space = 0; - strbuf_addch(sb, *msg); - if (*msg == '.') - while (*(msg+1) == '.') - msg++; + strbuf_addch(sb, r[i]); + if (r[i] == '.') + while (r[i+1] == '.') + i++; } else space |= 1; } + free(r); /* trim any trailing '.' or '-' characters */ trimlen = 0; @@ -1155,7 +1160,7 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */ const struct commit *commit = c->commit; const char *msg = c->message; struct commit_list *p; - const char *arg; + const char *arg, *eol; size_t res; char **slot; @@ -1405,7 +1410,8 @@ static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */ format_subject(sb, msg + c->subject_off, " "); return 1; case 'f': /* sanitized subject */ - format_sanitized_subject(sb, msg + c->subject_off); + eol = strchrnul(msg + c->subject_off, '\n'); + format_sanitized_subject(sb, msg + c->subject_off, eol - (msg + c->subject_off)); return 1; case 'b': /* body */ strbuf_addstr(sb, msg + c->body_off);