[1/5] ref-filter: support different email formats

Message ID	aeb116c5aaaa23dfefbc7a6f4ac743a6f5a3ade8.1595882588.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=S8Id=BG=vger.kernel.org=git-owner@kernel.org> Message-Id: <aeb116c5aaaa23dfefbc7a6f4ac743a6f5a3ade8.1595882588.git.gitgitgadget@gmail.com> In-Reply-To: <pull.684.git.1595882588.gitgitgadget@gmail.com> References: <pull.684.git.1595882588.gitgitgadget@gmail.com> From: "Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> Date: Mon, 27 Jul 2020 20:43:04 +0000 Subject: [PATCH 1/5] ref-filter: support different email formats Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Hariom Verma <hariom18599@gmail.com>, Hariom Verma <hariom18599@gmail.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Improvements to ref-filter \| expand [0/5,GSoC] Improvements to ref-filter [1/5] ref-filter: support different email formats [2/5] ref-filter: add `short` option for 'tree' and 'parent' [3/5] pretty: refactor `format_sanitized_subject()` [4/5] format-support: move `format_sanitized_subject()` from pretty [5/5] ref-filter: add `sanitize` option for 'subject' atom

Johannes Schindelin via GitGitGadget July 27, 2020, 8:43 p.m. UTC

From: Hariom Verma <hariom18599@gmail.com>

Currently, ref-filter only supports printing email with arrow brackets.

Let's add support for two more email options.
- trim : print email without arrow brackets.
- localpart : prints the part before the @ sign

Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Mentored-by: Heba Waly <heba.waly@gmail.com>
Signed-off-by: Hariom Verma <hariom18599@gmail.com>
---
 ref-filter.c            | 36 ++++++++++++++++++++++++++++++++----
 t/t6300-for-each-ref.sh | 16 ++++++++++++++++
 2 files changed, 48 insertions(+), 4 deletions(-)

Junio C Hamano July 27, 2020, 10:51 p.m. UTC | #1

"Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Hariom Verma <hariom18599@gmail.com>
>
> Currently, ref-filter only supports printing email with arrow brackets.

This is the first time I heard the term "arrow bracket".  Aren't
they more commonly called angle brackets?

> Let's add support for two more email options.
> - trim : print email without arrow brackets.

Why would this be useful?

> - localpart : prints the part before the @ sign

Meaning I'd get "<gitster" for me?

Building small pieces of new feature in each patch is good, and
adding tests to each step is also good.  Why not do the same for
docs?

> +static struct email_option{

Missing SP.

> +	enum { EO_INVALID, EO_RAW, EO_TRIM, EO_LOCALPART } option;
> +} email_option;
> +
> @@ -1040,10 +1044,26 @@ static const char *copy_email(const char *buf)
>  	const char *eoemail;
>  	if (!email)
>  		return xstrdup("");
> -	eoemail = strchr(email, '>');

The original code prepares to see NULL from this strchr(); that is
why it checks eoemail for NULL and returns an empty string---the
data is broken (i.e. not an address in angle brackets), which this
code cannot do anything about---in the later part of the code.

> +	switch (email_option.option) {
> +	case EO_RAW:
> +		eoemail = strchr(email, '>') + 1;

And this breaks the carefully laid out error handling by the
original code.  Adding 1 to NULL is quite undefined.

> +		break;
> +	case EO_TRIM:
> +		email++;
> +		eoemail = strchr(email, '>');
> +		break;
> +	case EO_LOCALPART:
> +		email++;
> +		eoemail = strchr(email, '@');

The undocumented design here is that you want to return "hariom" for
"<hariom@gmail.com>", i.e. out of the "trimmed" e-mail, the part
before the at-sign is returned.

If the data were "<hariom>", you'd still want to return "hariom" no?
But because you do not check for NULL, you end up returning an empty
string.

I think you want to cut at the first '@' or '>', whichever comes
first.

> +		break;
> +	case EO_INVALID:
> +	default:

Invalid and unhandled ones are silently ignored and not treated as
an error?  I would have thought that at least the "default" one
would be a BUG(), as you covered all the possible values for the
enum with case arms.  I wouldn't be surprised if seeing EO_INVALID
is also a BUG(), i.e. the control flow that led to the caller to
call this function with EO_INVALID in email_option.option is likely
to be broken.  It's not like you return "" to protect yourself when
fed a bad data from objects---a bad value in .option can only come
here if the parser you wrote for "--format=<string>" produced a
wrong result.

> +		return xstrdup("");
> +	}
> +
>  	if (!eoemail)
>  		return xstrdup("");
> -	return xmemdupz(email, eoemail + 1 - email);
> +	return xmemdupz(email, eoemail - email);
>  }
>  
>  static char *copy_subject(const char *buf, unsigned long len)
> @@ -1113,7 +1133,7 @@ static void grab_person(const char *who, struct atom_value *val, int deref, void
>  			continue;
>  		if (name[wholen] != 0 &&
>  		    strcmp(name + wholen, "name") &&
> -		    strcmp(name + wholen, "email") &&
> +		    !starts_with(name + wholen, "email") &&
>  		    !starts_with(name + wholen, "date"))
>  			continue;
>  		if (!wholine)
> @@ -1124,8 +1144,16 @@ static void grab_person(const char *who, struct atom_value *val, int deref, void
>  			v->s = copy_line(wholine);
>  		else if (!strcmp(name + wholen, "name"))
>  			v->s = copy_name(wholine);
> -		else if (!strcmp(name + wholen, "email"))
> +		else if (starts_with(name + wholen, "email")) {
> +			email_option.option = EO_INVALID;
> +			if (!strcmp(name + wholen, "email"))
> +				email_option.option = EO_RAW;
> +			if (!strcmp(name + wholen, "email:trim"))
> +				email_option.option = EO_TRIM;
> +			if (!strcmp(name + wholen, "email:localpart"))
> +				email_option.option = EO_LOCALPART;

The ref-filter formatting language already knows many "colon plus
modifier" suffix like "refname:short" and "contents:body", but I do
not think we have ugly repetition like the above code to parse them.
Perhaps the addition for "email:<whatever>" can benefit from
studying and mimicking existing practices a bit more?

Đoàn Trần Công Danh July 28, 2020, 1:58 p.m. UTC | #2

[this is a resent, my previous mail couldn't reach the archive]

On 2020-07-27 20:43:04+0000, Hariom Verma via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Hariom Verma <hariom18599@gmail.com>
> 
> Currently, ref-filter only supports printing email with arrow brackets.
> 
> Let's add support for two more email options.
> - trim : print email without arrow brackets.
> - localpart : prints the part before the @ sign
> 
> Mentored-by: Christian Couder <chriscool@tuxfamily.org>
> Mentored-by: Heba Waly <heba.waly@gmail.com>
> Signed-off-by: Hariom Verma <hariom18599@gmail.com>
> ---
>  ref-filter.c            | 36 ++++++++++++++++++++++++++++++++----
>  t/t6300-for-each-ref.sh | 16 ++++++++++++++++
>  2 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/ref-filter.c b/ref-filter.c
> index 8447cb09be..8563088eb1 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -102,6 +102,10 @@ static struct ref_to_worktree_map {
>  	struct worktree **worktrees;
>  } ref_to_worktree_map;
>  
> +static struct email_option{
> +	enum { EO_INVALID, EO_RAW, EO_TRIM, EO_LOCALPART } option;
> +} email_option;
> +
>  /*
>   * An atom is a valid field atom listed below, possibly prefixed with
>   * a "*" to denote deref_tag().
> @@ -1040,10 +1044,26 @@ static const char *copy_email(const char *buf)
>  	const char *eoemail;
>  	if (!email)
>  		return xstrdup("");
> -	eoemail = strchr(email, '>');
> +	switch (email_option.option) {
> +	case EO_RAW:
> +		eoemail = strchr(email, '>') + 1;
> +		break;
> +	case EO_TRIM:
> +		email++;
> +		eoemail = strchr(email, '>');
> +		break;
> +	case EO_LOCALPART:
> +		email++;
> +		eoemail = strchr(email, '@');
> +		break;


This is not correct.
RFC-822 allows @ in local part,
albeit, that localpart must be quoted:

        addr-spec       =       local-part "@" domain
        local-part      =       dot-atom / quoted-string / obs-local-part
        quoted-string   =       [CFWS]
                                DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                                [CFWS]
        qcontent        =       qtext / quoted-pair
        qtext           =       NO-WS-CTL /     ; Non white space
        qtext           =       NO-WS-CTL /     ; Non white space controls
                                %d33 /          ; The rest of the US-ASCII
                                %d35-91 /       ;  characters not including "\"
                                %d93-126        ;  or the quote character
        quoted-pair     =       ("\" text) / obs-qp

IOW, those below email addresses are valid email address,
and the local part is `quoted@local'

        "quoted@local"@example.com
        quoted\@local@example.com

Anyway, it seems like current Git strips first `"'
from `"quoted@local"@example.com'

Junio C Hamano July 28, 2020, 4:45 p.m. UTC | #3

Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

> This is not correct.
> RFC-822 allows @ in local part,
> albeit, that localpart must be quoted:

We do care about truly local e-mail addresses, without '@' anywhere
inside <braket>, simply because they are common in the result of SCM
conversion from CVS/SVN.

But I do not think we are pedantic/academic enough to have cared
about any local part that is unusual enough to require quoting; we
instead relied on the fact that we live in real world with practical
people who would avoid such an address ;-).

Hariom verma July 28, 2020, 8:31 p.m. UTC | #4

Hi,

On Tue, Jul 28, 2020 at 4:21 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Hariom Verma via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Hariom Verma <hariom18599@gmail.com>
> >
> > Currently, ref-filter only supports printing email with arrow brackets.
>
> This is the first time I heard the term "arrow bracket".  Aren't
> they more commonly called angle brackets?

Yeah. Sorry about that.

> > Let's add support for two more email options.
> > - trim : print email without arrow brackets.
>
> Why would this be useful?

It might be useful for using the ref-filter's logic in pretty.c
(especially for `--pretty` formats like `%an` and `%cn`)

> > - localpart : prints the part before the @ sign
>
> Meaning I'd get "<gitster" for me?

No, you'll get "gitster".

> Building small pieces of new feature in each patch is good, and
> adding tests to each step is also good.  Why not do the same for
> docs?

Yeah, I agree with you. I should have focused on documentation too.

> > +static struct email_option{
>
> Missing SP.

I'll fix it.

> > +     enum { EO_INVALID, EO_RAW, EO_TRIM, EO_LOCALPART } option;
> > +} email_option;
> > +
> > @@ -1040,10 +1044,26 @@ static const char *copy_email(const char *buf)
> >       const char *eoemail;
> >       if (!email)
> >               return xstrdup("");
> > -     eoemail = strchr(email, '>');
>
> The original code prepares to see NULL from this strchr(); that is
> why it checks eoemail for NULL and returns an empty string---the
> data is broken (i.e. not an address in angle brackets), which this
> code cannot do anything about---in the later part of the code.

I think this commit still takes care of NULL.
After the switch-case statements, code consists of:
```
if (!eoemail)
    return xstrdup("");
```
Which checks eoemail for NULL. And will return empty string if address
is not in angle brackets.
Same applies for local-part too. If '@' is not present in email
address, it will still return empty string.

> > +     switch (email_option.option) {
> > +     case EO_RAW:
> > +             eoemail = strchr(email, '>') + 1;
>
> And this breaks the carefully laid out error handling by the
> original code.  Adding 1 to NULL is quite undefined.

Yeah. I'll take care of it in the next version.

> > +             break;
> > +     case EO_TRIM:
> > +             email++;
> > +             eoemail = strchr(email, '>');
> > +             break;
> > +     case EO_LOCALPART:
> > +             email++;
> > +             eoemail = strchr(email, '@');
>
> The undocumented design here is that you want to return "hariom" for
> "<hariom@gmail.com>", i.e. out of the "trimmed" e-mail, the part
> before the at-sign is returned.
>
> If the data were "<hariom>", you'd still want to return "hariom" no?
> But because you do not check for NULL, you end up returning an empty
> string.

I never heard of email address without '@' symbol. Thats why I
returned empty string.

Will fix that too.

> I think you want to cut at the first '@' or '>', whichever comes
> first.

If email data can be without '@' symbol, then I guess "yes".

> > +             break;
> > +     case EO_INVALID:
> > +     default:
>
> Invalid and unhandled ones are silently ignored and not treated as
> an error?  I would have thought that at least the "default" one
> would be a BUG(), as you covered all the possible values for the
> enum with case arms.  I wouldn't be surprised if seeing EO_INVALID
> is also a BUG(), i.e. the control flow that led to the caller to
> call this function with EO_INVALID in email_option.option is likely
> to be broken.  It's not like you return "" to protect yourself when
> fed a bad data from objects---a bad value in .option can only come
> here if the parser you wrote for "--format=<string>" produced a
> wrong result.

Christian <chriscool@tuxfamily.org> also suggested me the same. Will
fix this too.

BTW, on master "{author,committer,tagger}email:<xyz>" does not print any error.

> > +             return xstrdup("");
> > +     }
> > +
> >       if (!eoemail)
> >               return xstrdup("");
> > -     return xmemdupz(email, eoemail + 1 - email);
> > +     return xmemdupz(email, eoemail - email);
> >  }
> >
> >  static char *copy_subject(const char *buf, unsigned long len)
> > @@ -1113,7 +1133,7 @@ static void grab_person(const char *who, struct atom_value *val, int deref, void
> >                       continue;
> >               if (name[wholen] != 0 &&
> >                   strcmp(name + wholen, "name") &&
> > -                 strcmp(name + wholen, "email") &&
> > +                 !starts_with(name + wholen, "email") &&
> >                   !starts_with(name + wholen, "date"))
> >                       continue;
> >               if (!wholine)
> > @@ -1124,8 +1144,16 @@ static void grab_person(const char *who, struct atom_value *val, int deref, void
> >                       v->s = copy_line(wholine);
> >               else if (!strcmp(name + wholen, "name"))
> >                       v->s = copy_name(wholine);
> > -             else if (!strcmp(name + wholen, "email"))
> > +             else if (starts_with(name + wholen, "email")) {
> > +                     email_option.option = EO_INVALID;
> > +                     if (!strcmp(name + wholen, "email"))
> > +                             email_option.option = EO_RAW;
> > +                     if (!strcmp(name + wholen, "email:trim"))
> > +                             email_option.option = EO_TRIM;
> > +                     if (!strcmp(name + wholen, "email:localpart"))
> > +                             email_option.option = EO_LOCALPART;
>
> The ref-filter formatting language already knows many "colon plus
> modifier" suffix like "refname:short" and "contents:body", but I do
> not think we have ugly repetition like the above code to parse them.
> Perhaps the addition for "email:<whatever>" can benefit from
> studying and mimicking existing practices a bit more?
>

For "email:<whatever>",
even If I parse that <whatever>. I still make comparison something like:
```
if (!modifier)
    email_option.option = EO_RAW;
else if (!strcmp(modifier, "trim"))
    email_option.option = EO_TRIM;
else if (!strcmp(arg, "localpart"))
    email_option.option = EO_LOCALPART;
```

Thanks,
Hariom

Junio C Hamano July 28, 2020, 8:43 p.m. UTC | #5

Hariom verma <hariom18599@gmail.com> writes:

>> The ref-filter formatting language already knows many "colon plus
>> modifier" suffix like "refname:short" and "contents:body", but I do
>> not think we have ugly repetition like the above code to parse them.
>> Perhaps the addition for "email:<whatever>" can benefit from
>> studying and mimicking existing practices a bit more?
>>
>
> For "email:<whatever>",
> even If I parse that <whatever>. I still make comparison something like:
> ```
> if (!modifier)
>     email_option.option = EO_RAW;
> else if (!strcmp(modifier, "trim"))
>     email_option.option = EO_TRIM;
> else if (!strcmp(arg, "localpart"))
>     email_option.option = EO_LOCALPART;
> ```

Somebody needs to do a comparison, but it should be done at parsing
phase when the --format is grokked, not in grab phase that is run
for each and every ref to be shown.

These patches should only be done after looking at existing
"<basicatom>:<modifiers>" like "objectname:short" etc are handled,
not before, I think.

[1/5] ref-filter: support different email formats

Commit Message

Comments

Patch