diff mbox series

[v2,4/5] mailmap: use case-sensitive comparisons for local-parts and names

Message ID 20210103211849.2691287-5-sandals@crustytoothpaste.net (mailing list archive)
State New, archived
Headers show
Series Hashed mailmap | expand

Commit Message

brian m. carlson Jan. 3, 2021, 9:18 p.m. UTC
RFC 5321 is clear that the local-part of an email address (the part
before the at sign) is case sensitive, and this has been the case since
the original RFC 821.  It directs us that "the local-part MUST be
interpreted and assigned semantics only by the host specified in the
domain part of the address."

Since we are not that party, it's not correct for us to compare them
case insensitively.  However, we do still want to compare the domain
parts case insensitively, so let's add a helper that downcases the
domain and then compare byte by byte.

Similarly, it's not possible for us to correctly case-fold text in a
locale-insensitive way, so our handling of personal names has also been
subject to bugs.  Additionally, we've never handled non-ASCII characters
correctly, which means that our previous comparisons really only worked
well for a fraction of the people on the planet.  Since our code wasn't
right and it's basically impossible to compare personal names without
regard to case, let's also switch our matching of names to be byte by
byte.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 mailmap.c          | 27 ++++++++++++++++++++++++---
 t/t4203-mailmap.sh |  4 ++--
 2 files changed, 26 insertions(+), 5 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Jan. 4, 2021, 4:10 p.m. UTC | #1
On Sun, Jan 03 2021, brian m. carlson wrote:

[For the purposes of this reply I'm also replying to your commit message
in 3/5, which I think makes things less confusing than two E-Mails]

> Currently, Git always looks up entries in the mailmap in a
> case-insensitive way, both for names and addresses, which is, as
> explained below, suboptimal.
>
> First, for email addresses, RFC 5321 is clear that only domains are case
> insensitive; local-parts (the portion before the at sign) are not.  It
> states this:
> 
>   The local-part of a mailbox MUST BE treated as case sensitive.
>   Therefore, SMTP implementations MUST take care to preserve the case of
>   mailbox local-parts.

It seems better to quote the part of this where it goes on to say:

    However, exploiting the case sensitivity of mailbox local-parts
    impedes interoperability and is discouraged.  Mailbox domains follow
    normal DNS rules and are hence not case sensitive.

And also that RFC 5321 says earlier

    "a host that expects to receive mail SHOULD avoid defining mailboxes
    where the Local-part requires (or uses) the Quoted-string form or
    where the Local-part is case- sensitive.  For any purposes that
    require generating or comparing Local-parts (e.g., to specific
    mailbox names), all quoted forms MUST be treated as equivalent,".

So if we're taking RFC 5321 MUST requirements as something we've got to
do we're still not following it. But as I'll go on to note I think
rationale by RFC 5321 is perhaps not the best thing to do here...

> There exist systems today where local-parts remain case sensitive (and
> this author has one), and as such, it's incorrect for us to case fold
> them in any way.  Let's add a failing test that indicates this is a
> problem, while still keeping the test for case-insensitive domains.

I don't really care much about this feature of mailmap, but I'm
struggling a bit to understand this part.

We're not an SMTP server or SMTP server library, so we don't have to
worry about mail mis-routing or whatever.

We just have to worry about cases where you're not all of these people
in one project's commit metadata and/or .mailmap, and thus mailmap rules
would match too generously:

    "brian m. carlson" <sandals@crustytoothpaste.net>
    "brian m. carlson" <SANDALS@crustytoothpaste.net>
    "BRIAN M. CARLSON" <sandals@crustytoothpaste.net>
    "BRIAN M. CARLSON" <SANDALS@crustytoothpaste.net>

Is that really plausible? In any case, neither of these two patches make
reference to us already having changed this in the past in 1.6.2 and &
there being reports on the ML about the bug & us changing it back. See
https://lore.kernel.org/git/f182fb1700e8dea15459fd02ced2a6e5797bec99.1238458535u.git.johannes.schindelin@gmx.de/

Maybe we should still do this, but I think for a v3 it makes sense to
summarize that discussion etc.

> Note that it's also incorrect for us to case-fold names because we don't
> guarantee that we're using the locale of the author, and it's impossible
> to case-fold names in a locale-insensitive way.  Turkish and Azeri
> contain both a dotted and dotless I, and the uppercase ASCII I folds not
> to the lowercase ASCII I, but to a dotless version, and vice versa with
> the lowercase I.  There are many words in Turkish which differ only in
> the dottedness of the I, so it is likely that there are also personal
> names which differ in the same way.
> 
> That would be a problem even if our implementation were perfect, which
> it is not.  We currently fold only ASCII characters, so this feature has
> never worked correctly for the vast majority of the users on the planet,
> regardless of the locale.  For example, on Linux, even in a Spanish
> locale, we don't handle "Simón" properly.  Even if we did handle that,
> we'd probably also want to implement Unicode normalization, which we
> don't.

As one of those people, aren't you confusing two things here. I daresay
that almost everyone with a non-ASCII name doesn't have a non-ASCII
E-Mail address, sure some do, but treating those as one and the same
doesn't seem to make sense.

Yes you can have non-ASCII on the LHS of @ in an E-Mail address, it just
seems to me that's very rare.

Do we have known cases where we're making use of this case-insensitive
matching of the *name* part? Showing some of those non-ASCII cases in
the tests in 3/5 would be nice.

> IN general, case-folding text is extremely language- and locale-specific
> and requires intimacy with the spelling and grammar of the language in
> question and careful attention to the Unicode details in order to
> produce a result that is meaningful to humans and conforms with
> linguistic and societal norms.
> 
> Because we do not have any of the required context with a plain personal
> name, we cannot hope to possibly case-fold personal names correctly.  We
> should stop trying to do so and just treat them as a series of bytes, so
> let's add a test that we don't case-fold personal names as well.

[... end <snip> of commit 3/5 ...]

> RFC 5321 is clear that the local-part of an email address (the part
> before the at sign) is case sensitive, and this has been the case since
> the original RFC 821.  It directs us that "the local-part MUST be
> interpreted and assigned semantics only by the host specified in the
> domain part of the address."
>
> Since we are not that party, it's not correct for us to compare them
> case insensitively.  However, we do still want to compare the domain
> parts case insensitively, so let's add a helper that downcases the
> domain and then compare byte by byte.
>
> Similarly, it's not possible for us to correctly case-fold text in a
> locale-insensitive way, so our handling of personal names has also been
> subject to bugs.  Additionally, we've never handled non-ASCII characters
> correctly, which means that our previous comparisons really only worked
> well for a fraction of the people on the planet.  Since our code wasn't
> right and it's basically impossible to compare personal names without
> regard to case, let's also switch our matching of names to be byte by
> byte.

I'm still undecided about whether we should be calling strcasecmp() to
begin with, but I don't really find this convincing. Just because we
only support brian@ and BRIAN@ being the same but not ævar@ and ÆVAR@ we
should break the existing behavior for everyone?

> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  mailmap.c          | 27 ++++++++++++++++++++++++---
>  t/t4203-mailmap.sh |  4 ++--
>  2 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/mailmap.c b/mailmap.c
> index d3287b409a..5c52dbb7e0 100644
> --- a/mailmap.c
> +++ b/mailmap.c
> @@ -64,7 +64,22 @@ static void free_mailmap_entry(void *p, const char *s)
>   */
>  static int namemap_cmp(const char *a, const char *b)
>  {
> -	return strcasecmp(a, b);
> +	return strcmp(a, b);
> +}

It seems to me if we're not going to use strcasecmp() we can get rid of
this whole namemap_cmp() function. See the comment just above it &
de2f95ebed2 (mailmap: work around implementations with pure inline
strcasecmp, 2013-09-12). I.e. we had a wrapper function here just
because we were using strcasecmp().

> +/*
> + * Lowercases the domain (and only the domain) part of an email address.  The
> + * local-part, which is defined by RFC 5321 to be case sensitive, is not
> + * affected.
> + */
> +static char *lowercase_email(char *s)
> +{
> +	char *end = strchrnul(s, '@');

Speaking of pedantic readings of RFC 5321, aren't we closing the door
here to proper DQUOTE handling? I.e. the local-part containing a @
within quotes :)
Junio C Hamano Jan. 6, 2021, 12:46 a.m. UTC | #2
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Sun, Jan 03 2021, brian m. carlson wrote:
>
> We just have to worry about cases where you're not all of these people
> in one project's commit metadata and/or .mailmap, and thus mailmap rules
> would match too generously:
>
>     "brian m. carlson" <sandals@crustytoothpaste.net>
>     "brian m. carlson" <SANDALS@crustytoothpaste.net>
>     "BRIAN M. CARLSON" <sandals@crustytoothpaste.net>
>     "BRIAN M. CARLSON" <SANDALS@crustytoothpaste.net>
>
> Is that really plausible? In any case, neither of these two patches make
> reference to us already having changed this in the past in 1.6.2 and &
> there being reports on the ML about the bug & us changing it back. See
> https://lore.kernel.org/git/f182fb1700e8dea15459fd02ced2a6e5797bec99.1238458535u.git.johannes.schindelin@gmx.de/
>
> Maybe we should still do this, but I think for a v3 it makes sense to
> summarize that discussion etc.

After reading the old discussion again, I am not sure if this is
worth doing.  To many people, it is a promise we've made and kept
that we treat addresses including the local part case-insensitively
when summarising commits by ident strings.

I'd really wish that this series were structured to have 5/5 early
and 3&4/5 squashed into a single final patch.
Ævar Arnfjörð Bjarmason Jan. 12, 2021, 2:08 p.m. UTC | #3
On Wed, Jan 06 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> On Sun, Jan 03 2021, brian m. carlson wrote:
>>
>> We just have to worry about cases where you're not all of these people
>> in one project's commit metadata and/or .mailmap, and thus mailmap rules
>> would match too generously:
>>
>>     "brian m. carlson" <sandals@crustytoothpaste.net>
>>     "brian m. carlson" <SANDALS@crustytoothpaste.net>
>>     "BRIAN M. CARLSON" <sandals@crustytoothpaste.net>
>>     "BRIAN M. CARLSON" <SANDALS@crustytoothpaste.net>
>>
>> Is that really plausible? In any case, neither of these two patches make
>> reference to us already having changed this in the past in 1.6.2 and &
>> there being reports on the ML about the bug & us changing it back. See
>> https://lore.kernel.org/git/f182fb1700e8dea15459fd02ced2a6e5797bec99.1238458535u.git.johannes.schindelin@gmx.de/
>>
>> Maybe we should still do this, but I think for a v3 it makes sense to
>> summarize that discussion etc.
>
> After reading the old discussion again, I am not sure if this is
> worth doing.  To many people, it is a promise we've made and kept
> that we treat addresses including the local part case-insensitively
> when summarising commits by ident strings.
>
> I'd really wish that this series were structured to have 5/5 early
> and 3&4/5 squashed into a single final patch.

Something that I only realized after I sent
<87czykvg19.fsf@evledraar.gmail.com>: Any problems .mailmap has with
Turkish "dotless I" we have already with "git log --author=<name> -i".

Not to say that there isn't some problem to solve here, just that if we
do it's a more general issue than mailmap.

As can be inferred from my upthread reply I thought that was ASCII-only,
but it turns out we do set LC_CTYPE based on the user's locale, and it
also applies for English-speakers. E.g. in LANG=en_US.UTF-8
"--author=ævar -i" will work.

Of course that doesn't address the point of whether we should be
DWYM-ing E-Mail addresses, just the sub-claim that one reason we
shouldn't is because of impossible-to-solve Unicode edge cases.
diff mbox series

Patch

diff --git a/mailmap.c b/mailmap.c
index d3287b409a..5c52dbb7e0 100644
--- a/mailmap.c
+++ b/mailmap.c
@@ -64,7 +64,22 @@  static void free_mailmap_entry(void *p, const char *s)
  */
 static int namemap_cmp(const char *a, const char *b)
 {
-	return strcasecmp(a, b);
+	return strcmp(a, b);
+}
+
+/*
+ * Lowercases the domain (and only the domain) part of an email address.  The
+ * local-part, which is defined by RFC 5321 to be case sensitive, is not
+ * affected.
+ */
+static char *lowercase_email(char *s)
+{
+	char *end = strchrnul(s, '@');
+	while (*end) {
+		*end = tolower(*end);
+		end++;
+	}
+	return s;
 }
 
 static void add_mapping(struct string_list *map,
@@ -74,9 +89,13 @@  static void add_mapping(struct string_list *map,
 	struct mailmap_entry *me;
 	struct string_list_item *item;
 
+	lowercase_email(new_email);
+
 	if (old_email == NULL) {
 		old_email = new_email;
 		new_email = NULL;
+	} else {
+		lowercase_email(old_email);
 	}
 
 	item = string_list_insert(map, old_email);
@@ -300,7 +319,7 @@  static struct string_list_item *lookup_prefix(struct string_list *map,
 	 * real location of the key if one exists.
 	 */
 	while (0 <= --i && i < map->nr) {
-		int cmp = strncasecmp(map->items[i].string, string, len);
+		int cmp = strncmp(map->items[i].string, string, len);
 		if (cmp < 0)
 			/*
 			 * "i" points at a key definitely below the prefix;
@@ -323,6 +342,7 @@  int map_user(struct mailmap *map,
 	     const char **email, size_t *emaillen,
 	     const char **name, size_t *namelen)
 {
+	char *searchable_email = xstrndup(*email, *emaillen);
 	struct string_list_item *item;
 	struct mailmap_entry *me;
 
@@ -330,7 +350,8 @@  int map_user(struct mailmap *map,
 		 (int)*namelen, debug_str(*name),
 		 (int)*emaillen, debug_str(*email));
 
-	item = lookup_prefix(map->mailmap, *email, *emaillen);
+	item = lookup_prefix(map->mailmap, searchable_email, *emaillen);
+	free(searchable_email);
 	if (item != NULL) {
 		me = (struct mailmap_entry *)item->util;
 		if (me->namemap.nr) {
diff --git a/t/t4203-mailmap.sh b/t/t4203-mailmap.sh
index 32e849504c..df4a0e03cc 100755
--- a/t/t4203-mailmap.sh
+++ b/t/t4203-mailmap.sh
@@ -187,7 +187,7 @@  nick1 (1):
 
 EOF
 
-test_expect_failure 'name entry after email entry, case-sensitive local-part' '
+test_expect_success 'name entry after email entry, case-sensitive local-part' '
 	mkdir -p internal_mailmap &&
 	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
 	echo "Internal Guy <BUGS@company.xx>" >>internal_mailmap/.mailmap &&
@@ -195,7 +195,7 @@  test_expect_failure 'name entry after email entry, case-sensitive local-part' '
 	test_cmp expect actual
 '
 
-test_expect_failure 'name entry after email entry, case-sensitive personal name' '
+test_expect_success 'name entry after email entry, case-sensitive personal name' '
 	mkdir -p internal_mailmap &&
 	echo "<bugs@company.xy> <bugs@company.xx>" >internal_mailmap/.mailmap &&
 	echo "Nick1 <bugs@company.xx> NICK1 <bugs@company.xx>" >internal_mailmap/.mailmap &&