diff mbox series

[PATCH/RFC,1/1] diff.c: When appropriate, use utf8_strwidth()

Message ID 20220814133531.16952-1-tboegi@web.de (mailing list archive)
State Superseded
Headers show
Series [PATCH/RFC,1/1] diff.c: When appropriate, use utf8_strwidth() | expand

Commit Message

Torsten Bögershausen Aug. 14, 2022, 1:35 p.m. UTC
From: Torsten Bögershausen <tboegi@web.de>

When unicode filenames (encoded in UTF-8) are used, the visible width
on the screen is not the same as strlen(filename).

For example, `git log --stat` may produce an output like this:

$ git log --stat

[snip the header]

 Arger.txt  | 1 +
 Ärger.txt | 1 +
 2 files changed, 2 insertions(+)

A side note: the original report was about cyrillic filenames.
After some investigations it turned out that
a) This is not a problem with "ambiguous characters" in unicode
b) The same problem exist for all unicode code points (so we
  can use Latin based Umlauts for demonstrations below)

The 'Ä' takes the same space on the screen as the 'A'.
But needs one more byte in memory, so the the `git log --stat` output
for "Arger.txt" (!) gets mis-aligned:
The maximum length is derived from "Ärger.txt", 10 bytes in memory,
9 positions on the screen. That is why "Arger.txt" gets one extra ' '
for aligment, it needs 9 bytes in memory.
If there was a file "Ö", it would be correctly aligned by chance,
but "Öhö" would not.

The solution is of course, to use utf8_strwidth() instead of strlen()
when dealing with the width on screen.

And then there is another problem: code like this
strbuf_addf(&out, "%-*s", len, name);

(or using the underlying snprintf() function) does not align the
buffer to a minimum of len measured in screen-width, but uses the
memory count, if name is UTF-8 encoded.

We could be tempted to wish that snprintf() was UTF-8 aware.
That doesn't seem to be the case anywhere (tested on Linux and Mac),
probably snprintf() uses the "bytes in memory"/strlen() approach to be
compatible with older versions and this will never change.

The choosen solution is to split code in diff.c like this

strbuf_addf(&out, "%-*s", len, name);

into 2 calls, like this:

strbuf_addf(&out, "%s", name);
if (len > utf8_strwidth(name))
    strbuf_addchars(&out, ' ', len - utf8_strwidth(name));

Reported-by: Alexander Meshcheryakov <alexander.s.m@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---
 diff.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

--
2.34.0

Comments

Junio C Hamano Aug. 14, 2022, 11:12 p.m. UTC | #1
tboegi@web.de writes:

> The choosen solution is to split code in diff.c like this
>
> strbuf_addf(&out, "%-*s", len, name);
>
> into 2 calls, like this:
>
> strbuf_addf(&out, "%s", name);
> if (len > utf8_strwidth(name))
>     strbuf_addchars(&out, ' ', len - utf8_strwidth(name));

Makes sense.  Is utf8_strwidth(name) cheap enough that we can call
it twice in a row on the same string casually, or should we avoid it
with a new variable?

It might be worth doing a helper function, even?

	static inline strbuf_pad(struct strbuf *out, const char *s, size_t width)
	{
		size_t w = utf8_strwidth(s);

		strbuf_addstr(out, s);
		if (w < width)
			strbuf_addchars(out, ' ', width - w);
	}

Other than that, sounds very sensible.
Torsten Bögershausen Aug. 15, 2022, 6:34 a.m. UTC | #2
On Sun, Aug 14, 2022 at 04:12:12PM -0700, Junio C Hamano wrote:
> tboegi@web.de writes:
>
> > The choosen solution is to split code in diff.c like this
> >
> > strbuf_addf(&out, "%-*s", len, name);
> >
> > into 2 calls, like this:
> >
> > strbuf_addf(&out, "%s", name);
> > if (len > utf8_strwidth(name))
> >     strbuf_addchars(&out, ' ', len - utf8_strwidth(name));
>
> Makes sense.  Is utf8_strwidth(name) cheap enough that we can call
> it twice in a row on the same string casually, or should we avoid it
> with a new variable?
>
> It might be worth doing a helper function, even?
>
> 	static inline strbuf_pad(struct strbuf *out, const char *s, size_t width)
> 	{
> 		size_t w = utf8_strwidth(s);
>
> 		strbuf_addstr(out, s);
> 		if (w < width)
> 			strbuf_addchars(out, ' ', width - w);
> 	}
>
> Other than that, sounds very sensible.
>

Thanks for the review.

Actually, the commit message is wrong - after writing it, the code
was changed into

if (len > utf8_strwidth(name))
        num_padding_spaces = len - utf8_strwidth(name);

and later

if (num_padding_spaces)
	strbuf_addchars(&out, ' ', num_padding_spaces);

(And having written this, there is probably room for test cases,
IOW: a V2 will come the next days)
Junio C Hamano Aug. 18, 2022, 9 p.m. UTC | #3
Torsten Bögershausen <tboegi@web.de> writes:

> (And having written this, there is probably room for test cases,
> IOW: a V2 will come the next days)

Yeah, that all sounds sensible.

Thanks for working on this.
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index 974626a621..7fb254c545 100644
--- a/diff.c
+++ b/diff.c
@@ -2620,7 +2620,7 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		fill_print_name(file);
-		len = strlen(file->print_name);
+		len = utf8_strwidth(file->print_name);
 		if (max_len < len)
 			max_len = len;

@@ -2734,6 +2734,7 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		char *name = file->print_name;
 		uintmax_t added = file->added;
 		uintmax_t deleted = file->deleted;
+		size_t num_padding_spaces = 0;
 		int name_len;

 		if (!file->is_interesting && (added + deleted == 0))
@@ -2743,7 +2744,7 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 		 * "scale" the filename
 		 */
 		len = name_width;
-		name_len = strlen(name);
+		name_len = utf8_strwidth(name);
 		if (name_width < name_len) {
 			char *slash;
 			prefix = "...";
@@ -2753,10 +2754,14 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			if (slash)
 				name = slash;
 		}
+		if (len > utf8_strwidth(name))
+			num_padding_spaces = len - utf8_strwidth(name);

 		if (file->is_binary) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addf(&out, " %*s", number_width, "Bin");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addf(&out, "| %*s", number_width, "Bin");
 			if (!added && !deleted) {
 				strbuf_addch(&out, '\n');
 				emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
@@ -2776,8 +2781,10 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 			continue;
 		}
 		else if (file->is_unmerged) {
-			strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-			strbuf_addstr(&out, " Unmerged\n");
+			strbuf_addf(&out, " %s%s ", prefix,  name);
+			if (num_padding_spaces)
+				strbuf_addchars(&out, ' ', num_padding_spaces);
+			strbuf_addstr(&out, "| Unmerged\n");
 			emit_diff_symbol(options, DIFF_SYMBOL_STATS_LINE,
 					 out.buf, out.len, 0);
 			strbuf_reset(&out);
@@ -2803,8 +2810,10 @@  static void show_stats(struct diffstat_t *data, struct diff_options *options)
 				add = total - del;
 			}
 		}
-		strbuf_addf(&out, " %s%-*s |", prefix, len, name);
-		strbuf_addf(&out, " %*"PRIuMAX"%s",
+		strbuf_addf(&out, " %s%s ", prefix,  name);
+		if (num_padding_spaces)
+			strbuf_addchars(&out, ' ', num_padding_spaces);
+		strbuf_addf(&out, "| %*"PRIuMAX"%s",
 			number_width, added + deleted,
 			added + deleted ? " " : "");
 		show_graph(&out, '+', add, add_c, reset);