diff mbox series

[v4] pretty-formats: add hard truncation, without ellipsis, options

Message ID 20221112143616.1429-1-philipoakley@iee.email (mailing list archive)
State New, archived
Headers show
Series [v4] pretty-formats: add hard truncation, without ellipsis, options | expand

Commit Message

Philip Oakley Nov. 12, 2022, 2:36 p.m. UTC
Instead of replacing with "..", replace with the empty string,
implied by passing NULL, and adjust the padding length calculation.

Extend the existing tests for these pretty formats to include
`Trunc` and Ltrunc` options matching the `trunc` and `ltrunc`
tests.

Signed-off-by: Philip Oakley <philipoakley@iee.email>
---

Removed left over comments from locations that had needed QZ 'space'
conversion in the here-docs.

To: GitList <git@vger.kernel.org>
Cc: Taylor Blau <me@ttaylorr.com>
Cc: Junio C Hamano <gitster@pobox.com>
Cc: NSENGIYUMVA WILBERFORCE <nsengiyumvawilberforce@gmail.com>
In-reply-to: <20221102120853.2013-1-philipoakley@iee.email> 

This should complete the patch (cf. [1] <Y2rPAGp96IwrLS1T@nand.local>)
Note: latest What's Cooking <Y2riRSL+NprJt278@nand.local> hadn't been updated. Tests are included.

A final review would be welcomed.

Range-diff against v3:
1:  498439f0375 ! 1:  d26dabc5271 pretty-formats: add hard truncation, without ellipsis, options
    @@ t/t6006-rev-list-format.sh: commit $head1
      added (hinzugef${added_utf8_part}gt..
      EOF
      
    -+# ZZZ for a space?
     +test_format subject-truncated "%<($truncate_count,Trunc)%s" qz_to_tab_space <<EOF
     +commit $head2
     +changed (ge${changed_utf8_part}ndert) f
    @@ t/t6006-rev-list-format.sh: commit $head1
      added (hinzugef${added_utf8_part_iso88591}gt..
      EOF
      
    -+# need qz qz_to_tab_space
     +test_format complex-subject-Trunc "%<($truncate_count,Trunc)%s" qz_to_tab_space <<EOF
     +commit $head3
     +Test printing of com
    @@ t/t6006-rev-list-format.sh: commit $head1
      added (hinzugef${added_utf8_part}gt..
      EOF
      
    -+# need qz_to_tab_space
     +test_format complex-subject-commitencoding-unset-Trunc "%<($truncate_count,Trunc)%s" qz_to_tab_space <<EOF
     +commit $head3
     +Test printing of com

 Documentation/pretty-formats.txt | 11 +++---
 pretty.c                         | 18 ++++++++-
 t/t4205-log-pretty-formats.sh    | 67 ++++++++++++++++++++++++++++++++
 t/t6006-rev-list-format.sh       | 42 ++++++++++++++++++++
 4 files changed, 132 insertions(+), 6 deletions(-)

Comments

Junio C Hamano Nov. 21, 2022, 12:34 a.m. UTC | #1
Philip Oakley <philipoakley@iee.email> writes:

> Instead of replacing with "..", replace with the empty string,
> implied by passing NULL, and adjust the padding length calculation.

What's the point of saying "implied by passing NULL" here?  Is it an
excuse for passing NULL when passing "" would have sufficed and been
more natural, or something?  Also, it is unclear to whom you are
passing the NULL.  I think that it is sufficient that you said
"replace with the empty string" there.

> Extend the existing tests for these pretty formats to include
> `Trunc` and Ltrunc` options matching the `trunc` and `ltrunc`
> tests.

A more important thing to say is that we add Trunc and Ltrunc, than
we test for these new features ;-)

You may also want to explain why there is no matching Mtrunc added.

I also have another comment on the design.

Imagine there are series of wide characters, each occupying two
display columns, and you give 6 display columns to truncate such a
string into.  "trunc" would give you "[][].." (where [] denotes one
such wide letter that occupies two display columns), and "Trunc"
would give you "[][][]".  Now if you give only 5 display columns,
to fill instead of 6, what should happen?

I do not recall how ".."-stuffed truncation works in this case but
it should notice that it cannot stuff 3 wide letters and give you
"[][].".  The current code may be already buggy, but at least at the
design level, it is fairly clear what the feature _should_ do.

As a design question, what should "Trunc" do in such a case now?  I
do not think we can still call it "hard truncate" if the feature
gives "[][]" (i.e. fill only 4 display columns, resulting in a
string that is not wide enough) or "[][][]" (i.e. exceed 5 columns
that are given), but of course chomping a letter in the middle is
not acceptable behaviour, so ...
Philip Oakley Nov. 21, 2022, 6:10 p.m. UTC | #2
Hi Junio

On 21/11/2022 00:34, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> Instead of replacing with "..", replace with the empty string,
>> implied by passing NULL, and adjust the padding length calculation.
> What's the point of saying "implied by passing NULL" here?  Is it an
> excuse for passing NULL when passing "" would have sufficed and been
> more natural, or something?  

Passing the empty string was my first approach, however Taylor had
commented "why pass the empty string, when NULL will do", hence I
checked, and yes, we can pass NULL, so I followed that guidance on the
re-roll.

> Also, it is unclear to whom you are
> passing the NULL.  I think that it is sufficient that you said
> "replace with the empty string" there.

I could drop the commit message comment, and keep the NULL being passed
tostrbuf_utf8_replace to indicate the empty string, though that may
create the same reviewer question that Taylor had.
>
>> Extend the existing tests for these pretty formats to include
>> `Trunc` and Ltrunc` options matching the `trunc` and `ltrunc`
>> tests.
> A more important thing to say is that we add Trunc and Ltrunc, than
> we test for these new features ;-)

That would be to swap the paragraphs about, yes?
>
> You may also want to explain why there is no matching Mtrunc added.

Can do, though it felt obvious that the original mtrunc ellipsis would
be necessary for the mid-case.
>
> I also have another comment on the design.
>
> Imagine there are series of wide characters, each occupying two
> display columns, and you give 6 display columns to truncate such a
> string into.  "trunc" would give you "[][].." (where [] denotes one
> such wide letter that occupies two display columns), and "Trunc"
> would give you "[][][]".  Now if you give only 5 display columns,
> to fill instead of 6, what should happen?

My reading of the existing code for ltrunc/mtrunc/trunc was that all
these padding conditions were already covered. It was simply a matter of
column counting.
>
> I do not recall how ".."-stuffed truncation works in this case but
> it should notice that it cannot stuff 3 wide letters and give you
> "[][].".  The current code may be already buggy, but at least at the
> design level, it is fairly clear what the feature _should_ do.
>
> As a design question, what should "Trunc" do in such a case now?  I
> do not think we can still call it "hard truncate" if the feature
> gives "[][]" (i.e. fill only 4 display columns, resulting in a
> string that is not wide enough) or "[][][]" (i.e. exceed 5 columns
> that are given), but of course chomping a letter in the middle is
> not acceptable behaviour, so ...
The design had already covered those cases. The author already had those
thoughts

--

Philip
Junio C Hamano Nov. 22, 2022, 12:57 a.m. UTC | #3
Philip Oakley <philipoakley@iee.email> writes:

>> As a design question, what should "Trunc" do in such a case now?  I
>> do not think we can still call it "hard truncate" if the feature
>> gives "[][]" (i.e. fill only 4 display columns, resulting in a
>> string that is not wide enough) or "[][][]" (i.e. exceed 5 columns
>> that are given), but of course chomping a letter in the middle is
>> not acceptable behaviour, so ...
> The design had already covered those cases. The author already had those
> thoughts

Sorry, I was saying that none of

 * giving only [][] to fill only 4 display columns, without filling
   the given 5 display columns,

 * giving [][][] to fill 6 display columns, exceeding the given 5
   display columns,

 * giving [][][ that chomps a letter in the middle, in a failed
   attempt to fill exactly 5 displya columns.

would be a sensible design of the behaviour for "Trunc", so I am not
sure what "had already covered" really mean...
Philip Oakley Nov. 23, 2022, 2:26 p.m. UTC | #4
On 22/11/2022 00:57, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>>> As a design question, what should "Trunc" do in such a case now?  I
>>> do not think we can still call it "hard truncate" if the feature
>>> gives "[][]" (i.e. fill only 4 display columns, resulting in a
>>> string that is not wide enough) or "[][][]" (i.e. exceed 5 columns
>>> that are given), but of course chomping a letter in the middle is
>>> not acceptable behaviour, so ...
>> The design had already covered those cases. The author already had those
>> thoughts
> Sorry, I was saying that none of
>
>  * giving only [][] to fill only 4 display columns, without filling
>    the given 5 display columns,
>
>  * giving [][][] to fill 6 display columns, exceeding the given 5
>    display columns,
>
>  * giving [][][ that chomps a letter in the middle, in a failed
>    attempt to fill exactly 5 displya columns.
>
> would be a sensible design of the behaviour for "Trunc", so I am not
> sure what "had already covered" really mean...
>
I'm still unsure what you are trying to say here.

Is this a question about the prior `trunc`, `mtrunc`, and `ltrunc`
design and tests?
e.g. how complete are their tests?

Is it looking for an extended explanation of the 'fit in N-columns'
design approaches that they all have?

I'm happy to add a paragraph saying that an `Mtrunc`, i.e. miss out the
middle characters to fit the set column width, without using the two dot
ellipsis (`..`), was considered a non starter because of potential
confusion when looking at such output when tabulated.

The existing code (and tests) already covers the need to hide the
characters those two dots (ellipsis) consumed win the N-column tabulated
output. The tests also include utf8-encoded characters.

All the previous tests for `trunc` and `'ltrunc` (i.e. with ellipsis)
have been repeated for the `Trunc` and `Ltrunc` (without ellipsis) hard
truncation commands, and their expected outputs updated, including the
use of the qz_to_tab_space  for those case where trailing spaces are now
present.

Does that cover your questions?

--
Philip
Junio C Hamano Nov. 25, 2022, 7:11 a.m. UTC | #5
Philip Oakley <philipoakley@iee.email> writes:

> On 22/11/2022 00:57, Junio C Hamano wrote:
>> Philip Oakley <philipoakley@iee.email> writes:
>>
>>>> As a design question, what should "Trunc" do in such a case now?  I
>>>> do not think we can still call it "hard truncate" if the feature
>>>> gives "[][]" (i.e. fill only 4 display columns, resulting in a
>>>> string that is not wide enough) or "[][][]" (i.e. exceed 5 columns
>>>> that are given), but of course chomping a letter in the middle is
>>>> not acceptable behaviour, so ...
>>> The design had already covered those cases. The author already had those
>>> thoughts
>> Sorry, I was saying that none of
>>
>>  * giving only [][] to fill only 4 display columns, without filling
>>    the given 5 display columns,
>>
>>  * giving [][][] to fill 6 display columns, exceeding the given 5
>>    display columns,
>>
>>  * giving [][][ that chomps a letter in the middle, in a failed
>>    attempt to fill exactly 5 displya columns.
>>
>> would be a sensible design of the behaviour for "Trunc", so I am not
>> sure what "had already covered" really mean...
>>
> I'm still unsure what you are trying to say here.
>
> Is this a question about the prior `trunc`, `mtrunc`, and `ltrunc`
> design and tests?
> e.g. how complete are their tests?

No.  As I said, the existing lowercase ones may already be buggy (I
didn't check), in that they may do "[][].." or "[][][]" when told to
"trunc" fill a string with four or more double-width letters into a
5 display space.  But the point is at least for these with ellipsis
it is fairly clear what the desired behaviour is.  For "trunc" in
the above example, I think the right thing for it to do would be to
do "[][].", i.e. consume exactly 5 display columns, and avoid
exceeding the given space by not giving two dots but just one.

But with "hard truncate", I do not think we can define any sensible
behaviour when we ask it to do the same.  Giving "[][]" leaves one
display space unconsumed and giving "[][][]" would exceed the given
space, so anything you would write after that would be unaligned on
the line.

As to the tests, the question was, whatever the designed behaviour
for the above case, if they record the design choice made by this
series (even though, as I said, I suspect no design choice for the
"hard-fill/trunc odd number of columns with a run of double-width
letters" problem is satisfactory).
Philip Oakley Nov. 26, 2022, 2:32 p.m. UTC | #6
On 25/11/2022 07:11, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> On 22/11/2022 00:57, Junio C Hamano wrote:
>>> Philip Oakley <philipoakley@iee.email> writes:
>>>
>>>>> As a design question, what should "Trunc" do in such a case now?  I
>>>>> do not think we can still call it "hard truncate" if the feature
>>>>> gives "[][]" (i.e. fill only 4 display columns, resulting in a
>>>>> string that is not wide enough) or "[][][]" (i.e. exceed 5 columns
>>>>> that are given), but of course chomping a letter in the middle is
>>>>> not acceptable behaviour, so ...
>>>> The design had already covered those cases. The author already had those
>>>> thoughts
>>> Sorry, I was saying that none of
>>>
>>>  * giving only [][] to fill only 4 display columns, without filling
>>>    the given 5 display columns,
>>>
>>>  * giving [][][] to fill 6 display columns, exceeding the given 5
>>>    display columns,
>>>
>>>  * giving [][][ that chomps a letter in the middle, in a failed
>>>    attempt to fill exactly 5 displya columns.
>>>
>>> would be a sensible design of the behaviour for "Trunc", so I am not
>>> sure what "had already covered" really mean...
>>>
>> I'm still unsure what you are trying to say here.
>>
>> Is this a question about the prior `trunc`, `mtrunc`, and `ltrunc`
>> design and tests?
>> e.g. how complete are their tests?
> No.  As I said, the existing lowercase ones may already be buggy (I
> didn't check),

That question hadn't come across in the previous emails.

>  in that they may do "[][].." or "[][][]" when told to
> "trunc" fill a string with four or more double-width letters into a
> 5 display space.  But the point is at least for these with ellipsis
> it is fairly clear what the desired behaviour is.

That "is fairly clear" is probably the problem. In retrospect it's not
clear in the docs that the "%<(N" format is (would appear to be) about
defining the display width, in terminal character columns, that the
selected parameter is to be displayed within.

The code already pads the displayed parameter with spaces as required if
the parameter is shorter than the display width - the else condition in
pretty.c L1750

>   For "trunc" in
> the above example, I think the right thing for it to do would be to
> do "[][].", i.e. consume exactly 5 display columns, and avoid
> exceeding the given space by not giving two dots but just one.

The existing choice is padding "[][]" with a single space to reach 5
display chars.
For the 6-char "[][][]" truncation it is "[][..", i.e. 3 chars from
"[][][]", then the two ".." dots of the ellipsis.
>
> But with "hard truncate", I do not think we can define any sensible
> behaviour 

Given the existing code and the display column expectation, it felt
rather simple to apply the hard cut at the display width, or pad with
spaces if the chars to display were shorter than the allocated columns.

> when we ask it to do the same.  Giving "[][]" leaves one
> display space unconsumed and giving "[][][]" would exceed the given
> space, so anything you would write after that would be unaligned on
> the line.

A careful read (and testing) of the existing 'mtrunc' does show a
rounding error bug though. Its a confusion between the computed start
and end points of the cut that loses one display column (the string
displayed is short by one when the count is odd, IIUC).

>
> As to the tests, the question was, whatever the designed behaviour
> for the above case, if they record the design choice made by this
> series (even though, as I said, I suspect no design choice for the
> "hard-fill/trunc odd number of columns with a run of double-width
> letters" problem is satisfactory).

The existing tests already cover the trailing space padding issues.
However I'd agree that the documentation (and original commit messages
for the existing code) do not make clear the distinction between display
columns to be filled and characters in the displayed parameters (aka
'placeholders').

Within that, there is also the "%<(N" and "%<|(M" lack of clarity, where
the N count is for a single place holder's character count, while the M
value is the terminal's display column number, of the 24x80 column style.

Finally, the docs use of '<N>'  in the middle the place holders title
that also use `<` and `>` is a readability nightmare. Given that the
text then only talks about `N` it may be sensible to drop the
surrounding `<.>` to reduce confusion for these alignment placeholders.

--

Philip
Philip Oakley Nov. 26, 2022, 10:44 p.m. UTC | #7
On 26/11/2022 14:32, Philip Oakley wrote:
> A careful read (and testing) of the existing 'mtrunc' does show a
> rounding error bug though. Its a confusion between the computed start
> and end points of the cut that loses one display column (the string
> displayed is short by one when the count is odd, IIUC).

I had confused myself.

My test case `git log --graph --format="%<(19,mtrunc)%d=2" -6` had a
2-char step in the graph that I confused with the effects between
repeated runs with the 'N' value adjusted by +1 and -1.

I then jumped to conclusions about the integer division in the mid
string position of the mtrunc case win the code.

Sorry for that.
Junio C Hamano Nov. 26, 2022, 11:19 p.m. UTC | #8
Philip Oakley <philipoakley@iee.email> writes:

>>  in that they may do "[][].." or "[][][]" when told to
>> "trunc" fill a string with four or more double-width letters into a
>> 5 display space.  But the point is at least for these with ellipsis
>> it is fairly clear what the desired behaviour is.
>
> That "is fairly clear" is probably the problem. In retrospect it's not
> clear in the docs that the "%<(N" format is (would appear to be) about
> defining the display width, in terminal character columns, that the
> selected parameter is to be displayed within.
>
> The code already pads the displayed parameter with spaces as required if
> the parameter is shorter than the display width - the else condition in
> pretty.c L1750
>
>>   For "trunc" in
>> the above example, I think the right thing for it to do would be to
>> do "[][].", i.e. consume exactly 5 display columns, and avoid
>> exceeding the given space by not giving two dots but just one.
>
> The existing choice is padding "[][]" with a single space to reach 5
> display chars.
> For the 6-char "[][][]" truncation it is "[][..", i.e. 3 chars from
> "[][][]", then the two ".." dots of the ellipsis.

Here, I realize that I did not explain the scenario well.  The
message you are responding to was meant to be a clarification of my
earlier message and it should have done a better job but apparently
I failed.  Sorry, and let me try again.

The single example I meant to use to illustrate the scenario I worry
about is this.  There is a string, in which there are four (or more)
letters, each of which occupies two display columns.  And '[]' in my
earlier messages stood for a SINGLE such letter (I just wanted to
stick to ASCII, instead of using East Asian script, for
illustration).  So "[][.." is not possible (you are chomping the
second such letter in half).

I could use East Asian 一二三四 (there are four letters, denoting
one, two, three, and four, each occupying two display spaces when
typeset in a fixed width font), but to make it easier to see in
ASCII only text, let's pretend "[1]", "[2]", "[3]", "[4]" are such
letters.  You cannot chomp them in the middle (and please pretend
each of them occupy two, not three, display spaces).

When the given display space is 6 columns, we can fit 2 such letters
plus ".." in the space.  If the original string were [1][2][3][4],
it is clear trunk and ltrunk can do "[1][2].." (remember [n] stands
for a single letter whose width is 2 columns, so that takes 6
columns) and "..[3][4]", respectively.  It also is clear that Trunk
and Ltrunk can do "[1][2][3]" and "[2][3][4]", respectively.  We
truncate the given string so that we fill the alloted display
columns fully.

If the given display space is 5 columns, the desirable behaviour for
trunk and ltrunk is still clear.  Instead of consuming two dots, we
could use a single dot as the filler.  As I said, I suspect that the
implementation of trunk and ltrunc does this correctly, though.

My worry is it is not clear what Trunk and Ltrunk should do in that
case.  There is no way to fit a substring of [1][2][3][4] into 5
columns without any filler.
Philip Oakley Nov. 28, 2022, 1:39 p.m. UTC | #9
On 26/11/2022 23:19, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>>>  in that they may do "[][].." or "[][][]" when told to
>>> "trunc" fill a string with four or more double-width letters into a
>>> 5 display space.  But the point is at least for these with ellipsis
>>> it is fairly clear what the desired behaviour is.
>> That "is fairly clear" is probably the problem. In retrospect it's not
>> clear in the docs that the "%<(N" format is (would appear to be) about
>> defining the display width, in terminal character columns, that the
>> selected parameter is to be displayed within.
>>
>> The code already pads the displayed parameter with spaces as required if
>> the parameter is shorter than the display width - the else condition in
>> pretty.c L1750
>>
>>>   For "trunc" in
>>> the above example, I think the right thing for it to do would be to
>>> do "[][].", i.e. consume exactly 5 display columns, and avoid
>>> exceeding the given space by not giving two dots but just one.
>> The existing choice is padding "[][]" with a single space to reach 5
>> display chars.
>> For the 6-char "[][][]" truncation it is "[][..", i.e. 3 chars from
>> "[][][]", then the two ".." dots of the ellipsis.
> Here, I realize that I did not explain the scenario well.  The
> message you are responding to was meant to be a clarification of my
> earlier message and it should have done a better job but apparently
> I failed.  Sorry, and let me try again.
>
> The single example I meant to use to illustrate the scenario I worry
> about is this.  There is a string, in which there are four (or more)
> letters, each of which occupies two display columns.  And '[]' in my
> earlier messages stood for a SINGLE such letter (I just wanted to
> stick to ASCII, instead of using East Asian script, for
> illustration).  So "[][.." is not possible (you are chomping the
> second such letter in half).
>
> I could use East Asian 一二三四 (there are four letters, denoting
> one, two, three, and four, each occupying two display spaces when
> typeset in a fixed width font),

Thanks for that clarification, I'd been thinking it was about c char
(bytes) such as ASCII and multi-byte characters (code points), e.g.
European umlaut style distinctions.

I hadn't really picked up on the distinction between wide and narrow
'glyphs' (if that's the right term to use).
 
I see that the code does properly count the widths of narrow and wide
code points as 1 and 2 columns of the display, but then doesn't
explicitly try any adjustment for the wide code point problem you noted.
>  but to make it easier to see in
> ASCII only text, let's pretend "[1]", "[2]", "[3]", "[4]" are such
> letters.  You cannot chomp them in the middle (and please pretend
> each of them occupy two, not three, display spaces).
>
> When the given display space is 6 columns, we can fit 2 such letters
> plus ".." in the space.  If the original string were [1][2][3][4],
> it is clear trunk and ltrunk can do "[1][2].." (remember [n] stands
> for a single letter whose width is 2 columns, so that takes 6
> columns) and "..[3][4]", respectively.  It also is clear that Trunk
> and Ltrunk can do "[1][2][3]" and "[2][3][4]", respectively.  We
> truncate the given string so that we fill the alloted display
> columns fully.
>
> If the given display space is 5 columns, the desirable behaviour for
> trunk and ltrunk is still clear.  Instead of consuming two dots, we
> could use a single dot as the filler.  As I said, I suspect that the
> implementation of trunk and ltrunc does this correctly, though.

I believe there is a possible solution that, if we detect a column
over-run, then we can go back and replace the current two column double
dot with a narrow U+2026 Horizontal ellipsis, to regain the needed column.
>
> My worry is it is not clear what Trunk and Ltrunk should do in that
> case.  There is no way to fit a substring of [1][2][3][4] into 5
> columns without any filler.
For this case where the final code point overruns, my solution
could/would be to use the Vertical ellipsis U+22EE "⋮" to re-write that
final character (though the Unicode Replacement Character "�" could be
used, but that's ugly)

I suspect the code would need some close reading to ensure that the
column counting and replacement would correctly cope with the 'off by
one' wide width case inside the strbuf_utf8_replace().

I.e. given the same off-by-one position and replacement length, get back
to the same point to replace either the double dot or the final code
point in an idempotent manner.

The logic feels sound, as long as there are no three wide crocodile
code-points. Either we counted the right number of columns, or we
over-ran by one, so we go back and substitute with a one-for-two
replacement.

Philip

For watchers, https://github.com/microsoft/terminal/issues/4345 shows
some of the issues in the general case.
Junio C Hamano Nov. 29, 2022, 12:18 a.m. UTC | #10
Philip Oakley <philipoakley@iee.email> writes:

>> If the given display space is 5 columns, the desirable behaviour for
>> trunk and ltrunk is still clear.  Instead of consuming two dots, we
>> could use a single dot as the filler.  As I said, I suspect that the
>> implementation of trunk and ltrunc does this correctly, though.
>
> I believe there is a possible solution that, if we detect a column
> over-run, then we can go back and replace the current two column double
> dot with a narrow U+2026 Horizontal ellipsis, to regain the needed column.

It would work, too.  As "trunc" and "ltrunc" (and "mtrunc") are
about "truncate and show the filler to indicate some letters in the
original are not shown, and make the result exactly N columns", it
can use a narrower filler than the originally specified and use the
alloted space.

>> My worry is it is not clear what Trunk and Ltrunk should do in that
>> case.  There is no way to fit a substring of [1][2][3][4] into 5
>> columns without any filler.
> For this case where the final code point overruns, my solution
> could/would be to use the Vertical ellipsis U+22EE "⋮" to re-write that
> final character (though the Unicode Replacement Character "�" could be
> used, but that's ugly)

That would be "trunc" not "Trunc", wouldn't it?  If the capital
letter verions are "hard truncate" without filler, it fundamentally
cannot be done to fill 5 display spaces only with letters each of
which take 2 display spaces.

I am not saying that there is an impossible situation to handle and
"hard truncate" primitives should not be invented.  I just want us
to specify (and document) what happens in such a case, so that it no
longer is an "impossible" situation (we can say "odd/leftover
columns are filled with minimum number of SP to make the result N
display spaces", for example).  And not calling it "hard truncate"
might be a good place to start, as it won't be "hard truncate" with
such a padding.
Philip Oakley Dec. 7, 2022, 12:24 a.m. UTC | #11
Apologies for the delays.
On 26/11/2022 23:19, Junio C Hamano wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>>>  in that they may do "[][].." or "[][][]" when told to
>>> "trunc" fill a string with four or more double-width letters into a
>>> 5 display space.  But the point is at least for these with ellipsis
>>> it is fairly clear what the desired behaviour is.
>> That "is fairly clear" is probably the problem. In retrospect it's not
>> clear in the docs that the "%<(N" format is (would appear to be) about
>> defining the display width, in terminal character columns, that the
>> selected parameter is to be displayed within.
>>
>> The code already pads the displayed parameter with spaces as required if
>> the parameter is shorter than the display width - the else condition in
>> pretty.c L1750
>>
>>>   For "trunc" in
>>> the above example, I think the right thing for it to do would be to
>>> do "[][].", i.e. consume exactly 5 display columns, and avoid
>>> exceeding the given space by not giving two dots but just one.
>> The existing choice is padding "[][]" with a single space to reach 5
>> display chars.
>> For the 6-char "[][][]" truncation it is "[][..", i.e. 3 chars from
>> "[][][]", then the two ".." dots of the ellipsis.
> Here, I realize that I did not explain the scenario well.  The
> message you are responding to was meant to be a clarification of my
> earlier message and it should have done a better job but apparently
> I failed.  Sorry, and let me try again.
>
> The single example I meant to use to illustrate the scenario I worry
> about is this.  There is a string, in which there are four (or more)
> letters, each of which occupies two display columns.

I hadn't appreciated that utf8 has 'wide' characters that are
effectively double width, i.e. use 2 display columns, such as emojis.
I had been well aware of the multi-byte nature of utf8, and vaguely
aware of the potential for combined characters.

>   And '[]' in my
> earlier messages stood for a SINGLE such letter (I just wanted to
> stick to ASCII, instead of using East Asian script, for
> illustration).  So "[][.." is not possible (you are chomping the
> second such letter in half).
>
> I could use East Asian 一二三四 (there are four letters, denoting
> one, two, three, and four, each occupying two display spaces when
> typeset in a fixed width font),

These 4 characters came through ok.
>  but to make it easier to see in
> ASCII only text, let's pretend "[1]", "[2]", "[3]", "[4]" are such
> letters.  You cannot chomp them in the middle (and please pretend
> each of them occupy two, not three, display spaces).
>
> When the given display space is 6 columns, we can fit 2 such letters
> plus ".." in the space.  If the original string were [1][2][3][4],
> it is clear trunk and ltrunk can do "[1][2].." 

> (remember [n] stands
> for a single letter whose width is 2 columns, so that takes 6
> columns) 

Aside: The man pages aren't that clear about the distinction between
display columns and characters, both for these width based formats
(allow this placeholder parameter a width of <N> columns), and the
terminal's column position formats (start this parameter at the
on-screen column <N>) .


> and "..[3][4]", respectively.  It also is clear that Trunk
> and Ltrunk can do "[1][2][3]" and "[2][3][4]", respectively.  We
> truncate the given string so that we fill the alloted display
> columns fully.

While this example is clear, it's not clear what should be done if we
have mixed width strings, e.g. with emojis, as the boundaries in random
text will also be randomly placed.
>
> If the given display space is 5 columns, the desirable behaviour for
> trunk and ltrunk is still clear.  Instead of consuming two dots, we
> could use a single dot as the filler.  As I said, I suspect that the
> implementation of trunk and ltrunc does this correctly, though.

The current implementation is buggy in the case of combining character
code points. We forget to add the (zero space) combining code points
once we have the base character when it is the character before the
split (where the double dot ".." is inserted). I.e. the zero space
characters are added after the ".." double dot.

This can cause problems with the existing code for `mtrunc` where the
'lost' combing code points then become attached to the preceding "two
dots".
>
> My worry is it is not clear what Trunk and Ltrunk should do in that
> case.  There is no way to fit a substring of [1][2][3][4] into 5
> columns without any filler.

I'm not sure that anyone has fully solved that issue of what to do when
a wide character overlaps the end of an available display space, such as
terminal word-wrap when resizing the window (in my mintty terminal
window it displays a 'space' then starts the wide character on a new line).

There's also a 'bug' reported for one of the microsoft terminals that
the user wants to position the cursor at a column that is the middle of
a wide emoji character and wants half of it over written!

For our case (no wordwrap) I'd be minded to to mark the end column with
a single width character to show that some (wide) thing should be here,
but we've had to cut it off, such as the vertical ellipsis. At least
it's explainable.

I'll at least work on the doc clarification regarding the column width,
column position and wide char (2-col) issue, and hopefully a few failing
tests for the combing code point and the wide char fitment issue.

--
Philip
Junio C Hamano Dec. 7, 2022, 12:54 a.m. UTC | #12
Philip Oakley <philipoakley@iee.email> writes:

>> and "..[3][4]", respectively.  It also is clear that Trunk
>> and Ltrunk can do "[1][2][3]" and "[2][3][4]", respectively.  We
>> truncate the given string so that we fill the alloted display
>> columns fully.
>
> While this example is clear, it's not clear what should be done if we
> have mixed width strings, e.g. with emojis, as the boundaries in random
> text will also be randomly placed.

As long as wider letters have widths that is integral of the
narrowest letters (ASCII?), "use N columns, padding with '.' if
needed" has a reasonable solution, no?  "[1]A[2]" occupies 2+1+2
columns, so trunc that is given only 3 (or 4) columns can drop the
last "[2]" and fit "[1]A" in the given columns with padding.

> I'll at least work on the doc clarification regarding the column width,
> column position and wide char (2-col) issue, and hopefully a few failing
> tests for the combing code point and the wide char fitment issue.

Thanks, that would give us a very good starting point.
diff mbox series

Patch

diff --git a/Documentation/pretty-formats.txt b/Documentation/pretty-formats.txt
index 0b4c1c8d98a..0bd339db333 100644
--- a/Documentation/pretty-formats.txt
+++ b/Documentation/pretty-formats.txt
@@ -146,14 +146,15 @@  The placeholders are:
 '%m':: left (`<`), right (`>`) or boundary (`-`) mark
 '%w([<w>[,<i1>[,<i2>]]])':: switch line wrapping, like the -w option of
 			    linkgit:git-shortlog[1].
-'%<(<N>[,trunc|ltrunc|mtrunc])':: make the next placeholder take at
+'%<(<N>[,trunc|ltrunc|mtrunc|Trunc|Ltrunc])':: make the next placeholder take at
 				  least N columns, padding spaces on
 				  the right if necessary.  Optionally
-				  truncate at the beginning (ltrunc),
-				  the middle (mtrunc) or the end
-				  (trunc) if the output is longer than
-				  N columns.  Note that truncating
+				  truncate (with ellipsis '..') at the beginning (ltrunc) `..ft`,
+				  the middle (mtrunc) `mi..le` or the end
+				  (trunc) `rig..` if the output is longer than
+				  N columns.  Note that truncating with ellipsis
 				  only works correctly with N >= 2.
+				  Use (Trunc|Ltrunc) for hard truncation without ellipsis.
 '%<|(<N>)':: make the next placeholder take at least until Nth
 	     columns, padding spaces on the right if necessary
 '%>(<N>)', '%>|(<N>)':: similar to '%<(<N>)', '%<|(<N>)' respectively,
diff --git a/pretty.c b/pretty.c
index 6cb363ae1c9..1e873d3f41a 100644
--- a/pretty.c
+++ b/pretty.c
@@ -857,7 +857,9 @@  enum trunc_type {
 	trunc_none,
 	trunc_left,
 	trunc_middle,
-	trunc_right
+	trunc_right,
+	trunc_left_hard,
+	trunc_right_hard
 };
 
 struct format_commit_context {
@@ -1145,6 +1147,10 @@  static size_t parse_padding_placeholder(const char *placeholder,
 				c->truncate = trunc_left;
 			else if (starts_with(start, "mtrunc)"))
 				c->truncate = trunc_middle;
+			else if (starts_with(start, "Ltrunc)"))
+				c->truncate = trunc_left_hard;
+			else if (starts_with(start, "Trunc)"))
+				c->truncate = trunc_right_hard;
 			else
 				return 0;
 		} else
@@ -1743,6 +1749,16 @@  static size_t format_and_pad_commit(struct strbuf *sb, /* in UTF-8 */
 					    padding - 2, len - (padding - 2),
 					    "..");
 			break;
+		case trunc_left_hard:
+			strbuf_utf8_replace(&local_sb,
+					    0, len - padding,
+					    NULL);
+			break;
+		case trunc_right_hard:
+			strbuf_utf8_replace(&local_sb,
+					    padding, len - padding,
+					    NULL);
+			break;
 		case trunc_none:
 			break;
 		}
diff --git a/t/t4205-log-pretty-formats.sh b/t/t4205-log-pretty-formats.sh
index e448ef2928a..55dca307983 100755
--- a/t/t4205-log-pretty-formats.sh
+++ b/t/t4205-log-pretty-formats.sh
@@ -261,6 +261,17 @@  test_expect_success 'left alignment formatting with trunc' '
 	test_cmp expected actual
 '
 
+test_expect_success 'left alignment formatting with Trunc' '
+	git log --pretty="tformat:%<(10,Trunc)%s" >actual &&
+	qz_to_tab_space <<-\EOF >expected &&
+	message tw
+	message on
+	add bar  Z
+	initial. a
+	EOF
+	test_cmp expected actual
+'
+
 test_expect_success 'left alignment formatting with trunc. i18n.logOutputEncoding' '
 	git -c i18n.logOutputEncoding=$test_encoding log --pretty="tformat:%<(10,trunc)%s" >actual &&
 	qz_to_tab_space <<-\EOF | iconv -f utf-8 -t $test_encoding >expected &&
@@ -272,6 +283,17 @@  test_expect_success 'left alignment formatting with trunc. i18n.logOutputEncodin
 	test_cmp expected actual
 '
 
+test_expect_success 'left alignment formatting with Trunc. i18n.logOutputEncoding' '
+	git -c i18n.logOutputEncoding=$test_encoding log --pretty="tformat:%<(10,Trunc)%s" >actual &&
+	qz_to_tab_space <<-\EOF | iconv -f utf-8 -t $test_encoding >expected &&
+	message tw
+	message on
+	add bar  Z
+	initial. a
+	EOF
+	test_cmp expected actual
+'
+
 test_expect_success 'left alignment formatting with ltrunc' '
 	git log --pretty="tformat:%<(10,ltrunc)%s" >actual &&
 	qz_to_tab_space <<-EOF >expected &&
@@ -283,6 +305,18 @@  test_expect_success 'left alignment formatting with ltrunc' '
 	test_cmp expected actual
 '
 
+test_expect_success 'left alignment formatting with Ltrunc' '
+	git log --pretty="tformat:%<(10,Ltrunc)%s" >actual &&
+	qz_to_tab_space <<-EOF >expected &&
+	essage two
+	essage one
+	add bar  Z
+	an${sample_utf8_part}lich
+	EOF
+	test_cmp expected actual
+'
+
+
 test_expect_success 'left alignment formatting with ltrunc. i18n.logOutputEncoding' '
 	git -c i18n.logOutputEncoding=$test_encoding log --pretty="tformat:%<(10,ltrunc)%s" >actual &&
 	qz_to_tab_space <<-EOF | iconv -f utf-8 -t $test_encoding >expected &&
@@ -294,6 +328,16 @@  test_expect_success 'left alignment formatting with ltrunc. i18n.logOutputEncodi
 	test_cmp expected actual
 '
 
+test_expect_success 'left alignment formatting with Ltrunc. i18n.logOutputEncoding' '
+	git -c i18n.logOutputEncoding=$test_encoding log --pretty="tformat:%<(10,Ltrunc)%s" >actual &&
+	qz_to_tab_space <<-EOF | iconv -f utf-8 -t $test_encoding >expected &&
+	essage two
+	essage one
+	add bar  Z
+	an${sample_utf8_part}lich
+	EOF
+	test_cmp expected actual
+'
 test_expect_success 'left alignment formatting with mtrunc' '
 	git log --pretty="tformat:%<(10,mtrunc)%s" >actual &&
 	qz_to_tab_space <<-\EOF >expected &&
@@ -507,6 +551,19 @@  test_expect_success 'left/right alignment formatting with stealing' '
 	EOF
 	test_cmp expected actual
 '
+
+test_expect_success 'left/right hard alignment formatting with stealing' '
+	git commit --amend -m short --author "long long long <long@me.com>" &&
+	git log --pretty="tformat:%<(10,Trunc)%s%>>(10,Ltrunc)% an" >actual &&
+	cat <<-\EOF >expected &&
+	short long  long long
+	message on   A U Thor
+	add bar      A U Thor
+	initial. a   A U Thor
+	EOF
+	test_cmp expected actual
+'
+
 test_expect_success 'left/right alignment formatting with stealing. i18n.logOutputEncoding' '
 	git -c i18n.logOutputEncoding=$test_encoding log --pretty="tformat:%<(10,trunc)%s%>>(10,ltrunc)% an" >actual &&
 	cat <<-\EOF | iconv -f utf-8 -t $test_encoding >expected &&
@@ -517,6 +574,16 @@  test_expect_success 'left/right alignment formatting with stealing. i18n.logOutp
 	EOF
 	test_cmp expected actual
 '
+test_expect_success 'left/right alignment formatting with stealing. i18n.logOutputEncoding' '
+	git -c i18n.logOutputEncoding=$test_encoding log --pretty="tformat:%<(10,Trunc)%s%>>(10,Ltrunc)% an" >actual &&
+	cat <<-\EOF | iconv -f utf-8 -t $test_encoding >expected &&
+	short long  long long
+	message on   A U Thor
+	add bar      A U Thor
+	initial. a   A U Thor
+	EOF
+	test_cmp expected actual
+'
 
 test_expect_success 'strbuf_utf8_replace() not producing NUL' '
 	git log --color --pretty="tformat:%<(10,trunc)%s%>>(10,ltrunc)%C(auto)%d" |
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index 41d0ca00b1c..c44935b0ab4 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -218,6 +218,13 @@  commit $head1
 added (hinzugef${added_utf8_part}gt..
 EOF
 
+test_format subject-truncated "%<($truncate_count,Trunc)%s" qz_to_tab_space <<EOF
+commit $head2
+changed (ge${changed_utf8_part}ndert) f
+commit $head1
+added (hinzugef${added_utf8_part}gt)Z
+EOF
+
 test_format body %b <<EOF
 commit $head2
 commit $head1
@@ -400,6 +407,15 @@  commit $head1
 added (hinzugef${added_utf8_part_iso88591}gt..
 EOF
 
+test_format complex-subject-Trunc "%<($truncate_count,Trunc)%s" qz_to_tab_space <<EOF
+commit $head3
+Test printing of com
+commit $head2
+changed (ge${changed_utf8_part_iso88591}ndert) f
+commit $head1
+added (hinzugef${added_utf8_part_iso88591}gt)Z
+EOF
+
 test_format complex-subject-mtrunc "%<($truncate_count,mtrunc)%s" <<EOF
 commit $head3
 Test prin..ex bodies
@@ -418,6 +434,14 @@  commit $head1
 .. (hinzugef${added_utf8_part_iso88591}gt) foo
 EOF
 
+test_format complex-subject-Ltrunc "%<($truncate_count,Ltrunc)%s" <<EOF
+commit $head3
+ng of complex bodies
+commit $head2
+anged (ge${changed_utf8_part_iso88591}ndert) foo
+commit $head1
+ed (hinzugef${added_utf8_part_iso88591}gt) foo
+EOF
 test_expect_success 'setup expected messages (for test %b)' '
 	cat <<-EOF >expected.utf-8 &&
 	commit $head3
@@ -455,6 +479,15 @@  commit $head1
 added (hinzugef${added_utf8_part}gt..
 EOF
 
+test_format complex-subject-commitencoding-unset-Trunc "%<($truncate_count,Trunc)%s" qz_to_tab_space <<EOF
+commit $head3
+Test printing of com
+commit $head2
+changed (ge${changed_utf8_part}ndert) f
+commit $head1
+added (hinzugef${added_utf8_part}gt)Z
+EOF
+
 test_format complex-subject-commitencoding-unset-mtrunc "%<($truncate_count,mtrunc)%s" <<EOF
 commit $head3
 Test prin..ex bodies
@@ -473,6 +506,15 @@  commit $head1
 .. (hinzugef${added_utf8_part}gt) foo
 EOF
 
+test_format complex-subject-commitencoding-unset-Ltrunc "%<($truncate_count,Ltrunc)%s" <<EOF
+commit $head3
+ng of complex bodies
+commit $head2
+anged (ge${changed_utf8_part}ndert) foo
+commit $head1
+ed (hinzugef${added_utf8_part}gt) foo
+EOF
+
 test_format complex-body-commitencoding-unset %b <expected.utf-8
 
 test_expect_success '%x00 shows NUL' '