blame: add the ability to ignore commits

Message ID	20190107213013.231514-1-brho@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Mon, 7 Jan 2019 16:30:12 -0500 Message-Id: <20190107213013.231514-1-brho@google.com> Mime-Version: 1.0 Subject: [PATCH] blame: add the ability to ignore commits From: Barret Rhoden <brho@google.com> To: git@vger.kernel.org Cc: Stefan Beller <stefanbeller@gmail.com>, Jeff Smith <whydoubt@gmail.com>, Junio C Hamano <gitster@pobox.com>, Jeff King <peff@peff.net> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	blame: add the ability to ignore commits \| expand blame: add the ability to ignore commits

Barret Rhoden Jan. 7, 2019, 9:30 p.m. UTC

Commits that make formatting changes or renames are often not
interesting when blaming a file.  This commit, similar to
git-hyper-blame, allows one to specify certain revisions to ignore during
the blame process.

To ignore a revision, put its committish in a file specified by
--ignore-file=<file> or use -i <rev>, which can be repeated.  The file
.git-blame-ignore-revs is checked by default.

It's useful to be alerted to the presence of an ignored commit in the
history of a line.  Those lines will be marked with '*' in the
non-porcelain output.  The '*' is attached to the line number to keep
from breaking tools that rely on the whitespace between columns.

A blame_entry attributed to an ignored commit will get passed to its
parent.  If an ignored commit changed a line, an ancestor that changed
the line will get blamed.  However, if an ignored commit added lines, a
commit changing a nearby line may get blamed.  If no commit is found,
the original commit for the file will get blamed.

Signed-off-by: Barret Rhoden <brho@google.com>
---
 Documentation/git-blame.txt | 15 +++++++++
 blame.c                     | 38 +++++++++++++++++----
 blame.h                     |  3 ++
 builtin/blame.c             | 66 +++++++++++++++++++++++++++++++++++--
 4 files changed, 112 insertions(+), 10 deletions(-)

Junio C Hamano Jan. 7, 2019, 11:13 p.m. UTC | #1

Barret Rhoden <brho@google.com> writes:

> Commits that make formatting changes or renames are often not
> interesting when blaming a file.  This commit, similar to
> git-hyper-blame, allows one to specify certain revisions to ignore during
> the blame process.
>
> To ignore a revision, put its committish in a file specified by
> --ignore-file=<file> or use -i <rev>, which can be repeated.

If I read it correctly, this gives a very limited form of -S, in the
sense that anything this can do can be expressed by using -S but the
reverse is not true, but is designed to be easier to use, in the
sense that unlike -S, this does not have to describe the part of the
history you do not have to lie about.  The documentation should say
something about these pros-and-cons to help readers decide which
feature to use.

I somehow feel that this is rare enough that it should not squat on
short-and-sweet '-i'.  We would want to reserve it to something like
"--ignore-case", for example.

> The file .git-blame-ignore-revs is checked by default.

Giving the projects a way to easily help participants to share the
same distorted view of the history is a good idea, but I do not
think we should allow projects to do so to those who merely clone it
without their consent.  IOW, "by default" is a terrible idea.

Perhaps paying attention to blame.skipRevsFile configuration
variable that is set by the end user would be sufficient----the
project can ship .blame-skip-revs (or any random filename of their
choice) in-tree as tracked contents and tell its users that they can
optionally use it.

> It's useful to be alerted to the presence of an ignored commit in the
> history of a line.  Those lines will be marked with '*' in the
> non-porcelain output.  The '*' is attached to the line number to keep
> from breaking tools that rely on the whitespace between columns.

A policy decision like the above two shouldn't be hardcoded in the
feature like this, but should be done as a separate option.  By
default, these shouldn't be marked with '*', as the same tools you
said you are afraid of breaking would be expecting a word with only
digits and no asterisk in the column where line numbers appear and
will get broken by this change if done unconditionally.

In general, I find this patch trying to change too many things at
the same time, without sufficient justification.  Perhaps do these
different things as separate steps in a single series?

> A blame_entry attributed to an ignored commit will get passed to its
> parent.

Obviously, an interesting consideration is what happens when a merge
commit is skipped.  Is it sufficient to change this description to
"...will get passed to its parentS", or would the code do completely
nonsensical things without further tweaks (a possible simple tweak
could be to refuse skipping merges)?

> If an ignored commit changed a line, an ancestor that changed
> the line will get blamed.  However, if an ignored commit added lines, a
> commit changing a nearby line may get blamed.  If no commit is found,
> the original commit for the file will get blamed.

The above somehow does not read as describing a carefully designed
behaviour; rather, it sounds as if it is saying "the code does
whatever random things it happens to do".  For example, when there
is a newly added line how is "A" commit changing a nearby line
chosen (a line has lines before it and after it---they may be
touched by different commits, and before and after that skipped
commit, so there are multiple commits to choose from)?

> diff --git a/Documentation/git-blame.txt b/Documentation/git-blame.txt
> index 16323eb80e31..e41375374892 100644
> --- a/Documentation/git-blame.txt
> +++ b/Documentation/git-blame.txt
> @@ -10,6 +10,7 @@ SYNOPSIS
>  [verse]
>  'git blame' [-c] [-b] [-l] [--root] [-t] [-f] [-n] [-s] [-e] [-p] [-w] [--incremental]
>  	    [-L <range>] [-S <revs-file>] [-M] [-C] [-C] [-C] [--since=<date>]
> +	    [-i <rev>] [--no-default-ignores] [--ignore-file=<file>]
>  	    [--progress] [--abbrev=<n>] [<rev> | --contents <file> | --reverse <rev>..<rev>]
>  	    [--] <file>
>  
> @@ -84,6 +85,20 @@ include::blame-options.txt[]
>  	Ignore whitespace when comparing the parent's version and
>  	the child's to find where the lines came from.
>  
> +-i <rev>::
> +	Ignore revision when assigning blame.  Lines that were changed by an
> +	ignored commit will be marked with a `*` in the blame output.  Lines
> +	that were added by an ignored commit may be attributed commits making
> +	nearby changes or to the first commit touching the file.

It probably deserves to be told that this option can be given
multiple times and used cumulatively (unlike usual "last one wins"
rule).

> +--no-default-ignores::
> +	Do not automatically ignore revisions in the file
> +	`.git-blame-ignore-revs`.

This should not be "opt-out" like this.

> +--ignore-file=<file>::
> +	Ignore revisions listed in `file`, one revision per line.  Whitespace
> +	and comments beginning with `#` are ignored.

Should it be capable to take two or more ignore-files?  Or should we
use the usual "the last one wins" rule?

I think we should support blame.skipRevFile configuration variable
so that the users do not have to constantly give the option from the
command line.  And with that, there is no need to have a hardcoded
filename .git-blame-ignore-revs or anything like that.

If we are to use configuration variable, however, we'd need a way to
override its use from the command line, too.  Perhaps a sane
arrangement would be

    - if one or more --skip-revs-file=<file> are given from the
      command line, use all of them and ignore blame.skipRevsFile
      configuration variable.

    - if no --skip-revs-file=<file> is given from the command line,
      use blame.skipRevsFile configuration variable.

    - regardless of the above two, pay attention to --skip-rev=<rev>
      command line option(s).

Somehow the damage to blame.c codebase looks way too noisy for what
the code wants to do.  If all we want is to pretend in a history,
e.g.

    ---O---A---B---C---D

that commit B does not exist, i.e. use a distorted view of the
history

    ---O---A-------C---D

wouldn't it be sufficient to modify pass_blame(), i.e. the only the
caller of the pass_blame_to_parent(), where we find the parent
commits of "C" and dig the history to pass blame to "B", and have it
pretend as if "B" does not exist and pass blame directly to "A"?

Thanks.  I am personally not all that interested in this yet.

Ævar Arnfjörð Bjarmason Jan. 8, 2019, 1:12 p.m. UTC | #2

On Mon, Jan 07 2019, Barret Rhoden wrote:

> +static int handle_ignore_file(const char *path, struct string_list *ignores)
> +{
> +	FILE *fp = fopen(path, "r");
> +	struct strbuf sb = STRBUF_INIT;
> +
> +	if (!fp)
> +		return -1;
> +	while (!strbuf_getline(&sb, fp)) {
> +		const char *hash;
> +
> +		hash = strchr(sb.buf, '#');
> +		if (hash)
> +			strbuf_setlen(&sb, hash - sb.buf);
> +		strbuf_trim(&sb);
> +		if (!sb.len)
> +			continue;
> +		string_list_append(ignores, sb.buf);
> +	}
> +	fclose(fp);
> +	strbuf_release(&sb);
> +	return 0;
> +}

Aside from other comments on this patch that Junio had either you mostly
copy-pasted this from init_skiplist() or you've come up with almost the
same code on your own.

In any case, if we're going to integrate something like this patch let's
split this "parse file with SHA-1s or comments/whitespace" into a
utility function that both this and init_skiplist() can call.

Then we could split up the description for the fsck.skipList config
variable to reference that format, and say that both it and this new
thing should consult those docs for how it's parsed.

Barret Rhoden Jan. 8, 2019, 4:27 p.m. UTC | #3

On 2019-01-07 at 15:13 Junio C Hamano <gitster@pobox.com> wrote:
> If I read it correctly, this gives a very limited form of -S, in the
> sense that anything this can do can be expressed by using -S but the
> reverse is not true, but is designed to be easier to use, in the
> sense that unlike -S, this does not have to describe the part of the
> history you do not have to lie about.  The documentation should say
> something about these pros-and-cons to help readers decide which
> feature to use.

Yeah, -S lists the revs to use, this lists the revs to *not* use.  I'll
add a note.

> I somehow feel that this is rare enough that it should not squat on
> short-and-sweet '-i'.  We would want to reserve it to something like
> "--ignore-case", for example.

Can do.  I'll change the interface to your suggestion from down below.

> > The file .git-blame-ignore-revs is checked by default.  
> 
> Giving the projects a way to easily help participants to share the
> same distorted view of the history is a good idea, but I do not
> think we should allow projects to do so to those who merely clone it
> without their consent.  IOW, "by default" is a terrible idea.
> 
> Perhaps paying attention to blame.skipRevsFile configuration
> variable that is set by the end user would be sufficient----the
> project can ship .blame-skip-revs (or any random filename of their
> choice) in-tree as tracked contents and tell its users that they can
> optionally use it.

A blame config option works for me.  I'd like the users/cloners of a
repo to not need to do anything extravagant, but a one-time config
would be fine.

> > It's useful to be alerted to the presence of an ignored commit in the
> > history of a line.  Those lines will be marked with '*' in the
> > non-porcelain output.  The '*' is attached to the line number to keep
> > from breaking tools that rely on the whitespace between columns.  
> 
> A policy decision like the above two shouldn't be hardcoded in the
> feature like this, but should be done as a separate option.  By
> default, these shouldn't be marked with '*', as the same tools you
> said you are afraid of breaking would be expecting a word with only
> digits and no asterisk in the column where line numbers appear and
> will get broken by this change if done unconditionally.

Since users are already opting-in to the blame-ignore, do you also want
them to opt-in to the annotation?  I can make a separate config option
to turn on the annotation.  Any preference for how it is marked?

> In general, I find this patch trying to change too many things at
> the same time, without sufficient justification.  Perhaps do these
> different things as separate steps in a single series?
> 
> > A blame_entry attributed to an ignored commit will get passed to its
> > parent.  
> 
> Obviously, an interesting consideration is what happens when a merge
> commit is skipped.  Is it sufficient to change this description to
> "...will get passed to its parentS", or would the code do completely
> nonsensical things without further tweaks (a possible simple tweak
> could be to refuse skipping merges)?

If we skip a merge commit, it might pick the wrong parent.  For
example, this line was brought in via a merge:

$ ~/src/git/git-blame include/linux/mm.h | grep VM_SYNC
b6fb293f2497a (Jan Kara 2017-11-01 16:36:41 +0100  204) #define VM_SYNC

It's from merge: a3841f94c7ec ("Merge tag 'libnvdimm-for-4.15', and if
we ignore it:

$ ~/src/git/git-blame -i a3841f94c7ecb include/linux/mm.h | grep VM_SYNC
cc2383ec06be0 (Konstantin Khlebnikov 2012-10-08 16:28:37 -0700  204*) #define VM_SYNC

The wrong commit is blamed.

I can put a note in the doc about it and die if we're given a merge
commit.  Is there a convenient helper for detecting a merge, or can I
just check for multiple parents?  (something like commit->parents &&
commit->parents->next)
 
> > If an ignored commit changed a line, an ancestor that changed
> > the line will get blamed.  However, if an ignored commit added lines, a
> > commit changing a nearby line may get blamed.  If no commit is found,
> > the original commit for the file will get blamed.  
> 
> The above somehow does not read as describing a carefully designed
> behaviour; rather, it sounds as if it is saying "the code does
> whatever random things it happens to do".  For example, when there
> is a newly added line how is "A" commit changing a nearby line
> chosen (a line has lines before it and after it---they may be
> touched by different commits, and before and after that skipped
> commit, so there are multiple commits to choose from)?

This was more of a commentary about its behavior.  If you ignore a
commit that added lines, it'd be nice to get a hint of what might have
caused it, and picking a commit that affected an adjacent line seemed
fine.  But yeah, it's not doing anything crazy.

> > diff --git a/Documentation/git-blame.txt b/Documentation/git-blame.txt
> > index 16323eb80e31..e41375374892 100644
> > --- a/Documentation/git-blame.txt
> > +++ b/Documentation/git-blame.txt
> > @@ -10,6 +10,7 @@ SYNOPSIS
> >  [verse]
> >  'git blame' [-c] [-b] [-l] [--root] [-t] [-f] [-n] [-s] [-e] [-p] [-w] [--incremental]
> >  	    [-L <range>] [-S <revs-file>] [-M] [-C] [-C] [-C] [--since=<date>]
> > +	    [-i <rev>] [--no-default-ignores] [--ignore-file=<file>]
> >  	    [--progress] [--abbrev=<n>] [<rev> | --contents <file> | --reverse <rev>..<rev>]
> >  	    [--] <file>
> >  
> > @@ -84,6 +85,20 @@ include::blame-options.txt[]
> >  	Ignore whitespace when comparing the parent's version and
> >  	the child's to find where the lines came from.
> >  
> > +-i <rev>::
> > +	Ignore revision when assigning blame.  Lines that were changed by an
> > +	ignored commit will be marked with a `*` in the blame output.  Lines
> > +	that were added by an ignored commit may be attributed commits making
> > +	nearby changes or to the first commit touching the file.  
> 
> It probably deserves to be told that this option can be given
> multiple times and used cumulatively (unlike usual "last one wins"
> rule).
> 
> > +--no-default-ignores::
> > +	Do not automatically ignore revisions in the file
> > +	`.git-blame-ignore-revs`.  
> 
> This should not be "opt-out" like this.
> 
> > +--ignore-file=<file>::
> > +	Ignore revisions listed in `file`, one revision per line.  Whitespace
> > +	and comments beginning with `#` are ignored.  
> 
> Should it be capable to take two or more ignore-files?  Or should we
> use the usual "the last one wins" rule?
> 
> I think we should support blame.skipRevFile configuration variable
> so that the users do not have to constantly give the option from the
> command line.  And with that, there is no need to have a hardcoded
> filename .git-blame-ignore-revs or anything like that.
> 
> If we are to use configuration variable, however, we'd need a way to
> override its use from the command line, too.  Perhaps a sane
> arrangement would be
> 
>     - if one or more --skip-revs-file=<file> are given from the
>       command line, use all of them and ignore blame.skipRevsFile
>       configuration variable.
> 
>     - if no --skip-revs-file=<file> is given from the command line,
>       use blame.skipRevsFile configuration variable.
> 
>     - regardless of the above two, pay attention to --skip-rev=<rev>
>       command line option(s).

Sounds fine to me.

> Somehow the damage to blame.c codebase looks way too noisy for what
> the code wants to do.  If all we want is to pretend in a history,
> e.g.
> 
>     ---O---A---B---C---D
> 
> that commit B does not exist, i.e. use a distorted view of the
> history
> 
>     ---O---A-------C---D
> 
> wouldn't it be sufficient to modify pass_blame(), i.e. the only the
> caller of the pass_blame_to_parent(), where we find the parent
> commits of "C" and dig the history to pass blame to "B", and have it
> pretend as if "B" does not exist and pass blame directly to "A"?

I originally tried to skip 'B' in pass_blame() when B popped up as a
scapegoat.  That broke the offsets of the blame_entries in the
parent.  By running diff_hunks(), we get the opportunity to adjust the
offsets.  Also, when it comes to marking the blame_entries for marking
the output, we want to know the specific diffs and to break them up at
the boundaries of [tlno,same) in blame_chunk().

> Thanks.  I am personally not all that interested in this yet.

Thanks for taking a look.

Barret

Barret Rhoden Jan. 8, 2019, 4:41 p.m. UTC | #4

On 2019-01-08 at 14:12 Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Mon, Jan 07 2019, Barret Rhoden wrote:
> 
> > +static int handle_ignore_file(const char *path, struct string_list *ignores)
> > +{
> > +	FILE *fp = fopen(path, "r");
> > +	struct strbuf sb = STRBUF_INIT;
> > +
> > +	if (!fp)
> > +		return -1;
> > +	while (!strbuf_getline(&sb, fp)) {
> > +		const char *hash;
> > +
> > +		hash = strchr(sb.buf, '#');
> > +		if (hash)
> > +			strbuf_setlen(&sb, hash - sb.buf);
> > +		strbuf_trim(&sb);
> > +		if (!sb.len)
> > +			continue;
> > +		string_list_append(ignores, sb.buf);
> > +	}
> > +	fclose(fp);
> > +	strbuf_release(&sb);
> > +	return 0;
> > +}  
> 
> Aside from other comments on this patch that Junio had either you mostly
> copy-pasted this from init_skiplist() or you've come up with almost the
> same code on your own.
> 
> In any case, if we're going to integrate something like this patch let's
> split this "parse file with SHA-1s or comments/whitespace" into a
> utility function that both this and init_skiplist() can call.

One minor difference is that fsck wants an unabbreviated SHA-1, using
parse_oid_hex() instead of get_oid_committish().  Would you be OK with
also changing fsck to take a committish instead of a full SHA-1?

Is there a good place for the common helper?  Since it's an oidset, I
could put it in oidset.c.  oidset_parse_file() or something.

> Then we could split up the description for the fsck.skipList config
> variable to reference that format, and say that both it and this new
> thing should consult those docs for how it's parsed.

Is there a good spot for the generic skipList documentation?  The only
common text would be: 

	... comments ('#'), empty lines, and any leading and trailing
	whitespace is ignored

Thanks,

Barret

Barret Rhoden Jan. 8, 2019, 6:04 p.m. UTC | #5

On 2019-01-08 at 11:41 Barret Rhoden <brho@google.com> wrote:
> Would you be OK with
> also changing fsck to take a committish instead of a full SHA-1?

Actually, in retrospect, I can keep the unabbreviated SHA-1 for the
file inputs and use get_oid_committish() for the one-off --skip-rev=
cases.

Thanks,

Barret

Junio C Hamano Jan. 8, 2019, 6:26 p.m. UTC | #6

Barret Rhoden <brho@google.com> writes:

>> A policy decision like the above two shouldn't be hardcoded in the
>> feature like this, but should be done as a separate option.  By
>> default, these shouldn't be marked with '*', as the same tools you
>> said you are afraid of breaking would be expecting a word with only
>> digits and no asterisk in the column where line numbers appear and
>> will get broken by this change if done unconditionally.
>
> Since users are already opting-in to the blame-ignore, do you also want
> them to opt-in to the annotation?

Absolutely.

After all, the users of a moral equivalent that is -S
never needed such an extra annotation, which tells us two things.
(1) the claim "It's useful to be alerted to the presence of an
ignored commit" in the proposed log message is merely a personal
preference and universal users' requirement; (2) if it is useful to
mark a blame-source whose parents (used while blaming) do not match
the actual parents, such an annotation would also be useful while
running -S.  So probably it should be a separate option that can be
given when any of the --skip-commit=<rev>, --skip-commits-file=<file>,
r -S<file> option is given.

>> Obviously, an interesting consideration is what happens when a merge
>> commit is skipped.  Is it sufficient to change this description to
>> "...will get passed to its parentS", or would the code do completely
>> nonsensical things without further tweaks (a possible simple tweak
>> could be to refuse skipping merges)?
>
> If we skip a merge commit, it might pick the wrong parent.

Actually after thinking about it a bit more, I no longer think a
merge is so special.  In this topology (as usual, time flows from
left to right), if you are skipping M,

        ---A---M---C---D
              /
         ----B

you'd simply pretend that the ancestry is like this and you'd be
fine, no?

        ---A-------C---D
                  /
         --------B

That is, when inspecting C, pass_blame() would enumerate its true
parents, notice that M to be skipped, which would cause it to
pretend as if C has M's parents instead of M itself as its parents.
If M in the true history is the first parent of C, then M's first
parent in the true history becomes C's first parent, M's second
parent in the true history becomes C's second parent, etc. (if C
were a merge in the true history, such a rewriting would make C an
octopus in the distorted history)

> The wrong commit is blamed.

So... I still suspect that it is merely a bug in your implementation
and there is nothing inherent that forces us to avoid skipping a
merge.

>> Somehow the damage to blame.c codebase looks way too noisy for what
>> the code wants to do.  If all we want is to pretend in a history,
>> e.g.
>> 
>>     ---O---A---B---C---D
>> 
>> that commit B does not exist, i.e. use a distorted view of the
>> history
>> 
>>     ---O---A-------C---D
>> 
>> wouldn't it be sufficient to modify pass_blame(), i.e. the only the
>> caller of the pass_blame_to_parent(), where we find the parent
>> commits of "C" and dig the history to pass blame to "B", and have it
>> pretend as if "B" does not exist and pass blame directly to "A"?
>
> I originally tried to skip 'B' in pass_blame() when B popped up as a
> scapegoat.  That broke the offsets of the blame_entries in the
> parent.

Why?  When inspecting C in order to exonerate it from as many lines
as possible, we'd run "git diff $P C" for each parent of C, but in
the distorted history (i.e. instead of using $P==B, we use $P==A).
As long as the code reads from "git diff A C", I do not see why "the
offsets in the parent" can be broken.  Care to elaborate a bit more?

Barret Rhoden Jan. 9, 2019, 8:48 p.m. UTC | #7

Hi -

On 2019-01-08 at 10:26 Junio C Hamano <gitster@pobox.com> wrote:
> >> Obviously, an interesting consideration is what happens when a merge
> >> commit is skipped.  Is it sufficient to change this description to
> >> "...will get passed to its parentS", or would the code do completely
> >> nonsensical things without further tweaks (a possible simple tweak
> >> could be to refuse skipping merges)?  
> >
> > If we skip a merge commit, it might pick the wrong parent.  
> 
> Actually after thinking about it a bit more, I no longer think a
> merge is so special.  In this topology (as usual, time flows from
> left to right), if you are skipping M,
> 
>         ---A---M---C---D
>               /
>          ----B
> 
> you'd simply pretend that the ancestry is like this and you'd be
> fine, no?
> 
> 
>         ---A-------C---D
>                   /
>          --------B
> 
> That is, when inspecting C, pass_blame() would enumerate its true
> parents, notice that M to be skipped, which would cause it to
> pretend as if C has M's parents instead of M itself as its parents.
> If M in the true history is the first parent of C, then M's first
> parent in the true history becomes C's first parent, M's second
> parent in the true history becomes C's second parent, etc. (if C
> were a merge in the true history, such a rewriting would make C an
> octopus in the distorted history)
> 
> > The wrong commit is blamed.  
> 
> So... I still suspect that it is merely a bug in your implementation
> and there is nothing inherent that forces us to avoid skipping a
> merge.

Probably!  =)  I tried a bunch of things initially, and might have had
things broken for other reasons.  One reason was not calling
parse_commit() at the right time.

I took another look and had first_scapegoat() scan the parent list, and
for any parent in the list, replace it with its own parent list.  e.g.
when removing 'A':

         ---X---A---B---C
               /
              Y 

(B's parent was A, A is the skipped commit, and A has parents X and Y)

becomes:

         ---X---B---C
               /
              Y

But just yanking skipped commits from the scapegoat list doesn't help,
since B gets blamed for the changes made by A, since the file B and
{X,Y} differ, specifically due to the changes A made.  

You'd think that would have worked, since it sounds similar to what -S
revs-file does, but it turns out I want different behavior than -S for
the ignore/skip.  With -S, you blame a nearby commit (at least based on
my testing - what is the desired behavior there?).  With ignore/skip, I
want to blame the last commit that modified the line.

For instance, commit X does this:

-foo(x,y);
+foo(x,y,z);

Then commit Y comes along to reformat it:

-foo(x,y,z);
+foo(x, y, z);

And the history / rev-list for the file looks like:

---O---A---X---B---C---D---Y---E---F

I want to ignore/skip Y and see X in the blame output.  

Sorry that I wasn't clear about that from the beginning; I can see how
one could have different expectations for what happens when a commit is
skipped.

> >> Somehow the damage to blame.c codebase looks way too noisy for what
> >> the code wants to do.  If all we want is to pretend in a history,
> >> e.g.
> >> 
> >>     ---O---A---B---C---D
> >> 
> >> that commit B does not exist, i.e. use a distorted view of the
> >> history
> >> 
> >>     ---O---A-------C---D
> >> 
> >> wouldn't it be sufficient to modify pass_blame(), i.e. the only the
> >> caller of the pass_blame_to_parent(), where we find the parent
> >> commits of "C" and dig the history to pass blame to "B", and have it
> >> pretend as if "B" does not exist and pass blame directly to "A"?  
> >
> > I originally tried to skip 'B' in pass_blame() when B popped up as a
> > scapegoat.  That broke the offsets of the blame_entries in the
> > parent.  
> 
> Why?  When inspecting C in order to exonerate it from as many lines
> as possible, we'd run "git diff $P C" for each parent of C, but in
> the distorted history (i.e. instead of using $P==B, we use $P==A).
> As long as the code reads from "git diff A C", I do not see why "the
> offsets in the parent" can be broken.  Care to elaborate a bit more?

When we run "git diff $P C" with $P == A, the diffs from B (the one we
want to remove) will pop up and be attributed to C.  This is what
happened when I tried the "yank B from the history" approach from
above.  If we ever diff over a skipped commit like that, we'll blame
the wrong one. (C, here).

The first thing I tried was to simply move the blame_entry list from C
to B to A without calling diff_hunks().  That's where the "file offsets
were broken," since e->s_lno should be adjusted when passing an entry
from the target blame_origin to its parent blame_origin.  An example of
this is in blame_chunk() when we hand every blame_entry before tlno to
the parent in the "e->s_lno < tlno" block.

Without diffing commit B, we can't propagate the blame_entries.  And
if we just diff A and C, we'll get the "wrong" commit.

The other reason to run the diffs was to identify and mark the
blame_entry that would have been blamed for the skipped commit.  That's
for the annotations during the output phase.  

If we don't track and taint the blame entries, you could just call
blame_chunk() from blame_chunk_cb() with tlno = same, which in essence
tells the blame code that "this diff belongs to the parent."  But that
loses the blame_entry boundaries, and is almost nearly as much of a
change to blame.c.

When it comes to skipping merges, my existing code couldn't do it, but
I found a way to make it work:

         ---A---M---C---D
               /
       ---X---B

Say the commit changing the line/blame_entry was X, and we try to skip
M.  'A' made no changes to the line.

If B is the first scapegoat, pass_blame_to_parent() will see no diff
between B and M for that line, and the blame will get passed to B, and
we're done.  This is the same both with my change and without.

However, if A is the first scapegoat, the line will have a diff, and
from the perspective of the ---A---M---C graph, it looks like M
introduced the change.  (In fact it did - it was via B though).  The
existing code would keep the blame on M and then check the next
scapegoat: B.  

With my change, it couldn't tell the difference between M introducing
a change that we wanted to skip and M having another parent that
introduced the change.  Both look like a diff from A..M.

I changed my version to handle that case.  In pass_blame(), if it fails
to find a scapegoat in the loop that calls pass_blame_to_parent(),
it'll make another loop through the scapegoats, where it will pass the
blame to the first scapegoat that has a diff.

This way, we get a first attempt to pass the blame for real (e.g. to
B), and then when we failed, we skip the commit.  In other words, we
only skip the commit when we need to.

With that change, from my earlier example:

This was wrong:
$ git-blame --skip-rev a3841f94c7ecb include/linux/mm.h | grep VM_SYNC
cc2383ec06be0 (Konstantin Khlebnikov 2012-10-08 16:28:37 -0700  204*) #define VM_SYNC

This is right:
$ git-blame --skip-rev a3841f94c7ecb include/linux/mm.h | grep VM_SYNC
b6fb293f2497a (Jan Kara              2017-11-01 16:36:41 +0100  204) #define VM_SYNC

Also note that in the 'right' case it didn't annotate it with the *.
That's because we didn't need to ignore the commit.  b6fb29 was the
correct one, and we didn't need to ignore M.

Anyway, thanks for reading and for the feedback.  I'll clean things up
and explain it better in another patchset.

Thanks,

Barret

Stefan Beller Jan. 9, 2019, 9:21 p.m. UTC | #8

On Tue, Jan 8, 2019 at 10:26 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Barret Rhoden <brho@google.com> writes:
>
> >> A policy decision like the above two shouldn't be hardcoded in the
> >> feature like this, but should be done as a separate option.  By
> >> default, these shouldn't be marked with '*', as the same tools you
> >> said you are afraid of breaking would be expecting a word with only
> >> digits and no asterisk in the column where line numbers appear and
> >> will get broken by this change if done unconditionally.
> >
> > Since users are already opting-in to the blame-ignore, do you also want
> > them to opt-in to the annotation?
>
> Absolutely.
>
> After all, the users of a moral equivalent that is -S
> never needed such an extra annotation, which tells us two things.
> (1) the claim "It's useful to be alerted to the presence of an
> ignored commit" in the proposed log message is merely a personal
> preference and universal users' requirement; (2) if it is useful to
> mark a blame-source whose parents (used while blaming) do not match
> the actual parents, such an annotation would also be useful while
> running -S.  So probably it should be a separate option that can be
> given when any of the --skip-commit=<rev>, --skip-commits-file=<file>,
> r -S<file> option is given.

From a users point of view it may be desirable to express all this
in the grafts file, i.e. -S <file> where the syntax of that file is extended.
For example we could introduce
    !<hash>
to make it exclude that commit.
Of course this could lead to confusion, as this puts 2 very different
concepts into the same option/file.

Speaking of the implementation:
This patch proposes an oid-set that is handled specially
in blame itself. I wonder if this could be generalized.

Jonathan Tan (cc'd) refactored and extended revision walking
for git-fetch and its negotiation leading to
7c85ee6c58 (Merge branch 'jt/fetch-negotiator-skipping',
2018-08-02), and 3390e42adb (fetch-pack:
support negotiation tip whitelist, 2018-07-02)
which implements another revision walking
algorithm that can be used to fine-tune revisions
walked when fetching.

I wonder if that work could be generalized more
to have "generic" revision walking algorithms
and then making use of them in either fetch or
blame.

For git-fetch there is a new algorithm that increases
step size between commits, which would be funny to
try for blame here. It would give the wrong blamed
commit, but would speed up blaming a lot.

Omitting some revisions seems to be applicable to
more than just blame/fetch, too.

Stefan

Junio C Hamano Jan. 10, 2019, 10:29 p.m. UTC | #9

Barret Rhoden <brho@google.com> writes:

> For instance, commit X does this:
>
> -foo(x,y);
> +foo(x,y,z);
>
> Then commit Y comes along to reformat it:
>
> -foo(x,y,z);
> +foo(x, y, z);
>
> And the history / rev-list for the file looks like:
>
> ---O---A---X---B---C---D---Y---E---F
>
> I want to ignore/skip Y and see X in the blame output.  

If you skip Y, the altered history would have "foo(x, y, z)" in E,
"foo(x,y,z)" in X, and "foo(x,y)" in A.  If you start blaming from
F, you'd get E as the commit that explains the latest state.  If you
do not skip Y, you'd get Y.  I am not sure how you'd get X in either
case.

Barret Rhoden Jan. 14, 2019, 3:19 p.m. UTC | #10

On 2019-01-10 at 14:29 Junio C Hamano <gitster@pobox.com> wrote:
> > For instance, commit X does this:
> >
> > -foo(x,y);
> > +foo(x,y,z);
> >
> > Then commit Y comes along to reformat it:
> >
> > -foo(x,y,z);
> > +foo(x, y, z);
> >
> > And the history / rev-list for the file looks like:
> >
> > ---O---A---X---B---C---D---Y---E---F
> >
> > I want to ignore/skip Y and see X in the blame output.    
> 
> If you skip Y, the altered history would have "foo(x, y, z)" in E,
> "foo(x,y,z)" in X, and "foo(x,y)" in A.  If you start blaming from
> F, you'd get E as the commit that explains the latest state.  If you
> do not skip Y, you'd get Y.  I am not sure how you'd get X in either
> case.  

The way to do it is to let the blames get passed to Y - don't yank it
from the graph.  Then when trying to pass the blames from Y to its
parent, when we get a diff chunk that Y is responsible for, instead of
keeping it on Y's suspect list, we hand the blame_entry to D.  That
blame_entry will get passed all the way back to X, which also has a
diff that touches that line.

Basically we do the same blame processing as usual, but just don't let
any blames stick to Y.

Barret

Junio C Hamano Jan. 14, 2019, 5:45 p.m. UTC | #11

Barret Rhoden <brho@google.com> writes:

> On 2019-01-10 at 14:29 Junio C Hamano <gitster@pobox.com> wrote:
>> > For instance, commit X does this:
>> >
>> > -foo(x,y);
>> > +foo(x,y,z);
>> >
>> > Then commit Y comes along to reformat it:
>> >
>> > -foo(x,y,z);
>> > +foo(x, y, z);
>> >
>> > And the history / rev-list for the file looks like:
>> >
>> > ---O---A---X---B---C---D---Y---E---F
>> >
>> > I want to ignore/skip Y and see X in the blame output.    
>> 
>> If you skip Y, the altered history would have "foo(x, y, z)" in E,
>> "foo(x,y,z)" in X, and "foo(x,y)" in A.  If you start blaming from
>> F, you'd get E as the commit that explains the latest state.  If you
>> do not skip Y, you'd get Y.  I am not sure how you'd get X in either
>> case.  
>
> The way to do it is ...

Sorry, I made a too-fuzzy statement.  What I meant was, that unless
you are ignoring E, I do not know why you "would want to" attribute
a line "foo(x, y, z)" that appears in F to X.  Starting from X up to
D (and to Y in real history, but you are ignoring Y), the line was
"foo(x,y,z)", after E, it is "foo(x, y, z)".  I didn't mean to ask
how you "would show" such a result---as I do not yet understand why
you would want such a result to begin with.

Randall S. Becker Jan. 14, 2019, 6:26 p.m. UTC | #12

On January 14, 2019 12:46, Junio C Hamano wrote:
> Barret Rhoden <brho@google.com> writes:
> 
> > On 2019-01-10 at 14:29 Junio C Hamano <gitster@pobox.com> wrote:
> >> > For instance, commit X does this:
> >> >
> >> > -foo(x,y);
> >> > +foo(x,y,z);
> >> >
> >> > Then commit Y comes along to reformat it:
> >> >
> >> > -foo(x,y,z);
> >> > +foo(x, y, z);
> >> >
> >> > And the history / rev-list for the file looks like:
> >> >
> >> > ---O---A---X---B---C---D---Y---E---F
> >> >
> >> > I want to ignore/skip Y and see X in the blame output.
> >>
> >> If you skip Y, the altered history would have "foo(x, y, z)" in E,
> >> "foo(x,y,z)" in X, and "foo(x,y)" in A.  If you start blaming from F,
> >> you'd get E as the commit that explains the latest state.  If you do
> >> not skip Y, you'd get Y.  I am not sure how you'd get X in either
> >> case.
> >
> > The way to do it is ...
> 
> Sorry, I made a too-fuzzy statement.  What I meant was, that unless you
are
> ignoring E, I do not know why you "would want to" attribute a line "foo(x,
y,
> z)" that appears in F to X.  Starting from X up to D (and to Y in real
history, but
> you are ignoring Y), the line was "foo(x,y,z)", after E, it is "foo(x, y,
z)".  I
> didn't mean to ask how you "would show" such a result---as I do not yet
> understand why you would want such a result to begin with.

From my own community, this came up also. The intent was to show everyone
who touched a particular line, throughout history, not just the current one.
Perhaps that is what Barret is going for.

Regards,
Randall

Barret Rhoden Jan. 14, 2019, 7:03 p.m. UTC | #13

On 2019-01-14 at 13:26 "Randall S. Becker" <rsbecker@nexbridge.com>
wrote:
> > 
> > Sorry, I made a too-fuzzy statement.  What I meant was, that unless you  
> are
> > ignoring E, I do not know why you "would want to" attribute a line "foo(x,  
> y,
> > z)" that appears in F to X.  Starting from X up to D (and to Y in real  
> history, but
> > you are ignoring Y), the line was "foo(x,y,z)", after E, it is "foo(x, y,  
> z)".  I
> > didn't mean to ask how you "would show" such a result---as I do not yet
> > understand why you would want such a result to begin with.  
> 
> From my own community, this came up also. The intent was to show everyone
> who touched a particular line, throughout history, not just the current one.
> Perhaps that is what Barret is going for.

Yeah.  I want to find the most recent commit that changed a line, minus
a set of commits that are deemed 'not interesting' by the user.  

The primary reason for this is to not see a blame attributed to a
commit that does nothing but reformatting the code base.

One could also use this feature for one-off investigation with the
command-line switch.  Imagine a manually driven git-blame + git-log,
where you do a git-blame, check the commit, decide you want to see the
next one back, and rerun git-blame with --skip-rev=SHA1.

The Chromium depot_tools has a python tool that can do this, called
hyper-blame: 
(https://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/git-hyper-blame.html).

I thought it was a useful feature for all of git's users.  And I also
didn't want people to have to install depot_tools.

Barret

Junio C Hamano Jan. 15, 2019, 6:08 a.m. UTC | #14

"Randall S. Becker" <rsbecker@nexbridge.com> writes:

> On January 14, 2019 12:46, Junio C Hamano wrote:
>> Barret Rhoden <brho@google.com> writes:
>> 
>> > On 2019-01-10 at 14:29 Junio C Hamano <gitster@pobox.com> wrote:
>> >> > For instance, commit X does this:
>> >> >
>> >> > -foo(x,y);
>> >> > +foo(x,y,z);
>> >> >
>> >> > Then commit Y comes along to reformat it:
>> >> >
>> >> > -foo(x,y,z);
>> >> > +foo(x, y, z);
>> >> >
>> >> > And the history / rev-list for the file looks like:
>> >> >
>> >> > ---O---A---X---B---C---D---Y---E---F
>> >> >
>> >> > I want to ignore/skip Y and see X in the blame output.
>> >>
>> >> If you skip Y, the altered history would have "foo(x, y, z)" in E,
>> >> "foo(x,y,z)" in X, and "foo(x,y)" in A.  If you start blaming from F,
>> >> you'd get E as the commit that explains the latest state.  If you do
>> >> not skip Y, you'd get Y.  I am not sure how you'd get X in either
>> >> case.
>> >
>> ...
>
> From my own community, this came up also. The intent was to show everyone
> who touched a particular line, throughout history, not just the current one.
> Perhaps that is what Barret is going for.

I think I now understand what is going on.  In short, what Barret
wanted is way more ambitious than "blame in a hypothetical history
that lacks certain commits".  There is no similiarity with existing
"-S <revs>" option that I asked about in my initial review.  This is
quite a different beast (and I kind-a like it ;-).

Instead of "skipping" Y, Barret wants to pretend that the effect of
commit Y, in addition to the commit Y itself, never existed in the
history.  Even when blaming back from F (where it has "foo(x, y,
z)", which is the same as E and down to Y), the algorithm wants to
pretend as if E and F had that line in "foo(x,y,z)" form, i.e.
before Y touched it, and also pretend that the user started blaming
a line in that shape.

As there is no general solution [*1*] to "how did this line looked
like before this commit", such a blame may not have a meaningful
answer.  When Y added a block of text without removing anything
(i.e. "git show Y" has no line that begins with '-'), and when the
user asked to blame that block of text in F, there is no sensible
answer to the question: "what got replaced by Y in this text?", so a
blame that "ignores" Y would not be able to answer "where did these
lines come from?" question.  But in some cases (e.g. when "git show
Y" has only one '-' line, immediately followed by '+' line, both
with quite a similar contents and no other change in the patch---it
is highly plausible that in such a change, the former became the
latter, like the example in Barret's message I was responding to),
we can say "this was how the line that was modified in Y looked like
before Y" and blaming while ignoring Y can have a reasonable answer.

[Footnote]

*1* Here, I use the word "solution" in the same sense as "solution
to a math equation", not in the sense to describe the steps to
derive that "solution".

blame: add the ability to ignore commits

Commit Message

Comments

Patch