diff mbox series

email as a bona fide git transport

Message ID b9fb52b8-8168-6bf0-9a72-1e6c44a281a5@oracle.com (mailing list archive)
State New, archived
Headers show
Series email as a bona fide git transport | expand

Commit Message

Vegard Nossum Oct. 16, 2019, 10:22 a.m. UTC
(cross-posted to git, LKML, and the kernel workflows mailing lists.)

Hi all,

I've been following Konstantin Ryabitsev's quest for better development
and communication tools for the kernel [1][2][3], and I would like to
propose a relatively straightforward idea which I think could bring a
lot to the table.

Step 1:

* git send-email needs to include parent SHA1s and generally all the
   information needed to perfectly recreate the commit when applied so
   that all the SHA1s remain the same

* git am (or an alternative command) needs to recreate the commit
   perfectly when applied, including applying it to the correct parent

Having these two will allow a perfect mapping between email and git;
essentially email just becomes a transport for git. There are a lot of
advantages to this, particularly that you have a stable way to refer to
a patch or commit (despite it appearing on a mailing list), and there
is no need for "changeset IDs" or whatever, since you can just use the
git SHA1 which is unique, unambiguous, and stable.

As a rough proof of concept I've attached 3 git patches which implement
this. There are issues to work out like exact format, encodings, mail
mangling, error handling, etc., but hopefully the git community can
help out here. (Improvement suggestions are welcome!)

Step 2:

* A bot that follows LKML (and other lists) and imports patchsets into
   a git repository hosted on git.kernel.org

* The bot can add git notes with URLs to lore (and/or other mailing
   list archives) and store them in e.g. refs/notes/lore,
   refs/notes/lkml, etc.

   (For those who don't use git notes yet: they are essentially small
   bits of information you can add to a commit without changing its SHA1,
   and you can configure tools like 'git log' to show these at the bottom
   of a commit. Notes can also exist in a repo completely separate from
   the commits they attach data to, so there is _zero_ overhead for those
   who don't want to use this.)

* Maintainers can either pull patchsets directly from this bot-
   maintained repo OR they can continue to apply patches from their inbox
   (the result should be the same either way) OR they can continue in the
   old-style process (at least for a while) and just not have the
   benefits of the new process.

Step 3:

* Instead of describing a patchset in a separate introduction email, we
   can create a merge commit between the parent of the first commit in
   the series and the last and put the patchset description in the merge
   commit [5]. This means the patchset description also gets to be part
   of git history.

   (This would require support for git send-email/am to be able to send
   and apply merge commits -- at least those which have the same tree as
   one of the parents. This is _not_ yet supported in my proposed git
   patches.)

* stable SHA1s means we can refer to previous versions of a patchset by
   SHA1 rather than archive links. I propose a new changelog tag for
   this, maybe "Previous:" or maybe even a full list of "v1:", "v2:",
   etc. with a SHA1 or ref. Note that these SHA1s do *not* need to exist
   in Linus's repo, but those who want can pull those branches from the
   bot-maintained repo on git.kernel.org.

Advantages:

- we can keep using email to post patches/patchsets

- the process is opt-in (but should be encouraged) for both authors and
   maintainers, and the transition can happen over time

- there is a central repo for convenience, but it is not necessary for
   development to happen and is not a single point of failure -- it's
   more like Linus's repo and can be moved or even replicated from
   scratch by somebody else simply by having mailing list archives

- allows quick lookup of patch/patchset <-> email discussion within git

- allows diffing between versions of a single logical patchset

- patchset descriptions naturally become part of the changelog that ends
   up in Linus's tree

Disadvantages:

- requires patching git

- requires a bot to continuously create branches for patchsets sent to
   mailing lists

- increased storage/bandwidth for git.kernel.org (?)

- may need a couple of new wrapper scripts to automate patchset
   construction/versioning

Thoughts?


Vegard

PS: Eric Wong described something that comes quite close to this idea, 
but AFAICT without actually recreating commits exactly. I've included 
the link for completeness. [4]


[1]: https://lwn.net/Articles/793037/ "Ryabitsev: Patches carved into
developer sigchains"

[2]: https://lwn.net/Articles/799134/ "Defragmenting the kernel
development process"

[3]: 
https://lore.kernel.org/workflows/20190924182536.GC6041@hmswarspite.think-freely.org/

[4]: https://lore.kernel.org/workflows/20191008003931.y4rc2dp64gbhv5ju@dcvr/

[5]: To create this merge commit one could use something like this (bash):

# usage: patchset BASE [PREVIOUS_VERSION]
patchset () {
     start=$1
     prev=$2

     # construct tentative commit message
     commit_editmsg="$(git rev-parse --git-dir)/COMMIT_EDITMSG"
     (
         if [ -z "$prev" ]
         then
             echo 'Patchset title'
             echo
             echo Commits:
             echo
             git log --oneline $start..HEAD
         else
             git show --format=format:%B --no-patch $prev
             echo Previous-version: $(git rev-parse $prev)
         fi
     ) > "${commit_editmsg}"

     ${EDITOR} "${commit_editmsg}"

     merge=$(git commit-tree -p $start -p HEAD -F "${commit_editmsg}" 
$(git rev-parse HEAD^{tree}))
     echo $merge
}

This will open the editor to edit the patchset description and create a
merge commit that encompasses the patches in the patchset (use sha1^- to
view the patches in it).

Comments

Willy Tarreau Oct. 16, 2019, 11:10 a.m. UTC | #1
Hi Vegard,

On Wed, Oct 16, 2019 at 12:22:54PM +0200, Vegard Nossum wrote:
> (cross-posted to git, LKML, and the kernel workflows mailing lists.)
> 
> Hi all,
> 
> I've been following Konstantin Ryabitsev's quest for better development
> and communication tools for the kernel [1][2][3], and I would like to
> propose a relatively straightforward idea which I think could bring a
> lot to the table.
> 
> Step 1:
> 
> * git send-email needs to include parent SHA1s and generally all the
>   information needed to perfectly recreate the commit when applied so
>   that all the SHA1s remain the same
> 
> * git am (or an alternative command) needs to recreate the commit
>   perfectly when applied, including applying it to the correct parent
> 
> Having these two will allow a perfect mapping between email and git;
> essentially email just becomes a transport for git. There are a lot of
> advantages to this, particularly that you have a stable way to refer to
> a patch or commit (despite it appearing on a mailing list), and there
> is no need for "changeset IDs" or whatever, since you can just use the
> git SHA1 which is unique, unambiguous, and stable.

I agree this would be great and have been missing this a number of times,
eventhough I'm aware of git-send-pack/git-receive-pack. The text format
is way more convenient for a lot of reasons. It could also help with
Greg's idea of using the commit IDs to reference bugs, as such IDs could
remain stable within a series before it is merged, and as such referenced
in subsequent commit messages. It could also be useful to avoid losing
notes related to a patch once it's merged.

> Step 3:
> 
> * Instead of describing a patchset in a separate introduction email, we
>   can create a merge commit between the parent of the first commit in
>   the series and the last and put the patchset description in the merge
>   commit [5]. This means the patchset description also gets to be part
>   of git history.
> 
>   (This would require support for git send-email/am to be able to send
>   and apply merge commits -- at least those which have the same tree as
>   one of the parents. This is _not_ yet supported in my proposed git
>   patches.)

That's a good idea, as we've all seen long series with a very detailed
description in patch 0 and much less context in subsequent patches, thus
losing the context once merged.

> * stable SHA1s means we can refer to previous versions of a patchset by
>   SHA1 rather than archive links. I propose a new changelog tag for
>   this, maybe "Previous:" or maybe even a full list of "v1:", "v2:",
>   etc. with a SHA1 or ref. Note that these SHA1s do *not* need to exist
>   in Linus's repo, but those who want can pull those branches from the
>   bot-maintained repo on git.kernel.org.

For me this mainly brings the benefit of finally having a unique identifier
for multiple iterations of a patchset. It then becomes easier to use this
identifier to designate the functional work, regardless of the number of
updates it gets. Of course it's never that black and white since such work
may itself merge multiple other patchsets but for most use cases it can
help.

Willy
Santiago Torres Arias Oct. 16, 2019, 2:45 p.m. UTC | #2
Hi Willy, Vegard.

On Wed, Oct 16, 2019 at 01:10:09PM +0200, Willy Tarreau wrote:
> Hi Vegard,
> 
> On Wed, Oct 16, 2019 at 12:22:54PM +0200, Vegard Nossum wrote:
> > (cross-posted to git, LKML, and the kernel workflows mailing lists.)
> > 
> > Hi all,
> > 
> > I've been following Konstantin Ryabitsev's quest for better development
> > and communication tools for the kernel [1][2][3], and I would like to
> > propose a relatively straightforward idea which I think could bring a
> > lot to the table.
> > 
> > Step 1:
> > 
> > * git send-email needs to include parent SHA1s and generally all the
> >   information needed to perfectly recreate the commit when applied so
> >   that all the SHA1s remain the same
> > 
> > * git am (or an alternative command) needs to recreate the commit
> >   perfectly when applied, including applying it to the correct parent
> > 
> > Having these two will allow a perfect mapping between email and git;
> > essentially email just becomes a transport for git. There are a lot of
> > advantages to this, particularly that you have a stable way to refer to
> > a patch or commit (despite it appearing on a mailing list), and there
> > is no need for "changeset IDs" or whatever, since you can just use the
> > git SHA1 which is unique, unambiguous, and stable.

I wonder if it'd be also possible to then embed gpg signatures over
send-mail payloads so as they can be transparently transferred to the
commit.

-Santiago
Pratyush Yadav Oct. 16, 2019, 3 p.m. UTC | #3
Hi Vegard,

On 16/10/19 12:22PM, Vegard Nossum wrote:
> (cross-posted to git, LKML, and the kernel workflows mailing lists.)
> 
> Hi all,
> 
> I've been following Konstantin Ryabitsev's quest for better development
> and communication tools for the kernel [1][2][3], and I would like to
> propose a relatively straightforward idea which I think could bring a
> lot to the table.
> 
> Step 1:
> 
> * git send-email needs to include parent SHA1s and generally all the
>   information needed to perfectly recreate the commit when applied so
>   that all the SHA1s remain the same
> 
> * git am (or an alternative command) needs to recreate the commit
>   perfectly when applied, including applying it to the correct parent
> 
> Having these two will allow a perfect mapping between email and git;
> essentially email just becomes a transport for git. There are a lot of
> advantages to this, particularly that you have a stable way to refer to
> a patch or commit (despite it appearing on a mailing list), and there
> is no need for "changeset IDs" or whatever, since you can just use the
> git SHA1 which is unique, unambiguous, and stable.
 
FWIW, I like the idea.

> As a rough proof of concept I've attached 3 git patches which implement
> this. There are issues to work out like exact format, encodings, mail
> mangling, error handling, etc., but hopefully the git community can
> help out here. (Improvement suggestions are welcome!)
> 
> Step 2:
> 
> * A bot that follows LKML (and other lists) and imports patchsets into
>   a git repository hosted on git.kernel.org
> 
> * The bot can add git notes with URLs to lore (and/or other mailing
>   list archives) and store them in e.g. refs/notes/lore,
>   refs/notes/lkml, etc.
> 
>   (For those who don't use git notes yet: they are essentially small
>   bits of information you can add to a commit without changing its SHA1,
>   and you can configure tools like 'git log' to show these at the bottom
>   of a commit. Notes can also exist in a repo completely separate from
>   the commits they attach data to, so there is _zero_ overhead for those
>   who don't want to use this.)
> 
> * Maintainers can either pull patchsets directly from this bot-
>   maintained repo OR they can continue to apply patches from their inbox
>   (the result should be the same either way) OR they can continue in the
>   old-style process (at least for a while) and just not have the
>   benefits of the new process.
> 
> Step 3:
> 
> * Instead of describing a patchset in a separate introduction email, we
>   can create a merge commit between the parent of the first commit in
>   the series and the last and put the patchset description in the merge
>   commit [5]. This means the patchset description also gets to be part
>   of git history.
> 
>   (This would require support for git send-email/am to be able to send
>   and apply merge commits -- at least those which have the same tree as
>   one of the parents. This is _not_ yet supported in my proposed git
>   patches.)

Can sending merge commits via email work with your proposed '--exact'? 
Say I'm the maintainer, and you fork off a feature branch off my master, 
add a few commits that introduce your new feature, and then merge it 
into my master, and then send those commits, including the merge.

Now in that scenario, say the tip of your feature branch was X and the 
tip of my 'master' was Y when you sent your patches. Now while your 
patches are still being reviewed, I merge in some other branch creating 
a merge commit Z on my master.

Now your merge's first parent was Y and second parent was X. But now the 
tip of my master is Z, so the first parent of your merge needs to be Z, 
not Y. Changing the first parent would mean a different commit hash.

So, the way I see it, your proposed merge commits via email can't work 
with '--exact'. Do I understand this situation correctly? Am I missing 
something?

Maybe a better idea would be to allow 'am' to create these merges 
locally when applying the patches. That would mean having to merge the 
separate branch along with applying the patches, otherwise the cover 
letter text is lost. This might not be something everyone wants. I for 
one don't. When I apply patches via 'am', I first keep them on a 
separate branch, test them out, and then merge them into 'master'.

So a yet another alternative could be to save the cover letter as the 
branch description. This branch description can then be used to generate 
the merge message. IIRC, Denton Liu is working on generating the cover 
letter text from branch description, so this feature would be like its 
inverse.
 
> * stable SHA1s means we can refer to previous versions of a patchset by
>   SHA1 rather than archive links. I propose a new changelog tag for
>   this, maybe "Previous:" or maybe even a full list of "v1:", "v2:",
>   etc. with a SHA1 or ref. Note that these SHA1s do *not* need to exist
>   in Linus's repo, but those who want can pull those branches from the
>   bot-maintained repo on git.kernel.org.
> 
> Advantages:
> 
> - we can keep using email to post patches/patchsets
> 
> - the process is opt-in (but should be encouraged) for both authors and
>   maintainers, and the transition can happen over time
> 
> - there is a central repo for convenience, but it is not necessary for
>   development to happen and is not a single point of failure -- it's
>   more like Linus's repo and can be moved or even replicated from
>   scratch by somebody else simply by having mailing list archives
> 
> - allows quick lookup of patch/patchset <-> email discussion within git
> 
> - allows diffing between versions of a single logical patchset
> 
> - patchset descriptions naturally become part of the changelog that ends
>   up in Linus's tree
> 
> Disadvantages:
> 
> - requires patching git
> 
> - requires a bot to continuously create branches for patchsets sent to
>   mailing lists
> 
> - increased storage/bandwidth for git.kernel.org (?)
> 
> - may need a couple of new wrapper scripts to automate patchset
>   construction/versioning

Just to play the devil's advocate, even though I'm in favor of something 
like this, I'll add in another disadvantage:

- The maintainer can't make small edits before pushing the changes out. 

I do that every now and then for git-gui, and Junio does that sometimes 
for Git. I don't know if the folks over at Linux do something like this, 
but using '--exact' would mean that contributors would have to send a 
re-roll for even minor changes. Its mostly an inconvenience instead of a 
problem, but I thought I'd point it out.
 
> Thoughts?

One more question, not strictly related to your proposal: right now, 
when I apply patches from contributors, I pass '-s' to 'am', so the 
applied commit would have my sign-off. The way I see it, that sign-off 
is supposed to signify that I have the right to push out the commit to 
the "main" repo, just like the author's sign-off means that they have 
the right to send me that commit.

Looking at git.git, I notice that Junio does the same. The new '--exact' 
would be incompatible with '-s', correct (since the commit message has 
changed, the SHA1 would also change)? So firstly, make sure you account 
for something like that if you haven't already (I haven't found the time 
to read your patches yet). Secondly, is it all right for the maintainer 
to just not sign-off on the commits they push out?
Jonathan Nieder Oct. 16, 2019, 8:57 p.m. UTC | #4
Hi,

A few small points.

Vegard Nossum wrote:

> * git am (or an alternative command) needs to recreate the commit
>   perfectly when applied, including applying it to the correct parent

Interesting.  "git format-patch" has a --base option to do some of
what you're looking for, for the sake of snowpatch
<https://github.com/ruscur/snowpatch>.  Though it's not exactly the
same thing you mean.

We also discussed sending merge commits by mail recently in the
virtual git committer summit[1].

Of course, the devil is in the details.  It's straightforward to use
"git bundle" to use mail as a Git transport today, but presumably you
also want the ability to perform reviews along the way and that's not
so easy with a binary format.  Do you have more details on what you'd
want the format to look like, particularly for merge commits?

[...]
>                                                                 there
> is no need for "changeset IDs" or whatever, since you can just use the
> git SHA1 which is unique, unambiguous, and stable.

In [2] the hope was for some identifier that is preserved by "git
rebase" and "git commit --amend" (so that you can track the evolution
of a change as the author improves it in response to reviews).  Is
that the conversation you're alluding to?

[...]
> Disadvantages:
>
> - requires patching git

That's not a disadvantage.  It means get to work with the Git project,
which is a welcoming bunch of people, working on userspace (seeing how
the other half lives), and improving the lives of everyone using Git.

[...]
> Date: Sat, 5 Oct 2019 16:15:59 +0200
> Subject: [PATCH 1/3] format-patch: add --complete
>
> Include the raw commit data between the changelog and the diffstat.

Oh!  I had missed this on first reading because it was in an
attachment.

I have mixed feelings.  Can you say a bit more about the advantages
and disadvantages relative to sending a git bundle?  What happens if a
mail client or a box along the way mangles whitespace in the commit
message?

Happy hacking,
Jonathan

[1] https://public-inbox.org/git/nycvar.QRO.7.76.6.1909261253400.15067@tvgsbejvaqbjf.bet/
[2] https://lore.kernel.org/ksummit-discuss/CAD=FV=UPjPpUyFTPjF-Ogzj_6LJLE4PTxMhCoCEDmH1LXSSmpQ@mail.gmail.com/
Junio C Hamano Oct. 17, 2019, 3:17 a.m. UTC | #5
Vegard Nossum <vegard.nossum@oracle.com> writes:

> Step 1:
>
> * git send-email needs to include parent SHA1s and generally all the
>   information needed to perfectly recreate the commit when applied so
>   that all the SHA1s remain the same
>
> * git am (or an alternative command) needs to recreate the commit
>   perfectly when applied, including applying it to the correct parent

You can record and convey the commit object name a series is meant
to be applied on top already, and it in general is a good way to
give a wider context in order to explain and justify the series.

On the other hand, "all the information needed to recreate..." is
not very useful.  If you want the commit object to be exactly what
you want to see at the tip of the end result, you are better off
asking your upstream to pull.  Using e-mail for that makes you and
project participants give up a lot of benefits the workflow based on
e-mail gives you, the biggest of which is the ease of giving
suggestions for improvements.  Once you insist "perfectly recreate
the commit", you are not willing to take any input from the
sidelines---worse yet, you are even dictating when the upstream
runs "git am" to turn them into commits, and do so without reading
the patches (there is no point reviewing as the person who runs "git
am" is not even allowed to fix typo or make obvious fixes to the
code, which will fail to perfectly recreate the commit).

In short, one should resist temptation to bring up "perfect
reproduction" when one talks about e-mail workflow.

> * Instead of describing a patchset in a separate introduction email, we
>   can create a merge commit between the parent of the first commit in
>   the series and the last and put the patchset description in the merge
>   commit [5]. This means the patchset description also gets to be part
>   of git history.

This has been done with tools around git-core, and merits a more
official support.  When merging a topic, it is a good idea to
explain in the merge commit that brings in the topic to the mainline
what the topic is about, and at least in the past few years Linus
and other maintainers both within and outside the kernel have been
doing so.  The cover-letter material in [PATCH 00/NN] obviously can
help those integrators when they write the merge message.

>   (This would require support for git send-email/am to be able to send
>   and apply merge commits -- at least those which have the same tree as
>   one of the parents. This is _not_ yet supported in my proposed git
>   patches.)

This does not require much from format-patch and am.  All you need
to do is to ensure that they can handle an empty commit.  What you
need more is a support in merge.  The outline for the workflow would
go like this:

 * The contributor prepares an N patch series 1/N..N/N on a single
   topic branch.

 * The summary of the series, the message that is meant to help the
   integrator, is recorded as (N+1)th commit at the tip of the topic
   branch, as an empty commit (i.e. a commit that records the same
   tree as its parent).

 * "git format-patch" is taught, when told to prepare the patch
   e-mails from such a topic branch, to notice the unusual "an empty
   commit at the tip" layout, and turn that into 0/N of the message.

 * "git am" is taught a new option to cap a topic branch made from
   patches 1/N..N/N from the incoming mbox with an extra empty
   commit, whose message is taken from the 0/N cover-letter
   material, to recreate what the contributor had in the second step
   above.

 * "git merge" is taught, when told to merge a topic branch, to
   notice the unusual "an empty commit at the tip" layout, and

   (1) merge topic~1 instead to excise the empty commit itself,

   (2) take the log message from the empty commit at the tip and use
       it to help prepare the log message of the merge.
Vegard Nossum Oct. 17, 2019, 12:23 p.m. UTC | #6
On 10/16/19 5:00 PM, Pratyush Yadav wrote:
> On 16/10/19 12:22PM, Vegard Nossum wrote:
> Just to play the devil's advocate, even though I'm in favor of something
> like this, I'll add in another disadvantage:
> 
> - The maintainer can't make small edits before pushing the changes out.
> 
> I do that every now and then for git-gui, and Junio does that sometimes
> for Git. I don't know if the folks over at Linux do something like this,
> but using '--exact' would mean that contributors would have to send a
> re-roll for even minor changes. Its mostly an inconvenience instead of a
> problem, but I thought I'd point it out.

I don't think this is a problem.

The point of 'git am --exact' is not for maintainers per se (although
they should use it if they don't have any manual changes to make), but
for the bot that keeps track of patchsets submitted via email.

The important part is that there is a git reference to the patchset that
was submitted in the patchset that was merged. You could see it as the
maintainer rolling a new version of the patchset locally and merging
that instead of merging what was submitted directly.

Of course, this relies strongly on actually having (correct) sha1
references to previous versions inside the changelog. In my original
idea, this reference would only appear inside the merge commit that
binds the patchset together to minimise churn, although maybe it is
feasible to also append it to each patch -- in that case, the "patchset"
command from my first email is not sufficient to create a new version of
a patchset.

> One more question, not strictly related to your proposal: right now,
> when I apply patches from contributors, I pass '-s' to 'am', so the
> applied commit would have my sign-off. The way I see it, that sign-off
> is supposed to signify that I have the right to push out the commit to
> the "main" repo, just like the author's sign-off means that they have
> the right to send me that commit.
> 
> Looking at git.git, I notice that Junio does the same. The new '--exact'
> would be incompatible with '-s', correct (since the commit message has
> changed, the SHA1 would also change)? So firstly, make sure you account
> for something like that if you haven't already (I haven't found the time
> to read your patches yet). Secondly, is it all right for the maintainer
> to just not sign-off on the commits they push out?

In the Linux kernel at least, only the front-line maintainers add their
signoffs; higher-level maintainers take pull requests and don't add
their own sign-offs. Only Linus (and Greg, perhaps) has commit rights to
the main repository, but he only signs off on the patches that he either
writes himself or applies directly from email.

In any case, I don't think this is a concern because of what I wrote
above -- somebody who wants to add their signoff can do it by
essentially rolling a new version of the patchset that has the signoff,
but refers to the sha1 that was submitted to them by the patchset author
so that you can still find the original commits (without the signoffs)
and reviews/discussions.

I don't want to create extra work for maintainers, so I think any
solution should involve having the existing git tools/workflows do the
right thing automatically.

How about this? If 'git am' or 'git am -s' (without --exact) finds an
email patch where the exact commit metadata is present, it automatically
appends a line to the changelog saying where it was taken from:

     Submitted-as: 111122223333444455556666777788889999aaaa

or

     Applied-from: 111122223333444455556666777788889999aaaa

Although, again, this would modify the changelog of the patch itself
rather than just the changelog of the merge commit... Maybe this is
enough and we can have the "patchset" command also add references to
previous versions of each patch rather than the patchset as a whole.


Vegard
Vegard Nossum Oct. 17, 2019, 1:08 p.m. UTC | #7
On 10/16/19 10:57 PM, Jonathan Nieder wrote:
> Hi,
> 
> A few small points.
> 
> Vegard Nossum wrote:
> 
>> * git am (or an alternative command) needs to recreate the commit
>>    perfectly when applied, including applying it to the correct parent
> 
> Interesting.  "git format-patch" has a --base option to do some of
> what you're looking for, for the sake of snowpatch
> <https://github.com/ruscur/snowpatch>.  Though it's not exactly the
> same thing you mean.

Yes, --base is great for importing email patches into git, but does not
allow resulting commits to have the same sha1 (because it doesn't have
the complete author/committer information, mainly), so it doesn't do
enough for what I want with this proposal.

> We also discussed sending merge commits by mail recently in the
> virtual git committer summit[1].
> 
> Of course, the devil is in the details.  It's straightforward to use
> "git bundle" to use mail as a Git transport today, but presumably you
> also want the ability to perform reviews along the way and that's not
> so easy with a binary format.  Do you have more details on what you'd
> want the format to look like, particularly for merge commits?

Yes, one of the goals is to continue to use git-send-email and be able
to view and review patches on a mailing list with minimal intrusion to
existing processes.

I did not envision supporting merge commits where the resulting tree is
different from all of its parents, which means any merge commits run
through 'git-format-patch --complete' would just have multiple "parent"
lines and no diff where the diff would normally be.

>> is no need for "changeset IDs" or whatever, since you can just use the
>> git SHA1 which is unique, unambiguous, and stable.
> 
> In [2] the hope was for some identifier that is preserved by "git
> rebase" and "git commit --amend" (so that you can track the evolution
> of a change as the author improves it in response to reviews).  Is
> that the conversation you're alluding to?

Yes.

(Not the specific email/thread you linked, but yes, this is the general
conversation about having stable identifiers I'm referring to.)

> 
> [...]
>> Disadvantages:
>>
>> - requires patching git
> 
> That's not a disadvantage.  It means get to work with the Git project,
> which is a welcoming bunch of people, working on userspace (seeing how
> the other half lives), and improving the lives of everyone using Git.

True! :-)

>> Date: Sat, 5 Oct 2019 16:15:59 +0200
>> Subject: [PATCH 1/3] format-patch: add --complete
>>
>> Include the raw commit data between the changelog and the diffstat.
> 
> Oh!  I had missed this on first reading because it was in an
> attachment.
> 
> I have mixed feelings.  Can you say a bit more about the advantages
> and disadvantages relative to sending a git bundle?  What happens if a
> mail client or a box along the way mangles whitespace in the commit
> message?

Yes, as we both said above: git bundle is not human readable and does
not lend itself to reviews in mail clients or on mailing lists.

Any kind of mangling is a serious concern, and my thoughts are:

  - The _diff_ itself should already be safe from mangling. If this were
not the case, then sending patches by email would be completely unsafe
and would need to be fixed somehow anyway.

  - I think I remember seeing something in either 'git am' or 'git
mailinfo' about format=flowed apparently allowing whitespace to
disappear from the line endings.

  - Isolated problems like that could be preemptively fixed (or at least
warned about) on the submitter side by e.g. warning about potential
issues when committing or running format-patch/send-email)

  - If some mail software is susceptible to mangling patches, then I
think it would great to have a mechanism for detecting that -- which
"git-format-patch --complete" would be! :-)

  - It is true that the commit message itself (and perhaps particularly
people's names) is especially vulnerable to mangling and/or reencoding
(I noticed that 'git am' seems to convert to utf8 before committing). I
honestly don't know if this would be a problem in practice, as I don't
know too much about mail software and encodings. If most people use
format-patch/send-email to send patches, then maybe it's not actually a
problem because everything is either kept as utf-8 to start with or
encoded/decoded consistently across the git userbase?

  - I was thinking of hex-encoding the raw bytes of the author and
committer lines of the extra commit metadata in the final email to keep
everything as ASCII and avoid character set conversions. It is true that
this does not stop mangling of the changelog...

  - I suspect that if mangling was a problem in practice, then we would
have seen it already and the solution to it is not specific to what I'm
proposing here.

I'd love for somebody who is more knowledgeable about email and encoding
to chime in here. It is definitely one of the biggest potential problems
and it would be good to approach it in a way that doesn't require lots
of red tape after it has already been implemented.


Vegard

> 
> Happy hacking,
> Jonathan
> 
> [1] https://public-inbox.org/git/nycvar.QRO.7.76.6.1909261253400.15067@tvgsbejvaqbjf.bet/
> [2] https://lore.kernel.org/ksummit-discuss/CAD=FV=UPjPpUyFTPjF-Ogzj_6LJLE4PTxMhCoCEDmH1LXSSmpQ@mail.gmail.com/
>
Theodore Ts'o Oct. 17, 2019, 1:11 p.m. UTC | #8
On Thu, Oct 17, 2019 at 02:23:58PM +0200, Vegard Nossum wrote:
> Of course, this relies strongly on actually having (correct) sha1
> references to previous versions inside the changelog. In my original
> idea, this reference would only appear inside the merge commit that
> binds the patchset together to minimise churn, although maybe it is
> feasible to also append it to each patch -- in that case, the "patchset"
> command from my first email is not sufficient to create a new version of
> a patchset.

This also relies on the base of the commit actually being a public
SHA1.  Sometimes developers will cherry-pick in a patch that they need
so that the kernel will actually *boot* (or otherwise fix problems
that have been fixed in other subsystems, but not yet landed in -rc2
or -rc3).

Of course, we could tell people that they should always create their
patches off of the last stable version (but then there may have been
changes pulled in via the last merge window that makes their patch not
apply), or they could be told to develop against -rc2 or -rc3, and
then cherry pick the required fix-up patches on top of -rc2 and -rc3,
but then they have to do a lot more rebuilding.

So there are no perfect solutions here, and while in the ideal world,
-rc2 and -rc3 should be perfectly stable enough for developers so that
they never need to manually patch in stablization patches, we need to
live in the real world.  I believe that Darrick told me that in the
previous development cycle, he had to wait until -rc4 before the tree
was stable enough for him to start building xfs patches on top
mainline.

(This is also true for this development cycle if you enable
CONFIG_KMEMLEAK, although fortunately, the workaround that worked for
me was to just CONFIG_KMEMLEAK --- although of course, if I do have to
run a KMEMLEAK test run, I'll need to cherry-pick the fix which landed
this week on top of the ext4 git tree.)

What this all might mean is that sometimes it will make sense to allow
the user to override the base commit so such stablization patches can
be elided.  Of course, we could force the user to create a separate
branch and rebase, but can be quite painful and slow --- and they
won't be able to test the rebased branch anyway, unless we then want
to tell them to cherry pick the stablization patches on top, and then
remove them before running "git send-email".

						- Ted
Vegard Nossum Oct. 17, 2019, 1:30 p.m. UTC | #9
On 10/17/19 5:17 AM, Junio C Hamano wrote:
> Vegard Nossum <vegard.nossum@oracle.com> writes:
> 
>> Step 1:
>>
>> * git send-email needs to include parent SHA1s and generally all the
>>    information needed to perfectly recreate the commit when applied so
>>    that all the SHA1s remain the same
>>
>> * git am (or an alternative command) needs to recreate the commit
>>    perfectly when applied, including applying it to the correct parent
> 
> You can record and convey the commit object name a series is meant
> to be applied on top already, and it in general is a good way to
> give a wider context in order to explain and justify the series.
> 
> On the other hand, "all the information needed to recreate..." is
> not very useful.  If you want the commit object to be exactly what
> you want to see at the tip of the end result, you are better off
> asking your upstream to pull.  Using e-mail for that makes you and
> project participants give up a lot of benefits the workflow based on
> e-mail gives you, the biggest of which is the ease of giving
> suggestions for improvements.  Once you insist "perfectly recreate
> the commit", you are not willing to take any input from the
> sidelines---worse yet, you are even dictating when the upstream
> runs "git am" to turn them into commits, and do so without reading
> the patches (there is no point reviewing as the person who runs "git
> am" is not even allowed to fix typo or make obvious fixes to the
> code, which will fail to perfectly recreate the commit).
> 
> In short, one should resist temptation to bring up "perfect
> reproduction" when one talks about e-mail workflow.

Please see what I wrote to Pratyush Yadav here:

https://public-inbox.org/git/a1c33600-14e6-be37-c026-8d8b8e4bad92@oracle.com/

TL;DR: the goal is not necessarily for maintainers to be able to merge
the patchset with the same SHA1 that the submitter had, but for the
patchset to have a definite SHA1 that lives in git, and which can be
used by all the participants -- submitter, reviewers, bots (including
potentially testing/CI infrastructure), and maintainers.

I am definitely not proposing to get rid of the email workflow -- on the
contrary, this it the workflow I want to preserve! :-) The "workflows"
mailing list was created for the purpose of discussing this topic (in
the context of Linux kernel development) and right now there are many
proposals that either completely cut out email or reduce it to something
like pull requests. My proposal keeps almost everything the same, except
for a few lines of extra metadata before the actual diff.

(I will answer the rest of your email separately.)


Vegard
Vegard Nossum Oct. 17, 2019, 2:01 p.m. UTC | #10
On 10/17/19 3:11 PM, Theodore Y. Ts'o wrote:
> On Thu, Oct 17, 2019 at 02:23:58PM +0200, Vegard Nossum wrote:
>> Of course, this relies strongly on actually having (correct) sha1
>> references to previous versions inside the changelog. In my original
>> idea, this reference would only appear inside the merge commit that
>> binds the patchset together to minimise churn, although maybe it is
>> feasible to also append it to each patch -- in that case, the "patchset"
>> command from my first email is not sufficient to create a new version of
>> a patchset.
> 
> This also relies on the base of the commit actually being a public
> SHA1.  Sometimes developers will cherry-pick in a patch that they need
> so that the kernel will actually *boot* (or otherwise fix problems
> that have been fixed in other subsystems, but not yet landed in -rc2
> or -rc3).
> 
> Of course, we could tell people that they should always create their
> patches off of the last stable version (but then there may have been
> changes pulled in via the last merge window that makes their patch not
> apply), or they could be told to develop against -rc2 or -rc3, and
> then cherry pick the required fix-up patches on top of -rc2 and -rc3,
> but then they have to do a lot more rebuilding.
> 
> So there are no perfect solutions here, and while in the ideal world,
> -rc2 and -rc3 should be perfectly stable enough for developers so that
> they never need to manually patch in stablization patches, we need to
> live in the real world.  I believe that Darrick told me that in the
> previous development cycle, he had to wait until -rc4 before the tree
> was stable enough for him to start building xfs patches on top
> mainline.
> 
> (This is also true for this development cycle if you enable
> CONFIG_KMEMLEAK, although fortunately, the workaround that worked for
> me was to just CONFIG_KMEMLEAK --- although of course, if I do have to
> run a KMEMLEAK test run, I'll need to cherry-pick the fix which landed
> this week on top of the ext4 git tree.)
> 
> What this all might mean is that sometimes it will make sense to allow
> the user to override the base commit so such stablization patches can
> be elided.  Of course, we could force the user to create a separate
> branch and rebase, but can be quite painful and slow --- and they
> won't be able to test the rebased branch anyway, unless we then want
> to tell them to cherry pick the stablization patches on top, and then
> remove them before running "git send-email".

Good points.

I suspect that you should almost always be able to find a good base
revision to build and test your changes on.

In your example, couldn't Darrick simply base his xfs work on the latest
xfs branch that was pulled by Linus? That should be up to date with all
things xfs without having any of the things that made Linus's tree not
work for him.

Otherwise, you could apply the stabilisation patches and then do your
final testing in a branch that merges that with your patchset, like so:

    rc1 o -----> fixup A ------> fixup B ---->o merge (tested)
(base)  \                                   /
          \                                 /
           ---> patch 001 --> patch 002 -->o patchset (submitted)

It does not seem too hard to me, and it should be pretty safe from a
test-what-you-ship point of view assuming the fixups and your patches
really are independent.

I think the more difficult problem to solve might be how to ensure that
the base commit is actually public/reachable when this is the intention.
A bot watching the mailing list could always respond with a "Hey, I
don't have that, could you rebase the series or push it somewhere?". But
it would be even better if git could tell you when you're about to
submit a patch. Maybe something like:

   git send-email --ensure-reachable-from [remote] rev^^..

In the worst case, I guess the base commit will just not be available --
the email will still have a sha1 on it, though, and which might still be
usable as an identifier for the patch/patchset. If not, it's still not
worse than the current workflow (which would still work).


Vegard
Theodore Ts'o Oct. 17, 2019, 2:47 p.m. UTC | #11
On Thu, Oct 17, 2019 at 04:01:33PM +0200, Vegard Nossum wrote:
> 
> In your example, couldn't Darrick simply base his xfs work on the latest
> xfs branch that was pulled by Linus? That should be up to date with all
> things xfs without having any of the things that made Linus's tree not
> work for him.

Sure, but sometimes there are changes in subsystems which the file
system depends upon that were also merged by Linus.  So for example,
the ext4 branch might be based on v5.3-rc4, and gets pulled after v5.3
is released, along with a huge number of other subsystem trees, so the
delta between v5.3 and v5.3-rc1 is ***huge***.

So while I could base my development on my previous ext4.git branch
(based off of v5.3-rc4), at *some* point I need to be able to sync up
with upstream.  And the usual way to do this is to start a new
development branch based on (for example) v5.4-rc2, or in some cases
v5.4-rc4.

We could keep the development branch on based off of v5.3-rc4, and
wait until things stablize, and *then* merge in v5.4-rcX, when
v5.4-rcX is finally stable, but that makes for a more complex merge,
and so it means that things like "git log origin..master" don't really
work any more.  So the preferred development practice is very much....

   rc2 o --> patch 1 --> patch 2 --> ... --> patch N 
(origin)                                     (master)

Where the "master" branch gets merged into the rewinding "dev" branch
(which works much like git's pu branch), and where the "master" branch
is what Linus will merge at the next merge window.

> Otherwise, you could apply the stabilisation patches and then do your
> final testing in a branch that merges that with your patchset, like so:
> 
>    rc1 o -----> fixup A ------> fixup B ---->o merge (tested)
> (base)  \                                   /
>          \                                 /
>           ---> patch 001 --> patch 002 -->o patchset (submitted)

I cloud do that, but remember that the checked out kernel tree is
about a gigabyte (this assumes using git clone --shared, so it doesn't
include the git pack files, and this is source only; the object files
are another 2.6 GB).  I could keep separate checked out trees, and
separate build trees, but that burns a lot of disk/SSD space.  Or I
could switch back and forth by using "git checkout" between the
development branch and the branch with the stablization patches, but
then I'm constantly having to rebuild the object files, and ccache
only helps so much.

So it's much simpler to put the fixup patches at the on top of the
origin, and then mail them out without having to play git branch
rebasing gymnastics.  When the patch series is finally ready to roll,
then the maintainer will apply the patch series on a clean branch,
since hopefully by then -rc3 or -rc4 is finally stable enough to use
as an origin point.

So the idea is that developer might be sending out revisions of their
patches on top of -rc1 plus fixup patches, but then the final version
of the patch series, after a few rounds of review, gets applied on top
of -rc3 or -rc4.

> I think the more difficult problem to solve might be how to ensure that
> the base commit is actually public/reachable when this is the intention.
> A bot watching the mailing list could always respond with a "Hey, I
> don't have that, could you rebase the series or push it somewhere?". But
> it would be even better if git could tell you when you're about to
> submit a patch. Maybe something like:
> 
>   git send-email --ensure-reachable-from [remote] rev^^..
> 
> In the worst case, I guess the base commit will just not be available --
> the email will still have a sha1 on it, though, and which might still be
> usable as an identifier for the patch/patchset. If not, it's still not
> worse than the current workflow (which would still work).

... or what we can do is allow the developer to specify the intended
base --- e.g., -rc1, even though his patchset was against "-rc1 plus fixups".

	     	       	     	  	     - Ted
Steven Rostedt Oct. 17, 2019, 3:11 p.m. UTC | #12
On Thu, 17 Oct 2019 16:01:33 +0200
Vegard Nossum <vegard.nossum@oracle.com> wrote:

> In your example, couldn't Darrick simply base his xfs work on the latest
> xfs branch that was pulled by Linus? That should be up to date with all
> things xfs without having any of the things that made Linus's tree not
> work for him.

Sure, but why?

I thought this whole exercise is to make the process easier. This seems
to be making it more complex. Now we are going to be demanding
submitters to be basing their work on a specific (older) commit.

I always tell people that submit to me, to base off of one of Linus's
latest tags. That's what I do.

-- Steve
Greg KH Oct. 17, 2019, 8:43 p.m. UTC | #13
On Wed, Oct 16, 2019 at 10:45:19AM -0400, Santiago Torres Arias wrote:
> Hi Willy, Vegard.
> 
> On Wed, Oct 16, 2019 at 01:10:09PM +0200, Willy Tarreau wrote:
> > Hi Vegard,
> > 
> > On Wed, Oct 16, 2019 at 12:22:54PM +0200, Vegard Nossum wrote:
> > > (cross-posted to git, LKML, and the kernel workflows mailing lists.)
> > > 
> > > Hi all,
> > > 
> > > I've been following Konstantin Ryabitsev's quest for better development
> > > and communication tools for the kernel [1][2][3], and I would like to
> > > propose a relatively straightforward idea which I think could bring a
> > > lot to the table.
> > > 
> > > Step 1:
> > > 
> > > * git send-email needs to include parent SHA1s and generally all the
> > >   information needed to perfectly recreate the commit when applied so
> > >   that all the SHA1s remain the same
> > > 
> > > * git am (or an alternative command) needs to recreate the commit
> > >   perfectly when applied, including applying it to the correct parent
> > > 
> > > Having these two will allow a perfect mapping between email and git;
> > > essentially email just becomes a transport for git. There are a lot of
> > > advantages to this, particularly that you have a stable way to refer to
> > > a patch or commit (despite it appearing on a mailing list), and there
> > > is no need for "changeset IDs" or whatever, since you can just use the
> > > git SHA1 which is unique, unambiguous, and stable.
> 
> I wonder if it'd be also possible to then embed gpg signatures over
> send-mail payloads so as they can be transparently transferred to the
> commit.

That's a crazy idea.  It would be nice if we could do that, I like it :)

greg k-h
Konstantin Ryabitsev Oct. 17, 2019, 8:45 p.m. UTC | #14
On Thu, Oct 17, 2019 at 01:43:43PM -0700, Greg KH wrote:
>> I wonder if it'd be also possible to then embed gpg signatures over
>> send-mail payloads so as they can be transparently transferred to the
>> commit.
>
>That's a crazy idea.  It would be nice if we could do that, I like it 
>:)

It could only possibly work if nobody ever adds their own 
"Signed-Off-By" or any other bylines. I expect this is a deal-breaker 
for most maintainers.

-K
Greg KH Oct. 18, 2019, 1:30 a.m. UTC | #15
On Thu, Oct 17, 2019 at 04:45:32PM -0400, Konstantin Ryabitsev wrote:
> On Thu, Oct 17, 2019 at 01:43:43PM -0700, Greg KH wrote:
> > > I wonder if it'd be also possible to then embed gpg signatures over
> > > send-mail payloads so as they can be transparently transferred to the
> > > commit.
> > 
> > That's a crazy idea.  It would be nice if we could do that, I like it :)
> 
> It could only possibly work if nobody ever adds their own "Signed-Off-By" or
> any other bylines. I expect this is a deal-breaker for most maintainers.

Yeah it is :(

But, if we could just have the signature on the code change, not the
changelog text, that would help with that issue.

thanks,

rgeg k-h
Konstantin Ryabitsev Oct. 18, 2019, 1:54 a.m. UTC | #16
On Thu, Oct 17, 2019 at 06:30:29PM -0700, Greg KH wrote:
>> It could only possibly work if nobody ever adds their own 
>> "Signed-Off-By" or
>> any other bylines. I expect this is a deal-breaker for most maintainers.
>
>Yeah it is :(
>
>But, if we could just have the signature on the code change, not the
>changelog text, that would help with that issue.

We totally should, and I even mused on how we would do that here:
https://public-inbox.org/git/20190910121324.GA6867@pure.paranoia.local/

However, since git's PGP signatures are made for the content in the 
actual commit record (tree hash, parent, author, commit message, etc), 
the only way we could preserve them between the email and the git tree 
is if we never modify any of that data. The SOB and other trailers would 
have to only be applied to the merge commit, or migrate into commit 
notes.

-K
Eric Wong Oct. 18, 2019, 2:22 a.m. UTC | #17
Vegard Nossum <vegard.nossum@oracle.com> wrote:
<snip>

> Disadvantages:
> 
> - requires patching git

The bigger disadvantage is this won't work with a historical
patch series (and some folks stay on ancient git).  But maybe
that window for that is only a few years...

The toughest part right now for public-inbox is trying to make
sense of --range-diff (supporting --interdiff would be easy, I
think...).  Also, we've only had --range-diff for a year or
so.

Your proposal would make things 100% easier for public-inbox
to deal with future --range-diff uses, however :)

> - requires a bot to continuously create branches for patchsets sent to
>   mailing lists

Not necessarily, being able to search on commit OIDs would
be pretty handy itself for dealing with --range-diff output
in public-inbox, so there's no real need to actually make
the branch in git.

I also have a parallel solution in the works to make
--range-diff output more amenable for search engines like
public-inbox by adding blob OIDs to its output:

  https://public-inbox.org/git/20191017121045.GA15364@dcvr/
  I shall call myself an "SEO expert" from now on :>

> Thoughts?

Pretty much the same concerns others brought up around exactness
and working on top of cherry-picks.

> PS: Eric Wong described something that comes quite close to this idea, but
> AFAICT without actually recreating commits exactly. I've included the link
> for completeness. [4]

> [4]: https://lore.kernel.org/workflows/20191008003931.y4rc2dp64gbhv5ju@dcvr/

My plan is to work on interdiff support in the next week or so
once bugs are fixed and public-inbox v1.2 is out the door.  Not
sure about range-diff and reverse-mapping blobs -> trees ->
commits, but searching on "git patch-id --stable" output is also
on the table.

PS: Attached patches: I have nothing against using MIME for those,
    (not speaking for anybody else).  public-inbox needs to handle
    those better w.r.t search indexing linkification.  And then
    I found some bugs for --reindex corner cases which I'm still
    working on :x
Willy Tarreau Oct. 18, 2019, 2:52 a.m. UTC | #18
On Thu, Oct 17, 2019 at 09:54:47PM -0400, Konstantin Ryabitsev wrote:
> On Thu, Oct 17, 2019 at 06:30:29PM -0700, Greg KH wrote:
> > > It could only possibly work if nobody ever adds their own
> > > "Signed-Off-By" or
> > > any other bylines. I expect this is a deal-breaker for most maintainers.
> > 
> > Yeah it is :(
> > 
> > But, if we could just have the signature on the code change, not the
> > changelog text, that would help with that issue.
> 
> We totally should, and I even mused on how we would do that here:
> https://public-inbox.org/git/20190910121324.GA6867@pure.paranoia.local/
> 
> However, since git's PGP signatures are made for the content in the actual
> commit record (tree hash, parent, author, commit message, etc), the only way
> we could preserve them between the email and the git tree is if we never
> modify any of that data. The SOB and other trailers would have to only be
> applied to the merge commit, or migrate into commit notes.

There's also the possibility to handle this a bit like we do when adding
comments before the SOB: a PGP signature would apply to the text *before*
it only. We could then have long chains of SOB, PGP, SOB, PGP etc.

Willy
Nicolas Belouin Oct. 18, 2019, 6:34 a.m. UTC | #19
On 10/18/19 4:52 AM, Willy Tarreau wrote:
> On Thu, Oct 17, 2019 at 09:54:47PM -0400, Konstantin Ryabitsev wrote:
>> On Thu, Oct 17, 2019 at 06:30:29PM -0700, Greg KH wrote:
>>>> It could only possibly work if nobody ever adds their own
>>>> "Signed-Off-By" or
>>>> any other bylines. I expect this is a deal-breaker for most maintainers.
>>> Yeah it is :(
>>>
>>> But, if we could just have the signature on the code change, not the
>>> changelog text, that would help with that issue.
>> We totally should, and I even mused on how we would do that here:
>> https://public-inbox.org/git/20190910121324.GA6867@pure.paranoia.local/
>>
>> However, since git's PGP signatures are made for the content in the actual
>> commit record (tree hash, parent, author, commit message, etc), the only way
>> we could preserve them between the email and the git tree is if we never
>> modify any of that data. The SOB and other trailers would have to only be
>> applied to the merge commit, or migrate into commit notes.
> There's also the possibility to handle this a bit like we do when adding
> comments before the SOB: a PGP signature would apply to the text *before*
> it only. We could then have long chains of SOB, PGP, SOB, PGP etc.
>
> Willy

I don't think it can work that easily as the signed content is not just
the message.
It would need git to support nesting signatures and to allow amending a
commit without
touching the signature and to allow adding one to cover the new content
and to have a
way to verify every step.
Moreover you won't be able to reparent the commit as a maintainer (wich
I think is
also a deal-breaker)

Nicolas
Vegard Nossum Oct. 18, 2019, 2:27 p.m. UTC | #20
On 10/16/19 4:45 PM, Santiago Torres Arias wrote:
> Hi Willy, Vegard.
> 
> On Wed, Oct 16, 2019 at 01:10:09PM +0200, Willy Tarreau wrote:
>> Hi Vegard,
>>
>> On Wed, Oct 16, 2019 at 12:22:54PM +0200, Vegard Nossum wrote:
>>> (cross-posted to git, LKML, and the kernel workflows mailing lists.)
>>>
>>> Hi all,
>>>
>>> I've been following Konstantin Ryabitsev's quest for better development
>>> and communication tools for the kernel [1][2][3], and I would like to
>>> propose a relatively straightforward idea which I think could bring a
>>> lot to the table.
>>>
>>> Step 1:
>>>
>>> * git send-email needs to include parent SHA1s and generally all the
>>>    information needed to perfectly recreate the commit when applied so
>>>    that all the SHA1s remain the same
>>>
>>> * git am (or an alternative command) needs to recreate the commit
>>>    perfectly when applied, including applying it to the correct parent
>>>
>>> Having these two will allow a perfect mapping between email and git;
>>> essentially email just becomes a transport for git. There are a lot of
>>> advantages to this, particularly that you have a stable way to refer to
>>> a patch or commit (despite it appearing on a mailing list), and there
>>> is no need for "changeset IDs" or whatever, since you can just use the
>>> git SHA1 which is unique, unambiguous, and stable.
> 
> I wonder if it'd be also possible to then embed gpg signatures over
> send-mail payloads so as they can be transparently transferred to the
> commit.
> 
> -Santiago
> 

I just played a bit with this and with my proposed patch for
git-format-patch the signature is already part of the output:

$ ./git-format-patch --complete HEAD^-
0001-format-patch-add-complete.patch

$ cat 0001-format-patch-add-complete.patch
 From ac30b08065cd55362a7244a3bbc8df3563cefaaa Mon Sep 17 00:00:00 2001
From: Vegard Nossum <vegard.nossum@oracle.com>
Date: Sat, 5 Oct 2019 16:15:59 +0200
Subject: [PATCH] format-patch: add --complete

Include the raw commit data between the changelog and the diffstat.
This will allow 'git am' to reconstruct the commit exactly to the point
where the sha1 will be the same.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
---
commit ac30b08065cd55362a7244a3bbc8df3563cefaaa
tree 8f09d9d6ed78f8617b2fe54fe9712990ba808546
parent 108b97dc372828f0e72e56bbb40cae8e1e83ece6
author Vegard Nossum <vegard.nossum@oracle.com> 1570284959 +0200
committer Vegard Nossum <vegard.nossum@oracle.com> 1571408340 +0200
gpgsig -----BEGIN PGP SIGNATURE-----
  Version: GnuPG v1

  iQIcBAABAgAGBQJdqcnVAAoJEAvO9Nj+mLpYPEMP/0qyUF6U9y6FMM3BQrjteGGY
  IEEwmkvfW8vMBdjXRjmSI1jxRUuW+xJs3kxezNuW79Gzkl63PlS3CdW50yLlWau6
  2gU4R8oSNr7vxpgfAscELxaAuvUSp7Vb1FEPc5kPW06Sprg4PkLkUMD71ALRnGMV
  TxTVMDbMYg2xHpwBFs1ZyF2l0ElqOvRqoQqYfvRql1rgbs5LhF0RevkIN5xswj93
  3Gz1CuB8MURX2lfglfYSTy/05Rx3w/QHxwNlbbPDtXrexySf+a70j/Z6i2/BIzR/
  kxlZJ/k4ZPN931mxFcLPBsV/K51uP378oEH1QdaZyO2jz1rj+AZxXlgXe8J3ZAmt
  XYT/FMze5lukd7EQDO5vPZazp1dnJ6wnmrAd8shCSWe23vybDMCYnjXTuwAXwbA5
  R7ffKxm3MwRn9LKsbHFiV0J8tS1/fHbOIEXQDJ+kFhKqys0RSXipDZU61LnogXaw
  827TcsUYLvkYlQ+LdmSjZ537E+bUTo3Udb/UkGbgwSSm9LTjHnAI34S6dxSZ+1cl
  jBD54v8u9I1hEImWxGbXns7ET1fh17Z4PoTPpA4COt3puAQY7vB7inGY3/kWz+7z
  iRieHyD/W6lba4rqNYHBxacD4JTXN9S9Z7o6F4ijeGQThbA77RWD64SGjuJM0mC7
  mGUqvHz0pn7zOl1ZOS26
  =gCT0
  -----END PGP SIGNATURE-----

---
  builtin/log.c | 12 ++++++++++++
  log-tree.c    | 17 +++++++++++++++++
  revision.h    |  3 ++-
  3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/builtin/log.c b/builtin/log.c
index c4b35fdaf9..81c1164ae5 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1545,6 +1545,7 @@ int cmd_format_patch(int argc, const char **argv, 
const char *prefix)
[...]

There is no corresponding support in 'git am' yet, however.

Seeing how large this signature is, I have to admit that I am partial to
Konstantin's suggestion of using minisign. This seems like something
that could be added to git as an alternative to gpg without too much
trouble, I think.


Vegard
Santiago Torres Arias Oct. 18, 2019, 3:50 p.m. UTC | #21
On Fri, Oct 18, 2019 at 08:34:17AM +0200, Nicolas Belouin wrote:
> On 10/18/19 4:52 AM, Willy Tarreau wrote:
> > On Thu, Oct 17, 2019 at 09:54:47PM -0400, Konstantin Ryabitsev wrote:
> >> On Thu, Oct 17, 2019 at 06:30:29PM -0700, Greg KH wrote:
> >>>> It could only possibly work if nobody ever adds their own
> >>>> "Signed-Off-By" or
> >>>> any other bylines. I expect this is a deal-breaker for most maintainers.
> >>> Yeah it is :(
> >>>
> >>> But, if we could just have the signature on the code change, not the
> >>> changelog text, that would help with that issue.
> >> We totally should, and I even mused on how we would do that here:
> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__public-2Dinbox.org_git_20190910121324.GA6867-40pure.paranoia.local_&d=DwICaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=yZMPY-APGKyVIX7HgQFZJA&m=-7NJMybpa_bV7Y1FxWmqo1cUHOsDXAsRR1vvpQmYhyI&s=iFHNwBfYAPr---qMdv0mvKQAxqjXxvf1mAiAYZG6DIE&e= 
> >>
> >> However, since git's PGP signatures are made for the content in the actual
> >> commit record (tree hash, parent, author, commit message, etc), the only way
> >> we could preserve them between the email and the git tree is if we never
> >> modify any of that data. The SOB and other trailers would have to only be
> >> applied to the merge commit, or migrate into commit notes.
> > There's also the possibility to handle this a bit like we do when adding
> > comments before the SOB: a PGP signature would apply to the text *before*
> > it only. We could then have long chains of SOB, PGP, SOB, PGP etc.
> >
> > Willy
> 
> I don't think it can work that easily as the signed content is not just
> the message.
> It would need git to support nesting signatures and to allow amending a
> commit without
> touching the signature and to allow adding one to cover the new content
> and to have a
> way to verify every step.
> Moreover you won't be able to reparent the commit as a maintainer (wich
> I think is
> also a deal-breaker)

For reference, we did something similar here[1]. I'll acknowledge it's
somewhat of a niche use, and there's a danger with multiple signature
types that could mean many different things...

I do wonder if an over-lying tool could probably provide with more
granular verification over mutiple gpg payloads inside of a commit...

Cheers!
-Santiago.

[1] https://dl.acm.org/citation.cfm?id=3196523
Santiago Torres Arias Oct. 18, 2019, 3:54 p.m. UTC | #22
> Seeing how large this signature is, I have to admit that I am partial to
> Konstantin's suggestion of using minisign. This seems like something
> that could be added to git as an alternative to gpg without too much
> trouble, I think.
> 
> 

I wonder how big the pgp payload would be with ed25519 as the underlying
algorithm. AFAICT, the payload of a minisign signature vs a signature
packet have almost the same fields...
Konstantin Ryabitsev Oct. 18, 2019, 4:03 p.m. UTC | #23
On Fri, Oct 18, 2019 at 11:54:09AM -0400, Santiago Torres Arias wrote:
>> Seeing how large this signature is, I have to admit that I am partial to
>> Konstantin's suggestion of using minisign. This seems like something
>> that could be added to git as an alternative to gpg without too much
>> trouble, I think.
>
>I wonder how big the pgp payload would be with ed25519 as the underlying
>algorithm. AFAICT, the payload of a minisign signature vs a signature
>packet have almost the same fields...

It's smaller, but it's not a one-liner. Here's a comparison using 
ED25519 keys of the same length:

minisign:

RWQ4kF9UdFgeSt3LqnS3WnrLlx2EnuIFW7euw5JnLUHY/79ipftmj7A2ug7FiR2WmnFNoSacWr7llBuyInVmRL/VRovj1LFtvA0=

pgp:

-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQR2vl2yUnHhSB5njDW2xBzjVmSZbAUCXaniFAAKCRC2xBzjVmSZ
bHA5AP46sSPFJfL2tbXwswvj0v2DjLAQ9doxl9bfj9iPZu+3qwEAw5qAMbjw9teL
L7+NbJ0WVniDWTgt+5ruQ2V9vyfYxAc=
=B/St
-----END PGP SIGNATURE-----

-K
Santiago Torres Arias Oct. 18, 2019, 4:11 p.m. UTC | #24
On Fri, Oct 18, 2019 at 12:03:43PM -0400, Konstantin Ryabitsev wrote:
> On Fri, Oct 18, 2019 at 11:54:09AM -0400, Santiago Torres Arias wrote:
> > > Seeing how large this signature is, I have to admit that I am partial to
> > > Konstantin's suggestion of using minisign. This seems like something
> > > that could be added to git as an alternative to gpg without too much
> > > trouble, I think.
> > 
> > I wonder how big the pgp payload would be with ed25519 as the underlying
> > algorithm. AFAICT, the payload of a minisign signature vs a signature
> > packet have almost the same fields...
> 
> It's smaller, but it's not a one-liner. Here's a comparison using ED25519
> keys of the same length:
> 
> minisign:
> 
> RWQ4kF9UdFgeSt3LqnS3WnrLlx2EnuIFW7euw5JnLUHY/79ipftmj7A2ug7FiR2WmnFNoSacWr7llBuyInVmRL/VRovj1LFtvA0=
> 
> pgp:
> 
> -----BEGIN PGP SIGNATURE-----
> 
> iHUEARYIAB0WIQR2vl2yUnHhSB5njDW2xBzjVmSZbAUCXaniFAAKCRC2xBzjVmSZ
> bHA5AP46sSPFJfL2tbXwswvj0v2DjLAQ9doxl9bfj9iPZu+3qwEAw5qAMbjw9teL
> L7+NbJ0WVniDWTgt+5ruQ2V9vyfYxAc=
> =B/St

Yeah, the discrepancy mostly comes from pgp embedding a timestamp and a
longer keyid (+a full keyid fingerprint in pgp 2.1+). Minisign keyids
are 8 random bytes, apparently.

It doesn't seem like an amazing win in terms of succintness, imvho...

Cheers!
-Santiago.
Theodore Ts'o Oct. 18, 2019, 4:15 p.m. UTC | #25
On Fri, Oct 18, 2019 at 04:27:48PM +0200, Vegard Nossum wrote:
> commit ac30b08065cd55362a7244a3bbc8df3563cefaaa
> tree 8f09d9d6ed78f8617b2fe54fe9712990ba808546
> parent 108b97dc372828f0e72e56bbb40cae8e1e83ece6
> author Vegard Nossum <vegard.nossum@oracle.com> 1570284959 +0200
> committer Vegard Nossum <vegard.nossum@oracle.com> 1571408340 +0200
> gpgsig -----BEGIN PGP SIGNATURE-----
	...

Would it perhaps be possible to put some or all of these headers after
the patch, as a set of "trailers"?  That would make it easier for
human readers of the e-mail to get the bits that they most care
about.... namely, the patch itself.  :-)

If we move the PGP signature to the end, then the fact that it is so
big and bulky becomes much less of an issue.  A mini-sig might still
be a cool thing, from a space savings perspective both in the mail
archives, and in the git repo itself, if we start signing all commits.
But that seems like a separable issue.

Thanks,

					- Ted
Vegard Nossum Oct. 18, 2019, 4:50 p.m. UTC | #26
On 10/18/19 6:15 PM, Theodore Y. Ts'o wrote:
> On Fri, Oct 18, 2019 at 04:27:48PM +0200, Vegard Nossum wrote:
>> commit ac30b08065cd55362a7244a3bbc8df3563cefaaa
>> tree 8f09d9d6ed78f8617b2fe54fe9712990ba808546
>> parent 108b97dc372828f0e72e56bbb40cae8e1e83ece6
>> author Vegard Nossum <vegard.nossum@oracle.com> 1570284959 +0200
>> committer Vegard Nossum <vegard.nossum@oracle.com> 1571408340 +0200
>> gpgsig -----BEGIN PGP SIGNATURE-----
> 	...
> 
> Would it perhaps be possible to put some or all of these headers after
> the patch, as a set of "trailers"?  That would make it easier for
> human readers of the e-mail to get the bits that they most care
> about.... namely, the patch itself.  :-)
> 

Yes, agreed.

I started out using this approach, but I changed it because the
implementation was a bit annoying: 'git am' runs 'git mailsplit',
which just splits the email into two parts:

1) headers, changelog, and diffstat;
2) diff and signature.

One of my PoC patches changes mailsplit to split the extra metadata into
a third file.

The problem I ran into with putting the metadata at the end was
detecting where the diff ends. A comment in 'git apply' suggested that
detecting the difference between "--" as a diff/signature separator and
as part of the diff is nontrivial in the sense that you need to actually
do some parsing and keep track of hunk sizes.

I can try to put it at the end, but maybe the git people have some hints
that would make the implementation easier? Is it okay to reimplement a
simple diff parser in mailsplit?

Thanks,


Vegard
Konstantin Ryabitsev Oct. 18, 2019, 6 p.m. UTC | #27
On Fri, Oct 18, 2019 at 12:11:22PM -0400, Santiago Torres Arias wrote:
>> It's smaller, but it's not a one-liner. Here's a comparison using 
>> ED25519
>> keys of the same length:
>>
>> minisign:
>>
>> RWQ4kF9UdFgeSt3LqnS3WnrLlx2EnuIFW7euw5JnLUHY/79ipftmj7A2ug7FiR2WmnFNoSacWr7llBuyInVmRL/VRovj1LFtvA0=
>>
>> pgp:
>>
>> -----BEGIN PGP SIGNATURE-----
>>
>> iHUEARYIAB0WIQR2vl2yUnHhSB5njDW2xBzjVmSZbAUCXaniFAAKCRC2xBzjVmSZ
>> bHA5AP46sSPFJfL2tbXwswvj0v2DjLAQ9doxl9bfj9iPZu+3qwEAw5qAMbjw9teL
>> L7+NbJ0WVniDWTgt+5ruQ2V9vyfYxAc=
>> =B/St
>
>Yeah, the discrepancy mostly comes from pgp embedding a timestamp and a
>longer keyid (+a full keyid fingerprint in pgp 2.1+). Minisign keyids
>are 8 random bytes, apparently.
>
>It doesn't seem like an amazing win in terms of succintness, imvho...

There isn't, but ED25519 subkeys are still very rare among developers.  
Many have 4096-bit RSA subkeys, and you can imagine how large the sigs 
from those are.

I want to underline that my use of minisign was specifically for patches 
sent via email, without the intent of preserving them in git history 
(which is why in my proposal they are put under the `---` cutoff). Git 
itself would continue to use PGP signing.

(This also means that we don't necessarily need to make this a native 
part of git -- it can be accomplished by a combination of wrappers, 
git-format-patch parameters, and a pre-applypatch hook. However, the 
likelihood of adoption in this case would be very low.)

-K
Theodore Ts'o Oct. 18, 2019, 7:14 p.m. UTC | #28
On Fri, Oct 18, 2019 at 06:50:51PM +0200, Vegard Nossum wrote:
> I started out using this approach, but I changed it because the
> implementation was a bit annoying: 'git am' runs 'git mailsplit',
> which just splits the email into two parts:
> 
> 1) headers, changelog, and diffstat;
> 2) diff and signature.
> 
> One of my PoC patches changes mailsplit to split the extra metadata into
> a third file.
> 
> The problem I ran into with putting the metadata at the end was
> detecting where the diff ends. A comment in 'git apply' suggested that
> detecting the difference between "--" as a diff/signature separator and
> as part of the diff is nontrivial in the sense that you need to actually
> do some parsing and keep track of hunk sizes.

Could we cheat by having "git format-patch" add a "Diff-size" in the
header which gives the number of lines in the diff so git am can just
count lines to find the Trailer section?

Thanks,

					- Ted
Willy Tarreau Oct. 20, 2019, 3:17 a.m. UTC | #29
On Fri, Oct 18, 2019 at 03:14:56PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Oct 18, 2019 at 06:50:51PM +0200, Vegard Nossum wrote:
> > The problem I ran into with putting the metadata at the end was
> > detecting where the diff ends. A comment in 'git apply' suggested that
> > detecting the difference between "--" as a diff/signature separator and
> > as part of the diff is nontrivial in the sense that you need to actually
> > do some parsing and keep track of hunk sizes.
> 
> Could we cheat by having "git format-patch" add a "Diff-size" in the
> header which gives the number of lines in the diff so git am can just
> count lines to find the Trailer section?

Be careful with this, it starts like this and ends up with non-editable
patches. I'd rather have git-am use best-effort detection of the end.
Also when dealing with stable backports, I've done a lot of
"cat foo.diff >> bar.patch" to fixup some patches in which I just had
to move some parts around. Having to count lines and edit a counter
somewhere is going to become really painful.

Just my two cents,
Willy
Laurent Pinchart Oct. 20, 2019, 5:50 a.m. UTC | #30
On Thu, Oct 17, 2019 at 06:30:29PM -0700, Greg KH wrote:
> On Thu, Oct 17, 2019 at 04:45:32PM -0400, Konstantin Ryabitsev wrote:
> > On Thu, Oct 17, 2019 at 01:43:43PM -0700, Greg KH wrote:
> >>> I wonder if it'd be also possible to then embed gpg signatures over
> >>> send-mail payloads so as they can be transparently transferred to the
> >>> commit.
> >> 
> >> That's a crazy idea.  It would be nice if we could do that, I like it :)
> > 
> > It could only possibly work if nobody ever adds their own "Signed-Off-By" or
> > any other bylines. I expect this is a deal-breaker for most maintainers.
> 
> Yeah it is :(
> 
> But, if we could just have the signature on the code change, not the
> changelog text, that would help with that issue.

I ran into a related issue recently when thinking about how to implement
server-side workflows (for a non-kernel project). My goal is to ensure a
patch can only be pushed to the master branch if it has received review.
The easy way to do so it to check the Reviewed-by tags, but those can
easily be forged. I was thus wondering if we should have a way to sign
tags (as in commit message tags, not git tags).
Vegard Nossum Oct. 20, 2019, 6:28 a.m. UTC | #31
On 10/20/19 5:17 AM, Willy Tarreau wrote:
> On Fri, Oct 18, 2019 at 03:14:56PM -0400, Theodore Y. Ts'o wrote:
>> On Fri, Oct 18, 2019 at 06:50:51PM +0200, Vegard Nossum wrote:
>>> The problem I ran into with putting the metadata at the end was
>>> detecting where the diff ends. A comment in 'git apply' suggested that
>>> detecting the difference between "--" as a diff/signature separator and
>>> as part of the diff is nontrivial in the sense that you need to actually
>>> do some parsing and keep track of hunk sizes.
>>
>> Could we cheat by having "git format-patch" add a "Diff-size" in the
>> header which gives the number of lines in the diff so git am can just
>> count lines to find the Trailer section?
> 
> Be careful with this, it starts like this and ends up with non-editable
> patches. I'd rather have git-am use best-effort detection of the end.

Expect filesystem developers to come up with a format that uses extents ;-)

> Also when dealing with stable backports, I've done a lot of
> "cat foo.diff >> bar.patch" to fixup some patches in which I just had
> to move some parts around. Having to count lines and edit a counter
> somewhere is going to become really painful.

I almost have some new patches ready for putting the metadata after the
patch using a very bare-bones diff parser (it's actually not that bad),
I just need to fix a few corner cases that are causing breakage in the
git test suite.


Vegard
Vegard Nossum Oct. 22, 2019, 12:11 p.m. UTC | #32
On 10/20/19 8:28 AM, Vegard Nossum wrote:
> 
> On 10/20/19 5:17 AM, Willy Tarreau wrote:
>> On Fri, Oct 18, 2019 at 03:14:56PM -0400, Theodore Y. Ts'o wrote:
>>> On Fri, Oct 18, 2019 at 06:50:51PM +0200, Vegard Nossum wrote:
>>>> The problem I ran into with putting the metadata at the end was
>>>> detecting where the diff ends. A comment in 'git apply' suggested that
>>>> detecting the difference between "--" as a diff/signature separator and
>>>> as part of the diff is nontrivial in the sense that you need to 
>>>> actually
>>>> do some parsing and keep track of hunk sizes.
>>>
>>> Could we cheat by having "git format-patch" add a "Diff-size" in the
>>> header which gives the number of lines in the diff so git am can just
>>> count lines to find the Trailer section?
>>
>> Be careful with this, it starts like this and ends up with non-editable
>> patches. I'd rather have git-am use best-effort detection of the end.
> 
> Expect filesystem developers to come up with a format that uses extents ;-)
> 
>> Also when dealing with stable backports, I've done a lot of
>> "cat foo.diff >> bar.patch" to fixup some patches in which I just had
>> to move some parts around. Having to count lines and edit a counter
>> somewhere is going to become really painful.
> 
> I almost have some new patches ready for putting the metadata after the
> patch using a very bare-bones diff parser (it's actually not that bad),
> I just need to fix a few corner cases that are causing breakage in the
> git test suite.

I sent v2 of the patches (with metadata _after_ the diff) to the git
list here:

https://public-inbox.org/git/20191022114518.32055-1-vegard.nossum@oracle.com/T/#u

As I wrote in there, we could already today start using

   git am --message-id

when applying patches and this would provide something that a bot could
annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
would already be a pretty nice improvement over today's situation.

Sadly, since the beginning of 2018, this was only used for a measly
~0.14% of all non-merge commits in the kernel:

$ git rev-list --count --no-merges --since='2018-01-01' --grep 
'Message-Id: ' linus/master
178

$ git rev-list --count --no-merges --since='2018-01-01' linus/master
130777

So how can we spread the word about --message-id and get maintainers to
actually use it? I don't suppose it's reasonable to change the 'git am'
default setting?


Vegard
Theodore Ts'o Oct. 22, 2019, 1:53 p.m. UTC | #33
On Tue, Oct 22, 2019 at 02:11:22PM +0200, Vegard Nossum wrote:
> 
> As I wrote in there, we could already today start using
> 
>   git am --message-id
> 
> when applying patches and this would provide something that a bot could
> annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
> would already be a pretty nice improvement over today's situation.
> 
> Sadly, since the beginning of 2018, this was only used for a measly
> ~0.14% of all non-merge commits in the kernel:
> 
> $ git rev-list --count --no-merges --since='2018-01-01' --grep 'Message-Id:
> ' linus/master
> 178

You might also want to count commits which have a link tag with a
Message-Id:

Link: https://lore.kernel.org/r/c3438dad66a34a7d4e7509a5dd64c2326340a52a.1571647180.git.mbobrowski@mbobrowski.org

That's because some kernel developers have been using a hook script like this:

#!/bin/sh
# For .git/hooks/applypatch-msg
#
# You must have the following in .git/config:
# [am]
#	messageid = true
	
. git-sh-setup
perl -pi -e 's|^Message-Id:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
test -x "$GIT_DIR/hooks/commit-msg" &&
	exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
:

.... as we had reached rough consensus that this was the best way to
incorprate the message id (since it could made to be a clickable link
in tools like gitk, for example).  This rough consensus has only been
in place since around the time of the Maintainer's Summit in Lisbon,
so uptake is still probably a bit slow.  I'd expect to see a lot more
of this in the next merge window, though.

						- Ted
Vegard Nossum Oct. 22, 2019, 4:29 p.m. UTC | #34
On 10/22/19 3:53 PM, Theodore Y. Ts'o wrote:
> On Tue, Oct 22, 2019 at 02:11:22PM +0200, Vegard Nossum wrote:
>>
>> As I wrote in there, we could already today start using
>>
>>    git am --message-id
>>
>> when applying patches and this would provide something that a bot could
>> annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
>> would already be a pretty nice improvement over today's situation.
>>
>> Sadly, since the beginning of 2018, this was only used for a measly
>> ~0.14% of all non-merge commits in the kernel:
>>
>> $ git rev-list --count --no-merges --since='2018-01-01' --grep 'Message-Id:
>> ' linus/master
>> 178
> 
> You might also want to count commits which have a link tag with a
> Message-Id:
> 
> Link: https://lore.kernel.org/r/c3438dad66a34a7d4e7509a5dd64c2326340a52a.1571647180.git.mbobrowski@mbobrowski.org
> 
> That's because some kernel developers have been using a hook script like this:
> 
> #!/bin/sh
> # For .git/hooks/applypatch-msg
> #
> # You must have the following in .git/config:
> # [am]
> #	messageid = true
> 	
> . git-sh-setup
> perl -pi -e 's|^Message-Id:\s*<?([^>]+)>?$|Link: https://lore.kernel.org/r/$1|g;' "$1"
> test -x "$GIT_DIR/hooks/commit-msg" &&
> 	exec "$GIT_DIR/hooks/commit-msg" ${1+"$@"}
> :
> 
> .... as we had reached rough consensus that this was the best way to
> incorprate the message id (since it could made to be a clickable link
> in tools like gitk, for example).  This rough consensus has only been
> in place since around the time of the Maintainer's Summit in Lisbon,
> so uptake is still probably a bit slow.  I'd expect to see a lot more
> of this in the next merge window, though.

Thanks, I was not aware of this!

Seems like something that should go in Documentation/maintainer/,
right?

The figure is much better, 16.7% on all non-merges since 2018-01-01.
This should help and we can maybe already do some interesting things
with git notes and lore/public-inbox.


Vegard
Eric Wong Oct. 22, 2019, 7:01 p.m. UTC | #35
Vegard Nossum <vegard.nossum@oracle.com> wrote:
> I sent v2 of the patches (with metadata _after_ the diff) to the git
> list here:
> 
> https://public-inbox.org/git/20191022114518.32055-1-vegard.nossum@oracle.com/T/#u
> 
> As I wrote in there, we could already today start using
> 
>    git am --message-id
> 
> when applying patches and this would provide something that a bot could
> annotate with git notes pointing to lore/LKML/LWN/whatever. I think that
> would already be a pretty nice improvement over today's situation.
> 
> Sadly, since the beginning of 2018, this was only used for a measly
> ~0.14% of all non-merge commits in the kernel:

--message-id helps provide a concrete reference, yes.  However,
being able to search for commit subjects in the mail archives is
already implemented via cgit filter.  An example is here:

https://80x24.org/mirrors/git.git/commit/?id=8da56a484800023a545d7a7c022473f5aa9e720f

The link at "userdiff: fix some corner cases in dts regex" makes a link to:

https://public-inbox.org/git/?x=t&q=%22userdiff:+fix+some+corner+cases+in+dts+regex%22
(side note: not sure if that "x=t" to expand the whole message is good...)

That link is generated by examples/cgit-commit-filter.lua in the
 public-inbox source:

https://public-inbox.org/meta/1677253/s/?b=examples/cgit-commit-filter.lua

My longer term plan is to be able to use the post-image blob OIDs
from cgit to generate a search query for public-inbox such as:

https://public-inbox.org/git/?q=dfpost:afc6b5b404+dfpost:072d58b69d+dfpost:4353b8220c+dfpost:333a625c70+dfpost:e187d356f6

Which finds all versions of the userdiff patch posted.  But AFAIK
there's no easy way to get at blob OIDs from cgit to a Lua filter...
diff mbox series

Patch

From 3120370db888889f32e07a082edb4722db8feef1 Mon Sep 17 00:00:00 2001
From: Vegard Nossum <vegard.nossum@oracle.com>
Date: Wed, 16 Oct 2019 02:36:18 +0200
Subject: [PATCH 3/3] am: add --exact

This uses exact metadata when creating the commit object, hopefully
reconstructing the commit with the exact same SHA1.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
---
commit 3120370db888889f32e07a082edb4722db8feef1
tree 61b7556f06fd6fcb0f4a43940ec0cbc29ccf1bcc
parent 51bb531eb57320caf3761680ebf77c25b89b3719
author Vegard Nossum <vegard.nossum@oracle.com> 1571186178 +0200
committer Vegard Nossum <vegard.nossum@oracle.com> 1571219301 +0200

---
 builtin/am.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 94 insertions(+), 9 deletions(-)

diff --git a/builtin/am.c b/builtin/am.c
index 4190383bba..069a625895 100644
--- a/builtin/am.c
+++ b/builtin/am.c
@@ -118,6 +118,7 @@  struct am_state {
 	int allow_rerere_autoupdate;
 	const char *sign_commit;
 	int rebasing;
+	int exact;
 };
 
 /**
@@ -399,6 +400,8 @@  static void am_load(struct am_state *state)
 
 	state->rebasing = !!file_exists(am_path(state, "rebasing"));
 
+	state->exact = read_state_file(&sb, state, "exact", 1);
+
 	strbuf_release(&sb);
 }
 
@@ -1005,6 +1008,8 @@  static void am_setup(struct am_state *state, enum patch_format patch_format,
 	else
 		write_state_text(state, "applying", "");
 
+	write_state_bool(state, "exact", state->exact);
+
 	if (!get_oid("HEAD", &curr_head)) {
 		write_state_text(state, "abort-safety", oid_to_hex(&curr_head));
 		if (!state->rebasing)
@@ -1548,19 +1553,88 @@  static int fall_back_threeway(const struct am_state *state, const char *index_pa
  */
 static void do_commit(const struct am_state *state)
 {
+	struct object_id meta_commit = {};
+	struct object_id meta_tree = {};
+
 	struct object_id tree, parent, commit;
 	const struct object_id *old_oid;
 	struct commit_list *parents = NULL;
 	const char *reflog_msg, *author;
 	struct strbuf sb = STRBUF_INIT;
 
+	if (state->exact) {
+		/*
+		 * Scan meta file for parents + other data
+		 */
+
+		struct strbuf line = STRBUF_INIT;
+		FILE *fp = xfopen(am_path(state, "meta"), "r");
+
+		while (!strbuf_getline_lf(&line, fp)) {
+			const char *rest;
+
+			if (skip_prefix(line.buf, "commit ", &rest)) {
+				if (get_oid_hex(rest, &meta_commit))
+					die("invalid exact metadata (commit)");
+			} else if (skip_prefix(line.buf, "tree ", &rest)) {
+				if (get_oid_hex(rest, &meta_tree))
+					die("invalid exact metadata (tree)");
+			} else if (skip_prefix(line.buf, "parent ", &rest)) {
+				if (get_oid_hex(rest, &parent))
+					die("invalid exact metadata (parent)");
+
+				commit_list_insert(lookup_commit(the_repository, &parent), &parents);
+			} else if (skip_prefix(line.buf, "author ", &rest)) {
+				author = strdup(rest);
+			} else if (skip_prefix(line.buf, "committer ", &rest)) {
+				char *name_copy;
+				char *email;
+				char *email_copy;
+				char *date;
+
+				email = strstr(rest, " <");
+				if (!email)
+					die("invalid exact metadata (committer name)");
+
+				name_copy = xstrndup(rest, email - rest);
+				email += 2;
+				setenv("GIT_COMMITTER_NAME", name_copy, 1);
+				free(name_copy);
+
+				date = strstr(email, "> ");
+				if (!date)
+					die("invalid exact metadata (committer email)");
+
+				email_copy = xstrndup(email, date - email);
+				date += 2;
+				setenv("GIT_COMMITTER_EMAIL", email_copy, 1);
+				free(email_copy);
+
+				setenv("GIT_COMMITTER_DATE", date, 1);
+			} else if (line.len == 0) {
+				break;
+			} else {
+				die("unknown exact metadata: %.*s", line.len, line.buf);
+			}
+		}
+
+		fclose(fp);
+	}
+
 	if (run_hook_le(NULL, "pre-applypatch", NULL))
 		exit(1);
 
 	if (write_cache_as_tree(&tree, 0, NULL))
 		die(_("git write-tree failed to write a tree"));
 
-	if (!get_oid_commit("HEAD", &parent)) {
+	if (state->exact && !oideq(&tree, &meta_tree))
+		die("tree mismatch");
+
+	if (state->exact) {
+		/*
+		 * Already got parents above.
+		 */
+	} else if (!get_oid_commit("HEAD", &parent)) {
 		old_oid = &parent;
 		commit_list_insert(lookup_commit(the_repository, &parent),
 				   &parents);
@@ -1569,19 +1643,28 @@  static void do_commit(const struct am_state *state)
 		say(state, stderr, _("applying to an empty history"));
 	}
 
-	author = fmt_ident(state->author_name, state->author_email,
-		WANT_AUTHOR_IDENT,
-			state->ignore_date ? NULL : state->author_date,
-			IDENT_STRICT);
-
-	if (state->committer_date_is_author_date)
-		setenv("GIT_COMMITTER_DATE",
-			state->ignore_date ? "" : state->author_date, 1);
+	if (state->exact) {
+		/*
+		 * Already got author above.
+		 */
+	} else {
+		author = fmt_ident(state->author_name, state->author_email,
+			WANT_AUTHOR_IDENT,
+				state->ignore_date ? NULL : state->author_date,
+				IDENT_STRICT);
+
+		if (state->committer_date_is_author_date)
+			setenv("GIT_COMMITTER_DATE",
+				state->ignore_date ? "" : state->author_date, 1);
+	}
 
 	if (commit_tree(state->msg, state->msg_len, &tree, parents, &commit,
 			author, state->sign_commit))
 		die(_("failed to write commit object"));
 
+	if (state->exact && !oideq(&commit, &meta_commit))
+		die("sha1 mismatch");
+
 	reflog_msg = getenv("GIT_REFLOG_ACTION");
 	if (!reflog_msg)
 		reflog_msg = "am";
@@ -2182,6 +2265,8 @@  int cmd_am(int argc, const char **argv, const char *prefix)
 			0, PARSE_OPT_NONEG),
 		OPT_BOOL('c', "scissors", &state.scissors,
 			N_("strip everything before a scissors line")),
+		OPT_BOOL('e', "exact", &state.exact,
+			N_("preserve exact metadata, including sha1")),
 		OPT_PASSTHRU_ARGV(0, "whitespace", &state.git_apply_opts, N_("action"),
 			N_("pass it through git-apply"),
 			0),
-- 
2.23.0.718.g3120370db8