diff mbox series

[09/10] fast-export: add a --show-original-ids option to show original names

Message ID 20181111062312.16342-10-newren@gmail.com (mailing list archive)
State New, archived
Headers show
Series fast export and import fixes and features | expand

Commit Message

Elijah Newren Nov. 11, 2018, 6:23 a.m. UTC
Knowing the original names (hashes) of commits, blobs, and tags can
sometimes enable post-filtering that would otherwise be difficult or
impossible.  In particular, the desire to rewrite commit messages which
refer to other prior commits (on top of whatever other filtering is
being done) is very difficult without knowing the original names of each
commit.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'originally <hash>' line to blob, commits, and
tags.  It also teaches fast-import to parse (and ignore) such lines.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-fast-export.txt |  7 +++++++
 builtin/fast-export.c             | 20 +++++++++++++++-----
 fast-import.c                     | 17 +++++++++++++++++
 t/t9350-fast-export.sh            | 17 +++++++++++++++++
 4 files changed, 56 insertions(+), 5 deletions(-)

Comments

Jeff King Nov. 11, 2018, 7:20 a.m. UTC | #1
On Sat, Nov 10, 2018 at 10:23:11PM -0800, Elijah Newren wrote:

> Knowing the original names (hashes) of commits, blobs, and tags can
> sometimes enable post-filtering that would otherwise be difficult or
> impossible.  In particular, the desire to rewrite commit messages which
> refer to other prior commits (on top of whatever other filtering is
> being done) is very difficult without knowing the original names of each
> commit.
> 
> This commit teaches a new --show-original-ids option to fast-export
> which will make it add a 'originally <hash>' line to blob, commits, and
> tags.  It also teaches fast-import to parse (and ignore) such lines.

Makes sense as a feature; I think filter-branch can make its mappings
available, too.

Do we need to worry about compatibility with other fast-import programs?
I think no, because this is not enabled by default (so if sending the
extra lines to another importer hurts, the answer is "don't do that").

I have a vague feeling that there might be some way to combine this with
--export-marks or --no-data, but I can't really think of a way. They
seem related, but not quite.

> ---
>  Documentation/git-fast-export.txt |  7 +++++++
>  builtin/fast-export.c             | 20 +++++++++++++++-----
>  fast-import.c                     | 17 +++++++++++++++++
>  t/t9350-fast-export.sh            | 17 +++++++++++++++++
>  4 files changed, 56 insertions(+), 5 deletions(-)

The fast-import format is documented in Documentation/git-fast-import.txt.
It might need an update to cover the new format.

> --- a/Documentation/git-fast-export.txt
> +++ b/Documentation/git-fast-export.txt
> @@ -121,6 +121,13 @@ marks the same across runs.
>  	used by a repository which already contains the necessary
>  	parent commits.
>  
> +--show-original-ids::
> +	Add an extra directive to the output for commits and blobs,
> +	`originally <SHA1SUM>`.  While such directives will likely be
> +	ignored by importers such as git-fast-import, it may be useful
> +	for intermediary filters (e.g. for rewriting commit messages
> +	which refer to older commits, or for stripping blobs by id).

I'm not quite sure how a blob ends up being rewritten by fast-export (I
get that commits may change due to dropping parents).

The name "originally" doesn't seem great to me. Probably because I would
continually wonder if it has one "l" or two. ;) Perhaps something like
"original-oid" might be better. That's well into bikeshed territory,
though.

-Peff
Elijah Newren Nov. 11, 2018, 8:32 a.m. UTC | #2
On Sat, Nov 10, 2018 at 11:20 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:11PM -0800, Elijah Newren wrote:
>
> > Knowing the original names (hashes) of commits, blobs, and tags can
> > sometimes enable post-filtering that would otherwise be difficult or
> > impossible.  In particular, the desire to rewrite commit messages which
> > refer to other prior commits (on top of whatever other filtering is
> > being done) is very difficult without knowing the original names of each
> > commit.
> >
> > This commit teaches a new --show-original-ids option to fast-export
> > which will make it add a 'originally <hash>' line to blob, commits, and
> > tags.  It also teaches fast-import to parse (and ignore) such lines.
>
> Makes sense as a feature; I think filter-branch can make its mappings
> available, too.
>
> Do we need to worry about compatibility with other fast-import programs?
> I think no, because this is not enabled by default (so if sending the
> extra lines to another importer hurts, the answer is "don't do that").
>
> I have a vague feeling that there might be some way to combine this with
> --export-marks or --no-data, but I can't really think of a way. They
> seem related, but not quite.
>
> > ---
> >  Documentation/git-fast-export.txt |  7 +++++++
> >  builtin/fast-export.c             | 20 +++++++++++++++-----
> >  fast-import.c                     | 17 +++++++++++++++++
> >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> >  4 files changed, 56 insertions(+), 5 deletions(-)
>
> The fast-import format is documented in Documentation/git-fast-import.txt.
> It might need an update to cover the new format.

We document the format in both fast-import.c and
Documentation/git-fast-import.txt?  Maybe we should delete the long
comments in fast-import.c so this isn't duplicated?

> > --- a/Documentation/git-fast-export.txt
> > +++ b/Documentation/git-fast-export.txt
> > @@ -121,6 +121,13 @@ marks the same across runs.
> >       used by a repository which already contains the necessary
> >       parent commits.
> >
> > +--show-original-ids::
> > +     Add an extra directive to the output for commits and blobs,
> > +     `originally <SHA1SUM>`.  While such directives will likely be
> > +     ignored by importers such as git-fast-import, it may be useful
> > +     for intermediary filters (e.g. for rewriting commit messages
> > +     which refer to older commits, or for stripping blobs by id).
>
> I'm not quite sure how a blob ends up being rewritten by fast-export (I
> get that commits may change due to dropping parents).

It doesn't get rewritten by fast-export; it gets rewritten by other
intermediary filters, e.g. in something like this:

   git fast-export --show-original-ids --all | intermediary_filter |
git fast-import

The intermediary_filter program may want to strip out blobs by id, or
remove filemodify and filedelete directives unless they touch certain
paths, etc.

> The name "originally" doesn't seem great to me. Probably because I would
> continually wonder if it has one "l" or two. ;) Perhaps something like
> "original-oid" might be better. That's well into bikeshed territory,
> though.

I wasn't a huge fan of "originally" either, but I just couldn't come
up with anything else that wasn't really long.  I'd be happy to switch
to original-oid.
Jeff King Nov. 12, 2018, 12:53 p.m. UTC | #3
On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:

> > >  Documentation/git-fast-export.txt |  7 +++++++
> > >  builtin/fast-export.c             | 20 +++++++++++++++-----
> > >  fast-import.c                     | 17 +++++++++++++++++
> > >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> > >  4 files changed, 56 insertions(+), 5 deletions(-)
> >
> > The fast-import format is documented in Documentation/git-fast-import.txt.
> > It might need an update to cover the new format.
> 
> We document the format in both fast-import.c and
> Documentation/git-fast-import.txt?  Maybe we should delete the long
> comments in fast-import.c so this isn't duplicated?

Yes, that is probably worth doing (see the comment at the top of
fast-import.c). Some information might need to be migrated.

If we're going to have just one spot, I think it needs to be the
user-facing documentation. This is a public interface that other people
are building compatible implementations for (including your new tool).

> > > +--show-original-ids::
> > > +     Add an extra directive to the output for commits and blobs,
> > > +     `originally <SHA1SUM>`.  While such directives will likely be
> > > +     ignored by importers such as git-fast-import, it may be useful
> > > +     for intermediary filters (e.g. for rewriting commit messages
> > > +     which refer to older commits, or for stripping blobs by id).
> >
> > I'm not quite sure how a blob ends up being rewritten by fast-export (I
> > get that commits may change due to dropping parents).
> 
> It doesn't get rewritten by fast-export; it gets rewritten by other
> intermediary filters, e.g. in something like this:
> 
>    git fast-export --show-original-ids --all | intermediary_filter |
> git fast-import
> 
> The intermediary_filter program may want to strip out blobs by id, or
> remove filemodify and filedelete directives unless they touch certain
> paths, etc.

OK, that matches my understanding. So why does fast-export need to print
the blob ids? If the intermediary is rewriting blobs, it can then
produce the "originally" line itself, can't it?

The more interesting case I guess is your "strip out blobs by id"
example. There the intermediary _could_ do so itself, but it would
require recomputing the object id of each blob.

If you use "--no-data", then this just works (we specify tree entries by
object id, rather than by mark). But I can see how it would be useful to
have the information even without "--no-data" (i.e., if you are doing
multiple kinds of rewrites on a single stream).

I think the thing that confused me is that this "originally" is doing
two things:

  - mentioning blob ids as an optimization / convenience for the reader

  - mentioning rewritten commit (and presumably tag?) ids that were
    rewritten as part of a partial history export. I suppose even trees
    could be rewritten that way, too, but fast-import doesn't generally
    consider trees to be a first-class item.

So I'm OK with it, but I wonder if there is an easier way to explain it.

-Peff
Elijah Newren Nov. 12, 2018, 3:46 p.m. UTC | #4
On Mon, Nov 12, 2018 at 4:53 AM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 11, 2018 at 12:32:22AM -0800, Elijah Newren wrote:
>
> > > >  Documentation/git-fast-export.txt |  7 +++++++
> > > >  builtin/fast-export.c             | 20 +++++++++++++++-----
> > > >  fast-import.c                     | 17 +++++++++++++++++
> > > >  t/t9350-fast-export.sh            | 17 +++++++++++++++++
> > > >  4 files changed, 56 insertions(+), 5 deletions(-)
> > >
> > > The fast-import format is documented in Documentation/git-fast-import.txt.
> > > It might need an update to cover the new format.
> >
> > We document the format in both fast-import.c and
> > Documentation/git-fast-import.txt?  Maybe we should delete the long
> > comments in fast-import.c so this isn't duplicated?
>
> Yes, that is probably worth doing (see the comment at the top of
> fast-import.c). Some information might need to be migrated.
>
> If we're going to have just one spot, I think it needs to be the
> user-facing documentation. This is a public interface that other people
> are building compatible implementations for (including your new tool).

Okay, I'll work on that.

> OK, that matches my understanding. So why does fast-export need to print
> the blob ids? If the intermediary is rewriting blobs, it can then
> produce the "originally" line itself, can't it?
>
> The more interesting case I guess is your "strip out blobs by id"
> example. There the intermediary _could_ do so itself, but it would
> require recomputing the object id of each blob.
>
> If you use "--no-data", then this just works (we specify tree entries by
> object id, rather than by mark). But I can see how it would be useful to
> have the information even without "--no-data" (i.e., if you are doing
> multiple kinds of rewrites on a single stream).
>
> I think the thing that confused me is that this "originally" is doing
> two things:
>
>   - mentioning blob ids as an optimization / convenience for the reader
>
>   - mentioning rewritten commit (and presumably tag?) ids that were
>     rewritten as part of a partial history export. I suppose even trees
>     could be rewritten that way, too, but fast-import doesn't generally
>     consider trees to be a first-class item.
>
> So I'm OK with it, but I wonder if there is an easier way to explain it.

Yeah, I started out just needing to add the original oids for commits.
Once I added them there, I wondered whether someone would need them
for tags and blobs too (not trees since fast-import doesn't work with
those).  For blobs, it made sense as a small performance optimization
(when running without --no-data), as you pointed out.  I can't think
of a use for them in tags, but once I've included them in blobs and
commits it felt like I might as well include them there for
completeness.  So maybe my commit message should have been something
more like:

"""
Knowing the original names (hashes) of commits can sometimes enable
post-filtering that would otherwise be difficult or impossible.  In
particular, the desire to rewrite commit messages which refer to other
prior commits (on top of whatever other filtering is being done) is
very difficult without knowing the original names of each commit.

In addition, knowing the original names (hashes) of blobs can allow
filtering by blob-id without requiring re-hashing the content of the
blob, and is thus useful as a small optimization.

Once we add original ids for both commits and blobs, we may as well
add them for tags too for completeness.  Perhaps someone will have a
use for them.

This commit teaches a new --show-original-ids option to fast-export
which will make it add a 'original-oid <hash>' line to blob, commits,
and tags.  It also teaches fast-import to parse (and ignore) such
lines.
"""

?
Jeff King Nov. 12, 2018, 4:31 p.m. UTC | #5
On Mon, Nov 12, 2018 at 07:46:14AM -0800, Elijah Newren wrote:

> So maybe my commit message should have been something
> more like:
> 
> """
> Knowing the original names (hashes) of commits can sometimes enable
> post-filtering that would otherwise be difficult or impossible.  In
> particular, the desire to rewrite commit messages which refer to other
> prior commits (on top of whatever other filtering is being done) is
> very difficult without knowing the original names of each commit.
> 
> In addition, knowing the original names (hashes) of blobs can allow
> filtering by blob-id without requiring re-hashing the content of the
> blob, and is thus useful as a small optimization.
> 
> Once we add original ids for both commits and blobs, we may as well
> add them for tags too for completeness.  Perhaps someone will have a
> use for them.
> 
> This commit teaches a new --show-original-ids option to fast-export
> which will make it add a 'original-oid <hash>' line to blob, commits,
> and tags.  It also teaches fast-import to parse (and ignore) such
> lines.
> """
> 
> ?

Yes, that makes much more sense to me (though of course I've also been
discussing it with you, so just about anything would at this point ;) ).

It's possible that somebody would want to filter on tree id's, too. A
fast-import stream just has trees incidentally as part of commit state,
but we could say something like "by the way, this tree is X". You can
even do "fast-export -t" to see subtrees, though I am not sure if that
is intentional or just an artifact of being based on the diff code.

I guess that is not all that useful, though. I was mostly thinking about
it because of your "we may as well add them for tags too for
completeness" above. But the issues around trees are sufficiently subtle
that we're probably better off not trying to handle them here. There's a
good chance we'd get it wrong, making our "let's just add this for
completeness while we're here" totally backfire.

-Peff
diff mbox series

Patch

diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
index 2916096bdd..4e40f0b99a 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -121,6 +121,13 @@  marks the same across runs.
 	used by a repository which already contains the necessary
 	parent commits.
 
+--show-original-ids::
+	Add an extra directive to the output for commits and blobs,
+	`originally <SHA1SUM>`.  While such directives will likely be
+	ignored by importers such as git-fast-import, it may be useful
+	for intermediary filters (e.g. for rewriting commit messages
+	which refer to older commits, or for stripping blobs by id).
+
 --refspec::
 	Apply the specified refspec to each ref exported. Multiple of them can
 	be specified.
diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index ea9c5b1c00..cc01dcc90c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -38,6 +38,7 @@  static int use_done_feature;
 static int no_data;
 static int full_tree;
 static int reference_excluded_commits;
+static int show_original_ids;
 static struct string_list extra_refs = STRING_LIST_INIT_NODUP;
 static struct string_list tag_refs = STRING_LIST_INIT_NODUP;
 static struct refspec refspecs = REFSPEC_INIT_FETCH;
@@ -271,7 +272,10 @@  static void export_blob(const struct object_id *oid)
 
 	mark_next_object(object);
 
-	printf("blob\nmark :%"PRIu32"\ndata %lu\n", last_idnum, size);
+	printf("blob\nmark :%"PRIu32"\n", last_idnum);
+	if (show_original_ids)
+		printf("originally %s\n", oid_to_hex(oid));
+	printf("data %lu\n", size);
 	if (size && fwrite(buf, size, 1, stdout) != 1)
 		die_errno("could not write blob '%s'", oid_to_hex(oid));
 	printf("\n");
@@ -628,8 +632,10 @@  static void handle_commit(struct commit *commit, struct rev_info *rev,
 		reencoded = reencode_string(message, "UTF-8", encoding);
 	if (!commit->parents)
 		printf("reset %s\n", refname);
-	printf("commit %s\nmark :%"PRIu32"\n%.*s\n%.*s\ndata %u\n%s",
-	       refname, last_idnum,
+	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
+	if (show_original_ids)
+		printf("originally %s\n", oid_to_hex(&commit->object.oid));
+	printf("%.*s\n%.*s\ndata %u\n%s",
 	       (int)(author_end - author), author,
 	       (int)(committer_end - committer), committer,
 	       (unsigned)(reencoded
@@ -807,8 +813,10 @@  static void handle_tag(const char *name, struct tag *tag)
 
 	if (starts_with(name, "refs/tags/"))
 		name += 10;
-	printf("tag %s\nfrom :%d\n%.*s%sdata %d\n%.*s\n",
-	       name, tagged_mark,
+	printf("tag %s\nfrom :%d\n", name, tagged_mark);
+	if (show_original_ids)
+		printf("originally %s\n", oid_to_hex(&tag->object.oid));
+	printf("%.*s%sdata %d\n%.*s\n",
 	       (int)(tagger_end - tagger), tagger,
 	       tagger == tagger_end ? "" : "\n",
 	       (int)message_size, (int)message_size, message ? message : "");
@@ -1089,6 +1097,8 @@  int cmd_fast_export(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "anonymize", &anonymize, N_("anonymize output")),
 		OPT_BOOL(0, "reference-excluded-parents",
 			 &reference_excluded_commits, N_("Reference parents which are not in fast-export stream by sha1sum")),
+		OPT_BOOL(0, "show-original-ids", &show_original_ids,
+			    N_("Show original sha1sums of blobs/commits")),
 
 		OPT_END()
 	};
diff --git a/fast-import.c b/fast-import.c
index 95600c78e0..232b6a8b8d 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -14,11 +14,13 @@  Format of STDIN stream:
 
   new_blob ::= 'blob' lf
     mark?
+    originally?
     file_content;
   file_content ::= data;
 
   new_commit ::= 'commit' sp ref_str lf
     mark?
+    originally?
     ('author' (sp name)? sp '<' email '>' sp when lf)?
     'committer' (sp name)? sp '<' email '>' sp when lf
     commit_msg
@@ -49,6 +51,7 @@  Format of STDIN stream:
 
   new_tag ::= 'tag' sp tag_str lf
     'from' sp commit-ish lf
+    originally?
     ('tagger' (sp name)? sp '<' email '>' sp when lf)?
     tag_msg;
   tag_msg ::= data;
@@ -73,6 +76,8 @@  Format of STDIN stream:
   data ::= (delimited_data | exact_data)
     lf?;
 
+  originally ::= 'originally' sp not_lf+ lf
+
     # note: delim may be any string but must not contain lf.
     # data_line may contain any data but must not be exactly
     # delim.
@@ -1968,6 +1973,13 @@  static void parse_mark(void)
 		next_mark = 0;
 }
 
+static void parse_original_identifier(void)
+{
+	const char *v;
+	if (skip_prefix(command_buf.buf, "originally ", &v))
+		read_next_command();
+}
+
 static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 {
 	const char *data;
@@ -2110,6 +2122,7 @@  static void parse_new_blob(void)
 {
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	parse_and_store_blob(&last_blob, NULL, next_mark);
 }
 
@@ -2733,6 +2746,7 @@  static void parse_new_commit(const char *arg)
 
 	read_next_command();
 	parse_mark();
+	parse_original_identifier();
 	if (skip_prefix(command_buf.buf, "author ", &v)) {
 		author = parse_ident(v);
 		read_next_command();
@@ -2865,6 +2879,9 @@  static void parse_new_tag(const char *arg)
 		die("Invalid ref name or SHA1 expression: %s", from);
 	read_next_command();
 
+	/* originally ... */
+	parse_original_identifier();
+
 	/* tagger ... */
 	if (skip_prefix(command_buf.buf, "tagger ", &v)) {
 		tagger = parse_ident(v);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index c2f40d6a40..5ad6669910 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -77,6 +77,23 @@  test_expect_success 'fast-export --reference-excluded-parents master~2..master'
 		 test $MASTER = $(git rev-parse --verify refs/heads/rewrite))
 '
 
+test_expect_success 'fast-export --show-original-ids' '
+
+	git fast-export --show-original-ids master >output &&
+	grep ^originally output| sed -e s/^originally.// | sort >actual &&
+	git rev-list --objects master muss >objects-and-names &&
+	awk "{print \$1}" objects-and-names | sort >commits-trees-blobs &&
+	comm -23 actual commits-trees-blobs >unfound &&
+	test_must_be_empty unfound
+'
+
+test_expect_success 'fast-export --show-original-ids | git fast-import' '
+
+	git fast-export --show-original-ids master muss | git fast-import --quiet &&
+	test $MASTER = $(git rev-parse --verify refs/heads/master) &&
+	test $MUSS = $(git rev-parse --verify refs/tags/muss)
+'
+
 test_expect_success 'iso-8859-1' '
 
 	git config i18n.commitencoding ISO8859-1 &&