diff mbox series

fast-export: fix surprising behavior with --first-parent

Message ID pull.1084.git.1637666927224.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series fast-export: fix surprising behavior with --first-parent | expand

Commit Message

William Sprent Nov. 23, 2021, 11:28 a.m. UTC
From: William Sprent <williams@unity3d.com>

When invoking git-fast-export with the --first-parent flag on a branch
with merges, fast-export would early-out on processing the first merge
on the branch. If combined with --reverse, fast-export would instead
output all single parent commits on the branch.

This commit makes fast-export output the same commits as rev-list
--first-parent, and makes --reverse not have an effect on which commits
are output.

The fix involves removing logic within fast-export which was responsible
for ensuring that parents are processed before their children, which was
what was exiting early due to missing second parents. This is replaced
by setting 'reverse = 1' before revision walking, which, in conjuction
with topo_order, allows for delegating the ordering of commits to
revision.c. The reverse flag is set after parsing rev-list arguments to
avoid having it disabled.

Signed-off-by: William Sprent <williams@unity3d.com>
---
    fast-export: fix surprising behavior with --first-parent
    
    Hi,
    
    This is my first time patching git, so I probably need some guidance on
    my approach. :)
    
    I've noticed that git fast-export exhibits some odd behavior when passed
    the --first-parent flag. In the repository I was working on, it would
    only output the initial commit before exiting. Moreover, adding the
    --reverse flag causes it to behave differently and instead output all
    commits in the first parent line that only have one parent. My
    expectation is more or less that git fast-export should output the same
    set of commits as git rev-list would output given the same arguments,
    which matches how it acts when given revision ranges.
    
    It seems like this behavior comes from the fact that has_unshown_parents
    will never be false for any merge commits encountered when fast-export
    is called with --first-parent. This causes the while loop to follow the
    pattern of pushing all commits into the "commits" queue until the
    initial commit is encountered, at which point it calls handle_tail which
    falls through on the first merge commit, causing most of the commits to
    be unhandled.
    
    My impression is that this logic only serves to ensure that parents are
    processed before their children, so in my patch I've opted to fix the
    issue by delegating that responsibility to revision.c by adding the
    reverse flag before performing the revision walk. From what I can see,
    this should be equivalent to what the previous logic is trying to
    achieve, but I can also see that there could be risk in these changes.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1084%2Fwilliams-unity%2Ffast-export%2Ffirst-parent-issues-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1084/williams-unity/fast-export/first-parent-issues-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1084

 builtin/fast-export.c  | 36 ++----------------------------------
 t/t9350-fast-export.sh | 30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 34 deletions(-)


base-commit: cd3e606211bb1cf8bc57f7d76bab98cc17a150bc

Comments

Ævar Arnfjörð Bjarmason Nov. 23, 2021, 1:07 p.m. UTC | #1
On Tue, Nov 23 2021, William Sprent via GitGitGadget wrote:

> From: William Sprent <williams@unity3d.com>
>
> When invoking git-fast-export with the --first-parent flag on a branch
> with merges, fast-export would early-out on processing the first merge
> on the branch. If combined with --reverse, fast-export would instead
> output all single parent commits on the branch.
>
> This commit makes fast-export output the same commits as rev-list
> --first-parent, and makes --reverse not have an effect on which commits
> are output.
>
> The fix involves removing logic within fast-export which was responsible
> for ensuring that parents are processed before their children, which was
> what was exiting early due to missing second parents. This is replaced
> by setting 'reverse = 1' before revision walking, which, in conjuction
> with topo_order, allows for delegating the ordering of commits to
> revision.c. The reverse flag is set after parsing rev-list arguments to
> avoid having it disabled.
>
> Signed-off-by: William Sprent <williams@unity3d.com>
> ---
>     fast-export: fix surprising behavior with --first-parent
>     
>     Hi,
>     
>     This is my first time patching git, so I probably need some guidance on
>     my approach. :)

Hi, thanks for your first contribution to git. This is a rather shallow
review, a deeper one is much deserved.

I notice that you're removing code in builtin/fast-export.c, presumably
we have code in revision.c that does the same thing. It would really
help a reviewer for you to dig a bit into the relevant commit history
and note it in the commit message.

I.e. could revision.c always do this, and this was always needless
duplication, or at time X it was needed, but as of Y revision.c learned
to do this, and callers A, B and C were adjusted, but just not this
missed call D? etc.

> -static int has_unshown_parent(struct commit *commit)
> -{
> -	struct commit_list *parent;
> -
> -	for (parent = commit->parents; parent; parent = parent->next)
> -		if (!(parent->item->object.flags & SHOWN) &&
> -		    !(parent->item->object.flags & UNINTERESTING))
> -			return 1;
> -	return 0;
> -}
> -
>  struct anonymized_entry {
>  	struct hashmap_entry hash;
>  	const char *anon;
> @@ -752,20 +740,6 @@ static char *anonymize_tag(void *data)
>  	return strbuf_detach(&out, NULL);
>  }
>  
> -static void handle_tail(struct object_array *commits, struct rev_info *revs,
> -			struct string_list *paths_of_changed_objects)
> -{
> -	struct commit *commit;
> -	while (commits->nr) {
> -		commit = (struct commit *)object_array_pop(commits);
> -		if (has_unshown_parent(commit)) {
> -			/* Queue again, to be handled later */
> -			add_object_array(&commit->object, NULL, commits);
> -			return;
> -		}
> -		handle_commit(commit, revs, paths_of_changed_objects);
> -	}
> -}

...

>  static void handle_tag(const char *name, struct tag *tag)
>  {
> @@ -1185,7 +1159,6 @@ static int parse_opt_anonymize_map(const struct option *opt,
>  int cmd_fast_export(int argc, const char **argv, const char *prefix)
>  {
>  	struct rev_info revs;
> -	struct object_array commits = OBJECT_ARRAY_INIT;
>  	struct commit *commit;
>  	char *export_filename = NULL,
>  	     *import_filename = NULL,
> @@ -1281,19 +1254,14 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>  
>  	get_tags_and_duplicates(&revs.cmdline);
>  
> +	revs.reverse = 1;

Is the placement of revs.reverse = 1 here important, or could it go
earlier after init_revision_sources() when we assign some other values
ir revs?

>  	if (prepare_revision_walk(&revs))
>  		die("revision walk setup failed");

A light reading of prepare_revision_walk() suggests it could come after,
but maybe I'm entirely wrong.

>  	revs.diffopt.format_callback = show_filemodify;
>  	revs.diffopt.format_callback_data = &paths_of_changed_objects;
>  	revs.diffopt.flags.recursive = 1;
>  	while ((commit = get_revision(&revs))) {
> -		if (has_unshown_parent(commit)) {
> -			add_object_array(&commit->object, NULL, &commits);
> -		}
> -		else {
> -			handle_commit(commit, &revs, &paths_of_changed_objects);
> -			handle_tail(&commits, &revs, &paths_of_changed_objects);
> -		}
> +		handle_commit(commit, &revs, &paths_of_changed_objects);
>  	}
>  

Yay code deletion, good if it works (I didn't check).

Since this is just a one-statement while-loop we can also remove its
braces now.

> +test_expect_success 'fast-export --first-parent outputs all revisions output by revision walk' '
> +	git init first-parent &&
> +	cd first-parent &&

Do any such "cd" in a sub-shell:

	git init x &&
	(
    		cd x &&
                ...
	)
Otherwise the next test after you is going to run in anotherdirectory.

> +	test_commit init &&
> +	git checkout -b topic1 &&
> +	test_commit file2 file2.txt &&
> +	git checkout main &&
> +	git merge topic1 --no-ff &&
> +
> +	git checkout -b topic2 &&
> +	test_commit file3 file3.txt &&
> +	git checkout main &&
> +	git merge topic2 --no-ff &&

Just a nit. I'd use "test_commit A", "test_commit B" etc. when the
filenames etc. aren't important. There's no subsequent reference here,
so I assume they're not.

> +	test_commit branch-head &&
> +
> +	git rev-list --format="%ad%B" --first-parent --topo-order --no-commit-header main > expected &&

nit; >expected, not > expected is the usual style.
> +
> +	git fast-export main -- --first-parent > first-parent-export &&
> +	git fast-export main -- --first-parent --reverse > first-parent-reverse-export &&

ditto:

> +	git init import && cd import &&

ditto earlier "cd" comment.

> +	cat ../first-parent-export | git fast-import &&

Instead of "cat x | prog" do "prog <x".

> +	git rev-list --format="%ad%B" --topo-order --all --no-commit-header > actual &&

> +	test $(git rev-list --all | wc -l) -eq 4 &&

Instead:

    git rev-list --all >tmp &&
    test_line_count = 4 tmp

(for some value of tmp)

> +	test_cmp ../expected actual &&
> +	test_cmp ../first-parent-export ../first-parent-reverse-export
> +'

Maybe some of the CD-ing around here wouldu be easier by not doing that
and instead running e.g.:

    git -C subdir fast-export >file-not-in-subdir &&
    ...
Elijah Newren Nov. 23, 2021, 7:14 p.m. UTC | #2
On Tue, Nov 23, 2021 at 5:25 AM William Sprent via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: William Sprent <williams@unity3d.com>
>
> When invoking git-fast-export with the --first-parent flag on a branch
> with merges, fast-export would early-out on processing the first merge
> on the branch. If combined with --reverse, fast-export would instead
> output all single parent commits on the branch.
>
> This commit makes fast-export output the same commits as rev-list
> --first-parent, and makes --reverse not have an effect on which commits
> are output.
>
> The fix involves removing logic within fast-export which was responsible
> for ensuring that parents are processed before their children, which was
> what was exiting early due to missing second parents. This is replaced
> by setting 'reverse = 1' before revision walking, which, in conjuction
> with topo_order, allows for delegating the ordering of commits to
> revision.c. The reverse flag is set after parsing rev-list arguments to
> avoid having it disabled.
>
> Signed-off-by: William Sprent <williams@unity3d.com>
> ---
>     fast-export: fix surprising behavior with --first-parent
>
>     Hi,
>
>     This is my first time patching git, so I probably need some guidance on
>     my approach. :)
>
>     I've noticed that git fast-export exhibits some odd behavior when passed
>     the --first-parent flag. In the repository I was working on, it would
>     only output the initial commit before exiting. Moreover, adding the
>     --reverse flag causes it to behave differently and instead output all
>     commits in the first parent line that only have one parent. My
>     expectation is more or less that git fast-export should output the same
>     set of commits as git rev-list would output given the same arguments,
>     which matches how it acts when given revision ranges.
>
>     It seems like this behavior comes from the fact that has_unshown_parents
>     will never be false for any merge commits encountered when fast-export
>     is called with --first-parent. This causes the while loop to follow the
>     pattern of pushing all commits into the "commits" queue until the
>     initial commit is encountered, at which point it calls handle_tail which
>     falls through on the first merge commit, causing most of the commits to
>     be unhandled.
>
>     My impression is that this logic only serves to ensure that parents are
>     processed before their children, so in my patch I've opted to fix the
>     issue by delegating that responsibility to revision.c by adding the
>     reverse flag before performing the revision walk. From what I can see,
>     this should be equivalent to what the previous logic is trying to
>     achieve, but I can also see that there could be risk in these changes.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1084%2Fwilliams-unity%2Ffast-export%2Ffirst-parent-issues-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1084/williams-unity/fast-export/first-parent-issues-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1084
>
>  builtin/fast-export.c  | 36 ++----------------------------------
>  t/t9350-fast-export.sh | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+), 34 deletions(-)
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 8e2caf72819..50f8c224b6e 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -107,18 +107,6 @@ static int parse_opt_reencode_mode(const struct option *opt,
>
>  static struct decoration idnums;
>  static uint32_t last_idnum;
> -
> -static int has_unshown_parent(struct commit *commit)
> -{
> -       struct commit_list *parent;
> -
> -       for (parent = commit->parents; parent; parent = parent->next)
> -               if (!(parent->item->object.flags & SHOWN) &&
> -                   !(parent->item->object.flags & UNINTERESTING))
> -                       return 1;
> -       return 0;
> -}
> -
>  struct anonymized_entry {
>         struct hashmap_entry hash;
>         const char *anon;
> @@ -752,20 +740,6 @@ static char *anonymize_tag(void *data)
>         return strbuf_detach(&out, NULL);
>  }
>
> -static void handle_tail(struct object_array *commits, struct rev_info *revs,
> -                       struct string_list *paths_of_changed_objects)
> -{
> -       struct commit *commit;
> -       while (commits->nr) {
> -               commit = (struct commit *)object_array_pop(commits);
> -               if (has_unshown_parent(commit)) {
> -                       /* Queue again, to be handled later */
> -                       add_object_array(&commit->object, NULL, commits);
> -                       return;
> -               }
> -               handle_commit(commit, revs, paths_of_changed_objects);
> -       }
> -}
>
>  static void handle_tag(const char *name, struct tag *tag)
>  {
> @@ -1185,7 +1159,6 @@ static int parse_opt_anonymize_map(const struct option *opt,
>  int cmd_fast_export(int argc, const char **argv, const char *prefix)
>  {
>         struct rev_info revs;
> -       struct object_array commits = OBJECT_ARRAY_INIT;
>         struct commit *commit;
>         char *export_filename = NULL,
>              *import_filename = NULL,
> @@ -1281,19 +1254,14 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>
>         get_tags_and_duplicates(&revs.cmdline);
>
> +       revs.reverse = 1;
>         if (prepare_revision_walk(&revs))
>                 die("revision walk setup failed");
>         revs.diffopt.format_callback = show_filemodify;
>         revs.diffopt.format_callback_data = &paths_of_changed_objects;
>         revs.diffopt.flags.recursive = 1;
>         while ((commit = get_revision(&revs))) {
> -               if (has_unshown_parent(commit)) {
> -                       add_object_array(&commit->object, NULL, &commits);
> -               }
> -               else {
> -                       handle_commit(commit, &revs, &paths_of_changed_objects);
> -                       handle_tail(&commits, &revs, &paths_of_changed_objects);
> -               }
> +               handle_commit(commit, &revs, &paths_of_changed_objects);

Wow, interesting.  I did some related work that I never submitted at
https://github.com/newren/git/commit/08f799b4667de1c755c78e1ea1657f946b588556
-- in that commit, I also noticed that fast-export did not seem
careful enough about making sure to process all commits, and came up
with a much different fix.  However, in my case, I didn't know how to
trigger the problems in fast-export without some additional changes I
was trying to make, and since I never got those other changes working,
I didn't think it was worth submitting my fix.  My solution is more
complex, in part because it was designed to handle ordering
requirements & recommendations other than just ancestry.  However, we
don't currently have any additional ordering requirements (the current
code only considers ancestry), so your solution is much simpler.  If I
do at some point get my other changes working, I'd need to
re-introduce the queue and handle_tail() and my changes to these, but
that can always be done later if and when I get those other changes
working.

Interestingly, Dscho added both the --reverse ability to revision.c
and the commits object_array and the handle_tail() stuff in
fast-export.c.  Both were done in the same year, and --reverse came
first.  See 9c5e66e97da8 (Teach revision machinery about --reverse,
2007-01-20) and f2dc849e9c5f (Add 'git fast-export', the sister of
'git fast-import', 2007-12-02).  It's not clear why revs.reverse = 1
wasn't used previously, although it certainly didn't occur to me
either and I think it would have been an alternative solution to my
first ever git.git contribution -- 784f8affe4df (fast-export: ensure
we traverse commits in topological order, 2009-02-10).  (Though
rev.reverse = 1 wouldn't have provided the same speedups that
rev.topo_order=1 provided.)

I can't think of any cases where this would cause any problems, and it
seems like using rev.reverse is a pretty clean solution.  Just in case
I'm missing something, though, it would be nice to get Dscho to also
comment on this.  I'm cc'ing him.

(Also, I see that Ævar has already provided some good feedback on the
testcase, so I'll stop here.)

>         }
>
>         handle_tags_and_duplicates(&extra_refs);
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index 409b48e2442..bd08101b81d 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -750,4 +750,34 @@ test_expect_success 'merge commit gets exported with --import-marks' '
>         )
>  '
>
> +
> +test_expect_success 'fast-export --first-parent outputs all revisions output by revision walk' '
> +       git init first-parent &&
> +       cd first-parent &&
> +       test_commit init &&
> +       git checkout -b topic1 &&
> +       test_commit file2 file2.txt &&
> +       git checkout main &&
> +       git merge topic1 --no-ff &&
> +
> +       git checkout -b topic2 &&
> +       test_commit file3 file3.txt &&
> +       git checkout main &&
> +       git merge topic2 --no-ff &&
> +
> +       test_commit branch-head &&
> +
> +       git rev-list --format="%ad%B" --first-parent --topo-order --no-commit-header main > expected &&
> +
> +       git fast-export main -- --first-parent > first-parent-export &&
> +       git fast-export main -- --first-parent --reverse > first-parent-reverse-export &&
> +       git init import && cd import &&
> +       cat ../first-parent-export | git fast-import &&
> +
> +       git rev-list --format="%ad%B" --topo-order --all --no-commit-header > actual &&
> +       test $(git rev-list --all | wc -l) -eq 4 &&
> +       test_cmp ../expected actual &&
> +       test_cmp ../first-parent-export ../first-parent-reverse-export
> +'
> +
>  test_done
>
> base-commit: cd3e606211bb1cf8bc57f7d76bab98cc17a150bc
> --
> gitgitgadget
Junio C Hamano Nov. 24, 2021, 12:41 a.m. UTC | #3
"William Sprent via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: William Sprent <williams@unity3d.com>
>
> When invoking git-fast-export with the --first-parent flag on a branch
> with merges, fast-export would early-out on processing the first merge
> on the branch. If combined with --reverse, fast-export would instead
> output all single parent commits on the branch.

I do not doubt we would want to make the command behave sensibly
with all options it accepts, but let me first ask a more basic and
possibly stupid question.

What is "git fast-export --first-parent <options>" supposed to do
differently from "git fast-export <options>" (with the same set of
options other than "--first-parent")?  Should it omit merge commits
altogether, pretending that the first single-parent ancestor it
finds on the first parent chain is a direct parent of a
single-parent descendant, e.g. if the real history were with two
single-parente commits A and B, with two merges M and N, on the
mainline, making the resulting commits into a single strand of two
pearls, with A and B before and after the rewrite to have the same
tree objects?

    ---A---M---N---B             ---A---B
          /   /           ==>
         X   Y

Or should it pretend merge commits have only their first parent as
their parents, i.e.

    ---A---M---N---B             ---A---M---N---B
          /   /           ==>
         X   Y

"git fast-export --help" does not even mention "--first-parent" and
pretend that any and all [<git-rev-list-args>...] are valid requests
to make to the command, but I am wondering if that is what we intend
to support in the first place.  In builtin/fast-export.c, I do not
see any attempt to do anything differently when "--first-parent" is
requested.  Perhaps we shouldn't be even taking "--first-parent" as
an option to begin with.

The "--reverse" feels even more iffy.  Are we reversing the history
with such an export, i.e. pretending that parents are children and
merges are forks?

    ---A---M---N---B             B---N---M---A---
          /   /           ==>         \   \
         X   Y                         X   Y

Or are we supposed to produce the same history in the end, just
spewing older commits first in the output stream?  I am not sure
what purpose such a request serves---the "fast-import" downstream
would need the same set of objects before it can create each commit
anyway, so I am not sure what the point of giving "--reverse" is.

If there is no sensible interpretation for some of the options that
are valid in rev-list in the context of "fast-export" command, should
we just error out when we parse the command line, instead of producing
nonsense output stream, I wonder.
William Sprent Nov. 24, 2021, 11:57 a.m. UTC | #4
On 23/11/2021 14.07, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Nov 23 2021, William Sprent via GitGitGadget wrote:
> 
>> From: William Sprent <williams@unity3d.com>
>>
>> When invoking git-fast-export with the --first-parent flag on a branch
>> with merges, fast-export would early-out on processing the first merge
>> on the branch. If combined with --reverse, fast-export would instead
>> output all single parent commits on the branch.
>>
>> This commit makes fast-export output the same commits as rev-list
>> --first-parent, and makes --reverse not have an effect on which commits
>> are output.
>>
>> The fix involves removing logic within fast-export which was responsible
>> for ensuring that parents are processed before their children, which was
>> what was exiting early due to missing second parents. This is replaced
>> by setting 'reverse = 1' before revision walking, which, in conjuction
>> with topo_order, allows for delegating the ordering of commits to
>> revision.c. The reverse flag is set after parsing rev-list arguments to
>> avoid having it disabled.
>>
>> Signed-off-by: William Sprent <williams@unity3d.com>
>> ---
>>      fast-export: fix surprising behavior with --first-parent
>>      
>>      Hi,
>>      
>>      This is my first time patching git, so I probably need some guidance on
>>      my approach. :)
> 
> Hi, thanks for your first contribution to git. This is a rather shallow
> review, a deeper one is much deserved.
> > I notice that you're removing code in builtin/fast-export.c, presumably
> we have code in revision.c that does the same thing. It would really
> help a reviewer for you to dig a bit into the relevant commit history
> and note it in the commit message.
> 
> I.e. could revision.c always do this, and this was always needless
> duplication, or at time X it was needed, but as of Y revision.c learned
> to do this, and callers A, B and C were adjusted, but just not this
> missed call D? etc.
> 

That's a really good suggestion. I should've done that. I did dig just 
enough to see that the logic has been around since fast-export was 
introduced, but I didn't check whether the 'reverse' option was part of 
revision.c at that point. I see that Elijah has done that homework for 
me later in this thread and discovered that --reverse was introduce a 
year or so before fast-export though.

>> -static int has_unshown_parent(struct commit *commit)
>> -{
>> -	struct commit_list *parent;
>> -
>> -	for (parent = commit->parents; parent; parent = parent->next)
>> -		if (!(parent->item->object.flags & SHOWN) &&
>> -		    !(parent->item->object.flags & UNINTERESTING))
>> -			return 1;
>> -	return 0;
>> -}
>> -
>>   struct anonymized_entry {
>>   	struct hashmap_entry hash;
>>   	const char *anon;
>> @@ -752,20 +740,6 @@ static char *anonymize_tag(void *data)
>>   	return strbuf_detach(&out, NULL);
>>   }
>>   
>> -static void handle_tail(struct object_array *commits, struct rev_info *revs,
>> -			struct string_list *paths_of_changed_objects)
>> -{
>> -	struct commit *commit;
>> -	while (commits->nr) {
>> -		commit = (struct commit *)object_array_pop(commits);
>> -		if (has_unshown_parent(commit)) {
>> -			/* Queue again, to be handled later */
>> -			add_object_array(&commit->object, NULL, commits);
>> -			return;
>> -		}
>> -		handle_commit(commit, revs, paths_of_changed_objects);
>> -	}
>> -}
> 
> ...
> 
>>   static void handle_tag(const char *name, struct tag *tag)
>>   {
>> @@ -1185,7 +1159,6 @@ static int parse_opt_anonymize_map(const struct option *opt,
>>   int cmd_fast_export(int argc, const char **argv, const char *prefix)
>>   {
>>   	struct rev_info revs;
>> -	struct object_array commits = OBJECT_ARRAY_INIT;
>>   	struct commit *commit;
>>   	char *export_filename = NULL,
>>   	     *import_filename = NULL,
>> @@ -1281,19 +1254,14 @@ int cmd_fast_export(int argc, const char **argv, const char *prefix)
>>   
>>   	get_tags_and_duplicates(&revs.cmdline);
>>   
>> +	revs.reverse = 1;
> 
> Is the placement of revs.reverse = 1 here important, or could it go
> earlier after init_revision_sources() when we assign some other values
> ir revs?
> 
>>   	if (prepare_revision_walk(&revs))
>>   		die("revision walk setup failed");
> 
> A light reading of prepare_revision_walk() suggests it could come after,
> but maybe I'm entirely wrong.

It needs to go after the setup_revisions() call as otherwise a --reverse 
passed to fast-export will toggle the option off again. You are right in 
that it can be moved down. I've done that in my working directory for 
the next patch.

Another option would be to move the revs.reverse up as you suggest and 
then then call die() if it was toggled off by setup_revisions(). The 
only downside I can think of is that it would make any current usage of 
'fast-export --reverse' go from a no-op to an error.

>>   	revs.diffopt.format_callback = show_filemodify;
>>   	revs.diffopt.format_callback_data = &paths_of_changed_objects;
>>   	revs.diffopt.flags.recursive = 1;
>>   	while ((commit = get_revision(&revs))) {
>> -		if (has_unshown_parent(commit)) {
>> -			add_object_array(&commit->object, NULL, &commits);
>> -		}
>> -		else {
>> -			handle_commit(commit, &revs, &paths_of_changed_objects);
>> -			handle_tail(&commits, &revs, &paths_of_changed_objects);
>> -		}
>> +		handle_commit(commit, &revs, &paths_of_changed_objects);
>>   	}
>>   
> 
> Yay code deletion, good if it works (I didn't check).
> 
> Since this is just a one-statement while-loop we can also remove its
> braces now.
>  >> +test_expect_success 'fast-export --first-parent outputs all 
revisions output by revision walk' '
>> +	git init first-parent &&
>> +	cd first-parent &&
> 
> Do any such "cd" in a sub-shell:
> 
> 	git init x &&
> 	(
>      		cd x &&
>                  ...
> 	)
> Otherwise the next test after you is going to run in anotherdirectory.
> 
>> +	test_commit init &&
>> +	git checkout -b topic1 &&
>> +	test_commit file2 file2.txt &&
>> +	git checkout main &&
>> +	git merge topic1 --no-ff &&
>> +
>> +	git checkout -b topic2 &&
>> +	test_commit file3 file3.txt &&
>> +	git checkout main &&
>> +	git merge topic2 --no-ff &&
> 
> Just a nit. I'd use "test_commit A", "test_commit B" etc. when the
> filenames etc. aren't important. There's no subsequent reference here,
> so I assume they're not.
> 
>> +	test_commit branch-head &&
>> +
>> +	git rev-list --format="%ad%B" --first-parent --topo-order --no-commit-header main > expected &&
> 
> nit; >expected, not > expected is the usual style.
>> +
>> +	git fast-export main -- --first-parent > first-parent-export &&
>> +	git fast-export main -- --first-parent --reverse > first-parent-reverse-export &&
> 
> ditto:
> 
>> +	git init import && cd import &&
> 
> ditto earlier "cd" comment.
> 
>> +	cat ../first-parent-export | git fast-import &&
> 
> Instead of "cat x | prog" do "prog <x".
> 
>> +	git rev-list --format="%ad%B" --topo-order --all --no-commit-header > actual &&
> 
>> +	test $(git rev-list --all | wc -l) -eq 4 &&
> 
> Instead:
> 
>      git rev-list --all >tmp &&
>      test_line_count = 4 tmp
> 
> (for some value of tmp)
> 
>> +	test_cmp ../expected actual &&
>> +	test_cmp ../first-parent-export ../first-parent-reverse-export
>> +'
> 
> Maybe some of the CD-ing around here wouldu be easier by not doing that
> and instead running e.g.:
> 
>      git -C subdir fast-export >file-not-in-subdir &&
>      ...
> 

Thanks for taking the time to give your feedback. :) I'll remove those 
braces from the while loop and incorporate your comments about the test 
for v2.
William Sprent Nov. 24, 2021, 1:05 p.m. UTC | #5
On 24/11/2021 01.41, Junio C Hamano wrote:
> "William Sprent via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: William Sprent <williams@unity3d.com>
>>
>> When invoking git-fast-export with the --first-parent flag on a branch
>> with merges, fast-export would early-out on processing the first merge
>> on the branch. If combined with --reverse, fast-export would instead
>> output all single parent commits on the branch.
> 
> I do not doubt we would want to make the command behave sensibly
> with all options it accepts, but let me first ask a more basic and
> possibly stupid question.
> 
> What is "git fast-export --first-parent <options>" supposed to do
> differently from "git fast-export <options>" (with the same set of
> options other than "--first-parent")?  Should it omit merge commits
> altogether, pretending that the first single-parent ancestor it
> finds on the first parent chain is a direct parent of a
> single-parent descendant, e.g. if the real history were with two
> single-parente commits A and B, with two merges M and N, on the
> mainline, making the resulting commits into a single strand of two
> pearls, with A and B before and after the rewrite to have the same
> tree objects?
> 
>      ---A---M---N---B             ---A---B
>            /   /           ==>
>           X   Y
> 
> Or should it pretend merge commits have only their first parent as
> their parents, i.e.
> 
>      ---A---M---N---B             ---A---M---N---B
>            /   /           ==>
>           X   Y
> 
> "git fast-export --help" does not even mention "--first-parent" and
> pretend that any and all [<git-rev-list-args>...] are valid requests
> to make to the command, but I am wondering if that is what we intend
> to support in the first place.  In builtin/fast-export.c, I do not
> see any attempt to do anything differently when "--first-parent" is
> requested.  Perhaps we shouldn't be even taking "--first-parent" as
> an option to begin with.
> The "--reverse" feels even more iffy.  Are we reversing the history
> with such an export, i.e. pretending that parents are children and
> merges are forks?
> 
>      ---A---M---N---B             B---N---M---A---
>            /   /           ==>         \   \
>           X   Y                         X   Y
> 
> Or are we supposed to produce the same history in the end, just
> spewing older commits first in the output stream?  I am not sure
> what purpose such a request serves---the "fast-import" downstream
> would need the same set of objects before it can create each commit
> anyway, so I am not sure what the point of giving "--reverse" is.
> 
> If there is no sensible interpretation for some of the options that
> are valid in rev-list in the context of "fast-export" command, should
> we just error out when we parse the command line, instead of producing
> nonsense output stream, I wonder.
> 


I agree with the concerns. I just skimmed the list of flags that git 
rev-list take, and I'm pretty sure that there are both flags that don't 
make sense at all in the context of fast-export, and that there are 
flags where it is unclear what the behavior of fast-export would be when 
passed.

However, I do think that having git fast-export support history limiting 
is useful. And I also think that the workflow of crafting a git rev-list 
command (perhaps using --graph) which outputs the part of history you 
want, and then applying it to git fast-export is fairly straight 
forward. But I also agree that "git fast-export --reverse" is nonsense.

I've thought about this a bit, and I wonder if having "git fast-export" 
accept revisions on stdin in a similar format as "git rev-list 
--parents" outputs would be an API that would be flexible enough, but 
without the oddities of allowing all rev-list flags. Or maybe there 
should be a list of acceptable rev-list flags which fast-export should 
accept. I don't really know.

For the specific question of what "--first-parent" should output, my 
thinking is that I would expect "git fast-export --first-parent" to 
output the same set of commits as "git rev-list --first-parent", which 
would be latter of your examples, i.e.

      ---A---M---N---B

Similarly, I guess if the user wanted

      ---A---B

then they could pass "--no-merges" as well, which would leave out the 
merge commits.

With regards to this patch in particular, we now overrides any 
"--reverse" flag that the user passes, so I can make "git fast-export 
--reverse" cause an error while I'm at it, if that is desired behavior.
William Sprent Nov. 24, 2021, 1:05 p.m. UTC | #6
On 23/11/2021 20.14, Elijah Newren wrote:
> On Tue, Nov 23, 2021 at 5:25 AM William Sprent via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: William Sprent <williams@unity3d.com>
>> ...
> 
> Wow, interesting.  I did some related work that I never submitted at
> https://github.com/newren/git/commit/08f799b4667de1c755c78e1ea1657f946b588556
> -- in that commit, I also noticed that fast-export did not seem
> careful enough about making sure to process all commits, and came up
> with a much different fix.  However, in my case, I didn't know how to
> trigger the problems in fast-export without some additional changes I
> was trying to make, and since I never got those other changes working,
> I didn't think it was worth submitting my fix.  My solution is more
> complex, in part because it was designed to handle ordering
> requirements & recommendations other than just ancestry.  However, we
> don't currently have any additional ordering requirements (the current
> code only considers ancestry), so your solution is much simpler.  If I
> do at some point get my other changes working, I'd need to
> re-introduce the queue and handle_tail() and my changes to these, but
> that can always be done later if and when I get those other changes
> working.
> 

Ah that is interesting. Good to know. I was happy that I could make 
fast-export rely on revision.c for ordering. At surface level, with the 
minimal amount of insight I have, that seems like an alright separation 
of concerns.

> Interestingly, Dscho added both the --reverse ability to revision.c
> and the commits object_array and the handle_tail() stuff in
> fast-export.c.  Both were done in the same year, and --reverse came
> first.  See 9c5e66e97da8 (Teach revision machinery about --reverse,
> 2007-01-20) and f2dc849e9c5f (Add 'git fast-export', the sister of
> 'git fast-import', 2007-12-02).  It's not clear why revs.reverse = 1
> wasn't used previously, although it certainly didn't occur to me
> either and I think it would have been an alternative solution to my
> first ever git.git contribution -- 784f8affe4df (fast-export: ensure
> we traverse commits in topological order, 2009-02-10).  (Though
> rev.reverse = 1 wouldn't have provided the same speedups that
> rev.topo_order=1 provided.)

Ah. Thanks for the insight.

> I can't think of any cases where this would cause any problems, and it
> seems like using rev.reverse is a pretty clean solution.  Just in case
> I'm missing something, though, it would be nice to get Dscho to also
> comment on this.  I'm cc'ing him.

Thanks. Good idea.
diff mbox series

Patch

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index 8e2caf72819..50f8c224b6e 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -107,18 +107,6 @@  static int parse_opt_reencode_mode(const struct option *opt,
 
 static struct decoration idnums;
 static uint32_t last_idnum;
-
-static int has_unshown_parent(struct commit *commit)
-{
-	struct commit_list *parent;
-
-	for (parent = commit->parents; parent; parent = parent->next)
-		if (!(parent->item->object.flags & SHOWN) &&
-		    !(parent->item->object.flags & UNINTERESTING))
-			return 1;
-	return 0;
-}
-
 struct anonymized_entry {
 	struct hashmap_entry hash;
 	const char *anon;
@@ -752,20 +740,6 @@  static char *anonymize_tag(void *data)
 	return strbuf_detach(&out, NULL);
 }
 
-static void handle_tail(struct object_array *commits, struct rev_info *revs,
-			struct string_list *paths_of_changed_objects)
-{
-	struct commit *commit;
-	while (commits->nr) {
-		commit = (struct commit *)object_array_pop(commits);
-		if (has_unshown_parent(commit)) {
-			/* Queue again, to be handled later */
-			add_object_array(&commit->object, NULL, commits);
-			return;
-		}
-		handle_commit(commit, revs, paths_of_changed_objects);
-	}
-}
 
 static void handle_tag(const char *name, struct tag *tag)
 {
@@ -1185,7 +1159,6 @@  static int parse_opt_anonymize_map(const struct option *opt,
 int cmd_fast_export(int argc, const char **argv, const char *prefix)
 {
 	struct rev_info revs;
-	struct object_array commits = OBJECT_ARRAY_INIT;
 	struct commit *commit;
 	char *export_filename = NULL,
 	     *import_filename = NULL,
@@ -1281,19 +1254,14 @@  int cmd_fast_export(int argc, const char **argv, const char *prefix)
 
 	get_tags_and_duplicates(&revs.cmdline);
 
+	revs.reverse = 1;
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	revs.diffopt.format_callback = show_filemodify;
 	revs.diffopt.format_callback_data = &paths_of_changed_objects;
 	revs.diffopt.flags.recursive = 1;
 	while ((commit = get_revision(&revs))) {
-		if (has_unshown_parent(commit)) {
-			add_object_array(&commit->object, NULL, &commits);
-		}
-		else {
-			handle_commit(commit, &revs, &paths_of_changed_objects);
-			handle_tail(&commits, &revs, &paths_of_changed_objects);
-		}
+		handle_commit(commit, &revs, &paths_of_changed_objects);
 	}
 
 	handle_tags_and_duplicates(&extra_refs);
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 409b48e2442..bd08101b81d 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -750,4 +750,34 @@  test_expect_success 'merge commit gets exported with --import-marks' '
 	)
 '
 
+
+test_expect_success 'fast-export --first-parent outputs all revisions output by revision walk' '
+	git init first-parent &&
+	cd first-parent &&
+	test_commit init &&
+	git checkout -b topic1 &&
+	test_commit file2 file2.txt &&
+	git checkout main &&
+	git merge topic1 --no-ff &&
+
+	git checkout -b topic2 &&
+	test_commit file3 file3.txt &&
+	git checkout main &&
+	git merge topic2 --no-ff &&
+
+	test_commit branch-head &&
+
+	git rev-list --format="%ad%B" --first-parent --topo-order --no-commit-header main > expected &&
+
+	git fast-export main -- --first-parent > first-parent-export &&
+	git fast-export main -- --first-parent --reverse > first-parent-reverse-export &&
+	git init import && cd import &&
+	cat ../first-parent-export | git fast-import &&
+
+	git rev-list --format="%ad%B" --topo-order --all --no-commit-header > actual &&
+	test $(git rev-list --all | wc -l) -eq 4 &&
+	test_cmp ../expected actual &&
+	test_cmp ../first-parent-export ../first-parent-reverse-export
+'
+
 test_done