[06/10] fast-export: when using paths, avoid corrupt stream with non-existent mark
diff mbox series

Message ID 20181111062312.16342-7-newren@gmail.com
State New
Headers show
Series
  • fast export and import fixes and features
Related show

Commit Message

Elijah Newren Nov. 11, 2018, 6:23 a.m. UTC
If file paths are specified to fast-export and multiple refs point to a
commit that does not touch any of the relevant file paths, then
fast-export can hit problems.  fast-export has a list of additional refs
that it needs to explicitly set after exporting all blobs and commits,
and when it tries to get_object_mark() on the relevant commit, it can
get a mark of 0, i.e. "not found", because the commit in question did
not touch the relevant paths and thus was not exported.  Trying to
import a stream with a mark corresponding to an unexported object will
cause fast-import to crash.

Avoid this problem by taking the commit the ref points to and finding an
ancestor of it that was exported, and make the ref point to that commit
instead.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 builtin/fast-export.c  | 13 ++++++++++++-
 t/t9350-fast-export.sh | 24 ++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletion(-)

Comments

Jeff King Nov. 11, 2018, 6:53 a.m. UTC | #1
On Sat, Nov 10, 2018 at 10:23:08PM -0800, Elijah Newren wrote:

> If file paths are specified to fast-export and multiple refs point to a
> commit that does not touch any of the relevant file paths, then
> fast-export can hit problems.  fast-export has a list of additional refs
> that it needs to explicitly set after exporting all blobs and commits,
> and when it tries to get_object_mark() on the relevant commit, it can
> get a mark of 0, i.e. "not found", because the commit in question did
> not touch the relevant paths and thus was not exported.  Trying to
> import a stream with a mark corresponding to an unexported object will
> cause fast-import to crash.
> 
> Avoid this problem by taking the commit the ref points to and finding an
> ancestor of it that was exported, and make the ref point to that commit
> instead.

As with the earlier tag commit, I wonder if this might depend on the
context in which you're using fast-export. I suppose that if you did not
feed the ref on the command line that we would not be dealing with it at
all (and maybe that is the answer to my question about the tag thing,
too).

It does seem funny that the behavior for the earlier case (bounded
commits) and this case (skipping some commits) are different. Would you
ever want to keep walking backwards to find an ancestor in the earlier
case? Or vice versa, would you ever want to simply delete a tag in a
case like this one?

I'm not sure sure, but I suspect you may have thought about it a lot
harder than I have. :)

> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index a3c044b0af..5648a8ce9c 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -900,7 +900,18 @@ static void handle_tags_and_duplicates(void)
>  			if (anonymize)
>  				name = anonymize_refname(name);
>  			/* create refs pointing to already seen commits */
> -			commit = (struct commit *)object;
> +			commit = rewrite_commit((struct commit *)object);
> +			if (!commit) {
> +				/*
> +				 * Neither this object nor any of its
> +				 * ancestors touch any relevant paths, so
> +				 * it has been filtered to nothing.  Delete
> +				 * it.
> +				 */
> +				printf("reset %s\nfrom %s\n\n",
> +				       name, sha1_to_hex(null_sha1));
> +				continue;
> +			}

This hunk makes sense.

> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -386,6 +386,30 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
>  	grep file0 actual
>  '
>  
> +test_expect_success 'avoid corrupt stream with non-existent mark' '
> +	test_create_repo avoid_non_existent_mark &&
> +	(
> +		cd avoid_non_existent_mark &&
> +
> +		touch important-path &&
> +		git add important-path &&
> +		test_commit initial &&
> +
> +		touch ignored &&
> +		git add ignored &&
> +		test_commit whatever &&
> +
> +		git branch A &&
> +		git branch B &&
> +
> +		echo foo >>important-path &&
> +		git add important-path &&
> +		test_commit more changes &&
> +
> +		git fast-export --all -- important-path | git fast-import --force
> +	)
> +'

Similar comments apply about "touch" and "test_commit" to what I wrote
for the earlier patch.

-Peff
Elijah Newren Nov. 11, 2018, 8:01 a.m. UTC | #2
On Sat, Nov 10, 2018 at 10:53 PM Jeff King <peff@peff.net> wrote:
>
> On Sat, Nov 10, 2018 at 10:23:08PM -0800, Elijah Newren wrote:
>
> > If file paths are specified to fast-export and multiple refs point to a
> > commit that does not touch any of the relevant file paths, then
> > fast-export can hit problems.  fast-export has a list of additional refs
> > that it needs to explicitly set after exporting all blobs and commits,
> > and when it tries to get_object_mark() on the relevant commit, it can
> > get a mark of 0, i.e. "not found", because the commit in question did
> > not touch the relevant paths and thus was not exported.  Trying to
> > import a stream with a mark corresponding to an unexported object will
> > cause fast-import to crash.
> >
> > Avoid this problem by taking the commit the ref points to and finding an
> > ancestor of it that was exported, and make the ref point to that commit
> > instead.
>
> As with the earlier tag commit, I wonder if this might depend on the
> context in which you're using fast-export. I suppose that if you did not
> feed the ref on the command line that we would not be dealing with it at
> all (and maybe that is the answer to my question about the tag thing,
> too).

Right, if you didn't feed the ref on the command line, we're not
dealing with the ref at all, so the code here doesn't affect any such
ref.

> It does seem funny that the behavior for the earlier case (bounded
> commits) and this case (skipping some commits) are different. Would you
> ever want to keep walking backwards to find an ancestor in the earlier
> case? Or vice versa, would you ever want to simply delete a tag in a
> case like this one?
>
> I'm not sure sure, but I suspect you may have thought about it a lot
> harder than I have. :)

I'm not sure why you thought the behavior for the two cases was
different?  For both patches, my testcases used path limiting; it was
you who suggested employing a negative revision to bound the commits.

Anyway, for both patches assuming you haven't bounded the commits, you
can attempt to keep walking backwards to find an earlier ancestor, but
the fundamental fact is you aren't guaranteed that you can find one
(i.e. some tag or branch points to a commit that didn't modify any of
the specified paths, and nor did any of its ancestors back to any root
commits).  I hit that case lots of times.  If the user explicitly
requested a tag or branch for export (and requested tag rewriting),
and limited to certain paths that had never existed in the repository
as of the time of the tag or branch, then you hit the cases these
patches worry about.  Patch 4 was about (annotated and signed) tags,
this patch is about unannotated tags and branches and other refs.

If you think about using negative revisions, for both cases, then
again you can keep walking back history to try to find a commit that
your tag or branch or ref can point to, but if you get back to the
negative revisions, then you are in the range the user requested to be
omitted from the resulting repository.  Sounds like tag/ref deletion
to me.

>
> > diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> > index a3c044b0af..5648a8ce9c 100644
> > --- a/builtin/fast-export.c
> > +++ b/builtin/fast-export.c
> > @@ -900,7 +900,18 @@ static void handle_tags_and_duplicates(void)
> >                       if (anonymize)
> >                               name = anonymize_refname(name);
> >                       /* create refs pointing to already seen commits */
> > -                     commit = (struct commit *)object;
> > +                     commit = rewrite_commit((struct commit *)object);
> > +                     if (!commit) {
> > +                             /*
> > +                              * Neither this object nor any of its
> > +                              * ancestors touch any relevant paths, so
> > +                              * it has been filtered to nothing.  Delete
> > +                              * it.
> > +                              */
> > +                             printf("reset %s\nfrom %s\n\n",
> > +                                    name, sha1_to_hex(null_sha1));
> > +                             continue;
> > +                     }
>
> This hunk makes sense.

Cool, this was the entirety of the code...so does this mean that the
code makes more sense than my commit message summary did?  ...and
perhaps that my attempts to answer your questions in this email
weren't necessary anymore?

> > --- a/t/t9350-fast-export.sh
> > +++ b/t/t9350-fast-export.sh
> > @@ -386,6 +386,30 @@ test_expect_success 'path limiting with import-marks does not lose unmodified fi
> >       grep file0 actual
> >  '
> >
> > +test_expect_success 'avoid corrupt stream with non-existent mark' '
> > +     test_create_repo avoid_non_existent_mark &&
> > +     (
> > +             cd avoid_non_existent_mark &&
> > +
> > +             touch important-path &&
> > +             git add important-path &&
> > +             test_commit initial &&
> > +
> > +             touch ignored &&
> > +             git add ignored &&
> > +             test_commit whatever &&
> > +
> > +             git branch A &&
> > +             git branch B &&
> > +
> > +             echo foo >>important-path &&
> > +             git add important-path &&
> > +             test_commit more changes &&
> > +
> > +             git fast-export --all -- important-path | git fast-import --force
> > +     )
> > +'
>
> Similar comments apply about "touch" and "test_commit" to what I wrote
> for the earlier patch.

Thanks; will fix.
Jeff King Nov. 12, 2018, 12:45 p.m. UTC | #3
On Sun, Nov 11, 2018 at 12:01:43AM -0800, Elijah Newren wrote:

> > It does seem funny that the behavior for the earlier case (bounded
> > commits) and this case (skipping some commits) are different. Would you
> > ever want to keep walking backwards to find an ancestor in the earlier
> > case? Or vice versa, would you ever want to simply delete a tag in a
> > case like this one?
> >
> > I'm not sure sure, but I suspect you may have thought about it a lot
> > harder than I have. :)
> 
> I'm not sure why you thought the behavior for the two cases was
> different?  For both patches, my testcases used path limiting; it was
> you who suggested employing a negative revision to bound the commits.

Sorry, I think I just got confused. I was thinking about the
documentation fixup you started with, which did regard bounded commits.
But that's not relevant here.

> Anyway, for both patches assuming you haven't bounded the commits, you
> can attempt to keep walking backwards to find an earlier ancestor, but
> the fundamental fact is you aren't guaranteed that you can find one
> (i.e. some tag or branch points to a commit that didn't modify any of
> the specified paths, and nor did any of its ancestors back to any root
> commits).  I hit that case lots of times.  If the user explicitly
> requested a tag or branch for export (and requested tag rewriting),
> and limited to certain paths that had never existed in the repository
> as of the time of the tag or branch, then you hit the cases these
> patches worry about.  Patch 4 was about (annotated and signed) tags,
> this patch is about unannotated tags and branches and other refs.

OK, that makes more sense.

So I guess my question is: in patch 4, why do we not walk back to find
an appropriate ancestor pointed to by the signed tag object, as we do
here for the unannotated case?

And I think the answer is: we already do that. It's just that the
unannotated case never learned the same trick. So basically it's:

  1. rewriting annotated tags to ancestors is already known on "master"

  2. patch 4 further teaches it to drop a tag when that fails

  3. patch 6 teaches both (1) and (2) to the unannotated code path,
     which knew neither

Is that right?

> > This hunk makes sense.
> 
> Cool, this was the entirety of the code...so does this mean that the
> code makes more sense than my commit message summary did?  ...and
> perhaps that my attempts to answer your questions in this email
> weren't necessary anymore?

No, it only made sense that the hunk implemented what you claimed in the
commit message. ;)

I think your responses did help me understand that what the commit
message is claiming is a good thing.

-Peff
Elijah Newren Nov. 12, 2018, 3:36 p.m. UTC | #4
On Mon, Nov 12, 2018 at 4:45 AM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 11, 2018 at 12:01:43AM -0800, Elijah Newren wrote:
>
> > > It does seem funny that the behavior for the earlier case (bounded
> > > commits) and this case (skipping some commits) are different. Would you
> > > ever want to keep walking backwards to find an ancestor in the earlier
> > > case? Or vice versa, would you ever want to simply delete a tag in a
> > > case like this one?
> > >
> > > I'm not sure sure, but I suspect you may have thought about it a lot
> > > harder than I have. :)
> >
> > I'm not sure why you thought the behavior for the two cases was
> > different?  For both patches, my testcases used path limiting; it was
> > you who suggested employing a negative revision to bound the commits.
>
> Sorry, I think I just got confused. I was thinking about the
> documentation fixup you started with, which did regard bounded commits.
> But that's not relevant here.
>
> > Anyway, for both patches assuming you haven't bounded the commits, you
> > can attempt to keep walking backwards to find an earlier ancestor, but
> > the fundamental fact is you aren't guaranteed that you can find one
> > (i.e. some tag or branch points to a commit that didn't modify any of
> > the specified paths, and nor did any of its ancestors back to any root
> > commits).  I hit that case lots of times.  If the user explicitly
> > requested a tag or branch for export (and requested tag rewriting),
> > and limited to certain paths that had never existed in the repository
> > as of the time of the tag or branch, then you hit the cases these
> > patches worry about.  Patch 4 was about (annotated and signed) tags,
> > this patch is about unannotated tags and branches and other refs.
>
> OK, that makes more sense.
>
> So I guess my question is: in patch 4, why do we not walk back to find
> an appropriate ancestor pointed to by the signed tag object, as we do
> here for the unannotated case?
>
> And I think the answer is: we already do that. It's just that the
> unannotated case never learned the same trick. So basically it's:
>
>   1. rewriting annotated tags to ancestors is already known on "master"
>
>   2. patch 4 further teaches it to drop a tag when that fails
>
>   3. patch 6 teaches both (1) and (2) to the unannotated code path,
>      which knew neither
>
> Is that right?

Ah, now I see where the slight disconnect was.  And yes, you are correct.

> > > This hunk makes sense.
> >
> > Cool, this was the entirety of the code...so does this mean that the
> > code makes more sense than my commit message summary did?  ...and
> > perhaps that my attempts to answer your questions in this email
> > weren't necessary anymore?
>
> No, it only made sense that the hunk implemented what you claimed in the
> commit message. ;)
>
> I think your responses did help me understand that what the commit
> message is claiming is a good thing.

Patch
diff mbox series

diff --git a/builtin/fast-export.c b/builtin/fast-export.c
index a3c044b0af..5648a8ce9c 100644
--- a/builtin/fast-export.c
+++ b/builtin/fast-export.c
@@ -900,7 +900,18 @@  static void handle_tags_and_duplicates(void)
 			if (anonymize)
 				name = anonymize_refname(name);
 			/* create refs pointing to already seen commits */
-			commit = (struct commit *)object;
+			commit = rewrite_commit((struct commit *)object);
+			if (!commit) {
+				/*
+				 * Neither this object nor any of its
+				 * ancestors touch any relevant paths, so
+				 * it has been filtered to nothing.  Delete
+				 * it.
+				 */
+				printf("reset %s\nfrom %s\n\n",
+				       name, sha1_to_hex(null_sha1));
+				continue;
+			}
 			printf("reset %s\nfrom :%d\n\n", name,
 			       get_object_mark(&commit->object));
 			show_progress();
diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
index 5bf21b4908..dbb560c110 100755
--- a/t/t9350-fast-export.sh
+++ b/t/t9350-fast-export.sh
@@ -386,6 +386,30 @@  test_expect_success 'path limiting with import-marks does not lose unmodified fi
 	grep file0 actual
 '
 
+test_expect_success 'avoid corrupt stream with non-existent mark' '
+	test_create_repo avoid_non_existent_mark &&
+	(
+		cd avoid_non_existent_mark &&
+
+		touch important-path &&
+		git add important-path &&
+		test_commit initial &&
+
+		touch ignored &&
+		git add ignored &&
+		test_commit whatever &&
+
+		git branch A &&
+		git branch B &&
+
+		echo foo >>important-path &&
+		git add important-path &&
+		test_commit more changes &&
+
+		git fast-export --all -- important-path | git fast-import --force
+	)
+'
+
 test_expect_success 'full-tree re-shows unmodified files'        '
 	git checkout -f simple &&
 	git fast-export --full-tree simple >actual &&