[GSOC] cat-file: fix --batch report changed-type bug

Message ID	pull.965.git.1622363366722.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.965.git.1622363366722.gitgitgadget@gmail.com> Date: Sun, 30 May 2021 08:29:26 +0000 Subject: [PATCH] [GSOC] cat-file: fix --batch report changed-type bug Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Christian Couder <christian.couder@gmail.com>, Hariom Verma <hariom18599@gmail.com>, ZheNing Hu <adlternative@gmail.com>, ZheNing Hu <adlternative@gmail.com> Precedence: bulk From: ZheNing Hu <adlternative@gmail.com>
Series	[GSOC] cat-file: fix --batch report changed-type bug \| expand [GSOC] cat-file: fix --batch report changed-type bug

ZheNing Hu May 30, 2021, 8:29 a.m. UTC

From: ZheNing Hu <adlternative@gmail.com>

When we use `--batch` with no atoms formatting and use
`--batch-all-objects` at the same time (e.g.
`git cat-file --batch="batman" --batch-all-objects`),
Git will exit and report "object xxx changed type!?".

This is because we have a format string which does not
contain any atoms, so skip_object_info option will be
set if we also use --batch-all-objects, and then
`oid_object_info_extended()` will be skipped in
`batch_object_write()`, it cause object type to not be
collected. Therefore, it reported object type has changed.

So avoid checking changes in type and size when all_objects
and skip_object_info options are set at the same time.

Signed-off-by: ZheNing Hu <adlternative@gmail.com>
---
    [GSOC] cat-file: fix --batch report changed-type bug
    
    "git cat-file --batch-all-objects --batch=batman" will trigger a bug:
    "fatal: object xxx changed type!?"
    
    Although we will replace the logic here in the future, this patch can
    help the old git maintain its functionality.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-965%2Fadlternative%2Fcat-file-batch-bug-fix-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-965/adlternative/cat-file-batch-bug-fix-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/965

 builtin/cat-file.c  | 11 ++++++-----
 t/t1006-cat-file.sh |  6 ++++++
 2 files changed, 12 insertions(+), 5 deletions(-)


base-commit: 5d5b1473453400224ebb126bf3947e0a3276bdf5

Jeff King May 30, 2021, 9:09 p.m. UTC | #1

On Sun, May 30, 2021 at 08:29:26AM +0000, ZheNing Hu via GitGitGadget wrote:

> From: ZheNing Hu <adlternative@gmail.com>
> 
> When we use `--batch` with no atoms formatting and use
> `--batch-all-objects` at the same time (e.g.
> `git cat-file --batch="batman" --batch-all-objects`),
> Git will exit and report "object xxx changed type!?".
> 
> This is because we have a format string which does not
> contain any atoms, so skip_object_info option will be
> set if we also use --batch-all-objects, and then
> `oid_object_info_extended()` will be skipped in
> `batch_object_write()`, it cause object type to not be
> collected. Therefore, it reported object type has changed.

Good find. I think this bug is mostly my fault, as I added the
skip_object_info flag but never thought to use it with --batch.

> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 5ebf13359e83..5f9578f9b86b 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -345,11 +345,12 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>  		contents = read_object_file(oid, &type, &size);
>  		if (!contents)
>  			die("object %s disappeared", oid_to_hex(oid));
> -		if (type != data->type)
> -			die("object %s changed type!?", oid_to_hex(oid));
> -		if (data->info.sizep && size != data->size)
> -			die("object %s changed size!?", oid_to_hex(oid));
> -
> +		if (!(opt->all_objects && data->skip_object_info)) {
> +			if (type != data->type)
> +				die("object %s changed type!?", oid_to_hex(oid));
> +			if (data->info.sizep && size != data->size)
> +				die("object %s changed size!?", oid_to_hex(oid));
> +		}

Wouldn't checking data->skip_object_info be sufficient? It's only set
when opt->all_objects is set anyway. But more importantly, it is the
most direct root of the problem: we did not find out the type and size
earlier, so comparing anything against data->type is useless.

But that leads me to a follow-up question: what if we did give a format,
so skip_object_info isn't set, but it didn't include the type or size?

In the size code, it looks like we explicitly protect against this by
checking if data->info.sizep is set (i.e., did we request the size from
oid_object_info_extended). But it's not for the type.

So the assumption is that we do always fill in the type, even if the
user didn't ask for it. And that assumption is actually violated much
earlier. These two bits of code from the setup are out of order:

          if (opt->all_objects) {
                  struct object_info empty = OBJECT_INFO_INIT;
                  if (!memcmp(&data.info, &empty, sizeof(empty)))
                          data.skip_object_info = 1;
          }

          /*
           * If we are printing out the object, then always fill in the type,
           * since we will want to decide whether or not to stream.
           */
          if (opt->print_contents)
                  data.info.typep = &data.type;

We should not let skip_object_info kick in at all if opt->print_contents
is requested. And that causes other bugs outside of the spot you found.
We look at data->type earlier in print_object_or_die() to decide whether
or not to stream the contents, but we'll see garbage (fortunately we
zero-initialize the expand_data struct, so it's at least predictably
zero, and not random undefined behavior).

But I think we'd want to solve it by swapping the two conditionals I
showed above, which restores the assumption made in print_object_or_die().

> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 5d2dc99b74ad..9b0f1ae5ef8b 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -586,4 +586,10 @@ test_expect_success 'cat-file --unordered works' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'cat-file --batch="batman" with --batch-all-objects will work' '
> +	git -C all-two cat-file --batch-all-objects --batch="%(objectname)" | wc -l >expect &&
> +	git -C all-two cat-file --batch-all-objects --batch="batman" | wc -l >actual &&
> +	test_cmp expect actual
> +'

Is it worth testing both of these? The %(objectname) one will fail in
the same way (because we do not need to run oid_object_info() to get the
oid, which we already have). I'm OK doing both for better coverage, but
it may be worth mentioning either in a comment or in the commit message
that we expect both to fail, and why.

-Peff

Junio C Hamano May 30, 2021, 9:15 p.m. UTC | #2

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: ZheNing Hu <adlternative@gmail.com>
>
> When we use `--batch` with no atoms formatting and use
> `--batch-all-objects` at the same time (e.g.
> `git cat-file --batch="batman" --batch-all-objects`),
> Git will exit and report "object xxx changed type!?".
>
> This is because we have a format string which does not
> contain any atoms, so skip_object_info option will be
> set if we also use --batch-all-objects, and then
> `oid_object_info_extended()` will be skipped in
> `batch_object_write()`, it cause object type to not be
> collected. Therefore, it reported object type has changed.

The above analysis on how these die()s get hit makes sense, but ...

> So avoid checking changes in type and size when all_objects
> and skip_object_info options are set at the same time.

... it is not immediately clear how the above conclusion follows.

An obvious alternative, however, is to avoid skipping object info
when all objects is in use---but that goes directly against why this
"skip" mechanism was added at 845de33a (cat-file: avoid noop calls
to sha1_object_info_extended, 2016-05-18).

Which makes me suspect that the solution presented here would be
going in the right direction.

There however is one curious thing about this.  The log message of
the original commit that introduced this optimization does use the
batch-check and batch-all-objects at the same time.  Was this
breakage not there when the original was written and we broke it in
a later update?  If so, with what commit?  Can that commit have
broken other places in cat-file in a similar manner?

> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 5ebf13359e83..5f9578f9b86b 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -345,11 +345,12 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>  		contents = read_object_file(oid, &type, &size);
>  		if (!contents)
>  			die("object %s disappeared", oid_to_hex(oid));
> -		if (type != data->type)
> -			die("object %s changed type!?", oid_to_hex(oid));
> -		if (data->info.sizep && size != data->size)
> -			die("object %s changed size!?", oid_to_hex(oid));
> -
> +		if (!(opt->all_objects && data->skip_object_info)) {
> +			if (type != data->type)
> +				die("object %s changed type!?", oid_to_hex(oid));

When skip_object_info bit is set, we know data->type and date->size
are unusable and should not be checked, regardless of the reason why
skip_object_info bit is set, don't we?

> +			if (data->info.sizep && size != data->size)
> +				die("object %s changed size!?", oid_to_hex(oid));

Does this check need to be modified at all?  Would info.sizep ever
be set to non-NULL if skip_object_info is set (hence we are not
calling object-info)?

> +		}

In other words, shouldn't this patch just this one liner?

-		if (type != data->type)
+		if (data->skip_object_info && type != data->type)
			die("object %s changed type!?", oid_to_hex(oid));

> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 5d2dc99b74ad..9b0f1ae5ef8b 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -586,4 +586,10 @@ test_expect_success 'cat-file --unordered works' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'cat-file --batch="batman" with --batch-all-objects will work' '
> +	git -C all-two cat-file --batch-all-objects --batch="%(objectname)" | wc -l >expect &&
> +	git -C all-two cat-file --batch-all-objects --batch="batman" | wc -l >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_done
>
> base-commit: 5d5b1473453400224ebb126bf3947e0a3276bdf5

Jeff King May 30, 2021, 9:36 p.m. UTC | #3

On Mon, May 31, 2021 at 06:15:00AM +0900, Junio C Hamano wrote:

> There however is one curious thing about this.  The log message of
> the original commit that introduced this optimization does use the
> batch-check and batch-all-objects at the same time.  Was this
> breakage not there when the original was written and we broke it in
> a later update?  If so, with what commit?  Can that commit have
> broken other places in cat-file in a similar manner?

The subtlety here is that it is broken with --batch, not --batch-check.
Only the former calls print_object_or_die(), which is after all the
difference between the two. :)

> [...]

Everything else you noted was quite reasonable, but I think we can get
further at the root of the issue; see my other email.

-Peff

ZheNing Hu May 31, 2021, 1:20 p.m. UTC | #4

Jeff King <peff@peff.net> 于2021年5月31日周一 上午5:09写道：

> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index 5ebf13359e83..5f9578f9b86b 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -345,11 +345,12 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
> >               contents = read_object_file(oid, &type, &size);
> >               if (!contents)
> >                       die("object %s disappeared", oid_to_hex(oid));
> > -             if (type != data->type)
> > -                     die("object %s changed type!?", oid_to_hex(oid));
> > -             if (data->info.sizep && size != data->size)
> > -                     die("object %s changed size!?", oid_to_hex(oid));
> > -
> > +             if (!(opt->all_objects && data->skip_object_info)) {
> > +                     if (type != data->type)
> > +                             die("object %s changed type!?", oid_to_hex(oid));
> > +                     if (data->info.sizep && size != data->size)
> > +                             die("object %s changed size!?", oid_to_hex(oid));
> > +             }
>
> Wouldn't checking data->skip_object_info be sufficient? It's only set
> when opt->all_objects is set anyway. But more importantly, it is the
> most direct root of the problem: we did not find out the type and size
> earlier, so comparing anything against data->type is useless.
>
> But that leads me to a follow-up question: what if we did give a format,
> so skip_object_info isn't set, but it didn't include the type or size?
>

Indeed, such a situation may arise with some format atom e.g. %(objectname)
and %(rest). However %(deltabase) and %(objectdisk:size) don't have this
problem because they filled typep in `packed_object_info()`.

> In the size code, it looks like we explicitly protect against this by
> checking if data->info.sizep is set (i.e., did we request the size from
> oid_object_info_extended). But it's not for the type.
>

Yes, because we set typep while setting print_contents for
print_object_or_die() to check different object type case, this
leads to inconsistent behavior between sizep and typep.

> So the assumption is that we do always fill in the type, even if the
> user didn't ask for it. And that assumption is actually violated much
> earlier. These two bits of code from the setup are out of order:
>
>           if (opt->all_objects) {
>                   struct object_info empty = OBJECT_INFO_INIT;
>                   if (!memcmp(&data.info, &empty, sizeof(empty)))
>                           data.skip_object_info = 1;
>           }
>
>           /*
>            * If we are printing out the object, then always fill in the type,
>            * since we will want to decide whether or not to stream.
>            */
>           if (opt->print_contents)
>                   data.info.typep = &data.type;
>
> We should not let skip_object_info kick in at all if opt->print_contents
> is requested. And that causes other bugs outside of the spot you found.
> We look at data->type earlier in print_object_or_die() to decide whether
> or not to stream the contents, but we'll see garbage (fortunately we
> zero-initialize the expand_data struct, so it's at least predictably
> zero, and not random undefined behavior).
>
> But I think we'd want to solve it by swapping the two conditionals I
> showed above, which restores the assumption made in print_object_or_die().
>

This method is correct. This will ensure that skip_object_info will not
be set when print_contents is set.

By the way, maybe we can merge this two "if (opt->all_objects)" block to one.

        if (opt->all_objects) {
                struct object_info empty = OBJECT_INFO_INIT;
                if (!memcmp(&data.info, &empty, sizeof(empty)))
                        data.skip_object_info = 1;
        }

        if (opt->all_objects) {
                struct object_cb_data cb;

                if (has_promisor_remote())
                        warning("This repository uses promisor
remotes. Some objects may not be loaded.");

After the merger:

        if (opt->all_objects) {
                struct object_cb_data cb;
                struct object_info empty = OBJECT_INFO_INIT;

                if (!memcmp(&data.info, &empty, sizeof(empty)))
                        data.skip_object_info = 1;

                if (has_promisor_remote())
                        warning("This repository uses promisor
remotes. Some objects may not be loaded.");


> > +test_expect_success 'cat-file --batch="batman" with --batch-all-objects will work' '
> > +     git -C all-two cat-file --batch-all-objects --batch="%(objectname)" | wc -l >expect &&
> > +     git -C all-two cat-file --batch-all-objects --batch="batman" | wc -l >actual &&
> > +     test_cmp expect actual
> > +'
>
> Is it worth testing both of these? The %(objectname) one will fail in
> the same way (because we do not need to run oid_object_info() to get the
> oid, which we already have). I'm OK doing both for better coverage, but
> it may be worth mentioning either in a comment or in the commit message
> that we expect both to fail, and why.
>

Yes, these damages need to be pointed out in the commit message.

> -Peff

Thanks.
--
ZheNing Hu

ZheNing Hu May 31, 2021, 1:55 p.m. UTC | #5

Junio C Hamano <gitster@pobox.com> 于2021年5月31日周一 上午5:15写道：
> > So avoid checking changes in type and size when all_objects
> > and skip_object_info options are set at the same time.
>
> ... it is not immediately clear how the above conclusion follows.
>

My origin idea is: Now that it is known that the first "data->type" is 0,
and the second "type" obtained is the real type of the object, they are
different, so we can skip the comparison here.

> An obvious alternative, however, is to avoid skipping object info
> when all objects is in use---but that goes directly against why this
> "skip" mechanism was added at 845de33a (cat-file: avoid noop calls
> to sha1_object_info_extended, 2016-05-18).
>
> Which makes me suspect that the solution presented here would be
> going in the right direction.
>

Looking at it now, Peff's solution may be correct.

> There however is one curious thing about this.  The log message of
> the original commit that introduced this optimization does use the
> batch-check and batch-all-objects at the same time.  Was this
> breakage not there when the original was written and we broke it in
> a later update?  If so, with what commit?  Can that commit have
> broken other places in cat-file in a similar manner?
>

Because it's bug of --batch with --batch-all-objects, not the bug of
 --batch-check with --batch-all-objects. The extra broken atoms
are %(rest), %(objectname).

> > diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> > index 5ebf13359e83..5f9578f9b86b 100644
> > --- a/builtin/cat-file.c
> > +++ b/builtin/cat-file.c
> > @@ -345,11 +345,12 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
> >               contents = read_object_file(oid, &type, &size);
> >               if (!contents)
> >                       die("object %s disappeared", oid_to_hex(oid));
> > -             if (type != data->type)
> > -                     die("object %s changed type!?", oid_to_hex(oid));
> > -             if (data->info.sizep && size != data->size)
> > -                     die("object %s changed size!?", oid_to_hex(oid));
> > -
> > +             if (!(opt->all_objects && data->skip_object_info)) {
> > +                     if (type != data->type)
> > +                             die("object %s changed type!?", oid_to_hex(oid));
>
> When skip_object_info bit is set, we know data->type and date->size
> are unusable and should not be checked, regardless of the reason why
> skip_object_info bit is set, don't we?

Yes... We don't need to consider "opt->all_objects".

>
> > +                     if (data->info.sizep && size != data->size)
> > +                             die("object %s changed size!?", oid_to_hex(oid));
>
> Does this check need to be modified at all?  Would info.sizep ever
> be set to non-NULL if skip_object_info is set (hence we are not
> calling object-info)?
>

Indeed, we don’t need to consider sizep.

> > +             }
>
> In other words, shouldn't this patch just this one liner?
>
> -               if (type != data->type)
> +               if (data->skip_object_info && type != data->type)
>                         die("object %s changed type!?", oid_to_hex(oid));
>

With the original patch, this is the best code.

Thanks.
--
ZheNing Hu

Jeff King May 31, 2021, 2:44 p.m. UTC | #6

On Mon, May 31, 2021 at 09:20:34PM +0800, ZheNing Hu wrote:

> > But I think we'd want to solve it by swapping the two conditionals I
> > showed above, which restores the assumption made in print_object_or_die().
> 
> This method is correct. This will ensure that skip_object_info will not
> be set when print_contents is set.

OK, good, it sounds like we are on the same page. :)

> By the way, maybe we can merge this two "if (opt->all_objects)" block to one.
> [...]

Yes, I think that would make sense to do. It might be worth doing as a
separate patch on top of the bug-fix, though, to make it clear that it's
just cosmetic and separate from the bug-fix.

> > > +test_expect_success 'cat-file --batch="batman" with --batch-all-objects will work' '
> > > +     git -C all-two cat-file --batch-all-objects --batch="%(objectname)" | wc -l >expect &&
> > > +     git -C all-two cat-file --batch-all-objects --batch="batman" | wc -l >actual &&
> > > +     test_cmp expect actual
> > > +'
> >
> > Is it worth testing both of these? The %(objectname) one will fail in
> > the same way (because we do not need to run oid_object_info() to get the
> > oid, which we already have). I'm OK doing both for better coverage, but
> > it may be worth mentioning either in a comment or in the commit message
> > that we expect both to fail, and why.
> 
> Yes, these damages need to be pointed out in the commit message.

I think what confused me here is that you are using "%(objectname)" as
the "expect" output, but it also exhibits the bug. So I'd expect this
test to pass even before your patch (though I didn't try it).

Really, the symptom of the bug is that _neither_ of those cat-file
invocations will exit with a success code. But because they're on the
left-hand side of a pipe, we wouldn't even notice.

The simplest test is just:

  git -C all-two cat-file --batch-all-objects --batch="%(objectname)" >/dev/null

which will currently fail. It would be nice to verify that its output is
sensible, but I'm not sure how to easily do that (it will spew a bunch
of binary tree data, and it cannot even be parsed reliably since we
haven't output the sizes).

-Peff

ZheNing Hu May 31, 2021, 3:32 p.m. UTC | #7

Jeff King <peff@peff.net> 于2021年5月31日周一 下午10:44写道：
>
> > > > +test_expect_success 'cat-file --batch="batman" with --batch-all-objects will work' '
> > > > +     git -C all-two cat-file --batch-all-objects --batch="%(objectname)" | wc -l >expect &&
> > > > +     git -C all-two cat-file --batch-all-objects --batch="batman" | wc -l >actual &&
> > > > +     test_cmp expect actual
> > > > +'
> > >
> > > Is it worth testing both of these? The %(objectname) one will fail in
> > > the same way (because we do not need to run oid_object_info() to get the
> > > oid, which we already have). I'm OK doing both for better coverage, but
> > > it may be worth mentioning either in a comment or in the commit message
> > > that we expect both to fail, and why.
> >
> > Yes, these damages need to be pointed out in the commit message.
>
> I think what confused me here is that you are using "%(objectname)" as
> the "expect" output, but it also exhibits the bug. So I'd expect this
> test to pass even before your patch (though I didn't try it).
>

Yes, %(objectname) should not be used as the output of "expect",
it is also a broken part.

> Really, the symptom of the bug is that _neither_ of those cat-file
> invocations will exit with a success code. But because they're on the
> left-hand side of a pipe, we wouldn't even notice.
>
> The simplest test is just:
>
>   git -C all-two cat-file --batch-all-objects --batch="%(objectname)" >/dev/null
>

Yes, this is enough. I even think about use something like
"!test_must_fail xxx" before... that is not necessary.

> which will currently fail. It would be nice to verify that its output is
> sensible, but I'm not sure how to easily do that (it will spew a bunch
> of binary tree data, and it cannot even be parsed reliably since we
> haven't output the sizes).
>
> -Peff

Thanks!
--
ZheNing Hu

Felipe Contreras May 31, 2021, 4:07 p.m. UTC | #8

Jeff King wrote:
> The simplest test is just:
> 
>   git -C all-two cat-file --batch-all-objects --batch="%(objectname)" >/dev/null
> 
> which will currently fail. It would be nice to verify that its output is
> sensible, but I'm not sure how to easily do that (it will spew a bunch
> of binary tree data, and it cannot even be parsed reliably since we
> haven't output the sizes).

I use ruby to parse binary data from git all the time:

        git log --format='%b%x00' |
                ruby -e 'ARGF.each("\0", chomp: true) { |chunk| p chunk }'

Junio C Hamano June 1, 2021, 1:40 a.m. UTC | #9

Jeff King <peff@peff.net> writes:

> On Mon, May 31, 2021 at 06:15:00AM +0900, Junio C Hamano wrote:
>
>> There however is one curious thing about this.  The log message of
>> the original commit that introduced this optimization does use the
>> batch-check and batch-all-objects at the same time.  Was this
>> breakage not there when the original was written and we broke it in
>> a later update?  If so, with what commit?  Can that commit have
>> broken other places in cat-file in a similar manner?
>
> The subtlety here is that it is broken with --batch, not --batch-check.
> Only the former calls print_object_or_die(), which is after all the
> difference between the two. :)
>
>> [...]
>
> Everything else you noted was quite reasonable, but I think we can get
> further at the root of the issue; see my other email.

Yup, I missed the place we depend on type being set without
checking.

Jeff King June 1, 2021, 1:49 a.m. UTC | #10

On Mon, May 31, 2021 at 11:07:26AM -0500, Felipe Contreras wrote:

> Jeff King wrote:
> > The simplest test is just:
> > 
> >   git -C all-two cat-file --batch-all-objects --batch="%(objectname)" >/dev/null
> > 
> > which will currently fail. It would be nice to verify that its output is
> > sensible, but I'm not sure how to easily do that (it will spew a bunch
> > of binary tree data, and it cannot even be parsed reliably since we
> > haven't output the sizes).
> 
> I use ruby to parse binary data from git all the time:
> 
>         git log --format='%b%x00' |
>                 ruby -e 'ARGF.each("\0", chomp: true) { |chunk| p chunk }'

I doubt we'd want to add a ruby dependency to our test suite, but sure,
we could do the same thing with perl.

The trickier part is that without the sizes, the output is ambiguous
(the command above will dump trees that contain arbitrary bytes
including NULs, so there's no solid delimiter). We could probably devise
a hacky perl snippet that checks for an expected sequence of output that
is unlikely to appear otherwise.

Or we could just generate the entire expected output and check it with
"cmp". The most robust way is probably to loop over the objects, running
"git cat-file" for each, but that's slow. Maybe:

  git cat-file --batch-all-objects --batch-check='%(objectname)' >objects &&
  git cat-file --batch='%(objectname)' <objects >expect &&
  git cat-file --batch-all-objects --batch='%(objectname)' >actual &&
  # not test_cmp, as it is not binary clean!
  cmp expect actual

That feels a bit circular in that it's mostly just exercising most of
the same code in the "expect" and "actual" paths. The interesting part
is the combination of the two options, which is why I think that just
making sure we don't hit an error might be enough.

-Peff

Felipe Contreras June 1, 2021, 6:34 a.m. UTC | #11

Jeff King wrote:
> On Mon, May 31, 2021 at 11:07:26AM -0500, Felipe Contreras wrote:
> 
> > Jeff King wrote:
> > > The simplest test is just:
> > > 
> > >   git -C all-two cat-file --batch-all-objects --batch="%(objectname)" >/dev/null
> > > 
> > > which will currently fail. It would be nice to verify that its output is
> > > sensible, but I'm not sure how to easily do that (it will spew a bunch
> > > of binary tree data, and it cannot even be parsed reliably since we
> > > haven't output the sizes).
> > 
> > I use ruby to parse binary data from git all the time:
> > 
> >         git log --format='%b%x00' |
> >                 ruby -e 'ARGF.each("\0", chomp: true) { |chunk| p chunk }'
> 
> I doubt we'd want to add a ruby dependency to our test suite, but sure,
> we could do the same thing with perl.

I don't mean in the final patches, I mean while the patches are
being developed.

Once it's clear what the code should do, and how to verify it's doing
what it's supposed to be doing, we can decide how the test suite should
verify it.

Ruby is great for prototyping.

ZheNing Hu June 1, 2021, 12:46 p.m. UTC | #12

Jeff King <peff@peff.net> 于2021年6月1日周二 上午9:49写道：
>
> Or we could just generate the entire expected output and check it with
> "cmp". The most robust way is probably to loop over the objects, running
> "git cat-file" for each, but that's slow. Maybe:
>
>   git cat-file --batch-all-objects --batch-check='%(objectname)' >objects &&
>   git cat-file --batch='%(objectname)' <objects >expect &&
>   git cat-file --batch-all-objects --batch='%(objectname)' >actual &&
>   # not test_cmp, as it is not binary clean!
>   cmp expect actual
>
> That feels a bit circular in that it's mostly just exercising most of
> the same code in the "expect" and "actual" paths. The interesting part
> is the combination of the two options, which is why I think that just
> making sure we don't hit an error might be enough.
>

Such a test is also acceptable. I will add a test like this.

> -Peff

--
ZheNing Hu

Jeff King June 1, 2021, 3:34 p.m. UTC | #13

On Tue, Jun 01, 2021 at 01:34:32AM -0500, Felipe Contreras wrote:

> > > I use ruby to parse binary data from git all the time:
> > > 
> > >         git log --format='%b%x00' |
> > >                 ruby -e 'ARGF.each("\0", chomp: true) { |chunk| p chunk }'
> > 
> > I doubt we'd want to add a ruby dependency to our test suite, but sure,
> > we could do the same thing with perl.
> 
> I don't mean in the final patches, I mean while the patches are
> being developed.
> 
> Once it's clear what the code should do, and how to verify it's doing
> what it's supposed to be doing, we can decide how the test suite should
> verify it.
> 
> Ruby is great for prototyping.

If we are not worried about the test suite, then I would think viewing
the output in "less" would be the simplest way to see that it's doing
the right thing.

-Peff

Felipe Contreras June 1, 2021, 4:42 p.m. UTC | #14

Jeff King wrote:
> On Tue, Jun 01, 2021 at 01:34:32AM -0500, Felipe Contreras wrote:
> 
> > > > I use ruby to parse binary data from git all the time:
> > > > 
> > > >         git log --format='%b%x00' |
> > > >                 ruby -e 'ARGF.each("\0", chomp: true) { |chunk| p chunk }'
> > > 
> > > I doubt we'd want to add a ruby dependency to our test suite, but sure,
> > > we could do the same thing with perl.
> > 
> > I don't mean in the final patches, I mean while the patches are
> > being developed.
> > 
> > Once it's clear what the code should do, and how to verify it's doing
> > what it's supposed to be doing, we can decide how the test suite should
> > verify it.
> > 
> > Ruby is great for prototyping.
> 
> If we are not worried about the test suite, then I would think viewing
> the output in "less" would be the simplest way to see that it's doing
> the right thing.

No, I meant you write the test case with ruby and once everything works
you decide how to replicate the same with something more widely
available.

Sometimes I decide to delete the test cases I used for prototyping
because they are too convoluted, test something very esoteric, or it's
not easy to do the same in CI infraestructure.

But they helped me write the code.

Cheers.

[GSOC] cat-file: fix --batch report changed-type bug

Commit Message

Comments

Patch