Always check the return value of `repo_read_object_file()`

Message ID	pull.1650.git.1707143753726.gitgitgadget@gmail.com (mailing list archive)
State	Accepted
Commit	568459bf5e97a4f61429e3bdd1f97b54b39a1383
Headers	show Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 750132C19B for <git@vger.kernel.org>; Mon, 5 Feb 2024 14:35:57 +0000 (UTC) Message-ID: <pull.1650.git.1707143753726.gitgitgadget@gmail.com> Date: Mon, 05 Feb 2024 14:35:53 +0000 Subject: [PATCH] Always check the return value of `repo_read_object_file()` Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: Johannes Schindelin <johannes.schindelin@gmx.de>, Johannes Schindelin <johannes.schindelin@gmx.de> From: Johannes Schindelin <johannes.schindelin@gmx.de>
Series	Always check the return value of `repo_read_object_file()` \| expand Always check the return value of `repo_read_object_file()`

Johannes Schindelin Feb. 5, 2024, 2:35 p.m. UTC

From: Johannes Schindelin <johannes.schindelin@gmx.de>

There are a couple of places in Git's source code where the return value
is not checked. As a consequence, they are susceptible to segmentation
faults.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    Always check the return value of repo_read_object_file()
    
    I ran into this today, when I had tried git am -3 to import changes from
    a repository into a different repository that has the first repository's
    code vendored in. To make this work, I set
    GIT_ALTERNATE_OBJECT_DIRECTORIES accordingly for the git am -3 call, but
    forgot to set it for a subsequent git diff call, which then segfaulted.
    
    There are still a couple of places left where there are checks but they
    look dubious to me, as they simply continue as if an empty blob had been
    read, for example in builtin/tag.c. However, there are checks that avoid
    segfaults, so I left them alone.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1650%2Fdscho%2Fsafer-object-reads-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1650/dscho/safer-object-reads-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1650

 bisect.c           |  3 +++
 builtin/cat-file.c | 10 ++++++++--
 builtin/grep.c     |  2 ++
 builtin/notes.c    |  6 ++++--
 combine-diff.c     |  2 ++
 rerere.c           |  3 +++
 6 files changed, 22 insertions(+), 4 deletions(-)


base-commit: 2a540e432fe5dff3cfa9d3bf7ca56db2ad12ebb9

Karthik Nayak Feb. 5, 2024, 4:10 p.m. UTC | #1

Hello,

"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com> writes:
> diff --git a/bisect.c b/bisect.c
> index f1273c787d9..f75e50c3397 100644
> --- a/bisect.c
> +++ b/bisect.c
> @@ -158,6 +158,9 @@ static void show_list(const char *debug, int counted, int nr,
>  		const char *subject_start;
>  		int subject_len;
>
> +		if (!buf)
> +			die(_("unable to read %s"), oid_to_hex(&commit->object.oid));
> +

Nit: We know that `repo_read_object_file()` fails on corrupt objects, so
this means that this is only happening when the object doesn't exist. I
wonder if it makes more sense to replace "unable to read %s" which is a
little ambiguous with something like "object %q doesn't exist".

Otherwise, the patch looks good, thanks!

Kyle Lippincott Feb. 6, 2024, 1:13 a.m. UTC | #2

On Mon, Feb 5, 2024 at 6:36 AM Johannes Schindelin via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
>
> There are a couple of places in Git's source code where the return value
> is not checked. As a consequence, they are susceptible to segmentation
> faults.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>     Always check the return value of repo_read_object_file()
>
>     I ran into this today, when I had tried git am -3 to import changes from
>     a repository into a different repository that has the first repository's
>     code vendored in. To make this work, I set
>     GIT_ALTERNATE_OBJECT_DIRECTORIES accordingly for the git am -3 call, but
>     forgot to set it for a subsequent git diff call, which then segfaulted.
>
>     There are still a couple of places left where there are checks but they
>     look dubious to me, as they simply continue as if an empty blob had been
>     read, for example in builtin/tag.c. However, there are checks that avoid
>     segfaults, so I left them alone.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1650%2Fdscho%2Fsafer-object-reads-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1650/dscho/safer-object-reads-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1650
>
>  bisect.c           |  3 +++
>  builtin/cat-file.c | 10 ++++++++--
>  builtin/grep.c     |  2 ++
>  builtin/notes.c    |  6 ++++--
>  combine-diff.c     |  2 ++
>  rerere.c           |  3 +++
>  6 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/builtin/notes.c b/builtin/notes.c
> index e65cae0bcf7..caf20fd5bdd 100644
> --- a/builtin/notes.c
> +++ b/builtin/notes.c
> @@ -716,9 +716,11 @@ static int append_edit(int argc, const char **argv, const char *prefix)
>                 struct strbuf buf = STRBUF_INIT;
>                 char *prev_buf = repo_read_object_file(the_repository, note, &type, &size);
>
> -               if (prev_buf && size)
> +               if (!prev_buf)
> +                       die(_("unable to read %s"), oid_to_hex(note));

This changes the behavior of this function. Previously, it would not
add prev_buf output, but still succeed. This now dies.

> +               if (size)
>                         strbuf_add(&buf, prev_buf, size);
> -               if (d.buf.len && prev_buf && size)
> +               if (d.buf.len && size)
>                         append_separator(&buf);
>                 strbuf_insert(&d.buf, 0, buf.buf, buf.len);
>
> diff --git a/combine-diff.c b/combine-diff.c
> index db94581f724..d6d6fa16894 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -337,6 +337,8 @@ static char *grab_blob(struct repository *r,
>                 free_filespec(df);
>         } else {
>                 blob = repo_read_object_file(r, oid, &type, size);
> +               if (!blob)
> +                       die(_("unable to read %s"), oid_to_hex(oid));

Technically this is changing the output, but I think that's good - I
believe that previously the behavior was undefined, because `type`
wouldn't be modified if the blob didn't exist, and `type` wasn't
assigned a value earlier in the function.

>                 if (type != OBJ_BLOB)
>                         die("object '%s' is not a blob!", oid_to_hex(oid));
>         }
>
> base-commit: 2a540e432fe5dff3cfa9d3bf7ca56db2ad12ebb9
> --
> gitgitgadget
>

Patrick Steinhardt Feb. 6, 2024, 6:51 a.m. UTC | #3

On Mon, Feb 05, 2024 at 02:35:53PM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
[snip]
> diff --git a/rerere.c b/rerere.c
> index ca7e77ba68c..13c94ded037 100644
> --- a/rerere.c
> +++ b/rerere.c
> @@ -973,6 +973,9 @@ static int handle_cache(struct index_state *istate,
>  			mmfile[i].ptr = repo_read_object_file(the_repository,
>  							      &ce->oid, &type,
>  							      &size);
> +			if (!mmfile[i].ptr)
> +				die(_("unable to read %s"),
> +				    oid_to_hex(&ce->oid));
>  			mmfile[i].size = size;
>  		}
>  	}

A few lines below this we check whether `mmfile[i].ptr` is `NULL` and
replace it with the empty string if so. So this patch here is basically
a change in behaviour where we now die instead of falling back to the
empty value.

I'm not familiar enough with the code to say whether the old behaviour
is intended or not -- it certainly feels somewhat weird to me. But it
did leave me wondering and could maybe use some explanation.

Patrick

Junio C Hamano Feb. 6, 2024, 6:36 p.m. UTC | #4

Karthik Nayak <karthik.188@gmail.com> writes:

> Hello,
>
> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> diff --git a/bisect.c b/bisect.c
>> index f1273c787d9..f75e50c3397 100644
>> --- a/bisect.c
>> +++ b/bisect.c
>> @@ -158,6 +158,9 @@ static void show_list(const char *debug, int counted, int nr,
>>  		const char *subject_start;
>>  		int subject_len;
>>
>> +		if (!buf)
>> +			die(_("unable to read %s"), oid_to_hex(&commit->object.oid));
>> +
>
> Nit: We know that `repo_read_object_file()` fails on corrupt objects, so
> this means that this is only happening when the object doesn't exist. I
> wonder if it makes more sense to replace "unable to read %s" which is a
> little ambiguous with something like "object %q doesn't exist".

I am not sure if that is a good move in the longer run.  We may
"fix" the called function to return NULL to allow callers to deal
with errors from object corruption better, at which time between
"doesn't exist" and "unable to read", the latter becomes far closer
to what actually happened (it is debatable if a corrupt thing really
exists in the first place, too).

> Otherwise, the patch looks good, thanks!

I haven't read the remainder of the patch, but to me this hunk looks
OK.

Thanks.

Junio C Hamano Feb. 6, 2024, 6:42 p.m. UTC | #5

Patrick Steinhardt <ps@pks.im> writes:

>>  			mmfile[i].ptr = repo_read_object_file(the_repository,
>>  							      &ce->oid, &type,
>>  							      &size);
>> +			if (!mmfile[i].ptr)
>> +				die(_("unable to read %s"),
>> +				    oid_to_hex(&ce->oid));
>>  			mmfile[i].size = size;
>>  		}
>>  	}
>
> A few lines below this we check whether `mmfile[i].ptr` is `NULL` and
> replace it with the empty string if so. So this patch here is basically
> a change in behaviour where we now die instead of falling back to the
> empty value.

I think that one is trying to cope with cases where we genuinely do
not have all three variants, not "we thought we had this variant so
we tried to read it into mmfile[i].{ptr,size}, but it turns out that
the object name we had was bad".  So the fallback code for an entirely
different case was masking the breakage the above hunk fixes, and
this being "rerere", it is better to be cautious than sorry.

Thanks for reading the original code carefully.

Junio C Hamano Feb. 6, 2024, 10:02 p.m. UTC | #6

> diff --git a/builtin/grep.c b/builtin/grep.c
> index c8e33f97755..982bcfc4b1d 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -571,6 +571,8 @@ static int grep_cache(struct grep_opt *opt,
>  
>  			data = repo_read_object_file(the_repository, &ce->oid,
>  						     &type, &size);
> +			if (!data)
> +				die(_("unable to read tree %s"), oid_to_hex(&ce->oid));
>  			init_tree_desc(&tree, data, size);
>  
>  			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);

This caught my attention during today's integration cycle.  Checking
nullness for data certainly is an improvement, but shouldn't we be
checking type as well to make sure it is a tree and not a random
tree-ish or even blob?

Johannes Schindelin Feb. 9, 2024, 8:06 a.m. UTC | #7

Hi Kyle,

On Mon, 5 Feb 2024, Kyle Lippincott wrote:

> On Mon, Feb 5, 2024 at 6:36 AM Johannes Schindelin via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > diff --git a/builtin/notes.c b/builtin/notes.c
> > index e65cae0bcf7..caf20fd5bdd 100644
> > --- a/builtin/notes.c
> > +++ b/builtin/notes.c
> > @@ -716,9 +716,11 @@ static int append_edit(int argc, const char **argv, const char *prefix)
> >                 struct strbuf buf = STRBUF_INIT;
> >                 char *prev_buf = repo_read_object_file(the_repository, note, &type, &size);
> >
> > -               if (prev_buf && size)
> > +               if (!prev_buf)
> > +                       die(_("unable to read %s"), oid_to_hex(note));
>
> This changes the behavior of this function. Previously, it would not
> add prev_buf output, but still succeed. This now dies.

It does change behavior. The previous behavior looked up the note OID,
then tried to read it, and if it was missing just pretended that there had
not been a note.

I'm not quite sure whether we should keep that behavior, as it is unclear
in which scenarios it would be desirable to paper over missing objects.

In GitGitGadget, I am a heavy user of notes and it wouldn't do any good to
have this behavior: It would lose information.

And even in scenarios where the `notes` ref is fetch shallowly, I would
expect all of the actual notes blobs to be present, and I would _want_ the
`git note edit ...` command to error out when that blob is not found.

Does that reasoning make sense to you?

Ciao,
Johannes

Johannes Schindelin Feb. 9, 2024, 8:15 a.m. UTC | #8

Hi Patrick,

On Tue, 6 Feb 2024, Patrick Steinhardt wrote:

> On Mon, Feb 05, 2024 at 02:35:53PM +0000, Johannes Schindelin via GitGitGadget wrote:
> > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> [snip]
> > diff --git a/rerere.c b/rerere.c
> > index ca7e77ba68c..13c94ded037 100644
> > --- a/rerere.c
> > +++ b/rerere.c
> > @@ -973,6 +973,9 @@ static int handle_cache(struct index_state *istate,
> >  			mmfile[i].ptr = repo_read_object_file(the_repository,
> >  							      &ce->oid, &type,
> >  							      &size);
> > +			if (!mmfile[i].ptr)
> > +				die(_("unable to read %s"),
> > +				    oid_to_hex(&ce->oid));
> >  			mmfile[i].size = size;
> >  		}
> >  	}
>
> A few lines below this we check whether `mmfile[i].ptr` is `NULL` and
> replace it with the empty string if so. So this patch here is basically
> a change in behaviour where we now die instead of falling back to the
> empty value.
>
> I'm not familiar enough with the code to say whether the old behaviour
> is intended or not -- it certainly feels somewhat weird to me. But it
> did leave me wondering and could maybe use some explanation.

Hmm. That's a good point. The `mmfile[i].ptr == NULL` situation is indeed
handled specifically.

However, after reading the code I come to the conclusion that the `i`
refers to the stage of an index entry, i.e. that loop
(https://github.com/git/git/blob/v2.43.0/rerere.c#L981-L983) handles the
case where conflicts are due to deletions (where one side of the merge
deleted the file) or double-adds (where both sides of the merge added the
file, with different contents).

Therefore I would suggest that ignoring missing blobs (as is the pre-patch
behavior) would mishandle the available data and paper over a corruption
of the database (the blob is reachable via the Git index, but is missing).

Ciao,
Johannes

Patrick Steinhardt Feb. 9, 2024, 8:17 a.m. UTC | #9

On Fri, Feb 09, 2024 at 09:15:15AM +0100, Johannes Schindelin wrote:
> Hi Patrick,
> 
> On Tue, 6 Feb 2024, Patrick Steinhardt wrote:
> 
> > On Mon, Feb 05, 2024 at 02:35:53PM +0000, Johannes Schindelin via GitGitGadget wrote:
> > > From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > [snip]
> > > diff --git a/rerere.c b/rerere.c
> > > index ca7e77ba68c..13c94ded037 100644
> > > --- a/rerere.c
> > > +++ b/rerere.c
> > > @@ -973,6 +973,9 @@ static int handle_cache(struct index_state *istate,
> > >  			mmfile[i].ptr = repo_read_object_file(the_repository,
> > >  							      &ce->oid, &type,
> > >  							      &size);
> > > +			if (!mmfile[i].ptr)
> > > +				die(_("unable to read %s"),
> > > +				    oid_to_hex(&ce->oid));
> > >  			mmfile[i].size = size;
> > >  		}
> > >  	}
> >
> > A few lines below this we check whether `mmfile[i].ptr` is `NULL` and
> > replace it with the empty string if so. So this patch here is basically
> > a change in behaviour where we now die instead of falling back to the
> > empty value.
> >
> > I'm not familiar enough with the code to say whether the old behaviour
> > is intended or not -- it certainly feels somewhat weird to me. But it
> > did leave me wondering and could maybe use some explanation.
> 
> Hmm. That's a good point. The `mmfile[i].ptr == NULL` situation is indeed
> handled specifically.
> 
> However, after reading the code I come to the conclusion that the `i`
> refers to the stage of an index entry, i.e. that loop
> (https://github.com/git/git/blob/v2.43.0/rerere.c#L981-L983) handles the
> case where conflicts are due to deletions (where one side of the merge
> deleted the file) or double-adds (where both sides of the merge added the
> file, with different contents).
> 
> Therefore I would suggest that ignoring missing blobs (as is the pre-patch
> behavior) would mishandle the available data and paper over a corruption
> of the database (the blob is reachable via the Git index, but is missing).

Yeah, that explanation sounds reasonable to me, thanks!

Patrick

Junio C Hamano Feb. 9, 2024, 4:37 p.m. UTC | #10

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> It does change behavior. The previous behavior looked up the note OID,
> then tried to read it, and if it was missing just pretended that there had
> not been a note.
>
> I'm not quite sure whether we should keep that behavior, as it is unclear
> in which scenarios it would be desirable to paper over missing objects.

Yeah, an object that does not have any notes attached to is a norm
so the calling application must be prepared for it, but this
codepath is different.  The notes tree says it has notes, we try to
read it and it is not there---at least we noticed an inconsistent
notes tree (and object store), and if we were to run "git fsck" at
that point, we would certainly complain about a missing blob object
(can a tree object at an intermediate level be missing and would we
notice, by the way, I wonder).  It is only prudent to report it,
instead of pretending that the notes are not there.

So I think this tightening falls into the "bugfix" category.

Thanks.

Kyle Lippincott Feb. 9, 2024, 7:56 p.m. UTC | #11

On Fri, Feb 9, 2024 at 12:06 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Hi Kyle,
>
> On Mon, 5 Feb 2024, Kyle Lippincott wrote:
>
> > On Mon, Feb 5, 2024 at 6:36 AM Johannes Schindelin via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> > >
> > > diff --git a/builtin/notes.c b/builtin/notes.c
> > > index e65cae0bcf7..caf20fd5bdd 100644
> > > --- a/builtin/notes.c
> > > +++ b/builtin/notes.c
> > > @@ -716,9 +716,11 @@ static int append_edit(int argc, const char **argv, const char *prefix)
> > >                 struct strbuf buf = STRBUF_INIT;
> > >                 char *prev_buf = repo_read_object_file(the_repository, note, &type, &size);
> > >
> > > -               if (prev_buf && size)
> > > +               if (!prev_buf)
> > > +                       die(_("unable to read %s"), oid_to_hex(note));
> >
> > This changes the behavior of this function. Previously, it would not
> > add prev_buf output, but still succeed. This now dies.
>
> It does change behavior. The previous behavior looked up the note OID,
> then tried to read it, and if it was missing just pretended that there had
> not been a note.
>
> I'm not quite sure whether we should keep that behavior, as it is unclear
> in which scenarios it would be desirable to paper over missing objects.
>
> In GitGitGadget, I am a heavy user of notes and it wouldn't do any good to
> have this behavior: It would lose information.
>
> And even in scenarios where the `notes` ref is fetch shallowly, I would
> expect all of the actual notes blobs to be present, and I would _want_ the
> `git note edit ...` command to error out when that blob is not found.
>
> Does that reasoning make sense to you?

Sounds good, thanks for talking through it with me.

>
> Ciao,
> Johannes

Johannes Schindelin Feb. 12, 2024, 11:16 p.m. UTC | #12

Hi Karthik,

On Mon, 5 Feb 2024, Karthik Nayak wrote:

> Hello,
>
> "Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > diff --git a/bisect.c b/bisect.c
> > index f1273c787d9..f75e50c3397 100644
> > --- a/bisect.c
> > +++ b/bisect.c
> > @@ -158,6 +158,9 @@ static void show_list(const char *debug, int counted, int nr,
> >  		const char *subject_start;
> >  		int subject_len;
> >
> > +		if (!buf)
> > +			die(_("unable to read %s"), oid_to_hex(&commit->object.oid));
> > +
>
> Nit: We know that `repo_read_object_file()` fails on corrupt objects, so
> this means that this is only happening when the object doesn't exist. I
> wonder if it makes more sense to replace "unable to read %s" which is a
> little ambiguous with something like "object %q doesn't exist".

I specifically copied this error message from existing code that already
deals with these errors, so as not to cause unnecessary translator
friction.

Ciao,
Johannes

Johannes Schindelin Feb. 12, 2024, 11:19 p.m. UTC | #13

Hi Junio,

On Tue, 6 Feb 2024, Junio C Hamano wrote:

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index c8e33f97755..982bcfc4b1d 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -571,6 +571,8 @@ static int grep_cache(struct grep_opt *opt,
> >
> >  			data = repo_read_object_file(the_repository, &ce->oid,
> >  						     &type, &size);
> > +			if (!data)
> > +				die(_("unable to read tree %s"), oid_to_hex(&ce->oid));
> >  			init_tree_desc(&tree, data, size);
> >
> >  			hit |= grep_tree(opt, pathspec, &tree, &name, 0, 0);
>
> This caught my attention during today's integration cycle.  Checking
> nullness for data certainly is an improvement, but shouldn't we be
> checking type as well to make sure it is a tree and not a random
> tree-ish or even blob?

That sounds right, but I am already stretching this patch beyond what I am
funded to work on, and therefore I need to leave it at the current state.

Ciao,
Johannes

Teng Long Feb. 16, 2024, 6:43 a.m. UTC | #14

Johannes Schindelin <johannes.schindelin@gmx.de> wrote on Mon, 05 Feb 2024:

Hi, when I do zh_CN l10n work for 2.44, I found some check changes like:

    die(_("unable to read tree %s")

in patchset, some old code for this like work is similar but with parentheses
surrounded with the OID parameter:

   die(_("unable to read tree (%s)")

I think it's really a small nit, I don't think it's a requirement to immediately
optimize, they're just some small printing consistency formatting issues, so make
some small tips here.

Thanks.

Johannes Schindelin Feb. 18, 2024, 10:36 p.m. UTC | #15

Hi,

On Fri, 16 Feb 2024, Teng Long wrote:

> Johannes Schindelin <johannes.schindelin@gmx.de> wrote on Mon, 05 Feb 2024:
>
> Hi, when I do zh_CN l10n work for 2.44, I found some check changes like:
>
>     die(_("unable to read tree %s")
>
> in patchset, some old code for this like work is similar but with parentheses
> surrounded with the OID parameter:
>
>    die(_("unable to read tree (%s)")

FWIW I copied the error message from
https://github.com/git/git/blob/v2.43.0/tree-walk.c#L103, but only now
realized that it is untranslated.

> I think it's really a small nit, I don't think it's a requirement to immediately
> optimize, they're just some small printing consistency formatting issues, so make
> some small tips here.

Thank you for paying attention. I agree that it would be good to make
Git's error messages consistent, even if I sadly won't be able to focus on
that due to changes at my dayjob.

Ciao,
Johannes

Always check the return value of `repo_read_object_file()`

Commit Message

Comments

Patch