diff mbox series

[6/6] hash-object: use fsck for object checks

Message ID Y8haHL9xIWntSm0/@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 69bbbe484ba10bd88efb9ae3f6a58fcc687df69e
Headers show
Series hash-object: use fsck to check objects | expand

Commit Message

Jeff King Jan. 18, 2023, 8:44 p.m. UTC
Since c879daa237 (Make hash-object more robust against malformed
objects, 2011-02-05), we've done some rudimentary checks against objects
we're about to write by running them through our usual parsers for
trees, commits, and tags.

These parsers catch some problems, but they are not nearly as careful as
the fsck functions (which make sense; the parsers are designed to be
fast and forgiving, bailing only when the input is unintelligible). We
are better off doing the more thorough fsck checks when writing objects.
Doing so at write time is much better than writing garbage only to find
out later (after building more history atop it!) that fsck complains
about it, or hosts with transfer.fsckObjects reject it.

This is obviously going to be a user-visible behavior change, and the
test changes earlier in this series show the scope of the impact. But
I'd argue that this is OK:

  - the documentation for hash-object is already vague about which
    checks we might do, saying that --literally will allow "any
    garbage[...] which might not otherwise pass standard object parsing
    or git-fsck checks". So we are already covered under the documented
    behavior.

  - users don't generally run hash-object anyway. There are a lot of
    spots in the tests that needed to be updated because creating
    garbage objects is something that Git's tests disproportionately do.

  - it's hard to imagine anyone thinking the new behavior is worse. Any
    object we reject would be a potential problem down the road for the
    user. And if they really want to create garbage, --literally is
    already the escape hatch they need.

Note that the change here is actually in index_mem(), which handles the
HASH_FORMAT_CHECK flag passed by hash-object. That flag is also used by
"git-replace --edit" to sanity-check the result. Covering that with more
thorough checks likewise seems like a good thing.

Besides being more thorough, there are a few other bonuses:

  - we get rid of some questionable stack allocations of object structs.
    These don't seem to currently cause any problems in practice, but
    they subtly violate some of the assumptions made by the rest of the
    code (e.g., the "struct commit" we put on the stack and
    zero-initialize will not have a proper index from
    alloc_comit_index().

  - likewise, those parsed object structs are the source of some small
    memory leaks

  - the resulting messages are much better. For example:

      [before]
      $ echo 'tree 123' | git hash-object -t commit --stdin
      error: bogus commit object 0000000000000000000000000000000000000000
      fatal: corrupt commit

      [after]
      $ echo 'tree 123' | git.compile hash-object -t commit --stdin
      error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
      fatal: refusing to create malformed object

Signed-off-by: Jeff King <peff@peff.net>
---
 object-file.c          | 55 ++++++++++++++++++------------------------
 t/t1007-hash-object.sh | 11 +++++++++
 2 files changed, 34 insertions(+), 32 deletions(-)

Comments

Taylor Blau Jan. 18, 2023, 9:34 p.m. UTC | #1
On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote:
> This is obviously going to be a user-visible behavior change, and the
> test changes earlier in this series show the scope of the impact. But
> I'd argue that this is OK:
>
>   - the documentation for hash-object is already vague about which
>     checks we might do, saying that --literally will allow "any
>     garbage[...] which might not otherwise pass standard object parsing
>     or git-fsck checks". So we are already covered under the documented
>     behavior.
>
>   - users don't generally run hash-object anyway. There are a lot of
>     spots in the tests that needed to be updated because creating
>     garbage objects is something that Git's tests disproportionately do.
>
>   - it's hard to imagine anyone thinking the new behavior is worse. Any
>     object we reject would be a potential problem down the road for the
>     user. And if they really want to create garbage, --literally is
>     already the escape hatch they need.

This is the discussion I was pointing out earlier in the series as
evidence for making this behavior the new default without "--literally".

That being said, let me play devil's advocate for a second. Do the new
fsck checks slow anything in hash-object down significantly? If so, then
it's plausible to imagine a hash-object caller who (a) doesn't use
`--literally`, but (b) does care about throughput if they're writing a
large number of objects at once.

I don't know if such a situation exists, or if these new fsck checks
even slow hash-object down enough to care. But I didn't catch a
discussion of this case in your series, so I figured I'd bring it up
here just in case.

>   - the resulting messages are much better. For example:
>
>       [before]
>       $ echo 'tree 123' | git hash-object -t commit --stdin
>       error: bogus commit object 0000000000000000000000000000000000000000
>       fatal: corrupt commit
>
>       [after]
>       $ echo 'tree 123' | git.compile hash-object -t commit --stdin
>       error: object fails fsck: badTreeSha1: invalid 'tree' line format - bad sha1
>       fatal: refusing to create malformed object

Much nicer, well done.

> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  object-file.c          | 55 ++++++++++++++++++------------------------
>  t/t1007-hash-object.sh | 11 +++++++++
>  2 files changed, 34 insertions(+), 32 deletions(-)
>
> diff --git a/object-file.c b/object-file.c
> index 80a0cd3b35..5c96384803 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -33,6 +33,7 @@
>  #include "object-store.h"
>  #include "promisor-remote.h"
>  #include "submodule.h"
> +#include "fsck.h"
>
>  /* The maximum size for an object header. */
>  #define MAX_HEADER_LEN 32
> @@ -2298,32 +2299,21 @@ int repo_has_object_file(struct repository *r,
>  	return repo_has_object_file_with_flags(r, oid, 0);
>  }
>
> -static void check_tree(const void *buf, size_t size)
> -{
> -	struct tree_desc desc;
> -	struct name_entry entry;
> -
> -	init_tree_desc(&desc, buf, size);
> -	while (tree_entry(&desc, &entry))
> -		/* do nothing
> -		 * tree_entry() will die() on malformed entries */
> -		;
> -}
> -
> -static void check_commit(const void *buf, size_t size)
> -{
> -	struct commit c;
> -	memset(&c, 0, sizeof(c));
> -	if (parse_commit_buffer(the_repository, &c, buf, size, 0))
> -		die(_("corrupt commit"));
> -}
> -
> -static void check_tag(const void *buf, size_t size)
> -{
> -	struct tag t;
> -	memset(&t, 0, sizeof(t));
> -	if (parse_tag_buffer(the_repository, &t, buf, size))
> -		die(_("corrupt tag"));

OK, here we're getting rid of all of the lightweight checks that
hash-object used to implement on its own.

> +/*
> + * We can't use the normal fsck_error_function() for index_mem(),
> + * because we don't yet have a valid oid for it to report. Instead,
> + * report the minimal fsck error here, and rely on the caller to
> + * give more context.
> + */
> +static int hash_format_check_report(struct fsck_options *opts,
> +				     const struct object_id *oid,
> +				     enum object_type object_type,
> +				     enum fsck_msg_type msg_type,
> +				     enum fsck_msg_id msg_id,
> +				     const char *message)
> +{
> +	error(_("object fails fsck: %s"), message);
> +	return 1;
>  }
>
>  static int index_mem(struct index_state *istate,
> @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate,
>  		}
>  	}
>  	if (flags & HASH_FORMAT_CHECK) {
> -		if (type == OBJ_TREE)
> -			check_tree(buf, size);
> -		if (type == OBJ_COMMIT)
> -			check_commit(buf, size);
> -		if (type == OBJ_TAG)
> -			check_tag(buf, size);
> +		struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
> +
> +		opts.strict = 1;
> +		opts.error_func = hash_format_check_report;
> +		if (fsck_buffer(null_oid(), type, buf, size, &opts))
> +			die(_("refusing to create malformed object"));
> +		fsck_finish(&opts);
>  	}

And here's the main part of the change, which is delightfully simple and
appears correct to me.

> diff --git a/t/t1007-hash-object.sh b/t/t1007-hash-object.sh
> index 2d2148d8fa..ac3d173767 100755
> --- a/t/t1007-hash-object.sh
> +++ b/t/t1007-hash-object.sh
> @@ -222,6 +222,17 @@ test_expect_success 'empty filename in tree' '
>  	grep "empty filename in tree entry" err
>  '
>
> +test_expect_success 'duplicate filename in tree' '
> +	hex_oid=$(echo foo | git hash-object --stdin -w) &&
> +	bin_oid=$(echo $hex_oid | hex2oct) &&
> +	{
> +		printf "100644 file\0$bin_oid" &&
> +		printf "100644 file\0$bin_oid"
> +	} >tree-with-duplicate-filename &&
> +	test_must_fail git hash-object -t tree tree-with-duplicate-filename 2>err &&
> +	grep "duplicateEntries" err
> +'
> +

For what it's worth, I think that this is sufficient coverage for the
new fsck checks.

Thanks,
Taylor
Jeff King Jan. 19, 2023, 2:31 a.m. UTC | #2
On Wed, Jan 18, 2023 at 04:34:02PM -0500, Taylor Blau wrote:

> That being said, let me play devil's advocate for a second. Do the new
> fsck checks slow anything in hash-object down significantly? If so, then
> it's plausible to imagine a hash-object caller who (a) doesn't use
> `--literally`, but (b) does care about throughput if they're writing a
> large number of objects at once.
> 
> I don't know if such a situation exists, or if these new fsck checks
> even slow hash-object down enough to care. But I didn't catch a
> discussion of this case in your series, so I figured I'd bring it up
> here just in case.

That's a really good point to bring up.

Prior to timing anything, here were my guesses:

  - it won't make a big difference either way because the time is
    dominated by computing sha1 anyway

  - we might actually be a little faster for commits and tags in the new
    code, because they aren't allocating structs for the pointed-to
    objects (trees, parents, etc). Nor stuffing them into obj_hash, so
    our total memory usage would be lower.

  - trees may be a little slower, because we're doing a more analysis on
    the filenames (sort order, various filesystem specific checks for
    .git, etc)

And here's what I timed, using linux.git. First I pulled out the raw
object data like so:

  mkdir -p commit tag tree

  git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)' |
  perl -alne 'print $F[1] unless $F[0] eq "blob"' |
  git cat-file --batch |
  perl -ne '
    /(\S+) (\S+) (\d+)/ or die "confusing: $_";
    my $dir = "$2/" . substr($1, 0, 2);
    my $fn = "$dir/" . substr($1, 2);
    mkdir($dir);
    open(my $fh, ">", $fn) or die "open($fn): $!";
    read(STDIN, my $buf, $3) or die "read($3): $!";
    print $fh $buf;
    read(STDIN, $buf, 1); # trailing newline
  '

And then I timed it like this:

  find commit -type f | sort >input
  hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t commit <input'

which yielded:

  Benchmark 1: ./git.old hash-object --stdin-paths -t commit <input
    Time (mean ± σ):      7.264 s ±  0.142 s    [User: 4.129 s, System: 3.043 s]
    Range (min … max):    7.098 s …  7.558 s    10 runs

  Benchmark 2: ./git.new hash-object --stdin-paths -t commit <input
    Time (mean ± σ):      6.832 s ±  0.087 s    [User: 3.848 s, System: 2.901 s]
    Range (min … max):    6.752 s …  7.059 s    10 runs

  Summary
    './git.new hash-object --stdin-paths -t commit <input' ran
      1.06 ± 0.02 times faster than './git.old hash-object --stdin-paths -t commit <input'

So the new code is indeed faster, though really most of the time is
spent reading the data and computing the hash anyway. For comparison,
using --literally drops it to ~6.3s.

And according to massif, peak heap drops from 241MB to 80k. Which is
pretty good.

Trees are definitely slower, though. I reduced the number to fit in my
budget of patience:

  find tree -type f | sort | head -n 200000 >input
  hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t tree <input'

And got:

  Benchmark 1: ./git.old hash-object --stdin-paths -t tree <input
    Time (mean ± σ):      2.470 s ±  0.022 s    [User: 1.902 s, System: 0.549 s]
    Range (min … max):    2.442 s …  2.509 s    10 runs
  
  Benchmark 2: ./git.new hash-object --stdin-paths -t tree <input
    Time (mean ± σ):      3.244 s ±  0.026 s    [User: 2.661 s, System: 0.567 s]
    Range (min … max):    3.215 s …  3.295 s    10 runs
  
  Summary
    './git.old hash-object --stdin-paths -t tree <input' ran
      1.31 ± 0.02 times faster than './git.new hash-object --stdin-paths -t tree <input'

So we indeed got a bit slower (and --literally here is ~2.2s). It's
enough that it outweighs the benefits from the commits getting faster
(especially because there tend to be more trees than commits). But those
also get diluted by blobs (which have a lot of data to hash and free
fsck checks).

So in the end, I think nobody would really care that much. The absolute
numbers are pretty small, and this is already a fairly dumb way to get
objects into your repository. The usual way is via index-pack, and it
already uses the fsck code for its checks. But I do think it was a good
question to explore (plus it found a descriptor leak in hash-object,
which I sent a separate patch for).

-Peff
Jeff King Feb. 1, 2023, 12:50 p.m. UTC | #3
On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote:

> @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate,
>  		}
>  	}
>  	if (flags & HASH_FORMAT_CHECK) {
> -		if (type == OBJ_TREE)
> -			check_tree(buf, size);
> -		if (type == OBJ_COMMIT)
> -			check_commit(buf, size);
> -		if (type == OBJ_TAG)
> -			check_tag(buf, size);
> +		struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
> +
> +		opts.strict = 1;
> +		opts.error_func = hash_format_check_report;
> +		if (fsck_buffer(null_oid(), type, buf, size, &opts))
> +			die(_("refusing to create malformed object"));
> +		fsck_finish(&opts);
>  	}

By the way, I wanted to call out one thing here that nobody mentioned
during review: we are not checking the return value of fsck_finish().

That is a bit of a weird function. We must call it because it cleans up
any resources allocated during the fsck_buffer() call. But it also is
the last chance to fsck any special blobs (like those that are found as
.gitmodules, etc). We only find out the filenames while looking at the
enclosing trees, so we queue them and then check the blobs later.

So if we are hashing a blob, that is mostly fine. We will not have the
blob's name queued as anything special, and so the fsck is a noop.

But if we fsck a tree, and it has a .gitmodules entry pointing to blob
X, then we would also pull X from the odb and fsck it during this
"finish" phase.

Which leads me to two diverging lines of thought:

  1. One of my goals with this series is that one could add objects to
     the repository via "git hash-object -w" and feel confident that no
     fsck rules were violated, because fsck implements some security
     checks. In the past when GitHub rolled out security checks this was
     a major pain, because objects enter repositories not just from
     pushes, but also from web-client activity (e.g., editing a blob on
     the website). And since Git had no way to say "fsck just this
     object", we ended up implementing the fsck checks multiple times,
     in libgit2 and in some of its calling code.

     So I was hoping that just passing the objects to "hash-object"
     would be a viable solution. I'm not sure if it is or not. If you
     just hash a blob, then we'll have no clue it's a .gitmodules file.
     OTOH, you have to get the matching tree which binds the blob to the
     .gitmodules path somehow. So if that tree is fsck'd, and then
     checks the blob during fsck_finish(), that should be enough.
     Assuming that fsck complains when the pointed-to blob cannot be
     accessed, which I think it should (because really, incremental
     pushes face the same problem).

     In which case we really ought to be checking the result of
     fsck_finish() here and complaining.

  2. We're not checking fsck connectivity here, and that's intentional.
     So you can "hash-object" a tree that points to blobs that we don't
     actually have. But if you hash a tree that points a .gitmodules
     entry at a blob that doesn't exist, then that will fail the fsck
     (during the finish step). And respecting the fsck_finish() exit
     code would break that.

     As an addendum, in a regular fsck, many trees might mention the
     same blob as .gitmodules, and we'll queue that blob to be checked
     once. But here, we are potentially running a bunch of individual
     fscks, one per object we hash. So if you had, say, 1000 trees that
     all mentioned the same blob (because other entries were changing),
     and you tried to hash them all with "hash-object --stdin-paths" or
     similar, then we'd fsck that blob 1000 times.

     Which isn't wrong, per se, but seems inefficient. Solving it would
     require keeping track of what has been checked between calls to
     index_mem(). Which is kind of awkward, seeing as how low-level it
     is. It would be a lot more natural if all this checking happened in
     hash-object itself.

So I dunno. The code above is doing (2), albeit with the inefficiency of
checking blobs that we might not care about. I kind of think (1) is the
right thing, though, and anybody who really wants to make trees that
point to bogus .gitmodules can use --literally.

-Peff
Ævar Arnfjörð Bjarmason Feb. 1, 2023, 1:08 p.m. UTC | #4
On Wed, Feb 01 2023, Jeff King wrote:

> On Wed, Jan 18, 2023 at 03:44:12PM -0500, Jeff King wrote:
>
>> @@ -2350,12 +2340,13 @@ static int index_mem(struct index_state *istate,
>>  		}
>>  	}
>>  	if (flags & HASH_FORMAT_CHECK) {
>> -		if (type == OBJ_TREE)
>> -			check_tree(buf, size);
>> -		if (type == OBJ_COMMIT)
>> -			check_commit(buf, size);
>> -		if (type == OBJ_TAG)
>> -			check_tag(buf, size);
>> +		struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
>> +
>> +		opts.strict = 1;
>> +		opts.error_func = hash_format_check_report;
>> +		if (fsck_buffer(null_oid(), type, buf, size, &opts))
>> +			die(_("refusing to create malformed object"));
>> +		fsck_finish(&opts);
>>  	}
>
> By the way, I wanted to call out one thing here that nobody mentioned
> during review: we are not checking the return value of fsck_finish().
>
> That is a bit of a weird function. We must call it because it cleans up
> any resources allocated during the fsck_buffer() call. But it also is
> the last chance to fsck any special blobs (like those that are found as
> .gitmodules, etc). We only find out the filenames while looking at the
> enclosing trees, so we queue them and then check the blobs later.
>
> So if we are hashing a blob, that is mostly fine. We will not have the
> blob's name queued as anything special, and so the fsck is a noop.
>
> But if we fsck a tree, and it has a .gitmodules entry pointing to blob
> X, then we would also pull X from the odb and fsck it during this
> "finish" phase.
>
> Which leads me to two diverging lines of thought:
>
>   1. One of my goals with this series is that one could add objects to
>      the repository via "git hash-object -w" and feel confident that no
>      fsck rules were violated, because fsck implements some security
>      checks. In the past when GitHub rolled out security checks this was
>      a major pain, because objects enter repositories not just from
>      pushes, but also from web-client activity (e.g., editing a blob on
>      the website). And since Git had no way to say "fsck just this
>      object", we ended up implementing the fsck checks multiple times,
>      in libgit2 and in some of its calling code.
>
>      So I was hoping that just passing the objects to "hash-object"
>      would be a viable solution. I'm not sure if it is or not. If you
>      just hash a blob, then we'll have no clue it's a .gitmodules file.
>      OTOH, you have to get the matching tree which binds the blob to the
>      .gitmodules path somehow. So if that tree is fsck'd, and then
>      checks the blob during fsck_finish(), that should be enough.
>      Assuming that fsck complains when the pointed-to blob cannot be
>      accessed, which I think it should (because really, incremental
>      pushes face the same problem).
>
>      In which case we really ought to be checking the result of
>      fsck_finish() here and complaining.
>
>   2. We're not checking fsck connectivity here, and that's intentional.
>      So you can "hash-object" a tree that points to blobs that we don't
>      actually have. But if you hash a tree that points a .gitmodules
>      entry at a blob that doesn't exist, then that will fail the fsck
>      (during the finish step). And respecting the fsck_finish() exit
>      code would break that.
>
>      As an addendum, in a regular fsck, many trees might mention the
>      same blob as .gitmodules, and we'll queue that blob to be checked
>      once. But here, we are potentially running a bunch of individual
>      fscks, one per object we hash. So if you had, say, 1000 trees that
>      all mentioned the same blob (because other entries were changing),
>      and you tried to hash them all with "hash-object --stdin-paths" or
>      similar, then we'd fsck that blob 1000 times.
>
>      Which isn't wrong, per se, but seems inefficient. Solving it would
>      require keeping track of what has been checked between calls to
>      index_mem(). Which is kind of awkward, seeing as how low-level it
>      is. It would be a lot more natural if all this checking happened in
>      hash-object itself.
>
> So I dunno. The code above is doing (2), albeit with the inefficiency of
> checking blobs that we might not care about. I kind of think (1) is the
> right thing, though, and anybody who really wants to make trees that
> point to bogus .gitmodules can use --literally.

Aside from the other things you bring up here, it seems wrong to me to
conflate --literally with some sort of "no fsck" or "don't fsck this
collection yet" mode.

Can't we have a "--no-fsck" or similar, which won't do any sort of full
fsck, but also won't accept bogus object types & the like?

Currently I believe (and I haven't had time to carefully review what you
have here) we only need --literally to produce objects that are truly
corrupt when viewed in isolation. E.g. a tag that refers to a bogus
object type etc.

But we have long supported a narrow view of what the fsck checks mean in
that context. E.g. now with "mktag" we'll use the fsck machinery, but
only skin-deep, so you can be referring to a tree which would in turn
fail our checks.

I tend to think that we should be keeping it like that, but documenting
that if you're creating such objects you either need to do it really
carefully, or follow it up with an operation that's guaranteed to fsck
the sum of the objects you've added recursively.

So, rather than teach e.g. "hash-object" to be smart about that we
should e.g. encourage a manual creation of trees/blobs/commits to be
followed-up with a "git push" to a new ref that refers to them, even if
that "git push" is to the repository located in the $PWD.

By doing that we offload the "what's new?" question to the
pack-over-the-wire machinery, which is well tested.

Anything else seems ultimately to be madness, after all if I feed a
newly crafted commit to "hash-object" how do we know where to stop,
other than essentially faking up a push negotiation with ourselves?

It's also worth noting that much of the complexity around .gitmodules in
particular is to support packfile-uri's odd notion of applying the
"last" part of the PACK before the "first" part, which nothing else
does.

Which, if we just blindly applied both, and then fsck'd the resulting
combination we'd get rid of that tricky special-case. But I haven't
benchmarked that. It should be a bit slower, particularly on a large
repository that won't fit in memory. But my hunch is that it won't be
too bad, and the resulting simplification may be worth it (particularly
now that we have bundle-uri, which doesn't share that edge-case).
Junio C Hamano Feb. 1, 2023, 8:41 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

>      ... So if that tree is fsck'd, and then
>      checks the blob during fsck_finish(), that should be enough.
>      Assuming that fsck complains when the pointed-to blob cannot be
>      accessed, which I think it should (because really, incremental
>      pushes face the same problem).

Yes.

>   2. We're not checking fsck connectivity here, and that's intentional.
>      So you can "hash-object" a tree that points to blobs that we don't
>      actually have. But if you hash a tree that points a .gitmodules
>      entry at a blob that doesn't exist, then that will fail the fsck
>      (during the finish step). And respecting the fsck_finish() exit
>      code would break that.

That's tricky.  An attack vector to sneak a bad .gitmodules file
into history then becomes (1) hash a tree with a .gitmodules entry
that points at a missing blob and then (2) after that fact is
forgotten, hash a bad blob pointed to by the tree?  We cannot afford
to remember all trees with .gitmodules we didn't find the blob for
forever, so one approach to solve it is to reject trees with missing
blobs.  Legitimate use cases should be able to build up trees bottle
up to hash blobs before their containing trees.

If you hash a commit object, we would want to fsck its tree?  Do we
want to fsck its parent commit and its tree?  Ideally we can stop
when our "traversal" reaches objects that are known to be good, but
how do we decide which objects are "known to be good"?  Being
reachable from our refs, as usual?

> So I dunno. The code above is doing (2), albeit with the inefficiency of
> checking blobs that we might not care about. I kind of think (1) is the
> right thing, though, and anybody who really wants to make trees that
> point to bogus .gitmodules can use --literally.

True.
diff mbox series

Patch

diff --git a/object-file.c b/object-file.c
index 80a0cd3b35..5c96384803 100644
--- a/object-file.c
+++ b/object-file.c
@@ -33,6 +33,7 @@ 
 #include "object-store.h"
 #include "promisor-remote.h"
 #include "submodule.h"
+#include "fsck.h"
 
 /* The maximum size for an object header. */
 #define MAX_HEADER_LEN 32
@@ -2298,32 +2299,21 @@  int repo_has_object_file(struct repository *r,
 	return repo_has_object_file_with_flags(r, oid, 0);
 }
 
-static void check_tree(const void *buf, size_t size)
-{
-	struct tree_desc desc;
-	struct name_entry entry;
-
-	init_tree_desc(&desc, buf, size);
-	while (tree_entry(&desc, &entry))
-		/* do nothing
-		 * tree_entry() will die() on malformed entries */
-		;
-}
-
-static void check_commit(const void *buf, size_t size)
-{
-	struct commit c;
-	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(the_repository, &c, buf, size, 0))
-		die(_("corrupt commit"));
-}
-
-static void check_tag(const void *buf, size_t size)
-{
-	struct tag t;
-	memset(&t, 0, sizeof(t));
-	if (parse_tag_buffer(the_repository, &t, buf, size))
-		die(_("corrupt tag"));
+/*
+ * We can't use the normal fsck_error_function() for index_mem(),
+ * because we don't yet have a valid oid for it to report. Instead,
+ * report the minimal fsck error here, and rely on the caller to
+ * give more context.
+ */
+static int hash_format_check_report(struct fsck_options *opts,
+				     const struct object_id *oid,
+				     enum object_type object_type,
+				     enum fsck_msg_type msg_type,
+				     enum fsck_msg_id msg_id,
+				     const char *message)
+{
+	error(_("object fails fsck: %s"), message);
+	return 1;
 }
 
 static int index_mem(struct index_state *istate,
@@ -2350,12 +2340,13 @@  static int index_mem(struct index_state *istate,
 		}
 	}
 	if (flags & HASH_FORMAT_CHECK) {
-		if (type == OBJ_TREE)
-			check_tree(buf, size);
-		if (type == OBJ_COMMIT)
-			check_commit(buf, size);
-		if (type == OBJ_TAG)
-			check_tag(buf, size);
+		struct fsck_options opts = FSCK_OPTIONS_DEFAULT;
+
+		opts.strict = 1;
+		opts.error_func = hash_format_check_report;
+		if (fsck_buffer(null_oid(), type, buf, size, &opts))
+			die(_("refusing to create malformed object"));
+		fsck_finish(&opts);
 	}
 
 	if (write_object)
diff --git a/t/t1007-hash-object.sh b/t/t1007-hash-object.sh
index 2d2148d8fa..ac3d173767 100755
--- a/t/t1007-hash-object.sh
+++ b/t/t1007-hash-object.sh
@@ -222,6 +222,17 @@  test_expect_success 'empty filename in tree' '
 	grep "empty filename in tree entry" err
 '
 
+test_expect_success 'duplicate filename in tree' '
+	hex_oid=$(echo foo | git hash-object --stdin -w) &&
+	bin_oid=$(echo $hex_oid | hex2oct) &&
+	{
+		printf "100644 file\0$bin_oid" &&
+		printf "100644 file\0$bin_oid"
+	} >tree-with-duplicate-filename &&
+	test_must_fail git hash-object -t tree tree-with-duplicate-filename 2>err &&
+	grep "duplicateEntries" err
+'
+
 test_expect_success 'corrupt commit' '
 	test_must_fail git hash-object -t commit --stdin </dev/null
 '