diff mbox series

[v2,2/3] object-file: emit corruption errors when detected

Message ID 9ddfff3585c293c9801570e395b514505796a43f.1670373420.git.jonathantanmy@google.com (mailing list archive)
State Superseded
Headers show
Series Don't lazy-fetch commits when parsing them | expand

Commit Message

Jonathan Tan Dec. 7, 2022, 12:40 a.m. UTC
Instead of relying on errno being preserved across function calls, teach
do_oid_object_info_extended() to itself report object corruption when
it first detects it. There are 3 types of corruption being detected:
 - when a replacement object is missing
 - when a loose object is corrupt
 - when a packed object is corrupt and the object cannot be read
   in another way

Note that in the RHS of this patch's diff, a check for ENOENT that was
introduced in 3ba7a06552 (A loose object is not corrupt if it cannot
be read due to EMFILE, 2010-10-28) is also removed. The purpose of this
check is to avoid a false report of corruption if the errno contains
something like EMFILE (or anything that is not ENOENT), in which case
a more generic report is presented. Because, as of this patch, we no
longer rely on such a heuristic to determine corruption, but surface
the error message at the point when we read something that we did not
expect, this check is no longer necessary.

Besides being more resilient, this also prepares for a future patch in
which an indirect caller of do_oid_object_info_extended() will need
such functionality.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 object-file.c  | 63 ++++++++++++++++++++++++++------------------------
 object-store.h |  3 +++
 2 files changed, 36 insertions(+), 30 deletions(-)

Comments

Junio C Hamano Dec. 7, 2022, 1:16 a.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> Instead of relying on errno being preserved across function calls, teach
> do_oid_object_info_extended() to itself report object corruption when
> it first detects it. There are 3 types of corruption being detected:
>  - when a replacement object is missing
>  - when a loose object is corrupt
>  - when a packed object is corrupt and the object cannot be read
>    in another way
>
> Note that in the RHS of this patch's diff, a check for ENOENT that was
> introduced in 3ba7a06552 (A loose object is not corrupt if it cannot
> be read due to EMFILE, 2010-10-28) is also removed. The purpose of this
> check is to avoid a false report of corruption if the errno contains
> something like EMFILE (or anything that is not ENOENT), in which case
> a more generic report is presented. Because, as of this patch, we no
> longer rely on such a heuristic to determine corruption, but surface
> the error message at the point when we read something that we did not
> expect, this check is no longer necessary.
>
> Besides being more resilient, this also prepares for a future patch in
> which an indirect caller of do_oid_object_info_extended() will need
> such functionality.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  object-file.c  | 63 ++++++++++++++++++++++++++------------------------
>  object-store.h |  3 +++
>  2 files changed, 36 insertions(+), 30 deletions(-)

The implementation looks very straight-forward.  Nicely done.
Ævar Arnfjörð Bjarmason Dec. 7, 2022, 4:05 a.m. UTC | #2
On Tue, Dec 06 2022, Jonathan Tan wrote:

> Instead of relying on errno being preserved across function calls, teach
> do_oid_object_info_extended() to itself report object corruption when
> it first detects it. There are 3 types of corruption being detected:
>  - when a replacement object is missing
>  - when a loose object is corrupt
>  - when a packed object is corrupt and the object cannot be read
>    in another way
>
> Note that in the RHS of this patch's diff, a check for ENOENT that was
> introduced in 3ba7a06552 (A loose object is not corrupt if it cannot
> be read due to EMFILE, 2010-10-28) is also removed. The purpose of this
> check is to avoid a false report of corruption if the errno contains
> something like EMFILE (or anything that is not ENOENT), in which case
> a more generic report is presented. Because, as of this patch, we no
> longer rely on such a heuristic to determine corruption, but surface
> the error message at the point when we read something that we did not
> expect, this check is no longer necessary.
>
> Besides being more resilient, this also prepares for a future patch in
> which an indirect caller of do_oid_object_info_extended() will need
> such functionality.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  object-file.c  | 63 ++++++++++++++++++++++++++------------------------
>  object-store.h |  3 +++
>  2 files changed, 36 insertions(+), 30 deletions(-)
>
> diff --git a/object-file.c b/object-file.c
> index 596dd049fd..c7a513d123 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1215,7 +1215,8 @@ static int quick_has_loose(struct repository *r,
>   * searching for a loose object named "oid".
>   */
>  static void *map_loose_object_1(struct repository *r, const char *path,
> -			     const struct object_id *oid, unsigned long *size)
> +				const struct object_id *oid, unsigned long *size,
> +				char **mapped_path)
>  {
>  	void *map;
>  	int fd;
> @@ -1224,6 +1225,9 @@ static void *map_loose_object_1(struct repository *r, const char *path,
>  		fd = git_open(path);
>  	else
>  		fd = open_loose_object(r, oid, &path);
> +	if (mapped_path)
> +		*mapped_path = xstrdup(path);
> +
>  	map = NULL;
>  	if (fd >= 0) {
>  		struct stat st;

I find this map_loose_object_1() function to be rather "busy". Part of
it's the pre-image.

Callers at the end of this series are:

   1254:        return map_loose_object_1(r, NULL, oid, size, NULL);
   1467:        map = map_loose_object_1(r, NULL, oid, &mapsize, &mapped_path);
   2803:        map = map_loose_object_1(the_repository, path, NULL, &mapsize, NULL);

So, either we know the path already, and we pass it in, or we don't know
the path, and may or may not be interested in what the path ends up
being.

Which is why we pass in both a "path" and a "mapped_path".

Then, somewhat confusingly (maybe I'm the only one who finds this odd")
the "path" variable itself does double-duty within the function. If we
have a "path" already we leave it alone, but if we don't it's NULL, and
then we write our new path to it.

We *might* then have a path already, *and* write to the "mapped_path",
but in that case we'd be xstrdup() ing a string the user passed in. But
this API use would make no sense.

So shouldn't we at least have a:

	if (path && mapped_path)
		BUG("either tell me the path, or ask me, not both!");

But I think it's better to just separate these concerns. Most of this
refactoring is good, but I think this bit went a step too far, and as a
result we now need to memory manage this "mapped_path". I.e. we'd get a
"struct strbuf"'s "buf" before, but now loose_object_info() needs to
have it xstrdup()'d, just to free() it again.

Isn't the below squashed in better? I.e. just always pass the "path",
but maybe pass a "fd=0", in which case the function might need to
git_open() it.

Then have map_loose_object() and loose_object_info() call
open_loose_object(), and pass in the "path" and "fd".

diff --git a/object-file.c b/object-file.c
index c7a513d123e..24793e1b479 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1214,19 +1214,13 @@ static int quick_has_loose(struct repository *r,
  * Map the loose object at "path" if it is not NULL, or the path found by
  * searching for a loose object named "oid".
  */
-static void *map_loose_object_1(struct repository *r, const char *path,
-				const struct object_id *oid, unsigned long *size,
-				char **mapped_path)
+static void *map_loose_object_1(struct repository *r, const char *const path,
+				int fd, unsigned long *size)
 {
 	void *map;
-	int fd;
 
-	if (path)
+	if (!fd)
 		fd = git_open(path);
-	else
-		fd = open_loose_object(r, oid, &path);
-	if (mapped_path)
-		*mapped_path = xstrdup(path);
 
 	map = NULL;
 	if (fd >= 0) {
@@ -1251,7 +1245,10 @@ void *map_loose_object(struct repository *r,
 		       const struct object_id *oid,
 		       unsigned long *size)
 {
-	return map_loose_object_1(r, NULL, oid, size, NULL);
+	const char *path;
+	int fd = open_loose_object(r, oid, &path);
+
+	return map_loose_object_1(r, path,fd, size);
 }
 
 enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
@@ -1432,7 +1429,6 @@ static int loose_object_info(struct repository *r,
 {
 	int status = 0;
 	unsigned long mapsize;
-	char *mapped_path = NULL;
 	void *map;
 	git_zstream stream;
 	char hdr[MAX_HEADER_LEN];
@@ -1440,6 +1436,8 @@ static int loose_object_info(struct repository *r,
 	unsigned long size_scratch;
 	enum object_type type_scratch;
 	int allow_unknown = flags & OBJECT_INFO_ALLOW_UNKNOWN_TYPE;
+	int fd;
+	const char *path;
 
 	if (oi->delta_base_oid)
 		oidclr(oi->delta_base_oid);
@@ -1464,11 +1462,10 @@ static int loose_object_info(struct repository *r,
 		return 0;
 	}
 
-	map = map_loose_object_1(r, NULL, oid, &mapsize, &mapped_path);
-	if (!map) {
-		free(mapped_path);
+	fd = open_loose_object(r, oid, &path);
+	map = map_loose_object_1(r, path, fd, &mapsize);
+	if (!map)
 		return -1;
-	}
 
 	if (!oi->sizep)
 		oi->sizep = &size_scratch;
@@ -1506,11 +1503,10 @@ static int loose_object_info(struct repository *r,
 
 	if (status && (flags & OBJECT_INFO_DIE_IF_CORRUPT))
 		die(_("loose object %s (stored in %s) is corrupt"),
-		    oid_to_hex(oid), mapped_path);
+		    oid_to_hex(oid), path);
 
 	git_inflate_end(&stream);
 cleanup:
-	free(mapped_path);
 	munmap(map, mapsize);
 	if (oi->sizep == &size_scratch)
 		oi->sizep = NULL;
@@ -2800,7 +2796,7 @@ int read_loose_object(const char *path,
 	char hdr[MAX_HEADER_LEN];
 	unsigned long *size = oi->sizep;
 
-	map = map_loose_object_1(the_repository, path, NULL, &mapsize, NULL);
+	map = map_loose_object_1(the_repository, path, 0, &mapsize);
 	if (!map) {
 		error_errno(_("unable to mmap %s"), path);
 		goto out;
Jeff King Dec. 7, 2022, 6:42 a.m. UTC | #3
On Tue, Dec 06, 2022 at 04:40:52PM -0800, Jonathan Tan wrote:

> Note that in the RHS of this patch's diff, a check for ENOENT that was
> introduced in 3ba7a06552 (A loose object is not corrupt if it cannot
> be read due to EMFILE, 2010-10-28) is also removed. The purpose of this
> check is to avoid a false report of corruption if the errno contains
> something like EMFILE (or anything that is not ENOENT), in which case
> a more generic report is presented. Because, as of this patch, we no
> longer rely on such a heuristic to determine corruption, but surface
> the error message at the point when we read something that we did not
> expect, this check is no longer necessary.

You're right that this will not say "oops, the object is corrupted" when
we get EMFILE, etc, which is what 3ba7a06552 was fixing. But I think
after your patch, we would also never actually say "could not open
$path: too many open descriptors". I don't think that kicked in reliably
with the current code, but it seems like something we are losing in this
patch.

I.e., I think you'd want to also complain when map_loose_object()
returns anything but ENOENT. The errno value there is reliable-ish,
though it might be worth spending the extra work to preserve it across
the close() calls, just in case. Or if you split the open/mmap as Ævar
suggested, then it becomes:

  fd = open_loose_object(r, oid, &path);
  if (fd < 0) {
	if (errno != ENOENT)
		error_errno("unable to open loose object %s", path);
	return -1;
  }

  buf = map_loose_object_1(fd, path);
  if (!buf) {
	/* not much point in complaining here, as xmmap will die on
	 * error. In theory this could catch fstat() problems, but
	 * probably map_loose_object_1() itself should do so, since
	 * it already is special-casing empty files.
	 */
        close(fd);
	return -1;
  }

> diff --git a/object-file.c b/object-file.c
> index 596dd049fd..c7a513d123 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1215,7 +1215,8 @@ static int quick_has_loose(struct repository *r,
>   * searching for a loose object named "oid".
>   */
>  static void *map_loose_object_1(struct repository *r, const char *path,
> -			     const struct object_id *oid, unsigned long *size)
> +				const struct object_id *oid, unsigned long *size,
> +				char **mapped_path)
>  {
>  	void *map;
>  	int fd;
> @@ -1224,6 +1225,9 @@ static void *map_loose_object_1(struct repository *r, const char *path,
>  		fd = git_open(path);
>  	else
>  		fd = open_loose_object(r, oid, &path);
> +	if (mapped_path)
> +		*mapped_path = xstrdup(path);
> +

This introduces an extra malloc/free in every object lookup, even in the
success case where we don't even bother using the value. It's probably
not really noticeable, but it kind of feels wrong. Especially when
open_loose_object() already returns a pointer to long-ish term storage.

One solution is...

> @@ -1497,8 +1504,13 @@ static int loose_object_info(struct repository *r,
>  		break;
>  	}
>  
> +	if (status && (flags & OBJECT_INFO_DIE_IF_CORRUPT))
> +		die(_("loose object %s (stored in %s) is corrupt"),
> +		    oid_to_hex(oid), mapped_path);
> +

...we could just say "loose object %s is corrupt", which is generally
sufficient (outside of alternates, you can only have one copy anyway).

But if we want to retain it, we could just redo the lookup in the error
path, which is what the existing code (that you're deleting) does, via
stat_loose_object(). That's even more wasteful than an extra
malloc/free, but you only pay the cost for a corrupted object. It's
racy, of course, but probably not in a meaningful way in practice.

Alternatively, I like Ævar's suggestion to just split
map_loose_object(). Then this code would naturally have the path, via
calling open_loose_object() itself.

> diff --git a/object-store.h b/object-store.h
> index 1be57abaf1..01134ab5ec 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -447,6 +447,9 @@ struct object_info {
>   */
>  #define OBJECT_INFO_FOR_PREFETCH (OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK)
>  
> +/* Die if object corruption (not just an object being missing) was detected. */
> +#define OBJECT_INFO_DIE_IF_CORRUPT 64

I have a suspicion that the world would be a better place if these die()
calls simply went away, in favor of returning -1 up the stack. But I'm
OK leaving it as-is for the sake of trying not to do too many things at
once (I probably wouldn't have even mentioned it, except that if we do
want to end up there in the long run, we'd eventually have to rip out
this new flag and its associated plumbing).

-Peff
Jeff King Dec. 7, 2022, 7:07 a.m. UTC | #4
On Wed, Dec 07, 2022 at 05:05:47AM +0100, Ævar Arnfjörð Bjarmason wrote:

> Isn't the below squashed in better? I.e. just always pass the "path",
> but maybe pass a "fd=0", in which case the function might need to
> git_open() it.
> 
> Then have map_loose_object() and loose_object_info() call
> open_loose_object(), and pass in the "path" and "fd".

I like this direction, though I'd give a few small suggestions. One is
to make it unconditional to pass in a valid "fd". These kind of magic
sentinel values sometimes lead to confusion or bugs, and it's easy
enough for the caller to use git_open() itself.

In fact, in the one caller who cares, it lets us produce a nicer
error message:

diff --git a/object-file.c b/object-file.c
index 24793e1b47..7c2a85132b 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1219,9 +1219,6 @@ static void *map_loose_object_1(struct repository *r, const char *const path,
 {
 	void *map;
 
-	if (!fd)
-		fd = git_open(path);
-
 	map = NULL;
 	if (fd >= 0) {
 		struct stat st;
@@ -2790,13 +2787,18 @@ int read_loose_object(const char *path,
 		      struct object_info *oi)
 {
 	int ret = -1;
+	int fd;
 	void *map = NULL;
 	unsigned long mapsize;
 	git_zstream stream;
 	char hdr[MAX_HEADER_LEN];
 	unsigned long *size = oi->sizep;
 
-	map = map_loose_object_1(the_repository, path, 0, &mapsize);
+	fd = git_open(path);
+	if (fd < 0)
+		error_errno(_("unable to open %s"), path);
+
+	map = map_loose_object_1(the_repository, path, fd, &mapsize);
 	if (!map) {
 		error_errno(_("unable to mmap %s"), path);
 		goto out;

> +static void *map_loose_object_1(struct repository *r, const char *const path,
> +				int fd, unsigned long *size)
>  {
>  	void *map;
> -	int fd;
>  
> -	if (path)
> +	if (!fd)
>  		fd = git_open(path);
> -	else
> -		fd = open_loose_object(r, oid, &path);
> -	if (mapped_path)
> -		*mapped_path = xstrdup(path);

The other weird thing here is ownership of "fd". Now some callers pass
it in, but map_loose_object_1() always closes it. I think that's OK
(since we want it closed even on success), but definitely surprising
enough that we'd want to document that in a comment.

> @@ -1251,7 +1245,10 @@ void *map_loose_object(struct repository *r,
>  		       const struct object_id *oid,
>  		       unsigned long *size)
>  {
> -	return map_loose_object_1(r, NULL, oid, size, NULL);
> +	const char *path;
> +	int fd = open_loose_object(r, oid, &path);
> +
> +	return map_loose_object_1(r, path,fd, size);
>  }

It's also kind of weird that map_loose_object_1() is a noop on a
negative descriptor. That technically makes this correct, but I think it
would be much less surprising to always take a valid descriptor, and
this code should do:

  if (fd)
	return -1;
  return map_loose_object_1(r, path, fd, size);

If we are going to make map_loose_object_1() less confusing (and I think
that is worth doing), let's go all the way.

-Peff
Ævar Arnfjörð Bjarmason Dec. 7, 2022, 10:33 a.m. UTC | #5
On Wed, Dec 07 2022, Jeff King wrote:

> On Wed, Dec 07, 2022 at 05:05:47AM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> Isn't the below squashed in better? I.e. just always pass the "path",
>> but maybe pass a "fd=0", in which case the function might need to
>> git_open() it.
>> 
>> Then have map_loose_object() and loose_object_info() call
>> open_loose_object(), and pass in the "path" and "fd".
>
> I like this direction, though I'd give a few small suggestions. One is
> to make it unconditional to pass in a valid "fd". These kind of magic
> sentinel values sometimes lead to confusion or bugs, and it's easy
> enough for the caller to use git_open() itself.
>
> In fact, in the one caller who cares, it lets us produce a nicer
> error message:
>
> diff --git a/object-file.c b/object-file.c
> index 24793e1b47..7c2a85132b 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1219,9 +1219,6 @@ static void *map_loose_object_1(struct repository *r, const char *const path,
>  {
>  	void *map;
>  
> -	if (!fd)
> -		fd = git_open(path);
> -
>  	map = NULL;
>  	if (fd >= 0) {
>  		struct stat st;
> @@ -2790,13 +2787,18 @@ int read_loose_object(const char *path,
>  		      struct object_info *oi)
>  {
>  	int ret = -1;
> +	int fd;
>  	void *map = NULL;
>  	unsigned long mapsize;
>  	git_zstream stream;
>  	char hdr[MAX_HEADER_LEN];
>  	unsigned long *size = oi->sizep;
>  
> -	map = map_loose_object_1(the_repository, path, 0, &mapsize);
> +	fd = git_open(path);
> +	if (fd < 0)
> +		error_errno(_("unable to open %s"), path);
> +
> +	map = map_loose_object_1(the_repository, path, fd, &mapsize);
>  	if (!map) {
>  		error_errno(_("unable to mmap %s"), path);
>  		goto out;

Yeah, I think that's even better, although...

>> +static void *map_loose_object_1(struct repository *r, const char *const path,
>> +				int fd, unsigned long *size)
>>  {
>>  	void *map;
>> -	int fd;
>>  
>> -	if (path)
>> +	if (!fd)
>>  		fd = git_open(path);
>> -	else
>> -		fd = open_loose_object(r, oid, &path);
>> -	if (mapped_path)
>> -		*mapped_path = xstrdup(path);
>
> The other weird thing here is ownership of "fd". Now some callers pass
> it in, but map_loose_object_1() always closes it. I think that's OK
> (since we want it closed even on success), but definitely surprising
> enough that we'd want to document that in a comment.
>
>> @@ -1251,7 +1245,10 @@ void *map_loose_object(struct repository *r,
>>  		       const struct object_id *oid,
>>  		       unsigned long *size)
>>  {
>> -	return map_loose_object_1(r, NULL, oid, size, NULL);
>> +	const char *path;
>> +	int fd = open_loose_object(r, oid, &path);
>> +
>> +	return map_loose_object_1(r, path,fd, size);
>>  }
>
> It's also kind of weird that map_loose_object_1() is a noop on a
> negative descriptor. That technically makes this correct, but I think it
> would be much less surprising to always take a valid descriptor, and
> this code should do:
>
>   if (fd)
> 	return -1;
>   return map_loose_object_1(r, path, fd, size);
>
> If we are going to make map_loose_object_1() less confusing (and I think
> that is worth doing), let's go all the way.

...maybe we should go further in the other direction. I.e. with my
earlier suggestion we're left with the mess that the "fd" ownership
isn't clear.

But what I was trying to do was fix up the ownership around the
"mapped_path", but we don't need to xstrdup() it in the first place. We
already have the caller of open_loose_object() not doing that, we can
just say that you're not going to open two loose objects at a time.

Then this becomes easier, and we can just pass the maybe-NULL "const
char **oid_path" all the way to open_loose_object():


diff --git a/object-file.c b/object-file.c
index c7a513d123e..6e900737b76 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1176,7 +1176,7 @@ static int stat_loose_object(struct repository *r, const struct object_id *oid,
  * descriptor. See the caveats on the "path" parameter above.
  */
 static int open_loose_object(struct repository *r,
-			     const struct object_id *oid, const char **path)
+			     const struct object_id *oid, const char **oid_path)
 {
 	int fd;
 	struct object_directory *odb;
@@ -1185,8 +1185,12 @@ static int open_loose_object(struct repository *r,
 
 	prepare_alt_odb(r);
 	for (odb = r->objects->odb; odb; odb = odb->next) {
-		*path = odb_loose_path(odb, &buf, oid);
-		fd = git_open(*path);
+		const char *path;
+
+		path = odb_loose_path(odb, &buf, oid);
+		if (oid_path)
+			*oid_path = path;
+		fd = git_open(path);
 		if (fd >= 0)
 			return fd;
 
@@ -1214,19 +1218,22 @@ static int quick_has_loose(struct repository *r,
  * Map the loose object at "path" if it is not NULL, or the path found by
  * searching for a loose object named "oid".
  */
-static void *map_loose_object_1(struct repository *r, const char *path,
+static void *map_loose_object_1(struct repository *r, const char *const path,
 				const struct object_id *oid, unsigned long *size,
-				char **mapped_path)
+				const char **oid_path)
 {
 	void *map;
 	int fd;
 
+	if (path && oid_path)
+		BUG("don't tell me about the path, and ask me what it is!");
+	else if (!(path || oid))
+		BUG("must get an OID or a path!");
+
 	if (path)
 		fd = git_open(path);
 	else
-		fd = open_loose_object(r, oid, &path);
-	if (mapped_path)
-		*mapped_path = xstrdup(path);
+		fd = open_loose_object(r, oid, oid_path);
 
 	map = NULL;
 	if (fd >= 0) {
@@ -1236,7 +1243,8 @@ static void *map_loose_object_1(struct repository *r, const char *path,
 			*size = xsize_t(st.st_size);
 			if (!*size) {
 				/* mmap() is forbidden on empty files */
-				error(_("object file %s is empty"), path);
+				error(_("object file %s is empty"),
+				      path ? path : *oid_path);
 				close(fd);
 				return NULL;
 			}
@@ -1432,7 +1440,7 @@ static int loose_object_info(struct repository *r,
 {
 	int status = 0;
 	unsigned long mapsize;
-	char *mapped_path = NULL;
+	const char *oid_path;
 	void *map;
 	git_zstream stream;
 	char hdr[MAX_HEADER_LEN];
@@ -1464,11 +1472,9 @@ static int loose_object_info(struct repository *r,
 		return 0;
 	}
 
-	map = map_loose_object_1(r, NULL, oid, &mapsize, &mapped_path);
-	if (!map) {
-		free(mapped_path);
+	map = map_loose_object_1(r, NULL, oid, &mapsize, &oid_path);
+	if (!map)
 		return -1;
-	}
 
 	if (!oi->sizep)
 		oi->sizep = &size_scratch;
@@ -1506,11 +1512,10 @@ static int loose_object_info(struct repository *r,
 
 	if (status && (flags & OBJECT_INFO_DIE_IF_CORRUPT))
 		die(_("loose object %s (stored in %s) is corrupt"),
-		    oid_to_hex(oid), mapped_path);
+		    oid_to_hex(oid), oid_path);
 
 	git_inflate_end(&stream);
 cleanup:
-	free(mapped_path);
 	munmap(map, mapsize);
 	if (oi->sizep == &size_scratch)
 		oi->sizep = NULL;
Jonathan Tan Dec. 7, 2022, 11:26 p.m. UTC | #6
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Yeah, I think that's even better, although...

[snip]
 
> > It's also kind of weird that map_loose_object_1() is a noop on a
> > negative descriptor. That technically makes this correct, but I think it
> > would be much less surprising to always take a valid descriptor, and
> > this code should do:
> >
> >   if (fd)
> > 	return -1;
> >   return map_loose_object_1(r, path, fd, size);
> >
> > If we are going to make map_loose_object_1() less confusing (and I think
> > that is worth doing), let's go all the way.
> 
> ...maybe we should go further in the other direction. I.e. with my
> earlier suggestion we're left with the mess that the "fd" ownership
> isn't clear.

With Peff's suggestion I think we can make the function always close
the fd. If we find it not to be clear, we can rename the function
to ..._close_fd() or something like that.

Thanks to both of you for your suggestions.
Ævar Arnfjörð Bjarmason Dec. 7, 2022, 11:50 p.m. UTC | #7
On Wed, Dec 07 2022, Jonathan Tan wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> Yeah, I think that's even better, although...
>
> [snip]
>  
>> > It's also kind of weird that map_loose_object_1() is a noop on a
>> > negative descriptor. That technically makes this correct, but I think it
>> > would be much less surprising to always take a valid descriptor, and
>> > this code should do:
>> >
>> >   if (fd)
>> > 	return -1;
>> >   return map_loose_object_1(r, path, fd, size);
>> >
>> > If we are going to make map_loose_object_1() less confusing (and I think
>> > that is worth doing), let's go all the way.
>> 
>> ...maybe we should go further in the other direction. I.e. with my
>> earlier suggestion we're left with the mess that the "fd" ownership
>> isn't clear.
>
> With Peff's suggestion I think we can make the function always close
> the fd. If we find it not to be clear, we can rename the function
> to ..._close_fd() or something like that.
>
> Thanks to both of you for your suggestions.

I think that was my suggestion.

Peff's on top of that was to never have it *open* the fd, but I'd left
one caller doing that.

I.e. that the ownership would still be passed to it, but at least it
would always be passed, and wouldn't be the mixed affair that my initial
suggestion left it at.

I'll leave it to you to pick through this.

I have a mild preference for my latest suggestion as the ownership of
all the variables seems cleanest in that iteration. I.e. we don't need
to xstrdup(), and the "fd" is always contained within
map_loose_object_1().

We still have the "sometimes a path, sometimes I make a path from an
oid" semantics though, but that seems unavoidable.
Jeff King Dec. 8, 2022, 6:33 a.m. UTC | #8
On Thu, Dec 08, 2022 at 12:50:27AM +0100, Ævar Arnfjörð Bjarmason wrote:

> I have a mild preference for my latest suggestion as the ownership of
> all the variables seems cleanest in that iteration. I.e. we don't need
> to xstrdup(), and the "fd" is always contained within
> map_loose_object_1().
> 
> We still have the "sometimes a path, sometimes I make a path from an
> oid" semantics though, but that seems unavoidable.

Of the two warts, I think "this function consume the fd" is less weird
than the two path variables (one sometimes-in and one sometimes-out).
If the fd thing is too ugly, we could have the function _not_ consume
the fd, but I think that probably makes the callers worse.

At any rate, we can wait and see what Jonathan comes up with.

(As an aside, thank you Jonathan for dealing with some of this
long-standing ugliness; it is not directly related to your goal, but I
think it's adjacent enough to merit doing it as part of the series).

-Peff
diff mbox series

Patch

diff --git a/object-file.c b/object-file.c
index 596dd049fd..c7a513d123 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1215,7 +1215,8 @@  static int quick_has_loose(struct repository *r,
  * searching for a loose object named "oid".
  */
 static void *map_loose_object_1(struct repository *r, const char *path,
-			     const struct object_id *oid, unsigned long *size)
+				const struct object_id *oid, unsigned long *size,
+				char **mapped_path)
 {
 	void *map;
 	int fd;
@@ -1224,6 +1225,9 @@  static void *map_loose_object_1(struct repository *r, const char *path,
 		fd = git_open(path);
 	else
 		fd = open_loose_object(r, oid, &path);
+	if (mapped_path)
+		*mapped_path = xstrdup(path);
+
 	map = NULL;
 	if (fd >= 0) {
 		struct stat st;
@@ -1247,7 +1251,7 @@  void *map_loose_object(struct repository *r,
 		       const struct object_id *oid,
 		       unsigned long *size)
 {
-	return map_loose_object_1(r, NULL, oid, size);
+	return map_loose_object_1(r, NULL, oid, size, NULL);
 }
 
 enum unpack_loose_header_result unpack_loose_header(git_zstream *stream,
@@ -1428,6 +1432,7 @@  static int loose_object_info(struct repository *r,
 {
 	int status = 0;
 	unsigned long mapsize;
+	char *mapped_path = NULL;
 	void *map;
 	git_zstream stream;
 	char hdr[MAX_HEADER_LEN];
@@ -1459,9 +1464,11 @@  static int loose_object_info(struct repository *r,
 		return 0;
 	}
 
-	map = map_loose_object(r, oid, &mapsize);
-	if (!map)
+	map = map_loose_object_1(r, NULL, oid, &mapsize, &mapped_path);
+	if (!map) {
+		free(mapped_path);
 		return -1;
+	}
 
 	if (!oi->sizep)
 		oi->sizep = &size_scratch;
@@ -1497,8 +1504,13 @@  static int loose_object_info(struct repository *r,
 		break;
 	}
 
+	if (status && (flags & OBJECT_INFO_DIE_IF_CORRUPT))
+		die(_("loose object %s (stored in %s) is corrupt"),
+		    oid_to_hex(oid), mapped_path);
+
 	git_inflate_end(&stream);
 cleanup:
+	free(mapped_path);
 	munmap(map, mapsize);
 	if (oi->sizep == &size_scratch)
 		oi->sizep = NULL;
@@ -1608,6 +1620,15 @@  static int do_oid_object_info_extended(struct repository *r,
 			continue;
 		}
 
+		if (flags & OBJECT_INFO_DIE_IF_CORRUPT) {
+			const struct packed_git *p;
+			if ((flags & OBJECT_INFO_LOOKUP_REPLACE) && !oideq(real, oid))
+				die(_("replacement %s not found for %s"),
+				    oid_to_hex(real), oid_to_hex(oid));
+			if ((p = has_packed_and_bad(r, real)))
+				die(_("packed object %s (stored in %s) is corrupt"),
+				    oid_to_hex(real), p->pack_name);
+		}
 		return -1;
 	}
 
@@ -1660,7 +1681,8 @@  int oid_object_info(struct repository *r,
 
 static void *read_object(struct repository *r,
 			 const struct object_id *oid, enum object_type *type,
-			 unsigned long *size)
+			 unsigned long *size,
+			 int die_if_corrupt)
 {
 	struct object_info oi = OBJECT_INFO_INIT;
 	void *content;
@@ -1668,7 +1690,9 @@  static void *read_object(struct repository *r,
 	oi.sizep = size;
 	oi.contentp = &content;
 
-	if (oid_object_info_extended(r, oid, &oi, 0) < 0)
+	if (oid_object_info_extended(r, oid, &oi,
+				     die_if_corrupt ? OBJECT_INFO_DIE_IF_CORRUPT : 0)
+	    < 0)
 		return NULL;
 	return content;
 }
@@ -1704,35 +1728,14 @@  void *read_object_file_extended(struct repository *r,
 				int lookup_replace)
 {
 	void *data;
-	const struct packed_git *p;
-	const char *path;
-	struct stat st;
 	const struct object_id *repl = lookup_replace ?
 		lookup_replace_object(r, oid) : oid;
 
 	errno = 0;
-	data = read_object(r, repl, type, size);
+	data = read_object(r, repl, type, size, 1);
 	if (data)
 		return data;
 
-	obj_read_lock();
-	if (errno && errno != ENOENT)
-		die_errno(_("failed to read object %s"), oid_to_hex(oid));
-
-	/* die if we replaced an object with one that does not exist */
-	if (repl != oid)
-		die(_("replacement %s not found for %s"),
-		    oid_to_hex(repl), oid_to_hex(oid));
-
-	if (!stat_loose_object(r, repl, &st, &path))
-		die(_("loose object %s (stored in %s) is corrupt"),
-		    oid_to_hex(repl), path);
-
-	if ((p = has_packed_and_bad(r, repl)))
-		die(_("packed object %s (stored in %s) is corrupt"),
-		    oid_to_hex(repl), p->pack_name);
-	obj_read_unlock();
-
 	return NULL;
 }
 
@@ -2275,7 +2278,7 @@  int force_object_loose(const struct object_id *oid, time_t mtime)
 
 	if (has_loose_object(oid))
 		return 0;
-	buf = read_object(the_repository, oid, &type, &len);
+	buf = read_object(the_repository, oid, &type, &len, 0);
 	if (!buf)
 		return error(_("cannot read object for %s"), oid_to_hex(oid));
 	hdrlen = format_object_header(hdr, sizeof(hdr), type, len);
@@ -2797,7 +2800,7 @@  int read_loose_object(const char *path,
 	char hdr[MAX_HEADER_LEN];
 	unsigned long *size = oi->sizep;
 
-	map = map_loose_object_1(the_repository, path, NULL, &mapsize);
+	map = map_loose_object_1(the_repository, path, NULL, &mapsize, NULL);
 	if (!map) {
 		error_errno(_("unable to mmap %s"), path);
 		goto out;
diff --git a/object-store.h b/object-store.h
index 1be57abaf1..01134ab5ec 100644
--- a/object-store.h
+++ b/object-store.h
@@ -447,6 +447,9 @@  struct object_info {
  */
 #define OBJECT_INFO_FOR_PREFETCH (OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_QUICK)
 
+/* Die if object corruption (not just an object being missing) was detected. */
+#define OBJECT_INFO_DIE_IF_CORRUPT 64
+
 int oid_object_info_extended(struct repository *r,
 			     const struct object_id *,
 			     struct object_info *, unsigned flags);