diff mbox series

[1/2] blob.c: remove buffer & size arguments to parse_blob_buffer()

Message ID patch-1.3-68a7709fe5-20210409T080534Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series blob/object.c: trivial readability improvements | expand

Commit Message

Ævar Arnfjörð Bjarmason April 9, 2021, 8:07 a.m. UTC
As noted in the comment introduced in 837d395a5c0 (Replace
parse_blob() with an explanatory comment, 2010-01-18) the old
parse_blob() function and the current parse_blob_buffer() exist merely
to provide consistency in the API.

We're not going to parse blobs like we "parse" commits, trees or
tags. So let's not have the parse_blob_buffer() take arguments that
pretends that we do. Its only use is to set the "parsed" flag.

See bd2c39f58f9 ([PATCH] don't load and decompress objects twice with
parse_object(), 2005-05-06) for the introduction of parse_blob_buffer().

I'm moving the prototype of parse_blob_buffer() below the comment
added in 837d395a5c0 while I'm at it. That comment was originally
meant to be a replacement for the missing parse_blob() function, but
it's much less confusing to have it be above the parse_blob_buffer()
function it refers to.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 blob.c   | 2 +-
 blob.h   | 3 +--
 object.c | 4 ++--
 3 files changed, 4 insertions(+), 5 deletions(-)

Comments

Jeff King April 9, 2021, 5:51 p.m. UTC | #1
On Fri, Apr 09, 2021 at 10:07:27AM +0200, Ævar Arnfjörð Bjarmason wrote:

> As noted in the comment introduced in 837d395a5c0 (Replace
> parse_blob() with an explanatory comment, 2010-01-18) the old
> parse_blob() function and the current parse_blob_buffer() exist merely
> to provide consistency in the API.
> 
> We're not going to parse blobs like we "parse" commits, trees or
> tags. So let's not have the parse_blob_buffer() take arguments that
> pretends that we do. Its only use is to set the "parsed" flag.
> 
> See bd2c39f58f9 ([PATCH] don't load and decompress objects twice with
> parse_object(), 2005-05-06) for the introduction of parse_blob_buffer().

OK. Calling it parse_blob_buffer() is a little silly since it doesn't
even take a buffer anymore. But I guess parse_blob() might imply that it
actually loads the contents from disk to check them (which the other
parse_foo() functions do), so that's not a good name.

So this might be the least bad thing. Given that there are only two
callers, just setting blob->object.parsed might not be unreasonable,
either. But I don't think it's worth spending too much time on.

> @@ -266,7 +266,7 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
>  			error(_("hash mismatch %s"), oid_to_hex(oid));
>  			return NULL;
>  		}
> -		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
> +		parse_blob_buffer(lookup_blob(r, oid));
>  		return lookup_object(r, oid);

Not new in your patch, but I wondered if this could cause a segfault
when lookup_blob() returns NULL. I _think_ the answer is "no". We'd hit
this code path when either:

  - lookup_object() returns an object with type OBJ_BLOB, in which case
    lookup_blob() would return that same object

  - lookup_object() returned NULL, in which case lookup_blob() will call
    it again, get NULL again, and then auto-create the blob and return
    it

So I think it is OK. But there are a bunch of duplicate hash lookups in
this code. It would be clearer and more efficient as:

diff --git a/object.c b/object.c
index 2c32691dc4..2dfa038f13 100644
--- a/object.c
+++ b/object.c
@@ -262,12 +262,14 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
+		if (!obj)
+			obj = create_object(r, oid, alloc_blob_node(r));
 		if (check_object_signature(r, repl, NULL, 0, NULL) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
-		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
-		return lookup_object(r, oid);
+		parse_blob_buffer(obj, NULL, 0);
+		return obj;
 	}
 
 	buffer = repo_read_object_file(r, oid, &type, &size);

but I doubt the efficiency matters much in practice. Those hash lookups
will be lost in the noise of computing the hash of the blob contents.

-Peff
Junio C Hamano April 9, 2021, 10:31 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

> OK. Calling it parse_blob_buffer() is a little silly since it doesn't
> even take a buffer anymore. But I guess parse_blob() might imply that it
> actually loads the contents from disk to check them (which the other
> parse_foo() functions do), so that's not a good name.

mark_object_as_parsed(), perhaps?

> So this might be the least bad thing. Given that there are only two
> callers, just setting blob->object.parsed might not be unreasonable,
> either. But I don't think it's worth spending too much time on.

Yup.
Ævar Arnfjörð Bjarmason April 10, 2021, 12:57 p.m. UTC | #3
On Fri, Apr 09 2021, Jeff King wrote:

> On Fri, Apr 09, 2021 at 10:07:27AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> As noted in the comment introduced in 837d395a5c0 (Replace
>> parse_blob() with an explanatory comment, 2010-01-18) the old
>> parse_blob() function and the current parse_blob_buffer() exist merely
>> to provide consistency in the API.
>> 
>> We're not going to parse blobs like we "parse" commits, trees or
>> tags. So let's not have the parse_blob_buffer() take arguments that
>> pretends that we do. Its only use is to set the "parsed" flag.
>> 
>> See bd2c39f58f9 ([PATCH] don't load and decompress objects twice with
>> parse_object(), 2005-05-06) for the introduction of parse_blob_buffer().
>
> OK. Calling it parse_blob_buffer() is a little silly since it doesn't
> even take a buffer anymore. But I guess parse_blob() might imply that it
> actually loads the contents from disk to check them (which the other
> parse_foo() functions do), so that's not a good name.
>
> So this might be the least bad thing. Given that there are only two
> callers, just setting blob->object.parsed might not be unreasonable,
> either. But I don't think it's worth spending too much time on.
>
>> @@ -266,7 +266,7 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
>>  			error(_("hash mismatch %s"), oid_to_hex(oid));
>>  			return NULL;
>>  		}
>> -		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
>> +		parse_blob_buffer(lookup_blob(r, oid));
>>  		return lookup_object(r, oid);
>
> Not new in your patch, but I wondered if this could cause a segfault
> when lookup_blob() returns NULL. I _think_ the answer is "no". We'd hit
> this code path when either:
>
>   - lookup_object() returns an object with type OBJ_BLOB, in which case
>     lookup_blob() would return that same object
>
>   - lookup_object() returned NULL, in which case lookup_blob() will call
>     it again, get NULL again, and then auto-create the blob and return
>     it
>
> So I think it is OK. But there are a bunch of duplicate hash lookups in
> this code. It would be clearer and more efficient as:
>
> diff --git a/object.c b/object.c
> index 2c32691dc4..2dfa038f13 100644
> --- a/object.c
> +++ b/object.c
> @@ -262,12 +262,14 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
>  	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
>  	    (!obj && repo_has_object_file(r, oid) &&
>  	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
> +		if (!obj)
> +			obj = create_object(r, oid, alloc_blob_node(r));
>  		if (check_object_signature(r, repl, NULL, 0, NULL) < 0) {
>  			error(_("hash mismatch %s"), oid_to_hex(oid));
>  			return NULL;
>  		}
> -		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
> -		return lookup_object(r, oid);
> +		parse_blob_buffer(obj, NULL, 0);
> +		return obj;
>  	}
>  
>  	buffer = repo_read_object_file(r, oid, &type, &size);
>
> but I doubt the efficiency matters much in practice. Those hash lookups
> will be lost in the noise of computing the hash of the blob contents.

I was trying to keep the changes smaller, but what about just doing this?:

diff --git a/blob.c b/blob.c
index 182718aba9..69293e7d8e 100644
--- a/blob.c
+++ b/blob.c
@@ -5,16 +5,16 @@
 
 const char *blob_type = "blob";
 
+struct blob *create_blob(struct repository *r, const struct object_id *oid)
+{
+	return create_object(r, oid, alloc_blob_node(r));
+}
+
 struct blob *lookup_blob(struct repository *r, const struct object_id *oid)
 {
 	struct object *obj = lookup_object(r, oid);
 	if (!obj)
-		return create_object(r, oid, alloc_blob_node(r));
-	return object_as_type(obj, OBJ_BLOB, 0);
-}
+		return create_blob(r, oid);
 
-int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size)
-{
-	item->object.parsed = 1;
-	return 0;
+	return object_as_type(obj, OBJ_BLOB, 0);
 }
diff --git a/blob.h b/blob.h
index 1664872055..ad34f0e9cc 100644
--- a/blob.h
+++ b/blob.h
@@ -9,10 +9,9 @@ struct blob {
 	struct object object;
 };
 
+struct blob *create_blob(struct repository *r, const struct object_id *oid);
 struct blob *lookup_blob(struct repository *r, const struct object_id *oid);
 
-int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size);
-
 /**
  * Blobs do not contain references to other objects and do not have
  * structured data that needs parsing. However, code may use the
diff --git a/object.c b/object.c
index 78343781ae..2699431404 100644
--- a/object.c
+++ b/object.c
@@ -195,8 +195,7 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id
 	if (type == OBJ_BLOB) {
 		struct blob *blob = lookup_blob(r, oid);
 		if (blob) {
-			if (parse_blob_buffer(blob, buffer, size))
-				return NULL;
+			blob->object.parsed = 1;
 			obj = &blob->object;
 		}
 	} else if (type == OBJ_TREE) {
@@ -262,12 +261,16 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
+		struct blob *blob;
+		if (!obj)
+			blob = create_blob(r, oid);
 		if (check_object_signature(r, repl, NULL, 0, NULL) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
-		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
-		return lookup_object(r, oid);
+		obj = &blob->object;
+		obj->parsed = 1;
+		return obj;
 	}
 
 	buffer = repo_read_object_file(r, oid, &type, &size);
Ævar Arnfjörð Bjarmason April 10, 2021, 1:01 p.m. UTC | #4
On Sat, Apr 10 2021, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Apr 09 2021, Jeff King wrote:
>
>> On Fri, Apr 09, 2021 at 10:07:27AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>>> As noted in the comment introduced in 837d395a5c0 (Replace
>>> parse_blob() with an explanatory comment, 2010-01-18) the old
>>> parse_blob() function and the current parse_blob_buffer() exist merely
>>> to provide consistency in the API.
>>> 
>>> We're not going to parse blobs like we "parse" commits, trees or
>>> tags. So let's not have the parse_blob_buffer() take arguments that
>>> pretends that we do. Its only use is to set the "parsed" flag.
>>> 
>>> See bd2c39f58f9 ([PATCH] don't load and decompress objects twice with
>>> parse_object(), 2005-05-06) for the introduction of parse_blob_buffer().
>>
>> OK. Calling it parse_blob_buffer() is a little silly since it doesn't
>> even take a buffer anymore. But I guess parse_blob() might imply that it
>> actually loads the contents from disk to check them (which the other
>> parse_foo() functions do), so that's not a good name.
>>
>> So this might be the least bad thing. Given that there are only two
>> callers, just setting blob->object.parsed might not be unreasonable,
>> either. But I don't think it's worth spending too much time on.
>>
>>> @@ -266,7 +266,7 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
>>>  			error(_("hash mismatch %s"), oid_to_hex(oid));
>>>  			return NULL;
>>>  		}
>>> -		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
>>> +		parse_blob_buffer(lookup_blob(r, oid));
>>>  		return lookup_object(r, oid);
>>
>> Not new in your patch, but I wondered if this could cause a segfault
>> when lookup_blob() returns NULL. I _think_ the answer is "no". We'd hit
>> this code path when either:
>>
>>   - lookup_object() returns an object with type OBJ_BLOB, in which case
>>     lookup_blob() would return that same object
>>
>>   - lookup_object() returned NULL, in which case lookup_blob() will call
>>     it again, get NULL again, and then auto-create the blob and return
>>     it
>>
>> So I think it is OK. But there are a bunch of duplicate hash lookups in
>> this code. It would be clearer and more efficient as:
>>
>> diff --git a/object.c b/object.c
>> index 2c32691dc4..2dfa038f13 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -262,12 +262,14 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
>>  	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
>>  	    (!obj && repo_has_object_file(r, oid) &&
>>  	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
>> +		if (!obj)
>> +			obj = create_object(r, oid, alloc_blob_node(r));
>>  		if (check_object_signature(r, repl, NULL, 0, NULL) < 0) {
>>  			error(_("hash mismatch %s"), oid_to_hex(oid));
>>  			return NULL;
>>  		}
>> -		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
>> -		return lookup_object(r, oid);
>> +		parse_blob_buffer(obj, NULL, 0);
>> +		return obj;
>>  	}
>>  
>>  	buffer = repo_read_object_file(r, oid, &type, &size);
>>
>> but I doubt the efficiency matters much in practice. Those hash lookups
>> will be lost in the noise of computing the hash of the blob contents.
>
> I was trying to keep the changes smaller, but what about just doing this?:

Sent a bit too soon...:

> diff --git a/blob.c b/blob.c
> index 182718aba9..69293e7d8e 100644
> --- a/blob.c
> +++ b/blob.c
> @@ -5,16 +5,16 @@
>  
>  const char *blob_type = "blob";
>  
> +struct blob *create_blob(struct repository *r, const struct object_id *oid)
> +{
> +	return create_object(r, oid, alloc_blob_node(r));
> +}
> +
>  struct blob *lookup_blob(struct repository *r, const struct object_id *oid)
>  {
>  	struct object *obj = lookup_object(r, oid);
>  	if (!obj)
> -		return create_object(r, oid, alloc_blob_node(r));
> -	return object_as_type(obj, OBJ_BLOB, 0);
> -}
> +		return create_blob(r, oid);
>  
> -int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size)
> -{
> -	item->object.parsed = 1;
> -	return 0;
> +	return object_as_type(obj, OBJ_BLOB, 0);
>  }
> diff --git a/blob.h b/blob.h
> index 1664872055..ad34f0e9cc 100644
> --- a/blob.h
> +++ b/blob.h
> @@ -9,10 +9,9 @@ struct blob {
>  	struct object object;
>  };
>  
> +struct blob *create_blob(struct repository *r, const struct object_id *oid);
>  struct blob *lookup_blob(struct repository *r, const struct object_id *oid);
>  
> -int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size);
> -
>  /**
>   * Blobs do not contain references to other objects and do not have
>   * structured data that needs parsing. However, code may use the
> diff --git a/object.c b/object.c
> index 78343781ae..2699431404 100644
> --- a/object.c
> +++ b/object.c
> @@ -195,8 +195,7 @@ struct object *parse_object_buffer(struct repository *r, const struct object_id
>  	if (type == OBJ_BLOB) {
>  		struct blob *blob = lookup_blob(r, oid);
>  		if (blob) {
> -			if (parse_blob_buffer(blob, buffer, size))
> -				return NULL;
> +			blob->object.parsed = 1;
>  			obj = &blob->object;
>  		}
>  	} else if (type == OBJ_TREE) {
> @@ -262,12 +261,16 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
>  	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
>  	    (!obj && repo_has_object_file(r, oid) &&
>  	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
> +		struct blob *blob;
> +		if (!obj)
> +			blob = create_blob(r, oid);
>
>  		if (check_object_signature(r, repl, NULL, 0, NULL) < 0) {
>  			error(_("hash mismatch %s"), oid_to_hex(oid));
>  			return NULL;
>  		}
> -		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
> -		return lookup_object(r, oid);
> +		obj = &blob->object;
> +		obj->parsed = 1;
> +		return obj;
>  	}
>  
>  	buffer = repo_read_object_file(r, oid, &type, &size);

Well, aside from this segfault think-o introduced while
experimenting. Needs to be:

@@ -262,12 +261,16 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
+		if (!obj) {
+			struct blob *blob = create_blob(r, oid);
+			obj = &blob->object;
+		}
 		if (check_object_signature(r, repl, NULL, 0, NULL) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
-		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
-		return lookup_object(r, oid);
+		obj->parsed = 1;
+		return obj;
 	}
 
 	buffer = repo_read_object_file(r, oid, &type, &size);
Jeff King April 13, 2021, 8:25 a.m. UTC | #5
On Sat, Apr 10, 2021 at 02:57:12PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Not new in your patch, but I wondered if this could cause a segfault
> > when lookup_blob() returns NULL. I _think_ the answer is "no". We'd hit
> > this code path when either:
> [...]
> I was trying to keep the changes smaller, but what about just doing this?:
> [...]

Yeah, that seems pretty reasonable to me. It cleans up the extra lookups
in parse_object() and gets rid of the funny-named "parse_blob_buffer()"
that takes no buffer.

-Peff
diff mbox series

Patch

diff --git a/blob.c b/blob.c
index 182718aba9..389a7546dc 100644
--- a/blob.c
+++ b/blob.c
@@ -13,7 +13,7 @@  struct blob *lookup_blob(struct repository *r, const struct object_id *oid)
 	return object_as_type(obj, OBJ_BLOB, 0);
 }
 
-int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size)
+int parse_blob_buffer(struct blob *item)
 {
 	item->object.parsed = 1;
 	return 0;
diff --git a/blob.h b/blob.h
index 1664872055..ac1d4804a5 100644
--- a/blob.h
+++ b/blob.h
@@ -11,8 +11,6 @@  struct blob {
 
 struct blob *lookup_blob(struct repository *r, const struct object_id *oid);
 
-int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size);
-
 /**
  * Blobs do not contain references to other objects and do not have
  * structured data that needs parsing. However, code may use the
@@ -21,5 +19,6 @@  int parse_blob_buffer(struct blob *item, void *buffer, unsigned long size);
  * parse_blob_buffer() is used (by object.c) to flag that the object
  * has been read successfully from the database.
  **/
+int parse_blob_buffer(struct blob *item);
 
 #endif /* BLOB_H */
diff --git a/object.c b/object.c
index 78343781ae..63896abf01 100644
--- a/object.c
+++ b/object.c
@@ -195,7 +195,7 @@  struct object *parse_object_buffer(struct repository *r, const struct object_id
 	if (type == OBJ_BLOB) {
 		struct blob *blob = lookup_blob(r, oid);
 		if (blob) {
-			if (parse_blob_buffer(blob, buffer, size))
+			if (parse_blob_buffer(blob))
 				return NULL;
 			obj = &blob->object;
 		}
@@ -266,7 +266,7 @@  struct object *parse_object(struct repository *r, const struct object_id *oid)
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
-		parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
+		parse_blob_buffer(lookup_blob(r, oid));
 		return lookup_object(r, oid);
 	}