[v2,02/10] packfile: allow content-limit for cat-file

Message ID	20240823224630.1180772-3-e@80x24.org (mailing list archive)
State	New
Headers	show Received: from dcvr.yhbt.net (dcvr.yhbt.net [173.255.242.215]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4DB091494D6 for <git@vger.kernel.org>; Fri, 23 Aug 2024 22:46:52 +0000 (UTC) From: Eric Wong <e@80x24.org> To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, Patrick Steinhardt <ps@pks.im> Subject: [PATCH v2 02/10] packfile: allow content-limit for cat-file Date: Fri, 23 Aug 2024 22:46:22 +0000 Message-ID: <20240823224630.1180772-3-e@80x24.org> In-Reply-To: <20240823224630.1180772-1-e@80x24.org> References: <20240823224630.1180772-1-e@80x24.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	cat-file speedups \| expand [v2,00/10] cat-file speedups [v2,01/10] packfile: move sizep computation [v2,02/10] packfile: allow content-limit for cat-file [v2,03/10] packfile: fix off-by-one in content_limit comparison [v2,04/10] packfile: inline cache_or_unpack_entry [v2,05/10] cat-file: use delta_base_cache entries directly [v2,06/10] packfile: packed_object_info avoids packed_to_object_type [v2,07/10] object_info: content_limit only applies to blobs [v2,08/10] cat-file: batch-command uses content_limit [v2,09/10] cat-file: batch_write: use size_t for length [v2,10/10] cat-file: use writev(2) if available

Message ID

20240823224630.1180772-3-e@80x24.org (mailing list archive)

State

New

Headers

From: Eric Wong <e@80x24.org>
To: git@vger.kernel.org
Cc: Jeff King <peff@peff.net>,
	Patrick Steinhardt <ps@pks.im>
Subject: [PATCH v2 02/10] packfile: allow content-limit for cat-file
Date: Fri, 23 Aug 2024 22:46:22 +0000
Message-ID: <20240823224630.1180772-3-e@80x24.org>
In-Reply-To: <20240823224630.1180772-1-e@80x24.org>
References: <20240823224630.1180772-1-e@80x24.org>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

cat-file speedups | expand

Commit Message

Eric Wong Aug. 23, 2024, 10:46 p.m. UTC

From: Jeff King <peff@peff.net>

Avoid unnecessary round trips to the object store to speed
up cat-file contents retrievals.  The majority of packed objects
don't benefit from the streaming interface at all and we end up
having to load them in core anyways to satisfy our streaming
API.

This drops the runtime of
`git cat-file --batch-all-objects --unordered --batch' on
git.git from ~7.1s to ~6.1s on Jeff's machine.

[ew: commit message]

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Eric Wong <e@80x24.org>
---
 builtin/cat-file.c | 17 +++++++++++++++--
 object-file.c      |  6 ++++++
 object-store-ll.h  |  1 +
 packfile.c         | 13 ++++++++++++-
 4 files changed, 34 insertions(+), 3 deletions(-)

Comments

Junio C Hamano Aug. 26, 2024, 5:10 p.m. UTC | #1

Eric Wong <e@80x24.org> writes:

> From: Jeff King <peff@peff.net>
>
> Avoid unnecessary round trips to the object store to speed
> up cat-file contents retrievals.  The majority of packed objects
> don't benefit from the streaming interface at all and we end up
> having to load them in core anyways to satisfy our streaming
> API.

What I found missing from the description is something like ...

    The new trick used is to teach oid_object_info_extended() that a
    non-NULL oi->contentp that means "grab the contents of the objects
    here" can be told to refrain from grabbing an object that is too
    large.

> diff --git a/object-file.c b/object-file.c
> index 065103be3e..1cc29c3c58 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1492,6 +1492,12 @@ static int loose_object_info(struct repository *r,
>  
>  		if (!oi->contentp)
>  			break;
> +		if (oi->content_limit && *oi->sizep > oi->content_limit) {

I cannot convince myself enough to say "content limit" is a great
name.  It invites "limited by what?  text files are allowed but
images are not?".

> diff --git a/object-store-ll.h b/object-store-ll.h
> index c5f2bb2fc2..b71a15f590 100644
> --- a/object-store-ll.h
> +++ b/object-store-ll.h
> @@ -289,6 +289,7 @@ struct object_info {
>  	struct object_id *delta_base_oid;
>  	struct strbuf *type_name;
>  	void **contentp;
> +	size_t content_limit;
>  
>  	/* Response */
>  	enum {
> diff --git a/packfile.c b/packfile.c
> index 4028763947..c12a0515b3 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -1529,7 +1529,7 @@ int packed_object_info(struct repository *r, struct packed_git *p,
>  	 * We always get the representation type, but only convert it to
>  	 * a "real" type later if the caller is interested.
>  	 */
> -	if (oi->contentp) {
> +	if (oi->contentp && !oi->content_limit) {
>  		*oi->contentp = cache_or_unpack_entry(r, p, obj_offset, oi->sizep,
>  						      &type);
>  		if (!*oi->contentp)
> @@ -1555,6 +1555,17 @@ int packed_object_info(struct repository *r, struct packed_git *p,
>  				*oi->sizep = size;
>  			}
>  		}
> +
> +		if (oi->contentp) {
> +			if (oi->sizep && *oi->sizep < oi->content_limit) {

It happens that with the current code structure, at this point,
oi->content_limit is _always_ non-zero.  But it felt somewhat
fragile to rely on it, and I would have appreciated if this was
written with an explicit check for oi->content_limit, just like how
it is done in loose_object_info() function.

Other than that, looking very good.

Thanks.


> +				*oi->contentp = cache_or_unpack_entry(r, p, obj_offset,
> +								      oi->sizep, &type);
> +				if (!*oi->contentp)
> +					type = OBJ_BAD;
> +			} else {
> +				*oi->contentp = NULL;
> +			}
> +		}
>  	}
>  
>  	if (oi->disk_sizep) {

Eric Wong Aug. 27, 2024, 8:23 p.m. UTC | #2

Junio C Hamano <gitster@pobox.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > From: Jeff King <peff@peff.net>
> >
> > Avoid unnecessary round trips to the object store to speed
> > up cat-file contents retrievals.  The majority of packed objects
> > don't benefit from the streaming interface at all and we end up
> > having to load them in core anyways to satisfy our streaming
> > API.
> 
> What I found missing from the description is something like ...
> 
>     The new trick used is to teach oid_object_info_extended() that a
>     non-NULL oi->contentp that means "grab the contents of the objects
>     here" can be told to refrain from grabbing an object that is too
>     large.

OK.

> > diff --git a/object-file.c b/object-file.c
> > index 065103be3e..1cc29c3c58 100644
> > --- a/object-file.c
> > +++ b/object-file.c
> > @@ -1492,6 +1492,12 @@ static int loose_object_info(struct repository *r,
> >  
> >  		if (!oi->contentp)
> >  			break;
> > +		if (oi->content_limit && *oi->sizep > oi->content_limit) {
> 
> I cannot convince myself enough to say "content limit" is a great
> name.  It invites "limited by what?  text files are allowed but
> images are not?".

Hmm... naming is a most difficult problem :<

->slurp_max?  It could be ->content_slurp_max, but I think
that's too long...

Would welcome other suggestions...

> > diff --git a/object-store-ll.h b/object-store-ll.h
> > index c5f2bb2fc2..b71a15f590 100644
> > --- a/object-store-ll.h
> > +++ b/object-store-ll.h
> > @@ -289,6 +289,7 @@ struct object_info {
> >  	struct object_id *delta_base_oid;
> >  	struct strbuf *type_name;
> >  	void **contentp;
> > +	size_t content_limit;
> >  
> >  	/* Response */
> >  	enum {
> > diff --git a/packfile.c b/packfile.c
> > index 4028763947..c12a0515b3 100644
> > --- a/packfile.c
> > +++ b/packfile.c
> > @@ -1529,7 +1529,7 @@ int packed_object_info(struct repository *r, struct packed_git *p,
> >  	 * We always get the representation type, but only convert it to
> >  	 * a "real" type later if the caller is interested.
> >  	 */
> > -	if (oi->contentp) {
> > +	if (oi->contentp && !oi->content_limit) {
> >  		*oi->contentp = cache_or_unpack_entry(r, p, obj_offset, oi->sizep,
> >  						      &type);
> >  		if (!*oi->contentp)
> > @@ -1555,6 +1555,17 @@ int packed_object_info(struct repository *r, struct packed_git *p,
> >  				*oi->sizep = size;
> >  			}
> >  		}
> > +
> > +		if (oi->contentp) {
> > +			if (oi->sizep && *oi->sizep < oi->content_limit) {
> 
> It happens that with the current code structure, at this point,
> oi->content_limit is _always_ non-zero.  But it felt somewhat
> fragile to rely on it, and I would have appreciated if this was
> written with an explicit check for oi->content_limit, just like how
> it is done in loose_object_info() function.

Right.  I actually think something like:

		assert(oi->content_limit); /* see `if' above */
		if (oi->sizep && *oi->sizep < oi->content_limit) {

is good for documentation purposes since this is in the `else'
branch of the `if (oi->contentp && !oi->content_limit) {' condition.

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 18fe58d6b8..bc4bb89610 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -280,6 +280,7 @@  struct expand_data {
 	off_t disk_size;
 	const char *rest;
 	struct object_id delta_base_oid;
+	void *content;
 
 	/*
 	 * If mark_query is true, we do not expand anything, but rather
@@ -383,7 +384,10 @@  static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 
 	assert(data->info.typep);
 
-	if (data->type == OBJ_BLOB) {
+	if (data->content) {
+		batch_write(opt, data->content, data->size);
+		FREE_AND_NULL(data->content);
+	} else if (data->type == OBJ_BLOB) {
 		if (opt->buffer_output)
 			fflush(stdout);
 		if (opt->transform_mode) {
@@ -801,9 +805,18 @@  static int batch_objects(struct batch_options *opt)
 	/*
 	 * If we are printing out the object, then always fill in the type,
 	 * since we will want to decide whether or not to stream.
+	 *
+	 * Likewise, grab the content in the initial request if it's small
+	 * and we're not planning to filter it.
 	 */
-	if (opt->batch_mode == BATCH_MODE_CONTENTS)
+	if (opt->batch_mode == BATCH_MODE_CONTENTS) {
 		data.info.typep = &data.type;
+		if (!opt->transform_mode) {
+			data.info.sizep = &data.size;
+			data.info.contentp = &data.content;
+			data.info.content_limit = big_file_threshold;
+		}
+	}
 
 	if (opt->all_objects) {
 		struct object_cb_data cb;
diff --git a/object-file.c b/object-file.c
index 065103be3e..1cc29c3c58 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1492,6 +1492,12 @@  static int loose_object_info(struct repository *r,
 
 		if (!oi->contentp)
 			break;
+		if (oi->content_limit && *oi->sizep > oi->content_limit) {
+			git_inflate_end(&stream);
+			oi->contentp = NULL;
+			goto cleanup;
+		}
+
 		*oi->contentp = unpack_loose_rest(&stream, hdr, *oi->sizep, oid);
 		if (*oi->contentp)
 			goto cleanup;
diff --git a/object-store-ll.h b/object-store-ll.h
index c5f2bb2fc2..b71a15f590 100644
--- a/object-store-ll.h
+++ b/object-store-ll.h
@@ -289,6 +289,7 @@  struct object_info {
 	struct object_id *delta_base_oid;
 	struct strbuf *type_name;
 	void **contentp;
+	size_t content_limit;
 
 	/* Response */
 	enum {
diff --git a/packfile.c b/packfile.c
index 4028763947..c12a0515b3 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1529,7 +1529,7 @@  int packed_object_info(struct repository *r, struct packed_git *p,
 	 * We always get the representation type, but only convert it to
 	 * a "real" type later if the caller is interested.
 	 */
-	if (oi->contentp) {
+	if (oi->contentp && !oi->content_limit) {
 		*oi->contentp = cache_or_unpack_entry(r, p, obj_offset, oi->sizep,
 						      &type);
 		if (!*oi->contentp)
@@ -1555,6 +1555,17 @@  int packed_object_info(struct repository *r, struct packed_git *p,
 				*oi->sizep = size;
 			}
 		}
+
+		if (oi->contentp) {
+			if (oi->sizep && *oi->sizep < oi->content_limit) {
+				*oi->contentp = cache_or_unpack_entry(r, p, obj_offset,
+								      oi->sizep, &type);
+				if (!*oi->contentp)
+					type = OBJ_BAD;
+			} else {
+				*oi->contentp = NULL;
+			}
+		}
 	}
 
 	if (oi->disk_sizep) {

[v2,02/10] packfile: allow content-limit for cat-file

Commit Message

Comments

Patch