diff mbox series

[v3,3/4] revision: avoid loading object headers multiple times

Message ID b9897e102afbcab3bfee58ed8bda24257d8b54fb.1627896460.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series Speed up connectivity checks | expand

Commit Message

Patrick Steinhardt Aug. 2, 2021, 9:38 a.m. UTC
When loading references, we try to optimize loading of commits by using
the commit graph. To do so, we first need to determine whether the
object actually is a commit or not, which is why we always execute
`oid_object_info()` first. Like this, we'll unpack the object header of
each object first.

This pattern can be quite inefficient in case many references point to
the same commit: if the object didn't end up in the cached objects, then
we'll repeatedly unpack the same object header, even if we've already
seen the object before.

Optimize this pattern by using `lookup_unknown_object()` first in order
to determine whether we've seen the object before. If so, then we don't
need to re-parse the header but can directly use its object information
and thus gain a modest performance improvement. Executed in a real-world
repository with around 2.2 million references:

    Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.771 s ±  0.238 s    [User: 4.440 s, System: 0.330 s]
      Range (min … max):    4.539 s …  5.219 s    10 runs

    Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.454 s ±  0.037 s    [User: 4.122 s, System: 0.332 s]
      Range (min … max):    4.375 s …  4.496 s    10 runs

    Summary
      'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
        1.07 ± 0.05 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

The downside is that `lookup_unknown_object()` is forced to always
allocate an object such that it's big enough to host all object types'
structs and thus we may waste memory here. This tradeoff is probably
worth it though considering the following struct sizes:

    - commit: 72 bytes
    - tree: 56 bytes
    - blob: 40 bytes
    - tag: 64 bytes

Assuming that in almost all repositories, most references will point to
either a tag or a commit, we'd have a modest increase in memory
consumption of about 12.5% here.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 revision.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Aug. 2, 2021, 12:55 p.m. UTC | #1
On Mon, Aug 02 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> When loading references, we try to optimize loading of commits by using
> the commit graph. To do so, we first need to determine whether the
> object actually is a commit or not, which is why we always execute
> `oid_object_info()` first. Like this, we'll unpack the object header of
> each object first.
>
> This pattern can be quite inefficient in case many references point to
> the same commit: if the object didn't end up in the cached objects, then
> we'll repeatedly unpack the same object header, even if we've already
> seen the object before.
>
> Optimize this pattern by using `lookup_unknown_object()` first in order
> to determine whether we've seen the object before. If so, then we don't
> need to re-parse the header but can directly use its object information
> and thus gain a modest performance improvement. Executed in a real-world
> repository with around 2.2 million references:
>
>     Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
>       Time (mean ± σ):      4.771 s ±  0.238 s    [User: 4.440 s, System: 0.330 s]
>       Range (min … max):    4.539 s …  5.219 s    10 runs
>
>     Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
>       Time (mean ± σ):      4.454 s ±  0.037 s    [User: 4.122 s, System: 0.332 s]
>       Range (min … max):    4.375 s …  4.496 s    10 runs
>
>     Summary
>       'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
>         1.07 ± 0.05 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'
>
> The downside is that `lookup_unknown_object()` is forced to always
> allocate an object such that it's big enough to host all object types'
> structs and thus we may waste memory here. This tradeoff is probably
> worth it though considering the following struct sizes:
>
>     - commit: 72 bytes
>     - tree: 56 bytes
>     - blob: 40 bytes
>     - tag: 64 bytes
>
> Assuming that in almost all repositories, most references will point to
> either a tag or a commit, we'd have a modest increase in memory
> consumption of about 12.5% here.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  revision.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index f06a5d63a3..671b6d6513 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -359,14 +359,22 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
>  				    const struct object_id *oid,
>  				    unsigned int flags)
>  {
> -	struct object *object;
> +	struct object *object = lookup_unknown_object(revs->repo, oid);
> +
> +	if (object->type == OBJ_NONE) {
> +		int type = oid_object_info(revs->repo, oid, NULL);
> +		if (type < 0 || !object_as_type(object, type, 1)) {

Let's s/int type/enum object_type, personally I think we should never do
"type < 0" either, and check OBJ_BAD explicitly, but I've seemingly lost
that discussion on-list before.

But I think the consensus is that we should not do !type, but rather
type == OBJ_NONE.
Junio C Hamano Aug. 2, 2021, 7:40 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> When loading references, we try to optimize loading of commits by using
> the commit graph. To do so, we first need to determine whether the
> object actually is a commit or not, which is why we always execute
> `oid_object_info()` first. Like this, we'll unpack the object header of
> each object first.
>
> This pattern can be quite inefficient in case many references point to
> the same commit: if the object didn't end up in the cached objects, then
> we'll repeatedly unpack the same object header, even if we've already
> seen the object before.
> ...
> Assuming that in almost all repositories, most references will point to
> either a tag or a commit, we'd have a modest increase in memory
> consumption of about 12.5% here.

I wonder if we can also say almost all repositories, the majority of
refs point at the same object.  If that holds, this would certainly
be a win, but otherwise, it is not so clear.
Patrick Steinhardt Aug. 3, 2021, 9:07 a.m. UTC | #3
On Mon, Aug 02, 2021 at 12:40:56PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > When loading references, we try to optimize loading of commits by using
> > the commit graph. To do so, we first need to determine whether the
> > object actually is a commit or not, which is why we always execute
> > `oid_object_info()` first. Like this, we'll unpack the object header of
> > each object first.
> >
> > This pattern can be quite inefficient in case many references point to
> > the same commit: if the object didn't end up in the cached objects, then
> > we'll repeatedly unpack the same object header, even if we've already
> > seen the object before.
> > ...
> > Assuming that in almost all repositories, most references will point to
> > either a tag or a commit, we'd have a modest increase in memory
> > consumption of about 12.5% here.
> 
> I wonder if we can also say almost all repositories, the majority of
> refs point at the same object.  If that holds, this would certainly
> be a win, but otherwise, it is not so clear.

I doubt that's the case in general. I rather assume that it's typically
going to be a smallish subset that points to the same commit, but for
these cases we at least avoid doing the lookup multiple times. As I
said, it's definitely a tradeoff between memory and performance: in the
worst case (all references point to different blobs) we allocate 33%
more memory without having any speedups. A more realistic scenario would
probably be something like a trunk-based development repo, where there's
a single branch only and the rest is tags. There we'd allocate 11% more
memory without any speedups. In general, it's going to be various shades
of gray, where we allocate something from 0% to 11% more memory while
getting some modest speedups in some cases.

So if we only inspect this commit as a standalone it's definitely
debatable whether we'd want to take it or not. But one important thing
is that it's a prerequisite for patch 4/4: in order to not parse commits
in case they're part of the commit-graph, we need to first obtain an
object such that we can fill it in via the graph. So we have to call
`lookup_unknown_object()` anyway. Might be sensible to document this as
part of the commit message.

Patrick
Patrick Steinhardt Aug. 5, 2021, 10:12 a.m. UTC | #4
On Mon, Aug 02, 2021 at 02:55:09PM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Mon, Aug 02 2021, Patrick Steinhardt wrote:
[snip]
> > diff --git a/revision.c b/revision.c
> > index f06a5d63a3..671b6d6513 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -359,14 +359,22 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
> >  				    const struct object_id *oid,
> >  				    unsigned int flags)
> >  {
> > -	struct object *object;
> > +	struct object *object = lookup_unknown_object(revs->repo, oid);
> > +
> > +	if (object->type == OBJ_NONE) {
> > +		int type = oid_object_info(revs->repo, oid, NULL);
> > +		if (type < 0 || !object_as_type(object, type, 1)) {
> 
> Let's s/int type/enum object_type, personally I think we should never do
> "type < 0" either, and check OBJ_BAD explicitly, but I've seemingly lost
> that discussion on-list before.
> 
> But I think the consensus is that we should not do !type, but rather
> type == OBJ_NONE.

`oid_object_info()` does return an `int` though, so it feels kind of
weird to me to stuff it into an enum right away. Furthermore, while
`OBJ_BAD` does map to `-1`, the documentation of `oid_object_info()`
currently asks for what I'm doing: """returns enum object_type or
negative""".

Patrick
Patrick Steinhardt Aug. 6, 2021, 2:17 p.m. UTC | #5
On Tue, Aug 03, 2021 at 11:07:29AM +0200, Patrick Steinhardt wrote:
> On Mon, Aug 02, 2021 at 12:40:56PM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > When loading references, we try to optimize loading of commits by using
> > > the commit graph. To do so, we first need to determine whether the
> > > object actually is a commit or not, which is why we always execute
> > > `oid_object_info()` first. Like this, we'll unpack the object header of
> > > each object first.
> > >
> > > This pattern can be quite inefficient in case many references point to
> > > the same commit: if the object didn't end up in the cached objects, then
> > > we'll repeatedly unpack the same object header, even if we've already
> > > seen the object before.
> > > ...
> > > Assuming that in almost all repositories, most references will point to
> > > either a tag or a commit, we'd have a modest increase in memory
> > > consumption of about 12.5% here.
> > 
> > I wonder if we can also say almost all repositories, the majority of
> > refs point at the same object.  If that holds, this would certainly
> > be a win, but otherwise, it is not so clear.
> 
> I doubt that's the case in general. I rather assume that it's typically
> going to be a smallish subset that points to the same commit, but for
> these cases we at least avoid doing the lookup multiple times. As I
> said, it's definitely a tradeoff between memory and performance: in the
> worst case (all references point to different blobs) we allocate 33%
> more memory without having any speedups. A more realistic scenario would
> probably be something like a trunk-based development repo, where there's
> a single branch only and the rest is tags. There we'd allocate 11% more
> memory without any speedups. In general, it's going to be various shades
> of gray, where we allocate something from 0% to 11% more memory while
> getting some modest speedups in some cases.
> 
> So if we only inspect this commit as a standalone it's definitely
> debatable whether we'd want to take it or not. But one important thing
> is that it's a prerequisite for patch 4/4: in order to not parse commits
> in case they're part of the commit-graph, we need to first obtain an
> object such that we can fill it in via the graph. So we have to call
> `lookup_unknown_object()` anyway. Might be sensible to document this as
> part of the commit message.

Scratch that: we can just rewrite this to use `lookup_object()` to check
whether we already know the object, and then we can rewrite the
`parse_commit_in_graph_gently()` to be `lookup_commit_in_graph()`, which
does a `lookup_commit()` to create the object in case the OID was found
in the graph. That's also got nicer calling semantics..

It's negligibly slower (3.021s instead of 2.983s), but it doesn't need
me arguing about the performance/memory tradeoff. Which is a good thing
I guess: there will always be that one repo that's completely different
and where such assumptions don't hold.

Patrick
diff mbox series

Patch

diff --git a/revision.c b/revision.c
index f06a5d63a3..671b6d6513 100644
--- a/revision.c
+++ b/revision.c
@@ -359,14 +359,22 @@  static struct object *get_reference(struct rev_info *revs, const char *name,
 				    const struct object_id *oid,
 				    unsigned int flags)
 {
-	struct object *object;
+	struct object *object = lookup_unknown_object(revs->repo, oid);
+
+	if (object->type == OBJ_NONE) {
+		int type = oid_object_info(revs->repo, oid, NULL);
+		if (type < 0 || !object_as_type(object, type, 1)) {
+			object = NULL;
+			goto out;
+		}
+	}
 
 	/*
 	 * If the repository has commit graphs, repo_parse_commit() avoids
 	 * reading the object buffer, so use it whenever possible.
 	 */
-	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
-		struct commit *c = lookup_commit(revs->repo, oid);
+	if (object->type == OBJ_COMMIT) {
+		struct commit *c = (struct commit *) object;
 		if (!repo_parse_commit(revs->repo, c))
 			object = (struct object *) c;
 		else
@@ -375,6 +383,7 @@  static struct object *get_reference(struct rev_info *revs, const char *name,
 		object = parse_object(revs->repo, oid);
 	}
 
+out:
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;