cache-tree: use ce_namelen() instead of strlen()

Message ID	a8b5bb13-5c88-d744-b47e-7204e12ad05d@web.de (mailing list archive)
State	Accepted
Commit	0b72536a0b6128d2bfd05d633cd2228d7515b53d
Headers	show Return-Path: <git-owner@kernel.org> Subject: [PATCH] cache-tree: use ce_namelen() instead of strlen() To: Derrick Stolee <stolee@gmail.com>, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>, git@vger.kernel.org Cc: newren@gmail.com, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com>, Junio C Hamano <gitster@pobox.com> References: <pull.829.git.1609356413.gitgitgadget@gmail.com> <20ea7050324cdd78b0966f54366b26224dfc7814.1609356414.git.gitgitgadget@gmail.com> <daeb59ee-861d-9c8d-3601-6aef1ac3fc94@web.de> <3ed37ae6-1f09-bd6b-c9d9-c8089da1af92@gmail.com> From: =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de> Message-ID: <a8b5bb13-5c88-d744-b47e-7204e12ad05d@web.de> Date: Sat, 2 Jan 2021 16:19:11 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <3ed37ae6-1f09-bd6b-c9d9-c8089da1af92@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	cache-tree: use ce_namelen() instead of strlen() \| expand cache-tree: use ce_namelen() instead of strlen()

Message ID

a8b5bb13-5c88-d744-b47e-7204e12ad05d@web.de (mailing list archive)

State

Accepted

Commit

0b72536a0b6128d2bfd05d633cd2228d7515b53d

Headers

Subject: [PATCH] cache-tree: use ce_namelen() instead of strlen()
To: Derrick Stolee <stolee@gmail.com>,
        Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>,
        git@vger.kernel.org
Cc: newren@gmail.com, Derrick Stolee <derrickstolee@github.com>,
        Derrick Stolee <dstolee@microsoft.com>,
        Junio C Hamano <gitster@pobox.com>
References: <pull.829.git.1609356413.gitgitgadget@gmail.com>
 <20ea7050324cdd78b0966f54366b26224dfc7814.1609356414.git.gitgitgadget@gmail.com>
 <daeb59ee-861d-9c8d-3601-6aef1ac3fc94@web.de>
 <3ed37ae6-1f09-bd6b-c9d9-c8089da1af92@gmail.com>
From: =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de>
Message-ID: <a8b5bb13-5c88-d744-b47e-7204e12ad05d@web.de>
Date: Sat, 2 Jan 2021 16:19:11 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <3ed37ae6-1f09-bd6b-c9d9-c8089da1af92@gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

cache-tree: use ce_namelen() instead of strlen() | expand

Commit Message

René Scharfe Jan. 2, 2021, 3:19 p.m. UTC

Use the name length field of cache entries instead of calculating its
value anew.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
Not sure why it took me so long to spot this.. o_O

 cache-tree.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--
2.30.0

Comments

Derrick Stolee Jan. 4, 2021, 1:26 a.m. UTC | #1

On 1/2/2021 10:19 AM, René Scharfe wrote:
> Use the name length field of cache entries instead of calculating its
> value anew.

Thanks! I didn't know about this macro.

-Stolee

> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
> Not sure why it took me so long to spot this.. o_O
> 
>  cache-tree.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/cache-tree.c b/cache-tree.c
> index a537a806c1..57cacab195 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -185,10 +185,12 @@ static int verify_cache(struct cache_entry **cache,
>  		 * the cache is sorted.  Also path can appear only once,
>  		 * which means conflicting one would immediately follow.
>  		 */
> -		const char *this_name = cache[i]->name;
> -		const char *next_name = cache[i+1]->name;
> -		int this_len = strlen(this_name);
> -		if (this_len < strlen(next_name) &&
> +		const struct cache_entry *this_ce = cache[i];
> +		const struct cache_entry *next_ce = cache[i + 1];
> +		const char *this_name = this_ce->name;
> +		const char *next_name = next_ce->name;
> +		int this_len = ce_namelen(this_ce);
> +		if (this_len < ce_namelen(next_ce) &&
>  		    strncmp(this_name, next_name, this_len) == 0 &&
>  		    next_name[this_len] == '/') {
>  			if (10 < ++funny) {

For what it's worth, this caching of the lengths lowers my test
time from 854ms to 833ms. It goes down again to 816ms with the
swapped conditional, so I'll include that patch here:

-- >8 --

From 8f303000030bd8f9db3466c90eb0d9fea11a7c3b Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Sun, 3 Jan 2021 20:20:05 -0500
Subject: [PATCH] cache-tree: speed up consecutive path comparisons
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous change reduced time spent in strlen() while comparing
consecutive paths in verify_cache(), but we can do better. The
conditional checks the existence of a directory separator at the correct
location, but only after doing a string comparison. Swap the order to be
logically equivalent but perform fewer string comparisons.

To test the effect on performance, I used a repository with over three
million paths in the index. I then ran the following command on repeat:

  git -c index.threads=1 commit --amend --allow-empty --no-edit

Here are the measurements over 10 runs after a 5-run warmup:

  Benchmark #1: v2.30.0
    Time (mean ± σ):     854.5 ms ±  18.2 ms
    Range (min … max):   825.0 ms … 892.8 ms

  Benchmark #2: Previous change
    Time (mean ± σ):     833.2 ms ±  10.3 ms
    Range (min … max):   815.8 ms … 849.7 ms

  Benchmark #3: This change
    Time (mean ± σ):     815.5 ms ±  18.1 ms
    Range (min … max):   795.4 ms … 849.5 ms

This change is 2% faster than the previous change and 5% faster than
v2.30.0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 57cacab195..6eaef21145 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -191,8 +191,8 @@ static int verify_cache(struct cache_entry **cache,
 		const char *next_name = next_ce->name;
 		int this_len = ce_namelen(this_ce);
 		if (this_len < ce_namelen(next_ce) &&
-		    strncmp(this_name, next_name, this_len) == 0 &&
-		    next_name[this_len] == '/') {
+		    next_name[this_len] == '/' &&
+		    strncmp(this_name, next_name, this_len) == 0) {
 			if (10 < ++funny) {
 				fprintf(stderr, "...\n");
 				break;

Junio C Hamano Jan. 5, 2021, 12:05 p.m. UTC | #2

René Scharfe <l.s.r@web.de> writes:

> Use the name length field of cache entries instead of calculating its
> value anew.
>
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
> Not sure why it took me so long to spot this.. o_O

That probably is because this does not cause behaviour change.

It used to be that use of ce_namelen() was more important in
learning the length of the string before 7a51ed66 (Make on-disk
index representation separate from in-core one, 2008-01-14) and
7fec10b7 (index: be careful when handling long names, 2008-01-18).
The on-disk field to store the name length has only 12 bits, and
before b60e188c (Strip namelen out of ce_flags into a ce_namelen
field, 2012-07-11), the convention to learn the length of the name
of an in-core cache entry was to see the field and then if it fully
occupies the full 12-bit field, ask the name string itself its
length with strlen().

These days, ce->namelen is a separate field that always gives the
full length after the on-disk index is read, so with this change,
you won't be running strlen() in this part of the function even with
an entry with a long pathname.

>
>  cache-tree.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index a537a806c1..57cacab195 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -185,10 +185,12 @@ static int verify_cache(struct cache_entry **cache,
>  		 * the cache is sorted.  Also path can appear only once,
>  		 * which means conflicting one would immediately follow.
>  		 */
> -		const char *this_name = cache[i]->name;
> -		const char *next_name = cache[i+1]->name;
> -		int this_len = strlen(this_name);
> -		if (this_len < strlen(next_name) &&
> +		const struct cache_entry *this_ce = cache[i];
> +		const struct cache_entry *next_ce = cache[i + 1];
> +		const char *this_name = this_ce->name;
> +		const char *next_name = next_ce->name;
> +		int this_len = ce_namelen(this_ce);
> +		if (this_len < ce_namelen(next_ce) &&
>  		    strncmp(this_name, next_name, this_len) == 0 &&
>  		    next_name[this_len] == '/') {
>  			if (10 < ++funny) {
> --
> 2.30.0

diff --git a/cache-tree.c b/cache-tree.c
index a537a806c1..57cacab195 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -185,10 +185,12 @@  static int verify_cache(struct cache_entry **cache,
 		 * the cache is sorted.  Also path can appear only once,
 		 * which means conflicting one would immediately follow.
 		 */
-		const char *this_name = cache[i]->name;
-		const char *next_name = cache[i+1]->name;
-		int this_len = strlen(this_name);
-		if (this_len < strlen(next_name) &&
+		const struct cache_entry *this_ce = cache[i];
+		const struct cache_entry *next_ce = cache[i + 1];
+		const char *this_name = this_ce->name;
+		const char *next_name = next_ce->name;
+		int this_len = ce_namelen(this_ce);
+		if (this_len < ce_namelen(next_ce) &&
 		    strncmp(this_name, next_name, this_len) == 0 &&
 		    next_name[this_len] == '/') {
 			if (10 < ++funny) {

cache-tree: use ce_namelen() instead of strlen()

Commit Message

Comments

Patch