diff mbox series

[3/3] object-store: use one oid_array per subdirectory for loose cache

Message ID c8dd851f-0a18-848f-8e58-cc0ee5f8e705@web.de (mailing list archive)
State New, archived
Headers show
Series [1/3] object-store: factor out odb_loose_cache() | expand

Commit Message

René Scharfe Jan. 6, 2019, 4:45 p.m. UTC
The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array, which has to be resorted after each add
operation.  So when querying a wide range of objects, the partially
filled array needs to be resorted up to 255 times, which takes over 100
times longer than sorting once.

Use one oid_array for each subdirectory.  This ensures that entries have
to only be sorted a single time.  It also avoids eight binary search
steps for each cache lookup as a small bonus.

The cache is used for collision checks for the log placeholders %h, %t
and %p, and we can see the change speeding them up in a repository with
ca. 100 objects per subdirectory:

$ git count-objects
26733 objects, 68808 kilobytes

Test                        HEAD^             HEAD
--------------------------------------------------------------------
4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%

Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
---
 object-store.h | 2 +-
 sha1-file.c    | 9 ++++++---
 2 files changed, 7 insertions(+), 4 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Jan. 6, 2019, 8:38 p.m. UTC | #1
On Sun, Jan 06 2019, René Scharfe wrote:

Thanks. I haven't done my own performance testing but at a glance this
looks good.

> The cache is used for collision checks for the log placeholders %h, %t
> and %p, and we can see the change speeding them up in a repository with
> ca. 100 objects per subdirectory:
>
> $ git count-objects
> 26733 objects, 68808 kilobytes
>
> Test                        HEAD^             HEAD
> --------------------------------------------------------------------
> 4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
> 4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
> 4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
> 4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
> 4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
> 4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
> 4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%

Can you elaborate on the test setup required to get to the point where
you got these numbers for subsequent comparison, i.e. how you generated
the approx 100 objects per dir, what OS/version & storage type etc.
René Scharfe Jan. 6, 2019, 10:58 p.m. UTC | #2
Am 06.01.2019 um 21:38 schrieb Ævar Arnfjörð Bjarmason:
>> $ git count-objects
>> 26733 objects, 68808 kilobytes
>>
>> Test                        HEAD^             HEAD
>> --------------------------------------------------------------------
>> 4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
>> 4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
>> 4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
>> 4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
>> 4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
>> 4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
>> 4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%
> 
> Can you elaborate on the test setup required to get to the point where
> you got these numbers for subsequent comparison, i.e. how you generated
> the approx 100 objects per dir, what OS/version & storage type etc.

I happened to have that many loose objects lying around.  Numbers are
for Debian Testing on a Hyper-V VM on Windows 10 1893 on an SSD.

You could fake object directory entries with something like this:

    for d in .git/objects/??
    do
        for i in $(seq 0 9)
        do
            >"$d/0000000000000000000000000000000000000$i"
        done
    done

René
diff mbox series

Patch

diff --git a/object-store.h b/object-store.h
index 709bf856b6..2fb6c0e4db 100644
--- a/object-store.h
+++ b/object-store.h
@@ -20,7 +20,7 @@  struct object_directory {
 	 * Be sure to call odb_load_loose_cache() before using.
 	 */
 	char loose_objects_subdir_seen[256];
-	struct oid_array loose_objects_cache;
+	struct oid_array loose_objects_cache[256];
 
 	/*
 	 * Path to the alternative object store. If this is a relative path,
diff --git a/sha1-file.c b/sha1-file.c
index 2f965b2688..c3c6e50704 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -2155,7 +2155,7 @@  struct oid_array *odb_loose_cache(struct object_directory *odb,
 {
 	int subdir_nr = oid->hash[0];
 	odb_load_loose_cache(odb, subdir_nr);
-	return &odb->loose_objects_cache;
+	return &odb->loose_objects_cache[subdir_nr];
 }
 
 void odb_load_loose_cache(struct object_directory *odb, int subdir_nr)
@@ -2173,14 +2173,17 @@  void odb_load_loose_cache(struct object_directory *odb, int subdir_nr)
 	for_each_file_in_obj_subdir(subdir_nr, &buf,
 				    append_loose_object,
 				    NULL, NULL,
-				    &odb->loose_objects_cache);
+				    &odb->loose_objects_cache[subdir_nr]);
 	odb->loose_objects_subdir_seen[subdir_nr] = 1;
 	strbuf_release(&buf);
 }
 
 void odb_clear_loose_cache(struct object_directory *odb)
 {
-	oid_array_clear(&odb->loose_objects_cache);
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(odb->loose_objects_cache); i++)
+		oid_array_clear(&odb->loose_objects_cache[i]);
 	memset(&odb->loose_objects_subdir_seen, 0,
 	       sizeof(odb->loose_objects_subdir_seen));
 }