[v5,3/5] read-cache: load cache entries on worker threads
diff mbox series

Message ID 20180912161832.55324-4-benpeart@microsoft.com
State New
Headers show
Series
  • [v5,1/5] eoie: add End of Index Entry (EOIE) extension
Related show

Commit Message

Ben Peart Sept. 12, 2018, 4:18 p.m. UTC
This patch helps address the CPU cost of loading the index by creating
multiple threads to divide the work of loading and converting the cache
entries across all available CPU cores.

It accomplishes this by having the primary thread loop across the index file
tracking the offset and (for V4 indexes) expanding the name. It creates a
thread to process each block of entries as it comes to them.

I used p0002-read-cache.sh to generate some performance data:

Test w/100,000 files                Baseline         Parallel entries
---------------------------------------------------------------------------
read_cache/discard_cache 1000 times 14.08(0.01+0.10) 9.72(0.03+0.06) -31.0%

Test w/1,000,000 files              Baseline         Parallel entries
------------------------------------------------------------------------------
read_cache/discard_cache 1000 times 202.95(0.01+0.07) 154.14(0.03+0.06) -24.1%

Signed-off-by: Ben Peart <Ben.Peart@microsoft.com>
---
 read-cache.c | 242 +++++++++++++++++++++++++++++++++++++++++++++------
 t/README     |   3 +
 2 files changed, 217 insertions(+), 28 deletions(-)

Comments

Duy Nguyen Sept. 15, 2018, 10:31 a.m. UTC | #1
On Wed, Sep 12, 2018 at 6:18 PM Ben Peart <benpeart@microsoft.com> wrote:
>
> This patch helps address the CPU cost of loading the index by creating
> multiple threads to divide the work of loading and converting the cache
> entries across all available CPU cores.
>
> It accomplishes this by having the primary thread loop across the index file
> tracking the offset and (for V4 indexes) expanding the name. It creates a
> thread to process each block of entries as it comes to them.
>
> I used p0002-read-cache.sh to generate some performance data:
>
> Test w/100,000 files                Baseline         Parallel entries
> ---------------------------------------------------------------------------
> read_cache/discard_cache 1000 times 14.08(0.01+0.10) 9.72(0.03+0.06) -31.0%
>
> Test w/1,000,000 files              Baseline         Parallel entries
> ------------------------------------------------------------------------------
> read_cache/discard_cache 1000 times 202.95(0.01+0.07) 154.14(0.03+0.06) -24.1%

The numbers here and the previous patch to load extensions in parallel
are exactly the same. What do these numbers mean? With both changes?
Duy Nguyen Sept. 15, 2018, 11:07 a.m. UTC | #2
On Wed, Sep 12, 2018 at 6:18 PM Ben Peart <benpeart@microsoft.com> wrote:
>
> This patch helps address the CPU cost of loading the index by creating
> multiple threads to divide the work of loading and converting the cache
> entries across all available CPU cores.
>
> It accomplishes this by having the primary thread loop across the index file
> tracking the offset and (for V4 indexes) expanding the name. It creates a
> thread to process each block of entries as it comes to them.

I added a couple trace_printf() to see how time is spent. This is with
a 1m entry index (basically my webkit.git index repeated 4 times)

12:50:00.084237 read-cache.c:1721       start loading index
12:50:00.119941 read-cache.c:1943       performance: 0.034778758 s:
loaded all extensions (1667075 bytes)
12:50:00.185352 read-cache.c:2029       performance: 0.100152079 s:
loaded 367110 entries
12:50:00.189683 read-cache.c:2126       performance: 0.104566615 s:
finished scanning all entries
12:50:00.217900 read-cache.c:2029       performance: 0.082309193 s:
loaded 367110 entries
12:50:00.259969 read-cache.c:2029       performance: 0.070257130 s:
loaded 367108 entries
12:50:00.263662 read-cache.c:2278       performance: 0.179344458 s:
read cache .git/index

Two observations:

- the extension thread finishes up quickly (this is with TREE
extension alone). We could use that spare core to parse some more
entries.

- the main "scanning and allocating" thread does hold up the two
remaining threads. You can see the first index entry thread is
finished even before the scanning thread. And this scanning thread
takes a lot of cpu.

If all index entry threads start at the same time, based on these
numbers we would be finished around 12:50:00.185352 mark, cutting
loading time by half.

Could you go back to your original solution? If you don't want to
spend more time on this, I offer to rewrite this patch.
Duy Nguyen Sept. 15, 2018, 11:09 a.m. UTC | #3
On Sat, Sep 15, 2018 at 01:07:46PM +0200, Duy Nguyen wrote:
> 12:50:00.084237 read-cache.c:1721       start loading index
> 12:50:00.119941 read-cache.c:1943       performance: 0.034778758 s: loaded all extensions (1667075 bytes)
> 12:50:00.185352 read-cache.c:2029       performance: 0.100152079 s: loaded 367110 entries
> 12:50:00.189683 read-cache.c:2126       performance: 0.104566615 s: finished scanning all entries
> 12:50:00.217900 read-cache.c:2029       performance: 0.082309193 s: loaded 367110 entries
> 12:50:00.259969 read-cache.c:2029       performance: 0.070257130 s: loaded 367108 entries
> 12:50:00.263662 read-cache.c:2278       performance: 0.179344458 s: read cache .git/index

The previous mail wraps these lines and make it a bit hard to read. Corrected now.

--
Duy
Duy Nguyen Sept. 15, 2018, 11:29 a.m. UTC | #4
On Wed, Sep 12, 2018 at 6:18 PM Ben Peart <benpeart@microsoft.com> wrote:
>  #ifndef NO_PTHREADS
>         nr_threads = git_config_get_index_threads();
> -       if (!nr_threads)
> -               nr_threads = online_cpus();
> +       if (!nr_threads) {
> +               cpus = online_cpus();
> +               nr_threads = istate->cache_nr / THREAD_COST;
> +               if (nr_threads > cpus)
> +                       nr_threads = cpus;

It seems like overcommitting cpu does reduce time. With this patch
(and a 4 core system), I got

$ test-tool read-cache 100
real    0m36.270s
user    0m54.193s
sys     0m17.346s

if I force nr_threads to 9 (even though cpus is 4)

$ test-tool read-cache 100
real    0m33.592s
user    1m4.230s
sys     0m18.380s

Even though we use more cpus, real time is shorter. I guess these
threads still sleep a bit due to I/O and having more threads than
cores will utilize those idle cycles.
--
Duy
Ben Peart Sept. 17, 2018, 5:25 p.m. UTC | #5
On 9/15/2018 6:31 AM, Duy Nguyen wrote:
> On Wed, Sep 12, 2018 at 6:18 PM Ben Peart <benpeart@microsoft.com> wrote:
>>
>> This patch helps address the CPU cost of loading the index by creating
>> multiple threads to divide the work of loading and converting the cache
>> entries across all available CPU cores.
>>
>> It accomplishes this by having the primary thread loop across the index file
>> tracking the offset and (for V4 indexes) expanding the name. It creates a
>> thread to process each block of entries as it comes to them.
>>
>> I used p0002-read-cache.sh to generate some performance data:
>>
>> Test w/100,000 files                Baseline         Parallel entries
>> ---------------------------------------------------------------------------
>> read_cache/discard_cache 1000 times 14.08(0.01+0.10) 9.72(0.03+0.06) -31.0%
>>
>> Test w/1,000,000 files              Baseline         Parallel entries
>> ------------------------------------------------------------------------------
>> read_cache/discard_cache 1000 times 202.95(0.01+0.07) 154.14(0.03+0.06) -24.1%
> 
> The numbers here and the previous patch to load extensions in parallel
> are exactly the same. What do these numbers mean? With both changes?
> 

It means I messed up when creating my commit message for the extension 
patch and copy/pasted the wrong numbers.  Yes, these numbers are with 
both changes (the correct numbers for the extension only are not as good).
Ben Peart Sept. 17, 2018, 6:52 p.m. UTC | #6
On 9/15/2018 7:09 AM, Duy Nguyen wrote:
> On Sat, Sep 15, 2018 at 01:07:46PM +0200, Duy Nguyen wrote:
>> 12:50:00.084237 read-cache.c:1721       start loading index
>> 12:50:00.119941 read-cache.c:1943       performance: 0.034778758 s: loaded all extensions (1667075 bytes)
>> 12:50:00.185352 read-cache.c:2029       performance: 0.100152079 s: loaded 367110 entries
>> 12:50:00.189683 read-cache.c:2126       performance: 0.104566615 s: finished scanning all entries
>> 12:50:00.217900 read-cache.c:2029       performance: 0.082309193 s: loaded 367110 entries
>> 12:50:00.259969 read-cache.c:2029       performance: 0.070257130 s: loaded 367108 entries
>> 12:50:00.263662 read-cache.c:2278       performance: 0.179344458 s: read cache .git/index
> 
> The previous mail wraps these lines and make it a bit hard to read. Corrected now.
> 
> --
> Duy
> 

Interesting!  Clearly the data shape makes a big difference here as I 
had run a similar test but in my case, the extensions thread actually 
finished last (and it's cost is what drove me to move that onto a 
separate thread that starts first).

Purpose	    			First	Last	Duration
load_index_extensions_thread	719.40	968.50	249.10
load_cache_entries_thread	718.89	738.65	19.76
load_cache_entries_thread	730.39	753.83	23.43
load_cache_entries_thread	741.23	751.23	10.00
load_cache_entries_thread	751.93	780.88	28.95
load_cache_entries_thread	763.60	791.31	27.72
load_cache_entries_thread	773.46	783.46	10.00
load_cache_entries_thread	783.96	794.28	10.32
load_cache_entries_thread	795.61	805.52	9.91
load_cache_entries_thread	805.99	827.21	21.22
load_cache_entries_thread	816.85	826.85	10.00
load_cache_entries_thread	827.03	837.96	10.93

In my tests, the scanning thread clearly delayed the later ce threads 
but given the extension was so slow, it didn't impact the overall time 
nearly as much as your case.

I completely agree that the optimal solution would be to go back to my 
original patch/design.  It eliminates the overhead of the scanning 
thread entirely and allows all threads to start at the same time. This 
would ensure the best performance whether the extensions were the 
longest thread or the cache entry threads took the longest.

I ran out of time and energy last year so dropped it to work on other 
tasks.  I appreciate your offer of help. Perhaps between the two of us 
we could successfully get it through the mailing list this time. :-) 
Let me go back and see what it would take to combine the current EOIE 
patch with the older IEOT patch.

I'm also intrigued with your observation that over committing the cpu 
actually results in time savings.  I hadn't tested that.  It looks like 
that could have a positive impact on the overall time and warrant a 
change to the default nr_threads logic.

Patch
diff mbox series

diff --git a/read-cache.c b/read-cache.c
index b203eebb44..880f627b4c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1942,20 +1942,212 @@  static void *load_index_extensions(void *_data)
 	return NULL;
 }
 
+/*
+ * A helper function that will load the specified range of cache entries
+ * from the memory mapped file and add them to the given index.
+ */
+static unsigned long load_cache_entry_block(struct index_state *istate,
+			struct mem_pool *ce_mem_pool, int offset, int nr, void *mmap,
+			unsigned long start_offset, struct strbuf *previous_name)
+{
+	int i;
+	unsigned long src_offset = start_offset;
+
+	for (i = offset; i < offset + nr; i++) {
+		struct ondisk_cache_entry *disk_ce;
+		struct cache_entry *ce;
+		unsigned long consumed;
+
+		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
+		ce = create_from_disk(ce_mem_pool, disk_ce, &consumed, previous_name);
+		set_index_entry(istate, i, ce);
+
+		src_offset += consumed;
+	}
+	return src_offset - start_offset;
+}
+
+static unsigned long load_all_cache_entries(struct index_state *istate,
+			void *mmap, size_t mmap_size, unsigned long src_offset)
+{
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+	unsigned long consumed;
+
+	if (istate->version == 4) {
+		previous_name = &previous_name_buf;
+		mem_pool_init(&istate->ce_mem_pool,
+				estimate_cache_size_from_compressed(istate->cache_nr));
+	} else {
+		previous_name = NULL;
+		mem_pool_init(&istate->ce_mem_pool,
+				estimate_cache_size(mmap_size, istate->cache_nr));
+	}
+
+	consumed = load_cache_entry_block(istate, istate->ce_mem_pool,
+					0, istate->cache_nr, mmap, src_offset, previous_name);
+	strbuf_release(&previous_name_buf);
+	return consumed;
+}
+
+#ifndef NO_PTHREADS
+
+/*
+ * Mostly randomly chosen maximum thread counts: we
+ * cap the parallelism to online_cpus() threads, and we want
+ * to have at least 10000 cache entries per thread for it to
+ * be worth starting a thread.
+ */
+#define THREAD_COST		(10000)
+
+struct load_cache_entries_thread_data
+{
+	pthread_t pthread;
+	struct index_state *istate;
+	struct mem_pool *ce_mem_pool;
+	int offset, nr;
+	void *mmap;
+	unsigned long start_offset;
+	struct strbuf previous_name_buf;
+	struct strbuf *previous_name;
+	unsigned long consumed;	/* return # of bytes in index file processed */
+};
+
+/*
+ * A thread proc to run the load_cache_entries() computation
+ * across multiple background threads.
+ */
+static void *load_cache_entries_thread(void *_data)
+{
+	struct load_cache_entries_thread_data *p = _data;
+
+	p->consumed += load_cache_entry_block(p->istate, p->ce_mem_pool,
+		p->offset, p->nr, p->mmap, p->start_offset, p->previous_name);
+	return NULL;
+}
+
+static unsigned long load_cache_entries_threaded(int nr_threads, struct index_state *istate,
+			void *mmap, size_t mmap_size, unsigned long src_offset)
+{
+	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
+	struct load_cache_entries_thread_data *data;
+	int ce_per_thread;
+	unsigned long consumed;
+	int i, thread;
+
+	/* a little sanity checking */
+	if (istate->name_hash_initialized)
+		BUG("the name hash isn't thread safe");
+
+	mem_pool_init(&istate->ce_mem_pool, 0);
+	if (istate->version == 4)
+		previous_name = &previous_name_buf;
+	else
+		previous_name = NULL;
+
+	ce_per_thread = DIV_ROUND_UP(istate->cache_nr, nr_threads);
+	data = xcalloc(nr_threads, sizeof(struct load_cache_entries_thread_data));
+
+	/*
+	 * Loop through index entries starting a thread for every ce_per_thread
+	 * entries. Exit the loop when we've created the final thread (no need
+	 * to parse the remaining entries.
+	 */
+	consumed = thread = 0;
+	for (i = 0; ; i++) {
+		struct ondisk_cache_entry *ondisk;
+		const char *name;
+		unsigned int flags;
+
+		/*
+		 * we've reached the beginning of a block of cache entries,
+		 * kick off a thread to process them
+		 */
+		if (i % ce_per_thread == 0) {
+			struct load_cache_entries_thread_data *p = &data[thread];
+
+			p->istate = istate;
+			p->offset = i;
+			p->nr = ce_per_thread < istate->cache_nr - i ? ce_per_thread : istate->cache_nr - i;
+
+			/* create a mem_pool for each thread */
+			if (istate->version == 4)
+				mem_pool_init(&p->ce_mem_pool,
+					estimate_cache_size_from_compressed(p->nr));
+			else
+				mem_pool_init(&p->ce_mem_pool,
+					estimate_cache_size(mmap_size, p->nr));
+
+			p->mmap = mmap;
+			p->start_offset = src_offset;
+			if (previous_name) {
+				strbuf_addbuf(&p->previous_name_buf, previous_name);
+				p->previous_name = &p->previous_name_buf;
+			}
+
+			if (pthread_create(&p->pthread, NULL, load_cache_entries_thread, p))
+				die("unable to create load_cache_entries_thread");
+
+			/* exit the loop when we've created the last thread */
+			if (++thread == nr_threads)
+				break;
+		}
+
+		ondisk = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
+
+		/* On-disk flags are just 16 bits */
+		flags = get_be16(&ondisk->flags);
+
+		if (flags & CE_EXTENDED) {
+			struct ondisk_cache_entry_extended *ondisk2;
+			ondisk2 = (struct ondisk_cache_entry_extended *)ondisk;
+			name = ondisk2->name;
+		} else
+			name = ondisk->name;
+
+		if (!previous_name) {
+			size_t len;
+
+			/* v3 and earlier */
+			len = flags & CE_NAMEMASK;
+			if (len == CE_NAMEMASK)
+				len = strlen(name);
+			src_offset += (flags & CE_EXTENDED) ?
+				ondisk_cache_entry_extended_size(len) :
+				ondisk_cache_entry_size(len);
+		} else
+			src_offset += (name - ((char *)ondisk)) + expand_name_field(previous_name, name);
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		struct load_cache_entries_thread_data *p = data + i;
+		if (pthread_join(p->pthread, NULL))
+			die("unable to join load_cache_entries_thread");
+		mem_pool_combine(istate->ce_mem_pool, p->ce_mem_pool);
+		strbuf_release(&p->previous_name_buf);
+		consumed += p->consumed;
+	}
+
+	free(data);
+	strbuf_release(&previous_name_buf);
+
+	return consumed;
+}
+
+#endif
+
 /* remember to discard_cache() before reading a different cache! */
 int do_read_index(struct index_state *istate, const char *path, int must_exist)
 {
-	int fd, i;
+	int fd;
 	struct stat st;
 	unsigned long src_offset;
 	struct cache_header *hdr;
 	void *mmap;
 	size_t mmap_size;
-	struct strbuf previous_name_buf = STRBUF_INIT, *previous_name;
 	struct load_index_extensions p = { 0 };
 	unsigned long extension_offset = 0;
 #ifndef NO_PTHREADS
-	int nr_threads;
+	int cpus, nr_threads;
 #endif
 
 	if (istate->initialized)
@@ -1997,10 +2189,20 @@  int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	p.mmap = mmap;
 	p.mmap_size = mmap_size;
 
+	src_offset = sizeof(*hdr);
+
 #ifndef NO_PTHREADS
 	nr_threads = git_config_get_index_threads();
-	if (!nr_threads)
-		nr_threads = online_cpus();
+	if (!nr_threads) {
+		cpus = online_cpus();
+		nr_threads = istate->cache_nr / THREAD_COST;
+		if (nr_threads > cpus)
+			nr_threads = cpus;
+	}
+
+	/* enable testing with fewer than default minimum of entries */
+	if (istate->cache_nr > 1 && nr_threads < 3 && git_env_bool("GIT_TEST_INDEX_THREADS", 0))
+		nr_threads = 3;
 
 	if (nr_threads >= 2) {
 		extension_offset = read_eoie_extension(mmap, mmap_size);
@@ -2009,33 +2211,17 @@  int do_read_index(struct index_state *istate, const char *path, int must_exist)
 			p.src_offset = extension_offset;
 			if (pthread_create(&p.pthread, NULL, load_index_extensions, &p))
 				die(_("unable to create load_index_extensions_thread"));
+			nr_threads--;
 		}
 	}
+	if (nr_threads >= 2)
+		src_offset += load_cache_entries_threaded(nr_threads, istate, mmap, mmap_size, src_offset);
+	else
+		src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset);
+#else
+	src_offset += load_all_cache_entries(istate, mmap, mmap_size, src_offset);
 #endif
 
-	if (istate->version == 4) {
-		previous_name = &previous_name_buf;
-		mem_pool_init(&istate->ce_mem_pool,
-			      estimate_cache_size_from_compressed(istate->cache_nr));
-	} else {
-		previous_name = NULL;
-		mem_pool_init(&istate->ce_mem_pool,
-			      estimate_cache_size(mmap_size, istate->cache_nr));
-	}
-
-	src_offset = sizeof(*hdr);
-	for (i = 0; i < istate->cache_nr; i++) {
-		struct ondisk_cache_entry *disk_ce;
-		struct cache_entry *ce;
-		unsigned long consumed;
-
-		disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
-		ce = create_from_disk(istate->ce_mem_pool, disk_ce, &consumed, previous_name);
-		set_index_entry(istate, i, ce);
-
-		src_offset += consumed;
-	}
-	strbuf_release(&previous_name_buf);
 	istate->timestamp.sec = st.st_mtime;
 	istate->timestamp.nsec = ST_MTIME_NSEC(st);
 
diff --git a/t/README b/t/README
index d8754dd23a..69c695ad8e 100644
--- a/t/README
+++ b/t/README
@@ -324,6 +324,9 @@  This is used to allow tests 1, 4-9 in t1700-split-index.sh to succeed
 as they currently hard code SHA values for the index which are no longer
 valid due to the addition of the EOIE extension.
 
+GIT_TEST_INDEX_THREADS=<boolean> forces multi-threaded loading of
+the index cache entries and extensions for the whole test suite.
+
 Naming Tests
 ------------