mbox series

[v3,0/4] read-cache: speed up index load through parallelization

Message ID 20180906210227.54368-1-benpeart@microsoft.com (mailing list archive)
Headers show
Series read-cache: speed up index load through parallelization | expand

Message

Ben Peart Sept. 6, 2018, 9:03 p.m. UTC
On further investigation with the previous patch, I noticed that my test
repos didn't contain the cache tree extension in their index. After doing a
commit to ensure they existed, I realized that in some instances, the time
to load the cache tree exceeded the time to load all the cache entries in
parallel.  Because the thread to read the cache tree was started last (due
to having to parse through all the cache entries first) we weren't always
getting optimal performance.

To better optimize for this case, I decided to write the EOIE extension
as suggested by Junio [1] in response to my earlier multithreading patch
series [2].  This enables me to spin up the thread to load the extensions
earlier as it no longer has to parse through all the cache entries first.

The big changes in this iteration are:

- add the EOIE extension
- update the index extension worker thread to start first

The absolute perf numbers don't look as good as the previous iteration
because not loading the cache tree at all is a lot faster than loading it in
parallel. These were measured with a V4 index that included a cache tree
extension.

I used p0002-read-cache.sh to generate some performance data on how the three
performance patches help:

p0002-read-cache.sh w/100,000 files                        
Baseline         expand_name_field()    Thread extensions       Thread entries
---------------------------------------------------------------------------------------
22.34(0.01+0.12) 21.14(0.03+0.01) -5.4% 20.71(0.03+0.03) -7.3%	13.93(0.04+0.04) -37.6%

p0002-read-cache.sh w/1,000,000 files                        
Baseline          expand_name_field()     Thread extensions        Thread entries
-------------------------------------------------------------------------------------------
306.44(0.04+0.07) 295.42(0.01+0.07) -3.6% 217.60(0.03+0.04) -29.0% 199.00(0.00+0.10) -35.1%

This patch conflicts with Duy's patch to remove the double memory copy and
pass in the previous ce instead.  The two will need to be merged/reconciled
once they settle down a bit.

[1] https://public-inbox.org/git/xmqq1sl017dw.fsf@gitster.mtv.corp.google.com/
[2] https://public-inbox.org/git/20171109141737.47976-1-benpeart@microsoft.com/


Base Ref: master
Web-Diff: https://github.com/benpeart/git/commit/325ec69299
Checkout: git fetch https://github.com/benpeart/git read-index-multithread-v3 && git checkout 325ec69299


### Patches

Ben Peart (4):
  read-cache: optimize expand_name_field() to speed up V4 index parsing.
  eoie: add End of Index Entry (EOIE) extension
  read-cache: load cache extensions on a worker thread
  read-cache: speed up index load through parallelization

 Documentation/config.txt                 |   6 +
 Documentation/technical/index-format.txt |  23 ++
 config.c                                 |  18 +
 config.h                                 |   1 +
 read-cache.c                             | 476 ++++++++++++++++++++---
 t/README                                 |  11 +
 t/t1700-split-index.sh                   |   1 +
 7 files changed, 487 insertions(+), 49 deletions(-)


base-commit: 29d9e3e2c47dd4b5053b0a98c891878d398463e3

Comments

Junio C Hamano Sept. 7, 2018, 5:21 p.m. UTC | #1
Ben Peart <benpeart@microsoft.com> writes:

> On further investigation with the previous patch, I noticed that my test
> repos didn't contain the cache tree extension in their index. After doing a
> commit to ensure they existed, I realized that in some instances, the time
> to load the cache tree exceeded the time to load all the cache entries in
> parallel.  Because the thread to read the cache tree was started last (due
> to having to parse through all the cache entries first) we weren't always
> getting optimal performance.
>
> To better optimize for this case, I decided to write the EOIE extension
> as suggested by Junio [1] in response to my earlier multithreading patch
> series [2].  This enables me to spin up the thread to load the extensions
> earlier as it no longer has to parse through all the cache entries first.

Hmph. I kinda liked the simplicity of the previous one, but if we
need to start reading the extension sections sooner by eliminating
the overhead to scan the cache entries, perhaps we should bite the
bullet now.

> The big changes in this iteration are:
>
> - add the EOIE extension
> - update the index extension worker thread to start first

I guess I'd need to see the actual patch to find this out, but once
we rely on a new extension, then we could omit scanning the main
index even to partition the work among workers (i.e. like the topic
long ago, you can have list of pointers into the main index to help
partitioning, plus reset the prefix compression used in v4).  I
think you didn't get that far in this round, which is good.  If the
gain with EOIE alone (and starting the worker for the extension
section early) is large enough without such a pre-computed work
partition table, the simplicity of this round may give us a good
stopping point.

> This patch conflicts with Duy's patch to remove the double memory copy and
> pass in the previous ce instead.  The two will need to be merged/reconciled
> once they settle down a bit.

Thanks.  I have a feeling that 67922abb ("read-cache.c: optimize
reading index format v4", 2018-09-02) is already 'next'-worthy
and ready to be built on, but I'd prefer to hear from Duy to double
check.
Ben Peart Sept. 7, 2018, 6:31 p.m. UTC | #2
On 9/7/2018 1:21 PM, Junio C Hamano wrote:
> Ben Peart <benpeart@microsoft.com> writes:
> 
>> On further investigation with the previous patch, I noticed that my test
>> repos didn't contain the cache tree extension in their index. After doing a
>> commit to ensure they existed, I realized that in some instances, the time
>> to load the cache tree exceeded the time to load all the cache entries in
>> parallel.  Because the thread to read the cache tree was started last (due
>> to having to parse through all the cache entries first) we weren't always
>> getting optimal performance.
>>
>> To better optimize for this case, I decided to write the EOIE extension
>> as suggested by Junio [1] in response to my earlier multithreading patch
>> series [2].  This enables me to spin up the thread to load the extensions
>> earlier as it no longer has to parse through all the cache entries first.
> 
> Hmph. I kinda liked the simplicity of the previous one, but if we
> need to start reading the extension sections sooner by eliminating
> the overhead to scan the cache entries, perhaps we should bite the
> bullet now.
> 

I preferred the simplicity as well but when I was profiling the code and 
found out that loading the extensions was most often the last thread to 
complete, I took this intermediate step to speed things up.

>> The big changes in this iteration are:
>>
>> - add the EOIE extension
>> - update the index extension worker thread to start first
> 
> I guess I'd need to see the actual patch to find this out, but once
> we rely on a new extension, then we could omit scanning the main
> index even to partition the work among workers (i.e. like the topic
> long ago, you can have list of pointers into the main index to help
> partitioning, plus reset the prefix compression used in v4).  I
> think you didn't get that far in this round, which is good.  If the
> gain with EOIE alone (and starting the worker for the extension
> section early) is large enough without such a pre-computed work
> partition table, the simplicity of this round may give us a good
> stopping point.
> 

Agreed.  I didn't go that far in this series as it doesn't appear to be 
necessary.  We could always add that later if it turned out to be worth 
the additional complexity.

>> This patch conflicts with Duy's patch to remove the double memory copy and
>> pass in the previous ce instead.  The two will need to be merged/reconciled
>> once they settle down a bit.
> 
> Thanks.  I have a feeling that 67922abb ("read-cache.c: optimize
> reading index format v4", 2018-09-02) is already 'next'-worthy
> and ready to be built on, but I'd prefer to hear from Duy to double
> check.
> 

I'll take a closer look at what this will entail. It gets more 
complicated as we don't actually have a previous expanded cache entry 
when starting each thread.  I'll see how complex it makes the code and 
how much additional performance it gives.
Duy Nguyen Sept. 8, 2018, 1:18 p.m. UTC | #3
On Fri, Sep 7, 2018 at 7:21 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Ben Peart <benpeart@microsoft.com> writes:
>
> > On further investigation with the previous patch, I noticed that my test
> > repos didn't contain the cache tree extension in their index. After doing a
> > commit to ensure they existed, I realized that in some instances, the time
> > to load the cache tree exceeded the time to load all the cache entries in
> > parallel.  Because the thread to read the cache tree was started last (due
> > to having to parse through all the cache entries first) we weren't always
> > getting optimal performance.
> >
> > To better optimize for this case, I decided to write the EOIE extension
> > as suggested by Junio [1] in response to my earlier multithreading patch
> > series [2].  This enables me to spin up the thread to load the extensions
> > earlier as it no longer has to parse through all the cache entries first.
>
> Hmph. I kinda liked the simplicity of the previous one, but if we
> need to start reading the extension sections sooner by eliminating
> the overhead to scan the cache entries, perhaps we should bite the
> bullet now.

My view is slightly different. If we have to optimize might as well
try to squeeze the best out of it. Simplicity is already out of the
window at this point (but maintainability remains).

> > The big changes in this iteration are:
> >
> > - add the EOIE extension
> > - update the index extension worker thread to start first
>
> I guess I'd need to see the actual patch to find this out, but once
> we rely on a new extension, then we could omit scanning the main
> index even to partition the work among workers (i.e. like the topic
> long ago, you can have list of pointers into the main index to help
> partitioning, plus reset the prefix compression used in v4).  I
> think you didn't get that far in this round, which is good.  If the
> gain with EOIE alone (and starting the worker for the extension
> section early) is large enough without such a pre-computed work
> partition table, the simplicity of this round may give us a good
> stopping point.

I suspect the reduced gain in 1M files case compared to 100k files in
4/4 is because of scanning the index to split work to worker threads.
Since the index is now larger, the scanning takes more time before we
can start worker threads and we gain less from parallelization. I have
not experimented to see if this is true or there is something else.

> > This patch conflicts with Duy's patch to remove the double memory copy and
> > pass in the previous ce instead.  The two will need to be merged/reconciled
> > once they settle down a bit.
>
> Thanks.  I have a feeling that 67922abb ("read-cache.c: optimize
> reading index format v4", 2018-09-02) is already 'next'-worthy
> and ready to be built on, but I'd prefer to hear from Duy to double
> check.

Yes I think it's good. I ran the entire test suite with v4 just to
double check (and thinking of testing v4 version in travis too).