mbox series

[RFC,0/4] Speed booting by sorting ORC unwind tables at build time

Message ID 20191107143205.206606-1-shile.zhang@linux.alibaba.com (mailing list archive)
Headers show
Series Speed booting by sorting ORC unwind tables at build time | expand

Message

Shile Zhang Nov. 7, 2019, 2:32 p.m. UTC
From: Shile Zhang <shile.zhang@linux.alibaba.com>

Hi,

I found the unwind_init taken long time (more than 90ms) in kernel
booting, mainly spent on sorting the two ORC unwind tables, orc_unwind
and orc_unwind_ip.

I also noticed that this issued has reported and discussed last year:
https://lkml.org/lkml/2018/10/8/342
But seems no final solution until now, I tried to sort the ORC tables at
build time, followed the helpful hints from Josh and Ingo in that thread.
And mainly referred the implementation of 'sortextable' tool:
https://lore.kernel.org/linux-mips/1334872799-14589-1-git-send-email-ddaney.cavm@gmail.com/

What I did:

- Add a Kconfig to control build-time sorting or runtime sorting;
- Referred 'sortextable', create a similar helper tool 'sortorctable',
  help to sort the ORC unwind tables at vmlinux link process.

One potential improvement is to sort the module ORC tables in future.

Thanks!

Shile Zhang (4):
  scripts: Add sortorctable to sort ORC unwind tables
  kbuild: Sort ORC unwind tables in vmlinux link process
  x86/unwind/orc: Skip sorting if BUILDTIME_ORCTABLE_SORT is configured
  x86/Kconfig: Add a Kconfig option to sort ORC tables at build time

 arch/x86/Kconfig.debug       |   9 ++
 arch/x86/kernel/unwind_orc.c |   2 +
 scripts/.gitignore           |   1 +
 scripts/Makefile             |   2 +
 scripts/link-vmlinux.sh      |  10 ++
 scripts/sortorctable.c       | 246 +++++++++++++++++++++++++++++++++++
 scripts/sortorctable.h       |  25 ++++
 7 files changed, 295 insertions(+)
 create mode 100644 scripts/sortorctable.c
 create mode 100644 scripts/sortorctable.h

Comments

Peter Zijlstra Nov. 7, 2019, 3:22 p.m. UTC | #1
On Thu, Nov 07, 2019 at 10:32:01PM +0800, shile.zhang@linux.alibaba.com wrote:
> From: Shile Zhang <shile.zhang@linux.alibaba.com>
> 
> Hi,
> 
> I found the unwind_init taken long time (more than 90ms) in kernel
> booting, mainly spent on sorting the two ORC unwind tables, orc_unwind
> and orc_unwind_ip.
> 
> I also noticed that this issued has reported and discussed last year:
> https://lkml.org/lkml/2018/10/8/342
> But seems no final solution until now, I tried to sort the ORC tables at
> build time, followed the helpful hints from Josh and Ingo in that thread.
> And mainly referred the implementation of 'sortextable' tool:
> https://lore.kernel.org/linux-mips/1334872799-14589-1-git-send-email-ddaney.cavm@gmail.com/
> 
> What I did:
> 
> - Add a Kconfig to control build-time sorting or runtime sorting;
> - Referred 'sortextable', create a similar helper tool 'sortorctable',
>   help to sort the ORC unwind tables at vmlinux link process.

What is the build-time cost for doing this? The link phase is already a
fairly big bottleneck for building a kernel.

Can sort{ex,orc}table() be ran concurrently? Do they want to be the same
(threaded) tool?
Josh Poimboeuf Nov. 7, 2019, 3:46 p.m. UTC | #2
On Thu, Nov 07, 2019 at 10:32:01PM +0800, shile.zhang@linux.alibaba.com wrote:
> From: Shile Zhang <shile.zhang@linux.alibaba.com>
> 
> Hi,
> 
> I found the unwind_init taken long time (more than 90ms) in kernel
> booting, mainly spent on sorting the two ORC unwind tables, orc_unwind
> and orc_unwind_ip.
> 
> I also noticed that this issued has reported and discussed last year:
> https://lkml.org/lkml/2018/10/8/342
> But seems no final solution until now, I tried to sort the ORC tables at
> build time, followed the helpful hints from Josh and Ingo in that thread.
> And mainly referred the implementation of 'sortextable' tool:
> https://lore.kernel.org/linux-mips/1334872799-14589-1-git-send-email-ddaney.cavm@gmail.com/
> 
> What I did:
> 
> - Add a Kconfig to control build-time sorting or runtime sorting;
> - Referred 'sortextable', create a similar helper tool 'sortorctable',
>   help to sort the ORC unwind tables at vmlinux link process.
> 
> One potential improvement is to sort the module ORC tables in future.
> 
> Thanks!

Thanks a lot for working on this!

I'd say the new config option isn't needed.  The runtime ORC sorting
logic is unconditionally bad and the code should just be removed.  I saw
recently that it's one of the main offenders for boot time latency.

I also agree with Peter that we should try to reduce the link-time
penalty as much as possible.  But it's a necessary evil to a certain
extent.
Shile Zhang Nov. 8, 2019, 1:42 a.m. UTC | #3
On 2019/11/7 23:22, Peter Zijlstra wrote:
> On Thu, Nov 07, 2019 at 10:32:01PM +0800, shile.zhang@linux.alibaba.com wrote:
>> From: Shile Zhang <shile.zhang@linux.alibaba.com>
>>
>> Hi,
>>
>> I found the unwind_init taken long time (more than 90ms) in kernel
>> booting, mainly spent on sorting the two ORC unwind tables, orc_unwind
>> and orc_unwind_ip.
>>
>> I also noticed that this issued has reported and discussed last year:
>> https://lkml.org/lkml/2018/10/8/342
>> But seems no final solution until now, I tried to sort the ORC tables at
>> build time, followed the helpful hints from Josh and Ingo in that thread.
>> And mainly referred the implementation of 'sortextable' tool:
>> https://lore.kernel.org/linux-mips/1334872799-14589-1-git-send-email-ddaney.cavm@gmail.com/
>>
>> What I did:
>>
>> - Add a Kconfig to control build-time sorting or runtime sorting;
>> - Referred 'sortextable', create a similar helper tool 'sortorctable',
>>    help to sort the ORC unwind tables at vmlinux link process.
> What is the build-time cost for doing this? The link phase is already a
> fairly big bottleneck for building a kernel.
Hi, Peter, Thanks for your kindly reply!
On my test env, the build-time sort spend about 100ms, which is similar 
to runtime sorting due to the same sorting code. Of course there are few 
compiling cost in sortorctable tool itself. I think the overall cost of 
this build-time sorting is not so much.

I agree with you that the link time of vmlinux shoud be optimized.
But IMHO, for one kernel product release, the kernel building is once 
for all. So put the sorting in build time can save boot time for 
customer, for each booting. I think this is significant for boot time 
sensitive products, such as embedded devices in IoT, or VM in Cloud.
> Can sort{ex,orc}table() be ran concurrently? Do they want to be the same
> (threaded) tool?
I think it is possible to do those sort work concurrently, likes 
deferred memory init which is big boot time speed up.
But I don't know if the exception table and ORC unwind tables can be 
deferred, due to those tables might be used in early boot time, for 
early exception handling and early debugging. I'm not familiar with that.

IMO, the init works in kernel boot time should do the necessary 
runtime-depends initialization, which cannot done out-of-bands. For 
exception, ORC unwind like tables, which does not depends on the runtime 
ENV. It can/should be ready in building time. IOW, this kind of "setup 
cost" should not put on customer.

Thanks again!
Shile Zhang Nov. 8, 2019, 1:43 a.m. UTC | #4
On 2019/11/7 23:46, Josh Poimboeuf wrote:
> On Thu, Nov 07, 2019 at 10:32:01PM +0800, shile.zhang@linux.alibaba.com wrote:
>> From: Shile Zhang <shile.zhang@linux.alibaba.com>
>>
>> Hi,
>>
>> I found the unwind_init taken long time (more than 90ms) in kernel
>> booting, mainly spent on sorting the two ORC unwind tables, orc_unwind
>> and orc_unwind_ip.
>>
>> I also noticed that this issued has reported and discussed last year:
>> https://lkml.org/lkml/2018/10/8/342
>> But seems no final solution until now, I tried to sort the ORC tables at
>> build time, followed the helpful hints from Josh and Ingo in that thread.
>> And mainly referred the implementation of 'sortextable' tool:
>> https://lore.kernel.org/linux-mips/1334872799-14589-1-git-send-email-ddaney.cavm@gmail.com/
>>
>> What I did:
>>
>> - Add a Kconfig to control build-time sorting or runtime sorting;
>> - Referred 'sortextable', create a similar helper tool 'sortorctable',
>>    help to sort the ORC unwind tables at vmlinux link process.
>>
>> One potential improvement is to sort the module ORC tables in future.
>>
>> Thanks!
> Thanks a lot for working on this!
>
> I'd say the new config option isn't needed.  The runtime ORC sorting
> logic is unconditionally bad and the code should just be removed.  I saw
> recently that it's one of the main offenders for boot time latency.
Hi, Josh, Thanks very much for your quickly response!

I'll refactor the code as your advice.
Yes, the run-time sorting cost is bigger for currently Cloud products, 
such as serverless products, which needs boot up ASAP.
>
> I also agree with Peter that we should try to reduce the link-time
> penalty as much as possible.  But it's a necessary evil to a certain
> extent.
>
agree!

Thanks again and looking forwards more advice from you!
Peter Zijlstra Nov. 8, 2019, 9:21 a.m. UTC | #5
On Fri, Nov 08, 2019 at 09:42:55AM +0800, Shile Zhang wrote:

> > Can sort{ex,orc}table() be ran concurrently? Do they want to be the same
> > (threaded) tool?

> I think it is possible to do those sort work concurrently, likes deferred
> memory init which is big boot time speed up.
> But I don't know if the exception table and ORC unwind tables can be
> deferred, due to those tables might be used in early boot time, for early
> exception handling and early debugging. I'm not familiar with that.

I meant at link time, run both sorts concurrently such that we only have
to wait for the longest, instead of the sum of them.

They're not changing the same part of the ELF file, so it should be
possible to have one tool have multiple threads, each sorting a
different table.

Aside from the .ex_table and ORC there's also .jump_table that wants
sorting (see jump_label_sort_entries()).

I agree that doing it at link time makes sense, I just hate to do all
this sorting in sequence and blowing up the link time. I don't build for
customers, I build for single use boot and linking _SUCKS_.
Peter Zijlstra Nov. 8, 2019, 9:25 a.m. UTC | #6
On Fri, Nov 08, 2019 at 10:21:36AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 08, 2019 at 09:42:55AM +0800, Shile Zhang wrote:
> 
> > > Can sort{ex,orc}table() be ran concurrently? Do they want to be the same
> > > (threaded) tool?
> 
> > I think it is possible to do those sort work concurrently, likes deferred
> > memory init which is big boot time speed up.
> > But I don't know if the exception table and ORC unwind tables can be
> > deferred, due to those tables might be used in early boot time, for early
> > exception handling and early debugging. I'm not familiar with that.
> 
> I meant at link time, run both sorts concurrently such that we only have
> to wait for the longest, instead of the sum of them.
> 
> They're not changing the same part of the ELF file, so it should be
> possible to have one tool have multiple threads, each sorting a
> different table.
> 
> Aside from the .ex_table and ORC there's also .jump_table that wants
> sorting (see jump_label_sort_entries()).

Oh, and I'll be adding .static_call_sites soon, see:

  https://lkml.kernel.org/r/20191007082708.013939311@infradead.org

(I should repost that)

That gives us 4 tables to sort which we can do concurrently in 4
threads.

> I agree that doing it at link time makes sense, I just hate to do all
> this sorting in sequence and blowing up the link time. I don't build for
> customers, I build for single use boot and linking _SUCKS_.
Shile Zhang Nov. 11, 2019, 2:44 a.m. UTC | #7
On 2019/11/8 17:25, Peter Zijlstra wrote:
> On Fri, Nov 08, 2019 at 10:21:36AM +0100, Peter Zijlstra wrote:
>> On Fri, Nov 08, 2019 at 09:42:55AM +0800, Shile Zhang wrote:
>>
>>>> Can sort{ex,orc}table() be ran concurrently? Do they want to be the same
>>>> (threaded) tool?
>>> I think it is possible to do those sort work concurrently, likes deferred
>>> memory init which is big boot time speed up.
>>> But I don't know if the exception table and ORC unwind tables can be
>>> deferred, due to those tables might be used in early boot time, for early
>>> exception handling and early debugging. I'm not familiar with that.
>> I meant at link time, run both sorts concurrently such that we only have
>> to wait for the longest, instead of the sum of them.
>>
>> They're not changing the same part of the ELF file, so it should be
>> possible to have one tool have multiple threads, each sorting a
>> different table.
>>
>> Aside from the .ex_table and ORC there's also .jump_table that wants
>> sorting (see jump_label_sort_entries()).
> Oh, and I'll be adding .static_call_sites soon, see:
>
>    https://lkml.kernel.org/r/20191007082708.013939311@infradead.org
>
> (I should repost that)
>
> That gives us 4 tables to sort which we can do concurrently in 4
> threads.

I got your point now.
I'll try to rework the sort tool to sort all tables concurrently in one 
tool with multiple-threads.
Thanks for your advice!

>> I agree that doing it at link time makes sense, I just hate to do all
>> this sorting in sequence and blowing up the link time. I don't build for
>> customers, I build for single use boot and linking _SUCKS_.