mbox series

[v7,00/11] kallsyms: Optimizes the performance of lookup symbols

Message ID 20221017064950.2038-1-thunder.leizhen@huawei.com (mailing list archive)
Headers show
Series kallsyms: Optimizes the performance of lookup symbols | expand

Message

Zhen Lei Oct. 17, 2022, 6:49 a.m. UTC
v6 --> v7:
1. Improve the performance of kallsyms_lookup_name() when CONFIG_LTO_CLANG=y
   To achieve this, restrict '.' to be at the beginning of a substring, not in
   the middle or end.
2. kallsyms_selftest.c adds support for CONFIG_LTO_CLANG=y.
3. Patches 4-6 are rearranged, centralize implementations of the same
   functionality in one patch, rather than split it based on whether it
   belongs to the tool or kernel.
4. Due to the impact of the following patches, some adaptations are made.
   aa221f2ea58655f kallsyms: take the input file instead of reading stdin
   73bbb94466fd3f8 kallsyms: support "big" kernel symbols
   dfb352ab1162f73 kallsyms: Drop CONFIG_CFI_CLANG workarounds


v5 --> v6:
1. Add patch 6/11, kallsyms: Add helper kallsyms_lookup_clang_name()
2. Update commit message of patch 9/11.

v4 --> v5:
1. In scripts/kallsyms.c, we use an extra field to hold type and eventually
   put it together with name in write_src().
2. Generate a new table kallsyms_best_token_table[], so that we compress a
   symbol in the kernel using a process similar to compress_symbol().
3. Remove helper sym_name(), and rename field 'sym[]' to 'name[]' in
   scripts/kallsyms.c
4. Add helper __kallsyms_lookup_compressed_name() to avoid duplicate code in
   functions kallsyms_lookup_name() and kallsyms_on_each_match_symbol().
5. Add a new parameter "const char *modname" to module_kallsyms_on_each_symbol(),
   this makes the code logic clearer.
6. Delete the parameter 'struct module *' in the hook function associated with
   kallsyms_on_each_symbol(), it's unused now.

v3 --> v4:
1. Move the declaration of function kallsyms_sym_address() to linux/kallsyms.h,
   fix a build warning.

v2 --> v3:
1. Improve test cases, perform complete functional tests on functions
   kallsyms_lookup_name(), kallsyms_on_each_symbol() and
   kallsyms_on_each_match_symbol().
2. Add patch [PATCH v3 2/8] scripts/kallsyms: ensure that all possible
   combinations are compressed.
3. The symbol type is not compressed regardless of whether
   CONFIG_KALLSYMS_ALL is set or not. The memory overhead is increased
   by less than 20KiB if CONFIG_KALLSYMS_ALL=n.
4. Discard [PATCH v2 3/8] kallsyms: Adjust the types of some local variables

v1 --> v2:
Add self-test facility

v1:
Currently, to search for a symbol, we need to expand the symbols in
'kallsyms_names' one by one, and then use the expanded string for
comparison. This is very slow.

In fact, we can first compress the name being looked up and then use
it for comparison when traversing 'kallsyms_names'.

This patch series optimizes the performance of function kallsyms_lookup_name(),
and function klp_find_object_symbol() in the livepatch module. Based on the
test results, the performance overhead is reduced to 5%. That is, the
performance of these functions is improved by 20 times.

To avoid increasing the kernel size in non-debug mode, the optimization is only
for the case CONFIG_KALLSYMS_ALL=y.

Zhen Lei (11):
  scripts/kallsyms: rename build_initial_tok_table()
  scripts/kallsyms: don't compress symbol types
  scripts/kallsyms: remove helper sym_name() and cleanup
  kallsyms: Add helper kallsyms_compress_symbol_name()
  kallsyms: Improve the performance of kallsyms_lookup_name()
  kallsyms: Improve the performance of kallsyms_lookup_name() when
    CONFIG_LTO_CLANG=y
  kallsyms: Add helper kallsyms_on_each_match_symbol()
  livepatch: Use kallsyms_on_each_match_symbol() to improve performance
  livepatch: Improve the search performance of
    module_kallsyms_on_each_symbol()
  kallsyms: Delete an unused parameter related to
    kallsyms_on_each_symbol()
  kallsyms: Add self-test facility

 include/linux/kallsyms.h   |  12 +-
 include/linux/module.h     |   4 +-
 init/Kconfig               |  13 ++
 kernel/Makefile            |   1 +
 kernel/kallsyms.c          | 197 ++++++++++++++--
 kernel/kallsyms_internal.h |   1 +
 kernel/kallsyms_selftest.c | 464 +++++++++++++++++++++++++++++++++++++
 kernel/livepatch/core.c    |  31 ++-
 kernel/module/kallsyms.c   |  15 +-
 kernel/trace/ftrace.c      |   3 +-
 scripts/kallsyms.c         | 161 ++++++++++---
 scripts/link-vmlinux.sh    |   4 +
 12 files changed, 837 insertions(+), 69 deletions(-)
 create mode 100644 kernel/kallsyms_selftest.c

Comments

Luis Chamberlain Oct. 19, 2022, 12:01 p.m. UTC | #1
On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote:
> Currently, to search for a symbol, we need to expand the symbols in
> 'kallsyms_names' one by one, and then use the expanded string for
> comparison. This is very slow.
> 
> In fact, we can first compress the name being looked up and then use
> it for comparison when traversing 'kallsyms_names'.
> 
> This patch series optimizes the performance of function kallsyms_lookup_name(),
> and function klp_find_object_symbol() in the livepatch module. Based on the
> test results, the performance overhead is reduced to 5%. That is, the
> performance of these functions is improved by 20 times.

Stupid question, is a hash table in order?

  Luis
Zhen Lei Oct. 19, 2022, 2:11 p.m. UTC | #2
On 2022/10/19 20:01, Luis Chamberlain wrote:
> On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote:
>> Currently, to search for a symbol, we need to expand the symbols in
>> 'kallsyms_names' one by one, and then use the expanded string for
>> comparison. This is very slow.
>>
>> In fact, we can first compress the name being looked up and then use
>> it for comparison when traversing 'kallsyms_names'.
>>
>> This patch series optimizes the performance of function kallsyms_lookup_name(),
>> and function klp_find_object_symbol() in the livepatch module. Based on the
>> test results, the performance overhead is reduced to 5%. That is, the
>> performance of these functions is improved by 20 times.
> 
> Stupid question, is a hash table in order?

No hash table.

All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms

The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols
are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one
relationship. For any symbol, it has the same index in both arrays.

Therefore, when we look up a symbolic name based on an address, we use a binary lookup.
However, when we look up an address based on a symbol name, we can only traverse array
kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory.

> 
>   Luis
> .
>
Luis Chamberlain Oct. 25, 2022, 5:53 p.m. UTC | #3
On Wed, Oct 19, 2022 at 10:11:58PM +0800, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/10/19 20:01, Luis Chamberlain wrote:
> > On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote:
> >> Currently, to search for a symbol, we need to expand the symbols in
> >> 'kallsyms_names' one by one, and then use the expanded string for
> >> comparison. This is very slow.
> >>
> >> In fact, we can first compress the name being looked up and then use
> >> it for comparison when traversing 'kallsyms_names'.
> >>
> >> This patch series optimizes the performance of function kallsyms_lookup_name(),
> >> and function klp_find_object_symbol() in the livepatch module. Based on the
> >> test results, the performance overhead is reduced to 5%. That is, the
> >> performance of these functions is improved by 20 times.
> > 
> > Stupid question, is a hash table in order?
> 
> No hash table.
> 
> All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms
> 
> The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols
> are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one
> relationship. For any symbol, it has the same index in both arrays.
> 
> Therefore, when we look up a symbolic name based on an address, we use a binary lookup.
> However, when we look up an address based on a symbol name, we can only traverse array
> kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory.

This answers how we don't use a hash table, the question was *should* we
use one?

  Luis
Zhen Lei Oct. 26, 2022, 6:44 a.m. UTC | #4
On 2022/10/26 1:53, Luis Chamberlain wrote:
> On Wed, Oct 19, 2022 at 10:11:58PM +0800, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/19 20:01, Luis Chamberlain wrote:
>>> On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote:
>>>> Currently, to search for a symbol, we need to expand the symbols in
>>>> 'kallsyms_names' one by one, and then use the expanded string for
>>>> comparison. This is very slow.
>>>>
>>>> In fact, we can first compress the name being looked up and then use
>>>> it for comparison when traversing 'kallsyms_names'.
>>>>
>>>> This patch series optimizes the performance of function kallsyms_lookup_name(),
>>>> and function klp_find_object_symbol() in the livepatch module. Based on the
>>>> test results, the performance overhead is reduced to 5%. That is, the
>>>> performance of these functions is improved by 20 times.
>>>
>>> Stupid question, is a hash table in order?
>>
>> No hash table.
>>
>> All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms
>>
>> The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols
>> are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one
>> relationship. For any symbol, it has the same index in both arrays.
>>
>> Therefore, when we look up a symbolic name based on an address, we use a binary lookup.
>> However, when we look up an address based on a symbol name, we can only traverse array
>> kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory.
> 
> This answers how we don't use a hash table, the question was *should* we
> use one?

I'm not the original author, and I can only answer now based on my understanding. Maybe
the original author didn't think of the hash method, or he has weighed it out.

Hash is a good solution if only performance is required and memory overhead is not
considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.

Because I don't know what hash algorithm will be used, the cost of generating the
hash value corresponding to the symbol name is unknown now. But I think it's gonna
be small. But it definitely needs a simpler algorithm, the tool needs to implement
the same hash algorithm.

If the hash is not very uniform or ARRAY_SIZE(hashtable) is small, then my current
approach still makes sense. So maybe hash can be deferred to the next phase of
improvement.

> 
>   Luis
> .
>
Luis Chamberlain Oct. 26, 2022, 7:03 p.m. UTC | #5
On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
> On 2022/10/26 1:53, Luis Chamberlain wrote:
> > This answers how we don't use a hash table, the question was *should* we
> > use one?
> 
> I'm not the original author, and I can only answer now based on my understanding. Maybe
> the original author didn't think of the hash method, or he has weighed it out.
> 
> Hash is a good solution if only performance is required and memory overhead is not
> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
> 
> Because I don't know what hash algorithm will be used, the cost of generating the
> hash value corresponding to the symbol name is unknown now. But I think it's gonna
> be small. But it definitely needs a simpler algorithm, the tool needs to implement
> the same hash algorithm.

For instance, you can look at evaluating if alloc_large_system_hash() would help.

  Luis
Zhen Lei Oct. 27, 2022, 3:26 a.m. UTC | #6
On 2022/10/27 3:03, Luis Chamberlain wrote:
> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>> This answers how we don't use a hash table, the question was *should* we
>>> use one?
>>
>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>> the original author didn't think of the hash method, or he has weighed it out.
>>
>> Hash is a good solution if only performance is required and memory overhead is not
>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
>>
>> Because I don't know what hash algorithm will be used, the cost of generating the
>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>> the same hash algorithm.
> 
> For instance, you can look at evaluating if alloc_large_system_hash() would help.

OK, I found the right hash function. In this way, the tool does not need to consider
the byte order.

include/linux/stringhash.h

/*
 * Version 1: one byte at a time.  Example of use:
 *
 * unsigned long hash = init_name_hash;
 * while (*p)
 *      hash = partial_name_hash(tolower(*p++), hash);
 * hash = end_name_hash(hash);


> 
>   Luis
> .
>
Zhen Lei Oct. 27, 2022, 6:27 a.m. UTC | #7
On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/10/27 3:03, Luis Chamberlain wrote:
>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>> This answers how we don't use a hash table, the question was *should* we
>>>> use one?
>>>
>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>> the original author didn't think of the hash method, or he has weighed it out.
>>>
>>> Hash is a good solution if only performance is required and memory overhead is not
>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.

Sorry, 1-2 million ==> 0.1~0.2 million

>>>
>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>> the same hash algorithm.
>>
>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
> 
> OK, I found the right hash function. In this way, the tool does not need to consider
> the byte order.

https://en.wikipedia.org/wiki/Jenkins_hash_function

Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
have to think about sizeof(long). It seems to be closest to our current needs.

uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
	size_t i = 0;
	uint32_t hash = 0;

	while (i != length) {
		hash += key[i++];
		hash += hash << 10;
		hash ^= hash >> 6;
	}
	hash += hash << 3;
	hash ^= hash >> 11;
	hash += hash << 15;

	return hash;
}

> 
> include/linux/stringhash.h
> 
> /*
>  * Version 1: one byte at a time.  Example of use:
>  *
>  * unsigned long hash = init_name_hash;
>  * while (*p)
>  *      hash = partial_name_hash(tolower(*p++), hash);
>  * hash = end_name_hash(hash);
> 
> 
>>
>>   Luis
>> .
>>
>
Zhen Lei Oct. 29, 2022, 8:10 a.m. UTC | #8
On 2022/10/27 14:27, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>> use one?
>>>>
>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>>> the original author didn't think of the hash method, or he has weighed it out.
>>>>
>>>> Hash is a good solution if only performance is required and memory overhead is not
>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
> 
> Sorry, 1-2 million ==> 0.1~0.2 million
> 
>>>>
>>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>>> the same hash algorithm.
>>>
>>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>>

The following three hash algorithms are compared. The kernel is compiled by defconfig
on arm64.

|---------------------------------------------------------------------------------------|
|                           |             hash &= HASH_TABLE_SIZE - 1                   |
|                           |             number of conflicts >= 1000                   |
|---------------------------------------------------------------------------------------|
|   ARRAY_SIZE(hash_table)  |       crc16        | jhash_one_at_a_time | string hash_32 |
|---------------------------------------------------------------------------------------|
|                           |     345b: 3905     |     0d40: 1013      |   4a4c: 6548   |
|                           |     35bb: 1016     |     35ce: 6549      |   883a: 1015   |
|        0x10000            |     385b: 6548     |     4440: 19126     |   d05f: 19129  |
|                           |     f0ba: 19127    |     7ebe: 3916      |   ecda: 3903   |
|---------------------------------------------------------------------------------------|
|                           |      0ba: 19168    |      440: 19165     |    05f: 19170  |
|                           |      45b: 3955     |      5ce: 6577      |    83a: 1066   |
|        0x1000             |      5bb: 1069     |      d40: 1052      |    a4c: 6609   |
|                           |      85b: 6582     |      ebe: 3938      |    cda: 3924   |
|---------------------------------------------------------------------------------------|

Based on the above test results, I conclude that:
1. For the worst-case scenario, the three algorithms are not much different. But the kernel
   only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so
   it is coupled with byte order and sizeof(long). So crc16 is the best choice.
2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash
   value,just over 1/10 of the total. And with my current compression-then-comparison
   approach, it's 25-30 times faster. So there's still a need for my current approach, and
   they can be combined.
   if (nr_conflicts(key) >= CONST_N) {
       newname = compress(name);
       for_each_name_in_slot(key): compare(new_name)
   } else {
       for_each_name_in_slot(key): compare(name)
   }

   Above CONST_N can be roughly calculated:
   time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name)
3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table)
   0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough.
   Statistic information:
   |------------------------------------------------------|
   |   nr_conflicts(key)  |     ARRAY_SIZE(hash_table)    |
   |------------------------------------------------------|
   |         <= ?         |     0x1000    |    0x10000    |
   |------------------------------------------------------|
   |             0        |         0     |      7821     |
   |            20        |        19     |     57375     |
   |            40        |      2419     |       124     |
   |            60        |      1343     |        70     |
   |            80        |       149     |        73     |
   |           100        |        38     |        49     |
   |           200        |       108     |        16     |
   |           400        |        14     |         2     |
   |           600        |         2     |         2     |
   |           800        |         0     |         0     |
   |          1000        |         0     |         0     |
   |        100000        |         4     |         4     |
   |------------------------------------------------------|


Also, I re-calculated:
Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)"
                                                   |---- What I said earlier was 4
The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y.

Hi, Luis:
  For the reasons of the above-mentioned second conclusion. And except for patches 4-6,
even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11
are also needed. If you don't mind, I'd like to use hash at the next stage. The current
patch set is already huge.


>> OK, I found the right hash function. In this way, the tool does not need to consider
>> the byte order.
> 
> https://en.wikipedia.org/wiki/Jenkins_hash_function
> 
> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
> have to think about sizeof(long). It seems to be closest to our current needs.
> 
> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
> 	size_t i = 0;
> 	uint32_t hash = 0;
> 
> 	while (i != length) {
> 		hash += key[i++];
> 		hash += hash << 10;
> 		hash ^= hash >> 6;
> 	}
> 	hash += hash << 3;
> 	hash ^= hash >> 11;
> 	hash += hash << 15;
> 
> 	return hash;
> }
> 
>>
>> include/linux/stringhash.h
>>
>> /*
>>  * Version 1: one byte at a time.  Example of use:
>>  *
>>  * unsigned long hash = init_name_hash;
>>  * while (*p)
>>  *      hash = partial_name_hash(tolower(*p++), hash);
>>  * hash = end_name_hash(hash);
>>
>>
>>>
>>>   Luis
>>> .
>>>
>>
>
David Laight Oct. 29, 2022, 12:49 p.m. UTC | #9
> >> On 2022/10/27 3:03, Luis Chamberlain wrote:
> >>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
> >>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
> >>>>> This answers how we don't use a hash table, the question was *should* we
> >>>>> use one?

(Probably brainfart) thought...

Is the current table (effectively) a sorted list of strings?
So the lookup is a binary chop - so O(log(n)).

But your hashes are having 'trouble' stopping one chain
being very long?
So a linear search of that hash chain is slow.
In fact that sort of hashed lookup in O(n).

What if the symbols were sorted by hash then name?
(Without putting the hash into each entry.)
Then the code could do a binary chop search over
the symbols with the same hash value.
The additional data is then an array of symbol numbers
indexed by the hash - 32 bits for each bucket.

If the hash table has 0x1000 entries it saves 12 compares.
(All of which are likely to be data cache misses.)

If you add the hash to each table entry then you can do
a binary chop search for the hash itself.
While this is the same search as is done for the strings
the comparison (just a number) will be faster than a
string compare.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Zhen Lei Oct. 31, 2022, 2:55 a.m. UTC | #10
On 2022/10/29 20:49, David Laight wrote:
>>>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>>>> use one?
> 
> (Probably brainfart) thought...
> 
> Is the current table (effectively) a sorted list of strings?
> So the lookup is a binary chop - so O(log(n)).

Currently not sorted.

> 
> But your hashes are having 'trouble' stopping one chain
> being very long?
> So a linear search of that hash chain is slow.
> In fact that sort of hashed lookup in O(n).

You've analyzed it very well. The hash method is not good for sorting names
and then searching in binary mode. I figured it out when I was working on
the design these days.

Current Implementation:
---------------------------------------
| idx | addresses | markers |  names  |
---------------------------------------
|  0  |    addr0  |         |  name0  |
|  1  |    addr1  |         |  name1  |
| ... |    addrx  |   [0]   |  namex  |
| 255 |    addrx  |         |  name255|
---------------------------------------
| 256 |  addr256  |         |  name256|
| ... |    addrx  |   [1]   |  namex  |
| 511 |  addr511  |         |  name511|
---------------------------------------

markers[0] = offset_of(name0)
markers[1] = offset_of(name256)

1. Find name by address
   binary search addresses[], get idx, traverse names[] from  markers[idx>>8] to markers[(idx>>8) + 1], return name

2. Find address by name
   traverse names[], get idx, return addresses[idx]

Hash Implementation:
Add two new arrays: hash_table[] and names_offsets[]

-----------------------------------------------------------------
| key |      hash_table       |         names_offsets           |
|---------------------------------------------------------------|
|  0  |  names_offsets[key=0] | offsets of all names with key=0 |
|  1  |  names_offsets[key=1] | offsets of all names with key=1 |
| ... |          ...          | offsets of all names with key=k |
|---------------------------------------------------------------|

hash_table[0] = 0
hash_table[1] = hash_table[0] + sizeof(names_offsets[0]) * number_of_names(key=0)
hash_table[2] = hash_table[1] + sizeof(names_offsets[0]) * number_of_names(key=1)

1. Find address by name
   hash name, get key, traverse names_offsets[] from index=hash_table[key] to
   index=hash_table[key+1], get the offset of name in names[]. binary search markers[],
   get index, then traverse names[] from  markers[index] to markers[index + 1], until
   match the offset of name, return addresses[idx].
2. Find address by name
   No change.

Sorted names Implementation:
Add two new arrays: offsets_of_addr_to_name[] and offsets_of_name[]

offsets_of_addr_to_name[i] = offset of name i in names[]
offsets_of_name[i]         = offset of sorted name i in names[]

1. Find name by address
   binary search addresses[], get idx, return names[offsets_of_addr_to_name[idx]]

2. Find address by name
   binary search offsets_of_name[], get idx, return addresses[idx]

> 
> What if the symbols were sorted by hash then name?
> (Without putting the hash into each entry.)
> Then the code could do a binary chop search over
> the symbols with the same hash value.
> The additional data is then an array of symbol numbers
> indexed by the hash - 32 bits for each bucket.
> 
> If the hash table has 0x1000 entries it saves 12 compares.
> (All of which are likely to be data cache misses.)
> 
> If you add the hash to each table entry then you can do
> a binary chop search for the hash itself.
> While this is the same search as is done for the strings
> the comparison (just a number) will be faster than a
> string compare.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
Zhen Lei Oct. 31, 2022, 4:55 a.m. UTC | #11
On 2022/10/29 16:10, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/10/27 14:27, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>>>
>>>
>>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>>> use one?
>>>>>
>>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>>>> the original author didn't think of the hash method, or he has weighed it out.
>>>>>
>>>>> Hash is a good solution if only performance is required and memory overhead is not
>>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
>>
>> Sorry, 1-2 million ==> 0.1~0.2 million
>>
>>>>>
>>>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>>>> the same hash algorithm.
>>>>
>>>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>>>
> 
> The following three hash algorithms are compared. The kernel is compiled by defconfig
> on arm64.
> 
> |---------------------------------------------------------------------------------------|
> |                           |             hash &= HASH_TABLE_SIZE - 1                   |
> |                           |             number of conflicts >= 1000                   |
> |---------------------------------------------------------------------------------------|
> |   ARRAY_SIZE(hash_table)  |       crc16        | jhash_one_at_a_time | string hash_32 |
> |---------------------------------------------------------------------------------------|
> |                           |     345b: 3905     |     0d40: 1013      |   4a4c: 6548   |
> |                           |     35bb: 1016     |     35ce: 6549      |   883a: 1015   |
> |        0x10000            |     385b: 6548     |     4440: 19126     |   d05f: 19129  |
> |                           |     f0ba: 19127    |     7ebe: 3916      |   ecda: 3903   |
> |---------------------------------------------------------------------------------------|
> |                           |      0ba: 19168    |      440: 19165     |    05f: 19170  |
> |                           |      45b: 3955     |      5ce: 6577      |    83a: 1066   |
> |        0x1000             |      5bb: 1069     |      d40: 1052      |    a4c: 6609   |
> |                           |      85b: 6582     |      ebe: 3938      |    cda: 3924   |
> |---------------------------------------------------------------------------------------|
> 
> Based on the above test results, I conclude that:
> 1. For the worst-case scenario, the three algorithms are not much different. But the kernel
>    only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so
>    it is coupled with byte order and sizeof(long). So crc16 is the best choice.
> 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash
>    value,just over 1/10 of the total. And with my current compression-then-comparison
>    approach, it's 25-30 times faster. So there's still a need for my current approach, and
>    they can be combined.
>    if (nr_conflicts(key) >= CONST_N) {
>        newname = compress(name);
>        for_each_name_in_slot(key): compare(new_name)
>    } else {
>        for_each_name_in_slot(key): compare(name)
>    }
> 
>    Above CONST_N can be roughly calculated:
>    time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name)
> 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table)
>    0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough.
>    Statistic information:
>    |------------------------------------------------------|
>    |   nr_conflicts(key)  |     ARRAY_SIZE(hash_table)    |
>    |------------------------------------------------------|
>    |         <= ?         |     0x1000    |    0x10000    |
>    |------------------------------------------------------|
>    |             0        |         0     |      7821     |
>    |            20        |        19     |     57375     |
>    |            40        |      2419     |       124     |
>    |            60        |      1343     |        70     |
>    |            80        |       149     |        73     |
>    |           100        |        38     |        49     |
>    |           200        |       108     |        16     |
>    |           400        |        14     |         2     |
>    |           600        |         2     |         2     |
>    |           800        |         0     |         0     |
>    |          1000        |         0     |         0     |
>    |        100000        |         4     |         4     |
>    |------------------------------------------------------|
> 
> 
> Also, I re-calculated:
> Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)"
>                                                    |---- What I said earlier was 4
> The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y.
> 
> Hi, Luis:
>   For the reasons of the above-mentioned second conclusion. And except for patches 4-6,
> even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11
> are also needed. If you don't mind, I'd like to use hash at the next stage. The current
> patch set is already huge.

I just had an update in response to David Laight's email. The hash solution is like
a centrist. It doesn't seem very feasible.

Now, we need to make a decision. Choose one of the two:
1. Continue with my current approach. Improve the average performance of
   kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by:
   arm64 (defconfig):
     73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y.
     19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n.
   x86 (defconfig):
     49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y.
     16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n.
2. Sort names, binary search (The static function causes duplicate names. Additional work is required)
   2^18=262144, only up to 18 symbol expansions and comparisons are required.
   The performance is definitely excellent, although I haven't tested it yet.
   The memory overhead is increased by: 6 * kallsyms_num_syms
   arm64 (defconfig):
       1MiB if CONFIG_KALLSYMS_ALL=y.
     362KiB if CONFIG_KALLSYMS_ALL=n.
   x86 (defconfig):
     770KiB if CONFIG_KALLSYMS_ALL=y.
     356KiB if CONFIG_KALLSYMS_ALL=n.




> 
> 
>>> OK, I found the right hash function. In this way, the tool does not need to consider
>>> the byte order.
>>
>> https://en.wikipedia.org/wiki/Jenkins_hash_function
>>
>> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
>> have to think about sizeof(long). It seems to be closest to our current needs.
>>
>> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
>> 	size_t i = 0;
>> 	uint32_t hash = 0;
>>
>> 	while (i != length) {
>> 		hash += key[i++];
>> 		hash += hash << 10;
>> 		hash ^= hash >> 6;
>> 	}
>> 	hash += hash << 3;
>> 	hash ^= hash >> 11;
>> 	hash += hash << 15;
>>
>> 	return hash;
>> }
>>
>>>
>>> include/linux/stringhash.h
>>>
>>> /*
>>>  * Version 1: one byte at a time.  Example of use:
>>>  *
>>>  * unsigned long hash = init_name_hash;
>>>  * while (*p)
>>>  *      hash = partial_name_hash(tolower(*p++), hash);
>>>  * hash = end_name_hash(hash);
>>>
>>>
>>>>
>>>>   Luis
>>>> .
>>>>
>>>
>>
>
Zhen Lei Oct. 31, 2022, 3:04 p.m. UTC | #12
On 2022/10/31 12:55, Leizhen (ThunderTown) wrote:
> 
> 
> On 2022/10/29 16:10, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/27 14:27, Leizhen (ThunderTown) wrote:
>>>
>>>
>>> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>>>>
>>>>
>>>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>>>> use one?
>>>>>>
>>>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>>>>> the original author didn't think of the hash method, or he has weighed it out.
>>>>>>
>>>>>> Hash is a good solution if only performance is required and memory overhead is not
>>>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
>>>
>>> Sorry, 1-2 million ==> 0.1~0.2 million
>>>
>>>>>>
>>>>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>>>>> the same hash algorithm.
>>>>>
>>>>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>>>>
>>
>> The following three hash algorithms are compared. The kernel is compiled by defconfig
>> on arm64.
>>
>> |---------------------------------------------------------------------------------------|
>> |                           |             hash &= HASH_TABLE_SIZE - 1                   |
>> |                           |             number of conflicts >= 1000                   |
>> |---------------------------------------------------------------------------------------|
>> |   ARRAY_SIZE(hash_table)  |       crc16        | jhash_one_at_a_time | string hash_32 |
>> |---------------------------------------------------------------------------------------|
>> |                           |     345b: 3905     |     0d40: 1013      |   4a4c: 6548   |
>> |                           |     35bb: 1016     |     35ce: 6549      |   883a: 1015   |
>> |        0x10000            |     385b: 6548     |     4440: 19126     |   d05f: 19129  |
>> |                           |     f0ba: 19127    |     7ebe: 3916      |   ecda: 3903   |
>> |---------------------------------------------------------------------------------------|
>> |                           |      0ba: 19168    |      440: 19165     |    05f: 19170  |
>> |                           |      45b: 3955     |      5ce: 6577      |    83a: 1066   |
>> |        0x1000             |      5bb: 1069     |      d40: 1052      |    a4c: 6609   |
>> |                           |      85b: 6582     |      ebe: 3938      |    cda: 3924   |
>> |---------------------------------------------------------------------------------------|
>>
>> Based on the above test results, I conclude that:
>> 1. For the worst-case scenario, the three algorithms are not much different. But the kernel
>>    only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so
>>    it is coupled with byte order and sizeof(long). So crc16 is the best choice.
>> 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash
>>    value,just over 1/10 of the total. And with my current compression-then-comparison
>>    approach, it's 25-30 times faster. So there's still a need for my current approach, and
>>    they can be combined.
>>    if (nr_conflicts(key) >= CONST_N) {
>>        newname = compress(name);
>>        for_each_name_in_slot(key): compare(new_name)
>>    } else {
>>        for_each_name_in_slot(key): compare(name)
>>    }
>>
>>    Above CONST_N can be roughly calculated:
>>    time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name)
>> 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table)
>>    0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough.
>>    Statistic information:
>>    |------------------------------------------------------|
>>    |   nr_conflicts(key)  |     ARRAY_SIZE(hash_table)    |
>>    |------------------------------------------------------|
>>    |         <= ?         |     0x1000    |    0x10000    |
>>    |------------------------------------------------------|
>>    |             0        |         0     |      7821     |
>>    |            20        |        19     |     57375     |
>>    |            40        |      2419     |       124     |
>>    |            60        |      1343     |        70     |
>>    |            80        |       149     |        73     |
>>    |           100        |        38     |        49     |
>>    |           200        |       108     |        16     |
>>    |           400        |        14     |         2     |
>>    |           600        |         2     |         2     |
>>    |           800        |         0     |         0     |
>>    |          1000        |         0     |         0     |
>>    |        100000        |         4     |         4     |
>>    |------------------------------------------------------|
>>
>>
>> Also, I re-calculated:
>> Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)"
>>                                                    |---- What I said earlier was 4
>> The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y.
>>
>> Hi, Luis:
>>   For the reasons of the above-mentioned second conclusion. And except for patches 4-6,
>> even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11
>> are also needed. If you don't mind, I'd like to use hash at the next stage. The current
>> patch set is already huge.
> 
> I just had an update in response to David Laight's email. The hash solution is like
> a centrist. It doesn't seem very feasible.
> 
> Now, we need to make a decision. Choose one of the two:
> 1. Continue with my current approach. Improve the average performance of
>    kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by:
>    arm64 (defconfig):
>      73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y.
>      19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n.
>    x86 (defconfig):
>      49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y.
>      16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n.
> 2. Sort names, binary search (The static function causes duplicate names. Additional work is required)
>    2^18=262144, only up to 18 symbol expansions and comparisons are required.
>    The performance is definitely excellent, although I haven't tested it yet.
>    The memory overhead is increased by: 6 * kallsyms_num_syms
>    arm64 (defconfig):
>        1MiB if CONFIG_KALLSYMS_ALL=y.
>      362KiB if CONFIG_KALLSYMS_ALL=n.
>    x86 (defconfig):
>      770KiB if CONFIG_KALLSYMS_ALL=y.
>      356KiB if CONFIG_KALLSYMS_ALL=n.
> 

Preliminary Test Results: (On Qemu arm64)
[   73.049249] kallsyms_selftest: kallsyms_lookup_name() looked up 151880 symbols
[   73.049331] kallsyms_selftest: The time spent on each symbol is (ns): min=1088, max=46848, avg=6629

> 
> 
> 
>>
>>
>>>> OK, I found the right hash function. In this way, the tool does not need to consider
>>>> the byte order.
>>>
>>> https://en.wikipedia.org/wiki/Jenkins_hash_function
>>>
>>> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
>>> have to think about sizeof(long). It seems to be closest to our current needs.
>>>
>>> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
>>> 	size_t i = 0;
>>> 	uint32_t hash = 0;
>>>
>>> 	while (i != length) {
>>> 		hash += key[i++];
>>> 		hash += hash << 10;
>>> 		hash ^= hash >> 6;
>>> 	}
>>> 	hash += hash << 3;
>>> 	hash ^= hash >> 11;
>>> 	hash += hash << 15;
>>>
>>> 	return hash;
>>> }
>>>
>>>>
>>>> include/linux/stringhash.h
>>>>
>>>> /*
>>>>  * Version 1: one byte at a time.  Example of use:
>>>>  *
>>>>  * unsigned long hash = init_name_hash;
>>>>  * while (*p)
>>>>  *      hash = partial_name_hash(tolower(*p++), hash);
>>>>  * hash = end_name_hash(hash);
>>>>
>>>>
>>>>>
>>>>>   Luis
>>>>> .
>>>>>
>>>>
>>>
>>
>
Zhen Lei Nov. 2, 2022, 9:18 a.m. UTC | #13
On 2022/10/31 23:04, Leizhen (ThunderTown) wrote:
> Now, we need to make a decision. Choose one of the two:
> 1. Continue with my current approach. Improve the average performance of
>    kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by:
>    arm64 (defconfig):
>      73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y.
>      19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n.
>    x86 (defconfig):
>      49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y.
>      16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n.
> 2. Sort names, binary search (The static function causes duplicate names. Additional work is required)
>    2^18=262144, only up to 18 symbol expansions and comparisons are required.
>    The performance is definitely excellent, although I haven't tested it yet.
>    The memory overhead is increased by: 6 * kallsyms_num_syms
>    arm64 (defconfig):
>        1MiB if CONFIG_KALLSYMS_ALL=y.
>      362KiB if CONFIG_KALLSYMS_ALL=n.
>    x86 (defconfig):
>      770KiB if CONFIG_KALLSYMS_ALL=y.
>      356KiB if CONFIG_KALLSYMS_ALL=n.

Hi, Luis:
  I've implemented v8 based on method 2(Sort names, binary search).
The memory overhead is increased by: 3 * kallsyms_num_syms.
kallsyms_offsets_of_names[] is not added in v8 because it does not help
much in performance improvement, so save (3 * kallsyms_num_syms) bytes.
For details about the performance data, please see the commit message.