[RFC,0/2] improve vmalloc allocation

Message ID	20181019173538.590-1-urezki@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of urezki@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: Matthew Wilcox <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org Cc: LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com>, Thomas Garnier <thgarnie@google.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>, Steven Rostedt <rostedt@goodmis.org>, Joel Fernandes <joelaf@google.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu>, Tejun Heo <tj@kernel.org>, "Uladzislau Rezki (Sony)" <urezki@gmail.com> Subject: [RFC PATCH 0/2] improve vmalloc allocation Date: Fri, 19 Oct 2018 19:35:36 +0200 Message-Id: <20181019173538.590-1-urezki@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	improve vmalloc allocation \| expand [RFC,0/2] improve vmalloc allocation [RFC,1/2] mm/vmalloc: keep track of free blocks for allocation [RFC,2/2] mm: add priority threshold to __purge_vmap_area_lazy()

Uladzislau Rezki Oct. 19, 2018, 5:35 p.m. UTC

Objective
---------
Initiative of improving vmalloc allocator comes from getting many issues
related to allocation time, i.e. sometimes it is terribly slow. As a result
many workloads which are sensitive for long (more than 1 millisecond) preemption
off scenario are affected by that slowness(test cases like UI or audio, etc.).

The problem is that, currently an allocation of the new VA area is done over
busy list iteration until a suitable hole is found between two busy areas.
Therefore each new allocation causes the list being grown. Due to long list
and different permissive parameters an allocation can take a long time on
embedded devices(milliseconds).

Description
-----------
This approach keeps track of free blocks and allocation is done over free list
iteration instead. During initialization phase the vmalloc memory layout is
organized into one free area(can be more) building free double linked list
within VMALLOC_START-VMALLOC_END range.

Proposed algorithm uses red-black tree that keeps blocks sorted by their offsets
in pair with linked list keeping the free space in order of increasing addresses.

Allocation. It uses a first-fit policy. To allocate a new block a search is done
over free list areas until a first suitable block is large enough to encompass
the requested size. If the block is bigger than requested size - it is split.

A free block can be split by three different ways. Their names are FL_FIT_TYPE,
LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested
size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free list/tree
because it fully fits. Comparing with current design there is an extra work with
rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is
just cutting a free block. It is as fast as a current design. Most of the vmalloc
allocations just end up with this case, because the edge is always aligned to 1.

NE_FIT_TYPE - Is much less common case. Basically it happens when requested size
and alignment does not fit left nor right edges, i.e. it is between them. In this
case during splitting we have to build a remaining left free area and place it
back to the free list/tree.

Comparing with current design there are two extra steps. First one is we have to
allocate a new vmap_area structure. Second one we have to insert that remaining
free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects. Instead
of allocating from slab we just take an object from the cache and reuse it.

Second one is pretty optimized. Since we know a start point in the tree we do not
do a search from the top. Instead a traversal begins from a rb-tree node we split.

De-allocation. Red-black tree allows efficiently find a spot in the tree whereas
a linked list allows fast access to neighbors, thus a fast merge of de-allocated
memory chunks with existing free blocks creating large coalesced areas. It means
comparing with current design there is an extra step we have to done when a block
is freed.

In order to optimize a merge logic a free vmap area is not inserted straight
away into free structures. Instead we find a place in the rbtree where a free
block potentially can be inserted. Its parent node and left or right direction.
Knowing that, we can identify future next/prev list nodes, thus at this point
it becomes possible to check if a block can be merged. If not, just link it.

Test environment
----------------
I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another
is HiKey960(arm64) board. i5-3320M runs on 4.18 kernel, whereas Hikey960
uses 4.15 linaro kernel.

i5-3320M:
set performance governor
echo 0 > /proc/sys/kernel/nmi_watchdog
echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Stability and stress tests
--------------------------
As for stress testing of the new approach, i wrote a small patch with
different test cases and allocations methods to make a pressure on
vmalloc subsystem. In short, it runs different kind of allocation
tests simultaneously on each online CPU with different run-time(few days).

A test-suite patch you can find here, it is based on 4.18 kernel.
ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch

Below are some issues i run into during stability check phase(default kernel).

1) This Kernel BUG can be triggered by align_shift_alloc_test() stress test.
See it in test-suite patch:

<snip>
[66970.279289] kernel BUG at mm/vmalloc.c:512!
[66970.279363] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[66970.279411] CPU: 1 PID: 652 Comm: insmod Tainted: G O 4.18.0+ #741
[66970.279463] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[66970.279525] RIP: 0010:alloc_vmap_area+0x358/0x370
<snip>

Patched version does not suffer from that BUG().

2) This one has been introduced by commit 763b218ddfaf. Which introduced some
preemption points into __purge_vmap_area_lazy(). Under heavy simultaneous
allocations/releases number of lazy pages can easily go beyond millions
hitting out_of_memory -> panic or making operations like: allocation, lookup,
unmap, remove, etc. much slower.

It is fixed by second commit in this series. Please see more description in
the commit message of the patch.

3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
i.e. there is a memory leek somewhere in percpu allocator. It sounds like
a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
Resulting in memory leaking or "Kernel panic":

---[ end Kernel panic - not syncing: Out of memory and no killable processes...

There is no fix for that.

Performance test results
------------------------
I run 5 different tests to compare the performance between the new approach
and current one. Since there are three different type of allocations i wanted
to compare them with default version, apart of those there are two extra.
One allocates in long busy list condition. Another one does random number
of pages allocation(will not post here to keep it short).

- reboot the system;
- do three iteration of full run. One run is 5 tests;
- calculate average of three run(microseconds).

i5-3320M:
le_fit_alloc_test b_fit_alloc_test
1218459 vs 1146597 diff 5% 972322 vs 1008655 diff -3.74%
1219721 vs 1145212 diff 6% 1013817 vs 994195 diff 1.94%
1226255 vs 1142134 diff 6% 1002057 vs 993364 diff 0.87%
1239828 vs 1144809 diff 7% 985092 vs 977549 diff 0.77%
1232131 vs 1144775 diff 7% 1031320 vs 999952 diff 3.04%

ne_fit_alloc_test long_busy_list_alloc_test
2056336 vs 2043121 diff 0.64% 55866315 vs 15037680 diff 73%
2043136 vs 2041888 diff 0.06% 57601435 vs 14809454 diff 74%
2042384 vs 2040181 diff 0.11% 52612371 vs 14550292 diff 72%
2041287 vs 2038905 diff 0.12% 48894648 vs 14769538 diff 69%
2039014 vs 2038632 diff 0.02% 55718063 vs 14727350 diff 73%

Hikey960:
le_fit_alloc_test b_fit_alloc_test
2382168 vs 2115617 diff 11.19% 2864080 vs 2081309 diff 27.33%
2772909 vs 2114988 diff 23.73% 2968612 vs 2062716 diff 30.52%
2772579 vs 2113069 diff 23.79% 2748495 vs 2106658 diff 23.35%
2770596 vs 2111823 diff 23.78% 2966023 vs 2071139 diff 30.17%
2759768 vs 2111868 diff 23.48% 2765568 vs 2125216 diff 23.15%

ne_fit_alloc_test long_busy_list_alloc_test
4353846 vs 4241838 diff 2.57 239789754 vs 33364305 diff 86%
4133506 vs 4241615 diff -2.62 778283461 vs 34551548 diff 95%
4134769 vs 4240714 diff -2.56 210244212 vs 33467529 diff 84%
4132224 vs 4242281 diff -2.66 429232377 vs 33307173 diff 92%
4410969 vs 4240864 diff 3.86 527560967 vs 33661115 diff 93%

Almost all results are better. Full data and the test module you can find here:

ftp://vps418301.ovh.net/incoming/vmalloc_test_module.tar.bz2
ftp://vps418301.ovh.net/incoming/HiKey960_test_result.txt
ftp://vps418301.ovh.net/incoming/i5-3320M_test_result.txt

Conclusion
----------
According to provided results and my subjective opinion, it is worth to organize
and maintain a free list and do an allocation based on it.

Appreciate for any valuable comments and sorry for the long description :)

Best Regards,
Uladzislau Rezki

Uladzislau Rezki (Sony) (2):
mm/vmalloc: keep track of free blocks for allocation
mm: add priority threshold to __purge_vmap_area_lazy()

include/linux/vmalloc.h | 2 +-
mm/vmalloc.c | 850 ++++++++++++++++++++++++++++++++++++++----------
2 files changed, 676 insertions(+), 176 deletions(-)

Roman Gushchin Oct. 19, 2018, 10:44 p.m. UTC | #1

On Fri, Oct 19, 2018 at 07:35:36PM +0200, Uladzislau Rezki (Sony) wrote:
> Objective
> ---------
> Initiative of improving vmalloc allocator comes from getting many issues
> related to allocation time, i.e. sometimes it is terribly slow. As a result
> many workloads which are sensitive for long (more than 1 millisecond) preemption
> off scenario are affected by that slowness(test cases like UI or audio, etc.).
> 
> The problem is that, currently an allocation of the new VA area is done over
> busy list iteration until a suitable hole is found between two busy areas.
> Therefore each new allocation causes the list being grown. Due to long list
> and different permissive parameters an allocation can take a long time on
> embedded devices(milliseconds).
...
> 3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
> stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
> i.e. there is a memory leek somewhere in percpu allocator. It sounds like
> a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
> Resulting in memory leaking or "Kernel panic":
> 

Can you, please, try the following patch:
6685b357363b ("percpu: stop leaking bitmap metadata blocks") ?

BTW, with growing number of vmalloc users (per-cpu allocator and bpf stuff are
big drivers), I find the patchset very interesting.

Thanks!

Joel Fernandes Oct. 20, 2018, 12:11 a.m. UTC | #2

On Fri, Oct 19, 2018 at 07:35:36PM +0200, Uladzislau Rezki (Sony) wrote:
> Objective
> ---------
> Initiative of improving vmalloc allocator comes from getting many issues
> related to allocation time, i.e. sometimes it is terribly slow. As a result
> many workloads which are sensitive for long (more than 1 millisecond) preemption
> off scenario are affected by that slowness(test cases like UI or audio, etc.).
> 
> The problem is that, currently an allocation of the new VA area is done over
> busy list iteration until a suitable hole is found between two busy areas.
> Therefore each new allocation causes the list being grown. Due to long list
> and different permissive parameters an allocation can take a long time on
> embedded devices(milliseconds).

I am not super familiar with the vmap allocation code, it has been some
years. But I have 2 comments:

(1) It seems the issue you are reporting is the walking of the list in
alloc_vmap_area().

Can we not solve this by just simplifying the following code?

	/* from the starting point, walk areas until a suitable hole is found
	 */
	while (addr + size > first->va_start && addr + size <= vend) {
		if (addr + cached_hole_size < first->va_start)
			cached_hole_size = first->va_start - addr;
		addr = ALIGN(first->va_end, align);
		if (addr + size < addr)
			goto overflow;

		if (list_is_last(&first->list, &vmap_area_list))
			goto found;

		first = list_next_entry(first, list);
	}

Instead of going through the vmap_area_list, can we not just binary search
the existing address-sorted vmap_area_root rbtree to find a hole? If yes,
that would bring down the linear search overhead. If not, why not?

(2) I am curious, do you have any measurements of how much time
alloc_vmap_area() is taking? You mentioned it takes milliseconds but I was
wondering if you had more finer grained function profiling measurements. And
also any data on how big are the lists at the time you see this issue.

Good to see an effort on improving this, thanks!

 - Joel


> Description
> -----------
> This approach keeps track of free blocks and allocation is done over free list
> iteration instead. During initialization phase the vmalloc memory layout is
> organized into one free area(can be more) building free double linked list
> within VMALLOC_START-VMALLOC_END range.
> 
> Proposed algorithm uses red-black tree that keeps blocks sorted by their offsets
> in pair with linked list keeping the free space in order of increasing addresses.
> 
> Allocation. It uses a first-fit policy. To allocate a new block a search is done
> over free list areas until a first suitable block is large enough to encompass
> the requested size. If the block is bigger than requested size - it is split.
> 
> A free block can be split by three different ways. Their names are FL_FIT_TYPE,
> LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested
> size and alignment fit to a free block.
> 
> FL_FIT_TYPE - in this case a free block is just removed from the free list/tree
> because it fully fits. Comparing with current design there is an extra work with
> rb-tree updating.
> 
> LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is
> just cutting a free block. It is as fast as a current design. Most of the vmalloc
> allocations just end up with this case, because the edge is always aligned to 1.
> 
> NE_FIT_TYPE - Is much less common case. Basically it happens when requested size
> and alignment does not fit left nor right edges, i.e. it is between them. In this
> case during splitting we have to build a remaining left free area and place it
> back to the free list/tree.
> 
> Comparing with current design there are two extra steps. First one is we have to
> allocate a new vmap_area structure. Second one we have to insert that remaining 
> free block to the address sorted list/tree.
> 
> In order to optimize a first case there is a cache with free_vmap objects. Instead
> of allocating from slab we just take an object from the cache and reuse it.
> 
> Second one is pretty optimized. Since we know a start point in the tree we do not
> do a search from the top. Instead a traversal begins from a rb-tree node we split.
> 
> De-allocation. Red-black tree allows efficiently find a spot in the tree whereas
> a linked list allows fast access to neighbors, thus a fast merge of de-allocated
> memory chunks with existing free blocks creating large coalesced areas. It means
> comparing with current design there is an extra step we have to done when a block
> is freed.
> 
> In order to optimize a merge logic a free vmap area is not inserted straight
> away into free structures. Instead we find a place in the rbtree where a free
> block potentially can be inserted. Its parent node and left or right direction.
> Knowing that, we can identify future next/prev list nodes, thus at this point
> it becomes possible to check if a block can be merged. If not, just link it.
> 
> Test environment
> ----------------
> I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another
> is HiKey960(arm64) board. i5-3320M runs on 4.18 kernel, whereas Hikey960
> uses 4.15 linaro kernel.
> 
> i5-3320M:
> set performance governor
> echo 0 > /proc/sys/kernel/nmi_watchdog
> echo -1 > /proc/sys/kernel/perf_event_paranoid
> echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> 
> Stability and stress tests
> --------------------------
> As for stress testing of the new approach, i wrote a small patch with
> different test cases and allocations methods to make a pressure on
> vmalloc subsystem. In short, it runs different kind of allocation
> tests simultaneously on each online CPU with different run-time(few days).
> 
> A test-suite patch you can find here, it is based on 4.18 kernel.
> ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
> 
> Below are some issues i run into during stability check phase(default kernel).
> 
> 1) This Kernel BUG can be triggered by align_shift_alloc_test() stress test.
> See it in test-suite patch:
> 
> <snip>
> [66970.279289] kernel BUG at mm/vmalloc.c:512!
> [66970.279363] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [66970.279411] CPU: 1 PID: 652 Comm: insmod Tainted: G           O      4.18.0+ #741
> [66970.279463] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [66970.279525] RIP: 0010:alloc_vmap_area+0x358/0x370
> <snip>
> 
> Patched version does not suffer from that BUG().
> 
> 2) This one has been introduced by commit 763b218ddfaf. Which introduced some
> preemption points into __purge_vmap_area_lazy(). Under heavy simultaneous
> allocations/releases number of lazy pages can easily go beyond millions
> hitting out_of_memory -> panic or making operations like: allocation, lookup,
> unmap, remove, etc. much slower.
> 
> It is fixed by second commit in this series. Please see more description in
> the commit message of the patch.
> 
> 3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
> stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
> i.e. there is a memory leek somewhere in percpu allocator. It sounds like
> a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
> Resulting in memory leaking or "Kernel panic":
> 
> ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
> 
> There is no fix for that.
> 
> Performance test results
> ------------------------
> I run 5 different tests to compare the performance between the new approach
> and current one. Since there are three different type of allocations i wanted
> to compare them with default version, apart of those there are two extra.
> One allocates in long busy list condition. Another one does random number
> of pages allocation(will not post here to keep it short).
> 
> - reboot the system;
> - do three iteration of full run. One run is 5 tests;
> - calculate average of three run(microseconds).
> 
> i5-3320M:
> le_fit_alloc_test                b_fit_alloc_test
> 1218459 vs 1146597 diff 5%       972322 vs 1008655 diff -3.74%
> 1219721 vs 1145212 diff 6%      1013817 vs  994195 diff  1.94%
> 1226255 vs 1142134 diff 6%      1002057 vs  993364 diff  0.87%
> 1239828 vs 1144809 diff 7%       985092 vs  977549 diff  0.77%
> 1232131 vs 1144775 diff 7%      1031320 vs  999952 diff  3.04%
> 
> ne_fit_alloc_test                long_busy_list_alloc_test
> 2056336 vs 2043121 diff 0.64%    55866315 vs 15037680 diff 73%
> 2043136 vs 2041888 diff 0.06%    57601435 vs 14809454 diff 74%
> 2042384 vs 2040181 diff 0.11%    52612371 vs 14550292 diff 72%
> 2041287 vs 2038905 diff 0.12%    48894648 vs 14769538 diff 69%
> 2039014 vs 2038632 diff 0.02%    55718063 vs 14727350 diff 73%
> 
> Hikey960:
> le_fit_alloc_test                b_fit_alloc_test
> 2382168 vs 2115617 diff 11.19%   2864080 vs 2081309 diff 27.33%
> 2772909 vs 2114988 diff 23.73%   2968612 vs 2062716 diff 30.52%
> 2772579 vs 2113069 diff 23.79%   2748495 vs 2106658 diff 23.35%
> 2770596 vs 2111823 diff 23.78%   2966023 vs 2071139 diff 30.17%
> 2759768 vs 2111868 diff 23.48%   2765568 vs 2125216 diff 23.15%
> 
> ne_fit_alloc_test                long_busy_list_alloc_test
> 4353846 vs 4241838 diff  2.57    239789754 vs 33364305 diff 86%
> 4133506 vs 4241615 diff -2.62    778283461 vs 34551548 diff 95%
> 4134769 vs 4240714 diff -2.56    210244212 vs 33467529 diff 84%
> 4132224 vs 4242281 diff -2.66    429232377 vs 33307173 diff 92%
> 4410969 vs 4240864 diff  3.86    527560967 vs 33661115 diff 93%
> 
> Almost all results are better. Full data and the test module you can find here:
> 
> ftp://vps418301.ovh.net/incoming/vmalloc_test_module.tar.bz2
> ftp://vps418301.ovh.net/incoming/HiKey960_test_result.txt
> ftp://vps418301.ovh.net/incoming/i5-3320M_test_result.txt
> 
> Conclusion
> ----------
> According to provided results and my subjective opinion, it is worth to organize
> and maintain a free list and do an allocation based on it.
> 
> Appreciate for any valuable comments and sorry for the long description :)
> 
> Best Regards,
> Uladzislau Rezki
> 
> Uladzislau Rezki (Sony) (2):
>   mm/vmalloc: keep track of free blocks for allocation
>   mm: add priority threshold to __purge_vmap_area_lazy()
> 
>  include/linux/vmalloc.h |   2 +-
>  mm/vmalloc.c            | 850 ++++++++++++++++++++++++++++++++++++++----------
>  2 files changed, 676 insertions(+), 176 deletions(-)
> 
> -- 
> 2.11.0
>

Michal Hocko Oct. 22, 2018, 12:51 p.m. UTC | #3

Hi,
I haven't read through the implementation yet but I have say that I
really love this cover letter. It is clear on intetion, it covers design
from high level enough to start discussion and provides a very nice
testing coverage. Nice work!

I also think that we need a better performing vmalloc implementation
long term because of the increasing number of kvmalloc users.

I just have two mostly workflow specific comments.

> A test-suite patch you can find here, it is based on 4.18 kernel.
> ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch

Can you fit this stress test into the standard self test machinery?

> It is fixed by second commit in this series. Please see more description in
> the commit message of the patch.

Bug fixes should go first and new functionality should be built on top.
A kernel crash sounds serious enough to have a fix marked for stable. If
the fix is too hard/complex then we might consider a revert of the
faulty commit.
> 
> 3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
> stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
> i.e. there is a memory leek somewhere in percpu allocator. It sounds like
> a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
> Resulting in memory leaking or "Kernel panic":
> 
> ---[ end Kernel panic - not syncing: Out of memory and no killable processes...

It would be great to pin point this one down before the rework as well.

Thanks a lot!

Uladzislau Rezki Oct. 22, 2018, 2:01 p.m. UTC | #4

On Fri, Oct 19, 2018 at 10:44:39PM +0000, Roman Gushchin wrote:
> On Fri, Oct 19, 2018 at 07:35:36PM +0200, Uladzislau Rezki (Sony) wrote:
> > Objective
> > ---------
> > Initiative of improving vmalloc allocator comes from getting many issues
> > related to allocation time, i.e. sometimes it is terribly slow. As a result
> > many workloads which are sensitive for long (more than 1 millisecond) preemption
> > off scenario are affected by that slowness(test cases like UI or audio, etc.).
> > 
> > The problem is that, currently an allocation of the new VA area is done over
> > busy list iteration until a suitable hole is found between two busy areas.
> > Therefore each new allocation causes the list being grown. Due to long list
> > and different permissive parameters an allocation can take a long time on
> > embedded devices(milliseconds).
> ...
> > 3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
> > stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
> > i.e. there is a memory leek somewhere in percpu allocator. It sounds like
> > a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
> > Resulting in memory leaking or "Kernel panic":
> > 
> 
> Can you, please, try the following patch:
> 6685b357363b ("percpu: stop leaking bitmap metadata blocks") ?
>
I have tested that patch. It fixes the leak for sure. Thank you for a
good point.

> 
> BTW, with growing number of vmalloc users (per-cpu allocator and bpf stuff are
> big drivers), I find the patchset very interesting.
> 
> Thanks!
Thank you!

--
Vlad Rezki

Uladzislau Rezki Oct. 22, 2018, 2:50 p.m. UTC | #5

On Fri, Oct 19, 2018 at 05:11:45PM -0700, Joel Fernandes wrote:
> On Fri, Oct 19, 2018 at 07:35:36PM +0200, Uladzislau Rezki (Sony) wrote:
> > Objective
> > ---------
> > Initiative of improving vmalloc allocator comes from getting many issues
> > related to allocation time, i.e. sometimes it is terribly slow. As a result
> > many workloads which are sensitive for long (more than 1 millisecond) preemption
> > off scenario are affected by that slowness(test cases like UI or audio, etc.).
> > 
> > The problem is that, currently an allocation of the new VA area is done over
> > busy list iteration until a suitable hole is found between two busy areas.
> > Therefore each new allocation causes the list being grown. Due to long list
> > and different permissive parameters an allocation can take a long time on
> > embedded devices(milliseconds).
> 
> I am not super familiar with the vmap allocation code, it has been some
> years. But I have 2 comments:
> 
> (1) It seems the issue you are reporting is the walking of the list in
> alloc_vmap_area().
> 
> Can we not solve this by just simplifying the following code?
> 
> 	/* from the starting point, walk areas until a suitable hole is found
> 	 */
> 	while (addr + size > first->va_start && addr + size <= vend) {
> 		if (addr + cached_hole_size < first->va_start)
> 			cached_hole_size = first->va_start - addr;
> 		addr = ALIGN(first->va_end, align);
> 		if (addr + size < addr)
> 			goto overflow;
> 
> 		if (list_is_last(&first->list, &vmap_area_list))
> 			goto found;
> 
> 		first = list_next_entry(first, list);
> 	}
> 
> Instead of going through the vmap_area_list, can we not just binary search
> the existing address-sorted vmap_area_root rbtree to find a hole? If yes,
> that would bring down the linear search overhead. If not, why not?
>
vmap_area_root rb-tree is used for fast access to vmap_area knowing
the address(any va_start). That is why we use the tree. To use that tree
in order to check holes will require to start from the left most node or
specified "vstart" and move forward by rb_next(). What is much slower
than regular(list_next_entry O(1)) access in this case. 

> 
> (2) I am curious, do you have any measurements of how much time
> alloc_vmap_area() is taking? You mentioned it takes milliseconds but I was
> wondering if you had more finer grained function profiling measurements. And
> also any data on how big are the lists at the time you see this issue.
> 
Basically it depends on how much or heavily your system uses vmalloc
allocations. I was using CONFIG_DEBUG_PREEMPT with an extra patch. See it
here: ftp://vps418301.ovh.net/incoming/0001-tracing-track-preemption-disable-callers.patch

As for list size. It can be easily thousands.

> Good to see an effort on improving this, thanks!
> 
>  - Joel
Thank you!

> 
> 
> > Description
> > -----------
> > This approach keeps track of free blocks and allocation is done over free list
> > iteration instead. During initialization phase the vmalloc memory layout is
> > organized into one free area(can be more) building free double linked list
> > within VMALLOC_START-VMALLOC_END range.
> > 
> > Proposed algorithm uses red-black tree that keeps blocks sorted by their offsets
> > in pair with linked list keeping the free space in order of increasing addresses.
> > 
> > Allocation. It uses a first-fit policy. To allocate a new block a search is done
> > over free list areas until a first suitable block is large enough to encompass
> > the requested size. If the block is bigger than requested size - it is split.
> > 
> > A free block can be split by three different ways. Their names are FL_FIT_TYPE,
> > LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they correspond to how requested
> > size and alignment fit to a free block.
> > 
> > FL_FIT_TYPE - in this case a free block is just removed from the free list/tree
> > because it fully fits. Comparing with current design there is an extra work with
> > rb-tree updating.
> > 
> > LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do is
> > just cutting a free block. It is as fast as a current design. Most of the vmalloc
> > allocations just end up with this case, because the edge is always aligned to 1.
> > 
> > NE_FIT_TYPE - Is much less common case. Basically it happens when requested size
> > and alignment does not fit left nor right edges, i.e. it is between them. In this
> > case during splitting we have to build a remaining left free area and place it
> > back to the free list/tree.
> > 
> > Comparing with current design there are two extra steps. First one is we have to
> > allocate a new vmap_area structure. Second one we have to insert that remaining 
> > free block to the address sorted list/tree.
> > 
> > In order to optimize a first case there is a cache with free_vmap objects. Instead
> > of allocating from slab we just take an object from the cache and reuse it.
> > 
> > Second one is pretty optimized. Since we know a start point in the tree we do not
> > do a search from the top. Instead a traversal begins from a rb-tree node we split.
> > 
> > De-allocation. Red-black tree allows efficiently find a spot in the tree whereas
> > a linked list allows fast access to neighbors, thus a fast merge of de-allocated
> > memory chunks with existing free blocks creating large coalesced areas. It means
> > comparing with current design there is an extra step we have to done when a block
> > is freed.
> > 
> > In order to optimize a merge logic a free vmap area is not inserted straight
> > away into free structures. Instead we find a place in the rbtree where a free
> > block potentially can be inserted. Its parent node and left or right direction.
> > Knowing that, we can identify future next/prev list nodes, thus at this point
> > it becomes possible to check if a block can be merged. If not, just link it.
> > 
> > Test environment
> > ----------------
> > I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and another
> > is HiKey960(arm64) board. i5-3320M runs on 4.18 kernel, whereas Hikey960
> > uses 4.15 linaro kernel.
> > 
> > i5-3320M:
> > set performance governor
> > echo 0 > /proc/sys/kernel/nmi_watchdog
> > echo -1 > /proc/sys/kernel/perf_event_paranoid
> > echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
> > 
> > Stability and stress tests
> > --------------------------
> > As for stress testing of the new approach, i wrote a small patch with
> > different test cases and allocations methods to make a pressure on
> > vmalloc subsystem. In short, it runs different kind of allocation
> > tests simultaneously on each online CPU with different run-time(few days).
> > 
> > A test-suite patch you can find here, it is based on 4.18 kernel.
> > ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
> > 
> > Below are some issues i run into during stability check phase(default kernel).
> > 
> > 1) This Kernel BUG can be triggered by align_shift_alloc_test() stress test.
> > See it in test-suite patch:
> > 
> > <snip>
> > [66970.279289] kernel BUG at mm/vmalloc.c:512!
> > [66970.279363] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [66970.279411] CPU: 1 PID: 652 Comm: insmod Tainted: G           O      4.18.0+ #741
> > [66970.279463] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > [66970.279525] RIP: 0010:alloc_vmap_area+0x358/0x370
> > <snip>
> > 
> > Patched version does not suffer from that BUG().
> > 
> > 2) This one has been introduced by commit 763b218ddfaf. Which introduced some
> > preemption points into __purge_vmap_area_lazy(). Under heavy simultaneous
> > allocations/releases number of lazy pages can easily go beyond millions
> > hitting out_of_memory -> panic or making operations like: allocation, lookup,
> > unmap, remove, etc. much slower.
> > 
> > It is fixed by second commit in this series. Please see more description in
> > the commit message of the patch.
> > 
> > 3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
> > stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
> > i.e. there is a memory leek somewhere in percpu allocator. It sounds like
> > a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
> > Resulting in memory leaking or "Kernel panic":
> > 
> > ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
> > 
> > There is no fix for that.
> > 
> > Performance test results
> > ------------------------
> > I run 5 different tests to compare the performance between the new approach
> > and current one. Since there are three different type of allocations i wanted
> > to compare them with default version, apart of those there are two extra.
> > One allocates in long busy list condition. Another one does random number
> > of pages allocation(will not post here to keep it short).
> > 
> > - reboot the system;
> > - do three iteration of full run. One run is 5 tests;
> > - calculate average of three run(microseconds).
> > 
> > i5-3320M:
> > le_fit_alloc_test                b_fit_alloc_test
> > 1218459 vs 1146597 diff 5%       972322 vs 1008655 diff -3.74%
> > 1219721 vs 1145212 diff 6%      1013817 vs  994195 diff  1.94%
> > 1226255 vs 1142134 diff 6%      1002057 vs  993364 diff  0.87%
> > 1239828 vs 1144809 diff 7%       985092 vs  977549 diff  0.77%
> > 1232131 vs 1144775 diff 7%      1031320 vs  999952 diff  3.04%
> > 
> > ne_fit_alloc_test                long_busy_list_alloc_test
> > 2056336 vs 2043121 diff 0.64%    55866315 vs 15037680 diff 73%
> > 2043136 vs 2041888 diff 0.06%    57601435 vs 14809454 diff 74%
> > 2042384 vs 2040181 diff 0.11%    52612371 vs 14550292 diff 72%
> > 2041287 vs 2038905 diff 0.12%    48894648 vs 14769538 diff 69%
> > 2039014 vs 2038632 diff 0.02%    55718063 vs 14727350 diff 73%
> > 
> > Hikey960:
> > le_fit_alloc_test                b_fit_alloc_test
> > 2382168 vs 2115617 diff 11.19%   2864080 vs 2081309 diff 27.33%
> > 2772909 vs 2114988 diff 23.73%   2968612 vs 2062716 diff 30.52%
> > 2772579 vs 2113069 diff 23.79%   2748495 vs 2106658 diff 23.35%
> > 2770596 vs 2111823 diff 23.78%   2966023 vs 2071139 diff 30.17%
> > 2759768 vs 2111868 diff 23.48%   2765568 vs 2125216 diff 23.15%
> > 
> > ne_fit_alloc_test                long_busy_list_alloc_test
> > 4353846 vs 4241838 diff  2.57    239789754 vs 33364305 diff 86%
> > 4133506 vs 4241615 diff -2.62    778283461 vs 34551548 diff 95%
> > 4134769 vs 4240714 diff -2.56    210244212 vs 33467529 diff 84%
> > 4132224 vs 4242281 diff -2.66    429232377 vs 33307173 diff 92%
> > 4410969 vs 4240864 diff  3.86    527560967 vs 33661115 diff 93%
> > 
> > Almost all results are better. Full data and the test module you can find here:
> > 
> > ftp://vps418301.ovh.net/incoming/vmalloc_test_module.tar.bz2
> > ftp://vps418301.ovh.net/incoming/HiKey960_test_result.txt
> > ftp://vps418301.ovh.net/incoming/i5-3320M_test_result.txt
> > 
> > Conclusion
> > ----------
> > According to provided results and my subjective opinion, it is worth to organize
> > and maintain a free list and do an allocation based on it.
> > 
> > Appreciate for any valuable comments and sorry for the long description :)
> > 
> > Best Regards,
> > Uladzislau Rezki
> > 
> > Uladzislau Rezki (Sony) (2):
> >   mm/vmalloc: keep track of free blocks for allocation
> >   mm: add priority threshold to __purge_vmap_area_lazy()
> > 
> >  include/linux/vmalloc.h |   2 +-
> >  mm/vmalloc.c            | 850 ++++++++++++++++++++++++++++++++++++++----------
> >  2 files changed, 676 insertions(+), 176 deletions(-)
> > 
> > -- 
> > 2.11.0
> >

Uladzislau Rezki Oct. 22, 2018, 4:52 p.m. UTC | #6

On Mon, Oct 22, 2018 at 02:51:42PM +0200, Michal Hocko wrote:
> Hi,
> I haven't read through the implementation yet but I have say that I
> really love this cover letter. It is clear on intetion, it covers design
> from high level enough to start discussion and provides a very nice
> testing coverage. Nice work!
> 
> I also think that we need a better performing vmalloc implementation
> long term because of the increasing number of kvmalloc users.
> 
> I just have two mostly workflow specific comments.
> 
> > A test-suite patch you can find here, it is based on 4.18 kernel.
> > ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
> 
> Can you fit this stress test into the standard self test machinery?
> 
If you mean "tools/testing/selftests", then i can fit that as a kernel module.
But not all the tests i can trigger from kernel module, because 3 of 8 tests
use __vmalloc_node_range() function that is not marked as EXPORT_SYMBOL.

> > It is fixed by second commit in this series. Please see more description in
> > the commit message of the patch.
> 
> Bug fixes should go first and new functionality should be built on top.
>
Thanks for the good point.

> A kernel crash sounds serious enough to have a fix marked for stable. If
> the fix is too hard/complex then we might consider a revert of the
> faulty commit.
>
The fix is straightforward and easy. It adds a threshold passing which we
forbid cond_resched_lock() and continue draining of lazy pages.

> >
> > 3) This one is related to PCPU allocator(see pcpu_alloc_test()). In that
> > stress test case i see that SUnreclaim(/proc/meminfo) parameter gets increased,
> > i.e. there is a memory leek somewhere in percpu allocator. It sounds like
> > a memory that is allocated by pcpu_get_vm_areas() sometimes is not freed.
> > Resulting in memory leaking or "Kernel panic":
> > 
> > ---[ end Kernel panic - not syncing: Out of memory and no killable processes...
> 
> It would be great to pin point this one down before the rework as well.
> 
Actually it has been fixed recently. Roman Gushchin pointed to the:

6685b357363b ("percpu: stop leaking bitmap metadata blocks")

i have checked, it works fine and fixes a leak i see.

Thank you!

--
Vlad Rezki

> Thanks a lot!
> -- 
> Michal Hocko
> SUSE Labs

Joel Fernandes Oct. 23, 2018, 6:36 a.m. UTC | #7

On Mon, Oct 22, 2018 at 04:50:06PM +0200, Uladzislau Rezki wrote:
> On Fri, Oct 19, 2018 at 05:11:45PM -0700, Joel Fernandes wrote:
> > On Fri, Oct 19, 2018 at 07:35:36PM +0200, Uladzislau Rezki (Sony) wrote:
> > > Objective
> > > ---------
> > > Initiative of improving vmalloc allocator comes from getting many issues
> > > related to allocation time, i.e. sometimes it is terribly slow. As a result
> > > many workloads which are sensitive for long (more than 1 millisecond) preemption
> > > off scenario are affected by that slowness(test cases like UI or audio, etc.).
> > > 
> > > The problem is that, currently an allocation of the new VA area is done over
> > > busy list iteration until a suitable hole is found between two busy areas.
> > > Therefore each new allocation causes the list being grown. Due to long list
> > > and different permissive parameters an allocation can take a long time on
> > > embedded devices(milliseconds).
> > 
> > I am not super familiar with the vmap allocation code, it has been some
> > years. But I have 2 comments:
> > 
> > (1) It seems the issue you are reporting is the walking of the list in
> > alloc_vmap_area().
> > 
> > Can we not solve this by just simplifying the following code?
> > 
> > 	/* from the starting point, walk areas until a suitable hole is found
> > 	 */
> > 	while (addr + size > first->va_start && addr + size <= vend) {
> > 		if (addr + cached_hole_size < first->va_start)
> > 			cached_hole_size = first->va_start - addr;
> > 		addr = ALIGN(first->va_end, align);
> > 		if (addr + size < addr)
> > 			goto overflow;
> > 
> > 		if (list_is_last(&first->list, &vmap_area_list))
> > 			goto found;
> > 
> > 		first = list_next_entry(first, list);
> > 	}
> > 
> > Instead of going through the vmap_area_list, can we not just binary search
> > the existing address-sorted vmap_area_root rbtree to find a hole? If yes,
> > that would bring down the linear search overhead. If not, why not?
> >
> vmap_area_root rb-tree is used for fast access to vmap_area knowing
> the address(any va_start). That is why we use the tree. To use that tree
> in order to check holes will require to start from the left most node or
> specified "vstart" and move forward by rb_next(). What is much slower
> than regular(list_next_entry O(1)) access in this case. 

Ah, sorry. Don't know what I was thinking, you are right.  By the way the
binder driver does something similar too for buffer allocations, maintains an
rb tree of free areas:
https://github.com/torvalds/linux/blob/master/drivers/android/binder_alloc.c#L415

> > (2) I am curious, do you have any measurements of how much time
> > alloc_vmap_area() is taking? You mentioned it takes milliseconds but I was
> > wondering if you had more finer grained function profiling measurements. And
> > also any data on how big are the lists at the time you see this issue.
> > 
> Basically it depends on how much or heavily your system uses vmalloc
> allocations. I was using CONFIG_DEBUG_PREEMPT with an extra patch. See it
> here: ftp://vps418301.ovh.net/incoming/0001-tracing-track-preemption-disable-callers.patch
> 
> As for list size. It can be easily thousands.

Understood. I will go through your patches more in the coming days, thanks!

 - Joel

Michal Hocko Oct. 23, 2018, 7:23 a.m. UTC | #8

Hi Shuah,

On Mon 22-10-18 18:52:53, Uladzislau Rezki wrote:
> On Mon, Oct 22, 2018 at 02:51:42PM +0200, Michal Hocko wrote:
> > Hi,
> > I haven't read through the implementation yet but I have say that I
> > really love this cover letter. It is clear on intetion, it covers design
> > from high level enough to start discussion and provides a very nice
> > testing coverage. Nice work!
> > 
> > I also think that we need a better performing vmalloc implementation
> > long term because of the increasing number of kvmalloc users.
> > 
> > I just have two mostly workflow specific comments.
> > 
> > > A test-suite patch you can find here, it is based on 4.18 kernel.
> > > ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
> > 
> > Can you fit this stress test into the standard self test machinery?
> > 
> If you mean "tools/testing/selftests", then i can fit that as a kernel module.
> But not all the tests i can trigger from kernel module, because 3 of 8 tests
> use __vmalloc_node_range() function that is not marked as EXPORT_SYMBOL.

Is there any way to conditionally export these internal symbols just for
kselftests? Or is there any other standard way how to test internal
functionality that is not exported to modules?

Shuah Oct. 23, 2018, 3:02 p.m. UTC | #9

Hi Michal,

On 10/23/2018 01:23 AM, Michal Hocko wrote:
> Hi Shuah,
> 
> On Mon 22-10-18 18:52:53, Uladzislau Rezki wrote:
>> On Mon, Oct 22, 2018 at 02:51:42PM +0200, Michal Hocko wrote:
>>> Hi,
>>> I haven't read through the implementation yet but I have say that I
>>> really love this cover letter. It is clear on intetion, it covers design
>>> from high level enough to start discussion and provides a very nice
>>> testing coverage. Nice work!
>>>
>>> I also think that we need a better performing vmalloc implementation
>>> long term because of the increasing number of kvmalloc users.
>>>
>>> I just have two mostly workflow specific comments.
>>>
>>>> A test-suite patch you can find here, it is based on 4.18 kernel.
>>>> ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
>>>
>>> Can you fit this stress test into the standard self test machinery?
>>>
>> If you mean "tools/testing/selftests", then i can fit that as a kernel module.
>> But not all the tests i can trigger from kernel module, because 3 of 8 tests
>> use __vmalloc_node_range() function that is not marked as EXPORT_SYMBOL.
> 
> Is there any way to conditionally export these internal symbols just for
> kselftests? Or is there any other standard way how to test internal
> functionality that is not exported to modules?
> 

The way it can be handled is by adding a test module under lib. test_kmod,
test_sysctl, test_user_copy etc.

There is a corresponding test script e.g selftests/kmod/kmod.sh that loads
the module and runs tests.

Take a look at lib/test_overflow.c - It is running some vmalloc_node tests
test_overflow.c:DEFINE_TEST_ALLOC(vmalloc_node,  vfree,	     0, 0, 1);
test_overflow.c:DEFINE_TEST_ALLOC(kvmalloc_node, kvfree,     0, 1, 1);
test_overflow.c:	err |= test_kvmalloc_node(NULL);
test_overflow.c:	err |= test_vmalloc_node(NULL);

This module could be extended to tun these stress tests perhaps? I don't see a
selftests test script for test_overflow, one could be added.

Adding Kees Cook to the thread for input on test_overflow.

thanks,
-- Shuah

Matthew Wilcox Oct. 23, 2018, 3:26 p.m. UTC | #10

On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> Hi Michal,
> 
> On 10/23/2018 01:23 AM, Michal Hocko wrote:
> > Hi Shuah,
> > 
> > On Mon 22-10-18 18:52:53, Uladzislau Rezki wrote:
> >> On Mon, Oct 22, 2018 at 02:51:42PM +0200, Michal Hocko wrote:
> >>> Hi,
> >>> I haven't read through the implementation yet but I have say that I
> >>> really love this cover letter. It is clear on intetion, it covers design
> >>> from high level enough to start discussion and provides a very nice
> >>> testing coverage. Nice work!
> >>>
> >>> I also think that we need a better performing vmalloc implementation
> >>> long term because of the increasing number of kvmalloc users.
> >>>
> >>> I just have two mostly workflow specific comments.
> >>>
> >>>> A test-suite patch you can find here, it is based on 4.18 kernel.
> >>>> ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
> >>>
> >>> Can you fit this stress test into the standard self test machinery?
> >>>
> >> If you mean "tools/testing/selftests", then i can fit that as a kernel module.
> >> But not all the tests i can trigger from kernel module, because 3 of 8 tests
> >> use __vmalloc_node_range() function that is not marked as EXPORT_SYMBOL.
> > 
> > Is there any way to conditionally export these internal symbols just for
> > kselftests? Or is there any other standard way how to test internal
> > functionality that is not exported to modules?
> > 
> 
> The way it can be handled is by adding a test module under lib. test_kmod,
> test_sysctl, test_user_copy etc.

The problem is that said module can only invoke functions which are
exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
which I don't think we're willing to pay, purely to get test coverage.

Based on my own experience with the IDA & XArray test suites, I would
like to propose a solution which does not require exporting all of
these symbols:

Create a new kernel module in mm/test_vmalloc.c

Towards the top of that file,

#include <linux/export.h>
#undef EXPORT_SYMBOL
#define EXPORT_SYMBOL(x)	/* */
#include "vmalloc.c"

Now you can invoke even static functions from your test harness.

Michal Hocko Oct. 23, 2018, 5:05 p.m. UTC | #11

On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
[...]
> > The way it can be handled is by adding a test module under lib. test_kmod,
> > test_sysctl, test_user_copy etc.
> 
> The problem is that said module can only invoke functions which are
> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> which I don't think we're willing to pay, purely to get test coverage.

Yes, I think we do not want to export internal functionality which might
be still interesting for the testing coverage. Maybe we want something
like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
kselftest machinery but it wouldn't allow the same for general modules
and will not give any API promisses.

I wouldn't be surprised if we found some cases of EXPORT_SYMBOL* just to
make a symbol available for testing which would be unfortunate.

Shuah Oct. 23, 2018, 5:13 p.m. UTC | #12

On 10/23/2018 11:05 AM, Michal Hocko wrote:
> On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
>> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> [...]
>>> The way it can be handled is by adding a test module under lib. test_kmod,
>>> test_sysctl, test_user_copy etc.
>>
>> The problem is that said module can only invoke functions which are
>> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
>> which I don't think we're willing to pay, purely to get test coverage.
> 
> Yes, I think we do not want to export internal functionality which might
> be still interesting for the testing coverage. Maybe we want something
> like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
> kselftest machinery but it wouldn't allow the same for general modules
> and will not give any API promisses.
> 

I like this proposal. I think we will open up lot of test opportunities with
this approach.

Maybe we can use this stress test as a pilot and see where it takes us.

thanks,
-- Shuah

Joel Fernandes Oct. 23, 2018, 7:30 p.m. UTC | #13

On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> On 10/23/2018 11:05 AM, Michal Hocko wrote:
> > On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
> >> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> > [...]
> >>> The way it can be handled is by adding a test module under lib. test_kmod,
> >>> test_sysctl, test_user_copy etc.
> >>
> >> The problem is that said module can only invoke functions which are
> >> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> >> which I don't think we're willing to pay, purely to get test coverage.
> > 
> > Yes, I think we do not want to export internal functionality which might
> > be still interesting for the testing coverage. Maybe we want something
> > like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
> > kselftest machinery but it wouldn't allow the same for general modules
> > and will not give any API promisses.
> > 
> 
> I like this proposal. I think we will open up lot of test opportunities with
> this approach.
> 
> Maybe we can use this stress test as a pilot and see where it takes us.

I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
out-of-tree module writers to call internal functionality.

How would you prevent that?

thanks,

 - Joel

Shuah Oct. 23, 2018, 7:48 p.m. UTC | #14

On 10/23/2018 01:30 PM, Joel Fernandes wrote:
> On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
>> On 10/23/2018 11:05 AM, Michal Hocko wrote:
>>> On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
>>>> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
>>> [...]
>>>>> The way it can be handled is by adding a test module under lib. test_kmod,
>>>>> test_sysctl, test_user_copy etc.
>>>>
>>>> The problem is that said module can only invoke functions which are
>>>> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
>>>> which I don't think we're willing to pay, purely to get test coverage.
>>>
>>> Yes, I think we do not want to export internal functionality which might
>>> be still interesting for the testing coverage. Maybe we want something
>>> like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
>>> kselftest machinery but it wouldn't allow the same for general modules
>>> and will not give any API promisses.
>>>
>>
>> I like this proposal. I think we will open up lot of test opportunities with
>> this approach.
>>
>> Maybe we can use this stress test as a pilot and see where it takes us.
> 
> I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> out-of-tree module writers to call internal functionality.

That is  valid concern to consider before we go forward with the proposal.

We could wrap EXPORT_SYMBOL_KSELFTEST this in an existing debug option. This could
be fine grained for each sub-system for its debug option. We do have a few of these
now

# CONFIG_STATIC_KEYS_SELFTEST is not set
# CONFIG_BT_SELFTEST is not set
# CONFIG_DRM_DEBUG_SELFTEST is not set
# CONFIG_CHASH_SELFTEST is not set
# CONFIG_DRM_I915_SELFTEST is not set
# CONFIG_CRC32_SELFTEST is not set
# CONFIG_RANDOM32_SELFTEST is not set
# CONFIG_GLOB_SELFTEST is not set
# CONFIG_STRING_SELFTEST is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_X86_DECODER_SELFTEST is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set

thanks,
-- Shuah

Matthew Wilcox Oct. 23, 2018, 8:09 p.m. UTC | #15

On Tue, Oct 23, 2018 at 01:48:32PM -0600, Shuah Khan wrote:
> On 10/23/2018 01:30 PM, Joel Fernandes wrote:
> > On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> >> I like this proposal. I think we will open up lot of test opportunities with
> >> this approach.
> >>
> >> Maybe we can use this stress test as a pilot and see where it takes us.
> > 
> > I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> > out-of-tree module writers to call internal functionality.
> 
> That is  valid concern to consider before we go forward with the proposal.
> 
> We could wrap EXPORT_SYMBOL_KSELFTEST this in an existing debug option. This could
> be fine grained for each sub-system for its debug option. We do have a few of these
> now

This all seems far more complicated than my proposed solution.

Shuah Oct. 23, 2018, 8:50 p.m. UTC | #16

On 10/23/2018 02:09 PM, Matthew Wilcox wrote:
> On Tue, Oct 23, 2018 at 01:48:32PM -0600, Shuah Khan wrote:
>> On 10/23/2018 01:30 PM, Joel Fernandes wrote:
>>> On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
>>>> I like this proposal. I think we will open up lot of test opportunities with
>>>> this approach.
>>>>
>>>> Maybe we can use this stress test as a pilot and see where it takes us.
>>>
>>> I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
>>> out-of-tree module writers to call internal functionality.
>>
>> That is  valid concern to consider before we go forward with the proposal.
>>
>> We could wrap EXPORT_SYMBOL_KSELFTEST this in an existing debug option. This could
>> be fine grained for each sub-system for its debug option. We do have a few of these
>> now
> 
> This all seems far more complicated than my proposed solution.
> 

Not sure if it that complicated. But it is more involved. It dies have the
advantage of fitting in with the rest of the debug/test type framework we
already have.

The option you proposed sounds simpler, however it sounds a bit adhoc to me.

In any case I went looking for EXPORT_SYMBOL defines and found them in 

tools/include/asm/export.h
tools/include/linux/export.h
tools/virtio/linux/export.h

selftests/powerpc/copyloops/asm/export.h
selftests/powerpc/stringloops/asm/export.h

thanks,
-- Shuah

Joel Fernandes Oct. 23, 2018, 9:01 p.m. UTC | #17

On Tue, Oct 23, 2018 at 01:09:23PM -0700, Matthew Wilcox wrote:
> On Tue, Oct 23, 2018 at 01:48:32PM -0600, Shuah Khan wrote:
> > On 10/23/2018 01:30 PM, Joel Fernandes wrote:
> > > On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> > >> I like this proposal. I think we will open up lot of test opportunities with
> > >> this approach.
> > >>
> > >> Maybe we can use this stress test as a pilot and see where it takes us.
> > > 
> > > I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> > > out-of-tree module writers to call internal functionality.
> > 
> > That is  valid concern to consider before we go forward with the proposal.
> > 
> > We could wrap EXPORT_SYMBOL_KSELFTEST this in an existing debug option. This could
> > be fine grained for each sub-system for its debug option. We do have a few of these
> > now
> 
> This all seems far more complicated than my proposed solution.

Matthew's solution seems Ok to me where it works. A problem could be that it
will not always work.

As an example, recently I wanted to directly set the sysctl_sched_rt_runtime
variable from the rcutorture test, just for forcing some conditions. This
symbol is internal and inaccessible from modules. This can also be done by
calling the internal sched_rt_handler with some parameters.  However I don't
think including an internal source file in a test source file can achieve the
objective of setting it since access to the internal symbol is not possible
without exporting it somehow. This could be a "special" case too but is an
example where the include trick may fall apart.

I do think its a cool trick though ;-)

 - Joel

Michal Hocko Oct. 24, 2018, 6:22 a.m. UTC | #18

On Tue 23-10-18 12:30:44, Joel Fernandes wrote:
> On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> > On 10/23/2018 11:05 AM, Michal Hocko wrote:
> > > On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
> > >> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> > > [...]
> > >>> The way it can be handled is by adding a test module under lib. test_kmod,
> > >>> test_sysctl, test_user_copy etc.
> > >>
> > >> The problem is that said module can only invoke functions which are
> > >> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> > >> which I don't think we're willing to pay, purely to get test coverage.
> > > 
> > > Yes, I think we do not want to export internal functionality which might
> > > be still interesting for the testing coverage. Maybe we want something
> > > like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
> > > kselftest machinery but it wouldn't allow the same for general modules
> > > and will not give any API promisses.
> > > 
> > 
> > I like this proposal. I think we will open up lot of test opportunities with
> > this approach.
> > 
> > Maybe we can use this stress test as a pilot and see where it takes us.
> 
> I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> out-of-tree module writers to call internal functionality.
> 
> How would you prevent that?

There is no way to prevent non-exported symbols abuse by 3rd party
AFAIK. EXPORT_SYMBOL_* is not there to prohibid abuse. It is a mere
signal of what is, well, an exported API.

Uladzislau Rezki Oct. 24, 2018, 4:36 p.m. UTC | #19

On Tue, Oct 23, 2018 at 08:26:40AM -0700, Matthew Wilcox wrote:
> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> > Hi Michal,
> > 
> > On 10/23/2018 01:23 AM, Michal Hocko wrote:
> > > Hi Shuah,
> > > 
> > > On Mon 22-10-18 18:52:53, Uladzislau Rezki wrote:
> > >> On Mon, Oct 22, 2018 at 02:51:42PM +0200, Michal Hocko wrote:
> > >>> Hi,
> > >>> I haven't read through the implementation yet but I have say that I
> > >>> really love this cover letter. It is clear on intetion, it covers design
> > >>> from high level enough to start discussion and provides a very nice
> > >>> testing coverage. Nice work!
> > >>>
> > >>> I also think that we need a better performing vmalloc implementation
> > >>> long term because of the increasing number of kvmalloc users.
> > >>>
> > >>> I just have two mostly workflow specific comments.
> > >>>
> > >>>> A test-suite patch you can find here, it is based on 4.18 kernel.
> > >>>> ftp://vps418301.ovh.net/incoming/0001-mm-vmalloc-stress-test-suite-v4.18.patch
> > >>>
> > >>> Can you fit this stress test into the standard self test machinery?
> > >>>
> > >> If you mean "tools/testing/selftests", then i can fit that as a kernel module.
> > >> But not all the tests i can trigger from kernel module, because 3 of 8 tests
> > >> use __vmalloc_node_range() function that is not marked as EXPORT_SYMBOL.
> > > 
> > > Is there any way to conditionally export these internal symbols just for
> > > kselftests? Or is there any other standard way how to test internal
> > > functionality that is not exported to modules?
> > > 
> > 
> > The way it can be handled is by adding a test module under lib. test_kmod,
> > test_sysctl, test_user_copy etc.
> 
> The problem is that said module can only invoke functions which are
> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> which I don't think we're willing to pay, purely to get test coverage.
> 
> Based on my own experience with the IDA & XArray test suites, I would
> like to propose a solution which does not require exporting all of
> these symbols:
> 
> Create a new kernel module in mm/test_vmalloc.c
> 
> Towards the top of that file,
> 
> #include <linux/export.h>
> #undef EXPORT_SYMBOL
> #define EXPORT_SYMBOL(x)	/* */
> #include "vmalloc.c"
> 
> Now you can invoke even static functions from your test harness.
I see your point. But i also think that it would not be so easy to go.

<snip>
#undef CONFIG_HAVE_ARCH_HUGE_VMAP
#undef CONFIG_PROC_FS

#include "../hikey_linux.git/mm/vmalloc.c"
#include <linux/random.h>
<snip>

<snip>
  LD [M]  /mnt/coding/vmalloc_performance_test/vmalloc_test.o
  Building modules, stage 2.
  MODPOST 1 modules
WARNING: "__pud_alloc" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "__sync_icache_dcache" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "warn_alloc" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "pmd_clear_bad" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "pgd_clear_bad" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "swapper_pg_dir" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "__pmd_alloc" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "pud_clear_bad" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
WARNING: "__pte_alloc_kernel" [/mnt/coding/vmalloc_performance_test/vmalloc_test.ko] undefined!
 LD [M]  /mnt/coding/vmalloc_performance_test/vmalloc_test.ko
<snip>

i.e. i will need either link with objects where those functions are or
wrap them up somehow. 

Thanks!

--
Vlad Rezki

Uladzislau Rezki Oct. 24, 2018, 5:34 p.m. UTC | #20

Hi.

On Wed, Oct 24, 2018 at 08:22:52AM +0200, Michal Hocko wrote:
> On Tue 23-10-18 12:30:44, Joel Fernandes wrote:
> > On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> > > On 10/23/2018 11:05 AM, Michal Hocko wrote:
> > > > On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
> > > >> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> > > > [...]
> > > >>> The way it can be handled is by adding a test module under lib. test_kmod,
> > > >>> test_sysctl, test_user_copy etc.
> > > >>
> > > >> The problem is that said module can only invoke functions which are
> > > >> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> > > >> which I don't think we're willing to pay, purely to get test coverage.
> > > > 
> > > > Yes, I think we do not want to export internal functionality which might
> > > > be still interesting for the testing coverage. Maybe we want something
> > > > like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
> > > > kselftest machinery but it wouldn't allow the same for general modules
> > > > and will not give any API promisses.
> > > > 
> > > 
> > > I like this proposal. I think we will open up lot of test opportunities with
> > > this approach.
> > > 
> > > Maybe we can use this stress test as a pilot and see where it takes us.
> > 
> > I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> > out-of-tree module writers to call internal functionality.
> > 
> > How would you prevent that?
> 
> There is no way to prevent non-exported symbols abuse by 3rd party
> AFAIK. EXPORT_SYMBOL_* is not there to prohibid abuse. It is a mere
> signal of what is, well, an exported API.
> -- 
> Michal Hocko
> SUSE Labs

Can we just use kallsyms_lookup_name()?

<snip>
static void *((*__my_vmalloc_node_range)(unsigned long size,
    unsigned long align,unsigned long start, unsigned long end,
    gfp_t gfp_mask,pgprot_t prot, unsigned long vm_flags,
    int node, const void *caller));

__my_vmalloc_node_range = (void *) kallsyms_lookup_name("__vmalloc_node_range");
<snip>

--
Vlad Rezki

Andrew Morton Oct. 24, 2018, 11:01 p.m. UTC | #21

On Fri, 19 Oct 2018 19:35:36 +0200 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:

> improving vmalloc allocator

It's about time ;)

Are you aware of https://lwn.net/Articles/285341/ ?  If not, please do
take a look through Nick's work and see if there are any good things
there which can be borrowed.

Michal Hocko Oct. 25, 2018, 8:43 a.m. UTC | #22

On Wed 24-10-18 19:34:18, Uladzislau Rezki wrote:
> Hi.
> 
> On Wed, Oct 24, 2018 at 08:22:52AM +0200, Michal Hocko wrote:
> > On Tue 23-10-18 12:30:44, Joel Fernandes wrote:
> > > On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> > > > On 10/23/2018 11:05 AM, Michal Hocko wrote:
> > > > > On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
> > > > >> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> > > > > [...]
> > > > >>> The way it can be handled is by adding a test module under lib. test_kmod,
> > > > >>> test_sysctl, test_user_copy etc.
> > > > >>
> > > > >> The problem is that said module can only invoke functions which are
> > > > >> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> > > > >> which I don't think we're willing to pay, purely to get test coverage.
> > > > > 
> > > > > Yes, I think we do not want to export internal functionality which might
> > > > > be still interesting for the testing coverage. Maybe we want something
> > > > > like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
> > > > > kselftest machinery but it wouldn't allow the same for general modules
> > > > > and will not give any API promisses.
> > > > > 
> > > > 
> > > > I like this proposal. I think we will open up lot of test opportunities with
> > > > this approach.
> > > > 
> > > > Maybe we can use this stress test as a pilot and see where it takes us.
> > > 
> > > I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> > > out-of-tree module writers to call internal functionality.
> > > 
> > > How would you prevent that?
> > 
> > There is no way to prevent non-exported symbols abuse by 3rd party
> > AFAIK. EXPORT_SYMBOL_* is not there to prohibid abuse. It is a mere
> > signal of what is, well, an exported API.
> 
> Can we just use kallsyms_lookup_name()?

Heh, this is the abuse I've had in mind ;)
> <snip>
> static void *((*__my_vmalloc_node_range)(unsigned long size,
>     unsigned long align,unsigned long start, unsigned long end,
>     gfp_t gfp_mask,pgprot_t prot, unsigned long vm_flags,
>     int node, const void *caller));
> 
> __my_vmalloc_node_range = (void *) kallsyms_lookup_name("__vmalloc_node_range");
> <snip>

This is just too ugly to live. So I would go with it only if there is no
reasonable way to export what tests need with a sane interface.

Uladzislau Rezki Oct. 25, 2018, 10:33 a.m. UTC | #23

On Wed, Oct 24, 2018 at 04:01:06PM -0700, Andrew Morton wrote:
> On Fri, 19 Oct 2018 19:35:36 +0200 "Uladzislau Rezki (Sony)" <urezki@gmail.com> wrote:
> 
> > improving vmalloc allocator
> 
> It's about time ;)
> 
> Are you aware of https://lwn.net/Articles/285341/ ?  If not, please do
> take a look through Nick's work and see if there are any good things
> there which can be borrowed.
> 
No, i have not known about that. I will go through that work to see if
there is something there!

Thanks :)

--
Vlad Rezki

Uladzislau Rezki Oct. 25, 2018, 10:42 a.m. UTC | #24

On Thu, Oct 25, 2018 at 10:43:27AM +0200, Michal Hocko wrote:
> On Wed 24-10-18 19:34:18, Uladzislau Rezki wrote:
> > Hi.
> > 
> > On Wed, Oct 24, 2018 at 08:22:52AM +0200, Michal Hocko wrote:
> > > On Tue 23-10-18 12:30:44, Joel Fernandes wrote:
> > > > On Tue, Oct 23, 2018 at 11:13:36AM -0600, Shuah Khan wrote:
> > > > > On 10/23/2018 11:05 AM, Michal Hocko wrote:
> > > > > > On Tue 23-10-18 08:26:40, Matthew Wilcox wrote:
> > > > > >> On Tue, Oct 23, 2018 at 09:02:56AM -0600, Shuah Khan wrote:
> > > > > > [...]
> > > > > >>> The way it can be handled is by adding a test module under lib. test_kmod,
> > > > > >>> test_sysctl, test_user_copy etc.
> > > > > >>
> > > > > >> The problem is that said module can only invoke functions which are
> > > > > >> exported using EXPORT_SYMBOL.  And there's a cost to exporting them,
> > > > > >> which I don't think we're willing to pay, purely to get test coverage.
> > > > > > 
> > > > > > Yes, I think we do not want to export internal functionality which might
> > > > > > be still interesting for the testing coverage. Maybe we want something
> > > > > > like EXPORT_SYMBOL_KSELFTEST which would allow to link within the
> > > > > > kselftest machinery but it wouldn't allow the same for general modules
> > > > > > and will not give any API promisses.
> > > > > > 
> > > > > 
> > > > > I like this proposal. I think we will open up lot of test opportunities with
> > > > > this approach.
> > > > > 
> > > > > Maybe we can use this stress test as a pilot and see where it takes us.
> > > > 
> > > > I am a bit worried that such an EXPORT_SYMBOL_KSELFTEST mechanism can be abused by
> > > > out-of-tree module writers to call internal functionality.
> > > > 
> > > > How would you prevent that?
> > > 
> > > There is no way to prevent non-exported symbols abuse by 3rd party
> > > AFAIK. EXPORT_SYMBOL_* is not there to prohibid abuse. It is a mere
> > > signal of what is, well, an exported API.
> > 
> > Can we just use kallsyms_lookup_name()?
> 
> Heh, this is the abuse I've had in mind ;)
> > <snip>
> > static void *((*__my_vmalloc_node_range)(unsigned long size,
> >     unsigned long align,unsigned long start, unsigned long end,
> >     gfp_t gfp_mask,pgprot_t prot, unsigned long vm_flags,
> >     int node, const void *caller));
> > 
> > __my_vmalloc_node_range = (void *) kallsyms_lookup_name("__vmalloc_node_range");
> > <snip>
> 
> This is just too ugly to live. So I would go with it only if there is no
> reasonable way to export what tests need with a sane interface.
Agree, that is a bit ugly and not generic even though it is easy :)

--
Vlad Rezki

[RFC,0/2] improve vmalloc allocation

Message

Comments