diff mbox series

[bpf-next,v1,RESEND,1/5] vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec

Message ID 20221031222541.1773452-2-song@kernel.org (mailing list archive)
State New
Headers show
Series vmalloc_exec for modules and BPF programs | expand

Commit Message

Song Liu Oct. 31, 2022, 10:25 p.m. UTC
vmalloc_exec is used to allocate memory to host dynamic kernel text
(modules, BPF programs, etc.) with huge pages. This is similar to the
proposal by Peter in [1].

A new tree of vmap_area, free_text_area_* tree, is introduced in addition
to free_vmap_area_* and vmap_area_*. vmalloc_exec allocates pages from
free_text_area_*. When there isn't enough space left in free_text_area_*,
new PMD_SIZE page(s) is allocated from free_vmap_area_* and added to
free_text_area_*. To be more accurate, the vmap_area is first added to
vmap_area_* tree and then moved to free_text_area_*. This extra move
simplifies the logic of vmalloc_exec.

vmap_area in free_text_area_* tree are backed with memory, but we need
subtree_max_size for tree operations. Therefore, vm_struct for these
vmap_area are stored in a separate list, all_text_vm.

The new tree allows separate handling of < PAGE_SIZE allocations, as
current vmalloc code mostly assumes PAGE_SIZE aligned allocations. This
version of vmalloc_exec can handle bpf programs, which uses 64 byte
aligned allocations), and modules, which uses PAGE_SIZE aligned
allocations.

Memory allocated by vmalloc_exec() is set to RO+X before returning to the
caller. Therefore, the caller cannot write directly write to the memory.
Instead, the caller is required to use vcopy_exec() to update the memory.
For the safety and security of X memory, vcopy_exec() checks the data
being updated always in the memory allocated by one vmalloc_exec() call.
vcopy_exec() uses text_poke like mechanism and requires arch support.
Specifically, the arch need to implement arch_vcopy_exec().

In vfree_exec(), the memory is first erased with arch_invalidate_exec().
Then, the memory is added to free_text_area_*. If this free creates big
enough continuous free space (> PMD_SIZE), vfree_exec() will try to free
the backing vm_struct.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Signed-off-by: Song Liu <song@kernel.org>
---
 include/linux/vmalloc.h |   5 +
 mm/nommu.c              |  12 ++
 mm/vmalloc.c            | 318 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 335 insertions(+)

Comments

Luis Chamberlain Nov. 2, 2022, 11:41 p.m. UTC | #1
On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> vmalloc_exec is used to allocate memory to host dynamic kernel text
> (modules, BPF programs, etc.) with huge pages. This is similar to the
> proposal by Peter in [1].

This is allg reat but we need to clarify *why* we would go through the
trouble.  So if folks are not to excited about this series, that's
probably why. IMHO it lacks substance for rationale, **and** implies a few
gains without any *clear* performance metrics. I have 0 experience with
mm so I'd like other's feedback on my this -- I'm just trying to do
decipher rationale from prior "bpf prog pack" efforts.

I'm sensing that the cables in messaging are a bit crossed here and we need
to provide a bit better full picture for rationale and this is being
completely missed and this work is being undersold.  If my assessment is
accurate though, the bpf prog pack strategy with sharing huge pages may prove
useful long term for other things than just modules / ftrace / kprobes.

I was surprised to see this entire patch series upgrade from RFC to proper
PATCH form now completely fails to mention any of the original motivations
behind the "BPF prog pack", which you are doing a true heroic effort to try to
generalize as the problem is hard. Let me try to help with that. The rationale
for the old BPF prog pack is documented as follows:

* Most BPF programs are pretty small. Allocating a hole page for each
* program is sometime a waste. Many small bpf program also adds pressure
* to instruction TLB. To solve this issue, we introduce a BPF program pack
* allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
* to host BPF programs.

Previously you have also stated in earlier versions of this patch set:

  "Most BPF programs are small, but they consume a page each. For systems
   with busy traffic and many BPF programs, this could also add significant
   pressure to instruction TLB. High iTLB pressure usually causes slow down
   for the whole system, which includes visible performance
   degradation for production workloads."

So it is implied here that one of the benefits is to help reduce iTLB misses.
But that's it. We have no visible numbers to look at and for what... But
reducing iTLB misses doesn't always have a complete direct correlation
with improving things, but if the code change is small enough it obviously
makes sense to apply. If the change is a bit more intrusive, as in this
patch series a bit more rationale should be provided.

Other than the "performance aspects" of your patchset, the *main* reason
I am engaged and like it is it reduces the nasty mess of semantics on
dealing with special permissions on pages which we see in modules and a
few other places which today completely open code it. That proves error
prone and I'm glad to see efforts to generalize that nastiness. So
please ensure this is added as part of the documented rationale. Even
if the iTLB miss ratio improvement is not astronomical I believe that
the gains in sanity on improving semantics on special pages and sharing code
make it well worthwhile. The iTLB miss ratio improvement is just a small
cherry on top.

Going back to performance aspects, when Linus had poked for more details
about this your have elaborated further:

  "we have seen direct map fragmentation causing visible
   performance drop for our major services. This is the shadow 
   production benchmark, so it is not possible to run it out of 
   our data centers. Tracing showed that BPF program was the top 
   trigger of these direct map splits."

And the only other metric we have is:

  "For our web service production benchmark, bpf_prog_pack on 4kB pages
   gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."

These metrics are completely arbitrary and opaque to us. We need
something tangible and reproducible and I have been suggesting that
from early on...

I'm under the impression that the real missed, undocumented, major value-add
here is that the old "BPF prog pack" strategy helps to reduce the direct map
fragmentation caused by heavy use of the eBPF JIT programs and this in
turn helps your overall random system performance (regardless of what
it is you do). As I see it then the eBPF prog pack is just one strategy to
try to mitigate memory fragmentation on the direct map caused by the the eBPF
JIT programs, so the "slow down" your team has obvserved should be due to the
eventual fragmentation caused on the direct map *while* eBPF programs
get heavily used.

Mike Rapoport had presented about the Direct map fragmentation problem
at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
evaluation on whether using 2M/1G pages aggressively for the kernel direct map
help performance [1] ends up generally recommending huge pages. The work by Xing
though was about using huge pages *alone*, not using a strategy such as in the
"bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
and that I think is the real golden nugget here.

I contend therefore that the theoretical reduction of iTLB misses by using
huge pages for "bpf prog pack" is not what gets your systems to perform
somehow better. It should be simply that it reduces fragmentation and
*this* generally can help with performance long term. If this is accurate
then let's please separate the two aspects to this.

There's two aspects to what I would like to see from a performance
perspective then actually mentioned in the commit logs:

1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
   Vs not using it at all:

   I'd tried the below but didn't see much difference:

   perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \
   	--pre 'modprobe -r test_bpf' \-- modprobe test_bpf

   This is likely that the system I have might have a huge cache and I
   was not contending it with another heavy memory intensive workload.
   So it may be worthy to run something like the above *as* some other
   memory intensive benchark kicks gear in a loop. Another reason too
   may be that test_bpf runs tests sequentially instead of in parallel,
   and so it would be good to get ebpf folks's feedback as to an idea
   of what other things could be done to really eBPF JIT the hell out of
   a system. It is why I had suggested long ago maybe a custom new
   selftest which stresses the hell out of only eBPF JIT and had given
   an example multithreaded selftest.

   If we ever get modules support, I was hoping to see if something like
   the below (if the fs module .text section does end up shared) *may*
   have reduce iTLB miss ratio.

   perf stat --repeat 10 -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-stores-misses,iTLB-loads,iTLB-load-misses,page-faults \
	--pre 'make -s mrproper defconfig' \-- make -s -j$(nproc) bzImage

   Maybe something as simple as systemd-analyze may show time reduction
   if iTLB miss ratio is reduced by sharing most module code in a huge
   page.

2) Estimate in reduction on direct map fragmentation by using the "bpf
   prog pack" or this generalized solution:

   For this I'd expect a benchmark similar to the workload you guys
   run or something memory intensive, as eBPF JITs are heavily used,
   and after a certain amount of time somehow compute how fragmented
   memory is. The only sensible thing I can think to measure memory
   fragmentation is to look at the memory compaction index
   /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
   ideas as I'm a mm n00b.

[0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

> +static void move_vmap_to_free_text_tree(void *addr)
> +{
> +	struct vmap_area *va;
> +
> +	/* remove from vmap_area_root */
> +	spin_lock(&vmap_area_lock);
> +	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
> +	if (WARN_ON_ONCE(!va)) {

This seems to be hopeful but we purposely are allowing for the allocated
memory to be used elsewhere are we not? Can you trigger the WARN_ON_ONCE()
with stress-ng as described in commit 82dd23e84be3e ("mm/vmalloc.c:
preload a CPU with one object for split purpose") while using eBPF JIT
or loading/unloading a module in a loop?

> +void *vmalloc_exec(unsigned long size, unsigned long align)
> +{
> +	struct vmap_area *va, *tmp;
> +	unsigned long addr;
> +	enum fit_type type;
> +	int ret;
> +
> +	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
> +	if (unlikely(!va))
> +		return NULL;
> +
> +again:
> +	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
> +	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
> +
> +	if (!tmp) {
> +		unsigned long alloc_size;
> +		void *ptr;
> +
> +		spin_unlock(&free_text_area_lock);
> +
> +		/*
> +		 * Not enough continuous space in free_text_area_root, try
> +		 * allocate more memory. The memory is first added to
> +		 * vmap_area_root, and then moved to free_text_area_root.
> +		 */
> +		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
> +		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
> +					   VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
> +					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
> +					   NUMA_NO_NODE, __builtin_return_address(0));

I thought Peter had suggested keeping the guard?

Once you get modules going you may want to update the comment about
modules on __vmalloc_node_range().

  Luis
Mike Rapoport Nov. 3, 2022, 3:51 p.m. UTC | #2
Hi Luis,

Thanks for looping me in.

On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > proposal by Peter in [1].
> 
> This is allg reat but we need to clarify *why* we would go through the
> trouble.  So if folks are not to excited about this series, that's
> probably why. IMHO it lacks substance for rationale, **and** implies a few
> gains without any *clear* performance metrics. I have 0 experience with
> mm so I'd like other's feedback on my this -- I'm just trying to do
> decipher rationale from prior "bpf prog pack" efforts.
> 
> I'm sensing that the cables in messaging are a bit crossed here and we need
> to provide a bit better full picture for rationale and this is being
> completely missed and this work is being undersold.  If my assessment is
> accurate though, the bpf prog pack strategy with sharing huge pages may prove
> useful long term for other things than just modules / ftrace / kprobes.
> 
> I was surprised to see this entire patch series upgrade from RFC to proper
> PATCH form now completely fails to mention any of the original motivations
> behind the "BPF prog pack", which you are doing a true heroic effort to try to
> generalize as the problem is hard. Let me try to help with that. The rationale
> for the old BPF prog pack is documented as follows:
> 
> * Most BPF programs are pretty small. Allocating a hole page for each
> * program is sometime a waste. Many small bpf program also adds pressure
> * to instruction TLB. To solve this issue, we introduce a BPF program pack
> * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> * to host BPF programs.
> 
> Previously you have also stated in earlier versions of this patch set:
> 
>   "Most BPF programs are small, but they consume a page each. For systems
>    with busy traffic and many BPF programs, this could also add significant
>    pressure to instruction TLB. High iTLB pressure usually causes slow down
>    for the whole system, which includes visible performance
>    degradation for production workloads."
> 
> So it is implied here that one of the benefits is to help reduce iTLB misses.
> But that's it. We have no visible numbers to look at and for what... But
> reducing iTLB misses doesn't always have a complete direct correlation
> with improving things, but if the code change is small enough it obviously
> makes sense to apply. If the change is a bit more intrusive, as in this
> patch series a bit more rationale should be provided.
> 
> Other than the "performance aspects" of your patchset, the *main* reason
> I am engaged and like it is it reduces the nasty mess of semantics on
> dealing with special permissions on pages which we see in modules and a
> few other places which today completely open code it. That proves error
> prone and I'm glad to see efforts to generalize that nastiness. So
> please ensure this is added as part of the documented rationale. Even
> if the iTLB miss ratio improvement is not astronomical I believe that
> the gains in sanity on improving semantics on special pages and sharing code
> make it well worthwhile. The iTLB miss ratio improvement is just a small
> cherry on top.
> 
> Going back to performance aspects, when Linus had poked for more details
> about this your have elaborated further:
> 
>   "we have seen direct map fragmentation causing visible
>    performance drop for our major services. This is the shadow 
>    production benchmark, so it is not possible to run it out of 
>    our data centers. Tracing showed that BPF program was the top 
>    trigger of these direct map splits."
> 
> And the only other metric we have is:
> 
>   "For our web service production benchmark, bpf_prog_pack on 4kB pages
>    gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> 
> These metrics are completely arbitrary and opaque to us. We need
> something tangible and reproducible and I have been suggesting that
> from early on...
> 
> I'm under the impression that the real missed, undocumented, major value-add
> here is that the old "BPF prog pack" strategy helps to reduce the direct map
> fragmentation caused by heavy use of the eBPF JIT programs and this in
> turn helps your overall random system performance (regardless of what
> it is you do). As I see it then the eBPF prog pack is just one strategy to
> try to mitigate memory fragmentation on the direct map caused by the the eBPF
> JIT programs, so the "slow down" your team has obvserved should be due to the
> eventual fragmentation caused on the direct map *while* eBPF programs
> get heavily used.

I believe that while the eBPF prog pack is helpful in mitigation of the
direct map fragmentation caused by the eBPF JIT programs, the same strategy
of allocating a large page, splitting its PMD entry and then reusing the
memory for smaller allocations can be (and should be) generalized to other
use cases that require non-default permissions in the page table.  Most
prominent use cases are those that allocate memory for code, but the same
approach is relevant for other cases, like secretmem or page table
protection with PKS.

A while ago I've suggested to handle such caching of large pages at the
page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
opinion was that added value does not justify changes to the page
allocator and it was suggested to handle such caching elsewhere. 

I had to put this project on a backburner for $VARIOUS_REASONS, but I still
think that we need a generic allocator for memory with non-default
permissions in the direct map and that code allocation should build on that
allocator.

All that said, the direct map fragmentation problem is currently relevant
only to x86 because it's the only architecture that supports splitting of
the large pages in the direct map.
 
> Mike Rapoport had presented about the Direct map fragmentation problem
> at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> help performance [1] ends up generally recommending huge pages. The work by Xing
> though was about using huge pages *alone*, not using a strategy such as in the
> "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> and that I think is the real golden nugget here.
> 
> I contend therefore that the theoretical reduction of iTLB misses by using
> huge pages for "bpf prog pack" is not what gets your systems to perform
> somehow better. It should be simply that it reduces fragmentation and
> *this* generally can help with performance long term. If this is accurate
> then let's please separate the two aspects to this.

The direct map fragmentation is the reason for higher TLB miss rate, both
for iTLB and dTLB. Whenever a large page in the direct map is split, all
kernel accesses via the direct map will use small pages which requires
dealing with 512 page table entries instead of one for 2M range.

Since small pages in the direct map are never collapsed back to large
pages, long living system that heavily uses eBPF programs will have its
direct map severely fragmented, higher TLB miss rate and worse overall
performance. 

> There's two aspects to what I would like to see from a performance
> perspective then actually mentioned in the commit logs:
> 
> 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
>    Vs not using it at all:

... 
 
> 2) Estimate in reduction on direct map fragmentation by using the "bpf
>    prog pack" or this generalized solution:
> 
>    For this I'd expect a benchmark similar to the workload you guys
>    run or something memory intensive, as eBPF JITs are heavily used,
>    and after a certain amount of time somehow compute how fragmented
>    memory is. The only sensible thing I can think to measure memory
>    fragmentation is to look at the memory compaction index
>    /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
>    ideas as I'm a mm n00b.

The direct map fragmentation can be tracked with 

	grep DirectMap /proc/meminfo
	grep direct_map /proc/vmstat

and by looking at /sys/kernel/debug/page_tables/kernel
 
> [0] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
> 
>   Luis
Luis Chamberlain Nov. 3, 2022, 6:59 p.m. UTC | #3
On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> Hi Luis,
> 
> Thanks for looping me in.
> 
> On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> > On Mon, Oct 31, 2022 at 03:25:37PM -0700, Song Liu wrote:
> > > vmalloc_exec is used to allocate memory to host dynamic kernel text
> > > (modules, BPF programs, etc.) with huge pages. This is similar to the
> > > proposal by Peter in [1].
> > 
> > This is allg reat but we need to clarify *why* we would go through the
> > trouble.  So if folks are not to excited about this series, that's
> > probably why. IMHO it lacks substance for rationale, **and** implies a few
> > gains without any *clear* performance metrics. I have 0 experience with
> > mm so I'd like other's feedback on my this -- I'm just trying to do
> > decipher rationale from prior "bpf prog pack" efforts.
> > 
> > I'm sensing that the cables in messaging are a bit crossed here and we need
> > to provide a bit better full picture for rationale and this is being
> > completely missed and this work is being undersold.  If my assessment is
> > accurate though, the bpf prog pack strategy with sharing huge pages may prove
> > useful long term for other things than just modules / ftrace / kprobes.
> > 
> > I was surprised to see this entire patch series upgrade from RFC to proper
> > PATCH form now completely fails to mention any of the original motivations
> > behind the "BPF prog pack", which you are doing a true heroic effort to try to
> > generalize as the problem is hard. Let me try to help with that. The rationale
> > for the old BPF prog pack is documented as follows:
> > 
> > * Most BPF programs are pretty small. Allocating a hole page for each
> > * program is sometime a waste. Many small bpf program also adds pressure
> > * to instruction TLB. To solve this issue, we introduce a BPF program pack
> > * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> > * to host BPF programs.
> > 
> > Previously you have also stated in earlier versions of this patch set:
> > 
> >   "Most BPF programs are small, but they consume a page each. For systems
> >    with busy traffic and many BPF programs, this could also add significant
> >    pressure to instruction TLB. High iTLB pressure usually causes slow down
> >    for the whole system, which includes visible performance
> >    degradation for production workloads."
> > 
> > So it is implied here that one of the benefits is to help reduce iTLB misses.
> > But that's it. We have no visible numbers to look at and for what... But
> > reducing iTLB misses doesn't always have a complete direct correlation
> > with improving things, but if the code change is small enough it obviously
> > makes sense to apply. If the change is a bit more intrusive, as in this
> > patch series a bit more rationale should be provided.
> > 
> > Other than the "performance aspects" of your patchset, the *main* reason
> > I am engaged and like it is it reduces the nasty mess of semantics on
> > dealing with special permissions on pages which we see in modules and a
> > few other places which today completely open code it. That proves error
> > prone and I'm glad to see efforts to generalize that nastiness. So
> > please ensure this is added as part of the documented rationale. Even
> > if the iTLB miss ratio improvement is not astronomical I believe that
> > the gains in sanity on improving semantics on special pages and sharing code
> > make it well worthwhile. The iTLB miss ratio improvement is just a small
> > cherry on top.
> > 
> > Going back to performance aspects, when Linus had poked for more details
> > about this your have elaborated further:
> > 
> >   "we have seen direct map fragmentation causing visible
> >    performance drop for our major services. This is the shadow 
> >    production benchmark, so it is not possible to run it out of 
> >    our data centers. Tracing showed that BPF program was the top 
> >    trigger of these direct map splits."
> > 
> > And the only other metric we have is:
> > 
> >   "For our web service production benchmark, bpf_prog_pack on 4kB pages
> >    gives 0.5% to 0.7% more throughput than not using bpf_prog_pack."
> > 
> > These metrics are completely arbitrary and opaque to us. We need
> > something tangible and reproducible and I have been suggesting that
> > from early on...
> > 
> > I'm under the impression that the real missed, undocumented, major value-add
> > here is that the old "BPF prog pack" strategy helps to reduce the direct map
> > fragmentation caused by heavy use of the eBPF JIT programs and this in
> > turn helps your overall random system performance (regardless of what
> > it is you do). As I see it then the eBPF prog pack is just one strategy to
> > try to mitigate memory fragmentation on the direct map caused by the the eBPF
> > JIT programs, so the "slow down" your team has obvserved should be due to the
> > eventual fragmentation caused on the direct map *while* eBPF programs
> > get heavily used.
> 
> I believe that while the eBPF prog pack is helpful in mitigation of the
> direct map fragmentation caused by the eBPF JIT programs, the same strategy
> of allocating a large page, splitting its PMD entry and then reusing the
> memory for smaller allocations can be (and should be) generalized to other
> use cases that require non-default permissions in the page table.  Most
> prominent use cases are those that allocate memory for code, but the same
> approach is relevant for other cases, like secretmem or page table
> protection with PKS.
> 
> A while ago I've suggested to handle such caching of large pages at the
> page allocator level, but when we discussed it at LSF/MM/BPF, prevailing
> opinion was that added value does not justify changes to the page
> allocator and it was suggested to handle such caching elsewhere. 

I saw that on the lwn coverage.

> I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> think that we need a generic allocator for memory with non-default
> permissions in the direct map and that code allocation should build on that
> allocator.

It seems this generalization of the bpf prog pack to possibly be used
for modules / kprobes / ftrace is a small step in that direction.

> All that said, the direct map fragmentation problem is currently relevant
> only to x86 because it's the only architecture that supports splitting of
> the large pages in the direct map.

I was thinking even more long term too, using this as a proof of concept. If
this practice in general helps with fragmentation, could it be used for
experimetnation with compound pages later, as a way to reduce possible
fragmentation.

> > Mike Rapoport had presented about the Direct map fragmentation problem
> > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> > evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> > help performance [1] ends up generally recommending huge pages. The work by Xing
> > though was about using huge pages *alone*, not using a strategy such as in the
> > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> > and that I think is the real golden nugget here.
> > 
> > I contend therefore that the theoretical reduction of iTLB misses by using
> > huge pages for "bpf prog pack" is not what gets your systems to perform
> > somehow better. It should be simply that it reduces fragmentation and
> > *this* generally can help with performance long term. If this is accurate
> > then let's please separate the two aspects to this.
> 
> The direct map fragmentation is the reason for higher TLB miss rate, both
> for iTLB and dTLB.

OK so then whatever benchmark is running in tandem as eBPF JIT is hammered
should *also* be measured with perf for iTLB and dTLB. ie, the patch can
provide such results as a justifications.

> Whenever a large page in the direct map is split, all
> kernel accesses via the direct map will use small pages which requires
> dealing with 512 page table entries instead of one for 2M range.
> 
> Since small pages in the direct map are never collapsed back to large
> pages, long living system that heavily uses eBPF programs will have its
> direct map severely fragmented, higher TLB miss rate and worse overall
> performance. 

Shouldn't compaction help with those situations?

> > There's two aspects to what I would like to see from a performance
> > perspective then actually mentioned in the commit logs:
> > 
> > 1) iTLB miss loss ratio with "bpf prog pack" or this generalized solution
> >    Vs not using it at all:
> 
> ... 
>  
> > 2) Estimate in reduction on direct map fragmentation by using the "bpf
> >    prog pack" or this generalized solution:
> > 
> >    For this I'd expect a benchmark similar to the workload you guys
> >    run or something memory intensive, as eBPF JITs are heavily used,
> >    and after a certain amount of time somehow compute how fragmented
> >    memory is. The only sensible thing I can think to measure memory
> >    fragmentation is to look at the memory compaction index
> >    /sys/kernel/debug/extfrag/extfrag_index , but I highly welcome other's
> >    ideas as I'm a mm n00b.
> 
> The direct map fragmentation can be tracked with 
> 
> 	grep DirectMap /proc/meminfo
> 	grep direct_map /proc/vmstat
> 
> and by looking at /sys/kernel/debug/page_tables/kernel

Thanks!

  Luis
Rick Edgecombe Nov. 3, 2022, 9:19 p.m. UTC | #4
On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > Mike Rapoport had presented about the Direct map fragmentation
> > > problem
> > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > ftrace /
> > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > performance
> > > evaluation on whether using 2M/1G pages aggressively for the
> > > kernel direct map
> > > help performance [1] ends up generally recommending huge pages.
> > > The work by Xing
> > > though was about using huge pages *alone*, not using a strategy
> > > such as in the
> > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > programs,
> > > and that I think is the real golden nugget here.
> > > 
> > > I contend therefore that the theoretical reduction of iTLB misses
> > > by using
> > > huge pages for "bpf prog pack" is not what gets your systems to
> > > perform
> > > somehow better. It should be simply that it reduces fragmentation
> > > and
> > > *this* generally can help with performance long term. If this is
> > > accurate
> > > then let's please separate the two aspects to this.
> > 
> > The direct map fragmentation is the reason for higher TLB miss
> > rate, both
> > for iTLB and dTLB.
> 
> OK so then whatever benchmark is running in tandem as eBPF JIT is
> hammered
> should *also* be measured with perf for iTLB and dTLB. ie, the patch
> can
> provide such results as a justifications.

Song had done some tests on the old prog pack version that to me seemed
to indicate most (or possibly all) of the benefit was direct map
fragmentation reduction. This was surprised me, since 2MB kernel text
has shown to be beneficial.

Otherwise +1 to all these comments. This should be clear about what the
benefits are. I would add, that this is also much nicer about TLB
shootdowns than the existing way of loading text and saves some memory.

So I think there are sort of four areas of improvements:
1. Direct map fragmentation reduction (dTLB miss improvements). This
sort of does it as a side effect in this series, and the solution Mike
is talking about is a more general, probably better one.
2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
solution for this, but surprisingly it doesn't seem to be useful for
JITs. (modules testing TBD)
3. Loading text to reused allocation with per-cpu mappings. This
reduces TLB shootdowns, which are a short term load and teardown time
performance drag. My understanding is this is more of a problem on
bigger systems with many CPUs. This series does a decent job at this,
but the solution is not compatible with modules. Maybe ok since modules
don't load as often as JITs.
4. Having BPF progs share pages. This saves memory. This series could
probably easily get a number for how much.
Song Liu Nov. 3, 2022, 9:41 p.m. UTC | #5
On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > problem
> > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > ftrace /
> > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > performance
> > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > kernel direct map
> > > > help performance [1] ends up generally recommending huge pages.
> > > > The work by Xing
> > > > though was about using huge pages *alone*, not using a strategy
> > > > such as in the
> > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > programs,
> > > > and that I think is the real golden nugget here.
> > > >
> > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > by using
> > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > perform
> > > > somehow better. It should be simply that it reduces fragmentation
> > > > and
> > > > *this* generally can help with performance long term. If this is
> > > > accurate
> > > > then let's please separate the two aspects to this.
> > >
> > > The direct map fragmentation is the reason for higher TLB miss
> > > rate, both
> > > for iTLB and dTLB.
> >
> > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > hammered
> > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > can
> > provide such results as a justifications.
>
> Song had done some tests on the old prog pack version that to me seemed
> to indicate most (or possibly all) of the benefit was direct map
> fragmentation reduction. This was surprised me, since 2MB kernel text
> has shown to be beneficial.
>
> Otherwise +1 to all these comments. This should be clear about what the
> benefits are. I would add, that this is also much nicer about TLB
> shootdowns than the existing way of loading text and saves some memory.
>
> So I think there are sort of four areas of improvements:
> 1. Direct map fragmentation reduction (dTLB miss improvements). This
> sort of does it as a side effect in this series, and the solution Mike
> is talking about is a more general, probably better one.
> 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
> solution for this, but surprisingly it doesn't seem to be useful for
> JITs. (modules testing TBD)
> 3. Loading text to reused allocation with per-cpu mappings. This
> reduces TLB shootdowns, which are a short term load and teardown time
> performance drag. My understanding is this is more of a problem on
> bigger systems with many CPUs. This series does a decent job at this,
> but the solution is not compatible with modules. Maybe ok since modules
> don't load as often as JITs.
> 4. Having BPF progs share pages. This saves memory. This series could
> probably easily get a number for how much.
>

Hi Luis, Rick, and Mike,

Thanks a lot for helping me organize this information. Totally agree with
all these comments. I will add more data to the next revision.

Besides the motivation improvement, could you please also share your
comments on:
1. The logic/design of the vmalloc_exec() et. al. APIs;
2. The naming of these functions. Does  execmem_[alloc|free|fill|cpy]
  (as suggested by Chritoph) sound good?

Thanks,
Song
Luis Chamberlain Nov. 3, 2022, 11:33 p.m. UTC | #6
On Thu, Nov 03, 2022 at 02:41:42PM -0700, Song Liu wrote:
> On Thu, Nov 3, 2022 at 2:19 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> Besides the motivation improvement, could you please also share your
> comments on:
> 1. The logic/design of the vmalloc_exec() et. al. APIs;

I've provided the feedback that I can so far as I'm new to mm.
Best I can do is provided a better rationale so that mm folks
can understand the motivation.

> 2. The naming of these functions. Does  execmem_[alloc|free|fill|cpy]
>   (as suggested by Chritoph) sound good?

That seems sensible.

There may be other users later, secmm_alloc() too then later, for
instance. So we just gotta keep that in mind.

  Luis
Luis Chamberlain Nov. 4, 2022, 12:18 a.m. UTC | #7
On Thu, Nov 03, 2022 at 09:19:25PM +0000, Edgecombe, Rick P wrote:
> On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > problem
> > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > ftrace /
> > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > performance
> > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > kernel direct map
> > > > help performance [1] ends up generally recommending huge pages.
> > > > The work by Xing
> > > > though was about using huge pages *alone*, not using a strategy
> > > > such as in the
> > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > programs,
> > > > and that I think is the real golden nugget here.
> > > > 
> > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > by using
> > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > perform
> > > > somehow better. It should be simply that it reduces fragmentation
> > > > and
> > > > *this* generally can help with performance long term. If this is
> > > > accurate
> > > > then let's please separate the two aspects to this.
> > > 
> > > The direct map fragmentation is the reason for higher TLB miss
> > > rate, both
> > > for iTLB and dTLB.
> > 
> > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > hammered
> > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > can
> > provide such results as a justifications.
> 
> Song had done some tests on the old prog pack version that to me seemed
> to indicate most (or possibly all) of the benefit was direct map
> fragmentation reduction.

Matches my observations but I also provided quite a bit of hints as to
*why* I think that is. I suggested lib/test_kmod.c as an example beefy
multithreaded selftests which really kicks the hell out of the kernel
with whatever crap you want to run. That is precicely how I uncovered
some odd kmod bug lingering for years.

> This was surprised me, since 2MB kernel text
> has shown to be beneficial.
> 
> Otherwise +1 to all these comments. This should be clear about what the
> benefits are. I would add, that this is also much nicer about TLB
> shootdowns than the existing way of loading text and saves some memory.
> 
> So I think there are sort of four areas of improvements:
> 1. Direct map fragmentation reduction (dTLB miss improvements).

The dTLB gains should be on the benchmark which runs in tandem to the
ebpf-JIT-monster-selftest, not on the ebpf-JIT-monster-selftest, right?

> This
> sort of does it as a side effect in this series, and the solution Mike
> is talking about is a more general, probably better one.
> 2. 2MB mapped JITs. This is the iTLB side. I think this is a decent
> solution for this, but surprisingly it doesn't seem to be useful for
> JITs. (modules testing TBD)

Yes I'm super eager to get this tested. In fact I wonder if one can
boot Linux with less memory too...

> 3. Loading text to reused allocation with per-cpu mappings. This
> reduces TLB shootdowns, which are a short term load and teardown time
> performance drag. My understanding is this is more of a problem on
> bigger systems with many CPUs. This series does a decent job at this,
> but the solution is not compatible with modules. Maybe ok since modules
> don't load as often as JITs.

There are some tests like fstests which make heavy use of module
removal. But as a side effect, indeed I like to reboot to have
a fresh system before running fstests. I guess fstests should run
with a heavily fragmented memory too as a side corner case thing.

> 4. Having BPF progs share pages. This saves memory. This series could
> probably easily get a number for how much.

Once that does hit modules / kprobes / ftrace, the impact is much
much greater obviously.

  Luis
Luis Chamberlain Nov. 4, 2022, 3:29 a.m. UTC | #8
On Thu, Nov 03, 2022 at 05:18:51PM -0700, Luis Chamberlain wrote:
> On Thu, Nov 03, 2022 at 09:19:25PM +0000, Edgecombe, Rick P wrote:
> > On Thu, 2022-11-03 at 11:59 -0700, Luis Chamberlain wrote:
> > > > > Mike Rapoport had presented about the Direct map fragmentation
> > > > > problem
> > > > > at Plumbers 2021 [0], and clearly mentioned modules / BPF /
> > > > > ftrace /
> > > > > kprobes as possible sources for this. Then Xing Zhengjun's 2021
> > > > > performance
> > > > > evaluation on whether using 2M/1G pages aggressively for the
> > > > > kernel direct map
> > > > > help performance [1] ends up generally recommending huge pages.
> > > > > The work by Xing
> > > > > though was about using huge pages *alone*, not using a strategy
> > > > > such as in the
> > > > > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF
> > > > > programs,
> > > > > and that I think is the real golden nugget here.
> > > > > 
> > > > > I contend therefore that the theoretical reduction of iTLB misses
> > > > > by using
> > > > > huge pages for "bpf prog pack" is not what gets your systems to
> > > > > perform
> > > > > somehow better. It should be simply that it reduces fragmentation
> > > > > and
> > > > > *this* generally can help with performance long term. If this is
> > > > > accurate
> > > > > then let's please separate the two aspects to this.
> > > > 
> > > > The direct map fragmentation is the reason for higher TLB miss
> > > > rate, both
> > > > for iTLB and dTLB.
> > > 
> > > OK so then whatever benchmark is running in tandem as eBPF JIT is
> > > hammered
> > > should *also* be measured with perf for iTLB and dTLB. ie, the patch
> > > can
> > > provide such results as a justifications.
> > 
> > Song had done some tests on the old prog pack version that to me seemed
> > to indicate most (or possibly all) of the benefit was direct map
> > fragmentation reduction.
> 
> Matches my observations but I also provided quite a bit of hints as to
> *why* I think that is. I suggested lib/test_kmod.c as an example beefy
> multithreaded selftests which really kicks the hell out of the kernel
> with whatever crap you want to run. That is precicely how I uncovered
> some odd kmod bug lingering for years.

*and*, *perhaps*... it may be that you need another memory intensive benchmark
to run in tandem, one which mimics the behaviour of the internal "shadow
production benchmark", whatever that is.

  Luis
Aaron Lu Nov. 7, 2022, 6:40 a.m. UTC | #9
Hello,

On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:

... ...

> I'm under the impression that the real missed, undocumented, major value-add
> here is that the old "BPF prog pack" strategy helps to reduce the direct map
> fragmentation caused by heavy use of the eBPF JIT programs and this in
> turn helps your overall random system performance (regardless of what
> it is you do). As I see it then the eBPF prog pack is just one strategy to
> try to mitigate memory fragmentation on the direct map caused by the the eBPF
> JIT programs, so the "slow down" your team has obvserved should be due to the
> eventual fragmentation caused on the direct map *while* eBPF programs
> get heavily used.
> 
> Mike Rapoport had presented about the Direct map fragmentation problem
> at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> help performance [1] ends up generally recommending huge pages. The work by Xing
> though was about using huge pages *alone*, not using a strategy such as in the
> "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> and that I think is the real golden nugget here.

I'm interested in how this patchset (further) improves direct map
fragmentation so would like to evaluate it to see if my previous work to
merge small mappings back in architecture layer[1] is still necessary.

I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed,
so I took one step back and evaluated the existing bpf_prog_pack. I'm
aware of this patchset would make things even better by using order-9
page to backup the vmalloced range.

I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to
run, feel free to let me know a better way to evaluate this.

- In kernels before bpf_prog_pack(v5.17 and earlier), this prog would
cause 3 pages to change protection to RO+X from RW+NX; And if the three
pages are far apart, each would cause a level 3 split then a level 2
split; Reality is, allocated pages tend to stay close physically so
actual result will not be this bad.

- After bpf_prog_pack, the load of this prog will most likely requires
no new page protection change as long as the existing pack pool has space
for it(the common case). The actual space required for this bpf prog that
needs special protection is: 64 * 2 + 192 bytes, it would need 6144 such
progs to use up the cache and trigger another 2MB alloc which can
potentially cause a direct map split. 6144 seems to be pretty large number
to me so I think the direct map split due to bpf is greatly improved
(if not totally solved).

Here are test results on a 8G x86_64 VM.
(on x86_64, 4k is PTE mapping, 2M is PMD mapping and 1G is PUD mapping)
- v5.17
1) right after boot
$ grep Direct /proc/meminfo
DirectMap4k:       87900 kB
DirectMap2M:     5154816 kB
DirectMap1G:     5242880 kB

2) after running 512 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      462684 kB
DirectMap2M:     4780032 kB
DirectMap1G:     5242880 kB
PUD mapping survived, some PMD mappings are splitted.

3) after running 1024 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      884572 kB
DirectMap2M:     6455296 kB
DirectMap1G:     3145728 kB
2 PUD mappings and some PMD mappings splitted.

4) after run 2048 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     1654620 kB
DirectMap2M:     6733824 kB
DirectMap1G:     2097152 kB
Another PUD mapping and some PMD mappings spliited.
At the end, 2 PUD mappings survived.

- v6.1-rc3
The direct map number doesn't change for "right after boot", "after
running 512/1024/2048" sockex1 instances. I also tried running 5120
instances but it would consume all available memory when it runs about
4000 instances and even then, the direct map number doesn't change, i.e.
not a single split happened. This is understandable because as I
calculated above, it would need 6144 such progs to cause another
alloc/split. Here is its number:
$ grep Direct /proc/meminfo
DirectMap4k:       22364 kB
DirectMap2M:     3123200 kB
DirectMap1G:     7340032 kB

Consider that production system will have memory mostly consumed and when
CPU allocates pages, the page can be far apart than systems right after
boot, causing more mapping split, so I also tested to firstly consume all
memory to page cache by reading some sparse files and then run this test.
I expect this time v5.17 would become worse. Here it is:

- v5.17
1) right after boot
$ grep Direct /proc/meminfo
DirectMap4k:       94044 kB
DirectMap2M:     4100096 kB
DirectMap1G:     6291456 kB
More mappings are in PUD for this boot.

2) after run 512 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:      538460 kB
DirectMap2M:     7849984 kB
DirectMap1G:     2097152 kB
4 PUD mappings and some PMD mappings are splitted this time, more than
last time.

3) after run 1024 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     1083228 kB
DirectMap2M:     7305216 kB
DirectMap1G:     2097152 kB
Some PMD mappings split.

4) after running 2048 sockex1 instances concurrently
$ grep Direct /proc/meminfo
DirectMap4k:     2340700 kB
DirectMap2M:     6047744 kB
DirectMap1G:     2097152 kB
The end result is about the same as before.

- v6.1-rc3
There is no difference because I can't trigger another pack alloc before
system is OOMed.

Conclusion: I think bpf_prog_pack is very good at reducing direct map
fragmentation and this patchset can further improve this situation on
large machines(with huge amount of memory) or with more large bpf progs
loaded etc.

Some imperfect things I can think of are(not related to this patchset):
1 Once a split happened, it remains happened. This may not be a big deal
now with bpf_prog_pack and this patchset because the need to allocate a
new order-9 page and thus cause a potential split should happen much much
less;
2 When a new order-9 page has to be allocated, there is no way to tell
the allocator to allocate this order-9 page from an already splitted PUD
range to avoid another PUD mapping split;
3 As Mike and others have mentioned, there are other users that can also
cause direct map split.

[1]: https://lore.kernel.org/lkml/20220808145649.2261258-1-aaron.lu@intel.com/

Regards,
Aaron
Mike Rapoport Nov. 7, 2022, 6:58 a.m. UTC | #10
On Thu, Nov 03, 2022 at 11:59:48AM -0700, Luis Chamberlain wrote:
> On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> 
> > I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> > think that we need a generic allocator for memory with non-default
> > permissions in the direct map and that code allocation should build on that
> > allocator.
> 
> It seems this generalization of the bpf prog pack to possibly be used
> for modules / kprobes / ftrace is a small step in that direction.
> 
> > All that said, the direct map fragmentation problem is currently relevant
> > only to x86 because it's the only architecture that supports splitting of
> > the large pages in the direct map.
> 
> I was thinking even more long term too, using this as a proof of concept. If
> this practice in general helps with fragmentation, could it be used for
> experimetnation with compound pages later, as a way to reduce possible
> fragmentation.

As Rick already mentioned, these patches help with the direct map
fragmentation only indirectly. With these patches memory is freed in
PMD_SIZE chunks and this makes the changes to the direct map in
vm_remove_mappings() to happen in in PMD_SIZE units and this is pretty much
the only effect of this series on the direct map layout.

A bit unrelated, but I'm wondering now if we want to have the direct map
alias of the pages allocated for code also to be read-only...

> > Whenever a large page in the direct map is split, all
> > kernel accesses via the direct map will use small pages which requires
> > dealing with 512 page table entries instead of one for 2M range.
> > 
> > Since small pages in the direct map are never collapsed back to large
> > pages, long living system that heavily uses eBPF programs will have its
> > direct map severely fragmented, higher TLB miss rate and worse overall
> > performance. 
> 
> Shouldn't compaction help with those situations?

Compaction helps to reduce fragmentation of the physical memory, it tries
to bring free physical pages next to each other to create large contiguous
chunks, but it does not change the virtual addresses the users of the
underlying data see.

Changing permissions of a small page in the direct map causes
"discontinuity" in the virtual space. E.g. if we have 2M mapped RW with a
single PMD changing several page in the middle of those 2M to R+X will
require to remap that range with 512 PTEs.
 
> Thanks!
> 
>   Luis
Luis Chamberlain Nov. 7, 2022, 5:26 p.m. UTC | #11
On Mon, Nov 07, 2022 at 08:58:17AM +0200, Mike Rapoport wrote:
> On Thu, Nov 03, 2022 at 11:59:48AM -0700, Luis Chamberlain wrote:
> > On Thu, Nov 03, 2022 at 05:51:57PM +0200, Mike Rapoport wrote:
> > 
> > > I had to put this project on a backburner for $VARIOUS_REASONS, but I still
> > > think that we need a generic allocator for memory with non-default
> > > permissions in the direct map and that code allocation should build on that
> > > allocator.
> > 
> > It seems this generalization of the bpf prog pack to possibly be used
> > for modules / kprobes / ftrace is a small step in that direction.
> > 
> > > All that said, the direct map fragmentation problem is currently relevant
> > > only to x86 because it's the only architecture that supports splitting of
> > > the large pages in the direct map.
> > 
> > I was thinking even more long term too, using this as a proof of concept. If
> > this practice in general helps with fragmentation, could it be used for
> > experimetnation with compound pages later, as a way to reduce possible
> > fragmentation.
> 
> As Rick already mentioned, these patches help with the direct map
> fragmentation only indirectly. With these patches memory is freed in
> PMD_SIZE chunks and this makes the changes to the direct map in
> vm_remove_mappings() to happen in in PMD_SIZE units and this is pretty much
> the only effect of this series on the direct map layout.

I understand that is what *this* series does. I was wondering is similar
scheme may be useful to study to see if it helps with aggregating say
something like 32 x 64 kb for compound page allocations of order 16 (64 Kib)
to see if it may help with possible fragmentation concerns for that world
where that may be useful in the future (completely unrelated to page
permissions).

> A bit unrelated, but I'm wondering now if we want to have the direct map
> alias of the pages allocated for code also to be read-only...
> 
> > > Whenever a large page in the direct map is split, all
> > > kernel accesses via the direct map will use small pages which requires
> > > dealing with 512 page table entries instead of one for 2M range.
> > > 
> > > Since small pages in the direct map are never collapsed back to large
> > > pages, long living system that heavily uses eBPF programs will have its
> > > direct map severely fragmented, higher TLB miss rate and worse overall
> > > performance. 
> > 
> > Shouldn't compaction help with those situations?
> 
> Compaction helps to reduce fragmentation of the physical memory, it tries
> to bring free physical pages next to each other to create large contiguous
> chunks, but it does not change the virtual addresses the users of the
> underlying data see.

Sorry I understood that 'bpf prog pack' only only used *one* 2 MiB huge
page for *all* eBPF JIT programs, and so the fragmentation issue prior
to 'bpf prog pack' I thought was the fact that as eBPF programs are
still alive, we have fragmentation. In the world without 'bpf prog pack'
I thought no huge pages were used.

> Changing permissions of a small page in the direct map causes
> "discontinuity" in the virtual space. E.g. if we have 2M mapped RW with a
> single PMD changing several page in the middle of those 2M to R+X will
> require to remap that range with 512 PTEs.

Makes sense, thanks.

  Luis
Luis Chamberlain Nov. 7, 2022, 5:39 p.m. UTC | #12
On Mon, Nov 07, 2022 at 02:40:14PM +0800, Aaron Lu wrote:
> Hello,
> 
> On Wed, Nov 02, 2022 at 04:41:59PM -0700, Luis Chamberlain wrote:
> 
> ... ...
> 
> > I'm under the impression that the real missed, undocumented, major value-add
> > here is that the old "BPF prog pack" strategy helps to reduce the direct map
> > fragmentation caused by heavy use of the eBPF JIT programs and this in
> > turn helps your overall random system performance (regardless of what
> > it is you do). As I see it then the eBPF prog pack is just one strategy to
> > try to mitigate memory fragmentation on the direct map caused by the the eBPF
> > JIT programs, so the "slow down" your team has obvserved should be due to the
> > eventual fragmentation caused on the direct map *while* eBPF programs
> > get heavily used.
> > 
> > Mike Rapoport had presented about the Direct map fragmentation problem
> > at Plumbers 2021 [0], and clearly mentioned modules / BPF / ftrace /
> > kprobes as possible sources for this. Then Xing Zhengjun's 2021 performance
> > evaluation on whether using 2M/1G pages aggressively for the kernel direct map
> > help performance [1] ends up generally recommending huge pages. The work by Xing
> > though was about using huge pages *alone*, not using a strategy such as in the
> > "bpf prog pack" to share one 2 MiB huge page for *all* small eBPF programs,
> > and that I think is the real golden nugget here.
> 
> I'm interested in how this patchset (further) improves direct map
> fragmentation so would like to evaluate it to see if my previous work to
> merge small mappings back in architecture layer[1] is still necessary.

You gotta apply it to 6.0.5 which had a large change go in for eBPF
which was not present on 6.0.

> Conclusion: I think bpf_prog_pack is very good at reducing direct map
> fragmentation and this patchset can further improve this situation on
> large machines(with huge amount of memory) or with more large bpf progs
> loaded etc.

Fantastic. Thanks for the analysis, so yet another set of metrics which
I'd hope can be applied to this patch set as this effort is generalized.

Now imagine the effort in cosnideration also of modules / ftrace / kprobes.

> Some imperfect things I can think of are(not related to this patchset):
> 1 Once a split happened, it remains happened. This may not be a big deal
> now with bpf_prog_pack and this patchset because the need to allocate a
> new order-9 page and thus cause a potential split should happen much much
> less;

Not sure I follow, are you suggesting a order-9 (512 bytes) allocation would
trigger a split of the reserved say 2 MiB huge page?

> 2 When a new order-9 page has to be allocated, there is no way to tell
> the allocator to allocate this order-9 page from an already splitted PUD
> range to avoid another PUD mapping split;
> 3 As Mike and others have mentioned, there are other users that can also
> cause direct map split.

Hence the effort to generalize.

  Luis
Song Liu Nov. 7, 2022, 6:30 p.m. UTC | #13
Hi Aaron,

On Sun, Nov 6, 2022 at 10:40 PM Aaron Lu <aaron.lu@intel.com> wrote:
>
[...]
> > and that I think is the real golden nugget here.
>
> I'm interested in how this patchset (further) improves direct map
> fragmentation so would like to evaluate it to see if my previous work to
> merge small mappings back in architecture layer[1] is still necessary.
>
> I tried to apply this patchset on v6.1-rc3/2/1 and v6.0 but all failed,
> so I took one step back and evaluated the existing bpf_prog_pack. I'm
> aware of this patchset would make things even better by using order-9
> page to backup the vmalloced range.

The patchset was based on bpf-next tree:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/

>
> I used the sample bpf prog: sample/bpf/sockex1 because it looks easy to
> run, feel free to let me know a better way to evaluate this.
>
> - In kernels before bpf_prog_pack(v5.17 and earlier), this prog would
> cause 3 pages to change protection to RO+X from RW+NX; And if the three
> pages are far apart, each would cause a level 3 split then a level 2
> split; Reality is, allocated pages tend to stay close physically so
> actual result will not be this bad.
[...]
>
> - v6.1-rc3
> There is no difference because I can't trigger another pack alloc before
> system is OOMed.
>
> Conclusion: I think bpf_prog_pack is very good at reducing direct map
> fragmentation and this patchset can further improve this situation on
> large machines(with huge amount of memory) or with more large bpf progs
> loaded etc.

Thanks a lot for these experiments! I will include the data in the next
version of the set.

>
> Some imperfect things I can think of are(not related to this patchset):
> 1 Once a split happened, it remains happened. This may not be a big deal
> now with bpf_prog_pack and this patchset because the need to allocate a
> new order-9 page and thus cause a potential split should happen much much
> less;

I think we will need to regroup the direct map for some scenarios. But I
am not aware of such workloads.

> 2 When a new order-9 page has to be allocated, there is no way to tell
> the allocator to allocate this order-9 page from an already splitted PUD
> range to avoid another PUD mapping split;

This would be a good improvement.

> 3 As Mike and others have mentioned, there are other users that can also
> cause direct map split.

Thanks,
Song
Song Liu Nov. 7, 2022, 6:35 p.m. UTC | #14
On Mon, Nov 7, 2022 at 9:39 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
>
> > Some imperfect things I can think of are(not related to this patchset):
> > 1 Once a split happened, it remains happened. This may not be a big deal
> > now with bpf_prog_pack and this patchset because the need to allocate a
> > new order-9 page and thus cause a potential split should happen much much
> > less;
>
> Not sure I follow, are you suggesting a order-9 (512 bytes) allocation would
> trigger a split of the reserved say 2 MiB huge page?

I think by order-9 allocation, Aaron meant 2MiB huge pages. The issue is that
direct map split is one-way operation. If we set_memory_x() on one 4kB page
out of a 1GB direct map, we will split it into 511x 2MiB pages and 512x 4kB
pages. There is currently no way to regroup the 1GB page after
set_memory_nx() on the page.

Thanks,
Song

>
> > 2 When a new order-9 page has to be allocated, there is no way to tell
> > the allocator to allocate this order-9 page from an already splitted PUD
> > range to avoid another PUD mapping split;
> > 3 As Mike and others have mentioned, there are other users that can also
> > cause direct map split.
>
> Hence the effort to generalize.
>
>   Luis
diff mbox series

Patch

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..9b2042313c12 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -154,6 +154,11 @@  extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 		int node, const void *caller) __alloc_size(1);
 void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+void *vmalloc_exec(unsigned long size, unsigned long align) __alloc_size(1);
+void *vcopy_exec(void *dst, void *src, size_t len);
+void vfree_exec(void *addr);
+void *arch_vcopy_exec(void *dst, void *src, size_t len);
+int arch_invalidate_exec(void *ptr, size_t len);
 
 extern void *__vmalloc_array(size_t n, size_t size, gfp_t flags) __alloc_size(1, 2);
 extern void *vmalloc_array(size_t n, size_t size) __alloc_size(1, 2);
diff --git a/mm/nommu.c b/mm/nommu.c
index 214c70e1d059..8a1317247ef0 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -371,6 +371,18 @@  int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
 }
 EXPORT_SYMBOL(vm_map_pages_zero);
 
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+	return NULL;
+}
+
+void *vcopy_exec(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+void vfree_exec(const void *addr) { }
+
 /*
  *  sys_brk() for the most part doesn't need the global kernel
  *  lock, except when an application is doing something nasty
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..6f4c73e67191 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -72,6 +72,9 @@  early_param("nohugevmalloc", set_nohugevmalloc);
 static const bool vmap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
 
+#define PMD_ALIGN(addr) ALIGN(addr, PMD_SIZE)
+#define PMD_ALIGN_DOWN(addr) ALIGN_DOWN(addr, PMD_SIZE)
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)kasan_reset_tag(x);
@@ -769,6 +772,38 @@  static LIST_HEAD(free_vmap_area_list);
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
 
+/*
+ * free_text_area for vmalloc_exec()
+ */
+static DEFINE_SPINLOCK(free_text_area_lock);
+/*
+ * This linked list is used in pair with free_text_area_root.
+ * It gives O(1) access to prev/next to perform fast coalescing.
+ */
+static LIST_HEAD(free_text_area_list);
+
+/*
+ * This augment red-black tree represents the free text space.
+ * All vmap_area objects in this tree are sorted by va->va_start
+ * address. It is used for allocation and merging when a vmap
+ * object is released.
+ *
+ * Each vmap_area node contains a maximum available free block
+ * of its sub-tree, right or left. Therefore it is possible to
+ * find a lowest match of free area.
+ *
+ * vmap_area in this tree are backed by RO+X memory, but they do
+ * not have valid vm pointer (because we need subtree_max_size).
+ * The vm for these vmap_area are stored in all_text_vm.
+ */
+static struct rb_root free_text_area_root = RB_ROOT;
+
+/*
+ * List of vm_struct for free_text_area_root. This list is rarely
+ * accessed, so the O(N) complexity is not likely a real issue.
+ */
+struct vm_struct *all_text_vm;
+
 /*
  * Preload a CPU with one object for "no edge" split case. The
  * aim is to get rid of allocations from the atomic context, thus
@@ -3313,6 +3348,289 @@  void *vmalloc(unsigned long size)
 }
 EXPORT_SYMBOL(vmalloc);
 
+#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
+#define VMALLOC_EXEC_START MODULES_VADDR
+#define VMALLOC_EXEC_END MODULES_END
+#else
+#define VMALLOC_EXEC_START VMALLOC_START
+#define VMALLOC_EXEC_END VMALLOC_END
+#endif
+
+static void move_vmap_to_free_text_tree(void *addr)
+{
+	struct vmap_area *va;
+
+	/* remove from vmap_area_root */
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* make the memory RO+X */
+	memset(addr, 0, va->va_end - va->va_start);
+	set_memory_ro(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+	set_memory_x(va->va_start, (va->va_end - va->va_start) >> PAGE_SHIFT);
+
+	/* add to all_text_vm */
+	va->vm->next = all_text_vm;
+	all_text_vm = va->vm;
+
+	/* add to free_text_area_root */
+	spin_lock(&free_text_area_lock);
+	merge_or_add_vmap_area_augment(va, &free_text_area_root, &free_text_area_list);
+	spin_unlock(&free_text_area_lock);
+}
+
+/**
+ * vmalloc_exec - allocate virtually contiguous RO+X memory
+ * @size:    allocation size
+ *
+ * This is used to allocate dynamic kernel text, such as module text, BPF
+ * programs, etc. User need to use text_poke to update the memory allocated
+ * by vmalloc_exec.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_exec(unsigned long size, unsigned long align)
+{
+	struct vmap_area *va, *tmp;
+	unsigned long addr;
+	enum fit_type type;
+	int ret;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_KERNEL, NUMA_NO_NODE);
+	if (unlikely(!va))
+		return NULL;
+
+again:
+	preload_this_cpu_lock(&free_text_area_lock, GFP_KERNEL, NUMA_NO_NODE);
+	tmp = find_vmap_lowest_match(&free_text_area_root, size, align, 1, false);
+
+	if (!tmp) {
+		unsigned long alloc_size;
+		void *ptr;
+
+		spin_unlock(&free_text_area_lock);
+
+		/*
+		 * Not enough continuous space in free_text_area_root, try
+		 * allocate more memory. The memory is first added to
+		 * vmap_area_root, and then moved to free_text_area_root.
+		 */
+		alloc_size = roundup(size, PMD_SIZE * num_online_nodes());
+		ptr = __vmalloc_node_range(alloc_size, PMD_SIZE, VMALLOC_EXEC_START,
+					   VMALLOC_EXEC_END, GFP_KERNEL, PAGE_KERNEL,
+					   VM_ALLOW_HUGE_VMAP | VM_NO_GUARD,
+					   NUMA_NO_NODE, __builtin_return_address(0));
+		if (unlikely(!ptr))
+			goto err_out;
+
+		move_vmap_to_free_text_tree(ptr);
+		goto again;
+	}
+
+	addr = roundup(tmp->va_start, align);
+	type = classify_va_fit_type(tmp, addr, size);
+	if (WARN_ON_ONCE(type == NOTHING_FIT))
+		goto err_out;
+
+	ret = adjust_va_to_fit_type(&free_text_area_root, &free_text_area_list,
+				    tmp, addr, size);
+	if (ret)
+		goto err_out;
+
+	spin_unlock(&free_text_area_lock);
+
+	va->va_start = addr;
+	va->va_end = addr + size;
+	va->vm = NULL;
+
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	return (void *)addr;
+
+err_out:
+	spin_unlock(&free_text_area_lock);
+	kmem_cache_free(vmap_area_cachep, va);
+	return NULL;
+}
+
+void __weak *arch_vcopy_exec(void *dst, void *src, size_t len)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+int __weak arch_invalidate_exec(void *ptr, size_t len)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * vcopy_exec - Copy text to RO+X memory allocated by vmalloc_exec()
+ * @dst:  pointer to memory allocated by vmalloc_exec()
+ * @src:  pointer to data being copied from
+ * @len:  number of bytes to be copied
+ *
+ * vcopy_exec() will only update memory allocated by a single vcopy_exec()
+ * call. If dst + len goes beyond the boundary of one allocation,
+ * vcopy_exec() is aborted.
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void *vcopy_exec(void *dst, void *src, size_t len)
+{
+	struct vmap_area *va;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)dst, &vmap_area_root);
+
+	/*
+	 * If no va, or va has a vm attached, this memory is not allocated
+	 * by vmalloc_exec().
+	 */
+	if (WARN_ON_ONCE(!va) || WARN_ON_ONCE(va->vm))
+		goto err_out;
+	if (WARN_ON_ONCE((unsigned long)dst + len > va->va_end))
+		goto err_out;
+
+	spin_unlock(&vmap_area_lock);
+
+	return arch_vcopy_exec(dst, src, len);
+
+err_out:
+	spin_unlock(&vmap_area_lock);
+	return ERR_PTR(-EINVAL);
+}
+
+static struct vm_struct *find_and_unlink_text_vm(unsigned long start, unsigned long end)
+{
+	struct vm_struct *vm, *prev_vm;
+
+	lockdep_assert_held(&free_text_area_lock);
+
+	vm = all_text_vm;
+	while (vm) {
+		unsigned long vm_addr = (unsigned long)vm->addr;
+
+		/* vm is within this free space, we can free it */
+		if ((vm_addr >= start) && ((vm_addr + vm->size) <= end))
+			goto unlink_vm;
+		vm = vm->next;
+	}
+	return NULL;
+
+unlink_vm:
+	if (all_text_vm == vm) {
+		all_text_vm = vm->next;
+	} else {
+		prev_vm = all_text_vm;
+		while (prev_vm->next != vm)
+			prev_vm = prev_vm->next;
+		prev_vm = vm->next;
+	}
+	return vm;
+}
+
+/**
+ * vfree_exec - Release memory allocated by vmalloc_exec()
+ * @addr:  Memory base address
+ *
+ * If @addr is NULL, no operation is performed.
+ */
+void vfree_exec(void *addr)
+{
+	unsigned long free_start, free_end, free_addr;
+	struct vm_struct *vm;
+	struct vmap_area *va;
+
+	might_sleep();
+
+	if (!addr)
+		return;
+
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
+	if (WARN_ON_ONCE(!va)) {
+		spin_unlock(&vmap_area_lock);
+		return;
+	}
+	WARN_ON_ONCE(va->vm);
+
+	unlink_va(va, &vmap_area_root);
+	spin_unlock(&vmap_area_lock);
+
+	/* Invalidate text in the region */
+	arch_invalidate_exec(addr, va->va_end - va->va_start);
+
+	spin_lock(&free_text_area_lock);
+	va = merge_or_add_vmap_area_augment(va,
+		&free_text_area_root, &free_text_area_list);
+
+	if (WARN_ON_ONCE(!va))
+		goto out;
+
+	free_start = PMD_ALIGN(va->va_start);
+	free_end = PMD_ALIGN_DOWN(va->va_end);
+
+	/*
+	 * Only try to free vm when there is at least one PMD_SIZE free
+	 * continuous memory.
+	 */
+	if (free_start >= free_end)
+		goto out;
+
+	/*
+	 * TODO: It is possible that multiple vm are ready to be freed
+	 * after one vfree_exec(). But we free at most one vm for now.
+	 */
+	vm = find_and_unlink_text_vm(free_start, free_end);
+	if (!vm)
+		goto out;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, GFP_ATOMIC, NUMA_NO_NODE);
+	if (unlikely(!va))
+		goto out_save_vm;
+
+	free_addr = __alloc_vmap_area(&free_text_area_root, &free_text_area_list,
+				      vm->size, 1, (unsigned long)vm->addr,
+				      (unsigned long)vm->addr + vm->size);
+
+	if (WARN_ON_ONCE(free_addr != (unsigned long)vm->addr))
+		goto out_save_vm;
+
+	va->va_start = (unsigned long)vm->addr;
+	va->va_end = va->va_start + vm->size;
+	va->vm = vm;
+	spin_unlock(&free_text_area_lock);
+
+	set_memory_nx(va->va_start, vm->size >> PAGE_SHIFT);
+	set_memory_rw(va->va_start, vm->size >> PAGE_SHIFT);
+
+	/* put the va to vmap_area_root, and then free it with vfree */
+	spin_lock(&vmap_area_lock);
+	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+	spin_unlock(&vmap_area_lock);
+
+	vfree(vm->addr);
+	return;
+
+out_save_vm:
+	/*
+	 * vm is removed from all_text_vm, but not freed. Add it back,
+	 * so that we can use or free it later.
+	 */
+	vm->next = all_text_vm;
+	all_text_vm = vm;
+out:
+	spin_unlock(&free_text_area_lock);
+}
+
 /**
  * vmalloc_huge - allocate virtually contiguous memory, allow huge pages
  * @size:      allocation size