diff mbox series

[bpf-next,v2,5/5] x86: use register_text_tail_vm

Message ID 20221107223921.3451913-6-song@kernel.org (mailing list archive)
State New
Headers show
Series execmem_alloc for BPF programs | expand

Commit Message

Song Liu Nov. 7, 2022, 10:39 p.m. UTC
Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
[round_up(_etext, 4kb), round_up(_etext, 2MB)] with register_text_tail_vm
so that we can use this part of memory for dynamic kernel text (BPF
programs, etc.).

Here is an example:

[root@eth50-1 ~]# grep _etext /proc/kallsyms
ffffffff82202a08 T _etext

[root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
ffffffff8220f920 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
ffffffff8220fa28 t bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
ffffffff8220fad4 t bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]

[root@eth50-1 ~]#  grep 0xffffffff82200000 /sys/kernel/debug/page_tables/kernel
0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x  pmd

ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text, and
bpf programs.

Signed-off-by: Song Liu <song@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h | 1 +
 arch/x86/mm/init_64.c                   | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

Edgecombe, Rick P Nov. 8, 2022, 7:04 p.m. UTC | #1
On Mon, 2022-11-07 at 14:39 -0800, Song Liu wrote:
> Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
> [round_up(_etext, 4kb), round_up(_etext, 2MB)] with
> register_text_tail_vm
> so that we can use this part of memory for dynamic kernel text (BPF
> programs, etc.).
> 
> Here is an example:
> 
> [root@eth50-1 ~]# grep _etext /proc/kallsyms
> ffffffff82202a08 T _etext
> 
> [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> ffffffff8220f920 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> ffffffff8220fa28 t
> bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> ffffffff8220fad4 t
> bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> 
> [root@eth50-1 ~]#  grep 0xffffffff82200000
> /sys/kernel/debug/page_tables/kernel
> 0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x 
> pmd
> 
> ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> and
> bpf programs.
> 
> Signed-off-by: Song Liu <song@kernel.org>

Please update Documentation/x86/x86_64/mm.txt and teach places that
check if an address is text about it.
Song Liu Nov. 8, 2022, 10:15 p.m. UTC | #2
On Tue, Nov 8, 2022 at 11:04 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-11-07 at 14:39 -0800, Song Liu wrote:
> > Allocate 2MB pages up to round_up(_etext, 2MB), and register memory
> > [round_up(_etext, 4kb), round_up(_etext, 2MB)] with
> > register_text_tail_vm
> > so that we can use this part of memory for dynamic kernel text (BPF
> > programs, etc.).
> >
> > Here is an example:
> >
> > [root@eth50-1 ~]# grep _etext /proc/kallsyms
> > ffffffff82202a08 T _etext
> >
> > [root@eth50-1 ~]# grep bpf_prog_ /proc/kallsyms  | tail -n 3
> > ffffffff8220f920 t
> > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup       [bpf]
> > ffffffff8220fa28 t
> > bpf_prog_cc61a5364ac11d93_handle__sched_wakeup_new   [bpf]
> > ffffffff8220fad4 t
> > bpf_prog_3bf73fa16f5e3d92_handle__sched_switch       [bpf]
> >
> > [root@eth50-1 ~]#  grep 0xffffffff82200000
> > /sys/kernel/debug/page_tables/kernel
> > 0xffffffff82200000-0xffffffff82400000     2M     ro   PSE         x
> > pmd
> >
> > ffffffff82200000-ffffffff82400000 is a 2MB page, serving kernel text,
> > and
> > bpf programs.
> >
> > Signed-off-by: Song Liu <song@kernel.org>
>
> Please update Documentation/x86/x86_64/mm.txt and teach places that
> check if an address is text about it.

For mm.rst, I got something like:

=========================== 8< ===========================

diff --git i/Documentation/x86/x86_64/mm.rst w/Documentation/x86/x86_64/mm.rst
index 9798676bb0bf..ac041b7d3965 100644
--- i/Documentation/x86/x86_64/mm.rst
+++ w/Documentation/x86/x86_64/mm.rst
@@ -62,7 +62,7 @@ Complete virtual memory map with 4-level page tables
    ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
    ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI
region mapping space
    ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel and module text mapping, mapped to physical address 0
    ffffffff80000000 |-2048    MB |                  |         |
    ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB |
module mapping space
    ffffffffff000000 |  -16    MB |                  |         |
@@ -121,7 +121,7 @@ Complete virtual memory map with 5-level page tables
    ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
    ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI
region mapping space
    ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
-   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel text mapping, mapped to physical address 0
+   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
kernel and module text mapping, mapped to physical address 0
    ffffffff80000000 |-2048    MB |                  |         |
    ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB |
module mapping space
    ffffffffff000000 |  -16    MB |                  |         |

=========================== 8< ===========================

Is this good enough?

I added extra check in is_vmalloc_or_module_addr() (4/5). Where do we need
similar logic?

Thanks,
Song
Edgecombe, Rick P Nov. 15, 2022, 5:28 p.m. UTC | #3
On Tue, 2022-11-08 at 14:15 -0800, Song Liu wrote:
> diff --git i/Documentation/x86/x86_64/mm.rst
> w/Documentation/x86/x86_64/mm.rst
> index 9798676bb0bf..ac041b7d3965 100644
> --- i/Documentation/x86/x86_64/mm.rst
> +++ w/Documentation/x86/x86_64/mm.rst
> @@ -62,7 +62,7 @@ Complete virtual memory map with 4-level page
> tables
>     ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ...
> unused hole
>     ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI
> region mapping space
>     ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ...
> unused hole
> -   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
> kernel text mapping, mapped to physical address 0
> +   ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB |
> kernel and module text mapping, mapped to physical address 0

It's not really "module text mapping" yet right? Because it doesn't get
used by modules. I might just call it execmem or whatever you call the
component. Otherwise it is outdated when the next users starts using
the API. Otherwise looks ok, thanks.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 04f36063ad54..c0f9cceb109a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -101,6 +101,7 @@  extern unsigned int ptrs_per_p4d;
 #define PUD_MASK	(~(PUD_SIZE - 1))
 #define PGDIR_SIZE	(_AC(1, UL) << PGDIR_SHIFT)
 #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
+#define PMD_ALIGN(x)	(((unsigned long)(x) + (PMD_SIZE - 1)) & PMD_MASK)
 
 /*
  * See Documentation/x86/x86_64/mm.rst for a description of the memory map.
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3f040c6e5d13..5b42fc0c6099 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1373,7 +1373,7 @@  void mark_rodata_ro(void)
 	unsigned long start = PFN_ALIGN(_text);
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
 	unsigned long end = (unsigned long)__end_rodata_hpage_align;
-	unsigned long text_end = PFN_ALIGN(_etext);
+	unsigned long text_end = PMD_ALIGN(_etext);
 	unsigned long rodata_end = PFN_ALIGN(__end_rodata);
 	unsigned long all_end;
 
@@ -1414,6 +1414,8 @@  void mark_rodata_ro(void)
 				(void *)rodata_end, (void *)_sdata);
 
 	debug_checkwx();
+	register_text_tail_vm(PFN_ALIGN((unsigned long)_etext),
+			      PMD_ALIGN((unsigned long)_etext));
 }
 
 int kern_addr_valid(unsigned long addr)