diff mbox series

[v2,bpf,3/3] bpf: use vmalloc with VM_ALLOW_HUGE_VMAP for bpf_prog_pack

Message ID 20220411233549.740157-5-song@kernel.org (mailing list archive)
State New
Headers show
Series vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP | expand

Commit Message

Song Liu April 11, 2022, 11:35 p.m. UTC
Use __vmalloc_node_range with VM_ALLOW_HUGE_VMAP for bpf_prog_pack so that
BPF programs sit on PMD_SIZE pages. This benefits system performance by
reducing iTLB miss rate.

Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig April 12, 2022, 4:20 a.m. UTC | #1
On Mon, Apr 11, 2022 at 04:35:49PM -0700, Song Liu wrote:
> Use __vmalloc_node_range with VM_ALLOW_HUGE_VMAP for bpf_prog_pack so that

That is only very indirectly true now.
Song Liu April 12, 2022, 6:12 a.m. UTC | #2
On Mon, Apr 11, 2022 at 9:20 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Apr 11, 2022 at 04:35:49PM -0700, Song Liu wrote:
> > Use __vmalloc_node_range with VM_ALLOW_HUGE_VMAP for bpf_prog_pack so that
>
> That is only very indirectly true now.

Yeah, I realized I missed this part after sending it. Will fix.

Thanks,
Song
Edgecombe, Rick P April 12, 2022, 5:20 p.m. UTC | #3
On Mon, 2022-04-11 at 16:35 -0700, Song Liu wrote:
> @@ -889,7 +889,6 @@ static struct bpf_prog_pack *alloc_new_pack(void)
>         bitmap_zero(pack->bitmap, bpf_prog_pack_size /
> BPF_PROG_CHUNK_SIZE);
>         list_add_tail(&pack->list, &pack_list);
>  
> -       set_vm_flush_reset_perms(pack->ptr);
>         set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size /
> PAGE_SIZE);
>         set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size /
> PAGE_SIZE);
>         return pack;

Dropping set_vm_flush_reset_perms() is not mentioned in the commit log.
It is kind of a fix for a different issue.

Now that x86 supports vmalloc huge pages, but VM_FLUSH_RESET_PERMS does
not work with them, we should have some comments or warnings to that
effect somewhere. Someone may try to pass the flags in together.

> @@ -970,7 +969,9 @@ static void bpf_prog_pack_free(struct
> bpf_binary_header *hdr)
>         if (bitmap_find_next_zero_area(pack->bitmap,
> bpf_prog_chunk_count(), 0,
>                                        bpf_prog_chunk_count(), 0) ==
> 0) {
>                 list_del(&pack->list);
> -               module_memfree(pack->ptr);


> +               set_memory_nx((unsigned long)pack->ptr,
> bpf_prog_pack_size / PAGE_SIZE);
> +               set_memory_rw((unsigned long)pack->ptr,
> bpf_prog_pack_size / PAGE_SIZE);
> +               vfree(pack->ptr);
>                 kfree(pack);

Now that it calls module_alloc_huge() instead of vmalloc_node_range(),
should it call module_memfree() instead of vfree()?



Since there are bugs, simple, immediate fixes seem like the right thing
to do, but I had a couple long term focused comments on this new
feature:

It would be nice if bpf and the other module_alloc() callers could
share the same large pages. Meaning, ultimately that this whole thing
should probably live outside of bpf. BPF tracing usages might benefit
for example, and kprobes and ftrace are not too different than bpf
progs from a text allocation perspective.

I agree that the module's part is non-trivial. A while back I had tried
to do something like bpf_prog_pack() that worked for all the
module_alloc() callers. It had some modules changes to allow different
permissions to go to different allocations so they could be made to
share large pages:

https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/

I thought the existing kernel special permission allocation methods
were just too brittle and intertwined to improve without a new
interface. The hope was the new interface could wrap all the arch
intricacies instead of leaving them exposed in the cross-arch callers.

I wonder what you think of that general direction or if you have any
follow up plans for this?
Song Liu April 12, 2022, 9 p.m. UTC | #4
On Tue, Apr 12, 2022 at 10:21 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-04-11 at 16:35 -0700, Song Liu wrote:
> > @@ -889,7 +889,6 @@ static struct bpf_prog_pack *alloc_new_pack(void)
> >         bitmap_zero(pack->bitmap, bpf_prog_pack_size /
> > BPF_PROG_CHUNK_SIZE);
> >         list_add_tail(&pack->list, &pack_list);
> >
> > -       set_vm_flush_reset_perms(pack->ptr);
> >         set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size /
> > PAGE_SIZE);
> >         set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size /
> > PAGE_SIZE);
> >         return pack;
>
> Dropping set_vm_flush_reset_perms() is not mentioned in the commit log.
> It is kind of a fix for a different issue.
>
> Now that x86 supports vmalloc huge pages, but VM_FLUSH_RESET_PERMS does
> not work with them, we should have some comments or warnings to that
> effect somewhere. Someone may try to pass the flags in together.

Good catch! I will add it in the next version.

>
> > @@ -970,7 +969,9 @@ static void bpf_prog_pack_free(struct
> > bpf_binary_header *hdr)
> >         if (bitmap_find_next_zero_area(pack->bitmap,
> > bpf_prog_chunk_count(), 0,
> >                                        bpf_prog_chunk_count(), 0) ==
> > 0) {
> >                 list_del(&pack->list);
> > -               module_memfree(pack->ptr);
>
>
> > +               set_memory_nx((unsigned long)pack->ptr,
> > bpf_prog_pack_size / PAGE_SIZE);
> > +               set_memory_rw((unsigned long)pack->ptr,
> > bpf_prog_pack_size / PAGE_SIZE);
> > +               vfree(pack->ptr);
> >                 kfree(pack);
>
> Now that it calls module_alloc_huge() instead of vmalloc_node_range(),
> should it call module_memfree() instead of vfree()?

Right. Let me sort that out. (Also, whether we introduce module_alloc_huge()
or not).

>
>
>
> Since there are bugs, simple, immediate fixes seem like the right thing
> to do, but I had a couple long term focused comments on this new
> feature:
>
> It would be nice if bpf and the other module_alloc() callers could
> share the same large pages. Meaning, ultimately that this whole thing
> should probably live outside of bpf. BPF tracing usages might benefit
> for example, and kprobes and ftrace are not too different than bpf
> progs from a text allocation perspective.

Agreed.

>
> I agree that the module's part is non-trivial. A while back I had tried
> to do something like bpf_prog_pack() that worked for all the
> module_alloc() callers. It had some modules changes to allow different
> permissions to go to different allocations so they could be made to
> share large pages:
>
> https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
>
> I thought the existing kernel special permission allocation methods
> were just too brittle and intertwined to improve without a new
> interface. The hope was the new interface could wrap all the arch
> intricacies instead of leaving them exposed in the cross-arch callers.
>
> I wonder what you think of that general direction or if you have any
> follow up plans for this?

Since I am still learning the vmalloc/module_alloc code, I think I am
not really capable of commenting on the direction. From our use
cases, we do see performance hit due to large number of BPF
program fragmenting the page table. Kernel module, OTOH, is not
too big an issue for us, as we usually build hot modules into the
kernel. That being said, we are interested in making the huge page
interface general for BPF program and kernel module. We can
commit resources to this effort.

Thanks,
Song
Edgecombe, Rick P April 13, 2022, 3:51 p.m. UTC | #5
CC Mike, who has been working on a direct map fragmentation solution.
[0]

On Tue, 2022-04-12 at 14:00 -0700, Song Liu wrote:
> Since I am still learning the vmalloc/module_alloc code, I think I am
> not really capable of commenting on the direction. From our use
> cases, we do see performance hit due to large number of BPF
> program fragmenting the page table. Kernel module, OTOH, is not
> too big an issue for us, as we usually build hot modules into the
> kernel. That being said, we are interested in making the huge page
> interface general for BPF program and kernel module. We can
> commit resources to this effort.

That sounds great. Please feel free to loop me in if you do.


[0] 
https://lore.kernel.org/lkml/20220127085608.306306-1-rppt@kernel.org/
diff mbox series

Patch

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 13e9dbeeedf3..fd45bdd80a75 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -857,7 +857,7 @@  static size_t select_bpf_prog_pack_size(void)
 	void *ptr;
 
 	size = BPF_HPAGE_SIZE * num_online_nodes();
-	ptr = module_alloc(size);
+	ptr = module_alloc_huge(size);
 
 	/* Test whether we can get huge pages. If not just use PAGE_SIZE
 	 * packs.
@@ -881,7 +881,7 @@  static struct bpf_prog_pack *alloc_new_pack(void)
 		       GFP_KERNEL);
 	if (!pack)
 		return NULL;
-	pack->ptr = module_alloc(bpf_prog_pack_size);
+	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
 	if (!pack->ptr) {
 		kfree(pack);
 		return NULL;
@@ -889,7 +889,6 @@  static struct bpf_prog_pack *alloc_new_pack(void)
 	bitmap_zero(pack->bitmap, bpf_prog_pack_size / BPF_PROG_CHUNK_SIZE);
 	list_add_tail(&pack->list, &pack_list);
 
-	set_vm_flush_reset_perms(pack->ptr);
 	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
 	return pack;
@@ -970,7 +969,9 @@  static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
 	if (bitmap_find_next_zero_area(pack->bitmap, bpf_prog_chunk_count(), 0,
 				       bpf_prog_chunk_count(), 0) == 0) {
 		list_del(&pack->list);
-		module_memfree(pack->ptr);
+		set_memory_nx((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+		set_memory_rw((unsigned long)pack->ptr, bpf_prog_pack_size / PAGE_SIZE);
+		vfree(pack->ptr);
 		kfree(pack);
 	}
 out: