mbox series

[bpf-next,v1,RESEND,0/5] vmalloc_exec for modules and BPF programs

Message ID 20221031222541.1773452-1-song@kernel.org (mailing list archive)
Headers show
Series vmalloc_exec for modules and BPF programs | expand

Message

Song Liu Oct. 31, 2022, 10:25 p.m. UTC
Original email didn't make to lore, resend with a shorter CC list.

This set enables bpf programs and bpf dispatchers to share huge pages with
new API:
  vmalloc_exec()
  vfree_exec()
  vcopy_exec()

The idea is similar to Peter's suggestion in [1].

vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates these
memory to its users. vfree_exec() is used to free memory allocated by
vmalloc_exec(). vcopy_exec() is used to update memory allocated by
vmalloc_exec().

Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate W^X.
The caller has to update the content with text_poke like mechanism.
Specifically, vcopy_exec() is provided to update memory allocated by
vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
boundary of one chunk allocated by vmalloc_exec(). Please refer to patch
1/5 for more details of

Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.

Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to share
PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved by
allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
_etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/
[2] RFC v1: https://lore.kernel.org/linux-mm/20220818224218.2399791-3-song@kernel.org/T/

Changes RFC v2 => PATCH v1:
1. Add vcopy_exec(), which updates memory allocated by vmalloc_exec(). It
   also ensures vcopy_exec() is only used to update memory from one single
   vmalloc_exec() call. (Christoph Hellwig)
2. Add arch_vcopy_exec() and arch_invalidate_exec() as wrapper for the
   text_poke() like logic.
3. Drop changes for kernel modules and focus on BPF side changes.

Changes RFC v1 => RFC v2:
1. Major rewrite of the logic of vmalloc_exec and vfree_exec. They now
   work fine with BPF programs (patch 1, 2, 4). But module side (patch 3)
   still need some work.

Song Liu (5):
  vmalloc: introduce vmalloc_exec, vfree_exec, and vcopy_exec
  x86/alternative: support vmalloc_exec() and vfree_exec()
  bpf: use vmalloc_exec for bpf program and bpf dispatcher
  vmalloc: introduce register_text_tail_vm()
  x86: use register_text_tail_vm

 arch/x86/include/asm/pgtable_64_types.h |   1 +
 arch/x86/kernel/alternative.c           |  12 +
 arch/x86/mm/init_64.c                   |   4 +-
 arch/x86/net/bpf_jit_comp.c             |  23 +-
 include/linux/bpf.h                     |   3 -
 include/linux/filter.h                  |   5 -
 include/linux/vmalloc.h                 |   9 +
 kernel/bpf/core.c                       | 180 +-----------
 kernel/bpf/dispatcher.c                 |  11 +-
 mm/nommu.c                              |   7 +
 mm/vmalloc.c                            | 351 ++++++++++++++++++++++++
 11 files changed, 404 insertions(+), 202 deletions(-)

--
2.30.2

Comments

Christoph Hellwig Nov. 1, 2022, 11:26 a.m. UTC | #1
On Mon, Oct 31, 2022 at 03:25:36PM -0700, Song Liu wrote:
> This set enables bpf programs and bpf dispatchers to share huge pages with
> new API:
>   vmalloc_exec()
>   vfree_exec()
>   vcopy_exec()

Maybe it's just me, but I don't like the names very much.  They imply
a slight extension to the vmalloc API, but while they use the vmalloc
mechanisms internally, the API is actually quite different.

So why not something like:

   execmem_alloc
   execmem_free
   execmem_fill or execmem_set or copy_to_execmem

?
Song Liu Nov. 1, 2022, 3:10 p.m. UTC | #2
On Tue, Nov 1, 2022 at 4:26 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Oct 31, 2022 at 03:25:36PM -0700, Song Liu wrote:
> > This set enables bpf programs and bpf dispatchers to share huge pages with
> > new API:
> >   vmalloc_exec()
> >   vfree_exec()
> >   vcopy_exec()
>
> Maybe it's just me, but I don't like the names very much.  They imply
> a slight extension to the vmalloc API, but while they use the vmalloc
> mechanisms internally, the API is actually quite different.
>
> So why not something like:
>
>    execmem_alloc
>    execmem_free
>    execmem_fill or execmem_set or copy_to_execmem
>
> ?

I don't have a strong preference on names. We can change the name
to whatever we agree on.

Thanks,
Song
Luis Chamberlain Nov. 2, 2022, 8:45 p.m. UTC | #3
On Mon, Oct 31, 2022 at 03:25:36PM -0700, Song Liu wrote:
> Changes RFC v2 => PATCH v1:

<-- snip -->

> 3. Drop changes for kernel modules and focus on BPF side changes.

Yeah because in my testing of RFCv2 things blow up pretty hard,
I haven't been able to successfully even boot *any* system with those
patches, virutal or bare metal.

I'm very happy to help test new iterations if you have anything.

  Luis
Rick Edgecombe Nov. 2, 2022, 10:29 p.m. UTC | #4
On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> This set enables bpf programs and bpf dispatchers to share huge pages
> with
> new API:
>   vmalloc_exec()
>   vfree_exec()
>   vcopy_exec()
> 
> The idea is similar to Peter's suggestion in [1].
> 
> vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates
> these
> memory to its users. vfree_exec() is used to free memory allocated by
> vmalloc_exec(). vcopy_exec() is used to update memory allocated by
> vmalloc_exec().
> 
> Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate
> W^X.
> The caller has to update the content with text_poke like mechanism.
> Specifically, vcopy_exec() is provided to update memory allocated by
> vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
> boundary of one chunk allocated by vmalloc_exec(). Please refer to
> patch
> 1/5 for more details of
> 
> Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> 
> Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to
> share
> PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved
> by
> allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.

It might help to spell out what the benefits of this are. My
understanding is that (to my surprise) we actually haven't seen a
performance improvement with using 2MB pages for JITs. The main
performance benefit you saw on your previous version was from reduced
fragmentation of the direct map IIUC. This was from the effect of
reusing the same pages for JITs so that new ones don't need to be
broken.

The other benefit of this thing is reduced shootdowns. It can load a
JIT with about only a local TLB flush on average, which should help
really high cpu systems some unknown amount.
Song Liu Nov. 3, 2022, 9:13 p.m. UTC | #5
On Wed, Nov 2, 2022 at 3:30 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Mon, 2022-10-31 at 15:25 -0700, Song Liu wrote:
> > This set enables bpf programs and bpf dispatchers to share huge pages
> > with
> > new API:
> >   vmalloc_exec()
> >   vfree_exec()
> >   vcopy_exec()
> >
> > The idea is similar to Peter's suggestion in [1].
> >
> > vmalloc_exec() manages a set of PMD_SIZE RO+X memory, and allocates
> > these
> > memory to its users. vfree_exec() is used to free memory allocated by
> > vmalloc_exec(). vcopy_exec() is used to update memory allocated by
> > vmalloc_exec().
> >
> > Memory allocated by vmalloc_exec() is RO+X, so this doesnot violate
> > W^X.
> > The caller has to update the content with text_poke like mechanism.
> > Specifically, vcopy_exec() is provided to update memory allocated by
> > vmalloc_exec(). vcopy_exec() also makes sure the update stays in the
> > boundary of one chunk allocated by vmalloc_exec(). Please refer to
> > patch
> > 1/5 for more details of
> >
> > Patch 3/5 uses these new APIs in bpf program and bpf dispatcher.
> >
> > Patch 4/5 and 5/5 allows static kernel text (_stext to _etext) to
> > share
> > PMD_SIZE pages with dynamic kernel text on x86_64. This is achieved
> > by
> > allocating PMD_SIZE pages to roundup(_etext, PMD_SIZE), and then use
> > _etext to roundup(_etext, PMD_SIZE) for dynamic kernel text.
>
> It might help to spell out what the benefits of this are. My
> understanding is that (to my surprise) we actually haven't seen a
> performance improvement with using 2MB pages for JITs. The main
> performance benefit you saw on your previous version was from reduced
> fragmentation of the direct map IIUC. This was from the effect of
> reusing the same pages for JITs so that new ones don't need to be
> broken.
>
> The other benefit of this thing is reduced shootdowns. It can load a
> JIT with about only a local TLB flush on average, which should help
> really high cpu systems some unknown amount.

Thanks for pointing out the missing information.

I don't have a benchmark that uses very big BPF programs, so the
results I have don't show much benefit from fewer iTLB misses.

Song