mbox series

[RFC,0/5] vmalloc_exec for modules and BPF programs

Message ID 20220818224218.2399791-1-song@kernel.org (mailing list archive)
Headers show
Series vmalloc_exec for modules and BPF programs | expand

Message

Song Liu Aug. 18, 2022, 10:42 p.m. UTC
This set is a prototype that allows dynamic kernel text (modules, bpf
programs, various trampolines, etc.) to share huge pages. The idea is
similar to Peter's suggestion in [1]. Please refer to each patch for
more detais.

The ultimate goal is to only host kernel text in 2MB pages (for x86_64).

Please share your comments on this.

Thanks!

[1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Song Liu (5):
  vmalloc: introduce vmalloc_exec and vfree_exec
  bpf: use vmalloc_exec
  modules, x86: use vmalloc_exec for module core
  vmalloc_exec: share a huge page with kernel text
  vmalloc: vfree_exec: free unused vm_struct

 arch/x86/Kconfig              |   1 +
 arch/x86/kernel/alternative.c |  30 ++++-
 arch/x86/kernel/module.c      |   1 +
 arch/x86/mm/init_64.c         |   3 +-
 include/linux/vmalloc.h       |  16 +--
 kernel/bpf/core.c             | 155 ++------------------------
 kernel/module/main.c          |  23 ++--
 kernel/module/strict_rwx.c    |   3 -
 kernel/trace/ftrace.c         |   3 +-
 mm/nommu.c                    |   7 ++
 mm/vmalloc.c                  | 200 +++++++++++++++++++++++++++++-----
 11 files changed, 239 insertions(+), 203 deletions(-)

--
2.30.2

Comments

Song Liu Aug. 22, 2022, 3:46 p.m. UTC | #1
> On Aug 18, 2022, at 3:42 PM, Song Liu <song@kernel.org> wrote:
> 
> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.
> 
> The ultimate goal is to only host kernel text in 2MB pages (for x86_64).
> 
> Please share your comments on this.
> 
> Thanks!
> 
> [1] https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@hirez.programming.kicks-ass.net/

Hi Peter, 

Could you please share your feedback on this? 

Thanks,
Song

PS: I guess vger dropped my patch again. :( The set is also available at

https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git 

branch vmalloc_exec. 

> 
> Song Liu (5):
>  vmalloc: introduce vmalloc_exec and vfree_exec
>  bpf: use vmalloc_exec
>  modules, x86: use vmalloc_exec for module core
>  vmalloc_exec: share a huge page with kernel text
>  vmalloc: vfree_exec: free unused vm_struct
> 
> arch/x86/Kconfig              |   1 +
> arch/x86/kernel/alternative.c |  30 ++++-
> arch/x86/kernel/module.c      |   1 +
> arch/x86/mm/init_64.c         |   3 +-
> include/linux/vmalloc.h       |  16 +--
> kernel/bpf/core.c             | 155 ++------------------------
> kernel/module/main.c          |  23 ++--
> kernel/module/strict_rwx.c    |   3 -
> kernel/trace/ftrace.c         |   3 +-
> mm/nommu.c                    |   7 ++
> mm/vmalloc.c                  | 200 +++++++++++++++++++++++++++++-----
> 11 files changed, 239 insertions(+), 203 deletions(-)
> 
> --
> 2.30.2
Peter Zijlstra Aug. 22, 2022, 4:34 p.m. UTC | #2
On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
> Could you please share your feedback on this? 

I've looked at it all of 5 minutes, so perhaps I've missed something.

However, I'm a little surprised you went with a second tree instead of
doing the top-down thing for data. The way you did it makes it hard to
have guard pages between text and data.
Song Liu Aug. 22, 2022, 4:56 p.m. UTC | #3
> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>> Could you please share your feedback on this? 
> 
> I've looked at it all of 5 minutes, so perhaps I've missed something.
> 
> However, I'm a little surprised you went with a second tree instead of
> doing the top-down thing for data. The way you did it makes it hard to
> have guard pages between text and data.

I didn't realize the importance of the guard pages. But it is not too
hard to do it with this approach. For each 2MB text page, we can reserve
4kB on the beginning and end of it. Would this work?

There are a couple benefits from a second tree:

1. It allows text allocations to go below PAGE_SIZE granularity, while 
   data allocations would still use PAGE_SIZE granularity, which is the
   same as current code. 
2. Text allocate requires mapping one vm_struct to many vmap_area. Putting
   text allocations in a separate tree make it easier to handle this. 
   (Well, I haven't finished this logic yet). 
3. A separate tree makes it easier to use text tail page, 
   [_etext, roundup(_etext, PMD_SIZE)], for modules and BPF programs. 

Does this make sense? Do you see other downsides with a second tree?

Thanks,
Song
Peter Zijlstra Aug. 23, 2022, 5:42 a.m. UTC | #4
On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
> 
> 
> > On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
> >> Could you please share your feedback on this? 
> > 
> > I've looked at it all of 5 minutes, so perhaps I've missed something.
> > 
> > However, I'm a little surprised you went with a second tree instead of
> > doing the top-down thing for data. The way you did it makes it hard to
> > have guard pages between text and data.
> 
> I didn't realize the importance of the guard pages. But it is not too

I'm not sure how important it is, just seems like a good idea to trap
anybody trying to cross that divide. Also, to me it seems like a good
idea to have a single large contiguous text region instead of splintered
2M pages.

> hard to do it with this approach. For each 2MB text page, we can reserve
> 4kB on the beginning and end of it. Would this work?

Typically a guard page has different protections (as in none what so
ever) so that every access goes *splat*.
Christophe Leroy Aug. 23, 2022, 6:39 a.m. UTC | #5
Le 23/08/2022 à 07:42, Peter Zijlstra a écrit :
> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>>
>>
>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>>> Could you please share your feedback on this?
>>>
>>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>>
>>> However, I'm a little surprised you went with a second tree instead of
>>> doing the top-down thing for data. The way you did it makes it hard to
>>> have guard pages between text and data.
>>
>> I didn't realize the importance of the guard pages. But it is not too
> 
> I'm not sure how important it is, just seems like a good idea to trap
> anybody trying to cross that divide. Also, to me it seems like a good
> idea to have a single large contiguous text region instead of splintered
> 2M pages.
> 
>> hard to do it with this approach. For each 2MB text page, we can reserve
>> 4kB on the beginning and end of it. Would this work?
> 
> Typically a guard page has different protections (as in none what so
> ever) so that every access goes *splat*. >

Text is RO-X, on some architectures even only X. So the only real thing 
to protect against is bad execution, isn't it ?. So I guess having some 
areas with invalid or trap instructions would be enough ?
Song Liu Aug. 23, 2022, 6:55 a.m. UTC | #6
> On Aug 22, 2022, at 10:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>> 
>> 
>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>>> Could you please share your feedback on this? 
>>> 
>>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>> 
>>> However, I'm a little surprised you went with a second tree instead of
>>> doing the top-down thing for data. The way you did it makes it hard to
>>> have guard pages between text and data.
>> 
>> I didn't realize the importance of the guard pages. But it is not too
> 
> I'm not sure how important it is, just seems like a good idea to trap
> anybody trying to cross that divide. Also, to me it seems like a good
> idea to have a single large contiguous text region instead of splintered
> 2M pages.

A single large contiguous text region is great. However, it is not easy to
keep it contiguous. For example, when we load a big module, and then unload
it. It is not easy to recycle the space. Say we load module-x-v1, which is 
4MB, and uses 2 huge pages. Then we load a small BPF program after it. The 
address space looks like:

MODULE_VADDR to MODULE_VADDR + 4MB:			module-x-v1
MODULE_VADDR + 4MB to MODULE_VADDR + 4MB + 4kB:		bpf_prog_xxxx

When we unload module-x-v1, there will be 4MB hole in the address space. 
If we then load module-x-v2, which is 4.1MB in size, we cannot reuse that
hole, because the module is a little too big for the hole. 

AFAICT, to use the space efficiently, we will have to deal with splintered
2MB pages. 

Does this make sense?

Thanks,
Song

> 
>> hard to do it with this approach. For each 2MB text page, we can reserve
>> 4kB on the beginning and end of it. Would this work?
> 
> Typically a guard page has different protections (as in none what so
> ever) so that every access goes *splat*.
Song Liu Aug. 23, 2022, 6:57 a.m. UTC | #7
> On Aug 22, 2022, at 11:39 PM, Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> 
> 
> 
> Le 23/08/2022 à 07:42, Peter Zijlstra a écrit :
>> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>>> 
>>> 
>>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> 
>>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>>>> Could you please share your feedback on this?
>>>> 
>>>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>>> 
>>>> However, I'm a little surprised you went with a second tree instead of
>>>> doing the top-down thing for data. The way you did it makes it hard to
>>>> have guard pages between text and data.
>>> 
>>> I didn't realize the importance of the guard pages. But it is not too
>> 
>> I'm not sure how important it is, just seems like a good idea to trap
>> anybody trying to cross that divide. Also, to me it seems like a good
>> idea to have a single large contiguous text region instead of splintered
>> 2M pages.
>> 
>>> hard to do it with this approach. For each 2MB text page, we can reserve
>>> 4kB on the beginning and end of it. Would this work?
>> 
>> Typically a guard page has different protections (as in none what so
>> ever) so that every access goes *splat*. >
> 
> Text is RO-X, on some architectures even only X. So the only real thing 
> to protect against is bad execution, isn't it ?. So I guess having some 
> areas with invalid or trap instructions would be enough ?

Agreed that filling with trap instructions should be enough. 

Thanks,
Song
Song Liu Aug. 24, 2022, 5:06 p.m. UTC | #8
Hi Peter, 

> On Aug 22, 2022, at 9:56 AM, Song Liu <songliubraving@fb.com> wrote:
> 
>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>> Could you please share your feedback on this? 
>> 
>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>> 
>> However, I'm a little surprised you went with a second tree instead of
>> doing the top-down thing for data. The way you did it makes it hard to
>> have guard pages between text and data.
> 
> I didn't realize the importance of the guard pages. But it is not too
> hard to do it with this approach. For each 2MB text page, we can reserve
> 4kB on the beginning and end of it. Would this work?
> 
> There are a couple benefits from a second tree:
> 
> 1. It allows text allocations to go below PAGE_SIZE granularity, while 
>   data allocations would still use PAGE_SIZE granularity, which is the
>   same as current code. 
> 2. Text allocate requires mapping one vm_struct to many vmap_area. Putting
>   text allocations in a separate tree make it easier to handle this. 
>   (Well, I haven't finished this logic yet). 
> 3. A separate tree makes it easier to use text tail page, 
>   [_etext, roundup(_etext, PMD_SIZE)], for modules and BPF programs. 
> 
> Does this make sense? Do you see other downsides with a second tree?

Did these make sense? Do you have future comments that I would address in 
future versions?

Thanks,
Song