[v6,bpf-next,0/5] bpf_prog_pack followup

Message ID	20220707223546.4124919-1-song@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Song Liu <song@kernel.org> To: <bpf@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org> CC: <daniel@iogearbox.net>, <kernel-team@fb.com>, <x86@kernel.org>, <dave.hansen@linux.intel.com>, <rick.p.edgecombe@intel.com>, <mcgrof@kernel.org>, Song Liu <song@kernel.org> Subject: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup Date: Thu, 7 Jul 2022 15:35:41 -0700 Message-ID: <20220707223546.4124919-1-song@kernel.org> Content-Type: text/plain Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	bpf_prog_pack followup \| expand [v6,bpf-next,0/5] bpf_prog_pack followup [v6,bpf-next,1/5] module: introduce module_alloc_huge [v6,bpf-next,2/5] bpf: use module_alloc_huge for bpf_prog_pack [v6,bpf-next,3/5] vmalloc: WARN for set_vm_flush_reset_perms() on huge pages [v6,bpf-next,4/5] vmalloc: introduce huge_vmalloc_supported [v6,bpf-next,5/5] bpf: simplify select_bpf_prog_pack_size

Song Liu July 7, 2022, 10:35 p.m. UTC

This set is the second half of v4 [1].

Changes v5 => v6:
1. Rebase and extend CC list.

Changes v4 => v5:
1. Rebase and resolve conflicts due to module.c split.
2. Update experiment results (below).

For our web service production benchmark, bpf_prog_pack on 4kB pages
gives 0.5% to 0.7% more throughput than not using bpf_prog_pack.
bpf_prog_pack on 2MB pages 0.6% to 0.9% more throughput than not using
bpf_prog_pack. Note that 0.5% is a huge improvement for our fleet. I
believe this is also significant for other companies with many thousand
servers.

Update: Further experiments (suggested by Rick Edgecombe) showed that most
of benefit on the web service benchmark came from less direct map
fragmentation. The experiment is as follows:

Side A: 2MB bpf prog pack on a single 2MB page;
Side B: 2MB bpf prog pack on 512x 4kB pages;

The system only uses about 200kB for BPF programs, but 2MB is allocated
for bpf_prog_pack (for both A and B). Therefore, direct map fragmentation
caused by BPF programs is elminated, and we are only measuring the
performance difference of 1x 2MB page vs. ~50 4kB pages (we only use
about 50 out of the 512 pages). For these two sides, the difference in
system throughput is within the noise. I also measured iTLB-load-misses
caused by bpf programs, which is ~300/s for case A, and ~1600/s for case B.
The overall iTLB-load-misses is about 1.5M/s on these hosts. Therefore,
we can clearly see 2MB page reduces iTLB misses, but the difference is not
enough to have visible impact on system throughput.

Of course, the impact of iTLB miss will be more significant for systems
with more BPF programs loaded.

[1] https://lore.kernel.org/bpf/20220520235758.1858153-1-song@kernel.org/

Song Liu (5):
  module: introduce module_alloc_huge
  bpf: use module_alloc_huge for bpf_prog_pack
  vmalloc: WARN for set_vm_flush_reset_perms() on huge pages
  vmalloc: introduce huge_vmalloc_supported
  bpf: simplify select_bpf_prog_pack_size

 arch/x86/kernel/module.c     | 21 +++++++++++++++++++++
 include/linux/moduleloader.h |  5 +++++
 include/linux/vmalloc.h      |  7 +++++++
 kernel/bpf/core.c            | 25 ++++++++++---------------
 kernel/module/main.c         |  8 ++++++++
 mm/vmalloc.c                 |  5 +++++
 6 files changed, 56 insertions(+), 15 deletions(-)

--
2.30.2

Luis Chamberlain July 7, 2022, 10:59 p.m. UTC | #1

On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> This set is the second half of v4 [1].
> 
> Changes v5 => v6:
> 1. Rebase and extend CC list.

Why post a new iteration so soon without completing the discussion we
had? It seems like we were at least going somewhere. If it's just
to include mm as I requested, sure, that's fine, but this does not
provide context as to what we last were talking about.

  Luis

Song Liu July 7, 2022, 11:52 p.m. UTC | #2

> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>> This set is the second half of v4 [1].
>> 
>> Changes v5 => v6:
>> 1. Rebase and extend CC list.
> 
> Why post a new iteration so soon without completing the discussion we
> had? It seems like we were at least going somewhere. If it's just
> to include mm as I requested, sure, that's fine, but this does not
> provide context as to what we last were talking about.

Sorry for sending v6 too soon. The primary reason was to extend the CC
list and add it back to patchwork (v5 somehow got archived). 

Also, I think vmalloc_exec_ work would be a separate project, while this 
set is the followup work of bpf_prog_pack. Does this make sense? 

Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
more efficient to discuss this in person. 

Thanks,
Song

Luis Chamberlain July 8, 2022, 12:53 a.m. UTC | #3

On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
> > On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> >> This set is the second half of v4 [1].
> >> 
> >> Changes v5 => v6:
> >> 1. Rebase and extend CC list.
> > 
> > Why post a new iteration so soon without completing the discussion we
> > had? It seems like we were at least going somewhere. If it's just
> > to include mm as I requested, sure, that's fine, but this does not
> > provide context as to what we last were talking about.
> 
> Sorry for sending v6 too soon. The primary reason was to extend the CC
> list and add it back to patchwork (v5 somehow got archived). 
> 
> Also, I think vmalloc_exec_ work would be a separate project, while this 
> set is the followup work of bpf_prog_pack. Does this make sense? 
> 
> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
> more efficient to discuss this in person. 

What we need is input from mm / arch folks. What is not done here is
what that stuff we're talking about is and so mm folks can't guess. My
preference is to address that.

I don't think in person discussion is needed if the only folks
discussing this topic so far is just you and me.

  Luis

Song Liu July 8, 2022, 1:36 a.m. UTC | #4

> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>>>> This set is the second half of v4 [1].
>>>> 
>>>> Changes v5 => v6:
>>>> 1. Rebase and extend CC list.
>>> 
>>> Why post a new iteration so soon without completing the discussion we
>>> had? It seems like we were at least going somewhere. If it's just
>>> to include mm as I requested, sure, that's fine, but this does not
>>> provide context as to what we last were talking about.
>> 
>> Sorry for sending v6 too soon. The primary reason was to extend the CC
>> list and add it back to patchwork (v5 somehow got archived). 
>> 
>> Also, I think vmalloc_exec_ work would be a separate project, while this 
>> set is the followup work of bpf_prog_pack. Does this make sense? 
>> 
>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
>> more efficient to discuss this in person. 
> 
> What we need is input from mm / arch folks. What is not done here is
> what that stuff we're talking about is and so mm folks can't guess. My
> preference is to address that.
> 
> I don't think in person discussion is needed if the only folks
> discussing this topic so far is just you and me.

How about we start a thread with mm / arch folks for the vmalloc_exec_*
topic? I will summarize previous discussions and include pointers to 
these discussions. If necessary, we can continue the discussion at LPC.

OTOH, I guess the outcome of that discussion should not change this set? 
If we have concern about module_alloc_huge(), maybe we can have bpf code 
call vmalloc directly (until we have vmalloc_exec_)? 

What do you think about this plan?

Thanks,
Song

Luis Chamberlain July 8, 2022, 3:58 p.m. UTC | #5

On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
> 
> 
> > On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
> >>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>> 
> >>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> >>>> This set is the second half of v4 [1].
> >>>> 
> >>>> Changes v5 => v6:
> >>>> 1. Rebase and extend CC list.
> >>> 
> >>> Why post a new iteration so soon without completing the discussion we
> >>> had? It seems like we were at least going somewhere. If it's just
> >>> to include mm as I requested, sure, that's fine, but this does not
> >>> provide context as to what we last were talking about.
> >> 
> >> Sorry for sending v6 too soon. The primary reason was to extend the CC
> >> list and add it back to patchwork (v5 somehow got archived). 
> >> 
> >> Also, I think vmalloc_exec_ work would be a separate project, while this 
> >> set is the followup work of bpf_prog_pack. Does this make sense? 
> >> 
> >> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
> >> more efficient to discuss this in person. 
> > 
> > What we need is input from mm / arch folks. What is not done here is
> > what that stuff we're talking about is and so mm folks can't guess. My
> > preference is to address that.
> > 
> > I don't think in person discussion is needed if the only folks
> > discussing this topic so far is just you and me.
> 
> How about we start a thread with mm / arch folks for the vmalloc_exec_*
> topic? I will summarize previous discussions and include pointers to 
> these discussions. If necessary, we can continue the discussion at LPC.

This sounds like a nice thread to use as this is why we are talking
about that topic.

> OTOH, I guess the outcome of that discussion should not change this set? 

If the above is done right then actually I think it would show similar
considerations for a respective free for module_alloc_huge().

> If we have concern about module_alloc_huge(), maybe we can have bpf code 
> call vmalloc directly (until we have vmalloc_exec_)? 

You'd need to then still open code in a similar way the same things
which we are trying to reach consensus on.

> What do you think about this plan?

I think we should strive to not be lazy and sloppy, and prevent growth
of sloppy code. So long as we do that I think this is all reasoanble.

  Luis

Song Liu July 8, 2022, 7:58 p.m. UTC | #6

> On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
>>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>>> 
>>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>>>>>> This set is the second half of v4 [1].
>>>>>> 
>>>>>> Changes v5 => v6:
>>>>>> 1. Rebase and extend CC list.
>>>>> 
>>>>> Why post a new iteration so soon without completing the discussion we
>>>>> had? It seems like we were at least going somewhere. If it's just
>>>>> to include mm as I requested, sure, that's fine, but this does not
>>>>> provide context as to what we last were talking about.
>>>> 
>>>> Sorry for sending v6 too soon. The primary reason was to extend the CC
>>>> list and add it back to patchwork (v5 somehow got archived). 
>>>> 
>>>> Also, I think vmalloc_exec_ work would be a separate project, while this 
>>>> set is the followup work of bpf_prog_pack. Does this make sense? 
>>>> 
>>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
>>>> more efficient to discuss this in person. 
>>> 
>>> What we need is input from mm / arch folks. What is not done here is
>>> what that stuff we're talking about is and so mm folks can't guess. My
>>> preference is to address that.
>>> 
>>> I don't think in person discussion is needed if the only folks
>>> discussing this topic so far is just you and me.
>> 
>> How about we start a thread with mm / arch folks for the vmalloc_exec_*
>> topic? I will summarize previous discussions and include pointers to 
>> these discussions. If necessary, we can continue the discussion at LPC.
> 
> This sounds like a nice thread to use as this is why we are talking
> about that topic.
> 
>> OTOH, I guess the outcome of that discussion should not change this set? 
> 
> If the above is done right then actually I think it would show similar
> considerations for a respective free for module_alloc_huge().
> 
>> If we have concern about module_alloc_huge(), maybe we can have bpf code 
>> call vmalloc directly (until we have vmalloc_exec_)? 
> 
> You'd need to then still open code in a similar way the same things
> which we are trying to reach consensus on.

>> What do you think about this plan?
> 
> I think we should strive to not be lazy and sloppy, and prevent growth
> of sloppy code. So long as we do that I think this is all reasoanble.

Let me try to understand your concerns here. Say if we want module code
to be a temporary home for module_alloc_huge before we move it to mm
code. Would you think it is ready to ship if we:

1) Rename module_alloc_huge as module_alloc_text_huge();
2) Add module_free_text_huge();
3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge()
  and module_free_text_huge(). 

Are these on the right direction? Did I miss anything important?

Thanks,
Song

Luis Chamberlain July 8, 2022, 10:24 p.m. UTC | #7

On Fri, Jul 08, 2022 at 07:58:44PM +0000, Song Liu wrote:
> 
> 
> > On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
> >> 
> >> 
> >>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>> 
> >>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
> >>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >>>>> 
> >>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
> >>>>>> This set is the second half of v4 [1].
> >>>>>> 
> >>>>>> Changes v5 => v6:
> >>>>>> 1. Rebase and extend CC list.
> >>>>> 
> >>>>> Why post a new iteration so soon without completing the discussion we
> >>>>> had? It seems like we were at least going somewhere. If it's just
> >>>>> to include mm as I requested, sure, that's fine, but this does not
> >>>>> provide context as to what we last were talking about.
> >>>> 
> >>>> Sorry for sending v6 too soon. The primary reason was to extend the CC
> >>>> list and add it back to patchwork (v5 somehow got archived). 
> >>>> 
> >>>> Also, I think vmalloc_exec_ work would be a separate project, while this 
> >>>> set is the followup work of bpf_prog_pack. Does this make sense? 
> >>>> 
> >>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
> >>>> more efficient to discuss this in person. 
> >>> 
> >>> What we need is input from mm / arch folks. What is not done here is
> >>> what that stuff we're talking about is and so mm folks can't guess. My
> >>> preference is to address that.
> >>> 
> >>> I don't think in person discussion is needed if the only folks
> >>> discussing this topic so far is just you and me.
> >> 
> >> How about we start a thread with mm / arch folks for the vmalloc_exec_*
> >> topic? I will summarize previous discussions and include pointers to 
> >> these discussions. If necessary, we can continue the discussion at LPC.
> > 
> > This sounds like a nice thread to use as this is why we are talking
> > about that topic.
> > 
> >> OTOH, I guess the outcome of that discussion should not change this set? 
> > 
> > If the above is done right then actually I think it would show similar
> > considerations for a respective free for module_alloc_huge().
> > 
> >> If we have concern about module_alloc_huge(), maybe we can have bpf code 
> >> call vmalloc directly (until we have vmalloc_exec_)? 
> > 
> > You'd need to then still open code in a similar way the same things
> > which we are trying to reach consensus on.
> 
> >> What do you think about this plan?
> > 
> > I think we should strive to not be lazy and sloppy, and prevent growth
> > of sloppy code. So long as we do that I think this is all reasoanble.
> 
> Let me try to understand your concerns here. Say if we want module code
> to be a temporary home for module_alloc_huge before we move it to mm
> code. Would you think it is ready to ship if we:

Please CC Christoph and linux-modules@vger.kernel.org on future patches
and dicussions aroudn this, and all others now CC'd.

> 1) Rename module_alloc_huge as module_alloc_text_huge();

module_alloc_text_huge() is too long, but I've suggested names before
which are short and generic, and also suggested that if modules are
not the only users this needs to go outside of modules and so
vmalloc_text_huge() or whatever.

To do this right it begs the question why we don't do that for the
existing module_alloc(), as the users of this code is well outside of
modules now. Last time a similar generic name was used all the special
arch stuff was left to be done by the module code still, but still
non-modules were still using that allocator. From my perspective the
right thing to do is to deal with all the arch stuff as well in the
generic handler, and have the module code *and* the other users which
use module_alloc() to use that new caller as well.

> 2) Add module_free_text_huge();

Right, we have special handling for how we free this special code for regular
module_alloc() and so similar considerations would be needed here for
the huge stuff.

> 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge()
>   and module_free_text_huge(). 

Yes, that's a bit hairy now, and so a saner and consistent way to do
this would be best.

> Are these on the right direction? Did I miss anything important?

I've also hinted before that another way to help here is to have draw
up a simple lib/test_vmalloc_text.c or something like that which would
enable a selftest to ensure correctness of this code on different archs
and maybe even let you do performance analysis using perf [0]. You have
good reasons to move to the huge allocator and the performance metrics
are an abstract load, however perf measurements can also give you real
raw data which you can reproduce and enable others to do similar
comparisons later.

The last thing I'd ask is just ensure you Cc folks who have already been in
these discussions.

[0] https://lkml.kernel.org/r/Yog+d+oR5TtPp2cs@bombadil.infradead.org

  Luis

Song Liu July 9, 2022, 1:14 a.m. UTC | #8

> On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Fri, Jul 08, 2022 at 07:58:44PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 8, 2022, at 8:58 AM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Fri, Jul 08, 2022 at 01:36:25AM +0000, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Jul 7, 2022, at 5:53 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>>> 
>>>>> On Thu, Jul 07, 2022 at 11:52:58PM +0000, Song Liu wrote:
>>>>>>> On Jul 7, 2022, at 3:59 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>>>>> 
>>>>>>> On Thu, Jul 07, 2022 at 03:35:41PM -0700, Song Liu wrote:
>>>>>>>> This set is the second half of v4 [1].
>>>>>>>> 
>>>>>>>> Changes v5 => v6:
>>>>>>>> 1. Rebase and extend CC list.
>>>>>>> 
>>>>>>> Why post a new iteration so soon without completing the discussion we
>>>>>>> had? It seems like we were at least going somewhere. If it's just
>>>>>>> to include mm as I requested, sure, that's fine, but this does not
>>>>>>> provide context as to what we last were talking about.
>>>>>> 
>>>>>> Sorry for sending v6 too soon. The primary reason was to extend the CC
>>>>>> list and add it back to patchwork (v5 somehow got archived). 
>>>>>> 
>>>>>> Also, I think vmalloc_exec_ work would be a separate project, while this 
>>>>>> set is the followup work of bpf_prog_pack. Does this make sense? 
>>>>>> 
>>>>>> Btw, vmalloc_exec_ work could be a good topic for LPC. It will be much
>>>>>> more efficient to discuss this in person. 
>>>>> 
>>>>> What we need is input from mm / arch folks. What is not done here is
>>>>> what that stuff we're talking about is and so mm folks can't guess. My
>>>>> preference is to address that.
>>>>> 
>>>>> I don't think in person discussion is needed if the only folks
>>>>> discussing this topic so far is just you and me.
>>>> 
>>>> How about we start a thread with mm / arch folks for the vmalloc_exec_*
>>>> topic? I will summarize previous discussions and include pointers to 
>>>> these discussions. If necessary, we can continue the discussion at LPC.
>>> 
>>> This sounds like a nice thread to use as this is why we are talking
>>> about that topic.
>>> 
>>>> OTOH, I guess the outcome of that discussion should not change this set? 
>>> 
>>> If the above is done right then actually I think it would show similar
>>> considerations for a respective free for module_alloc_huge().
>>> 
>>>> If we have concern about module_alloc_huge(), maybe we can have bpf code 
>>>> call vmalloc directly (until we have vmalloc_exec_)? 
>>> 
>>> You'd need to then still open code in a similar way the same things
>>> which we are trying to reach consensus on.
>> 
>>>> What do you think about this plan?
>>> 
>>> I think we should strive to not be lazy and sloppy, and prevent growth
>>> of sloppy code. So long as we do that I think this is all reasoanble.
>> 
>> Let me try to understand your concerns here. Say if we want module code
>> to be a temporary home for module_alloc_huge before we move it to mm
>> code. Would you think it is ready to ship if we:
> 
> Please CC Christoph and linux-modules@vger.kernel.org on future patches
> and dicussions aroudn this, and all others now CC'd.

Sometimes, vger drops my patch because the CC list is too long. That's 
the reason I often trim the CC list. I will try to keep folks in this
thread CC'ed. 

> 
>> 1) Rename module_alloc_huge as module_alloc_text_huge();
> 
> module_alloc_text_huge() is too long, but I've suggested names before
> which are short and generic, and also suggested that if modules are
> not the only users this needs to go outside of modules and so
> vmalloc_text_huge() or whatever.
> 
> To do this right it begs the question why we don't do that for the
> existing module_alloc(), as the users of this code is well outside of
> modules now. Last time a similar generic name was used all the special
> arch stuff was left to be done by the module code still, but still
> non-modules were still using that allocator. From my perspective the
> right thing to do is to deal with all the arch stuff as well in the
> generic handler, and have the module code *and* the other users which
> use module_alloc() to use that new caller as well.

The key difference between module_alloc() and the new API is that the 
API will return RO+X memory, and the user need text-poke like API to
modify this buffer. Archs that do not support text-poke will not be
able to use the new API. Does this sound like a reasonable design?

> 
>> 2) Add module_free_text_huge();
> 
> Right, we have special handling for how we free this special code for regular
> module_alloc() and so similar considerations would be needed here for
> the huge stuff.
> 
>> 3) Move set_memory_* and fill_ill_insn logic into module_alloc_text_huge()
>> and module_free_text_huge(). 
> 
> Yes, that's a bit hairy now, and so a saner and consistent way to do
> this would be best.

Thanks for these information. I will try to go this direction. 

> 
>> Are these on the right direction? Did I miss anything important?
> 
> I've also hinted before that another way to help here is to have draw
> up a simple lib/test_vmalloc_text.c or something like that which would
> enable a selftest to ensure correctness of this code on different archs
> and maybe even let you do performance analysis using perf [0]. You have
> good reasons to move to the huge allocator and the performance metrics
> are an abstract load, however perf measurements can also give you real
> raw data which you can reproduce and enable others to do similar
> comparisons later.
> 
> The last thing I'd ask is just ensure you Cc folks who have already been in
> these discussions.
> 
> [0] https://lkml.kernel.org/r/Yog+d+oR5TtPp2cs@bombadil.infradead.org

Let me see how we can test it. 

Thanks,
Song

Luis Chamberlain July 12, 2022, 4:18 a.m. UTC | #9

On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote:
> > On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> >> 1) Rename module_alloc_huge as module_alloc_text_huge();
> > 
> > module_alloc_text_huge() is too long, but I've suggested names before
> > which are short and generic, and also suggested that if modules are
> > not the only users this needs to go outside of modules and so
> > vmalloc_text_huge() or whatever.
> > 
> > To do this right it begs the question why we don't do that for the
> > existing module_alloc(), as the users of this code is well outside of
> > modules now. Last time a similar generic name was used all the special
> > arch stuff was left to be done by the module code still, but still
> > non-modules were still using that allocator. From my perspective the
> > right thing to do is to deal with all the arch stuff as well in the
> > generic handler, and have the module code *and* the other users which
> > use module_alloc() to use that new caller as well.
> 
> The key difference between module_alloc() and the new API is that the 
> API will return RO+X memory, and the user need text-poke like API to
> modify this buffer. Archs that do not support text-poke will not be
> able to use the new API. Does this sound like a reasonable design?

I'm adding kprobe + ftrace folks.

I can't see why we need to *require* text_poke for just a
module_alloc_huge(). Enhancements on module_alloc() are just
enhancements, not requirements. So we have these for instance:

``` from arch/Kconfig
config ARCH_OPTIONAL_KERNEL_RWX
	def_bool n

config ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	def_bool n

config ARCH_HAS_STRICT_KERNEL_RWX
	def_bool n

config STRICT_KERNEL_RWX
	bool "Make kernel text and rodata read-only" if ARCH_OPTIONAL_KERNEL_RWX
	depends on ARCH_HAS_STRICT_KERNEL_RWX
	default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	help
	  If this is set, kernel text and rodata memory will be made read-only,
	  and non-text memory will be made non-executable. This provides
	  protection against certain security exploits (e.g. executing the heap
	  or modifying text)

	  These features are considered standard security practice these days.
	  You should say Y here in almost all cases.

config ARCH_HAS_STRICT_MODULE_RWX
	def_bool n

config STRICT_MODULE_RWX
	bool "Set loadable kernel module data as NX and text as RO" if ARCH_OPTIONAL_KERNEL_RWX
	depends on ARCH_HAS_STRICT_MODULE_RWX && MODULES
	default !ARCH_OPTIONAL_KERNEL_RWX || ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
	help
	  If this is set, module text and rodata memory will be made read-only,
	  and non-text memory will be made non-executable. This provides
	  protection against certain security exploits (e.g. writing to text)
```

With module_alloc() we have the above symbols to tell us when we *can*
support strict module rwx. So the way the kernel's modules are allocated
and used is:

for each module section:
	module_alloc()
module_enable_ro()
module_enable_nx()
module_enable_x()

The above can be read in the code as:

load_module() -->
	layout_and_allocate()
	complete_formation()

Then there is the consideration of set_vm_flush_reset_perms() for
freeing. On the module code we use this fore the RO+X stuff (core_layout,
init_layout), but now that is a bit obfuscated due to the placement of
the call. It would seem the other users use it for the same:

 * ebpf
 * kprobes
 * ftrace

I believe you are mentioning requiring text_poke() because the way
eBPF code uses the module_alloc() is different. Correct me if I'm
wrong, but from what I gather is you use the text_poke_copy() as the data
is already RO+X, contrary module_alloc() use cases. You do this since your
bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
module_alloc() and before you can use this memory. This is a different type
of allocator. And, again please correct me if I'm wrong but now you want to
share *one* 2 MiB huge-page for multiple BPF programs to help with the
impact of TLB misses.

A vmalloc_ro_exec() by definition would imply a text_poke().

Can kprobes, ftrace and modules use it too? It would be nice
so to not have to deal with the loose semantics on the user to
have to use set_vm_flush_reset_perms() on ro+x later, but
I think this can be addressed separately on a case by case basis.

But a vmalloc_ro_exec() with a respective free can remove the
requirement to do set_vm_flush_reset_perms().

  Luis

Luis Chamberlain July 12, 2022, 4:24 a.m. UTC | #10

On Mon, Jul 11, 2022 at 09:18:53PM -0700, Luis Chamberlain wrote:
> A vmalloc_ro_exec() by definition would imply a text_poke().
> 
> Can kprobes, ftrace and modules use it too? It would be nice
> so to not have to deal with the loose semantics on the user to
> have to use set_vm_flush_reset_perms() on ro+x later, but
> I think this can be addressed separately on a case by case basis.

And who knows, if they can, and can also share a huge page allocator
then they may also share similar performance improvements.

  Luis

Song Liu July 12, 2022, 5:49 a.m. UTC | #11

> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Sat, Jul 09, 2022 at 01:14:23AM +0000, Song Liu wrote:
>>> On Jul 8, 2022, at 3:24 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>>> 1) Rename module_alloc_huge as module_alloc_text_huge();
>>> 
>>> module_alloc_text_huge() is too long, but I've suggested names before
>>> which are short and generic, and also suggested that if modules are
>>> not the only users this needs to go outside of modules and so
>>> vmalloc_text_huge() or whatever.
>>> 
>>> To do this right it begs the question why we don't do that for the
>>> existing module_alloc(), as the users of this code is well outside of
>>> modules now. Last time a similar generic name was used all the special
>>> arch stuff was left to be done by the module code still, but still
>>> non-modules were still using that allocator. From my perspective the
>>> right thing to do is to deal with all the arch stuff as well in the
>>> generic handler, and have the module code *and* the other users which
>>> use module_alloc() to use that new caller as well.
>> 
>> The key difference between module_alloc() and the new API is that the 
>> API will return RO+X memory, and the user need text-poke like API to
>> modify this buffer. Archs that do not support text-poke will not be
>> able to use the new API. Does this sound like a reasonable design?

[...]

> I believe you are mentioning requiring text_poke() because the way
> eBPF code uses the module_alloc() is different. Correct me if I'm
> wrong, but from what I gather is you use the text_poke_copy() as the data
> is already RO+X, contrary module_alloc() use cases. You do this since your
> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> module_alloc() and before you can use this memory. This is a different type
> of allocator. And, again please correct me if I'm wrong but now you want to
> share *one* 2 MiB huge-page for multiple BPF programs to help with the
> impact of TLB misses.

Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
much value on top of current module_alloc(). 

> A vmalloc_ro_exec() by definition would imply a text_poke().
> 
> Can kprobes, ftrace and modules use it too? It would be nice
> so to not have to deal with the loose semantics on the user to
> have to use set_vm_flush_reset_perms() on ro+x later, but
> I think this can be addressed separately on a case by case basis.

I am pretty confident that kprobe and ftrace can share huge pages with 
BPF programs. I haven't looked into all the details with modules, but 
given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
possible. 

Once this is done, a regular system (without huge BPF program or huge
modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
and bpf programs. 

> 
> But a vmalloc_ro_exec() with a respective free can remove the
> requirement to do set_vm_flush_reset_perms().

Removing the requirement to set_vm_flush_reset_perms() is the other
reason to go directly to vmalloc_ro_exec(). 

My current version looks like this:

void *vmalloc_exec(unsigned long size);
void vfree_exec(void *ptr, unsigned int size);

ro is eliminated as there is no rw version of the API. 

The ugly part is @size for vfree_exec(). We need it to share huge 
pages. 

Under the hood, it looks similar to current bpf_prog_pack_alloc
and bpf_prog_pack_free. 

Thanks,
Song

Luis Chamberlain July 12, 2022, 7:04 p.m. UTC | #12

On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
> > On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> > I believe you are mentioning requiring text_poke() because the way
> > eBPF code uses the module_alloc() is different. Correct me if I'm
> > wrong, but from what I gather is you use the text_poke_copy() as the data
> > is already RO+X, contrary module_alloc() use cases. You do this since your
> > bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> > module_alloc() and before you can use this memory. This is a different type
> > of allocator. And, again please correct me if I'm wrong but now you want to
> > share *one* 2 MiB huge-page for multiple BPF programs to help with the
> > impact of TLB misses.
> 
> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
> much value on top of current module_alloc(). 

Thanks for the clarification.

> > A vmalloc_ro_exec() by definition would imply a text_poke().
> > 
> > Can kprobes, ftrace and modules use it too? It would be nice
> > so to not have to deal with the loose semantics on the user to
> > have to use set_vm_flush_reset_perms() on ro+x later, but
> > I think this can be addressed separately on a case by case basis.
> 
> I am pretty confident that kprobe and ftrace can share huge pages with 
> BPF programs.

Then wonderful, we know where to go in terms of a new API then as it
can be shared in the future for sure and there are gains.

> I haven't looked into all the details with modules, but 
> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
> possible.

Sure.

> Once this is done, a regular system (without huge BPF program or huge
> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
> and bpf programs. 

That would be nice, if possible, however modules will require likely its
own thing, on my system I see about 57 MiB used on coresize alone.

lsmod | grep -v Module | cut -f1 -d ' ' | \
	xargs sudo modinfo | grep filename | \
	grep -o '/.*' | xargs stat -c "%s - %n" | \
	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
60001272

And so perhaps we need such a pool size to be configurable.

> > But a vmalloc_ro_exec() with a respective free can remove the
> > requirement to do set_vm_flush_reset_perms().
> 
> Removing the requirement to set_vm_flush_reset_perms() is the other
> reason to go directly to vmalloc_ro_exec(). 

Yes fantastic.

> My current version looks like this:
> 
> void *vmalloc_exec(unsigned long size);
> void vfree_exec(void *ptr, unsigned int size);
> 
> ro is eliminated as there is no rw version of the API. 

Alright.

I am not sure if 2 MiB will suffice given what I mentioned above, and
what to do to ensure this grows at a reasonable pace. Then, at least for
usage for all architectures since not all will support text_poke() we
will want to consider a way to make it easy to users to use non huge
page fallbacks, but that would be up to those users, so we can wait for
that.

> The ugly part is @size for vfree_exec(). We need it to share huge 
> pages. 

I suppose this will become evident during patch review.

> Under the hood, it looks similar to current bpf_prog_pack_alloc
> and bpf_prog_pack_free. 

Groovy.

  Luis

Song Liu July 12, 2022, 11:12 p.m. UTC | #13

> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>> 
>>> I believe you are mentioning requiring text_poke() because the way
>>> eBPF code uses the module_alloc() is different. Correct me if I'm
>>> wrong, but from what I gather is you use the text_poke_copy() as the data
>>> is already RO+X, contrary module_alloc() use cases. You do this since your
>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
>>> module_alloc() and before you can use this memory. This is a different type
>>> of allocator. And, again please correct me if I'm wrong but now you want to
>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
>>> impact of TLB misses.
>> 
>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
>> much value on top of current module_alloc(). 
> 
> Thanks for the clarification.
> 
>>> A vmalloc_ro_exec() by definition would imply a text_poke().
>>> 
>>> Can kprobes, ftrace and modules use it too? It would be nice
>>> so to not have to deal with the loose semantics on the user to
>>> have to use set_vm_flush_reset_perms() on ro+x later, but
>>> I think this can be addressed separately on a case by case basis.
>> 
>> I am pretty confident that kprobe and ftrace can share huge pages with 
>> BPF programs.
> 
> Then wonderful, we know where to go in terms of a new API then as it
> can be shared in the future for sure and there are gains.
> 
>> I haven't looked into all the details with modules, but 
>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
>> possible.
> 
> Sure.
> 
>> Once this is done, a regular system (without huge BPF program or huge
>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
>> and bpf programs. 
> 
> That would be nice, if possible, however modules will require likely its
> own thing, on my system I see about 57 MiB used on coresize alone.
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | \
> 	xargs sudo modinfo | grep filename | \
> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
> 60001272
> 
> And so perhaps we need such a pool size to be configurable.
> 
>>> But a vmalloc_ro_exec() with a respective free can remove the
>>> requirement to do set_vm_flush_reset_perms().
>> 
>> Removing the requirement to set_vm_flush_reset_perms() is the other
>> reason to go directly to vmalloc_ro_exec(). 
> 
> Yes fantastic.
> 
>> My current version looks like this:
>> 
>> void *vmalloc_exec(unsigned long size);
>> void vfree_exec(void *ptr, unsigned int size);
>> 
>> ro is eliminated as there is no rw version of the API. 
> 
> Alright.
> 
> I am not sure if 2 MiB will suffice given what I mentioned above, and
> what to do to ensure this grows at a reasonable pace. Then, at least for
> usage for all architectures since not all will support text_poke() we
> will want to consider a way to make it easy to users to use non huge
> page fallbacks, but that would be up to those users, so we can wait for
> that.

We are not limited to 2MiB total. The logic is like: 

1. Anything bigger than 2MiB gets its own allocation.
2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
   these pages are in use. 
3. For objects smaller than 2MiB, we will try to fit it in one of these
   pages. 
   3. a) If there isn't a page with big enough continuous free space, we
        will allocate a new 2MiB page. 

(For system with n NUMA nodes, multiple 2MiB above by n). 

So, if we have 100 kernel modules using 1MiB each, they will share 50x
2MiB pages. 

Thanks,
Song

Luis Chamberlain July 12, 2022, 11:42 p.m. UTC | #14

On Tue, Jul 12, 2022 at 11:12:22PM +0000, Song Liu wrote:
> 
> 
> > On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> > 
> > On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
> >>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> >> 
> >>> I believe you are mentioning requiring text_poke() because the way
> >>> eBPF code uses the module_alloc() is different. Correct me if I'm
> >>> wrong, but from what I gather is you use the text_poke_copy() as the data
> >>> is already RO+X, contrary module_alloc() use cases. You do this since your
> >>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
> >>> module_alloc() and before you can use this memory. This is a different type
> >>> of allocator. And, again please correct me if I'm wrong but now you want to
> >>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
> >>> impact of TLB misses.
> >> 
> >> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
> >> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
> >> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
> >> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
> >> much value on top of current module_alloc(). 
> > 
> > Thanks for the clarification.
> > 
> >>> A vmalloc_ro_exec() by definition would imply a text_poke().
> >>> 
> >>> Can kprobes, ftrace and modules use it too? It would be nice
> >>> so to not have to deal with the loose semantics on the user to
> >>> have to use set_vm_flush_reset_perms() on ro+x later, but
> >>> I think this can be addressed separately on a case by case basis.
> >> 
> >> I am pretty confident that kprobe and ftrace can share huge pages with 
> >> BPF programs.
> > 
> > Then wonderful, we know where to go in terms of a new API then as it
> > can be shared in the future for sure and there are gains.
> > 
> >> I haven't looked into all the details with modules, but 
> >> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
> >> possible.
> > 
> > Sure.
> > 
> >> Once this is done, a regular system (without huge BPF program or huge
> >> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
> >> and bpf programs. 
> > 
> > That would be nice, if possible, however modules will require likely its
> > own thing, on my system I see about 57 MiB used on coresize alone.
> > 
> > lsmod | grep -v Module | cut -f1 -d ' ' | \
> > 	xargs sudo modinfo | grep filename | \
> > 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> > 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
> > 60001272
> > 
> > And so perhaps we need such a pool size to be configurable.
> > 
> >>> But a vmalloc_ro_exec() with a respective free can remove the
> >>> requirement to do set_vm_flush_reset_perms().
> >> 
> >> Removing the requirement to set_vm_flush_reset_perms() is the other
> >> reason to go directly to vmalloc_ro_exec(). 
> > 
> > Yes fantastic.
> > 
> >> My current version looks like this:
> >> 
> >> void *vmalloc_exec(unsigned long size);
> >> void vfree_exec(void *ptr, unsigned int size);
> >> 
> >> ro is eliminated as there is no rw version of the API. 
> > 
> > Alright.
> > 
> > I am not sure if 2 MiB will suffice given what I mentioned above, and
> > what to do to ensure this grows at a reasonable pace. Then, at least for
> > usage for all architectures since not all will support text_poke() we
> > will want to consider a way to make it easy to users to use non huge
> > page fallbacks, but that would be up to those users, so we can wait for
> > that.
> 
> We are not limited to 2MiB total. The logic is like: 
> 
> 1. Anything bigger than 2MiB gets its own allocation.

And does that allocation get split up into a few huge 2 MiB pages?
When freed does that go into the pool of available list of 2 MiB pages
to use?

> 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
>    these pages are in use. 

How many 2 MiB huge pages are allocated initially? Do we have a cap?

> 3. For objects smaller than 2MiB, we will try to fit it in one of these
>    pages. 
>    3. a) If there isn't a page with big enough continuous free space, we
>         will allocate a new 2MiB page. 
> 
> (For system with n NUMA nodes, multiple 2MiB above by n). 
> 
> So, if we have 100 kernel modules using 1MiB each, they will share 50x
> 2MiB pages. 

lsmod | grep -v Module | cut -f1 -d ' ' | \
	xargs sudo modinfo | grep filename |\
	grep -o '/.*' | xargs stat -c "%s - %n" | \
	awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 
271.273

On average my system's modules are 271 KiB.

Then I only have 6 out of 216 modules which are use more than 2 MiB or
memory for coresize. So roughly 97% of my modules would be covered
with this. Not bad.

The monsters:

lsmod | grep -v Module | cut -f1 -d ' ' | xargs sudo modinfo \
	| grep filename |grep -o '/.*' | xargs stat -c "%s %n" | \
	sort -n -k 1 -r | head -10 | \
	awk '{print $1/1024/1024" "$2}'
6.50775 /lib/modules/5.17.0-1-amd64/kernel/drivers/gpu/drm/i915/i915.ko
3.6847 /lib/modules/5.17.0-1-amd64/kernel/fs/xfs/xfs.ko
3.34252 /lib/modules/5.17.0-1-amd64/kernel/fs/btrfs/btrfs.ko
2.37677 /lib/modules/5.17.0-1-amd64/kernel/net/mac80211/mac80211.ko
2.2972 /lib/modules/5.17.0-1-amd64/kernel/net/wireless/cfg80211.ko
2.05754 /lib/modules/5.17.0-1-amd64/kernel/arch/x86/kvm/kvm.ko
1.96126 /lib/modules/5.17.0-1-amd64/kernel/net/bluetooth/bluetooth.ko
1.83429 /lib/modules/5.17.0-1-amd64/kernel/fs/ext4/ext4.ko
1.7724 /lib/modules/5.17.0-1-amd64/kernel/fs/nfsd/nfsd.ko
1.60539 /lib/modules/5.17.0-1-amd64/kernel/net/sunrpc/sunrpc.ko

On a big iron server I have 149 modules and the situation is better
there:

3.69791 /lib/modules/5.16.0-6-amd64/kernel/fs/xfs/xfs.ko
3.35575 /lib/modules/5.16.0-6-amd64/kernel/fs/btrfs/btrfs.ko
3.21056 /lib/modules/5.16.0-6-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
2.02773 /lib/modules/5.16.0-6-amd64/kernel/arch/x86/kvm/kvm.ko
1.82574 /lib/modules/5.16.0-6-amd64/kernel/fs/ext4/ext4.ko
1.36571 /lib/modules/5.16.0-6-amd64/kernel/net/sunrpc/sunrpc.ko
1.32686 /lib/modules/5.16.0-6-amd64/kernel/fs/nfsd/nfsd.ko
1.12648 /lib/modules/5.16.0-6-amd64/kernel/drivers/gpu/drm/drm.ko
0.898623 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
0.86922 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/core/ib_core.ko

So this may just work nicely.

  Luis

Song Liu July 13, 2022, 1 a.m. UTC | #15

> On Jul 12, 2022, at 4:42 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
> 
> On Tue, Jul 12, 2022 at 11:12:22PM +0000, Song Liu wrote:
>> 
>> 
>>> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>> 
>>> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
>>>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@kernel.org> wrote:
>>>> 
>>>>> I believe you are mentioning requiring text_poke() because the way
>>>>> eBPF code uses the module_alloc() is different. Correct me if I'm
>>>>> wrong, but from what I gather is you use the text_poke_copy() as the data
>>>>> is already RO+X, contrary module_alloc() use cases. You do this since your
>>>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
>>>>> module_alloc() and before you can use this memory. This is a different type
>>>>> of allocator. And, again please correct me if I'm wrong but now you want to
>>>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
>>>>> impact of TLB misses.
>>>> 
>>>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
>>>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
>>>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
>>>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
>>>> much value on top of current module_alloc(). 
>>> 
>>> Thanks for the clarification.
>>> 
>>>>> A vmalloc_ro_exec() by definition would imply a text_poke().
>>>>> 
>>>>> Can kprobes, ftrace and modules use it too? It would be nice
>>>>> so to not have to deal with the loose semantics on the user to
>>>>> have to use set_vm_flush_reset_perms() on ro+x later, but
>>>>> I think this can be addressed separately on a case by case basis.
>>>> 
>>>> I am pretty confident that kprobe and ftrace can share huge pages with 
>>>> BPF programs.
>>> 
>>> Then wonderful, we know where to go in terms of a new API then as it
>>> can be shared in the future for sure and there are gains.
>>> 
>>>> I haven't looked into all the details with modules, but 
>>>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
>>>> possible.
>>> 
>>> Sure.
>>> 
>>>> Once this is done, a regular system (without huge BPF program or huge
>>>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
>>>> and bpf programs. 
>>> 
>>> That would be nice, if possible, however modules will require likely its
>>> own thing, on my system I see about 57 MiB used on coresize alone.
>>> 
>>> lsmod | grep -v Module | cut -f1 -d ' ' | \
>>> 	xargs sudo modinfo | grep filename | \
>>> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
>>> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
>>> 60001272
>>> 
>>> And so perhaps we need such a pool size to be configurable.
>>> 
>>>>> But a vmalloc_ro_exec() with a respective free can remove the
>>>>> requirement to do set_vm_flush_reset_perms().
>>>> 
>>>> Removing the requirement to set_vm_flush_reset_perms() is the other
>>>> reason to go directly to vmalloc_ro_exec(). 
>>> 
>>> Yes fantastic.
>>> 
>>>> My current version looks like this:
>>>> 
>>>> void *vmalloc_exec(unsigned long size);
>>>> void vfree_exec(void *ptr, unsigned int size);
>>>> 
>>>> ro is eliminated as there is no rw version of the API. 
>>> 
>>> Alright.
>>> 
>>> I am not sure if 2 MiB will suffice given what I mentioned above, and
>>> what to do to ensure this grows at a reasonable pace. Then, at least for
>>> usage for all architectures since not all will support text_poke() we
>>> will want to consider a way to make it easy to users to use non huge
>>> page fallbacks, but that would be up to those users, so we can wait for
>>> that.
>> 
>> We are not limited to 2MiB total. The logic is like: 
>> 
>> 1. Anything bigger than 2MiB gets its own allocation.
> 
> And does that allocation get split up into a few huge 2 MiB pages?
> When freed does that go into the pool of available list of 2 MiB pages
> to use?

This would have some 2MiB pages and some 4kB pages. For example, if we 
need 4MiB + 5kB, it will allocate 2x 2MiB pages, and 2x 4kB pages (round
up to 8kB). 

On free, the will not go to the pool. Instead, it will be vfree()'ed. 

> 
>> 2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
>>   these pages are in use. 
> 
> How many 2 MiB huge pages are allocated initially? Do we have a cap?

Current logic just allocates 1 huge page at a time. No cap. 

> 
>> 3. For objects smaller than 2MiB, we will try to fit it in one of these
>>   pages. 
>>   3. a) If there isn't a page with big enough continuous free space, we
>>        will allocate a new 2MiB page. 
>> 
>> (For system with n NUMA nodes, multiple 2MiB above by n). 
>> 
>> So, if we have 100 kernel modules using 1MiB each, they will share 50x
>> 2MiB pages. 
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | \
> 	xargs sudo modinfo | grep filename |\
> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 
> 271.273
> 
> On average my system's modules are 271 KiB.
> 
> Then I only have 6 out of 216 modules which are use more than 2 MiB or
> memory for coresize. So roughly 97% of my modules would be covered
> with this. Not bad.

Are these all the modules we have in tree? ;)

Thanks,
Song

> 
> The monsters:
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | xargs sudo modinfo \
> 	| grep filename |grep -o '/.*' | xargs stat -c "%s %n" | \
> 	sort -n -k 1 -r | head -10 | \
> 	awk '{print $1/1024/1024" "$2}'
> 6.50775 /lib/modules/5.17.0-1-amd64/kernel/drivers/gpu/drm/i915/i915.ko
> 3.6847 /lib/modules/5.17.0-1-amd64/kernel/fs/xfs/xfs.ko
> 3.34252 /lib/modules/5.17.0-1-amd64/kernel/fs/btrfs/btrfs.ko
> 2.37677 /lib/modules/5.17.0-1-amd64/kernel/net/mac80211/mac80211.ko
> 2.2972 /lib/modules/5.17.0-1-amd64/kernel/net/wireless/cfg80211.ko
> 2.05754 /lib/modules/5.17.0-1-amd64/kernel/arch/x86/kvm/kvm.ko
> 1.96126 /lib/modules/5.17.0-1-amd64/kernel/net/bluetooth/bluetooth.ko
> 1.83429 /lib/modules/5.17.0-1-amd64/kernel/fs/ext4/ext4.ko
> 1.7724 /lib/modules/5.17.0-1-amd64/kernel/fs/nfsd/nfsd.ko
> 1.60539 /lib/modules/5.17.0-1-amd64/kernel/net/sunrpc/sunrpc.ko
> 
> On a big iron server I have 149 modules and the situation is better
> there:
> 
> 3.69791 /lib/modules/5.16.0-6-amd64/kernel/fs/xfs/xfs.ko
> 3.35575 /lib/modules/5.16.0-6-amd64/kernel/fs/btrfs/btrfs.ko
> 3.21056 /lib/modules/5.16.0-6-amd64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
> 2.02773 /lib/modules/5.16.0-6-amd64/kernel/arch/x86/kvm/kvm.ko
> 1.82574 /lib/modules/5.16.0-6-amd64/kernel/fs/ext4/ext4.ko
> 1.36571 /lib/modules/5.16.0-6-amd64/kernel/net/sunrpc/sunrpc.ko
> 1.32686 /lib/modules/5.16.0-6-amd64/kernel/fs/nfsd/nfsd.ko
> 1.12648 /lib/modules/5.16.0-6-amd64/kernel/drivers/gpu/drm/drm.ko
> 0.898623 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
> 0.86922 /lib/modules/5.16.0-6-amd64/kernel/drivers/infiniband/core/ib_core.ko
> 
> So this may just work nicely.
> 
>  Luis

[v6,bpf-next,0/5] bpf_prog_pack followup

Message

Comments