[bpf-next,5/5] bpf: use module_alloc_huge for bpf_prog_pack

Message ID	20220516054051.114490-6-song@kernel.org (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@kernel.org> From: Song Liu <song@kernel.org> To: <linux-kernel@vger.kernel.org>, <bpf@vger.kernel.org> CC: <ast@kernel.org>, <daniel@iogearbox.net>, <peterz@infradead.org>, <mcgrof@kernel.org>, <torvalds@linux-foundation.org>, <rick.p.edgecombe@intel.com>, <kernel-team@fb.com>, Song Liu <song@kernel.org> Subject: [PATCH bpf-next 5/5] bpf: use module_alloc_huge for bpf_prog_pack Date: Sun, 15 May 2022 22:40:51 -0700 Message-ID: <20220516054051.114490-6-song@kernel.org> In-Reply-To: <20220516054051.114490-1-song@kernel.org> References: <20220516054051.114490-1-song@kernel.org> Content-Type: text/plain Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Precedence: bulk
Series	bpf_prog_pack followup \| expand [bpf-next,0/5] bpf_prog_pack followup [bpf-next,1/5] bpf: fill new bpf_prog_pack with illegal instructions [bpf-next,2/5] x86/alternative: introduce text_poke_set [bpf-next,3/5] bpf: introduce bpf_arch_text_invalidate for bpf_prog_pack [bpf-next,4/5] module: introduce module_alloc_huge [bpf-next,5/5] bpf: use module_alloc_huge for bpf_prog_pack

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers	warning	7 maintainers not CCed: netdev@vger.kernel.org kafai@fb.com john.fastabend@gmail.com yhs@fb.com kpsingh@kernel.org andrii@kernel.org songliubraving@fb.com
netdev/build_clang	success	Errors and warnings before: 9 this patch: 9
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 9 this patch: 9
netdev/checkpatch	warning	WARNING: line length of 88 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1	fail	Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2	fail	Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-PR	fail	PR summary
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Kernel LATEST on z15 with gcc

Song Liu May 16, 2022, 5:40 a.m. UTC

Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
PMD_SIZE pages. This benefits system performance by reducing iTLB miss
rate. Benchmark of a real web service workload shows this change gives
another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
(which improve system throughput by ~0.5%).

Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]

[1] https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Song Liu <song@kernel.org>
---
 kernel/bpf/core.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Edgecombe, Rick P May 17, 2022, 7:15 p.m. UTC | #1

On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
> PMD_SIZE pages. This benefits system performance by reducing iTLB
> miss
> rate. Benchmark of a real web service workload shows this change
> gives
> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
> (which improve system throughput by ~0.5%).

0.7% sounds good as a whole. How sure are you of that +0.2%? Was this a
big averaged test?

> 
> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
> 
> [1] 
> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

As I said before, I think this will work functionally. But I meant it
as a quick fix when we were talking about patching this up to keep it
enabled upstream.

So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
pages? The main benefit would be to keep the tear down of these types
of allocations consistent for correctness reasons. The TLB flush
minimizing differences are probably less impactful given the caching
introduced here. At the very least though, we should have (or have
already had) some WARN if people try to use it with huge pages.

> Signed-off-by: Song Liu <song@kernel.org>
> ---
>  kernel/bpf/core.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index cacd8684c3c4..b64d91fcb0ba 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>  	void *ptr;
>  
>  	size = BPF_HPAGE_SIZE * num_online_nodes();
> -	ptr = module_alloc(size);
> +	ptr = module_alloc_huge(size);

This select_bpf_prog_pack_size() function always seemed weird - doing a
big allocation and then immediately freeing. Can't it check a config
for vmalloc huge page support?

>  
>  	/* Test whether we can get huge pages. If not just use
> PAGE_SIZE
>  	 * packs.
> @@ -881,7 +881,7 @@ static struct bpf_prog_pack
> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>  		       GFP_KERNEL);
>  	if (!pack)
>  		return NULL;
> -	pack->ptr = module_alloc(bpf_prog_pack_size);
> +	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
>  	if (!pack->ptr) {
>  		kfree(pack);
>  		return NULL;
> @@ -890,7 +890,6 @@ static struct bpf_prog_pack
> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>  	bitmap_zero(pack->bitmap, bpf_prog_pack_size /
> BPF_PROG_CHUNK_SIZE);
>  	list_add_tail(&pack->list, &pack_list);
>  
> -	set_vm_flush_reset_perms(pack->ptr);
>  	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size /
> PAGE_SIZE);
>  	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size /
> PAGE_SIZE);
>  	return pack;
> @@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size,
> bpf_jit_fill_hole_t bpf_fill_ill_insn
>  
>  	if (size > bpf_prog_pack_size) {
>  		size = round_up(size, PAGE_SIZE);
> -		ptr = module_alloc(size);
> +		ptr = module_alloc_huge(size);
>  		if (ptr) {
>  			bpf_fill_ill_insns(ptr, size);
> -			set_vm_flush_reset_perms(ptr);
>  			set_memory_ro((unsigned long)ptr, size /
> PAGE_SIZE);
>  			set_memory_x((unsigned long)ptr, size /
> PAGE_SIZE);
>  		}
> @@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct
> bpf_binary_header *hdr)
>  
>  	mutex_lock(&pack_mutex);
>  	if (hdr->size > bpf_prog_pack_size) {
> +		set_memory_nx((unsigned long)hdr, hdr->size /
> PAGE_SIZE);
> +		set_memory_rw((unsigned long)hdr, hdr->size /
> PAGE_SIZE);
>  		module_memfree(hdr);
>  		goto out;
>  	}
> @@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct
> bpf_binary_header *hdr)
>  	if (bitmap_find_next_zero_area(pack->bitmap,
> bpf_prog_chunk_count(), 0,
>  				       bpf_prog_chunk_count(), 0) == 0)
> {
>  		list_del(&pack->list);
> +		set_memory_nx((unsigned long)pack->ptr,
> bpf_prog_pack_size / PAGE_SIZE);
> +		set_memory_rw((unsigned long)pack->ptr,
> bpf_prog_pack_size / PAGE_SIZE);
>  		module_memfree(pack->ptr);
>  		kfree(pack);
>  	}

Song Liu May 17, 2022, 9:08 p.m. UTC | #2

> On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit on
>> PMD_SIZE pages. This benefits system performance by reducing iTLB
>> miss
>> rate. Benchmark of a real web service workload shows this change
>> gives
>> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
>> (which improve system throughput by ~0.5%).
> 
> 0.7% sounds good as a whole. How sure are you of that +0.2%? Was this a
> big averaged test?

Yes, this was a test between two tiers with 10+ servers on each tier.  
We took the average performance over a few hours of shadow workload. 

> 
>> 
>> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and use
>> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
>> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
>> 
>> [1] 
>> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
>> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> As I said before, I think this will work functionally. But I meant it
> as a quick fix when we were talking about patching this up to keep it
> enabled upstream.
> 
> So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
> pages? The main benefit would be to keep the tear down of these types
> of allocations consistent for correctness reasons. The TLB flush
> minimizing differences are probably less impactful given the caching
> introduced here. At the very least though, we should have (or have
> already had) some WARN if people try to use it with huge pages.

I am not quite sure the exact work needed here. Rick, would you have
time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
window is coming soon, I guess we need current work around in 5.19. 

> 
>> Signed-off-by: Song Liu <song@kernel.org>
>> ---
>> kernel/bpf/core.c | 12 +++++++-----
>> 1 file changed, 7 insertions(+), 5 deletions(-)
>> 
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index cacd8684c3c4..b64d91fcb0ba 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>> 	void *ptr;
>> 
>> 	size = BPF_HPAGE_SIZE * num_online_nodes();
>> -	ptr = module_alloc(size);
>> +	ptr = module_alloc_huge(size);
> 
> This select_bpf_prog_pack_size() function always seemed weird - doing a
> big allocation and then immediately freeing. Can't it check a config
> for vmalloc huge page support?

Yes, it is weird. Checking a config is not enough here. We also need to 
check vmap_allow_huge, which is controlled by boot parameter nohugeiomap. 
I haven’t got a better solution for this. 

Thanks,
Song

> 
>> 
>> 	/* Test whether we can get huge pages. If not just use
>> PAGE_SIZE
>> 	 * packs.
>> @@ -881,7 +881,7 @@ static struct bpf_prog_pack
>> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>> 		       GFP_KERNEL);
>> 	if (!pack)
>> 		return NULL;
>> -	pack->ptr = module_alloc(bpf_prog_pack_size);
>> +	pack->ptr = module_alloc_huge(bpf_prog_pack_size);
>> 	if (!pack->ptr) {
>> 		kfree(pack);
>> 		return NULL;
>> @@ -890,7 +890,6 @@ static struct bpf_prog_pack
>> *alloc_new_pack(bpf_jit_fill_hole_t bpf_fill_ill_ins
>> 	bitmap_zero(pack->bitmap, bpf_prog_pack_size /
>> BPF_PROG_CHUNK_SIZE);
>> 	list_add_tail(&pack->list, &pack_list);
>> 
>> -	set_vm_flush_reset_perms(pack->ptr);
>> 	set_memory_ro((unsigned long)pack->ptr, bpf_prog_pack_size /
>> PAGE_SIZE);
>> 	set_memory_x((unsigned long)pack->ptr, bpf_prog_pack_size /
>> PAGE_SIZE);
>> 	return pack;
>> @@ -909,10 +908,9 @@ static void *bpf_prog_pack_alloc(u32 size,
>> bpf_jit_fill_hole_t bpf_fill_ill_insn
>> 
>> 	if (size > bpf_prog_pack_size) {
>> 		size = round_up(size, PAGE_SIZE);
>> -		ptr = module_alloc(size);
>> +		ptr = module_alloc_huge(size);
>> 		if (ptr) {
>> 			bpf_fill_ill_insns(ptr, size);
>> -			set_vm_flush_reset_perms(ptr);
>> 			set_memory_ro((unsigned long)ptr, size /
>> PAGE_SIZE);
>> 			set_memory_x((unsigned long)ptr, size /
>> PAGE_SIZE);
>> 		}
>> @@ -949,6 +947,8 @@ static void bpf_prog_pack_free(struct
>> bpf_binary_header *hdr)
>> 
>> 	mutex_lock(&pack_mutex);
>> 	if (hdr->size > bpf_prog_pack_size) {
>> +		set_memory_nx((unsigned long)hdr, hdr->size /
>> PAGE_SIZE);
>> +		set_memory_rw((unsigned long)hdr, hdr->size /
>> PAGE_SIZE);
>> 		module_memfree(hdr);
>> 		goto out;
>> 	}
>> @@ -975,6 +975,8 @@ static void bpf_prog_pack_free(struct
>> bpf_binary_header *hdr)
>> 	if (bitmap_find_next_zero_area(pack->bitmap,
>> bpf_prog_chunk_count(), 0,
>> 				       bpf_prog_chunk_count(), 0) == 0)
>> {
>> 		list_del(&pack->list);
>> +		set_memory_nx((unsigned long)pack->ptr,
>> bpf_prog_pack_size / PAGE_SIZE);
>> +		set_memory_rw((unsigned long)pack->ptr,
>> bpf_prog_pack_size / PAGE_SIZE);
>> 		module_memfree(pack->ptr);
>> 		kfree(pack);
>> 	}

Edgecombe, Rick P May 17, 2022, 11:58 p.m. UTC | #3

On Tue, 2022-05-17 at 21:08 +0000, Song Liu wrote:
> > On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
> > > Use module_alloc_huge for bpf_prog_pack so that BPF programs sit
> > > on
> > > PMD_SIZE pages. This benefits system performance by reducing iTLB
> > > miss
> > > rate. Benchmark of a real web service workload shows this change
> > > gives
> > > another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
> > > (which improve system throughput by ~0.5%).
> > 
> > 0.7% sounds good as a whole. How sure are you of that +0.2%? Was
> > this a
> > big averaged test?
> 
> Yes, this was a test between two tiers with 10+ servers on each
> tier.  
> We took the average performance over a few hours of shadow workload. 

Awesome. Sounds great.

> 
> > 
> > > 
> > > Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and
> > > use
> > > set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
> > > VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
> > > 
> > > [1] 
> > > 
https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
> > > Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > As I said before, I think this will work functionally. But I meant
> > it
> > as a quick fix when we were talking about patching this up to keep
> > it
> > enabled upstream.
> > 
> > So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
> > pages? The main benefit would be to keep the tear down of these
> > types
> > of allocations consistent for correctness reasons. The TLB flush
> > minimizing differences are probably less impactful given the
> > caching
> > introduced here. At the very least though, we should have (or have
> > already had) some WARN if people try to use it with huge pages.
> 
> I am not quite sure the exact work needed here. Rick, would you have
> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
> window is coming soon, I guess we need current work around in 5.19. 

I would have hard time squeezing that in now. The vmalloc part is easy,
I think I already posted a diff. But first hibernate needs to be
changed to not care about direct map page sizes.

> 
> > 
> > > Signed-off-by: Song Liu <song@kernel.org>
> > > ---
> > > kernel/bpf/core.c | 12 +++++++-----
> > > 1 file changed, 7 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > index cacd8684c3c4..b64d91fcb0ba 100644
> > > --- a/kernel/bpf/core.c
> > > +++ b/kernel/bpf/core.c
> > > @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
> > >       void *ptr;
> > > 
> > >       size = BPF_HPAGE_SIZE * num_online_nodes();
> > > -    ptr = module_alloc(size);
> > > +    ptr = module_alloc_huge(size);
> > 
> > This select_bpf_prog_pack_size() function always seemed weird -
> > doing a
> > big allocation and then immediately freeing. Can't it check a
> > config
> > for vmalloc huge page support?
> 
> Yes, it is weird. Checking a config is not enough here. We also need
> to 
> check vmap_allow_huge, which is controlled by boot parameter
> nohugeiomap. 
> I haven’t got a better solution for this. 

It's too weird. We should expose whats needed in vmalloc.
huge_vmalloc_supported() or something.

I'm also not clear why we wouldn't want to use the prog pack allocator
even if vmalloc huge pages was disabled. Doesn't it improve performance
even with small page sizes, per your benchmarks? What is the downside
to just always using it?

Song Liu May 18, 2022, 6:25 a.m. UTC | #4

> On May 17, 2022, at 4:58 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Tue, 2022-05-17 at 21:08 +0000, Song Liu wrote:
>>> On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>>>> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit
>>>> on
>>>> PMD_SIZE pages. This benefits system performance by reducing iTLB
>>>> miss
>>>> rate. Benchmark of a real web service workload shows this change
>>>> gives
>>>> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
>>>> (which improve system throughput by ~0.5%).
>>> 
>>> 0.7% sounds good as a whole. How sure are you of that +0.2%? Was
>>> this a
>>> big averaged test?
>> 
>> Yes, this was a test between two tiers with 10+ servers on each
>> tier.  
>> We took the average performance over a few hours of shadow workload. 
> 
> Awesome. Sounds great.
> 
>> 
>>> 
>>>> 
>>>> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and
>>>> use
>>>> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
>>>> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
>>>> 
>>>> [1] 
>>>> 
> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
>>>> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> 
>>> As I said before, I think this will work functionally. But I meant
>>> it
>>> as a quick fix when we were talking about patching this up to keep
>>> it
>>> enabled upstream.
>>> 
>>> So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
>>> pages? The main benefit would be to keep the tear down of these
>>> types
>>> of allocations consistent for correctness reasons. The TLB flush
>>> minimizing differences are probably less impactful given the
>>> caching
>>> introduced here. At the very least though, we should have (or have
>>> already had) some WARN if people try to use it with huge pages.
>> 
>> I am not quite sure the exact work needed here. Rick, would you have
>> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
>> window is coming soon, I guess we need current work around in 5.19. 
> 
> I would have hard time squeezing that in now. The vmalloc part is easy,
> I think I already posted a diff. But first hibernate needs to be
> changed to not care about direct map page sizes.

I guess I missed the diff, could you please send a link to it?

> 
>> 
>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> kernel/bpf/core.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>> index cacd8684c3c4..b64d91fcb0ba 100644
>>>> --- a/kernel/bpf/core.c
>>>> +++ b/kernel/bpf/core.c
>>>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>>>>      void *ptr;
>>>> 
>>>>      size = BPF_HPAGE_SIZE * num_online_nodes();
>>>> -    ptr = module_alloc(size);
>>>> +    ptr = module_alloc_huge(size);
>>> 
>>> This select_bpf_prog_pack_size() function always seemed weird -
>>> doing a
>>> big allocation and then immediately freeing. Can't it check a
>>> config
>>> for vmalloc huge page support?
>> 
>> Yes, it is weird. Checking a config is not enough here. We also need
>> to 
>> check vmap_allow_huge, which is controlled by boot parameter
>> nohugeiomap. 
>> I haven’t got a better solution for this. 
> 
> It's too weird. We should expose whats needed in vmalloc.
> huge_vmalloc_supported() or something.

Yeah, this should work. I will get something like this in the next 
version.

> 
> I'm also not clear why we wouldn't want to use the prog pack allocator
> even if vmalloc huge pages was disabled. Doesn't it improve performance
> even with small page sizes, per your benchmarks? What is the downside
> to just always using it?

With current version, when huge page is disabled, the prog pack allocator
will use 4kB pages for each pack. We still get about 0.5% performance
improvement with 4kB prog packs. 

Thanks,
Song

Song Liu May 18, 2022, 6:34 a.m. UTC | #5

> On May 17, 2022, at 4:58 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Tue, 2022-05-17 at 21:08 +0000, Song Liu wrote:
>>> On May 17, 2022, at 12:15 PM, Edgecombe, Rick P <
>>> rick.p.edgecombe@intel.com> wrote:
>>> 
>>> On Sun, 2022-05-15 at 22:40 -0700, Song Liu wrote:
>>>> Use module_alloc_huge for bpf_prog_pack so that BPF programs sit
>>>> on
>>>> PMD_SIZE pages. This benefits system performance by reducing iTLB
>>>> miss
>>>> rate. Benchmark of a real web service workload shows this change
>>>> gives
>>>> another ~0.2% performance boost on top of PAGE_SIZE bpf_prog_pack
>>>> (which improve system throughput by ~0.5%).
>>> 
>>> 0.7% sounds good as a whole. How sure are you of that +0.2%? Was
>>> this a
>>> big averaged test?
>> 
>> Yes, this was a test between two tiers with 10+ servers on each
>> tier.  
>> We took the average performance over a few hours of shadow workload. 
> 
> Awesome. Sounds great.
> 
>> 
>>> 
>>>> 
>>>> Also, remove set_vm_flush_reset_perms() from alloc_new_pack() and
>>>> use
>>>> set_memory_[nx|rw] in bpf_prog_pack_free(). This is because
>>>> VM_FLUSH_RESET_PERMS does not work with huge pages yet. [1]
>>>> 
>>>> [1] 
>>>> 
> https://lore.kernel.org/bpf/aeeeaf0b7ec63fdba55d4834d2f524d8bf05b71b.camel@intel.com/
>>>> Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> 
>>> As I said before, I think this will work functionally. But I meant
>>> it
>>> as a quick fix when we were talking about patching this up to keep
>>> it
>>> enabled upstream.
>>> 
>>> So now, should we make VM_FLUSH_RESET_PERMS work properly with huge
>>> pages? The main benefit would be to keep the tear down of these
>>> types
>>> of allocations consistent for correctness reasons. The TLB flush
>>> minimizing differences are probably less impactful given the
>>> caching
>>> introduced here. At the very least though, we should have (or have
>>> already had) some WARN if people try to use it with huge pages.
>> 
>> I am not quite sure the exact work needed here. Rick, would you have
>> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the merge 
>> window is coming soon, I guess we need current work around in 5.19. 
> 
> I would have hard time squeezing that in now. The vmalloc part is easy,
> I think I already posted a diff. But first hibernate needs to be
> changed to not care about direct map page sizes.

I guess I missed the diff, could you please send a link to it?

> 
>> 
>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> kernel/bpf/core.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>> index cacd8684c3c4..b64d91fcb0ba 100644
>>>> --- a/kernel/bpf/core.c
>>>> +++ b/kernel/bpf/core.c
>>>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>>>>      void *ptr;
>>>> 
>>>>      size = BPF_HPAGE_SIZE * num_online_nodes();
>>>> -    ptr = module_alloc(size);
>>>> +    ptr = module_alloc_huge(size);
>>> 
>>> This select_bpf_prog_pack_size() function always seemed weird -
>>> doing a
>>> big allocation and then immediately freeing. Can't it check a
>>> config
>>> for vmalloc huge page support?
>> 
>> Yes, it is weird. Checking a config is not enough here. We also need
>> to 
>> check vmap_allow_huge, which is controlled by boot parameter
>> nohugeiomap. 
>> I haven’t got a better solution for this. 
> 
> It's too weird. We should expose whats needed in vmalloc.
> huge_vmalloc_supported() or something.

Yeah, this should work. I will get something like this in the next 
version.

> 
> I'm also not clear why we wouldn't want to use the prog pack allocator
> even if vmalloc huge pages was disabled. Doesn't it improve performance
> even with small page sizes, per your benchmarks? What is the downside
> to just always using it?

With current version, when huge page is disabled, the prog pack allocator
will use 4kB pages for each pack. We still get about 0.5% performance
improvement with 4kB prog packs. 

Thanks,
Song

Edgecombe, Rick P May 18, 2022, 4:49 p.m. UTC | #6

On Wed, 2022-05-18 at 06:34 +0000, Song Liu wrote:
> > > I am not quite sure the exact work needed here. Rick, would you
> > > have
> > > time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the
> > > merge 
> > > window is coming soon, I guess we need current work around in
> > > 5.19. 
> > 
> > I would have hard time squeezing that in now. The vmalloc part is
> > easy,
> > I think I already posted a diff. But first hibernate needs to be
> > changed to not care about direct map page sizes.
> 
> I guess I missed the diff, could you please send a link to it?

https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/

The remaining problem is that hibernate may encounter NP pages when
saving memory to disk. It resets them with CPA calls 4k at a time. So
if a page is NP, hibernate needs it to be already be 4k or it might
need to split. I think hibernate should just utilize a different
mapping to get at the page when it encounters this rare scenario. In
that diff I put some locking so that hibernate couldn't race with a
huge NP page, but then I thought we should just change hibernate.

> 
> > 
> > > 
> > > > 
> > > > > Signed-off-by: Song Liu <song@kernel.org>
> > > > > ---
> > > > > kernel/bpf/core.c | 12 +++++++-----
> > > > > 1 file changed, 7 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > > > index cacd8684c3c4..b64d91fcb0ba 100644
> > > > > --- a/kernel/bpf/core.c
> > > > > +++ b/kernel/bpf/core.c
> > > > > @@ -857,7 +857,7 @@ static size_t
> > > > > select_bpf_prog_pack_size(void)
> > > > >       void *ptr;
> > > > > 
> > > > >       size = BPF_HPAGE_SIZE * num_online_nodes();
> > > > > -    ptr = module_alloc(size);
> > > > > +    ptr = module_alloc_huge(size);
> > > > 
> > > > This select_bpf_prog_pack_size() function always seemed weird -
> > > > doing a
> > > > big allocation and then immediately freeing. Can't it check a
> > > > config
> > > > for vmalloc huge page support?
> > > 
> > > Yes, it is weird. Checking a config is not enough here. We also
> > > need
> > > to 
> > > check vmap_allow_huge, which is controlled by boot parameter
> > > nohugeiomap. 
> > > I haven’t got a better solution for this. 
> > 
> > It's too weird. We should expose whats needed in vmalloc.
> > huge_vmalloc_supported() or something.
> 
> Yeah, this should work. I will get something like this in the next 
> version.
> 
> > 
> > I'm also not clear why we wouldn't want to use the prog pack
> > allocator
> > even if vmalloc huge pages was disabled. Doesn't it improve
> > performance
> > even with small page sizes, per your benchmarks? What is the
> > downside
> > to just always using it?
> 
> With current version, when huge page is disabled, the prog pack
> allocator
> will use 4kB pages for each pack. We still get about 0.5% performance
> improvement with 4kB prog packs. 

Oh, I thought you were comparing a 2MB sized, small page mapped
allocation to a 2MB sized, huge page mapped allocation.

It looks like the logic is to free a pack if it is empty, so then for
smaller packs you are more likely to let the pages go back to the page
allocator. Then future allocations would break more pages.

So I think that is not a fully apples to apples test of huge mapping
benefits. I'd be surprised if there really was no huge mapping benefit,
since its been seen with core kernel text. Did you notice if the direct
map breakage was different between the tests?

Song Liu May 18, 2022, 6:31 p.m. UTC | #7

> On May 18, 2022, at 9:49 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-05-18 at 06:34 +0000, Song Liu wrote:
>>>> I am not quite sure the exact work needed here. Rick, would you
>>>> have
>>>> time to enable VM_FLUSH_RESET_PERMS for huge pages? Given the
>>>> merge 
>>>> window is coming soon, I guess we need current work around in
>>>> 5.19. 
>>> 
>>> I would have hard time squeezing that in now. The vmalloc part is
>>> easy,
>>> I think I already posted a diff. But first hibernate needs to be
>>> changed to not care about direct map page sizes.
>> 
>> I guess I missed the diff, could you please send a link to it?
> 
> 
> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
> 
> The remaining problem is that hibernate may encounter NP pages when
> saving memory to disk. It resets them with CPA calls 4k at a time. So
> if a page is NP, hibernate needs it to be already be 4k or it might
> need to split. I think hibernate should just utilize a different
> mapping to get at the page when it encounters this rare scenario. In
> that diff I put some locking so that hibernate couldn't race with a
> huge NP page, but then I thought we should just change hibernate.

I am not quite sure how to test the hibernate path. Given the merge
window is coming soon, how about we ship this patch in 5.19, and fix
VM_FLUSH_RESET_PERMS in a later release?

> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>> 
>>> 
>>> I'm also not clear why we wouldn't want to use the prog pack
>>> allocator
>>> even if vmalloc huge pages was disabled. Doesn't it improve
>>> performance
>>> even with small page sizes, per your benchmarks? What is the
>>> downside
>>> to just always using it?
>> 
>> With current version, when huge page is disabled, the prog pack
>> allocator
>> will use 4kB pages for each pack. We still get about 0.5% performance
>> improvement with 4kB prog packs. 
> 
> Oh, I thought you were comparing a 2MB sized, small page mapped
> allocation to a 2MB sized, huge page mapped allocation.
> 
> It looks like the logic is to free a pack if it is empty, so then for
> smaller packs you are more likely to let the pages go back to the page
> allocator. Then future allocations would break more pages.

This is correct. This is the current version we have with 5.18-rc7. 

> 
> So I think that is not a fully apples to apples test of huge mapping
> benefits. I'd be surprised if there really was no huge mapping benefit,
> since its been seen with core kernel text. Did you notice if the direct
> map breakage was different between the tests?

I didn’t check specifically, but it is expected that the 4kB prog pack
will cause more direct map breakage. 

Thanks,
Song

Song Liu May 19, 2022, 6:42 a.m. UTC | #8

> On May 17, 2022, at 4:58 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
>>> 
>>>> Signed-off-by: Song Liu <song@kernel.org>
>>>> ---
>>>> kernel/bpf/core.c | 12 +++++++-----
>>>> 1 file changed, 7 insertions(+), 5 deletions(-)
>>>> 
>>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>>> index cacd8684c3c4..b64d91fcb0ba 100644
>>>> --- a/kernel/bpf/core.c
>>>> +++ b/kernel/bpf/core.c
>>>> @@ -857,7 +857,7 @@ static size_t select_bpf_prog_pack_size(void)
>>>>      void *ptr;
>>>> 
>>>>      size = BPF_HPAGE_SIZE * num_online_nodes();
>>>> -    ptr = module_alloc(size);
>>>> +    ptr = module_alloc_huge(size);
>>> 
>>> This select_bpf_prog_pack_size() function always seemed weird -
>>> doing a
>>> big allocation and then immediately freeing. Can't it check a
>>> config
>>> for vmalloc huge page support?
>> 
>> Yes, it is weird. Checking a config is not enough here. We also need
>> to 
>> check vmap_allow_huge, which is controlled by boot parameter
>> nohugeiomap. 
>> I haven’t got a better solution for this. 
> 
> It's too weird. We should expose whats needed in vmalloc.
> huge_vmalloc_supported() or something.

Thinking more on this. Even huge page is not supported, we can allocate
2MB worth of 4kB pages and keep using it. This would help direct map
fragmentation. And the code would also be simpler. 

Rick, I guess this is inline with some of your ideas?

Thanks,
Song

> 
> I'm also not clear why we wouldn't want to use the prog pack allocator
> even if vmalloc huge pages was disabled. Doesn't it improve performance
> even with small page sizes, per your benchmarks? What is the downside
> to just always using it?

Edgecombe, Rick P May 19, 2022, 4:56 p.m. UTC | #9

On Thu, 2022-05-19 at 06:42 +0000, Song Liu wrote:
> Thinking more on this. Even huge page is not supported, we can
> allocate
> 2MB worth of 4kB pages and keep using it. This would help direct map
> fragmentation. And the code would also be simpler. 
> 
> Rick, I guess this is inline with some of your ideas?

Yea, that is what I wondering. Potential benefits are just speculative
though. There is a memory overhead cost, so it's not free.

As for the other question of whether to fix VM_FLUSH_RESET_PERMS. If
there really is an intention to create a more general module_alloc()
replacement soon, then I think it is ok to side step it. An optimal
replacement might not need it and it could be removed in that case.
Let's at least add a WARN about it not working with huge pages though.

I also think the benchmarking so far is not sufficient to make the case
that huge page mappings help your workload since the direct map splits
were also different between the tests. I was expecting it to help
though. Others were the ones that asked for that, so just commenting my
analysis here.

Song Liu May 19, 2022, 6:25 p.m. UTC | #10

> On May 19, 2022, at 9:56 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Thu, 2022-05-19 at 06:42 +0000, Song Liu wrote:
>> Thinking more on this. Even huge page is not supported, we can
>> allocate
>> 2MB worth of 4kB pages and keep using it. This would help direct map
>> fragmentation. And the code would also be simpler. 
>> 
>> Rick, I guess this is inline with some of your ideas?
> 
> Yea, that is what I wondering. Potential benefits are just speculative
> though. There is a memory overhead cost, so it's not free.

Yeah, I had the same concern with memory overhead. The benefit should 
not exceed 0.2% (when 2MB page is supported), so I think we can use
4kB pages for now.  

> 
> As for the other question of whether to fix VM_FLUSH_RESET_PERMS. If
> there really is an intention to create a more general module_alloc()
> replacement soon, then I think it is ok to side step it. An optimal
> replacement might not need it and it could be removed in that case.
> Let's at least add a WARN about it not working with huge pages though.

IIUC, it will take some effort to let kernel modules use a solution 
like this. But there are some low hanging fruits, like ftrace and BPF
trampolines. Maybe that's enough to promote a more general solution. 

For the WARN, I guess we need something like this?

diff --git i/include/linux/vmalloc.h w/include/linux/vmalloc.h
index b159c2789961..5e0d0a60d9d5 100644
--- i/include/linux/vmalloc.h
+++ w/include/linux/vmalloc.h
@@ -238,6 +238,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
 {
        struct vm_struct *vm = find_vm_area(addr);

+       WARN_ON_ONCE(is_vm_area_hugepages(addr));
        if (vm)
                vm->flags |= VM_FLUSH_RESET_PERMS;
 }


Thanks,
Song

> 
> I also think the benchmarking so far is not sufficient to make the case
> that huge page mappings help your workload since the direct map splits
> were also different between the tests. I was expecting it to help
> though. Others were the ones that asked for that, so just commenting my
> analysis here.

[bpf-next,5/5] bpf: use module_alloc_huge for bpf_prog_pack

Checks

Commit Message

Comments

Patch