mbox series

[bpf,0/4] introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack

Message ID 20220330225642.1163897-1-song@kernel.org (mailing list archive)
Headers show
Series introduce HAVE_ARCH_HUGE_VMALLOC_FLAG for bpf_prog_pack | expand

Message

Song Liu March 30, 2022, 10:56 p.m. UTC
We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
issues [1], [2]. Fix this by with HAVE_ARCH_HUGE_VMALLOC_FLAG, which allows
user of __vmalloc_node_range to ask for huge pages with a new flag
VM_TRY_HUGE_VMAP. bpf_prog_pack is updated to use this __vmalloc_node_range
with VM_TRY_HUGE_VMAP.

[1] https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
[2] https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/

Song Liu (4):
  x86: disable HAVE_ARCH_HUGE_VMALLOC
  vmalloc: introduce HAVE_ARCH_HUGE_VMALLOC_FLAG
  x86: select HAVE_ARCH_HUGE_VMALLOC_FLAG for X86_64
  bpf: use __vmalloc_node_range() with VM_TRY_HUGE_VMAP for
    bpf_prog_pack

 arch/Kconfig            |  9 +++++++++
 arch/x86/Kconfig        |  2 +-
 include/linux/vmalloc.h |  9 +++++++--
 kernel/bpf/core.c       | 21 ++++++++++++++++++---
 mm/vmalloc.c            | 28 +++++++++++++++++++---------
 5 files changed, 54 insertions(+), 15 deletions(-)

--
2.30.2

Comments

Edgecombe, Rick P March 31, 2022, 12:04 a.m. UTC | #1
On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
> [1] 
> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/

The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed in
this series. And I think the solution I proposed is kind of wonky with
respect to hibernate. So I think maybe hibernate should be fixed to not
impose restrictions on the direct map, so the wonkiness is not needed.
But then this "fixes" series becomes quite extensive.

I wonder, why not just push the patch 1 here, then re-enable this thing
when it is all properly fixed up. It looked like your code could handle
the allocation not actually getting large pages.

Another solution that would keep large pages but still need fixing up
later: Just don't use VM_FLUSH_RESET_PERMS for now. Call
set_memory_nx() and then set_memory_rw() on the module space address
before vfree(). This will clean up everything that's needed with
respect to direct map permissions. Have vmalloc warn if is sees
VM_FLUSH_RESET_PERMS and huge pages together.
Song Liu March 31, 2022, 12:46 a.m. UTC | #2
> On Mar 30, 2022, at 5:04 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
>> [1] 
>> https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
> 
> The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed in
> this series. And I think the solution I proposed is kind of wonky with
> respect to hibernate. So I think maybe hibernate should be fixed to not
> impose restrictions on the direct map, so the wonkiness is not needed.
> But then this "fixes" series becomes quite extensive.
> 
> I wonder, why not just push the patch 1 here, then re-enable this thing
> when it is all properly fixed up. It looked like your code could handle
> the allocation not actually getting large pages.

Only shipping patch 1 should eliminate the issues. But that will also
reduce the benefit in iTLB efficiency (I don't know by how much yet.)

> 
> Another solution that would keep large pages but still need fixing up
> later: Just don't use VM_FLUSH_RESET_PERMS for now. Call
> set_memory_nx() and then set_memory_rw() on the module space address
> before vfree(). This will clean up everything that's needed with
> respect to direct map permissions. Have vmalloc warn if is sees
> VM_FLUSH_RESET_PERMS and huge pages together.

Do you mean we should remove set_vm_flush_reset_perms() from 
alloc_new_pack() and do set_memory_nx() and set_memory_rw() before
we call vfree() in bpf_prog_pack_free()? If this works, I would prefer
we go with this way. 

Thanks,
Song
Christoph Hellwig March 31, 2022, 5:37 a.m. UTC | #3
On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote:
> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
> issues [1], [2].
>

Please fix the underlying issues instead of papering over them and
creating a huge maintainance burden for others.
Edgecombe, Rick P March 31, 2022, 4:19 p.m. UTC | #4
On Thu, 2022-03-31 at 00:46 +0000, Song Liu wrote:
> > On Mar 30, 2022, at 5:04 PM, Edgecombe, Rick P <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > On Wed, 2022-03-30 at 15:56 -0700, Song Liu wrote:
> > > [1] 
> > > 
https://lore.kernel.org/lkml/5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com/
> > 
> > The issues I brought up around VM_FLUSH_RESET_PERMS are not fixed
> > in
> > this series. And I think the solution I proposed is kind of wonky
> > with
> > respect to hibernate. So I think maybe hibernate should be fixed to
> > not
> > impose restrictions on the direct map, so the wonkiness is not
> > needed.
> > But then this "fixes" series becomes quite extensive.
> > 
> > I wonder, why not just push the patch 1 here, then re-enable this
> > thing
> > when it is all properly fixed up. It looked like your code could
> > handle
> > the allocation not actually getting large pages.
> 
> Only shipping patch 1 should eliminate the issues. But that will also
> reduce the benefit in iTLB efficiency (I don't know by how much yet.)

Yea, it's just a matter of what order/timeline things get done in. This
change didn't get enough mm attention ahead of time. Now there are two
issues. One where the root cause is not fully clear and one that
properly needs a wider fix. Just thinking it could be nice to take some
time on it, rather than rush to finish what was already too rushed.

> 
> > 
> > Another solution that would keep large pages but still need fixing
> > up
> > later: Just don't use VM_FLUSH_RESET_PERMS for now. Call
> > set_memory_nx() and then set_memory_rw() on the module space
> > address
> > before vfree(). This will clean up everything that's needed with
> > respect to direct map permissions. Have vmalloc warn if is sees
> > VM_FLUSH_RESET_PERMS and huge pages together.
> 
> Do you mean we should remove set_vm_flush_reset_perms() from 
> alloc_new_pack() and do set_memory_nx() and set_memory_rw() before
> we call vfree() in bpf_prog_pack_free()? If this works, I would
> prefer
> we go with this way. 

I believe this would work functionally.
Song Liu March 31, 2022, 11:59 p.m. UTC | #5
Hi Christoph, 

> On Mar 30, 2022, at 10:37 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote:
>> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
>> issues [1], [2].
>> 
> 
> Please fix the underlying issues instead of papering over them and
> creating a huge maintainance burden for others.

I agree that this set is papering over the issue. And I would like 
your recommendations here. 

The biggest problem to me is that we (or at least myself) don't know 
all the issues HAVE_ARCH_HUGE_VMALLOC will trigger on x86_64. Right 
now we have a bug report from Paul, and the warning from Rick, but
I am afraid there might be some other issues. 

How about we approach it like this:

Since it is still early in the release cycle (pre rc1), we can keep 
HAVE_ARCH_HUGE_VMALLOC on for x86_64 for now and try to fix all the 
reported issues and warnings. If things don't go very well, we can
turn HAVE_ARCH_HUGE_VMALLOC off after rc4 or rc5. 

Does this make sense?

Thanks,
Song
Song Liu April 1, 2022, 10:22 p.m. UTC | #6
+ Nicholas and Claudio,


> On Mar 31, 2022, at 4:59 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Hi Christoph, 
> 
>> On Mar 30, 2022, at 10:37 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> 
>> On Wed, Mar 30, 2022 at 03:56:38PM -0700, Song Liu wrote:
>>> We prematurely enabled HAVE_ARCH_HUGE_VMALLOC for x86_64, which could cause
>>> issues [1], [2].
>>> 
>> 
>> Please fix the underlying issues instead of papering over them and
>> creating a huge maintainance burden for others.

After reading the code a little more, I wonder what would be best strategy. 
IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
For example, all the module_alloc cannot work with huge pages at the moment.
And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
powerpc with 5.17 kernel as-is? (trace attached below) 

Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
However, given there are so many users of vmalloc, vzalloc, etc., we 
probably do need a flag for the user to opt-in? 

Does this make sense? Any recommendations are really appreciated. 

Thanks,
Song 




[    1.687983] BUG: Bad page state in process systemd-udevd  pfn:102e03
[    1.687992] fbcon: Taking over console
[    1.688007] page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x3 pfn:0x102e03
[    1.688011] head:(____ptrval____) order:9 compound_mapcount:0 compound_pincount:0
[    1.688013] flags: 0x2fffc000010000(head|node=0|zone=2|lastcpupid=0x3fff)
[    1.688018] raw: 002fffc000000000 ffffe815040b8001 ffffe815040b80c8 0000000000000000
[    1.688020] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[    1.688022] head: 002fffc000010000 0000000000000000 dead000000000122 0000000000000000
[    1.688023] head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[    1.688024] page dumped because: corrupted mapping in tail page
[    1.688025] Modules linked in: r8169(+) k10temp snd_pcm(+) xhci_hcd snd_timer realtek ohci_hcd ehci_pci(+) i2c_piix4 ehci_hcd radeon(+) snd sg drm_ttm_helper ttm soundcore coreboot_table acpi_cpufreq fuse ipv6 autofs4
[    1.688045] CPU: 1 PID: 151 Comm: systemd-udevd Not tainted 5.16.0-11615-gfac54e2bfb5b #319
[    1.688048] Hardware name: ASUS F2A85-M_PRO/F2A85-M_PRO, BIOS 4.16-337-gb87986e67b 03/25/2022
[    1.688050] Call Trace:
[    1.688051]  <TASK>
[    1.688053]  dump_stack_lvl+0x34/0x44
[    1.688059]  bad_page.cold+0x63/0x94
[    1.688063]  free_tail_pages_check+0xd1/0x110
[    1.688067]  ? _raw_spin_lock+0x13/0x30
[    1.688071]  free_pcp_prepare+0x251/0x2e0
[    1.688075]  free_unref_page+0x1d/0x110
[    1.688078]  __vunmap+0x28a/0x380
[    1.688082]  drm_fbdev_cleanup+0x5f/0xb0
[    1.688085]  drm_fbdev_fb_destroy+0x15/0x30
[    1.688087]  unregister_framebuffer+0x1d/0x30
[    1.688091]  drm_client_dev_unregister+0x69/0xe0
[    1.688095]  drm_dev_unregister+0x2e/0x80
[    1.688098]  drm_dev_unplug+0x21/0x40
[    1.688100]  simpledrm_remove+0x11/0x20
[    1.688103]  platform_remove+0x1f/0x40
[    1.688106]  __device_release_driver+0x17a/0x240
[    1.688109]  device_release_driver+0x24/0x30
[    1.688112]  bus_remove_device+0xd8/0x140
[    1.688115]  device_del+0x18b/0x3f0
[    1.688118]  ? _raw_spin_unlock_irqrestore+0x1b/0x30
[    1.688121]  ? try_to_wake_up+0x94/0x5b0
[    1.688124]  platform_device_del.part.0+0x13/0x70
[    1.688127]  platform_device_unregister+0x1c/0x30
[    1.688130]  drm_aperture_detach_drivers+0xa1/0xd0
[    1.688134]  drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x60
[    1.688137]  radeon_pci_probe+0x54/0xf0 [radeon]
[    1.688212]  local_pci_probe+0x45/0x80
[    1.688216]  ? pci_match_device+0xd7/0x130
[    1.688219]  pci_device_probe+0xc2/0x1d0
[    1.688223]  really_probe+0x1f5/0x3d0
[    1.688226]  __driver_probe_device+0xfe/0x180
[    1.688229]  driver_probe_device+0x1e/0x90
[    1.688232]  __driver_attach+0xc0/0x1c0
[    1.688235]  ? __device_attach_driver+0xe0/0xe0
[    1.688237]  ? __device_attach_driver+0xe0/0xe0
[    1.688239]  bus_for_each_dev+0x78/0xc0
[    1.688242]  bus_add_driver+0x149/0x1e0
[    1.688245]  driver_register+0x8f/0xe0
[    1.688248]  ? 0xffffffffc051d000
[    1.688250]  do_one_initcall+0x44/0x200
[    1.688254]  ? kmem_cache_alloc_trace+0x170/0x2c0
[    1.688257]  do_init_module+0x5c/0x260
[    1.688262]  __do_sys_finit_module+0xb4/0x120
[    1.688266]  __do_fast_syscall_32+0x6b/0xe0
[    1.688270]  do_fast_syscall_32+0x2f/0x70
[    1.688272]  entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[    1.688275] RIP: 0023:0xf7e51549
[    1.688278] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 cd 0f 05 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
[    1.688281] RSP: 002b:00000000ffa1666c EFLAGS: 00200292 ORIG_RAX: 000000000000015e
[    1.688285] RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00000000f7e30e09
[    1.688287] RDX: 0000000000000000 RSI: 00000000f9a705d0 RDI: 00000000f9a6f6a0
[    1.688288] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[    1.688290] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    1.688291] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    1.688294]  </TASK>
[    1.688355] Disabling lock debugging due to kernel taint
Christoph Hellwig April 5, 2022, 7:07 a.m. UTC | #7
On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:
> >> Please fix the underlying issues instead of papering over them and
> >> creating a huge maintainance burden for others.
> 
> After reading the code a little more, I wonder what would be best strategy. 
> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
> For example, all the module_alloc cannot work with huge pages at the moment.
> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
> powerpc with 5.17 kernel as-is? (trace attached below) 
> 
> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
> However, given there are so many users of vmalloc, vzalloc, etc., we 
> probably do need a flag for the user to opt-in? 
> 
> Does this make sense? Any recommendations are really appreciated. 

I think there is multiple aspects here:

 - if we think that the kernel is not ready for hugepage backed vmalloc
   in general we need to disable it in powerpc for now.
 - if we think even in the longer run only some users can cope with
   hugepage backed vmalloc we need to turn it into an opt-in in
   general and not just for x86
 - there still to appear various unresolved underlying x86 specific
   issues that need to be fixed either way
Song Liu April 5, 2022, 11:54 p.m. UTC | #8
> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:
>>>> Please fix the underlying issues instead of papering over them and
>>>> creating a huge maintainance burden for others.
>> 
>> After reading the code a little more, I wonder what would be best strategy. 
>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
>> For example, all the module_alloc cannot work with huge pages at the moment.
>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
>> powerpc with 5.17 kernel as-is? (trace attached below) 
>> 
>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
>> However, given there are so many users of vmalloc, vzalloc, etc., we 
>> probably do need a flag for the user to opt-in? 
>> 
>> Does this make sense? Any recommendations are really appreciated. 
> 
> I think there is multiple aspects here:
> 
> - if we think that the kernel is not ready for hugepage backed vmalloc
>   in general we need to disable it in powerpc for now.

Nicholas and Claudio, 

What do you think about the status of hugepage backed vmalloc on powerpc? 
I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
But I am not aware of users that benefit from huge pages (except vfs hash,
which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? 

Thanks,
Song

> - if we think even in the longer run only some users can cope with
>   hugepage backed vmalloc we need to turn it into an opt-in in
>   general and not just for x86
> - there still to appear various unresolved underlying x86 specific
>   issues that need to be fixed either way
Song Liu April 7, 2022, 7:57 p.m. UTC | #9
Hi Nicholas and Claudio, 

> On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote:
> 
>> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> 
>> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:
>>>>> Please fix the underlying issues instead of papering over them and
>>>>> creating a huge maintainance burden for others.
>>> 
>>> After reading the code a little more, I wonder what would be best strategy. 
>>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
>>> For example, all the module_alloc cannot work with huge pages at the moment.
>>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
>>> powerpc with 5.17 kernel as-is? (trace attached below) 
>>> 
>>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
>>> However, given there are so many users of vmalloc, vzalloc, etc., we 
>>> probably do need a flag for the user to opt-in? 
>>> 
>>> Does this make sense? Any recommendations are really appreciated. 
>> 
>> I think there is multiple aspects here:
>> 
>> - if we think that the kernel is not ready for hugepage backed vmalloc
>>  in general we need to disable it in powerpc for now.
> 
> Nicholas and Claudio, 
> 
> What do you think about the status of hugepage backed vmalloc on powerpc? 
> I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
> But I am not aware of users that benefit from huge pages (except vfs hash,
> which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
> current opt-out flag, VM_NO_HUGE_VMAP) make sense to you? 

Could you please share your comments on this? Specifically, does it make 
sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current
opt-out flag is better approach, what would be the best practice to find 
all the cases to opt-out?

Thanks,
Song


> Thanks,
> Song
> 
>> - if we think even in the longer run only some users can cope with
>>  hugepage backed vmalloc we need to turn it into an opt-in in
>>  general and not just for x86
>> - there still to appear various unresolved underlying x86 specific
>>  issues that need to be fixed either way
>
Claudio Imbrenda April 8, 2022, 10:08 a.m. UTC | #10
On Thu, 7 Apr 2022 19:57:25 +0000
Song Liu <songliubraving@fb.com> wrote:

> Hi Nicholas and Claudio, 
> 
> > On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote:
> >   
> >> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >> 
> >> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:  
> >>>>> Please fix the underlying issues instead of papering over them and
> >>>>> creating a huge maintainance burden for others.  
> >>> 
> >>> After reading the code a little more, I wonder what would be best strategy. 
> >>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
> >>> For example, all the module_alloc cannot work with huge pages at the moment.
> >>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
> >>> powerpc with 5.17 kernel as-is? (trace attached below) 
> >>> 
> >>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
> >>> However, given there are so many users of vmalloc, vzalloc, etc., we 
> >>> probably do need a flag for the user to opt-in? 
> >>> 
> >>> Does this make sense? Any recommendations are really appreciated.   
> >> 
> >> I think there is multiple aspects here:
> >> 
> >> - if we think that the kernel is not ready for hugepage backed vmalloc
> >>  in general we need to disable it in powerpc for now.  
> > 
> > Nicholas and Claudio, 
> > 
> > What do you think about the status of hugepage backed vmalloc on powerpc? 
> > I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
> > But I am not aware of users that benefit from huge pages (except vfs hash,
> > which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
> > current opt-out flag, VM_NO_HUGE_VMAP) make sense to you?   
> 
> Could you please share your comments on this? Specifically, does it make 
> sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current
> opt-out flag is better approach, what would be the best practice to find 
> all the cases to opt-out?

An opt in flag would surely make sense, and it would be more backwards
compatible with existing code. That way each user can decide whether to
fix the code to allow for hugepages, if possible at all. For example,
the case you mentioned for s390 (kvm_s390_pv_alloc_vm) would not be
fixable, because of a hardware limitation (the whole area _must_ be
mapped with 4k pages)

If the consensus were to keep the current opt-put, then I guess each
user would have to check each usage of vmalloc and similar, and see if
anything breaks. To be honest, I think an opt-out would only be
possible after having the opt-in for a (long) while, when most users
would have fixed their code.

In short, I fully support opt-in.

> 
> Thanks,
> Song
> 
> 
> > Thanks,
> > Song
> >   
> >> - if we think even in the longer run only some users can cope with
> >>  hugepage backed vmalloc we need to turn it into an opt-in in
> >>  general and not just for x86
> >> - there still to appear various unresolved underlying x86 specific
> >>  issues that need to be fixed either way  
> >   
>
Song Liu April 8, 2022, 9:22 p.m. UTC | #11
> On Apr 8, 2022, at 3:08 AM, Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> 
> On Thu, 7 Apr 2022 19:57:25 +0000
> Song Liu <songliubraving@fb.com> wrote:
> 
>> Hi Nicholas and Claudio, 
>> 
>>> On Apr 5, 2022, at 4:54 PM, Song Liu <songliubraving@fb.com> wrote:
>>> 
>>>> On Apr 5, 2022, at 12:07 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>>> 
>>>> On Fri, Apr 01, 2022 at 10:22:00PM +0000, Song Liu wrote:  
>>>>>>> Please fix the underlying issues instead of papering over them and
>>>>>>> creating a huge maintainance burden for others.  
>>>>> 
>>>>> After reading the code a little more, I wonder what would be best strategy. 
>>>>> IIUC, most of the kernel is not ready for huge page backed vmalloc memory.
>>>>> For example, all the module_alloc cannot work with huge pages at the moment.
>>>>> And the error Paul Menzel reported in drm_fb_helper.c will probably hit 
>>>>> powerpc with 5.17 kernel as-is? (trace attached below) 
>>>>> 
>>>>> Right now, we have VM_NO_HUGE_VMAP to let a user to opt out of huge pages. 
>>>>> However, given there are so many users of vmalloc, vzalloc, etc., we 
>>>>> probably do need a flag for the user to opt-in? 
>>>>> 
>>>>> Does this make sense? Any recommendations are really appreciated.   
>>>> 
>>>> I think there is multiple aspects here:
>>>> 
>>>> - if we think that the kernel is not ready for hugepage backed vmalloc
>>>> in general we need to disable it in powerpc for now.  
>>> 
>>> Nicholas and Claudio, 
>>> 
>>> What do you think about the status of hugepage backed vmalloc on powerpc? 
>>> I found module_alloc and kvm_s390_pv_alloc_vm() opt-out of huge pages.
>>> But I am not aware of users that benefit from huge pages (except vfs hash,
>>> which was mentioned in 8abddd968a30). Does an opt-in flag (instead of 
>>> current opt-out flag, VM_NO_HUGE_VMAP) make sense to you?   
>> 
>> Could you please share your comments on this? Specifically, does it make 
>> sense to replace VM_NO_HUGE_VMAP with an opt-in flag? If we think current
>> opt-out flag is better approach, what would be the best practice to find 
>> all the cases to opt-out?
> 
> An opt in flag would surely make sense, and it would be more backwards
> compatible with existing code. That way each user can decide whether to
> fix the code to allow for hugepages, if possible at all. For example,
> the case you mentioned for s390 (kvm_s390_pv_alloc_vm) would not be
> fixable, because of a hardware limitation (the whole area _must_ be
> mapped with 4k pages)
> 
> If the consensus were to keep the current opt-put, then I guess each
> user would have to check each usage of vmalloc and similar, and see if
> anything breaks. To be honest, I think an opt-out would only be
> possible after having the opt-in for a (long) while, when most users
> would have fixed their code.
> 
> In short, I fully support opt-in.

Thanks Claudio!

I will prepare patches to replace VM_NO_HUGE_VMAP with an opt-in flag, 
and use the new flag in BPF. Please let me know any comments/suggestions
ont this direction. 

Song