Message ID | 20230601101257.530867-1-rppt@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | mm: jit/text allocator | expand |
Hi Mike, On Thu, Jun 01, 2023 at 01:12:44PM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > Hi, > > module_alloc() is used everywhere as a mean to allocate memory for code. > > Beside being semantically wrong, this unnecessarily ties all subsystmes > that need to allocate code, such as ftrace, kprobes and BPF to modules > and puts the burden of code allocation to the modules code. I agree this is a problem, and one key issue here is that these can have different requirements. For example, on arm64 we need modules to be placed within a 128M or 2G window containing the kernel, whereas it would be safe for the kprobes XOL area to be placed arbitrarily far from the kernel image (since we don't allow PC-relative insns to be stepped out-of-line). Likewise arm64 doesn't have ftrace trampolines, and DIRECT_CALL trampolines can safely be placed arbitarily far from the kernel image. For a while I have wanted to give kprobes its own allocator so that it can work even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in the modules area. Given that, I think these should have their own allocator functions that can be provided independently, even if those happen to use common infrastructure. > Several architectures override module_alloc() because of various > constraints where the executable memory can be located and this causes > additional obstacles for improvements of code allocation. > > This set splits code allocation from modules by introducing > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > sites of module_alloc() and module_memfree() with the new APIs and > implements core text and related allocation in a central place. > > Instead of architecture specific overrides for module_alloc(), the > architectures that require non-default behaviour for text allocation must > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > returns a pointer to that structure. If an architecture does not implement > jit_alloc_arch_params(), the defaults compatible with the current > modules::module_alloc() are used. As above, I suspect that each of the callsites should probably be using common infrastructure, but I don't think that a single jit_alloc_arch_params() makes sense, since the parameters for each case may need to be distinct. > The new jitalloc infrastructure allows decoupling of kprobes and ftrace > from modules, and most importantly it enables ROX allocations for > executable memory. > > A centralized infrastructure for code allocation allows future > optimizations for allocations of executable memory, caching large pages for > better iTLB performance and providing sub-page allocations for users that > only need small jit code snippets. This sounds interesting, but I think this can be achieved without requiring a single jit_alloc_arch_params() shared by all users? Thanks, Mark. > > patches 1-5: split out the code allocation from modules and arch > patch 6: add dedicated API for data allocations with constraints similar to > code allocations > patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES > patches 10-13: enable ROX allocations for executable memory on x86 > > Mike Rapoport (IBM) (11): > nios2: define virtual address space for modules > mm: introduce jit_text_alloc() and use it instead of module_alloc() > mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc > mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc > module, jitalloc: drop module_alloc > mm/jitalloc: introduce jit_data_alloc() > x86/ftrace: enable dynamic ftrace without CONFIG_MODULES > arch: make jitalloc setup available regardless of CONFIG_MODULES > kprobes: remove dependcy on CONFIG_MODULES > modules, jitalloc: prepare to allocate executable memory as ROX > x86/jitalloc: make memory allocated for code ROX > > Song Liu (2): > ftrace: Add swap_func to ftrace_process_locs() > x86/jitalloc: prepare to allocate exectuatble memory as ROX > > arch/Kconfig | 5 +- > arch/arm/kernel/module.c | 32 ------ > arch/arm/mm/init.c | 35 ++++++ > arch/arm64/kernel/module.c | 47 -------- > arch/arm64/mm/init.c | 42 +++++++ > arch/loongarch/kernel/module.c | 6 - > arch/loongarch/mm/init.c | 16 +++ > arch/mips/kernel/module.c | 9 -- > arch/mips/mm/init.c | 19 ++++ > arch/nios2/include/asm/pgtable.h | 5 +- > arch/nios2/kernel/module.c | 24 ++-- > arch/parisc/kernel/module.c | 11 -- > arch/parisc/mm/init.c | 21 +++- > arch/powerpc/kernel/kprobes.c | 4 +- > arch/powerpc/kernel/module.c | 37 ------- > arch/powerpc/mm/mem.c | 41 +++++++ > arch/riscv/kernel/module.c | 10 -- > arch/riscv/mm/init.c | 18 +++ > arch/s390/kernel/ftrace.c | 4 +- > arch/s390/kernel/kprobes.c | 4 +- > arch/s390/kernel/module.c | 46 +------- > arch/s390/mm/init.c | 35 ++++++ > arch/sparc/kernel/module.c | 34 +----- > arch/sparc/mm/Makefile | 2 + > arch/sparc/mm/jitalloc.c | 21 ++++ > arch/sparc/net/bpf_jit_comp_32.c | 8 +- > arch/x86/Kconfig | 2 + > arch/x86/kernel/alternative.c | 43 ++++--- > arch/x86/kernel/ftrace.c | 59 +++++----- > arch/x86/kernel/kprobes/core.c | 4 +- > arch/x86/kernel/module.c | 75 +------------ > arch/x86/kernel/static_call.c | 10 +- > arch/x86/kernel/unwind_orc.c | 13 ++- > arch/x86/mm/init.c | 52 +++++++++ > arch/x86/net/bpf_jit_comp.c | 22 +++- > include/linux/ftrace.h | 2 + > include/linux/jitalloc.h | 69 ++++++++++++ > include/linux/moduleloader.h | 15 --- > kernel/bpf/core.c | 14 +-- > kernel/kprobes.c | 51 +++++---- > kernel/module/Kconfig | 1 + > kernel/module/main.c | 56 ++++------ > kernel/trace/ftrace.c | 13 ++- > kernel/trace/trace_kprobe.c | 11 ++ > mm/Kconfig | 3 + > mm/Makefile | 1 + > mm/jitalloc.c | 185 +++++++++++++++++++++++++++++++ > mm/mm_init.c | 2 + > 48 files changed, 777 insertions(+), 462 deletions(-) > create mode 100644 arch/sparc/mm/jitalloc.c > create mode 100644 include/linux/jitalloc.h > create mode 100644 mm/jitalloc.c > > > base-commit: 44c026a73be8038f03dbdeef028b642880cf1511 > -- > 2.35.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > For a while I have wanted to give kprobes its own allocator so that it can work > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > the modules area. > > Given that, I think these should have their own allocator functions that can be > provided independently, even if those happen to use common infrastructure. How much memory can kprobes conceivably use? I think we also want to try to push back on combinatorial new allocators, if we can. > > Several architectures override module_alloc() because of various > > constraints where the executable memory can be located and this causes > > additional obstacles for improvements of code allocation. > > > > This set splits code allocation from modules by introducing > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > sites of module_alloc() and module_memfree() with the new APIs and > > implements core text and related allocation in a central place. > > > > Instead of architecture specific overrides for module_alloc(), the > > architectures that require non-default behaviour for text allocation must > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > returns a pointer to that structure. If an architecture does not implement > > jit_alloc_arch_params(), the defaults compatible with the current > > modules::module_alloc() are used. > > As above, I suspect that each of the callsites should probably be using common > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > sense, since the parameters for each case may need to be distinct. I don't see how that follows. The whole point of function parameters is that they may be different :) Can you give more detail on what parameters you need? If the only extra parameter is just "does this allocation need to live close to kernel text", that's not that big of a deal.
On Thu, Jun 1, 2023 at 3:13 AM Mike Rapoport <rppt@kernel.org> wrote: > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > Hi, > > module_alloc() is used everywhere as a mean to allocate memory for code. > > Beside being semantically wrong, this unnecessarily ties all subsystmes > that need to allocate code, such as ftrace, kprobes and BPF to modules > and puts the burden of code allocation to the modules code. > > Several architectures override module_alloc() because of various > constraints where the executable memory can be located and this causes > additional obstacles for improvements of code allocation. > > This set splits code allocation from modules by introducing > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > sites of module_alloc() and module_memfree() with the new APIs and > implements core text and related allocation in a central place. > > Instead of architecture specific overrides for module_alloc(), the > architectures that require non-default behaviour for text allocation must > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > returns a pointer to that structure. If an architecture does not implement > jit_alloc_arch_params(), the defaults compatible with the current > modules::module_alloc() are used. > > The new jitalloc infrastructure allows decoupling of kprobes and ftrace > from modules, and most importantly it enables ROX allocations for > executable memory. This set does look cleaner than my version [1]. However, this is partially because this set only separates text and data; while [1] also separates rw data, ro data, and ro_after_init data. We need such separation to fully cover module usage, and to remove VM_FLUSH_RESET_PERMS. Once we add these logic to this set, the two versions will look similar. OTOH, I do like the fact this version enables kprobes (and potentially ftrace and bpf) without CONFIG_MODULES. And mm/ seems a better home for the logic. That being said, besides comments in a few patches, this version looks good to me. With the fix I suggested for patch 12/13, it passed my tests on x86_64 with modules, kprobes, ftrace, and BPF. If we decided to ship this version, I would appreciate it if I could get more credit for my work in [1] and research work before that. Thanks, Song [1] https://lore.kernel.org/lkml/20230526051529.3387103-1-song@kernel.org/
On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > For a while I have wanted to give kprobes its own allocator so that it can work > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > the modules area. > > > > Given that, I think these should have their own allocator functions that can be > > provided independently, even if those happen to use common infrastructure. > > How much memory can kprobes conceivably use? I think we also want to try > to push back on combinatorial new allocators, if we can. That depends on who's using it, and how (e.g. via BPF). To be clear, I'm not necessarily asking for entirely different allocators, but I do thinkg that we want wrappers that can at least pass distinct start+end parameters to a common allocator, and for arm64's modules code I'd expect that we'd keep the range falblack logic out of the common allcoator, and just call it twice. > > > Several architectures override module_alloc() because of various > > > constraints where the executable memory can be located and this causes > > > additional obstacles for improvements of code allocation. > > > > > > This set splits code allocation from modules by introducing > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > sites of module_alloc() and module_memfree() with the new APIs and > > > implements core text and related allocation in a central place. > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > architectures that require non-default behaviour for text allocation must > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > returns a pointer to that structure. If an architecture does not implement > > > jit_alloc_arch_params(), the defaults compatible with the current > > > modules::module_alloc() are used. > > > > As above, I suspect that each of the callsites should probably be using common > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > sense, since the parameters for each case may need to be distinct. > > I don't see how that follows. The whole point of function parameters is > that they may be different :) What I mean is that jit_alloc_arch_params() tries to aggregate common parameters, but they aren't actually common (e.g. the actual start+end range for allocation). > Can you give more detail on what parameters you need? If the only extra > parameter is just "does this allocation need to live close to kernel > text", that's not that big of a deal. My thinking was that we at least need the start + end for each caller. That might be it, tbh. Thanks, Mark.
On Fri, Jun 2, 2023 at 2:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > > For a while I have wanted to give kprobes its own allocator so that it can work > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > > the modules area. > > > > > > Given that, I think these should have their own allocator functions that can be > > > provided independently, even if those happen to use common infrastructure. > > > > How much memory can kprobes conceivably use? I think we also want to try > > to push back on combinatorial new allocators, if we can. > > That depends on who's using it, and how (e.g. via BPF). > > To be clear, I'm not necessarily asking for entirely different allocators, but > I do thinkg that we want wrappers that can at least pass distinct start+end > parameters to a common allocator, and for arm64's modules code I'd expect that > we'd keep the range falblack logic out of the common allcoator, and just call > it twice. > > > > > Several architectures override module_alloc() because of various > > > > constraints where the executable memory can be located and this causes > > > > additional obstacles for improvements of code allocation. > > > > > > > > This set splits code allocation from modules by introducing > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > > sites of module_alloc() and module_memfree() with the new APIs and > > > > implements core text and related allocation in a central place. > > > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > > architectures that require non-default behaviour for text allocation must > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > > returns a pointer to that structure. If an architecture does not implement > > > > jit_alloc_arch_params(), the defaults compatible with the current > > > > modules::module_alloc() are used. > > > > > > As above, I suspect that each of the callsites should probably be using common > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > > sense, since the parameters for each case may need to be distinct. > > > > I don't see how that follows. The whole point of function parameters is > > that they may be different :) > > What I mean is that jit_alloc_arch_params() tries to aggregate common > parameters, but they aren't actually common (e.g. the actual start+end range > for allocation). > > > Can you give more detail on what parameters you need? If the only extra > > parameter is just "does this allocation need to live close to kernel > > text", that's not that big of a deal. > > My thinking was that we at least need the start + end for each caller. That > might be it, tbh. IIUC, arm64 uses VMALLOC address space for BPF programs. The reason is each BPF program uses at least 64kB (one page) out of the 128MB address space. Puranjay Mohan (CC'ed) is working on enabling bpf_prog_pack for arm64. Once this work is done, multiple BPF programs will be able to share a page. Will this improvement remove the need to specify a different address range for BPF programs? Thanks, Song
On Fri, Jun 2, 2023 at 8:21 PM Song Liu <song@kernel.org> wrote: > > On Fri, Jun 2, 2023 at 2:35 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > > > For a while I have wanted to give kprobes its own allocator so that it can work > > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > > > the modules area. > > > > > > > > Given that, I think these should have their own allocator functions that can be > > > > provided independently, even if those happen to use common infrastructure. > > > > > > How much memory can kprobes conceivably use? I think we also want to try > > > to push back on combinatorial new allocators, if we can. > > > > That depends on who's using it, and how (e.g. via BPF). > > > > To be clear, I'm not necessarily asking for entirely different allocators, but > > I do thinkg that we want wrappers that can at least pass distinct start+end > > parameters to a common allocator, and for arm64's modules code I'd expect that > > we'd keep the range falblack logic out of the common allcoator, and just call > > it twice. > > > > > > > Several architectures override module_alloc() because of various > > > > > constraints where the executable memory can be located and this causes > > > > > additional obstacles for improvements of code allocation. > > > > > > > > > > This set splits code allocation from modules by introducing > > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > > > sites of module_alloc() and module_memfree() with the new APIs and > > > > > implements core text and related allocation in a central place. > > > > > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > > > architectures that require non-default behaviour for text allocation must > > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > > > returns a pointer to that structure. If an architecture does not implement > > > > > jit_alloc_arch_params(), the defaults compatible with the current > > > > > modules::module_alloc() are used. > > > > > > > > As above, I suspect that each of the callsites should probably be using common > > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > > > sense, since the parameters for each case may need to be distinct. > > > > > > I don't see how that follows. The whole point of function parameters is > > > that they may be different :) > > > > What I mean is that jit_alloc_arch_params() tries to aggregate common > > parameters, but they aren't actually common (e.g. the actual start+end range > > for allocation). > > > > > Can you give more detail on what parameters you need? If the only extra > > > parameter is just "does this allocation need to live close to kernel > > > text", that's not that big of a deal. > > > > My thinking was that we at least need the start + end for each caller. That > > might be it, tbh. > > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason > is each BPF program uses at least 64kB (one page) out of the 128MB > address space. Puranjay Mohan (CC'ed) is working on enabling > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs > will be able to share a page. Will this improvement remove the need to > specify a different address range for BPF programs? Hi, Thanks for adding me to the conversation. The ARM64 BPF JIT used to allocate the memory using module_alloc but it was not optimal because BPF programs and modules were sharing the 128 MB module region. This was fixed by 91fc957c9b1d ("arm64/bpf: don't allocate BPF JIT programs in module memory") It created a dedicated 128 MB region set aside for BPF programs. But 128MB could get exhausted especially where PAGE_SIZE is 64KB - one page is needed per program. This restriction was removed by b89ddf4cca43 ("arm64/bpf: Remove 128MB limit for BPF JIT programs") So, currently BPF programs are using a full page from vmalloc (4 KB, 16 KB, or 64 KB). This wastes memory and also causes iTLB pressure. Enabling bpf_prog_pack for ARM64 would fix it. I am doing some final tests and will send the patches in 1-2 days. Thanks, Puranjay
On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote: > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason > is each BPF program uses at least 64kB (one page) out of the 128MB > address space. Puranjay Mohan (CC'ed) is working on enabling > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs > will be able to share a page. Will this improvement remove the need to > specify a different address range for BPF programs? Can we please stop working on BPF specific sub page allocation and focus on doing this in mm/? This never should have been in BPF in the first place.
On Sun, Jun 4, 2023 at 11:02 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote: > > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason > > is each BPF program uses at least 64kB (one page) out of the 128MB > > address space. Puranjay Mohan (CC'ed) is working on enabling > > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs > > will be able to share a page. Will this improvement remove the need to > > specify a different address range for BPF programs? > > Can we please stop working on BPF specific sub page allocation and focus > on doing this in mm/? This never should have been in BPF in the first > place. That work is mostly independent of the allocator work we are discussing here. The goal Puranjay's work is to enable the arm64 BPF JIT engine to use a ROX allocator. The allocator could be the bpf_prog_pack allocator, or jitalloc, or module_alloc_type. Puranjay is using bpf_prog_alloc for now. But once jitalloc or module_alloc_type (either one) is merged, we will migrate BPF JIT engines (x86_64 and arm64) to the new allocator and then tear down bpf_prog_pack. Does this make sense? Thanks, Song
On Sun, Jun 04, 2023 at 02:22:30PM -0700, Song Liu wrote: > On Sun, Jun 4, 2023 at 11:02 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote: > > > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason > > > is each BPF program uses at least 64kB (one page) out of the 128MB > > > address space. Puranjay Mohan (CC'ed) is working on enabling > > > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs > > > will be able to share a page. Will this improvement remove the need to > > > specify a different address range for BPF programs? > > > > Can we please stop working on BPF specific sub page allocation and focus > > on doing this in mm/? This never should have been in BPF in the first > > place. > > That work is mostly independent of the allocator work we are discussing here. > The goal Puranjay's work is to enable the arm64 BPF JIT engine to use a > ROX allocator. The allocator could be the bpf_prog_pack allocator, or jitalloc, > or module_alloc_type. Puranjay is using bpf_prog_alloc for now. But once > jitalloc or module_alloc_type (either one) is merged, we will migrate BPF > JIT engines (x86_64 and arm64) to the new allocator and then tear down > bpf_prog_pack. > > Does this make sense? Yeah, as long as that's the plan. Maybe one of you could tell us what issues were preventing prog_pack from being used in the first place, it might be relevant - this is the time to get the new allocator API right.
On Sun, Jun 4, 2023 at 2:40 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Sun, Jun 04, 2023 at 02:22:30PM -0700, Song Liu wrote: > > On Sun, Jun 4, 2023 at 11:02 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Fri, Jun 02, 2023 at 11:20:58AM -0700, Song Liu wrote: > > > > IIUC, arm64 uses VMALLOC address space for BPF programs. The reason > > > > is each BPF program uses at least 64kB (one page) out of the 128MB > > > > address space. Puranjay Mohan (CC'ed) is working on enabling > > > > bpf_prog_pack for arm64. Once this work is done, multiple BPF programs > > > > will be able to share a page. Will this improvement remove the need to > > > > specify a different address range for BPF programs? > > > > > > Can we please stop working on BPF specific sub page allocation and focus > > > on doing this in mm/? This never should have been in BPF in the first > > > place. > > > > That work is mostly independent of the allocator work we are discussing here. > > The goal Puranjay's work is to enable the arm64 BPF JIT engine to use a > > ROX allocator. The allocator could be the bpf_prog_pack allocator, or jitalloc, > > or module_alloc_type. Puranjay is using bpf_prog_alloc for now. But once > > jitalloc or module_alloc_type (either one) is merged, we will migrate BPF > > JIT engines (x86_64 and arm64) to the new allocator and then tear down > > bpf_prog_pack. > > > > Does this make sense? > > Yeah, as long as that's the plan. Maybe one of you could tell us what > issues were preventing prog_pack from being used in the first place, it > might be relevant - this is the time to get the new allocator API right. The JIT engine does a lot of writes. Instead of doing many text_poke(), we are using a temporary RW write buffer, and then do text_poke_copy() at the end. To make this work, we need the JIT engine to be able to handle an RW temporary buffer and an RO final memory region. There is nothing preventing prog_pack to work. It is just we need to do the work. Thanks, Song
On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote: > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > > For a while I have wanted to give kprobes its own allocator so that it can work > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > > the modules area. > > > > > > Given that, I think these should have their own allocator functions that can be > > > provided independently, even if those happen to use common infrastructure. > > > > How much memory can kprobes conceivably use? I think we also want to try > > to push back on combinatorial new allocators, if we can. > > That depends on who's using it, and how (e.g. via BPF). > > To be clear, I'm not necessarily asking for entirely different allocators, but > I do thinkg that we want wrappers that can at least pass distinct start+end > parameters to a common allocator, and for arm64's modules code I'd expect that > we'd keep the range falblack logic out of the common allcoator, and just call > it twice. > > > > > Several architectures override module_alloc() because of various > > > > constraints where the executable memory can be located and this causes > > > > additional obstacles for improvements of code allocation. > > > > > > > > This set splits code allocation from modules by introducing > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > > sites of module_alloc() and module_memfree() with the new APIs and > > > > implements core text and related allocation in a central place. > > > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > > architectures that require non-default behaviour for text allocation must > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > > returns a pointer to that structure. If an architecture does not implement > > > > jit_alloc_arch_params(), the defaults compatible with the current > > > > modules::module_alloc() are used. > > > > > > As above, I suspect that each of the callsites should probably be using common > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > > sense, since the parameters for each case may need to be distinct. > > > > I don't see how that follows. The whole point of function parameters is > > that they may be different :) > > What I mean is that jit_alloc_arch_params() tries to aggregate common > parameters, but they aren't actually common (e.g. the actual start+end range > for allocation). jit_alloc_arch_params() tries to aggregate architecture constraints and requirements for allocations of executable memory and this exactly what the first 6 patches of this set do. A while ago Thomas suggested to use a structure that parametrizes architecture constraints by the memory type used in modules [1] and Song implemented the infrastructure for it and x86 part [2]. I liked the idea of defining parameters in a single structure, but I thought that approaching the problem from the arch side rather than from modules perspective will be better starting point, hence these patches. I don't see a fundamental reason why a single structure cannot describe what is needed for different code allocation cases, be it modules, kprobes or bpf. There is of course an assumption that the core allocations will be the same for all the users, and it seems to me that something like * allocate physical memory if allocator caches are empty * map it in vmalloc or modules address space * return memory from the allocator cache to the caller will work for all usecases. We might need separate caches for different cases on different architectures, and a way to specify what cache should be used in the allocator API, but that does not contradict a single structure for arch specific parameters, but only makes it more elaborate, e.g. something like enum jit_type { JIT_MODULES_TEXT, JIT_MODULES_DATA, JIT_KPROBES, JIT_FTRACE, JIT_BPF, JIT_TYPE_MAX, }; struct jit_alloc_params { struct jit_range ranges[JIT_TYPE_MAX]; /* ... */ }; > > Can you give more detail on what parameters you need? If the only extra > > parameter is just "does this allocation need to live close to kernel > > text", that's not that big of a deal. > > My thinking was that we at least need the start + end for each caller. That > might be it, tbh. Do you mean that modules will have something like jit_text_alloc(size, MODULES_START, MODULES_END); and kprobes will have jit_text_alloc(size, KPROBES_START, KPROBES_END); ? It sill can be achieved with a single jit_alloc_arch_params(), just by adding enum jit_type parameter to jit_text_alloc(). [1] https://lore.kernel.org/linux-mm/87v8mndy3y.ffs@tglx/ [2] https://lore.kernel.org/all/20230526051529.3387103-1-song@kernel.org > Thanks, > Mark.
On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote: > On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote: > > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > > > For a while I have wanted to give kprobes its own allocator so that it can work > > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > > > the modules area. > > > > > > > > Given that, I think these should have their own allocator functions that can be > > > > provided independently, even if those happen to use common infrastructure. > > > > > > How much memory can kprobes conceivably use? I think we also want to try > > > to push back on combinatorial new allocators, if we can. > > > > That depends on who's using it, and how (e.g. via BPF). > > > > To be clear, I'm not necessarily asking for entirely different allocators, but > > I do thinkg that we want wrappers that can at least pass distinct start+end > > parameters to a common allocator, and for arm64's modules code I'd expect that > > we'd keep the range falblack logic out of the common allcoator, and just call > > it twice. > > > > > > > Several architectures override module_alloc() because of various > > > > > constraints where the executable memory can be located and this causes > > > > > additional obstacles for improvements of code allocation. > > > > > > > > > > This set splits code allocation from modules by introducing > > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > > > sites of module_alloc() and module_memfree() with the new APIs and > > > > > implements core text and related allocation in a central place. > > > > > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > > > architectures that require non-default behaviour for text allocation must > > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > > > returns a pointer to that structure. If an architecture does not implement > > > > > jit_alloc_arch_params(), the defaults compatible with the current > > > > > modules::module_alloc() are used. > > > > > > > > As above, I suspect that each of the callsites should probably be using common > > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > > > sense, since the parameters for each case may need to be distinct. > > > > > > I don't see how that follows. The whole point of function parameters is > > > that they may be different :) > > > > What I mean is that jit_alloc_arch_params() tries to aggregate common > > parameters, but they aren't actually common (e.g. the actual start+end range > > for allocation). > > jit_alloc_arch_params() tries to aggregate architecture constraints and > requirements for allocations of executable memory and this exactly what > the first 6 patches of this set do. > > A while ago Thomas suggested to use a structure that parametrizes > architecture constraints by the memory type used in modules [1] and Song > implemented the infrastructure for it and x86 part [2]. > > I liked the idea of defining parameters in a single structure, but I > thought that approaching the problem from the arch side rather than from > modules perspective will be better starting point, hence these patches. > > I don't see a fundamental reason why a single structure cannot describe > what is needed for different code allocation cases, be it modules, kprobes > or bpf. There is of course an assumption that the core allocations will be > the same for all the users, and it seems to me that something like > > * allocate physical memory if allocator caches are empty > * map it in vmalloc or modules address space > * return memory from the allocator cache to the caller > > will work for all usecases. > > We might need separate caches for different cases on different > architectures, and a way to specify what cache should be used in the > allocator API, but that does not contradict a single structure for arch > specific parameters, but only makes it more elaborate, e.g. something like > > enum jit_type { > JIT_MODULES_TEXT, > JIT_MODULES_DATA, > JIT_KPROBES, > JIT_FTRACE, > JIT_BPF, > JIT_TYPE_MAX, > }; > > struct jit_alloc_params { > struct jit_range ranges[JIT_TYPE_MAX]; > /* ... */ > }; > > > > Can you give more detail on what parameters you need? If the only extra > > > parameter is just "does this allocation need to live close to kernel > > > text", that's not that big of a deal. > > > > My thinking was that we at least need the start + end for each caller. That > > might be it, tbh. > > Do you mean that modules will have something like > > jit_text_alloc(size, MODULES_START, MODULES_END); > > and kprobes will have > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > ? Yes. > It sill can be achieved with a single jit_alloc_arch_params(), just by > adding enum jit_type parameter to jit_text_alloc(). That feels backwards to me; it centralizes a bunch of information about distinct users to be able to shove that into a static array, when the callsites can pass that information. What's *actually* common after separating out the ranges? Is it just the permissions? If we want this to be able to share allocations and so on, why can't we do this like a kmem_cache, and have the callsite pass a pointer to the allocator data? That would make it easy for callsites to share an allocator or use a distinct one. Thanks, Mark. > [1] https://lore.kernel.org/linux-mm/87v8mndy3y.ffs@tglx/ > [2] https://lore.kernel.org/all/20230526051529.3387103-1-song@kernel.org > > > Thanks, > > Mark. > > -- > Sincerely yours, > Mike.
On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote: > On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote: > > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > > > For a while I have wanted to give kprobes its own allocator so that it can work > > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > > > the modules area. > > > > > > > > Given that, I think these should have their own allocator functions that can be > > > > provided independently, even if those happen to use common infrastructure. > > > > > > How much memory can kprobes conceivably use? I think we also want to try > > > to push back on combinatorial new allocators, if we can. > > > > That depends on who's using it, and how (e.g. via BPF). > > > > To be clear, I'm not necessarily asking for entirely different allocators, but > > I do thinkg that we want wrappers that can at least pass distinct start+end > > parameters to a common allocator, and for arm64's modules code I'd expect that > > we'd keep the range falblack logic out of the common allcoator, and just call > > it twice. > > > > > > > Several architectures override module_alloc() because of various > > > > > constraints where the executable memory can be located and this causes > > > > > additional obstacles for improvements of code allocation. > > > > > > > > > > This set splits code allocation from modules by introducing > > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > > > sites of module_alloc() and module_memfree() with the new APIs and > > > > > implements core text and related allocation in a central place. > > > > > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > > > architectures that require non-default behaviour for text allocation must > > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > > > returns a pointer to that structure. If an architecture does not implement > > > > > jit_alloc_arch_params(), the defaults compatible with the current > > > > > modules::module_alloc() are used. > > > > > > > > As above, I suspect that each of the callsites should probably be using common > > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > > > sense, since the parameters for each case may need to be distinct. > > > > > > I don't see how that follows. The whole point of function parameters is > > > that they may be different :) > > > > What I mean is that jit_alloc_arch_params() tries to aggregate common > > parameters, but they aren't actually common (e.g. the actual start+end range > > for allocation). > > jit_alloc_arch_params() tries to aggregate architecture constraints and > requirements for allocations of executable memory and this exactly what > the first 6 patches of this set do. > > A while ago Thomas suggested to use a structure that parametrizes > architecture constraints by the memory type used in modules [1] and Song > implemented the infrastructure for it and x86 part [2]. > > I liked the idea of defining parameters in a single structure, but I > thought that approaching the problem from the arch side rather than from > modules perspective will be better starting point, hence these patches. > > I don't see a fundamental reason why a single structure cannot describe > what is needed for different code allocation cases, be it modules, kprobes > or bpf. There is of course an assumption that the core allocations will be > the same for all the users, and it seems to me that something like > > * allocate physical memory if allocator caches are empty > * map it in vmalloc or modules address space > * return memory from the allocator cache to the caller > > will work for all usecases. > > We might need separate caches for different cases on different > architectures, and a way to specify what cache should be used in the > allocator API, but that does not contradict a single structure for arch > specific parameters, but only makes it more elaborate, e.g. something like > > enum jit_type { > JIT_MODULES_TEXT, > JIT_MODULES_DATA, > JIT_KPROBES, > JIT_FTRACE, > JIT_BPF, > JIT_TYPE_MAX, > }; Why would we actually need different enums for modules_text, kprobes, ftrace and bpf? Why can't we treat all text allocations the same? The reason we can't do that currently is because modules need to go in a 128Mb region on some archs, and without sub page allocation bpf/kprobes/etc. burn a full page for each allocation. But we're doing sub page allocation - right? That leaves module data - which really needs to be split out into rw, ro, ro_after_init - but I'm not sure we'd even want the same API for those, they need fairly different page permissions handling.
On Mon, Jun 05, 2023 at 11:09:34AM +0100, Mark Rutland wrote: > On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote: > > On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote: > > > > It sill can be achieved with a single jit_alloc_arch_params(), just by > > adding enum jit_type parameter to jit_text_alloc(). > > That feels backwards to me; it centralizes a bunch of information about > distinct users to be able to shove that into a static array, when the callsites > can pass that information. The goal was not to shove everything into an array, but centralize architecture requirements for code allocations. The callsites don't have that information per se, they get it from the arch code, so having this information in a single place per arch is better than spreading MODULE_START, KPROBES_START etc all over. I'd agree though that having types for jit_text_alloc is ugly and this should be handled differently. > What's *actually* common after separating out the ranges? Is it just the > permissions? On x86 everything, on arm64 apparently just the permissions. I've started to summarize what are the restrictions for code placement for modules, kprobes and bpf on different architectures, that's roughly what I've got so far: * x86 and s390 need everything within modules address space because of PC-relative * arm, arm64, loongarch, sparc64, riscv64, some of mips and powerpc32 configurations require a dedicated modules address space; the rest just use vmalloc address space * all architectures that support kprobes except x86 and s390 don't use relative jumps, so they don't care where kprobes insn_page will live * not sure yet about BPF. Looks like on arm and arm64 it does not use relative jumps, so it can be anywhere, didn't dig enough about the others. > If we want this to be able to share allocations and so on, why can't we do this > like a kmem_cache, and have the callsite pass a pointer to the allocator data? > That would make it easy for callsites to share an allocator or use a distinct > one. This maybe something worth exploring. > Thanks, > Mark.
On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote: [...] > > > > Can you give more detail on what parameters you need? If the only extra > > > > parameter is just "does this allocation need to live close to kernel > > > > text", that's not that big of a deal. > > > > > > My thinking was that we at least need the start + end for each caller. That > > > might be it, tbh. > > > > Do you mean that modules will have something like > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > and kprobes will have > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > ? > > Yes. How about we start with two APIs: jit_text_alloc(size); jit_text_alloc_range(size, start, end); AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am not quite convinced it is needed. > > > It sill can be achieved with a single jit_alloc_arch_params(), just by > > adding enum jit_type parameter to jit_text_alloc(). > > That feels backwards to me; it centralizes a bunch of information about > distinct users to be able to shove that into a static array, when the callsites > can pass that information. I think we only two type of users: module and everything else (ftrace, kprobe, bpf stuff). The key differences are: 1. module uses text and data; while everything else only uses text. 2. module code is generated by the compiler, and thus has stronger requirements in address ranges; everything else are generated via some JIT or manual written assembly, so they are more flexible with address ranges (in JIT, we can avoid using instructions that requires a specific address range). The next question is, can we have the two types of users share the same address ranges? If not, we can reserve the preferred range for modules, and let everything else use the other range. I don't see reasons to further separate users in the "everything else" group. > > What's *actually* common after separating out the ranges? Is it just the > permissions? I believe permission is the key, as we need the hardware to enforce permission. > > If we want this to be able to share allocations and so on, why can't we do this > like a kmem_cache, and have the callsite pass a pointer to the allocator data? > That would make it easy for callsites to share an allocator or use a distinct > one. Sharing among different call sites will give us more benefit (in TLB misses rate, etc.). For example, a 2MB page may host text of two kernel modules, 4 kprobes, 6 ftrace trampolines, and 10 BPF programs. All of these only require one entry in the iTLB. Thanks, Song
On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote: > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote: > > [...] > > > > > > Can you give more detail on what parameters you need? If the only extra > > > > > parameter is just "does this allocation need to live close to kernel > > > > > text", that's not that big of a deal. > > > > > > > > My thinking was that we at least need the start + end for each caller. That > > > > might be it, tbh. > > > > > > Do you mean that modules will have something like > > > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > > > and kprobes will have > > > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > > ? > > > > Yes. > > How about we start with two APIs: > jit_text_alloc(size); > jit_text_alloc_range(size, start, end); > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am > not quite convinced it is needed. Right now arm64 and riscv override bpf and kprobes allocations to use the entire vmalloc address space, but having the ability to allocate generated code outside of modules area may be useful for other architectures. Still the start + end for the callers feels backwards to me because the callers do not define the ranges, but rather the architectures, so we still need a way for architectures to define how they want allocate memory for the generated code. > > > It sill can be achieved with a single jit_alloc_arch_params(), just by > > > adding enum jit_type parameter to jit_text_alloc(). > > > > That feels backwards to me; it centralizes a bunch of information about > > distinct users to be able to shove that into a static array, when the callsites > > can pass that information. > > I think we only two type of users: module and everything else (ftrace, kprobe, > bpf stuff). The key differences are: > > 1. module uses text and data; while everything else only uses text. > 2. module code is generated by the compiler, and thus has stronger > requirements in address ranges; everything else are generated via some > JIT or manual written assembly, so they are more flexible with address > ranges (in JIT, we can avoid using instructions that requires a specific > address range). > > The next question is, can we have the two types of users share the same > address ranges? If not, we can reserve the preferred range for modules, > and let everything else use the other range. I don't see reasons to further > separate users in the "everything else" group. I agree that we can define only two types: modules and everything else and let the architectures define if they need different ranges for these two types, or want the same range for everything. With only two types we can have two API calls for alloc, and a single structure that defines the ranges etc from the architecture side rather than spread all over. Like something along these lines: struct execmem_range { unsigned long start; unsigned long end; unsigned long fallback_start; unsigned long fallback_end; pgprot_t pgprot; unsigned int alignment; }; struct execmem_modules_range { enum execmem_module_flags flags; struct execmem_range text; struct execmem_range data; }; struct execmem_jit_range { struct execmem_range text; }; struct execmem_params { struct execmem_modules_range modules; struct execmem_jit_range jit; }; struct execmem_params *execmem_arch_params(void); void *execmem_text_alloc(size_t size); void *execmem_data_alloc(size_t size); void execmem_free(void *ptr); void *jit_text_alloc(size_t size); void jit_free(void *ptr); Modules or anything that must live close to the kernel image can use execmem_*_alloc() and the callers that don't generally care about relative addressing will use jit_text_alloc(), presuming that arch will restrict jit range if necessary, like e.g. below for arm64 jit can be anywhere in vmalloc and for x86 and s390 it will share the modules range. struct execmem_params arm64_execmem = { .modules = { .flags = KASAN, .text = { .start = MODULES_VADDR, .end = MODULES_END, .pgprot = PAGE_KERNEL_ROX, .fallback_start = VMALLOC_START, .fallback_start = VMALLOC_END, }, }, .jit = { .text = { .start = VMALLOC_START, .end = VMALLOC_END, .pgprot = PAGE_KERNEL_ROX, }, }, }; /* x86 and s390 */ struct execmem_params cisc_execmem = { .modules = { .flags = KASAN, .text = { .start = MODULES_VADDR, .end = MODULES_END, .pgprot = PAGE_KERNEL_ROX, }, }, .jit_range = {}, /* impplies reusing .modules */ }; struct execmem_params default_execmem = { .modules = { .flags = KASAN, .text = { .start = VMALLOC_START, .end = VMALLOC_END, .pgprot = PAGE_KERNEL_EXEC, }, }, };
On Thu, Jun 8, 2023 at 11:41 AM Mike Rapoport <rppt@kernel.org> wrote: > > On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote: > > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > [...] > > > > > > > > Can you give more detail on what parameters you need? If the only extra > > > > > > parameter is just "does this allocation need to live close to kernel > > > > > > text", that's not that big of a deal. > > > > > > > > > > My thinking was that we at least need the start + end for each caller. That > > > > > might be it, tbh. > > > > > > > > Do you mean that modules will have something like > > > > > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > > > > > and kprobes will have > > > > > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > > > ? > > > > > > Yes. > > > > How about we start with two APIs: > > jit_text_alloc(size); > > jit_text_alloc_range(size, start, end); > > > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am > > not quite convinced it is needed. > > Right now arm64 and riscv override bpf and kprobes allocations to use the > entire vmalloc address space, but having the ability to allocate generated > code outside of modules area may be useful for other architectures. > > Still the start + end for the callers feels backwards to me because the > callers do not define the ranges, but rather the architectures, so we still > need a way for architectures to define how they want allocate memory for > the generated code. Yeah, this makes sense. > > > > > It sill can be achieved with a single jit_alloc_arch_params(), just by > > > > adding enum jit_type parameter to jit_text_alloc(). > > > > > > That feels backwards to me; it centralizes a bunch of information about > > > distinct users to be able to shove that into a static array, when the callsites > > > can pass that information. > > > > I think we only two type of users: module and everything else (ftrace, kprobe, > > bpf stuff). The key differences are: > > > > 1. module uses text and data; while everything else only uses text. > > 2. module code is generated by the compiler, and thus has stronger > > requirements in address ranges; everything else are generated via some > > JIT or manual written assembly, so they are more flexible with address > > ranges (in JIT, we can avoid using instructions that requires a specific > > address range). > > > > The next question is, can we have the two types of users share the same > > address ranges? If not, we can reserve the preferred range for modules, > > and let everything else use the other range. I don't see reasons to further > > separate users in the "everything else" group. > > I agree that we can define only two types: modules and everything else and > let the architectures define if they need different ranges for these two > types, or want the same range for everything. > > With only two types we can have two API calls for alloc, and a single > structure that defines the ranges etc from the architecture side rather > than spread all over. > > Like something along these lines: > > struct execmem_range { > unsigned long start; > unsigned long end; > unsigned long fallback_start; > unsigned long fallback_end; > pgprot_t pgprot; > unsigned int alignment; > }; > > struct execmem_modules_range { > enum execmem_module_flags flags; > struct execmem_range text; > struct execmem_range data; > }; > > struct execmem_jit_range { > struct execmem_range text; > }; > > struct execmem_params { > struct execmem_modules_range modules; > struct execmem_jit_range jit; > }; > > struct execmem_params *execmem_arch_params(void); > > void *execmem_text_alloc(size_t size); > void *execmem_data_alloc(size_t size); > void execmem_free(void *ptr); With the jit variation, maybe we can just call these module_[text|data]_alloc()? btw: Depending on the implementation of the allocator, we may also need separate free()s for text and data. > > void *jit_text_alloc(size_t size); > void jit_free(void *ptr); > [...] How should we move ahead from here? AFAICT, all these changes can be easily extended and refactored in the future, so we don't have to make it perfect the first time. OTOH, having the interface committed (either this set or my module_alloc_type version) can unblock works in the binpack allocator and the users side. Therefore, I think we can move relatively fast here? Thanks, Song
On Fri, Jun 09, 2023 at 10:02:16AM -0700, Song Liu wrote: > On Thu, Jun 8, 2023 at 11:41 AM Mike Rapoport <rppt@kernel.org> wrote: > > > > On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote: > > > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > [...] > > > > > > > > > > Can you give more detail on what parameters you need? If the only extra > > > > > > > parameter is just "does this allocation need to live close to kernel > > > > > > > text", that's not that big of a deal. > > > > > > > > > > > > My thinking was that we at least need the start + end for each caller. That > > > > > > might be it, tbh. > > > > > > > > > > Do you mean that modules will have something like > > > > > > > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > > > > > > > and kprobes will have > > > > > > > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > > > > ? > > > > > > > > Yes. > > > > > > How about we start with two APIs: > > > jit_text_alloc(size); > > > jit_text_alloc_range(size, start, end); > > > > > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am > > > not quite convinced it is needed. > > > > Right now arm64 and riscv override bpf and kprobes allocations to use the > > entire vmalloc address space, but having the ability to allocate generated > > code outside of modules area may be useful for other architectures. > > > > Still the start + end for the callers feels backwards to me because the > > callers do not define the ranges, but rather the architectures, so we still > > need a way for architectures to define how they want allocate memory for > > the generated code. > > Yeah, this makes sense. > > > > > > > > It sill can be achieved with a single jit_alloc_arch_params(), just by > > > > > adding enum jit_type parameter to jit_text_alloc(). > > > > > > > > That feels backwards to me; it centralizes a bunch of information about > > > > distinct users to be able to shove that into a static array, when the callsites > > > > can pass that information. > > > > > > I think we only two type of users: module and everything else (ftrace, kprobe, > > > bpf stuff). The key differences are: > > > > > > 1. module uses text and data; while everything else only uses text. > > > 2. module code is generated by the compiler, and thus has stronger > > > requirements in address ranges; everything else are generated via some > > > JIT or manual written assembly, so they are more flexible with address > > > ranges (in JIT, we can avoid using instructions that requires a specific > > > address range). > > > > > > The next question is, can we have the two types of users share the same > > > address ranges? If not, we can reserve the preferred range for modules, > > > and let everything else use the other range. I don't see reasons to further > > > separate users in the "everything else" group. > > > > I agree that we can define only two types: modules and everything else and > > let the architectures define if they need different ranges for these two > > types, or want the same range for everything. > > > > With only two types we can have two API calls for alloc, and a single > > structure that defines the ranges etc from the architecture side rather > > than spread all over. > > > > Like something along these lines: > > > > struct execmem_range { > > unsigned long start; > > unsigned long end; > > unsigned long fallback_start; > > unsigned long fallback_end; > > pgprot_t pgprot; > > unsigned int alignment; > > }; > > > > struct execmem_modules_range { > > enum execmem_module_flags flags; > > struct execmem_range text; > > struct execmem_range data; > > }; > > > > struct execmem_jit_range { > > struct execmem_range text; > > }; > > > > struct execmem_params { > > struct execmem_modules_range modules; > > struct execmem_jit_range jit; > > }; > > > > struct execmem_params *execmem_arch_params(void); > > > > void *execmem_text_alloc(size_t size); > > void *execmem_data_alloc(size_t size); > > void execmem_free(void *ptr); > > With the jit variation, maybe we can just call these > module_[text|data]_alloc()? I was thinking about "execmem_*_alloc()" for allocations that must be close to kernel image, like modules, ftrace on x86 and s390 and maybe something else in the future. And jit_text_alloc() for allocations that can reside anywhere. I tried to find a different name for 'struct execmem_modules_range' but couldn't think of anything better than 'struct execmem_close_to_kernel', so I've left modules in the name. > btw: Depending on the implementation of the allocator, we may also > need separate free()s for text and data. > > > > > void *jit_text_alloc(size_t size); > > void jit_free(void *ptr); > > Let's just add jit_free() for completeness even if it will be the same as execmem_free() for now. > [...] > > How should we move ahead from here? > > AFAICT, all these changes can be easily extended and refactored > in the future, so we don't have to make it perfect the first time. > OTOH, having the interface committed (either this set or my > module_alloc_type version) can unblock works in the binpack > allocator and the users side. Therefore, I think we can move > relatively fast here? Once the interface and architecture abstraction is ready we can work on the allocator and the users. We also need to update text_poking/alternatives on architectures that would allocate executable memory as ROX. I did some quick tests and with these patches 'modprobe xfs' takes tens time more than before. > Thanks, > Song
On Thu, Jun 08, 2023 at 09:41:16PM +0300, Mike Rapoport wrote: > On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote: > > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > [...] > > > > > > > > Can you give more detail on what parameters you need? If the only extra > > > > > > parameter is just "does this allocation need to live close to kernel > > > > > > text", that's not that big of a deal. > > > > > > > > > > My thinking was that we at least need the start + end for each caller. That > > > > > might be it, tbh. > > > > > > > > Do you mean that modules will have something like > > > > > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > > > > > and kprobes will have > > > > > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > > > ? > > > > > > Yes. > > > > How about we start with two APIs: > > jit_text_alloc(size); > > jit_text_alloc_range(size, start, end); > > > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am > > not quite convinced it is needed. > > Right now arm64 and riscv override bpf and kprobes allocations to use the > entire vmalloc address space, but having the ability to allocate generated > code outside of modules area may be useful for other architectures. > > Still the start + end for the callers feels backwards to me because the > callers do not define the ranges, but rather the architectures, so we still > need a way for architectures to define how they want allocate memory for > the generated code. So, the start + end just comes from the need to keep relative pointers under a certain size. I think this could be just a flag, I see no reason to expose actual addresses here.
On Tue, Jun 13, 2023 at 02:56:14PM -0400, Kent Overstreet wrote: > On Thu, Jun 08, 2023 at 09:41:16PM +0300, Mike Rapoport wrote: > > On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote: > > > On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > [...] > > > > > > > > > > Can you give more detail on what parameters you need? If the only extra > > > > > > > parameter is just "does this allocation need to live close to kernel > > > > > > > text", that's not that big of a deal. > > > > > > > > > > > > My thinking was that we at least need the start + end for each caller. That > > > > > > might be it, tbh. > > > > > > > > > > Do you mean that modules will have something like > > > > > > > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > > > > > > > and kprobes will have > > > > > > > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > > > > ? > > > > > > > > Yes. > > > > > > How about we start with two APIs: > > > jit_text_alloc(size); > > > jit_text_alloc_range(size, start, end); > > > > > > AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am > > > not quite convinced it is needed. > > > > Right now arm64 and riscv override bpf and kprobes allocations to use the > > entire vmalloc address space, but having the ability to allocate generated > > code outside of modules area may be useful for other architectures. > > > > Still the start + end for the callers feels backwards to me because the > > callers do not define the ranges, but rather the architectures, so we still > > need a way for architectures to define how they want allocate memory for > > the generated code. > > So, the start + end just comes from the need to keep relative pointers > under a certain size. I think this could be just a flag, I see no reason > to expose actual addresses here. It's the other way around. The start + end comes from the need to restrict allocation to certain range because of the relative addressing. I don't see how a flag can help here.
On Mon, Jun 05, 2023 at 11:09:34AM +0100, Mark Rutland wrote: > On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote: > > On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote: > > > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote: > > > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote: > > > > > For a while I have wanted to give kprobes its own allocator so that it can work > > > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space in > > > > > the modules area. > > > > > > > > > > Given that, I think these should have their own allocator functions that can be > > > > > provided independently, even if those happen to use common infrastructure. > > > > > > > > How much memory can kprobes conceivably use? I think we also want to try > > > > to push back on combinatorial new allocators, if we can. > > > > > > That depends on who's using it, and how (e.g. via BPF). > > > > > > To be clear, I'm not necessarily asking for entirely different allocators, but > > > I do thinkg that we want wrappers that can at least pass distinct start+end > > > parameters to a common allocator, and for arm64's modules code I'd expect that > > > we'd keep the range falblack logic out of the common allcoator, and just call > > > it twice. > > > > > > > > > Several architectures override module_alloc() because of various > > > > > > constraints where the executable memory can be located and this causes > > > > > > additional obstacles for improvements of code allocation. > > > > > > > > > > > > This set splits code allocation from modules by introducing > > > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call > > > > > > sites of module_alloc() and module_memfree() with the new APIs and > > > > > > implements core text and related allocation in a central place. > > > > > > > > > > > > Instead of architecture specific overrides for module_alloc(), the > > > > > > architectures that require non-default behaviour for text allocation must > > > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() that > > > > > > returns a pointer to that structure. If an architecture does not implement > > > > > > jit_alloc_arch_params(), the defaults compatible with the current > > > > > > modules::module_alloc() are used. > > > > > > > > > > As above, I suspect that each of the callsites should probably be using common > > > > > infrastructure, but I don't think that a single jit_alloc_arch_params() makes > > > > > sense, since the parameters for each case may need to be distinct. > > > > > > > > I don't see how that follows. The whole point of function parameters is > > > > that they may be different :) > > > > > > What I mean is that jit_alloc_arch_params() tries to aggregate common > > > parameters, but they aren't actually common (e.g. the actual start+end range > > > for allocation). > > > > jit_alloc_arch_params() tries to aggregate architecture constraints and > > requirements for allocations of executable memory and this exactly what > > the first 6 patches of this set do. > > > > A while ago Thomas suggested to use a structure that parametrizes > > architecture constraints by the memory type used in modules [1] and Song > > implemented the infrastructure for it and x86 part [2]. > > > > I liked the idea of defining parameters in a single structure, but I > > thought that approaching the problem from the arch side rather than from > > modules perspective will be better starting point, hence these patches. > > > > I don't see a fundamental reason why a single structure cannot describe > > what is needed for different code allocation cases, be it modules, kprobes > > or bpf. There is of course an assumption that the core allocations will be > > the same for all the users, and it seems to me that something like > > > > * allocate physical memory if allocator caches are empty > > * map it in vmalloc or modules address space > > * return memory from the allocator cache to the caller > > > > will work for all usecases. > > > > We might need separate caches for different cases on different > > architectures, and a way to specify what cache should be used in the > > allocator API, but that does not contradict a single structure for arch > > specific parameters, but only makes it more elaborate, e.g. something like > > > > enum jit_type { > > JIT_MODULES_TEXT, > > JIT_MODULES_DATA, > > JIT_KPROBES, > > JIT_FTRACE, > > JIT_BPF, > > JIT_TYPE_MAX, > > }; > > > > struct jit_alloc_params { > > struct jit_range ranges[JIT_TYPE_MAX]; > > /* ... */ > > }; > > > > > > Can you give more detail on what parameters you need? If the only extra > > > > parameter is just "does this allocation need to live close to kernel > > > > text", that's not that big of a deal. > > > > > > My thinking was that we at least need the start + end for each caller. That > > > might be it, tbh. > > > > Do you mean that modules will have something like > > > > jit_text_alloc(size, MODULES_START, MODULES_END); > > > > and kprobes will have > > > > jit_text_alloc(size, KPROBES_START, KPROBES_END); > > ? > > Yes. > > > It sill can be achieved with a single jit_alloc_arch_params(), just by > > adding enum jit_type parameter to jit_text_alloc(). > > That feels backwards to me; it centralizes a bunch of information about > distinct users to be able to shove that into a static array, when the callsites > can pass that information. > > What's *actually* common after separating out the ranges? Is it just the > permissions? Even if for some architecture the only common thing are the permissions, having a definition for code allocations in a single place an improvement. The diffstat of the patches is indeed positive (even without comments), but having a single structure that specifies how the code should be allocated would IMHO actually reduce the maintenance burden. And features like caching of large pages and sub-page size allocations are surely will be easier to opt-in this way. > If we want this to be able to share allocations and so on, why can't we do this > like a kmem_cache, and have the callsite pass a pointer to the allocator data? > That would make it easy for callsites to share an allocator or use a distinct > one. I've looked into doing this like a kmem_cache with call sites passing the allocator data, and this gets really hairy. For each user we need to pass the arch specific parameters to that user, create a cache there and only then the cache can be used. Since we don't have hooks to setup any of the users in the arch code, the initialization gets more complex than shoving everything into an array. I think that jit_alloc(type, size) is the best way to move forward to let different users choose their ranges and potentially caches. Differentiation by the API name will explode even now and it'll get worse if/when new users will show up and we can't even force users to avoid using PC-relative addressing because, e.g. RISC-V explicitly switched their BPF JIT to use that. > Thanks, > Mark.
From: "Mike Rapoport (IBM)" <rppt@kernel.org> Hi, module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystmes that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. This set splits code allocation from modules by introducing jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call sites of module_alloc() and module_memfree() with the new APIs and implements core text and related allocation in a central place. Instead of architecture specific overrides for module_alloc(), the architectures that require non-default behaviour for text allocation must fill jit_alloc_params structure and implement jit_alloc_arch_params() that returns a pointer to that structure. If an architecture does not implement jit_alloc_arch_params(), the defaults compatible with the current modules::module_alloc() are used. The new jitalloc infrastructure allows decoupling of kprobes and ftrace from modules, and most importantly it enables ROX allocations for executable memory. A centralized infrastructure for code allocation allows future optimizations for allocations of executable memory, caching large pages for better iTLB performance and providing sub-page allocations for users that only need small jit code snippets. patches 1-5: split out the code allocation from modules and arch patch 6: add dedicated API for data allocations with constraints similar to code allocations patches 7-9: decouple dynamic ftrace and kprobes form CONFIG_MODULES patches 10-13: enable ROX allocations for executable memory on x86 Mike Rapoport (IBM) (11): nios2: define virtual address space for modules mm: introduce jit_text_alloc() and use it instead of module_alloc() mm/jitalloc, arch: convert simple overrides of module_alloc to jitalloc mm/jitalloc, arch: convert remaining overrides of module_alloc to jitalloc module, jitalloc: drop module_alloc mm/jitalloc: introduce jit_data_alloc() x86/ftrace: enable dynamic ftrace without CONFIG_MODULES arch: make jitalloc setup available regardless of CONFIG_MODULES kprobes: remove dependcy on CONFIG_MODULES modules, jitalloc: prepare to allocate executable memory as ROX x86/jitalloc: make memory allocated for code ROX Song Liu (2): ftrace: Add swap_func to ftrace_process_locs() x86/jitalloc: prepare to allocate exectuatble memory as ROX arch/Kconfig | 5 +- arch/arm/kernel/module.c | 32 ------ arch/arm/mm/init.c | 35 ++++++ arch/arm64/kernel/module.c | 47 -------- arch/arm64/mm/init.c | 42 +++++++ arch/loongarch/kernel/module.c | 6 - arch/loongarch/mm/init.c | 16 +++ arch/mips/kernel/module.c | 9 -- arch/mips/mm/init.c | 19 ++++ arch/nios2/include/asm/pgtable.h | 5 +- arch/nios2/kernel/module.c | 24 ++-- arch/parisc/kernel/module.c | 11 -- arch/parisc/mm/init.c | 21 +++- arch/powerpc/kernel/kprobes.c | 4 +- arch/powerpc/kernel/module.c | 37 ------- arch/powerpc/mm/mem.c | 41 +++++++ arch/riscv/kernel/module.c | 10 -- arch/riscv/mm/init.c | 18 +++ arch/s390/kernel/ftrace.c | 4 +- arch/s390/kernel/kprobes.c | 4 +- arch/s390/kernel/module.c | 46 +------- arch/s390/mm/init.c | 35 ++++++ arch/sparc/kernel/module.c | 34 +----- arch/sparc/mm/Makefile | 2 + arch/sparc/mm/jitalloc.c | 21 ++++ arch/sparc/net/bpf_jit_comp_32.c | 8 +- arch/x86/Kconfig | 2 + arch/x86/kernel/alternative.c | 43 ++++--- arch/x86/kernel/ftrace.c | 59 +++++----- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/module.c | 75 +------------ arch/x86/kernel/static_call.c | 10 +- arch/x86/kernel/unwind_orc.c | 13 ++- arch/x86/mm/init.c | 52 +++++++++ arch/x86/net/bpf_jit_comp.c | 22 +++- include/linux/ftrace.h | 2 + include/linux/jitalloc.h | 69 ++++++++++++ include/linux/moduleloader.h | 15 --- kernel/bpf/core.c | 14 +-- kernel/kprobes.c | 51 +++++---- kernel/module/Kconfig | 1 + kernel/module/main.c | 56 ++++------ kernel/trace/ftrace.c | 13 ++- kernel/trace/trace_kprobe.c | 11 ++ mm/Kconfig | 3 + mm/Makefile | 1 + mm/jitalloc.c | 185 +++++++++++++++++++++++++++++++ mm/mm_init.c | 2 + 48 files changed, 777 insertions(+), 462 deletions(-) create mode 100644 arch/sparc/mm/jitalloc.c create mode 100644 include/linux/jitalloc.h create mode 100644 mm/jitalloc.c base-commit: 44c026a73be8038f03dbdeef028b642880cf1511