[bpf-next] mm: Introduce vm_area_[un]map_pages().

Message ID	20240220192613.8840-1-alexei.starovoitov@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E76F612EBEC for <bpf@vger.kernel.org>; Tue, 20 Feb 2024 19:26:18 +0000 (UTC) From: Alexei Starovoitov <alexei.starovoitov@gmail.com> To: bpf@vger.kernel.org Cc: daniel@iogearbox.net, andrii@kernel.org, torvalds@linux-foundation.org, brho@google.com, hannes@cmpxchg.org, lstoakes@gmail.com, akpm@linux-foundation.org, urezki@gmail.com, hch@infradead.org, linux-mm@kvack.org, kernel-team@fb.com Subject: [PATCH bpf-next] mm: Introduce vm_area_[un]map_pages(). Date: Tue, 20 Feb 2024 11:26:13 -0800 Message-Id: <20240220192613.8840-1-alexei.starovoitov@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[bpf-next] mm: Introduce vm_area_[un]map_pages(). \| expand [bpf-next] mm: Introduce vm_area_[un]map_pages().

Context	Check	Description
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5	success	Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12	success	Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-11	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-9	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-17	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-18	success	Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-20	success	Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-19	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-42	success	Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-28	success	Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35	success	Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41	success	Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-34	success	Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-33	success	Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-6	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-16	success	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38	success	Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39	success	Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37	success	Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-31	success	Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-30	success	Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-27	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-7	success	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8	success	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13	success	Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-32	success	Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-40	success	Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-36	success	Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-14	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-15	success	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-29	success	Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 14193 this patch: 14193
netdev/build_tools	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	success	CCed 6 of 6 maintainers
netdev/build_clang	success	Errors and warnings before: 3076 this patch: 3076
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 15305 this patch: 15305
netdev/checkpatch	warning	WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Alexei Starovoitov Feb. 20, 2024, 7:26 p.m. UTC

From: Alexei Starovoitov <ast@kernel.org>

vmap() API is used to map a set of pages into contiguous kernel virtual space.

BPF would like to extend the vmap API to implement a lazily-populated
contiguous kernel virtual space which size and start address is fixed early.

The vmap API has functions to request and release areas of kernel address space:
get_vm_area() and free_vm_area().

Introduce vm_area_map_pages(area, start_addr, count, pages)
to map a set of pages within a given area.
It has the same sanity checks as vmap() does.
In addition it also checks that get_vm_area() was created with VM_MAP flag
(as all users of vmap() should be doing).

Also add vm_area_unmap_pages() that is a safer alternative to
existing vunmap_range() api.

The next commits will introduce bpf_arena which is a sparsely populated shared
memory region between bpf program and user space process. It will map
privately-managed pages into an existing vm area with the following steps:

  area = get_vm_area(area_size, VM_MAP | VM_USERMAP); // at bpf prog verification time
  vm_area_map_pages(area, kaddr, 1, page);            // on demand
  vm_area_unmap_pages(area, kaddr, 1);
  free_vm_area(area);                                 // after bpf prog is unloaded

For BPF use case the area_size will be 4Gbyte plus 64Kbyte of guard pages and
area->addr known and fixed at the program verification time.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/vmalloc.h |  3 +++
 mm/vmalloc.c            | 46 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

Christoph Hellwig Feb. 21, 2024, 5:52 a.m. UTC | #1

On Tue, Feb 20, 2024 at 11:26:13AM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> vmap() API is used to map a set of pages into contiguous kernel virtual space.
> 
> BPF would like to extend the vmap API to implement a lazily-populated
> contiguous kernel virtual space which size and start address is fixed early.
> 
> The vmap API has functions to request and release areas of kernel address space:
> get_vm_area() and free_vm_area().

As said before I really hate growing more get_vm_area and
free_vm_area outside the core vmalloc code.  We have a few of those
mostly due to ioremap (which is beeing consolidate) and executable code
allocation (which there have been various attempts at consolidation,
and hopefully one finally succeeds..).  So let's take a step back and
think how we can do that without it.

For the dynamically growing part do you need a special allocator or
can we just go straight to the page allocator and implement this
in common code?

> For BPF use case the area_size will be 4Gbyte plus 64Kbyte of guard pages and
> area->addr known and fixed at the program verification time.

How is this ever going to to work on 32-bit platforms?

Alexei Starovoitov Feb. 21, 2024, 7:05 p.m. UTC | #2

On Tue, Feb 20, 2024 at 9:52 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Feb 20, 2024 at 11:26:13AM -0800, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > vmap() API is used to map a set of pages into contiguous kernel virtual space.
> >
> > BPF would like to extend the vmap API to implement a lazily-populated
> > contiguous kernel virtual space which size and start address is fixed early.
> >
> > The vmap API has functions to request and release areas of kernel address space:
> > get_vm_area() and free_vm_area().
>
> As said before I really hate growing more get_vm_area and
> free_vm_area outside the core vmalloc code.  We have a few of those
> mostly due to ioremap (which is beeing consolidate) and executable code
> allocation (which there have been various attempts at consolidation,
> and hopefully one finally succeeds..).  So let's take a step back and
> think how we can do that without it.

There are also xen grant tables that grab the range with get_vm_area(),
but manage it on their own. It's not an ioremap case.
It looks to me the vmalloc address range has different kinds of areas
already: vmalloc, vmap, ioremap, xen.

Maybe we can do:
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 7d112cc5f2a3..633c7b643daa 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -28,6 +28,7 @@ struct iov_iter;              /* in uio.h */
 #define VM_MAP_PUT_PAGES       0x00000200      /* put pages and free
array in vfree */
 #define VM_ALLOW_HUGE_VMAP     0x00000400      /* Allow for huge
pages on archs with HAVE_ARCH_HUGE_VMALLOC */
+#define VM_BPF                 0x00000800      /* bpf_arena pages */

+static inline struct vm_struct *get_bpf_vm_area(unsigned long size)
+{
+       return get_vm_area(size, VM_BPF);
+}

and enforce that flag in vm_area_[un]map_pages() ?

vmallocinfo can display it or skip it.
Things like find_vm_area() can do something different with such an area
(if that was the concern).

> For the dynamically growing part do you need a special allocator or
> can we just go straight to the page allocator and implement this
> in common code?

It's a bit special allocator that is using maple tree to manage
range within 4G region and
alloc_pages_node(GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT)
to grab pages.
With extra dance:
        memcg = bpf_map_get_memcg(map);
        old_memcg = set_active_memcg(memcg);
to make sure memcg accounting is done the common way for all bpf maps.

The tricky bpf specific part is a computation of pgoff,
since it's a shared memory region between user space and bpf prog.
The lower 32-bits of the pointer have to be the same for user space and bpf.

Not much changed in the patch since the earlier thread.
Either find it in your email or here:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=364c9b5d233d775728ec2bf3b4168fa6909e58d1

Are you suggesting the api like:

struct vm_struct *area = get_sparse_vm_area(size);
vm_area_alloc_pages(struct vm_struct *area, ulong addr, int page_cnt,
int numa_id);

and vm_area_alloc_pages() will allocate pages and vmap_pages_range()
them while all code in mm/vmalloc.c ?

I can give it a shot.

The ugly part is bpf_map_get_memcg() would need to be passed in somehow.

Another bpf specific bit is the guard pages before and after 4G range
and such vm_area_alloc_pages() would need to skip them.

> > For BPF use case the area_size will be 4Gbyte plus 64Kbyte of guard pages and
> > area->addr known and fixed at the program verification time.
>
> How is this ever going to to work on 32-bit platforms?

bpf_arena requires 64bit and mmu.

ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
obj-$(CONFIG_BPF_SYSCALL) += arena.o
endif

and special JIT support too.

With bpf_arena we can finally deprecate a bunch of things like bloom
filter bpf map, etc.

Alexei Starovoitov Feb. 22, 2024, 11:25 p.m. UTC | #3

On Wed, Feb 21, 2024 at 11:05 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 20, 2024 at 9:52 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Tue, Feb 20, 2024 at 11:26:13AM -0800, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > vmap() API is used to map a set of pages into contiguous kernel virtual space.
> > >
> > > BPF would like to extend the vmap API to implement a lazily-populated
> > > contiguous kernel virtual space which size and start address is fixed early.
> > >
> > > The vmap API has functions to request and release areas of kernel address space:
> > > get_vm_area() and free_vm_area().
> >
> > As said before I really hate growing more get_vm_area and
> > free_vm_area outside the core vmalloc code.  We have a few of those
> > mostly due to ioremap (which is beeing consolidate) and executable code
> > allocation (which there have been various attempts at consolidation,
> > and hopefully one finally succeeds..).  So let's take a step back and
> > think how we can do that without it.
>
> There are also xen grant tables that grab the range with get_vm_area(),
> but manage it on their own. It's not an ioremap case.
> It looks to me the vmalloc address range has different kinds of areas
> already: vmalloc, vmap, ioremap, xen.
>
> Maybe we can do:
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 7d112cc5f2a3..633c7b643daa 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -28,6 +28,7 @@ struct iov_iter;              /* in uio.h */
>  #define VM_MAP_PUT_PAGES       0x00000200      /* put pages and free
> array in vfree */
>  #define VM_ALLOW_HUGE_VMAP     0x00000400      /* Allow for huge
> pages on archs with HAVE_ARCH_HUGE_VMALLOC */
> +#define VM_BPF                 0x00000800      /* bpf_arena pages */
>
> +static inline struct vm_struct *get_bpf_vm_area(unsigned long size)
> +{
> +       return get_vm_area(size, VM_BPF);
> +}
>
> and enforce that flag in vm_area_[un]map_pages() ?
>
> vmallocinfo can display it or skip it.
> Things like find_vm_area() can do something different with such an area
> (if that was the concern).
>
> > For the dynamically growing part do you need a special allocator or
> > can we just go straight to the page allocator and implement this
> > in common code?
>
> It's a bit special allocator that is using maple tree to manage
> range within 4G region and
> alloc_pages_node(GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT)
> to grab pages.
> With extra dance:
>         memcg = bpf_map_get_memcg(map);
>         old_memcg = set_active_memcg(memcg);
> to make sure memcg accounting is done the common way for all bpf maps.
>
> The tricky bpf specific part is a computation of pgoff,
> since it's a shared memory region between user space and bpf prog.
> The lower 32-bits of the pointer have to be the same for user space and bpf.
>
> Not much changed in the patch since the earlier thread.
> Either find it in your email or here:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=364c9b5d233d775728ec2bf3b4168fa6909e58d1
>
> Are you suggesting the api like:
>
> struct vm_struct *area = get_sparse_vm_area(size);
> vm_area_alloc_pages(struct vm_struct *area, ulong addr, int page_cnt,
> int numa_id);
>
> and vm_area_alloc_pages() will allocate pages and vmap_pages_range()
> them while all code in mm/vmalloc.c ?
>
> I can give it a shot.
>
> The ugly part is bpf_map_get_memcg() would need to be passed in somehow.
>
> Another bpf specific bit is the guard pages before and after 4G range
> and such vm_area_alloc_pages() would need to skip them.

I've looked at this approach more.
The somewhat generic-ish api for mm/vmalloc.c may look like:
struct vm_sparse_struct *area;

area = get_sparse_vm_area(vm_area_size, guard_size,
                          pgoff_offset, max_pages, memcg, ...);

vm_area_size is what get_vm_area() will reserve out of the kernel
vmalloc region. For bpf_arena case it will be 4gb+64k.
guard_size is the size of the guard area. 64k for bpf_arena.
pgoff_offset is the offset where pages would need to start allocating
after the guard area.
For any normal vma the pgoff==0 is the first page after vma->vm_start.
bpf_arena is bpf/user shared sparse region and it needs to keep lower 32-bit
from the address that user space received from mmap().
So that the first allocated page with pgoff=0 will be the first
page for _user_ vma->vm_start.
Hence for kernel vmalloc range the page allocator needs that
pgoff_offset.
max_pages is easy. It's the max number of pages that
this sparse_vm_area is allowed to allocate.
It's also driven by user space. When user does
mmap(NULL, bpf_arena_size, ..., bpf_arena_map_fd)
it gets an address and that address determines pgoff_offset
and arena_size determines the max_pages.
That arena_size can be 1 page or 1000 pages. Always less than 4Gb.
But vm_area_size will be 4gb+64k regardless.

vm_area_alloc_pages(struct vm_sparse_struct *area, ulong addr,
                    int page_cnt, int numa_id);
is semantically similar to user's mmap().
If addr == 0 the kernel will find a free range after pgoff_offset
and will allocate page_cnt pages from there and vmap to
kernel's vm_sparse_struct area.
If addr is specified it would have to be >= pgoff_offset
and page_cnt <= max_pages.
All pages are accounted into memcg specified at vm_sparse_struct
creation time.
And it will use maple tree to track all these range allocation
within vm_sparse_struct.

So far it looks like the bigger half of kernel/bpf/arena.c
will migrate to mm/vmalloc.c and will be very bpf specific.

So I don't particularly like this direction. Feels like a burden
for mm and bpf folks.

btw LWN just posted a nice article describing the motivation
https://lwn.net/Articles/961941/

So far doing:

+#define VM_BPF                 0x00000800      /* bpf_arena pages */
or VM_SPARSE ?

+static inline struct vm_struct *get_bpf_vm_area(unsigned long size)
+{
+       return get_vm_area(size, VM_BPF);
+}
and enforcing that flag where appropriate in mm/vmalloc.c
is the easiest for everyone.
We probably should add
#define VM_XEN 0x00001000
and use it in xen use cases to differentiate
vmalloc vs vmap vs ioremap vs bpf vs xen users.

Please share your opinions.

Christoph Hellwig Feb. 23, 2024, 5:14 p.m. UTC | #4

On Wed, Feb 21, 2024 at 11:05:09AM -0800, Alexei Starovoitov wrote:
> +#define VM_BPF                 0x00000800      /* bpf_arena pages */
> 
> +static inline struct vm_struct *get_bpf_vm_area(unsigned long size)
> +{
> +       return get_vm_area(size, VM_BPF);
> +}
> 
> and enforce that flag in vm_area_[un]map_pages() ?
> 
> vmallocinfo can display it or skip it.
> Things like find_vm_area() can do something different with such an area
> (if that was the concern).

Well, a growing allocation is a generally useful feature.  I'd
rather not limit it to bpf if we can.

> > For the dynamically growing part do you need a special allocator or
> > can we just go straight to the page allocator and implement this
> > in common code?
> 
> It's a bit special allocator that is using maple tree to manage
> range within 4G region and
> alloc_pages_node(GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT)
> to grab pages.
> With extra dance:
>         memcg = bpf_map_get_memcg(map);
>         old_memcg = set_active_memcg(memcg);
> to make sure memcg accounting is done the common way for all bpf maps.

Ok, so it's not just a growing allocation but actually sparse and
all over the place?  That doesn't really make it easier to come
up with a good enough interface.  How do you decide what gets placed
where?

> struct vm_struct *area = get_sparse_vm_area(size);
> vm_area_alloc_pages(struct vm_struct *area, ulong addr, int page_cnt,
> int numa_id);
> 
> and vm_area_alloc_pages() will allocate pages and vmap_pages_range()
> them while all code in mm/vmalloc.c ?

My vague hope was that we could just start out with an area and
grow it.  But it sounds like you need something much more complex
that that.

But yes, a more specific API is probably a better idea.  And maybe
the cookie should be a VM area either but a structure dedicated to
this.

Alexei Starovoitov Feb. 23, 2024, 5:27 p.m. UTC | #5

On Fri, Feb 23, 2024 at 9:14 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Feb 21, 2024 at 11:05:09AM -0800, Alexei Starovoitov wrote:
> > +#define VM_BPF                 0x00000800      /* bpf_arena pages */
> >
> > +static inline struct vm_struct *get_bpf_vm_area(unsigned long size)
> > +{
> > +       return get_vm_area(size, VM_BPF);
> > +}
> >
> > and enforce that flag in vm_area_[un]map_pages() ?
> >
> > vmallocinfo can display it or skip it.
> > Things like find_vm_area() can do something different with such an area
> > (if that was the concern).
>
> Well, a growing allocation is a generally useful feature.  I'd
> rather not limit it to bpf if we can.

sure. See VM_SPARSE proposal in the other email.

> > > For the dynamically growing part do you need a special allocator or
> > > can we just go straight to the page allocator and implement this
> > > in common code?
> >
> > It's a bit special allocator that is using maple tree to manage
> > range within 4G region and
> > alloc_pages_node(GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT)
> > to grab pages.
> > With extra dance:
> >         memcg = bpf_map_get_memcg(map);
> >         old_memcg = set_active_memcg(memcg);
> > to make sure memcg accounting is done the common way for all bpf maps.
>
> Ok, so it's not just a growing allocation but actually sparse and
> all over the place?  That doesn't really make it easier to come
> up with a good enough interface.

yep.

> How do you decide what gets placed
> where?

See proposal in the other email in this thread.
tldr: it's a user space mmap() like interface.
either give me N pages at any addr or
give me N pages at this addr if this range is still free.

> > struct vm_struct *area = get_sparse_vm_area(size);
> > vm_area_alloc_pages(struct vm_struct *area, ulong addr, int page_cnt,
> > int numa_id);
> >
> > and vm_area_alloc_pages() will allocate pages and vmap_pages_range()
> > them while all code in mm/vmalloc.c ?
>
> My vague hope was that we could just start out with an area and
> grow it.  But it sounds like you need something much more complex
> that that.

yes. With bpf specific tricks due to lower 32-bit wrap around.

> But yes, a more specific API is probably a better idea.  And maybe
> the cookie should be a VM area either but a structure dedicated to
> this.

Right. see 'struct sparse_vm_area' proposal in the other email.

Alexei Starovoitov Feb. 24, 2024, midnight UTC | #6

On Thu, Feb 22, 2024 at 3:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> >
> > I can give it a shot.
> >
> > The ugly part is bpf_map_get_memcg() would need to be passed in somehow.
> >
> > Another bpf specific bit is the guard pages before and after 4G range
> > and such vm_area_alloc_pages() would need to skip them.
>
> I've looked at this approach more.
> The somewhat generic-ish api for mm/vmalloc.c may look like:
> struct vm_sparse_struct *area;
>
> area = get_sparse_vm_area(vm_area_size, guard_size,
>                           pgoff_offset, max_pages, memcg, ...);
>
> vm_area_size is what get_vm_area() will reserve out of the kernel
> vmalloc region. For bpf_arena case it will be 4gb+64k.
> guard_size is the size of the guard area. 64k for bpf_arena.
> pgoff_offset is the offset where pages would need to start allocating
> after the guard area.
> For any normal vma the pgoff==0 is the first page after vma->vm_start.
> bpf_arena is bpf/user shared sparse region and it needs to keep lower 32-bit
> from the address that user space received from mmap().
> So that the first allocated page with pgoff=0 will be the first
> page for _user_ vma->vm_start.
> Hence for kernel vmalloc range the page allocator needs that
> pgoff_offset.
> max_pages is easy. It's the max number of pages that
> this sparse_vm_area is allowed to allocate.
> It's also driven by user space. When user does
> mmap(NULL, bpf_arena_size, ..., bpf_arena_map_fd)
> it gets an address and that address determines pgoff_offset
> and arena_size determines the max_pages.
> That arena_size can be 1 page or 1000 pages. Always less than 4Gb.
> But vm_area_size will be 4gb+64k regardless.
>
> vm_area_alloc_pages(struct vm_sparse_struct *area, ulong addr,
>                     int page_cnt, int numa_id);
> is semantically similar to user's mmap().
> If addr == 0 the kernel will find a free range after pgoff_offset
> and will allocate page_cnt pages from there and vmap to
> kernel's vm_sparse_struct area.
> If addr is specified it would have to be >= pgoff_offset
> and page_cnt <= max_pages.
> All pages are accounted into memcg specified at vm_sparse_struct
> creation time.
> And it will use maple tree to track all these range allocation
> within vm_sparse_struct.
>
> So far it looks like the bigger half of kernel/bpf/arena.c
> will migrate to mm/vmalloc.c and will be very bpf specific.
>
> So I don't particularly like this direction. Feels like a burden
> for mm and bpf folks.
>
> btw LWN just posted a nice article describing the motivation
> https://lwn.net/Articles/961941/
>
> So far doing:
>
> +#define VM_BPF                 0x00000800      /* bpf_arena pages */
> or VM_SPARSE ?
>
> and enforcing that flag where appropriate in mm/vmalloc.c
> is the easiest for everyone.
> We probably should add
> #define VM_XEN 0x00001000
> and use it in xen use cases to differentiate
> vmalloc vs vmap vs ioremap vs bpf vs xen users.

Here is what I had in mind:
https://lore.kernel.org/bpf/20240223235728.13981-1-alexei.starovoitov@gmail.com/

[bpf-next] mm: Introduce vm_area_[un]map_pages().

Checks

Commit Message

Comments

Patch