Message ID | 20220819214232.18784-1-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: BPF specific memory allocator. | expand |
On Fri, 19 Aug 2022 at 23:42, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > Introduce any context BPF specific memory allocator. > > Tracing BPF programs can attach to kprobe and fentry. Hence they > run in unknown context where calling plain kmalloc() might not be safe. > Front-end kmalloc() with per-cpu cache of free elements. > Refill this cache asynchronously from irq_work. > > Major achievements enabled by bpf_mem_alloc: > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated. > With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc. > - Tracing bpf programs can use dynamically allocated hash maps. > Potentially saving lots of memory. Typical hash map is sparsely populated. > - Sleepable bpf programs can used dynamically allocated hash maps. > From my side, for the whole series: Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> > v2->v3: > - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1. > - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14. > - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is > requested to reduce memory consumption. Patch 15. > - Fix: zero-fill percpu allocation > - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction > > v2 thread: > https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/ > > v1->v2: > - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9. > - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6. > - Tuned watermarks per allocation size. Patch 8 > - Adopted this approach to per-cpu allocation. Patch 10. > - Fully converted hash map to bpf_mem_alloc. Patch 11. > - Removed tracing prog restriction on map types. Combination of all patches and final patch 12. > > v1 thread: > https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/ > > LWN article: > https://lwn.net/Articles/899274/ > > Future work: > - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc > - convert lru map to bpf_mem_alloc > > Alexei Starovoitov (15): > bpf: Introduce any context BPF specific memory allocator. > bpf: Convert hash map to bpf_mem_alloc. > selftests/bpf: Improve test coverage of test_maps > samples/bpf: Reduce syscall overhead in map_perf_test. > bpf: Relax the requirement to use preallocated hash maps in tracing > progs. > bpf: Optimize element count in non-preallocated hash map. > bpf: Optimize call_rcu in non-preallocated hash map. > bpf: Adjust low/high watermarks in bpf_mem_cache > bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. > bpf: Add percpu allocation support to bpf_mem_alloc. > bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. > bpf: Remove tracing program restriction on map types > bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. > bpf: Remove prealloc-only restriction for sleepable bpf programs. > bpf: Introduce sysctl kernel.bpf_force_dyn_alloc. > > include/linux/bpf_mem_alloc.h | 26 + > include/linux/filter.h | 2 + > kernel/bpf/Makefile | 2 +- > kernel/bpf/core.c | 2 + > kernel/bpf/hashtab.c | 132 +++-- > kernel/bpf/memalloc.c | 601 ++++++++++++++++++++++ > kernel/bpf/syscall.c | 14 +- > kernel/bpf/verifier.c | 52 -- > samples/bpf/map_perf_test_kern.c | 44 +- > samples/bpf/map_perf_test_user.c | 2 +- > tools/testing/selftests/bpf/progs/timer.c | 11 - > tools/testing/selftests/bpf/test_maps.c | 38 +- > 12 files changed, 795 insertions(+), 131 deletions(-) > create mode 100644 include/linux/bpf_mem_alloc.h > create mode 100644 kernel/bpf/memalloc.c > > -- > 2.30.2 >
On Wed, Aug 24, 2022 at 1:03 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Fri, 19 Aug 2022 at 23:42, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > From: Alexei Starovoitov <ast@kernel.org> > > > > Introduce any context BPF specific memory allocator. > > > > Tracing BPF programs can attach to kprobe and fentry. Hence they > > run in unknown context where calling plain kmalloc() might not be safe. > > Front-end kmalloc() with per-cpu cache of free elements. > > Refill this cache asynchronously from irq_work. > > > > Major achievements enabled by bpf_mem_alloc: > > - Dynamically allocated hash maps used to be 10 times slower than fully preallocated. > > With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc. > > - Tracing bpf programs can use dynamically allocated hash maps. > > Potentially saving lots of memory. Typical hash map is sparsely populated. > > - Sleepable bpf programs can used dynamically allocated hash maps. > > > > From my side, for the whole series: > Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Thanks a bunch for all the suggestions. Especially for ideas that led to rewrite of patch 1. It looks much simpler now. I've missed #include <asm/local.h> in the patch 1. In the respin I'm going to keep your Ack if you don't mind. Thanks!
Alexei and I spent some time today going back and forth on what the uapi to this allocator should look like in a BPF program. To both of our surprise, the problem space became far more complicated than we anticipated. There are three primary problems we have to solve: 1) Knowing which allocator an object came from, so we can safely reclaim it when necessary (e.g., freeing a map). 2) Type confusion between local and kernel types. (I.e., a program allocating kernel types and passing them to helpers/kfuncs that don't expect them). This is especially important because the existing kptr mechanism assumes kernel types everywhere. 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy when you allow allocated objects in pinned maps. This is the proposed design that we landed on: 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity at creation time. Value_size > 0 takes the kmem_cache path. Probably with btf_value_type_id enforcement for the kmem_cache path. 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer. 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local) annotates fields that can hold values with verifier type `PTR_TO_BTF_ID | BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves against program-local btf (in the verifier, at runtime it only gets an allocation size). 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This would be the only place these objects can leak out of bpf land. If there's no runtime aspect (and there likely wouldn't be), we might consider doing this transparently, still against an allowlist of types. 4. To ensure the allocator stays alive while objects from it are alive, we must be able to identify which allocator each __kptr_local pointer came from, and we must keep the refcount up while any such values are alive. One concern here is that doing the refcount manipulation in kptr_xchg would be too expensive. The proposed solution is to: 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the object to and can simplify the bpf_obj_free() call. 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would walk all maps that the program has access to (that have kptr_local fields), walk each value and ensure that any allocators not already in the map's used_allocators are refcount_inc'd and added to the list. Do note that allocators are also kept alive by their bpf_map wrapper but after that's gone, used_allocators is the main mechanism. Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we can only return objects to it. 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then refcount_dec all the used_allocators. Overall, we think this handles all the nasty corners - objects escaping into kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations, programs accessing multiple allocators having deterministic freelist behaviors - while keeping the API and complexity sane. The used_allocators approach can certainly be less conservative (or can be even precise) but for a v1 that's probably overkill. Please, feel free to shoot holes in this design! We tried to capture everything but I'd love confirmation that we didn't miss anything. --Delyan
On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote: > > Alexei and I spent some time today going back and forth on what the uapi to this > allocator should look like in a BPF program. To both of our surprise, the problem > space became far more complicated than we anticipated. > > There are three primary problems we have to solve: > 1) Knowing which allocator an object came from, so we can safely reclaim it when > necessary (e.g., freeing a map). > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel > types and passing them to helpers/kfuncs that don't expect them). This is especially > important because the existing kptr mechanism assumes kernel types everywhere. Why is the btf_is_kernel(reg->btf) check not enough to distinguish local vs kernel kptr? We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now. Fun fact: I added a similar check on purpose in map_kptr_match_type, since Alexei mentioned back then he was working on a local type allocator, so forgetting to add it later would have been a problem. > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy > when you allow allocated objects in pinned maps. > > This is the proposed design that we landed on: > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity > at creation time. Value_size > 0 takes the kmem_cache path. Probably with > btf_value_type_id enforcement for the kmem_cache path. > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer. > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local) > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID | > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves > against program-local btf (in the verifier, at runtime it only gets an allocation > size). This is ok too, but I think just gating everywhere with btf_is_kernel would be fine as well. > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This It would be fine to allow passing if it is just plain data (e.g. what scalar_struct check does for kfuncs). There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID, etc. so it was necessary to restrict the kind of type to LCD. But we don't have to do it from day 1, just listing what should be ok. > would be the only place these objects can leak out of bpf land. If there's no runtime > aspect (and there likely wouldn't be), we might consider doing this transparently, > still against an allowlist of types. > > 4. To ensure the allocator stays alive while objects from it are alive, we must be > able to identify which allocator each __kptr_local pointer came from, and we must > keep the refcount up while any such values are alive. One concern here is that doing > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution > is to: > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the > object to and can simplify the bpf_obj_free() call. > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would > walk all maps that the program has access to (that have kptr_local fields), walk each > value and ensure that any allocators not already in the map's used_allocators are > refcount_inc'd and added to the list. Do note that allocators are also kept alive by > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism. > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we > can only return objects to it. > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then > refcount_dec all the used_allocators. > So to summarize your approach: Each allocation has a bpf_mem_alloc pointer before it to track its owner allocator. We know used_maps of each prog, so during unload of program, walk all local kptrs in each used_maps map values, and that map takes a reference to the allocator stashing it in used_allocators list, because prog is going to relinquish its ref to allocator_map (which if it were the last one would release allocator reference as well for local kptrs held by those maps). Once prog is gone, the allocator is kept alive by other maps holding objects allocated from it. References to the allocator are taken lazily when required. Did I get it right? I see two problems: the first is concurrency. When walking each value, it is going to be hard to ensure the kptr field remains stable while you load and take ref to its allocator. Some other programs may also have access to the map value and may concurrently change the kptr field (xchg and even release it). How do we safely do a refcount_inc of its allocator? For the second problem, consider this: obj = bpf_obj_alloc(&alloc_map, ...); inner_map = bpf_map_lookup_elem(&map_in_map, ...); map_val = bpf_map_lookup_elem(inner_map, ...); kptr_xchg(&map_val->kptr, obj); Now delete the entry having that inner_map, but keep its fd open. Unload the program, since it is map-in-map, no way to fill used_allocators. alloc_map is freed, releases reference on allocator, allocator is freed. Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak. Is there a way to prevent this in your scheme? -- I had another idea, but it's not _completely_ 0 overhead. Heavy prototyping so I might be missing corner cases. It is to take reference on each allocation and deallocation. Yes, naive and slow if using atomics, but instead we can use percpu_ref instead of atomic refcount for the allocator. percpu_ref_get/put on each unit_alloc/unit_free. The problem though is that once initial reference is killed, it downgrades to atomic, which will kill performance. So we need to be smart about how that initial reference is managed. My idea is that the initial ref is taken and killed by the allocator bpf_map pinning the allocator. Once that bpf_map is gone, you cannot do any more allocations anyway (since you need to pass the map pointer to bpf_obj_alloc), so once it downgrades to atomics at that point we will only be releasing the references after freeing its allocated objects. Yes, then the free path becomes a bit costly after the allocator map is gone. We might be able to remove the cost on free path as well using the used_allocators scheme from above (to delay percpu_ref_kill), but it is not clear how to safely increment the ref of the allocator from map value... wdyt? > Overall, we think this handles all the nasty corners - objects escaping into > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations, > programs accessing multiple allocators having deterministic freelist behaviors - > while keeping the API and complexity sane. The used_allocators approach can certainly > be less conservative (or can be even precise) but for a v1 that's probably overkill. > > Please, feel free to shoot holes in this design! We tried to capture everything but > I'd love confirmation that we didn't miss anything. > > --Delyan
Thanks for taking a look, Kumar! On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote: > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote: > > > > Alexei and I spent some time today going back and forth on what the uapi to this > > allocator should look like in a BPF program. To both of our surprise, the problem > > space became far more complicated than we anticipated. > > > > There are three primary problems we have to solve: > > 1) Knowing which allocator an object came from, so we can safely reclaim it when > > necessary (e.g., freeing a map). > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel > > types and passing them to helpers/kfuncs that don't expect them). This is especially > > important because the existing kptr mechanism assumes kernel types everywhere. > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish > local vs kernel kptr? Answered below. > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now. > > Fun fact: I added a similar check on purpose in map_kptr_match_type, > since Alexei mentioned back then he was working on a local type > allocator, so forgetting to add it later would have been a problem. > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy > > when you allow allocated objects in pinned maps. > > > > This is the proposed design that we landed on: > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with > > btf_value_type_id enforcement for the kmem_cache path. > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer. > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local) > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID | > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves > > against program-local btf (in the verifier, at runtime it only gets an allocation > > size). > > This is ok too, but I think just gating everywhere with btf_is_kernel > would be fine as well. Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that in the btf field of the register/stack slot/kptr/helper proto. That said, we still need the new type tag to tell the map btf parsing code to use the local btf in the kptr descriptor. > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This > > It would be fine to allow passing if it is just plain data (e.g. what > scalar_struct check does for kfuncs). > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID, > etc. so it was necessary to restrict the kind of type to LCD. > > But we don't have to do it from day 1, just listing what should be ok. That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic. > > > would be the only place these objects can leak out of bpf land. If there's no runtime > > aspect (and there likely wouldn't be), we might consider doing this transparently, > > still against an allowlist of types. > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be > > able to identify which allocator each __kptr_local pointer came from, and we must > > keep the refcount up while any such values are alive. One concern here is that doing > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution > > is to: > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the > > object to and can simplify the bpf_obj_free() call. > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would > > walk all maps that the program has access to (that have kptr_local fields), walk each > > value and ensure that any allocators not already in the map's used_allocators are > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism. > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we > > can only return objects to it. > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then > > refcount_dec all the used_allocators. > > > > So to summarize your approach: > Each allocation has a bpf_mem_alloc pointer before it to track its > owner allocator. > We know used_maps of each prog, so during unload of program, walk all > local kptrs in each used_maps map values, and that map takes a > reference to the allocator stashing it in used_allocators list, > because prog is going to relinquish its ref to allocator_map (which if > it were the last one would release allocator reference as well for > local kptrs held by those maps). > Once prog is gone, the allocator is kept alive by other maps holding > objects allocated from it. References to the allocator are taken > lazily when required. > Did I get it right? That's correct! > > I see two problems: the first is concurrency. When walking each value, > it is going to be hard to ensure the kptr field remains stable while > you load and take ref to its allocator. Some other programs may also > have access to the map value and may concurrently change the kptr > field (xchg and even release it). How do we safely do a refcount_inc > of its allocator? Fair question. You can think of that pointer as immutable for the entire time that the allocator is able to interact with the object. Once the object makes it on a freelist, it won't be released until an rcu qs. Therefore, the first time that value can change - when we return the object to the global kmalloc pool - it has provably no bpf-side concurrent observers. Alexei, please correct me if I misunderstood how the design is supposed to work. > > For the second problem, consider this: > obj = bpf_obj_alloc(&alloc_map, ...); > inner_map = bpf_map_lookup_elem(&map_in_map, ...); > map_val = bpf_map_lookup_elem(inner_map, ...); > kptr_xchg(&map_val->kptr, obj); > > Now delete the entry having that inner_map, but keep its fd open. > Unload the program, since it is map-in-map, no way to fill used_allocators. > alloc_map is freed, releases reference on allocator, allocator is freed. > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak. > Is there a way to prevent this in your scheme? This is fair, inner maps not being tracked in used_maps is a wrench in that plan. > - > > I had another idea, but it's not _completely_ 0 overhead. Heavy > prototyping so I might be missing corner cases. > It is to take reference on each allocation and deallocation. Yes, > naive and slow if using atomics, but instead we can use percpu_ref > instead of atomic refcount for the allocator. percpu_ref_get/put on > each unit_alloc/unit_free. > The problem though is that once initial reference is killed, it > downgrades to atomic, which will kill performance. So we need to be > smart about how that initial reference is managed. > My idea is that the initial ref is taken and killed by the allocator > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot > do any more allocations anyway (since you need to pass the map pointer > to bpf_obj_alloc), so once it downgrades to atomics at that point we > will only be releasing the references after freeing its allocated > objects. Yes, then the free path becomes a bit costly after the > allocator map is gone. > > We might be able to remove the cost on free path as well using the > used_allocators scheme from above (to delay percpu_ref_kill), but it > is not clear how to safely increment the ref of the allocator from map > value... As explained above, the values are already rcu-protected, so we can use that to coordinate refcounting of the allocator. That said, percpu_ref could work (I was considering something similar within the allocator itself) but I'm not convinced about the cost. My concern is that once it becomes atomic_t, it erases the benefits of all the work in the allocator that maintains percpu data structures. I wonder if the allocator should maintain percpu live counts (with underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then, we can have explicit "sum up all the counts to discover if you should be destroyed" calls. If we keep the used_allocators scheme, these calls can be inserted at program unload for maps in used_maps and at map free time for maps that escape that mechanism - map goes over all its used_allocators and have them confirm the liveness count is > 0. I think doing it this way we cover the hole with map-in-map without regressing any path. Thoughts? > > wdyt? > > > Overall, we think this handles all the nasty corners - objects escaping into > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations, > > programs accessing multiple allocators having deterministic freelist behaviors - > > while keeping the API and complexity sane. The used_allocators approach can certainly > > be less conservative (or can be even precise) but for a v1 that's probably overkill. > > > > Please, feel free to shoot holes in this design! We tried to capture everything but > > I'd love confirmation that we didn't miss anything. > > > > --Delyan
Thanks for taking a look, Kumar! On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote: > > > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > Alexei and I spent some time today going back and forth on what the uapi to this > > > > allocator should look like in a BPF program. To both of our surprise, the problem > > > > space became far more complicated than we anticipated. > > > > > > > > There are three primary problems we have to solve: > > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when > > > > necessary (e.g., freeing a map). > > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel > > > > types and passing them to helpers/kfuncs that don't expect them). This is especially > > > > important because the existing kptr mechanism assumes kernel types everywhere. > > > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish > > local vs kernel kptr? Answered below. > > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now. > > > > Fun fact: I added a similar check on purpose in map_kptr_match_type, > > since Alexei mentioned back then he was working on a local type > > allocator, so forgetting to add it later would have been a problem. > > > > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy > > > > when you allow allocated objects in pinned maps. > > > > > > > > This is the proposed design that we landed on: > > > > > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity > > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with > > > > btf_value_type_id enforcement for the kmem_cache path. > > > > > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct > > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer. > > > > > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local) > > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID | > > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves > > > > against program-local btf (in the verifier, at runtime it only gets an allocation > > > > size). > > > > This is ok too, but I think just gating everywhere with btf_is_kernel > > would be fine as well. Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that in the btf field of the register/stack slot/kptr/helper proto. That said, we still need the new type tag to tell the map btf parsing code to use the local btf in the kptr descriptor. > > > > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce > > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the > > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This > > > > It would be fine to allow passing if it is just plain data (e.g. what > > scalar_struct check does for kfuncs). > > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID, > > etc. so it was necessary to restrict the kind of type to LCD. > > > > But we don't have to do it from day 1, just listing what should be ok. That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic. > > > > > > would be the only place these objects can leak out of bpf land. If there's no runtime > > > > aspect (and there likely wouldn't be), we might consider doing this transparently, > > > > still against an allowlist of types. > > > > > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be > > > > able to identify which allocator each __kptr_local pointer came from, and we must > > > > keep the refcount up while any such values are alive. One concern here is that doing > > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution > > > > is to: > > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer > > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the > > > > object to and can simplify the bpf_obj_free() call. > > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would > > > > walk all maps that the program has access to (that have kptr_local fields), walk each > > > > value and ensure that any allocators not already in the map's used_allocators are > > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by > > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism. > > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we > > > > can only return objects to it. > > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then > > > > refcount_dec all the used_allocators. > > > > > > > > So to summarize your approach: > > Each allocation has a bpf_mem_alloc pointer before it to track its > > owner allocator. > > We know used_maps of each prog, so during unload of program, walk all > > local kptrs in each used_maps map values, and that map takes a > > reference to the allocator stashing it in used_allocators list, > > because prog is going to relinquish its ref to allocator_map (which if > > it were the last one would release allocator reference as well for > > local kptrs held by those maps). > > Once prog is gone, the allocator is kept alive by other maps holding > > objects allocated from it. References to the allocator are taken > > lazily when required. > > Did I get it right? That's correct! > > > > I see two problems: the first is concurrency. When walking each value, > > it is going to be hard to ensure the kptr field remains stable while > > you load and take ref to its allocator. Some other programs may also > > have access to the map value and may concurrently change the kptr > > field (xchg and even release it). How do we safely do a refcount_inc > > of its allocator? Fair question. You can think of that pointer as immutable for the entire time that the allocator is able to interact with the object. Once the object makes it on a freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time that value can change - when we return the object to the global kmalloc pool - it has provably no bpf-side concurrent observers. Alexei, please correct me if I misunderstood how the design is supposed to work. > > > > For the second problem, consider this: > > obj = bpf_obj_alloc(&alloc_map, ...); > > inner_map = bpf_map_lookup_elem(&map_in_map, ...); > > map_val = bpf_map_lookup_elem(inner_map, ...); > > kptr_xchg(&map_val->kptr, obj); > > > > Now delete the entry having that inner_map, but keep its fd open. > > Unload the program, since it is map-in-map, no way to fill used_allocators. > > alloc_map is freed, releases reference on allocator, allocator is freed. > > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak. > > Is there a way to prevent this in your scheme? This is fair, inner maps not being tracked in used_maps is a wrench in that plan. However, we can have the parent map propagate its used_allocators on inner map removal. > > - > > > > I had another idea, but it's not _completely_ 0 overhead. Heavy > > prototyping so I might be missing corner cases. > > It is to take reference on each allocation and deallocation. Yes, > > naive and slow if using atomics, but instead we can use percpu_ref > > instead of atomic refcount for the allocator. percpu_ref_get/put on > > each unit_alloc/unit_free. > > The problem though is that once initial reference is killed, it > > downgrades to atomic, which will kill performance. So we need to be > > smart about how that initial reference is managed. > > My idea is that the initial ref is taken and killed by the allocator > > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot > > do any more allocations anyway (since you need to pass the map pointer > > to bpf_obj_alloc), so once it downgrades to atomics at that point we > > will only be releasing the references after freeing its allocated > > objects. Yes, then the free path becomes a bit costly after the > > allocator map is gone. > > > > We might be able to remove the cost on free path as well using the > > used_allocators scheme from above (to delay percpu_ref_kill), but it > > is not clear how to safely increment the ref of the allocator from map > > value... As explained above, the values are already rcu-protected, so we can use that to coordinate refcounting of the allocator. That said, percpu_ref could work (I was considering something similar within the allocator itself) but I'm not convinced the cost is right. My main concern would be that once it becomes atomic_t, it erases the benefits of all the work in the allocator that maintains percpu data structures. If we want to go down this path, the allocator can maintain percpu live counts (with underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then, we can have explicit "sum up all the counts to discover if you should be destroyed" calls. If we keep the used_allocators scheme, these calls can be inserted at program unload for maps in used_maps and at map free time for maps that escape that mechanism. Or, we just extend the map-in-map mechanism to propagate used_allocators as needed. There are nice debug properties of the allocator knowing the liveness counts but we don't have to put them on the path to correctness. > > > > wdyt? > > > > > > Overall, we think this handles all the nasty corners - objects escaping into > > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations, > > > > programs accessing multiple allocators having deterministic freelist behaviors - > > > > while keeping the API and complexity sane. The used_allocators approach can certainly > > > > be less conservative (or can be even precise) but for a v1 that's probably overkill. > > > > > > > > Please, feel free to shoot holes in this design! We tried to capture everything but > > > > I'd love confirmation that we didn't miss anything. > > > > > > > > --Delyan
On Mon, 29 Aug 2022 at 23:29, Delyan Kratunov <delyank@fb.com> wrote: > > Thanks for taking a look, Kumar! > > On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote: > > > > > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > Alexei and I spent some time today going back and forth on what the uapi to this > > > > > allocator should look like in a BPF program. To both of our surprise, the problem > > > > > space became far more complicated than we anticipated. > > > > > > > > > > There are three primary problems we have to solve: > > > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when > > > > > necessary (e.g., freeing a map). > > > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel > > > > > types and passing them to helpers/kfuncs that don't expect them). This is especially > > > > > important because the existing kptr mechanism assumes kernel types everywhere. > > > > > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish > > > local vs kernel kptr? > > Answered below. > > > > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now. > > > > > > Fun fact: I added a similar check on purpose in map_kptr_match_type, > > > since Alexei mentioned back then he was working on a local type > > > allocator, so forgetting to add it later would have been a problem. > > > > > > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy > > > > > when you allow allocated objects in pinned maps. > > > > > > > > > > This is the proposed design that we landed on: > > > > > > > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity > > > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with > > > > > btf_value_type_id enforcement for the kmem_cache path. > > > > > > > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct > > > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer. > > > > > > > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local) > > > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID | > > > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves > > > > > against program-local btf (in the verifier, at runtime it only gets an allocation > > > > > size). > > > > > > This is ok too, but I think just gating everywhere with btf_is_kernel > > > would be fine as well. > > > Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that > in the btf field of the register/stack slot/kptr/helper proto. That said, we still > need the new type tag to tell the map btf parsing code to use the local btf in the > kptr descriptor. > Agreed, the new __local type tag looks necessary to make it search in map BTF instead. > > > > > > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce > > > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the > > > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This > > > > > > It would be fine to allow passing if it is just plain data (e.g. what > > > scalar_struct check does for kfuncs). > > > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID, > > > etc. so it was necessary to restrict the kind of type to LCD. > > > > > > But we don't have to do it from day 1, just listing what should be ok. > > That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic. > > > > > > > > > would be the only place these objects can leak out of bpf land. If there's no runtime > > > > > aspect (and there likely wouldn't be), we might consider doing this transparently, > > > > > still against an allowlist of types. > > > > > > > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be > > > > > able to identify which allocator each __kptr_local pointer came from, and we must > > > > > keep the refcount up while any such values are alive. One concern here is that doing > > > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution > > > > > is to: > > > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer > > > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the > > > > > object to and can simplify the bpf_obj_free() call. > > > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would > > > > > walk all maps that the program has access to (that have kptr_local fields), walk each > > > > > value and ensure that any allocators not already in the map's used_allocators are > > > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by > > > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism. > > > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we > > > > > can only return objects to it. > > > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then > > > > > refcount_dec all the used_allocators. > > > > > > > > > > > So to summarize your approach: > > > Each allocation has a bpf_mem_alloc pointer before it to track its > > > owner allocator. > > > We know used_maps of each prog, so during unload of program, walk all > > > local kptrs in each used_maps map values, and that map takes a > > > reference to the allocator stashing it in used_allocators list, > > > because prog is going to relinquish its ref to allocator_map (which if > > > it were the last one would release allocator reference as well for > > > local kptrs held by those maps). > > > Once prog is gone, the allocator is kept alive by other maps holding > > > objects allocated from it. References to the allocator are taken > > > lazily when required. > > > Did I get it right? > > That's correct! > > > > > > > I see two problems: the first is concurrency. When walking each value, > > > it is going to be hard to ensure the kptr field remains stable while > > > you load and take ref to its allocator. Some other programs may also > > > have access to the map value and may concurrently change the kptr > > > field (xchg and even release it). How do we safely do a refcount_inc > > > of its allocator? > > Fair question. You can think of that pointer as immutable for the entire time that > the allocator is able to interact with the object. Once the object makes it on a > freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time > that value can change - when we return the object to the global kmalloc pool - it has > provably no bpf-side concurrent observers. > I don't think that assumption will hold. Requiring RCU protection for all local kptrs means allocator cache reuse becomes harder, as elements are stuck in freelist until the next grace period. It necessitates use of larger caches. For some use cases where they can tolerate reuse, it might not be optimal. IMO the allocator should be independent of how the lifetime of elements is managed. That said, even if you assume RCU protection, that still doesn't address the real problem. Yes, you can access the value without worrying about it moving to another map, but consider this case: Prog unloading, populate_used_allocators -> map walks map_values, tries to take reference to local kptr whose backing allocator is A. Loads kptr, meanwhile some other prog sharing access to the map value exchanges (kptr_xchg) another pointer into that field. Now you take reference to allocator A in used_allocators, while actual value in map is for allocator B. So you either have to cmpxchg and then retry if it fails (which is not a wait-free operation, and honestly not great imo), or if you don't handle this: Someone moved an allocated local kptr backed by B into your map, but you don't hold reference to it. That other program may release allocator map -> allocator, and then when this map is destroyed, on unit_free it will be use-after-free of bpf_mem_alloc *. I don't see an easy way around these kinds of problems. And this is just one specific scenario. > Alexei, please correct me if I misunderstood how the design is supposed to work. > > > > > > > For the second problem, consider this: > > > obj = bpf_obj_alloc(&alloc_map, ...); > > > inner_map = bpf_map_lookup_elem(&map_in_map, ...); > > > map_val = bpf_map_lookup_elem(inner_map, ...); > > > kptr_xchg(&map_val->kptr, obj); > > > > > > Now delete the entry having that inner_map, but keep its fd open. > > > Unload the program, since it is map-in-map, no way to fill used_allocators. > > > alloc_map is freed, releases reference on allocator, allocator is freed. > > > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak. > > > Is there a way to prevent this in your scheme? > > This is fair, inner maps not being tracked in used_maps is a wrench in that plan. > However, we can have the parent map propagate its used_allocators on inner map > removal. > But used_allocators are not populated until unload? inner_map removal can happen while the program is loaded/attached. Or will you populate it before unloading, everytime during inner_map removal? Then that would be very costly for a bpf_map_delete_elem... > > > - > > > > > > I had another idea, but it's not _completely_ 0 overhead. Heavy > > > prototyping so I might be missing corner cases. > > > It is to take reference on each allocation and deallocation. Yes, > > > naive and slow if using atomics, but instead we can use percpu_ref > > > instead of atomic refcount for the allocator. percpu_ref_get/put on > > > each unit_alloc/unit_free. > > > The problem though is that once initial reference is killed, it > > > downgrades to atomic, which will kill performance. So we need to be > > > smart about how that initial reference is managed. > > > My idea is that the initial ref is taken and killed by the allocator > > > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot > > > do any more allocations anyway (since you need to pass the map pointer > > > to bpf_obj_alloc), so once it downgrades to atomics at that point we > > > will only be releasing the references after freeing its allocated > > > objects. Yes, then the free path becomes a bit costly after the > > > allocator map is gone. > > > > > > We might be able to remove the cost on free path as well using the > > > used_allocators scheme from above (to delay percpu_ref_kill), but it > > > is not clear how to safely increment the ref of the allocator from map > > > value... > > As explained above, the values are already rcu-protected, so we can use that to > coordinate refcounting of the allocator. That said, percpu_ref could work (I was > considering something similar within the allocator itself) but I'm not convinced the > cost is right. My main concern would be that once it becomes atomic_t, it erases the > benefits of all the work in the allocator that maintains percpu data structures. > > If we want to go down this path, the allocator can maintain percpu live counts (with > underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then, > we can have explicit "sum up all the counts to discover if you should be destroyed" > calls. If we keep the used_allocators scheme, these calls can be inserted at program > unload for maps in used_maps and at map free time for maps that escape that > mechanism. Yes, it would be a good idea to put the percpu refcount for this case inside the already percpu bpf_mem_cache struct, since that will be much better than allocating it separately. The increment will then be a 100% cache hit. The main question is how this "sum up all the count" operation needs to be done. Once that initial reference of bpf_map is gone, you need to track the final owner who will be responsible to release the allocator. You will need to do something similar to percpu_ref's atomic count upgrade unless I'm missing something. Once you establish that used_allocators cannot be safely populated on unload (which you can correct me on), the only utility I see for it is delaying the atomic upgrade for this idea. So another approach (though I don't like this too much): One solution to delay the upgrade can be that when you have allocator maps and other normal maps in used_maps, it always incs ref on the allocator pinned by allocator map on unload for maps that have local kptr, so each map having local kptrs speculatively takes ref on allocator maps of this prog, assuming allocations from that allocator map are more likely to be in these maps. Same with other progs having different allocator map but sharing these maps. It is not very precise, but until those maps are gone it delays release of the allocator (we can empty all percpu caches to save memory once bpf_map pinning the allocator is gone, because allocations are not going to be served). But it allows unit_free to be relatively less costly as long as those 'candidate' maps are around. > > Or, we just extend the map-in-map mechanism to propagate used_allocators as needed. > There are nice debug properties of the allocator knowing the liveness counts but we > don't have to put them on the path to correctness. > > > > > > > wdyt? > > > > > > > > Overall, we think this handles all the nasty corners - objects escaping into > > > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations, > > > > > programs accessing multiple allocators having deterministic freelist behaviors - > > > > > while keeping the API and complexity sane. The used_allocators approach can certainly > > > > > be less conservative (or can be even precise) but for a v1 that's probably overkill. > > > > > > > > > > Please, feel free to shoot holes in this design! We tried to capture everything but > > > > > I'd love confirmation that we didn't miss anything. > > > > > > > > > > --Delyan > >
On Tue, 2022-08-30 at 00:07 +0200, Kumar Kartikeya Dwivedi wrote: > > On Mon, 29 Aug 2022 at 23:29, Delyan Kratunov <delyank@fb.com> wrote: > > > > Thanks for taking a look, Kumar! > > > > On Fri, 2022-08-26 at 06:03 +0200, Kumar Kartikeya Dwivedi wrote: > > > > > > > > On Thu, 25 Aug 2022 at 02:56, Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > > > Alexei and I spent some time today going back and forth on what the uapi to this > > > > > > allocator should look like in a BPF program. To both of our surprise, the problem > > > > > > space became far more complicated than we anticipated. > > > > > > > > > > > > There are three primary problems we have to solve: > > > > > > 1) Knowing which allocator an object came from, so we can safely reclaim it when > > > > > > necessary (e.g., freeing a map). > > > > > > 2) Type confusion between local and kernel types. (I.e., a program allocating kernel > > > > > > types and passing them to helpers/kfuncs that don't expect them). This is especially > > > > > > important because the existing kptr mechanism assumes kernel types everywhere. > > > > > > > > Why is the btf_is_kernel(reg->btf) check not enough to distinguish > > > > local vs kernel kptr? > > > > Answered below. > > > > > > We add that wherever kfunc/helpers verify the PTR_TO_BTF_ID right now. > > > > > > > > Fun fact: I added a similar check on purpose in map_kptr_match_type, > > > > since Alexei mentioned back then he was working on a local type > > > > allocator, so forgetting to add it later would have been a problem. > > > > > > > > > > 3) Allocated objects lifetimes, allocator refcounting, etc. It all gets very hairy > > > > > > when you allow allocated objects in pinned maps. > > > > > > > > > > > > This is the proposed design that we landed on: > > > > > > > > > > > > 1. Allocators get their own MAP_TYPE_ALLOCATOR, so you can specify initial capacity > > > > > > at creation time. Value_size > 0 takes the kmem_cache path. Probably with > > > > > > btf_value_type_id enforcement for the kmem_cache path. > > > > > > > > > > > > 2. The helper APIs are just bpf_obj_alloc(bpf_map *, bpf_core_type_id_local(struct > > > > > > foo)) and bpf_obj_free(void *). Note that obj_free() only takes an object pointer. > > > > > > > > > > > > 3. To avoid mixing BTF type domains, a new type tag (provisionally __kptr_local) > > > > > > annotates fields that can hold values with verifier type `PTR_TO_BTF_ID | > > > > > > BTF_ID_LOCAL`. obj_alloc only ever returns these local kptrs and only ever resolves > > > > > > against program-local btf (in the verifier, at runtime it only gets an allocation > > > > > > size). > > > > > > > > This is ok too, but I think just gating everywhere with btf_is_kernel > > > > would be fine as well. > > > > > > Yeah, I can get behind not using BTF_LOCAL_ID as a type flag and just encoding that > > in the btf field of the register/stack slot/kptr/helper proto. That said, we still > > need the new type tag to tell the map btf parsing code to use the local btf in the > > kptr descriptor. > > > > Agreed, the new __local type tag looks necessary to make it search in > map BTF instead. > > > > > > > > > > > 3.1. If eventually we need to pass these objects to kfuncs/helpers, we can introduce > > > > > > a new bpf_obj_export helper that takes a PTR_TO_LOCAL_BTF_ID and returns the > > > > > > corresponding PTR_TO_BTF_ID, after verifying against an allowlist of some kind. This > > > > > > > > It would be fine to allow passing if it is just plain data (e.g. what > > > > scalar_struct check does for kfuncs). > > > > There we had the issue where it can take PTR_TO_MEM, PTR_TO_BTF_ID, > > > > etc. so it was necessary to restrict the kind of type to LCD. > > > > > > > > But we don't have to do it from day 1, just listing what should be ok. > > > > That's a good call, I'll add it to the initial can-transition-to-kernel-kptr logic. > > > > > > > > > > > > would be the only place these objects can leak out of bpf land. If there's no runtime > > > > > > aspect (and there likely wouldn't be), we might consider doing this transparently, > > > > > > still against an allowlist of types. > > > > > > > > > > > > 4. To ensure the allocator stays alive while objects from it are alive, we must be > > > > > > able to identify which allocator each __kptr_local pointer came from, and we must > > > > > > keep the refcount up while any such values are alive. One concern here is that doing > > > > > > the refcount manipulation in kptr_xchg would be too expensive. The proposed solution > > > > > > is to: > > > > > > 4.1 Keep a struct bpf_mem_alloc* in the header before the returned object pointer > > > > > > from bpf_mem_alloc(). This way we never lose track which bpf_mem_alloc to return the > > > > > > object to and can simplify the bpf_obj_free() call. > > > > > > 4.2. Tracking used_allocators in each bpf_map. When unloading a program, we would > > > > > > walk all maps that the program has access to (that have kptr_local fields), walk each > > > > > > value and ensure that any allocators not already in the map's used_allocators are > > > > > > refcount_inc'd and added to the list. Do note that allocators are also kept alive by > > > > > > their bpf_map wrapper but after that's gone, used_allocators is the main mechanism. > > > > > > Once the bpf_map is gone, the allocator cannot be used to allocate new objects, we > > > > > > can only return objects to it. > > > > > > 4.3. On map free, we walk and obj_free() all the __kptr_local fields, then > > > > > > refcount_dec all the used_allocators. > > > > > > > > > > > > > > So to summarize your approach: > > > > Each allocation has a bpf_mem_alloc pointer before it to track its > > > > owner allocator. > > > > We know used_maps of each prog, so during unload of program, walk all > > > > local kptrs in each used_maps map values, and that map takes a > > > > reference to the allocator stashing it in used_allocators list, > > > > because prog is going to relinquish its ref to allocator_map (which if > > > > it were the last one would release allocator reference as well for > > > > local kptrs held by those maps). > > > > Once prog is gone, the allocator is kept alive by other maps holding > > > > objects allocated from it. References to the allocator are taken > > > > lazily when required. > > > > Did I get it right? > > > > That's correct! > > > > > > > > > > I see two problems: the first is concurrency. When walking each value, > > > > it is going to be hard to ensure the kptr field remains stable while > > > > you load and take ref to its allocator. Some other programs may also > > > > have access to the map value and may concurrently change the kptr > > > > field (xchg and even release it). How do we safely do a refcount_inc > > > > of its allocator? > > > > Fair question. You can think of that pointer as immutable for the entire time that > > the allocator is able to interact with the object. Once the object makes it on a > > freelist, it won't be released until an rcu gp has elapsed. Therefore, the first time > > that value can change - when we return the object to the global kmalloc pool - it has > > provably no bpf-side concurrent observers. > > > > I don't think that assumption will hold. Requiring RCU protection for > all local kptrs means allocator cache reuse becomes harder, as > elements are stuck in freelist until the next grace period. It > necessitates use of larger caches. > For some use cases where they can tolerate reuse, it might not be > optimal. IMO the allocator should be independent of how the lifetime > of elements is managed. All maps and allocations are already rcu-protected, we're not adding new here. We're only relying on this rcu protection (c.f. the chain of call_rcu_task_trace and call_rcu in the patchset) to prove that no program can observe an allocator pointer that is corrupted or stale. > > That said, even if you assume RCU protection, that still doesn't > address the real problem. Yes, you can access the value without > worrying about it moving to another map, but consider this case: > Prog unloading, > populate_used_allocators -> map walks map_values, tries to take > reference to local kptr whose backing allocator is A. > Loads kptr, meanwhile some other prog sharing access to the map value > exchanges (kptr_xchg) another pointer into that field. > Now you take reference to allocator A in used_allocators, while actual > value in map is for allocator B. This is fine. The algorithm is conservative, it may keep allocators around longer than they're needed. Eventually there will exist a time that this map won't be accessible to any program, at which point both allocator A and B will be released. It is possible to make a more precise algorithm but given that this behavior is only really a problem with (likely pinned) long-lived maps, it's imo not worth it for v1. > > So you either have to cmpxchg and then retry if it fails (which is not > a wait-free operation, and honestly not great imo), or if you don't > handle this: > Someone moved an allocated local kptr backed by B into your map, but > you don't hold reference to it. You don't need a reference while something else is holding the allocator alive. The references exist to extend the lifetime past the lifetimes of programs that can directly reference the allocator. > That other program may release > allocator map -> allocator, The allocator map cannot be removed while it's in used_maps of the program. On program unload, we'll add B to the used_allocators list, as a value from B is in the map. Only then will the allocator map be releasable. > and then when this map is destroyed, on > unit_free it will be use-after-free of bpf_mem_alloc *. > > I don't see an easy way around these kinds of problems. And this is > just one specific scenario. > > > Alexei, please correct me if I misunderstood how the design is supposed to work. > > > > > > > > > > For the second problem, consider this: > > > > obj = bpf_obj_alloc(&alloc_map, ...); > > > > inner_map = bpf_map_lookup_elem(&map_in_map, ...); > > > > map_val = bpf_map_lookup_elem(inner_map, ...); > > > > kptr_xchg(&map_val->kptr, obj); > > > > > > > > Now delete the entry having that inner_map, but keep its fd open. > > > > Unload the program, since it is map-in-map, no way to fill used_allocators. > > > > alloc_map is freed, releases reference on allocator, allocator is freed. > > > > Now close(inner_map_fd), inner_map is free. Either bad unit_free or memory leak. > > > > Is there a way to prevent this in your scheme? > > > > This is fair, inner maps not being tracked in used_maps is a wrench in that plan. > > However, we can have the parent map propagate its used_allocators on inner map > > removal. > > > > But used_allocators are not populated until unload? inner_map removal > can happen while the program is loaded/attached. > Or will you populate it before unloading, everytime during inner_map > removal? Then that would be very costly for a bpf_map_delete_elem... It's not free, granted, but it only kicks if the map-of-maps has already acquired a used_allocators list. I'd be okay with handling map-of-map via liveness counts too. > > > > > - > > > > > > > > I had another idea, but it's not _completely_ 0 overhead. Heavy > > > > prototyping so I might be missing corner cases. > > > > It is to take reference on each allocation and deallocation. Yes, > > > > naive and slow if using atomics, but instead we can use percpu_ref > > > > instead of atomic refcount for the allocator. percpu_ref_get/put on > > > > each unit_alloc/unit_free. > > > > The problem though is that once initial reference is killed, it > > > > downgrades to atomic, which will kill performance. So we need to be > > > > smart about how that initial reference is managed. > > > > My idea is that the initial ref is taken and killed by the allocator > > > > bpf_map pinning the allocator. Once that bpf_map is gone, you cannot > > > > do any more allocations anyway (since you need to pass the map pointer > > > > to bpf_obj_alloc), so once it downgrades to atomics at that point we > > > > will only be releasing the references after freeing its allocated > > > > objects. Yes, then the free path becomes a bit costly after the > > > > allocator map is gone. > > > > > > > > We might be able to remove the cost on free path as well using the > > > > used_allocators scheme from above (to delay percpu_ref_kill), but it > > > > is not clear how to safely increment the ref of the allocator from map > > > > value... > > > > As explained above, the values are already rcu-protected, so we can use that to > > coordinate refcounting of the allocator. That said, percpu_ref could work (I was > > considering something similar within the allocator itself) but I'm not convinced the > > cost is right. My main concern would be that once it becomes atomic_t, it erases the > > benefits of all the work in the allocator that maintains percpu data structures. > > > > If we want to go down this path, the allocator can maintain percpu live counts (with > > underflow for unbalanced alloc-free pairs on a cpu) in its percpu structures. Then, > > we can have explicit "sum up all the counts to discover if you should be destroyed" > > calls. If we keep the used_allocators scheme, these calls can be inserted at program > > unload for maps in used_maps and at map free time for maps that escape that > > mechanism. > > Yes, it would be a good idea to put the percpu refcount for this case > inside the already percpu bpf_mem_cache struct, since that will be > much better than allocating it separately. The increment will then be > a 100% cache hit. > > The main question is how this "sum up all the count" operation needs > to be done. Once that initial reference of bpf_map is gone, you need > to track the final owner who will be responsible to release the > allocator. You will need to do something similar to percpu_ref's > atomic count upgrade unless I'm missing something. It's actually easier since we know we can limit the checks to only the there's-no- reference-from-an-allocator-map case. We can also postpone the work to the rcu gp to make it even easier to sequence with the individual elements' free()s. > > Once you establish that used_allocators cannot be safely populated on > unload (which you can correct me on), the only utility I see for it is > delaying the atomic upgrade for this idea. > > So another approach (though I don't like this too much): > One solution to delay the upgrade can be that when you have allocator > maps and other normal maps in used_maps, it always incs ref on the > allocator pinned by allocator map on unload for maps that have local > kptr, so each map having local kptrs speculatively takes ref on > allocator maps of this prog, assuming allocations from that allocator > map are more likely to be in these maps. > > Same with other progs having different allocator map but sharing these maps. > > It is not very precise, but until those maps are gone it delays > release of the allocator (we can empty all percpu caches to save > memory once bpf_map pinning the allocator is gone, because allocations > are not going to be served). But it allows unit_free to be relatively > less costly as long as those 'candidate' maps are around. Yes, we considered this but it's much easier to get to pathological behaviors, by just loading and unloading programs that can access an allocator in a loop. The freelists being empty help but it's still quite easy to hold a lot of memory for nothing. The pointer walk was proposed to prune most such pathological cases while still being conservative enough to be easy to implement. Only races with the pointer walk can extend the lifetime unnecessarily. > > > > > > > Or, we just extend the map-in-map mechanism to propagate used_allocators as needed. > > There are nice debug properties of the allocator knowing the liveness counts but we > > don't have to put them on the path to correctness. > > > > > > > > > > wdyt? > > > > > > > > > > Overall, we think this handles all the nasty corners - objects escaping into > > > > > > kfuncs/helpers when they shouldn't, pinned maps containing pointers to allocations, > > > > > > programs accessing multiple allocators having deterministic freelist behaviors - > > > > > > while keeping the API and complexity sane. The used_allocators approach can certainly > > > > > > be less conservative (or can be even precise) but for a v1 that's probably overkill. > > > > > > > > > > > > Please, feel free to shoot holes in this design! We tried to capture everything but > > > > > > I'd love confirmation that we didn't miss anything. > > > > > > > > > > > > --Delyan > > > >
On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > It is not very precise, but until those maps are gone it delays > > release of the allocator (we can empty all percpu caches to save > > memory once bpf_map pinning the allocator is gone, because allocations > > are not going to be served). But it allows unit_free to be relatively > > less costly as long as those 'candidate' maps are around. > > Yes, we considered this but it's much easier to get to pathological behaviors, by > just loading and unloading programs that can access an allocator in a loop. The > freelists being empty help but it's still quite easy to hold a lot of memory for > nothing. > > The pointer walk was proposed to prune most such pathological cases while still being > conservative enough to be easy to implement. Only races with the pointer walk can > extend the lifetime unnecessarily. I'm getting lost in this thread. Here is my understanding so far: We don't free kernel kptrs from map in release_uref, but we should for local kptrs, since such objs are not much different from timers. So release_uref will xchg all such kptrs and free them into the allocator without touching allocator's refcnt. So there is no concurrency issue that Kumar was concerned about. We might need two arrays though. prog->used_allocators[] and map->used_allocators[] The verifier would populate both at load time. At prog unload dec refcnt in one array. At map free dec refcnt in the other array. Map-in-map insert/delete of new map would copy allocators[] from outer map. As the general suggestion to solve this problem I think we really need to avoid run-time refcnt changes at alloc/free even when they're per-cpu 'fast'.
On Tue, 30 Aug 2022 at 01:18, Delyan Kratunov <delyank@fb.com> wrote: > > > [...] > > > > I don't think that assumption will hold. Requiring RCU protection for > > all local kptrs means allocator cache reuse becomes harder, as > > elements are stuck in freelist until the next grace period. It > > necessitates use of larger caches. > > For some use cases where they can tolerate reuse, it might not be > > optimal. IMO the allocator should be independent of how the lifetime > > of elements is managed. > > All maps and allocations are already rcu-protected, we're not adding new here. We're > only relying on this rcu protection (c.f. the chain of call_rcu_task_trace and > call_rcu in the patchset) to prove that no program can observe an allocator pointer > that is corrupted or stale. > > > > > That said, even if you assume RCU protection, that still doesn't > > address the real problem. Yes, you can access the value without > > worrying about it moving to another map, but consider this case: > > Prog unloading, > > populate_used_allocators -> map walks map_values, tries to take > > reference to local kptr whose backing allocator is A. > > Loads kptr, meanwhile some other prog sharing access to the map value > > exchanges (kptr_xchg) another pointer into that field. > > Now you take reference to allocator A in used_allocators, while actual > > value in map is for allocator B. > > This is fine. The algorithm is conservative, it may keep allocators around longer > than they're needed. Eventually there will exist a time that this map won't be > accessible to any program, at which point both allocator A and B will be released. > > It is possible to make a more precise algorithm but given that this behavior is only > really a problem with (likely pinned) long-lived maps, it's imo not worth it for v1. > > > > > So you either have to cmpxchg and then retry if it fails (which is not > > a wait-free operation, and honestly not great imo), or if you don't > > handle this: > > Someone moved an allocated local kptr backed by B into your map, but > > you don't hold reference to it. > > You don't need a reference while something else is holding the allocator alive. The > references exist to extend the lifetime past the lifetimes of programs that can > directly reference the allocator. > > > That other program may release > > allocator map -> allocator, > > The allocator map cannot be removed while it's in used_maps of the program. On > program unload, we'll add B to the used_allocators list, as a value from B is in the > map. Only then will the allocator map be releasable. > Ahh, so prog1 and prog2 both share map M, but not allocator map A and B, so if it swaps in pointer of B into M while prog1 is unloading, it will take unneeded ref to A (it it sees kptr to A just before swap). But when prog2 will unload, it will then see that ptr value is from B so it will also take ref to B in M's used_allocators. Then A stays alive for a little while longer, but only when this race happens. But there won't be any correctness problem as both A and B are kept alive. Makes sense, but IIUC this only takes care of this specific case. It can also see 'NULL' and miss taking reference to A. prog1 uses A, M prog2 uses B, M A and B are allocator maps, M is normal hashmap, M is shared between both. prog1 is unloading: M holds kptr from A. during walk, before loading field, prog2 xchg it to NULL, M does not take ref to A. // 1 Right after it is done processing this field during its walk, prog2 now xchg it back in, now M is holding A but did not take ref to A. // 2 prog1 unloads. M's used_allocators list is empty. M is still kept alive by prog2. Now prog2 has already moved its result of kptr_xchg back into the map in step 2. Hence, prog2 execution can terminate, this ends its RCU section. Allocator A is waiting for all prior RCU readers, eventually it can be freed. Now prog2 runs again, starts a new RCU section, kptr_xchg this ptr from M, and tries to free it. Allocator A is already gone. If this is also taken care of, I'll only be poking at the code next when you post it, so that I don't waste any more of your time. But IIUC this can still cause issues, right?
On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > It is not very precise, but until those maps are gone it delays > > > release of the allocator (we can empty all percpu caches to save > > > memory once bpf_map pinning the allocator is gone, because allocations > > > are not going to be served). But it allows unit_free to be relatively > > > less costly as long as those 'candidate' maps are around. > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by > > just loading and unloading programs that can access an allocator in a loop. The > > freelists being empty help but it's still quite easy to hold a lot of memory for > > nothing. > > > > The pointer walk was proposed to prune most such pathological cases while still being > > conservative enough to be easy to implement. Only races with the pointer walk can > > extend the lifetime unnecessarily. > > I'm getting lost in this thread. > > Here is my understanding so far: > We don't free kernel kptrs from map in release_uref, > but we should for local kptrs, since such objs are > not much different from timers. > So release_uref will xchg all such kptrs and free them > into the allocator without touching allocator's refcnt. > So there is no concurrency issue that Kumar was concerned about. Haven't really thought through whether this will fix the concurrent kptr swap problem, but then with this I think you need: - New helper bpf_local_kptr_xchg(map, map_value, kptr) - Associating map_uid of map, map_value - Always doing atomic_inc_not_zero(map->usercnt) for each call to local_kptr_xchg 1 and 2 because of inner_maps, 3 because of release_uref. But maybe not a deal breaker? > We might need two arrays though. > prog->used_allocators[] and map->used_allocators[] > The verifier would populate both at load time. > At prog unload dec refcnt in one array. > At map free dec refcnt in the other array. > Map-in-map insert/delete of new map would copy allocators[] from > outer map. > As the general suggestion to solve this problem I think > we really need to avoid run-time refcnt changes at alloc/free > even when they're per-cpu 'fast'.
On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > > It is not very precise, but until those maps are gone it delays > > > > release of the allocator (we can empty all percpu caches to save > > > > memory once bpf_map pinning the allocator is gone, because allocations > > > > are not going to be served). But it allows unit_free to be relatively > > > > less costly as long as those 'candidate' maps are around. > > > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by > > > just loading and unloading programs that can access an allocator in a loop. The > > > freelists being empty help but it's still quite easy to hold a lot of memory for > > > nothing. > > > > > > The pointer walk was proposed to prune most such pathological cases while still being > > > conservative enough to be easy to implement. Only races with the pointer walk can > > > extend the lifetime unnecessarily. > > > > I'm getting lost in this thread. > > > > Here is my understanding so far: > > We don't free kernel kptrs from map in release_uref, > > but we should for local kptrs, since such objs are > > not much different from timers. > > So release_uref will xchg all such kptrs and free them > > into the allocator without touching allocator's refcnt. > > So there is no concurrency issue that Kumar was concerned about. > > Haven't really thought through whether this will fix the concurrent > kptr swap problem, but then with this I think you need: > - New helper bpf_local_kptr_xchg(map, map_value, kptr) no. why? current bpf_kptr_xchg(void *map_value, void *ptr) should work. The verifier knows map ptr from map_value. > - Associating map_uid of map, map_value > - Always doing atomic_inc_not_zero(map->usercnt) for each call to > local_kptr_xchg > 1 and 2 because of inner_maps, 3 because of release_uref. > But maybe not a deal breaker? No run-time refcnts. All possible allocators will be added to map->used_allocators at prog load time and allocator's refcnt incremented. At run-time bpf_kptr_xchg(map_value, ptr) will be happening with an allocator A which was added to that map->used_allocators already.
On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > > > > > It is not very precise, but until those maps are gone it delays > > > > > release of the allocator (we can empty all percpu caches to save > > > > > memory once bpf_map pinning the allocator is gone, because allocations > > > > > are not going to be served). But it allows unit_free to be relatively > > > > > less costly as long as those 'candidate' maps are around. > > > > > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by > > > > just loading and unloading programs that can access an allocator in a loop. The > > > > freelists being empty help but it's still quite easy to hold a lot of memory for > > > > nothing. > > > > > > > > The pointer walk was proposed to prune most such pathological cases while still being > > > > conservative enough to be easy to implement. Only races with the pointer walk can > > > > extend the lifetime unnecessarily. > > > > > > I'm getting lost in this thread. > > > > > > Here is my understanding so far: > > > We don't free kernel kptrs from map in release_uref, > > > but we should for local kptrs, since such objs are > > > not much different from timers. > > > So release_uref will xchg all such kptrs and free them > > > into the allocator without touching allocator's refcnt. > > > So there is no concurrency issue that Kumar was concerned about. > > > > Haven't really thought through whether this will fix the concurrent > > kptr swap problem, but then with this I think you need: > > - New helper bpf_local_kptr_xchg(map, map_value, kptr) > > no. why? > current bpf_kptr_xchg(void *map_value, void *ptr) should work. > The verifier knows map ptr from map_value. > > > - Associating map_uid of map, map_value > > - Always doing atomic_inc_not_zero(map->usercnt) for each call to > > local_kptr_xchg > > 1 and 2 because of inner_maps, 3 because of release_uref. > > But maybe not a deal breaker? > > No run-time refcnts. How is future kptr_xchg prevented for the map after its usercnt drops to 0? If we don't check it at runtime we can xchg in non-NULL kptr after release_uref callback. For timer you are taking timer spinlock and reading map->usercnt in timer_set_callback. Or do you mean this case can never happen with your approach? > All possible allocators will be added to map->used_allocators > at prog load time and allocator's refcnt incremented. > At run-time bpf_kptr_xchg(map_value, ptr) will be happening > with an allocator A which was added to that map->used_allocators > already.
On Mon, Aug 29, 2022 at 5:45 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi > > <memxor@gmail.com> wrote: > > > > > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > > > > > > > > It is not very precise, but until those maps are gone it delays > > > > > > release of the allocator (we can empty all percpu caches to save > > > > > > memory once bpf_map pinning the allocator is gone, because allocations > > > > > > are not going to be served). But it allows unit_free to be relatively > > > > > > less costly as long as those 'candidate' maps are around. > > > > > > > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by > > > > > just loading and unloading programs that can access an allocator in a loop. The > > > > > freelists being empty help but it's still quite easy to hold a lot of memory for > > > > > nothing. > > > > > > > > > > The pointer walk was proposed to prune most such pathological cases while still being > > > > > conservative enough to be easy to implement. Only races with the pointer walk can > > > > > extend the lifetime unnecessarily. > > > > > > > > I'm getting lost in this thread. > > > > > > > > Here is my understanding so far: > > > > We don't free kernel kptrs from map in release_uref, > > > > but we should for local kptrs, since such objs are > > > > not much different from timers. > > > > So release_uref will xchg all such kptrs and free them > > > > into the allocator without touching allocator's refcnt. > > > > So there is no concurrency issue that Kumar was concerned about. > > > > > > Haven't really thought through whether this will fix the concurrent > > > kptr swap problem, but then with this I think you need: > > > - New helper bpf_local_kptr_xchg(map, map_value, kptr) > > > > no. why? > > current bpf_kptr_xchg(void *map_value, void *ptr) should work. > > The verifier knows map ptr from map_value. > > > > > - Associating map_uid of map, map_value > > > - Always doing atomic_inc_not_zero(map->usercnt) for each call to > > > local_kptr_xchg > > > 1 and 2 because of inner_maps, 3 because of release_uref. > > > But maybe not a deal breaker? > > > > No run-time refcnts. > > How is future kptr_xchg prevented for the map after its usercnt drops to 0? > If we don't check it at runtime we can xchg in non-NULL kptr after > release_uref callback. > For timer you are taking timer spinlock and reading map->usercnt in > timer_set_callback. Sorry I confused myself and others with release_uref. I meant map_poke_untrack-like call. When we drop refs from used maps in __bpf_free_used_maps we walk all elements. Similar idea here. When prog is unloaded it cleans up all objects it allocated and stored into maps before dropping refcnt-s in prog->used_allocators.
On Mon, 2022-08-29 at 18:05 -0700, Alexei Starovoitov wrote: > !-------------------------------------------------------------------| > This Message Is From an External Sender > > > -------------------------------------------------------------------! > > On Mon, Aug 29, 2022 at 5:45 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > On Tue, 30 Aug 2022 at 02:26, Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Mon, Aug 29, 2022 at 5:20 PM Kumar Kartikeya Dwivedi > > > <memxor@gmail.com> wrote: > > > > > > > > On Tue, 30 Aug 2022 at 01:45, Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Mon, Aug 29, 2022 at 4:18 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > > > > > > > > > > > > > > > It is not very precise, but until those maps are gone it delays > > > > > > > release of the allocator (we can empty all percpu caches to save > > > > > > > memory once bpf_map pinning the allocator is gone, because allocations > > > > > > > are not going to be served). But it allows unit_free to be relatively > > > > > > > less costly as long as those 'candidate' maps are around. > > > > > > > > > > > > Yes, we considered this but it's much easier to get to pathological behaviors, by > > > > > > just loading and unloading programs that can access an allocator in a loop. The > > > > > > freelists being empty help but it's still quite easy to hold a lot of memory for > > > > > > nothing. > > > > > > > > > > > > The pointer walk was proposed to prune most such pathological cases while still being > > > > > > conservative enough to be easy to implement. Only races with the pointer walk can > > > > > > extend the lifetime unnecessarily. > > > > > > > > > > I'm getting lost in this thread. > > > > > > > > > > Here is my understanding so far: > > > > > We don't free kernel kptrs from map in release_uref, > > > > > but we should for local kptrs, since such objs are > > > > > not much different from timers. > > > > > So release_uref will xchg all such kptrs and free them > > > > > into the allocator without touching allocator's refcnt. > > > > > So there is no concurrency issue that Kumar was concerned about. Yes, if we populate used_allocators on load and copy them to inner maps, this might work. It requires the most conservative approach where loading and unloading a program in a loop would add its allocators to the visible pinned maps, accumulating allocators we can't release until the map is gone. However, I thought you wanted to walk the values instead to prevent this abuse. At least that's the understanding I was operating under. If a program has the max number of possible allocators and we just load/unload it in a loop, with a visible pinned map, the used_allocators list of that map can easily skyrocket. This is a problem in itself, in that it needs to grow dynamically (program load/unload is a good context to do that from but inner map insert/delete can also grow the list and that's in a bad context potentially). > > > > > > > > Haven't really thought through whether this will fix the concurrent > > > > kptr swap problem, but then with this I think you need: > > > > - New helper bpf_local_kptr_xchg(map, map_value, kptr) > > > > > > no. why? > > > current bpf_kptr_xchg(void *map_value, void *ptr) should work. > > > The verifier knows map ptr from map_value. > > > > > > > - Associating map_uid of map, map_value > > > > - Always doing atomic_inc_not_zero(map->usercnt) for each call to > > > > local_kptr_xchg > > > > 1 and 2 because of inner_maps, 3 because of release_uref. > > > > But maybe not a deal breaker? > > > > > > No run-time refcnts. > > > > How is future kptr_xchg prevented for the map after its usercnt drops to 0? > > If we don't check it at runtime we can xchg in non-NULL kptr after > > release_uref callback. > > For timer you are taking timer spinlock and reading map->usercnt in > > timer_set_callback. > > Sorry I confused myself and others with release_uref. > I meant map_poke_untrack-like call. > When we drop refs from used maps in __bpf_free_used_maps > we walk all elements. > Similar idea here. > When prog is unloaded it cleans up all objects it allocated > and stored into maps before dropping refcnt-s > in prog->used_allocators. For an allocator that's visible from multiple programs (say, it's in a map-of-maps), how would we know which values were allocated by which program? Do we forbid shared allocators outright? I can imagine container agent-like software wanting to reuse its allocator caches from a previous run. Besides, cleaning up the values is the easy part - so long as the map is extending the allocator's lifetime, this is safe, we can even use the kptr destructor mechanism (though I'd rather not). There are three main cases of lifetime extension: 1) Directly visible maps - normal used_allocators with or without a walk works here. 2) Inner maps. I plan to straight up disallow their insertion if they have a kptr_local field. If someone needs this, we can solve it then, it's too hard to do the union of lists logic cleanly for a v1. 3) Stack of another program. I'm not sure how to require that programs with kptr_local do not use the same btf structs but if that's possible, this can naturally be disallowed via the btf comparisons failing (it's another case of btf domain crossover). Maybe we tag the btf struct as containing local kptr type tags and disallow its use for more than one program? Separately, I think I just won't allow allocators as inner maps, that's for another day too (though it may work just fine). Perfect, enemy of the good, something-something. -- Delyan
On Mon, Aug 29, 2022 at 6:40 PM Delyan Kratunov <delyank@fb.com> wrote: > > Yes, if we populate used_allocators on load and copy them to inner maps, this might > work. It requires the most conservative approach where loading and unloading a > program in a loop would add its allocators to the visible pinned maps, accumulating > allocators we can't release until the map is gone. > > However, I thought you wanted to walk the values instead to prevent this abuse. At > least that's the understanding I was operating under. > > If a program has the max number of possible allocators and we just load/unload it in > a loop, with a visible pinned map, the used_allocators list of that map can easily > skyrocket. Right. That will be an issue if we don't trim the list at prog unload. > > Sorry I confused myself and others with release_uref. > > I meant map_poke_untrack-like call. > > When we drop refs from used maps in __bpf_free_used_maps > > we walk all elements. > > Similar idea here. > > When prog is unloaded it cleans up all objects it allocated > > and stored into maps before dropping refcnt-s > > in prog->used_allocators. > > For an allocator that's visible from multiple programs (say, it's in a map-of-maps), > how would we know which values were allocated by which program? Do we forbid shared > allocators outright? Hopefully we don't need to forbid shared allocators and allow map-in-map to contain kptr local. > Separately, I think I just won't allow allocators as inner maps, that's for another > day too (though it may work just fine). > > Perfect, enemy of the good, something-something. Right, but if we can allow that with something simple it would be nice. After a lot of head scratching realized that walk-all-elems-and-kptr_xchg approach doesn't work, because prog_A and prog_B might share a map and when prog_A is unloaded and trying to do kptr_xchg on all elems the prog_B might kptr_xchg as well and walk_all loop will miss kptrs. prog->used_allocators[] approach is broken too. Since the prog_B (in the example above) might see objects that were allocated from prog_A's allocators. prog->used_allocators at load time is incorrect. To prevent all of these issues how about we restrict kptr local to contain a pointer only from one allocator. When parsing map's BTF if there is only one kptr local in the map value the equivalent of map->used_allocators[] will guarantee to contain only one allocator. Two kptr locals in the map value -> potentially two allocators. So here is new proposal: At load time the verifier walks all kptr_xchg(map_value, obj) and adds obj's allocator to map->used_allocators <- {kptr_offset, allocator}; If kptr_offset already exists -> failure to load. Allocator can probably be a part of struct bpf_map_value_off_desc. In other words the pairs of {kptr_offset, allocator} say 'there could be an object from that allocator in that kptr in some map values'. Do nothing at prog unload. At map free time walk all elements and free kptrs. Finally drop allocator refcnts. This approach allows sharing of allocators. kptr local in map-in-map also should be fine. If not we have a problem with bpf_map_value_off_desc and map-in-map then. The prog doesn't need to have a special used_allocator list, since if bpf prog doesn't do kptr_xchg all allocated objects will be freed during prog execution. Instead since allocator is a different type of map it should go into existing used_maps[] to make sure we don't free allocator when prog is executing. Maybe with this approach we won't even need to hide the allocator pointer into the first 8 bytes. For all pointers returned from kptr_xchg the verifier will know which allocator is supposed to be used for freeing. Thoughts?
On Tue, 30 Aug 2022 at 05:35, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Aug 29, 2022 at 6:40 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > Yes, if we populate used_allocators on load and copy them to inner maps, this might > > work. It requires the most conservative approach where loading and unloading a > > program in a loop would add its allocators to the visible pinned maps, accumulating > > allocators we can't release until the map is gone. > > > > However, I thought you wanted to walk the values instead to prevent this abuse. At > > least that's the understanding I was operating under. > > > > If a program has the max number of possible allocators and we just load/unload it in > > a loop, with a visible pinned map, the used_allocators list of that map can easily > > skyrocket. > > Right. That will be an issue if we don't trim the list > at prog unload. > > > > Sorry I confused myself and others with release_uref. > > > I meant map_poke_untrack-like call. > > > When we drop refs from used maps in __bpf_free_used_maps > > > we walk all elements. > > > Similar idea here. > > > When prog is unloaded it cleans up all objects it allocated > > > and stored into maps before dropping refcnt-s > > > in prog->used_allocators. > > > > For an allocator that's visible from multiple programs (say, it's in a map-of-maps), > > how would we know which values were allocated by which program? Do we forbid shared > > allocators outright? > > Hopefully we don't need to forbid shared allocators and > allow map-in-map to contain kptr local. > > > Separately, I think I just won't allow allocators as inner maps, that's for another > > day too (though it may work just fine). > > > > Perfect, enemy of the good, something-something. > > Right, but if we can allow that with something simple > it would be nice. > > After a lot of head scratching realized that > walk-all-elems-and-kptr_xchg approach doesn't work, > because prog_A and prog_B might share a map and when > prog_A is unloaded and trying to do kptr_xchg on all elems > the prog_B might kptr_xchg as well and walk_all loop > will miss kptrs. Agreed, I can't see it working either. > prog->used_allocators[] approach is broken too. > Since the prog_B (in the example above) might see > objects that were allocated from prog_A's allocators. > prog->used_allocators at load time is incorrect. > > To prevent all of these issues how about we > restrict kptr local to contain a pointer only > from one allocator. > When parsing map's BTF if there is only one kptr local > in the map value the equivalent of map->used_allocators[] > will guarantee to contain only one allocator. > Two kptr locals in the map value -> potentially two allocators. > > So here is new proposal: > Thanks for the proposal, Alexei. I think we're getting close to a solution, but still some comments below. > At load time the verifier walks all kptr_xchg(map_value, obj) > and adds obj's allocator to > map->used_allocators <- {kptr_offset, allocator}; > If kptr_offset already exists -> failure to load. > Allocator can probably be a part of struct bpf_map_value_off_desc. > > In other words the pairs of {kptr_offset, allocator} > say 'there could be an object from that allocator in > that kptr in some map values'. > > Do nothing at prog unload. > > At map free time walk all elements and free kptrs. > Finally drop allocator refcnts. > Yes, this should be possible. It's quite easy to capture the map_ptr for the allocated local kptr. Limiting each local kptr to one allocator is probably fine, at least for a v1. One problem I see is how it works when the allocator map is an inner map. Then, it is not possible to find the backing allocator instance at verification time, hence not possible to take the reference to it in map->used_allocators. But let's just assume that is disallowed for now. The other problem I see is that when the program just does kptr_xchg(map_value, NULL), we may not have allocator info from kptr_offset at that moment. Allocating prog which fills used_allocators may be verified later. We _can_ reject this, but it makes everything fragile (dependent on which order you load programs in), which won't be great. You can then use this lost info to make kptr disjoint from allocator lifetime. Let me explain through an example. Consider this order to set up the programs: One allocator map A. Two hashmaps M1, M2. Three programs P1, P2, P3. P1 uses M1, M2. P2 uses A, M1. P3 uses M2. Sequence: map_create A, M1, M2. Load P1, uses M1, M2. What this P1 does is: p = kptr_xchg(M1.value, NULL); kptr_xchg(M2.value, p); So it moves the kptr in M1 into M2. The problem is at this point kptr_offset is not populated, so we cannot fill used_allocators of M2 as we cannot track which allocator is used to fill M1.value. We saw nothing filling it yet. Next, load P3. It does: p = kptr_xchg(M2.value, NULL); unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so this is ok if allocator is alive. Again, M2.used_allocators is empty. Nothing is filled into it. Next, load P2. p = alloc(&A, ...); kptr_xchg(M1.value, p); Now, M1.used_allocators is filled with allocator ref and kptr_offset. But M2.used_allocators will remain unfilled. Now, run programs in sequence of P2, then P1. This will allocate from A, and move the ref to M1, then to M2. But only P1 and P2 have references to M1 so it keeps the allocator alive. However, now unload both P1 and P2. P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2 is still held by loaded P3. Now, M2.used_allocators is empty. P3 is using it, and it is holding allocation from allocator A. Both M1 and A are freed. When P3 runs now, it can kptr_xchg and try to free it, and the same uaf happens again. If not that, uaf when M2 is freed and it does unit_free on the alive local kptr. -- Will this case be covered by your approach? Did I miss something? The main issue is that this allocator info can be lost depending on how you verify a set of programs. It would not be lost if we verified in order P2, P1, P3 instead of the current P1, P3, P2. So we might have to teach the verifier to identify kptr_xchg edges between maps, and propagate any used_allocators to the other map? But it's becoming too complicated. You _can_ reject loads of programs when you don't find kptr_offset populated on seeing kptr_xchg(..., NULL), but I don't think this is practical either. It makes the things sensitive to program verification order, which would be confusing for users.
On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > So here is new proposal: > > > > Thanks for the proposal, Alexei. I think we're getting close to a > solution, but still some comments below. > > > At load time the verifier walks all kptr_xchg(map_value, obj) > > and adds obj's allocator to > > map->used_allocators <- {kptr_offset, allocator}; > > If kptr_offset already exists -> failure to load. > > Allocator can probably be a part of struct bpf_map_value_off_desc. > > > > In other words the pairs of {kptr_offset, allocator} > > say 'there could be an object from that allocator in > > that kptr in some map values'. > > > > Do nothing at prog unload. > > > > At map free time walk all elements and free kptrs. > > Finally drop allocator refcnts. > > > > Yes, this should be possible. > It's quite easy to capture the map_ptr for the allocated local kptr. > Limiting each local kptr to one allocator is probably fine, at least for a v1. > > One problem I see is how it works when the allocator map is an inner map. > Then, it is not possible to find the backing allocator instance at > verification time, hence not possible to take the reference to it in > map->used_allocators. > But let's just assume that is disallowed for now. > > The other problem I see is that when the program just does > kptr_xchg(map_value, NULL), we may not have allocator info from > kptr_offset at that moment. Allocating prog which fills > used_allocators may be verified later. We _can_ reject this, but it > makes everything fragile (dependent on which order you load programs > in), which won't be great. You can then use this lost info to make > kptr disjoint from allocator lifetime. > > Let me explain through an example. > > Consider this order to set up the programs: > One allocator map A. > Two hashmaps M1, M2. > Three programs P1, P2, P3. > > P1 uses M1, M2. > P2 uses A, M1. > P3 uses M2. > > Sequence: > map_create A, M1, M2. > > Load P1, uses M1, M2. What this P1 does is: > p = kptr_xchg(M1.value, NULL); > kptr_xchg(M2.value, p); > > So it moves the kptr in M1 into M2. The problem is at this point > kptr_offset is not populated, so we cannot fill used_allocators of M2 > as we cannot track which allocator is used to fill M1.value. We saw > nothing filling it yet. > > Next, load P3. It does: > p = kptr_xchg(M2.value, NULL); > unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so > this is ok if allocator is alive. > > Again, M2.used_allocators is empty. Nothing is filled into it. > > Next, load P2. > p = alloc(&A, ...); > kptr_xchg(M1.value, p); > > Now, M1.used_allocators is filled with allocator ref and kptr_offset. > But M2.used_allocators will remain unfilled. > > Now, run programs in sequence of P2, then P1. This will allocate from > A, and move the ref to M1, then to M2. But only P1 and P2 have > references to M1 so it keeps the allocator alive. However, now unload > both P1 and P2. > P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2 > is still held by loaded P3. > > Now, M2.used_allocators is empty. P3 is using it, and it is holding > allocation from allocator A. Both M1 and A are freed. > When P3 runs now, it can kptr_xchg and try to free it, and the same > uaf happens again. > If not that, uaf when M2 is freed and it does unit_free on the alive local kptr. > > -- > > Will this case be covered by your approach? Did I miss something? > > The main issue is that this allocator info can be lost depending on > how you verify a set of programs. It would not be lost if we verified > in order P2, P1, P3 instead of the current P1, P3, P2. > > So we might have to teach the verifier to identify kptr_xchg edges > between maps, and propagate any used_allocators to the other map? But > it's becoming too complicated. > > You _can_ reject loads of programs when you don't find kptr_offset > populated on seeing kptr_xchg(..., NULL), but I don't think this is > practical either. It makes the things sensitive to program > verification order, which would be confusing for users. Right. Thanks for brainstorming and coming up with the case where it breaks. Let me explain the thought process behind the proposal. The way the progs will be written will be something like: struct foo { int var; }; struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); __type(value, struct foo); } ma SEC(".maps"); struct map_val { struct foo * ptr __kptr __local; }; struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 123); __type(key, __u32); __type(value, struct map_val); } hash SEC(".maps"); struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); bpf_kptr_xchg(&map_val->ptr, p); Even if prog doesn't allocate and only does kptr_xchg like your P1 and P3 do the C code has to have a full definition 'struct foo' to compile P1 and P3. P1 and P3 don't need to see definition of 'ma' to be compiled, but 'struct foo' has to be seen. BTF reference will be taken by both 'ma' and by 'hash'. The btf_id will come from that BTF. The type is tied to BTF and tied to kptr in map value. The type is also tied to the allocator. The type creates a dependency chain between allocator and map. So the restriction of one allocator per kptr feels reasonable and doesn't feel restrictive at all. That dependency chain is there in the C code of the program. Hence the proposal to discover this dependency in the verifier through tracking of allocator from mem_alloc into kptr_xchg. But you're correct that it's not working for P1 and P3. I can imagine a few ways to solve it. 1. Ask users to annotate kptr local with the allocator that will be used. It's easy for progs P1 and P3. All definitions are likely available. It's only an extra tag of some form. 2. move 'used_allocator' from map further into BTF, since BTF is the root of this dependency chain. When the verifier sees bpf_mem_alloc in P2 it will add {allocator, btf_id} pair to BTF. If P1 loads first and the verifier see: p = kptr_xchg(M1.value, NULL); it will add {unique_allocator_placeholder, btf_id} to BTF. Then kptr_xchg(M2.value, p); does nothing. The verifier checks that M1's BTF == M2's BTF and id-s are same. Then P3 loads with: p = kptr_xchg(M2.value, NULL); unit_free(p); since unique_allocator_placholder is already there for that btf_id the verifier does nothing. Eventually it will see bpf_mem_alloc for that btf_id and will replace the placeholder with the actual allocator. We can even allow P1 and P3 to be runnable after load right away. Since nothing can allocate into that kptr local those kptr_xchg() in P1 and P3 will be returning NULL. If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue. Ideally for unit_free(p); in P3 the verifier would add a hidden 'ma' argument, so that allocator doesn't need to be stored dynamically. We can either insns of P3 after P2 was verified or pass a pointer to a place in BTF->used_allocator list of pairs where actual allocator pointer will be written later. Then no patching is needed. If P2 never loads the unit_free(*addr /* here it will load the value of unique_allocator_placeholder */, ...) but since unit_free() will never execute with valid obj to be freed. "At map free time walk all elements and free kptrs" step stays the same. but we decrement allocator refcnt only when BTF is freed which should be after map free time. So btf_put(map->btf); would need to move after ops->map_free. Maybe unique_allocator_placeholder approach can be used without moving the list into BTF. Need to think more about it tomorrow.
On Mon, 2022-08-29 at 23:03 -0700, Alexei Starovoitov wrote: > > On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: > > > > > > So here is new proposal: > > > > > > > Thanks for the proposal, Alexei. I think we're getting close to a > > solution, but still some comments below. > > > > > At load time the verifier walks all kptr_xchg(map_value, obj) > > > and adds obj's allocator to > > > map->used_allocators <- {kptr_offset, allocator}; > > > If kptr_offset already exists -> failure to load. > > > Allocator can probably be a part of struct bpf_map_value_off_desc. > > > > > > In other words the pairs of {kptr_offset, allocator} > > > say 'there could be an object from that allocator in > > > that kptr in some map values'. > > > > > > Do nothing at prog unload. > > > > > > At map free time walk all elements and free kptrs. > > > Finally drop allocator refcnts. > > > > > > > Yes, this should be possible. > > It's quite easy to capture the map_ptr for the allocated local kptr. > > Limiting each local kptr to one allocator is probably fine, at least for a v1. > > > > One problem I see is how it works when the allocator map is an inner map. > > Then, it is not possible to find the backing allocator instance at > > verification time, hence not possible to take the reference to it in > > map->used_allocators. > > But let's just assume that is disallowed for now. > > > > The other problem I see is that when the program just does > > kptr_xchg(map_value, NULL), we may not have allocator info from > > kptr_offset at that moment. Allocating prog which fills > > used_allocators may be verified later. We _can_ reject this, but it > > makes everything fragile (dependent on which order you load programs > > in), which won't be great. You can then use this lost info to make > > kptr disjoint from allocator lifetime. > > > > Let me explain through an example. > > > > Consider this order to set up the programs: > > One allocator map A. > > Two hashmaps M1, M2. > > Three programs P1, P2, P3. > > > > P1 uses M1, M2. > > P2 uses A, M1. > > P3 uses M2. > > > > Sequence: > > map_create A, M1, M2. > > > > Load P1, uses M1, M2. What this P1 does is: > > p = kptr_xchg(M1.value, NULL); > > kptr_xchg(M2.value, p); > > > > So it moves the kptr in M1 into M2. The problem is at this point > > kptr_offset is not populated, so we cannot fill used_allocators of M2 > > as we cannot track which allocator is used to fill M1.value. We saw > > nothing filling it yet. > > > > Next, load P3. It does: > > p = kptr_xchg(M2.value, NULL); > > unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so > > this is ok if allocator is alive. > > > > Again, M2.used_allocators is empty. Nothing is filled into it. > > > > Next, load P2. > > p = alloc(&A, ...); > > kptr_xchg(M1.value, p); > > > > Now, M1.used_allocators is filled with allocator ref and kptr_offset. > > But M2.used_allocators will remain unfilled. > > > > Now, run programs in sequence of P2, then P1. This will allocate from > > A, and move the ref to M1, then to M2. But only P1 and P2 have > > references to M1 so it keeps the allocator alive. However, now unload > > both P1 and P2. > > P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2 > > is still held by loaded P3. > > > > Now, M2.used_allocators is empty. P3 is using it, and it is holding > > allocation from allocator A. Both M1 and A are freed. > > When P3 runs now, it can kptr_xchg and try to free it, and the same > > uaf happens again. > > If not that, uaf when M2 is freed and it does unit_free on the alive local kptr. > > > > -- > > > > Will this case be covered by your approach? Did I miss something? > > > > The main issue is that this allocator info can be lost depending on > > how you verify a set of programs. It would not be lost if we verified > > in order P2, P1, P3 instead of the current P1, P3, P2. > > > > So we might have to teach the verifier to identify kptr_xchg edges > > between maps, and propagate any used_allocators to the other map? But > > it's becoming too complicated. > > > > You _can_ reject loads of programs when you don't find kptr_offset > > populated on seeing kptr_xchg(..., NULL), but I don't think this is > > practical either. It makes the things sensitive to program > > verification order, which would be confusing for users. > > Right. Thanks for brainstorming and coming up with the case > where it breaks. > > Let me explain the thought process behind the proposal. > The way the progs will be written will be something like: > > struct foo { > int var; > }; > > struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > __type(value, struct foo); > } ma SEC(".maps"); > > struct map_val { > struct foo * ptr __kptr __local; > }; > > struct { > __uint(type, BPF_MAP_TYPE_HASH); > __uint(max_entries, 123); > __type(key, __u32); > __type(value, struct map_val); > } hash SEC(".maps"); > > struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); > bpf_kptr_xchg(&map_val->ptr, p); > > Even if prog doesn't allocate and only does kptr_xchg like > your P1 and P3 do the C code has to have a full > definition 'struct foo' to compile P1 and P3. > P1 and P3 don't need to see definition of 'ma' to be compiled, > but 'struct foo' has to be seen. > BTF reference will be taken by both 'ma' and by 'hash'. > The btf_id will come from that BTF. > > The type is tied to BTF and tied to kptr in map value. > The type is also tied to the allocator. > The type creates a dependency chain between allocator and map. > So the restriction of one allocator per kptr feels > reasonable and doesn't feel restrictive at all. > That dependency chain is there in the C code of the program. > Hence the proposal to discover this dependency in the verifier > through tracking of allocator from mem_alloc into kptr_xchg. > But you're correct that it's not working for P1 and P3. Encoding the allocator in the runtime type system is fine but it comes with its own set of tradeoffs. > > I can imagine a few ways to solve it. > 1. Ask users to annotate kptr local with the allocator > that will be used. > It's easy for progs P1 and P3. All definitions are likely available. > It's only an extra tag of some form. This would introduce maps referring to other maps and would thus require substantial work in libbpf. In order to encode the link specific to field instead of the whole map object, we'd have to resort to something like map names as type tags, which is not a great design (arbitrary strings etc). > 2. move 'used_allocator' from map further into BTF, > since BTF is the root of this dependency chain. This would _maybe_ work. The hole I can see in this plan is that once a slot is claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the local kptr fields useless (i.e. the allocator is around but no allocator map for it exists and thus no program can use these fields again, they can only be free()). The reason it can't be unclaimed is that programs were verified with a specific allocator value in mind and we can't change that after they're loaded. To be able to unclaim a slot, we'd need to remove everything using that system (i.e. btf object) and load it again, which is not great. > When the verifier sees bpf_mem_alloc in P2 it will add > {allocator, btf_id} pair to BTF. > > If P1 loads first and the verifier see: > p = kptr_xchg(M1.value, NULL); > it will add {unique_allocator_placeholder, btf_id} to BTF. > Then > kptr_xchg(M2.value, p); does nothing. > The verifier checks that M1's BTF == M2's BTF and id-s are same. Note to self: don't allow setting base_btf from userspace without auditing all these checks. > > Then P3 loads with: > p = kptr_xchg(M2.value, NULL); > unit_free(p); > since unique_allocator_placholder is already there for that btf_id > the verifier does nothing. > > Eventually it will see bpf_mem_alloc for that btf_id and will > replace the placeholder with the actual allocator. > We can even allow P1 and P3 to be runnable after load right away. > Since nothing can allocate into that kptr local those > kptr_xchg() in P1 and P3 will be returning NULL. > If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue. > > Ideally for unit_free(p); in P3 the verifier would add a hidden > 'ma' argument, so that allocator doesn't need to be stored dynamically. > We can either insns of P3 after P2 was verified or > pass a pointer to a place in BTF->used_allocator list of pairs > where actual allocator pointer will be written later. > Then no patching is needed. > If P2 never loads the unit_free(*addr /* here it will load the > value of unique_allocator_placeholder */, ...) > but since unit_free() will never execute with valid obj to be freed. > > "At map free time walk all elements and free kptrs" step stays the same. > but we decrement allocator refcnt only when BTF is freed > which should be after map free time. > So btf_put(map->btf); would need to move after ops->map_free. > > Maybe unique_allocator_placeholder approach can be used > without moving the list into BTF. Need to think more about it tomorrow. I don't think we have to resort to storing the type-allocator mappings in the BTF at all. In fact, we can encode them where you wanted to encode them with tags - on the fields themselves. We can put the mem_alloc reference in the kptr field descriptors and have it transition to a specific allocator irreversibly. We would need to record where any equivalences between allocators we need for the currently loaded programs but it's not impossible. Making the transition reversible is the hard part of course. Taking a step back, we're trying to convert a runtime property (this object came from this allocator) into a static property. The _cleanest_ way to do this would be to store the dependencies explicitly and propagate/dissolve them on program load/unload. Note that only programs introduce new dependency edges, maps on their own do not mandate dependencies but stored values can extend the lifetime of a dependency chain. There are only a couple of types of dependencies: Program X stores allocation from Y into map Z, field A Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for inner maps) Program X uses field Z.A (therefore Z.A values may live on stack, so you can't walk the map yet) On program load, we materialize these in a table. They can have placeholder values or take values from existing maps. When a program loads that makes a placeholder a concrete allocator instance, we propagate that to all related dependencies (it's kinda like constant propagation). Calls to bpf_obj_free can receive a bpf_mem_alloc** to a field in that program's dependency in this table. On program unload, we can tag the relationships the program introduced as ready to discard. Once the entire chain is ready to discard, no programs require this relationship, so we can potentially walk the map and if all the values are NULL, dissolve the relationship. The map field can now be used with a different allocator. If there's a non-NULL value in the map, we can't do much - we need a program to load that uses this map and on that program's unload, we can check again. On map free, we can free the values, of course, but we can't remove the dependency edges from the table, since values may have propagated to other tables (this depends on the concrete implementation - we might be able to have the map remove all edges that reference it). Once all relationships for allocator A are gone, we can destroy it safely. This scheme allows for reuse of maps so long as the values get cleared before attempts to store in that field from a second allocator. It does not allow mixing allocators for the same field. I may have missed relationships but they too can follow this pattern - you store them explicitly, then reason about them at program load/unload or map free. Anything less explicit is doing this anyway, either less precisely or less clearly. All this machinery _does_ create the closest approximation of the runtime property (per field instead of per value) but it's also approximating something that the allocator can easily keep track of itself, precisely, at runtime. I don't think it's worth the complexity, explicit or not. -- Delyan
On Tue, Aug 30, 2022 at 08:31:55PM +0000, Delyan Kratunov wrote: > On Mon, 2022-08-29 at 23:03 -0700, Alexei Starovoitov wrote: > > > > On Mon, Aug 29, 2022 at 10:03 PM Kumar Kartikeya Dwivedi > > <memxor@gmail.com> wrote: > > > > > > > > So here is new proposal: > > > > > > > > > > Thanks for the proposal, Alexei. I think we're getting close to a > > > solution, but still some comments below. > > > > > > > At load time the verifier walks all kptr_xchg(map_value, obj) > > > > and adds obj's allocator to > > > > map->used_allocators <- {kptr_offset, allocator}; > > > > If kptr_offset already exists -> failure to load. > > > > Allocator can probably be a part of struct bpf_map_value_off_desc. > > > > > > > > In other words the pairs of {kptr_offset, allocator} > > > > say 'there could be an object from that allocator in > > > > that kptr in some map values'. > > > > > > > > Do nothing at prog unload. > > > > > > > > At map free time walk all elements and free kptrs. > > > > Finally drop allocator refcnts. > > > > > > > > > > Yes, this should be possible. > > > It's quite easy to capture the map_ptr for the allocated local kptr. > > > Limiting each local kptr to one allocator is probably fine, at least for a v1. > > > > > > One problem I see is how it works when the allocator map is an inner map. > > > Then, it is not possible to find the backing allocator instance at > > > verification time, hence not possible to take the reference to it in > > > map->used_allocators. > > > But let's just assume that is disallowed for now. > > > > > > The other problem I see is that when the program just does > > > kptr_xchg(map_value, NULL), we may not have allocator info from > > > kptr_offset at that moment. Allocating prog which fills > > > used_allocators may be verified later. We _can_ reject this, but it > > > makes everything fragile (dependent on which order you load programs > > > in), which won't be great. You can then use this lost info to make > > > kptr disjoint from allocator lifetime. > > > > > > Let me explain through an example. > > > > > > Consider this order to set up the programs: > > > One allocator map A. > > > Two hashmaps M1, M2. > > > Three programs P1, P2, P3. > > > > > > P1 uses M1, M2. > > > P2 uses A, M1. > > > P3 uses M2. > > > > > > Sequence: > > > map_create A, M1, M2. > > > > > > Load P1, uses M1, M2. What this P1 does is: > > > p = kptr_xchg(M1.value, NULL); > > > kptr_xchg(M2.value, p); > > > > > > So it moves the kptr in M1 into M2. The problem is at this point > > > kptr_offset is not populated, so we cannot fill used_allocators of M2 > > > as we cannot track which allocator is used to fill M1.value. We saw > > > nothing filling it yet. > > > > > > Next, load P3. It does: > > > p = kptr_xchg(M2.value, NULL); > > > unit_free(p); // let's assume p has bpf_mem_alloc ptr behind itself so > > > this is ok if allocator is alive. > > > > > > Again, M2.used_allocators is empty. Nothing is filled into it. > > > > > > Next, load P2. > > > p = alloc(&A, ...); > > > kptr_xchg(M1.value, p); > > > > > > Now, M1.used_allocators is filled with allocator ref and kptr_offset. > > > But M2.used_allocators will remain unfilled. > > > > > > Now, run programs in sequence of P2, then P1. This will allocate from > > > A, and move the ref to M1, then to M2. But only P1 and P2 have > > > references to M1 so it keeps the allocator alive. However, now unload > > > both P1 and P2. > > > P1, P2, A, allocator of A, M1 all can be freed after RCU gp wait. M2 > > > is still held by loaded P3. > > > > > > Now, M2.used_allocators is empty. P3 is using it, and it is holding > > > allocation from allocator A. Both M1 and A are freed. > > > When P3 runs now, it can kptr_xchg and try to free it, and the same > > > uaf happens again. > > > If not that, uaf when M2 is freed and it does unit_free on the alive local kptr. > > > > > > -- > > > > > > Will this case be covered by your approach? Did I miss something? > > > > > > The main issue is that this allocator info can be lost depending on > > > how you verify a set of programs. It would not be lost if we verified > > > in order P2, P1, P3 instead of the current P1, P3, P2. > > > > > > So we might have to teach the verifier to identify kptr_xchg edges > > > between maps, and propagate any used_allocators to the other map? But > > > it's becoming too complicated. > > > > > > You _can_ reject loads of programs when you don't find kptr_offset > > > populated on seeing kptr_xchg(..., NULL), but I don't think this is > > > practical either. It makes the things sensitive to program > > > verification order, which would be confusing for users. > > > > Right. Thanks for brainstorming and coming up with the case > > where it breaks. > > > > Let me explain the thought process behind the proposal. > > The way the progs will be written will be something like: > > > > struct foo { > > int var; > > }; > > > > struct { > > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > > __type(value, struct foo); > > } ma SEC(".maps"); > > > > struct map_val { > > struct foo * ptr __kptr __local; > > }; > > > > struct { > > __uint(type, BPF_MAP_TYPE_HASH); > > __uint(max_entries, 123); > > __type(key, __u32); > > __type(value, struct map_val); > > } hash SEC(".maps"); > > > > struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); > > bpf_kptr_xchg(&map_val->ptr, p); > > > > Even if prog doesn't allocate and only does kptr_xchg like > > your P1 and P3 do the C code has to have a full > > definition 'struct foo' to compile P1 and P3. > > P1 and P3 don't need to see definition of 'ma' to be compiled, > > but 'struct foo' has to be seen. > > BTF reference will be taken by both 'ma' and by 'hash'. > > The btf_id will come from that BTF. > > > > The type is tied to BTF and tied to kptr in map value. > > The type is also tied to the allocator. > > The type creates a dependency chain between allocator and map. > > So the restriction of one allocator per kptr feels > > reasonable and doesn't feel restrictive at all. > > That dependency chain is there in the C code of the program. > > Hence the proposal to discover this dependency in the verifier > > through tracking of allocator from mem_alloc into kptr_xchg. > > But you're correct that it's not working for P1 and P3. > > Encoding the allocator in the runtime type system is fine but it comes with its own > set of tradeoffs. > > > > > I can imagine a few ways to solve it. > > 1. Ask users to annotate kptr local with the allocator > > that will be used. > > It's easy for progs P1 and P3. All definitions are likely available. > > It's only an extra tag of some form. > > This would introduce maps referring to other maps and would thus require substantial > work in libbpf. In order to encode the link specific to field instead of the whole > map object, we'd have to resort to something like map names as type tags, which is > not a great design (arbitrary strings etc). > > > 2. move 'used_allocator' from map further into BTF, > > since BTF is the root of this dependency chain. > > This would _maybe_ work. The hole I can see in this plan is that once a slot is > claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the > local kptr fields useless (i.e. the allocator is around but no allocator map for it > exists and thus no program can use these fields again, they can only be free()). That's correct if we think of allocators as property of programs, but see below. > The reason it can't be unclaimed is that programs were verified with a specific > allocator value in mind and we can't change that after they're loaded. To be able to > unclaim a slot, we'd need to remove everything using that system (i.e. btf object) > and load it again, which is not great. That's also correct. > > > When the verifier sees bpf_mem_alloc in P2 it will add > > {allocator, btf_id} pair to BTF. > > > > If P1 loads first and the verifier see: > > p = kptr_xchg(M1.value, NULL); > > it will add {unique_allocator_placeholder, btf_id} to BTF. > > Then > > kptr_xchg(M2.value, p); does nothing. > > The verifier checks that M1's BTF == M2's BTF and id-s are same. > > Note to self: don't allow setting base_btf from userspace without auditing all these > checks. > > > > > Then P3 loads with: > > p = kptr_xchg(M2.value, NULL); > > unit_free(p); > > since unique_allocator_placholder is already there for that btf_id > > the verifier does nothing. > > > > Eventually it will see bpf_mem_alloc for that btf_id and will > > replace the placeholder with the actual allocator. > > We can even allow P1 and P3 to be runnable after load right away. > > Since nothing can allocate into that kptr local those > > kptr_xchg() in P1 and P3 will be returning NULL. > > If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue. > > > > Ideally for unit_free(p); in P3 the verifier would add a hidden > > 'ma' argument, so that allocator doesn't need to be stored dynamically. > > We can either insns of P3 after P2 was verified or > > pass a pointer to a place in BTF->used_allocator list of pairs > > where actual allocator pointer will be written later. > > Then no patching is needed. > > If P2 never loads the unit_free(*addr /* here it will load the > > value of unique_allocator_placeholder */, ...) > > but since unit_free() will never execute with valid obj to be freed. > > > > "At map free time walk all elements and free kptrs" step stays the same. > > but we decrement allocator refcnt only when BTF is freed > > which should be after map free time. > > So btf_put(map->btf); would need to move after ops->map_free. > > > > Maybe unique_allocator_placeholder approach can be used > > without moving the list into BTF. Need to think more about it tomorrow. > > I don't think we have to resort to storing the type-allocator mappings in the BTF at > all. > > In fact, we can encode them where you wanted to encode them with tags - on the fields > themselves. We can put the mem_alloc reference in the kptr field descriptors and have > it transition to a specific allocator irreversibly. We would need to record where any > equivalences between allocators we need for the currently loaded programs but it's > not impossible. > > Making the transition reversible is the hard part of course. > > Taking a step back, we're trying to convert a runtime property (this object came from > this allocator) into a static property. The _cleanest_ way to do this would be to > store the dependencies explicitly and propagate/dissolve them on program load/unload. Agree with above. > Note that only programs introduce new dependency edges, maps on their own do not > mandate dependencies but stored values can extend the lifetime of a dependency chain. I think we're getting to the bottom of the issues with this api. The api is designed around the concept of an allocator being another map type. That felt correct in the beginning, but see below. > There are only a couple of types of dependencies: > Program X stores allocation from Y into map Z, field A > Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for > inner maps) What does it mean for two allocators to be equivalent? Like mergeable slabs? If so the kernel gave the answer to that question long ago: merge them whenever possible. bpf-speak if two allocators have the same type they should be mergeable (== equivalent). Now let's come up with a use case when bpf core would need to recognize two equivalent allocators. Take networking service, like katran. It has a bunch of progs that are using pinned maps. The maps are pinned because progs can be reloaded due to live-upgrade or due to user space restart. The map contents are preserved, but the progs are reloaded. That's a common scenario. If we design allocator api in a way that allocator is another map. The prog reload case has two options: - create new allocator map and use it in reloaded progs - pin allocator map along with other maps at the very beginning and make reloaded progs use it In the later case there is no new allocator. No need to do equivalence checks. In the former case there is a new allocator and bpf core needs to recognize equivalence otherwise pinned maps with kptrs won't be usable out of reloaded progs. But what are the benefits of creating new allocators on reload? I don't see any. Old allocator had warm caches populated with cache friendly objects. New allocator has none of it, but it's going to be used for exactly the same objects. What's worse. New allocator will likely bring a new BTF object making equivalence work even harder. I think it all points to the design issue where allocator is another map and dependency between maps is a run-time property. As you pointed out earlier the cleanest way is to make this dependency static. Turned out we have such static dependency already. Existing bpf maps depend on kernel slab allocator. After bpf_mem_alloc patches each hash map will have a hidden dependency on that allocator. What we've designed so far is: struct foo { int var; }; struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); __type(value, struct foo); } ma SEC(".maps"); struct map_val { struct foo * ptr __kptr __local; }; struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 123); __type(key, __u32); __type(value, struct map_val); } hash SEC(".maps"); struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); bpf_kptr_xchg(&map_val->ptr, p); /* this is a copy-paste of my earlier example. no changes */ but what wasn't obvious that we have two allocators here. One explicit 'ma' and another hidden allocator in 'hash'. Both are based on 'struct bpf_mem_alloc'. One will be allocating 'struct foo'. Another 'struct map_val'. But we missed opportunity to make them mergeable and bpf prog cannot customize them. I think the way to fix the api is to recognize: - every map has an allocator - expose that allocator to progs - allow sharing of allocators between maps so an allocator is a mandatory part of the map. If it's not specified an implicit one will be used. Thinking along these lines we probably need map-in-map-like specification of explicit allocator(s) for maps: struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 123); __type(key, __u32); __type(value, struct map_val); __array(elem_allocator, struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); }); __array(kptr_allocator, struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); }); } hash SEC(".maps"); That would be explicit and static declartion of allocators that hash map should use. The benefits: - the prog writers can share the same allocator across multiple hash maps by specifying the same 'elem_allocator'. (struct bpf_mem_alloc already supports more than one size) The most memory concious authors can use the same allocator across all of their maps achieving the best memory savings. Not talking about kptr yet. This is just plain hash maps. - the program can influence hash map allocator. Once we have bpf_mem_prefill() helper the bpf program can add reserved elements to a hash map and guarantee that bpf_map_update_elem() will succeed later. - kptr allocator is specified statically. At map creation time the map will take reference of allocator and will keep it until destruction. The verifier will make sure that kptr_xchg-ed objects come only from that allocator. So all of the refcnt issues discussed in this thread are gone. The prog will look like: struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); } ma SEC(".maps"); struct { __uint(type, BPF_MAP_TYPE_HASH); __type(value, struct map_val); __array(kptr_allocator, struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); }); } hash SEC(".maps") = { .kptr_allocator = { (void *)&ma }, }; struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); If allocated kptr-s don't need to be in multiple maps the prog can be: struct { __uint(type, BPF_MAP_TYPE_HASH); __type(value, struct map_val); __array(kptr_allocator, struct { __uint(type, BPF_MAP_TYPE_ALLOCATOR); }); } hash SEC(".maps"); struct foo * p = bpf_mem_alloc(&hash->kptr_allocator, type_id_local(struct foo)); The same hash->kptr_allocator can be used to allocate different types. We can also allow the same allocator to be specified in hash->elem_allocator and hash->kptr_allocator. We can allow progs to skip declaring explicit BPF_MAP_TYPE_ALLOCATOR. Like: struct { __uint(type, BPF_MAP_TYPE_HASH); __type(value, struct map_val); } hash SEC(".maps"); struct foo * p = bpf_mem_alloc(&hash->elem_allocator, type_id_local(struct foo)); In this case both kptr objects and map elements come from the same allocator that was implicitly created at hash map creation time. The prog reload use case is naturally solved. By pinning map the user space pinned allocators and prog reload will reuse the same maps with the same allocators. If this design revision makes sense the bpf_mem_alloc needs a bit of work to support the above cleanly. It should be straightforward. > If there's a non-NULL value in the map, we can't do much - we need a program to load > that uses this map and on that program's unload, we can check again. On map free, we > can free the values, of course, but we can't remove the dependency edges from the > table, since values may have propagated to other tables (this depends on the concrete > implementation - we might be able to have the map remove all edges that reference > it). ... > I don't think it's worth the complexity, explicit or not. The edge tracking dependency graph sounds quite complex and I agree that it's not worth it.
On Tue, 2022-08-30 at 18:52 -0700, Alexei Starovoitov wrote: > > [SNIP] > > This would _maybe_ work. The hole I can see in this plan is that once a slot is > > claimed, it cannot be unclaimed and thus maps can remain in a state that leaves the > > local kptr fields useless (i.e. the allocator is around but no allocator map for it > > exists and thus no program can use these fields again, they can only be free()). > > That's correct if we think of allocators as property of programs, but see below. Allocators are not properties of programs, allocator _access_ is. Allocators are properties of the objects allocated from them, programs and maps only participate in the lifecycle of the objects and thus extend the lifetime of the allocator. > > > The reason it can't be unclaimed is that programs were verified with a specific > > allocator value in mind and we can't change that after they're loaded. To be able to > > unclaim a slot, we'd need to remove everything using that system (i.e. btf object) > > and load it again, which is not great. > > That's also correct. > > > > > > When the verifier sees bpf_mem_alloc in P2 it will add > > > {allocator, btf_id} pair to BTF. > > > > > > If P1 loads first and the verifier see: > > > p = kptr_xchg(M1.value, NULL); > > > it will add {unique_allocator_placeholder, btf_id} to BTF. > > > Then > > > kptr_xchg(M2.value, p); does nothing. > > > The verifier checks that M1's BTF == M2's BTF and id-s are same. > > > > Note to self: don't allow setting base_btf from userspace without auditing all these > > checks. > > > > > > > > Then P3 loads with: > > > p = kptr_xchg(M2.value, NULL); > > > unit_free(p); > > > since unique_allocator_placholder is already there for that btf_id > > > the verifier does nothing. > > > > > > Eventually it will see bpf_mem_alloc for that btf_id and will > > > replace the placeholder with the actual allocator. > > > We can even allow P1 and P3 to be runnable after load right away. > > > Since nothing can allocate into that kptr local those > > > kptr_xchg() in P1 and P3 will be returning NULL. > > > If P2 with bpf_mem_alloc never loads it's fine. Not a safety issue. > > > > > > Ideally for unit_free(p); in P3 the verifier would add a hidden > > > 'ma' argument, so that allocator doesn't need to be stored dynamically. > > > We can either insns of P3 after P2 was verified or > > > pass a pointer to a place in BTF->used_allocator list of pairs > > > where actual allocator pointer will be written later. > > > Then no patching is needed. > > > If P2 never loads the unit_free(*addr /* here it will load the > > > value of unique_allocator_placeholder */, ...) > > > but since unit_free() will never execute with valid obj to be freed. > > > > > > "At map free time walk all elements and free kptrs" step stays the same. > > > but we decrement allocator refcnt only when BTF is freed > > > which should be after map free time. > > > So btf_put(map->btf); would need to move after ops->map_free. > > > > > > Maybe unique_allocator_placeholder approach can be used > > > without moving the list into BTF. Need to think more about it tomorrow. > > > > I don't think we have to resort to storing the type-allocator mappings in the BTF at > > all. > > > > In fact, we can encode them where you wanted to encode them with tags - on the fields > > themselves. We can put the mem_alloc reference in the kptr field descriptors and have > > it transition to a specific allocator irreversibly. We would need to record where any > > equivalences between allocators we need for the currently loaded programs but it's > > not impossible. > > > > Making the transition reversible is the hard part of course. > > > > Taking a step back, we're trying to convert a runtime property (this object came from > > this allocator) into a static property. The _cleanest_ way to do this would be to > > store the dependencies explicitly and propagate/dissolve them on program load/unload. > > Agree with above. > > > Note that only programs introduce new dependency edges, maps on their own do not > > mandate dependencies but stored values can extend the lifetime of a dependency chain. > > I think we're getting to the bottom of the issues with this api. > The api is designed around the concept of an allocator being another map type. > That felt correct in the beginning, but see below. > > > There are only a couple of types of dependencies: > > Program X stores allocation from Y into map Z, field A > > Program X requires allocator for Y.A to equal allocator for Z.A (+ a version for > > inner maps) > > What does it mean for two allocators to be equivalent? > Like mergeable slabs? If so the kernel gave the answer to that question long ago: > merge them whenever possible. > bpf-speak if two allocators have the same type they should be mergeable (== equivalent). The requirement to track (allocator, type) tuples came from the desire to manage allocator lifetime statically, it's not inherent in the problem. It allows us to narrow down the problem space while retaining some flexibility in that multiple allocators can participate in the same map. > Now let's come up with a use case when bpf core would need to recognize two > equivalent allocators. > Take networking service, like katran. It has a bunch of progs that are using > pinned maps. The maps are pinned because progs can be reloaded due to > live-upgrade or due to user space restart. The map contents are preserved, but > the progs are reloaded. That's a common scenario. > If we design allocator api in a way that allocator is another map. The prog > reload case has two options: > - create new allocator map and use it in reloaded progs > - pin allocator map along with other maps at the very beginning and > make reloaded progs use it > > In the later case there is no new allocator. No need to do equivalence checks. > In the former case there is a new allocator and bpf core needs to recognize > equivalence otherwise pinned maps with kptrs won't be usable out of reloaded > progs. But what are the benefits of creating new allocators on reload? > I don't see any. Old allocator had warm caches populated with cache friendly > objects. New allocator has none of it, but it's going to be used for exactly > the same objects. What's worse. New allocator will likely bring a new BTF object > making equivalence work even harder. From an optimal behavior point of view, I agree that starting a new allocator is not ideal. However, if a design allows suboptimal behavior, it _will_ be used by users, so it must be correct. > > I think it all points to the design issue where allocator is another map and > dependency between maps is a run-time property. As you pointed out earlier the > cleanest way is to make this dependency static. Turned out we have such static > dependency already. Existing bpf maps depend on kernel slab allocator. After > bpf_mem_alloc patches each hash map will have a hidden dependency on that > allocator. > > What we've designed so far is: > > struct foo { > int var; > }; > > struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > __type(value, struct foo); > } ma SEC(".maps"); > > struct map_val { > struct foo * ptr __kptr __local; > }; > > struct { > __uint(type, BPF_MAP_TYPE_HASH); > __uint(max_entries, 123); > __type(key, __u32); > __type(value, struct map_val); > } hash SEC(".maps"); > > struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); > bpf_kptr_xchg(&map_val->ptr, p); > > /* this is a copy-paste of my earlier example. no changes */ > > but what wasn't obvious that we have two allocators here. > One explicit 'ma' and another hidden allocator in 'hash'. > Both are based on 'struct bpf_mem_alloc'. > One will be allocating 'struct foo'. Another 'struct map_val'. > But we missed opportunity to make them mergeable and > bpf prog cannot customize them. > > I think the way to fix the api is to recognize: > - every map has an allocator Array maps don't or at least not in a way that matters. But sure. > - expose that allocator to progs Good idea on its own. > - allow sharing of allocators between maps Hidden assumption here (see below). > > so an allocator is a mandatory part of the map. > If it's not specified an implicit one will be used. > Thinking along these lines we probably need map-in-map-like > specification of explicit allocator(s) for maps: > > struct { > __uint(type, BPF_MAP_TYPE_HASH); > __uint(max_entries, 123); > __type(key, __u32); > __type(value, struct map_val); > __array(elem_allocator, struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > }); > __array(kptr_allocator, struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > }); > } hash SEC(".maps"); > > That would be explicit and static declartion of allocators that > hash map should use. Adding a kptr_allocator customization mechanism to all maps is fine, though pretty heavy-handed in terms of abstraction leakage. elem_allocator doesn't make sense for all maps. > The benefits: > - the prog writers can share the same allocator across multiple > hash maps by specifying the same 'elem_allocator'. > (struct bpf_mem_alloc already supports more than one size) > The most memory concious authors can use the same allocator > across all of their maps achieving the best memory savings. > Not talking about kptr yet. This is just plain hash maps. > > - the program can influence hash map allocator. > Once we have bpf_mem_prefill() helper the bpf program can add > reserved elements to a hash map and guarantee that > bpf_map_update_elem() will succeed later. I like these extensions, they make sense. > > - kptr allocator is specified statically. At map creation time > the map will take reference of allocator and will keep it until > destruction. The verifier will make sure that kptr_xchg-ed objects > come only from that allocator. > So all of the refcnt issues discussed in this thread are gone. > The prog will look like: > > struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > } ma SEC(".maps"); > > struct { > __uint(type, BPF_MAP_TYPE_HASH); > __type(value, struct map_val); > __array(kptr_allocator, struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > }); > } hash SEC(".maps") = { > .kptr_allocator = { (void *)&ma }, > }; This part is going to be painful to implement but plausible. I also have to verify that this will emit relocations and not global initializers (I don't recall the exact rules). > struct foo * p = bpf_mem_alloc(&ma, type_id_local(struct foo)); > > If allocated kptr-s don't need to be in multiple maps the prog can be: > > struct { > __uint(type, BPF_MAP_TYPE_HASH); > __type(value, struct map_val); > __array(kptr_allocator, struct { > __uint(type, BPF_MAP_TYPE_ALLOCATOR); > }); > } hash SEC(".maps"); > > struct foo * p = bpf_mem_alloc(&hash->kptr_allocator, type_id_local(struct foo)); > > The same hash->kptr_allocator can be used to allocate different types. > We can also allow the same allocator to be specified in hash->elem_allocator > and hash->kptr_allocator. The need to extract the allocator into its own map in order to have kptrs of the same type in more than one map is definitely a convenience downgrade. We're now to a point where we're not even putting the onus of dependency tracking on the verifier but entirely on the developer, while making the dependency tracking less granular. On the spectrum where on one side we have the allocator tracking its lifetime at runtime in a way where allocators can be easily mixed in the same map, through verifier smarts with _some_ restrictions (per field), to this design, this feels like the most restrictive version from a developer experience pov. I personally also don't like APIs where easy things are easy but the moment you want to do something slightly out of the blessed path, you hit a cliff and have to learn about a bunch of things you don't care about. "I just want to move a pointer from here to over there, why do I have to refactor all my maps?" > > We can allow progs to skip declaring explicit BPF_MAP_TYPE_ALLOCATOR. > Like: > struct { > __uint(type, BPF_MAP_TYPE_HASH); > __type(value, struct map_val); > } hash SEC(".maps"); > > struct foo * p = bpf_mem_alloc(&hash->elem_allocator, type_id_local(struct foo)); > > In this case both kptr objects and map elements come from the same allocator > that was implicitly created at hash map creation time. Overall, this design (or maybe the way it's presented here) conflates a few ideas. 1) The extensions to expose and customize map's internal element allocator are fine independently of even this patchset. 2) The idea that kptrs in a map need to have a statically identifiable allocator is taken as an axiom, and then expanded to its extreme (single allocator per map as opposed to the smarter verifier schemes). I still contest that this is not the case and the runtime overhead it avoids is paid back in bad developer experience multiple times over. 3) The idea that allocators can be merged between elements and kptrs is independent of the static requirements. If the map's internal allocator is exposed per 1), we can still use it to allocate kptrs but not require that all kptrs in a map are from the same allocator. Going this coarse in the API is easy for us but fundamentally more limiting for developers. It's not hard to imagine situations where the verifier dependency tracking or runtime lifetime tracking would allow for pinned maps to be retained but this scheme would require new maps entirely. (Any situation where you just refactored the implicit allocator out to share it, for example) I also don't think that simplicity for us (a one time implementation cost + continuous maintenance cost) trumps over long term developer experience (a much bigger implementation cost over a much bigger time span). > > The prog reload use case is naturally solved. > By pinning map the user space pinned allocators and prog reload will reuse > the same maps with the same allocators. > > If this design revision makes sense the bpf_mem_alloc needs a bit of work > to support the above cleanly. It should be straightforward. > > > If there's a non-NULL value in the map, we can't do much - we need a program to load > > that uses this map and on that program's unload, we can check again. On map free, we > > can free the values, of course, but we can't remove the dependency edges from the > > table, since values may have propagated to other tables (this depends on the concrete > > implementation - we might be able to have the map remove all edges that reference > > it). > ... > > I don't think it's worth the complexity, explicit or not. > > The edge tracking dependency graph sounds quite complex and I agree that it's not worth it. So far, my ranked choice vote is: 1) maximum flexibility and runtime live object counts (with exposed allocators, I like the merging) 2) medium flexibility with per-field allocator tracking in the verifier and the ability to lose the association once programs are unloaded and values are gone. This also works better with exposed allocators since they are implicitly pinned and would be usable to store values in another map. 3) minimum flexibility with static whole-map kptr allocators -- Delyan
On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote: > > Overall, this design (or maybe the way it's presented here) conflates a few ideas. > > 1) The extensions to expose and customize map's internal element allocator are fine > independently of even this patchset. > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is > taken as an axiom, and then expanded to its extreme (single allocator per map as > opposed to the smarter verifier schemes). I still contest that this is not the case > and the runtime overhead it avoids is paid back in bad developer experience multiple > times over. > > 3) The idea that allocators can be merged between elements and kptrs is independent > of the static requirements. If the map's internal allocator is exposed per 1), we can > still use it to allocate kptrs but not require that all kptrs in a map are from the > same allocator. > > Going this coarse in the API is easy for us but fundamentally more limiting for > developers. It's not hard to imagine situations where the verifier dependency > tracking or runtime lifetime tracking would allow for pinned maps to be retained but > this scheme would require new maps entirely. (Any situation where you just refactored > the implicit allocator out to share it, for example) > > I also don't think that simplicity for us (a one time implementation cost + > continuous maintenance cost) trumps over long term developer experience (a much > bigger implementation cost over a much bigger time span). It feels we're thinking about scope and use cases for the allocator quite differently and what you're seeing as 'limiting developer choices' to me looks like 'not a limiting issue at all'. To me the allocator is one jemalloc/tcmalloc instance. One user space application with multiple threads, lots of maps and code is using exactly one such allocator. The allocator manages all the memory of user space process. In bpf land we don't have a bpf process. We don't have a bpf name space either. A loose analogy would be a set of programs and maps managed by one user space agent. The bpf allocator would manage all the memory of these maps and programs and provide a "memory namespace" for this set of programs. Another user space agent with its own programs would never want to share the same allocator. In user space a chunk of memory could be mmap-ed between different process to share the data, but you would never put a tcmalloc over such memory to be an allocator for different processes. More below. > So far, my ranked choice vote is: > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I > like the merging) > 2) medium flexibility with per-field allocator tracking in the verifier and the > ability to lose the association once programs are unloaded and values are gone. This > also works better with exposed allocators since they are implicitly pinned and would > be usable to store values in another map. > 3) minimum flexibility with static whole-map kptr allocators The option 1 flexibility is necessary when allocator is seen as a private pool of objects of given size. Like kernel's kmem_cache instance. I don't think we quite there yet. There is a need to "preallocate this object from sleepable context, so the prog has a guaranteed chunk of memory to use in restricted context", but, arguably, it's not a job of bpf allocator. bpf prog can allocate an object, stash it into kptr, and use it later. So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is more than we need. Ideally it would be easy to specifiy one single allocator for all maps and progs in a set of .c files. Sort-of a bpf package. In other words one bpf allocator per bpf "namespace" is more than enough. Program authors shouldn't be creating allocators left and right. All these free lists will waste memory. btw I've added an extra patch to bpf_mem_alloc series: https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1 that removes kmem_cache usage. Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea. When free_lists are not shared they will similarly waste memory. In user space the C code just does malloc() and the memory is isolated per process. Ideally in bpf world the programs would just do: bpf_mem_alloc(btf_type_id_local(struct foo)); without specifying an allocator, but that would require one global allocator for all bpf programs in the kernel which is probably not a direction we should go ? So the programs have to specify an allocator to use in bpf_mem_alloc(), but it should be one for all progs, maps in a bpf-package/set/namespace. If it's easy for programs to specify a bunch of allocators, like one for each program, or one for each btf_type_id the bpf kernel infra would be required to merge these allocators from day one. (The profileration of kmem_cache-s in the past forced merging of them). By restricting bpf program choices with allocator-per-map (this option 3) we're not only making the kernel side to do less work (no run-time ref counts, no merging is required today), we're also pushing bpf progs to use memory concious choices. Having said all that maybe one global allocator is not such a bad idea.
On Wed, 31 Aug 2022 at 20:57, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote: > > > > Overall, this design (or maybe the way it's presented here) conflates a few ideas. > > > > 1) The extensions to expose and customize map's internal element allocator are fine > > independently of even this patchset. > > > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is > > taken as an axiom, and then expanded to its extreme (single allocator per map as > > opposed to the smarter verifier schemes). I still contest that this is not the case > > and the runtime overhead it avoids is paid back in bad developer experience multiple > > times over. > > > > 3) The idea that allocators can be merged between elements and kptrs is independent > > of the static requirements. If the map's internal allocator is exposed per 1), we can > > still use it to allocate kptrs but not require that all kptrs in a map are from the > > same allocator. > > > > Going this coarse in the API is easy for us but fundamentally more limiting for > > developers. It's not hard to imagine situations where the verifier dependency > > tracking or runtime lifetime tracking would allow for pinned maps to be retained but > > this scheme would require new maps entirely. (Any situation where you just refactored > > the implicit allocator out to share it, for example) > > > > I also don't think that simplicity for us (a one time implementation cost + > > continuous maintenance cost) trumps over long term developer experience (a much > > bigger implementation cost over a much bigger time span). > > It feels we're thinking about scope and use cases for the allocator quite > differently and what you're seeing as 'limiting developer choices' to me looks > like 'not a limiting issue at all'. To me the allocator is one I went over the proposal multiple times, just to make sure I understood it properly, but I still can't see this working ideally for the inner map case, even if we ignore the rest of the things for a moment. But maybe you prefer to just forbid them there? Please correct me if I'm wrong. You won't be able to know the allocator statically for inner maps (and hence not be able to enforce the kptr_xchg requirement to be from the same allocator as map). To know it, we will have to force all inner_maps to use the same allocator, either the one specified for inner map fd, or the one in map-in-map definition, or elsewhere. But to be able to know it statically the information will have to come from map-in-map somehow. That seems like a very weird limitation just to use local kptrs, and doesn't even make sense for map-in-map use cases IMO. And unless I'm missing something there isn't an easy way to accommodate it in the 'statically known allocator' proposal, because such inner_map allocators (and inner_maps) are themselves not static. > jemalloc/tcmalloc instance. One user space application with multiple threads, > lots of maps and code is using exactly one such allocator. The allocator > manages all the memory of user space process. In bpf land we don't have a bpf > process. We don't have a bpf name space either. A loose analogy would be a set > of programs and maps managed by one user space agent. The bpf allocator would > manage all the memory of these maps and programs and provide a "memory namespace" > for this set of programs. Another user space agent with its own programs > would never want to share the same allocator. In user space a chunk of memory > could be mmap-ed between different process to share the data, but you would never > put a tcmalloc over such memory to be an allocator for different processes. > But just saying "would never" or "should never" doesn't work right? Hyrum's law and all. libbpf style "bpf package" deployments may not be the only consumers of this infra in the future. So designing around this specific idea that people will never or shouldn't dynamically share their allocator objects between maps which don't share the same allocator seems destined to only serve us in the short run IMO. People may come up with cases where they are passing ownership of objects between such bpf packages, and after coming up with multiple examples before it doesn't seem likely static dependencies will be able to capture such dynamic runtime relationships, e.g. static dependencies don't even work in the inner_map case without more restrictions. > More below. > > > So far, my ranked choice vote is: > > > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I > > like the merging) > > 2) medium flexibility with per-field allocator tracking in the verifier and the > > ability to lose the association once programs are unloaded and values are gone. This > > also works better with exposed allocators since they are implicitly pinned and would > > be usable to store values in another map. > > 3) minimum flexibility with static whole-map kptr allocators > > The option 1 flexibility is necessary when allocator is seen as a private pool > of objects of given size. Like kernel's kmem_cache instance. > I don't think we quite there yet. > There is a need to "preallocate this object from sleepable context, > so the prog has a guaranteed chunk of memory to use in restricted context", > but, arguably, it's not a job of bpf allocator. bpf prog can allocate an object, > stash it into kptr, and use it later. At least if not adding support for it all now, I think this kind of flexibility in option 1 needs to be given some more consideration, as in whether this proposal to encode things statically would be able to accommodate such cases in the future. To me it seems pretty hard (and unless I missed something, it already won't work for inner_maps case without requiring all to use the same allocator). We might actually be able to do a hybrid of the options by utilizing the statically known allocator info to acquire references and runtime object counts, which may help eliminate/delay the actual cost we pay for it - the atomic upgrade, when initial reference goes away. So I think I lean towards option 1 as well, and then the same order as Delyan. It seems to cover all kinds of corner cases (allocator known vs unknown, normal vs inner maps, etc.) we've been going over in this thread, and would also be future proof in terms of permitting unforeseen patterns of usage. > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is > more than we need. Ideally it would be easy to specifiy one single > allocator for all maps and progs in a set of .c files. Sort-of a bpf package. > In other words one bpf allocator per bpf "namespace" is more than enough. > Program authors shouldn't be creating allocators left and right. All these > free lists will waste memory. > btw I've added an extra patch to bpf_mem_alloc series: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1 > that removes kmem_cache usage. > Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea. > When free_lists are not shared they will similarly waste memory. > In user space the C code just does malloc() and the memory is isolated per process. > Ideally in bpf world the programs would just do: > bpf_mem_alloc(btf_type_id_local(struct foo)); > without specifying an allocator, but that would require one global allocator > for all bpf programs in the kernel which is probably not a direction we should go ? > So the programs have to specify an allocator to use in bpf_mem_alloc(), > but it should be one for all progs, maps in a bpf-package/set/namespace.| But "all progs" doesn't mean all of them are running in the same context and having the same kinds of memory allocation requirements, right? While having too many allocators is also bad, having just one single one per package would also be limiting. A bpf object/package is very different from a userspace process. So the analogy doesn't exactly fit IMO. > If it's easy for programs to specify a bunch of allocators, like one for each program, > or one for each btf_type_id the bpf kernel infra would be required to merge > these allocators from day one. (The profileration of kmem_cache-s in the past > forced merging of them). By restricting bpf program choices with allocator-per-map > (this option 3) we're not only making the kernel side to do less work > (no run-time ref counts, no merging is required today), we're also pushing > bpf progs to use memory concious choices. > Having said all that maybe one global allocator is not such a bad idea.
On Wed, Aug 31, 2022 at 10:12:28PM +0200, Kumar Kartikeya Dwivedi wrote: > On Wed, 31 Aug 2022 at 20:57, Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote: > > > > > > Overall, this design (or maybe the way it's presented here) conflates a few ideas. > > > > > > 1) The extensions to expose and customize map's internal element allocator are fine > > > independently of even this patchset. > > > > > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is > > > taken as an axiom, and then expanded to its extreme (single allocator per map as > > > opposed to the smarter verifier schemes). I still contest that this is not the case > > > and the runtime overhead it avoids is paid back in bad developer experience multiple > > > times over. > > > > > > 3) The idea that allocators can be merged between elements and kptrs is independent > > > of the static requirements. If the map's internal allocator is exposed per 1), we can > > > still use it to allocate kptrs but not require that all kptrs in a map are from the > > > same allocator. > > > > > > Going this coarse in the API is easy for us but fundamentally more limiting for > > > developers. It's not hard to imagine situations where the verifier dependency > > > tracking or runtime lifetime tracking would allow for pinned maps to be retained but > > > this scheme would require new maps entirely. (Any situation where you just refactored > > > the implicit allocator out to share it, for example) > > > > > > I also don't think that simplicity for us (a one time implementation cost + > > > continuous maintenance cost) trumps over long term developer experience (a much > > > bigger implementation cost over a much bigger time span). > > > > It feels we're thinking about scope and use cases for the allocator quite > > differently and what you're seeing as 'limiting developer choices' to me looks > > like 'not a limiting issue at all'. To me the allocator is one > > I went over the proposal multiple times, just to make sure I > understood it properly, but I still can't see this working ideally for > the inner map case, even if we ignore the rest of the things for a > moment. > But maybe you prefer to just forbid them there? Please correct me if I'm wrong. > > You won't be able to know the allocator statically for inner maps (and > hence not be able to enforce the kptr_xchg requirement to be from the > same allocator as map). To know it, we will have to force all > inner_maps to use the same allocator, Of course. That's the idea. I don't see a practical use case to use different allocators in different inner maps. > either the one specified for > inner map fd, or the one in map-in-map definition, or elsewhere. But > to be able to know it statically the information will have to come > from map-in-map somehow. > > That seems like a very weird limitation just to use local kptrs, and > doesn't even make sense for map-in-map use cases IMO. > And unless I'm missing something there isn't an easy way to > accommodate it in the 'statically known allocator' proposal, because > such inner_map allocators (and inner_maps) are themselves not static. It doesn't look difficult. The inner map template has to be defined in the outer map. All inner maps must fit this template. Currently it requires key/value to be exactly the same. We used to enforce max_entries too, but it was relaxed later. The same allocator for all inner maps would be a part of the requirement. Easy to enforce and easy for progs to comply. It doesn't look limiting or weird to me. > > jemalloc/tcmalloc instance. One user space application with multiple threads, > > lots of maps and code is using exactly one such allocator. The allocator > > manages all the memory of user space process. In bpf land we don't have a bpf > > process. We don't have a bpf name space either. A loose analogy would be a set > > of programs and maps managed by one user space agent. The bpf allocator would > > manage all the memory of these maps and programs and provide a "memory namespace" > > for this set of programs. Another user space agent with its own programs > > would never want to share the same allocator. In user space a chunk of memory > > could be mmap-ed between different process to share the data, but you would never > > put a tcmalloc over such memory to be an allocator for different processes. > > > > But just saying "would never" or "should never" doesn't work right? > Hyrum's law and all. Agree to disagree. Vanilla C allows null pointer dereferences too. BPF C doesn't. > libbpf style "bpf package" deployments may not be the only consumers > of this infra in the future. So designing around this specific idea > that people will never or shouldn't dynamically share their allocator > objects between maps which don't share the same allocator seems > destined to only serve us in the short run IMO. > > People may come up with cases where they are passing ownership of > objects between such bpf packages, and after coming up with multiple > examples before it doesn't seem likely static dependencies will be > able to capture such dynamic runtime relationships, e.g. static > dependencies don't even work in the inner_map case without more > restrictions. The maps are such shared objects. They are shared between bpf programs and between progs and user space. Not proposing anything new here. An allocator connected to a map preserves this sharing ability. > > > More below. > > > > > So far, my ranked choice vote is: > > > > > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I > > > like the merging) > > > 2) medium flexibility with per-field allocator tracking in the verifier and the > > > ability to lose the association once programs are unloaded and values are gone. This > > > also works better with exposed allocators since they are implicitly pinned and would > > > be usable to store values in another map. > > > 3) minimum flexibility with static whole-map kptr allocators > > > > The option 1 flexibility is necessary when allocator is seen as a private pool > > of objects of given size. Like kernel's kmem_cache instance. > > I don't think we quite there yet. > > There is a need to "preallocate this object from sleepable context, > > so the prog has a guaranteed chunk of memory to use in restricted context", > > but, arguably, it's not a job of bpf allocator. bpf prog can allocate an object, > > stash it into kptr, and use it later. > > At least if not adding support for it all now, I think this kind of > flexibility in option 1 needs to be given some more consideration, as > in whether this proposal to encode things statically would be able to > accommodate such cases in the future. To me it seems pretty hard (and > unless I missed something, it already won't work for inner_maps case > without requiring all to use the same allocator). What use case are we walking about? So far I hear 'ohh it will be limiting', but nothing concrete. > > We might actually be able to do a hybrid of the options by utilizing > the statically known allocator info to acquire references and runtime > object counts, which may help eliminate/delay the actual cost we pay > for it - the atomic upgrade, when initial reference goes away. > > So I think I lean towards option 1 as well, and then the same order as > Delyan. It seems to cover all kinds of corner cases (allocator known > vs unknown, normal vs inner maps, etc.) we've been going over in this > thread, and would also be future proof in terms of permitting > unforeseen patterns of usage. These "unforeseen patterns" sounds as "lets overdesign now because we cannot predict the future". > > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is > > more than we need. Ideally it would be easy to specifiy one single > > allocator for all maps and progs in a set of .c files. Sort-of a bpf package. > > In other words one bpf allocator per bpf "namespace" is more than enough. > > Program authors shouldn't be creating allocators left and right. All these > > free lists will waste memory. > > btw I've added an extra patch to bpf_mem_alloc series: > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1 > > that removes kmem_cache usage. > > Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea. > > When free_lists are not shared they will similarly waste memory. > > In user space the C code just does malloc() and the memory is isolated per process. > > Ideally in bpf world the programs would just do: > > bpf_mem_alloc(btf_type_id_local(struct foo)); > > without specifying an allocator, but that would require one global allocator > > for all bpf programs in the kernel which is probably not a direction we should go ? > > So the programs have to specify an allocator to use in bpf_mem_alloc(), > > but it should be one for all progs, maps in a bpf-package/set/namespace.| > > But "all progs" doesn't mean all of them are running in the same > context and having the same kinds of memory allocation requirements, > right? While having too many allocators is also bad, having just one > single one per package would also be limiting. A bpf object/package is > very different from a userspace process. So the analogy doesn't > exactly fit IMO. It's not exact fit, for sure. bpf progs are more analogous to kernel modules. The modules just do kmalloc. The more we discuss the more I'm leaning towards the same model as well: Just one global allocator for all bpf progs.
On Wed, 2022-08-31 at 11:57 -0700, Alexei Starovoitov wrote: > !-------------------------------------------------------------------| > This Message Is From an External Sender > > > -------------------------------------------------------------------! > > On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote: > > > > Overall, this design (or maybe the way it's presented here) conflates a few ideas. > > > > 1) The extensions to expose and customize map's internal element allocator are fine > > independently of even this patchset. > > > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is > > taken as an axiom, and then expanded to its extreme (single allocator per map as > > opposed to the smarter verifier schemes). I still contest that this is not the case > > and the runtime overhead it avoids is paid back in bad developer experience multiple > > times over. > > > > 3) The idea that allocators can be merged between elements and kptrs is independent > > of the static requirements. If the map's internal allocator is exposed per 1), we can > > still use it to allocate kptrs but not require that all kptrs in a map are from the > > same allocator. > > > > Going this coarse in the API is easy for us but fundamentally more limiting for > > developers. It's not hard to imagine situations where the verifier dependency > > tracking or runtime lifetime tracking would allow for pinned maps to be retained but > > this scheme would require new maps entirely. (Any situation where you just refactored > > the implicit allocator out to share it, for example) > > > > I also don't think that simplicity for us (a one time implementation cost + > > continuous maintenance cost) trumps over long term developer experience (a much > > bigger implementation cost over a much bigger time span). > > It feels we're thinking about scope and use cases for the allocator quite > differently and what you're seeing as 'limiting developer choices' to me looks > like 'not a limiting issue at all'. To me the allocator is one > jemalloc/tcmalloc instance. One user space application with multiple threads, > lots of maps and code is using exactly one such allocator. The allocator > manages all the memory of user space process. In bpf land we don't have a bpf > process. We don't have a bpf name space either. A loose analogy would be a set > of programs and maps managed by one user space agent. The bpf allocator would > manage all the memory of these maps and programs and provide a "memory namespace" > for this set of programs. Another user space agent with its own programs > would never want to share the same allocator. In user space a chunk of memory > could be mmap-ed between different process to share the data, but you would never > put a tcmalloc over such memory to be an allocator for different processes. > > More below. > > > So far, my ranked choice vote is: > > > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I > > like the merging) > > 2) medium flexibility with per-field allocator tracking in the verifier and the > > ability to lose the association once programs are unloaded and values are gone. This > > also works better with exposed allocators since they are implicitly pinned and would > > be usable to store values in another map. > > 3) minimum flexibility with static whole-map kptr allocators > > The option 1 flexibility is necessary when allocator is seen as a private pool > of objects of given size. Like kernel's kmem_cache instance. > I don't think we quite there yet. If we're not there, we should aim to get there :) > There is a need to "preallocate this object from sleepable context, > so the prog has a guaranteed chunk of memory to use in restricted context", > but, arguably, it's not a job of bpf allocator. Leaving it to the programs is worse for memory usage (discussed below). > bpf prog can allocate an object, stash it into kptr, and use it later. Given that tracing programs can't really maintain their own freelists safely (I think they're missing the building blocks - you can't cmpxchg kptrs), I do feel like isolated allocators are a requirement here. Without them, allocations can fail and there's no way to write a reliable program. *If* we ensure that you can build a usable freelist out of allocator-backed memory for (a set of) nmi programs, then I can maybe get behind this (but there's other reasons not to do this). > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is > more than we need. Ideally it would be easy to specifiy one single > allocator for all maps and progs in a set of .c files. Sort-of a bpf package. > In other words one bpf allocator per bpf "namespace" is more than enough. _Potentially_. Programs need to know that when they reserved X objects, they'll have them available at a later time and any sharing with other programs can remove this property. A _set_ of programs can in theory determine the right prefill levels, but this is certainly easier to reason about on a per-program basis, given that programs will run at different rates. > Program authors shouldn't be creating allocators left and right. All these > free lists will waste memory. > btw I've added an extra patch to bpf_mem_alloc series: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1 > that removes kmem_cache usage. > Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea. > When free_lists are not shared they will similarly waste memory. > In user space the C code just does malloc() and the memory is isolated per process. > Ideally in bpf world the programs would just do: > bpf_mem_alloc(btf_type_id_local(struct foo)); > without specifying an allocator, but that would require one global allocator > for all bpf programs in the kernel which is probably not a direction we should go ? Why does it require a global allocator? For example, you can have each program have its own internal allocator and with runtime live counts, this API is very achievable. Once the program unloads, you can drain the freelists, so most allocator memory does not have to live as long as the longest-lived object from that allocator. In addition, all allocators can share a global freelist too, so chunks released after the program unloads get a chance to be reused. > So the programs have to specify an allocator to use in bpf_mem_alloc(), > but it should be one for all progs, maps in a bpf-package/set/namespace. > If it's easy for programs to specify a bunch of allocators, like one for each program, > or one for each btf_type_id the bpf kernel infra would be required to merge > these allocators from day one. How is having one allocator per program different from having one allocator per set of programs, with per-program bpf-side freelists? The requirement that some (most?) programs need deterministic access to their freelists is still there, no matter the number of allocators. If we fear that the default freelist behavior will waste memory, then the defaults need to be aggressively conservative, with programs being able to adjust them. Besides, if we punt the freelists to bpf, then we get absolutely no control over the memory usage, which is strictly worse for us (and worse developer experience on top). > (The profileration of kmem_cache-s in the past > forced merging of them). By restricting bpf program choices with allocator-per-map > (this option 3) we're not only making the kernel side to do less work > (no run-time ref counts, no merging is required today), we're also pushing > bpf progs to use memory concious choices. This is conflating "there needs to be a limit on memory stuck in freelists" with "you can only store kptrs from one allocator in each map." The former practically advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree with the latter - it's coming strictly from the desire to have static mappings between object storage and allocators; it's not coming from a memory usage need, it only avoids runtime live object counts. > Having said all that maybe one global allocator is not such a bad idea. It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however, think there is value in having precise and conservative freelist policies, along with a global freelist for overflow and draining after program unload. The latter would allow us to share memory between allocators without sacrificing per-allocator freelist determinism, especially if paired with very static (but configurable) freelist thresholds. -- Delyan
On Wed, 31 Aug 2022 at 23:02, Delyan Kratunov <delyank@fb.com> wrote: > > On Wed, 2022-08-31 at 11:57 -0700, Alexei Starovoitov wrote: > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > > > > -------------------------------------------------------------------! > > > > On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote: > > > > > > Overall, this design (or maybe the way it's presented here) conflates a few ideas. > > > > > > 1) The extensions to expose and customize map's internal element allocator are fine > > > independently of even this patchset. > > > > > > 2) The idea that kptrs in a map need to have a statically identifiable allocator is > > > taken as an axiom, and then expanded to its extreme (single allocator per map as > > > opposed to the smarter verifier schemes). I still contest that this is not the case > > > and the runtime overhead it avoids is paid back in bad developer experience multiple > > > times over. > > > > > > 3) The idea that allocators can be merged between elements and kptrs is independent > > > of the static requirements. If the map's internal allocator is exposed per 1), we can > > > still use it to allocate kptrs but not require that all kptrs in a map are from the > > > same allocator. > > > > > > Going this coarse in the API is easy for us but fundamentally more limiting for > > > developers. It's not hard to imagine situations where the verifier dependency > > > tracking or runtime lifetime tracking would allow for pinned maps to be retained but > > > this scheme would require new maps entirely. (Any situation where you just refactored > > > the implicit allocator out to share it, for example) > > > > > > I also don't think that simplicity for us (a one time implementation cost + > > > continuous maintenance cost) trumps over long term developer experience (a much > > > bigger implementation cost over a much bigger time span). > > > > It feels we're thinking about scope and use cases for the allocator quite > > differently and what you're seeing as 'limiting developer choices' to me looks > > like 'not a limiting issue at all'. To me the allocator is one > > jemalloc/tcmalloc instance. One user space application with multiple threads, > > lots of maps and code is using exactly one such allocator. The allocator > > manages all the memory of user space process. In bpf land we don't have a bpf > > process. We don't have a bpf name space either. A loose analogy would be a set > > of programs and maps managed by one user space agent. The bpf allocator would > > manage all the memory of these maps and programs and provide a "memory namespace" > > for this set of programs. Another user space agent with its own programs > > would never want to share the same allocator. In user space a chunk of memory > > could be mmap-ed between different process to share the data, but you would never > > put a tcmalloc over such memory to be an allocator for different processes. > > > > More below. > > > > > So far, my ranked choice vote is: > > > > > > 1) maximum flexibility and runtime live object counts (with exposed allocators, I > > > like the merging) > > > 2) medium flexibility with per-field allocator tracking in the verifier and the > > > ability to lose the association once programs are unloaded and values are gone. This > > > also works better with exposed allocators since they are implicitly pinned and would > > > be usable to store values in another map. > > > 3) minimum flexibility with static whole-map kptr allocators > > > > The option 1 flexibility is necessary when allocator is seen as a private pool > > of objects of given size. Like kernel's kmem_cache instance. > > I don't think we quite there yet. > > If we're not there, we should aim to get there :) > > > There is a need to "preallocate this object from sleepable context, > > so the prog has a guaranteed chunk of memory to use in restricted context", > > but, arguably, it's not a job of bpf allocator. > > Leaving it to the programs is worse for memory usage (discussed below). > > > bpf prog can allocate an object, stash it into kptr, and use it later. > > Given that tracing programs can't really maintain their own freelists safely (I think > they're missing the building blocks - you can't cmpxchg kptrs), I do feel like > isolated allocators are a requirement here. Without them, allocations can fail and > there's no way to write a reliable program. > > *If* we ensure that you can build a usable freelist out of allocator-backed memory > for (a set of) nmi programs, then I can maybe get behind this (but there's other > reasons not to do this). > > > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is > > more than we need. Ideally it would be easy to specifiy one single > > allocator for all maps and progs in a set of .c files. Sort-of a bpf package. > > In other words one bpf allocator per bpf "namespace" is more than enough. > > _Potentially_. Programs need to know that when they reserved X objects, they'll have > them available at a later time and any sharing with other programs can remove this > property. A _set_ of programs can in theory determine the right prefill levels, but > this is certainly easier to reason about on a per-program basis, given that programs > will run at different rates. > > > Program authors shouldn't be creating allocators left and right. All these > > free lists will waste memory. > > btw I've added an extra patch to bpf_mem_alloc series: > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1 > > that removes kmem_cache usage. > > Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea. > > When free_lists are not shared they will similarly waste memory. > > In user space the C code just does malloc() and the memory is isolated per process. > > Ideally in bpf world the programs would just do: > > bpf_mem_alloc(btf_type_id_local(struct foo)); > > without specifying an allocator, but that would require one global allocator > > for all bpf programs in the kernel which is probably not a direction we should go ? > > Why does it require a global allocator? For example, you can have each program have > its own internal allocator and with runtime live counts, this API is very achievable. > Once the program unloads, you can drain the freelists, so most allocator memory does > not have to live as long as the longest-lived object from that allocator. In > addition, all allocators can share a global freelist too, so chunks released after > the program unloads get a chance to be reused. > > > So the programs have to specify an allocator to use in bpf_mem_alloc(), > > but it should be one for all progs, maps in a bpf-package/set/namespace. > > If it's easy for programs to specify a bunch of allocators, like one for each program, > > or one for each btf_type_id the bpf kernel infra would be required to merge > > these allocators from day one. > > How is having one allocator per program different from having one allocator per set > of programs, with per-program bpf-side freelists? The requirement that some (most?) > programs need deterministic access to their freelists is still there, no matter the > number of allocators. If we fear that the default freelist behavior will waste > memory, then the defaults need to be aggressively conservative, with programs being > able to adjust them. > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the > memory usage, which is strictly worse for us (and worse developer experience on top). > > > (The profileration of kmem_cache-s in the past > > forced merging of them). By restricting bpf program choices with allocator-per-map > > (this option 3) we're not only making the kernel side to do less work > > (no run-time ref counts, no merging is required today), we're also pushing > > bpf progs to use memory concious choices. > > This is conflating "there needs to be a limit on memory stuck in freelists" with "you > can only store kptrs from one allocator in each map." The former practically > advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree > with the latter - it's coming strictly from the desire to have static mappings > between object storage and allocators; it's not coming from a memory usage need, it > only avoids runtime live object counts. > > > Having said all that maybe one global allocator is not such a bad idea. > > It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however, > think there is value in having precise and conservative freelist policies, along with > a global freelist for overflow and draining after program unload. The latter would > allow us to share memory between allocators without sacrificing per-allocator > freelist determinism, especially if paired with very static (but configurable) > freelist thresholds. > These are all good points. Sharing an allocator between all programs means bpf_mem_prefill request cannot really guarantee much, it does hurt determinism. The prefilled items can be drained by some other program with an inconsistent allocation pattern. But going back to what Alexei replied in the other thread: > bpf progs are more analogous to kernel modules. > The modules just do kmalloc. > The more we discuss the more I'm leaning towards the same model as well: > Just one global allocator for all bpf progs. There does seem to be one big benefit in having a global allocator (not per program, but actually globally in the kernel, basically a percpu freelist cache fronting kmalloc) usable safely in any context. We don't have to do any allocator lifetime tracking at all, that case reduces to basically how we handle kernel kptrs currently. I am wondering if we can go with such an approach: by default, the global allocator in the kernel serves bpf_mem_alloc requests, which allows freelist sharing between all programs, it is basically kmalloc but safe in NMI and with reentrancy protection. When determinism is needed, use the percpu refcount approach with option 1 from Delyan for the custom allocator case. Now by default you have conservative global freelist sharing (percpu), and when required program can use a custom allocator and prefill to keep the cache ready to serve requests (where that kind of control will be very useful for progs in NMI/hardirq context, where depletion of cache means NULL from unit_alloc), where its own allocator freelist will be unaffected by other allocations. Any kptr from the bpf_mem_alloc allocator can go to any map, no problem at all. The only extra cost is maintaining the percpu live counts for non-global allocators, it is basically free for the global case. And it would also be allowed to probably choose and share allocators between maps, as proposed by Alexei before. That has no effect on kptrs being stored in them, as most commonly they would be from the global allocator. Thoughts on this?
On Thu, Sep 01, 2022 at 12:32:33AM +0200, Kumar Kartikeya Dwivedi wrote: > > bpf progs are more analogous to kernel modules. > > The modules just do kmalloc. > > The more we discuss the more I'm leaning towards the same model as well: > > Just one global allocator for all bpf progs. > > There does seem to be one big benefit in having a global allocator > (not per program, but actually globally in the kernel, basically a > percpu freelist cache fronting kmalloc) usable safely in any context. > We don't have to do any allocator lifetime tracking at all, that case > reduces to basically how we handle kernel kptrs currently. > > I am wondering if we can go with such an approach: by default, the > global allocator in the kernel serves bpf_mem_alloc requests, which > allows freelist sharing between all programs, it is basically kmalloc > but safe in NMI and with reentrancy protection. Right. That what I was proposing. > When determinism is > needed, use the percpu refcount approach with option 1 from Delyan for > the custom allocator case. I wasn't rejecting that part. I was suggesting to table that discussion. The best way to achieve guaranteed allocation is still an open question. So far we've only talked about a new map type with "allocator" type... Is this really the best design? > Now by default you have conservative global freelist sharing (percpu), > and when required program can use a custom allocator and prefill to > keep the cache ready to serve requests (where that kind of control > will be very useful for progs in NMI/hardirq context, where depletion > of cache means NULL from unit_alloc), where its own allocator freelist > will be unaffected by other allocations. The custom allocator is not necessary the right answer. It could be. Maybe it should be open coded free list of preallocated items that bpf prog takes from global allocator and pushes them to a list? We'll have locks and native link lists in bpf soon. So why "allocator" concept should do a double job of allocating and keeping a link list for prefill reasons? > Any kptr from the bpf_mem_alloc allocator can go to any map, no problem at all. > The only extra cost is maintaining the percpu live counts for > non-global allocators, it is basically free for the global case. > And it would also be allowed to probably choose and share allocators > between maps, as proposed by Alexei before. That has no effect on > kptrs being stored in them, as most commonly they would be from the > global allocator. It still feels to me that doing global allocator only for now will be good enough. prefill use case for one element can be solved already without any extra work (just kptr_xchg in and out). prefill of multiple objects might get nicely solved with native link lists too.
On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote: > Given that tracing programs can't really maintain their own freelists safely (I think > they're missing the building blocks - you can't cmpxchg kptrs), Today? yes, but soon we will have link lists supported natively. > I do feel like > isolated allocators are a requirement here. Without them, allocations can fail and > there's no way to write a reliable program. Completely agree that there should be a way for programs to guarantee availability of the element. Inline allocation can fail regardless whether allocation pool is shared by multiple programs or a single program owns an allocator. In that sense, allowing multiple programs to create an instance of an allocator doesn't solve this problem. Short free list inside bpf_mem_cache is an implementation detail. "prefill to guarantee successful alloc" is a bit out of scope of an allocator. "allocate a set and stash it" should be a separate building block available to bpf progs when step "allocate" can fail and efficient "stash it" can probably be done on top of the link list. > *If* we ensure that you can build a usable freelist out of allocator-backed memory > for (a set of) nmi programs, then I can maybe get behind this (but there's other > reasons not to do this). Agree that nmi adds another quirk to "stash it" step. If native link list is not going to work then something else would have to be designed. > > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is > > more than we need. Ideally it would be easy to specifiy one single > > allocator for all maps and progs in a set of .c files. Sort-of a bpf package. > > In other words one bpf allocator per bpf "namespace" is more than enough. > > _Potentially_. Programs need to know that when they reserved X objects, they'll have > them available at a later time and any sharing with other programs can remove this > property. Agree. > A _set_ of programs can in theory determine the right prefill levels, but > this is certainly easier to reason about on a per-program basis, given that programs > will run at different rates. Agree as well. > Why does it require a global allocator? For example, you can have each program have > its own internal allocator and with runtime live counts, this API is very achievable. > Once the program unloads, you can drain the freelists, so most allocator memory does > not have to live as long as the longest-lived object from that allocator. In > addition, all allocators can share a global freelist too, so chunks released after > the program unloads get a chance to be reused. All makes sense to me except that the kernel can provide that global allocator and per-program "allocators" can hopefully be implemented as native bpf code. > How is having one allocator per program different from having one allocator per set > of programs, with per-program bpf-side freelists? The requirement that some (most?) > programs need deterministic access to their freelists is still there, no matter the > number of allocators. If we fear that the default freelist behavior will waste > memory, then the defaults need to be aggressively conservative, with programs being > able to adjust them. I think the disagreement here is that per-prog allocator based on bpf_mem_alloc isn't going to be any more deterministic than one global bpf_mem_alloc for all progs. Per-prog short free list of ~64 elements vs global free list of ~64 elements. In both cases these lists will have to do irq_work and refill out of global slabs. > Besides, if we punt the freelists to bpf, then we get absolutely no control over the > memory usage, which is strictly worse for us (and worse developer experience on top). I don't understand this point. All allocations are still coming out of bpf_mem_alloc. We can have debug mode with memleak support and other debug mechanisms. > > (The profileration of kmem_cache-s in the past > > forced merging of them). By restricting bpf program choices with allocator-per-map > > (this option 3) we're not only making the kernel side to do less work > > (no run-time ref counts, no merging is required today), we're also pushing > > bpf progs to use memory concious choices. > > This is conflating "there needs to be a limit on memory stuck in freelists" with "you > can only store kptrs from one allocator in each map." The former practically > advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree > with the latter - it's coming strictly from the desire to have static mappings > between object storage and allocators; it's not coming from a memory usage need, it > only avoids runtime live object counts. > > > Having said all that maybe one global allocator is not such a bad idea. > > It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however, > think there is value in having precise and conservative freelist policies, along with > a global freelist for overflow and draining after program unload. The latter would > allow us to share memory between allocators without sacrificing per-allocator > freelist determinism, especially if paired with very static (but configurable) > freelist thresholds. What is 'freelist determinism' ? Are you talking about some other freelist on top of bpf_mem_alloc's free lists ?
On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote: > !-------------------------------------------------------------------| > This Message Is From an External Sender > > > -------------------------------------------------------------------! > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote: > > Given that tracing programs can't really maintain their own freelists safely (I think > > they're missing the building blocks - you can't cmpxchg kptrs), > > Today? yes, but soon we will have link lists supported natively. > > > I do feel like > > isolated allocators are a requirement here. Without them, allocations can fail and > > there's no way to write a reliable program. > > Completely agree that there should be a way for programs > to guarantee availability of the element. > Inline allocation can fail regardless whether allocation pool > is shared by multiple programs or a single program owns an allocator. I'm not sure I understand this point. If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs for struct foo to succeed. In what situations can this fail if I'm the only program using it _and_ I've calculated the prefill amount correctly? Unless you're saying that the prefill wouldn't adjust the freelist limits, in which case, I think that's a mistake - prefill should effectively _set_ the freelist limits. > In that sense, allowing multiple programs to create an instance > of an allocator doesn't solve this problem. > Short free list inside bpf_mem_cache is an implementation detail. > "prefill to guarantee successful alloc" is a bit out of scope > of an allocator. I disagree that it's out of scope. This is the only access to dynamic memory from a bpf program, it makes sense that it covers the requirements of bpf programs, including prefill/freelist behavior, so all programs can safely use it. > "allocate a set and stash it" should be a separate building block > available to bpf progs when step "allocate" can fail and > efficient "stash it" can probably be done on top of the link list. Do you imagine a BPF object that every user has to link into their programs (yuck), or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things this way. If it's distributed separately, I think that's an unexpected burden on developers (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool skels). There are no other bpf features that require a userspace support library like this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful without a library) > > > *If* we ensure that you can build a usable freelist out of allocator-backed memory > > for (a set of) nmi programs, then I can maybe get behind this (but there's other > > reasons not to do this). > > Agree that nmi adds another quirk to "stash it" step. > If native link list is not going to work then something > else would have to be designed. Given that I'll be making the deferred work mechanism on top of this allocator, we definitely need to cover nmi usage cleanly. It would be almost unusable to have your deferred work fail to submit randomly. > > > > So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is > > > more than we need. Ideally it would be easy to specifiy one single > > > allocator for all maps and progs in a set of .c files. Sort-of a bpf package. > > > In other words one bpf allocator per bpf "namespace" is more than enough. > > > > _Potentially_. Programs need to know that when they reserved X objects, they'll have > > them available at a later time and any sharing with other programs can remove this > > property. > > Agree. > > > A _set_ of programs can in theory determine the right prefill levels, but > > this is certainly easier to reason about on a per-program basis, given that programs > > will run at different rates. > > Agree as well. > > > Why does it require a global allocator? For example, you can have each program have > > its own internal allocator and with runtime live counts, this API is very achievable. > > Once the program unloads, you can drain the freelists, so most allocator memory does > > not have to live as long as the longest-lived object from that allocator. In > > addition, all allocators can share a global freelist too, so chunks released after > > the program unloads get a chance to be reused. > > All makes sense to me except that the kernel can provide that > global allocator and per-program "allocators" can hopefully be > implemented as native bpf code. > > > How is having one allocator per program different from having one allocator per set > > of programs, with per-program bpf-side freelists? The requirement that some (most?) > > programs need deterministic access to their freelists is still there, no matter the > > number of allocators. If we fear that the default freelist behavior will waste > > memory, then the defaults need to be aggressively conservative, with programs being > > able to adjust them. > > I think the disagreement here is that per-prog allocator based > on bpf_mem_alloc isn't going to be any more deterministic than > one global bpf_mem_alloc for all progs. > Per-prog short free list of ~64 elements vs > global free list of ~64 elements. Right, I think I had a hidden assumption here that we've exposed. Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be 1000 struct foos on the freelist and the freelist thresholds would be adjusted accordingly (i.e., you can free all of them and allocate them again, all from the freelist). Ultimately, that's what nmi programs actually need but I see why that's not an obvious behavior. > In both cases these lists will have to do irq_work and refill > out of global slabs. If a tracing program needs irq_work to refill, then hasn't the API already failed the program writer? I'd have to remind myself how irq_work actually works but given that it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before irq_work has a chance to refill it. I don't see how you'd write reliable programs this way. > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the > > memory usage, which is strictly worse for us (and worse developer experience on top). > > I don't understand this point. > All allocations are still coming out of bpf_mem_alloc. > We can have debug mode with memleak support and other debug > mechanisms. I mostly mean accounting here. If we segment the allocated objects by finer-grained allocators, we can attribute them to individual programs better. But, I agree, this can be implemented in other ways, it can just be counts/tables on bpf_prog. > > > > (The profileration of kmem_cache-s in the past > > > forced merging of them). By restricting bpf program choices with allocator-per-map > > > (this option 3) we're not only making the kernel side to do less work > > > (no run-time ref counts, no merging is required today), we're also pushing > > > bpf progs to use memory concious choices. > > > > This is conflating "there needs to be a limit on memory stuck in freelists" with "you > > can only store kptrs from one allocator in each map." The former practically > > advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree > > with the latter - it's coming strictly from the desire to have static mappings > > between object storage and allocators; it's not coming from a memory usage need, it > > only avoids runtime live object counts. > > > > > Having said all that maybe one global allocator is not such a bad idea. > > > > It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however, > > think there is value in having precise and conservative freelist policies, along with > > a global freelist for overflow and draining after program unload. The latter would > > allow us to share memory between allocators without sacrificing per-allocator > > freelist determinism, especially if paired with very static (but configurable) > > freelist thresholds. > > What is 'freelist determinism' ? The property that prefill keeps all objects on the freelist, so the following sequence doesn't observe allocation failures: bpf_mem_prefill(1000, struct foo); run_1000_times { alloc(struct foo); } run_1000_times { free(struct foo); } run_1000_times { alloc(struct foo); } alloc(struct foo) // this can observe a failure > Are you talking about some other freelist on top of bpf_mem_alloc's > free lists ? Well, that's the question, isn't it? I think it should be part of the bpf kernel ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's instantiated per-program, that's easiest to reason about. -- Delyan
On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote: > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote: > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > > > > -------------------------------------------------------------------! > > > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote: > > > Given that tracing programs can't really maintain their own freelists safely (I think > > > they're missing the building blocks - you can't cmpxchg kptrs), > > > > Today? yes, but soon we will have link lists supported natively. > > > > > I do feel like > > > isolated allocators are a requirement here. Without them, allocations can fail and > > > there's no way to write a reliable program. > > > > Completely agree that there should be a way for programs > > to guarantee availability of the element. > > Inline allocation can fail regardless whether allocation pool > > is shared by multiple programs or a single program owns an allocator. > > I'm not sure I understand this point. > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs > for struct foo to succeed. In what situations can this fail if I'm the only program > using it _and_ I've calculated the prefill amount correctly? > > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which > case, I think that's a mistake - prefill should effectively _set_ the freelist > limits. There is no prefill implementation today, so we're just guessing, but let's try. prefill would probably have to adjust high-watermark limit. That makes sense, but for how long? Should the watermark go back after time or after N objects were consumed? What prefill is going to do? Prefill on current cpu only ? but it doesn't help the prog to be deterministic in consuming them. Prefill on all cpu-s? That's possible, but for_each_possible_cpu() {irq_work_queue_on(cpu);} looks to be a significant memory and run-time overhead. When freelist is managed by the program it may contain just N elements that progs needs. > > In that sense, allowing multiple programs to create an instance > > of an allocator doesn't solve this problem. > > Short free list inside bpf_mem_cache is an implementation detail. > > "prefill to guarantee successful alloc" is a bit out of scope > > of an allocator. > > I disagree that it's out of scope. This is the only access to dynamic memory from a > bpf program, it makes sense that it covers the requirements of bpf programs, > including prefill/freelist behavior, so all programs can safely use it. > > > "allocate a set and stash it" should be a separate building block > > available to bpf progs when step "allocate" can fail and > > efficient "stash it" can probably be done on top of the link list. > > Do you imagine a BPF object that every user has to link into their programs (yuck), > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things > this way. I'm assuming Kumar's proposed list api: struct bpf_list_head head; struct bpf_list_node node; bpf_list_insert(&node, &head); will work out. > If it's distributed separately, I think that's an unexpected burden on developers > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool > skels). There are no other bpf features that require a userspace support library like > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful > without a library) bpf progs must not pay for what they don't use. Hence all building blocks should be small. We will have libc-like bpf libraries with bigger blocks eventually. > > I think the disagreement here is that per-prog allocator based > > on bpf_mem_alloc isn't going to be any more deterministic than > > one global bpf_mem_alloc for all progs. > > Per-prog short free list of ~64 elements vs > > global free list of ~64 elements. > > Right, I think I had a hidden assumption here that we've exposed. > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be > 1000 struct foos on the freelist and the freelist thresholds would be adjusted > accordingly (i.e., you can free all of them and allocate them again, all from the > freelist). > > Ultimately, that's what nmi programs actually need but I see why that's not an > obvious behavior. How prefill is going to work is still to-be-designed. In addition to current-cpu vs on-all-cpu question above... Will prefill() helper just do irq_work ? If so then it doesn't help nmi and irq-disabled progs at all. prefill helper working asynchronously doesn't guarantee availability of objects later to the program. prefill() becomes a hint and probably useless as such. So it should probably be synchronous and fail when in-nmi or in-irq? But bpf prog cannot know its context, so only safe synchronous prefill() would be out of sleepable progs. > > In both cases these lists will have to do irq_work and refill > > out of global slabs. > > If a tracing program needs irq_work to refill, then hasn't the API already failed the > program writer? I'd have to remind myself how irq_work actually works but given that > it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before > irq_work has a chance to refill it. I don't see how you'd write reliable programs > this way. The only way nmi-prog can guarantee availability if it allocates and reserves objects in a different non-nmi program. If everything runs in nmi there is nothing can be done. > > > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the > > > memory usage, which is strictly worse for us (and worse developer experience on top). > > > > I don't understand this point. > > All allocations are still coming out of bpf_mem_alloc. > > We can have debug mode with memleak support and other debug > > mechanisms. > > I mostly mean accounting here. If we segment the allocated objects by finer-grained > allocators, we can attribute them to individual programs better. But, I agree, this > can be implemented in other ways, it can just be counts/tables on bpf_prog. mem accounting is the whole different, huge and largely unsovled topic. The thread about memcg and bpf is still ongoing. The fine-grained bpf_mem_alloc isn't going to magically solve it. > > > > What is 'freelist determinism' ? > > The property that prefill keeps all objects on the freelist, so the following > sequence doesn't observe allocation failures: > > bpf_mem_prefill(1000, struct foo); > run_1000_times { alloc(struct foo); } > run_1000_times { free(struct foo); } > run_1000_times { alloc(struct foo); } > alloc(struct foo) // this can observe a failure we cannot evalute the above until we answer current-cpu vs on-all-cpus question and whether bpf_mem_prefill is sync or async. I still think designing prefill and guaranteed availability is out of scope of allocator. > > Are you talking about some other freelist on top of bpf_mem_alloc's > > free lists ? > > Well, that's the question, isn't it? I think it should be part of the bpf kernel > ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's > instantiated per-program, that's easiest to reason about. There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list is still a question. Bundling this feature together with an allocator feels artificial. In user space C you wouldn't combine tcmalloc with custom free list. During early days of bpf bundling would totally make sense, but right now we're able to create much smaller building blocks than in the past. I don't think we'll be adding any more new map types. We probably won't be adding any new program types either.
On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote: > On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote: > > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote: > > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > Given that tracing programs can't really maintain their own freelists safely (I think > > > > they're missing the building blocks - you can't cmpxchg kptrs), > > > > > > Today? yes, but soon we will have link lists supported natively. > > > > > > > I do feel like > > > > isolated allocators are a requirement here. Without them, allocations can fail and > > > > there's no way to write a reliable program. > > > > > > Completely agree that there should be a way for programs > > > to guarantee availability of the element. > > > Inline allocation can fail regardless whether allocation pool > > > is shared by multiple programs or a single program owns an allocator. > > > > I'm not sure I understand this point. > > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs > > for struct foo to succeed. In what situations can this fail if I'm the only program > > using it _and_ I've calculated the prefill amount correctly? > > > > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which > > case, I think that's a mistake - prefill should effectively _set_ the freelist > > limits. > > There is no prefill implementation today, so we're just guessing, but let's try. Well, initial capacity was going to be part of the map API, so I always considered it part of the same work. > prefill would probably have to adjust high-watermark limit. > That makes sense, but for how long? Should the watermark go back after time > or after N objects were consumed? Neither, if you want your pool of objects to not vanish from under you. > What prefill is going to do? Prefill on current cpu only ? > but it doesn't help the prog to be deterministic in consuming them. > Prefill on all cpu-s? That's possible, > but for_each_possible_cpu() {irq_work_queue_on(cpu);} > looks to be a significant memory and run-time overhead. No, that's overkill imo, especially on 100+ core systems. I was imagining the allocator consuming the current cpu freelist first, then stealing from other cpus, and only if they are empty, giving up and scheduling irq_work. A little complex to implement but it's possible. It does require atomics everywhere though. > When freelist is managed by the program it may contain just N elements > that progs needs. > > > > In that sense, allowing multiple programs to create an instance > > > of an allocator doesn't solve this problem. > > > Short free list inside bpf_mem_cache is an implementation detail. > > > "prefill to guarantee successful alloc" is a bit out of scope > > > of an allocator. > > > > I disagree that it's out of scope. This is the only access to dynamic memory from a > > bpf program, it makes sense that it covers the requirements of bpf programs, > > including prefill/freelist behavior, so all programs can safely use it. > > > > > "allocate a set and stash it" should be a separate building block > > > available to bpf progs when step "allocate" can fail and > > > efficient "stash it" can probably be done on top of the link list. > > > > Do you imagine a BPF object that every user has to link into their programs (yuck), > > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things > > this way. > > I'm assuming Kumar's proposed list api: > struct bpf_list_head head; > struct bpf_list_node node; > bpf_list_insert(&node, &head); > > will work out. Given the assumed locking in that design, I don't see how it would help nmi programs tbh. This is list_head, we need llist_head, relatively speaking. > > > If it's distributed separately, I think that's an unexpected burden on developers > > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool > > skels). There are no other bpf features that require a userspace support library like > > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful > > without a library) > > bpf progs must not pay for what they don't use. Hence all building blocks should be > small. We will have libc-like bpf libraries with bigger blocks eventually. I'm not sure I understand how having the mechanism in helpers and managed by the kernel is paying for something they don't use? > > > > I think the disagreement here is that per-prog allocator based > > > on bpf_mem_alloc isn't going to be any more deterministic than > > > one global bpf_mem_alloc for all progs. > > > Per-prog short free list of ~64 elements vs > > > global free list of ~64 elements. > > > > Right, I think I had a hidden assumption here that we've exposed. > > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be > > 1000 struct foos on the freelist and the freelist thresholds would be adjusted > > accordingly (i.e., you can free all of them and allocate them again, all from the > > freelist). > > > > Ultimately, that's what nmi programs actually need but I see why that's not an > > obvious behavior. > > How prefill is going to work is still to-be-designed. That's the new part for me, though - the maps design had a mechanism to specify initial capacity, and it worked for nmi programs. That's why I'm pulling on this thread, it's the hardest thing to get right _and_ it needs to exist before deferred work can be useful. > In addition to current-cpu vs on-all-cpu question above... > Will prefill() helper just do irq_work ? > If so then it doesn't help nmi and irq-disabled progs at all. > prefill helper working asynchronously doesn't guarantee availability of objects > later to the program. > prefill() becomes a hint and probably useless as such. Agreed. > So it should probably be synchronous and fail when in-nmi or in-irq? > But bpf prog cannot know its context, so only safe synchronous prefill() > would be out of sleepable progs. Initial maps capacity would've come from the syscall, so the program itself would not contain a prefill() call. We covered this in our initial discussions - I also think that requiring every perf_event program to setup a uprobe or syscall program to fill the object pool (internal or external) is also a bad design. If we're going for a global allocator, I suppose we could encode these requirements in BTF and satisfy them on program load? .alloc map with some predefined names or something? > > > > In both cases these lists will have to do irq_work and refill > > > out of global slabs. > > > > If a tracing program needs irq_work to refill, then hasn't the API already failed the > > program writer? I'd have to remind myself how irq_work actually works but given that > > it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before > > irq_work has a chance to refill it. I don't see how you'd write reliable programs > > this way. > > The only way nmi-prog can guarantee availability if it allocates and reserves > objects in a different non-nmi program. > If everything runs in nmi there is nothing can be done. See above, we were using the map load syscall to satisfy this before. We could probably do the same here but it's just documenting requirements as opposed to also introducing ownership/lifetime problems. > > > > > > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the > > > > memory usage, which is strictly worse for us (and worse developer experience on top). > > > > > > I don't understand this point. > > > All allocations are still coming out of bpf_mem_alloc. > > > We can have debug mode with memleak support and other debug > > > mechanisms. > > > > I mostly mean accounting here. If we segment the allocated objects by finer-grained > > allocators, we can attribute them to individual programs better. But, I agree, this > > can be implemented in other ways, it can just be counts/tables on bpf_prog. > > mem accounting is the whole different, huge and largely unsovled topic. > The thread about memcg and bpf is still ongoing. > The fine-grained bpf_mem_alloc isn't going to magically solve it. > > > > > > > What is 'freelist determinism' ? > > > > The property that prefill keeps all objects on the freelist, so the following > > sequence doesn't observe allocation failures: > > > > bpf_mem_prefill(1000, struct foo); > > run_1000_times { alloc(struct foo); } > > run_1000_times { free(struct foo); } > > run_1000_times { alloc(struct foo); } > > alloc(struct foo) // this can observe a failure > > we cannot evalute the above until we answer current-cpu vs on-all-cpus question > and whether bpf_mem_prefill is sync or async. > > I still think designing prefill and guaranteed availability is out of scope > of allocator. It was in the maps design on purpose though, I need it for deferred work to be useful (remember that build id EFAULT thread? only way to fix it for good requires that work submission never fails, which needs allocations from nmi to never fail). > > > > Are you talking about some other freelist on top of bpf_mem_alloc's > > > free lists ? > > > > Well, that's the question, isn't it? I think it should be part of the bpf kernel > > ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's > > instantiated per-program, that's easiest to reason about. > > There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list > is still a question. Bundling this feature together with an allocator feels artificial. > In user space C you wouldn't combine tcmalloc with custom free list. Userspace doesn't have nmi or need allocators that work from signal handlers, for a more appropriate analogy. We actually need this to work reliably from nmi, so we can shift work _away_ from nmi. If we didn't have this use case, I would've folded on the entire issue and kicked the can down the road (plenty of helpers don't work in nmi). -- Delyan > During early days of bpf bundling would totally make sense, but right now we're able to > create much smaller building blocks than in the past. I don't think we'll be adding > any more new map types. We probably won't be adding any new program types either.
On Fri, Sep 02, 2022 at 01:40:29AM +0000, Delyan Kratunov wrote: > On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote: > > On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote: > > > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote: > > > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > Given that tracing programs can't really maintain their own freelists safely (I think > > > > > they're missing the building blocks - you can't cmpxchg kptrs), > > > > > > > > Today? yes, but soon we will have link lists supported natively. > > > > > > > > > I do feel like > > > > > isolated allocators are a requirement here. Without them, allocations can fail and > > > > > there's no way to write a reliable program. > > > > > > > > Completely agree that there should be a way for programs > > > > to guarantee availability of the element. > > > > Inline allocation can fail regardless whether allocation pool > > > > is shared by multiple programs or a single program owns an allocator. > > > > > > I'm not sure I understand this point. > > > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs > > > for struct foo to succeed. In what situations can this fail if I'm the only program > > > using it _and_ I've calculated the prefill amount correctly? > > > > > > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which > > > case, I think that's a mistake - prefill should effectively _set_ the freelist > > > limits. > > > > There is no prefill implementation today, so we're just guessing, but let's try. > > Well, initial capacity was going to be part of the map API, so I always considered it > part of the same work. > > > prefill would probably have to adjust high-watermark limit. > > That makes sense, but for how long? Should the watermark go back after time > > or after N objects were consumed? > > Neither, if you want your pool of objects to not vanish from under you. > > > What prefill is going to do? Prefill on current cpu only ? > > but it doesn't help the prog to be deterministic in consuming them. > > Prefill on all cpu-s? That's possible, > > but for_each_possible_cpu() {irq_work_queue_on(cpu);} > > looks to be a significant memory and run-time overhead. > > No, that's overkill imo, especially on 100+ core systems. > I was imagining the allocator consuming the current cpu freelist first, then stealing > from other cpus, and only if they are empty, giving up and scheduling irq_work. stealing from other cpus?! That's certainly out of scope for bpf_mem_alloc as it's implemented. Stealing from other cpus would require a redesign. > A little complex to implement but it's possible. It does require atomics everywhere > though. atomic everywhere and many more weeks of thinking and debugging. kernel/bpf/percpu_freelist.c does stealing from other cpus and it wasn't trivial to do. > > > When freelist is managed by the program it may contain just N elements > > that progs needs. > > > > > > In that sense, allowing multiple programs to create an instance > > > > of an allocator doesn't solve this problem. > > > > Short free list inside bpf_mem_cache is an implementation detail. > > > > "prefill to guarantee successful alloc" is a bit out of scope > > > > of an allocator. > > > > > > I disagree that it's out of scope. This is the only access to dynamic memory from a > > > bpf program, it makes sense that it covers the requirements of bpf programs, > > > including prefill/freelist behavior, so all programs can safely use it. > > > > > > > "allocate a set and stash it" should be a separate building block > > > > available to bpf progs when step "allocate" can fail and > > > > efficient "stash it" can probably be done on top of the link list. > > > > > > Do you imagine a BPF object that every user has to link into their programs (yuck), > > > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things > > > this way. > > > > I'm assuming Kumar's proposed list api: > > struct bpf_list_head head; > > struct bpf_list_node node; > > bpf_list_insert(&node, &head); > > > > will work out. > > Given the assumed locking in that design, I don't see how it would help nmi programs > tbh. This is list_head, we need llist_head, relatively speaking. Of course. bpf-native link list could be per-cpu and based on llist. bpf_list vs bpf_llist. SMOP :) > > > > > > If it's distributed separately, I think that's an unexpected burden on developers > > > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool > > > skels). There are no other bpf features that require a userspace support library like > > > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful > > > without a library) > > > > bpf progs must not pay for what they don't use. Hence all building blocks should be > > small. We will have libc-like bpf libraries with bigger blocks eventually. > > I'm not sure I understand how having the mechanism in helpers and managed by the > kernel is paying for something they don't use? every feature adds up.. like stealing from cpus. > > > > > > I think the disagreement here is that per-prog allocator based > > > > on bpf_mem_alloc isn't going to be any more deterministic than > > > > one global bpf_mem_alloc for all progs. > > > > Per-prog short free list of ~64 elements vs > > > > global free list of ~64 elements. > > > > > > Right, I think I had a hidden assumption here that we've exposed. > > > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be > > > 1000 struct foos on the freelist and the freelist thresholds would be adjusted > > > accordingly (i.e., you can free all of them and allocate them again, all from the > > > freelist). > > > > > > Ultimately, that's what nmi programs actually need but I see why that's not an > > > obvious behavior. > > > > How prefill is going to work is still to-be-designed. > > That's the new part for me, though - the maps design had a mechanism to specify > initial capacity, and it worked for nmi programs. That's why I'm pulling on this > thread, it's the hardest thing to get right _and_ it needs to exist before deferred > work can be useful. Specifying initial capacity sounds great in theory, but what does it mean in practice? N elements on each cpu or evenly distributed across all? > > > In addition to current-cpu vs on-all-cpu question above... > > Will prefill() helper just do irq_work ? > > If so then it doesn't help nmi and irq-disabled progs at all. > > prefill helper working asynchronously doesn't guarantee availability of objects > > later to the program. > > prefill() becomes a hint and probably useless as such. > > Agreed. > > > So it should probably be synchronous and fail when in-nmi or in-irq? > > But bpf prog cannot know its context, so only safe synchronous prefill() > > would be out of sleepable progs. > > Initial maps capacity would've come from the syscall, so the program itself would not > contain a prefill() call. > > We covered this in our initial discussions - I also think that requiring every > perf_event program to setup a uprobe or syscall program to fill the object pool > (internal or external) is also a bad design. right. we did. prefill from user space makes sense. > If we're going for a global allocator, I suppose we could encode these requirements > in BTF and satisfy them on program load? .alloc map with some predefined names or > something? ohh. When I was saying 'global allocator' I meant an allocator that is not exposed to bpf progs at all. It's just there for all programs. It has hidden watermarks and prefill for it doesn't make sense. Pretty much kmalloc equivalent. > > > > > > In both cases these lists will have to do irq_work and refill > > > > out of global slabs. > > > > > > If a tracing program needs irq_work to refill, then hasn't the API already failed the > > > program writer? I'd have to remind myself how irq_work actually works but given that > > > it's a soft/hardirq, an nmi program can trivially exhaust the entire allocator before > > > irq_work has a chance to refill it. I don't see how you'd write reliable programs > > > this way. > > > > The only way nmi-prog can guarantee availability if it allocates and reserves > > objects in a different non-nmi program. > > If everything runs in nmi there is nothing can be done. > > See above, we were using the map load syscall to satisfy this before. We could > probably do the same here but it's just documenting requirements as opposed to also > introducing ownership/lifetime problems. > > > > > > > > > > > > Besides, if we punt the freelists to bpf, then we get absolutely no control over the > > > > > memory usage, which is strictly worse for us (and worse developer experience on top). > > > > > > > > I don't understand this point. > > > > All allocations are still coming out of bpf_mem_alloc. > > > > We can have debug mode with memleak support and other debug > > > > mechanisms. > > > > > > I mostly mean accounting here. If we segment the allocated objects by finer-grained > > > allocators, we can attribute them to individual programs better. But, I agree, this > > > can be implemented in other ways, it can just be counts/tables on bpf_prog. > > > > mem accounting is the whole different, huge and largely unsovled topic. > > The thread about memcg and bpf is still ongoing. > > The fine-grained bpf_mem_alloc isn't going to magically solve it. > > > > > > > > > > What is 'freelist determinism' ? > > > > > > The property that prefill keeps all objects on the freelist, so the following > > > sequence doesn't observe allocation failures: > > > > > > bpf_mem_prefill(1000, struct foo); > > > run_1000_times { alloc(struct foo); } > > > run_1000_times { free(struct foo); } > > > run_1000_times { alloc(struct foo); } > > > alloc(struct foo) // this can observe a failure > > > > we cannot evalute the above until we answer current-cpu vs on-all-cpus question > > and whether bpf_mem_prefill is sync or async. > > > > I still think designing prefill and guaranteed availability is out of scope > > of allocator. > > It was in the maps design on purpose though, I need it for deferred work to be useful > (remember that build id EFAULT thread? only way to fix it for good requires that work > submission never fails, which needs allocations from nmi to never fail). iirc build_id EFAULT-ing thread the main issue was: moving build_id collection from nmi into exit_to_user context so that build_id logic can do copy_from_user. In that context it can allocate with GFP_KERNEL too. We've discussed combing kernel stack with later user+build_id, ringbufs, etc Lots of things. > > > > > > Are you talking about some other freelist on top of bpf_mem_alloc's > > > > free lists ? > > > > > > Well, that's the question, isn't it? I think it should be part of the bpf kernel > > > ecosystem (helper/kfunc) but it doesn't have to be bpf_mem_alloc itself. And, if it's > > > instantiated per-program, that's easiest to reason about. > > > > There should be a way. For sure. Helper/kfunc or trivial stuff on top of bpf_link_list > > is still a question. Bundling this feature together with an allocator feels artificial. > > In user space C you wouldn't combine tcmalloc with custom free list. > > Userspace doesn't have nmi or need allocators that work from signal handlers, for a > more appropriate analogy. We actually need this to work reliably from nmi, so we can > shift work _away_ from nmi. If we didn't have this use case, I would've folded on the > entire issue and kicked the can down the road (plenty of helpers don't work in nmi). Sure. I think all the arguments against global mem_alloc come from the assumption that run-time percpu_ref_get/put in bpf_mem_alloc/free will work. Kumar mentioned that we have to carefully think when to do percpu_ref_exit() since it will convert percpu_ref to atomic and performance will suffer. Also there could be yet another solution to refcounting that will enable per-program custom bpf_mem_alloc. For example: - bpf_mem_alloc is a new map type. It's added to prog->used_maps[] - no run-time refcnt-ing - don't store mem_alloc into hidden 8 bytes - since __kptr __local enforces type and size we can allow: obj = bpf_mem_alloc(alloc_A, btf_type_id_local(struct foo)); kptr_xchg(val, obj); .. // on different cpu in a different prog obj = kptr_xchg(val, NULL); bpf_mem_free(alloc_B, obj); The verifier will need to make sure that alloc_A and alloc_B can service the same type. If allocators can service any type sizes not checks are necessary. - during hash map free we do: obj = xchg(val) bpf_mem_free(global_alloc, obj); Where global_alloc is the global allocator I was talking about. It's always there. Cannot get any simpler. My main point is let's postpone the discussions about features that will happen ten steps from now. Let's start with global allocator first and don't expose bpf_mem_alloc as a map just yet. It will enable plenty of new use cases and unblock other works.
On Fri, 2 Sept 2022 at 05:29, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Sep 02, 2022 at 01:40:29AM +0000, Delyan Kratunov wrote: > > On Thu, 2022-09-01 at 17:12 -0700, Alexei Starovoitov wrote: > > > On Thu, Sep 01, 2022 at 10:46:09PM +0000, Delyan Kratunov wrote: > > > > On Wed, 2022-08-31 at 20:55 -0700, Alexei Starovoitov wrote: > > > > > On Wed, Aug 31, 2022 at 2:02 PM Delyan Kratunov <delyank@fb.com> wrote: > > > > > > Given that tracing programs can't really maintain their own freelists safely (I think > > > > > > they're missing the building blocks - you can't cmpxchg kptrs), > > > > > > > > > > Today? yes, but soon we will have link lists supported natively. > > > > > > > > > > > I do feel like > > > > > > isolated allocators are a requirement here. Without them, allocations can fail and > > > > > > there's no way to write a reliable program. > > > > > > > > > > Completely agree that there should be a way for programs > > > > > to guarantee availability of the element. > > > > > Inline allocation can fail regardless whether allocation pool > > > > > is shared by multiple programs or a single program owns an allocator. > > > > > > > > I'm not sure I understand this point. > > > > If I call bpf_mem_prefill(20, type_id(struct foo)), I would expect the next 20 allocs > > > > for struct foo to succeed. In what situations can this fail if I'm the only program > > > > using it _and_ I've calculated the prefill amount correctly? > > > > > > > > Unless you're saying that the prefill wouldn't adjust the freelist limits, in which > > > > case, I think that's a mistake - prefill should effectively _set_ the freelist > > > > limits. > > > > > > There is no prefill implementation today, so we're just guessing, but let's try. > > > > Well, initial capacity was going to be part of the map API, so I always considered it > > part of the same work. > > > > > prefill would probably have to adjust high-watermark limit. > > > That makes sense, but for how long? Should the watermark go back after time > > > or after N objects were consumed? > > > > Neither, if you want your pool of objects to not vanish from under you. > > > > > What prefill is going to do? Prefill on current cpu only ? > > > but it doesn't help the prog to be deterministic in consuming them. > > > Prefill on all cpu-s? That's possible, > > > but for_each_possible_cpu() {irq_work_queue_on(cpu);} > > > looks to be a significant memory and run-time overhead. > > > > No, that's overkill imo, especially on 100+ core systems. > > I was imagining the allocator consuming the current cpu freelist first, then stealing > > from other cpus, and only if they are empty, giving up and scheduling irq_work. > > stealing from other cpus?! > That's certainly out of scope for bpf_mem_alloc as it's implemented. > Stealing from other cpus would require a redesign. > Yes, stealing would most likely force us to use a spinlock, concurrent llist_del_first doesn't work, so that is the only option that comes to mind unless you have something fancy in mind (and I would be genuinely interested to know how :)). It will then be some more verifier side work if you want to make it work in tracing and perf_event progs. Essentially, we would need to teach it to treat bpf_in_nmi() branch specially and force it to use spin_trylock, otherwise spin_lock can be used (since bpf's is an irqsave variant so lower contexts are exclusive). Even then there might be some corner cases I don't remember right now. > > A little complex to implement but it's possible. It does require atomics everywhere > > though. > > atomic everywhere and many more weeks of thinking and debugging. > kernel/bpf/percpu_freelist.c does stealing from other cpus and it wasn't > trivial to do. > > > > > > When freelist is managed by the program it may contain just N elements > > > that progs needs. > > > > > > > > In that sense, allowing multiple programs to create an instance > > > > > of an allocator doesn't solve this problem. > > > > > Short free list inside bpf_mem_cache is an implementation detail. > > > > > "prefill to guarantee successful alloc" is a bit out of scope > > > > > of an allocator. > > > > > > > > I disagree that it's out of scope. This is the only access to dynamic memory from a > > > > bpf program, it makes sense that it covers the requirements of bpf programs, > > > > including prefill/freelist behavior, so all programs can safely use it. > > > > > > > > > "allocate a set and stash it" should be a separate building block > > > > > available to bpf progs when step "allocate" can fail and > > > > > efficient "stash it" can probably be done on top of the link list. > > > > > > > > Do you imagine a BPF object that every user has to link into their programs (yuck), > > > > or a different set of helpers? If it's helpers/kfuncs, I'm all for splitting things > > > > this way. > > > > > > I'm assuming Kumar's proposed list api: > > > struct bpf_list_head head; > > > struct bpf_list_node node; > > > bpf_list_insert(&node, &head); > > > > > > will work out. > > > > Given the assumed locking in that design, I don't see how it would help nmi programs > > tbh. This is list_head, we need llist_head, relatively speaking. > > Of course. bpf-native link list could be per-cpu and based on llist. > bpf_list vs bpf_llist. SMOP :) +1. percpu NMI safe list using local_t style protection should work out well. It will hook into the same infra for locked linked lists, but use local_t lock for protection. percpu maps only have local_t, non-percpu has bpf_spin_lock. Also need to limit remote percpu access (using bpf_lookup_elem_percpu). The only labor needed is doing the trylock part for it (since it can fail, inc_return != 1), so only one branch with the checked result of trylock has the lock. The lock section is already limited to current bpf_func_state, and unlocking always happens in the same frame. Other than that, it is trivial to support with most basic infra already there in [0]. [0]: https://lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com/ > > > > > > > > > If it's distributed separately, I think that's an unexpected burden on developers > > > > (I'm thinking especially of tools not writing programs in C or using libbpf/bpftool > > > > skels). There are no other bpf features that require a userspace support library like > > > > this. (USDT doesn't count, uprobes are the underlying bpf feature and that is useful > > > > without a library) > > > > > > bpf progs must not pay for what they don't use. Hence all building blocks should be > > > small. We will have libc-like bpf libraries with bigger blocks eventually. > > > > I'm not sure I understand how having the mechanism in helpers and managed by the > > kernel is paying for something they don't use? > > every feature adds up.. like stealing from cpus. > > > > > > > > > I think the disagreement here is that per-prog allocator based > > > > > on bpf_mem_alloc isn't going to be any more deterministic than > > > > > one global bpf_mem_alloc for all progs. > > > > > Per-prog short free list of ~64 elements vs > > > > > global free list of ~64 elements. > > > > > > > > Right, I think I had a hidden assumption here that we've exposed. > > > > Namely, I imagined that after a mem_prefill(1000, struct foo) call, there would be > > > > 1000 struct foos on the freelist and the freelist thresholds would be adjusted > > > > accordingly (i.e., you can free all of them and allocate them again, all from the > > > > freelist). > > > > > > > > Ultimately, that's what nmi programs actually need but I see why that's not an > > > > obvious behavior. > > > > > > How prefill is going to work is still to-be-designed. > > > > That's the new part for me, though - the maps design had a mechanism to specify > > initial capacity, and it worked for nmi programs. That's why I'm pulling on this > > thread, it's the hardest thing to get right _and_ it needs to exist before deferred > > work can be useful. > > Specifying initial capacity sounds great in theory, but what does it mean in practice? > N elements on each cpu or evenly distributed across all? > > > > > > In addition to current-cpu vs on-all-cpu question above... > > > Will prefill() helper just do irq_work ? > > > If so then it doesn't help nmi and irq-disabled progs at all. > > > prefill helper working asynchronously doesn't guarantee availability of objects > > > later to the program. > > > prefill() becomes a hint and probably useless as such. > > > > Agreed. > > > > > So it should probably be synchronous and fail when in-nmi or in-irq? > > > But bpf prog cannot know its context, so only safe synchronous prefill() > > > would be out of sleepable progs. > > > > Initial maps capacity would've come from the syscall, so the program itself would not > > contain a prefill() call. > > > > We covered this in our initial discussions - I also think that requiring every > > perf_event program to setup a uprobe or syscall program to fill the object pool > > (internal or external) is also a bad design. > > right. we did. prefill from user space makes sense. > > > If we're going for a global allocator, I suppose we could encode these requirements > > in BTF and satisfy them on program load? .alloc map with some predefined names or > > something? > > ohh. When I was saying 'global allocator' I meant an allocator that is not exposed > to bpf progs at all. It's just there for all programs. It has hidden watermarks > and prefill for it doesn't make sense. Pretty much kmalloc equivalent. > > > > > [...] > > > > Userspace doesn't have nmi or need allocators that work from signal handlers, for a > > more appropriate analogy. We actually need this to work reliably from nmi, so we can > > shift work _away_ from nmi. If we didn't have this use case, I would've folded on the > > entire issue and kicked the can down the road (plenty of helpers don't work in nmi). > > Sure. > I think all the arguments against global mem_alloc come from the assumption that > run-time percpu_ref_get/put in bpf_mem_alloc/free will work. > Kumar mentioned that we have to carefully think when to do percpu_ref_exit() > since it will convert percpu_ref to atomic and performance will suffer. > > Also there could be yet another solution to refcounting that will enable > per-program custom bpf_mem_alloc. > For example: > - bpf_mem_alloc is a new map type. It's added to prog->used_maps[] > - no run-time refcnt-ing > - don't store mem_alloc into hidden 8 bytes > - since __kptr __local enforces type and size we can allow: > obj = bpf_mem_alloc(alloc_A, btf_type_id_local(struct foo)); > kptr_xchg(val, obj); > .. > // on different cpu in a different prog > obj = kptr_xchg(val, NULL); > bpf_mem_free(alloc_B, obj); > The verifier will need to make sure that alloc_A and alloc_B can service the same type. > If allocators can service any type sizes not checks are necessary. > > - during hash map free we do: > obj = xchg(val) > bpf_mem_free(global_alloc, obj); > Where global_alloc is the global allocator I was talking about. It's always there. > Cannot get any simpler. Neat idea. The size of kptr is always known (already in the map value type). Also realized a fun tidbit: we can technically do a sized delete [1] as well. C, C++, Rust all support this, the first one manually (actually will in C23 or whenever implementers get around to adopting it), and the latter two statically using the type's size transform delete call to do size delete. No need to do size class lookup using bpf_mem_cache_idx in bpf_mem_free. The verifier can always pass a hidden argument. So probably a minor perf improvement on the free path that comes for free. [1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2699.htm > > My main point is let's postpone the discussions about features that will happen ten steps > from now. Let's start with global allocator first and don't expose bpf_mem_alloc > as a map just yet. It will enable plenty of new use cases and unblock other works. +1, I think we should go with global bpf_mem_alloc, even if it only serves as a fallback in the future. It makes things much simpler and pretty much follows the existing kmalloc idea with an extra cache. Also, I'm confident the BPF percpu linked list case is going to be pretty much equivalent or at least very close to bpf_mem_alloc freelist. But famous last words and all, we'll see.
From: Alexei Starovoitov <ast@kernel.org> Introduce any context BPF specific memory allocator. Tracing BPF programs can attach to kprobe and fentry. Hence they run in unknown context where calling plain kmalloc() might not be safe. Front-end kmalloc() with per-cpu cache of free elements. Refill this cache asynchronously from irq_work. Major achievements enabled by bpf_mem_alloc: - Dynamically allocated hash maps used to be 10 times slower than fully preallocated. With bpf_mem_alloc and subsequent optimizations the speed of dynamic maps is equal to full prealloc. - Tracing bpf programs can use dynamically allocated hash maps. Potentially saving lots of memory. Typical hash map is sparsely populated. - Sleepable bpf programs can used dynamically allocated hash maps. v2->v3: - Rewrote the free_list algorithm based on discussions with Kumar. Patch 1. - Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14. - Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is requested to reduce memory consumption. Patch 15. - Fix: zero-fill percpu allocation - Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction v2 thread: https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/ v1->v2: - Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9. - Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6. - Tuned watermarks per allocation size. Patch 8 - Adopted this approach to per-cpu allocation. Patch 10. - Fully converted hash map to bpf_mem_alloc. Patch 11. - Removed tracing prog restriction on map types. Combination of all patches and final patch 12. v1 thread: https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/ LWN article: https://lwn.net/Articles/899274/ Future work: - expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc - convert lru map to bpf_mem_alloc Alexei Starovoitov (15): bpf: Introduce any context BPF specific memory allocator. bpf: Convert hash map to bpf_mem_alloc. selftests/bpf: Improve test coverage of test_maps samples/bpf: Reduce syscall overhead in map_perf_test. bpf: Relax the requirement to use preallocated hash maps in tracing progs. bpf: Optimize element count in non-preallocated hash map. bpf: Optimize call_rcu in non-preallocated hash map. bpf: Adjust low/high watermarks in bpf_mem_cache bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU. bpf: Add percpu allocation support to bpf_mem_alloc. bpf: Convert percpu hash map to per-cpu bpf_mem_alloc. bpf: Remove tracing program restriction on map types bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs. bpf: Remove prealloc-only restriction for sleepable bpf programs. bpf: Introduce sysctl kernel.bpf_force_dyn_alloc. include/linux/bpf_mem_alloc.h | 26 + include/linux/filter.h | 2 + kernel/bpf/Makefile | 2 +- kernel/bpf/core.c | 2 + kernel/bpf/hashtab.c | 132 +++-- kernel/bpf/memalloc.c | 601 ++++++++++++++++++++++ kernel/bpf/syscall.c | 14 +- kernel/bpf/verifier.c | 52 -- samples/bpf/map_perf_test_kern.c | 44 +- samples/bpf/map_perf_test_user.c | 2 +- tools/testing/selftests/bpf/progs/timer.c | 11 - tools/testing/selftests/bpf/test_maps.c | 38 +- 12 files changed, 795 insertions(+), 131 deletions(-) create mode 100644 include/linux/bpf_mem_alloc.h create mode 100644 kernel/bpf/memalloc.c