Message ID | 20221230041151.1231169-1-houtao@huaweicloud.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Handle reuse in bpf memory alloc | expand |
On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > From: Hou Tao <houtao1@huawei.com> > > Hi, > > The patchset tries to fix the problems found when checking how htab map > handles element reuse in bpf memory allocator. The immediate reuse of > freed elements may lead to two problems in htab map: > > (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > htab map value and it may corrupt lookup procedure with BFP_F_LOCK > flag which acquires bpf-spin-lock during value copying. The > corruption of bpf-spin-lock may result in hard lock-up. > (2) lookup procedure may get incorrect map value if the found element is > freed and then reused. > > Because the type of htab map elements are the same, so problem #1 can be > fixed by supporting ctor in bpf memory allocator. The ctor initializes > these special fields in map element only when the map element is newly > allocated. If it is just a reused element, there will be no > reinitialization. Instead of adding the overhead of ctor callback let's just add __GFP_ZERO to flags in __alloc(). That will address the issue 1 and will make bpf_mem_alloc behave just like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default will behave the same way. > Problem #2 exists for both non-preallocated and preallocated htab map. > By adding seq in htab element, doing reuse check and retrying the > lookup procedure may be a feasible solution, but it will make the > lookup API being hard to use, because the user needs to check whether > the found element is reused or not and repeat the lookup procedure if it > is reused. A simpler solution would be just disabling freed elements > reuse and freeing these elements after lookup procedure ends. You've proposed this 'solution' twice already in qptrie thread and both times the answer was 'no, we cannot do this' with reasons explained. The 3rd time the answer is still the same. This 'issue 2' existed in hashmap since very beginning for many years. It's a known quirk. There is nothing to fix really. The graph apis (aka new gen data structs) with link list and rbtree are in active development. Soon bpf progs will be able to implement their own hash maps with explicit bpf_rcu_read_lock. At that time the progs will be making the trade off between performance and lookup/delete race. So please respin with just __GFP_ZERO and update the patch 6 to check for lockup only.
On 12/31/22 5:26 PM, Alexei Starovoitov wrote: > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> Hi, >> >> The patchset tries to fix the problems found when checking how htab map >> handles element reuse in bpf memory allocator. The immediate reuse of >> freed elements may lead to two problems in htab map: >> >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >> flag which acquires bpf-spin-lock during value copying. The >> corruption of bpf-spin-lock may result in hard lock-up. >> (2) lookup procedure may get incorrect map value if the found element is >> freed and then reused. >> >> Because the type of htab map elements are the same, so problem #1 can be >> fixed by supporting ctor in bpf memory allocator. The ctor initializes >> these special fields in map element only when the map element is newly >> allocated. If it is just a reused element, there will be no >> reinitialization. > > Instead of adding the overhead of ctor callback let's just > add __GFP_ZERO to flags in __alloc(). > That will address the issue 1 and will make bpf_mem_alloc behave just > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > will behave the same way. Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ tried to address a similar issue for lru hash table. Maybe we need to do similar things after bpf_mem_cache_alloc() for hash table? > >> Problem #2 exists for both non-preallocated and preallocated htab map. >> By adding seq in htab element, doing reuse check and retrying the >> lookup procedure may be a feasible solution, but it will make the >> lookup API being hard to use, because the user needs to check whether >> the found element is reused or not and repeat the lookup procedure if it >> is reused. A simpler solution would be just disabling freed elements >> reuse and freeing these elements after lookup procedure ends. > > You've proposed this 'solution' twice already in qptrie thread and both > times the answer was 'no, we cannot do this' with reasons explained. > The 3rd time the answer is still the same. > This 'issue 2' existed in hashmap since very beginning for many years. > It's a known quirk. There is nothing to fix really. > > The graph apis (aka new gen data structs) with link list and rbtree are > in active development. Soon bpf progs will be able to implement their own > hash maps with explicit bpf_rcu_read_lock. At that time the progs will > be making the trade off between performance and lookup/delete race. > So please respin with just __GFP_ZERO and update the patch 6 > to check for lockup only.
Hi, On 1/1/2023 9:26 AM, Alexei Starovoitov wrote: > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >> From: Hou Tao <houtao1@huawei.com> >> >> Hi, >> >> The patchset tries to fix the problems found when checking how htab map >> handles element reuse in bpf memory allocator. The immediate reuse of >> freed elements may lead to two problems in htab map: >> >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >> flag which acquires bpf-spin-lock during value copying. The >> corruption of bpf-spin-lock may result in hard lock-up. >> (2) lookup procedure may get incorrect map value if the found element is >> freed and then reused. >> >> Because the type of htab map elements are the same, so problem #1 can be >> fixed by supporting ctor in bpf memory allocator. The ctor initializes >> these special fields in map element only when the map element is newly >> allocated. If it is just a reused element, there will be no >> reinitialization. > Instead of adding the overhead of ctor callback let's just > add __GFP_ZERO to flags in __alloc(). > That will address the issue 1 and will make bpf_mem_alloc behave just > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > will behave the same way. Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated object may be bigger than ctor callback when the size of allocated object is large. And it also introduces unnecessary memory zeroing if there is no special field in the map value. >> Problem #2 exists for both non-preallocated and preallocated htab map. >> By adding seq in htab element, doing reuse check and retrying the >> lookup procedure may be a feasible solution, but it will make the >> lookup API being hard to use, because the user needs to check whether >> the found element is reused or not and repeat the lookup procedure if it >> is reused. A simpler solution would be just disabling freed elements >> reuse and freeing these elements after lookup procedure ends. > You've proposed this 'solution' twice already in qptrie thread and both > times the answer was 'no, we cannot do this' with reasons explained. > The 3rd time the answer is still the same. This time a workable demo which calls call_rcu_task_trace() in batch is provided :) Also because I can not find a better solution for the reuse problem. But you are right, although don't reuse the freed element will make the implementation of map simpler, the potential OOM problem is hard to solve specially when RCU tasks trace grace period is slow. Hope Paul can provide some insight about the problem. > This 'issue 2' existed in hashmap since very beginning for many years. > It's a known quirk. There is nothing to fix really. Do we need to document the unexpected behavior somewhere, because I really don't know nothing about the quirk ? > > The graph apis (aka new gen data structs) with link list and rbtree are > in active development. Soon bpf progs will be able to implement their own > hash maps with explicit bpf_rcu_read_lock. At that time the progs will > be making the trade off between performance and lookup/delete race. It seems these new gen data struct also need to solve the reuse problem because a global bpf memory allocator is used. > So please respin with just __GFP_ZERO and update the patch 6 > to check for lockup only. > .
Hi, On 1/2/2023 2:48 AM, Yonghong Song wrote: > > > On 12/31/22 5:26 PM, Alexei Starovoitov wrote: >> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >>> From: Hou Tao <houtao1@huawei.com> >>> >>> Hi, >>> >>> The patchset tries to fix the problems found when checking how htab map >>> handles element reuse in bpf memory allocator. The immediate reuse of >>> freed elements may lead to two problems in htab map: >>> >>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >>> flag which acquires bpf-spin-lock during value copying. The >>> corruption of bpf-spin-lock may result in hard lock-up. >>> (2) lookup procedure may get incorrect map value if the found element is >>> freed and then reused. >>> >>> Because the type of htab map elements are the same, so problem #1 can be >>> fixed by supporting ctor in bpf memory allocator. The ctor initializes >>> these special fields in map element only when the map element is newly >>> allocated. If it is just a reused element, there will be no >>> reinitialization. >> >> Instead of adding the overhead of ctor callback let's just >> add __GFP_ZERO to flags in __alloc(). >> That will address the issue 1 and will make bpf_mem_alloc behave just >> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default >> will behave the same way. > > Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ > tried to address a similar issue for lru hash table. > Maybe we need to do similar things after bpf_mem_cache_alloc() for > hash table? IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? > > >> >>> Problem #2 exists for both non-preallocated and preallocated htab map. >>> By adding seq in htab element, doing reuse check and retrying the >>> lookup procedure may be a feasible solution, but it will make the >>> lookup API being hard to use, because the user needs to check whether >>> the found element is reused or not and repeat the lookup procedure if it >>> is reused. A simpler solution would be just disabling freed elements >>> reuse and freeing these elements after lookup procedure ends. >> >> You've proposed this 'solution' twice already in qptrie thread and both >> times the answer was 'no, we cannot do this' with reasons explained. >> The 3rd time the answer is still the same. >> This 'issue 2' existed in hashmap since very beginning for many years. >> It's a known quirk. There is nothing to fix really. >> >> The graph apis (aka new gen data structs) with link list and rbtree are >> in active development. Soon bpf progs will be able to implement their own >> hash maps with explicit bpf_rcu_read_lock. At that time the progs will >> be making the trade off between performance and lookup/delete race. >> So please respin with just __GFP_ZERO and update the patch 6 >> to check for lockup only.
On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 1/1/2023 9:26 AM, Alexei Starovoitov wrote: > > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > >> From: Hou Tao <houtao1@huawei.com> > >> > >> Hi, > >> > >> The patchset tries to fix the problems found when checking how htab map > >> handles element reuse in bpf memory allocator. The immediate reuse of > >> freed elements may lead to two problems in htab map: > >> > >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > >> htab map value and it may corrupt lookup procedure with BFP_F_LOCK > >> flag which acquires bpf-spin-lock during value copying. The > >> corruption of bpf-spin-lock may result in hard lock-up. > >> (2) lookup procedure may get incorrect map value if the found element is > >> freed and then reused. > >> > >> Because the type of htab map elements are the same, so problem #1 can be > >> fixed by supporting ctor in bpf memory allocator. The ctor initializes > >> these special fields in map element only when the map element is newly > >> allocated. If it is just a reused element, there will be no > >> reinitialization. > > Instead of adding the overhead of ctor callback let's just > > add __GFP_ZERO to flags in __alloc(). > > That will address the issue 1 and will make bpf_mem_alloc behave just > > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > > will behave the same way. > Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated > object may be bigger than ctor callback when the size of allocated object is > large. And it also introduces unnecessary memory zeroing if there is no special > field in the map value. Small memset is often faster than an indirect call. I doubt that adding GFP_ZERO will have a measurable perf difference in map_perf_test and other benchmarks. > >> Problem #2 exists for both non-preallocated and preallocated htab map. > >> By adding seq in htab element, doing reuse check and retrying the > >> lookup procedure may be a feasible solution, but it will make the > >> lookup API being hard to use, because the user needs to check whether > >> the found element is reused or not and repeat the lookup procedure if it > >> is reused. A simpler solution would be just disabling freed elements > >> reuse and freeing these elements after lookup procedure ends. > > You've proposed this 'solution' twice already in qptrie thread and both > > times the answer was 'no, we cannot do this' with reasons explained. > > The 3rd time the answer is still the same. > This time a workable demo which calls call_rcu_task_trace() in batch is provided > :) Also because I can not find a better solution for the reuse problem. But you > are right, although don't reuse the freed element will make the implementation > of map simpler, the potential OOM problem is hard to solve specially when RCU > tasks trace grace period is slow. Hope Paul can provide some insight about the > problem. OOM is exactly the reason why we cannot do this delaying logic in the general case. We don't control what progs do and rcu tasks trace may take a long time. > > This 'issue 2' existed in hashmap since very beginning for many years. > > It's a known quirk. There is nothing to fix really. > Do we need to document the unexpected behavior somewhere, because I really don't > know nothing about the quirk ? Yeah. It's not documented in Documentation/bpf/map_hash.rst. Please send a patch to add it. > > > > The graph apis (aka new gen data structs) with link list and rbtree are > > in active development. Soon bpf progs will be able to implement their own > > hash maps with explicit bpf_rcu_read_lock. At that time the progs will > > be making the trade off between performance and lookup/delete race. > It seems these new gen data struct also need to solve the reuse problem because > a global bpf memory allocator is used. Currently the graph api is single owner and kptr_xchg is used to allow single cpu at a time to observe the object. In the future we will add bpf_refcount and multi owner. Then multiple cpus will be able to access the same object concurrently. They will race to read/write the fields and it will be prog decision to arbitrate the access. In other words the bpf prog needs to have mechanisms to deal with reuse.
On 1/3/23 5:47 AM, Hou Tao wrote: > Hi, > > On 1/2/2023 2:48 AM, Yonghong Song wrote: >> >> >> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: >>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >>>> From: Hou Tao <houtao1@huawei.com> >>>> >>>> Hi, >>>> >>>> The patchset tries to fix the problems found when checking how htab map >>>> handles element reuse in bpf memory allocator. The immediate reuse of >>>> freed elements may lead to two problems in htab map: >>>> >>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >>>> flag which acquires bpf-spin-lock during value copying. The >>>> corruption of bpf-spin-lock may result in hard lock-up. >>>> (2) lookup procedure may get incorrect map value if the found element is >>>> freed and then reused. >>>> >>>> Because the type of htab map elements are the same, so problem #1 can be >>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes >>>> these special fields in map element only when the map element is newly >>>> allocated. If it is just a reused element, there will be no >>>> reinitialization. >>> >>> Instead of adding the overhead of ctor callback let's just >>> add __GFP_ZERO to flags in __alloc(). >>> That will address the issue 1 and will make bpf_mem_alloc behave just >>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default >>> will behave the same way. >> >> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ >> tried to address a similar issue for lru hash table. >> Maybe we need to do similar things after bpf_mem_cache_alloc() for >> hash table? > IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? The following is my understanding: in function alloc_htab_elem() (hashtab.c), we have if (is_map_full(htab)) if (!old_elem) /* when map is full and update() is replacing * old element, it's ok to allocate, since * old element will be freed immediately. * Otherwise return an error */ return ERR_PTR(-E2BIG); inc_elem_count(htab); l_new = bpf_mem_cache_alloc(&htab->ma); if (!l_new) { l_new = ERR_PTR(-ENOMEM); goto dec_count; } check_and_init_map_value(&htab->map, l_new->key + round_up(key_size, 8)); In the above check_and_init_map_value() intends to do initializing for an element from bpf_mem_cache_alloc (could be reused from the free list). The check_and_init_map_value() looks like below (in include/linux/bpf.h) static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) { int i; if (!foffs) return; for (i = 0; i < foffs->cnt; i++) memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); } static inline void check_and_init_map_value(struct bpf_map *map, void *dst) { bpf_obj_init(map->field_offs, dst); } IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, list_head, list_node, etc. This is the problem for above problem #1. Maybe I missed something? >> >> >>> >>>> Problem #2 exists for both non-preallocated and preallocated htab map. >>>> By adding seq in htab element, doing reuse check and retrying the >>>> lookup procedure may be a feasible solution, but it will make the >>>> lookup API being hard to use, because the user needs to check whether >>>> the found element is reused or not and repeat the lookup procedure if it >>>> is reused. A simpler solution would be just disabling freed elements >>>> reuse and freeing these elements after lookup procedure ends. >>> >>> You've proposed this 'solution' twice already in qptrie thread and both >>> times the answer was 'no, we cannot do this' with reasons explained. >>> The 3rd time the answer is still the same. >>> This 'issue 2' existed in hashmap since very beginning for many years. >>> It's a known quirk. There is nothing to fix really. >>> >>> The graph apis (aka new gen data structs) with link list and rbtree are >>> in active development. Soon bpf progs will be able to implement their own >>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will >>> be making the trade off between performance and lookup/delete race. >>> So please respin with just __GFP_ZERO and update the patch 6 >>> to check for lockup only. >
Hi, On 1/4/2023 2:10 PM, Yonghong Song wrote: > > > On 1/3/23 5:47 AM, Hou Tao wrote: >> Hi, >> >> On 1/2/2023 2:48 AM, Yonghong Song wrote: >>> >>> >>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: >>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >>>>> From: Hou Tao <houtao1@huawei.com> >>>>> >>>>> Hi, >>>>> >>>>> The patchset tries to fix the problems found when checking how htab map >>>>> handles element reuse in bpf memory allocator. The immediate reuse of >>>>> freed elements may lead to two problems in htab map: >>>>> >>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >>>>> flag which acquires bpf-spin-lock during value copying. The >>>>> corruption of bpf-spin-lock may result in hard lock-up. >>>>> (2) lookup procedure may get incorrect map value if the found element is >>>>> freed and then reused. >>>>> >>>>> Because the type of htab map elements are the same, so problem #1 can be >>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes >>>>> these special fields in map element only when the map element is newly >>>>> allocated. If it is just a reused element, there will be no >>>>> reinitialization. >>>> >>>> Instead of adding the overhead of ctor callback let's just >>>> add __GFP_ZERO to flags in __alloc(). >>>> That will address the issue 1 and will make bpf_mem_alloc behave just >>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default >>>> will behave the same way. >>> >>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ >>> tried to address a similar issue for lru hash table. >>> Maybe we need to do similar things after bpf_mem_cache_alloc() for >>> hash table? >> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? > > The following is my understanding: > in function alloc_htab_elem() (hashtab.c), we have > > if (is_map_full(htab)) > if (!old_elem) > /* when map is full and update() is replacing > * old element, it's ok to allocate, since > * old element will be freed immediately. > * Otherwise return an error > */ > return ERR_PTR(-E2BIG); > inc_elem_count(htab); > l_new = bpf_mem_cache_alloc(&htab->ma); > if (!l_new) { > l_new = ERR_PTR(-ENOMEM); > goto dec_count; > } > check_and_init_map_value(&htab->map, > l_new->key + round_up(key_size, 8)); > > In the above check_and_init_map_value() intends to do initializing > for an element from bpf_mem_cache_alloc (could be reused from the free list). > > The check_and_init_map_value() looks like below (in include/linux/bpf.h) > > static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) > { > int i; > > if (!foffs) > return; > for (i = 0; i < foffs->cnt; i++) > memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); > } > > static inline void check_and_init_map_value(struct bpf_map *map, void *dst) > { > bpf_obj_init(map->field_offs, dst); > } > > IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, > list_head, list_node, etc. > > This is the problem for above problem #1. > Maybe I missed something? Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the problem by only calling check_and_init_map_value() once for the newly-allocated element, so if a freed element is reused, its special fields will not be zeroed again. Is there any other cases which are not covered by the solution or any other similar problems in hash-tab ? > >>> >>> >>>> >>>>> Problem #2 exists for both non-preallocated and preallocated htab map. >>>>> By adding seq in htab element, doing reuse check and retrying the >>>>> lookup procedure may be a feasible solution, but it will make the >>>>> lookup API being hard to use, because the user needs to check whether >>>>> the found element is reused or not and repeat the lookup procedure if it >>>>> is reused. A simpler solution would be just disabling freed elements >>>>> reuse and freeing these elements after lookup procedure ends. >>>> >>>> You've proposed this 'solution' twice already in qptrie thread and both >>>> times the answer was 'no, we cannot do this' with reasons explained. >>>> The 3rd time the answer is still the same. >>>> This 'issue 2' existed in hashmap since very beginning for many years. >>>> It's a known quirk. There is nothing to fix really. >>>> >>>> The graph apis (aka new gen data structs) with link list and rbtree are >>>> in active development. Soon bpf progs will be able to implement their own >>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will >>>> be making the trade off between performance and lookup/delete race. >>>> So please respin with just __GFP_ZERO and update the patch 6 >>>> to check for lockup only. >>
On 1/3/23 10:30 PM, Hou Tao wrote: > Hi, > > On 1/4/2023 2:10 PM, Yonghong Song wrote: >> >> >> On 1/3/23 5:47 AM, Hou Tao wrote: >>> Hi, >>> >>> On 1/2/2023 2:48 AM, Yonghong Song wrote: >>>> >>>> >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >>>>>> From: Hou Tao <houtao1@huawei.com> >>>>>> >>>>>> Hi, >>>>>> >>>>>> The patchset tries to fix the problems found when checking how htab map >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of >>>>>> freed elements may lead to two problems in htab map: >>>>>> >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >>>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >>>>>> flag which acquires bpf-spin-lock during value copying. The >>>>>> corruption of bpf-spin-lock may result in hard lock-up. >>>>>> (2) lookup procedure may get incorrect map value if the found element is >>>>>> freed and then reused. >>>>>> >>>>>> Because the type of htab map elements are the same, so problem #1 can be >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes >>>>>> these special fields in map element only when the map element is newly >>>>>> allocated. If it is just a reused element, there will be no >>>>>> reinitialization. >>>>> >>>>> Instead of adding the overhead of ctor callback let's just >>>>> add __GFP_ZERO to flags in __alloc(). >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default >>>>> will behave the same way. >>>> >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ >>>> tried to address a similar issue for lru hash table. >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for >>>> hash table? >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? >> >> The following is my understanding: >> in function alloc_htab_elem() (hashtab.c), we have >> >> if (is_map_full(htab)) >> if (!old_elem) >> /* when map is full and update() is replacing >> * old element, it's ok to allocate, since >> * old element will be freed immediately. >> * Otherwise return an error >> */ >> return ERR_PTR(-E2BIG); >> inc_elem_count(htab); >> l_new = bpf_mem_cache_alloc(&htab->ma); >> if (!l_new) { >> l_new = ERR_PTR(-ENOMEM); >> goto dec_count; >> } >> check_and_init_map_value(&htab->map, >> l_new->key + round_up(key_size, 8)); >> >> In the above check_and_init_map_value() intends to do initializing >> for an element from bpf_mem_cache_alloc (could be reused from the free list). >> >> The check_and_init_map_value() looks like below (in include/linux/bpf.h) >> >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) >> { >> int i; >> >> if (!foffs) >> return; >> for (i = 0; i < foffs->cnt; i++) >> memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); >> } >> >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst) >> { >> bpf_obj_init(map->field_offs, dst); >> } >> >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, >> list_head, list_node, etc. >> >> This is the problem for above problem #1. >> Maybe I missed something? > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the > problem by only calling check_and_init_map_value() once for the newly-allocated > element, so if a freed element is reused, its special fields will not be zeroed > again. Is there any other cases which are not covered by the solution or any > other similar problems in hash-tab ? No, I checked all cases of check_and_init_map_value() and didn't find any other instances. >> >>>> >>>> >>>>> >>>>>> Problem #2 exists for both non-preallocated and preallocated htab map. >>>>>> By adding seq in htab element, doing reuse check and retrying the >>>>>> lookup procedure may be a feasible solution, but it will make the >>>>>> lookup API being hard to use, because the user needs to check whether >>>>>> the found element is reused or not and repeat the lookup procedure if it >>>>>> is reused. A simpler solution would be just disabling freed elements >>>>>> reuse and freeing these elements after lookup procedure ends. >>>>> >>>>> You've proposed this 'solution' twice already in qptrie thread and both >>>>> times the answer was 'no, we cannot do this' with reasons explained. >>>>> The 3rd time the answer is still the same. >>>>> This 'issue 2' existed in hashmap since very beginning for many years. >>>>> It's a known quirk. There is nothing to fix really. >>>>> >>>>> The graph apis (aka new gen data structs) with link list and rbtree are >>>>> in active development. Soon bpf progs will be able to implement their own >>>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will >>>>> be making the trade off between performance and lookup/delete race. >>>>> So please respin with just __GFP_ZERO and update the patch 6 >>>>> to check for lockup only. >>> >
On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote: > > > > On 1/3/23 10:30 PM, Hou Tao wrote: > > Hi, > > > > On 1/4/2023 2:10 PM, Yonghong Song wrote: > >> > >> > >> On 1/3/23 5:47 AM, Hou Tao wrote: > >>> Hi, > >>> > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote: > >>>> > >>>> > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > >>>>>> From: Hou Tao <houtao1@huawei.com> > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> The patchset tries to fix the problems found when checking how htab map > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of > >>>>>> freed elements may lead to two problems in htab map: > >>>>>> > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > >>>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK > >>>>>> flag which acquires bpf-spin-lock during value copying. The > >>>>>> corruption of bpf-spin-lock may result in hard lock-up. > >>>>>> (2) lookup procedure may get incorrect map value if the found element is > >>>>>> freed and then reused. > >>>>>> > >>>>>> Because the type of htab map elements are the same, so problem #1 can be > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes > >>>>>> these special fields in map element only when the map element is newly > >>>>>> allocated. If it is just a reused element, there will be no > >>>>>> reinitialization. > >>>>> > >>>>> Instead of adding the overhead of ctor callback let's just > >>>>> add __GFP_ZERO to flags in __alloc(). > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > >>>>> will behave the same way. > >>>> > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ > >>>> tried to address a similar issue for lru hash table. > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for > >>>> hash table? > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? > >> > >> The following is my understanding: > >> in function alloc_htab_elem() (hashtab.c), we have > >> > >> if (is_map_full(htab)) > >> if (!old_elem) > >> /* when map is full and update() is replacing > >> * old element, it's ok to allocate, since > >> * old element will be freed immediately. > >> * Otherwise return an error > >> */ > >> return ERR_PTR(-E2BIG); > >> inc_elem_count(htab); > >> l_new = bpf_mem_cache_alloc(&htab->ma); > >> if (!l_new) { > >> l_new = ERR_PTR(-ENOMEM); > >> goto dec_count; > >> } > >> check_and_init_map_value(&htab->map, > >> l_new->key + round_up(key_size, 8)); > >> > >> In the above check_and_init_map_value() intends to do initializing > >> for an element from bpf_mem_cache_alloc (could be reused from the free list). > >> > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h) > >> > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) > >> { > >> int i; > >> > >> if (!foffs) > >> return; > >> for (i = 0; i < foffs->cnt; i++) > >> memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); > >> } > >> > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst) > >> { > >> bpf_obj_init(map->field_offs, dst); > >> } > >> > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, > >> list_head, list_node, etc. > >> > >> This is the problem for above problem #1. > >> Maybe I missed something? > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the > > problem by only calling check_and_init_map_value() once for the newly-allocated > > element, so if a freed element is reused, its special fields will not be zeroed > > again. Is there any other cases which are not covered by the solution or any > > other similar problems in hash-tab ? > > No, I checked all cases of check_and_init_map_value() and didn't find > any other instances. check_and_init_map_value() is called in two other cases: lookup_and_delete[_batch]. There the zeroing of the fields is necessary because the 'value' is a temp buffer that is going to be copied to user space. I think the way forward is to add GFP_ZERO to mem_alloc (to make it equivalent to prealloc), remove one case of check_and_init_map_value from hashmap, add short comments to two other cases and add a big comment to check_and_init_map_value() that should say that 'dst' must be a temp buffer and should not point to memory that could be used in parallel by a bpf prog. It feels like we've dealt with this issue a couple times already and keep repeating this mistake, so the more comments the better.
On 1/3/23 11:38 AM, Alexei Starovoitov wrote: > On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote: >> >> Hi, >> >> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote: >>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >>>> From: Hou Tao <houtao1@huawei.com> >>>> >>>> Hi, >>>> >>>> The patchset tries to fix the problems found when checking how htab map >>>> handles element reuse in bpf memory allocator. The immediate reuse of >>>> freed elements may lead to two problems in htab map: >>>> >>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >>>> flag which acquires bpf-spin-lock during value copying. The >>>> corruption of bpf-spin-lock may result in hard lock-up. >>>> (2) lookup procedure may get incorrect map value if the found element is >>>> freed and then reused. >>>> >>>> Because the type of htab map elements are the same, so problem #1 can be >>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes >>>> these special fields in map element only when the map element is newly >>>> allocated. If it is just a reused element, there will be no >>>> reinitialization. >>> Instead of adding the overhead of ctor callback let's just >>> add __GFP_ZERO to flags in __alloc(). >>> That will address the issue 1 and will make bpf_mem_alloc behave just >>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default >>> will behave the same way. >> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated >> object may be bigger than ctor callback when the size of allocated object is >> large. And it also introduces unnecessary memory zeroing if there is no special >> field in the map value. > > Small memset is often faster than an indirect call. > I doubt that adding GFP_ZERO will have a measurable perf difference > in map_perf_test and other benchmarks. > >>>> Problem #2 exists for both non-preallocated and preallocated htab map. >>>> By adding seq in htab element, doing reuse check and retrying the >>>> lookup procedure may be a feasible solution, but it will make the >>>> lookup API being hard to use, because the user needs to check whether >>>> the found element is reused or not and repeat the lookup procedure if it >>>> is reused. A simpler solution would be just disabling freed elements >>>> reuse and freeing these elements after lookup procedure ends. >>> You've proposed this 'solution' twice already in qptrie thread and both >>> times the answer was 'no, we cannot do this' with reasons explained. >>> The 3rd time the answer is still the same. >> This time a workable demo which calls call_rcu_task_trace() in batch is provided >> :) Also because I can not find a better solution for the reuse problem. But you >> are right, although don't reuse the freed element will make the implementation >> of map simpler, the potential OOM problem is hard to solve specially when RCU >> tasks trace grace period is slow. Hope Paul can provide some insight about the >> problem. > > OOM is exactly the reason why we cannot do this delaying logic > in the general case. We don't control what progs do and rcu tasks trace > may take a long time. I haven't looked at the details of this patch yet. Since Hou was asking in https://lore.kernel.org/bpf/6e4ec7a4-9ac9-417c-c11a-de59e72a6e42@huawei.com/ for the local storage use case (thanks!), so continue the discussion in this thread. Some of my current thoughts, the local storage map is a little different from the more generic bpf map (eg htab). The common use case in local storage map is less demanding on the alloc/free path. The storage is usually allocated once and then stays for the whole lifetime with its owner (sk, task, cgrp, or inode). There is no update helper, so it encourages a get() and then followed by an in-place modification. Beside, the current local storage is also freed after going through a rcu tasks trace gp. That said, I am checking to see if the common local storage use case can be modified to safely reuse without waiting the gp. It will then be an improvement on the existing implementation. The bottom line is not slowing down the current get (ie lookup) performance. Not 100% sure yet if it is doable. The delete() helper likely still need to go through gp. Being able to avoid immediate reuse in bpf_mem_alloc will at least be needed for the local storage delete() code path. > >>> This 'issue 2' existed in hashmap since very beginning for many years. >>> It's a known quirk. There is nothing to fix really. >> Do we need to document the unexpected behavior somewhere, because I really don't >> know nothing about the quirk ? > > Yeah. It's not documented in Documentation/bpf/map_hash.rst. > Please send a patch to add it. > >>> >>> The graph apis (aka new gen data structs) with link list and rbtree are >>> in active development. Soon bpf progs will be able to implement their own >>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will >>> be making the trade off between performance and lookup/delete race. >> It seems these new gen data struct also need to solve the reuse problem because >> a global bpf memory allocator is used. > > Currently the graph api is single owner and kptr_xchg is used to allow > single cpu at a time to observe the object. > In the future we will add bpf_refcount and multi owner. > Then multiple cpus will be able to access the same object concurrently. > They will race to read/write the fields and it will be prog decision > to arbitrate the access. > In other words the bpf prog needs to have mechanisms to deal with reuse.
On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote: > On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote: > > > > > > > > On 1/3/23 10:30 PM, Hou Tao wrote: > > > Hi, > > > > > > On 1/4/2023 2:10 PM, Yonghong Song wrote: > > >> > > >> > > >> On 1/3/23 5:47 AM, Hou Tao wrote: > > >>> Hi, > > >>> > > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote: > > >>>> > > >>>> > > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: > > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > > >>>>>> From: Hou Tao <houtao1@huawei.com> > > >>>>>> > > >>>>>> Hi, > > >>>>>> > > >>>>>> The patchset tries to fix the problems found when checking how htab map > > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of > > >>>>>> freed elements may lead to two problems in htab map: > > >>>>>> > > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > > >>>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK > > >>>>>> flag which acquires bpf-spin-lock during value copying. The > > >>>>>> corruption of bpf-spin-lock may result in hard lock-up. > > >>>>>> (2) lookup procedure may get incorrect map value if the found element is > > >>>>>> freed and then reused. > > >>>>>> > > >>>>>> Because the type of htab map elements are the same, so problem #1 can be > > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes > > >>>>>> these special fields in map element only when the map element is newly > > >>>>>> allocated. If it is just a reused element, there will be no > > >>>>>> reinitialization. > > >>>>> > > >>>>> Instead of adding the overhead of ctor callback let's just > > >>>>> add __GFP_ZERO to flags in __alloc(). > > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just > > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > > >>>>> will behave the same way. > > >>>> > > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ > > >>>> tried to address a similar issue for lru hash table. > > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for > > >>>> hash table? > > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? > > >> > > >> The following is my understanding: > > >> in function alloc_htab_elem() (hashtab.c), we have > > >> > > >> if (is_map_full(htab)) > > >> if (!old_elem) > > >> /* when map is full and update() is replacing > > >> * old element, it's ok to allocate, since > > >> * old element will be freed immediately. > > >> * Otherwise return an error > > >> */ > > >> return ERR_PTR(-E2BIG); > > >> inc_elem_count(htab); > > >> l_new = bpf_mem_cache_alloc(&htab->ma); > > >> if (!l_new) { > > >> l_new = ERR_PTR(-ENOMEM); > > >> goto dec_count; > > >> } > > >> check_and_init_map_value(&htab->map, > > >> l_new->key + round_up(key_size, 8)); > > >> > > >> In the above check_and_init_map_value() intends to do initializing > > >> for an element from bpf_mem_cache_alloc (could be reused from the free list). > > >> > > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h) > > >> > > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) > > >> { > > >> int i; > > >> > > >> if (!foffs) > > >> return; > > >> for (i = 0; i < foffs->cnt; i++) > > >> memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); > > >> } > > >> > > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst) > > >> { > > >> bpf_obj_init(map->field_offs, dst); > > >> } > > >> > > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, > > >> list_head, list_node, etc. > > >> > > >> This is the problem for above problem #1. > > >> Maybe I missed something? > > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the > > > problem by only calling check_and_init_map_value() once for the newly-allocated > > > element, so if a freed element is reused, its special fields will not be zeroed > > > again. Is there any other cases which are not covered by the solution or any > > > other similar problems in hash-tab ? > > > > No, I checked all cases of check_and_init_map_value() and didn't find > > any other instances. > > check_and_init_map_value() is called in two other cases: > lookup_and_delete[_batch]. > There the zeroing of the fields is necessary because the 'value' > is a temp buffer that is going to be copied to user space. > I think the way forward is to add GFP_ZERO to mem_alloc > (to make it equivalent to prealloc), remove one case > of check_and_init_map_value from hashmap, add short comments > to two other cases and add a big comment to check_and_init_map_value() > that should say that 'dst' must be a temp buffer and should not > point to memory that could be used in parallel by a bpf prog. > It feels like we've dealt with this issue a couple times already > and keep repeating this mistake, so the more comments the better. Hou, are you plannning to resubmit this change? I also hit this while testing my changes on bpf-next.
On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote: > > On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote: > > > > > > > > > > > > On 1/3/23 10:30 PM, Hou Tao wrote: > > > > Hi, > > > > > > > > On 1/4/2023 2:10 PM, Yonghong Song wrote: > > > >> > > > >> > > > >> On 1/3/23 5:47 AM, Hou Tao wrote: > > > >>> Hi, > > > >>> > > > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote: > > > >>>> > > > >>>> > > > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: > > > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > > > >>>>>> From: Hou Tao <houtao1@huawei.com> > > > >>>>>> > > > >>>>>> Hi, > > > >>>>>> > > > >>>>>> The patchset tries to fix the problems found when checking how htab map > > > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of > > > >>>>>> freed elements may lead to two problems in htab map: > > > >>>>>> > > > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > > > >>>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK > > > >>>>>> flag which acquires bpf-spin-lock during value copying. The > > > >>>>>> corruption of bpf-spin-lock may result in hard lock-up. > > > >>>>>> (2) lookup procedure may get incorrect map value if the found element is > > > >>>>>> freed and then reused. > > > >>>>>> > > > >>>>>> Because the type of htab map elements are the same, so problem #1 can be > > > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes > > > >>>>>> these special fields in map element only when the map element is newly > > > >>>>>> allocated. If it is just a reused element, there will be no > > > >>>>>> reinitialization. > > > >>>>> > > > >>>>> Instead of adding the overhead of ctor callback let's just > > > >>>>> add __GFP_ZERO to flags in __alloc(). > > > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just > > > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > > > >>>>> will behave the same way. > > > >>>> > > > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ > > > >>>> tried to address a similar issue for lru hash table. > > > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for > > > >>>> hash table? > > > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? > > > >> > > > >> The following is my understanding: > > > >> in function alloc_htab_elem() (hashtab.c), we have > > > >> > > > >> if (is_map_full(htab)) > > > >> if (!old_elem) > > > >> /* when map is full and update() is replacing > > > >> * old element, it's ok to allocate, since > > > >> * old element will be freed immediately. > > > >> * Otherwise return an error > > > >> */ > > > >> return ERR_PTR(-E2BIG); > > > >> inc_elem_count(htab); > > > >> l_new = bpf_mem_cache_alloc(&htab->ma); > > > >> if (!l_new) { > > > >> l_new = ERR_PTR(-ENOMEM); > > > >> goto dec_count; > > > >> } > > > >> check_and_init_map_value(&htab->map, > > > >> l_new->key + round_up(key_size, 8)); > > > >> > > > >> In the above check_and_init_map_value() intends to do initializing > > > >> for an element from bpf_mem_cache_alloc (could be reused from the free list). > > > >> > > > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h) > > > >> > > > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) > > > >> { > > > >> int i; > > > >> > > > >> if (!foffs) > > > >> return; > > > >> for (i = 0; i < foffs->cnt; i++) > > > >> memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); > > > >> } > > > >> > > > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst) > > > >> { > > > >> bpf_obj_init(map->field_offs, dst); > > > >> } > > > >> > > > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, > > > >> list_head, list_node, etc. > > > >> > > > >> This is the problem for above problem #1. > > > >> Maybe I missed something? > > > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the > > > > problem by only calling check_and_init_map_value() once for the newly-allocated > > > > element, so if a freed element is reused, its special fields will not be zeroed > > > > again. Is there any other cases which are not covered by the solution or any > > > > other similar problems in hash-tab ? > > > > > > No, I checked all cases of check_and_init_map_value() and didn't find > > > any other instances. > > > > check_and_init_map_value() is called in two other cases: > > lookup_and_delete[_batch]. > > There the zeroing of the fields is necessary because the 'value' > > is a temp buffer that is going to be copied to user space. > > I think the way forward is to add GFP_ZERO to mem_alloc > > (to make it equivalent to prealloc), remove one case > > of check_and_init_map_value from hashmap, add short comments > > to two other cases and add a big comment to check_and_init_map_value() > > that should say that 'dst' must be a temp buffer and should not > > point to memory that could be used in parallel by a bpf prog. > > It feels like we've dealt with this issue a couple times already > > and keep repeating this mistake, so the more comments the better. > > Hou, are you plannning to resubmit this change? I also hit this while testing my > changes on bpf-next. Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? The former will take a long time to settle. The latter is trivial. To unblock yourself just add GFP_ZERO in an extra patch?
On 2/11/2023 5:06 AM, Alexei Starovoitov wrote: > On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi > <memxor@gmail.com> wrote: >> On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote: >>> On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote: >>>> >>>> >>>> On 1/3/23 10:30 PM, Hou Tao wrote: >>>>> Hi, >>>>> >>>>> On 1/4/2023 2:10 PM, Yonghong Song wrote: >>>>>> >>>>>> On 1/3/23 5:47 AM, Hou Tao wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 1/2/2023 2:48 AM, Yonghong Song wrote: >>>>>>>> >>>>>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote: >>>>>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: >>>>>>>>>> From: Hou Tao <houtao1@huawei.com> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> The patchset tries to fix the problems found when checking how htab map >>>>>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of >>>>>>>>>> freed elements may lead to two problems in htab map: >>>>>>>>>> >>>>>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in >>>>>>>>>> htab map value and it may corrupt lookup procedure with BFP_F_LOCK >>>>>>>>>> flag which acquires bpf-spin-lock during value copying. The >>>>>>>>>> corruption of bpf-spin-lock may result in hard lock-up. >>>>>>>>>> (2) lookup procedure may get incorrect map value if the found element is >>>>>>>>>> freed and then reused. >>>>>>>>>> >>>>>>>>>> Because the type of htab map elements are the same, so problem #1 can be >>>>>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes >>>>>>>>>> these special fields in map element only when the map element is newly >>>>>>>>>> allocated. If it is just a reused element, there will be no >>>>>>>>>> reinitialization. >>>>>>>>> Instead of adding the overhead of ctor callback let's just >>>>>>>>> add __GFP_ZERO to flags in __alloc(). >>>>>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just >>>>>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default >>>>>>>>> will behave the same way. >>>>>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ >>>>>>>> tried to address a similar issue for lru hash table. >>>>>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for >>>>>>>> hash table? >>>>>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ? >>>>>> The following is my understanding: >>>>>> in function alloc_htab_elem() (hashtab.c), we have >>>>>> >>>>>> if (is_map_full(htab)) >>>>>> if (!old_elem) >>>>>> /* when map is full and update() is replacing >>>>>> * old element, it's ok to allocate, since >>>>>> * old element will be freed immediately. >>>>>> * Otherwise return an error >>>>>> */ >>>>>> return ERR_PTR(-E2BIG); >>>>>> inc_elem_count(htab); >>>>>> l_new = bpf_mem_cache_alloc(&htab->ma); >>>>>> if (!l_new) { >>>>>> l_new = ERR_PTR(-ENOMEM); >>>>>> goto dec_count; >>>>>> } >>>>>> check_and_init_map_value(&htab->map, >>>>>> l_new->key + round_up(key_size, 8)); >>>>>> >>>>>> In the above check_and_init_map_value() intends to do initializing >>>>>> for an element from bpf_mem_cache_alloc (could be reused from the free list). >>>>>> >>>>>> The check_and_init_map_value() looks like below (in include/linux/bpf.h) >>>>>> >>>>>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj) >>>>>> { >>>>>> int i; >>>>>> >>>>>> if (!foffs) >>>>>> return; >>>>>> for (i = 0; i < foffs->cnt; i++) >>>>>> memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]); >>>>>> } >>>>>> >>>>>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst) >>>>>> { >>>>>> bpf_obj_init(map->field_offs, dst); >>>>>> } >>>>>> >>>>>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer, >>>>>> list_head, list_node, etc. >>>>>> >>>>>> This is the problem for above problem #1. >>>>>> Maybe I missed something? >>>>> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the >>>>> problem by only calling check_and_init_map_value() once for the newly-allocated >>>>> element, so if a freed element is reused, its special fields will not be zeroed >>>>> again. Is there any other cases which are not covered by the solution or any >>>>> other similar problems in hash-tab ? >>>> No, I checked all cases of check_and_init_map_value() and didn't find >>>> any other instances. >>> check_and_init_map_value() is called in two other cases: >>> lookup_and_delete[_batch]. >>> There the zeroing of the fields is necessary because the 'value' >>> is a temp buffer that is going to be copied to user space. >>> I think the way forward is to add GFP_ZERO to mem_alloc >>> (to make it equivalent to prealloc), remove one case >>> of check_and_init_map_value from hashmap, add short comments >>> to two other cases and add a big comment to check_and_init_map_value() >>> that should say that 'dst' must be a temp buffer and should not >>> point to memory that could be used in parallel by a bpf prog. >>> It feels like we've dealt with this issue a couple times already >>> and keep repeating this mistake, so the more comments the better. >> Hou, are you plannning to resubmit this change? I also hit this while testing my >> changes on bpf-next. > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? > The former will take a long time to settle. > The latter is trivial. > To unblock yourself just add GFP_ZERO in an extra patch? Sorry for the long delay. Just find find out time to do some tests to compare the performance of bzero and ctor. After it is done, will resubmit on next week. > .
On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: > >> Hou, are you plannning to resubmit this change? I also hit this while testing my > >> changes on bpf-next. > > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? > > The former will take a long time to settle. > > The latter is trivial. > > To unblock yourself just add GFP_ZERO in an extra patch? > Sorry for the long delay. Just find find out time to do some tests to compare > the performance of bzero and ctor. After it is done, will resubmit on next week. I still don't like ctor as a concept. In general the callbacks in the critical path are guaranteed to be slow due to retpoline overhead. Please send a patch to add GFP_ZERO. Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. This approach will work for inner nodes of qptrie, since bpf progs never see pointers to them. It will work for local storage converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. It's also safe without uaf caveat in sleepable progs and sleepable progs can use explicit bpf_rcu_read_lock() when they want to avoid uaf. So please respin the set with rcu gp only and that new flag.
On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: > > >> Hou, are you plannning to resubmit this change? I also hit this while testing my > > >> changes on bpf-next. > > > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? > > > The former will take a long time to settle. > > > The latter is trivial. > > > To unblock yourself just add GFP_ZERO in an extra patch? > > Sorry for the long delay. Just find find out time to do some tests to compare > > the performance of bzero and ctor. After it is done, will resubmit on next week. > > I still don't like ctor as a concept. In general the callbacks in the critical > path are guaranteed to be slow due to retpoline overhead. > Please send a patch to add GFP_ZERO. > > Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable > without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. > This approach will work for inner nodes of qptrie, since bpf progs > never see pointers to them. It will work for local storage > converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. > It's also safe without uaf caveat in sleepable progs and sleepable progs I meant 'safe with uaf caveat'. Safe because we wait for rcu_task_trace later before returning to kernel memory. > can use explicit bpf_rcu_read_lock() when they want to avoid uaf. > So please respin the set with rcu gp only and that new flag.
On 2/11/23 8:34 AM, Alexei Starovoitov wrote: > On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my >>>>> changes on bpf-next. >>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? >>>> The former will take a long time to settle. >>>> The latter is trivial. >>>> To unblock yourself just add GFP_ZERO in an extra patch? >>> Sorry for the long delay. Just find find out time to do some tests to compare >>> the performance of bzero and ctor. After it is done, will resubmit on next week. >> >> I still don't like ctor as a concept. In general the callbacks in the critical >> path are guaranteed to be slow due to retpoline overhead. >> Please send a patch to add GFP_ZERO. >> >> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable >> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. >> This approach will work for inner nodes of qptrie, since bpf progs >> never see pointers to them. It will work for local storage >> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. >> It's also safe without uaf caveat in sleepable progs and sleepable progs > > I meant 'safe with uaf caveat'. > Safe because we wait for rcu_task_trace later before returning to kernel memory. For local storage, when its owner (sk/task/inode/cgrp) is going away, the memory can be reused immediately. No rcu gp is needed. The local storage delete case (eg. bpf_sk_storage_delete) is the only one that needs to be freed by tasks_trace gp because another bpf prog (reader) may be under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp can be extended to the local storage delete case. I think we can extend the assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also. I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and the BPF_REUSE_AFTER_RCU_GP idea. Probably will get the GFP_ZERO out first. > >> can use explicit bpf_rcu_read_lock() when they want to avoid uaf. >> So please respin the set with rcu gp only and that new flag.
Hi, On 2/12/2023 12:33 AM, Alexei Starovoitov wrote: > On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>> Hou, are you plannning to resubmit this change? I also hit this while testing my >>>> changes on bpf-next. >>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? >>> The former will take a long time to settle. >>> The latter is trivial. >>> To unblock yourself just add GFP_ZERO in an extra patch? >> Sorry for the long delay. Just find find out time to do some tests to compare >> the performance of bzero and ctor. After it is done, will resubmit on next week. > I still don't like ctor as a concept. In general the callbacks in the critical > path are guaranteed to be slow due to retpoline overhead. > Please send a patch to add GFP_ZERO. I see. Will do. But i think it is better to know the coarse overhead of these two methods, so I hack map_perf_test to support customizable value size for hash_map_alloc and do some benchmarks to show the overheads of ctor and GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when the number of allocated elements is small, the overheads of ctor and bzero are basically the same, but when the number of allocated element increases (e.g., half full), the overhead of ctor will be bigger. For big value size, the overhead of ctor and zero are basically the same, and it seems due to the main overhead comes from slab allocation. The following is the detailed results: * ./map_perf_test 4 8 8192 10000 $value_size Key of htab is thread pid, so only 8 elements are allocated. | value_size | 8 | 256 | 4K | 16K | 64K | 256K | | -- | -- | -- | -- | -- | -- | -- | | base | 256604 | 261112 | 173646 | 74195 | 23138 | 6275 | | bzero | 253362 | 257563 | 171445 | 73303 | 22949 | 6249 | | ctor | 264570 | 258246 | 175048 | 72511 | 23004 | 6270 | * ./map_perf_test 4 8 8192 100 $value_size The key is still thread pid, so only 8 elements are allocated. decrease the loop count to 100 to show the overhead of first allocation. | value_size | 8 | 256 | 4K | 16K | 64K | 256K | | -- | -- | -- | -- | -- | -- | -- | | base | 135662 | 137742 | 87043 | 36265 | 12501 | 4450 | | bzero | 139993 | 134920 | 94570 | 37190 | 12543 | 4131 | | ctor | 147949 | 141825 | 94321 | 38240 | 13131 | 4248 | * ./map_perf_test 4 8 8192 1000 $value_size Create 512 different keys per-thread, so the hash table will be half-full. Also decrease the loop count to 1K. | value_size | 8 | 256 | 4K | 16K | 64K | 256K | | -- | -- | -- | -- | -- | -- | -- | | base | 4234 | 4289 | 1478 | 510 | 168 | 46 | | bzero | 3792 | 4002 | 1473 | 515 | 161 | 37 | | ctor | 3846 | 2198 | 1269 | 499 | 161 | 42 | * ./map_perf_test 4 8 8192 100 $value_size Create 512 different keys per-thread, so the hash table will be half-full. Also decrease the loop count to 100. | value_size | 8 | 256 | 4K | 16K | 64K | 256K | | -- | -- | -- | -- | -- | -- | -- | | base | 3669 | 3419 | 1272 | 476 | 168 | 44 | | bzero | 3468 | 3499 | 1274 | 476 | 150 | 36 | | ctor | 2235 | 2312 | 1128 | 452 | 145 | 35 | > > Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable > without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. > This approach will work for inner nodes of qptrie, since bpf progs > never see pointers to them. It will work for local storage > converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. > It's also safe without uaf caveat in sleepable progs and sleepable progs > can use explicit bpf_rcu_read_lock() when they want to avoid uaf. > So please respin the set with rcu gp only and that new flag. > .
On Tue, Feb 14, 2023 at 6:36 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 2/12/2023 12:33 AM, Alexei Starovoitov wrote: > > On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: > >>>> Hou, are you plannning to resubmit this change? I also hit this while testing my > >>>> changes on bpf-next. > >>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? > >>> The former will take a long time to settle. > >>> The latter is trivial. > >>> To unblock yourself just add GFP_ZERO in an extra patch? > >> Sorry for the long delay. Just find find out time to do some tests to compare > >> the performance of bzero and ctor. After it is done, will resubmit on next week. > > I still don't like ctor as a concept. In general the callbacks in the critical > > path are guaranteed to be slow due to retpoline overhead. > > Please send a patch to add GFP_ZERO. > I see. Will do. But i think it is better to know the coarse overhead of these > two methods, so I hack map_perf_test to support customizable value size for > hash_map_alloc and do some benchmarks to show the overheads of ctor and > GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when > the number of allocated elements is small, the overheads of ctor and bzero are > basically the same, but when the number of allocated element increases (e.g., > half full), the overhead of ctor will be bigger. For big value size, the > overhead of ctor and zero are basically the same, and it seems due to the main > overhead comes from slab allocation. The following is the detailed results: and with retpoline?
Hi, On 2/15/2023 10:42 AM, Alexei Starovoitov wrote: > On Tue, Feb 14, 2023 at 6:36 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 2/12/2023 12:33 AM, Alexei Starovoitov wrote: >>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my >>>>>> changes on bpf-next. >>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? >>>>> The former will take a long time to settle. >>>>> The latter is trivial. >>>>> To unblock yourself just add GFP_ZERO in an extra patch? >>>> Sorry for the long delay. Just find find out time to do some tests to compare >>>> the performance of bzero and ctor. After it is done, will resubmit on next week. >>> I still don't like ctor as a concept. In general the callbacks in the critical >>> path are guaranteed to be slow due to retpoline overhead. >>> Please send a patch to add GFP_ZERO. >> I see. Will do. But i think it is better to know the coarse overhead of these >> two methods, so I hack map_perf_test to support customizable value size for >> hash_map_alloc and do some benchmarks to show the overheads of ctor and >> GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when >> the number of allocated elements is small, the overheads of ctor and bzero are >> basically the same, but when the number of allocated element increases (e.g., >> half full), the overhead of ctor will be bigger. For big value size, the >> overhead of ctor and zero are basically the same, and it seems due to the main >> overhead comes from slab allocation. The following is the detailed results: > and with retpoline? Yes. Forge to mention that. The following is the output of vulnerabilities directory: # cd /sys/devices/system/cpu/vulnerabilities # grep . * itlb_multihit:Processor vulnerable l1tf:Mitigation: PTE Inversion mds:Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown meltdown:Mitigation: PTI mmio_stale_data:Unknown: No mitigations retbleed:Not affected spec_store_bypass:Vulnerable spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization spectre_v2:Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS: Not affected srbds:Not affected tsx_async_abort:Not affected
Hi, On 2/15/2023 9:54 AM, Martin KaFai Lau wrote: > On 2/11/23 8:34 AM, Alexei Starovoitov wrote: >> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >>> >>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>>> Hou, are you plannning to resubmit this change? I also hit this while >>>>>> testing my >>>>>> changes on bpf-next. >>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? >>>>> The former will take a long time to settle. >>>>> The latter is trivial. >>>>> To unblock yourself just add GFP_ZERO in an extra patch? >>>> Sorry for the long delay. Just find find out time to do some tests to compare >>>> the performance of bzero and ctor. After it is done, will resubmit on next >>>> week. >>> >>> I still don't like ctor as a concept. In general the callbacks in the critical >>> path are guaranteed to be slow due to retpoline overhead. >>> Please send a patch to add GFP_ZERO. >>> >>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable >>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. >>> This approach will work for inner nodes of qptrie, since bpf progs >>> never see pointers to them. It will work for local storage >>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. >>> It's also safe without uaf caveat in sleepable progs and sleepable progs >> >> I meant 'safe with uaf caveat'. >> Safe because we wait for rcu_task_trace later before returning to kernel memory. For qp-trie, I had added reuse checking for qp-trie inner node by adding a version in both child pointer and child node itself and it seemed works, but using BPF_REUSE_AFTER_RCU_GP for inner node will be much simpler for the implementation. And it seems for qp-trie leaf node, BPF_REUSE_AFTER_RCU_GP is needed as well, else the value returned to caller in bpf program or syscall may be reused just like hash-table during its usage. We can change qp-trie to act as a set only (e.g., no value), but that will limit its usage scenario. Maybe requiring the caller to use bpf_rcu_read_lock() is solution as well. What do you think ? > > For local storage, when its owner (sk/task/inode/cgrp) is going away, the > memory can be reused immediately. No rcu gp is needed. Now it seems it will wait for RCU GP and i think it is still necessary, because when the process exits, other processes may still access the local storage through pidfd or task_struct of the exited process. > > The local storage delete case (eg. bpf_sk_storage_delete) is the only one that > needs to be freed by tasks_trace gp because another bpf prog (reader) may be > under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on > allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp > can be extended to the local storage delete case. I think we can extend the > assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() > when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also. > It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not only bpf_task_storage_get(), but also the dereferences of the returned local storage ptr, right ? I think qp-trie may also need this. > I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and > the BPF_REUSE_AFTER_RCU_GP idea. Probably will get the GFP_ZERO out first. I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we cooperate together to accomplish that ? > >> >>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf. >>> So please respin the set with rcu gp only and that new flag.
On 2/14/23 8:02 PM, Hou Tao wrote: >> For local storage, when its owner (sk/task/inode/cgrp) is going away, the >> memory can be reused immediately. No rcu gp is needed. > Now it seems it will wait for RCU GP and i think it is still necessary, because > when the process exits, other processes may still access the local storage > through pidfd or task_struct of the exited process. When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0 and will be kfree immediately next. eg. bpf_sk_storage_free is called just before the sk is about to be kfree. No bpf prog should have a hold on this sk. The same should go for the task. The current rcu gp waiting during bpf_{sk,task,cgrp...}_storage_free is because the racing with the map destruction bpf_local_storage_map_free(). >> >> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that >> needs to be freed by tasks_trace gp because another bpf prog (reader) may be >> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on >> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp >> can be extended to the local storage delete case. I think we can extend the >> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() >> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also. >> > It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not > only bpf_task_storage_get(), but also the dereferences of the returned local > storage ptr, right ? I think qp-trie may also need this. I think bpf_rcu_read_lock() is primarily for bpf prog. The bpf_{sk,task,...}_storage_get() internal is easier to handle and probably will need to do its own rcu_read_lock() instead of depending on the bpf prog doing the bpf_rcu_read_lock() because the bpf prog may decide uaf is fine. >> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and >> the BPF_REUSE_AFTER_RCU_GP idea. Probably will get the GFP_ZERO out first. > I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean > that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we > cooperate together to accomplish that ? Please submit the GFP_ZERO patch first. Kumar and I can use it immediately. I have been hacking to make bpf's memalloc safe for the bpf_{sk,task,cgrp..}_storage_delete() and this safe-on-reuse piece still need works. The whole thing is getting pretty long, so my current plan is to put the safe-on-reuse piece aside for now, focus back on the immediate goal and make the common case deadlock free first. Meaning the bpf_*_storage_get(BPF_*_STORAGE_GET_F_CREATE) and the bpf_*_storage_free() will use the bpf_mem_cache_{alloc,free}. The bpf_*_storage_delete() will stay as-is to go through the call_rcu_tasks_trace() for now since delete is not the common use case. In parallel, if you can post the BPF_REUSE_AFTER_RCU_GP, we can discuss based on your work. That should speed up the progress. If I finished the immediate goal for local storage and this piece is still pending, I will ping you first. Thoughts?
Hi, On 2/15/2023 3:22 PM, Martin KaFai Lau wrote: > On 2/14/23 8:02 PM, Hou Tao wrote: >>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the >>> memory can be reused immediately. No rcu gp is needed. >> Now it seems it will wait for RCU GP and i think it is still necessary, because >> when the process exits, other processes may still access the local storage >> through pidfd or task_struct of the exited process. > > When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0 > and will be kfree immediately next. eg. bpf_sk_storage_free is called just > before the sk is about to be kfree. No bpf prog should have a hold on this sk. > The same should go for the task. A bpf syscall may have already found the task local storage through a pidfd, then the target task exits and the local storage is free immediately, then bpf syscall starts to copy the local storage and there will be a UAF, right ? Did I missing something here ? > > The current rcu gp waiting during bpf_{sk,task,cgrp...}_storage_free is > because the racing with the map destruction bpf_local_storage_map_free(). > >>> >>> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that >>> needs to be freed by tasks_trace gp because another bpf prog (reader) may be >>> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on >>> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp >>> can be extended to the local storage delete case. I think we can extend the >>> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() >>> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also. >>> >> It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not >> only bpf_task_storage_get(), but also the dereferences of the returned local >> storage ptr, right ? I think qp-trie may also need this. > > I think bpf_rcu_read_lock() is primarily for bpf prog. Yes. I mean the bpf program which uses qp-trie will need bpf_rcu_read_lock(). > > The bpf_{sk,task,...}_storage_get() internal is easier to handle and probably > will need to do its own rcu_read_lock() instead of depending on the bpf prog > doing the bpf_rcu_read_lock() because the bpf prog may decide uaf is fine. > >>> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and >>> the BPF_REUSE_AFTER_RCU_GP idea. Probably will get the GFP_ZERO out first. >> I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean >> that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we >> cooperate together to accomplish that ? > Please submit the GFP_ZERO patch first. Kumar and I can use it immediately. > > I have been hacking to make bpf's memalloc safe for the > bpf_{sk,task,cgrp..}_storage_delete() and this safe-on-reuse piece still need > works. The whole thing is getting pretty long, so my current plan is to put > the safe-on-reuse piece aside for now, focus back on the immediate goal and > make the common case deadlock free first. Meaning the > bpf_*_storage_get(BPF_*_STORAGE_GET_F_CREATE) and the bpf_*_storage_free() > will use the bpf_mem_cache_{alloc,free}. The bpf_*_storage_delete() will stay > as-is to go through the call_rcu_tasks_trace() for now since delete is not the > common use case. > > In parallel, if you can post the BPF_REUSE_AFTER_RCU_GP, we can discuss based > on your work. That should speed up the progress. If I finished the immediate > goal for local storage and this piece is still pending, I will ping you > first. Thoughts? I am fine with the proposal, thanks.
On 2/15/23 6:11 PM, Hou Tao wrote: >>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the >>>> memory can be reused immediately. No rcu gp is needed. >>> Now it seems it will wait for RCU GP and i think it is still necessary, because >>> when the process exits, other processes may still access the local storage >>> through pidfd or task_struct of the exited process. >> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0 >> and will be kfree immediately next. eg. bpf_sk_storage_free is called just >> before the sk is about to be kfree. No bpf prog should have a hold on this sk. >> The same should go for the task. > A bpf syscall may have already found the task local storage through a pidfd, > then the target task exits and the local storage is free immediately, then bpf > syscall starts to copy the local storage and there will be a UAF, right ? Did I > missing something here ? bpf syscall like bpf_pid_task_storage_lookup_elem and you meant __put_task_struct() will be called while the syscall's bpf_map_copy_value() is still under rcu_read_lock()?
Hi, On 2/16/2023 3:47 PM, Martin KaFai Lau wrote: > On 2/15/23 6:11 PM, Hou Tao wrote: >>>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the >>>>> memory can be reused immediately. No rcu gp is needed. >>>> Now it seems it will wait for RCU GP and i think it is still necessary, >>>> because >>>> when the process exits, other processes may still access the local storage >>>> through pidfd or task_struct of the exited process. >>> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0 >>> and will be kfree immediately next. eg. bpf_sk_storage_free is called just >>> before the sk is about to be kfree. No bpf prog should have a hold on this sk. >>> The same should go for the task. >> A bpf syscall may have already found the task local storage through a pidfd, >> then the target task exits and the local storage is free immediately, then bpf >> syscall starts to copy the local storage and there will be a UAF, right ? Did I >> missing something here ? > bpf syscall like bpf_pid_task_storage_lookup_elem and you meant > __put_task_struct() will be called while the syscall's bpf_map_copy_value() is > still under rcu_read_lock()? Yes, but I known it is impossible here. I missed that task_struct is released through call_rcu(), so the calling of __put_task_struct() must happen after the completion of bpf_map_copy_value() in bpf syscall. Thanks for the explanation.
Hi, On 2/12/2023 12:34 AM, Alexei Starovoitov wrote: > On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote: >>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my >>>>> changes on bpf-next. >>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc? >>>> The former will take a long time to settle. >>>> The latter is trivial. >>>> To unblock yourself just add GFP_ZERO in an extra patch? >>> Sorry for the long delay. Just find find out time to do some tests to compare >>> the performance of bzero and ctor. After it is done, will resubmit on next week. >> I still don't like ctor as a concept. In general the callbacks in the critical >> path are guaranteed to be slow due to retpoline overhead. >> Please send a patch to add GFP_ZERO. >> >> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable >> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace. >> This approach will work for inner nodes of qptrie, since bpf progs >> never see pointers to them. It will work for local storage >> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu. >> It's also safe without uaf caveat in sleepable progs and sleepable progs > I meant 'safe with uaf caveat'. > Safe because we wait for rcu_task_trace later before returning to kernel memory. > >> can use explicit bpf_rcu_read_lock() when they want to avoid uaf. >> So please respin the set with rcu gp only and that new flag. Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ? Its downside is that it will enforce sleep-able program to use bpf_rcu_read_{lock,unlock}() to access these returned pointers ? > .
On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote: > > Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ? The idea is for bpf_mem_free to wait normal RCU GP before adding the elements back to the free list and free the elem to global kernel memory only after both rcu and rcu_tasks_trace GPs as it's doing now. > Its downside is that it will enforce sleep-able program to use > bpf_rcu_read_{lock,unlock}() to access these returned pointers ? sleepable can access elems without kptrs/spin_locks even when not using rcu_read_lock, since it's safe, but there is uaf. Some progs might be fine with it. When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.
Hi, On 2/17/2023 12:35 AM, Alexei Starovoitov wrote: > On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote: >> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ? > The idea is for bpf_mem_free to wait normal RCU GP before adding > the elements back to the free list and free the elem to global kernel memory > only after both rcu and rcu_tasks_trace GPs as it's doing now. > >> Its downside is that it will enforce sleep-able program to use >> bpf_rcu_read_{lock,unlock}() to access these returned pointers ? > sleepable can access elems without kptrs/spin_locks > even when not using rcu_read_lock, since it's safe, but there is uaf. > Some progs might be fine with it. > When sleepable needs to avoid uaf they will use bpf_rcu_read_lock. Thanks for the explanation for BPF_REUSE_AFTER_RCU_GP. It seems that BPF_REUSE_AFTER_RCU_GP may incur OOM easily, because before the expiration of one RCU GP, these freed elements will not available to both bpf ma or slab subsystem and after the expiration of RCU GP, these freed elements are only available for one bpf ma but the number of these freed elements maybe too many for one bpf ma, so part of these freed elements will be freed through call_rcu_tasks_trace() and these freed-again elements will not be available for slab subsystem untill the expiration of tasks trace RCU. In brief, after one RCU GP, part of these freed elements will be reused, but the majority of these elements will still be freed through call_rcu_tasks_trace(). Due to the doubt above, I proposed BPF_FREE_AFTER_RCU to directly free these elements after one RCU GP and enforce sleepable program to use bpf_rcu_read_lock() to access these elements, but the enforcement will break the existing sleepable programs, so BPF_FREE_AFTER_GP is still not a good idea. I will check whether or not these is still OOM risk for BPF_REUSE_AFTER_RCU_GP and try to mitigate if it is possible (e.g., share these freed elements between all bpf ma instead of one bpf ma which free it).
On Thu, Feb 16, 2023 at 5:19 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 2/17/2023 12:35 AM, Alexei Starovoitov wrote: > > On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote: > >> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ? > > The idea is for bpf_mem_free to wait normal RCU GP before adding > > the elements back to the free list and free the elem to global kernel memory > > only after both rcu and rcu_tasks_trace GPs as it's doing now. > > > >> Its downside is that it will enforce sleep-able program to use > >> bpf_rcu_read_{lock,unlock}() to access these returned pointers ? > > sleepable can access elems without kptrs/spin_locks > > even when not using rcu_read_lock, since it's safe, but there is uaf. > > Some progs might be fine with it. > > When sleepable needs to avoid uaf they will use bpf_rcu_read_lock. > Thanks for the explanation for BPF_REUSE_AFTER_RCU_GP. It seems that > BPF_REUSE_AFTER_RCU_GP may incur OOM easily, because before the expiration of > one RCU GP, these freed elements will not available to both bpf ma or slab > subsystem and after the expiration of RCU GP, these freed elements are only > available for one bpf ma but the number of these freed elements maybe too many > for one bpf ma, so part of these freed elements will be freed through > call_rcu_tasks_trace() and these freed-again elements will not be available for > slab subsystem untill the expiration of tasks trace RCU. In brief, after one RCU > GP, part of these freed elements will be reused, but the majority of these > elements will still be freed through call_rcu_tasks_trace(). Due to the doubt > above, I proposed BPF_FREE_AFTER_RCU to directly free these elements after one > RCU GP and enforce sleepable program to use bpf_rcu_read_lock() to access these > elements, but the enforcement will break the existing sleepable programs, so > BPF_FREE_AFTER_GP is still not a good idea. I will check whether or not these is > still OOM risk for BPF_REUSE_AFTER_RCU_GP and try to mitigate if it is possible > (e.g., share these freed elements between all bpf ma instead of one bpf ma which > free it). Since BPF_REUSE_AFTER_RCU_GP is a new thing that will be used by qptrie map and maybe? local storage, there is no sleepable breakage. If we start using BPF_REUSE_AFTER_RCU_GP for hashmaps with kptrs and enforce bpf_rcu_read_lock() this is also ok, since kptrs are unstable. I prefer to avoid complicating bpf ma with sharing free lists across all ma-s. Unless this is really trivial code that is easy to review.
From: Hou Tao <houtao1@huawei.com> Hi, The patchset tries to fix the problems found when checking how htab map handles element reuse in bpf memory allocator. The immediate reuse of freed elements may lead to two problems in htab map: (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in htab map value and it may corrupt lookup procedure with BFP_F_LOCK flag which acquires bpf-spin-lock during value copying. The corruption of bpf-spin-lock may result in hard lock-up. (2) lookup procedure may get incorrect map value if the found element is freed and then reused. Because the type of htab map elements are the same, so problem #1 can be fixed by supporting ctor in bpf memory allocator. The ctor initializes these special fields in map element only when the map element is newly allocated. If it is just a reused element, there will be no reinitialization. Problem #2 exists for both non-preallocated and preallocated htab map. By adding seq in htab element, doing reuse check and retrying the lookup procedure may be a feasible solution, but it will make the lookup API being hard to use, because the user needs to check whether the found element is reused or not and repeat the lookup procedure if it is reused. A simpler solution would be just disabling freed elements reuse and freeing these elements after lookup procedure ends. In order to reduce the overhead of call_rcu_tasks_trace() for each freed elements, freeing these elements in batch by moving these freed elements into a global per-cpu free list firstly, then after the number of freed elements reaches the threshold, these freed elements will be moved into a dymaically allocated object and being freed by a global per-cpu worker by calling call_rcu_tasks_trace(). Because the solution frees memory by allocating new memory, so if there is no memory available, the global per-cpu worker will call rcu_barrier_tasks_trace() to wait for the expiration of RCU grace period and free these free elements which have been spliced into a temporary list. And the newly freed elements will be freed after another round of rcu_barrier_tasks_trace() if there is still no memory. Maybe need to reserve some bpf_ma_free_batch to speed up the free. Now also doesn't consider the scenario when RCU grace period is slow. Because these newly-allocated memory (aka bpf_ma_free_batch) will be freed after the expiration of RCU grace period, so if grace period is slow, there may be too much bpf_ma_free_batch being allocated. Aftering applying BPF_MA_NO_REUSE in htab map, the performance of "./map_perf_test 4 18 8192" drops from 520K to 330K events per sec on one CPU. It is a big performance degradation, so hope to get some feedbacks on whether or not it is necessary and how to better fixing the reuse problem in htab map (global allocated object may have the same problems as htab map). Comments are always welcome. Regards, Hou Hou Tao (6): bpf: Support ctor in bpf memory allocator bpf: Factor out a common helper free_llist() bpf: Pass bitwise flags to bpf_mem_alloc_init() bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator bpf: Use BPF_MA_NO_REUSE in htab map selftests/bpf: Add test case for element reuse in htab map include/linux/bpf_mem_alloc.h | 12 +- kernel/bpf/core.c | 2 +- kernel/bpf/hashtab.c | 17 +- kernel/bpf/memalloc.c | 218 ++++++++++++++++-- .../selftests/bpf/prog_tests/htab_reuse.c | 111 +++++++++ .../testing/selftests/bpf/progs/htab_reuse.c | 19 ++ 6 files changed, 353 insertions(+), 26 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_reuse.c create mode 100644 tools/testing/selftests/bpf/progs/htab_reuse.c