mbox series

[RFC,bpf-next,0/6] bpf: Handle reuse in bpf memory alloc

Message ID 20221230041151.1231169-1-houtao@huaweicloud.com (mailing list archive)
Headers show
Series bpf: Handle reuse in bpf memory alloc | expand

Message

Hou Tao Dec. 30, 2022, 4:11 a.m. UTC
From: Hou Tao <houtao1@huawei.com>

Hi,

The patchset tries to fix the problems found when checking how htab map
handles element reuse in bpf memory allocator. The immediate reuse of
freed elements may lead to two problems in htab map:

(1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
    htab map value and it may corrupt lookup procedure with BFP_F_LOCK
    flag which acquires bpf-spin-lock during value copying. The
    corruption of bpf-spin-lock may result in hard lock-up.
(2) lookup procedure may get incorrect map value if the found element is
    freed and then reused.

Because the type of htab map elements are the same, so problem #1 can be
fixed by supporting ctor in bpf memory allocator. The ctor initializes
these special fields in map element only when the map element is newly
allocated. If it is just a reused element, there will be no
reinitialization.

Problem #2 exists for both non-preallocated and preallocated htab map.
By adding seq in htab element, doing reuse check and retrying the
lookup procedure may be a feasible solution, but it will make the
lookup API being hard to use, because the user needs to check whether
the found element is reused or not and repeat the lookup procedure if it
is reused. A simpler solution would be just disabling freed elements
reuse and freeing these elements after lookup procedure ends.

In order to reduce the overhead of call_rcu_tasks_trace() for each freed
elements, freeing these elements in batch by moving these freed elements
into a global per-cpu free list firstly, then after the number of freed
elements reaches the threshold, these freed elements will be moved into
a dymaically allocated object and being freed by a global per-cpu worker
by calling call_rcu_tasks_trace().

Because the solution frees memory by allocating new memory, so if there
is no memory available, the global per-cpu worker will call
rcu_barrier_tasks_trace() to wait for the expiration of RCU grace period
and free these free elements which have been spliced into a temporary
list. And the newly freed elements will be freed after another round of
rcu_barrier_tasks_trace() if there is still no memory. Maybe need to
reserve some bpf_ma_free_batch to speed up the free. Now also doesn't
consider the scenario when RCU grace period is slow. Because these
newly-allocated memory (aka bpf_ma_free_batch) will be freed after the
expiration of RCU grace period, so if grace period is slow, there may be
too much bpf_ma_free_batch being allocated.

Aftering applying BPF_MA_NO_REUSE in htab map, the performance of
"./map_perf_test 4 18 8192" drops from 520K to 330K events per sec on
one CPU. It is a big performance degradation, so hope to get some
feedbacks on whether or not it is necessary and how to better fixing the
reuse problem in htab map (global allocated object may have the same
problems as htab map). Comments are always welcome.

Regards,
Hou

Hou Tao (6):
  bpf: Support ctor in bpf memory allocator
  bpf: Factor out a common helper free_llist()
  bpf: Pass bitwise flags to bpf_mem_alloc_init()
  bpf: Introduce BPF_MA_NO_REUSE for bpf memory allocator
  bpf: Use BPF_MA_NO_REUSE in htab map
  selftests/bpf: Add test case for element reuse in htab map

 include/linux/bpf_mem_alloc.h                 |  12 +-
 kernel/bpf/core.c                             |   2 +-
 kernel/bpf/hashtab.c                          |  17 +-
 kernel/bpf/memalloc.c                         | 218 ++++++++++++++++--
 .../selftests/bpf/prog_tests/htab_reuse.c     | 111 +++++++++
 .../testing/selftests/bpf/progs/htab_reuse.c  |  19 ++
 6 files changed, 353 insertions(+), 26 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_reuse.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_reuse.c

Comments

Alexei Starovoitov Jan. 1, 2023, 1:26 a.m. UTC | #1
On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> The patchset tries to fix the problems found when checking how htab map
> handles element reuse in bpf memory allocator. The immediate reuse of
> freed elements may lead to two problems in htab map:
> 
> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>     flag which acquires bpf-spin-lock during value copying. The
>     corruption of bpf-spin-lock may result in hard lock-up.
> (2) lookup procedure may get incorrect map value if the found element is
>     freed and then reused.
> 
> Because the type of htab map elements are the same, so problem #1 can be
> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> these special fields in map element only when the map element is newly
> allocated. If it is just a reused element, there will be no
> reinitialization.

Instead of adding the overhead of ctor callback let's just
add __GFP_ZERO to flags in __alloc().
That will address the issue 1 and will make bpf_mem_alloc behave just
like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
will behave the same way.

> Problem #2 exists for both non-preallocated and preallocated htab map.
> By adding seq in htab element, doing reuse check and retrying the
> lookup procedure may be a feasible solution, but it will make the
> lookup API being hard to use, because the user needs to check whether
> the found element is reused or not and repeat the lookup procedure if it
> is reused. A simpler solution would be just disabling freed elements
> reuse and freeing these elements after lookup procedure ends.

You've proposed this 'solution' twice already in qptrie thread and both
times the answer was 'no, we cannot do this' with reasons explained.
The 3rd time the answer is still the same.
This 'issue 2' existed in hashmap since very beginning for many years.
It's a known quirk. There is nothing to fix really.

The graph apis (aka new gen data structs) with link list and rbtree are
in active development. Soon bpf progs will be able to implement their own
hash maps with explicit bpf_rcu_read_lock. At that time the progs will
be making the trade off between performance and lookup/delete race.
So please respin with just __GFP_ZERO and update the patch 6
to check for lockup only.
Yonghong Song Jan. 1, 2023, 6:48 p.m. UTC | #2
On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
>>
>> The patchset tries to fix the problems found when checking how htab map
>> handles element reuse in bpf memory allocator. The immediate reuse of
>> freed elements may lead to two problems in htab map:
>>
>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>      flag which acquires bpf-spin-lock during value copying. The
>>      corruption of bpf-spin-lock may result in hard lock-up.
>> (2) lookup procedure may get incorrect map value if the found element is
>>      freed and then reused.
>>
>> Because the type of htab map elements are the same, so problem #1 can be
>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>> these special fields in map element only when the map element is newly
>> allocated. If it is just a reused element, there will be no
>> reinitialization.
> 
> Instead of adding the overhead of ctor callback let's just
> add __GFP_ZERO to flags in __alloc().
> That will address the issue 1 and will make bpf_mem_alloc behave just
> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> will behave the same way.

Patch 
https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/ 
tried to address a similar issue for lru hash table.
Maybe we need to do similar things after bpf_mem_cache_alloc() for
hash table?


> 
>> Problem #2 exists for both non-preallocated and preallocated htab map.
>> By adding seq in htab element, doing reuse check and retrying the
>> lookup procedure may be a feasible solution, but it will make the
>> lookup API being hard to use, because the user needs to check whether
>> the found element is reused or not and repeat the lookup procedure if it
>> is reused. A simpler solution would be just disabling freed elements
>> reuse and freeing these elements after lookup procedure ends.
> 
> You've proposed this 'solution' twice already in qptrie thread and both
> times the answer was 'no, we cannot do this' with reasons explained.
> The 3rd time the answer is still the same.
> This 'issue 2' existed in hashmap since very beginning for many years.
> It's a known quirk. There is nothing to fix really.
> 
> The graph apis (aka new gen data structs) with link list and rbtree are
> in active development. Soon bpf progs will be able to implement their own
> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> be making the trade off between performance and lookup/delete race.
> So please respin with just __GFP_ZERO and update the patch 6
> to check for lockup only.
Hou Tao Jan. 3, 2023, 1:40 p.m. UTC | #3
Hi,

On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
>>
>> The patchset tries to fix the problems found when checking how htab map
>> handles element reuse in bpf memory allocator. The immediate reuse of
>> freed elements may lead to two problems in htab map:
>>
>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>     flag which acquires bpf-spin-lock during value copying. The
>>     corruption of bpf-spin-lock may result in hard lock-up.
>> (2) lookup procedure may get incorrect map value if the found element is
>>     freed and then reused.
>>
>> Because the type of htab map elements are the same, so problem #1 can be
>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>> these special fields in map element only when the map element is newly
>> allocated. If it is just a reused element, there will be no
>> reinitialization.
> Instead of adding the overhead of ctor callback let's just
> add __GFP_ZERO to flags in __alloc().
> That will address the issue 1 and will make bpf_mem_alloc behave just
> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> will behave the same way.
Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
object may be bigger than ctor callback when the size of allocated object is
large. And it also introduces unnecessary memory zeroing if there is no special
field in the map value.

>> Problem #2 exists for both non-preallocated and preallocated htab map.
>> By adding seq in htab element, doing reuse check and retrying the
>> lookup procedure may be a feasible solution, but it will make the
>> lookup API being hard to use, because the user needs to check whether
>> the found element is reused or not and repeat the lookup procedure if it
>> is reused. A simpler solution would be just disabling freed elements
>> reuse and freeing these elements after lookup procedure ends.
> You've proposed this 'solution' twice already in qptrie thread and both
> times the answer was 'no, we cannot do this' with reasons explained.
> The 3rd time the answer is still the same.
This time a workable demo which calls call_rcu_task_trace() in batch is provided
:) Also because I can not find a better solution for the reuse problem. But you
are right, although don't reuse the freed element will make the implementation
of map simpler, the potential OOM problem is hard to solve specially when RCU
tasks trace grace period is slow. Hope Paul can provide some insight about the
problem.
> This 'issue 2' existed in hashmap since very beginning for many years.
> It's a known quirk. There is nothing to fix really.
Do we need to document the unexpected behavior somewhere, because I really don't
know nothing about the quirk ?
>
> The graph apis (aka new gen data structs) with link list and rbtree are
> in active development. Soon bpf progs will be able to implement their own
> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> be making the trade off between performance and lookup/delete race.
It seems these new gen data struct also need to solve the reuse problem because
a global bpf memory allocator is used.
> So please respin with just __GFP_ZERO and update the patch 6
> to check for lockup only.
> .
Hou Tao Jan. 3, 2023, 1:47 p.m. UTC | #4
Hi,

On 1/2/2023 2:48 AM, Yonghong Song wrote:
>
>
> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> Hi,
>>>
>>> The patchset tries to fix the problems found when checking how htab map
>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>> freed elements may lead to two problems in htab map:
>>>
>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>      flag which acquires bpf-spin-lock during value copying. The
>>>      corruption of bpf-spin-lock may result in hard lock-up.
>>> (2) lookup procedure may get incorrect map value if the found element is
>>>      freed and then reused.
>>>
>>> Because the type of htab map elements are the same, so problem #1 can be
>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>> these special fields in map element only when the map element is newly
>>> allocated. If it is just a reused element, there will be no
>>> reinitialization.
>>
>> Instead of adding the overhead of ctor callback let's just
>> add __GFP_ZERO to flags in __alloc().
>> That will address the issue 1 and will make bpf_mem_alloc behave just
>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>> will behave the same way.
>
> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> tried to address a similar issue for lru hash table.
> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> hash table?
IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>
>
>>
>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>> By adding seq in htab element, doing reuse check and retrying the
>>> lookup procedure may be a feasible solution, but it will make the
>>> lookup API being hard to use, because the user needs to check whether
>>> the found element is reused or not and repeat the lookup procedure if it
>>> is reused. A simpler solution would be just disabling freed elements
>>> reuse and freeing these elements after lookup procedure ends.
>>
>> You've proposed this 'solution' twice already in qptrie thread and both
>> times the answer was 'no, we cannot do this' with reasons explained.
>> The 3rd time the answer is still the same.
>> This 'issue 2' existed in hashmap since very beginning for many years.
>> It's a known quirk. There is nothing to fix really.
>>
>> The graph apis (aka new gen data structs) with link list and rbtree are
>> in active development. Soon bpf progs will be able to implement their own
>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>> be making the trade off between performance and lookup/delete race.
>> So please respin with just __GFP_ZERO and update the patch 6
>> to check for lockup only.
Alexei Starovoitov Jan. 3, 2023, 7:38 p.m. UTC | #5
On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
> > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Hi,
> >>
> >> The patchset tries to fix the problems found when checking how htab map
> >> handles element reuse in bpf memory allocator. The immediate reuse of
> >> freed elements may lead to two problems in htab map:
> >>
> >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> >>     htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> >>     flag which acquires bpf-spin-lock during value copying. The
> >>     corruption of bpf-spin-lock may result in hard lock-up.
> >> (2) lookup procedure may get incorrect map value if the found element is
> >>     freed and then reused.
> >>
> >> Because the type of htab map elements are the same, so problem #1 can be
> >> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> >> these special fields in map element only when the map element is newly
> >> allocated. If it is just a reused element, there will be no
> >> reinitialization.
> > Instead of adding the overhead of ctor callback let's just
> > add __GFP_ZERO to flags in __alloc().
> > That will address the issue 1 and will make bpf_mem_alloc behave just
> > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > will behave the same way.
> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
> object may be bigger than ctor callback when the size of allocated object is
> large. And it also introduces unnecessary memory zeroing if there is no special
> field in the map value.

Small memset is often faster than an indirect call.
I doubt that adding GFP_ZERO will have a measurable perf difference
in map_perf_test and other benchmarks.

> >> Problem #2 exists for both non-preallocated and preallocated htab map.
> >> By adding seq in htab element, doing reuse check and retrying the
> >> lookup procedure may be a feasible solution, but it will make the
> >> lookup API being hard to use, because the user needs to check whether
> >> the found element is reused or not and repeat the lookup procedure if it
> >> is reused. A simpler solution would be just disabling freed elements
> >> reuse and freeing these elements after lookup procedure ends.
> > You've proposed this 'solution' twice already in qptrie thread and both
> > times the answer was 'no, we cannot do this' with reasons explained.
> > The 3rd time the answer is still the same.
> This time a workable demo which calls call_rcu_task_trace() in batch is provided
> :) Also because I can not find a better solution for the reuse problem. But you
> are right, although don't reuse the freed element will make the implementation
> of map simpler, the potential OOM problem is hard to solve specially when RCU
> tasks trace grace period is slow. Hope Paul can provide some insight about the
> problem.

OOM is exactly the reason why we cannot do this delaying logic
in the general case. We don't control what progs do and rcu tasks trace
may take a long time.

> > This 'issue 2' existed in hashmap since very beginning for many years.
> > It's a known quirk. There is nothing to fix really.
> Do we need to document the unexpected behavior somewhere, because I really don't
> know nothing about the quirk ?

Yeah. It's not documented in Documentation/bpf/map_hash.rst.
Please send a patch to add it.

> >
> > The graph apis (aka new gen data structs) with link list and rbtree are
> > in active development. Soon bpf progs will be able to implement their own
> > hash maps with explicit bpf_rcu_read_lock. At that time the progs will
> > be making the trade off between performance and lookup/delete race.
> It seems these new gen data struct also need to solve the reuse problem because
> a global bpf memory allocator is used.

Currently the graph api is single owner and kptr_xchg is used to allow
single cpu at a time to observe the object.
In the future we will add bpf_refcount and multi owner.
Then multiple cpus will be able to access the same object concurrently.
They will race to read/write the fields and it will be prog decision
to arbitrate the access.
In other words the bpf prog needs to have mechanisms to deal with reuse.
Yonghong Song Jan. 4, 2023, 6:10 a.m. UTC | #6
On 1/3/23 5:47 AM, Hou Tao wrote:
> Hi,
> 
> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>
>>
>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> The patchset tries to fix the problems found when checking how htab map
>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>> freed elements may lead to two problems in htab map:
>>>>
>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>       htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>       flag which acquires bpf-spin-lock during value copying. The
>>>>       corruption of bpf-spin-lock may result in hard lock-up.
>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>       freed and then reused.
>>>>
>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>> these special fields in map element only when the map element is newly
>>>> allocated. If it is just a reused element, there will be no
>>>> reinitialization.
>>>
>>> Instead of adding the overhead of ctor callback let's just
>>> add __GFP_ZERO to flags in __alloc().
>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>> will behave the same way.
>>
>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>> tried to address a similar issue for lru hash table.
>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>> hash table?
> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?

The following is my understanding:
in function alloc_htab_elem() (hashtab.c), we have

                 if (is_map_full(htab))
                         if (!old_elem)
                                 /* when map is full and update() is 
replacing
                                  * old element, it's ok to allocate, since
                                  * old element will be freed immediately.
                                  * Otherwise return an error
                                  */
                                 return ERR_PTR(-E2BIG);
                 inc_elem_count(htab);
                 l_new = bpf_mem_cache_alloc(&htab->ma);
                 if (!l_new) {
                         l_new = ERR_PTR(-ENOMEM);
                         goto dec_count;
                 }
                 check_and_init_map_value(&htab->map,
                                          l_new->key + 
round_up(key_size, 8));

In the above check_and_init_map_value() intends to do initializing
for an element from bpf_mem_cache_alloc (could be reused from the free 
list).

The check_and_init_map_value() looks like below (in include/linux/bpf.h)

static inline void bpf_obj_init(const struct btf_field_offs *foffs, void 
*obj)
{
         int i;

         if (!foffs)
                 return;
         for (i = 0; i < foffs->cnt; i++)
                 memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
}

static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
{
         bpf_obj_init(map->field_offs, dst);
}

IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
list_head, list_node, etc.

This is the problem for above problem #1.
Maybe I missed something?

>>
>>
>>>
>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>> By adding seq in htab element, doing reuse check and retrying the
>>>> lookup procedure may be a feasible solution, but it will make the
>>>> lookup API being hard to use, because the user needs to check whether
>>>> the found element is reused or not and repeat the lookup procedure if it
>>>> is reused. A simpler solution would be just disabling freed elements
>>>> reuse and freeing these elements after lookup procedure ends.
>>>
>>> You've proposed this 'solution' twice already in qptrie thread and both
>>> times the answer was 'no, we cannot do this' with reasons explained.
>>> The 3rd time the answer is still the same.
>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>> It's a known quirk. There is nothing to fix really.
>>>
>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>> in active development. Soon bpf progs will be able to implement their own
>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>> be making the trade off between performance and lookup/delete race.
>>> So please respin with just __GFP_ZERO and update the patch 6
>>> to check for lockup only.
>
Hou Tao Jan. 4, 2023, 6:30 a.m. UTC | #7
Hi,

On 1/4/2023 2:10 PM, Yonghong Song wrote:
>
>
> On 1/3/23 5:47 AM, Hou Tao wrote:
>> Hi,
>>
>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>
>>>
>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> Hi,
>>>>>
>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>> freed elements may lead to two problems in htab map:
>>>>>
>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>       htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>       flag which acquires bpf-spin-lock during value copying. The
>>>>>       corruption of bpf-spin-lock may result in hard lock-up.
>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>       freed and then reused.
>>>>>
>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>> these special fields in map element only when the map element is newly
>>>>> allocated. If it is just a reused element, there will be no
>>>>> reinitialization.
>>>>
>>>> Instead of adding the overhead of ctor callback let's just
>>>> add __GFP_ZERO to flags in __alloc().
>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>> will behave the same way.
>>>
>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>> tried to address a similar issue for lru hash table.
>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>> hash table?
>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>
> The following is my understanding:
> in function alloc_htab_elem() (hashtab.c), we have
>
>                 if (is_map_full(htab))
>                         if (!old_elem)
>                                 /* when map is full and update() is replacing
>                                  * old element, it's ok to allocate, since
>                                  * old element will be freed immediately.
>                                  * Otherwise return an error
>                                  */
>                                 return ERR_PTR(-E2BIG);
>                 inc_elem_count(htab);
>                 l_new = bpf_mem_cache_alloc(&htab->ma);
>                 if (!l_new) {
>                         l_new = ERR_PTR(-ENOMEM);
>                         goto dec_count;
>                 }
>                 check_and_init_map_value(&htab->map,
>                                          l_new->key + round_up(key_size, 8));
>
> In the above check_and_init_map_value() intends to do initializing
> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>
> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>
> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> {
>         int i;
>
>         if (!foffs)
>                 return;
>         for (i = 0; i < foffs->cnt; i++)
>                 memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> }
>
> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> {
>         bpf_obj_init(map->field_offs, dst);
> }
>
> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> list_head, list_node, etc.
>
> This is the problem for above problem #1.
> Maybe I missed something?
Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
problem by only calling check_and_init_map_value() once for the newly-allocated
element, so if a freed element is reused, its special fields will not be zeroed
again. Is there any other cases which are not covered by the solution or any
other similar problems in hash-tab ?
>
>>>
>>>
>>>>
>>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>>> By adding seq in htab element, doing reuse check and retrying the
>>>>> lookup procedure may be a feasible solution, but it will make the
>>>>> lookup API being hard to use, because the user needs to check whether
>>>>> the found element is reused or not and repeat the lookup procedure if it
>>>>> is reused. A simpler solution would be just disabling freed elements
>>>>> reuse and freeing these elements after lookup procedure ends.
>>>>
>>>> You've proposed this 'solution' twice already in qptrie thread and both
>>>> times the answer was 'no, we cannot do this' with reasons explained.
>>>> The 3rd time the answer is still the same.
>>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>>> It's a known quirk. There is nothing to fix really.
>>>>
>>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>>> in active development. Soon bpf progs will be able to implement their own
>>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>>> be making the trade off between performance and lookup/delete race.
>>>> So please respin with just __GFP_ZERO and update the patch 6
>>>> to check for lockup only.
>>
Yonghong Song Jan. 4, 2023, 7:14 a.m. UTC | #8
On 1/3/23 10:30 PM, Hou Tao wrote:
> Hi,
> 
> On 1/4/2023 2:10 PM, Yonghong Song wrote:
>>
>>
>> On 1/3/23 5:47 AM, Hou Tao wrote:
>>> Hi,
>>>
>>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>>> freed elements may lead to two problems in htab map:
>>>>>>
>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>>        flag which acquires bpf-spin-lock during value copying. The
>>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
>>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>>        freed and then reused.
>>>>>>
>>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>>> these special fields in map element only when the map element is newly
>>>>>> allocated. If it is just a reused element, there will be no
>>>>>> reinitialization.
>>>>>
>>>>> Instead of adding the overhead of ctor callback let's just
>>>>> add __GFP_ZERO to flags in __alloc().
>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>>> will behave the same way.
>>>>
>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>>> tried to address a similar issue for lru hash table.
>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>>> hash table?
>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>>
>> The following is my understanding:
>> in function alloc_htab_elem() (hashtab.c), we have
>>
>>                  if (is_map_full(htab))
>>                          if (!old_elem)
>>                                  /* when map is full and update() is replacing
>>                                   * old element, it's ok to allocate, since
>>                                   * old element will be freed immediately.
>>                                   * Otherwise return an error
>>                                   */
>>                                  return ERR_PTR(-E2BIG);
>>                  inc_elem_count(htab);
>>                  l_new = bpf_mem_cache_alloc(&htab->ma);
>>                  if (!l_new) {
>>                          l_new = ERR_PTR(-ENOMEM);
>>                          goto dec_count;
>>                  }
>>                  check_and_init_map_value(&htab->map,
>>                                           l_new->key + round_up(key_size, 8));
>>
>> In the above check_and_init_map_value() intends to do initializing
>> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>>
>> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>>
>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
>> {
>>          int i;
>>
>>          if (!foffs)
>>                  return;
>>          for (i = 0; i < foffs->cnt; i++)
>>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
>> }
>>
>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>> {
>>          bpf_obj_init(map->field_offs, dst);
>> }
>>
>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
>> list_head, list_node, etc.
>>
>> This is the problem for above problem #1.
>> Maybe I missed something?
> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> problem by only calling check_and_init_map_value() once for the newly-allocated
> element, so if a freed element is reused, its special fields will not be zeroed
> again. Is there any other cases which are not covered by the solution or any
> other similar problems in hash-tab ?

No, I checked all cases of check_and_init_map_value() and didn't find
any other instances.

>>
>>>>
>>>>
>>>>>
>>>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>>>> By adding seq in htab element, doing reuse check and retrying the
>>>>>> lookup procedure may be a feasible solution, but it will make the
>>>>>> lookup API being hard to use, because the user needs to check whether
>>>>>> the found element is reused or not and repeat the lookup procedure if it
>>>>>> is reused. A simpler solution would be just disabling freed elements
>>>>>> reuse and freeing these elements after lookup procedure ends.
>>>>>
>>>>> You've proposed this 'solution' twice already in qptrie thread and both
>>>>> times the answer was 'no, we cannot do this' with reasons explained.
>>>>> The 3rd time the answer is still the same.
>>>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>>>> It's a known quirk. There is nothing to fix really.
>>>>>
>>>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>>>> in active development. Soon bpf progs will be able to implement their own
>>>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>>>> be making the trade off between performance and lookup/delete race.
>>>>> So please respin with just __GFP_ZERO and update the patch 6
>>>>> to check for lockup only.
>>>
>
Alexei Starovoitov Jan. 4, 2023, 6:26 p.m. UTC | #9
On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 1/3/23 10:30 PM, Hou Tao wrote:
> > Hi,
> >
> > On 1/4/2023 2:10 PM, Yonghong Song wrote:
> >>
> >>
> >> On 1/3/23 5:47 AM, Hou Tao wrote:
> >>> Hi,
> >>>
> >>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
> >>>>
> >>>>
> >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> >>>>>> From: Hou Tao <houtao1@huawei.com>
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> The patchset tries to fix the problems found when checking how htab map
> >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
> >>>>>> freed elements may lead to two problems in htab map:
> >>>>>>
> >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> >>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> >>>>>>        flag which acquires bpf-spin-lock during value copying. The
> >>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
> >>>>>> (2) lookup procedure may get incorrect map value if the found element is
> >>>>>>        freed and then reused.
> >>>>>>
> >>>>>> Because the type of htab map elements are the same, so problem #1 can be
> >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> >>>>>> these special fields in map element only when the map element is newly
> >>>>>> allocated. If it is just a reused element, there will be no
> >>>>>> reinitialization.
> >>>>>
> >>>>> Instead of adding the overhead of ctor callback let's just
> >>>>> add __GFP_ZERO to flags in __alloc().
> >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
> >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> >>>>> will behave the same way.
> >>>>
> >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> >>>> tried to address a similar issue for lru hash table.
> >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> >>>> hash table?
> >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
> >>
> >> The following is my understanding:
> >> in function alloc_htab_elem() (hashtab.c), we have
> >>
> >>                  if (is_map_full(htab))
> >>                          if (!old_elem)
> >>                                  /* when map is full and update() is replacing
> >>                                   * old element, it's ok to allocate, since
> >>                                   * old element will be freed immediately.
> >>                                   * Otherwise return an error
> >>                                   */
> >>                                  return ERR_PTR(-E2BIG);
> >>                  inc_elem_count(htab);
> >>                  l_new = bpf_mem_cache_alloc(&htab->ma);
> >>                  if (!l_new) {
> >>                          l_new = ERR_PTR(-ENOMEM);
> >>                          goto dec_count;
> >>                  }
> >>                  check_and_init_map_value(&htab->map,
> >>                                           l_new->key + round_up(key_size, 8));
> >>
> >> In the above check_and_init_map_value() intends to do initializing
> >> for an element from bpf_mem_cache_alloc (could be reused from the free list).
> >>
> >> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
> >>
> >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> >> {
> >>          int i;
> >>
> >>          if (!foffs)
> >>                  return;
> >>          for (i = 0; i < foffs->cnt; i++)
> >>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> >> }
> >>
> >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >> {
> >>          bpf_obj_init(map->field_offs, dst);
> >> }
> >>
> >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> >> list_head, list_node, etc.
> >>
> >> This is the problem for above problem #1.
> >> Maybe I missed something?
> > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> > problem by only calling check_and_init_map_value() once for the newly-allocated
> > element, so if a freed element is reused, its special fields will not be zeroed
> > again. Is there any other cases which are not covered by the solution or any
> > other similar problems in hash-tab ?
>
> No, I checked all cases of check_and_init_map_value() and didn't find
> any other instances.

check_and_init_map_value() is called in two other cases:
lookup_and_delete[_batch].
There the zeroing of the fields is necessary because the 'value'
is a temp buffer that is going to be copied to user space.
I think the way forward is to add GFP_ZERO to mem_alloc
(to make it equivalent to prealloc), remove one case
of check_and_init_map_value from hashmap, add short comments
to two other cases and add a big comment to check_and_init_map_value()
that should say that 'dst' must be a temp buffer and should not
point to memory that could be used in parallel by a bpf prog.
It feels like we've dealt with this issue a couple times already
and keep repeating this mistake, so the more comments the better.
Martin KaFai Lau Jan. 10, 2023, 6:26 a.m. UTC | #10
On 1/3/23 11:38 AM, Alexei Starovoitov wrote:
> On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> On 1/1/2023 9:26 AM, Alexei Starovoitov wrote:
>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> The patchset tries to fix the problems found when checking how htab map
>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>> freed elements may lead to two problems in htab map:
>>>>
>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>      htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>      flag which acquires bpf-spin-lock during value copying. The
>>>>      corruption of bpf-spin-lock may result in hard lock-up.
>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>      freed and then reused.
>>>>
>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>> these special fields in map element only when the map element is newly
>>>> allocated. If it is just a reused element, there will be no
>>>> reinitialization.
>>> Instead of adding the overhead of ctor callback let's just
>>> add __GFP_ZERO to flags in __alloc().
>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>> will behave the same way.
>> Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated
>> object may be bigger than ctor callback when the size of allocated object is
>> large. And it also introduces unnecessary memory zeroing if there is no special
>> field in the map value.
> 
> Small memset is often faster than an indirect call.
> I doubt that adding GFP_ZERO will have a measurable perf difference
> in map_perf_test and other benchmarks.
> 
>>>> Problem #2 exists for both non-preallocated and preallocated htab map.
>>>> By adding seq in htab element, doing reuse check and retrying the
>>>> lookup procedure may be a feasible solution, but it will make the
>>>> lookup API being hard to use, because the user needs to check whether
>>>> the found element is reused or not and repeat the lookup procedure if it
>>>> is reused. A simpler solution would be just disabling freed elements
>>>> reuse and freeing these elements after lookup procedure ends.
>>> You've proposed this 'solution' twice already in qptrie thread and both
>>> times the answer was 'no, we cannot do this' with reasons explained.
>>> The 3rd time the answer is still the same.
>> This time a workable demo which calls call_rcu_task_trace() in batch is provided
>> :) Also because I can not find a better solution for the reuse problem. But you
>> are right, although don't reuse the freed element will make the implementation
>> of map simpler, the potential OOM problem is hard to solve specially when RCU
>> tasks trace grace period is slow. Hope Paul can provide some insight about the
>> problem.
> 
> OOM is exactly the reason why we cannot do this delaying logic
> in the general case. We don't control what progs do and rcu tasks trace
> may take a long time.

I haven't looked at the details of this patch yet. Since Hou was asking in
https://lore.kernel.org/bpf/6e4ec7a4-9ac9-417c-c11a-de59e72a6e42@huawei.com/ for 
the local storage use case (thanks!), so continue the discussion in this thread.

Some of my current thoughts, the local storage map is a little different from 
the more generic bpf map (eg htab). The common use case in local storage map is 
less demanding on the alloc/free path. The storage is usually allocated once and 
then stays for the whole lifetime with its owner (sk, task, cgrp, or inode). 
There is no update helper, so it encourages a get() and then followed by an 
in-place modification. Beside, the current local storage is also freed after 
going through a rcu tasks trace gp.

That said, I am checking to see if the common local storage use case can be 
modified to safely reuse without waiting the gp. It will then be an improvement 
on the existing implementation. The bottom line is not slowing down the current 
get (ie lookup) performance. Not 100% sure yet if it is doable.

The delete() helper likely still need to go through gp. Being able to avoid 
immediate reuse in bpf_mem_alloc will at least be needed for the local storage 
delete() code path.

> 
>>> This 'issue 2' existed in hashmap since very beginning for many years.
>>> It's a known quirk. There is nothing to fix really.
>> Do we need to document the unexpected behavior somewhere, because I really don't
>> know nothing about the quirk ?
> 
> Yeah. It's not documented in Documentation/bpf/map_hash.rst.
> Please send a patch to add it.
> 
>>>
>>> The graph apis (aka new gen data structs) with link list and rbtree are
>>> in active development. Soon bpf progs will be able to implement their own
>>> hash maps with explicit bpf_rcu_read_lock. At that time the progs will
>>> be making the trade off between performance and lookup/delete race.
>> It seems these new gen data struct also need to solve the reuse problem because
>> a global bpf memory allocator is used.
> 
> Currently the graph api is single owner and kptr_xchg is used to allow
> single cpu at a time to observe the object.
> In the future we will add bpf_refcount and multi owner.
> Then multiple cpus will be able to access the same object concurrently.
> They will race to read/write the fields and it will be prog decision
> to arbitrate the access.
> In other words the bpf prog needs to have mechanisms to deal with reuse.
Kumar Kartikeya Dwivedi Feb. 10, 2023, 4:32 p.m. UTC | #11
On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
> On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
> >
> >
> >
> > On 1/3/23 10:30 PM, Hou Tao wrote:
> > > Hi,
> > >
> > > On 1/4/2023 2:10 PM, Yonghong Song wrote:
> > >>
> > >>
> > >> On 1/3/23 5:47 AM, Hou Tao wrote:
> > >>> Hi,
> > >>>
> > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
> > >>>>
> > >>>>
> > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> > >>>>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> The patchset tries to fix the problems found when checking how htab map
> > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
> > >>>>>> freed elements may lead to two problems in htab map:
> > >>>>>>
> > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> > >>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> > >>>>>>        flag which acquires bpf-spin-lock during value copying. The
> > >>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
> > >>>>>> (2) lookup procedure may get incorrect map value if the found element is
> > >>>>>>        freed and then reused.
> > >>>>>>
> > >>>>>> Because the type of htab map elements are the same, so problem #1 can be
> > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> > >>>>>> these special fields in map element only when the map element is newly
> > >>>>>> allocated. If it is just a reused element, there will be no
> > >>>>>> reinitialization.
> > >>>>>
> > >>>>> Instead of adding the overhead of ctor callback let's just
> > >>>>> add __GFP_ZERO to flags in __alloc().
> > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
> > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > >>>>> will behave the same way.
> > >>>>
> > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> > >>>> tried to address a similar issue for lru hash table.
> > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> > >>>> hash table?
> > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
> > >>
> > >> The following is my understanding:
> > >> in function alloc_htab_elem() (hashtab.c), we have
> > >>
> > >>                  if (is_map_full(htab))
> > >>                          if (!old_elem)
> > >>                                  /* when map is full and update() is replacing
> > >>                                   * old element, it's ok to allocate, since
> > >>                                   * old element will be freed immediately.
> > >>                                   * Otherwise return an error
> > >>                                   */
> > >>                                  return ERR_PTR(-E2BIG);
> > >>                  inc_elem_count(htab);
> > >>                  l_new = bpf_mem_cache_alloc(&htab->ma);
> > >>                  if (!l_new) {
> > >>                          l_new = ERR_PTR(-ENOMEM);
> > >>                          goto dec_count;
> > >>                  }
> > >>                  check_and_init_map_value(&htab->map,
> > >>                                           l_new->key + round_up(key_size, 8));
> > >>
> > >> In the above check_and_init_map_value() intends to do initializing
> > >> for an element from bpf_mem_cache_alloc (could be reused from the free list).
> > >>
> > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
> > >>
> > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> > >> {
> > >>          int i;
> > >>
> > >>          if (!foffs)
> > >>                  return;
> > >>          for (i = 0; i < foffs->cnt; i++)
> > >>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> > >> }
> > >>
> > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > >> {
> > >>          bpf_obj_init(map->field_offs, dst);
> > >> }
> > >>
> > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> > >> list_head, list_node, etc.
> > >>
> > >> This is the problem for above problem #1.
> > >> Maybe I missed something?
> > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> > > problem by only calling check_and_init_map_value() once for the newly-allocated
> > > element, so if a freed element is reused, its special fields will not be zeroed
> > > again. Is there any other cases which are not covered by the solution or any
> > > other similar problems in hash-tab ?
> >
> > No, I checked all cases of check_and_init_map_value() and didn't find
> > any other instances.
>
> check_and_init_map_value() is called in two other cases:
> lookup_and_delete[_batch].
> There the zeroing of the fields is necessary because the 'value'
> is a temp buffer that is going to be copied to user space.
> I think the way forward is to add GFP_ZERO to mem_alloc
> (to make it equivalent to prealloc), remove one case
> of check_and_init_map_value from hashmap, add short comments
> to two other cases and add a big comment to check_and_init_map_value()
> that should say that 'dst' must be a temp buffer and should not
> point to memory that could be used in parallel by a bpf prog.
> It feels like we've dealt with this issue a couple times already
> and keep repeating this mistake, so the more comments the better.

Hou, are you plannning to resubmit this change? I also hit this while testing my
changes on bpf-next.
Alexei Starovoitov Feb. 10, 2023, 9:06 p.m. UTC | #12
On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
> > On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
> > >
> > >
> > >
> > > On 1/3/23 10:30 PM, Hou Tao wrote:
> > > > Hi,
> > > >
> > > > On 1/4/2023 2:10 PM, Yonghong Song wrote:
> > > >>
> > > >>
> > > >> On 1/3/23 5:47 AM, Hou Tao wrote:
> > > >>> Hi,
> > > >>>
> > > >>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
> > > >>>>
> > > >>>>
> > > >>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
> > > >>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
> > > >>>>>> From: Hou Tao <houtao1@huawei.com>
> > > >>>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> The patchset tries to fix the problems found when checking how htab map
> > > >>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
> > > >>>>>> freed elements may lead to two problems in htab map:
> > > >>>>>>
> > > >>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
> > > >>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
> > > >>>>>>        flag which acquires bpf-spin-lock during value copying. The
> > > >>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
> > > >>>>>> (2) lookup procedure may get incorrect map value if the found element is
> > > >>>>>>        freed and then reused.
> > > >>>>>>
> > > >>>>>> Because the type of htab map elements are the same, so problem #1 can be
> > > >>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
> > > >>>>>> these special fields in map element only when the map element is newly
> > > >>>>>> allocated. If it is just a reused element, there will be no
> > > >>>>>> reinitialization.
> > > >>>>>
> > > >>>>> Instead of adding the overhead of ctor callback let's just
> > > >>>>> add __GFP_ZERO to flags in __alloc().
> > > >>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
> > > >>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
> > > >>>>> will behave the same way.
> > > >>>>
> > > >>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
> > > >>>> tried to address a similar issue for lru hash table.
> > > >>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
> > > >>>> hash table?
> > > >>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
> > > >>
> > > >> The following is my understanding:
> > > >> in function alloc_htab_elem() (hashtab.c), we have
> > > >>
> > > >>                  if (is_map_full(htab))
> > > >>                          if (!old_elem)
> > > >>                                  /* when map is full and update() is replacing
> > > >>                                   * old element, it's ok to allocate, since
> > > >>                                   * old element will be freed immediately.
> > > >>                                   * Otherwise return an error
> > > >>                                   */
> > > >>                                  return ERR_PTR(-E2BIG);
> > > >>                  inc_elem_count(htab);
> > > >>                  l_new = bpf_mem_cache_alloc(&htab->ma);
> > > >>                  if (!l_new) {
> > > >>                          l_new = ERR_PTR(-ENOMEM);
> > > >>                          goto dec_count;
> > > >>                  }
> > > >>                  check_and_init_map_value(&htab->map,
> > > >>                                           l_new->key + round_up(key_size, 8));
> > > >>
> > > >> In the above check_and_init_map_value() intends to do initializing
> > > >> for an element from bpf_mem_cache_alloc (could be reused from the free list).
> > > >>
> > > >> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
> > > >>
> > > >> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
> > > >> {
> > > >>          int i;
> > > >>
> > > >>          if (!foffs)
> > > >>                  return;
> > > >>          for (i = 0; i < foffs->cnt; i++)
> > > >>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
> > > >> }
> > > >>
> > > >> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> > > >> {
> > > >>          bpf_obj_init(map->field_offs, dst);
> > > >> }
> > > >>
> > > >> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
> > > >> list_head, list_node, etc.
> > > >>
> > > >> This is the problem for above problem #1.
> > > >> Maybe I missed something?
> > > > Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
> > > > problem by only calling check_and_init_map_value() once for the newly-allocated
> > > > element, so if a freed element is reused, its special fields will not be zeroed
> > > > again. Is there any other cases which are not covered by the solution or any
> > > > other similar problems in hash-tab ?
> > >
> > > No, I checked all cases of check_and_init_map_value() and didn't find
> > > any other instances.
> >
> > check_and_init_map_value() is called in two other cases:
> > lookup_and_delete[_batch].
> > There the zeroing of the fields is necessary because the 'value'
> > is a temp buffer that is going to be copied to user space.
> > I think the way forward is to add GFP_ZERO to mem_alloc
> > (to make it equivalent to prealloc), remove one case
> > of check_and_init_map_value from hashmap, add short comments
> > to two other cases and add a big comment to check_and_init_map_value()
> > that should say that 'dst' must be a temp buffer and should not
> > point to memory that could be used in parallel by a bpf prog.
> > It feels like we've dealt with this issue a couple times already
> > and keep repeating this mistake, so the more comments the better.
>
> Hou, are you plannning to resubmit this change? I also hit this while testing my
> changes on bpf-next.

Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
The former will take a long time to settle.
The latter is trivial.
To unblock yourself just add GFP_ZERO in an extra patch?
Hou Tao Feb. 11, 2023, 1:09 a.m. UTC | #13
On 2/11/2023 5:06 AM, Alexei Starovoitov wrote:
> On Fri, Feb 10, 2023 at 8:33 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>> On Wed, Jan 04, 2023 at 07:26:12PM CET, Alexei Starovoitov wrote:
>>> On Tue, Jan 3, 2023 at 11:14 PM Yonghong Song <yhs@meta.com> wrote:
>>>>
>>>>
>>>> On 1/3/23 10:30 PM, Hou Tao wrote:
>>>>> Hi,
>>>>>
>>>>> On 1/4/2023 2:10 PM, Yonghong Song wrote:
>>>>>>
>>>>>> On 1/3/23 5:47 AM, Hou Tao wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 1/2/2023 2:48 AM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>> On 12/31/22 5:26 PM, Alexei Starovoitov wrote:
>>>>>>>>> On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote:
>>>>>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> The patchset tries to fix the problems found when checking how htab map
>>>>>>>>>> handles element reuse in bpf memory allocator. The immediate reuse of
>>>>>>>>>> freed elements may lead to two problems in htab map:
>>>>>>>>>>
>>>>>>>>>> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in
>>>>>>>>>>        htab map value and it may corrupt lookup procedure with BFP_F_LOCK
>>>>>>>>>>        flag which acquires bpf-spin-lock during value copying. The
>>>>>>>>>>        corruption of bpf-spin-lock may result in hard lock-up.
>>>>>>>>>> (2) lookup procedure may get incorrect map value if the found element is
>>>>>>>>>>        freed and then reused.
>>>>>>>>>>
>>>>>>>>>> Because the type of htab map elements are the same, so problem #1 can be
>>>>>>>>>> fixed by supporting ctor in bpf memory allocator. The ctor initializes
>>>>>>>>>> these special fields in map element only when the map element is newly
>>>>>>>>>> allocated. If it is just a reused element, there will be no
>>>>>>>>>> reinitialization.
>>>>>>>>> Instead of adding the overhead of ctor callback let's just
>>>>>>>>> add __GFP_ZERO to flags in __alloc().
>>>>>>>>> That will address the issue 1 and will make bpf_mem_alloc behave just
>>>>>>>>> like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default
>>>>>>>>> will behave the same way.
>>>>>>>> Patch https://lore.kernel.org/all/20220809213033.24147-3-memxor@gmail.com/
>>>>>>>> tried to address a similar issue for lru hash table.
>>>>>>>> Maybe we need to do similar things after bpf_mem_cache_alloc() for
>>>>>>>> hash table?
>>>>>>> IMO ctor or __GFP_ZERO will fix the issue. Did I miss something here ?
>>>>>> The following is my understanding:
>>>>>> in function alloc_htab_elem() (hashtab.c), we have
>>>>>>
>>>>>>                  if (is_map_full(htab))
>>>>>>                          if (!old_elem)
>>>>>>                                  /* when map is full and update() is replacing
>>>>>>                                   * old element, it's ok to allocate, since
>>>>>>                                   * old element will be freed immediately.
>>>>>>                                   * Otherwise return an error
>>>>>>                                   */
>>>>>>                                  return ERR_PTR(-E2BIG);
>>>>>>                  inc_elem_count(htab);
>>>>>>                  l_new = bpf_mem_cache_alloc(&htab->ma);
>>>>>>                  if (!l_new) {
>>>>>>                          l_new = ERR_PTR(-ENOMEM);
>>>>>>                          goto dec_count;
>>>>>>                  }
>>>>>>                  check_and_init_map_value(&htab->map,
>>>>>>                                           l_new->key + round_up(key_size, 8));
>>>>>>
>>>>>> In the above check_and_init_map_value() intends to do initializing
>>>>>> for an element from bpf_mem_cache_alloc (could be reused from the free list).
>>>>>>
>>>>>> The check_and_init_map_value() looks like below (in include/linux/bpf.h)
>>>>>>
>>>>>> static inline void bpf_obj_init(const struct btf_field_offs *foffs, void *obj)
>>>>>> {
>>>>>>          int i;
>>>>>>
>>>>>>          if (!foffs)
>>>>>>                  return;
>>>>>>          for (i = 0; i < foffs->cnt; i++)
>>>>>>                  memset(obj + foffs->field_off[i], 0, foffs->field_sz[i]);
>>>>>> }
>>>>>>
>>>>>> static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>>>>>> {
>>>>>>          bpf_obj_init(map->field_offs, dst);
>>>>>> }
>>>>>>
>>>>>> IIUC, bpf_obj_init() will bzero those fields like spin_lock, timer,
>>>>>> list_head, list_node, etc.
>>>>>>
>>>>>> This is the problem for above problem #1.
>>>>>> Maybe I missed something?
>>>>> Yes. It is the problem patch #1 tries to fix exactly. Patch #1 tries to fix the
>>>>> problem by only calling check_and_init_map_value() once for the newly-allocated
>>>>> element, so if a freed element is reused, its special fields will not be zeroed
>>>>> again. Is there any other cases which are not covered by the solution or any
>>>>> other similar problems in hash-tab ?
>>>> No, I checked all cases of check_and_init_map_value() and didn't find
>>>> any other instances.
>>> check_and_init_map_value() is called in two other cases:
>>> lookup_and_delete[_batch].
>>> There the zeroing of the fields is necessary because the 'value'
>>> is a temp buffer that is going to be copied to user space.
>>> I think the way forward is to add GFP_ZERO to mem_alloc
>>> (to make it equivalent to prealloc), remove one case
>>> of check_and_init_map_value from hashmap, add short comments
>>> to two other cases and add a big comment to check_and_init_map_value()
>>> that should say that 'dst' must be a temp buffer and should not
>>> point to memory that could be used in parallel by a bpf prog.
>>> It feels like we've dealt with this issue a couple times already
>>> and keep repeating this mistake, so the more comments the better.
>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>> changes on bpf-next.
> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> The former will take a long time to settle.
> The latter is trivial.
> To unblock yourself just add GFP_ZERO in an extra patch?
Sorry for the long delay. Just find find out time to do some tests to compare
the performance of bzero and ctor. After it is done, will resubmit on next week.
> .
Alexei Starovoitov Feb. 11, 2023, 4:33 p.m. UTC | #14
On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hou, are you plannning to resubmit this change? I also hit this while testing my
> >> changes on bpf-next.
> > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> > The former will take a long time to settle.
> > The latter is trivial.
> > To unblock yourself just add GFP_ZERO in an extra patch?
> Sorry for the long delay. Just find find out time to do some tests to compare
> the performance of bzero and ctor. After it is done, will resubmit on next week.

I still don't like ctor as a concept. In general the callbacks in the critical
path are guaranteed to be slow due to retpoline overhead.
Please send a patch to add GFP_ZERO.

Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
This approach will work for inner nodes of qptrie, since bpf progs
never see pointers to them. It will work for local storage
converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
It's also safe without uaf caveat in sleepable progs and sleepable progs
can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
So please respin the set with rcu gp only and that new flag.
Alexei Starovoitov Feb. 11, 2023, 4:34 p.m. UTC | #15
On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hou, are you plannning to resubmit this change? I also hit this while testing my
> > >> changes on bpf-next.
> > > Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> > > The former will take a long time to settle.
> > > The latter is trivial.
> > > To unblock yourself just add GFP_ZERO in an extra patch?
> > Sorry for the long delay. Just find find out time to do some tests to compare
> > the performance of bzero and ctor. After it is done, will resubmit on next week.
>
> I still don't like ctor as a concept. In general the callbacks in the critical
> path are guaranteed to be slow due to retpoline overhead.
> Please send a patch to add GFP_ZERO.
>
> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
> This approach will work for inner nodes of qptrie, since bpf progs
> never see pointers to them. It will work for local storage
> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
> It's also safe without uaf caveat in sleepable progs and sleepable progs

I meant 'safe with uaf caveat'.
Safe because we wait for rcu_task_trace later before returning to kernel memory.

> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
> So please respin the set with rcu gp only and that new flag.
Martin KaFai Lau Feb. 15, 2023, 1:54 a.m. UTC | #16
On 2/11/23 8:34 AM, Alexei Starovoitov wrote:
> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>>> changes on bpf-next.
>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>> The former will take a long time to settle.
>>>> The latter is trivial.
>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>> the performance of bzero and ctor. After it is done, will resubmit on next week.
>>
>> I still don't like ctor as a concept. In general the callbacks in the critical
>> path are guaranteed to be slow due to retpoline overhead.
>> Please send a patch to add GFP_ZERO.
>>
>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
>> This approach will work for inner nodes of qptrie, since bpf progs
>> never see pointers to them. It will work for local storage
>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
>> It's also safe without uaf caveat in sleepable progs and sleepable progs
> 
> I meant 'safe with uaf caveat'.
> Safe because we wait for rcu_task_trace later before returning to kernel memory.

For local storage, when its owner (sk/task/inode/cgrp) is going away, the memory 
can be reused immediately. No rcu gp is needed.

The local storage delete case (eg. bpf_sk_storage_delete) is the only one that 
needs to be freed by tasks_trace gp because another bpf prog (reader) may be 
under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on 
allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp 
can be extended to the local storage delete case. I think we can extend the 
assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock() 
when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.

I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and the 
BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.

> 
>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
>> So please respin the set with rcu gp only and that new flag.
Hou Tao Feb. 15, 2023, 2:35 a.m. UTC | #17
Hi,

On 2/12/2023 12:33 AM, Alexei Starovoitov wrote:
> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>> changes on bpf-next.
>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>> The former will take a long time to settle.
>>> The latter is trivial.
>>> To unblock yourself just add GFP_ZERO in an extra patch?
>> Sorry for the long delay. Just find find out time to do some tests to compare
>> the performance of bzero and ctor. After it is done, will resubmit on next week.
> I still don't like ctor as a concept. In general the callbacks in the critical
> path are guaranteed to be slow due to retpoline overhead.
> Please send a patch to add GFP_ZERO.
I see. Will do. But i think it is better to know the coarse overhead of these
two methods, so I hack map_perf_test to support customizable value size for
hash_map_alloc and do some benchmarks to show the overheads of ctor and
GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when
the number of allocated elements is small, the overheads of ctor and bzero are
basically the same, but when the number of allocated element increases (e.g.,
half full), the overhead of ctor will be bigger. For big value size, the
overhead of ctor and zero are basically the same, and it seems due to the main
overhead comes from slab allocation. The following is the detailed results:

* ./map_perf_test 4 8 8192 10000 $value_size

Key of htab is thread pid, so only 8 elements are allocated.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 256604 | 261112 | 173646 | 74195  | 23138  | 6275   |
| bzero      | 253362 | 257563 | 171445 | 73303  | 22949  | 6249   |
| ctor       | 264570 | 258246 | 175048 | 72511  | 23004  | 6270   |

* ./map_perf_test 4 8 8192 100 $value_size

The key is still thread pid, so only 8 elements are allocated. decrease the loop
count to 100 to show the overhead of first allocation.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 135662 | 137742 | 87043  | 36265  | 12501  | 4450   |
| bzero      | 139993 | 134920 | 94570  | 37190  | 12543  | 4131   |
| ctor       | 147949 | 141825 | 94321  | 38240  | 13131  | 4248   |

* ./map_perf_test 4 8 8192 1000 $value_size

Create 512 different keys per-thread, so the hash table will be half-full. Also
decrease the loop count to 1K.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 4234   | 4289   | 1478   | 510    | 168    | 46     |
| bzero      | 3792   | 4002   | 1473   | 515    | 161    | 37     |
| ctor       | 3846   | 2198   | 1269   | 499    | 161    | 42     |

* ./map_perf_test 4 8 8192 100 $value_size

Create 512 different keys per-thread, so the hash table will be half-full. Also
decrease the loop count to 100.

| value_size | 8      | 256    | 4K     | 16K    | 64K    | 256K   |
| --         | --     | --     | --     | --     | --     | --     |
| base       | 3669   | 3419   | 1272   | 476    | 168    | 44     |
| bzero      | 3468   | 3499   | 1274   | 476    | 150    | 36     |
| ctor       | 2235   | 2312   | 1128   | 452    | 145    | 35     |
>
> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
> This approach will work for inner nodes of qptrie, since bpf progs
> never see pointers to them. It will work for local storage
> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
> It's also safe without uaf caveat in sleepable progs and sleepable progs
> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
> So please respin the set with rcu gp only and that new flag.
> .
Alexei Starovoitov Feb. 15, 2023, 2:42 a.m. UTC | #18
On Tue, Feb 14, 2023 at 6:36 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/12/2023 12:33 AM, Alexei Starovoitov wrote:
> > On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
> >>>> changes on bpf-next.
> >>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
> >>> The former will take a long time to settle.
> >>> The latter is trivial.
> >>> To unblock yourself just add GFP_ZERO in an extra patch?
> >> Sorry for the long delay. Just find find out time to do some tests to compare
> >> the performance of bzero and ctor. After it is done, will resubmit on next week.
> > I still don't like ctor as a concept. In general the callbacks in the critical
> > path are guaranteed to be slow due to retpoline overhead.
> > Please send a patch to add GFP_ZERO.
> I see. Will do. But i think it is better to know the coarse overhead of these
> two methods, so I hack map_perf_test to support customizable value size for
> hash_map_alloc and do some benchmarks to show the overheads of ctor and
> GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when
> the number of allocated elements is small, the overheads of ctor and bzero are
> basically the same, but when the number of allocated element increases (e.g.,
> half full), the overhead of ctor will be bigger. For big value size, the
> overhead of ctor and zero are basically the same, and it seems due to the main
> overhead comes from slab allocation. The following is the detailed results:

and with retpoline?
Hou Tao Feb. 15, 2023, 3 a.m. UTC | #19
Hi,

On 2/15/2023 10:42 AM, Alexei Starovoitov wrote:
> On Tue, Feb 14, 2023 at 6:36 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 2/12/2023 12:33 AM, Alexei Starovoitov wrote:
>>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>>>> changes on bpf-next.
>>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>>> The former will take a long time to settle.
>>>>> The latter is trivial.
>>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>>> the performance of bzero and ctor. After it is done, will resubmit on next week.
>>> I still don't like ctor as a concept. In general the callbacks in the critical
>>> path are guaranteed to be slow due to retpoline overhead.
>>> Please send a patch to add GFP_ZERO.
>> I see. Will do. But i think it is better to know the coarse overhead of these
>> two methods, so I hack map_perf_test to support customizable value size for
>> hash_map_alloc and do some benchmarks to show the overheads of ctor and
>> GFP_ZERO. These benchmark are conducted on a KVM-VM with 8-cpus, it seems when
>> the number of allocated elements is small, the overheads of ctor and bzero are
>> basically the same, but when the number of allocated element increases (e.g.,
>> half full), the overhead of ctor will be bigger. For big value size, the
>> overhead of ctor and zero are basically the same, and it seems due to the main
>> overhead comes from slab allocation. The following is the detailed results:
> and with retpoline?
Yes. Forge to mention that. The following is the output of vulnerabilities
directory:

# cd /sys/devices/system/cpu/vulnerabilities
# grep . *
itlb_multihit:Processor vulnerable
l1tf:Mitigation: PTE Inversion
mds:Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
meltdown:Mitigation: PTI
mmio_stale_data:Unknown: No mitigations
retbleed:Not affected
spec_store_bypass:Vulnerable
spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
spectre_v2:Mitigation: Retpolines, STIBP: disabled, RSB filling, PBRSB-eIBRS:
Not affected
srbds:Not affected
tsx_async_abort:Not affected
Hou Tao Feb. 15, 2023, 4:02 a.m. UTC | #20
Hi,

On 2/15/2023 9:54 AM, Martin KaFai Lau wrote:
> On 2/11/23 8:34 AM, Alexei Starovoitov wrote:
>> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>> Hou, are you plannning to resubmit this change? I also hit this while
>>>>>> testing my
>>>>>> changes on bpf-next.
>>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>>> The former will take a long time to settle.
>>>>> The latter is trivial.
>>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>>> the performance of bzero and ctor. After it is done, will resubmit on next
>>>> week.
>>>
>>> I still don't like ctor as a concept. In general the callbacks in the critical
>>> path are guaranteed to be slow due to retpoline overhead.
>>> Please send a patch to add GFP_ZERO.
>>>
>>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
>>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
>>> This approach will work for inner nodes of qptrie, since bpf progs
>>> never see pointers to them. It will work for local storage
>>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
>>> It's also safe without uaf caveat in sleepable progs and sleepable progs
>>
>> I meant 'safe with uaf caveat'.
>> Safe because we wait for rcu_task_trace later before returning to kernel memory.
For qp-trie, I had added reuse checking for qp-trie inner node by adding a
version in both child pointer and child node itself and it seemed works, but
using BPF_REUSE_AFTER_RCU_GP for inner node will be much simpler for the
implementation. And it seems for qp-trie leaf node, BPF_REUSE_AFTER_RCU_GP is
needed as well, else the value returned to caller in bpf program or syscall may
be reused just like hash-table during its usage. We can change qp-trie to act as
a set only (e.g., no value), but that will limit its usage scenario. Maybe
requiring the caller to use bpf_rcu_read_lock() is solution as well. What do you
think ?
>
> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
> memory can be reused immediately. No rcu gp is needed.
Now it seems it will wait for RCU GP and i think it is still necessary, because
when the process exits, other processes may still access the local storage
through pidfd or task_struct of the exited process.
>
> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that
> needs to be freed by tasks_trace gp because another bpf prog (reader) may be
> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on
> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp
> can be extended to the local storage delete case. I think we can extend the
> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock()
> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.
>
It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not
only bpf_task_storage_get(), but also the dereferences of the returned local
storage ptr, right ? I think qp-trie may also need this.
> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and
> the BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.
I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean
that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we
cooperate together to accomplish that ?


>
>>
>>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
>>> So please respin the set with rcu gp only and that new flag.
Martin KaFai Lau Feb. 15, 2023, 7:22 a.m. UTC | #21
On 2/14/23 8:02 PM, Hou Tao wrote:
>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>> memory can be reused immediately. No rcu gp is needed.
> Now it seems it will wait for RCU GP and i think it is still necessary, because
> when the process exits, other processes may still access the local storage
> through pidfd or task_struct of the exited process.

When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0 
and will be kfree immediately next. eg. bpf_sk_storage_free is called just 
before the sk is about to be kfree. No bpf prog should have a hold on this sk. 
The same should go for the task.

The current rcu gp waiting during bpf_{sk,task,cgrp...}_storage_free is because 
the racing with the map destruction bpf_local_storage_map_free().

>>
>> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that
>> needs to be freed by tasks_trace gp because another bpf prog (reader) may be
>> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on
>> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp
>> can be extended to the local storage delete case. I think we can extend the
>> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock()
>> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.
>>
> It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not
> only bpf_task_storage_get(), but also the dereferences of the returned local
> storage ptr, right ? I think qp-trie may also need this.

I think bpf_rcu_read_lock() is primarily for bpf prog.

The bpf_{sk,task,...}_storage_get() internal is easier to handle and probably 
will need to do its own rcu_read_lock() instead of depending on the bpf prog 
doing the bpf_rcu_read_lock() because the bpf prog may decide uaf is fine.

>> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and
>> the BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.
> I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean
> that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we
> cooperate together to accomplish that ?
Please submit the GFP_ZERO patch first. Kumar and I can use it immediately.

I have been hacking to make bpf's memalloc safe for the 
bpf_{sk,task,cgrp..}_storage_delete() and this safe-on-reuse piece still need 
works. The whole thing is getting pretty long, so my current plan is to put the 
safe-on-reuse piece aside for now, focus back on the immediate goal and make the 
common case deadlock free first. Meaning the 
bpf_*_storage_get(BPF_*_STORAGE_GET_F_CREATE) and the bpf_*_storage_free() will 
use the bpf_mem_cache_{alloc,free}. The bpf_*_storage_delete() will stay as-is 
to go through the call_rcu_tasks_trace() for now since delete is not the common 
use case.

In parallel, if you can post the BPF_REUSE_AFTER_RCU_GP, we can discuss based on 
your work. That should speed up the progress. If I finished the immediate goal 
for local storage and this piece is still pending, I will ping you first.  Thoughts?
Hou Tao Feb. 16, 2023, 2:11 a.m. UTC | #22
Hi,

On 2/15/2023 3:22 PM, Martin KaFai Lau wrote:
> On 2/14/23 8:02 PM, Hou Tao wrote:
>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>>> memory can be reused immediately. No rcu gp is needed.
>> Now it seems it will wait for RCU GP and i think it is still necessary, because
>> when the process exits, other processes may still access the local storage
>> through pidfd or task_struct of the exited process.
>
> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0
> and will be kfree immediately next. eg. bpf_sk_storage_free is called just
> before the sk is about to be kfree. No bpf prog should have a hold on this sk.
> The same should go for the task.
A bpf syscall may have already found the task local storage through a pidfd,
then the target task exits and the local storage is free immediately, then bpf
syscall starts to copy the local storage and there will be a UAF, right ? Did I
missing something here ?
>
> The current rcu gp waiting during bpf_{sk,task,cgrp...}_storage_free is
> because the racing with the map destruction bpf_local_storage_map_free().
>
>>>
>>> The local storage delete case (eg. bpf_sk_storage_delete) is the only one that
>>> needs to be freed by tasks_trace gp because another bpf prog (reader) may be
>>> under the rcu_read_lock_trace(). I think the idea (BPF_REUSE_AFTER_RCU_GP) on
>>> allowing reuse after vanilla rcu gp and free (if needed) after tasks_trace gp
>>> can be extended to the local storage delete case. I think we can extend the
>>> assumption that "sleepable progs (reader) can use explicit bpf_rcu_read_lock()
>>> when they want to avoid uaf" to bpf_{sk,task,inode,cgrp}_storage_get() also.
>>>
>> It seems bpf_rcu_read_lock() & bpf_rcu_read_unlock() will be used to protect not
>> only bpf_task_storage_get(), but also the dereferences of the returned local
>> storage ptr, right ? I think qp-trie may also need this.
>
> I think bpf_rcu_read_lock() is primarily for bpf prog.
Yes. I mean the bpf program which uses qp-trie will need bpf_rcu_read_lock().
>
> The bpf_{sk,task,...}_storage_get() internal is easier to handle and probably
> will need to do its own rcu_read_lock() instead of depending on the bpf prog
> doing the bpf_rcu_read_lock() because the bpf prog may decide uaf is fine.
>
>>> I also need the GFP_ZERO in bpf_mem_alloc, so will work on the GFP_ZERO and
>>> the BPF_REUSE_AFTER_RCU_GP idea.  Probably will get the GFP_ZERO out first.
>> I will continue work on this patchset for GFP_ZERO and reuse flag. Do you mean
>> that you want to work together to implement BPF_REUSE_AFTER_RCU_GP ? How do we
>> cooperate together to accomplish that ?
> Please submit the GFP_ZERO patch first. Kumar and I can use it immediately.
>
> I have been hacking to make bpf's memalloc safe for the
> bpf_{sk,task,cgrp..}_storage_delete() and this safe-on-reuse piece still need
> works. The whole thing is getting pretty long, so my current plan is to put
> the safe-on-reuse piece aside for now, focus back on the immediate goal and
> make the common case deadlock free first. Meaning the
> bpf_*_storage_get(BPF_*_STORAGE_GET_F_CREATE) and the bpf_*_storage_free()
> will use the bpf_mem_cache_{alloc,free}. The bpf_*_storage_delete() will stay
> as-is to go through the call_rcu_tasks_trace() for now since delete is not the
> common use case.
>
> In parallel, if you can post the BPF_REUSE_AFTER_RCU_GP, we can discuss based
> on your work. That should speed up the progress. If I finished the immediate
> goal for local storage and this piece is still pending, I will ping you
> first.  Thoughts?
I am fine with the proposal, thanks.
Martin KaFai Lau Feb. 16, 2023, 7:47 a.m. UTC | #23
On 2/15/23 6:11 PM, Hou Tao wrote:
>>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>>>> memory can be reused immediately. No rcu gp is needed.
>>> Now it seems it will wait for RCU GP and i think it is still necessary, because
>>> when the process exits, other processes may still access the local storage
>>> through pidfd or task_struct of the exited process.
>> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0
>> and will be kfree immediately next. eg. bpf_sk_storage_free is called just
>> before the sk is about to be kfree. No bpf prog should have a hold on this sk.
>> The same should go for the task.
> A bpf syscall may have already found the task local storage through a pidfd,
> then the target task exits and the local storage is free immediately, then bpf
> syscall starts to copy the local storage and there will be a UAF, right ? Did I
> missing something here ?
bpf syscall like bpf_pid_task_storage_lookup_elem and you meant 
__put_task_struct() will be called while the syscall's bpf_map_copy_value() is 
still under rcu_read_lock()?
Hou Tao Feb. 16, 2023, 8:18 a.m. UTC | #24
Hi,

On 2/16/2023 3:47 PM, Martin KaFai Lau wrote:
> On 2/15/23 6:11 PM, Hou Tao wrote:
>>>>> For local storage, when its owner (sk/task/inode/cgrp) is going away, the
>>>>> memory can be reused immediately. No rcu gp is needed.
>>>> Now it seems it will wait for RCU GP and i think it is still necessary,
>>>> because
>>>> when the process exits, other processes may still access the local storage
>>>> through pidfd or task_struct of the exited process.
>>> When its owner (sk/task/cgrp...) is going away, its owner has reached refcnt 0
>>> and will be kfree immediately next. eg. bpf_sk_storage_free is called just
>>> before the sk is about to be kfree. No bpf prog should have a hold on this sk.
>>> The same should go for the task.
>> A bpf syscall may have already found the task local storage through a pidfd,
>> then the target task exits and the local storage is free immediately, then bpf
>> syscall starts to copy the local storage and there will be a UAF, right ? Did I
>> missing something here ?
> bpf syscall like bpf_pid_task_storage_lookup_elem and you meant
> __put_task_struct() will be called while the syscall's bpf_map_copy_value() is
> still under rcu_read_lock()?
Yes, but I known it is impossible here. I missed that task_struct is released
through call_rcu(), so the calling of __put_task_struct() must happen after the
completion of bpf_map_copy_value() in bpf syscall. Thanks for the explanation.
Hou Tao Feb. 16, 2023, 1:55 p.m. UTC | #25
Hi,

On 2/12/2023 12:34 AM, Alexei Starovoitov wrote:
> On Sat, Feb 11, 2023 at 8:33 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Fri, Feb 10, 2023 at 5:10 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hou, are you plannning to resubmit this change? I also hit this while testing my
>>>>> changes on bpf-next.
>>>> Are you talking about the whole patch set or just GFP_ZERO in mem_alloc?
>>>> The former will take a long time to settle.
>>>> The latter is trivial.
>>>> To unblock yourself just add GFP_ZERO in an extra patch?
>>> Sorry for the long delay. Just find find out time to do some tests to compare
>>> the performance of bzero and ctor. After it is done, will resubmit on next week.
>> I still don't like ctor as a concept. In general the callbacks in the critical
>> path are guaranteed to be slow due to retpoline overhead.
>> Please send a patch to add GFP_ZERO.
>>
>> Also I realized that we can make the BPF_REUSE_AFTER_RCU_GP flag usable
>> without risking OOM by only waiting for normal rcu GP and not rcu_tasks_trace.
>> This approach will work for inner nodes of qptrie, since bpf progs
>> never see pointers to them. It will work for local storage
>> converted to bpf_mem_alloc too. It wouldn't need to use its own call_rcu.
>> It's also safe without uaf caveat in sleepable progs and sleepable progs
> I meant 'safe with uaf caveat'.
> Safe because we wait for rcu_task_trace later before returning to kernel memory.
>
>> can use explicit bpf_rcu_read_lock() when they want to avoid uaf.
>> So please respin the set with rcu gp only and that new flag.
Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?
Its downside is that it will enforce sleep-able program to use
bpf_rcu_read_{lock,unlock}() to access these returned pointers ?
> .
Alexei Starovoitov Feb. 16, 2023, 4:35 p.m. UTC | #26
On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?

The idea is for bpf_mem_free to wait normal RCU GP before adding
the elements back to the free list and free the elem to global kernel memory
only after both rcu and rcu_tasks_trace GPs as it's doing now.

> Its downside is that it will enforce sleep-able program to use
> bpf_rcu_read_{lock,unlock}() to access these returned pointers ?

sleepable can access elems without kptrs/spin_locks
even when not using rcu_read_lock, since it's safe, but there is uaf.
Some progs might be fine with it.
When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.
Hou Tao Feb. 17, 2023, 1:19 a.m. UTC | #27
Hi,

On 2/17/2023 12:35 AM, Alexei Starovoitov wrote:
> On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?
> The idea is for bpf_mem_free to wait normal RCU GP before adding
> the elements back to the free list and free the elem to global kernel memory
> only after both rcu and rcu_tasks_trace GPs as it's doing now.
>
>> Its downside is that it will enforce sleep-able program to use
>> bpf_rcu_read_{lock,unlock}() to access these returned pointers ?
> sleepable can access elems without kptrs/spin_locks
> even when not using rcu_read_lock, since it's safe, but there is uaf.
> Some progs might be fine with it.
> When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.
Thanks for the explanation for BPF_REUSE_AFTER_RCU_GP. It seems that
BPF_REUSE_AFTER_RCU_GP may incur OOM easily, because before the expiration of
one RCU GP, these freed elements will not available to both bpf ma or slab
subsystem and after the expiration of RCU GP, these freed elements are only
available for one bpf ma but the number of these freed elements maybe too many
for one bpf ma, so part of these freed elements will be freed through
call_rcu_tasks_trace() and these freed-again elements will not be available for
slab subsystem untill the expiration of tasks trace RCU. In brief, after one RCU
GP, part of these freed elements will be reused, but the majority of these
elements will still be freed through call_rcu_tasks_trace(). Due to the doubt
above, I proposed BPF_FREE_AFTER_RCU to directly free these elements after one
RCU GP and enforce sleepable program to use bpf_rcu_read_lock() to access these
elements, but the enforcement will break the existing sleepable programs, so
BPF_FREE_AFTER_GP is still not a good idea. I will check whether or not these is
still OOM risk for BPF_REUSE_AFTER_RCU_GP and try to mitigate if it is possible
(e.g., share these freed elements between all bpf ma instead of one bpf ma which
free it).
Alexei Starovoitov Feb. 22, 2023, 7:30 p.m. UTC | #28
On Thu, Feb 16, 2023 at 5:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/17/2023 12:35 AM, Alexei Starovoitov wrote:
> > On Thu, Feb 16, 2023 at 5:55 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Beside BPF_REUSE_AFTER_RCU_GP, is BPF_FREE_AFTER_RCU_GP a feasible solution ?
> > The idea is for bpf_mem_free to wait normal RCU GP before adding
> > the elements back to the free list and free the elem to global kernel memory
> > only after both rcu and rcu_tasks_trace GPs as it's doing now.
> >
> >> Its downside is that it will enforce sleep-able program to use
> >> bpf_rcu_read_{lock,unlock}() to access these returned pointers ?
> > sleepable can access elems without kptrs/spin_locks
> > even when not using rcu_read_lock, since it's safe, but there is uaf.
> > Some progs might be fine with it.
> > When sleepable needs to avoid uaf they will use bpf_rcu_read_lock.
> Thanks for the explanation for BPF_REUSE_AFTER_RCU_GP. It seems that
> BPF_REUSE_AFTER_RCU_GP may incur OOM easily, because before the expiration of
> one RCU GP, these freed elements will not available to both bpf ma or slab
> subsystem and after the expiration of RCU GP, these freed elements are only
> available for one bpf ma but the number of these freed elements maybe too many
> for one bpf ma, so part of these freed elements will be freed through
> call_rcu_tasks_trace() and these freed-again elements will not be available for
> slab subsystem untill the expiration of tasks trace RCU. In brief, after one RCU
> GP, part of these freed elements will be reused, but the majority of these
> elements will still be freed through call_rcu_tasks_trace(). Due to the doubt
> above, I proposed BPF_FREE_AFTER_RCU to directly free these elements after one
> RCU GP and enforce sleepable program to use bpf_rcu_read_lock() to access these
> elements, but the enforcement will break the existing sleepable programs, so
> BPF_FREE_AFTER_GP is still not a good idea. I will check whether or not these is
> still OOM risk for BPF_REUSE_AFTER_RCU_GP and try to mitigate if it is possible
> (e.g., share these freed elements between all bpf ma instead of one bpf ma which
> free it).

Since BPF_REUSE_AFTER_RCU_GP is a new thing that will be used
by qptrie map and maybe? local storage, there is no sleepable breakage.
If we start using BPF_REUSE_AFTER_RCU_GP for hashmaps with kptrs
and enforce bpf_rcu_read_lock() this is also ok, since kptrs are unstable.
I prefer to avoid complicating bpf ma with sharing free lists across all ma-s.
Unless this is really trivial code that is easy to review.