diff mbox series

[bpf-next,v3,2/3] bpf: Support kptrs in local storage maps

Message ID 20230225154010.391965-3-memxor@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Add support for kptrs in more BPF maps | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 98 this patch: 99
netdev/cc_maintainers warning 7 maintainers not CCed: jolsa@kernel.org john.fastabend@gmail.com martin.lau@linux.dev yhs@fb.com song@kernel.org haoluo@google.com sdf@google.com
netdev/build_clang success Errors and warnings before: 12 this patch: 12
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 98 this patch: 99
netdev/checkpatch warning WARNING: Block comments should align the * on each line WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-14 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32 on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_progs_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_progs_parallel on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on aarch64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-37 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38 success Logs for test_verifier on x86_64 with llvm-17

Commit Message

Kumar Kartikeya Dwivedi Feb. 25, 2023, 3:40 p.m. UTC
Enable support for kptrs in local storage maps by wiring up the freeing
of these kptrs from map value. Freeing of bpf_local_storage_map is only
delayed in case there are special fields, therefore bpf_selem_free_*
path can also only dereference smap safely in that case. This is
recorded using a bool utilizing a hole in bpF_local_storage_elem. It
could have been tagged in the pointer value smap using the lowest bit
(since alignment > 1), but since there was already a hole I went with
the simpler option. Only the map structure freeing is delayed using RCU
barriers, as the buckets aren't used when selem is being freed, so they
can be freed once all readers of the bucket lists can no longer access
it.

Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_local_storage.h |  6 ++++
 kernel/bpf/bpf_local_storage.c    | 48 ++++++++++++++++++++++++++++---
 kernel/bpf/syscall.c              |  6 +++-
 kernel/bpf/verifier.c             | 12 +++++---
 4 files changed, 63 insertions(+), 9 deletions(-)

Comments

Martin KaFai Lau Feb. 27, 2023, 9:16 p.m. UTC | #1
On 2/25/23 7:40 AM, Kumar Kartikeya Dwivedi wrote:
> Enable support for kptrs in local storage maps by wiring up the freeing
> of these kptrs from map value. Freeing of bpf_local_storage_map is only
> delayed in case there are special fields, therefore bpf_selem_free_*
> path can also only dereference smap safely in that case. This is
> recorded using a bool utilizing a hole in bpF_local_storage_elem. It
> could have been tagged in the pointer value smap using the lowest bit
> (since alignment > 1), but since there was already a hole I went with
> the simpler option. Only the map structure freeing is delayed using RCU
> barriers, as the buckets aren't used when selem is being freed, so they
> can be freed once all readers of the bucket lists can no longer access
> it.
> 
> Cc: Martin KaFai Lau <martin.lau@kernel.org>
> Cc: KP Singh <kpsingh@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>   include/linux/bpf_local_storage.h |  6 ++++
>   kernel/bpf/bpf_local_storage.c    | 48 ++++++++++++++++++++++++++++---
>   kernel/bpf/syscall.c              |  6 +++-
>   kernel/bpf/verifier.c             | 12 +++++---
>   4 files changed, 63 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 6d37a40cd90e..0fe92986412b 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -74,6 +74,12 @@ struct bpf_local_storage_elem {
>   	struct hlist_node snode;	/* Linked to bpf_local_storage */
>   	struct bpf_local_storage __rcu *local_storage;
>   	struct rcu_head rcu;
> +	bool can_use_smap; /* Is it safe to access smap in bpf_selem_free_* RCU
> +			    * callbacks? bpf_local_storage_map_free only
> +			    * executes rcu_barrier when there are special
> +			    * fields, this field remembers that to ensure we
> +			    * don't access already freed smap in sdata.
> +			    */
>   	/* 8 bytes hole */
>   	/* The data is stored in another cacheline to minimize
>   	 * the number of cachelines access during a cache hit.
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 58da17ae5124..2bdd722fe293 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -85,6 +85,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   	if (selem) {
>   		if (value)
>   			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +		/* No need to call check_and_init_map_value as memory is zero init */
>   		return selem;
>   	}
>   
> @@ -113,10 +114,25 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
>   	struct bpf_local_storage_elem *selem;
>   
>   	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> +	/* The can_use_smap bool is set whenever we need to free additional
> +	 * fields in selem data before freeing selem. bpf_local_storage_map_free
> +	 * only executes rcu_barrier to wait for RCU callbacks when it has
> +	 * special fields, hence we can only conditionally dereference smap, as
> +	 * by this time the map might have already been freed without waiting
> +	 * for our call_rcu callback if it did not have any special fields.
> +	 */
> +	if (selem->can_use_smap)
> +		bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data);
> +	kfree(selem);
> +}
> +
> +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
nit. May be a shorter name, bpf_selem_free_rcu_trace() ?

> +{
> +	/* Free directly if Tasks Trace RCU GP also implies RCU GP */
>   	if (rcu_trace_implies_rcu_gp())
> -		kfree(selem);
> +		bpf_selem_free_rcu(rcu);
>   	else
> -		kfree_rcu(selem, rcu);
> +		call_rcu(rcu, bpf_selem_free_rcu);
>   }
>   
>   /* local_storage->lock must be held and selem->local_storage == local_storage.
> @@ -170,9 +186,9 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
>   		RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
>   
>   	if (use_trace_rcu)
> -		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> +		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
>   	else
> -		kfree_rcu(selem, rcu);
> +		call_rcu(&selem->rcu, bpf_selem_free_rcu);

Instead of adding 'bool can_use_smap' to 'struct bpf_local_storage_elem', can it 
be a different rcu call back when smap->map.record is not NULL and only that new 
rcu call back can use smap?
I have a use on this 8-byte hole when using bpf_mem_alloc in bpf_local_storage.

>   
>   	return free_local_storage;
>   }
> @@ -240,6 +256,11 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>   	RCU_INIT_POINTER(SDATA(selem)->smap, smap);
>   	hlist_add_head_rcu(&selem->map_node, &b->list);
>   	raw_spin_unlock_irqrestore(&b->lock, flags);
> +
> +	/* If our data will have special fields, smap will wait for us to use
> +	 * its record in bpf_selem_free_* RCU callbacks before freeing itself.
> +	 */
> +	selem->can_use_smap = !IS_ERR_OR_NULL(smap->map.record);
>   }
>   
>   void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu)
> @@ -723,6 +744,25 @@ void bpf_local_storage_map_free(struct bpf_map *map,
>   	 */
>   	synchronize_rcu();
>   
> +	/* Only delay freeing of smap, buckets are not needed anymore */
>   	kvfree(smap->buckets);
> +
> +	/* When local storage has special fields, callbacks for
> +	 * bpf_selem_free_rcu and bpf_selem_free_tasks_trace_rcu will keep using
> +	 * the map BTF record, we need to execute an RCU barrier to wait for
> +	 * them as the record will be freed right after our map_free callback.
> +	 */
> +	if (!IS_ERR_OR_NULL(smap->map.record)) {
> +		rcu_barrier_tasks_trace();
> +		/* We cannot skip rcu_barrier() when rcu_trace_implies_rcu_gp()
> +		 * is true, because while call_rcu invocation is skipped in that
> +		 * case in bpf_selem_free_tasks_trace_rcu (and all local storage
> +		 * maps pass use_trace_rcu = true), there can be call_rcu
> +		 * callbacks based on use_trace_rcu = false in the earlier while
> +		 * ((selem = ...)) loop or from bpf_local_storage_unlink_nolock
> +		 * called from owner's free path.
> +		 */
> +		rcu_barrier();

Others lgtm.
KP Singh Feb. 28, 2023, 3:03 a.m. UTC | #2
On Mon, Feb 27, 2023 at 10:16 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/25/23 7:40 AM, Kumar Kartikeya Dwivedi wrote:
> > Enable support for kptrs in local storage maps by wiring up the freeing
> > of these kptrs from map value. Freeing of bpf_local_storage_map is only

from a map value.

> > delayed in case there are special fields, therefore bpf_selem_free_*

This needs a bit of explanation here, can you add a bit more
description in this commit's log as to what these special fields are?
It would be great if the commit descriptions are hermetic and self
explanatory.

> > path can also only dereference smap safely in that case. This is
> > recorded using a bool utilizing a hole in bpF_local_storage_elem. It

nit: bpf_local_storage_elem

> > could have been tagged in the pointer value smap using the lowest bit
> > (since alignment > 1), but since there was already a hole I went with
> > the simpler option. Only the map structure freeing is delayed using RCU
> > barriers, as the buckets aren't used when selem is being freed, so they
> > can be freed once all readers of the bucket lists can no longer access
> > it.
> >
> > Cc: Martin KaFai Lau <martin.lau@kernel.org>
> > Cc: KP Singh <kpsingh@kernel.org>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >   include/linux/bpf_local_storage.h |  6 ++++
> >   kernel/bpf/bpf_local_storage.c    | 48 ++++++++++++++++++++++++++++---
> >   kernel/bpf/syscall.c              |  6 +++-
> >   kernel/bpf/verifier.c             | 12 +++++---
> >   4 files changed, 63 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> > index 6d37a40cd90e..0fe92986412b 100644
> > --- a/include/linux/bpf_local_storage.h
> > +++ b/include/linux/bpf_local_storage.h
> > @@ -74,6 +74,12 @@ struct bpf_local_storage_elem {
> >       struct hlist_node snode;        /* Linked to bpf_local_storage */
> >       struct bpf_local_storage __rcu *local_storage;
> >       struct rcu_head rcu;
> > +     bool can_use_smap; /* Is it safe to access smap in bpf_selem_free_* RCU
> > +                         * callbacks? bpf_local_storage_map_free only
> > +                         * executes rcu_barrier when there are special
> > +                         * fields, this field remembers that to ensure we
> > +                         * don't access already freed smap in sdata.
> > +                         */
> >       /* 8 bytes hole */
> >       /* The data is stored in another cacheline to minimize
> >        * the number of cachelines access during a cache hit.
> > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> > index 58da17ae5124..2bdd722fe293 100644
> > --- a/kernel/bpf/bpf_local_storage.c
> > +++ b/kernel/bpf/bpf_local_storage.c
> > @@ -85,6 +85,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> >       if (selem) {
> >               if (value)
> >                       copy_map_value(&smap->map, SDATA(selem)->data, value);
> > +             /* No need to call check_and_init_map_value as memory is zero init */
> >               return selem;
> >       }
> >
> > @@ -113,10 +114,25 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
> >       struct bpf_local_storage_elem *selem;
> >
> >       selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> > +     /* The can_use_smap bool is set whenever we need to free additional
> > +      * fields in selem data before freeing selem. bpf_local_storage_map_free
> > +      * only executes rcu_barrier to wait for RCU callbacks when it has
> > +      * special fields, hence we can only conditionally dereference smap, as
> > +      * by this time the map might have already been freed without waiting
> > +      * for our call_rcu callback if it did not have any special fields.
> > +      */
> > +     if (selem->can_use_smap)
> > +             bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data);
> > +     kfree(selem);
> > +}
> > +
> > +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
> nit. May be a shorter name, bpf_selem_free_rcu_trace() ?
>
> > +{
> > +     /* Free directly if Tasks Trace RCU GP also implies RCU GP */
> >       if (rcu_trace_implies_rcu_gp())
> > -             kfree(selem);
> > +             bpf_selem_free_rcu(rcu);
> >       else
> > -             kfree_rcu(selem, rcu);
> > +             call_rcu(rcu, bpf_selem_free_rcu);
> >   }
> >
> >   /* local_storage->lock must be held and selem->local_storage == local_storage.
> > @@ -170,9 +186,9 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
> >               RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
> >
> >       if (use_trace_rcu)
> > -             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> > +             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
> >       else
> > -             kfree_rcu(selem, rcu);
> > +             call_rcu(&selem->rcu, bpf_selem_free_rcu);
>
> Instead of adding 'bool can_use_smap' to 'struct bpf_local_storage_elem', can it
> be a different rcu call back when smap->map.record is not NULL and only that new
> rcu call back can use smap?
> I have a use on this 8-byte hole when using bpf_mem_alloc in bpf_local_storage.
>
> >
> >       return free_local_storage;
> >   }
> > @@ -240,6 +256,11 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
> >       RCU_INIT_POINTER(SDATA(selem)->smap, smap);
> >       hlist_add_head_rcu(&selem->map_node, &b->list);
> >       raw_spin_unlock_irqrestore(&b->lock, flags);
> > +
> > +     /* If our data will have special fields, smap will wait for us to use
> > +      * its record in bpf_selem_free_* RCU callbacks before freeing itself.
> > +      */
> > +     selem->can_use_smap = !IS_ERR_OR_NULL(smap->map.record);
> >   }
> >
> >   void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu)
> > @@ -723,6 +744,25 @@ void bpf_local_storage_map_free(struct bpf_map *map,
> >        */
> >       synchronize_rcu();
> >
> > +     /* Only delay freeing of smap, buckets are not needed anymore */
> >       kvfree(smap->buckets);
> > +
> > +     /* When local storage has special fields, callbacks for
> > +      * bpf_selem_free_rcu and bpf_selem_free_tasks_trace_rcu will keep using
> > +      * the map BTF record, we need to execute an RCU barrier to wait for
> > +      * them as the record will be freed right after our map_free callback.
> > +      */
> > +     if (!IS_ERR_OR_NULL(smap->map.record)) {
> > +             rcu_barrier_tasks_trace();
> > +             /* We cannot skip rcu_barrier() when rcu_trace_implies_rcu_gp()
> > +              * is true, because while call_rcu invocation is skipped in that
> > +              * case in bpf_selem_free_tasks_trace_rcu (and all local storage
> > +              * maps pass use_trace_rcu = true), there can be call_rcu
> > +              * callbacks based on use_trace_rcu = false in the earlier while
> > +              * ((selem = ...)) loop or from bpf_local_storage_unlink_nolock
> > +              * called from owner's free path.
> > +              */
> > +             rcu_barrier();
>
> Others lgtm.

+1

>
Alexei Starovoitov March 1, 2023, 6:29 p.m. UTC | #3
On Mon, Feb 27, 2023 at 7:04 PM KP Singh <kpsingh@kernel.org> wrote:
>
> > >
> > >       if (use_trace_rcu)
> > > -             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> > > +             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
> > >       else
> > > -             kfree_rcu(selem, rcu);
> > > +             call_rcu(&selem->rcu, bpf_selem_free_rcu);
> >
> > Instead of adding 'bool can_use_smap' to 'struct bpf_local_storage_elem', can it
> > be a different rcu call back when smap->map.record is not NULL and only that new
> > rcu call back can use smap?
> > I have a use on this 8-byte hole when using bpf_mem_alloc in bpf_local_storage.

I've decided it to apply it as-is to speeds things up.
Kumar, please follow up addressing Kumar's and KP's suggestions.
Martin KaFai Lau March 4, 2023, 7:52 a.m. UTC | #4
On 2/25/23 7:40 AM, Kumar Kartikeya Dwivedi wrote:
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 6d37a40cd90e..0fe92986412b 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -74,6 +74,12 @@ struct bpf_local_storage_elem {
>   	struct hlist_node snode;	/* Linked to bpf_local_storage */
>   	struct bpf_local_storage __rcu *local_storage;
>   	struct rcu_head rcu;
> +	bool can_use_smap; /* Is it safe to access smap in bpf_selem_free_* RCU
> +			    * callbacks? bpf_local_storage_map_free only
> +			    * executes rcu_barrier when there are special
> +			    * fields, this field remembers that to ensure we
> +			    * don't access already freed smap in sdata.
> +			    */
>   	/* 8 bytes hole */
>   	/* The data is stored in another cacheline to minimize
>   	 * the number of cachelines access during a cache hit.
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 58da17ae5124..2bdd722fe293 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -85,6 +85,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   	if (selem) {
>   		if (value)
>   			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +		/* No need to call check_and_init_map_value as memory is zero init */
>   		return selem;
>   	}
>   
> @@ -113,10 +114,25 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
>   	struct bpf_local_storage_elem *selem;
>   
>   	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> +	/* The can_use_smap bool is set whenever we need to free additional
> +	 * fields in selem data before freeing selem. bpf_local_storage_map_free
> +	 * only executes rcu_barrier to wait for RCU callbacks when it has
> +	 * special fields, hence we can only conditionally dereference smap, as
> +	 * by this time the map might have already been freed without waiting
> +	 * for our call_rcu callback if it did not have any special fields.
> +	 */
> +	if (selem->can_use_smap)
> +		bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data);
> +	kfree(selem);
> +}
> +
> +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
> +{
> +	/* Free directly if Tasks Trace RCU GP also implies RCU GP */
>   	if (rcu_trace_implies_rcu_gp())
> -		kfree(selem);
> +		bpf_selem_free_rcu(rcu);
>   	else
> -		kfree_rcu(selem, rcu);
> +		call_rcu(rcu, bpf_selem_free_rcu);
>   }
>   
>   /* local_storage->lock must be held and selem->local_storage == local_storage.
> @@ -170,9 +186,9 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
>   		RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
>   
>   	if (use_trace_rcu)
> -		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> +		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
>   	else
> -		kfree_rcu(selem, rcu);
> +		call_rcu(&selem->rcu, bpf_selem_free_rcu);

After another thought, bpf_obj_free_fields() does not need to go through the rcu 
gp, right?

bpf_obj_free_fields() can be done just before the call_rcu_tasks_trace() and the 
call_rcu() here. In hashtab, bpf_obj_free_fields() is also done just before 
bpf_mem_cache_free().

>   
>   	return free_local_storage;
>   }
> @@ -240,6 +256,11 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
>   	RCU_INIT_POINTER(SDATA(selem)->smap, smap);
>   	hlist_add_head_rcu(&selem->map_node, &b->list);
>   	raw_spin_unlock_irqrestore(&b->lock, flags);
> +
> +	/* If our data will have special fields, smap will wait for us to use
> +	 * its record in bpf_selem_free_* RCU callbacks before freeing itself.
> +	 */
> +	selem->can_use_smap = !IS_ERR_OR_NULL(smap->map.record);
>   }
>   
>   void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu)
> @@ -723,6 +744,25 @@ void bpf_local_storage_map_free(struct bpf_map *map,
>   	 */
>   	synchronize_rcu();
>   
> +	/* Only delay freeing of smap, buckets are not needed anymore */
>   	kvfree(smap->buckets);
> +
> +	/* When local storage has special fields, callbacks for
> +	 * bpf_selem_free_rcu and bpf_selem_free_tasks_trace_rcu will keep using
> +	 * the map BTF record, we need to execute an RCU barrier to wait for
> +	 * them as the record will be freed right after our map_free callback.
> +	 */
> +	if (!IS_ERR_OR_NULL(smap->map.record)) {
> +		rcu_barrier_tasks_trace();
> +		/* We cannot skip rcu_barrier() when rcu_trace_implies_rcu_gp()
> +		 * is true, because while call_rcu invocation is skipped in that
> +		 * case in bpf_selem_free_tasks_trace_rcu (and all local storage
> +		 * maps pass use_trace_rcu = true), there can be call_rcu
> +		 * callbacks based on use_trace_rcu = false in the earlier while
> +		 * ((selem = ...)) loop or from bpf_local_storage_unlink_nolock
> +		 * called from owner's free path.
> +		 */
> +		rcu_barrier();
> +	}
>   	bpf_map_area_free(smap);
>   }
Kumar Kartikeya Dwivedi March 4, 2023, 3:34 p.m. UTC | #5
On Sat, 4 Mar 2023 at 08:53, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 2/25/23 7:40 AM, Kumar Kartikeya Dwivedi wrote:
> > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> > index 6d37a40cd90e..0fe92986412b 100644
> > --- a/include/linux/bpf_local_storage.h
> > +++ b/include/linux/bpf_local_storage.h
> > @@ -74,6 +74,12 @@ struct bpf_local_storage_elem {
> >       struct hlist_node snode;        /* Linked to bpf_local_storage */
> >       struct bpf_local_storage __rcu *local_storage;
> >       struct rcu_head rcu;
> > +     bool can_use_smap; /* Is it safe to access smap in bpf_selem_free_* RCU
> > +                         * callbacks? bpf_local_storage_map_free only
> > +                         * executes rcu_barrier when there are special
> > +                         * fields, this field remembers that to ensure we
> > +                         * don't access already freed smap in sdata.
> > +                         */
> >       /* 8 bytes hole */
> >       /* The data is stored in another cacheline to minimize
> >        * the number of cachelines access during a cache hit.
> > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> > index 58da17ae5124..2bdd722fe293 100644
> > --- a/kernel/bpf/bpf_local_storage.c
> > +++ b/kernel/bpf/bpf_local_storage.c
> > @@ -85,6 +85,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> >       if (selem) {
> >               if (value)
> >                       copy_map_value(&smap->map, SDATA(selem)->data, value);
> > +             /* No need to call check_and_init_map_value as memory is zero init */
> >               return selem;
> >       }
> >
> > @@ -113,10 +114,25 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
> >       struct bpf_local_storage_elem *selem;
> >
> >       selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> > +     /* The can_use_smap bool is set whenever we need to free additional
> > +      * fields in selem data before freeing selem. bpf_local_storage_map_free
> > +      * only executes rcu_barrier to wait for RCU callbacks when it has
> > +      * special fields, hence we can only conditionally dereference smap, as
> > +      * by this time the map might have already been freed without waiting
> > +      * for our call_rcu callback if it did not have any special fields.
> > +      */
> > +     if (selem->can_use_smap)
> > +             bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data);
> > +     kfree(selem);
> > +}
> > +
> > +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
> > +{
> > +     /* Free directly if Tasks Trace RCU GP also implies RCU GP */
> >       if (rcu_trace_implies_rcu_gp())
> > -             kfree(selem);
> > +             bpf_selem_free_rcu(rcu);
> >       else
> > -             kfree_rcu(selem, rcu);
> > +             call_rcu(rcu, bpf_selem_free_rcu);
> >   }
> >
> >   /* local_storage->lock must be held and selem->local_storage == local_storage.
> > @@ -170,9 +186,9 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
> >               RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
> >
> >       if (use_trace_rcu)
> > -             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> > +             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
> >       else
> > -             kfree_rcu(selem, rcu);
> > +             call_rcu(&selem->rcu, bpf_selem_free_rcu);
>
> After another thought, bpf_obj_free_fields() does not need to go through the rcu
> gp, right?
>
> bpf_obj_free_fields() can be done just before the call_rcu_tasks_trace() and the
> call_rcu() here. In hashtab, bpf_obj_free_fields() is also done just before
> bpf_mem_cache_free().

Perhaps not. But the original code for hashtab prior to conversion to
use bpf_mem_cache actually freed timers and kptrs after waiting for a
complete RCU grace period for the kmalloc case. My main idea was to
try to delay it until the last point, where memory is returned for
reuse. Now that does not include a RCU grace period for hashtab
anymore because memory can be reused as soon as it is returned to
bpf_mem_cache. Same for array maps where update does the freeing.

bpf_obj_free_fields can possibly do a lot of work, try to acquire the
bpf_spin_lock in map value, etc. Even moreso now that we have lists
and rbtree that could be in map values, where they have to drain all
elements (which might have fields of their own). Not doing that in the
context of the program calling update or delete is usually better if
we have a choice, since it might introduce unexpected delays. Here we
are doing an RCU callback in all cases, so I think it's better to
delay freeing the fields and do it in RCU callback, since we are doing
call_rcu anyway.
Martin KaFai Lau March 4, 2023, 4:17 p.m. UTC | #6
On 3/4/23 7:34 AM, Kumar Kartikeya Dwivedi wrote:
>>> @@ -113,10 +114,25 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
>>>        struct bpf_local_storage_elem *selem;
>>>
>>>        selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
>>> +     /* The can_use_smap bool is set whenever we need to free additional
>>> +      * fields in selem data before freeing selem. bpf_local_storage_map_free
>>> +      * only executes rcu_barrier to wait for RCU callbacks when it has
>>> +      * special fields, hence we can only conditionally dereference smap, as
>>> +      * by this time the map might have already been freed without waiting
>>> +      * for our call_rcu callback if it did not have any special fields.
>>> +      */
>>> +     if (selem->can_use_smap)
>>> +             bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data);
>>> +     kfree(selem);
>>> +}
>>> +
>>> +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
>>> +{
>>> +     /* Free directly if Tasks Trace RCU GP also implies RCU GP */
>>>        if (rcu_trace_implies_rcu_gp())
>>> -             kfree(selem);
>>> +             bpf_selem_free_rcu(rcu);
>>>        else
>>> -             kfree_rcu(selem, rcu);
>>> +             call_rcu(rcu, bpf_selem_free_rcu);
>>>    }
>>>
>>>    /* local_storage->lock must be held and selem->local_storage == local_storage.
>>> @@ -170,9 +186,9 @@ static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
>>>                RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
>>>
>>>        if (use_trace_rcu)
>>> -             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
>>> +             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
>>>        else
>>> -             kfree_rcu(selem, rcu);
>>> +             call_rcu(&selem->rcu, bpf_selem_free_rcu);
>> After another thought, bpf_obj_free_fields() does not need to go through the rcu
>> gp, right?
>>
>> bpf_obj_free_fields() can be done just before the call_rcu_tasks_trace() and the
>> call_rcu() here. In hashtab, bpf_obj_free_fields() is also done just before
>> bpf_mem_cache_free().
> Perhaps not. But the original code for hashtab prior to conversion to
> use bpf_mem_cache actually freed timers and kptrs after waiting for a
> complete RCU grace period for the kmalloc case. My main idea was to
> try to delay it until the last point, where memory is returned for
> reuse. Now that does not include a RCU grace period for hashtab
> anymore because memory can be reused as soon as it is returned to
> bpf_mem_cache. Same for array maps where update does the freeing.
> 
> bpf_obj_free_fields can possibly do a lot of work, try to acquire the
> bpf_spin_lock in map value, etc. Even moreso now that we have lists
> and rbtree that could be in map values, where they have to drain all
> elements (which might have fields of their own). Not doing that in the
> context of the program calling update or delete is usually better if
> we have a choice, since it might introduce unexpected delays. Here we
> are doing an RCU callback in all cases, so I think it's better to
> delay freeing the fields and do it in RCU callback, since we are doing
> call_rcu anyway.

The delete_elem for local storage is not the common use case. The usage is 
usually to have the storage stay with its owner lifetime until the bpf storage 
is destroyed by bpf_{sk,task,inode,cgrp}_storage_free. The userspace does not 
need to track the lifetime of its owner which could be fragile.

More importantly, I am moving local storage to bpf_mem_cache_alloc/free because 
of potential deadlock during the kmalloc time: 
https://lore.kernel.org/bpf/dea8c3c5-0739-58c1-9a88-b989878a9b8f@linux.dev/
Thus, bpf_obj_free_fields() needs to be done before freeing the selem. I have 
already made this change in my set and will post shortly.

Thanks for the prompt reply!
diff mbox series

Patch

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 6d37a40cd90e..0fe92986412b 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -74,6 +74,12 @@  struct bpf_local_storage_elem {
 	struct hlist_node snode;	/* Linked to bpf_local_storage */
 	struct bpf_local_storage __rcu *local_storage;
 	struct rcu_head rcu;
+	bool can_use_smap; /* Is it safe to access smap in bpf_selem_free_* RCU
+			    * callbacks? bpf_local_storage_map_free only
+			    * executes rcu_barrier when there are special
+			    * fields, this field remembers that to ensure we
+			    * don't access already freed smap in sdata.
+			    */
 	/* 8 bytes hole */
 	/* The data is stored in another cacheline to minimize
 	 * the number of cachelines access during a cache hit.
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 58da17ae5124..2bdd722fe293 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -85,6 +85,7 @@  bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 	if (selem) {
 		if (value)
 			copy_map_value(&smap->map, SDATA(selem)->data, value);
+		/* No need to call check_and_init_map_value as memory is zero init */
 		return selem;
 	}
 
@@ -113,10 +114,25 @@  static void bpf_selem_free_rcu(struct rcu_head *rcu)
 	struct bpf_local_storage_elem *selem;
 
 	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
+	/* The can_use_smap bool is set whenever we need to free additional
+	 * fields in selem data before freeing selem. bpf_local_storage_map_free
+	 * only executes rcu_barrier to wait for RCU callbacks when it has
+	 * special fields, hence we can only conditionally dereference smap, as
+	 * by this time the map might have already been freed without waiting
+	 * for our call_rcu callback if it did not have any special fields.
+	 */
+	if (selem->can_use_smap)
+		bpf_obj_free_fields(SDATA(selem)->smap->map.record, SDATA(selem)->data);
+	kfree(selem);
+}
+
+static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
+{
+	/* Free directly if Tasks Trace RCU GP also implies RCU GP */
 	if (rcu_trace_implies_rcu_gp())
-		kfree(selem);
+		bpf_selem_free_rcu(rcu);
 	else
-		kfree_rcu(selem, rcu);
+		call_rcu(rcu, bpf_selem_free_rcu);
 }
 
 /* local_storage->lock must be held and selem->local_storage == local_storage.
@@ -170,9 +186,9 @@  static bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_stor
 		RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
 
 	if (use_trace_rcu)
-		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
+		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
 	else
-		kfree_rcu(selem, rcu);
+		call_rcu(&selem->rcu, bpf_selem_free_rcu);
 
 	return free_local_storage;
 }
@@ -240,6 +256,11 @@  void bpf_selem_link_map(struct bpf_local_storage_map *smap,
 	RCU_INIT_POINTER(SDATA(selem)->smap, smap);
 	hlist_add_head_rcu(&selem->map_node, &b->list);
 	raw_spin_unlock_irqrestore(&b->lock, flags);
+
+	/* If our data will have special fields, smap will wait for us to use
+	 * its record in bpf_selem_free_* RCU callbacks before freeing itself.
+	 */
+	selem->can_use_smap = !IS_ERR_OR_NULL(smap->map.record);
 }
 
 void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool use_trace_rcu)
@@ -723,6 +744,25 @@  void bpf_local_storage_map_free(struct bpf_map *map,
 	 */
 	synchronize_rcu();
 
+	/* Only delay freeing of smap, buckets are not needed anymore */
 	kvfree(smap->buckets);
+
+	/* When local storage has special fields, callbacks for
+	 * bpf_selem_free_rcu and bpf_selem_free_tasks_trace_rcu will keep using
+	 * the map BTF record, we need to execute an RCU barrier to wait for
+	 * them as the record will be freed right after our map_free callback.
+	 */
+	if (!IS_ERR_OR_NULL(smap->map.record)) {
+		rcu_barrier_tasks_trace();
+		/* We cannot skip rcu_barrier() when rcu_trace_implies_rcu_gp()
+		 * is true, because while call_rcu invocation is skipped in that
+		 * case in bpf_selem_free_tasks_trace_rcu (and all local storage
+		 * maps pass use_trace_rcu = true), there can be call_rcu
+		 * callbacks based on use_trace_rcu = false in the earlier while
+		 * ((selem = ...)) loop or from bpf_local_storage_unlink_nolock
+		 * called from owner's free path.
+		 */
+		rcu_barrier();
+	}
 	bpf_map_area_free(smap);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index da117a2a83b2..eb50025b03c1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1063,7 +1063,11 @@  static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
 				    map->map_type != BPF_MAP_TYPE_LRU_PERCPU_HASH &&
 				    map->map_type != BPF_MAP_TYPE_ARRAY &&
-				    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY) {
+				    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY &&
+				    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
+				    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
+				    map->map_type != BPF_MAP_TYPE_TASK_STORAGE &&
+				    map->map_type != BPF_MAP_TYPE_CGRP_STORAGE) {
 					ret = -EOPNOTSUPP;
 					goto free_map_tab;
 				}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5cb8b623f639..f5e2813e22de 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7151,22 +7151,26 @@  static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_MAP_TYPE_SK_STORAGE:
 		if (func_id != BPF_FUNC_sk_storage_get &&
-		    func_id != BPF_FUNC_sk_storage_delete)
+		    func_id != BPF_FUNC_sk_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_INODE_STORAGE:
 		if (func_id != BPF_FUNC_inode_storage_get &&
-		    func_id != BPF_FUNC_inode_storage_delete)
+		    func_id != BPF_FUNC_inode_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_TASK_STORAGE:
 		if (func_id != BPF_FUNC_task_storage_get &&
-		    func_id != BPF_FUNC_task_storage_delete)
+		    func_id != BPF_FUNC_task_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_CGRP_STORAGE:
 		if (func_id != BPF_FUNC_cgrp_storage_get &&
-		    func_id != BPF_FUNC_cgrp_storage_delete)
+		    func_id != BPF_FUNC_cgrp_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_BLOOM_FILTER: