[v5,bpf-next,2/3] bpf: implement CAP_BPF

Message ID	20200508215340.41921-3-alexei.starovoitov@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=gDDV=6W=vger.kernel.org=linux-security-module-owner@kernel.org> From: Alexei Starovoitov <alexei.starovoitov@gmail.com> To: davem@davemloft.net Cc: daniel@iogearbox.net, netdev@vger.kernel.org, bpf@vger.kernel.org, kernel-team@fb.com, linux-security-module@vger.kernel.org, acme@redhat.com, jamorris@linux.microsoft.com, jannh@google.com, kpsingh@google.com Subject: [PATCH v5 bpf-next 2/3] bpf: implement CAP_BPF Date: Fri, 8 May 2020 14:53:39 -0700 Message-Id: <20200508215340.41921-3-alexei.starovoitov@gmail.com> In-Reply-To: <20200508215340.41921-1-alexei.starovoitov@gmail.com> References: <20200508215340.41921-1-alexei.starovoitov@gmail.com> Sender: owner-linux-security-module@vger.kernel.org Precedence: bulk
Series	Introduce CAP_BPF \| expand [v5,bpf-next,0/3] Introduce CAP_BPF [v5,bpf-next,1/3] bpf, capability: Introduce CAP_BPF [v5,bpf-next,2/3] bpf: implement CAP_BPF [v5,bpf-next,3/3] selftests/bpf: use CAP_BPF and CAP_PERFMON in tests

Alexei Starovoitov May 8, 2020, 9:53 p.m. UTC

From: Alexei Starovoitov <ast@kernel.org>

Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into allow_ptr_leaks and bpf_capable flags and they are set as:
  env->allow_ptr_leaks = perfmon_capable();
  env->bpf_capable = bpf_capable();

bpf_capable enables bounded loops, variable stack access and other verifier features.
allow_ptr_leaks enable ptr leaks, ptr conversions, subtraction of pointers, etc.
It also disables side channel mitigations.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will
have speculative checks done by the verifier and other spectre mitigation applied.
Such networking BPF program will not be able to leak kernel pointers.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 drivers/media/rc/bpf-lirc.c   |  2 +-
 include/linux/bpf_verifier.h  |  1 +
 kernel/bpf/arraymap.c         |  2 +-
 kernel/bpf/bpf_struct_ops.c   |  2 +-
 kernel/bpf/core.c             |  4 +-
 kernel/bpf/cpumap.c           |  2 +-
 kernel/bpf/hashtab.c          |  4 +-
 kernel/bpf/helpers.c          |  4 +-
 kernel/bpf/lpm_trie.c         |  2 +-
 kernel/bpf/queue_stack_maps.c |  2 +-
 kernel/bpf/reuseport_array.c  |  2 +-
 kernel/bpf/stackmap.c         |  2 +-
 kernel/bpf/syscall.c          | 87 ++++++++++++++++++++++++++---------
 kernel/bpf/verifier.c         | 24 +++++-----
 kernel/trace/bpf_trace.c      |  3 ++
 net/core/bpf_sk_storage.c     |  4 +-
 net/core/filter.c             |  4 +-
 17 files changed, 102 insertions(+), 49 deletions(-)

Stanislav Fomichev May 12, 2020, 12:12 a.m. UTC | #1

On 05/08, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
[..]
> @@ -3932,7 +3977,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr  
> __user *, uattr, unsigned int, siz
>   	union bpf_attr attr;
>   	int err;

> -	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> +	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
>   		return -EPERM;
This is awesome, thanks for reviving the effort!

One question I have about this particular snippet:
Does it make sense to drop bpf_capable checks for the operations
that work on a provided fd?

The use-case I have in mind is as follows:
* privileged (CAP_BPF) process loads the programs/maps and pins
   them at some known location
* unprivileged process opens up those pins and does the following:
   * prepares the maps (and will later on read them)
   * does SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF which afaik don't
     require any capabilities

This essentially pushes some of the permission checks into a fs layer. So
whoever has a file descriptor (via unix sock or open) can do BPF operations
on the object that represents it.

Thoughts? Am I missing something important?

Alexei Starovoitov May 12, 2020, 2:36 a.m. UTC | #2

On Mon, May 11, 2020 at 05:12:10PM -0700, sdf@google.com wrote:
> On 05/08, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> [..]
> > @@ -3932,7 +3977,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr
> > __user *, uattr, unsigned int, siz
> >   	union bpf_attr attr;
> >   	int err;
> 
> > -	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > +	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> >   		return -EPERM;
> This is awesome, thanks for reviving the effort!
> 
> One question I have about this particular snippet:
> Does it make sense to drop bpf_capable checks for the operations
> that work on a provided fd?

Above snippet is for the case when sysctl switches unpriv off.
It was a big hammer and stays big hammer.
I certainly would like to improve the situation, but I suspect
the folks who turn that sysctl knob on are simply paranoid about bpf
and no amount of reasoning would turn them around.

> The use-case I have in mind is as follows:
> * privileged (CAP_BPF) process loads the programs/maps and pins
>   them at some known location
> * unprivileged process opens up those pins and does the following:
>   * prepares the maps (and will later on read them)
>   * does SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF which afaik don't
>     require any capabilities
> 
> This essentially pushes some of the permission checks into a fs layer. So
> whoever has a file descriptor (via unix sock or open) can do BPF operations
> on the object that represents it.

cap_bpf doesn't change things in that regard.
Two cases here:
sysctl_unprivileged_bpf_disabled==0:
  Unpriv can load socket_filter prog type and unpriv can attach it
  via SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF.
sysctl_unprivileged_bpf_disabled==1:
  cap_sys_admin can load socket_filter and unpriv can attach it.

With addition of cap_bpf in the second case cap_bpf process can
load socket_filter too.
It doesn't mean that permissions are pushed into fs layer.
I'm not sure that relaxing of sysctl_unprivileged_bpf_disabled
will be well received.
Are you proposing to selectively allow certain bpf syscall commands
even when sysctl_unprivileged_bpf_disabled==1 ?
Like allow unpriv to do BPF_OBJ_GET to get an fd from bpffs ?
And allow unpriv to do map_update ? 
It makes complete sense to me, but I'd like to argue about that
independently from this cap_bpf set.
We can relax that sysctl later.

Jordan Glover May 12, 2020, 12:50 p.m. UTC | #3

On Tuesday, May 12, 2020 2:36 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, May 11, 2020 at 05:12:10PM -0700, sdf@google.com wrote:
>
> > On 05/08, Alexei Starovoitov wrote:
> >
> > > From: Alexei Starovoitov ast@kernel.org
> > > [..]
> > > @@ -3932,7 +3977,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr
> > > __user *, uattr, unsigned int, siz
> > > union bpf_attr attr;
> > > int err;
> >
> > > -   if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > >
> > > -   if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> > >     return -EPERM;
> > >     This is awesome, thanks for reviving the effort!
> > >
> >
> > One question I have about this particular snippet:
> > Does it make sense to drop bpf_capable checks for the operations
> > that work on a provided fd?
>
> Above snippet is for the case when sysctl switches unpriv off.
> It was a big hammer and stays big hammer.
> I certainly would like to improve the situation, but I suspect
> the folks who turn that sysctl knob on are simply paranoid about bpf
> and no amount of reasoning would turn them around.
>

Without CAP_BPF, sysctl was the only option to keep you safe from flow
of bpf vulns. You didn't had to be paranoid about that.

Jordan

Daniel Borkmann May 12, 2020, 2:35 p.m. UTC | #4

On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Implement permissions as stated in uapi/linux/capability.h
> In order to do that the verifier allow_ptr_leaks flag is split
> into allow_ptr_leaks and bpf_capable flags and they are set as:
>    env->allow_ptr_leaks = perfmon_capable();
>    env->bpf_capable = bpf_capable();
> 
> bpf_capable enables bounded loops, variable stack access and other verifier features.
> allow_ptr_leaks enable ptr leaks, ptr conversions, subtraction of pointers, etc.
> It also disables side channel mitigations.
> 
> That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will
> have speculative checks done by the verifier and other spectre mitigation applied.
> Such networking BPF program will not be able to leak kernel pointers.

I don't quite follow this part in the code below yet, see my comments.

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
[...]
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 6abd5a778fcd..c32a7880fa62 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -375,6 +375,7 @@ struct bpf_verifier_env {
>   	u32 used_map_cnt;		/* number of used maps */
>   	u32 id_gen;			/* used to generate unique reg IDs */
>   	bool allow_ptr_leaks;
> +	bool bpf_capable;
>   	bool seen_direct_write;
>   	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
>   	const struct bpf_line_info *prev_linfo;
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 95d77770353c..264a9254dc39 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -77,7 +77,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
>   	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
>   	int ret, numa_node = bpf_map_attr_numa_node(attr);
>   	u32 elem_size, index_mask, max_entries;
> -	bool unpriv = !capable(CAP_SYS_ADMIN);
> +	bool unpriv = !bpf_capable();

So here progs loaded with CAP_BPF will have spectre mitigations bypassed which
is the opposite of above statement, no?

>   	u64 cost, array_size, mask64;
>   	struct bpf_map_memory mem;
>   	struct bpf_array *array;
[...]
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 6aa11de67315..8f421dd0c4cf 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
>   void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>   {
>   	if (!bpf_prog_kallsyms_candidate(fp) ||
> -	    !capable(CAP_SYS_ADMIN))
> +	    !bpf_capable())
>   		return;
>   
>   	bpf_prog_ksym_set_addr(fp);
> @@ -824,7 +824,7 @@ static int bpf_jit_charge_modmem(u32 pages)
>   {
>   	if (atomic_long_add_return(pages, &bpf_jit_current) >
>   	    (bpf_jit_limit >> PAGE_SHIFT)) {
> -		if (!capable(CAP_SYS_ADMIN)) {
> +		if (!bpf_capable()) {

Should there still be an upper charge on module mem for !CAP_SYS_ADMIN?

>   			atomic_long_sub(pages, &bpf_jit_current);
>   			return -EPERM;
>   		}
[...]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 70ad009577f8..a6893746cd87 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[...]
> @@ -3428,7 +3429,7 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
>   		 * Spectre masking for stack ALU.
>   		 * See also retrieve_ptr_limit().
>   		 */
> -		if (!env->allow_ptr_leaks) {
> +		if (!env->bpf_capable) {

This needs to stay on env->allow_ptr_leaks, the can_skip_alu_sanitation() does
check on env->allow_ptr_leaks as well, otherwise this breaks spectre mitgation
when masking alu.

>   			char tn_buf[48];
>   
>   			tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> @@ -7229,7 +7230,7 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env,
>   		insn_stack[env->cfg.cur_stack++] = w;
>   		return 1;
>   	} else if ((insn_state[w] & 0xF0) == DISCOVERED) {
> -		if (loop_ok && env->allow_ptr_leaks)
> +		if (loop_ok && env->bpf_capable)
>   			return 0;
>   		verbose_linfo(env, t, "%d: ", t);
>   		verbose_linfo(env, w, "%d: ", w);
> @@ -8338,7 +8339,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
>   	if (env->max_states_per_insn < states_cnt)
>   		env->max_states_per_insn = states_cnt;
>   
> -	if (!env->allow_ptr_leaks && states_cnt > BPF_COMPLEXITY_LIMIT_STATES)
> +	if (!env->bpf_capable && states_cnt > BPF_COMPLEXITY_LIMIT_STATES)
>   		return push_jmp_history(env, cur);
>   
>   	if (!add_new_state)
> @@ -9998,7 +9999,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>   			insn->code = BPF_JMP | BPF_TAIL_CALL;
>   
>   			aux = &env->insn_aux_data[i + delta];
> -			if (env->allow_ptr_leaks && !expect_blinding &&
> +			if (env->bpf_capable && !expect_blinding &&
>   			    prog->jit_requested &&
>   			    !bpf_map_key_poisoned(aux) &&
>   			    !bpf_map_ptr_poisoned(aux) &&
> @@ -10725,7 +10726,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
>   		env->insn_aux_data[i].orig_idx = i;
>   	env->prog = *prog;
>   	env->ops = bpf_verifier_ops[env->prog->type];
> -	is_priv = capable(CAP_SYS_ADMIN);
> +	is_priv = bpf_capable();
>   
>   	if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {
>   		mutex_lock(&bpf_verifier_lock);
> @@ -10766,7 +10767,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
>   	if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
>   		env->strict_alignment = false;
>   
> -	env->allow_ptr_leaks = is_priv;
> +	env->allow_ptr_leaks = perfmon_capable();
> +	env->bpf_capable = bpf_capable();
>

Daniel Borkmann May 12, 2020, 3:05 p.m. UTC | #5

On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Implement permissions as stated in uapi/linux/capability.h
> In order to do that the verifier allow_ptr_leaks flag is split
> into allow_ptr_leaks and bpf_capable flags and they are set as:
>    env->allow_ptr_leaks = perfmon_capable();
>    env->bpf_capable = bpf_capable();

[...]
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 70ad009577f8..a6893746cd87 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1293,7 +1293,7 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env,
>   	reg->type = SCALAR_VALUE;
>   	reg->var_off = tnum_unknown;
>   	reg->frameno = 0;
> -	reg->precise = env->subprog_cnt > 1 || !env->allow_ptr_leaks;
> +	reg->precise = env->subprog_cnt > 1 || !env->bpf_capable;
>   	__mark_reg_unbounded(reg);
>   }
>   
> @@ -1425,8 +1425,9 @@ static int check_subprogs(struct bpf_verifier_env *env)
>   			continue;
>   		if (insn[i].src_reg != BPF_PSEUDO_CALL)
>   			continue;
> -		if (!env->allow_ptr_leaks) {
> -			verbose(env, "function calls to other bpf functions are allowed for root only\n");
> +		if (!env->bpf_capable) {
> +			verbose(env,
> +				"function calls to other bpf functions are allowed for CAP_BPF and CAP_SYS_ADMIN\n");
>   			return -EPERM;
>   		}
>   		ret = add_subprog(env, i + insn[i].imm + 1);
> @@ -1960,7 +1961,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>   	bool new_marks = false;
>   	int i, err;
>   
> -	if (!env->allow_ptr_leaks)
> +	if (!env->bpf_capable)
>   		/* backtracking is root only for now */
>   		return 0;
>   
> @@ -2208,7 +2209,7 @@ static int check_stack_write(struct bpf_verifier_env *env,
>   		reg = &cur->regs[value_regno];
>   
>   	if (reg && size == BPF_REG_SIZE && register_is_const(reg) &&
> -	    !register_is_null(reg) && env->allow_ptr_leaks) {
> +	    !register_is_null(reg) && env->bpf_capable) {
>   		if (dst_reg != BPF_REG_FP) {
>   			/* The backtracking logic can only recognize explicit
>   			 * stack slot address like [fp - 8]. Other spill of
> @@ -3428,7 +3429,7 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
>   		 * Spectre masking for stack ALU.
>   		 * See also retrieve_ptr_limit().
>   		 */
> -		if (!env->allow_ptr_leaks) {
> +		if (!env->bpf_capable) {
>   			char tn_buf[48];
>   
>   			tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
> @@ -7229,7 +7230,7 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env,
>   		insn_stack[env->cfg.cur_stack++] = w;
>   		return 1;
>   	} else if ((insn_state[w] & 0xF0) == DISCOVERED) {
> -		if (loop_ok && env->allow_ptr_leaks)
> +		if (loop_ok && env->bpf_capable)
>   			return 0;
>   		verbose_linfo(env, t, "%d: ", t);
>   		verbose_linfo(env, w, "%d: ", w);
> @@ -8338,7 +8339,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
>   	if (env->max_states_per_insn < states_cnt)
>   		env->max_states_per_insn = states_cnt;
>   
> -	if (!env->allow_ptr_leaks && states_cnt > BPF_COMPLEXITY_LIMIT_STATES)
> +	if (!env->bpf_capable && states_cnt > BPF_COMPLEXITY_LIMIT_STATES)
>   		return push_jmp_history(env, cur);
>   
>   	if (!add_new_state)
> @@ -9998,7 +9999,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>   			insn->code = BPF_JMP | BPF_TAIL_CALL;
>   
>   			aux = &env->insn_aux_data[i + delta];
> -			if (env->allow_ptr_leaks && !expect_blinding &&
> +			if (env->bpf_capable && !expect_blinding &&
>   			    prog->jit_requested &&
>   			    !bpf_map_key_poisoned(aux) &&
>   			    !bpf_map_ptr_poisoned(aux) &&
> @@ -10725,7 +10726,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
>   		env->insn_aux_data[i].orig_idx = i;
>   	env->prog = *prog;
>   	env->ops = bpf_verifier_ops[env->prog->type];
> -	is_priv = capable(CAP_SYS_ADMIN);
> +	is_priv = bpf_capable();
>   
>   	if (!btf_vmlinux && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) {
>   		mutex_lock(&bpf_verifier_lock);
> @@ -10766,7 +10767,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
>   	if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
>   		env->strict_alignment = false;
>   
> -	env->allow_ptr_leaks = is_priv;
> +	env->allow_ptr_leaks = perfmon_capable();
> +	env->bpf_capable = bpf_capable();

Probably more of a detail, but it feels weird to tie perfmon_capable() into the BPF
core and use it in various places there. I would rather make this a proper bpf_*
prefixed helper and add a more descriptive name (what does it have to do with perf
or monitoring directly?). For example, all the main functionality could be under
`bpf_base_capable()` and everything with potential to leak pointers or mem to user
space as `bpf_leak_capable()`. Then inside include/linux/capability.h this can still
resolve under the hood to something like:

static inline bool bpf_base_capable(void)
{
	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
}

static inline bool bpf_leak_capable(void)
{
	return perfmon_capable();
}

Thanks,
Daniel

Alexei Starovoitov May 12, 2020, 3:46 p.m. UTC | #6

On Tue, May 12, 2020 at 12:50:05PM +0000, Jordan Glover wrote:
> On Tuesday, May 12, 2020 2:36 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Mon, May 11, 2020 at 05:12:10PM -0700, sdf@google.com wrote:
> >
> > > On 05/08, Alexei Starovoitov wrote:
> > >
> > > > From: Alexei Starovoitov ast@kernel.org
> > > > [..]
> > > > @@ -3932,7 +3977,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr
> > > > __user *, uattr, unsigned int, siz
> > > > union bpf_attr attr;
> > > > int err;
> > >
> > > > -   if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > >
> > > > -   if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> > > >     return -EPERM;
> > > >     This is awesome, thanks for reviving the effort!
> > > >
> > >
> > > One question I have about this particular snippet:
> > > Does it make sense to drop bpf_capable checks for the operations
> > > that work on a provided fd?
> >
> > Above snippet is for the case when sysctl switches unpriv off.
> > It was a big hammer and stays big hammer.
> > I certainly would like to improve the situation, but I suspect
> > the folks who turn that sysctl knob on are simply paranoid about bpf
> > and no amount of reasoning would turn them around.
> >
> 
> Without CAP_BPF, sysctl was the only option to keep you safe from flow
> of bpf vulns. You didn't had to be paranoid about that.

In the year 2020 there were three verifier bugs that could have been exploited
through unpriv. All three were found by new kBdysch fuzzer. In 2019 there was
nothing. Not because people didn't try, but because syzbot fuzzer reached its
limit. This cap_bpf will help fuzzers find a new set of bugs.

The pace of bpf development is accelerating, so there will be more bugs found
and introduced in the verifier. Folks that run the very latest kernel are
taking that risk along with the risk associated with other new kernel features.
Yet other features don't have sysctls to disable them.

Stanislav Fomichev May 12, 2020, 3:54 p.m. UTC | #7

On 05/11, Alexei Starovoitov wrote:
> On Mon, May 11, 2020 at 05:12:10PM -0700, sdf@google.com wrote:
> > On 05/08, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > [..]
> > > @@ -3932,7 +3977,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr
> > > __user *, uattr, unsigned int, siz
> > >   	union bpf_attr attr;
> > >   	int err;
> >
> > > -	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > +	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> > >   		return -EPERM;
> > This is awesome, thanks for reviving the effort!
> >
> > One question I have about this particular snippet:
> > Does it make sense to drop bpf_capable checks for the operations
> > that work on a provided fd?

> Above snippet is for the case when sysctl switches unpriv off.
> It was a big hammer and stays big hammer.
> I certainly would like to improve the situation, but I suspect
> the folks who turn that sysctl knob on are simply paranoid about bpf
> and no amount of reasoning would turn them around.
Yeah, and we do use it unfortunately :-( I suppose we still would
like to keep it that way for a while, but maybe start relaxing
some operations a bit.

> > The use-case I have in mind is as follows:
> > * privileged (CAP_BPF) process loads the programs/maps and pins
> >   them at some known location
> > * unprivileged process opens up those pins and does the following:
> >   * prepares the maps (and will later on read them)
> >   * does SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF which afaik don't
> >     require any capabilities
> >
> > This essentially pushes some of the permission checks into a fs layer.  
> So
> > whoever has a file descriptor (via unix sock or open) can do BPF  
> operations
> > on the object that represents it.

> cap_bpf doesn't change things in that regard.
> Two cases here:
> sysctl_unprivileged_bpf_disabled==0:
>    Unpriv can load socket_filter prog type and unpriv can attach it
>    via SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF.
> sysctl_unprivileged_bpf_disabled==1:
>    cap_sys_admin can load socket_filter and unpriv can attach it.
Sorry, I wasn't clear enough, I was talking about unpriv_bpf_disabled=1
case.

> With addition of cap_bpf in the second case cap_bpf process can
> load socket_filter too.
> It doesn't mean that permissions are pushed into fs layer.
> I'm not sure that relaxing of sysctl_unprivileged_bpf_disabled
> will be well received.
> Are you proposing to selectively allow certain bpf syscall commands
> even when sysctl_unprivileged_bpf_disabled==1 ?
> Like allow unpriv to do BPF_OBJ_GET to get an fd from bpffs ?
> And allow unpriv to do map_update ?
Yes, that's the gist of what I'm proposing. Allow the operations that
work on fd even with unpriv_bpf_disabled=1. The assumption that
obtaining fd requires a privileged operation on its own and
should give enough protection.

> It makes complete sense to me, but I'd like to argue about that
> independently from this cap_bpf set.
> We can relax that sysctl later.
Ack, thanks, let me bring it up again later, when we get to the cap_bpf
state.

Alexei Starovoitov May 12, 2020, 6:25 p.m. UTC | #8

On Tue, May 12, 2020 at 04:35:41PM +0200, Daniel Borkmann wrote:
> On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > Implement permissions as stated in uapi/linux/capability.h
> > In order to do that the verifier allow_ptr_leaks flag is split
> > into allow_ptr_leaks and bpf_capable flags and they are set as:
> >    env->allow_ptr_leaks = perfmon_capable();
> >    env->bpf_capable = bpf_capable();
> > 
> > bpf_capable enables bounded loops, variable stack access and other verifier features.
> > allow_ptr_leaks enable ptr leaks, ptr conversions, subtraction of pointers, etc.
> > It also disables side channel mitigations.
> > 
> > That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will
> > have speculative checks done by the verifier and other spectre mitigation applied.
> > Such networking BPF program will not be able to leak kernel pointers.
> 
> I don't quite follow this part in the code below yet, see my comments.
> 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> [...]
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 6abd5a778fcd..c32a7880fa62 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -375,6 +375,7 @@ struct bpf_verifier_env {
> >   	u32 used_map_cnt;		/* number of used maps */
> >   	u32 id_gen;			/* used to generate unique reg IDs */
> >   	bool allow_ptr_leaks;
> > +	bool bpf_capable;
> >   	bool seen_direct_write;
> >   	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
> >   	const struct bpf_line_info *prev_linfo;
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index 95d77770353c..264a9254dc39 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -77,7 +77,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
> >   	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
> >   	int ret, numa_node = bpf_map_attr_numa_node(attr);
> >   	u32 elem_size, index_mask, max_entries;
> > -	bool unpriv = !capable(CAP_SYS_ADMIN);
> > +	bool unpriv = !bpf_capable();
> 
> So here progs loaded with CAP_BPF will have spectre mitigations bypassed which
> is the opposite of above statement, no?

right. good catch, but now I'm not sure it was such a good call to toss
spectre into cap_perfmon. It probably should be disabled under cap_bpf.

> >   	u64 cost, array_size, mask64;
> >   	struct bpf_map_memory mem;
> >   	struct bpf_array *array;
> [...]
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 6aa11de67315..8f421dd0c4cf 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
> >   void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> >   {
> >   	if (!bpf_prog_kallsyms_candidate(fp) ||
> > -	    !capable(CAP_SYS_ADMIN))
> > +	    !bpf_capable())
> >   		return;
> >   	bpf_prog_ksym_set_addr(fp);
> > @@ -824,7 +824,7 @@ static int bpf_jit_charge_modmem(u32 pages)
> >   {
> >   	if (atomic_long_add_return(pages, &bpf_jit_current) >
> >   	    (bpf_jit_limit >> PAGE_SHIFT)) {
> > -		if (!capable(CAP_SYS_ADMIN)) {
> > +		if (!bpf_capable()) {
> 
> Should there still be an upper charge on module mem for !CAP_SYS_ADMIN?

hmm. cap_bpf is a subset of cap_sys_admin. I don't see a reason
to keep requiring cap_sys_admin here.

> 
> >   			atomic_long_sub(pages, &bpf_jit_current);
> >   			return -EPERM;
> >   		}
> [...]
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 70ad009577f8..a6893746cd87 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> [...]
> > @@ -3428,7 +3429,7 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
> >   		 * Spectre masking for stack ALU.
> >   		 * See also retrieve_ptr_limit().
> >   		 */
> > -		if (!env->allow_ptr_leaks) {
> > +		if (!env->bpf_capable) {
> 
> This needs to stay on env->allow_ptr_leaks, the can_skip_alu_sanitation() does
> check on env->allow_ptr_leaks as well, otherwise this breaks spectre mitgation
> when masking alu.

The patch kept it in can_skip_alu_sanitation(), but I missed it here.
Don't really recall the details of discussion around
commit 088ec26d9c2d ("bpf: Reject indirect var_off stack access in unpriv mode")

So thinking all over this bit will effectively disable variable
stack access which is one of main usability features.
So for v6 I'm thinking to put spectre bypass into cap_bpf.
allow_ptr_leak will mean only what the name says: pointer leaks only.
cap_bpf should not be given to user processes that want to become root
via spectre side channels.
I think it's a usability trade-off for cap_bpf.
Without indirect var under cap_bpf too many networking progs will be forced to use
cap_bpf+net_net_admin+cap_perfmon only to pass the verifier
while they don't really care about reading arbitrary memory via cap_perfmon.

Alexei Starovoitov May 12, 2020, 6:29 p.m. UTC | #9

On Tue, May 12, 2020 at 05:05:12PM +0200, Daniel Borkmann wrote:
> > -	env->allow_ptr_leaks = is_priv;
> > +	env->allow_ptr_leaks = perfmon_capable();
> > +	env->bpf_capable = bpf_capable();
> 
> Probably more of a detail, but it feels weird to tie perfmon_capable() into the BPF
> core and use it in various places there. I would rather make this a proper bpf_*
> prefixed helper and add a more descriptive name (what does it have to do with perf
> or monitoring directly?). For example, all the main functionality could be under
> `bpf_base_capable()` and everything with potential to leak pointers or mem to user
> space as `bpf_leak_capable()`. Then inside include/linux/capability.h this can still
> resolve under the hood to something like:
> 
> static inline bool bpf_base_capable(void)
> {
> 	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
> }

I don't like the 'base' in the name, since 'base' implies common subset,
but it's not the case. Also 'base' implies that something else is additive,
but it's not the case either. The real base is unpriv. cap_bpf adds to it.
So bpf_capable() in capability.h is the most appropriate.
It also matches perfmon_capable() and other *_capable()

> static inline bool bpf_leak_capable(void)
> {
> 	return perfmon_capable();
> }

This is ok, but not in capability.h. I can put it into bpf_verifier.h

Alexei Starovoitov May 12, 2020, 6:39 p.m. UTC | #10

On Tue, May 12, 2020 at 08:54:11AM -0700, sdf@google.com wrote:
> On 05/11, Alexei Starovoitov wrote:
> > On Mon, May 11, 2020 at 05:12:10PM -0700, sdf@google.com wrote:
> > > On 05/08, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > [..]
> > > > @@ -3932,7 +3977,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr
> > > > __user *, uattr, unsigned int, siz
> > > >   	union bpf_attr attr;
> > > >   	int err;
> > >
> > > > -	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
> > > > +	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
> > > >   		return -EPERM;
> > > This is awesome, thanks for reviving the effort!
> > >
> > > One question I have about this particular snippet:
> > > Does it make sense to drop bpf_capable checks for the operations
> > > that work on a provided fd?
> 
> > Above snippet is for the case when sysctl switches unpriv off.
> > It was a big hammer and stays big hammer.
> > I certainly would like to improve the situation, but I suspect
> > the folks who turn that sysctl knob on are simply paranoid about bpf
> > and no amount of reasoning would turn them around.
> Yeah, and we do use it unfortunately :-( I suppose we still would
> like to keep it that way for a while, but maybe start relaxing
> some operations a bit.

I suspect that was done couple years ago when spectre was just discovered and
the verifier wasn't doing speculative analysis.
If your security folks still insist on that sysctl they either didn't
follow bpf development or paranoid.
If former is the case I would love to openly discuss all the advances in
the verification logic to prevent side channels.
The verifier is doing amazing job finding bad assembly code.
There is no other tool that is similarly capable.
All compilers and static analyzers are no where close to the level
of sophistication that the verifier has in detection of bad speculation.

> > > The use-case I have in mind is as follows:
> > > * privileged (CAP_BPF) process loads the programs/maps and pins
> > >   them at some known location
> > > * unprivileged process opens up those pins and does the following:
> > >   * prepares the maps (and will later on read them)
> > >   * does SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF which afaik don't
> > >     require any capabilities
> > >
> > > This essentially pushes some of the permission checks into a fs layer.
> > So
> > > whoever has a file descriptor (via unix sock or open) can do BPF
> > operations
> > > on the object that represents it.
> 
> > cap_bpf doesn't change things in that regard.
> > Two cases here:
> > sysctl_unprivileged_bpf_disabled==0:
> >    Unpriv can load socket_filter prog type and unpriv can attach it
> >    via SO_ATTACH_BPF/SO_ATTACH_REUSEPORT_EBPF.
> > sysctl_unprivileged_bpf_disabled==1:
> >    cap_sys_admin can load socket_filter and unpriv can attach it.
> Sorry, I wasn't clear enough, I was talking about unpriv_bpf_disabled=1
> case.
> 
> > With addition of cap_bpf in the second case cap_bpf process can
> > load socket_filter too.
> > It doesn't mean that permissions are pushed into fs layer.
> > I'm not sure that relaxing of sysctl_unprivileged_bpf_disabled
> > will be well received.
> > Are you proposing to selectively allow certain bpf syscall commands
> > even when sysctl_unprivileged_bpf_disabled==1 ?
> > Like allow unpriv to do BPF_OBJ_GET to get an fd from bpffs ?
> > And allow unpriv to do map_update ?
> Yes, that's the gist of what I'm proposing. Allow the operations that
> work on fd even with unpriv_bpf_disabled=1. The assumption that
> obtaining fd requires a privileged operation on its own and
> should give enough protection.

I agree.

> 
> > It makes complete sense to me, but I'd like to argue about that
> > independently from this cap_bpf set.
> > We can relax that sysctl later.
> Ack, thanks, let me bring it up again later, when we get to the cap_bpf
> state.

Thanks for the feedback.
Just to make sure we're on the same page let me clarify one more thing.
The state of cap_bpf in this patch set is not the final state of bpf
security in general. We were stuck on cap_bpf proposal since september.
bpf community lost many months of what could have been gradual
improvements in bpf safety and security.
This cap_bpf is a way to get us unstuck. There will be many more
security related patches that improve safety, security and usability.

Daniel Borkmann May 12, 2020, 8:07 p.m. UTC | #11

On 5/12/20 8:25 PM, Alexei Starovoitov wrote:
> On Tue, May 12, 2020 at 04:35:41PM +0200, Daniel Borkmann wrote:
>> On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> Implement permissions as stated in uapi/linux/capability.h
>>> In order to do that the verifier allow_ptr_leaks flag is split
>>> into allow_ptr_leaks and bpf_capable flags and they are set as:
>>>     env->allow_ptr_leaks = perfmon_capable();
>>>     env->bpf_capable = bpf_capable();
>>>
>>> bpf_capable enables bounded loops, variable stack access and other verifier features.
>>> allow_ptr_leaks enable ptr leaks, ptr conversions, subtraction of pointers, etc.
>>> It also disables side channel mitigations.
>>>
>>> That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will
>>> have speculative checks done by the verifier and other spectre mitigation applied.
>>> Such networking BPF program will not be able to leak kernel pointers.
>>
>> I don't quite follow this part in the code below yet, see my comments.
>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> [...]
>>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>>> index 6abd5a778fcd..c32a7880fa62 100644
>>> --- a/include/linux/bpf_verifier.h
>>> +++ b/include/linux/bpf_verifier.h
>>> @@ -375,6 +375,7 @@ struct bpf_verifier_env {
>>>    	u32 used_map_cnt;		/* number of used maps */
>>>    	u32 id_gen;			/* used to generate unique reg IDs */
>>>    	bool allow_ptr_leaks;
>>> +	bool bpf_capable;
>>>    	bool seen_direct_write;
>>>    	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
>>>    	const struct bpf_line_info *prev_linfo;
>>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>>> index 95d77770353c..264a9254dc39 100644
>>> --- a/kernel/bpf/arraymap.c
>>> +++ b/kernel/bpf/arraymap.c
>>> @@ -77,7 +77,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
>>>    	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
>>>    	int ret, numa_node = bpf_map_attr_numa_node(attr);
>>>    	u32 elem_size, index_mask, max_entries;
>>> -	bool unpriv = !capable(CAP_SYS_ADMIN);
>>> +	bool unpriv = !bpf_capable();
>>
>> So here progs loaded with CAP_BPF will have spectre mitigations bypassed which
>> is the opposite of above statement, no?
> 
> right. good catch, but now I'm not sure it was such a good call to toss
> spectre into cap_perfmon. It probably should be disabled under cap_bpf.

Right. :( Too bad CAP_*s are not more fine-grained today for more descriptive
policy. I would presume granting both CAP_BPF + CAP_PERFMON combination is not
always desired either, so probably makes sense to leave it out with a clear
description in patch 1/3 for CAP_BPF about the implications.

>>>    	u64 cost, array_size, mask64;
>>>    	struct bpf_map_memory mem;
>>>    	struct bpf_array *array;
>> [...]
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 6aa11de67315..8f421dd0c4cf 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
>>>    void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>>    {
>>>    	if (!bpf_prog_kallsyms_candidate(fp) ||
>>> -	    !capable(CAP_SYS_ADMIN))
>>> +	    !bpf_capable())
>>>    		return;
>>>    	bpf_prog_ksym_set_addr(fp);
>>> @@ -824,7 +824,7 @@ static int bpf_jit_charge_modmem(u32 pages)
>>>    {
>>>    	if (atomic_long_add_return(pages, &bpf_jit_current) >
>>>    	    (bpf_jit_limit >> PAGE_SHIFT)) {
>>> -		if (!capable(CAP_SYS_ADMIN)) {
>>> +		if (!bpf_capable()) {
>>
>> Should there still be an upper charge on module mem for !CAP_SYS_ADMIN?
> 
> hmm. cap_bpf is a subset of cap_sys_admin. I don't see a reason
> to keep requiring cap_sys_admin here.

It should probably be described in the CAP_BPF comment as well since this
is different compared to plain unpriv.

>>>    			atomic_long_sub(pages, &bpf_jit_current);
>>>    			return -EPERM;
>>>    		}
>> [...]
>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>> index 70ad009577f8..a6893746cd87 100644
>>> --- a/kernel/bpf/verifier.c
>>> +++ b/kernel/bpf/verifier.c
>> [...]
>>> @@ -3428,7 +3429,7 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
>>>    		 * Spectre masking for stack ALU.
>>>    		 * See also retrieve_ptr_limit().
>>>    		 */
>>> -		if (!env->allow_ptr_leaks) {
>>> +		if (!env->bpf_capable) {
>>
>> This needs to stay on env->allow_ptr_leaks, the can_skip_alu_sanitation() does
>> check on env->allow_ptr_leaks as well, otherwise this breaks spectre mitgation
>> when masking alu.
> 
> The patch kept it in can_skip_alu_sanitation(), but I missed it here.
> Don't really recall the details of discussion around
> commit 088ec26d9c2d ("bpf: Reject indirect var_off stack access in unpriv mode")
> 
> So thinking all over this bit will effectively disable variable
> stack access which is one of main usability features.

The reason is that we otherwise cannot derive a fixed limit for the masking
in order to avoid oob access under speculation.

> So for v6 I'm thinking to put spectre bypass into cap_bpf.
> allow_ptr_leak will mean only what the name says: pointer leaks only.
> cap_bpf should not be given to user processes that want to become root
> via spectre side channels.

Yeah, I think it needs to be made crystal clear that from a security level
CAP_BPF is effectively from a BPF point of view very close to CAP_SYS_ADMIN
minus the remaining non-BPF stuff in there, so this should not be handed out
loosely.

> I think it's a usability trade-off for cap_bpf.
> Without indirect var under cap_bpf too many networking progs will be forced to use
> cap_bpf+net_net_admin+cap_perfmon only to pass the verifier
> while they don't really care about reading arbitrary memory via cap_perfmon.

If I recall correctly, at least for Cilium programs the var access restriction
was not an issue - we don't use/need them in our code today, but it might differ
on your side, for example. This brings us back that while CAP_BPF would solve
the issue of not having to hand out the even wider CAP_SYS_ADMIN, it's still not
the end of the tunnel either and we'll see need for something more fine-grained
coming next.

Thanks,
Daniel

Daniel Borkmann May 12, 2020, 8:09 p.m. UTC | #12

On 5/12/20 8:29 PM, Alexei Starovoitov wrote:
> On Tue, May 12, 2020 at 05:05:12PM +0200, Daniel Borkmann wrote:
>>> -	env->allow_ptr_leaks = is_priv;
>>> +	env->allow_ptr_leaks = perfmon_capable();
>>> +	env->bpf_capable = bpf_capable();
>>
>> Probably more of a detail, but it feels weird to tie perfmon_capable() into the BPF
>> core and use it in various places there. I would rather make this a proper bpf_*
>> prefixed helper and add a more descriptive name (what does it have to do with perf
>> or monitoring directly?). For example, all the main functionality could be under
>> `bpf_base_capable()` and everything with potential to leak pointers or mem to user
>> space as `bpf_leak_capable()`. Then inside include/linux/capability.h this can still
>> resolve under the hood to something like:
>>
>> static inline bool bpf_base_capable(void)
>> {
>> 	return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
>> }
> 
> I don't like the 'base' in the name, since 'base' implies common subset,
> but it's not the case. Also 'base' implies that something else is additive,
> but it's not the case either. The real base is unpriv. cap_bpf adds to it.
> So bpf_capable() in capability.h is the most appropriate.
> It also matches perfmon_capable() and other *_capable()

That's okay with me, naming is usually hardest. :)

>> static inline bool bpf_leak_capable(void)
>> {
>> 	return perfmon_capable();
>> }
> 
> This is ok, but not in capability.h. I can put it into bpf_verifier.h

Makes sense.

Daniel Borkmann May 12, 2020, 8:27 p.m. UTC | #13

On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
[...]
> @@ -2880,8 +2933,6 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
>   	struct bpf_prog *prog;
>   	int ret = -ENOTSUPP;
>   
> -	if (!capable(CAP_SYS_ADMIN))
> -		return -EPERM;

Should above be under bpf_capable() as well or is the intention to really let
(fully) unpriv users run sk_filter test progs here? I would assume only progs
that have prior been loaded under bpf_capable() should suffice, so no need to
lower the bar for now, no?

>   	if (CHECK_ATTR(BPF_PROG_TEST_RUN))
>   		return -EINVAL;
>   
> @@ -3163,7 +3214,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
>   	info.run_time_ns = stats.nsecs;
>   	info.run_cnt = stats.cnt;
>   
> -	if (!capable(CAP_SYS_ADMIN)) {
> +	if (!bpf_capable()) {

Given the JIT dump this also exposes addresses when bpf_dump_raw_ok() passes.
I presume okay, but should probably be documented given CAP_SYS_ADMIN isn't
required anymore?

>   		info.jited_prog_len = 0;
>   		info.xlated_prog_len = 0;
>   		info.nr_jited_ksyms = 0;
> @@ -3522,7 +3573,7 @@ static int bpf_btf_load(const union bpf_attr *attr)
>   	if (CHECK_ATTR(BPF_BTF_LOAD))
>   		return -EINVAL;
>   
> -	if (!capable(CAP_SYS_ADMIN))
> +	if (!bpf_capable())
>   		return -EPERM;
>   
>   	return btf_new_fd(attr);
> @@ -3736,9 +3787,6 @@ static int link_create(union bpf_attr *attr)
>   	struct bpf_prog *prog;
>   	int ret;
>   
> -	if (!capable(CAP_NET_ADMIN))
> -		return -EPERM;
> -
>   	if (CHECK_ATTR(BPF_LINK_CREATE))
>   		return -EINVAL;
>   
> @@ -3784,9 +3832,6 @@ static int link_update(union bpf_attr *attr)
>   	u32 flags;
>   	int ret;
>   
> -	if (!capable(CAP_NET_ADMIN))
> -		return -EPERM;
> -
>   	if (CHECK_ATTR(BPF_LINK_UPDATE))
>   		return -EINVAL;
>

Alexei Starovoitov May 12, 2020, 10:56 p.m. UTC | #14

On Tue, May 12, 2020 at 10:07:08PM +0200, Daniel Borkmann wrote:
> On 5/12/20 8:25 PM, Alexei Starovoitov wrote:
> > On Tue, May 12, 2020 at 04:35:41PM +0200, Daniel Borkmann wrote:
> > > On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > 
> > > > Implement permissions as stated in uapi/linux/capability.h
> > > > In order to do that the verifier allow_ptr_leaks flag is split
> > > > into allow_ptr_leaks and bpf_capable flags and they are set as:
> > > >     env->allow_ptr_leaks = perfmon_capable();
> > > >     env->bpf_capable = bpf_capable();
> > > > 
> > > > bpf_capable enables bounded loops, variable stack access and other verifier features.
> > > > allow_ptr_leaks enable ptr leaks, ptr conversions, subtraction of pointers, etc.
> > > > It also disables side channel mitigations.
> > > > 
> > > > That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will
> > > > have speculative checks done by the verifier and other spectre mitigation applied.
> > > > Such networking BPF program will not be able to leak kernel pointers.
> > > 
> > > I don't quite follow this part in the code below yet, see my comments.
> > > 
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > [...]
> > > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > > index 6abd5a778fcd..c32a7880fa62 100644
> > > > --- a/include/linux/bpf_verifier.h
> > > > +++ b/include/linux/bpf_verifier.h
> > > > @@ -375,6 +375,7 @@ struct bpf_verifier_env {
> > > >    	u32 used_map_cnt;		/* number of used maps */
> > > >    	u32 id_gen;			/* used to generate unique reg IDs */
> > > >    	bool allow_ptr_leaks;
> > > > +	bool bpf_capable;
> > > >    	bool seen_direct_write;
> > > >    	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
> > > >    	const struct bpf_line_info *prev_linfo;
> > > > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > > > index 95d77770353c..264a9254dc39 100644
> > > > --- a/kernel/bpf/arraymap.c
> > > > +++ b/kernel/bpf/arraymap.c
> > > > @@ -77,7 +77,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
> > > >    	bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY;
> > > >    	int ret, numa_node = bpf_map_attr_numa_node(attr);
> > > >    	u32 elem_size, index_mask, max_entries;
> > > > -	bool unpriv = !capable(CAP_SYS_ADMIN);
> > > > +	bool unpriv = !bpf_capable();
> > > 
> > > So here progs loaded with CAP_BPF will have spectre mitigations bypassed which
> > > is the opposite of above statement, no?
> > 
> > right. good catch, but now I'm not sure it was such a good call to toss
> > spectre into cap_perfmon. It probably should be disabled under cap_bpf.
> 
> Right. :( Too bad CAP_*s are not more fine-grained today for more descriptive
> policy. I would presume granting both CAP_BPF + CAP_PERFMON combination is not
> always desired either, so probably makes sense to leave it out with a clear
> description in patch 1/3 for CAP_BPF about the implications.

My reasoning to bypass spectre mitigation under cap_perfmon was
that cap_perfmon allows reading kernel memory and side channels are
doing the same thing just in much more complicated way.
Whereas cap_bpf is about enabling advanced bpf features in the verifier
and other places that could be security sensitive only because of lack
of exposure to unpriv in the past, but by themselves cap_bpf doesn't hold
any known security issues. I think it makes sense to stick to that.
It sucks that indirect var access will be disallowed in the verifier
without cap_perfmon, but to make extensions easier in the future I'll add
env->bypass_spec_v1 and env->bypass_spec_v4 internal verifier knobs that will
deal with these two cases.
In the next revision I'll init them with perfmon_capable() and eventually
relax it to:
env->bypass_spec_v1 = perfmon_capable() || spectre_v1_mitigation == SPECTRE_V1_MITIGATION_NONE;
env->bypass_spec_v4 = perfmon_capable() || ssb_mode == SPEC_STORE_BYPASS_NONE;

The latter conditions are x86 specific and I don't want to mess with x86 bits
at this moment and argue about generalization of this for other archs.
Just doing env->bypass_spec_v1 = perfmon_capable(); is good enough for now
and can be improved like above in the follow up.

> > > >    	u64 cost, array_size, mask64;
> > > >    	struct bpf_map_memory mem;
> > > >    	struct bpf_array *array;
> > > [...]
> > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > > > index 6aa11de67315..8f421dd0c4cf 100644
> > > > --- a/kernel/bpf/core.c
> > > > +++ b/kernel/bpf/core.c
> > > > @@ -646,7 +646,7 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp)
> > > >    void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> > > >    {
> > > >    	if (!bpf_prog_kallsyms_candidate(fp) ||
> > > > -	    !capable(CAP_SYS_ADMIN))
> > > > +	    !bpf_capable())
> > > >    		return;
> > > >    	bpf_prog_ksym_set_addr(fp);
> > > > @@ -824,7 +824,7 @@ static int bpf_jit_charge_modmem(u32 pages)
> > > >    {
> > > >    	if (atomic_long_add_return(pages, &bpf_jit_current) >
> > > >    	    (bpf_jit_limit >> PAGE_SHIFT)) {
> > > > -		if (!capable(CAP_SYS_ADMIN)) {
> > > > +		if (!bpf_capable()) {
> > > 
> > > Should there still be an upper charge on module mem for !CAP_SYS_ADMIN?
> > 
> > hmm. cap_bpf is a subset of cap_sys_admin. I don't see a reason
> > to keep requiring cap_sys_admin here.
> 
> It should probably be described in the CAP_BPF comment as well since this
> is different compared to plain unpriv.

I'm going to keep it under CAP_SYS_ADMIN for now. Just to avoid arguing.

> > > >    			atomic_long_sub(pages, &bpf_jit_current);
> > > >    			return -EPERM;
> > > >    		}
> > > [...]
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index 70ad009577f8..a6893746cd87 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > [...]
> > > > @@ -3428,7 +3429,7 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno,
> > > >    		 * Spectre masking for stack ALU.
> > > >    		 * See also retrieve_ptr_limit().
> > > >    		 */
> > > > -		if (!env->allow_ptr_leaks) {
> > > > +		if (!env->bpf_capable) {
> > > 
> > > This needs to stay on env->allow_ptr_leaks, the can_skip_alu_sanitation() does
> > > check on env->allow_ptr_leaks as well, otherwise this breaks spectre mitgation
> > > when masking alu.
> > 
> > The patch kept it in can_skip_alu_sanitation(), but I missed it here.
> > Don't really recall the details of discussion around
> > commit 088ec26d9c2d ("bpf: Reject indirect var_off stack access in unpriv mode")
> > 
> > So thinking all over this bit will effectively disable variable
> > stack access which is one of main usability features.
> 
> The reason is that we otherwise cannot derive a fixed limit for the masking
> in order to avoid oob access under speculation.

I guess we can revisit this and came up with better handling of this case
in the verifier.
In the next revision the above will be "if (!env->bypass_spec_v1)".

Alexei Starovoitov May 12, 2020, 11:01 p.m. UTC | #15

On Tue, May 12, 2020 at 10:27:33PM +0200, Daniel Borkmann wrote:
> On 5/8/20 11:53 PM, Alexei Starovoitov wrote:
> [...]
> > @@ -2880,8 +2933,6 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
> >   	struct bpf_prog *prog;
> >   	int ret = -ENOTSUPP;
> > -	if (!capable(CAP_SYS_ADMIN))
> > -		return -EPERM;
> 
> Should above be under bpf_capable() as well or is the intention to really let
> (fully) unpriv users run sk_filter test progs here? I would assume only progs
> that have prior been loaded under bpf_capable() should suffice, so no need to
> lower the bar for now, no?

Unpriv can load sk_filter and attach to a socket. Then send data through
the socket to trigger execution.
bpf_prog_test_run is doing the same prog execution without creating a socket.
What is the concern?

> >   	if (CHECK_ATTR(BPF_PROG_TEST_RUN))
> >   		return -EINVAL;
> > @@ -3163,7 +3214,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
> >   	info.run_time_ns = stats.nsecs;
> >   	info.run_cnt = stats.cnt;
> > -	if (!capable(CAP_SYS_ADMIN)) {
> > +	if (!bpf_capable()) {
> 
> Given the JIT dump this also exposes addresses when bpf_dump_raw_ok() passes.
> I presume okay, but should probably be documented given CAP_SYS_ADMIN isn't
> required anymore?

Exactly. dump_raw_ok() is there. I'm not even sure why this cap_sys_admin
check is there. It looks like it can be completely removed, but I didn't
want to go that far in this set.

[v5,bpf-next,2/3] bpf: implement CAP_BPF

Commit Message

Comments

Patch