diff mbox series

[RFC,bpf-next,1/1] bpf: Add a BPF helper for getting the cgroup path of current task

Message ID 20210512095823.99162-2-yunbo.xufeng@linux.alibaba.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Implement getting cgroup path bpf helper | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 2 maintainers not CCed: netdev@vger.kernel.org andrii@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 10106 this patch: 10106
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 95 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 10520 this patch: 10520
netdev/header_inline success Link

Commit Message

xufeng zhang May 12, 2021, 9:58 a.m. UTC
To implement security rules for application containers by utilizing
bpf LSM, the container to which the current running task belongs need
to be known in bpf context. Think about this scenario: kubernetes
schedules a pod into one host, before the application container can run,
the security rules for this application need to be loaded into bpf
maps firstly, so that LSM bpf programs can make decisions based on
this rule maps.

However, there is no effective bpf helper to achieve this goal,
especially for cgroup v1. In the above case, the only available information
from user side is container-id, and the cgroup path for this container
is certain based on container-id, so in order to make a bridge between
user side and bpf programs, bpf programs also need to know the current
cgroup path of running task.

This change add a new bpf helper: bpf_get_current_cpuset_cgroup_path(),
since cgroup_path_ns() can sleep, this helper is only allowed for
sleepable LSM hooks.

Signed-off-by: Xufeng Zhang <yunbo.xufeng@linux.alibaba.com>
---
 include/uapi/linux/bpf.h       | 13 +++++++++++++
 kernel/bpf/bpf_lsm.c           | 28 ++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 13 +++++++++++++
 3 files changed, 54 insertions(+)

Comments

Alexei Starovoitov May 12, 2021, 10:55 p.m. UTC | #1
On Wed, May 12, 2021 at 05:58:23PM +0800, Xufeng Zhang wrote:
> To implement security rules for application containers by utilizing
> bpf LSM, the container to which the current running task belongs need
> to be known in bpf context. Think about this scenario: kubernetes
> schedules a pod into one host, before the application container can run,
> the security rules for this application need to be loaded into bpf
> maps firstly, so that LSM bpf programs can make decisions based on
> this rule maps.
> 
> However, there is no effective bpf helper to achieve this goal,
> especially for cgroup v1. In the above case, the only available information
> from user side is container-id, and the cgroup path for this container
> is certain based on container-id, so in order to make a bridge between
> user side and bpf programs, bpf programs also need to know the current
> cgroup path of running task.
...
> +#ifdef CONFIG_CGROUPS
> +BPF_CALL_2(bpf_get_current_cpuset_cgroup_path, char *, buf, u32, buf_len)
> +{
> +	struct cgroup_subsys_state *css;
> +	int retval;
> +
> +	css = task_get_css(current, cpuset_cgrp_id);
> +	retval = cgroup_path_ns(css->cgroup, buf, buf_len, &init_cgroup_ns);
> +	css_put(css);
> +	if (retval >= buf_len)
> +		retval = -ENAMETOOLONG;

Manipulating string path to check the hierarchy will be difficult to do
inside bpf prog. It seems to me this helper will be useful only for
simplest cgroup setups where there is no additional cgroup nesting
within containers.
Have you looked at *ancestor_cgroup_id and *cgroup_id helpers?
They're a bit more flexible when dealing with hierarchy and
can be used to achieve the same correlation between kernel and user cgroup ids.
xufeng zhang May 13, 2021, 8:57 a.m. UTC | #2
在 2021/5/13 上午6:55, Alexei Starovoitov 写道:

> On Wed, May 12, 2021 at 05:58:23PM +0800, Xufeng Zhang wrote:
>> To implement security rules for application containers by utilizing
>> bpf LSM, the container to which the current running task belongs need
>> to be known in bpf context. Think about this scenario: kubernetes
>> schedules a pod into one host, before the application container can run,
>> the security rules for this application need to be loaded into bpf
>> maps firstly, so that LSM bpf programs can make decisions based on
>> this rule maps.
>>
>> However, there is no effective bpf helper to achieve this goal,
>> especially for cgroup v1. In the above case, the only available information
>> from user side is container-id, and the cgroup path for this container
>> is certain based on container-id, so in order to make a bridge between
>> user side and bpf programs, bpf programs also need to know the current
>> cgroup path of running task.
> ...
>> +#ifdef CONFIG_CGROUPS
>> +BPF_CALL_2(bpf_get_current_cpuset_cgroup_path, char *, buf, u32, buf_len)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +	int retval;
>> +
>> +	css = task_get_css(current, cpuset_cgrp_id);
>> +	retval = cgroup_path_ns(css->cgroup, buf, buf_len, &init_cgroup_ns);
>> +	css_put(css);
>> +	if (retval >= buf_len)
>> +		retval = -ENAMETOOLONG;
> Manipulating string path to check the hierarchy will be difficult to do
> inside bpf prog. It seems to me this helper will be useful only for
> simplest cgroup setups where there is no additional cgroup nesting
> within containers.
> Have you looked at *ancestor_cgroup_id and *cgroup_id helpers?
> They're a bit more flexible when dealing with hierarchy and
> can be used to achieve the same correlation between kernel and user cgroup ids.


Thanks for your timely reply, Alexei!

Yes, this helper is not so common, it does not works for nesting cgroup 
within containers.

About your suggestion, the *cgroup_id helpers only works for cgroup v2, 
however, we're still using cgroup v1 in product,and even for cgroup v2, 
I'm not sure if there is any way for user space to get this cgroup id 
timely(after container created, but before container start to run)。

So if there is any effective way works for cgroup v1?


Many thanks!

Xufeng
xufeng zhang May 14, 2021, 4:06 a.m. UTC | #3
在 2021/5/13 上午6:55, Alexei Starovoitov 写道:
> On Wed, May 12, 2021 at 05:58:23PM +0800, Xufeng Zhang wrote:
>> To implement security rules for application containers by utilizing
>> bpf LSM, the container to which the current running task belongs need
>> to be known in bpf context. Think about this scenario: kubernetes
>> schedules a pod into one host, before the application container can run,
>> the security rules for this application need to be loaded into bpf
>> maps firstly, so that LSM bpf programs can make decisions based on
>> this rule maps.
>>
>> However, there is no effective bpf helper to achieve this goal,
>> especially for cgroup v1. In the above case, the only available information
>> from user side is container-id, and the cgroup path for this container
>> is certain based on container-id, so in order to make a bridge between
>> user side and bpf programs, bpf programs also need to know the current
>> cgroup path of running task.
> ...
>> +#ifdef CONFIG_CGROUPS
>> +BPF_CALL_2(bpf_get_current_cpuset_cgroup_path, char *, buf, u32, buf_len)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +	int retval;
>> +
>> +	css = task_get_css(current, cpuset_cgrp_id);
>> +	retval = cgroup_path_ns(css->cgroup, buf, buf_len, &init_cgroup_ns);
>> +	css_put(css);
>> +	if (retval >= buf_len)
>> +		retval = -ENAMETOOLONG;
> Manipulating string path to check the hierarchy will be difficult to do
> inside bpf prog. It seems to me this helper will be useful only for
> simplest cgroup setups where there is no additional cgroup nesting
> within containers.
> Have you looked at *ancestor_cgroup_id and *cgroup_id helpers?
> They're a bit more flexible when dealing with hierarchy and
> can be used to achieve the same correlation between kernel and user cgroup ids.


KP,

do you have any suggestion?

what I am thinking is the internal kernel object(cgroup id or ns.inum) 
is not so user friendly, we can get the container-context from them for 
tracing scenario, but not for LSM blocking cases, I'm not sure how 
Google internally resolve similar issue.


Thanks!

Xufeng
Alexei Starovoitov May 14, 2021, 4:20 a.m. UTC | #4
On Thu, May 13, 2021 at 1:57 AM xufeng zhang
<yunbo.xufeng@linux.alibaba.com> wrote:
>
> 在 2021/5/13 上午6:55, Alexei Starovoitov 写道:
>
> > On Wed, May 12, 2021 at 05:58:23PM +0800, Xufeng Zhang wrote:
> >> To implement security rules for application containers by utilizing
> >> bpf LSM, the container to which the current running task belongs need
> >> to be known in bpf context. Think about this scenario: kubernetes
> >> schedules a pod into one host, before the application container can run,
> >> the security rules for this application need to be loaded into bpf
> >> maps firstly, so that LSM bpf programs can make decisions based on
> >> this rule maps.
> >>
> >> However, there is no effective bpf helper to achieve this goal,
> >> especially for cgroup v1. In the above case, the only available information
> >> from user side is container-id, and the cgroup path for this container
> >> is certain based on container-id, so in order to make a bridge between
> >> user side and bpf programs, bpf programs also need to know the current
> >> cgroup path of running task.
> > ...
> >> +#ifdef CONFIG_CGROUPS
> >> +BPF_CALL_2(bpf_get_current_cpuset_cgroup_path, char *, buf, u32, buf_len)
> >> +{
> >> +    struct cgroup_subsys_state *css;
> >> +    int retval;
> >> +
> >> +    css = task_get_css(current, cpuset_cgrp_id);
> >> +    retval = cgroup_path_ns(css->cgroup, buf, buf_len, &init_cgroup_ns);
> >> +    css_put(css);
> >> +    if (retval >= buf_len)
> >> +            retval = -ENAMETOOLONG;
> > Manipulating string path to check the hierarchy will be difficult to do
> > inside bpf prog. It seems to me this helper will be useful only for
> > simplest cgroup setups where there is no additional cgroup nesting
> > within containers.
> > Have you looked at *ancestor_cgroup_id and *cgroup_id helpers?
> > They're a bit more flexible when dealing with hierarchy and
> > can be used to achieve the same correlation between kernel and user cgroup ids.
>
>
> Thanks for your timely reply, Alexei!
>
> Yes, this helper is not so common, it does not works for nesting cgroup
> within containers.
>
> About your suggestion, the *cgroup_id helpers only works for cgroup v2,
> however, we're still using cgroup v1 in product,and even for cgroup v2,
> I'm not sure if there is any way for user space to get this cgroup id
> timely(after container created, but before container start to run)。
>
> So if there is any effective way works for cgroup v1?

https://github.com/systemd/systemd/blob/main/NEWS#L379
KP Singh May 14, 2021, 11:20 a.m. UTC | #5
On Fri, May 14, 2021 at 6:06 AM xufeng zhang
<yunbo.xufeng@linux.alibaba.com> wrote:
>
>
> 在 2021/5/13 上午6:55, Alexei Starovoitov 写道:
> > On Wed, May 12, 2021 at 05:58:23PM +0800, Xufeng Zhang wrote:
> >> To implement security rules for application containers by utilizing
> >> bpf LSM, the container to which the current running task belongs need
> >> to be known in bpf context. Think about this scenario: kubernetes
> >> schedules a pod into one host, before the application container can run,
> >> the security rules for this application need to be loaded into bpf
> >> maps firstly, so that LSM bpf programs can make decisions based on
> >> this rule maps.
> >>
> >> However, there is no effective bpf helper to achieve this goal,
> >> especially for cgroup v1. In the above case, the only available information
> >> from user side is container-id, and the cgroup path for this container
> >> is certain based on container-id, so in order to make a bridge between
> >> user side and bpf programs, bpf programs also need to know the current
> >> cgroup path of running task.
> > ...
> >> +#ifdef CONFIG_CGROUPS
> >> +BPF_CALL_2(bpf_get_current_cpuset_cgroup_path, char *, buf, u32, buf_len)
> >> +{
> >> +    struct cgroup_subsys_state *css;
> >> +    int retval;
> >> +
> >> +    css = task_get_css(current, cpuset_cgrp_id);
> >> +    retval = cgroup_path_ns(css->cgroup, buf, buf_len, &init_cgroup_ns);
> >> +    css_put(css);
> >> +    if (retval >= buf_len)
> >> +            retval = -ENAMETOOLONG;
> > Manipulating string path to check the hierarchy will be difficult to do
> > inside bpf prog. It seems to me this helper will be useful only for
> > simplest cgroup setups where there is no additional cgroup nesting
> > within containers.
> > Have you looked at *ancestor_cgroup_id and *cgroup_id helpers?
> > They're a bit more flexible when dealing with hierarchy and
> > can be used to achieve the same correlation between kernel and user cgroup ids.
>
>
> KP,
>
> do you have any suggestion?

I haven't really tried this yet, but have you considered using task local
storage to identify the container?

- Add a task local storage with container ID somewhere in the container
  manager
- Propagate this ID to all the tasks within a container using task security
  blob management hooks (like task_alloc and task_free) etc.

>
> what I am thinking is the internal kernel object(cgroup id or ns.inum)
> is not so user friendly, we can get the container-context from them for
> tracing scenario, but not for LSM blocking cases, I'm not sure how
> Google internally resolve similar issue.
>
>
> Thanks!
>
> Xufeng
>
KP Singh May 14, 2021, 11:21 a.m. UTC | #6
> > About your suggestion, the *cgroup_id helpers only works for cgroup v2,
> > however, we're still using cgroup v1 in product,and even for cgroup v2,
> > I'm not sure if there is any way for user space to get this cgroup id
> > timely(after container created, but before container start to run)。
> >
> > So if there is any effective way works for cgroup v1?
>
> https://github.com/systemd/systemd/blob/main/NEWS#L379

I agree that we should not focus on cgroup v1 if we do add a helper.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ec6d85a81744..e8295101b865 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4735,6 +4735,18 @@  union bpf_attr {
  *		be zero-terminated except when **str_size** is 0.
  *
  *		Or **-EBUSY** if the per-CPU memory copy buffer is busy.
+ *
+ * int bpf_get_current_cpuset_cgroup_path(char *buf, u32 buf_len)
+ *	Description
+ *		Get the cpuset cgroup path of current task from kernel memory,
+ *		this path can be used to identify in which container is the
+ *		current task running.
+ *		*buf* memory is pre-allocated, and *buf_len* indicates the size
+ *		of this memory.
+ *
+ *	Return
+ *		The cpuset cgroup path is copied into *buf* on success,
+ *		or a negative integer error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4903,6 +4915,7 @@  union bpf_attr {
 	FN(check_mtu),			\
 	FN(for_each_map_elem),		\
 	FN(snprintf),			\
+	FN(get_current_cpuset_cgroup_path),     \
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 5efb2b24012c..5e62e3875df1 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c
@@ -99,6 +99,30 @@  static const struct bpf_func_proto bpf_ima_inode_hash_proto = {
 	.allowed	= bpf_ima_inode_hash_allowed,
 };
 
+#ifdef CONFIG_CGROUPS
+BPF_CALL_2(bpf_get_current_cpuset_cgroup_path, char *, buf, u32, buf_len)
+{
+	struct cgroup_subsys_state *css;
+	int retval;
+
+	css = task_get_css(current, cpuset_cgrp_id);
+	retval = cgroup_path_ns(css->cgroup, buf, buf_len, &init_cgroup_ns);
+	css_put(css);
+	if (retval >= buf_len)
+		retval = -ENAMETOOLONG;
+	return retval;
+}
+
+static const struct bpf_func_proto bpf_get_current_cpuset_cgroup_path_proto = {
+	.func           = bpf_get_current_cpuset_cgroup_path,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type      = ARG_CONST_SIZE,
+	.allowed        = bpf_ima_inode_hash_allowed,
+};
+#endif
+
 static const struct bpf_func_proto *
 bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -119,6 +143,10 @@  bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_bprm_opts_set_proto;
 	case BPF_FUNC_ima_inode_hash:
 		return prog->aux->sleepable ? &bpf_ima_inode_hash_proto : NULL;
+#ifdef CONFIG_CGROUPS
+	case BPF_FUNC_get_current_cpuset_cgroup_path:
+		return prog->aux->sleepable ? &bpf_get_current_cpuset_cgroup_path_proto : NULL;
+#endif
 	default:
 		return tracing_prog_func_proto(func_id, prog);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index ec6d85a81744..fe31252d92e3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4735,6 +4735,18 @@  union bpf_attr {
  *		be zero-terminated except when **str_size** is 0.
  *
  *		Or **-EBUSY** if the per-CPU memory copy buffer is busy.
+ *
+ * int bpf_get_current_cpuset_cgroup_path(char *buf, u32 buf_len)
+ *	Description
+ *		Get the cpuset cgroup path of current task from kernel memory,
+ *		this path can be used to identify in which container is the
+ *		current task running.
+ *		*buf* memory is pre-allocated, and *buf_len* indicates the size
+ *		of this memory.
+ *
+ *	Return
+ *		The cpuset cgroup path is copied into *buf* on success,
+ *		or a negative integer error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4903,6 +4915,7 @@  union bpf_attr {
 	FN(check_mtu),			\
 	FN(for_each_map_elem),		\
 	FN(snprintf),			\
+	FN(get_current_cpuset_cgroup_path),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper