diff mbox series

[bpf] bpf: fix potentially incorrect results with bpf_get_local_storage()

Message ID 20210810010413.1976277-1-yhs@fb.com (mailing list archive)
State Accepted
Commit a2baf4e8bb0f306fbed7b5e6197c02896a638ab5
Delegated to: BPF
Headers show
Series [bpf] bpf: fix potentially incorrect results with bpf_get_local_storage() | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/cc_maintainers warning 5 maintainers not CCed: netdev@vger.kernel.org john.fastabend@gmail.com songliubraving@fb.com kpsingh@kernel.org kafai@fb.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 11785 this patch: 11785
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 86 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 11307 this patch: 11307
netdev/header_inline success Link

Commit Message

Yonghong Song Aug. 10, 2021, 1:04 a.m. UTC
Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in
bpf_get_local_storage() helper") fixed a bug for bpf_get_local_storage()
helper so different tasks won't mess up with each other's
percpu local storage.

The percpu data contains 8 slots so it can hold up to 8 contexts
(same or different tasks), for 8 different program runs,
at the same time. This in general is sufficient. But our internal
testing showed the following warning multiple times:

  warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
     __cgroup_bpf_run_filter_sock_ops+0x13e/0x180
  RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
  <IRQ>
   tcp_call_bpf.constprop.99+0x93/0xc0
   tcp_conn_request+0x41e/0xa50
   ? tcp_rcv_state_process+0x203/0xe00
   tcp_rcv_state_process+0x203/0xe00
   ? sk_filter_trim_cap+0xbc/0x210
   ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
   tcp_v6_do_rcv+0x181/0x3e0
   tcp_v6_rcv+0xc65/0xcb0
   ip6_protocol_deliver_rcu+0xbd/0x450
   ip6_input_finish+0x11/0x20
   ip6_input+0xb5/0xc0
   ip6_sublist_rcv_finish+0x37/0x50
   ip6_sublist_rcv+0x1dc/0x270
   ipv6_list_rcv+0x113/0x140
   __netif_receive_skb_list_core+0x1a0/0x210
   netif_receive_skb_list_internal+0x186/0x2a0
   gro_normal_list.part.170+0x19/0x40
   napi_complete_done+0x65/0x150
   mlx5e_napi_poll+0x1ae/0x680
   __napi_poll+0x25/0x120
   net_rx_action+0x11e/0x280
   __do_softirq+0xbb/0x271
   irq_exit_rcu+0x97/0xa0
   common_interrupt+0x7f/0xa0
   </IRQ>
   asm_common_interrupt+0x1e/0x40
  RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
   ? __cgroup_bpf_run_filter_skb+0x378/0x4e0
   ? do_softirq+0x34/0x70
   ? ip6_finish_output2+0x266/0x590
   ? ip6_finish_output+0x66/0xa0
   ? ip6_output+0x6c/0x130
   ? ip6_xmit+0x279/0x550
   ? ip6_dst_check+0x61/0xd0
  ...

Using drgn to dump the percpu buffer contents showed that
on this cpu slot 0 is still available but
slots 1-7 are occupied and those tasks in slots 1-7 mostly don't exist
any more. So we might have issues in bpf_cgroup_storage_unset().

Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:
  1. a task is running and claims slot 0
  2. running bpf program is done, and it checked slot "0" has the "task"
     and ready to reset it to NULL (not yet).
  3. an interrupt happens, another bpf program runs and it claims slot 1
     with the *same* task.
  4. the unset() in interrupt context releases slot 0 since it matches "task".
  5. interrupt is done, the task in process context reset slot 0.

At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 bpf program
won't be able to claim an empty slot and a warning will be issued.

To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.

The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to bpf program.

Cc: Roman Gushchin <guro@fb.com>
Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/bpf-cgroup.h | 4 ++--
 kernel/bpf/helpers.c       | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

This patch targets to bpf tree. In bpf-next,
Andrii's c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current")
should have fixed the issue too. I also okay with backporting Andrii's patch
to bpf tree if it is viable.

Comments

Andrii Nakryiko Aug. 10, 2021, 5:32 a.m. UTC | #1
On Mon, Aug 9, 2021 at 6:04 PM Yonghong Song <yhs@fb.com> wrote:
>
> Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in
> bpf_get_local_storage() helper") fixed a bug for bpf_get_local_storage()
> helper so different tasks won't mess up with each other's
> percpu local storage.
>
> The percpu data contains 8 slots so it can hold up to 8 contexts
> (same or different tasks), for 8 different program runs,
> at the same time. This in general is sufficient. But our internal
> testing showed the following warning multiple times:
>
>   warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
>      __cgroup_bpf_run_filter_sock_ops+0x13e/0x180
>   RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
>   <IRQ>
>    tcp_call_bpf.constprop.99+0x93/0xc0
>    tcp_conn_request+0x41e/0xa50
>    ? tcp_rcv_state_process+0x203/0xe00
>    tcp_rcv_state_process+0x203/0xe00
>    ? sk_filter_trim_cap+0xbc/0x210
>    ? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
>    tcp_v6_do_rcv+0x181/0x3e0
>    tcp_v6_rcv+0xc65/0xcb0
>    ip6_protocol_deliver_rcu+0xbd/0x450
>    ip6_input_finish+0x11/0x20
>    ip6_input+0xb5/0xc0
>    ip6_sublist_rcv_finish+0x37/0x50
>    ip6_sublist_rcv+0x1dc/0x270
>    ipv6_list_rcv+0x113/0x140
>    __netif_receive_skb_list_core+0x1a0/0x210
>    netif_receive_skb_list_internal+0x186/0x2a0
>    gro_normal_list.part.170+0x19/0x40
>    napi_complete_done+0x65/0x150
>    mlx5e_napi_poll+0x1ae/0x680
>    __napi_poll+0x25/0x120
>    net_rx_action+0x11e/0x280
>    __do_softirq+0xbb/0x271
>    irq_exit_rcu+0x97/0xa0
>    common_interrupt+0x7f/0xa0
>    </IRQ>
>    asm_common_interrupt+0x1e/0x40
>   RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
>    ? __cgroup_bpf_run_filter_skb+0x378/0x4e0
>    ? do_softirq+0x34/0x70
>    ? ip6_finish_output2+0x266/0x590
>    ? ip6_finish_output+0x66/0xa0
>    ? ip6_output+0x6c/0x130
>    ? ip6_xmit+0x279/0x550
>    ? ip6_dst_check+0x61/0xd0
>   ...
>
> Using drgn to dump the percpu buffer contents showed that

worth putting the reference to drgn for people not familiar with what it is

  [0] https://github.com/osandov/drgn

> on this cpu slot 0 is still available but
> slots 1-7 are occupied and those tasks in slots 1-7 mostly don't exist
> any more. So we might have issues in bpf_cgroup_storage_unset().
>
> Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
> Currently, it tries to unset "current" slot with searching from the start.
> So the following sequence is possible:
>   1. a task is running and claims slot 0
>   2. running bpf program is done, and it checked slot "0" has the "task"
>      and ready to reset it to NULL (not yet).
>   3. an interrupt happens, another bpf program runs and it claims slot 1
>      with the *same* task.
>   4. the unset() in interrupt context releases slot 0 since it matches "task".
>   5. interrupt is done, the task in process context reset slot 0.
>
> At the end, slot 1 is not reset and the same process can continue to occupy
> slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 bpf program
> won't be able to claim an empty slot and a warning will be issued.
>
> To fix the issue, for unset() function, we should traverse from the last slot
> to the first. This way, the above issue can be avoided.
>
> The same reverse traversal should also be done in bpf_get_local_storage() helper
> itself. Otherwise, incorrect local storage may be returned to bpf program.
>
> Cc: Roman Gushchin <guro@fb.com>
> Fixes: b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---

LGTM.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>  include/linux/bpf-cgroup.h | 4 ++--
>  kernel/bpf/helpers.c       | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> This patch targets to bpf tree. In bpf-next,
> Andrii's c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current")
> should have fixed the issue too. I also okay with backporting Andrii's patch
> to bpf tree if it is viable.
>

I don't have preferences, but my patch might be a bit harder to
backport, so doing this as a fix is fine with me.

> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 8b77d08d4b47..6c9b10d82c80 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -201,8 +201,8 @@ static inline void bpf_cgroup_storage_unset(void)
>  {
>         int i;
>
> -       for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) {
> -               if (unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
> +       for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) {
> +               if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
>                         continue;
>
>                 this_cpu_write(bpf_cgroup_storage_info[i].task, NULL);
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 62cf00383910..9b3f16eee21f 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -397,8 +397,8 @@ BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags)
>         void *ptr;
>         int i;
>
> -       for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) {
> -               if (unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
> +       for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) {
> +               if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
>                         continue;
>
>                 storage = this_cpu_read(bpf_cgroup_storage_info[i].storage[stype]);
> --
> 2.30.2
>
patchwork-bot+netdevbpf@kernel.org Aug. 10, 2021, 8:30 a.m. UTC | #2
Hello:

This patch was applied to bpf/bpf.git (refs/heads/master):

On Mon, 9 Aug 2021 18:04:13 -0700 you wrote:
> Commit b910eaaaa4b8 ("bpf: Fix NULL pointer dereference in
> bpf_get_local_storage() helper") fixed a bug for bpf_get_local_storage()
> helper so different tasks won't mess up with each other's
> percpu local storage.
> 
> The percpu data contains 8 slots so it can hold up to 8 contexts
> (same or different tasks), for 8 different program runs,
> at the same time. This in general is sufficient. But our internal
> testing showed the following warning multiple times:
> 
> [...]

Here is the summary with links:
  - [bpf] bpf: fix potentially incorrect results with bpf_get_local_storage()
    https://git.kernel.org/bpf/bpf/c/a2baf4e8bb0f

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
diff mbox series

Patch

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 8b77d08d4b47..6c9b10d82c80 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -201,8 +201,8 @@  static inline void bpf_cgroup_storage_unset(void)
 {
 	int i;
 
-	for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) {
-		if (unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
+	for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) {
+		if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
 			continue;
 
 		this_cpu_write(bpf_cgroup_storage_info[i].task, NULL);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 62cf00383910..9b3f16eee21f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -397,8 +397,8 @@  BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags)
 	void *ptr;
 	int i;
 
-	for (i = 0; i < BPF_CGROUP_STORAGE_NEST_MAX; i++) {
-		if (unlikely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
+	for (i = BPF_CGROUP_STORAGE_NEST_MAX - 1; i >= 0; i--) {
+		if (likely(this_cpu_read(bpf_cgroup_storage_info[i].task) != current))
 			continue;
 
 		storage = this_cpu_read(bpf_cgroup_storage_info[i].storage[stype]);