[bpf,v6,03/12] bpf: sockmap, improved check for empty queue

Message ID	20230407171654.107311-4-john.fastabend@gmail.com (mailing list archive)
State	Accepted
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: John Fastabend <john.fastabend@gmail.com> To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com, edumazet@google.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: [PATCH bpf v6 03/12] bpf: sockmap, improved check for empty queue Date: Fri, 7 Apr 2023 10:16:45 -0700 Message-Id: <20230407171654.107311-4-john.fastabend@gmail.com> In-Reply-To: <20230407171654.107311-1-john.fastabend@gmail.com> References: <20230407171654.107311-1-john.fastabend@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bpf sockmap fixes \| expand [bpf,v6,00/12] bpf sockmap fixes [bpf,v6,01/12] bpf: sockmap, pass skb ownership through read_skb [bpf,v6,02/12] bpf: sockmap, convert schedule_work into delayed_work [bpf,v6,03/12] bpf: sockmap, improved check for empty queue [bpf,v6,04/12] bpf: sockmap, handle fin correctly [bpf,v6,05/12] bpf: sockmap, TCP data stall on recv before accept [bpf,v6,06/12] bpf: sockmap, wake up polling after data copy [bpf,v6,07/12] bpf: sockmap, incorrectly handling copied_seq [bpf,v6,08/12] bpf: sockmap, pull socket helpers out of listen test for general use [bpf,v6,09/12] bpf: sockmap, build helper to create connected socket pair [bpf,v6,10/12] bpf: sockmap, test shutdown() correctly exits epoll and recv()=0 [bpf,v6,11/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer [bpf,v6,12/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer with drops

Message ID

20230407171654.107311-4-john.fastabend@gmail.com (mailing list archive)

State

Accepted

Delegated to:

BPF

Headers

From: John Fastabend <john.fastabend@gmail.com>
To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com,
        edumazet@google.com
Cc: john.fastabend@gmail.com, bpf@vger.kernel.org,
        netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
        will@isovalent.com
Subject: [PATCH bpf v6 03/12] bpf: sockmap, improved check for empty queue
Date: Fri,  7 Apr 2023 10:16:45 -0700
Message-Id: <20230407171654.107311-4-john.fastabend@gmail.com>
In-Reply-To: <20230407171654.107311-1-john.fastabend@gmail.com>
References: <20230407171654.107311-1-john.fastabend@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

bpf sockmap fixes | expand

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for bpf, async
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 20 this patch: 20
netdev/cc_maintainers	fail	1 blamed authors not CCed: cong.wang@bytedance.com; 4 maintainers not CCed: kuba@kernel.org cong.wang@bytedance.com pabeni@redhat.com davem@davemloft.net
netdev/build_clang	success	Errors and warnings before: 18 this patch: 18
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 20 this patch: 20
netdev/checkpatch	warning	WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")' WARNING: line length of 82 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-VM_Test-8	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9	success	Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-11	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-12	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-13	success	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-14	success	Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-16	fail	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17	fail	Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-18	success	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-19	success	Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-21	fail	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22	fail	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-23	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-24	success	Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-25	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-26	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-27	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-28	success	Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-29	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-30	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-31	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-32	success	Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-34	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-35	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-20	success	Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-33	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-15	success	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-10	success	Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-PR	success	PR summary
bpf/vmtest-bpf-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3	success	Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-4	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-5	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-6	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-7	success	Logs for set-matrix

Context

Check

Description

netdev/series_format

success

Posting correctly formatted

netdev/tree_selection

success

Clearly marked for bpf, async

netdev/fixes_present

success

Fixes tag present in non-next series

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 20 this patch: 20

netdev/cc_maintainers

fail

1 blamed authors not CCed: cong.wang@bytedance.com; 4 maintainers not CCed: kuba@kernel.org cong.wang@bytedance.com pabeni@redhat.com davem@davemloft.net

netdev/build_clang

success

Errors and warnings before: 18 this patch: 18

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/deprecated_api

success

None detected

netdev/check_selftest

success

No net selftest shell script

netdev/verify_fixes

success

Fixes tag looks correct

netdev/build_allmodconfig_warn

success

Errors and warnings before: 20 this patch: 20

netdev/checkpatch

warning

WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()")' WARNING: line length of 82 exceeds 80 columns

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

bpf/vmtest-bpf-VM_Test-8

success

Logs for test_maps on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-9

success

Logs for test_maps on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-11

success

Logs for test_maps on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-12

success

Logs for test_maps on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-13

success

Logs for test_progs on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-14

success

Logs for test_progs on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-16

fail

Logs for test_progs on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-17

fail

Logs for test_progs on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-18

success

Logs for test_progs_no_alu32 on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-19

success

Logs for test_progs_no_alu32 on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-21

fail

Logs for test_progs_no_alu32 on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-22

fail

Logs for test_progs_no_alu32 on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-23

success

Logs for test_progs_no_alu32_parallel on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-24

success

Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-25

success

Logs for test_progs_no_alu32_parallel on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-26

success

Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-27

success

Logs for test_progs_parallel on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-28

success

Logs for test_progs_parallel on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-29

success

Logs for test_progs_parallel on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-30

success

Logs for test_progs_parallel on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-31

success

Logs for test_verifier on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-32

success

Logs for test_verifier on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-34

success

Logs for test_verifier on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-35

success

Logs for test_verifier on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-20

success

Logs for test_progs_no_alu32 on s390x with gcc

bpf/vmtest-bpf-VM_Test-33

success

Logs for test_verifier on s390x with gcc

bpf/vmtest-bpf-VM_Test-15

success

Logs for test_progs on s390x with gcc

bpf/vmtest-bpf-VM_Test-10

success

Logs for test_maps on s390x with gcc

bpf/vmtest-bpf-PR

success

PR summary

bpf/vmtest-bpf-VM_Test-1

success

Logs for ShellCheck

bpf/vmtest-bpf-VM_Test-2

success

Logs for build for aarch64 with gcc

bpf/vmtest-bpf-VM_Test-3

success

Logs for build for aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-4

success

Logs for build for s390x with gcc

bpf/vmtest-bpf-VM_Test-5

success

Logs for build for x86_64 with gcc

bpf/vmtest-bpf-VM_Test-6

success

Logs for build for x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-7

success

Logs for set-matrix

Commit Message

John Fastabend April 7, 2023, 5:16 p.m. UTC

We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.

But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.

Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.

To fix also check the 'state' variable where we would cache partially
processed sk_buff. This catches the majority of cases. But, we also
need to use the mutex lock around this check because we can't have both
codes running and check sensibly. We could perhaps do this with atomic
bit checks, but we are already here due to memory pressure so slowing
things down a bit seems OK and simpler to just grab a lock.

To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Tested-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

Comments

Jakub Sitnicki April 11, 2023, 5:53 p.m. UTC | #1

On Fri, Apr 07, 2023 at 10:16 AM -07, John Fastabend wrote:
> We noticed some rare sk_buffs were stepping past the queue when system was
> under memory pressure. The general theory is to skip enqueueing
> sk_buffs when its not necessary which is the normal case with a system
> that is properly provisioned for the task, no memory pressure and enough
> cpu assigned.
>
> But, if we can't allocate memory due to an ENOMEM error when enqueueing
> the sk_buff into the sockmap receive queue we push it onto a delayed
> workqueue to retry later. When a new sk_buff is received we then check
> if that queue is empty. However, there is a problem with simply checking
> the queue length. When a sk_buff is being processed from the ingress queue
> but not yet on the sockmap msg receive queue its possible to also recv
> a sk_buff through normal path. It will check the ingress queue which is
> zero and then skip ahead of the pkt being processed.
>
> Previously we used sock lock from both contexts which made the problem
> harder to hit, but not impossible.
>
> To fix also check the 'state' variable where we would cache partially
> processed sk_buff. This catches the majority of cases. But, we also
> need to use the mutex lock around this check because we can't have both
> codes running and check sensibly. We could perhaps do this with atomic
> bit checks, but we are already here due to memory pressure so slowing
> things down a bit seems OK and simpler to just grab a lock.
>
> To reproduce issue we run NGINX compliance test with sockmap running and
> observe some flakes in our testing that we attributed to this issue.
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Tested-by: William Findlay <will@isovalent.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

Alexei Starovoitov April 13, 2023, 5:51 p.m. UTC | #2

On Fri, Apr 7, 2023 at 10:17 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> We noticed some rare sk_buffs were stepping past the queue when system was
> under memory pressure. The general theory is to skip enqueueing
> sk_buffs when its not necessary which is the normal case with a system
> that is properly provisioned for the task, no memory pressure and enough
> cpu assigned.
>
> But, if we can't allocate memory due to an ENOMEM error when enqueueing
> the sk_buff into the sockmap receive queue we push it onto a delayed
> workqueue to retry later. When a new sk_buff is received we then check
> if that queue is empty. However, there is a problem with simply checking
> the queue length. When a sk_buff is being processed from the ingress queue
> but not yet on the sockmap msg receive queue its possible to also recv
> a sk_buff through normal path. It will check the ingress queue which is
> zero and then skip ahead of the pkt being processed.
>
> Previously we used sock lock from both contexts which made the problem
> harder to hit, but not impossible.
>
> To fix also check the 'state' variable where we would cache partially
> processed sk_buff. This catches the majority of cases. But, we also
> need to use the mutex lock around this check because we can't have both
> codes running and check sensibly. We could perhaps do this with atomic
> bit checks, but we are already here due to memory pressure so slowing
> things down a bit seems OK and simpler to just grab a lock.
>
> To reproduce issue we run NGINX compliance test with sockmap running and
> observe some flakes in our testing that we attributed to this issue.
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Tested-by: William Findlay <will@isovalent.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 198bed303c51..f8731818b5c3 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -987,6 +987,7 @@ EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
>  static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
>                                   int verdict)
>  {
> +       struct sk_psock_work_state *state;
>         struct sock *sk_other;
>         int err = 0;
>         u32 len, off;
> @@ -1003,13 +1004,28 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
>
>                 skb_bpf_set_ingress(skb);
>
> +               /* We need to grab mutex here because in-flight skb is in one of
> +                * the following states: either on ingress_skb, in psock->state
> +                * or being processed by backlog and neither in state->skb and
> +                * ingress_skb may be also empty. The troublesome case is when
> +                * the skb has been dequeued from ingress_skb list or taken from
> +                * state->skb because we can not easily test this case. Maybe we
> +                * could be clever with flags and resolve this but being clever
> +                * got us here in the first place and we note this is done under
> +                * sock lock and backlog conditions mean we are already running
> +                * into ENOMEM or other performance hindering cases so lets do
> +                * the obvious thing and grab the mutex.
> +                */
> +               mutex_lock(&psock->work_mutex);
> +               state = &psock->work_state;

This splat says that above is wrong:

[   98.732763] BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:580
[   98.733483] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
370, name: test_progs
[   98.734103] preempt_count: 102, expected: 0
[   98.734416] RCU nest depth: 4, expected: 0
[   98.734739] 6 locks held by test_progs/370:
[   98.735046]  #0: ffff888106475530 (sk_lock-AF_INET){+.+.}-{0:0},
at: inet_shutdown+0x43/0x150
[   98.735695]  #1: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
__ip_queue_xmit+0x5/0xa00
[   98.736325]  #2: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
process_backlog+0xc0/0x360
[   98.736971]  #3: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
ip_local_deliver_finish+0xbb/0x220
[   98.737668]  #4: ffff8881064748b0 (slock-AF_INET/1){+.-.}-{2:2},
at: tcp_v4_rcv+0x1b72/0x1d80
[   98.738297]  #5: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
sk_psock_verdict_recv+0x5/0x3a0
[   98.738973] Preemption disabled at:
[   98.738976] [<ffffffff8238bb41>] ip_finish_output2+0x171/0xfa0
[   98.739687] CPU: 1 PID: 370 Comm: test_progs Tainted: G           O
      6.3.0-rc5-00193-g9149a3b041d2 #942
[   98.740379] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[   98.741164] Call Trace:
[   98.741338]  <IRQ>
[   98.741483]  dump_stack_lvl+0x60/0x70
[   98.741735]  __might_resched+0x21c/0x340
[   98.742005]  __mutex_lock+0xb4/0x12a0
[   98.743721]  ? mutex_lock_io_nested+0x1070/0x1070
[   98.744053]  ? lock_is_held_type+0xda/0x130
[   98.744337]  ? preempt_count_sub+0x14/0xc0
[   98.744624]  ? sk_psock_verdict_apply+0x1a3/0x2f0
[   98.744936]  sk_psock_verdict_apply+0x1a3/0x2f0
[   98.745242]  ? preempt_count_sub+0x14/0xc0
[   98.745530]  sk_psock_verdict_recv+0x1e7/0x3a0
[   98.745858]  ? preempt_count_sub+0x14/0xc0
[   98.746168]  tcp_read_skb+0x19c/0x2d0
[   98.746447]  ? sk_psock_strp_read+0x390/0x390
[   98.746774]  ? tcp_alloc_md5sig_pool+0x230/0x230
[   98.747116]  ? rcu_read_lock_held+0x91/0xa0
[   98.747427]  ? rcu_read_lock_sched_held+0xc0/0xc0
[   98.747772]  ? __rcu_read_unlock+0x6b/0x2a0
[   98.748087]  sk_psock_verdict_data_ready+0x99/0x2d0
[   98.748446]  tcp_data_queue+0xd39/0x19b0
[   98.748749]  ? tcp_send_rcvq+0x280/0x280
[   98.749038]  ? tcp_urg+0x7f/0x4c0
[   98.749298]  ? tcp_ack_update_rtt.isra.55+0x910/0x910
[   98.749644]  ? lockdep_hardirqs_on+0x79/0x100
[   98.749940]  ? ktime_get+0x112/0x120
[   98.750225]  ? ktime_get+0x86/0x120
[   98.750498]  tcp_rcv_established+0x3fb/0xcc0
[   98.752087]  tcp_v4_do_rcv+0x34a/0x4c0
[   98.752400]  tcp_v4_rcv+0x1c9a/0x1d80
[   98.754137]  ip_protocol_deliver_rcu+0x4f/0x4d0
[   98.754562]  ip_local_deliver_finish+0x146/0x220
[   98.754928]  ip_local_deliver+0x100/0x2e0
[   98.756055]  ip_rcv+0xb6/0x2b0
[   98.757689]  __netif_receive_skb_one_core+0xd2/0x110
[   98.759121]  process_backlog+0x160/0x360
[   98.759446]  __napi_poll+0x57/0x300
[   98.759725]  net_rx_action+0x555/0x600
[   98.760029]  ? napi_threaded_poll+0x2b0/0x2b0
[   98.760444]  __do_softirq+0xeb/0x4e7
[   98.760728]  ? ip_finish_output2+0x391/0xfa0
[   98.761059]  do_softirq+0xa1/0xd0

I'm afraid I have to revert this set from bpf tree.

John Fastabend April 13, 2023, 8:06 p.m. UTC | #3

Alexei Starovoitov wrote:
> On Fri, Apr 7, 2023 at 10:17 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > We noticed some rare sk_buffs were stepping past the queue when system was
> > under memory pressure. The general theory is to skip enqueueing
> > sk_buffs when its not necessary which is the normal case with a system
> > that is properly provisioned for the task, no memory pressure and enough
> > cpu assigned.
> >
> > But, if we can't allocate memory due to an ENOMEM error when enqueueing
> > the sk_buff into the sockmap receive queue we push it onto a delayed
> > workqueue to retry later. When a new sk_buff is received we then check
> > if that queue is empty. However, there is a problem with simply checking
> > the queue length. When a sk_buff is being processed from the ingress queue
> > but not yet on the sockmap msg receive queue its possible to also recv
> > a sk_buff through normal path. It will check the ingress queue which is
> > zero and then skip ahead of the pkt being processed.
> >
> > Previously we used sock lock from both contexts which made the problem
> > harder to hit, but not impossible.
> >
> > To fix also check the 'state' variable where we would cache partially
> > processed sk_buff. This catches the majority of cases. But, we also
> > need to use the mutex lock around this check because we can't have both
> > codes running and check sensibly. We could perhaps do this with atomic
> > bit checks, but we are already here due to memory pressure so slowing
> > things down a bit seems OK and simpler to just grab a lock.
> >
> > To reproduce issue we run NGINX compliance test with sockmap running and
> > observe some flakes in our testing that we attributed to this issue.
> >
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Tested-by: William Findlay <will@isovalent.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 198bed303c51..f8731818b5c3 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -987,6 +987,7 @@ EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
> >  static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
> >                                   int verdict)
> >  {
> > +       struct sk_psock_work_state *state;
> >         struct sock *sk_other;
> >         int err = 0;
> >         u32 len, off;
> > @@ -1003,13 +1004,28 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
> >
> >                 skb_bpf_set_ingress(skb);
> >
> > +               /* We need to grab mutex here because in-flight skb is in one of
> > +                * the following states: either on ingress_skb, in psock->state
> > +                * or being processed by backlog and neither in state->skb and
> > +                * ingress_skb may be also empty. The troublesome case is when
> > +                * the skb has been dequeued from ingress_skb list or taken from
> > +                * state->skb because we can not easily test this case. Maybe we
> > +                * could be clever with flags and resolve this but being clever
> > +                * got us here in the first place and we note this is done under
> > +                * sock lock and backlog conditions mean we are already running
> > +                * into ENOMEM or other performance hindering cases so lets do
> > +                * the obvious thing and grab the mutex.
> > +                */
> > +               mutex_lock(&psock->work_mutex);
> > +               state = &psock->work_state;
> 
> This splat says that above is wrong:
> 
> [   98.732763] BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:580
> [   98.733483] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid:
> 370, name: test_progs
> [   98.734103] preempt_count: 102, expected: 0
> [   98.734416] RCU nest depth: 4, expected: 0
> [   98.734739] 6 locks held by test_progs/370:
> [   98.735046]  #0: ffff888106475530 (sk_lock-AF_INET){+.+.}-{0:0},
> at: inet_shutdown+0x43/0x150
> [   98.735695]  #1: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> __ip_queue_xmit+0x5/0xa00
> [   98.736325]  #2: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> process_backlog+0xc0/0x360
> [   98.736971]  #3: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> ip_local_deliver_finish+0xbb/0x220
> [   98.737668]  #4: ffff8881064748b0 (slock-AF_INET/1){+.-.}-{2:2},
> at: tcp_v4_rcv+0x1b72/0x1d80
> [   98.738297]  #5: ffffffff84250ba0 (rcu_read_lock){....}-{1:2}, at:
> sk_psock_verdict_recv+0x5/0x3a0
> [   98.738973] Preemption disabled at:
> [   98.738976] [<ffffffff8238bb41>] ip_finish_output2+0x171/0xfa0
> [   98.739687] CPU: 1 PID: 370 Comm: test_progs Tainted: G           O
>       6.3.0-rc5-00193-g9149a3b041d2 #942
> [   98.740379] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> [   98.741164] Call Trace:
> [   98.741338]  <IRQ>
> [   98.741483]  dump_stack_lvl+0x60/0x70
> [   98.741735]  __might_resched+0x21c/0x340
> [   98.742005]  __mutex_lock+0xb4/0x12a0
> [   98.743721]  ? mutex_lock_io_nested+0x1070/0x1070
> [   98.744053]  ? lock_is_held_type+0xda/0x130
> [   98.744337]  ? preempt_count_sub+0x14/0xc0
> [   98.744624]  ? sk_psock_verdict_apply+0x1a3/0x2f0
> [   98.744936]  sk_psock_verdict_apply+0x1a3/0x2f0
> [   98.745242]  ? preempt_count_sub+0x14/0xc0
> [   98.745530]  sk_psock_verdict_recv+0x1e7/0x3a0
> [   98.745858]  ? preempt_count_sub+0x14/0xc0
> [   98.746168]  tcp_read_skb+0x19c/0x2d0
> [   98.746447]  ? sk_psock_strp_read+0x390/0x390
> [   98.746774]  ? tcp_alloc_md5sig_pool+0x230/0x230
> [   98.747116]  ? rcu_read_lock_held+0x91/0xa0
> [   98.747427]  ? rcu_read_lock_sched_held+0xc0/0xc0
> [   98.747772]  ? __rcu_read_unlock+0x6b/0x2a0
> [   98.748087]  sk_psock_verdict_data_ready+0x99/0x2d0
> [   98.748446]  tcp_data_queue+0xd39/0x19b0
> [   98.748749]  ? tcp_send_rcvq+0x280/0x280
> [   98.749038]  ? tcp_urg+0x7f/0x4c0
> [   98.749298]  ? tcp_ack_update_rtt.isra.55+0x910/0x910
> [   98.749644]  ? lockdep_hardirqs_on+0x79/0x100
> [   98.749940]  ? ktime_get+0x112/0x120
> [   98.750225]  ? ktime_get+0x86/0x120
> [   98.750498]  tcp_rcv_established+0x3fb/0xcc0
> [   98.752087]  tcp_v4_do_rcv+0x34a/0x4c0
> [   98.752400]  tcp_v4_rcv+0x1c9a/0x1d80
> [   98.754137]  ip_protocol_deliver_rcu+0x4f/0x4d0
> [   98.754562]  ip_local_deliver_finish+0x146/0x220
> [   98.754928]  ip_local_deliver+0x100/0x2e0
> [   98.756055]  ip_rcv+0xb6/0x2b0
> [   98.757689]  __netif_receive_skb_one_core+0xd2/0x110
> [   98.759121]  process_backlog+0x160/0x360
> [   98.759446]  __napi_poll+0x57/0x300
> [   98.759725]  net_rx_action+0x555/0x600
> [   98.760029]  ? napi_threaded_poll+0x2b0/0x2b0
> [   98.760444]  __do_softirq+0xeb/0x4e7
> [   98.760728]  ? ip_finish_output2+0x391/0xfa0
> [   98.761059]  do_softirq+0xa1/0xd0
> 
> I'm afraid I have to revert this set from bpf tree.

OK thanks. We also got a syzkaller complaint so will fix these two things.

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 198bed303c51..f8731818b5c3 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -987,6 +987,7 @@  EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
 static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 				  int verdict)
 {
+	struct sk_psock_work_state *state;
 	struct sock *sk_other;
 	int err = 0;
 	u32 len, off;
@@ -1003,13 +1004,28 @@  static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 
 		skb_bpf_set_ingress(skb);
 
+		/* We need to grab mutex here because in-flight skb is in one of
+		 * the following states: either on ingress_skb, in psock->state
+		 * or being processed by backlog and neither in state->skb and
+		 * ingress_skb may be also empty. The troublesome case is when
+		 * the skb has been dequeued from ingress_skb list or taken from
+		 * state->skb because we can not easily test this case. Maybe we
+		 * could be clever with flags and resolve this but being clever
+		 * got us here in the first place and we note this is done under
+		 * sock lock and backlog conditions mean we are already running
+		 * into ENOMEM or other performance hindering cases so lets do
+		 * the obvious thing and grab the mutex.
+		 */
+		mutex_lock(&psock->work_mutex);
+		state = &psock->work_state;
+
 		/* If the queue is empty then we can submit directly
 		 * into the msg queue. If its not empty we have to
 		 * queue work otherwise we may get OOO data. Otherwise,
 		 * if sk_psock_skb_ingress errors will be handled by
 		 * retrying later from workqueue.
 		 */
-		if (skb_queue_empty(&psock->ingress_skb)) {
+		if (skb_queue_empty(&psock->ingress_skb) && likely(!state->skb)) {
 			len = skb->len;
 			off = 0;
 			if (skb_bpf_strparser(skb)) {
@@ -1030,9 +1046,11 @@  static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 			spin_unlock_bh(&psock->ingress_lock);
 			if (err < 0) {
 				skb_bpf_redirect_clear(skb);
+				mutex_unlock(&psock->work_mutex);
 				goto out_free;
 			}
 		}
+		mutex_unlock(&psock->work_mutex);
 		break;
 	case __SK_REDIRECT:
 		err = sk_psock_skb_redirect(psock, skb);

[bpf,v6,03/12] bpf: sockmap, improved check for empty queue

Checks

Commit Message

Comments

Patch