[bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog

Message ID	20230731134536.4058181-1-xukuohai@huaweicloud.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A37B64429; Mon, 31 Jul 2023 13:24:33 +0000 (UTC) From: Xu Kuohai <xukuohai@huaweicloud.com> To: bpf@vger.kernel.org, netdev@vger.kernel.org Cc: John Fastabend <john.fastabend@gmail.com>, Jakub Sitnicki <jakub@cloudflare.com>, "David S . Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Daniel Borkmann <daniel@iogearbox.net>, Alexei Starovoitov <ast@kernel.org>, Cong Wang <cong.wang@bytedance.com> Subject: [PATCH bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog Date: Mon, 31 Jul 2023 09:45:36 -0400 Message-Id: <20230731134536.4058181-1-xukuohai@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog \| expand [bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog

Message ID

20230731134536.4058181-1-xukuohai@huaweicloud.com (mailing list archive)

State

Changes Requested

Delegated to:

BPF

Headers

From: Xu Kuohai <xukuohai@huaweicloud.com>
To: bpf@vger.kernel.org,
	netdev@vger.kernel.org
Cc: John Fastabend <john.fastabend@gmail.com>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	"David S . Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Paolo Abeni <pabeni@redhat.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <ast@kernel.org>,
	Cong Wang <cong.wang@bytedance.com>
Subject: [PATCH bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog
Date: Mon, 31 Jul 2023 09:45:36 -0400
Message-Id: <20230731134536.4058181-1-xukuohai@huaweicloud.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

[bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog | expand

Context	Check	Description
bpf/vmtest-bpf-VM_Test-26	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-8	success	Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-12	fail	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-PR	fail	PR summary
bpf/vmtest-bpf-VM_Test-16	fail	Logs for test_progs_no_alu32 on s390x with gcc
netdev/series_format	success	Single patches do not need cover letters
netdev/tree_selection	success	Clearly marked for bpf
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1344 this patch: 1344
netdev/cc_maintainers	success	CCed 10 of 10 maintainers
netdev/build_clang	success	Errors and warnings before: 1365 this patch: 1365
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1367 this patch: 1367
netdev/checkpatch	warning	WARNING: line length of 81 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-5	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-6	success	Logs for set-matrix
bpf/vmtest-bpf-VM_Test-4	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-3	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-25	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-29	success	Logs for veristat
bpf/vmtest-bpf-VM_Test-7	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-10	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-27	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-28	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-11	fail	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-13	fail	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-15	fail	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-17	fail	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-18	fail	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-19	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-20	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-21	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-22	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-23	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-14	fail	Logs for test_progs on x86_64 with llvm-16

Context

Check

Description

bpf/vmtest-bpf-VM_Test-26

success

Logs for test_verifier on s390x with gcc

bpf/vmtest-bpf-VM_Test-8

success

Logs for test_maps on s390x with gcc

bpf/vmtest-bpf-VM_Test-12

fail

Logs for test_progs on s390x with gcc

bpf/vmtest-bpf-PR

fail

PR summary

bpf/vmtest-bpf-VM_Test-16

fail

Logs for test_progs_no_alu32 on s390x with gcc

netdev/series_format

success

Single patches do not need cover letters

netdev/tree_selection

success

Clearly marked for bpf

netdev/fixes_present

success

Fixes tag present in non-next series

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 1344 this patch: 1344

netdev/cc_maintainers

success

CCed 10 of 10 maintainers

netdev/build_clang

success

Errors and warnings before: 1365 this patch: 1365

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/deprecated_api

success

None detected

netdev/check_selftest

success

No net selftest shell script

netdev/verify_fixes

success

Fixes tag looks correct

netdev/build_allmodconfig_warn

success

Errors and warnings before: 1367 this patch: 1367

netdev/checkpatch

warning

WARNING: line length of 81 exceeds 80 columns

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

bpf/vmtest-bpf-VM_Test-1

success

Logs for ShellCheck

bpf/vmtest-bpf-VM_Test-2

success

Logs for build for aarch64 with gcc

bpf/vmtest-bpf-VM_Test-5

success

Logs for build for x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-6

success

Logs for set-matrix

bpf/vmtest-bpf-VM_Test-4

success

Logs for build for x86_64 with gcc

bpf/vmtest-bpf-VM_Test-3

success

Logs for build for s390x with gcc

bpf/vmtest-bpf-VM_Test-25

success

Logs for test_verifier on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-29

success

Logs for veristat

bpf/vmtest-bpf-VM_Test-7

success

Logs for test_maps on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-9

success

Logs for test_maps on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-10

success

Logs for test_maps on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-27

success

Logs for test_verifier on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-28

success

Logs for test_verifier on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-11

fail

Logs for test_progs on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-13

fail

Logs for test_progs on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-15

fail

Logs for test_progs_no_alu32 on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-17

fail

Logs for test_progs_no_alu32 on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-18

fail

Logs for test_progs_no_alu32 on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-19

success

Logs for test_progs_no_alu32_parallel on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-20

success

Logs for test_progs_no_alu32_parallel on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-21

success

Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-22

success

Logs for test_progs_parallel on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-23

success

Logs for test_progs_parallel on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-24

success

Logs for test_progs_parallel on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-14

fail

Logs for test_progs on x86_64 with llvm-16

Commit Message

Xu Kuohai July 31, 2023, 1:45 p.m. UTC

From: Xu Kuohai <xukuohai@huawei.com>

sk_psock_backlog triggers a NULL dereference:

 BUG: kernel NULL pointer dereference, address: 000000000000000e
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 0 PID: 70 Comm: kworker/0:3 Not tainted 6.5.0-rc2-00585-gb11bbbe4c66e #26
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-p4
 Workqueue: events sk_psock_backlog
 RIP: 0010:0xffffffffc0205254
 Code: 00 00 48 89 94 24 a0 00 00 00 41 5f 41 5e 41 5d 41 5c 5d 5b 41 5b 41 5a 41 59 41 50
 RSP: 0018:ffffc90000acbcb8 EFLAGS: 00010246
 RAX: ffffffff81c5ee10 RBX: ffff888018260000 RCX: 0000000000000001
 RDX: 0000000000000003 RSI: ffffc90000acbd58 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000080100005
 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000003
 R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000003
 FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000000e CR3: 000000000b0de002 CR4: 0000000000170ef0
 Call Trace:
  <TASK>
  ? __die+0x24/0x70
  ? page_fault_oops+0x15d/0x480
  ? fixup_exception+0x26/0x330
  ? exc_page_fault+0x72/0x1d0
  ? asm_exc_page_fault+0x26/0x30
  ? __pfx_inet_sendmsg+0x10/0x10
  ? 0xffffffffc0205254
  ? inet_sendmsg+0x20/0x80
  ? sock_sendmsg+0x8f/0xa0
  ? __skb_send_sock+0x315/0x360
  ? __pfx_sendmsg_unlocked+0x10/0x10
  ? sk_psock_backlog+0xb4/0x300
  ? process_one_work+0x292/0x560
  ? worker_thread+0x53/0x3e0
  ? __pfx_worker_thread+0x10/0x10
  ? kthread+0x102/0x130
  ? __pfx_kthread+0x10/0x10
  ? ret_from_fork+0x34/0x50
  ? __pfx_kthread+0x10/0x10
  ? ret_from_fork_asm+0x1b/0x30
  </TASK>

The bug flow is as follows:

thread 1                                   thread 2

sk_psock_backlog                           sock_close
  sk_psock_handle_skb                        __sock_release
    __skb_send_sock                            inet_release
      sendmsg_unlocked                           tcp_close
        sock_sendmsg                               lock_sock
                                                     __tcp_close
                                                   release_sock
                                                 sock->sk = NULL // (1)
          inet_sendmsg
            sk = sock->sk // (2)
            inet_send_prepare
              inet_sk(sk)->inet_num // (3)

sock->sk is set to NULL by thread 2 at time (1), then fetched by
thread 1 at time (2), and used by thread 1 to access memory at
time (3), resulting in NULL pointer dereference.

To fix it, add lock_sock back on the egress path for sk_psock_handle_skb.

Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
---
 net/core/skmsg.c | 44 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 34 insertions(+), 10 deletions(-)

Comments

John Fastabend Aug. 2, 2023, 3:04 a.m. UTC | #1

Xu Kuohai wrote:
> From: Xu Kuohai <xukuohai@huawei.com>
> 
> sk_psock_backlog triggers a NULL dereference:
> 
>  BUG: kernel NULL pointer dereference, address: 000000000000000e
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x0000) - not-present page
>  PGD 0 P4D 0
>  Oops: 0000 [#1] PREEMPT SMP PTI
>  CPU: 0 PID: 70 Comm: kworker/0:3 Not tainted 6.5.0-rc2-00585-gb11bbbe4c66e #26
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-p4
>  Workqueue: events sk_psock_backlog
>  RIP: 0010:0xffffffffc0205254
>  Code: 00 00 48 89 94 24 a0 00 00 00 41 5f 41 5e 41 5d 41 5c 5d 5b 41 5b 41 5a 41 59 41 50
>  RSP: 0018:ffffc90000acbcb8 EFLAGS: 00010246
>  RAX: ffffffff81c5ee10 RBX: ffff888018260000 RCX: 0000000000000001
>  RDX: 0000000000000003 RSI: ffffc90000acbd58 RDI: 0000000000000000
>  RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000080100005
>  R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000003
>  R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000003
>  FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000000000000000e CR3: 000000000b0de002 CR4: 0000000000170ef0
>  Call Trace:
>   <TASK>
>   ? __die+0x24/0x70
>   ? page_fault_oops+0x15d/0x480
>   ? fixup_exception+0x26/0x330
>   ? exc_page_fault+0x72/0x1d0
>   ? asm_exc_page_fault+0x26/0x30
>   ? __pfx_inet_sendmsg+0x10/0x10
>   ? 0xffffffffc0205254
>   ? inet_sendmsg+0x20/0x80
>   ? sock_sendmsg+0x8f/0xa0
>   ? __skb_send_sock+0x315/0x360
>   ? __pfx_sendmsg_unlocked+0x10/0x10
>   ? sk_psock_backlog+0xb4/0x300
>   ? process_one_work+0x292/0x560
>   ? worker_thread+0x53/0x3e0
>   ? __pfx_worker_thread+0x10/0x10
>   ? kthread+0x102/0x130
>   ? __pfx_kthread+0x10/0x10
>   ? ret_from_fork+0x34/0x50
>   ? __pfx_kthread+0x10/0x10
>   ? ret_from_fork_asm+0x1b/0x30
>   </TASK>
> 
> The bug flow is as follows:
> 
> thread 1                                   thread 2
> 
> sk_psock_backlog                           sock_close
>   sk_psock_handle_skb                        __sock_release
>     __skb_send_sock                            inet_release
>       sendmsg_unlocked                           tcp_close
>         sock_sendmsg                               lock_sock
>                                                      __tcp_close
>                                                    release_sock
>                                                  sock->sk = NULL // (1)
>           inet_sendmsg
>             sk = sock->sk // (2)
>             inet_send_prepare
>               inet_sk(sk)->inet_num // (3)

We are doing a lot of hoping through calls here to find something we
should already know. We know the psock we are sending has a protocol
of tcp, udp, ... and could call the send directly instead of walking
back into the sk_socket and so on. For tcp example we could simply
call tcp_sendmsg(sk, msg, size).

I haven't tried it yet, but I wonder if a lot of this logic gets
easier to reason about if we have per protocol backlog logic. Its
just a hunch at this point though.

> 
> sock->sk is set to NULL by thread 2 at time (1), then fetched by
> thread 1 at time (2), and used by thread 1 to access memory at
> time (3), resulting in NULL pointer dereference.
> 
> To fix it, add lock_sock back on the egress path for sk_psock_handle_skb.
> 
> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> ---
>  net/core/skmsg.c | 44 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 34 insertions(+), 10 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 7c2764beeb04..8b758c51aa0d 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -609,15 +609,42 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>  	return err;
>  }
>  
> +static int sk_psock_handle_ingress_skb(struct sk_psock *psock,
> +				       struct sk_buff *skb,
> +				       u32 off, u32 len)
> +{
> +	if (sock_flag(psock->sk, SOCK_DEAD))
> +		return -EIO;

We didn't previously have the SOCK_DEAD check on ingress which
looks fine because we will come along and flush the ingress
queue when psock is being torn down. Adding it looks fine
though because __tcp_close is flushing the sk_receive_queue
and detaching the user from the socket so we have no way
to read the data anyways. This will then abort the backlog
which moves the psock destruct op along a bit faster.

> +	return sk_psock_skb_ingress(psock, skb, off, len);
> +}
> +
> +static int sk_psock_handle_egress_skb(struct sk_psock *psock,
> +				      struct sk_buff *skb,
> +				      u32 off, u32 len)
> +{
> +	int ret;
> +
> +	lock_sock(psock->sk);
> +
> +	if (sock_flag(psock->sk, SOCK_DEAD))
> +		ret = -EIO;

OK, the sock_orphan() call from tcp_close adjudge_to_death block will set
the SOCK_DEAD flag and ensure we abort the send here. EIO then forces
backlog to abort. This looks correct to me.

> +	else if (!sock_writeable(psock->sk))
> +		ret = -EAGAIN;
> +	else
> +		ret = skb_send_sock_locked(psock->sk, skb, off, len);
> +
> +	release_sock(psock->sk);
> +
> +	return ret;
> +}
> +
>  static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>  			       u32 off, u32 len, bool ingress)
>  {
> -	if (!ingress) {
> -		if (!sock_writeable(psock->sk))
> -			return -EAGAIN;
> -		return skb_send_sock(psock->sk, skb, off, len);
> -	}
> -	return sk_psock_skb_ingress(psock, skb, off, len);
> +	if (ingress)
> +		return sk_psock_handle_ingress_skb(psock, skb, off, len);
> +	else
> +		return sk_psock_handle_egress_skb(psock, skb, off, len);
>  }
>  
>  static void sk_psock_skb_state(struct sk_psock *psock,
> @@ -660,10 +687,7 @@ static void sk_psock_backlog(struct work_struct *work)
>  		ingress = skb_bpf_ingress(skb);
>  		skb_bpf_redirect_clear(skb);
>  		do {
> -			ret = -EIO;
> -			if (!sock_flag(psock->sk, SOCK_DEAD))
> -				ret = sk_psock_handle_skb(psock, skb, off,
> -							  len, ingress);
> +			ret = sk_psock_handle_skb(psock, skb, off, len, ingress);
>  			if (ret <= 0) {
>  				if (ret == -EAGAIN) {
>  					sk_psock_skb_state(psock, state, len, off);

OK LGTM nice catch I left my commentary above that helped as I reviewed it. I
guess we need more stress testing along this path all of our testing is on
ingress path at the moment. Do you happen to have something coded up that
stress tests the redirect send paths?

Reviewed-by: John Fastabend <john.fastabend@gmail.com>

Xu Kuohai Aug. 3, 2023, 10:05 a.m. UTC | #2

On 8/2/2023 11:04 AM, John Fastabend wrote:
> Xu Kuohai wrote:
>> From: Xu Kuohai <xukuohai@huawei.com>
>>
>> sk_psock_backlog triggers a NULL dereference:
>>
>>   BUG: kernel NULL pointer dereference, address: 000000000000000e
>>   #PF: supervisor read access in kernel mode
>>   #PF: error_code(0x0000) - not-present page
>>   PGD 0 P4D 0
>>   Oops: 0000 [#1] PREEMPT SMP PTI
>>   CPU: 0 PID: 70 Comm: kworker/0:3 Not tainted 6.5.0-rc2-00585-gb11bbbe4c66e #26
>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-p4
>>   Workqueue: events sk_psock_backlog
>>   RIP: 0010:0xffffffffc0205254
>>   Code: 00 00 48 89 94 24 a0 00 00 00 41 5f 41 5e 41 5d 41 5c 5d 5b 41 5b 41 5a 41 59 41 50
>>   RSP: 0018:ffffc90000acbcb8 EFLAGS: 00010246
>>   RAX: ffffffff81c5ee10 RBX: ffff888018260000 RCX: 0000000000000001
>>   RDX: 0000000000000003 RSI: ffffc90000acbd58 RDI: 0000000000000000
>>   RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000080100005
>>   R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000003
>>   R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000003
>>   FS:  0000000000000000(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 000000000000000e CR3: 000000000b0de002 CR4: 0000000000170ef0
>>   Call Trace:
>>    <TASK>
>>    ? __die+0x24/0x70
>>    ? page_fault_oops+0x15d/0x480
>>    ? fixup_exception+0x26/0x330
>>    ? exc_page_fault+0x72/0x1d0
>>    ? asm_exc_page_fault+0x26/0x30
>>    ? __pfx_inet_sendmsg+0x10/0x10
>>    ? 0xffffffffc0205254
>>    ? inet_sendmsg+0x20/0x80
>>    ? sock_sendmsg+0x8f/0xa0
>>    ? __skb_send_sock+0x315/0x360
>>    ? __pfx_sendmsg_unlocked+0x10/0x10
>>    ? sk_psock_backlog+0xb4/0x300
>>    ? process_one_work+0x292/0x560
>>    ? worker_thread+0x53/0x3e0
>>    ? __pfx_worker_thread+0x10/0x10
>>    ? kthread+0x102/0x130
>>    ? __pfx_kthread+0x10/0x10
>>    ? ret_from_fork+0x34/0x50
>>    ? __pfx_kthread+0x10/0x10
>>    ? ret_from_fork_asm+0x1b/0x30
>>    </TASK>
>>
>> The bug flow is as follows:
>>
>> thread 1                                   thread 2
>>
>> sk_psock_backlog                           sock_close
>>    sk_psock_handle_skb                        __sock_release
>>      __skb_send_sock                            inet_release
>>        sendmsg_unlocked                           tcp_close
>>          sock_sendmsg                               lock_sock
>>                                                       __tcp_close
>>                                                     release_sock
>>                                                   sock->sk = NULL // (1)
>>            inet_sendmsg
>>              sk = sock->sk // (2)
>>              inet_send_prepare
>>                inet_sk(sk)->inet_num // (3)
> 
> We are doing a lot of hoping through calls here to find something we
> should already know. We know the psock we are sending has a protocol
> of tcp, udp, ... and could call the send directly instead of walking
> back into the sk_socket and so on. For tcp example we could simply
> call tcp_sendmsg(sk, msg, size).
> 

Sorry, the fix method in this patch is not correct:

1. though it works on tcp, it fails on udp and unix sockets due to the
    lack of sendmsg_locked callback, which only exists on tcp.

2. inet_release sets socket->sk = NULL outside lock_sock, so lock_sock
    cannot protect us from accessing a NULL socket->sk.

To fix it correctly, calling tcp/udp/unix sendmsg directly without
touching sk_socket seems a good idea, I'll try it. Thanks.

> I haven't tried it yet, but I wonder if a lot of this logic gets
> easier to reason about if we have per protocol backlog logic. Its
> just a hunch at this point though.
> 
>>
>> sock->sk is set to NULL by thread 2 at time (1), then fetched by
>> thread 1 at time (2), and used by thread 1 to access memory at
>> time (3), resulting in NULL pointer dereference.
>>
>> To fix it, add lock_sock back on the egress path for sk_psock_handle_skb.
>>
>> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
>> ---
>>   net/core/skmsg.c | 44 ++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 34 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 7c2764beeb04..8b758c51aa0d 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -609,15 +609,42 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>>   	return err;
>>   }
>>   
>> +static int sk_psock_handle_ingress_skb(struct sk_psock *psock,
>> +				       struct sk_buff *skb,
>> +				       u32 off, u32 len)
>> +{
>> +	if (sock_flag(psock->sk, SOCK_DEAD))
>> +		return -EIO;
> 
> We didn't previously have the SOCK_DEAD check on ingress which
> looks fine because we will come along and flush the ingress
> queue when psock is being torn down. Adding it looks fine
> though because __tcp_close is flushing the sk_receive_queue
> and detaching the user from the socket so we have no way
> to read the data anyways. This will then abort the backlog
> which moves the psock destruct op along a bit faster.
> 
>> +	return sk_psock_skb_ingress(psock, skb, off, len);
>> +}
>> +
>> +static int sk_psock_handle_egress_skb(struct sk_psock *psock,
>> +				      struct sk_buff *skb,
>> +				      u32 off, u32 len)
>> +{
>> +	int ret;
>> +
>> +	lock_sock(psock->sk);
>> +
>> +	if (sock_flag(psock->sk, SOCK_DEAD))
>> +		ret = -EIO;
> 
> OK, the sock_orphan() call from tcp_close adjudge_to_death block will set
> the SOCK_DEAD flag and ensure we abort the send here. EIO then forces
> backlog to abort. This looks correct to me.
> 
>> +	else if (!sock_writeable(psock->sk))
>> +		ret = -EAGAIN;
>> +	else
>> +		ret = skb_send_sock_locked(psock->sk, skb, off, len);
>> +
>> +	release_sock(psock->sk);
>> +
>> +	return ret;
>> +}
>> +
>>   static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>>   			       u32 off, u32 len, bool ingress)
>>   {
>> -	if (!ingress) {
>> -		if (!sock_writeable(psock->sk))
>> -			return -EAGAIN;
>> -		return skb_send_sock(psock->sk, skb, off, len);
>> -	}
>> -	return sk_psock_skb_ingress(psock, skb, off, len);
>> +	if (ingress)
>> +		return sk_psock_handle_ingress_skb(psock, skb, off, len);
>> +	else
>> +		return sk_psock_handle_egress_skb(psock, skb, off, len);
>>   }
>>   
>>   static void sk_psock_skb_state(struct sk_psock *psock,
>> @@ -660,10 +687,7 @@ static void sk_psock_backlog(struct work_struct *work)
>>   		ingress = skb_bpf_ingress(skb);
>>   		skb_bpf_redirect_clear(skb);
>>   		do {
>> -			ret = -EIO;
>> -			if (!sock_flag(psock->sk, SOCK_DEAD))
>> -				ret = sk_psock_handle_skb(psock, skb, off,
>> -							  len, ingress);
>> +			ret = sk_psock_handle_skb(psock, skb, off, len, ingress);
>>   			if (ret <= 0) {
>>   				if (ret == -EAGAIN) {
>>   					sk_psock_skb_state(psock, state, len, off);
> 
> OK LGTM nice catch I left my commentary above that helped as I reviewed it. I
> guess we need more stress testing along this path all of our testing is on
> ingress path at the moment. Do you happen to have something coded up that
> stress tests the redirect send paths?
>
Not yet, this bug was triggered in one of our http pressure tests.

> Reviewed-by: John Fastabend <john.fastabend@gmail.com>
> 
> .

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 7c2764beeb04..8b758c51aa0d 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -609,15 +609,42 @@  static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 	return err;
 }
 
+static int sk_psock_handle_ingress_skb(struct sk_psock *psock,
+				       struct sk_buff *skb,
+				       u32 off, u32 len)
+{
+	if (sock_flag(psock->sk, SOCK_DEAD))
+		return -EIO;
+	return sk_psock_skb_ingress(psock, skb, off, len);
+}
+
+static int sk_psock_handle_egress_skb(struct sk_psock *psock,
+				      struct sk_buff *skb,
+				      u32 off, u32 len)
+{
+	int ret;
+
+	lock_sock(psock->sk);
+
+	if (sock_flag(psock->sk, SOCK_DEAD))
+		ret = -EIO;
+	else if (!sock_writeable(psock->sk))
+		ret = -EAGAIN;
+	else
+		ret = skb_send_sock_locked(psock->sk, skb, off, len);
+
+	release_sock(psock->sk);
+
+	return ret;
+}
+
 static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 			       u32 off, u32 len, bool ingress)
 {
-	if (!ingress) {
-		if (!sock_writeable(psock->sk))
-			return -EAGAIN;
-		return skb_send_sock(psock->sk, skb, off, len);
-	}
-	return sk_psock_skb_ingress(psock, skb, off, len);
+	if (ingress)
+		return sk_psock_handle_ingress_skb(psock, skb, off, len);
+	else
+		return sk_psock_handle_egress_skb(psock, skb, off, len);
 }
 
 static void sk_psock_skb_state(struct sk_psock *psock,
@@ -660,10 +687,7 @@  static void sk_psock_backlog(struct work_struct *work)
 		ingress = skb_bpf_ingress(skb);
 		skb_bpf_redirect_clear(skb);
 		do {
-			ret = -EIO;
-			if (!sock_flag(psock->sk, SOCK_DEAD))
-				ret = sk_psock_handle_skb(psock, skb, off,
-							  len, ingress);
+			ret = sk_psock_handle_skb(psock, skb, off, len, ingress);
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
 					sk_psock_skb_state(psock, state, len, off);

[bpf] bpf, sockmap: Fix NULL deref in sk_psock_backlog

Checks

Commit Message

Comments

Patch