diff mbox series

[bpf] bpf: sockmap, fix skb refcnt race after locking changes

Message ID 20230901202137.214666-1-john.fastabend@gmail.com (mailing list archive)
State Accepted
Commit a454d84ee20baf7bd7be90721b9821f73c7d23d9
Delegated to: BPF
Headers show
Series [bpf] bpf: sockmap, fix skb refcnt race after locking changes | expand

Checks

Context Check Description
bpf/vmtest-bpf-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-PR success PR summary
bpf/vmtest-bpf-VM_Test-5 success Logs for set-matrix
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1332 this patch: 1332
netdev/cc_maintainers fail 1 blamed authors not CCed: ast@kernel.org; 5 maintainers not CCed: kuba@kernel.org jakub@cloudflare.com davem@davemloft.net pabeni@redhat.com ast@kernel.org
netdev/build_clang success Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1355 this patch: 1355
netdev/checkpatch warning WARNING: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-10 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-12 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-13 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-14 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-16 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-15 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-11 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-7 success Logs for test_maps on s390x with gcc

Commit Message

John Fastabend Sept. 1, 2023, 8:21 p.m. UTC
There is a race where skb's from the sk_psock_backlog can be referenced
after userspace side has already skb_consumed() the sk_buff and its
refcnt dropped to zer0 causing use after free.

The flow is the following,

  while ((skb = skb_peek(&psock->ingress_skb))
    sk_psock_handle_Skb(psock, skb, ..., ingress)
    if (!ingress) ...
    sk_psock_skb_ingress
       sk_psock_skb_ingress_enqueue(skb)
          msg->skb = skb
          sk_psock_queue_msg(psock, msg)
    skb_dequeue(&psock->ingress_skb)

The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
what the application reads when recvmsg() is called. An application can
read this anytime after the msg is placed on the queue. The recvmsg
hook will also read msg->skb and then after user space reads the msg
will call consume_skb(skb) on it effectively free'ing it.

But, the race is in above where backlog queue still has a reference to
the skb and calls skb_dequeue(). If the skb_dequeue happens after the
user reads and free's the skb we have a use after free.

The !ingress case does not suffer from this problem because it uses
sendmsg_*(sk, msg) which does not pass the sk_buff further down the
stack.

The following splat was observed with 'test_progs -t sockmap_listen':

[ 1022.710250][ T2556] general protection fault, ...
 ...
[ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
[ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
[ 1022.713653][ T2556] Code: ...
 ...
[ 1022.720699][ T2556] Call Trace:
[ 1022.720984][ T2556]  <TASK>
[ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
[ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
[ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
[ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
[ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
[ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
[ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
[ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
[ 1022.724386][ T2556]  kthread+0xfd/0x130
[ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
[ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
[ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
[ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
[ 1022.726201][ T2556]  </TASK>

To fix we add an skb_get() before passing the skb to be enqueued in
the engress queue. This bumps the skb->users refcnt so that consume_skb
and kfree_skb will not immediately free the sk_buff. With this we can
be sure the skb is still around when we do the dequeue. Then we just
need to decrement the refcnt or free the skb in the backlog case which
we do by calling kfree_skb() on the ingress case as well as the sendmsg
case.

Before locking change from fixes tag we had the sock locked so we
couldn't race with user and there was no issue here.

Fixes: 799aa7f98d53e (skmsg: Avoid lock_sock() in sk_psock_backlog())
Reported-by: Jiri Olsa  <jolsa@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Jiri Olsa Sept. 1, 2023, 9:20 p.m. UTC | #1
On Fri, Sep 01, 2023 at 01:21:37PM -0700, John Fastabend wrote:
> There is a race where skb's from the sk_psock_backlog can be referenced
> after userspace side has already skb_consumed() the sk_buff and its
> refcnt dropped to zer0 causing use after free.
> 
> The flow is the following,
> 
>   while ((skb = skb_peek(&psock->ingress_skb))
>     sk_psock_handle_Skb(psock, skb, ..., ingress)
>     if (!ingress) ...
>     sk_psock_skb_ingress
>        sk_psock_skb_ingress_enqueue(skb)
>           msg->skb = skb
>           sk_psock_queue_msg(psock, msg)
>     skb_dequeue(&psock->ingress_skb)
> 
> The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
> what the application reads when recvmsg() is called. An application can
> read this anytime after the msg is placed on the queue. The recvmsg
> hook will also read msg->skb and then after user space reads the msg
> will call consume_skb(skb) on it effectively free'ing it.
> 
> But, the race is in above where backlog queue still has a reference to
> the skb and calls skb_dequeue(). If the skb_dequeue happens after the
> user reads and free's the skb we have a use after free.
> 
> The !ingress case does not suffer from this problem because it uses
> sendmsg_*(sk, msg) which does not pass the sk_buff further down the
> stack.
> 
> The following splat was observed with 'test_progs -t sockmap_listen':
> 
> [ 1022.710250][ T2556] general protection fault, ...
>  ...
> [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
> [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
> [ 1022.713653][ T2556] Code: ...
>  ...
> [ 1022.720699][ T2556] Call Trace:
> [ 1022.720984][ T2556]  <TASK>
> [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
> [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
> [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
> [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
> [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
> [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
> [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
> [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
> [ 1022.724386][ T2556]  kthread+0xfd/0x130
> [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
> [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
> [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
> [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
> [ 1022.726201][ T2556]  </TASK>
> 
> To fix we add an skb_get() before passing the skb to be enqueued in
> the engress queue. This bumps the skb->users refcnt so that consume_skb
> and kfree_skb will not immediately free the sk_buff. With this we can
> be sure the skb is still around when we do the dequeue. Then we just
> need to decrement the refcnt or free the skb in the backlog case which
> we do by calling kfree_skb() on the ingress case as well as the sendmsg
> case.
> 
> Before locking change from fixes tag we had the sock locked so we
> couldn't race with user and there was no issue here.
> 
> Fixes: 799aa7f98d53e (skmsg: Avoid lock_sock() in sk_psock_backlog())
> Reported-by: Jiri Olsa  <jolsa@kernel.org>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index a0659fc29bcc..6c31eefbd777 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -612,12 +612,18 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>  static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>  			       u32 off, u32 len, bool ingress)
>  {
> +	int err = 0;
> +
>  	if (!ingress) {
>  		if (!sock_writeable(psock->sk))
>  			return -EAGAIN;
>  		return skb_send_sock(psock->sk, skb, off, len);
>  	}
> -	return sk_psock_skb_ingress(psock, skb, off, len);
> +	skb_get(skb);
> +	err = sk_psock_skb_ingress(psock, skb, off, len);
> +	if (err < 0)
> +		kfree_skb(skb);
> +	return err;
>  }
>  
>  static void sk_psock_skb_state(struct sk_psock *psock,
> @@ -685,9 +691,7 @@ static void sk_psock_backlog(struct work_struct *work)
>  		} while (len);
>  
>  		skb = skb_dequeue(&psock->ingress_skb);
> -		if (!ingress) {
> -			kfree_skb(skb);
> -		}
> +		kfree_skb(skb);
>  	}
>  end:
>  	mutex_unlock(&psock->work_mutex);
> -- 
> 2.33.0
> 

there's no crash wit with fix, but I noticed I occasionally get FAIL

#212/78  sockmap_listen/sockmap Unix test_unix_redir:OK
./test_progs:vsock_unix_redir_connectible:1501: ingress: write: Transport endpoint is not connected
vsock_unix_redir_connectible:FAIL:1501
./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected
vsock_unix_redir_connectible:FAIL:1501
#212/79  sockmap_listen/sockmap VSOCK test_vsock_redir:FAIL
#212/80  sockmap_listen/sockhash IPv4 TCP test_insert_invalid:OK

no idea if it's related

jirka
Eduard Zingerman Sept. 1, 2023, 9:24 p.m. UTC | #2
On Fri, 2023-09-01 at 23:20 +0200, Jiri Olsa wrote:
> On Fri, Sep 01, 2023 at 01:21:37PM -0700, John Fastabend wrote:
> > There is a race where skb's from the sk_psock_backlog can be referenced
> > after userspace side has already skb_consumed() the sk_buff and its
> > refcnt dropped to zer0 causing use after free.
> > 
> > The flow is the following,
> > 
> >   while ((skb = skb_peek(&psock->ingress_skb))
> >     sk_psock_handle_Skb(psock, skb, ..., ingress)
> >     if (!ingress) ...
> >     sk_psock_skb_ingress
> >        sk_psock_skb_ingress_enqueue(skb)
> >           msg->skb = skb
> >           sk_psock_queue_msg(psock, msg)
> >     skb_dequeue(&psock->ingress_skb)
> > 
> > The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
> > what the application reads when recvmsg() is called. An application can
> > read this anytime after the msg is placed on the queue. The recvmsg
> > hook will also read msg->skb and then after user space reads the msg
> > will call consume_skb(skb) on it effectively free'ing it.
> > 
> > But, the race is in above where backlog queue still has a reference to
> > the skb and calls skb_dequeue(). If the skb_dequeue happens after the
> > user reads and free's the skb we have a use after free.
> > 
> > The !ingress case does not suffer from this problem because it uses
> > sendmsg_*(sk, msg) which does not pass the sk_buff further down the
> > stack.
> > 
> > The following splat was observed with 'test_progs -t sockmap_listen':
> > 
> > [ 1022.710250][ T2556] general protection fault, ...
> >  ...
> > [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
> > [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
> > [ 1022.713653][ T2556] Code: ...
> >  ...
> > [ 1022.720699][ T2556] Call Trace:
> > [ 1022.720984][ T2556]  <TASK>
> > [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
> > [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
> > [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
> > [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
> > [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
> > [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
> > [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
> > [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
> > [ 1022.724386][ T2556]  kthread+0xfd/0x130
> > [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
> > [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
> > [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
> > [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
> > [ 1022.726201][ T2556]  </TASK>
> > 
> > To fix we add an skb_get() before passing the skb to be enqueued in
> > the engress queue. This bumps the skb->users refcnt so that consume_skb
> > and kfree_skb will not immediately free the sk_buff. With this we can
> > be sure the skb is still around when we do the dequeue. Then we just
> > need to decrement the refcnt or free the skb in the backlog case which
> > we do by calling kfree_skb() on the ingress case as well as the sendmsg
> > case.
> > 
> > Before locking change from fixes tag we had the sock locked so we
> > couldn't race with user and there was no issue here.
> > 
> > Fixes: 799aa7f98d53e (skmsg: Avoid lock_sock() in sk_psock_backlog())
> > Reported-by: Jiri Olsa  <jolsa@kernel.org>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index a0659fc29bcc..6c31eefbd777 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -612,12 +612,18 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
> >  static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
> >  			       u32 off, u32 len, bool ingress)
> >  {
> > +	int err = 0;
> > +
> >  	if (!ingress) {
> >  		if (!sock_writeable(psock->sk))
> >  			return -EAGAIN;
> >  		return skb_send_sock(psock->sk, skb, off, len);
> >  	}
> > -	return sk_psock_skb_ingress(psock, skb, off, len);
> > +	skb_get(skb);
> > +	err = sk_psock_skb_ingress(psock, skb, off, len);
> > +	if (err < 0)
> > +		kfree_skb(skb);
> > +	return err;
> >  }
> >  
> >  static void sk_psock_skb_state(struct sk_psock *psock,
> > @@ -685,9 +691,7 @@ static void sk_psock_backlog(struct work_struct *work)
> >  		} while (len);
> >  
> >  		skb = skb_dequeue(&psock->ingress_skb);
> > -		if (!ingress) {
> > -			kfree_skb(skb);
> > -		}
> > +		kfree_skb(skb);
> >  	}
> >  end:
> >  	mutex_unlock(&psock->work_mutex);
> > -- 
> > 2.33.0
> > 
> 
> there's no crash wit with fix, but I noticed I occasionally get FAIL
> 

Please note this patch:
https://lore.kernel.org/bpf/20230901031037.3314007-1-xukuohai@huaweicloud.com/
Which should fix the test in question.

> #212/78  sockmap_listen/sockmap Unix test_unix_redir:OK
> ./test_progs:vsock_unix_redir_connectible:1501: ingress: write: Transport endpoint is not connected
> vsock_unix_redir_connectible:FAIL:1501
> ./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected
> vsock_unix_redir_connectible:FAIL:1501
> #212/79  sockmap_listen/sockmap VSOCK test_vsock_redir:FAIL
> #212/80  sockmap_listen/sockhash IPv4 TCP test_insert_invalid:OK
> 
> no idea if it's related
> 
> jirka
Xu Kuohai Sept. 2, 2023, 8:13 a.m. UTC | #3
On 9/2/2023 4:21 AM, John Fastabend wrote:
> There is a race where skb's from the sk_psock_backlog can be referenced
> after userspace side has already skb_consumed() the sk_buff and its
> refcnt dropped to zer0 causing use after free.
> 
> The flow is the following,
> 
>    while ((skb = skb_peek(&psock->ingress_skb))
>      sk_psock_handle_Skb(psock, skb, ..., ingress)
>      if (!ingress) ...
>      sk_psock_skb_ingress
>         sk_psock_skb_ingress_enqueue(skb)
>            msg->skb = skb
>            sk_psock_queue_msg(psock, msg)
>      skb_dequeue(&psock->ingress_skb)
> 
> The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
> what the application reads when recvmsg() is called. An application can
> read this anytime after the msg is placed on the queue. The recvmsg
> hook will also read msg->skb and then after user space reads the msg
> will call consume_skb(skb) on it effectively free'ing it.
> 
> But, the race is in above where backlog queue still has a reference to
> the skb and calls skb_dequeue(). If the skb_dequeue happens after the
> user reads and free's the skb we have a use after free.
> 
> The !ingress case does not suffer from this problem because it uses
> sendmsg_*(sk, msg) which does not pass the sk_buff further down the
> stack.
> 
> The following splat was observed with 'test_progs -t sockmap_listen':
> 
> [ 1022.710250][ T2556] general protection fault, ...
>   ...
> [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
> [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
> [ 1022.713653][ T2556] Code: ...
>   ...
> [ 1022.720699][ T2556] Call Trace:
> [ 1022.720984][ T2556]  <TASK>
> [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
> [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
> [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
> [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
> [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
> [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
> [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
> [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
> [ 1022.724386][ T2556]  kthread+0xfd/0x130
> [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
> [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
> [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
> [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
> [ 1022.726201][ T2556]  </TASK>
> 
> To fix we add an skb_get() before passing the skb to be enqueued in
> the engress queue. This bumps the skb->users refcnt so that consume_skb
> and kfree_skb will not immediately free the sk_buff. With this we can
> be sure the skb is still around when we do the dequeue. Then we just
> need to decrement the refcnt or free the skb in the backlog case which
> we do by calling kfree_skb() on the ingress case as well as the sendmsg
> case.
> 
> Before locking change from fixes tag we had the sock locked so we
> couldn't race with user and there was no issue here.
> 
> Fixes: 799aa7f98d53e (skmsg: Avoid lock_sock() in sk_psock_backlog())
> Reported-by: Jiri Olsa  <jolsa@kernel.org>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>   net/core/skmsg.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index a0659fc29bcc..6c31eefbd777 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -612,12 +612,18 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>   static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>   			       u32 off, u32 len, bool ingress)
>   {
> +	int err = 0;
> +
>   	if (!ingress) {
>   		if (!sock_writeable(psock->sk))
>   			return -EAGAIN;
>   		return skb_send_sock(psock->sk, skb, off, len);
>   	}
> -	return sk_psock_skb_ingress(psock, skb, off, len);
> +	skb_get(skb);
> +	err = sk_psock_skb_ingress(psock, skb, off, len);
> +	if (err < 0)
> +		kfree_skb(skb);
> +	return err;
>   }
>   
>   static void sk_psock_skb_state(struct sk_psock *psock,
> @@ -685,9 +691,7 @@ static void sk_psock_backlog(struct work_struct *work)
>   		} while (len);
>   
>   		skb = skb_dequeue(&psock->ingress_skb);
> -		if (!ingress) {
> -			kfree_skb(skb);
> -		}
> +		kfree_skb(skb);
>   	}
>   end:
>   	mutex_unlock(&psock->work_mutex);

Tested-by: Xu Kuohai <xukuohai@huawei.com>
Simon Horman Sept. 2, 2023, 9 a.m. UTC | #4
On Fri, Sep 01, 2023 at 01:21:37PM -0700, John Fastabend wrote:
> There is a race where skb's from the sk_psock_backlog can be referenced
> after userspace side has already skb_consumed() the sk_buff and its
> refcnt dropped to zer0 causing use after free.
> 
> The flow is the following,
> 
>   while ((skb = skb_peek(&psock->ingress_skb))
>     sk_psock_handle_Skb(psock, skb, ..., ingress)
>     if (!ingress) ...
>     sk_psock_skb_ingress
>        sk_psock_skb_ingress_enqueue(skb)
>           msg->skb = skb
>           sk_psock_queue_msg(psock, msg)
>     skb_dequeue(&psock->ingress_skb)
> 
> The sk_psock_queue_msg() puts the msg on the ingress_msg queue. This is
> what the application reads when recvmsg() is called. An application can
> read this anytime after the msg is placed on the queue. The recvmsg
> hook will also read msg->skb and then after user space reads the msg
> will call consume_skb(skb) on it effectively free'ing it.
> 
> But, the race is in above where backlog queue still has a reference to
> the skb and calls skb_dequeue(). If the skb_dequeue happens after the
> user reads and free's the skb we have a use after free.
> 
> The !ingress case does not suffer from this problem because it uses
> sendmsg_*(sk, msg) which does not pass the sk_buff further down the
> stack.
> 
> The following splat was observed with 'test_progs -t sockmap_listen':
> 
> [ 1022.710250][ T2556] general protection fault, ...
>  ...
> [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog
> [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80
> [ 1022.713653][ T2556] Code: ...
>  ...
> [ 1022.720699][ T2556] Call Trace:
> [ 1022.720984][ T2556]  <TASK>
> [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
> [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0
> [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30
> [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80
> [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300
> [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0
> [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0
> [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10
> [ 1022.724386][ T2556]  kthread+0xfd/0x130
> [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10
> [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50
> [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10
> [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30
> [ 1022.726201][ T2556]  </TASK>
> 
> To fix we add an skb_get() before passing the skb to be enqueued in
> the engress queue. This bumps the skb->users refcnt so that consume_skb
> and kfree_skb will not immediately free the sk_buff. With this we can
> be sure the skb is still around when we do the dequeue. Then we just
> need to decrement the refcnt or free the skb in the backlog case which
> we do by calling kfree_skb() on the ingress case as well as the sendmsg
> case.
> 
> Before locking change from fixes tag we had the sock locked so we
> couldn't race with user and there was no issue here.
> 
> Fixes: 799aa7f98d53e (skmsg: Avoid lock_sock() in sk_psock_backlog())

Hi John,

A minor nit from my side.
I think the usual format for a fixes tag is follows.

Fixes: 799aa7f98d53e ("skmsg: Avoid lock_sock() in sk_psock_backlog()")

> Reported-by: Jiri Olsa  <jolsa@kernel.org>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

...
Jiri Olsa Sept. 2, 2023, 5:28 p.m. UTC | #5
On Sat, Sep 02, 2023 at 12:24:01AM +0300, Eduard Zingerman wrote:

SNIP

> > >  static void sk_psock_skb_state(struct sk_psock *psock,
> > > @@ -685,9 +691,7 @@ static void sk_psock_backlog(struct work_struct *work)
> > >  		} while (len);
> > >  
> > >  		skb = skb_dequeue(&psock->ingress_skb);
> > > -		if (!ingress) {
> > > -			kfree_skb(skb);
> > > -		}
> > > +		kfree_skb(skb);
> > >  	}
> > >  end:
> > >  	mutex_unlock(&psock->work_mutex);
> > > -- 
> > > 2.33.0
> > > 
> > 
> > there's no crash wit with fix, but I noticed I occasionally get FAIL
> > 
> 
> Please note this patch:
> https://lore.kernel.org/bpf/20230901031037.3314007-1-xukuohai@huaweicloud.com/
> Which should fix the test in question.

ah right it does, thanks

Tested-by: Jiri Olsa <jolsa@kernel.org>

jirka

> 
> > #212/78  sockmap_listen/sockmap Unix test_unix_redir:OK
> > ./test_progs:vsock_unix_redir_connectible:1501: ingress: write: Transport endpoint is not connected
> > vsock_unix_redir_connectible:FAIL:1501
> > ./test_progs:vsock_unix_redir_connectible:1501: egress: write: Transport endpoint is not connected
> > vsock_unix_redir_connectible:FAIL:1501
> > #212/79  sockmap_listen/sockmap VSOCK test_vsock_redir:FAIL
> > #212/80  sockmap_listen/sockhash IPv4 TCP test_insert_invalid:OK
> > 
> > no idea if it's related
> > 
> > jirka
>
patchwork-bot+netdevbpf@kernel.org Sept. 4, 2023, 8:13 a.m. UTC | #6
Hello:

This patch was applied to bpf/bpf.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Fri,  1 Sep 2023 13:21:37 -0700 you wrote:
> There is a race where skb's from the sk_psock_backlog can be referenced
> after userspace side has already skb_consumed() the sk_buff and its
> refcnt dropped to zer0 causing use after free.
> 
> The flow is the following,
> 
>   while ((skb = skb_peek(&psock->ingress_skb))
>     sk_psock_handle_Skb(psock, skb, ..., ingress)
>     if (!ingress) ...
>     sk_psock_skb_ingress
>        sk_psock_skb_ingress_enqueue(skb)
>           msg->skb = skb
>           sk_psock_queue_msg(psock, msg)
>     skb_dequeue(&psock->ingress_skb)
> 
> [...]

Here is the summary with links:
  - [bpf] bpf: sockmap, fix skb refcnt race after locking changes
    https://git.kernel.org/bpf/bpf/c/a454d84ee20b

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a0659fc29bcc..6c31eefbd777 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -612,12 +612,18 @@  static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 			       u32 off, u32 len, bool ingress)
 {
+	int err = 0;
+
 	if (!ingress) {
 		if (!sock_writeable(psock->sk))
 			return -EAGAIN;
 		return skb_send_sock(psock->sk, skb, off, len);
 	}
-	return sk_psock_skb_ingress(psock, skb, off, len);
+	skb_get(skb);
+	err = sk_psock_skb_ingress(psock, skb, off, len);
+	if (err < 0)
+		kfree_skb(skb);
+	return err;
 }
 
 static void sk_psock_skb_state(struct sk_psock *psock,
@@ -685,9 +691,7 @@  static void sk_psock_backlog(struct work_struct *work)
 		} while (len);
 
 		skb = skb_dequeue(&psock->ingress_skb);
-		if (!ingress) {
-			kfree_skb(skb);
-		}
+		kfree_skb(skb);
 	}
 end:
 	mutex_unlock(&psock->work_mutex);