[bpf,v7,04/13] bpf: sockmap, improved check for empty queue

Message ID	20230502155159.305437-5-john.fastabend@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: John Fastabend <john.fastabend@gmail.com> To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com, edumazet@google.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org, will@isovalent.com Subject: [PATCH bpf v7 04/13] bpf: sockmap, improved check for empty queue Date: Tue, 2 May 2023 08:51:50 -0700 Message-Id: <20230502155159.305437-5-john.fastabend@gmail.com> In-Reply-To: <20230502155159.305437-1-john.fastabend@gmail.com> References: <20230502155159.305437-1-john.fastabend@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bpf sockmap fixes \| expand [bpf,v7,00/13] bpf sockmap fixes [bpf,v7,01/13] bpf: sockmap, pass skb ownership through read_skb [bpf,v7,02/13] bpf: sockmap, convert schedule_work into delayed_work [bpf,v7,03/13] bpf: sockmap, reschedule is now done through backlog [bpf,v7,04/13] bpf: sockmap, improved check for empty queue [bpf,v7,05/13] bpf: sockmap, handle fin correctly [bpf,v7,06/13] bpf: sockmap, TCP data stall on recv before accept [bpf,v7,07/13] bpf: sockmap, wake up polling after data copy [bpf,v7,08/13] bpf: sockmap, incorrectly handling copied_seq [bpf,v7,09/13] bpf: sockmap, pull socket helpers out of listen test for general use [bpf,v7,10/13] bpf: sockmap, build helper to create connected socket pair [bpf,v7,11/13] bpf: sockmap, test shutdown() correctly exits epoll and recv()=0 [bpf,v7,12/13] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer [bpf,v7,13/13] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer with drops

Message ID

20230502155159.305437-5-john.fastabend@gmail.com (mailing list archive)

State

Changes Requested

Delegated to:

BPF

Headers

From: John Fastabend <john.fastabend@gmail.com>
To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com,
        edumazet@google.com
Cc: john.fastabend@gmail.com, bpf@vger.kernel.org,
        netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
        will@isovalent.com
Subject: [PATCH bpf v7 04/13] bpf: sockmap, improved check for empty queue
Date: Tue,  2 May 2023 08:51:50 -0700
Message-Id: <20230502155159.305437-5-john.fastabend@gmail.com>
In-Reply-To: <20230502155159.305437-1-john.fastabend@gmail.com>
References: <20230502155159.305437-1-john.fastabend@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

bpf sockmap fixes | expand

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf, async
netdev/apply	fail	Patch does not apply to bpf
bpf/vmtest-bpf-PR	fail	merge-conflict
bpf/vmtest-bpf-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-2	success	Logs for build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3	success	Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-4	success	Logs for build for s390x with gcc
bpf/vmtest-bpf-VM_Test-5	success	Logs for build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-6	success	Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-7	success	Logs for set-matrix
bpf/vmtest-bpf-VM_Test-8	success	Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9	success	Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-10	success	Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-VM_Test-11	success	Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-12	success	Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-13	fail	Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-14	fail	Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-15	fail	Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-16	fail	Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-17	fail	Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-18	fail	Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-19	fail	Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-20	fail	Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-21	fail	Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22	fail	Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-23	success	Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-24	success	Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-25	success	Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-26	success	Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-27	success	Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-28	success	Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-29	success	Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-30	success	Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-31	success	Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-32	success	Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-VM_Test-33	success	Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-34	success	Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-35	success	Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-VM_Test-36	success	Logs for veristat

Context

Check

Description

netdev/tree_selection

success

Clearly marked for bpf, async

netdev/apply

fail

Patch does not apply to bpf

bpf/vmtest-bpf-PR

fail

merge-conflict

bpf/vmtest-bpf-VM_Test-1

success

Logs for ShellCheck

bpf/vmtest-bpf-VM_Test-2

success

Logs for build for aarch64 with gcc

bpf/vmtest-bpf-VM_Test-3

success

Logs for build for aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-4

success

Logs for build for s390x with gcc

bpf/vmtest-bpf-VM_Test-5

success

Logs for build for x86_64 with gcc

bpf/vmtest-bpf-VM_Test-6

success

Logs for build for x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-7

success

Logs for set-matrix

bpf/vmtest-bpf-VM_Test-8

success

Logs for test_maps on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-9

success

Logs for test_maps on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-10

success

Logs for test_maps on s390x with gcc

bpf/vmtest-bpf-VM_Test-11

success

Logs for test_maps on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-12

success

Logs for test_maps on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-13

fail

Logs for test_progs on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-14

fail

Logs for test_progs on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-15

fail

Logs for test_progs on s390x with gcc

bpf/vmtest-bpf-VM_Test-16

fail

Logs for test_progs on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-17

fail

Logs for test_progs on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-18

fail

Logs for test_progs_no_alu32 on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-19

fail

Logs for test_progs_no_alu32 on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-20

fail

Logs for test_progs_no_alu32 on s390x with gcc

bpf/vmtest-bpf-VM_Test-21

fail

Logs for test_progs_no_alu32 on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-22

fail

Logs for test_progs_no_alu32 on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-23

success

Logs for test_progs_no_alu32_parallel on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-24

success

Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-25

success

Logs for test_progs_no_alu32_parallel on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-26

success

Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-27

success

Logs for test_progs_parallel on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-28

success

Logs for test_progs_parallel on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-29

success

Logs for test_progs_parallel on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-30

success

Logs for test_progs_parallel on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-31

success

Logs for test_verifier on aarch64 with gcc

bpf/vmtest-bpf-VM_Test-32

success

Logs for test_verifier on aarch64 with llvm-16

bpf/vmtest-bpf-VM_Test-33

success

Logs for test_verifier on s390x with gcc

bpf/vmtest-bpf-VM_Test-34

success

Logs for test_verifier on x86_64 with gcc

bpf/vmtest-bpf-VM_Test-35

success

Logs for test_verifier on x86_64 with llvm-16

bpf/vmtest-bpf-VM_Test-36

success

Logs for veristat

Commit Message

John Fastabend May 2, 2023, 3:51 p.m. UTC

We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.

But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.

Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.

To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.

To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Tested-by: William Findlay <will@isovalent.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |  1 -
 net/core/skmsg.c      | 32 ++++++++------------------------
 2 files changed, 8 insertions(+), 25 deletions(-)

Comments

Jakub Sitnicki May 4, 2023, 4:53 p.m. UTC | #1

On Tue, May 02, 2023 at 08:51 AM -07, John Fastabend wrote:
> We noticed some rare sk_buffs were stepping past the queue when system was
> under memory pressure. The general theory is to skip enqueueing
> sk_buffs when its not necessary which is the normal case with a system
> that is properly provisioned for the task, no memory pressure and enough
> cpu assigned.
>
> But, if we can't allocate memory due to an ENOMEM error when enqueueing
> the sk_buff into the sockmap receive queue we push it onto a delayed
> workqueue to retry later. When a new sk_buff is received we then check
> if that queue is empty. However, there is a problem with simply checking
> the queue length. When a sk_buff is being processed from the ingress queue
> but not yet on the sockmap msg receive queue its possible to also recv
> a sk_buff through normal path. It will check the ingress queue which is
> zero and then skip ahead of the pkt being processed.
>
> Previously we used sock lock from both contexts which made the problem
> harder to hit, but not impossible.
>
> To fix instead of popping the skb from the queue entirely we peek the
> skb from the queue and do the copy there. This ensures checks to the
> queue length are non-zero while skb is being processed. Then finally
> when the entire skb has been copied to user space queue or another
> socket we pop it off the queue. This way the queue length check allows
> bypassing the queue only after the list has been completely processed.
>
> To reproduce issue we run NGINX compliance test with sockmap running and
> observe some flakes in our testing that we attributed to this issue.
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Tested-by: William Findlay <will@isovalent.com>
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h |  1 -
>  net/core/skmsg.c      | 32 ++++++++------------------------
>  2 files changed, 8 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 904ff9a32ad6..054d7911bfc9 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -71,7 +71,6 @@ struct sk_psock_link {
>  };
>  
>  struct sk_psock_work_state {
> -	struct sk_buff			*skb;
>  	u32				len;
>  	u32				off;
>  };
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 3f95c460c261..bc5ca973400c 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -622,16 +622,12 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
>  
>  static void sk_psock_skb_state(struct sk_psock *psock,
>  			       struct sk_psock_work_state *state,
> -			       struct sk_buff *skb,
>  			       int len, int off)
>  {
>  	spin_lock_bh(&psock->ingress_lock);
>  	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> -		state->skb = skb;
>  		state->len = len;
>  		state->off = off;
> -	} else {
> -		sock_drop(psock->sk, skb);
>  	}
>  	spin_unlock_bh(&psock->ingress_lock);
>  }
> @@ -642,23 +638,17 @@ static void sk_psock_backlog(struct work_struct *work)
>  	struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
>  	struct sk_psock_work_state *state = &psock->work_state;
>  	struct sk_buff *skb = NULL;
> +	u32 len = 0, off = 0;
>  	bool ingress;
> -	u32 len, off;
>  	int ret;
>  
>  	mutex_lock(&psock->work_mutex);
> -	if (unlikely(state->skb)) {
> -		spin_lock_bh(&psock->ingress_lock);
> -		skb = state->skb;
> +	if (unlikely(state->len)) {
>  		len = state->len;
>  		off = state->off;
> -		state->skb = NULL;
> -		spin_unlock_bh(&psock->ingress_lock);
>  	}
> -	if (skb)
> -		goto start;
>  
> -	while ((skb = skb_dequeue(&psock->ingress_skb))) {
> +	while ((skb = skb_peek(&psock->ingress_skb))) {
>  		len = skb->len;
>  		off = 0;
>  		if (skb_bpf_strparser(skb)) {
> @@ -667,7 +657,6 @@ static void sk_psock_backlog(struct work_struct *work)
>  			off = stm->offset;
>  			len = stm->full_len;
>  		}
> -start:
>  		ingress = skb_bpf_ingress(skb);
>  		skb_bpf_redirect_clear(skb);
>  		do {
> @@ -677,8 +666,7 @@ static void sk_psock_backlog(struct work_struct *work)
>  							  len, ingress);
>  			if (ret <= 0) {
>  				if (ret == -EAGAIN) {
> -					sk_psock_skb_state(psock, state, skb,
> -							   len, off);
> +					sk_psock_skb_state(psock, state, len, off);
>  
>  					/* Delay slightly to prioritize any
>  					 * other work that might be here.

I've been staring at this bit and I think it doesn't matter if we update
psock->work_state when SK_PSOCK_TX_ENABLED has been cleared.

But what I think we shouldn't be doing here is scheduling
sk_psock_backlog again if SK_PSOCK_TX_ENABLED got cleared by
sk_psock_stop.

> @@ -689,15 +677,16 @@ static void sk_psock_backlog(struct work_struct *work)
>  				/* Hard errors break pipe and stop xmit. */
>  				sk_psock_report_error(psock, ret ? -ret : EPIPE);
>  				sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
> -				sock_drop(psock->sk, skb);
>  				goto end;
>  			}
>  			off += ret;
>  			len -= ret;
>  		} while (len);
>  
> -		if (!ingress)
> +		skb = skb_dequeue(&psock->ingress_skb);
> +		if (!ingress) {
>  			kfree_skb(skb);
> +		}
>  	}
>  end:
>  	mutex_unlock(&psock->work_mutex);

John Fastabend May 4, 2023, 5:42 p.m. UTC | #2

Jakub Sitnicki wrote:
> On Tue, May 02, 2023 at 08:51 AM -07, John Fastabend wrote:
> > We noticed some rare sk_buffs were stepping past the queue when system was
> > under memory pressure. The general theory is to skip enqueueing
> > sk_buffs when its not necessary which is the normal case with a system
> > that is properly provisioned for the task, no memory pressure and enough
> > cpu assigned.
> >
> > But, if we can't allocate memory due to an ENOMEM error when enqueueing
> > the sk_buff into the sockmap receive queue we push it onto a delayed
> > workqueue to retry later. When a new sk_buff is received we then check
> > if that queue is empty. However, there is a problem with simply checking
> > the queue length. When a sk_buff is being processed from the ingress queue
> > but not yet on the sockmap msg receive queue its possible to also recv
> > a sk_buff through normal path. It will check the ingress queue which is
> > zero and then skip ahead of the pkt being processed.
> >
> > Previously we used sock lock from both contexts which made the problem
> > harder to hit, but not impossible.
> >
> > To fix instead of popping the skb from the queue entirely we peek the
> > skb from the queue and do the copy there. This ensures checks to the
> > queue length are non-zero while skb is being processed. Then finally
> > when the entire skb has been copied to user space queue or another
> > socket we pop it off the queue. This way the queue length check allows
> > bypassing the queue only after the list has been completely processed.
> >
> > To reproduce issue we run NGINX compliance test with sockmap running and
> > observe some flakes in our testing that we attributed to this issue.
> >
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Tested-by: William Findlay <will@isovalent.com>
> > Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---

[...]

> > @@ -677,8 +666,7 @@ static void sk_psock_backlog(struct work_struct *work)
> >  							  len, ingress);
> >  			if (ret <= 0) {
> >  				if (ret == -EAGAIN) {
> > -					sk_psock_skb_state(psock, state, skb,
> > -							   len, off);
> > +					sk_psock_skb_state(psock, state, len, off);
> >  
> >  					/* Delay slightly to prioritize any
> >  					 * other work that might be here.
> 
> I've been staring at this bit and I think it doesn't matter if we update
> psock->work_state when SK_PSOCK_TX_ENABLED has been cleared.
> 
> But what I think we shouldn't be doing here is scheduling
> sk_psock_backlog again if SK_PSOCK_TX_ENABLED got cleared by
> sk_psock_stop.

Yeah I agree we shouldn't be scheduling with TX_ENABLED cleared. Otherwise
while we cancle and sync the worker from the destroy path we could queue
up more work here.

Also spotted another case of this where its not wrapped in a check. I guess
we should fix it. Nice catch.

v8 it is I guess thanks.

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 904ff9a32ad6..054d7911bfc9 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -71,7 +71,6 @@  struct sk_psock_link {
 };
 
 struct sk_psock_work_state {
-	struct sk_buff			*skb;
 	u32				len;
 	u32				off;
 };
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 3f95c460c261..bc5ca973400c 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -622,16 +622,12 @@  static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 
 static void sk_psock_skb_state(struct sk_psock *psock,
 			       struct sk_psock_work_state *state,
-			       struct sk_buff *skb,
 			       int len, int off)
 {
 	spin_lock_bh(&psock->ingress_lock);
 	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
-		state->skb = skb;
 		state->len = len;
 		state->off = off;
-	} else {
-		sock_drop(psock->sk, skb);
 	}
 	spin_unlock_bh(&psock->ingress_lock);
 }
@@ -642,23 +638,17 @@  static void sk_psock_backlog(struct work_struct *work)
 	struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
 	struct sk_psock_work_state *state = &psock->work_state;
 	struct sk_buff *skb = NULL;
+	u32 len = 0, off = 0;
 	bool ingress;
-	u32 len, off;
 	int ret;
 
 	mutex_lock(&psock->work_mutex);
-	if (unlikely(state->skb)) {
-		spin_lock_bh(&psock->ingress_lock);
-		skb = state->skb;
+	if (unlikely(state->len)) {
 		len = state->len;
 		off = state->off;
-		state->skb = NULL;
-		spin_unlock_bh(&psock->ingress_lock);
 	}
-	if (skb)
-		goto start;
 
-	while ((skb = skb_dequeue(&psock->ingress_skb))) {
+	while ((skb = skb_peek(&psock->ingress_skb))) {
 		len = skb->len;
 		off = 0;
 		if (skb_bpf_strparser(skb)) {
@@ -667,7 +657,6 @@  static void sk_psock_backlog(struct work_struct *work)
 			off = stm->offset;
 			len = stm->full_len;
 		}
-start:
 		ingress = skb_bpf_ingress(skb);
 		skb_bpf_redirect_clear(skb);
 		do {
@@ -677,8 +666,7 @@  static void sk_psock_backlog(struct work_struct *work)
 							  len, ingress);
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
-					sk_psock_skb_state(psock, state, skb,
-							   len, off);
+					sk_psock_skb_state(psock, state, len, off);
 
 					/* Delay slightly to prioritize any
 					 * other work that might be here.
@@ -689,15 +677,16 @@  static void sk_psock_backlog(struct work_struct *work)
 				/* Hard errors break pipe and stop xmit. */
 				sk_psock_report_error(psock, ret ? -ret : EPIPE);
 				sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
-				sock_drop(psock->sk, skb);
 				goto end;
 			}
 			off += ret;
 			len -= ret;
 		} while (len);
 
-		if (!ingress)
+		skb = skb_dequeue(&psock->ingress_skb);
+		if (!ingress) {
 			kfree_skb(skb);
+		}
 	}
 end:
 	mutex_unlock(&psock->work_mutex);
@@ -790,11 +779,6 @@  static void __sk_psock_zap_ingress(struct sk_psock *psock)
 		skb_bpf_redirect_clear(skb);
 		sock_drop(psock->sk, skb);
 	}
-	kfree_skb(psock->work_state.skb);
-	/* We null the skb here to ensure that calls to sk_psock_backlog
-	 * do not pick up the free'd skb.
-	 */
-	psock->work_state.skb = NULL;
 	__sk_psock_purge_ingress_msg(psock);
 }
 
@@ -813,7 +797,6 @@  void sk_psock_stop(struct sk_psock *psock)
 	spin_lock_bh(&psock->ingress_lock);
 	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
 	sk_psock_cork_free(psock);
-	__sk_psock_zap_ingress(psock);
 	spin_unlock_bh(&psock->ingress_lock);
 }
 
@@ -828,6 +811,7 @@  static void sk_psock_destroy(struct work_struct *work)
 	sk_psock_done_strp(psock);
 
 	cancel_delayed_work_sync(&psock->work);
+	__sk_psock_zap_ingress(psock);
 	mutex_destroy(&psock->work_mutex);
 
 	psock_progs_drop(&psock->progs);

[bpf,v7,04/13] bpf: sockmap, improved check for empty queue

Checks

Commit Message

Comments

Patch