[v4,bpf,1/2] bpf: fix skb_do_redirect return values

Message ID	e5d05e56bf41de82f10d33229b8a8f6b49290e98.1690332693.git.yan@cloudflare.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@vger.kernel.org> Date: Tue, 25 Jul 2023 18:08:17 -0700 From: Yan Zhai <yan@cloudflare.com> To: bpf@vger.kernel.org Cc: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Mykola Lysenko <mykolal@fb.com>, Shuah Khan <shuah@kernel.org>, Yan Zhai <yan@cloudflare.com>, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@cloudflare.com, Jordan Griege <jgriege@cloudflare.com>, Markus Elfring <Markus.Elfring@web.de>, Jakub Sitnicki <jakub@cloudflare.com> Subject: [PATCH v4 bpf 1/2] bpf: fix skb_do_redirect return values Message-ID: <e5d05e56bf41de82f10d33229b8a8f6b49290e98.1690332693.git.yan@cloudflare.com> References: <cover.1690332693.git.yan@cloudflare.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <cover.1690332693.git.yan@cloudflare.com> Precedence: bulk
Series	bpf: return proper error codes for lwt redirect \| expand [v4,bpf,0/2] bpf: return proper error codes for lwt redirect [v4,bpf,1/2] bpf: fix skb_do_redirect return values [v4,bpf,2/2] bpf: selftests: add lwt redirect regression test cases

Yan Zhai July 26, 2023, 1:08 a.m. UTC

skb_do_redirect returns various of values: error code (negative),
0 (success), and some positive status code, e.g. NET_XMIT_CN,
NET_RX_DROP. Commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel
infrastructure") didn't check the return code correctly, so positive
values are propagated back along call chain:

  ip_finish_output2
    -> bpf_xmit
      -> run_lwt_bpf
        -> skb_do_redirect

Inside ip_finish_output2, redirected skb will continue to neighbor
subsystem as if LWTUNNEL_XMIT_CONTINUE is returned, despite that this
skb could have been freed. The bug can trigger use-after-free warning
and crashes kernel afterwards:

https://gist.github.com/zhaiyan920/8fbac245b261fe316a7ef04c9b1eba48

Convert positive statuses from skb_do_redirect eliminates this issue.

Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Markus Elfring <Markus.Elfring@web.de>
Suggested-by: Stanislav Fomichev <sdf@google.com>
Reported-by: Jordan Griege <jgriege@cloudflare.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
---
 include/linux/netdevice.h | 2 ++
 net/core/filter.c         | 9 +++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

Jakub Sitnicki July 26, 2023, 12:25 p.m. UTC | #1

On Tue, Jul 25, 2023 at 06:08 PM -07, Yan Zhai wrote:
> skb_do_redirect returns various of values: error code (negative),
> 0 (success), and some positive status code, e.g. NET_XMIT_CN,
> NET_RX_DROP. Commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel
> infrastructure") didn't check the return code correctly, so positive
> values are propagated back along call chain:
>
>   ip_finish_output2
>     -> bpf_xmit
>       -> run_lwt_bpf
>         -> skb_do_redirect
>
> Inside ip_finish_output2, redirected skb will continue to neighbor
> subsystem as if LWTUNNEL_XMIT_CONTINUE is returned, despite that this
> skb could have been freed. The bug can trigger use-after-free warning
> and crashes kernel afterwards:
>
> https://gist.github.com/zhaiyan920/8fbac245b261fe316a7ef04c9b1eba48
>
> Convert positive statuses from skb_do_redirect eliminates this issue.
>
> Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
> Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Suggested-by: Markus Elfring <Markus.Elfring@web.de>
> Suggested-by: Stanislav Fomichev <sdf@google.com>
> Reported-by: Jordan Griege <jgriege@cloudflare.com>
> Signed-off-by: Yan Zhai <yan@cloudflare.com>
> ---

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

Dan Carpenter July 26, 2023, 1:39 p.m. UTC | #2

I'm not positive I understand the code in ip_finish_output2().  I think
instead of looking for LWTUNNEL_XMIT_DONE it should instead look for
!= LWTUNNEL_XMIT_CONTINUE.  It's unfortunate that NET_XMIT_DROP and
LWTUNNEL_XMIT_CONTINUE are the both 0x1.  Why don't we just change that
instead?

Also there seems to be a leak in lwtunnel_xmit().  Should that return
LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning?

Something like the following?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 11652e464f5d..375790b672bc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev);
 #define NET_XMIT_CN		0x02	/* congestion notification	*/
 #define NET_XMIT_MASK		0x0f	/* qdisc flags in net/sch_generic.h */
 
+#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS
+#define LWTUNNEL_XMIT_CONTINUE 0x3
+
 /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It
  * indicates that the device will soon be dropping packets, or already drops
  * some packets of the same priority; prompting us to send less aggressively. */
diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 6f15e6fa154e..8ab032ee04d0 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -16,12 +16,6 @@
 #define LWTUNNEL_STATE_INPUT_REDIRECT	BIT(1)
 #define LWTUNNEL_STATE_XMIT_REDIRECT	BIT(2)
 
-enum {
-	LWTUNNEL_XMIT_DONE,
-	LWTUNNEL_XMIT_CONTINUE,
-};
-
-
 struct lwtunnel_state {
 	__u16		type;
 	__u16		flags;
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 711cd3b4347a..732415d1287d 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb)
 
 	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
 	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
-		return 0;
+		return LWTUNNEL_XMIT_CONTINUE;
 
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 6e70839257f7..4be50a211b14 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
 		int res = lwtunnel_xmit(skb);
 
-		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
+		if (res != LWTUNNEL_XMIT_CONTINUE)
 			return res;
 	}
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1e8c90e97608..016b0a513259 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -113,7 +113,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
 		int res = lwtunnel_xmit(skb);
 
-		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
+		if (res != LWTUNNEL_XMIT_CONTINUE)
 			return res;
 	}

Yan Zhai July 26, 2023, 2:14 p.m. UTC | #3

On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote:
> I'm not positive I understand the code in ip_finish_output2().  I think
> instead of looking for LWTUNNEL_XMIT_DONE it should instead look for
> != LWTUNNEL_XMIT_CONTINUE.  It's unfortunate that NET_XMIT_DROP and
> LWTUNNEL_XMIT_CONTINUE are the both 0x1.  Why don't we just change that
> instead?
> 
I considered about changing lwt side logic. But it would bring larger
impact since there are multiple types of encaps on this hook, not just
bpf redirect. Changing bpf return values is a minimum change on the
other hand. In addition, returning value of NET_RX_DROP and
NET_XMIT_CN are the same, so if we don't do something in bpf redirect,
there is no way to distinguish them later: the former is considered as
an error, while "CN" is considered as non-error.

> Also there seems to be a leak in lwtunnel_xmit().  Should that return
> LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning?
> 
> Something like the following?
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 11652e464f5d..375790b672bc 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev);
>  #define NET_XMIT_CN		0x02	/* congestion notification	*/
>  #define NET_XMIT_MASK		0x0f	/* qdisc flags in net/sch_generic.h */
>  
> +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS
> +#define LWTUNNEL_XMIT_CONTINUE 0x3
> +
>  /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It
>   * indicates that the device will soon be dropping packets, or already drops
>   * some packets of the same priority; prompting us to send less aggressively. */
> diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
> index 6f15e6fa154e..8ab032ee04d0 100644
> --- a/include/net/lwtunnel.h
> +++ b/include/net/lwtunnel.h
> @@ -16,12 +16,6 @@
>  #define LWTUNNEL_STATE_INPUT_REDIRECT	BIT(1)
>  #define LWTUNNEL_STATE_XMIT_REDIRECT	BIT(2)
>  
> -enum {
> -	LWTUNNEL_XMIT_DONE,
> -	LWTUNNEL_XMIT_CONTINUE,
> -};
> -
> -
>  struct lwtunnel_state {
>  	__u16		type;
>  	__u16		flags;
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index 711cd3b4347a..732415d1287d 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb)
>  
>  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
>  	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> -		return 0;
> +		return LWTUNNEL_XMIT_CONTINUE;

You are correct this path would leak skb. Return continue (or drop)
would avoid the leak. Personally I'd prefer drop instead to signal the
error setup. Since this is a separate issue, do you want to send a
separate patch on this? Or I am happy to do it if you prefer.

>  
>  	ret = -EOPNOTSUPP;
>  	rcu_read_lock();
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 6e70839257f7..4be50a211b14 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
>  	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
>  		int res = lwtunnel_xmit(skb);
>  
> -		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
> +		if (res != LWTUNNEL_XMIT_CONTINUE)
>  			return res;

Unfortunately we cannot return res directly here when res > 0. This is
the final reason why I didn't patch here. Return values here can be
propagated back to sendmsg syscall, so returning a positive value
would break the syscall convention.


best,
Yan

>  	}
>  
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 1e8c90e97608..016b0a513259 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -113,7 +113,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
>  	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
>  		int res = lwtunnel_xmit(skb);
>  
> -		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
> +		if (res != LWTUNNEL_XMIT_CONTINUE)
>  			return res;
>  	}
>

Dan Carpenter July 26, 2023, 3:01 p.m. UTC | #4

On Wed, Jul 26, 2023 at 07:14:56AM -0700, Yan Zhai wrote:
> On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote:
> > I'm not positive I understand the code in ip_finish_output2().  I think
> > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for
> > != LWTUNNEL_XMIT_CONTINUE.  It's unfortunate that NET_XMIT_DROP and
> > LWTUNNEL_XMIT_CONTINUE are the both 0x1.  Why don't we just change that
> > instead?
> > 
> I considered about changing lwt side logic. But it would bring larger
> impact since there are multiple types of encaps on this hook, not just
> bpf redirect. Changing bpf return values is a minimum change on the
> other hand. In addition, returning value of NET_RX_DROP and
> NET_XMIT_CN are the same, so if we don't do something in bpf redirect,
> there is no way to distinguish them later: the former is considered as
> an error, while "CN" is considered as non-error.

Uh, NET_RX/XMIT_DROP values are 1.  NET_XMIT_CN is 2.

I'm not an expert but I think what happens is that we treat NET_XMIT_CN
as success so that it takes a while for the resend to happen.
Eventually the TCP layer will detect it as a dropped packet.

> 
> > Also there seems to be a leak in lwtunnel_xmit().  Should that return
> > LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning?
> > 
> > Something like the following?
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 11652e464f5d..375790b672bc 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev);
> >  #define NET_XMIT_CN		0x02	/* congestion notification	*/
> >  #define NET_XMIT_MASK		0x0f	/* qdisc flags in net/sch_generic.h */
> >  
> > +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS
> > +#define LWTUNNEL_XMIT_CONTINUE 0x3
> > +
> >  /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It
> >   * indicates that the device will soon be dropping packets, or already drops
> >   * some packets of the same priority; prompting us to send less aggressively. */
> > diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
> > index 6f15e6fa154e..8ab032ee04d0 100644
> > --- a/include/net/lwtunnel.h
> > +++ b/include/net/lwtunnel.h
> > @@ -16,12 +16,6 @@
> >  #define LWTUNNEL_STATE_INPUT_REDIRECT	BIT(1)
> >  #define LWTUNNEL_STATE_XMIT_REDIRECT	BIT(2)
> >  
> > -enum {
> > -	LWTUNNEL_XMIT_DONE,
> > -	LWTUNNEL_XMIT_CONTINUE,
> > -};
> > -
> > -
> >  struct lwtunnel_state {
> >  	__u16		type;
> >  	__u16		flags;
> > diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> > index 711cd3b4347a..732415d1287d 100644
> > --- a/net/core/lwtunnel.c
> > +++ b/net/core/lwtunnel.c
> > @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb)
> >  
> >  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> >  	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> > -		return 0;
> > +		return LWTUNNEL_XMIT_CONTINUE;
> 
> You are correct this path would leak skb. Return continue (or drop)
> would avoid the leak. Personally I'd prefer drop instead to signal the
> error setup. Since this is a separate issue, do you want to send a
> separate patch on this? Or I am happy to do it if you prefer.
> 

I don't know which makes sense so I'll leave that up to you.

> >  
> >  	ret = -EOPNOTSUPP;
> >  	rcu_read_lock();
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 6e70839257f7..4be50a211b14 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
> >  	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
> >  		int res = lwtunnel_xmit(skb);
> >  
> > -		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
> > +		if (res != LWTUNNEL_XMIT_CONTINUE)
> >  			return res;
> 
> Unfortunately we cannot return res directly here when res > 0. This is
> the final reason why I didn't patch here. Return values here can be
> propagated back to sendmsg syscall, so returning a positive value
> would break the syscall convention.

The neigh_output() function is going to return NET_XMIT_DROP so this
already happens.  Is that not what we want to happen?

I guess my concern is that eventually people will eventually new
introduce bugs.  Fixing incorrect error codes is something that I do
several times per week.  :P

regards,
dan carpenter

Yan Zhai July 26, 2023, 4:10 p.m. UTC | #5

On Wed, Jul 26, 2023 at 06:01:00PM +0300, Dan Carpenter wrote:
> On Wed, Jul 26, 2023 at 07:14:56AM -0700, Yan Zhai wrote:
> > On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote:
> > > I'm not positive I understand the code in ip_finish_output2().  I think
> > > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for
> > > != LWTUNNEL_XMIT_CONTINUE.  It's unfortunate that NET_XMIT_DROP and
> > > LWTUNNEL_XMIT_CONTINUE are the both 0x1.  Why don't we just change that
> > > instead?
> > > 
> > I considered about changing lwt side logic. But it would bring larger
> > impact since there are multiple types of encaps on this hook, not just
> > bpf redirect. Changing bpf return values is a minimum change on the
> > other hand. In addition, returning value of NET_RX_DROP and
> > NET_XMIT_CN are the same, so if we don't do something in bpf redirect,
> > there is no way to distinguish them later: the former is considered as
> > an error, while "CN" is considered as non-error.
> 
> Uh, NET_RX/XMIT_DROP values are 1.  NET_XMIT_CN is 2.
> 
> I'm not an expert but I think what happens is that we treat NET_XMIT_CN
> as success so that it takes a while for the resend to happen.
> Eventually the TCP layer will detect it as a dropped packet.
> 
My eyes slipped lines. CN is 2. But the fact RX return value can be
returned on a TX path still makes me feel unclean. Odds are low that
we will have new statuses in future, it is a risk. I'd hope to contain
these values only inside BPF redirect code as they are the reason why
such rx values can show up there. Meanwhile, your argument do make
good sense to me that the same problem may occur for other stuff. It
is true. In fact, I just re-examined BPF-REROUTE path, it has the
exact same issue by directly sending dst_output value back.

So I would propose to do two things:
1. still convert BPF redirect ingress code to contain the propagation
of mixed return. Return only TX side value instead, which is also what
majority of those local senders are expecting. (I was wrong about
positive values returned to sendmsg below btw, they are not).

2. change LWTUNNEL_XMIT_CONTINUE and check for this at xmit hook.

> > 
> > > Also there seems to be a leak in lwtunnel_xmit().  Should that return
> > > LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning?
> > > 
> > > Something like the following?
> > > 
> > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > index 11652e464f5d..375790b672bc 100644
> > > --- a/include/linux/netdevice.h
> > > +++ b/include/linux/netdevice.h
> > > @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev);
> > >  #define NET_XMIT_CN		0x02	/* congestion notification	*/
> > >  #define NET_XMIT_MASK		0x0f	/* qdisc flags in net/sch_generic.h */
> > >  
> > > +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS
> > > +#define LWTUNNEL_XMIT_CONTINUE 0x3
> > > +
> > >  /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It
> > >   * indicates that the device will soon be dropping packets, or already drops
> > >   * some packets of the same priority; prompting us to send less aggressively. */
> > > diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
> > > index 6f15e6fa154e..8ab032ee04d0 100644
> > > --- a/include/net/lwtunnel.h
> > > +++ b/include/net/lwtunnel.h
> > > @@ -16,12 +16,6 @@
> > >  #define LWTUNNEL_STATE_INPUT_REDIRECT	BIT(1)
> > >  #define LWTUNNEL_STATE_XMIT_REDIRECT	BIT(2)
> > >  
> > > -enum {
> > > -	LWTUNNEL_XMIT_DONE,
> > > -	LWTUNNEL_XMIT_CONTINUE,
> > > -};
> > > -
> > > -
> > >  struct lwtunnel_state {
> > >  	__u16		type;
> > >  	__u16		flags;
> > > diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> > > index 711cd3b4347a..732415d1287d 100644
> > > --- a/net/core/lwtunnel.c
> > > +++ b/net/core/lwtunnel.c
> > > @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb)
> > >  
> > >  	if (lwtstate->type == LWTUNNEL_ENCAP_NONE ||
> > >  	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
> > > -		return 0;
> > > +		return LWTUNNEL_XMIT_CONTINUE;
> > 
> > You are correct this path would leak skb. Return continue (or drop)
> > would avoid the leak. Personally I'd prefer drop instead to signal the
> > error setup. Since this is a separate issue, do you want to send a
> > separate patch on this? Or I am happy to do it if you prefer.
> > 
> 
> I don't know which makes sense so I'll leave that up to you.
> 
This conversation is juicy, I think we discovered two potential new
problem sites (the leak here and the reroute path) :)

> > >  
> > >  	ret = -EOPNOTSUPP;
> > >  	rcu_read_lock();
> > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > > index 6e70839257f7..4be50a211b14 100644
> > > --- a/net/ipv4/ip_output.c
> > > +++ b/net/ipv4/ip_output.c
> > > @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
> > >  	if (lwtunnel_xmit_redirect(dst->lwtstate)) {
> > >  		int res = lwtunnel_xmit(skb);
> > >  
> > > -		if (res < 0 || res == LWTUNNEL_XMIT_DONE)
> > > +		if (res != LWTUNNEL_XMIT_CONTINUE)
> > >  			return res;
> > 
> > Unfortunately we cannot return res directly here when res > 0. This is
> > the final reason why I didn't patch here. Return values here can be
> > propagated back to sendmsg syscall, so returning a positive value
> > would break the syscall convention.
> 
> The neigh_output() function is going to return NET_XMIT_DROP so this
> already happens.  Is that not what we want to happen?
> 
My bad, those return values are processed at ip_send_skb etc, while I
was staring only at ip_local_out and beneath with my sleepy eyes.

> I guess my concern is that eventually people will eventually new
> introduce bugs.  Fixing incorrect error codes is something that I do
> several times per week.  :P
> 
> regards,
> dan carpenter
> 
>

Dan Carpenter July 26, 2023, 4:53 p.m. UTC | #6

On Wed, Jul 26, 2023 at 09:10:20AM -0700, Yan Zhai wrote:
> On Wed, Jul 26, 2023 at 06:01:00PM +0300, Dan Carpenter wrote:
> > On Wed, Jul 26, 2023 at 07:14:56AM -0700, Yan Zhai wrote:
> > > On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote:
> > > > I'm not positive I understand the code in ip_finish_output2().  I think
> > > > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for
> > > > != LWTUNNEL_XMIT_CONTINUE.  It's unfortunate that NET_XMIT_DROP and
> > > > LWTUNNEL_XMIT_CONTINUE are the both 0x1.  Why don't we just change that
> > > > instead?
> > > > 
> > > I considered about changing lwt side logic. But it would bring larger
> > > impact since there are multiple types of encaps on this hook, not just
> > > bpf redirect. Changing bpf return values is a minimum change on the
> > > other hand. In addition, returning value of NET_RX_DROP and
> > > NET_XMIT_CN are the same, so if we don't do something in bpf redirect,
> > > there is no way to distinguish them later: the former is considered as
> > > an error, while "CN" is considered as non-error.
> > 
> > Uh, NET_RX/XMIT_DROP values are 1.  NET_XMIT_CN is 2.
> > 
> > I'm not an expert but I think what happens is that we treat NET_XMIT_CN
> > as success so that it takes a while for the resend to happen.
> > Eventually the TCP layer will detect it as a dropped packet.
> > 
> My eyes slipped lines. CN is 2. But the fact RX return value can be
> returned on a TX path still makes me feel unclean. Odds are low that
> we will have new statuses in future, it is a risk. I'd hope to contain
> these values only inside BPF redirect code as they are the reason why
> such rx values can show up there. Meanwhile, your argument do make
> good sense to me that the same problem may occur for other stuff. It
> is true. In fact, I just re-examined BPF-REROUTE path, it has the
> exact same issue by directly sending dst_output value back.
> 
> So I would propose to do two things:
> 1. still convert BPF redirect ingress code to contain the propagation
> of mixed return. Return only TX side value instead, which is also what
> majority of those local senders are expecting. (I was wrong about
> positive values returned to sendmsg below btw, they are not).
> 
> 2. change LWTUNNEL_XMIT_CONTINUE and check for this at xmit hook.
> 

Sounds good!

regards,
dan carpenter

Martin KaFai Lau July 28, 2023, 10:02 p.m. UTC | #7

On 7/25/23 6:08 PM, Yan Zhai wrote:
> skb_do_redirect returns various of values: error code (negative),
> 0 (success), and some positive status code, e.g. NET_XMIT_CN,
> NET_RX_DROP. Commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel
> infrastructure") didn't check the return code correctly, so positive
> values are propagated back along call chain:
> 
>    ip_finish_output2
>      -> bpf_xmit
>        -> run_lwt_bpf
>          -> skb_do_redirect

 From looking at skb_do_redirect, the skb_do_redirect should have consumed the 
skb except for the -EAGAIN return value. afaik, -EAGAIN could only happen by 
using the bpf_redirect_peer helper. lwt does not have the bpf_redirect_peer 
helper available, so there is no -EAGAIN case in lwt. iow, skb_do_redirect 
should have always consumed the skb in lwt. or did I miss something?

If that is the case, it feels like the fix should be in run_lwt_bpf() and the 
"if (ret == 0)" test in run_lwt_bpf() is unnecessary?

			ret = skb_do_redirect(skb);
			if (ret == 0)
				ret = BPF_REDIRECT;

> 
> Inside ip_finish_output2, redirected skb will continue to neighbor
> subsystem as if LWTUNNEL_XMIT_CONTINUE is returned, despite that this
> skb could have been freed. The bug can trigger use-after-free warning
> and crashes kernel afterwards:
> 
> https://gist.github.com/zhaiyan920/8fbac245b261fe316a7ef04c9b1eba48

Dan Carpenter July 31, 2023, 2:26 p.m. UTC | #8

I'm not a networking person, but I was looking at some use after free
static checker warnings.

Apparently the rule with xmit functions is that if they return a value
> 15 then that means the skb was not freed.  Otherwise it's supposed to
be freed.  So like NETDEV_TX_BUSY is 0x10 so it's not freed.

This is checked with using the dev_xmit_complete() function.  So I feel
like it would make sense for LWTUNNEL_XMIT_CONTINUE to return higher
than 15.

Because that's the bug right?  The original code was assuming that
everything besides LWTUNNEL_XMIT_DONE was freed.

regards,
dan carpenter

Yan Zhai July 31, 2023, 9:35 p.m. UTC | #9

On Fri, Jul 28, 2023 at 5:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/25/23 6:08 PM, Yan Zhai wrote:
> > skb_do_redirect returns various of values: error code (negative),
> > 0 (success), and some positive status code, e.g. NET_XMIT_CN,
> > NET_RX_DROP. Commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel
> > infrastructure") didn't check the return code correctly, so positive
> > values are propagated back along call chain:
> >
> >    ip_finish_output2
> >      -> bpf_xmit
> >        -> run_lwt_bpf
> >          -> skb_do_redirect
>
>  From looking at skb_do_redirect, the skb_do_redirect should have consumed the
> skb except for the -EAGAIN return value. afaik, -EAGAIN could only happen by
> using the bpf_redirect_peer helper. lwt does not have the bpf_redirect_peer
> helper available, so there is no -EAGAIN case in lwt. iow, skb_do_redirect
> should have always consumed the skb in lwt. or did I miss something?
>
> If that is the case, it feels like the fix should be in run_lwt_bpf() and the
> "if (ret == 0)" test in run_lwt_bpf() is unnecessary?
>
>                         ret = skb_do_redirect(skb);
>                         if (ret == 0)
>                                 ret = BPF_REDIRECT;
>
>
Just fixing skb redirect return code won't be sufficient. I realized
there are other return paths that need to be treated, e.g. bpf reroute
path also directly returns dev_queue_xmit status. I plan to check for
LWTUNNEL_XMIT_CONTINUE (and change it to a value that does not
conflict with NET_RX_DROP and NET_XMIT_DROP) in the next revision. On
the other hand, the return value of NETDEV_TX_BUSY is another hassle.
As Dan suggested, packets might not have been freed when this is
returned from drivers. The caller of dev_queue_xmit might need to free
skb when this happens.

Yan

Martin KaFai Lau July 31, 2023, 10:11 p.m. UTC | #10

On 7/31/23 2:35 PM, Yan Zhai wrote:
> On Fri, Jul 28, 2023 at 5:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 7/25/23 6:08 PM, Yan Zhai wrote:
>>> skb_do_redirect returns various of values: error code (negative),
>>> 0 (success), and some positive status code, e.g. NET_XMIT_CN,
>>> NET_RX_DROP. Commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel
>>> infrastructure") didn't check the return code correctly, so positive
>>> values are propagated back along call chain:
>>>
>>>     ip_finish_output2
>>>       -> bpf_xmit
>>>         -> run_lwt_bpf
>>>           -> skb_do_redirect
>>
>>   From looking at skb_do_redirect, the skb_do_redirect should have consumed the
>> skb except for the -EAGAIN return value. afaik, -EAGAIN could only happen by
>> using the bpf_redirect_peer helper. lwt does not have the bpf_redirect_peer
>> helper available, so there is no -EAGAIN case in lwt. iow, skb_do_redirect
>> should have always consumed the skb in lwt. or did I miss something?
>>
>> If that is the case, it feels like the fix should be in run_lwt_bpf() and the
>> "if (ret == 0)" test in run_lwt_bpf() is unnecessary?
>>
>>                          ret = skb_do_redirect(skb);
>>                          if (ret == 0)
>>                                  ret = BPF_REDIRECT;
>>
>>
> Just fixing skb redirect return code won't be sufficient. I realized
> there are other return paths that need to be treated, e.g. bpf reroute
> path also directly returns dev_queue_xmit status. I plan to check for
> LWTUNNEL_XMIT_CONTINUE (and change it to a value that does not
> conflict with NET_RX_DROP and NET_XMIT_DROP) in the next revision. On
> the other hand, the return value of NETDEV_TX_BUSY is another hassle.

I suspect we are talking about different things or I am still missing something.

I was thinking skb_do_redirect() should have always consumed the skb and 
bpf_xmit should always return LWTUNNEL_XMIT_DONE also (instead of 
LWTUNNEL_XMIT_CONTINUE described in the this patch commit message). It is what 
sch_handle_egress() is doing also. Could you explain how is it different from 
the skb_do_redirect usage in sch_handle_egress() or you are suggesting the 
current sch_handle_egress() has the issue too also?


> As Dan suggested, packets might not have been freed when this is
> returned from drivers. The caller of dev_queue_xmit might need to free
> skb when this happens.
> 
> Yan

Yan Zhai July 31, 2023, 11:01 p.m. UTC | #11

On Mon, Jul 31, 2023 at 5:11 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 7/31/23 2:35 PM, Yan Zhai wrote:
> > On Fri, Jul 28, 2023 at 5:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 7/25/23 6:08 PM, Yan Zhai wrote:
> >>> skb_do_redirect returns various of values: error code (negative),
> >>> 0 (success), and some positive status code, e.g. NET_XMIT_CN,
> >>> NET_RX_DROP. Commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel
> >>> infrastructure") didn't check the return code correctly, so positive
> >>> values are propagated back along call chain:
> >>>
> >>>     ip_finish_output2
> >>>       -> bpf_xmit
> >>>         -> run_lwt_bpf
> >>>           -> skb_do_redirect
> >>
> >>   From looking at skb_do_redirect, the skb_do_redirect should have consumed the
> >> skb except for the -EAGAIN return value. afaik, -EAGAIN could only happen by
> >> using the bpf_redirect_peer helper. lwt does not have the bpf_redirect_peer
> >> helper available, so there is no -EAGAIN case in lwt. iow, skb_do_redirect
> >> should have always consumed the skb in lwt. or did I miss something?
> >>
> >> If that is the case, it feels like the fix should be in run_lwt_bpf() and the
> >> "if (ret == 0)" test in run_lwt_bpf() is unnecessary?
> >>
> >>                          ret = skb_do_redirect(skb);
> >>                          if (ret == 0)
> >>                                  ret = BPF_REDIRECT;
> >>
> >>
> > Just fixing skb redirect return code won't be sufficient. I realized
> > there are other return paths that need to be treated, e.g. bpf reroute
> > path also directly returns dev_queue_xmit status. I plan to check for
> > LWTUNNEL_XMIT_CONTINUE (and change it to a value that does not
> > conflict with NET_RX_DROP and NET_XMIT_DROP) in the next revision. On
> > the other hand, the return value of NETDEV_TX_BUSY is another hassle.
>
> I suspect we are talking about different things or I am still missing something.
>
> I was thinking skb_do_redirect() should have always consumed the skb and
> bpf_xmit should always return LWTUNNEL_XMIT_DONE also (instead of
> LWTUNNEL_XMIT_CONTINUE described in the this patch commit message). It is what
> sch_handle_egress() is doing also. Could you explain how is it different from
> the skb_do_redirect usage in sch_handle_egress() or you are suggesting the
> current sch_handle_egress() has the issue too also?
>
I think we were not on the same page. You are absolutely right that
skb_do_redirect should consume the packet anyway. The difference
between your proposal and this patch is that this patch returns errno
or LWTUNNEL_XMIT_DONE, and yours does not even return errno. Both
approaches fix the issue of "redirect to down device crashes the
kernel".

What I commented was an exact same issue at different location: BPF
reroute may trigger the crash as well, since it also returns
dev_queue_xmit status in bpf_xmit. Need to fix this, or instead fixing
LWTUNNEL_XMIT_CONTINUE value and correct the behavior at lwtunnel_xmit
rather than bpf_xmit.

Yan

>
> > As Dan suggested, packets might not have been freed when this is
> > returned from drivers. The caller of dev_queue_xmit might need to free
> > skb when this happens.
> >
> > Yan
>

Martin KaFai Lau July 31, 2023, 11:52 p.m. UTC | #12

On 7/31/23 4:01 PM, Yan Zhai wrote:
> What I commented was an exact same issue at different location: BPF
> reroute may trigger the crash as well, since it also returns
> dev_queue_xmit status in bpf_xmit. Need to fix this, or instead fixing
> LWTUNNEL_XMIT_CONTINUE value and correct the behavior at lwtunnel_xmit
> rather than bpf_xmit.

Ah. I think I got it. You meant the bpf_lwt_xmit_reroute() / BPF_LWT_REROUTE 
case? It would be clearer if some of these names were quoted instead. "reroute" 
could mean many things.

Please put details comment in v5. Thanks.

> 
> Yan
> 
>>
>>> As Dan suggested, packets might not have been freed when this is
>>> returned from drivers. The caller of dev_queue_xmit might need to free
>>> skb when this happens.
>>>
>>> Yan
>>

Yan Zhai Aug. 1, 2023, 10:18 p.m. UTC | #13

On Mon, Jul 31, 2023 at 9:26 AM Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> I'm not a networking person, but I was looking at some use after free
> static checker warnings.

Did you refer to the gist I posted or something new?

>
> Apparently the rule with xmit functions is that if they return a value
> > 15 then that means the skb was not freed.  Otherwise it's supposed to
> be freed.  So like NETDEV_TX_BUSY is 0x10 so it's not freed.
>
> This is checked with using the dev_xmit_complete() function.  So I feel
> like it would make sense for LWTUNNEL_XMIT_CONTINUE to return higher
> than 15.

Yes I am adopting your suggestion in v5. Dealing with NETDEV_TX_BUSY
would be left as another item (potentially more suited for netdev
rather than bpf). Would be great to find a reproduction of memleak.

>
> Because that's the bug right?  The original code was assuming that
> everything besides LWTUNNEL_XMIT_DONE was freed.
>
> regards,
> dan carpenter
>

[v4,bpf,1/2] bpf: fix skb_do_redirect return values

Commit Message

Comments

Patch