diff mbox series

[v1,net,1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.

Message ID 20250323231016.74813-2-kuniyu@amazon.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX. | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 6 this patch: 6
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0

Commit Message

Kuniyuki Iwashima March 23, 2025, 11:09 p.m. UTC
__udp_enqueue_schedule_skb() has the following condition:

  if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
          goto drop;

sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
or SO_RCVBUFFORCE.

If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
as sk->sk_rmem_alloc is also signed int.

Then, the size of the incoming skb is added to sk->sk_rmem_alloc
unconditionally.

This results in integer overflow (possibly multiple times) on
sk->sk_rmem_alloc and allows a single socket to have skb up to
net.core.udp_mem[1].

For example, if we set a large value to udp_mem[1] and INT_MAX to
sk->sk_rcvbuf and flood packets to the socket, we can see multiple
overflows:

  # cat /proc/net/sockstat | grep UDP:
  UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
                                             ^- PAGE_SHIFT
  # ss -uam
  State  Recv-Q      ...
  UNCONN -1757018048 ...    <-- flipping the sign repeatedly
         skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)

Previously, we had a boundary check for INT_MAX, which was removed by
commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").

A complete fix would be to revert it and cap the right operand by
INT_MAX:

  rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
  if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
          goto uncharge_drop;

but we do not want to add the expensive atomic_add_return() back just
for the corner case.

So, let's perform the first check as unsigned int to detect the
integer overflow.

Note that we still allow a single wraparound, which can be observed
from userspace, but it's acceptable considering it's unlikely that
no recv() is called for a long period, and the negative value will
soon flip back to positive after a few recv() calls.

  # cat /proc/net/sockstat | grep UDP:
  UDP: inuse 3 mem 524288  <-- (INT_MAX + 1) >> 12

  # ss -uam
  State  Recv-Q      ...
  UNCONN -2147482816 ...   <-- INT_MAX + 831 bytes
         skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)

Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/udp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet March 24, 2025, 10:01 a.m. UTC | #1
On Mon, Mar 24, 2025 at 12:11 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> __udp_enqueue_schedule_skb() has the following condition:
>
>   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
>           goto drop;
>
> sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> or SO_RCVBUFFORCE.
>
> If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> as sk->sk_rmem_alloc is also signed int.
>
> Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> unconditionally.
>
> This results in integer overflow (possibly multiple times) on
> sk->sk_rmem_alloc and allows a single socket to have skb up to
> net.core.udp_mem[1].
>
> For example, if we set a large value to udp_mem[1] and INT_MAX to
> sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> overflows:
>
>   # cat /proc/net/sockstat | grep UDP:
>   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
>                                              ^- PAGE_SHIFT
>   # ss -uam
>   State  Recv-Q      ...
>   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
>          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
>
> Previously, we had a boundary check for INT_MAX, which was removed by
> commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
>
> A complete fix would be to revert it and cap the right operand by
> INT_MAX:
>
>   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
>   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
>           goto uncharge_drop;
>
> but we do not want to add the expensive atomic_add_return() back just
> for the corner case.
>
> So, let's perform the first check as unsigned int to detect the
> integer overflow.
>
> Note that we still allow a single wraparound, which can be observed
> from userspace, but it's acceptable considering it's unlikely that
> no recv() is called for a long period, and the negative value will
> soon flip back to positive after a few recv() calls.
>
>   # cat /proc/net/sockstat | grep UDP:
>   UDP: inuse 3 mem 524288  <-- (INT_MAX + 1) >> 12
>
>   # ss -uam
>   State  Recv-Q      ...
>   UNCONN -2147482816 ...   <-- INT_MAX + 831 bytes
>          skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
>
> Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/udp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a9bb9ce5438e..a1e60aab29b5 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>          */
>         rmem = atomic_read(&sk->sk_rmem_alloc);
>         rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> -       if (rmem > rcvbuf)
> +       if ((unsigned int)rmem > rcvbuf)

SGTM, but maybe make rmem and rcvbuf  'unsigned int ' to avoid casts ?

BTW piling 2GB worth of skbs in a single UDP receive queue means a
latency spike when
__skb_queue_purge(&sk->sk_receive_queue) is called, say from
inet_sock_destruct(),
which is a problem on its own.


diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index db606f7e4163809d8220be1c1a4adb5662fc914e..575baac391e8af911fc1eff3f2d8e64bb9aa4c70
100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1725,9 +1725,9 @@ static int udp_rmem_schedule(struct sock *sk, int size)
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 {
        struct sk_buff_head *list = &sk->sk_receive_queue;
-       int rmem, err = -ENOMEM;
+       unsigned int rmem, rcvbuf;
+       int size, err = -ENOMEM;
        spinlock_t *busy = NULL;
-       int size, rcvbuf;

        /* Immediately drop when the receive queue is full.
         * Always allow at least one packet.
Willem de Bruijn March 24, 2025, 2:59 p.m. UTC | #2
Kuniyuki Iwashima wrote:
> __udp_enqueue_schedule_skb() has the following condition:
> 
>   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
>           goto drop;
> 
> sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> or SO_RCVBUFFORCE.
> 
> If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> as sk->sk_rmem_alloc is also signed int.
> 
> Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> unconditionally.
> 
> This results in integer overflow (possibly multiple times) on
> sk->sk_rmem_alloc and allows a single socket to have skb up to
> net.core.udp_mem[1].
> 
> For example, if we set a large value to udp_mem[1] and INT_MAX to
> sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> overflows:
> 
>   # cat /proc/net/sockstat | grep UDP:
>   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
>                                              ^- PAGE_SHIFT
>   # ss -uam
>   State  Recv-Q      ...
>   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
>          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> 
> Previously, we had a boundary check for INT_MAX, which was removed by
> commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> 
> A complete fix would be to revert it and cap the right operand by
> INT_MAX:
> 
>   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
>   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
>           goto uncharge_drop;
> 
> but we do not want to add the expensive atomic_add_return() back just
> for the corner case.
> 
> So, let's perform the first check as unsigned int to detect the
> integer overflow.
> 
> Note that we still allow a single wraparound, which can be observed
> from userspace, but it's acceptable considering it's unlikely that
> no recv() is called for a long period, and the negative value will
> soon flip back to positive after a few recv() calls.

Can we do better than this?
Is this because of the "Always allow at least one packet" below, and
due to testing the value of the counter without skb->truesize added?

        /* Immediately drop when the receive queue is full.
         * Always allow at least one packet.
         */
        rmem = atomic_read(&sk->sk_rmem_alloc);
        rcvbuf = READ_ONCE(sk->sk_rcvbuf);
        if (rmem > rcvbuf)
                goto drop;

> 
>   # cat /proc/net/sockstat | grep UDP:
>   UDP: inuse 3 mem 524288  <-- (INT_MAX + 1) >> 12
> 
>   # ss -uam
>   State  Recv-Q      ...
>   UNCONN -2147482816 ...   <-- INT_MAX + 831 bytes
>          skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
> 
> Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
>  net/ipv4/udp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index a9bb9ce5438e..a1e60aab29b5 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  	 */
>  	rmem = atomic_read(&sk->sk_rmem_alloc);
>  	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> -	if (rmem > rcvbuf)
> +	if ((unsigned int)rmem > rcvbuf)
>  		goto drop;
>  
>  	/* Under mem pressure, it might be helpful to help udp_recvmsg()
> -- 
> 2.48.1
>
Kuniyuki Iwashima March 24, 2025, 6 p.m. UTC | #3
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 24 Mar 2025 11:01:15 +0100
> On Mon, Mar 24, 2025 at 12:11 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > __udp_enqueue_schedule_skb() has the following condition:
> >
> >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> >           goto drop;
> >
> > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > or SO_RCVBUFFORCE.
> >
> > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > as sk->sk_rmem_alloc is also signed int.
> >
> > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > unconditionally.
> >
> > This results in integer overflow (possibly multiple times) on
> > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > net.core.udp_mem[1].
> >
> > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > overflows:
> >
> >   # cat /proc/net/sockstat | grep UDP:
> >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> >                                              ^- PAGE_SHIFT
> >   # ss -uam
> >   State  Recv-Q      ...
> >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> >
> > Previously, we had a boundary check for INT_MAX, which was removed by
> > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> >
> > A complete fix would be to revert it and cap the right operand by
> > INT_MAX:
> >
> >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> >           goto uncharge_drop;
> >
> > but we do not want to add the expensive atomic_add_return() back just
> > for the corner case.
> >
> > So, let's perform the first check as unsigned int to detect the
> > integer overflow.
> >
> > Note that we still allow a single wraparound, which can be observed
> > from userspace, but it's acceptable considering it's unlikely that
> > no recv() is called for a long period, and the negative value will
> > soon flip back to positive after a few recv() calls.
> >
> >   # cat /proc/net/sockstat | grep UDP:
> >   UDP: inuse 3 mem 524288  <-- (INT_MAX + 1) >> 12
> >
> >   # ss -uam
> >   State  Recv-Q      ...
> >   UNCONN -2147482816 ...   <-- INT_MAX + 831 bytes
> >          skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947)
> >
> > Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc")
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > ---
> >  net/ipv4/udp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index a9bb9ce5438e..a1e60aab29b5 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
> >          */
> >         rmem = atomic_read(&sk->sk_rmem_alloc);
> >         rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> > -       if (rmem > rcvbuf)
> > +       if ((unsigned int)rmem > rcvbuf)
> 
> SGTM, but maybe make rmem and rcvbuf  'unsigned int ' to avoid casts ?

That's cleaner.  I'll add a small comment above the comparison
not to lose the boundary check by defining these two as int in
the future.


> 
> BTW piling 2GB worth of skbs in a single UDP receive queue means a
> latency spike when
> __skb_queue_purge(&sk->sk_receive_queue) is called, say from
> inet_sock_destruct(),
> which is a problem on its own.

Yes, we need to improve our application a lot :)

Thanks!

> 
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index db606f7e4163809d8220be1c1a4adb5662fc914e..575baac391e8af911fc1eff3f2d8e64bb9aa4c70
> 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1725,9 +1725,9 @@ static int udp_rmem_schedule(struct sock *sk, int size)
>  int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
>  {
>         struct sk_buff_head *list = &sk->sk_receive_queue;
> -       int rmem, err = -ENOMEM;
> +       unsigned int rmem, rcvbuf;
> +       int size, err = -ENOMEM;
>         spinlock_t *busy = NULL;
> -       int size, rcvbuf;
> 
>         /* Immediately drop when the receive queue is full.
>          * Always allow at least one packet.
>
Kuniyuki Iwashima March 24, 2025, 6:10 p.m. UTC | #4
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 10:59:49 -0400
> Kuniyuki Iwashima wrote:
> > __udp_enqueue_schedule_skb() has the following condition:
> > 
> >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> >           goto drop;
> > 
> > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > or SO_RCVBUFFORCE.
> > 
> > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > as sk->sk_rmem_alloc is also signed int.
> > 
> > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > unconditionally.
> > 
> > This results in integer overflow (possibly multiple times) on
> > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > net.core.udp_mem[1].
> > 
> > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > overflows:
> > 
> >   # cat /proc/net/sockstat | grep UDP:
> >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> >                                              ^- PAGE_SHIFT
> >   # ss -uam
> >   State  Recv-Q      ...
> >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > 
> > Previously, we had a boundary check for INT_MAX, which was removed by
> > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > 
> > A complete fix would be to revert it and cap the right operand by
> > INT_MAX:
> > 
> >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> >           goto uncharge_drop;
> > 
> > but we do not want to add the expensive atomic_add_return() back just
> > for the corner case.
> > 
> > So, let's perform the first check as unsigned int to detect the
> > integer overflow.
> > 
> > Note that we still allow a single wraparound, which can be observed
> > from userspace, but it's acceptable considering it's unlikely that
> > no recv() is called for a long period, and the negative value will
> > soon flip back to positive after a few recv() calls.
> 
> Can we do better than this?

Another approach I had in mind was to restore the original validation
under the recvq lock but without atomic ops like

  1. add another u32 as union of sk_rmem_alloc (only for UDP)
  2. access it with READ_ONCE() or under the recvq lock
  3. perform the validation under the lock

But it requires more changes around the error queue handling and
the general socket impl, so will be too invasive for net.git but
maybe worth a try for net-next ?


> Is this because of the "Always allow at least one packet" below, and
> due to testing the value of the counter without skb->truesize added?

Yes, that's the reason although we don't receive a single >INT_MAX
packet.


> 
>         /* Immediately drop when the receive queue is full.
>          * Always allow at least one packet.
>          */
>         rmem = atomic_read(&sk->sk_rmem_alloc);
>         rcvbuf = READ_ONCE(sk->sk_rcvbuf);
>         if (rmem > rcvbuf)
>                 goto drop;
Willem de Bruijn March 24, 2025, 7:44 p.m. UTC | #5
Kuniyuki Iwashima wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Mon, 24 Mar 2025 10:59:49 -0400
> > Kuniyuki Iwashima wrote:
> > > __udp_enqueue_schedule_skb() has the following condition:
> > > 
> > >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > >           goto drop;
> > > 
> > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > or SO_RCVBUFFORCE.
> > > 
> > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > as sk->sk_rmem_alloc is also signed int.
> > > 
> > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > unconditionally.
> > > 
> > > This results in integer overflow (possibly multiple times) on
> > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > net.core.udp_mem[1].
> > > 
> > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > overflows:
> > > 
> > >   # cat /proc/net/sockstat | grep UDP:
> > >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> > >                                              ^- PAGE_SHIFT
> > >   # ss -uam
> > >   State  Recv-Q      ...
> > >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> > >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > 
> > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > 
> > > A complete fix would be to revert it and cap the right operand by
> > > INT_MAX:
> > > 
> > >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > >           goto uncharge_drop;
> > > 
> > > but we do not want to add the expensive atomic_add_return() back just
> > > for the corner case.
> > > 
> > > So, let's perform the first check as unsigned int to detect the
> > > integer overflow.
> > > 
> > > Note that we still allow a single wraparound, which can be observed
> > > from userspace, but it's acceptable considering it's unlikely that
> > > no recv() is called for a long period, and the negative value will
> > > soon flip back to positive after a few recv() calls.
> > 
> > Can we do better than this?
> 
> Another approach I had in mind was to restore the original validation
> under the recvq lock but without atomic ops like
> 
>   1. add another u32 as union of sk_rmem_alloc (only for UDP)
>   2. access it with READ_ONCE() or under the recvq lock
>   3. perform the validation under the lock
> 
> But it requires more changes around the error queue handling and
> the general socket impl, so will be too invasive for net.git but
> maybe worth a try for net-next ?

Definitely not net material. Adding more complexity here
would also need some convincing benchmark data probably.

> 
> > Is this because of the "Always allow at least one packet" below, and
> > due to testing the value of the counter without skb->truesize added?
> 
> Yes, that's the reason although we don't receive a single >INT_MAX
> packet.

I was surprised that we don't take the current skb size into
account when doing this calculation.

Turns out that this code used to do that.

commit 363dc73acacb ("udp: be less conservative with sock rmem
accounting") made this change:

-       if (rmem && (rmem + size > sk->sk_rcvbuf))
+       if (rmem > sk->sk_rcvbuf)
                goto drop;

The special consideration to allow one packet is to avoid starvation
with small rcvbuf, judging also from this review comment:

https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/

That clearly doesn't apply when rcvbuf is near INT_MAX.
Can we separate the tiny budget case and hard drop including the
skb->truesize for normal buffer sizes?

> 
> > 
> >         /* Immediately drop when the receive queue is full.
> >          * Always allow at least one packet.
> >          */
> >         rmem = atomic_read(&sk->sk_rmem_alloc);
> >         rcvbuf = READ_ONCE(sk->sk_rcvbuf);
> >         if (rmem > rcvbuf)
> >                 goto drop;
Kuniyuki Iwashima March 24, 2025, 8:46 p.m. UTC | #6
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Mon, 24 Mar 2025 15:44:40 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > Kuniyuki Iwashima wrote:
> > > > __udp_enqueue_schedule_skb() has the following condition:
> > > > 
> > > >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > >           goto drop;
> > > > 
> > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > or SO_RCVBUFFORCE.
> > > > 
> > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > as sk->sk_rmem_alloc is also signed int.
> > > > 
> > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > unconditionally.
> > > > 
> > > > This results in integer overflow (possibly multiple times) on
> > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > net.core.udp_mem[1].
> > > > 
> > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > overflows:
> > > > 
> > > >   # cat /proc/net/sockstat | grep UDP:
> > > >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> > > >                                              ^- PAGE_SHIFT
> > > >   # ss -uam
> > > >   State  Recv-Q      ...
> > > >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> > > >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > > 
> > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > > 
> > > > A complete fix would be to revert it and cap the right operand by
> > > > INT_MAX:
> > > > 
> > > >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > >           goto uncharge_drop;
> > > > 
> > > > but we do not want to add the expensive atomic_add_return() back just
> > > > for the corner case.
> > > > 
> > > > So, let's perform the first check as unsigned int to detect the
> > > > integer overflow.
> > > > 
> > > > Note that we still allow a single wraparound, which can be observed
> > > > from userspace, but it's acceptable considering it's unlikely that
> > > > no recv() is called for a long period, and the negative value will
> > > > soon flip back to positive after a few recv() calls.
> > > 
> > > Can we do better than this?
> > 
> > Another approach I had in mind was to restore the original validation
> > under the recvq lock but without atomic ops like
> > 
> >   1. add another u32 as union of sk_rmem_alloc (only for UDP)
> >   2. access it with READ_ONCE() or under the recvq lock
> >   3. perform the validation under the lock
> > 
> > But it requires more changes around the error queue handling and
> > the general socket impl, so will be too invasive for net.git but
> > maybe worth a try for net-next ?
> 
> Definitely not net material. Adding more complexity here
> would also need some convincing benchmark data probably.
> 
> > 
> > > Is this because of the "Always allow at least one packet" below, and
> > > due to testing the value of the counter without skb->truesize added?
> > 
> > Yes, that's the reason although we don't receive a single >INT_MAX
> > packet.
> 
> I was surprised that we don't take the current skb size into
> account when doing this calculation.
> 
> Turns out that this code used to do that.
> 
> commit 363dc73acacb ("udp: be less conservative with sock rmem
> accounting") made this change:
> 
> -       if (rmem && (rmem + size > sk->sk_rcvbuf))
> +       if (rmem > sk->sk_rcvbuf)
>                 goto drop;
> 
> The special consideration to allow one packet is to avoid starvation
> with small rcvbuf, judging also from this review comment:
> 
> https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/

Interesting, thanks for the info !

Now it's allowed to exceed by the total size of the incoming skb
on every CPUs, and a user may notice that rmem > rcvbuf via ss,
but I guess it's allowed because the fast recovery is expected.


> 
> That clearly doesn't apply when rcvbuf is near INT_MAX.
> Can we separate the tiny budget case and hard drop including the
> skb->truesize for normal buffer sizes?

Maybe like this ?

        if (rcvbuf < UDP_MIN_RCVBUF) {
                if (rmem > rcvbuf)
                        goto drop;
        } else {
                if (rmem + size > rcvbuf)
                        goto drop;
        }

SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
reported after that in 2016, so UDP_MIN_RCVBUF would be more ?

But I wonder if adding new branches in the fast path is worth for
the corner case, and that's why I chose integrating the cast into
the exisintg branch, allowing a small overflow, which is observable
only when no thread calls recv() and skbs are queued more than INT_MAX.
Willem de Bruijn March 25, 2025, 2:18 p.m. UTC | #7
Kuniyuki Iwashima wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Mon, 24 Mar 2025 15:44:40 -0400
> > Kuniyuki Iwashima wrote:
> > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > > Kuniyuki Iwashima wrote:
> > > > > __udp_enqueue_schedule_skb() has the following condition:
> > > > > 
> > > > >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > > >           goto drop;
> > > > > 
> > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > > or SO_RCVBUFFORCE.
> > > > > 
> > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > > as sk->sk_rmem_alloc is also signed int.
> > > > > 
> > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > > unconditionally.
> > > > > 
> > > > > This results in integer overflow (possibly multiple times) on
> > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > > net.core.udp_mem[1].
> > > > > 
> > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > > overflows:
> > > > > 
> > > > >   # cat /proc/net/sockstat | grep UDP:
> > > > >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> > > > >                                              ^- PAGE_SHIFT
> > > > >   # ss -uam
> > > > >   State  Recv-Q      ...
> > > > >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> > > > >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > > > 
> > > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > > > 
> > > > > A complete fix would be to revert it and cap the right operand by
> > > > > INT_MAX:
> > > > > 
> > > > >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > > >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > > >           goto uncharge_drop;
> > > > > 
> > > > > but we do not want to add the expensive atomic_add_return() back just
> > > > > for the corner case.
> > > > > 
> > > > > So, let's perform the first check as unsigned int to detect the
> > > > > integer overflow.
> > > > > 
> > > > > Note that we still allow a single wraparound, which can be observed
> > > > > from userspace, but it's acceptable considering it's unlikely that
> > > > > no recv() is called for a long period, and the negative value will
> > > > > soon flip back to positive after a few recv() calls.
> > > > 
> > > > Can we do better than this?
> > > 
> > > Another approach I had in mind was to restore the original validation
> > > under the recvq lock but without atomic ops like
> > > 
> > >   1. add another u32 as union of sk_rmem_alloc (only for UDP)
> > >   2. access it with READ_ONCE() or under the recvq lock
> > >   3. perform the validation under the lock
> > > 
> > > But it requires more changes around the error queue handling and
> > > the general socket impl, so will be too invasive for net.git but
> > > maybe worth a try for net-next ?
> > 
> > Definitely not net material. Adding more complexity here
> > would also need some convincing benchmark data probably.
> > 
> > > 
> > > > Is this because of the "Always allow at least one packet" below, and
> > > > due to testing the value of the counter without skb->truesize added?
> > > 
> > > Yes, that's the reason although we don't receive a single >INT_MAX
> > > packet.
> > 
> > I was surprised that we don't take the current skb size into
> > account when doing this calculation.
> > 
> > Turns out that this code used to do that.
> > 
> > commit 363dc73acacb ("udp: be less conservative with sock rmem
> > accounting") made this change:
> > 
> > -       if (rmem && (rmem + size > sk->sk_rcvbuf))
> > +       if (rmem > sk->sk_rcvbuf)
> >                 goto drop;
> > 
> > The special consideration to allow one packet is to avoid starvation
> > with small rcvbuf, judging also from this review comment:
> > 
> > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
> 
> Interesting, thanks for the info !
> 
> Now it's allowed to exceed by the total size of the incoming skb
> on every CPUs, and a user may notice that rmem > rcvbuf via ss,
> but I guess it's allowed because the fast recovery is expected.
> 
> 
> > 
> > That clearly doesn't apply when rcvbuf is near INT_MAX.
> > Can we separate the tiny budget case and hard drop including the
> > skb->truesize for normal buffer sizes?
> 
> Maybe like this ?
> 
>         if (rcvbuf < UDP_MIN_RCVBUF) {
>                 if (rmem > rcvbuf)
>                         goto drop;
>         } else {
>                 if (rmem + size > rcvbuf)
>                         goto drop;
>         }
> 
> SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
> reported after that in 2016, so UDP_MIN_RCVBUF would be more ?

Since the only issue is the overflow, could use a higher bound like
INT_MAX >> 1.
 
> But I wonder if adding new branches in the fast path is worth for
> the corner case, and that's why I chose integrating the cast into
> the exisintg branch, allowing a small overflow, which is observable
> only when no thread calls recv() and skbs are queued more than INT_MAX.

Okay. Though it can probably be structured that the likely path does
not even see this?

    if (rmem + size > rcvbuf) {
            if (rcvbuf > INT_MAX << 1)
                    goto drop;
            if (rmem > rcvbuf)
                    goto drop;
    }
Kuniyuki Iwashima March 25, 2025, 5:53 p.m. UTC | #8
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 25 Mar 2025 10:18:30 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date: Mon, 24 Mar 2025 15:44:40 -0400
> > > Kuniyuki Iwashima wrote:
> > > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > > > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > > > Kuniyuki Iwashima wrote:
> > > > > > __udp_enqueue_schedule_skb() has the following condition:
> > > > > > 
> > > > > >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > > > >           goto drop;
> > > > > > 
> > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > > > or SO_RCVBUFFORCE.
> > > > > > 
> > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > > > as sk->sk_rmem_alloc is also signed int.
> > > > > > 
> > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > > > unconditionally.
> > > > > > 
> > > > > > This results in integer overflow (possibly multiple times) on
> > > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > > > net.core.udp_mem[1].
> > > > > > 
> > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > > > overflows:
> > > > > > 
> > > > > >   # cat /proc/net/sockstat | grep UDP:
> > > > > >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> > > > > >                                              ^- PAGE_SHIFT
> > > > > >   # ss -uam
> > > > > >   State  Recv-Q      ...
> > > > > >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> > > > > >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > > > > 
> > > > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > > > > 
> > > > > > A complete fix would be to revert it and cap the right operand by
> > > > > > INT_MAX:
> > > > > > 
> > > > > >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > > > >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > > > >           goto uncharge_drop;
> > > > > > 
> > > > > > but we do not want to add the expensive atomic_add_return() back just
> > > > > > for the corner case.
> > > > > > 
> > > > > > So, let's perform the first check as unsigned int to detect the
> > > > > > integer overflow.
> > > > > > 
> > > > > > Note that we still allow a single wraparound, which can be observed
> > > > > > from userspace, but it's acceptable considering it's unlikely that
> > > > > > no recv() is called for a long period, and the negative value will
> > > > > > soon flip back to positive after a few recv() calls.
> > > > > 
> > > > > Can we do better than this?
> > > > 
> > > > Another approach I had in mind was to restore the original validation
> > > > under the recvq lock but without atomic ops like
> > > > 
> > > >   1. add another u32 as union of sk_rmem_alloc (only for UDP)
> > > >   2. access it with READ_ONCE() or under the recvq lock
> > > >   3. perform the validation under the lock
> > > > 
> > > > But it requires more changes around the error queue handling and
> > > > the general socket impl, so will be too invasive for net.git but
> > > > maybe worth a try for net-next ?
> > > 
> > > Definitely not net material. Adding more complexity here
> > > would also need some convincing benchmark data probably.
> > > 
> > > > 
> > > > > Is this because of the "Always allow at least one packet" below, and
> > > > > due to testing the value of the counter without skb->truesize added?
> > > > 
> > > > Yes, that's the reason although we don't receive a single >INT_MAX
> > > > packet.
> > > 
> > > I was surprised that we don't take the current skb size into
> > > account when doing this calculation.
> > > 
> > > Turns out that this code used to do that.
> > > 
> > > commit 363dc73acacb ("udp: be less conservative with sock rmem
> > > accounting") made this change:
> > > 
> > > -       if (rmem && (rmem + size > sk->sk_rcvbuf))
> > > +       if (rmem > sk->sk_rcvbuf)
> > >                 goto drop;
> > > 
> > > The special consideration to allow one packet is to avoid starvation
> > > with small rcvbuf, judging also from this review comment:
> > > 
> > > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
> > 
> > Interesting, thanks for the info !
> > 
> > Now it's allowed to exceed by the total size of the incoming skb
> > on every CPUs, and a user may notice that rmem > rcvbuf via ss,
> > but I guess it's allowed because the fast recovery is expected.
> > 
> > 
> > > 
> > > That clearly doesn't apply when rcvbuf is near INT_MAX.
> > > Can we separate the tiny budget case and hard drop including the
> > > skb->truesize for normal buffer sizes?
> > 
> > Maybe like this ?
> > 
> >         if (rcvbuf < UDP_MIN_RCVBUF) {
> >                 if (rmem > rcvbuf)
> >                         goto drop;
> >         } else {
> >                 if (rmem + size > rcvbuf)
> >                         goto drop;
> >         }
> > 
> > SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
> > reported after that in 2016, so UDP_MIN_RCVBUF would be more ?
> 
> Since the only issue is the overflow, could use a higher bound like
> INT_MAX >> 1.
>  
> > But I wonder if adding new branches in the fast path is worth for
> > the corner case, and that's why I chose integrating the cast into
> > the exisintg branch, allowing a small overflow, which is observable
> > only when no thread calls recv() and skbs are queued more than INT_MAX.
> 
> Okay. Though it can probably be structured that the likely path does
> not even see this?

Ah exactly, will use this and update commit message in v2.

Thanks!

> 
>     if (rmem + size > rcvbuf) {
>             if (rcvbuf > INT_MAX << 1)
>                     goto drop;
>             if (rmem > rcvbuf)
>                     goto drop;
>     }
>
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a9bb9ce5438e..a1e60aab29b5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1735,7 +1735,7 @@  int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	rmem = atomic_read(&sk->sk_rmem_alloc);
 	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
-	if (rmem > rcvbuf)
+	if ((unsigned int)rmem > rcvbuf)
 		goto drop;
 
 	/* Under mem pressure, it might be helpful to help udp_recvmsg()