Message ID | 20250323231016.74813-2-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | udp: Fix two integer overflows when sk->sk_rcvbuf is close to INT_MAX. | expand |
On Mon, Mar 24, 2025 at 12:11 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > __udp_enqueue_schedule_skb() has the following condition: > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > goto drop; > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > or SO_RCVBUFFORCE. > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > as sk->sk_rmem_alloc is also signed int. > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > unconditionally. > > This results in integer overflow (possibly multiple times) on > sk->sk_rmem_alloc and allows a single socket to have skb up to > net.core.udp_mem[1]. > > For example, if we set a large value to udp_mem[1] and INT_MAX to > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > overflows: > > # cat /proc/net/sockstat | grep UDP: > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > ^- PAGE_SHIFT > # ss -uam > State Recv-Q ... > UNCONN -1757018048 ... <-- flipping the sign repeatedly > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > Previously, we had a boundary check for INT_MAX, which was removed by > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > A complete fix would be to revert it and cap the right operand by > INT_MAX: > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > goto uncharge_drop; > > but we do not want to add the expensive atomic_add_return() back just > for the corner case. > > So, let's perform the first check as unsigned int to detect the > integer overflow. > > Note that we still allow a single wraparound, which can be observed > from userspace, but it's acceptable considering it's unlikely that > no recv() is called for a long period, and the negative value will > soon flip back to positive after a few recv() calls. > > # cat /proc/net/sockstat | grep UDP: > UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12 > > # ss -uam > State Recv-Q ... > UNCONN -2147482816 ... <-- INT_MAX + 831 bytes > skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947) > > Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc") > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > net/ipv4/udp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index a9bb9ce5438e..a1e60aab29b5 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) > */ > rmem = atomic_read(&sk->sk_rmem_alloc); > rcvbuf = READ_ONCE(sk->sk_rcvbuf); > - if (rmem > rcvbuf) > + if ((unsigned int)rmem > rcvbuf) SGTM, but maybe make rmem and rcvbuf 'unsigned int ' to avoid casts ? BTW piling 2GB worth of skbs in a single UDP receive queue means a latency spike when __skb_queue_purge(&sk->sk_receive_queue) is called, say from inet_sock_destruct(), which is a problem on its own. diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index db606f7e4163809d8220be1c1a4adb5662fc914e..575baac391e8af911fc1eff3f2d8e64bb9aa4c70 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1725,9 +1725,9 @@ static int udp_rmem_schedule(struct sock *sk, int size) int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) { struct sk_buff_head *list = &sk->sk_receive_queue; - int rmem, err = -ENOMEM; + unsigned int rmem, rcvbuf; + int size, err = -ENOMEM; spinlock_t *busy = NULL; - int size, rcvbuf; /* Immediately drop when the receive queue is full. * Always allow at least one packet.
Kuniyuki Iwashima wrote: > __udp_enqueue_schedule_skb() has the following condition: > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > goto drop; > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > or SO_RCVBUFFORCE. > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > as sk->sk_rmem_alloc is also signed int. > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > unconditionally. > > This results in integer overflow (possibly multiple times) on > sk->sk_rmem_alloc and allows a single socket to have skb up to > net.core.udp_mem[1]. > > For example, if we set a large value to udp_mem[1] and INT_MAX to > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > overflows: > > # cat /proc/net/sockstat | grep UDP: > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > ^- PAGE_SHIFT > # ss -uam > State Recv-Q ... > UNCONN -1757018048 ... <-- flipping the sign repeatedly > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > Previously, we had a boundary check for INT_MAX, which was removed by > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > A complete fix would be to revert it and cap the right operand by > INT_MAX: > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > goto uncharge_drop; > > but we do not want to add the expensive atomic_add_return() back just > for the corner case. > > So, let's perform the first check as unsigned int to detect the > integer overflow. > > Note that we still allow a single wraparound, which can be observed > from userspace, but it's acceptable considering it's unlikely that > no recv() is called for a long period, and the negative value will > soon flip back to positive after a few recv() calls. Can we do better than this? Is this because of the "Always allow at least one packet" below, and due to testing the value of the counter without skb->truesize added? /* Immediately drop when the receive queue is full. * Always allow at least one packet. */ rmem = atomic_read(&sk->sk_rmem_alloc); rcvbuf = READ_ONCE(sk->sk_rcvbuf); if (rmem > rcvbuf) goto drop; > > # cat /proc/net/sockstat | grep UDP: > UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12 > > # ss -uam > State Recv-Q ... > UNCONN -2147482816 ... <-- INT_MAX + 831 bytes > skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947) > > Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc") > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > net/ipv4/udp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index a9bb9ce5438e..a1e60aab29b5 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) > */ > rmem = atomic_read(&sk->sk_rmem_alloc); > rcvbuf = READ_ONCE(sk->sk_rcvbuf); > - if (rmem > rcvbuf) > + if ((unsigned int)rmem > rcvbuf) > goto drop; > > /* Under mem pressure, it might be helpful to help udp_recvmsg() > -- > 2.48.1 >
From: Eric Dumazet <edumazet@google.com> Date: Mon, 24 Mar 2025 11:01:15 +0100 > On Mon, Mar 24, 2025 at 12:11 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > __udp_enqueue_schedule_skb() has the following condition: > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > goto drop; > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > > or SO_RCVBUFFORCE. > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > > as sk->sk_rmem_alloc is also signed int. > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > > unconditionally. > > > > This results in integer overflow (possibly multiple times) on > > sk->sk_rmem_alloc and allows a single socket to have skb up to > > net.core.udp_mem[1]. > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > > overflows: > > > > # cat /proc/net/sockstat | grep UDP: > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > > ^- PAGE_SHIFT > > # ss -uam > > State Recv-Q ... > > UNCONN -1757018048 ... <-- flipping the sign repeatedly > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > > > Previously, we had a boundary check for INT_MAX, which was removed by > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > > > A complete fix would be to revert it and cap the right operand by > > INT_MAX: > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > > goto uncharge_drop; > > > > but we do not want to add the expensive atomic_add_return() back just > > for the corner case. > > > > So, let's perform the first check as unsigned int to detect the > > integer overflow. > > > > Note that we still allow a single wraparound, which can be observed > > from userspace, but it's acceptable considering it's unlikely that > > no recv() is called for a long period, and the negative value will > > soon flip back to positive after a few recv() calls. > > > > # cat /proc/net/sockstat | grep UDP: > > UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12 > > > > # ss -uam > > State Recv-Q ... > > UNCONN -2147482816 ... <-- INT_MAX + 831 bytes > > skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947) > > > > Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc") > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > --- > > net/ipv4/udp.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index a9bb9ce5438e..a1e60aab29b5 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) > > */ > > rmem = atomic_read(&sk->sk_rmem_alloc); > > rcvbuf = READ_ONCE(sk->sk_rcvbuf); > > - if (rmem > rcvbuf) > > + if ((unsigned int)rmem > rcvbuf) > > SGTM, but maybe make rmem and rcvbuf 'unsigned int ' to avoid casts ? That's cleaner. I'll add a small comment above the comparison not to lose the boundary check by defining these two as int in the future. > > BTW piling 2GB worth of skbs in a single UDP receive queue means a > latency spike when > __skb_queue_purge(&sk->sk_receive_queue) is called, say from > inet_sock_destruct(), > which is a problem on its own. Yes, we need to improve our application a lot :) Thanks! > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index db606f7e4163809d8220be1c1a4adb5662fc914e..575baac391e8af911fc1eff3f2d8e64bb9aa4c70 > 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1725,9 +1725,9 @@ static int udp_rmem_schedule(struct sock *sk, int size) > int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) > { > struct sk_buff_head *list = &sk->sk_receive_queue; > - int rmem, err = -ENOMEM; > + unsigned int rmem, rcvbuf; > + int size, err = -ENOMEM; > spinlock_t *busy = NULL; > - int size, rcvbuf; > > /* Immediately drop when the receive queue is full. > * Always allow at least one packet. >
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Mon, 24 Mar 2025 10:59:49 -0400 > Kuniyuki Iwashima wrote: > > __udp_enqueue_schedule_skb() has the following condition: > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > goto drop; > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > > or SO_RCVBUFFORCE. > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > > as sk->sk_rmem_alloc is also signed int. > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > > unconditionally. > > > > This results in integer overflow (possibly multiple times) on > > sk->sk_rmem_alloc and allows a single socket to have skb up to > > net.core.udp_mem[1]. > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > > overflows: > > > > # cat /proc/net/sockstat | grep UDP: > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > > ^- PAGE_SHIFT > > # ss -uam > > State Recv-Q ... > > UNCONN -1757018048 ... <-- flipping the sign repeatedly > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > > > Previously, we had a boundary check for INT_MAX, which was removed by > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > > > A complete fix would be to revert it and cap the right operand by > > INT_MAX: > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > > goto uncharge_drop; > > > > but we do not want to add the expensive atomic_add_return() back just > > for the corner case. > > > > So, let's perform the first check as unsigned int to detect the > > integer overflow. > > > > Note that we still allow a single wraparound, which can be observed > > from userspace, but it's acceptable considering it's unlikely that > > no recv() is called for a long period, and the negative value will > > soon flip back to positive after a few recv() calls. > > Can we do better than this? Another approach I had in mind was to restore the original validation under the recvq lock but without atomic ops like 1. add another u32 as union of sk_rmem_alloc (only for UDP) 2. access it with READ_ONCE() or under the recvq lock 3. perform the validation under the lock But it requires more changes around the error queue handling and the general socket impl, so will be too invasive for net.git but maybe worth a try for net-next ? > Is this because of the "Always allow at least one packet" below, and > due to testing the value of the counter without skb->truesize added? Yes, that's the reason although we don't receive a single >INT_MAX packet. > > /* Immediately drop when the receive queue is full. > * Always allow at least one packet. > */ > rmem = atomic_read(&sk->sk_rmem_alloc); > rcvbuf = READ_ONCE(sk->sk_rcvbuf); > if (rmem > rcvbuf) > goto drop;
Kuniyuki Iwashima wrote: > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > Date: Mon, 24 Mar 2025 10:59:49 -0400 > > Kuniyuki Iwashima wrote: > > > __udp_enqueue_schedule_skb() has the following condition: > > > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > goto drop; > > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > > > or SO_RCVBUFFORCE. > > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > > > as sk->sk_rmem_alloc is also signed int. > > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > > > unconditionally. > > > > > > This results in integer overflow (possibly multiple times) on > > > sk->sk_rmem_alloc and allows a single socket to have skb up to > > > net.core.udp_mem[1]. > > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > > > overflows: > > > > > > # cat /proc/net/sockstat | grep UDP: > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > > > ^- PAGE_SHIFT > > > # ss -uam > > > State Recv-Q ... > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > > > > > Previously, we had a boundary check for INT_MAX, which was removed by > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > > > > > A complete fix would be to revert it and cap the right operand by > > > INT_MAX: > > > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > > > goto uncharge_drop; > > > > > > but we do not want to add the expensive atomic_add_return() back just > > > for the corner case. > > > > > > So, let's perform the first check as unsigned int to detect the > > > integer overflow. > > > > > > Note that we still allow a single wraparound, which can be observed > > > from userspace, but it's acceptable considering it's unlikely that > > > no recv() is called for a long period, and the negative value will > > > soon flip back to positive after a few recv() calls. > > > > Can we do better than this? > > Another approach I had in mind was to restore the original validation > under the recvq lock but without atomic ops like > > 1. add another u32 as union of sk_rmem_alloc (only for UDP) > 2. access it with READ_ONCE() or under the recvq lock > 3. perform the validation under the lock > > But it requires more changes around the error queue handling and > the general socket impl, so will be too invasive for net.git but > maybe worth a try for net-next ? Definitely not net material. Adding more complexity here would also need some convincing benchmark data probably. > > > Is this because of the "Always allow at least one packet" below, and > > due to testing the value of the counter without skb->truesize added? > > Yes, that's the reason although we don't receive a single >INT_MAX > packet. I was surprised that we don't take the current skb size into account when doing this calculation. Turns out that this code used to do that. commit 363dc73acacb ("udp: be less conservative with sock rmem accounting") made this change: - if (rmem && (rmem + size > sk->sk_rcvbuf)) + if (rmem > sk->sk_rcvbuf) goto drop; The special consideration to allow one packet is to avoid starvation with small rcvbuf, judging also from this review comment: https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/ That clearly doesn't apply when rcvbuf is near INT_MAX. Can we separate the tiny budget case and hard drop including the skb->truesize for normal buffer sizes? > > > > > /* Immediately drop when the receive queue is full. > > * Always allow at least one packet. > > */ > > rmem = atomic_read(&sk->sk_rmem_alloc); > > rcvbuf = READ_ONCE(sk->sk_rcvbuf); > > if (rmem > rcvbuf) > > goto drop;
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Mon, 24 Mar 2025 15:44:40 -0400 > Kuniyuki Iwashima wrote: > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > Date: Mon, 24 Mar 2025 10:59:49 -0400 > > > Kuniyuki Iwashima wrote: > > > > __udp_enqueue_schedule_skb() has the following condition: > > > > > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > > goto drop; > > > > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > > > > or SO_RCVBUFFORCE. > > > > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > > > > as sk->sk_rmem_alloc is also signed int. > > > > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > > > > unconditionally. > > > > > > > > This results in integer overflow (possibly multiple times) on > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to > > > > net.core.udp_mem[1]. > > > > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > > > > overflows: > > > > > > > > # cat /proc/net/sockstat | grep UDP: > > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > > > > ^- PAGE_SHIFT > > > > # ss -uam > > > > State Recv-Q ... > > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly > > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > > > > > > > Previously, we had a boundary check for INT_MAX, which was removed by > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > > > > > > > A complete fix would be to revert it and cap the right operand by > > > > INT_MAX: > > > > > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > > > > goto uncharge_drop; > > > > > > > > but we do not want to add the expensive atomic_add_return() back just > > > > for the corner case. > > > > > > > > So, let's perform the first check as unsigned int to detect the > > > > integer overflow. > > > > > > > > Note that we still allow a single wraparound, which can be observed > > > > from userspace, but it's acceptable considering it's unlikely that > > > > no recv() is called for a long period, and the negative value will > > > > soon flip back to positive after a few recv() calls. > > > > > > Can we do better than this? > > > > Another approach I had in mind was to restore the original validation > > under the recvq lock but without atomic ops like > > > > 1. add another u32 as union of sk_rmem_alloc (only for UDP) > > 2. access it with READ_ONCE() or under the recvq lock > > 3. perform the validation under the lock > > > > But it requires more changes around the error queue handling and > > the general socket impl, so will be too invasive for net.git but > > maybe worth a try for net-next ? > > Definitely not net material. Adding more complexity here > would also need some convincing benchmark data probably. > > > > > > Is this because of the "Always allow at least one packet" below, and > > > due to testing the value of the counter without skb->truesize added? > > > > Yes, that's the reason although we don't receive a single >INT_MAX > > packet. > > I was surprised that we don't take the current skb size into > account when doing this calculation. > > Turns out that this code used to do that. > > commit 363dc73acacb ("udp: be less conservative with sock rmem > accounting") made this change: > > - if (rmem && (rmem + size > sk->sk_rcvbuf)) > + if (rmem > sk->sk_rcvbuf) > goto drop; > > The special consideration to allow one packet is to avoid starvation > with small rcvbuf, judging also from this review comment: > > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/ Interesting, thanks for the info ! Now it's allowed to exceed by the total size of the incoming skb on every CPUs, and a user may notice that rmem > rcvbuf via ss, but I guess it's allowed because the fast recovery is expected. > > That clearly doesn't apply when rcvbuf is near INT_MAX. > Can we separate the tiny budget case and hard drop including the > skb->truesize for normal buffer sizes? Maybe like this ? if (rcvbuf < UDP_MIN_RCVBUF) { if (rmem > rcvbuf) goto drop; } else { if (rmem + size > rcvbuf) goto drop; } SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was reported after that in 2016, so UDP_MIN_RCVBUF would be more ? But I wonder if adding new branches in the fast path is worth for the corner case, and that's why I chose integrating the cast into the exisintg branch, allowing a small overflow, which is observable only when no thread calls recv() and skbs are queued more than INT_MAX.
Kuniyuki Iwashima wrote: > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > Date: Mon, 24 Mar 2025 15:44:40 -0400 > > Kuniyuki Iwashima wrote: > > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > > Date: Mon, 24 Mar 2025 10:59:49 -0400 > > > > Kuniyuki Iwashima wrote: > > > > > __udp_enqueue_schedule_skb() has the following condition: > > > > > > > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > > > goto drop; > > > > > > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > > > > > or SO_RCVBUFFORCE. > > > > > > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > > > > > as sk->sk_rmem_alloc is also signed int. > > > > > > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > > > > > unconditionally. > > > > > > > > > > This results in integer overflow (possibly multiple times) on > > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to > > > > > net.core.udp_mem[1]. > > > > > > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to > > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > > > > > overflows: > > > > > > > > > > # cat /proc/net/sockstat | grep UDP: > > > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > > > > > ^- PAGE_SHIFT > > > > > # ss -uam > > > > > State Recv-Q ... > > > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly > > > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > > > > > > > > > Previously, we had a boundary check for INT_MAX, which was removed by > > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > > > > > > > > > A complete fix would be to revert it and cap the right operand by > > > > > INT_MAX: > > > > > > > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > > > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > > > > > goto uncharge_drop; > > > > > > > > > > but we do not want to add the expensive atomic_add_return() back just > > > > > for the corner case. > > > > > > > > > > So, let's perform the first check as unsigned int to detect the > > > > > integer overflow. > > > > > > > > > > Note that we still allow a single wraparound, which can be observed > > > > > from userspace, but it's acceptable considering it's unlikely that > > > > > no recv() is called for a long period, and the negative value will > > > > > soon flip back to positive after a few recv() calls. > > > > > > > > Can we do better than this? > > > > > > Another approach I had in mind was to restore the original validation > > > under the recvq lock but without atomic ops like > > > > > > 1. add another u32 as union of sk_rmem_alloc (only for UDP) > > > 2. access it with READ_ONCE() or under the recvq lock > > > 3. perform the validation under the lock > > > > > > But it requires more changes around the error queue handling and > > > the general socket impl, so will be too invasive for net.git but > > > maybe worth a try for net-next ? > > > > Definitely not net material. Adding more complexity here > > would also need some convincing benchmark data probably. > > > > > > > > > Is this because of the "Always allow at least one packet" below, and > > > > due to testing the value of the counter without skb->truesize added? > > > > > > Yes, that's the reason although we don't receive a single >INT_MAX > > > packet. > > > > I was surprised that we don't take the current skb size into > > account when doing this calculation. > > > > Turns out that this code used to do that. > > > > commit 363dc73acacb ("udp: be less conservative with sock rmem > > accounting") made this change: > > > > - if (rmem && (rmem + size > sk->sk_rcvbuf)) > > + if (rmem > sk->sk_rcvbuf) > > goto drop; > > > > The special consideration to allow one packet is to avoid starvation > > with small rcvbuf, judging also from this review comment: > > > > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/ > > Interesting, thanks for the info ! > > Now it's allowed to exceed by the total size of the incoming skb > on every CPUs, and a user may notice that rmem > rcvbuf via ss, > but I guess it's allowed because the fast recovery is expected. > > > > > > That clearly doesn't apply when rcvbuf is near INT_MAX. > > Can we separate the tiny budget case and hard drop including the > > skb->truesize for normal buffer sizes? > > Maybe like this ? > > if (rcvbuf < UDP_MIN_RCVBUF) { > if (rmem > rcvbuf) > goto drop; > } else { > if (rmem + size > rcvbuf) > goto drop; > } > > SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was > reported after that in 2016, so UDP_MIN_RCVBUF would be more ? Since the only issue is the overflow, could use a higher bound like INT_MAX >> 1. > But I wonder if adding new branches in the fast path is worth for > the corner case, and that's why I chose integrating the cast into > the exisintg branch, allowing a small overflow, which is observable > only when no thread calls recv() and skbs are queued more than INT_MAX. Okay. Though it can probably be structured that the likely path does not even see this? if (rmem + size > rcvbuf) { if (rcvbuf > INT_MAX << 1) goto drop; if (rmem > rcvbuf) goto drop; }
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Date: Tue, 25 Mar 2025 10:18:30 -0400 > Kuniyuki Iwashima wrote: > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > Date: Mon, 24 Mar 2025 15:44:40 -0400 > > > Kuniyuki Iwashima wrote: > > > > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > > > Date: Mon, 24 Mar 2025 10:59:49 -0400 > > > > > Kuniyuki Iwashima wrote: > > > > > > __udp_enqueue_schedule_skb() has the following condition: > > > > > > > > > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > > > > > goto drop; > > > > > > > > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can > > > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max, > > > > > > or SO_RCVBUFFORCE. > > > > > > > > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false > > > > > > as sk->sk_rmem_alloc is also signed int. > > > > > > > > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc > > > > > > unconditionally. > > > > > > > > > > > > This results in integer overflow (possibly multiple times) on > > > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to > > > > > > net.core.udp_mem[1]. > > > > > > > > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to > > > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple > > > > > > overflows: > > > > > > > > > > > > # cat /proc/net/sockstat | grep UDP: > > > > > > UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 > > > > > > ^- PAGE_SHIFT > > > > > > # ss -uam > > > > > > State Recv-Q ... > > > > > > UNCONN -1757018048 ... <-- flipping the sign repeatedly > > > > > > skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) > > > > > > > > > > > > Previously, we had a boundary check for INT_MAX, which was removed by > > > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). > > > > > > > > > > > > A complete fix would be to revert it and cap the right operand by > > > > > > INT_MAX: > > > > > > > > > > > > rmem = atomic_add_return(size, &sk->sk_rmem_alloc); > > > > > > if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) > > > > > > goto uncharge_drop; > > > > > > > > > > > > but we do not want to add the expensive atomic_add_return() back just > > > > > > for the corner case. > > > > > > > > > > > > So, let's perform the first check as unsigned int to detect the > > > > > > integer overflow. > > > > > > > > > > > > Note that we still allow a single wraparound, which can be observed > > > > > > from userspace, but it's acceptable considering it's unlikely that > > > > > > no recv() is called for a long period, and the negative value will > > > > > > soon flip back to positive after a few recv() calls. > > > > > > > > > > Can we do better than this? > > > > > > > > Another approach I had in mind was to restore the original validation > > > > under the recvq lock but without atomic ops like > > > > > > > > 1. add another u32 as union of sk_rmem_alloc (only for UDP) > > > > 2. access it with READ_ONCE() or under the recvq lock > > > > 3. perform the validation under the lock > > > > > > > > But it requires more changes around the error queue handling and > > > > the general socket impl, so will be too invasive for net.git but > > > > maybe worth a try for net-next ? > > > > > > Definitely not net material. Adding more complexity here > > > would also need some convincing benchmark data probably. > > > > > > > > > > > > Is this because of the "Always allow at least one packet" below, and > > > > > due to testing the value of the counter without skb->truesize added? > > > > > > > > Yes, that's the reason although we don't receive a single >INT_MAX > > > > packet. > > > > > > I was surprised that we don't take the current skb size into > > > account when doing this calculation. > > > > > > Turns out that this code used to do that. > > > > > > commit 363dc73acacb ("udp: be less conservative with sock rmem > > > accounting") made this change: > > > > > > - if (rmem && (rmem + size > sk->sk_rcvbuf)) > > > + if (rmem > sk->sk_rcvbuf) > > > goto drop; > > > > > > The special consideration to allow one packet is to avoid starvation > > > with small rcvbuf, judging also from this review comment: > > > > > > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > Interesting, thanks for the info ! > > > > Now it's allowed to exceed by the total size of the incoming skb > > on every CPUs, and a user may notice that rmem > rcvbuf via ss, > > but I guess it's allowed because the fast recovery is expected. > > > > > > > > > > That clearly doesn't apply when rcvbuf is near INT_MAX. > > > Can we separate the tiny budget case and hard drop including the > > > skb->truesize for normal buffer sizes? > > > > Maybe like this ? > > > > if (rcvbuf < UDP_MIN_RCVBUF) { > > if (rmem > rcvbuf) > > goto drop; > > } else { > > if (rmem + size > rcvbuf) > > goto drop; > > } > > > > SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was > > reported after that in 2016, so UDP_MIN_RCVBUF would be more ? > > Since the only issue is the overflow, could use a higher bound like > INT_MAX >> 1. > > > But I wonder if adding new branches in the fast path is worth for > > the corner case, and that's why I chose integrating the cast into > > the exisintg branch, allowing a small overflow, which is observable > > only when no thread calls recv() and skbs are queued more than INT_MAX. > > Okay. Though it can probably be structured that the likely path does > not even see this? Ah exactly, will use this and update commit message in v2. Thanks! > > if (rmem + size > rcvbuf) { > if (rcvbuf > INT_MAX << 1) > goto drop; > if (rmem > rcvbuf) > goto drop; > } >
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index a9bb9ce5438e..a1e60aab29b5 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1735,7 +1735,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) */ rmem = atomic_read(&sk->sk_rmem_alloc); rcvbuf = READ_ONCE(sk->sk_rcvbuf); - if (rmem > rcvbuf) + if ((unsigned int)rmem > rcvbuf) goto drop; /* Under mem pressure, it might be helpful to help udp_recvmsg()
__udp_enqueue_schedule_skb() has the following condition: if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) goto drop; sk->sk_rcvbuf is initialised by net.core.rmem_default and later can be configured by SO_RCVBUF, which is limited by net.core.rmem_max, or SO_RCVBUFFORCE. If we set INT_MAX to sk->sk_rcvbuf, the condition is always false as sk->sk_rmem_alloc is also signed int. Then, the size of the incoming skb is added to sk->sk_rmem_alloc unconditionally. This results in integer overflow (possibly multiple times) on sk->sk_rmem_alloc and allows a single socket to have skb up to net.core.udp_mem[1]. For example, if we set a large value to udp_mem[1] and INT_MAX to sk->sk_rcvbuf and flood packets to the socket, we can see multiple overflows: # cat /proc/net/sockstat | grep UDP: UDP: inuse 3 mem 7956736 <-- (7956736 << 12) bytes > INT_MAX * 15 ^- PAGE_SHIFT # ss -uam State Recv-Q ... UNCONN -1757018048 ... <-- flipping the sign repeatedly skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0) Previously, we had a boundary check for INT_MAX, which was removed by commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc"). A complete fix would be to revert it and cap the right operand by INT_MAX: rmem = atomic_add_return(size, &sk->sk_rmem_alloc); if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX)) goto uncharge_drop; but we do not want to add the expensive atomic_add_return() back just for the corner case. So, let's perform the first check as unsigned int to detect the integer overflow. Note that we still allow a single wraparound, which can be observed from userspace, but it's acceptable considering it's unlikely that no recv() is called for a long period, and the negative value will soon flip back to positive after a few recv() calls. # cat /proc/net/sockstat | grep UDP: UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> 12 # ss -uam State Recv-Q ... UNCONN -2147482816 ... <-- INT_MAX + 831 bytes skmem:(r2147484480,rb2147483646,t0,tb212992,f3264,w0,o0,bl0,d14468947) Fixes: 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- net/ipv4/udp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)