Message ID | 20230221110344.82818-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] udp: fix memory schedule error | expand |
On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > From: Jason Xing <kernelxing@tencent.com> > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > and sk_rmem_schedule() errors"): > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > we want to allocate 1 byte more (rounded up to one page), > instead of 150001" I'm wondering if this would cause measurable (even small) performance regression? Specifically under high packet rate, with BH and user-space processing happening on different CPUs. Could you please provide the relevant performance figures? Thanks! Paolo
On Tue, Feb 21, 2023 at 1:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > From: Jason Xing <kernelxing@tencent.com> > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > and sk_rmem_schedule() errors"): > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > we want to allocate 1 byte more (rounded up to one page), > > instead of 150001" > > I'm wondering if this would cause measurable (even small) performance > regression? Specifically under high packet rate, with BH and user-space > processing happening on different CPUs. > > Could you please provide the relevant performance figures? > > Thanks! > > Paolo > Probably not a big deal. TCP skb truesize can easily reach 180 KB, but for UDP it's 99% below or close to a 4K page. I doubt this change makes any difference for UDP.
On Tue, Feb 21, 2023 at 8:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > From: Jason Xing <kernelxing@tencent.com> > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > and sk_rmem_schedule() errors"): > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > we want to allocate 1 byte more (rounded up to one page), > > instead of 150001" > > I'm wondering if this would cause measurable (even small) performance > regression? Specifically under high packet rate, with BH and user-space > processing happening on different CPUs. > > Could you please provide the relevant performance figures? Sure, I've done some basic tests on my machine as below. Environment: 16 cpus, 60G memory Server: run "iperf3 -s -p [port]" command and start 500 processes. Client: run "iperf3 -u -c 127.0.0.1 -p [port]" command and start 500 processes. Running such tests makes sure that the util output of every cpu is higher than 15% which is observed through top command. Here're some before/after numbers by using the "sar -n DEV 10 2" command. Before: rxpck/s 2000, txpck/s 2000, rxkB/s 64054.69, txkB/s 64054.69 After: rxpck/s 2000, txpck/s 2000, rxkB/s 64054.58, txkB/s 64054.58 So I don't see much impact on the results. In theory, I have no clue about why it could cause some regression? Maybe the memory allocation is not that enough compared to the original code? Thanks, Jason > > Thanks! > > Paolo >
On Tue, Feb 21, 2023 at 8:35 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Feb 21, 2023 at 1:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > > and sk_rmem_schedule() errors"): > > > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > > we want to allocate 1 byte more (rounded up to one page), > > > instead of 150001" > > > > I'm wondering if this would cause measurable (even small) performance > > regression? Specifically under high packet rate, with BH and user-space > > processing happening on different CPUs. > > > > Could you please provide the relevant performance figures? > > > > Thanks! > > > > Paolo > > > > Probably not a big deal. > > TCP skb truesize can easily reach 180 KB, but for UDP it's 99% below > or close to a 4K page. Yes. > > I doubt this change makes any difference for UDP. Based on my understanding of this part, it could not neither cause much regression nor improve much performance. I think what you've done to the TCP stack is the right way to go so the UDP can probably follow this. Calculating extra memory is a little bit odd because we actually don't need that much when receiving skb everytime. Thanks, Jason
On Tue, 2023-02-21 at 21:39 +0800, Jason Xing wrote: > On Tue, Feb 21, 2023 at 8:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > > and sk_rmem_schedule() errors"): > > > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > > we want to allocate 1 byte more (rounded up to one page), > > > instead of 150001" > > > > I'm wondering if this would cause measurable (even small) performance > > regression? Specifically under high packet rate, with BH and user-space > > processing happening on different CPUs. > > > > Could you please provide the relevant performance figures? > > Sure, I've done some basic tests on my machine as below. > > Environment: 16 cpus, 60G memory > Server: run "iperf3 -s -p [port]" command and start 500 processes. > Client: run "iperf3 -u -c 127.0.0.1 -p [port]" command and start 500 processes. Just for the records, with the above command each process will send pkts at 1mbs - not very relevant performance wise. Instead you could do: taskset 0x2 iperf -s & iperf -u -c 127.0.0.1 -b 0 -l 64 > In theory, I have no clue about why it could cause some regression? > Maybe the memory allocation is not that enough compared to the > original code? As Eric noted, for UDP traffic, due to the expected average packet size, sk_forward_alloc is touched quite frequently, both with and without this patch, so there is little chance it will have any performance impact. Cheers, Paolo
On Tue, Feb 21, 2023 at 10:46 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2023-02-21 at 21:39 +0800, Jason Xing wrote: > > On Tue, Feb 21, 2023 at 8:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > > > and sk_rmem_schedule() errors"): > > > > > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > > > we want to allocate 1 byte more (rounded up to one page), > > > > instead of 150001" > > > > > > I'm wondering if this would cause measurable (even small) performance > > > regression? Specifically under high packet rate, with BH and user-space > > > processing happening on different CPUs. > > > > > > Could you please provide the relevant performance figures? > > > > Sure, I've done some basic tests on my machine as below. > > > > Environment: 16 cpus, 60G memory > > Server: run "iperf3 -s -p [port]" command and start 500 processes. > > Client: run "iperf3 -u -c 127.0.0.1 -p [port]" command and start 500 processes. > > Just for the records, with the above command each process will send > pkts at 1mbs - not very relevant performance wise. > > Instead you could do: > > taskset 0x2 iperf -s & > iperf -u -c 127.0.0.1 -b 0 -l 64 > Thanks for your guidance. Here're some numbers according to what you suggested, which I tested several times. ----------|IFACE rxpck/s txpck/s rxkB/s txkB/s Before: lo 411073.41 411073.41 36932.38 36932.38 After: lo 410308.73 410308.73 36863.81 36863.81 Above is one of many results which does not mean that the original code absolutely outperforms. The output is not that constant and stable, I think. Please help me review those numbers. > > > In theory, I have no clue about why it could cause some regression? > > Maybe the memory allocation is not that enough compared to the > > original code? > > As Eric noted, for UDP traffic, due to the expected average packet > size, sk_forward_alloc is touched quite frequently, both with and > without this patch, so there is little chance it will have any > performance impact. Well, I see. Thanks, Jason > > Cheers, > > Paolo >
On Tue, Feb 21, 2023 at 11:46 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Tue, Feb 21, 2023 at 10:46 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Tue, 2023-02-21 at 21:39 +0800, Jason Xing wrote: > > > On Tue, Feb 21, 2023 at 8:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > > > > and sk_rmem_schedule() errors"): > > > > > > > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > > > > we want to allocate 1 byte more (rounded up to one page), > > > > > instead of 150001" > > > > > > > > I'm wondering if this would cause measurable (even small) performance > > > > regression? Specifically under high packet rate, with BH and user-space > > > > processing happening on different CPUs. > > > > > > > > Could you please provide the relevant performance figures? > > > > > > Sure, I've done some basic tests on my machine as below. > > > > > > Environment: 16 cpus, 60G memory > > > Server: run "iperf3 -s -p [port]" command and start 500 processes. > > > Client: run "iperf3 -u -c 127.0.0.1 -p [port]" command and start 500 processes. > > > > Just for the records, with the above command each process will send > > pkts at 1mbs - not very relevant performance wise. > > > > Instead you could do: > > > > > taskset 0x2 iperf -s & > > iperf -u -c 127.0.0.1 -b 0 -l 64 > > > > Thanks for your guidance. > > Here're some numbers according to what you suggested, which I tested > several times. > ----------|IFACE rxpck/s txpck/s rxkB/s txkB/s > Before: lo 411073.41 411073.41 36932.38 36932.38 > After: lo 410308.73 410308.73 36863.81 36863.81 > > Above is one of many results which does not mean that the original > code absolutely outperforms. > The output is not that constant and stable, I think. Today, I ran the same test on other servers, it looks the same as above. Those results fluctuate within ~2%. Oh, one more thing I forgot to say is the output of iperf itself which doesn't show any difference. Before: Bitrate is 211 - 212 Mbits/sec After: Bitrate is 211 - 212 Mbits/sec So this result is relatively constant especially if we keep running the test over 2 minutes. Jason > > Please help me review those numbers. > > > > > > In theory, I have no clue about why it could cause some regression? > > > Maybe the memory allocation is not that enough compared to the > > > original code? > > > > As Eric noted, for UDP traffic, due to the expected average packet > > size, sk_forward_alloc is touched quite frequently, both with and > > without this patch, so there is little chance it will have any > > performance impact. > > Well, I see. > > Thanks, > Jason > > > > > Cheers, > > > > Paolo > >
On Wed, 2023-02-22 at 11:47 +0800, Jason Xing wrote: > On Tue, Feb 21, 2023 at 11:46 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Tue, Feb 21, 2023 at 10:46 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > On Tue, 2023-02-21 at 21:39 +0800, Jason Xing wrote: > > > > On Tue, Feb 21, 2023 at 8:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > > > > > and sk_rmem_schedule() errors"): > > > > > > > > > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > > > > > we want to allocate 1 byte more (rounded up to one page), > > > > > > instead of 150001" > > > > > > > > > > I'm wondering if this would cause measurable (even small) performance > > > > > regression? Specifically under high packet rate, with BH and user-space > > > > > processing happening on different CPUs. > > > > > > > > > > Could you please provide the relevant performance figures? > > > > > > > > Sure, I've done some basic tests on my machine as below. > > > > > > > > Environment: 16 cpus, 60G memory > > > > Server: run "iperf3 -s -p [port]" command and start 500 processes. > > > > Client: run "iperf3 -u -c 127.0.0.1 -p [port]" command and start 500 processes. > > > > > > Just for the records, with the above command each process will send > > > pkts at 1mbs - not very relevant performance wise. > > > > > > Instead you could do: > > > > > > > > taskset 0x2 iperf -s & > > > iperf -u -c 127.0.0.1 -b 0 -l 64 > > > > > > > Thanks for your guidance. > > > > Here're some numbers according to what you suggested, which I tested > > several times. > > ----------|IFACE rxpck/s txpck/s rxkB/s txkB/s > > Before: lo 411073.41 411073.41 36932.38 36932.38 > > After: lo 410308.73 410308.73 36863.81 36863.81 > > > > Above is one of many results which does not mean that the original > > code absolutely outperforms. > > The output is not that constant and stable, I think. > > Today, I ran the same test on other servers, it looks the same as > above. Those results fluctuate within ~2%. > > Oh, one more thing I forgot to say is the output of iperf itself which > doesn't show any difference. > Before: Bitrate is 211 - 212 Mbits/sec > After: Bitrate is 211 - 212 Mbits/sec > So this result is relatively constant especially if we keep running > the test over 2 minutes. Thanks for the testing. My personal take on this one is that is more a refactor than a bug fix - as the amount forward allocated memory should always be negligible for UDP. Still it could make sense keep the accounting schema consistent across different protocols. I suggest to repost for net-next, when it will re- open, additionally introducing __sk_mem_schedule() usage to avoid code duplication. Thanks, Paolo
On Thu, Feb 23, 2023 at 4:39 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Wed, 2023-02-22 at 11:47 +0800, Jason Xing wrote: > > On Tue, Feb 21, 2023 at 11:46 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Tue, Feb 21, 2023 at 10:46 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > On Tue, 2023-02-21 at 21:39 +0800, Jason Xing wrote: > > > > > On Tue, Feb 21, 2023 at 8:27 PM Paolo Abeni <pabeni@redhat.com> wrote: > > > > > > > > > > > > On Tue, 2023-02-21 at 19:03 +0800, Jason Xing wrote: > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > > > > > Quoting from the commit 7c80b038d23e ("net: fix sk_wmem_schedule() > > > > > > > and sk_rmem_schedule() errors"): > > > > > > > > > > > > > > "If sk->sk_forward_alloc is 150000, and we need to schedule 150001 bytes, > > > > > > > we want to allocate 1 byte more (rounded up to one page), > > > > > > > instead of 150001" > > > > > > > > > > > > I'm wondering if this would cause measurable (even small) performance > > > > > > regression? Specifically under high packet rate, with BH and user-space > > > > > > processing happening on different CPUs. > > > > > > > > > > > > Could you please provide the relevant performance figures? > > > > > > > > > > Sure, I've done some basic tests on my machine as below. > > > > > > > > > > Environment: 16 cpus, 60G memory > > > > > Server: run "iperf3 -s -p [port]" command and start 500 processes. > > > > > Client: run "iperf3 -u -c 127.0.0.1 -p [port]" command and start 500 processes. > > > > > > > > Just for the records, with the above command each process will send > > > > pkts at 1mbs - not very relevant performance wise. > > > > > > > > Instead you could do: > > > > > > > > > > > taskset 0x2 iperf -s & > > > > iperf -u -c 127.0.0.1 -b 0 -l 64 > > > > > > > > > > Thanks for your guidance. > > > > > > Here're some numbers according to what you suggested, which I tested > > > several times. > > > ----------|IFACE rxpck/s txpck/s rxkB/s txkB/s > > > Before: lo 411073.41 411073.41 36932.38 36932.38 > > > After: lo 410308.73 410308.73 36863.81 36863.81 > > > > > > Above is one of many results which does not mean that the original > > > code absolutely outperforms. > > > The output is not that constant and stable, I think. > > > > Today, I ran the same test on other servers, it looks the same as > > above. Those results fluctuate within ~2%. > > > > Oh, one more thing I forgot to say is the output of iperf itself which > > doesn't show any difference. > > Before: Bitrate is 211 - 212 Mbits/sec > > After: Bitrate is 211 - 212 Mbits/sec > > So this result is relatively constant especially if we keep running > > the test over 2 minutes. > > Thanks for the testing. My personal take on this one is that is more a > refactor than a bug fix - as the amount forward allocated memory should > always be negligible for UDP. > > Still it could make sense keep the accounting schema consistent across > different protocols. I suggest to repost for net-next, when it will re- > open, additionally introducing __sk_mem_schedule() usage to avoid code > duplication. > Thanks for the review. I will replace this part with __sk_mem_schedule() and then repost it after Mar 6th. Thanks, Jason > Thanks, > > Paolo >
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 9592fe3e444a..a13f622cfa36 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1567,16 +1567,16 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) goto uncharge_drop; spin_lock(&list->lock); - if (size >= sk->sk_forward_alloc) { - amt = sk_mem_pages(size); - delta = amt << PAGE_SHIFT; + if (size > sk->sk_forward_alloc) { + delta = size - sk->sk_forward_alloc; + amt = sk_mem_pages(delta); if (!__sk_mem_raise_allocated(sk, delta, amt, SK_MEM_RECV)) { err = -ENOBUFS; spin_unlock(&list->lock); goto uncharge_drop; } - sk->sk_forward_alloc += delta; + sk->sk_forward_alloc += amt << PAGE_SHIFT; } sk->sk_forward_alloc -= size;