diff mbox series

[v4,net-next,2/2] net: introduce budget_squeeze to help us tune rx behavior

Message ID 20230315092041.35482-3-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series add some detailed data when reading softnet_stat | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4349 this patch: 4349
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 1003 this patch: 1003
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4558 this patch: 4558
netdev/checkpatch warning WARNING: quoted string split across lines
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jason Xing March 15, 2023, 9:20 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

In our production environment, there're hundreds of machines hitting the
old time_squeeze limit often from which we cannot tell what exactly causes
such issues. Hitting limits aranged from 400 to 2000 times per second,
Especially, when users are running on the guest OS with veth policy
configured, it is relatively easier to hit the limit. After several tries
without this patch, I found it is only real time_squeeze not including
budget_squeeze that hinders the receive process.

So when we encounter some related performance issue and then get lost on
how to tune the budget limit and time limit in net_rx_action() function,
we can separately counting both of them to avoid the confusion.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
---
v4:
1) also avoid the inconsistency by caching variables suggested by Eric.
2) add more details about the real issue happened on our servers
suggested by Jakub.

v3:
1) drop the comment suggested by Simon
Link: https://lore.kernel.org/lkml/20230314030532.9238-3-kerneljasonxing@gmail.com/

v2:
1) change the coding style suggested by Stephen and Simon
2) Keep the display of the old data (time_squeeze) untouched suggested
by Kui-Feng
Link: https://lore.kernel.org/lkml/20230311163614.92296-1-kerneljasonxing@gmail.com/
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 12 ++++++++----
 net/core/net-procfs.c     |  9 ++++++---
 3 files changed, 15 insertions(+), 7 deletions(-)

Comments

Jakub Kicinski March 17, 2023, 12:20 a.m. UTC | #1
On Wed, 15 Mar 2023 17:20:41 +0800 Jason Xing wrote:
> In our production environment, there're hundreds of machines hitting the
> old time_squeeze limit often from which we cannot tell what exactly causes
> such issues. Hitting limits aranged from 400 to 2000 times per second,
> Especially, when users are running on the guest OS with veth policy
> configured, it is relatively easier to hit the limit. After several tries
> without this patch, I found it is only real time_squeeze not including
> budget_squeeze that hinders the receive process.

That is the common case, and can be understood from the napi trace
point and probing the kernel with bpftrace. We should only add
uAPI for statistics which must be maintained contiguously. For
investigations tracing will always be orders of magnitude more
powerful :(

On the time squeeze BTW, have you found out what the problem was?
In workloads I've seen the time problems are often because of noise 
in how jiffies are accounted (cgroup code disables interrupts
for long periods of time, for example, making jiffies increment 
by 2, 3 or 4 rather than by 1).

> So when we encounter some related performance issue and then get lost on
> how to tune the budget limit and time limit in net_rx_action() function,
> we can separately counting both of them to avoid the confusion.
Jason Xing March 17, 2023, 2:27 a.m. UTC | #2
On Fri, Mar 17, 2023 at 8:20 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 15 Mar 2023 17:20:41 +0800 Jason Xing wrote:
> > In our production environment, there're hundreds of machines hitting the
> > old time_squeeze limit often from which we cannot tell what exactly causes
> > such issues. Hitting limits aranged from 400 to 2000 times per second,
> > Especially, when users are running on the guest OS with veth policy
> > configured, it is relatively easier to hit the limit. After several tries
> > without this patch, I found it is only real time_squeeze not including
> > budget_squeeze that hinders the receive process.
>
[...]
> That is the common case, and can be understood from the napi trace

Thanks for your reply. It is commonly happening every day on many servers.

> point and probing the kernel with bpftrace. We should only add

We probably can deduce (or guess) which one causes the latency because
trace_napi_poll() only counts the budget consumed per poll.

Besides, tracing napi poll is totally ok with the testbed but not ok
with those servers with heavy load which bpftrace related tools
capturing the data from the hot path may cause some bad impact,
especially with special cards equipped, say, 100G nic card. Resorting
to legacy file softnet_stat is relatively feasible based on my limited
knowledge.

Paolo also added backlog queues into this file in 2020 (see commit:
7d58e6555870d). I believe that after this patch, there are few or no
more new data that is needed to print for the next few years.

> uAPI for statistics which must be maintained contiguously. For

In this patch, I didn't touch the old data as suggested in the
previous emails and only separated the old way of counting
@time_squeeze into two parts (time_squeeze and budget_squeeze). Using
budget_squeeze can help us profile the server and tune it more
usefully.

> investigations tracing will always be orders of magnitude more
> powerful :(


>
> On the time squeeze BTW, have you found out what the problem was?
> In workloads I've seen the time problems are often because of noise
> in how jiffies are accounted (cgroup code disables interrupts
> for long periods of time, for example, making jiffies increment
> by 2, 3 or 4 rather than by 1).

Yes ! The issue of jiffies increment troubles those servers more often
than not. For a small group of servers, budget limit is also a
problem. Sometimes we might treat guest OS differently.

Thanks,
Jason

>
> > So when we encounter some related performance issue and then get lost on
> > how to tune the budget limit and time limit in net_rx_action() function,
> > we can separately counting both of them to avoid the confusion.
Jakub Kicinski March 17, 2023, 3:26 a.m. UTC | #3
On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
> > That is the common case, and can be understood from the napi trace  
> 
> Thanks for your reply. It is commonly happening every day on many servers.

Right but the common issue is the time squeeze, not budget squeeze,
and either way the budget squeeze doesn't really matter because 
the softirq loop will call us again soon, if softirq itself is 
not scheduled out.

So if you want to monitor a meaningful event in your fleet, I think 
a better event to monitor is the number of times ksoftirqd was woken 
up and latency of it getting onto the CPU.

Did you try to measure that?

(Please do *not* send patches to touch softirq code right now, just
measure first. We are trying to improve the situation but the core
kernel maintainers are weary of changes:
https://lwn.net/Articles/925540/
so if both of us start sending code they will probably take neither
patches :()

> > point and probing the kernel with bpftrace. We should only add  
> 
> We probably can deduce (or guess) which one causes the latency because
> trace_napi_poll() only counts the budget consumed per poll.
> 
> Besides, tracing napi poll is totally ok with the testbed but not ok
> with those servers with heavy load which bpftrace related tools
> capturing the data from the hot path may cause some bad impact,
> especially with special cards equipped, say, 100G nic card. Resorting
> to legacy file softnet_stat is relatively feasible based on my limited
> knowledge.

Right, but we're still measuring something relatively irrelevant.
As I said the softirq loop will call us again. In my experience
network queues get long when ksoftirqd is woken up but not scheduled
for a long time. That is the source of latency. You may have the same
problem (high latency) without consuming the entire budget.

I think if we wanna make new stats we should try to come up with a way
of capturing the problem rather than one of the symptoms.

> Paolo also added backlog queues into this file in 2020 (see commit:
> 7d58e6555870d). I believe that after this patch, there are few or no
> more new data that is needed to print for the next few years.
> 
> > uAPI for statistics which must be maintained contiguously. For  
> 
> In this patch, I didn't touch the old data as suggested in the
> previous emails and only separated the old way of counting
> @time_squeeze into two parts (time_squeeze and budget_squeeze). Using
> budget_squeeze can help us profile the server and tune it more
> usefully.
> 
> > investigations tracing will always be orders of magnitude more
> > powerful :(  
> 
> > On the time squeeze BTW, have you found out what the problem was?
> > In workloads I've seen the time problems are often because of noise
> > in how jiffies are accounted (cgroup code disables interrupts
> > for long periods of time, for example, making jiffies increment
> > by 2, 3 or 4 rather than by 1).  
> 
> Yes ! The issue of jiffies increment troubles those servers more often
> than not. For a small group of servers, budget limit is also a
> problem. Sometimes we might treat guest OS differently.
Jason Xing March 17, 2023, 4:11 a.m. UTC | #4
On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
> > > That is the common case, and can be understood from the napi trace
> >
> > Thanks for your reply. It is commonly happening every day on many servers.
>
> Right but the common issue is the time squeeze, not budget squeeze,

Most of them are about time, so yes.

> and either way the budget squeeze doesn't really matter because
> the softirq loop will call us again soon, if softirq itself is
> not scheduled out.
>
> So if you want to monitor a meaningful event in your fleet, I think
> a better event to monitor is the number of times ksoftirqd was woken
> up and latency of it getting onto the CPU.

It's a good point. Thanks for your advice.

>
> Did you try to measure that?
>
> (Please do *not* send patches to touch softirq code right now, just
> measure first. We are trying to improve the situation but the core
> kernel maintainers are weary of changes:
> https://lwn.net/Articles/925540/
> so if both of us start sending code they will probably take neither
> patches :()

I understand. One more thing I would like to know is about the state
of 1/2 patch.

Thanks,
Jason

>
> > > point and probing the kernel with bpftrace. We should only add
> >
> > We probably can deduce (or guess) which one causes the latency because
> > trace_napi_poll() only counts the budget consumed per poll.
> >
> > Besides, tracing napi poll is totally ok with the testbed but not ok
> > with those servers with heavy load which bpftrace related tools
> > capturing the data from the hot path may cause some bad impact,
> > especially with special cards equipped, say, 100G nic card. Resorting
> > to legacy file softnet_stat is relatively feasible based on my limited
> > knowledge.
>
> Right, but we're still measuring something relatively irrelevant.
> As I said the softirq loop will call us again. In my experience
> network queues get long when ksoftirqd is woken up but not scheduled
> for a long time. That is the source of latency. You may have the same
> problem (high latency) without consuming the entire budget.
>
> I think if we wanna make new stats we should try to come up with a way
> of capturing the problem rather than one of the symptoms.
>
> > Paolo also added backlog queues into this file in 2020 (see commit:
> > 7d58e6555870d). I believe that after this patch, there are few or no
> > more new data that is needed to print for the next few years.
> >
> > > uAPI for statistics which must be maintained contiguously. For
> >
> > In this patch, I didn't touch the old data as suggested in the
> > previous emails and only separated the old way of counting
> > @time_squeeze into two parts (time_squeeze and budget_squeeze). Using
> > budget_squeeze can help us profile the server and tune it more
> > usefully.
> >
> > > investigations tracing will always be orders of magnitude more
> > > powerful :(
> >
> > > On the time squeeze BTW, have you found out what the problem was?
> > > In workloads I've seen the time problems are often because of noise
> > > in how jiffies are accounted (cgroup code disables interrupts
> > > for long periods of time, for example, making jiffies increment
> > > by 2, 3 or 4 rather than by 1).
> >
> > Yes ! The issue of jiffies increment troubles those servers more often
> > than not. For a small group of servers, budget limit is also a
> > problem. Sometimes we might treat guest OS differently.
Jakub Kicinski March 17, 2023, 4:30 a.m. UTC | #5
On Fri, 17 Mar 2023 12:11:46 +0800 Jason Xing wrote:
> I understand. One more thing I would like to know is about the state
> of 1/2 patch.

That one seems fine, we already collect the information so we can
expose it.
Jason Xing March 18, 2023, 4 a.m. UTC | #6
On Fri, Mar 17, 2023 at 12:30 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 17 Mar 2023 12:11:46 +0800 Jason Xing wrote:
> > I understand. One more thing I would like to know is about the state
> > of 1/2 patch.
>
> That one seems fine, we already collect the information so we can
> expose it.

Thanks, I got it.
Jesper Dangaard Brouer March 20, 2023, 1:30 p.m. UTC | #7
On 17/03/2023 05.11, Jason Xing wrote:
> On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>
>> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
>>>> That is the common case, and can be understood from the napi trace
>>>
>>> Thanks for your reply. It is commonly happening every day on many servers.
>>
>> Right but the common issue is the time squeeze, not budget squeeze,
> 
> Most of them are about time, so yes.
> 
>> and either way the budget squeeze doesn't really matter because
>> the softirq loop will call us again soon, if softirq itself is
>> not scheduled out.
>>

I agree, the budget squeeze count doesn't provide much value as it
doesn't indicate something critical (softirq loop will call us again
soon).  The time squeeze event is more critical and something that is
worth monitoring.

I see value in this patch, because it makes it possible monitor the time
squeeze events.  Currently the counter is "polluted" by the budget
squeeze, making it impossible to get a proper time squeeze signal.
Thus, I see this patch as a fix to a old problem.

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

That said (see below), besides monitoring time squeeze counter, I
recommend adding some BPF monitoring to capture latency issues...

>> So if you want to monitor a meaningful event in your fleet, I think
>> a better event to monitor is the number of times ksoftirqd was woken
>> up and latency of it getting onto the CPU.
> 
> It's a good point. Thanks for your advice.

I'm willing to help you out writing a BPF-based tool that can help you
identify the issue Jakub describe above. Of high latency from when
softIRQ is raised until softIRQ processing runs on the CPU.

I have this bpftrace script[1] available that does just that:

  [1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt

Perhaps you can take the latency historgrams and then plot a heatmap[2]
in your monitoring platform.

  [2] https://www.brendangregg.com/heatmaps.html

--Jesper
Jakub Kicinski March 20, 2023, 6:46 p.m. UTC | #8
On Mon, 20 Mar 2023 14:30:27 +0100 Jesper Dangaard Brouer wrote:
> >> So if you want to monitor a meaningful event in your fleet, I think
> >> a better event to monitor is the number of times ksoftirqd was woken
> >> up and latency of it getting onto the CPU.  
> > 
> > It's a good point. Thanks for your advice.  
> 
> I'm willing to help you out writing a BPF-based tool that can help you
> identify the issue Jakub describe above. Of high latency from when
> softIRQ is raised until softIRQ processing runs on the CPU.
> 
> I have this bpftrace script[1] available that does just that:
> 
>   [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
> 
> Perhaps you can take the latency historgrams and then plot a heatmap[2]
> in your monitoring platform.
> 
>   [2] https://www.brendangregg.com/heatmaps.html

FWIW we have this little kludge of code in prod kernels:

https://github.com/kuba-moo/linux/commit/e09006bc08847a218276486817a84e38e82841a6

it tries to measure the latency from xmit to napi reaping completions.
So it covers both NICs IRQs being busted and the noise introduced by 
the scheduler. Not great, those should really be separate.
Jason Xing March 21, 2023, 2:08 a.m. UTC | #9
On Mon, Mar 20, 2023 at 9:30 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 17/03/2023 05.11, Jason Xing wrote:
> > On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
> >>>> That is the common case, and can be understood from the napi trace
> >>>
> >>> Thanks for your reply. It is commonly happening every day on many servers.
> >>
> >> Right but the common issue is the time squeeze, not budget squeeze,
> >
> > Most of them are about time, so yes.
> >
> >> and either way the budget squeeze doesn't really matter because
> >> the softirq loop will call us again soon, if softirq itself is
> >> not scheduled out.
> >>
>
[...]
> I agree, the budget squeeze count doesn't provide much value as it
> doesn't indicate something critical (softirq loop will call us again
> soon).  The time squeeze event is more critical and something that is
> worth monitoring.
>
> I see value in this patch, because it makes it possible monitor the time
> squeeze events.  Currently the counter is "polluted" by the budget
> squeeze, making it impossible to get a proper time squeeze signal.
> Thus, I see this patch as a fix to a old problem.
>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

Thanks for your acknowledgement. As you said, I didn't add more
functional change or performance related change, but make a little
change to previous output which needs to be more accurate. Even though
I would like to get it merged, I have to drop this patch and resend
another one. If maintainers really think it matters, I hope it will be
picked someday :)

>
> That said (see below), besides monitoring time squeeze counter, I
> recommend adding some BPF monitoring to capture latency issues...
>
> >> So if you want to monitor a meaningful event in your fleet, I think
> >> a better event to monitor is the number of times ksoftirqd was woken
> >> up and latency of it getting onto the CPU.
> >
> > It's a good point. Thanks for your advice.
>
> I'm willing to help you out writing a BPF-based tool that can help you
> identify the issue Jakub describe above. Of high latency from when
> softIRQ is raised until softIRQ processing runs on the CPU.
>
> I have this bpftrace script[1] available that does just that:
>
>   [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>

A few days ago, I did the same thing with bcc tools to handle those
complicated issues. Your bt script looks much more concise. Thanks
anyway.

> Perhaps you can take the latency historgrams and then plot a heatmap[2]
> in your monitoring platform.
>
>   [2] https://www.brendangregg.com/heatmaps.html
>
> --Jesper
>
Jason Xing March 30, 2023, 9:59 a.m. UTC | #10
On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
> > > That is the common case, and can be understood from the napi trace
> >
> > Thanks for your reply. It is commonly happening every day on many servers.
>
> Right but the common issue is the time squeeze, not budget squeeze,
> and either way the budget squeeze doesn't really matter because
> the softirq loop will call us again soon, if softirq itself is
> not scheduled out.
>
> So if you want to monitor a meaningful event in your fleet, I think
> a better event to monitor is the number of times ksoftirqd was woken
> up and latency of it getting onto the CPU.
>
> Did you try to measure that?
>
[...]
> (Please do *not* send patches to touch softirq code right now, just
> measure first. We are trying to improve the situation but the core
> kernel maintainers are weary of changes:
> https://lwn.net/Articles/925540/
> so if both of us start sending code they will probably take neither
> patches :()

Hello Jakub,

I'm wondering for now if I can update and resend this patch to have a
better monitor (actually we do need one) on this part since we have
touched the net_rx_action() in the rps optimization patch series?
Also, just like Jesper mentioned before, it can be considered as one
'fix' to a old problem but targetting to net-next is just fine. What
do you think about it ?

Thanks,
Jason

>
> > > point and probing the kernel with bpftrace. We should only add
> >
> > We probably can deduce (or guess) which one causes the latency because
> > trace_napi_poll() only counts the budget consumed per poll.
> >
> > Besides, tracing napi poll is totally ok with the testbed but not ok
> > with those servers with heavy load which bpftrace related tools
> > capturing the data from the hot path may cause some bad impact,
> > especially with special cards equipped, say, 100G nic card. Resorting
> > to legacy file softnet_stat is relatively feasible based on my limited
> > knowledge.
>
> Right, but we're still measuring something relatively irrelevant.
> As I said the softirq loop will call us again. In my experience
> network queues get long when ksoftirqd is woken up but not scheduled
> for a long time. That is the source of latency. You may have the same
> problem (high latency) without consuming the entire budget.
>
> I think if we wanna make new stats we should try to come up with a way
> of capturing the problem rather than one of the symptoms.
>
> > Paolo also added backlog queues into this file in 2020 (see commit:
> > 7d58e6555870d). I believe that after this patch, there are few or no
> > more new data that is needed to print for the next few years.
> >
> > > uAPI for statistics which must be maintained contiguously. For
> >
> > In this patch, I didn't touch the old data as suggested in the
> > previous emails and only separated the old way of counting
> > @time_squeeze into two parts (time_squeeze and budget_squeeze). Using
> > budget_squeeze can help us profile the server and tune it more
> > usefully.
> >
> > > investigations tracing will always be orders of magnitude more
> > > powerful :(
> >
> > > On the time squeeze BTW, have you found out what the problem was?
> > > In workloads I've seen the time problems are often because of noise
> > > in how jiffies are accounted (cgroup code disables interrupts
> > > for long periods of time, for example, making jiffies increment
> > > by 2, 3 or 4 rather than by 1).
> >
> > Yes ! The issue of jiffies increment troubles those servers more often
> > than not. For a small group of servers, budget limit is also a
> > problem. Sometimes we might treat guest OS differently.
Jakub Kicinski March 30, 2023, 4:23 p.m. UTC | #11
On Thu, 30 Mar 2023 17:59:46 +0800 Jason Xing wrote:
> I'm wondering for now if I can update and resend this patch to have a
> better monitor (actually we do need one) on this part since we have
> touched the net_rx_action() in the rps optimization patch series?
> Also, just like Jesper mentioned before, it can be considered as one
> 'fix' to a old problem but targetting to net-next is just fine. What
> do you think about it ?

Sorry, I don't understand what you're trying to say :(
Jason Xing March 31, 2023, 12:48 a.m. UTC | #12
On Fri, Mar 31, 2023 at 12:23 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 30 Mar 2023 17:59:46 +0800 Jason Xing wrote:
> > I'm wondering for now if I can update and resend this patch to have a
> > better monitor (actually we do need one) on this part since we have
> > touched the net_rx_action() in the rps optimization patch series?
> > Also, just like Jesper mentioned before, it can be considered as one
> > 'fix' to a old problem but targetting to net-next is just fine. What
> > do you think about it ?
>
> Sorry, I don't understand what you're trying to say :(

Previously this patch was not accepted because we do not want to touch
softirqs (actually which is net_rx_action()). Since it is touched in
the commit [1] in recent days, I would like to ask your permission:
could I resend this patch to the mailing list? I hope we can get it
merged.

This patch can be considered as a 'fix' to the old problem. It's
beneficial and harmless, I think :)

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=8b43fd3d1d7d

Thanks,
Jason
Jakub Kicinski March 31, 2023, 2:20 a.m. UTC | #13
On Fri, 31 Mar 2023 08:48:07 +0800 Jason Xing wrote:
> On Fri, Mar 31, 2023 at 12:23 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Thu, 30 Mar 2023 17:59:46 +0800 Jason Xing wrote:  
> > > I'm wondering for now if I can update and resend this patch to have a
> > > better monitor (actually we do need one) on this part since we have
> > > touched the net_rx_action() in the rps optimization patch series?
> > > Also, just like Jesper mentioned before, it can be considered as one
> > > 'fix' to a old problem but targetting to net-next is just fine. What
> > > do you think about it ?  
> >
> > Sorry, I don't understand what you're trying to say :(  
> 
> Previously this patch was not accepted because we do not want to touch
> softirqs (actually which is net_rx_action()). Since it is touched in
> the commit [1] in recent days, I would like to ask your permission:
> could I resend this patch to the mailing list? I hope we can get it
> merged.
> 
> This patch can be considered as a 'fix' to the old problem. It's
> beneficial and harmless, I think :)

The not touching part was about softirqs which is kernel/softirq.c,
this patch was rejected because it's not useful.
Jason Xing March 31, 2023, 2:33 a.m. UTC | #14
On Fri, Mar 31, 2023 at 10:20 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 31 Mar 2023 08:48:07 +0800 Jason Xing wrote:
> > On Fri, Mar 31, 2023 at 12:23 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Thu, 30 Mar 2023 17:59:46 +0800 Jason Xing wrote:
> > > > I'm wondering for now if I can update and resend this patch to have a
> > > > better monitor (actually we do need one) on this part since we have
> > > > touched the net_rx_action() in the rps optimization patch series?
> > > > Also, just like Jesper mentioned before, it can be considered as one
> > > > 'fix' to a old problem but targetting to net-next is just fine. What
> > > > do you think about it ?
> > >
> > > Sorry, I don't understand what you're trying to say :(
> >
> > Previously this patch was not accepted because we do not want to touch
> > softirqs (actually which is net_rx_action()). Since it is touched in
> > the commit [1] in recent days, I would like to ask your permission:
> > could I resend this patch to the mailing list? I hope we can get it
> > merged.
> >
> > This patch can be considered as a 'fix' to the old problem. It's
> > beneficial and harmless, I think :)
>
> The not touching part was about softirqs which is kernel/softirq.c,
> this patch was rejected because it's not useful.

But...but time_squeeze here is really vague and it doesn't provide any
useful information to those user-side tools, which means we cannot
conclude anything from this output if we have to trace back to the
history of servers.

Right now, time_squeeze looks abandoned but many applications still
rely on it. It's not proper :(

Thanks,
Jason
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6a14b7b11766..5736311a2133 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3157,6 +3157,7 @@  struct softnet_data {
 	/* stats */
 	unsigned int		processed;
 	unsigned int		time_squeeze;
+	unsigned int		budget_squeeze;
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 253584777101..1518a366783b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6637,6 +6637,7 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 	unsigned long time_limit = jiffies +
 		usecs_to_jiffies(READ_ONCE(netdev_budget_usecs));
 	int budget = READ_ONCE(netdev_budget);
+	bool done = false;
 	LIST_HEAD(list);
 	LIST_HEAD(repoll);
 
@@ -6644,7 +6645,7 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 	list_splice_init(&sd->poll_list, &list);
 	local_irq_enable();
 
-	for (;;) {
+	while (!done) {
 		struct napi_struct *n;
 
 		skb_defer_free_flush(sd);
@@ -6662,10 +6663,13 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 		 * Allow this to run for 2 jiffies since which will allow
 		 * an average latency of 1.5/HZ.
 		 */
-		if (unlikely(budget <= 0 ||
-			     time_after_eq(jiffies, time_limit))) {
+		if (unlikely(budget <= 0)) {
+			sd->budget_squeeze++;
+			done = true;
+		}
+		if (unlikely(time_after_eq(jiffies, time_limit))) {
 			sd->time_squeeze++;
-			break;
+			done = true;
 		}
 	}
 
diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index 09f7ed1a04e8..b748e85952b0 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -158,6 +158,8 @@  static int softnet_seq_show(struct seq_file *seq, void *v)
 	struct softnet_data *sd = v;
 	u32 input_qlen = softnet_input_pkt_queue_len(sd);
 	u32 process_qlen = softnet_process_queue_len(sd);
+	unsigned int budget_sq = sd->budget_squeeze;
+	unsigned int time_sq = sd->time_squeeze;
 	unsigned int flow_limit_count = 0;
 
 #ifdef CONFIG_NET_FLOW_LIMIT
@@ -176,13 +178,14 @@  static int softnet_seq_show(struct seq_file *seq, void *v)
 	 */
 	seq_printf(seq,
 		   "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x "
-		   "%08x %08x\n",
-		   sd->processed, sd->dropped, sd->time_squeeze, 0,
+		   "%08x %08x %08x %08x\n",
+		   sd->processed, sd->dropped, time_sq + budget_sq, 0,
 		   0, 0, 0, 0, /* was fastroute */
 		   0,	/* was cpu_collision */
 		   sd->received_rps, flow_limit_count,
 		   input_qlen + process_qlen, (int)seq->index,
-		   input_qlen, process_qlen);
+		   input_qlen, process_qlen,
+		   time_sq, budget_sq);
 	return 0;
 }