[next] softirq: enable MAX_SOFTIRQ_TIME tuning with sysctl max_softirq_time_usecs
diff mbox series

Message ID f274f85a-bbb6-3e32-b293-1d5d7f27a98f@huawei.com
State New
Headers show
Series
  • [next] softirq: enable MAX_SOFTIRQ_TIME tuning with sysctl max_softirq_time_usecs
Related show

Commit Message

Zhiqiang Liu June 20, 2019, 3:14 p.m. UTC
From: Zhiqiang liu <liuzhiqiang26@huawei.com>

In __do_softirq func, MAX_SOFTIRQ_TIME was set to 2ms via experimentation by
commit c10d73671 ("softirq: reduce latencies") in 2013, which was designed
to reduce latencies for various network workloads. The key reason is that the
maximum number of microseconds in one NAPI polling cycle in net_rx_action func
was set to 2 jiffies, so different HZ settting will lead to different latencies.

However, commit 7acf8a1e8 ("Replace 2 jiffies with sysctl netdev_budget_usecs
to enable softirq tuning") adopts netdev_budget_usecs to tun maximum number of
microseconds in one NAPI polling cycle. So the latencies of net_rx_action can be
controlled by sysadmins to copy with hardware changes over time.

Correspondingly, the MAX_SOFTIRQ_TIME should be able to be tunned by sysadmins,
who knows best about hardware performance, for excepted tradeoff between latence
and fairness.

Here, we add sysctl variable max_softirq_time_usecs to replace MAX_SOFTIRQ_TIME
with 2ms default value.

Signed-off-by: Zhiqiang liu <liuzhiqiang26@huawei.com>
---
 Documentation/sysctl/kernel.txt |  7 +++++++
 kernel/softirq.c                | 10 ++++++----
 kernel/sysctl.c                 |  9 +++++++++
 3 files changed, 22 insertions(+), 4 deletions(-)

Comments

Thomas Gleixner June 23, 2019, 4:38 p.m. UTC | #1
Zhiqiang,

On Thu, 20 Jun 2019, Zhiqiang Liu wrote:

> From: Zhiqiang liu <liuzhiqiang26@huawei.com>
> 
> In __do_softirq func, MAX_SOFTIRQ_TIME was set to 2ms via experimentation by
> commit c10d73671 ("softirq: reduce latencies") in 2013, which was designed
> to reduce latencies for various network workloads. The key reason is that the
> maximum number of microseconds in one NAPI polling cycle in net_rx_action func
> was set to 2 jiffies, so different HZ settting will lead to different latencies.
> 
> However, commit 7acf8a1e8 ("Replace 2 jiffies with sysctl netdev_budget_usecs
> to enable softirq tuning") adopts netdev_budget_usecs to tun maximum number of
> microseconds in one NAPI polling cycle. So the latencies of net_rx_action can be
> controlled by sysadmins to copy with hardware changes over time.

So much for the theory. See below.

> Correspondingly, the MAX_SOFTIRQ_TIME should be able to be tunned by sysadmins,
> who knows best about hardware performance, for excepted tradeoff between latence
> and fairness.
> 
> Here, we add sysctl variable max_softirq_time_usecs to replace MAX_SOFTIRQ_TIME
> with 2ms default value.

...

>   */
> -#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
> +unsigned int __read_mostly max_softirq_time_usecs = 2000;
>  #define MAX_SOFTIRQ_RESTART 10
> 
>  #ifdef CONFIG_TRACE_IRQFLAGS
> @@ -248,7 +249,8 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }
> 
>  asmlinkage __visible void __softirq_entry __do_softirq(void)
>  {
> -	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> +	unsigned long end = jiffies +
> +		usecs_to_jiffies(max_softirq_time_usecs);

That's still jiffies based and therefore depends on CONFIG_HZ. Any budget
value will be rounded up to the next jiffie. So in case of HZ=100 and
time=1000us this will still result in 10ms of allowed loop time.

I'm not saying that we must use a more fine grained time source, but both
the changelog and the sysctl documentation are misleading.

If we keep it jiffies based, then microseconds do not make any sense. They
just give a false sense of controlability.

Keep also in mind that with jiffies the accuracy depends also on the
distance to the next tick when 'end' is evaluated. The next tick might be
imminent.

That's all information which needs to be in the documentation.

> +	{
> +		.procname	= "max_softirq_time_usecs",
> +		.data		= &max_softirq_time_usecs,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1		= &zero,
> +	},

Zero as the lower limit? That means it allows a single loop. Fine, but
needs to be documented as well.

Thanks,

	tglx
Zhiqiang Liu June 24, 2019, 4:01 a.m. UTC | #2
在 2019/6/24 0:38, Thomas Gleixner 写道:
> Zhiqiang,
>> controlled by sysadmins to copy with hardware changes over time.
> 
> So much for the theory. See below.

Thanks for your reply.
> 
>> Correspondingly, the MAX_SOFTIRQ_TIME should be able to be tunned by sysadmins,
>> who knows best about hardware performance, for excepted tradeoff between latence
>> and fairness.
>>
>> Here, we add sysctl variable max_softirq_time_usecs to replace MAX_SOFTIRQ_TIME
>> with 2ms default value.
> 
> ...
> 
>>   */
>> -#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
>> +unsigned int __read_mostly max_softirq_time_usecs = 2000;
>>  #define MAX_SOFTIRQ_RESTART 10
>>
>>  #ifdef CONFIG_TRACE_IRQFLAGS
>> @@ -248,7 +249,8 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }
>>
>>  asmlinkage __visible void __softirq_entry __do_softirq(void)
>>  {
>> -	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>> +	unsigned long end = jiffies +
>> +		usecs_to_jiffies(max_softirq_time_usecs);
> 
> That's still jiffies based and therefore depends on CONFIG_HZ. Any budget
> value will be rounded up to the next jiffie. So in case of HZ=100 and
> time=1000us this will still result in 10ms of allowed loop time.
> 
> I'm not saying that we must use a more fine grained time source, but both
> the changelog and the sysctl documentation are misleading.
> 
> If we keep it jiffies based, then microseconds do not make any sense. They
> just give a false sense of controlability.
> 
> Keep also in mind that with jiffies the accuracy depends also on the
> distance to the next tick when 'end' is evaluated. The next tick might be
> imminent.
> 
> That's all information which needs to be in the documentation.
> 

Thanks again for your detailed advice.
As your said, the max_softirq_time_usecs setting without explaining the
relationship with CONFIG_HZ will give a false sense of controlability. And
the time accuracy of jiffies will result in a certain difference between the
max_softirq_time_usecs set value and the actual value, which is in one jiffies
range.

I will add these infomation in the sysctl documentation and changelog in v2 patch.

>> +	{
>> +		.procname	= "max_softirq_time_usecs",
>> +		.data		= &max_softirq_time_usecs,
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler   = proc_dointvec_minmax,
>> +		.extra1		= &zero,
>> +	},
> 
> Zero as the lower limit? That means it allows a single loop. Fine, but
> needs to be documented as well.
> 
> Thanks,
> 
> 	tglx
> 
> .
>
Thomas Gleixner June 24, 2019, 9:45 a.m. UTC | #3
Zhiqiang,

On Mon, 24 Jun 2019, Zhiqiang Liu wrote:
> 在 2019/6/24 0:38, Thomas Gleixner 写道:
> > If we keep it jiffies based, then microseconds do not make any sense. They
> > just give a false sense of controlability.
> > 
> > Keep also in mind that with jiffies the accuracy depends also on the
> > distance to the next tick when 'end' is evaluated. The next tick might be
> > imminent.
> > 
> > That's all information which needs to be in the documentation.
> > 
> 
> Thanks again for your detailed advice.
> As your said, the max_softirq_time_usecs setting without explaining the
> relationship with CONFIG_HZ will give a false sense of controlability. And
> the time accuracy of jiffies will result in a certain difference between the
> max_softirq_time_usecs set value and the actual value, which is in one jiffies
> range.
> 
> I will add these infomation in the sysctl documentation and changelog in v2 patch.

Please make the sysctl milliseconds based. That's the closest approximation
of useful units for this. This still has the same issues as explained
before but it's not off by 3 orders of magitude anymore.

Thanks,

	tglx
Zhiqiang Liu June 24, 2019, 1:32 p.m. UTC | #4
On 2019/6/24 17:45, Thomas Gleixner wrote:
> Zhiqiang,
> 
> On Mon, 24 Jun 2019, Zhiqiang Liu wrote:
>> 在 2019/6/24 0:38, Thomas Gleixner 写道:
>>
>> Thanks again for your detailed advice.
>> As your said, the max_softirq_time_usecs setting without explaining the
>> relationship with CONFIG_HZ will give a false sense of controlability. And
>> the time accuracy of jiffies will result in a certain difference between the
>> max_softirq_time_usecs set value and the actual value, which is in one jiffies
>> range.
>>
>> I will add these infomation in the sysctl documentation and changelog in v2 patch.
> 
> Please make the sysctl milliseconds based. That's the closest approximation
> of useful units for this. This still has the same issues as explained
> before but it's not off by 3 orders of magitude anymore.
> 
> Thanks,
> 
> 	tglx
> 
Thanks for your suggestion.
I will adopt max_softirq_time_ms to replace MAX_SOFTIRQ_TIME in v2.
Zhiqiang Liu June 25, 2019, 2:46 p.m. UTC | #5
Dear Thomas,

On 2019/6/24 17:45, Thomas Gleixner wrote:
> Zhiqiang,
> 
> On Mon, 24 Jun 2019, Zhiqiang Liu wrote:
>>
>> Thanks again for your detailed advice.
>> As your said, the max_softirq_time_usecs setting without explaining the
>> relationship with CONFIG_HZ will give a false sense of controlability. And
>> the time accuracy of jiffies will result in a certain difference between the
>> max_softirq_time_usecs set value and the actual value, which is in one jiffies
>> range.
>>
>> I will add these infomation in the sysctl documentation and changelog in v2 patch.
> 
> Please make the sysctl milliseconds based. That's the closest approximation
> of useful units for this. This still has the same issues as explained
> before but it's not off by 3 orders of magitude anymore.
>

I have a doubt about _msecs_to_jiffies funcs, especially when input m is equal to 0.

For different HZ setttings, different _msecs_to_jiffies funcs will be chosen for
msecs_to_jiffies func. However, the performance of different _msecs_to_jiffies is
inconsistent with input m is equal to 0.
If HZ satisfies the condition: HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ), the return
value of _msecs_to_jiffies func with m=0 is different with different HZ setting.
------------------------------------
| HZ |	MSEC_PER_SEC / HZ | return |
------------------------------------
|1000|		1	  |   0	   |
|500 |		2	  |   1	   |
|200 |		5	  |   1	   |
|100 |		10	  |   1	   |
------------------------------------

Why only the return value of HZ=1000 is equal to 0 with m=0 ?

Codes are given as follows,
    #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
    static inline unsigned long _msecs_to_jiffies(const unsigned int m)
    {
            return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
    }
    #elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
    static inline unsigned long _msecs_to_jiffies(const unsigned int m)
    {
            if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
                    return MAX_JIFFY_OFFSET;
            return m * (HZ / MSEC_PER_SEC);
    }
    #else
    static inline unsigned long _msecs_to_jiffies(const unsigned int m)
    {
            if (HZ > MSEC_PER_SEC && m > jiffies_to_msecs(MAX_JIFFY_OFFSET))
                    return MAX_JIFFY_OFFSET;

            return (MSEC_TO_HZ_MUL32 * m + MSEC_TO_HZ_ADJ32) >> MSEC_TO_HZ_SHR32;
    }








> Thanks,
> 
> 	tglx
>
Thomas Gleixner July 8, 2019, 2:14 p.m. UTC | #6
Zhiqiang,

On Tue, 25 Jun 2019, Zhiqiang Liu wrote:

> I have a doubt about _msecs_to_jiffies funcs, especially when input m is
> equal to 0.
>
> For different HZ setttings, different _msecs_to_jiffies funcs will be
> chosen for msecs_to_jiffies func. However, the performance of different
> _msecs_to_jiffies is inconsistent with input m is equal to 0.
>
> If HZ satisfies the condition: HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC %
> HZ), the return value of _msecs_to_jiffies func with m=0 is different
> with different HZ setting.

> ------------------------------------
> | HZ |	MSEC_PER_SEC / HZ | return |
> ------------------------------------
> |1000|		1	  |   0	   |
> |500 |		2	  |   1	   |
> |200 |		5	  |   1	   |
> |100 |		10	  |   1	   |
> ------------------------------------
> 
> Why only the return value of HZ=1000 is equal to 0 with m=0 ?

I don't know how you tested that, but obviously all four HZ values use
this variant:

>     #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
>     static inline unsigned long _msecs_to_jiffies(const unsigned int m)
>     {
>             return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
>     }

and for all four HZ values the result is 0. Why?

For m = 0 the calculation reduces to:

      ((MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)

i.e.

	(x - 1) / x	where x = [1, 2, 5, 10]

which is guaranteed to be 0 for integer math. If not, you have a compiler
problem.

Thanks,

	tglx
Zhiqiang Liu July 9, 2019, 1:25 a.m. UTC | #7
On 2019/7/8 22:14, Thomas Gleixner wrote:
> Zhiqiang,
> 
>> If HZ satisfies the condition: HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC %
>> HZ), the return value of _msecs_to_jiffies func with m=0 is different
>> with different HZ setting.
> 
>> ------------------------------------
>> | HZ |	MSEC_PER_SEC / HZ | return |
>> ------------------------------------
>> |1000|		1	  |   0	   |
>> |500 |		2	  |   1	   |
>> |200 |		5	  |   1	   |
>> |100 |		10	  |   1	   |
>> ------------------------------------
>>
>> Why only the return value of HZ=1000 is equal to 0 with m=0 ?
> 
> I don't know how you tested that, but obviously all four HZ values use
> this variant:
> 
>>     #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
>>     static inline unsigned long _msecs_to_jiffies(const unsigned int m)
>>     {
>>             return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ);
>>     }
> 
> and for all four HZ values the result is 0. Why?
> 
> For m = 0 the calculation reduces to:
> 
>       ((MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ)
> 
> i.e.
> 
> 	(x - 1) / x	where x = [1, 2, 5, 10]
> 
> which is guaranteed to be 0 for integer math. If not, you have a compiler
> problem.
> 
> Thanks,
> 
> 	tglx
Thanks for your reply. Actually, I have made a low-level mistake.
I am really sorry for that.
Thanks again.

Patch
diff mbox series

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index f0c86fbb3b48..647233faf896 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -44,6 +44,7 @@  show up in /proc/sys/kernel:
 - kexec_load_disabled
 - kptr_restrict
 - l2cr                        [ PPC only ]
+- max_softirq_time_usecs
 - modprobe                    ==> Documentation/debugging-modules.txt
 - modules_disabled
 - msg_next_id		      [ sysv ipc ]
@@ -445,6 +446,12 @@  This flag controls the L2 cache of G3 processor boards. If

 ==============================================================

+max_softirq_time_usecs:
+Maximum number of microseconds to break the loop of restarting softirq
+processing for at most MAX_SOFTIRQ_RESTART times in __do_softirq().
+
+==============================================================
+
 modules_disabled:

 A toggle value indicating if modules are allowed to be loaded
diff --git a/kernel/softirq.c b/kernel/softirq.c
index a6b81c6b6bff..32f93d82e2e8 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -199,8 +199,9 @@  EXPORT_SYMBOL(__local_bh_enable_ip);

 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
- * but break the loop if need_resched() is set or after 2 ms.
- * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
+ * but break the loop if need_resched() is set or after
+ * max_softirq_time_usecs usecs.
+ * The max_softirq_time_usecs provides a nice upper bound in most cases, but in
  * certain cases, such as stop_machine(), jiffies may cease to
  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
  * well to make sure we eventually return from this method.
@@ -210,7 +211,7 @@  EXPORT_SYMBOL(__local_bh_enable_ip);
  * we want to handle softirqs as soon as possible, but they
  * should not be able to lock up the box.
  */
-#define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
+unsigned int __read_mostly max_softirq_time_usecs = 2000;
 #define MAX_SOFTIRQ_RESTART 10

 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -248,7 +249,8 @@  static inline void lockdep_softirq_end(bool in_hardirq) { }

 asmlinkage __visible void __softirq_entry __do_softirq(void)
 {
-	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
+	unsigned long end = jiffies +
+		usecs_to_jiffies(max_softirq_time_usecs);
 	unsigned long old_flags = current->flags;
 	int max_restart = MAX_SOFTIRQ_RESTART;
 	struct softirq_action *h;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1beca96fb625..db4bc18f84de 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -118,6 +118,7 @@  extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+extern unsigned int max_softirq_time_usecs;

 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1276,6 +1277,14 @@  static struct ctl_table kern_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "max_softirq_time_usecs",
+		.data		= &max_softirq_time_usecs,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
 	{ }
 };