mbox series

[0/6] sched,delayacct: Some cleanups

Message ID 20210505105940.190490250@infradead.org (mailing list archive)
Headers show
Series sched,delayacct: Some cleanups | expand

Message

Peter Zijlstra May 5, 2021, 10:59 a.m. UTC
Hi,

Due to:

  https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com

and general principle, delayacct really shouldn't be using ktime (pvclock also
really shouldn't be doing what it does, but that's another story). This lead me
to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.

The rest of the patches are an attempt at simplifying all that a little. All
that crud is enabled by default for distros which is leading to a death by a
thousand cuts.

The last patch is an attempt at default disabling DELAYACCT, because I don't
think anybody actually uses that much, but what do I know, there were no ill
effects on my testbox. Perhaps we should mirror
/proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
frobbing.

Comments

Balbir Singh May 5, 2021, 10:29 p.m. UTC | #1
On Wed, May 05, 2021 at 12:59:40PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> Due to:
> 
>   https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com
> 
> and general principle, delayacct really shouldn't be using ktime (pvclock also
> really shouldn't be doing what it does, but that's another story). This lead me
> to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.
> 
> The rest of the patches are an attempt at simplifying all that a little. All
> that crud is enabled by default for distros which is leading to a death by a
> thousand cuts.
> 
> The last patch is an attempt at default disabling DELAYACCT, because I don't
> think anybody actually uses that much, but what do I know, there were no ill
> effects on my testbox. Perhaps we should mirror
> /proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
> frobbing.
>

There are tools like iotop that use delayacct to display information. When the
code was checked in, we did run SPEC* back in the day 2006 to find overheads,
nothing significant showed. Do we have any date on the overhead your seeing?

I'll look at the patches

Balbir Singh.
Peter Zijlstra May 6, 2021, 9:13 a.m. UTC | #2
On Thu, May 06, 2021 at 08:29:40AM +1000, Balbir Singh wrote:
> On Wed, May 05, 2021 at 12:59:40PM +0200, Peter Zijlstra wrote:
> > Hi,
> > 
> > Due to:
> > 
> >   https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com
> > 
> > and general principle, delayacct really shouldn't be using ktime (pvclock also
> > really shouldn't be doing what it does, but that's another story). This lead me
> > to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.
> > 
> > The rest of the patches are an attempt at simplifying all that a little. All
> > that crud is enabled by default for distros which is leading to a death by a
> > thousand cuts.
> > 
> > The last patch is an attempt at default disabling DELAYACCT, because I don't
> > think anybody actually uses that much, but what do I know, there were no ill
> > effects on my testbox. Perhaps we should mirror
> > /proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
> > frobbing.
> >
> 
> There are tools like iotop that use delayacct to display information. 

Right, but how many actual people use that? Does that justify saddling
the whole sodding world with the overhead?

> When the
> code was checked in, we did run SPEC* back in the day 2006 to find overheads,
> nothing significant showed. Do we have any date on the overhead your seeing?

I've not looked, but having it disabled saves that per-task allocation
and that spinlock in delayacct_end() for iowait wakeups and a bunch of
cache misses ofcourse.

I doubt SPEC is a benchmark that tickles those paths much if at all.

The thing is; we can't just keep growing more and more stats, that'll
kill us quite dead.
Thomas Gleixner May 7, 2021, 9:05 a.m. UTC | #3
On Wed, May 05 2021 at 12:59, Peter Zijlstra wrote:
> Due to:
>
>   https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com
>
> and general principle, delayacct really shouldn't be using ktime (pvclock also
> really shouldn't be doing what it does, but that's another story). This lead me
> to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.
>
> The rest of the patches are an attempt at simplifying all that a little. All
> that crud is enabled by default for distros which is leading to a death by a
> thousand cuts.
>
> The last patch is an attempt at default disabling DELAYACCT, because I don't
> think anybody actually uses that much, but what do I know, there were no ill
> effects on my testbox. Perhaps we should mirror
> /proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
> frobbing.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Balbir Singh May 7, 2021, 12:38 p.m. UTC | #4
On Thu, May 06, 2021 at 11:13:52AM +0200, Peter Zijlstra wrote:
> On Thu, May 06, 2021 at 08:29:40AM +1000, Balbir Singh wrote:
> > On Wed, May 05, 2021 at 12:59:40PM +0200, Peter Zijlstra wrote:
> > > Hi,
> > > 
> > > Due to:
> > > 
> > >   https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com
> > > 
> > > and general principle, delayacct really shouldn't be using ktime (pvclock also
> > > really shouldn't be doing what it does, but that's another story). This lead me
> > > to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.
> > > 
> > > The rest of the patches are an attempt at simplifying all that a little. All
> > > that crud is enabled by default for distros which is leading to a death by a
> > > thousand cuts.
> > > 
> > > The last patch is an attempt at default disabling DELAYACCT, because I don't
> > > think anybody actually uses that much, but what do I know, there were no ill
> > > effects on my testbox. Perhaps we should mirror
> > > /proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
> > > frobbing.
> > >
> > 
> > There are tools like iotop that use delayacct to display information. 
> 
> Right, but how many actual people use that? Does that justify saddling
> the whole sodding world with the overhead?
>

Not sure I have that data.
 
> > When the
> > code was checked in, we did run SPEC* back in the day 2006 to find overheads,
> > nothing significant showed. Do we have any date on the overhead your seeing?
> 
> I've not looked, but having it disabled saves that per-task allocation
> and that spinlock in delayacct_end() for iowait wakeups and a bunch of
> cache misses ofcourse.
> 
> I doubt SPEC is a benchmark that tickles those paths much if at all.
> 
> The thing is; we can't just keep growing more and more stats, that'll
> kill us quite dead.


I don't disagree, we've had these around for a while and I know of users
that use these stats to find potential starvation. I am OK with default
disabled. I suspect distros will have the final say.

Balbir Singh.
Ingo Molnar May 10, 2021, 7:08 a.m. UTC | #5
* Peter Zijlstra <peterz@infradead.org> wrote:

> Hi,
> 
> Due to:
> 
>   https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com
> 
> and general principle, delayacct really shouldn't be using ktime (pvclock also
> really shouldn't be doing what it does, but that's another story). This lead me
> to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.
> 
> The rest of the patches are an attempt at simplifying all that a little. All
> that crud is enabled by default for distros which is leading to a death by a
> thousand cuts.
> 
> The last patch is an attempt at default disabling DELAYACCT, because I don't
> think anybody actually uses that much, but what do I know, there were no ill
> effects on my testbox. Perhaps we should mirror
> /proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
> frobbing.

Reviewed-by: Ingo Molnar <mingo@kernel.org>

(This includes the #6 RFC patch.)

Thanks,

	Ingo
Mel Gorman May 12, 2021, 11:34 a.m. UTC | #6
On Fri, May 07, 2021 at 10:38:10PM +1000, Balbir Singh wrote:
> On Thu, May 06, 2021 at 11:13:52AM +0200, Peter Zijlstra wrote:
> > On Thu, May 06, 2021 at 08:29:40AM +1000, Balbir Singh wrote:
> > > On Wed, May 05, 2021 at 12:59:40PM +0200, Peter Zijlstra wrote:
> > > > Hi,
> > > > 
> > > > Due to:
> > > > 
> > > >   https://lkml.kernel.org/r/0000000000001d43ac05c0f5c6a0@google.com
> > > > 
> > > > and general principle, delayacct really shouldn't be using ktime (pvclock also
> > > > really shouldn't be doing what it does, but that's another story). This lead me
> > > > to looking at the SCHED_INFO, SCHEDSTATS, DELAYACCT (and PSI) accounting hell.
> > > > 
> > > > The rest of the patches are an attempt at simplifying all that a little. All
> > > > that crud is enabled by default for distros which is leading to a death by a
> > > > thousand cuts.
> > > > 
> > > > The last patch is an attempt at default disabling DELAYACCT, because I don't
> > > > think anybody actually uses that much, but what do I know, there were no ill
> > > > effects on my testbox. Perhaps we should mirror
> > > > /proc/sys/kernel/sched_schedstats and provide a delayacct sysctl for runtime
> > > > frobbing.
> > > >
> > > 
> > > There are tools like iotop that use delayacct to display information. 
> > 
> > Right, but how many actual people use that? Does that justify saddling
> > the whole sodding world with the overhead?
> >
> 
> Not sure I have that data.
>  

It's used occasionally as a debugging tool when looking at IO problems
in general. Like sched_schedstats, it seems unnecesary to incur overhead
unless someone is debugging.

Minimally disabling had a small positive impact -- tbench 0-8% gain
depending on thread count and machine, positive impact in general to
specjbb2005, neutral on hackbench and most of the other workloads checked.

> > > When the
> > > code was checked in, we did run SPEC* back in the day 2006 to find overheads,
> > > nothing significant showed. Do we have any date on the overhead your seeing?
> > 
> > I've not looked, but having it disabled saves that per-task allocation
> > and that spinlock in delayacct_end() for iowait wakeups and a bunch of
> > cache misses ofcourse.
> > 
> > I doubt SPEC is a benchmark that tickles those paths much if at all.
> > 
> > The thing is; we can't just keep growing more and more stats, that'll
> > kill us quite dead.
> 
> 
> I don't disagree, we've had these around for a while and I know of users
> that use these stats to find potential starvation. I am OK with default
> disabled. I suspect distros will have the final say.
> 

I think default disabled should be fine. At worst when dealing with a bug
there would be a need to either reboot or enable at runtime with patch
7 included and add that instruction to the bug report when requesting
iotop data. At worst, a distro could revert the patch if iotop generated
too many bug reports or patch iotop in the distro package.

Alternatively, I've added Paul Wise to the cc who is the latest
committer to iotop.  Maybe he knows who could add/commit a check for
sysctl.sched_delayacct and if it exists then check if it's 1 and display
an error suggesting corrective action (add delayacct to the kernel command
line or sysctl sched.sched_delayacct=1). iotop appears to be in maintenance
mode but gets occasional commits even if it has not had a new version
since 2013 so maybe it could get a 0.7 tag if such a check was added.
Peter Zijlstra May 12, 2021, 11:38 a.m. UTC | #7
On Wed, May 12, 2021 at 12:34:19PM +0100, Mel Gorman wrote:
> > I don't disagree, we've had these around for a while and I know of users
> > that use these stats to find potential starvation. I am OK with default
> > disabled. I suspect distros will have the final say.
> > 
> 
> I think default disabled should be fine. At worst when dealing with a bug
> there would be a need to either reboot or enable at runtime with patch
> 7 included and add that instruction to the bug report when requesting
> iotop data. At worst, a distro could revert the patch if iotop generated
> too many bug reports or patch iotop in the distro package.
> 
> Alternatively, I've added Paul Wise to the cc who is the latest
> committer to iotop.  Maybe he knows who could add/commit a check for
> sysctl.sched_delayacct and if it exists then check if it's 1 and display
> an error suggesting corrective action (add delayacct to the kernel command
> line or sysctl sched.sched_delayacct=1). iotop appears to be in maintenance
> mode but gets occasional commits even if it has not had a new version
> since 2013 so maybe it could get a 0.7 tag if such a check was added.

In the final commit I made the knob be known as task_delayacct. I
figured that was a slightly better name.

But yes, a check in iotop, if that really is the sole user of all this,
would be very nice.

Thanks!
Paul Wise May 12, 2021, 12:23 p.m. UTC | #8
On Wed, 2021-05-12 at 12:34 +0100, Mel Gorman wrote:

> Alternatively, I've added Paul Wise to the cc who is the latest
> committer to iotop.  Maybe he knows who could add/commit a check for
> sysctl.sched_delayacct and if it exists then check if it's 1 and display
> an error suggesting corrective action (add delayacct to the kernel command
> line or sysctl sched.sched_delayacct=1). iotop appears to be in maintenance
> mode but gets occasional commits even if it has not had a new version
> since 2013 so maybe it could get a 0.7 tag if such a check was added.

I am able to commit to the iotop repository but I don't have permission
from the author to make releases nor do I have access to the website.

I am happy to apply any patches anyone has for iotop and upload the
result to Debian, or I'll be willing to write patches to cope with
changes in Linux behaviour, if given a succinct explanation of what
changes are needed in iotop, once the Linux changes are fully merged.

As well as the Python iotop implementation, there is one written in C 
with more features, so please also file an issue or pull request there.
Please note that I don't have commit access to that repository though.

https://github.com/Tomas-M/iotop
Mel Gorman May 12, 2021, 1 p.m. UTC | #9
On Wed, May 12, 2021 at 08:23:51PM +0800, Paul Wise wrote:
> On Wed, 2021-05-12 at 12:34 +0100, Mel Gorman wrote:
> 
> > Alternatively, I've added Paul Wise to the cc who is the latest
> > committer to iotop.  Maybe he knows who could add/commit a check for
> > sysctl.sched_delayacct and if it exists then check if it's 1 and display
> > an error suggesting corrective action (add delayacct to the kernel command
> > line or sysctl sched.sched_delayacct=1). iotop appears to be in maintenance
> > mode but gets occasional commits even if it has not had a new version
> > since 2013 so maybe it could get a 0.7 tag if such a check was added.
> 
> I am able to commit to the iotop repository but I don't have permission
> from the author to make releases nor do I have access to the website.
> 
> I am happy to apply any patches anyone has for iotop and upload the
> result to Debian, or I'll be willing to write patches to cope with
> changes in Linux behaviour, if given a succinct explanation of what
> changes are needed in iotop, once the Linux changes are fully merged.
> 

If you send me the same patch, I can do submit a request to the devel
package for openSUSE. I don't have commit access but I would be surprised
if the package maintainer didn't accept the request. Obviously, I'll
build+boot a kernel that includes the final version of this series in
case of any naming changes or other oddities.

> As well as the Python iotop implementation, there is one written in C 
> with more features, so please also file an issue or pull request there.
> Please note that I don't have commit access to that repository though.
> 

Good thinking. I'll open a bug on github when I've tested your iotop
patch so that the bug report is more coherent.

Thanks for the quick response.
Paul Wise May 13, 2021, 1:29 a.m. UTC | #10
On Wed, 2021-05-12 at 14:00 +0100, Mel Gorman wrote:

> If you send me the same patch, I can do submit a request to the devel
> package for openSUSE. I don't have commit access but I would be surprised
> if the package maintainer didn't accept the request. Obviously, I'll
> build+boot a kernel that includes the final version of this series in
> case of any naming changes or other oddities.

At this point I'm not clear exactly what needs to be done and whether
or not the details have been nailed down enough that it is time to
commit the change to the iotop-py and iotop-c git repositories.

I recommend upgrading the openSUSE iotop package to the latest git
commit rather than just applying the latest patch on top.

Alternatively, once the patch is applied I can probably overstep my
permissions and add a tag to the iotop-py git repository, in case folks
are happy to pull from the git repository instead of the website.

> Good thinking. I'll open a bug on github when I've tested your iotop
> patch so that the bug report is more coherent.

OK, sounds good.

PS: does Linux have a facility for userspace processes to convert
syscall names to numbers for the currently running Linux kernel? I
noticed that iotop-py just hard-codes the syscall numbers for
ioprio_set and ioprio_get on common arches, missing newer arches.
Paul Wise June 25, 2021, 12:50 a.m. UTC | #11
On Wed, 2021-05-12 at 12:34 +0100, Mel Gorman wrote:

> Alternatively, I've added Paul Wise to the cc who is the latest
> committer to iotop.  Maybe he knows who could add/commit a check for
> sysctl.sched_delayacct and if it exists then check if it's 1 and display
> an error suggesting corrective action (add delayacct to the kernel command
> line or sysctl sched.sched_delayacct=1). iotop appears to be in maintenance
> mode but gets occasional commits even if it has not had a new version
> since 2013 so maybe it could get a 0.7 tag if such a check was added.

Did the proposed changes get merged?

If so, please let me know the details of what needs to happen in iotop
and iotop-c to cope with the changes in the Linux kernel.