mbox series

[RFC,0/4] Fix excessive CPU usage during compaction

Message ID 20230125134434.18017-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Fix excessive CPU usage during compaction | expand

Message

Mel Gorman Jan. 25, 2023, 1:44 p.m. UTC
Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
fixed a problem where pageblocks found by fast_find_migrateblock() were
ignored. Unfortunately there were numerous bug reports complaining about high
CPU usage and massive stalls once 6.1 was released. Due to the severity,
the patch was reverted by Vlastimil as a short-term fix[1] to -stable and
is currently sitting in the Andrew's git branch mm/mm-hotfixes-unstable.

The underlying problem for each of the bugs is suspected to be the
repeated scanning of the same pageblocks. This series should guarantee
forward progress even with commit 7efc3b726103. More information is in
the changelog for patch 4.

If this series is accepted and merged after the revert of 7efc3b726103
then a "revert of the revert" will be needed.

[1] http://lore.kernel.org/r/20230113173345.9692-1-vbabka@suse.cz

 mm/compaction.c | 73 +++++++++++++++++++++++++++++++------------------
 mm/internal.h   |  6 +++-
 2 files changed, 52 insertions(+), 27 deletions(-)

Comments

Andrew Morton Jan. 26, 2023, 1:11 a.m. UTC | #1
On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
> fixed a problem where pageblocks found by fast_find_migrateblock() were
> ignored. Unfortunately there were numerous bug reports complaining about high
> CPU usage and massive stalls once 6.1 was released. Due to the severity,
> the patch was reverted by Vlastimil as a short-term fix[1] to -stable and
> is currently sitting in the Andrew's git branch mm/mm-hotfixes-unstable.
> 
> The underlying problem for each of the bugs is suspected to be the
> repeated scanning of the same pageblocks. This series should guarantee
> forward progress even with commit 7efc3b726103. More information is in
> the changelog for patch 4.
> 
> If this series is accepted and merged after the revert of 7efc3b726103
> then a "revert of the revert" will be needed.

If we drop Vlastimil's reversion and apply this, the whole series
should be cc:stable and it isn't really designed for that.

So I think either

a) drop Vlastimil's reversion and persuade Mel to send us a minimal
   version of patch #4 for -stable consumption.  Patches 1-3 of this
   series come later.

b) go ahead with Vlastimil's revert for -stable, queue up this
   series for 6.3-rc1 and redo the original "fix set skip in
   fast_find_migrateblock" some time in the future.

If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction:
Finish pageblocks on complete migration failure" is inappropriate -
fixing a reverted commit which Vlastimil's revert already fixed.

I'll plan on b) for now.
Mel Gorman Jan. 26, 2023, 9:04 a.m. UTC | #2
On Wed, Jan 25, 2023 at 05:11:59PM -0800, Andrew Morton wrote:
> On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
> > fixed a problem where pageblocks found by fast_find_migrateblock() were
> > ignored. Unfortunately there were numerous bug reports complaining about high
> > CPU usage and massive stalls once 6.1 was released. Due to the severity,
> > the patch was reverted by Vlastimil as a short-term fix[1] to -stable and
> > is currently sitting in the Andrew's git branch mm/mm-hotfixes-unstable.
> > 
> > The underlying problem for each of the bugs is suspected to be the
> > repeated scanning of the same pageblocks. This series should guarantee
> > forward progress even with commit 7efc3b726103. More information is in
> > the changelog for patch 4.
> > 
> > If this series is accepted and merged after the revert of 7efc3b726103
> > then a "revert of the revert" will be needed.
> 
> If we drop Vlastimil's reversion and apply this, the whole series
> should be cc:stable and it isn't really designed for that.
> 
> So I think either
> 
> a) drop Vlastimil's reversion and persuade Mel to send us a minimal
>    version of patch #4 for -stable consumption.  Patches 1-3 of this
>    series come later.
> 
> b) go ahead with Vlastimil's revert for -stable, queue up this
>    series for 6.3-rc1 and redo the original "fix set skip in
>    fast_find_migrateblock" some time in the future.
> 
> If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction:
> Finish pageblocks on complete migration failure" is inappropriate -
> fixing a reverted commit which Vlastimil's revert already fixed.
> 
> I'll plan on b) for now.

I think plan b) is more appropriate because Vlastimil's revert contains
useful information on this class of bug that may be useful again in the
future. There is no harm preserving that in both the mainline and -stable
git history.

The series worked for me but that was a limited test case and high CPU
usage bugs could crop up again in 6.2. If such bugs show up then reverting
("mm/compaction: fix set skip in fast_find_migrateblock") becomes a
short-term option again while it is analysed.

I disagee that the Fixes tag is inappropriate. Commit 7efc3b726103 is simply
a messenger that fast_find_migrateblock() was broken and while it fixed one
problem, it triggered many more because the fix was incomplete. Vlastimil's
fix was a short-term "fix" to mitigate the fallout until a proper fix was
available. This series is a more comprehensive fix but both are "Fixes"
to commit 7efc3b726103 in terms of what kernel has very obvious problems
that needs to be fixed. Any kernel including 7efc3b726103 should include
either Vlastimil's fix, mine or both if git history is being preserved by the
backports. While there is an older commit that broke fast_find_migrateblock,
most people simply didn't notice where as commit 7efc3b726103 was noticeable
by a number of people very quickly.

However, if you disagree then the appropriate fixes would be

Fixes: 70b44595eafe9 ("mm, compaction: use free lists to quickly locate a migration source")

That would at least indicate to anyone doing the backport that they need
to ensure the backport does not have any hidden dependencies.
Vlastimil Babka Jan. 29, 2023, 6:03 p.m. UTC | #3
On 1/26/23 02:11, Andrew Morton wrote:
> On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> If we drop Vlastimil's reversion and apply this, the whole series
> should be cc:stable and it isn't really designed for that.
> 
> So I think either
> 
> a) drop Vlastimil's reversion and persuade Mel to send us a minimal
>    version of patch #4 for -stable consumption.  Patches 1-3 of this
>    series come later.
> 
> b) go ahead with Vlastimil's revert for -stable, queue up this
>    series for 6.3-rc1 and redo the original "fix set skip in
>    fast_find_migrateblock" some time in the future.
> 
> If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction:
> Finish pageblocks on complete migration failure" is inappropriate -
> fixing a reverted commit which Vlastimil's revert already fixed.
> 
> I'll plan on b) for now.

Agreed with the plan b). I couldn't review this yet due to being sick,
but I doubt I would have enough confidence to fast-track the series to
6.2 and 6.1-stable. It's subtle enough area and extra time in -next and
full -rc cycle will help.
Mel Gorman Jan. 29, 2023, 9 p.m. UTC | #4
On Sun, Jan 29, 2023 at 07:03:54PM +0100, Vlastimil Babka wrote:
> On 1/26/23 02:11, Andrew Morton wrote:
> > On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > If we drop Vlastimil's reversion and apply this, the whole series
> > should be cc:stable and it isn't really designed for that.
> > 
> > So I think either
> > 
> > a) drop Vlastimil's reversion and persuade Mel to send us a minimal
> >    version of patch #4 for -stable consumption.  Patches 1-3 of this
> >    series come later.
> > 
> > b) go ahead with Vlastimil's revert for -stable, queue up this
> >    series for 6.3-rc1 and redo the original "fix set skip in
> >    fast_find_migrateblock" some time in the future.
> > 
> > If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction:
> > Finish pageblocks on complete migration failure" is inappropriate -
> > fixing a reverted commit which Vlastimil's revert already fixed.
> > 
> > I'll plan on b) for now.
> 
> Agreed with the plan b). I couldn't review this yet due to being sick,
> but I doubt I would have enough confidence to fast-track the series to
> 6.2 and 6.1-stable. It's subtle enough area and extra time in -next and
> full -rc cycle will help.

I hope you feel better soon but for what it's worth, I think it also
deserves a full -rc cycle. I've been running it on my own machine for the
last few days using an openSUSE stable kernel with this series applied and
I haven't had problems with kcompactd or khugepaged getting out of control
(monitored via top -b -i). However, I have noticed at least one audio glitch
and I'm not sure if that is related to the series or not. Compaction is
more active than I would have expected from intuition but I've also never
had reason to monitor compaction on my desktop so that's not very useful
in itself. My desktop is a very basic environment (awesome WM, no fancy
animations) and the times when my machine starts thrashing, I also expect
it to because it's the weekly "run git gc on every git tree Friday evening
if X is idle more than an hour".