Message ID | 20230125134434.18017-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Fix excessive CPU usage during compaction | expand |
On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock") > fixed a problem where pageblocks found by fast_find_migrateblock() were > ignored. Unfortunately there were numerous bug reports complaining about high > CPU usage and massive stalls once 6.1 was released. Due to the severity, > the patch was reverted by Vlastimil as a short-term fix[1] to -stable and > is currently sitting in the Andrew's git branch mm/mm-hotfixes-unstable. > > The underlying problem for each of the bugs is suspected to be the > repeated scanning of the same pageblocks. This series should guarantee > forward progress even with commit 7efc3b726103. More information is in > the changelog for patch 4. > > If this series is accepted and merged after the revert of 7efc3b726103 > then a "revert of the revert" will be needed. If we drop Vlastimil's reversion and apply this, the whole series should be cc:stable and it isn't really designed for that. So I think either a) drop Vlastimil's reversion and persuade Mel to send us a minimal version of patch #4 for -stable consumption. Patches 1-3 of this series come later. b) go ahead with Vlastimil's revert for -stable, queue up this series for 6.3-rc1 and redo the original "fix set skip in fast_find_migrateblock" some time in the future. If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction: Finish pageblocks on complete migration failure" is inappropriate - fixing a reverted commit which Vlastimil's revert already fixed. I'll plan on b) for now.
On Wed, Jan 25, 2023 at 05:11:59PM -0800, Andrew Morton wrote: > On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock") > > fixed a problem where pageblocks found by fast_find_migrateblock() were > > ignored. Unfortunately there were numerous bug reports complaining about high > > CPU usage and massive stalls once 6.1 was released. Due to the severity, > > the patch was reverted by Vlastimil as a short-term fix[1] to -stable and > > is currently sitting in the Andrew's git branch mm/mm-hotfixes-unstable. > > > > The underlying problem for each of the bugs is suspected to be the > > repeated scanning of the same pageblocks. This series should guarantee > > forward progress even with commit 7efc3b726103. More information is in > > the changelog for patch 4. > > > > If this series is accepted and merged after the revert of 7efc3b726103 > > then a "revert of the revert" will be needed. > > If we drop Vlastimil's reversion and apply this, the whole series > should be cc:stable and it isn't really designed for that. > > So I think either > > a) drop Vlastimil's reversion and persuade Mel to send us a minimal > version of patch #4 for -stable consumption. Patches 1-3 of this > series come later. > > b) go ahead with Vlastimil's revert for -stable, queue up this > series for 6.3-rc1 and redo the original "fix set skip in > fast_find_migrateblock" some time in the future. > > If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction: > Finish pageblocks on complete migration failure" is inappropriate - > fixing a reverted commit which Vlastimil's revert already fixed. > > I'll plan on b) for now. I think plan b) is more appropriate because Vlastimil's revert contains useful information on this class of bug that may be useful again in the future. There is no harm preserving that in both the mainline and -stable git history. The series worked for me but that was a limited test case and high CPU usage bugs could crop up again in 6.2. If such bugs show up then reverting ("mm/compaction: fix set skip in fast_find_migrateblock") becomes a short-term option again while it is analysed. I disagee that the Fixes tag is inappropriate. Commit 7efc3b726103 is simply a messenger that fast_find_migrateblock() was broken and while it fixed one problem, it triggered many more because the fix was incomplete. Vlastimil's fix was a short-term "fix" to mitigate the fallout until a proper fix was available. This series is a more comprehensive fix but both are "Fixes" to commit 7efc3b726103 in terms of what kernel has very obvious problems that needs to be fixed. Any kernel including 7efc3b726103 should include either Vlastimil's fix, mine or both if git history is being preserved by the backports. While there is an older commit that broke fast_find_migrateblock, most people simply didn't notice where as commit 7efc3b726103 was noticeable by a number of people very quickly. However, if you disagree then the appropriate fixes would be Fixes: 70b44595eafe9 ("mm, compaction: use free lists to quickly locate a migration source") That would at least indicate to anyone doing the backport that they need to ensure the backport does not have any hidden dependencies.
On 1/26/23 02:11, Andrew Morton wrote: > On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > If we drop Vlastimil's reversion and apply this, the whole series > should be cc:stable and it isn't really designed for that. > > So I think either > > a) drop Vlastimil's reversion and persuade Mel to send us a minimal > version of patch #4 for -stable consumption. Patches 1-3 of this > series come later. > > b) go ahead with Vlastimil's revert for -stable, queue up this > series for 6.3-rc1 and redo the original "fix set skip in > fast_find_migrateblock" some time in the future. > > If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction: > Finish pageblocks on complete migration failure" is inappropriate - > fixing a reverted commit which Vlastimil's revert already fixed. > > I'll plan on b) for now. Agreed with the plan b). I couldn't review this yet due to being sick, but I doubt I would have enough confidence to fast-track the series to 6.2 and 6.1-stable. It's subtle enough area and extra time in -next and full -rc cycle will help.
On Sun, Jan 29, 2023 at 07:03:54PM +0100, Vlastimil Babka wrote: > On 1/26/23 02:11, Andrew Morton wrote: > > On Wed, 25 Jan 2023 13:44:30 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > > If we drop Vlastimil's reversion and apply this, the whole series > > should be cc:stable and it isn't really designed for that. > > > > So I think either > > > > a) drop Vlastimil's reversion and persuade Mel to send us a minimal > > version of patch #4 for -stable consumption. Patches 1-3 of this > > series come later. > > > > b) go ahead with Vlastimil's revert for -stable, queue up this > > series for 6.3-rc1 and redo the original "fix set skip in > > fast_find_migrateblock" some time in the future. > > > > If we go with b) then the Fixes: tag in "[PATCH 4/4] mm, compaction: > > Finish pageblocks on complete migration failure" is inappropriate - > > fixing a reverted commit which Vlastimil's revert already fixed. > > > > I'll plan on b) for now. > > Agreed with the plan b). I couldn't review this yet due to being sick, > but I doubt I would have enough confidence to fast-track the series to > 6.2 and 6.1-stable. It's subtle enough area and extra time in -next and > full -rc cycle will help. I hope you feel better soon but for what it's worth, I think it also deserves a full -rc cycle. I've been running it on my own machine for the last few days using an openSUSE stable kernel with this series applied and I haven't had problems with kcompactd or khugepaged getting out of control (monitored via top -b -i). However, I have noticed at least one audio glitch and I'm not sure if that is related to the series or not. Compaction is more active than I would have expected from intuition but I've also never had reason to monitor compaction on my desktop so that's not very useful in itself. My desktop is a very basic environment (awesome WM, no fancy animations) and the times when my machine starts thrashing, I also expect it to because it's the weekly "run git gc on every git tree Friday evening if X is idle more than an hour".