mbox series

[RFC,0/3] Cgroup-based THP control

Message ID 20241030083311.965933-1-gutierrez.asier@huawei-partners.com (mailing list archive)
Headers show
Series Cgroup-based THP control | expand

Message

Gutierrez Asier Oct. 30, 2024, 8:33 a.m. UTC
From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>

Currently THP modes are set globally. It can be an overkill if only some
specific app/set of apps need to get benefits from THP usage. Moreover, various
apps might need different THP settings. Here we propose a cgroup-based THP
control mechanism.

THP interface is added to memory cgroup subsystem. Existing global THP control
semantics is supported for backward compatibility. When THP modes are set
globally all the changes are propagated to memory cgroups. However, when a
particular cgroup changes its THP policy, the global THP policy in sysfs remains
the same.

New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
have completely the same format as global THP enabled/defrag.

Child cgroups inherit THP settings from parent cgroup upon creation. Particular
cgroup mode changes aren't propagated to child cgroups.

During the memory cgroup attachment stage, the correct slots
are added or removed to khugepaged according to the THP
policy.

Usage examples:

Set globally "madvise" mode:
# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

All the settings are propagated
# cat /sys/fs/cgroup/memory.thp_enabled
always [madvise] never

# cat /sys/fs/cgroup/test/memory.thp_enabled
always [madvise] never

Set "always" for some specific cgroup:
# echo always > /sys/fs/cgroup/test/memory.thp_enabled
# cat /sys/fs/cgroup/test/memory.thp_enabled
[always] madvise never

Root cgroup remains with "madvise" mode:
# cat /sys/fs/cgroup/memory.thp_enabled
always [madvise] never

When attempting to read global settings we get "mixed state" warning as the
THP-mode isn't the same for every cgroup:
# cat /sys/kernel/mm/transparent_hugepage/enabled
Mixed state: see particular memcg flags! 

Again, set THP mode globally, make sure everything works fine:
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# cat /sys/fs/cgroup/memory.thp_enabled
always madvise [never]

# cat /sys/fs/cgroup/test/memory.thp_enabled
always madvise [never]

Here is a simple demo with a 
test which is doing anon. mmap() and a series of random reads.
System is rebooted between the cases.

Case 1: Global THP - always. No cgroup.

// Global THP stats:
AnonHugePages:    391168 kB
FileHugePages:    120832 kB
FilePmdMapped:     67584 kB

// THP stats from *smaps* of the testing process
AnonHugePages:     12288 kB

Case 2: Global THP - never. Cgroup - always.

// Global THP stats:
AnonHugePages:     12288 kB
FileHugePages:      2048 kB
FilePmdMapped:      2048 kB

// THP stats from *smaps* of the testing process
AnonHugePages:     12288 kB

// The cgroup THP stats
anon_thp 12582912
file_thp 2097152

Obviously there's a huge difference between the two in terms of global THP 
usage, thus showing the cgroup approach is beneficial for such cases, when a 
specific app/set of apps needs THP, but not willing to change anything in the 
app. code.

TODO list:

1. Anonymous mTHP
2. Fine-grained mode selection for different VMA types: "anon|exec|ro|file", to
   be able to support combinations as: "always + exec", "always + anon", etc.
3. Per-cgroup limit for the THP usage


Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com>

Asier Gutierrez, Anatoly Stepanov (3):
  mm: Add thp_flags control for cgroup
  mm: Support for huge pages in cgroups
  mm: Add thp_defrag control for cgroup


 include/linux/huge_mm.h    |  23 +++-
 include/linux/khugepaged.h |   2 +-
 include/linux/memcontrol.h |  28 ++++
 mm/huge_memory.c           | 207 ++++++++++++++++++-----------
 mm/khugepaged.c            |   8 +-
 mm/memcontrol.c            | 262 +++++++++++++++++++++++++++++++++++++
 6 files changed, 449 insertions(+), 81 deletions(-)

Comments

Michal Hocko Oct. 30, 2024, 8:38 a.m. UTC | #1
On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.
> 
> THP interface is added to memory cgroup subsystem. Existing global THP control
> semantics is supported for backward compatibility. When THP modes are set
> globally all the changes are propagated to memory cgroups. However, when a
> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> the same.

Do you have any specific examples where this would be benefitial?

> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> have completely the same format as global THP enabled/defrag.
> 
> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> cgroup mode changes aren't propagated to child cgroups.

So this breaks hierarchical property, doesn't it? In other words if a
parent cgroup would like to enforce a certain policy to all descendants
then this is not really possible.
Gutierrez Asier Oct. 30, 2024, 12:51 p.m. UTC | #2
On 10/30/2024 11:38 AM, Michal Hocko wrote:
> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
>>
>> THP interface is added to memory cgroup subsystem. Existing global THP control
>> semantics is supported for backward compatibility. When THP modes are set
>> globally all the changes are propagated to memory cgroups. However, when a
>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>> the same.
> 
> Do you have any specific examples where this would be benefitial?

Now we're mostly focused on database scenarios (MySQL, Redis).  

The main idea is to avoid using a global THP setting that can potentially waste 
overall resource and have per cgroup granularity.

Besides THP are being beneficial for DB performance, we observe high THP 
"over-usage" by some unrelated apps/services, when "always" mode is enabled 
globally.

With cgroup-THP, we're able to specify exact "THP-users", and plan to introduce
an ability to limit the amount of THPs per-cgroup.

We suppose it should be beneficial for some container-based workloads, when 
certain containers can have different THP-policies, but haven't looked into 
this case yet.

>> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>> have completely the same format as global THP enabled/defrag.
>>
>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>> cgroup mode changes aren't propagated to child cgroups.
> 
> So this breaks hierarchical property, doesn't it? In other words if a
> parent cgroup would like to enforce a certain policy to all descendants
> then this is not really possible. 

The first idea was to have some flexibility when changing THP policies. 

I will submit a new patch set which will enforce the cgroup hierarchy and change all
the children recursively.
Matthew Wilcox Oct. 30, 2024, 1:14 p.m. UTC | #3
On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.

Or maybe we should stop making the sysadmin's life so damned hard and
figure out how to do without all of these settings?
David Hildenbrand Oct. 30, 2024, 1:16 p.m. UTC | #4
On 30.10.24 14:14, Matthew Wilcox wrote:
> On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
> 
> Or maybe we should stop making the sysadmin's life so damned hard and
> figure out how to do without all of these settings?

In particular if there is no proper problem description / use case.
Michal Hocko Oct. 30, 2024, 1:27 p.m. UTC | #5
On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
> 
> 
> On 10/30/2024 11:38 AM, Michal Hocko wrote:
> > On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> >>
> >> Currently THP modes are set globally. It can be an overkill if only some
> >> specific app/set of apps need to get benefits from THP usage. Moreover, various
> >> apps might need different THP settings. Here we propose a cgroup-based THP
> >> control mechanism.
> >>
> >> THP interface is added to memory cgroup subsystem. Existing global THP control
> >> semantics is supported for backward compatibility. When THP modes are set
> >> globally all the changes are propagated to memory cgroups. However, when a
> >> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> >> the same.
> > 
> > Do you have any specific examples where this would be benefitial?
> 
> Now we're mostly focused on database scenarios (MySQL, Redis).  

That seems to be more process than workload oriented. Why the existing
per-process tuning doesn't work?

[...]
> >> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> >> cgroup mode changes aren't propagated to child cgroups.
> > 
> > So this breaks hierarchical property, doesn't it? In other words if a
> > parent cgroup would like to enforce a certain policy to all descendants
> > then this is not really possible. 
> 
> The first idea was to have some flexibility when changing THP policies. 
> 
> I will submit a new patch set which will enforce the cgroup hierarchy and change all
> the children recursively.

What is the expected semantics then?
Chris Down Oct. 30, 2024, 2:45 p.m. UTC | #6
gutierrez.asier@huawei-partners.com writes:
>New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>have completely the same format as global THP enabled/defrag.

cgroup controls exist because there are things we want to do for an entire 
class of processes (group OOM, resource control, etc). Enabling or disabling 
some specific setting is generally not one of them, hence why we got rid of 
things like per-cgroup vm.swappiness. We know that these controls do not 
compose well and have caused a lot of pain in the past. So my immediate 
reaction is a nack on the general concept, unless there's some absolutely 
compelling case here.

I talked a little at Kernel Recipes last year about moving away from sysctl and 
other global interfaces and making things more granular. Don't get me wrong, I 
think that is a good thing (although, of course, a very large undertaking) -- 
but it is a mistake to overload the amount of controls we expose as part of the 
cgroup interface.

I am up for thinking overall about how we can improve the state of global 
tunables to make them more granular overall, but this can't set a precedent as 
the way to do it.
Gutierrez Asier Oct. 30, 2024, 2:58 p.m. UTC | #7
On 10/30/2024 4:27 PM, Michal Hocko wrote:
> On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
>>
>>
>> On 10/30/2024 11:38 AM, Michal Hocko wrote:
>>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>>>
>>>> Currently THP modes are set globally. It can be an overkill if only some
>>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>>>> apps might need different THP settings. Here we propose a cgroup-based THP
>>>> control mechanism.
>>>>
>>>> THP interface is added to memory cgroup subsystem. Existing global THP control
>>>> semantics is supported for backward compatibility. When THP modes are set
>>>> globally all the changes are propagated to memory cgroups. However, when a
>>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>>>> the same.
>>>
>>> Do you have any specific examples where this would be benefitial?
>>
>> Now we're mostly focused on database scenarios (MySQL, Redis).  
> 
> That seems to be more process than workload oriented. Why the existing
> per-process tuning doesn't work?
> 
> [...]

1st Point

We're trying to provide a transparent mechanism, but all the existing per-process
methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

Moreover we're using file-backed THPs too (for .text mostly), which make it for
user-space developers even more complicated.

>>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>>>> cgroup mode changes aren't propagated to child cgroups.
>>>
>>> So this breaks hierarchical property, doesn't it? In other words if a
>>> parent cgroup would like to enforce a certain policy to all descendants
>>> then this is not really possible. 
>>
>> The first idea was to have some flexibility when changing THP policies. 
>>
>> I will submit a new patch set which will enforce the cgroup hierarchy and change all
>> the children recursively.
> 
> What is the expected semantics then?

2nd point (on semantics)
1. Children inherit the THP policy upon creation
2. Parent's policy changes are propagated to all the children
3. Children can set the policy independently
Michal Hocko Oct. 30, 2024, 3:04 p.m. UTC | #8
On Wed 30-10-24 14:45:24, Chris Down wrote:
> gutierrez.asier@huawei-partners.com writes:
> > New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> > have completely the same format as global THP enabled/defrag.
> 
> cgroup controls exist because there are things we want to do for an entire
> class of processes (group OOM, resource control, etc). Enabling or disabling
> some specific setting is generally not one of them, hence why we got rid of
> things like per-cgroup vm.swappiness. We know that these controls do not
> compose well and have caused a lot of pain in the past. So my immediate
> reaction is a nack on the general concept, unless there's some absolutely
> compelling case here.
> 
> I talked a little at Kernel Recipes last year about moving away from sysctl
> and other global interfaces and making things more granular. Don't get me
> wrong, I think that is a good thing (although, of course, a very large
> undertaking) -- but it is a mistake to overload the amount of controls we
> expose as part of the cgroup interface.

Completely agreed!
Johannes Weiner Oct. 30, 2024, 3:08 p.m. UTC | #9
On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.
> 
> THP interface is added to memory cgroup subsystem. Existing global THP control
> semantics is supported for backward compatibility. When THP modes are set
> globally all the changes are propagated to memory cgroups. However, when a
> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> the same.
> 
> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> have completely the same format as global THP enabled/defrag.
> 
> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> cgroup mode changes aren't propagated to child cgroups.

Cgroups are for hierarchical resource distribution. It's tempting to
add parameters you would want for flat collections of processes, but
it gets weird when it comes to inheritance and hiearchical semantics
inside the cgroup tree - like it does here. So this is not a good fit.

On this particular issue, I agree with what Willy and David: let's not
proliferate THP knobs; let's focus on making them truly transparent.
Michal Hocko Oct. 30, 2024, 3:15 p.m. UTC | #10
On Wed 30-10-24 17:58:04, Gutierrez Asier wrote:
> 
> 
> On 10/30/2024 4:27 PM, Michal Hocko wrote:
> > On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
> >>
> >>
> >> On 10/30/2024 11:38 AM, Michal Hocko wrote:
> >>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> >>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> >>>>
> >>>> Currently THP modes are set globally. It can be an overkill if only some
> >>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
> >>>> apps might need different THP settings. Here we propose a cgroup-based THP
> >>>> control mechanism.
> >>>>
> >>>> THP interface is added to memory cgroup subsystem. Existing global THP control
> >>>> semantics is supported for backward compatibility. When THP modes are set
> >>>> globally all the changes are propagated to memory cgroups. However, when a
> >>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> >>>> the same.
> >>>
> >>> Do you have any specific examples where this would be benefitial?
> >>
> >> Now we're mostly focused on database scenarios (MySQL, Redis).  
> > 
> > That seems to be more process than workload oriented. Why the existing
> > per-process tuning doesn't work?
> > 
> > [...]
> 
> 1st Point
> 
> We're trying to provide a transparent mechanism, but all the existing per-process
> methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

There is also prctl to define per-process policy. We currently have
means to disable THP for the process to override the defeault behavior.
That would be mostly transparent for the application. 

You have not really answered a more fundamental question though. Why the
THP behavior should be at the cgroup scope? From a practical POV that
would represent containers which are a mixed bag of applications to
support the workload. Why does the same THP policy apply to all of them?
Doesn't this make the sub-optimal global behavior the same on the cgroup
level when some parts will benefit while others will not?

> Moreover we're using file-backed THPs too (for .text mostly), which make it for
> user-space developers even more complicated.
> 
> >>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> >>>> cgroup mode changes aren't propagated to child cgroups.
> >>>
> >>> So this breaks hierarchical property, doesn't it? In other words if a
> >>> parent cgroup would like to enforce a certain policy to all descendants
> >>> then this is not really possible. 
> >>
> >> The first idea was to have some flexibility when changing THP policies. 
> >>
> >> I will submit a new patch set which will enforce the cgroup hierarchy and change all
> >> the children recursively.
> > 
> > What is the expected semantics then?
> 
> 2nd point (on semantics)
> 1. Children inherit the THP policy upon creation
> 2. Parent's policy changes are propagated to all the children
> 3. Children can set the policy independently

So if the parent decides that none of the children should be using THP
they can override that so the tuning at parent has no imperative
control. This is breaking hierarchical property that is expected from
cgroup control files.