diff mbox series

[RFC] mm, oom: introduce vm.sacrifice_hugepage_on_oom

Message ID 20210216030713.79101-1-eiichi.tsukata@nutanix.com (mailing list archive)
State New
Headers show
Series [RFC] mm, oom: introduce vm.sacrifice_hugepage_on_oom | expand

Commit Message

Eiichi Tsukata Feb. 16, 2021, 3:07 a.m. UTC
Hugepages can be preallocated to avoid unpredictable allocation latency.
If we run into 4k page shortage, the kernel can trigger OOM even though
there were free hugepages. When OOM is triggered by user address page
fault handler, we can use oom notifier to free hugepages in user space
but if it's triggered by memory allocation for kernel, there is no way
to synchronously handle it in user space.

This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
enabled, it first tries to free a hugepage if available before invoking
the oom-killer. The default value is disabled not to change the current
behavior.

Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 12 ++++++++++++
 include/linux/hugetlb.h                 |  2 ++
 include/linux/oom.h                     |  1 +
 kernel/sysctl.c                         |  9 +++++++++
 mm/hugetlb.c                            |  4 ++--
 mm/oom_kill.c                           | 23 +++++++++++++++++++++++
 6 files changed, 49 insertions(+), 2 deletions(-)

Comments

Michal Hocko Feb. 16, 2021, 8:12 a.m. UTC | #1
On Tue 16-02-21 03:07:13, Eiichi Tsukata wrote:
> Hugepages can be preallocated to avoid unpredictable allocation latency.
> If we run into 4k page shortage, the kernel can trigger OOM even though
> there were free hugepages. When OOM is triggered by user address page
> fault handler, we can use oom notifier to free hugepages in user space
> but if it's triggered by memory allocation for kernel, there is no way
> to synchronously handle it in user space.

Can you expand some more on what kind of problem do you see?
Hugetlb pages are, by definition, a preallocated, unreclaimable and
admin controlled pool of pages. Under those conditions it is expected
and required that the sizing would be done very carefully. Why is that a
problem in your particular setup/scenario?

If the sizing is really done properly and then a random process can
trigger OOM then this can lead to malfunctioning of those workloads
which do depend on hugetlb pool, right? So isn't this a kinda DoS
scenario?

> This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> enabled, it first tries to free a hugepage if available before invoking
> the oom-killer. The default value is disabled not to change the current
> behavior.

Why is this interface not hugepage size aware? It is quite different to
release a GB huge page or 2MB one. Or is it expected to release the
smallest one? To the implementation...

[...]
> +static int sacrifice_hugepage(void)
> +{
> +	int ret;
> +
> +	spin_lock(&hugetlb_lock);
> +	ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);

... no it is going to release the default huge page. This will be 2MB in
most cases but this is not given.

Unless I am mistaken this will free up also reserved hugetlb pages. This
would mean that a page fault would SIGBUS which is very likely not
something we want to do right? You also want to use oom nodemask rather
than a full one.

Overall, I am not really happy about this feature even when above is
fixed, but let's hear more the actual problem first.
Chris Down Feb. 16, 2021, 1:38 p.m. UTC | #2
Hi Eiichi,

I agree with Michal's points, and I think there are also some other design 
questions which don't quite make sense to me. Perhaps you can clear them up?  
:-)

Eiichi Tsukata writes:
>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>index 4bdb58ab14cb..e2d57200fd00 100644
>--- a/mm/hugetlb.c
>+++ b/mm/hugetlb.c
>@@ -1726,8 +1726,8 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>  * balanced over allowed nodes.
>  * Called with hugetlb_lock locked.
>  */
>-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>-							 bool acct_surplus)
>+int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
>+			bool acct_surplus)
> {
> 	int nr_nodes, node;
> 	int ret = 0;

The immediate red flag to me is that we're investing further mm knowledge into 
hugetlb. For the vast majority of intents and purposes, hugetlb exists outside 
of the typical memory management lifecycle, and historic behaviour has been to 
treat a separate reserve that we don't touch. We expect that hugetlb is a 
reserve which is by and large explicitly managed by the system administrator, 
not by us, and this seems to violate that.

Shoehorning in shrink-on-OOM support to it seems a little suspicious to me, 
because we already have a modernised system for huge pages that handles not 
only this, but many other memory management situations: THP. THP not only has 
support for this particular case, but so many other features which are 
necessary to coherently manage it as part of the mm lifecycle. For that reason, 
I'm not convinced that those composes to a sensible interface.

As some example questions which appear unresolved to me: if hugetlb pages are 
lost, what mechanisms will we provide to tell automation or the system 
administrator what to do in that scenario? How should the interface for 
resolving hugepage starvation due to repeated OOMs look? By what metrics will 
you decide if releasing the hugepage is worse for the system than selecting a 
victim for OOM? Why can't the system use the existing THP mechanisms to resolve 
this ahead of time?

Thanks,

Chris
David Rientjes Feb. 16, 2021, 9:53 p.m. UTC | #3
On Tue, 16 Feb 2021, Michal Hocko wrote:

> > Hugepages can be preallocated to avoid unpredictable allocation latency.
> > If we run into 4k page shortage, the kernel can trigger OOM even though
> > there were free hugepages. When OOM is triggered by user address page
> > fault handler, we can use oom notifier to free hugepages in user space
> > but if it's triggered by memory allocation for kernel, there is no way
> > to synchronously handle it in user space.
> 
> Can you expand some more on what kind of problem do you see?
> Hugetlb pages are, by definition, a preallocated, unreclaimable and
> admin controlled pool of pages.

Small nit: true of non-surplus hugetlb pages.

> Under those conditions it is expected
> and required that the sizing would be done very carefully. Why is that a
> problem in your particular setup/scenario?
> 
> If the sizing is really done properly and then a random process can
> trigger OOM then this can lead to malfunctioning of those workloads
> which do depend on hugetlb pool, right? So isn't this a kinda DoS
> scenario?
> 
> > This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> > enabled, it first tries to free a hugepage if available before invoking
> > the oom-killer. The default value is disabled not to change the current
> > behavior.
> 
> Why is this interface not hugepage size aware? It is quite different to
> release a GB huge page or 2MB one. Or is it expected to release the
> smallest one? To the implementation...
> 
> [...]
> > +static int sacrifice_hugepage(void)
> > +{
> > +	int ret;
> > +
> > +	spin_lock(&hugetlb_lock);
> > +	ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);
> 
> ... no it is going to release the default huge page. This will be 2MB in
> most cases but this is not given.
> 
> Unless I am mistaken this will free up also reserved hugetlb pages. This
> would mean that a page fault would SIGBUS which is very likely not
> something we want to do right? You also want to use oom nodemask rather
> than a full one.
> 
> Overall, I am not really happy about this feature even when above is
> fixed, but let's hear more the actual problem first.

Shouldn't this behavior be possible as an oomd plugin instead, perhaps 
triggered by psi?  I'm not sure if oomd is intended only to kill something 
(oomkilld? lol) or if it can be made to do sysadmin level behavior, such 
as shrinking the hugetlb pool, to solve the oom condition.

If so, it seems like we want to do this at the absolute last minute.  In 
other words, reclaim has failed to free memory by other means so we would 
like to shrink the hugetlb pool.  (It's the reason why it's implemented as 
a predecessor to oom as opposed to part of reclaim in general.)

Do we have the ability to suppress the oom killer until oomd has a chance 
to react in this scenario?
Mike Kravetz Feb. 16, 2021, 10:30 p.m. UTC | #4
On 2/16/21 12:12 AM, Michal Hocko wrote:
> On Tue 16-02-21 03:07:13, Eiichi Tsukata wrote:
>> Hugepages can be preallocated to avoid unpredictable allocation latency.
>> If we run into 4k page shortage, the kernel can trigger OOM even though
>> there were free hugepages. When OOM is triggered by user address page
>> fault handler, we can use oom notifier to free hugepages in user space
>> but if it's triggered by memory allocation for kernel, there is no way
>> to synchronously handle it in user space.
> 
> Can you expand some more on what kind of problem do you see?
> Hugetlb pages are, by definition, a preallocated, unreclaimable and
> admin controlled pool of pages. Under those conditions it is expected
> and required that the sizing would be done very carefully. Why is that a
> problem in your particular setup/scenario?
> 
> If the sizing is really done properly and then a random process can
> trigger OOM then this can lead to malfunctioning of those workloads
> which do depend on hugetlb pool, right? So isn't this a kinda DoS
> scenario?

I spent a bunch of time last year looking at OOMs or near OOMs onsystems
where there were a bunch of free hugetlb pages.  The number of hugetlb pages
was carefully chosen by the DB for it's expected needs.  Some other services
running on the system were actually driving/causing the OOM situations.
If a feature like this was in place, it could have caused a DOS scenario
as Michal sugested.

However, this is an 'opt in' feature.  So, I would not expect anyone who
carefully plans the size of their hugetlb pool to enable such a feature.
If there is a use case where hugetlb pages are used in a non-essential
application, this might be of use.
Michal Hocko Feb. 17, 2021, 7:54 a.m. UTC | #5
On Tue 16-02-21 13:53:12, David Rientjes wrote:
> On Tue, 16 Feb 2021, Michal Hocko wrote:
[...]
> > Overall, I am not really happy about this feature even when above is
> > fixed, but let's hear more the actual problem first.
> 
> Shouldn't this behavior be possible as an oomd plugin instead, perhaps 
> triggered by psi?  I'm not sure if oomd is intended only to kill something 
> (oomkilld? lol) or if it can be made to do sysadmin level behavior, such 
> as shrinking the hugetlb pool, to solve the oom condition.

It should be under control of an admin who knows what the pool is
preallocated for and whether a decrease (e.g. a temporal one) is
tolerable.
 
> If so, it seems like we want to do this at the absolute last minute.  In 
> other words, reclaim has failed to free memory by other means so we would 
> like to shrink the hugetlb pool.  (It's the reason why it's implemented as 
> a predecessor to oom as opposed to part of reclaim in general.)
> 
> Do we have the ability to suppress the oom killer until oomd has a chance 
> to react in this scenario?

We don't and I do not think we want to bind the kernel oom behavior to
any userspace process. We have extensively discussed things like this in
the past IIRC.
Michal Hocko Feb. 17, 2021, 7:57 a.m. UTC | #6
On Tue 16-02-21 14:30:15, Mike Kravetz wrote:
[...]
> However, this is an 'opt in' feature.  So, I would not expect anyone who
> carefully plans the size of their hugetlb pool to enable such a feature.
> If there is a use case where hugetlb pages are used in a non-essential
> application, this might be of use.

I would really like to hear about the specific usecase. Because it
smells more like a misconfiguration. What would be non-essential hugetlb
pages? This is not a resource to be pre-allocated just in case, right?
David Hildenbrand Feb. 17, 2021, 9:09 a.m. UTC | #7
On 16.02.21 04:07, Eiichi Tsukata wrote:
> Hugepages can be preallocated to avoid unpredictable allocation latency.
> If we run into 4k page shortage, the kernel can trigger OOM even though
> there were free hugepages. When OOM is triggered by user address page
> fault handler, we can use oom notifier to free hugepages in user space
> but if it's triggered by memory allocation for kernel, there is no way
> to synchronously handle it in user space.
> 
> This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> enabled, it first tries to free a hugepage if available before invoking
> the oom-killer. The default value is disabled not to change the current
> behavior.

In addition to the other comments, some more thoughts:

What if you're low on kernel memory but you end up freeing huge pages 
residing in ZONE_MOVABLE? IOW, this is not zone aware.
Eiichi Tsukata Feb. 17, 2021, 10:42 a.m. UTC | #8
Hi All,

Firstly, thank you for your careful review and attention to my patch
(and apologies for top-posting!).  Let me first explain why our use
case requires hugetlb over THP and then elaborate on the difficulty we
have to maintain the correct number of hugepages in the pool, finally
concluding with why the proposed approach would help us. Hopefully you
can extend it to other use cases and justify the proposal.

We use Linux to operate a KVM-based hypervisor. Using hugepages to
back VM memory significantly increases performance and density. Each
VM incurs a 4k regular page overhead which can vary drastically even
at runtime (eg. depending on network traffic). In addition, the
software doesn't know upfront if users will power on one large VM or
several small VMs.

To manage the varying balance of 4k pages vs. hugepages, we originally
leveraged THP. However, constant fragmentation due to VM power cycles,
the varying overhead I mentioned above, and other operations like
reconfiguration of NIC RX buffers resulted in two problems:
1) There were no guarantees hugepages would be used; and
2) Constant memory compaction incurred a measurable overhead.

Having a userspace service managing hugetlb gave us significant
performance advantages and much needed determinism. It chooses when to
try and create more hugepages as well as how many hugepages to go
after. Elements like how many hugepages it actually gets, combined
with what operations are happening on the host, allow our service to
make educated decisions about when to compact memory, drop caches, and
retry growing (or shrinking) the pool.

But that comes with a challenge: despite listening on cgroup for
pressure notifications (which happen from those runtime events we do
not control), the service is not guaranteed to sacrifice hugepages
fast enough and that causes an OOM. The killer will normally take out
a VM even if there are plenty of unused hugepages and that's obviously
disruptive for users. For us, free hugepages are almost always expendable.

For the bloat cases which are predictable, a memory management service
can adjust the hugepage pool size ahead of time. But it can be hard to
anticipate all scenarios, and some can be very volatile. Having a
failsafe mechanism as proposed in this patch offers invaluable
protection when things are missed.

The proposal solves this problem by sacrificing hugepages inline even
when the pressure comes from kernel allocations. The userspace service
can later readjust the pool size without being under pressure. Given
this is configurable, and defaults to being off, we thought it would
be a nice addition to the kernel and appreciated by other users that
may have similar requirements.

I welcome your comments and thank you again for your time!

Eiichi

> On Feb 17, 2021, at 16:57, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Tue 16-02-21 14:30:15, Mike Kravetz wrote:
> [...]
>> However, this is an 'opt in' feature.  So, I would not expect anyone who
>> carefully plans the size of their hugetlb pool to enable such a feature.
>> If there is a use case where hugetlb pages are used in a non-essential
>> application, this might be of use.
> 
> I would really like to hear about the specific usecase. Because it
> smells more like a misconfiguration. What would be non-essential hugetlb
> pages? This is not a resource to be pre-allocated just in case, right?
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Feb. 17, 2021, 12:31 p.m. UTC | #9
On Wed 17-02-21 10:42:24, Eiichi Tsukata wrote:
> Hi All,
> 
> Firstly, thank you for your careful review and attention to my patch
> (and apologies for top-posting!).  Let me first explain why our use
> case requires hugetlb over THP and then elaborate on the difficulty we
> have to maintain the correct number of hugepages in the pool, finally
> concluding with why the proposed approach would help us. Hopefully you
> can extend it to other use cases and justify the proposal.
> 
> We use Linux to operate a KVM-based hypervisor. Using hugepages to
> back VM memory significantly increases performance and density. Each
> VM incurs a 4k regular page overhead which can vary drastically even
> at runtime (eg. depending on network traffic). In addition, the
> software doesn't know upfront if users will power on one large VM or
> several small VMs.
> 
> To manage the varying balance of 4k pages vs. hugepages, we originally
> leveraged THP. However, constant fragmentation due to VM power cycles,
> the varying overhead I mentioned above, and other operations like
> reconfiguration of NIC RX buffers resulted in two problems:
> 1) There were no guarantees hugepages would be used; and
> 2) Constant memory compaction incurred a measurable overhead.
> 
> Having a userspace service managing hugetlb gave us significant
> performance advantages and much needed determinism. It chooses when to
> try and create more hugepages as well as how many hugepages to go
> after. Elements like how many hugepages it actually gets, combined
> with what operations are happening on the host, allow our service to
> make educated decisions about when to compact memory, drop caches, and
> retry growing (or shrinking) the pool.

OK, thanks for the clarification. Just to make sure I understand. This
means that you are pro-activelly and optimistically pre-allocate hugetlb
pages even when there is no immediate need for those, right?

> But that comes with a challenge: despite listening on cgroup for
> pressure notifications (which happen from those runtime events we do
> not control),

We do also have global pressure (PSI) counters. Have you tried to look
into those and try to back off even when the situation becomes critical?

> the service is not guaranteed to sacrifice hugepages
> fast enough and that causes an OOM. The killer will normally take out
> a VM even if there are plenty of unused hugepages and that's obviously
> disruptive for users. For us, free hugepages are almost always expendable.
> 
> For the bloat cases which are predictable, a memory management service
> can adjust the hugepage pool size ahead of time. But it can be hard to
> anticipate all scenarios, and some can be very volatile. Having a
> failsafe mechanism as proposed in this patch offers invaluable
> protection when things are missed.
> 
> The proposal solves this problem by sacrificing hugepages inline even
> when the pressure comes from kernel allocations. The userspace service
> can later readjust the pool size without being under pressure. Given
> this is configurable, and defaults to being off, we thought it would
> be a nice addition to the kernel and appreciated by other users that
> may have similar requirements.

Thanks for your usecase description. It helped me to understand what you
are doing and how this can be really useful for your particular setup.
This is really a very specific situation from my POV. I am not yet sure
this is generic enough to warrant for a yet another tunable. One thing
you can do [1] is to
hook into oom notifiers interface (register_oom_notifier) and release
pages from the callback. Why is that batter than a global tunable?
For one thing you can make the implementation tailored to your specific
usecase. As the review feedback has shown this would be more tricky to
be done in a general case. Unlike a generic solution it would allow you
to coordinate with your userspace if you need. Would something like that
work for you?

---
[1] and I have to say I hate myself for suggesting that because I was
really hoping this interface would go away. But the reality disagrees so
I gave up on that goal...
Michal Hocko Feb. 17, 2021, 12:40 p.m. UTC | #10
On Wed 17-02-21 13:31:07, Michal Hocko wrote:
[...]
> Thanks for your usecase description. It helped me to understand what you
> are doing and how this can be really useful for your particular setup.
> This is really a very specific situation from my POV. I am not yet sure
> this is generic enough to warrant for a yet another tunable. One thing
> you can do [1] is to
> hook into oom notifiers interface (register_oom_notifier) and release
> pages from the callback.

Forgot to mention that this would be done from a kernel module.

> Why is that batter than a global tunable?
> For one thing you can make the implementation tailored to your specific
> usecase. As the review feedback has shown this would be more tricky to
> be done in a general case. Unlike a generic solution it would allow you
> to coordinate with your userspace if you need. Would something like that
> work for you?
> 
> ---
> [1] and I have to say I hate myself for suggesting that because I was
> really hoping this interface would go away. But the reality disagrees so
> I gave up on that goal...
> -- 
> Michal Hocko
> SUSE Labs
Shakeel Butt Feb. 17, 2021, 2:59 p.m. UTC | #11
On Tue, Feb 16, 2021 at 5:25 PM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 16 Feb 2021, Michal Hocko wrote:
>
> > > Hugepages can be preallocated to avoid unpredictable allocation latency.
> > > If we run into 4k page shortage, the kernel can trigger OOM even though
> > > there were free hugepages. When OOM is triggered by user address page
> > > fault handler, we can use oom notifier to free hugepages in user space
> > > but if it's triggered by memory allocation for kernel, there is no way
> > > to synchronously handle it in user space.
> >
> > Can you expand some more on what kind of problem do you see?
> > Hugetlb pages are, by definition, a preallocated, unreclaimable and
> > admin controlled pool of pages.
>
> Small nit: true of non-surplus hugetlb pages.
>
> > Under those conditions it is expected
> > and required that the sizing would be done very carefully. Why is that a
> > problem in your particular setup/scenario?
> >
> > If the sizing is really done properly and then a random process can
> > trigger OOM then this can lead to malfunctioning of those workloads
> > which do depend on hugetlb pool, right? So isn't this a kinda DoS
> > scenario?
> >
> > > This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If
> > > enabled, it first tries to free a hugepage if available before invoking
> > > the oom-killer. The default value is disabled not to change the current
> > > behavior.
> >
> > Why is this interface not hugepage size aware? It is quite different to
> > release a GB huge page or 2MB one. Or is it expected to release the
> > smallest one? To the implementation...
> >
> > [...]
> > > +static int sacrifice_hugepage(void)
> > > +{
> > > +   int ret;
> > > +
> > > +   spin_lock(&hugetlb_lock);
> > > +   ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);
> >
> > ... no it is going to release the default huge page. This will be 2MB in
> > most cases but this is not given.
> >
> > Unless I am mistaken this will free up also reserved hugetlb pages. This
> > would mean that a page fault would SIGBUS which is very likely not
> > something we want to do right? You also want to use oom nodemask rather
> > than a full one.
> >
> > Overall, I am not really happy about this feature even when above is
> > fixed, but let's hear more the actual problem first.
>
> Shouldn't this behavior be possible as an oomd plugin instead, perhaps
> triggered by psi?  I'm not sure if oomd is intended only to kill something
> (oomkilld? lol) or if it can be made to do sysadmin level behavior, such
> as shrinking the hugetlb pool, to solve the oom condition.

The senpai plugin of oomd actually is a proactive reclaimer, so oomd
is being used for more than oom-killing.

>
> If so, it seems like we want to do this at the absolute last minute.  In
> other words, reclaim has failed to free memory by other means so we would
> like to shrink the hugetlb pool.  (It's the reason why it's implemented as
> a predecessor to oom as opposed to part of reclaim in general.)
>
> Do we have the ability to suppress the oom killer until oomd has a chance
> to react in this scenario?

There is no explicit knob but there are indirect ways to delay the
kernel oom killer. In the presence of reclaimable memory the kernel is
very conservative to trigger the oom-kill. I think the way Facebook is
achieving this in oomd is by using swap to have good enough
reclaimable memory and then using memory.swap.high to throttle the
workload's allocation rates which will increase the PSI as well. Since
oomd pools PSI, it will be able to react before the kernel oom-killer.
Eiichi Tsukata Feb. 18, 2021, 12:22 p.m. UTC | #12
Hi Michal

> On Feb 17, 2021, at 21:31, Michal Hocko <mhocko@suse.com> wrote:
> 
> On Wed 17-02-21 10:42:24, Eiichi Tsukata wrote:
>> Hi All,
>> 
>> Firstly, thank you for your careful review and attention to my patch
>> (and apologies for top-posting!).  Let me first explain why our use
>> case requires hugetlb over THP and then elaborate on the difficulty we
>> have to maintain the correct number of hugepages in the pool, finally
>> concluding with why the proposed approach would help us. Hopefully you
>> can extend it to other use cases and justify the proposal.
>> 
>> We use Linux to operate a KVM-based hypervisor. Using hugepages to
>> back VM memory significantly increases performance and density. Each
>> VM incurs a 4k regular page overhead which can vary drastically even
>> at runtime (eg. depending on network traffic). In addition, the
>> software doesn't know upfront if users will power on one large VM or
>> several small VMs.
>> 
>> To manage the varying balance of 4k pages vs. hugepages, we originally
>> leveraged THP. However, constant fragmentation due to VM power cycles,
>> the varying overhead I mentioned above, and other operations like
>> reconfiguration of NIC RX buffers resulted in two problems:
>> 1) There were no guarantees hugepages would be used; and
>> 2) Constant memory compaction incurred a measurable overhead.
>> 
>> Having a userspace service managing hugetlb gave us significant
>> performance advantages and much needed determinism. It chooses when to
>> try and create more hugepages as well as how many hugepages to go
>> after. Elements like how many hugepages it actually gets, combined
>> with what operations are happening on the host, allow our service to
>> make educated decisions about when to compact memory, drop caches, and
>> retry growing (or shrinking) the pool.
> 
> OK, thanks for the clarification. Just to make sure I understand. This
> means that you are pro-activelly and optimistically pre-allocate hugetlb
> pages even when there is no immediate need for those, right?

Right, but this is not a "pre-allocation just in case". We need to
know how many hugepages are available for VM memory upfront. That
allows us to plan for disaster scenarios where a host goes down and we
need to restart VMs in other hosts. In addition, going from zero to
TBs worth of hugepages may take a long time and makes VM power on
times too slow. Of course in bloat conditions we could lose hugepages
we pre-allocated, but our placement models can react to that.


> 
>> But that comes with a challenge: despite listening on cgroup for
>> pressure notifications (which happen from those runtime events we do
>> not control),
> 
> We do also have global pressure (PSI) counters. Have you tried to look
> into those and try to back off even when the situation becomes critical?

Yes. PSI counters help us to some extent. But we've found that in some cases
OOM can happen before we observe memory pressure if memory bloat occurred
rapidly. The proposed failsafe mechanism can cover even such a situation.
Also, as I mentioned in commit message, oom notifiers doesn't work if OOM
is triggered by memory allocation for kernel.

> 
>> the service is not guaranteed to sacrifice hugepages
>> fast enough and that causes an OOM. The killer will normally take out
>> a VM even if there are plenty of unused hugepages and that's obviously
>> disruptive for users. For us, free hugepages are almost always expendable.
>> 
>> For the bloat cases which are predictable, a memory management service
>> can adjust the hugepage pool size ahead of time. But it can be hard to
>> anticipate all scenarios, and some can be very volatile. Having a
>> failsafe mechanism as proposed in this patch offers invaluable
>> protection when things are missed.
>> 
>> The proposal solves this problem by sacrificing hugepages inline even
>> when the pressure comes from kernel allocations. The userspace service
>> can later readjust the pool size without being under pressure. Given
>> this is configurable, and defaults to being off, we thought it would
>> be a nice addition to the kernel and appreciated by other users that
>> may have similar requirements.
> 
> Thanks for your usecase description. It helped me to understand what you
> are doing and how this can be really useful for your particular setup.
> This is really a very specific situation from my POV. I am not yet sure
> this is generic enough to warrant for a yet another tunable. One thing
> you can do [1] is to
> hook into oom notifiers interface (register_oom_notifier) and release
> pages from the callback. Why is that batter than a global tunable?
> For one thing you can make the implementation tailored to your specific
> usecase. As the review feedback has shown this would be more tricky to
> be done in a general case. Unlike a generic solution it would allow you
> to coordinate with your userspace if you need. Would something like that
> work for you?

Thanks for your suggestion. Implementing our own oom handler using
register_oom_notifier in out-of-tree kernel module is actually one of our
options. The intention of this RFC patch is to share the idea and know
the needs from other users who may have similar requirements.

As for the implementation, I'm considering to make the behavior of
sacrifice_hugepage() corresponds to decrementing vm.nr_hugepages param.
Of course any suggestions are always welcome.

Eiichi

> 
> ---
> [1] and I have to say I hate myself for suggesting that because I was
> really hoping this interface would go away. But the reality disagrees so
> I gave up on that goal...
> -- 
> Michal Hocko
> SUSE Labs
Chris Down Feb. 18, 2021, 12:39 p.m. UTC | #13
Eiichi Tsukata writes:
>>> But that comes with a challenge: despite listening on cgroup for
>>> pressure notifications (which happen from those runtime events we do
>>> not control),
>>
>> We do also have global pressure (PSI) counters. Have you tried to look
>> into those and try to back off even when the situation becomes critical?
>
>Yes. PSI counters help us to some extent. But we've found that in some cases
>OOM can happen before we observe memory pressure if memory bloat occurred
>rapidly. The proposed failsafe mechanism can cover even such a situation.
>Also, as I mentioned in commit message, oom notifiers doesn't work if OOM
>is triggered by memory allocation for kernel.

Hmm, do you have free swap? Without it, we can trivially go from fine to OOM in 
a totally binary fashion. As long as there's some swap space available, there 
should be a clear period where pressure is rising prior to OOM.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e35a3f2fb006..f2f195524be6 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -65,6 +65,7 @@  Currently, these files are in /proc/sys/vm:
 - page-cluster
 - panic_on_oom
 - percpu_pagelist_fraction
+- sacrifice_hugepage_on_oom
 - stat_interval
 - stat_refresh
 - numa_stat
@@ -807,6 +808,17 @@  The initial value is zero.  Kernel does not use this value at boot time to set
 the high water marks for each per cpu page list.  If the user writes '0' to this
 sysctl, it will revert to this default behavior.
 
+sacrifice_hugepage_on_oom
+=========================
+
+This value controls whether the kernel should attempt to break up hugepages
+when out-of-memory happens. OOM happens under memory cgroup would not invoke
+this.
+
+If set to 0 (default), the kernel doesn't touch the hugepage pool during OOM
+conditions.
+If set to 1, the kernel frees one hugepage at a time, if available, before
+invoking the oom-killer.
 
 stat_interval
 =============
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b5807f23caf8..8aad2f2ab6e6 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -145,6 +145,8 @@  int hugetlb_reserve_pages(struct inode *inode, long from, long to,
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
+int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+			bool acct_surplus);
 void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..0bfae027ec16 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -127,4 +127,5 @@  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern int sysctl_sacrifice_hugepage_on_oom;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c9fbdd848138..d2e3ec625f5f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2708,6 +2708,15 @@  static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "sacrifice_hugepage_on_oom",
+		.data		= &sysctl_sacrifice_hugepage_on_oom,
+		.maxlen		= sizeof(sysctl_sacrifice_hugepage_on_oom),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "overcommit_ratio",
 		.data		= &sysctl_overcommit_ratio,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..e2d57200fd00 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1726,8 +1726,8 @@  static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
  * balanced over allowed nodes.
  * Called with hugetlb_lock locked.
  */
-static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-							 bool acct_surplus)
+int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+			bool acct_surplus)
 {
 	int nr_nodes, node;
 	int ret = 0;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04b19b7b5435..fd2c1f427926 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,6 +43,7 @@ 
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hugetlb.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -52,6 +53,7 @@ 
 #include <trace/events/oom.h>
 
 int sysctl_panic_on_oom;
+int sysctl_sacrifice_hugepage_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 
@@ -1023,6 +1025,22 @@  static void check_panic_on_oom(struct oom_control *oc)
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
 }
 
+static int sacrifice_hugepage(void)
+{
+	int ret;
+
+	spin_lock(&hugetlb_lock);
+	ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0);
+	spin_unlock(&hugetlb_lock);
+	if (ret) {
+		pr_warn("Out of memory: Successfully sacrificed a hugepage\n");
+		hugetlb_show_meminfo();
+	} else {
+		pr_warn("Out of memory: No free hugepage available\n");
+	}
+	return ret;
+}
+
 static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
 
 int register_oom_notifier(struct notifier_block *nb)
@@ -1100,6 +1118,11 @@  bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	if (!is_memcg_oom(oc) && sysctl_sacrifice_hugepage_on_oom) {
+		if (sacrifice_hugepage())
+			return true;
+	}
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {