diff mbox series

x86/pod: Do not fragment PoD memory allocations

Message ID 202101242308.10ON8Umj004866@m5p.com (mailing list archive)
State New, archived
Headers show
Series x86/pod: Do not fragment PoD memory allocations | expand

Commit Message

Elliott Mitchell Jan. 24, 2021, 4:47 a.m. UTC
Previously p2m_pod_set_cache_target() would fall back to allocating 4KB
pages if 2MB pages ran out.  This is counterproductive since it suggests
severe memory pressure and is likely a precursor to a memory exhaustion
panic.  As such don't try to fill requests for 2MB pages from 4KB pages
if 2MB pages run out.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>

---
Changes in v2:
- Include the obvious removal of the goto target.  Always realize you're
  at the wrong place when you press "send".

I'm not including a separate cover message since this is a single hunk.
This really needs some checking in `xl`.  If one has a domain which
sometimes gets started on different hosts and is sometimes modified with
slightly differing settings, one can run into trouble.

In this case most of the time the particular domain is most often used
PV/PVH, but every so often is used as a template for HVM.  Starting it
HVM will trigger PoD mode.  If it is started on a machine with less
memory than others, PoD may well exhaust all memory and then trigger a
panic.

`xl` should likely fail HVM domain creation when the maximum memory
exceeds available memory (never mind total memory).

For example try a domain with the following settings:

memory = 8192
maxmem = 2147483648

If type is PV or PVH, it will likely boot successfully.  Change type to
HVM and unless your hardware budget is impressive, Xen will soon panic.

Really, this is an example of where Xen should be robust against user
error.  Certainly as a HVM domain this is wrong, yet a nominally valid
domain configuration shouldn't be able to panic Xen.

In other news, I would suggest on ia32 Xen should only support domain
memory allocations in multiples of 2MB.  There is need to toss around 4KB
pages for I/O and ballooning superpages may be difficult.  Yet allocating
or deallocating less than an entire superpage worth at a time seems
dubious.
---
 xen/arch/x86/mm/p2m-pod.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

Comments

Jan Beulich Jan. 25, 2021, 9:56 a.m. UTC | #1
On 24.01.2021 05:47, Elliott Mitchell wrote:
> Previously p2m_pod_set_cache_target() would fall back to allocating 4KB
> pages if 2MB pages ran out.  This is counterproductive since it suggests
> severe memory pressure and is likely a precursor to a memory exhaustion
> panic.  As such don't try to fill requests for 2MB pages from 4KB pages
> if 2MB pages run out.

I disagree - there may be ample 4k pages available, yet no 2Mb
ones at all. I only agree that this _may_ be counterproductive
_if indeed_ the system is short on memory.

> Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
> 
> ---
> Changes in v2:
> - Include the obvious removal of the goto target.  Always realize you're
>   at the wrong place when you press "send".

Please could you also label the submission then accordingly? I
got puzzled by two identically titled messages side by side,
until I noticed the difference.

> I'm not including a separate cover message since this is a single hunk.
> This really needs some checking in `xl`.  If one has a domain which
> sometimes gets started on different hosts and is sometimes modified with
> slightly differing settings, one can run into trouble.
> 
> In this case most of the time the particular domain is most often used
> PV/PVH, but every so often is used as a template for HVM.  Starting it
> HVM will trigger PoD mode.  If it is started on a machine with less
> memory than others, PoD may well exhaust all memory and then trigger a
> panic.
> 
> `xl` should likely fail HVM domain creation when the maximum memory
> exceeds available memory (never mind total memory).

I don't think so, no - it's the purpose of PoD to allow starting
a guest despite there not being enough memory available to
satisfy its "max", as such guests are expected to balloon down
immediately, rather than triggering an oom condition.

> For example try a domain with the following settings:
> 
> memory = 8192
> maxmem = 2147483648
> 
> If type is PV or PVH, it will likely boot successfully.  Change type to
> HVM and unless your hardware budget is impressive, Xen will soon panic.

Xen will panic? That would need fixing if so. Also I'd consider
an excessively high maxmem (compared to memory) a configuration
error. According to my experiments long, long ago I seem to
recall that a factor beyond 32 is almost never going to lead to
anything good, irrespective of guest type. (But as said, badness
here should be restricted to the guest; Xen itself should limp
on fine.)

> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -212,16 +212,13 @@ p2m_pod_set_cache_target(struct p2m_domain *p2m, unsigned long pod_target, int p
>              order = PAGE_ORDER_2M;
>          else
>              order = PAGE_ORDER_4K;
> -    retry:
>          page = alloc_domheap_pages(d, order, 0);
>          if ( unlikely(page == NULL) )
>          {
> -            if ( order == PAGE_ORDER_2M )
> -            {
> -                /* If we can't allocate a superpage, try singleton pages */
> -                order = PAGE_ORDER_4K;
> -                goto retry;
> -            }
> +            /* Superpages allocation failures likely indicate severe memory
> +            ** pressure.  Continuing to try to fulfill attempts using 4KB pages
> +            ** is likely to exhaust memory and trigger a panic.  As such it is
> +            ** NOT worth trying to use 4KB pages to fulfill 2MB page requests.*/

Just in case my arguments against this change get overridden:
This comment is malformed - please see ./CODING_STYLE.

Jan
Andrew Cooper Jan. 25, 2021, 10:20 a.m. UTC | #2
On 25/01/2021 09:56, Jan Beulich wrote:
> On 24.01.2021 05:47, Elliott Mitchell wrote:
>> Previously p2m_pod_set_cache_target() would fall back to allocating 4KB
>> pages if 2MB pages ran out.  This is counterproductive since it suggests
>> severe memory pressure and is likely a precursor to a memory exhaustion
>> panic.  As such don't try to fill requests for 2MB pages from 4KB pages
>> if 2MB pages run out.
> I disagree - there may be ample 4k pages available, yet no 2Mb
> ones at all. I only agree that this _may_ be counterproductive
> _if indeed_ the system is short on memory.

Further to this, PoD is very frequently used in combination with
ballooning operations, in which case there are (or can be made to be)
plenty of 4k pages, even without a single 2M range in sight.

~Andrew
Elliott Mitchell Jan. 25, 2021, 5:46 p.m. UTC | #3
On Mon, Jan 25, 2021 at 10:56:25AM +0100, Jan Beulich wrote:
> On 24.01.2021 05:47, Elliott Mitchell wrote:
> > 
> > ---
> > Changes in v2:
> > - Include the obvious removal of the goto target.  Always realize you're
> >   at the wrong place when you press "send".
> 
> Please could you also label the submission then accordingly? I
> got puzzled by two identically titled messages side by side,
> until I noticed the difference.

Sorry about that.  Would you have preferred a third message mentioning
this mistake?

> > I'm not including a separate cover message since this is a single hunk.
> > This really needs some checking in `xl`.  If one has a domain which
> > sometimes gets started on different hosts and is sometimes modified with
> > slightly differing settings, one can run into trouble.
> > 
> > In this case most of the time the particular domain is most often used
> > PV/PVH, but every so often is used as a template for HVM.  Starting it
> > HVM will trigger PoD mode.  If it is started on a machine with less
> > memory than others, PoD may well exhaust all memory and then trigger a
> > panic.
> > 
> > `xl` should likely fail HVM domain creation when the maximum memory
> > exceeds available memory (never mind total memory).
> 
> I don't think so, no - it's the purpose of PoD to allow starting
> a guest despite there not being enough memory available to
> satisfy its "max", as such guests are expected to balloon down
> immediately, rather than triggering an oom condition.

Even Qemu/OVMF is expected to handle ballooning for a *HVM* domain?

> > For example try a domain with the following settings:
> > 
> > memory = 8192
> > maxmem = 2147483648
> > 
> > If type is PV or PVH, it will likely boot successfully.  Change type to
> > HVM and unless your hardware budget is impressive, Xen will soon panic.
> 
> Xen will panic? That would need fixing if so. Also I'd consider
> an excessively high maxmem (compared to memory) a configuration
> error. According to my experiments long, long ago I seem to
> recall that a factor beyond 32 is almost never going to lead to
> anything good, irrespective of guest type. (But as said, badness
> here should be restricted to the guest; Xen itself should limp
> on fine.)

I'll confess I haven't confirmed the panic is in Xen itself.  Problem is
when this gets triggered, by the time the situation is clear and I can
get to the console the computer is already restarting, thus no error
message has been observed.

This is most certainly a configuration error.  Problem is this is a very
small delta between a perfectly valid configuration and the one which
reliably triggers a panic.

The memory:maxmem ratio isn't the problem.  My example had a maxmem of
2147483648 since that is enough to exceed the memory of sub-$100K
computers.  The crucial features are maxmem >= machine memory,
memory < free memory (thus potentially bootable PV/PVH) and type = "hvm".

When was the last time you tried running a Xen machine with near zero
free memory?  Perhaps in the past Xen kept the promise of never panicing
on memory exhaustion, but this feels like this hasn't held for some time.
Jan Beulich Jan. 26, 2021, 11:08 a.m. UTC | #4
On 25.01.2021 18:46, Elliott Mitchell wrote:
> On Mon, Jan 25, 2021 at 10:56:25AM +0100, Jan Beulich wrote:
>> On 24.01.2021 05:47, Elliott Mitchell wrote:
>>>
>>> ---
>>> Changes in v2:
>>> - Include the obvious removal of the goto target.  Always realize you're
>>>   at the wrong place when you press "send".
>>
>> Please could you also label the submission then accordingly? I
>> got puzzled by two identically titled messages side by side,
>> until I noticed the difference.
> 
> Sorry about that.  Would you have preferred a third message mentioning
> this mistake?

No. But I'd have expected v2 to have its subject start with
"[PATCH v2] ...", making it relatively clear that one might
save looking at the one labeled just "[PATCH] ...".

>>> I'm not including a separate cover message since this is a single hunk.
>>> This really needs some checking in `xl`.  If one has a domain which
>>> sometimes gets started on different hosts and is sometimes modified with
>>> slightly differing settings, one can run into trouble.
>>>
>>> In this case most of the time the particular domain is most often used
>>> PV/PVH, but every so often is used as a template for HVM.  Starting it
>>> HVM will trigger PoD mode.  If it is started on a machine with less
>>> memory than others, PoD may well exhaust all memory and then trigger a
>>> panic.
>>>
>>> `xl` should likely fail HVM domain creation when the maximum memory
>>> exceeds available memory (never mind total memory).
>>
>> I don't think so, no - it's the purpose of PoD to allow starting
>> a guest despite there not being enough memory available to
>> satisfy its "max", as such guests are expected to balloon down
>> immediately, rather than triggering an oom condition.
> 
> Even Qemu/OVMF is expected to handle ballooning for a *HVM* domain?

No idea how qemu comes into play here. Any preboot environment
aware of possibly running under Xen of course is expected to
tolerate running with maxmem > memory (or clearly document its
inability, in which case it may not be suitable for certain
use cases). For example, I don't see why a preboot environment
would need to touch all of the memory in a VM, except maybe
for the purpose of zeroing it (which PoD can deal with fine).

>>> For example try a domain with the following settings:
>>>
>>> memory = 8192
>>> maxmem = 2147483648
>>>
>>> If type is PV or PVH, it will likely boot successfully.  Change type to
>>> HVM and unless your hardware budget is impressive, Xen will soon panic.
>>
>> Xen will panic? That would need fixing if so. Also I'd consider
>> an excessively high maxmem (compared to memory) a configuration
>> error. According to my experiments long, long ago I seem to
>> recall that a factor beyond 32 is almost never going to lead to
>> anything good, irrespective of guest type. (But as said, badness
>> here should be restricted to the guest; Xen itself should limp
>> on fine.)
> 
> I'll confess I haven't confirmed the panic is in Xen itself.  Problem is
> when this gets triggered, by the time the situation is clear and I can
> get to the console the computer is already restarting, thus no error
> message has been observed.

If the panic isn't in Xen itself, why would the computer be
restarting?

> This is most certainly a configuration error.  Problem is this is a very
> small delta between a perfectly valid configuration and the one which
> reliably triggers a panic.
> 
> The memory:maxmem ratio isn't the problem.  My example had a maxmem of
> 2147483648 since that is enough to exceed the memory of sub-$100K
> computers.  The crucial features are maxmem >= machine memory,
> memory < free memory (thus potentially bootable PV/PVH) and type = "hvm".
> 
> When was the last time you tried running a Xen machine with near zero
> free memory?  Perhaps in the past Xen kept the promise of never panicing
> on memory exhaustion, but this feels like this hasn't held for some time.

Such bugs needs fixing. Which first of all requires properly
pointing them out. A PoD guest misbehaving when there's not
enough memory to fill its pages (i.e. before it manages to
balloon down) is expected behavior. If you can't guarantee the
guest ballooning down quickly enough, don't configure it to
use PoD. A PoD guest causing a Xen crash, otoh, is very likely
even a security issue. Which needs to be treated as such: It
needs fixing, not avoiding by "curing" one of perhaps many
possible sources.

As an aside - if the PoD code had proper 1Gb page support,
would you then propose to only allocate in 1Gb chunks? And if
there was a 512Gb page feature in hardware, in 512Gb chunks
(leaving aside the fact that scanning 512Gb of memory to be
all zero would simply take too long with today's computers)?

Jan
Elliott Mitchell Jan. 26, 2021, 5:51 p.m. UTC | #5
On Tue, Jan 26, 2021 at 12:08:15PM +0100, Jan Beulich wrote:
> On 25.01.2021 18:46, Elliott Mitchell wrote:
> > On Mon, Jan 25, 2021 at 10:56:25AM +0100, Jan Beulich wrote:
> >> On 24.01.2021 05:47, Elliott Mitchell wrote:
> >>>
> >>> ---
> >>> Changes in v2:
> >>> - Include the obvious removal of the goto target.  Always realize you're
> >>>   at the wrong place when you press "send".
> >>
> >> Please could you also label the submission then accordingly? I
> >> got puzzled by two identically titled messages side by side,
> >> until I noticed the difference.
> > 
> > Sorry about that.  Would you have preferred a third message mentioning
> > this mistake?
> 
> No. But I'd have expected v2 to have its subject start with
> "[PATCH v2] ...", making it relatively clear that one might
> save looking at the one labeled just "[PATCH] ...".

Yes, in fact I spotted this just after.  I was in a situation of, "does
this deserve sending an additional message out?"  (ugh, yet more damage
from that issue...)


> >>> I'm not including a separate cover message since this is a single hunk.
> >>> This really needs some checking in `xl`.  If one has a domain which
> >>> sometimes gets started on different hosts and is sometimes modified with
> >>> slightly differing settings, one can run into trouble.
> >>>
> >>> In this case most of the time the particular domain is most often used
> >>> PV/PVH, but every so often is used as a template for HVM.  Starting it
> >>> HVM will trigger PoD mode.  If it is started on a machine with less
> >>> memory than others, PoD may well exhaust all memory and then trigger a
> >>> panic.
> >>>
> >>> `xl` should likely fail HVM domain creation when the maximum memory
> >>> exceeds available memory (never mind total memory).
> >>
> >> I don't think so, no - it's the purpose of PoD to allow starting
> >> a guest despite there not being enough memory available to
> >> satisfy its "max", as such guests are expected to balloon down
> >> immediately, rather than triggering an oom condition.
> > 
> > Even Qemu/OVMF is expected to handle ballooning for a *HVM* domain?
> 
> No idea how qemu comes into play here. Any preboot environment
> aware of possibly running under Xen of course is expected to
> tolerate running with maxmem > memory (or clearly document its
> inability, in which case it may not be suitable for certain
> use cases). For example, I don't see why a preboot environment
> would need to touch all of the memory in a VM, except maybe
> for the purpose of zeroing it (which PoD can deal with fine).

I'm reading that as your answer to the above question is "yes".


> >>> For example try a domain with the following settings:
> >>>
> >>> memory = 8192
> >>> maxmem = 2147483648
> >>>
> >>> If type is PV or PVH, it will likely boot successfully.  Change type to
> >>> HVM and unless your hardware budget is impressive, Xen will soon panic.
> >>
> >> Xen will panic? That would need fixing if so. Also I'd consider
> >> an excessively high maxmem (compared to memory) a configuration
> >> error. According to my experiments long, long ago I seem to
> >> recall that a factor beyond 32 is almost never going to lead to
> >> anything good, irrespective of guest type. (But as said, badness
> >> here should be restricted to the guest; Xen itself should limp
> >> on fine.)
> > 
> > I'll confess I haven't confirmed the panic is in Xen itself.  Problem is
> > when this gets triggered, by the time the situation is clear and I can
> > get to the console the computer is already restarting, thus no error
> > message has been observed.
> 
> If the panic isn't in Xen itself, why would the computer be
> restarting?

I was thinking there was a possibility it is actually Domain 0 which is
panicing.


> > This is most certainly a configuration error.  Problem is this is a very
> > small delta between a perfectly valid configuration and the one which
> > reliably triggers a panic.
> > 
> > The memory:maxmem ratio isn't the problem.  My example had a maxmem of
> > 2147483648 since that is enough to exceed the memory of sub-$100K
> > computers.  The crucial features are maxmem >= machine memory,
> > memory < free memory (thus potentially bootable PV/PVH) and type = "hvm".
> > 
> > When was the last time you tried running a Xen machine with near zero
> > free memory?  Perhaps in the past Xen kept the promise of never panicing
> > on memory exhaustion, but this feels like this hasn't held for some time.
> 
> Such bugs needs fixing. Which first of all requires properly
> pointing them out. A PoD guest misbehaving when there's not
> enough memory to fill its pages (i.e. before it manages to
> balloon down) is expected behavior. If you can't guarantee the
> guest ballooning down quickly enough, don't configure it to
> use PoD. A PoD guest causing a Xen crash, otoh, is very likely
> even a security issue. Which needs to be treated as such: It
> needs fixing, not avoiding by "curing" one of perhaps many
> possible sources.

Okay, this has been reliably reproducing for a while.  I had originally
thought it was a problem of HVM plus memory != maxmem, but the
non-immediate restart disagrees with that assessment.

> As an aside - if the PoD code had proper 1Gb page support,
> would you then propose to only allocate in 1Gb chunks? And if
> there was a 512Gb page feature in hardware, in 512Gb chunks
> (leaving aside the fact that scanning 512Gb of memory to be
> all zero would simply take too long with today's computers)?

That answer would vary over time.  Today or tommorrow, certainly not.
In a decade's time (or several) when a typical pocket computer^W^W
cellphone has 4TB of memory and a $30K server has a minimum of 128TB then
doing allocations in 1GB chunks would be worthy of consideration.
Jan Beulich Jan. 27, 2021, 9:47 a.m. UTC | #6
On 26.01.2021 18:51, Elliott Mitchell wrote:
> On Tue, Jan 26, 2021 at 12:08:15PM +0100, Jan Beulich wrote:
>> On 25.01.2021 18:46, Elliott Mitchell wrote:
>>> On Mon, Jan 25, 2021 at 10:56:25AM +0100, Jan Beulich wrote:
>>>> On 24.01.2021 05:47, Elliott Mitchell wrote:
>>>>>
>>>>> ---
>>>>> Changes in v2:
>>>>> - Include the obvious removal of the goto target.  Always realize you're
>>>>>   at the wrong place when you press "send".
>>>>
>>>> Please could you also label the submission then accordingly? I
>>>> got puzzled by two identically titled messages side by side,
>>>> until I noticed the difference.
>>>
>>> Sorry about that.  Would you have preferred a third message mentioning
>>> this mistake?
>>
>> No. But I'd have expected v2 to have its subject start with
>> "[PATCH v2] ...", making it relatively clear that one might
>> save looking at the one labeled just "[PATCH] ...".
> 
> Yes, in fact I spotted this just after.  I was in a situation of, "does
> this deserve sending an additional message out?"  (ugh, yet more damage
> from that issue...)
> 
> 
>>>>> I'm not including a separate cover message since this is a single hunk.
>>>>> This really needs some checking in `xl`.  If one has a domain which
>>>>> sometimes gets started on different hosts and is sometimes modified with
>>>>> slightly differing settings, one can run into trouble.
>>>>>
>>>>> In this case most of the time the particular domain is most often used
>>>>> PV/PVH, but every so often is used as a template for HVM.  Starting it
>>>>> HVM will trigger PoD mode.  If it is started on a machine with less
>>>>> memory than others, PoD may well exhaust all memory and then trigger a
>>>>> panic.
>>>>>
>>>>> `xl` should likely fail HVM domain creation when the maximum memory
>>>>> exceeds available memory (never mind total memory).
>>>>
>>>> I don't think so, no - it's the purpose of PoD to allow starting
>>>> a guest despite there not being enough memory available to
>>>> satisfy its "max", as such guests are expected to balloon down
>>>> immediately, rather than triggering an oom condition.
>>>
>>> Even Qemu/OVMF is expected to handle ballooning for a *HVM* domain?
>>
>> No idea how qemu comes into play here. Any preboot environment
>> aware of possibly running under Xen of course is expected to
>> tolerate running with maxmem > memory (or clearly document its
>> inability, in which case it may not be suitable for certain
>> use cases). For example, I don't see why a preboot environment
>> would need to touch all of the memory in a VM, except maybe
>> for the purpose of zeroing it (which PoD can deal with fine).
> 
> I'm reading that as your answer to the above question is "yes".

For the OVMF part of your question.

>>>>> For example try a domain with the following settings:
>>>>>
>>>>> memory = 8192
>>>>> maxmem = 2147483648
>>>>>
>>>>> If type is PV or PVH, it will likely boot successfully.  Change type to
>>>>> HVM and unless your hardware budget is impressive, Xen will soon panic.
>>>>
>>>> Xen will panic? That would need fixing if so. Also I'd consider
>>>> an excessively high maxmem (compared to memory) a configuration
>>>> error. According to my experiments long, long ago I seem to
>>>> recall that a factor beyond 32 is almost never going to lead to
>>>> anything good, irrespective of guest type. (But as said, badness
>>>> here should be restricted to the guest; Xen itself should limp
>>>> on fine.)
>>>
>>> I'll confess I haven't confirmed the panic is in Xen itself.  Problem is
>>> when this gets triggered, by the time the situation is clear and I can
>>> get to the console the computer is already restarting, thus no error
>>> message has been observed.
>>
>> If the panic isn't in Xen itself, why would the computer be
>> restarting?
> 
> I was thinking there was a possibility it is actually Domain 0 which is
> panicing.

Which wouldn't be any different in how it would need dealing
with.

>>> This is most certainly a configuration error.  Problem is this is a very
>>> small delta between a perfectly valid configuration and the one which
>>> reliably triggers a panic.
>>>
>>> The memory:maxmem ratio isn't the problem.  My example had a maxmem of
>>> 2147483648 since that is enough to exceed the memory of sub-$100K
>>> computers.  The crucial features are maxmem >= machine memory,
>>> memory < free memory (thus potentially bootable PV/PVH) and type = "hvm".
>>>
>>> When was the last time you tried running a Xen machine with near zero
>>> free memory?  Perhaps in the past Xen kept the promise of never panicing
>>> on memory exhaustion, but this feels like this hasn't held for some time.
>>
>> Such bugs needs fixing. Which first of all requires properly
>> pointing them out. A PoD guest misbehaving when there's not
>> enough memory to fill its pages (i.e. before it manages to
>> balloon down) is expected behavior. If you can't guarantee the
>> guest ballooning down quickly enough, don't configure it to
>> use PoD. A PoD guest causing a Xen crash, otoh, is very likely
>> even a security issue. Which needs to be treated as such: It
>> needs fixing, not avoiding by "curing" one of perhaps many
>> possible sources.
> 
> Okay, this has been reliably reproducing for a while.  I had originally
> thought it was a problem of HVM plus memory != maxmem, but the
> non-immediate restart disagrees with that assessment.

I guess it's not really clear what you mean with this, but anyway:
The important aspect here that I'm concerned about is what the
manifestations of the issue are. I'm still hoping that you would
provide such information, so we can then start thinking about how
to solve these. If, of course, there is anything worse than the
expected effects which use of PoD can have on the guest itself.

Jan
Elliott Mitchell Jan. 27, 2021, 8:12 p.m. UTC | #7
On Wed, Jan 27, 2021 at 10:47:19AM +0100, Jan Beulich wrote:
> On 26.01.2021 18:51, Elliott Mitchell wrote:
> > Okay, this has been reliably reproducing for a while.  I had originally
> > thought it was a problem of HVM plus memory != maxmem, but the
> > non-immediate restart disagrees with that assessment.
> 
> I guess it's not really clear what you mean with this, but anyway:
> The important aspect here that I'm concerned about is what the
> manifestations of the issue are. I'm still hoping that you would
> provide such information, so we can then start thinking about how
> to solve these. If, of course, there is anything worse than the
> expected effects which use of PoD can have on the guest itself.

Manifestation is domain 0 and/or Xen panic a few seconds after the
domain.cfg file is loaded via `xl`.  Everything on the host is lost and
the host restarts.  Any VMs which were present are lost and need to
restart, similar to power loss without UPS.

Upon pressing return for `xl create domain.cfg` there is a short period
of apparently normal behavior in domain 0.  After this there is a short
period of very laggy behavior in domain 0.  Finally domain 0 goes
unresponsive and so far by the time I've gotten to the host's console it
has already started to reboot.

The periods of apparently normal and laggy behavior are perhaps 5-10
seconds each.

The configurations I've reproduced with have had maxmem substantially
larger than the total host memory (this is intended as a prototype of a
future larger VM).  The first recorded observation of this was with
Debian's build of Xen 4.8, though I recall running into it with Xen 4.4
too.

Part of the problem might also be attributeable to QEMU touching all
memory on start (thus causing PoD to try to populate *all* memory) or
OVMF.
Andrew Cooper Jan. 27, 2021, 9:03 p.m. UTC | #8
On 27/01/2021 20:12, Elliott Mitchell wrote:
> On Wed, Jan 27, 2021 at 10:47:19AM +0100, Jan Beulich wrote:
>> On 26.01.2021 18:51, Elliott Mitchell wrote:
>>> Okay, this has been reliably reproducing for a while.  I had originally
>>> thought it was a problem of HVM plus memory != maxmem, but the
>>> non-immediate restart disagrees with that assessment.
>> I guess it's not really clear what you mean with this, but anyway:
>> The important aspect here that I'm concerned about is what the
>> manifestations of the issue are. I'm still hoping that you would
>> provide such information, so we can then start thinking about how
>> to solve these. If, of course, there is anything worse than the
>> expected effects which use of PoD can have on the guest itself.
> Manifestation is domain 0 and/or Xen panic a few seconds after the
> domain.cfg file is loaded via `xl`.  Everything on the host is lost and
> the host restarts.  Any VMs which were present are lost and need to
> restart, similar to power loss without UPS.
>
> Upon pressing return for `xl create domain.cfg` there is a short period
> of apparently normal behavior in domain 0.  After this there is a short
> period of very laggy behavior in domain 0.  Finally domain 0 goes
> unresponsive and so far by the time I've gotten to the host's console it
> has already started to reboot.
>
> The periods of apparently normal and laggy behavior are perhaps 5-10
> seconds each.
>
> The configurations I've reproduced with have had maxmem substantially
> larger than the total host memory (this is intended as a prototype of a
> future larger VM).  The first recorded observation of this was with
> Debian's build of Xen 4.8, though I recall running into it with Xen 4.4
> too.
>
> Part of the problem might also be attributeable to QEMU touching all
> memory on start (thus causing PoD to try to populate *all* memory) or
> OVMF.

So.  What *should* happen is that if QEMU/OVMF dirties more memory than
exists in the PoD cache, the domain gets terminated.

Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
action like this.

Do you have a serial log of the crash?  If not, can you set up a crash
kernel environment to capture the logs, or alternatively reproduce the
issue on a different box which does have serial?

Whatever the underlying bug is, avoiding 2M degrading to 4K allocations
isn't a real fix, and is at best, sidestepping the problem.

~Andrew
Elliott Mitchell Jan. 27, 2021, 10:28 p.m. UTC | #9
On Wed, Jan 27, 2021 at 09:03:32PM +0000, Andrew Cooper wrote:
> So.?? What *should* happen is that if QEMU/OVMF dirties more memory than
> exists in the PoD cache, the domain gets terminated.
> 
> Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
> action like this.
> 
> Do you have a serial log of the crash??? If not, can you set up a crash
> kernel environment to capture the logs, or alternatively reproduce the
> issue on a different box which does have serial?

No, I don't.  I'm setup to debug ARM failures, not x86 ones.
Jan Beulich Jan. 28, 2021, 10:19 a.m. UTC | #10
On 27.01.2021 23:28, Elliott Mitchell wrote:
> On Wed, Jan 27, 2021 at 09:03:32PM +0000, Andrew Cooper wrote:
>> So.?? What *should* happen is that if QEMU/OVMF dirties more memory than
>> exists in the PoD cache, the domain gets terminated.
>>
>> Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
>> action like this.
>>
>> Do you have a serial log of the crash??? If not, can you set up a crash
>> kernel environment to capture the logs, or alternatively reproduce the
>> issue on a different box which does have serial?
> 
> No, I don't.  I'm setup to debug ARM failures, not x86 ones.

Then alternatively can you at least give conditions that need to
be met to observe the problem, for someone to repro and then
debug? (The less complex the better, of course.)

Jan
Elliott Mitchell Jan. 28, 2021, 6:26 p.m. UTC | #11
On Thu, Jan 28, 2021 at 11:19:41AM +0100, Jan Beulich wrote:
> On 27.01.2021 23:28, Elliott Mitchell wrote:
> > On Wed, Jan 27, 2021 at 09:03:32PM +0000, Andrew Cooper wrote:
> >> So.?? What *should* happen is that if QEMU/OVMF dirties more memory than
> >> exists in the PoD cache, the domain gets terminated.
> >>
> >> Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
> >> action like this.
> >>
> >> Do you have a serial log of the crash??? If not, can you set up a crash
> >> kernel environment to capture the logs, or alternatively reproduce the
> >> issue on a different box which does have serial?
> > 
> > No, I don't.  I'm setup to debug ARM failures, not x86 ones.
> 
> Then alternatively can you at least give conditions that need to
> be met to observe the problem, for someone to repro and then
> debug? (The less complex the better, of course.)

Multiple prior messages have included statements of what I believed to be
the minimal case to reproduce.  Presently I believe the minimal
constraints are, maxmem >= host memory, memory < free Xen memory, type
HVM.  A minimal kr45hme.cfg file:

type = "hvm"
memory = 1024
maxmem = 1073741824

I suspect maxmem > free Xen memory may be sufficient.  The instances I
can be certain of have been maxmem = total host memory *7.
George Dunlap Jan. 28, 2021, 10:42 p.m. UTC | #12
> On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
> 
> On Thu, Jan 28, 2021 at 11:19:41AM +0100, Jan Beulich wrote:
>> On 27.01.2021 23:28, Elliott Mitchell wrote:
>>> On Wed, Jan 27, 2021 at 09:03:32PM +0000, Andrew Cooper wrote:
>>>> So.?? What *should* happen is that if QEMU/OVMF dirties more memory than
>>>> exists in the PoD cache, the domain gets terminated.
>>>> 
>>>> Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
>>>> action like this.
>>>> 
>>>> Do you have a serial log of the crash??? If not, can you set up a crash
>>>> kernel environment to capture the logs, or alternatively reproduce the
>>>> issue on a different box which does have serial?
>>> 
>>> No, I don't.  I'm setup to debug ARM failures, not x86 ones.
>> 
>> Then alternatively can you at least give conditions that need to
>> be met to observe the problem, for someone to repro and then
>> debug? (The less complex the better, of course.)
> 
> Multiple prior messages have included statements of what I believed to be
> the minimal case to reproduce.  Presently I believe the minimal
> constraints are, maxmem >= host memory, memory < free Xen memory, type
> HVM.  A minimal kr45hme.cfg file:
> 
> type = "hvm"
> memory = 1024
> maxmem = 1073741824
> 
> I suspect maxmem > free Xen memory may be sufficient.  The instances I
> can be certain of have been maxmem = total host memory *7.

Can you include your Xen version and dom0 command-line?

For me, domain creation fails with an error like this:

root@immortal:/images# xl create c6-01.cfg maxmem=1073741824
Parsing config from c6-01.cfg
xc: error: panic: xc_dom_boot.c:120: xc_dom_boot_mem_init: can't allocate low memory for domain: Out of memory
libxl: error: libxl_dom.c:593:libxl__build_dom: xc_dom_boot_mem_init failed: Cannot allocate memory
libxl: error: libxl_create.c:1576:domcreate_rebuild_done: Domain 9:cannot (re-)build domain: -3
libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 9:Non-existant domain
libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 9:Unable to destroy guest
libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 9:Destruction of domain failed

This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.

 -George
George Dunlap Jan. 28, 2021, 10:56 p.m. UTC | #13
> On Jan 28, 2021, at 10:42 PM, George Dunlap <george.dunlap@citrix.com> wrote:
> 
> 
> 
>> On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
>> 
>> On Thu, Jan 28, 2021 at 11:19:41AM +0100, Jan Beulich wrote:
>>> On 27.01.2021 23:28, Elliott Mitchell wrote:
>>>> On Wed, Jan 27, 2021 at 09:03:32PM +0000, Andrew Cooper wrote:
>>>>> So.?? What *should* happen is that if QEMU/OVMF dirties more memory than
>>>>> exists in the PoD cache, the domain gets terminated.
>>>>> 
>>>>> Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
>>>>> action like this.
>>>>> 
>>>>> Do you have a serial log of the crash??? If not, can you set up a crash
>>>>> kernel environment to capture the logs, or alternatively reproduce the
>>>>> issue on a different box which does have serial?
>>>> 
>>>> No, I don't.  I'm setup to debug ARM failures, not x86 ones.
>>> 
>>> Then alternatively can you at least give conditions that need to
>>> be met to observe the problem, for someone to repro and then
>>> debug? (The less complex the better, of course.)
>> 
>> Multiple prior messages have included statements of what I believed to be
>> the minimal case to reproduce.  Presently I believe the minimal
>> constraints are, maxmem >= host memory, memory < free Xen memory, type
>> HVM.  A minimal kr45hme.cfg file:
>> 
>> type = "hvm"
>> memory = 1024
>> maxmem = 1073741824
>> 
>> I suspect maxmem > free Xen memory may be sufficient.  The instances I
>> can be certain of have been maxmem = total host memory *7.
> 
> Can you include your Xen version and dom0 command-line?
> 
> For me, domain creation fails with an error like this:
> 
> root@immortal:/images# xl create c6-01.cfg maxmem=1073741824
> Parsing config from c6-01.cfg
> xc: error: panic: xc_dom_boot.c:120: xc_dom_boot_mem_init: can't allocate low memory for domain: Out of memory
> libxl: error: libxl_dom.c:593:libxl__build_dom: xc_dom_boot_mem_init failed: Cannot allocate memory
> libxl: error: libxl_create.c:1576:domcreate_rebuild_done: Domain 9:cannot (re-)build domain: -3
> libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 9:Non-existant domain
> libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 9:Unable to destroy guest
> libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 9:Destruction of domain failed
> 
> This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.

Hmm, but with that line removed, I get this:

root@immortal:/images# xl create c6-01.cfg maxmem=1073741824
Parsing config from c6-01.cfg
libxl: error: libxl_mem.c:279:libxl_set_memory_target: New target 0 for dom0 is below the minimum threshold
failed to free memory for the domain

Maybe the issue you’re probably facing is that “minimum threshold” safety catch either isn’t triggering, or is set low enough that your dom0 is OOMing trying to make enough memory for your VM?

That 1TiB of empty space isn’t actually free after all, even for Xen — you have to actually allocate p2m memory for the domain to hold all of those PoD entries.

 -George
George Dunlap Jan. 29, 2021, 10:56 a.m. UTC | #14
> On Jan 28, 2021, at 10:56 PM, George Dunlap <george.dunlap@citrix.com> wrote:
> 
> 
> 
>> On Jan 28, 2021, at 10:42 PM, George Dunlap <george.dunlap@citrix.com> wrote:
>> 
>> 
>> 
>>> On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
>>> 
>>> On Thu, Jan 28, 2021 at 11:19:41AM +0100, Jan Beulich wrote:
>>>> On 27.01.2021 23:28, Elliott Mitchell wrote:
>>>>> On Wed, Jan 27, 2021 at 09:03:32PM +0000, Andrew Cooper wrote:
>>>>>> So.?? What *should* happen is that if QEMU/OVMF dirties more memory than
>>>>>> exists in the PoD cache, the domain gets terminated.
>>>>>> 
>>>>>> Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
>>>>>> action like this.
>>>>>> 
>>>>>> Do you have a serial log of the crash??? If not, can you set up a crash
>>>>>> kernel environment to capture the logs, or alternatively reproduce the
>>>>>> issue on a different box which does have serial?
>>>>> 
>>>>> No, I don't.  I'm setup to debug ARM failures, not x86 ones.
>>>> 
>>>> Then alternatively can you at least give conditions that need to
>>>> be met to observe the problem, for someone to repro and then
>>>> debug? (The less complex the better, of course.)
>>> 
>>> Multiple prior messages have included statements of what I believed to be
>>> the minimal case to reproduce.  Presently I believe the minimal
>>> constraints are, maxmem >= host memory, memory < free Xen memory, type
>>> HVM.  A minimal kr45hme.cfg file:
>>> 
>>> type = "hvm"
>>> memory = 1024
>>> maxmem = 1073741824
>>> 
>>> I suspect maxmem > free Xen memory may be sufficient.  The instances I
>>> can be certain of have been maxmem = total host memory *7.
>> 
>> Can you include your Xen version and dom0 command-line?
>> 
>> For me, domain creation fails with an error like this:
>> 
>> root@immortal:/images# xl create c6-01.cfg maxmem=1073741824
>> Parsing config from c6-01.cfg
>> xc: error: panic: xc_dom_boot.c:120: xc_dom_boot_mem_init: can't allocate low memory for domain: Out of memory
>> libxl: error: libxl_dom.c:593:libxl__build_dom: xc_dom_boot_mem_init failed: Cannot allocate memory
>> libxl: error: libxl_create.c:1576:domcreate_rebuild_done: Domain 9:cannot (re-)build domain: -3
>> libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 9:Non-existant domain
>> libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 9:Unable to destroy guest
>> libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 9:Destruction of domain failed
>> 
>> This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.
> 
> Hmm, but with that line removed, I get this:
> 
> root@immortal:/images# xl create c6-01.cfg maxmem=1073741824
> Parsing config from c6-01.cfg
> libxl: error: libxl_mem.c:279:libxl_set_memory_target: New target 0 for dom0 is below the minimum threshold
> failed to free memory for the domain
> 
> Maybe the issue you’re probably facing is that “minimum threshold” safety catch either isn’t triggering, or is set low enough that your dom0 is OOMing trying to make enough memory for your VM?

Looks like LIBXL_MIN_DOM0_MEM is hard-coded to 128MiB, which is not going to be enough on a lot of systems.  At very least that should be something that can be set in a global config somewhere.  Ideally we’d have a more sophisticated way of calculating the minimum value that wouldn’t trip so easily.

Elliot, as a short-term fix, I suggest considering setting aside a fixed amount of memory for dom0, as recommended in https://wiki.xenproject.org/wiki/Xen_Project_Best_Practices.

 -George
Elliott Mitchell Jan. 31, 2021, 6:13 p.m. UTC | #15
On Thu, Jan 28, 2021 at 10:42:27PM +0000, George Dunlap wrote:
> 
> > On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
> > type = "hvm"
> > memory = 1024
> > maxmem = 1073741824
> > 
> > I suspect maxmem > free Xen memory may be sufficient.  The instances I
> > can be certain of have been maxmem = total host memory *7.
> 
> Can you include your Xen version and dom0 command-line?

> This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.
> 

As this is a server, not a development target, Debian's build of 4.11 is
in use.  Your domain 0 memory allocation is extremely generous compared
to mine.  One thing which is on the command-line though is
"watchdog=true".

I've got 3 candidates which presently concern me:ble:

1> There is a limited range of maxmem values where this occurs.  Perhaps
1TB is too high on your machine for the problem to reproduce.  As
previously stated my sample configuration has maxmem being roughly 7
times actual machine memory.

2> Between issuing the `xl create` command and the machine rebooting a
few moments of slow response have been observed.  Perhaps the memory
allocator loop is hogging processor cores long enough for the watchdog to
trigger?

3> Perhaps one of the patches on Debian broke things?  This seems
unlikely since nearly all of Debian's patches are either strictly for
packaging or else picks from Xen's main branch, but this is certainly
possible.
Roger Pau Monné Feb. 1, 2021, 8:15 a.m. UTC | #16
On Sun, Jan 31, 2021 at 10:13:49AM -0800, Elliott Mitchell wrote:
> On Thu, Jan 28, 2021 at 10:42:27PM +0000, George Dunlap wrote:
> > 
> > > On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
> > > type = "hvm"
> > > memory = 1024
> > > maxmem = 1073741824
> > > 
> > > I suspect maxmem > free Xen memory may be sufficient.  The instances I
> > > can be certain of have been maxmem = total host memory *7.
> > 
> > Can you include your Xen version and dom0 command-line?
> 
> > This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.
> > 
> 
> As this is a server, not a development target, Debian's build of 4.11 is
> in use.  Your domain 0 memory allocation is extremely generous compared
> to mine.  One thing which is on the command-line though is
> "watchdog=true".
> 
> I've got 3 candidates which presently concern me:ble:
> 
> 1> There is a limited range of maxmem values where this occurs.  Perhaps
> 1TB is too high on your machine for the problem to reproduce.  As
> previously stated my sample configuration has maxmem being roughly 7
> times actual machine memory.
> 
> 2> Between issuing the `xl create` command and the machine rebooting a
> few moments of slow response have been observed.  Perhaps the memory
> allocator loop is hogging processor cores long enough for the watchdog to
> trigger?
> 
> 3> Perhaps one of the patches on Debian broke things?  This seems
> unlikely since nearly all of Debian's patches are either strictly for
> packaging or else picks from Xen's main branch, but this is certainly
> possible.

If you have a reliable way to reproduce this, and such error happens
on one of your server boxes, is it impossible for you to connect to
the serial console and get the output of the crash?

I would assume this being a server it must have some kind of serial
console support, even if Serial over LAN.

That way we could remove all the speculation about what has gone
wrong.

Thanks, Roger.
George Dunlap Feb. 1, 2021, 10:35 a.m. UTC | #17
> On Jan 31, 2021, at 6:13 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
> 
> On Thu, Jan 28, 2021 at 10:42:27PM +0000, George Dunlap wrote:
>> 
>>> On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
>>> type = "hvm"
>>> memory = 1024
>>> maxmem = 1073741824
>>> 
>>> I suspect maxmem > free Xen memory may be sufficient.  The instances I
>>> can be certain of have been maxmem = total host memory *7.
>> 
>> Can you include your Xen version and dom0 command-line?
> 
>> This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.
>> 
> 
> As this is a server, not a development target, Debian's build of 4.11 is
> in use.  Your domain 0 memory allocation is extremely generous compared
> to mine.  One thing which is on the command-line though is
> "watchdog=true".

staging-4.14 is just the stable 4.14 branch which our CI loop tests before pushing to stable-4.14, which is essentially tagged 3 times a year for point releases.  It’s quite stable.  I’ll give 4.11 a try if I get a chance.

It’s not clear from your response — are you allocating a fixed amount to dom0?  How much is it?  In fact, probably the simplest thing to do would be to attach the output of `xl info` and `xl dmesg`; that will save a lot of potential future back-and-forth.

1GiB isn’t particularly generous if you’re running a large number of guests.  My understanding is that XenServer now defaults to 4GiB of RAM for dom0.

> I've got 3 candidates which presently concern me:ble:
> 
> 1> There is a limited range of maxmem values where this occurs.  Perhaps
> 1TB is too high on your machine for the problem to reproduce.  As
> previously stated my sample configuration has maxmem being roughly 7
> times actual machine memory.

In fact I did a number of binary-search-style experiments to try to find out boundary behavior.  I don’t think I did 7x memory, but I certainly did 2x or 3x host memory, and the exact number you gave that caused you problems.  In all cases for me, it either worked or failed with a cryptic error message (the specific message depending on whether I had fixed dom0 memory or autoballooned memory).

> 2> Between issuing the `xl create` command and the machine rebooting a
> few moments of slow response have been observed.  Perhaps the memory
> allocator loop is hogging processor cores long enough for the watchdog to
> trigger?

I don’t know the balloon driver very well, but I’d hope it yielded pretty regularly.  It seems more likely to me that your dom0 is swapping due to low memory / struggling with having to work with no file cache.  Or the OOM killer is doing its calculation trying to figure out which process to shoot?  

> 3> Perhaps one of the patches on Debian broke things?  This seems
> unlikely since nearly all of Debian's patches are either strictly for
> packaging or else picks from Xen's main branch, but this is certainly
> possible.

Indeed, I’d consider that unlikely.  Some things I’d consider more likely to cause the difference:

1. The amount of host memory (my test box had only 6GiB)

2. The amount of memory assigned to dom0 

3. The number of other VMs running in the background

4. A difference in the version of Linux (I’m also running Debian, but deban-testing)

5. A bug in 4.11 that was fixed by 4.14.

If you’re already allocating a fixed amount of memory to dom0, but it’s significantly less than 1GiB, the first thing I’d try is increasing that to 1GiB.  Also make sure that you’re specifying a ‘max’ for dom0 memory: If you simply put `dom0_mem=X`, dom0 will start with X amount of memory, but allocate enough frame tables such that it could balloon up to the full host memory if requested.  (And frame tables are not free.)  `dom0_mem=X,max=X` will cause dom0 to only make frame tables for X memory.  (At least, so I’m guessing; I haven’t checked.)

If that doesn’t work, please include the output of `xl info` and `xl dmesg`; that will give us a lot more information to work with.

Peace,
 -George
Elliott Mitchell Feb. 2, 2021, 5:58 a.m. UTC | #18
On Mon, Feb 01, 2021 at 10:35:15AM +0000, George Dunlap wrote:
> 
> 
> > On Jan 31, 2021, at 6:13 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
> > 
> > On Thu, Jan 28, 2021 at 10:42:27PM +0000, George Dunlap wrote:
> >> 
> >>> On Jan 28, 2021, at 6:26 PM, Elliott Mitchell <ehem+xen@m5p.com> wrote:
> >>> type = "hvm"
> >>> memory = 1024
> >>> maxmem = 1073741824
> >>> 
> >>> I suspect maxmem > free Xen memory may be sufficient.  The instances I
> >>> can be certain of have been maxmem = total host memory *7.
> >> 
> >> Can you include your Xen version and dom0 command-line?
> > 
> >> This is on staging-4.14 from a month or two ago (i.e., what I happened to have on a random test  box), and `dom0_mem=1024M,max:1024M` in my command-line.  That rune will give dom0 only 1GiB of RAM, but also prevent it from auto-ballooning down to free up memory for the guest.
> >> 
> > 
> > As this is a server, not a development target, Debian's build of 4.11 is
> > in use.  Your domain 0 memory allocation is extremely generous compared
> > to mine.  One thing which is on the command-line though is
> > "watchdog=true".
> 
> staging-4.14 is just the stable 4.14 branch which our CI loop tests before pushing to stable-4.14, which is essentially tagged 3 times a year for point releases.  It???s quite stable.  I???ll give 4.11 a try if I get a chance.
> 
> It???s not clear from your response ??? are you allocating a fixed amount to dom0?  How much is it?  In fact, probably the simplest thing to do would be to attach the output of `xl info` and `xl dmesg`; that will save a lot of potential future back-and-forth.
> 
> 1GiB isn???t particularly generous if you???re running a large number of guests.  My understanding is that XenServer now defaults to 4GiB of RAM for dom0.
> 

I guess it comes to setup, how careful one is at pruning unneeded
services and whether one takes steps to ensure there aren't extra qemu
processes hanging around (avoiding hvm VMs in most cases).


release                : 4.19.160-2
version                : #5 SMP Sat Dec 5 09:58:41 PST 2020
machine                : x86_64
nr_cpus                : 8
max_cpu_id             : 7
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 2
cpu_mhz                : 4018.086
hw_caps                : 178bf3ff:b698320b:2e500800:0069bfff:00000000:00000008:00000000:00000500
virt_caps              : hvm
total_memory           : 16110
free_memory            : 781
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 11
xen_extra              : .4
xen_version            : 4.11.4
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : 
xen_commandline        : placeholder watchdog=true loglvl=info iommu=verbose cpuidle dom0_mem=384M,max:640M dom0_max_vcpus=8
cc_compiler            : gcc (Debian 8.3.0-6) 8.3.0
cc_compile_by          : pkg-xen-devel
cc_compile_domain      : lists.alioth.debian.org
cc_compile_date        : Fri Dec 11 21:33:51 UTC 2020
build_id               : 6d8e0fa3ddb825695eb6c6832631b4fa2331fe41
xend_config_format     : 4


> > I've got 3 candidates which presently concern me:ble:
> > 
> > 1> There is a limited range of maxmem values where this occurs.  Perhaps
> > 1TB is too high on your machine for the problem to reproduce.  As
> > previously stated my sample configuration has maxmem being roughly 7
> > times actual machine memory.
> 
> In fact I did a number of binary-search-style experiments to try to find out boundary behavior.  I don???t think I did 7x memory, but I certainly did 2x or 3x host memory, and the exact number you gave that caused you problems.  In all cases for me, it either worked or failed with a cryptic error message (the specific message depending on whether I had fixed dom0 memory or autoballooned memory).
> 

Hmm, may have to mem-set Dom0 to max then retry the crash configuration
with maxmem just greater than machine memory...    Do have that downtime
due to kernel update...


> > 2> Between issuing the `xl create` command and the machine rebooting a
> > few moments of slow response have been observed.  Perhaps the memory
> > allocator loop is hogging processor cores long enough for the watchdog to
> > trigger?
> 
> I don???t know the balloon driver very well, but I???d hope it yielded pretty regularly.  It seems more likely to me that your dom0 is swapping due to low memory / struggling with having to work with no file cache.  Or the OOM killer is doing its calculation trying to figure out which process to shoot?  
> 

I know which process it shoots.  One ideal is to have memory just high
enough for the OOM-killer not to trigger.  Under this idea you *want* to
use some swap, as some portions of process address space are left idle
99.99% of the time and don't need to waste RAM.  This though is a bit
funky with SSDs taking over for which writes are more precious.
Difficulty then becomes some of Xen's odd Dom0 memory behavior.

According to `xl list` it isn't possible to set Dom0's memory to maximum.
I've been theorizing this might be due to memory used for DMA needing to
be inside the maxmem region, but isn't counted in `xl list`...


> > 3> Perhaps one of the patches on Debian broke things?  This seems
> > unlikely since nearly all of Debian's patches are either strictly for
> > packaging or else picks from Xen's main branch, but this is certainly
> > possible.
> 
> Indeed, I???d consider that unlikely.  Some things I???d consider more likely to cause the difference:
> 
> 1. The amount of host memory (my test box had only 6GiB)
> 
> 2. The amount of memory assigned to dom0 

I consider this unlikely.  Due to a downtime I got a chance to try this
issue from the console and *nothing* appeared.  If there was a memory
issue with domain 0 then I would have expected messages from the
OOM-killer before restart.


> 3. The number of other VMs running in the background

During that downtime other VMs had been saved to storage (I didn't want
to lose their runtimes).  As such all memory was available to domain 0
and the problematic VM configuration.


> 4. A difference in the version of Linux (I???m also running Debian, but deban-testing)
> 

Not impossible, but seems improbable to me.  This has also been observed
when domain 0 had a 4.9 kernel.  Perhaps 5.x includes some fix which
works around the issue, but I'm very doubtful of this.

> 5. A bug in 4.11 that was fixed by 4.14.

This isn't confined to 4.11.  I observed this with 4.8 and I recall
running into suspiciously similar things with 4.4.  The bug may well have
been fixed between 4.11 and 4.14 though.
diff mbox series

Patch

diff --git a/xen/arch/x86/mm/p2m-pod.c b/xen/arch/x86/mm/p2m-pod.c
index 48e609d1ed..1baa1404e8 100644
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -212,16 +212,13 @@  p2m_pod_set_cache_target(struct p2m_domain *p2m, unsigned long pod_target, int p
             order = PAGE_ORDER_2M;
         else
             order = PAGE_ORDER_4K;
-    retry:
         page = alloc_domheap_pages(d, order, 0);
         if ( unlikely(page == NULL) )
         {
-            if ( order == PAGE_ORDER_2M )
-            {
-                /* If we can't allocate a superpage, try singleton pages */
-                order = PAGE_ORDER_4K;
-                goto retry;
-            }
+            /* Superpages allocation failures likely indicate severe memory
+            ** pressure.  Continuing to try to fulfill attempts using 4KB pages
+            ** is likely to exhaust memory and trigger a panic.  As such it is
+            ** NOT worth trying to use 4KB pages to fulfill 2MB page requests.*/
 
             printk("%s: Unable to allocate page for PoD cache (target=%lu cache=%ld)\n",
                    __func__, pod_target, p2m->pod.count);