diff mbox

[1/2] xen/x86: return partial memory map in case of not enough space

Message ID 20161205163409.16714-2-jgross@suse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jürgen Groß Dec. 5, 2016, 4:34 p.m. UTC
For XENMEM_machine_memory_map the hypervisor returns EINVAL if the
caller's buffer can't hold all entries.

This is a problem as the caller has normally a static buffer defined
and when he is doing the call no dynamic memory allocation is
possible as nothing is yet known about the system's memory layout.

Instead of just fail deliver as many memory map entries as possible
and return with E2BIG indicating the result was incomplete. Then the
caller will be capable to use at least some memory reported to exist
to allocate a larger buffer for the complete memory map.

As E2BIG wasn't returned before a caller not prepared for this case
will still see just a failure as before, while someone prepared for
this error code running on an old hypervisor won't run into problems
other than without this change.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/arch/x86/mm.c           | 22 ++++++++++++----------
 xen/include/public/memory.h |  2 ++
 2 files changed, 14 insertions(+), 10 deletions(-)

Comments

Jan Beulich Dec. 5, 2016, 5:17 p.m. UTC | #1
>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote:
> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the
> caller's buffer can't hold all entries.
> 
> This is a problem as the caller has normally a static buffer defined
> and when he is doing the call no dynamic memory allocation is
> possible as nothing is yet known about the system's memory layout.
> 
> Instead of just fail deliver as many memory map entries as possible
> and return with E2BIG indicating the result was incomplete. Then the
> caller will be capable to use at least some memory reported to exist
> to allocate a larger buffer for the complete memory map.

This makes no sense, as what we're talking about here is the
machine memory map, and the calling Dom0 kernel can't allocate
from that pool directly. Instead it would need its own memory
map to know where to place such a larger buffer, and this map
is usually just one or two entries large.

For that reason I'm not convinced we need or want this change.

Jan
Jürgen Groß Dec. 6, 2016, 7:43 a.m. UTC | #2
On 05/12/16 18:17, Jan Beulich wrote:
>>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote:
>> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the
>> caller's buffer can't hold all entries.
>>
>> This is a problem as the caller has normally a static buffer defined
>> and when he is doing the call no dynamic memory allocation is
>> possible as nothing is yet known about the system's memory layout.
>>
>> Instead of just fail deliver as many memory map entries as possible
>> and return with E2BIG indicating the result was incomplete. Then the
>> caller will be capable to use at least some memory reported to exist
>> to allocate a larger buffer for the complete memory map.
> 
> This makes no sense, as what we're talking about here is the
> machine memory map, and the calling Dom0 kernel can't allocate
> from that pool directly. Instead it would need its own memory
> map to know where to place such a larger buffer, and this map
> is usually just one or two entries large.

This is true. In practice, however, things are a little bit more
complicated:

Linux being started as dom0 tries to rearrange memory layout to match
the one of the physical machine. It will only add memory to its
allocator which is known to either not need to be moved or which has
already been moved. And this decision is based on the machine memory
map.

I admit it is a Linux kernel private decision how to handle the boot and
adding of memory to its allocator. OTOH the "all or nothing" approach of
the hypervisor regarding delivery of the machine memory map is a little
bit strange, especially as the BIOS is returning the E820 map one entry
after the other.

> For that reason I'm not convinced we need or want this change.

It would certainly make it easier for the Linux kernel.


Juergen
Jan Beulich Dec. 6, 2016, 8:15 a.m. UTC | #3
>>> On 06.12.16 at 08:43, <JGross@suse.com> wrote:
> On 05/12/16 18:17, Jan Beulich wrote:
>>>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote:
>>> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the
>>> caller's buffer can't hold all entries.
>>>
>>> This is a problem as the caller has normally a static buffer defined
>>> and when he is doing the call no dynamic memory allocation is
>>> possible as nothing is yet known about the system's memory layout.
>>>
>>> Instead of just fail deliver as many memory map entries as possible
>>> and return with E2BIG indicating the result was incomplete. Then the
>>> caller will be capable to use at least some memory reported to exist
>>> to allocate a larger buffer for the complete memory map.
>> 
>> This makes no sense, as what we're talking about here is the
>> machine memory map, and the calling Dom0 kernel can't allocate
>> from that pool directly. Instead it would need its own memory
>> map to know where to place such a larger buffer, and this map
>> is usually just one or two entries large.
> 
> This is true. In practice, however, things are a little bit more
> complicated:
> 
> Linux being started as dom0 tries to rearrange memory layout to match
> the one of the physical machine. It will only add memory to its
> allocator which is known to either not need to be moved or which has
> already been moved. And this decision is based on the machine memory
> map.

Right, I did recall this oddity only after having sent the initial reply
(I'm still much more into how the non-pvops kernel works, and there's
no such dependency there).

> I admit it is a Linux kernel private decision how to handle the boot and
> adding of memory to its allocator. OTOH the "all or nothing" approach of
> the hypervisor regarding delivery of the machine memory map is a little
> bit strange, especially as the BIOS is returning the E820 map one entry
> after the other.
> 
>> For that reason I'm not convinced we need or want this change.
> 
> It would certainly make it easier for the Linux kernel.

I'd like us to at least consider alternatives:

1) Have a new sub-op behaving BIOS like (one entry at a time).

2) Make the full map available inside the initial mapping, pointed to
by a new entry in the start info structure.

3) Have pvops Linux make use of the extra space available at the
end of the initial mapping. The minimum of 512k there should be
more than enough.

4) Others I can't think of right now.

Jan
Jürgen Groß Dec. 6, 2016, 8:33 a.m. UTC | #4
On 06/12/16 09:15, Jan Beulich wrote:
>>>> On 06.12.16 at 08:43, <JGross@suse.com> wrote:
>> On 05/12/16 18:17, Jan Beulich wrote:
>>>>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote:
>>>> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the
>>>> caller's buffer can't hold all entries.
>>>>
>>>> This is a problem as the caller has normally a static buffer defined
>>>> and when he is doing the call no dynamic memory allocation is
>>>> possible as nothing is yet known about the system's memory layout.
>>>>
>>>> Instead of just fail deliver as many memory map entries as possible
>>>> and return with E2BIG indicating the result was incomplete. Then the
>>>> caller will be capable to use at least some memory reported to exist
>>>> to allocate a larger buffer for the complete memory map.
>>>
>>> This makes no sense, as what we're talking about here is the
>>> machine memory map, and the calling Dom0 kernel can't allocate
>>> from that pool directly. Instead it would need its own memory
>>> map to know where to place such a larger buffer, and this map
>>> is usually just one or two entries large.
>>
>> This is true. In practice, however, things are a little bit more
>> complicated:
>>
>> Linux being started as dom0 tries to rearrange memory layout to match
>> the one of the physical machine. It will only add memory to its
>> allocator which is known to either not need to be moved or which has
>> already been moved. And this decision is based on the machine memory
>> map.
> 
> Right, I did recall this oddity only after having sent the initial reply
> (I'm still much more into how the non-pvops kernel works, and there's
> no such dependency there).
> 
>> I admit it is a Linux kernel private decision how to handle the boot and
>> adding of memory to its allocator. OTOH the "all or nothing" approach of
>> the hypervisor regarding delivery of the machine memory map is a little
>> bit strange, especially as the BIOS is returning the E820 map one entry
>> after the other.
>>
>>> For that reason I'm not convinced we need or want this change.
>>
>> It would certainly make it easier for the Linux kernel.
> 
> I'd like us to at least consider alternatives:
> 
> 1) Have a new sub-op behaving BIOS like (one entry at a time).

No objections from me.

> 2) Make the full map available inside the initial mapping, pointed to
> by a new entry in the start info structure.

What about PVH?

> 3) Have pvops Linux make use of the extra space available at the
> end of the initial mapping. The minimum of 512k there should be
> more than enough.

This would mean using this area as a temporary buffer and copying the
obtained map somewhere else (either only in the size of the static
buffer or after reorganizing the memory according to the map). I guess
this would work, too.

> 4) Others I can't think of right now.

My preference would be 1). In case we are choosing this approach I
don't see any need for the other patch I sent for getting the number
of entries. It can easily be replaced by a loop of the single entry
solution.


Juergen
Jan Beulich Dec. 6, 2016, 8:51 a.m. UTC | #5
>>> On 06.12.16 at 09:33, <JGross@suse.com> wrote:
> On 06/12/16 09:15, Jan Beulich wrote:
>> I'd like us to at least consider alternatives:
>> 
>> 1) Have a new sub-op behaving BIOS like (one entry at a time).
> 
> No objections from me.
> 
>> 2) Make the full map available inside the initial mapping, pointed to
>> by a new entry in the start info structure.
> 
> What about PVH?

Does PVH re-arrangement of its physical memory too? I didn't think
so ...

>> 3) Have pvops Linux make use of the extra space available at the
>> end of the initial mapping. The minimum of 512k there should be
>> more than enough.
> 
> This would mean using this area as a temporary buffer and copying the
> obtained map somewhere else (either only in the size of the static
> buffer or after reorganizing the memory according to the map). I guess
> this would work, too.
> 
>> 4) Others I can't think of right now.
> 
> My preference would be 1). In case we are choosing this approach I
> don't see any need for the other patch I sent for getting the number
> of entries. It can easily be replaced by a loop of the single entry
> solution.

Well, my preference would be 3), then 2), then 1).

Regardless of that I would welcome patch 2 to be re-submitted
without depending on patch 1, as I think this is a worthwhile
adjustment in any case (bringing the interface in line with what
we do elsewhere).

Jan
Jürgen Groß Dec. 6, 2016, 9:44 a.m. UTC | #6
On 06/12/16 09:51, Jan Beulich wrote:
>>>> On 06.12.16 at 09:33, <JGross@suse.com> wrote:
>> On 06/12/16 09:15, Jan Beulich wrote:
>>> I'd like us to at least consider alternatives:
>>>
>>> 1) Have a new sub-op behaving BIOS like (one entry at a time).
>>
>> No objections from me.
>>
>>> 2) Make the full map available inside the initial mapping, pointed to
>>> by a new entry in the start info structure.
>>
>> What about PVH?
> 
> Does PVH re-arrangement of its physical memory too? I didn't think
> so ...

Not yet. Are you sure it won't be needed in future? Why was it done in
the pvops kernel? I thought the main reason was to have the same PCI
space address layout as on the native system in order to avoid any bad
surprises due to buggy firmware and/or BIOS. The same could apply for
PVH.

>>> 3) Have pvops Linux make use of the extra space available at the
>>> end of the initial mapping. The minimum of 512k there should be
>>> more than enough.
>>
>> This would mean using this area as a temporary buffer and copying the
>> obtained map somewhere else (either only in the size of the static
>> buffer or after reorganizing the memory according to the map). I guess
>> this would work, too.
>>
>>> 4) Others I can't think of right now.
>>
>> My preference would be 1). In case we are choosing this approach I
>> don't see any need for the other patch I sent for getting the number
>> of entries. It can easily be replaced by a loop of the single entry
>> solution.
> 
> Well, my preference would be 3), then 2), then 1).

If nobody is objecting I can go that route (3).

> Regardless of that I would welcome patch 2 to be re-submitted
> without depending on patch 1, as I think this is a worthwhile
> adjustment in any case (bringing the interface in line with what
> we do elsewhere).

Sure, will do.


Juergen
Jan Beulich Dec. 6, 2016, 9:51 a.m. UTC | #7
>>> On 06.12.16 at 10:44, <JGross@suse.com> wrote:
> On 06/12/16 09:51, Jan Beulich wrote:
>>>>> On 06.12.16 at 09:33, <JGross@suse.com> wrote:
>>> On 06/12/16 09:15, Jan Beulich wrote:
>>>> I'd like us to at least consider alternatives:
>>>>
>>>> 1) Have a new sub-op behaving BIOS like (one entry at a time).
>>>
>>> No objections from me.
>>>
>>>> 2) Make the full map available inside the initial mapping, pointed to
>>>> by a new entry in the start info structure.
>>>
>>> What about PVH?
>> 
>> Does PVH re-arrangement of its physical memory too? I didn't think
>> so ...
> 
> Not yet. Are you sure it won't be needed in future? Why was it done in
> the pvops kernel? I thought the main reason was to have the same PCI
> space address layout as on the native system in order to avoid any bad
> surprises due to buggy firmware and/or BIOS. The same could apply for
> PVH.

PVH Dom0 (iirc) gets its memory arranged around MMIO holes
already (by the hypervisor).

Jan
diff mbox

Patch

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 14552a1..f8e679d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4737,7 +4737,7 @@  static int _handle_iomem_range(unsigned long s, unsigned long e,
         XEN_GUEST_HANDLE(e820entry_t) buffer;
 
         if ( ctxt->n + 1 >= ctxt->map.nr_entries )
-            return -EINVAL;
+            return -E2BIG;
         ent.addr = (uint64_t)ctxt->s << PAGE_SHIFT;
         ent.size = (uint64_t)(s - ctxt->s) << PAGE_SHIFT;
         ent.type = E820_RESERVED;
@@ -4985,8 +4985,6 @@  long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         if ( copy_from_guest(&ctxt.map, arg, 1) )
             return -EFAULT;
-        if ( ctxt.map.nr_entries < e820.nr_map + 1 )
-            return -EINVAL;
 
         buffer_param = guest_handle_cast(ctxt.map.buffer, e820entry_t);
         buffer = guest_handle_from_param(buffer_param, e820entry_t);
@@ -5005,31 +5003,35 @@  long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
                 if ( !rc )
                     rc = handle_iomem_range(s, s, &ctxt);
                 if ( rc )
-                    return rc;
+                    break;
+            }
+            if ( ctxt.map.nr_entries <= ctxt.n + 1 )
+            {
+                rc = -E2BIG;
+                break;
             }
-            if ( ctxt.map.nr_entries <= ctxt.n + (e820.nr_map - i) )
-                return -EINVAL;
             if ( __copy_to_guest_offset(buffer, ctxt.n, e820.map + i, 1) )
                 return -EFAULT;
             ctxt.s = PFN_UP(e820.map[i].addr + e820.map[i].size);
         }
 
-        if ( ctxt.s )
+        if ( !rc && ctxt.s )
         {
             rc = rangeset_report_ranges(current->domain->iomem_caps, ctxt.s,
                                         ~0UL, handle_iomem_range, &ctxt);
             if ( !rc && ctxt.s )
                 rc = handle_iomem_range(~0UL, ~0UL, &ctxt);
-            if ( rc )
-                return rc;
         }
 
+        if ( rc && rc != -E2BIG )
+            return rc;
+
         ctxt.map.nr_entries = ctxt.n;
 
         if ( __copy_to_guest(arg, &ctxt.map, 1) )
             return -EFAULT;
 
-        return 0;
+        return rc;
     }
 
     case XENMEM_machphys_mapping:
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 5bf840f..20df769 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -339,6 +339,8 @@  DEFINE_XEN_GUEST_HANDLE(xen_memory_map_t);
 /*
  * Returns the real physical memory map. Passes the same structure as
  * XENMEM_memory_map.
+ * In case of a buffer not capable to hold all entries of the physical
+ * memory map -E2BIG is returned and the buffer is filled completely.
  * arg == addr of xen_memory_map_t.
  */
 #define XENMEM_machine_memory_map   10