Message ID | 20161205163409.16714-2-jgross@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote: > For XENMEM_machine_memory_map the hypervisor returns EINVAL if the > caller's buffer can't hold all entries. > > This is a problem as the caller has normally a static buffer defined > and when he is doing the call no dynamic memory allocation is > possible as nothing is yet known about the system's memory layout. > > Instead of just fail deliver as many memory map entries as possible > and return with E2BIG indicating the result was incomplete. Then the > caller will be capable to use at least some memory reported to exist > to allocate a larger buffer for the complete memory map. This makes no sense, as what we're talking about here is the machine memory map, and the calling Dom0 kernel can't allocate from that pool directly. Instead it would need its own memory map to know where to place such a larger buffer, and this map is usually just one or two entries large. For that reason I'm not convinced we need or want this change. Jan
On 05/12/16 18:17, Jan Beulich wrote: >>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote: >> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the >> caller's buffer can't hold all entries. >> >> This is a problem as the caller has normally a static buffer defined >> and when he is doing the call no dynamic memory allocation is >> possible as nothing is yet known about the system's memory layout. >> >> Instead of just fail deliver as many memory map entries as possible >> and return with E2BIG indicating the result was incomplete. Then the >> caller will be capable to use at least some memory reported to exist >> to allocate a larger buffer for the complete memory map. > > This makes no sense, as what we're talking about here is the > machine memory map, and the calling Dom0 kernel can't allocate > from that pool directly. Instead it would need its own memory > map to know where to place such a larger buffer, and this map > is usually just one or two entries large. This is true. In practice, however, things are a little bit more complicated: Linux being started as dom0 tries to rearrange memory layout to match the one of the physical machine. It will only add memory to its allocator which is known to either not need to be moved or which has already been moved. And this decision is based on the machine memory map. I admit it is a Linux kernel private decision how to handle the boot and adding of memory to its allocator. OTOH the "all or nothing" approach of the hypervisor regarding delivery of the machine memory map is a little bit strange, especially as the BIOS is returning the E820 map one entry after the other. > For that reason I'm not convinced we need or want this change. It would certainly make it easier for the Linux kernel. Juergen
>>> On 06.12.16 at 08:43, <JGross@suse.com> wrote: > On 05/12/16 18:17, Jan Beulich wrote: >>>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote: >>> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the >>> caller's buffer can't hold all entries. >>> >>> This is a problem as the caller has normally a static buffer defined >>> and when he is doing the call no dynamic memory allocation is >>> possible as nothing is yet known about the system's memory layout. >>> >>> Instead of just fail deliver as many memory map entries as possible >>> and return with E2BIG indicating the result was incomplete. Then the >>> caller will be capable to use at least some memory reported to exist >>> to allocate a larger buffer for the complete memory map. >> >> This makes no sense, as what we're talking about here is the >> machine memory map, and the calling Dom0 kernel can't allocate >> from that pool directly. Instead it would need its own memory >> map to know where to place such a larger buffer, and this map >> is usually just one or two entries large. > > This is true. In practice, however, things are a little bit more > complicated: > > Linux being started as dom0 tries to rearrange memory layout to match > the one of the physical machine. It will only add memory to its > allocator which is known to either not need to be moved or which has > already been moved. And this decision is based on the machine memory > map. Right, I did recall this oddity only after having sent the initial reply (I'm still much more into how the non-pvops kernel works, and there's no such dependency there). > I admit it is a Linux kernel private decision how to handle the boot and > adding of memory to its allocator. OTOH the "all or nothing" approach of > the hypervisor regarding delivery of the machine memory map is a little > bit strange, especially as the BIOS is returning the E820 map one entry > after the other. > >> For that reason I'm not convinced we need or want this change. > > It would certainly make it easier for the Linux kernel. I'd like us to at least consider alternatives: 1) Have a new sub-op behaving BIOS like (one entry at a time). 2) Make the full map available inside the initial mapping, pointed to by a new entry in the start info structure. 3) Have pvops Linux make use of the extra space available at the end of the initial mapping. The minimum of 512k there should be more than enough. 4) Others I can't think of right now. Jan
On 06/12/16 09:15, Jan Beulich wrote: >>>> On 06.12.16 at 08:43, <JGross@suse.com> wrote: >> On 05/12/16 18:17, Jan Beulich wrote: >>>>>> On 05.12.16 at 17:34, <JGross@suse.com> wrote: >>>> For XENMEM_machine_memory_map the hypervisor returns EINVAL if the >>>> caller's buffer can't hold all entries. >>>> >>>> This is a problem as the caller has normally a static buffer defined >>>> and when he is doing the call no dynamic memory allocation is >>>> possible as nothing is yet known about the system's memory layout. >>>> >>>> Instead of just fail deliver as many memory map entries as possible >>>> and return with E2BIG indicating the result was incomplete. Then the >>>> caller will be capable to use at least some memory reported to exist >>>> to allocate a larger buffer for the complete memory map. >>> >>> This makes no sense, as what we're talking about here is the >>> machine memory map, and the calling Dom0 kernel can't allocate >>> from that pool directly. Instead it would need its own memory >>> map to know where to place such a larger buffer, and this map >>> is usually just one or two entries large. >> >> This is true. In practice, however, things are a little bit more >> complicated: >> >> Linux being started as dom0 tries to rearrange memory layout to match >> the one of the physical machine. It will only add memory to its >> allocator which is known to either not need to be moved or which has >> already been moved. And this decision is based on the machine memory >> map. > > Right, I did recall this oddity only after having sent the initial reply > (I'm still much more into how the non-pvops kernel works, and there's > no such dependency there). > >> I admit it is a Linux kernel private decision how to handle the boot and >> adding of memory to its allocator. OTOH the "all or nothing" approach of >> the hypervisor regarding delivery of the machine memory map is a little >> bit strange, especially as the BIOS is returning the E820 map one entry >> after the other. >> >>> For that reason I'm not convinced we need or want this change. >> >> It would certainly make it easier for the Linux kernel. > > I'd like us to at least consider alternatives: > > 1) Have a new sub-op behaving BIOS like (one entry at a time). No objections from me. > 2) Make the full map available inside the initial mapping, pointed to > by a new entry in the start info structure. What about PVH? > 3) Have pvops Linux make use of the extra space available at the > end of the initial mapping. The minimum of 512k there should be > more than enough. This would mean using this area as a temporary buffer and copying the obtained map somewhere else (either only in the size of the static buffer or after reorganizing the memory according to the map). I guess this would work, too. > 4) Others I can't think of right now. My preference would be 1). In case we are choosing this approach I don't see any need for the other patch I sent for getting the number of entries. It can easily be replaced by a loop of the single entry solution. Juergen
>>> On 06.12.16 at 09:33, <JGross@suse.com> wrote: > On 06/12/16 09:15, Jan Beulich wrote: >> I'd like us to at least consider alternatives: >> >> 1) Have a new sub-op behaving BIOS like (one entry at a time). > > No objections from me. > >> 2) Make the full map available inside the initial mapping, pointed to >> by a new entry in the start info structure. > > What about PVH? Does PVH re-arrangement of its physical memory too? I didn't think so ... >> 3) Have pvops Linux make use of the extra space available at the >> end of the initial mapping. The minimum of 512k there should be >> more than enough. > > This would mean using this area as a temporary buffer and copying the > obtained map somewhere else (either only in the size of the static > buffer or after reorganizing the memory according to the map). I guess > this would work, too. > >> 4) Others I can't think of right now. > > My preference would be 1). In case we are choosing this approach I > don't see any need for the other patch I sent for getting the number > of entries. It can easily be replaced by a loop of the single entry > solution. Well, my preference would be 3), then 2), then 1). Regardless of that I would welcome patch 2 to be re-submitted without depending on patch 1, as I think this is a worthwhile adjustment in any case (bringing the interface in line with what we do elsewhere). Jan
On 06/12/16 09:51, Jan Beulich wrote: >>>> On 06.12.16 at 09:33, <JGross@suse.com> wrote: >> On 06/12/16 09:15, Jan Beulich wrote: >>> I'd like us to at least consider alternatives: >>> >>> 1) Have a new sub-op behaving BIOS like (one entry at a time). >> >> No objections from me. >> >>> 2) Make the full map available inside the initial mapping, pointed to >>> by a new entry in the start info structure. >> >> What about PVH? > > Does PVH re-arrangement of its physical memory too? I didn't think > so ... Not yet. Are you sure it won't be needed in future? Why was it done in the pvops kernel? I thought the main reason was to have the same PCI space address layout as on the native system in order to avoid any bad surprises due to buggy firmware and/or BIOS. The same could apply for PVH. >>> 3) Have pvops Linux make use of the extra space available at the >>> end of the initial mapping. The minimum of 512k there should be >>> more than enough. >> >> This would mean using this area as a temporary buffer and copying the >> obtained map somewhere else (either only in the size of the static >> buffer or after reorganizing the memory according to the map). I guess >> this would work, too. >> >>> 4) Others I can't think of right now. >> >> My preference would be 1). In case we are choosing this approach I >> don't see any need for the other patch I sent for getting the number >> of entries. It can easily be replaced by a loop of the single entry >> solution. > > Well, my preference would be 3), then 2), then 1). If nobody is objecting I can go that route (3). > Regardless of that I would welcome patch 2 to be re-submitted > without depending on patch 1, as I think this is a worthwhile > adjustment in any case (bringing the interface in line with what > we do elsewhere). Sure, will do. Juergen
>>> On 06.12.16 at 10:44, <JGross@suse.com> wrote: > On 06/12/16 09:51, Jan Beulich wrote: >>>>> On 06.12.16 at 09:33, <JGross@suse.com> wrote: >>> On 06/12/16 09:15, Jan Beulich wrote: >>>> I'd like us to at least consider alternatives: >>>> >>>> 1) Have a new sub-op behaving BIOS like (one entry at a time). >>> >>> No objections from me. >>> >>>> 2) Make the full map available inside the initial mapping, pointed to >>>> by a new entry in the start info structure. >>> >>> What about PVH? >> >> Does PVH re-arrangement of its physical memory too? I didn't think >> so ... > > Not yet. Are you sure it won't be needed in future? Why was it done in > the pvops kernel? I thought the main reason was to have the same PCI > space address layout as on the native system in order to avoid any bad > surprises due to buggy firmware and/or BIOS. The same could apply for > PVH. PVH Dom0 (iirc) gets its memory arranged around MMIO holes already (by the hypervisor). Jan
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index 14552a1..f8e679d 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -4737,7 +4737,7 @@ static int _handle_iomem_range(unsigned long s, unsigned long e, XEN_GUEST_HANDLE(e820entry_t) buffer; if ( ctxt->n + 1 >= ctxt->map.nr_entries ) - return -EINVAL; + return -E2BIG; ent.addr = (uint64_t)ctxt->s << PAGE_SHIFT; ent.size = (uint64_t)(s - ctxt->s) << PAGE_SHIFT; ent.type = E820_RESERVED; @@ -4985,8 +4985,6 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) if ( copy_from_guest(&ctxt.map, arg, 1) ) return -EFAULT; - if ( ctxt.map.nr_entries < e820.nr_map + 1 ) - return -EINVAL; buffer_param = guest_handle_cast(ctxt.map.buffer, e820entry_t); buffer = guest_handle_from_param(buffer_param, e820entry_t); @@ -5005,31 +5003,35 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) if ( !rc ) rc = handle_iomem_range(s, s, &ctxt); if ( rc ) - return rc; + break; + } + if ( ctxt.map.nr_entries <= ctxt.n + 1 ) + { + rc = -E2BIG; + break; } - if ( ctxt.map.nr_entries <= ctxt.n + (e820.nr_map - i) ) - return -EINVAL; if ( __copy_to_guest_offset(buffer, ctxt.n, e820.map + i, 1) ) return -EFAULT; ctxt.s = PFN_UP(e820.map[i].addr + e820.map[i].size); } - if ( ctxt.s ) + if ( !rc && ctxt.s ) { rc = rangeset_report_ranges(current->domain->iomem_caps, ctxt.s, ~0UL, handle_iomem_range, &ctxt); if ( !rc && ctxt.s ) rc = handle_iomem_range(~0UL, ~0UL, &ctxt); - if ( rc ) - return rc; } + if ( rc && rc != -E2BIG ) + return rc; + ctxt.map.nr_entries = ctxt.n; if ( __copy_to_guest(arg, &ctxt.map, 1) ) return -EFAULT; - return 0; + return rc; } case XENMEM_machphys_mapping: diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h index 5bf840f..20df769 100644 --- a/xen/include/public/memory.h +++ b/xen/include/public/memory.h @@ -339,6 +339,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_memory_map_t); /* * Returns the real physical memory map. Passes the same structure as * XENMEM_memory_map. + * In case of a buffer not capable to hold all entries of the physical + * memory map -E2BIG is returned and the buffer is filled completely. * arg == addr of xen_memory_map_t. */ #define XENMEM_machine_memory_map 10
For XENMEM_machine_memory_map the hypervisor returns EINVAL if the caller's buffer can't hold all entries. This is a problem as the caller has normally a static buffer defined and when he is doing the call no dynamic memory allocation is possible as nothing is yet known about the system's memory layout. Instead of just fail deliver as many memory map entries as possible and return with E2BIG indicating the result was incomplete. Then the caller will be capable to use at least some memory reported to exist to allocate a larger buffer for the complete memory map. As E2BIG wasn't returned before a caller not prepared for this case will still see just a failure as before, while someone prepared for this error code running on an old hypervisor won't run into problems other than without this change. Signed-off-by: Juergen Gross <jgross@suse.com> --- xen/arch/x86/mm.c | 22 ++++++++++++---------- xen/include/public/memory.h | 2 ++ 2 files changed, 14 insertions(+), 10 deletions(-)