diff mbox series

[v2.1] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820

Message ID 1598928634-30849-1-git-send-email-igor.druzhinin@citrix.com (mailing list archive)
State New, archived
Headers show
Series [v2.1] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820 | expand

Commit Message

Igor Druzhinin Sept. 1, 2020, 2:50 a.m. UTC
Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass firmware region locations to the second
kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
table and count immovable memory regions").

The memory that hvmloader allocates in the reserved region mostly contains
these useful tables and could be safely indicated as ACPI without the need
to designate a sub-region specially for that. Making it non-reclaimable
(ACPI NVS) in contrast with ACPI reclaim (ACPI table) memory would avoid
potential reuse of this memory by the guest taking into account this region
may contain runtime structures like VM86 TSS, etc. If necessary, those
can be moved away later and the region marked as reclaimable.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---
Changes in v2.1:
- fixed previously missed uint32_t occurence

Changes in v2:
- gave more information on NVS type selection and potential alternatives
  in the description
- minor type fixes suggested

---
 tools/firmware/hvmloader/e820.c | 21 +++++++++++++++++----
 tools/firmware/hvmloader/util.c |  6 ++++++
 tools/firmware/hvmloader/util.h |  3 +++
 3 files changed, 26 insertions(+), 4 deletions(-)

Comments

Roger Pau Monné Sept. 1, 2020, 9:28 a.m. UTC | #1
On Tue, Sep 01, 2020 at 03:50:34AM +0100, Igor Druzhinin wrote:
> Guest kernel does need to know in some cases where the tables are located
> to treat these regions properly. One example is kexec process where
> the first kernel needs to pass firmware region locations to the second
> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
> table and count immovable memory regions").

Can you add a note that this is a Linux commit? Albeit there's a
reference to kexec above I don't it's entirely clear it's a Linux
commit.

> 
> The memory that hvmloader allocates in the reserved region mostly contains
> these useful tables and could be safely indicated as ACPI without the need
> to designate a sub-region specially for that. Making it non-reclaimable
> (ACPI NVS) in contrast with ACPI reclaim (ACPI table) memory would avoid
> potential reuse of this memory by the guest taking into account this region
> may contain runtime structures like VM86 TSS, etc. If necessary, those
> can be moved away later and the region marked as reclaimable.

By looking at domain_construct_memmap from libxl I think the same
problem is not present on that case as regions are properly marked as
RESERVED or ACPI as required?

> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>

Just one question below and one nit.

> ---
> Changes in v2.1:
> - fixed previously missed uint32_t occurence
> 
> Changes in v2:
> - gave more information on NVS type selection and potential alternatives
>   in the description
> - minor type fixes suggested
> 
> ---
>  tools/firmware/hvmloader/e820.c | 21 +++++++++++++++++----
>  tools/firmware/hvmloader/util.c |  6 ++++++
>  tools/firmware/hvmloader/util.h |  3 +++
>  3 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
> index 4d1c955..0ad2f05 100644
> --- a/tools/firmware/hvmloader/e820.c
> +++ b/tools/firmware/hvmloader/e820.c
> @@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
>  {
>      unsigned int nr = 0, i, j;
>      uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
> +    unsigned long firmware_mem_end =
> +        RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << PAGE_SHIFT);
>  
>      if ( !lowmem_reserved_base )
>              lowmem_reserved_base = 0xA0000;
> @@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
>      nr++;
>  
>      /*
> +     * Mark populated reserved memory that contains ACPI and other tables as
> +     * ACPI NVS (non-reclaimable) space - that should help the guest to treat
> +     * it correctly later (e.g. pass to the next kernel on kexec).
> +     */
> +
> +    e820[nr].addr = RESERVED_MEMBASE;
> +    e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
> +    e820[nr].type = E820_NVS;
> +    nr++;
> +
> +    /*
>       * Explicitly reserve space for special pages.
> -     * This space starts at RESERVED_MEMBASE an extends to cover various
> +     * This space starts after ACPI region and extends to cover various
>       * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
>       *
>       * If igd_opregion_pgbase we need to split the RESERVED region in two.
> @@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
>      {
>          uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
>  
> -        e820[nr].addr = RESERVED_MEMBASE;
> -        e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
> +        e820[nr].addr = firmware_mem_end;
> +        e820[nr].size = igd_opregion_base - firmware_mem_end;

Is there anything between firmware_mem_end and igd_opregion_base now?
You already account for RESERVED_MEMBASE to firmware_mem_end.

>          e820[nr].type = E820_RESERVED;
>          nr++;
>  
> @@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
>      }
>      else
>      {
> -        e820[nr].addr = RESERVED_MEMBASE;
> +        e820[nr].addr = firmware_mem_end;
>          e820[nr].size = (uint32_t)-e820[nr].addr;
>          e820[nr].type = E820_RESERVED;
>          nr++;
> diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
> index 0c3f2d2..59cde4a 100644
> --- a/tools/firmware/hvmloader/util.c
> +++ b/tools/firmware/hvmloader/util.c
> @@ -444,6 +444,12 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
>  static uint32_t alloc_up = RESERVED_MEMORY_DYNAMIC_START - 1;
>  static uint32_t alloc_down = RESERVED_MEMORY_DYNAMIC_END;
>  
> +unsigned long mem_mfns_allocated(void)

mem_pages_allocated might be better.

Thanks, Roger.
Igor Druzhinin Sept. 1, 2020, 10:29 a.m. UTC | #2
On 01/09/2020 10:28, Roger Pau Monné wrote:
> On Tue, Sep 01, 2020 at 03:50:34AM +0100, Igor Druzhinin wrote:
>> Guest kernel does need to know in some cases where the tables are located
>> to treat these regions properly. One example is kexec process where
>> the first kernel needs to pass firmware region locations to the second
>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>> table and count immovable memory regions").
> 
> Can you add a note that this is a Linux commit? Albeit there's a
> reference to kexec above I don't it's entirely clear it's a Linux
> commit.

Ok.

>>
>> The memory that hvmloader allocates in the reserved region mostly contains
>> these useful tables and could be safely indicated as ACPI without the need
>> to designate a sub-region specially for that. Making it non-reclaimable
>> (ACPI NVS) in contrast with ACPI reclaim (ACPI table) memory would avoid
>> potential reuse of this memory by the guest taking into account this region
>> may contain runtime structures like VM86 TSS, etc. If necessary, those
>> can be moved away later and the region marked as reclaimable.
> 
> By looking at domain_construct_memmap from libxl I think the same
> problem is not present on that case as regions are properly marked as
> RESERVED or ACPI as required?

Uhh, I simply forgot that PVH also constructs e820 - it seems it's doing it
properly though. I want to makes e820 map as close as possible between PVH and
HVM - let me see if I can improve my HVM version.

>>
>> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> 
> Just one question below and one nit.
> 
>> ---
>> Changes in v2.1:
>> - fixed previously missed uint32_t occurence
>>
>> Changes in v2:
>> - gave more information on NVS type selection and potential alternatives
>>   in the description
>> - minor type fixes suggested
>>
>> ---
>>  tools/firmware/hvmloader/e820.c | 21 +++++++++++++++++----
>>  tools/firmware/hvmloader/util.c |  6 ++++++
>>  tools/firmware/hvmloader/util.h |  3 +++
>>  3 files changed, 26 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
>> index 4d1c955..0ad2f05 100644
>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>      unsigned int nr = 0, i, j;
>>      uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
>> +    unsigned long firmware_mem_end =
>> +        RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << PAGE_SHIFT);
>>  
>>      if ( !lowmem_reserved_base )
>>              lowmem_reserved_base = 0xA0000;
>> @@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
>>      nr++;
>>  
>>      /*
>> +     * Mark populated reserved memory that contains ACPI and other tables as
>> +     * ACPI NVS (non-reclaimable) space - that should help the guest to treat
>> +     * it correctly later (e.g. pass to the next kernel on kexec).
>> +     */
>> +
>> +    e820[nr].addr = RESERVED_MEMBASE;
>> +    e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
>> +    e820[nr].type = E820_NVS;
>> +    nr++;
>> +
>> +    /*
>>       * Explicitly reserve space for special pages.
>> -     * This space starts at RESERVED_MEMBASE an extends to cover various
>> +     * This space starts after ACPI region and extends to cover various
>>       * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
>>       *
>>       * If igd_opregion_pgbase we need to split the RESERVED region in two.
>> @@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
>>      {
>>          uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
>>  
>> -        e820[nr].addr = RESERVED_MEMBASE;
>> -        e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
>> +        e820[nr].addr = firmware_mem_end;
>> +        e820[nr].size = igd_opregion_base - firmware_mem_end;
> 
> Is there anything between firmware_mem_end and igd_opregion_base now?
> You already account for RESERVED_MEMBASE to firmware_mem_end.

It's possible that there is something in between. IGD opregion is allocated
dynamically from above and occupies a couple of pages - there is a gap between the
top of a populated region and it where other structures could be located.
I don't want to tie e820 to the current allocation order.

>>          e820[nr].type = E820_RESERVED;
>>          nr++;
>>  
>> @@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
>>      }
>>      else
>>      {
>> -        e820[nr].addr = RESERVED_MEMBASE;
>> +        e820[nr].addr = firmware_mem_end;
>>          e820[nr].size = (uint32_t)-e820[nr].addr;
>>          e820[nr].type = E820_RESERVED;
>>          nr++;
>> diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
>> index 0c3f2d2..59cde4a 100644
>> --- a/tools/firmware/hvmloader/util.c
>> +++ b/tools/firmware/hvmloader/util.c
>> @@ -444,6 +444,12 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
>>  static uint32_t alloc_up = RESERVED_MEMORY_DYNAMIC_START - 1;
>>  static uint32_t alloc_down = RESERVED_MEMORY_DYNAMIC_END;
>>  
>> +unsigned long mem_mfns_allocated(void)
> 
> mem_pages_allocated might be better.

Ok, I'll see if I keep this function in a new version.

Igor
Jan Beulich Sept. 4, 2020, 8:33 a.m. UTC | #3
On 01.09.2020 04:50, Igor Druzhinin wrote:
> Guest kernel does need to know in some cases where the tables are located
> to treat these regions properly. One example is kexec process where
> the first kernel needs to pass firmware region locations to the second
> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
> table and count immovable memory regions").

I'm still struggling with the connection here: Reserved regions
surely are "immovable" too, aren't they? Where's the connection to
the E820 map in the first place - the change cited above is entirely
about SRAT? And I can't imagine kexec getting away with passing on
ACPI NVS regions, but not reserved ones.

> The memory that hvmloader allocates in the reserved region mostly contains
> these useful tables and could be safely indicated as ACPI without the need
> to designate a sub-region specially for that. Making it non-reclaimable
> (ACPI NVS) in contrast with ACPI reclaim (ACPI table) memory would avoid
> potential reuse of this memory by the guest taking into account this region
> may contain runtime structures like VM86 TSS, etc. If necessary, those
> can be moved away later and the region marked as reclaimable.

Some of this may want reflecting also ...

> @@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
>      nr++;
>  
>      /*
> +     * Mark populated reserved memory that contains ACPI and other tables as
> +     * ACPI NVS (non-reclaimable) space - that should help the guest to treat
> +     * it correctly later (e.g. pass to the next kernel on kexec).
> +     */
> +
> +    e820[nr].addr = RESERVED_MEMBASE;
> +    e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
> +    e820[nr].type = E820_NVS;
> +    nr++;

... in the comment you introduce here.

Jan
Igor Druzhinin Sept. 4, 2020, 11:49 a.m. UTC | #4
On 04/09/2020 09:33, Jan Beulich wrote:
> On 01.09.2020 04:50, Igor Druzhinin wrote:
>> Guest kernel does need to know in some cases where the tables are located
>> to treat these regions properly. One example is kexec process where
>> the first kernel needs to pass firmware region locations to the second
>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>> table and count immovable memory regions").
> 
> I'm still struggling with the connection here: Reserved regions
> surely are "immovable" too, aren't they? 

"Immovable" regions here are RAM that doesn't go away by hot-unplug. That change
was necessary in Linux to avoid image randomized placement to these regions.

> Where's the connection to
> the E820 map in the first place - the change cited above is entirely
> about SRAT? And I can't imagine kexec getting away with passing on
> ACPI NVS regions, but not reserved ones.
> 

They got away with it for as long as kexec exists I think. The point was that
those reserved regions were not accessed during early boot as long as kexec kernel stays
at transition tables. Now ACPI portion of it is accessed which highlighted our
imprecise reporting of memory layout to the guest - which I think should be fixed
either way.

I'm not going to argue if reserved regions should be mapped to transition tables or
not - I don't think it's important in context related to this patch. There were
already several kernel releases without that mappings and those also should be able
to invoke kdump.

Igor
Jan Beulich Sept. 4, 2020, 2:40 p.m. UTC | #5
On 04.09.2020 13:49, Igor Druzhinin wrote:
> On 04/09/2020 09:33, Jan Beulich wrote:
>> On 01.09.2020 04:50, Igor Druzhinin wrote:
>>> Guest kernel does need to know in some cases where the tables are located
>>> to treat these regions properly. One example is kexec process where
>>> the first kernel needs to pass firmware region locations to the second
>>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>>> table and count immovable memory regions").
>>
>> I'm still struggling with the connection here: Reserved regions
>> surely are "immovable" too, aren't they? 
> 
> "Immovable" regions here are RAM that doesn't go away by hot-unplug. That change
> was necessary in Linux to avoid image randomized placement to these regions.
> 
>> Where's the connection to
>> the E820 map in the first place - the change cited above is entirely
>> about SRAT? And I can't imagine kexec getting away with passing on
>> ACPI NVS regions, but not reserved ones.
>>
> 
> They got away with it for as long as kexec exists I think. The point was that
> those reserved regions were not accessed during early boot as long as kexec kernel stays
> at transition tables. Now ACPI portion of it is accessed which highlighted our
> imprecise reporting of memory layout to the guest - which I think should be fixed
> either way.

Is this to mean they map ACPI regions into the transition page tables,
but not reserved regions? If so, perhaps that's what the description
wants to say (and then possibly with a reference to the commit
introducing this into Linux, instead of the seemingly unrelated SRAT
one)?

Jan

> I'm not going to argue if reserved regions should be mapped to transition tables or
> not - I don't think it's important in context related to this patch. There were
> already several kernel releases without that mappings and those also should be able
> to invoke kdump.
> 
> Igor
>
Igor Druzhinin Sept. 4, 2020, 2:47 p.m. UTC | #6
On 04/09/2020 15:40, Jan Beulich wrote:
> On 04.09.2020 13:49, Igor Druzhinin wrote:
>> On 04/09/2020 09:33, Jan Beulich wrote:
>>> On 01.09.2020 04:50, Igor Druzhinin wrote:
>>>> Guest kernel does need to know in some cases where the tables are located
>>>> to treat these regions properly. One example is kexec process where
>>>> the first kernel needs to pass firmware region locations to the second
>>>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>>>> table and count immovable memory regions").
>>>
>>> I'm still struggling with the connection here: Reserved regions
>>> surely are "immovable" too, aren't they? 
>>
>> "Immovable" regions here are RAM that doesn't go away by hot-unplug. That change
>> was necessary in Linux to avoid image randomized placement to these regions.
>>
>>> Where's the connection to
>>> the E820 map in the first place - the change cited above is entirely
>>> about SRAT? And I can't imagine kexec getting away with passing on
>>> ACPI NVS regions, but not reserved ones.
>>>
>>
>> They got away with it for as long as kexec exists I think. The point was that
>> those reserved regions were not accessed during early boot as long as kexec kernel stays
>> at transition tables. Now ACPI portion of it is accessed which highlighted our
>> imprecise reporting of memory layout to the guest - which I think should be fixed
>> either way.
> 
> Is this to mean they map ACPI regions into the transition page tables,
> but not reserved regions?

Yes.

> If so, perhaps that's what the description
> wants to say (and then possibly with a reference to the commit
> introducing this into Linux, instead of the seemingly unrelated SRAT
> one)?

The referenced commit is not unrelated - it's the commit introducing the access
causing kexec transition to fail. But I can add another one pointing to the mapping
of ACPI tables that was supposed to fix the failure caused by the first one.

Igor
Jan Beulich Sept. 4, 2020, 3:29 p.m. UTC | #7
On 04.09.2020 16:47, Igor Druzhinin wrote:
> The referenced commit is not unrelated - it's the commit introducing the access
> causing kexec transition to fail. But I can add another one pointing to the mapping
> of ACPI tables that was supposed to fix the failure caused by the first one.

Oh, okay, I finally understand. The purpose of the change is
unrelated, it's what get_acpi_srat_table() does which causes the
problem. Hopefully this will be more obvious with the other commit
also referenced. How about additionally changing "which is now a
requirement after" to "which is now a requirement in order for ...
to work" or "which is a prerequisite for ... actually functioning"
or some such?

Jan
diff mbox series

Patch

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 4d1c955..0ad2f05 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -155,6 +155,8 @@  int build_e820_table(struct e820entry *e820,
 {
     unsigned int nr = 0, i, j;
     uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
+    unsigned long firmware_mem_end =
+        RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << PAGE_SHIFT);
 
     if ( !lowmem_reserved_base )
             lowmem_reserved_base = 0xA0000;
@@ -199,8 +201,19 @@  int build_e820_table(struct e820entry *e820,
     nr++;
 
     /*
+     * Mark populated reserved memory that contains ACPI and other tables as
+     * ACPI NVS (non-reclaimable) space - that should help the guest to treat
+     * it correctly later (e.g. pass to the next kernel on kexec).
+     */
+
+    e820[nr].addr = RESERVED_MEMBASE;
+    e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
+    e820[nr].type = E820_NVS;
+    nr++;
+
+    /*
      * Explicitly reserve space for special pages.
-     * This space starts at RESERVED_MEMBASE an extends to cover various
+     * This space starts after ACPI region and extends to cover various
      * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
      *
      * If igd_opregion_pgbase we need to split the RESERVED region in two.
@@ -210,8 +223,8 @@  int build_e820_table(struct e820entry *e820,
     {
         uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
 
-        e820[nr].addr = RESERVED_MEMBASE;
-        e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
+        e820[nr].addr = firmware_mem_end;
+        e820[nr].size = igd_opregion_base - firmware_mem_end;
         e820[nr].type = E820_RESERVED;
         nr++;
 
@@ -227,7 +240,7 @@  int build_e820_table(struct e820entry *e820,
     }
     else
     {
-        e820[nr].addr = RESERVED_MEMBASE;
+        e820[nr].addr = firmware_mem_end;
         e820[nr].size = (uint32_t)-e820[nr].addr;
         e820[nr].type = E820_RESERVED;
         nr++;
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 0c3f2d2..59cde4a 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -444,6 +444,12 @@  void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
 static uint32_t alloc_up = RESERVED_MEMORY_DYNAMIC_START - 1;
 static uint32_t alloc_down = RESERVED_MEMORY_DYNAMIC_END;
 
+unsigned long mem_mfns_allocated(void)
+{
+    return (alloc_up >> PAGE_SHIFT) -
+            ((RESERVED_MEMORY_DYNAMIC_START - 1) >> PAGE_SHIFT);
+}
+
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns)
 {
     alloc_down -= nr_mfns << PAGE_SHIFT;
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 7bca641..acd673a 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -200,6 +200,9 @@  void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns);
 /* Allocate a memory hole below 4GB. */
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns);
 
+/* Return number of pages allocated */
+unsigned long mem_mfns_allocated(void);
+
 /* Allocate memory in a reserved region below 4GB. */
 void *mem_alloc(uint32_t size, uint32_t align);
 #define virt_to_phys(v) ((unsigned long)(v))