x86/mmcfg: add "force" option for MCFG
diff mbox series

Message ID 1567005862-18540-1-git-send-email-igor.druzhinin@citrix.com
State New, archived
Headers show
Series
  • x86/mmcfg: add "force" option for MCFG
Related show

Commit Message

Igor Druzhinin Aug. 28, 2019, 3:24 p.m. UTC
If MCFG area is not reserved in E820 Xen by default will defer its usage
until Dom0 registers it explicitly after ACPI parser recognizes it as
a reserved resource in DSDT. Having it reserved in E820 is not
mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
and firmware is free to keep a hole E820 in that place.

Unfortunately, keeping it disabled at this point makes Xen fail to
recognize many of PCIe extended capabilities early enough for newly added
devices. Namely, (1) some of VT-d quirks are not applied during Dom0
device handoff, (2) currently MCFG reservation report comes
too late from Dom0 after some of PCI devices being registered in Xen
by Dom0 kernel that break initialization of a number of PCIe capabilities
(e.g. SR-IOV VF BAR sizing).

Since introduction of ACPI parser in Xen is not possible add a "force"
option that will allow Xen to use MCFG area even it's not properly
reserved in E820.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
---

I think we need to have this option to at least have a way to workaround
problem (1). Problem (2) could be solved in Dom0 kernel by invoking
xen_mcfg_late() earlier but before the first PCI bus enumertaion which
currently happens somwhere during ACPI scan I think.

The question is what the defult value for this option should be?

---
 docs/misc/xen-command-line.pandoc     |  8 +++++---
 xen/arch/x86/e820.c                   | 20 ++++++++++++++++++++
 xen/arch/x86/x86_64/mmconfig-shared.c | 34 ++++++++++++++++++++--------------
 xen/include/asm-x86/e820.h            |  1 +
 4 files changed, 46 insertions(+), 17 deletions(-)

Comments

Roger Pau Monné Aug. 29, 2019, 8 a.m. UTC | #1
On Wed, Aug 28, 2019 at 04:24:22PM +0100, Igor Druzhinin wrote:
> If MCFG area is not reserved in E820 Xen by default will defer its usage
> until Dom0 registers it explicitly after ACPI parser recognizes it as
> a reserved resource in DSDT. Having it reserved in E820 is not
> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
> and firmware is free to keep a hole E820 in that place.
> 
> Unfortunately, keeping it disabled at this point makes Xen fail to
> recognize many of PCIe extended capabilities early enough for newly added
> devices. Namely, (1) some of VT-d quirks are not applied during Dom0
> device handoff, (2) currently MCFG reservation report comes
> too late from Dom0 after some of PCI devices being registered in Xen
> by Dom0 kernel that break initialization of a number of PCIe capabilities
> (e.g. SR-IOV VF BAR sizing).
> 
> Since introduction of ACPI parser in Xen is not possible add a "force"
> option that will allow Xen to use MCFG area even it's not properly
> reserved in E820.
> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>

Thanks! I think this is the best way to solve the issue.

> ---
> 
> I think we need to have this option to at least have a way to workaround
> problem (1). Problem (2) could be solved in Dom0 kernel by invoking
> xen_mcfg_late() earlier but before the first PCI bus enumertaion which
> currently happens somwhere during ACPI scan I think.
> 
> The question is what the defult value for this option should be?

Have you tested this against a variety of hardware?

Have you found any box that has a wrong MMCFG area in the MCFG static
table? (ie: one that doesn't match the motherboard resource
reservation in the dynamic ACPI tables?)

> 
> ---
>  docs/misc/xen-command-line.pandoc     |  8 +++++---
>  xen/arch/x86/e820.c                   | 20 ++++++++++++++++++++
>  xen/arch/x86/x86_64/mmconfig-shared.c | 34 ++++++++++++++++++++--------------
>  xen/include/asm-x86/e820.h            |  1 +
>  4 files changed, 46 insertions(+), 17 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index 7c72e31..f13b10c 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1424,11 +1424,13 @@ ordinary DomU, control domain, hardware domain, and - when supported
>  by the platform - DomU with pass-through device assigned).
>  
>  ### mmcfg (x86)
> -> `= <boolean>[,amd-fam10]`
> +> `= List of [ <boolean>, force, amd-fam10 ]`
>  
> -> Default: `1`
> +> Default: `true,force`
>  
> -Specify if the MMConfig space should be enabled.
> +Specify if the MMConfig space should be enabled. If force option is specified
> +(default) MMConfig space will be used by Xen early in boot even if it's
> +not reserved by firmware in the E820 table.
>  
>  ### mmio-relax (x86)
>  > `= <boolean> | all`
> diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
> index 8e8a2c4..edef72a 100644
> --- a/xen/arch/x86/e820.c
> +++ b/xen/arch/x86/e820.c
> @@ -37,6 +37,26 @@ struct e820map e820;
>  struct e820map __initdata e820_raw;
>  
>  /*
> + * This function checks if any part of the range <start,end> is mapped
> + * with type.
> + */
> +int __init e820_any_mapped(u64 start, u64 end, unsigned type)

Please use uint64_t and unsigned int. Also it looks like this
function wants to return a bool instead of an int.

> +{
> +       int i;

unsigned int.

> +
> +       for (i = 0; i < e820.nr_map; i++) {

Coding style. Some parts of this file are using the Linux coding
style I think, but new additions should be using the Xen coding
style, see e820_change_range_type for example.

> +               struct e820entry *ei = &e820.map[i];

const.

> +
> +               if (type && ei->type != type)
> +                       continue;
> +               if (ei->addr >= end || ei->addr + ei->size <= start)
> +                       continue;
> +               return 1;
> +       }
> +       return 0;
> +}
> +
> +/*
>   * This function checks if the entire range <start,end> is mapped with type.
>   *
>   * Note: this function only works correct if the e820 table is sorted and
> diff --git a/xen/arch/x86/x86_64/mmconfig-shared.c b/xen/arch/x86/x86_64/mmconfig-shared.c
> index cc08b52..9fc0865 100644
> --- a/xen/arch/x86/x86_64/mmconfig-shared.c
> +++ b/xen/arch/x86/x86_64/mmconfig-shared.c
> @@ -26,33 +26,34 @@
>  
>  #include "mmconfig.h"
>  
> +static bool_t __read_mostly force_mmcfg = true;
>  unsigned int pci_probe = PCI_PROBE_CONF1 | PCI_PROBE_MMCONF;
>  
>  static int __init parse_mmcfg(const char *s)
>  {
>      const char *ss;
> -    int rc = 0;
> +    int val, rc = 0;
>  
>      do {
>          ss = strchr(s, ',');
>          if ( !ss )
>              ss = strchr(s, '\0');
>  
> -        switch ( parse_bool(s, ss) )
> -        {
> -        case 0:
> -            pci_probe &= ~PCI_PROBE_MMCONF;
> -            break;
> -        case 1:
> -            break;
> -        default:
> -            if ( !cmdline_strcmp(s, "amd_fam10") ||
> -                 !cmdline_strcmp(s, "amd-fam10") )
> +        if ( (val = parse_bool(s, ss)) >= 0 ) {
> +            if ( val )
> +               pci_probe |= PCI_PROBE_MMCONF;
> +            else
> +               pci_probe &= ~PCI_PROBE_MMCONF;
> +        } else if ( (val = parse_boolean("amd_fam10", s, ss)) >= 0 ||
> +                    (val = parse_boolean("amd-fam10", s, ss)) >= 0 ) {
> +            if ( val )
>                  pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
>              else
> -                rc = -EINVAL;
> -            break;
> -        }
> +                pci_probe &= ~PCI_CHECK_ENABLE_AMD_MMCONF;
> +        } else if ( (val = parse_boolean("force", s, ss)) >= 0)
> +            force_mmcfg = val;

You could also consider adding a new flag to pci_probe, ie:
PCI_PROBE_FORCE_MMCFG.

> +        else
> +            rc = -EINVAL;
>  
>          s = ss + 1;
>      } while ( *ss );
> @@ -355,6 +356,11 @@ static int __init is_mmconf_reserved(
>                     (unsigned int)cfg->start_bus_number,
>                     (unsigned int)cfg->end_bus_number);
>          }
> +    } else if (!e820_any_mapped(addr, addr + old_size - 1, 0)) {

I think it should be fine to also accept a MMCFG area that's partially
reserved and partially a hole in the memory map?

AFAICT the proposed code will only accept MMCFG regions that are
fully in a memory map hole.

Thanks, Roger.
Jan Beulich Aug. 29, 2019, 9:12 a.m. UTC | #2
On 29.08.2019 10:00, Roger Pau Monné  wrote:
> On Wed, Aug 28, 2019 at 04:24:22PM +0100, Igor Druzhinin wrote:
>> --- a/xen/arch/x86/x86_64/mmconfig-shared.c
>> +++ b/xen/arch/x86/x86_64/mmconfig-shared.c
>> @@ -26,33 +26,34 @@
>>  
>>  #include "mmconfig.h"
>>  
>> +static bool_t __read_mostly force_mmcfg = true;
>>  unsigned int pci_probe = PCI_PROBE_CONF1 | PCI_PROBE_MMCONF;
>>  
>>  static int __init parse_mmcfg(const char *s)
>>  {
>>      const char *ss;
>> -    int rc = 0;
>> +    int val, rc = 0;
>>  
>>      do {
>>          ss = strchr(s, ',');
>>          if ( !ss )
>>              ss = strchr(s, '\0');
>>  
>> -        switch ( parse_bool(s, ss) )
>> -        {
>> -        case 0:
>> -            pci_probe &= ~PCI_PROBE_MMCONF;
>> -            break;
>> -        case 1:
>> -            break;
>> -        default:
>> -            if ( !cmdline_strcmp(s, "amd_fam10") ||
>> -                 !cmdline_strcmp(s, "amd-fam10") )
>> +        if ( (val = parse_bool(s, ss)) >= 0 ) {
>> +            if ( val )
>> +               pci_probe |= PCI_PROBE_MMCONF;
>> +            else
>> +               pci_probe &= ~PCI_PROBE_MMCONF;
>> +        } else if ( (val = parse_boolean("amd_fam10", s, ss)) >= 0 ||
>> +                    (val = parse_boolean("amd-fam10", s, ss)) >= 0 ) {
>> +            if ( val )
>>                  pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
>>              else
>> -                rc = -EINVAL;
>> -            break;
>> -        }
>> +                pci_probe &= ~PCI_CHECK_ENABLE_AMD_MMCONF;
>> +        } else if ( (val = parse_boolean("force", s, ss)) >= 0)
>> +            force_mmcfg = val;
> 
> You could also consider adding a new flag to pci_probe, ie:
> PCI_PROBE_FORCE_MMCFG.

Yes please, albeit to be in sync with the other flag perhaps
better PCI_PROBE_FORCE_MMCONF or PCI_PROBE_MMCONF_FORCE.

Jan
Jan Beulich Aug. 29, 2019, 9:49 a.m. UTC | #3
On 28.08.2019 17:24, Igor Druzhinin wrote:
> If MCFG area is not reserved in E820 Xen by default will defer its usage
> until Dom0 registers it explicitly after ACPI parser recognizes it as
> a reserved resource in DSDT. Having it reserved in E820 is not
> mandatory according to "PCI Firmware Specification, rev 3.2" (par. 4.1.2)
> and firmware is free to keep a hole E820 in that place.
> 
> Unfortunately, keeping it disabled at this point makes Xen fail to
> recognize many of PCIe extended capabilities early enough for newly added
> devices. Namely, (1) some of VT-d quirks are not applied during Dom0
> device handoff, (2) currently MCFG reservation report comes
> too late from Dom0 after some of PCI devices being registered in Xen
> by Dom0 kernel that break initialization of a number of PCIe capabilities
> (e.g. SR-IOV VF BAR sizing).
> 
> Since introduction of ACPI parser in Xen is not possible add a "force"
> option that will allow Xen to use MCFG area even it's not properly
> reserved in E820.
> 
> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
> ---
> 
> I think we need to have this option to at least have a way to workaround
> problem (1). Problem (2) could be solved in Dom0 kernel by invoking
> xen_mcfg_late() earlier but before the first PCI bus enumertaion which
> currently happens somwhere during ACPI scan I think.

Indeed (2) should be fixed as you say here. (1) should imo be fixed
differently too: pci_vtd_quirk() (and anything else that relies on
extended config space accesses) should be re-invoked once MCFG becomes
available for a given bus (range).

Nevertheless I think the command line option is a good thing to have.

> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1424,11 +1424,13 @@ ordinary DomU, control domain, hardware domain, and - when supported
>  by the platform - DomU with pass-through device assigned).
>  
>  ### mmcfg (x86)
> -> `= <boolean>[,amd-fam10]`
> +> `= List of [ <boolean>, force, amd-fam10 ]`
>  
> -> Default: `1`
> +> Default: `true,force`
>  
> -Specify if the MMConfig space should be enabled.
> +Specify if the MMConfig space should be enabled. If force option is specified
> +(default) MMConfig space will be used by Xen early in boot even if it's
> +not reserved by firmware in the E820 table.

The term "force" is too heavy for my taste: You still require the
range to be at least in an E820 hole. "allow-e820-hole" otoh is a
little longish. Perhaps "relaxed"?

> --- a/xen/arch/x86/x86_64/mmconfig-shared.c
> +++ b/xen/arch/x86/x86_64/mmconfig-shared.c
> @@ -26,33 +26,34 @@
>  
>  #include "mmconfig.h"
>  
> +static bool_t __read_mostly force_mmcfg = true;

Just bool please.

>  unsigned int pci_probe = PCI_PROBE_CONF1 | PCI_PROBE_MMCONF;
>  
>  static int __init parse_mmcfg(const char *s)
>  {
>      const char *ss;
> -    int rc = 0;
> +    int val, rc = 0;
>  
>      do {
>          ss = strchr(s, ',');
>          if ( !ss )
>              ss = strchr(s, '\0');
>  
> -        switch ( parse_bool(s, ss) )
> -        {
> -        case 0:
> -            pci_probe &= ~PCI_PROBE_MMCONF;
> -            break;
> -        case 1:
> -            break;
> -        default:
> -            if ( !cmdline_strcmp(s, "amd_fam10") ||
> -                 !cmdline_strcmp(s, "amd-fam10") )
> +        if ( (val = parse_bool(s, ss)) >= 0 ) {
> +            if ( val )
> +               pci_probe |= PCI_PROBE_MMCONF;
> +            else
> +               pci_probe &= ~PCI_PROBE_MMCONF;
> +        } else if ( (val = parse_boolean("amd_fam10", s, ss)) >= 0 ||
> +                    (val = parse_boolean("amd-fam10", s, ss)) >= 0 ) {
> +            if ( val )
>                  pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
>              else
> -                rc = -EINVAL;
> -            break;
> -        }
> +                pci_probe &= ~PCI_CHECK_ENABLE_AMD_MMCONF;
> +        } else if ( (val = parse_boolean("force", s, ss)) >= 0)
> +            force_mmcfg = val;
> +        else
> +            rc = -EINVAL;

Brace placement is wrong in your additions. I also don't really see
why you do away with the switch().

> @@ -355,6 +356,11 @@ static int __init is_mmconf_reserved(
>                     (unsigned int)cfg->start_bus_number,
>                     (unsigned int)cfg->end_bus_number);
>          }
> +    } else if (!e820_any_mapped(addr, addr + old_size - 1, 0)) {
> +        if (!force_mmcfg)
> +            printk(KERN_WARNING "PCI: MCFG area at %lx is not reserved in E820, "
> +                   "use mmcfg=force option\n", addr);

I don't think saying "use" is a good idea. If anything, make it less
strict - either by attaching "if ..." or by saying "consider using".

Jan
Igor Druzhinin Aug. 29, 2019, 7:25 p.m. UTC | #4
On 29/08/2019 09:00, Roger Pau Monné wrote:
>>
>> I think we need to have this option to at least have a way to workaround
>> problem (1). Problem (2) could be solved in Dom0 kernel by invoking
>> xen_mcfg_late() earlier but before the first PCI bus enumertaion which
>> currently happens somwhere during ACPI scan I think.
>>
>> The question is what the defult value for this option should be?
> 
> Have you tested this against a variety of hardware?
Not yet, I presume it's possible in theory that there is such a box in
the wild that will misbehave if we attempt to access MCFG earlier it'd
be discovered through ACPI. Other than that I don't see a reason why it
would cause any problem. But you're right - we need to run some tests
with this option set to true on our fleet before deciding.

> Have you found any box that has a wrong MMCFG area in the MCFG static
> table? (ie: one that doesn't match the motherboard resource
> reservation in the dynamic ACPI tables?)

I think it's possible that size of the table reported from ACPI is
smaller which would be a problem if we access out-of-bounds preliminary
- hence the ability to fall back. But I'm not aware of any hardware like
that.

>>
>> ---
>>  docs/misc/xen-command-line.pandoc     |  8 +++++---
>>  xen/arch/x86/e820.c                   | 20 ++++++++++++++++++++
>>  xen/arch/x86/x86_64/mmconfig-shared.c | 34 ++++++++++++++++++++--------------
>>  xen/include/asm-x86/e820.h            |  1 +
>>  4 files changed, 46 insertions(+), 17 deletions(-)
>>
>> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
>> index 7c72e31..f13b10c 100644
>> --- a/docs/misc/xen-command-line.pandoc
>> +++ b/docs/misc/xen-command-line.pandoc
>> @@ -1424,11 +1424,13 @@ ordinary DomU, control domain, hardware domain, and - when supported
>>  by the platform - DomU with pass-through device assigned).
>>  
>>  ### mmcfg (x86)
>> -> `= <boolean>[,amd-fam10]`
>> +> `= List of [ <boolean>, force, amd-fam10 ]`
>>  
>> -> Default: `1`
>> +> Default: `true,force`
>>  
>> -Specify if the MMConfig space should be enabled.
>> +Specify if the MMConfig space should be enabled. If force option is specified
>> +(default) MMConfig space will be used by Xen early in boot even if it's
>> +not reserved by firmware in the E820 table.
>>  
>>  ### mmio-relax (x86)
>>  > `= <boolean> | all`
>> diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
>> index 8e8a2c4..edef72a 100644
>> --- a/xen/arch/x86/e820.c
>> +++ b/xen/arch/x86/e820.c
>> @@ -37,6 +37,26 @@ struct e820map e820;
>>  struct e820map __initdata e820_raw;
>>  
>>  /*
>> + * This function checks if any part of the range <start,end> is mapped
>> + * with type.
>> + */
>> +int __init e820_any_mapped(u64 start, u64 end, unsigned type)
> 
> Please use uint64_t and unsigned int. Also it looks like this
> function wants to return a bool instead of an int.
> 

The problem here is that I want this function be similar to the one
below (e820_all_mapped) which is copied from Linux as well as the rest
of the file. At the same time I don't want to introduce code churn
fixing unrelated style issues. Perhaps it's better to keep style of this
function as is? Or do you think it's still better to unify the style
across both of them (perhaps in a separate commit)?

>> +{
>> +       int i;
> 
> unsigned int.
> 
>> +
>> +       for (i = 0; i < e820.nr_map; i++) {
> 
> Coding style. Some parts of this file are using the Linux coding
> style I think, but new additions should be using the Xen coding
> style, see e820_change_range_type for example.
> 
>> +               struct e820entry *ei = &e820.map[i];
> 
> const.
> 
>> +
>> +               if (type && ei->type != type)
>> +                       continue;
>> +               if (ei->addr >= end || ei->addr + ei->size <= start)
>> +                       continue;
>> +               return 1;
>> +       }
>> +       return 0;
>> +}
>> +
>> +/*
>>   * This function checks if the entire range <start,end> is mapped with type.
>>   *
>>   * Note: this function only works correct if the e820 table is sorted and
>> diff --git a/xen/arch/x86/x86_64/mmconfig-shared.c b/xen/arch/x86/x86_64/mmconfig-shared.c
>> index cc08b52..9fc0865 100644
>> --- a/xen/arch/x86/x86_64/mmconfig-shared.c
>> +++ b/xen/arch/x86/x86_64/mmconfig-shared.c
>> @@ -26,33 +26,34 @@
>>  
>>  #include "mmconfig.h"
>>  
>> +static bool_t __read_mostly force_mmcfg = true;
>>  unsigned int pci_probe = PCI_PROBE_CONF1 | PCI_PROBE_MMCONF;
>>  
>>  static int __init parse_mmcfg(const char *s)
>>  {
>>      const char *ss;
>> -    int rc = 0;
>> +    int val, rc = 0;
>>  
>>      do {
>>          ss = strchr(s, ',');
>>          if ( !ss )
>>              ss = strchr(s, '\0');
>>  
>> -        switch ( parse_bool(s, ss) )
>> -        {
>> -        case 0:
>> -            pci_probe &= ~PCI_PROBE_MMCONF;
>> -            break;
>> -        case 1:
>> -            break;
>> -        default:
>> -            if ( !cmdline_strcmp(s, "amd_fam10") ||
>> -                 !cmdline_strcmp(s, "amd-fam10") )
>> +        if ( (val = parse_bool(s, ss)) >= 0 ) {
>> +            if ( val )
>> +               pci_probe |= PCI_PROBE_MMCONF;
>> +            else
>> +               pci_probe &= ~PCI_PROBE_MMCONF;
>> +        } else if ( (val = parse_boolean("amd_fam10", s, ss)) >= 0 ||
>> +                    (val = parse_boolean("amd-fam10", s, ss)) >= 0 ) {
>> +            if ( val )
>>                  pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
>>              else
>> -                rc = -EINVAL;
>> -            break;
>> -        }
>> +                pci_probe &= ~PCI_CHECK_ENABLE_AMD_MMCONF;
>> +        } else if ( (val = parse_boolean("force", s, ss)) >= 0)
>> +            force_mmcfg = val;
> 
> You could also consider adding a new flag to pci_probe, ie:
> PCI_PROBE_FORCE_MMCFG.

I was thinking about that too. Will switch to that method if you think
that is better.

>> +        else
>> +            rc = -EINVAL;
>>  
>>          s = ss + 1;
>>      } while ( *ss );
>> @@ -355,6 +356,11 @@ static int __init is_mmconf_reserved(
>>                     (unsigned int)cfg->start_bus_number,
>>                     (unsigned int)cfg->end_bus_number);
>>          }
>> +    } else if (!e820_any_mapped(addr, addr + old_size - 1, 0)) {
> 
> I think it should be fine to also accept a MMCFG area that's partially
> reserved and partially a hole in the memory map?
> 
> AFAICT the proposed code will only accept MMCFG regions that are
> fully in a memory map hole.

What case particularly are you referring to? Partially covered case is
handled above where the beginning of reserved region corresponds to the
beginning of the table. Other than that (if a reserved region is
randomly located within a reported area) I think is totally ambiguous
and should be considered broken.

Igor
Jan Beulich Aug. 30, 2019, 7:44 a.m. UTC | #5
On 29.08.2019 21:25, Igor Druzhinin wrote:
> On 29/08/2019 09:00, Roger Pau Monné wrote:
>>>
>>> I think we need to have this option to at least have a way to workaround
>>> problem (1). Problem (2) could be solved in Dom0 kernel by invoking
>>> xen_mcfg_late() earlier but before the first PCI bus enumertaion which
>>> currently happens somwhere during ACPI scan I think.
>>>
>>> The question is what the defult value for this option should be?
>>
>> Have you tested this against a variety of hardware?
> Not yet, I presume it's possible in theory that there is such a box in
> the wild that will misbehave if we attempt to access MCFG earlier it'd
> be discovered through ACPI. Other than that I don't see a reason why it
> would cause any problem. But you're right - we need to run some tests
> with this option set to true on our fleet before deciding.
> 
>> Have you found any box that has a wrong MMCFG area in the MCFG static
>> table? (ie: one that doesn't match the motherboard resource
>> reservation in the dynamic ACPI tables?)
> 
> I think it's possible that size of the table reported from ACPI is
> smaller which would be a problem if we access out-of-bounds preliminary
> - hence the ability to fall back. But I'm not aware of any hardware like
> that.

The reason why Linux had put these two checks (E820 and ACPI) in place
(and we cloned them) was that in the old days (at least until around
the first x86-64 systems appearing) there were many such systems (iirc
more than there were well behaved ones). The bad situation with BIOSes
was also why the various chipset probing methods had been introduced.

Anyway - the main risk with using MCFG without knowing the range is
suitably reserved is that other things may live in that range (e.g.
BARs may have been allocated there). Therefore I don't think we can
reasonably default this option to "true", irrespective whether
perhaps _all_ systems in your testing fleet are well behaved in this
regard.

>>> --- a/xen/arch/x86/e820.c
>>> +++ b/xen/arch/x86/e820.c
>>> @@ -37,6 +37,26 @@ struct e820map e820;
>>>  struct e820map __initdata e820_raw;
>>>  
>>>  /*
>>> + * This function checks if any part of the range <start,end> is mapped
>>> + * with type.
>>> + */
>>> +int __init e820_any_mapped(u64 start, u64 end, unsigned type)
>>
>> Please use uint64_t and unsigned int. Also it looks like this
>> function wants to return a bool instead of an int.
>>
> 
> The problem here is that I want this function be similar to the one
> below (e820_all_mapped) which is copied from Linux as well as the rest
> of the file. At the same time I don't want to introduce code churn
> fixing unrelated style issues. Perhaps it's better to keep style of this
> function as is? Or do you think it's still better to unify the style
> across both of them (perhaps in a separate commit)?

Since we're trying to gradually switch to uint<N>_t, I think new code
should by default use the intended types. Exceptions would be imports
of entire files from Linux. If you followed up your change with one
converting the other function(s) to the intended types as well, that'd
be much appreciated, but is imo not a requirement.

Jan

Patch
diff mbox series

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 7c72e31..f13b10c 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1424,11 +1424,13 @@  ordinary DomU, control domain, hardware domain, and - when supported
 by the platform - DomU with pass-through device assigned).
 
 ### mmcfg (x86)
-> `= <boolean>[,amd-fam10]`
+> `= List of [ <boolean>, force, amd-fam10 ]`
 
-> Default: `1`
+> Default: `true,force`
 
-Specify if the MMConfig space should be enabled.
+Specify if the MMConfig space should be enabled. If force option is specified
+(default) MMConfig space will be used by Xen early in boot even if it's
+not reserved by firmware in the E820 table.
 
 ### mmio-relax (x86)
 > `= <boolean> | all`
diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index 8e8a2c4..edef72a 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -37,6 +37,26 @@  struct e820map e820;
 struct e820map __initdata e820_raw;
 
 /*
+ * This function checks if any part of the range <start,end> is mapped
+ * with type.
+ */
+int __init e820_any_mapped(u64 start, u64 end, unsigned type)
+{
+       int i;
+
+       for (i = 0; i < e820.nr_map; i++) {
+               struct e820entry *ei = &e820.map[i];
+
+               if (type && ei->type != type)
+                       continue;
+               if (ei->addr >= end || ei->addr + ei->size <= start)
+                       continue;
+               return 1;
+       }
+       return 0;
+}
+
+/*
  * This function checks if the entire range <start,end> is mapped with type.
  *
  * Note: this function only works correct if the e820 table is sorted and
diff --git a/xen/arch/x86/x86_64/mmconfig-shared.c b/xen/arch/x86/x86_64/mmconfig-shared.c
index cc08b52..9fc0865 100644
--- a/xen/arch/x86/x86_64/mmconfig-shared.c
+++ b/xen/arch/x86/x86_64/mmconfig-shared.c
@@ -26,33 +26,34 @@ 
 
 #include "mmconfig.h"
 
+static bool_t __read_mostly force_mmcfg = true;
 unsigned int pci_probe = PCI_PROBE_CONF1 | PCI_PROBE_MMCONF;
 
 static int __init parse_mmcfg(const char *s)
 {
     const char *ss;
-    int rc = 0;
+    int val, rc = 0;
 
     do {
         ss = strchr(s, ',');
         if ( !ss )
             ss = strchr(s, '\0');
 
-        switch ( parse_bool(s, ss) )
-        {
-        case 0:
-            pci_probe &= ~PCI_PROBE_MMCONF;
-            break;
-        case 1:
-            break;
-        default:
-            if ( !cmdline_strcmp(s, "amd_fam10") ||
-                 !cmdline_strcmp(s, "amd-fam10") )
+        if ( (val = parse_bool(s, ss)) >= 0 ) {
+            if ( val )
+               pci_probe |= PCI_PROBE_MMCONF;
+            else
+               pci_probe &= ~PCI_PROBE_MMCONF;
+        } else if ( (val = parse_boolean("amd_fam10", s, ss)) >= 0 ||
+                    (val = parse_boolean("amd-fam10", s, ss)) >= 0 ) {
+            if ( val )
                 pci_probe |= PCI_CHECK_ENABLE_AMD_MMCONF;
             else
-                rc = -EINVAL;
-            break;
-        }
+                pci_probe &= ~PCI_CHECK_ENABLE_AMD_MMCONF;
+        } else if ( (val = parse_boolean("force", s, ss)) >= 0)
+            force_mmcfg = val;
+        else
+            rc = -EINVAL;
 
         s = ss + 1;
     } while ( *ss );
@@ -355,6 +356,11 @@  static int __init is_mmconf_reserved(
                    (unsigned int)cfg->start_bus_number,
                    (unsigned int)cfg->end_bus_number);
         }
+    } else if (!e820_any_mapped(addr, addr + old_size - 1, 0)) {
+        if (!force_mmcfg)
+            printk(KERN_WARNING "PCI: MCFG area at %lx is not reserved in E820, "
+                   "use mmcfg=force option\n", addr);
+        valid = force_mmcfg;
     }
 
     return valid;
diff --git a/xen/include/asm-x86/e820.h b/xen/include/asm-x86/e820.h
index 52916fb..8babb4b 100644
--- a/xen/include/asm-x86/e820.h
+++ b/xen/include/asm-x86/e820.h
@@ -24,6 +24,7 @@  struct e820map {
 };
 
 extern int sanitize_e820_map(struct e820entry *biosmap, unsigned int *pnr_map);
+extern int e820_any_mapped(u64 start, u64 end, unsigned type);
 extern int e820_all_mapped(u64 start, u64 end, unsigned type);
 extern int reserve_e820_ram(struct e820map *e820, uint64_t s, uint64_t e);
 extern int e820_change_range_type(