diff mbox series

[v2] x86/PCI: Disable exclusion of E820 reserved addressed in some cases

Message ID 20220228105259.230903-2-hdegoede@redhat.com (mailing list archive)
State Superseded
Headers show
Series [v2] x86/PCI: Disable exclusion of E820 reserved addressed in some cases | expand

Commit Message

Hans de Goede Feb. 28, 2022, 10:52 a.m. UTC
Some fw has a bug where the PCI bridge window returned by the ACPI
resources partly overlaps with some other address range, causing issues.
To workaround this Linux excludes E820 reserved addresses when allocating
addresses from the PCI bridge window. 2 known examples of such fw bugs are:

1. The returned window contains addresses which map to system RAM,
see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating
address space").

2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
reserved range and the ACPI provided PCI bridge window:
 efi: mem46: [MMIO] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
 BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
 pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
then the system fails to resume after a suspend.

Recently (2019) some systems have shown-up with EFI memmap MMIO entries
covering the entire ACPI provided PCI bridge window. These memmap entries
get converted into e820_table entries, causing all attempts to assign
memory to PCI BARs which have not been setup by the BIOS to fail.
For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
 efi: mem63: [MMIO] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
 BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
 pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
 pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
 pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]

To fix this, check if the ACPI provided PCI bridge window is fully
contained within in EFI memmap MMIO region and in that case disable
the "exclude E820 reserved addresses" workaround, fixing the problem
of not being able to find free space for unassigned BARs.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=206459
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2029207
BugLink: https://bugs.launchpad.net/bugs/1878279
BugLink: https://bugs.launchpad.net/bugs/1931715
BugLink: https://bugs.launchpad.net/bugs/1932069
BugLink: https://bugs.launchpad.net/bugs/1921649
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
Changes in v2:
- Add a couple of missing includes to arch/x86/include/asm/pci_x86.h
  to fix i386 build errors Reported-by: kernel test robot <lkp@intel.com>
- Do not call resource_is_efi_mmio_region() on resource-list entries which
  have just been destroyed because they match resource_is_pcicfg_ioport()
  Reported-by: kernel test robot <oliver.sang@intel.com>
- Add (res->flags & IORESOURCE_MEM) check to resource_is_efi_mmio_region()
---
 arch/x86/include/asm/pci_x86.h | 10 +++++
 arch/x86/kernel/resource.c     |  4 ++
 arch/x86/pci/acpi.c            | 68 +++++++++++++++++++++++++++++++++-
 3 files changed, 81 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki March 1, 2022, 12:31 p.m. UTC | #1
On Mon, Feb 28, 2022 at 11:55 AM Hans de Goede <hdegoede@redhat.com> wrote:
>
> Some fw has a bug where the PCI bridge window returned by the ACPI
> resources partly overlaps with some other address range, causing issues.
> To workaround this Linux excludes E820 reserved addresses when allocating
> addresses from the PCI bridge window. 2 known examples of such fw bugs are:
>
> 1. The returned window contains addresses which map to system RAM,
> see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating
> address space").
>
> 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
> reserved range and the ACPI provided PCI bridge window:
>  efi: mem46: [MMIO] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>  BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
>  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
> then the system fails to resume after a suspend.
>
> Recently (2019) some systems have shown-up with EFI memmap MMIO entries
> covering the entire ACPI provided PCI bridge window. These memmap entries
> get converted into e820_table entries, causing all attempts to assign
> memory to PCI BARs which have not been setup by the BIOS to fail.
> For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
>  efi: mem63: [MMIO] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
>  BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
>  pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
>  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
>
> To fix this, check if the ACPI provided PCI bridge window is fully
> contained within in EFI memmap MMIO region and in that case disable
> the "exclude E820 reserved addresses" workaround, fixing the problem
> of not being able to find free space for unassigned BARs.
>
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=206459
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2029207
> BugLink: https://bugs.launchpad.net/bugs/1878279
> BugLink: https://bugs.launchpad.net/bugs/1931715
> BugLink: https://bugs.launchpad.net/bugs/1932069
> BugLink: https://bugs.launchpad.net/bugs/1921649
> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> ---
> Changes in v2:
> - Add a couple of missing includes to arch/x86/include/asm/pci_x86.h
>   to fix i386 build errors Reported-by: kernel test robot <lkp@intel.com>
> - Do not call resource_is_efi_mmio_region() on resource-list entries which
>   have just been destroyed because they match resource_is_pcicfg_ioport()
>   Reported-by: kernel test robot <oliver.sang@intel.com>
> - Add (res->flags & IORESOURCE_MEM) check to resource_is_efi_mmio_region()

Applied (subject fixed up), thanks!

> ---
>  arch/x86/include/asm/pci_x86.h | 10 +++++
>  arch/x86/kernel/resource.c     |  4 ++
>  arch/x86/pci/acpi.c            | 68 +++++++++++++++++++++++++++++++++-
>  3 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
> index 490411dba438..4ce61ab01a4f 100644
> --- a/arch/x86/include/asm/pci_x86.h
> +++ b/arch/x86/include/asm/pci_x86.h
> @@ -5,7 +5,9 @@
>   *     (c) 1999 Martin Mares <mj@ucw.cz>
>   */
>
> +#include <linux/init.h>
>  #include <linux/ioport.h>
> +#include <linux/spinlock.h>
>
>  #undef DEBUG
>
> @@ -64,6 +66,8 @@ void pcibios_scan_specific_bus(int busn);
>
>  /* pci-irq.c */
>
> +struct pci_dev;
> +
>  struct irq_info {
>         u8 bus, devfn;                  /* Bus, device and function */
>         struct {
> @@ -232,3 +236,9 @@ static inline void mmio_config_writel(void __iomem *pos, u32 val)
>  # define x86_default_pci_init_irq      NULL
>  # define x86_default_pci_fixup_irqs    NULL
>  #endif
> +
> +#if defined CONFIG_PCI && defined CONFIG_ACPI
> +extern bool pci_use_e820;
> +#else
> +#define pci_use_e820 true
> +#endif
> diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
> index 9b9fb7882c20..e8dc9bc327bd 100644
> --- a/arch/x86/kernel/resource.c
> +++ b/arch/x86/kernel/resource.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/ioport.h>
>  #include <asm/e820/api.h>
> +#include <asm/pci_x86.h>
>
>  static void resource_clip(struct resource *res, resource_size_t start,
>                           resource_size_t end)
> @@ -28,6 +29,9 @@ static void remove_e820_regions(struct resource *avail)
>         int i;
>         struct e820_entry *entry;
>
> +       if (!pci_use_e820)
> +               return;
> +
>         for (i = 0; i < e820_table->nr_entries; i++) {
>                 entry = &e820_table->entries[i];
>
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index 052f1d78a562..fce05e03ba9e 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -1,4 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0
> +#include <linux/efi.h>
>  #include <linux/pci.h>
>  #include <linux/acpi.h>
>  #include <linux/init.h>
> @@ -21,6 +22,7 @@ struct pci_root_info {
>
>  static bool pci_use_crs = true;
>  static bool pci_ignore_seg;
> +bool pci_use_e820 = true;
>
>  static int __init set_use_crs(const struct dmi_system_id *id)
>  {
> @@ -291,6 +293,63 @@ static bool resource_is_pcicfg_ioport(struct resource *res)
>                 res->start == 0xCF8 && res->end == 0xCFF;
>  }
>
> +/*
> + * Some fw has a bug where the PCI bridge window returned by the ACPI resources
> + * partly overlaps with some other address range, causing issues. To workaround
> + * this Linux excludes E820 reserved addresses when allocating addresses from
> + * the PCI bridge window. 2 known examples of such firmware bugs are:
> + *
> + * 1. The returned window contains addresses which map to system RAM, see
> + * commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
> + *
> + * 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
> + * reserved range and the ACPI provided PCI bridge window:
> + *  efi: mem46: [MMIO] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> + *  BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
> + *  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> + * If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
> + * then the system fails to resume after a suspend.
> + *
> + * Recently (2019) some systems have shown-up with EFI memmap MMIO entries
> + * covering the entire ACPI provided PCI bridge window. These memmap entries
> + * get converted into e820_table entries, causing all attempts to assign
> + * memory to PCI BARs which have not been setup by the BIOS to fail.
> + * For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
> + *  efi: mem63: [MMIO] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
> + *  BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
> + *  pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
> + *  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> + *  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> + *
> + * To code below checks if the ACPI provided PCI bridge window is fully
> + * contained within in EFI memmap MMIO region and in that case disables
> + * the "exclude E820 reserved addresses" workaround to avoid this issue.
> + */
> +static bool resource_is_efi_mmio_region(const struct resource *res)
> +{
> +       unsigned long long start, end;
> +       efi_memory_desc_t *md;
> +
> +       if (!(res->flags & IORESOURCE_MEM))
> +               return false;
> +
> +       if (!efi_enabled(EFI_MEMMAP))
> +               return false;
> +
> +       for_each_efi_memory_desc(md) {
> +               if (md->type != EFI_MEMORY_MAPPED_IO)
> +                       continue;
> +
> +               start = md->phys_addr;
> +               end = start + (md->num_pages << EFI_PAGE_SHIFT) - 1;
> +
> +               if (res->start >= start && res->end <= end)
> +                       return true;
> +       }
> +
> +       return false;
> +}
> +
>  static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
>  {
>         struct acpi_device *device = ci->bridge;
> @@ -300,9 +359,16 @@ static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
>
>         status = acpi_pci_probe_root_resources(ci);
>         if (pci_use_crs) {
> -               resource_list_for_each_entry_safe(entry, tmp, &ci->resources)
> +               resource_list_for_each_entry_safe(entry, tmp, &ci->resources) {
>                         if (resource_is_pcicfg_ioport(entry->res))
>                                 resource_list_destroy_entry(entry);
> +                       else if (resource_is_efi_mmio_region(entry->res)) {
> +                               dev_info(&device->dev,
> +                                       "host bridge window %pR is marked by EFI as MMIO\n",
> +                                       entry->res);
> +                               pci_use_e820 = false;
> +                       }
> +               }
>                 return status;
>         }
>
> --
> 2.35.1
>
Bjorn Helgaas March 3, 2022, 12:40 a.m. UTC | #2
On Mon, Feb 28, 2022 at 11:52:59AM +0100, Hans de Goede wrote:

I know Rafael has already applied this, but I'm still trying to
understand this because it looks like a very complicated maintenance
problem.

> Some fw has a bug where the PCI bridge window returned by the ACPI
> resources partly overlaps with some other address range, causing issues.
> To workaround this Linux excludes E820 reserved addresses when allocating
> addresses from the PCI bridge window. 2 known examples of such fw bugs are:
>
> 1. The returned window contains addresses which map to system RAM,
> see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating
> address space").

Bug report is https://bugzilla.kernel.org/show_bug.cgi?id=16228
First dmesg log https://bugzilla.kernel.org/attachment.cgi?id=26811
shows:

  BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
  pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
  pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
  pci 0000:00:1f.2: BAR 5: assigned [mem 0xbff00000-0xbff007ff]
  ahci 0000:00:1f.2: controller reset failed (0xffffffff)
  ahci 0000:00:1f.2: failed to stop engine (-5)

The problem is that _CRS advertises [mem 0xbff00000-0xdfffffff], and
we assigned [mem 0xbff00000-0xbff007ff] to 00:1f.2, but
0xbff00000-0xbfffffff is not usable for PCI devices.  My guess is that
it contains host bridge registers, but all we really know is that it
doesn't work.

I think the _CRS that includes non-usable space is clearly a BIOS
defect.

The fix from 4dc2287c1805 was to avoid that region based on the
0xbfe4dc00-0xc0000000 E820 entry, and the result is in
https://bugzilla.kernel.org/attachment.cgi?id=30662:

  BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
  pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xf7ffffff]
  pci_root PNP0A03:00: host bridge window [mem 0xff980000-0xff980fff]
  pci 0000:00:1f.2: reg 24: [mem 0xff970000-0xff9707ff]   # BAR 5
  pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
  pci 0000:00:1f.2: BAR 5: assigned [mem 0xff980800-0xff980fff]

The patch below doesn't affect this workaround.

> 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
> reserved range and the ACPI provided PCI bridge window:
>  efi: mem46: [MMIO] range=              [0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>  BIOS-e820:                         [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
>  pci_bus 0000:00: root bus resource [mem         0xdfa00000-        0xfebfffff window]
> If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
> then the system fails to resume after a suspend.

I think this is from https://bugzilla.redhat.com/show_bug.cgi?id=2029207

If I understand correctly, the log in comment 23 from Ivan
(https://bugzilla.redhat.com/attachment.cgi?id=1859801) is a
case where resume doesn't work:

  BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
  pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]

And the log in comment 38, also from Ivan,
(https://bugzilla.redhat.com/attachment.cgi?id=1861539) is a case
where resume *does* work:

  BIOS-e820: [mem 0x00000000dcf00000-0x00000000dfa0ffff] reserved
  efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
  pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff]

_CRS advertises [mem 0xdfa00000-0xfebfffff], but when we assign the
0xdfa00000-0xdfafffff region to the 00:1c.0 MMIO window, resume fails.

I don't see a theory about what the root cause is.  It's possible this
is similar to case 1 above, where _CRS is defective.  But here we're
only assigning the 00:1c.0 MMIO window, and 00:1c.0 is a bridge to
[bus 02], and there are no devices on bus 02.  There should be no
transactions that use that MMIO window, so it's not clear why this
should matter.

If this is a _CRS defect similar to case 1, it's possible there are
host bridge registers in 0xdfa00000-0xdfafffff, and BIOS might use
those during resume, and assigning that area to the 00:1c.0 MMIO
window might interfere with that.  But that's a lot of speculation.

If you have a good Lenovo contact, they might be able to confirm or
deny this.

I'm missing some things here that should be obvious; can you help me
out?

  - Why did the 4dc2287c1805 workaround not apply here?  The E820
    region overlaps the _CRS window just like in case 1, so why did we
    assign 0xdfa00000?

  - How does this patch work around the problem?  This patch checks if
    a host bridge window is completely contained in an EFI MMIO
    region, but I don't see such an EFI region here.

It might be interesting to poke around in the 0xdfa00000-0xdfafffff
range.  For example, if Ivan tried things like this, it's possible
we'd find something we could identify as registers:

  # ./rdwrmem -b 4 -m -s 0xdfa00000 -l 256
  # ./rdwrmem -b 4 -m -s 0xdfa40000 -l 256
  # ./rdwrmem -b 4 -m -s 0xdfa80000 -l 256
  # ./rdwrmem -b 4 -m -s 0xdfac0000 -l 256

This tool is from https://cmp.felk.cvut.cz/~pisa/linux/rdwrmem.c

> Recently (2019) some systems have shown-up with EFI memmap MMIO entries
> covering the entire ACPI provided PCI bridge window. These memmap entries
> get converted into e820_table entries, causing all attempts to assign
> memory to PCI BARs which have not been setup by the BIOS to fail.
> For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
>  efi: mem63: [MMIO] range=              [0x0000000065400000-0x00000000cfffffff] (1708MB)
>  BIOS-e820:                         [mem 0x000000004bc50000-0x00000000cfffffff] reserved
>  pci_bus 0000:00: root bus resource [mem         0x65400000-        0xbfffffff window]
>  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> 
> To fix this, check if the ACPI provided PCI bridge window is fully
> contained within in EFI memmap MMIO region and in that case disable
> the "exclude E820 reserved addresses" workaround, fixing the problem
> of not being able to find free space for unassigned BARs.
> 
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=206459
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2029207
> BugLink: https://bugs.launchpad.net/bugs/1878279
> BugLink: https://bugs.launchpad.net/bugs/1931715
> BugLink: https://bugs.launchpad.net/bugs/1932069
> BugLink: https://bugs.launchpad.net/bugs/1921649
> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> ---
> Changes in v2:
> - Add a couple of missing includes to arch/x86/include/asm/pci_x86.h
>   to fix i386 build errors Reported-by: kernel test robot <lkp@intel.com>
> - Do not call resource_is_efi_mmio_region() on resource-list entries which
>   have just been destroyed because they match resource_is_pcicfg_ioport()
>   Reported-by: kernel test robot <oliver.sang@intel.com>
> - Add (res->flags & IORESOURCE_MEM) check to resource_is_efi_mmio_region()
> ---
>  arch/x86/include/asm/pci_x86.h | 10 +++++
>  arch/x86/kernel/resource.c     |  4 ++
>  arch/x86/pci/acpi.c            | 68 +++++++++++++++++++++++++++++++++-
>  3 files changed, 81 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
> index 490411dba438..4ce61ab01a4f 100644
> --- a/arch/x86/include/asm/pci_x86.h
> +++ b/arch/x86/include/asm/pci_x86.h
> @@ -5,7 +5,9 @@
>   *	(c) 1999 Martin Mares <mj@ucw.cz>
>   */
>  
> +#include <linux/init.h>
>  #include <linux/ioport.h>
> +#include <linux/spinlock.h>
>  
>  #undef DEBUG
>  
> @@ -64,6 +66,8 @@ void pcibios_scan_specific_bus(int busn);
>  
>  /* pci-irq.c */
>  
> +struct pci_dev;

Both the above look like fixes to an unrelated asm/pci_x86.h
problem that happened to be exposed by including it in
arch/x86/kernel/resource.c.  Most users of asm/pci_x86.h
include linux/pci.h first, which covers up the problem.

>  struct irq_info {
>  	u8 bus, devfn;			/* Bus, device and function */
>  	struct {
> @@ -232,3 +236,9 @@ static inline void mmio_config_writel(void __iomem *pos, u32 val)
>  # define x86_default_pci_init_irq	NULL
>  # define x86_default_pci_fixup_irqs	NULL
>  #endif
> +
> +#if defined CONFIG_PCI && defined CONFIG_ACPI
> +extern bool pci_use_e820;
> +#else
> +#define pci_use_e820 true
> +#endif
> diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
> index 9b9fb7882c20..e8dc9bc327bd 100644
> --- a/arch/x86/kernel/resource.c
> +++ b/arch/x86/kernel/resource.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  #include <linux/ioport.h>
>  #include <asm/e820/api.h>
> +#include <asm/pci_x86.h>
>  
>  static void resource_clip(struct resource *res, resource_size_t start,
>  			  resource_size_t end)
> @@ -28,6 +29,9 @@ static void remove_e820_regions(struct resource *avail)
>  	int i;
>  	struct e820_entry *entry;
>  
> +	if (!pci_use_e820)
> +		return;
> +
>  	for (i = 0; i < e820_table->nr_entries; i++) {
>  		entry = &e820_table->entries[i];
>  
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index 052f1d78a562..fce05e03ba9e 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -1,4 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0
> +#include <linux/efi.h>
>  #include <linux/pci.h>
>  #include <linux/acpi.h>
>  #include <linux/init.h>
> @@ -21,6 +22,7 @@ struct pci_root_info {
>  
>  static bool pci_use_crs = true;
>  static bool pci_ignore_seg;
> +bool pci_use_e820 = true;
>  
>  static int __init set_use_crs(const struct dmi_system_id *id)
>  {
> @@ -291,6 +293,63 @@ static bool resource_is_pcicfg_ioport(struct resource *res)
>  		res->start == 0xCF8 && res->end == 0xCFF;
>  }
>  
> +/*
> + * Some fw has a bug where the PCI bridge window returned by the ACPI resources
> + * partly overlaps with some other address range, causing issues. To workaround
> + * this Linux excludes E820 reserved addresses when allocating addresses from
> + * the PCI bridge window. 2 known examples of such firmware bugs are:
> + *
> + * 1. The returned window contains addresses which map to system RAM, see
> + * commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
> + *
> + * 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
> + * reserved range and the ACPI provided PCI bridge window:
> + *  efi: mem46: [MMIO] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> + *  BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
> + *  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> + * If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
> + * then the system fails to resume after a suspend.
> + *
> + * Recently (2019) some systems have shown-up with EFI memmap MMIO entries
> + * covering the entire ACPI provided PCI bridge window. These memmap entries
> + * get converted into e820_table entries, causing all attempts to assign
> + * memory to PCI BARs which have not been setup by the BIOS to fail.
> + * For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
> + *  efi: mem63: [MMIO] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
> + *  BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
> + *  pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
> + *  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
> + *  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
> + *
> + * To code below checks if the ACPI provided PCI bridge window is fully
> + * contained within in EFI memmap MMIO region and in that case disables
> + * the "exclude E820 reserved addresses" workaround to avoid this issue.
> + */
> +static bool resource_is_efi_mmio_region(const struct resource *res)
> +{
> +	unsigned long long start, end;
> +	efi_memory_desc_t *md;
> +
> +	if (!(res->flags & IORESOURCE_MEM))
> +		return false;
> +
> +	if (!efi_enabled(EFI_MEMMAP))
> +		return false;
> +
> +	for_each_efi_memory_desc(md) {
> +		if (md->type != EFI_MEMORY_MAPPED_IO)
> +			continue;
> +
> +		start = md->phys_addr;
> +		end = start + (md->num_pages << EFI_PAGE_SHIFT) - 1;
> +
> +		if (res->start >= start && res->end <= end)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
>  {
>  	struct acpi_device *device = ci->bridge;
> @@ -300,9 +359,16 @@ static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
>  
>  	status = acpi_pci_probe_root_resources(ci);
>  	if (pci_use_crs) {
> -		resource_list_for_each_entry_safe(entry, tmp, &ci->resources)
> +		resource_list_for_each_entry_safe(entry, tmp, &ci->resources) {
>  			if (resource_is_pcicfg_ioport(entry->res))
>  				resource_list_destroy_entry(entry);
> +			else if (resource_is_efi_mmio_region(entry->res)) {
> +				dev_info(&device->dev,
> +					"host bridge window %pR is marked by EFI as MMIO\n",
> +					entry->res);
> +				pci_use_e820 = false;

This message suggests that marking the host bridge window as MMIO in
EFI is a defect, or at least something unusual and worthy of being
flagged.  But I think it's perfectly legal.

UEFI v2.8, sec 7.2, says EfiMemoryMappedIO means:

  Used by system firmware to request that a memory-mapped IO region be
  mapped by the OS to a virtual address so it can be accessed by EFI
  runtime services.


> +			}
> +		}
>  		return status;
>  	}
>  
> -- 
> 2.35.1
>
Hans de Goede March 3, 2022, 1:29 p.m. UTC | #3
Hi Bjorn,

On 3/3/22 01:40, Bjorn Helgaas wrote:
> On Mon, Feb 28, 2022 at 11:52:59AM +0100, Hans de Goede wrote:
> 
> I know Rafael has already applied this, but I'm still trying to
> understand this because it looks like a very complicated maintenance
> problem.
> 
>> Some fw has a bug where the PCI bridge window returned by the ACPI
>> resources partly overlaps with some other address range, causing issues.
>> To workaround this Linux excludes E820 reserved addresses when allocating
>> addresses from the PCI bridge window. 2 known examples of such fw bugs are:
>>
>> 1. The returned window contains addresses which map to system RAM,
>> see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating
>> address space").
> 
> Bug report is https://bugzilla.kernel.org/show_bug.cgi?id=16228
> First dmesg log https://bugzilla.kernel.org/attachment.cgi?id=26811
> shows:
> 
>   BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
>   pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
>   pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
>   pci 0000:00:1f.2: BAR 5: assigned [mem 0xbff00000-0xbff007ff]
>   ahci 0000:00:1f.2: controller reset failed (0xffffffff)
>   ahci 0000:00:1f.2: failed to stop engine (-5)
> 
> The problem is that _CRS advertises [mem 0xbff00000-0xdfffffff], and
> we assigned [mem 0xbff00000-0xbff007ff] to 00:1f.2, but
> 0xbff00000-0xbfffffff is not usable for PCI devices.  My guess is that
> it contains host bridge registers, but all we really know is that it
> doesn't work.
> 
> I think the _CRS that includes non-usable space is clearly a BIOS
> defect.
> 
> The fix from 4dc2287c1805 was to avoid that region based on the
> 0xbfe4dc00-0xc0000000 E820 entry, and the result is in
> https://bugzilla.kernel.org/attachment.cgi?id=30662:
> 
>   BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
>   pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xf7ffffff]
>   pci_root PNP0A03:00: host bridge window [mem 0xff980000-0xff980fff]
>   pci 0000:00:1f.2: reg 24: [mem 0xff970000-0xff9707ff]   # BAR 5
>   pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
>   pci 0000:00:1f.2: BAR 5: assigned [mem 0xff980800-0xff980fff]
> 
> The patch below doesn't affect this workaround.
> 
>> 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
>> reserved range and the ACPI provided PCI bridge window:
>>  efi: mem46: [MMIO] range=              [0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>>  BIOS-e820:                         [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
>>  pci_bus 0000:00: root bus resource [mem         0xdfa00000-        0xfebfffff window]
>> If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
>> then the system fails to resume after a suspend.
> 
> I think this is from https://bugzilla.redhat.com/show_bug.cgi?id=2029207

Correct.

> If I understand correctly, the log in comment 23 from Ivan
> (https://bugzilla.redhat.com/attachment.cgi?id=1859801) is a
> case where resume doesn't work:
> 
>   BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]
> 
> And the log in comment 38, also from Ivan,
> (https://bugzilla.redhat.com/attachment.cgi?id=1861539) is a case
> where resume *does* work:
> 
>   BIOS-e820: [mem 0x00000000dcf00000-0x00000000dfa0ffff] reserved
>   efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff]
> 
> _CRS advertises [mem 0xdfa00000-0xfebfffff], but when we assign the
> 0xdfa00000-0xdfafffff region to the 00:1c.0 MMIO window, resume fails.

Correct this "when we assign the 0xdfa00000-0xdfafffff region to
the 00:1c.0 MMIO window, resume fails" is my conclusion too.

> I don't see a theory about what the root cause is.

Correct, because I have no theory, I merely observed that:

1. 0xdfa00000-0xdfa0ffff (note the 0 between the a and ffff)
is reserved in the E820 ranges (this was first seen with
classic BIOS boot, but also reproduces with EFI bootiung)

2. If the kernel assigns PCI resources to it despite it being
reserved the reporter reports resume being broken.

I indeed do not know the exact cause of this.

> It's possible this
> is similar to case 1 above, where _CRS is defective.

I definitely consider this another case of _CRS being defective
and it is similar to 1 in that regards, yes.

> But here we're
> only assigning the 00:1c.0 MMIO window, and 00:1c.0 is a bridge to
> [bus 02], and there are no devices on bus 02.  There should be no
> transactions that use that MMIO window, so it's not clear why this
> should matter.
> 
> If this is a _CRS defect similar to case 1, it's possible there are
> host bridge registers in 0xdfa00000-0xdfafffff, and BIOS might use
> those during resume, and assigning that area to the 00:1c.0 MMIO
> window might interfere with that.  But that's a lot of speculation.
> 
> If you have a good Lenovo contact, they might be able to confirm or
> deny this.
> 
> I'm missing some things here that should be obvious; can you help me
> out?
> 
>   - Why did the 4dc2287c1805 workaround not apply here?  The E820
>     region overlaps the _CRS window just like in case 1, so why did we
>     assign 0xdfa00000?

The 4dc2287c1805 workaround does apply here. The problem is that
my earlier patch which Rafael merged to not honor e820 reservations
on newer BIOS-es broke suspend/resume since that patch disables
the 4dc2287c1805 workaround on this machine.

The comment tries to mention this issue as a second case of
why the 4dc2287c1805 workaround is necessary.

The goal of this part of the comment is to explain why we
cannot just disable the 4dc2287c1805 workaround.

>   - How does this patch work around the problem?  This patch checks if
>     a host bridge window is completely contained in an EFI MMIO
>     region, but I don't see such an EFI region here.

It does not help with case 2 from the comment, since that
case is already effectively fixed by the 4dc2287c1805 workaround.

This patch is a replacement for my earlier fix to disable
the 4dc2287c1805 workaround on newer systems. Since that patch
was causing the regression reported in the Red Hat 2029207
the patch to disable the 4dc2287c1805 workaround on newer systems
has been reverted.

So this is yet another attempt to fix the case where
the PCI subsystem cannot assign resources to some PCI
devices because the entire root bridge window is covered
by an EFI memtable MMIO entry, which gets translated into
an e820_table reserved entry.

What this patch does is identify systems which have the
problem of the entire root bridge window being covered
by an EFI memtable MMIO entry and only disable the
4dc2287c1805 workaround there.

The idea here is to only disable arch_remove_reservations()
taking e820 reservations into account on as narrow a set of
systems as possible.

Keeping it enabled on all other systems, including recent
systems. Since recent systems also benefit from the
4dc2287c1805 workaround as the regression from my patch
to disable it on recent systems has shown us.

I hope this helps understand things better.

Regards,

Hans





>> Recently (2019) some systems have shown-up with EFI memmap MMIO entries
>> covering the entire ACPI provided PCI bridge window. These memmap entries
>> get converted into e820_table entries, causing all attempts to assign
>> memory to PCI BARs which have not been setup by the BIOS to fail.
>> For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
>>  efi: mem63: [MMIO] range=              [0x0000000065400000-0x00000000cfffffff] (1708MB)
>>  BIOS-e820:                         [mem 0x000000004bc50000-0x00000000cfffffff] reserved
>>  pci_bus 0000:00: root bus resource [mem         0x65400000-        0xbfffffff window]
>>  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>>  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
>>
>> To fix this, check if the ACPI provided PCI bridge window is fully
>> contained within in EFI memmap MMIO region and in that case disable
>> the "exclude E820 reserved addresses" workaround, fixing the problem
>> of not being able to find free space for unassigned BARs.
>>
>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=206459
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1868899
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1871793
>> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2029207
>> BugLink: https://bugs.launchpad.net/bugs/1878279
>> BugLink: https://bugs.launchpad.net/bugs/1931715
>> BugLink: https://bugs.launchpad.net/bugs/1932069
>> BugLink: https://bugs.launchpad.net/bugs/1921649
>> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
>> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
>> ---
>> Changes in v2:
>> - Add a couple of missing includes to arch/x86/include/asm/pci_x86.h
>>   to fix i386 build errors Reported-by: kernel test robot <lkp@intel.com>
>> - Do not call resource_is_efi_mmio_region() on resource-list entries which
>>   have just been destroyed because they match resource_is_pcicfg_ioport()
>>   Reported-by: kernel test robot <oliver.sang@intel.com>
>> - Add (res->flags & IORESOURCE_MEM) check to resource_is_efi_mmio_region()
>> ---
>>  arch/x86/include/asm/pci_x86.h | 10 +++++
>>  arch/x86/kernel/resource.c     |  4 ++
>>  arch/x86/pci/acpi.c            | 68 +++++++++++++++++++++++++++++++++-
>>  3 files changed, 81 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
>> index 490411dba438..4ce61ab01a4f 100644
>> --- a/arch/x86/include/asm/pci_x86.h
>> +++ b/arch/x86/include/asm/pci_x86.h
>> @@ -5,7 +5,9 @@
>>   *	(c) 1999 Martin Mares <mj@ucw.cz>
>>   */
>>  
>> +#include <linux/init.h>
>>  #include <linux/ioport.h>
>> +#include <linux/spinlock.h>
>>  
>>  #undef DEBUG
>>  
>> @@ -64,6 +66,8 @@ void pcibios_scan_specific_bus(int busn);
>>  
>>  /* pci-irq.c */
>>  
>> +struct pci_dev;
> 
> Both the above look like fixes to an unrelated asm/pci_x86.h
> problem that happened to be exposed by including it in
> arch/x86/kernel/resource.c.  Most users of asm/pci_x86.h
> include linux/pci.h first, which covers up the problem.
> 
>>  struct irq_info {
>>  	u8 bus, devfn;			/* Bus, device and function */
>>  	struct {
>> @@ -232,3 +236,9 @@ static inline void mmio_config_writel(void __iomem *pos, u32 val)
>>  # define x86_default_pci_init_irq	NULL
>>  # define x86_default_pci_fixup_irqs	NULL
>>  #endif
>> +
>> +#if defined CONFIG_PCI && defined CONFIG_ACPI
>> +extern bool pci_use_e820;
>> +#else
>> +#define pci_use_e820 true
>> +#endif
>> diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
>> index 9b9fb7882c20..e8dc9bc327bd 100644
>> --- a/arch/x86/kernel/resource.c
>> +++ b/arch/x86/kernel/resource.c
>> @@ -1,6 +1,7 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include <linux/ioport.h>
>>  #include <asm/e820/api.h>
>> +#include <asm/pci_x86.h>
>>  
>>  static void resource_clip(struct resource *res, resource_size_t start,
>>  			  resource_size_t end)
>> @@ -28,6 +29,9 @@ static void remove_e820_regions(struct resource *avail)
>>  	int i;
>>  	struct e820_entry *entry;
>>  
>> +	if (!pci_use_e820)
>> +		return;
>> +
>>  	for (i = 0; i < e820_table->nr_entries; i++) {
>>  		entry = &e820_table->entries[i];
>>  
>> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
>> index 052f1d78a562..fce05e03ba9e 100644
>> --- a/arch/x86/pci/acpi.c
>> +++ b/arch/x86/pci/acpi.c
>> @@ -1,4 +1,5 @@
>>  // SPDX-License-Identifier: GPL-2.0
>> +#include <linux/efi.h>
>>  #include <linux/pci.h>
>>  #include <linux/acpi.h>
>>  #include <linux/init.h>
>> @@ -21,6 +22,7 @@ struct pci_root_info {
>>  
>>  static bool pci_use_crs = true;
>>  static bool pci_ignore_seg;
>> +bool pci_use_e820 = true;
>>  
>>  static int __init set_use_crs(const struct dmi_system_id *id)
>>  {
>> @@ -291,6 +293,63 @@ static bool resource_is_pcicfg_ioport(struct resource *res)
>>  		res->start == 0xCF8 && res->end == 0xCFF;
>>  }
>>  
>> +/*
>> + * Some fw has a bug where the PCI bridge window returned by the ACPI resources
>> + * partly overlaps with some other address range, causing issues. To workaround
>> + * this Linux excludes E820 reserved addresses when allocating addresses from
>> + * the PCI bridge window. 2 known examples of such firmware bugs are:
>> + *
>> + * 1. The returned window contains addresses which map to system RAM, see
>> + * commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
>> + *
>> + * 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
>> + * reserved range and the ACPI provided PCI bridge window:
>> + *  efi: mem46: [MMIO] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
>> + *  BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
>> + *  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
>> + * If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
>> + * then the system fails to resume after a suspend.
>> + *
>> + * Recently (2019) some systems have shown-up with EFI memmap MMIO entries
>> + * covering the entire ACPI provided PCI bridge window. These memmap entries
>> + * get converted into e820_table entries, causing all attempts to assign
>> + * memory to PCI BARs which have not been setup by the BIOS to fail.
>> + * For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
>> + *  efi: mem63: [MMIO] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
>> + *  BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
>> + *  pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
>> + *  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
>> + *  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
>> + *
>> + * To code below checks if the ACPI provided PCI bridge window is fully
>> + * contained within in EFI memmap MMIO region and in that case disables
>> + * the "exclude E820 reserved addresses" workaround to avoid this issue.
>> + */
>> +static bool resource_is_efi_mmio_region(const struct resource *res)
>> +{
>> +	unsigned long long start, end;
>> +	efi_memory_desc_t *md;
>> +
>> +	if (!(res->flags & IORESOURCE_MEM))
>> +		return false;
>> +
>> +	if (!efi_enabled(EFI_MEMMAP))
>> +		return false;
>> +
>> +	for_each_efi_memory_desc(md) {
>> +		if (md->type != EFI_MEMORY_MAPPED_IO)
>> +			continue;
>> +
>> +		start = md->phys_addr;
>> +		end = start + (md->num_pages << EFI_PAGE_SHIFT) - 1;
>> +
>> +		if (res->start >= start && res->end <= end)
>> +			return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>>  static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
>>  {
>>  	struct acpi_device *device = ci->bridge;
>> @@ -300,9 +359,16 @@ static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
>>  
>>  	status = acpi_pci_probe_root_resources(ci);
>>  	if (pci_use_crs) {
>> -		resource_list_for_each_entry_safe(entry, tmp, &ci->resources)
>> +		resource_list_for_each_entry_safe(entry, tmp, &ci->resources) {
>>  			if (resource_is_pcicfg_ioport(entry->res))
>>  				resource_list_destroy_entry(entry);
>> +			else if (resource_is_efi_mmio_region(entry->res)) {
>> +				dev_info(&device->dev,
>> +					"host bridge window %pR is marked by EFI as MMIO\n",
>> +					entry->res);
>> +				pci_use_e820 = false;
> 
> This message suggests that marking the host bridge window as MMIO in
> EFI is a defect, or at least something unusual and worthy of being
> flagged.  But I think it's perfectly legal.
> 
> UEFI v2.8, sec 7.2, says EfiMemoryMappedIO means:
> 
>   Used by system firmware to request that a memory-mapped IO region be
>   mapped by the OS to a virtual address so it can be accessed by EFI
>   runtime services.
> 
> 
>> +			}
>> +		}
>>  		return status;
>>  	}
>>  
>> -- 
>> 2.35.1
>>
>
Bjorn Helgaas March 4, 2022, 3:21 a.m. UTC | #4
On Thu, Mar 03, 2022 at 02:29:30PM +0100, Hans de Goede wrote:
> On 3/3/22 01:40, Bjorn Helgaas wrote:
> > On Mon, Feb 28, 2022 at 11:52:59AM +0100, Hans de Goede wrote:
> > 
> > I know Rafael has already applied this, but I'm still trying to
> > understand this because it looks like a very complicated maintenance
> > problem.
> > 
> >> Some fw has a bug where the PCI bridge window returned by the ACPI
> >> resources partly overlaps with some other address range, causing issues.
> >> To workaround this Linux excludes E820 reserved addresses when allocating
> >> addresses from the PCI bridge window. 2 known examples of such fw bugs are:
> >>
> >> 1. The returned window contains addresses which map to system RAM,
> >> see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating
> >> address space").
> > 
> > Bug report is https://bugzilla.kernel.org/show_bug.cgi?id=16228
> > First dmesg log https://bugzilla.kernel.org/attachment.cgi?id=26811
> > shows:
> > 
> >   BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
> >   pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
> >   pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
> >   pci 0000:00:1f.2: BAR 5: assigned [mem 0xbff00000-0xbff007ff]
> >   ahci 0000:00:1f.2: controller reset failed (0xffffffff)
> >   ahci 0000:00:1f.2: failed to stop engine (-5)
> > 
> > The problem is that _CRS advertises [mem 0xbff00000-0xdfffffff], and
> > we assigned [mem 0xbff00000-0xbff007ff] to 00:1f.2, but
> > 0xbff00000-0xbfffffff is not usable for PCI devices.  My guess is that
> > it contains host bridge registers, but all we really know is that it
> > doesn't work.
> > 
> > I think the _CRS that includes non-usable space is clearly a BIOS
> > defect.
> > 
> > The fix from 4dc2287c1805 was to avoid that region based on the
> > 0xbfe4dc00-0xc0000000 E820 entry, and the result is in
> > https://bugzilla.kernel.org/attachment.cgi?id=30662:
> > 
> >   BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
> >   pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xf7ffffff]
> >   pci_root PNP0A03:00: host bridge window [mem 0xff980000-0xff980fff]
> >   pci 0000:00:1f.2: reg 24: [mem 0xff970000-0xff9707ff]   # BAR 5
> >   pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
> >   pci 0000:00:1f.2: BAR 5: assigned [mem 0xff980800-0xff980fff]
> > 
> > The patch below doesn't affect this workaround.
> > 
> >> 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
> >> reserved range and the ACPI provided PCI bridge window:
> >>  efi: mem46: [MMIO] range=              [0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> >>  BIOS-e820:                         [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
> >>  pci_bus 0000:00: root bus resource [mem         0xdfa00000-        0xfebfffff window]
> >> If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
> >> then the system fails to resume after a suspend.
> > 
> > I think this is from https://bugzilla.redhat.com/show_bug.cgi?id=2029207
> 
> Correct.
> 
> > If I understand correctly, the log in comment 23 from Ivan
> > (https://bugzilla.redhat.com/attachment.cgi?id=1859801) is a
> > case where resume doesn't work:
> > 
> >   BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
> >   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> >   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]
> > 
> > And the log in comment 38, also from Ivan,
> > (https://bugzilla.redhat.com/attachment.cgi?id=1861539) is a case
> > where resume *does* work:
> > 
> >   BIOS-e820: [mem 0x00000000dcf00000-0x00000000dfa0ffff] reserved
> >   efi: mem46: [MMIO        |RUN|  |  |  |  |  |  |  |  |   |  |  |  |  ] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
> >   pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
> >   pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff]
> > 
> > _CRS advertises [mem 0xdfa00000-0xfebfffff], but when we assign the
> > 0xdfa00000-0xdfafffff region to the 00:1c.0 MMIO window, resume fails.
> 
> Correct this "when we assign the 0xdfa00000-0xdfafffff region to
> the 00:1c.0 MMIO window, resume fails" is my conclusion too.
> 
> > I don't see a theory about what the root cause is.
> 
> Correct, because I have no theory, I merely observed that:
> 
> 1. 0xdfa00000-0xdfa0ffff (note the 0 between the a and ffff)
> is reserved in the E820 ranges (this was first seen with
> classic BIOS boot, but also reproduces with EFI bootiung)
> 
> 2. If the kernel assigns PCI resources to it despite it being
> reserved the reporter reports resume being broken.
> 
> I indeed do not know the exact cause of this.
> 
> > It's possible this
> > is similar to case 1 above, where _CRS is defective.
> 
> I definitely consider this another case of _CRS being defective
> and it is similar to 1 in that regards, yes.
> 
> > But here we're
> > only assigning the 00:1c.0 MMIO window, and 00:1c.0 is a bridge to
> > [bus 02], and there are no devices on bus 02.  There should be no
> > transactions that use that MMIO window, so it's not clear why this
> > should matter.
> > 
> > If this is a _CRS defect similar to case 1, it's possible there are
> > host bridge registers in 0xdfa00000-0xdfafffff, and BIOS might use
> > those during resume, and assigning that area to the 00:1c.0 MMIO
> > window might interfere with that.  But that's a lot of speculation.
> > 
> > If you have a good Lenovo contact, they might be able to confirm or
> > deny this.
> > 
> > I'm missing some things here that should be obvious; can you help me
> > out?
> > 
> >   - Why did the 4dc2287c1805 workaround not apply here?  The E820
> >     region overlaps the _CRS window just like in case 1, so why did we
> >     assign 0xdfa00000?
> 
> The 4dc2287c1805 workaround does apply here. The problem is that
> my earlier patch which Rafael merged to not honor e820 reservations
> on newer BIOS-es broke suspend/resume since that patch disables
> the 4dc2287c1805 workaround on this machine.

Ah!  That makes sense now, thank you!  So this is merely an example of
a newer machine for which 4dc2287c1805 is important.  It's not a case
where *this* patch is relevant.

> The comment tries to mention this issue as a second case of
> why the 4dc2287c1805 workaround is necessary.
> 
> The goal of this part of the comment is to explain why we
> cannot just disable the 4dc2287c1805 workaround.
> 
> >   - How does this patch work around the problem?  This patch checks if
> >     a host bridge window is completely contained in an EFI MMIO
> >     region, but I don't see such an EFI region here.
> 
> It does not help with case 2 from the comment, since that
> case is already effectively fixed by the 4dc2287c1805 workaround.
> 
> This patch is a replacement for my earlier fix to disable
> the 4dc2287c1805 workaround on newer systems. Since that patch
> was causing the regression reported in the Red Hat 2029207
> the patch to disable the 4dc2287c1805 workaround on newer systems
> has been reverted.
> 
> So this is yet another attempt to fix the case where
> the PCI subsystem cannot assign resources to some PCI
> devices because the entire root bridge window is covered
> by an EFI memtable MMIO entry, which gets translated into
> an e820_table reserved entry.
> 
> What this patch does is identify systems which have the
> problem of the entire root bridge window being covered
> by an EFI memtable MMIO entry and only disable the
> 4dc2287c1805 workaround there.

I think having the entire root bridge window covered by an EFI MMIO
entry is not actually a problem per spec.  My previous response to the
log message applies:

> > This message suggests that marking the host bridge window as MMIO in
> > EFI is a defect, or at least something unusual and worthy of being
> > flagged.  But I think it's perfectly legal.
> > 
> > UEFI v2.8, sec 7.2, says EfiMemoryMappedIO means:
> > 
> >   Used by system firmware to request that a memory-mapped IO region be
> >   mapped by the OS to a virtual address so it can be accessed by EFI
> >   runtime services.

The problem is only that Linux doesn't handle this correctly because
of 4dc2287c1805, which makes it think the entire window is
unavailable.

> The idea here is to only disable arch_remove_reservations()
> taking e820 reservations into account on as narrow a set of
> systems as possible.
> 
> Keeping it enabled on all other systems, including recent
> systems. Since recent systems also benefit from the
> 4dc2287c1805 workaround as the regression from my patch
> to disable it on recent systems has shown us.
> 
> I hope this helps understand things better.

It does, it was very helpful, thanks!

I have an alternate proposal which does basically the same thing
(skipping the 4dc2287c1805 workaround when an E820 entry covers the
entire window), but I think it's slightly simpler.

And more importantly, it adds a little bit of logging and updates the
window itself instead of clipping on every alloc, which I should have
done in 4dc2287c1805 from the beginning, and should make issues easier
to debug.

I'll post it shortly.

Bjorn
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index 490411dba438..4ce61ab01a4f 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -5,7 +5,9 @@ 
  *	(c) 1999 Martin Mares <mj@ucw.cz>
  */
 
+#include <linux/init.h>
 #include <linux/ioport.h>
+#include <linux/spinlock.h>
 
 #undef DEBUG
 
@@ -64,6 +66,8 @@  void pcibios_scan_specific_bus(int busn);
 
 /* pci-irq.c */
 
+struct pci_dev;
+
 struct irq_info {
 	u8 bus, devfn;			/* Bus, device and function */
 	struct {
@@ -232,3 +236,9 @@  static inline void mmio_config_writel(void __iomem *pos, u32 val)
 # define x86_default_pci_init_irq	NULL
 # define x86_default_pci_fixup_irqs	NULL
 #endif
+
+#if defined CONFIG_PCI && defined CONFIG_ACPI
+extern bool pci_use_e820;
+#else
+#define pci_use_e820 true
+#endif
diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
index 9b9fb7882c20..e8dc9bc327bd 100644
--- a/arch/x86/kernel/resource.c
+++ b/arch/x86/kernel/resource.c
@@ -1,6 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/ioport.h>
 #include <asm/e820/api.h>
+#include <asm/pci_x86.h>
 
 static void resource_clip(struct resource *res, resource_size_t start,
 			  resource_size_t end)
@@ -28,6 +29,9 @@  static void remove_e820_regions(struct resource *avail)
 	int i;
 	struct e820_entry *entry;
 
+	if (!pci_use_e820)
+		return;
+
 	for (i = 0; i < e820_table->nr_entries; i++) {
 		entry = &e820_table->entries[i];
 
diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index 052f1d78a562..fce05e03ba9e 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -1,4 +1,5 @@ 
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/efi.h>
 #include <linux/pci.h>
 #include <linux/acpi.h>
 #include <linux/init.h>
@@ -21,6 +22,7 @@  struct pci_root_info {
 
 static bool pci_use_crs = true;
 static bool pci_ignore_seg;
+bool pci_use_e820 = true;
 
 static int __init set_use_crs(const struct dmi_system_id *id)
 {
@@ -291,6 +293,63 @@  static bool resource_is_pcicfg_ioport(struct resource *res)
 		res->start == 0xCF8 && res->end == 0xCFF;
 }
 
+/*
+ * Some fw has a bug where the PCI bridge window returned by the ACPI resources
+ * partly overlaps with some other address range, causing issues. To workaround
+ * this Linux excludes E820 reserved addresses when allocating addresses from
+ * the PCI bridge window. 2 known examples of such firmware bugs are:
+ *
+ * 1. The returned window contains addresses which map to system RAM, see
+ * commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
+ *
+ * 2. The Lenovo X1 carbon gen 2 BIOS has an overlap between an EFI/E820
+ * reserved range and the ACPI provided PCI bridge window:
+ *  efi: mem46: [MMIO] range=[0x00000000dfa00000-0x00000000dfa0ffff] (0MB)
+ *  BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved
+ *  pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]
+ * If Linux assigns the overlapping 0xdfa00000-0xdfa0ffff range to a PCI BAR
+ * then the system fails to resume after a suspend.
+ *
+ * Recently (2019) some systems have shown-up with EFI memmap MMIO entries
+ * covering the entire ACPI provided PCI bridge window. These memmap entries
+ * get converted into e820_table entries, causing all attempts to assign
+ * memory to PCI BARs which have not been setup by the BIOS to fail.
+ * For example see these dmesg snippets from a Lenovo IdeaPad 3 15IIL 81WE:
+ *  efi: mem63: [MMIO] range=[0x0000000065400000-0x00000000cfffffff] (1708MB)
+ *  BIOS-e820: [mem 0x000000004bc50000-0x00000000cfffffff] reserved
+ *  pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window]
+ *  pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit]
+ *  pci 0000:00:15.0: BAR 0: failed to assign [mem size 0x00001000 64bit]
+ *
+ * To code below checks if the ACPI provided PCI bridge window is fully
+ * contained within in EFI memmap MMIO region and in that case disables
+ * the "exclude E820 reserved addresses" workaround to avoid this issue.
+ */
+static bool resource_is_efi_mmio_region(const struct resource *res)
+{
+	unsigned long long start, end;
+	efi_memory_desc_t *md;
+
+	if (!(res->flags & IORESOURCE_MEM))
+		return false;
+
+	if (!efi_enabled(EFI_MEMMAP))
+		return false;
+
+	for_each_efi_memory_desc(md) {
+		if (md->type != EFI_MEMORY_MAPPED_IO)
+			continue;
+
+		start = md->phys_addr;
+		end = start + (md->num_pages << EFI_PAGE_SHIFT) - 1;
+
+		if (res->start >= start && res->end <= end)
+			return true;
+	}
+
+	return false;
+}
+
 static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
 {
 	struct acpi_device *device = ci->bridge;
@@ -300,9 +359,16 @@  static int pci_acpi_root_prepare_resources(struct acpi_pci_root_info *ci)
 
 	status = acpi_pci_probe_root_resources(ci);
 	if (pci_use_crs) {
-		resource_list_for_each_entry_safe(entry, tmp, &ci->resources)
+		resource_list_for_each_entry_safe(entry, tmp, &ci->resources) {
 			if (resource_is_pcicfg_ioport(entry->res))
 				resource_list_destroy_entry(entry);
+			else if (resource_is_efi_mmio_region(entry->res)) {
+				dev_info(&device->dev,
+					"host bridge window %pR is marked by EFI as MMIO\n",
+					entry->res);
+				pci_use_e820 = false;
+			}
+		}
 		return status;
 	}