diff mbox

[v9,4/5] x86/PCI: Enable a 64bit BAR on AMD Family 15h (Models 30h-3fh) Processors v5

Message ID 0ab28334-3ccc-2a93-9991-396ee01a3662@amd.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Christian König Nov. 28, 2017, 9:12 a.m. UTC
Am 27.11.2017 um 19:30 schrieb Boris Ostrovsky:
> On 11/23/2017 09:12 AM, Boris Ostrovsky wrote:
>>
>> On 11/23/2017 03:11 AM, Christian König wrote:
>>> Am 22.11.2017 um 18:27 schrieb Boris Ostrovsky:
>>>> On 11/22/2017 11:54 AM, Christian König wrote:
>>>>> Am 22.11.2017 um 17:24 schrieb Boris Ostrovsky:
>>>>>> On 11/22/2017 05:09 AM, Christian König wrote:
>>>>>>> Am 21.11.2017 um 23:26 schrieb Boris Ostrovsky:
>>>>>>>> On 11/21/2017 08:34 AM, Christian König wrote:
>>>>>>>>> Hi Boris,
>>>>>>>>>
>>>>>>>>> attached are two patches.
>>>>>>>>>
>>>>>>>>> The first one is a trivial fix for the infinite loop issue, it now
>>>>>>>>> correctly aborts the fixup when it can't find address space for
>>>>>>>>> the
>>>>>>>>> root window.
>>>>>>>>>
>>>>>>>>> The second is a workaround for your board. It simply checks if
>>>>>>>>> there
>>>>>>>>> is exactly one Processor Function to apply this fix on.
>>>>>>>>>
>>>>>>>>> Both are based on linus current master branch. Please test if they
>>>>>>>>> fix
>>>>>>>>> your issue.
>>>>>>>> Yes, they do fix it but that's because the feature is disabled.
>>>>>>>>
>>>>>>>> Do you know what the actual problem was (on Xen)?
>>>>>>> I still haven't understood what you actually did with Xen.
>>>>>>>
>>>>>>> When you used PCI pass through with those devices then you have
>>>>>>> made a
>>>>>>> major configuration error.
>>>>>>>
>>>>>>> When the problem happened on dom0 then the explanation is most
>>>>>>> likely
>>>>>>> that some PCI device ended up in the configured space, but the
>>>>>>> routing
>>>>>>> was only setup correctly on one CPU socket.
>>>>>> The problem is that dom0 can be (and was in my case() booted with
>>>>>> less
>>>>>> than full physical memory and so the "rest" of the host memory is not
>>>>>> necessarily reflected in iomem. Your patch then tried to configure
>>>>>> that
>>>>>> memory for MMIO and the system hang.
>>>>>>
>>>>>> And so my guess is that this patch will break dom0 on a single-socket
>>>>>> system as well.
>>>>> Oh, thanks!
>>>>>
>>>>> I've thought about that possibility before, but wasn't able to find a
>>>>> system which actually does that.
>>>>>
>>>>> May I ask why the rest of the memory isn't reported to the OS?
>>>> That memory doesn't belong to the OS (dom0), it is owned by the
>>>> hypervisor.
>>>>
>>>>> Sounds like I can't trust Linux resource management and probably need
>>>>> to read the DRAM config to figure things out after all.
>>>> My question is whether what you are trying to do should ever be done
>>>> for
>>>> a guest at all (any guest, not necessarily Xen).
>>> The issue is probably that I don't know enough about Xen: What
>>> exactly is dom0? My understanding was that dom0 is the hypervisor,
>>> but that seems to be incorrect.
>>>
>>> The issue is that under no circumstances *EVER* a virtualized guest
>>> should have access to the PCI devices marked as "Processor Function"
>>> on AMD platforms. Otherwise it is trivial to break out of the
>>> virtualization.
>>>
>>> When dom0 is something like the system domain with all hardware
>>> access then the approach seems legitimate, but then the hypervisor
>>> should report the stolen memory to the OS using the e820 table.
>>>
>>> When the hypervisor doesn't do that and the Linux kernel isn't aware
>>> that there is memory at a given location mapping PCI space there will
>>> obviously crash the hypervisor.
>>>
>>> Possible solutions as far as I can see are either disabling this
>>> feature when we detect that we are a Xen dom0, scanning the DRAM
>>> settings to update Linux resource handling or fixing Xen to report
>>> stolen memory to the dom0 OS as reserved.
>>>
>>> Opinions?
>> You are right, these functions are not exposed to a regular guest.
>>
>> I think for dom0 (which is a special Xen guest, with additional
>> privileges) we may be able to add a reserved e820 region for host
>> memory that is not assigned to dom0. Let me try it on Monday (I am out
>> until then).
>
> One thing I realized while looking at solution for Xen dom0 is that this
> patch may cause problems for memory hotplug.

Good point. My assumption was that when you got an BIOS which can handle 
memory hotplug then you also got an BIOS which doesn't need this fixup. 
But I've never validated that assumption.

> What happens if new memory
> is added to the system and we have everything above current memory set
> to MMIO?

In theory the BIOS would search for address space and won't find 
anything, so the hotplug operation should fail even before it reaches 
the kernel in the first place.

In practice I think that nobody ever tested if that works correctly. So 
I'm pretty sure the system would just crash.

How about the attached patch? It limits the newly added MMIO space to 
the upper 256GB of the address space. That should still be enough for 
most devices, but we avoid both issues with Xen dom0 as most likely 
problems with memory hotplug as well.

Christian.

>
> -boris
>

Comments

Jan Beulich Nov. 28, 2017, 9:46 a.m. UTC | #1
>>> On 28.11.17 at 10:12, <christian.koenig@amd.com> wrote:
> In theory the BIOS would search for address space and won't find 
> anything, so the hotplug operation should fail even before it reaches 
> the kernel in the first place.

How would the BIOS know what the OS does or plans to do? I think
it's the other way around - the OS needs to avoid using any regions
for MMIO which are marked as hotpluggable in SRAT. Since there is
no vNUMA yet for Xen Dom0, that would need special handling.

Jan
Christian König Nov. 28, 2017, 10:17 a.m. UTC | #2
Am 28.11.2017 um 10:46 schrieb Jan Beulich:
>>>> On 28.11.17 at 10:12, <christian.koenig@amd.com> wrote:
>> In theory the BIOS would search for address space and won't find
>> anything, so the hotplug operation should fail even before it reaches
>> the kernel in the first place.
> How would the BIOS know what the OS does or plans to do?

As far as I know the ACPI BIOS should work directly with the register 
content.

So when we update the register content to enable the MMIO decoding the 
BIOS should know that as well.

> I think
> it's the other way around - the OS needs to avoid using any regions
> for MMIO which are marked as hotpluggable in SRAT.

I was under the impression that this is exactly what 
acpi_numa_memory_affinity_init() does.

> Since there is
> no vNUMA yet for Xen Dom0, that would need special handling.

I think that the problem is rather that SRAT is NUMA specific and if I'm 
not totally mistaken the content is ignored when NUMA support isn't 
compiled into the kernel.

When Xen steals some memory from Dom0 by hocking up itself into the e820 
call then I would say the cleanest way is to report this memory in e820 
as reserved as well. But take that with a grain of salt, I'm seriously 
not a Xen expert.

Regards,
Christian.

>
> Jan
>
Jan Beulich Nov. 28, 2017, 10:53 a.m. UTC | #3
>>> On 28.11.17 at 11:17, <christian.koenig@amd.com> wrote:
> Am 28.11.2017 um 10:46 schrieb Jan Beulich:
>>>>> On 28.11.17 at 10:12, <christian.koenig@amd.com> wrote:
>>> In theory the BIOS would search for address space and won't find
>>> anything, so the hotplug operation should fail even before it reaches
>>> the kernel in the first place.
>> How would the BIOS know what the OS does or plans to do?
> 
> As far as I know the ACPI BIOS should work directly with the register 
> content.
> 
> So when we update the register content to enable the MMIO decoding the 
> BIOS should know that as well.

I'm afraid I don't follow: During memory hotplug, surely you don't
expect the BIOS to do a PCI bus scan? Plus even if it did, it would
be racy - some device could, at this very moment, have memory
decoding disabled, just for the OS to re-enable it a millisecond
later. Yet looking at BAR values is meaningless when memory
decode of a device is disabled.

>> I think
>> it's the other way around - the OS needs to avoid using any regions
>> for MMIO which are marked as hotpluggable in SRAT.
> 
> I was under the impression that this is exactly what 
> acpi_numa_memory_affinity_init() does.

Perhaps, except that (when I last looked) insufficient state is
(was) being recorded to have that information readily available
at the time MMIO space above 4Gb needs to be allocated for
some device.

>> Since there is
>> no vNUMA yet for Xen Dom0, that would need special handling.
> 
> I think that the problem is rather that SRAT is NUMA specific and if I'm 
> not totally mistaken the content is ignored when NUMA support isn't 
> compiled into the kernel.
> 
> When Xen steals some memory from Dom0 by hocking up itself into the e820 
> call then I would say the cleanest way is to report this memory in e820 
> as reserved as well. But take that with a grain of salt, I'm seriously 
> not a Xen expert.

The E820 handling in PV Linux is all fake anyway - there's a single
chunk of memory given to a PV guest (including Dom0), contiguous
in what PV guests know as "physical address space" (not to be
mixed up with "machine address space", which is where MMIO
needs to be allocated from). Xen code in the kernel then mimics
an E820 matching the host one, moving around pieces of memory
in physical address space if necessary.

Since Dom0 knows the machine E820, MMIO allocation shouldn't
need to be much different there from the non-Xen case.

Jan
Christian König Nov. 28, 2017, 11:59 a.m. UTC | #4
Am 28.11.2017 um 11:53 schrieb Jan Beulich:
>>>> On 28.11.17 at 11:17, <christian.koenig@amd.com> wrote:
>> Am 28.11.2017 um 10:46 schrieb Jan Beulich:
>>>>>> On 28.11.17 at 10:12, <christian.koenig@amd.com> wrote:
>>>> In theory the BIOS would search for address space and won't find
>>>> anything, so the hotplug operation should fail even before it reaches
>>>> the kernel in the first place.
>>> How would the BIOS know what the OS does or plans to do?
>> As far as I know the ACPI BIOS should work directly with the register
>> content.
>>
>> So when we update the register content to enable the MMIO decoding the
>> BIOS should know that as well.
> I'm afraid I don't follow: During memory hotplug, surely you don't
> expect the BIOS to do a PCI bus scan? Plus even if it did, it would
> be racy - some device could, at this very moment, have memory
> decoding disabled, just for the OS to re-enable it a millisecond
> later. Yet looking at BAR values is meaningless when memory
> decode of a device is disabled.

No, sorry you misunderstood me. The PCI bus is not even involved here.

In AMD Family CPUs you have four main types of address space routed by 
the NB:
1.  Memory space targeting system DRAM.
2.  Memory space targeting IO (MMIO).
3.  IO space.
4.  Configuration space.

See section "2.8.2 NB Routing" in the BIOS and Kernel Developer’s Guide 
(https://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf).

Long story short you have fix addresses for configuration and legacy IO 
(VGA for example) and then configurable memory space for DRAM and MMIO.

What the ACPI BIOS does (or at least should do) is taking a look at the 
registers to find space during memory hotplug.

Now in theory the MMIO space should be configurable by similar ACPI BIOS 
functions, but unfortunately most BIOS doesn't enable that function 
because it can break some older Windows versions.

So what we do here is just what the BIOS should have provided in the 
first place.

>>> I think
>>> it's the other way around - the OS needs to avoid using any regions
>>> for MMIO which are marked as hotpluggable in SRAT.
>> I was under the impression that this is exactly what
>> acpi_numa_memory_affinity_init() does.
> Perhaps, except that (when I last looked) insufficient state is
> (was) being recorded to have that information readily available
> at the time MMIO space above 4Gb needs to be allocated for
> some device.

That was also my concern, but in the most recent version I'm 
intentionally doing this fixup very late after all the PCI setup is 
already done.

This way the extra address space is only available for devices which are 
added by PCI hotplug or for resizing BARs on the fly (which is the use 
case I'm interested in).

>>> Since there is
>>> no vNUMA yet for Xen Dom0, that would need special handling.
>> I think that the problem is rather that SRAT is NUMA specific and if I'm
>> not totally mistaken the content is ignored when NUMA support isn't
>> compiled into the kernel.
>>
>> When Xen steals some memory from Dom0 by hocking up itself into the e820
>> call then I would say the cleanest way is to report this memory in e820
>> as reserved as well. But take that with a grain of salt, I'm seriously
>> not a Xen expert.
> The E820 handling in PV Linux is all fake anyway - there's a single
> chunk of memory given to a PV guest (including Dom0), contiguous
> in what PV guests know as "physical address space" (not to be
> mixed up with "machine address space", which is where MMIO
> needs to be allocated from). Xen code in the kernel then mimics
> an E820 matching the host one, moving around pieces of memory
> in physical address space if necessary.

Good to know.

> Since Dom0 knows the machine E820, MMIO allocation shouldn't
> need to be much different there from the non-Xen case.

Yes, completely agree.

I think even if we don't do MMIO allocation with this fixup letting the 
kernel in Dom0 know which memory/address space regions are in use is 
still a good idea.

Otherwise we will run into exactly the same problem when we do the MMIO 
allocation with an ACPI method and that is certainly going to come in 
the next BIOS generations because Microsoft is pushing for it.

Regards,
Christian.

>
> Jan
>
Boris Ostrovsky Nov. 28, 2017, 6:55 p.m. UTC | #5
On 11/28/2017 04:12 AM, Christian König wrote:
>
>
> How about the attached patch? It limits the newly added MMIO space to
> the upper 256GB of the address space. That should still be enough for
> most devices, but we avoid both issues with Xen dom0 as most likely
> problems with memory hotplug as well.

It certainly makes the problem to be less likely so I guess we could do
this for now.

Thanks.
-boris
diff mbox

Patch

From 586bd9d67ebb6ca48bd0a6b1bd9203e94093cc8e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Tue, 28 Nov 2017 10:02:35 +0100
Subject: [PATCH] x86/PCI: limit the size of the 64bit BAR to 256GB
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This avoids problems with Xen which hides some memory resources from the
OS and potentially also allows memory hotplug while this fixup is
enabled.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 arch/x86/pci/fixup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index 56c39a7bd080..6dffdee8a2de 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -690,7 +690,7 @@  static void pci_amd_enable_64bit_bar(struct pci_dev *dev)
 	res->name = "PCI Bus 0000:00";
 	res->flags = IORESOURCE_PREFETCH | IORESOURCE_MEM |
 		IORESOURCE_MEM_64 | IORESOURCE_WINDOW;
-	res->start = 0x100000000ull;
+	res->start = 0xbd00000000ull;
 	res->end = 0xfd00000000ull - 1;
 
 	/* Just grab the free area behind system memory for this */
-- 
2.11.0