Message ID | 628e2b58-b16b-5792-b4ef-88bec15ab779@amd.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
[+cc Alex, Dave because this affects the AMDGPU large BAR support] On Tue, Jan 09, 2018 at 11:37:55AM +0100, Christian König wrote: > Am 09.01.2018 um 00:23 schrieb Bjorn Helgaas: > >[+cc Boris, Juergen, linux-pci] > > > >On Fri, Jan 5, 2018 at 6:00 PM, Linus Torvalds > ><torvalds@linux-foundation.org> wrote: > >>On Fri, Jan 5, 2018 at 2:04 PM, Aaro Koskinen <aaro.koskinen@iki.fi> wrote: > >>>After v4.14, I've been unable to boot my AMD compilation box with the > >>>v4.15-rc mainline Linux. It just ends up in a silent reboot loop. > >>> > >>>I bisected this to: > >>> > >>>commit fa564ad9636651fd11ec2c79c48dee844066f73a > >>>Author: Christian König <christian.koenig@amd.com> > >>>Date: Tue Oct 24 14:40:29 2017 -0500 > >>> > >>> x86/PCI: Enable a 64bit BAR on AMD Family 15h (Models 00-1f, 30-3f, 60-7f) > >>Hmm. That was reported to break boot earlier already. > >> > >>The breakage was supposedly fixed by three patches from Christian: > >> > >> a19e2696135e: "x86/PCI: Only enable a 64bit BAR on single-socket AMD > >>Family 15h" > >> > >> 470195f82e4e: "x86/PCI: Fix infinite loop in search for 64bit BAR placement" > >> > >>and a third one that was apparently never applied. > >> > >>I'm not sure why that third patch was never applied, I'm including it here. > >> > >>Does the system work for you if you apply that patch (instead of > >>reverting all of them)? > >> > >>I wonder why that patch wasn't applied, but if it doesn't fix things, > >>I think we do need to revert it all. > >> > >>Christian? Bjorn? > >I didn't apply the third patch ("x86/PCI: limit the size of the 64bit > >BAR to 256GB") because (a) we thought it was optional ("just a > >precaution against eventual problems"), (b) we didn't have a good > >explanation of why 256GB was the correct number, and (c) it seemed to > >be a workaround for a Xen issue that we hoped to fix in a better way. > > Just for the record completely agree on that. > > >It does apparently make Aaro's system work, but I still hesitate to > >apply it because it's magical -- avoiding the address space from > >0x1_00000000 to 0xbd_00000000 makes things work, but we don't know > >why. I assume there's some unreported device in that area, but I > >don't think we have any real assurance that the > >0xbd_00000000-0xfd_00000000 area we now use is any safer. > > Well, I knew why it's not working. The BIOS is not telling us the > truth about how much memory is installed. > > A device above 4GB would actually be handled correctly by the code > (see the check when we walk over all the existing IO regions). > > I tested a bit with Aaro and came up with the attached patch, it > adds a 16GB guard between the end of memory and the new window for > the PCIe root hub. But I agree with you that this is just a hack and > not a real solution. > > >I would feel better about this if we made it opt-in via a kernel > >parameter and/or some kind of whitelist. I still don't really *like* > >it, since ACPI does provide a mechanism (_PRS/_SRS) for doing this > >safely, and we could just say "if you want to use big BARs, the BIOS > >should enable big windows or at least make them available via ACPI > >resources." The only problem is that BIOSes don't do that and we > >don't yet have Linux support for _PRS/_SRS for host bridges. > > Well that is the point I disagree on. When the memory map we get > from the BIOS is not correct it makes no difference if we enable the > window with the BIOS or by direct programming the hardware. > > I will work with Aaron some more to come up with a solution which > reads the memory map directly from the hardware as well and checks > if that is valid before doing anything else. I don't agree with this approach. One goal of UEFI/ACPI is to remove the need for the OS to directly peek at the hardware for things like this. In Aaro's case, the BIOS apparently does not describe all of system memory in E820. ACPI r5.0, sec 15.1 describes E820 and it does say that the the E820 interface "provides a memory map for all of the installed RAM, and of physical memory ranges reserved by the BIOS," so one could argue that Aaro's BIOS is defective because it omits something. But I think the intent of the spec is that all non-discoverable resources available to the OS will be described via E820, UEFI GetMemoryMap(), ACPI tables, etc. So I think a BIOS engineer would be on pretty safe ground to say "Mr. OS, we didn't tell you about X, so you should not use X." And, again with a BIOS engineer's hat on, I would argue that the BIOS is entitled to use unreported things for its own purposes and rely on the assumption that the OS will not use them. So I'm not convinced this is necessarily a BIOS bug. If we peek at the hardware directly for things the spec can express, I think we're making the OS less portable and opening the door to conflicts with the BIOS. > >I'll prepare a revert as a back-up plan in case we don't come up with > >a better solution. > > Either that or only enable it when pci=add-root-window is given on > the kernel commandline. I don't like to add new parameters because I think it's an unreasonable burden on users and it makes it hard for distros, but I understand the desire to use this functionality. What would you think of adding a "pci=big_root_window" parameter that logs an info message and taints the kernel with TAINT_FIRMWARE_WORKAROUND? Then there's at least some clue that we're in uncharted territory. Bjorn
On Tue, Jan 9, 2018 at 2:37 AM, Christian König <christian.koenig@amd.com> wrote: > > I tested a bit with Aaro and came up with the attached patch, it adds a 16GB > guard between the end of memory and the new window for the PCIe root hub. > But I agree with you that this is just a hack and not a real solution. Guys, that last statement makes no sense. The *real* hack was that original patch that caused problems. I mean, just look at it. It has a completely made up - and bad - default start, and then it tries to forcibly just create the maximum window it possibly can. Well, not quite, but almost. Now *THAT* is hacky, and fragile, and a bad idea. It's a fundamentally bad idea exactly because it assumes (a) we have perfect knowledge (b) that window that wasn't even enabled or configured in the first place should be the maximum window. both of those assumptions seem completely bogus, and seem to have no actual reason. This comment in that code really does say it all: /* Just grab the free area behind system memory for this */ very lackadaisical. I agree that the 16GB guard is _also_ random, but it's certainly not less random or hacky. But I really wonder why you want to be that close to memory at all. What was wrong with the patch thgat just put it the hell out of any normal memory range, and just changed the default start from one random (and clearly bad) value to _another_ random but at least out-of-the-way value? IOW, this change - res->start = 0x100000000ull; + res->start = 0xbd00000000ull; really has a relatively solid explanation for it: "pick a high address that is likely out of the way". That's *much* better than "pick an address that is right after memory". Now, could there be a better high address to pick? Probably. It would be really nice to have a *reason* for the address to be picked. But honestly, even "it doesn't break Aaro's machine" is a better reason than many, in the absence of other reasons. For example, was there a reason for that random 756GB address? Is the limit of the particular AMD 64-bit bar perhaps at the 1TB mark (and that "res->end" value is because "close to it, but not at the top")? So I think "just above RAM" is a _horrible_ default starting point. The random 16GB guard is _better_, but it honestly doesn't seem any better than the simpler original patch. A starting point like "halfway to from the hardware limit" would actually be a better reason. Or just "we picked an end-point, let's pick a starting point that gives us a _sufficient_ - but not excessive - window". Or any number of other possible starting points. Almost _anything_ is better than "end of RAM". That "above end of RAM" might be a worst-case fall-back value (and in fact, I think that _is_ pretty close to what the PCI code uses for the case of "we don't have any parent at all, so we'll just have to assume it's a transparent bridge"), but I don't think it's necessarily what you should _strive_ for. So the hackyness of whatever the fix is really should be balanced with the hackyness of the original patch that introduced this problem. Linus
Am 09.01.2018 um 20:18 schrieb Linus Torvalds: > On Tue, Jan 9, 2018 at 2:37 AM, Christian König > <christian.koenig@amd.com> wrote: >> I tested a bit with Aaro and came up with the attached patch, it adds a 16GB >> guard between the end of memory and the new window for the PCIe root hub. >> But I agree with you that this is just a hack and not a real solution. > Guys, that last statement makes no sense. > > The *real* hack was that original patch that caused problems. > > I mean, just look at it. It has a completely made up - and bad - > default start, and then it tries to forcibly just create the maximum > window it possibly can. Well, not quite, but almost. > > Now *THAT* is hacky, and fragile, and a bad idea. It's a fundamentally > bad idea exactly because it assumes > > (a) we have perfect knowledge > > (b) that window that wasn't even enabled or configured in the first > place should be the maximum window. > > both of those assumptions seem completely bogus, and seem to have no > actual reason. > > This comment in that code really does say it all: > > /* Just grab the free area behind system memory for this */ > > very lackadaisical. > > I agree that the 16GB guard is _also_ random, but it's certainly not > less random or hacky. > > But I really wonder why you want to be that close to memory at all. Actually I don't want to be close to the end of memory at all. It's just what I found a certain other OS is doing and I thought: Hey, that has the best chance of working... But yeah, thinking about it I agree with you that this was probably not a good idea. > What was wrong with the patch thgat just put it the hell out of any > normal memory range, and just changed the default start from one > random (and clearly bad) value to _another_ random but at least > out-of-the-way value? Well Bjorn didn't liked it because I couldn't come up with a good explanation why 256GB is a good value in general (it is a good value for our particular use case). > IOW, this change > > - res->start = 0x100000000ull; > + res->start = 0xbd00000000ull; > > really has a relatively solid explanation for it: "pick a high address > that is likely out of the way". That's *much* better than "pick an > address that is right after memory". > > Now, could there be a better high address to pick? Probably. It would > be really nice to have a *reason* for the address to be picked. > > But honestly, even "it doesn't break Aaro's machine" is a better > reason than many, in the absence of other reasons. > > For example, was there a reason for that random 756GB address? Is the > limit of the particular AMD 64-bit bar perhaps at the 1TB mark (and > that "res->end" value is because "close to it, but not at the top")? That is actually a hardware limit documented in the BIOS and Kernel developers guide for AMD CPUs (https://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf). I should probably add a comment explaining this. > So I think "just above RAM" is a _horrible_ default starting point. > The random 16GB guard is _better_, but it honestly doesn't seem any > better than the simpler original patch. > > A starting point like "halfway to from the hardware limit" would > actually be a better reason. Or just "we picked an end-point, let's > pick a starting point that gives us a _sufficient_ - but not excessive > - window". Well that is exactly what the 256GB patch was doing. So as long as you are fine with that I'm perfectly fine to use that one. Christian. > Or any number of other possible starting points. Almost _anything_ is > better than "end of RAM". > > That "above end of RAM" might be a worst-case fall-back value (and in > fact, I think that _is_ pretty close to what the PCI code uses for the > case of "we don't have any parent at all, so we'll just have to assume > it's a transparent bridge"), but I don't think it's necessarily what > you should _strive_ for. > > So the hackyness of whatever the fix is really should be balanced with > the hackyness of the original patch that introduced this problem. > > Linus
On Tue, Jan 9, 2018 at 11:31 AM, Christian König <christian.koenig@amd.com> wrote: >> >> For example, was there a reason for that random 756GB address? Is the >> limit of the particular AMD 64-bit bar perhaps at the 1TB mark (and >> that "res->end" value is because "close to it, but not at the top")? > > That is actually a hardware limit documented in the BIOS and Kernel > developers guide for AMD CPUs > (https://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf). > > I should probably add a comment explaining this. Ok, good. So at least some of those values have reasons. And yes, documenting it would be great. >> A starting point like "halfway to from the hardware limit" would >> actually be a better reason. Or just "we picked an end-point, let's >> pick a starting point that gives us a _sufficient_ - but not excessive >> - window". > > > Well that is exactly what the 256GB patch was doing. Ahh, so 0xbd00000000ull is basically 256GB from the end point, and the end point is basically specified by hardware. So yes, I think that's a starting point that can at least be explained: let's try to make it something "big enough" (and 256GB seems to be big enough). Of course, since this is all "pick a random number", maybe that breaks something _else_. We do traditionally have a very similar issue for the 32-bit PCI starting address, where we used to have tons of issues with "we don't know all resources, we want to try to come up with a range that is out of the way". There we try to find a big enough gap in th e820 memory map (e820_search_gap()), and the end result is pci_mem_start (which is then exposed as PCIBIOS_MIN_MEM to the PCI resource allocation). If worst comes to worst, maybe we should look at having something similar for the full 64-bit range. Linus
Am 09.01.2018 um 19:38 schrieb Bjorn Helgaas: > I don't like to add new parameters because I think it's an > unreasonable burden on users and it makes it hard for distros, but I > understand the desire to use this functionality. What would you think > of adding a "pci=big_root_window" parameter that logs an info message > and taints the kernel with TAINT_FIRMWARE_WORKAROUND? Then there's at > least some clue that we're in uncharted territory. Done, patches for this are in your inbox. Additionally to that I cleaned up the 256GB patch a bit more, e.g. added a comment where the 0xfd00000000 limit comes from and updated the commit message. Please review and/or comment, Christian.
From 101e157babcef10b91edf91e7e6f03826c2f8ade Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com> Date: Tue, 28 Nov 2017 10:02:35 +0100 Subject: [PATCH] x86/PCI: add 16GB guard between end of memory and new PCI window MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a workaround for buggy BIOS implementations who steal memory for iGPUs from the OS without reporting it as reserved. Signed-off-by: Christian König <christian.koenig@amd.com> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> --- arch/x86/pci/fixup.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c index e663d6bf1328..e1bdae2cebb6 100644 --- a/arch/x86/pci/fixup.c +++ b/arch/x86/pci/fixup.c @@ -713,6 +713,10 @@ static void pci_amd_enable_64bit_bar(struct pci_dev *dev) } res->start = conflict->end + 1; } + /* Add 16GB guard between end of memory and new PCI window to work + * around buggy BIOS implementations. + */ + res->start += 0x400000000ull; dev_info(&dev->dev, "adding root bus resource %pR\n", res); -- 2.11.0