Message ID | 20190214170028.27862-1-logang@deltatee.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [1/2] PCI: Prevent 64-bit resources from being counted in 32-bit bridge region | expand |
Hi Logan, Sorry for the delay. This code gives a headache. I still remember my headache from the last time we touched it. Help me understand what's going on here. On Thu, Feb 14, 2019 at 10:00:27AM -0700, Logan Gunthorpe wrote: > When using the pci=realloc command line argument, with hpmemsize not > equal to zero, some hierarchies of 32-bit resources can fail to be > assigned in some situations. When this happens, the user will see > some PCI BAR resources being ignored and some PCI Bridge windows > being left unset. In lspci this may look like: > > Memory behind bridge: fff00000-000fffff > > or > > Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K] > > Ignored BARs mean the underlying device will not be usable. > > The possible situations where this can happen will be quite varied and > depend highly on the exact hierarchy and how the realloc code ends up > trying to assign the regions. It's known to at least require a > large 64-bit BAR (>1GB) below a PCI bridge. I guess the bug is that some BAR or window is unset when we actually have space for it? We need to make this more concrete, e.g., with a minimal example of a failure case, and then connect this code change specifically with that. "Ignored BARs" doesn't seem like the best terminology here. Can we just say they're "unset" as you do for windows? Even that's a little squishy because there's really no such thing as a clearly "unset" or invalid value for a BAR. All we can say is that Linux *thinks* it's unset because it happens to be zero (technically still a valid BAR value) or it conflicts with another device. Strictly speaking, the result is that we can't enable decoding for that BAR type. Often that does mean the device is unusable, but in some cases, e.g., an I/O BAR being unset and a driver using pci_enable_device_mem(), the device *is* usable. Surely realloc can fail even without a large 64-bit BAR? I don't think there's a magic threshold at 1GB. Maybe an example would illustrate the problem better. > The cause of this bug is in __pci_bus_size_bridges() which tries to > calculate the total resource space required for each of the bridge windows > (typically IO, 64-bit, and 32-bit / non-prefetchable). The code, as > written, tries to allocate all the 64-bit prefetchable resources > followed by all the remaining resources. It uses two calls to > pbus_size_mem() for this. If the first call to pbus_size_mem() fails > it tries to fit all resources into the 32-bit bridge window and it > expects the size of the 32-bit bridge window to be multiple GBs which > will never be assignable under the 4GB limit imposed on it. There are actually three calls to pbus_size_mem(): 1) If bridge has a 64-bit prefetchable window, find the size of all 64-bit prefetchable resources below the bridge 2) If bridge has no 64-bit prefetchable window, find the size of all prefetchable resources below the bridge 3) Find the size of everything else (non-prefetchable resources plus any prefetchable ones that couldn't be accommodated above) Sorry again for being so literal and unimaginative, but I don't understand how the code "expects the size of the ... window to be multiple GBs which will never be assignable ...". Whether things are assignable just depends on what resources are available. It's not a matter of "expecting" the window to be big enough; it just is big enough or it isn't. > There are only two reasons for pbus_size_mem() to fail: if there is no > 64-bit/prefetchable bridge window, or if that window is already > assigned (in other words, its resource already has a parent set). We know > the former case can't be true because, in __pci_bus_size_bridges(), it's > existence is checked before making the call. So if the pbus_size_mem() > call in question fails, the window must already be assigned, and in this > case, we still do not want 64-bit resources trying to be sized into the > 32-bit catch-all resource. I guess this question of putting a 64-bit resource in the 32-bit non-prefetchable window (legal but undesirable) is a secondary thing, not the chief complaint you're fixing? > So to fix the bug, we must always set mask, type2 and type3 in cases > where a 64-bit resource exists even if pbus_size_mem() fails. > > Reported-by: Kit Chow <kchow@gigaio.com> > Fixes: 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources") > Signed-off-by: Logan Gunthorpe <logang@deltatee.com> > Cc: Bjorn Helgaas <bhelgaas@google.com> > Cc: Yinghai Lu <yinghai@kernel.org> > --- > drivers/pci/setup-bus.c | 17 ++++++++--------- > 1 file changed, 8 insertions(+), 9 deletions(-) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index ed960436df5e..56b7077f37ff 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1265,21 +1265,20 @@ void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head) > prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH; > if (b_res[2].flags & IORESOURCE_MEM_64) { > prefmask |= IORESOURCE_MEM_64; > - ret = pbus_size_mem(bus, prefmask, prefmask, > + pbus_size_mem(bus, prefmask, prefmask, > prefmask, prefmask, > realloc_head ? 0 : additional_mem_size, > additional_mem_size, realloc_head); > > /* > - * If successful, all non-prefetchable resources > - * and any 32-bit prefetchable resources will go in > - * the non-prefetchable window. > + * Given the existence of a 64-bit resource for this > + * bus, all non-prefetchable resources and any 32-bit > + * prefetchable resources will go in the > + * non-prefetchable window. > */ > - if (ret == 0) { > - mask = prefmask; > - type2 = prefmask & ~IORESOURCE_MEM_64; > - type3 = prefmask & ~IORESOURCE_PREFETCH; > - } > + mask = prefmask; > + type2 = prefmask & ~IORESOURCE_MEM_64; > + type3 = prefmask & ~IORESOURCE_PREFETCH; > } > > /* > -- > 2.19.0 >
On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote: > Sorry for the delay. This code gives a headache. I still remember > my headache from the last time we touched it. Help me understand > what's going on here. Yes, this code gave me a headache debugging it too. And it's not the first time I've tried to figure out what's going on with it because it often just prints noisy messages that look like errors. I think I understand it better now but it's something that's a bit fleeting and easy to forget the details of. There may also be other solutions to this problem. > On Thu, Feb 14, 2019 at 10:00:27AM -0700, Logan Gunthorpe wrote: >> The possible situations where this can happen will be quite varied and >> depend highly on the exact hierarchy and how the realloc code ends up >> trying to assign the regions. It's known to at least require a >> large 64-bit BAR (>1GB) below a PCI bridge. > > I guess the bug is that some BAR or window is unset when we actually > have space for it? We need to make this more concrete, e.g., with a > minimal example of a failure case, and then connect this code change > specifically with that. The system we hit this bug on is quite large and complex with multiple layers of switches though I suspect I might have seen it on a completely different system but never had time to dig into it. I guess I could try to find a case in which qemu can hit it. > "Ignored BARs" doesn't seem like the best terminology here. Can we > just say they're "unset" as you do for windows? Even that's a little > squishy because there's really no such thing as a clearly "unset" or > invalid value for a BAR. All we can say is that Linux *thinks* it's > unset because it happens to be zero (technically still a valid BAR > value) or it conflicts with another device. Yes. I used the "ignored" term because that's the terminology lspci uses when it sees a resource like this. I'm not sure I like the "unset" term because, in fact, the BAR registers in the configuration space are typically still set to whatever the bios set them to[*1]. It's just that the kernel doesn't create a struct resource for them and thus you wont see a corresponding entry in /sys/bus/pci/.../resources or /proc/bus/pci/devices. For example, here's the lspci and hex dump of the config space for an example case; as you can see Region 0, is "ignored", but the BAR register is still set to 0xf7100000. 05:00.0 PCI bridge: PLX Technology, Inc. Device 8733 (rev ca) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 24 NUMA node: 0 Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K] Bus: primary=05, secondary=06, subordinate=0a, sec-latency=0 05:00.0 PCI bridge: PLX Technology, Inc. Device 8733 (rev ca) 00: b5 10 33 87 07 01 10 00 ca 00 04 06 08 00 81 00 10: 00 00 10 f7 00 00 00 00 05 06 0a 00 11 21 00 00 f7100000 20: f0 ff 00 00 01 00 f1 ff 20 02 00 00 7f 02 00 00 > Strictly speaking, the result is that we can't enable decoding for > that BAR type. Often that does mean the device is unusable, but in > some cases, e.g., an I/O BAR being unset and a driver using > pci_enable_device_mem(), the device *is* usable. Yes, though I think this bug will always apply to non-prefetchable 32-bit MEM BARs and not I/O BARs. So if the driver needs such a BAR (which I expect is common) it will not function. > Surely realloc can fail even without a large 64-bit BAR? I don't > think there's a magic threshold at 1GB. Maybe an example would > illustrate the problem better. No, to hit this bug, we do require a large 64-bit BAR. Though the threshold isn't necessarily 1GB, its more complicated (see below). This is just really hard to explain because the code is so tricky. I'll try to clarify it more below. Once we can clear it up so you understand it I'll try to update my commit message to be clearer. > There are actually three calls to pbus_size_mem(): > > 1) If bridge has a 64-bit prefetchable window, find the size of all > 64-bit prefetchable resources below the bridge > > 2) If bridge has no 64-bit prefetchable window, find the size of all > prefetchable resources below the bridge > > 3) Find the size of everything else (non-prefetchable resources plus > any prefetchable ones that couldn't be accommodated above) Yes, this is technically correct. I was just over-simplifying because, to hit this bug, there must be a 64-bit prefetchable window and there are no prefetchable 32-bit resources so (2) is irrelevant. >> There are only two reasons for pbus_size_mem() to fail: if there is no >> 64-bit/prefetchable bridge window, or if that window is already >> assigned (in other words, its resource already has a parent set). We know >> the former case can't be true because, in __pci_bus_size_bridges(), it's >> existence is checked before making the call. So if the pbus_size_mem() >> call in question fails, the window must already be assigned, and in this >> case, we still do not want 64-bit resources trying to be sized into the >> 32-bit catch-all resource. > > I guess this question of putting a 64-bit resource in the 32-bit > non-prefetchable window (legal but undesirable) is a secondary thing, > not the chief complaint you're fixing? Uh, yeah no. The 64-bit resource(s) are typically correctly assigned to the 64-bit window. However the bug causes victim 32-bit Non-prefetchable resources to not be assigned because when we calculate the size the code to inadvertently thinks the 64-bit resource must fit into the 32-bit non-prefetchable window. -- Ok, so let me try to walk through this a bit more step by step. Lets make the following assumptions: 1) Let's say we have a bridge that has a 64-bit prefetchable window, such that (b_res[2].flags & IORESOURCE_MEM_64) is true. So the bridge has three windows: the I/O window, the non-preftechable 32-bit window and the prefetchable 64-bit window. 2) Let's also say that, under the bridge, we have a resource that's larger than we'd expect to fit into the 32-bit window. (So, the actual limit depends on the maximum size of that window, which is hardware dependant, and the size of everything else that goes in it, but for simplicity I estimated this to be at least 1GB). For the purposes of this example, lets say it's very large: 64GB. 3) Now, the tricky thing is that this code tries to do things in multiple passes, unassigning resources that didn't seem to fit and recalculating window sizes on multiple passes. So lets say we are on the second pass where, previously, the 64-bit prefetchable window on this bridge was successfully assigned but, for whatever reason, this bridge failed and the resources were released (see pci_bus_release_bridge_resources()). In this case (bres[2].parent != NULL) and the large 64-bit resource now has (res->parent == NULL). Walking through __pci_bus_size_bridges(): i) Per (1), we do have a prefetchable window, so the first call to pbus_size_mem() happens. However, the first thing that function does is call find_free_bus_resource() which should find bres[2], but does not because, per (3) its parent is not NULL (see the comment above find_free_bus_resource() which makes it seem important that parent not be set). Thus this call to pbus_size_mem() fails with -ENOSPC, and 'type2' and 'type3' remain unset. ii) Seeing type2 is unset, we go to the second call to pbus_size_mem(). This call fails because per (1), there is no 32-bit prefetchable resource to find. So find_free_bus_resource() fails, as it should. 'type2' and 'type3' are now set to IORESOURCE_MEM. iii) Next we do the third call to pbus_size_mem() for the non-prefetchable window, however, because the first two calls failed, it will calculate the size for the 32-bit window to be greater than 64GB. As the code recurses up into the rest of the PCI tree, nothing will fit into the 32-bit window seeing it's calculated to be much larger than it needs to be so none of the 32-bit prefetchable BARs will be assigned and thus will not have struct resources and they will be reported as ignored by lspci. Drivers that try to use these resources will also fail and there will be addresses in the IOVA space that may get routed to the wrong PCI address. My solution to this bug was to notice that pbus_size_mem() can only fail for one of two reasons: either the resource doesn't exist at all or it is already assigned (ie. the resource's parent is still set). However, for the first call in (i), we know the resource does exist because we check for it (ie. the condition in (1)). Therefore we can say that if pbus_size_mem() fails in (i) there is a 64-bit window and we should not try to size the 32-bit window to fit 64-bit resources. We do this simply by setting 'mask', 'type2' and 'type3' even when pbus_size_mem() fails. Another potential solution would be to always unassign the windows for the failing bridges by setting it's resources parents to NULL. But this makes me much more nervous because I don't understand what all the implications would be and the comment above find_free_bus_resource() makes me think this is important. Let me know if this makes more sense or you have further questions. Also, the second patch in this series is a similar bug with *very* similar symptoms but I think it's easier to understand. It's a completely different cause and only happens on one particular piece of hardware that I'm aware of; and the driver for that hardware doesn't use the BAR at all right now so the only real negative effect is on the IOVA address space. Thanks, Logan [*1] This is probably another big bug due to its affect on the IOVA address space, but I'm not sure what to do about it and with these patches we don't have any further issues. However, maybe we should be scanning the tree after everything is said and done and unset any BARs that do not have corresponding resources. That way, if there are other bugs that can cause ignored regions, or if our algorithm does something that doesn't match the bios there aren't random failures when using the IOMMU.
On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote: > On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote: > > Sorry for the delay. This code gives a headache. I still remember > > my headache from the last time we touched it. Help me understand > > what's going on here. > > Yes, this code gave me a headache debugging it too. And it's not the > first time I've tried to figure out what's going on with it because it > often just prints noisy messages that look like errors. I think I > understand it better now but it's something that's a bit fleeting and > easy to forget the details of. There may also be other solutions to this > problem. Thanks for the explanation below. I haven't worked through it yet, but I will. Obviously it would be far better than an explanation if we could simplify the code (and the noisy messages) such that it didn't *require* so much explanation. Bjorn
On 2019-03-04 1:11 p.m., Bjorn Helgaas wrote: > On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote: >> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote: >>> Sorry for the delay. This code gives a headache. I still remember >>> my headache from the last time we touched it. Help me understand >>> what's going on here. >> >> Yes, this code gave me a headache debugging it too. And it's not the >> first time I've tried to figure out what's going on with it because it >> often just prints noisy messages that look like errors. I think I >> understand it better now but it's something that's a bit fleeting and >> easy to forget the details of. There may also be other solutions to this >> problem. > > Thanks for the explanation below. I haven't worked through it yet, but I will. > > Obviously it would be far better than an explanation if we could > simplify the code (and the noisy messages) such that it didn't > *require* so much explanation. I agree, but reworking this code scares me and I suspect it was designed this way for a reason. I'm guessing there are a lot of corner cases and unusual bios issues this stuff works around. We might end up fixing a some cases and breaking a bunch of other cases. It would probably be a lot simpler if (for 'realloc', at least) it unassigns everything then only does one pass. It wouldn't make the code itself much simpler but it would might make it easier to reason about and debug; and would also remove a lot of the noisy messages. I suspect the multi-pass setup would still be required for cases where the bios doesn't assign a device or whatever it's doing and it was probably designed this way to try and keep as many of the bios assignments the same as possible, though I'm not really sure if that would be necessary. Logan
On Mon, Mar 4, 2019 at 2:21 PM Logan Gunthorpe <logang@deltatee.com> wrote: > On 2019-03-04 1:11 p.m., Bjorn Helgaas wrote: > > On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote: > >> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote: > >>> Sorry for the delay. This code gives a headache. I still remember > >>> my headache from the last time we touched it. Help me understand > >>> what's going on here. > >> > >> Yes, this code gave me a headache debugging it too. And it's not the > >> first time I've tried to figure out what's going on with it because it > >> often just prints noisy messages that look like errors. I think I > >> understand it better now but it's something that's a bit fleeting and > >> easy to forget the details of. There may also be other solutions to this > >> problem. > > > > Thanks for the explanation below. I haven't worked through it yet, but I will. > > > > Obviously it would be far better than an explanation if we could > > simplify the code (and the noisy messages) such that it didn't > > *require* so much explanation. > > I agree, but reworking this code scares me and I suspect it was designed > this way for a reason. I'm guessing there are a lot of corner cases and > unusual bios issues this stuff works around. We might end up fixing a > some cases and breaking a bunch of other cases. Scares me too, which is one reason I haven't done anything about it. I didn't mean to suggest that you should rework it for *this* issue. I just keep hoping that we can chip away at teensy pieces and in ten or twenty years maybe make some headway. Bjorn
On 2019-03-04 1:29 p.m., Bjorn Helgaas wrote: >> I agree, but reworking this code scares me and I suspect it was designed >> this way for a reason. I'm guessing there are a lot of corner cases and >> unusual bios issues this stuff works around. We might end up fixing a >> some cases and breaking a bunch of other cases. > > Scares me too, which is one reason I haven't done anything about it. > > I didn't mean to suggest that you should rework it for *this* issue. > I just keep hoping that we can chip away at teensy pieces and in ten > or twenty years maybe make some headway. Sure. Just trying to brainstorm some ideas. Another idea to chip away at things might be that instead of pbus_size_mem() trying to find an appropriate bridge resources by looping, we simply pass the resource it's supposed to use (which I suspect __pci_bus_size_bridges should be able to figure out ahead of time. So instead of guessing and testing for a bunch of different resource window types we might have, just loop through the actually available windows and group the resources in what we have. Once we've sorted out these patches, when I have some free time, I might try working out a cleanup patch in this direction that we could test and merge slowly. Logan
On 2019-03-04 12:21 p.m., Logan Gunthorpe wrote: > The system we hit this bug on is quite large and complex with multiple > layers of switches though I suspect I might have seen it on a completely > different system but never had time to dig into it. I guess I could try > to find a case in which qemu can hit it. Ok, I was able to hit this bug with QEMU and I found out a bunch of minutiae about this bug. For starters pci=realloc is not really what I (or others) had expected: it really just tries *harder* to re-assign any resources the bios failed to assign. So really all this mess in setup-bus is strictly there to work around bios issues. I was also expecting it to be able to insert gaps for hotplug with the hpmemsize parameter, but if the bios assigns all the devices correctly, adding parameters such as pci=realloc,hpmemsize=8M" will actually do nothing, on x86 at least. So to hit this bug in real life the bios has to fail to assign a large 64-bit BAR as well as somehow have 32-bit non-prefetchable resources need to be re-allocated (possibly as a result of other failed assignments). So I suspect on the system we hit this bug on, the bios failed to allocate a bunch of resources, which pci=realloc was *mostly* able to fix up; but this bug caused it to "ignore" a bunch of unrelated resources. As QEMU does not have a broken bios, and always seems to do sensible things, reproducing the bug is a bit difficult. To hit the bug, I had to apply the hack patch given at the end of the email to release a large BAR as well as the 32-bit non-prefetchable memory for the entire bus before executing the rest of the code in pci_assign_unassigned_root_bus_resources(). Using the following QEMU command: qemu-system-x86_64 -enable-kvm -s -m 2048 $IMAGE \ -device pcie-root-port,id=root_port1,chassis=1,slot=1 \ -device x3130-upstream,id=upstream_port1,bus=root_port1 \ -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=2,slot=1 \ -device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=2,slot=2 \ -drive file=${IMGDIR}/nvme.qcow2,if=none,id=nvme1,snapshot=on \ -device nvme,drive=nvme1,serial=nvme1,cmb_size_mb=2048,bus=downstream_port1 \ -drive file=${IMGDIR}/nvme2.qcow2,if=none,id=nvme2,snapshot=on \ -device nvme,drive=nvme2,serial=nvme1,bus=downstream_port2 \ -virtfs local,id=home,path=/home/,security_model=mapped,mount_tag=home \ -nographic \ -serial mon:stdio \ -kernel $KERNEL \ -append "root=/dev/sda2 rootfstype=ext4 console=ttyS0,38400n8" we can emulate an NVMe device with a 2GB CMB BAR underneath a PCIe switch with a sibling basic NVMe device for demonstration purposes. The hack releases the 2GB BAR and all the 32-bit resources in the bus which *should* be easily reassigned correctly by the code in setup-bus.c because QEMU had originally found addresses for them. (Note: we do not even need "pci=realloc" on the command line to hit the bug with this setup.) When booted, lspci for the NVMe devices shows that all non-prefetchable resources under the switch were now ignored, but the large 2GB bar was able to be correctly re-assigned: 03:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM Express Controller (rev 02) (prog-if 02 [NVM Express]) Subsystem: Red Hat, Inc QEMU Virtual Machine Flags: fast devsel, IRQ 11 Memory at <ignored> (64-bit, non-prefetchable) Memory at 100000000 (64-bit, prefetchable) [size=2G] Memory at <ignored> (32-bit, non-prefetchable) Capabilities: [40] MSI-X: Enable- Count=64 Masked- Capabilities: [80] Express Endpoint, MSI 00 04:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM Express Controller (rev 02) (prog-if 02 [NVM Express]) Subsystem: Red Hat, Inc QEMU Virtual Machine Flags: fast devsel, IRQ 10 Memory at <ignored> (64-bit, non-prefetchable) Memory at <ignored> (32-bit, non-prefetchable) Capabilities: [40] MSI-X: Enable- Count=64 Masked- Capabilities: [80] Express Endpoint, MSI 00 After applying patch 1 in this series, the same setup correctly assigns all resources: 03:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM Express Controller (rev 02) (prog-if 02 [NVM Express]) Subsystem: Red Hat, Inc QEMU Virtual Machine Flags: bus master, fast devsel, latency 0, IRQ 11 Memory at 80000000 (64-bit, non-prefetchable) [size=8K] Memory at 100000000 (64-bit, prefetchable) [size=2G] Memory at 80002000 (32-bit, non-prefetchable) [size=4K] Capabilities: [40] MSI-X: Enable+ Count=64 Masked- Capabilities: [80] Express Endpoint, MSI 00 Kernel driver in use: nvme 04:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM Express Controller (rev 02) (prog-if 02 [NVM Express]) Subsystem: Red Hat, Inc QEMU Virtual Machine Flags: fast devsel, IRQ 10 Memory at 80200000 (64-bit, non-prefetchable) [size=8K] Memory at 80202000 (32-bit, non-prefetchable) [size=4K] Capabilities: [40] MSI-X: Enable- Count=64 Masked- Capabilities: [80] Express Endpoint, MSI 00 Logan -- diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index ed960436df5e..3ab1b9f9df49 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1744,6 +1744,24 @@ static enum enable_type pci_realloc_detect(struct pci_bus *bus, } #endif +static void unassign_qemu_device_hack(struct pci_bus *bus) +{ + struct device *dev; + struct pci_dev *pdev; + + dev = bus_find_device_by_name(&pci_bus_type, NULL, "0000:03:00.0"); + if (!dev) + return; + + pdev = to_pci_dev(dev); + + pci_info(pdev, "---Releasing BAR 2\n"); + release_resource(&pdev->resource[2]); + + pci_bus_release_bridge_resources(bus, IORESOURCE_PREFETCH, + whole_subtree); +} + /* * first try will not touch pci bridge res * second and later try will clear small leaf bridge res @@ -1761,6 +1779,8 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) int pci_try_num = 1; enum enable_type enable_local; + unassign_qemu_device_hack(bus); + /* don't realloc if asked to do so */ enable_local = pci_realloc_detect(bus, pci_realloc_enable); if (pci_realloc_enabled(enable_local)) {
Hey Bjorn, On 2019-03-04 1:11 p.m., Bjorn Helgaas wrote: > On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote: >> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote: >>> Sorry for the delay. This code gives a headache. I still remember >>> my headache from the last time we touched it. Help me understand >>> what's going on here. >> >> Yes, this code gave me a headache debugging it too. And it's not the >> first time I've tried to figure out what's going on with it because it >> often just prints noisy messages that look like errors. I think I >> understand it better now but it's something that's a bit fleeting and >> easy to forget the details of. There may also be other solutions to this >> problem. > > Thanks for the explanation below. I haven't worked through it yet, but I will. Any chance you've had a chance to go through this issue yet? Is there anything I should do with it? Do you want me to try to improve the commit messages and resend the patches? Thanks, Logan
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index ed960436df5e..56b7077f37ff 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1265,21 +1265,20 @@ void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head) prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH; if (b_res[2].flags & IORESOURCE_MEM_64) { prefmask |= IORESOURCE_MEM_64; - ret = pbus_size_mem(bus, prefmask, prefmask, + pbus_size_mem(bus, prefmask, prefmask, prefmask, prefmask, realloc_head ? 0 : additional_mem_size, additional_mem_size, realloc_head); /* - * If successful, all non-prefetchable resources - * and any 32-bit prefetchable resources will go in - * the non-prefetchable window. + * Given the existence of a 64-bit resource for this + * bus, all non-prefetchable resources and any 32-bit + * prefetchable resources will go in the + * non-prefetchable window. */ - if (ret == 0) { - mask = prefmask; - type2 = prefmask & ~IORESOURCE_MEM_64; - type3 = prefmask & ~IORESOURCE_PREFETCH; - } + mask = prefmask; + type2 = prefmask & ~IORESOURCE_MEM_64; + type3 = prefmask & ~IORESOURCE_PREFETCH; } /*
When using the pci=realloc command line argument, with hpmemsize not equal to zero, some hierarchies of 32-bit resources can fail to be assigned in some situations. When this happens, the user will see some PCI BAR resources being ignored and some PCI Bridge windows being left unset. In lspci this may look like: Memory behind bridge: fff00000-000fffff or Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K] Ignored BARs mean the underlying device will not be usable. The possible situations where this can happen will be quite varied and depend highly on the exact hierarchy and how the realloc code ends up trying to assign the regions. It's known to at least require a large 64-bit BAR (>1GB) below a PCI bridge. The cause of this bug is in __pci_bus_size_bridges() which tries to calculate the total resource space required for each of the bridge windows (typically IO, 64-bit, and 32-bit / non-prefetchable). The code, as written, tries to allocate all the 64-bit prefetchable resources followed by all the remaining resources. It uses two calls to pbus_size_mem() for this. If the first call to pbus_size_mem() fails it tries to fit all resources into the 32-bit bridge window and it expects the size of the 32-bit bridge window to be multiple GBs which will never be assignable under the 4GB limit imposed on it. There are only two reasons for pbus_size_mem() to fail: if there is no 64-bit/prefetchable bridge window, or if that window is already assigned (in other words, its resource already has a parent set). We know the former case can't be true because, in __pci_bus_size_bridges(), it's existence is checked before making the call. So if the pbus_size_mem() call in question fails, the window must already be assigned, and in this case, we still do not want 64-bit resources trying to be sized into the 32-bit catch-all resource. So to fix the bug, we must always set mask, type2 and type3 in cases where a 64-bit resource exists even if pbus_size_mem() fails. Reported-by: Kit Chow <kchow@gigaio.com> Fixes: 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Yinghai Lu <yinghai@kernel.org> --- drivers/pci/setup-bus.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-)