diff mbox series

[1/2] PCI: Prevent 64-bit resources from being counted in 32-bit bridge region

Message ID 20190214170028.27862-1-logang@deltatee.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series [1/2] PCI: Prevent 64-bit resources from being counted in 32-bit bridge region | expand

Commit Message

Logan Gunthorpe Feb. 14, 2019, 5 p.m. UTC
When using the pci=realloc command line argument, with hpmemsize not
equal to zero, some hierarchies of 32-bit resources can fail to be
assigned in some situations. When this happens, the user will see
some PCI BAR resources being ignored and some PCI Bridge windows
being left unset. In lspci this may look like:

  Memory behind bridge: fff00000-000fffff

or

  Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K]

Ignored BARs mean the underlying device will not be usable.

The possible situations where this can happen will be quite varied and
depend highly on the exact hierarchy and how the realloc code ends up
trying to assign the regions. It's known to at least require a
large 64-bit BAR (>1GB) below a PCI bridge.

The cause of this bug is in __pci_bus_size_bridges() which tries to
calculate the total resource space required for each of the bridge windows
(typically IO, 64-bit, and 32-bit / non-prefetchable). The code, as
written, tries to allocate all the 64-bit prefetchable resources
followed by all the remaining resources. It uses two calls to
pbus_size_mem() for this. If the first call to pbus_size_mem() fails
it tries to fit all resources into the 32-bit bridge window and it
expects the size of the 32-bit bridge window to be multiple GBs which
will never be assignable under the 4GB limit imposed on it.

There are only two reasons for pbus_size_mem() to fail: if there is no
64-bit/prefetchable bridge window, or if that window is already
assigned (in other words, its resource already has a parent set). We know
the former case can't be true because, in __pci_bus_size_bridges(), it's
existence is checked before making the call. So if the pbus_size_mem()
call in question fails, the window must already be assigned, and in this
case, we still do not want 64-bit resources trying to be sized into the
32-bit catch-all resource.

So to fix the bug, we must always set mask, type2 and type3 in cases
where a 64-bit resource exists even if pbus_size_mem() fails.

Reported-by: Kit Chow <kchow@gigaio.com>
Fixes: 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/setup-bus.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

Comments

Bjorn Helgaas March 4, 2019, 12:23 a.m. UTC | #1
Hi Logan,

Sorry for the delay.  This code gives a headache.  I still remember
my headache from the last time we touched it.  Help me understand
what's going on here.

On Thu, Feb 14, 2019 at 10:00:27AM -0700, Logan Gunthorpe wrote:
> When using the pci=realloc command line argument, with hpmemsize not
> equal to zero, some hierarchies of 32-bit resources can fail to be
> assigned in some situations. When this happens, the user will see
> some PCI BAR resources being ignored and some PCI Bridge windows
> being left unset. In lspci this may look like:
> 
>   Memory behind bridge: fff00000-000fffff
> 
> or
> 
>   Region 0: Memory at <ignored> (32-bit, non-prefetchable) [size=256K]
> 
> Ignored BARs mean the underlying device will not be usable.
> 
> The possible situations where this can happen will be quite varied and
> depend highly on the exact hierarchy and how the realloc code ends up
> trying to assign the regions. It's known to at least require a
> large 64-bit BAR (>1GB) below a PCI bridge.

I guess the bug is that some BAR or window is unset when we actually
have space for it?  We need to make this more concrete, e.g., with a
minimal example of a failure case, and then connect this code change
specifically with that.

"Ignored BARs" doesn't seem like the best terminology here.  Can we
just say they're "unset" as you do for windows?  Even that's a little
squishy because there's really no such thing as a clearly "unset" or
invalid value for a BAR.  All we can say is that Linux *thinks* it's
unset because it happens to be zero (technically still a valid BAR
value) or it conflicts with another device.

Strictly speaking, the result is that we can't enable decoding for
that BAR type.  Often that does mean the device is unusable, but in
some cases, e.g., an I/O BAR being unset and a driver using
pci_enable_device_mem(), the device *is* usable.

Surely realloc can fail even without a large 64-bit BAR?  I don't
think there's a magic threshold at 1GB.  Maybe an example would
illustrate the problem better.

> The cause of this bug is in __pci_bus_size_bridges() which tries to
> calculate the total resource space required for each of the bridge windows
> (typically IO, 64-bit, and 32-bit / non-prefetchable). The code, as
> written, tries to allocate all the 64-bit prefetchable resources
> followed by all the remaining resources. It uses two calls to
> pbus_size_mem() for this. If the first call to pbus_size_mem() fails
> it tries to fit all resources into the 32-bit bridge window and it
> expects the size of the 32-bit bridge window to be multiple GBs which
> will never be assignable under the 4GB limit imposed on it.

There are actually three calls to pbus_size_mem():

  1) If bridge has a 64-bit prefetchable window, find the size of all
     64-bit prefetchable resources below the bridge

  2) If bridge has no 64-bit prefetchable window, find the size of all
     prefetchable resources below the bridge

  3) Find the size of everything else (non-prefetchable resources plus
     any prefetchable ones that couldn't be accommodated above)

Sorry again for being so literal and unimaginative, but I don't
understand how the code "expects the size of the ... window to be
multiple GBs which will never be assignable ...".  Whether things are
assignable just depends on what resources are available.  It's not a
matter of "expecting" the window to be big enough; it just is big
enough or it isn't.

> There are only two reasons for pbus_size_mem() to fail: if there is no
> 64-bit/prefetchable bridge window, or if that window is already
> assigned (in other words, its resource already has a parent set). We know
> the former case can't be true because, in __pci_bus_size_bridges(), it's
> existence is checked before making the call. So if the pbus_size_mem()
> call in question fails, the window must already be assigned, and in this
> case, we still do not want 64-bit resources trying to be sized into the
> 32-bit catch-all resource.

I guess this question of putting a 64-bit resource in the 32-bit
non-prefetchable window (legal but undesirable) is a secondary thing,
not the chief complaint you're fixing?

> So to fix the bug, we must always set mask, type2 and type3 in cases
> where a 64-bit resource exists even if pbus_size_mem() fails.
> 
> Reported-by: Kit Chow <kchow@gigaio.com>
> Fixes: 5b28541552ef ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Yinghai Lu <yinghai@kernel.org>
> ---
>  drivers/pci/setup-bus.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index ed960436df5e..56b7077f37ff 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1265,21 +1265,20 @@ void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head)
>  		prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
>  		if (b_res[2].flags & IORESOURCE_MEM_64) {
>  			prefmask |= IORESOURCE_MEM_64;
> -			ret = pbus_size_mem(bus, prefmask, prefmask,
> +			pbus_size_mem(bus, prefmask, prefmask,
>  				  prefmask, prefmask,
>  				  realloc_head ? 0 : additional_mem_size,
>  				  additional_mem_size, realloc_head);
>  
>  			/*
> -			 * If successful, all non-prefetchable resources
> -			 * and any 32-bit prefetchable resources will go in
> -			 * the non-prefetchable window.
> +			 * Given the existence of a 64-bit resource for this
> +			 * bus, all non-prefetchable resources and any 32-bit
> +			 * prefetchable resources will go in the
> +			 * non-prefetchable window.
>  			 */
> -			if (ret == 0) {
> -				mask = prefmask;
> -				type2 = prefmask & ~IORESOURCE_MEM_64;
> -				type3 = prefmask & ~IORESOURCE_PREFETCH;
> -			}
> +			mask = prefmask;
> +			type2 = prefmask & ~IORESOURCE_MEM_64;
> +			type3 = prefmask & ~IORESOURCE_PREFETCH;
>  		}
>  
>  		/*
> -- 
> 2.19.0
>
Logan Gunthorpe March 4, 2019, 7:21 p.m. UTC | #2
On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote:
> Sorry for the delay.  This code gives a headache.  I still remember
> my headache from the last time we touched it.  Help me understand
> what's going on here.

Yes, this code gave me a headache debugging it too. And it's not the
first time I've tried to figure out what's going on with it because it
often just prints noisy messages that look like errors. I think I
understand it better now but it's something that's a bit fleeting and
easy to forget the details of. There may also be other solutions to this
problem.

> On Thu, Feb 14, 2019 at 10:00:27AM -0700, Logan Gunthorpe wrote:
>> The possible situations where this can happen will be quite varied and
>> depend highly on the exact hierarchy and how the realloc code ends up
>> trying to assign the regions. It's known to at least require a
>> large 64-bit BAR (>1GB) below a PCI bridge.
> 
> I guess the bug is that some BAR or window is unset when we actually
> have space for it?  We need to make this more concrete, e.g., with a
> minimal example of a failure case, and then connect this code change
> specifically with that.

The system we hit this bug on is quite large and complex with multiple
layers of switches though I suspect I might have seen it on a completely
different system but never had time to dig into it. I guess I could try
to find a case in which qemu can hit it.

> "Ignored BARs" doesn't seem like the best terminology here.  Can we
> just say they're "unset" as you do for windows?  Even that's a little
> squishy because there's really no such thing as a clearly "unset" or
> invalid value for a BAR.  All we can say is that Linux *thinks* it's
> unset because it happens to be zero (technically still a valid BAR
> value) or it conflicts with another device.

Yes. I used the "ignored" term because that's the terminology lspci uses
when it sees a resource like this. I'm not sure I like the "unset" term
because, in fact, the BAR registers in the configuration space are
typically still set to whatever the bios set them to[*1]. It's just that
the kernel doesn't create a struct resource for them and thus you wont
see a corresponding entry in /sys/bus/pci/.../resources or
/proc/bus/pci/devices. For example, here's the lspci and hex dump of the
config space for an example case; as you can see Region 0, is "ignored",
but the BAR register is still set to 0xf7100000.

05:00.0 PCI bridge: PLX Technology, Inc. Device 8733 (rev ca) (prog-if
00 [Normal decode])
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 32 bytes
         Interrupt: pin A routed to IRQ 24
         NUMA node: 0
         Region 0: Memory at <ignored> (32-bit, non-prefetchable)
[size=256K]
         Bus: primary=05, secondary=06, subordinate=0a, sec-latency=0

05:00.0 PCI bridge: PLX Technology, Inc. Device 8733 (rev ca)
00: b5 10 33 87 07 01 10 00 ca 00 04 06 08 00 81 00
10: 00 00 10 f7 00 00 00 00 05 06 0a 00 11 21 00 00		f7100000
20: f0 ff 00 00 01 00 f1 ff 20 02 00 00 7f 02 00 00


> Strictly speaking, the result is that we can't enable decoding for
> that BAR type.  Often that does mean the device is unusable, but in
> some cases, e.g., an I/O BAR being unset and a driver using
> pci_enable_device_mem(), the device *is* usable.

Yes, though I think this bug will always apply to non-prefetchable
32-bit MEM BARs and not I/O BARs. So if the driver needs such a BAR
(which I expect is common) it will not function.

> Surely realloc can fail even without a large 64-bit BAR?  I don't
> think there's a magic threshold at 1GB.  Maybe an example would
> illustrate the problem better.

No, to hit this bug, we do require a large 64-bit BAR. Though the
threshold isn't necessarily 1GB, its more complicated (see below).

This is just really hard to explain because the code is so tricky. I'll
try to clarify it more below. Once we can clear it up so you understand
it I'll try to update my commit message to be clearer.

> There are actually three calls to pbus_size_mem():
> 
>   1) If bridge has a 64-bit prefetchable window, find the size of all
>      64-bit prefetchable resources below the bridge
> 
>   2) If bridge has no 64-bit prefetchable window, find the size of all
>      prefetchable resources below the bridge
> 
>   3) Find the size of everything else (non-prefetchable resources plus
>      any prefetchable ones that couldn't be accommodated above)

Yes, this is technically correct. I was just over-simplifying because,
to hit this bug, there must be a 64-bit prefetchable window and there
are no prefetchable 32-bit resources so (2) is irrelevant.

>> There are only two reasons for pbus_size_mem() to fail: if there is no
>> 64-bit/prefetchable bridge window, or if that window is already
>> assigned (in other words, its resource already has a parent set). We know
>> the former case can't be true because, in __pci_bus_size_bridges(), it's
>> existence is checked before making the call. So if the pbus_size_mem()
>> call in question fails, the window must already be assigned, and in this
>> case, we still do not want 64-bit resources trying to be sized into the
>> 32-bit catch-all resource.
> 
> I guess this question of putting a 64-bit resource in the 32-bit
> non-prefetchable window (legal but undesirable) is a secondary thing,
> not the chief complaint you're fixing?

Uh, yeah no. The 64-bit resource(s) are typically correctly assigned to
the 64-bit window. However the bug causes victim 32-bit Non-prefetchable
resources to not be assigned because when we calculate the size the code
to inadvertently thinks the 64-bit resource must fit into the 32-bit
non-prefetchable window.

--

Ok, so let me try to walk through this a bit more step by step. Lets
make the following assumptions:

1) Let's say we have a bridge that has a 64-bit prefetchable window,
such that (b_res[2].flags & IORESOURCE_MEM_64) is true. So the bridge
has three windows: the I/O window, the non-preftechable 32-bit window
and the prefetchable 64-bit window.

2) Let's also say that, under the bridge, we have a resource that's
larger than we'd expect to fit into the 32-bit window. (So, the actual
limit depends on the maximum size of that window, which is hardware
dependant, and the size of everything else that goes in it, but for
simplicity I estimated this to be at least 1GB). For the purposes of
this example, lets say it's very large: 64GB.

3) Now, the tricky thing is that this code tries to do things in
multiple passes, unassigning resources that didn't seem to fit and
recalculating window sizes on multiple passes. So lets say we are on the
second pass where, previously, the 64-bit prefetchable window on this
bridge was successfully assigned but, for whatever reason, this bridge
failed and the resources were released (see
pci_bus_release_bridge_resources()). In this case (bres[2].parent !=
NULL) and the large 64-bit resource now has (res->parent == NULL).

Walking through __pci_bus_size_bridges():

i) Per (1), we do have a prefetchable window, so the first call to
pbus_size_mem() happens. However, the first thing that function does is
call find_free_bus_resource() which should find bres[2], but does not
because, per (3) its parent is not NULL (see the comment above
find_free_bus_resource() which makes it seem important that parent not
be set). Thus this call to pbus_size_mem() fails with -ENOSPC, and
'type2' and 'type3' remain unset.

ii) Seeing type2 is unset, we go to the second call to pbus_size_mem().
This call fails because per (1), there is no 32-bit prefetchable
resource to find. So find_free_bus_resource() fails, as it should.
'type2' and 'type3' are now set to IORESOURCE_MEM.

iii) Next we do the third call to pbus_size_mem() for the
non-prefetchable window, however, because the first two calls failed, it
will calculate the size for the 32-bit window to be greater than 64GB.

As the code recurses up into the rest of the PCI tree, nothing will fit
into the 32-bit window seeing it's calculated to be much larger than it
needs to be so none of the 32-bit prefetchable BARs will be assigned and
thus will not have struct resources and they will be reported as ignored
by lspci. Drivers that try to use these resources will also fail and
there will be addresses in the IOVA space that may get routed to the
wrong PCI address.

My solution to this bug was to notice that pbus_size_mem() can only fail
for one of two reasons: either the resource doesn't exist at all or it
is already assigned (ie. the resource's parent is still set). However,
for the first call in (i), we know the resource does exist because we
check for it (ie. the condition in (1)). Therefore we can say that if
pbus_size_mem() fails in (i) there is a 64-bit window and we should not
try to size the 32-bit window to fit 64-bit resources. We do this simply
by setting 'mask', 'type2' and 'type3' even when pbus_size_mem() fails.

Another potential solution would be to always unassign the windows for
the failing bridges by setting it's resources parents to NULL. But this
makes me much more nervous because I don't understand what all the
implications would be and the comment above find_free_bus_resource()
makes me think this is important.

Let me know if this makes more sense or you have further questions.

Also, the second patch in this series is a similar bug with *very*
similar symptoms but I think it's easier to understand. It's a
completely different cause and only happens on one particular piece of
hardware that I'm aware of; and the driver for that hardware doesn't use
the BAR at all right now so the only real negative effect is on the IOVA
address space.

Thanks,

Logan

[*1] This is probably another big bug due to its affect on the IOVA
address space, but I'm not sure what to do about it and with these
patches we don't have any further issues. However, maybe we should be
scanning the tree after everything is said and done and unset any BARs
that do not have corresponding resources. That way, if there are other
bugs that can cause ignored regions, or if our algorithm does something
that doesn't match the bios there aren't random failures when using the
IOMMU.
Bjorn Helgaas March 4, 2019, 8:11 p.m. UTC | #3
On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote:
> > Sorry for the delay.  This code gives a headache.  I still remember
> > my headache from the last time we touched it.  Help me understand
> > what's going on here.
>
> Yes, this code gave me a headache debugging it too. And it's not the
> first time I've tried to figure out what's going on with it because it
> often just prints noisy messages that look like errors. I think I
> understand it better now but it's something that's a bit fleeting and
> easy to forget the details of. There may also be other solutions to this
> problem.

Thanks for the explanation below.  I haven't worked through it yet, but I will.

Obviously it would be far better than an explanation if we could
simplify the code (and the noisy messages) such that it didn't
*require* so much explanation.

Bjorn
Logan Gunthorpe March 4, 2019, 8:21 p.m. UTC | #4
On 2019-03-04 1:11 p.m., Bjorn Helgaas wrote:
> On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote:
>>> Sorry for the delay.  This code gives a headache.  I still remember
>>> my headache from the last time we touched it.  Help me understand
>>> what's going on here.
>>
>> Yes, this code gave me a headache debugging it too. And it's not the
>> first time I've tried to figure out what's going on with it because it
>> often just prints noisy messages that look like errors. I think I
>> understand it better now but it's something that's a bit fleeting and
>> easy to forget the details of. There may also be other solutions to this
>> problem.
> 
> Thanks for the explanation below.  I haven't worked through it yet, but I will.
> 
> Obviously it would be far better than an explanation if we could
> simplify the code (and the noisy messages) such that it didn't
> *require* so much explanation.

I agree, but reworking this code scares me and I suspect it was designed
this way for a reason. I'm guessing there are a lot of corner cases and
unusual bios issues this stuff works around. We might end up fixing a
some cases and breaking a bunch of other cases.

It would probably be a lot simpler if (for 'realloc', at least) it
unassigns everything then only does one pass. It wouldn't make the code
itself much simpler but it would might make it easier to reason about
and debug; and would also remove a lot of the noisy messages. I suspect
the multi-pass setup would still be required for cases where the bios
doesn't assign a device or whatever it's doing and it was probably
designed this way to try and keep as many of the bios assignments the
same as possible, though I'm not really sure if that would be necessary.

Logan
Bjorn Helgaas March 4, 2019, 8:29 p.m. UTC | #5
On Mon, Mar 4, 2019 at 2:21 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> On 2019-03-04 1:11 p.m., Bjorn Helgaas wrote:
> > On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> >> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote:
> >>> Sorry for the delay.  This code gives a headache.  I still remember
> >>> my headache from the last time we touched it.  Help me understand
> >>> what's going on here.
> >>
> >> Yes, this code gave me a headache debugging it too. And it's not the
> >> first time I've tried to figure out what's going on with it because it
> >> often just prints noisy messages that look like errors. I think I
> >> understand it better now but it's something that's a bit fleeting and
> >> easy to forget the details of. There may also be other solutions to this
> >> problem.
> >
> > Thanks for the explanation below.  I haven't worked through it yet, but I will.
> >
> > Obviously it would be far better than an explanation if we could
> > simplify the code (and the noisy messages) such that it didn't
> > *require* so much explanation.
>
> I agree, but reworking this code scares me and I suspect it was designed
> this way for a reason. I'm guessing there are a lot of corner cases and
> unusual bios issues this stuff works around. We might end up fixing a
> some cases and breaking a bunch of other cases.

Scares me too, which is one reason I haven't done anything about it.

I didn't mean to suggest that you should rework it for *this* issue.
I just keep hoping that we can chip away at teensy pieces and in ten
or twenty years maybe make some headway.

Bjorn
Logan Gunthorpe March 4, 2019, 8:39 p.m. UTC | #6
On 2019-03-04 1:29 p.m., Bjorn Helgaas wrote:
>> I agree, but reworking this code scares me and I suspect it was designed
>> this way for a reason. I'm guessing there are a lot of corner cases and
>> unusual bios issues this stuff works around. We might end up fixing a
>> some cases and breaking a bunch of other cases.
> 
> Scares me too, which is one reason I haven't done anything about it.
> 
> I didn't mean to suggest that you should rework it for *this* issue.
> I just keep hoping that we can chip away at teensy pieces and in ten
> or twenty years maybe make some headway.

Sure. Just trying to brainstorm some ideas.

Another idea to chip away at things might be that instead of
pbus_size_mem() trying to find an appropriate bridge resources by
looping, we simply pass the resource it's supposed to use (which I
suspect __pci_bus_size_bridges should be able to figure out ahead of
time. So instead of guessing and testing for a bunch of different
resource window types we might have, just loop through the actually
available windows and group the resources in what we have. Once we've
sorted out these patches, when I have some free time, I might try
working out a cleanup patch in this direction that we could test and
merge slowly.

Logan
Logan Gunthorpe March 4, 2019, 11:58 p.m. UTC | #7
On 2019-03-04 12:21 p.m., Logan Gunthorpe wrote:
> The system we hit this bug on is quite large and complex with multiple
> layers of switches though I suspect I might have seen it on a completely
> different system but never had time to dig into it. I guess I could try
> to find a case in which qemu can hit it.

Ok, I was able to hit this bug with QEMU and I found out a bunch of
minutiae about this bug.

For starters pci=realloc is not really what I (or others) had expected:
it really just tries *harder* to re-assign any resources the bios failed
to assign. So really all this mess in setup-bus is strictly there to
work around bios issues. I was also expecting it to be able to insert
gaps for hotplug with the hpmemsize parameter, but if the bios assigns
all the devices correctly, adding parameters such as
pci=realloc,hpmemsize=8M" will actually do nothing, on x86 at least.

So to hit this bug in real life the bios has to fail to assign a large
64-bit BAR as well as somehow have 32-bit non-prefetchable resources
need to be re-allocated (possibly as a result of other failed
assignments). So I suspect on the system we hit this bug on, the bios
failed to allocate a bunch of resources, which pci=realloc was *mostly*
able to fix up; but this bug caused it to "ignore" a bunch of unrelated
resources.

As QEMU does not have a broken bios, and always seems to do sensible
things, reproducing the bug is a bit difficult. To hit the bug, I had to
apply the hack patch given at the end of the email to release a large
BAR as well as the 32-bit non-prefetchable memory for the entire bus
before executing the rest of the code in
pci_assign_unassigned_root_bus_resources().

Using the following QEMU command:

  qemu-system-x86_64 -enable-kvm -s -m 2048 $IMAGE \
    -device pcie-root-port,id=root_port1,chassis=1,slot=1 \
    -device x3130-upstream,id=upstream_port1,bus=root_port1 \
    -device
xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=2,slot=1 \
    -device
xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=2,slot=2 \
    -drive file=${IMGDIR}/nvme.qcow2,if=none,id=nvme1,snapshot=on \
    -device
nvme,drive=nvme1,serial=nvme1,cmb_size_mb=2048,bus=downstream_port1 \
    -drive file=${IMGDIR}/nvme2.qcow2,if=none,id=nvme2,snapshot=on \
    -device nvme,drive=nvme2,serial=nvme1,bus=downstream_port2 \
    -virtfs local,id=home,path=/home/,security_model=mapped,mount_tag=home \
    -nographic \
    -serial mon:stdio \
    -kernel $KERNEL \
    -append "root=/dev/sda2 rootfstype=ext4 console=ttyS0,38400n8"

we can emulate an NVMe device with a 2GB CMB BAR underneath a PCIe
switch with a sibling basic NVMe device for demonstration purposes. The
hack releases the 2GB BAR and all the 32-bit resources in the bus which
*should* be easily reassigned correctly by the code in setup-bus.c
because QEMU had originally found addresses for them. (Note: we do not
even need "pci=realloc" on the command line to hit the bug with this setup.)

When booted, lspci for the NVMe devices shows that all non-prefetchable
resources under the switch were now ignored, but the large 2GB bar was
able to be correctly re-assigned:

  03:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM
Express Controller (rev 02) (prog-if 02 [NVM Express])
	Subsystem: Red Hat, Inc QEMU Virtual Machine
	Flags: fast devsel, IRQ 11
	Memory at <ignored> (64-bit, non-prefetchable)
	Memory at 100000000 (64-bit, prefetchable) [size=2G]
	Memory at <ignored> (32-bit, non-prefetchable)
	Capabilities: [40] MSI-X: Enable- Count=64 Masked-
	Capabilities: [80] Express Endpoint, MSI 00

  04:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM
Express Controller (rev 02) (prog-if 02 [NVM Express])
	Subsystem: Red Hat, Inc QEMU Virtual Machine
	Flags: fast devsel, IRQ 10
	Memory at <ignored> (64-bit, non-prefetchable)
	Memory at <ignored> (32-bit, non-prefetchable)
	Capabilities: [40] MSI-X: Enable- Count=64 Masked-
	Capabilities: [80] Express Endpoint, MSI 00

After applying patch 1 in this series, the same setup correctly assigns
all resources:

  03:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM
Express Controller (rev 02) (prog-if 02 [NVM Express])
	Subsystem: Red Hat, Inc QEMU Virtual Machine
	Flags: bus master, fast devsel, latency 0, IRQ 11
	Memory at 80000000 (64-bit, non-prefetchable) [size=8K]
	Memory at 100000000 (64-bit, prefetchable) [size=2G]
	Memory at 80002000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [40] MSI-X: Enable+ Count=64 Masked-
	Capabilities: [80] Express Endpoint, MSI 00
	Kernel driver in use: nvme

  04:00.0 Non-Volatile memory controller: Intel Corporation QEMU NVM
Express Controller (rev 02) (prog-if 02 [NVM Express])
	Subsystem: Red Hat, Inc QEMU Virtual Machine
	Flags: fast devsel, IRQ 10
	Memory at 80200000 (64-bit, non-prefetchable) [size=8K]
	Memory at 80202000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [40] MSI-X: Enable- Count=64 Masked-
	Capabilities: [80] Express Endpoint, MSI 00


Logan

--

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index ed960436df5e..3ab1b9f9df49 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1744,6 +1744,24 @@ static enum enable_type pci_realloc_detect(struct
pci_bus *bus,
 }
 #endif

+static void unassign_qemu_device_hack(struct pci_bus *bus)
+{
+       struct device *dev;
+       struct pci_dev *pdev;
+
+       dev = bus_find_device_by_name(&pci_bus_type, NULL, "0000:03:00.0");
+       if (!dev)
+               return;
+
+       pdev = to_pci_dev(dev);
+
+       pci_info(pdev, "---Releasing BAR 2\n");
+       release_resource(&pdev->resource[2]);
+
+       pci_bus_release_bridge_resources(bus, IORESOURCE_PREFETCH,
+                                        whole_subtree);
+}
+
 /*
  * first try will not touch pci bridge res
  * second and later try will clear small leaf bridge res
@@ -1761,6 +1779,8 @@ void
pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
        int pci_try_num = 1;
        enum enable_type enable_local;

+       unassign_qemu_device_hack(bus);
+
        /* don't realloc if asked to do so */
        enable_local = pci_realloc_detect(bus, pci_realloc_enable);
        if (pci_realloc_enabled(enable_local)) {
Logan Gunthorpe April 1, 2019, 7:22 p.m. UTC | #8
Hey Bjorn,

On 2019-03-04 1:11 p.m., Bjorn Helgaas wrote:
> On Mon, Mar 4, 2019 at 1:21 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>> On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote:
>>> Sorry for the delay.  This code gives a headache.  I still remember
>>> my headache from the last time we touched it.  Help me understand
>>> what's going on here.
>>
>> Yes, this code gave me a headache debugging it too. And it's not the
>> first time I've tried to figure out what's going on with it because it
>> often just prints noisy messages that look like errors. I think I
>> understand it better now but it's something that's a bit fleeting and
>> easy to forget the details of. There may also be other solutions to this
>> problem.
> 
> Thanks for the explanation below.  I haven't worked through it yet, but I will.

Any chance you've had a chance to go through this issue yet? Is there
anything I should do with it? Do you want me to try to improve the
commit messages and resend the patches?

Thanks,

Logan
diff mbox series

Patch

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index ed960436df5e..56b7077f37ff 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1265,21 +1265,20 @@  void __pci_bus_size_bridges(struct pci_bus *bus, struct list_head *realloc_head)
 		prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
 		if (b_res[2].flags & IORESOURCE_MEM_64) {
 			prefmask |= IORESOURCE_MEM_64;
-			ret = pbus_size_mem(bus, prefmask, prefmask,
+			pbus_size_mem(bus, prefmask, prefmask,
 				  prefmask, prefmask,
 				  realloc_head ? 0 : additional_mem_size,
 				  additional_mem_size, realloc_head);
 
 			/*
-			 * If successful, all non-prefetchable resources
-			 * and any 32-bit prefetchable resources will go in
-			 * the non-prefetchable window.
+			 * Given the existence of a 64-bit resource for this
+			 * bus, all non-prefetchable resources and any 32-bit
+			 * prefetchable resources will go in the
+			 * non-prefetchable window.
 			 */
-			if (ret == 0) {
-				mask = prefmask;
-				type2 = prefmask & ~IORESOURCE_MEM_64;
-				type3 = prefmask & ~IORESOURCE_PREFETCH;
-			}
+			mask = prefmask;
+			type2 = prefmask & ~IORESOURCE_MEM_64;
+			type3 = prefmask & ~IORESOURCE_PREFETCH;
 		}
 
 		/*