diff mbox

[8/9] PCI: Ignore BAR contents when firmware left decoding disabled

Message ID 20140226193757.10125.81865.stgit@bhelgaas-glaptop.roam.corp.google.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Bjorn Helgaas Feb. 26, 2014, 7:37 p.m. UTC
Don't rely on BAR contents when the command register says the BAR is
disabled.

If we receive a PCI device from firmware (or a hot-added device that was
just powered up) with the MEMORY or IO enable bits in the PCI command
register cleared, there's no reason to believe the BARs contain valid
addresses.

In that case, we still know the type and size of the BAR, but this
patch marks the resource as "unset" so we have a chance to reassign it.

Historically, we often used "BAR == 0" to decide the BAR is invalid.  But 0
is a legal BAR value, especially if the host bridge translates addresses,
so I think it's better to decide based on the PCI command register, and
store the conclusion in the IORESOURCE_UNSET bit.

Reference: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679545
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=48451
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/probe.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ming Lei March 13, 2014, 8:51 a.m. UTC | #1
Hi Bjorn,

I found this patch broke virtio-pci devices.

On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> Don't rely on BAR contents when the command register says the BAR is
> disabled.
>
> If we receive a PCI device from firmware (or a hot-added device that was
> just powered up) with the MEMORY or IO enable bits in the PCI command
> register cleared, there's no reason to believe the BARs contain valid
> addresses.

From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
PCI_COMMAND_IO and PCI_COMMAND_MEM should be
cleared after reset, so looks the patch sets IORESOURCE_UNSET
too early because PCI drivers may call pci_enable_device()
(->pci_enable_resources()) to enable the two bits of
PCI_COMMAND explicitly.

With this patch, driver can't enable device/resource with
pci_enable_device() any more because IORESOURCE_UNSET
has been set already.

>
> In that case, we still know the type and size of the BAR, but this
> patch marks the resource as "unset" so we have a chance to reassign it.
>
> Historically, we often used "BAR == 0" to decide the BAR is invalid.  But 0
> is a legal BAR value, especially if the host bridge translates addresses,
> so I think it's better to decide based on the PCI command register, and
> store the conclusion in the IORESOURCE_UNSET bit.
>
> Reference: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679545
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=48451
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/probe.c |    8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 6e34498ec9f0..02654b5ec1b9 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -177,9 +177,10 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>
>         mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
>
> +       pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> +
>         /* No printks while decoding is disabled! */
>         if (!dev->mmio_always_on) {
> -               pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
>                 if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
>                         pci_write_config_word(dev, PCI_COMMAND,
>                                 orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> @@ -215,9 +216,13 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>                 if (res->flags & IORESOURCE_IO) {
>                         l &= PCI_BASE_ADDRESS_IO_MASK;
>                         mask = PCI_BASE_ADDRESS_IO_MASK & (u32) IO_SPACE_LIMIT;
> +                       if (!(orig_cmd & PCI_COMMAND_IO))
> +                               res->flags |= IORESOURCE_UNSET;
>                 } else {
>                         l &= PCI_BASE_ADDRESS_MEM_MASK;
>                         mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
> +                       if (!(orig_cmd & PCI_COMMAND_MEMORY))
> +                               res->flags |= IORESOURCE_UNSET;
>                 }
>         } else {
>                 res->flags |= (l & IORESOURCE_ROM_ENABLE);
> @@ -252,6 +257,7 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>                         /* Address above 32-bit boundary; disable the BAR */
>                         pci_write_config_dword(dev, pos, 0);
>                         pci_write_config_dword(dev, pos + 4, 0);
> +                       res->flags |= IORESOURCE_UNSET;
>                         region.start = 0;
>                         region.end = sz64;
>                         bar_disabled = true;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


Thanks,
Bjorn Helgaas March 13, 2014, 4:08 p.m. UTC | #2
On Thu, Mar 13, 2014 at 2:51 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> Hi Bjorn,
>
> I found this patch broke virtio-pci devices.

Thanks a lot for testing this.

> On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> Don't rely on BAR contents when the command register says the BAR is
>> disabled.
>>
>> If we receive a PCI device from firmware (or a hot-added device that was
>> just powered up) with the MEMORY or IO enable bits in the PCI command
>> register cleared, there's no reason to believe the BARs contain valid
>> addresses.
>
> From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
> PCI_COMMAND_IO and PCI_COMMAND_MEM should be
> cleared after reset, so looks the patch sets IORESOURCE_UNSET
> too early because PCI drivers may call pci_enable_device()
> (->pci_enable_resources()) to enable the two bits of
> PCI_COMMAND explicitly.

The point is that it's not safe to enable those two bits unless we're
certain that the BARs they control contain valid values that don't
conflict with anything else in the system.

Maybe we should only set IORESOURCE_UNSET when we find a conflict or a
BAR that's not contained by an upstream bridge window, and we should
try to reallocate then.  I'm pretty sure we do that at least in some
cases, but it would probably simplify things if we did it more
consistently, and maybe we shouldn't set it at all here in
__pci_read_base().

But I'd like to understand your situation better, so can you provide
more details, please?  Complete before/after dmesg logs would go a
long way toward illustrating the problem you're seeing.

> With this patch, driver can't enable device/resource with
> pci_enable_device() any more because IORESOURCE_UNSET
> has been set already.

>> In that case, we still know the type and size of the BAR, but this
>> patch marks the resource as "unset" so we have a chance to reassign it.
>>
>> Historically, we often used "BAR == 0" to decide the BAR is invalid.  But 0
>> is a legal BAR value, especially if the host bridge translates addresses,
>> so I think it's better to decide based on the PCI command register, and
>> store the conclusion in the IORESOURCE_UNSET bit.
>>
>> Reference: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=679545
>> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=48451
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>>  drivers/pci/probe.c |    8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 6e34498ec9f0..02654b5ec1b9 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -177,9 +177,10 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>>
>>         mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
>>
>> +       pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
>> +
>>         /* No printks while decoding is disabled! */
>>         if (!dev->mmio_always_on) {
>> -               pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
>>                 if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
>>                         pci_write_config_word(dev, PCI_COMMAND,
>>                                 orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
>> @@ -215,9 +216,13 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>>                 if (res->flags & IORESOURCE_IO) {
>>                         l &= PCI_BASE_ADDRESS_IO_MASK;
>>                         mask = PCI_BASE_ADDRESS_IO_MASK & (u32) IO_SPACE_LIMIT;
>> +                       if (!(orig_cmd & PCI_COMMAND_IO))
>> +                               res->flags |= IORESOURCE_UNSET;
>>                 } else {
>>                         l &= PCI_BASE_ADDRESS_MEM_MASK;
>>                         mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
>> +                       if (!(orig_cmd & PCI_COMMAND_MEMORY))
>> +                               res->flags |= IORESOURCE_UNSET;
>>                 }
>>         } else {
>>                 res->flags |= (l & IORESOURCE_ROM_ENABLE);
>> @@ -252,6 +257,7 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>>                         /* Address above 32-bit boundary; disable the BAR */
>>                         pci_write_config_dword(dev, pos, 0);
>>                         pci_write_config_dword(dev, pos + 4, 0);
>> +                       res->flags |= IORESOURCE_UNSET;
>>                         region.start = 0;
>>                         region.end = sz64;
>>                         bar_disabled = true;
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> Thanks,
> --
> Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei March 14, 2014, 1:48 a.m. UTC | #3
On Fri, Mar 14, 2014 at 12:08 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Thu, Mar 13, 2014 at 2:51 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> Hi Bjorn,
>>
>> I found this patch broke virtio-pci devices.
>
> Thanks a lot for testing this.
>
>> On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> Don't rely on BAR contents when the command register says the BAR is
>>> disabled.
>>>
>>> If we receive a PCI device from firmware (or a hot-added device that was
>>> just powered up) with the MEMORY or IO enable bits in the PCI command
>>> register cleared, there's no reason to believe the BARs contain valid
>>> addresses.
>>
>> From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
>> PCI_COMMAND_IO and PCI_COMMAND_MEM should be
>> cleared after reset, so looks the patch sets IORESOURCE_UNSET
>> too early because PCI drivers may call pci_enable_device()
>> (->pci_enable_resources()) to enable the two bits of
>> PCI_COMMAND explicitly.
>
> The point is that it's not safe to enable those two bits unless we're
> certain that the BARs they control contain valid values that don't
> conflict with anything else in the system.
>
> Maybe we should only set IORESOURCE_UNSET when we find a conflict or a
> BAR that's not contained by an upstream bridge window, and we should
> try to reallocate then.  I'm pretty sure we do that at least in some
> cases, but it would probably simplify things if we did it more
> consistently, and maybe we shouldn't set it at all here in
> __pci_read_base().

I think so because __pci_read_base() is called in device emulation
path.

>
> But I'd like to understand your situation better, so can you provide
> more details, please?  Complete before/after dmesg logs would go a
> long way toward illustrating the problem you're seeing.

Please see the two attachment log. The memory allocation failure
is caused by mistaken value read from pci address after the device
is failed to enable.


Thanks,
Bjorn Helgaas March 18, 2014, 12:27 a.m. UTC | #4
On Fri, Mar 14, 2014 at 09:48:35AM +0800, Ming Lei wrote:
> On Fri, Mar 14, 2014 at 12:08 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Thu, Mar 13, 2014 at 2:51 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> >> Hi Bjorn,
> >>
> >> I found this patch broke virtio-pci devices.
> >
> > Thanks a lot for testing this.
> >
> >> On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >>> Don't rely on BAR contents when the command register says the BAR is
> >>> disabled.
> >>>
> >>> If we receive a PCI device from firmware (or a hot-added device that was
> >>> just powered up) with the MEMORY or IO enable bits in the PCI command
> >>> register cleared, there's no reason to believe the BARs contain valid
> >>> addresses.
> >>
> >> From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
> >> PCI_COMMAND_IO and PCI_COMMAND_MEM should be
> >> cleared after reset, so looks the patch sets IORESOURCE_UNSET
> >> too early because PCI drivers may call pci_enable_device()
> >> (->pci_enable_resources()) to enable the two bits of
> >> PCI_COMMAND explicitly.
> >
> > The point is that it's not safe to enable those two bits unless we're
> > certain that the BARs they control contain valid values that don't
> > conflict with anything else in the system.
> >
> > Maybe we should only set IORESOURCE_UNSET when we find a conflict or a
> > BAR that's not contained by an upstream bridge window, and we should
> > try to reallocate then.  I'm pretty sure we do that at least in some
> > cases, but it would probably simplify things if we did it more
> > consistently, and maybe we shouldn't set it at all here in
> > __pci_read_base().
> 
> I think so because __pci_read_base() is called in device emulation
> path.

Which path is this?  I don't know anything about virtio-pci, and I only see
calls to __pci_read_base() from:

  sriov_init()
  pci_sriov_resource_alignment()
  pci_read_bases()

> > But I'd like to understand your situation better, so can you provide
> > more details, please?  Complete before/after dmesg logs would go a
> > long way toward illustrating the problem you're seeing.
> 
> Please see the two attachment log. The memory allocation failure
> is caused by mistaken value read from pci address after the device
> is failed to enable.

Your logs are harder than necessary to compare because one has a lot more
debug turned on than the other.

In the failing case, we ignore all the initial BAR values, but we do assign
values to all of them later:

  pci 0000:00:00.0: can't claim BAR 0 [mem size 0x00000400]: no address assigned
  pci 0000:00:00.0: can't claim BAR 1 [io  size 0x0400]: no address assigned
  ...
  pci 0000:00:00.0: BAR 0: assigned [mem 0x40000000-0x400003ff]
  pci 0000:00:00.0: BAR 1: assigned [io  0x1000-0x13ff]
  ...

The newly-assigned values look valid, and as far as I can tell, they should
work.  Do you know why they don't?  Is there an assumption somewhere that
we never change BAR values?

Even if we don't need to ignore BAR values in as many cases as we do, it
should be legal to ignore them and reassign them, so I want to understand
what's going on here before reverting this.

Is there an easy way I can reproduce the problem on my own box?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei March 19, 2014, 3:32 a.m. UTC | #5
On Tue, Mar 18, 2014 at 8:27 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, Mar 14, 2014 at 09:48:35AM +0800, Ming Lei wrote:
>> On Fri, Mar 14, 2014 at 12:08 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > On Thu, Mar 13, 2014 at 2:51 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> >> Hi Bjorn,
>> >>
>> >> I found this patch broke virtio-pci devices.
>> >
>> > Thanks a lot for testing this.
>> >
>> >> On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> >>> Don't rely on BAR contents when the command register says the BAR is
>> >>> disabled.
>> >>>
>> >>> If we receive a PCI device from firmware (or a hot-added device that was
>> >>> just powered up) with the MEMORY or IO enable bits in the PCI command
>> >>> register cleared, there's no reason to believe the BARs contain valid
>> >>> addresses.
>> >>
>> >> From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
>> >> PCI_COMMAND_IO and PCI_COMMAND_MEM should be
>> >> cleared after reset, so looks the patch sets IORESOURCE_UNSET
>> >> too early because PCI drivers may call pci_enable_device()
>> >> (->pci_enable_resources()) to enable the two bits of
>> >> PCI_COMMAND explicitly.
>> >
>> > The point is that it's not safe to enable those two bits unless we're
>> > certain that the BARs they control contain valid values that don't
>> > conflict with anything else in the system.
>> >
>> > Maybe we should only set IORESOURCE_UNSET when we find a conflict or a
>> > BAR that's not contained by an upstream bridge window, and we should
>> > try to reallocate then.  I'm pretty sure we do that at least in some
>> > cases, but it would probably simplify things if we did it more
>> > consistently, and maybe we shouldn't set it at all here in
>> > __pci_read_base().
>>
>> I think so because __pci_read_base() is called in device emulation
>> path.
>
> Which path is this?  I don't know anything about virtio-pci, and I only see
> calls to __pci_read_base() from:
>
>   sriov_init()
>   pci_sriov_resource_alignment()
>   pci_read_bases()
>
>> > But I'd like to understand your situation better, so can you provide
>> > more details, please?  Complete before/after dmesg logs would go a
>> > long way toward illustrating the problem you're seeing.
>>
>> Please see the two attachment log. The memory allocation failure
>> is caused by mistaken value read from pci address after the device
>> is failed to enable.
>
> Your logs are harder than necessary to compare because one has a lot more
> debug turned on than the other.
>
> In the failing case, we ignore all the initial BAR values, but we do assign
> values to all of them later:
>
>   pci 0000:00:00.0: can't claim BAR 0 [mem size 0x00000400]: no address assigned
>   pci 0000:00:00.0: can't claim BAR 1 [io  size 0x0400]: no address assigned
>   ...
>   pci 0000:00:00.0: BAR 0: assigned [mem 0x40000000-0x400003ff]
>   pci 0000:00:00.0: BAR 1: assigned [io  0x1000-0x13ff]
>   ...
>
> The newly-assigned values look valid, and as far as I can tell, they should
> work.  Do you know why they don't?  Is there an assumption somewhere that
> we never change BAR values?

I don't know the cause, maybe it is related with the hypervisor
implementation.

>
> Even if we don't need to ignore BAR values in as many cases as we do, it
> should be legal to ignore them and reassign them, so I want to understand
> what's going on here before reverting this.
>
> Is there an easy way I can reproduce the problem on my own box?

It is not quite difficult, you can build a lkvm following the README in
below link and test -next tree on the small kvm hypervisor:

     https://github.com/penberg/linux-kvm/blob/master/tools/kvm/README

Thanks,
Ming Lei March 19, 2014, 4:52 a.m. UTC | #6
Hi,

Looks Sasha fixed the problem in lkvm tool[1].

Sasha, looks we both saw the problem, but from technical
view, I am wondering if the fix is correct, because PCI spec.
requires that the IO/MMIO bits in COMMAND register should
be cleared after reset, maybe there are some potential problem
in lkvm pci emulation.


[1],  commit 6478ce1416aacf1ce35530f79ea035f89fb21e90
Author: Sasha Levin <sasha.levin@oracle.com>
Date:   Wed Mar 5 23:08:16 2014 -0500

    kvm tools: mark our PCI card as PIO and MMIO able


Thanks,
--
Ming Lei

On Wed, Mar 19, 2014 at 11:32 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Tue, Mar 18, 2014 at 8:27 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Fri, Mar 14, 2014 at 09:48:35AM +0800, Ming Lei wrote:
>>> On Fri, Mar 14, 2014 at 12:08 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> > On Thu, Mar 13, 2014 at 2:51 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> >> Hi Bjorn,
>>> >>
>>> >> I found this patch broke virtio-pci devices.
>>> >
>>> > Thanks a lot for testing this.
>>> >
>>> >> On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> >>> Don't rely on BAR contents when the command register says the BAR is
>>> >>> disabled.
>>> >>>
>>> >>> If we receive a PCI device from firmware (or a hot-added device that was
>>> >>> just powered up) with the MEMORY or IO enable bits in the PCI command
>>> >>> register cleared, there's no reason to believe the BARs contain valid
>>> >>> addresses.
>>> >>
>>> >> From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
>>> >> PCI_COMMAND_IO and PCI_COMMAND_MEM should be
>>> >> cleared after reset, so looks the patch sets IORESOURCE_UNSET
>>> >> too early because PCI drivers may call pci_enable_device()
>>> >> (->pci_enable_resources()) to enable the two bits of
>>> >> PCI_COMMAND explicitly.
>>> >
>>> > The point is that it's not safe to enable those two bits unless we're
>>> > certain that the BARs they control contain valid values that don't
>>> > conflict with anything else in the system.
>>> >
>>> > Maybe we should only set IORESOURCE_UNSET when we find a conflict or a
>>> > BAR that's not contained by an upstream bridge window, and we should
>>> > try to reallocate then.  I'm pretty sure we do that at least in some
>>> > cases, but it would probably simplify things if we did it more
>>> > consistently, and maybe we shouldn't set it at all here in
>>> > __pci_read_base().
>>>
>>> I think so because __pci_read_base() is called in device emulation
>>> path.
>>
>> Which path is this?  I don't know anything about virtio-pci, and I only see
>> calls to __pci_read_base() from:
>>
>>   sriov_init()
>>   pci_sriov_resource_alignment()
>>   pci_read_bases()
>>
>>> > But I'd like to understand your situation better, so can you provide
>>> > more details, please?  Complete before/after dmesg logs would go a
>>> > long way toward illustrating the problem you're seeing.
>>>
>>> Please see the two attachment log. The memory allocation failure
>>> is caused by mistaken value read from pci address after the device
>>> is failed to enable.
>>
>> Your logs are harder than necessary to compare because one has a lot more
>> debug turned on than the other.
>>
>> In the failing case, we ignore all the initial BAR values, but we do assign
>> values to all of them later:
>>
>>   pci 0000:00:00.0: can't claim BAR 0 [mem size 0x00000400]: no address assigned
>>   pci 0000:00:00.0: can't claim BAR 1 [io  size 0x0400]: no address assigned
>>   ...
>>   pci 0000:00:00.0: BAR 0: assigned [mem 0x40000000-0x400003ff]
>>   pci 0000:00:00.0: BAR 1: assigned [io  0x1000-0x13ff]
>>   ...
>>
>> The newly-assigned values look valid, and as far as I can tell, they should
>> work.  Do you know why they don't?  Is there an assumption somewhere that
>> we never change BAR values?
>
> I don't know the cause, maybe it is related with the hypervisor
> implementation.
>
>>
>> Even if we don't need to ignore BAR values in as many cases as we do, it
>> should be legal to ignore them and reassign them, so I want to understand
>> what's going on here before reverting this.
>>
>> Is there an easy way I can reproduce the problem on my own box?
>
> It is not quite difficult, you can build a lkvm following the README in
> below link and test -next tree on the small kvm hypervisor:
>
>      https://github.com/penberg/linux-kvm/blob/master/tools/kvm/README
>
> Thanks,
> --
> Ming Lei
Bjorn Helgaas March 19, 2014, 4:45 p.m. UTC | #7
On Tue, Mar 18, 2014 at 10:52 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> Hi,
>
> Looks Sasha fixed the problem in lkvm tool[1].
>
> Sasha, looks we both saw the problem, but from technical
> view, I am wondering if the fix is correct, because PCI spec.
> requires that the IO/MMIO bits in COMMAND register should
> be cleared after reset, maybe there are some potential problem
> in lkvm pci emulation.

I think I'm going to revert this patch ([2], "Ignore BAR contents when
firmware left decoding disabled").  The main reason for that patch was
to try for a consistent way of figuring out whether BARs are valid
that we could use on all architectures, but I think we can do it in a
better way.

That said, this kvm change should not be necessary.  We *should* be
able to take any PCI device and initialize it from power-on state
without any dependencies on what the BIOS left in the BARs or the
command register.  As far as I can tell, the PCI core actually worked
fine in this case (we assigned valid addresses to the devices), but
something else blew up.  If I revert that patch, it will cover up
whatever this other bug is, but it would be much better to figure out
what it is and fix is.

You said earlier that "The memory allocation failure is caused by
mistaken value read from pci address after the device is failed to
enable."  Can you elaborate on that?  Are you saying that something
tried to read from a region mapped by a BAR even though
pci_enable_device() failed?  That would be a programming error, of
course.  If you have any more details about exactly where this
happened, that would help a lot in finding the problem.

Bjorn

[2] http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?id=5c89a9ee943d5e

> [1],  commit 6478ce1416aacf1ce35530f79ea035f89fb21e90
> Author: Sasha Levin <sasha.levin@oracle.com>
> Date:   Wed Mar 5 23:08:16 2014 -0500
>
>     kvm tools: mark our PCI card as PIO and MMIO able
>
>
> Thanks,
> --
> Ming Lei
>
> On Wed, Mar 19, 2014 at 11:32 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Tue, Mar 18, 2014 at 8:27 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> On Fri, Mar 14, 2014 at 09:48:35AM +0800, Ming Lei wrote:
>>>> On Fri, Mar 14, 2014 at 12:08 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>> > On Thu, Mar 13, 2014 at 2:51 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>>> >> Hi Bjorn,
>>>> >>
>>>> >> I found this patch broke virtio-pci devices.
>>>> >
>>>> > Thanks a lot for testing this.
>>>> >
>>>> >> On Thu, Feb 27, 2014 at 3:37 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>> >>> Don't rely on BAR contents when the command register says the BAR is
>>>> >>> disabled.
>>>> >>>
>>>> >>> If we receive a PCI device from firmware (or a hot-added device that was
>>>> >>> just powered up) with the MEMORY or IO enable bits in the PCI command
>>>> >>> register cleared, there's no reason to believe the BARs contain valid
>>>> >>> addresses.
>>>> >>
>>>> >> From PCI LOCAL BUS SPECIFICATION, REV. 3.0, both
>>>> >> PCI_COMMAND_IO and PCI_COMMAND_MEM should be
>>>> >> cleared after reset, so looks the patch sets IORESOURCE_UNSET
>>>> >> too early because PCI drivers may call pci_enable_device()
>>>> >> (->pci_enable_resources()) to enable the two bits of
>>>> >> PCI_COMMAND explicitly.
>>>> >
>>>> > The point is that it's not safe to enable those two bits unless we're
>>>> > certain that the BARs they control contain valid values that don't
>>>> > conflict with anything else in the system.
>>>> >
>>>> > Maybe we should only set IORESOURCE_UNSET when we find a conflict or a
>>>> > BAR that's not contained by an upstream bridge window, and we should
>>>> > try to reallocate then.  I'm pretty sure we do that at least in some
>>>> > cases, but it would probably simplify things if we did it more
>>>> > consistently, and maybe we shouldn't set it at all here in
>>>> > __pci_read_base().
>>>>
>>>> I think so because __pci_read_base() is called in device emulation
>>>> path.
>>>
>>> Which path is this?  I don't know anything about virtio-pci, and I only see
>>> calls to __pci_read_base() from:
>>>
>>>   sriov_init()
>>>   pci_sriov_resource_alignment()
>>>   pci_read_bases()
>>>
>>>> > But I'd like to understand your situation better, so can you provide
>>>> > more details, please?  Complete before/after dmesg logs would go a
>>>> > long way toward illustrating the problem you're seeing.
>>>>
>>>> Please see the two attachment log. The memory allocation failure
>>>> is caused by mistaken value read from pci address after the device
>>>> is failed to enable.
>>>
>>> Your logs are harder than necessary to compare because one has a lot more
>>> debug turned on than the other.
>>>
>>> In the failing case, we ignore all the initial BAR values, but we do assign
>>> values to all of them later:
>>>
>>>   pci 0000:00:00.0: can't claim BAR 0 [mem size 0x00000400]: no address assigned
>>>   pci 0000:00:00.0: can't claim BAR 1 [io  size 0x0400]: no address assigned
>>>   ...
>>>   pci 0000:00:00.0: BAR 0: assigned [mem 0x40000000-0x400003ff]
>>>   pci 0000:00:00.0: BAR 1: assigned [io  0x1000-0x13ff]
>>>   ...
>>>
>>> The newly-assigned values look valid, and as far as I can tell, they should
>>> work.  Do you know why they don't?  Is there an assumption somewhere that
>>> we never change BAR values?
>>
>> I don't know the cause, maybe it is related with the hypervisor
>> implementation.
>>
>>>
>>> Even if we don't need to ignore BAR values in as many cases as we do, it
>>> should be legal to ignore them and reassign them, so I want to understand
>>> what's going on here before reverting this.
>>>
>>> Is there an easy way I can reproduce the problem on my own box?
>>
>> It is not quite difficult, you can build a lkvm following the README in
>> below link and test -next tree on the small kvm hypervisor:
>>
>>      https://github.com/penberg/linux-kvm/blob/master/tools/kvm/README
>>
>> Thanks,
>> --
>> Ming Lei
>
>
>
> --
> Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei March 20, 2014, 1:32 a.m. UTC | #8
On Thu, Mar 20, 2014 at 12:45 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Mar 18, 2014 at 10:52 PM, Ming Lei <tom.leiming@gmail.com> wrote:
>> Hi,
>>
>> Looks Sasha fixed the problem in lkvm tool[1].
>>
>> Sasha, looks we both saw the problem, but from technical
>> view, I am wondering if the fix is correct, because PCI spec.
>> requires that the IO/MMIO bits in COMMAND register should
>> be cleared after reset, maybe there are some potential problem
>> in lkvm pci emulation.
>
> I think I'm going to revert this patch ([2], "Ignore BAR contents when
> firmware left decoding disabled").  The main reason for that patch was
> to try for a consistent way of figuring out whether BARs are valid
> that we could use on all architectures, but I think we can do it in a
> better way.
>
> That said, this kvm change should not be necessary.  We *should* be
> able to take any PCI device and initialize it from power-on state
> without any dependencies on what the BIOS left in the BARs or the
> command register.  As far as I can tell, the PCI core actually worked
> fine in this case (we assigned valid addresses to the devices), but
> something else blew up.  If I revert that patch, it will cover up
> whatever this other bug is, but it would be much better to figure out
> what it is and fix is.
>
> You said earlier that "The memory allocation failure is caused by
> mistaken value read from pci address after the device is failed to
> enable."  Can you elaborate on that?  Are you saying that something

Sorry, that's my take for granted.

> tried to read from a region mapped by a BAR even though
> pci_enable_device() failed?  That would be a programming error, of
> course.  If you have any more details about exactly where this
> happened, that would help a lot in finding the problem.

When I check again, as you saw in the dmesg log after reverting, the
virtio device has been enabled successfully, looks no obvious PCI
failure, and the only problem is that the virtio driver reads zero queue
number from one region mapped by a BAR:

ioread16(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NUM)
         <- setup_vq(): drivers/virtio/virtio_pci.c

That causes the memory allocation failure.

Thanks,
Bjorn Helgaas March 21, 2014, 8:07 p.m. UTC | #9
[+cc kvm list]

On Wed, Mar 19, 2014 at 7:32 PM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Thu, Mar 20, 2014 at 12:45 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Tue, Mar 18, 2014 at 10:52 PM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> Hi,
>>>
>>> Looks Sasha fixed the problem in lkvm tool[1].
>>>
>>> Sasha, looks we both saw the problem, but from technical
>>> view, I am wondering if the fix is correct, because PCI spec.
>>> requires that the IO/MMIO bits in COMMAND register should
>>> be cleared after reset, maybe there are some potential problem
>>> in lkvm pci emulation.
>>
>> I think I'm going to revert this patch ([2], "Ignore BAR contents when
>> firmware left decoding disabled").  The main reason for that patch was
>> to try for a consistent way of figuring out whether BARs are valid
>> that we could use on all architectures, but I think we can do it in a
>> better way.
>>
>> That said, this kvm change should not be necessary.  We *should* be
>> able to take any PCI device and initialize it from power-on state
>> without any dependencies on what the BIOS left in the BARs or the
>> command register.  As far as I can tell, the PCI core actually worked
>> fine in this case (we assigned valid addresses to the devices), but
>> something else blew up.  If I revert that patch, it will cover up
>> whatever this other bug is, but it would be much better to figure out
>> what it is and fix is.

I think I figured out what the problem is.  In virtio_pci__init(), we
allocate some address space with pci_get_io_space_block(), save its
address in vpci->mmio_addr, and hook that address space up to
virtio_pci__io_mmio_callback with kvm__register_mmio().

But when we update the BAR value in pci__config_wr(), the address
space mapping is never updated.  I think this means that virtio-pci
can't tolerate its devices being moved by the OS.

In my opinion, this is a bug in linux-kvm.  We've managed to avoid
triggering this bug by preventing Linux from moving the BAR (either by
me reverting my patch, or by Sasha's linux-kvm change [1]).  But it's
not very robust to assume that the OS will never change the BAR, so
it's quite possible that you'll trip over this again in the future.

Bjorn

[1] 6478ce1416aa kvm tools: mark our PCI card as PIO and MMIO able
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sasha Levin March 21, 2014, 8:25 p.m. UTC | #10
On 03/21/2014 04:07 PM, Bjorn Helgaas wrote:
> I think I figured out what the problem is.  In virtio_pci__init(), we
> allocate some address space with pci_get_io_space_block(), save its
> address in vpci->mmio_addr, and hook that address space up to
> virtio_pci__io_mmio_callback with kvm__register_mmio().
>
> But when we update the BAR value in pci__config_wr(), the address
> space mapping is never updated.  I think this means that virtio-pci
> can't tolerate its devices being moved by the OS.
>
> In my opinion, this is a bug in linux-kvm.  We've managed to avoid
> triggering this bug by preventing Linux from moving the BAR (either by
> me reverting my patch, or by Sasha's linux-kvm change [1]).  But it's
> not very robust to assume that the OS will never change the BAR, so
> it's quite possible that you'll trip over this again in the future.

The purpose of KVM tool is to implement as much as possible of the KVM
interface and the virtio spec so that we'll have a good development/testing
environment with a very simple to understand codebase.

The issue you've mentioned is the "evil" side of the KVM tool. It never
tried (or claimed) to implement anything close to legacy hardware
interfaces. This means, for example, that it doesn't run any BIOS, there's
very lacking APIC support and the kernel is just injected into the virtual
RAM and gets run from there.

It also means that we went into the PCI spec deep enough to get the code
to work with the kernel. The only reason we implemented MSI interrupts
for example is because they provide improved performance with KVM, not
because we were trying to get a complete implementation of the PCI spec.

So yes, the PCI implementation in the KVM tool is lacking and what we
have there might be broken by making the kernel conform more closely
to the spec, but we are always happy to adapt and improve our code to
work with any changes in the kernel.

To sum it up, If you'll end up adding a change to the kernel that is
valid according to the spec but breaks the KVM tool we'll just go ahead
and fix the tool. You really don't need to worry about breaking it.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas March 21, 2014, 8:40 p.m. UTC | #11
On Fri, Mar 21, 2014 at 2:25 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 03/21/2014 04:07 PM, Bjorn Helgaas wrote:
>>
>> I think I figured out what the problem is.  In virtio_pci__init(), we
>> allocate some address space with pci_get_io_space_block(), save its
>> address in vpci->mmio_addr, and hook that address space up to
>> virtio_pci__io_mmio_callback with kvm__register_mmio().
>>
>> But when we update the BAR value in pci__config_wr(), the address
>> space mapping is never updated.  I think this means that virtio-pci
>> can't tolerate its devices being moved by the OS.
>>
>> In my opinion, this is a bug in linux-kvm.  We've managed to avoid
>> triggering this bug by preventing Linux from moving the BAR (either by
>> me reverting my patch, or by Sasha's linux-kvm change [1]).  But it's
>> not very robust to assume that the OS will never change the BAR, so
>> it's quite possible that you'll trip over this again in the future.
>
>
> The purpose of KVM tool is to implement as much as possible of the KVM
> interface and the virtio spec so that we'll have a good development/testing
> environment with a very simple to understand codebase.
>
> The issue you've mentioned is the "evil" side of the KVM tool. It never
> tried (or claimed) to implement anything close to legacy hardware
> interfaces. This means, for example, that it doesn't run any BIOS, there's
> very lacking APIC support and the kernel is just injected into the virtual
> RAM and gets run from there.
>
> It also means that we went into the PCI spec deep enough to get the code
> to work with the kernel. The only reason we implemented MSI interrupts
> for example is because they provide improved performance with KVM, not
> because we were trying to get a complete implementation of the PCI spec.
>
> So yes, the PCI implementation in the KVM tool is lacking and what we
> have there might be broken by making the kernel conform more closely
> to the spec, but we are always happy to adapt and improve our code to
> work with any changes in the kernel.
>
> To sum it up, If you'll end up adding a change to the kernel that is
> valid according to the spec but breaks the KVM tool we'll just go ahead
> and fix the tool. You really don't need to worry about breaking it.

That makes sense, and I'm glad I had a chance to get acquainted with
the KVM tool.  If I get another problem report related to it, I'll try
to remember that I don't need to worry about breaking it :)

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 6e34498ec9f0..02654b5ec1b9 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -177,9 +177,10 @@  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 
 	mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
 
+	pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
+
 	/* No printks while decoding is disabled! */
 	if (!dev->mmio_always_on) {
-		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
 		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
 			pci_write_config_word(dev, PCI_COMMAND,
 				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
@@ -215,9 +216,13 @@  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 		if (res->flags & IORESOURCE_IO) {
 			l &= PCI_BASE_ADDRESS_IO_MASK;
 			mask = PCI_BASE_ADDRESS_IO_MASK & (u32) IO_SPACE_LIMIT;
+			if (!(orig_cmd & PCI_COMMAND_IO))
+				res->flags |= IORESOURCE_UNSET;
 		} else {
 			l &= PCI_BASE_ADDRESS_MEM_MASK;
 			mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
+			if (!(orig_cmd & PCI_COMMAND_MEMORY))
+				res->flags |= IORESOURCE_UNSET;
 		}
 	} else {
 		res->flags |= (l & IORESOURCE_ROM_ENABLE);
@@ -252,6 +257,7 @@  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 			/* Address above 32-bit boundary; disable the BAR */
 			pci_write_config_dword(dev, pos, 0);
 			pci_write_config_dword(dev, pos + 4, 0);
+			res->flags |= IORESOURCE_UNSET;
 			region.start = 0;
 			region.end = sz64;
 			bar_disabled = true;