diff mbox

[2/2] PCI: Stop clearing bridge Secondary Status when setting up I/O aperture

Message ID 20131206001954.27659.78163.stgit@bhelgaas-glaptop.roam.corp.google.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Bjorn Helgaas Dec. 6, 2013, 12:19 a.m. UTC
pci_setup_bridge_io() accessed PCI_IO_BASE and PCI_IO_LIMIT using dword
(32-bit) reads and writes, which also access the Secondary Status register.
Since the Secondary Status register is in the upper 16 bits of the dword,
and we preserved those upper 16 bits, this had the effect of clearing any
of the write-1-to-clear bits that happened to be set in the Secondary
Status register.

That's not what we want, so use word (16-bit) accesses to update only
PCI_IO_BASE and PCI_IO_LIMIT.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/setup-bus.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jason Gunthorpe Dec. 9, 2013, 7:33 p.m. UTC | #1
On Thu, Dec 05, 2013 at 05:19:55PM -0700, Bjorn Helgaas wrote:
> pci_setup_bridge_io() accessed PCI_IO_BASE and PCI_IO_LIMIT using dword
> (32-bit) reads and writes, which also access the Secondary Status register.
> Since the Secondary Status register is in the upper 16 bits of the dword,
> and we preserved those upper 16 bits, this had the effect of clearing any
> of the write-1-to-clear bits that happened to be set in the Secondary
> Status register.

This is a good catch!

> -	pci_write_config_dword(bridge, PCI_IO_BASE, l);
> +	pci_write_config_word(bridge, PCI_IO_BASE, l);

But this is a problem :(

tegra and mvebu at least do not have HW to do non-32 bit writes, so
their implementation of pci_write_config_word does the RMW internally
and will still have this same bug.

I think you have to keep the 32 bit write here, but zero the
write-one-to-clear bits :(

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 9, 2013, 8:15 p.m. UTC | #2
On Mon, Dec 9, 2013 at 12:33 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Dec 05, 2013 at 05:19:55PM -0700, Bjorn Helgaas wrote:
>> pci_setup_bridge_io() accessed PCI_IO_BASE and PCI_IO_LIMIT using dword
>> (32-bit) reads and writes, which also access the Secondary Status register.
>> Since the Secondary Status register is in the upper 16 bits of the dword,
>> and we preserved those upper 16 bits, this had the effect of clearing any
>> of the write-1-to-clear bits that happened to be set in the Secondary
>> Status register.
>
> This is a good catch!
>
>> -     pci_write_config_dword(bridge, PCI_IO_BASE, l);
>> +     pci_write_config_word(bridge, PCI_IO_BASE, l);
>
> But this is a problem :(
>
> tegra and mvebu at least do not have HW to do non-32 bit writes, so
> their implementation of pci_write_config_word does the RMW internally
> and will still have this same bug.
>
> I think you have to keep the 32 bit write here, but zero the
> write-one-to-clear bits :(

Is there actually some spec requirement about the access sizes that
must be supported by the hardware?  If there is something, I'll gladly
keep the 32-bit access, but if it's only a tegra/mvebu-specific
restriction, then I think it needs to be handled in the PCI config
accessors for that hardware.

This spec statement (PCI (not PCIe) r3.0, sec 3.2.2.3.4) makes it
sound like tegra/mvebu is non-conforming and should deal with this
specially:

"A function must not restrict the size of the access it supports in
Configuration Space. The configuration commands, like other commands,
allow data to be accessed using any combination of bytes (including a
byte, word, DWORD, or non-contiguous bytes) and multiple data phases
in a burst. The target is required to handle any combination of byte
enables."

I think the tegra/mvebu accessors should be able to use the address
and access size to figure out what we're trying to do.  If we're doing
a WORD write to PCI_IO_BASE, it can turn that into: 32-bit read, clear
the upper 16 bits (secondary status), insert the PCI_IO_BASE part,
32-bit write.  If we're doing a 16-bit write to PCI_SEC_STATUS, it can
turn that into: 32-bit read, insert upper 16 bits (secondary status),
leave lower 16 bits alone, 32-bit write.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 9, 2013, 9:27 p.m. UTC | #3
On Mon, Dec 09, 2013 at 01:15:52PM -0700, Bjorn Helgaas wrote:

> Is there actually some spec requirement about the access sizes that
> must be supported by the hardware?  If there is something, I'll gladly
> keep the 32-bit access, but if it's only a tegra/mvebu-specific
> restriction, then I think it needs to be handled in the PCI config
> accessors for that hardware.

The host register interface is not specified by any standard.

I looked a for a bit and found other drivers with the same problem:
  arch/arm/mach-cns3xxx/pcie.c
  arch/arm/mach-iop13xx/pci.c
  arch/arm/mach-ks8695/pci.c
  arch/arm/mach-sa1100/pci-nanoengine.c
 [..]

I stopped looking after that point..

> This spec statement (PCI (not PCIe) r3.0, sec 3.2.2.3.4) makes it
> sound like tegra/mvebu is non-conforming and should deal with this
> specially:
> 
> "A function must not restrict the size of the access it supports in
> Configuration Space. The configuration commands, like other commands,
> allow data to be accessed using any combination of bytes (including a
> byte, word, DWORD, or non-contiguous bytes) and multiple data phases
> in a burst. The target is required to handle any combination of byte
> enables."

This verbage is talking about the target/responder, not the host. Any
target must respond to all possible sizes for all possible config
registers, including combinations that even x86 can't issue (like a BE
pattern of b0110)

> I think the tegra/mvebu accessors should be able to use the address
> and access size to figure out what we're trying to do.  If we're doing
> a WORD write to PCI_IO_BASE, it can turn that into: 32-bit read, clear
> the upper 16 bits (secondary status), insert the PCI_IO_BASE part,
> 32-bit write.  If we're doing a 16-bit write to PCI_SEC_STATUS, it can
> turn that into: 32-bit read, insert upper 16 bits (secondary status),
> leave lower 16 bits alone, 32-bit write.

Seems reasonable, but this obviously shouldn't be open coded in every
driver..

Maybe pci_ops should gain a read32/write32 and HW that can only do 32
bit access should not define read/write. The core code could provide
read/write functions to do the bit mangling for all possibilities?
That would be very clear what is going on.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 9, 2013, 9:58 p.m. UTC | #4
On Mon, Dec 9, 2013 at 2:27 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 09, 2013 at 01:15:52PM -0700, Bjorn Helgaas wrote:
>
>> Is there actually some spec requirement about the access sizes that
>> must be supported by the hardware?  If there is something, I'll gladly
>> keep the 32-bit access, but if it's only a tegra/mvebu-specific
>> restriction, then I think it needs to be handled in the PCI config
>> accessors for that hardware.
>
> The host register interface is not specified by any standard.
>
> I looked a for a bit and found other drivers with the same problem:
>   arch/arm/mach-cns3xxx/pcie.c
>   arch/arm/mach-iop13xx/pci.c
>   arch/arm/mach-ks8695/pci.c
>   arch/arm/mach-sa1100/pci-nanoengine.c
>  [..]
>
> I stopped looking after that point..
>
>> This spec statement (PCI (not PCIe) r3.0, sec 3.2.2.3.4) makes it
>> sound like tegra/mvebu is non-conforming and should deal with this
>> specially:
>>
>> "A function must not restrict the size of the access it supports in
>> Configuration Space. The configuration commands, like other commands,
>> allow data to be accessed using any combination of bytes (including a
>> byte, word, DWORD, or non-contiguous bytes) and multiple data phases
>> in a burst. The target is required to handle any combination of byte
>> enables."
>
> This verbage is talking about the target/responder, not the host. Any
> target must respond to all possible sizes for all possible config
> registers, including combinations that even x86 can't issue (like a BE
> pattern of b0110)

Now I'm confused.  From the PCI core's point of view,
pci_setup_bridge_io() is talking to a target, namely a PCI-to-PCI
bridge.  PCI config space accessors are also for talking to targets.

I think you're talking about host bridges, and I agree that they are
not covered by this spec statement and can have arbitrary register
layout and access size restrictions.

Some arches set up their host bridges so they appear in PCI config
space and have register layouts that look like PCI-to-PCI bridges.  In
my opinion, those arches then have the responsibility of following all
the PCI-to-PCI bridge rules, including access size restrictions,
either directly in hardware or in their config accessors.

I don't think it makes sense to try to make the core know about this
stuff.  There are many places that touch bridges, and it would be
pretty hard to manage exceptions like this.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 10, 2013, 12:10 a.m. UTC | #5
On Mon, Dec 09, 2013 at 02:58:06PM -0700, Bjorn Helgaas wrote:

> >> "A function must not restrict the size of the access it supports in
> >> Configuration Space. The configuration commands, like other commands,
> >> allow data to be accessed using any combination of bytes (including a
> >> byte, word, DWORD, or non-contiguous bytes) and multiple data phases
> >> in a burst. The target is required to handle any combination of byte
> >> enables."
> >
> > This verbage is talking about the target/responder, not the host. Any
> > target must respond to all possible sizes for all possible config
> > registers, including combinations that even x86 can't issue (like a BE
> > pattern of b0110)
> 
> Now I'm confused.  From the PCI core's point of view,
> pci_setup_bridge_io() is talking to a target, namely a PCI-to-PCI
> bridge.  PCI config space accessors are also for talking to targets.

The fundamental issue is that the PCI host drivers I pointed out have
hardware that cannot construct a PCI-E ConfigWr TLP (or PCI/PCI-X
eqiuv) with any content other than a 32 bit DWORD write.

So even though pci_setup_bridge_io is talking to a target, and the
target is required to handle all possible ConfigWr TLPs, the host
driver (struct pci_ops) can only generate one possibility.

The problematic host drivers are all emulating non-32 bit ConfigWr
support by doing RMW, and you have observed that RMW is not 100%
correct when dealing with write 1 to clear bits.

Basically, your fix to pci_setup_bridge_io is fine - but your
observation led me to realize that the HW drivers implementing RMW for
8 and 16 bit ops under their struct pci_ops.read function have exactly
the same flaw you are fixing here - they will silently wipe out write
1 to clear bits.

> Some arches set up their host bridges so they appear in PCI config
> space and have register layouts that look like PCI-to-PCI bridges.  In
> my opinion, those arches then have the responsibility of following all
> the PCI-to-PCI bridge rules, including access size restrictions,
> either directly in hardware or in their config accessors.

Sure, but this isn't the problem :)

IMHO, the pci core needs to help host drivers implement a correct
pci_ops.write function for the case where the host hardware can only
produce 32 bit ConfigWR TLPs. That seems to be common enough and
subtle enough it shouldn't be inlined into every driver.

Just to be clear, I am talking about implementations like this:

static int cns3xxx_pci_write_config(struct pci_bus *bus, unsigned int devfn,
                                    int where, int size, u32 val)
{
        u32 v;
        void __iomem *base;
        u32 mask = (0x1ull << (size * 8)) - 1;
        int shift = (where % 4) * 8;

        base = cns3xxx_pci_cfg_base(bus, devfn, where);
        if (!base)
                return PCIBIOS_SUCCESSFUL;

        v = __raw_readl(base);

        v &= ~(mask << shift);
        v |= (val & mask) << shift;

        __raw_writel(v, base);

        return PCIBIOS_SUCCESSFUL;
}

static struct pci_ops cns3xxx_pcie_ops = {
        .read = cns3xxx_pci_read_config,
        .write = cns3xxx_pci_write_config,
};

Which we now know is subtly broken as it will clear write 1 to clear
bits when doing sub dword operations.

I found 6 copies of this pattern in 5 mins of searching :(

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 10, 2013, 12:18 a.m. UTC | #6
On Mon, Dec 9, 2013 at 5:10 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 09, 2013 at 02:58:06PM -0700, Bjorn Helgaas wrote:
>
>> >> "A function must not restrict the size of the access it supports in
>> >> Configuration Space. The configuration commands, like other commands,
>> >> allow data to be accessed using any combination of bytes (including a
>> >> byte, word, DWORD, or non-contiguous bytes) and multiple data phases
>> >> in a burst. The target is required to handle any combination of byte
>> >> enables."
>> >
>> > This verbage is talking about the target/responder, not the host. Any
>> > target must respond to all possible sizes for all possible config
>> > registers, including combinations that even x86 can't issue (like a BE
>> > pattern of b0110)
>>
>> Now I'm confused.  From the PCI core's point of view,
>> pci_setup_bridge_io() is talking to a target, namely a PCI-to-PCI
>> bridge.  PCI config space accessors are also for talking to targets.
>
> The fundamental issue is that the PCI host drivers I pointed out have
> hardware that cannot construct a PCI-E ConfigWr TLP (or PCI/PCI-X
> eqiuv) with any content other than a 32 bit DWORD write.

Yep.

> So even though pci_setup_bridge_io is talking to a target, and the
> target is required to handle all possible ConfigWr TLPs, the host
> driver (struct pci_ops) can only generate one possibility.

Yep.

> The problematic host drivers are all emulating non-32 bit ConfigWr
> support by doing RMW, and you have observed that RMW is not 100%
> correct when dealing with write 1 to clear bits.

Yep.

> Basically, your fix to pci_setup_bridge_io is fine - but your
> observation led me to realize that the HW drivers implementing RMW for
> 8 and 16 bit ops under their struct pci_ops.read function have exactly
> the same flaw you are fixing here - they will silently wipe out write
> 1 to clear bits.

Ah, OK.  I thought you were saying that we couldn't change
pci_setup_bridge_io() to use 16-bit accesses because of this problem.

>> Some arches set up their host bridges so they appear in PCI config
>> space and have register layouts that look like PCI-to-PCI bridges.  In
>> my opinion, those arches then have the responsibility of following all
>> the PCI-to-PCI bridge rules, including access size restrictions,
>> either directly in hardware or in their config accessors.
>
> Sure, but this isn't the problem :)
>
> IMHO, the pci core needs to help host drivers implement a correct
> pci_ops.write function for the case where the host hardware can only
> produce 32 bit ConfigWR TLPs. That seems to be common enough and
> subtle enough it shouldn't be inlined into every driver.

Sure, if somebody can come up with a reasonable way to share this sort
of code, that sounds like a good thing.  Maybe extending struct
pci_ops is the way to do this, but I hope not, because it seems like
it will require quite a bit of specific knowledge about which
registers have RW1C bits.  They're not all at constant offsets because
they can be in capabilities, too.

> Just to be clear, I am talking about implementations like this:
>
> static int cns3xxx_pci_write_config(struct pci_bus *bus, unsigned int devfn,
>                                     int where, int size, u32 val)
> {
>         u32 v;
>         void __iomem *base;
>         u32 mask = (0x1ull << (size * 8)) - 1;
>         int shift = (where % 4) * 8;
>
>         base = cns3xxx_pci_cfg_base(bus, devfn, where);
>         if (!base)
>                 return PCIBIOS_SUCCESSFUL;
>
>         v = __raw_readl(base);
>
>         v &= ~(mask << shift);
>         v |= (val & mask) << shift;
>
>         __raw_writel(v, base);
>
>         return PCIBIOS_SUCCESSFUL;
> }
>
> static struct pci_ops cns3xxx_pcie_ops = {
>         .read = cns3xxx_pci_read_config,
>         .write = cns3xxx_pci_write_config,
> };
>
> Which we now know is subtly broken as it will clear write 1 to clear
> bits when doing sub dword operations.
>
> I found 6 copies of this pattern in 5 mins of searching :(

Yes, I bet there are a bunch of them.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Dec. 10, 2013, 1:31 a.m. UTC | #7
On Mon, Dec 09, 2013 at 05:18:30PM -0700, Bjorn Helgaas wrote:

> > Basically, your fix to pci_setup_bridge_io is fine - but your
> > observation led me to realize that the HW drivers implementing RMW for
> > 8 and 16 bit ops under their struct pci_ops.read function have exactly
> > the same flaw you are fixing here - they will silently wipe out write
> > 1 to clear bits.
> 
> Ah, OK.  I thought you were saying that we couldn't change
> pci_setup_bridge_io() to use 16-bit accesses because of this problem.

No, not really - just that this bug you discovered is broader than just
that one place :)

> Sure, if somebody can come up with a reasonable way to share this sort
> of code, that sounds like a good thing.  Maybe extending struct
> pci_ops is the way to do this, but I hope not, because it seems like

Well, sharing the code is no problem, it is exactly the same for every
32 bit only host driver.

But supporting RW1C bits in capability lists seems fairly hard once
the context at the call site is lost.

Which makes me wonder if supporting sub-dword writes to dwords with
RW1C bits makes sense at all :(

That was why my initial reaction was to not do sub dword writes if you
know it will conflict with a RW1C bit.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Dec. 10, 2013, 6:22 p.m. UTC | #8
On Mon, Dec 9, 2013 at 6:31 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Dec 09, 2013 at 05:18:30PM -0700, Bjorn Helgaas wrote:
>
>> > Basically, your fix to pci_setup_bridge_io is fine - but your
>> > observation led me to realize that the HW drivers implementing RMW for
>> > 8 and 16 bit ops under their struct pci_ops.read function have exactly
>> > the same flaw you are fixing here - they will silently wipe out write
>> > 1 to clear bits.
>>
>> Ah, OK.  I thought you were saying that we couldn't change
>> pci_setup_bridge_io() to use 16-bit accesses because of this problem.
>
> No, not really - just that this bug you discovered is broader than just
> that one place :)
>
>> Sure, if somebody can come up with a reasonable way to share this sort
>> of code, that sounds like a good thing.  Maybe extending struct
>> pci_ops is the way to do this, but I hope not, because it seems like
>
> Well, sharing the code is no problem, it is exactly the same for every
> 32 bit only host driver.
>
> But supporting RW1C bits in capability lists seems fairly hard once
> the context at the call site is lost.

Strictly speaking, fixing this only requires knowledge of the hardware
register layout, so at least in principle, it can be done without the
call site context.  It does sound like it might be messy, though.
QEMU deals with this issue somehow; maybe there's something there we
can copy.

> Which makes me wonder if supporting sub-dword writes to dwords with
> RW1C bits makes sense at all :(
>
> That was why my initial reaction was to not do sub dword writes if you
> know it will conflict with a RW1C bit.

Keeping the dword read/write and clearing the upper 16 bits before the
write is an option.  But I'm inclined to do the 16-bit accesses, i.e.,
use the patch as posted, because that makes the issue more visible.
If we keep the dword access, we can fix this particular issue, but it
does nothing to help fix other similar cases.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 80350299a6ea..2e344a5581ae 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -538,7 +538,8 @@  static void pci_setup_bridge_io(struct pci_bus *bus)
 	struct pci_bus_region region;
 	unsigned long io_mask;
 	u8 io_base_lo, io_limit_lo;
-	u32 l, io_upper16;
+	u16 l;
+	u32 io_upper16;
 
 	io_mask = PCI_IO_RANGE_MASK;
 	if (bridge->io_window_1k)
@@ -548,11 +549,10 @@  static void pci_setup_bridge_io(struct pci_bus *bus)
 	res = bus->resource[0];
 	pcibios_resource_to_bus(bridge, &region, res);
 	if (res->flags & IORESOURCE_IO) {
-		pci_read_config_dword(bridge, PCI_IO_BASE, &l);
-		l &= 0xffff0000;
+		pci_read_config_word(bridge, PCI_IO_BASE, &l);
 		io_base_lo = (region.start >> 8) & io_mask;
 		io_limit_lo = (region.end >> 8) & io_mask;
-		l |= ((u32) io_limit_lo << 8) | io_base_lo;
+		l = ((u16) io_limit_lo << 8) | io_base_lo;
 		/* Set up upper 16 bits of I/O base/limit. */
 		io_upper16 = (region.end & 0xffff0000) | (region.start >> 16);
 		dev_info(&bridge->dev, "  bridge window %pR\n", res);
@@ -564,7 +564,7 @@  static void pci_setup_bridge_io(struct pci_bus *bus)
 	/* Temporarily disable the I/O range before updating PCI_IO_BASE. */
 	pci_write_config_dword(bridge, PCI_IO_BASE_UPPER16, 0x0000ffff);
 	/* Update lower 16 bits of I/O base/limit. */
-	pci_write_config_dword(bridge, PCI_IO_BASE, l);
+	pci_write_config_word(bridge, PCI_IO_BASE, l);
 	/* Update upper 16 bits of I/O base/limit. */
 	pci_write_config_dword(bridge, PCI_IO_BASE_UPPER16, io_upper16);
 }