diff mbox series

PCI: Avoid handing out address 0 to devices

Message ID alpine.DEB.2.21.2202260044180.25061@angie.orcam.me.uk (mailing list archive)
State Superseded
Headers show
Series PCI: Avoid handing out address 0 to devices | expand

Commit Message

Maciej W. Rozycki Feb. 26, 2022, 10:47 a.m. UTC
We have numerous platforms that permit assigning addresses from 0 to PCI 
devices, both in the memory and the I/O bus space, and we happily do so 
if there is no conflict, e.g.:

pci 0000:07:00.0: BAR 0: assigned [io  0x0000-0x0007]
pci 0000:07:00.1: BAR 0: assigned [io  0x0008-0x000f]
pci 0000:06:01.0: PCI bridge to [bus 07]
pci 0000:06:01.0:   bridge window [io  0x0000-0x0fff]

(with the SiFive HiFive Unmatched RISC-V board and a dual serial port 
option card based on the OxSemi OXPCIe952 device wired for the legacy 
UART mode).

Address 0 is treated specially however in many places, for example in 
`pci_iomap_range' and `pci_iomap_wc_range' we require that the start 
address is non-zero, and even if we let such an address through, then 
individual device drivers could reject a request to handle a device at 
such an address, such as in `uart_configure_port'.  Consequently given
devices configured as shown above only one is actually usable:

Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
serial 0000:07:00.0: enabling device (0000 -> 0001)
serial: probe of 0000:07:00.0 failed with error -12
serial 0000:07:00.1: enabling device (0000 -> 0001)
serial 0000:07:00.1: detected caps 00000700 should be 00000500
0000:07:00.1: ttyS0 at I/O 0x8 (irq = 39, base_baud = 15625000) is a 16C950/954

Especially I/O space ranges are particularly valuable, because bridges 
only decode bits from 12 up and consequently where 16-bit addressing is 
in effect, as few as 16 separate ranges can be assigned to individual 
buses only.

Therefore avoid handing out address 0, however rather than bumping the 
lowest address available to PCI via PCIBIOS_MIN_IO and PCIBIOS_MIN_MEM, 
or doing an equivalent arrangement in `__pci_assign_resource', let the 
whole range assigned to a bus start from that address and instead only 
avoid it for actual devices.  Do it in `pci_bus_alloc_from_region' then 
observing that bridge resources will have the IORESOURCE_STARTALIGN flag 
set rather than IORESOURCE_SIZEALIGN and by making the least significant 
bit decoded 1 according to the resource type, either memory or I/O.

With this in place the system in question we have:

pci 0000:07:00.0: BAR 0: assigned [io  0x0008-0x000f]
pci 0000:07:00.1: BAR 0: assigned [io  0x0010-0x0017]
pci 0000:06:01.0: PCI bridge to [bus 07]
pci 0000:06:01.0:   bridge window [io  0x0000-0x0fff]

and then devices work correctly:

Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
serial 0000:07:00.0: enabling device (0000 -> 0001)
serial 0000:07:00.0: detected caps 00000700 should be 00000500
0000:07:00.0: ttyS0 at I/O 0x8 (irq = 38, base_baud = 15625000) is a 16C950/954
serial 0000:07:00.1: enabling device (0000 -> 0001)
serial 0000:07:00.1: detected caps 00000700 should be 00000500
0000:07:00.1: ttyS1 at I/O 0x10 (irq = 39, base_baud = 15625000) is a 16C950/954

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
Hi,

 NB I have an OxSemi OXPCIe952 based card that can be wired to either the 
native or the legacy mode via a jumper block and I am so glad that I have 
checked whether it works in the legacy mode as well.  I guess there are so 
few legacy-free platforms still for nobody else to notice this issue yet.

 I think I've chosen the right solution, but I'll be happy to hear any 
suggestions for an alternative one.  Otherwise please apply.

  Maciej
---
 drivers/pci/bus.c |    9 +++++++++
 1 file changed, 9 insertions(+)

linux-pci-bus-alloc-from-region-min.diff

Comments

Maciej W. Rozycki March 31, 2022, 7:11 a.m. UTC | #1
On Sat, 26 Feb 2022, Maciej W. Rozycki wrote:

> We have numerous platforms that permit assigning addresses from 0 to PCI 
> devices, both in the memory and the I/O bus space, and we happily do so 
> if there is no conflict, e.g.:

 Ping for:
<https://lore.kernel.org/lkml/alpine.DEB.2.21.2202260044180.25061@angie.orcam.me.uk/>

  Maciej
Maciej W. Rozycki April 13, 2022, 10:53 p.m. UTC | #2
On Sat, 26 Feb 2022, Maciej W. Rozycki wrote:

> We have numerous platforms that permit assigning addresses from 0 to PCI 
> devices, both in the memory and the I/O bus space, and we happily do so 
> if there is no conflict, e.g.:

 Ping for:
<https://lore.kernel.org/lkml/alpine.DEB.2.21.2202260044180.25061@angie.orcam.me.uk/>

  Maciej
Bjorn Helgaas April 14, 2022, 12:06 a.m. UTC | #3
On Sat, Feb 26, 2022 at 10:47:10AM +0000, Maciej W. Rozycki wrote:
> We have numerous platforms that permit assigning addresses from 0 to PCI 
> devices, both in the memory and the I/O bus space, and we happily do so 
> if there is no conflict, e.g.:
> 
> pci 0000:07:00.0: BAR 0: assigned [io  0x0000-0x0007]
> pci 0000:07:00.1: BAR 0: assigned [io  0x0008-0x000f]
> pci 0000:06:01.0: PCI bridge to [bus 07]
> pci 0000:06:01.0:   bridge window [io  0x0000-0x0fff]
> 
> (with the SiFive HiFive Unmatched RISC-V board and a dual serial port 
> option card based on the OxSemi OXPCIe952 device wired for the legacy 
> UART mode).
> 
> Address 0 is treated specially however in many places, for example in 
> `pci_iomap_range' and `pci_iomap_wc_range' we require that the start 
> address is non-zero, and even if we let such an address through, then 
> individual device drivers could reject a request to handle a device at 
> such an address, such as in `uart_configure_port'.  Consequently given
> devices configured as shown above only one is actually usable:

pci_iomap_range() tests the resource start, i.e., the CPU address.  I
guess the implication is that on RISC-V, the CPU-side port address is
the same as the PCI bus port address?

Is that actually a requirement?  Maybe you could also avoid this by
remapping the ports in CPU address space?

Is the same true for PCI memory addresses?  They are identical to CPU
addresses, i.e., no translation is applied?

On the PCI side, zero is a perfectly valid address, so it's a shame to
throw it away if we don't have to, especially since throwing away even
16 bytes of MMIO space means a 4GB 32-bit BAR cannot be mapped at all.

Bjorn
Maciej W. Rozycki April 14, 2022, 1:10 a.m. UTC | #4
On Wed, 13 Apr 2022, Bjorn Helgaas wrote:

> > Address 0 is treated specially however in many places, for example in 
> > `pci_iomap_range' and `pci_iomap_wc_range' we require that the start 
> > address is non-zero, and even if we let such an address through, then 
> > individual device drivers could reject a request to handle a device at 
> > such an address, such as in `uart_configure_port'.  Consequently given
> > devices configured as shown above only one is actually usable:
> 
> pci_iomap_range() tests the resource start, i.e., the CPU address.  I
> guess the implication is that on RISC-V, the CPU-side port address is
> the same as the PCI bus port address?

 Umm, for all systems I came across except x86, which have native port I/O 
access machine instructions, a port I/O resource records PCI bus addresses 
of the device rather than its CPU addresses, which encode the location of 
an MMIO window the PCI port I/O space is accessed through.  E.g. my RISC-V 
SiFive Unmatched has this:

# cat /proc/ioports
00000000-0000ffff : pcie@e00000000
  00000000-00001fff : PCI Bus 0000:01
    00000000-00001fff : PCI Bus 0000:02
      00000000-00001fff : PCI Bus 0000:05
        00000000-00001fff : PCI Bus 0000:06
          00000000-00000fff : PCI Bus 0000:07
          00000004-00000007 : 0000:07:00.0
          00000004-00000006 : parport0
          00000008-0000000f : 0000:07:00.0
          00000008-0000000a : parport0
          0000000b-0000000f : parport0
          00001000-00001fff : PCI Bus 0000:08
          00001000-00001fff : PCI Bus 0000:09
          00001000-000010ff : 0000:09:02.0
          00001100-0000117f : 0000:09:01.0
# 

and my MIPS MTI Malta has this:

# cat /proc/ioports
00000000-0000001f : dma1
00000020-00000021 : pic1
00000040-0000005f : timer
00000060-0000006f : keyboard
00000070-00000077 : rtc0
00000080-0000008f : dma page reg
000000a0-000000a1 : pic2
000000c0-000000df : dma2
00000170-00000177 : ata_piix
000001f0-000001f7 : ata_piix
000002f8-000002ff : serial
00000376-00000376 : ata_piix
00000378-0000037a : parport0
0000037b-0000037f : parport0
000003f6-000003f6 : ata_piix
000003f8-000003ff : serial
00001000-001fffff : GT-64120 PCI I/O
  00001000-0000103f : 0000:00:0a.3
  00001040-0000105f : 0000:00:0a.2
  00001060-0000107f : 0000:00:0b.0
    00001060-0000107f : pcnet32_probe_pci
  00001080-000010ff : 0000:00:14.0
  00001100-0000110f : 0000:00:0a.3
  00001400-000014ff : 0000:00:13.0
  00001800-0000180f : 0000:00:0a.1
    00001800-0000180f : ata_piix
# 

(though this is not strictly correctly reported, because the legacy junk 
is also behind the GT-64120).  It is especially clear with the Malta that 
PCI port I/O addresses have nothing to do with CPU addresses given this:

# cat /proc/iomem
00000000-0fffbfff : System RAM
  00100000-0076e9bf : Kernel code
  0076e9c0-0097665f : Kernel data
  00ab0000-00ae6ccf : Kernel bss
10000000-17ffffff : GT-64120 PCI MEM
  10000000-100fffff : 0000:00:0b.0
  10100000-101fffff : PCI Bus 0000:01
    10100000-10101fff : 0000:01:00.0
      10100000-10101fff : xhci-hcd
  10200000-1021ffff : 0000:00:13.0
  10220000-1022ffff : 0000:00:0c.0
  10230000-1023ffff : 0000:00:14.0
  10240000-10240fff : 0000:00:0c.0
  10241000-10241fff : 0000:00:13.0
  10242000-1024207f : 0000:00:14.0
    10242000-1024207f : defxx
  10242080-1024209f : 0000:00:0b.0
1e000000-1e3fffff : 1e000000.flash flash@1e000000
1f000900-1f00093f : serial
# 

where we have system RAM from CPU address 0 onwards (of course the Malta 
has legacy PC/ISA junk in the southbridge, so it only allocates native PCI 
port I/O resources from 0x1000 up and therefore it avoids the problem with 
port I/O address 0).

 Maybe there are systems that don't do that and use CPU addresses for port 
I/O resources, but I haven't come across one.

> Is that actually a requirement?  Maybe you could also avoid this by
> remapping the ports in CPU address space?

 Sadly it's not recorded in /proc/iomem, but from my understanding of the 
Unmatched DTS the CPU address of PCI I/O port 0 is 0x60080000, and for the 
Malta likewise it's 0x18000000, so the remapping is already there.

> Is the same true for PCI memory addresses?  They are identical to CPU
> addresses, i.e., no translation is applied?

 For MMIO I guess this isn't a problem for the systems I know of, but it 
would be if the PCI MMIO access window was decoded at 0 somewhere.  For 
the Unmatched and the Malta the windows are at 0x60090000 and 0x10000000 
respectively.

> On the PCI side, zero is a perfectly valid address, so it's a shame to
> throw it away if we don't have to, especially since throwing away even
> 16 bytes of MMIO space means a 4GB 32-bit BAR cannot be mapped at all.

 A problem with considering an address special, be it 0 or another value, 
is that such a designated location is thrown away.  Buses usually do not 
treat any addresses specially, it's merely a software convention.

  Maciej
Bjorn Helgaas April 14, 2022, 5:07 p.m. UTC | #5
On Thu, Apr 14, 2022 at 02:10:43AM +0100, Maciej W. Rozycki wrote:
> On Wed, 13 Apr 2022, Bjorn Helgaas wrote:
> 
> > > Address 0 is treated specially however in many places, for example in 
> > > `pci_iomap_range' and `pci_iomap_wc_range' we require that the start 
> > > address is non-zero, and even if we let such an address through, then 
> > > individual device drivers could reject a request to handle a device at 
> > > such an address, such as in `uart_configure_port'.  Consequently given
> > > devices configured as shown above only one is actually usable:
> > 
> > pci_iomap_range() tests the resource start, i.e., the CPU address.  I
> > guess the implication is that on RISC-V, the CPU-side port address is
> > the same as the PCI bus port address?
> 
>  Umm, for all systems I came across except x86, which have native port I/O 
> access machine instructions, a port I/O resource records PCI bus addresses 
> of the device rather than its CPU addresses, which encode the location of 
> an MMIO window the PCI port I/O space is accessed through.

My point is only that it is not necessary for the PCI bus address and
the struct resource address, i.e., the argument to inb(), to be the
same.

I tried to find the RISC-V definition of inb(), but it's obfuscated
too much to be easily discoverable.

Bjorn
Maciej W. Rozycki April 14, 2022, 8:22 p.m. UTC | #6
On Thu, 14 Apr 2022, Bjorn Helgaas wrote:

> > > > Address 0 is treated specially however in many places, for example in 
> > > > `pci_iomap_range' and `pci_iomap_wc_range' we require that the start 
> > > > address is non-zero, and even if we let such an address through, then 
> > > > individual device drivers could reject a request to handle a device at 
> > > > such an address, such as in `uart_configure_port'.  Consequently given
> > > > devices configured as shown above only one is actually usable:
> > > 
> > > pci_iomap_range() tests the resource start, i.e., the CPU address.  I
> > > guess the implication is that on RISC-V, the CPU-side port address is
> > > the same as the PCI bus port address?
> > 
> >  Umm, for all systems I came across except x86, which have native port I/O 
> > access machine instructions, a port I/O resource records PCI bus addresses 
> > of the device rather than its CPU addresses, which encode the location of 
> > an MMIO window the PCI port I/O space is accessed through.
> 
> My point is only that it is not necessary for the PCI bus address and
> the struct resource address, i.e., the argument to inb(), to be the
> same.

 Sure, but I have yet to see a system where it is the case.

 Also in principle peer PCI buses could have their own port I/O address 
spaces each mapped via distinct MMIO windows in the CPU address space, but 
I haven't heard of such a system.  That of course doesn't mean there's no 
such system in existence.

> I tried to find the RISC-V definition of inb(), but it's obfuscated
> too much to be easily discoverable.

 AFAICT the RISC-V port uses definitions from include/asm-generic/io.h.  
Palmer, did I get this right?

  Maciej
Palmer Dabbelt April 14, 2022, 10:12 p.m. UTC | #7
On Thu, 14 Apr 2022 13:22:42 PDT (-0700), macro@orcam.me.uk wrote:
> On Thu, 14 Apr 2022, Bjorn Helgaas wrote:
>
>> > > > Address 0 is treated specially however in many places, for example in
>> > > > `pci_iomap_range' and `pci_iomap_wc_range' we require that the start
>> > > > address is non-zero, and even if we let such an address through, then
>> > > > individual device drivers could reject a request to handle a device at
>> > > > such an address, such as in `uart_configure_port'.  Consequently given
>> > > > devices configured as shown above only one is actually usable:
>> > >
>> > > pci_iomap_range() tests the resource start, i.e., the CPU address.  I
>> > > guess the implication is that on RISC-V, the CPU-side port address is
>> > > the same as the PCI bus port address?
>> >
>> >  Umm, for all systems I came across except x86, which have native port I/O
>> > access machine instructions, a port I/O resource records PCI bus addresses
>> > of the device rather than its CPU addresses, which encode the location of
>> > an MMIO window the PCI port I/O space is accessed through.
>>
>> My point is only that it is not necessary for the PCI bus address and
>> the struct resource address, i.e., the argument to inb(), to be the
>> same.
>
>  Sure, but I have yet to see a system where it is the case.
>
>  Also in principle peer PCI buses could have their own port I/O address
> spaces each mapped via distinct MMIO windows in the CPU address space, but
> I haven't heard of such a system.  That of course doesn't mean there's no
> such system in existence.
>
>> I tried to find the RISC-V definition of inb(), but it's obfuscated
>> too much to be easily discoverable.
>
>  AFAICT the RISC-V port uses definitions from include/asm-generic/io.h.
> Palmer, did I get this right?

I'd argue that asm-generic/io.h uses the definitions from RISC-V, but 
the result is the same ;).

The general idea is that the IO itself is pretty generic for a handful 
of ports, they just need to be decorated with some fences (or whatever 
the ISA calls them) before/after the load/store.  Those are the 
__io_p{b,a}{r,w}() macros, which are in riscv's io.h.  IIRC they stand 
for something like Port{Before,After}{Read,Write}.
Bjorn Helgaas April 14, 2022, 11:23 p.m. UTC | #8
On Thu, Apr 14, 2022 at 09:22:42PM +0100, Maciej W. Rozycki wrote:
> On Thu, 14 Apr 2022, Bjorn Helgaas wrote:
> 
> > > > > Address 0 is treated specially however in many places, for
> > > > > example in `pci_iomap_range' and `pci_iomap_wc_range' we
> > > > > require that the start address is non-zero, and even if we
> > > > > let such an address through, then individual device drivers
> > > > > could reject a request to handle a device at such an
> > > > > address, such as in `uart_configure_port'.  Consequently
> > > > > given devices configured as shown above only one is actually
> > > > > usable:
> > > > 
> > > > pci_iomap_range() tests the resource start, i.e., the CPU
> > > > address.  I guess the implication is that on RISC-V, the
> > > > CPU-side port address is the same as the PCI bus port address?
> > > 
> > >  Umm, for all systems I came across except x86, which have
> > >  native port I/O access machine instructions, a port I/O
> > >  resource records PCI bus addresses of the device rather than
> > >  its CPU addresses, which encode the location of an MMIO window
> > >  the PCI port I/O space is accessed through.
> > 
> > My point is only that it is not necessary for the PCI bus address
> > and the struct resource address, i.e., the argument to inb(), to
> > be the same.
> 
>  Sure, but I have yet to see a system where it is the case.
> 
>  Also in principle peer PCI buses could have their own port I/O
>  address spaces each mapped via distinct MMIO windows in the CPU
>  address space, but I haven't heard of such a system.  That of
>  course doesn't mean there's no such system in existence.

They do exist, but are probably rare.  Even on x86 where multiple host
bridges are now fairly common, and the hardware probably supports a
separate 64K port space for each, the ones I've seen split up a single
64K I/O port space so each bridge only gets a fraction of it.  I'm not
sure Linux would even support multiple spaces.  I do know ia64
supports multiple port spaces (see __ia64_mk_io_addr()), so we could
have something like this:

  pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
  pci_bus 0001:00: root bus resource [io  0x10000-0x1ffff] (bus address [0x0000-0xffff])

I guess the question is whether we want to reserve port 0 and MMIO
address 0 as being "invalid".  That makes the first space smaller than
the others, but it's not *much* smaller and it's an unlikely
configuration to begin with.

But at the same time, it adds another slightly weird special case in
the already full-of-special-cases alloc code, and I'm somewhat averse
to things like that.

We do have the IORESOURCE_UNSET flag bit that could possibly be used
in pci_iomap_range() instead of testing for "!start".  Or maybe
there's a way to allocate address 0 instead of special-casing the
allocator?  Just thinking out loud here.

Bjorn
Maciej W. Rozycki April 15, 2022, 12:27 p.m. UTC | #9
On Thu, 14 Apr 2022, Bjorn Helgaas wrote:

> > > > > > Address 0 is treated specially however in many places, for
> > > > > > example in `pci_iomap_range' and `pci_iomap_wc_range' we
> > > > > > require that the start address is non-zero, and even if we
> > > > > > let such an address through, then individual device drivers
> > > > > > could reject a request to handle a device at such an
> > > > > > address, such as in `uart_configure_port'.  Consequently
> > > > > > given devices configured as shown above only one is actually
> > > > > > usable:
> > > > > 
> > > > > pci_iomap_range() tests the resource start, i.e., the CPU
> > > > > address.  I guess the implication is that on RISC-V, the
> > > > > CPU-side port address is the same as the PCI bus port address?
> > > > 
> > > >  Umm, for all systems I came across except x86, which have
> > > >  native port I/O access machine instructions, a port I/O
> > > >  resource records PCI bus addresses of the device rather than
> > > >  its CPU addresses, which encode the location of an MMIO window
> > > >  the PCI port I/O space is accessed through.
> > > 
> > > My point is only that it is not necessary for the PCI bus address
> > > and the struct resource address, i.e., the argument to inb(), to
> > > be the same.
> > 
> >  Sure, but I have yet to see a system where it is the case.
> > 
> >  Also in principle peer PCI buses could have their own port I/O
> >  address spaces each mapped via distinct MMIO windows in the CPU
> >  address space, but I haven't heard of such a system.  That of
> >  course doesn't mean there's no such system in existence.
> 
> They do exist, but are probably rare.  Even on x86 where multiple host
> bridges are now fairly common, and the hardware probably supports a
> separate 64K port space for each, the ones I've seen split up a single
> 64K I/O port space so each bridge only gets a fraction of it.  I'm not
> sure Linux would even support multiple spaces.  I do know ia64
> supports multiple port spaces (see __ia64_mk_io_addr()), so we could
> have something like this:
> 
>   pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
>   pci_bus 0001:00: root bus resource [io  0x10000-0x1ffff] (bus address [0x0000-0xffff])

 Yeah, I guess if anything, it *had* to be IA64!

 Conversely Alpha systems decode the full 32-bit address range for port 
I/O and happily assign I/O bars beyond 64K in their firmware, however as 
a uniform address space even across several peer PCI buses.

 As to x86 systems as I mentioned they have native port I/O access machine
instructions and they only support 16-bit addressing, so I wouldn't expect 
more than one 64K of port I/O space implemented with them.  There is no 
problem at the CPU bus level of course with presenting port I/O addresses 
beyond 64K and as a matter of interest the original 80386 CPU did make use 
of them internally for communicating with the 80387 FPU, just because they 
cannot be produced with machine code and therefore a programmer could not 
interfere with the CPU-to-FPU communication protocol.  Port I/O locations
0x800000f8 through 0x800000ff were actually used in that protocol[1][2].

> I guess the question is whether we want to reserve port 0 and MMIO
> address 0 as being "invalid".  That makes the first space smaller than
> the others, but it's not *much* smaller and it's an unlikely
> configuration to begin with.

 Unfortunately just as IRQ 0 is considered special and barring the 8254 
special exception for PC-style legacy junk it means "no IRQ", similarly 
I/O port or MMIO address 0 is considered "no device" in several places.  
One I have identified as noted above is `uart_configure_port':

	/*
	 * If there isn't a port here, don't do anything further.
	 */
	if (!port->iobase && !port->mapbase && !port->membase)
		return;

So even if we let address 0 through it will be rejected downstream here 
and there and the device won't work.

> But at the same time, it adds another slightly weird special case in
> the already full-of-special-cases alloc code, and I'm somewhat averse
> to things like that.

 I can sympathise with that.

> We do have the IORESOURCE_UNSET flag bit that could possibly be used
> in pci_iomap_range() instead of testing for "!start".  Or maybe
> there's a way to allocate address 0 instead of special-casing the
> allocator?  Just thinking out loud here.

 I have discovered it with an 8250-compatible UART option card and when I 
patched up `pci_iomap_range' to let address 0 through, I have learnt the 
hard way it is not going to work, as described above.  And I think the 
special semantics of bus address 0 has been there with us since forever, 
so I dare not change it as people surely have relied on it.

 I could see if the parport_pc driver for the parallel port option card I
now have installed instead would be happy with one of its decode ranges 
set to 0, but I'm not sure if there's any value in such a check given my 
observation.

 We're still doing better with my proposal than systems that have a legacy 
southbridge do, as we're only losing one increment of the BAR decode range 
rather than 4K of the I/O port address space that those systems do.

References:

[1] "Intel386 DX Microprocessor High Performance 32-bit CHMOS
    Microprocessor with Integrated Memory Management", Intel Corporation, 
    Order Number: 231630-010, December 1992, Section 5.2.4 "Address Bus 
    (BE0# through BE3#, A2 through A31)", p.63

[2] same, Table 5-11 "Numeric Coprocessor Port Addresses", p.94

  Maciej
Bjorn Helgaas April 15, 2022, 6:39 p.m. UTC | #10
On Fri, Apr 15, 2022 at 01:27:17PM +0100, Maciej W. Rozycki wrote:
> On Thu, 14 Apr 2022, Bjorn Helgaas wrote:
> 
> > > > > > > Address 0 is treated specially however in many places, for
> > > > > > > example in `pci_iomap_range' and `pci_iomap_wc_range' we
> > > > > > > require that the start address is non-zero, and even if we
> > > > > > > let such an address through, then individual device drivers
> > > > > > > could reject a request to handle a device at such an
> > > > > > > address, such as in `uart_configure_port'.  Consequently
> > > > > > > given devices configured as shown above only one is actually
> > > > > > > usable:
> > > > > > 
> > > > > > pci_iomap_range() tests the resource start, i.e., the CPU
> > > > > > address.  I guess the implication is that on RISC-V, the
> > > > > > CPU-side port address is the same as the PCI bus port address?
> > > > > 
> > > > >  Umm, for all systems I came across except x86, which have
> > > > >  native port I/O access machine instructions, a port I/O
> > > > >  resource records PCI bus addresses of the device rather than
> > > > >  its CPU addresses, which encode the location of an MMIO window
> > > > >  the PCI port I/O space is accessed through.
> > > > 
> > > > My point is only that it is not necessary for the PCI bus address
> > > > and the struct resource address, i.e., the argument to inb(), to
> > > > be the same.
> > > 
> > >  Sure, but I have yet to see a system where it is the case.
> > > 
> > >  Also in principle peer PCI buses could have their own port I/O
> > >  address spaces each mapped via distinct MMIO windows in the CPU
> > >  address space, but I haven't heard of such a system.  That of
> > >  course doesn't mean there's no such system in existence.
> > 
> > They do exist, but are probably rare.  Even on x86 where multiple host
> > bridges are now fairly common, and the hardware probably supports a
> > separate 64K port space for each, the ones I've seen split up a single
> > 64K I/O port space so each bridge only gets a fraction of it.  I'm not
> > sure Linux would even support multiple spaces.  I do know ia64
> > supports multiple port spaces (see __ia64_mk_io_addr()), so we could
> > have something like this:
> > 
> >   pci_bus 0000:00: root bus resource [io  0x0000-0xffff]
> >   pci_bus 0001:00: root bus resource [io  0x10000-0x1ffff] (bus address [0x0000-0xffff])
> 
>  Yeah, I guess if anything, it *had* to be IA64!

:)

>  Conversely Alpha systems decode the full 32-bit address range for port 
> I/O and happily assign I/O bars beyond 64K in their firmware, however as 
> a uniform address space even across several peer PCI buses.
> 
>  As to x86 systems as I mentioned they have native port I/O access machine
> instructions and they only support 16-bit addressing, so I wouldn't expect 
> more than one 64K of port I/O space implemented with them.  There is no 
> problem at the CPU bus level of course with presenting port I/O addresses 
> beyond 64K and as a matter of interest the original 80386 CPU did make use 
> of them internally for communicating with the 80387 FPU, just because they 
> cannot be produced with machine code and therefore a programmer could not 
> interfere with the CPU-to-FPU communication protocol.  Port I/O locations
> 0x800000f8 through 0x800000ff were actually used in that protocol[1][2].

> > I guess the question is whether we want to reserve port 0 and MMIO
> > address 0 as being "invalid".  That makes the first space smaller than
> > the others, but it's not *much* smaller and it's an unlikely
> > configuration to begin with.
> 
>  Unfortunately just as IRQ 0 is considered special and barring the 8254 
> special exception for PC-style legacy junk it means "no IRQ", similarly 
> I/O port or MMIO address 0 is considered "no device" in several places.  
> One I have identified as noted above is `uart_configure_port':
> 
> 	/*
> 	 * If there isn't a port here, don't do anything further.
> 	 */
> 	if (!port->iobase && !port->mapbase && !port->membase)
> 		return;
> 
> So even if we let address 0 through it will be rejected downstream here 
> and there and the device won't work.

This is a driver question, which I think is secondary.  If necessary,
we can fix drivers after figuring out what the PCI core should do.

> > We do have the IORESOURCE_UNSET flag bit that could possibly be used
> > in pci_iomap_range() instead of testing for "!start".  Or maybe
> > there's a way to allocate address 0 instead of special-casing the
> > allocator?  Just thinking out loud here.

Another possibility is PCIBIOS_MIN_IO.  It's also kind of an ugly
special case, but at least it already exists.  Most arches define it
to be non-zero, which should avoid this issue.

Defining PCIBIOS_MIN_IO would be simple; what would we lose compared
to adding code in pci_bus_alloc_from_region()?

Bjorn
Maciej W. Rozycki April 16, 2022, 2:02 p.m. UTC | #11
On Fri, 15 Apr 2022, Bjorn Helgaas wrote:

> > > I guess the question is whether we want to reserve port 0 and MMIO
> > > address 0 as being "invalid".  That makes the first space smaller than
> > > the others, but it's not *much* smaller and it's an unlikely
> > > configuration to begin with.
> > 
> >  Unfortunately just as IRQ 0 is considered special and barring the 8254 
> > special exception for PC-style legacy junk it means "no IRQ", similarly 
> > I/O port or MMIO address 0 is considered "no device" in several places.  
> > One I have identified as noted above is `uart_configure_port':
> > 
> > 	/*
> > 	 * If there isn't a port here, don't do anything further.
> > 	 */
> > 	if (!port->iobase && !port->mapbase && !port->membase)
> > 		return;
> > 
> > So even if we let address 0 through it will be rejected downstream here 
> > and there and the device won't work.
> 
> This is a driver question, which I think is secondary.  If necessary,
> we can fix drivers after figuring out what the PCI core should do.

 This goes back to ISA and user-supplied driver options given as kernel 
parameters, so I am fairly sure it's been a part of our user interface 
since forever.  Changing these assumptions would surely break something 
for someone out there, so I would be very wary making such changes.

> > > We do have the IORESOURCE_UNSET flag bit that could possibly be used
> > > in pci_iomap_range() instead of testing for "!start".  Or maybe
> > > there's a way to allocate address 0 instead of special-casing the
> > > allocator?  Just thinking out loud here.
> 
> Another possibility is PCIBIOS_MIN_IO.  It's also kind of an ugly
> special case, but at least it already exists.  Most arches define it
> to be non-zero, which should avoid this issue.
> 
> Defining PCIBIOS_MIN_IO would be simple; what would we lose compared
> to adding code in pci_bus_alloc_from_region()?

 As I explained in the change description:

> Especially I/O space ranges are particularly valuable, because bridges
> only decode bits from 12 up and consequently where 16-bit addressing is
> in effect, as few as 16 separate ranges can be assigned to individual
> buses only.
> 
> Therefore avoid handing out address 0, however rather than bumping the
> lowest address available to PCI via PCIBIOS_MIN_IO and PCIBIOS_MIN_MEM,
> or doing an equivalent arrangement in `__pci_assign_resource', let the
> whole range assigned to a bus start from that address and instead only
> avoid it for actual devices.  [...]

Yes, just bumping up PCIBIOS_MIN_IO was my first thought and the path of 
least resistance.  However with the strictly hierarchical topology of PCIe 
systems the limit of 16 ranges feels so frighteningly low to me already as 
to make me rather unwilling to reduce it even further for a system that is 
free from PC legacy junk (no southbridge let alone ISA) and therefore does 
not require it.  So I've reconsidered my initial approach and came up with 
this proposal instead.  I think it is a good compromise.

  Maciej
Bjorn Helgaas April 19, 2022, 3:37 a.m. UTC | #12
On Sat, Apr 16, 2022 at 03:02:35PM +0100, Maciej W. Rozycki wrote:
> On Fri, 15 Apr 2022, Bjorn Helgaas wrote:
> ...

> > Another possibility is PCIBIOS_MIN_IO.  It's also kind of an ugly
> > special case, but at least it already exists.  Most arches define it
> > to be non-zero, which should avoid this issue.
> > 
> > Defining PCIBIOS_MIN_IO would be simple; what would we lose compared
> > to adding code in pci_bus_alloc_from_region()?
> 
>  As I explained in the change description:
> 
>   Especially I/O space ranges are particularly valuable, because
>   bridges only decode bits from 12 up and consequently where 16-bit
>   addressing is in effect, as few as 16 separate ranges can be
>   assigned to individual buses only.
> 
>   Therefore avoid handing out address 0, however rather than bumping
>   the lowest address available to PCI via PCIBIOS_MIN_IO and
>   PCIBIOS_MIN_MEM, or doing an equivalent arrangement in
>   `__pci_assign_resource', let the whole range assigned to a bus
>   start from that address and instead only avoid it for actual
>   devices.  [...]
> 
> Yes, just bumping up PCIBIOS_MIN_IO was my first thought and the
> path of least resistance.  However with the strictly hierarchical
> topology of PCIe systems the limit of 16 ranges feels so
> frighteningly low to me already as to make me rather unwilling to
> reduce it even further for a system that is free from PC legacy junk
> (no southbridge let alone ISA) and therefore does not require it.
> So I've reconsidered my initial approach and came up with this
> proposal instead.  I think it is a good compromise.

Sorry for being dense here, I think it's finally sinking in.

The problem is that making PCIBIOS_MIN_IO greater than zero would keep
us from assigning a [io 0x0000- ] window, so instead of 16 I/O bridge
windows, we could only have 15 (unless bridges support 32-bit I/O
windows).  Right?

Are you running into that limit?  Most devices don't need I/O port
space, and almost all other arches define PCIBIOS_MIN_IO, so they live
without that window today.

Sparc uses the MMIO I/O port address directly in the struct resource,
so it will never try to allocate [io 0x0000], and there's no problem
with using PCI I/O port 0:

  pci_bus 0000:00: root bus resource [io  0x804000000000-0x80400fffffff] (bus address [0x0000-0xfffffff])
  mpt2sas0: ioport(0x0000804000001000), size(256)

The sparc ioport interfaces are basically:

  ioport_map(port)  { return port; }
  ioread8(addr)     { return readb(addr); }
  inb(addr)         { return readb(addr); }

RISC-V uses the generic ones, i.e.,

  ioport_map(port)  { return PIO_OFFSET + port; }
  ioread8(addr)     { if (addr) >= PIO_RESERVED)
                        return readb(addr);
                      else
                        return inb(addr & PIO_MASK); }
  inb(addr)         { return __raw_readb(PCI_IOBASE + addr); }

Obviously RISC-V gives you prettier I/O port addresses, but at the
cost of having PCI I/O port 0 be 0 in the struct resource as well,
which makes it basically unusable.  Is it worth it?

Bjorn
Maciej W. Rozycki April 27, 2022, 10:18 p.m. UTC | #13
On Mon, 18 Apr 2022, Bjorn Helgaas wrote:

> > Yes, just bumping up PCIBIOS_MIN_IO was my first thought and the
> > path of least resistance.  However with the strictly hierarchical
> > topology of PCIe systems the limit of 16 ranges feels so
> > frighteningly low to me already as to make me rather unwilling to
> > reduce it even further for a system that is free from PC legacy junk
> > (no southbridge let alone ISA) and therefore does not require it.
> > So I've reconsidered my initial approach and came up with this
> > proposal instead.  I think it is a good compromise.
> 
> Sorry for being dense here, I think it's finally sinking in.

 No worries, I'm glad we're on the same page now.

> The problem is that making PCIBIOS_MIN_IO greater than zero would keep
> us from assigning a [io 0x0000- ] window, so instead of 16 I/O bridge
> windows, we could only have 15 (unless bridges support 32-bit I/O
> windows).  Right?

 Yes, that's been my concern.

> Are you running into that limit?  Most devices don't need I/O port
> space, and almost all other arches define PCIBIOS_MIN_IO, so they live
> without that window today.

 I haven't run into this limit, but then whoever runs into it is likely 
not going to be someone who's able to come up with a solution like this 
proposed here or possibly understanding it.  Granted, it's only one extra 
range, but still IMHO an unnecessary extra limitation beyond one built 
into hardware.

 Also there are devices out there that have alternative MMIO and port I/O 
BARs available for mapping their CSRs, and while our drivers tend to use 
MMIO in that case those devices still claim a port I/O address range.  If 
you are unlucky they will get an allocation and a device that actually 
needs port I/O to work won't.

 Such unexpected quirks can be very frustrating to IT folk.

> Sparc uses the MMIO I/O port address directly in the struct resource,
> so it will never try to allocate [io 0x0000], and there's no problem
> with using PCI I/O port 0:
> 
>   pci_bus 0000:00: root bus resource [io  0x804000000000-0x80400fffffff] (bus address [0x0000-0xfffffff])
>   mpt2sas0: ioport(0x0000804000001000), size(256)
> 
> The sparc ioport interfaces are basically:
> 
>   ioport_map(port)  { return port; }
>   ioread8(addr)     { return readb(addr); }
>   inb(addr)         { return readb(addr); }
> 
> RISC-V uses the generic ones, i.e.,
> 
>   ioport_map(port)  { return PIO_OFFSET + port; }
>   ioread8(addr)     { if (addr) >= PIO_RESERVED)
>                         return readb(addr);
>                       else
>                         return inb(addr & PIO_MASK); }
>   inb(addr)         { return __raw_readb(PCI_IOBASE + addr); }
> 
> Obviously RISC-V gives you prettier I/O port addresses, but at the
> cost of having PCI I/O port 0 be 0 in the struct resource as well,
> which makes it basically unusable.  Is it worth it?

 Well, the SPARC port may be able to get away with that, but overall I 
think using PCI bus addresses for port I/O resources is the only sane 
thing to do.  In fact I think for MMIO resources we probably ought to do 
the same, though it may be actually more difficult to implement, because 
it's more likely there are systems out there with multiple per-bus MMIO 
address spaces.

 The reason why I think using bus addresses here is the right thing to do 
is that systems may have multiple decode windows exposed to the CPU(s) 
mapping to the same I/O bus addresses, often for specific purposes.  E.g. 
Alpha has the sparse and the dense address space and some systems (I have 
one with a MIPS CPU) have spaces with different endianness swap policies 
(for matching either bit or byte lanes) wired in the bus logic hardware.  
So the same port I/O location can be mapped at different MMIO addresses 
simultaneously in one system.

 I did some further experimentation with a parallel port option card now, 
and it seems to operate just fine at address 0, likely because plain port 
I/O accessors (`inb', `outb', and friends) have no special intepretation 
for address 0, unlike the iomap handlers:

parport_pc 0000:07:00.0: enabling device (0000 -> 0001)
PCI parallel port detected: 1415:c118, I/O at 0x0(0x8), IRQ 38
parport0: PC-style at 0x0 (0x8), irq 38, using FIFO [PCSPP,TRISTATE,COMPAT,EPP,ECP]
lp0: using parport0 (interrupt-driven).

So it seems we're quite inconsistent already.

 Since we need a way to move forward and we have a real issue with PCI BAR 
allocations that cause devices to break I have now posted a fix for RISC-V 
systems only which solves the problem at hand, however wasting one whole 
4KiB I/O address range.  If we ever have a better generic solution, either 
one proposed here or an alternative one, then we can revert the RISC-V 
change.  Cf. 
<https://lore.kernel.org/r/alpine.DEB.2.21.2204271207590.9383@angie.orcam.me.uk>.

  Maciej
Bjorn Helgaas April 28, 2022, 6:55 p.m. UTC | #14
On Wed, Apr 27, 2022 at 11:18:12PM +0100, Maciej W. Rozycki wrote:
> On Mon, 18 Apr 2022, Bjorn Helgaas wrote:
> ...

> > Sparc uses the MMIO I/O port address directly in the struct resource,
> > so it will never try to allocate [io 0x0000], and there's no problem
> > with using PCI I/O port 0:
> > 
> >   pci_bus 0000:00: root bus resource [io  0x804000000000-0x80400fffffff] (bus address [0x0000-0xfffffff])
> >   mpt2sas0: ioport(0x0000804000001000), size(256)
> > 
> > The sparc ioport interfaces are basically:
> > 
> >   ioport_map(port)  { return port; }
> >   ioread8(addr)     { return readb(addr); }
> >   inb(addr)         { return readb(addr); }
> > 
> > RISC-V uses the generic ones, i.e.,
> > 
> >   ioport_map(port)  { return PIO_OFFSET + port; }
> >   ioread8(addr)     { if (addr) >= PIO_RESERVED)
> >                         return readb(addr);
> >                       else
> >                         return inb(addr & PIO_MASK); }
> >   inb(addr)         { return __raw_readb(PCI_IOBASE + addr); }
> > 
> > Obviously RISC-V gives you prettier I/O port addresses, but at the
> > cost of having PCI I/O port 0 be 0 in the struct resource as well,
> > which makes it basically unusable.  Is it worth it?
> 
>  Well, the SPARC port may be able to get away with that, but overall I 
> think using PCI bus addresses for port I/O resources is the only sane 
> thing to do.

That only works if you have a single host bridge where you care about
I/O ports, or if you're willing to split up the single space across
the multiple host bridges, e.g., [io 0x0000-0x7fff] to one host
bridge, [0x8000-0xffff] to another.

> In fact I think for MMIO resources we probably ought to do 
> the same, though it may be actually more difficult to implement, because 
> it's more likely there are systems out there with multiple per-bus MMIO 
> address spaces.

I might be missing your point here, but multiple host bridges are
common on ia64, sparc, and (I think) powerpc.  It probably would be
feasible to make the struct resource address identical to the PCI bus
address for 64-bit MMIO space because they're both constrained to the
same-sized space.

But doing it for the PCI 32-bit non-prefetchable space would be an
issue because there's usually RAM in that area of CPU address space,
so we'd have to reserve a hole for PCI, put the MMIO in the hole, then
split the MMIO space across all the host bridges.  This sparc system
has 16 host bridges, so even if we had no hole, each one could only
have 256MB of identity-mapped non-prefetchable space:

  pci_bus 0000:00: root bus resource [mem 0x800000000000-0x80007effffff] (bus address [0x00000000-0x7effffff])
  pci_bus 0001:00: root bus resource [mem 0x801000000000-0x80107effffff] (bus address [0x00000000-0x7effffff])
  ...
  pci_bus 000f:00: root bus resource [mem 0x839000000000-0x83907effffff] (bus address [0x00000000-0x7effffff])

(This is from https://bugzilla.kernel.org/show_bug.cgi?id=96241)

>  The reason why I think using bus addresses here is the right thing to do 
> is that systems may have multiple decode windows exposed to the CPU(s) 
> mapping to the same I/O bus addresses, often for specific purposes. E.g. 
> Alpha has the sparse and the dense address space and some systems (I have 
> one with a MIPS CPU) have spaces with different endianness swap policies 
> (for matching either bit or byte lanes) wired in the bus logic hardware.  
> So the same port I/O location can be mapped at different MMIO addresses 
> simultaneously in one system.

Is this feature used by Linux?  I guess it would require some generic
mechanism drivers could use to get the desired endianness?  I don't
know whether such a thing exists.

Bjorn
diff mbox series

Patch

Index: linux-macro/drivers/pci/bus.c
===================================================================
--- linux-macro.orig/drivers/pci/bus.c
+++ linux-macro/drivers/pci/bus.c
@@ -194,6 +194,15 @@  static int pci_bus_alloc_from_region(str
 		 */
 		if (avail.start)
 			min_used = avail.start;
+		/*
+		 * For non-bridge resources avoid assigning address 0 as
+		 * we assume that to mean no assignment in many places,
+		 * starting from `pci_iomap_range'.
+		 */
+		if (min_used == 0 && (res->flags & IORESOURCE_SIZEALIGN))
+			min_used = res->flags & IORESOURCE_IO ?
+				   ~PCI_BASE_ADDRESS_IO_MASK + 1 :
+				   ~PCI_BASE_ADDRESS_MEM_MASK + 1;
 
 		max = avail.end;