diff mbox series

[03/11] PCI: of_property: Sanitize 32 bit PCI address parsed from DT

Message ID 8b4fa91380fc4754ea80f47330c613e4f6b6592c.1724159867.git.andrea.porta@suse.com (mailing list archive)
State Superseded
Headers show
Series Add support for RaspberryPi RP1 PCI device using a DT overlay | expand

Commit Message

Andrea della Porta Aug. 20, 2024, 2:36 p.m. UTC
The of_pci_set_address() function parse devicetree PCI range specifier
assuming the address is 'sanitized' at the origin, i.e. without checking
whether the incoming address is 32 or 64 bit has specified in the flags.
In this way an address with no OF_PCI_ADDR_SPACE_MEM64 set in the flagss
could leak through and the upper 32 bits of the address will be set too,
and this violates the PCI specs stating that ion 32 bit address the upper
bit should be zero.
This could cause mapping translation mismatch on PCI devices (e.g. RP1)
that are expected to be addressed with a 64 bit address while advertising
a 32 bit address in the PCI config region.
Add a check in of_pci_set_address() to set upper 32 bits to zero in case
the address has no 64 bit flag set.

Signed-off-by: Andrea della Porta <andrea.porta@suse.com>
---
 drivers/pci/of_property.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas Aug. 21, 2024, 3:24 p.m. UTC | #1
On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> The of_pci_set_address() function parse devicetree PCI range specifier

s/parse/parses/ ? 

> assuming the address is 'sanitized' at the origin, i.e. without checking
> whether the incoming address is 32 or 64 bit has specified in the flags.
> In this way an address with no OF_PCI_ADDR_SPACE_MEM64 set in the flagss

s/flagss/flags/

> could leak through and the upper 32 bits of the address will be set too,
> and this violates the PCI specs stating that ion 32 bit address the upper

s/ion/in/

> bit should be zero.

I don't understand this code, so I'm probably missing something.  It
looks like the interesting path here is:

  of_pci_prop_ranges
    res = &pdev->resource[...];
    for (j = 0; j < num; j++) {
      val64 = res[j].start;
      of_pci_set_address(..., val64, 0, flags, false);
 +      if (OF_PCI_ADDR_SPACE_MEM64)
 +        prop[1] = upper_32_bits(val64);
 +      else
 +        prop[1] = 0;

OF_PCI_ADDR_SPACE_MEM64 tells us about the size of the PCI bus
address, but the address (val64) is a CPU physical address, not a PCI
bus address, so I don't understand why of_pci_set_address() should use
OF_PCI_ADDR_SPACE_MEM64 to clear part of the CPU address.

Add blank lines between paragraphs.

> This could cause mapping translation mismatch on PCI devices (e.g. RP1)
> that are expected to be addressed with a 64 bit address while advertising
> a 32 bit address in the PCI config region.
> Add a check in of_pci_set_address() to set upper 32 bits to zero in case
> the address has no 64 bit flag set.

Is this an indication of a DT error?  Have you seen this cause a
problem?  If so, what does it look like to a user?  I.e., how could a
user find this patch if they saw a problem?

> Signed-off-by: Andrea della Porta <andrea.porta@suse.com>
> ---
>  drivers/pci/of_property.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/of_property.c b/drivers/pci/of_property.c
> index 5a0b98e69795..77865facdb4a 100644
> --- a/drivers/pci/of_property.c
> +++ b/drivers/pci/of_property.c
> @@ -60,7 +60,10 @@ static void of_pci_set_address(struct pci_dev *pdev, u32 *prop, u64 addr,
>  	prop[0] |= flags | reg_num;
>  	if (!reloc) {
>  		prop[0] |= OF_PCI_ADDR_FIELD_NONRELOC;
> -		prop[1] = upper_32_bits(addr);
> +		if (FIELD_GET(OF_PCI_ADDR_FIELD_SS, flags) == OF_PCI_ADDR_SPACE_MEM64)
> +			prop[1] = upper_32_bits(addr);
> +		else
> +			prop[1] = 0;
>  		prop[2] = lower_32_bits(addr);
>  	}
>  }
> -- 
> 2.35.3
>
Andrea della Porta Aug. 26, 2024, 7:51 p.m. UTC | #2
Hi Bjorn,

On 10:24 Wed 21 Aug     , Bjorn Helgaas wrote:
> On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> > The of_pci_set_address() function parse devicetree PCI range specifier
> 
> s/parse/parses/ ? 

Ack.

> 
> > assuming the address is 'sanitized' at the origin, i.e. without checking
> > whether the incoming address is 32 or 64 bit has specified in the flags.
> > In this way an address with no OF_PCI_ADDR_SPACE_MEM64 set in the flagss
> 
> s/flagss/flags/

Ack.

> 
> > could leak through and the upper 32 bits of the address will be set too,
> > and this violates the PCI specs stating that ion 32 bit address the upper
> 
> s/ion/in/

Ack.

> 
> > bit should be zero.
> 
> I don't understand this code, so I'm probably missing something.  It
> looks like the interesting path here is:
> 
>   of_pci_prop_ranges
>     res = &pdev->resource[...];
>     for (j = 0; j < num; j++) {
>       val64 = res[j].start;
>       of_pci_set_address(..., val64, 0, flags, false);
>  +      if (OF_PCI_ADDR_SPACE_MEM64)
>  +        prop[1] = upper_32_bits(val64);
>  +      else
>  +        prop[1] = 0;
> 
> OF_PCI_ADDR_SPACE_MEM64 tells us about the size of the PCI bus
> address, but the address (val64) is a CPU physical address, not a PCI
> bus address, so I don't understand why of_pci_set_address() should use
> OF_PCI_ADDR_SPACE_MEM64 to clear part of the CPU address.
>

It all starts from of_pci_prop_ranges(), that is the caller of of_pci_set_address().
val64 (i.e. res[j].start) is the address part of a struct resource that has
its own flags.  Those flags are directly translated to of_pci_range flags by
of_pci_get_addr_flags(), so any IORESOURCE_MEM_64 / IORESOURCE_MEM in the
resource flag will respectively become OF_PCI_ADDR_SPACE_MEM64 / OF_PCI_ADDR_SPACE_MEM32
in pci range.
What is advertised as 32 bit at the origin (val64) should not become a 64
bit PCI address at the output of of_pci_set_address(), so the upper 32 bit
portion should be dropped. 
This is explicitly stated in [1] (see page 5), where a space code of 0b10
implies that the upper 32 bit of the address must be zeroed out.
Please note that of_pci_prop_ranges() will be called only in case 
CONFIG_PCI_DYNAMIC_OF_NODES is enabled to populate ranges for pci bridges and
pci endpoint for which a quirk is declared, so I would say not a very
often recurring use case.
 
[1] - https://www.devicetree.org/open-firmware/bindings/pci/pci2_1.pdf

> Add blank lines between paragraphs.

Ack.

> 
> > This could cause mapping translation mismatch on PCI devices (e.g. RP1)
> > that are expected to be addressed with a 64 bit address while advertising
> > a 32 bit address in the PCI config region.
> > Add a check in of_pci_set_address() to set upper 32 bits to zero in case
> > the address has no 64 bit flag set.
> 
> Is this an indication of a DT error?  Have you seen this cause a
> problem?  If so, what does it look like to a user?  I.e., how could a
> user find this patch if they saw a problem?

Not neccessarily a DT error, but an inconsistent representation of addresses
wrt the specs. I incidentally encountered this on RaspberryPi 5, where
the PCI config space for the RP1 endpoint shows 32 bit BARs (basically an
offset from zero) but the effective address to which the CPU can access the
device is 64 bit nonetheless (0x1f_00000000).  I believe this is backed by
some non standard hw wiring.

Without this patch the range translation chain is broken, like this:

pcie@120000: <0x2000000 0x00 0x00    0x1f 0x00                0x00 0xfffffffc>;
~~~ chain breaks here ~~~
pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;

while with the patch applied the chain correctly become:

pcie@120000: <0x2000000 0x00 0x00    0x1f 0x00                0x00 0xfffffffc>;
pci@0      : <0x82000000 0x00 0x00   0x82000000 0x00 0x00     0x00 0x600000>;
dev@0,0    : <0x01 0x00 0x00         0x82010000 0x00 0x00     0x00 0x400000>;
rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;

I'm not advocating here that this patch is fixing the behaviour on Rpi5, this
is just a nice side effect of the correct address representation. I think we can
also probably fix it by patching the pcie node in the devicetree like this:

pcie@120000: <0x2000000 0xi1f 0x00    0x1f 0x00                0x00 0xfffffffc>;

but this is of course just a 1:1 mapping, while the address will still be at
least 'virtually' unconsistent, showing 64 bit address wihile the 32 bit flag is
set.
The net symptoms to the user would be, in the case of the RP1, a platform driver
of one of its sub-peripheral that fails to be probed.

Many thanks,
Andrea

> 
> > Signed-off-by: Andrea della Porta <andrea.porta@suse.com>
> > ---
> >  drivers/pci/of_property.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/of_property.c b/drivers/pci/of_property.c
> > index 5a0b98e69795..77865facdb4a 100644
> > --- a/drivers/pci/of_property.c
> > +++ b/drivers/pci/of_property.c
> > @@ -60,7 +60,10 @@ static void of_pci_set_address(struct pci_dev *pdev, u32 *prop, u64 addr,
> >  	prop[0] |= flags | reg_num;
> >  	if (!reloc) {
> >  		prop[0] |= OF_PCI_ADDR_FIELD_NONRELOC;
> > -		prop[1] = upper_32_bits(addr);
> > +		if (FIELD_GET(OF_PCI_ADDR_FIELD_SS, flags) == OF_PCI_ADDR_SPACE_MEM64)
> > +			prop[1] = upper_32_bits(addr);
> > +		else
> > +			prop[1] = 0;
> >  		prop[2] = lower_32_bits(addr);
> >  	}
> >  }
> > -- 
> > 2.35.3
> >
Bjorn Helgaas Sept. 3, 2024, 10:26 p.m. UTC | #3
On Mon, Aug 26, 2024 at 09:51:02PM +0200, Andrea della Porta wrote:
> On 10:24 Wed 21 Aug     , Bjorn Helgaas wrote:
> > On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> > > The of_pci_set_address() function parses devicetree PCI range
> > > specifier assuming the address is 'sanitized' at the origin,
> > > i.e. without checking whether the incoming address is 32 or 64
> > > bit has specified in the flags.  In this way an address with no
> > > OF_PCI_ADDR_SPACE_MEM64 set in the flags could leak through and
> > > the upper 32 bits of the address will be set too, and this
> > > violates the PCI specs stating that in 32 bit address the upper
> > > bit should be zero.

> > I don't understand this code, so I'm probably missing something.  It
> > looks like the interesting path here is:
> > 
> >   of_pci_prop_ranges
> >     res = &pdev->resource[...];
> >     for (j = 0; j < num; j++) {
> >       val64 = res[j].start;
> >       of_pci_set_address(..., val64, 0, flags, false);
> >  +      if (OF_PCI_ADDR_SPACE_MEM64)
> >  +        prop[1] = upper_32_bits(val64);
> >  +      else
> >  +        prop[1] = 0;
> > 
> > OF_PCI_ADDR_SPACE_MEM64 tells us about the size of the PCI bus
> > address, but the address (val64) is a CPU physical address, not a PCI
> > bus address, so I don't understand why of_pci_set_address() should use
> > OF_PCI_ADDR_SPACE_MEM64 to clear part of the CPU address.
> 
> It all starts from of_pci_prop_ranges(), that is the caller of
> of_pci_set_address().

> val64 (i.e. res[j].start) is the address part of a struct resource
> that has its own flags.  Those flags are directly translated to
> of_pci_range flags by of_pci_get_addr_flags(), so any
> IORESOURCE_MEM_64 / IORESOURCE_MEM in the resource flag will
> respectively become OF_PCI_ADDR_SPACE_MEM64 /
> OF_PCI_ADDR_SPACE_MEM32 in pci range.

> What is advertised as 32 bit at the origin (val64) should not become
> a 64 bit PCI address at the output of of_pci_set_address(), so the
> upper 32 bit portion should be dropped. 

> This is explicitly stated in [1] (see page 5), where a space code of 0b10
> implies that the upper 32 bit of the address must be zeroed out.

OK, I was confused and thought IORESOURCE_MEM_64 was telling us
something about the *CPU* address, but it's actually telling us
something about what *PCI bus* addresses are possible, i.e., whether
it's a 32-bit BAR or a 64-bit BAR.

However, the CPU physical address space and the PCI bus address are
not the same.  Generic code paths should account for that different by
applying an offset (the offset will be zero on many platforms where
CPU and PCI bus addresses *look* the same).

So a generic code path like of_pci_prop_ranges() that basically copies
a CPU physical address to a PCI bus address looks broken to me.

Maybe my expectation of this being described in DT is mistaken.

Bjorn
Andrea della Porta Sept. 5, 2024, 4:43 p.m. UTC | #4
Hi Bjorn,

On 17:26 Tue 03 Sep     , Bjorn Helgaas wrote:
> On Mon, Aug 26, 2024 at 09:51:02PM +0200, Andrea della Porta wrote:
> > On 10:24 Wed 21 Aug     , Bjorn Helgaas wrote:
> > > On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> > > > The of_pci_set_address() function parses devicetree PCI range
> > > > specifier assuming the address is 'sanitized' at the origin,
> > > > i.e. without checking whether the incoming address is 32 or 64
> > > > bit has specified in the flags.  In this way an address with no
> > > > OF_PCI_ADDR_SPACE_MEM64 set in the flags could leak through and
> > > > the upper 32 bits of the address will be set too, and this
> > > > violates the PCI specs stating that in 32 bit address the upper
> > > > bit should be zero.
> 
> > > I don't understand this code, so I'm probably missing something.  It
> > > looks like the interesting path here is:
> > > 
> > >   of_pci_prop_ranges
> > >     res = &pdev->resource[...];
> > >     for (j = 0; j < num; j++) {
> > >       val64 = res[j].start;
> > >       of_pci_set_address(..., val64, 0, flags, false);
> > >  +      if (OF_PCI_ADDR_SPACE_MEM64)
> > >  +        prop[1] = upper_32_bits(val64);
> > >  +      else
> > >  +        prop[1] = 0;
> > > 
> > > OF_PCI_ADDR_SPACE_MEM64 tells us about the size of the PCI bus
> > > address, but the address (val64) is a CPU physical address, not a PCI
> > > bus address, so I don't understand why of_pci_set_address() should use
> > > OF_PCI_ADDR_SPACE_MEM64 to clear part of the CPU address.
> > 
> > It all starts from of_pci_prop_ranges(), that is the caller of
> > of_pci_set_address().
> 
> > val64 (i.e. res[j].start) is the address part of a struct resource
> > that has its own flags.  Those flags are directly translated to
> > of_pci_range flags by of_pci_get_addr_flags(), so any
> > IORESOURCE_MEM_64 / IORESOURCE_MEM in the resource flag will
> > respectively become OF_PCI_ADDR_SPACE_MEM64 /
> > OF_PCI_ADDR_SPACE_MEM32 in pci range.
> 
> > What is advertised as 32 bit at the origin (val64) should not become
> > a 64 bit PCI address at the output of of_pci_set_address(), so the
> > upper 32 bit portion should be dropped. 
> 
> > This is explicitly stated in [1] (see page 5), where a space code of 0b10
> > implies that the upper 32 bit of the address must be zeroed out.
> 
> OK, I was confused and thought IORESOURCE_MEM_64 was telling us
> something about the *CPU* address, but it's actually telling us
> something about what *PCI bus* addresses are possible, i.e., whether
> it's a 32-bit BAR or a 64-bit BAR.
> 
> However, the CPU physical address space and the PCI bus address are
> not the same.  Generic code paths should account for that different by
> applying an offset (the offset will be zero on many platforms where
> CPU and PCI bus addresses *look* the same).
> 
> So a generic code path like of_pci_prop_ranges() that basically copies
> a CPU physical address to a PCI bus address looks broken to me.

Hmmm, I'd say that a translation from one bus type to the other is
going on nonetheless, and this is done in the current upstream function
as well. This patch of course does not add the translation (which is
already in place), just to do it avoiding generating inconsistent address.


> 
> Maybe my expectation of this being described in DT is mistaken.

Not sure what you mean here, the address being translated are coming from
DT, in fact they are described by "ranges" properties.

Many thanks,
Andrea

> Bjorn
Bjorn Helgaas Sept. 5, 2024, 8:16 p.m. UTC | #5
[+cc Lizhi]

On Thu, Sep 05, 2024 at 06:43:35PM +0200, Andrea della Porta wrote:
> On 17:26 Tue 03 Sep     , Bjorn Helgaas wrote:
> > On Mon, Aug 26, 2024 at 09:51:02PM +0200, Andrea della Porta wrote:
> > > On 10:24 Wed 21 Aug     , Bjorn Helgaas wrote:
> > > > On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> > > > > The of_pci_set_address() function parses devicetree PCI range
> > > > > specifier assuming the address is 'sanitized' at the origin,
> > > > > i.e. without checking whether the incoming address is 32 or 64
> > > > > bit has specified in the flags.  In this way an address with no
> > > > > OF_PCI_ADDR_SPACE_MEM64 set in the flags could leak through and
> > > > > the upper 32 bits of the address will be set too, and this
> > > > > violates the PCI specs stating that in 32 bit address the upper
> > > > > bit should be zero.
> > 
> > > > I don't understand this code, so I'm probably missing something.  It
> > > > looks like the interesting path here is:
> > > > 
> > > >   of_pci_prop_ranges
> > > >     res = &pdev->resource[...];
> > > >     for (j = 0; j < num; j++) {
> > > >       val64 = res[j].start;
> > > >       of_pci_set_address(..., val64, 0, flags, false);
> > > >  +      if (OF_PCI_ADDR_SPACE_MEM64)
> > > >  +        prop[1] = upper_32_bits(val64);
> > > >  +      else
> > > >  +        prop[1] = 0;
> ...
> > However, the CPU physical address space and the PCI bus address are
> > not the same.  Generic code paths should account for that different by
> > applying an offset (the offset will be zero on many platforms where
> > CPU and PCI bus addresses *look* the same).
> > 
> > So a generic code path like of_pci_prop_ranges() that basically copies
> > a CPU physical address to a PCI bus address looks broken to me.
> 
> Hmmm, I'd say that a translation from one bus type to the other is
> going on nonetheless, and this is done in the current upstream function
> as well. This patch of course does not add the translation (which is
> already in place), just to do it avoiding generating inconsistent address.

I think I was looking at this backwards.  I assumed we were *parsing"
a "ranges" property, but I think in fact we're *building* a "ranges"
property to describe an existing PCI device (either a PCI-to-PCI
bridge or an endpoint).  For such devices there is no address
translation.

Any address translation would only occur at a PCI host bridge that has
CPU address space on the upstream side and PCI address space on the
downstream side.

Since (IIUC), we're building "ranges" for a device in the interior of
a PCI hierarchy where address translation doesn't happen, I think both
the parent and child addresses in "ranges" should be in the PCI
address space.

But right now, I think they're both in the CPU address space, and we
basically do this:

  of_pci_prop_ranges(struct pci_dev *pdev, ...)
    res = &pdev->resource[...];
    for (j = 0; j < num; j++) {   # iterate through BARs or windows
      val64 = res[j].start;       # CPU physical address
      # <convert to PCI address space>
      of_pci_set_address(..., rp[i].parent_addr, val64, ...)
        rp[i].parent_addr = val64
      if (pci_is_bridge(pdev))
        memcpy(rp[i].child_addr, rp[i].parent_addr)
      else
        rp[i].child_addr[0] = j   # child addr unset/unused

Here "res" is a PCI BAR or bridge window, and it contains CPU physical
addresses, so "val64" is a CPU physical address.  It looks to me like
we should convert to a PCI bus address at the point noted above, based
on any translation described by the PCI host bridge.  That *should*
naturally result in a 32-bit value if OF_PCI_ADDR_SPACE_MEM64 is not
set.

> > Maybe my expectation of this being described in DT is mistaken.
> 
> Not sure what you mean here, the address being translated are coming from
> DT, in fact they are described by "ranges" properties.

Right, for my own future reference since I couldn't find a generic
description of "ranges" in Documentation/devicetree/:

[1] https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#ranges
Andrea della Porta Sept. 27, 2024, 6:48 a.m. UTC | #6
Hi Bjorn,

On 15:16 Thu 05 Sep     , Bjorn Helgaas wrote:
> [+cc Lizhi]
> 
> On Thu, Sep 05, 2024 at 06:43:35PM +0200, Andrea della Porta wrote:
> > On 17:26 Tue 03 Sep     , Bjorn Helgaas wrote:
> > > On Mon, Aug 26, 2024 at 09:51:02PM +0200, Andrea della Porta wrote:
> > > > On 10:24 Wed 21 Aug     , Bjorn Helgaas wrote:
> > > > > On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> > > > > > The of_pci_set_address() function parses devicetree PCI range
> > > > > > specifier assuming the address is 'sanitized' at the origin,
> > > > > > i.e. without checking whether the incoming address is 32 or 64
> > > > > > bit has specified in the flags.  In this way an address with no
> > > > > > OF_PCI_ADDR_SPACE_MEM64 set in the flags could leak through and
> > > > > > the upper 32 bits of the address will be set too, and this
> > > > > > violates the PCI specs stating that in 32 bit address the upper
> > > > > > bit should be zero.
> > > 
> > > > > I don't understand this code, so I'm probably missing something.  It
> > > > > looks like the interesting path here is:
> > > > > 
> > > > >   of_pci_prop_ranges
> > > > >     res = &pdev->resource[...];
> > > > >     for (j = 0; j < num; j++) {
> > > > >       val64 = res[j].start;
> > > > >       of_pci_set_address(..., val64, 0, flags, false);
> > > > >  +      if (OF_PCI_ADDR_SPACE_MEM64)
> > > > >  +        prop[1] = upper_32_bits(val64);
> > > > >  +      else
> > > > >  +        prop[1] = 0;
> > ...
> > > However, the CPU physical address space and the PCI bus address are
> > > not the same.  Generic code paths should account for that different by
> > > applying an offset (the offset will be zero on many platforms where
> > > CPU and PCI bus addresses *look* the same).
> > > 
> > > So a generic code path like of_pci_prop_ranges() that basically copies
> > > a CPU physical address to a PCI bus address looks broken to me.
> > 
> > Hmmm, I'd say that a translation from one bus type to the other is
> > going on nonetheless, and this is done in the current upstream function
> > as well. This patch of course does not add the translation (which is
> > already in place), just to do it avoiding generating inconsistent address.
> 
> I think I was looking at this backwards.  I assumed we were *parsing"
> a "ranges" property, but I think in fact we're *building* a "ranges"
> property to describe an existing PCI device (either a PCI-to-PCI
> bridge or an endpoint).  For such devices there is no address
> translation.
> 
> Any address translation would only occur at a PCI host bridge that has
> CPU address space on the upstream side and PCI address space on the
> downstream side.
> 
> Since (IIUC), we're building "ranges" for a device in the interior of
> a PCI hierarchy where address translation doesn't happen, I think both
> the parent and child addresses in "ranges" should be in the PCI
> address space.
> 
> But right now, I think they're both in the CPU address space, and we
> basically do this:
> 
>   of_pci_prop_ranges(struct pci_dev *pdev, ...)
>     res = &pdev->resource[...];
>     for (j = 0; j < num; j++) {   # iterate through BARs or windows
>       val64 = res[j].start;       # CPU physical address
>       # <convert to PCI address space>
>       of_pci_set_address(..., rp[i].parent_addr, val64, ...)
>         rp[i].parent_addr = val64
>       if (pci_is_bridge(pdev))
>         memcpy(rp[i].child_addr, rp[i].parent_addr)
>       else
>         rp[i].child_addr[0] = j   # child addr unset/unused
> 
> Here "res" is a PCI BAR or bridge window, and it contains CPU physical
> addresses, so "val64" is a CPU physical address.  It looks to me like
> we should convert to a PCI bus address at the point noted above, based
> on any translation described by the PCI host bridge.  That *should*
> naturally result in a 32-bit value if OF_PCI_ADDR_SPACE_MEM64 is not
> set.

That's exactly the point, ecxept that right now a 64 bit address would
"unnaturally" pass through even if OF_PCI_ADDR_SPACE_MEM64 is not set.
Hence the purpose of this patch.

Many thanks,
Andrea

> 
> > > Maybe my expectation of this being described in DT is mistaken.
> > 
> > Not sure what you mean here, the address being translated are coming from
> > DT, in fact they are described by "ranges" properties.
> 
> Right, for my own future reference since I couldn't find a generic
> description of "ranges" in Documentation/devicetree/:
> 
> [1] https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#ranges
Bjorn Helgaas Sept. 28, 2024, 8:17 p.m. UTC | #7
On Fri, Sep 27, 2024 at 08:48:28AM +0200, Andrea della Porta wrote:
> On 15:16 Thu 05 Sep     , Bjorn Helgaas wrote:
> > On Thu, Sep 05, 2024 at 06:43:35PM +0200, Andrea della Porta wrote:
> > > On 17:26 Tue 03 Sep     , Bjorn Helgaas wrote:
> > > > On Mon, Aug 26, 2024 at 09:51:02PM +0200, Andrea della Porta wrote:
> > > > > On 10:24 Wed 21 Aug     , Bjorn Helgaas wrote:
> > > > > > On Tue, Aug 20, 2024 at 04:36:05PM +0200, Andrea della Porta wrote:
> > > > > > > The of_pci_set_address() function parses devicetree PCI range
> > > > > > > specifier assuming the address is 'sanitized' at the origin,
> > > > > > > i.e. without checking whether the incoming address is 32 or 64
> > > > > > > bit has specified in the flags.  In this way an address with no
> > > > > > > OF_PCI_ADDR_SPACE_MEM64 set in the flags could leak through and
> > > > > > > the upper 32 bits of the address will be set too, and this
> > > > > > > violates the PCI specs stating that in 32 bit address the upper
> > > > > > > bit should be zero.
> > > > 
> > > > > > I don't understand this code, so I'm probably missing something.  It
> > > > > > looks like the interesting path here is:
> > > > > > 
> > > > > >   of_pci_prop_ranges
> > > > > >     res = &pdev->resource[...];
> > > > > >     for (j = 0; j < num; j++) {
> > > > > >       val64 = res[j].start;
> > > > > >       of_pci_set_address(..., val64, 0, flags, false);
> > > > > >  +      if (OF_PCI_ADDR_SPACE_MEM64)
> > > > > >  +        prop[1] = upper_32_bits(val64);
> > > > > >  +      else
> > > > > >  +        prop[1] = 0;
> > > ...
> > > > However, the CPU physical address space and the PCI bus address are
> > > > not the same.  Generic code paths should account for that different by
> > > > applying an offset (the offset will be zero on many platforms where
> > > > CPU and PCI bus addresses *look* the same).
> > > > 
> > > > So a generic code path like of_pci_prop_ranges() that basically copies
> > > > a CPU physical address to a PCI bus address looks broken to me.
> > > 
> > > Hmmm, I'd say that a translation from one bus type to the other is
> > > going on nonetheless, and this is done in the current upstream function
> > > as well. This patch of course does not add the translation (which is
> > > already in place), just to do it avoiding generating inconsistent address.
> > 
> > I think I was looking at this backwards.  I assumed we were *parsing"
> > a "ranges" property, but I think in fact we're *building* a "ranges"
> > property to describe an existing PCI device (either a PCI-to-PCI
> > bridge or an endpoint).  For such devices there is no address
> > translation.
> > 
> > Any address translation would only occur at a PCI host bridge that has
> > CPU address space on the upstream side and PCI address space on the
> > downstream side.
> > 
> > Since (IIUC), we're building "ranges" for a device in the interior of
> > a PCI hierarchy where address translation doesn't happen, I think both
> > the parent and child addresses in "ranges" should be in the PCI
> > address space.
> > 
> > But right now, I think they're both in the CPU address space, and we
> > basically do this:
> > 
> >   of_pci_prop_ranges(struct pci_dev *pdev, ...)
> >     res = &pdev->resource[...];
> >     for (j = 0; j < num; j++) {   # iterate through BARs or windows
> >       val64 = res[j].start;       # CPU physical address
> >       # <convert to PCI address space>
> >       of_pci_set_address(..., rp[i].parent_addr, val64, ...)
> >         rp[i].parent_addr = val64
> >       if (pci_is_bridge(pdev))
> >         memcpy(rp[i].child_addr, rp[i].parent_addr)
> >       else
> >         rp[i].child_addr[0] = j   # child addr unset/unused
> > 
> > Here "res" is a PCI BAR or bridge window, and it contains CPU physical
> > addresses, so "val64" is a CPU physical address.  It looks to me like
> > we should convert to a PCI bus address at the point noted above, based
> > on any translation described by the PCI host bridge.  That *should*
> > naturally result in a 32-bit value if OF_PCI_ADDR_SPACE_MEM64 is not
> > set.
> 
> That's exactly the point, except that right now a 64 bit address would
> "unnaturally" pass through even if OF_PCI_ADDR_SPACE_MEM64 is not set.
> Hence the purpose of this patch.

From your earlier email
(https://lore.kernel.org/r/Zszcps6bnCcdFa54@apocalypse):

> Without this patch the range translation chain is broken, like this:

> pcie@120000: <0x2000000 0x00 0x00    0x1f 0x00                0x00 0xfffffffc>;
> ~~~ chain breaks here ~~~
> pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
> dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
> rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;

The cover letter said "RP1 is an MFD chipset that acts as a
south-bridge PCIe endpoint .. the RP1 as an endpoint itself is
discoverable via usual PCI enumeration".

I assume pcie@120000 is the PCI host bridge and is already in the
original DT describing the platform.  I assume pci@0 is a Root Port
and dev@0,0 is the RP1 Endpoint, and the existing code already adds
them as they are enumerated when pci_bus_add_device() calls
of_pci_make_dev_node(), and I think this series adds the rp1@0
description.

And the "ranges" properties are built when of_pci_make_dev_node()
eventually calls of_pci_prop_ranges().  With reference to sec 2.2.1.1
of https://www.devicetree.org/open-firmware/bindings/pci/pci2_1.pdf
and
https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#ranges,
I *think* your example says:

pcie@120000 has:
  child phys.hi	      0x02000000    n=0 p=0 t=0 ss=10b
  child phys.mid,lo   0x00000000_00000000
  parent phys.hi,lo   0x0000001f_00000000
  length hi,lo        0x00000000_fffffffc

which would make it a bridge where the child (PCI) address space is
relocatable non-prefetchable 32-bit memory space at
0x00000000-0xfffffffc, and the corresponding parent address space is
0x1f_00000000-0x1f_fffffffc.  That means the host bridge applies an
address translation of "child_addr = parent_addr - 0x1f_00000000".

pci@0 has:
  child phys.hi	      0x82000000    n=1 p=0 t=0 ss=10b
  child phys.mid,lo   0x0000001f_00000000
  parent phys.hi      0x82000000    n=1 p=0 t=0 ss=10b
  parent phys.mid,lo  0x0000001f_00000000
  length hi,lo        0x00000000_00600000

which would make it a PCI-to-PCI bridge (I assume a PCIe Root Port),
where the child (secondary bus) address space is the non-relocatable
non-prefetchable 32-bit memory space 0x1f_00000000-0x1f_005fffff and
the parent (primary bus) address space is also non-relocatable
non-prefetchable 32-bit memory space at 0x1f_00000000-0x1f_005fffff.

This looks wrong to me because the pci@0 parent address space
(0x1f_00000000-0x1f_005fffff) should be inside the pcie@120000 child
address space (0x00000000-0xfffffffc), but it's not.

IIUC, this patch clears the upper 32 bits in the pci@0 parent address
space.  That would make things work correctly in this case because
that happens to be the exact translation of pcie@120000, so it results
in pci@0 parent address space of 0x00000000-0x005fffff.

But I don't think it works in general because there's no requirement
that the host bridge address translation be that simple.  For example,
if we have two host bridges, and we want each to have 2GB of 32-bit
PCI address space starting at 0x0, it might look like this:

  0x00000002_00000000 -> PCI 0x00000000 (subtract 0x00000002_00000000)
  0x00000002_80000000 -> PCI 0x00000000 (subtract 0x00000002_80000000)

In this case simply ignoring the high 32 bits of the CPU address isn't
the correct translation for the second host bridge.  I think we should
look at each host bridge's "ranges", find the difference between its
parent and child addresses, and apply the same difference to
everything below that bridge.

> while with the patch applied the chain correctly become:

> pcie@120000: <0x2000000 0x00 0x00    0x1f 0x00                0x00 0xfffffffc>;
> pci@0      : <0x82000000 0x00 0x00   0x82000000 0x00 0x00     0x00 0x600000>;
> dev@0,0    : <0x01 0x00 0x00         0x82010000 0x00 0x00     0x00 0x400000>;
> rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;

> > > > Maybe my expectation of this being described in DT is mistaken.
> > > 
> > > Not sure what you mean here, the address being translated are coming from
> > > DT, in fact they are described by "ranges" properties.
> > 
> > Right, for my own future reference since I couldn't find a generic
> > description of "ranges" in Documentation/devicetree/:
> > 
> > [1] https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#ranges
Andrea della Porta Oct. 6, 2024, 11:20 a.m. UTC | #8
Hi Bjorn,

On 15:17 Sat 28 Sep     , Bjorn Helgaas wrote:
...
> From your earlier email
> (https://lore.kernel.org/r/Zszcps6bnCcdFa54@apocalypse):
> 
> > Without this patch the range translation chain is broken, like this:
> 
> > pcie@120000: <0x2000000 0x00 0x00    0x1f 0x00                0x00 0xfffffffc>;
> > ~~~ chain breaks here ~~~
> > pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
> > dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
> > rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;
> 
> The cover letter said "RP1 is an MFD chipset that acts as a
> south-bridge PCIe endpoint .. the RP1 as an endpoint itself is
> discoverable via usual PCI enumeration".
> 
> I assume pcie@120000 is the PCI host bridge and is already in the
> original DT describing the platform.  I assume pci@0 is a Root Port
> and dev@0,0 is the RP1 Endpoint, and the existing code already adds
> them as they are enumerated when pci_bus_add_device() calls
> of_pci_make_dev_node(), and I think this series adds the rp1@0
> description.

Correct.

> 
> And the "ranges" properties are built when of_pci_make_dev_node()
> eventually calls of_pci_prop_ranges().  With reference to sec 2.2.1.1
> of https://www.devicetree.org/open-firmware/bindings/pci/pci2_1.pdf
> and
> https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#ranges,
> I *think* your example says:
> 
> pcie@120000 has:
>   child phys.hi	      0x02000000    n=0 p=0 t=0 ss=10b
>   child phys.mid,lo   0x00000000_00000000
>   parent phys.hi,lo   0x0000001f_00000000
>   length hi,lo        0x00000000_fffffffc
> 
> which would make it a bridge where the child (PCI) address space is
> relocatable non-prefetchable 32-bit memory space at
> 0x00000000-0xfffffffc, and the corresponding parent address space is
> 0x1f_00000000-0x1f_fffffffc.  That means the host bridge applies an
> address translation of "child_addr = parent_addr - 0x1f_00000000".
> 
> pci@0 has:
>   child phys.hi	      0x82000000    n=1 p=0 t=0 ss=10b
>   child phys.mid,lo   0x0000001f_00000000
>   parent phys.hi      0x82000000    n=1 p=0 t=0 ss=10b
>   parent phys.mid,lo  0x0000001f_00000000
>   length hi,lo        0x00000000_00600000
> 
> which would make it a PCI-to-PCI bridge (I assume a PCIe Root Port),
> where the child (secondary bus) address space is the non-relocatable
> non-prefetchable 32-bit memory space 0x1f_00000000-0x1f_005fffff and
> the parent (primary bus) address space is also non-relocatable
> non-prefetchable 32-bit memory space at 0x1f_00000000-0x1f_005fffff.
> 
> This looks wrong to me because the pci@0 parent address space
> (0x1f_00000000-0x1f_005fffff) should be inside the pcie@120000 child
> address space (0x00000000-0xfffffffc), but it's not.

Exactly, that example refers to the 'uncorrected' case, i.e. without the
patch applied.

> 
> IIUC, this patch clears the upper 32 bits in the pci@0 parent address
> space.  That would make things work correctly in this case because
> that happens to be the exact translation of pcie@120000, so it results
> in pci@0 parent address space of 0x00000000-0x005fffff.

Right. I think we sould split it into two issues:

[1] RP1 acknowledges a 32 bit BAR address from its config space while the
device must be accessed using a 64 bit address (that is cpu address
0x1f_00000000), which sounds strange to me but I guess that is how
the hw interconnect has been designed, so we need to cope with it.

[2] I still think that the of_pci_set_address() function should be amended
to avoid generating invalid 64 address when 32 bit flag is set.

As you noted, fixing [2] will incidentally also let [1] work: I think
we can try to solve [1] the proper way and maybe defer [2] for a separate
patch.
To solve [1] I've dropped this patch and tried to solve it from devicetree,
modifying the following mapping:

pcie@120000: <0x3000000 0x1f 0x00    0x1f 0x00                0x00 0xfffffffc>;

so we now have a 1:1 64 bit mapping from 0x1f_00000000 to 0x1f_00000000.
I thought it would result in something like this:

pcie@120000: <0x3000000 0x1f 0x00    0x1f 0x00                0x00 0xfffffffc>;
pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;

but it fails instead (err: "can't assign; no space") in pci_assign_resource()
function trying to match the size using pci_clip_resource_to_region(). It turned
out that the clipping is done against 32 bit memory region 'pci_32_bit',and
this is failing because the original region addresses to be clipped wxxiereas 64
bit wide. The 'culprit' seems to be the function devm_of_pci_get_host_bridge_resources()
dropping IORESOURCE_MEM_64 on any memory resource, which seems to be a change
somewhat specific to a RK3399 case (see commit 3bd6b8271ee66), but I'm not sure
whether it can be considered generic.

So, I'm actually at an empasse here.

Also, while taking a look at the resulting devicetree, I'm a bit confused by the
fact that the parent address generated by of_pci_prop_ranges() for the pci@0,0
bridge seems to be taken from the parent address of the pcie@120000 node. Shouldn't
it be taken from the child address of pcie@120000, instead?

> 
> But I don't think it works in general because there's no requirement
> that the host bridge address translation be that simple.  For example,
> if we have two host bridges, and we want each to have 2GB of 32-bit
> PCI address space starting at 0x0, it might look like this:
> 
>   0x00000002_00000000 -> PCI 0x00000000 (subtract 0x00000002_00000000)
>   0x00000002_80000000 -> PCI 0x00000000 (subtract 0x00000002_80000000)
> 
> In this case simply ignoring the high 32 bits of the CPU address isn't
> the correct translation for the second host bridge.  I think we should
> look at each host bridge's "ranges", find the difference between its
> parent and child addresses, and apply the same difference to
> everything below that bridge.

Not sure I've got this scenario straight: can you please provide the topology
and the bit setting (32/64 bit) for those ranges? Also, is this scenario coming
from a real use case or is it hypothetical?

Many thanks,
Andrea

...
Bjorn Helgaas Oct. 8, 2024, 1:08 a.m. UTC | #9
On Sun, Oct 06, 2024 at 01:20:51PM +0200, Andrea della Porta wrote:
> On 15:17 Sat 28 Sep, Bjorn Helgaas wrote:
> ...
> > From your earlier email
> > (https://lore.kernel.org/r/Zszcps6bnCcdFa54@apocalypse):
> > 
> > > Without this patch the range translation chain is broken, like this:
> > 
> > > pcie@120000: <0x2000000 0x00 0x00    0x1f 0x00                0x00 0xfffffffc>;
> > > ~~~ chain breaks here ~~~
> > > pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
> > > dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
> > > rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;
> > 
> > The cover letter said "RP1 is an MFD chipset that acts as a
> > south-bridge PCIe endpoint .. the RP1 as an endpoint itself is
> > discoverable via usual PCI enumeration".
> > 
> > I assume pcie@120000 is the PCI host bridge and is already in the
> > original DT describing the platform.  I assume pci@0 is a Root Port
> > and dev@0,0 is the RP1 Endpoint, and the existing code already adds
> > them as they are enumerated when pci_bus_add_device() calls
> > of_pci_make_dev_node(), and I think this series adds the rp1@0
> > description.
> 
> Correct.
> 
> > And the "ranges" properties are built when of_pci_make_dev_node()
> > eventually calls of_pci_prop_ranges().  With reference to sec 2.2.1.1
> > of https://www.devicetree.org/open-firmware/bindings/pci/pci2_1.pdf
> > and
> > https://devicetree-specification.readthedocs.io/en/latest/chapter2-devicetree-basics.html#ranges,
> > I *think* your example says:
> > 
> > pcie@120000 has:
> >   child phys.hi       0x02000000    n=0 p=0 t=0 ss=10b
> >   child phys.mid,lo   0x00000000_00000000
> >   parent phys.hi,lo   0x0000001f_00000000
> >   length hi,lo        0x00000000_fffffffc
> > 
> > which would make it a bridge where the child (PCI) address space is
> > relocatable non-prefetchable 32-bit memory space at
> > 0x00000000-0xfffffffc, and the corresponding parent address space is
> > 0x1f_00000000-0x1f_fffffffc.  That means the host bridge applies an
> > address translation of "child_addr = parent_addr - 0x1f_00000000".
> > 
> > pci@0 has:
> >   child phys.hi       0x82000000    n=1 p=0 t=0 ss=10b
> >   child phys.mid,lo   0x0000001f_00000000
> >   parent phys.hi      0x82000000    n=1 p=0 t=0 ss=10b
> >   parent phys.mid,lo  0x0000001f_00000000
> >   length hi,lo        0x00000000_00600000
> > 
> > which would make it a PCI-to-PCI bridge (I assume a PCIe Root Port),
> > where the child (secondary bus) address space is the non-relocatable
> > non-prefetchable 32-bit memory space 0x1f_00000000-0x1f_005fffff and
> > the parent (primary bus) address space is also non-relocatable
> > non-prefetchable 32-bit memory space at 0x1f_00000000-0x1f_005fffff.
> > 
> > This looks wrong to me because the pci@0 parent address space
> > (0x1f_00000000-0x1f_005fffff) should be inside the pcie@120000 child
> > address space (0x00000000-0xfffffffc), but it's not.
> 
> Exactly, that example refers to the 'uncorrected' case, i.e. without the
> patch applied.
> 
> > IIUC, this patch clears the upper 32 bits in the pci@0 parent address
> > space.  That would make things work correctly in this case because
> > that happens to be the exact translation of pcie@120000, so it results
> > in pci@0 parent address space of 0x00000000-0x005fffff.
> 
> Right. I think we should split it into two issues:
> 
> [1] RP1 acknowledges a 32 bit BAR address from its config space while the
> device must be accessed using a 64 bit address (that is cpu address
> 0x1f_00000000), which sounds strange to me but I guess that is how
> the hw interconnect has been designed, so we need to cope with it.

It's common that PCI bus addresses are identical to CPU physical
addresses, but by no means universal.  More details in
Documentation/core-api/dma-api-howto.rst

> [2] I still think that the of_pci_set_address() function should be amended
> to avoid generating invalid 64 address when 32 bit flag is set.
> 
> As you noted, fixing [2] will incidentally also let [1] work: I think
> we can try to solve [1] the proper way and maybe defer [2] for a separate
> patch.
> To solve [1] I've dropped this patch and tried to solve it from devicetree,
> modifying the following mapping:
> 
> pcie@120000: <0x3000000 0x1f 0x00    0x1f 0x00                0x00 0xfffffffc>;
> 
> so we now have a 1:1 64 bit mapping from 0x1f_00000000 to 0x1f_00000000.

That's the wrong thing to change.  pcie@120000 is fine; it's pci@0
that's incorrect.

pcie@120000 is the host bridge, and its "ranges" must describe the
address translation it performs between the primary (CPU) side and the
secondary (PCI) side.  Either this offset is built into the hardware
and can't be changed, or the offset is configured by firmware and the
DT has to match.

So I think this description is correct:

  pcie@120000: <0x2000000 0x0 0x00000000 0x1f 0x00000000 0x0 0xfffffffc>;

which means we have an aperture from CPU physical addresses to PCI bus
addresses like this:

  Host bridge: [mem 0x1f_00000000-0x1f_fffffffb window] (bus address 0x00000000-0xfffffffb)

> I thought it would result in something like this:
> 
> pcie@120000: <0x3000000 0x1f 0x00    0x1f 0x00                0x00 0xfffffffc>;
> pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
> dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
> rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;
> 
> but it fails instead (err: "can't assign; no space") in pci_assign_resource()
> function trying to match the size using pci_clip_resource_to_region(). It turned
> out that the clipping is done against 32 bit memory region 'pci_32_bit',and
> this is failing because the original region addresses to be clipped wxxiereas 64
> bit wide. The 'culprit' seems to be the function devm_of_pci_get_host_bridge_resources()
> dropping IORESOURCE_MEM_64 on any memory resource, which seems to be a change
> somewhat specific to a RK3399 case (see commit 3bd6b8271ee66), but I'm not sure
> whether it can be considered generic.

I think the problem is that we're building the pci@0 (Root Port)
"ranges" incorrectly.  pci@0 is a PCI-PCI bridge, which cannot do any
address translation, so its parent and child address spaces must both
be inside the pcie@120000 *child* address space.

> Also, while taking a look at the resulting devicetree, I'm a bit confused by the
> fact that the parent address generated by of_pci_prop_ranges() for the pci@0,0
> bridge seems to be taken from the parent address of the pcie@120000 node. Shouldn't
> it be taken from the child address of pcie@120000, instead?

Yes, this is exactly the problem.  The pci@0 parent and child
addresses in "ranges" are both in the PCI address space.  But we
start with pdev->resource[N], which is a CPU address.  To get the PCI
address, we need to apply pci_bus_address().  If the host bridge
windows are set up correctly, the window->offset used in
pcibios_resource_to_bus() should yield the PCI bus address.

I think it should look like this:

  pci@0: <0x82000000 0x0 0x00000000 0x82000000 0x0 0x00000000 0x0 0x600000>;

By default lspci shows you the CPU addresses for BARs, so you should
see something like this:

  00:02.0 PCI bridge
    Memory behind bridge: 1f00000000-1ffffffffb
    Capabilities: [40] Express Root Port

If you run "lspci -b", it will show you PCI bus addresses instead,
which should look like this:

  00:02.0 PCI bridge
    Memory behind bridge: 00000000-fffffffb
    Capabilities: [40] Express Root Port

> > But I don't think it works in general because there's no requirement
> > that the host bridge address translation be that simple.  For example,
> > if we have two host bridges, and we want each to have 2GB of 32-bit
> > PCI address space starting at 0x0, it might look like this:
> > 
> >   0x00000002_00000000 -> PCI 0x00000000 (subtract 0x00000002_00000000)
> >   0x00000002_80000000 -> PCI 0x00000000 (subtract 0x00000002_80000000)
> > 
> > In this case simply ignoring the high 32 bits of the CPU address isn't
> > the correct translation for the second host bridge.  I think we should
> > look at each host bridge's "ranges", find the difference between its
> > parent and child addresses, and apply the same difference to
> > everything below that bridge.
> 
> Not sure I've got this scenario straight: can you please provide the topology
> and the bit setting (32/64 bit) for those ranges? Also, is this scenario coming
> from a real use case or is it hypothetical?

This scenario is purely hypothetical, but it's a legal topology that
we should handle correctly.  It's two host bridges, with independent
PCI hierarchies below them:

  Host bridge A: [mem 0x2_00000000-0x2_7fffffff window] (bus address 0x00000000-0x7fffffff)
  Host bridge B: [mem 0x2_80000000-0x2_ffffffff window] (bus address 0x00000000-0x7fffffff)

Bridge A has an MMIO aperture at CPU addresses
0x2_00000000-0x2_7fffffff, and when it initiates PCI transactions on
its secondary side, the PCI address is CPU_addr - 0x2_00000000.

Similarly, bridge B has an MMIO aperture at CPU addresses 
0x2_80000000-0x2_ffffffff, and when it initiates PCI transactions on 
its secondary side, the PCI address is CPU_addr - 0x2_80000000.

Both hierarchies use PCI bus addresses in the 0x00000000-0x7fffffff
range.  In a topology like this, you can't convert a bus address back
to a CPU address unless you know which hierarchy it's in.
pcibios_bus_to_resource() takes a pci_bus pointer, which tells you
which hierarchy (and which host bridge address translation) to use.

Bjorn
Andrea della Porta Oct. 18, 2024, 12:41 p.m. UTC | #10
Hi Bjorn,

On 20:08 Mon 07 Oct     , Bjorn Helgaas wrote:
... 
> It's common that PCI bus addresses are identical to CPU physical
> addresses, but by no means universal.  More details in
> Documentation/core-api/dma-api-howto.rst
> 
> > [2] I still think that the of_pci_set_address() function should be amended
> > to avoid generating invalid 64 address when 32 bit flag is set.
> > 
> > As you noted, fixing [2] will incidentally also let [1] work: I think
> > we can try to solve [1] the proper way and maybe defer [2] for a separate
> > patch.
> > To solve [1] I've dropped this patch and tried to solve it from devicetree,
> > modifying the following mapping:
> > 
> > pcie@120000: <0x3000000 0x1f 0x00    0x1f 0x00                0x00 0xfffffffc>;
> > 
> > so we now have a 1:1 64 bit mapping from 0x1f_00000000 to 0x1f_00000000.
> 
> That's the wrong thing to change.  pcie@120000 is fine; it's pci@0
> that's incorrect.
> 
> pcie@120000 is the host bridge, and its "ranges" must describe the
> address translation it performs between the primary (CPU) side and the
> secondary (PCI) side.  Either this offset is built into the hardware
> and can't be changed, or the offset is configured by firmware and the
> DT has to match.
> 
> So I think this description is correct:
> 
>   pcie@120000: <0x2000000 0x0 0x00000000 0x1f 0x00000000 0x0 0xfffffffc>;
> 
> which means we have an aperture from CPU physical addresses to PCI bus
> addresses like this:
> 
>   Host bridge: [mem 0x1f_00000000-0x1f_fffffffb window] (bus address 0x00000000-0xfffffffb)
> 
> > I thought it would result in something like this:
> > 
> > pcie@120000: <0x3000000 0x1f 0x00    0x1f 0x00                0x00 0xfffffffc>;
> > pci@0      : <0x82000000 0x1f 0x00   0x82000000 0x1f 0x00     0x00 0x600000>;
> > dev@0,0    : <0x01 0x00 0x00         0x82010000 0x1f 0x00     0x00 0x400000>;
> > rp1@0      : <0xc0 0x40000000        0x01 0x00 0x00           0x00 0x400000>;
> > 
> > but it fails instead (err: "can't assign; no space") in pci_assign_resource()
> > function trying to match the size using pci_clip_resource_to_region(). It turned
> > out that the clipping is done against 32 bit memory region 'pci_32_bit',and
> > this is failing because the original region addresses to be clipped wxxiereas 64
> > bit wide. The 'culprit' seems to be the function devm_of_pci_get_host_bridge_resources()
> > dropping IORESOURCE_MEM_64 on any memory resource, which seems to be a change
> > somewhat specific to a RK3399 case (see commit 3bd6b8271ee66), but I'm not sure
> > whether it can be considered generic.
> 
> I think the problem is that we're building the pci@0 (Root Port)
> "ranges" incorrectly.  pci@0 is a PCI-PCI bridge, which cannot do any
> address translation, so its parent and child address spaces must both
> be inside the pcie@120000 *child* address space.
> 
> > Also, while taking a look at the resulting devicetree, I'm a bit confused by the
> > fact that the parent address generated by of_pci_prop_ranges() for the pci@0,0
> > bridge seems to be taken from the parent address of the pcie@120000 node. Shouldn't
> > it be taken from the child address of pcie@120000, instead?
> 
> Yes, this is exactly the problem.  The pci@0 parent and child
> addresses in "ranges" are both in the PCI address space.  But we
> start with pdev->resource[N], which is a CPU address.  To get the PCI
> address, we need to apply pci_bus_address().  If the host bridge
> windows are set up correctly, the window->offset used in
> pcibios_resource_to_bus() should yield the PCI bus address.

You mean something like this, I think:

@@ -129,7 +129,7 @@ static int of_pci_prop_ranges(struct pci_dev *pdev, struct of_changeset *ocs,
                if (of_pci_get_addr_flags(&res[j], &flags))
                        continue;
 
-               val64 = res[j].start;
+               val64 = pci_bus_address(pdev, &res[j] - pdev->resource);
                of_pci_set_address(pdev, rp[i].parent_addr, val64, 0, flags,
                                   false);
                if (pci_is_bridge(pdev)) {

> 
> I think it should look like this:
> 
>   pci@0: <0x82000000 0x0 0x00000000 0x82000000 0x0 0x00000000 0x0 0x600000>;

indeed, with the above patch applied, the result is exactly as you expected.

> 
> By default lspci shows you the CPU addresses for BARs, so you should
> see something like this:
> 
>   00:02.0 PCI bridge
>     Memory behind bridge: 1f00000000-1ffffffffb
>     Capabilities: [40] Express Root Port
> 
> If you run "lspci -b", it will show you PCI bus addresses instead,
> which should look like this:
> 
>   00:02.0 PCI bridge
>     Memory behind bridge: 00000000-fffffffb
>     Capabilities: [40] Express Root Port
> 
> > > But I don't think it works in general because there's no requirement
> > > that the host bridge address translation be that simple.  For example,
> > > if we have two host bridges, and we want each to have 2GB of 32-bit
> > > PCI address space starting at 0x0, it might look like this:
> > > 
> > >   0x00000002_00000000 -> PCI 0x00000000 (subtract 0x00000002_00000000)
> > >   0x00000002_80000000 -> PCI 0x00000000 (subtract 0x00000002_80000000)
> > > 
> > > In this case simply ignoring the high 32 bits of the CPU address isn't
> > > the correct translation for the second host bridge.  I think we should
> > > look at each host bridge's "ranges", find the difference between its
> > > parent and child addresses, and apply the same difference to
> > > everything below that bridge.
> > 
> > Not sure I've got this scenario straight: can you please provide the topology
> > and the bit setting (32/64 bit) for those ranges? Also, is this scenario coming
> > from a real use case or is it hypothetical?
> 
> This scenario is purely hypothetical, but it's a legal topology that
> we should handle correctly.  It's two host bridges, with independent
> PCI hierarchies below them:
> 
>   Host bridge A: [mem 0x2_00000000-0x2_7fffffff window] (bus address 0x00000000-0x7fffffff)
>   Host bridge B: [mem 0x2_80000000-0x2_ffffffff window] (bus address 0x00000000-0x7fffffff)
> 
> Bridge A has an MMIO aperture at CPU addresses
> 0x2_00000000-0x2_7fffffff, and when it initiates PCI transactions on
> its secondary side, the PCI address is CPU_addr - 0x2_00000000.
> 
> Similarly, bridge B has an MMIO aperture at CPU addresses 
> 0x2_80000000-0x2_ffffffff, and when it initiates PCI transactions on 
> its secondary side, the PCI address is CPU_addr - 0x2_80000000.
> 
> Both hierarchies use PCI bus addresses in the 0x00000000-0x7fffffff
> range.  In a topology like this, you can't convert a bus address back
> to a CPU address unless you know which hierarchy it's in.
> pcibios_bus_to_resource() takes a pci_bus pointer, which tells you
> which hierarchy (and which host bridge address translation) to use.

Agreed. While I think about how to adjust that specific patch,i let's drop it from
this patchset since the aforementioned change is properly fixing the translation
issue.

> 
> Bjora

Many thanks,
Andrea
Bjorn Helgaas Oct. 18, 2024, 10:28 p.m. UTC | #11
On Fri, Oct 18, 2024 at 02:41:11PM +0200, Andrea della Porta wrote:
> On 20:08 Mon 07 Oct     , Bjorn Helgaas wrote:
> ... 

> > Yes, this is exactly the problem.  The pci@0 parent and child
> > addresses in "ranges" are both in the PCI address space.  But we
> > start with pdev->resource[N], which is a CPU address.  To get the PCI
> > address, we need to apply pci_bus_address().  If the host bridge
> > windows are set up correctly, the window->offset used in
> > pcibios_resource_to_bus() should yield the PCI bus address.
> 
> You mean something like this, I think:
> 
> @@ -129,7 +129,7 @@ static int of_pci_prop_ranges(struct pci_dev *pdev, struct of_changeset *ocs,
>                 if (of_pci_get_addr_flags(&res[j], &flags))
>                         continue;
>  
> -               val64 = res[j].start;
> +               val64 = pci_bus_address(pdev, &res[j] - pdev->resource);
>                 of_pci_set_address(pdev, rp[i].parent_addr, val64, 0, flags,
>                                    false);
>                 if (pci_is_bridge(pdev)) {

Yes.

> > I think it should look like this:
> > 
> >   pci@0: <0x82000000 0x0 0x00000000 0x82000000 0x0 0x00000000 0x0 0x600000>;
> 
> indeed, with the above patch applied, the result is exactly as you expected.
> ...

> > > > But I don't think it works in general because there's no
> > > > requirement that the host bridge address translation be that
> > > > simple.  For example, if we have two host bridges, and we want
> > > > each to have 2GB of 32-bit PCI address space starting at 0x0,
> > > > it might look like this:
> > > > 
> > > >   0x00000002_00000000 -> PCI 0x00000000 (subtract 0x00000002_00000000)
> > > >   0x00000002_80000000 -> PCI 0x00000000 (subtract 0x00000002_80000000)
> > > > 
> > > > In this case simply ignoring the high 32 bits of the CPU
> > > > address isn't the correct translation for the second host
> > > > bridge.  I think we should look at each host bridge's
> > > > "ranges", find the difference between its parent and child
> > > > addresses, and apply the same difference to everything below
> > > > that bridge.
> > > 
> > > Not sure I've got this scenario straight: can you please provide
> > > the topology and the bit setting (32/64 bit) for those ranges?
> > > Also, is this scenario coming from a real use case or is it
> > > hypothetical?
> > 
> > This scenario is purely hypothetical, but it's a legal topology
> > that we should handle correctly.  It's two host bridges, with
> > independent PCI hierarchies below them:
> > 
> >   Host bridge A: [mem 0x2_00000000-0x2_7fffffff window] (bus address 0x00000000-0x7fffffff)
> >   Host bridge B: [mem 0x2_80000000-0x2_ffffffff window] (bus address 0x00000000-0x7fffffff)
> > 
> > Bridge A has an MMIO aperture at CPU addresses
> > 0x2_00000000-0x2_7fffffff, and when it initiates PCI transactions on
> > its secondary side, the PCI address is CPU_addr - 0x2_00000000.
> > 
> > Similarly, bridge B has an MMIO aperture at CPU addresses 
> > 0x2_80000000-0x2_ffffffff, and when it initiates PCI transactions on 
> > its secondary side, the PCI address is CPU_addr - 0x2_80000000.
> > 
> > Both hierarchies use PCI bus addresses in the 0x00000000-0x7fffffff
> > range.  In a topology like this, you can't convert a bus address back
> > to a CPU address unless you know which hierarchy it's in.
> > pcibios_bus_to_resource() takes a pci_bus pointer, which tells you
> > which hierarchy (and which host bridge address translation) to use.
> 
> Agreed. While I think about how to adjust that specific patch,i
> let's drop it from this patchset since the aforementioned change is
> properly fixing the translation issue.

OK.  I assume you mean to drop the "PCI: of_property: Sanitize 32 bit
PCI address parsed from DT" patch?  Or replace it with the
pci_bus_address() addition above?

Bjorn
Andrea della Porta Oct. 19, 2024, 8:46 a.m. UTC | #12
Hi Bjorn,

On 17:28 Fri 18 Oct     , Bjorn Helgaas wrote:
> On Fri, Oct 18, 2024 at 02:41:11PM +0200, Andrea della Porta wrote:
> > On 20:08 Mon 07 Oct     , Bjorn Helgaas wrote:
> > ... 
> 
> > > Yes, this is exactly the problem.  The pci@0 parent and child
> > > addresses in "ranges" are both in the PCI address space.  But we
> > > start with pdev->resource[N], which is a CPU address.  To get the PCI
> > > address, we need to apply pci_bus_address().  If the host bridge
> > > windows are set up correctly, the window->offset used in
> > > pcibios_resource_to_bus() should yield the PCI bus address.
> > 
> > You mean something like this, I think:
> > 
> > @@ -129,7 +129,7 @@ static int of_pci_prop_ranges(struct pci_dev *pdev, struct of_changeset *ocs,
> >                 if (of_pci_get_addr_flags(&res[j], &flags))
> >                         continue;
> >  
> > -               val64 = res[j].start;
> > +               val64 = pci_bus_address(pdev, &res[j] - pdev->resource);
> >                 of_pci_set_address(pdev, rp[i].parent_addr, val64, 0, flags,
> >                                    false);
> >                 if (pci_is_bridge(pdev)) {
> 
> Yes.
> 
> > > I think it should look like this:
> > > 
> > >   pci@0: <0x82000000 0x0 0x00000000 0x82000000 0x0 0x00000000 0x0 0x600000>;
> > 
> > indeed, with the above patch applied, the result is exactly as you expected.
> > ...
> 
> > > > > But I don't think it works in general because there's no
> > > > > requirement that the host bridge address translation be that
> > > > > simple.  For example, if we have two host bridges, and we want
> > > > > each to have 2GB of 32-bit PCI address space starting at 0x0,
> > > > > it might look like this:
> > > > > 
> > > > >   0x00000002_00000000 -> PCI 0x00000000 (subtract 0x00000002_00000000)
> > > > >   0x00000002_80000000 -> PCI 0x00000000 (subtract 0x00000002_80000000)
> > > > > 
> > > > > In this case simply ignoring the high 32 bits of the CPU
> > > > > address isn't the correct translation for the second host
> > > > > bridge.  I think we should look at each host bridge's
> > > > > "ranges", find the difference between its parent and child
> > > > > addresses, and apply the same difference to everything below
> > > > > that bridge.
> > > > 
> > > > Not sure I've got this scenario straight: can you please provide
> > > > the topology and the bit setting (32/64 bit) for those ranges?
> > > > Also, is this scenario coming from a real use case or is it
> > > > hypothetical?
> > > 
> > > This scenario is purely hypothetical, but it's a legal topology
> > > that we should handle correctly.  It's two host bridges, with
> > > independent PCI hierarchies below them:
> > > 
> > >   Host bridge A: [mem 0x2_00000000-0x2_7fffffff window] (bus address 0x00000000-0x7fffffff)
> > >   Host bridge B: [mem 0x2_80000000-0x2_ffffffff window] (bus address 0x00000000-0x7fffffff)
> > > 
> > > Bridge A has an MMIO aperture at CPU addresses
> > > 0x2_00000000-0x2_7fffffff, and when it initiates PCI transactions on
> > > its secondary side, the PCI address is CPU_addr - 0x2_00000000.
> > > 
> > > Similarly, bridge B has an MMIO aperture at CPU addresses 
> > > 0x2_80000000-0x2_ffffffff, and when it initiates PCI transactions on 
> > > its secondary side, the PCI address is CPU_addr - 0x2_80000000.
> > > 
> > > Both hierarchies use PCI bus addresses in the 0x00000000-0x7fffffff
> > > range.  In a topology like this, you can't convert a bus address back
> > > to a CPU address unless you know which hierarchy it's in.
> > > pcibios_bus_to_resource() takes a pci_bus pointer, which tells you
> > > which hierarchy (and which host bridge address translation) to use.
> > 
> > Agreed. While I think about how to adjust that specific patch,i
> > let's drop it from this patchset since the aforementioned change is
> > properly fixing the translation issue.
> 
> OK.  I assume you mean to drop the "PCI: of_property: Sanitize 32 bit
> PCI address parsed from DT" patch?  Or replace it with the
> pci_bus_address() addition above?

I'm planning to replace that patch with the above mentioned pci_bus_address()
addition. However, I think the 32 bit sanitization is still useful to prevent
wrongly encoded address to linger around, but I defer it to a subsequent standalone
patch, after figuring out the dual bridge scenario that you proposed.

> 
> Bjorn

Many thanks,
Andrea
diff mbox series

Patch

diff --git a/drivers/pci/of_property.c b/drivers/pci/of_property.c
index 5a0b98e69795..77865facdb4a 100644
--- a/drivers/pci/of_property.c
+++ b/drivers/pci/of_property.c
@@ -60,7 +60,10 @@  static void of_pci_set_address(struct pci_dev *pdev, u32 *prop, u64 addr,
 	prop[0] |= flags | reg_num;
 	if (!reloc) {
 		prop[0] |= OF_PCI_ADDR_FIELD_NONRELOC;
-		prop[1] = upper_32_bits(addr);
+		if (FIELD_GET(OF_PCI_ADDR_FIELD_SS, flags) == OF_PCI_ADDR_SPACE_MEM64)
+			prop[1] = upper_32_bits(addr);
+		else
+			prop[1] = 0;
 		prop[2] = lower_32_bits(addr);
 	}
 }