diff mbox series

[V13,2/5] PCI: Create device tree node for bridge

Message ID 1692120000-46900-3-git-send-email-lizhi.hou@amd.com (mailing list archive)
State Accepted
Commit 407d1a51921e9f28c1bcec647c2205925bd1fdab
Delegated to: Rob Herring
Headers show
Series Generate device tree node for pci devices | expand

Commit Message

Lizhi Hou Aug. 15, 2023, 5:19 p.m. UTC
The PCI endpoint device such as Xilinx Alveo PCI card maps the register
spaces from multiple hardware peripherals to its PCI BAR. Normally,
the PCI core discovers devices and BARs using the PCI enumeration process.
There is no infrastructure to discover the hardware peripherals that are
present in a PCI device, and which can be accessed through the PCI BARs.

Apparently, the device tree framework requires a device tree node for the
PCI device. Thus, it can generate the device tree nodes for hardware
peripherals underneath. Because PCI is self discoverable bus, there might
not be a device tree node created for PCI devices. Furthermore, if the PCI
device is hot pluggable, when it is plugged in, the device tree nodes for
its parent bridges are required. Add support to generate device tree node
for PCI bridges.

Add an of_pci_make_dev_node() interface that can be used to create device
tree node for PCI devices.

Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
the kernel will generate device tree nodes for PCI bridges unconditionally.

Initially, add the basic properties for the dynamically generated device
tree nodes which include #address-cells, #size-cells, device_type,
compatible, ranges, reg.

Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
---
 drivers/pci/Kconfig       |  12 ++
 drivers/pci/Makefile      |   1 +
 drivers/pci/bus.c         |   2 +
 drivers/pci/of.c          |  79 +++++++++
 drivers/pci/of_property.c | 355 ++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.h         |  12 ++
 drivers/pci/remove.c      |   1 +
 7 files changed, 462 insertions(+)
 create mode 100644 drivers/pci/of_property.c

Comments

Guenter Roeck Aug. 31, 2023, 1:57 p.m. UTC | #1
On Tue, Aug 15, 2023 at 10:19:57AM -0700, Lizhi Hou wrote:
> The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> spaces from multiple hardware peripherals to its PCI BAR. Normally,
> the PCI core discovers devices and BARs using the PCI enumeration process.
> There is no infrastructure to discover the hardware peripherals that are
> present in a PCI device, and which can be accessed through the PCI BARs.
> 
> Apparently, the device tree framework requires a device tree node for the
> PCI device. Thus, it can generate the device tree nodes for hardware
> peripherals underneath. Because PCI is self discoverable bus, there might
> not be a device tree node created for PCI devices. Furthermore, if the PCI
> device is hot pluggable, when it is plugged in, the device tree nodes for
> its parent bridges are required. Add support to generate device tree node
> for PCI bridges.
> 
> Add an of_pci_make_dev_node() interface that can be used to create device
> tree node for PCI devices.
> 
> Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> the kernel will generate device tree nodes for PCI bridges unconditionally.
> 
> Initially, add the basic properties for the dynamically generated device
> tree nodes which include #address-cells, #size-cells, device_type,
> compatible, ranges, reg.
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>

This patch results in the following build error.

Building sparc64:allmodconfig ... failed
--------------
Error log:
<stdin>:1519:2: warning: #warning syscall clone3 not implemented [-Wcpp]
sparc64-linux-ld: drivers/pci/of_property.o: in function `of_pci_prop_intr_map':
of_property.c:(.text+0xc4): undefined reference to `of_irq_parse_raw'

Guenter
Jonathan Cameron Sept. 11, 2023, 2:48 p.m. UTC | #2
On Tue, 15 Aug 2023 10:19:57 -0700
Lizhi Hou <lizhi.hou@amd.com> wrote:

> The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> spaces from multiple hardware peripherals to its PCI BAR. Normally,
> the PCI core discovers devices and BARs using the PCI enumeration process.
> There is no infrastructure to discover the hardware peripherals that are
> present in a PCI device, and which can be accessed through the PCI BARs.
> 
> Apparently, the device tree framework requires a device tree node for the
> PCI device. Thus, it can generate the device tree nodes for hardware
> peripherals underneath. Because PCI is self discoverable bus, there might
> not be a device tree node created for PCI devices. Furthermore, if the PCI
> device is hot pluggable, when it is plugged in, the device tree nodes for
> its parent bridges are required. Add support to generate device tree node
> for PCI bridges.
> 
> Add an of_pci_make_dev_node() interface that can be used to create device
> tree node for PCI devices.
> 
> Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> the kernel will generate device tree nodes for PCI bridges unconditionally.
> 
> Initially, add the basic properties for the dynamically generated device
> tree nodes which include #address-cells, #size-cells, device_type,
> compatible, ranges, reg.
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>

I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
machine.

There are some missing parts that were present in Clements series, but not this
one, particularly creation of the root pci object.

Anyhow, hit an intermittent crash...


> ---
> +static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
> +				struct device_node *np)
> +{
> +	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
> +	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
> +	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
> +	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
> +	struct device_node *pnode;
> +	struct pci_dev *child;
> +	u32 *int_map, *mapp;
> +	int ret;
> +	u8 pin;
> +
> +	pnode = pci_device_to_OF_node(pdev->bus->self);
> +	if (!pnode)
> +		pnode = pci_bus_to_OF_node(pdev->bus);
> +
> +	if (!pnode) {
> +		pci_err(pdev, "failed to get parent device node");
> +		return -EINVAL;
> +	}
> +
> +	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
> +	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
> +		i = pin - 1;
> +		out_irq[i].np = pnode;
> +		out_irq[i].args_count = 1;
> +		out_irq[i].args[0] = pin;
> +		ret = of_irq_parse_raw(laddr, &out_irq[i]);
> +		if (ret) {
> +			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
> +			continue;

If all the interrupt parsing fails we continue ever time...

> +		}
> +		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
> +					   &addr_sz[i]);
> +		if (ret)
> +			addr_sz[i] = 0;

This never happens.

> +	}
> +
> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
> +			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;

and here we end up derefencing random memory which happens in my case to cause
a massive allocation sometimes and that fails one of the assertions in the
allocator.

I'd suggest just setting addr_sz[xxx] = {}; above
to ensure it's initialized. Then the if(ret) handling should not be needed
as well as of_property_read_u32 should be side effect free I hope!

> +		}
> +	}
> +
> +	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
> +	mapp = int_map;
Herve Codina Sept. 11, 2023, 3:13 p.m. UTC | #3
Hi Lizhi,

On Tue, 15 Aug 2023 10:19:57 -0700
Lizhi Hou <lizhi.hou@amd.com> wrote:
...
> +void of_pci_make_dev_node(struct pci_dev *pdev)
> +{
> +	struct device_node *ppnode, *np = NULL;
> +	const char *pci_type;
> +	struct of_changeset *cset;
> +	const char *name;
> +	int ret;
> +
> +	/*
> +	 * If there is already a device tree node linked to this device,
> +	 * return immediately.
> +	 */
> +	if (pci_device_to_OF_node(pdev))
> +		return;
> +
> +	/* Check if there is device tree node for parent device */
> +	if (!pdev->bus->self)
> +		ppnode = pdev->bus->dev.of_node;
> +	else
> +		ppnode = pdev->bus->self->dev.of_node;
> +	if (!ppnode)
> +		return;
> +
> +	if (pci_is_bridge(pdev))
> +		pci_type = "pci";
> +	else
> +		pci_type = "dev";
> +
> +	name = kasprintf(GFP_KERNEL, "%s@%x,%x", pci_type,
> +			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +	if (!name)
> +		return;
> +
> +	cset = kmalloc(sizeof(*cset), GFP_KERNEL);
> +	if (!cset)
> +		goto failed;
> +	of_changeset_init(cset);
> +
> +	np = of_changeset_create_node(ppnode, name, cset);
> +	if (!np)
> +		goto failed;

The "goto failed" will leak the cset previously allocated.

np->data = cset; (next line) allows to free the cset when the node is destroyed
(of_node_put() calls). When the node cannot be created, the allocated cset should
be freed.

> +	np->data = cset;
> +
> +	ret = of_pci_add_properties(pdev, cset, np);
> +	if (ret)
> +		goto failed;
> +
> +	ret = of_changeset_apply(cset);
> +	if (ret)
> +		goto failed;
> +
> +	pdev->dev.of_node = np;
> +	kfree(name);
> +
> +	return;
> +
> +failed:
> +	if (np)
> +		of_node_put(np);
> +	kfree(name);
> +}
> +#endif
> +
>  #endif /* CONFIG_PCI */
>  
...
> +static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
> +				struct device_node *np)
> +{
> +	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
> +	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
> +	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
> +	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
> +	struct device_node *pnode;
> +	struct pci_dev *child;
> +	u32 *int_map, *mapp;
> +	int ret;
> +	u8 pin;
> +
> +	pnode = pci_device_to_OF_node(pdev->bus->self);
> +	if (!pnode)
> +		pnode = pci_bus_to_OF_node(pdev->bus);
> +
> +	if (!pnode) {
> +		pci_err(pdev, "failed to get parent device node");
> +		return -EINVAL;
> +	}
> +
> +	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
> +	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
> +		i = pin - 1;
> +		out_irq[i].np = pnode;
> +		out_irq[i].args_count = 1;
> +		out_irq[i].args[0] = pin;
> +		ret = of_irq_parse_raw(laddr, &out_irq[i]);
> +		if (ret) {
> +			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
> +			continue;
> +		}
> +		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
> +					   &addr_sz[i]);
> +		if (ret)
> +			addr_sz[i] = 0;
> +	}

if of_irq_parse_raw() fails, addr_sz[i] is not initialized and map_sz bellow is
computed with uninitialized values.
On the test I did, this lead to a kernel crash due to the following kcalloc()
called with incorrect values.

Are interrupt-map and interrupt-map-mask properties needed in all cases ?
I mean are they mandatory for the host pci bridge ?

> +
> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
> +			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;

of_irq_parse_raw() can fail on some pins.
Is it correct to set map_sz based on information related to all pins even if
of_irq_parse_raw() previously failed on some pins ?

> +		}
> +	}
> +
> +	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
> +	mapp = int_map;
> +
> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
> +			*mapp = (child->bus->number << 16) |
> +				(child->devfn << 8);
> +			mapp += OF_PCI_ADDRESS_CELLS;
> +			*mapp = pin;
> +			mapp++;
> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
> +			*mapp = out_irq[i].np->phandle;
> +			mapp++;
> +			if (addr_sz[i]) {
> +				ret = of_property_read_u32_array(out_irq[i].np,
> +								 "reg", mapp,
> +								 addr_sz[i]);
> +				if (ret)
> +					goto failed;
> +			}
> +			mapp += addr_sz[i];
> +			memcpy(mapp, out_irq[i].args,
> +			       out_irq[i].args_count * sizeof(u32));
> +			mapp += out_irq[i].args_count;
> +		}
> +	}
> +
> +	ret = of_changeset_add_prop_u32_array(ocs, np, "interrupt-map", int_map,
> +					      map_sz);
> +	if (ret)
> +		goto failed;
> +
> +	ret = of_changeset_add_prop_u32(ocs, np, "#interrupt-cells", 1);
> +	if (ret)
> +		goto failed;
> +
> +	ret = of_changeset_add_prop_u32_array(ocs, np, "interrupt-map-mask",
> +					      int_map_mask,
> +					      ARRAY_SIZE(int_map_mask));
> +	if (ret)
> +		goto failed;
> +
> +	kfree(int_map);
> +	return 0;
> +
> +failed:
> +	kfree(int_map);
> +	return ret;
> +}
> +
...

Regards,
Hervé
Herve Codina Sept. 11, 2023, 3:35 p.m. UTC | #4
Hi Jonathan,

On Mon, 11 Sep 2023 15:48:56 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Tue, 15 Aug 2023 10:19:57 -0700
> Lizhi Hou <lizhi.hou@amd.com> wrote:
> 
> > The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> > spaces from multiple hardware peripherals to its PCI BAR. Normally,
> > the PCI core discovers devices and BARs using the PCI enumeration process.
> > There is no infrastructure to discover the hardware peripherals that are
> > present in a PCI device, and which can be accessed through the PCI BARs.
> > 
> > Apparently, the device tree framework requires a device tree node for the
> > PCI device. Thus, it can generate the device tree nodes for hardware
> > peripherals underneath. Because PCI is self discoverable bus, there might
> > not be a device tree node created for PCI devices. Furthermore, if the PCI
> > device is hot pluggable, when it is plugged in, the device tree nodes for
> > its parent bridges are required. Add support to generate device tree node
> > for PCI bridges.
> > 
> > Add an of_pci_make_dev_node() interface that can be used to create device
> > tree node for PCI devices.
> > 
> > Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> > the kernel will generate device tree nodes for PCI bridges unconditionally.
> > 
> > Initially, add the basic properties for the dynamically generated device
> > tree nodes which include #address-cells, #size-cells, device_type,
> > compatible, ranges, reg.
> > 
> > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>  
> 
> I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
> machine.
> 
> There are some missing parts that were present in Clements series, but not this
> one, particularly creation of the root pci object.
> 
> Anyhow, hit an intermittent crash...

I am facing the same issues.

I use a custom PCIe board too but on x86 ACPI machine.

In order to have a working system, I need also to build a DT node for the PCI
Host bridge (previously done by Clement's patch) and I am a bit stuck with
interrupts.

On your side (ACPI machine) how do you handle this ?
I mean is your PCI host bridge provided by ACPI ? And if so, you probably need
to build a DT node for this PCI host bridge and add some interrupt-map,
interrupt-map-mask properties in the DT node.

Best regards,
Hervé
Jonathan Cameron Sept. 11, 2023, 3:47 p.m. UTC | #5
On Mon, 11 Sep 2023 17:35:03 +0200
Herve Codina <herve.codina@bootlin.com> wrote:

> Hi Jonathan,
> 
> On Mon, 11 Sep 2023 15:48:56 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
> > On Tue, 15 Aug 2023 10:19:57 -0700
> > Lizhi Hou <lizhi.hou@amd.com> wrote:
> >   
> > > The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> > > spaces from multiple hardware peripherals to its PCI BAR. Normally,
> > > the PCI core discovers devices and BARs using the PCI enumeration process.
> > > There is no infrastructure to discover the hardware peripherals that are
> > > present in a PCI device, and which can be accessed through the PCI BARs.
> > > 
> > > Apparently, the device tree framework requires a device tree node for the
> > > PCI device. Thus, it can generate the device tree nodes for hardware
> > > peripherals underneath. Because PCI is self discoverable bus, there might
> > > not be a device tree node created for PCI devices. Furthermore, if the PCI
> > > device is hot pluggable, when it is plugged in, the device tree nodes for
> > > its parent bridges are required. Add support to generate device tree node
> > > for PCI bridges.
> > > 
> > > Add an of_pci_make_dev_node() interface that can be used to create device
> > > tree node for PCI devices.
> > > 
> > > Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> > > the kernel will generate device tree nodes for PCI bridges unconditionally.
> > > 
> > > Initially, add the basic properties for the dynamically generated device
> > > tree nodes which include #address-cells, #size-cells, device_type,
> > > compatible, ranges, reg.
> > > 
> > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>    
> > 
> > I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
> > machine.
> > 
> > There are some missing parts that were present in Clements series, but not this
> > one, particularly creation of the root pci object.
> > 
> > Anyhow, hit an intermittent crash...  
> 
> I am facing the same issues.
> 
> I use a custom PCIe board too but on x86 ACPI machine.
> 
> In order to have a working system, I need also to build a DT node for the PCI
> Host bridge (previously done by Clement's patch) and I am a bit stuck with
> interrupts.
> 
> On your side (ACPI machine) how do you handle this ?

That was next on my list to look at now I've gotten the device tree stuff
to show up.

> I mean is your PCI host bridge provided by ACPI ? And if so, you probably need
> to build a DT node for this PCI host bridge and add some interrupt-map,
> interrupt-map-mask properties in the DT node.

Agreed. Potentially some other stuff, but interrupts are the thing that
showed up first as an issue.

Given the only reason I'm looking at this is to potentially solve
a long term CXL / MCTP over I2C upstreaming problem on QEMU side, I've only
limited time to throw at this (thought it was a short activity
for a Friday afternoon :)  Will see if it turns out not too be
too hard to build the rest.

I can at least boot same system with device tree and check I'm matching
what is being generated by QEMU.

Jonathan


> 
> Best regards,
> Hervé
>
Jonathan Cameron Sept. 11, 2023, 4:22 p.m. UTC | #6
On Mon, 11 Sep 2023 16:47:41 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Mon, 11 Sep 2023 17:35:03 +0200
> Herve Codina <herve.codina@bootlin.com> wrote:
> 
> > Hi Jonathan,
> > 
> > On Mon, 11 Sep 2023 15:48:56 +0100
> > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> >   
> > > On Tue, 15 Aug 2023 10:19:57 -0700
> > > Lizhi Hou <lizhi.hou@amd.com> wrote:
> > >     
> > > > The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> > > > spaces from multiple hardware peripherals to its PCI BAR. Normally,
> > > > the PCI core discovers devices and BARs using the PCI enumeration process.
> > > > There is no infrastructure to discover the hardware peripherals that are
> > > > present in a PCI device, and which can be accessed through the PCI BARs.
> > > > 
> > > > Apparently, the device tree framework requires a device tree node for the
> > > > PCI device. Thus, it can generate the device tree nodes for hardware
> > > > peripherals underneath. Because PCI is self discoverable bus, there might
> > > > not be a device tree node created for PCI devices. Furthermore, if the PCI
> > > > device is hot pluggable, when it is plugged in, the device tree nodes for
> > > > its parent bridges are required. Add support to generate device tree node
> > > > for PCI bridges.
> > > > 
> > > > Add an of_pci_make_dev_node() interface that can be used to create device
> > > > tree node for PCI devices.
> > > > 
> > > > Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> > > > the kernel will generate device tree nodes for PCI bridges unconditionally.
> > > > 
> > > > Initially, add the basic properties for the dynamically generated device
> > > > tree nodes which include #address-cells, #size-cells, device_type,
> > > > compatible, ranges, reg.
> > > > 
> > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>      
> > > 
> > > I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
> > > machine.
> > > 
> > > There are some missing parts that were present in Clements series, but not this
> > > one, particularly creation of the root pci object.
> > > 
> > > Anyhow, hit an intermittent crash...    
> > 
> > I am facing the same issues.
> > 
> > I use a custom PCIe board too but on x86 ACPI machine.
> > 
> > In order to have a working system, I need also to build a DT node for the PCI
> > Host bridge (previously done by Clement's patch) and I am a bit stuck with
> > interrupts.
> > 
> > On your side (ACPI machine) how do you handle this ?  
> 
> That was next on my list to look at now I've gotten the device tree stuff
> to show up.
> 
> > I mean is your PCI host bridge provided by ACPI ? And if so, you probably need
> > to build a DT node for this PCI host bridge and add some interrupt-map,
> > interrupt-map-mask properties in the DT node.  
> 
> Agreed. Potentially some other stuff, but interrupts are the thing that
> showed up first as an issue.
> 
> Given the only reason I'm looking at this is to potentially solve
> a long term CXL / MCTP over I2C upstreaming problem on QEMU side, I've only
> limited time to throw at this (thought it was a short activity
> for a Friday afternoon :)  Will see if it turns out not too be
> too hard to build the rest.
> 
> I can at least boot same system with device tree and check I'm matching
> what is being generated by QEMU.

So, I'm not really sure how to approach this.  It seems 'unwise'/'unworkable' to
instantiate the device tree blob for the interrupt controller we already have
ACPI for and without that I have nothing to route to.

Or can we just ignore the interrupt map stuff completely and instead
rely on instantiating an interrupt controller on the card (that under
the hood uses non DT paths to make interrupts actually happen?)

That path to me seems workable and keeps the boundary of ACPI vs DT
actually getting used within the card specific driver.

Suggestions welcome!

Jonathan

> 
> Jonathan
> 
> 
> > 
> > Best regards,
> > Hervé
> >   
>
Lizhi Hou Sept. 11, 2023, 4:58 p.m. UTC | #7
On 9/11/23 07:48, Jonathan Cameron wrote:
> On Tue, 15 Aug 2023 10:19:57 -0700
> Lizhi Hou <lizhi.hou@amd.com> wrote:
>
>> The PCI endpoint device such as Xilinx Alveo PCI card maps the register
>> spaces from multiple hardware peripherals to its PCI BAR. Normally,
>> the PCI core discovers devices and BARs using the PCI enumeration process.
>> There is no infrastructure to discover the hardware peripherals that are
>> present in a PCI device, and which can be accessed through the PCI BARs.
>>
>> Apparently, the device tree framework requires a device tree node for the
>> PCI device. Thus, it can generate the device tree nodes for hardware
>> peripherals underneath. Because PCI is self discoverable bus, there might
>> not be a device tree node created for PCI devices. Furthermore, if the PCI
>> device is hot pluggable, when it is plugged in, the device tree nodes for
>> its parent bridges are required. Add support to generate device tree node
>> for PCI bridges.
>>
>> Add an of_pci_make_dev_node() interface that can be used to create device
>> tree node for PCI devices.
>>
>> Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
>> the kernel will generate device tree nodes for PCI bridges unconditionally.
>>
>> Initially, add the basic properties for the dynamically generated device
>> tree nodes which include #address-cells, #size-cells, device_type,
>> compatible, ranges, reg.
>>
>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
> I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
> machine.
>
> There are some missing parts that were present in Clements series, but not this
> one, particularly creation of the root pci object.
Thanks for trying this. The entire effort was separated in different 
phases. That is why this patchset does not include creating of_root.
>
> Anyhow, hit an intermittent crash...
>
>
>> ---
>> +static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
>> +				struct device_node *np)
>> +{
>> +	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
>> +	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
>> +	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
>> +	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
>> +	struct device_node *pnode;
>> +	struct pci_dev *child;
>> +	u32 *int_map, *mapp;
>> +	int ret;
>> +	u8 pin;
>> +
>> +	pnode = pci_device_to_OF_node(pdev->bus->self);
>> +	if (!pnode)
>> +		pnode = pci_bus_to_OF_node(pdev->bus);
>> +
>> +	if (!pnode) {
>> +		pci_err(pdev, "failed to get parent device node");
>> +		return -EINVAL;
>> +	}
>> +
>> +	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
>> +	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
>> +		i = pin - 1;
>> +		out_irq[i].np = pnode;
>> +		out_irq[i].args_count = 1;
>> +		out_irq[i].args[0] = pin;
>> +		ret = of_irq_parse_raw(laddr, &out_irq[i]);
>> +		if (ret) {
>> +			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
>> +			continue;
> If all the interrupt parsing fails we continue ever time...

Did you use Clement's patch to create of_root? I am just wondering if 
parsing irq could fail on a dt-based system.

And yes, the failure case should be handled without crash. I think if 
irq parsing failed,  the interrupt-map pair generation should be skipped.


Thanks,

Lizhi

>
>> +		}
>> +		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
>> +					   &addr_sz[i]);
>> +		if (ret)
>> +			addr_sz[i] = 0;
> This never happens.
>
>> +	}
>> +
>> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
>> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
>> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
>> +			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;
> and here we end up derefencing random memory which happens in my case to cause
> a massive allocation sometimes and that fails one of the assertions in the
> allocator.
>
> I'd suggest just setting addr_sz[xxx] = {}; above
> to ensure it's initialized. Then the if(ret) handling should not be needed
> as well as of_property_read_u32 should be side effect free I hope!
>
>> +		}
>> +	}
>> +
>> +	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
>> +	mapp = int_map;
Lizhi Hou Sept. 11, 2023, 5:53 p.m. UTC | #8
On 9/11/23 08:13, Herve Codina wrote:
> Hi Lizhi,
>
> On Tue, 15 Aug 2023 10:19:57 -0700
> Lizhi Hou <lizhi.hou@amd.com> wrote:
> ...
>> +void of_pci_make_dev_node(struct pci_dev *pdev)
>> +{
>> +	struct device_node *ppnode, *np = NULL;
>> +	const char *pci_type;
>> +	struct of_changeset *cset;
>> +	const char *name;
>> +	int ret;
>> +
>> +	/*
>> +	 * If there is already a device tree node linked to this device,
>> +	 * return immediately.
>> +	 */
>> +	if (pci_device_to_OF_node(pdev))
>> +		return;
>> +
>> +	/* Check if there is device tree node for parent device */
>> +	if (!pdev->bus->self)
>> +		ppnode = pdev->bus->dev.of_node;
>> +	else
>> +		ppnode = pdev->bus->self->dev.of_node;
>> +	if (!ppnode)
>> +		return;
>> +
>> +	if (pci_is_bridge(pdev))
>> +		pci_type = "pci";
>> +	else
>> +		pci_type = "dev";
>> +
>> +	name = kasprintf(GFP_KERNEL, "%s@%x,%x", pci_type,
>> +			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
>> +	if (!name)
>> +		return;
>> +
>> +	cset = kmalloc(sizeof(*cset), GFP_KERNEL);
>> +	if (!cset)
>> +		goto failed;
>> +	of_changeset_init(cset);
>> +
>> +	np = of_changeset_create_node(ppnode, name, cset);
>> +	if (!np)
>> +		goto failed;
> The "goto failed" will leak the cset previously allocated.
>
> np->data = cset; (next line) allows to free the cset when the node is destroyed
> (of_node_put() calls). When the node cannot be created, the allocated cset should
> be freed.
Thanks for pointing this out.
>
>> +	np->data = cset;
>> +
>> +	ret = of_pci_add_properties(pdev, cset, np);
>> +	if (ret)
>> +		goto failed;
>> +
>> +	ret = of_changeset_apply(cset);
>> +	if (ret)
>> +		goto failed;
>> +
>> +	pdev->dev.of_node = np;
>> +	kfree(name);
>> +
>> +	return;
>> +
>> +failed:
>> +	if (np)
>> +		of_node_put(np);
>> +	kfree(name);
>> +}
>> +#endif
>> +
>>   #endif /* CONFIG_PCI */
>>   
> ...
>> +static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
>> +				struct device_node *np)
>> +{
>> +	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
>> +	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
>> +	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
>> +	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
>> +	struct device_node *pnode;
>> +	struct pci_dev *child;
>> +	u32 *int_map, *mapp;
>> +	int ret;
>> +	u8 pin;
>> +
>> +	pnode = pci_device_to_OF_node(pdev->bus->self);
>> +	if (!pnode)
>> +		pnode = pci_bus_to_OF_node(pdev->bus);
>> +
>> +	if (!pnode) {
>> +		pci_err(pdev, "failed to get parent device node");
>> +		return -EINVAL;
>> +	}
>> +
>> +	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
>> +	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
>> +		i = pin - 1;
>> +		out_irq[i].np = pnode;
>> +		out_irq[i].args_count = 1;
>> +		out_irq[i].args[0] = pin;
>> +		ret = of_irq_parse_raw(laddr, &out_irq[i]);
>> +		if (ret) {
>> +			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
>> +			continue;
>> +		}
>> +		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
>> +					   &addr_sz[i]);
>> +		if (ret)
>> +			addr_sz[i] = 0;
>> +	}
> if of_irq_parse_raw() fails, addr_sz[i] is not initialized and map_sz bellow is
> computed with uninitialized values.
> On the test I did, this lead to a kernel crash due to the following kcalloc()
> called with incorrect values.
>
> Are interrupt-map and interrupt-map-mask properties needed in all cases ?
> I mean are they mandatory for the host pci bridge ?
interrupt-map is required for bridges when a legacy interrupt device is 
underneath. Otherwise, of_irq_parse_pci() needs to be changed for legacy 
interrupt device. Please see my previous patch and comments.
>
>> +
>> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
>> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
>> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
>> +			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;
> of_irq_parse_raw() can fail on some pins.
> Is it correct to set map_sz based on information related to all pins even if
> of_irq_parse_raw() previously failed on some pins ?

I think the interrupt-map pair should be skipped if of_irq_parse_raw() 
is failed. Thanks.


Lizhi

>
>> +		}
>> +	}
>> +
>> +	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
>> +	mapp = int_map;
>> +
>> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
>> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
>> +			*mapp = (child->bus->number << 16) |
>> +				(child->devfn << 8);
>> +			mapp += OF_PCI_ADDRESS_CELLS;
>> +			*mapp = pin;
>> +			mapp++;
>> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
>> +			*mapp = out_irq[i].np->phandle;
>> +			mapp++;
>> +			if (addr_sz[i]) {
>> +				ret = of_property_read_u32_array(out_irq[i].np,
>> +								 "reg", mapp,
>> +								 addr_sz[i]);
>> +				if (ret)
>> +					goto failed;
>> +			}
>> +			mapp += addr_sz[i];
>> +			memcpy(mapp, out_irq[i].args,
>> +			       out_irq[i].args_count * sizeof(u32));
>> +			mapp += out_irq[i].args_count;
>> +		}
>> +	}
>> +
>> +	ret = of_changeset_add_prop_u32_array(ocs, np, "interrupt-map", int_map,
>> +					      map_sz);
>> +	if (ret)
>> +		goto failed;
>> +
>> +	ret = of_changeset_add_prop_u32(ocs, np, "#interrupt-cells", 1);
>> +	if (ret)
>> +		goto failed;
>> +
>> +	ret = of_changeset_add_prop_u32_array(ocs, np, "interrupt-map-mask",
>> +					      int_map_mask,
>> +					      ARRAY_SIZE(int_map_mask));
>> +	if (ret)
>> +		goto failed;
>> +
>> +	kfree(int_map);
>> +	return 0;
>> +
>> +failed:
>> +	kfree(int_map);
>> +	return ret;
>> +}
>> +
> ...
>
> Regards,
> Hervé
>
Andy Shevchenko Sept. 11, 2023, 9:06 p.m. UTC | #9
On Tue, Aug 15, 2023 at 10:19:57AM -0700, Lizhi Hou wrote:
> The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> spaces from multiple hardware peripherals to its PCI BAR. Normally,
> the PCI core discovers devices and BARs using the PCI enumeration process.
> There is no infrastructure to discover the hardware peripherals that are
> present in a PCI device, and which can be accessed through the PCI BARs.
> 
> Apparently, the device tree framework requires a device tree node for the
> PCI device. Thus, it can generate the device tree nodes for hardware
> peripherals underneath. Because PCI is self discoverable bus, there might
> not be a device tree node created for PCI devices. Furthermore, if the PCI
> device is hot pluggable, when it is plugged in, the device tree nodes for
> its parent bridges are required. Add support to generate device tree node
> for PCI bridges.
> 
> Add an of_pci_make_dev_node() interface that can be used to create device
> tree node for PCI devices.
> 
> Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> the kernel will generate device tree nodes for PCI bridges unconditionally.
> 
> Initially, add the basic properties for the dynamically generated device
> tree nodes which include #address-cells, #size-cells, device_type,
> compatible, ranges, reg.

...

> @@ -32,6 +32,7 @@ obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
>  obj-$(CONFIG_PCI_DOE)		+= doe.o

> +obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o

Maybe a bit ordered?

...

> +void of_pci_remove_node(struct pci_dev *pdev)
> +{
> +	struct device_node *np;
> +
> +	np = pci_device_to_OF_node(pdev);

CamelCase out of a sudden?!

> +	if (!np || !of_node_check_flag(np, OF_DYNAMIC))

Do you need a first check? Shouldn't the second return false for you in such a
case?

> +		return;

> +	pdev->dev.of_node = NULL;

This will mess up with fwnode, isn't it?


> +	of_changeset_revert(np->data);
> +	of_changeset_destroy(np->data);
> +	of_node_put(np);
> +}

...

> +void of_pci_make_dev_node(struct pci_dev *pdev)
> +{
> +	struct device_node *ppnode, *np = NULL;
> +	const char *pci_type;
> +	struct of_changeset *cset;
> +	const char *name;
> +	int ret;
> +
> +	/*
> +	 * If there is already a device tree node linked to this device,
> +	 * return immediately.
> +	 */
> +	if (pci_device_to_OF_node(pdev))
> +		return;
> +
> +	/* Check if there is device tree node for parent device */
> +	if (!pdev->bus->self)

While not positive conditional?

> +		ppnode = pdev->bus->dev.of_node;
> +	else
> +		ppnode = pdev->bus->self->dev.of_node;

What about firmware nodes?

> +	if (!ppnode)
> +		return;
> +
> +	if (pci_is_bridge(pdev))
> +		pci_type = "pci";
> +	else
> +		pci_type = "dev";
> +
> +	name = kasprintf(GFP_KERNEL, "%s@%x,%x", pci_type,
> +			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
> +	if (!name)
> +		return;
> +
> +	cset = kmalloc(sizeof(*cset), GFP_KERNEL);
> +	if (!cset)
> +		goto failed;
> +	of_changeset_init(cset);
> +
> +	np = of_changeset_create_node(ppnode, name, cset);
> +	if (!np)
> +		goto failed;
> +	np->data = cset;
> +
> +	ret = of_pci_add_properties(pdev, cset, np);
> +	if (ret)
> +		goto failed;
> +
> +	ret = of_changeset_apply(cset);
> +	if (ret)
> +		goto failed;
> +
> +	pdev->dev.of_node = np;

Firmware node?

> +	kfree(name);
> +
> +	return;
> +
> +failed:

> +	if (np)

Dup check.

> +		of_node_put(np);
> +	kfree(name);
> +}

...

> +#include <linux/pci.h>
> +#include <linux/of.h>
> +#include <linux/of_irq.h>
> +#include <linux/bitfield.h>
> +#include <linux/bits.h>

Can it be ordered?

...

> +struct of_pci_addr_pair {
> +	u32		phys_addr[OF_PCI_ADDRESS_CELLS];
> +	u32		size[OF_PCI_SIZE_CELLS];
> +};

Why not

struct foo {
	u32 phys_addr; // why not 64-bit?
	u32 size; // same Q, btw
};

struct _pairs {
	strict foo pairs[...];
}

?

...

> +struct of_pci_range {
> +	u32		child_addr[OF_PCI_ADDRESS_CELLS];
> +	u32		parent_addr[OF_PCI_ADDRESS_CELLS];
> +	u32		size[OF_PCI_SIZE_CELLS];
> +};

In the similar way?

...

> +enum of_pci_prop_compatible {
> +	PROP_COMPAT_PCI_VVVV_DDDD,
> +	PROP_COMPAT_PCICLASS_CCSSPP,
> +	PROP_COMPAT_PCICLASS_CCSS,
> +	PROP_COMPAT_NUM,

No comma for the terminator entry (as far as I got it).

> +};

...

> +static void of_pci_set_address(struct pci_dev *pdev, u32 *prop, u64 addr,
> +			       u32 reg_num, u32 flags, bool reloc)
> +{
> +	prop[0] = FIELD_PREP(OF_PCI_ADDR_FIELD_BUS, pdev->bus->number) |
> +		FIELD_PREP(OF_PCI_ADDR_FIELD_DEV, PCI_SLOT(pdev->devfn)) |
> +		FIELD_PREP(OF_PCI_ADDR_FIELD_FUNC, PCI_FUNC(pdev->devfn));
> +	prop[0] |= flags | reg_num;

No checks? No masks? flags or reg_num may easily / mistakenly rewrite the above.

> +	if (!reloc) {

	if (reloc)
		return;

?

> +		prop[0] |= OF_PCI_ADDR_FIELD_NONRELOC;
> +		prop[1] = upper_32_bits(addr);
> +		prop[2] = lower_32_bits(addr);
> +	}
> +}

...

> +static int of_pci_get_addr_flags(struct resource *res, u32 *flags)
> +{
> +	u32 ss;

> +	if (res->flags & IORESOURCE_IO)
> +		ss = OF_PCI_ADDR_SPACE_IO;
> +	else if (res->flags & IORESOURCE_MEM_64)
> +		ss = OF_PCI_ADDR_SPACE_MEM64;
> +	else if (res->flags & IORESOURCE_MEM)
> +		ss = OF_PCI_ADDR_SPACE_MEM32;
> +	else
> +		return -EINVAL;

We have ioport.h and respective helpers, can you use them?
resource_type(), for example.

> +	*flags = 0;
> +	if (res->flags & IORESOURCE_PREFETCH)
> +		*flags |= OF_PCI_ADDR_FIELD_PREFETCH;
> +
> +	*flags |= FIELD_PREP(OF_PCI_ADDR_FIELD_SS, ss);
> +
> +	return 0;
> +}

...

> +static int of_pci_prop_bus_range(struct pci_dev *pdev,
> +				 struct of_changeset *ocs,
> +				 struct device_node *np)
> +{
> +	u32 bus_range[] = { pdev->subordinate->busn_res.start,
> +			    pdev->subordinate->busn_res.end };

Wrong. It won't work on 64-bit resources.

> +	return of_changeset_add_prop_u32_array(ocs, np, "bus-range", bus_range,
> +					       ARRAY_SIZE(bus_range));
> +}

...

> +	if (pci_is_bridge(pdev)) {
> +		num = PCI_BRIDGE_RESOURCE_NUM;
> +		res = &pdev->resource[PCI_BRIDGE_RESOURCES];
> +	} else {
> +		num = PCI_STD_NUM_BARS;
> +		res = &pdev->resource[PCI_STD_RESOURCES];
> +	}

Don't we have pci_resource() macro?

...

> +	for (i = 0, j = 0; j < num; j++) {
> +		if (!resource_size(&res[j]))
> +			continue;
> +
> +		if (of_pci_get_addr_flags(&res[j], &flags))
> +			continue;
> +
> +		val64 = res[j].start;
> +		of_pci_set_address(pdev, rp[i].parent_addr, val64, 0, flags,
> +				   false);
> +		if (pci_is_bridge(pdev)) {

> +			memcpy(rp[i].child_addr, rp[i].parent_addr,
> +			       sizeof(rp[i].child_addr));

Why simple assignment is not good enough here?

> +		} else {
> +			/*
> +			 * For endpoint device, the lower 64-bits of child
> +			 * address is always zero.
> +			 */
> +			rp[i].child_addr[0] = j;
> +		}

> +		val64 = resource_size(&res[j]);

Dup. You already called this at the top of the loop, why to repeat?

> +		rp[i].size[0] = upper_32_bits(val64);
> +		rp[i].size[1] = lower_32_bits(val64);
> +
> +		i++;
> +	}

...

> +static int of_pci_prop_reg(struct pci_dev *pdev, struct of_changeset *ocs,
> +			   struct device_node *np)
> +{
> +	struct of_pci_addr_pair reg = { 0 };

0 is redundant.

> +
> +	/* configuration space */
> +	of_pci_set_address(pdev, reg.phys_addr, 0, 0, 0, true);
> +
> +	return of_changeset_add_prop_u32_array(ocs, np, "reg", (u32 *)&reg,
> +					       sizeof(reg) / sizeof(u32));
> +}

...

> +	ret = pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
> +	if (ret != 0)

Why this pattern?

> +		return ret;

Are you aware that above can return positive codes, aren't you?
You probably want to translate them to the Linux error codes
Same applies to all generic PCI config space accessors used in
the code.

> +	if (!pin)
> +		return 0;
> +
> +	return of_changeset_add_prop_u32(ocs, np, "interrupts", (u32)pin);

Why casting?

> +}
Andy Shevchenko Sept. 11, 2023, 9:13 p.m. UTC | #10
On Mon, Sep 11, 2023 at 05:22:56PM +0100, Jonathan Cameron wrote:
> On Mon, 11 Sep 2023 16:47:41 +0100
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > On Mon, 11 Sep 2023 17:35:03 +0200
> > Herve Codina <herve.codina@bootlin.com> wrote:
> > > On Mon, 11 Sep 2023 15:48:56 +0100
> > > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> > > > On Tue, 15 Aug 2023 10:19:57 -0700
> > > > Lizhi Hou <lizhi.hou@amd.com> wrote:

> > > > > The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> > > > > spaces from multiple hardware peripherals to its PCI BAR. Normally,
> > > > > the PCI core discovers devices and BARs using the PCI enumeration process.
> > > > > There is no infrastructure to discover the hardware peripherals that are
> > > > > present in a PCI device, and which can be accessed through the PCI BARs.
> > > > > 
> > > > > Apparently, the device tree framework requires a device tree node for the
> > > > > PCI device. Thus, it can generate the device tree nodes for hardware
> > > > > peripherals underneath. Because PCI is self discoverable bus, there might
> > > > > not be a device tree node created for PCI devices. Furthermore, if the PCI
> > > > > device is hot pluggable, when it is plugged in, the device tree nodes for
> > > > > its parent bridges are required. Add support to generate device tree node
> > > > > for PCI bridges.
> > > > > 
> > > > > Add an of_pci_make_dev_node() interface that can be used to create device
> > > > > tree node for PCI devices.
> > > > > 
> > > > > Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> > > > > the kernel will generate device tree nodes for PCI bridges unconditionally.
> > > > > 
> > > > > Initially, add the basic properties for the dynamically generated device
> > > > > tree nodes which include #address-cells, #size-cells, device_type,
> > > > > compatible, ranges, reg.
> > > > > 
> > > > > Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> > > > > Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>      
> > > > 
> > > > I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
> > > > machine.
> > > > 
> > > > There are some missing parts that were present in Clements series, but not this
> > > > one, particularly creation of the root pci object.
> > > > 
> > > > Anyhow, hit an intermittent crash...    
> > > 
> > > I am facing the same issues.
> > > 
> > > I use a custom PCIe board too but on x86 ACPI machine.
> > > 
> > > In order to have a working system, I need also to build a DT node for the PCI
> > > Host bridge (previously done by Clement's patch) and I am a bit stuck with
> > > interrupts.
> > > 
> > > On your side (ACPI machine) how do you handle this ?  
> > 
> > That was next on my list to look at now I've gotten the device tree stuff
> > to show up.
> > 
> > > I mean is your PCI host bridge provided by ACPI ? And if so, you probably need
> > > to build a DT node for this PCI host bridge and add some interrupt-map,
> > > interrupt-map-mask properties in the DT node.  
> > 
> > Agreed. Potentially some other stuff, but interrupts are the thing that
> > showed up first as an issue.
> > 
> > Given the only reason I'm looking at this is to potentially solve
> > a long term CXL / MCTP over I2C upstreaming problem on QEMU side, I've only
> > limited time to throw at this (thought it was a short activity
> > for a Friday afternoon :)  Will see if it turns out not too be
> > too hard to build the rest.
> > 
> > I can at least boot same system with device tree and check I'm matching
> > what is being generated by QEMU.
> 
> So, I'm not really sure how to approach this.  It seems 'unwise'/'unworkable' to
> instantiate the device tree blob for the interrupt controller we already have
> ACPI for and without that I have nothing to route to.
> 
> Or can we just ignore the interrupt map stuff completely and instead
> rely on instantiating an interrupt controller on the card (that under
> the hood uses non DT paths to make interrupts actually happen?)
> 
> That path to me seems workable and keeps the boundary of ACPI vs DT
> actually getting used within the card specific driver.
> 
> Suggestions welcome!

Interestingly I haven't got your message in the thread via `b4`.
Anyways, I think that was has been discussed at some point and
DT appears just to be handy blob format to be supplied along with
the device as "description of its configuration". Whatever format
is chosen it should be available for ACPI/DT/etc platforms and
be uniform. ACPI also supports overlays (as a debug feature, though)
but would it make sense to have AML (compiled ASL) for ACPI and DTB
for DT platforms and etc for etc platforms with duplicative data
inside with all limitations of the each of those formats and their
respective parsers/interpreters?
Jonathan Cameron Sept. 12, 2023, 10:10 a.m. UTC | #11
On Mon, 11 Sep 2023 09:58:04 -0700
Lizhi Hou <lizhi.hou@amd.com> wrote:

> On 9/11/23 07:48, Jonathan Cameron wrote:
> > On Tue, 15 Aug 2023 10:19:57 -0700
> > Lizhi Hou <lizhi.hou@amd.com> wrote:
> >  
> >> The PCI endpoint device such as Xilinx Alveo PCI card maps the register
> >> spaces from multiple hardware peripherals to its PCI BAR. Normally,
> >> the PCI core discovers devices and BARs using the PCI enumeration process.
> >> There is no infrastructure to discover the hardware peripherals that are
> >> present in a PCI device, and which can be accessed through the PCI BARs.
> >>
> >> Apparently, the device tree framework requires a device tree node for the
> >> PCI device. Thus, it can generate the device tree nodes for hardware
> >> peripherals underneath. Because PCI is self discoverable bus, there might
> >> not be a device tree node created for PCI devices. Furthermore, if the PCI
> >> device is hot pluggable, when it is plugged in, the device tree nodes for
> >> its parent bridges are required. Add support to generate device tree node
> >> for PCI bridges.
> >>
> >> Add an of_pci_make_dev_node() interface that can be used to create device
> >> tree node for PCI devices.
> >>
> >> Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
> >> the kernel will generate device tree nodes for PCI bridges unconditionally.
> >>
> >> Initially, add the basic properties for the dynamically generated device
> >> tree nodes which include #address-cells, #size-cells, device_type,
> >> compatible, ranges, reg.
> >>
> >> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> >> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>  
> > I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
> > machine.
> >
> > There are some missing parts that were present in Clements series, but not this
> > one, particularly creation of the root pci object.  
> Thanks for trying this. The entire effort was separated in different 
> phases. That is why this patchset does not include creating of_root.
> >
> > Anyhow, hit an intermittent crash...
> >
> >  
> >> ---
> >> +static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
> >> +				struct device_node *np)
> >> +{
> >> +	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
> >> +	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
> >> +	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
> >> +	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
> >> +	struct device_node *pnode;
> >> +	struct pci_dev *child;
> >> +	u32 *int_map, *mapp;
> >> +	int ret;
> >> +	u8 pin;
> >> +
> >> +	pnode = pci_device_to_OF_node(pdev->bus->self);
> >> +	if (!pnode)
> >> +		pnode = pci_bus_to_OF_node(pdev->bus);
> >> +
> >> +	if (!pnode) {
> >> +		pci_err(pdev, "failed to get parent device node");
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
> >> +	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
> >> +		i = pin - 1;
> >> +		out_irq[i].np = pnode;
> >> +		out_irq[i].args_count = 1;
> >> +		out_irq[i].args[0] = pin;
> >> +		ret = of_irq_parse_raw(laddr, &out_irq[i]);
> >> +		if (ret) {
> >> +			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
> >> +			continue;  
> > If all the interrupt parsing fails we continue ever time...  
> 
> Did you use Clement's patch to create of_root? I am just wondering if 
> parsing irq could fail on a dt-based system.

For qemu already have of_root as there is still a device tree, it's just
used to pass some stuff to EDK2 I think. I was carrying the Frank's
series adding a bare device tree, it's just not doing anything on these
systems

I used Clements patch to add the pci root (cleaned up a bit to
match the style of your series more closely).

However, my interest is in ACPI based systems, not DT based ones.

Jonathan


> 
> And yes, the failure case should be handled without crash. I think if 
> irq parsing failed,  the interrupt-map pair generation should be skipped.
> 
> 
> Thanks,
> 
> Lizhi
> 
> >  
> >> +		}
> >> +		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
> >> +					   &addr_sz[i]);
> >> +		if (ret)
> >> +			addr_sz[i] = 0;  
> > This never happens.
> >  
> >> +	}
> >> +
> >> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
> >> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
> >> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
> >> +			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;  
> > and here we end up derefencing random memory which happens in my case to cause
> > a massive allocation sometimes and that fails one of the assertions in the
> > allocator.
> >
> > I'd suggest just setting addr_sz[xxx] = {}; above
> > to ensure it's initialized. Then the if(ret) handling should not be needed
> > as well as of_property_read_u32 should be side effect free I hope!
> >  
> >> +		}
> >> +	}
> >> +
> >> +	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
> >> +	mapp = int_map;
Lizhi Hou Sept. 12, 2023, 5:05 p.m. UTC | #12
On 9/12/23 03:10, Jonathan Cameron wrote:
> On Mon, 11 Sep 2023 09:58:04 -0700
> Lizhi Hou <lizhi.hou@amd.com> wrote:
>
>> On 9/11/23 07:48, Jonathan Cameron wrote:
>>> On Tue, 15 Aug 2023 10:19:57 -0700
>>> Lizhi Hou <lizhi.hou@amd.com> wrote:
>>>   
>>>> The PCI endpoint device such as Xilinx Alveo PCI card maps the register
>>>> spaces from multiple hardware peripherals to its PCI BAR. Normally,
>>>> the PCI core discovers devices and BARs using the PCI enumeration process.
>>>> There is no infrastructure to discover the hardware peripherals that are
>>>> present in a PCI device, and which can be accessed through the PCI BARs.
>>>>
>>>> Apparently, the device tree framework requires a device tree node for the
>>>> PCI device. Thus, it can generate the device tree nodes for hardware
>>>> peripherals underneath. Because PCI is self discoverable bus, there might
>>>> not be a device tree node created for PCI devices. Furthermore, if the PCI
>>>> device is hot pluggable, when it is plugged in, the device tree nodes for
>>>> its parent bridges are required. Add support to generate device tree node
>>>> for PCI bridges.
>>>>
>>>> Add an of_pci_make_dev_node() interface that can be used to create device
>>>> tree node for PCI devices.
>>>>
>>>> Add a PCI_DYNAMIC_OF_NODES config option. When the option is turned on,
>>>> the kernel will generate device tree nodes for PCI bridges unconditionally.
>>>>
>>>> Initially, add the basic properties for the dynamically generated device
>>>> tree nodes which include #address-cells, #size-cells, device_type,
>>>> compatible, ranges, reg.
>>>>
>>>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>>>> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
>>> I tried to bring this up for a custom PCIe card emulated in QEMU on an ARM ACPI
>>> machine.
>>>
>>> There are some missing parts that were present in Clements series, but not this
>>> one, particularly creation of the root pci object.
>> Thanks for trying this. The entire effort was separated in different
>> phases. That is why this patchset does not include creating of_root.
>>> Anyhow, hit an intermittent crash...
>>>
>>>   
>>>> ---
>>>> +static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
>>>> +				struct device_node *np)
>>>> +{
>>>> +	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
>>>> +	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
>>>> +	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
>>>> +	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
>>>> +	struct device_node *pnode;
>>>> +	struct pci_dev *child;
>>>> +	u32 *int_map, *mapp;
>>>> +	int ret;
>>>> +	u8 pin;
>>>> +
>>>> +	pnode = pci_device_to_OF_node(pdev->bus->self);
>>>> +	if (!pnode)
>>>> +		pnode = pci_bus_to_OF_node(pdev->bus);
>>>> +
>>>> +	if (!pnode) {
>>>> +		pci_err(pdev, "failed to get parent device node");
>>>> +		return -EINVAL;
>>>> +	}
>>>> +
>>>> +	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
>>>> +	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
>>>> +		i = pin - 1;
>>>> +		out_irq[i].np = pnode;
>>>> +		out_irq[i].args_count = 1;
>>>> +		out_irq[i].args[0] = pin;
>>>> +		ret = of_irq_parse_raw(laddr, &out_irq[i]);
>>>> +		if (ret) {
>>>> +			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
>>>> +			continue;
>>> If all the interrupt parsing fails we continue ever time...
>> Did you use Clement's patch to create of_root? I am just wondering if
>> parsing irq could fail on a dt-based system.
> For qemu already have of_root as there is still a device tree, it's just
> used to pass some stuff to EDK2 I think. I was carrying the Frank's
> series adding a bare device tree, it's just not doing anything on these
> systems
>
> I used Clements patch to add the pci root (cleaned up a bit to
> match the style of your series more closely).
>
> However, my interest is in ACPI based systems, not DT based ones.

Thanks for your clarification. I am also more interested in ACPI based 
system. After discussing with Rob, creating PCI nodes on DT based system 
is the first step to achieve this.


Lizhi

>
> Jonathan
>
>
>> And yes, the failure case should be handled without crash. I think if
>> irq parsing failed,  the interrupt-map pair generation should be skipped.
>>
>>
>> Thanks,
>>
>> Lizhi
>>
>>>   
>>>> +		}
>>>> +		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
>>>> +					   &addr_sz[i]);
>>>> +		if (ret)
>>>> +			addr_sz[i] = 0;
>>> This never happens.
>>>   
>>>> +	}
>>>> +
>>>> +	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
>>>> +		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
>>>> +			i = pci_swizzle_interrupt_pin(child, pin) - 1;
>>>> +			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;
>>> and here we end up derefencing random memory which happens in my case to cause
>>> a massive allocation sometimes and that fails one of the assertions in the
>>> allocator.
>>>
>>> I'd suggest just setting addr_sz[xxx] = {}; above
>>> to ensure it's initialized. Then the if(ret) handling should not be needed
>>> as well as of_property_read_u32 should be side effect free I hope!
>>>   
>>>> +		}
>>>> +	}
>>>> +
>>>> +	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
>>>> +	mapp = int_map;
diff mbox series

Patch

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 3c07d8d214b3..49bd09c7dd0a 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -194,6 +194,18 @@  config PCI_HYPERV
 	  The PCI device frontend driver allows the kernel to import arbitrary
 	  PCI devices from a PCI backend to support PCI driver domains.
 
+config PCI_DYNAMIC_OF_NODES
+	bool "Create Device tree nodes for PCI devices"
+	depends on OF
+	select OF_DYNAMIC
+	help
+	  This option enables support for generating device tree nodes for some
+	  PCI devices. Thus, the driver of this kind can load and overlay
+	  flattened device tree for its downstream devices.
+
+	  Once this option is selected, the device tree nodes will be generated
+	  for all PCI bridges.
+
 choice
 	prompt "PCI Express hierarchy optimization setting"
 	default PCIE_BUS_DEFAULT
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 2680e4c92f0a..cc8b4e01e29d 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -32,6 +32,7 @@  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
 obj-$(CONFIG_PCI_DOE)		+= doe.o
+obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
 
 # Endpoint library must be initialized before its users
 obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 5bc81cc0a2de..ab7d06cd0099 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -340,6 +340,8 @@  void pci_bus_add_device(struct pci_dev *dev)
 	 */
 	pcibios_bus_add_device(dev);
 	pci_fixup_device(pci_fixup_final, dev);
+	if (pci_is_bridge(dev))
+		of_pci_make_dev_node(dev);
 	pci_create_sysfs_dev_files(dev);
 	pci_proc_attach_device(dev);
 	pci_bridge_d3_update(dev);
diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index e51219f9f523..ec132fbf5c69 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -611,6 +611,85 @@  int devm_of_pci_bridge_init(struct device *dev, struct pci_host_bridge *bridge)
 	return pci_parse_request_of_pci_ranges(dev, bridge);
 }
 
+#ifdef CONFIG_PCI_DYNAMIC_OF_NODES
+
+void of_pci_remove_node(struct pci_dev *pdev)
+{
+	struct device_node *np;
+
+	np = pci_device_to_OF_node(pdev);
+	if (!np || !of_node_check_flag(np, OF_DYNAMIC))
+		return;
+	pdev->dev.of_node = NULL;
+
+	of_changeset_revert(np->data);
+	of_changeset_destroy(np->data);
+	of_node_put(np);
+}
+
+void of_pci_make_dev_node(struct pci_dev *pdev)
+{
+	struct device_node *ppnode, *np = NULL;
+	const char *pci_type;
+	struct of_changeset *cset;
+	const char *name;
+	int ret;
+
+	/*
+	 * If there is already a device tree node linked to this device,
+	 * return immediately.
+	 */
+	if (pci_device_to_OF_node(pdev))
+		return;
+
+	/* Check if there is device tree node for parent device */
+	if (!pdev->bus->self)
+		ppnode = pdev->bus->dev.of_node;
+	else
+		ppnode = pdev->bus->self->dev.of_node;
+	if (!ppnode)
+		return;
+
+	if (pci_is_bridge(pdev))
+		pci_type = "pci";
+	else
+		pci_type = "dev";
+
+	name = kasprintf(GFP_KERNEL, "%s@%x,%x", pci_type,
+			 PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+	if (!name)
+		return;
+
+	cset = kmalloc(sizeof(*cset), GFP_KERNEL);
+	if (!cset)
+		goto failed;
+	of_changeset_init(cset);
+
+	np = of_changeset_create_node(ppnode, name, cset);
+	if (!np)
+		goto failed;
+	np->data = cset;
+
+	ret = of_pci_add_properties(pdev, cset, np);
+	if (ret)
+		goto failed;
+
+	ret = of_changeset_apply(cset);
+	if (ret)
+		goto failed;
+
+	pdev->dev.of_node = np;
+	kfree(name);
+
+	return;
+
+failed:
+	if (np)
+		of_node_put(np);
+	kfree(name);
+}
+#endif
+
 #endif /* CONFIG_PCI */
 
 /**
diff --git a/drivers/pci/of_property.c b/drivers/pci/of_property.c
new file mode 100644
index 000000000000..710ec35ba4a1
--- /dev/null
+++ b/drivers/pci/of_property.c
@@ -0,0 +1,355 @@ 
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2022-2023, Advanced Micro Devices, Inc.
+ */
+
+#include <linux/pci.h>
+#include <linux/of.h>
+#include <linux/of_irq.h>
+#include <linux/bitfield.h>
+#include <linux/bits.h>
+#include "pci.h"
+
+#define OF_PCI_ADDRESS_CELLS		3
+#define OF_PCI_SIZE_CELLS		2
+#define OF_PCI_MAX_INT_PIN		4
+
+struct of_pci_addr_pair {
+	u32		phys_addr[OF_PCI_ADDRESS_CELLS];
+	u32		size[OF_PCI_SIZE_CELLS];
+};
+
+/*
+ * Each entry in the ranges table is a tuple containing the child address,
+ * the parent address, and the size of the region in the child address space.
+ * Thus, for PCI, in each entry parent address is an address on the primary
+ * side and the child address is the corresponding address on the secondary
+ * side.
+ */
+struct of_pci_range {
+	u32		child_addr[OF_PCI_ADDRESS_CELLS];
+	u32		parent_addr[OF_PCI_ADDRESS_CELLS];
+	u32		size[OF_PCI_SIZE_CELLS];
+};
+
+#define OF_PCI_ADDR_SPACE_IO		0x1
+#define OF_PCI_ADDR_SPACE_MEM32		0x2
+#define OF_PCI_ADDR_SPACE_MEM64		0x3
+
+#define OF_PCI_ADDR_FIELD_NONRELOC	BIT(31)
+#define OF_PCI_ADDR_FIELD_SS		GENMASK(25, 24)
+#define OF_PCI_ADDR_FIELD_PREFETCH	BIT(30)
+#define OF_PCI_ADDR_FIELD_BUS		GENMASK(23, 16)
+#define OF_PCI_ADDR_FIELD_DEV		GENMASK(15, 11)
+#define OF_PCI_ADDR_FIELD_FUNC		GENMASK(10, 8)
+#define OF_PCI_ADDR_FIELD_REG		GENMASK(7, 0)
+
+enum of_pci_prop_compatible {
+	PROP_COMPAT_PCI_VVVV_DDDD,
+	PROP_COMPAT_PCICLASS_CCSSPP,
+	PROP_COMPAT_PCICLASS_CCSS,
+	PROP_COMPAT_NUM,
+};
+
+static void of_pci_set_address(struct pci_dev *pdev, u32 *prop, u64 addr,
+			       u32 reg_num, u32 flags, bool reloc)
+{
+	prop[0] = FIELD_PREP(OF_PCI_ADDR_FIELD_BUS, pdev->bus->number) |
+		FIELD_PREP(OF_PCI_ADDR_FIELD_DEV, PCI_SLOT(pdev->devfn)) |
+		FIELD_PREP(OF_PCI_ADDR_FIELD_FUNC, PCI_FUNC(pdev->devfn));
+	prop[0] |= flags | reg_num;
+	if (!reloc) {
+		prop[0] |= OF_PCI_ADDR_FIELD_NONRELOC;
+		prop[1] = upper_32_bits(addr);
+		prop[2] = lower_32_bits(addr);
+	}
+}
+
+static int of_pci_get_addr_flags(struct resource *res, u32 *flags)
+{
+	u32 ss;
+
+	if (res->flags & IORESOURCE_IO)
+		ss = OF_PCI_ADDR_SPACE_IO;
+	else if (res->flags & IORESOURCE_MEM_64)
+		ss = OF_PCI_ADDR_SPACE_MEM64;
+	else if (res->flags & IORESOURCE_MEM)
+		ss = OF_PCI_ADDR_SPACE_MEM32;
+	else
+		return -EINVAL;
+
+	*flags = 0;
+	if (res->flags & IORESOURCE_PREFETCH)
+		*flags |= OF_PCI_ADDR_FIELD_PREFETCH;
+
+	*flags |= FIELD_PREP(OF_PCI_ADDR_FIELD_SS, ss);
+
+	return 0;
+}
+
+static int of_pci_prop_bus_range(struct pci_dev *pdev,
+				 struct of_changeset *ocs,
+				 struct device_node *np)
+{
+	u32 bus_range[] = { pdev->subordinate->busn_res.start,
+			    pdev->subordinate->busn_res.end };
+
+	return of_changeset_add_prop_u32_array(ocs, np, "bus-range", bus_range,
+					       ARRAY_SIZE(bus_range));
+}
+
+static int of_pci_prop_ranges(struct pci_dev *pdev, struct of_changeset *ocs,
+			      struct device_node *np)
+{
+	struct of_pci_range *rp;
+	struct resource *res;
+	int i, j, ret;
+	u32 flags, num;
+	u64 val64;
+
+	if (pci_is_bridge(pdev)) {
+		num = PCI_BRIDGE_RESOURCE_NUM;
+		res = &pdev->resource[PCI_BRIDGE_RESOURCES];
+	} else {
+		num = PCI_STD_NUM_BARS;
+		res = &pdev->resource[PCI_STD_RESOURCES];
+	}
+
+	rp = kcalloc(num, sizeof(*rp), GFP_KERNEL);
+	if (!rp)
+		return -ENOMEM;
+
+	for (i = 0, j = 0; j < num; j++) {
+		if (!resource_size(&res[j]))
+			continue;
+
+		if (of_pci_get_addr_flags(&res[j], &flags))
+			continue;
+
+		val64 = res[j].start;
+		of_pci_set_address(pdev, rp[i].parent_addr, val64, 0, flags,
+				   false);
+		if (pci_is_bridge(pdev)) {
+			memcpy(rp[i].child_addr, rp[i].parent_addr,
+			       sizeof(rp[i].child_addr));
+		} else {
+			/*
+			 * For endpoint device, the lower 64-bits of child
+			 * address is always zero.
+			 */
+			rp[i].child_addr[0] = j;
+		}
+
+		val64 = resource_size(&res[j]);
+		rp[i].size[0] = upper_32_bits(val64);
+		rp[i].size[1] = lower_32_bits(val64);
+
+		i++;
+	}
+
+	ret = of_changeset_add_prop_u32_array(ocs, np, "ranges", (u32 *)rp,
+					      i * sizeof(*rp) / sizeof(u32));
+	kfree(rp);
+
+	return ret;
+}
+
+static int of_pci_prop_reg(struct pci_dev *pdev, struct of_changeset *ocs,
+			   struct device_node *np)
+{
+	struct of_pci_addr_pair reg = { 0 };
+
+	/* configuration space */
+	of_pci_set_address(pdev, reg.phys_addr, 0, 0, 0, true);
+
+	return of_changeset_add_prop_u32_array(ocs, np, "reg", (u32 *)&reg,
+					       sizeof(reg) / sizeof(u32));
+}
+
+static int of_pci_prop_interrupts(struct pci_dev *pdev,
+				  struct of_changeset *ocs,
+				  struct device_node *np)
+{
+	int ret;
+	u8 pin;
+
+	ret = pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
+	if (ret != 0)
+		return ret;
+
+	if (!pin)
+		return 0;
+
+	return of_changeset_add_prop_u32(ocs, np, "interrupts", (u32)pin);
+}
+
+static int of_pci_prop_intr_map(struct pci_dev *pdev, struct of_changeset *ocs,
+				struct device_node *np)
+{
+	struct of_phandle_args out_irq[OF_PCI_MAX_INT_PIN];
+	u32 i, addr_sz[OF_PCI_MAX_INT_PIN], map_sz = 0;
+	__be32 laddr[OF_PCI_ADDRESS_CELLS] = { 0 };
+	u32 int_map_mask[] = { 0xffff00, 0, 0, 7 };
+	struct device_node *pnode;
+	struct pci_dev *child;
+	u32 *int_map, *mapp;
+	int ret;
+	u8 pin;
+
+	pnode = pci_device_to_OF_node(pdev->bus->self);
+	if (!pnode)
+		pnode = pci_bus_to_OF_node(pdev->bus);
+
+	if (!pnode) {
+		pci_err(pdev, "failed to get parent device node");
+		return -EINVAL;
+	}
+
+	laddr[0] = cpu_to_be32((pdev->bus->number << 16) | (pdev->devfn << 8));
+	for (pin = 1; pin <= OF_PCI_MAX_INT_PIN;  pin++) {
+		i = pin - 1;
+		out_irq[i].np = pnode;
+		out_irq[i].args_count = 1;
+		out_irq[i].args[0] = pin;
+		ret = of_irq_parse_raw(laddr, &out_irq[i]);
+		if (ret) {
+			pci_err(pdev, "parse irq %d failed, ret %d", pin, ret);
+			continue;
+		}
+		ret = of_property_read_u32(out_irq[i].np, "#address-cells",
+					   &addr_sz[i]);
+		if (ret)
+			addr_sz[i] = 0;
+	}
+
+	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
+		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
+			i = pci_swizzle_interrupt_pin(child, pin) - 1;
+			map_sz += 5 + addr_sz[i] + out_irq[i].args_count;
+		}
+	}
+
+	int_map = kcalloc(map_sz, sizeof(u32), GFP_KERNEL);
+	mapp = int_map;
+
+	list_for_each_entry(child, &pdev->subordinate->devices, bus_list) {
+		for (pin = 1; pin <= OF_PCI_MAX_INT_PIN; pin++) {
+			*mapp = (child->bus->number << 16) |
+				(child->devfn << 8);
+			mapp += OF_PCI_ADDRESS_CELLS;
+			*mapp = pin;
+			mapp++;
+			i = pci_swizzle_interrupt_pin(child, pin) - 1;
+			*mapp = out_irq[i].np->phandle;
+			mapp++;
+			if (addr_sz[i]) {
+				ret = of_property_read_u32_array(out_irq[i].np,
+								 "reg", mapp,
+								 addr_sz[i]);
+				if (ret)
+					goto failed;
+			}
+			mapp += addr_sz[i];
+			memcpy(mapp, out_irq[i].args,
+			       out_irq[i].args_count * sizeof(u32));
+			mapp += out_irq[i].args_count;
+		}
+	}
+
+	ret = of_changeset_add_prop_u32_array(ocs, np, "interrupt-map", int_map,
+					      map_sz);
+	if (ret)
+		goto failed;
+
+	ret = of_changeset_add_prop_u32(ocs, np, "#interrupt-cells", 1);
+	if (ret)
+		goto failed;
+
+	ret = of_changeset_add_prop_u32_array(ocs, np, "interrupt-map-mask",
+					      int_map_mask,
+					      ARRAY_SIZE(int_map_mask));
+	if (ret)
+		goto failed;
+
+	kfree(int_map);
+	return 0;
+
+failed:
+	kfree(int_map);
+	return ret;
+}
+
+static int of_pci_prop_compatible(struct pci_dev *pdev,
+				  struct of_changeset *ocs,
+				  struct device_node *np)
+{
+	const char *compat_strs[PROP_COMPAT_NUM] = { 0 };
+	int i, ret;
+
+	compat_strs[PROP_COMPAT_PCI_VVVV_DDDD] =
+		kasprintf(GFP_KERNEL, "pci%x,%x", pdev->vendor, pdev->device);
+	compat_strs[PROP_COMPAT_PCICLASS_CCSSPP] =
+		kasprintf(GFP_KERNEL, "pciclass,%06x", pdev->class);
+	compat_strs[PROP_COMPAT_PCICLASS_CCSS] =
+		kasprintf(GFP_KERNEL, "pciclass,%04x", pdev->class >> 8);
+
+	ret = of_changeset_add_prop_string_array(ocs, np, "compatible",
+						 compat_strs, PROP_COMPAT_NUM);
+	for (i = 0; i < PROP_COMPAT_NUM; i++)
+		kfree(compat_strs[i]);
+
+	return ret;
+}
+
+int of_pci_add_properties(struct pci_dev *pdev, struct of_changeset *ocs,
+			  struct device_node *np)
+{
+	int ret;
+
+	/*
+	 * The added properties will be released when the
+	 * changeset is destroyed.
+	 */
+	if (pci_is_bridge(pdev)) {
+		ret = of_changeset_add_prop_string(ocs, np, "device_type",
+						   "pci");
+		if (ret)
+			return ret;
+
+		ret = of_pci_prop_bus_range(pdev, ocs, np);
+		if (ret)
+			return ret;
+
+		ret = of_pci_prop_intr_map(pdev, ocs, np);
+		if (ret)
+			return ret;
+	}
+
+	ret = of_pci_prop_ranges(pdev, ocs, np);
+	if (ret)
+		return ret;
+
+	ret = of_changeset_add_prop_u32(ocs, np, "#address-cells",
+					OF_PCI_ADDRESS_CELLS);
+	if (ret)
+		return ret;
+
+	ret = of_changeset_add_prop_u32(ocs, np, "#size-cells",
+					OF_PCI_SIZE_CELLS);
+	if (ret)
+		return ret;
+
+	ret = of_pci_prop_reg(pdev, ocs, np);
+	if (ret)
+		return ret;
+
+	ret = of_pci_prop_compatible(pdev, ocs, np);
+	if (ret)
+		return ret;
+
+	ret = of_pci_prop_interrupts(pdev, ocs, np);
+	if (ret)
+		return ret;
+
+	return 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index a4c397434057..ba717bdd700d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -679,6 +679,18 @@  static inline int devm_of_pci_bridge_init(struct device *dev, struct pci_host_br
 
 #endif /* CONFIG_OF */
 
+struct of_changeset;
+
+#ifdef CONFIG_PCI_DYNAMIC_OF_NODES
+void of_pci_make_dev_node(struct pci_dev *pdev);
+void of_pci_remove_node(struct pci_dev *pdev);
+int of_pci_add_properties(struct pci_dev *pdev, struct of_changeset *ocs,
+			  struct device_node *np);
+#else
+static inline void of_pci_make_dev_node(struct pci_dev *pdev) { }
+static inline void of_pci_remove_node(struct pci_dev *pdev) { }
+#endif
+
 #ifdef CONFIG_PCIEAER
 void pci_no_aer(void);
 void pci_aer_init(struct pci_dev *dev);
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index d68aee29386b..d749ea8250d6 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -22,6 +22,7 @@  static void pci_stop_dev(struct pci_dev *dev)
 		device_release_driver(&dev->dev);
 		pci_proc_detach_device(dev);
 		pci_remove_sysfs_dev_files(dev);
+		of_pci_remove_node(dev);
 
 		pci_dev_assign_added(dev, false);
 	}