[RFC,KERNEL,v4,3/3] PCI/sysfs: Add gsi sysfs for pci_dev

Message ID	20240105062217.349645-4-Jiqian.Chen@amd.com (mailing list archive)
State	Superseded
Delegated to:	Krzysztof Wilczyński
Headers	show Received: from NAM11-CO1-obe.outbound.protection.outlook.com (mail-co1nam11on2078.outbound.protection.outlook.com [40.107.220.78]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5C9D91DFCB; Fri, 5 Jan 2024 06:22:56 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Jiqian Chen <Jiqian.Chen@amd.com> To: Juergen Gross <jgross@suse.com>, Stefano Stabellini <sstabellini@kernel.org>, Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Bjorn Helgaas <bhelgaas@google.com>, "Rafael J . Wysocki" <rafael@kernel.org>, Len Brown <lenb@kernel.org>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> CC: <xen-devel@lists.xenproject.org>, <linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>, "Stewart Hildebrand" <Stewart.Hildebrand@amd.com>, Huang Rui <Ray.Huang@amd.com>, Jiqian Chen <Jiqian.Chen@amd.com>, Huang Rui <ray.huang@amd.com> Subject: [RFC KERNEL PATCH v4 3/3] PCI/sysfs: Add gsi sysfs for pci_dev Date: Fri, 5 Jan 2024 14:22:17 +0800 Message-ID: <20240105062217.349645-4-Jiqian.Chen@amd.com> In-Reply-To: <20240105062217.349645-1-Jiqian.Chen@amd.com> References: <20240105062217.349645-1-Jiqian.Chen@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Support device passthrough when dom0 is PVH on Xen \| expand [RFC,KERNEL,v4,0/3] Support device passthrough when dom0 is PVH on Xen [RFC,KERNEL,v4,1/3] xen/pci: Add xen_reset_device_state function [RFC,KERNEL,v4,2/3] xen/pvh: Setup gsi for passthrough device [RFC,KERNEL,v4,3/3] PCI/sysfs: Add gsi sysfs for pci_dev

Jiqian Chen Jan. 5, 2024, 6:22 a.m. UTC

There is a need for some scenarios to use gsi sysfs.
For example, when xen passthrough a device to dumU, it will
use gsi to map pirq, but currently userspace can't get gsi
number.
So, add gsi sysfs for that and for other potential scenarios.

Co-developed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
---
 drivers/acpi/pci_irq.c  |  1 +
 drivers/pci/pci-sysfs.c | 11 +++++++++++
 include/linux/pci.h     |  2 ++
 3 files changed, 14 insertions(+)

Jiqian Chen Jan. 22, 2024, 6:36 a.m. UTC | #1

Hi Bjorn Helgaas,

Do you have any comments on this patch?

On 2024/1/5 14:22, Chen, Jiqian wrote:
> There is a need for some scenarios to use gsi sysfs.
> For example, when xen passthrough a device to dumU, it will
> use gsi to map pirq, but currently userspace can't get gsi
> number.
> So, add gsi sysfs for that and for other potential scenarios.
> 
> Co-developed-by: Huang Rui <ray.huang@amd.com>
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> ---
>  drivers/acpi/pci_irq.c  |  1 +
>  drivers/pci/pci-sysfs.c | 11 +++++++++++
>  include/linux/pci.h     |  2 ++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index 630fe0a34bc6..739a58755df2 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
>  		kfree(entry);
>  		return 0;
>  	}
> +	dev->gsi = gsi;
>  
>  	rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
>  	if (rc < 0) {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 2321fdfefd7d..c51df88d079e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(irq);
>  
> +static ssize_t gsi_show(struct device *dev,
> +			struct device_attribute *attr,
> +			char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sysfs_emit(buf, "%u\n", pdev->gsi);
> +}
> +static DEVICE_ATTR_RO(gsi);
> +
>  static ssize_t broken_parity_status_show(struct device *dev,
>  					 struct device_attribute *attr,
>  					 char *buf)
> @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
>  	&dev_attr_revision.attr,
>  	&dev_attr_class.attr,
>  	&dev_attr_irq.attr,
> +	&dev_attr_gsi.attr,
>  	&dev_attr_local_cpus.attr,
>  	&dev_attr_local_cpulist.attr,
>  	&dev_attr_modalias.attr,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dea043bc1e38..0618d4a87a50 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -529,6 +529,8 @@ struct pci_dev {
>  
>  	/* These methods index pci_reset_fn_methods[] */
>  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> +
> +	unsigned int	gsi;
>  };
>  
>  static inline struct pci_dev *pci_physfn(struct pci_dev *dev)

Bjorn Helgaas Jan. 22, 2024, 11:37 p.m. UTC | #2

On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> There is a need for some scenarios to use gsi sysfs.
> For example, when xen passthrough a device to dumU, it will
> use gsi to map pirq, but currently userspace can't get gsi
> number.
> So, add gsi sysfs for that and for other potential scenarios.

Isn't GSI really an ACPI-specific concept?

I don't know enough about Xen to know why it needs the GSI in
userspace.  Is this passthrough brand new functionality that can't be
done today because we don't expose the GSI yet?

How does userspace use the GSI?  I see "to map pirq", but I think we
should have more concrete details about exactly what is needed and how
it is used before adding something new in sysfs.

Is there some more generic kernel interface we could use
for this?

s/dumU/DomU/ ?  (I dunno, but https://www.google.com/search?q=xen+dumu
suggests it :))

> Co-developed-by: Huang Rui <ray.huang@amd.com>
> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
> ---
>  drivers/acpi/pci_irq.c  |  1 +
>  drivers/pci/pci-sysfs.c | 11 +++++++++++
>  include/linux/pci.h     |  2 ++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
> index 630fe0a34bc6..739a58755df2 100644
> --- a/drivers/acpi/pci_irq.c
> +++ b/drivers/acpi/pci_irq.c
> @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
>  		kfree(entry);
>  		return 0;
>  	}
> +	dev->gsi = gsi;
>  
>  	rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
>  	if (rc < 0) {
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 2321fdfefd7d..c51df88d079e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(irq);
>  
> +static ssize_t gsi_show(struct device *dev,
> +			struct device_attribute *attr,
> +			char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sysfs_emit(buf, "%u\n", pdev->gsi);
> +}
> +static DEVICE_ATTR_RO(gsi);
> +
>  static ssize_t broken_parity_status_show(struct device *dev,
>  					 struct device_attribute *attr,
>  					 char *buf)
> @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
>  	&dev_attr_revision.attr,
>  	&dev_attr_class.attr,
>  	&dev_attr_irq.attr,
> +	&dev_attr_gsi.attr,
>  	&dev_attr_local_cpus.attr,
>  	&dev_attr_local_cpulist.attr,
>  	&dev_attr_modalias.attr,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index dea043bc1e38..0618d4a87a50 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -529,6 +529,8 @@ struct pci_dev {
>  
>  	/* These methods index pci_reset_fn_methods[] */
>  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> +
> +	unsigned int	gsi;
>  };
>  
>  static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
> -- 
> 2.34.1
> 
>

Jiqian Chen Jan. 23, 2024, 10:13 a.m. UTC | #3

On 2024/1/23 07:37, Bjorn Helgaas wrote:
> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>> There is a need for some scenarios to use gsi sysfs.
>> For example, when xen passthrough a device to dumU, it will
>> use gsi to map pirq, but currently userspace can't get gsi
>> number.
>> So, add gsi sysfs for that and for other potential scenarios.
> 
> Isn't GSI really an ACPI-specific concept?
I also added the Maintains of ACPI to get some inputs.
Hi Rafael J. Wysocki and Len Brown, do you have any suggestions about this patch?

> 
> I don't know enough about Xen to know why it needs the GSI in
> userspace.  Is this passthrough brand new functionality that can't be
> done today because we don't expose the GSI yet?
In Xen architecture, there is a privileged domain named Dom0 that has ACPI support and is responsible for detecting and controlling the hardware, also it performs privileged operations such as the creation of normal (unprivileged) domains DomUs. When we give to a DomU direct access to a device, we need also to route the physical interrupts to the DomU. In order to do so Xen needs to setup and map the interrupts appropriately. For the case of GSI interrupts, since Xen does not have support to get the ACPI routing info in the hypervisor itself, it needs to get this info from Dom0. One way would be for this info to be exposed in sysfs and the xen toolstack that runs in Dom0's userspace to get this info reading sysfs and then pass it to Xen.

And I have tried another approach in the past version patches that keeping irq to gsi mappings and then xen tool was consulting the map via a syscall and was passing the info to Xen. But it was rejected by Xen maintainers because they thought the mappings and translations were all Linux internal actions, and has nothing to do with Xen, so they suggested me to expose the GSI in sysfs because it is cleaner and easier to retrieve it in userspace.
This is my past version:
Kernel: https://lore.kernel.org/lkml/20231124103123.3263471-1-Jiqian.Chen@amd.com/T/#m8d20edd326cf7735c2804f0371e8a63b6beec60c
Xen: https://lore.kernel.org/xen-devel/20231124104136.3263722-1-Jiqian.Chen@amd.com/T/#m9f9068d558822af0a5b28cd241cab4d779e36974

> 
> How does userspace use the GSI?  I see "to map pirq", but I think we
> should have more concrete details about exactly what is needed and how
> it is used before adding something new in sysfs.
As above reason.

> 
> Is there some more generic kernel interface we could use
> for this?
No, there is no method for now, I think.

> 
> s/dumU/DomU/ ?  (I dunno, but https://www.google.com/search?q=xen+dumu
> suggests it :))
> 
>> Co-developed-by: Huang Rui <ray.huang@amd.com>
>> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
>> ---
>>  drivers/acpi/pci_irq.c  |  1 +
>>  drivers/pci/pci-sysfs.c | 11 +++++++++++
>>  include/linux/pci.h     |  2 ++
>>  3 files changed, 14 insertions(+)
>>
>> diff --git a/drivers/acpi/pci_irq.c b/drivers/acpi/pci_irq.c
>> index 630fe0a34bc6..739a58755df2 100644
>> --- a/drivers/acpi/pci_irq.c
>> +++ b/drivers/acpi/pci_irq.c
>> @@ -449,6 +449,7 @@ int acpi_pci_irq_enable(struct pci_dev *dev)
>>  		kfree(entry);
>>  		return 0;
>>  	}
>> +	dev->gsi = gsi;
>>  
>>  	rc = acpi_register_gsi(&dev->dev, gsi, triggering, polarity);
>>  	if (rc < 0) {
>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index 2321fdfefd7d..c51df88d079e 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -71,6 +71,16 @@ static ssize_t irq_show(struct device *dev,
>>  }
>>  static DEVICE_ATTR_RO(irq);
>>  
>> +static ssize_t gsi_show(struct device *dev,
>> +			struct device_attribute *attr,
>> +			char *buf)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +	return sysfs_emit(buf, "%u\n", pdev->gsi);
>> +}
>> +static DEVICE_ATTR_RO(gsi);
>> +
>>  static ssize_t broken_parity_status_show(struct device *dev,
>>  					 struct device_attribute *attr,
>>  					 char *buf)
>> @@ -596,6 +606,7 @@ static struct attribute *pci_dev_attrs[] = {
>>  	&dev_attr_revision.attr,
>>  	&dev_attr_class.attr,
>>  	&dev_attr_irq.attr,
>> +	&dev_attr_gsi.attr,
>>  	&dev_attr_local_cpus.attr,
>>  	&dev_attr_local_cpulist.attr,
>>  	&dev_attr_modalias.attr,
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index dea043bc1e38..0618d4a87a50 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -529,6 +529,8 @@ struct pci_dev {
>>  
>>  	/* These methods index pci_reset_fn_methods[] */
>>  	u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
>> +
>> +	unsigned int	gsi;
>>  };
>>  
>>  static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
>> -- 
>> 2.34.1
>>
>>

Bjorn Helgaas Jan. 23, 2024, 4:02 p.m. UTC | #4

On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> >> There is a need for some scenarios to use gsi sysfs.
> >> For example, when xen passthrough a device to dumU, it will
> >> use gsi to map pirq, but currently userspace can't get gsi
> >> number.
> >> So, add gsi sysfs for that and for other potential scenarios.
> ...

> > I don't know enough about Xen to know why it needs the GSI in
> > userspace.  Is this passthrough brand new functionality that can't be
> > done today because we don't expose the GSI yet?
>
> In Xen architecture, there is a privileged domain named Dom0 that
> has ACPI support and is responsible for detecting and controlling
> the hardware, also it performs privileged operations such as the
> creation of normal (unprivileged) domains DomUs. When we give to a
> DomU direct access to a device, we need also to route the physical
> interrupts to the DomU. In order to do so Xen needs to setup and map
> the interrupts appropriately.

What kernel interfaces are used for this setup and mapping?

Jiqian Chen Jan. 25, 2024, 7:17 a.m. UTC | #5

On 2024/1/24 00:02, Bjorn Helgaas wrote:
> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>> There is a need for some scenarios to use gsi sysfs.
>>>> For example, when xen passthrough a device to dumU, it will
>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>> number.
>>>> So, add gsi sysfs for that and for other potential scenarios.
>> ...
> 
>>> I don't know enough about Xen to know why it needs the GSI in
>>> userspace.  Is this passthrough brand new functionality that can't be
>>> done today because we don't expose the GSI yet?
>>
>> In Xen architecture, there is a privileged domain named Dom0 that
>> has ACPI support and is responsible for detecting and controlling
>> the hardware, also it performs privileged operations such as the
>> creation of normal (unprivileged) domains DomUs. When we give to a
>> DomU direct access to a device, we need also to route the physical
>> interrupts to the DomU. In order to do so Xen needs to setup and map
>> the interrupts appropriately.
> 
> What kernel interfaces are used for this setup and mapping?
For passthrough devices, the setup and mapping of routing physical interrupts to DomU are done on Xen hypervisor side, hypervisor only need userspace to provide the GSI info, see Xen code: xc_physdev_map_pirq require GSI and then will call hypercall to pass GSI into hypervisor and then hypervisor will do the mapping and routing, kernel doesn't do the setup and mapping.
For devices on PVH Dom0, Dom0 setups interrupts for devices as the baremetal Linux kernel does, through using acpi_pci_irq_enable-> acpi_register_gsi-> __acpi_register_gsi->acpi_register_gsi_ioapic.

Bjorn Helgaas Jan. 29, 2024, 10:01 p.m. UTC | #6

On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> >>>> There is a need for some scenarios to use gsi sysfs.
> >>>> For example, when xen passthrough a device to dumU, it will
> >>>> use gsi to map pirq, but currently userspace can't get gsi
> >>>> number.
> >>>> So, add gsi sysfs for that and for other potential scenarios.
> >> ...
> > 
> >>> I don't know enough about Xen to know why it needs the GSI in
> >>> userspace.  Is this passthrough brand new functionality that can't be
> >>> done today because we don't expose the GSI yet?

I assume this must be new functionality, i.e., this kind of
passthrough does not work today, right?

> >> has ACPI support and is responsible for detecting and controlling
> >> the hardware, also it performs privileged operations such as the
> >> creation of normal (unprivileged) domains DomUs. When we give to a
> >> DomU direct access to a device, we need also to route the physical
> >> interrupts to the DomU. In order to do so Xen needs to setup and map
> >> the interrupts appropriately.
> > 
> > What kernel interfaces are used for this setup and mapping?
>
> For passthrough devices, the setup and mapping of routing physical
> interrupts to DomU are done on Xen hypervisor side, hypervisor only
> need userspace to provide the GSI info, see Xen code:
> xc_physdev_map_pirq require GSI and then will call hypercall to pass
> GSI into hypervisor and then hypervisor will do the mapping and
> routing, kernel doesn't do the setup and mapping.

So we have to expose the GSI to userspace not because userspace itself
uses it, but so userspace can turn around and pass it back into the
kernel?

It seems like it would be better for userspace to pass an identifier
of the PCI device itself back into the hypervisor.  Then the interface
could be generic and potentially work even on non-ACPI systems where
the GSI concept doesn't apply.

> For devices on PVH Dom0, Dom0 setups interrupts for devices as the
> baremetal Linux kernel does, through using acpi_pci_irq_enable->
> acpi_register_gsi-> __acpi_register_gsi->acpi_register_gsi_ioapic.

This case sounds like it's all inside Linux, so I assume there's no
problem with this one?  If you can call acpi_pci_irq_enable(), you
have the pci_dev, so I assume there's no need to expose the GSI in
sysfs?

Bjorn

Roger Pau Monne Jan. 30, 2024, 9:07 a.m. UTC | #7

On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > >>>> There is a need for some scenarios to use gsi sysfs.
> > >>>> For example, when xen passthrough a device to dumU, it will
> > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > >>>> number.
> > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > >> ...
> > > 
> > >>> I don't know enough about Xen to know why it needs the GSI in
> > >>> userspace.  Is this passthrough brand new functionality that can't be
> > >>> done today because we don't expose the GSI yet?
> 
> I assume this must be new functionality, i.e., this kind of
> passthrough does not work today, right?
> 
> > >> has ACPI support and is responsible for detecting and controlling
> > >> the hardware, also it performs privileged operations such as the
> > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > >> DomU direct access to a device, we need also to route the physical
> > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > >> the interrupts appropriately.
> > > 
> > > What kernel interfaces are used for this setup and mapping?
> >
> > For passthrough devices, the setup and mapping of routing physical
> > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > need userspace to provide the GSI info, see Xen code:
> > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > GSI into hypervisor and then hypervisor will do the mapping and
> > routing, kernel doesn't do the setup and mapping.
> 
> So we have to expose the GSI to userspace not because userspace itself
> uses it, but so userspace can turn around and pass it back into the
> kernel?

No, the point is to pass it back to Xen, which doesn't know the
mapping between GSIs and PCI devices because it can't execute the ACPI
AML resource methods that provide such information.

The (Linux) kernel is just a proxy that forwards the hypercalls from
user-space tools into Xen.

> It seems like it would be better for userspace to pass an identifier
> of the PCI device itself back into the hypervisor.  Then the interface
> could be generic and potentially work even on non-ACPI systems where
> the GSI concept doesn't apply.

We would still need a way to pass the GSI to PCI device relation to
the hypervisor, and then cache such data in the hypervisor.

I don't think we have any preference of where such information should
be exposed, but given GSIs are an ACPI concept not specific to Xen
they should be exposed by a non-Xen specific interface.

Thanks, Roger.

Bjorn Helgaas Jan. 30, 2024, 8:44 p.m. UTC | #8

On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > >>>> For example, when xen passthrough a device to dumU, it will
> > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > >>>> number.
> > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > >> ...
> > > > 
> > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > >>> done today because we don't expose the GSI yet?
> > 
> > I assume this must be new functionality, i.e., this kind of
> > passthrough does not work today, right?
> > 
> > > >> has ACPI support and is responsible for detecting and controlling
> > > >> the hardware, also it performs privileged operations such as the
> > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > >> DomU direct access to a device, we need also to route the physical
> > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > >> the interrupts appropriately.
> > > > 
> > > > What kernel interfaces are used for this setup and mapping?
> > >
> > > For passthrough devices, the setup and mapping of routing physical
> > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > need userspace to provide the GSI info, see Xen code:
> > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > GSI into hypervisor and then hypervisor will do the mapping and
> > > routing, kernel doesn't do the setup and mapping.
> > 
> > So we have to expose the GSI to userspace not because userspace itself
> > uses it, but so userspace can turn around and pass it back into the
> > kernel?
> 
> No, the point is to pass it back to Xen, which doesn't know the
> mapping between GSIs and PCI devices because it can't execute the ACPI
> AML resource methods that provide such information.
> 
> The (Linux) kernel is just a proxy that forwards the hypercalls from
> user-space tools into Xen.

But I guess Xen knows how to interpret a GSI even though it doesn't
have access to AML?

> > It seems like it would be better for userspace to pass an identifier
> > of the PCI device itself back into the hypervisor.  Then the interface
> > could be generic and potentially work even on non-ACPI systems where
> > the GSI concept doesn't apply.
> 
> We would still need a way to pass the GSI to PCI device relation to
> the hypervisor, and then cache such data in the hypervisor.
> 
> I don't think we have any preference of where such information should
> be exposed, but given GSIs are an ACPI concept not specific to Xen
> they should be exposed by a non-Xen specific interface.

AFAIK Linux doesn't expose GSIs directly to userspace yet.  The GSI
concept relies on ACPI MADT, _MAT, _PRT, etc.  A GSI is associated
with some device (PCI in this case) and some interrupt controller
entry.  I don't understand how a GSI value is useful without knowing
something about that framework in which GSIs exist.

Obviously I know less than nothing about Xen, so I apologize for
asking all these stupid questions, but it just doesn't all make sense
to me yet.

Bjorn

Roger Pau Monne Jan. 31, 2024, 8:58 a.m. UTC | #9

On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > >>>> number.
> > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > >> ...
> > > > > 
> > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > >>> done today because we don't expose the GSI yet?
> > > 
> > > I assume this must be new functionality, i.e., this kind of
> > > passthrough does not work today, right?
> > > 
> > > > >> has ACPI support and is responsible for detecting and controlling
> > > > >> the hardware, also it performs privileged operations such as the
> > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > >> DomU direct access to a device, we need also to route the physical
> > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > >> the interrupts appropriately.
> > > > > 
> > > > > What kernel interfaces are used for this setup and mapping?
> > > >
> > > > For passthrough devices, the setup and mapping of routing physical
> > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > need userspace to provide the GSI info, see Xen code:
> > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > routing, kernel doesn't do the setup and mapping.
> > > 
> > > So we have to expose the GSI to userspace not because userspace itself
> > > uses it, but so userspace can turn around and pass it back into the
> > > kernel?
> > 
> > No, the point is to pass it back to Xen, which doesn't know the
> > mapping between GSIs and PCI devices because it can't execute the ACPI
> > AML resource methods that provide such information.
> > 
> > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > user-space tools into Xen.
> 
> But I guess Xen knows how to interpret a GSI even though it doesn't
> have access to AML?

On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
configure the RTE as requested.

> > > It seems like it would be better for userspace to pass an identifier
> > > of the PCI device itself back into the hypervisor.  Then the interface
> > > could be generic and potentially work even on non-ACPI systems where
> > > the GSI concept doesn't apply.
> > 
> > We would still need a way to pass the GSI to PCI device relation to
> > the hypervisor, and then cache such data in the hypervisor.
> > 
> > I don't think we have any preference of where such information should
> > be exposed, but given GSIs are an ACPI concept not specific to Xen
> > they should be exposed by a non-Xen specific interface.
> 
> AFAIK Linux doesn't expose GSIs directly to userspace yet.  The GSI
> concept relies on ACPI MADT, _MAT, _PRT, etc.  A GSI is associated
> with some device (PCI in this case) and some interrupt controller
> entry.  I don't understand how a GSI value is useful without knowing
> something about that framework in which GSIs exist.

I wouldn't say it's strictly associated with PCI.  A GSI is a way for
ACPI to have a single space that unifies all possible IO-APICs pins in
the system in a flat way.  A GSI is useful in itself because there's
a single GSI space for the whole host.

> Obviously I know less than nothing about Xen, so I apologize for
> asking all these stupid questions, but it just doesn't all make sense
> to me yet.

That's all fine, maybe there's a better path or way to expose this ACPI
information.  Maybe introduce a per-device acpi directory and expose
it there?  Or rename the entry to acpi_gsi?

Thanks, Roger.

Bjorn Helgaas Jan. 31, 2024, 7 p.m. UTC | #10

On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > >>>> number.
> > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > >> ...
> > > > > > 
> > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > > >>> done today because we don't expose the GSI yet?
> > > > 
> > > > I assume this must be new functionality, i.e., this kind of
> > > > passthrough does not work today, right?
> > > > 
> > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > >> the hardware, also it performs privileged operations such as the
> > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > >> the interrupts appropriately.
> > > > > > 
> > > > > > What kernel interfaces are used for this setup and mapping?
> > > > >
> > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > need userspace to provide the GSI info, see Xen code:
> > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > routing, kernel doesn't do the setup and mapping.
> > > > 
> > > > So we have to expose the GSI to userspace not because userspace itself
> > > > uses it, but so userspace can turn around and pass it back into the
> > > > kernel?
> > > 
> > > No, the point is to pass it back to Xen, which doesn't know the
> > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > AML resource methods that provide such information.
> > > 
> > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > user-space tools into Xen.
> > 
> > But I guess Xen knows how to interpret a GSI even though it doesn't
> > have access to AML?
> 
> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> configure the RTE as requested.

IIUC, mapping a GSI to an IO-APIC pin requires information from the
MADT.  So I guess Xen does use the static ACPI tables, but not the AML
_PRT methods that would connect a GSI with a PCI device?

I guess this means Xen would not be able to deal with _MAT methods,
which also contains MADT entries?  I don't know the implications of
this -- maybe it means Xen might not be able to use with hot-added
devices?

The tables (including DSDT and SSDTS that contain the AML) are exposed
to userspace via /sys/firmware/acpi/tables/, but of course that
doesn't mean Xen knows how to interpret the AML, and even if it did,
Xen probably wouldn't be able to *evaluate* it since that could
conflict with the host kernel's use of AML.

Bjorn

Roger Pau Monne Feb. 1, 2024, 8:39 a.m. UTC | #11

On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > >>>> number.
> > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > >> ...
> > > > > > > 
> > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > > > >>> done today because we don't expose the GSI yet?
> > > > > 
> > > > > I assume this must be new functionality, i.e., this kind of
> > > > > passthrough does not work today, right?
> > > > > 
> > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > >> the interrupts appropriately.
> > > > > > > 
> > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > >
> > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > 
> > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > kernel?
> > > > 
> > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > AML resource methods that provide such information.
> > > > 
> > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > user-space tools into Xen.
> > > 
> > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > have access to AML?
> > 
> > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > configure the RTE as requested.
> 
> IIUC, mapping a GSI to an IO-APIC pin requires information from the
> MADT.  So I guess Xen does use the static ACPI tables, but not the AML
> _PRT methods that would connect a GSI with a PCI device?

Yes, Xen can parse the static tables, and knows the base GSI of
IO-APICs from the MADT.

> I guess this means Xen would not be able to deal with _MAT methods,
> which also contains MADT entries?  I don't know the implications of
> this -- maybe it means Xen might not be able to use with hot-added
> devices?

It's my understanding _MAT will only be present on some very specific
devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
IO-APICs, but hotplug of CPUs should in principle be supported with
cooperation from the control domain OS (albeit it's not something that
we tests on our CI).  I don't expect however that a CPU object _MAT
method will return IO APIC entries.

> The tables (including DSDT and SSDTS that contain the AML) are exposed
> to userspace via /sys/firmware/acpi/tables/, but of course that
> doesn't mean Xen knows how to interpret the AML, and even if it did,
> Xen probably wouldn't be able to *evaluate* it since that could
> conflict with the host kernel's use of AML.

Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
in our context).

Getting back to our context though, what would be a suitable place for
exposing the GSI assigned to each device?

Thanks, Roger.

Bjorn Helgaas Feb. 9, 2024, 9:05 p.m. UTC | #12

On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > >>>> number.
> > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > >> ...
> > > > > > > > 
> > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > 
> > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > passthrough does not work today, right?
> > > > > > 
> > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > >> the interrupts appropriately.
> > > > > > > > 
> > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > >
> > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > 
> > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > kernel?
> > > > > 
> > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > AML resource methods that provide such information.
> > > > > 
> > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > user-space tools into Xen.
> > > > 
> > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > have access to AML?
> > > 
> > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > configure the RTE as requested.
> > 
> > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > MADT.  So I guess Xen does use the static ACPI tables, but not the AML
> > _PRT methods that would connect a GSI with a PCI device?
> 
> Yes, Xen can parse the static tables, and knows the base GSI of
> IO-APICs from the MADT.
> 
> > I guess this means Xen would not be able to deal with _MAT methods,
> > which also contains MADT entries?  I don't know the implications of
> > this -- maybe it means Xen might not be able to use with hot-added
> > devices?
> 
> It's my understanding _MAT will only be present on some very specific
> devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
> IO-APICs, but hotplug of CPUs should in principle be supported with
> cooperation from the control domain OS (albeit it's not something that
> we tests on our CI).  I don't expect however that a CPU object _MAT
> method will return IO APIC entries.
> 
> > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > to userspace via /sys/firmware/acpi/tables/, but of course that
> > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > Xen probably wouldn't be able to *evaluate* it since that could
> > conflict with the host kernel's use of AML.
> 
> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> in our context).
> 
> Getting back to our context though, what would be a suitable place for
> exposing the GSI assigned to each device?

IIUC, the Xen hypervisor:

  - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
    something running on the Dom0 kernel) to find the physical base
    address and GSI base, e.g., from I/O APIC, I/O SAPIC.

  - Needs the GSI to locate the APIC and pin within the APIC.  The
    Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
    learn the PCI device -> GSI mapping.

  - Has direct access to the APIC physical base address to program the
    Redirection Table.

The patch seems a little messy to me because the PCI core has to keep
track of the GSI even though it doesn't need it itself.  And the
current patch exposes it on all arches, even non-ACPI ones or when
ACPI is disabled (easily fixable).

We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
we don't know the GSI unless a Dom0 driver has claimed the device and
called pci_enable_device() for it, which seems like it might not be
desirable.

I was hoping we could put it in /sys/firmware/acpi/interrupts, but
that looks like it's only for SCI statistics.  I guess we could moot a
new /sys/firmware/acpi/gsi/ directory, but then each file there would
have to identify a device, which might not be as convenient as the
/sys/devices/ directory that already exists.  I guess there may be
GSIs for things other than PCI devices; will you ever care about any
of those?

Bjorn

Roger Pau Monne Feb. 12, 2024, 9:13 a.m. UTC | #13

On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> > On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > > >>>> number.
> > > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > > >> ...
> > > > > > > > > 
> > > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > > 
> > > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > > passthrough does not work today, right?
> > > > > > > 
> > > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > > >> the interrupts appropriately.
> > > > > > > > > 
> > > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > > >
> > > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > > 
> > > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > > kernel?
> > > > > > 
> > > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > > AML resource methods that provide such information.
> > > > > > 
> > > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > > user-space tools into Xen.
> > > > > 
> > > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > > have access to AML?
> > > > 
> > > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > > configure the RTE as requested.
> > > 
> > > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > > MADT.  So I guess Xen does use the static ACPI tables, but not the AML
> > > _PRT methods that would connect a GSI with a PCI device?
> > 
> > Yes, Xen can parse the static tables, and knows the base GSI of
> > IO-APICs from the MADT.
> > 
> > > I guess this means Xen would not be able to deal with _MAT methods,
> > > which also contains MADT entries?  I don't know the implications of
> > > this -- maybe it means Xen might not be able to use with hot-added
> > > devices?
> > 
> > It's my understanding _MAT will only be present on some very specific
> > devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
> > IO-APICs, but hotplug of CPUs should in principle be supported with
> > cooperation from the control domain OS (albeit it's not something that
> > we tests on our CI).  I don't expect however that a CPU object _MAT
> > method will return IO APIC entries.
> > 
> > > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > > to userspace via /sys/firmware/acpi/tables/, but of course that
> > > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > > Xen probably wouldn't be able to *evaluate* it since that could
> > > conflict with the host kernel's use of AML.
> > 
> > Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> > in our context).
> > 
> > Getting back to our context though, what would be a suitable place for
> > exposing the GSI assigned to each device?
> 
> IIUC, the Xen hypervisor:
> 
>   - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>     something running on the Dom0 kernel) to find the physical base
>     address and GSI base, e.g., from I/O APIC, I/O SAPIC.

No, Xen parses the MADT directly from memory, before stating dom0.
That's a static table so it's fine for Xen to parse it and obtain the
I/O APIC GSI base.

>   - Needs the GSI to locate the APIC and pin within the APIC.  The
>     Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>     learn the PCI device -> GSI mapping.

Yes, Xen doesn't know the PCI device -> GSI mapping.  Dom0 needs to
parse the ACPI methods and signal Xen to configure a GSI with a
given trigger and polarity.

>   - Has direct access to the APIC physical base address to program the
>     Redirection Table.

Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
accessible by dom0.

> The patch seems a little messy to me because the PCI core has to keep
> track of the GSI even though it doesn't need it itself.  And the
> current patch exposes it on all arches, even non-ACPI ones or when
> ACPI is disabled (easily fixable).
> 
> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
> we don't know the GSI unless a Dom0 driver has claimed the device and
> called pci_enable_device() for it, which seems like it might not be
> desirable.

I think that's always the case, as on dom0 devices to be passed
through are handled by pciback which does enable them.

I agree it might be best to not tie exposing the node to
pci_enable_device() having been called.  Is _PRT only evaluated as
part of acpi_pci_irq_enable()? (or pci_enable_device()).

> I was hoping we could put it in /sys/firmware/acpi/interrupts, but
> that looks like it's only for SCI statistics.  I guess we could moot a
> new /sys/firmware/acpi/gsi/ directory, but then each file there would
> have to identify a device, which might not be as convenient as the
> /sys/devices/ directory that already exists.  I guess there may be
> GSIs for things other than PCI devices; will you ever care about any
> of those?

We only support passthrough of PCI devices so far, but I guess if any
of such non-PCI devices ever appear and those use a GSI, and Xen
supports passthrough for them, then yes, we would need to fetch such
GSI somehow.

Thanks, Roger.

Bjorn Helgaas Feb. 12, 2024, 7:18 p.m. UTC | #14

On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
> > On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> > > On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > > > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > > > >>>> number.
> > > > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > > > >> ...
> > > > > > > > > > 
> > > > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > > > 
> > > > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > > > passthrough does not work today, right?
> > > > > > > > 
> > > > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > > > >> the interrupts appropriately.
> > > > > > > > > > 
> > > > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > > > >
> > > > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > > > 
> > > > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > > > kernel?
> > > > > > > 
> > > > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > > > AML resource methods that provide such information.
> > > > > > > 
> > > > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > > > user-space tools into Xen.
> > > > > > 
> > > > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > > > have access to AML?
> > > > > 
> > > > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > > > configure the RTE as requested.
> > > > 
> > > > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > > > MADT.  So I guess Xen does use the static ACPI tables, but not the AML
> > > > _PRT methods that would connect a GSI with a PCI device?
> > > 
> > > Yes, Xen can parse the static tables, and knows the base GSI of
> > > IO-APICs from the MADT.
> > > 
> > > > I guess this means Xen would not be able to deal with _MAT methods,
> > > > which also contains MADT entries?  I don't know the implications of
> > > > this -- maybe it means Xen might not be able to use with hot-added
> > > > devices?
> > > 
> > > It's my understanding _MAT will only be present on some very specific
> > > devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
> > > IO-APICs, but hotplug of CPUs should in principle be supported with
> > > cooperation from the control domain OS (albeit it's not something that
> > > we tests on our CI).  I don't expect however that a CPU object _MAT
> > > method will return IO APIC entries.
> > > 
> > > > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > > > to userspace via /sys/firmware/acpi/tables/, but of course that
> > > > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > > > Xen probably wouldn't be able to *evaluate* it since that could
> > > > conflict with the host kernel's use of AML.
> > > 
> > > Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> > > in our context).
> > > 
> > > Getting back to our context though, what would be a suitable place for
> > > exposing the GSI assigned to each device?
> > 
> > IIUC, the Xen hypervisor:
> > 
> >   - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
> >     something running on the Dom0 kernel) to find the physical base
> >     address and GSI base, e.g., from I/O APIC, I/O SAPIC.
> 
> No, Xen parses the MADT directly from memory, before stating dom0.
> That's a static table so it's fine for Xen to parse it and obtain the
> I/O APIC GSI base.

It's an interesting split to consume ACPI static tables directly but
put the AML interpreter elsewhere.  I doubt the ACPI spec envisioned
that, which makes me wonder what other things we could trip over, but
that's just a tangent.

> >   - Needs the GSI to locate the APIC and pin within the APIC.  The
> >     Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
> >     learn the PCI device -> GSI mapping.
> 
> Yes, Xen doesn't know the PCI device -> GSI mapping.  Dom0 needs to
> parse the ACPI methods and signal Xen to configure a GSI with a
> given trigger and polarity.
> 
> >   - Has direct access to the APIC physical base address to program the
> >     Redirection Table.
> 
> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
> accessible by dom0.
> 
> > The patch seems a little messy to me because the PCI core has to keep
> > track of the GSI even though it doesn't need it itself.  And the
> > current patch exposes it on all arches, even non-ACPI ones or when
> > ACPI is disabled (easily fixable).
> > 
> > We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
> > we don't know the GSI unless a Dom0 driver has claimed the device and
> > called pci_enable_device() for it, which seems like it might not be
> > desirable.
> 
> I think that's always the case, as on dom0 devices to be passed
> through are handled by pciback which does enable them.

pcistub_init_device() labels the pci_enable_device() as a "HACK"
related to determining the IRQ, which makes me think there's not
really a requirement for the device to be *enabled* (BAR decoding
enabled) by dom0.

> I agree it might be best to not tie exposing the node to
> pci_enable_device() having been called.  Is _PRT only evaluated as
> part of acpi_pci_irq_enable()? (or pci_enable_device()).

Yes.  AFAICT, acpi_pci_irq_enable() is the only path that evaluates
_PRT (except for a debugger interface).  I don't think it *needs* to
be that way, and the fact that we do it per-device like that means we
evaluate _PRT many times even though I think the results never change.

I could imagine evaluating _PRT once as part of enumerating a PCI host
bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
comment), but that looks like a fair bit of work to implement.  And of
course it doesn't really affect the question of how to expose the
result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
a possible location.

Bjorn

Roger Pau Monne Feb. 15, 2024, 8:37 a.m. UTC | #15

On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
> > On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
> > > On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
> > > > On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
> > > > > On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
> > > > > > On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
> > > > > > > > On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
> > > > > > > > > On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
> > > > > > > > > > On 2024/1/24 00:02, Bjorn Helgaas wrote:
> > > > > > > > > > > On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
> > > > > > > > > > >> On 2024/1/23 07:37, Bjorn Helgaas wrote:
> > > > > > > > > > >>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
> > > > > > > > > > >>>> There is a need for some scenarios to use gsi sysfs.
> > > > > > > > > > >>>> For example, when xen passthrough a device to dumU, it will
> > > > > > > > > > >>>> use gsi to map pirq, but currently userspace can't get gsi
> > > > > > > > > > >>>> number.
> > > > > > > > > > >>>> So, add gsi sysfs for that and for other potential scenarios.
> > > > > > > > > > >> ...
> > > > > > > > > > > 
> > > > > > > > > > >>> I don't know enough about Xen to know why it needs the GSI in
> > > > > > > > > > >>> userspace.  Is this passthrough brand new functionality that can't be
> > > > > > > > > > >>> done today because we don't expose the GSI yet?
> > > > > > > > > 
> > > > > > > > > I assume this must be new functionality, i.e., this kind of
> > > > > > > > > passthrough does not work today, right?
> > > > > > > > > 
> > > > > > > > > > >> has ACPI support and is responsible for detecting and controlling
> > > > > > > > > > >> the hardware, also it performs privileged operations such as the
> > > > > > > > > > >> creation of normal (unprivileged) domains DomUs. When we give to a
> > > > > > > > > > >> DomU direct access to a device, we need also to route the physical
> > > > > > > > > > >> interrupts to the DomU. In order to do so Xen needs to setup and map
> > > > > > > > > > >> the interrupts appropriately.
> > > > > > > > > > > 
> > > > > > > > > > > What kernel interfaces are used for this setup and mapping?
> > > > > > > > > >
> > > > > > > > > > For passthrough devices, the setup and mapping of routing physical
> > > > > > > > > > interrupts to DomU are done on Xen hypervisor side, hypervisor only
> > > > > > > > > > need userspace to provide the GSI info, see Xen code:
> > > > > > > > > > xc_physdev_map_pirq require GSI and then will call hypercall to pass
> > > > > > > > > > GSI into hypervisor and then hypervisor will do the mapping and
> > > > > > > > > > routing, kernel doesn't do the setup and mapping.
> > > > > > > > > 
> > > > > > > > > So we have to expose the GSI to userspace not because userspace itself
> > > > > > > > > uses it, but so userspace can turn around and pass it back into the
> > > > > > > > > kernel?
> > > > > > > > 
> > > > > > > > No, the point is to pass it back to Xen, which doesn't know the
> > > > > > > > mapping between GSIs and PCI devices because it can't execute the ACPI
> > > > > > > > AML resource methods that provide such information.
> > > > > > > > 
> > > > > > > > The (Linux) kernel is just a proxy that forwards the hypercalls from
> > > > > > > > user-space tools into Xen.
> > > > > > > 
> > > > > > > But I guess Xen knows how to interpret a GSI even though it doesn't
> > > > > > > have access to AML?
> > > > > > 
> > > > > > On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
> > > > > > configure the RTE as requested.
> > > > > 
> > > > > IIUC, mapping a GSI to an IO-APIC pin requires information from the
> > > > > MADT.  So I guess Xen does use the static ACPI tables, but not the AML
> > > > > _PRT methods that would connect a GSI with a PCI device?
> > > > 
> > > > Yes, Xen can parse the static tables, and knows the base GSI of
> > > > IO-APICs from the MADT.
> > > > 
> > > > > I guess this means Xen would not be able to deal with _MAT methods,
> > > > > which also contains MADT entries?  I don't know the implications of
> > > > > this -- maybe it means Xen might not be able to use with hot-added
> > > > > devices?
> > > > 
> > > > It's my understanding _MAT will only be present on some very specific
> > > > devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
> > > > IO-APICs, but hotplug of CPUs should in principle be supported with
> > > > cooperation from the control domain OS (albeit it's not something that
> > > > we tests on our CI).  I don't expect however that a CPU object _MAT
> > > > method will return IO APIC entries.
> > > > 
> > > > > The tables (including DSDT and SSDTS that contain the AML) are exposed
> > > > > to userspace via /sys/firmware/acpi/tables/, but of course that
> > > > > doesn't mean Xen knows how to interpret the AML, and even if it did,
> > > > > Xen probably wouldn't be able to *evaluate* it since that could
> > > > > conflict with the host kernel's use of AML.
> > > > 
> > > > Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
> > > > in our context).
> > > > 
> > > > Getting back to our context though, what would be a suitable place for
> > > > exposing the GSI assigned to each device?
> > > 
> > > IIUC, the Xen hypervisor:
> > > 
> > >   - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
> > >     something running on the Dom0 kernel) to find the physical base
> > >     address and GSI base, e.g., from I/O APIC, I/O SAPIC.
> > 
> > No, Xen parses the MADT directly from memory, before stating dom0.
> > That's a static table so it's fine for Xen to parse it and obtain the
> > I/O APIC GSI base.
> 
> It's an interesting split to consume ACPI static tables directly but
> put the AML interpreter elsewhere.

Well, static tables can be consumed by Xen, because thye don't require
an AML parser (obviously), and parsing them doesn't have any
side-effects that would prevent dom0 from being the OSPM (no methods
or similar are evaluated).

> I doubt the ACPI spec envisioned
> that, which makes me wonder what other things we could trip over, but
> that's just a tangent.

Indeed, ACPI is not be best interface for the Xen/dom0 split model.

> > >   - Needs the GSI to locate the APIC and pin within the APIC.  The
> > >     Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
> > >     learn the PCI device -> GSI mapping.
> > 
> > Yes, Xen doesn't know the PCI device -> GSI mapping.  Dom0 needs to
> > parse the ACPI methods and signal Xen to configure a GSI with a
> > given trigger and polarity.
> > 
> > >   - Has direct access to the APIC physical base address to program the
> > >     Redirection Table.
> > 
> > Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
> > accessible by dom0.
> > 
> > > The patch seems a little messy to me because the PCI core has to keep
> > > track of the GSI even though it doesn't need it itself.  And the
> > > current patch exposes it on all arches, even non-ACPI ones or when
> > > ACPI is disabled (easily fixable).
> > > 
> > > We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
> > > we don't know the GSI unless a Dom0 driver has claimed the device and
> > > called pci_enable_device() for it, which seems like it might not be
> > > desirable.
> > 
> > I think that's always the case, as on dom0 devices to be passed
> > through are handled by pciback which does enable them.
> 
> pcistub_init_device() labels the pci_enable_device() as a "HACK"
> related to determining the IRQ, which makes me think there's not
> really a requirement for the device to be *enabled* (BAR decoding
> enabled) by dom0.

No, there's no need for memory decoding to be enabled for getting the
GSI from the ACPI method I would assume.  I'm confused by that
pci_enable_device() call.  Is maybe the purpose to make sure the
device is powered up so that reading the PCI header Interrupt Line and
Pin fields returns valid values?  No idea whether reading those fields
requires the device to be in certain (active) power states.

> > I agree it might be best to not tie exposing the node to
> > pci_enable_device() having been called.  Is _PRT only evaluated as
> > part of acpi_pci_irq_enable()? (or pci_enable_device()).
> 
> Yes.  AFAICT, acpi_pci_irq_enable() is the only path that evaluates
> _PRT (except for a debugger interface).  I don't think it *needs* to
> be that way, and the fact that we do it per-device like that means we
> evaluate _PRT many times even though I think the results never change.
> 
> I could imagine evaluating _PRT once as part of enumerating a PCI host
> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
> comment), but that looks like a fair bit of work to implement.  And of
> course it doesn't really affect the question of how to expose the
> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
> a possible location.

So you suggest exposing the GSI as part of the PCI host bridge?  I'm
afraid I'm not following how we could then map PCI SBDFs from devices
to their assigned GSI.

Thanks, Roger.

Jiqian Chen March 1, 2024, 7:57 a.m. UTC | #16

Hi Bjorn,
Looking forward to getting your more inputs and suggestions.
It seems /sys/bus/acpi/devices/PNP0A03:00/ is not a good place to create gsi sysfs.

On 2024/2/15 16:37, Roger Pau Monné wrote:
> On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
>> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
>>> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
>>>> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
>>>>> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
>>>>>> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
>>>>>>> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>> On 2024/1/24 00:02, Bjorn Helgaas wrote:
>>>>>>>>>>>> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>>>>>>>>>>>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>>>>>>>>>>>>> There is a need for some scenarios to use gsi sysfs.
>>>>>>>>>>>>>>> For example, when xen passthrough a device to dumU, it will
>>>>>>>>>>>>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>>>>>>>>>>>>> number.
>>>>>>>>>>>>>>> So, add gsi sysfs for that and for other potential scenarios.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't know enough about Xen to know why it needs the GSI in
>>>>>>>>>>>>>> userspace.  Is this passthrough brand new functionality that can't be
>>>>>>>>>>>>>> done today because we don't expose the GSI yet?
>>>>>>>>>>
>>>>>>>>>> I assume this must be new functionality, i.e., this kind of
>>>>>>>>>> passthrough does not work today, right?
>>>>>>>>>>
>>>>>>>>>>>>> has ACPI support and is responsible for detecting and controlling
>>>>>>>>>>>>> the hardware, also it performs privileged operations such as the
>>>>>>>>>>>>> creation of normal (unprivileged) domains DomUs. When we give to a
>>>>>>>>>>>>> DomU direct access to a device, we need also to route the physical
>>>>>>>>>>>>> interrupts to the DomU. In order to do so Xen needs to setup and map
>>>>>>>>>>>>> the interrupts appropriately.
>>>>>>>>>>>>
>>>>>>>>>>>> What kernel interfaces are used for this setup and mapping?
>>>>>>>>>>>
>>>>>>>>>>> For passthrough devices, the setup and mapping of routing physical
>>>>>>>>>>> interrupts to DomU are done on Xen hypervisor side, hypervisor only
>>>>>>>>>>> need userspace to provide the GSI info, see Xen code:
>>>>>>>>>>> xc_physdev_map_pirq require GSI and then will call hypercall to pass
>>>>>>>>>>> GSI into hypervisor and then hypervisor will do the mapping and
>>>>>>>>>>> routing, kernel doesn't do the setup and mapping.
>>>>>>>>>>
>>>>>>>>>> So we have to expose the GSI to userspace not because userspace itself
>>>>>>>>>> uses it, but so userspace can turn around and pass it back into the
>>>>>>>>>> kernel?
>>>>>>>>>
>>>>>>>>> No, the point is to pass it back to Xen, which doesn't know the
>>>>>>>>> mapping between GSIs and PCI devices because it can't execute the ACPI
>>>>>>>>> AML resource methods that provide such information.
>>>>>>>>>
>>>>>>>>> The (Linux) kernel is just a proxy that forwards the hypercalls from
>>>>>>>>> user-space tools into Xen.
>>>>>>>>
>>>>>>>> But I guess Xen knows how to interpret a GSI even though it doesn't
>>>>>>>> have access to AML?
>>>>>>>
>>>>>>> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
>>>>>>> configure the RTE as requested.
>>>>>>
>>>>>> IIUC, mapping a GSI to an IO-APIC pin requires information from the
>>>>>> MADT.  So I guess Xen does use the static ACPI tables, but not the AML
>>>>>> _PRT methods that would connect a GSI with a PCI device?
>>>>>
>>>>> Yes, Xen can parse the static tables, and knows the base GSI of
>>>>> IO-APICs from the MADT.
>>>>>
>>>>>> I guess this means Xen would not be able to deal with _MAT methods,
>>>>>> which also contains MADT entries?  I don't know the implications of
>>>>>> this -- maybe it means Xen might not be able to use with hot-added
>>>>>> devices?
>>>>>
>>>>> It's my understanding _MAT will only be present on some very specific
>>>>> devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
>>>>> IO-APICs, but hotplug of CPUs should in principle be supported with
>>>>> cooperation from the control domain OS (albeit it's not something that
>>>>> we tests on our CI).  I don't expect however that a CPU object _MAT
>>>>> method will return IO APIC entries.
>>>>>
>>>>>> The tables (including DSDT and SSDTS that contain the AML) are exposed
>>>>>> to userspace via /sys/firmware/acpi/tables/, but of course that
>>>>>> doesn't mean Xen knows how to interpret the AML, and even if it did,
>>>>>> Xen probably wouldn't be able to *evaluate* it since that could
>>>>>> conflict with the host kernel's use of AML.
>>>>>
>>>>> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
>>>>> in our context).
>>>>>
>>>>> Getting back to our context though, what would be a suitable place for
>>>>> exposing the GSI assigned to each device?
>>>>
>>>> IIUC, the Xen hypervisor:
>>>>
>>>>   - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>>>>     something running on the Dom0 kernel) to find the physical base
>>>>     address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>>>
>>> No, Xen parses the MADT directly from memory, before stating dom0.
>>> That's a static table so it's fine for Xen to parse it and obtain the
>>> I/O APIC GSI base.
>>
>> It's an interesting split to consume ACPI static tables directly but
>> put the AML interpreter elsewhere.
> 
> Well, static tables can be consumed by Xen, because thye don't require
> an AML parser (obviously), and parsing them doesn't have any
> side-effects that would prevent dom0 from being the OSPM (no methods
> or similar are evaluated).
> 
>> I doubt the ACPI spec envisioned
>> that, which makes me wonder what other things we could trip over, but
>> that's just a tangent.
> 
> Indeed, ACPI is not be best interface for the Xen/dom0 split model.
> 
>>>>   - Needs the GSI to locate the APIC and pin within the APIC.  The
>>>>     Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>>>>     learn the PCI device -> GSI mapping.
>>>
>>> Yes, Xen doesn't know the PCI device -> GSI mapping.  Dom0 needs to
>>> parse the ACPI methods and signal Xen to configure a GSI with a
>>> given trigger and polarity.
>>>
>>>>   - Has direct access to the APIC physical base address to program the
>>>>     Redirection Table.
>>>
>>> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
>>> accessible by dom0.
>>>
>>>> The patch seems a little messy to me because the PCI core has to keep
>>>> track of the GSI even though it doesn't need it itself.  And the
>>>> current patch exposes it on all arches, even non-ACPI ones or when
>>>> ACPI is disabled (easily fixable).
>>>>
>>>> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
>>>> we don't know the GSI unless a Dom0 driver has claimed the device and
>>>> called pci_enable_device() for it, which seems like it might not be
>>>> desirable.
>>>
>>> I think that's always the case, as on dom0 devices to be passed
>>> through are handled by pciback which does enable them.
>>
>> pcistub_init_device() labels the pci_enable_device() as a "HACK"
>> related to determining the IRQ, which makes me think there's not
>> really a requirement for the device to be *enabled* (BAR decoding
>> enabled) by dom0.
> 
> No, there's no need for memory decoding to be enabled for getting the
> GSI from the ACPI method I would assume.  I'm confused by that
> pci_enable_device() call.  Is maybe the purpose to make sure the
> device is powered up so that reading the PCI header Interrupt Line and
> Pin fields returns valid values?  No idea whether reading those fields
> requires the device to be in certain (active) power states.
> 
>>> I agree it might be best to not tie exposing the node to
>>> pci_enable_device() having been called.  Is _PRT only evaluated as
>>> part of acpi_pci_irq_enable()? (or pci_enable_device()).
>>
>> Yes.  AFAICT, acpi_pci_irq_enable() is the only path that evaluates
>> _PRT (except for a debugger interface).  I don't think it *needs* to
>> be that way, and the fact that we do it per-device like that means we
>> evaluate _PRT many times even though I think the results never change.
>>
>> I could imagine evaluating _PRT once as part of enumerating a PCI host
>> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
>> comment), but that looks like a fair bit of work to implement.  And of
>> course it doesn't really affect the question of how to expose the
>> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
>> a possible location.
> 
> So you suggest exposing the GSI as part of the PCI host bridge?  I'm
> afraid I'm not following how we could then map PCI SBDFs from devices
> to their assigned GSI.
> 
> Thanks, Roger.

Jiqian Chen April 8, 2024, 6:42 a.m. UTC | #17

Hi Bjorn,
It has been almost two months since we received your reply last time.
This series are blocking on this patch, since there are patches on Xen and Qemu side depending on it.
Do you still have any confusion about this patch? Or do you have other suggestions?
If no, may I get your Reviewed-by?

On 2024/3/1 15:57, Chen, Jiqian wrote:
> Hi Bjorn,
> Looking forward to getting your more inputs and suggestions.
> It seems /sys/bus/acpi/devices/PNP0A03:00/ is not a good place to create gsi sysfs.
> 
> On 2024/2/15 16:37, Roger Pau Monné wrote:
>> On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
>>> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
>>>> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
>>>>> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
>>>>>> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
>>>>>>> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
>>>>>>>> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
>>>>>>>>> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
>>>>>>>>>> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
>>>>>>>>>>> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>> On 2024/1/24 00:02, Bjorn Helgaas wrote:
>>>>>>>>>>>>> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>>>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>>>>>>>>>>>>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>>>>>>>>>>>>>> There is a need for some scenarios to use gsi sysfs.
>>>>>>>>>>>>>>>> For example, when xen passthrough a device to dumU, it will
>>>>>>>>>>>>>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>>>>>>>>>>>>>> number.
>>>>>>>>>>>>>>>> So, add gsi sysfs for that and for other potential scenarios.
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't know enough about Xen to know why it needs the GSI in
>>>>>>>>>>>>>>> userspace.  Is this passthrough brand new functionality that can't be
>>>>>>>>>>>>>>> done today because we don't expose the GSI yet?
>>>>>>>>>>>
>>>>>>>>>>> I assume this must be new functionality, i.e., this kind of
>>>>>>>>>>> passthrough does not work today, right?
>>>>>>>>>>>
>>>>>>>>>>>>>> has ACPI support and is responsible for detecting and controlling
>>>>>>>>>>>>>> the hardware, also it performs privileged operations such as the
>>>>>>>>>>>>>> creation of normal (unprivileged) domains DomUs. When we give to a
>>>>>>>>>>>>>> DomU direct access to a device, we need also to route the physical
>>>>>>>>>>>>>> interrupts to the DomU. In order to do so Xen needs to setup and map
>>>>>>>>>>>>>> the interrupts appropriately.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What kernel interfaces are used for this setup and mapping?
>>>>>>>>>>>>
>>>>>>>>>>>> For passthrough devices, the setup and mapping of routing physical
>>>>>>>>>>>> interrupts to DomU are done on Xen hypervisor side, hypervisor only
>>>>>>>>>>>> need userspace to provide the GSI info, see Xen code:
>>>>>>>>>>>> xc_physdev_map_pirq require GSI and then will call hypercall to pass
>>>>>>>>>>>> GSI into hypervisor and then hypervisor will do the mapping and
>>>>>>>>>>>> routing, kernel doesn't do the setup and mapping.
>>>>>>>>>>>
>>>>>>>>>>> So we have to expose the GSI to userspace not because userspace itself
>>>>>>>>>>> uses it, but so userspace can turn around and pass it back into the
>>>>>>>>>>> kernel?
>>>>>>>>>>
>>>>>>>>>> No, the point is to pass it back to Xen, which doesn't know the
>>>>>>>>>> mapping between GSIs and PCI devices because it can't execute the ACPI
>>>>>>>>>> AML resource methods that provide such information.
>>>>>>>>>>
>>>>>>>>>> The (Linux) kernel is just a proxy that forwards the hypercalls from
>>>>>>>>>> user-space tools into Xen.
>>>>>>>>>
>>>>>>>>> But I guess Xen knows how to interpret a GSI even though it doesn't
>>>>>>>>> have access to AML?
>>>>>>>>
>>>>>>>> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
>>>>>>>> configure the RTE as requested.
>>>>>>>
>>>>>>> IIUC, mapping a GSI to an IO-APIC pin requires information from the
>>>>>>> MADT.  So I guess Xen does use the static ACPI tables, but not the AML
>>>>>>> _PRT methods that would connect a GSI with a PCI device?
>>>>>>
>>>>>> Yes, Xen can parse the static tables, and knows the base GSI of
>>>>>> IO-APICs from the MADT.
>>>>>>
>>>>>>> I guess this means Xen would not be able to deal with _MAT methods,
>>>>>>> which also contains MADT entries?  I don't know the implications of
>>>>>>> this -- maybe it means Xen might not be able to use with hot-added
>>>>>>> devices?
>>>>>>
>>>>>> It's my understanding _MAT will only be present on some very specific
>>>>>> devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
>>>>>> IO-APICs, but hotplug of CPUs should in principle be supported with
>>>>>> cooperation from the control domain OS (albeit it's not something that
>>>>>> we tests on our CI).  I don't expect however that a CPU object _MAT
>>>>>> method will return IO APIC entries.
>>>>>>
>>>>>>> The tables (including DSDT and SSDTS that contain the AML) are exposed
>>>>>>> to userspace via /sys/firmware/acpi/tables/, but of course that
>>>>>>> doesn't mean Xen knows how to interpret the AML, and even if it did,
>>>>>>> Xen probably wouldn't be able to *evaluate* it since that could
>>>>>>> conflict with the host kernel's use of AML.
>>>>>>
>>>>>> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
>>>>>> in our context).
>>>>>>
>>>>>> Getting back to our context though, what would be a suitable place for
>>>>>> exposing the GSI assigned to each device?
>>>>>
>>>>> IIUC, the Xen hypervisor:
>>>>>
>>>>>   - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>>>>>     something running on the Dom0 kernel) to find the physical base
>>>>>     address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>>>>
>>>> No, Xen parses the MADT directly from memory, before stating dom0.
>>>> That's a static table so it's fine for Xen to parse it and obtain the
>>>> I/O APIC GSI base.
>>>
>>> It's an interesting split to consume ACPI static tables directly but
>>> put the AML interpreter elsewhere.
>>
>> Well, static tables can be consumed by Xen, because thye don't require
>> an AML parser (obviously), and parsing them doesn't have any
>> side-effects that would prevent dom0 from being the OSPM (no methods
>> or similar are evaluated).
>>
>>> I doubt the ACPI spec envisioned
>>> that, which makes me wonder what other things we could trip over, but
>>> that's just a tangent.
>>
>> Indeed, ACPI is not be best interface for the Xen/dom0 split model.
>>
>>>>>   - Needs the GSI to locate the APIC and pin within the APIC.  The
>>>>>     Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>>>>>     learn the PCI device -> GSI mapping.
>>>>
>>>> Yes, Xen doesn't know the PCI device -> GSI mapping.  Dom0 needs to
>>>> parse the ACPI methods and signal Xen to configure a GSI with a
>>>> given trigger and polarity.
>>>>
>>>>>   - Has direct access to the APIC physical base address to program the
>>>>>     Redirection Table.
>>>>
>>>> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
>>>> accessible by dom0.
>>>>
>>>>> The patch seems a little messy to me because the PCI core has to keep
>>>>> track of the GSI even though it doesn't need it itself.  And the
>>>>> current patch exposes it on all arches, even non-ACPI ones or when
>>>>> ACPI is disabled (easily fixable).
>>>>>
>>>>> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
>>>>> we don't know the GSI unless a Dom0 driver has claimed the device and
>>>>> called pci_enable_device() for it, which seems like it might not be
>>>>> desirable.
>>>>
>>>> I think that's always the case, as on dom0 devices to be passed
>>>> through are handled by pciback which does enable them.
>>>
>>> pcistub_init_device() labels the pci_enable_device() as a "HACK"
>>> related to determining the IRQ, which makes me think there's not
>>> really a requirement for the device to be *enabled* (BAR decoding
>>> enabled) by dom0.
>>
>> No, there's no need for memory decoding to be enabled for getting the
>> GSI from the ACPI method I would assume.  I'm confused by that
>> pci_enable_device() call.  Is maybe the purpose to make sure the
>> device is powered up so that reading the PCI header Interrupt Line and
>> Pin fields returns valid values?  No idea whether reading those fields
>> requires the device to be in certain (active) power states.
>>
>>>> I agree it might be best to not tie exposing the node to
>>>> pci_enable_device() having been called.  Is _PRT only evaluated as
>>>> part of acpi_pci_irq_enable()? (or pci_enable_device()).
>>>
>>> Yes.  AFAICT, acpi_pci_irq_enable() is the only path that evaluates
>>> _PRT (except for a debugger interface).  I don't think it *needs* to
>>> be that way, and the fact that we do it per-device like that means we
>>> evaluate _PRT many times even though I think the results never change.
>>>
>>> I could imagine evaluating _PRT once as part of enumerating a PCI host
>>> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
>>> comment), but that looks like a fair bit of work to implement.  And of
>>> course it doesn't really affect the question of how to expose the
>>> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
>>> a possible location.
>>
>> So you suggest exposing the GSI as part of the PCI host bridge?  I'm
>> afraid I'm not following how we could then map PCI SBDFs from devices
>> to their assigned GSI.
>>
>> Thanks, Roger.
>

Bjorn Helgaas April 9, 2024, 8:03 p.m. UTC | #18

[+to Rafael]

On Mon, Apr 08, 2024 at 06:42:31AM +0000, Chen, Jiqian wrote:
> Hi Bjorn,
> It has been almost two months since we received your reply last time.
> This series are blocking on this patch, since there are patches on Xen and Qemu side depending on it.
> Do you still have any confusion about this patch? Or do you have other suggestions?
> If no, may I get your Reviewed-by?

  - This is ACPI-specific, but exposes /sys/.../gsi for all systems,
    including non-ACPI systems.  I don't think we want that.

  - Do you care about similar Xen configurations on non-ACPI systems?
    If so, maybe the commit log could mention how you learn about PCI
    INTx routing on them in case there's some way to unify this in the
    future.

  - Missing an update to Documentation/ABI/.

  - A nit: I asked about s/dumU/DomU/ in the commit log earlier,
    haven't seen any response.

  - Commit log mentions "and for other potential scenarios."  It's
    another nit, but unless you have another concrete use for this,
    that phrase is meaningless hand waving and should be dropped.

  - A _PRT entry may refer directly to a GSI or to an interrupt link
    device (PNP0C0F) that can be routed to one of several GSIs:

      ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11 12 14 15)
 
    I don't think the kernel reconfigures interrupt links after
    enumeration, but if they are reconfigured at run-time (via _SRS),
    the cached GSI will be wrong.  I think setpnp could do this, but
    that tool is dead.  So maybe this isn't a concern anymore, but I
    *would* like to get Rafael's take on this.  If we don't care
    enough, I think we should mention it in the commit log just in
    case.

Bjorn

[RFC,KERNEL,v4,3/3] PCI/sysfs: Add gsi sysfs for pci_dev

Commit Message

Comments

Patch