diff mbox series

[V4,1/3] driver core: mark device as irq affinity managed if any irq is managed

Message ID 20210715120844.636968-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: fix blk_mq_alloc_request_hctx | expand

Commit Message

Ming Lei July 15, 2021, 12:08 p.m. UTC
irq vector allocation with managed affinity may be used by driver, and
blk-mq needs this info because managed irq will be shutdown when all
CPUs in the affinity mask are offline.

The info of using managed irq is often produced by drivers(pci subsystem,
platform device, ...), and it is consumed by blk-mq, so different subsystems
are involved in this info flow

Address this issue by adding one field of .irq_affinity_managed into
'struct device'.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/base/platform.c | 7 +++++++
 drivers/pci/msi.c       | 3 +++
 include/linux/device.h  | 1 +
 3 files changed, 11 insertions(+)

Comments

Greg Kroah-Hartman July 15, 2021, 12:40 p.m. UTC | #1
On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -569,6 +569,7 @@ struct device {
>  #ifdef CONFIG_DMA_OPS_BYPASS
>  	bool			dma_ops_bypass : 1;
>  #endif
> +	bool			irq_affinity_managed : 1;
>  };

No documentation for this new field?

:(
Ming Lei July 16, 2021, 2:17 a.m. UTC | #2
On Thu, Jul 15, 2021 at 02:40:55PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -569,6 +569,7 @@ struct device {
> >  #ifdef CONFIG_DMA_OPS_BYPASS
> >  	bool			dma_ops_bypass : 1;
> >  #endif
> > +	bool			irq_affinity_managed : 1;
> >  };
> 
> No documentation for this new field?

OK, will fix it in next version.


Thanks,
Ming
Bjorn Helgaas July 16, 2021, 8:01 p.m. UTC | #3
On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> irq vector allocation with managed affinity may be used by driver, and
> blk-mq needs this info because managed irq will be shutdown when all
> CPUs in the affinity mask are offline.
> 
> The info of using managed irq is often produced by drivers(pci subsystem,

Add space between "drivers" and "(".
s/pci/PCI/

Does this "managed IRQ" (or "managed affinity", not sure what the
correct terminology is here) have something to do with devm?

> platform device, ...), and it is consumed by blk-mq, so different subsystems
> are involved in this info flow

Add period at end of sentence.

> Address this issue by adding one field of .irq_affinity_managed into
> 'struct device'.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/base/platform.c | 7 +++++++
>  drivers/pci/msi.c       | 3 +++
>  include/linux/device.h  | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 8640578f45e9..d28cb91d5cf9 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
>  				ptr->irq[i], ret);
>  			goto err_free_desc;
>  		}
> +
> +		/*
> +		 * mark the device as irq affinity managed if any irq affinity
> +		 * descriptor is managed
> +		 */
> +		if (desc[i].is_managed)
> +			dev->dev.irq_affinity_managed = true;
>  	}
>  
>  	devres_add(&dev->dev, ptr);
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3d6db20d1b2b..7ddec90b711d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> +		dev->dev.irq_affinity_managed = true;

This is really opaque to me.  I can't tell what the connection between
PCI_IRQ_AFFINITY and irq_affinity_managed is.

AFAICT the only place irq_affinity_managed is ultimately used is
blk_mq_hctx_notify_offline(), and there's no obvious connection
between that and this code.

>  	} else {
>  		if (WARN_ON(affd))
>  			affd = NULL;
> @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  			return nvecs;
>  	}
>  
> +	dev->dev.irq_affinity_managed = false;
> +
>  	/* use legacy IRQ if allowed */
>  	if (flags & PCI_IRQ_LEGACY) {
>  		if (min_vecs == 1 && dev->irq) {
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 59940f1744c1..9ec6e671279e 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -569,6 +569,7 @@ struct device {
>  #ifdef CONFIG_DMA_OPS_BYPASS
>  	bool			dma_ops_bypass : 1;
>  #endif
> +	bool			irq_affinity_managed : 1;
>  };
>  
>  /**
> -- 
> 2.31.1
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
Ming Lei July 17, 2021, 9:30 a.m. UTC | #4
On Fri, Jul 16, 2021 at 03:01:54PM -0500, Bjorn Helgaas wrote:
> On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> > irq vector allocation with managed affinity may be used by driver, and
> > blk-mq needs this info because managed irq will be shutdown when all
> > CPUs in the affinity mask are offline.
> > 
> > The info of using managed irq is often produced by drivers(pci subsystem,
> 
> Add space between "drivers" and "(".
> s/pci/PCI/

OK.

> 
> Does this "managed IRQ" (or "managed affinity", not sure what the
> correct terminology is here) have something to do with devm?
> 
> > platform device, ...), and it is consumed by blk-mq, so different subsystems
> > are involved in this info flow
> 
> Add period at end of sentence.

OK.

> 
> > Address this issue by adding one field of .irq_affinity_managed into
> > 'struct device'.
> > 
> > Suggested-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/base/platform.c | 7 +++++++
> >  drivers/pci/msi.c       | 3 +++
> >  include/linux/device.h  | 1 +
> >  3 files changed, 11 insertions(+)
> > 
> > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > index 8640578f45e9..d28cb91d5cf9 100644
> > --- a/drivers/base/platform.c
> > +++ b/drivers/base/platform.c
> > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
> >  				ptr->irq[i], ret);
> >  			goto err_free_desc;
> >  		}
> > +
> > +		/*
> > +		 * mark the device as irq affinity managed if any irq affinity
> > +		 * descriptor is managed
> > +		 */
> > +		if (desc[i].is_managed)
> > +			dev->dev.irq_affinity_managed = true;
> >  	}
> >  
> >  	devres_add(&dev->dev, ptr);
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 3d6db20d1b2b..7ddec90b711d 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> >  	if (flags & PCI_IRQ_AFFINITY) {
> >  		if (!affd)
> >  			affd = &msi_default_affd;
> > +		dev->dev.irq_affinity_managed = true;
> 
> This is really opaque to me.  I can't tell what the connection between
> PCI_IRQ_AFFINITY and irq_affinity_managed is.

Comment for PCI_IRQ_AFFINITY is 'Auto-assign affinity',
'irq_affinity_managed' basically means that irq's affinity is managed by
kernel.

What blk-mq needs is exactly if PCI_IRQ_AFFINITY is applied when
allocating irq vectors. When PCI_IRQ_AFFINITY is used, genirq will
shutdown the irq when all CPUs in the assigned affinity are offline,
then blk-mq has to drain all in-flight IOs which will be completed
via this irq and prevent new IO. That is the connection.

Or you think 'irq_affinity_managed' isn't named well?

> 
> AFAICT the only place irq_affinity_managed is ultimately used is
> blk_mq_hctx_notify_offline(), and there's no obvious connection
> between that and this code.

I believe the connection is described in comment.


Thanks, 
Ming
John Garry July 19, 2021, 7:51 a.m. UTC | #5
On 15/07/2021 13:08, Ming Lei wrote:
> irq vector allocation with managed affinity may be used by driver, and
> blk-mq needs this info because managed irq will be shutdown when all
> CPUs in the affinity mask are offline.
> 
> The info of using managed irq is often produced by drivers(pci subsystem,
> platform device, ...), and it is consumed by blk-mq, so different subsystems
> are involved in this info flow
> 
> Address this issue by adding one field of .irq_affinity_managed into
> 'struct device'.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Did you consider that for PCI device we effectively have this info already:

bool dev_has_managed_msi_irq(struct device *dev)
{
	struct msi_desc *desc;

	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
		if (desc->affinity && desc->affinity->is_managed)
			return true;
	}

	return false;
}

Thanks,
John

> ---
>   drivers/base/platform.c | 7 +++++++
>   drivers/pci/msi.c       | 3 +++
>   include/linux/device.h  | 1 +
>   3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 8640578f45e9..d28cb91d5cf9 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
>   				ptr->irq[i], ret);
>   			goto err_free_desc;
>   		}
> +
> +		/*
> +		 * mark the device as irq affinity managed if any irq affinity
> +		 * descriptor is managed
> +		 */
> +		if (desc[i].is_managed)
> +			dev->dev.irq_affinity_managed = true;
>   	}
>   
>   	devres_add(&dev->dev, ptr);
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3d6db20d1b2b..7ddec90b711d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>   	if (flags & PCI_IRQ_AFFINITY) {
>   		if (!affd)
>   			affd = &msi_default_affd;
> +		dev->dev.irq_affinity_managed = true;
>   	} else {
>   		if (WARN_ON(affd))
>   			affd = NULL;
> @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>   			return nvecs;
>   	}
>   
> +	dev->dev.irq_affinity_managed = false;
> +
>   	/* use legacy IRQ if allowed */
>   	if (flags & PCI_IRQ_LEGACY) {
>   		if (min_vecs == 1 && dev->irq) {
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 59940f1744c1..9ec6e671279e 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -569,6 +569,7 @@ struct device {
>   #ifdef CONFIG_DMA_OPS_BYPASS
>   	bool			dma_ops_bypass : 1;
>   #endif
> +	bool			irq_affinity_managed : 1;
>   };
>   
>   /**
>
Christoph Hellwig July 19, 2021, 9:44 a.m. UTC | #6
On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
>> Address this issue by adding one field of .irq_affinity_managed into
>> 'struct device'.
>>
>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
> Did you consider that for PCI device we effectively have this info already:
>
> bool dev_has_managed_msi_irq(struct device *dev)
> {
> 	struct msi_desc *desc;
>
> 	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
> 		if (desc->affinity && desc->affinity->is_managed)
> 			return true;
> 	}
>
> 	return false;

Just walking the list seems fine to me given that this is not a
performance criticial path.  But what are the locking implications?

Also does the above imply this won't work for your platform MSI case?
John Garry July 19, 2021, 10:39 a.m. UTC | #7
On 19/07/2021 10:44, Christoph Hellwig wrote:
> On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
>>> Address this issue by adding one field of .irq_affinity_managed into
>>> 'struct device'.
>>>
>>> Suggested-by: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>
>> Did you consider that for PCI device we effectively have this info already:
>>
>> bool dev_has_managed_msi_irq(struct device *dev)
>> {
>> 	struct msi_desc *desc;
>>
>> 	list_for_each_entry(desc, dev_to_msi_list(dev), list)

I just noticed for_each_msi_entry(), which is the same


>> 		if (desc->affinity && desc->affinity->is_managed)
>> 			return true;
>> 	}
>>
>> 	return false;
> 
> Just walking the list seems fine to me given that this is not a
> performance criticial path.  But what are the locking implications?

Since it would be used for sequential setup code, I didn't think any 
locking was required. But would need to consider where that function 
lived and whether it's public.

> 
> Also does the above imply this won't work for your platform MSI case?
> .
> 

Right. I think that it may be possible to reach into the platform msi 
descriptors to get this info, but I am not sure it's worth it. There is 
only 1x user there and there is no generic .map_queues function, so 
could set the flag directly:

int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev 
*pdev,
  		for_each_cpu(cpu, mask)
  			qmap->mq_map[cpu] = qmap->queue_offset + queue;
  	}
+	qmap->use_managed_irq = dev_has_managed_msi_irq(&pdev->dev);
}

--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -3563,6 +3563,8 @@ static int map_queues_v2_hw(struct Scsi_Host *shost)
                        qmap->mq_map[cpu] = qmap->queue_offset + queue;
        }

+       qmap->use_managed_irq = 1;
+
        return 0;
}
Ming Lei July 20, 2021, 2:38 a.m. UTC | #8
On Mon, Jul 19, 2021 at 11:39:53AM +0100, John Garry wrote:
> On 19/07/2021 10:44, Christoph Hellwig wrote:
> > On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
> > > > Address this issue by adding one field of .irq_affinity_managed into
> > > > 'struct device'.
> > > > 
> > > > Suggested-by: Christoph Hellwig <hch@lst.de>
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > 
> > > Did you consider that for PCI device we effectively have this info already:
> > > 
> > > bool dev_has_managed_msi_irq(struct device *dev)
> > > {
> > > 	struct msi_desc *desc;
> > > 
> > > 	list_for_each_entry(desc, dev_to_msi_list(dev), list)
> 
> I just noticed for_each_msi_entry(), which is the same
> 
> 
> > > 		if (desc->affinity && desc->affinity->is_managed)
> > > 			return true;
> > > 	}
> > > 
> > > 	return false;
> > 
> > Just walking the list seems fine to me given that this is not a
> > performance criticial path.  But what are the locking implications?
> 
> Since it would be used for sequential setup code, I didn't think any locking
> was required. But would need to consider where that function lived and
> whether it's public.

Yeah, the allocated irq vectors should be live when running map queues.

> 
> > 
> > Also does the above imply this won't work for your platform MSI case?
> > .
> > 
> 
> Right. I think that it may be possible to reach into the platform msi
> descriptors to get this info, but I am not sure it's worth it. There is only
> 1x user there and there is no generic .map_queues function, so could set the
> flag directly:
> 
> int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev
> *pdev,
>  		for_each_cpu(cpu, mask)
>  			qmap->mq_map[cpu] = qmap->queue_offset + queue;
>  	}
> +	qmap->use_managed_irq = dev_has_managed_msi_irq(&pdev->dev);
> }
> 
> --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> @@ -3563,6 +3563,8 @@ static int map_queues_v2_hw(struct Scsi_Host *shost)
>                        qmap->mq_map[cpu] = qmap->queue_offset + queue;
>        }
> 
> +       qmap->use_managed_irq = 1;
> +
>        return 0;

virtio can be populated via platform device too, but managed irq affinity
isn't used, so seems dev_has_managed_msi_irq() is fine.


Thanks,
Ming
Bjorn Helgaas July 21, 2021, 12:30 a.m. UTC | #9
On Sat, Jul 17, 2021 at 05:30:43PM +0800, Ming Lei wrote:
> On Fri, Jul 16, 2021 at 03:01:54PM -0500, Bjorn Helgaas wrote:
> > On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> > > irq vector allocation with managed affinity may be used by driver, and
> > > blk-mq needs this info because managed irq will be shutdown when all
> > > CPUs in the affinity mask are offline.
> > > 
> > > The info of using managed irq is often produced by drivers(pci subsystem,
> > 
> > Add space between "drivers" and "(".
> > s/pci/PCI/
> 
> OK.
> 
> > Does this "managed IRQ" (or "managed affinity", not sure what the
> > correct terminology is here) have something to do with devm?
> > 
> > > platform device, ...), and it is consumed by blk-mq, so different subsystems
> > > are involved in this info flow
> > 
> > Add period at end of sentence.
> 
> OK.
> 
> > > Address this issue by adding one field of .irq_affinity_managed into
> > > 'struct device'.
> > > 
> > > Suggested-by: Christoph Hellwig <hch@lst.de>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/base/platform.c | 7 +++++++
> > >  drivers/pci/msi.c       | 3 +++
> > >  include/linux/device.h  | 1 +
> > >  3 files changed, 11 insertions(+)
> > > 
> > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > > index 8640578f45e9..d28cb91d5cf9 100644
> > > --- a/drivers/base/platform.c
> > > +++ b/drivers/base/platform.c
> > > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
> > >  				ptr->irq[i], ret);
> > >  			goto err_free_desc;
> > >  		}
> > > +
> > > +		/*
> > > +		 * mark the device as irq affinity managed if any irq affinity
> > > +		 * descriptor is managed
> > > +		 */
> > > +		if (desc[i].is_managed)
> > > +			dev->dev.irq_affinity_managed = true;
> > >  	}
> > >  
> > >  	devres_add(&dev->dev, ptr);
> > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > index 3d6db20d1b2b..7ddec90b711d 100644
> > > --- a/drivers/pci/msi.c
> > > +++ b/drivers/pci/msi.c
> > > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> > >  	if (flags & PCI_IRQ_AFFINITY) {
> > >  		if (!affd)
> > >  			affd = &msi_default_affd;
> > > +		dev->dev.irq_affinity_managed = true;
> > 
> > This is really opaque to me.  I can't tell what the connection between
> > PCI_IRQ_AFFINITY and irq_affinity_managed is.
> 
> Comment for PCI_IRQ_AFFINITY is 'Auto-assign affinity',
> 'irq_affinity_managed' basically means that irq's affinity is managed by
> kernel.
> 
> What blk-mq needs is exactly if PCI_IRQ_AFFINITY is applied when
> allocating irq vectors. When PCI_IRQ_AFFINITY is used, genirq will
> shutdown the irq when all CPUs in the assigned affinity are offline,
> then blk-mq has to drain all in-flight IOs which will be completed
> via this irq and prevent new IO. That is the connection.
> 
> Or you think 'irq_affinity_managed' isn't named well?

I've been looking at "devm" management where there is a concept of
devm-managed IRQs, and you mentioned "managed IRQ" in the commit log.
But IIUC this is a completely different sort of management.

> > AFAICT the only place irq_affinity_managed is ultimately used is
> > blk_mq_hctx_notify_offline(), and there's no obvious connection
> > between that and this code.
> 
> I believe the connection is described in comment.

You mean the comment that says "hctx needn't to be deactivated in case
managed irq isn't used"?  Sorry, that really doesn't explain to me why
pci_alloc_irq_vectors_affinity() needs to set irq_affinity_managed.
There's just a lot of magic here that cannot be deduced by reading the
code.

Nit: s/needn't to be/needn't be/ in that comment.  Or maybe even
"Deactivate hctx only when managed IRQ is used"?  I still have no idea
what that means, but at least it's easier to read :)  Or maybe this is
actually "draining a queue" instead of "deactivating"?

Bjorn
Thomas Gleixner July 21, 2021, 7:20 a.m. UTC | #10
On Mon, Jul 19 2021 at 11:44, Christoph Hellwig wrote:
> On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
>>> Address this issue by adding one field of .irq_affinity_managed into
>>> 'struct device'.
>>>
>>> Suggested-by: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>
>> Did you consider that for PCI device we effectively have this info already:
>>
>> bool dev_has_managed_msi_irq(struct device *dev)
>> {
>> 	struct msi_desc *desc;
>>
>> 	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
>> 		if (desc->affinity && desc->affinity->is_managed)
>> 			return true;
>> 	}
>>
>> 	return false;
>
> Just walking the list seems fine to me given that this is not a
> performance criticial path.  But what are the locking implications?

At the moment there are none because the list is initialized in the
setup path and never modified afterwards. Though that might change
sooner than later to fix the virtio wreckage vs. MSI-X.

> Also does the above imply this won't work for your platform MSI case?

The msi descriptors are attached to struct device and independent of
platform/PCI/whatever.

Thanks,

        tglx
Christoph Hellwig July 21, 2021, 7:24 a.m. UTC | #11
On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
> > Just walking the list seems fine to me given that this is not a
> > performance criticial path.  But what are the locking implications?
> 
> At the moment there are none because the list is initialized in the
> setup path and never modified afterwards. Though that might change
> sooner than later to fix the virtio wreckage vs. MSI-X.

What is the issue there?  Either way, if we keep the helper in the
IRQ code it should be easy to spot for anyone adding the locking.

> > Also does the above imply this won't work for your platform MSI case?
> 
> The msi descriptors are attached to struct device and independent of
> platform/PCI/whatever.

That's what I assumed, but this text from John suggested there is
something odd about the platform case:

"Did you consider that for PCI .."
John Garry July 21, 2021, 9:44 a.m. UTC | #12
+ Marc

On 21/07/2021 08:24, Christoph Hellwig wrote:
> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
>>> Just walking the list seems fine to me given that this is not a
>>> performance criticial path.  But what are the locking implications?
>> At the moment there are none because the list is initialized in the
>> setup path and never modified afterwards. Though that might change
>> sooner than later to fix the virtio wreckage vs. MSI-X.
> What is the issue there?  Either way, if we keep the helper in the
> IRQ code it should be easy to spot for anyone adding the locking.
> 
>>> Also does the above imply this won't work for your platform MSI case?
>> The msi descriptors are attached to struct device and independent of
>> platform/PCI/whatever.
> That's what I assumed, but this text from John suggested there is
> something odd about the platform case:
> 
> "Did you consider that for PCI .."
> .

For this special platform MSI case there is a secondary interrupt 
controller (called mbigen) which generates the MSI on behalf of the 
device, which I think the MSI belongs to (and not the device, itself).

See "latter case" mentioned in commit 91f90daa4fb2.

I think Marc and Thomas can explain this much better than I could.

Anyway, as I mentioned earlier, I think that this specific problem is 
unique and can be solved without using a function which examines the 
struct device.msi_list .

Thanks,
John
Thomas Gleixner July 21, 2021, 8:14 p.m. UTC | #13
On Wed, Jul 21 2021 at 09:24, Christoph Hellwig wrote:

> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
>> > Just walking the list seems fine to me given that this is not a
>> > performance criticial path.  But what are the locking implications?
>> 
>> At the moment there are none because the list is initialized in the
>> setup path and never modified afterwards. Though that might change
>> sooner than later to fix the virtio wreckage vs. MSI-X.
>
> What is the issue there?  Either way, if we keep the helper in the
> IRQ code it should be easy to spot for anyone adding the locking.

  https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de

TLDR: virtio allocates ONE irq on msix_enable() and then when the guest
actually unmasks another entry (e.g. request_irq()), it tears down the
allocated one and set's up two. On the third one this repeats ....

There are only two options:

  1) allocate everything upfront, which is undesired
  2) append entries, which might need locking, but I'm still trying to
     avoid that

There is another problem vs. vector exhaustion which can't be fixed that
way, but that's a different story.

Thanks,

        tglx
Thomas Gleixner July 21, 2021, 8:22 p.m. UTC | #14
On Wed, Jul 21 2021 at 10:44, John Garry wrote:
> On 21/07/2021 08:24, Christoph Hellwig wrote:
>> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
>>>> Also does the above imply this won't work for your platform MSI case?
>>> The msi descriptors are attached to struct device and independent of
>>> platform/PCI/whatever.
>> That's what I assumed, but this text from John suggested there is
>> something odd about the platform case:
>> 
>> "Did you consider that for PCI .."
>> .
>
> For this special platform MSI case there is a secondary interrupt 
> controller (called mbigen) which generates the MSI on behalf of the 
> device, which I think the MSI belongs to (and not the device, itself).

MBIGEN is a different story because it converts wired interrupts into
MSI interrupts, IOW a MSI based interrupt pin extender.

I might be wrong, but I seriously doubt that any multiqueue device which
wants to use affinity managed interrupts is built on top of that.

Thanks,

        tglx
Christoph Hellwig July 21, 2021, 8:32 p.m. UTC | #15
On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote:
>   https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de
> 
> TLDR: virtio allocates ONE irq on msix_enable() and then when the guest
> actually unmasks another entry (e.g. request_irq()), it tears down the
> allocated one and set's up two. On the third one this repeats ....
> 
> There are only two options:
> 
>   1) allocate everything upfront, which is undesired
>   2) append entries, which might need locking, but I'm still trying to
>      avoid that
> 
> There is another problem vs. vector exhaustion which can't be fixed that
> way, but that's a different story.

FTI, NVMe is similar.  We need one IRQ to setup the admin queue,
which is used to query/set how many I/O queues are supported.  Just
two steps though and not unbound.
Thomas Gleixner July 21, 2021, 10:38 p.m. UTC | #16
On Wed, Jul 21 2021 at 22:32, Christoph Hellwig wrote:
> On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote:
>>   https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de
>> 
>> TLDR: virtio allocates ONE irq on msix_enable() and then when the
>> guest

OOps, sorry that should have been VFIO not virtio.

>> actually unmasks another entry (e.g. request_irq()), it tears down the
>> allocated one and set's up two. On the third one this repeats ....
>> 
>> There are only two options:
>> 
>>   1) allocate everything upfront, which is undesired
>>   2) append entries, which might need locking, but I'm still trying to
>>      avoid that
>> 
>> There is another problem vs. vector exhaustion which can't be fixed that
>> way, but that's a different story.
>
> FTI, NVMe is similar.  We need one IRQ to setup the admin queue,
> which is used to query/set how many I/O queues are supported.  Just
> two steps though and not unbound.

That's fine because that's controlled by the driver consistently and it
(hopefully) makes sure that the admin queue is quiesced before
everything is torn down after the initial query.

But that's not the case for VFIO. It tears down all in use interrupts
and the guest driver is completely oblivious of that.

Assume the following situation:

 1) VM boots with 8 present CPUs and 16 possible CPUs

 2) The passed through card (PF or VF) supports multiqueue and the
    driver uses managed interrupts which e.g. allocates one queue and
    one interrupt per possible CPU.

    Initial setup requests all the interrupts, but only the first 8
    queue interrupts are unmasked and therefore reallocated by the host
    which works by some definition of works because the device is quiet
    at that point.

 3) Host admin plugs the other 8 CPUs into the guest

    Onlining these CPUs in the guest will unmask the dormant managed
    queue interrupts and cause the host to allocate the remaining 8 per
    queue interrupts one by one thereby tearing down _all_ previously
    allocated ones and then allocating one more than before.

    Assume that while this goes on the guest has I/O running on the
    already online CPUs and their associated queues. Depending on the
    device this either will lose interrupts or reroute them to the
    legacy INTx which is not handled. This might in the best case result
    in a few "timedout" requests, but I managed it at least once to make
    the device go into lala land state, i.e. it did not recover.

The above can be fixed by adding an 'append' mode to the MSI code.

But that does not fix the overcommit issue where the host runs out of
vector space. The result is simply that the guest does not know and just
continues to work on device/queues which will never ever recieve an
interrupt (again).

I got educated that all of this is considered unlikely and my argument
that the concept of unlikely simply does not exist at cloud scale got
ignored. Sure, I know it's VIRT and therefore not subject to common
sense.

Thanks,

        tglx
Christoph Hellwig July 22, 2021, 7:46 a.m. UTC | #17
On Thu, Jul 22, 2021 at 12:38:07AM +0200, Thomas Gleixner wrote:
> That's fine because that's controlled by the driver consistently and it
> (hopefully) makes sure that the admin queue is quiesced before
> everything is torn down after the initial query.

Yes, it is.

> The above can be fixed by adding an 'append' mode to the MSI code.

So IFF we get that append mode I think it would help to simplify
drivers that have unmanaged pre and post vectors, and/or do the above
proving.

So instead of currently requesting a single unmanaged vector, do
basic setup, tear it down, request N managed vectors with an unmanaged
pre-vector we could just keep the unmanaged vector, never tear it down
and just append the post vectors.  In the long run this could remove
the need to do the pre and post vectors entirely.
John Garry July 22, 2021, 7:48 a.m. UTC | #18
On 21/07/2021 21:22, Thomas Gleixner wrote:
>>> That's what I assumed, but this text from John suggested there is
>>> something odd about the platform case:
>>>
>>> "Did you consider that for PCI .."
>>> .
>> For this special platform MSI case there is a secondary interrupt
>> controller (called mbigen) which generates the MSI on behalf of the
>> device, which I think the MSI belongs to (and not the device, itself).
> MBIGEN is a different story because it converts wired interrupts into
> MSI interrupts, IOW a MSI based interrupt pin extender.
> 
> I might be wrong, but I seriously doubt that any multiqueue device which
> wants to use affinity managed interrupts is built on top of that.

Actually the SCSI device for which platform device managed interrupts 
support was added in commit e15f2fa959f2 uses mbigen.

Thanks,
John
diff mbox series

Patch

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index 8640578f45e9..d28cb91d5cf9 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -388,6 +388,13 @@  int devm_platform_get_irqs_affinity(struct platform_device *dev,
 				ptr->irq[i], ret);
 			goto err_free_desc;
 		}
+
+		/*
+		 * mark the device as irq affinity managed if any irq affinity
+		 * descriptor is managed
+		 */
+		if (desc[i].is_managed)
+			dev->dev.irq_affinity_managed = true;
 	}
 
 	devres_add(&dev->dev, ptr);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 3d6db20d1b2b..7ddec90b711d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1197,6 +1197,7 @@  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
 			affd = &msi_default_affd;
+		dev->dev.irq_affinity_managed = true;
 	} else {
 		if (WARN_ON(affd))
 			affd = NULL;
@@ -1215,6 +1216,8 @@  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 			return nvecs;
 	}
 
+	dev->dev.irq_affinity_managed = false;
+
 	/* use legacy IRQ if allowed */
 	if (flags & PCI_IRQ_LEGACY) {
 		if (min_vecs == 1 && dev->irq) {
diff --git a/include/linux/device.h b/include/linux/device.h
index 59940f1744c1..9ec6e671279e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -569,6 +569,7 @@  struct device {
 #ifdef CONFIG_DMA_OPS_BYPASS
 	bool			dma_ops_bypass : 1;
 #endif
+	bool			irq_affinity_managed : 1;
 };
 
 /**