diff mbox series

[net-next,v4,1/2] driver core: auxiliary bus: show auxiliary device IRQs

Message ID 20240509091411.627775-2-shayd@nvidia.com (mailing list archive)
State Superseded
Headers show
Series Introduce auxiliary bus IRQs sysfs | expand

Commit Message

Shay Drori May 9, 2024, 9:14 a.m. UTC
PCI subfunctions (SF) are anchored on the auxiliary bus. PCI physical
and virtual functions are anchored on the PCI bus;  the irq information
of each such function is visible to users via sysfs directory "msi_irqs"
containing file for each irq entry. However, for PCI SFs such information
is unavailable. Due to this users have no visibility on IRQs used by the
SFs.
Secondly, an SF is a multi function device supporting rdma, netdevice
and more. Without irq information at the bus level, the user is unable
to view or use the affinity of the SF IRQs.

Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
for supporting auxiliary devices, containing file for each irq entry.

Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
information is also not available to the users. To overcome this
limitation, each irq sysfs entry shows if irq is exclusive or shared.

For example:
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
50  51  52  53  54  55  56  57  58
$ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
exclusive

Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>

---
v3->4:
- remove global mutex (Przemek)
v2->v3:
- fix function declaration in case SYSFS isn't defined (Parav)
- convert auxdev->groups array with auxiliary_irqs_groups (Przemek)
v1->v2:
- move #ifdefs from drivers/base/auxiliary.c to
  include/linux/auxiliary_bus.h (Greg)
- use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
- Fix kzalloc(ref) to kzalloc(*ref) (Simon)
- Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
- Fix auxiliary_irq_mode_show doc (kernel test boot)
---
 Documentation/ABI/testing/sysfs-bus-auxiliary |  14 ++
 drivers/base/auxiliary.c                      | 178 +++++++++++++++++-
 include/linux/auxiliary_bus.h                 |  24 ++-
 3 files changed, 213 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary

Comments

Greg KH May 10, 2024, 8:15 a.m. UTC | #1
On Thu, May 09, 2024 at 12:14:10PM +0300, Shay Drory wrote:
> PCI subfunctions (SF) are anchored on the auxiliary bus.

"Some PCI subfunctions can be on the auxiliary bus"

Or maybe "Sometimes the auxiliary bus interface is used for PCI
subfunctions."

Either way, the text here as-is is not correct as that is not how the
auxbus code is always used, sorry.

> PCI physical
> and virtual functions are anchored on the PCI bus;  the irq information

Odd use of ';'?  And an extra ' '?

> of each such function is visible to users via sysfs directory "msi_irqs"
> containing file for each irq entry. However, for PCI SFs such information
> is unavailable. Due to this users have no visibility on IRQs used by the
> SFs.

Not even in /proc/irq/ ?

> Secondly, an SF is a multi function device supporting rdma, netdevice

Not "is", it should be "can be"  Not all the world is your crazy
hardware :)

> and more. Without irq information at the bus level, the user is unable
> to view or use the affinity of the SF IRQs.

How would affinity be relevent here?  You are just allowing them to be
viewed, not set.

> Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
> for supporting auxiliary devices, containing file for each irq entry.
> 
> Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
> information is also not available to the users. To overcome this
> limitation, each irq sysfs entry shows if irq is exclusive or shared.
> 
> For example:
> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
> 50  51  52  53  54  55  56  57  58
> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
> exclusive
> 
> Reviewed-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> 
> ---
> v3->4:
> - remove global mutex (Przemek)
> v2->v3:
> - fix function declaration in case SYSFS isn't defined (Parav)
> - convert auxdev->groups array with auxiliary_irqs_groups (Przemek)
> v1->v2:
> - move #ifdefs from drivers/base/auxiliary.c to
>   include/linux/auxiliary_bus.h (Greg)
> - use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
> - Fix kzalloc(ref) to kzalloc(*ref) (Simon)
> - Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
> - Fix auxiliary_irq_mode_show doc (kernel test boot)
> ---
>  Documentation/ABI/testing/sysfs-bus-auxiliary |  14 ++
>  drivers/base/auxiliary.c                      | 178 +++++++++++++++++-
>  include/linux/auxiliary_bus.h                 |  24 ++-
>  3 files changed, 213 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
> new file mode 100644
> index 000000000000..3b8299d49d9e
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
> @@ -0,0 +1,14 @@
> +What:		/sys/bus/auxiliary/devices/.../irqs/
> +Date:		April, 2024
> +Contact:	Shay Drory <shayd@nvidia.com>
> +Description:
> +		The /sys/devices/.../irqs directory contains a variable set of
> +		files, with each file is named as irq number similar to PCI PF
> +		or VF's irq number located in msi_irqs directory.

So this can be msi irqs?  Or not msi irqs?  How do we know?


> +
> +What:		/sys/bus/auxiliary/devices/.../irqs/<N>
> +Date:		April, 2024
> +Contact:	Shay Drory <shayd@nvidia.com>
> +Description:
> +		auxiliary devices can share IRQs. This attribute indicates if
> +		the irq is shared with other SFs or exclusively used by the SF.
> diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
> index d3a2c40c2f12..def02f5f1220 100644
> --- a/drivers/base/auxiliary.c
> +++ b/drivers/base/auxiliary.c
> @@ -158,6 +158,176 @@
>   *	};
>   */
>  
> +#ifdef CONFIG_SYSFS
> +/* Xarray of irqs to determine if irq is exclusive or shared. */
> +static DEFINE_XARRAY(irqs);
> +
> +struct auxiliary_irq_info {
> +	struct device_attribute sysfs_attr;
> +	int irq;
> +};
> +
> +static struct attribute *auxiliary_irq_attrs[] = {
> +	NULL
> +};
> +
> +static const struct attribute_group auxiliary_irqs_group = {
> +	.name = "irqs",
> +	.attrs = auxiliary_irq_attrs,
> +};
> +
> +static const struct attribute_group *auxiliary_irqs_groups[2] = {

Why list the array size?

> +	&auxiliary_irqs_group,
> +	NULL
> +};
> +
> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
> + * shared or exclusive.
> + */
> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
> +				       struct device_attribute *attr, char *buf)
> +{
> +	struct auxiliary_irq_info *info =
> +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
> +
> +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)

refcount combined with xa?  That feels wrong, why is refcount used for
this at all?

> +		return sysfs_emit(buf, "%s\n", "shared");
> +	else
> +		return sysfs_emit(buf, "%s\n", "exclusive");
> +}
> +
> +static void auxiliary_irq_destroy(int irq)
> +{
> +	refcount_t *ref;
> +
> +	xa_lock(&irqs);
> +	ref = xa_load(&irqs, irq);
> +	if (refcount_dec_and_test(ref)) {
> +		__xa_erase(&irqs, irq);
> +		kfree(ref);
> +	}
> +	xa_unlock(&irqs);
> +}
> +
> +static int auxiliary_irq_create(int irq)
> +{
> +	refcount_t *new_ref = kzalloc(sizeof(*new_ref), GFP_KERNEL);
> +	refcount_t *ref;
> +	int ret = 0;
> +
> +	if (!new_ref)
> +		return -ENOMEM;
> +
> +	xa_lock(&irqs);
> +	ref = xa_load(&irqs, irq);
> +	if (ref) {
> +		kfree(new_ref);
> +		refcount_inc(ref);

Why do you need to use refcounts for these?  What does that help out
with?

> +		goto out;
> +	}
> +
> +	refcount_set(new_ref, 1);
> +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
> +	if (ref) {
> +		kfree(new_ref);
> +		if (xa_is_err(ref)) {
> +			ret = xa_err(ref);
> +			goto out;
> +		}
> +
> +		/* Another thread beat us to creating the enrtry. */
> +		refcount_inc(ref);

How can that happen?  Why not just use a normal simple lock for all of
this so you don't have to mess with refcounts at all?  This is not
performance-relevent code at all, but yet with a refcount you cause
almost the same issues that a normal lock would have, plus the increased
complexity of all of the surrounding code (like this, and the crazy
__xa_cmpxchg() call)

Make this simple please.


> +	}
> +
> +out:
> +	xa_unlock(&irqs);
> +	return ret;
> +}
> +
> +/**
> + * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
> + * @auxdev: auxiliary bus device to add the sysfs entry.
> + * @irq: The associated Linux interrupt number.
> + *
> + * This function should be called after auxiliary device have successfully
> + * received the irq.
> + *
> + * Return: zero on success or an error code on failure.
> + */
> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
> +{
> +	struct device *dev = &auxdev->dev;
> +	struct auxiliary_irq_info *info;
> +	int ret;
> +
> +	ret = auxiliary_irq_create(irq);
> +	if (ret)
> +		return ret;
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info) {
> +		ret = -ENOMEM;
> +		goto info_err;
> +	}
> +
> +	sysfs_attr_init(&info->sysfs_attr.attr);
> +	info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
> +	if (!info->sysfs_attr.attr.name) {
> +		ret = -ENOMEM;
> +		goto name_err;
> +	}
> +	info->irq = irq;
> +	info->sysfs_attr.attr.mode = 0444;
> +	info->sysfs_attr.show = auxiliary_irq_mode_show;
> +
> +	ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
> +	if (ret)
> +		goto auxdev_xa_err;
> +
> +	ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
> +				      auxiliary_irqs_group.name);

Adding dynamic sysfs attributes like this means that you normally just
raced with userspace and lost.  How are you ensuring that you did not
just do that?

> +/**
> + * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
> + * @auxdev: auxiliary bus device to add the sysfs entry.
> + * @irq: the IRQ to remove.
> + *
> + * This function should be called to remove an IRQ sysfs entry.
> + */
> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
> +{
> +	struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
> +	struct device *dev = &auxdev->dev;
> +
> +	if (WARN_ON(!info))

How can this ever happen?  If not, don't check for it please.  If it can
happen, properly handle it and move on, don't reboot the box.

thanks,

greg k-h
Przemek Kitszel May 10, 2024, 12:54 p.m. UTC | #2
On 5/10/24 10:15, Greg KH wrote:
> On Thu, May 09, 2024 at 12:14:10PM +0300, Shay Drory wrote:
>> PCI subfunctions (SF) are anchored on the auxiliary bus.
> 
> "Some PCI subfunctions can be on the auxiliary bus"
> 
> Or maybe "Sometimes the auxiliary bus interface is used for PCI
> subfunctions."
> 
> Either way, the text here as-is is not correct as that is not how the
> auxbus code is always used, sorry.
> 
>> PCI physical
>> and virtual functions are anchored on the PCI bus;  the irq information
> 
> Odd use of ';'?  And an extra ' '?
> 
>> of each such function is visible to users via sysfs directory "msi_irqs"
>> containing file for each irq entry. However, for PCI SFs such information
>> is unavailable. Due to this users have no visibility on IRQs used by the
>> SFs.
> 
> Not even in /proc/irq/ ?
> 
>> Secondly, an SF is a multi function device supporting rdma, netdevice
> 
> Not "is", it should be "can be"  Not all the world is your crazy
> hardware :)
> 
>> and more. Without irq information at the bus level, the user is unable
>> to view or use the affinity of the SF IRQs.
> 
> How would affinity be relevent here?  You are just allowing them to be
> viewed, not set.
> 
>> Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
>> for supporting auxiliary devices, containing file for each irq entry.
>>
>> Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
>> information is also not available to the users. To overcome this
>> limitation, each irq sysfs entry shows if irq is exclusive or shared.
>>
>> For example:
>> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
>> 50  51  52  53  54  55  56  57  58
>> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
>> exclusive
>>
>> Reviewed-by: Parav Pandit <parav@nvidia.com>
>> Signed-off-by: Shay Drory <shayd@nvidia.com>
>>
>> ---
>> v3->4:
>> - remove global mutex (Przemek)

thanks, and sorry for not catching back in time on v3 disussion

>> v2->v3:
>> - fix function declaration in case SYSFS isn't defined (Parav)
>> - convert auxdev->groups array with auxiliary_irqs_groups (Przemek)
>> v1->v2:
>> - move #ifdefs from drivers/base/auxiliary.c to
>>    include/linux/auxiliary_bus.h (Greg)
>> - use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
>> - Fix kzalloc(ref) to kzalloc(*ref) (Simon)
>> - Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
>> - Fix auxiliary_irq_mode_show doc (kernel test boot)
>> ---
>>   Documentation/ABI/testing/sysfs-bus-auxiliary |  14 ++
>>   drivers/base/auxiliary.c                      | 178 +++++++++++++++++-
>>   include/linux/auxiliary_bus.h                 |  24 ++-
>>   3 files changed, 213 insertions(+), 3 deletions(-)
>>   create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> new file mode 100644
>> index 000000000000..3b8299d49d9e
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> @@ -0,0 +1,14 @@
>> +What:		/sys/bus/auxiliary/devices/.../irqs/
>> +Date:		April, 2024
>> +Contact:	Shay Drory <shayd@nvidia.com>
>> +Description:
>> +		The /sys/devices/.../irqs directory contains a variable set of
>> +		files, with each file is named as irq number similar to PCI PF
>> +		or VF's irq number located in msi_irqs directory.
> 
> So this can be msi irqs?  Or not msi irqs?  How do we know?
> 
> 
>> +
>> +What:		/sys/bus/auxiliary/devices/.../irqs/<N>
>> +Date:		April, 2024
>> +Contact:	Shay Drory <shayd@nvidia.com>
>> +Description:
>> +		auxiliary devices can share IRQs. This attribute indicates if
>> +		the irq is shared with other SFs or exclusively used by the SF.
>> diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
>> index d3a2c40c2f12..def02f5f1220 100644
>> --- a/drivers/base/auxiliary.c
>> +++ b/drivers/base/auxiliary.c
>> @@ -158,6 +158,176 @@
>>    *	};
>>    */
>>   
>> +#ifdef CONFIG_SYSFS
>> +/* Xarray of irqs to determine if irq is exclusive or shared. */
>> +static DEFINE_XARRAY(irqs);
>> +
>> +struct auxiliary_irq_info {
>> +	struct device_attribute sysfs_attr;
>> +	int irq;
>> +};
>> +
>> +static struct attribute *auxiliary_irq_attrs[] = {
>> +	NULL
>> +};
>> +
>> +static const struct attribute_group auxiliary_irqs_group = {
>> +	.name = "irqs",
>> +	.attrs = auxiliary_irq_attrs,
>> +};
>> +
>> +static const struct attribute_group *auxiliary_irqs_groups[2] = {
> 
> Why list the array size?
> 
>> +	&auxiliary_irqs_group,
>> +	NULL
>> +};
>> +
>> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
>> + * shared or exclusive.
>> + */
>> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
>> +				       struct device_attribute *attr, char *buf)
>> +{
>> +	struct auxiliary_irq_info *info =
>> +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
>> +
>> +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
> 
> refcount combined with xa?  That feels wrong, why is refcount used for
> this at all?

Not long ago I commented on similar usage for ice driver,
~"since you are locking anyway this could be a plain counter",
and author replied
~"additional semantics (like saturation) of refcount make me feel warm
and fuzzy" (sorry if misquoting too much).
That convinced me back then, so I kept quiet about that here.

The "use least powerful option" rule of thumb is perhaps more important.

@Greg, WDYT?

> 
>> +		return sysfs_emit(buf, "%s\n", "shared");
>> +	else
>> +		return sysfs_emit(buf, "%s\n", "exclusive");
>> +}
>> +
>> +static void auxiliary_irq_destroy(int irq)
>> +{
>> +	refcount_t *ref;
>> +
>> +	xa_lock(&irqs);
>> +	ref = xa_load(&irqs, irq);
>> +	if (refcount_dec_and_test(ref)) {
>> +		__xa_erase(&irqs, irq);
>> +		kfree(ref);
>> +	}
>> +	xa_unlock(&irqs);
>> +}
>> +
>> +static int auxiliary_irq_create(int irq)
>> +{
>> +	refcount_t *new_ref = kzalloc(sizeof(*new_ref), GFP_KERNEL);
>> +	refcount_t *ref;
>> +	int ret = 0;
>> +
>> +	if (!new_ref)
>> +		return -ENOMEM;
>> +
>> +	xa_lock(&irqs);
>> +	ref = xa_load(&irqs, irq);
>> +	if (ref) {
>> +		kfree(new_ref);
>> +		refcount_inc(ref);
> 
> Why do you need to use refcounts for these?  What does that help out
> with?
> 
>> +		goto out;
>> +	}
>> +
>> +	refcount_set(new_ref, 1);
>> +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
>> +	if (ref) {
>> +		kfree(new_ref);
>> +		if (xa_is_err(ref)) {
>> +			ret = xa_err(ref);
>> +			goto out;
>> +		}
>> +
>> +		/* Another thread beat us to creating the enrtry. */
>> +		refcount_inc(ref);
> 
> How can that happen?  Why not just use a normal simple lock for all of
> this so you don't have to mess with refcounts at all?  This is not
> performance-relevent code at all, but yet with a refcount you cause
> almost the same issues that a normal lock would have, plus the increased
> complexity of all of the surrounding code (like this, and the crazy
> __xa_cmpxchg() call)
> 
> Make this simple please.

I find current API of xarray not ideal for this use case, and would like
to fix it, but let me write a proper RFC to don't derail (or slow down)
this series.

> 
> 
>> +	}
>> +
>> +out:
>> +	xa_unlock(&irqs);
>> +	return ret;
>> +}
>> +
>> +/**
>> + * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: The associated Linux interrupt number.
>> + *
>> + * This function should be called after auxiliary device have successfully
>> + * received the irq.
>> + *
>> + * Return: zero on success or an error code on failure.
>> + */
>> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
>> +{
>> +	struct device *dev = &auxdev->dev;
>> +	struct auxiliary_irq_info *info;
>> +	int ret;
>> +
>> +	ret = auxiliary_irq_create(irq);
>> +	if (ret)
>> +		return ret;
>> +
>> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
>> +	if (!info) {
>> +		ret = -ENOMEM;
>> +		goto info_err;
>> +	}
>> +
>> +	sysfs_attr_init(&info->sysfs_attr.attr);
>> +	info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
>> +	if (!info->sysfs_attr.attr.name) {
>> +		ret = -ENOMEM;
>> +		goto name_err;
>> +	}
>> +	info->irq = irq;
>> +	info->sysfs_attr.attr.mode = 0444;
>> +	info->sysfs_attr.show = auxiliary_irq_mode_show;
>> +
>> +	ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
>> +	if (ret)
>> +		goto auxdev_xa_err;
>> +
>> +	ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
>> +				      auxiliary_irqs_group.name);
> 
> Adding dynamic sysfs attributes like this means that you normally just
> raced with userspace and lost.  How are you ensuring that you did not
> just do that?
> 
>> +/**
>> + * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: the IRQ to remove.
>> + *
>> + * This function should be called to remove an IRQ sysfs entry.
>> + */
>> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
>> +{
>> +	struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
>> +	struct device *dev = &auxdev->dev;
>> +
>> +	if (WARN_ON(!info))
> 
> How can this ever happen?  If not, don't check for it please.  If it can
> happen, properly handle it and move on, don't reboot the box.
> 
> thanks,
> 
> greg k-h
>
Greg KH May 10, 2024, 1:07 p.m. UTC | #3
On Fri, May 10, 2024 at 02:54:49PM +0200, Przemek Kitszel wrote:
> > > +static ssize_t auxiliary_irq_mode_show(struct device *dev,
> > > +				       struct device_attribute *attr, char *buf)
> > > +{
> > > +	struct auxiliary_irq_info *info =
> > > +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
> > > +
> > > +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
> > 
> > refcount combined with xa?  That feels wrong, why is refcount used for
> > this at all?
> 
> Not long ago I commented on similar usage for ice driver,
> ~"since you are locking anyway this could be a plain counter",
> and author replied
> ~"additional semantics (like saturation) of refcount make me feel warm
> and fuzzy" (sorry if misquoting too much).
> That convinced me back then, so I kept quiet about that here.

But why is this being incremented / decremented at all?  What is that
for?

> The "use least powerful option" rule of thumb is perhaps more important.

Yes, but use a refcount properly if needed, I can't figure out why a
refcount is needed here at all, which is not a good sign.

> > > +	refcount_set(new_ref, 1);
> > > +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
> > > +	if (ref) {
> > > +		kfree(new_ref);
> > > +		if (xa_is_err(ref)) {
> > > +			ret = xa_err(ref);
> > > +			goto out;
> > > +		}
> > > +
> > > +		/* Another thread beat us to creating the enrtry. */
> > > +		refcount_inc(ref);
> > 
> > How can that happen?  Why not just use a normal simple lock for all of
> > this so you don't have to mess with refcounts at all?  This is not
> > performance-relevent code at all, but yet with a refcount you cause
> > almost the same issues that a normal lock would have, plus the increased
> > complexity of all of the surrounding code (like this, and the crazy
> > __xa_cmpxchg() call)
> > 
> > Make this simple please.
> 
> I find current API of xarray not ideal for this use case, and would like
> to fix it, but let me write a proper RFC to don't derail (or slow down)
> this series.

Why do you need to use an xarray here at all?  Why isn't this just tied
directly to the aux device instead?

thanks,

greg k-h
Przemek Kitszel May 10, 2024, 2:01 p.m. UTC | #4
On 5/10/24 15:07, Greg KH wrote:
> On Fri, May 10, 2024 at 02:54:49PM +0200, Przemek Kitszel wrote:
>>>> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
>>>> +				       struct device_attribute *attr, char *buf)
>>>> +{
>>>> +	struct auxiliary_irq_info *info =
>>>> +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
>>>> +
>>>> +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
>>>
>>> refcount combined with xa?  That feels wrong, why is refcount used for
>>> this at all?
>>
>> Not long ago I commented on similar usage for ice driver,
>> ~"since you are locking anyway this could be a plain counter",
>> and author replied
>> ~"additional semantics (like saturation) of refcount make me feel warm
>> and fuzzy" (sorry if misquoting too much).
>> That convinced me back then, so I kept quiet about that here.
> 
> But why is this being incremented / decremented at all?  What is that
> for?

[global]
This is just a counter, it is used to tell if given IRQ is shared or
exclusive. Hence there is a global xarray for that.
And my argument is for this case precisely.

[other]
There is also a separate xarray for each auxdev (IIRC) which is used as
generic dynamic container [that stores sysfs attrs], any other would
work (with different characteristics), but I see no problems with
picking xarray here.

> 
>> The "use least powerful option" rule of thumb is perhaps more important.
> 
> Yes, but use a refcount properly if needed, I can't figure out why a
> refcount is needed here at all, which is not a good sign.
> 
>>>> +	refcount_set(new_ref, 1);
>>>> +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
>>>> +	if (ref) {
>>>> +		kfree(new_ref);
>>>> +		if (xa_is_err(ref)) {
>>>> +			ret = xa_err(ref);
>>>> +			goto out;
>>>> +		}
>>>> +
>>>> +		/* Another thread beat us to creating the enrtry. */
>>>> +		refcount_inc(ref);
>>>
>>> How can that happen?  Why not just use a normal simple lock for all of
>>> this so you don't have to mess with refcounts at all?  This is not
>>> performance-relevent code at all, but yet with a refcount you cause
>>> almost the same issues that a normal lock would have, plus the increased
>>> complexity of all of the surrounding code (like this, and the crazy
>>> __xa_cmpxchg() call)
>>>
>>> Make this simple please.
>>
>> I find current API of xarray not ideal for this use case, and would like
>> to fix it, but let me write a proper RFC to don't derail (or slow down)
>> this series.
> 
> Why do you need to use an xarray here at all?  Why isn't this just tied
> directly to the aux device instead?

for [global] above I find xarray suitable soultion, for the [other] I'll
leave defending it to @Shay :)

> 
> thanks,
> 
> greg k-h
Greg KH May 11, 2024, 7:44 a.m. UTC | #5
On Fri, May 10, 2024 at 04:01:01PM +0200, Przemek Kitszel wrote:
> On 5/10/24 15:07, Greg KH wrote:
> > On Fri, May 10, 2024 at 02:54:49PM +0200, Przemek Kitszel wrote:
> > > > > +static ssize_t auxiliary_irq_mode_show(struct device *dev,
> > > > > +				       struct device_attribute *attr, char *buf)
> > > > > +{
> > > > > +	struct auxiliary_irq_info *info =
> > > > > +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
> > > > > +
> > > > > +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
> > > > 
> > > > refcount combined with xa?  That feels wrong, why is refcount used for
> > > > this at all?
> > > 
> > > Not long ago I commented on similar usage for ice driver,
> > > ~"since you are locking anyway this could be a plain counter",
> > > and author replied
> > > ~"additional semantics (like saturation) of refcount make me feel warm
> > > and fuzzy" (sorry if misquoting too much).
> > > That convinced me back then, so I kept quiet about that here.
> > 
> > But why is this being incremented / decremented at all?  What is that
> > for?
> 
> [global]
> This is just a counter, it is used to tell if given IRQ is shared or
> exclusive. Hence there is a global xarray for that.
> And my argument is for this case precisely.
> 
> [other]
> There is also a separate xarray for each auxdev (IIRC) which is used as
> generic dynamic container [that stores sysfs attrs], any other would
> work (with different characteristics), but I see no problems with
> picking xarray here.

Again, why is an xarray needed, why isn't this part of the auxdevice
structure to start with?

thanks,

greg k-h
Shay Drori May 12, 2024, 7:27 a.m. UTC | #6
On 10/05/2024 11:15, Greg KH wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, May 09, 2024 at 12:14:10PM +0300, Shay Drory wrote:
>> PCI subfunctions (SF) are anchored on the auxiliary bus.
> 
> "Some PCI subfunctions can be on the auxiliary bus"
> 
> Or maybe "Sometimes the auxiliary bus interface is used for PCI
> subfunctions."
> 
> Either way, the text here as-is is not correct as that is not how the
> auxbus code is always used, sorry.

you are right, the suggested text is better, I will fix in next version.

> 
>> PCI physical
>> and virtual functions are anchored on the PCI bus;  the irq information
> 
> Odd use of ';'?  And an extra ' '?

correct, will change to '.'. and remove the extra ' '.

> 
>> of each such function is visible to users via sysfs directory "msi_irqs"
>> containing file for each irq entry. However, for PCI SFs such information
>> is unavailable. Due to this users have no visibility on IRQs used by the
>> SFs.
> 
> Not even in /proc/irq/ ?
> 

mlx5 PCI SFs doesn't have IRQ of their own. mlx5 PF provide IRQs to mlx5
SFs to use. the /proc/irq/ show the IRQ of the PF device. there is no
vendor agnostic way to identify an IRQ of a device, using /proc/irq .
we rule out an option to add a new sysfs to the /proc/irq for a very
norrow use case of SFs.
In additon, we rule out the usage of IRQ name, since these PCI SFs might
share IRQs between peer SFs, as written bellow.

>> Secondly, an SF is a multi function device supporting rdma, netdevice
> 
> Not "is", it should be "can be"  Not all the world is your crazy
> hardware :)

correct, will change in next version

> 
>> and more. Without irq information at the bus level, the user is unable
>> to view or use the affinity of the SF IRQs.
> 
> How would affinity be relevent here?  You are just allowing them to be
> viewed, not set.

correct, affinity setting is already controlled via /proc/irq. the
motivation here is to show the mapping between PCI SFs and IRQs like PCI
PF/VF and IRQs.

> 
>> Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
>> for supporting auxiliary devices, containing file for each irq entry.
>>
>> Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
>> information is also not available to the users. To overcome this
>> limitation, each irq sysfs entry shows if irq is exclusive or shared.
>>
>> For example:
>> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
>> 50  51  52  53  54  55  56  57  58
>> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
>> exclusive
>>
>> Reviewed-by: Parav Pandit <parav@nvidia.com>
>> Signed-off-by: Shay Drory <shayd@nvidia.com>
>>
>> ---
>> v3->4:
>> - remove global mutex (Przemek)
>> v2->v3:
>> - fix function declaration in case SYSFS isn't defined (Parav)
>> - convert auxdev->groups array with auxiliary_irqs_groups (Przemek)
>> v1->v2:
>> - move #ifdefs from drivers/base/auxiliary.c to
>>    include/linux/auxiliary_bus.h (Greg)
>> - use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
>> - Fix kzalloc(ref) to kzalloc(*ref) (Simon)
>> - Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
>> - Fix auxiliary_irq_mode_show doc (kernel test boot)
>> ---
>>   Documentation/ABI/testing/sysfs-bus-auxiliary |  14 ++
>>   drivers/base/auxiliary.c                      | 178 +++++++++++++++++-
>>   include/linux/auxiliary_bus.h                 |  24 ++-
>>   3 files changed, 213 insertions(+), 3 deletions(-)
>>   create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> new file mode 100644
>> index 000000000000..3b8299d49d9e
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> @@ -0,0 +1,14 @@
>> +What:                /sys/bus/auxiliary/devices/.../irqs/
>> +Date:                April, 2024
>> +Contact:     Shay Drory <shayd@nvidia.com>
>> +Description:
>> +             The /sys/devices/.../irqs directory contains a variable set of
>> +             files, with each file is named as irq number similar to PCI PF
>> +             or VF's irq number located in msi_irqs directory.
> 
> So this can be msi irqs?  Or not msi irqs?  How do we know?

PCI PF/VF showing IRQ type (msi/msix) for each IRQ is lot of
duplication, probably because of the history. As per PCI spec at a given
time, only msi or msix can be enabled, don't both, and this is visible
through the PCI capabilities.
showing msix is not usefull for PCI SFs.

> 
> 
>> +
>> +What:                /sys/bus/auxiliary/devices/.../irqs/<N>
>> +Date:                April, 2024
>> +Contact:     Shay Drory <shayd@nvidia.com>
>> +Description:
>> +             auxiliary devices can share IRQs. This attribute indicates if
>> +             the irq is shared with other SFs or exclusively used by the SF.
>> diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
>> index d3a2c40c2f12..def02f5f1220 100644
>> --- a/drivers/base/auxiliary.c
>> +++ b/drivers/base/auxiliary.c
>> @@ -158,6 +158,176 @@
>>    *   };
>>    */
>>
>> +#ifdef CONFIG_SYSFS
>> +/* Xarray of irqs to determine if irq is exclusive or shared. */
>> +static DEFINE_XARRAY(irqs);
>> +
>> +struct auxiliary_irq_info {
>> +     struct device_attribute sysfs_attr;
>> +     int irq;
>> +};
>> +
>> +static struct attribute *auxiliary_irq_attrs[] = {
>> +     NULL
>> +};
>> +
>> +static const struct attribute_group auxiliary_irqs_group = {
>> +     .name = "irqs",
>> +     .attrs = auxiliary_irq_attrs,
>> +};
>> +
>> +static const struct attribute_group *auxiliary_irqs_groups[2] = {
> 
> Why list the array size?

correct, will drop in next version.

> 
>> +     &auxiliary_irqs_group,
>> +     NULL
>> +};
>> +
>> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
>> + * shared or exclusive.
>> + */
>> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
>> +                                    struct device_attribute *attr, char *buf)
>> +{
>> +     struct auxiliary_irq_info *info =
>> +             container_of(attr, struct auxiliary_irq_info, sysfs_attr);
>> +
>> +     if (refcount_read(xa_load(&irqs, info->irq)) > 1)
> 
> refcount combined with xa?  That feels wrong,

correct, will split in next version.

> why is refcount used for this at all?
> 
>> +             return sysfs_emit(buf, "%s\n", "shared");
>> +     else
>> +             return sysfs_emit(buf, "%s\n", "exclusive");
>> +}
>> +
>> +static void auxiliary_irq_destroy(int irq)
>> +{
>> +     refcount_t *ref;
>> +
>> +     xa_lock(&irqs);
>> +     ref = xa_load(&irqs, irq);
>> +     if (refcount_dec_and_test(ref)) {
>> +             __xa_erase(&irqs, irq);
>> +             kfree(ref);
>> +     }
>> +     xa_unlock(&irqs);
>> +}
>> +
>> +static int auxiliary_irq_create(int irq)
>> +{
>> +     refcount_t *new_ref = kzalloc(sizeof(*new_ref), GFP_KERNEL);
>> +     refcount_t *ref;
>> +     int ret = 0;
>> +
>> +     if (!new_ref)
>> +             return -ENOMEM;
>> +
>> +     xa_lock(&irqs);
>> +     ref = xa_load(&irqs, irq);
>> +     if (ref) {
>> +             kfree(new_ref);
>> +             refcount_inc(ref);
> 
> Why do you need to use refcounts for these?  What does that help out
> with?

please see my answer bellow

> 
>> +             goto out;
>> +     }
>> +
>> +     refcount_set(new_ref, 1);
>> +     ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
>> +     if (ref) {
>> +             kfree(new_ref);
>> +             if (xa_is_err(ref)) {
>> +                     ret = xa_err(ref);
>> +                     goto out;
>> +             }
>> +
>> +             /* Another thread beat us to creating the enrtry. */
>> +             refcount_inc(ref);
> 
> How can that happen?  Why not just use a normal simple lock for all of
> this so you don't have to mess with refcounts at all?  This is not
> performance-relevent code at all, but yet with a refcount you cause
> almost the same issues that a normal lock would have, plus the increased
> complexity of all of the surrounding code (like this, and the crazy
> __xa_cmpxchg() call)
> 
> Make this simple please.

There was a global mutex, will restore it and will replace the refcount
with a simple counter.

> 
> 
>> +     }
>> +
>> +out:
>> +     xa_unlock(&irqs);
>> +     return ret;
>> +}
>> +
>> +/**
>> + * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: The associated Linux interrupt number.
>> + *
>> + * This function should be called after auxiliary device have successfully
>> + * received the irq.
>> + *
>> + * Return: zero on success or an error code on failure.
>> + */
>> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
>> +{
>> +     struct device *dev = &auxdev->dev;
>> +     struct auxiliary_irq_info *info;
>> +     int ret;
>> +
>> +     ret = auxiliary_irq_create(irq);
>> +     if (ret)
>> +             return ret;
>> +
>> +     info = kzalloc(sizeof(*info), GFP_KERNEL);
>> +     if (!info) {
>> +             ret = -ENOMEM;
>> +             goto info_err;
>> +     }
>> +
>> +     sysfs_attr_init(&info->sysfs_attr.attr);
>> +     info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
>> +     if (!info->sysfs_attr.attr.name) {
>> +             ret = -ENOMEM;
>> +             goto name_err;
>> +     }
>> +     info->irq = irq;
>> +     info->sysfs_attr.attr.mode = 0444;
>> +     info->sysfs_attr.show = auxiliary_irq_mode_show;
>> +
>> +     ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
>> +     if (ret)
>> +             goto auxdev_xa_err;
>> +
>> +     ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
>> +                                   auxiliary_irqs_group.name);
> 
> Adding dynamic sysfs attributes like this means that you normally just
> raced with userspace and lost.  How are you ensuring that you did not
> just do that?

I am not sure I understand, but the IRQs and their sysfs are added
dynamically during SF probe. so user interested in the mapping will
probe the SF before using the sysfs.
is this answering your question?

> 
>> +/**
>> + * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: the IRQ to remove.
>> + *
>> + * This function should be called to remove an IRQ sysfs entry.
>> + */
>> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
>> +{
>> +     struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
>> +     struct device *dev = &auxdev->dev;
>> +
>> +     if (WARN_ON(!info))
> 
> How can this ever happen?  If not, don't check for it please.  If it can
> happen, properly handle it and move on, don't reboot the box.

correct, This cannot happen, I will drop it.

> 
> thanks,
> 
> greg k-h
Shay Drori May 12, 2024, 7:30 a.m. UTC | #7
On 11/05/2024 10:44, Greg KH wrote:
> On Fri, May 10, 2024 at 04:01:01PM +0200, Przemek Kitszel wrote:
>> On 5/10/24 15:07, Greg KH wrote:
>>> On Fri, May 10, 2024 at 02:54:49PM +0200, Przemek Kitszel wrote:
>>>>>> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
>>>>>> +				       struct device_attribute *attr, char *buf)
>>>>>> +{
>>>>>> +	struct auxiliary_irq_info *info =
>>>>>> +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
>>>>>> +
>>>>>> +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
>>>>>
>>>>> refcount combined with xa?  That feels wrong, why is refcount used for
>>>>> this at all?
>>>>
>>>> Not long ago I commented on similar usage for ice driver,
>>>> ~"since you are locking anyway this could be a plain counter",
>>>> and author replied
>>>> ~"additional semantics (like saturation) of refcount make me feel warm
>>>> and fuzzy" (sorry if misquoting too much).
>>>> That convinced me back then, so I kept quiet about that here.
>>>
>>> But why is this being incremented / decremented at all?  What is that
>>> for?
>>
>> [global]
>> This is just a counter, it is used to tell if given IRQ is shared or
>> exclusive. Hence there is a global xarray for that.
>> And my argument is for this case precisely.
>>
>> [other]
>> There is also a separate xarray for each auxdev (IIRC) which is used as
>> generic dynamic container [that stores sysfs attrs], any other would
>> work (with different characteristics), but I see no problems with
>> picking xarray here.
> 
> Again, why is an xarray needed, why isn't this part of the auxdevice
> structure to start with?

If I understand you correctly, you are referring to the xarray of the
auxdevice (not the global one).
If so, instead of xarray what can be used by the auxdevice?

> 
> thanks,
> 
> greg k-h
>
Jason Gunthorpe May 12, 2024, 3:32 p.m. UTC | #8
On Fri, May 10, 2024 at 02:54:49PM +0200, Przemek Kitszel wrote:
> > > +	refcount_set(new_ref, 1);
> > > +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
> > > +	if (ref) {
> > > +		kfree(new_ref);
> > > +		if (xa_is_err(ref)) {
> > > +			ret = xa_err(ref);
> > > +			goto out;
> > > +		}
> > > +
> > > +		/* Another thread beat us to creating the enrtry. */
> > > +		refcount_inc(ref);
> > 
> > How can that happen?  Why not just use a normal simple lock for all of
> > this so you don't have to mess with refcounts at all?  This is not
> > performance-relevent code at all, but yet with a refcount you cause
> > almost the same issues that a normal lock would have, plus the increased
> > complexity of all of the surrounding code (like this, and the crazy
> > __xa_cmpxchg() call)
> > 
> > Make this simple please.
> 
> I find current API of xarray not ideal for this use case, and would like
> to fix it, but let me write a proper RFC to don't derail (or slow down)
> this series.

I think xarray can do this just fine already??

xa_lock(&irqs);
used = xa_to_value(xa_load(&irqs, irq));
used++;
ret = xa_store(&irqs, irq, xa_mk_value(used));
xa_unlock(&irqs);

And you can safely read the value using the typical xa_load RCU locking.

Jason
Przemek Kitszel May 13, 2024, 8:33 a.m. UTC | #9
On 5/12/24 17:32, Jason Gunthorpe wrote:
> On Fri, May 10, 2024 at 02:54:49PM +0200, Przemek Kitszel wrote:
>>>> +	refcount_set(new_ref, 1);
>>>> +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
>>>> +	if (ref) {
>>>> +		kfree(new_ref);
>>>> +		if (xa_is_err(ref)) {
>>>> +			ret = xa_err(ref);
>>>> +			goto out;
>>>> +		}
>>>> +
>>>> +		/* Another thread beat us to creating the enrtry. */
>>>> +		refcount_inc(ref);
>>>
>>> How can that happen?  Why not just use a normal simple lock for all of
>>> this so you don't have to mess with refcounts at all?  This is not
>>> performance-relevent code at all, but yet with a refcount you cause
>>> almost the same issues that a normal lock would have, plus the increased
>>> complexity of all of the surrounding code (like this, and the crazy
>>> __xa_cmpxchg() call)
>>>
>>> Make this simple please.
>>
>> I find current API of xarray not ideal for this use case, and would like
>> to fix it, but let me write a proper RFC to don't derail (or slow down)
>> this series.
> 
> I think xarray can do this just fine already??
> 
> xa_lock(&irqs);
> used = xa_to_value(xa_load(&irqs, irq));
> used++;
> ret = xa_store(&irqs, irq, xa_mk_value(used));
> xa_unlock(&irqs);
> 
> And you can safely read the value using the typical xa_load RCU locking.
> 
> Jason

What if I want to store some struct, potentially with need of some init
call (say, there will be a spinlock there)?

I believe the solution is to extend xarray so it will alloc the struct
(think flex array or user/priv data for "entry") and even init it
(so two new functions).
Jason Gunthorpe May 13, 2024, 11:06 p.m. UTC | #10
On Mon, May 13, 2024 at 10:33:18AM +0200, Przemek Kitszel wrote:

> What if I want to store some struct, potentially with need of some init
> call (say, there will be a spinlock there)?

Yes, that specific pattern is definately a bit tricky, I have a
version in iommufd and I think at least someplace else... A helper of
some sort would be nice and could do a bit more work to be optimal.
 
Jason
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
new file mode 100644
index 000000000000..3b8299d49d9e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
@@ -0,0 +1,14 @@ 
+What:		/sys/bus/auxiliary/devices/.../irqs/
+Date:		April, 2024
+Contact:	Shay Drory <shayd@nvidia.com>
+Description:
+		The /sys/devices/.../irqs directory contains a variable set of
+		files, with each file is named as irq number similar to PCI PF
+		or VF's irq number located in msi_irqs directory.
+
+What:		/sys/bus/auxiliary/devices/.../irqs/<N>
+Date:		April, 2024
+Contact:	Shay Drory <shayd@nvidia.com>
+Description:
+		auxiliary devices can share IRQs. This attribute indicates if
+		the irq is shared with other SFs or exclusively used by the SF.
diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
index d3a2c40c2f12..def02f5f1220 100644
--- a/drivers/base/auxiliary.c
+++ b/drivers/base/auxiliary.c
@@ -158,6 +158,176 @@ 
  *	};
  */
 
+#ifdef CONFIG_SYSFS
+/* Xarray of irqs to determine if irq is exclusive or shared. */
+static DEFINE_XARRAY(irqs);
+
+struct auxiliary_irq_info {
+	struct device_attribute sysfs_attr;
+	int irq;
+};
+
+static struct attribute *auxiliary_irq_attrs[] = {
+	NULL
+};
+
+static const struct attribute_group auxiliary_irqs_group = {
+	.name = "irqs",
+	.attrs = auxiliary_irq_attrs,
+};
+
+static const struct attribute_group *auxiliary_irqs_groups[2] = {
+	&auxiliary_irqs_group,
+	NULL
+};
+
+/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
+ * shared or exclusive.
+ */
+static ssize_t auxiliary_irq_mode_show(struct device *dev,
+				       struct device_attribute *attr, char *buf)
+{
+	struct auxiliary_irq_info *info =
+		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
+
+	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
+		return sysfs_emit(buf, "%s\n", "shared");
+	else
+		return sysfs_emit(buf, "%s\n", "exclusive");
+}
+
+static void auxiliary_irq_destroy(int irq)
+{
+	refcount_t *ref;
+
+	xa_lock(&irqs);
+	ref = xa_load(&irqs, irq);
+	if (refcount_dec_and_test(ref)) {
+		__xa_erase(&irqs, irq);
+		kfree(ref);
+	}
+	xa_unlock(&irqs);
+}
+
+static int auxiliary_irq_create(int irq)
+{
+	refcount_t *new_ref = kzalloc(sizeof(*new_ref), GFP_KERNEL);
+	refcount_t *ref;
+	int ret = 0;
+
+	if (!new_ref)
+		return -ENOMEM;
+
+	xa_lock(&irqs);
+	ref = xa_load(&irqs, irq);
+	if (ref) {
+		kfree(new_ref);
+		refcount_inc(ref);
+		goto out;
+	}
+
+	refcount_set(new_ref, 1);
+	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
+	if (ref) {
+		kfree(new_ref);
+		if (xa_is_err(ref)) {
+			ret = xa_err(ref);
+			goto out;
+		}
+
+		/* Another thread beat us to creating the enrtry. */
+		refcount_inc(ref);
+	}
+
+out:
+	xa_unlock(&irqs);
+	return ret;
+}
+
+/**
+ * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
+ * @auxdev: auxiliary bus device to add the sysfs entry.
+ * @irq: The associated Linux interrupt number.
+ *
+ * This function should be called after auxiliary device have successfully
+ * received the irq.
+ *
+ * Return: zero on success or an error code on failure.
+ */
+int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
+{
+	struct device *dev = &auxdev->dev;
+	struct auxiliary_irq_info *info;
+	int ret;
+
+	ret = auxiliary_irq_create(irq);
+	if (ret)
+		return ret;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		ret = -ENOMEM;
+		goto info_err;
+	}
+
+	sysfs_attr_init(&info->sysfs_attr.attr);
+	info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
+	if (!info->sysfs_attr.attr.name) {
+		ret = -ENOMEM;
+		goto name_err;
+	}
+	info->irq = irq;
+	info->sysfs_attr.attr.mode = 0444;
+	info->sysfs_attr.show = auxiliary_irq_mode_show;
+
+	ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
+	if (ret)
+		goto auxdev_xa_err;
+
+	ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
+				      auxiliary_irqs_group.name);
+	if (ret)
+		goto sysfs_add_err;
+
+	return 0;
+
+sysfs_add_err:
+	xa_erase(&auxdev->irqs, irq);
+auxdev_xa_err:
+	kfree(info->sysfs_attr.attr.name);
+name_err:
+	kfree(info);
+info_err:
+	auxiliary_irq_destroy(irq);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_add);
+
+/**
+ * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
+ * @auxdev: auxiliary bus device to add the sysfs entry.
+ * @irq: the IRQ to remove.
+ *
+ * This function should be called to remove an IRQ sysfs entry.
+ */
+void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
+{
+	struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
+	struct device *dev = &auxdev->dev;
+
+	if (WARN_ON(!info))
+		return;
+
+	sysfs_remove_file_from_group(&dev->kobj, &info->sysfs_attr.attr,
+				     auxiliary_irqs_group.name);
+	xa_erase(&auxdev->irqs, irq);
+	kfree(info->sysfs_attr.attr.name);
+	kfree(info);
+	auxiliary_irq_destroy(irq);
+}
+EXPORT_SYMBOL_GPL(auxiliary_device_sysfs_irq_remove);
+#endif
+
 static const struct auxiliary_device_id *auxiliary_match_id(const struct auxiliary_device_id *id,
 							    const struct auxiliary_device *auxdev)
 {
@@ -295,6 +465,7 @@  EXPORT_SYMBOL_GPL(auxiliary_device_init);
  * __auxiliary_device_add - add an auxiliary bus device
  * @auxdev: auxiliary bus device to add to the bus
  * @modname: name of the parent device's driver module
+ * @irqs_sysfs_enable: whether to enable IRQs sysfs
  *
  * This is the third step in the three-step process to register an
  * auxiliary_device.
@@ -310,7 +481,8 @@  EXPORT_SYMBOL_GPL(auxiliary_device_init);
  * parameter.  Only if a user requires a custom name would this version be
  * called directly.
  */
-int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname)
+int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname,
+			   bool irqs_sysfs_enable)
 {
 	struct device *dev = &auxdev->dev;
 	int ret;
@@ -325,6 +497,10 @@  int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname)
 		dev_err(dev, "auxiliary device dev_set_name failed: %d\n", ret);
 		return ret;
 	}
+	if (irqs_sysfs_enable) {
+		dev->groups = auxiliary_irqs_groups;
+		xa_init(&auxdev->irqs);
+	}
 
 	ret = device_add(dev);
 	if (ret)
diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h
index de21d9d24a95..760fadb26620 100644
--- a/include/linux/auxiliary_bus.h
+++ b/include/linux/auxiliary_bus.h
@@ -58,6 +58,7 @@ 
  *       in
  * @name: Match name found by the auxiliary device driver,
  * @id: unique identitier if multiple devices of the same name are exported,
+ * @irqs: irqs xarray contains irq indices which are used by the device,
  *
  * An auxiliary_device represents a part of its parent device's functionality.
  * It is given a name that, combined with the registering drivers
@@ -138,6 +139,7 @@ 
 struct auxiliary_device {
 	struct device dev;
 	const char *name;
+	struct xarray irqs;
 	u32 id;
 };
 
@@ -209,8 +211,26 @@  static inline struct auxiliary_driver *to_auxiliary_drv(struct device_driver *dr
 }
 
 int auxiliary_device_init(struct auxiliary_device *auxdev);
-int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname);
-#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev, KBUILD_MODNAME)
+int __auxiliary_device_add(struct auxiliary_device *auxdev, const char *modname,
+			   bool irqs_sysfs_enable);
+#define auxiliary_device_add(auxdev) __auxiliary_device_add(auxdev, KBUILD_MODNAME, false)
+#define auxiliary_device_add_with_irqs(auxdev) \
+	__auxiliary_device_add(auxdev, KBUILD_MODNAME, true)
+
+#ifdef CONFIG_SYSFS
+int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq);
+void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev,
+				       int irq);
+#else /* CONFIG_SYSFS */
+static inline int
+auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
+{
+	return 0;
+}
+
+static inline void
+auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq) {}
+#endif
 
 static inline void auxiliary_device_uninit(struct auxiliary_device *auxdev)
 {