mbox series

[v2,0/3] crypto: hisilicon - supports device isolation feature

Message ID 20220614122943.1406-1-yekai13@huawei.com (mailing list archive)
Headers show
Series crypto: hisilicon - supports device isolation feature | expand

Message

yekai (A) June 14, 2022, 12:29 p.m. UTC
Add the hardware error isolation feature for ACC. Defines a driver debugfs
node that used to configures the hardware error frequency. When the error
frequency is exceeded, the device will be isolated. The isolation strategy 
can be defined in each driver module. e.g. Defining the isolation strategy
for ACC, if the AER error frequency exceeds the value of setting for a 
certain period of time, The device will not be available in user space. The
VF device use the PF device isolation strategy. as well as the isolation 
strategy should not be set during device use.

changes v1->v2:
	1、deleted dev_to_uacce api.
	2、add vfs node doc. 
	3、move uacce->ref to driver.

Kai Ye (3):
  uacce: supports device isolation feature
  Documentation: add a isolation strategy vfs node for uacce
  crypto: hisilicon/qm - defining the device isolation strategy

 Documentation/ABI/testing/sysfs-driver-uacce |  17 ++
 drivers/crypto/hisilicon/qm.c                | 157 +++++++++++++++++--
 drivers/misc/uacce/uacce.c                   |  37 +++++
 include/linux/hisi_acc_qm.h                  |   9 ++
 include/linux/uacce.h                        |  16 +-
 5 files changed, 219 insertions(+), 17 deletions(-)

Comments

Greg Kroah-Hartman June 14, 2022, 12:41 p.m. UTC | #1
On Tue, Jun 14, 2022 at 08:29:39PM +0800, Kai Ye wrote:
> Update documentation describing DebugFS that could help to
> configure hard error frequency for users in th user space.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> ---
>  Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> index 08f2591138af..0c4226364182 100644
> --- a/Documentation/ABI/testing/sysfs-driver-uacce
> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> @@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
>  Description:    Available instances left of the device
>                  Return -ENODEV if uacce_ops get_available_instances is not provided
>  
> +What:           /sys/class/uacce/<dev_name>/isolate_strategy
> +Date:           Jun 2022
> +KernelVersion:  5.19
> +Contact:        linux-accelerators@lists.ozlabs.org
> +Description:    A vfs node that used to configures the hardware

What is a "vfs node"?

> +                error frequency. This frequency is abstract. Like once an hour
> +                or once a day. The specific isolation strategy can be defined in
> +                each driver module.

No, you need to be specific here and describe the units and the format.
Otherwise it is no description at all :(

> +
> +What:           /sys/class/uacce/<dev_name>/isolate
> +Date:           Jun 2022
> +KernelVersion:  5.19

5.19 will not have this change.

> +Contact:        linux-accelerators@lists.ozlabs.org
> +Description:    A vfs node that show the device isolated state. The value 0
> +                means that the device is working. The value 1 means that the
> +                device has been isolated.

What does "working" or "isolated" mean?

thanks,

greg k-h
Greg Kroah-Hartman June 14, 2022, 12:42 p.m. UTC | #2
On Tue, Jun 14, 2022 at 08:29:38PM +0800, Kai Ye wrote:
> UACCE add the hardware error isolation API. Users can configure
> the error frequency threshold by this vfs node. This API interface
> certainly supports the configuration of user protocol strategy. Then
> parse it inside the device driver. UACCE only reports the device
> isolate state. When the error frequency is exceeded, the device
> will be isolated. The isolation strategy should be defined in each
> driver module.
> 
> Signed-off-by: Kai Ye <yekai13@huawei.com>
> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com>
> ---
>  drivers/misc/uacce/uacce.c | 37 +++++++++++++++++++++++++++++++++++++
>  include/linux/uacce.h      | 16 +++++++++++++---
>  2 files changed, 50 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
> index b6219c6bfb48..525623215132 100644
> --- a/drivers/misc/uacce/uacce.c
> +++ b/drivers/misc/uacce/uacce.c
> @@ -346,12 +346,47 @@ static ssize_t region_dus_size_show(struct device *dev,
>  		       uacce->qf_pg_num[UACCE_QFRT_DUS] << PAGE_SHIFT);
>  }
>  
> +static ssize_t isolate_show(struct device *dev,
> +			    struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%d\n", uacce->ops->get_isolate_state(uacce));
> +}
> +
> +static ssize_t isolate_strategy_show(struct device *dev,
> +				     struct device_attribute *attr, char *buf)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +
> +	return sysfs_emit(buf, "%s\n", uacce->isolate_strategy);
> +}
> +
> +static ssize_t isolate_strategy_store(struct device *dev,
> +				      struct device_attribute *attr,
> +				      const char *buf, size_t count)
> +{
> +	struct uacce_device *uacce = to_uacce_device(dev);
> +	int ret;
> +
> +	if (!buf || sizeof(buf) > UACCE_MAX_ISOLATE_STRATEGY_LEN)
> +		return -EINVAL;
> +
> +	memcpy(uacce->isolate_strategy, buf, strlen(buf));
> +
> +	ret = uacce->ops->isolate_strategy_write(uacce, buf);
> +
> +	return ret ? ret : count;
> +}
> +
>  static DEVICE_ATTR_RO(api);
>  static DEVICE_ATTR_RO(flags);
>  static DEVICE_ATTR_RO(available_instances);
>  static DEVICE_ATTR_RO(algorithms);
>  static DEVICE_ATTR_RO(region_mmio_size);
>  static DEVICE_ATTR_RO(region_dus_size);
> +static DEVICE_ATTR_RO(isolate);
> +static DEVICE_ATTR_RW(isolate_strategy);
>  
>  static struct attribute *uacce_dev_attrs[] = {
>  	&dev_attr_api.attr,
> @@ -360,6 +395,8 @@ static struct attribute *uacce_dev_attrs[] = {
>  	&dev_attr_algorithms.attr,
>  	&dev_attr_region_mmio_size.attr,
>  	&dev_attr_region_dus_size.attr,
> +	&dev_attr_isolate.attr,
> +	&dev_attr_isolate_strategy.attr,
>  	NULL,
>  };
>  
> diff --git a/include/linux/uacce.h b/include/linux/uacce.h
> index 48e319f40275..0f7668bfa645 100644
> --- a/include/linux/uacce.h
> +++ b/include/linux/uacce.h
> @@ -8,6 +8,7 @@
>  #define UACCE_NAME		"uacce"
>  #define UACCE_MAX_REGION	2
>  #define UACCE_MAX_NAME_SIZE	64
> +#define UACCE_MAX_ISOLATE_STRATEGY_LEN	256

So it's a random string of characters?  What format?

>  
>  struct uacce_queue;
>  struct uacce_device;
> @@ -30,6 +31,8 @@ struct uacce_qfile_region {
>   * @is_q_updated: check whether the task is finished
>   * @mmap: mmap addresses of queue to user space
>   * @ioctl: ioctl for user space users of the queue
> + * @get_isolate_state: get the device state after set the isolate strategy
> + * @isolate_strategy_store: stored the isolate strategy to the device
>   */
>  struct uacce_ops {
>  	int (*get_available_instances)(struct uacce_device *uacce);
> @@ -43,6 +46,8 @@ struct uacce_ops {
>  		    struct uacce_qfile_region *qfr);
>  	long (*ioctl)(struct uacce_queue *q, unsigned int cmd,
>  		      unsigned long arg);
> +	enum uacce_dev_state (*get_isolate_state)(struct uacce_device *uacce);
> +	int (*isolate_strategy_write)(struct uacce_device *uacce, const char *buf);

Length of the buffer?

>  };
>  
>  /**
> @@ -57,6 +62,12 @@ struct uacce_interface {
>  	const struct uacce_ops *ops;
>  };
>  
> +enum uacce_dev_state {
> +	UACCE_DEV_ERR = -1,
> +	UACCE_DEV_NORMAL,
> +	UACCE_DEV_ISOLATE,
> +};
> +
>  enum uacce_q_state {
>  	UACCE_Q_ZOMBIE = 0,
>  	UACCE_Q_INIT,
> @@ -117,6 +128,7 @@ struct uacce_device {
>  	struct list_head queues;
>  	struct mutex queues_lock;
>  	struct inode *inode;
> +	char isolate_strategy[UACCE_MAX_ISOLATE_STRATEGY_LEN];
>  };
>  
>  #if IS_ENABLED(CONFIG_UACCE)
> @@ -125,7 +137,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
>  				 struct uacce_interface *interface);
>  int uacce_register(struct uacce_device *uacce);
>  void uacce_remove(struct uacce_device *uacce);
> -
> +struct uacce_device *dev_to_uacce(struct device *dev);

Why is this moved to the .h file yet the function is not exported?

thanks,

greg k-h
Jonathan Cameron June 15, 2022, 8:48 a.m. UTC | #3
On Tue, 14 Jun 2022 14:41:52 +0200
Greg KH <gregkh@linuxfoundation.org> wrote:

> On Tue, Jun 14, 2022 at 08:29:39PM +0800, Kai Ye wrote:
> > Update documentation describing DebugFS that could help to
> > configure hard error frequency for users in th user space.
> > 
> > Signed-off-by: Kai Ye <yekai13@huawei.com>
> > ---
> >  Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
> > index 08f2591138af..0c4226364182 100644
> > --- a/Documentation/ABI/testing/sysfs-driver-uacce
> > +++ b/Documentation/ABI/testing/sysfs-driver-uacce
> > @@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
> >  Description:    Available instances left of the device
> >                  Return -ENODEV if uacce_ops get_available_instances is not provided
> >  
> > +What:           /sys/class/uacce/<dev_name>/isolate_strategy
> > +Date:           Jun 2022
> > +KernelVersion:  5.19
> > +Contact:        linux-accelerators@lists.ozlabs.org
> > +Description:    A vfs node that used to configures the hardware  
> 
> What is a "vfs node"?
> 
> > +                error frequency. This frequency is abstract. Like once an hour
> > +                or once a day. The specific isolation strategy can be defined in
> > +                each driver module.  
> 
> No, you need to be specific here and describe the units and the format.
> Otherwise it is no description at all :(

Also, rename it.   A frequency isn't a strategy.  Strategy would be something
like:

* First fault
* Faults in moving time window.
* Faults in fixed time window.

some of which would then need separate controls for the threshold and the
time window - those should be in separate sysfs attributes.

> 
> > +
> > +What:           /sys/class/uacce/<dev_name>/isolate
> > +Date:           Jun 2022
> > +KernelVersion:  5.19  
> 
> 5.19 will not have this change.
> 
> > +Contact:        linux-accelerators@lists.ozlabs.org
> > +Description:    A vfs node that show the device isolated state. The value 0
> > +                means that the device is working. The value 1 means that the
> > +                device has been isolated.  
> 
> What does "working" or "isolated" mean?
> 
> thanks,
> 
> greg k-h
yekai (A) June 15, 2022, 9:18 a.m. UTC | #4
On 2022/6/15 16:48, Jonathan Cameron wrote:
> On Tue, 14 Jun 2022 14:41:52 +0200
> Greg KH <gregkh@linuxfoundation.org> wrote:
>
>> On Tue, Jun 14, 2022 at 08:29:39PM +0800, Kai Ye wrote:
>>> Update documentation describing DebugFS that could help to
>>> configure hard error frequency for users in th user space.
>>>
>>> Signed-off-by: Kai Ye <yekai13@huawei.com>
>>> ---
>>>  Documentation/ABI/testing/sysfs-driver-uacce | 17 +++++++++++++++++
>>>  1 file changed, 17 insertions(+)
>>>
>>> diff --git a/Documentation/ABI/testing/sysfs-driver-uacce b/Documentation/ABI/testing/sysfs-driver-uacce
>>> index 08f2591138af..0c4226364182 100644
>>> --- a/Documentation/ABI/testing/sysfs-driver-uacce
>>> +++ b/Documentation/ABI/testing/sysfs-driver-uacce
>>> @@ -19,6 +19,23 @@ Contact:        linux-accelerators@lists.ozlabs.org
>>>  Description:    Available instances left of the device
>>>                  Return -ENODEV if uacce_ops get_available_instances is not provided
>>>
>>> +What:           /sys/class/uacce/<dev_name>/isolate_strategy
>>> +Date:           Jun 2022
>>> +KernelVersion:  5.19
>>> +Contact:        linux-accelerators@lists.ozlabs.org
>>> +Description:    A vfs node that used to configures the hardware
>>
>> What is a "vfs node"?
>>
>>> +                error frequency. This frequency is abstract. Like once an hour
>>> +                or once a day. The specific isolation strategy can be defined in
>>> +                each driver module.
>>
>> No, you need to be specific here and describe the units and the format.
>> Otherwise it is no description at all :(
>
> Also, rename it.   A frequency isn't a strategy.  Strategy would be something
> like:
>
> * First fault
> * Faults in moving time window.
> * Faults in fixed time window.
>
> some of which would then need separate controls for the threshold and the
> time window - those should be in separate sysfs attributes.
>

I will describe the units and the format in here.

Thanks

Kai
>>
>>> +
>>> +What:           /sys/class/uacce/<dev_name>/isolate
>>> +Date:           Jun 2022
>>> +KernelVersion:  5.19
>>
>> 5.19 will not have this change.
>>
>>> +Contact:        linux-accelerators@lists.ozlabs.org
>>> +Description:    A vfs node that show the device isolated state. The value 0
>>> +                means that the device is working. The value 1 means that the
>>> +                device has been isolated.
>>
>> What does "working" or "isolated" mean?
>>
>> thanks,
>>
>> greg k-h
>
> .
>