Message ID | 20250203152735.825010-3-avri.altman@wdc.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | scsi: ufs: critical health condition | expand |
On 2/3/25 07:27, Avri Altman wrote: > The UFS 4.1 standard, released on January 8, 2025, introduces several > new features, including a new exception event: HEALTH_CRITICAL. This > event notifies the host of a device's critical health condition, > indicating that the device is approaching the end of its lifetime based > on the number of program/erase cycles performed. > > We utilize the hwmon (hardware monitoring) subsystem to propagate this > information via the chip alarm channel. > That is outside the scope of the hardware monitoring subsystem, the "alarms" attribute is deprecated and must not be used in new drivers, and it isn't actually implemented by this code. I can't control what is submitted into the ufs code, bu from hardware monitoring perspective this is a NACK. Guenter
> On 2/3/25 07:27, Avri Altman wrote: > > The UFS 4.1 standard, released on January 8, 2025, introduces several > > new features, including a new exception event: HEALTH_CRITICAL. This > > event notifies the host of a device's critical health condition, > > indicating that the device is approaching the end of its lifetime > > based on the number of program/erase cycles performed. > > > > We utilize the hwmon (hardware monitoring) subsystem to propagate this > > information via the chip alarm channel. > > > > That is outside the scope of the hardware monitoring subsystem, the > "alarms" attribute is deprecated and must not be used in new drivers, and it > isn't actually implemented by this code. OK. Thanks for letting me know. Do you see any other path I can take within the hwmon, To let the upper stack / HAL know that the ufs device is reaching its EOL ? Or should I look elsewhere? Thanks, Avri > > I can't control what is submitted into the ufs code, bu from hardware > monitoring perspective this is a NACK. > > Guenter
On 2/3/25 09:25, Avri Altman wrote: >> On 2/3/25 07:27, Avri Altman wrote: >>> The UFS 4.1 standard, released on January 8, 2025, introduces several >>> new features, including a new exception event: HEALTH_CRITICAL. This >>> event notifies the host of a device's critical health condition, >>> indicating that the device is approaching the end of its lifetime >>> based on the number of program/erase cycles performed. >>> >>> We utilize the hwmon (hardware monitoring) subsystem to propagate this >>> information via the chip alarm channel. >>> >> >> That is outside the scope of the hardware monitoring subsystem, the >> "alarms" attribute is deprecated and must not be used in new drivers, and it >> isn't actually implemented by this code. > OK. Thanks for letting me know. > Do you see any other path I can take within the hwmon, > To let the upper stack / HAL know that the ufs device is reaching its EOL ? > Or should I look elsewhere? > Again, this is not a hardware monitoring attribute. Normally I'd assume that information like this is reported, for example, via smartctl or whatever similar mechanism is available for ufs devices. Just to give an example: smartctl reports for one of the nvme drives in my system: SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 39 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 10,835,485 [5.54 TB] Data Units Written: 4,931,062 [2.52 TB] Host Read Commands: 149,936,032 Host Write Commands: 36,799,659 Controller Busy Time: 318 Power Cycles: 12 Power On Hours: 326 Unsafe Shutdowns: 4 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 39 Celsius Temperature Sensor 2: 41 Celsius Per your logic, all of that could be declared to be "hardware monitoring". That simply doesn't make sense. All that information is reported by smartctl, and it can and should be monitored using smartd or a similar tool. There is no need to invent a new mechanism to do the same. If smartmontools don't support ufs, such support should be added there, and not be pressed into some unrelated kernel subsystem. Thanks, Guenter
diff --git a/drivers/ufs/core/ufs-hwmon.c b/drivers/ufs/core/ufs-hwmon.c index db28f456b923..410dc6568de5 100644 --- a/drivers/ufs/core/ufs-hwmon.c +++ b/drivers/ufs/core/ufs-hwmon.c @@ -149,6 +149,7 @@ static umode_t ufs_hwmon_is_visible(const void *data, static const struct hwmon_channel_info *const ufs_hwmon_info[] = { HWMON_CHANNEL_INFO(temp, HWMON_T_ENABLE | HWMON_T_INPUT | HWMON_T_CRIT | HWMON_T_LCRIT), + HWMON_CHANNEL_INFO(chip, HWMON_C_ALARMS), NULL }; @@ -209,4 +210,7 @@ void ufs_hwmon_notify_event(struct ufs_hba *hba, u16 ee_mask) if (ee_mask & MASK_EE_TOO_LOW_TEMP) hwmon_notify_event(hba->hwmon_device, hwmon_temp, hwmon_temp_min_alarm, 0); + + if (ee_mask & MASK_EE_HEALTH_CRITICAL) + hwmon_notify_event(hba->hwmon_device, hwmon_chip, hwmon_chip_alarms, 0); } diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 9fbaf74b0fef..407dc1acca0f 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -6198,6 +6198,9 @@ static void ufshcd_exception_event_handler(struct work_struct *work) if (status & hba->ee_drv_mask & MASK_EE_URGENT_TEMP) ufs_hwmon_notify_event(hba, status & MASK_EE_URGENT_TEMP); + if (status & hba->ee_drv_mask & MASK_EE_HEALTH_CRITICAL) + ufs_hwmon_notify_event(hba, status & MASK_EE_HEALTH_CRITICAL); + ufs_debugfs_exception_event(hba, status); } @@ -8091,12 +8094,24 @@ static void ufshcd_temp_notif_probe(struct ufs_hba *hba, const u8 *desc_buf, u16 *mask |= MASK_EE_TOO_HIGH_TEMP; } +static void ufshcd_critical_health_probe(struct ufs_hba *hba, u16 *mask) +{ + struct ufs_dev_info *dev_info = &hba->dev_info; + + if (dev_info->wspecversion < 0x410) + return; + + *mask |= MASK_EE_HEALTH_CRITICAL; +} + static void ufshcd_hwmon_probe(struct ufs_hba *hba, const u8 *desc_buf) { u16 mask = 0; ufshcd_temp_notif_probe(hba, desc_buf, &mask); + ufshcd_critical_health_probe(hba, &mask); + if (mask) { ufshcd_enable_ee(hba, mask); ufs_hwmon_probe(hba, mask); diff --git a/include/ufs/ufs.h b/include/ufs/ufs.h index f151feb0ca8c..8a24ed59ec46 100644 --- a/include/ufs/ufs.h +++ b/include/ufs/ufs.h @@ -419,6 +419,7 @@ enum { MASK_EE_TOO_LOW_TEMP = BIT(4), MASK_EE_WRITEBOOSTER_EVENT = BIT(5), MASK_EE_PERFORMANCE_THROTTLING = BIT(6), + MASK_EE_HEALTH_CRITICAL = BIT(9), }; #define MASK_EE_URGENT_TEMP (MASK_EE_TOO_HIGH_TEMP | MASK_EE_TOO_LOW_TEMP)
The UFS 4.1 standard, released on January 8, 2025, introduces several new features, including a new exception event: HEALTH_CRITICAL. This event notifies the host of a device's critical health condition, indicating that the device is approaching the end of its lifetime based on the number of program/erase cycles performed. We utilize the hwmon (hardware monitoring) subsystem to propagate this information via the chip alarm channel. The host can gain further insight into the specific issue by reading one of the following attributes: bPreEOLInfo, bDeviceLifeTimeEstA, bDeviceLifeTimeEstB, bWriteBoosterBufferLifeTimeEst, and bRPMBLifeTimeEst. However, we do not provide the corresponding .read method in the hwmon subsystem. This is intentional: all other end-of-life (EOL) signals are available for reading via the driver's sysfs entries or through an applicable utility. It is up to user-space to read these attributes if needed. It is not the kernel's responsibility to interpret any EOL signals, as they may vary from vendor to vendor. Signed-off-by: Avri Altman <avri.altman@wdc.com> --- drivers/ufs/core/ufs-hwmon.c | 4 ++++ drivers/ufs/core/ufshcd.c | 15 +++++++++++++++ include/ufs/ufs.h | 1 + 3 files changed, 20 insertions(+)