Message ID | 746e059782953fca6c21945297151d2bb73d3370.1631519695.git.kwmad.kim@samsung.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | scsi: ufs: introduce vendor isr | expand |
On 9/13/21 12:55 AM, Kiwoong Kim wrote: > This patch is to raise recovery in some abnormal > conditions using an vendor specific interrupt > for some cases, such as a situation that some > contexts of a pending request in the host isn't > the same with those of its corresponding UPIUs > if they should have been the same exactly. > > The representative case is shown like below. > In the case, a broken UTRD entry, for internal > coherent problem or whatever, that had smaller value > of PRDT length than expected was transferred to the host. > So, the host raised an interrupt of transfer complete > even if device didn't finish its data transfer because > the host sees a fetched version of UTRD to determine > if data tranfer is over or not. Then the application level > seemed to recogize this as a sort of corruption and this > symptom led to boot failure. How can a UTRD entry be broken? Does that perhaps indicate memory corruption at the host side? Working around host-side memory corruption in a driver seems wrong to me. I think the root cause of the memory corruption should be fixed. > +static irqreturn_t exynos_ufs_isr(struct ufs_hba *hba) > +{ > + struct exynos_ufs *ufs = ufshcd_get_variant(hba); > + u32 status; > + unsigned long flags; > + > + if (!hba->priv) return IRQ_HANDLED; Please verify patches with checkpatch before posting these on the linux-scsi mailing list. The above if-statement does not follow the Linux kernel coding style. > + if (status & RX_UPIU_HIT_ERROR) { > + pr_err("%s: status: 0x%08x\n", __func__, status); > + hba->force_reset = true; > + hba->force_requeue = true; > + scsi_schedule_eh(hba->host); > + spin_unlock_irqrestore(hba->host->host_lock, flags); > + return IRQ_HANDLED; > + } > + return IRQ_NONE; > +} So the above code unlocks the host_lock depending on whether or not status & RX_UPIU_HIT_ERROR is true? Yikes ... Additionally, in the above code I found the following pattern: unsigned long flags; [ ... ] spin_unlock_irqrestore(hba->host->host_lock, flags); Such code is ALWAYS wrong. The value of the 'flags' argument passed to spin_unlock_irqrestore() must come from spin_lock_irqsave(). Bart.
> On 9/13/21 12:55 AM, Kiwoong Kim wrote: > > This patch is to raise recovery in some abnormal conditions using an > > vendor specific interrupt for some cases, such as a situation that > > some contexts of a pending request in the host isn't the same with > > those of its corresponding UPIUs if they should have been the same > > exactly. > > > > The representative case is shown like below. > > In the case, a broken UTRD entry, for internal coherent problem or > > whatever, that had smaller value of PRDT length than expected was > > transferred to the host. > > So, the host raised an interrupt of transfer complete even if device > > didn't finish its data transfer because the host sees a fetched > > version of UTRD to determine if data tranfer is over or not. Then the > > application level seemed to recogize this as a sort of corruption and > > this symptom led to boot failure. > > How can a UTRD entry be broken? Does that perhaps indicate memory > corruption at the host side? Working around host-side memory corruption in > a driver seems wrong to me. I think the root cause of the memory > corruption should be fixed. For SoC internal problems, yes, of course, they should be fixed. But I don't think the causes always come from inside the system. They could be outside the system or a device, such as sending DATA IN with a tag that a host has ever submitted a command with because of some bugs of the device. You might think putting this sort of code doesn't make sense but there could be various events that can't be understood in the point of view of the spec. And chips that is already fab-outed can't be fixed. That's why I put the details into Exynos. I'm not talking about the spec. I think Exynos isn't required to contain only things related with the spec and it can have realistic part. > > > +static irqreturn_t exynos_ufs_isr(struct ufs_hba *hba) { > > + struct exynos_ufs *ufs = ufshcd_get_variant(hba); > > + u32 status; > > + unsigned long flags; > > + > > + if (!hba->priv) return IRQ_HANDLED; > > Please verify patches with checkpatch before posting these on the linux- > scsi mailing list. The above if-statement does not follow the Linux kernel > coding style. > > > + if (status & RX_UPIU_HIT_ERROR) { > > + pr_err("%s: status: 0x%08x\n", __func__, status); > > + hba->force_reset = true; > > + hba->force_requeue = true; > > + scsi_schedule_eh(hba->host); > > + spin_unlock_irqrestore(hba->host->host_lock, flags); > > + return IRQ_HANDLED; > > + } > > + return IRQ_NONE; > > +} > > So the above code unlocks the host_lock depending on whether or not status > & RX_UPIU_HIT_ERROR is true? Yikes ... > > Additionally, in the above code I found the following pattern: > > unsigned long flags; > [ ... ] > spin_unlock_irqrestore(hba->host->host_lock, flags); > > Such code is ALWAYS wrong. The value of the 'flags' argument passed to > spin_unlock_irqrestore() must come from spin_lock_irqsave(). > > Bart. I missed for two things. Thanks, I'll look more carefully.
Hi, > +static irqreturn_t exynos_ufs_isr(struct ufs_hba *hba) { > + struct exynos_ufs *ufs = ufshcd_get_variant(hba); > + u32 status; > + unsigned long flags; > + > + if (!hba->priv) return IRQ_HANDLED; > + status = hci_readl(ufs, VENDOR_SPECIFIC_IS); > + hci_writel(ufs, status, VENDOR_SPECIFIC_IS); > + /* > + * If host doesn't guarantee integrity of UTP transmission, > + * it needs to be reset immediately to make it unable to > + * proceed requests. Because w/o this, if UTP functions > + * in host doesn't work properly for such system power margins, > + * DATA IN from broken devices or whatever in the real world, > + * some unexpected events could happen, such as transferring > + * a broken DATA IN to a device. It could be various types of > + * problems on the level of file system. In this case, I think > + * blocking the host's functionality is the best strategy. > + * Perhaps, if its root cause is temporary, system could recover. > + */ > + if (status & RX_UPIU_HIT_ERROR) { > + pr_err("%s: status: 0x%08x\n", __func__, status); > + hba->force_reset = true; > + hba->force_requeue = true; If force_reset is true, isn't force_requeue redundant? Thanks, Avri > + scsi_schedule_eh(hba->host); > + spin_unlock_irqrestore(hba->host->host_lock, flags); > + return IRQ_HANDLED; > + } > + return IRQ_NONE; > +} > + > static struct ufs_hba_variant_ops ufs_hba_exynos_ops = { > .name = "exynos_ufs", > .init = exynos_ufs_init, > @@ -1209,6 +1268,7 @@ static struct ufs_hba_variant_ops > ufs_hba_exynos_ops = { > .hibern8_notify = exynos_ufs_hibern8_notify, > .suspend = exynos_ufs_suspend, > .resume = exynos_ufs_resume, > + .intr = exynos_ufs_isr, > }; > > static int exynos_ufs_probe(struct platform_device *pdev) > -- > 2.7.4
diff --git a/drivers/scsi/ufs/ufs-exynos.c b/drivers/scsi/ufs/ufs-exynos.c index a14dd8c..b0c8843 100644 --- a/drivers/scsi/ufs/ufs-exynos.c +++ b/drivers/scsi/ufs/ufs-exynos.c @@ -21,6 +21,7 @@ #include "unipro.h" #include "ufs-exynos.h" +#include "../scsi_transport_api.h" /* * Exynos's Vendor specific registers for UFSHCI @@ -31,6 +32,8 @@ #define HCI_RXPRDT_ENTRY_SIZE 0x04 #define HCI_1US_TO_CNT_VAL 0x0C #define CNT_VAL_1US_MASK 0x3FF +#define VENDOR_SPECIFIC_IS 0x38 +#define VENDOR_SPECIFIC_IE 0x3C #define HCI_UTRL_NEXUS_TYPE 0x40 #define HCI_UTMRL_NEXUS_TYPE 0x44 #define HCI_SW_RST 0x50 @@ -74,6 +77,18 @@ UIC_TRANSPORT_NO_CONNECTION_RX |\ UIC_TRANSPORT_BAD_TC) +enum exynos_ufs_vs_interrupt { + /* + * This occurs when information of a pending request isn't + * the same with incoming UPIU for the request. For example, + * if UFS driver rings with task tag #1, subsequential UPIUs + * for this must have one as the value of task tag. But if + * it's corrutped until the host receives it or incoming UPIUs + * has an unexpected value for task tag, this raises. + */ + RX_UPIU_HIT_ERROR = 1 << 19, +}; + enum { UNIPRO_L1_5 = 0,/* PHY Adapter */ UNIPRO_L2, /* Data Link */ @@ -996,6 +1011,10 @@ static int exynos_ufs_init(struct ufs_hba *hba) goto out; exynos_ufs_specify_phy_time_attr(ufs); exynos_ufs_config_smu(ufs); + + /* enable vendor interrupts */ + hci_writel(ufs, VENDOR_SPECIFIC_IE, RX_UPIU_HIT_ERROR); + return 0; phy_off: @@ -1005,26 +1024,34 @@ static int exynos_ufs_init(struct ufs_hba *hba) return ret; } -static int exynos_ufs_host_reset(struct ufs_hba *hba) +static int exynos_ufs_host_reset(struct exynos_ufs *ufs) { - struct exynos_ufs *ufs = ufshcd_get_variant(hba); unsigned long timeout = jiffies + msecs_to_jiffies(1); - u32 val; - int ret = 0; - - exynos_ufs_disable_auto_ctrl_hcc_save(ufs, &val); + /* host reset */ hci_writel(ufs, UFS_SW_RST_MASK, HCI_SW_RST); - do { if (!(hci_readl(ufs, HCI_SW_RST) & UFS_SW_RST_MASK)) - goto out; + return 0; + cpu_relax(); } while (time_before(jiffies, timeout)); - dev_err(hba->dev, "timeout host sw-reset\n"); - ret = -ETIMEDOUT; + pr_err("timeout host sw-reset\n"); + return -ETIMEDOUT; +} + +static int exynos_ufs_host_init(struct ufs_hba *hba) +{ + struct exynos_ufs *ufs = ufshcd_get_variant(hba); + u32 val; + int ret; + + exynos_ufs_disable_auto_ctrl_hcc_save(ufs, &val); + + ret = exynos_ufs_host_reset(ufs); + if (ret) + return ret; -out: exynos_ufs_auto_ctrl_hcc_restore(ufs, &val); return ret; } @@ -1110,7 +1137,7 @@ static int exynos_ufs_hce_enable_notify(struct ufs_hba *hba, switch (status) { case PRE_CHANGE: - ret = exynos_ufs_host_reset(hba); + ret = exynos_ufs_host_init(hba); if (ret) return ret; exynos_ufs_dev_hw_reset(hba); @@ -1198,6 +1225,38 @@ static int exynos_ufs_resume(struct ufs_hba *hba, enum ufs_pm_op pm_op) return 0; } +static irqreturn_t exynos_ufs_isr(struct ufs_hba *hba) +{ + struct exynos_ufs *ufs = ufshcd_get_variant(hba); + u32 status; + unsigned long flags; + + if (!hba->priv) return IRQ_HANDLED; + status = hci_readl(ufs, VENDOR_SPECIFIC_IS); + hci_writel(ufs, status, VENDOR_SPECIFIC_IS); + /* + * If host doesn't guarantee integrity of UTP transmission, + * it needs to be reset immediately to make it unable to + * proceed requests. Because w/o this, if UTP functions + * in host doesn't work properly for such system power margins, + * DATA IN from broken devices or whatever in the real world, + * some unexpected events could happen, such as transferring + * a broken DATA IN to a device. It could be various types of + * problems on the level of file system. In this case, I think + * blocking the host's functionality is the best strategy. + * Perhaps, if its root cause is temporary, system could recover. + */ + if (status & RX_UPIU_HIT_ERROR) { + pr_err("%s: status: 0x%08x\n", __func__, status); + hba->force_reset = true; + hba->force_requeue = true; + scsi_schedule_eh(hba->host); + spin_unlock_irqrestore(hba->host->host_lock, flags); + return IRQ_HANDLED; + } + return IRQ_NONE; +} + static struct ufs_hba_variant_ops ufs_hba_exynos_ops = { .name = "exynos_ufs", .init = exynos_ufs_init, @@ -1209,6 +1268,7 @@ static struct ufs_hba_variant_ops ufs_hba_exynos_ops = { .hibern8_notify = exynos_ufs_hibern8_notify, .suspend = exynos_ufs_suspend, .resume = exynos_ufs_resume, + .intr = exynos_ufs_isr, }; static int exynos_ufs_probe(struct platform_device *pdev)
This patch is to raise recovery in some abnormal conditions using an vendor specific interrupt for some cases, such as a situation that some contexts of a pending request in the host isn't the same with those of its corresponding UPIUs if they should have been the same exactly. The representative case is shown like below. In the case, a broken UTRD entry, for internal coherent problem or whatever, that had smaller value of PRDT length than expected was transferred to the host. So, the host raised an interrupt of transfer complete even if device didn't finish its data transfer because the host sees a fetched version of UTRD to determine if data tranfer is over or not. Then the application level seemed to recogize this as a sort of corruption and this symptom led to boot failure. [5: init: 1] init: Failed to copy /avb into /metadata/gsi/dsu/avb/: No such file or directory [5: init: 1] init: [libfs_mgr]Created logical partition system on device /dev/block/dm-0 [4: init: 1] init: [libfs_mgr]Created logical partition vendor on device /dev/block/dm-1 [4: init: 1] init: [libfs_avb]total vbmeta size mismatch: 13504 (expected: 15616) [4: init: 1] init: [libfs_avb]VerifyVbmetaImages failed [4: init: 1] init: Failed to open AvbHandle: No such file or directory If the issued command would be write, it could be a disaster because some parts of its entire data chunk may be not written or more than the chunk may be written. Using the VENDOR_SPECIFIC_IS.RX_UPIU_HIT_ERROR, the host can detect the abnormal condition right when an DATA IN for offset bigger than expected for the broken value (i.e. PRDT length in this case) is received, do something for recovery and system will keep going safe. Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com> --- drivers/scsi/ufs/ufs-exynos.c | 84 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 72 insertions(+), 12 deletions(-)