diff mbox series

[v3] ACPI, APEI, EINJ: Relax platform response timeout to 1 second.

Message ID 20211026072829.94262-1-xueshuai@linux.alibaba.com (mailing list archive)
State Mainlined, archived
Headers show
Series [v3] ACPI, APEI, EINJ: Relax platform response timeout to 1 second. | expand

Commit Message

Shuai Xue Oct. 26, 2021, 7:28 a.m. UTC
When injecting an error into the platform, the OSPM executes an
EXECUTE_OPERATION action to instruct the platform to begin the injection
operation. And then, the OSPM busy waits for a while by continually
executing CHECK_BUSY_STATUS action until the platform indicates that the
operation is complete. More specifically, the platform is limited to
respond within 1 millisecond right now. This is too strict for some
platforms.

For example, in Arm platform, when injecting a Processor Correctable error,
the OSPM will warn:
    Firmware does not respond in time.

And a message is printed on the console:
    echo: write error: Input/output error

We observe that the waiting time for DDR error injection is about 10 ms and
that for PCIe error injection is about 500 ms in Arm platform.

In this patch, we relax the response timeout to 1 second.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
Changelog v2 -> v3:
- Implemented the timeout in usleep_range instead of msleep.
- Dropped command line interface of timeout.
- Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
---
 drivers/acpi/apei/einj.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Comments

Tony Luck Oct. 26, 2021, 5:05 p.m. UTC | #1
On Tue, Oct 26, 2021 at 03:28:29PM +0800, Shuai Xue wrote:
> When injecting an error into the platform, the OSPM executes an
> EXECUTE_OPERATION action to instruct the platform to begin the injection
> operation. And then, the OSPM busy waits for a while by continually
> executing CHECK_BUSY_STATUS action until the platform indicates that the
> operation is complete. More specifically, the platform is limited to
> respond within 1 millisecond right now. This is too strict for some
> platforms.
> 
> For example, in Arm platform, when injecting a Processor Correctable error,
> the OSPM will warn:
>     Firmware does not respond in time.
> 
> And a message is printed on the console:
>     echo: write error: Input/output error
> 
> We observe that the waiting time for DDR error injection is about 10 ms and
> that for PCIe error injection is about 500 ms in Arm platform.
> 
> In this patch, we relax the response timeout to 1 second.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>

Reviewed-by: Tony Luck <tony.luck@intel.com>

Rafael: Do you want to take this in the acpi tree? If not, I can
apply it to the RAS tree (already at -rc7, so in next merge cycle
after 5.16-rc1 comes out).

> ---
> Changelog v2 -> v3:
> - Implemented the timeout in usleep_range instead of msleep.
> - Dropped command line interface of timeout.
> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
> ---
>  drivers/acpi/apei/einj.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
> index 133156759551..6e1ff4b62a8f 100644
> --- a/drivers/acpi/apei/einj.c
> +++ b/drivers/acpi/apei/einj.c
> @@ -28,9 +28,10 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) "EINJ: " fmt
>  
> -#define SPIN_UNIT		100			/* 100ns */
> -/* Firmware should respond within 1 milliseconds */
> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
> +#define SLEEP_UNIT_MIN		1000			/* 1ms */
> +#define SLEEP_UNIT_MAX		5000			/* 5ms */
> +/* Firmware should respond within 1 seconds */
> +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
>  #define ACPI5_VENDOR_BIT	BIT(31)
>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
> @@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
>  
>  static int einj_timedout(u64 *t)
>  {
> -	if ((s64)*t < SPIN_UNIT) {
> +	if ((s64)*t < SLEEP_UNIT_MIN) {
>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>  		return 1;
>  	}
> -	*t -= SPIN_UNIT;
> -	ndelay(SPIN_UNIT);
> -	touch_nmi_watchdog();
> +	*t -= SLEEP_UNIT_MIN;
> +	usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
> +
>  	return 0;
>  }
>  
> -- 
> 2.20.1.12.g72788fdb
>
Shuai Xue Oct. 27, 2021, 2:18 a.m. UTC | #2
Hi Tony,

Thank you for your patient revision. :)

Cheers,
Shuai

On 2021/10/27 AM1:05, Luck, Tony wrote:
> On Tue, Oct 26, 2021 at 03:28:29PM +0800, Shuai Xue wrote:
>> When injecting an error into the platform, the OSPM executes an
>> EXECUTE_OPERATION action to instruct the platform to begin the injection
>> operation. And then, the OSPM busy waits for a while by continually
>> executing CHECK_BUSY_STATUS action until the platform indicates that the
>> operation is complete. More specifically, the platform is limited to
>> respond within 1 millisecond right now. This is too strict for some
>> platforms.
>>
>> For example, in Arm platform, when injecting a Processor Correctable error,
>> the OSPM will warn:
>>     Firmware does not respond in time.
>>
>> And a message is printed on the console:
>>     echo: write error: Input/output error
>>
>> We observe that the waiting time for DDR error injection is about 10 ms and
>> that for PCIe error injection is about 500 ms in Arm platform.
>>
>> In this patch, we relax the response timeout to 1 second.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> 
> Rafael: Do you want to take this in the acpi tree? If not, I can
> apply it to the RAS tree (already at -rc7, so in next merge cycle
> after 5.16-rc1 comes out).
> 
>> ---
>> Changelog v2 -> v3:
>> - Implemented the timeout in usleep_range instead of msleep.
>> - Dropped command line interface of timeout.
>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
>> ---
>>  drivers/acpi/apei/einj.c | 15 ++++++++-------
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
>> index 133156759551..6e1ff4b62a8f 100644
>> --- a/drivers/acpi/apei/einj.c
>> +++ b/drivers/acpi/apei/einj.c
>> @@ -28,9 +28,10 @@
>>  #undef pr_fmt
>>  #define pr_fmt(fmt) "EINJ: " fmt
>>  
>> -#define SPIN_UNIT		100			/* 100ns */
>> -/* Firmware should respond within 1 milliseconds */
>> -#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
>> +#define SLEEP_UNIT_MIN		1000			/* 1ms */
>> +#define SLEEP_UNIT_MAX		5000			/* 5ms */
>> +/* Firmware should respond within 1 seconds */
>> +#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
>>  #define ACPI5_VENDOR_BIT	BIT(31)
>>  #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
>>  				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
>> @@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
>>  
>>  static int einj_timedout(u64 *t)
>>  {
>> -	if ((s64)*t < SPIN_UNIT) {
>> +	if ((s64)*t < SLEEP_UNIT_MIN) {
>>  		pr_warn(FW_WARN "Firmware does not respond in time\n");
>>  		return 1;
>>  	}
>> -	*t -= SPIN_UNIT;
>> -	ndelay(SPIN_UNIT);
>> -	touch_nmi_watchdog();
>> +	*t -= SLEEP_UNIT_MIN;
>> +	usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
>> +
>>  	return 0;
>>  }
>>  
>> -- 
>> 2.20.1.12.g72788fdb
>>
Rafael J. Wysocki Oct. 27, 2021, 6:24 p.m. UTC | #3
On Tue, Oct 26, 2021 at 7:05 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Tue, Oct 26, 2021 at 03:28:29PM +0800, Shuai Xue wrote:
> > When injecting an error into the platform, the OSPM executes an
> > EXECUTE_OPERATION action to instruct the platform to begin the injection
> > operation. And then, the OSPM busy waits for a while by continually
> > executing CHECK_BUSY_STATUS action until the platform indicates that the
> > operation is complete. More specifically, the platform is limited to
> > respond within 1 millisecond right now. This is too strict for some
> > platforms.
> >
> > For example, in Arm platform, when injecting a Processor Correctable error,
> > the OSPM will warn:
> >     Firmware does not respond in time.
> >
> > And a message is printed on the console:
> >     echo: write error: Input/output error
> >
> > We observe that the waiting time for DDR error injection is about 10 ms and
> > that for PCIe error injection is about 500 ms in Arm platform.
> >
> > In this patch, we relax the response timeout to 1 second.
> >
> > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>
> Reviewed-by: Tony Luck <tony.luck@intel.com>
>
> Rafael: Do you want to take this in the acpi tree? If not, I can
> apply it to the RAS tree (already at -rc7, so in next merge cycle
> after 5.16-rc1 comes out).

I'll queue it up for 5.16.

Thanks!

> > ---
> > Changelog v2 -> v3:
> > - Implemented the timeout in usleep_range instead of msleep.
> > - Dropped command line interface of timeout.
> > - Link to the v1 patch: https://lkml.org/lkml/2021/10/14/1402
> > ---
> >  drivers/acpi/apei/einj.c | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
> > index 133156759551..6e1ff4b62a8f 100644
> > --- a/drivers/acpi/apei/einj.c
> > +++ b/drivers/acpi/apei/einj.c
> > @@ -28,9 +28,10 @@
> >  #undef pr_fmt
> >  #define pr_fmt(fmt) "EINJ: " fmt
> >
> > -#define SPIN_UNIT            100                     /* 100ns */
> > -/* Firmware should respond within 1 milliseconds */
> > -#define FIRMWARE_TIMEOUT     (1 * NSEC_PER_MSEC)
> > +#define SLEEP_UNIT_MIN               1000                    /* 1ms */
> > +#define SLEEP_UNIT_MAX               5000                    /* 5ms */
> > +/* Firmware should respond within 1 seconds */
> > +#define FIRMWARE_TIMEOUT     (1 * USEC_PER_SEC)
> >  #define ACPI5_VENDOR_BIT     BIT(31)
> >  #define MEM_ERROR_MASK               (ACPI_EINJ_MEMORY_CORRECTABLE | \
> >                               ACPI_EINJ_MEMORY_UNCORRECTABLE | \
> > @@ -171,13 +172,13 @@ static int einj_get_available_error_type(u32 *type)
> >
> >  static int einj_timedout(u64 *t)
> >  {
> > -     if ((s64)*t < SPIN_UNIT) {
> > +     if ((s64)*t < SLEEP_UNIT_MIN) {
> >               pr_warn(FW_WARN "Firmware does not respond in time\n");
> >               return 1;
> >       }
> > -     *t -= SPIN_UNIT;
> > -     ndelay(SPIN_UNIT);
> > -     touch_nmi_watchdog();
> > +     *t -= SLEEP_UNIT_MIN;
> > +     usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
> > +
> >       return 0;
> >  }
> >
> > --
> > 2.20.1.12.g72788fdb
> >
diff mbox series

Patch

diff --git a/drivers/acpi/apei/einj.c b/drivers/acpi/apei/einj.c
index 133156759551..6e1ff4b62a8f 100644
--- a/drivers/acpi/apei/einj.c
+++ b/drivers/acpi/apei/einj.c
@@ -28,9 +28,10 @@ 
 #undef pr_fmt
 #define pr_fmt(fmt) "EINJ: " fmt
 
-#define SPIN_UNIT		100			/* 100ns */
-/* Firmware should respond within 1 milliseconds */
-#define FIRMWARE_TIMEOUT	(1 * NSEC_PER_MSEC)
+#define SLEEP_UNIT_MIN		1000			/* 1ms */
+#define SLEEP_UNIT_MAX		5000			/* 5ms */
+/* Firmware should respond within 1 seconds */
+#define FIRMWARE_TIMEOUT	(1 * USEC_PER_SEC)
 #define ACPI5_VENDOR_BIT	BIT(31)
 #define MEM_ERROR_MASK		(ACPI_EINJ_MEMORY_CORRECTABLE | \
 				ACPI_EINJ_MEMORY_UNCORRECTABLE | \
@@ -171,13 +172,13 @@  static int einj_get_available_error_type(u32 *type)
 
 static int einj_timedout(u64 *t)
 {
-	if ((s64)*t < SPIN_UNIT) {
+	if ((s64)*t < SLEEP_UNIT_MIN) {
 		pr_warn(FW_WARN "Firmware does not respond in time\n");
 		return 1;
 	}
-	*t -= SPIN_UNIT;
-	ndelay(SPIN_UNIT);
-	touch_nmi_watchdog();
+	*t -= SLEEP_UNIT_MIN;
+	usleep_range(SLEEP_UNIT_MIN, SLEEP_UNIT_MAX);
+
 	return 0;
 }