diff mbox series

[4/4] target/arm: Retry pushing CPER error if necessary

Message ID 20250214041635.608012-5-gshan@redhat.com (mailing list archive)
State New
Headers show
Series target/arm: Improvement on memory error handling | expand

Commit Message

Gavin Shan Feb. 14, 2025, 4:16 a.m. UTC
The error -1 is returned if the previously reported CPER error
hasn't been claimed. The virtual machine is terminated due to
abort(). It's conflicting to the ideal behaviour that the affected
vCPU retries pushing the CPER error in this case since the vCPU
can't proceed its execution.

Move the chunk of code to push CPER error to a separate helper
report_memory_errors() and retry the request when the return
value from acpi_ghes_memory_errors() is greater than zero.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 target/arm/kvm.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

Comments

Igor Mammedov Feb. 19, 2025, 5:55 p.m. UTC | #1
On Fri, 14 Feb 2025 14:16:35 +1000
Gavin Shan <gshan@redhat.com> wrote:

> The error -1 is returned if the previously reported CPER error
> hasn't been claimed. The virtual machine is terminated due to
> abort(). It's conflicting to the ideal behaviour that the affected
> vCPU retries pushing the CPER error in this case since the vCPU
> can't proceed its execution.
> 
> Move the chunk of code to push CPER error to a separate helper
> report_memory_errors() and retry the request when the return
> value from acpi_ghes_memory_errors() is greater than zero.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  target/arm/kvm.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 5c0bf99aec..9f063f6053 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -2362,6 +2362,30 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
>      return ret;
>  }
>  
> +static void report_memory_error(CPUState *c, hwaddr paddr)
> +{
> +    int ret;
> +
> +    while (true) {
> +        /* Retry if the previously report error hasn't been claimed */
> +        ret = acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, true);
> +        if (ret <= 0) {
> +            break;
> +        }
> +
> +        bql_unlock();
> +        g_usleep(1000);
even with bql released it's not safe to loop in here.
consider,
  a guest with 2 vcpus
    * vcpu 1 gets SIGBUS due to error
    * vcpu 2 trips over the same error and gets into this loop
    * on guest side vcpu 1 continues to run to handle SEA but
      might need to acquire a lock that vcpu 2 holds

GHESv2 error source we support, can report several errors,
currently QEMU supports only 1 'error status block' which
can hold several error records (CPER) (though storage size is limited)

1:
We can potentially add support for more GHESv2 error sources
with their own Read ACK registers (let's say =max_cpus)
(that is under assumption that no other error will be
triggered while guest VCPUs handle their own SEA (upto clearing Read ACK))

2:
Another way could be for QEMU to allocate more error status _blocks_
for the only one error source it has now and try to find
empty status block to inject new error(s).
 * it can be saturated with high rate of errors (so what do we do in case it happens?)
 * subject to race between clearing/setting Read ACK
    (maybe it can dealt with that on side by keeping internal read_ack counter)

3:
And alternatively, queue incoming errors until read ack is cleared
and then inject pending errors in one go.
(problem with that is that at the moment QEMU doesn't monitor
read ack register memory so it won't notice guest clearing that)


Given spec has provision for multiple error status blocks/error data entries
it seems that #2 is an expected way to deal with the problem.

PS:
I'd prefer Mauro's series being merged 1st (once it's resplit),
for it refactors a bunch of original code and hopefully makes
code easier to follow/extend.

> +        bql_lock();
> +    }
> +
> +    if (ret == 0) {
> +        kvm_inject_arm_sea(c);
> +    } else {
> +        error_report("Error %d to report memory error", ret);
> +        abort();
> +    }
> +}
> +
>  void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>  {
>      ram_addr_t ram_addr;
> @@ -2387,12 +2411,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>               */
>              if (code == BUS_MCEERR_AR) {
>                  kvm_cpu_synchronize_state(c);
> -                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, false)) {
> -                    kvm_inject_arm_sea(c);
> -                } else {
> -                    error_report("failed to record the error");
> -                    abort();
> -                }
> +                report_memory_error(c, paddr);
>              }
>              return;
>          }
Gavin Shan Feb. 21, 2025, 5:27 a.m. UTC | #2
On 2/20/25 3:55 AM, Igor Mammedov wrote:
> On Fri, 14 Feb 2025 14:16:35 +1000
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> The error -1 is returned if the previously reported CPER error
>> hasn't been claimed. The virtual machine is terminated due to
>> abort(). It's conflicting to the ideal behaviour that the affected
>> vCPU retries pushing the CPER error in this case since the vCPU
>> can't proceed its execution.
>>
>> Move the chunk of code to push CPER error to a separate helper
>> report_memory_errors() and retry the request when the return
>> value from acpi_ghes_memory_errors() is greater than zero.
>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   target/arm/kvm.c | 31 +++++++++++++++++++++++++------
>>   1 file changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
>> index 5c0bf99aec..9f063f6053 100644
>> --- a/target/arm/kvm.c
>> +++ b/target/arm/kvm.c
>> @@ -2362,6 +2362,30 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
>>       return ret;
>>   }
>>   
>> +static void report_memory_error(CPUState *c, hwaddr paddr)
>> +{
>> +    int ret;
>> +
>> +    while (true) {
>> +        /* Retry if the previously report error hasn't been claimed */
>> +        ret = acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, true);
>> +        if (ret <= 0) {
>> +            break;
>> +        }
>> +
>> +        bql_unlock();
>> +        g_usleep(1000);

Igor, thanks for the detailed comments. Sorry for a bit delay of the reply, I
was checking the code to understand it better :)

> even with bql released it's not safe to loop in here.
> consider,
>    a guest with 2 vcpus
>      * vcpu 1 gets SIGBUS due to error
>      * vcpu 2 trips over the same error and gets into this loop
>      * on guest side vcpu 1 continues to run to handle SEA but
>        might need to acquire a lock that vcpu 2 holds
> 

Agreed.

> GHESv2 error source we support, can report several errors,
> currently QEMU supports only 1 'error status block' which
> can hold several error records (CPER) (though storage size is limited)
> 
> 1:
> We can potentially add support for more GHESv2 error sources
> with their own Read ACK registers (let's say =max_cpus)
> (that is under assumption that no other error will be
> triggered while guest VCPUs handle their own SEA (upto clearing Read ACK))
> 
> 2:
> Another way could be for QEMU to allocate more error status _blocks_
> for the only one error source it has now and try to find
> empty status block to inject new error(s).
>   * it can be saturated with high rate of errors (so what do we do in case it happens?)
>   * subject to race between clearing/setting Read ACK
>      (maybe it can dealt with that on side by keeping internal read_ack counter)
> 
> 3:
> And alternatively, queue incoming errors until read ack is cleared
> and then inject pending errors in one go.
> (problem with that is that at the moment QEMU doesn't monitor
> read ack register memory so it won't notice guest clearing that)
> 
> 
> Given spec has provision for multiple error status blocks/error data entries
> it seems that #2 is an expected way to deal with the problem.
> 

I would say #1 is the ideal model because the read_ack_register is the bottleneck
and it should be scaled up to max_cpus. In that way, the bottleneck can be avoided
from the bottom. Another benefit with #1 is the error can be delivered immediately
to the vCPU where the error was raised. This matches with the syntax of SEA to me.

#2 still has the risk to saturate the multiple error status blocks if there are
high rate of errors as you said. Besides, the vCPU where read_ack_register is acknoledged
can be different from the vCPU where the error is raised, violating the syntax of
SEA.

#3's drawback is to violate the syntax of SEA, similar to #2.

However, #2/#3 wouldn't be that complicated to #1. I didn't expect big surgery to
GHES module, but it seems there isn't perfect solution without a big surgery.
I would vote for #1 to resolve the issue from the ground. What do you think, Igor?
I'm also hoping Jonathan and Mauro can provide their preference.

> PS:
> I'd prefer Mauro's series being merged 1st (once it's resplit),
> for it refactors a bunch of original code and hopefully makes
> code easier to follow/extend.
> 

Sure. I won't start the coding until the solution is confirmed. All the followup
work will base on Mauro's series.

>> +        bql_lock();
>> +    }
>> +
>> +    if (ret == 0) {
>> +        kvm_inject_arm_sea(c);
>> +    } else {
>> +        error_report("Error %d to report memory error", ret);
>> +        abort();
>> +    }
>> +}
>> +
>>   void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>   {
>>       ram_addr_t ram_addr;
>> @@ -2387,12 +2411,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
>>                */
>>               if (code == BUS_MCEERR_AR) {
>>                   kvm_cpu_synchronize_state(c);
>> -                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, false)) {
>> -                    kvm_inject_arm_sea(c);
>> -                } else {
>> -                    error_report("failed to record the error");
>> -                    abort();
>> -                }
>> +                report_memory_error(c, paddr);
>>               }
>>               return;
>>           }
> 

Thanks,
Gavin
Jonathan Cameron Feb. 21, 2025, 11:04 a.m. UTC | #3
On Fri, 21 Feb 2025 15:27:36 +1000
Gavin Shan <gshan@redhat.com> wrote:

> On 2/20/25 3:55 AM, Igor Mammedov wrote:
> > On Fri, 14 Feb 2025 14:16:35 +1000
> > Gavin Shan <gshan@redhat.com> wrote:
> >   
> >> The error -1 is returned if the previously reported CPER error
> >> hasn't been claimed. The virtual machine is terminated due to
> >> abort(). It's conflicting to the ideal behaviour that the affected
> >> vCPU retries pushing the CPER error in this case since the vCPU
> >> can't proceed its execution.
> >>
> >> Move the chunk of code to push CPER error to a separate helper
> >> report_memory_errors() and retry the request when the return
> >> value from acpi_ghes_memory_errors() is greater than zero.
> >>
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   target/arm/kvm.c | 31 +++++++++++++++++++++++++------
> >>   1 file changed, 25 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> >> index 5c0bf99aec..9f063f6053 100644
> >> --- a/target/arm/kvm.c
> >> +++ b/target/arm/kvm.c
> >> @@ -2362,6 +2362,30 @@ int kvm_arch_get_registers(CPUState *cs, Error **errp)
> >>       return ret;
> >>   }
> >>   
> >> +static void report_memory_error(CPUState *c, hwaddr paddr)
> >> +{
> >> +    int ret;
> >> +
> >> +    while (true) {
> >> +        /* Retry if the previously report error hasn't been claimed */
> >> +        ret = acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, true);
> >> +        if (ret <= 0) {
> >> +            break;
> >> +        }
> >> +
> >> +        bql_unlock();
> >> +        g_usleep(1000);  
> 
> Igor, thanks for the detailed comments. Sorry for a bit delay of the reply, I
> was checking the code to understand it better :)

This is moderately tricky stuff so I'm not 100% sure of some of the things
I said below, but will be traveling for next few weeks so want to get some
comments out before that!

> 
> > even with bql released it's not safe to loop in here.
> > consider,
> >    a guest with 2 vcpus
> >      * vcpu 1 gets SIGBUS due to error
> >      * vcpu 2 trips over the same error and gets into this loop
> >      * on guest side vcpu 1 continues to run to handle SEA but
> >        might need to acquire a lock that vcpu 2 holds
> >   
> 
> Agreed.
> 
> > GHESv2 error source we support, can report several errors,
> > currently QEMU supports only 1 'error status block' which
> > can hold several error records (CPER) (though storage size is limited)
> > 
> > 1:
> > We can potentially add support for more GHESv2 error sources
> > with their own Read ACK registers (let's say =max_cpus)
> > (that is under assumption that no other error will be
> > triggered while guest VCPUs handle their own SEA (upto clearing Read ACK))

This one seems straight forward but I'd kind of like to know if real systems
do this (I'll try and find out about ours).  I don't think there is
any association available between a cpu and and SEA source, so linux at
least will just go looking for any that are active on each SEA.
Locking looks fine but it won't help with performance

> > 
> > 2:
> > Another way could be for QEMU to allocate more error status _blocks_
> > for the only one error source it has now and try to find
> > empty status block to inject new error(s).

Let me try to get my head around this one...

Each GHESv2 entry points, indirectly, to a single error status block at a time
(only one address to read that from)  Curious quirk is the length for that
error status block is fixed as that's just a value in GHESv2 not an indirection
via a register - however I think you can just make it 'big'.
So what I think you are proposing here is that on read_ack write (which we would
need to monitor for, the value of the error status address register is updated
to point to next one of a queue of error blocks.

That can work.  I'm not sure it actually gets us anything over just queuing in
qemu and writing the same error status block.  Those status blocks can contain
multiple Generic Error Data entries, but unless we have a load of them gathered
up at time of first notifying the guest, I'm not sure that helps us.

One thing that I'm nervous about is that I can't actually find spec language
that says that the OS 'must' reread the error status address register on
each event. That isn't mentioned in the GHESv2 flow description which just says:
"
These are the steps the OS must take once detecting an error from a particular GHESv2 error source:
• OSPM detects error (via interrupt/exception or polling the block status)
• OSPM copies the error status block
• OSPM clears the block status field of the error status block
• OSPM acknowledges the error via Read Ack register. For example:
– OSPM reads the Read Ack register –> X
– OSPM writes –> (( X & ReadAckPreserve) | ReadAckWrite)
"

The linux code is confusing me, but I think it wonderfully changes
the fixmap on every access as it needs to do an ioremap type operation
in NMI conditions.

> >   * it can be saturated with high rate of errors (so what do we do in case it happens?)

Make it big :)  But sure big just makes the condition unlikely rather than solving it.

> >   * subject to race between clearing/setting Read ACK
> >      (maybe it can dealt with that on side by keeping internal read_ack counter)

I don't think there are any races as long as we update the register only on clear which
should I think happen before the next SEA can happen? My understanding, which may
be wrong, is the OS must just take a copy of the error status block and set the
read_ack all in the exception handler.

> > 
> > 3:
> > And alternatively, queue incoming errors until read ack is cleared
> > and then inject pending errors in one go.
> > (problem with that is that at the moment QEMU doesn't monitor
> > read ack register memory so it won't notice guest clearing that)

We'd need to monitor it definitely.  Injecting all we have queued up
in one go here seems like a reasonable optimization over doing them
one at a time.

> > 
> > 
> > Given spec has provision for multiple error status blocks/error data entries
> > it seems that #2 is an expected way to deal with the problem.
> >   
> 
> I would say #1 is the ideal model because the read_ack_register is the bottleneck
> and it should be scaled up to max_cpus. In that way, the bottleneck can be avoided
> from the bottom. Another benefit with #1 is the error can be delivered immediately
> to the vCPU where the error was raised. This matches with the syntax of SEA to me.

I don't think it helps for the bottleneck in linux at least.  A whole bunch of locks
are taken on each SEA because of the novel use of the fixmap.  There is only one
VA ever used to access the error status blocks we just change what PA it points to
under a spin lock. Maybe that can be improved on if we can persuade people that error
handling performance is a thing to care about!

> 
> #2 still has the risk to saturate the multiple error status blocks if there are
> high rate of errors as you said. Besides, the vCPU where read_ack_register is acknoledged
> can be different from the vCPU where the error is raised, violating the syntax of
> SEA.
> 
> #3's drawback is to violate the syntax of SEA, similar to #2.
> 
> However, #2/#3 wouldn't be that complicated to #1. I didn't expect big surgery to
> GHES module, but it seems there isn't perfect solution without a big surgery.
> I would vote for #1 to resolve the issue from the ground. What do you think, Igor?
> I'm also hoping Jonathan and Mauro can provide their preference.

Ideally I'd like whatever we choose to look like what a bare metal machine
does - mostly because we are less likely to hit untested OS paths.

> 
> > PS:
> > I'd prefer Mauro's series being merged 1st (once it's resplit),
> > for it refactors a bunch of original code and hopefully makes
> > code easier to follow/extend.
> >   
> 
> Sure. I won't start the coding until the solution is confirmed. All the followup
> work will base on Mauro's series.
> 
> >> +        bql_lock();
> >> +    }
> >> +
> >> +    if (ret == 0) {
> >> +        kvm_inject_arm_sea(c);
> >> +    } else {
> >> +        error_report("Error %d to report memory error", ret);
> >> +        abort();
> >> +    }
> >> +}
> >> +
> >>   void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> >>   {
> >>       ram_addr_t ram_addr;
> >> @@ -2387,12 +2411,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> >>                */
> >>               if (code == BUS_MCEERR_AR) {
> >>                   kvm_cpu_synchronize_state(c);
> >> -                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, false)) {
> >> -                    kvm_inject_arm_sea(c);
> >> -                } else {
> >> -                    error_report("failed to record the error");
> >> -                    abort();
> >> -                }
> >> +                report_memory_error(c, paddr);
> >>               }
> >>               return;
> >>           }  
> >   
> 
> Thanks,
> Gavin
> 
>
diff mbox series

Patch

diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 5c0bf99aec..9f063f6053 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -2362,6 +2362,30 @@  int kvm_arch_get_registers(CPUState *cs, Error **errp)
     return ret;
 }
 
+static void report_memory_error(CPUState *c, hwaddr paddr)
+{
+    int ret;
+
+    while (true) {
+        /* Retry if the previously report error hasn't been claimed */
+        ret = acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, true);
+        if (ret <= 0) {
+            break;
+        }
+
+        bql_unlock();
+        g_usleep(1000);
+        bql_lock();
+    }
+
+    if (ret == 0) {
+        kvm_inject_arm_sea(c);
+    } else {
+        error_report("Error %d to report memory error", ret);
+        abort();
+    }
+}
+
 void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
 {
     ram_addr_t ram_addr;
@@ -2387,12 +2411,7 @@  void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
              */
             if (code == BUS_MCEERR_AR) {
                 kvm_cpu_synchronize_state(c);
-                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, false)) {
-                    kvm_inject_arm_sea(c);
-                } else {
-                    error_report("failed to record the error");
-                    abort();
-                }
+                report_memory_error(c, paddr);
             }
             return;
         }