Message ID | 1503916701-13516-7-git-send-email-gengdongjiu@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Dongjiu Geng, On 28/08/17 11:38, Dongjiu Geng wrote: > when userspace gets SIGBUS signal, it does not know whether > this is a synchronous external abort or SError, Why would Qemu/kvmtool need to know if the original notification (if there was one) was synchronous or asynchronous? This is between firmware and the kernel. I think I can see why you need this: to choose whether to emulate SEA or SEI, but what if the guest wasn't running? Or the guest was running, but it wasn't guest-memory that is affected. What happens if the dram-scrub hardware spots an error in guest memory, but the guest wasn't running? KVM won't have a relevant ESR value to give you. What happens if we start swapping a page of guest memory to disk, and discover the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM still can't give you an ESR. What about CPER records discovered through the polled interface? What happens if I write a PFN into the corrupt-pfn sysfs interface? I think what you need is some way of knowing if the BUS_MCEERR_A* was directly caused by a user-space (or guest) access, and if so was it a data or instruction fetch. These can become SEA notifications. KVM's user-space shouldn't be a special-case where the kernel behaves differently: if we tinker with this it needs to make sense for all user space processes and mean something on all architectures. I think this information could be useful to other users of these signals, e.g. a JVM could silently regenerate/reload code/data for a non-direct-access fault instead of exit-ing (or throwing an exception) for a direct access. For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an access or not. When the mm code gets -EHWPOISON when trying to resolve a user-space fault we know it was due to a direct-access. (I don't know if/how x86 can know if it was code or data). Faulting guest accesses through KVM are just a special version of this where KVM fixes-up stage2. ... but for any of this to work we need the address of the corrupt memory. (-> cover letter) Thanks, James
Hi James, On 2017/9/8 0:30, James Morse wrote: > Hi Dongjiu Geng, > > On 28/08/17 11:38, Dongjiu Geng wrote: >> when userspace gets SIGBUS signal, it does not know whether >> this is a synchronous external abort or SError, > > Why would Qemu/kvmtool need to know if the original notification (if there was > one) was synchronous or asynchronous? This is between firmware and the kernel. there are two reasons: 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors. 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source, so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table. etc/acpi/tables etc/hardware_errors ==================== ========================================== + +--------------------------+ +------------------+ | | HEST | | address | +--------------+ | +--------------------------+ | registers | | Error Status | | | GHES0 | | +----------------+ | Data Block 0 | | +--------------------------+ +--------->| |status_address0 |------------->| +------------+ | | ................. | | | +----------------+ | | CPER | | | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER | | | ................. | | | +----------------+ | | | .... | | | read_ack_register--------+-+ | | ............. | | | | CPER | | | read_ack_preserve | | | +------------------+ | | +-+------------+ | | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status | + +--------------------------+ | | | | +----------------+ | | | Data Block 1 | | | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+ + +--------------------------+ | | | +----------------+ | | | CPER | | | ................. | | | +--->| | ack_value1 | | | | CPER | | | error_status_address-----+---+ | | | +----------------+ | | | .... | | | ................. | | | | | ............. | | | | CPER | | | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+ | | read_ack_preserve | | +->| | ack_value10 | | | |.......... | | | read_ack_write | | | | +----------------+ | | +------------+ + +--------------------------| | | | | Error Status | | | ............... | | | | | Data Block 10| + +--------------------------+ | | +---->| +------------+ | | GHES10 | | | | | CPER | + +--------------------------+ | | | | CPER | | | ................. | | | | | .... | | | error_status_address-----+-----+ | | | CPER | | | ................. | | +-+------------+ | | read_ack_register--------+---------+ | | read_ack_preserve | | | read_ack_write | + +--------------------------+ > > > I think I can see why you need this: to choose whether to emulate SEA or SEI, emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI. > but what if the guest wasn't running? Or the guest was running, but it wasn't > guest-memory that is affected. If the guest was not running, host firmware will directly notify EL1 host kernel to handle the error, not notify hypervisor only if the guest was running host firmware can notify the Error to hypervisor. If the user space is Qemu, and the error is from Qemu, and guest-memory is not involve. I will not handle it, please see the code for arm64. void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) { ram_addr_t ram_addr; hwaddr paddr; ARMCPU *cpu = ARM_CPU(c); CPUARMState *env = &cpu->env; assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO); if (addr) { ram_addr = qemu_ram_addr_from_host(addr); if (ram_addr != RAM_ADDR_INVALID && kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) { kvm_cpu_synchronize_state(c); kvm_hwpoison_page_add(ram_addr); if (is_abort_sea(env->exception.syndrome)) { ghes_update_guest(ACPI_HEST_NOTIFY_SEA, paddr); kvm_inject_arm_sea(c); } else if (is_abort_sei(env->exception.syndrome)) { ghes_update_guest(ACPI_HEST_NOTIFY_SEI, paddr); kvm_inject_arm_sei(c); } return; } fprintf(stderr, "Hardware memory error for memory used by " "QEMU itself instead of guest system!\n"); } if (code == BUS_MCEERR_AR) { fprintf(stderr, "Hardware memory error!\n"); exit(1); } } For the x86, it also does not handle it, it only print "Hardware memory error for memory used by QEMU itself instead of guest system!" void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr) { X86CPU *cpu = X86_CPU(c); CPUX86State *env = &cpu->env; ram_addr_t ram_addr; hwaddr paddr; /* If we get an action required MCE, it has been injected by KVM * while the VM was running. An action optional MCE instead should * be coming from the main thread, which qemu_init_sigbus identifies * as the "early kill" thread. */ assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO); if ((env->mcg_cap & MCG_SER_P) && addr) { ram_addr = qemu_ram_addr_from_host(addr); if (ram_addr != RAM_ADDR_INVALID && kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) { kvm_hwpoison_page_add(ram_addr); kvm_mce_inject(cpu, paddr, code); return; } fprintf(stderr, "Hardware memory error for memory used by " "QEMU itself instead of guest system!\n"); } if (code == BUS_MCEERR_AR) { hardware_memory_error(); } /* Hope we are lucky for AO MCE */ } > > What happens if the dram-scrub hardware spots an error in guest memory, but the > guest wasn't running? KVM won't have a relevant ESR value to give you. if the dram-scrub hardware spots an error in guest memory, it will generate IRQ in DDR controller, not SEA or SEI exception. I still do not consider the GSIV. For GSIV, may be we can only handle it in the host OS. > > What happens if we start swapping a page of guest memory to disk, and discover > the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM > still can't give you an ESR. I think this Error is reported by IRQ(GSIV), GSIV is not SEA/SEI, we should not give the ESR to them. > > What about CPER records discovered through the polled interface? What happens if > I write a PFN into the corrupt-pfn sysfs interface? I do not understand this question. I think in the process it should report SEA notification when CPU consume the error page. > > > I think what you need is some way of knowing if the BUS_MCEERR_A* was directly > caused by a user-space (or guest) access, and if so was it a data or instruction when user space received the signal, it will judge whether the memory address is user-space (or guest) address > fetch. These can become SEA notifications. In fact, it can be SEI, not always SEA, why it will always SEA notifications? If the memory properties of data is device type, it may become SEI notification. > > KVM's user-space shouldn't be a special-case where the kernel behaves > differently: if we tinker with this it needs to make sense for all user space > processes and mean something on all architectures. > > I think this information could be useful to other users of these signals, e.g. a > JVM could silently regenerate/reload code/data for a non-direct-access fault > instead of exit-ing (or throwing an exception) for a direct access. > > For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an > access or not. When the mm code gets -EHWPOISON when trying to resolve a Because of that, so I allow userspace getting exception information > user-space fault we know it was due to a direct-access. (I don't know if/how x86 > can know if it was code or data). Faulting guest accesses through KVM are just a > special version of this where KVM fixes-up stage2. > > ... but for any of this to work we need the address of the corrupt memory. > (-> cover letter) > > > Thanks, > > James > > . >
Hi gengdongjiu, (re-ordered hunks) On 13/09/17 08:32, gengdongjiu wrote: > On 2017/9/8 0:30, James Morse wrote: >> On 28/08/17 11:38, Dongjiu Geng wrote: >> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by >> an access or not. Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is x86's kernel-first handling, which nicely matches this 'direct access' problem. BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc also triggers these directly, both from what look to be synchronous paths, so I think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO to something_else. I don't think we need anything else. >> When the mm code gets -EHWPOISON when trying to resolve a > > Because of that, so I allow userspace getting exception information ... and there are cases where you can't get the exception information, and other cases where it wasn't an exception at all. [...] >> What happens if the dram-scrub hardware spots an error in guest memory, but >> the guest wasn't running? KVM won't have a relevant ESR value to give you. > if the dram-scrub hardware spots an error in guest memory, it will generate > IRQ in DDR controller, not SEA or SEI exception. I still do not consider the > GSIV. For GSIV, may be we can only handle it in the host OS. Great example: this IRQ pulls us out of a guest, we tromp through APEI and then memory_failure(), the memory happened to belong to the same guest (coincidence!), we send it some signal and now its user-space's problem. Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though the notification interrupted the guest, and it was guest memory that was affected. KVM doesn't have a relevant ESR. I'm strongly against exposing 'which notification type' this error originally came from because: * it doesn't matter once we've got the CPER records, * there isn't always an answer (there are/will-be other ways of tripping memory_failure()) * it creates ABI between firwmare, host userspace and guest userspace. Firmware's choice of notification type shouldn't affect anything other than the host kernel. On 13/09/17 08:32, gengdongjiu wrote: > On 2017/9/8 0:30, James Morse wrote: >> On 28/08/17 11:38, Dongjiu Geng wrote: >>> when userspace gets SIGBUS signal, it does not know whether >>> this is a synchronous external abort or SError, >> >> Why would Qemu/kvmtool need to know if the original notification (if there was >> one) was synchronous or asynchronous? This is between firmware and the kernel. > there are two reasons: > > 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors. > 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source, > so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table. user-space can choose whether to use SEA or SEI, it doesn't have to choose the same notification type that firmware used, which in turn doesn't have to be the same as that used by the CPU to notify firmware. The choice only matters because these notifications hang on an existing pieces of the Arm-architecture, so the notification can only add to the architecturally defined meaning. (i.e. You can only send an SEA for something that can already be described as a synchronous external abort). Once we get to user-space, for memory_failure() notifications, (which so far is all we are talking about here), the only thing that could matter is whether the guest hit a PG_hwpoison page as a stage2 fault. These can be described as Synchronous-External-Abort. The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU because it can't always make an error synchronous. For memory_failure() notifications to a KVM guest we really can do this, and we already have this behaviour for free. An example: A guest touches some hardware:poisoned memory, for whatever reason the CPU can't put the world back together to make this a synchronous exception, so it reports it to firmware as an SError-interrupt. Linux gets an APEI notification and memory_failure() causes the affected page to be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> action optional, probably asynchronous. But in our example it wasn't really asynchronous, that was just a property of the original CPU->firmware notification. What happens? The guest vcpu is re-run, it re-runs the same instructions (this was a contained error so KVM's ELR points at/before the instruction that steps in the problem). This time KVM takes a stage2 fault, which the mm code will refuse to fixup because the relevant page was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA. > etc/acpi/tables etc/hardware_errors > ==================== ========================================== > + +--------------------------+ +------------------+ > | | HEST | | address | +--------------+ > | +--------------------------+ | registers | | Error Status | > | | GHES0 | | +----------------+ | Data Block 0 | > | +--------------------------+ +--------->| |status_address0 |------------->| +------------+ > | | ................. | | | +----------------+ | | CPER | > | | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER | > | | ................. | | | +----------------+ | | | .... | > | | read_ack_register--------+-+ | | ............. | | | | CPER | > | | read_ack_preserve | | | +------------------+ | | +-+------------+ > | | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status | > + +--------------------------+ | | | | +----------------+ | | | Data Block 1 | > | | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+ > + +--------------------------+ | | | +----------------+ | | | CPER | > | | ................. | | | +--->| | ack_value1 | | | | CPER | > | | error_status_address-----+---+ | | | +----------------+ | | | .... | > | | ................. | | | | | ............. | | | | CPER | > | | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+ > | | read_ack_preserve | | +->| | ack_value10 | | | |.......... | > | | read_ack_write | | | | +----------------+ | | +------------+ > + +--------------------------| | | | | Error Status | > | | ............... | | | | | Data Block 10| > + +--------------------------+ | | +---->| +------------+ > | | GHES10 | | | | | CPER | > + +--------------------------+ | | | | CPER | > | | ................. | | | | | .... | > | | error_status_address-----+-----+ | | | CPER | > | | ................. | | +-+------------+ > | | read_ack_register--------+---------+ > | | read_ack_preserve | > | | read_ack_write | > + +--------------------------+ > (nice ascii art!) >> I think I can see why you need this: to choose whether to emulate SEA or SEI, > emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI. (This doesn't matter: Generate the CPER records after you've chosen the notification and this isn't a problem. Or map your 'Error Status Data Blocks' to status_address* depending on usage not in a fixed 1:1 way) >> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly >> caused by a user-space (or guest) access, and if so was it a data or instruction > when user space received the signal, it will judge whether the memory address is user-space (or guest) address >> fetch. These can become SEA notifications. > In fact, it can be SEI, not always SEA, why it will always SEA notifications? > If the memory properties of data is device type, it may become SEI notification. Let's take a step back: in what scenario should we use an emulated-SEA instead of an emulated-SEI? (forget what the CPU and firmware did, this is up to Qemu to decide). It can use SEA if this is a valid Synchronous-external-abort. Stage 2 faults are synchronous exceptions, if you hit a PG_hwpoision page on this path you can report this back to the guest as a Synchronous-external-abort/SEA. How do you know? You get SIGBUS_MCEERR_AR from KVM. Thanks, James
James, Thanks for your comments, hope we can make the solution better. On 2017/9/14 21:00, James Morse wrote: > Hi gengdongjiu, > > (re-ordered hunks) > > On 13/09/17 08:32, gengdongjiu wrote: >> On 2017/9/8 0:30, James Morse wrote: >>> On 28/08/17 11:38, Dongjiu Geng wrote: >>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by >>> an access or not. > > Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via > some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is > x86's kernel-first handling, which nicely matches this 'direct access' problem. > BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc > also triggers these directly, both from what look to be synchronous paths, so I > think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO > to something_else. James, thanks for your explanation. can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access? Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error? In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest; if it is "BUS_MCEERR_AO", we use SEI notification type for the guest. Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type? > > I don't think we need anything else. > > >>> When the mm code gets -EHWPOISON when trying to resolve a >> >> Because of that, so I allow userspace getting exception information > > ... and there are cases where you can't get the exception information, and other > cases where it wasn't an exception at all. > > [...] > > >>> What happens if the dram-scrub hardware spots an error in guest memory, but >>> the guest wasn't running? KVM won't have a relevant ESR value to give you. > >> if the dram-scrub hardware spots an error in guest memory, it will generate >> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the >> GSIV. For GSIV, may be we can only handle it in the host OS. > > Great example: this IRQ pulls us out of a guest, we tromp through APEI and then > memory_failure(), the memory happened to belong to the same guest > (coincidence!), we send it some signal and now its user-space's problem. > > Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though > the notification interrupted the guest, and it was guest memory that was > affected. KVM doesn't have a relevant ESR. > > > I'm strongly against exposing 'which notification type' this error originally > came from because: > * it doesn't matter once we've got the CPER records, > * there isn't always an answer (there are/will-be other ways of tripping > memory_failure()) > * it creates ABI between firwmare, host userspace and guest userspace. > Firmware's choice of notification type shouldn't affect anything other than > the host kernel. > > > On 13/09/17 08:32, gengdongjiu wrote: >> On 2017/9/8 0:30, James Morse wrote: >>> On 28/08/17 11:38, Dongjiu Geng wrote: >>>> when userspace gets SIGBUS signal, it does not know whether >>>> this is a synchronous external abort or SError, >>> >>> Why would Qemu/kvmtool need to know if the original notification (if there was >>> one) was synchronous or asynchronous? This is between firmware and the kernel. > >> there are two reasons: >> >> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors. >> 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source, >> so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table. > > user-space can choose whether to use SEA or SEI, it doesn't have to choose the > same notification type that firmware used, which in turn doesn't have to be the > same as that used by the CPU to notify firmware. > > The choice only matters because these notifications hang on an existing pieces > of the Arm-architecture, so the notification can only add to the architecturally > defined meaning. (i.e. You can only send an SEA for something that can already > be described as a synchronous external abort). > > Once we get to user-space, for memory_failure() notifications, (which so far is > all we are talking about here), the only thing that could matter is whether the > guest hit a PG_hwpoison page as a stage2 fault. These can be described as > Synchronous-External-Abort. > > The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU > because it can't always make an error synchronous. For memory_failure() > notifications to a KVM guest we really can do this, and we already have this > behaviour for free. An example: > > A guest touches some hardware:poisoned memory, for whatever reason the CPU can't > put the world back together to make this a synchronous exception, so it reports > it to firmware as an SError-interrupt. > Linux gets an APEI notification and memory_failure() causes the affected page to > be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. > > Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> > action optional, probably asynchronous. If so, in this case, Qemu/kvmtool only got a little information(receive a SIGBUS), for this SIGBUS, it only include the SIGBUS_MCEERR_AO, error address. not include other information, only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification. for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification. so user space will be confused to use which method. I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough. how we provide other extra information to let it choose the proper notification? > > But in our example it wasn't really asynchronous, that was just a property of > the original CPU->firmware notification. What happens? The guest vcpu is re-run, > it re-runs the same instructions (this was a contained error so KVM's ELR points > at/before the instruction that steps in the problem). This time KVM takes a > stage2 fault, which the mm code will refuse to fixup because the relevant page > was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with > SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA. > > > > >> etc/acpi/tables etc/hardware_errors >> ==================== ========================================== >> + +--------------------------+ +------------------+ >> | | HEST | | address | +--------------+ >> | +--------------------------+ | registers | | Error Status | >> | | GHES0 | | +----------------+ | Data Block 0 | >> | +--------------------------+ +--------->| |status_address0 |------------->| +------------+ >> | | ................. | | | +----------------+ | | CPER | >> | | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER | >> | | ................. | | | +----------------+ | | | .... | >> | | read_ack_register--------+-+ | | ............. | | | | CPER | >> | | read_ack_preserve | | | +------------------+ | | +-+------------+ >> | | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status | >> + +--------------------------+ | | | | +----------------+ | | | Data Block 1 | >> | | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+ >> + +--------------------------+ | | | +----------------+ | | | CPER | >> | | ................. | | | +--->| | ack_value1 | | | | CPER | >> | | error_status_address-----+---+ | | | +----------------+ | | | .... | >> | | ................. | | | | | ............. | | | | CPER | >> | | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+ >> | | read_ack_preserve | | +->| | ack_value10 | | | |.......... | >> | | read_ack_write | | | | +----------------+ | | +------------+ >> + +--------------------------| | | | | Error Status | >> | | ............... | | | | | Data Block 10| >> + +--------------------------+ | | +---->| +------------+ >> | | GHES10 | | | | | CPER | >> + +--------------------------+ | | | | CPER | >> | | ................. | | | | | .... | >> | | error_status_address-----+-----+ | | | CPER | >> | | ................. | | +-+------------+ >> | | read_ack_register--------+---------+ >> | | read_ack_preserve | >> | | read_ack_write | >> + +--------------------------+ >> > > (nice ascii art!) > >>> I think I can see why you need this: to choose whether to emulate SEA or SEI, > >> emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI. > > (This doesn't matter: Generate the CPER records after you've chosen the > notification and this isn't a problem. Or map your 'Error Status Data Blocks' > to status_address* depending on usage not in a fixed 1:1 way) > > >>> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly >>> caused by a user-space (or guest) access, and if so was it a data or instruction > >> when user space received the signal, it will judge whether the memory address is user-space (or guest) address > >>> fetch. These can become SEA notifications. > >> In fact, it can be SEI, not always SEA, why it will always SEA notifications? >> If the memory properties of data is device type, it may become SEI notification. > > Let's take a step back: in what scenario should we use an emulated-SEA instead > of an emulated-SEI? (forget what the CPU and firmware did, this is up to Qemu to > decide). > > It can use SEA if this is a valid Synchronous-external-abort. Stage 2 faults are > synchronous exceptions, if you hit a PG_hwpoision page on this path you can > report this back to the guest as a Synchronous-external-abort/SEA. > How do you know? You get SIGBUS_MCEERR_AR from KVM. I understand your idea, I agree we can use SEA notification for guest if it is SIGBUS_MCEERR_AR. I am worried about the SIGBUS_MCEERR_AO. if it is SIGBUS_MCEERR_AO, we can choose SEI, IRQ or POLLed notification. so according to what principles to choose that? In my platform, there is another issue. for the stage2 fault, we get the IPA from the HPFAR_EL2 register, but for huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000), not translation error(DFSC[5:0] is 0b0101xx), the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM code, we get the IPA from the HPFAR_EL2, so we can not get the right IPA value, because its value is zero.I do not know whether you have same issue. Although hpfar_el2 does not record IPA, but host firmware can still record the PA If call memory_failure(), memory_failure can translate the PA to host VA, then deliver host VA to Qemu. Qemu can translate the host VA to IPA. so we rely on memory_failure() to get the IPA. > > > Thanks, > > James > > . >
Hi James On 2017/9/14 21:00, James Morse wrote: > Hi gengdongjiu, > user-space can choose whether to use SEA or SEI, it doesn't have to choose the > same notification type that firmware used, which in turn doesn't have to be the > same as that used by the CPU to notify firmware. > > The choice only matters because these notifications hang on an existing pieces > of the Arm-architecture, so the notification can only add to the architecturally > defined meaning. (i.e. You can only send an SEA for something that can already > be described as a synchronous external abort). > > Once we get to user-space, for memory_failure() notifications, (which so far is > all we are talking about here), the only thing that could matter is whether the > guest hit a PG_hwpoison page as a stage2 fault. These can be described as > Synchronous-External-Abort. > > The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU > because it can't always make an error synchronous. For memory_failure() > notifications to a KVM guest we really can do this, and we already have this > behaviour for free. An example: > > A guest touches some hardware:poisoned memory, for whatever reason the CPU can't > put the world back together to make this a synchronous exception, so it reports > it to firmware as an SError-interrupt. > Linux gets an APEI notification and memory_failure() causes the affected page to > be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. > > Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> > action optional, probably asynchronous. > > But in our example it wasn't really asynchronous, that was just a property of > the original CPU->firmware notification. What happens? The guest vcpu is re-run, > it re-runs the same instructions (this was a contained error so KVM's ELR points > at/before the instruction that steps in the problem). This time KVM takes a > stage2 fault, which the mm code will refuse to fixup because the relevant page > was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with > SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA. CC Achin I have some personal opinion, if you think it is not right, hope you can point out. Synchronous External Abort and SError Interrupt are hardware exception(hardware concept), which is independent of software notification, in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to better describe the two exceptions, so use SEA and SEI notification to stand for them. SEA notification stands for Synchronous External Abort, so may be it is not only a notification, it also stands for a hardware error type. SEI notification stands for SError Interrupt, so may be it is not only a notification, it also stands for a hardware error type. In the OS, it has different handling flow to the two exception(two notification): when the guest OS running, if the hardware generates a Synchronous External Abort, we told the guest OS this error is SError Interrupt instead of Synchronous External Abort. guest OS uses SEI notification handling flow to deal with it, I am not sure whether it will have problem, because the true hardware exception is Synchronous External Abort, but software treats it as SError interrupt to handle. In the mainline code, it does not have SEI notification support, the reason I think it is because of the error address record by firmware is not accurate(SError Interrupt is asynchronous exception). so if treat a hardware Synchronous External Abort as SError interrupt(SEI). The default OS behavior for SEI is PANIC, that is to say, when hardware triggers a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI), the OS will be panic. in fact, it can be recoverable instead of Panic. I ever added a patch to support the SEI notification, but not sure whether it is can be accepted by open source, until now, not receive response.
Hi gengdongjiu, On 18/09/17 14:36, gengdongjiu wrote: > On 2017/9/14 21:00, James Morse wrote: >> On 13/09/17 08:32, gengdongjiu wrote: >>> On 2017/9/8 0:30, James Morse wrote: >>>> On 28/08/17 11:38, Dongjiu Geng wrote: >>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by >>>> an access or not. >> >> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via >> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is >> x86's kernel-first handling, which nicely matches this 'direct access' problem. >> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc >> also triggers these directly, both from what look to be synchronous paths, so I >> think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO >> to something_else. > > James, thanks for your explanation. > can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access? Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point was I can't find a case where Action-Required is used for an error that isn't synchronous. We should run this past the people who maintain the existing BUS_MCEERR_AR users, in case its just a severity to them. > Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error? How would userspace get one of these memory errors for a PCIe error? > In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest; > if it is "BUS_MCEERR_AO", we use SEI notification type for the guest. > Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type? This is for Qemu/kvmtool to decide, it depends on what sort of machine they are emulating. For example, the physical machine's memory-controller may notify the CPU about memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also routed to EL3. By the time this gets to the host kernel the distinction doesn't matter. The host has handled the error. For a guest, your memory-controller is effectively the host kernel. It will give you an BUS_MCEERR_AO signal for any guest memory that is affected, and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory. What Qemu/kvmtool do with this is up to them. If they're emulating a machine with no RAS features, printing an error and exit. Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless the affected vcpu has interrupts masked, in which case an SEA notification gives you some NMI-like behaviour. For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would be IRQ, as you can't know if the guest supports SEI and it would be a shame to kill it with an SError if the affected memory was free. SEA for synchronous errors is still a good choice even if the guest doesn't support it as that memory is still gone so its still a valid guest:Synchronous-external-abort. [...] >>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors. >> user-space can choose whether to use SEA or SEI, it doesn't have to choose the >> same notification type that firmware used, which in turn doesn't have to be the >> same as that used by the CPU to notify firmware. >> >> The choice only matters because these notifications hang on an existing pieces >> of the Arm-architecture, so the notification can only add to the architecturally >> defined meaning. (i.e. You can only send an SEA for something that can already >> be described as a synchronous external abort). >> >> Once we get to user-space, for memory_failure() notifications, (which so far is >> all we are talking about here), the only thing that could matter is whether the >> guest hit a PG_hwpoison page as a stage2 fault. These can be described as >> Synchronous-External-Abort. >> >> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU >> because it can't always make an error synchronous. For memory_failure() >> notifications to a KVM guest we really can do this, and we already have this >> behaviour for free. An example: >> >> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't >> put the world back together to make this a synchronous exception, so it reports >> it to firmware as an SError-interrupt. > >> Linux gets an APEI notification and memory_failure() causes the affected page to >> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. >> >> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> >> action optional, probably asynchronous. > If so, in this case, Qemu/kvmtool only got a little information(receive a SIGBUS), for this SIGBUS, > it only include the SIGBUS_MCEERR_AO, error address. not include other information, > only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification. The kernel can't tell it which to use: user space has to decide. This has to be a property of the machine you are emulating, not the machine you happen to be running on. What happens if the notification came using a future notification type that user space doesn't know about. What if user space does know about this type, but the guest doesn't. What if you migrate to a machine that uses a new notification type that you didn't advertise to the guest via the HEST at boot time. These dependencies have to break somewhere, and the right place is between the host kernel and host user-space. This way whatever Qemu/kvmtool do will work in the above 'what-ifs'. > for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification. > so user space will be confused to use which method. There isn't a wrong choice here. I suggest always-use-IRQ. Its faster than POLLed, but won't kill a guest that doesn't support NOTIFY_SEI. > I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough. > how we provide other extra information to let it choose the proper notification? Forget the original notification. This physical machine's hardware configuration and how its memory controller is wired up to report errors should not be relevant to Qemu/kvmtool. You need to decide how your emulated platform reports errors, you may want to make it a configuration option which defaults to something safe. [...] > In my platform, there is another issue. > for the stage2 fault, we get the IPA from the HPFAR_EL2 register, > but for huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000), 'Synchronous External Abort, not on a translation table walk' > not translation error(DFSC[5:0] is 0b0101xx), (the set of external abort, parity or ECC errors that we get from the page-table-walker) > the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM code, we get the IPA from the HPFAR_EL2, so > we can not get the right IPA value, because its value is zero.I do not know whether you have same issue. This is something the ARM-ARM allows, so we have to live with it in software. For external aborts the ESR has a 'FnV' bit 10 that for your first DSFSC 'Synchronous External Abort, not on a translation table walk' indicates there is no FAR, (or presumably HPFAR). I assume you have this bit set in the ESR. This shouldn't be a problem, for firmware-first we should take the address from the CPER records as this also gives us a range. For kernel-first we'd take whatever is in the v8.2 RAS ERR records. Its only if this wasn't a RAS error that we're likely to print out this address as we kill-the-task/panic. > Although hpfar_el2 does not record IPA, but host firmware can still record the PA I agree, it can get the PA from the v8.2 RAS ERR registers and hand it to the OS using CPER. > If call memory_failure(), memory_failure can translate the PA to host VA, then deliver > host VA to Qemu. Yes, this is how it works for any user-space process as two processes sharing the same page may map it in different locations. > Qemu can translate the host VA to IPA. so we rely on memory_failure() to get > the IPA. Yes. I don't see why this is a problem: The kernel isn't going to pass RAS events into the guest, so it never needs to know the IPA. Instead we notify user-space about ranges of memory affected by memory_failure(), KVM's user-space isn't a special case here. As you point out, if Qemu wants to notify the guest it can calculate the IPA and either use CPER for firmware-first, or in the future, update some representation of the v8.2 ERR records once we can virtualise kernel-first. (I'm not sure I understand your point here, but I don't think we disagree,) Thanks, James
Hi gengdongjiu, On 21/09/17 08:55, gengdongjiu wrote: > On 2017/9/14 21:00, James Morse wrote: >> user-space can choose whether to use SEA or SEI, it doesn't have to choose the >> same notification type that firmware used, which in turn doesn't have to be the >> same as that used by the CPU to notify firmware. >> >> The choice only matters because these notifications hang on an existing pieces >> of the Arm-architecture, so the notification can only add to the architecturally >> defined meaning. (i.e. You can only send an SEA for something that can already >> be described as a synchronous external abort). >> >> Once we get to user-space, for memory_failure() notifications, (which so far is >> all we are talking about here), the only thing that could matter is whether the >> guest hit a PG_hwpoison page as a stage2 fault. These can be described as >> Synchronous-External-Abort. >> >> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU >> because it can't always make an error synchronous. For memory_failure() >> notifications to a KVM guest we really can do this, and we already have this >> behaviour for free. An example: >> >> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't >> put the world back together to make this a synchronous exception, so it reports >> it to firmware as an SError-interrupt. >> Linux gets an APEI notification and memory_failure() causes the affected page to >> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. >> >> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO-> >> action optional, probably asynchronous. >> >> But in our example it wasn't really asynchronous, that was just a property of >> the original CPU->firmware notification. What happens? The guest vcpu is re-run, >> it re-runs the same instructions (this was a contained error so KVM's ELR points >> at/before the instruction that steps in the problem). This time KVM takes a >> stage2 fault, which the mm code will refuse to fixup because the relevant page >> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with >> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA. > > CC Achin > > I have some personal opinion, if you think it is not right, hope you can point out. > > Synchronous External Abort and SError Interrupt are hardware exception(hardware concept), > which is independent of software notification, > in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to > better describe the two exceptions, so use SEA and SEI notification to stand for them. > SEA notification stands for Synchronous External Abort, so may be it is not only a > notification, it also stands for a hardware error type. > SEI notification stands for SError Interrupt, so may be it is not only a notification, > it also stands for a hardware error type. > In the OS, it has different handling flow to the two exception(two notification): > when the guest OS running, if the hardware generates a Synchronous External Abort, we > told the guest OS this error is SError Interrupt instead of Synchronous External Abort. This should only happen when APEI doesn't claim the external-abort as a RAS notification. If there were CPER records to process then the error is handled by the host, and we can return to the guest. If this wasn't a firmware-first notification, then you're right KVM hands the guest an asynchronous external abort. This could be considered a bug in KVM. (we can discuss with Marc and Christoffer what it should do), but: I'm not sure what scenario you could see this in: surely all your CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first notifications. So they should always be claimed by APEI. > guest OS uses SEI notification handling flow to deal with it, I am not sure whether it > will have problem, because the true hardware exception is Synchronous External > Abort, but software treats it as SError interrupt to handle. Once you're into a guest the original 'true hardware exception' shouldn't matter. In this scenario KVM has handed the guest an SError, our question is 'is it an SEI notification?': For firmware first the guest OS should poke around in the CPER buffers, find nothing to do, and return to the arch code for (future) kernel-first. For kernel first the guest OS should trawl through the v8.2 ERR registers, find nothing to do, and continue to the default case: By default, we should panic on SError, unless its classified as a non-fatal RAS error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is no work to do). What you may be seeing is some awkwardness with the change in the SError ESR with v8.2. Previously the VSE mechanism injected an impdef SError, (but they were all impdef so it didn't matter). With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this means 'classified as a RAS error ... unknown!'. I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort code to specify an impdef ESR for this path. > In the mainline code, it does not have SEI notification support, the reason I > think it is because of the error address record by firmware is not accurate > (SError Interrupt is asynchronous exception). Yes, while we don't expect a FAR with an SError, but we do expect a valid representation of the RAS error in either the CPER records or the v8.2. ERR registers (or both). If we have neither of those, its not a RAS error and we should panic. > so if treat a hardware Synchronous External Abort as SError interrupt(SEI). > The default OS behavior for SEI is PANIC, that is to say, when hardware triggers > a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI), > the OS will be panic. in fact, it can be recoverable instead of Panic. If its a RAS error APEI (or in the future, the kernel-first handler), should claim the error, so the guest never sees it. If you are hitting this behaviour in KVM, then it wasn't a RAS error. > I ever added a patch to support the SEI notification, but not sure whether > it is can be accepted by open source, until now, not receive response. The patch you posted during the merge window made no sense on its own, so must replace $one_of the other patches in your v5 (or was it v6)... I'll get to it... Because the SEI notification depends on v8.2 I'd like to get the SError/RAS series posted (currently re-testing), then I'll pick up enough of the patches you've posted for a consolidated version of the series, and we can take the discussion from there. I'd still like to know what your firmware does if the normal-world believes its masked physical-SError and you want to hand it an SEI notification. Thanks, James
Hi James, Thank you for your reply. On 2017/9/23 0:39, James Morse wrote: > Hi gengdongjiu, > > On 18/09/17 14:36, gengdongjiu wrote: >> On 2017/9/14 21:00, James Morse wrote: >>> On 13/09/17 08:32, gengdongjiu wrote: >>>> On 2017/9/8 0:30, James Morse wrote: >>>>> On 28/08/17 11:38, Dongjiu Geng wrote: >>>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are >>>>> caused by an access or not. >>> >>> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be >>> triggered via some CPER flags, but its not. The only code that flags >>> MF_ACTION_REQUIRED is x86's kernel-first handling, which nicely matches this 'direct access' problem. >>> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 >>> equivalent). Powerpc also triggers these directly, both from what >>> look to be synchronous paths, so I think its fair to equate >>> BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO to something_else. >> >> James, thanks for your explanation. >> can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access? > > Not 'stands for', as the AR is Action-Required and AO Action-Optional. > My point was I can't find a case where Action-Required is used for an > error that isn't synchronous. Ok, understand it. Thanks for your explanation. > > We should run this past the people who maintain the existing > BUS_MCEERR_AR users, in case its just a severity to them. Ok. > > >> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error? > > How would userspace get one of these memory errors for a PCIe error? seems Ok. Now I only add the support for the host SEI and SEA virtualization. For the PCIe error, I still do not consider much it. maybe we need to consider that afterwards. > > >> In the user space, we can check the si_code, if it is >> "BUS_MCEERR_AR", we use SEA notification type for the guest; if it is "BUS_MCEERR_AO", we use SEI notification type for the guest. >> Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type? > > This is for Qemu/kvmtool to decide, it depends on what sort of machine > they are emulating. > > For example, the physical machine's memory-controller may notify the > CPU about memory errors by triggering SError trapped to EL3, or with a > dedicated FIQ, also routed to EL3. By the time this gets to the host > kernel the distinction doesn't matter. The host has handled the error. > > For a guest, your memory-controller is effectively the host kernel. It > will give you an BUS_MCEERR_AO signal for any guest memory that is > affected, and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory. > > What Qemu/kvmtool do with this is up to them. If they're emulating a > machine with no RAS features, printing an error and exit. > > Otherwise BUS_MCEERR_AR could be notified as one of the flavours of > IRQ, unless the affected vcpu has interrupts masked, in which case an > SEA notification gives you some NMI-like behaviour. Thanks for explanation. Now that SEA notification can provide NMI-like behaviour. How about we use it for BUS_MCEERR_AR to avoid check the interrupts mask status? Even though guest OS not support SEA notification, It is still a valid guest:Synchronous-external-abort > > For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My > choice would be IRQ, as you can't know if the guest supports SEI and > it would be a shame to How about we first check whether user space can specify the virtual SError Exception Syndrome(have vsesr_el2 register)? If can specify, using SEI notification, otherwise use IRQ notification. The advantage is that it can pass more error information than IRQ if can specify Syndrome information. > kill it with an SError if the affected memory was free. SEA for > synchronous errors is still a good choice even if the guest doesn't > support it as that memory is still gone so its still a valid guest:Synchronous-external-abort. Yes, thanks > > > [...] > >>>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors. > >>> user-space can choose whether to use SEA or SEI, it doesn't have to >>> choose the same notification type that firmware used, which in turn >>> doesn't have to be the same as that used by the CPU to notify firmware. >>> >>> The choice only matters because these notifications hang on an >>> existing pieces of the Arm-architecture, so the notification can >>> only add to the architecturally defined meaning. (i.e. You can only >>> send an SEA for something that can already be described as a synchronous external abort). >>> >>> Once we get to user-space, for memory_failure() notifications, >>> (which so far is all we are talking about here), the only thing that >>> could matter is whether the guest hit a PG_hwpoison page as a stage2 >>> fault. These can be described as Synchronous-External-Abort. >>> >>> The Synchronous-External-Abort/SError-Interrupt distinction matters >>> for the CPU because it can't always make an error synchronous. For >>> memory_failure() notifications to a KVM guest we really can do this, >>> and we already have this behaviour for free. An example: >>> >>> A guest touches some hardware:poisoned memory, for whatever reason >>> the CPU can't put the world back together to make this a synchronous >>> exception, so it reports it to firmware as an SError-interrupt. >> >>> Linux gets an APEI notification and memory_failure() causes the >>> affected page to be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space. >>> >>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed >>> notification. AO-> action optional, probably asynchronous. > >> If so, in this case, Qemu/kvmtool only got a little >> information(receive a SIGBUS), for this SIGBUS, it only include the >> SIGBUS_MCEERR_AO, error address. not include other information, only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification. > > The kernel can't tell it which to use: user space has to decide. This > has to be a property of the machine you are emulating, not the machine > you happen to be running on. > > What happens if the notification came using a future notification type > that user space doesn't know about. > What if user space does know about this type, but the guest doesn't. > What if you migrate to a machine that uses a new notification type > that you didn't advertise to the guest via the HEST at boot time. > > These dependencies have to break somewhere, and the right place is > between the host kernel and host user-space. This way whatever > Qemu/kvmtool do will work in the above 'what-ifs'. > > >> for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification. >> so user space will be confused to use which method. > > There isn't a wrong choice here. I suggest always-use-IRQ. Its faster > than POLLed, but won't kill a guest that doesn't support NOTIFY_SEI. As I said above, how about we first check we can specify the virtual SError Exception Syndrome(have vsesr_el2 register)? If can specify, using SEI notification, otherwise use IRQ notification. The advantage is that it can pass more Syndrome information to guest. > > >> I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough. >> how we provide other extra information to let it choose the proper notification? > > Forget the original notification. This physical machine's hardware > configuration and how its memory controller is wired up to report > errors should not be relevant to Qemu/kvmtool. > > You need to decide how your emulated platform reports errors, you may > want to make it a configuration option which defaults to something safe. Ok, thanks. > > [...] > >> In my platform, there is another issue. >> for the stage2 fault, we get the IPA from the HPFAR_EL2 register, but >> for huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000), > > 'Synchronous External Abort, not on a translation table walk' > >> not translation error(DFSC[5:0] is 0b0101xx), > > (the set of external abort, parity or ECC errors that we get from the > page-table-walker) > >> the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM >> code, we get the IPA from the HPFAR_EL2, so we can not get the right IPA value, because its value is zero.I do not know whether you have same issue. > > This is something the ARM-ARM allows, so we have to live with it in software. > > For external aborts the ESR has a 'FnV' bit 10 that for your first > DSFSC 'Synchronous External Abort, not on a translation table walk' > indicates there is no FAR, (or presumably HPFAR). I assume you have this bit set in the ESR. > > This shouldn't be a problem, for firmware-first we should take the > address from the CPER records as this also gives us a range. For > kernel-first we'd take whatever is in the v8.2 RAS ERR records. Its > only if this wasn't a RAS error that we're likely to print out this address as we kill-the-task/panic. > > >> Although hpfar_el2 does not record IPA, but host firmware can still >> record the PA > > I agree, it can get the PA from the v8.2 RAS ERR registers and hand it > to the OS using CPER. > > >> If call memory_failure(), memory_failure can translate the PA to host >> VA, then deliver host VA to Qemu. > > Yes, this is how it works for any user-space process as two processes > sharing the same page may map it in different locations. > > >> Qemu can translate the host VA to IPA. so we rely on memory_failure() >> to get the IPA. > > Yes. I don't see why this is a problem: The kernel isn't going to pass > RAS events into the guest, so it never needs to know the IPA. > > Instead we notify user-space about ranges of memory affected by > memory_failure(), KVM's user-space isn't a special case here. > > As you point out, if Qemu wants to notify the guest it can calculate > the IPA and either use CPER for firmware-first, or in the future, > update some representation of the v8.2 ERR records once we can virtualise kernel-first. > > (I'm not sure I understand your point here, but I don't think we > disagree,) Yes, I only describe the workflow, not think we do not disagree. If not pass exception information to user space, there is another issue. As our agreement, if we want to inject a Synchronous-external-abort, we let Qemu/kvmtool injects it. when Qemu injecting it, it needs to set the value of FAR_EL1 with the value of FAR_EL2. but if we do not pass the far_el2's value to user space, Qemu will have to set the FAR_EL1 to 0, then FAR_EL1's value is invalid. The FAR_EL1 usually is used to save the fault guest VA. Of course, if guest cannot get the fault VA from the FAR_EL1. it still can read the CPER to get the guest fault PA and translate it to fault VA. > > > Thanks, > > James > > . >
Hi James, Sorry for my late response, thank you very much for comments. On 2017/9/23 0:51, James Morse wrote: [.....] >> >> CC Achin >> >> I have some personal opinion, if you think it is not right, hope you can point out. >> >> Synchronous External Abort and SError Interrupt are hardware exception(hardware concept), >> which is independent of software notification, >> in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to >> better describe the two exceptions, so use SEA and SEI notification to stand > for them. > >> SEA notification stands for Synchronous External Abort, so may be it is not only a >> notification, it also stands for a hardware error type. >> SEI notification stands for SError Interrupt, so may be it is not only a notification, >> it also stands for a hardware error type. > >> In the OS, it has different handling flow to the two exception(two notification): >> when the guest OS running, if the hardware generates a Synchronous External Abort, we >> told the guest OS this error is SError Interrupt instead of Synchronous > External Abort. > > This should only happen when APEI doesn't claim the external-abort as a RAS > notification. If there were CPER records to process then the error is handled by > the host, and we can return to the guest. consider again. I think you should be right. In the firmware-first solution, firmware will shield all kinds of errors and record them to the CPER buffer. > > If this wasn't a firmware-first notification, then you're right KVM hands the > guest an asynchronous external abort. This could be considered a bug in KVM. (we > can discuss with Marc and Christoffer what it should do), but: > > I'm not sure what scenario you could see this in: surely all your > CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first > notifications. So they should always be claimed by APEI. Yes, if it is firmware-first we should not exist such issue. > > >> guest OS uses SEI notification handling flow to deal with it, I am not sure whether it >> will have problem, because the true hardware exception is Synchronous External >> Abort, but software treats it as SError interrupt to handle. > > Once you're into a guest the original 'true hardware exception' shouldn't > matter. In this scenario KVM has handed the guest an SError, our question is 'is > it an SEI notification?': > > For firmware first the guest OS should poke around in the CPER buffers, find > nothing to do, and return to the arch code for (future) kernel-first. > For kernel first the guest OS should trawl through the v8.2 ERR registers, find > nothing to do, and continue to the default case: > > By default, we should panic on SError, unless its classified as a non-fatal RAS > error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is > no work to do). understand, thanks. > > > What you may be seeing is some awkwardness with the change in the SError ESR > with v8.2. Previously the VSE mechanism injected an impdef SError, (but they > were all impdef so it didn't matter). > With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this > means 'classified as a RAS error ... unknown!'. > > I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort > code to specify an impdef ESR for this path. Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in the userspace, now do you want to specify an impdef ESR in KVM instead of usrspace? if setting the vsesr_el2 in the KVM, whether user-space need to specify? May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one. > > >> In the mainline code, it does not have SEI notification support, the reason I >> think it is because of the error address record by firmware is not accurate >> (SError Interrupt is asynchronous exception). > > Yes, while we don't expect a FAR with an SError, but we do expect a valid > representation of the RAS error in either the CPER records or the v8.2. ERR > registers (or both). If we have neither of those, its not a RAS error and we > should panic. > > >> so if treat a hardware Synchronous External Abort as SError interrupt(SEI). >> The default OS behavior for SEI is PANIC, that is to say, when hardware triggers >> a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI), >> the OS will be panic. in fact, it can be recoverable instead of Panic. > > If its a RAS error APEI (or in the future, the kernel-first handler), should > claim the error, so the guest never sees it. If you are hitting this behaviour > in KVM, then it wasn't a RAS error. > > >> I ever added a patch to support the SEI notification, but not sure whether >> it is can be accepted by open source, until now, not receive response. > > The patch you posted during the merge window made no sense on its own, so must > replace $one_of the other patches in your v5 (or was it v6)... I'll get to it... yes, thanks. Because SEI notification support does not depend on the RAS virtualization, I break them out of this series. they are here(Today Tyler have some comments, I will update it) https://patchwork.kernel.org/patch/9950953/ https://patchwork.kernel.org/patch/9952493/ > > Because the SEI notification depends on v8.2 I'd like to get the SError/RAS > series posted (currently re-testing), then I'll pick up enough of the patches > you've posted for a consolidated version of the series, and we can take the > discussion from there. James, it is great, we can make a consolidated version of the series. > > I'd still like to know what your firmware does if the normal-world believes its > masked physical-SError and you want to hand it an SEI notification. firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set. when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers. (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL2. if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point), execute "ERET", then jump to EL2 hypervisor. (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL, set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point), execute "ERET", then jump to EL2 hypervisor. (2) if HCR_EL2.TEA is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1 if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580 if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380 execute "ERET", then jump to host EL1. > > Thanks, > > James > > . >
>> What you may be seeing is some awkwardness with the change in the SError ESR >> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they >> were all impdef so it didn't matter). >> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this >> means 'classified as a RAS error ... unknown!'. >> >> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort >> code to specify an impdef ESR for this path. > Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in > the userspace, I pasted Marc's propose and your suggestion that set VSESR_EL2(specify virtual SError syndrome) by the user space. https://lkml.org/lkml/2017/3/20/441 https://lkml.org/lkml/2017/3/20/516 > now do you want to specify an impdef ESR in KVM instead of usrspace? > if setting the vsesr_el2 in the KVM, whether user-space need to specify? > May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one. >
Hi gengdongjiu, On 25/09/17 16:13, gengdongjiu wrote: > On 2017/9/23 0:39, James Morse wrote: >> On 18/09/17 14:36, gengdongjiu wrote: >>> On 2017/9/14 21:00, James Morse wrote: >>>> On 13/09/17 08:32, gengdongjiu wrote: >>> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error? >> >> How would userspace get one of these memory errors for a PCIe error? > > seems Ok. > Now I only add the support for the host SEI and SEA virtualization. For the PCIe error, I still do not consider much it. > maybe we need to consider that afterwards. Any PCIe AER error should be handled by the device driver. If user-space is interacting with the device-driver directly, its up to them to come up with a way of reporting errors. >>> In the user space, we can check the si_code, if it is >>> "BUS_MCEERR_AR", we use SEA notification type for the guest; if it is "BUS_MCEERR_AO", we use SEI notification type for the guest. >>> Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type? >> >> This is for Qemu/kvmtool to decide, it depends on what sort of machine >> they are emulating. >> >> For example, the physical machine's memory-controller may notify the >> CPU about memory errors by triggering SError trapped to EL3, or with a >> dedicated FIQ, also routed to EL3. By the time this gets to the host >> kernel the distinction doesn't matter. The host has handled the error. >> >> For a guest, your memory-controller is effectively the host kernel. It >> will give you an BUS_MCEERR_AO signal for any guest memory that is >> affected, and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory. >> >> What Qemu/kvmtool do with this is up to them. If they're emulating a >> machine with no RAS features, printing an error and exit. >> >> Otherwise BUS_MCEERR_AR could be notified as one of the flavours of >> IRQ, unless the affected vcpu has interrupts masked, in which case an >> SEA notification gives you some NMI-like behaviour. > Thanks for explanation. > Now that SEA notification can provide NMI-like behaviour. How about we use it for > BUS_MCEERR_AR to avoid check the interrupts mask status? ... yes, that's the suggestion. Qemu/kvmtool can take MCEERR_AR from KVM and SET_ONE_REG the vcpu into taking a synchronous-external-abort instead. (but there is trouble ahead where this suggestion unravels) > Even though guest OS not support SEA notification, It is still a valid > guest:Synchronous-external-abort When it comes from KVM, Yes. If you think of linux+KVM's stage2 translation as your guests memory controller. It is effectively guaranteeing you to: 1. 'interrupt' userspace with MCEERR_AO when it detects an error, 2. contain it (by unmapping and marking the page PG_poison) then, 3. give you a synchronous external abort with MCEERR_AR if the guest touches the contained memory. This is fine when the MCEERR_AR came from KVM, what we need to find out is whether this will still be true for some future kernel code that sends MCEERR_AR... >> For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My >> choice would be IRQ, as you can't know if the guest supports SEI and >> it would be a shame to > > How about we first check whether user space can specify the virtual SError Exception Syndrome(have vsesr_el2 register)? > If can specify, using SEI notification, otherwise use IRQ notification. > The advantage is that it can pass more error information than IRQ if can specify Syndrome information. If this is for APEI firmware-first, its the severity in the CPER records that matters, so the SEI/IRQ choice doesn't matter if the guest supports both. But its unlikely the guest supports both. (v4.14 doesn't). If you pick IRQ and the guest doesn't support it, nothing happens, the guest never takes the IRQ. You take the memory-error as a MCEERR_AR some time later. If you pick SEI and the guest doesn't support it, the guest dies of an unknown SError. I would pick POLLed by default, as this would let the Qemu/kvmtool code that does this work be portable between x86 and arm64. [...] >>> If call memory_failure(), memory_failure can translate the PA to host >>> VA, then deliver host VA to Qemu. >> >> Yes, this is how it works for any user-space process as two processes >> sharing the same page may map it in different locations. >> >> >>> Qemu can translate the host VA to IPA. so we rely on memory_failure() >>> to get the IPA. >> >> Yes. I don't see why this is a problem: The kernel isn't going to pass >> RAS events into the guest, so it never needs to know the IPA. >> >> Instead we notify user-space about ranges of memory affected by >> memory_failure(), KVM's user-space isn't a special case here. >> >> As you point out, if Qemu wants to notify the guest it can calculate >> the IPA and either use CPER for firmware-first, or in the future, >> update some representation of the v8.2 ERR records once we can virtualise kernel-first. >> >> (I'm not sure I understand your point here, but I don't think we >> disagree,) > Yes, I only describe the workflow, not think we do not disagree. (double negative?) > If not pass exception information to user space, there is another issue. > As our agreement, if we want to inject a Synchronous-external-abort, we let Qemu/kvmtool injects it. Yes, because all we do for user space is send it MCEERR_AR when this happens. KVM only does this to make guest-accesses the same as user-space accesses. (See arm64's do_page_fault() and x86's do_sigbus() for the regular path) > when Qemu injecting it, it needs to set the value of FAR_EL1 with the value of FAR_EL2. but if we do not > pass the far_el2's value to user space, Qemu will have to set the FAR_EL1 to 0, then FAR_EL1's value is invalid. > The FAR_EL1 usually is used to save the fault guest VA. Yes, Qemu doesn't know the the guest VA, but it doesn't need to for this mechanism to work: Set the DFSC to '0b010000 Synchronous external abort, not on translation table walk', then set 'FnV - Far Not Valid'. ... But there is one piece of information Qemu needs here but doesn't have: whether this was an instruction or a data abort, which you need to pick the correct EC. Ideally, this would come with the signal, but that doesn't happen today. I don't think user-space generally needs to know this, only a JIT is likely to be able to handle these errors, and it would know which regions were code or data. So it looks like we really do need more information from KVM, and guests have to be a special-case here. Bother. I'll try and put together an RFC to let user-space generate a believable architectural exception when it gets this signal. > Of course, if guest cannot get the fault VA from the FAR_EL1. it still can read the CPER to get > the guest fault PA and translate it to fault VA. _A_ fault VA, there may be a set of them. This is why memory_failure() walks the rmap to find all the users of the page, and why the faulting VA isn't really relevant for this sort of RAS error. Thanks, James
Hi gengdongjiu, On 27/09/17 12:07, gengdongjiu wrote: > On 2017/9/23 0:51, James Morse wrote: >> If this wasn't a firmware-first notification, then you're right KVM hands the >> guest an asynchronous external abort. This could be considered a bug in KVM. (we >> can discuss with Marc and Christoffer what it should do), but: >> >> I'm not sure what scenario you could see this in: surely all your >> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first >> notifications. So they should always be claimed by APEI. > Yes, if it is firmware-first we should not exist such issue. [...] >> What you may be seeing is some awkwardness with the change in the SError ESR >> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they >> were all impdef so it didn't matter). >> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this >> means 'classified as a RAS error ... unknown!'. >> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort >> code to specify an impdef ESR for this path. https://www.spinics.net/lists/arm-kernel/msg609589.html > Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in > the userspace, If Qemu/kvmtool wants to emulate a machine that notifies the guest about firmware-first RAS Errors using SError/SEI, it needs to decide when to send these SError and what ESR to specify. > now do you want to specify an impdef ESR in KVM instead of usrspace? No, that patch is just trying to fixup the existing cases where KVM already injects an SError. I'm just trying to keep the behaviour the-same: Before the RAS Extensions the guest would always get an impdef SError ESR. After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS encoding. That patch changes it to still be an impdef SError ESR. > if setting the vsesr_el2 in the KVM, whether user-space need to specify? I think we need a KVM CAP API to do this. With that patch it can be wired into pend_guest_serror(), which takes the ESR to make pending if the CPU supports it. It's not clear to me whether its useful for user-space to make an SError pending even if it can't set the ESR.... [...] >> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS >> series posted (currently re-testing), then I'll pick up enough of the patches >> you've posted for a consolidated version of the series, and we can take the >> discussion from there. > James, it is great, we can make a consolidated version of the series. We need to get some of the wider issues fixed first, the big-ugly one is memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this would only become re-entrant if the kernel text was corrupt). It is a problem for SEI and SDEI. >> I'd still like to know what your firmware does if the normal-world believes its >> masked physical-SError and you want to hand it an SEI notification. Aha, thanks for this: > firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set. Masked for the CPU because the CPU can deliver the SError to EL3. What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware respect the PSTATE.A value of the exception level that SError are routed to? > when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers. > > (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check HCR_EL2.AMO. Some crazy hypervisor may set one and not the other. > SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to > ESR_EL2. The EC value in the ELR describes current/lower exception level, you need to re-encode this for EL2 if the exception came from EL2. > if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point), > > execute "ERET", then jump to EL2 hypervisor. > > (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL, > set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point), > > execute "ERET", then jump to EL2 hypervisor. This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning to the EL2 SError vector. EL2 believes it has masked SError, it does this because it can't handle one right now. If your firmware jumps in anyway - its game over. We mask SError in entry.S when we take an exception and when we return from an exception. This is so that we can read/write the ELR/SPSR without them changing under our feet. If your firmware overwrites these values - we've lost them, and can never return to the context we interrupted. > (2) if HCR_EL2.TEA It's an SError, these are routed by HCR_EL2.AMO. > is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1 Again, the EC needs re-encoding. > if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580 > if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380 > > execute "ERET", then jump to host EL1. What if the SError came from EL2 and HCR_EL2.AMO is clear. This means SError are routed to EL1, and are implicitly masked when executing at EL2. This is the same SError-is-masked issue as above. What does your firmware do when physical SError is masked and you want to hand it an SEI notification? You need to route your fake-SError based on HCR_EL2.AMO and your fake-External-Abort based on HCR_EL2.TEA. You also need a triage step at EL3 before you try and notify the normal world. If you take an uncontainable error to EL3 there is very little point telling the OS - its probably already on fire. In this case firmware needs to reboot the machine, and generate a summary of the error for the BERT. Thanks, James
Hi james, thanks for the mail, and sorry for my late response. On 2017/10/7 0:46, James Morse wrote: >>> will give you an BUS_MCEERR_AO signal for any guest memory that is >>> affected, and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory. >>> >>> What Qemu/kvmtool do with this is up to them. If they're emulating a >>> machine with no RAS features, printing an error and exit. >>> >>> Otherwise BUS_MCEERR_AR could be notified as one of the flavours of >>> IRQ, unless the affected vcpu has interrupts masked, in which case an >>> SEA notification gives you some NMI-like behaviour. >> Thanks for explanation. >> Now that SEA notification can provide NMI-like behaviour. How about we use it for >> BUS_MCEERR_AR to avoid check the interrupts mask status? > ... yes, that's the suggestion. Qemu/kvmtool can take MCEERR_AR from KVM and > SET_ONE_REG the vcpu into taking a synchronous-external-abort instead. > > (but there is trouble ahead where this suggestion unravels) > > >> Even though guest OS not support SEA notification, It is still a valid >> guest:Synchronous-external-abort > When it comes from KVM, Yes. > > If you think of linux+KVM's stage2 translation as your guests memory controller. > It is effectively guaranteeing you to: > 1. 'interrupt' userspace with MCEERR_AO when it detects an error, > 2. contain it (by unmapping and marking the page PG_poison) then, > 3. give you a synchronous external abort with MCEERR_AR if the guest touches the > contained memory. > > This is fine when the MCEERR_AR came from KVM, what we need to find out is > whether this will still be true for some future kernel code that sends MCEERR_AR... Ok > > >>> For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My >>> choice would be IRQ, as you can't know if the guest supports SEI and >>> it would be a shame to >> How about we first check whether user space can specify the virtual SError Exception Syndrome(have vsesr_el2 register)? >> If can specify, using SEI notification, otherwise use IRQ notification. >> The advantage is that it can pass more error information than IRQ if can specify Syndrome information. > If this is for APEI firmware-first, its the severity in the CPER records that > matters, so the SEI/IRQ choice doesn't matter if the guest supports both. > > But its unlikely the guest supports both. (v4.14 doesn't). > If you pick IRQ and the guest doesn't support it, nothing happens, the guest > never takes the IRQ. You take the memory-error as a MCEERR_AR some time later. > > If you pick SEI and the guest doesn't support it, the guest dies of an unknown > SError. > > I would pick POLLed by default, as this would let the Qemu/kvmtool code that > does this work be portable between x86 and arm64. > > [...] AS we disused in another mail, we can use GPIO-signal or GPIO interrupts which is also suggestion by APEI spec and does not let to guest dies, as shown in [1] [1] HW-reduced ACPI platforms signal the error using a GPIO interrupt or another interrupt declared under a generic event device (Section 5.6.9). In the case of GPIO-signaled events, an _AEI object lists the appropriate GPIO pin, while for Interrupt-signaled events a _CRS object is used to list the interrupt: > >>>> If call memory_failure(), memory_failure can translate the PA to host >>>> VA, then deliver host VA to Qemu. >>> Yes, this is how it works for any user-space process as two processes >>> sharing the same page may map it in different locations. >>> >>> >>>> Qemu can translate the host VA to IPA. so we rely on memory_failure() >>>> to get the IPA. >>> Yes. I don't see why this is a problem: The kernel isn't going to pass >>> RAS events into the guest, so it never needs to know the IPA. >>> >>> Instead we notify user-space about ranges of memory affected by >>> memory_failure(), KVM's user-space isn't a special case here. >>> >>> As you point out, if Qemu wants to notify the guest it can calculate >>> the IPA and either use CPER for firmware-first, or in the future, >>> update some representation of the v8.2 ERR records once we can virtualise kernel-first. >>> >>> (I'm not sure I understand your point here, but I don't think we >>> disagree,) >> Yes, I only describe the workflow, not think we do not disagree. > (double negative?) > > >> If not pass exception information to user space, there is another issue. >> As our agreement, if we want to inject a Synchronous-external-abort, we let Qemu/kvmtool injects it. > Yes, because all we do for user space is send it MCEERR_AR when this happens. > > KVM only does this to make guest-accesses the same as user-space accesses. > (See arm64's do_page_fault() and x86's do_sigbus() for the regular path) > > >> when Qemu injecting it, it needs to set the value of FAR_EL1 with the value of FAR_EL2. but if we do not >> pass the far_el2's value to user space, Qemu will have to set the FAR_EL1 to 0, then FAR_EL1's value is invalid. >> The FAR_EL1 usually is used to save the fault guest VA. > Yes, Qemu doesn't know the the guest VA, but it doesn't need to for this > mechanism to work: Set the DFSC to '0b010000 Synchronous external abort, not on > translation table walk', then set 'FnV - Far Not Valid'. yes, I think so. thanks for your good suggestion. > > ... > > But there is one piece of information Qemu needs here but doesn't have: whether > this was an instruction or a data abort, which you need to pick the correct EC. so this is patch aim to pass more information to user space. but you have some concern > > Ideally, this would come with the signal, but that doesn't happen today. I ever mentioned that the information is not enough, but you reply user space should not need that. > > I don't think user-space generally needs to know this, only a JIT is likely to > be able to handle these errors, and it would know which regions were code or data. why it is related with a JIT? > > So it looks like we really do need more information from KVM, and guests have to > be a special-case here. Bother. I think my current patch can pass the needed information. in fact, Qemu will judge whether the fault address is belong to guest, only belong to guest, it will get these information. not exist information stale issue. how about we keep this patch? For the guest APEI CPER recording, if want to make it better, it also need to pass more information. > > I'll try and put together an RFC to let user-space generate a believable > architectural exception when it gets this signal. I think this patch can pass the enough information for guest, why you will put another patch? > > >> Of course, if guest cannot get the fault VA from the FAR_EL1. it still can read the CPER to get >> the guest fault PA and translate it to fault VA. > _A_ fault VA, there may be a set of them. This is why memory_failure() walks the > rmap to find all the users of the page, and why the faulting VA isn't really > relevant for this sort of RAS error. yes. > > > Thanks,
On 2017/10/7 1:31, James Morse wrote: > Hi gengdongjiu, > > On 27/09/17 12:07, gengdongjiu wrote: >> On 2017/9/23 0:51, James Morse wrote: >>> If this wasn't a firmware-first notification, then you're right KVM hands the >>> guest an asynchronous external abort. This could be considered a bug in KVM. (we >>> can discuss with Marc and Christoffer what it should do), but: >>> >>> I'm not sure what scenario you could see this in: surely all your >>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first >>> notifications. So they should always be claimed by APEI. > >> Yes, if it is firmware-first we should not exist such issue. > > [...] > >>> What you may be seeing is some awkwardness with the change in the SError ESR >>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they >>> were all impdef so it didn't matter). >>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this >>> means 'classified as a RAS error ... unknown!'. > >>> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort >>> code to specify an impdef ESR for this path. > > https://www.spinics.net/lists/arm-kernel/msg609589.html > > >> Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in >> the userspace, > > If Qemu/kvmtool wants to emulate a machine that notifies the guest about > firmware-first RAS Errors using SError/SEI, it needs to decide when to send > these SError and what ESR to specify. yes, it is. I agree. > > >> now do you want to specify an impdef ESR in KVM instead of usrspace? > > No, that patch is just trying to fixup the existing cases where KVM already > injects an SError. I'm just trying to keep the behaviour the-same: > > Before the RAS Extensions the guest would always get an impdef SError ESR. > After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the > hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS > encoding. That patch changes it to still be an impdef SError ESR. > > >> if setting the vsesr_el2 in the KVM, whether user-space need to specify? > > I think we need a KVM CAP API to do this. With that patch it can be wired into > pend_guest_serror(), which takes the ESR to make pending if the CPU supports it. For this CAP API, I have made a patch in the new series patches. > > It's not clear to me whether its useful for user-space to make an SError pending > even if it can't set the ESR.... why it can not set the ESR? In the KVM, we can return a encoding fault info to userspace, then user space can specify its own ESR for guest. I already made a patch for it. > > [...] > >>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS >>> series posted (currently re-testing), then I'll pick up enough of the patches >>> you've posted for a consolidated version of the series, and we can take the >>> discussion from there. > >> James, it is great, we can make a consolidated version of the series. > > We need to get some of the wider issues fixed first, the big-ugly one is > memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this > would only become re-entrant if the kernel text was corrupt). It is a problem > for SEI and SDEI. For memory_failure_queue(), I think the big problem is it is in a process context, not error handling context. there are two context. and the memory_failure_queue() is scheduled later than the error handling. > > >>> I'd still like to know what your firmware does if the normal-world believes its >>> masked physical-SError and you want to hand it an SEI notification. > > Aha, thanks for this: > >> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set. > > Masked for the CPU because the CPU can deliver the SError to EL3. > > What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have > set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware > respect the PSTATE.A value of the exception level that SError are routed to? Before route to the target EL, software set the spsr_el3.A to 1, then "eret", the PSTATE.A will be to 1. Note: PSTATE.A is shared by different EL, in the hardware, it is one register, not many registers. spsr_elx has more registers, such as spsr_el1, spsr_el2, spsr_el3. > > >> when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers. >> >> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this > > HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check > HCR_EL2.AMO. Some crazy hypervisor may set one and not the other. sorry, it is typo issue, should check HCR_EL2.AMO > > >> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to >> ESR_EL2. > > The EC value in the ELR describes current/lower exception level, you need to > re-encode this for EL2 if the exception came from EL2. Regardless it was trapped from EL1 or EL2. they are all from lower exception level. it should be not encoding, such as Instruction Abort from a lower Exception level. For both EL1 or EL2, the EC is 0b100000 EC == 0b100000: Instruction Abort from a lower Exception level. Of course, if need re-encode for some cases, will do it. > > >> if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point), >> >> execute "ERET", then jump to EL2 hypervisor. >> >> (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL, >> set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point), >> >> execute "ERET", then jump to EL2 hypervisor. > > This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning > to the EL2 SError vector. before jumping, it will set the SPSR_EL3.A to 1. > > EL2 believes it has masked SError, it does this because it can't handle one > right now. If your firmware jumps in anyway - its game over. > > We mask SError in entry.S when we take an exception and when we return from an > exception. This is so that we can read/write the ELR/SPSR without them changing > under our feet. If your firmware overwrites these values - we've lost them, and > can never return to the context we interrupted. In fact, we do not want to return to the context which is interrupted in Hawei's platform. we will kill the APP or VM, not return to the context that interrupted. > > >> (2) if HCR_EL2.TEA > > It's an SError, these are routed by HCR_EL2.AMO. typo issue, should be HCR_EL2.AMO. > > >> is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1 > > Again, the EC needs re-encoding. For EL3, the EL1 and EL0 are all low level, why the EC needs re-encoding? > > >> if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580 >> if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380 >> >> execute "ERET", then jump to host EL1. > > What if the SError came from EL2 and HCR_EL2.AMO is clear. This means SError are > routed to EL1, and are implicitly masked when executing at EL2. No, before jump, it will also check the spsr_elx.M to determine jump to which level. > > This is the same SError-is-masked issue as above. > > What does your firmware do when physical SError is masked and you want to hand > it an SEI notification? Do you mean spsr_el3.A =1 or PSTATE.A = 1? we just jump to the corresponding vector entry, not care spsr_el3.A or PSTATE.A is set or not set. > > > You need to route your fake-SError based on HCR_EL2.AMO and your > fake-External-Abort based on HCR_EL2.TEA. yes, it should. above is typo issue. > > You also need a triage step at EL3 before you try and notify the normal world. > If you take an uncontainable error to EL3 there is very little point telling the > OS - its probably already on fire. In this case firmware needs to reboot the > machine, and generate a summary of the error for the BERT. yes, it should and we already done that. For some errors, BIOS will firstly handle it, and then return to the interrupted context, OS even does not know this error happen. > > > Thanks, > > James > > . >
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index 9f3ca24bbcc6..514261f682b8 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -181,6 +181,11 @@ struct kvm_arch_memory_slot { #define KVM_REG_ARM64_SYSREG_OP2_MASK 0x0000000000000007 #define KVM_REG_ARM64_SYSREG_OP2_SHIFT 0 +/* AArch64 fault registers */ +#define KVM_REG_ARM64_FAULT (0x0014 << KVM_REG_ARM_COPROC_SHIFT) +#define KVM_REG_ARM64_FAULT_ESR_EC_ISS (0) +#define KVM_REG_ARM64_FAULT_FAR (1) + #define ARM64_SYS_REG_SHIFT_MASK(x,n) \ (((x) << KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) & \ KVM_REG_ARM64_SYSREG_ ## n ## _MASK) diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 5c7f657dd207..020a644b20d7 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -128,6 +128,39 @@ static int set_core_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) out: return err; } +static int get_fault_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) +{ + void __user *uaddr = (void __user *)(unsigned long)reg->addr; + u32 value; + u32 id = reg->id & ~(KVM_REG_ARCH_MASK | + KVM_REG_SIZE_MASK | KVM_REG_ARM64_FAULT); + + switch (id) { + case KVM_REG_ARM64_FAULT_ESR_EC_ISS: + /* + * The user space may need to know the fault exception class + * ESR_ELx.EC and instruction specific syndrome ESR_ELx.ISS + */ + value = kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_EC_MASK | ESR_ELx_ISS_MASK); + if (copy_to_user(uaddr, &value, KVM_REG_SIZE(reg->id)) != 0) + return -EFAULT; + break; + case KVM_REG_ARM64_FAULT_FAR: + /* + * When user space injects synchronized abort, it needs + * to inject the faulting virtual address to guest OS, so + * needs to get it from kvm. + */ + if (copy_to_user(uaddr, &(vcpu->arch.fault.far_el2), + KVM_REG_SIZE(reg->id)) != 0) + return -EFAULT; + break; + default: + return -ENOENT; + } + return 0; +} + int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { @@ -243,6 +276,9 @@ int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM_CORE) return get_core_reg(vcpu, reg); + if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM64_FAULT) + return get_fault_reg(vcpu, reg); + if (is_timer_reg(reg->id)) return get_timer_reg(vcpu, reg);