Message ID | 20200406150355.4859-1-james.morse@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: arm64: arch_timer shouldn't assume the vcpu is loaded | expand |
Hi James, Thanks for looking into this. On Mon, 6 Apr 2020 16:03:55 +0100 James Morse <james.morse@arm.com> wrote: > kvm_arch_timer_get_input_level() needs to get the arch_timer_context for > a particular vcpu, and uses kvm_get_running_vcpu() to find it. > > kvm_arch_timer_get_input_level() may be called to handle a user-space > write to the redistributor, where the vcpu is not loaded. This causes > kvm_get_running_vcpu() to return NULL: > | Unable to handle kernel paging request at virtual address 0000000000001ec0 > | Mem abort info: > | ESR = 0x96000004 > | EC = 0x25: DABT (current EL), IL = 32 bits > | SET = 0, FnV = 0 > | EA = 0, S1PTW = 0 > | Data abort info: > | ISV = 0, ISS = 0x00000004 > | CM = 0, WnR = 0 > | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 > | [0000000000001ec0] pgd=0000000000000000 > | Internal error: Oops: 96000004 [#1] PREEMPT SMP > | Modules linked in: r8169 realtek efivarfs ip_tables x_tables > | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 > | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 > | pstate: 00000085 (nzcv daIf -PAN -UAO) > | pc : kvm_arch_timer_get_input_level+0x1c/0x68 > | lr : kvm_arch_timer_get_input_level+0x1c/0x68 > > | Call trace: > | kvm_arch_timer_get_input_level+0x1c/0x68 > | vgic_get_phys_line_level+0x3c/0x90 > | vgic_mmio_write_senable+0xe4/0x130 > | vgic_uaccess+0xe0/0x100 > | vgic_v3_redist_uaccess+0x5c/0x80 > | vgic_v3_attr_regs_access+0xf0/0x200 > | nvgic_v3_set_attr+0x234/0x250 > | kvm_device_ioctl_attr+0xa4/0xf8 > | kvm_device_ioctl+0x7c/0xc0 > | ksys_ioctl+0x1fc/0xc18 > | __arm64_sys_ioctl+0x24/0x30 > | do_el0_svc+0x7c/0x148 > | el0_sync_handler+0x138/0x258 > | el0_sync+0x140/0x180 > | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) > | ---[ end trace 81287612d93f1e70 ]--- > | note: qemu-system-aar[2615] exited with preempt_count 1 > > Loading the vcpu doesn't make a lot of sense for handling a device ioctl(), > so instead pass the vcpu through to kvm_arch_timer_get_input_level(). Its > not clear that an intid makes much sense without the paired vcpu. I don't fully agree with the analysis, Remember we are looking at the state of the physical interrupt associated with a virtual interrupt, so the vcpu doesn't quite make sense here if it isn't loaded. What does it mean to look at the HW timer when we are not in the right context? For all we know, it is completely random (the only guarantee we have is that it is disabled, actually). My gut feeling is that this is another instance where we should provide specific userspace accessors that would only deal with the virtual state, and leave anything that deals with the physical state of the interrupt to be exercised only by the guest. Does it make sense? Thanks, M.
Hi Marc, On 08/04/2020 11:07, Marc Zyngier wrote: > On Mon, 6 Apr 2020 16:03:55 +0100 > James Morse <james.morse@arm.com> wrote: > >> kvm_arch_timer_get_input_level() needs to get the arch_timer_context for >> a particular vcpu, and uses kvm_get_running_vcpu() to find it. >> >> kvm_arch_timer_get_input_level() may be called to handle a user-space >> write to the redistributor, where the vcpu is not loaded. This causes >> kvm_get_running_vcpu() to return NULL: >> | Unable to handle kernel paging request at virtual address 0000000000001ec0 >> | Mem abort info: >> | ESR = 0x96000004 >> | EC = 0x25: DABT (current EL), IL = 32 bits >> | SET = 0, FnV = 0 >> | EA = 0, S1PTW = 0 >> | Data abort info: >> | ISV = 0, ISS = 0x00000004 >> | CM = 0, WnR = 0 >> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 >> | [0000000000001ec0] pgd=0000000000000000 >> | Internal error: Oops: 96000004 [#1] PREEMPT SMP >> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables >> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 >> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 >> | pstate: 00000085 (nzcv daIf -PAN -UAO) >> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 >> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 >> >> | Call trace: >> | kvm_arch_timer_get_input_level+0x1c/0x68 >> | vgic_get_phys_line_level+0x3c/0x90 >> | vgic_mmio_write_senable+0xe4/0x130 >> | vgic_uaccess+0xe0/0x100 >> | vgic_v3_redist_uaccess+0x5c/0x80 >> | vgic_v3_attr_regs_access+0xf0/0x200 >> | nvgic_v3_set_attr+0x234/0x250 >> | kvm_device_ioctl_attr+0xa4/0xf8 >> | kvm_device_ioctl+0x7c/0xc0 >> | ksys_ioctl+0x1fc/0xc18 >> | __arm64_sys_ioctl+0x24/0x30 >> | do_el0_svc+0x7c/0x148 >> | el0_sync_handler+0x138/0x258 >> | el0_sync+0x140/0x180 >> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) >> | ---[ end trace 81287612d93f1e70 ]--- >> | note: qemu-system-aar[2615] exited with preempt_count 1 >> >> Loading the vcpu doesn't make a lot of sense for handling a device ioctl(), >> so instead pass the vcpu through to kvm_arch_timer_get_input_level(). Its >> not clear that an intid makes much sense without the paired vcpu. > > I don't fully agree with the analysis, Remember we are looking at the > state of the physical interrupt associated with a virtual interrupt, so > the vcpu doesn't quite make sense here if it isn't loaded. > > What does it mean to look at the HW timer when we are not in the right > context? For all we know, it is completely random (the only guarantee > we have is that it is disabled, actually). > My gut feeling is that this is another instance where we should provide > specific userspace accessors that would only deal with the virtual > state, and leave anything that deals with the physical state of the > interrupt to be exercised only by the guest. > Does it make sense? Broadly, yes. Specifically ... I'm not familiar enough with this code to work out where such a change should go! ~20 mins of grepping later~ Remove REGISTER_DESC_WITH_LENGTH() so that uaccess helpers have to be provided, and forbid NULL for the ur/uw values in REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED()...? Or if that is too invasive, something like, (totally, untested): ----------------%<---------------- diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c index 97fb2a40e6ba..30ae5f29e429 100644 --- a/virt/kvm/arm/vgic/vgic-mmio.c +++ b/virt/kvm/arm/vgic/vgic-mmio.c @@ -113,10 +113,11 @@ void vgic_mmio_write_senable(struct kvm_vcpu *vcpu, struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); raw_spin_lock_irqsave(&irq->irq_lock, flags); - if (vgic_irq_is_mapped_level(irq)) { + if (kvm_running_vcpu() && vgic_irq_is_mapped_level(irq)) { bool was_high = irq->line_level; /* + * Unless we are running due to a user-space access, * We need to update the state of the interrupt because * the guest might have changed the state of the device * while the interrupt was disabled at the VGIC level. ----------------%<---------------- Thanks, James
On 08/04/2020 11:07, Marc Zyngier wrote: Hi Marc, > Hi James, > > Thanks for looking into this. > > On Mon, 6 Apr 2020 16:03:55 +0100 > James Morse <james.morse@arm.com> wrote: > >> kvm_arch_timer_get_input_level() needs to get the arch_timer_context for >> a particular vcpu, and uses kvm_get_running_vcpu() to find it. >> >> kvm_arch_timer_get_input_level() may be called to handle a user-space >> write to the redistributor, where the vcpu is not loaded. This causes >> kvm_get_running_vcpu() to return NULL: >> | Unable to handle kernel paging request at virtual address 0000000000001ec0 >> | Mem abort info: >> | ESR = 0x96000004 >> | EC = 0x25: DABT (current EL), IL = 32 bits >> | SET = 0, FnV = 0 >> | EA = 0, S1PTW = 0 >> | Data abort info: >> | ISV = 0, ISS = 0x00000004 >> | CM = 0, WnR = 0 >> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 >> | [0000000000001ec0] pgd=0000000000000000 >> | Internal error: Oops: 96000004 [#1] PREEMPT SMP >> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables >> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 >> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 >> | pstate: 00000085 (nzcv daIf -PAN -UAO) >> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 >> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 >> >> | Call trace: >> | kvm_arch_timer_get_input_level+0x1c/0x68 >> | vgic_get_phys_line_level+0x3c/0x90 >> | vgic_mmio_write_senable+0xe4/0x130 >> | vgic_uaccess+0xe0/0x100 >> | vgic_v3_redist_uaccess+0x5c/0x80 >> | vgic_v3_attr_regs_access+0xf0/0x200 >> | nvgic_v3_set_attr+0x234/0x250 >> | kvm_device_ioctl_attr+0xa4/0xf8 >> | kvm_device_ioctl+0x7c/0xc0 >> | ksys_ioctl+0x1fc/0xc18 >> | __arm64_sys_ioctl+0x24/0x30 >> | do_el0_svc+0x7c/0x148 >> | el0_sync_handler+0x138/0x258 >> | el0_sync+0x140/0x180 >> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) >> | ---[ end trace 81287612d93f1e70 ]--- >> | note: qemu-system-aar[2615] exited with preempt_count 1 >> >> Loading the vcpu doesn't make a lot of sense for handling a device ioctl(), >> so instead pass the vcpu through to kvm_arch_timer_get_input_level(). Its >> not clear that an intid makes much sense without the paired vcpu. > > I don't fully agree with the analysis, Remember we are looking at the > state of the physical interrupt associated with a virtual interrupt, so > the vcpu doesn't quite make sense here if it isn't loaded. But wasn't it that this function is meant to specifically deal with this *without* going to the hardware (which is costly, hence this optimisation)? Because for the timer we *can* work out the logical IRQ line state by examining our saved state? And this is what we do in kvm_timer_should_fire(), when timer_ctx->loaded is false. Which for me this sounds like the right thing to do in this situation: the VCPU (and the timer) is not loaded, so we check our saved state and construct the logical line level. We just need a valid VCPU struct to achieve this, and hope for the virtual timer to be already initialised. Do I miss something here? Also to me it sound like the interface for this function is slightly lacking, because just an intid is not enough to uniquely identify an IRQ. It was just fine so far because of this special use case. Cheers, Andre > > What does it mean to look at the HW timer when we are not in the right > context? For all we know, it is completely random (the only guarantee > we have is that it is disabled, actually). > > My gut feeling is that this is another instance where we should provide > specific userspace accessors that would only deal with the virtual > state, and leave anything that deals with the physical state of the > interrupt to be exercised only by the guest. > > Does it make sense? > > Thanks, > > M. >
Hi Andre, On 2020-04-08 13:13, André Przywara wrote: > On 08/04/2020 11:07, Marc Zyngier wrote: > > Hi Marc, > >> Hi James, >> >> Thanks for looking into this. >> >> On Mon, 6 Apr 2020 16:03:55 +0100 >> James Morse <james.morse@arm.com> wrote: >> >>> kvm_arch_timer_get_input_level() needs to get the arch_timer_context >>> for >>> a particular vcpu, and uses kvm_get_running_vcpu() to find it. >>> >>> kvm_arch_timer_get_input_level() may be called to handle a user-space >>> write to the redistributor, where the vcpu is not loaded. This causes >>> kvm_get_running_vcpu() to return NULL: >>> | Unable to handle kernel paging request at virtual address >>> 0000000000001ec0 >>> | Mem abort info: >>> | ESR = 0x96000004 >>> | EC = 0x25: DABT (current EL), IL = 32 bits >>> | SET = 0, FnV = 0 >>> | EA = 0, S1PTW = 0 >>> | Data abort info: >>> | ISV = 0, ISS = 0x00000004 >>> | CM = 0, WnR = 0 >>> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 >>> | [0000000000001ec0] pgd=0000000000000000 >>> | Internal error: Oops: 96000004 [#1] PREEMPT SMP >>> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables >>> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 >>> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS >>> 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 >>> | pstate: 00000085 (nzcv daIf -PAN -UAO) >>> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 >>> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 >>> >>> | Call trace: >>> | kvm_arch_timer_get_input_level+0x1c/0x68 >>> | vgic_get_phys_line_level+0x3c/0x90 >>> | vgic_mmio_write_senable+0xe4/0x130 >>> | vgic_uaccess+0xe0/0x100 >>> | vgic_v3_redist_uaccess+0x5c/0x80 >>> | vgic_v3_attr_regs_access+0xf0/0x200 >>> | nvgic_v3_set_attr+0x234/0x250 >>> | kvm_device_ioctl_attr+0xa4/0xf8 >>> | kvm_device_ioctl+0x7c/0xc0 >>> | ksys_ioctl+0x1fc/0xc18 >>> | __arm64_sys_ioctl+0x24/0x30 >>> | do_el0_svc+0x7c/0x148 >>> | el0_sync_handler+0x138/0x258 >>> | el0_sync+0x140/0x180 >>> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) >>> | ---[ end trace 81287612d93f1e70 ]--- >>> | note: qemu-system-aar[2615] exited with preempt_count 1 >>> >>> Loading the vcpu doesn't make a lot of sense for handling a device >>> ioctl(), >>> so instead pass the vcpu through to kvm_arch_timer_get_input_level(). >>> Its >>> not clear that an intid makes much sense without the paired vcpu. >> >> I don't fully agree with the analysis, Remember we are looking at the >> state of the physical interrupt associated with a virtual interrupt, >> so >> the vcpu doesn't quite make sense here if it isn't loaded. > > But wasn't it that this function is meant to specifically deal with > this > *without* going to the hardware (which is costly, hence this > optimisation)? Because for the timer we *can* work out the logical IRQ > line state by examining our saved state? And this is what we do in > kvm_timer_should_fire(), when timer_ctx->loaded is false. Yes, but that's just a specialization of a more generic interface, which is "inspect the state of this *physical* intid". The fact that we are able to do it in a special way for the timer doesn't change the nature of the interface. > Which for me this sounds like the right thing to do in this situation: > the VCPU (and the timer) is not loaded, so we check our saved state and > construct the logical line level. We just need a valid VCPU struct to > achieve this, and hope for the virtual timer to be already initialised. > > Do I miss something here? Yes. You are missing that the *interface* is generic, and you can replace it with anything you want. Case in point, what we do when get_input_level is NULL. > Also to me it sound like the interface for this function is slightly > lacking, because just an intid is not enough to uniquely identify an > IRQ. It was just fine so far because of this special use case. This is a *physical* intid. It can only mean one single thing, and it only makes sense in the context of a vcpu if the device gets context-switched. I can remove the above fast path entirely, and everything will still work the same way, without having to pass any vcpu, because the *context* is what matters. Thanks, M.
On 08/04/2020 15:19, Marc Zyngier wrote: Hi Marc, > On 2020-04-08 13:13, André Przywara wrote: >> On 08/04/2020 11:07, Marc Zyngier wrote: >> >> Hi Marc, >> >>> Hi James, >>> >>> Thanks for looking into this. >>> >>> On Mon, 6 Apr 2020 16:03:55 +0100 >>> James Morse <james.morse@arm.com> wrote: >>> >>>> kvm_arch_timer_get_input_level() needs to get the arch_timer_context >>>> for >>>> a particular vcpu, and uses kvm_get_running_vcpu() to find it. >>>> >>>> kvm_arch_timer_get_input_level() may be called to handle a user-space >>>> write to the redistributor, where the vcpu is not loaded. This causes >>>> kvm_get_running_vcpu() to return NULL: >>>> | Unable to handle kernel paging request at virtual address >>>> 0000000000001ec0 >>>> | Mem abort info: >>>> | ESR = 0x96000004 >>>> | EC = 0x25: DABT (current EL), IL = 32 bits >>>> | SET = 0, FnV = 0 >>>> | EA = 0, S1PTW = 0 >>>> | Data abort info: >>>> | ISV = 0, ISS = 0x00000004 >>>> | CM = 0, WnR = 0 >>>> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 >>>> | [0000000000001ec0] pgd=0000000000000000 >>>> | Internal error: Oops: 96000004 [#1] PREEMPT SMP >>>> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables >>>> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 >>>> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS >>>> 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 >>>> | pstate: 00000085 (nzcv daIf -PAN -UAO) >>>> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 >>>> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 >>>> >>>> | Call trace: >>>> | kvm_arch_timer_get_input_level+0x1c/0x68 >>>> | vgic_get_phys_line_level+0x3c/0x90 >>>> | vgic_mmio_write_senable+0xe4/0x130 >>>> | vgic_uaccess+0xe0/0x100 >>>> | vgic_v3_redist_uaccess+0x5c/0x80 >>>> | vgic_v3_attr_regs_access+0xf0/0x200 >>>> | nvgic_v3_set_attr+0x234/0x250 >>>> | kvm_device_ioctl_attr+0xa4/0xf8 >>>> | kvm_device_ioctl+0x7c/0xc0 >>>> | ksys_ioctl+0x1fc/0xc18 >>>> | __arm64_sys_ioctl+0x24/0x30 >>>> | do_el0_svc+0x7c/0x148 >>>> | el0_sync_handler+0x138/0x258 >>>> | el0_sync+0x140/0x180 >>>> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) >>>> | ---[ end trace 81287612d93f1e70 ]--- >>>> | note: qemu-system-aar[2615] exited with preempt_count 1 >>>> >>>> Loading the vcpu doesn't make a lot of sense for handling a device >>>> ioctl(), >>>> so instead pass the vcpu through to >>>> kvm_arch_timer_get_input_level(). Its >>>> not clear that an intid makes much sense without the paired vcpu. >>> >>> I don't fully agree with the analysis, Remember we are looking at the >>> state of the physical interrupt associated with a virtual interrupt, so >>> the vcpu doesn't quite make sense here if it isn't loaded. >> >> But wasn't it that this function is meant to specifically deal with this >> *without* going to the hardware (which is costly, hence this >> optimisation)? Because for the timer we *can* work out the logical IRQ >> line state by examining our saved state? And this is what we do in >> kvm_timer_should_fire(), when timer_ctx->loaded is false. > > Yes, but that's just a specialization of a more generic interface, which is > "inspect the state of this *physical* intid". The fact that we are able > to do > it in a special way for the timer doesn't change the nature of the > interface. > >> Which for me this sounds like the right thing to do in this situation: >> the VCPU (and the timer) is not loaded, so we check our saved state and >> construct the logical line level. We just need a valid VCPU struct to >> achieve this, and hope for the virtual timer to be already initialised. >> >> Do I miss something here? > > Yes. You are missing that the *interface* is generic, and you can replace > it with anything you want. Case in point, what we do when get_input_level > is NULL. > >> Also to me it sound like the interface for this function is slightly >> lacking, because just an intid is not enough to uniquely identify an >> IRQ. It was just fine so far because of this special use case. > > This is a *physical* intid. Wait, I am confused, the type declaration in struct vgic_irq says: ... bool (*get_input_level)(int vintid); ^^^ Also in vgic.c:vgic_get_phys_line_level() we call irq->get_input_level(irq->intid), which is the virtual intid. But I see that the physical intid makes more sense here (in the spirit of: provide a shortcut for poking the GIC for the associated hwirq), but shouldn't we then pass at least irq->hwintid (which just happens to be the same in the arch timer case)? > It can only mean one single thing, and it > only makes sense in the context of a vcpu if the device gets > context-switched. I see, it's "this PPI on the current CPU, or this SPI/LPI in the system", and this call is always expected to happen during a context switch. And then indeed passing a VCPU doesn't make sense. Thanks for the explanation, I guess we should clarify this in the code then (because my suggestion was based on the idea that this is was a virtual IRQ). Cheers, Andre > I can remove the above fast path entirely, and everything will still work > the same way, without having to pass any vcpu, because the *context* is > what matters. > > Thanks, > > M.
On Wed, 8 Apr 2020 17:50:09 +0100 André Przywara <andre.przywara@arm.com> wrote: > On 08/04/2020 15:19, Marc Zyngier wrote: > > Hi Marc, > > > On 2020-04-08 13:13, André Przywara wrote: > >> On 08/04/2020 11:07, Marc Zyngier wrote: > >> > >> Hi Marc, > >> > >>> Hi James, > >>> > >>> Thanks for looking into this. > >>> > >>> On Mon, 6 Apr 2020 16:03:55 +0100 > >>> James Morse <james.morse@arm.com> wrote: > >>> > >>>> kvm_arch_timer_get_input_level() needs to get the arch_timer_context > >>>> for > >>>> a particular vcpu, and uses kvm_get_running_vcpu() to find it. > >>>> > >>>> kvm_arch_timer_get_input_level() may be called to handle a user-space > >>>> write to the redistributor, where the vcpu is not loaded. This causes > >>>> kvm_get_running_vcpu() to return NULL: > >>>> | Unable to handle kernel paging request at virtual address > >>>> 0000000000001ec0 > >>>> | Mem abort info: > >>>> | ESR = 0x96000004 > >>>> | EC = 0x25: DABT (current EL), IL = 32 bits > >>>> | SET = 0, FnV = 0 > >>>> | EA = 0, S1PTW = 0 > >>>> | Data abort info: > >>>> | ISV = 0, ISS = 0x00000004 > >>>> | CM = 0, WnR = 0 > >>>> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 > >>>> | [0000000000001ec0] pgd=0000000000000000 > >>>> | Internal error: Oops: 96000004 [#1] PREEMPT SMP > >>>> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables > >>>> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 > >>>> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS > >>>> 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 > >>>> | pstate: 00000085 (nzcv daIf -PAN -UAO) > >>>> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 > >>>> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 > >>>> > >>>> | Call trace: > >>>> | kvm_arch_timer_get_input_level+0x1c/0x68 > >>>> | vgic_get_phys_line_level+0x3c/0x90 > >>>> | vgic_mmio_write_senable+0xe4/0x130 > >>>> | vgic_uaccess+0xe0/0x100 > >>>> | vgic_v3_redist_uaccess+0x5c/0x80 > >>>> | vgic_v3_attr_regs_access+0xf0/0x200 > >>>> | nvgic_v3_set_attr+0x234/0x250 > >>>> | kvm_device_ioctl_attr+0xa4/0xf8 > >>>> | kvm_device_ioctl+0x7c/0xc0 > >>>> | ksys_ioctl+0x1fc/0xc18 > >>>> | __arm64_sys_ioctl+0x24/0x30 > >>>> | do_el0_svc+0x7c/0x148 > >>>> | el0_sync_handler+0x138/0x258 > >>>> | el0_sync+0x140/0x180 > >>>> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) > >>>> | ---[ end trace 81287612d93f1e70 ]--- > >>>> | note: qemu-system-aar[2615] exited with preempt_count 1 > >>>> > >>>> Loading the vcpu doesn't make a lot of sense for handling a device > >>>> ioctl(), > >>>> so instead pass the vcpu through to > >>>> kvm_arch_timer_get_input_level(). Its > >>>> not clear that an intid makes much sense without the paired vcpu. > >>> > >>> I don't fully agree with the analysis, Remember we are looking at the > >>> state of the physical interrupt associated with a virtual interrupt, so > >>> the vcpu doesn't quite make sense here if it isn't loaded. > >> > >> But wasn't it that this function is meant to specifically deal with this > >> *without* going to the hardware (which is costly, hence this > >> optimisation)? Because for the timer we *can* work out the logical IRQ > >> line state by examining our saved state? And this is what we do in > >> kvm_timer_should_fire(), when timer_ctx->loaded is false. > > > > Yes, but that's just a specialization of a more generic interface, which is > > "inspect the state of this *physical* intid". The fact that we are able > > to do > > it in a special way for the timer doesn't change the nature of the > > interface. > > > > > >> Which for me this sounds like the right thing to do in this situation: > >> the VCPU (and the timer) is not loaded, so we check our saved state and > >> construct the logical line level. We just need a valid VCPU struct to > >> achieve this, and hope for the virtual timer to be already initialised. > >> > >> Do I miss something here? > > > > Yes. You are missing that the *interface* is generic, and you can replace > > it with anything you want. Case in point, what we do when get_input_level > > is NULL. > > > >> Also to me it sound like the interface for this function is slightly > >> lacking, because just an intid is not enough to uniquely identify an > >> IRQ. It was just fine so far because of this special use case. > > > > This is a *physical* intid. > > Wait, I am confused, the type declaration in struct vgic_irq says: > ... > bool (*get_input_level)(int vintid); > ^^^ > Also in vgic.c:vgic_get_phys_line_level() we call > irq->get_input_level(irq->intid), which is the virtual intid. Yeah, that's not great indeed. It is a cunning shortcut to get to the timer, but that really should be the host irq. > But I see that the physical intid makes more sense here (in the spirit > of: provide a shortcut for poking the GIC for the associated hwirq), but > shouldn't we then pass at least irq->hwintid (which just happens to be > the same in the arch timer case)? hwintid isn't really something we should consider, as it is an implementation detail of the physical GIC and list registers. It is too low-level to be generally useful. The host_irq field, on the other hand, is a better information source, and the timer already has this stashed. Overall, we could just pass the pointer to the vgic_irq, and let the helper do whatever it needs to sort it out. After all, it is supposed to be faster than going to the GIC, so we can have a bit of leeway here. Not a big deal, as this isn't the part that is broken ATM. Thanks, M.
Hi James, On Wed, 8 Apr 2020 12:16:01 +0100 James Morse <james.morse@arm.com> wrote: > Hi Marc, > > On 08/04/2020 11:07, Marc Zyngier wrote: > > On Mon, 6 Apr 2020 16:03:55 +0100 > > James Morse <james.morse@arm.com> wrote: > > > >> kvm_arch_timer_get_input_level() needs to get the arch_timer_context for > >> a particular vcpu, and uses kvm_get_running_vcpu() to find it. > >> > >> kvm_arch_timer_get_input_level() may be called to handle a user-space > >> write to the redistributor, where the vcpu is not loaded. This causes > >> kvm_get_running_vcpu() to return NULL: > >> | Unable to handle kernel paging request at virtual address 0000000000001ec0 > >> | Mem abort info: > >> | ESR = 0x96000004 > >> | EC = 0x25: DABT (current EL), IL = 32 bits > >> | SET = 0, FnV = 0 > >> | EA = 0, S1PTW = 0 > >> | Data abort info: > >> | ISV = 0, ISS = 0x00000004 > >> | CM = 0, WnR = 0 > >> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 > >> | [0000000000001ec0] pgd=0000000000000000 > >> | Internal error: Oops: 96000004 [#1] PREEMPT SMP > >> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables > >> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 > >> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 > >> | pstate: 00000085 (nzcv daIf -PAN -UAO) > >> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 > >> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 > >> > >> | Call trace: > >> | kvm_arch_timer_get_input_level+0x1c/0x68 > >> | vgic_get_phys_line_level+0x3c/0x90 > >> | vgic_mmio_write_senable+0xe4/0x130 > >> | vgic_uaccess+0xe0/0x100 > >> | vgic_v3_redist_uaccess+0x5c/0x80 > >> | vgic_v3_attr_regs_access+0xf0/0x200 > >> | nvgic_v3_set_attr+0x234/0x250 > >> | kvm_device_ioctl_attr+0xa4/0xf8 > >> | kvm_device_ioctl+0x7c/0xc0 > >> | ksys_ioctl+0x1fc/0xc18 > >> | __arm64_sys_ioctl+0x24/0x30 > >> | do_el0_svc+0x7c/0x148 > >> | el0_sync_handler+0x138/0x258 > >> | el0_sync+0x140/0x180 > >> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) > >> | ---[ end trace 81287612d93f1e70 ]--- > >> | note: qemu-system-aar[2615] exited with preempt_count 1 > >> > >> Loading the vcpu doesn't make a lot of sense for handling a device ioctl(), > >> so instead pass the vcpu through to kvm_arch_timer_get_input_level(). Its > >> not clear that an intid makes much sense without the paired vcpu. > > > > I don't fully agree with the analysis, Remember we are looking at the > > state of the physical interrupt associated with a virtual interrupt, so > > the vcpu doesn't quite make sense here if it isn't loaded. > > > > What does it mean to look at the HW timer when we are not in the right > > context? For all we know, it is completely random (the only guarantee > > we have is that it is disabled, actually). > > > My gut feeling is that this is another instance where we should provide > > specific userspace accessors that would only deal with the virtual > > state, and leave anything that deals with the physical state of the > > interrupt to be exercised only by the guest. > > > Does it make sense? > > Broadly, yes. Specifically ... I'm not familiar enough with this code to work out where > such a change should go! > > ~20 mins of grepping later~ > > Remove REGISTER_DESC_WITH_LENGTH() so that uaccess helpers have to be provided, and forbid > NULL for the ur/uw values in REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED()...? I'd suggest something like this (untested, of course): diff --git a/virt/kvm/arm/vgic/vgic-mmio-v2.c b/virt/kvm/arm/vgic/vgic-mmio-v2.c index d63881f60e1a..f51c6e939c76 100644 --- a/virt/kvm/arm/vgic/vgic-mmio-v2.c +++ b/virt/kvm/arm/vgic/vgic-mmio-v2.c @@ -409,10 +409,12 @@ static const struct vgic_register_region vgic_v2_dist_registers[] = { NULL, vgic_mmio_uaccess_write_v2_group, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ENABLE_SET, - vgic_mmio_read_enable, vgic_mmio_write_senable, NULL, NULL, 1, + vgic_mmio_read_enable, vgic_mmio_write_senable, + NULL, vgic_uaccess_write_senable, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_ENABLE_CLEAR, - vgic_mmio_read_enable, vgic_mmio_write_cenable, NULL, NULL, 1, + vgic_mmio_read_enable, vgic_mmio_write_cenable, + NULL, vgic_uaccess_write_cenable, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ(GIC_DIST_PENDING_SET, vgic_mmio_read_pending, vgic_mmio_write_spending, NULL, NULL, 1, diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c index 77c8ba1a2535..a9c45048fadb 100644 --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c @@ -538,10 +538,12 @@ static const struct vgic_register_region vgic_v3_dist_registers[] = { vgic_mmio_read_group, vgic_mmio_write_group, NULL, NULL, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ISENABLER, - vgic_mmio_read_enable, vgic_mmio_write_senable, NULL, NULL, 1, + vgic_mmio_read_enable, vgic_mmio_write_senable, + NULL, vgic_uaccess_write_senable, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ICENABLER, - vgic_mmio_read_enable, vgic_mmio_write_cenable, NULL, NULL, 1, + vgic_mmio_read_enable, vgic_mmio_write_cenable, + NULL, vgic_uaccess_write_cenable, 1, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED(GICD_ISPENDR, vgic_mmio_read_pending, vgic_mmio_write_spending, @@ -609,11 +611,13 @@ static const struct vgic_register_region vgic_v3_rd_registers[] = { REGISTER_DESC_WITH_LENGTH(SZ_64K + GICR_IGROUPR0, vgic_mmio_read_group, vgic_mmio_write_group, 4, VGIC_ACCESS_32bit), - REGISTER_DESC_WITH_LENGTH(SZ_64K + GICR_ISENABLER0, - vgic_mmio_read_enable, vgic_mmio_write_senable, 4, + REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ISENABLER0, + vgic_mmio_read_enable, vgic_mmio_write_senable, + NULL, vgic_uaccess_write_senable, 4, VGIC_ACCESS_32bit), - REGISTER_DESC_WITH_LENGTH(SZ_64K + GICR_ICENABLER0, - vgic_mmio_read_enable, vgic_mmio_write_cenable, 4, + REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ICENABLER0, + vgic_mmio_read_enable, vgic_mmio_write_cenable, + NULL, vgic_uaccess_write_cenable, 4, VGIC_ACCESS_32bit), REGISTER_DESC_WITH_LENGTH_UACCESS(SZ_64K + GICR_ISPENDR0, vgic_mmio_read_pending, vgic_mmio_write_spending, diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c index 4012cd68ac93..2ca11b05b17b 100644 --- a/virt/kvm/arm/vgic/vgic-mmio.c +++ b/virt/kvm/arm/vgic/vgic-mmio.c @@ -184,6 +184,48 @@ void vgic_mmio_write_cenable(struct kvm_vcpu *vcpu, } } +int vgic_uaccess_write_senable(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len, + unsigned long val) +{ + u32 intid = VGIC_ADDR_TO_INTID(addr, 1); + int i; + unsigned long flags; + + for_each_set_bit(i, &val, len * 8) { + struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); + + raw_spin_lock_irqsave(&irq->irq_lock, flags); + irq->enabled = true; + vgic_queue_irq_unlock(vcpu->kvm, irq, flags); + + vgic_put_irq(vcpu->kvm, irq); + } + + return 0; +} + +int vgic_uaccess_write_cenable(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len, + unsigned long val) +{ + u32 intid = VGIC_ADDR_TO_INTID(addr, 1); + int i; + unsigned long flags; + + for_each_set_bit(i, &val, len * 8) { + struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); + + raw_spin_lock_irqsave(&irq->irq_lock, flags); + irq->enabled = false; + raw_spin_unlock_irqrestore(&irq->irq_lock, flags); + + vgic_put_irq(vcpu->kvm, irq); + } + + return 0; +} + unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len) { diff --git a/virt/kvm/arm/vgic/vgic-mmio.h b/virt/kvm/arm/vgic/vgic-mmio.h index 30713a44e3fa..327d0a6938e4 100644 --- a/virt/kvm/arm/vgic/vgic-mmio.h +++ b/virt/kvm/arm/vgic/vgic-mmio.h @@ -138,6 +138,14 @@ void vgic_mmio_write_cenable(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len, unsigned long val); +int vgic_uaccess_write_senable(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len, + unsigned long val); + +int vgic_uaccess_write_cenable(struct kvm_vcpu *vcpu, + gpa_t addr, unsigned int len, + unsigned long val); + unsigned long vgic_mmio_read_pending(struct kvm_vcpu *vcpu, gpa_t addr, unsigned int len); > > Or if that is too invasive, something like, (totally, untested): > ----------------%<---------------- > diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c > index 97fb2a40e6ba..30ae5f29e429 100644 > --- a/virt/kvm/arm/vgic/vgic-mmio.c > +++ b/virt/kvm/arm/vgic/vgic-mmio.c > @@ -113,10 +113,11 @@ void vgic_mmio_write_senable(struct kvm_vcpu *vcpu, > struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); > > raw_spin_lock_irqsave(&irq->irq_lock, flags); > - if (vgic_irq_is_mapped_level(irq)) { > + if (kvm_running_vcpu() && vgic_irq_is_mapped_level(irq)) { > bool was_high = irq->line_level; > > /* > + * Unless we are running due to a user-space access, > * We need to update the state of the interrupt because > * the guest might have changed the state of the device > * while the interrupt was disabled at the VGIC level. > ----------------%<---------------- Huh, nice try! ;-) Unfortunately, I think there is more than the enable register that suffers from a similar problem (see how the pending register write is also accessing the HW state, even if accessed from userspace). Thanks, M.
On 09/04/2020 09:08, Marc Zyngier wrote: Hi Marc, > On Wed, 8 Apr 2020 17:50:09 +0100 > André Przywara <andre.przywara@arm.com> wrote: > >> On 08/04/2020 15:19, Marc Zyngier wrote: >> >> Hi Marc, >> >>> On 2020-04-08 13:13, André Przywara wrote: >>>> On 08/04/2020 11:07, Marc Zyngier wrote: >>>> >>>> Hi Marc, >>>> >>>>> Hi James, >>>>> >>>>> Thanks for looking into this. >>>>> >>>>> On Mon, 6 Apr 2020 16:03:55 +0100 >>>>> James Morse <james.morse@arm.com> wrote: >>>>> >>>>>> kvm_arch_timer_get_input_level() needs to get the arch_timer_context >>>>>> for >>>>>> a particular vcpu, and uses kvm_get_running_vcpu() to find it. >>>>>> >>>>>> kvm_arch_timer_get_input_level() may be called to handle a user-space >>>>>> write to the redistributor, where the vcpu is not loaded. This causes >>>>>> kvm_get_running_vcpu() to return NULL: >>>>>> | Unable to handle kernel paging request at virtual address >>>>>> 0000000000001ec0 >>>>>> | Mem abort info: >>>>>> | ESR = 0x96000004 >>>>>> | EC = 0x25: DABT (current EL), IL = 32 bits >>>>>> | SET = 0, FnV = 0 >>>>>> | EA = 0, S1PTW = 0 >>>>>> | Data abort info: >>>>>> | ISV = 0, ISS = 0x00000004 >>>>>> | CM = 0, WnR = 0 >>>>>> | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 >>>>>> | [0000000000001ec0] pgd=0000000000000000 >>>>>> | Internal error: Oops: 96000004 [#1] PREEMPT SMP >>>>>> | Modules linked in: r8169 realtek efivarfs ip_tables x_tables >>>>>> | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 >>>>>> | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS >>>>>> 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 >>>>>> | pstate: 00000085 (nzcv daIf -PAN -UAO) >>>>>> | pc : kvm_arch_timer_get_input_level+0x1c/0x68 >>>>>> | lr : kvm_arch_timer_get_input_level+0x1c/0x68 >>>>>> >>>>>> | Call trace: >>>>>> | kvm_arch_timer_get_input_level+0x1c/0x68 >>>>>> | vgic_get_phys_line_level+0x3c/0x90 >>>>>> | vgic_mmio_write_senable+0xe4/0x130 >>>>>> | vgic_uaccess+0xe0/0x100 >>>>>> | vgic_v3_redist_uaccess+0x5c/0x80 >>>>>> | vgic_v3_attr_regs_access+0xf0/0x200 >>>>>> | nvgic_v3_set_attr+0x234/0x250 >>>>>> | kvm_device_ioctl_attr+0xa4/0xf8 >>>>>> | kvm_device_ioctl+0x7c/0xc0 >>>>>> | ksys_ioctl+0x1fc/0xc18 >>>>>> | __arm64_sys_ioctl+0x24/0x30 >>>>>> | do_el0_svc+0x7c/0x148 >>>>>> | el0_sync_handler+0x138/0x258 >>>>>> | el0_sync+0x140/0x180 >>>>>> | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) >>>>>> | ---[ end trace 81287612d93f1e70 ]--- >>>>>> | note: qemu-system-aar[2615] exited with preempt_count 1 >>>>>> >>>>>> Loading the vcpu doesn't make a lot of sense for handling a device >>>>>> ioctl(), >>>>>> so instead pass the vcpu through to >>>>>> kvm_arch_timer_get_input_level(). Its >>>>>> not clear that an intid makes much sense without the paired vcpu. >>>>> >>>>> I don't fully agree with the analysis, Remember we are looking at the >>>>> state of the physical interrupt associated with a virtual interrupt, so >>>>> the vcpu doesn't quite make sense here if it isn't loaded. >>>> >>>> But wasn't it that this function is meant to specifically deal with this >>>> *without* going to the hardware (which is costly, hence this >>>> optimisation)? Because for the timer we *can* work out the logical IRQ >>>> line state by examining our saved state? And this is what we do in >>>> kvm_timer_should_fire(), when timer_ctx->loaded is false. >>> >>> Yes, but that's just a specialization of a more generic interface, which is >>> "inspect the state of this *physical* intid". The fact that we are able >>> to do >>> it in a special way for the timer doesn't change the nature of the >>> interface. >> >> >>> >>>> Which for me this sounds like the right thing to do in this situation: >>>> the VCPU (and the timer) is not loaded, so we check our saved state and >>>> construct the logical line level. We just need a valid VCPU struct to >>>> achieve this, and hope for the virtual timer to be already initialised. >>>> >>>> Do I miss something here? >>> >>> Yes. You are missing that the *interface* is generic, and you can replace >>> it with anything you want. Case in point, what we do when get_input_level >>> is NULL. >>> >>>> Also to me it sound like the interface for this function is slightly >>>> lacking, because just an intid is not enough to uniquely identify an >>>> IRQ. It was just fine so far because of this special use case. >>> >>> This is a *physical* intid. >> >> Wait, I am confused, the type declaration in struct vgic_irq says: >> ... >> bool (*get_input_level)(int vintid); >> ^^^ >> Also in vgic.c:vgic_get_phys_line_level() we call >> irq->get_input_level(irq->intid), which is the virtual intid. > > Yeah, that's not great indeed. It is a cunning shortcut to get to the > timer, but that really should be the host irq. >> But I see that the physical intid makes more sense here (in the spirit >> of: provide a shortcut for poking the GIC for the associated hwirq), but >> shouldn't we then pass at least irq->hwintid (which just happens to be >> the same in the arch timer case)? > > hwintid isn't really something we should consider, as it is an > implementation detail of the physical GIC and list registers. It is > too low-level to be generally useful. The host_irq field, on the other > hand, is a better information source, and the timer already has this > stashed. That sounds indeed more logical, given that we use that number for the default handling (irq_get_irqchip_state()). > Overall, we could just pass the pointer to the vgic_irq, and let the > helper do whatever it needs to sort it out. After all, it is supposed > to be faster than going to the GIC, so we can have a bit of leeway here. That sounds tempting, but didn't we consider struct vgic_irq a data structure private to the VGIC? I remember we jumped through some hoops to get rid of internal VGIC data in arch_timer.c, for instance. A quick grep for vgic_irq returns only hits in the virt/kvm/arm/vgic directory. So maybe we should not try to sell this get_input_level() as a generic solution, but demote it to what it really is: a "cunning shortcut"(TM) to optimise the arch timer? What about we change the interface to use the Linux IRQ number, rewrite the implementation in arch_timer.c to match that against host_[vp]timer_irq and adjust the comments accordingly? Then the next user (shall there be one) can extend this interface as needed. Plus your code to provide user space accessors. Cheers, Andre > Not a big deal, as this isn't the part that is broken ATM. > > Thanks, > > M. >
Hi Marc, On 09/04/2020 09:27, Marc Zyngier wrote: > On Wed, 8 Apr 2020 12:16:01 +0100 > James Morse <james.morse@arm.com> wrote: >> On 08/04/2020 11:07, Marc Zyngier wrote: >>> I don't fully agree with the analysis, Remember we are looking at the >>> state of the physical interrupt associated with a virtual interrupt, so >>> the vcpu doesn't quite make sense here if it isn't loaded. >>> >>> What does it mean to look at the HW timer when we are not in the right >>> context? For all we know, it is completely random (the only guarantee >>> we have is that it is disabled, actually). >> >>> My gut feeling is that this is another instance where we should provide >>> specific userspace accessors that would only deal with the virtual >>> state, and leave anything that deals with the physical state of the >>> interrupt to be exercised only by the guest. >> >>> Does it make sense? >> >> Broadly, yes. Specifically ... I'm not familiar enough with this code to work out where >> such a change should go! >> >> ~20 mins of grepping later~ >> >> Remove REGISTER_DESC_WITH_LENGTH() so that uaccess helpers have to be provided, and forbid >> NULL for the ur/uw values in REGISTER_DESC_WITH_BITS_PER_IRQ_SHARED()...? > > I'd suggest something like this (untested, of course): [...] >> Or if that is too invasive, something like, (totally, untested): >> ----------------%<---------------- >> diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c >> index 97fb2a40e6ba..30ae5f29e429 100644 >> --- a/virt/kvm/arm/vgic/vgic-mmio.c >> +++ b/virt/kvm/arm/vgic/vgic-mmio.c >> @@ -113,10 +113,11 @@ void vgic_mmio_write_senable(struct kvm_vcpu *vcpu, >> struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, intid + i); >> >> raw_spin_lock_irqsave(&irq->irq_lock, flags); >> - if (vgic_irq_is_mapped_level(irq)) { >> + if (kvm_running_vcpu() && vgic_irq_is_mapped_level(irq)) { >> bool was_high = irq->line_level; >> >> /* >> + * Unless we are running due to a user-space access, >> * We need to update the state of the interrupt because >> * the guest might have changed the state of the device >> * while the interrupt was disabled at the VGIC level. >> ----------------%<---------------- > > Huh, nice try! ;-) Unfortunately, I think there is more than the enable > register that suffers from a similar problem (see how the pending > register write is also accessing the HW state, even if accessed from > userspace). Yeah, I'd expect to play wack-a-mole if I actually tested it. It was just the smallest, er, hack I could get my head round given your explanation. I've blindly tested your version, it works for me on a gicv2 machine: Tested-by: James Morse <james.morse@arm.com> I'll test on the gicv3 espressobin that I originally saw this on with rc1 on Tuesday. Do you want me to post it back to you as a tested patch? You can judge whether I understand it from the commit message... (I'd need your Signed-off-by...) Have a good extended weekend! Thanks, James
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h index d120e6c..42a016a 100644 --- a/include/kvm/arm_arch_timer.h +++ b/include/kvm/arm_arch_timer.h @@ -92,7 +92,7 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu); void kvm_timer_init_vhe(void); -bool kvm_arch_timer_get_input_level(int vintid); +bool kvm_arch_timer_get_input_level(int vintid, struct kvm_vcpu *vcpu); #define vcpu_timer(v) (&(v)->arch.timer_cpu) #define vcpu_get_timer(v,t) (&vcpu_timer(v)->timers[(t)]) diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 9d53f54..41e91b3 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -130,11 +130,9 @@ struct vgic_irq { * state of the input level of mapped level-triggered IRQ faster than * peaking into the physical GIC. * - * Always called in non-preemptible section and the functions can use - * kvm_arm_get_running_vcpu() to get the vcpu pointer for private - * IRQs. + * Always called in non-preemptible section. */ - bool (*get_input_level)(int vintid); + bool (*get_input_level)(int vintid, struct kvm_vcpu *vcpu); void *owner; /* Opaque pointer to reserve an interrupt for in-kernel devices. */ @@ -344,7 +342,7 @@ void kvm_vgic_init_cpu_hardware(void); int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int intid, bool level, void *owner); int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, - u32 vintid, bool (*get_input_level)(int vindid)); + u32 vintid, bool (*get_input_level)(int vindid, struct kvm_vcpu *vcpu)); int kvm_vgic_unmap_phys_irq(struct kvm_vcpu *vcpu, unsigned int vintid); bool kvm_vgic_map_is_active(struct kvm_vcpu *vcpu, unsigned int vintid); diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c index 0d9438e..ca0e87b 100644 --- a/virt/kvm/arm/arch_timer.c +++ b/virt/kvm/arm/arch_timer.c @@ -1021,9 +1021,8 @@ static bool timer_irqs_are_valid(struct kvm_vcpu *vcpu) return true; } -bool kvm_arch_timer_get_input_level(int vintid) +bool kvm_arch_timer_get_input_level(int vintid, struct kvm_vcpu *vcpu) { - struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); struct arch_timer_context *timer; if (vintid == vcpu_vtimer(vcpu)->irq.irq) diff --git a/virt/kvm/arm/vgic/vgic-mmio.c b/virt/kvm/arm/vgic/vgic-mmio.c index 97fb2a4..37ee2f8 100644 --- a/virt/kvm/arm/vgic/vgic-mmio.c +++ b/virt/kvm/arm/vgic/vgic-mmio.c @@ -121,7 +121,7 @@ void vgic_mmio_write_senable(struct kvm_vcpu *vcpu, * the guest might have changed the state of the device * while the interrupt was disabled at the VGIC level. */ - irq->line_level = vgic_get_phys_line_level(irq); + irq->line_level = vgic_get_phys_line_level(irq, vcpu); /* * Deactivate the physical interrupt so the GIC will let * us know when it is asserted again. diff --git a/virt/kvm/arm/vgic/vgic-v2.c b/virt/kvm/arm/vgic/vgic-v2.c index 621cc16..e126f25 100644 --- a/virt/kvm/arm/vgic/vgic-v2.c +++ b/virt/kvm/arm/vgic/vgic-v2.c @@ -110,7 +110,7 @@ void vgic_v2_fold_lr_state(struct kvm_vcpu *vcpu) * told when the interrupt becomes asserted again. */ if (vgic_irq_is_mapped_level(irq) && (val & GICH_LR_PENDING_BIT)) { - irq->line_level = vgic_get_phys_line_level(irq); + irq->line_level = vgic_get_phys_line_level(irq, vcpu); if (!irq->line_level) vgic_irq_set_phys_active(irq, false); diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c index f45635a..ff861fa 100644 --- a/virt/kvm/arm/vgic/vgic-v3.c +++ b/virt/kvm/arm/vgic/vgic-v3.c @@ -101,7 +101,7 @@ void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu) * told when the interrupt becomes asserted again. */ if (vgic_irq_is_mapped_level(irq) && (val & ICH_LR_PENDING_BIT)) { - irq->line_level = vgic_get_phys_line_level(irq); + irq->line_level = vgic_get_phys_line_level(irq, vcpu); if (!irq->line_level) vgic_irq_set_phys_active(irq, false); diff --git a/virt/kvm/arm/vgic/vgic.c b/virt/kvm/arm/vgic/vgic.c index 99b02ca..d113b5b 100644 --- a/virt/kvm/arm/vgic/vgic.c +++ b/virt/kvm/arm/vgic/vgic.c @@ -176,14 +176,14 @@ void vgic_irq_set_phys_pending(struct vgic_irq *irq, bool pending) pending)); } -bool vgic_get_phys_line_level(struct vgic_irq *irq) +bool vgic_get_phys_line_level(struct vgic_irq *irq, struct kvm_vcpu *vcpu) { bool line_level; BUG_ON(!irq->hw); if (irq->get_input_level) - return irq->get_input_level(irq->intid); + return irq->get_input_level(irq->intid, vcpu); WARN_ON(irq_get_irqchip_state(irq->host_irq, IRQCHIP_STATE_PENDING, @@ -479,7 +479,7 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int intid, /* @irq->irq_lock must be held */ static int kvm_vgic_map_irq(struct kvm_vcpu *vcpu, struct vgic_irq *irq, unsigned int host_irq, - bool (*get_input_level)(int vindid)) + bool (*get_input_level)(int vindid, struct kvm_vcpu *vcpu)) { struct irq_desc *desc; struct irq_data *data; @@ -512,7 +512,7 @@ static inline void kvm_vgic_unmap_irq(struct vgic_irq *irq) } int kvm_vgic_map_phys_irq(struct kvm_vcpu *vcpu, unsigned int host_irq, - u32 vintid, bool (*get_input_level)(int vindid)) + u32 vintid, bool (*get_input_level)(int vindid, struct kvm_vcpu *vcpu)) { struct vgic_irq *irq = vgic_get_irq(vcpu->kvm, vcpu, vintid); unsigned long flags; diff --git a/virt/kvm/arm/vgic/vgic.h b/virt/kvm/arm/vgic/vgic.h index c7fefd6..622865e 100644 --- a/virt/kvm/arm/vgic/vgic.h +++ b/virt/kvm/arm/vgic/vgic.h @@ -163,7 +163,7 @@ struct vgic_irq *vgic_get_irq(struct kvm *kvm, struct kvm_vcpu *vcpu, u32 intid); void __vgic_put_lpi_locked(struct kvm *kvm, struct vgic_irq *irq); void vgic_put_irq(struct kvm *kvm, struct vgic_irq *irq); -bool vgic_get_phys_line_level(struct vgic_irq *irq); +bool vgic_get_phys_line_level(struct vgic_irq *irq, struct kvm_vcpu *vcpu); void vgic_irq_set_phys_pending(struct vgic_irq *irq, bool pending); void vgic_irq_set_phys_active(struct vgic_irq *irq, bool active); bool vgic_queue_irq_unlock(struct kvm *kvm, struct vgic_irq *irq,
kvm_arch_timer_get_input_level() needs to get the arch_timer_context for a particular vcpu, and uses kvm_get_running_vcpu() to find it. kvm_arch_timer_get_input_level() may be called to handle a user-space write to the redistributor, where the vcpu is not loaded. This causes kvm_get_running_vcpu() to return NULL: | Unable to handle kernel paging request at virtual address 0000000000001ec0 | Mem abort info: | ESR = 0x96000004 | EC = 0x25: DABT (current EL), IL = 32 bits | SET = 0, FnV = 0 | EA = 0, S1PTW = 0 | Data abort info: | ISV = 0, ISS = 0x00000004 | CM = 0, WnR = 0 | user pgtable: 4k pages, 48-bit VAs, pgdp=000000003cbf9000 | [0000000000001ec0] pgd=0000000000000000 | Internal error: Oops: 96000004 [#1] PREEMPT SMP | Modules linked in: r8169 realtek efivarfs ip_tables x_tables | CPU: 1 PID: 2615 Comm: qemu-system-aar Not tainted 5.6.0-rc7 #30 | Hardware name: Marvell mvebu_armada-37xx/mvebu_armada-37xx, BIOS 2018.03-devel-18.12.3-gc9aa92c-armbian 02/20/2019 | pstate: 00000085 (nzcv daIf -PAN -UAO) | pc : kvm_arch_timer_get_input_level+0x1c/0x68 | lr : kvm_arch_timer_get_input_level+0x1c/0x68 | Call trace: | kvm_arch_timer_get_input_level+0x1c/0x68 | vgic_get_phys_line_level+0x3c/0x90 | vgic_mmio_write_senable+0xe4/0x130 | vgic_uaccess+0xe0/0x100 | vgic_v3_redist_uaccess+0x5c/0x80 | vgic_v3_attr_regs_access+0xf0/0x200 | nvgic_v3_set_attr+0x234/0x250 | kvm_device_ioctl_attr+0xa4/0xf8 | kvm_device_ioctl+0x7c/0xc0 | ksys_ioctl+0x1fc/0xc18 | __arm64_sys_ioctl+0x24/0x30 | do_el0_svc+0x7c/0x148 | el0_sync_handler+0x138/0x258 | el0_sync+0x140/0x180 | Code: 910003fd f9000bf3 2a0003f3 97ff650c (b95ec001) | ---[ end trace 81287612d93f1e70 ]--- | note: qemu-system-aar[2615] exited with preempt_count 1 Loading the vcpu doesn't make a lot of sense for handling a device ioctl(), so instead pass the vcpu through to kvm_arch_timer_get_input_level(). Its not clear that an intid makes much sense without the paired vcpu. Suggested-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: James Morse <james.morse@arm.com> --- include/kvm/arm_arch_timer.h | 2 +- include/kvm/arm_vgic.h | 8 +++----- virt/kvm/arm/arch_timer.c | 3 +-- virt/kvm/arm/vgic/vgic-mmio.c | 2 +- virt/kvm/arm/vgic/vgic-v2.c | 2 +- virt/kvm/arm/vgic/vgic-v3.c | 2 +- virt/kvm/arm/vgic/vgic.c | 8 ++++---- virt/kvm/arm/vgic/vgic.h | 2 +- 8 files changed, 13 insertions(+), 16 deletions(-)