Message ID | 20230316200219.42673-2-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iommu/amd: Fix GAM IRTEs affinity and GALog restart | expand |
On Thu, Mar 16, 2023, Joao Martins wrote: > On KVM GSI routing table updates, specially those where they have vIOMMUs > with interrupt remapping enabled (to boot >255vcpus setups without relying > on KVM_FEATURE_MSI_EXT_DEST_ID), a VMM may update the backing VF MSIs > with a new VCPU affinity. > > On AMD with AVIC enabled, the new vcpu affinity info is updated via: > avic_pi_update_irte() > irq_set_vcpu_affinity() > amd_ir_set_vcpu_affinity() > amd_iommu_{de}activate_guest_mode() > > Where the IRTE[GATag] is updated with the new vcpu affinity. The GATag > contains VM ID and VCPU ID, and is used by IOMMU hardware to signal KVM > (via GALog) when interrupt cannot be delivered due to vCPU is in > blocking state. > > The issue is that amd_iommu_activate_guest_mode() will essentially > only change IRTE fields on transitions from non-guest-mode to guest-mode > and otherwise returns *with no changes to IRTE* on already configured > guest-mode interrupts. To the guest this means that the VF interrupts > remain affined to the first vCPU they were first configured, and guest > will be unable to either VF interrupts and receive messages like this > from spuruious interrupts (e.g. from waking the wrong vCPU in GALog): > > [ 167.759472] __common_interrupt: 3.34 No irq handler for vector > [ 230.680927] mlx5_core 0000:00:02.0: mlx5_cmd_eq_recover:247:(pid > 3122): Recovered 1 EQEs on cmd_eq > [ 230.681799] mlx5_core 0000:00:02.0: > wait_func_handle_exec_timeout:1113:(pid 3122): cmd[0]: CREATE_CQ(0x400) > recovered after timeout > [ 230.683266] __common_interrupt: 3.34 No irq handler for vector > > Given the fact that amd_ir_set_vcpu_affinity() uses > amd_iommu_activate_guest_mode() underneath it essentially means that VCPU > affinity changes of IRTEs are nops. Fix it by dropping the check for > guest-mode at amd_iommu_activate_guest_mode(). Same thing is applicable to > amd_iommu_deactivate_guest_mode() although, even if the IRTE doesn't change > underlying DestID on the host, the VFIO IRQ handler will still be able to > poke at the right guest-vCPU. Is there any harm in giving deactivate the same treatement? If the worst case scenario is a few wasted cycles, having symmetric flows and eliminating benign bugs seems like a worthwhile tradeoff (assuming this is indeed a relatively slow path like I think it is). > Fixes: b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC (de-)activation code") > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> > --- > drivers/iommu/amd/iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c > index 5a505ba5467e..bf3ebc9d6cde 100644 > --- a/drivers/iommu/amd/iommu.c > +++ b/drivers/iommu/amd/iommu.c > @@ -3485,7 +3485,7 @@ int amd_iommu_activate_guest_mode(void *data) Any chance you (or anyone) would want to create a follow-up series to rename and/or rework these flows to make it more obvious that the helpers handle updates as well as transitions between "guest mode" and "host mode"? E.g. I can see KVM getting clever and skipping the "activation" when KVM knows AVIC is already active (though I can't tell for certain whether or not that would actually be problematic). > u64 valid; > > if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || > - !entry || entry->lo.fields_vapic.guest_mode) > + !entry) This can easily fit on the previous line. if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || !entry) return 0;
On 16/03/2023 21:01, Sean Christopherson wrote: > On Thu, Mar 16, 2023, Joao Martins wrote: >> On KVM GSI routing table updates, specially those where they have vIOMMUs >> with interrupt remapping enabled (to boot >255vcpus setups without relying >> on KVM_FEATURE_MSI_EXT_DEST_ID), a VMM may update the backing VF MSIs >> with a new VCPU affinity. >> >> On AMD with AVIC enabled, the new vcpu affinity info is updated via: >> avic_pi_update_irte() >> irq_set_vcpu_affinity() >> amd_ir_set_vcpu_affinity() >> amd_iommu_{de}activate_guest_mode() >> >> Where the IRTE[GATag] is updated with the new vcpu affinity. The GATag >> contains VM ID and VCPU ID, and is used by IOMMU hardware to signal KVM >> (via GALog) when interrupt cannot be delivered due to vCPU is in >> blocking state. >> >> The issue is that amd_iommu_activate_guest_mode() will essentially >> only change IRTE fields on transitions from non-guest-mode to guest-mode >> and otherwise returns *with no changes to IRTE* on already configured >> guest-mode interrupts. To the guest this means that the VF interrupts >> remain affined to the first vCPU they were first configured, and guest >> will be unable to either VF interrupts and receive messages like this >> from spuruious interrupts (e.g. from waking the wrong vCPU in GALog): >> >> [ 167.759472] __common_interrupt: 3.34 No irq handler for vector >> [ 230.680927] mlx5_core 0000:00:02.0: mlx5_cmd_eq_recover:247:(pid >> 3122): Recovered 1 EQEs on cmd_eq >> [ 230.681799] mlx5_core 0000:00:02.0: >> wait_func_handle_exec_timeout:1113:(pid 3122): cmd[0]: CREATE_CQ(0x400) >> recovered after timeout >> [ 230.683266] __common_interrupt: 3.34 No irq handler for vector >> >> Given the fact that amd_ir_set_vcpu_affinity() uses >> amd_iommu_activate_guest_mode() underneath it essentially means that VCPU >> affinity changes of IRTEs are nops. Fix it by dropping the check for >> guest-mode at amd_iommu_activate_guest_mode(). Same thing is applicable to >> amd_iommu_deactivate_guest_mode() although, even if the IRTE doesn't change >> underlying DestID on the host, the VFIO IRQ handler will still be able to >> poke at the right guest-vCPU. > > Is there any harm in giving deactivate the same treatement? If the worst case > scenario is a few wasted cycles, having symmetric flows and eliminating benign > bugs seems like a worthwhile tradeoff (assuming this is indeed a relatively slow > path like I think it is). > I wanna say there's no harm, but initially I had such a patch, and on testing it broke the classic interrupt remapping case but I didn't investigate further -- my suspicion is that the only case that should care is the updates (not the actual deactivation of guest-mode). >> Fixes: b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC (de-)activation code") >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> >> --- >> drivers/iommu/amd/iommu.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c >> index 5a505ba5467e..bf3ebc9d6cde 100644 >> --- a/drivers/iommu/amd/iommu.c >> +++ b/drivers/iommu/amd/iommu.c >> @@ -3485,7 +3485,7 @@ int amd_iommu_activate_guest_mode(void *data) > > Any chance you (or anyone) would want to create a follow-up series to rename and/or > rework these flows to make it more obvious that the helpers handle updates as well > as transitions between "guest mode" and "host mode"? E.g. I can see KVM getting > clever and skipping the "activation" when KVM knows AVIC is already active (though > I can't tell for certain whether or not that would actually be problematic). > To be honest, I think the function naming is correct. Part of the problem here (as you also hint) is instead the reusal of the helpers used in the (correct) transition to/from guest-mode *externally* by callers mixed from *internal* usage in amd iommu code for IRQ vcpu affinity using the same said helpers. And that'a also the reason I put the Fixes tag as that patch introduced such "reusal" and which could be useful for stable trees. Here we are mainly concerned with the updates (the internal usage) and actually exercising the IRTE update instead of skipping it such that when you have interrupts on blocked vCPUS that you actually wakeup the right one (and not doing so has a rather drastic effect for VFs within the guest). >> u64 valid; >> >> if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || >> - !entry || entry->lo.fields_vapic.guest_mode) >> + !entry) > > This can easily fit on the previous line. > > if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || !entry) > return 0; True, I can move it to the previous line.
On Thu, Mar 16, 2023, Joao Martins wrote: > On 16/03/2023 21:01, Sean Christopherson wrote: > > Is there any harm in giving deactivate the same treatement? If the worst case > > scenario is a few wasted cycles, having symmetric flows and eliminating benign > > bugs seems like a worthwhile tradeoff (assuming this is indeed a relatively slow > > path like I think it is). > > > > I wanna say there's no harm, but initially I had such a patch, and on testing it > broke the classic interrupt remapping case but I didn't investigate further -- > my suspicion is that the only case that should care is the updates (not the > actual deactivation of guest-mode). Ugh, I bet this is due to KVM invoking irq_set_vcpu_affinity() with garbage when AVIC is enabled, but KVM can't use a posted interrupt due to the how the IRQ is configured. I vaguely recall a bug report about uninitialized data in "pi" being consumed, but I can't find it at the moment. if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && kvm_vcpu_apicv_active(&svm->vcpu)) { ... } else { /* Use legacy mode in IRTE */ struct amd_iommu_pi_data pi; /** * Here, pi is used to: * - Tell IOMMU to use legacy mode for this interrupt. * - Retrieve ga_tag of prior interrupt remapping data. */ pi.prev_ga_tag = 0; pi.is_guest_mode = false; ret = irq_set_vcpu_affinity(host_irq, &pi); } > > Any chance you (or anyone) would want to create a follow-up series to rename and/or > > rework these flows to make it more obvious that the helpers handle updates as well > > as transitions between "guest mode" and "host mode"? E.g. I can see KVM getting > > clever and skipping the "activation" when KVM knows AVIC is already active (though > > I can't tell for certain whether or not that would actually be problematic). > > > > To be honest, I think the function naming is correct. After looking more closely at the KVM code, I agree. I was thinking KVM invoked the (de)activate helpers somewhat spuriously, but that's not actually the case, KVM just has a few less-than-perfect names due to conflicting requirements. Thanks!
On 17/3/23 07:02, Joao Martins wrote: > On KVM GSI routing table updates, specially those where they have vIOMMUs > with interrupt remapping enabled (to boot >255vcpus setups without relying > on KVM_FEATURE_MSI_EXT_DEST_ID), a VMM may update the backing VF MSIs > with a new VCPU affinity. > > On AMD with AVIC enabled, the new vcpu affinity info is updated via: > avic_pi_update_irte() > irq_set_vcpu_affinity() > amd_ir_set_vcpu_affinity() > amd_iommu_{de}activate_guest_mode() > > Where the IRTE[GATag] is updated with the new vcpu affinity. The GATag > contains VM ID and VCPU ID, and is used by IOMMU hardware to signal KVM > (via GALog) when interrupt cannot be delivered due to vCPU is in > blocking state. > > The issue is that amd_iommu_activate_guest_mode() will essentially > only change IRTE fields on transitions from non-guest-mode to guest-mode > and otherwise returns *with no changes to IRTE* on already configured > guest-mode interrupts. To the guest this means that the VF interrupts > remain affined to the first vCPU they were first configured,and guest > will be unable to either VF interrupts and receive messages like this > from spuruious interrupts (e.g. from waking the wrong vCPU in GALog): The "either" above sounds like there should be a verb which it is not, or is it? (my english skills are meh). I kinda get the idea anyway (I hope). btw s/spuruious/spurious/, says my vim. Thanks, > > [ 167.759472] __common_interrupt: 3.34 No irq handler for vector > [ 230.680927] mlx5_core 0000:00:02.0: mlx5_cmd_eq_recover:247:(pid > 3122): Recovered 1 EQEs on cmd_eq > [ 230.681799] mlx5_core 0000:00:02.0: > wait_func_handle_exec_timeout:1113:(pid 3122): cmd[0]: CREATE_CQ(0x400) > recovered after timeout > [ 230.683266] __common_interrupt: 3.34 No irq handler for vector > > Given the fact that amd_ir_set_vcpu_affinity() uses > amd_iommu_activate_guest_mode() underneath it essentially means that VCPU > affinity changes of IRTEs are nops. Fix it by dropping the check for > guest-mode at amd_iommu_activate_guest_mode(). Same thing is applicable to > amd_iommu_deactivate_guest_mode() although, even if the IRTE doesn't change > underlying DestID on the host, the VFIO IRQ handler will still be able to > poke at the right guest-vCPU. > > Fixes: b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC (de-)activation code") > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> > --- > drivers/iommu/amd/iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c > index 5a505ba5467e..bf3ebc9d6cde 100644 > --- a/drivers/iommu/amd/iommu.c > +++ b/drivers/iommu/amd/iommu.c > @@ -3485,7 +3485,7 @@ int amd_iommu_activate_guest_mode(void *data) > u64 valid; > > if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || > - !entry || entry->lo.fields_vapic.guest_mode) > + !entry) > return 0; > > valid = entry->lo.fields_vapic.valid;
On 28/03/2023 10:07, Alexey Kardashevskiy wrote: > On 17/3/23 07:02, Joao Martins wrote: >> On KVM GSI routing table updates, specially those where they have vIOMMUs >> with interrupt remapping enabled (to boot >255vcpus setups without relying >> on KVM_FEATURE_MSI_EXT_DEST_ID), a VMM may update the backing VF MSIs >> with a new VCPU affinity. >> >> On AMD with AVIC enabled, the new vcpu affinity info is updated via: >> avic_pi_update_irte() >> irq_set_vcpu_affinity() >> amd_ir_set_vcpu_affinity() >> amd_iommu_{de}activate_guest_mode() >> >> Where the IRTE[GATag] is updated with the new vcpu affinity. The GATag >> contains VM ID and VCPU ID, and is used by IOMMU hardware to signal KVM >> (via GALog) when interrupt cannot be delivered due to vCPU is in >> blocking state. >> >> The issue is that amd_iommu_activate_guest_mode() will essentially >> only change IRTE fields on transitions from non-guest-mode to guest-mode >> and otherwise returns *with no changes to IRTE* on already configured >> guest-mode interrupts. To the guest this means that the VF interrupts >> remain affined to the first vCPU they were first configured,and guest >> will be unable to either VF interrupts and receive messages like this >> from spuruious interrupts (e.g. from waking the wrong vCPU in GALog): > > The "either" above sounds like there should be a verb which it is not, or is it? > (my english skills are meh). I kinda get the idea anyway (I hope). > It should be 'issue'. I'll delete the 'either' > btw s/spuruious/spurious/, says my vim. Thanks, > /me nods
[I was out sick, hence the delay] On 24/03/2023 14:31, Sean Christopherson wrote: > On Thu, Mar 16, 2023, Joao Martins wrote: >> On 16/03/2023 21:01, Sean Christopherson wrote: >>> Is there any harm in giving deactivate the same treatement? If the worst case >>> scenario is a few wasted cycles, having symmetric flows and eliminating benign >>> bugs seems like a worthwhile tradeoff (assuming this is indeed a relatively slow >>> path like I think it is). >>> >> >> I wanna say there's no harm, but initially I had such a patch, and on testing it >> broke the classic interrupt remapping case but I didn't investigate further -- >> my suspicion is that the only case that should care is the updates (not the >> actual deactivation of guest-mode). > > Ugh, I bet this is due to KVM invoking irq_set_vcpu_affinity() with garbage when > AVIC is enabled, but KVM can't use a posted interrupt due to the how the IRQ is > configured. I vaguely recall a bug report about uninitialized data in "pi" being > consumed, but I can't find it at the moment. > > if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && > kvm_vcpu_apicv_active(&svm->vcpu)) { > > ... > > } else { > /* Use legacy mode in IRTE */ > struct amd_iommu_pi_data pi; > > /** > * Here, pi is used to: > * - Tell IOMMU to use legacy mode for this interrupt. > * - Retrieve ga_tag of prior interrupt remapping data. > */ > pi.prev_ga_tag = 0; > pi.is_guest_mode = false; > ret = irq_set_vcpu_affinity(host_irq, &pi); > } > > I recall one instance of the 'garbage pi data' issue but this was due to prev_ga_tag not being initialized (see commit f6426ab9c957). As far as I understand, AMD implementation on irq_vcpu_set_affinity will write back to caller the following fields of pi: - prev_ga_tag - ir_data - guest_mode (sometimes when it is unsupported or disabled by the host via cmdline) On legacy interrupt remap path (no iommu avic) the IRQ update just uses irq data mostly. It's the avic path that uses more things (vcpu_data, ga_tag, base, ga_root_ptr, ga_vector), but all of which are initialized by KVM properly already.
On Tue, Mar 28, 2023, Joao Martins wrote: > [I was out sick, hence the delay] > > On 24/03/2023 14:31, Sean Christopherson wrote: > > On Thu, Mar 16, 2023, Joao Martins wrote: > >> On 16/03/2023 21:01, Sean Christopherson wrote: > >>> Is there any harm in giving deactivate the same treatement? If the worst case > >>> scenario is a few wasted cycles, having symmetric flows and eliminating benign > >>> bugs seems like a worthwhile tradeoff (assuming this is indeed a relatively slow > >>> path like I think it is). > >>> > >> > >> I wanna say there's no harm, but initially I had such a patch, and on testing it > >> broke the classic interrupt remapping case but I didn't investigate further -- > >> my suspicion is that the only case that should care is the updates (not the > >> actual deactivation of guest-mode). > > > > Ugh, I bet this is due to KVM invoking irq_set_vcpu_affinity() with garbage when > > AVIC is enabled, but KVM can't use a posted interrupt due to the how the IRQ is > > configured. I vaguely recall a bug report about uninitialized data in "pi" being > > consumed, but I can't find it at the moment. > > > > if (!get_pi_vcpu_info(kvm, e, &vcpu_info, &svm) && set && > > kvm_vcpu_apicv_active(&svm->vcpu)) { > > > > ... > > > > } else { > > /* Use legacy mode in IRTE */ > > struct amd_iommu_pi_data pi; > > > > /** > > * Here, pi is used to: > > * - Tell IOMMU to use legacy mode for this interrupt. > > * - Retrieve ga_tag of prior interrupt remapping data. > > */ > > pi.prev_ga_tag = 0; > > pi.is_guest_mode = false; > > ret = irq_set_vcpu_affinity(host_irq, &pi); > > } > > > > > > I recall one instance of the 'garbage pi data' issue but this was due to > prev_ga_tag not being initialized (see commit f6426ab9c957). Yep, that's the one I was trying to recall. > As far as I understand, AMD implementation on irq_vcpu_set_affinity will > write back to caller the following fields of pi: > > - prev_ga_tag > - ir_data > - guest_mode (sometimes when it is unsupported or disabled by the host via cmdline) > > On legacy interrupt remap path (no iommu avic) the IRQ update just uses irq data > mostly. It's the avic path that uses more things (vcpu_data, ga_tag, base, > ga_root_ptr, ga_vector), but all of which are initialized by KVM properly already. Ya, on my Nth read through, I don't see any issues with KVM's behavior. I was thinking that KVM's "pi" could bleed into amd_iommu_deactivate_guest_mode(), but I had just gotten turned around by the many "data" variables. Bummer.
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 5a505ba5467e..bf3ebc9d6cde 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3485,7 +3485,7 @@ int amd_iommu_activate_guest_mode(void *data) u64 valid; if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) || - !entry || entry->lo.fields_vapic.guest_mode) + !entry) return 0; valid = entry->lo.fields_vapic.valid;