mbox series

[for-6.0,0/8] spapr: Address the confusion between IPI numbers and vCPU ids

Message ID 20201120174646.619395-1-groug@kaod.org (mailing list archive)
Headers show
Series spapr: Address the confusion between IPI numbers and vCPU ids | expand

Message

Greg Kurz Nov. 20, 2020, 5:46 p.m. UTC
A regression was recently fixed in the sPAPR XIVE code for QEMU 5.2
RC3 [1]. It boiled down to a confusion between IPI numbers and vCPU
ids, which happen to be numerically equal in general, but are really
different entities that can diverge in some setups. When this happens,
we end up misconfiguring XIVE in a way that is almost fatal for the
guest.

The confusion comes from XICS which has historically assumed equality
between interrupt server numbers and vCPU ids, as mentionned in a
comment back from 2011 in the linux kernel icp_native_init_one_node()
function:

    /* This code does the theorically broken assumption that the interrupt
     * server numbers are the same as the hard CPU numbers.
     * This happens to be the case so far but we are playing with fire...
     * should be fixed one of these days. -BenH.
     */

This assumption crept into QEMU through the "ibm,interrupt-server-ranges"
property of the "interrupt-controller" node in the DT. This property
contains ranges of consecutive vCPU ids defined as (first id, # of ids).
In the case of QEMU, we define a single range starting from 0 up to the
highest vCPU id, as returned by spapr_max_server_number(). This has
always been associated to the "nr_servers" wording when naming variables
or function arguments. When XIVE got added, we introduced an sPAPR IRQ
abstraction to be able to control several interrupt controller backends.
The sPAPR IRQ base class provides a dt() handler used to populate the
"interrupt-controller" node in the DT. This handler takes an "nr_server"
argument inherited from XICS and we ended up using it to populate the
"ibm,xive-lisn-ranges" property used with XIVE, which has completely
different semantics. It contain ranges of interrupt numbers that the
guest can use for any purpose. Since one obvious purpose is IPI, its
first range should just be able to accomodate all possible vCPUs.
In the case of QEMU, we define a single range starting from 0 up
to "nr_server" but we should rather size it to the number of vCPUs
actually (ie. smp.max_cpus).

This series aims at getting rid of the "nr_server" argument in the
sPAPR IC handlers. Since both the highest possible vCPU id and the
maximum number of vCPUs are invariants for XICS and XIVE respectively,
let's make them device properties to be configured by the machine when
it creates the interrupt controllers and use them where needed.

This doesn't cause any visible change to setups using the default
VSMT machine settings. This changes "ibm,xive-lisn-ranges" for
setups that mess with VSMT, but this is acceptable since linux
only allocates one interrupt per vCPU and the higher part of the
range was never used.

[1] https://git.qemu.org/?p=qemu.git;a=commit;h=6d24795ee7e3199d199d3c415312c93382ad1807

Greg Kurz (8):
  spapr/xive: Turn some sanity checks into assertions
  spapr/xive: Introduce spapr_xive_nr_ends()
  spapr/xive: Add "nr-servers" property
  spapr/xive: Add "nr-ipis" property
  spapr/xics: Drop unused argument to xics_kvm_has_broken_disconnect()
  spapr/xics: Add "nr-servers" property
  spapr: Drop "nr_servers" argument of the sPAPR IC activate() operation
  spapr: Drop "nr_servers" argument of the sPAPR IC dt() operation

 include/hw/ppc/spapr.h      |  4 +--
 include/hw/ppc/spapr_irq.h  |  9 ++---
 include/hw/ppc/spapr_xive.h | 25 ++++++++++++-
 include/hw/ppc/xics_spapr.h | 23 +++++++++---
 hw/intc/spapr_xive.c        | 72 ++++++++++++++++++++++---------------
 hw/intc/spapr_xive_kvm.c    |  4 +--
 hw/intc/xics_kvm.c          |  4 +--
 hw/intc/xics_spapr.c        | 45 ++++++++++++++---------
 hw/ppc/spapr.c              |  7 ++--
 hw/ppc/spapr_irq.c          | 27 +++++++-------
 10 files changed, 141 insertions(+), 79 deletions(-)

Comments

Cédric Le Goater Nov. 23, 2020, 8:04 a.m. UTC | #1
On 11/20/20 6:46 PM, Greg Kurz wrote:
> A regression was recently fixed in the sPAPR XIVE code for QEMU 5.2
> RC3 [1]. It boiled down to a confusion between IPI numbers and vCPU
> ids, which happen to be numerically equal in general, but are really
> different entities that can diverge in some setups. When this happens,
> we end up misconfiguring XIVE in a way that is almost fatal for the
> guest.
> 
> The confusion comes from XICS which has historically assumed equality
> between interrupt server numbers and vCPU ids, as mentionned in a
> comment back from 2011 in the linux kernel icp_native_init_one_node()
> function:
> 
>     /* This code does the theorically broken assumption that the interrupt
>      * server numbers are the same as the hard CPU numbers.
>      * This happens to be the case so far but we are playing with fire...
>      * should be fixed one of these days. -BenH.
>      */
> 
> This assumption crept into QEMU through the "ibm,interrupt-server-ranges"
> property of the "interrupt-controller" node in the DT. This property
> contains ranges of consecutive vCPU ids defined as (first id, # of ids).
> In the case of QEMU, we define a single range starting from 0 up to the
> highest vCPU id, as returned by spapr_max_server_number(). This has
> always been associated to the "nr_servers" wording when naming variables
> or function arguments. When XIVE got added, we introduced an sPAPR IRQ
> abstraction to be able to control several interrupt controller backends.
> The sPAPR IRQ base class provides a dt() handler used to populate the
> "interrupt-controller" node in the DT. This handler takes an "nr_server"
> argument inherited from XICS and we ended up using it to populate the
> "ibm,xive-lisn-ranges" property used with XIVE, which has completely
> different semantics. It contain ranges of interrupt numbers that the
> guest can use for any purpose. Since one obvious purpose is IPI, its
> first range should just be able to accomodate all possible vCPUs.

To clarify, PAPR says it is a requirement :

  "The first range will contain at least one per possible thread in the 
   partition."

The regression showed that we were not initializing correctly this range 
in QEMU/KVM. I an not even sure it had the correct size either since
we were anyhow initializing 4096 IPIs.

> In the case of QEMU, we define a single range starting from 0 up
> to "nr_server" but we should rather size it to the number of vCPUs
> actually (ie. smp.max_cpus).

ah. And so spapr_max_server_number(spapr) is crap ? This is starting
to be complex to follow :/
 
> This series aims at getting rid of the "nr_server" argument in the
> sPAPR IC handlers. Since both the highest possible vCPU id and the
> maximum number of vCPUs are invariants for XICS and XIVE respectively,

What XIVE cares about is the number of possible IPIs and the number
of vCPUs since we deduced from that the number of event queue 
descriptors, which is another XIVE structure.

> let's make them device properties to be configured by the machine when
> it creates the interrupt controllers and use them where needed.
> 
> This doesn't cause any visible change to setups using the default
> VSMT machine settings. This changes "ibm,xive-lisn-ranges" for
> setups that mess with VSMT, but this is acceptable since linux
> only allocates one interrupt per vCPU and the higher part of the
> range was never used.

This range is only used for vCPUs. 

C.

> [1] https://git.qemu.org/?p=qemu.git;a=commit;h=6d24795ee7e3199d199d3c415312c93382ad1807
> 
> Greg Kurz (8):
>   spapr/xive: Turn some sanity checks into assertions
>   spapr/xive: Introduce spapr_xive_nr_ends()
>   spapr/xive: Add "nr-servers" property
>   spapr/xive: Add "nr-ipis" property
>   spapr/xics: Drop unused argument to xics_kvm_has_broken_disconnect()
>   spapr/xics: Add "nr-servers" property
>   spapr: Drop "nr_servers" argument of the sPAPR IC activate() operation
>   spapr: Drop "nr_servers" argument of the sPAPR IC dt() operation
> 
>  include/hw/ppc/spapr.h      |  4 +--
>  include/hw/ppc/spapr_irq.h  |  9 ++---
>  include/hw/ppc/spapr_xive.h | 25 ++++++++++++-
>  include/hw/ppc/xics_spapr.h | 23 +++++++++---
>  hw/intc/spapr_xive.c        | 72 ++++++++++++++++++++++---------------
>  hw/intc/spapr_xive_kvm.c    |  4 +--
>  hw/intc/xics_kvm.c          |  4 +--
>  hw/intc/xics_spapr.c        | 45 ++++++++++++++---------
>  hw/ppc/spapr.c              |  7 ++--
>  hw/ppc/spapr_irq.c          | 27 +++++++-------
>  10 files changed, 141 insertions(+), 79 deletions(-)
>
Greg Kurz Nov. 23, 2020, 10:07 a.m. UTC | #2
On Mon, 23 Nov 2020 09:04:42 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 11/20/20 6:46 PM, Greg Kurz wrote:
> > A regression was recently fixed in the sPAPR XIVE code for QEMU 5.2
> > RC3 [1]. It boiled down to a confusion between IPI numbers and vCPU
> > ids, which happen to be numerically equal in general, but are really
> > different entities that can diverge in some setups. When this happens,
> > we end up misconfiguring XIVE in a way that is almost fatal for the
> > guest.
> > 
> > The confusion comes from XICS which has historically assumed equality
> > between interrupt server numbers and vCPU ids, as mentionned in a
> > comment back from 2011 in the linux kernel icp_native_init_one_node()
> > function:
> > 
> >     /* This code does the theorically broken assumption that the interrupt
> >      * server numbers are the same as the hard CPU numbers.
> >      * This happens to be the case so far but we are playing with fire...
> >      * should be fixed one of these days. -BenH.
> >      */
> > 
> > This assumption crept into QEMU through the "ibm,interrupt-server-ranges"
> > property of the "interrupt-controller" node in the DT. This property
> > contains ranges of consecutive vCPU ids defined as (first id, # of ids).
> > In the case of QEMU, we define a single range starting from 0 up to the
> > highest vCPU id, as returned by spapr_max_server_number(). This has
> > always been associated to the "nr_servers" wording when naming variables
> > or function arguments. When XIVE got added, we introduced an sPAPR IRQ
> > abstraction to be able to control several interrupt controller backends.
> > The sPAPR IRQ base class provides a dt() handler used to populate the
> > "interrupt-controller" node in the DT. This handler takes an "nr_server"
> > argument inherited from XICS and we ended up using it to populate the
> > "ibm,xive-lisn-ranges" property used with XIVE, which has completely
> > different semantics. It contain ranges of interrupt numbers that the
> > guest can use for any purpose. Since one obvious purpose is IPI, its
> > first range should just be able to accomodate all possible vCPUs.
> 
> To clarify, PAPR says it is a requirement :
> 
>   "The first range will contain at least one per possible thread in the 
>    partition."
> 
> The regression showed that we were not initializing correctly this range 
> in QEMU/KVM. I an not even sure it had the correct size either since
> we were anyhow initializing 4096 IPIs.
> 

The bad thing was that each online vCPU would reset it's IPI in
KVM using a bogus IPI number (the vCPU id), and thus doesn't reset
the interrupt the guest is really going to use for the IPI.

> > In the case of QEMU, we define a single range starting from 0 up
> > to "nr_server" but we should rather size it to the number of vCPUs
> > actually (ie. smp.max_cpus).
> 
> ah. And so spapr_max_server_number(spapr) is crap ? This is starting
> to be complex to follow :/
>  

No. spapr_max_server_number(spapr) gives the highest vCPU id that
we end over to KVM in order to optimize VP id allocation in the HW.
But it definitely has nothing to do with "ibm,xive-lisn-ranges".

David suggested in some other mail that we could maybe pass
both spapr_max_server_number(spapr) and smp.max_cpus to the
activate() handler.

> > This series aims at getting rid of the "nr_server" argument in the
> > sPAPR IC handlers. Since both the highest possible vCPU id and the
> > maximum number of vCPUs are invariants for XICS and XIVE respectively,
> 
> What XIVE cares about is the number of possible IPIs and the number
> of vCPUs since we deduced from that the number of event queue 
> descriptors, which is another XIVE structure.
> 
> > let's make them device properties to be configured by the machine when
> > it creates the interrupt controllers and use them where needed.
> > 
> > This doesn't cause any visible change to setups using the default
> > VSMT machine settings. This changes "ibm,xive-lisn-ranges" for
> > setups that mess with VSMT, but this is acceptable since linux
> > only allocates one interrupt per vCPU and the higher part of the
> > range was never used.
> 
> This range is only used for vCPUs. 
> 
> C.
> 
> > [1] https://git.qemu.org/?p=qemu.git;a=commit;h=6d24795ee7e3199d199d3c415312c93382ad1807
> > 
> > Greg Kurz (8):
> >   spapr/xive: Turn some sanity checks into assertions
> >   spapr/xive: Introduce spapr_xive_nr_ends()
> >   spapr/xive: Add "nr-servers" property
> >   spapr/xive: Add "nr-ipis" property
> >   spapr/xics: Drop unused argument to xics_kvm_has_broken_disconnect()
> >   spapr/xics: Add "nr-servers" property
> >   spapr: Drop "nr_servers" argument of the sPAPR IC activate() operation
> >   spapr: Drop "nr_servers" argument of the sPAPR IC dt() operation
> > 
> >  include/hw/ppc/spapr.h      |  4 +--
> >  include/hw/ppc/spapr_irq.h  |  9 ++---
> >  include/hw/ppc/spapr_xive.h | 25 ++++++++++++-
> >  include/hw/ppc/xics_spapr.h | 23 +++++++++---
> >  hw/intc/spapr_xive.c        | 72 ++++++++++++++++++++++---------------
> >  hw/intc/spapr_xive_kvm.c    |  4 +--
> >  hw/intc/xics_kvm.c          |  4 +--
> >  hw/intc/xics_spapr.c        | 45 ++++++++++++++---------
> >  hw/ppc/spapr.c              |  7 ++--
> >  hw/ppc/spapr_irq.c          | 27 +++++++-------
> >  10 files changed, 141 insertions(+), 79 deletions(-)
> > 
>