diff mbox series

[5/5] KVM: arm64: Exclude FP ownership from kvm_vcpu_arch

Message ID 20240302111935.129994-6-maz@kernel.org (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: Move host-specific data out of kvm_vcpu_arch | expand

Commit Message

Marc Zyngier March 2, 2024, 11:19 a.m. UTC
In retrospect, it is fairly obvious that the FP state ownership
is only meaningful for a given CPU, and that locating this
information in the vcpu was just a mistake.

Move the ownership tracking into the host data structure, and
rename it from fp_state to fp_owner, which is a better description
(name suggested by Mark Brown).

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_emulate.h    |  4 ++--
 arch/arm64/include/asm/kvm_host.h       | 14 +++++++-------
 arch/arm64/kvm/arm.c                    |  6 ------
 arch/arm64/kvm/fpsimd.c                 | 10 +++++-----
 arch/arm64/kvm/hyp/include/hyp/switch.h |  6 +++---
 arch/arm64/kvm/hyp/nvhe/hyp-main.c      |  2 --
 arch/arm64/kvm/hyp/nvhe/switch.c        |  2 +-
 arch/arm64/kvm/hyp/vhe/switch.c         |  2 +-
 8 files changed, 19 insertions(+), 27 deletions(-)

Comments

Mark Brown March 4, 2024, 7:10 p.m. UTC | #1
On Sat, Mar 02, 2024 at 11:19:35AM +0000, Marc Zyngier wrote:
> In retrospect, it is fairly obvious that the FP state ownership
> is only meaningful for a given CPU, and that locating this
> information in the vcpu was just a mistake.
> 
> Move the ownership tracking into the host data structure, and
> rename it from fp_state to fp_owner, which is a better description
> (name suggested by Mark Brown).

The SME patch series proposes adding an additional state to this
enumeration which would say if the registers are stored in a format
suitable for exchange with userspace, that would make this state part of
the vCPU state.  With the addition of SME we can have two vector lengths
in play so the series proposes picking the larger to be the format for
userspace registers.  

We could store this separately to fp_state/owner but it'd still be a
value stored in the vCPU.  Storing in a format suitable for userspace
usage all the time when we've got SME would most likely result in
performance overhead if nothing else and feels more complicated than
rewriting the data in the relatively unusual case where userspace looks
at it.  Trying to convert userspace writes into the current layout would
have issues if the current layout uses the smaller vector length and
create fragility with ordering issues when loading the guest state.

The proposal is not the most lovely idea ever but given the architecture
I think some degree of clunkiness would be unavoidable.
Marc Zyngier March 6, 2024, 9:43 a.m. UTC | #2
On Mon, 04 Mar 2024 19:10:08 +0000,
Mark Brown <broonie@kernel.org> wrote:
> 
> [1  <text/plain; us-ascii (quoted-printable)>]
> On Sat, Mar 02, 2024 at 11:19:35AM +0000, Marc Zyngier wrote:
> > In retrospect, it is fairly obvious that the FP state ownership
> > is only meaningful for a given CPU, and that locating this
> > information in the vcpu was just a mistake.
> > 
> > Move the ownership tracking into the host data structure, and
> > rename it from fp_state to fp_owner, which is a better description
> > (name suggested by Mark Brown).
> 
> The SME patch series proposes adding an additional state to this
> enumeration which would say if the registers are stored in a format
> suitable for exchange with userspace, that would make this state part of
> the vCPU state.  With the addition of SME we can have two vector lengths
> in play so the series proposes picking the larger to be the format for
> userspace registers.

What does this addition have anything to do with the ownership of the
physical register file? Not a lot, it seems.

Specially as there better be no state resident on the CPU when
userspace messes up with it.

> 
> We could store this separately to fp_state/owner but it'd still be a
> value stored in the vCPU.

I totally disagree.

> Storing in a format suitable for userspace
> usage all the time when we've got SME would most likely result in
> performance overhead

What performance overhead? Why should we care?

> if nothing else and feels more complicated than
> rewriting the data in the relatively unusual case where userspace looks
> at it.  Trying to convert userspace writes into the current layout would
> have issues if the current layout uses the smaller vector length and
> create fragility with ordering issues when loading the guest state.

What ordering issues? If userspace manipulates the guest state, the
guest isn't running. If it is, all bets are off.

> 
> The proposal is not the most lovely idea ever but given the architecture
> I think some degree of clunkiness would be unavoidable.

It is only unavoidable if we decide to make a bad job of it.

	M.
Mark Brown March 6, 2024, 10:19 p.m. UTC | #3
On Wed, Mar 06, 2024 at 09:43:13AM +0000, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Sat, Mar 02, 2024 at 11:19:35AM +0000, Marc Zyngier wrote:

> > > Move the ownership tracking into the host data structure, and
> > > rename it from fp_state to fp_owner, which is a better description
> > > (name suggested by Mark Brown).

> > The SME patch series proposes adding an additional state to this
> > enumeration which would say if the registers are stored in a format
> > suitable for exchange with userspace, that would make this state part of
> > the vCPU state.  With the addition of SME we can have two vector lengths
> > in play so the series proposes picking the larger to be the format for
> > userspace registers.

> What does this addition have anything to do with the ownership of the
> physical register file? Not a lot, it seems.

> Specially as there better be no state resident on the CPU when
> userspace messes up with it.

If we have a situation where the state might be stored in memory in
multiple formats it seems reasonable to consider the metadata which
indicates which format is currently in use as part of the state.

> > We could store this separately to fp_state/owner but it'd still be a
> > value stored in the vCPU.

> I totally disagree.

Where would you expect to see the state stored?

> > Storing in a format suitable for userspace
> > usage all the time when we've got SME would most likely result in
> > performance overhead

> What performance overhead? Why should we care?

Since in situations where we're not using the larger VL we would need to
load and store the registers using a vector length other than the
currently configured vector length we would not be able to use the
ability to load and store to a location based on a multiple of the
vector length that the architecture has:

   LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
   LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
   
   STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
   STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]

and would instead need to manually compute the memory locations where
values are stored.  As well as the extra instructions when using the
smaller vector length we'd also be working with sparser data likely over
more cache lines.

We would also need to consider if we need to zero the holes in the data
when saving, we'd only potentially be leaking information from the guest
but it might cause nasty surprises given that transitioning to/from
streaming mode is expected to zero values.  If we do need to zero then
that would be additional work that would need doing.

Exactly what the performance hit would be will be system and use case
dependent.  *Hopefully* we aren't needing to save and load the guest
state too often but I would be very surprised if we didn't have people
considering any cost in the guest context switch path worth paying
attention to.

As well as the performance overhead there would be some code complexity
cost, if nothing else we'd not be using the same format as fpsimd_save()
and would need to rearrange how we handle saving the register state.

Spending more effort to implement something which also has more runtime
performance overhead for the case of saving and restoring guest state
which I expect to be vastly more common than the VMM accessing the guest
registers just doesn't seem like an appealing choice.

> > if nothing else and feels more complicated than
> > rewriting the data in the relatively unusual case where userspace looks
> > at it.  Trying to convert userspace writes into the current layout would
> > have issues if the current layout uses the smaller vector length and
> > create fragility with ordering issues when loading the guest state.

> What ordering issues? If userspace manipulates the guest state, the
> guest isn't running. If it is, all bets are off.

If we were storing the data in the native format for the guest then that
format will change if streaming mode is changed via a write to SVCR.
This would mean that the host would need to understand that when writing
values SVCR needs to be written before the Z and P registers.  To be
clear I don't think this is a good idea.

> > The proposal is not the most lovely idea ever but given the architecture
> > I think some degree of clunkiness would be unavoidable.

> It is only unavoidable if we decide to make a bad job of it.

I don't think the handling of the vector registers for KVM with SME is
something where there is a clear good and bad job we can do - I don't
see how we can reasonably avoid at some point needing to translate
vector lengths or to/from FPSIMD format (in the case of a system with
SME but not SVE) which is just inherently a sharp edge.  It's just a
question of when and how we do that.
Marc Zyngier March 7, 2024, 11:10 a.m. UTC | #4
On Wed, 06 Mar 2024 22:19:03 +0000,
Mark Brown <broonie@kernel.org> wrote:
> 
> [1  <text/plain; us-ascii (quoted-printable)>]
> On Wed, Mar 06, 2024 at 09:43:13AM +0000, Marc Zyngier wrote:
> > Mark Brown <broonie@kernel.org> wrote:
> > > On Sat, Mar 02, 2024 at 11:19:35AM +0000, Marc Zyngier wrote:
> 
> > > > Move the ownership tracking into the host data structure, and
> > > > rename it from fp_state to fp_owner, which is a better description
> > > > (name suggested by Mark Brown).
> 
> > > The SME patch series proposes adding an additional state to this
> > > enumeration which would say if the registers are stored in a format
> > > suitable for exchange with userspace, that would make this state part of
> > > the vCPU state.  With the addition of SME we can have two vector lengths
> > > in play so the series proposes picking the larger to be the format for
> > > userspace registers.
> 
> > What does this addition have anything to do with the ownership of the
> > physical register file? Not a lot, it seems.
> 
> > Specially as there better be no state resident on the CPU when
> > userspace messes up with it.
> 
> If we have a situation where the state might be stored in memory in
> multiple formats it seems reasonable to consider the metadata which
> indicates which format is currently in use as part of the state.

There is no reason why the state should be in multiple formats
*simultaneously*. All the FP/SIMD/SVE/SME state is largely
overlapping, and we only need to correctly invalidate the state that
isn't relevant to writes from userspace.

> 
> > > We could store this separately to fp_state/owner but it'd still be a
> > > value stored in the vCPU.
> 
> > I totally disagree.
> 
> Where would you expect to see the state stored?

Sorry, that came out wrong. I expect *some* vcpu state to describe the
current use of the FP/vector registers, and that's about it. Not the
ownership information.

> 
> > > Storing in a format suitable for userspace
> > > usage all the time when we've got SME would most likely result in
> > > performance overhead
> 
> > What performance overhead? Why should we care?
> 
> Since in situations where we're not using the larger VL we would need to
> load and store the registers using a vector length other than the
> currently configured vector length we would not be able to use the
> ability to load and store to a location based on a multiple of the
> vector length that the architecture has:
> 
>    LDR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
>    LDR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
>    
>    STR <Zt>, [<Xn|SP>{, #<imm>, MUL VL}]
>    STR <Pt>, [<Xn|SP>{, #<imm>, MUL VL}]
> 
> and would instead need to manually compute the memory locations where
> values are stored.  As well as the extra instructions when using the
> smaller vector length we'd also be working with sparser data likely over
> more cache lines.

Are you talking about a context switch? or userspace accesses? I don't
give a damn about the latter, as it statistically never happens. The
former is of course of interest, but you still don't explain why the
above is a problem.

Nothing prevent you from storing the registers using the *current* VL,
since there is no data sharing between the SVE registers and the
streaming-SVE ones. All you need to do is to make sure you don't mix
the two.

> We would also need to consider if we need to zero the holes in the data
> when saving, we'd only potentially be leaking information from the guest
> but it might cause nasty surprises given that transitioning to/from
> streaming mode is expected to zero values.  If we do need to zero then
> that would be additional work that would need doing.

The zeroing is mandated by the architecture, AFAIU. That's not optional.

> 
> Exactly what the performance hit would be will be system and use case
> dependent.  *Hopefully* we aren't needing to save and load the guest
> state too often but I would be very surprised if we didn't have people
> considering any cost in the guest context switch path worth paying
> attention to.

These people can come out of the wood with numbers and reproducible
workloads. Until they do, their concerns do not really exist.

> As well as the performance overhead there would be some code complexity
> cost, if nothing else we'd not be using the same format as fpsimd_save()
> and would need to rearrange how we handle saving the register state.

And I'm fine with that. The way we store things is nobody's business
but ours, and I'm not sentimentally attached to 15 year old code.

> Spending more effort to implement something which also has more runtime
> performance overhead for the case of saving and restoring guest state
> which I expect to be vastly more common than the VMM accessing the guest
> registers just doesn't seem like an appealing choice.

I don't buy the runtime performance aspect at all. As long as you have
the space to dump the largest possible VL, you can always dump it in
the native format.

> > > if nothing else and feels more complicated than
> > > rewriting the data in the relatively unusual case where userspace looks
> > > at it.  Trying to convert userspace writes into the current layout would
> > > have issues if the current layout uses the smaller vector length and
> > > create fragility with ordering issues when loading the guest state.
> 
> > What ordering issues? If userspace manipulates the guest state, the
> > guest isn't running. If it is, all bets are off.
> 
> If we were storing the data in the native format for the guest then that
> format will change if streaming mode is changed via a write to SVCR.
> This would mean that the host would need to understand that when writing
> values SVCR needs to be written before the Z and P registers.  To be
> clear I don't think this is a good idea.

The architecture is crystal clear: you flip SVCR.SM, you loose all
data in both Z and P regs. If userspace doesn't understand the
architecture, that's their problem. The only thing we need to provide
is a faithful emulation of the architecture.

> 
> > > The proposal is not the most lovely idea ever but given the architecture
> > > I think some degree of clunkiness would be unavoidable.
> 
> > It is only unavoidable if we decide to make a bad job of it.
> 
> I don't think the handling of the vector registers for KVM with SME is
> something where there is a clear good and bad job we can do - I don't
> see how we can reasonably avoid at some point needing to translate
> vector lengths or to/from FPSIMD format (in the case of a system with
> SME but not SVE) which is just inherently a sharp edge.  It's just a
> question of when and how we do that.

My point is that there is no reason to translate the vector registers.
As long as your vcpu is in a given mode, all storage is done in that
mode. You switch mode, you lose data, as per the architecture. And
yes, there is some zeroing and invalidation to perform if the vcpu has
switched mode behind your back.

	M.
Mark Brown March 7, 2024, 2:26 p.m. UTC | #5
On Thu, Mar 07, 2024 at 11:10:40AM +0000, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Wed, Mar 06, 2024 at 09:43:13AM +0000, Marc Zyngier wrote:
> > > Mark Brown <broonie@kernel.org> wrote:
> > > > On Sat, Mar 02, 2024 at 11:19:35AM +0000, Marc Zyngier wrote:

> > > > The SME patch series proposes adding an additional state to this
> > > > enumeration which would say if the registers are stored in a format
> > > > suitable for exchange with userspace, that would make this state part of
> > > > the vCPU state.  With the addition of SME we can have two vector lengths
> > > > in play so the series proposes picking the larger to be the format for
> > > > userspace registers.

> > > What does this addition have anything to do with the ownership of the
> > > physical register file? Not a lot, it seems.

> > > Specially as there better be no state resident on the CPU when
> > > userspace messes up with it.

> > If we have a situation where the state might be stored in memory in
> > multiple formats it seems reasonable to consider the metadata which
> > indicates which format is currently in use as part of the state.

> There is no reason why the state should be in multiple formats
> *simultaneously*. All the FP/SIMD/SVE/SME state is largely
> overlapping, and we only need to correctly invalidate the state that
> isn't relevant to writes from userspace.

I agree that we don't want to store multiple copies, the proposed
implementation for SME does that - when converting between guest native
and userspace formats we discard the original format after conversion
and updates the metadata which says which format is stored.  That
metadata is modeled as adding a new owner.

> > > > We could store this separately to fp_state/owner but it'd still be a
> > > > value stored in the vCPU.

> > > I totally disagree.

> > Where would you expect to see the state stored?

> Sorry, that came out wrong. I expect *some* vcpu state to describe the
> current use of the FP/vector registers, and that's about it. Not the
> ownership information.

Ah, I think the distinction here is that you're modeling the state
tracking updated by this patch as purely tracking the registers in which
case yes, we should just add a separate variable in the vCPU for
tracking the format of the data.  I was modeling the state tracking as
covering both the state in the registers and the state of the storage
backing them (since the guest state being in the registers means that
the state in memory is invalid).

> > > > Storing in a format suitable for userspace
> > > > usage all the time when we've got SME would most likely result in
> > > > performance overhead

> > > What performance overhead? Why should we care?
> > 
> > Since in situations where we're not using the larger VL we would need to
> > load and store the registers using a vector length other than the

...

> > and would instead need to manually compute the memory locations where
> > values are stored.  As well as the extra instructions when using the
> > smaller vector length we'd also be working with sparser data likely over
> > more cache lines.

> Are you talking about a context switch? or userspace accesses? I don't
> give a damn about the latter, as it statistically never happens. The
> former is of course of interest, but you still don't explain why the
> above is a problem.

Well, there will be a cost in any rewriting so I'm trying to shift it
away from context switch to userspace access since as you say they are
negligably frequent.

> Nothing prevent you from storing the registers using the *current* VL,
> since there is no data sharing between the SVE registers and the
> streaming-SVE ones. All you need to do is to make sure you don't mix
> the two.

You seem to be suggesting modeling the streaming mode registers as a
completely different set of registers in the KVM ABI.  That would
certainly be another option, though it's a bit unclear from an
architecture point of view and it didn't seem to entirely fit with the
handling of the FPSIMD registers when SVE is in use.  From an
architecture point of view the streaming mode registers aren't really a
separate set of registers.  When in streaming mode it is not possible to
observe the non-streaming register state and vice versa, and entering or
exiting streaming mode zeros the register state so functionally you just
have floating point registers.

You'd need to handle what happens with access to the inactive register
set from userspace, the simplest thing to implement would be to read the
logical value of zero and either discard or return an error on writes.

> > We would also need to consider if we need to zero the holes in the data
> > when saving, we'd only potentially be leaking information from the guest
> > but it might cause nasty surprises given that transitioning to/from
> > streaming mode is expected to zero values.  If we do need to zero then
> > that would be additional work that would need doing.

> The zeroing is mandated by the architecture, AFAIU. That's not optional.

Yes, we need to zero at some point - we could just do it on userspace
read though I think rather than having to do it when saving.

> > Spending more effort to implement something which also has more runtime
> > performance overhead for the case of saving and restoring guest state
> > which I expect to be vastly more common than the VMM accessing the guest
> > registers just doesn't seem like an appealing choice.

> I don't buy the runtime performance aspect at all. As long as you have
> the space to dump the largest possible VL, you can always dump it in
> the native format.

Sure, my point was that you appeared to be asking to dump in a
non-native format.  

> > If we were storing the data in the native format for the guest then that
> > format will change if streaming mode is changed via a write to SVCR.
> > This would mean that the host would need to understand that when writing
> > values SVCR needs to be written before the Z and P registers.  To be
> > clear I don't think this is a good idea.

> The architecture is crystal clear: you flip SVCR.SM, you loose all
> data in both Z and P regs. If userspace doesn't understand the
> architecture, that's their problem. The only thing we need to provide
> is a faithful emulation of the architecture.

It is not clear to me that the intention behind the KVM register ABI is
that it should have all the ordering requirements that the architecture
has, and decisions like hiding the V registers when SVE is active do
take us away from just a natural architectural point of view.  My
understanding was that it was intended that userspace should be able to
do something like just enumerate all the registers, save them and then
later on restore them without really understanding them.
Marc Zyngier March 9, 2024, 11:01 a.m. UTC | #6
On Thu, 07 Mar 2024 14:26:30 +0000,
Mark Brown <broonie@kernel.org> wrote:
> 
> On Thu, Mar 07, 2024 at 11:10:40AM +0000, Marc Zyngier wrote:
> > Mark Brown <broonie@kernel.org> wrote:
> > > On Wed, Mar 06, 2024 at 09:43:13AM +0000, Marc Zyngier wrote:
> > > > Mark Brown <broonie@kernel.org> wrote:
> > > > > On Sat, Mar 02, 2024 at 11:19:35AM +0000, Marc Zyngier wrote:
> 
> > > > > The SME patch series proposes adding an additional state to this
> > > > > enumeration which would say if the registers are stored in a format
> > > > > suitable for exchange with userspace, that would make this state part of
> > > > > the vCPU state.  With the addition of SME we can have two vector lengths
> > > > > in play so the series proposes picking the larger to be the format for
> > > > > userspace registers.
> 
> > > > What does this addition have anything to do with the ownership of the
> > > > physical register file? Not a lot, it seems.
> 
> > > > Specially as there better be no state resident on the CPU when
> > > > userspace messes up with it.
> 
> > > If we have a situation where the state might be stored in memory in
> > > multiple formats it seems reasonable to consider the metadata which
> > > indicates which format is currently in use as part of the state.
> 
> > There is no reason why the state should be in multiple formats
> > *simultaneously*. All the FP/SIMD/SVE/SME state is largely
> > overlapping, and we only need to correctly invalidate the state that
> > isn't relevant to writes from userspace.
> 
> I agree that we don't want to store multiple copies, the proposed
> implementation for SME does that - when converting between guest native
> and userspace formats we discard the original format after conversion
> and updates the metadata which says which format is stored.  That
> metadata is modeled as adding a new owner.

What conversion? If userspace writes to FP, the upper bits of the SVE
registers should get zeroed. If it writes to SVE, it writes using the
current VL for the current streaming/non-streaming mode.

If the current userspace API doesn't give us the tools to do so, we
rev it.

>
> > > > > We could store this separately to fp_state/owner but it'd still be a
> > > > > value stored in the vCPU.
> 
> > > > I totally disagree.
> 
> > > Where would you expect to see the state stored?
> 
> > Sorry, that came out wrong. I expect *some* vcpu state to describe the
> > current use of the FP/vector registers, and that's about it. Not the
> > ownership information.
> 
> Ah, I think the distinction here is that you're modeling the state
> tracking updated by this patch as purely tracking the registers in which
> case yes, we should just add a separate variable in the vCPU for
> tracking the format of the data.  I was modeling the state tracking as
> covering both the state in the registers and the state of the storage
> backing them (since the guest state being in the registers means that
> the state in memory is invalid).
> 
> > > > > Storing in a format suitable for userspace
> > > > > usage all the time when we've got SME would most likely result in
> > > > > performance overhead
> 
> > > > What performance overhead? Why should we care?
> > > 
> > > Since in situations where we're not using the larger VL we would need to
> > > load and store the registers using a vector length other than the
> 
> ...
> 
> > > and would instead need to manually compute the memory locations where
> > > values are stored.  As well as the extra instructions when using the
> > > smaller vector length we'd also be working with sparser data likely over
> > > more cache lines.
> 
> > Are you talking about a context switch? or userspace accesses? I don't
> > give a damn about the latter, as it statistically never happens. The
> > former is of course of interest, but you still don't explain why the
> > above is a problem.
> 
> Well, there will be a cost in any rewriting so I'm trying to shift it
> away from context switch to userspace access since as you say they are
> negligably frequent.

So we're in violent agreement.

> > Nothing prevent you from storing the registers using the *current* VL,
> > since there is no data sharing between the SVE registers and the
> > streaming-SVE ones. All you need to do is to make sure you don't mix
> > the two.
> 
> You seem to be suggesting modeling the streaming mode registers as a
> completely different set of registers in the KVM ABI.

Where are you reading this? I *never* said anything of the sort. There
is only one set of SVE registers. The fact that they change size and
get randomly zeroed when userspace flips bits is a property of the
architecture. The HW *can* implements them as two sets of registers if
SME is a separate accelerator, but that's not architectural.

All I'm saying is that you can have a *single* register file, spanning
both FPSIMD and SVE, using the maximum VL achievable in the guest, and
be done with it. Yes, you'll have to refactor the code so that FPSIMD
lands at the right spot in the SVE register file. Big deal.

> That would
> certainly be another option, though it's a bit unclear from an
> architecture point of view and it didn't seem to entirely fit with the
> handling of the FPSIMD registers when SVE is in use.  From an
> architecture point of view the streaming mode registers aren't really a
> separate set of registers.  When in streaming mode it is not possible to
> observe the non-streaming register state and vice versa, and entering or
> exiting streaming mode zeros the register state so functionally you just
> have floating point registers.
> 
> You'd need to handle what happens with access to the inactive register
> set from userspace, the simplest thing to implement would be to read the
> logical value of zero and either discard or return an error on writes.
> 
> > > We would also need to consider if we need to zero the holes in the data
> > > when saving, we'd only potentially be leaking information from the guest
> > > but it might cause nasty surprises given that transitioning to/from
> > > streaming mode is expected to zero values.  If we do need to zero then
> > > that would be additional work that would need doing.
> 
> > The zeroing is mandated by the architecture, AFAIU. That's not optional.
> 
> Yes, we need to zero at some point - we could just do it on userspace
> read though I think rather than having to do it when saving.

As long as the guest only observes an architecturally compliant state,
I'm find with that.

> 
> > > Spending more effort to implement something which also has more runtime
> > > performance overhead for the case of saving and restoring guest state
> > > which I expect to be vastly more common than the VMM accessing the guest
> > > registers just doesn't seem like an appealing choice.
> 
> > I don't buy the runtime performance aspect at all. As long as you have
> > the space to dump the largest possible VL, you can always dump it in
> > the native format.
> 
> Sure, my point was that you appeared to be asking to dump in a
> non-native format.  

Again, what makes you think that? You keep putting words in my mouth.

>
> > > If we were storing the data in the native format for the guest then that
> > > format will change if streaming mode is changed via a write to SVCR.
> > > This would mean that the host would need to understand that when writing
> > > values SVCR needs to be written before the Z and P registers.  To be
> > > clear I don't think this is a good idea.
> 
> > The architecture is crystal clear: you flip SVCR.SM, you loose all
> > data in both Z and P regs. If userspace doesn't understand the
> > architecture, that's their problem. The only thing we need to provide
> > is a faithful emulation of the architecture.
> 
> It is not clear to me that the intention behind the KVM register ABI is
> that it should have all the ordering requirements that the architecture
> has, and decisions like hiding the V registers when SVE is active do
> take us away from just a natural architectural point of view.

My intention is that userspace accesses should be as close as possible
as that of a guest. If the current ABI doesn't allow this, I'm all for
changing it. Won't be the first time, won't be the last.

> My
> understanding was that it was intended that userspace should be able to
> do something like just enumerate all the registers, save them and then
> later on restore them without really understanding them.

And this statement doesn't get in the way of anything above. We own
the ABI, and can change it as we see fit when SME is enabled.

	M.
Mark Brown March 11, 2024, 6:42 p.m. UTC | #7
On Sat, Mar 09, 2024 at 11:01:56AM +0000, Marc Zyngier wrote:
> Mark Brown <broonie@kernel.org> wrote:

> > I agree that we don't want to store multiple copies, the proposed
> > implementation for SME does that - when converting between guest native
> > and userspace formats we discard the original format after conversion
> > and updates the metadata which says which format is stored.  That
> > metadata is modeled as adding a new owner.

> What conversion? If userspace writes to FP, the upper bits of the SVE
> registers should get zeroed. If it writes to SVE, it writes using the
> current VL for the current streaming/non-streaming mode.

> If the current userspace API doesn't give us the tools to do so, we
> rev it.

This is the conversion of vector lengths rather than something resulting
from acccessing from accessing the V registers via the sysreg interface
(which is not supported by either upstream or the currently posted SME
patches once either vector extension is enabled).  As previously
mentioned the current implementation of SME support for KVM always
presents the vector length sized registers to userspace with the maximum
vector length, not the vector length currently selected by SVCR.

> > > Nothing prevent you from storing the registers using the *current* VL,
> > > since there is no data sharing between the SVE registers and the
> > > streaming-SVE ones. All you need to do is to make sure you don't mix
> > > the two.

> > You seem to be suggesting modeling the streaming mode registers as a
> > completely different set of registers in the KVM ABI.

> Where are you reading this? I *never* said anything of the sort. There
> is only one set of SVE registers. The fact that they change size and
> get randomly zeroed when userspace flips bits is a property of the
> architecture. The HW *can* implements them as two sets of registers if
> SME is a separate accelerator, but that's not architectural.

When you talk about there being "no data sharing" and "mixing the two"
that sounds a lot like there might be two separate sets of data.  I've
been inferring a lot of what you are looking for from statements that
you make about other ideas and from the existing code and documentation
so things are less clear than they might be and it's likely there's been
some assumptions missed in places.  In order to try to clarify things
I've written down a summary of what I currently understand you're
looking for below.

> All I'm saying is that you can have a *single* register file, spanning
> both FPSIMD and SVE, using the maximum VL achievable in the guest, and

When you say "using" there I take it that you mean "sized to store"
rather than "written/accessed in the format for" since elsewhere you've
been taking about storing the current native format and changing the VL
in the ABI based on SVCR.SM?  I would have read "using the maximum VL
achievable in the guest" as meaning we store in a format reflecting the
larger VL but I'm pretty sure it's just the allocation you're
referencing there.

To be clear what I'm currently understanding for the userspace ABI is:

 - Create a requirement for userspace to set SVCR prior to setting any
   vector impacted register to ensure the correct format and that data
   isn't zeroed when SVCR is set.
 - Use the value of SVCR.SM and the guest maximum SVE and SME VLs to
   select the currently visible vector length for the Z, P and FFR
   registers.
 - This also implies discarding or failing all writes to ZA and ZT0
   unless SVCR.ZA is set for consistency, though that's much less
   impactful in terms of the implementation.
 - Add support for the V registers in the sysreg interface when SVE is
   enabled.

then the implementation can do what it likes to achieve that, the most
obvious thing being to store in native format for the current hardware
mode based on SVCR.{SM,ZA}.  Does that sound about right?

If that's all good the main thing I'm unclear on your expectations for
is what should happen with registers that are inaccessible in the
current mode (Z, P and FFR when in non-streaming mode without SVE, FFR
when in streaming mode without FA64, and ZA and ZT0 when SVCR.ZA is 0).
Having these registers cause errors on access as they would in the
architecture but I worry about complicating and potentially breaking the
ABI by having enumerable but inaccessible registers, though I think
that's more in line with your general thinking.  The main alternative
would be RAZ/WI style behavior which isn't quite what the architecture
does but does give what the guest would observe when it changes modes
and looks at the contents of those registers.

> be done with it. Yes, you'll have to refactor the code so that FPSIMD
> lands at the right spot in the SVE register file. Big deal.

At present the ABI just disables the V registers when SVE is enabled, I
was slightly surprised at that implementation and was concerned there
might be something that was considered desirable about the removal of
ambiguity (the changelog for the relevant commit did mention that it was
convenient to implement).

> > My
> > understanding was that it was intended that userspace should be able to
> > do something like just enumerate all the registers, save them and then
> > later on restore them without really understanding them.

> And this statement doesn't get in the way of anything above. We own
> the ABI, and can change it as we see fit when SME is enabled.

If you're willing to require specific ordering knowledge to do guest
state load then that makes life a bit easier for the implementation.  It
had appeared that VMMs could have existing ABI expectations about how
they can enumerate, store and load the state the guest implements which
we would cause problems by breaking.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index b804fe832184..9cd418d2bee0 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -593,7 +593,7 @@  static __always_inline u64 kvm_get_reset_cptr_el2(struct kvm_vcpu *vcpu)
 		val = (CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN);
 
 		if (!vcpu_has_sve(vcpu) ||
-		    (vcpu->arch.fp_state != FP_STATE_GUEST_OWNED))
+		    (*host_data_ptr(fp_owner) != FP_STATE_GUEST_OWNED))
 			val |= CPACR_EL1_ZEN_EL1EN | CPACR_EL1_ZEN_EL0EN;
 		if (cpus_have_final_cap(ARM64_SME))
 			val |= CPACR_EL1_SMEN_EL1EN | CPACR_EL1_SMEN_EL0EN;
@@ -601,7 +601,7 @@  static __always_inline u64 kvm_get_reset_cptr_el2(struct kvm_vcpu *vcpu)
 		val = CPTR_NVHE_EL2_RES1;
 
 		if (vcpu_has_sve(vcpu) &&
-		    (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED))
+		    (*host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED))
 			val |= CPTR_EL2_TZ;
 		if (cpus_have_final_cap(ARM64_SME))
 			val &= ~CPTR_EL2_TSM;
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 39b39da3e61b..e72e25702b9a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -507,6 +507,13 @@  struct kvm_host_data {
 	struct kvm_cpu_context host_ctxt;
 	struct user_fpsimd_state *fpsimd_state;	/* hyp VA */
 
+	/* Ownership of the FP regs */
+	enum {
+		FP_STATE_FREE,
+		FP_STATE_HOST_OWNED,
+		FP_STATE_GUEST_OWNED,
+	} fp_owner;
+
 	/*
 	 * host_debug_state contains the host registers which are
 	 * saved and restored during world switches.
@@ -582,13 +589,6 @@  struct kvm_vcpu_arch {
 	/* Exception Information */
 	struct kvm_vcpu_fault_info fault;
 
-	/* Ownership of the FP regs */
-	enum {
-		FP_STATE_FREE,
-		FP_STATE_HOST_OWNED,
-		FP_STATE_GUEST_OWNED,
-	} fp_state;
-
 	/* Configuration flags, set once and for all before the vcpu can run */
 	u8 cflags;
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d3d14cc41cb5..0c9720c7462e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -373,12 +373,6 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
 
-	/*
-	 * Default value for the FP state, will be overloaded at load
-	 * time if we support FP (pretty likely)
-	 */
-	vcpu->arch.fp_state = FP_STATE_FREE;
-
 	/* Set up the timer */
 	kvm_timer_vcpu_init(vcpu);
 
diff --git a/arch/arm64/kvm/fpsimd.c b/arch/arm64/kvm/fpsimd.c
index f650e46d4bea..4f51293db173 100644
--- a/arch/arm64/kvm/fpsimd.c
+++ b/arch/arm64/kvm/fpsimd.c
@@ -84,7 +84,7 @@  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
 	 * guest in kvm_arch_vcpu_ctxflush_fp() and override this to
 	 * FP_STATE_FREE if the flag set.
 	 */
-	vcpu->arch.fp_state = FP_STATE_HOST_OWNED;
+	*host_data_ptr(fp_owner) = FP_STATE_HOST_OWNED;
 	*host_data_ptr(fpsimd_state) = kern_hyp_va(&current->thread.uw.fpsimd_state);
 
 	vcpu_clear_flag(vcpu, HOST_SVE_ENABLED);
@@ -109,7 +109,7 @@  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
 		 * been saved, this is very unlikely to happen.
 		 */
 		if (read_sysreg_s(SYS_SVCR) & (SVCR_SM_MASK | SVCR_ZA_MASK)) {
-			vcpu->arch.fp_state = FP_STATE_FREE;
+			*host_data_ptr(fp_owner) = FP_STATE_FREE;
 			fpsimd_save_and_flush_cpu_state();
 		}
 	}
@@ -125,7 +125,7 @@  void kvm_arch_vcpu_load_fp(struct kvm_vcpu *vcpu)
 void kvm_arch_vcpu_ctxflush_fp(struct kvm_vcpu *vcpu)
 {
 	if (test_thread_flag(TIF_FOREIGN_FPSTATE))
-		vcpu->arch.fp_state = FP_STATE_FREE;
+		*host_data_ptr(fp_owner) = FP_STATE_FREE;
 }
 
 /*
@@ -141,7 +141,7 @@  void kvm_arch_vcpu_ctxsync_fp(struct kvm_vcpu *vcpu)
 
 	WARN_ON_ONCE(!irqs_disabled());
 
-	if (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED) {
+	if (*host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED) {
 
 		/*
 		 * Currently we do not support SME guests so SVCR is
@@ -194,7 +194,7 @@  void kvm_arch_vcpu_put_fp(struct kvm_vcpu *vcpu)
 		isb();
 	}
 
-	if (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED) {
+	if (*host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED) {
 		if (vcpu_has_sve(vcpu)) {
 			__vcpu_sys_reg(vcpu, ZCR_EL1) = read_sysreg_el1(SYS_ZCR);
 
diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index f67d8eafc245..e124c4686376 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -42,7 +42,7 @@  extern struct kvm_exception_table_entry __stop___kvm_ex_table;
 /* Check whether the FP regs are owned by the guest */
 static inline bool guest_owns_fp_regs(struct kvm_vcpu *vcpu)
 {
-	return vcpu->arch.fp_state == FP_STATE_GUEST_OWNED;
+	return *host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED;
 }
 
 /* Save the 32-bit only FPSIMD system register state */
@@ -369,7 +369,7 @@  static bool kvm_hyp_handle_fpsimd(struct kvm_vcpu *vcpu, u64 *exit_code)
 	isb();
 
 	/* Write out the host state if it's in the registers */
-	if (vcpu->arch.fp_state == FP_STATE_HOST_OWNED)
+	if (*host_data_ptr(fp_owner) == FP_STATE_HOST_OWNED)
 		__fpsimd_save_state(*host_data_ptr(fpsimd_state));
 
 	/* Restore the guest state */
@@ -382,7 +382,7 @@  static bool kvm_hyp_handle_fpsimd(struct kvm_vcpu *vcpu, u64 *exit_code)
 	if (!(read_sysreg(hcr_el2) & HCR_RW))
 		write_sysreg(__vcpu_sys_reg(vcpu, FPEXC32_EL2), fpexc32_el2);
 
-	vcpu->arch.fp_state = FP_STATE_GUEST_OWNED;
+	*host_data_ptr(fp_owner) = FP_STATE_GUEST_OWNED;
 
 	return true;
 }
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index c5f625dc1f07..26561c562f7a 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -39,7 +39,6 @@  static void flush_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu)
 	hyp_vcpu->vcpu.arch.cptr_el2	= host_vcpu->arch.cptr_el2;
 
 	hyp_vcpu->vcpu.arch.iflags	= host_vcpu->arch.iflags;
-	hyp_vcpu->vcpu.arch.fp_state	= host_vcpu->arch.fp_state;
 
 	hyp_vcpu->vcpu.arch.debug_ptr	= kern_hyp_va(host_vcpu->arch.debug_ptr);
 
@@ -63,7 +62,6 @@  static void sync_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu)
 	host_vcpu->arch.fault		= hyp_vcpu->vcpu.arch.fault;
 
 	host_vcpu->arch.iflags		= hyp_vcpu->vcpu.arch.iflags;
-	host_vcpu->arch.fp_state	= hyp_vcpu->vcpu.arch.fp_state;
 
 	host_cpu_if->vgic_hcr		= hyp_cpu_if->vgic_hcr;
 	for (i = 0; i < hyp_cpu_if->used_lrs; ++i)
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 544a419b9a39..1f82d531a494 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -337,7 +337,7 @@  int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	__sysreg_restore_state_nvhe(host_ctxt);
 
-	if (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED)
+	if (*host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED)
 		__fpsimd_save_fpexc32(vcpu);
 
 	__debug_switch_to_host(vcpu);
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 14b7a6bc5909..b92f9fe2d50e 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -258,7 +258,7 @@  static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 
 	sysreg_restore_host_state_vhe(host_ctxt);
 
-	if (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED)
+	if (*host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED)
 		__fpsimd_save_fpexc32(vcpu);
 
 	__debug_switch_to_host(vcpu);