[v5,0/2] MTE support for KVM guest

Message ID	20201119153901.53705-1-steven.price@arm.com (mailing list archive)
Headers	show Return-Path: <SRS0=dI23=EZ=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1866B20643 From: Steven Price <steven.price@arm.com> To: Catalin Marinas <catalin.marinas@arm.com>, Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org> Subject: [PATCH v5 0/2] MTE support for KVM guest Date: Thu, 19 Nov 2020 15:38:59 +0000 Message-Id: <20201119153901.53705-1-steven.price@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=217.140.110.172; envelope-from=steven.price@arm.com; helo=foss.arm.com Precedence: list Cc: Mark Rutland <mark.rutland@arm.com>, Peter Maydell <peter.maydell@linaro.org>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Andrew Jones <drjones@redhat.com>, Haibo Xu <Haibo.Xu@arm.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, qemu-devel@nongnu.org, Dave Martin <Dave.Martin@arm.com>, Juan Quintela <quintela@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, linux-kernel@vger.kernel.org, Steven Price <steven.price@arm.com>, James Morse <james.morse@arm.com>, Julien Thierry <julien.thierry.kdev@gmail.com>, Thomas Gleixner <tglx@linutronix.de>, kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	MTE support for KVM guest \| expand [v5,0/2] MTE support for KVM guest [v5,1/2] arm64: kvm: Save/restore MTE registers [v5,2/2] arm64: kvm: Introduce MTE VCPU feature

Steven Price Nov. 19, 2020, 3:38 p.m. UTC

This series adds support for Arm's Memory Tagging Extension (MTE) to
KVM, allowing KVM guests to make use of it. This builds on the existing
user space support already in v5.10-rc1, see [1] for an overview.

[1] https://lwn.net/Articles/834289/

Changes since v4[2]:

 * Rebased on v5.10-rc4.

 * Require the VMM to map all guest memory PROT_MTE if MTE is enabled
   for the guest.

 * Add a kvm_has_mte() accessor.

[2] http://lkml.kernel.org/r/20201026155727.36685-1-steven.price%40arm.com

The change to require the VMM to map all guest memory PROT_MTE is
significant as it means that the VMM has to deal with the MTE tags even
if it doesn't care about them (e.g. for virtual devices or if the VMM
doesn't support migration). Also unfortunately because the VMM can
change the memory layout at any time the check for PROT_MTE/VM_MTE has
to be done very late (at the point of faulting pages into stage 2).

The alternative would be to modify the set_pte_at() handler to always
check if there is MTE data relating to a swap page even if the PTE
doesn't have the MTE bit set. I haven't initially done this because of
ordering issues during early boot, but could investigate further if the
above VMM requirement is too strict.

Steven Price (2):
  arm64: kvm: Save/restore MTE registers
  arm64: kvm: Introduce MTE VCPU feature

 arch/arm64/include/asm/kvm_emulate.h       |  3 +++
 arch/arm64/include/asm/kvm_host.h          |  8 ++++++++
 arch/arm64/include/asm/sysreg.h            |  3 ++-
 arch/arm64/kvm/arm.c                       |  9 +++++++++
 arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 14 ++++++++++++++
 arch/arm64/kvm/mmu.c                       |  6 ++++++
 arch/arm64/kvm/sys_regs.c                  | 20 +++++++++++++++-----
 include/uapi/linux/kvm.h                   |  1 +
 8 files changed, 58 insertions(+), 6 deletions(-)

Peter Maydell Nov. 19, 2020, 3:45 p.m. UTC | #1

On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
> This series adds support for Arm's Memory Tagging Extension (MTE) to
> KVM, allowing KVM guests to make use of it. This builds on the existing
> user space support already in v5.10-rc1, see [1] for an overview.

> The change to require the VMM to map all guest memory PROT_MTE is
> significant as it means that the VMM has to deal with the MTE tags even
> if it doesn't care about them (e.g. for virtual devices or if the VMM
> doesn't support migration). Also unfortunately because the VMM can
> change the memory layout at any time the check for PROT_MTE/VM_MTE has
> to be done very late (at the point of faulting pages into stage 2).

I'm a bit dubious about requring the VMM to map the guest memory
PROT_MTE unless somebody's done at least a sketch of the design
for how this would work on the QEMU side. Currently QEMU just
assumes the guest memory is guest memory and it can access it
without special precautions...

thanks
-- PMM

Steven Price Nov. 19, 2020, 3:57 p.m. UTC | #2

On 19/11/2020 15:45, Peter Maydell wrote:
> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
>> This series adds support for Arm's Memory Tagging Extension (MTE) to
>> KVM, allowing KVM guests to make use of it. This builds on the existing
>> user space support already in v5.10-rc1, see [1] for an overview.
> 
>> The change to require the VMM to map all guest memory PROT_MTE is
>> significant as it means that the VMM has to deal with the MTE tags even
>> if it doesn't care about them (e.g. for virtual devices or if the VMM
>> doesn't support migration). Also unfortunately because the VMM can
>> change the memory layout at any time the check for PROT_MTE/VM_MTE has
>> to be done very late (at the point of faulting pages into stage 2).
> 
> I'm a bit dubious about requring the VMM to map the guest memory
> PROT_MTE unless somebody's done at least a sketch of the design
> for how this would work on the QEMU side. Currently QEMU just
> assumes the guest memory is guest memory and it can access it
> without special precautions...

I agree this needs some investigation - I'm hoping Haibo will be able to 
provide some feedback here as he has been looking at the QEMU support. 
However the VMM is likely going to require some significant changes to 
ensure that migration doesn't break, so either way there's work to be done.

Fundamentally most memory will need a mapping with PROT_MTE just so the 
VMM can get at the tags for migration purposes, so QEMU is going to have 
to learn how to treat guest memory specially if it wants to be able to 
enable MTE for both itself and the guest.

I'll also hunt down what's happening with my attempts to fix the 
set_pte_at() handling for swap and I'll post that as an alternative if 
it turns out to be a reasonable approach. But I don't think that solve 
the QEMU issue above.

The other alternative would be to implement a new kernel interface to 
fetch tags from the guest and not require the VMM to maintain a PROT_MTE 
mapping. But we need some real feedback from someone familiar with QEMU 
to know what that interface should look like. So I'm holding off on that 
until there's a 'real' PoC implementation.

Thanks,

Steve

Peter Maydell Nov. 19, 2020, 4:39 p.m. UTC | #3

On Thu, 19 Nov 2020 at 15:57, Steven Price <steven.price@arm.com> wrote:
> On 19/11/2020 15:45, Peter Maydell wrote:
> > I'm a bit dubious about requring the VMM to map the guest memory
> > PROT_MTE unless somebody's done at least a sketch of the design
> > for how this would work on the QEMU side. Currently QEMU just
> > assumes the guest memory is guest memory and it can access it
> > without special precautions...
>
> I agree this needs some investigation - I'm hoping Haibo will be able to
> provide some feedback here as he has been looking at the QEMU support.
> However the VMM is likely going to require some significant changes to
> ensure that migration doesn't break, so either way there's work to be done.
>
> Fundamentally most memory will need a mapping with PROT_MTE just so the
> VMM can get at the tags for migration purposes, so QEMU is going to have
> to learn how to treat guest memory specially if it wants to be able to
> enable MTE for both itself and the guest.

If the only reason the VMM needs tag access is for migration it
feels like there must be a nicer way to do it than by
requiring it to map the whole of the guest address space twice
(once for normal use and once to get the tags)...

Anyway, maybe "must map PROT_MTE" is workable, but it seems
a bit premature to fix the kernel ABI as working that way
until we are at least reasonably sure that it is the right
design.

thanks
-- PMM

Andrew Jones Nov. 19, 2020, 6:42 p.m. UTC | #4

On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
> > This series adds support for Arm's Memory Tagging Extension (MTE) to
> > KVM, allowing KVM guests to make use of it. This builds on the existing
> > user space support already in v5.10-rc1, see [1] for an overview.
> 
> > The change to require the VMM to map all guest memory PROT_MTE is
> > significant as it means that the VMM has to deal with the MTE tags even
> > if it doesn't care about them (e.g. for virtual devices or if the VMM
> > doesn't support migration). Also unfortunately because the VMM can
> > change the memory layout at any time the check for PROT_MTE/VM_MTE has
> > to be done very late (at the point of faulting pages into stage 2).
> 
> I'm a bit dubious about requring the VMM to map the guest memory
> PROT_MTE unless somebody's done at least a sketch of the design
> for how this would work on the QEMU side. Currently QEMU just
> assumes the guest memory is guest memory and it can access it
> without special precautions...
>

There are two statements being made here:

1) Requiring the use of PROT_MTE when mapping guest memory may not fit
   QEMU well.

2) New KVM features should be accompanied with supporting QEMU code in
   order to prove that the APIs make sense.

I strongly agree with (2). While kvmtool supports some quick testing, it
doesn't support migration. We must test all new features with a migration
supporting VMM.

I'm not sure about (1). I don't feel like it should be a major problem,
but (2).

I'd be happy to help with the QEMU prototype, but preferably when there's
hardware available. Has all the current MTE testing just been done on
simulators? And, if so, are there regression tests regularly running on
the simulators too? And can they test migration? If hardware doesn't
show up quickly and simulators aren't used for regression tests, then
all this code will start rotting from day one.

Thanks,
drew

Marc Zyngier Nov. 19, 2020, 7:11 p.m. UTC | #5

On 2020-11-19 18:42, Andrew Jones wrote:
> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> 
>> wrote:
>> > This series adds support for Arm's Memory Tagging Extension (MTE) to
>> > KVM, allowing KVM guests to make use of it. This builds on the existing
>> > user space support already in v5.10-rc1, see [1] for an overview.
>> 
>> > The change to require the VMM to map all guest memory PROT_MTE is
>> > significant as it means that the VMM has to deal with the MTE tags even
>> > if it doesn't care about them (e.g. for virtual devices or if the VMM
>> > doesn't support migration). Also unfortunately because the VMM can
>> > change the memory layout at any time the check for PROT_MTE/VM_MTE has
>> > to be done very late (at the point of faulting pages into stage 2).
>> 
>> I'm a bit dubious about requring the VMM to map the guest memory
>> PROT_MTE unless somebody's done at least a sketch of the design
>> for how this would work on the QEMU side. Currently QEMU just
>> assumes the guest memory is guest memory and it can access it
>> without special precautions...
>> 
> 
> There are two statements being made here:
> 
> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit
>    QEMU well.
> 
> 2) New KVM features should be accompanied with supporting QEMU code in
>    order to prove that the APIs make sense.
> 
> I strongly agree with (2). While kvmtool supports some quick testing, 
> it
> doesn't support migration. We must test all new features with a 
> migration
> supporting VMM.
> 
> I'm not sure about (1). I don't feel like it should be a major problem,
> but (2).
> 
> I'd be happy to help with the QEMU prototype, but preferably when 
> there's
> hardware available. Has all the current MTE testing just been done on
> simulators? And, if so, are there regression tests regularly running on
> the simulators too? And can they test migration? If hardware doesn't
> show up quickly and simulators aren't used for regression tests, then
> all this code will start rotting from day one.

While I agree with the sentiment, the reality is pretty bleak.

I'm pretty sure nobody will ever run a migration on emulation. I also 
doubt
there is much overlap between MTE users and migration users, 
unfortunately.

No HW is available today, and when it becomes available, it will be in
the form of a closed system on which QEMU doesn't run, either because
we are locked out of EL2 (as usual), or because migration is not part of
the use case (like KVM on Android, for example).

So we can wait another two (five?) years until general purpose HW 
becomes
available, or we start merging what we can test today. I'm inclined to
do the latter.

And I think it is absolutely fine for QEMU to say "no MTE support with 
KVM"
(we can remove all userspace visibility, except for the capability).

         M.

Steven Price Nov. 20, 2020, 9:50 a.m. UTC | #6

On 19/11/2020 19:11, Marc Zyngier wrote:
> On 2020-11-19 18:42, Andrew Jones wrote:
>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
>>> > This series adds support for Arm's Memory Tagging Extension (MTE) to
>>> > KVM, allowing KVM guests to make use of it. This builds on the 
>>> existing
>>> > user space support already in v5.10-rc1, see [1] for an overview.
>>>
>>> > The change to require the VMM to map all guest memory PROT_MTE is
>>> > significant as it means that the VMM has to deal with the MTE tags 
>>> even
>>> > if it doesn't care about them (e.g. for virtual devices or if the VMM
>>> > doesn't support migration). Also unfortunately because the VMM can
>>> > change the memory layout at any time the check for PROT_MTE/VM_MTE has
>>> > to be done very late (at the point of faulting pages into stage 2).
>>>
>>> I'm a bit dubious about requring the VMM to map the guest memory
>>> PROT_MTE unless somebody's done at least a sketch of the design
>>> for how this would work on the QEMU side. Currently QEMU just
>>> assumes the guest memory is guest memory and it can access it
>>> without special precautions...
>>>
>>
>> There are two statements being made here:
>>
>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit
>>    QEMU well.
>>
>> 2) New KVM features should be accompanied with supporting QEMU code in
>>    order to prove that the APIs make sense.
>>
>> I strongly agree with (2). While kvmtool supports some quick testing, it
>> doesn't support migration. We must test all new features with a migration
>> supporting VMM.
>>
>> I'm not sure about (1). I don't feel like it should be a major problem,
>> but (2).

(1) seems to be contentious whichever way we go. Either PROT_MTE isn't 
required in which case it's easy to accidentally screw up migration, or 
it is required in which case it's difficult to handle normal guest 
memory from the VMM. I get the impression that probably I should go back 
to the previous approach - sorry for the distraction with this change.

(2) isn't something I'm trying to skip, but I'm limited in what I can do 
myself so would appreciate help here. Haibo is looking into this.

>>
>> I'd be happy to help with the QEMU prototype, but preferably when there's
>> hardware available. Has all the current MTE testing just been done on
>> simulators? And, if so, are there regression tests regularly running on
>> the simulators too? And can they test migration? If hardware doesn't
>> show up quickly and simulators aren't used for regression tests, then
>> all this code will start rotting from day one.

As Marc says, hardware isn't available. Testing is either via the Arm 
FVP model (that I've been using for most of my testing) or QEMU full 
system emulation.

> 
> While I agree with the sentiment, the reality is pretty bleak.
> 
> I'm pretty sure nobody will ever run a migration on emulation. I also doubt
> there is much overlap between MTE users and migration users, unfortunately.
> 
> No HW is available today, and when it becomes available, it will be in
> the form of a closed system on which QEMU doesn't run, either because
> we are locked out of EL2 (as usual), or because migration is not part of
> the use case (like KVM on Android, for example).
> 
> So we can wait another two (five?) years until general purpose HW becomes
> available, or we start merging what we can test today. I'm inclined to
> do the latter.
> 
> And I think it is absolutely fine for QEMU to say "no MTE support with KVM"
> (we can remove all userspace visibility, except for the capability).

What I'm trying to achieve is a situation where KVM+MTE without 
migration works and we leave ourselves a clear path where migration can 
be added. With hindsight I think this version of the series was a wrong 
turn - if we return to not requiring PROT_MTE then we have the following 
two potential options to explore for migration in the future:

  * The VMM can choose to enable PROT_MTE if it needs to, and if desired 
we can add a flag to enforce this in the kernel.

  * If needed a new kernel interface can be provided to fetch/set tags 
from guest memory which isn't mapped PROT_MTE.

Does this sound reasonable?

I'll clean up the set_pte_at() change and post a v6 later today.

Marc Zyngier Nov. 20, 2020, 9:56 a.m. UTC | #7

On 2020-11-20 09:50, Steven Price wrote:
> On 19/11/2020 19:11, Marc Zyngier wrote:

> Does this sound reasonable?
> 
> I'll clean up the set_pte_at() change and post a v6 later today.

Please hold on. I still haven't reviewed your v5, nor have I had time
to read your reply to my comments on v4.

Thanks,

         M.

Steven Price Nov. 20, 2020, 9:58 a.m. UTC | #8

On 20/11/2020 09:56, Marc Zyngier wrote:
> On 2020-11-20 09:50, Steven Price wrote:
>> On 19/11/2020 19:11, Marc Zyngier wrote:
> 
>> Does this sound reasonable?
>>
>> I'll clean up the set_pte_at() change and post a v6 later today.
> 
> Please hold on. I still haven't reviewed your v5, nor have I had time
> to read your reply to my comments on v4.

Sure, no problem ;)

Steve

Dr. David Alan Gilbert Nov. 23, 2020, 12:16 p.m. UTC | #9

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
> > This series adds support for Arm's Memory Tagging Extension (MTE) to
> > KVM, allowing KVM guests to make use of it. This builds on the existing
> > user space support already in v5.10-rc1, see [1] for an overview.
> 
> > The change to require the VMM to map all guest memory PROT_MTE is
> > significant as it means that the VMM has to deal with the MTE tags even
> > if it doesn't care about them (e.g. for virtual devices or if the VMM
> > doesn't support migration). Also unfortunately because the VMM can
> > change the memory layout at any time the check for PROT_MTE/VM_MTE has
> > to be done very late (at the point of faulting pages into stage 2).
> 
> I'm a bit dubious about requring the VMM to map the guest memory
> PROT_MTE unless somebody's done at least a sketch of the design
> for how this would work on the QEMU side. Currently QEMU just
> assumes the guest memory is guest memory and it can access it
> without special precautions...

Although that is also changing because of the encrypted/protected memory
in things like SEV.

Dave

> thanks
> -- PMM
>

Haibo Xu Dec. 4, 2020, 8:25 a.m. UTC | #10

On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote:
>
> On 19/11/2020 19:11, Marc Zyngier wrote:
> > On 2020-11-19 18:42, Andrew Jones wrote:
> >> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
> >>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
> >>> > This series adds support for Arm's Memory Tagging Extension (MTE) to
> >>> > KVM, allowing KVM guests to make use of it. This builds on the
> >>> existing
> >>> > user space support already in v5.10-rc1, see [1] for an overview.
> >>>
> >>> > The change to require the VMM to map all guest memory PROT_MTE is
> >>> > significant as it means that the VMM has to deal with the MTE tags
> >>> even
> >>> > if it doesn't care about them (e.g. for virtual devices or if the VMM
> >>> > doesn't support migration). Also unfortunately because the VMM can
> >>> > change the memory layout at any time the check for PROT_MTE/VM_MTE has
> >>> > to be done very late (at the point of faulting pages into stage 2).
> >>>
> >>> I'm a bit dubious about requring the VMM to map the guest memory
> >>> PROT_MTE unless somebody's done at least a sketch of the design
> >>> for how this would work on the QEMU side. Currently QEMU just
> >>> assumes the guest memory is guest memory and it can access it
> >>> without special precautions...
> >>>
> >>
> >> There are two statements being made here:
> >>
> >> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit
> >>    QEMU well.
> >>
> >> 2) New KVM features should be accompanied with supporting QEMU code in
> >>    order to prove that the APIs make sense.
> >>
> >> I strongly agree with (2). While kvmtool supports some quick testing, it
> >> doesn't support migration. We must test all new features with a migration
> >> supporting VMM.
> >>
> >> I'm not sure about (1). I don't feel like it should be a major problem,
> >> but (2).
>
> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't
> required in which case it's easy to accidentally screw up migration, or
> it is required in which case it's difficult to handle normal guest
> memory from the VMM. I get the impression that probably I should go back
> to the previous approach - sorry for the distraction with this change.
>
> (2) isn't something I'm trying to skip, but I'm limited in what I can do
> myself so would appreciate help here. Haibo is looking into this.
>

Hi Steven,

Sorry for the later reply!

I have finished the POC for the MTE migration support with the assumption
that all the memory is mapped with PROT_MTE. But I got stuck in the test
with a FVP setup. Previously, I successfully compiled a test case to verify
the basic function of MTE in a guest. But these days, the re-compiled test
can't be executed by the guest(very weird). The short plan to verify
the migration
is to set the MTE tags on one page in the guest, and try to dump the migrated
memory contents.

I will update the status later next week!

Regards,
Haibo

> >>
> >> I'd be happy to help with the QEMU prototype, but preferably when there's
> >> hardware available. Has all the current MTE testing just been done on
> >> simulators? And, if so, are there regression tests regularly running on
> >> the simulators too? And can they test migration? If hardware doesn't
> >> show up quickly and simulators aren't used for regression tests, then
> >> all this code will start rotting from day one.
>
> As Marc says, hardware isn't available. Testing is either via the Arm
> FVP model (that I've been using for most of my testing) or QEMU full
> system emulation.
>
> >
> > While I agree with the sentiment, the reality is pretty bleak.
> >
> > I'm pretty sure nobody will ever run a migration on emulation. I also doubt
> > there is much overlap between MTE users and migration users, unfortunately.
> >
> > No HW is available today, and when it becomes available, it will be in
> > the form of a closed system on which QEMU doesn't run, either because
> > we are locked out of EL2 (as usual), or because migration is not part of
> > the use case (like KVM on Android, for example).
> >
> > So we can wait another two (five?) years until general purpose HW becomes
> > available, or we start merging what we can test today. I'm inclined to
> > do the latter.
> >
> > And I think it is absolutely fine for QEMU to say "no MTE support with KVM"
> > (we can remove all userspace visibility, except for the capability).
>
> What I'm trying to achieve is a situation where KVM+MTE without
> migration works and we leave ourselves a clear path where migration can
> be added. With hindsight I think this version of the series was a wrong
> turn - if we return to not requiring PROT_MTE then we have the following
> two potential options to explore for migration in the future:
>
>   * The VMM can choose to enable PROT_MTE if it needs to, and if desired
> we can add a flag to enforce this in the kernel.
>
>   * If needed a new kernel interface can be provided to fetch/set tags
> from guest memory which isn't mapped PROT_MTE.
>
> Does this sound reasonable?
>
> I'll clean up the set_pte_at() change and post a v6 later today.
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Steven Price Dec. 7, 2020, 2:48 p.m. UTC | #11

On 04/12/2020 08:25, Haibo Xu wrote:
> On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote:
>>
>> On 19/11/2020 19:11, Marc Zyngier wrote:
>>> On 2020-11-19 18:42, Andrew Jones wrote:
>>>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
>>>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
>>>>>> This series adds support for Arm's Memory Tagging Extension (MTE) to
>>>>>> KVM, allowing KVM guests to make use of it. This builds on the
>>>>> existing
>>>>>> user space support already in v5.10-rc1, see [1] for an overview.
>>>>>
>>>>>> The change to require the VMM to map all guest memory PROT_MTE is
>>>>>> significant as it means that the VMM has to deal with the MTE tags
>>>>> even
>>>>>> if it doesn't care about them (e.g. for virtual devices or if the VMM
>>>>>> doesn't support migration). Also unfortunately because the VMM can
>>>>>> change the memory layout at any time the check for PROT_MTE/VM_MTE has
>>>>>> to be done very late (at the point of faulting pages into stage 2).
>>>>>
>>>>> I'm a bit dubious about requring the VMM to map the guest memory
>>>>> PROT_MTE unless somebody's done at least a sketch of the design
>>>>> for how this would work on the QEMU side. Currently QEMU just
>>>>> assumes the guest memory is guest memory and it can access it
>>>>> without special precautions...
>>>>>
>>>>
>>>> There are two statements being made here:
>>>>
>>>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit
>>>>     QEMU well.
>>>>
>>>> 2) New KVM features should be accompanied with supporting QEMU code in
>>>>     order to prove that the APIs make sense.
>>>>
>>>> I strongly agree with (2). While kvmtool supports some quick testing, it
>>>> doesn't support migration. We must test all new features with a migration
>>>> supporting VMM.
>>>>
>>>> I'm not sure about (1). I don't feel like it should be a major problem,
>>>> but (2).
>>
>> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't
>> required in which case it's easy to accidentally screw up migration, or
>> it is required in which case it's difficult to handle normal guest
>> memory from the VMM. I get the impression that probably I should go back
>> to the previous approach - sorry for the distraction with this change.
>>
>> (2) isn't something I'm trying to skip, but I'm limited in what I can do
>> myself so would appreciate help here. Haibo is looking into this.
>>
> 
> Hi Steven,
> 
> Sorry for the later reply!
> 
> I have finished the POC for the MTE migration support with the assumption
> that all the memory is mapped with PROT_MTE. But I got stuck in the test
> with a FVP setup. Previously, I successfully compiled a test case to verify
> the basic function of MTE in a guest. But these days, the re-compiled test
> can't be executed by the guest(very weird). The short plan to verify
> the migration
> is to set the MTE tags on one page in the guest, and try to dump the migrated
> memory contents.

Hi Haibo,

Sounds like you are making good progress - thanks for the update. Have 
you thought about how the PROT_MTE mappings might work if QEMU itself 
were to use MTE? My worry is that we end up with MTE in a guest 
preventing QEMU from using MTE itself (because of the PROT_MTE 
mappings). I'm hoping QEMU can wrap its use of guest memory in a 
sequence which disables tag checking (something similar will be needed 
for the "protected VM" use case anyway), but this isn't something I've 
looked into.

> I will update the status later next week!

Great, I look forward to hearing how it goes.

Thanks,

Steve

Peter Maydell Dec. 7, 2020, 3:27 p.m. UTC | #12

On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote:
> Sounds like you are making good progress - thanks for the update. Have
> you thought about how the PROT_MTE mappings might work if QEMU itself
> were to use MTE? My worry is that we end up with MTE in a guest
> preventing QEMU from using MTE itself (because of the PROT_MTE
> mappings). I'm hoping QEMU can wrap its use of guest memory in a
> sequence which disables tag checking (something similar will be needed
> for the "protected VM" use case anyway), but this isn't something I've
> looked into.

It's not entirely the same as the "protected VM" case. For that
the patches currently on list basically special case "this is a
debug access (eg from gdbstub/monitor)" which then either gets
to go via "decrypt guest RAM for debug" or gets failed depending
on whether the VM has a debug-is-ok flag enabled. For an MTE
guest the common case will be guests doing standard DMA operations
to or from guest memory. The ideal API for that from QEMU's
point of view would be "accesses to guest RAM don't do tag
checks, even if tag checks are enabled for accesses QEMU does to
memory it has allocated itself as a normal userspace program".

thanks
-- PMM

Steven Price Dec. 7, 2020, 3:45 p.m. UTC | #13

On 07/12/2020 15:27, Peter Maydell wrote:
> On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote:
>> Sounds like you are making good progress - thanks for the update. Have
>> you thought about how the PROT_MTE mappings might work if QEMU itself
>> were to use MTE? My worry is that we end up with MTE in a guest
>> preventing QEMU from using MTE itself (because of the PROT_MTE
>> mappings). I'm hoping QEMU can wrap its use of guest memory in a
>> sequence which disables tag checking (something similar will be needed
>> for the "protected VM" use case anyway), but this isn't something I've
>> looked into.
> 
> It's not entirely the same as the "protected VM" case. For that
> the patches currently on list basically special case "this is a
> debug access (eg from gdbstub/monitor)" which then either gets
> to go via "decrypt guest RAM for debug" or gets failed depending
> on whether the VM has a debug-is-ok flag enabled. For an MTE
> guest the common case will be guests doing standard DMA operations
> to or from guest memory. The ideal API for that from QEMU's
> point of view would be "accesses to guest RAM don't do tag
> checks, even if tag checks are enabled for accesses QEMU does to
> memory it has allocated itself as a normal userspace program".

Sorry, I know I simplified it rather by saying it's similar to protected 
VM. Basically as I see it there are three types of memory access:

1) Debug case - has to go via a special case for decryption or ignoring 
the MTE tag value. Hopefully this can be abstracted in the same way.

2) Migration - for a protected VM there's likely to be a special method 
to allow the VMM access to the encrypted memory (AFAIK memory is usually 
kept inaccessible to the VMM). For MTE this again has to be special 
cased as we actually want both the data and the tag values.

3) Device DMA - for a protected VM it's usual to unencrypt a small area 
of memory (with the permission of the guest) and use that as a bounce 
buffer. This is possible with MTE: have an area the VMM purposefully 
maps with PROT_MTE. The issue is that this has a performance overhead 
and we can do better with MTE because it's trivial for the VMM to 
disable the protection for any memory.

The part I'm unsure on is how easy it is for QEMU to deal with (3) 
without the overhead of bounce buffers. Ideally there'd already be a 
wrapper for guest memory accesses and that could just be wrapped with 
setting TCO during the access. I suspect the actual situation is more 
complex though, and I'm hoping Haibo's investigations will help us 
understand this.

Thanks,

Steve

Marc Zyngier Dec. 7, 2020, 4:05 p.m. UTC | #14

On 2020-12-07 15:45, Steven Price wrote:
> On 07/12/2020 15:27, Peter Maydell wrote:
>> On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> 
>> wrote:
>>> Sounds like you are making good progress - thanks for the update. 
>>> Have
>>> you thought about how the PROT_MTE mappings might work if QEMU itself
>>> were to use MTE? My worry is that we end up with MTE in a guest
>>> preventing QEMU from using MTE itself (because of the PROT_MTE
>>> mappings). I'm hoping QEMU can wrap its use of guest memory in a
>>> sequence which disables tag checking (something similar will be 
>>> needed
>>> for the "protected VM" use case anyway), but this isn't something 
>>> I've
>>> looked into.
>> 
>> It's not entirely the same as the "protected VM" case. For that
>> the patches currently on list basically special case "this is a
>> debug access (eg from gdbstub/monitor)" which then either gets
>> to go via "decrypt guest RAM for debug" or gets failed depending
>> on whether the VM has a debug-is-ok flag enabled. For an MTE
>> guest the common case will be guests doing standard DMA operations
>> to or from guest memory. The ideal API for that from QEMU's
>> point of view would be "accesses to guest RAM don't do tag
>> checks, even if tag checks are enabled for accesses QEMU does to
>> memory it has allocated itself as a normal userspace program".
> 
> Sorry, I know I simplified it rather by saying it's similar to
> protected VM. Basically as I see it there are three types of memory
> access:
> 
> 1) Debug case - has to go via a special case for decryption or
> ignoring the MTE tag value. Hopefully this can be abstracted in the
> same way.
> 
> 2) Migration - for a protected VM there's likely to be a special
> method to allow the VMM access to the encrypted memory (AFAIK memory
> is usually kept inaccessible to the VMM). For MTE this again has to be
> special cased as we actually want both the data and the tag values.
> 
> 3) Device DMA - for a protected VM it's usual to unencrypt a small
> area of memory (with the permission of the guest) and use that as a
> bounce buffer. This is possible with MTE: have an area the VMM
> purposefully maps with PROT_MTE. The issue is that this has a
> performance overhead and we can do better with MTE because it's
> trivial for the VMM to disable the protection for any memory.
> 
> The part I'm unsure on is how easy it is for QEMU to deal with (3)
> without the overhead of bounce buffers. Ideally there'd already be a
> wrapper for guest memory accesses and that could just be wrapped with
> setting TCO during the access. I suspect the actual situation is more
> complex though, and I'm hoping Haibo's investigations will help us
> understand this.

What I'd really like to see is a description of how shared memory
is, in general, supposed to work with MTE. My gut feeling is that
it doesn't, and that you need to turn MTE off when sharing memory
(either implicitly or explicitly).

Thanks,

         M.

Catalin Marinas Dec. 7, 2020, 4:34 p.m. UTC | #15

On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote:
> On 2020-12-07 15:45, Steven Price wrote:
> > On 07/12/2020 15:27, Peter Maydell wrote:
> > > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com>
> > > wrote:
> > > > Sounds like you are making good progress - thanks for the
> > > > update. Have
> > > > you thought about how the PROT_MTE mappings might work if QEMU itself
> > > > were to use MTE? My worry is that we end up with MTE in a guest
> > > > preventing QEMU from using MTE itself (because of the PROT_MTE
> > > > mappings). I'm hoping QEMU can wrap its use of guest memory in a
> > > > sequence which disables tag checking (something similar will be
> > > > needed
> > > > for the "protected VM" use case anyway), but this isn't
> > > > something I've
> > > > looked into.
> > > 
> > > It's not entirely the same as the "protected VM" case. For that
> > > the patches currently on list basically special case "this is a
> > > debug access (eg from gdbstub/monitor)" which then either gets
> > > to go via "decrypt guest RAM for debug" or gets failed depending
> > > on whether the VM has a debug-is-ok flag enabled. For an MTE
> > > guest the common case will be guests doing standard DMA operations
> > > to or from guest memory. The ideal API for that from QEMU's
> > > point of view would be "accesses to guest RAM don't do tag
> > > checks, even if tag checks are enabled for accesses QEMU does to
> > > memory it has allocated itself as a normal userspace program".
> > 
> > Sorry, I know I simplified it rather by saying it's similar to
> > protected VM. Basically as I see it there are three types of memory
> > access:
> > 
> > 1) Debug case - has to go via a special case for decryption or
> > ignoring the MTE tag value. Hopefully this can be abstracted in the
> > same way.
> > 
> > 2) Migration - for a protected VM there's likely to be a special
> > method to allow the VMM access to the encrypted memory (AFAIK memory
> > is usually kept inaccessible to the VMM). For MTE this again has to be
> > special cased as we actually want both the data and the tag values.
> > 
> > 3) Device DMA - for a protected VM it's usual to unencrypt a small
> > area of memory (with the permission of the guest) and use that as a
> > bounce buffer. This is possible with MTE: have an area the VMM
> > purposefully maps with PROT_MTE. The issue is that this has a
> > performance overhead and we can do better with MTE because it's
> > trivial for the VMM to disable the protection for any memory.
> > 
> > The part I'm unsure on is how easy it is for QEMU to deal with (3)
> > without the overhead of bounce buffers. Ideally there'd already be a
> > wrapper for guest memory accesses and that could just be wrapped with
> > setting TCO during the access. I suspect the actual situation is more
> > complex though, and I'm hoping Haibo's investigations will help us
> > understand this.
> 
> What I'd really like to see is a description of how shared memory
> is, in general, supposed to work with MTE. My gut feeling is that
> it doesn't, and that you need to turn MTE off when sharing memory
> (either implicitly or explicitly).

The allocation tag (in-memory tag) is a property assigned to a physical
address range and it can be safely shared between different processes as
long as they access it via pointers with the same allocation tag (bits
59:56). The kernel enables such tagged shared memory for user processes
(anonymous, tmpfs, shmem).

What we don't have in the architecture is a memory type which allows
access to tags but no tag checking. To access the data when the tags
aren't known, the tag checking would have to be disabled via either a
prctl() or by setting the PSTATE.TCO bit.

The kernel accesses the user memory via the linear map using a match-all
tag 0xf, so no TCO bit toggling. For user, however, we disabled such
match-all tag and it cannot be enabled at run-time (at least not easily,
it's cached in the TLB). However, we already have two modes to disable
tag checking which Qemu could use when migrating data+tags.

Dr. David Alan Gilbert Dec. 7, 2020, 4:44 p.m. UTC | #16

* Steven Price (steven.price@arm.com) wrote:
> On 07/12/2020 15:27, Peter Maydell wrote:
> > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote:
> > > Sounds like you are making good progress - thanks for the update. Have
> > > you thought about how the PROT_MTE mappings might work if QEMU itself
> > > were to use MTE? My worry is that we end up with MTE in a guest
> > > preventing QEMU from using MTE itself (because of the PROT_MTE
> > > mappings). I'm hoping QEMU can wrap its use of guest memory in a
> > > sequence which disables tag checking (something similar will be needed
> > > for the "protected VM" use case anyway), but this isn't something I've
> > > looked into.
> > 
> > It's not entirely the same as the "protected VM" case. For that
> > the patches currently on list basically special case "this is a
> > debug access (eg from gdbstub/monitor)" which then either gets
> > to go via "decrypt guest RAM for debug" or gets failed depending
> > on whether the VM has a debug-is-ok flag enabled. For an MTE
> > guest the common case will be guests doing standard DMA operations
> > to or from guest memory. The ideal API for that from QEMU's
> > point of view would be "accesses to guest RAM don't do tag
> > checks, even if tag checks are enabled for accesses QEMU does to
> > memory it has allocated itself as a normal userspace program".
> 
> Sorry, I know I simplified it rather by saying it's similar to protected VM.
> Basically as I see it there are three types of memory access:
> 
> 1) Debug case - has to go via a special case for decryption or ignoring the
> MTE tag value. Hopefully this can be abstracted in the same way.
> 
> 2) Migration - for a protected VM there's likely to be a special method to
> allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> inaccessible to the VMM). For MTE this again has to be special cased as we
> actually want both the data and the tag values.
> 
> 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> memory (with the permission of the guest) and use that as a bounce buffer.
> This is possible with MTE: have an area the VMM purposefully maps with
> PROT_MTE. The issue is that this has a performance overhead and we can do
> better with MTE because it's trivial for the VMM to disable the protection
> for any memory.

Those all sound very similar to the AMD SEV world;  there's the special
case for Debug that Peter mentioned; migration is ...complicated and
needs special case that's still being figured out, and as I understand
Device DMA also uses a bounce buffer (and swiotlb in the guest to make
that happen).


I'm not sure about the stories for the IBM hardware equivalents.

Dave

> The part I'm unsure on is how easy it is for QEMU to deal with (3) without
> the overhead of bounce buffers. Ideally there'd already be a wrapper for
> guest memory accesses and that could just be wrapped with setting TCO during
> the access. I suspect the actual situation is more complex though, and I'm
> hoping Haibo's investigations will help us understand this.
> 
> Thanks,
> 
> Steve
>

Peter Maydell Dec. 7, 2020, 5:10 p.m. UTC | #17

On Mon, 7 Dec 2020 at 16:44, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> * Steven Price (steven.price@arm.com) wrote:
> > Sorry, I know I simplified it rather by saying it's similar to protected VM.
> > Basically as I see it there are three types of memory access:
> >
> > 1) Debug case - has to go via a special case for decryption or ignoring the
> > MTE tag value. Hopefully this can be abstracted in the same way.
> >
> > 2) Migration - for a protected VM there's likely to be a special method to
> > allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> > inaccessible to the VMM). For MTE this again has to be special cased as we
> > actually want both the data and the tag values.
> >
> > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> > memory (with the permission of the guest) and use that as a bounce buffer.
> > This is possible with MTE: have an area the VMM purposefully maps with
> > PROT_MTE. The issue is that this has a performance overhead and we can do
> > better with MTE because it's trivial for the VMM to disable the protection
> > for any memory.
>
> Those all sound very similar to the AMD SEV world;  there's the special
> case for Debug that Peter mentioned; migration is ...complicated and
> needs special case that's still being figured out, and as I understand
> Device DMA also uses a bounce buffer (and swiotlb in the guest to make
> that happen).

Mmm, but for encrypted VMs the VM has to jump through all these
hoops because "don't let the VM directly access arbitrary guest RAM"
is the whole point of the feature. For MTE, we don't want in general
to be doing tag-checked accesses to guest RAM and there is nothing
in the feature "allow guests to use MTE" that requires that the VMM's
guest RAM accesses must do tag-checking. So we should avoid having
a design that require us to jump through all the hoops. Even if
it happens that handling encrypted VMs means that QEMU has to grow
some infrastructure for carefully positioning hoops in appropriate
places, we shouldn't use it unnecessarily... All we actually need is
a mechanism for migrating the tags: I don't think there's ever a
situation where you want tag-checking enabled for the VMM's accesses
to the guest RAM.

thanks
-- PMM

Dr. David Alan Gilbert Dec. 7, 2020, 5:44 p.m. UTC | #18

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Mon, 7 Dec 2020 at 16:44, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > * Steven Price (steven.price@arm.com) wrote:
> > > Sorry, I know I simplified it rather by saying it's similar to protected VM.
> > > Basically as I see it there are three types of memory access:
> > >
> > > 1) Debug case - has to go via a special case for decryption or ignoring the
> > > MTE tag value. Hopefully this can be abstracted in the same way.
> > >
> > > 2) Migration - for a protected VM there's likely to be a special method to
> > > allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> > > inaccessible to the VMM). For MTE this again has to be special cased as we
> > > actually want both the data and the tag values.
> > >
> > > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> > > memory (with the permission of the guest) and use that as a bounce buffer.
> > > This is possible with MTE: have an area the VMM purposefully maps with
> > > PROT_MTE. The issue is that this has a performance overhead and we can do
> > > better with MTE because it's trivial for the VMM to disable the protection
> > > for any memory.
> >
> > Those all sound very similar to the AMD SEV world;  there's the special
> > case for Debug that Peter mentioned; migration is ...complicated and
> > needs special case that's still being figured out, and as I understand
> > Device DMA also uses a bounce buffer (and swiotlb in the guest to make
> > that happen).
> 
> Mmm, but for encrypted VMs the VM has to jump through all these
> hoops because "don't let the VM directly access arbitrary guest RAM"
> is the whole point of the feature. For MTE, we don't want in general
> to be doing tag-checked accesses to guest RAM and there is nothing
> in the feature "allow guests to use MTE" that requires that the VMM's
> guest RAM accesses must do tag-checking. So we should avoid having
> a design that require us to jump through all the hoops.

Yes agreed, that's a fair distinction.

Dave


 Even if
> it happens that handling encrypted VMs means that QEMU has to grow
> some infrastructure for carefully positioning hoops in appropriate
> places, we shouldn't use it unnecessarily... All we actually need is
> a mechanism for migrating the tags: I don't think there's ever a
> situation where you want tag-checking enabled for the VMM's accesses
> to the guest RAM.
> 
> thanks
> -- PMM
>

Marc Zyngier Dec. 7, 2020, 7:03 p.m. UTC | #19

On Mon, 07 Dec 2020 16:34:05 +0000,
Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote:
> > What I'd really like to see is a description of how shared memory
> > is, in general, supposed to work with MTE. My gut feeling is that
> > it doesn't, and that you need to turn MTE off when sharing memory
> > (either implicitly or explicitly).
> 
> The allocation tag (in-memory tag) is a property assigned to a physical
> address range and it can be safely shared between different processes as
> long as they access it via pointers with the same allocation tag (bits
> 59:56). The kernel enables such tagged shared memory for user processes
> (anonymous, tmpfs, shmem).

I think that's one case where the shared memory scheme breaks, as we
have two kernels in charge of their own tags, and they obviously can't
be synchronised

> What we don't have in the architecture is a memory type which allows
> access to tags but no tag checking. To access the data when the tags
> aren't known, the tag checking would have to be disabled via either a
> prctl() or by setting the PSTATE.TCO bit.

I guess that's point (3) in Steven's taxonomy. It still a bit ugly to
fit in an existing piece of userspace, specially if it wants to use
MTE for its own benefit.

> The kernel accesses the user memory via the linear map using a match-all
> tag 0xf, so no TCO bit toggling. For user, however, we disabled such
> match-all tag and it cannot be enabled at run-time (at least not easily,
> it's cached in the TLB). However, we already have two modes to disable
> tag checking which Qemu could use when migrating data+tags.

I wonder whether we will have to have something kernel side to
dump/reload tags in a way that matches the patterns used by live
migration.

	M.

Haibo Xu Dec. 8, 2020, 9:51 a.m. UTC | #20

On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote:
>
> On 04/12/2020 08:25, Haibo Xu wrote:
> > On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote:
> >>
> >> On 19/11/2020 19:11, Marc Zyngier wrote:
> >>> On 2020-11-19 18:42, Andrew Jones wrote:
> >>>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
> >>>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
> >>>>>> This series adds support for Arm's Memory Tagging Extension (MTE) to
> >>>>>> KVM, allowing KVM guests to make use of it. This builds on the
> >>>>> existing
> >>>>>> user space support already in v5.10-rc1, see [1] for an overview.
> >>>>>
> >>>>>> The change to require the VMM to map all guest memory PROT_MTE is
> >>>>>> significant as it means that the VMM has to deal with the MTE tags
> >>>>> even
> >>>>>> if it doesn't care about them (e.g. for virtual devices or if the VMM
> >>>>>> doesn't support migration). Also unfortunately because the VMM can
> >>>>>> change the memory layout at any time the check for PROT_MTE/VM_MTE has
> >>>>>> to be done very late (at the point of faulting pages into stage 2).
> >>>>>
> >>>>> I'm a bit dubious about requring the VMM to map the guest memory
> >>>>> PROT_MTE unless somebody's done at least a sketch of the design
> >>>>> for how this would work on the QEMU side. Currently QEMU just
> >>>>> assumes the guest memory is guest memory and it can access it
> >>>>> without special precautions...
> >>>>>
> >>>>
> >>>> There are two statements being made here:
> >>>>
> >>>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit
> >>>>     QEMU well.
> >>>>
> >>>> 2) New KVM features should be accompanied with supporting QEMU code in
> >>>>     order to prove that the APIs make sense.
> >>>>
> >>>> I strongly agree with (2). While kvmtool supports some quick testing, it
> >>>> doesn't support migration. We must test all new features with a migration
> >>>> supporting VMM.
> >>>>
> >>>> I'm not sure about (1). I don't feel like it should be a major problem,
> >>>> but (2).
> >>
> >> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't
> >> required in which case it's easy to accidentally screw up migration, or
> >> it is required in which case it's difficult to handle normal guest
> >> memory from the VMM. I get the impression that probably I should go back
> >> to the previous approach - sorry for the distraction with this change.
> >>
> >> (2) isn't something I'm trying to skip, but I'm limited in what I can do
> >> myself so would appreciate help here. Haibo is looking into this.
> >>
> >
> > Hi Steven,
> >
> > Sorry for the later reply!
> >
> > I have finished the POC for the MTE migration support with the assumption
> > that all the memory is mapped with PROT_MTE. But I got stuck in the test
> > with a FVP setup. Previously, I successfully compiled a test case to verify
> > the basic function of MTE in a guest. But these days, the re-compiled test
> > can't be executed by the guest(very weird). The short plan to verify
> > the migration
> > is to set the MTE tags on one page in the guest, and try to dump the migrated
> > memory contents.
>
> Hi Haibo,
>
> Sounds like you are making good progress - thanks for the update. Have
> you thought about how the PROT_MTE mappings might work if QEMU itself
> were to use MTE? My worry is that we end up with MTE in a guest
> preventing QEMU from using MTE itself (because of the PROT_MTE
> mappings). I'm hoping QEMU can wrap its use of guest memory in a
> sequence which disables tag checking (something similar will be needed
> for the "protected VM" use case anyway), but this isn't something I've
> looked into.

As far as I can see, to map all the guest memory with PROT_MTE in VMM
is a little weird, and lots of APIs have to be changed to include this flag.
IMHO, it would be better if the KVM can provide new APIs to load/store the
guest memory tag which may make it easier to enable the Qemu migration
support.

>
> > I will update the status later next week!
>
> Great, I look forward to hearing how it goes.
>
> Thanks,
>
> Steve

Marc Zyngier Dec. 8, 2020, 10:01 a.m. UTC | #21

On 2020-12-08 09:51, Haibo Xu wrote:
> On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote:
>> 

[...]

>> Sounds like you are making good progress - thanks for the update. Have
>> you thought about how the PROT_MTE mappings might work if QEMU itself
>> were to use MTE? My worry is that we end up with MTE in a guest
>> preventing QEMU from using MTE itself (because of the PROT_MTE
>> mappings). I'm hoping QEMU can wrap its use of guest memory in a
>> sequence which disables tag checking (something similar will be needed
>> for the "protected VM" use case anyway), but this isn't something I've
>> looked into.
> 
> As far as I can see, to map all the guest memory with PROT_MTE in VMM
> is a little weird, and lots of APIs have to be changed to include this 
> flag.
> IMHO, it would be better if the KVM can provide new APIs to load/store 
> the
> guest memory tag which may make it easier to enable the Qemu migration
> support.

On what granularity? To what storage? How do you plan to synchronise 
this
with the dirty-log interface?

Thanks,

         M.

Haibo Xu Dec. 8, 2020, 10:05 a.m. UTC | #22

On Tue, 8 Dec 2020 at 00:44, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>
> * Steven Price (steven.price@arm.com) wrote:
> > On 07/12/2020 15:27, Peter Maydell wrote:
> > > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote:
> > > > Sounds like you are making good progress - thanks for the update. Have
> > > > you thought about how the PROT_MTE mappings might work if QEMU itself
> > > > were to use MTE? My worry is that we end up with MTE in a guest
> > > > preventing QEMU from using MTE itself (because of the PROT_MTE
> > > > mappings). I'm hoping QEMU can wrap its use of guest memory in a
> > > > sequence which disables tag checking (something similar will be needed
> > > > for the "protected VM" use case anyway), but this isn't something I've
> > > > looked into.
> > >
> > > It's not entirely the same as the "protected VM" case. For that
> > > the patches currently on list basically special case "this is a
> > > debug access (eg from gdbstub/monitor)" which then either gets
> > > to go via "decrypt guest RAM for debug" or gets failed depending
> > > on whether the VM has a debug-is-ok flag enabled. For an MTE
> > > guest the common case will be guests doing standard DMA operations
> > > to or from guest memory. The ideal API for that from QEMU's
> > > point of view would be "accesses to guest RAM don't do tag
> > > checks, even if tag checks are enabled for accesses QEMU does to
> > > memory it has allocated itself as a normal userspace program".
> >
> > Sorry, I know I simplified it rather by saying it's similar to protected VM.
> > Basically as I see it there are three types of memory access:
> >
> > 1) Debug case - has to go via a special case for decryption or ignoring the
> > MTE tag value. Hopefully this can be abstracted in the same way.
> >
> > 2) Migration - for a protected VM there's likely to be a special method to
> > allow the VMM access to the encrypted memory (AFAIK memory is usually kept
> > inaccessible to the VMM). For MTE this again has to be special cased as we
> > actually want both the data and the tag values.
> >
> > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of
> > memory (with the permission of the guest) and use that as a bounce buffer.
> > This is possible with MTE: have an area the VMM purposefully maps with
> > PROT_MTE. The issue is that this has a performance overhead and we can do
> > better with MTE because it's trivial for the VMM to disable the protection
> > for any memory.
>
> Those all sound very similar to the AMD SEV world;  there's the special
> case for Debug that Peter mentioned; migration is ...complicated and
> needs special case that's still being figured out, and as I understand
> Device DMA also uses a bounce buffer (and swiotlb in the guest to make
> that happen).
>
>
> I'm not sure about the stories for the IBM hardware equivalents.

Like s390-skeys(storage keys) support in Qemu?

I have read the migration support for the s390-skeys in Qemu and found
that the logic is very similar to that of MTE, except the difference that the
s390-skeys were migrated separately from that of the guest memory data
while for MTE, I think the guest memory tags should go with the  memory data.

>
> Dave
>
> > The part I'm unsure on is how easy it is for QEMU to deal with (3) without
> > the overhead of bounce buffers. Ideally there'd already be a wrapper for
> > guest memory accesses and that could just be wrapped with setting TCO during
> > the access. I suspect the actual situation is more complex though, and I'm
> > hoping Haibo's investigations will help us understand this.
> >
> > Thanks,
> >
> > Steve
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

Haibo Xu Dec. 8, 2020, 10:10 a.m. UTC | #23

On Tue, 8 Dec 2020 at 18:01, Marc Zyngier <maz@kernel.org> wrote:
>
> On 2020-12-08 09:51, Haibo Xu wrote:
> > On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote:
> >>
>
> [...]
>
> >> Sounds like you are making good progress - thanks for the update. Have
> >> you thought about how the PROT_MTE mappings might work if QEMU itself
> >> were to use MTE? My worry is that we end up with MTE in a guest
> >> preventing QEMU from using MTE itself (because of the PROT_MTE
> >> mappings). I'm hoping QEMU can wrap its use of guest memory in a
> >> sequence which disables tag checking (something similar will be needed
> >> for the "protected VM" use case anyway), but this isn't something I've
> >> looked into.
> >
> > As far as I can see, to map all the guest memory with PROT_MTE in VMM
> > is a little weird, and lots of APIs have to be changed to include this
> > flag.
> > IMHO, it would be better if the KVM can provide new APIs to load/store
> > the
> > guest memory tag which may make it easier to enable the Qemu migration
> > support.
>
> On what granularity? To what storage? How do you plan to synchronise
> this
> with the dirty-log interface?

The Qemu would migrate page by page, and if one page has been migrated but
becomes dirty again, the migration process would re-send this dirty page.
The current MTE migration POC codes would try to send the page tags just after
the page data, if one page becomes dirty again, the page data and the tags would
be re-sent.

>
> Thanks,
>
>          M.
> --
> Jazz is not dead. It just smells funny...

Catalin Marinas Dec. 8, 2020, 5:21 p.m. UTC | #24

On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote:
> On Mon, 07 Dec 2020 16:34:05 +0000,
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote:
> > > What I'd really like to see is a description of how shared memory
> > > is, in general, supposed to work with MTE. My gut feeling is that
> > > it doesn't, and that you need to turn MTE off when sharing memory
> > > (either implicitly or explicitly).
> > 
> > The allocation tag (in-memory tag) is a property assigned to a physical
> > address range and it can be safely shared between different processes as
> > long as they access it via pointers with the same allocation tag (bits
> > 59:56). The kernel enables such tagged shared memory for user processes
> > (anonymous, tmpfs, shmem).
> 
> I think that's one case where the shared memory scheme breaks, as we
> have two kernels in charge of their own tags, and they obviously can't
> be synchronised

Yes, if you can't trust the other entity to not change the tags, the
only option is to do an untagged access.

> > What we don't have in the architecture is a memory type which allows
> > access to tags but no tag checking. To access the data when the tags
> > aren't known, the tag checking would have to be disabled via either a
> > prctl() or by setting the PSTATE.TCO bit.
> 
> I guess that's point (3) in Steven's taxonomy. It still a bit ugly to
> fit in an existing piece of userspace, specially if it wants to use
> MTE for its own benefit.

I agree it's ugly. For the device DMA emulation case, the only sane way
is to mimic what a real device does - no tag checking. For a generic
implementation, this means that such shared memory should not be mapped
with PROT_MTE on the VMM side. I guess this leads to your point that
sharing doesn't work for this scenario ;).

> > The kernel accesses the user memory via the linear map using a match-all
> > tag 0xf, so no TCO bit toggling. For user, however, we disabled such
> > match-all tag and it cannot be enabled at run-time (at least not easily,
> > it's cached in the TLB). However, we already have two modes to disable
> > tag checking which Qemu could use when migrating data+tags.
> 
> I wonder whether we will have to have something kernel side to
> dump/reload tags in a way that matches the patterns used by live
> migration.

We have something related - ptrace dumps/resores the tags. Can the same
concept be expanded to a KVM ioctl?

Marc Zyngier Dec. 8, 2020, 6:21 p.m. UTC | #25

On 2020-12-08 17:21, Catalin Marinas wrote:
> On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote:
>> On Mon, 07 Dec 2020 16:34:05 +0000,
>> Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote:
>> > > What I'd really like to see is a description of how shared memory
>> > > is, in general, supposed to work with MTE. My gut feeling is that
>> > > it doesn't, and that you need to turn MTE off when sharing memory
>> > > (either implicitly or explicitly).
>> >
>> > The allocation tag (in-memory tag) is a property assigned to a physical
>> > address range and it can be safely shared between different processes as
>> > long as they access it via pointers with the same allocation tag (bits
>> > 59:56). The kernel enables such tagged shared memory for user processes
>> > (anonymous, tmpfs, shmem).
>> 
>> I think that's one case where the shared memory scheme breaks, as we
>> have two kernels in charge of their own tags, and they obviously can't
>> be synchronised
> 
> Yes, if you can't trust the other entity to not change the tags, the
> only option is to do an untagged access.
> 
>> > What we don't have in the architecture is a memory type which allows
>> > access to tags but no tag checking. To access the data when the tags
>> > aren't known, the tag checking would have to be disabled via either a
>> > prctl() or by setting the PSTATE.TCO bit.
>> 
>> I guess that's point (3) in Steven's taxonomy. It still a bit ugly to
>> fit in an existing piece of userspace, specially if it wants to use
>> MTE for its own benefit.
> 
> I agree it's ugly. For the device DMA emulation case, the only sane way
> is to mimic what a real device does - no tag checking. For a generic
> implementation, this means that such shared memory should not be mapped
> with PROT_MTE on the VMM side. I guess this leads to your point that
> sharing doesn't work for this scenario ;).

Exactly ;-)

>> > The kernel accesses the user memory via the linear map using a match-all
>> > tag 0xf, so no TCO bit toggling. For user, however, we disabled such
>> > match-all tag and it cannot be enabled at run-time (at least not easily,
>> > it's cached in the TLB). However, we already have two modes to disable
>> > tag checking which Qemu could use when migrating data+tags.
>> 
>> I wonder whether we will have to have something kernel side to
>> dump/reload tags in a way that matches the patterns used by live
>> migration.
> 
> We have something related - ptrace dumps/resores the tags. Can the same
> concept be expanded to a KVM ioctl?

Yes, although I wonder whether we should integrate this deeply into
the dirty-log mechanism: it would be really interesting to dump the
tags at the point where the page is flagged as clean from a dirty-log
point of view. As the page is dirtied, discard the saved tags.

It is probably expensive, but it ensures that the VMM sees consistent
tags (if the page is clean, the tags are valid). Of course, it comes
with the added requirement that the VMM allocates enough memory to
store the tags, which may be a tall order. I'm not sure how to
give a consistent view to userspace otherwise.

It'd be worth looking at how much we can reuse from the ptrace (and
I expect swap?) code to implement this.

Thanks,

         M.

Catalin Marinas Dec. 9, 2020, 12:44 p.m. UTC | #26

On Tue, Dec 08, 2020 at 06:21:12PM +0000, Marc Zyngier wrote:
> On 2020-12-08 17:21, Catalin Marinas wrote:
> > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote:
> > > I wonder whether we will have to have something kernel side to
> > > dump/reload tags in a way that matches the patterns used by live
> > > migration.
> > 
> > We have something related - ptrace dumps/resores the tags. Can the same
> > concept be expanded to a KVM ioctl?
> 
> Yes, although I wonder whether we should integrate this deeply into
> the dirty-log mechanism: it would be really interesting to dump the
> tags at the point where the page is flagged as clean from a dirty-log
> point of view. As the page is dirtied, discard the saved tags.

From the VMM perspective, the tags can be treated just like additional
(meta)data in a page. We'd only need the tags when copying over. It can
race with the VM dirtying the page (writing tags would dirty it) but I
don't think the current migration code cares about this. If dirtied, it
copies it again.

The only downside I see is an extra syscall per page both on the origin
VMM and the destination one to dump/restore the tags. Is this a
performance issue?

Marc Zyngier Dec. 9, 2020, 1:25 p.m. UTC | #27

On 2020-12-09 12:44, Catalin Marinas wrote:
> On Tue, Dec 08, 2020 at 06:21:12PM +0000, Marc Zyngier wrote:
>> On 2020-12-08 17:21, Catalin Marinas wrote:
>> > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote:
>> > > I wonder whether we will have to have something kernel side to
>> > > dump/reload tags in a way that matches the patterns used by live
>> > > migration.
>> >
>> > We have something related - ptrace dumps/resores the tags. Can the same
>> > concept be expanded to a KVM ioctl?
>> 
>> Yes, although I wonder whether we should integrate this deeply into
>> the dirty-log mechanism: it would be really interesting to dump the
>> tags at the point where the page is flagged as clean from a dirty-log
>> point of view. As the page is dirtied, discard the saved tags.
> 
> From the VMM perspective, the tags can be treated just like additional
> (meta)data in a page. We'd only need the tags when copying over. It can
> race with the VM dirtying the page (writing tags would dirty it) but I
> don't think the current migration code cares about this. If dirtied, it
> copies it again.
> 
> The only downside I see is an extra syscall per page both on the origin
> VMM and the destination one to dump/restore the tags. Is this a
> performance issue?

I'm not sure. Migrating VMs already has a massive overhead, so an extra
syscall per page isn't terrifying. But that's the point where I admit
not knowing enough about what the VMM expects, nor whether that matches
what happens on other architectures that deal with per-page metadata.

Would this syscall operate on the guest address space? Or on the VMM's
own mapping?

         M.

Catalin Marinas Dec. 9, 2020, 3:27 p.m. UTC | #28

On Wed, Dec 09, 2020 at 01:25:18PM +0000, Marc Zyngier wrote:
> On 2020-12-09 12:44, Catalin Marinas wrote:
> > On Tue, Dec 08, 2020 at 06:21:12PM +0000, Marc Zyngier wrote:
> > > On 2020-12-08 17:21, Catalin Marinas wrote:
> > > > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote:
> > > > > I wonder whether we will have to have something kernel side to
> > > > > dump/reload tags in a way that matches the patterns used by live
> > > > > migration.
> > > >
> > > > We have something related - ptrace dumps/resores the tags. Can the same
> > > > concept be expanded to a KVM ioctl?
> > > 
> > > Yes, although I wonder whether we should integrate this deeply into
> > > the dirty-log mechanism: it would be really interesting to dump the
> > > tags at the point where the page is flagged as clean from a dirty-log
> > > point of view. As the page is dirtied, discard the saved tags.
> > 
> > From the VMM perspective, the tags can be treated just like additional
> > (meta)data in a page. We'd only need the tags when copying over. It can
> > race with the VM dirtying the page (writing tags would dirty it) but I
> > don't think the current migration code cares about this. If dirtied, it
> > copies it again.
> > 
> > The only downside I see is an extra syscall per page both on the origin
> > VMM and the destination one to dump/restore the tags. Is this a
> > performance issue?
> 
> I'm not sure. Migrating VMs already has a massive overhead, so an extra
> syscall per page isn't terrifying. But that's the point where I admit
> not knowing enough about what the VMM expects, nor whether that matches
> what happens on other architectures that deal with per-page metadata.
> 
> Would this syscall operate on the guest address space? Or on the VMM's
> own mapping?

Whatever is easier for the VMM, I don't think it matters as long as the
host kernel can get the actual physical address (and linear map
correspondent). Maybe simpler if it's the VMM address space as the
kernel can check the access permissions in case you want to hide the
guest memory from the VMM for other reasons (migration is also off the
table).

Without syscalls, an option would be for the VMM to create two mappings:
one with PROT_MTE for migration and the other without for normal DMA
etc. That's achievable using memfd_create() or shm_open() and two mmap()
calls, only one having PROT_MTE. The VMM address space should be
sufficiently large to map two guest IPAs.

Richard Henderson Dec. 9, 2020, 6:27 p.m. UTC | #29

On 12/9/20 9:27 AM, Catalin Marinas wrote:
> On Wed, Dec 09, 2020 at 01:25:18PM +0000, Marc Zyngier wrote:
>> Would this syscall operate on the guest address space? Or on the VMM's
>> own mapping?
...
> Whatever is easier for the VMM, I don't think it matters as long as the
> host kernel can get the actual physical address (and linear map
> correspondent). Maybe simpler if it's the VMM address space as the
> kernel can check the access permissions in case you want to hide the
> guest memory from the VMM for other reasons (migration is also off the
> table).

Indeed, such a syscall is no longer specific to vmm's and may be used for any
bulk move of tags that userland might want.

> Without syscalls, an option would be for the VMM to create two mappings:
> one with PROT_MTE for migration and the other without for normal DMA
> etc. That's achievable using memfd_create() or shm_open() and two mmap()
> calls, only one having PROT_MTE. The VMM address space should be
> sufficiently large to map two guest IPAs.

I would have thought that the best way is to use TCO, so that we don't have to
have dual mappings (and however many MB of extra page tables that might imply).


r~

Catalin Marinas Dec. 9, 2020, 6:39 p.m. UTC | #30

On Wed, Dec 09, 2020 at 12:27:59PM -0600, Richard Henderson wrote:
> On 12/9/20 9:27 AM, Catalin Marinas wrote:
> > On Wed, Dec 09, 2020 at 01:25:18PM +0000, Marc Zyngier wrote:
> >> Would this syscall operate on the guest address space? Or on the VMM's
> >> own mapping?
> ...
> > Whatever is easier for the VMM, I don't think it matters as long as the
> > host kernel can get the actual physical address (and linear map
> > correspondent). Maybe simpler if it's the VMM address space as the
> > kernel can check the access permissions in case you want to hide the
> > guest memory from the VMM for other reasons (migration is also off the
> > table).
> 
> Indeed, such a syscall is no longer specific to vmm's and may be used for any
> bulk move of tags that userland might want.

For CRIU, I think the current ptrace interface would do. With VMMs, the
same remote VM model doesn't apply (the "remote" VM is actually the
guest memory). I'd keep this under a KVM ioctl() number rather than a
new, specific syscall.

> > Without syscalls, an option would be for the VMM to create two mappings:
> > one with PROT_MTE for migration and the other without for normal DMA
> > etc. That's achievable using memfd_create() or shm_open() and two mmap()
> > calls, only one having PROT_MTE. The VMM address space should be
> > sufficiently large to map two guest IPAs.
> 
> I would have thought that the best way is to use TCO, so that we don't have to
> have dual mappings (and however many MB of extra page tables that might imply).

The problem appears when the VMM wants to use MTE itself (e.g. linked
against an MTE-aware glibc), toggling TCO is no longer generic enough,
especially when it comes to device emulation.

Richard Henderson Dec. 9, 2020, 8:13 p.m. UTC | #31

On 12/9/20 12:39 PM, Catalin Marinas wrote:
>> I would have thought that the best way is to use TCO, so that we don't have to
>> have dual mappings (and however many MB of extra page tables that might imply).
> 
> The problem appears when the VMM wants to use MTE itself (e.g. linked
> against an MTE-aware glibc), toggling TCO is no longer generic enough,
> especially when it comes to device emulation.

But we do know exactly when we're manipulating guest memory -- we have special
routines for that.  So the special routines gain a toggle of TCO around the
exact guest memory manipulation, not a blanket disable of MTE across large
swaths of QEMU.

r~

Peter Maydell Dec. 9, 2020, 8:20 p.m. UTC | #32

On Wed, 9 Dec 2020 at 20:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 12/9/20 12:39 PM, Catalin Marinas wrote:
> >> I would have thought that the best way is to use TCO, so that we don't have to
> >> have dual mappings (and however many MB of extra page tables that might imply).
> >
> > The problem appears when the VMM wants to use MTE itself (e.g. linked
> > against an MTE-aware glibc), toggling TCO is no longer generic enough,
> > especially when it comes to device emulation.
>
> But we do know exactly when we're manipulating guest memory -- we have special
> routines for that.

Well, yes and no. It's not like every access to guest memory
is through a specific set of "load from guest"/"store from guest"
functions, and in some situations there's a "get a pointer to
guest RAM, keep using it over a long-ish sequence of QEMU code,
then be done with it" pattern. It's because it's not that trivial
to isolate when something is accessing guest RAM that I don't want
to just have it be mapped PROT_MTE into QEMU. I think we'd end
up spending a lot of time hunting down "whoops, turns out this
is accessing guest RAM and sometimes it trips over the tags in
a hard-to-debug way" bugs. I'd much rather the kernel just
provided us with an API for what we want, which is (1) the guest
RAM as just RAM with no tag checking and separately (2) some
mechanism yet-to-be-designed which lets us bulk copy a page's
worth of tags for migration.

thanks
-- PMM

Haibo Xu Dec. 16, 2020, 7:31 a.m. UTC | #33

On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote:
>
> On 04/12/2020 08:25, Haibo Xu wrote:
> > On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote:
> >>
> >> On 19/11/2020 19:11, Marc Zyngier wrote:
> >>> On 2020-11-19 18:42, Andrew Jones wrote:
> >>>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote:
> >>>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote:
> >>>>>> This series adds support for Arm's Memory Tagging Extension (MTE) to
> >>>>>> KVM, allowing KVM guests to make use of it. This builds on the
> >>>>> existing
> >>>>>> user space support already in v5.10-rc1, see [1] for an overview.
> >>>>>
> >>>>>> The change to require the VMM to map all guest memory PROT_MTE is
> >>>>>> significant as it means that the VMM has to deal with the MTE tags
> >>>>> even
> >>>>>> if it doesn't care about them (e.g. for virtual devices or if the VMM
> >>>>>> doesn't support migration). Also unfortunately because the VMM can
> >>>>>> change the memory layout at any time the check for PROT_MTE/VM_MTE has
> >>>>>> to be done very late (at the point of faulting pages into stage 2).
> >>>>>
> >>>>> I'm a bit dubious about requring the VMM to map the guest memory
> >>>>> PROT_MTE unless somebody's done at least a sketch of the design
> >>>>> for how this would work on the QEMU side. Currently QEMU just
> >>>>> assumes the guest memory is guest memory and it can access it
> >>>>> without special precautions...
> >>>>>
> >>>>
> >>>> There are two statements being made here:
> >>>>
> >>>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit
> >>>>     QEMU well.
> >>>>
> >>>> 2) New KVM features should be accompanied with supporting QEMU code in
> >>>>     order to prove that the APIs make sense.
> >>>>
> >>>> I strongly agree with (2). While kvmtool supports some quick testing, it
> >>>> doesn't support migration. We must test all new features with a migration
> >>>> supporting VMM.
> >>>>
> >>>> I'm not sure about (1). I don't feel like it should be a major problem,
> >>>> but (2).
> >>
> >> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't
> >> required in which case it's easy to accidentally screw up migration, or
> >> it is required in which case it's difficult to handle normal guest
> >> memory from the VMM. I get the impression that probably I should go back
> >> to the previous approach - sorry for the distraction with this change.
> >>
> >> (2) isn't something I'm trying to skip, but I'm limited in what I can do
> >> myself so would appreciate help here. Haibo is looking into this.
> >>
> >
> > Hi Steven,
> >
> > Sorry for the later reply!
> >
> > I have finished the POC for the MTE migration support with the assumption
> > that all the memory is mapped with PROT_MTE. But I got stuck in the test
> > with a FVP setup. Previously, I successfully compiled a test case to verify
> > the basic function of MTE in a guest. But these days, the re-compiled test
> > can't be executed by the guest(very weird). The short plan to verify
> > the migration
> > is to set the MTE tags on one page in the guest, and try to dump the migrated
> > memory contents.
>
> Hi Haibo,
>
> Sounds like you are making good progress - thanks for the update. Have
> you thought about how the PROT_MTE mappings might work if QEMU itself
> were to use MTE? My worry is that we end up with MTE in a guest
> preventing QEMU from using MTE itself (because of the PROT_MTE
> mappings). I'm hoping QEMU can wrap its use of guest memory in a
> sequence which disables tag checking (something similar will be needed
> for the "protected VM" use case anyway), but this isn't something I've
> looked into.
>
> > I will update the status later next week!
>
> Great, I look forward to hearing how it goes.

Hi Steve,

I have finished verifying the POC on a FVP setup, and the MTE test case can
be migrated from one VM to another successfully. Since the test case is very
simple which just maps one page with MTE enabled and does some memory
access, so I can't say it's OK for other cases.

BTW, I noticed that you have sent out patch set v6 which mentions that mapping
all the guest memory with PROT_MTE was not feasible. So what's the plan for the
next step? Will new KVM APIs which can facilitate the tag store and recover be
available?

Regards,
Haibo

>
> Thanks,
>
> Steve

Steven Price Dec. 16, 2020, 10:22 a.m. UTC | #34

On 16/12/2020 07:31, Haibo Xu wrote:
[...]
> Hi Steve,

Hi Haibo

> I have finished verifying the POC on a FVP setup, and the MTE test case can
> be migrated from one VM to another successfully. Since the test case is very
> simple which just maps one page with MTE enabled and does some memory
> access, so I can't say it's OK for other cases.

That's great progress.

> 
> BTW, I noticed that you have sent out patch set v6 which mentions that mapping
> all the guest memory with PROT_MTE was not feasible. So what's the plan for the
> next step? Will new KVM APIs which can facilitate the tag store and recover be
> available?

I'm currently rebasing on top of the KASAN MTE patch series. My plan for 
now is to switch back to not requiring the VMM to supply PROT_MTE (so 
KVM 'upgrades' the pages as necessary) and I'll add an RFC patch on the 
end of the series to add an KVM API for doing bulk read/write of tags. 
That way the VMM can map guest memory without PROT_MTE (so device 'DMA' 
accesses will be unchecked), and use the new API for migration.

Thanks,

Steve

Haibo Xu Dec. 17, 2020, 1:47 a.m. UTC | #35

On Wed, 16 Dec 2020 at 18:23, Steven Price <steven.price@arm.com> wrote:
>
> On 16/12/2020 07:31, Haibo Xu wrote:
> [...]
> > Hi Steve,
>
> Hi Haibo
>
> > I have finished verifying the POC on a FVP setup, and the MTE test case can
> > be migrated from one VM to another successfully. Since the test case is very
> > simple which just maps one page with MTE enabled and does some memory
> > access, so I can't say it's OK for other cases.
>
> That's great progress.
>
> >
> > BTW, I noticed that you have sent out patch set v6 which mentions that mapping
> > all the guest memory with PROT_MTE was not feasible. So what's the plan for the
> > next step? Will new KVM APIs which can facilitate the tag store and recover be
> > available?
>
> I'm currently rebasing on top of the KASAN MTE patch series. My plan for
> now is to switch back to not requiring the VMM to supply PROT_MTE (so
> KVM 'upgrades' the pages as necessary) and I'll add an RFC patch on the
> end of the series to add an KVM API for doing bulk read/write of tags.
> That way the VMM can map guest memory without PROT_MTE (so device 'DMA'
> accesses will be unchecked), and use the new API for migration.
>

Great! Will have a try with the new API in my POC!

> Thanks,
>
> Steve

[v5,0/2] MTE support for KVM guest

Message

Comments