Message ID | 20201119153901.53705-1-steven.price@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | MTE support for KVM guest | expand |
On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: > This series adds support for Arm's Memory Tagging Extension (MTE) to > KVM, allowing KVM guests to make use of it. This builds on the existing > user space support already in v5.10-rc1, see [1] for an overview. > The change to require the VMM to map all guest memory PROT_MTE is > significant as it means that the VMM has to deal with the MTE tags even > if it doesn't care about them (e.g. for virtual devices or if the VMM > doesn't support migration). Also unfortunately because the VMM can > change the memory layout at any time the check for PROT_MTE/VM_MTE has > to be done very late (at the point of faulting pages into stage 2). I'm a bit dubious about requring the VMM to map the guest memory PROT_MTE unless somebody's done at least a sketch of the design for how this would work on the QEMU side. Currently QEMU just assumes the guest memory is guest memory and it can access it without special precautions... thanks -- PMM
On 19/11/2020 15:45, Peter Maydell wrote: > On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: >> This series adds support for Arm's Memory Tagging Extension (MTE) to >> KVM, allowing KVM guests to make use of it. This builds on the existing >> user space support already in v5.10-rc1, see [1] for an overview. > >> The change to require the VMM to map all guest memory PROT_MTE is >> significant as it means that the VMM has to deal with the MTE tags even >> if it doesn't care about them (e.g. for virtual devices or if the VMM >> doesn't support migration). Also unfortunately because the VMM can >> change the memory layout at any time the check for PROT_MTE/VM_MTE has >> to be done very late (at the point of faulting pages into stage 2). > > I'm a bit dubious about requring the VMM to map the guest memory > PROT_MTE unless somebody's done at least a sketch of the design > for how this would work on the QEMU side. Currently QEMU just > assumes the guest memory is guest memory and it can access it > without special precautions... I agree this needs some investigation - I'm hoping Haibo will be able to provide some feedback here as he has been looking at the QEMU support. However the VMM is likely going to require some significant changes to ensure that migration doesn't break, so either way there's work to be done. Fundamentally most memory will need a mapping with PROT_MTE just so the VMM can get at the tags for migration purposes, so QEMU is going to have to learn how to treat guest memory specially if it wants to be able to enable MTE for both itself and the guest. I'll also hunt down what's happening with my attempts to fix the set_pte_at() handling for swap and I'll post that as an alternative if it turns out to be a reasonable approach. But I don't think that solve the QEMU issue above. The other alternative would be to implement a new kernel interface to fetch tags from the guest and not require the VMM to maintain a PROT_MTE mapping. But we need some real feedback from someone familiar with QEMU to know what that interface should look like. So I'm holding off on that until there's a 'real' PoC implementation. Thanks, Steve
On Thu, 19 Nov 2020 at 15:57, Steven Price <steven.price@arm.com> wrote: > On 19/11/2020 15:45, Peter Maydell wrote: > > I'm a bit dubious about requring the VMM to map the guest memory > > PROT_MTE unless somebody's done at least a sketch of the design > > for how this would work on the QEMU side. Currently QEMU just > > assumes the guest memory is guest memory and it can access it > > without special precautions... > > I agree this needs some investigation - I'm hoping Haibo will be able to > provide some feedback here as he has been looking at the QEMU support. > However the VMM is likely going to require some significant changes to > ensure that migration doesn't break, so either way there's work to be done. > > Fundamentally most memory will need a mapping with PROT_MTE just so the > VMM can get at the tags for migration purposes, so QEMU is going to have > to learn how to treat guest memory specially if it wants to be able to > enable MTE for both itself and the guest. If the only reason the VMM needs tag access is for migration it feels like there must be a nicer way to do it than by requiring it to map the whole of the guest address space twice (once for normal use and once to get the tags)... Anyway, maybe "must map PROT_MTE" is workable, but it seems a bit premature to fix the kernel ABI as working that way until we are at least reasonably sure that it is the right design. thanks -- PMM
On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: > On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: > > This series adds support for Arm's Memory Tagging Extension (MTE) to > > KVM, allowing KVM guests to make use of it. This builds on the existing > > user space support already in v5.10-rc1, see [1] for an overview. > > > The change to require the VMM to map all guest memory PROT_MTE is > > significant as it means that the VMM has to deal with the MTE tags even > > if it doesn't care about them (e.g. for virtual devices or if the VMM > > doesn't support migration). Also unfortunately because the VMM can > > change the memory layout at any time the check for PROT_MTE/VM_MTE has > > to be done very late (at the point of faulting pages into stage 2). > > I'm a bit dubious about requring the VMM to map the guest memory > PROT_MTE unless somebody's done at least a sketch of the design > for how this would work on the QEMU side. Currently QEMU just > assumes the guest memory is guest memory and it can access it > without special precautions... > There are two statements being made here: 1) Requiring the use of PROT_MTE when mapping guest memory may not fit QEMU well. 2) New KVM features should be accompanied with supporting QEMU code in order to prove that the APIs make sense. I strongly agree with (2). While kvmtool supports some quick testing, it doesn't support migration. We must test all new features with a migration supporting VMM. I'm not sure about (1). I don't feel like it should be a major problem, but (2). I'd be happy to help with the QEMU prototype, but preferably when there's hardware available. Has all the current MTE testing just been done on simulators? And, if so, are there regression tests regularly running on the simulators too? And can they test migration? If hardware doesn't show up quickly and simulators aren't used for regression tests, then all this code will start rotting from day one. Thanks, drew
On 2020-11-19 18:42, Andrew Jones wrote: > On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: >> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> >> wrote: >> > This series adds support for Arm's Memory Tagging Extension (MTE) to >> > KVM, allowing KVM guests to make use of it. This builds on the existing >> > user space support already in v5.10-rc1, see [1] for an overview. >> >> > The change to require the VMM to map all guest memory PROT_MTE is >> > significant as it means that the VMM has to deal with the MTE tags even >> > if it doesn't care about them (e.g. for virtual devices or if the VMM >> > doesn't support migration). Also unfortunately because the VMM can >> > change the memory layout at any time the check for PROT_MTE/VM_MTE has >> > to be done very late (at the point of faulting pages into stage 2). >> >> I'm a bit dubious about requring the VMM to map the guest memory >> PROT_MTE unless somebody's done at least a sketch of the design >> for how this would work on the QEMU side. Currently QEMU just >> assumes the guest memory is guest memory and it can access it >> without special precautions... >> > > There are two statements being made here: > > 1) Requiring the use of PROT_MTE when mapping guest memory may not fit > QEMU well. > > 2) New KVM features should be accompanied with supporting QEMU code in > order to prove that the APIs make sense. > > I strongly agree with (2). While kvmtool supports some quick testing, > it > doesn't support migration. We must test all new features with a > migration > supporting VMM. > > I'm not sure about (1). I don't feel like it should be a major problem, > but (2). > > I'd be happy to help with the QEMU prototype, but preferably when > there's > hardware available. Has all the current MTE testing just been done on > simulators? And, if so, are there regression tests regularly running on > the simulators too? And can they test migration? If hardware doesn't > show up quickly and simulators aren't used for regression tests, then > all this code will start rotting from day one. While I agree with the sentiment, the reality is pretty bleak. I'm pretty sure nobody will ever run a migration on emulation. I also doubt there is much overlap between MTE users and migration users, unfortunately. No HW is available today, and when it becomes available, it will be in the form of a closed system on which QEMU doesn't run, either because we are locked out of EL2 (as usual), or because migration is not part of the use case (like KVM on Android, for example). So we can wait another two (five?) years until general purpose HW becomes available, or we start merging what we can test today. I'm inclined to do the latter. And I think it is absolutely fine for QEMU to say "no MTE support with KVM" (we can remove all userspace visibility, except for the capability). M.
On 19/11/2020 19:11, Marc Zyngier wrote: > On 2020-11-19 18:42, Andrew Jones wrote: >> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: >>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: >>> > This series adds support for Arm's Memory Tagging Extension (MTE) to >>> > KVM, allowing KVM guests to make use of it. This builds on the >>> existing >>> > user space support already in v5.10-rc1, see [1] for an overview. >>> >>> > The change to require the VMM to map all guest memory PROT_MTE is >>> > significant as it means that the VMM has to deal with the MTE tags >>> even >>> > if it doesn't care about them (e.g. for virtual devices or if the VMM >>> > doesn't support migration). Also unfortunately because the VMM can >>> > change the memory layout at any time the check for PROT_MTE/VM_MTE has >>> > to be done very late (at the point of faulting pages into stage 2). >>> >>> I'm a bit dubious about requring the VMM to map the guest memory >>> PROT_MTE unless somebody's done at least a sketch of the design >>> for how this would work on the QEMU side. Currently QEMU just >>> assumes the guest memory is guest memory and it can access it >>> without special precautions... >>> >> >> There are two statements being made here: >> >> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit >> QEMU well. >> >> 2) New KVM features should be accompanied with supporting QEMU code in >> order to prove that the APIs make sense. >> >> I strongly agree with (2). While kvmtool supports some quick testing, it >> doesn't support migration. We must test all new features with a migration >> supporting VMM. >> >> I'm not sure about (1). I don't feel like it should be a major problem, >> but (2). (1) seems to be contentious whichever way we go. Either PROT_MTE isn't required in which case it's easy to accidentally screw up migration, or it is required in which case it's difficult to handle normal guest memory from the VMM. I get the impression that probably I should go back to the previous approach - sorry for the distraction with this change. (2) isn't something I'm trying to skip, but I'm limited in what I can do myself so would appreciate help here. Haibo is looking into this. >> >> I'd be happy to help with the QEMU prototype, but preferably when there's >> hardware available. Has all the current MTE testing just been done on >> simulators? And, if so, are there regression tests regularly running on >> the simulators too? And can they test migration? If hardware doesn't >> show up quickly and simulators aren't used for regression tests, then >> all this code will start rotting from day one. As Marc says, hardware isn't available. Testing is either via the Arm FVP model (that I've been using for most of my testing) or QEMU full system emulation. > > While I agree with the sentiment, the reality is pretty bleak. > > I'm pretty sure nobody will ever run a migration on emulation. I also doubt > there is much overlap between MTE users and migration users, unfortunately. > > No HW is available today, and when it becomes available, it will be in > the form of a closed system on which QEMU doesn't run, either because > we are locked out of EL2 (as usual), or because migration is not part of > the use case (like KVM on Android, for example). > > So we can wait another two (five?) years until general purpose HW becomes > available, or we start merging what we can test today. I'm inclined to > do the latter. > > And I think it is absolutely fine for QEMU to say "no MTE support with KVM" > (we can remove all userspace visibility, except for the capability). What I'm trying to achieve is a situation where KVM+MTE without migration works and we leave ourselves a clear path where migration can be added. With hindsight I think this version of the series was a wrong turn - if we return to not requiring PROT_MTE then we have the following two potential options to explore for migration in the future: * The VMM can choose to enable PROT_MTE if it needs to, and if desired we can add a flag to enforce this in the kernel. * If needed a new kernel interface can be provided to fetch/set tags from guest memory which isn't mapped PROT_MTE. Does this sound reasonable? I'll clean up the set_pte_at() change and post a v6 later today.
On 2020-11-20 09:50, Steven Price wrote: > On 19/11/2020 19:11, Marc Zyngier wrote: > Does this sound reasonable? > > I'll clean up the set_pte_at() change and post a v6 later today. Please hold on. I still haven't reviewed your v5, nor have I had time to read your reply to my comments on v4. Thanks, M.
On 20/11/2020 09:56, Marc Zyngier wrote: > On 2020-11-20 09:50, Steven Price wrote: >> On 19/11/2020 19:11, Marc Zyngier wrote: > >> Does this sound reasonable? >> >> I'll clean up the set_pte_at() change and post a v6 later today. > > Please hold on. I still haven't reviewed your v5, nor have I had time > to read your reply to my comments on v4. Sure, no problem ;) Steve
* Peter Maydell (peter.maydell@linaro.org) wrote: > On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: > > This series adds support for Arm's Memory Tagging Extension (MTE) to > > KVM, allowing KVM guests to make use of it. This builds on the existing > > user space support already in v5.10-rc1, see [1] for an overview. > > > The change to require the VMM to map all guest memory PROT_MTE is > > significant as it means that the VMM has to deal with the MTE tags even > > if it doesn't care about them (e.g. for virtual devices or if the VMM > > doesn't support migration). Also unfortunately because the VMM can > > change the memory layout at any time the check for PROT_MTE/VM_MTE has > > to be done very late (at the point of faulting pages into stage 2). > > I'm a bit dubious about requring the VMM to map the guest memory > PROT_MTE unless somebody's done at least a sketch of the design > for how this would work on the QEMU side. Currently QEMU just > assumes the guest memory is guest memory and it can access it > without special precautions... Although that is also changing because of the encrypted/protected memory in things like SEV. Dave > thanks > -- PMM >
On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote: > > On 19/11/2020 19:11, Marc Zyngier wrote: > > On 2020-11-19 18:42, Andrew Jones wrote: > >> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: > >>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: > >>> > This series adds support for Arm's Memory Tagging Extension (MTE) to > >>> > KVM, allowing KVM guests to make use of it. This builds on the > >>> existing > >>> > user space support already in v5.10-rc1, see [1] for an overview. > >>> > >>> > The change to require the VMM to map all guest memory PROT_MTE is > >>> > significant as it means that the VMM has to deal with the MTE tags > >>> even > >>> > if it doesn't care about them (e.g. for virtual devices or if the VMM > >>> > doesn't support migration). Also unfortunately because the VMM can > >>> > change the memory layout at any time the check for PROT_MTE/VM_MTE has > >>> > to be done very late (at the point of faulting pages into stage 2). > >>> > >>> I'm a bit dubious about requring the VMM to map the guest memory > >>> PROT_MTE unless somebody's done at least a sketch of the design > >>> for how this would work on the QEMU side. Currently QEMU just > >>> assumes the guest memory is guest memory and it can access it > >>> without special precautions... > >>> > >> > >> There are two statements being made here: > >> > >> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit > >> QEMU well. > >> > >> 2) New KVM features should be accompanied with supporting QEMU code in > >> order to prove that the APIs make sense. > >> > >> I strongly agree with (2). While kvmtool supports some quick testing, it > >> doesn't support migration. We must test all new features with a migration > >> supporting VMM. > >> > >> I'm not sure about (1). I don't feel like it should be a major problem, > >> but (2). > > (1) seems to be contentious whichever way we go. Either PROT_MTE isn't > required in which case it's easy to accidentally screw up migration, or > it is required in which case it's difficult to handle normal guest > memory from the VMM. I get the impression that probably I should go back > to the previous approach - sorry for the distraction with this change. > > (2) isn't something I'm trying to skip, but I'm limited in what I can do > myself so would appreciate help here. Haibo is looking into this. > Hi Steven, Sorry for the later reply! I have finished the POC for the MTE migration support with the assumption that all the memory is mapped with PROT_MTE. But I got stuck in the test with a FVP setup. Previously, I successfully compiled a test case to verify the basic function of MTE in a guest. But these days, the re-compiled test can't be executed by the guest(very weird). The short plan to verify the migration is to set the MTE tags on one page in the guest, and try to dump the migrated memory contents. I will update the status later next week! Regards, Haibo > >> > >> I'd be happy to help with the QEMU prototype, but preferably when there's > >> hardware available. Has all the current MTE testing just been done on > >> simulators? And, if so, are there regression tests regularly running on > >> the simulators too? And can they test migration? If hardware doesn't > >> show up quickly and simulators aren't used for regression tests, then > >> all this code will start rotting from day one. > > As Marc says, hardware isn't available. Testing is either via the Arm > FVP model (that I've been using for most of my testing) or QEMU full > system emulation. > > > > > While I agree with the sentiment, the reality is pretty bleak. > > > > I'm pretty sure nobody will ever run a migration on emulation. I also doubt > > there is much overlap between MTE users and migration users, unfortunately. > > > > No HW is available today, and when it becomes available, it will be in > > the form of a closed system on which QEMU doesn't run, either because > > we are locked out of EL2 (as usual), or because migration is not part of > > the use case (like KVM on Android, for example). > > > > So we can wait another two (five?) years until general purpose HW becomes > > available, or we start merging what we can test today. I'm inclined to > > do the latter. > > > > And I think it is absolutely fine for QEMU to say "no MTE support with KVM" > > (we can remove all userspace visibility, except for the capability). > > What I'm trying to achieve is a situation where KVM+MTE without > migration works and we leave ourselves a clear path where migration can > be added. With hindsight I think this version of the series was a wrong > turn - if we return to not requiring PROT_MTE then we have the following > two potential options to explore for migration in the future: > > * The VMM can choose to enable PROT_MTE if it needs to, and if desired > we can add a flag to enforce this in the kernel. > > * If needed a new kernel interface can be provided to fetch/set tags > from guest memory which isn't mapped PROT_MTE. > > Does this sound reasonable? > > I'll clean up the set_pte_at() change and post a v6 later today. > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On 04/12/2020 08:25, Haibo Xu wrote: > On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote: >> >> On 19/11/2020 19:11, Marc Zyngier wrote: >>> On 2020-11-19 18:42, Andrew Jones wrote: >>>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: >>>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: >>>>>> This series adds support for Arm's Memory Tagging Extension (MTE) to >>>>>> KVM, allowing KVM guests to make use of it. This builds on the >>>>> existing >>>>>> user space support already in v5.10-rc1, see [1] for an overview. >>>>> >>>>>> The change to require the VMM to map all guest memory PROT_MTE is >>>>>> significant as it means that the VMM has to deal with the MTE tags >>>>> even >>>>>> if it doesn't care about them (e.g. for virtual devices or if the VMM >>>>>> doesn't support migration). Also unfortunately because the VMM can >>>>>> change the memory layout at any time the check for PROT_MTE/VM_MTE has >>>>>> to be done very late (at the point of faulting pages into stage 2). >>>>> >>>>> I'm a bit dubious about requring the VMM to map the guest memory >>>>> PROT_MTE unless somebody's done at least a sketch of the design >>>>> for how this would work on the QEMU side. Currently QEMU just >>>>> assumes the guest memory is guest memory and it can access it >>>>> without special precautions... >>>>> >>>> >>>> There are two statements being made here: >>>> >>>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit >>>> QEMU well. >>>> >>>> 2) New KVM features should be accompanied with supporting QEMU code in >>>> order to prove that the APIs make sense. >>>> >>>> I strongly agree with (2). While kvmtool supports some quick testing, it >>>> doesn't support migration. We must test all new features with a migration >>>> supporting VMM. >>>> >>>> I'm not sure about (1). I don't feel like it should be a major problem, >>>> but (2). >> >> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't >> required in which case it's easy to accidentally screw up migration, or >> it is required in which case it's difficult to handle normal guest >> memory from the VMM. I get the impression that probably I should go back >> to the previous approach - sorry for the distraction with this change. >> >> (2) isn't something I'm trying to skip, but I'm limited in what I can do >> myself so would appreciate help here. Haibo is looking into this. >> > > Hi Steven, > > Sorry for the later reply! > > I have finished the POC for the MTE migration support with the assumption > that all the memory is mapped with PROT_MTE. But I got stuck in the test > with a FVP setup. Previously, I successfully compiled a test case to verify > the basic function of MTE in a guest. But these days, the re-compiled test > can't be executed by the guest(very weird). The short plan to verify > the migration > is to set the MTE tags on one page in the guest, and try to dump the migrated > memory contents. Hi Haibo, Sounds like you are making good progress - thanks for the update. Have you thought about how the PROT_MTE mappings might work if QEMU itself were to use MTE? My worry is that we end up with MTE in a guest preventing QEMU from using MTE itself (because of the PROT_MTE mappings). I'm hoping QEMU can wrap its use of guest memory in a sequence which disables tag checking (something similar will be needed for the "protected VM" use case anyway), but this isn't something I've looked into. > I will update the status later next week! Great, I look forward to hearing how it goes. Thanks, Steve
On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote: > Sounds like you are making good progress - thanks for the update. Have > you thought about how the PROT_MTE mappings might work if QEMU itself > were to use MTE? My worry is that we end up with MTE in a guest > preventing QEMU from using MTE itself (because of the PROT_MTE > mappings). I'm hoping QEMU can wrap its use of guest memory in a > sequence which disables tag checking (something similar will be needed > for the "protected VM" use case anyway), but this isn't something I've > looked into. It's not entirely the same as the "protected VM" case. For that the patches currently on list basically special case "this is a debug access (eg from gdbstub/monitor)" which then either gets to go via "decrypt guest RAM for debug" or gets failed depending on whether the VM has a debug-is-ok flag enabled. For an MTE guest the common case will be guests doing standard DMA operations to or from guest memory. The ideal API for that from QEMU's point of view would be "accesses to guest RAM don't do tag checks, even if tag checks are enabled for accesses QEMU does to memory it has allocated itself as a normal userspace program". thanks -- PMM
On 07/12/2020 15:27, Peter Maydell wrote: > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote: >> Sounds like you are making good progress - thanks for the update. Have >> you thought about how the PROT_MTE mappings might work if QEMU itself >> were to use MTE? My worry is that we end up with MTE in a guest >> preventing QEMU from using MTE itself (because of the PROT_MTE >> mappings). I'm hoping QEMU can wrap its use of guest memory in a >> sequence which disables tag checking (something similar will be needed >> for the "protected VM" use case anyway), but this isn't something I've >> looked into. > > It's not entirely the same as the "protected VM" case. For that > the patches currently on list basically special case "this is a > debug access (eg from gdbstub/monitor)" which then either gets > to go via "decrypt guest RAM for debug" or gets failed depending > on whether the VM has a debug-is-ok flag enabled. For an MTE > guest the common case will be guests doing standard DMA operations > to or from guest memory. The ideal API for that from QEMU's > point of view would be "accesses to guest RAM don't do tag > checks, even if tag checks are enabled for accesses QEMU does to > memory it has allocated itself as a normal userspace program". Sorry, I know I simplified it rather by saying it's similar to protected VM. Basically as I see it there are three types of memory access: 1) Debug case - has to go via a special case for decryption or ignoring the MTE tag value. Hopefully this can be abstracted in the same way. 2) Migration - for a protected VM there's likely to be a special method to allow the VMM access to the encrypted memory (AFAIK memory is usually kept inaccessible to the VMM). For MTE this again has to be special cased as we actually want both the data and the tag values. 3) Device DMA - for a protected VM it's usual to unencrypt a small area of memory (with the permission of the guest) and use that as a bounce buffer. This is possible with MTE: have an area the VMM purposefully maps with PROT_MTE. The issue is that this has a performance overhead and we can do better with MTE because it's trivial for the VMM to disable the protection for any memory. The part I'm unsure on is how easy it is for QEMU to deal with (3) without the overhead of bounce buffers. Ideally there'd already be a wrapper for guest memory accesses and that could just be wrapped with setting TCO during the access. I suspect the actual situation is more complex though, and I'm hoping Haibo's investigations will help us understand this. Thanks, Steve
On 2020-12-07 15:45, Steven Price wrote: > On 07/12/2020 15:27, Peter Maydell wrote: >> On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> >> wrote: >>> Sounds like you are making good progress - thanks for the update. >>> Have >>> you thought about how the PROT_MTE mappings might work if QEMU itself >>> were to use MTE? My worry is that we end up with MTE in a guest >>> preventing QEMU from using MTE itself (because of the PROT_MTE >>> mappings). I'm hoping QEMU can wrap its use of guest memory in a >>> sequence which disables tag checking (something similar will be >>> needed >>> for the "protected VM" use case anyway), but this isn't something >>> I've >>> looked into. >> >> It's not entirely the same as the "protected VM" case. For that >> the patches currently on list basically special case "this is a >> debug access (eg from gdbstub/monitor)" which then either gets >> to go via "decrypt guest RAM for debug" or gets failed depending >> on whether the VM has a debug-is-ok flag enabled. For an MTE >> guest the common case will be guests doing standard DMA operations >> to or from guest memory. The ideal API for that from QEMU's >> point of view would be "accesses to guest RAM don't do tag >> checks, even if tag checks are enabled for accesses QEMU does to >> memory it has allocated itself as a normal userspace program". > > Sorry, I know I simplified it rather by saying it's similar to > protected VM. Basically as I see it there are three types of memory > access: > > 1) Debug case - has to go via a special case for decryption or > ignoring the MTE tag value. Hopefully this can be abstracted in the > same way. > > 2) Migration - for a protected VM there's likely to be a special > method to allow the VMM access to the encrypted memory (AFAIK memory > is usually kept inaccessible to the VMM). For MTE this again has to be > special cased as we actually want both the data and the tag values. > > 3) Device DMA - for a protected VM it's usual to unencrypt a small > area of memory (with the permission of the guest) and use that as a > bounce buffer. This is possible with MTE: have an area the VMM > purposefully maps with PROT_MTE. The issue is that this has a > performance overhead and we can do better with MTE because it's > trivial for the VMM to disable the protection for any memory. > > The part I'm unsure on is how easy it is for QEMU to deal with (3) > without the overhead of bounce buffers. Ideally there'd already be a > wrapper for guest memory accesses and that could just be wrapped with > setting TCO during the access. I suspect the actual situation is more > complex though, and I'm hoping Haibo's investigations will help us > understand this. What I'd really like to see is a description of how shared memory is, in general, supposed to work with MTE. My gut feeling is that it doesn't, and that you need to turn MTE off when sharing memory (either implicitly or explicitly). Thanks, M.
On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote: > On 2020-12-07 15:45, Steven Price wrote: > > On 07/12/2020 15:27, Peter Maydell wrote: > > > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> > > > wrote: > > > > Sounds like you are making good progress - thanks for the > > > > update. Have > > > > you thought about how the PROT_MTE mappings might work if QEMU itself > > > > were to use MTE? My worry is that we end up with MTE in a guest > > > > preventing QEMU from using MTE itself (because of the PROT_MTE > > > > mappings). I'm hoping QEMU can wrap its use of guest memory in a > > > > sequence which disables tag checking (something similar will be > > > > needed > > > > for the "protected VM" use case anyway), but this isn't > > > > something I've > > > > looked into. > > > > > > It's not entirely the same as the "protected VM" case. For that > > > the patches currently on list basically special case "this is a > > > debug access (eg from gdbstub/monitor)" which then either gets > > > to go via "decrypt guest RAM for debug" or gets failed depending > > > on whether the VM has a debug-is-ok flag enabled. For an MTE > > > guest the common case will be guests doing standard DMA operations > > > to or from guest memory. The ideal API for that from QEMU's > > > point of view would be "accesses to guest RAM don't do tag > > > checks, even if tag checks are enabled for accesses QEMU does to > > > memory it has allocated itself as a normal userspace program". > > > > Sorry, I know I simplified it rather by saying it's similar to > > protected VM. Basically as I see it there are three types of memory > > access: > > > > 1) Debug case - has to go via a special case for decryption or > > ignoring the MTE tag value. Hopefully this can be abstracted in the > > same way. > > > > 2) Migration - for a protected VM there's likely to be a special > > method to allow the VMM access to the encrypted memory (AFAIK memory > > is usually kept inaccessible to the VMM). For MTE this again has to be > > special cased as we actually want both the data and the tag values. > > > > 3) Device DMA - for a protected VM it's usual to unencrypt a small > > area of memory (with the permission of the guest) and use that as a > > bounce buffer. This is possible with MTE: have an area the VMM > > purposefully maps with PROT_MTE. The issue is that this has a > > performance overhead and we can do better with MTE because it's > > trivial for the VMM to disable the protection for any memory. > > > > The part I'm unsure on is how easy it is for QEMU to deal with (3) > > without the overhead of bounce buffers. Ideally there'd already be a > > wrapper for guest memory accesses and that could just be wrapped with > > setting TCO during the access. I suspect the actual situation is more > > complex though, and I'm hoping Haibo's investigations will help us > > understand this. > > What I'd really like to see is a description of how shared memory > is, in general, supposed to work with MTE. My gut feeling is that > it doesn't, and that you need to turn MTE off when sharing memory > (either implicitly or explicitly). The allocation tag (in-memory tag) is a property assigned to a physical address range and it can be safely shared between different processes as long as they access it via pointers with the same allocation tag (bits 59:56). The kernel enables such tagged shared memory for user processes (anonymous, tmpfs, shmem). What we don't have in the architecture is a memory type which allows access to tags but no tag checking. To access the data when the tags aren't known, the tag checking would have to be disabled via either a prctl() or by setting the PSTATE.TCO bit. The kernel accesses the user memory via the linear map using a match-all tag 0xf, so no TCO bit toggling. For user, however, we disabled such match-all tag and it cannot be enabled at run-time (at least not easily, it's cached in the TLB). However, we already have two modes to disable tag checking which Qemu could use when migrating data+tags.
* Steven Price (steven.price@arm.com) wrote: > On 07/12/2020 15:27, Peter Maydell wrote: > > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote: > > > Sounds like you are making good progress - thanks for the update. Have > > > you thought about how the PROT_MTE mappings might work if QEMU itself > > > were to use MTE? My worry is that we end up with MTE in a guest > > > preventing QEMU from using MTE itself (because of the PROT_MTE > > > mappings). I'm hoping QEMU can wrap its use of guest memory in a > > > sequence which disables tag checking (something similar will be needed > > > for the "protected VM" use case anyway), but this isn't something I've > > > looked into. > > > > It's not entirely the same as the "protected VM" case. For that > > the patches currently on list basically special case "this is a > > debug access (eg from gdbstub/monitor)" which then either gets > > to go via "decrypt guest RAM for debug" or gets failed depending > > on whether the VM has a debug-is-ok flag enabled. For an MTE > > guest the common case will be guests doing standard DMA operations > > to or from guest memory. The ideal API for that from QEMU's > > point of view would be "accesses to guest RAM don't do tag > > checks, even if tag checks are enabled for accesses QEMU does to > > memory it has allocated itself as a normal userspace program". > > Sorry, I know I simplified it rather by saying it's similar to protected VM. > Basically as I see it there are three types of memory access: > > 1) Debug case - has to go via a special case for decryption or ignoring the > MTE tag value. Hopefully this can be abstracted in the same way. > > 2) Migration - for a protected VM there's likely to be a special method to > allow the VMM access to the encrypted memory (AFAIK memory is usually kept > inaccessible to the VMM). For MTE this again has to be special cased as we > actually want both the data and the tag values. > > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of > memory (with the permission of the guest) and use that as a bounce buffer. > This is possible with MTE: have an area the VMM purposefully maps with > PROT_MTE. The issue is that this has a performance overhead and we can do > better with MTE because it's trivial for the VMM to disable the protection > for any memory. Those all sound very similar to the AMD SEV world; there's the special case for Debug that Peter mentioned; migration is ...complicated and needs special case that's still being figured out, and as I understand Device DMA also uses a bounce buffer (and swiotlb in the guest to make that happen). I'm not sure about the stories for the IBM hardware equivalents. Dave > The part I'm unsure on is how easy it is for QEMU to deal with (3) without > the overhead of bounce buffers. Ideally there'd already be a wrapper for > guest memory accesses and that could just be wrapped with setting TCO during > the access. I suspect the actual situation is more complex though, and I'm > hoping Haibo's investigations will help us understand this. > > Thanks, > > Steve >
On Mon, 7 Dec 2020 at 16:44, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > * Steven Price (steven.price@arm.com) wrote: > > Sorry, I know I simplified it rather by saying it's similar to protected VM. > > Basically as I see it there are three types of memory access: > > > > 1) Debug case - has to go via a special case for decryption or ignoring the > > MTE tag value. Hopefully this can be abstracted in the same way. > > > > 2) Migration - for a protected VM there's likely to be a special method to > > allow the VMM access to the encrypted memory (AFAIK memory is usually kept > > inaccessible to the VMM). For MTE this again has to be special cased as we > > actually want both the data and the tag values. > > > > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of > > memory (with the permission of the guest) and use that as a bounce buffer. > > This is possible with MTE: have an area the VMM purposefully maps with > > PROT_MTE. The issue is that this has a performance overhead and we can do > > better with MTE because it's trivial for the VMM to disable the protection > > for any memory. > > Those all sound very similar to the AMD SEV world; there's the special > case for Debug that Peter mentioned; migration is ...complicated and > needs special case that's still being figured out, and as I understand > Device DMA also uses a bounce buffer (and swiotlb in the guest to make > that happen). Mmm, but for encrypted VMs the VM has to jump through all these hoops because "don't let the VM directly access arbitrary guest RAM" is the whole point of the feature. For MTE, we don't want in general to be doing tag-checked accesses to guest RAM and there is nothing in the feature "allow guests to use MTE" that requires that the VMM's guest RAM accesses must do tag-checking. So we should avoid having a design that require us to jump through all the hoops. Even if it happens that handling encrypted VMs means that QEMU has to grow some infrastructure for carefully positioning hoops in appropriate places, we shouldn't use it unnecessarily... All we actually need is a mechanism for migrating the tags: I don't think there's ever a situation where you want tag-checking enabled for the VMM's accesses to the guest RAM. thanks -- PMM
* Peter Maydell (peter.maydell@linaro.org) wrote: > On Mon, 7 Dec 2020 at 16:44, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > * Steven Price (steven.price@arm.com) wrote: > > > Sorry, I know I simplified it rather by saying it's similar to protected VM. > > > Basically as I see it there are three types of memory access: > > > > > > 1) Debug case - has to go via a special case for decryption or ignoring the > > > MTE tag value. Hopefully this can be abstracted in the same way. > > > > > > 2) Migration - for a protected VM there's likely to be a special method to > > > allow the VMM access to the encrypted memory (AFAIK memory is usually kept > > > inaccessible to the VMM). For MTE this again has to be special cased as we > > > actually want both the data and the tag values. > > > > > > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of > > > memory (with the permission of the guest) and use that as a bounce buffer. > > > This is possible with MTE: have an area the VMM purposefully maps with > > > PROT_MTE. The issue is that this has a performance overhead and we can do > > > better with MTE because it's trivial for the VMM to disable the protection > > > for any memory. > > > > Those all sound very similar to the AMD SEV world; there's the special > > case for Debug that Peter mentioned; migration is ...complicated and > > needs special case that's still being figured out, and as I understand > > Device DMA also uses a bounce buffer (and swiotlb in the guest to make > > that happen). > > Mmm, but for encrypted VMs the VM has to jump through all these > hoops because "don't let the VM directly access arbitrary guest RAM" > is the whole point of the feature. For MTE, we don't want in general > to be doing tag-checked accesses to guest RAM and there is nothing > in the feature "allow guests to use MTE" that requires that the VMM's > guest RAM accesses must do tag-checking. So we should avoid having > a design that require us to jump through all the hoops. Yes agreed, that's a fair distinction. Dave Even if > it happens that handling encrypted VMs means that QEMU has to grow > some infrastructure for carefully positioning hoops in appropriate > places, we shouldn't use it unnecessarily... All we actually need is > a mechanism for migrating the tags: I don't think there's ever a > situation where you want tag-checking enabled for the VMM's accesses > to the guest RAM. > > thanks > -- PMM >
On Mon, 07 Dec 2020 16:34:05 +0000, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote: > > What I'd really like to see is a description of how shared memory > > is, in general, supposed to work with MTE. My gut feeling is that > > it doesn't, and that you need to turn MTE off when sharing memory > > (either implicitly or explicitly). > > The allocation tag (in-memory tag) is a property assigned to a physical > address range and it can be safely shared between different processes as > long as they access it via pointers with the same allocation tag (bits > 59:56). The kernel enables such tagged shared memory for user processes > (anonymous, tmpfs, shmem). I think that's one case where the shared memory scheme breaks, as we have two kernels in charge of their own tags, and they obviously can't be synchronised > What we don't have in the architecture is a memory type which allows > access to tags but no tag checking. To access the data when the tags > aren't known, the tag checking would have to be disabled via either a > prctl() or by setting the PSTATE.TCO bit. I guess that's point (3) in Steven's taxonomy. It still a bit ugly to fit in an existing piece of userspace, specially if it wants to use MTE for its own benefit. > The kernel accesses the user memory via the linear map using a match-all > tag 0xf, so no TCO bit toggling. For user, however, we disabled such > match-all tag and it cannot be enabled at run-time (at least not easily, > it's cached in the TLB). However, we already have two modes to disable > tag checking which Qemu could use when migrating data+tags. I wonder whether we will have to have something kernel side to dump/reload tags in a way that matches the patterns used by live migration. M.
On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote: > > On 04/12/2020 08:25, Haibo Xu wrote: > > On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote: > >> > >> On 19/11/2020 19:11, Marc Zyngier wrote: > >>> On 2020-11-19 18:42, Andrew Jones wrote: > >>>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: > >>>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: > >>>>>> This series adds support for Arm's Memory Tagging Extension (MTE) to > >>>>>> KVM, allowing KVM guests to make use of it. This builds on the > >>>>> existing > >>>>>> user space support already in v5.10-rc1, see [1] for an overview. > >>>>> > >>>>>> The change to require the VMM to map all guest memory PROT_MTE is > >>>>>> significant as it means that the VMM has to deal with the MTE tags > >>>>> even > >>>>>> if it doesn't care about them (e.g. for virtual devices or if the VMM > >>>>>> doesn't support migration). Also unfortunately because the VMM can > >>>>>> change the memory layout at any time the check for PROT_MTE/VM_MTE has > >>>>>> to be done very late (at the point of faulting pages into stage 2). > >>>>> > >>>>> I'm a bit dubious about requring the VMM to map the guest memory > >>>>> PROT_MTE unless somebody's done at least a sketch of the design > >>>>> for how this would work on the QEMU side. Currently QEMU just > >>>>> assumes the guest memory is guest memory and it can access it > >>>>> without special precautions... > >>>>> > >>>> > >>>> There are two statements being made here: > >>>> > >>>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit > >>>> QEMU well. > >>>> > >>>> 2) New KVM features should be accompanied with supporting QEMU code in > >>>> order to prove that the APIs make sense. > >>>> > >>>> I strongly agree with (2). While kvmtool supports some quick testing, it > >>>> doesn't support migration. We must test all new features with a migration > >>>> supporting VMM. > >>>> > >>>> I'm not sure about (1). I don't feel like it should be a major problem, > >>>> but (2). > >> > >> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't > >> required in which case it's easy to accidentally screw up migration, or > >> it is required in which case it's difficult to handle normal guest > >> memory from the VMM. I get the impression that probably I should go back > >> to the previous approach - sorry for the distraction with this change. > >> > >> (2) isn't something I'm trying to skip, but I'm limited in what I can do > >> myself so would appreciate help here. Haibo is looking into this. > >> > > > > Hi Steven, > > > > Sorry for the later reply! > > > > I have finished the POC for the MTE migration support with the assumption > > that all the memory is mapped with PROT_MTE. But I got stuck in the test > > with a FVP setup. Previously, I successfully compiled a test case to verify > > the basic function of MTE in a guest. But these days, the re-compiled test > > can't be executed by the guest(very weird). The short plan to verify > > the migration > > is to set the MTE tags on one page in the guest, and try to dump the migrated > > memory contents. > > Hi Haibo, > > Sounds like you are making good progress - thanks for the update. Have > you thought about how the PROT_MTE mappings might work if QEMU itself > were to use MTE? My worry is that we end up with MTE in a guest > preventing QEMU from using MTE itself (because of the PROT_MTE > mappings). I'm hoping QEMU can wrap its use of guest memory in a > sequence which disables tag checking (something similar will be needed > for the "protected VM" use case anyway), but this isn't something I've > looked into. As far as I can see, to map all the guest memory with PROT_MTE in VMM is a little weird, and lots of APIs have to be changed to include this flag. IMHO, it would be better if the KVM can provide new APIs to load/store the guest memory tag which may make it easier to enable the Qemu migration support. > > > I will update the status later next week! > > Great, I look forward to hearing how it goes. > > Thanks, > > Steve
On 2020-12-08 09:51, Haibo Xu wrote: > On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote: >> [...] >> Sounds like you are making good progress - thanks for the update. Have >> you thought about how the PROT_MTE mappings might work if QEMU itself >> were to use MTE? My worry is that we end up with MTE in a guest >> preventing QEMU from using MTE itself (because of the PROT_MTE >> mappings). I'm hoping QEMU can wrap its use of guest memory in a >> sequence which disables tag checking (something similar will be needed >> for the "protected VM" use case anyway), but this isn't something I've >> looked into. > > As far as I can see, to map all the guest memory with PROT_MTE in VMM > is a little weird, and lots of APIs have to be changed to include this > flag. > IMHO, it would be better if the KVM can provide new APIs to load/store > the > guest memory tag which may make it easier to enable the Qemu migration > support. On what granularity? To what storage? How do you plan to synchronise this with the dirty-log interface? Thanks, M.
On Tue, 8 Dec 2020 at 00:44, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > * Steven Price (steven.price@arm.com) wrote: > > On 07/12/2020 15:27, Peter Maydell wrote: > > > On Mon, 7 Dec 2020 at 14:48, Steven Price <steven.price@arm.com> wrote: > > > > Sounds like you are making good progress - thanks for the update. Have > > > > you thought about how the PROT_MTE mappings might work if QEMU itself > > > > were to use MTE? My worry is that we end up with MTE in a guest > > > > preventing QEMU from using MTE itself (because of the PROT_MTE > > > > mappings). I'm hoping QEMU can wrap its use of guest memory in a > > > > sequence which disables tag checking (something similar will be needed > > > > for the "protected VM" use case anyway), but this isn't something I've > > > > looked into. > > > > > > It's not entirely the same as the "protected VM" case. For that > > > the patches currently on list basically special case "this is a > > > debug access (eg from gdbstub/monitor)" which then either gets > > > to go via "decrypt guest RAM for debug" or gets failed depending > > > on whether the VM has a debug-is-ok flag enabled. For an MTE > > > guest the common case will be guests doing standard DMA operations > > > to or from guest memory. The ideal API for that from QEMU's > > > point of view would be "accesses to guest RAM don't do tag > > > checks, even if tag checks are enabled for accesses QEMU does to > > > memory it has allocated itself as a normal userspace program". > > > > Sorry, I know I simplified it rather by saying it's similar to protected VM. > > Basically as I see it there are three types of memory access: > > > > 1) Debug case - has to go via a special case for decryption or ignoring the > > MTE tag value. Hopefully this can be abstracted in the same way. > > > > 2) Migration - for a protected VM there's likely to be a special method to > > allow the VMM access to the encrypted memory (AFAIK memory is usually kept > > inaccessible to the VMM). For MTE this again has to be special cased as we > > actually want both the data and the tag values. > > > > 3) Device DMA - for a protected VM it's usual to unencrypt a small area of > > memory (with the permission of the guest) and use that as a bounce buffer. > > This is possible with MTE: have an area the VMM purposefully maps with > > PROT_MTE. The issue is that this has a performance overhead and we can do > > better with MTE because it's trivial for the VMM to disable the protection > > for any memory. > > Those all sound very similar to the AMD SEV world; there's the special > case for Debug that Peter mentioned; migration is ...complicated and > needs special case that's still being figured out, and as I understand > Device DMA also uses a bounce buffer (and swiotlb in the guest to make > that happen). > > > I'm not sure about the stories for the IBM hardware equivalents. Like s390-skeys(storage keys) support in Qemu? I have read the migration support for the s390-skeys in Qemu and found that the logic is very similar to that of MTE, except the difference that the s390-skeys were migrated separately from that of the guest memory data while for MTE, I think the guest memory tags should go with the memory data. > > Dave > > > The part I'm unsure on is how easy it is for QEMU to deal with (3) without > > the overhead of bounce buffers. Ideally there'd already be a wrapper for > > guest memory accesses and that could just be wrapped with setting TCO during > > the access. I suspect the actual situation is more complex though, and I'm > > hoping Haibo's investigations will help us understand this. > > > > Thanks, > > > > Steve > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
On Tue, 8 Dec 2020 at 18:01, Marc Zyngier <maz@kernel.org> wrote: > > On 2020-12-08 09:51, Haibo Xu wrote: > > On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote: > >> > > [...] > > >> Sounds like you are making good progress - thanks for the update. Have > >> you thought about how the PROT_MTE mappings might work if QEMU itself > >> were to use MTE? My worry is that we end up with MTE in a guest > >> preventing QEMU from using MTE itself (because of the PROT_MTE > >> mappings). I'm hoping QEMU can wrap its use of guest memory in a > >> sequence which disables tag checking (something similar will be needed > >> for the "protected VM" use case anyway), but this isn't something I've > >> looked into. > > > > As far as I can see, to map all the guest memory with PROT_MTE in VMM > > is a little weird, and lots of APIs have to be changed to include this > > flag. > > IMHO, it would be better if the KVM can provide new APIs to load/store > > the > > guest memory tag which may make it easier to enable the Qemu migration > > support. > > On what granularity? To what storage? How do you plan to synchronise > this > with the dirty-log interface? The Qemu would migrate page by page, and if one page has been migrated but becomes dirty again, the migration process would re-send this dirty page. The current MTE migration POC codes would try to send the page tags just after the page data, if one page becomes dirty again, the page data and the tags would be re-sent. > > Thanks, > > M. > -- > Jazz is not dead. It just smells funny...
On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote: > On Mon, 07 Dec 2020 16:34:05 +0000, > Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote: > > > What I'd really like to see is a description of how shared memory > > > is, in general, supposed to work with MTE. My gut feeling is that > > > it doesn't, and that you need to turn MTE off when sharing memory > > > (either implicitly or explicitly). > > > > The allocation tag (in-memory tag) is a property assigned to a physical > > address range and it can be safely shared between different processes as > > long as they access it via pointers with the same allocation tag (bits > > 59:56). The kernel enables such tagged shared memory for user processes > > (anonymous, tmpfs, shmem). > > I think that's one case where the shared memory scheme breaks, as we > have two kernels in charge of their own tags, and they obviously can't > be synchronised Yes, if you can't trust the other entity to not change the tags, the only option is to do an untagged access. > > What we don't have in the architecture is a memory type which allows > > access to tags but no tag checking. To access the data when the tags > > aren't known, the tag checking would have to be disabled via either a > > prctl() or by setting the PSTATE.TCO bit. > > I guess that's point (3) in Steven's taxonomy. It still a bit ugly to > fit in an existing piece of userspace, specially if it wants to use > MTE for its own benefit. I agree it's ugly. For the device DMA emulation case, the only sane way is to mimic what a real device does - no tag checking. For a generic implementation, this means that such shared memory should not be mapped with PROT_MTE on the VMM side. I guess this leads to your point that sharing doesn't work for this scenario ;). > > The kernel accesses the user memory via the linear map using a match-all > > tag 0xf, so no TCO bit toggling. For user, however, we disabled such > > match-all tag and it cannot be enabled at run-time (at least not easily, > > it's cached in the TLB). However, we already have two modes to disable > > tag checking which Qemu could use when migrating data+tags. > > I wonder whether we will have to have something kernel side to > dump/reload tags in a way that matches the patterns used by live > migration. We have something related - ptrace dumps/resores the tags. Can the same concept be expanded to a KVM ioctl?
On 2020-12-08 17:21, Catalin Marinas wrote: > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote: >> On Mon, 07 Dec 2020 16:34:05 +0000, >> Catalin Marinas <catalin.marinas@arm.com> wrote: >> > On Mon, Dec 07, 2020 at 04:05:55PM +0000, Marc Zyngier wrote: >> > > What I'd really like to see is a description of how shared memory >> > > is, in general, supposed to work with MTE. My gut feeling is that >> > > it doesn't, and that you need to turn MTE off when sharing memory >> > > (either implicitly or explicitly). >> > >> > The allocation tag (in-memory tag) is a property assigned to a physical >> > address range and it can be safely shared between different processes as >> > long as they access it via pointers with the same allocation tag (bits >> > 59:56). The kernel enables such tagged shared memory for user processes >> > (anonymous, tmpfs, shmem). >> >> I think that's one case where the shared memory scheme breaks, as we >> have two kernels in charge of their own tags, and they obviously can't >> be synchronised > > Yes, if you can't trust the other entity to not change the tags, the > only option is to do an untagged access. > >> > What we don't have in the architecture is a memory type which allows >> > access to tags but no tag checking. To access the data when the tags >> > aren't known, the tag checking would have to be disabled via either a >> > prctl() or by setting the PSTATE.TCO bit. >> >> I guess that's point (3) in Steven's taxonomy. It still a bit ugly to >> fit in an existing piece of userspace, specially if it wants to use >> MTE for its own benefit. > > I agree it's ugly. For the device DMA emulation case, the only sane way > is to mimic what a real device does - no tag checking. For a generic > implementation, this means that such shared memory should not be mapped > with PROT_MTE on the VMM side. I guess this leads to your point that > sharing doesn't work for this scenario ;). Exactly ;-) >> > The kernel accesses the user memory via the linear map using a match-all >> > tag 0xf, so no TCO bit toggling. For user, however, we disabled such >> > match-all tag and it cannot be enabled at run-time (at least not easily, >> > it's cached in the TLB). However, we already have two modes to disable >> > tag checking which Qemu could use when migrating data+tags. >> >> I wonder whether we will have to have something kernel side to >> dump/reload tags in a way that matches the patterns used by live >> migration. > > We have something related - ptrace dumps/resores the tags. Can the same > concept be expanded to a KVM ioctl? Yes, although I wonder whether we should integrate this deeply into the dirty-log mechanism: it would be really interesting to dump the tags at the point where the page is flagged as clean from a dirty-log point of view. As the page is dirtied, discard the saved tags. It is probably expensive, but it ensures that the VMM sees consistent tags (if the page is clean, the tags are valid). Of course, it comes with the added requirement that the VMM allocates enough memory to store the tags, which may be a tall order. I'm not sure how to give a consistent view to userspace otherwise. It'd be worth looking at how much we can reuse from the ptrace (and I expect swap?) code to implement this. Thanks, M.
On Tue, Dec 08, 2020 at 06:21:12PM +0000, Marc Zyngier wrote: > On 2020-12-08 17:21, Catalin Marinas wrote: > > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote: > > > I wonder whether we will have to have something kernel side to > > > dump/reload tags in a way that matches the patterns used by live > > > migration. > > > > We have something related - ptrace dumps/resores the tags. Can the same > > concept be expanded to a KVM ioctl? > > Yes, although I wonder whether we should integrate this deeply into > the dirty-log mechanism: it would be really interesting to dump the > tags at the point where the page is flagged as clean from a dirty-log > point of view. As the page is dirtied, discard the saved tags. From the VMM perspective, the tags can be treated just like additional (meta)data in a page. We'd only need the tags when copying over. It can race with the VM dirtying the page (writing tags would dirty it) but I don't think the current migration code cares about this. If dirtied, it copies it again. The only downside I see is an extra syscall per page both on the origin VMM and the destination one to dump/restore the tags. Is this a performance issue?
On 2020-12-09 12:44, Catalin Marinas wrote: > On Tue, Dec 08, 2020 at 06:21:12PM +0000, Marc Zyngier wrote: >> On 2020-12-08 17:21, Catalin Marinas wrote: >> > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote: >> > > I wonder whether we will have to have something kernel side to >> > > dump/reload tags in a way that matches the patterns used by live >> > > migration. >> > >> > We have something related - ptrace dumps/resores the tags. Can the same >> > concept be expanded to a KVM ioctl? >> >> Yes, although I wonder whether we should integrate this deeply into >> the dirty-log mechanism: it would be really interesting to dump the >> tags at the point where the page is flagged as clean from a dirty-log >> point of view. As the page is dirtied, discard the saved tags. > > From the VMM perspective, the tags can be treated just like additional > (meta)data in a page. We'd only need the tags when copying over. It can > race with the VM dirtying the page (writing tags would dirty it) but I > don't think the current migration code cares about this. If dirtied, it > copies it again. > > The only downside I see is an extra syscall per page both on the origin > VMM and the destination one to dump/restore the tags. Is this a > performance issue? I'm not sure. Migrating VMs already has a massive overhead, so an extra syscall per page isn't terrifying. But that's the point where I admit not knowing enough about what the VMM expects, nor whether that matches what happens on other architectures that deal with per-page metadata. Would this syscall operate on the guest address space? Or on the VMM's own mapping? M.
On Wed, Dec 09, 2020 at 01:25:18PM +0000, Marc Zyngier wrote: > On 2020-12-09 12:44, Catalin Marinas wrote: > > On Tue, Dec 08, 2020 at 06:21:12PM +0000, Marc Zyngier wrote: > > > On 2020-12-08 17:21, Catalin Marinas wrote: > > > > On Mon, Dec 07, 2020 at 07:03:13PM +0000, Marc Zyngier wrote: > > > > > I wonder whether we will have to have something kernel side to > > > > > dump/reload tags in a way that matches the patterns used by live > > > > > migration. > > > > > > > > We have something related - ptrace dumps/resores the tags. Can the same > > > > concept be expanded to a KVM ioctl? > > > > > > Yes, although I wonder whether we should integrate this deeply into > > > the dirty-log mechanism: it would be really interesting to dump the > > > tags at the point where the page is flagged as clean from a dirty-log > > > point of view. As the page is dirtied, discard the saved tags. > > > > From the VMM perspective, the tags can be treated just like additional > > (meta)data in a page. We'd only need the tags when copying over. It can > > race with the VM dirtying the page (writing tags would dirty it) but I > > don't think the current migration code cares about this. If dirtied, it > > copies it again. > > > > The only downside I see is an extra syscall per page both on the origin > > VMM and the destination one to dump/restore the tags. Is this a > > performance issue? > > I'm not sure. Migrating VMs already has a massive overhead, so an extra > syscall per page isn't terrifying. But that's the point where I admit > not knowing enough about what the VMM expects, nor whether that matches > what happens on other architectures that deal with per-page metadata. > > Would this syscall operate on the guest address space? Or on the VMM's > own mapping? Whatever is easier for the VMM, I don't think it matters as long as the host kernel can get the actual physical address (and linear map correspondent). Maybe simpler if it's the VMM address space as the kernel can check the access permissions in case you want to hide the guest memory from the VMM for other reasons (migration is also off the table). Without syscalls, an option would be for the VMM to create two mappings: one with PROT_MTE for migration and the other without for normal DMA etc. That's achievable using memfd_create() or shm_open() and two mmap() calls, only one having PROT_MTE. The VMM address space should be sufficiently large to map two guest IPAs.
On 12/9/20 9:27 AM, Catalin Marinas wrote: > On Wed, Dec 09, 2020 at 01:25:18PM +0000, Marc Zyngier wrote: >> Would this syscall operate on the guest address space? Or on the VMM's >> own mapping? ... > Whatever is easier for the VMM, I don't think it matters as long as the > host kernel can get the actual physical address (and linear map > correspondent). Maybe simpler if it's the VMM address space as the > kernel can check the access permissions in case you want to hide the > guest memory from the VMM for other reasons (migration is also off the > table). Indeed, such a syscall is no longer specific to vmm's and may be used for any bulk move of tags that userland might want. > Without syscalls, an option would be for the VMM to create two mappings: > one with PROT_MTE for migration and the other without for normal DMA > etc. That's achievable using memfd_create() or shm_open() and two mmap() > calls, only one having PROT_MTE. The VMM address space should be > sufficiently large to map two guest IPAs. I would have thought that the best way is to use TCO, so that we don't have to have dual mappings (and however many MB of extra page tables that might imply). r~
On Wed, Dec 09, 2020 at 12:27:59PM -0600, Richard Henderson wrote: > On 12/9/20 9:27 AM, Catalin Marinas wrote: > > On Wed, Dec 09, 2020 at 01:25:18PM +0000, Marc Zyngier wrote: > >> Would this syscall operate on the guest address space? Or on the VMM's > >> own mapping? > ... > > Whatever is easier for the VMM, I don't think it matters as long as the > > host kernel can get the actual physical address (and linear map > > correspondent). Maybe simpler if it's the VMM address space as the > > kernel can check the access permissions in case you want to hide the > > guest memory from the VMM for other reasons (migration is also off the > > table). > > Indeed, such a syscall is no longer specific to vmm's and may be used for any > bulk move of tags that userland might want. For CRIU, I think the current ptrace interface would do. With VMMs, the same remote VM model doesn't apply (the "remote" VM is actually the guest memory). I'd keep this under a KVM ioctl() number rather than a new, specific syscall. > > Without syscalls, an option would be for the VMM to create two mappings: > > one with PROT_MTE for migration and the other without for normal DMA > > etc. That's achievable using memfd_create() or shm_open() and two mmap() > > calls, only one having PROT_MTE. The VMM address space should be > > sufficiently large to map two guest IPAs. > > I would have thought that the best way is to use TCO, so that we don't have to > have dual mappings (and however many MB of extra page tables that might imply). The problem appears when the VMM wants to use MTE itself (e.g. linked against an MTE-aware glibc), toggling TCO is no longer generic enough, especially when it comes to device emulation.
On 12/9/20 12:39 PM, Catalin Marinas wrote: >> I would have thought that the best way is to use TCO, so that we don't have to >> have dual mappings (and however many MB of extra page tables that might imply). > > The problem appears when the VMM wants to use MTE itself (e.g. linked > against an MTE-aware glibc), toggling TCO is no longer generic enough, > especially when it comes to device emulation. But we do know exactly when we're manipulating guest memory -- we have special routines for that. So the special routines gain a toggle of TCO around the exact guest memory manipulation, not a blanket disable of MTE across large swaths of QEMU. r~
On Wed, 9 Dec 2020 at 20:13, Richard Henderson <richard.henderson@linaro.org> wrote: > > On 12/9/20 12:39 PM, Catalin Marinas wrote: > >> I would have thought that the best way is to use TCO, so that we don't have to > >> have dual mappings (and however many MB of extra page tables that might imply). > > > > The problem appears when the VMM wants to use MTE itself (e.g. linked > > against an MTE-aware glibc), toggling TCO is no longer generic enough, > > especially when it comes to device emulation. > > But we do know exactly when we're manipulating guest memory -- we have special > routines for that. Well, yes and no. It's not like every access to guest memory is through a specific set of "load from guest"/"store from guest" functions, and in some situations there's a "get a pointer to guest RAM, keep using it over a long-ish sequence of QEMU code, then be done with it" pattern. It's because it's not that trivial to isolate when something is accessing guest RAM that I don't want to just have it be mapped PROT_MTE into QEMU. I think we'd end up spending a lot of time hunting down "whoops, turns out this is accessing guest RAM and sometimes it trips over the tags in a hard-to-debug way" bugs. I'd much rather the kernel just provided us with an API for what we want, which is (1) the guest RAM as just RAM with no tag checking and separately (2) some mechanism yet-to-be-designed which lets us bulk copy a page's worth of tags for migration. thanks -- PMM
On Mon, 7 Dec 2020 at 22:48, Steven Price <steven.price@arm.com> wrote: > > On 04/12/2020 08:25, Haibo Xu wrote: > > On Fri, 20 Nov 2020 at 17:51, Steven Price <steven.price@arm.com> wrote: > >> > >> On 19/11/2020 19:11, Marc Zyngier wrote: > >>> On 2020-11-19 18:42, Andrew Jones wrote: > >>>> On Thu, Nov 19, 2020 at 03:45:40PM +0000, Peter Maydell wrote: > >>>>> On Thu, 19 Nov 2020 at 15:39, Steven Price <steven.price@arm.com> wrote: > >>>>>> This series adds support for Arm's Memory Tagging Extension (MTE) to > >>>>>> KVM, allowing KVM guests to make use of it. This builds on the > >>>>> existing > >>>>>> user space support already in v5.10-rc1, see [1] for an overview. > >>>>> > >>>>>> The change to require the VMM to map all guest memory PROT_MTE is > >>>>>> significant as it means that the VMM has to deal with the MTE tags > >>>>> even > >>>>>> if it doesn't care about them (e.g. for virtual devices or if the VMM > >>>>>> doesn't support migration). Also unfortunately because the VMM can > >>>>>> change the memory layout at any time the check for PROT_MTE/VM_MTE has > >>>>>> to be done very late (at the point of faulting pages into stage 2). > >>>>> > >>>>> I'm a bit dubious about requring the VMM to map the guest memory > >>>>> PROT_MTE unless somebody's done at least a sketch of the design > >>>>> for how this would work on the QEMU side. Currently QEMU just > >>>>> assumes the guest memory is guest memory and it can access it > >>>>> without special precautions... > >>>>> > >>>> > >>>> There are two statements being made here: > >>>> > >>>> 1) Requiring the use of PROT_MTE when mapping guest memory may not fit > >>>> QEMU well. > >>>> > >>>> 2) New KVM features should be accompanied with supporting QEMU code in > >>>> order to prove that the APIs make sense. > >>>> > >>>> I strongly agree with (2). While kvmtool supports some quick testing, it > >>>> doesn't support migration. We must test all new features with a migration > >>>> supporting VMM. > >>>> > >>>> I'm not sure about (1). I don't feel like it should be a major problem, > >>>> but (2). > >> > >> (1) seems to be contentious whichever way we go. Either PROT_MTE isn't > >> required in which case it's easy to accidentally screw up migration, or > >> it is required in which case it's difficult to handle normal guest > >> memory from the VMM. I get the impression that probably I should go back > >> to the previous approach - sorry for the distraction with this change. > >> > >> (2) isn't something I'm trying to skip, but I'm limited in what I can do > >> myself so would appreciate help here. Haibo is looking into this. > >> > > > > Hi Steven, > > > > Sorry for the later reply! > > > > I have finished the POC for the MTE migration support with the assumption > > that all the memory is mapped with PROT_MTE. But I got stuck in the test > > with a FVP setup. Previously, I successfully compiled a test case to verify > > the basic function of MTE in a guest. But these days, the re-compiled test > > can't be executed by the guest(very weird). The short plan to verify > > the migration > > is to set the MTE tags on one page in the guest, and try to dump the migrated > > memory contents. > > Hi Haibo, > > Sounds like you are making good progress - thanks for the update. Have > you thought about how the PROT_MTE mappings might work if QEMU itself > were to use MTE? My worry is that we end up with MTE in a guest > preventing QEMU from using MTE itself (because of the PROT_MTE > mappings). I'm hoping QEMU can wrap its use of guest memory in a > sequence which disables tag checking (something similar will be needed > for the "protected VM" use case anyway), but this isn't something I've > looked into. > > > I will update the status later next week! > > Great, I look forward to hearing how it goes. Hi Steve, I have finished verifying the POC on a FVP setup, and the MTE test case can be migrated from one VM to another successfully. Since the test case is very simple which just maps one page with MTE enabled and does some memory access, so I can't say it's OK for other cases. BTW, I noticed that you have sent out patch set v6 which mentions that mapping all the guest memory with PROT_MTE was not feasible. So what's the plan for the next step? Will new KVM APIs which can facilitate the tag store and recover be available? Regards, Haibo > > Thanks, > > Steve
On 16/12/2020 07:31, Haibo Xu wrote: [...] > Hi Steve, Hi Haibo > I have finished verifying the POC on a FVP setup, and the MTE test case can > be migrated from one VM to another successfully. Since the test case is very > simple which just maps one page with MTE enabled and does some memory > access, so I can't say it's OK for other cases. That's great progress. > > BTW, I noticed that you have sent out patch set v6 which mentions that mapping > all the guest memory with PROT_MTE was not feasible. So what's the plan for the > next step? Will new KVM APIs which can facilitate the tag store and recover be > available? I'm currently rebasing on top of the KASAN MTE patch series. My plan for now is to switch back to not requiring the VMM to supply PROT_MTE (so KVM 'upgrades' the pages as necessary) and I'll add an RFC patch on the end of the series to add an KVM API for doing bulk read/write of tags. That way the VMM can map guest memory without PROT_MTE (so device 'DMA' accesses will be unchecked), and use the new API for migration. Thanks, Steve
On Wed, 16 Dec 2020 at 18:23, Steven Price <steven.price@arm.com> wrote: > > On 16/12/2020 07:31, Haibo Xu wrote: > [...] > > Hi Steve, > > Hi Haibo > > > I have finished verifying the POC on a FVP setup, and the MTE test case can > > be migrated from one VM to another successfully. Since the test case is very > > simple which just maps one page with MTE enabled and does some memory > > access, so I can't say it's OK for other cases. > > That's great progress. > > > > > BTW, I noticed that you have sent out patch set v6 which mentions that mapping > > all the guest memory with PROT_MTE was not feasible. So what's the plan for the > > next step? Will new KVM APIs which can facilitate the tag store and recover be > > available? > > I'm currently rebasing on top of the KASAN MTE patch series. My plan for > now is to switch back to not requiring the VMM to supply PROT_MTE (so > KVM 'upgrades' the pages as necessary) and I'll add an RFC patch on the > end of the series to add an KVM API for doing bulk read/write of tags. > That way the VMM can map guest memory without PROT_MTE (so device 'DMA' > accesses will be unchecked), and use the new API for migration. > Great! Will have a try with the new API in my POC! > Thanks, > > Steve