Message ID | 20240222-gunyah-v17-19-1e9da6763d38@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Drivers for Gunyah hypervisor | expand |
On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > Firmware and hypervisor drivers can donate system heap memory to their > respective firmware/hypervisor entities. Those drivers should unmap the > pages from the kernel's logical map before doing so. > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > set_direct_map_default_noflush. Err, not they should not. And not using such super low-level interfaces from modular code.
On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > Firmware and hypervisor drivers can donate system heap memory to their > > respective firmware/hypervisor entities. Those drivers should unmap the > > pages from the kernel's logical map before doing so. > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > set_direct_map_default_noflush. > > Err, not they should not. And not using such super low-level interfaces > from modular code. Hi Cristoph, We've observed a few times that Linux can unintentionally access a page we've unmapped from host's stage 2 page table via an unaligned load from an adjacent page. The stage 2 is managed by Gunyah. There are few scenarios where even though we allocate and own a page from buddy, someone else could try to access the page without going through the hypervisor driver. One such instance we know about is load_unaligned_zeropad() via pathlookup_at() [1]. load_unaligned_zeropad() could be called near the end of a page. If the next page isn't mapped by the kernel in the stage one page tables, then the access from to the unmapped page from load_unaligned_zeropad() will land in __do_kernel_fault(), call fixup_exception(), and fill the remainder of the load with zeroes. If the page in question is mapped in stage 1 but was unmapped from stage 2, then the access lands back in Linux in do_sea(), leading to a panic(). Our preference would be to add fixup_exception() to S2 PTW errors for two reasons: 1. It's cheaper to do performance wise: we've already manipulated S2 page table and prevent intentional access to the page because pKVM/Gunyah drivers know that access to the page has been lost. 2. Page-granular S1 mappings only happen on arm64 with rodata=full. In an off-list discussion with the Android pkvm folks, their preference was to have the pages unmapped from stage 1. I've gone with that approach to get started but welcome discussion on the best approach. The Android (downstream) implementation of arm64 pkvm is currently implementing a hack where s2 ptw faults are given back to the host as s1 ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- allowing the kernel to fixup the exception. arm64 pKVM will also face this issue when implementing guest_memfd or when donating more memory to the hyp for s2 page tables, etc. As far as I can tell, this isn't an issue for arm64 pKVM today because memory isn't being dynamically donated to the hypervisor. Thanks, Elliot [1]: path_lookupat+0x340/0x3228 filename_lookup+0xbc/0x1c0 __arm64_sys_newfstatat+0xb0/0x4a0 invoke_syscall+0x58/0x118
The point is that we can't we just allow modules to unmap data from the kernel mapping, no matter how noble your intentions are.
On 26.02.24 12:06, Christoph Hellwig wrote: > The point is that we can't we just allow modules to unmap data from > the kernel mapping, no matter how noble your intentions are. I absolutely agree.
On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: > On 26.02.24 12:06, Christoph Hellwig wrote: > > The point is that we can't we just allow modules to unmap data from > > the kernel mapping, no matter how noble your intentions are. > > I absolutely agree. > Hi David and Chirstoph, Are your preferences that we should make Gunyah builtin only or should add fixing up S2 PTW errors (or something else)? Also, do you extend that preference to modifying S2 mappings? This would require any hypervisor driver that supports confidential compute usecases to only ever be builtin. Is your concern about unmapping data from kernel mapping, then module being unloaded, and then having no way to recover the mapping? Would a permanent module be better? The primary reason we were wanting to have it as module was to avoid having driver in memory if you're not a Gunyah guest. Thanks, Elliot
On 26.02.24 18:27, Elliot Berman wrote: > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: >> On 26.02.24 12:06, Christoph Hellwig wrote: >>> The point is that we can't we just allow modules to unmap data from >>> the kernel mapping, no matter how noble your intentions are. >> >> I absolutely agree. >> > > Hi David and Chirstoph, > > Are your preferences that we should make Gunyah builtin only or should add > fixing up S2 PTW errors (or something else)? Having that built into the kernel certainly does sound better than exposing that functionality to arbitrary OOT modules. But still, this feels like it is using a "too-low-level" interface. > > Also, do you extend that preference to modifying S2 mappings? This would > require any hypervisor driver that supports confidential compute > usecases to only ever be builtin. > > Is your concern about unmapping data from kernel mapping, then module > being unloaded, and then having no way to recover the mapping? Would a > permanent module be better? The primary reason we were wanting to have > it as module was to avoid having driver in memory if you're not a Gunyah > guest. What I didn't grasp from this patch description: is the area where a driver would unmap/remap that memory somehow known ahead of time and limited? How would the driver obtain that memory it would try to unmap/remap the direct map of? Simply allocate some pages and then unmap the direct map? For example, we do have mm/secretmem.c, where we unmap the directmap on allocation and remap when freeing a page. A nice abstraction on alloc/free, so one cannot really do a lot of harm. Further, we enlightened the remainder of the system about secretmem, such that we can detect that the directmap is no longer there. As one example, see the secretmem_active() check in kernel/power/hibernate.c. A similar abstraction would make sense (I remember a discussion about having secretmem functionality in guest_memfd, would that help?), but the question is "which" memory you want to unmap the direct map of, and how the driver became "owner" of that memory such that it would really be allowed to mess with the directmap.
On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote: > On 26.02.24 18:27, Elliot Berman wrote: > > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: > > > On 26.02.24 12:06, Christoph Hellwig wrote: > > > > The point is that we can't we just allow modules to unmap data from > > > > the kernel mapping, no matter how noble your intentions are. > > > > > > I absolutely agree. > > > > > > > Hi David and Chirstoph, > > > > Are your preferences that we should make Gunyah builtin only or should add > > fixing up S2 PTW errors (or something else)? > > Having that built into the kernel certainly does sound better than exposing > that functionality to arbitrary OOT modules. But still, this feels like it > is using a "too-low-level" interface. > What are your thoughts about fixing up the stage-2 fault instead? I think this gives mmu-based isolation a slight speed boost because we avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah) knows that the page isn't mapped. Whether we get S2 or S1 fault, the kernel is likely going to crash, except in the rare case where we want to fix the exception. In that case, we can modify the S2 fault handler to call fixup_exception() when appropriate. > > > > Also, do you extend that preference to modifying S2 mappings? This would > > require any hypervisor driver that supports confidential compute > > usecases to only ever be builtin. > > > > Is your concern about unmapping data from kernel mapping, then module > > being unloaded, and then having no way to recover the mapping? Would a > > permanent module be better? The primary reason we were wanting to have > > it as module was to avoid having driver in memory if you're not a Gunyah > > guest. > > What I didn't grasp from this patch description: is the area where a driver > would unmap/remap that memory somehow known ahead of time and limited? > > How would the driver obtain that memory it would try to unmap/remap the > direct map of? Simply allocate some pages and then unmap the direct map? That's correct. > > For example, we do have mm/secretmem.c, where we unmap the directmap on > allocation and remap when freeing a page. A nice abstraction on alloc/free, > so one cannot really do a lot of harm. > > Further, we enlightened the remainder of the system about secretmem, such > that we can detect that the directmap is no longer there. As one example, > see the secretmem_active() check in kernel/power/hibernate.c. > I'll take a look at this. guest_memfd might be able to use PM notifiers here instead, but I'll dig in the archives to see why secretmem isn't using that. > A similar abstraction would make sense (I remember a discussion about having > secretmem functionality in guest_memfd, would that help?), but the question > is "which" memory you want to unmap the direct map of, and how the driver > became "owner" of that memory such that it would really be allowed to mess > with the directmap.
On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote: > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > > Firmware and hypervisor drivers can donate system heap memory to their > > > respective firmware/hypervisor entities. Those drivers should unmap the > > > pages from the kernel's logical map before doing so. > > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > > set_direct_map_default_noflush. > > > > Err, not they should not. And not using such super low-level interfaces > > from modular code. > > Hi Cristoph, > > We've observed a few times that Linux can unintentionally access a page > we've unmapped from host's stage 2 page table via an unaligned load from > an adjacent page. The stage 2 is managed by Gunyah. There are few > scenarios where even though we allocate and own a page from buddy, > someone else could try to access the page without going through the > hypervisor driver. One such instance we know about is > load_unaligned_zeropad() via pathlookup_at() [1]. > > load_unaligned_zeropad() could be called near the end of a page. If the > next page isn't mapped by the kernel in the stage one page tables, then > the access from to the unmapped page from load_unaligned_zeropad() will > land in __do_kernel_fault(), call fixup_exception(), and fill the > remainder of the load with zeroes. If the page in question is mapped in > stage 1 but was unmapped from stage 2, then the access lands back in > Linux in do_sea(), leading to a panic(). > > Our preference would be to add fixup_exception() to S2 PTW errors for > two reasons: > 1. It's cheaper to do performance wise: we've already manipulated S2 > page table and prevent intentional access to the page because > pKVM/Gunyah drivers know that access to the page has been lost. > 2. Page-granular S1 mappings only happen on arm64 with rodata=full. > > In an off-list discussion with the Android pkvm folks, their preference > was to have the pages unmapped from stage 1. I've gone with that > approach to get started but welcome discussion on the best approach. > > The Android (downstream) implementation of arm64 pkvm is currently > implementing a hack where s2 ptw faults are given back to the host as s1 > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- > allowing the kernel to fixup the exception. > > arm64 pKVM will also face this issue when implementing guest_memfd or > when donating more memory to the hyp for s2 page tables, etc. As far as > I can tell, this isn't an issue for arm64 pKVM today because memory > isn't being dynamically donated to the hypervisor. FWIW pKVM already donates memory dynamically to the hypervisor, to store e.g. guest VM metadata and page-tables, and we've never seen that problem as far as I can recall. A key difference is that pKVM injects a data abort back into the kernel in case of a stage-2 fault, so the whole EXTABLE trick/hack in load_unaligned_zeropad() should work fine out of the box. As discussed offline, Gunyah injecting an SEA into the kernel is questionable, but I understand that the architecture is a bit lacking in this department, and that's probably the next best thing. Could the Gunyah driver allocate from a CMA region instead? That would surely simplify unmapping from EL1 stage-1 (similar to how drivers usually donate memory to TZ). Thanks, Quentin
On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote: > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote: > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > > > Firmware and hypervisor drivers can donate system heap memory to their > > > > respective firmware/hypervisor entities. Those drivers should unmap the > > > > pages from the kernel's logical map before doing so. > > > > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > > > set_direct_map_default_noflush. > > > > > > Err, not they should not. And not using such super low-level interfaces > > > from modular code. > > > > Hi Cristoph, > > > > We've observed a few times that Linux can unintentionally access a page > > we've unmapped from host's stage 2 page table via an unaligned load from > > an adjacent page. The stage 2 is managed by Gunyah. There are few > > scenarios where even though we allocate and own a page from buddy, > > someone else could try to access the page without going through the > > hypervisor driver. One such instance we know about is > > load_unaligned_zeropad() via pathlookup_at() [1]. > > > > load_unaligned_zeropad() could be called near the end of a page. If the > > next page isn't mapped by the kernel in the stage one page tables, then > > the access from to the unmapped page from load_unaligned_zeropad() will > > land in __do_kernel_fault(), call fixup_exception(), and fill the > > remainder of the load with zeroes. If the page in question is mapped in > > stage 1 but was unmapped from stage 2, then the access lands back in > > Linux in do_sea(), leading to a panic(). > > > > Our preference would be to add fixup_exception() to S2 PTW errors for > > two reasons: > > 1. It's cheaper to do performance wise: we've already manipulated S2 > > page table and prevent intentional access to the page because > > pKVM/Gunyah drivers know that access to the page has been lost. > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full. > > > > In an off-list discussion with the Android pkvm folks, their preference > > was to have the pages unmapped from stage 1. I've gone with that > > approach to get started but welcome discussion on the best approach. > > > > The Android (downstream) implementation of arm64 pkvm is currently > > implementing a hack where s2 ptw faults are given back to the host as s1 > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- > > allowing the kernel to fixup the exception. > > > > arm64 pKVM will also face this issue when implementing guest_memfd or > > when donating more memory to the hyp for s2 page tables, etc. As far as > > I can tell, this isn't an issue for arm64 pKVM today because memory > > isn't being dynamically donated to the hypervisor. > > FWIW pKVM already donates memory dynamically to the hypervisor, to store > e.g. guest VM metadata and page-tables, and we've never seen that > problem as far as I can recall. > > A key difference is that pKVM injects a data abort back into the kernel > in case of a stage-2 fault, so the whole EXTABLE trick/hack in > load_unaligned_zeropad() should work fine out of the box. > > As discussed offline, Gunyah injecting an SEA into the kernel is > questionable, but I understand that the architecture is a bit lacking in > this department, and that's probably the next best thing. > > Could the Gunyah driver allocate from a CMA region instead? That would > surely simplify unmapping from EL1 stage-1 (similar to how drivers > usually donate memory to TZ). In my opinion, CMA is overly restrictive because we'd have to define the region up front and we don't know how much memory the virtual machines the user will want to launch. Thanks, Elliot
On Monday 04 Mar 2024 at 15:37:41 (-0800), Elliot Berman wrote: > On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote: > > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote: > > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > > > > Firmware and hypervisor drivers can donate system heap memory to their > > > > > respective firmware/hypervisor entities. Those drivers should unmap the > > > > > pages from the kernel's logical map before doing so. > > > > > > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > > > > set_direct_map_default_noflush. > > > > > > > > Err, not they should not. And not using such super low-level interfaces > > > > from modular code. > > > > > > Hi Cristoph, > > > > > > We've observed a few times that Linux can unintentionally access a page > > > we've unmapped from host's stage 2 page table via an unaligned load from > > > an adjacent page. The stage 2 is managed by Gunyah. There are few > > > scenarios where even though we allocate and own a page from buddy, > > > someone else could try to access the page without going through the > > > hypervisor driver. One such instance we know about is > > > load_unaligned_zeropad() via pathlookup_at() [1]. > > > > > > load_unaligned_zeropad() could be called near the end of a page. If the > > > next page isn't mapped by the kernel in the stage one page tables, then > > > the access from to the unmapped page from load_unaligned_zeropad() will > > > land in __do_kernel_fault(), call fixup_exception(), and fill the > > > remainder of the load with zeroes. If the page in question is mapped in > > > stage 1 but was unmapped from stage 2, then the access lands back in > > > Linux in do_sea(), leading to a panic(). > > > > > > Our preference would be to add fixup_exception() to S2 PTW errors for > > > two reasons: > > > 1. It's cheaper to do performance wise: we've already manipulated S2 > > > page table and prevent intentional access to the page because > > > pKVM/Gunyah drivers know that access to the page has been lost. > > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full. > > > > > > In an off-list discussion with the Android pkvm folks, their preference > > > was to have the pages unmapped from stage 1. I've gone with that > > > approach to get started but welcome discussion on the best approach. > > > > > > The Android (downstream) implementation of arm64 pkvm is currently > > > implementing a hack where s2 ptw faults are given back to the host as s1 > > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- > > > allowing the kernel to fixup the exception. > > > > > > arm64 pKVM will also face this issue when implementing guest_memfd or > > > when donating more memory to the hyp for s2 page tables, etc. As far as > > > I can tell, this isn't an issue for arm64 pKVM today because memory > > > isn't being dynamically donated to the hypervisor. > > > > FWIW pKVM already donates memory dynamically to the hypervisor, to store > > e.g. guest VM metadata and page-tables, and we've never seen that > > problem as far as I can recall. > > > > A key difference is that pKVM injects a data abort back into the kernel > > in case of a stage-2 fault, so the whole EXTABLE trick/hack in > > load_unaligned_zeropad() should work fine out of the box. > > > > As discussed offline, Gunyah injecting an SEA into the kernel is > > questionable, but I understand that the architecture is a bit lacking in > > this department, and that's probably the next best thing. > > > > Could the Gunyah driver allocate from a CMA region instead? That would > > surely simplify unmapping from EL1 stage-1 (similar to how drivers > > usually donate memory to TZ). > > In my opinion, CMA is overly restrictive because we'd have to define the > region up front and we don't know how much memory the virtual machines > the user will want to launch. I was thinking of using CMA to allocate pages needed to store guest metadata and such at EL2, but not to back the actual guest pages themselves. That still means overallocating somehow, but that should hopefully be much smaller and be less of a problem? For the actual guest pages, the gunyah variant of guestmem will have to unmap the pages from the direct map itself, but I'd be personally happy with making that part non-modular to avoid the issue Christoph and others have raised. Thanks, Quentin
On Tue, Mar 05, 2024 at 03:30:58PM +0000, Quentin Perret wrote: > On Monday 04 Mar 2024 at 15:37:41 (-0800), Elliot Berman wrote: > > On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote: > > > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote: > > > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote: > > > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote: > > > > > > Firmware and hypervisor drivers can donate system heap memory to their > > > > > > respective firmware/hypervisor entities. Those drivers should unmap the > > > > > > pages from the kernel's logical map before doing so. > > > > > > > > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and > > > > > > set_direct_map_default_noflush. > > > > > > > > > > Err, not they should not. And not using such super low-level interfaces > > > > > from modular code. > > > > > > > > Hi Cristoph, > > > > > > > > We've observed a few times that Linux can unintentionally access a page > > > > we've unmapped from host's stage 2 page table via an unaligned load from > > > > an adjacent page. The stage 2 is managed by Gunyah. There are few > > > > scenarios where even though we allocate and own a page from buddy, > > > > someone else could try to access the page without going through the > > > > hypervisor driver. One such instance we know about is > > > > load_unaligned_zeropad() via pathlookup_at() [1]. > > > > > > > > load_unaligned_zeropad() could be called near the end of a page. If the > > > > next page isn't mapped by the kernel in the stage one page tables, then > > > > the access from to the unmapped page from load_unaligned_zeropad() will > > > > land in __do_kernel_fault(), call fixup_exception(), and fill the > > > > remainder of the load with zeroes. If the page in question is mapped in > > > > stage 1 but was unmapped from stage 2, then the access lands back in > > > > Linux in do_sea(), leading to a panic(). > > > > > > > > Our preference would be to add fixup_exception() to S2 PTW errors for > > > > two reasons: > > > > 1. It's cheaper to do performance wise: we've already manipulated S2 > > > > page table and prevent intentional access to the page because > > > > pKVM/Gunyah drivers know that access to the page has been lost. > > > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full. > > > > > > > > In an off-list discussion with the Android pkvm folks, their preference > > > > was to have the pages unmapped from stage 1. I've gone with that > > > > approach to get started but welcome discussion on the best approach. > > > > > > > > The Android (downstream) implementation of arm64 pkvm is currently > > > > implementing a hack where s2 ptw faults are given back to the host as s1 > > > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) -- > > > > allowing the kernel to fixup the exception. > > > > > > > > arm64 pKVM will also face this issue when implementing guest_memfd or > > > > when donating more memory to the hyp for s2 page tables, etc. As far as > > > > I can tell, this isn't an issue for arm64 pKVM today because memory > > > > isn't being dynamically donated to the hypervisor. > > > > > > FWIW pKVM already donates memory dynamically to the hypervisor, to store > > > e.g. guest VM metadata and page-tables, and we've never seen that > > > problem as far as I can recall. > > > > > > A key difference is that pKVM injects a data abort back into the kernel > > > in case of a stage-2 fault, so the whole EXTABLE trick/hack in > > > load_unaligned_zeropad() should work fine out of the box. > > > > > > As discussed offline, Gunyah injecting an SEA into the kernel is > > > questionable, but I understand that the architecture is a bit lacking in > > > this department, and that's probably the next best thing. > > > > > > Could the Gunyah driver allocate from a CMA region instead? That would > > > surely simplify unmapping from EL1 stage-1 (similar to how drivers > > > usually donate memory to TZ). > > > > In my opinion, CMA is overly restrictive because we'd have to define the > > region up front and we don't know how much memory the virtual machines > > the user will want to launch. > > I was thinking of using CMA to allocate pages needed to store guest > metadata and such at EL2, but not to back the actual guest pages > themselves. That still means overallocating somehow, but that should > hopefully be much smaller and be less of a problem? Ah, I understood the context now. Yes, we might need to use CMA region when donating memory to Gunyah if we have to ensure the memory is unmapped from stage 1, since we wouldn't use guest_memfd for that. > > For the actual guest pages, the gunyah variant of guestmem will have to > unmap the pages from the direct map itself, but I'd be personally happy I still disagree that this is a Gunyah-specific problem. As far as we can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables couldn't give a validation translation of the IPA from stage 1. IMO, downstream/Android pKVM is violating spec for ESR_EL1 by using the S1PTW bit (which is res0 for everyone except EL2 [1]) and this means that guests need to be pKVM-enlightened. If we are adding pKVM enlightment in the exception handlers, can we add Gunyah enlightment to handle the same? Thanks, Elliot [1]: https://developer.arm.com/documentation/ddi0601/2023-12/AArch64-Registers/ESR-EL1--Exception-Syndrome-Register--EL1-?lang=en#fieldset_0-24_0_16-7_7
On Tuesday 05 Mar 2024 at 12:26:59 (-0800), Elliot Berman wrote: > I still disagree that this is a Gunyah-specific problem. As far as we > can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables > couldn't give a validation translation of the IPA from stage 1. IMO, > downstream/Android pKVM is violating spec for ESR_EL1 by using the > S1PTW bit (which is res0 for everyone except EL2 [1]) and this means > that guests need to be pKVM-enlightened. Not really, in pKVM we have a very clear distinction between host Linux and guests, and only the host needs to be enlightened. But luckily, since pKVM is part of Linux, this is pretty much an internal kernel thing, so we're very flexible and if the S1PTW trick ever conflicts with something else (e.g. NV) we can fairly easily switch to another approach. We can tolerate non-architectural tricks like that between pKVM and host Linux because that is not ABI, but we certainly can't do that for guests. > If we are adding pKVM > enlightment in the exception handlers, can we add Gunyah enlightment to > handle the same? If you mean extending the Linux SEA handler so it does what Gunyah wants, then I'm personally not supportive of that idea since the 'contract' between Linux and Gunyah _is_ the architecture. The only ways I could see Gunyah delegate stage-2 fault handling to Linux cleanly is: - either talk to Arm to introduce a new ESR specifically for this, which doesn't sound entirely crazy to me; - or have Gunyah and Linux negotiate in software the location of the handlers. That probably means SDEI or equivalent which is a can of worm in itself I presume, and I'm not sure how feasible it would be for this handler to live in the Gunyah driver (that too probably requires exporting kernel symbols we don't want to export). Thanks, Quentin
On Wed, Mar 06, 2024 at 12:05:38PM +0000, Quentin Perret wrote: > On Tuesday 05 Mar 2024 at 12:26:59 (-0800), Elliot Berman wrote: > > I still disagree that this is a Gunyah-specific problem. As far as we > > can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables > > couldn't give a validation translation of the IPA from stage 1. IMO, > > downstream/Android pKVM is violating spec for ESR_EL1 by using the > > S1PTW bit (which is res0 for everyone except EL2 [1]) and this means > > that guests need to be pKVM-enlightened. > > Not really, in pKVM we have a very clear distinction between host Linux > and guests, and only the host needs to be enlightened. But luckily, > since pKVM is part of Linux, this is pretty much an internal kernel > thing, so we're very flexible and if the S1PTW trick ever conflicts > with something else (e.g. NV) we can fairly easily switch to another > approach. We can tolerate non-architectural tricks like that between > pKVM and host Linux because that is not ABI, but we certainly can't do > that for guests. > > > If we are adding pKVM > > enlightment in the exception handlers, can we add Gunyah enlightment to > > handle the same? > > If you mean extending the Linux SEA handler so it does what Gunyah > wants, then I'm personally not supportive of that idea since the > 'contract' between Linux and Gunyah _is_ the architecture. Fair enough. We're building out more use cases where we want to allocate memory from buddy and donate it to some entity which unmaps it from Linux (some entity = Gunyah or Qualcomm firmware). Video DRM is an example we're working on. I imagine OP-TEE might eventually have use-cases as well since pKVM is doing same. David expressed concerns about exporting the direct unmap functions. What kind of framework/restrictions do we want to have instead? I don't think making drivers like Gunyah a builtin-only module [1] (even a refactored/small portion) is the best approach, but maybe that is what we want to do. Thanks, Elliot [1]: qcom_scm_assign_mem (d/firmware/qcom/qcom_scm.ko) is an example of a module that would have to become builtin as we upstream use cases that lend buddy-allocated memory to firmware
I wanted to revive this thread based on the mm alignment discussion for guest_memfd. Gunyah's guest_memfd allocates memory via filemap_alloc_folio, identical to KVM's guest_memfd. There's a possiblity of a stage-2 fault when memory is donated to guest VM and Linux incidentally tries to access the donated memory with an unaligned access. This access will cause kernel to panic as it expects to be able to access all memory which has been mapped in stage 1. We don't want to disallow unaligned access simply because Gunyah drivers are enabled. There are two options I see to prevent the stage-2 fault from crashing the kernel: we can fix up the stage-2 fault or ensure that Linux has a S1 table consistent with S2. To do the latter, the obvious solution seemed to be using the set_direct_map functions, but you and Christoph have valid concerns about exporting this to modules since it's a low-level API. One way to avoid exporting the symbols is to make Gunyah a built-in, but I'd like to find a better solution. One way I can think of is to create a "guest_memfd library" that both KVM and Gunyah can use. It abstracts the common bits between the 2 into a built-in module and can be the one to call the set_direct_map functions. I also think the abstraction will also help keep KVM guest_memfd cleaner once we start supporting huge folios (and splitting them). Do KVM and mm folks also see value to using a library-fied guest_memfd? Thanks, Elliot On Thu, Feb 29, 2024 at 05:35:45PM -0800, Elliot Berman wrote: > On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote: > > On 26.02.24 18:27, Elliot Berman wrote: > > > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: > > > > On 26.02.24 12:06, Christoph Hellwig wrote: > > > > > The point is that we can't we just allow modules to unmap data from > > > > > the kernel mapping, no matter how noble your intentions are. > > > > > > > > I absolutely agree. > > > > > > > > > > Hi David and Chirstoph, > > > > > > Are your preferences that we should make Gunyah builtin only or should add > > > fixing up S2 PTW errors (or something else)? > > > > Having that built into the kernel certainly does sound better than exposing > > that functionality to arbitrary OOT modules. But still, this feels like it > > is using a "too-low-level" interface. > > > > What are your thoughts about fixing up the stage-2 fault instead? I > think this gives mmu-based isolation a slight speed boost because we > avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah) > knows that the page isn't mapped. Whether we get S2 or S1 fault, the > kernel is likely going to crash, except in the rare case where we want > to fix the exception. In that case, we can modify the S2 fault handler > to call fixup_exception() when appropriate. > > > > > > > Also, do you extend that preference to modifying S2 mappings? This would > > > require any hypervisor driver that supports confidential compute > > > usecases to only ever be builtin. > > > > > > Is your concern about unmapping data from kernel mapping, then module > > > being unloaded, and then having no way to recover the mapping? Would a > > > permanent module be better? The primary reason we were wanting to have > > > it as module was to avoid having driver in memory if you're not a Gunyah > > > guest. > > > > What I didn't grasp from this patch description: is the area where a driver > > would unmap/remap that memory somehow known ahead of time and limited? > > > > How would the driver obtain that memory it would try to unmap/remap the > > direct map of? Simply allocate some pages and then unmap the direct map? > > That's correct. > > > > > For example, we do have mm/secretmem.c, where we unmap the directmap on > > allocation and remap when freeing a page. A nice abstraction on alloc/free, > > so one cannot really do a lot of harm. > > > > Further, we enlightened the remainder of the system about secretmem, such > > that we can detect that the directmap is no longer there. As one example, > > see the secretmem_active() check in kernel/power/hibernate.c. > > > > I'll take a look at this. guest_memfd might be able to use PM notifiers here > instead, but I'll dig in the archives to see why secretmem isn't using that. > > > A similar abstraction would make sense (I remember a discussion about having > > secretmem functionality in guest_memfd, would that help?), but the question > > is "which" memory you want to unmap the direct map of, and how the driver > > became "owner" of that memory such that it would really be allowed to mess > > with the directmap. >
On 01.08.24 00:21, Elliot Berman wrote: > I wanted to revive this thread based on the mm alignment discussion for > guest_memfd. > > Gunyah's guest_memfd allocates memory via filemap_alloc_folio, identical > to KVM's guest_memfd. There's a possiblity of a stage-2 fault when > memory is donated to guest VM and Linux incidentally tries to access the > donated memory with an unaligned access. This access will cause kernel > to panic as it expects to be able to access all memory which has been > mapped in stage 1. We don't want to disallow unaligned access simply > because Gunyah drivers are enabled. > > There are two options I see to prevent the stage-2 fault from crashing > the kernel: we can fix up the stage-2 fault or ensure that Linux has a > S1 table consistent with S2. > > To do the latter, the obvious solution seemed to be using the > set_direct_map functions, but you and Christoph have valid concerns > about exporting this to modules since it's a low-level API. One way to > avoid exporting the symbols is to make Gunyah a built-in, but I'd like > to find a better solution. > > One way I can think of is to create a "guest_memfd library" that both > KVM and Gunyah can use. It abstracts the common bits between the 2 into > a built-in module and can be the one to call the set_direct_map > functions. I also think the abstraction will also help keep KVM > guest_memfd cleaner once we start supporting huge folios (and splitting > them). Do KVM and mm folks also see value to using a library-fied > guest_memfd? Without knowing about any details, this sounds like the right approach to me!
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 924843f1f661b..a9bd84588c98a 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c @@ -32,6 +32,7 @@ bool can_set_direct_map(void) return rodata_full || debug_pagealloc_enabled() || arm64_kfence_can_set_direct_map(); } +EXPORT_SYMBOL_GPL(can_set_direct_map); static int change_page_range(pte_t *ptep, unsigned long addr, void *data) { @@ -176,6 +177,7 @@ int set_direct_map_invalid_noflush(struct page *page) (unsigned long)page_address(page), PAGE_SIZE, change_page_range, &data); } +EXPORT_SYMBOL_GPL(set_direct_map_invalid_noflush); int set_direct_map_default_noflush(struct page *page) { @@ -191,6 +193,7 @@ int set_direct_map_default_noflush(struct page *page) (unsigned long)page_address(page), PAGE_SIZE, change_page_range, &data); } +EXPORT_SYMBOL_GPL(set_direct_map_default_noflush); #ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable)
Firmware and hypervisor drivers can donate system heap memory to their respective firmware/hypervisor entities. Those drivers should unmap the pages from the kernel's logical map before doing so. Export can_set_direct_map, set_direct_map_invalid_noflush, and set_direct_map_default_noflush. Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> --- arch/arm64/mm/pageattr.c | 3 +++ 1 file changed, 3 insertions(+)