Message ID | 20240805-guest-memfd-lib-v1-3-e5a29a4ff5d7@quicinc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Introduce guest_memfd library | expand |
On 05.08.24 20:34, Elliot Berman wrote: > This patch was reworked from Patrick's patch: > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ > > While guest_memfd is not available to be mapped by userspace, it is > still accessible through the kernel's direct map. This means that in > scenarios where guest-private memory is not hardware protected, it can > be speculatively read and its contents potentially leaked through > hardware side-channels. Removing guest-private memory from the direct > map, thus mitigates a large class of speculative execution issues > [1, Table 1]. I think you have to point out here that the speculative execution issues are primarily only an issue when guest_memfd private memory is used without TDX and friends where the memory would be encrypted either way. Or am I wrong? > > Direct map removal do not reuse the `.prepare` machinery, since > `prepare` can be called multiple time, and it is the responsibility of > the preparation routine to not "prepare" the same folio twice [2]. Thus, > instead explicitly check if `filemap_grab_folio` allocated a new folio, > and remove the returned folio from the direct map only if this was the > case. > > The patch uses release_folio instead of free_folio to reinsert pages > back into the direct map as by the time free_folio is called, > folio->mapping can already be NULL. This means that a call to > folio_inode inside free_folio might deference a NULL pointer, leaving no > way to access the inode which stores the flags that allow determining > whether the page was removed from the direct map in the first place. > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > Cc: Patrick Roy <roypat@amazon.co.uk> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> > --- > include/linux/guest_memfd.h | 8 ++++++ > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 72 insertions(+), 1 deletion(-) > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > index be56d9d53067..f9e4a27aed67 100644 > --- a/include/linux/guest_memfd.h > +++ b/include/linux/guest_memfd.h > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > int (*release)(struct inode *inode); > }; > > +/** > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > + * remove them from the kernel's direct map. > + */ Should we start introducing the concept of private and shared first, such that we can then say that this only applies to private memory? > +enum { > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > +}; > + > /** > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > * If trusted hyp will do it, can ommit this flag > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > index 580138b0f9d4..e9d8cab72b28 100644 > --- a/mm/guest_memfd.c > +++ b/mm/guest_memfd.c > @@ -7,9 +7,55 @@ > #include <linux/falloc.h> > #include <linux/guest_memfd.h> > #include <linux/pagemap.h> > +#include <linux/set_memory.h> > + > +static inline int guest_memfd_folio_private(struct folio *folio) > +{ > + unsigned long nr_pages = folio_nr_pages(folio); guest_memfd only supports small folios, this can be simplified. > + unsigned long i; > + int r; > + > + for (i = 0; i < nr_pages; i++) { > + struct page *page = folio_page(folio, i); > + > + r = set_direct_map_invalid_noflush(page); > + if (r < 0) > + goto out_remap; > + } > + > + folio_set_private(folio); > + return 0; > +out_remap: > + for (; i > 0; i--) { > + struct page *page = folio_page(folio, i - 1); > + > + BUG_ON(set_direct_map_default_noflush(page)); > + } > + return r; > +} > + > +static inline void guest_memfd_folio_clear_private(struct folio *folio) Set set/clear private semantics in this context are a bit confusing. I assume you mean "make inaccessible" "make accessible" and using the PG_private flag is just an implementation detail. > +{ > + unsigned long start = (unsigned long)folio_address(folio); > + unsigned long nr = folio_nr_pages(folio); > + unsigned long i; > + > + if (!folio_test_private(folio)) > + return; > + > + for (i = 0; i < nr; i++) { > + struct page *page = folio_page(folio, i); > + > + BUG_ON(set_direct_map_default_noflush(page)); > + } > + flush_tlb_kernel_range(start, start + folio_size(folio)); > + > + folio_clear_private(folio); > +} > > struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > { > + unsigned long gmem_flags = (unsigned long)file->private_data; > struct inode *inode = file_inode(file); > struct guest_memfd_operations *ops = inode->i_private; > struct folio *folio; > @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > goto out_err; > } > > + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > + r = guest_memfd_folio_private(folio); > + if (r) > + goto out_err; > + } > + > /* > * Ignore accessed, referenced, and dirty flags. The memory is > * unevictable and there is no storage to write back to. > @@ -213,14 +265,25 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp) > if (ops->invalidate_end) > ops->invalidate_end(inode, offset, nr); > > + guest_memfd_folio_clear_private(folio); > + > return true; > } > > +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len) > +{ > + /* not yet supported */ > + BUG_ON(offset || len != folio_size(folio)); > + > + BUG_ON(!gmem_release_folio(folio, 0)); In general, no BUG_ON please. WARN_ON_ONCE() is sufficient. > +} > + > static const struct address_space_operations gmem_aops = { > .dirty_folio = noop_dirty_folio, > .migrate_folio = gmem_migrate_folio, > .error_remove_folio = gmem_error_folio, > .release_folio = gmem_release_folio, > + .invalidate_folio = gmem_invalidate_folio, > }; > > static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops) > @@ -241,7 +304,7 @@ struct file *guest_memfd_alloc(const char *name, > if (!guest_memfd_check_ops(ops)) > return ERR_PTR(-EINVAL); > > - if (flags) > + if (flags & ~GUEST_MEMFD_FLAG_NO_DIRECT_MAP) > return ERR_PTR(-EINVAL); > > /* >
Hi Elliot, On Mon, 2024-08-05 at 19:34 +0100, Elliot Berman wrote: > This patch was reworked from Patrick's patch: > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ yaay :D > While guest_memfd is not available to be mapped by userspace, it is > still accessible through the kernel's direct map. This means that in > scenarios where guest-private memory is not hardware protected, it can > be speculatively read and its contents potentially leaked through > hardware side-channels. Removing guest-private memory from the direct > map, thus mitigates a large class of speculative execution issues > [1, Table 1]. > > Direct map removal do not reuse the `.prepare` machinery, since > `prepare` can be called multiple time, and it is the responsibility of > the preparation routine to not "prepare" the same folio twice [2]. Thus, > instead explicitly check if `filemap_grab_folio` allocated a new folio, > and remove the returned folio from the direct map only if this was the > case. My patch did this, but you separated the PG_uptodate logic from the direct map removal, right? > The patch uses release_folio instead of free_folio to reinsert pages > back into the direct map as by the time free_folio is called, > folio->mapping can already be NULL. This means that a call to > folio_inode inside free_folio might deference a NULL pointer, leaving no > way to access the inode which stores the flags that allow determining > whether the page was removed from the direct map in the first place. I thought release_folio was only called for folios with PG_private=1? You choose PG_private=1 to mean "this folio is in the direct map", so it gets called for exactly the wrong folios (more on that below, too). > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > Cc: Patrick Roy <roypat@amazon.co.uk> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> > --- > include/linux/guest_memfd.h | 8 ++++++ > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 72 insertions(+), 1 deletion(-) > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > index be56d9d53067..f9e4a27aed67 100644 > --- a/include/linux/guest_memfd.h > +++ b/include/linux/guest_memfd.h > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > int (*release)(struct inode *inode); > }; > > +/** > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > + * remove them from the kernel's direct map. > + */ > +enum { > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > +}; > + > /** > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > * If trusted hyp will do it, can ommit this flag > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > index 580138b0f9d4..e9d8cab72b28 100644 > --- a/mm/guest_memfd.c > +++ b/mm/guest_memfd.c > @@ -7,9 +7,55 @@ > #include <linux/falloc.h> > #include <linux/guest_memfd.h> > #include <linux/pagemap.h> > +#include <linux/set_memory.h> > + > +static inline int guest_memfd_folio_private(struct folio *folio) > +{ > + unsigned long nr_pages = folio_nr_pages(folio); > + unsigned long i; > + int r; > + > + for (i = 0; i < nr_pages; i++) { > + struct page *page = folio_page(folio, i); > + > + r = set_direct_map_invalid_noflush(page); > + if (r < 0) > + goto out_remap; > + } > + > + folio_set_private(folio); Mh, you've inverted the semantics of PG_private in the context of gmem here, compared to my patch. For me, PG_private=1 meant "this folio is back in the direct map". For you it means "this folio is removed from the direct map". Could you elaborate on why you require these different semantics for PG_private? Actually, I think in this patch series, you could just drop the PG_private stuff altogether, as the only place you do folio_test_private is in guest_memfd_clear_private, but iirc calling set_direct_map_default_noflush on a page that's already in the direct map is a NOOP anyway. On the other hand, as Paolo pointed out in my patches [1], just using a page flag to track direct map presence for gmem is not enough. We actually need to keep a refcount in folio->private to keep track of how many different actors request a folio's direct map presence (in the specific case in my patch series, it was different pfn_to_gfn_caches for the kvm-clock structures of different vcpus, which the guest can place into the same gfn). While this might not be a concern for the the pKVM/Gunyah case, where the guest dictates memory state, it's required for the non-CoCo case where KVM/userspace can set arbitrary guest gfns to shared if it needs/wants to access them for whatever reason. So for this we'd need to have PG_private=1 mean "direct map entry restored" (as if PG_private=0, there is no folio->private). [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315 > + return 0; > +out_remap: > + for (; i > 0; i--) { > + struct page *page = folio_page(folio, i - 1); > + > + BUG_ON(set_direct_map_default_noflush(page)); > + } > + return r; > +} > + > +static inline void guest_memfd_folio_clear_private(struct folio *folio) > +{ > + unsigned long start = (unsigned long)folio_address(folio); > + unsigned long nr = folio_nr_pages(folio); > + unsigned long i; > + > + if (!folio_test_private(folio)) > + return; > + > + for (i = 0; i < nr; i++) { > + struct page *page = folio_page(folio, i); > + > + BUG_ON(set_direct_map_default_noflush(page)); > + } > + flush_tlb_kernel_range(start, start + folio_size(folio)); > + > + folio_clear_private(folio); > +} > > struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > { > + unsigned long gmem_flags = (unsigned long)file->private_data; > struct inode *inode = file_inode(file); > struct guest_memfd_operations *ops = inode->i_private; > struct folio *folio; > @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > goto out_err; > } > > + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > + r = guest_memfd_folio_private(folio); > + if (r) > + goto out_err; > + } > + How does a caller of guest_memfd_grab_folio know whether a folio needs to be removed from the direct map? E.g. how can a caller know ahead of time whether guest_memfd_grab_folio will return a freshly allocated folio (which thus needs to be removed from the direct map), vs a folio that already exists and has been removed from the direct map (probably fine to remove from direct map again), vs a folio that already exists and is currently re-inserted into the direct map for whatever reason (must not remove these from the direct map, as other parts of KVM/userspace probably don't expect the direct map entries to disappear from underneath them). I couldn't figure this one out for my series, which is why I went with hooking into the PG_uptodate logic to always remove direct map entries on freshly allocated folios. > /* > * Ignore accessed, referenced, and dirty flags. The memory is > * unevictable and there is no storage to write back to. > @@ -213,14 +265,25 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp) > if (ops->invalidate_end) > ops->invalidate_end(inode, offset, nr); > > + guest_memfd_folio_clear_private(folio); > + > return true; > } > > +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len) > +{ > + /* not yet supported */ > + BUG_ON(offset || len != folio_size(folio)); > + > + BUG_ON(!gmem_release_folio(folio, 0)); > +} > + > static const struct address_space_operations gmem_aops = { > .dirty_folio = noop_dirty_folio, > .migrate_folio = gmem_migrate_folio, > .error_remove_folio = gmem_error_folio, > .release_folio = gmem_release_folio, > + .invalidate_folio = gmem_invalidate_folio, > }; > > static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops) > @@ -241,7 +304,7 @@ struct file *guest_memfd_alloc(const char *name, > if (!guest_memfd_check_ops(ops)) > return ERR_PTR(-EINVAL); > > - if (flags) > + if (flags & ~GUEST_MEMFD_FLAG_NO_DIRECT_MAP) > return ERR_PTR(-EINVAL); > > /* > > -- > 2.34.1 > Best, Patrick
On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote: > > Hi Elliot, > > On Mon, 2024-08-05 at 19:34 +0100, Elliot Berman wrote: > > This patch was reworked from Patrick's patch: > > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ > > yaay :D > > > While guest_memfd is not available to be mapped by userspace, it is > > still accessible through the kernel's direct map. This means that in > > scenarios where guest-private memory is not hardware protected, it can > > be speculatively read and its contents potentially leaked through > > hardware side-channels. Removing guest-private memory from the direct > > map, thus mitigates a large class of speculative execution issues > > [1, Table 1]. > > > > Direct map removal do not reuse the `.prepare` machinery, since > > `prepare` can be called multiple time, and it is the responsibility of > > the preparation routine to not "prepare" the same folio twice [2]. Thus, > > instead explicitly check if `filemap_grab_folio` allocated a new folio, > > and remove the returned folio from the direct map only if this was the > > case. > > My patch did this, but you separated the PG_uptodate logic from the > direct map removal, right? > > > The patch uses release_folio instead of free_folio to reinsert pages > > back into the direct map as by the time free_folio is called, > > folio->mapping can already be NULL. This means that a call to > > folio_inode inside free_folio might deference a NULL pointer, leaving no > > way to access the inode which stores the flags that allow determining > > whether the page was removed from the direct map in the first place. > > I thought release_folio was only called for folios with PG_private=1? > You choose PG_private=1 to mean "this folio is in the direct map", so it > gets called for exactly the wrong folios (more on that below, too). > PG_private=1 should be meaning "this folio is not in the direct map". > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > > > Cc: Patrick Roy <roypat@amazon.co.uk> > > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> > > --- > > include/linux/guest_memfd.h | 8 ++++++ > > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > > 2 files changed, 72 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > > index be56d9d53067..f9e4a27aed67 100644 > > --- a/include/linux/guest_memfd.h > > +++ b/include/linux/guest_memfd.h > > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > > int (*release)(struct inode *inode); > > }; > > > > +/** > > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > > + * remove them from the kernel's direct map. > > + */ > > +enum { > > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > > +}; > > + > > /** > > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > > * If trusted hyp will do it, can ommit this flag > > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > > index 580138b0f9d4..e9d8cab72b28 100644 > > --- a/mm/guest_memfd.c > > +++ b/mm/guest_memfd.c > > @@ -7,9 +7,55 @@ > > #include <linux/falloc.h> > > #include <linux/guest_memfd.h> > > #include <linux/pagemap.h> > > +#include <linux/set_memory.h> > > + > > +static inline int guest_memfd_folio_private(struct folio *folio) > > +{ > > + unsigned long nr_pages = folio_nr_pages(folio); > > + unsigned long i; > > + int r; > > + > > + for (i = 0; i < nr_pages; i++) { > > + struct page *page = folio_page(folio, i); > > + > > + r = set_direct_map_invalid_noflush(page); > > + if (r < 0) > > + goto out_remap; > > + } > > + > > + folio_set_private(folio); > > Mh, you've inverted the semantics of PG_private in the context of gmem > here, compared to my patch. For me, PG_private=1 meant "this folio is > back in the direct map". For you it means "this folio is removed from > the direct map". > > Could you elaborate on why you require these different semantics for > PG_private? Actually, I think in this patch series, you could just drop > the PG_private stuff altogether, as the only place you do > folio_test_private is in guest_memfd_clear_private, but iirc calling > set_direct_map_default_noflush on a page that's already in the direct > map is a NOOP anyway. > > On the other hand, as Paolo pointed out in my patches [1], just using a > page flag to track direct map presence for gmem is not enough. We > actually need to keep a refcount in folio->private to keep track of how > many different actors request a folio's direct map presence (in the > specific case in my patch series, it was different pfn_to_gfn_caches for > the kvm-clock structures of different vcpus, which the guest can place > into the same gfn). While this might not be a concern for the the > pKVM/Gunyah case, where the guest dictates memory state, it's required > for the non-CoCo case where KVM/userspace can set arbitrary guest gfns > to shared if it needs/wants to access them for whatever reason. So for > this we'd need to have PG_private=1 mean "direct map entry restored" (as > if PG_private=0, there is no folio->private). > > [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315 > I wonder if we can use the folio refcount itself, assuming we can rely on refcount == 1 means we can do shared->private conversion. In gpc_map_gmem, we convert private->shared. There's no problem here in the non-CoCo case. In gpc_unmap, we *try* to convert back from shared->private. If refcount>2, then the conversion would fail. The last gpc_unmap would be able to successfully convert back to private. Do you see any concerns with this approach? > > + return 0; > > +out_remap: > > + for (; i > 0; i--) { > > + struct page *page = folio_page(folio, i - 1); > > + > > + BUG_ON(set_direct_map_default_noflush(page)); > > + } > > + return r; > > +} > > + > > +static inline void guest_memfd_folio_clear_private(struct folio *folio) > > +{ > > + unsigned long start = (unsigned long)folio_address(folio); > > + unsigned long nr = folio_nr_pages(folio); > > + unsigned long i; > > + > > + if (!folio_test_private(folio)) > > + return; > > + > > + for (i = 0; i < nr; i++) { > > + struct page *page = folio_page(folio, i); > > + > > + BUG_ON(set_direct_map_default_noflush(page)); > > + } > > + flush_tlb_kernel_range(start, start + folio_size(folio)); > > + > > + folio_clear_private(folio); > > +} > > > > struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > > { > > + unsigned long gmem_flags = (unsigned long)file->private_data; > > struct inode *inode = file_inode(file); > > struct guest_memfd_operations *ops = inode->i_private; > > struct folio *folio; > > @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > > goto out_err; > > } > > > > + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > > + r = guest_memfd_folio_private(folio); > > + if (r) > > + goto out_err; > > + } > > + > > How does a caller of guest_memfd_grab_folio know whether a folio needs > to be removed from the direct map? E.g. how can a caller know ahead of > time whether guest_memfd_grab_folio will return a freshly allocated > folio (which thus needs to be removed from the direct map), vs a folio > that already exists and has been removed from the direct map (probably > fine to remove from direct map again), vs a folio that already exists > and is currently re-inserted into the direct map for whatever reason > (must not remove these from the direct map, as other parts of > KVM/userspace probably don't expect the direct map entries to disappear > from underneath them). I couldn't figure this one out for my series, > which is why I went with hooking into the PG_uptodate logic to always > remove direct map entries on freshly allocated folios. > gmem_flags come from the owner. If the caller (in non-CoCo case) wants to restore the direct map right away, it'd have to be a direct operation. As an optimization, we could add option that asks for page in "shared" state. If allocating new page, we can return it right away without removing from direct map. If grabbing existing folio, it would try to do the private->shared conversion. Thanks for the feedback, it was helpful! - Elliot > > /* > > * Ignore accessed, referenced, and dirty flags. The memory is > > * unevictable and there is no storage to write back to. > > @@ -213,14 +265,25 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp) > > if (ops->invalidate_end) > > ops->invalidate_end(inode, offset, nr); > > > > + guest_memfd_folio_clear_private(folio); > > + > > return true; > > } > > > > +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len) > > +{ > > + /* not yet supported */ > > + BUG_ON(offset || len != folio_size(folio)); > > + > > + BUG_ON(!gmem_release_folio(folio, 0)); > > +} > > + > > static const struct address_space_operations gmem_aops = { > > .dirty_folio = noop_dirty_folio, > > .migrate_folio = gmem_migrate_folio, > > .error_remove_folio = gmem_error_folio, > > .release_folio = gmem_release_folio, > > + .invalidate_folio = gmem_invalidate_folio, > > }; > > > > static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops) > > @@ -241,7 +304,7 @@ struct file *guest_memfd_alloc(const char *name, > > if (!guest_memfd_check_ops(ops)) > > return ERR_PTR(-EINVAL); > > > > - if (flags) > > + if (flags & ~GUEST_MEMFD_FLAG_NO_DIRECT_MAP) > > return ERR_PTR(-EINVAL); > > > > /* > > > > -- > > 2.34.1 > > > > Best, > Patrick >
On Tue, 2024-08-06 at 21:13 +0100, Elliot Berman wrote: > On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote: >> >> Hi Elliot, >> >> On Mon, 2024-08-05 at 19:34 +0100, Elliot Berman wrote: >>> This patch was reworked from Patrick's patch: >>> https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ >> >> yaay :D >> >>> While guest_memfd is not available to be mapped by userspace, it is >>> still accessible through the kernel's direct map. This means that in >>> scenarios where guest-private memory is not hardware protected, it can >>> be speculatively read and its contents potentially leaked through >>> hardware side-channels. Removing guest-private memory from the direct >>> map, thus mitigates a large class of speculative execution issues >>> [1, Table 1]. >>> >>> Direct map removal do not reuse the `.prepare` machinery, since >>> `prepare` can be called multiple time, and it is the responsibility of >>> the preparation routine to not "prepare" the same folio twice [2]. Thus, >>> instead explicitly check if `filemap_grab_folio` allocated a new folio, >>> and remove the returned folio from the direct map only if this was the >>> case. >> >> My patch did this, but you separated the PG_uptodate logic from the >> direct map removal, right? >> >>> The patch uses release_folio instead of free_folio to reinsert pages >>> back into the direct map as by the time free_folio is called, >>> folio->mapping can already be NULL. This means that a call to >>> folio_inode inside free_folio might deference a NULL pointer, leaving no >>> way to access the inode which stores the flags that allow determining >>> whether the page was removed from the direct map in the first place. >> >> I thought release_folio was only called for folios with PG_private=1? >> You choose PG_private=1 to mean "this folio is in the direct map", so it >> gets called for exactly the wrong folios (more on that below, too). >> > > PG_private=1 should be meaning "this folio is not in the direct map". Right. I just checked my patch and it indeed means the same there. No idea what I was on about yesterday. I think I only had Paolo's comment about using folio->private for refcounting sharings in mind, so I thought "to use folio->private, you need PG_private=1, therefore PG_private=1 means shared" (I just checked, and while folio_attach_private causes PG_private=1 to be set, page_set_private does not). Obviously my comments below and especially here on PG_private were nonsense. Sorry about that! >>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf >>> >>> Cc: Patrick Roy <roypat@amazon.co.uk> >>> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> >>> --- >>> include/linux/guest_memfd.h | 8 ++++++ >>> mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- >>> 2 files changed, 72 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h >>> index be56d9d53067..f9e4a27aed67 100644 >>> --- a/include/linux/guest_memfd.h >>> +++ b/include/linux/guest_memfd.h >>> @@ -25,6 +25,14 @@ struct guest_memfd_operations { >>> int (*release)(struct inode *inode); >>> }; >>> >>> +/** >>> + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also >>> + * remove them from the kernel's direct map. >>> + */ >>> +enum { >>> + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), >>> +}; >>> + >>> /** >>> * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. >>> * If trusted hyp will do it, can ommit this flag >>> diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c >>> index 580138b0f9d4..e9d8cab72b28 100644 >>> --- a/mm/guest_memfd.c >>> +++ b/mm/guest_memfd.c >>> @@ -7,9 +7,55 @@ >>> #include <linux/falloc.h> >>> #include <linux/guest_memfd.h> >>> #include <linux/pagemap.h> >>> +#include <linux/set_memory.h> >>> + >>> +static inline int guest_memfd_folio_private(struct folio *folio) >>> +{ >>> + unsigned long nr_pages = folio_nr_pages(folio); >>> + unsigned long i; >>> + int r; >>> + >>> + for (i = 0; i < nr_pages; i++) { >>> + struct page *page = folio_page(folio, i); >>> + >>> + r = set_direct_map_invalid_noflush(page); >>> + if (r < 0) >>> + goto out_remap; >>> + } >>> + >>> + folio_set_private(folio); >> >> Mh, you've inverted the semantics of PG_private in the context of gmem >> here, compared to my patch. For me, PG_private=1 meant "this folio is >> back in the direct map". For you it means "this folio is removed from >> the direct map". >> >> Could you elaborate on why you require these different semantics for >> PG_private? Actually, I think in this patch series, you could just drop >> the PG_private stuff altogether, as the only place you do >> folio_test_private is in guest_memfd_clear_private, but iirc calling >> set_direct_map_default_noflush on a page that's already in the direct >> map is a NOOP anyway. >> >> On the other hand, as Paolo pointed out in my patches [1], just using a >> page flag to track direct map presence for gmem is not enough. We >> actually need to keep a refcount in folio->private to keep track of how >> many different actors request a folio's direct map presence (in the >> specific case in my patch series, it was different pfn_to_gfn_caches for >> the kvm-clock structures of different vcpus, which the guest can place >> into the same gfn). While this might not be a concern for the the >> pKVM/Gunyah case, where the guest dictates memory state, it's required >> for the non-CoCo case where KVM/userspace can set arbitrary guest gfns >> to shared if it needs/wants to access them for whatever reason. So for >> this we'd need to have PG_private=1 mean "direct map entry restored" (as >> if PG_private=0, there is no folio->private). >> >> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315 >> > > I wonder if we can use the folio refcount itself, assuming we can rely > on refcount == 1 means we can do shared->private conversion. > > In gpc_map_gmem, we convert private->shared. There's no problem here in > the non-CoCo case. > > In gpc_unmap, we *try* to convert back from shared->private. If > refcount>2, then the conversion would fail. The last gpc_unmap would be > able to successfully convert back to private. > > Do you see any concerns with this approach? The gfn_to_pfn_cache does not keep an elevated refcount on the cached page, and instead responds to MMU notifiers to detect whether the cached translation has been invalidated, iirc. So the folio refcount will not reflect the number of gpcs holding that folio. >>> + return 0; >>> +out_remap: >>> + for (; i > 0; i--) { >>> + struct page *page = folio_page(folio, i - 1); >>> + >>> + BUG_ON(set_direct_map_default_noflush(page)); >>> + } >>> + return r; >>> +} >>> + >>> +static inline void guest_memfd_folio_clear_private(struct folio *folio) >>> +{ >>> + unsigned long start = (unsigned long)folio_address(folio); >>> + unsigned long nr = folio_nr_pages(folio); >>> + unsigned long i; >>> + >>> + if (!folio_test_private(folio)) >>> + return; >>> + >>> + for (i = 0; i < nr; i++) { >>> + struct page *page = folio_page(folio, i); >>> + >>> + BUG_ON(set_direct_map_default_noflush(page)); >>> + } >>> + flush_tlb_kernel_range(start, start + folio_size(folio)); >>> + >>> + folio_clear_private(folio); >>> +} >>> >>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) >>> { >>> + unsigned long gmem_flags = (unsigned long)file->private_data; >>> struct inode *inode = file_inode(file); >>> struct guest_memfd_operations *ops = inode->i_private; >>> struct folio *folio; >>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags >>> goto out_err; >>> } >>> >>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { >>> + r = guest_memfd_folio_private(folio); >>> + if (r) >>> + goto out_err; >>> + } >>> + >> >> How does a caller of guest_memfd_grab_folio know whether a folio needs >> to be removed from the direct map? E.g. how can a caller know ahead of >> time whether guest_memfd_grab_folio will return a freshly allocated >> folio (which thus needs to be removed from the direct map), vs a folio >> that already exists and has been removed from the direct map (probably >> fine to remove from direct map again), vs a folio that already exists >> and is currently re-inserted into the direct map for whatever reason >> (must not remove these from the direct map, as other parts of >> KVM/userspace probably don't expect the direct map entries to disappear >> from underneath them). I couldn't figure this one out for my series, >> which is why I went with hooking into the PG_uptodate logic to always >> remove direct map entries on freshly allocated folios. >> > > gmem_flags come from the owner. If the caller (in non-CoCo case) wants > to restore the direct map right away, it'd have to be a direct > operation. As an optimization, we could add option that asks for page in > "shared" state. If allocating new page, we can return it right away > without removing from direct map. If grabbing existing folio, it would > try to do the private->shared conversion. > > Thanks for the feedback, it was helpful! > > - Elliot > >>> /* >>> * Ignore accessed, referenced, and dirty flags. The memory is >>> * unevictable and there is no storage to write back to. >>> @@ -213,14 +265,25 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp) >>> if (ops->invalidate_end) >>> ops->invalidate_end(inode, offset, nr); >>> >>> + guest_memfd_folio_clear_private(folio); >>> + >>> return true; >>> } >>> >>> +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len) >>> +{ >>> + /* not yet supported */ >>> + BUG_ON(offset || len != folio_size(folio)); >>> + >>> + BUG_ON(!gmem_release_folio(folio, 0)); >>> +} >>> + >>> static const struct address_space_operations gmem_aops = { >>> .dirty_folio = noop_dirty_folio, >>> .migrate_folio = gmem_migrate_folio, >>> .error_remove_folio = gmem_error_folio, >>> .release_folio = gmem_release_folio, >>> + .invalidate_folio = gmem_invalidate_folio, >>> }; >>> >>> static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops) >>> @@ -241,7 +304,7 @@ struct file *guest_memfd_alloc(const char *name, >>> if (!guest_memfd_check_ops(ops)) >>> return ERR_PTR(-EINVAL); >>> >>> - if (flags) >>> + if (flags & ~GUEST_MEMFD_FLAG_NO_DIRECT_MAP) >>> return ERR_PTR(-EINVAL); >>> >>> /* >>> >>> -- >>> 2.34.1 >>> >> >> Best, >> Patrick >>
On Wed, 2024-08-07 at 07:48 +0100, Patrick Roy wrote: > > > On Tue, 2024-08-06 at 21:13 +0100, Elliot Berman wrote: >> On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote: >>> >>> Hi Elliot, >>> >>> On Mon, 2024-08-05 at 19:34 +0100, Elliot Berman wrote: >>>> This patch was reworked from Patrick's patch: >>>> https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ >>> >>> yaay :D >>> >>>> While guest_memfd is not available to be mapped by userspace, it is >>>> still accessible through the kernel's direct map. This means that in >>>> scenarios where guest-private memory is not hardware protected, it can >>>> be speculatively read and its contents potentially leaked through >>>> hardware side-channels. Removing guest-private memory from the direct >>>> map, thus mitigates a large class of speculative execution issues >>>> [1, Table 1]. >>>> >>>> Direct map removal do not reuse the `.prepare` machinery, since >>>> `prepare` can be called multiple time, and it is the responsibility of >>>> the preparation routine to not "prepare" the same folio twice [2]. Thus, >>>> instead explicitly check if `filemap_grab_folio` allocated a new folio, >>>> and remove the returned folio from the direct map only if this was the >>>> case. >>> >>> My patch did this, but you separated the PG_uptodate logic from the >>> direct map removal, right? >>> >>>> The patch uses release_folio instead of free_folio to reinsert pages >>>> back into the direct map as by the time free_folio is called, >>>> folio->mapping can already be NULL. This means that a call to >>>> folio_inode inside free_folio might deference a NULL pointer, leaving no >>>> way to access the inode which stores the flags that allow determining >>>> whether the page was removed from the direct map in the first place. >>> >>> I thought release_folio was only called for folios with PG_private=1? >>> You choose PG_private=1 to mean "this folio is in the direct map", so it >>> gets called for exactly the wrong folios (more on that below, too). >>> >> >> PG_private=1 should be meaning "this folio is not in the direct map". > > Right. I just checked my patch and it indeed means the same there. No > idea what I was on about yesterday. I think I only had Paolo's comment > about using folio->private for refcounting sharings in mind, so I > thought "to use folio->private, you need PG_private=1, therefore > PG_private=1 means shared" (I just checked, and while > folio_attach_private causes PG_private=1 to be set, page_set_private > does not). Obviously my comments below and especially here on PG_private > were nonsense. Sorry about that! > >>>> [1]: https://download.vusec.net/papers/quarantine_raid23.pdf >>>> >>>> Cc: Patrick Roy <roypat@amazon.co.uk> >>>> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> >>>> --- >>>> include/linux/guest_memfd.h | 8 ++++++ >>>> mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- >>>> 2 files changed, 72 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h >>>> index be56d9d53067..f9e4a27aed67 100644 >>>> --- a/include/linux/guest_memfd.h >>>> +++ b/include/linux/guest_memfd.h >>>> @@ -25,6 +25,14 @@ struct guest_memfd_operations { >>>> int (*release)(struct inode *inode); >>>> }; >>>> >>>> +/** >>>> + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also >>>> + * remove them from the kernel's direct map. >>>> + */ >>>> +enum { >>>> + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), >>>> +}; >>>> + >>>> /** >>>> * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. >>>> * If trusted hyp will do it, can ommit this flag >>>> diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c >>>> index 580138b0f9d4..e9d8cab72b28 100644 >>>> --- a/mm/guest_memfd.c >>>> +++ b/mm/guest_memfd.c >>>> @@ -7,9 +7,55 @@ >>>> #include <linux/falloc.h> >>>> #include <linux/guest_memfd.h> >>>> #include <linux/pagemap.h> >>>> +#include <linux/set_memory.h> >>>> + >>>> +static inline int guest_memfd_folio_private(struct folio *folio) >>>> +{ >>>> + unsigned long nr_pages = folio_nr_pages(folio); >>>> + unsigned long i; >>>> + int r; >>>> + >>>> + for (i = 0; i < nr_pages; i++) { >>>> + struct page *page = folio_page(folio, i); >>>> + >>>> + r = set_direct_map_invalid_noflush(page); >>>> + if (r < 0) >>>> + goto out_remap; >>>> + } >>>> + >>>> + folio_set_private(folio); >>> >>> Mh, you've inverted the semantics of PG_private in the context of gmem >>> here, compared to my patch. For me, PG_private=1 meant "this folio is >>> back in the direct map". For you it means "this folio is removed from >>> the direct map". >>> >>> Could you elaborate on why you require these different semantics for >>> PG_private? Actually, I think in this patch series, you could just drop >>> the PG_private stuff altogether, as the only place you do >>> folio_test_private is in guest_memfd_clear_private, but iirc calling >>> set_direct_map_default_noflush on a page that's already in the direct >>> map is a NOOP anyway. >>> >>> On the other hand, as Paolo pointed out in my patches [1], just using a >>> page flag to track direct map presence for gmem is not enough. We >>> actually need to keep a refcount in folio->private to keep track of how >>> many different actors request a folio's direct map presence (in the >>> specific case in my patch series, it was different pfn_to_gfn_caches for >>> the kvm-clock structures of different vcpus, which the guest can place >>> into the same gfn). While this might not be a concern for the the >>> pKVM/Gunyah case, where the guest dictates memory state, it's required >>> for the non-CoCo case where KVM/userspace can set arbitrary guest gfns >>> to shared if it needs/wants to access them for whatever reason. So for >>> this we'd need to have PG_private=1 mean "direct map entry restored" (as >>> if PG_private=0, there is no folio->private). >>> >>> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315 >>> >> >> I wonder if we can use the folio refcount itself, assuming we can rely >> on refcount == 1 means we can do shared->private conversion. >> >> In gpc_map_gmem, we convert private->shared. There's no problem here in >> the non-CoCo case. >> >> In gpc_unmap, we *try* to convert back from shared->private. If >> refcount>2, then the conversion would fail. The last gpc_unmap would be >> able to successfully convert back to private. >> >> Do you see any concerns with this approach? > > The gfn_to_pfn_cache does not keep an elevated refcount on the cached > page, and instead responds to MMU notifiers to detect whether the cached > translation has been invalidated, iirc. So the folio refcount will > not reflect the number of gpcs holding that folio. > >>>> + return 0; >>>> +out_remap: >>>> + for (; i > 0; i--) { >>>> + struct page *page = folio_page(folio, i - 1); >>>> + >>>> + BUG_ON(set_direct_map_default_noflush(page)); >>>> + } >>>> + return r; >>>> +} >>>> + >>>> +static inline void guest_memfd_folio_clear_private(struct folio *folio) >>>> +{ >>>> + unsigned long start = (unsigned long)folio_address(folio); >>>> + unsigned long nr = folio_nr_pages(folio); >>>> + unsigned long i; >>>> + >>>> + if (!folio_test_private(folio)) >>>> + return; >>>> + >>>> + for (i = 0; i < nr; i++) { >>>> + struct page *page = folio_page(folio, i); >>>> + >>>> + BUG_ON(set_direct_map_default_noflush(page)); >>>> + } >>>> + flush_tlb_kernel_range(start, start + folio_size(folio)); >>>> + >>>> + folio_clear_private(folio); >>>> +} >>>> >>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) >>>> { >>>> + unsigned long gmem_flags = (unsigned long)file->private_data; >>>> struct inode *inode = file_inode(file); >>>> struct guest_memfd_operations *ops = inode->i_private; >>>> struct folio *folio; >>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags >>>> goto out_err; >>>> } >>>> >>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { >>>> + r = guest_memfd_folio_private(folio); >>>> + if (r) >>>> + goto out_err; >>>> + } >>>> + >>> >>> How does a caller of guest_memfd_grab_folio know whether a folio needs >>> to be removed from the direct map? E.g. how can a caller know ahead of >>> time whether guest_memfd_grab_folio will return a freshly allocated >>> folio (which thus needs to be removed from the direct map), vs a folio >>> that already exists and has been removed from the direct map (probably >>> fine to remove from direct map again), vs a folio that already exists >>> and is currently re-inserted into the direct map for whatever reason >>> (must not remove these from the direct map, as other parts of >>> KVM/userspace probably don't expect the direct map entries to disappear >>> from underneath them). I couldn't figure this one out for my series, >>> which is why I went with hooking into the PG_uptodate logic to always >>> remove direct map entries on freshly allocated folios. >>> >> >> gmem_flags come from the owner. If the caller (in non-CoCo case) wants Ah, oops, I got it mixed up with the new `flags` parameter. >> to restore the direct map right away, it'd have to be a direct >> operation. As an optimization, we could add option that asks for page in >> "shared" state. If allocating new page, we can return it right away >> without removing from direct map. If grabbing existing folio, it would >> try to do the private->shared conversion. My concern is more with the implicit shared->private conversion that happens on every call to guest_memfd_grab_folio (and thus kvm_gmem_get_pfn) when grabbing existing folios. If something else marked the folio as shared, then we cannot punch it out of the direct map again until that something is done using the folio (when working on my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were temporarily marked shared, as I was seeing panics because of this). And if the folio is currently private, there's nothing to do. So either way, guest_memfd_grab_folio shouldn't touch the direct map entry for existing folios. >> >> Thanks for the feedback, it was helpful! >> >> - Elliot >> >>>> /* >>>> * Ignore accessed, referenced, and dirty flags. The memory is >>>> * unevictable and there is no storage to write back to. >>>> @@ -213,14 +265,25 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp) >>>> if (ops->invalidate_end) >>>> ops->invalidate_end(inode, offset, nr); >>>> >>>> + guest_memfd_folio_clear_private(folio); >>>> + >>>> return true; >>>> } >>>> >>>> +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len) >>>> +{ >>>> + /* not yet supported */ >>>> + BUG_ON(offset || len != folio_size(folio)); >>>> + >>>> + BUG_ON(!gmem_release_folio(folio, 0)); >>>> +} >>>> + >>>> static const struct address_space_operations gmem_aops = { >>>> .dirty_folio = noop_dirty_folio, >>>> .migrate_folio = gmem_migrate_folio, >>>> .error_remove_folio = gmem_error_folio, >>>> .release_folio = gmem_release_folio, >>>> + .invalidate_folio = gmem_invalidate_folio, >>>> }; >>>> >>>> static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops) >>>> @@ -241,7 +304,7 @@ struct file *guest_memfd_alloc(const char *name, >>>> if (!guest_memfd_check_ops(ops)) >>>> return ERR_PTR(-EINVAL); >>>> >>>> - if (flags) >>>> + if (flags & ~GUEST_MEMFD_FLAG_NO_DIRECT_MAP) >>>> return ERR_PTR(-EINVAL); >>>> >>>> /* >>>> >>>> -- >>>> 2.34.1 >>>> >>> >>> Best, >>> Patrick >>>
On Wed, Aug 07, 2024 at 11:57:35AM +0100, Patrick Roy wrote: > On Wed, 2024-08-07 at 07:48 +0100, Patrick Roy wrote: > > On Tue, 2024-08-06 at 21:13 +0100, Elliot Berman wrote: > >> On Tue, Aug 06, 2024 at 04:39:24PM +0100, Patrick Roy wrote: > >>> On the other hand, as Paolo pointed out in my patches [1], just using a > >>> page flag to track direct map presence for gmem is not enough. We > >>> actually need to keep a refcount in folio->private to keep track of how > >>> many different actors request a folio's direct map presence (in the > >>> specific case in my patch series, it was different pfn_to_gfn_caches for > >>> the kvm-clock structures of different vcpus, which the guest can place > >>> into the same gfn). While this might not be a concern for the the > >>> pKVM/Gunyah case, where the guest dictates memory state, it's required > >>> for the non-CoCo case where KVM/userspace can set arbitrary guest gfns > >>> to shared if it needs/wants to access them for whatever reason. So for > >>> this we'd need to have PG_private=1 mean "direct map entry restored" (as > >>> if PG_private=0, there is no folio->private). > >>> > >>> [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#m0608c4b6a069b3953d7ee97f48577d32688a3315 > >>> > >> > >> I wonder if we can use the folio refcount itself, assuming we can rely > >> on refcount == 1 means we can do shared->private conversion. > >> > >> In gpc_map_gmem, we convert private->shared. There's no problem here in > >> the non-CoCo case. > >> > >> In gpc_unmap, we *try* to convert back from shared->private. If > >> refcount>2, then the conversion would fail. The last gpc_unmap would be > >> able to successfully convert back to private. > >> > >> Do you see any concerns with this approach? > > > > The gfn_to_pfn_cache does not keep an elevated refcount on the cached > > page, and instead responds to MMU notifiers to detect whether the cached > > translation has been invalidated, iirc. So the folio refcount will > > not reflect the number of gpcs holding that folio. > > Ah, fair enough. This is kinda like a GUP pin which would prevent us from making page private, but without the pin part. [...] > >>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > >>>> { > >>>> + unsigned long gmem_flags = (unsigned long)file->private_data; > >>>> struct inode *inode = file_inode(file); > >>>> struct guest_memfd_operations *ops = inode->i_private; > >>>> struct folio *folio; > >>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > >>>> goto out_err; > >>>> } > >>>> > >>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > >>>> + r = guest_memfd_folio_private(folio); > >>>> + if (r) > >>>> + goto out_err; > >>>> + } > >>>> + > >>> > >>> How does a caller of guest_memfd_grab_folio know whether a folio needs > >>> to be removed from the direct map? E.g. how can a caller know ahead of > >>> time whether guest_memfd_grab_folio will return a freshly allocated > >>> folio (which thus needs to be removed from the direct map), vs a folio > >>> that already exists and has been removed from the direct map (probably > >>> fine to remove from direct map again), vs a folio that already exists > >>> and is currently re-inserted into the direct map for whatever reason > >>> (must not remove these from the direct map, as other parts of > >>> KVM/userspace probably don't expect the direct map entries to disappear > >>> from underneath them). I couldn't figure this one out for my series, > >>> which is why I went with hooking into the PG_uptodate logic to always > >>> remove direct map entries on freshly allocated folios. > >>> > >> > >> gmem_flags come from the owner. If the caller (in non-CoCo case) wants > > Ah, oops, I got it mixed up with the new `flags` parameter. > > >> to restore the direct map right away, it'd have to be a direct > >> operation. As an optimization, we could add option that asks for page in > >> "shared" state. If allocating new page, we can return it right away > >> without removing from direct map. If grabbing existing folio, it would > >> try to do the private->shared conversion. > > My concern is more with the implicit shared->private conversion that > happens on every call to guest_memfd_grab_folio (and thus > kvm_gmem_get_pfn) when grabbing existing folios. If something else > marked the folio as shared, then we cannot punch it out of the direct > map again until that something is done using the folio (when working on > my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were > temporarily marked shared, as I was seeing panics because of this). And > if the folio is currently private, there's nothing to do. So either way, > guest_memfd_grab_folio shouldn't touch the direct map entry for existing > folios. > What I did could be documented/commented better. If ops->accessible() is *not* provided, all guest_memfd allocations will immediately remove from direct map and treat them immediately like guest private (goal is to match what KVM does today on tip). If ops->accessible() is provided, then guest_memfd allocations start as "shared" and KVM/Gunyah need to do the shared->private conversion when they want to do the private conversion on the folio. "Shared" is the default because that is effectively a no-op. For the non-CoCo case you're interested in, we'd have the ops->accessible() provided and we wouldn't pull out the direct map from gpc. Thanks, Elliot
On 2024-08-06 07:10-0700 David Hildenbrand wrote: > > While guest_memfd is not available to be mapped by userspace, it is > > still accessible through the kernel's direct map. This means that in > > scenarios where guest-private memory is not hardware protected, it can > > be speculatively read and its contents potentially leaked through > > hardware side-channels. Removing guest-private memory from the direct > > map, thus mitigates a large class of speculative execution issues > > [1, Table 1]. > > I think you have to point out here that the speculative execution issues > are primarily only an issue when guest_memfd private memory is used > without TDX and friends where the memory would be encrypted either way. > > Or am I wrong? Actually, I'm not sure how much protection CoCo solutions offer in this regard. I'd love to hear more from Intel and AMD on this, but it looks like they are not targeting full coverage for these types of attacks (beyond protecting guest mitigation settings from manipulation by the host). For example, see this selection from AMD's 2020 whitepaper [1] on SEV-SNP: "There are certain classes of attacks that are not in scope for any of these three features. Architectural side channel attacks on CPU data structures are not specifically prevented by any hardware means. As with standard software security practices, code which is sensitive to such side channel attacks (e.g., cryptographic libraries) should be written in a way which helps prevent such attacks." And: "While SEV-SNP offers guests several options when it comes to protection from speculative side channel attacks and SMT, it is not able to protect against all possible side channel attacks. For example, traditional side channel attacks on software such as PRIME+PROBE are not protected by SEV-SNP." Intel's docs also indicate guests need to protect themselves in some cases saying, "TD software should be aware that potentially untrusted software running outside a TD may be able to influence conditional branch predictions of software running in a TD" [2] and "a TDX guest VM is no different from a legacy guest VM in terms of protecting this userspace <-> OS kernel boundary" [3]. But these focus on hardening kernel & software within the guest. What's not clear to me is what happens during transient execution when the host kernel attempts to access a page in physical memory that belongs to a guest. I assume if it only happens transiently, it will not result in a machine check like it would if the instructions were actually retired. As far as I can tell encryption happens between the CPU & main memory, so cache contents will be plaintext. This seems to leave open the possibility of the host kernel retrieving the plaintext cache contents with a transient execution attack. I assume vendors have controls in place to stop this, but Foreshadow/L1TF is a good example of one place this fell apart for SGX [4]. All that said, we're also dependent on hardware not being subject to L1TF-style issues for the currently proposed non-CoCo method to be effective. We're simply clearing the Present bit while the physmap PTE still points to the guest physical page. This was found to be exploitable across OS & VMM boundaries on Intel server parts before Cascade Lake [5] (thanks to Claudio for highlighting this). So that's a long way of saying TDX may offer similar protection, but not because of encryption. Derek [1] https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf#page=19 [2] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/trusted-domain-security-guidance-for-developers.html [3] https://intel.github.io/ccc-linux-guest-hardening-docs/security-spec.html#transient-execution-attacks-and-their-mitigation [4] https://foreshadowattack.eu/foreshadow.pdf [5] https://foreshadowattack.eu/foreshadow-NG.pdf
On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote: >>>>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) >>>>>> { >>>>>> + unsigned long gmem_flags = (unsigned long)file->private_data; >>>>>> struct inode *inode = file_inode(file); >>>>>> struct guest_memfd_operations *ops = inode->i_private; >>>>>> struct folio *folio; >>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags >>>>>> goto out_err; >>>>>> } >>>>>> >>>>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { >>>>>> + r = guest_memfd_folio_private(folio); >>>>>> + if (r) >>>>>> + goto out_err; >>>>>> + } >>>>>> + >>>>> >>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs >>>>> to be removed from the direct map? E.g. how can a caller know ahead of >>>>> time whether guest_memfd_grab_folio will return a freshly allocated >>>>> folio (which thus needs to be removed from the direct map), vs a folio >>>>> that already exists and has been removed from the direct map (probably >>>>> fine to remove from direct map again), vs a folio that already exists >>>>> and is currently re-inserted into the direct map for whatever reason >>>>> (must not remove these from the direct map, as other parts of >>>>> KVM/userspace probably don't expect the direct map entries to disappear >>>>> from underneath them). I couldn't figure this one out for my series, >>>>> which is why I went with hooking into the PG_uptodate logic to always >>>>> remove direct map entries on freshly allocated folios. >>>>> >>>> >>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants >> >> Ah, oops, I got it mixed up with the new `flags` parameter. >> >>>> to restore the direct map right away, it'd have to be a direct >>>> operation. As an optimization, we could add option that asks for page in >>>> "shared" state. If allocating new page, we can return it right away >>>> without removing from direct map. If grabbing existing folio, it would >>>> try to do the private->shared conversion. >> >> My concern is more with the implicit shared->private conversion that >> happens on every call to guest_memfd_grab_folio (and thus >> kvm_gmem_get_pfn) when grabbing existing folios. If something else >> marked the folio as shared, then we cannot punch it out of the direct >> map again until that something is done using the folio (when working on >> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were >> temporarily marked shared, as I was seeing panics because of this). And >> if the folio is currently private, there's nothing to do. So either way, >> guest_memfd_grab_folio shouldn't touch the direct map entry for existing >> folios. >> > > What I did could be documented/commented better. No worries, thanks for taking the time to walk me through understanding it! > If ops->accessible() is *not* provided, all guest_memfd allocations will > immediately remove from direct map and treat them immediately like guest > private (goal is to match what KVM does today on tip). Ah, so if ops->accessible() is not provided, then there will never be any shared memory inside gmem (like today, where gmem doesn't support shared memory altogether), and thus there's no problems with just unconditionally doing set_direct_map_invalid_noflush in guest_memfd_grab_folio, because all existing folios already have their direct map entry removed. Got it! > If ops->accessible() is provided, then guest_memfd allocations start > as "shared" and KVM/Gunyah need to do the shared->private conversion > when they want to do the private conversion on the folio. "Shared" is > the default because that is effectively a no-op. > For the non-CoCo case you're interested in, we'd have the > ops->accessible() provided and we wouldn't pull out the direct map from > gpc. So in pKVM/Gunyah's case, guest memory starts as shared, and at some point the guest will issue a hypercall (or similar) to flip it to private, at which point it'll get removed from the direct map? That isn't really what we want for our case. We consider the folios as private straight away, as we do not let the guest control their state at all. Everything is always "accessible" to both KVM and userspace in the sense that they can just flip gfns to shared as they please without the guest having any say in it. I think we should untangle the behavior of guest_memfd_grab_folio from the presence of ops->accessible. E.g. instead of direct map removal being dependent on ops->accessible we should have some GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all, and not set for us (I don't think we should have a "call set_direct_map_invalid_noflush unconditionally in guest_memfd_grab_folio" mode at all, because if sharing gmem is supported, then that is broken, and if sharing gmem is not supported then only removing direct map entries for freshly allocated folios gets us the same result of "all folios never in the direct map" while avoiding some no-op direct map operations). Because we would still use ->accessible, albeit for us that would be more for bookkeeping along the lines of "which gfns does userspace currently require to be in the direct map?". I haven't completely thought it through, but what I could see working for us would be a pair of ioctls for marking ranges accessible/inaccessible, with "accessibility" stored in some xarray (somewhat like Fuad's patches, I guess? [1]). In a world where we have a "sharing refcount", the "make accessible" ioctl reinserts into the direct map (if needed), lifts the "sharings refcount" for each folio in the given gfn range, and marks the range as accessible. And the "make inaccessible" ioctl would first check that userspace has unmapped all those gfns again, and if yes, mark them as inaccessible, drop the "sharings refcount" by 1 for each, and removes from the direct map again if it held the last reference (if userspace still has some gfns mapped, the ioctl would just fail). I guess for pKVM/Gunyah, there wouldn't be userspace ioctls, but instead the above would happen in handlers for share/unshare hypercalls. But the overall flow would be similar. The only difference is the default state of guest memory (shared for you, private for us). You want a guest_memfd_grab_folio that essentially returns folios with "sharing refcount == 1" (and thus present in the direct map), while we want the opposite. So I think something like the following should work for both of us (modulo some error handling): static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare, bool *fresh) { // as today's kvm_gmem_get_folio, except ... if (!folio_test_uptodate(folio)) { ... if (fresh) *fresh = true } ... } struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare) { bool fresh; unsigned long gmem_flags = /* ... */ struct folio *folio = __kvm_gmem_get_folio(file, index, prepare, &fresh); if (gmem_flag & GRAB_FOLIO_RETURN_SHARED != 0) { // if "sharing refcount == 0", inserts back into direct map and lifts refcount, otherwise just lifts refcount guest_memfd_folio_clear_private(folio); } else { if (fresh) guest_memfd_folio_private(folio); } return folio; } Now, thinking ahead, there's probably optimizations here where we defer the direct map manipulations to gmem_fault, at which point having a guest_memfd_grab_folio that doesn't remove direct map entries for fresh folios would be useful in our non-CoCo usecase too. But that should also be easily achievable by maybe having a flag to kvm_gmem_get_folio that forces the behavior of GRAB_FOLIO_RETURN_SHARED, indendently of whether GRAB_FOLIO_RETURN_SHARED is set in gmem_flags. How does that sound to you? [1]: https://lore.kernel.org/kvm/20240801090117.3841080-1-tabba@google.com/ > Thanks, > Elliot Best, Patrick
On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote: > On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote: > >>>>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > >>>>>> { > >>>>>> + unsigned long gmem_flags = (unsigned long)file->private_data; > >>>>>> struct inode *inode = file_inode(file); > >>>>>> struct guest_memfd_operations *ops = inode->i_private; > >>>>>> struct folio *folio; > >>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags > >>>>>> goto out_err; > >>>>>> } > >>>>>> > >>>>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { > >>>>>> + r = guest_memfd_folio_private(folio); > >>>>>> + if (r) > >>>>>> + goto out_err; > >>>>>> + } > >>>>>> + > >>>>> > >>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs > >>>>> to be removed from the direct map? E.g. how can a caller know ahead of > >>>>> time whether guest_memfd_grab_folio will return a freshly allocated > >>>>> folio (which thus needs to be removed from the direct map), vs a folio > >>>>> that already exists and has been removed from the direct map (probably > >>>>> fine to remove from direct map again), vs a folio that already exists > >>>>> and is currently re-inserted into the direct map for whatever reason > >>>>> (must not remove these from the direct map, as other parts of > >>>>> KVM/userspace probably don't expect the direct map entries to disappear > >>>>> from underneath them). I couldn't figure this one out for my series, > >>>>> which is why I went with hooking into the PG_uptodate logic to always > >>>>> remove direct map entries on freshly allocated folios. > >>>>> > >>>> > >>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants > >> > >> Ah, oops, I got it mixed up with the new `flags` parameter. > >> > >>>> to restore the direct map right away, it'd have to be a direct > >>>> operation. As an optimization, we could add option that asks for page in > >>>> "shared" state. If allocating new page, we can return it right away > >>>> without removing from direct map. If grabbing existing folio, it would > >>>> try to do the private->shared conversion. > >> > >> My concern is more with the implicit shared->private conversion that > >> happens on every call to guest_memfd_grab_folio (and thus > >> kvm_gmem_get_pfn) when grabbing existing folios. If something else > >> marked the folio as shared, then we cannot punch it out of the direct > >> map again until that something is done using the folio (when working on > >> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were > >> temporarily marked shared, as I was seeing panics because of this). And > >> if the folio is currently private, there's nothing to do. So either way, > >> guest_memfd_grab_folio shouldn't touch the direct map entry for existing > >> folios. > >> > > > > What I did could be documented/commented better. > > No worries, thanks for taking the time to walk me through understanding > it! > > > If ops->accessible() is *not* provided, all guest_memfd allocations will > > immediately remove from direct map and treat them immediately like guest > > private (goal is to match what KVM does today on tip). > > Ah, so if ops->accessible() is not provided, then there will never be > any shared memory inside gmem (like today, where gmem doesn't support > shared memory altogether), and thus there's no problems with just > unconditionally doing set_direct_map_invalid_noflush in > guest_memfd_grab_folio, because all existing folios already have their > direct map entry removed. Got it! > > > If ops->accessible() is provided, then guest_memfd allocations start > > as "shared" and KVM/Gunyah need to do the shared->private conversion > > when they want to do the private conversion on the folio. "Shared" is > > the default because that is effectively a no-op. > > For the non-CoCo case you're interested in, we'd have the > > ops->accessible() provided and we wouldn't pull out the direct map from > > gpc. > > So in pKVM/Gunyah's case, guest memory starts as shared, and at some > point the guest will issue a hypercall (or similar) to flip it to > private, at which point it'll get removed from the direct map? > > That isn't really what we want for our case. We consider the folios as > private straight away, as we do not let the guest control their state at > all. Everything is always "accessible" to both KVM and userspace in the > sense that they can just flip gfns to shared as they please without the > guest having any say in it. > > I think we should untangle the behavior of guest_memfd_grab_folio from > the presence of ops->accessible. E.g. instead of direct map removal > being dependent on ops->accessible we should have some > GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all, > and not set for us (I don't think we should have a "call > set_direct_map_invalid_noflush unconditionally in > guest_memfd_grab_folio" mode at all, because if sharing gmem is > supported, then that is broken, and if sharing gmem is not supported > then only removing direct map entries for freshly allocated folios gets > us the same result of "all folios never in the direct map" while > avoiding some no-op direct map operations). > > Because we would still use ->accessible, albeit for us that would be > more for bookkeeping along the lines of "which gfns does userspace > currently require to be in the direct map?". I haven't completely > thought it through, but what I could see working for us would be a pair > of ioctls for marking ranges accessible/inaccessible, with > "accessibility" stored in some xarray (somewhat like Fuad's patches, I > guess? [1]). > > In a world where we have a "sharing refcount", the "make accessible" > ioctl reinserts into the direct map (if needed), lifts the "sharings > refcount" for each folio in the given gfn range, and marks the range as > accessible. And the "make inaccessible" ioctl would first check that > userspace has unmapped all those gfns again, and if yes, mark them as > inaccessible, drop the "sharings refcount" by 1 for each, and removes > from the direct map again if it held the last reference (if userspace > still has some gfns mapped, the ioctl would just fail). > I am warming up to the sharing refcount idea. How does the sharing refcount look for kvm gpc? > I guess for pKVM/Gunyah, there wouldn't be userspace ioctls, but instead > the above would happen in handlers for share/unshare hypercalls. But the > overall flow would be similar. The only difference is the default state > of guest memory (shared for you, private for us). You want a > guest_memfd_grab_folio that essentially returns folios with "sharing > refcount == 1" (and thus present in the direct map), while we want the > opposite. > > So I think something like the following should work for both of us > (modulo some error handling): > > static struct folio *__kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare, bool *fresh) > { > // as today's kvm_gmem_get_folio, except > ... > if (!folio_test_uptodate(folio)) { > ... > if (fresh) > *fresh = true > } > ... > } > > struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index, bool prepare) > { > bool fresh; > unsigned long gmem_flags = /* ... */ > struct folio *folio = __kvm_gmem_get_folio(file, index, prepare, &fresh); > if (gmem_flag & GRAB_FOLIO_RETURN_SHARED != 0) { > // if "sharing refcount == 0", inserts back into direct map and lifts refcount, otherwise just lifts refcount > guest_memfd_folio_clear_private(folio); > } else { > if (fresh) > guest_memfd_folio_private(folio); > } > return folio; > } > > Now, thinking ahead, there's probably optimizations here where we defer > the direct map manipulations to gmem_fault, at which point having a > guest_memfd_grab_folio that doesn't remove direct map entries for fresh > folios would be useful in our non-CoCo usecase too. But that should also > be easily achievable by maybe having a flag to kvm_gmem_get_folio that > forces the behavior of GRAB_FOLIO_RETURN_SHARED, indendently of whether > GRAB_FOLIO_RETURN_SHARED is set in gmem_flags. > > How does that sound to you? > Yeah, I think this is a good idea. I'm also thinking to make a few tweaks to the ops structure: struct guest_memfd_operations { int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsigned long nr); void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsigned long nr); int (*prepare_accessible)(struct inode *inode, struct folio *folio); int (*prepare_private)(struct inode *inode, struct folio *folio); int (*release)(struct inode *inode); }; When grabbing a folio, we'd always call either prepare_accessible() or prepare_private() based on GRAB_FOLIO_RETURN_SHARED. In the prepare_private() case, guest_memfd can also ensure the folio is unmapped and not pinned. If userspace tries to grab the folio in pKVM/Gunyah case, prepare_accessible() will fail and grab_folio returns error. There's a lot of details I'm glossing over, but I hope it gives some brief idea of the direction I was thinking. In some cases, prepare_accessible() and the invalidate_*() functions might effectively be the same thing, except that invalidate_*() could operate on a range larger-than-a-folio. That would be useful becase we might offer optimization to reclaim a batch of pages versus e.g. flushing caches every page. Thanks, Elliot
On Thu, 2024-08-08 at 23:16 +0100, Elliot Berman wrote > On Thu, Aug 08, 2024 at 02:05:55PM +0100, Patrick Roy wrote: >> On Wed, 2024-08-07 at 20:06 +0100, Elliot Berman wrote: >>>>>>>> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) >>>>>>>> { >>>>>>>> + unsigned long gmem_flags = (unsigned long)file->private_data; >>>>>>>> struct inode *inode = file_inode(file); >>>>>>>> struct guest_memfd_operations *ops = inode->i_private; >>>>>>>> struct folio *folio; >>>>>>>> @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags >>>>>>>> goto out_err; >>>>>>>> } >>>>>>>> >>>>>>>> + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { >>>>>>>> + r = guest_memfd_folio_private(folio); >>>>>>>> + if (r) >>>>>>>> + goto out_err; >>>>>>>> + } >>>>>>>> + >>>>>>> >>>>>>> How does a caller of guest_memfd_grab_folio know whether a folio needs >>>>>>> to be removed from the direct map? E.g. how can a caller know ahead of >>>>>>> time whether guest_memfd_grab_folio will return a freshly allocated >>>>>>> folio (which thus needs to be removed from the direct map), vs a folio >>>>>>> that already exists and has been removed from the direct map (probably >>>>>>> fine to remove from direct map again), vs a folio that already exists >>>>>>> and is currently re-inserted into the direct map for whatever reason >>>>>>> (must not remove these from the direct map, as other parts of >>>>>>> KVM/userspace probably don't expect the direct map entries to disappear >>>>>>> from underneath them). I couldn't figure this one out for my series, >>>>>>> which is why I went with hooking into the PG_uptodate logic to always >>>>>>> remove direct map entries on freshly allocated folios. >>>>>>> >>>>>> >>>>>> gmem_flags come from the owner. If the caller (in non-CoCo case) wants >>>> >>>> Ah, oops, I got it mixed up with the new `flags` parameter. >>>> >>>>>> to restore the direct map right away, it'd have to be a direct >>>>>> operation. As an optimization, we could add option that asks for page in >>>>>> "shared" state. If allocating new page, we can return it right away >>>>>> without removing from direct map. If grabbing existing folio, it would >>>>>> try to do the private->shared conversion. >>>> >>>> My concern is more with the implicit shared->private conversion that >>>> happens on every call to guest_memfd_grab_folio (and thus >>>> kvm_gmem_get_pfn) when grabbing existing folios. If something else >>>> marked the folio as shared, then we cannot punch it out of the direct >>>> map again until that something is done using the folio (when working on >>>> my RFC, kvm_gmem_get_pfn was indeed called on existing folios that were >>>> temporarily marked shared, as I was seeing panics because of this). And >>>> if the folio is currently private, there's nothing to do. So either way, >>>> guest_memfd_grab_folio shouldn't touch the direct map entry for existing >>>> folios. >>>> >>> >>> What I did could be documented/commented better. >> >> No worries, thanks for taking the time to walk me through understanding >> it! >> >>> If ops->accessible() is *not* provided, all guest_memfd allocations will >>> immediately remove from direct map and treat them immediately like guest >>> private (goal is to match what KVM does today on tip). >> >> Ah, so if ops->accessible() is not provided, then there will never be >> any shared memory inside gmem (like today, where gmem doesn't support >> shared memory altogether), and thus there's no problems with just >> unconditionally doing set_direct_map_invalid_noflush in >> guest_memfd_grab_folio, because all existing folios already have their >> direct map entry removed. Got it! >> >>> If ops->accessible() is provided, then guest_memfd allocations start >>> as "shared" and KVM/Gunyah need to do the shared->private conversion >>> when they want to do the private conversion on the folio. "Shared" is >>> the default because that is effectively a no-op. >>> For the non-CoCo case you're interested in, we'd have the >>> ops->accessible() provided and we wouldn't pull out the direct map from >>> gpc. >> >> So in pKVM/Gunyah's case, guest memory starts as shared, and at some >> point the guest will issue a hypercall (or similar) to flip it to >> private, at which point it'll get removed from the direct map? >> >> That isn't really what we want for our case. We consider the folios as >> private straight away, as we do not let the guest control their state at >> all. Everything is always "accessible" to both KVM and userspace in the >> sense that they can just flip gfns to shared as they please without the >> guest having any say in it. >> >> I think we should untangle the behavior of guest_memfd_grab_folio from >> the presence of ops->accessible. E.g. instead of direct map removal >> being dependent on ops->accessible we should have some >> GRAB_FOLIO_RETURN_SHARED flag for gmem_flags, which is set for y'all, >> and not set for us (I don't think we should have a "call >> set_direct_map_invalid_noflush unconditionally in >> guest_memfd_grab_folio" mode at all, because if sharing gmem is >> supported, then that is broken, and if sharing gmem is not supported >> then only removing direct map entries for freshly allocated folios gets >> us the same result of "all folios never in the direct map" while >> avoiding some no-op direct map operations). >> >> Because we would still use ->accessible, albeit for us that would be >> more for bookkeeping along the lines of "which gfns does userspace >> currently require to be in the direct map?". I haven't completely >> thought it through, but what I could see working for us would be a pair >> of ioctls for marking ranges accessible/inaccessible, with >> "accessibility" stored in some xarray (somewhat like Fuad's patches, I >> guess? [1]). >> >> In a world where we have a "sharing refcount", the "make accessible" >> ioctl reinserts into the direct map (if needed), lifts the "sharings >> refcount" for each folio in the given gfn range, and marks the range as >> accessible. And the "make inaccessible" ioctl would first check that >> userspace has unmapped all those gfns again, and if yes, mark them as >> inaccessible, drop the "sharings refcount" by 1 for each, and removes >> from the direct map again if it held the last reference (if userspace >> still has some gfns mapped, the ioctl would just fail). >> > > I am warming up to the sharing refcount idea. How does the sharing > refcount look for kvm gpc? I've come up with the below rough draft (written as a new commit on top of my RFC series [1], with some bits from your patch copied in). With this, I was able to actually boot a Firecracker VM with multiple vCPUs (which previously didn't work because of different vCPUs putting their kvm-clock structures into the same guest page). Best, Patrick [1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/T/#ma44793da6bc000a2c22b1ffe37292b9615881838 --- From 2005c5a06b8a8f8568e9140b275d2c219488a71a Mon Sep 17 00:00:00 2001 From: Patrick Roy <roypat@amazon.co.uk> Date: Fri, 9 Aug 2024 15:13:08 +0100 Subject: [RFC PATCH 009/008] kvm: gmem: Introduce "sharing refcount" The assumption that there would never be two gfn_to_pfn_caches holding the same gfn was wrong. The guest can put the kvm-clock structures for different vCPUs into the same gfn. On KVM's side, one gfn_to_pfn_cache is initialized by vCPU, meaning in multi-vCPU setups, multiple gfn_to_pfn_caches to the same gfn exist. For gmem, this means that multiple gfn_to_pfn_caches will want direct map entries for the same page to be present - the direct map entry needs to be removed when the first gpc is initialized, and can only be removed again after the last gpc to this page is invalidated. To handle this, introduce the concept of a "sharing refcount" to gmem: If something inside of KVM wants to access gmem it should now do struct folio *gmem_folio = /* ... */; int r = kvm_gmem_folio_share(gmem_folio); if (r) goto err; /* do stuff */ kvm_gmem_folio_unshare(gmem_folio); The first call to kvm_gmem_folio_share will increment this new "sharing refcount" by 1 (and insert the folio back into the direct map if it acquires the first refcount), while kvm_gmem_folio_unshare will decrement the refcount by 1 (and remove the folio from the direct map again if it held the last refcount). One quirk is that we use "sharing_refcount == 1" to mean "folio is not in the direct map" (aka not shared), as letting the refcount temporarily drop to 0 would cause refcount_t functions to WARN. Signed-off-by: Patrick Roy <roypat@amazon.co.uk> --- virt/kvm/guest_memfd.c | 139 ++++++++++++++++++++++++++++++++++++++--- virt/kvm/kvm_main.c | 32 ++++------ virt/kvm/kvm_mm.h | 2 + virt/kvm/pfncache.c | 54 ++-------------- 4 files changed, 148 insertions(+), 79 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 29abbc883c73a..05fd6149c11c2 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -55,6 +55,96 @@ static bool kvm_gmem_not_present(struct inode *inode) return ((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) != 0; } +static int kvm_gmem_folio_private(struct folio* folio) +{ + unsigned long nr_pages = folio_nr_pages(folio); + unsigned long i; + int r; + + /* + * We must only remove direct map entries after the last "sharing + * reference" has gone away. + */ + if(WARN_ON_ONCE(refcount_read(folio_get_private(folio)) != 1)) + return -EPERM; + + for (i = 0; i < nr_pages; i++) { + struct page *page = folio_page(folio, i); + + r = set_direct_map_invalid_noflush(page); + if (r < 0) + goto out_remap; + } + + // We use the private flag to track whether the folio has been removed + // from the direct map. This is because inside of ->free_folio, + // we do not have access to the address_space anymore, meaning we + // cannot check folio_inode(folio)->i_private to determine whether + // KVM_GMEM_NO_DIRECT_MAP was set. + folio_set_private(folio); + return 0; +out_remap: + for (; i > 0; i--) { + struct page *page = folio_page(folio, i - 1); + + BUG_ON(set_direct_map_default_noflush(page)); + } + return r; +} + +static int kvm_gmem_folio_clear_private(struct folio *folio) +{ + unsigned long start = (unsigned long)folio_address(folio); + unsigned long nr = folio_nr_pages(folio); + unsigned long i; + int r; + + /* + * We must restore direct map entries on acquiring the first "sharing + * reference" (although restoring it before that is fine too - we + * restore direct map entries with sharing_refcount == 1 in + * kvm_gmem_invalidate_folio). + */ + WARN_ON_ONCE(refcount_read(folio_get_private(folio)) > 2); + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + r = set_direct_map_default_noflush(page); + if (r) + goto out_remap; + } + flush_tlb_kernel_range(start, start + folio_size(folio)); + + folio_clear_private(folio); + return 0; +out_remap: + for (; i > 0; i--) { + for (; i > 0; i--) { + struct page *page = folio_page(folio, i - 1); + + BUG_ON(set_direct_map_invalid_noflush(page)); + } + } + return r; +} + +static int kvm_gmem_init_sharing_count(struct folio *folio) +{ + refcount_t *sharing_count = kmalloc(sizeof(*sharing_count), GFP_KERNEL); + if (!sharing_count) + return -ENOMEM; + + /* + * we need to use sharing_count == 1 to mean "no sharing", because dropping + * a refcount_t to 0 and later inc-ing it again would result in a WARN + */ + refcount_set(sharing_count, 1); + folio_change_private(folio, (void *)sharing_count); + + return 0; +} + static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool prepare) { struct folio *folio; @@ -96,16 +186,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index, bool } if (zap_direct_map) { - r = set_direct_map_invalid_noflush(&folio->page); + r = kvm_gmem_init_sharing_count(folio); + if (r < 0) + goto out_err; + r = kvm_gmem_folio_private(folio); if (r < 0) goto out_err; - - // We use the private flag to track whether the folio has been removed - // from the direct map. This is because inside of ->free_folio, - // we do not have access to the address_space anymore, meaning we - // cannot check folio_inode(folio)->i_private to determine whether - // KVM_GMEM_NO_DIRECT_MAP was set. - folio_set_private(folio); } /* @@ -413,11 +499,21 @@ static void kvm_gmem_free_folio(struct folio *folio) static void kvm_gmem_invalidate_folio(struct folio *folio, size_t start, size_t end) { if (start == 0 && end == PAGE_SIZE) { + refcount_t *sharing_count = folio_get_private(folio); + /* + * sharing_count != 1 means that something else forgot + * to call kvm_gmem_folio_unshare after it was done with the + * folio (meaning the folio has been in the direct map + * this entire time, which means we haven't been getting the + * spculation protection we wanted). + */ + WARN_ON_ONCE(refcount_read(sharing_count) != 1); + // We only get here if PG_private is set, which only happens if kvm_gmem_not_present // returned true in kvm_gmem_get_folio. Thus no need to do that check again. - BUG_ON(set_direct_map_default_noflush(&folio->page)); + kvm_gmem_folio_clear_private(folio); + kfree(sharing_count); - folio_clear_private(folio); } } @@ -610,6 +706,29 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot) fput(file); } +int kvm_gmem_folio_unshare(struct folio *folio) +{ + if (kvm_gmem_not_present(folio_inode(folio))) { + refcount_t *sharing_count = folio_get_private(folio); + + refcount_dec(sharing_count); + if (refcount_read(sharing_count) == 1) + return kvm_gmem_folio_private(folio); + } + return 0; +} + +int kvm_gmem_folio_share(struct folio *folio) +{ + if (kvm_gmem_not_present(folio_inode(folio))) { + refcount_inc(folio_get_private(folio)); + + if (folio_test_private(folio)) + return kvm_gmem_folio_clear_private(folio); + } + return 0; +} + static int __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prepare) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 762decd9f2da0..d0680564ad52f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3301,17 +3301,13 @@ static int __kvm_read_guest_private_page(struct kvm *kvm, folio = pfn_folio(pfn); folio_lock(folio); kaddr = folio_address(folio); - if (folio_test_private(folio)) { - r = set_direct_map_default_noflush(&folio->page); - if (r) - goto out_unlock; - } + r = kvm_gmem_folio_share(folio); + if (r) + goto out_unlock; memcpy(data, kaddr + offset, len); - if (folio_test_private(folio)) { - r = set_direct_map_invalid_noflush(&folio->page); - if (r) - goto out_unlock; - } + r = kvm_gmem_folio_unshare(folio); + if (r) + goto out_unlock; out_unlock: folio_unlock(folio); folio_put(folio); @@ -3458,17 +3454,13 @@ static int __kvm_write_guest_private_page(struct kvm *kvm, folio = pfn_folio(pfn); folio_lock(folio); kaddr = folio_address(folio); - if (folio_test_private(folio)) { - r = set_direct_map_default_noflush(&folio->page); - if (r) - goto out_unlock; - } + r = kvm_gmem_folio_share(folio); + if (r) + goto out_unlock; memcpy(kaddr + offset, data, len); - if (folio_test_private(folio)) { - r = set_direct_map_invalid_noflush(&folio->page); - if (r) - goto out_unlock; - } + r = kvm_gmem_folio_unshare(folio); + if (r) + goto out_unlock; out_unlock: folio_unlock(folio); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 715f19669d01f..f3fb31a39a66f 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -41,6 +41,8 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args); int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset); void kvm_gmem_unbind(struct kvm_memory_slot *slot); +int kvm_gmem_folio_share(struct folio *folio); +int kvm_gmem_folio_unshare(struct folio *folio); #else static inline void kvm_gmem_init(struct module *module) { diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 55f39fd60f8af..9f955e07efb90 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -111,45 +111,8 @@ static int gpc_map_gmem(kvm_pfn_t pfn) if (((unsigned long)inode->i_private & KVM_GMEM_NO_DIRECT_MAP) == 0) goto out; - /* We need to avoid race conditions where set_memory_np is called for - * pages that other parts of KVM still try to access. We use the - * PG_private bit for this. If it is set, then the page is removed from - * the direct map. If it is cleared, the page is present in the direct - * map. All changes to this bit, and all modifications of the direct - * map entries for the page happen under the page lock. The _only_ - * place where a page will be in the direct map while the page lock is - * _not_ held is if it is inside a gpc. All other parts of KVM that - * temporarily re-insert gmem pages into the direct map (currently only - * guest_{read,write}_page) take the page lock before the direct map - * entry is restored, and hold it until it is zapped again. This means - * - If we reach gpc_map while, say, guest_read_page is operating on - * this page, we block on acquiring the page lock until - * guest_read_page is done. - * - If we reach gpc_map while another gpc is already caching this - * page, the page is present in the direct map and the PG_private - * flag is cleared. Int his case, we return -EINVAL below to avoid - * two gpcs caching the same page (since we do not ref-count - * insertions back into the direct map, when the first cache gets - * invalidated it would "break" the second cache that assumes the - * page is present in the direct map until the second cache itself - * gets invalidated). - * - Lastly, if guest_read_page is called for a page inside of a gpc, - * it will see that the PG_private flag is cleared, and thus assume - * it is present in the direct map (and leave the direct map entry - * untouched). Since it will be holding the page lock, it cannot race - * with gpc_unmap. - */ folio_lock(folio); - if (folio_test_private(folio)) { - r = set_direct_map_default_noflush(&folio->page); - if (r) - goto out_unlock; - - folio_clear_private(folio); - } else { - r = -EINVAL; - } -out_unlock: + r = kvm_gmem_folio_share(folio); folio_unlock(folio); out: return r; @@ -181,17 +144,10 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva, bool private) if (pfn_valid(pfn)) { if (private) { struct folio *folio = pfn_folio(pfn); - struct inode *inode = folio_inode(folio); - - if ((unsigned long)inode->i_private & - KVM_GMEM_NO_DIRECT_MAP) { - folio_lock(folio); - BUG_ON(folio_test_private(folio)); - BUG_ON(set_direct_map_invalid_noflush( - &folio->page)); - folio_set_private(folio); - folio_unlock(folio); - } + + folio_lock(folio); + kvm_gmem_folio_unshare(folio); + folio_unlock(folio); } kunmap(pfn_to_page(pfn)); return; -- 2.46.0
On 2024-08-07 17:16-0700 Derek Manwaring wrote: > All that said, we're also dependent on hardware not being subject to > L1TF-style issues for the currently proposed non-CoCo method to be > effective. We're simply clearing the Present bit while the physmap PTE > still points to the guest physical page. I was wrong here. The set_direct_map_invalid_noflush implementation moves through __change_page_attr and pfn_pte, eventually arriving at flip_protnone_guard where the PFN is inverted & thus no longer valid for pages marked not present. So we do benefit from that prior work's extra protection against L1TF. Thank you for finding this, Patrick. Derek
On Mon, Aug 05, 2024 at 11:34:49AM -0700, Elliot Berman wrote: > This patch was reworked from Patrick's patch: > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ > > While guest_memfd is not available to be mapped by userspace, it is > still accessible through the kernel's direct map. This means that in > scenarios where guest-private memory is not hardware protected, it can > be speculatively read and its contents potentially leaked through > hardware side-channels. Removing guest-private memory from the direct > map, thus mitigates a large class of speculative execution issues > [1, Table 1]. > > Direct map removal do not reuse the `.prepare` machinery, since > `prepare` can be called multiple time, and it is the responsibility of > the preparation routine to not "prepare" the same folio twice [2]. Thus, > instead explicitly check if `filemap_grab_folio` allocated a new folio, > and remove the returned folio from the direct map only if this was the > case. > > The patch uses release_folio instead of free_folio to reinsert pages > back into the direct map as by the time free_folio is called, > folio->mapping can already be NULL. This means that a call to > folio_inode inside free_folio might deference a NULL pointer, leaving no > way to access the inode which stores the flags that allow determining > whether the page was removed from the direct map in the first place. > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > Cc: Patrick Roy <roypat@amazon.co.uk> > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> > --- > include/linux/guest_memfd.h | 8 ++++++ > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 72 insertions(+), 1 deletion(-) > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > index be56d9d53067..f9e4a27aed67 100644 > --- a/include/linux/guest_memfd.h > +++ b/include/linux/guest_memfd.h > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > int (*release)(struct inode *inode); > }; > > +/** > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > + * remove them from the kernel's direct map. > + */ > +enum { please name this enum, otherwise kernel-doc wont' be happy > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > +}; > + > /** > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > * If trusted hyp will do it, can ommit this flag > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > index 580138b0f9d4..e9d8cab72b28 100644 > --- a/mm/guest_memfd.c > +++ b/mm/guest_memfd.c > @@ -7,9 +7,55 @@ > #include <linux/falloc.h> > #include <linux/guest_memfd.h> > #include <linux/pagemap.h> > +#include <linux/set_memory.h> > + > +static inline int guest_memfd_folio_private(struct folio *folio) > +{ > + unsigned long nr_pages = folio_nr_pages(folio); > + unsigned long i; > + int r; > + > + for (i = 0; i < nr_pages; i++) { > + struct page *page = folio_page(folio, i); > + > + r = set_direct_map_invalid_noflush(page); > + if (r < 0) > + goto out_remap; > + } > + > + folio_set_private(folio); > + return 0; > +out_remap: > + for (; i > 0; i--) { > + struct page *page = folio_page(folio, i - 1); > + > + BUG_ON(set_direct_map_default_noflush(page)); > + } > + return r; > +} > + > +static inline void guest_memfd_folio_clear_private(struct folio *folio) > +{ > + unsigned long start = (unsigned long)folio_address(folio); > + unsigned long nr = folio_nr_pages(folio); > + unsigned long i; > + > + if (!folio_test_private(folio)) > + return; > + > + for (i = 0; i < nr; i++) { > + struct page *page = folio_page(folio, i); > + > + BUG_ON(set_direct_map_default_noflush(page)); > + } > + flush_tlb_kernel_range(start, start + folio_size(folio)); I think that TLB flush should come after removing pages from the direct map rather than after adding them back. > + > + folio_clear_private(folio); > +} > > struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) > { > + unsigned long gmem_flags = (unsigned long)file->private_data; > struct inode *inode = file_inode(file); > struct guest_memfd_operations *ops = inode->i_private; > struct folio *folio;
On Mon, Aug 19, 2024 at 01:09:52PM +0300, Mike Rapoport wrote: > On Mon, Aug 05, 2024 at 11:34:49AM -0700, Elliot Berman wrote: > > This patch was reworked from Patrick's patch: > > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ > > > > While guest_memfd is not available to be mapped by userspace, it is > > still accessible through the kernel's direct map. This means that in > > scenarios where guest-private memory is not hardware protected, it can > > be speculatively read and its contents potentially leaked through > > hardware side-channels. Removing guest-private memory from the direct > > map, thus mitigates a large class of speculative execution issues > > [1, Table 1]. > > > > Direct map removal do not reuse the `.prepare` machinery, since > > `prepare` can be called multiple time, and it is the responsibility of > > the preparation routine to not "prepare" the same folio twice [2]. Thus, > > instead explicitly check if `filemap_grab_folio` allocated a new folio, > > and remove the returned folio from the direct map only if this was the > > case. > > > > The patch uses release_folio instead of free_folio to reinsert pages > > back into the direct map as by the time free_folio is called, > > folio->mapping can already be NULL. This means that a call to > > folio_inode inside free_folio might deference a NULL pointer, leaving no > > way to access the inode which stores the flags that allow determining > > whether the page was removed from the direct map in the first place. > > > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > > > Cc: Patrick Roy <roypat@amazon.co.uk> > > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> > > --- > > include/linux/guest_memfd.h | 8 ++++++ > > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > > 2 files changed, 72 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > > index be56d9d53067..f9e4a27aed67 100644 > > --- a/include/linux/guest_memfd.h > > +++ b/include/linux/guest_memfd.h > > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > > int (*release)(struct inode *inode); > > }; > > > > +/** > > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > > + * remove them from the kernel's direct map. > > + */ > > +enum { > > please name this enum, otherwise kernel-doc wont' be happy > > > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > > +}; > > + > > /** > > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > > * If trusted hyp will do it, can ommit this flag > > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > > index 580138b0f9d4..e9d8cab72b28 100644 > > --- a/mm/guest_memfd.c > > +++ b/mm/guest_memfd.c > > @@ -7,9 +7,55 @@ > > #include <linux/falloc.h> > > #include <linux/guest_memfd.h> > > #include <linux/pagemap.h> > > +#include <linux/set_memory.h> > > + > > +static inline int guest_memfd_folio_private(struct folio *folio) > > +{ > > + unsigned long nr_pages = folio_nr_pages(folio); > > + unsigned long i; > > + int r; > > + > > + for (i = 0; i < nr_pages; i++) { > > + struct page *page = folio_page(folio, i); > > + > > + r = set_direct_map_invalid_noflush(page); > > + if (r < 0) > > + goto out_remap; > > + } > > + > > + folio_set_private(folio); > > + return 0; > > +out_remap: > > + for (; i > 0; i--) { > > + struct page *page = folio_page(folio, i - 1); > > + > > + BUG_ON(set_direct_map_default_noflush(page)); > > + } > > + return r; > > +} > > + > > +static inline void guest_memfd_folio_clear_private(struct folio *folio) > > +{ > > + unsigned long start = (unsigned long)folio_address(folio); > > + unsigned long nr = folio_nr_pages(folio); > > + unsigned long i; > > + > > + if (!folio_test_private(folio)) > > + return; > > + > > + for (i = 0; i < nr; i++) { > > + struct page *page = folio_page(folio, i); > > + > > + BUG_ON(set_direct_map_default_noflush(page)); > > + } > > + flush_tlb_kernel_range(start, start + folio_size(folio)); > > I think that TLB flush should come after removing pages from the direct map > rather than after adding them back. > Gunyah flushes the tlb when it removes the stage 2 mapping, so we skipped it on removal as a performance optimization. I remember seeing that pKVM does the same (tlb flush for the stage 2 unmap & the equivalent for x86). Patrick had also done the same in their patches. Thanks, Elliot
On Tue, Aug 20, 2024 at 09:56:10AM -0700, Elliot Berman wrote: > On Mon, Aug 19, 2024 at 01:09:52PM +0300, Mike Rapoport wrote: > > On Mon, Aug 05, 2024 at 11:34:49AM -0700, Elliot Berman wrote: > > > This patch was reworked from Patrick's patch: > > > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ > > > > > > While guest_memfd is not available to be mapped by userspace, it is > > > still accessible through the kernel's direct map. This means that in > > > scenarios where guest-private memory is not hardware protected, it can > > > be speculatively read and its contents potentially leaked through > > > hardware side-channels. Removing guest-private memory from the direct > > > map, thus mitigates a large class of speculative execution issues > > > [1, Table 1]. > > > > > > Direct map removal do not reuse the `.prepare` machinery, since > > > `prepare` can be called multiple time, and it is the responsibility of > > > the preparation routine to not "prepare" the same folio twice [2]. Thus, > > > instead explicitly check if `filemap_grab_folio` allocated a new folio, > > > and remove the returned folio from the direct map only if this was the > > > case. > > > > > > The patch uses release_folio instead of free_folio to reinsert pages > > > back into the direct map as by the time free_folio is called, > > > folio->mapping can already be NULL. This means that a call to > > > folio_inode inside free_folio might deference a NULL pointer, leaving no > > > way to access the inode which stores the flags that allow determining > > > whether the page was removed from the direct map in the first place. > > > > > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > > > > > Cc: Patrick Roy <roypat@amazon.co.uk> > > > Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> > > > --- > > > include/linux/guest_memfd.h | 8 ++++++ > > > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > > > 2 files changed, 72 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > > > index be56d9d53067..f9e4a27aed67 100644 > > > --- a/include/linux/guest_memfd.h > > > +++ b/include/linux/guest_memfd.h > > > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > > > int (*release)(struct inode *inode); > > > }; > > > > > > +/** > > > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > > > + * remove them from the kernel's direct map. > > > + */ > > > +enum { > > > > please name this enum, otherwise kernel-doc wont' be happy > > > > > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > > > +}; > > > + > > > /** > > > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > > > * If trusted hyp will do it, can ommit this flag > > > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > > > index 580138b0f9d4..e9d8cab72b28 100644 > > > --- a/mm/guest_memfd.c > > > +++ b/mm/guest_memfd.c > > > @@ -7,9 +7,55 @@ > > > #include <linux/falloc.h> > > > #include <linux/guest_memfd.h> > > > #include <linux/pagemap.h> > > > +#include <linux/set_memory.h> > > > + > > > +static inline int guest_memfd_folio_private(struct folio *folio) > > > +{ > > > + unsigned long nr_pages = folio_nr_pages(folio); > > > + unsigned long i; > > > + int r; > > > + > > > + for (i = 0; i < nr_pages; i++) { > > > + struct page *page = folio_page(folio, i); > > > + > > > + r = set_direct_map_invalid_noflush(page); > > > + if (r < 0) > > > + goto out_remap; > > > + } > > > + > > > + folio_set_private(folio); > > > + return 0; > > > +out_remap: > > > + for (; i > 0; i--) { > > > + struct page *page = folio_page(folio, i - 1); > > > + > > > + BUG_ON(set_direct_map_default_noflush(page)); > > > + } > > > + return r; > > > +} > > > + > > > +static inline void guest_memfd_folio_clear_private(struct folio *folio) > > > +{ > > > + unsigned long start = (unsigned long)folio_address(folio); > > > + unsigned long nr = folio_nr_pages(folio); > > > + unsigned long i; > > > + > > > + if (!folio_test_private(folio)) > > > + return; > > > + > > > + for (i = 0; i < nr; i++) { > > > + struct page *page = folio_page(folio, i); > > > + > > > + BUG_ON(set_direct_map_default_noflush(page)); > > > + } > > > + flush_tlb_kernel_range(start, start + folio_size(folio)); > > > > I think that TLB flush should come after removing pages from the direct map > > rather than after adding them back. > > > > Gunyah flushes the tlb when it removes the stage 2 mapping, so we > skipped it on removal as a performance optimization. I remember seeing > that pKVM does the same (tlb flush for the stage 2 unmap & the > equivalent for x86). Patrick had also done the same in their patches. Strictly from the API perspective, unmapping the pages from the direct map would imply removing potentially stale TLB entries. If all currently anticipated users do it elsewhere, at the very least there should be a huge bold comment. And what's the point of tlb flush after setting the direct map to default? There should not be stale tlb entries for the unmapped pages. > Thanks, > Elliot
diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h index be56d9d53067..f9e4a27aed67 100644 --- a/include/linux/guest_memfd.h +++ b/include/linux/guest_memfd.h @@ -25,6 +25,14 @@ struct guest_memfd_operations { int (*release)(struct inode *inode); }; +/** + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also + * remove them from the kernel's direct map. + */ +enum { + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), +}; + /** * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. * If trusted hyp will do it, can ommit this flag diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c index 580138b0f9d4..e9d8cab72b28 100644 --- a/mm/guest_memfd.c +++ b/mm/guest_memfd.c @@ -7,9 +7,55 @@ #include <linux/falloc.h> #include <linux/guest_memfd.h> #include <linux/pagemap.h> +#include <linux/set_memory.h> + +static inline int guest_memfd_folio_private(struct folio *folio) +{ + unsigned long nr_pages = folio_nr_pages(folio); + unsigned long i; + int r; + + for (i = 0; i < nr_pages; i++) { + struct page *page = folio_page(folio, i); + + r = set_direct_map_invalid_noflush(page); + if (r < 0) + goto out_remap; + } + + folio_set_private(folio); + return 0; +out_remap: + for (; i > 0; i--) { + struct page *page = folio_page(folio, i - 1); + + BUG_ON(set_direct_map_default_noflush(page)); + } + return r; +} + +static inline void guest_memfd_folio_clear_private(struct folio *folio) +{ + unsigned long start = (unsigned long)folio_address(folio); + unsigned long nr = folio_nr_pages(folio); + unsigned long i; + + if (!folio_test_private(folio)) + return; + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + BUG_ON(set_direct_map_default_noflush(page)); + } + flush_tlb_kernel_range(start, start + folio_size(folio)); + + folio_clear_private(folio); +} struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags) { + unsigned long gmem_flags = (unsigned long)file->private_data; struct inode *inode = file_inode(file); struct guest_memfd_operations *ops = inode->i_private; struct folio *folio; @@ -43,6 +89,12 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags goto out_err; } + if (gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP) { + r = guest_memfd_folio_private(folio); + if (r) + goto out_err; + } + /* * Ignore accessed, referenced, and dirty flags. The memory is * unevictable and there is no storage to write back to. @@ -213,14 +265,25 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp) if (ops->invalidate_end) ops->invalidate_end(inode, offset, nr); + guest_memfd_folio_clear_private(folio); + return true; } +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len) +{ + /* not yet supported */ + BUG_ON(offset || len != folio_size(folio)); + + BUG_ON(!gmem_release_folio(folio, 0)); +} + static const struct address_space_operations gmem_aops = { .dirty_folio = noop_dirty_folio, .migrate_folio = gmem_migrate_folio, .error_remove_folio = gmem_error_folio, .release_folio = gmem_release_folio, + .invalidate_folio = gmem_invalidate_folio, }; static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops) @@ -241,7 +304,7 @@ struct file *guest_memfd_alloc(const char *name, if (!guest_memfd_check_ops(ops)) return ERR_PTR(-EINVAL); - if (flags) + if (flags & ~GUEST_MEMFD_FLAG_NO_DIRECT_MAP) return ERR_PTR(-EINVAL); /*
This patch was reworked from Patrick's patch: https://lore.kernel.org/all/20240709132041.3625501-6-roypat@amazon.co.uk/ While guest_memfd is not available to be mapped by userspace, it is still accessible through the kernel's direct map. This means that in scenarios where guest-private memory is not hardware protected, it can be speculatively read and its contents potentially leaked through hardware side-channels. Removing guest-private memory from the direct map, thus mitigates a large class of speculative execution issues [1, Table 1]. Direct map removal do not reuse the `.prepare` machinery, since `prepare` can be called multiple time, and it is the responsibility of the preparation routine to not "prepare" the same folio twice [2]. Thus, instead explicitly check if `filemap_grab_folio` allocated a new folio, and remove the returned folio from the direct map only if this was the case. The patch uses release_folio instead of free_folio to reinsert pages back into the direct map as by the time free_folio is called, folio->mapping can already be NULL. This means that a call to folio_inode inside free_folio might deference a NULL pointer, leaving no way to access the inode which stores the flags that allow determining whether the page was removed from the direct map in the first place. [1]: https://download.vusec.net/papers/quarantine_raid23.pdf Cc: Patrick Roy <roypat@amazon.co.uk> Signed-off-by: Elliot Berman <quic_eberman@quicinc.com> --- include/linux/guest_memfd.h | 8 ++++++ mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 72 insertions(+), 1 deletion(-)