[01/35] mm:gup/writeback: add callbacks for inaccessible pages
diff mbox series

Message ID 20200207113958.7320-2-borntraeger@de.ibm.com
State New
Headers show
Series
  • KVM: s390: Add support for protected VMs
Related show

Commit Message

Christian Borntraeger Feb. 7, 2020, 11:39 a.m. UTC
From: Claudio Imbrenda <imbrenda@linux.ibm.com>

With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.

While cpu accesses will trigger a fault that can be resolved, I/O
accesses will just fail.  We need to add a callback into architecture
code for places that will do I/O, namely when writeback is started or
when a page reference is taken.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 include/linux/gfp.h | 6 ++++++
 mm/gup.c            | 2 ++
 mm/page-writeback.c | 1 +
 3 files changed, 9 insertions(+)

Comments

Christian Borntraeger Feb. 10, 2020, 5:27 p.m. UTC | #1
CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
use for this on KVM/ARM in the future?

CC Sean Christopherson/Tom Lendacky. Any obvious use case for Intel/AMD
to have a callback before a page is used for I/O?

Andrew (or other mm people) any chance to get an ACK for this change?
I could then carry that via s390 or KVM tree. Or if you want to carry
that yourself I can send an updated version (we need to kind of 
synchronize that Linus will pull the KVM changes after the mm changes).

Andrea asked if others would benefit from this, so here are some more
information about this (and I can also put this into the patch
description).  So we have talked to the POWER folks. They do not use
the standard normal memory management, instead they have a hard split
between secure and normal memory. The secure memory  is the handled by
the hypervisor as device memory and the ultravisor and the hypervisor
move this forth and back when needed.

On s390 there is no *separate* pool of physical pages that are secure.
Instead, *any* physical page can be marked as secure or not, by
setting a bit in a per-page data structure that hardware uses to stop
unauthorized access.  (That bit is under control of the ultravisor.)

Note that one side effect of this strategy is that the decision
*which* secure pages to encrypt and then swap out is actually done by
the hypervisor, not the ultravisor.  In our case, the hypervisor is
Linux/KVM, so we're using the regular Linux memory management scheme
(active/inactive LRU lists etc.) to make this decision.  The advantage
is that the Ultravisor code does not need to itself implement any
memory management code, making it a lot simpler.

However, in the end this is why we need the hook into Linux memory
management: once Linux has decided to swap a page out, we need to get
a chance to tell the Ultravisor to "export" the page (i.e., encrypt
its contents and mark it no longer secure).

As outlined below this should be a no-op for anybody not opting in.

Christian                                   

On 07.02.20 12:39, Christian Borntraeger wrote:
> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
> 
> With the introduction of protected KVM guests on s390 there is now a
> concept of inaccessible pages. These pages need to be made accessible
> before the host can access them.
> 
> While cpu accesses will trigger a fault that can be resolved, I/O
> accesses will just fail.  We need to add a callback into architecture
> code for places that will do I/O, namely when writeback is started or
> when a page reference is taken.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  include/linux/gfp.h | 6 ++++++
>  mm/gup.c            | 2 ++
>  mm/page-writeback.c | 1 +
>  3 files changed, 9 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e5b817cb86e7..be2754841369 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
>  #ifndef HAVE_ARCH_ALLOC_PAGE
>  static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> +static inline int arch_make_page_accessible(struct page *page)
> +{
> +	return 0;
> +}
> +#endif
>  
>  struct page *
>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> diff --git a/mm/gup.c b/mm/gup.c
> index 7646bf993b25..a01262cd2821 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -257,6 +257,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  			page = ERR_PTR(-ENOMEM);
>  			goto out;
>  		}
> +		arch_make_page_accessible(page);
>  	}
>  	if (flags & FOLL_TOUCH) {
>  		if ((flags & FOLL_WRITE) &&
> @@ -1870,6 +1871,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  
>  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>  
> +		arch_make_page_accessible(page);
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
>  		(*nr)++;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2caf780a42e7..0f0bd14571b1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2806,6 +2806,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  		inc_lruvec_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  	}
> +	arch_make_page_accessible(page);
>  	unlock_page_memcg(page);

As outlined by Ulrich, we can move the callback after the unlock.

>  	return ret;
>  
>
David Hildenbrand Feb. 10, 2020, 6:17 p.m. UTC | #2
On 07.02.20 12:39, Christian Borntraeger wrote:
> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
> 
> With the introduction of protected KVM guests on s390 there is now a
> concept of inaccessible pages. These pages need to be made accessible
> before the host can access them.
> 
> While cpu accesses will trigger a fault that can be resolved, I/O
> accesses will just fail.  We need to add a callback into architecture
> code for places that will do I/O, namely when writeback is started or
> when a page reference is taken.

My question would be: What guarantees that the page will stay accessible
(for I/O)? IIRC, pages can be converted back to secure/inaccessible
whenever the guest wants to access them. How will that be dealt with?

I would assume some magic counter that tracks if the page still has to
remain accessible. Once all clients that require the page to be
"accessible" on the I/O path are done, the page can be made inaccessible
again. But then, I would assume there would be something like a

/* make page accessible and make sure the page will remain accessible */
arch_get_page_accessible(page);

/* we're done dealing with the page content */
arch_put_page_accessible(page);

You mention page references. I think you should elaborate how that is
expected to work in the patch description more detailed.


(side note: I assume you guys have a plan for dealing with kdump wanting
to dump inaccessible pages. the kexec kernel would have to talk to the
UV to convert pages - and also make pages accessible on the I/O path I
guess - or one would want to mark and skip encrypted pages completely in
kdump somehow, as the content is essentially garbage)



cc Michal (not sure if your area of expertise)
https://lore.kernel.org/kvm/20200207113958.7320-2-borntraeger@de.ibm.com/
Christian Borntraeger Feb. 10, 2020, 6:28 p.m. UTC | #3
On 10.02.20 19:17, David Hildenbrand wrote:
> On 07.02.20 12:39, Christian Borntraeger wrote:
>> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>
>> With the introduction of protected KVM guests on s390 there is now a
>> concept of inaccessible pages. These pages need to be made accessible
>> before the host can access them.
>>
>> While cpu accesses will trigger a fault that can be resolved, I/O
>> accesses will just fail.  We need to add a callback into architecture
>> code for places that will do I/O, namely when writeback is started or
>> when a page reference is taken.
> 
> My question would be: What guarantees that the page will stay accessible
> (for I/O)? IIRC, pages can be converted back to secure/inaccessible
> whenever the guest wants to access them. How will that be dealt with?

Yes, in patch 5 we do use the page lock, PageWriteBack and page_ref_freeze
to only make the page secure again if no I/O is going to be started or
still running.

We have minimized the common code impact (just these 3 callbacks) so that 
architecture code can do the right thing.

> 
> I would assume some magic counter that tracks if the page still has to
> remain accessible. Once all clients that require the page to be
> "accessible" on the I/O path are done, the page can be made inaccessible
> again. But then, I would assume there would be something like a
> 
> /* make page accessible and make sure the page will remain accessible */
> arch_get_page_accessible(page);
> 
> /* we're done dealing with the page content */
> arch_put_page_accessible(page);
> 
> You mention page references. I think you should elaborate how that is
> expected to work in the patch description more detailed.
> 
> 
> (side note: I assume you guys have a plan for dealing with kdump wanting
> to dump inaccessible pages. the kexec kernel would have to talk to the
> UV to convert pages - and also make pages accessible on the I/O path I
> guess - or one would want to mark and skip encrypted pages completely in
> kdump somehow, as the content is essentially garbage)

On kexec and kdump the ultravisor is called as part of the the diagnose
308 subcodes 0 and 1 to make sure that a: kdump works (no fault on a 
previously secure page) and b: the content of the secure page is no 
longer accessible.
David Hildenbrand Feb. 10, 2020, 6:43 p.m. UTC | #4
On 10.02.20 19:28, Christian Borntraeger wrote:
> 
> 
> On 10.02.20 19:17, David Hildenbrand wrote:
>> On 07.02.20 12:39, Christian Borntraeger wrote:
>>> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>>
>>> With the introduction of protected KVM guests on s390 there is now a
>>> concept of inaccessible pages. These pages need to be made accessible
>>> before the host can access them.
>>>
>>> While cpu accesses will trigger a fault that can be resolved, I/O
>>> accesses will just fail.  We need to add a callback into architecture
>>> code for places that will do I/O, namely when writeback is started or
>>> when a page reference is taken.
>>
>> My question would be: What guarantees that the page will stay accessible
>> (for I/O)? IIRC, pages can be converted back to secure/inaccessible
>> whenever the guest wants to access them. How will that be dealt with?
> 
> Yes, in patch 5 we do use the page lock, PageWriteBack and page_ref_freeze
> to only make the page secure again if no I/O is going to be started or
> still running.
> 
> We have minimized the common code impact (just these 3 callbacks) so that 
> architecture code can do the right thing.

So the magic is

+static int expected_page_refs(struct page *page)
+{
+	int res;
+
+	res = page_mapcount(page);
+	if (PageSwapCache(page))
+		res++;
+	else if (page_mapping(page)) {
+		res++;
+		if (page_has_private(page))
+			res++;
+	}
+	return res;
+}
[...]
+static int make_secure_pte(pte_t *ptep, unsigned long addr, void *data)
[...]
+	if (PageWriteback(page))
+		return -EAGAIN;
+	expected = expected_page_refs(page);
+	if (!page_ref_freeze(page, expected))
+		return -EBUSY;
[...]
+	rc = uv_call(0, (u64)params->uvcb);
+	page_ref_unfreeze(page, expected);

As long as a page is does not have the expected refcount, it cannot be
convert to secure and not used by the guest.

I assume this implies, that if a guest page is pinned somewhere (e.g.,
in KVM), it won't be usable by the guest.

Please add all these details to the patch description. I think they are
crucial to understand how this is expected to work and to be used.
Christian Borntraeger Feb. 10, 2020, 6:51 p.m. UTC | #5
On 10.02.20 19:43, David Hildenbrand wrote:
> On 10.02.20 19:28, Christian Borntraeger wrote:
>>
>>
>> On 10.02.20 19:17, David Hildenbrand wrote:
>>> On 07.02.20 12:39, Christian Borntraeger wrote:
>>>> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>>>
>>>> With the introduction of protected KVM guests on s390 there is now a
>>>> concept of inaccessible pages. These pages need to be made accessible
>>>> before the host can access them.
>>>>
>>>> While cpu accesses will trigger a fault that can be resolved, I/O
>>>> accesses will just fail.  We need to add a callback into architecture
>>>> code for places that will do I/O, namely when writeback is started or
>>>> when a page reference is taken.
>>>
>>> My question would be: What guarantees that the page will stay accessible
>>> (for I/O)? IIRC, pages can be converted back to secure/inaccessible
>>> whenever the guest wants to access them. How will that be dealt with?
>>
>> Yes, in patch 5 we do use the page lock, PageWriteBack and page_ref_freeze
>> to only make the page secure again if no I/O is going to be started or
>> still running.
>>
>> We have minimized the common code impact (just these 3 callbacks) so that 
>> architecture code can do the right thing.
> 
> So the magic is
> 
> +static int expected_page_refs(struct page *page)
> +{
> +	int res;
> +
> +	res = page_mapcount(page);
> +	if (PageSwapCache(page))
> +		res++;
> +	else if (page_mapping(page)) {
> +		res++;
> +		if (page_has_private(page))
> +			res++;
> +	}
> +	return res;
> +}
> [...]
> +static int make_secure_pte(pte_t *ptep, unsigned long addr, void *data)
> [...]
> +	if (PageWriteback(page))
> +		return -EAGAIN;
> +	expected = expected_page_refs(page);
> +	if (!page_ref_freeze(page, expected))
> +		return -EBUSY;
> [...]
> +	rc = uv_call(0, (u64)params->uvcb);
> +	page_ref_unfreeze(page, expected);
> 
> As long as a page is does not have the expected refcount, it cannot be
> convert to secure and not used by the guest.
> 
> I assume this implies, that if a guest page is pinned somewhere (e.g.,
> in KVM), it won't be usable by the guest.

Yes, but you can always exit QEMU nothing will "block". You you have a permanent
SIE exit. This is something that should not happen for QEMU/KVM and the expected
refcount logic can be found in many common code places. 
> 
> Please add all these details to the patch description. I think they are
> crucial to understand how this is expected to work and to be used.

Makes sense. Will add more explanation to patch 5.
Ulrich also had some idea how to simplify patch 5 in some places.
Will Deacon Feb. 11, 2020, 11:26 a.m. UTC | #6
On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
> use for this on KVM/ARM in the future?

I can't speak for Marc, but I can say that we're interested in something
like this for potentially isolating VMs from a KVM host in Android.
However, we've currently been working on the assumption that the memory
removed from the host won't usually be touched by the host (i.e. no
KSM or swapping out), so all we'd probably want at the moment is to be
able to return an error back from arch_make_page_accessible(). Its return
code is ignored in this patch :/

One thing I don't grok about the ultravisor encryption is how it avoids
replay attacks when paging back in. For example, if the host is compromised
and replaces the page contents with an old encrypted value. Are you storing
per-page metadata somewhere to ensure "freshness" of the encrypted data?

Will
Christian Borntraeger Feb. 11, 2020, 11:43 a.m. UTC | #7
On 11.02.20 12:26, Will Deacon wrote:
> On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
>> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
>> use for this on KVM/ARM in the future?
> 
> I can't speak for Marc, but I can say that we're interested in something
> like this for potentially isolating VMs from a KVM host in Android.
> However, we've currently been working on the assumption that the memory
> removed from the host won't usually be touched by the host (i.e. no
> KSM or swapping out), so all we'd probably want at the moment is to be
> able to return an error back from arch_make_page_accessible(). Its return
> code is ignored in this patch :/
> 
> One thing I don't grok about the ultravisor encryption is how it avoids
> replay attacks when paging back in. For example, if the host is compromised
> and replaces the page contents with an old encrypted value. Are you storing
> per-page metadata somewhere to ensure "freshness" of the encrypted data?

Cant talk about the others, but on s390 the ultravisor stores counter,
tweak, address and hashing information. No replay or page exchange within
the guest is possible. (We can move the guest content to a different host
page though be using the export/import as this will revalidate the 
correctness from the guest point of view)
Christian Borntraeger Feb. 13, 2020, 2:48 p.m. UTC | #8
On 11.02.20 12:26, Will Deacon wrote:
> On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
>> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
>> use for this on KVM/ARM in the future?
> 
> I can't speak for Marc, but I can say that we're interested in something
> like this for potentially isolating VMs from a KVM host in Android.
> However, we've currently been working on the assumption that the memory
> removed from the host won't usually be touched by the host (i.e. no
> KSM or swapping out), so all we'd probably want at the moment is to be
> able to return an error back from arch_make_page_accessible(). Its return
> code is ignored in this patch :/

I think there are two ways at the moment. One is to keep the memory away from
Linux, e.g. by using the memory as device driver memory like kmalloc. This is
kind of what Power does. And I understand you as you want to follow that model
and do not want to use paging, file backing or so.
Our approach tries to fully integrate into the existing Linux LRU methods.

Back to your approach. What happens when a malicious QEMU would start direct I/O
on such isolated memory? Is that what you meant by adding error checking in these
hooks. For the gup.c code returning an error seems straightforward.

I have no idea what to do in writeback. When somebody managed to trigger writeback
on such a page, it already seems too late.
Sean Christopherson Feb. 13, 2020, 7:56 p.m. UTC | #9
On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
> use for this on KVM/ARM in the future?
> 
> CC Sean Christopherson/Tom Lendacky. Any obvious use case for Intel/AMD
> to have a callback before a page is used for I/O?

Yes?

> Andrew (or other mm people) any chance to get an ACK for this change?
> I could then carry that via s390 or KVM tree. Or if you want to carry
> that yourself I can send an updated version (we need to kind of 
> synchronize that Linus will pull the KVM changes after the mm changes).
> 
> Andrea asked if others would benefit from this, so here are some more
> information about this (and I can also put this into the patch
> description).  So we have talked to the POWER folks. They do not use
> the standard normal memory management, instead they have a hard split
> between secure and normal memory. The secure memory  is the handled by
> the hypervisor as device memory and the ultravisor and the hypervisor
> move this forth and back when needed.
> 
> On s390 there is no *separate* pool of physical pages that are secure.
> Instead, *any* physical page can be marked as secure or not, by
> setting a bit in a per-page data structure that hardware uses to stop
> unauthorized access.  (That bit is under control of the ultravisor.)
> 
> Note that one side effect of this strategy is that the decision
> *which* secure pages to encrypt and then swap out is actually done by
> the hypervisor, not the ultravisor.  In our case, the hypervisor is
> Linux/KVM, so we're using the regular Linux memory management scheme
> (active/inactive LRU lists etc.) to make this decision.  The advantage
> is that the Ultravisor code does not need to itself implement any
> memory management code, making it a lot simpler.

Disclaimer: I'm not familiar with s390 guest page faults or UV.  I tried
to give myself a crash course, apologies if I'm way out in left field...

AIUI, pages will first be added to a secure guest by converting a normal,
non-secure page to secure and stuffing it into the guest page tables.  To
swap a page from a secure guest, arch_make_page_accessible() will be called
to encrypt the page in place so that it can be accessed by the untrusted
kernel/VMM and written out to disk.  And to fault the page back in, on s390
a secure guest access to a non-secure page will generate a page fault with
a dedicated type.  That fault routes directly to
do_non_secure_storage_access(), which converts the page to secure and thus
makes it re-accessible to the guest.

That all sounds sane and usable for Intel.

My big question is the follow/get flows, more on that below.

> However, in the end this is why we need the hook into Linux memory
> management: once Linux has decided to swap a page out, we need to get
> a chance to tell the Ultravisor to "export" the page (i.e., encrypt
> its contents and mark it no longer secure).
> 
> As outlined below this should be a no-op for anybody not opting in.
> 
> Christian                                   
> 
> On 07.02.20 12:39, Christian Borntraeger wrote:
> > From: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > 
> > With the introduction of protected KVM guests on s390 there is now a
> > concept of inaccessible pages. These pages need to be made accessible
> > before the host can access them.
> > 
> > While cpu accesses will trigger a fault that can be resolved, I/O
> > accesses will just fail.  We need to add a callback into architecture
> > code for places that will do I/O, namely when writeback is started or
> > when a page reference is taken.
> > 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > ---
> >  include/linux/gfp.h | 6 ++++++
> >  mm/gup.c            | 2 ++
> >  mm/page-writeback.c | 1 +
> >  3 files changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index e5b817cb86e7..be2754841369 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
> >  #ifndef HAVE_ARCH_ALLOC_PAGE
> >  static inline void arch_alloc_page(struct page *page, int order) { }
> >  #endif
> > +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> > +static inline int arch_make_page_accessible(struct page *page)
> > +{
> > +	return 0;
> > +}
> > +#endif
> >  
> >  struct page *
> >  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 7646bf993b25..a01262cd2821 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -257,6 +257,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> >  			page = ERR_PTR(-ENOMEM);
> >  			goto out;
> >  		}
> > +		arch_make_page_accessible(page);

As Will pointed out, the return value definitely needs to be checked, there
will undoubtedly be scenarios where the page cannot be made accessible.

What is the use case for calling arch_make_page_accessible() in the follow()
and gup() paths?  Live migration is the only thing that comes to mind, and
for live migration I would expect you would want to keep the secure guest
running when copying pages to the target, i.e. use pre-copy.  That would
conflict with converting the page in place.  Rather, migration would use a
separate dedicated path to copy the encrypted contents of the secure page to
a completely different page, and send *that* across the wire so that the
guest can continue accessing the original page.

Am I missing a need to do this for the swap/reclaim case?  Or is there a
completely different use case I'm overlooking?

Tangentially related, hooks here could be quite useful for sanity checking
the kernel/KVM and/or debugging kernel/KVM bugs.  Would it make sense to
pass a param to arch_make_page_accessible() to provide some information as
to why the page needs to be made accessible?

> >  	}
> >  	if (flags & FOLL_TOUCH) {
> >  		if ((flags & FOLL_WRITE) &&
> > @@ -1870,6 +1871,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >  
> >  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> >  
> > +		arch_make_page_accessible(page);
> >  		SetPageReferenced(page);
> >  		pages[*nr] = page;
> >  		(*nr)++;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 2caf780a42e7..0f0bd14571b1 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2806,6 +2806,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> >  		inc_lruvec_page_state(page, NR_WRITEBACK);
> >  		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> >  	}
> > +	arch_make_page_accessible(page);
> >  	unlock_page_memcg(page);
> 
> As outlined by Ulrich, we can move the callback after the unlock.
> 
> >  	return ret;
> >  
> > 
>
Christian Borntraeger Feb. 13, 2020, 8:13 p.m. UTC | #10
On 13.02.20 20:56, Sean Christopherson wrote:
> On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
>> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
>> use for this on KVM/ARM in the future?
>>
>> CC Sean Christopherson/Tom Lendacky. Any obvious use case for Intel/AMD
>> to have a callback before a page is used for I/O?
> 
> Yes?
> 
>> Andrew (or other mm people) any chance to get an ACK for this change?
>> I could then carry that via s390 or KVM tree. Or if you want to carry
>> that yourself I can send an updated version (we need to kind of 
>> synchronize that Linus will pull the KVM changes after the mm changes).
>>
>> Andrea asked if others would benefit from this, so here are some more
>> information about this (and I can also put this into the patch
>> description).  So we have talked to the POWER folks. They do not use
>> the standard normal memory management, instead they have a hard split
>> between secure and normal memory. The secure memory  is the handled by
>> the hypervisor as device memory and the ultravisor and the hypervisor
>> move this forth and back when needed.
>>
>> On s390 there is no *separate* pool of physical pages that are secure.
>> Instead, *any* physical page can be marked as secure or not, by
>> setting a bit in a per-page data structure that hardware uses to stop
>> unauthorized access.  (That bit is under control of the ultravisor.)
>>
>> Note that one side effect of this strategy is that the decision
>> *which* secure pages to encrypt and then swap out is actually done by
>> the hypervisor, not the ultravisor.  In our case, the hypervisor is
>> Linux/KVM, so we're using the regular Linux memory management scheme
>> (active/inactive LRU lists etc.) to make this decision.  The advantage
>> is that the Ultravisor code does not need to itself implement any
>> memory management code, making it a lot simpler.
> 
> Disclaimer: I'm not familiar with s390 guest page faults or UV.  I tried
> to give myself a crash course, apologies if I'm way out in left field...
> 
> AIUI, pages will first be added to a secure guest by converting a normal,
> non-secure page to secure and stuffing it into the guest page tables.  To
> swap a page from a secure guest, arch_make_page_accessible() will be called
> to encrypt the page in place so that it can be accessed by the untrusted
> kernel/VMM and written out to disk.  And to fault the page back in, on s390
> a secure guest access to a non-secure page will generate a page fault with
> a dedicated type.  That fault routes directly to
> do_non_secure_storage_access(), which converts the page to secure and thus
> makes it re-accessible to the guest.
> 
> That all sounds sane and usable for Intel.
> 
> My big question is the follow/get flows, more on that below.
> 
>> However, in the end this is why we need the hook into Linux memory
>> management: once Linux has decided to swap a page out, we need to get
>> a chance to tell the Ultravisor to "export" the page (i.e., encrypt
>> its contents and mark it no longer secure).
>>
>> As outlined below this should be a no-op for anybody not opting in.
>>
>> Christian                                   
>>
>> On 07.02.20 12:39, Christian Borntraeger wrote:
>>> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>>
>>> With the introduction of protected KVM guests on s390 there is now a
>>> concept of inaccessible pages. These pages need to be made accessible
>>> before the host can access them.
>>>
>>> While cpu accesses will trigger a fault that can be resolved, I/O
>>> accesses will just fail.  We need to add a callback into architecture
>>> code for places that will do I/O, namely when writeback is started or
>>> when a page reference is taken.
>>>
>>> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>> ---
>>>  include/linux/gfp.h | 6 ++++++
>>>  mm/gup.c            | 2 ++
>>>  mm/page-writeback.c | 1 +
>>>  3 files changed, 9 insertions(+)
>>>
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index e5b817cb86e7..be2754841369 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
>>>  #ifndef HAVE_ARCH_ALLOC_PAGE
>>>  static inline void arch_alloc_page(struct page *page, int order) { }
>>>  #endif
>>> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
>>> +static inline int arch_make_page_accessible(struct page *page)
>>> +{
>>> +	return 0;
>>> +}
>>> +#endif
>>>  
>>>  struct page *
>>>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 7646bf993b25..a01262cd2821 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -257,6 +257,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>>  			page = ERR_PTR(-ENOMEM);
>>>  			goto out;
>>>  		}
>>> +		arch_make_page_accessible(page);
> 
> As Will pointed out, the return value definitely needs to be checked, there
> will undoubtedly be scenarios where the page cannot be made accessible.

Actually onm s390 this should always succeed unless we have a bug.

But we can certainly provide a variant of that patch that does check the return
value. 
Proper error handling for gup and WARN_ON for pae-writeback.
> 
> What is the use case for calling arch_make_page_accessible() in the follow()
> and gup() paths?  Live migration is the only thing that comes to mind, and
> for live migration I would expect you would want to keep the secure guest
> running when copying pages to the target, i.e. use pre-copy.  That would
> conflict with converting the page in place.  Rather, migration would use a
> separate dedicated path to copy the encrypted contents of the secure page to
> a completely different page, and send *that* across the wire so that the
> guest can continue accessing the original page.
> Am I missing a need to do this for the swap/reclaim case?  Or is there a
> completely different use case I'm overlooking?

This is actually to protect the host against a malicious user space. For 
example a bad QEMU could simply start direct I/O on such protected memory.
We do not want userspace to be able to trigger I/O errors and thus we
implemented the logic to "whenever somebody accesses that page (gup) or
doing I/O, make sure that this page can be accessed. When the guest tries
to access that page we will wait in the page fault handler for writeback to
have finished and for the page_ref to be the expected value.



> 
> Tangentially related, hooks here could be quite useful for sanity checking
> the kernel/KVM and/or debugging kernel/KVM bugs.  Would it make sense to
> pass a param to arch_make_page_accessible() to provide some information as
> to why the page needs to be made accessible?

Some kind of enum that can be used optionally to optimize things?

> 
>>>  	}
>>>  	if (flags & FOLL_TOUCH) {
>>>  		if ((flags & FOLL_WRITE) &&
>>> @@ -1870,6 +1871,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>>  
>>>  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>>>  
>>> +		arch_make_page_accessible(page);
>>>  		SetPageReferenced(page);
>>>  		pages[*nr] = page;
>>>  		(*nr)++;
>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>> index 2caf780a42e7..0f0bd14571b1 100644
>>> --- a/mm/page-writeback.c
>>> +++ b/mm/page-writeback.c
>>> @@ -2806,6 +2806,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>>>  		inc_lruvec_page_state(page, NR_WRITEBACK);
>>>  		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>>>  	}
>>> +	arch_make_page_accessible(page);
>>>  	unlock_page_memcg(page);
>>
>> As outlined by Ulrich, we can move the callback after the unlock.
>>
>>>  	return ret;
>>>  
>>>
>>
Sean Christopherson Feb. 13, 2020, 8:46 p.m. UTC | #11
On Thu, Feb 13, 2020 at 09:13:35PM +0100, Christian Borntraeger wrote:
> 
> On 13.02.20 20:56, Sean Christopherson wrote:
> > On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
> > Am I missing a need to do this for the swap/reclaim case?  Or is there a
> > completely different use case I'm overlooking?
> 
> This is actually to protect the host against a malicious user space. For 
> example a bad QEMU could simply start direct I/O on such protected memory.
> We do not want userspace to be able to trigger I/O errors and thus we
> implemented the logic to "whenever somebody accesses that page (gup) or
> doing I/O, make sure that this page can be accessed. When the guest tries
> to access that page we will wait in the page fault handler for writeback to
> have finished and for the page_ref to be the expected value.

Ah.  I was assuming the pages would unmappable by userspace, enforced by
some other mechanism

> > 
> > Tangentially related, hooks here could be quite useful for sanity checking
> > the kernel/KVM and/or debugging kernel/KVM bugs.  Would it make sense to
> > pass a param to arch_make_page_accessible() to provide some information as
> > to why the page needs to be made accessible?
> 
> Some kind of enum that can be used optionally to optimize things?

Not just optimize, in the case above it'd probably preferable for us to
reject a userspace mapping outright, e.g. return -EFAULT if called from
gup()/follow().  Debug scenarios might also require differentiating between
writeback and "other".
Tom Lendacky Feb. 17, 2020, 8:55 p.m. UTC | #12
On 2/13/20 2:13 PM, Christian Borntraeger wrote:
> 
> 
> On 13.02.20 20:56, Sean Christopherson wrote:
>> On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
>>> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
>>> use for this on KVM/ARM in the future?
>>>
>>> CC Sean Christopherson/Tom Lendacky. Any obvious use case for Intel/AMD
>>> to have a callback before a page is used for I/O?

From an SEV-SNP perspective, I don't think so. The SEV-SNP architecture
uses page states and having the hypervisor change the state from beneath
the guest might trigger the guest into thinking it's being attacked vs
just allowing the I/O to fail. Is this a concern with flooding the console
with I/O error messages?

>>
>> Yes?
>>
>>> Andrew (or other mm people) any chance to get an ACK for this change?
>>> I could then carry that via s390 or KVM tree. Or if you want to carry
>>> that yourself I can send an updated version (we need to kind of 
>>> synchronize that Linus will pull the KVM changes after the mm changes).
>>>
>>> Andrea asked if others would benefit from this, so here are some more
>>> information about this (and I can also put this into the patch
>>> description).  So we have talked to the POWER folks. They do not use
>>> the standard normal memory management, instead they have a hard split
>>> between secure and normal memory. The secure memory  is the handled by
>>> the hypervisor as device memory and the ultravisor and the hypervisor
>>> move this forth and back when needed.
>>>
>>> On s390 there is no *separate* pool of physical pages that are secure.
>>> Instead, *any* physical page can be marked as secure or not, by
>>> setting a bit in a per-page data structure that hardware uses to stop
>>> unauthorized access.  (That bit is under control of the ultravisor.)
>>>
>>> Note that one side effect of this strategy is that the decision
>>> *which* secure pages to encrypt and then swap out is actually done by
>>> the hypervisor, not the ultravisor.  In our case, the hypervisor is
>>> Linux/KVM, so we're using the regular Linux memory management scheme
>>> (active/inactive LRU lists etc.) to make this decision.  The advantage
>>> is that the Ultravisor code does not need to itself implement any
>>> memory management code, making it a lot simpler.
>>
>> Disclaimer: I'm not familiar with s390 guest page faults or UV.  I tried
>> to give myself a crash course, apologies if I'm way out in left field...
>>
>> AIUI, pages will first be added to a secure guest by converting a normal,
>> non-secure page to secure and stuffing it into the guest page tables.  To
>> swap a page from a secure guest, arch_make_page_accessible() will be called
>> to encrypt the page in place so that it can be accessed by the untrusted
>> kernel/VMM and written out to disk.  And to fault the page back in, on s390
>> a secure guest access to a non-secure page will generate a page fault with
>> a dedicated type.  That fault routes directly to
>> do_non_secure_storage_access(), which converts the page to secure and thus
>> makes it re-accessible to the guest.
>>
>> That all sounds sane and usable for Intel.
>>
>> My big question is the follow/get flows, more on that below.
>>
>>> However, in the end this is why we need the hook into Linux memory
>>> management: once Linux has decided to swap a page out, we need to get
>>> a chance to tell the Ultravisor to "export" the page (i.e., encrypt
>>> its contents and mark it no longer secure).
>>>
>>> As outlined below this should be a no-op for anybody not opting in.
>>>
>>> Christian                                   
>>>
>>> On 07.02.20 12:39, Christian Borntraeger wrote:
>>>> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>>>
>>>> With the introduction of protected KVM guests on s390 there is now a
>>>> concept of inaccessible pages. These pages need to be made accessible
>>>> before the host can access them.
>>>>
>>>> While cpu accesses will trigger a fault that can be resolved, I/O
>>>> accesses will just fail.  We need to add a callback into architecture
>>>> code for places that will do I/O, namely when writeback is started or
>>>> when a page reference is taken.
>>>>
>>>> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>> ---
>>>>  include/linux/gfp.h | 6 ++++++
>>>>  mm/gup.c            | 2 ++
>>>>  mm/page-writeback.c | 1 +
>>>>  3 files changed, 9 insertions(+)
>>>>
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index e5b817cb86e7..be2754841369 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
>>>>  #ifndef HAVE_ARCH_ALLOC_PAGE
>>>>  static inline void arch_alloc_page(struct page *page, int order) { }
>>>>  #endif
>>>> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
>>>> +static inline int arch_make_page_accessible(struct page *page)
>>>> +{
>>>> +	return 0;
>>>> +}
>>>> +#endif
>>>>  
>>>>  struct page *
>>>>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 7646bf993b25..a01262cd2821 100644
>>>> --- a/mm/gup.c
>>>> +++ b/mm/gup.c
>>>> @@ -257,6 +257,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>>>  			page = ERR_PTR(-ENOMEM);
>>>>  			goto out;
>>>>  		}
>>>> +		arch_make_page_accessible(page);
>>
>> As Will pointed out, the return value definitely needs to be checked, there
>> will undoubtedly be scenarios where the page cannot be made accessible.
> 
> Actually onm s390 this should always succeed unless we have a bug.
> 
> But we can certainly provide a variant of that patch that does check the return
> value. 
> Proper error handling for gup and WARN_ON for pae-writeback.
>>
>> What is the use case for calling arch_make_page_accessible() in the follow()
>> and gup() paths?  Live migration is the only thing that comes to mind, and
>> for live migration I would expect you would want to keep the secure guest
>> running when copying pages to the target, i.e. use pre-copy.  That would
>> conflict with converting the page in place.  Rather, migration would use a
>> separate dedicated path to copy the encrypted contents of the secure page to
>> a completely different page, and send *that* across the wire so that the
>> guest can continue accessing the original page.
>> Am I missing a need to do this for the swap/reclaim case?  Or is there a
>> completely different use case I'm overlooking?
> 
> This is actually to protect the host against a malicious user space. For 
> example a bad QEMU could simply start direct I/O on such protected memory.
> We do not want userspace to be able to trigger I/O errors and thus we
> implemented the logic to "whenever somebody accesses that page (gup) or
> doing I/O, make sure that this page can be accessed. When the guest tries
> to access that page we will wait in the page fault handler for writeback to
> have finished and for the page_ref to be the expected value.

So in this case, when the guest tries to access the page, the page may now
be corrupted because I/O was allowed to be done to it? Or will the I/O
have been blocked in some way, but without generating the I/O error?

Thanks,
Tom

> 
> 
> 
>>
>> Tangentially related, hooks here could be quite useful for sanity checking
>> the kernel/KVM and/or debugging kernel/KVM bugs.  Would it make sense to
>> pass a param to arch_make_page_accessible() to provide some information as
>> to why the page needs to be made accessible?
> 
> Some kind of enum that can be used optionally to optimize things?
> 
>>
>>>>  	}
>>>>  	if (flags & FOLL_TOUCH) {
>>>>  		if ((flags & FOLL_WRITE) &&
>>>> @@ -1870,6 +1871,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>>>  
>>>>  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>>>>  
>>>> +		arch_make_page_accessible(page);
>>>>  		SetPageReferenced(page);
>>>>  		pages[*nr] = page;
>>>>  		(*nr)++;
>>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>>> index 2caf780a42e7..0f0bd14571b1 100644
>>>> --- a/mm/page-writeback.c
>>>> +++ b/mm/page-writeback.c
>>>> @@ -2806,6 +2806,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>>>>  		inc_lruvec_page_state(page, NR_WRITEBACK);
>>>>  		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>>>>  	}
>>>> +	arch_make_page_accessible(page);
>>>>  	unlock_page_memcg(page);
>>>
>>> As outlined by Ulrich, we can move the callback after the unlock.
>>>
>>>>  	return ret;
>>>>  
>>>>
>>>
>
Christian Borntraeger Feb. 17, 2020, 9:14 p.m. UTC | #13
On 17.02.20 21:55, Tom Lendacky wrote:
[...]

>>> What is the use case for calling arch_make_page_accessible() in the follow()
>>> and gup() paths?  Live migration is the only thing that comes to mind, and
>>> for live migration I would expect you would want to keep the secure guest
>>> running when copying pages to the target, i.e. use pre-copy.  That would
>>> conflict with converting the page in place.  Rather, migration would use a
>>> separate dedicated path to copy the encrypted contents of the secure page to
>>> a completely different page, and send *that* across the wire so that the
>>> guest can continue accessing the original page.
>>> Am I missing a need to do this for the swap/reclaim case?  Or is there a
>>> completely different use case I'm overlooking?
>>
>> This is actually to protect the host against a malicious user space. For 
>> example a bad QEMU could simply start direct I/O on such protected memory.
>> We do not want userspace to be able to trigger I/O errors and thus we
>> implemented the logic to "whenever somebody accesses that page (gup) or
>> doing I/O, make sure that this page can be accessed. When the guest tries
>> to access that page we will wait in the page fault handler for writeback to
>> have finished and for the page_ref to be the expected value.
> 
> So in this case, when the guest tries to access the page, the page may now
> be corrupted because I/O was allowed to be done to it? Or will the I/O
> have been blocked in some way, but without generating the I/O error?

No the I/O would be blocked by the hardware. Thats why we encrypt and export
the page for I/O usage. As soon as the refcount drops to the expected value
the guest can access its (unchanged) content after the import. the import
would check the hash etc. so no corruption of the guest state in any case.
(apart from denial of service, which is always possible)
If we would not have these hooks a malicious user could trigger I/O (which 
would be blocked) but the blocked I/O would generate an I/O error. And this
could bring trouble to some device drivers. And we want to avoid that.

In other words: the hardware/firmware will ensure guest integrity.But host
integrity (kernel vs userspace) must be enforced by the host kernel as usual
and this is one part of it.

But thanks for the clarification that you do not need those hooks.
Tian, Kevin Feb. 18, 2020, 3:36 a.m. UTC | #14
> From: Christian Borntraeger
> Sent: Friday, February 7, 2020 7:39 PM
> 
> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
> 
> With the introduction of protected KVM guests on s390 there is now a
> concept of inaccessible pages. These pages need to be made accessible
> before the host can access them.
> 
> While cpu accesses will trigger a fault that can be resolved, I/O
> accesses will just fail.  We need to add a callback into architecture
> code for places that will do I/O, namely when writeback is started or
> when a page reference is taken.

What about hooking the callback to DMA API ops?

> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  include/linux/gfp.h | 6 ++++++
>  mm/gup.c            | 2 ++
>  mm/page-writeback.c | 1 +
>  3 files changed, 9 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e5b817cb86e7..be2754841369 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page,
> int order) { }
>  #ifndef HAVE_ARCH_ALLOC_PAGE
>  static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> +static inline int arch_make_page_accessible(struct page *page)
> +{
> +	return 0;
> +}
> +#endif
> 
>  struct page *
>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> preferred_nid,
> diff --git a/mm/gup.c b/mm/gup.c
> index 7646bf993b25..a01262cd2821 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -257,6 +257,7 @@ static struct page *follow_page_pte(struct
> vm_area_struct *vma,
>  			page = ERR_PTR(-ENOMEM);
>  			goto out;
>  		}
> +		arch_make_page_accessible(page);
>  	}
>  	if (flags & FOLL_TOUCH) {
>  		if ((flags & FOLL_WRITE) &&
> @@ -1870,6 +1871,7 @@ static int gup_pte_range(pmd_t pmd, unsigned
> long addr, unsigned long end,
> 
>  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> 
> +		arch_make_page_accessible(page);
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
>  		(*nr)++;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2caf780a42e7..0f0bd14571b1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2806,6 +2806,7 @@ int __test_set_page_writeback(struct page *page,
> bool keep_write)
>  		inc_lruvec_page_state(page, NR_WRITEBACK);
>  		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  	}
> +	arch_make_page_accessible(page);
>  	unlock_page_memcg(page);
>  	return ret;
> 
> --
> 2.24.0
Christian Borntraeger Feb. 18, 2020, 6:44 a.m. UTC | #15
On 18.02.20 04:36, Tian, Kevin wrote:
>> From: Christian Borntraeger
>> Sent: Friday, February 7, 2020 7:39 PM
>>
>> From: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>
>> With the introduction of protected KVM guests on s390 there is now a
>> concept of inaccessible pages. These pages need to be made accessible
>> before the host can access them.
>>
>> While cpu accesses will trigger a fault that can be resolved, I/O
>> accesses will just fail.  We need to add a callback into architecture
>> code for places that will do I/O, namely when writeback is started or
>> when a page reference is taken.
> 
> What about hooking the callback to DMA API ops?

Not all device drivers do use the DMA API so it wont work for us.
Will Deacon Feb. 18, 2020, 4:02 p.m. UTC | #16
On Thu, Feb 13, 2020 at 03:48:16PM +0100, Christian Borntraeger wrote:
> 
> 
> On 11.02.20 12:26, Will Deacon wrote:
> > On Mon, Feb 10, 2020 at 06:27:04PM +0100, Christian Borntraeger wrote:
> >> CC Marc Zyngier for KVM on ARM.  Marc, see below. Will there be any
> >> use for this on KVM/ARM in the future?
> > 
> > I can't speak for Marc, but I can say that we're interested in something
> > like this for potentially isolating VMs from a KVM host in Android.
> > However, we've currently been working on the assumption that the memory
> > removed from the host won't usually be touched by the host (i.e. no
> > KSM or swapping out), so all we'd probably want at the moment is to be
> > able to return an error back from arch_make_page_accessible(). Its return
> > code is ignored in this patch :/
> 
> I think there are two ways at the moment. One is to keep the memory away from
> Linux, e.g. by using the memory as device driver memory like kmalloc. This is
> kind of what Power does. And I understand you as you want to follow that model
> and do not want to use paging, file backing or so.

Correct.

> Our approach tries to fully integrate into the existing Linux LRU methods.
> 
> Back to your approach. What happens when a malicious QEMU would start direct I/O
> on such isolated memory? Is that what you meant by adding error checking in these
> hooks. For the gup.c code returning an error seems straightforward.

Yes, it would be nice if the host could avoid even trying to access the
page if it's inaccessible and so returning an error from
arch_make_page_accessible() would be a good way to achieve that. If the
access goes ahead anyway, then the hypervisor will have to handle the
fault and effectively ignore the host access (writes will be lost, reads
will return poison).

> I have no idea what to do in writeback. When somebody managed to trigger writeback
> on such a page, it already seems too late.

For now, we could just have a BUG_ON().

Will

Patch
diff mbox series

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..be2754841369 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -485,6 +485,12 @@  static inline void arch_free_page(struct page *page, int order) { }
 #ifndef HAVE_ARCH_ALLOC_PAGE
 static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
+#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
+static inline int arch_make_page_accessible(struct page *page)
+{
+	return 0;
+}
+#endif
 
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..a01262cd2821 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -257,6 +257,7 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 			page = ERR_PTR(-ENOMEM);
 			goto out;
 		}
+		arch_make_page_accessible(page);
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -1870,6 +1871,7 @@  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 
+		arch_make_page_accessible(page);
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2caf780a42e7..0f0bd14571b1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2806,6 +2806,7 @@  int __test_set_page_writeback(struct page *page, bool keep_write)
 		inc_lruvec_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
+	arch_make_page_accessible(page);
 	unlock_page_memcg(page);
 	return ret;