diff mbox

[RFC,v2,14/27] mm: Handle THP/HugeTLB shadow stack page fault

Message ID 20180710222639.8241-15-yu-cheng.yu@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yu-cheng Yu July 10, 2018, 10:26 p.m. UTC
This patch implements THP shadow stack memory copying in the same
way as the previous patch for regular PTE.

In copy_huge_pmd(), we clear the dirty bit from the PMD.  On the
next shadow stack access to the PMD, a page fault occurs.  At
that time, the page is copied/re-used and the PMD is fixed.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
---
 mm/huge_memory.c | 8 ++++++++
 mm/memory.c      | 8 +++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

Comments

Dave Hansen July 10, 2018, 11:08 p.m. UTC | #1
On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> @@ -1347,6 +1353,8 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>  		pmd_t entry;
>  		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (is_shstk_mapping(vma->vm_flags))
> +			entry = pmd_mkdirty_shstk(entry);

This pattern is repeated enough that it makes me wonder if we should
just be doing the shadowstack PTE creation in mk_huge_pmd() itself.

Or, should we just be setting the shadowstack pte bit combination in
vma->vm_page_prot so we don't have to go set it explicitly every time?
Peter Zijlstra July 11, 2018, 9:10 a.m. UTC | #2
On Tue, Jul 10, 2018 at 03:26:26PM -0700, Yu-cheng Yu wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index a2695dbc0418..f7c46d61eaea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4108,7 +4108,13 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>  			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>  				return do_huge_pmd_numa_page(&vmf, orig_pmd);
>  
> -			if (dirty && !pmd_write(orig_pmd)) {
> +			/*
> +			 * Shadow stack trans huge PMDs are copy-on-access,
> +			 * so wp_huge_pmd() on them no mater if we have a
> +			 * write fault or not.
> +			 */
> +			if (is_shstk_mapping(vma->vm_flags) ||
> +			    (dirty && !pmd_write(orig_pmd))) {
>  				ret = wp_huge_pmd(&vmf, orig_pmd);
>  				if (!(ret & VM_FAULT_FALLBACK))
>  					return ret;

Can't we do this (and the do_wp_page thing) by setting FAULT_FLAG_WRITE
in the arch fault handler on shadow stack faults?
Yu-cheng Yu July 11, 2018, 4:11 p.m. UTC | #3
On Wed, 2018-07-11 at 11:10 +0200, Peter Zijlstra wrote:
> On Tue, Jul 10, 2018 at 03:26:26PM -0700, Yu-cheng Yu wrote:
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a2695dbc0418..f7c46d61eaea 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4108,7 +4108,13 @@ static int __handle_mm_fault(struct
> > vm_area_struct *vma, unsigned long address,
> >  			if (pmd_protnone(orig_pmd) &&
> > vma_is_accessible(vma))
> >  				return do_huge_pmd_numa_page(&vmf,
> > orig_pmd);
> >  
> > -			if (dirty && !pmd_write(orig_pmd)) {
> > +			/*
> > +			 * Shadow stack trans huge PMDs are copy-
> > on-access,
> > +			 * so wp_huge_pmd() on them no mater if we
> > have a
> > +			 * write fault or not.
> > +			 */
> > +			if (is_shstk_mapping(vma->vm_flags) ||
> > +			    (dirty && !pmd_write(orig_pmd))) {
> >  				ret = wp_huge_pmd(&vmf, orig_pmd);
> >  				if (!(ret & VM_FAULT_FALLBACK))
> >  					return ret;
> Can't we do this (and the do_wp_page thing) by setting
> FAULT_FLAG_WRITE
> in the arch fault handler on shadow stack faults?

This can work.  I don't know if that will create other issues.
Let me think about that.

Yu-cheng
Dave Hansen July 20, 2018, 2:20 p.m. UTC | #4
On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> @@ -1193,6 +1195,8 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
>  		pte_t entry;
>  		entry = mk_pte(pages[i], vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (is_shstk_mapping(vma->vm_flags))
> +			entry = pte_mkdirty_shstk(entry);

Peter Z was pointing out that we should get rid of all this generic code
manipulation.  We might not easily be able to do it *all*, but we can do
better than what we've got here.

Basically, if you have code outside of arch/x86 in your patch set that
refers to shadow stacks, you should consider it a bug (for now),
especially if you have to hack .c files.

For instance, in the code above, you could move the is_shstk_mapping() into:

static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
        if (likely(vma->vm_flags & VM_WRITE))
                pte = pte_mkwrite(pte);
	
+	pte = arch_pte_mkwrite(pte, vma);
+
        return pte;
}

... and add an arch callback that does:

static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
	if (!is_shstk_mapping(vma->vm_flags))
		return pte;

	WARN_ON(... pte bits incompatible with shadow stacks?);

	/* Lots of comments of course */
	entry = pte_mkdirty_shstk(entry);
}

This is just one example.  You are probably going to need a couple of
similar things.  Just remember: the bar is very high to make changes to
.c files outside of arch/x86.  You can do a _bit_ more in non-x86
headers, but you have the most freedom to patch what you want as long as
it's in arch/x86.
Yu-cheng Yu July 20, 2018, 2:58 p.m. UTC | #5
On Fri, 2018-07-20 at 07:20 -0700, Dave Hansen wrote:
> On 07/10/2018 03:26 PM, Yu-cheng Yu wrote:
> > 
> > @@ -1193,6 +1195,8 @@ static int
> > do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
> >  		pte_t entry;
> >  		entry = mk_pte(pages[i], vma->vm_page_prot);
> >  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> > +		if (is_shstk_mapping(vma->vm_flags))
> > +			entry = pte_mkdirty_shstk(entry);
> Peter Z was pointing out that we should get rid of all this generic
> code
> manipulation.  We might not easily be able to do it *all*, but we
> can do
> better than what we've got here.
> 
> Basically, if you have code outside of arch/x86 in your patch set
> that
> refers to shadow stacks, you should consider it a bug (for now),
> especially if you have to hack .c files.
> 
> For instance, in the code above, you could move the
> is_shstk_mapping() into:
> 
> static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct
> *vma)
> {
>         if (likely(vma->vm_flags & VM_WRITE))
>                 pte = pte_mkwrite(pte);
> 	
> +	pte = arch_pte_mkwrite(pte, vma);
> +
>         return pte;
> }
> 
> ... and add an arch callback that does:
> 
> static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct
> *vma)
> {
> 	if (!is_shstk_mapping(vma->vm_flags))
> 		return pte;
> 
> 	WARN_ON(... pte bits incompatible with shadow stacks?);
> 
> 	/* Lots of comments of course */
> 	entry = pte_mkdirty_shstk(entry);
> }
> 
> This is just one example.  You are probably going to need a couple
> of
> similar things.  Just remember: the bar is very high to make changes
> to
> .c files outside of arch/x86.  You can do a _bit_ more in non-x86
> headers, but you have the most freedom to patch what you want as
> long as
> it's in arch/x86.

Ok, I will work on that.  Thanks!

Yu-cheng
diff mbox

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cd7c1a57a14..7f3e11d3b64a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -597,6 +597,8 @@  static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (is_shstk_mapping(vma->vm_flags))
+			entry = pmd_mkdirty_shstk(entry);
 		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
@@ -1193,6 +1195,8 @@  static int do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, pmd_t orig_pmd,
 		pte_t entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (is_shstk_mapping(vma->vm_flags))
+			entry = pte_mkdirty_shstk(entry);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -1277,6 +1281,8 @@  int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (is_shstk_mapping(vma->vm_flags))
+			entry = pmd_mkdirty_shstk(entry);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry,  1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		ret |= VM_FAULT_WRITE;
@@ -1347,6 +1353,8 @@  int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		pmd_t entry;
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (is_shstk_mapping(vma->vm_flags))
+			entry = pmd_mkdirty_shstk(entry);
 		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
diff --git a/mm/memory.c b/mm/memory.c
index a2695dbc0418..f7c46d61eaea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4108,7 +4108,13 @@  static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
-			if (dirty && !pmd_write(orig_pmd)) {
+			/*
+			 * Shadow stack trans huge PMDs are copy-on-access,
+			 * so wp_huge_pmd() on them no mater if we have a
+			 * write fault or not.
+			 */
+			if (is_shstk_mapping(vma->vm_flags) ||
+			    (dirty && !pmd_write(orig_pmd))) {
 				ret = wp_huge_pmd(&vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;