diff mbox

[3/4] add replace_page(): change the page pte is pointing to.

Message ID 1239249521-5013-4-git-send-email-ieidus@redhat.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Izik Eidus April 9, 2009, 3:58 a.m. UTC
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page->index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus <ieidus@redhat.com>
---
 include/linux/mm.h |    5 +++
 mm/memory.c        |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+), 0 deletions(-)

Comments

Andrew Morton April 14, 2009, 10:09 p.m. UTC | #1
On Thu,  9 Apr 2009 06:58:40 +0300
Izik Eidus <ieidus@redhat.com> wrote:

> replace_page() allow changing the mapping of pte from one physical page
> into diffrent physical page.

At a high level, this is very similar to what page migration does.  Yet
this implementation shares nothing with the page migration code.

Can this situation be improved?

> this function is working by removing oldpage from the rmap and calling
> put_page on it, and by setting the pte to point into newpage and by
> inserting it to the rmap using page_add_file_rmap().
> 
> note: newpage must be non anonymous page, the reason for this is:
> replace_page() is built to allow mapping one page into more than one
> virtual addresses, the mapping of this page can happen in diffrent
> offsets inside each vma, and therefore we cannot trust the page->index
> anymore.
> 
> the side effect of this issue is that newpage cannot be anything but
> kernel allocated page that is not swappable.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrea Arcangeli April 15, 2009, 11:25 a.m. UTC | #2
On Tue, Apr 14, 2009 at 03:09:25PM -0700, Andrew Morton wrote:
> On Thu,  9 Apr 2009 06:58:40 +0300
> Izik Eidus <ieidus@redhat.com> wrote:
> 
> > replace_page() allow changing the mapping of pte from one physical page
> > into diffrent physical page.
> 
> At a high level, this is very similar to what page migration does.  Yet
> this implementation shares nothing with the page migration code.
> 
> Can this situation be improved?

This was discussed last time too. Basically the thing is that using
migration entry with its special page fault paths, for this looks a
bit of an overkill complexity and unnecessary dependency on the
migration code. All we need is to mark the pte readonly. replace_page
is a no brainer then. The brainer part is page_wrprotect
(page_wrprotect is like fork).

The data visibility in the final memcmp you mentioned in the other
mail is supposedly taken care of by page_wrprotect too. It already
does flush_cache_page for the virtual indexed and not physically
tagged caches. page_wrprotect has to also IPI all CPUs to nuke any not
wrprotected tlb entry. I don't think we need further smp memory
barriers when we're guaranteed all tlb entries are wrprotected in the
other cpus and an IPI and invlpg run in them, to be sure we read the
data stable during memcmp even if we read through the kernel
pagetables and the last userland write happened through userland ptes
before they become effective wrprotected by the IPI.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Izik Eidus April 15, 2009, 10:48 p.m. UTC | #3
Andrea Arcangeli wrote:
> On Tue, Apr 14, 2009 at 03:09:25PM -0700, Andrew Morton wrote:
>   
>> On Thu,  9 Apr 2009 06:58:40 +0300
>> Izik Eidus <ieidus@redhat.com> wrote:
>>
>>     
>>> replace_page() allow changing the mapping of pte from one physical page
>>> into diffrent physical page.
>>>       
>> At a high level, this is very similar to what page migration does.  Yet
>> this implementation shares nothing with the page migration code.
>>
>> Can this situation be improved?
>>     
>
> This was discussed last time too. Basically the thing is that using
> migration entry with its special page fault paths, for this looks a
> bit of an overkill complexity and unnecessary dependency on the
> migration code. 

I agree about that.

> All we need is to mark the pte readonly. replace_page
> is a no brainer then. The brainer part is page_wrprotect
> (page_wrprotect is like fork).
>
> The data visibility in the final memcmp you mentioned in the other
> mail is supposedly taken care of by page_wrprotect too. It already
> does flush_cache_page for the virtual indexed and not physically
> tagged caches. page_wrprotect has to also IPI all CPUs to nuke any not
> wrprotected tlb entry. I don't think we need further smp memory
> barriers when we're guaranteed all tlb entries are wrprotected in the
> other cpus and an IPI and invlpg run in them, to be sure we read the
> data stable during memcmp even if we read through the kernel
> pagetables and the last userland write happened through userland ptes
> before they become effective wrprotected by the IPI.
>   

Yup agree.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..7a831ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1240,6 +1240,11 @@  int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+		 struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
 			unsigned int foll_flags);
 #define FOLL_WRITE	0x01	/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:      vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page->index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+		 struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int ret;
+
+	BUG_ON(PageAnon(newpage));
+
+	ret = -EFAULT;
+	addr = page_address_in_vma(oldpage, vma);
+	if (addr == -EFAULT)
+		goto out;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out;
+
+	if (!pte_same(*ptep, orig_pte)) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out;
+	}
+
+	ret = 0;
+	get_page(newpage);
+	page_add_file_rmap(newpage);
+
+	flush_cache_page(vma, addr, pte_pfn(*ptep));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+	page_remove_rmap(oldpage);
+	if (PageAnon(oldpage)) {
+		dec_mm_counter(mm, anon_rss);
+		inc_mm_counter(mm, file_rss);
+	}
+	put_page(oldpage);
+
+	pte_unmap_unlock(ptep, ptl);
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results