diff mbox

[v5,1/5] mm: add vm_insert_mixed_mkwrite()

Message ID 20170724170616.25810-2-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler July 24, 2017, 5:06 p.m. UTC
To be able to use the common 4k zero page in DAX we need to have our PTE
fault path look more like our PMD fault path where a PTE entry can be
marked as dirty and writeable as it is first inserted rather than waiting
for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call.

Right now we can rely on having a dax_pfn_mkwrite() call because we can
distinguish between these two cases in do_wp_page():

	case 1: 4k zero page => writable DAX storage
	case 2: read-only DAX storage => writeable DAX storage

This distinction is made by via vm_normal_page().  vm_normal_page() returns
false for the common 4k zero page, though, just as it does for DAX ptes.
Instead of special casing the DAX + 4k zero page case we will simplify our
DAX PTE page fault sequence so that it matches our DAX PMD sequence, and
get rid of the dax_pfn_mkwrite() helper.  We will instead use
dax_iomap_fault() to handle write-protection faults.

This means that insert_pfn() needs to follow the lead of insert_pfn_pmd()
and allow us to pass in a 'mkwrite' flag.  If 'mkwrite' is set insert_pfn()
will do the work that was previously done by wp_page_reuse() as part of the
dax_pfn_mkwrite() call path.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 45 insertions(+), 7 deletions(-)

Comments

Kirill A . Shutemov July 24, 2017, 10:14 p.m. UTC | #1
On Mon, Jul 24, 2017 at 11:06:12AM -0600, Ross Zwisler wrote:
> @@ -1658,14 +1658,35 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  	if (!pte)
>  		goto out;
>  	retval = -EBUSY;
> -	if (!pte_none(*pte))
> -		goto out_unlock;
> +	if (!pte_none(*pte)) {
> +		if (mkwrite) {
> +			/*
> +			 * For read faults on private mappings the PFN passed
> +			 * in may not match the PFN we have mapped if the
> +			 * mapped PFN is a writeable COW page.  In the mkwrite
> +			 * case we are creating a writable PTE for a shared
> +			 * mapping and we expect the PFNs to match.
> +			 */

Can we?

I guess it's up to filesystem if it wants to reuse the same spot to write
data or not. I think your assumptions works for ext4 and xfs. I wouldn't
be that sure for btrfs or other filesystems with CoW support.
Christoph Hellwig July 25, 2017, 8:01 a.m. UTC | #2
On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> I guess it's up to filesystem if it wants to reuse the same spot to write
> data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> be that sure for btrfs or other filesystems with CoW support.

Or XFS with reflinks for that matter.  Which currently can't be
combined with DAX, but I had a somewhat working version a few month
ago.
Jan Kara July 25, 2017, 9:35 a.m. UTC | #3
On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > I guess it's up to filesystem if it wants to reuse the same spot to write
> > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > be that sure for btrfs or other filesystems with CoW support.
> 
> Or XFS with reflinks for that matter.  Which currently can't be
> combined with DAX, but I had a somewhat working version a few month
> ago.

But in cases like COW when the block mapping changes, the process
must run unmap_mapping_range() before installing the new PTE so that all
processes mapping this file offset actually refault and see the new
mapping. So this would go through pte_none() case. Am I missing something?

								Honza
Christoph Hellwig July 25, 2017, 12:15 p.m. UTC | #4
On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote:
> On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > > I guess it's up to filesystem if it wants to reuse the same spot to write
> > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > > be that sure for btrfs or other filesystems with CoW support.
> > 
> > Or XFS with reflinks for that matter.  Which currently can't be
> > combined with DAX, but I had a somewhat working version a few month
> > ago.
> 
> But in cases like COW when the block mapping changes, the process
> must run unmap_mapping_range() before installing the new PTE so that all
> processes mapping this file offset actually refault and see the new
> mapping. So this would go through pte_none() case. Am I missing something?

Yes, for DAX COW mappings we'd probably need something like this, unlike
the pagecache COW handling for which only the underlying block change,
but not the page.
Jan Kara July 25, 2017, 12:50 p.m. UTC | #5
On Tue 25-07-17 14:15:22, Christoph Hellwig wrote:
> On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote:
> > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > > > I guess it's up to filesystem if it wants to reuse the same spot to write
> > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > > > be that sure for btrfs or other filesystems with CoW support.
> > > 
> > > Or XFS with reflinks for that matter.  Which currently can't be
> > > combined with DAX, but I had a somewhat working version a few month
> > > ago.
> > 
> > But in cases like COW when the block mapping changes, the process
> > must run unmap_mapping_range() before installing the new PTE so that all
> > processes mapping this file offset actually refault and see the new
> > mapping. So this would go through pte_none() case. Am I missing something?
> 
> Yes, for DAX COW mappings we'd probably need something like this, unlike
> the pagecache COW handling for which only the underlying block change,
> but not the page.

Right. So again nothing where the WARN_ON should trigger. That being said I
don't care about the WARN_ON too deeply but it can help to catch DAX bugs
so if we can keep it I'd prefer to do so...

								Honza
Kirill A . Shutemov July 25, 2017, 2:31 p.m. UTC | #6
On Tue, Jul 25, 2017 at 02:50:37PM +0200, Jan Kara wrote:
> On Tue 25-07-17 14:15:22, Christoph Hellwig wrote:
> > On Tue, Jul 25, 2017 at 11:35:08AM +0200, Jan Kara wrote:
> > > On Tue 25-07-17 10:01:58, Christoph Hellwig wrote:
> > > > On Tue, Jul 25, 2017 at 01:14:00AM +0300, Kirill A. Shutemov wrote:
> > > > > I guess it's up to filesystem if it wants to reuse the same spot to write
> > > > > data or not. I think your assumptions works for ext4 and xfs. I wouldn't
> > > > > be that sure for btrfs or other filesystems with CoW support.
> > > > 
> > > > Or XFS with reflinks for that matter.  Which currently can't be
> > > > combined with DAX, but I had a somewhat working version a few month
> > > > ago.
> > > 
> > > But in cases like COW when the block mapping changes, the process
> > > must run unmap_mapping_range() before installing the new PTE so that all
> > > processes mapping this file offset actually refault and see the new
> > > mapping. So this would go through pte_none() case. Am I missing something?
> > 
> > Yes, for DAX COW mappings we'd probably need something like this, unlike
> > the pagecache COW handling for which only the underlying block change,
> > but not the page.
> 
> Right. So again nothing where the WARN_ON should trigger.

Yes. I was confused on how COW is handled.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5..483e84c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2293,6 +2293,8 @@  int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn, pgprot_t pgprot);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index 0e517be..b29dd42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1646,7 +1646,7 @@  int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL(vm_insert_page);
 
 static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
-			pfn_t pfn, pgprot_t prot)
+			pfn_t pfn, pgprot_t prot, bool mkwrite)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int retval;
@@ -1658,14 +1658,35 @@  static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	if (!pte)
 		goto out;
 	retval = -EBUSY;
-	if (!pte_none(*pte))
-		goto out_unlock;
+	if (!pte_none(*pte)) {
+		if (mkwrite) {
+			/*
+			 * For read faults on private mappings the PFN passed
+			 * in may not match the PFN we have mapped if the
+			 * mapped PFN is a writeable COW page.  In the mkwrite
+			 * case we are creating a writable PTE for a shared
+			 * mapping and we expect the PFNs to match.
+			 */
+			if (WARN_ON_ONCE(pte_pfn(*pte) != pfn_t_to_pfn(pfn)))
+				goto out_unlock;
+			entry = *pte;
+			goto out_mkwrite;
+		} else
+			goto out_unlock;
+	}
 
 	/* Ok, finally just insert the thing.. */
 	if (pfn_t_devmap(pfn))
 		entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
 	else
 		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
+
+out_mkwrite:
+	if (mkwrite) {
+		entry = pte_mkyoung(entry);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+	}
+
 	set_pte_at(mm, addr, pte, entry);
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
@@ -1736,14 +1757,15 @@  int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 
 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
-	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot);
+	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
+			false);
 
 	return ret;
 }
 EXPORT_SYMBOL(vm_insert_pfn_prot);
 
-int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
-			pfn_t pfn)
+static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn, bool mkwrite)
 {
 	pgprot_t pgprot = vma->vm_page_prot;
 
@@ -1772,10 +1794,24 @@  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 		page = pfn_to_page(pfn_t_to_pfn(pfn));
 		return insert_page(vma, addr, page, pgprot);
 	}
-	return insert_pfn(vma, addr, pfn, pgprot);
+	return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
+}
+
+int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn)
+{
+	return __vm_insert_mixed(vma, addr, pfn, false);
+
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr,
+			pfn_t pfn)
+{
+	return __vm_insert_mixed(vma, addr, pfn, true);
+}
+EXPORT_SYMBOL(vm_insert_mixed_mkwrite);
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results