[3/3] dax: Clear dirty entry tags on cache flush

Message ID	1466523915-14644-4-git-send-email-jack@suse.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Jan Kara <jack@suse.cz> To: linux-nvdimm@lists.01.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Dan Williams <dan.j.williams@intel.com>, Ross Zwisler <ross.zwisler@linux.intel.com>, Jan Kara <jack@suse.cz> Subject: [PATCH 3/3] dax: Clear dirty entry tags on cache flush Date: Tue, 21 Jun 2016 17:45:15 +0200 Message-Id: <1466523915-14644-4-git-send-email-jack@suse.cz> In-Reply-To: <1466523915-14644-1-git-send-email-jack@suse.cz> References: <1466523915-14644-1-git-send-email-jack@suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

1466523915-14644-4-git-send-email-jack@suse.cz (mailing list archive)

State

New, archived

Headers

From: Jan Kara <jack@suse.cz>
To: linux-nvdimm@lists.01.org
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Dan Williams <dan.j.williams@intel.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>, Jan Kara <jack@suse.cz>
Subject: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
Date: Tue, 21 Jun 2016 17:45:15 +0200
Message-Id: <1466523915-14644-4-git-send-email-jack@suse.cz>
In-Reply-To: <1466523915-14644-1-git-send-email-jack@suse.cz>
References: <1466523915-14644-1-git-send-email-jack@suse.cz>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Jan Kara June 21, 2016, 3:45 p.m. UTC

Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

Comments

kernel test robot June 21, 2016, 5:31 p.m. UTC | #1

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/pagemap.h:10:0,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> arch/x86/include/asm/pgtable_32.h:52:38: error: incompatible type for argument 1 of '__kunmap_atomic'
    #define pte_unmap(pte) kunmap_atomic((pte))
                                         ^
   include/linux/highmem.h:127:18: note: in definition of macro 'kunmap_atomic'
     __kunmap_atomic(addr);                                  \
                     ^~~~
>> include/linux/mm.h:1681:2: note: in expansion of macro 'pte_unmap'
     pte_unmap(pte);     \
     ^~~~~~~~~
>> fs/dax.c:714:3: note: in expansion of macro 'pte_unmap_unlock'
      pte_unmap_unlock(pte, ptl);
      ^~~~~~~~~~~~~~~~
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/pte_unmap_unlock +714 fs/dax.c

   698			address = pgoff_address(index, vma);
   699			changed = false;
   700			if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
   701				continue;
   702			if (pfn != pte_pfn(*ptep))
   703				goto unlock;
   704			if (!pte_dirty(*ptep) && !pte_write(*ptep))
   705				goto unlock;
   706	
   707			flush_cache_page(vma, address, pfn);
   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);
   718		}
   719		i_mmap_unlock_read(mapping);
   720	}
   721	
   722	static int dax_writeback_one(struct block_device *bdev,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

kernel test robot June 21, 2016, 8:59 p.m. UTC | #2

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> fs/dax.c:714:343: error: incompatible type for argument 1 of '__kunmap_atomic'
      pte_unmap_unlock(pte, ptl);
                                                                                                                                                                                                                                                                                                                                                          ^
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/__kunmap_atomic +714 fs/dax.c

   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Ross Zwisler June 28, 2016, 9:38 p.m. UTC | #3

On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> Currently we never clear dirty tags in DAX mappings and thus address
> ranges to flush accumulate. Now that we have locking of radix tree
> entries, we have all the locking necessary to reliably clear the radix
> tree dirty tag when flushing caches for corresponding address range.
> Similarly to page_mkclean() we also have to write-protect pages to get a
> page fault when the page is next written to so that we can mark the
> entry dirty again.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I think we still have a race where we can end up with a writeable PTE but a
clean DAX radix tree entry.  Here is the flow:

Thread 1					Thread 2
======== 					========
wp_pfn_shared()
  dax_pfn_mkwrite()
    get_unlocked_mapping_entry()
    radix_tree_tag_set(DIRTY)
    put_unlocked_mapping_entry()
						dax_writeback_one()
						  lock_slot()
						  radix_tree_tag_clear(TOWRITE)
						  dax_mapping_entry_mkclean()
						  wb_cache_pmem()
						  radix_tree_tag_clear(DIRTY)
						  put_locked_mapping_entry()
  wp_page_reuse()
    maybe_mkwrite()

When this ends the radix tree entry will have been written back and the
TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
writeable due to the last maybe_mkwrite() call.  This will result in no new
dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.

Essentially the problem is that we don't hold any locks through all the work
done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.

Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
unlock the slot?  This would guarantee that they happen together from DAX's
point of view.

Thread 1's flow would then be:

Thread 1
========
wp_pfn_shared()
  dax_pfn_mkwrite()
    lock_slot()
    radix_tree_tag_set(DIRTY)
  wp_page_reuse()
    maybe_mkwrite()
    new_dax_call_to_unlock_slot()
      put_unlocked_mapping_entry()

> ---
>  fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5209f8cd0bee..c0c4eecb5f73 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -31,6 +31,7 @@
>  #include <linux/vmstat.h>
>  #include <linux/pfn_t.h>
>  #include <linux/sizes.h>
> +#include <linux/mmu_notifier.h>
>  
>  /*
>   * We use lowest available bit in exceptional entry for locking, other two
> @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  	return new_entry;
>  }
>  
> +static inline unsigned long
> +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> +{
> +	unsigned long address;
> +
> +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	return address;
> +}
> +
> +/* Walk all mappings of a given index of a file and writeprotect them */
> +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> +				      pgoff_t index, unsigned long pfn)
> +{
> +	struct vm_area_struct *vma;
> +	pte_t *ptep;
> +	pte_t pte;
> +	spinlock_t *ptl;
> +	bool changed;
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> +		unsigned long address;
> +
> +		cond_resched();

Do we really need to cond_resched() between every PTE clean?  Maybe it would
be better to cond_resched() after each call to dax_writeback_one() in
dax_writeback_mapping_range() or something so we can be sure we've done a good
amount of work between each?

> +
> +		if (!(vma->vm_flags & VM_SHARED))
> +			continue;
> +
> +		address = pgoff_address(index, vma);
> +		changed = false;
> +		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
> +			continue;
> +		if (pfn != pte_pfn(*ptep))
> +			goto unlock;
> +		if (!pte_dirty(*ptep) && !pte_write(*ptep))
> +			goto unlock;
> +
> +		flush_cache_page(vma, address, pfn);
> +		pte = ptep_clear_flush(vma, address, ptep);
> +		pte = pte_wrprotect(pte);
> +		pte = pte_mkclean(pte);
> +		set_pte_at(vma->vm_mm, address, ptep, pte);
> +		changed = true;
> +unlock:
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (changed)
> +			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +	}
> +	i_mmap_unlock_read(mapping);
> +}
> +
>  static int dax_writeback_one(struct block_device *bdev,
>  		struct address_space *mapping, pgoff_t index, void *entry)
>  {
> @@ -723,17 +777,30 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * eventually calls cond_resched().
>  	 */
>  	ret = dax_map_atomic(bdev, &dax);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		put_locked_mapping_entry(mapping, index, entry);
>  		return ret;
> +	}
>  
>  	if (WARN_ON_ONCE(ret < dax.size)) {
>  		ret = -EIO;
>  		goto unmap;
>  	}
>  
> +	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
>  	wb_cache_pmem(dax.addr, dax.size);
> +	/*
> +	 * After we have flushed the cache, we can clear the dirty tag. There
> +	 * cannot be new dirty data in the pfn after the flush has completed as
> +	 * the pfn mappings are writeprotected and fault waits for mapping
> +	 * entry lock.
> +	 */
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
> +	spin_unlock_irq(&mapping->tree_lock);
>  unmap:
>  	dax_unmap_atomic(bdev, &dax);
> +	put_locked_mapping_entry(mapping, index, entry);
>  	return ret;
>  
>  put_unlock:
> -- 
> 2.6.6
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara June 29, 2016, 8:47 p.m. UTC | #4

On Tue 28-06-16 15:38:57, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> > Currently we never clear dirty tags in DAX mappings and thus address
> > ranges to flush accumulate. Now that we have locking of radix tree
> > entries, we have all the locking necessary to reliably clear the radix
> > tree dirty tag when flushing caches for corresponding address range.
> > Similarly to page_mkclean() we also have to write-protect pages to get a
> > page fault when the page is next written to so that we can mark the
> > entry dirty again.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I think we still have a race where we can end up with a writeable PTE but a
> clean DAX radix tree entry.  Here is the flow:
> 
> Thread 1					Thread 2
> ======== 					========
> wp_pfn_shared()
>   dax_pfn_mkwrite()
>     get_unlocked_mapping_entry()
>     radix_tree_tag_set(DIRTY)
>     put_unlocked_mapping_entry()
> 						dax_writeback_one()
> 						  lock_slot()
> 						  radix_tree_tag_clear(TOWRITE)
> 						  dax_mapping_entry_mkclean()
> 						  wb_cache_pmem()
> 						  radix_tree_tag_clear(DIRTY)
> 						  put_locked_mapping_entry()
>   wp_page_reuse()
>     maybe_mkwrite()
> 
> When this ends the radix tree entry will have been written back and the
> TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
> writeable due to the last maybe_mkwrite() call.  This will result in no new
> dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.

Good point, thanks for spotting this! There are several ways to fix this -
one is the callback you suggest below but I don't like that much. Another
is returning VM_FAULT_DAX_LOCKED as we do for cow faults handle that
properly in page_mkwrite() paths. Finally, we could just handle PTE changes
within the pfn_mkwrite() handler like we already do for the ->fault path
and thus keep the locking private to DAX code. I'm not yet decided what's
best. I'll think about it.

> Essentially the problem is that we don't hold any locks through all the work
> done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.
> 
> Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
> another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
> unlock the slot?  This would guarantee that they happen together from DAX's
> point of view.
> 
> Thread 1's flow would then be:
> 
> Thread 1
> ========
> wp_pfn_shared()
>   dax_pfn_mkwrite()
>     lock_slot()
>     radix_tree_tag_set(DIRTY)
>   wp_page_reuse()
>     maybe_mkwrite()
>     new_dax_call_to_unlock_slot()
>       put_unlocked_mapping_entry()
> 
> > ---
> >  fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 68 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5209f8cd0bee..c0c4eecb5f73 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -31,6 +31,7 @@
> >  #include <linux/vmstat.h>
> >  #include <linux/pfn_t.h>
> >  #include <linux/sizes.h>
> > +#include <linux/mmu_notifier.h>
> >  
> >  /*
> >   * We use lowest available bit in exceptional entry for locking, other two
> > @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  	return new_entry;
> >  }
> >  
> > +static inline unsigned long
> > +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> > +{
> > +	unsigned long address;
> > +
> > +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> > +	return address;
> > +}
> > +
> > +/* Walk all mappings of a given index of a file and writeprotect them */
> > +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > +				      pgoff_t index, unsigned long pfn)
> > +{
> > +	struct vm_area_struct *vma;
> > +	pte_t *ptep;
> > +	pte_t pte;
> > +	spinlock_t *ptl;
> > +	bool changed;
> > +
> > +	i_mmap_lock_read(mapping);
> > +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> > +		unsigned long address;
> > +
> > +		cond_resched();
> 
> Do we really need to cond_resched() between every PTE clean?  Maybe it would
> be better to cond_resched() after each call to dax_writeback_one() in
> dax_writeback_mapping_range() or something so we can be sure we've done a good
> amount of work between each?

This is copied over from rmap code and I'd prefer to keep the code similar.
Note that cond_resched() does not mean we are going to reschedule. It is
pretty cheap call and we will be rescheduled only if we have used our CPU
slice...

								Honza

diff --git a/fs/dax.c b/fs/dax.c
index 5209f8cd0bee..c0c4eecb5f73 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@ 
 #include <linux/vmstat.h>
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * We use lowest available bit in exceptional entry for locking, other two
@@ -665,6 +666,59 @@  static void *dax_insert_mapping_entry(struct address_space *mapping,
 	return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+				      pgoff_t index, unsigned long pfn)
+{
+	struct vm_area_struct *vma;
+	pte_t *ptep;
+	pte_t pte;
+	spinlock_t *ptl;
+	bool changed;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
+		unsigned long address;
+
+		cond_resched();
+
+		if (!(vma->vm_flags & VM_SHARED))
+			continue;
+
+		address = pgoff_address(index, vma);
+		changed = false;
+		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
+			continue;
+		if (pfn != pte_pfn(*ptep))
+			goto unlock;
+		if (!pte_dirty(*ptep) && !pte_write(*ptep))
+			goto unlock;
+
+		flush_cache_page(vma, address, pfn);
+		pte = ptep_clear_flush(vma, address, ptep);
+		pte = pte_wrprotect(pte);
+		pte = pte_mkclean(pte);
+		set_pte_at(vma->vm_mm, address, ptep, pte);
+		changed = true;
+unlock:
+		pte_unmap_unlock(pte, ptl);
+
+		if (changed)
+			mmu_notifier_invalidate_page(vma->vm_mm, address);
+	}
+	i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -723,17 +777,30 @@  static int dax_writeback_one(struct block_device *bdev,
 	 * eventually calls cond_resched().
 	 */
 	ret = dax_map_atomic(bdev, &dax);
-	if (ret < 0)
+	if (ret < 0) {
+		put_locked_mapping_entry(mapping, index, entry);
 		return ret;
+	}
 
 	if (WARN_ON_ONCE(ret < dax.size)) {
 		ret = -EIO;
 		goto unmap;
 	}
 
+	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
 	wb_cache_pmem(dax.addr, dax.size);
+	/*
+	 * After we have flushed the cache, we can clear the dirty tag. There
+	 * cannot be new dirty data in the pfn after the flush has completed as
+	 * the pfn mappings are writeprotected and fault waits for mapping
+	 * entry lock.
+	 */
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
 unmap:
 	dax_unmap_atomic(bdev, &dax);
+	put_locked_mapping_entry(mapping, index, entry);
 	return ret;
 
 put_unlock:

[3/3] dax: Clear dirty entry tags on cache flush

Commit Message

Comments

Patch