[v8,6/9] dax: add support for fsync/msync

Message ID	1452230879-18117-7-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> From: Ross Zwisler <ross.zwisler@linux.intel.com> To: linux-kernel@vger.kernel.org Subject: [PATCH v8 6/9] dax: add support for fsync/msync Date: Thu, 7 Jan 2016 22:27:56 -0700 Message-Id: <1452230879-18117-7-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1452230879-18117-1-git-send-email-ross.zwisler@linux.intel.com> References: <1452230879-18117-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com>, Dave Chinner <david@fromorbit.com>, "J. Bruce Fields" <bfields@fieldses.org>, linux-mm@kvack.org, Andreas Dilger <adilger.kernel@dilger.ca>, "H. Peter Anvin" <hpa@zytor.com>, Jeff Layton <jlayton@poochiereds.net>, linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar <mingo@redhat.com>, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro <viro@zeniv.linux.org.uk>, Thomas Gleixner <tglx@linutronix.de>, Theodore Ts'o <tytso@mit.edu>, Jan Kara <jack@suse.com>, linux-fsdevel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <matthew.r.wilcox@intel.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Ross Zwisler Jan. 8, 2016, 5:27 a.m. UTC

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 196 insertions(+), 6 deletions(-)

Jan Kara Jan. 12, 2016, 10:57 a.m. UTC | #1

On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Some comments below.

> ---
>  fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/dax.h |   2 +
>  mm/filemap.c        |   6 ++
>  3 files changed, 196 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b84a46..0db21ea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,6 +24,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pagevec.h>
>  #include <linux/pmem.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
> @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
>  	return 0;
>  }
>  
> +#define NO_SECTOR -1
> +
> +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> +		sector_t sector, bool pmd_entry, bool dirty)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type, error = 0;
> +	void *entry;
> +
> +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = radix_tree_lookup(page_tree, index);
> +
> +	if (entry) {
> +		type = RADIX_DAX_TYPE(entry);
> +		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> +					type != RADIX_DAX_PMD)) {
> +			error = -EIO;
> +			goto unlock;
> +		}
> +
> +		if (!pmd_entry || type == RADIX_DAX_PMD)
> +			goto dirty;
> +		radix_tree_delete(&mapping->page_tree, index);
> +		mapping->nrexceptional--;

In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
PMD entry instead of it. That will cause fsync() to miss some flushes. So
you should make sure you transfer all the tags to the new entry.

> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);

This seems to be somewhat racy as well - if there are two fsyncs running
against the same inode, one wins the race and clears TOWRITE tag, the
second then bails out and may finish before the skipped page gets flushed.

So we should clear the TOWRITE tag only after the range is flushed.  This
can result in some amount of duplicit flushing but I don't think the race
will happen that frequently in practice to be performance relevant.

And secondly: You must write-protect all mappings of the flushed range so
that you get fault when the sector gets written-to again. We spoke about
this in the past already but somehow it got lost and I forgot about it as
well. You need something like rmap_walk_file()...

> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		goto unmap;
> +	}
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> +	return ret;
> +
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}

...

> @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> - *
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct file *file = vma->vm_file;
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	sb_end_pagefault(sb);
> +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);

Why is NO_SECTOR argument correct here?

								Honza

Ross Zwisler Jan. 13, 2016, 7:30 a.m. UTC | #2

On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Some comments below.
> 
> > ---
> >  fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  include/linux/dax.h |   2 +
> >  mm/filemap.c        |   6 ++
> >  3 files changed, 196 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5b84a46..0db21ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> >  #include <linux/pmem.h>
> >  #include <linux/sched.h>
> >  #include <linux/uio.h>
> > @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> >  	return 0;
> >  }
> >  
> > +#define NO_SECTOR -1
> > +
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> > +		sector_t sector, bool pmd_entry, bool dirty)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type, error = 0;
> > +	void *entry;
> > +
> > +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = radix_tree_lookup(page_tree, index);
> > +
> > +	if (entry) {
> > +		type = RADIX_DAX_TYPE(entry);
> > +		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > +					type != RADIX_DAX_PMD)) {
> > +			error = -EIO;
> > +			goto unlock;
> > +		}
> > +
> > +		if (!pmd_entry || type == RADIX_DAX_PMD)
> > +			goto dirty;
> > +		radix_tree_delete(&mapping->page_tree, index);
> > +		mapping->nrexceptional--;
> 
> In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
> PMD entry instead of it. That will cause fsync() to miss some flushes. So
> you should make sure you transfer all the tags to the new entry.

Ah, great catch, I'll address it in v9 which I'll send out tomorrow.

> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> 
> This seems to be somewhat racy as well - if there are two fsyncs running
> against the same inode, one wins the race and clears TOWRITE tag, the
> second then bails out and may finish before the skipped page gets flushed.
> 
> So we should clear the TOWRITE tag only after the range is flushed.  This
> can result in some amount of duplicit flushing but I don't think the race
> will happen that frequently in practice to be performance relevant.

Yep, this make sense.  I'll also fix that in v9.

> And secondly: You must write-protect all mappings of the flushed range so
> that you get fault when the sector gets written-to again. We spoke about
> this in the past already but somehow it got lost and I forgot about it as
> well. You need something like rmap_walk_file()...

The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:

https://lkml.org/lkml/2015/11/13/759

I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.

The race goes like this:

Thread 1 (write)			Thread 2 (fsync)
================			================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
					dax_writeback_mapping_range()
					dax_writeback_one()
					radix_tag_clear(DIRTY)
					pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()

After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry.  This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.

The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.

I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock.  I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.

The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.

> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		goto unmap;
> > +	}
> > +
> > +	wb_cache_pmem(dax.addr, dax.size);
> > + unmap:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return ret;
> > +
> > + unlock:
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	return ret;
> > +}
> 
> ...
> 
> > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   * dax_pfn_mkwrite - handle first write to DAX page
> >   * @vma: The virtual memory area where the fault occurred
> >   * @vmf: The description of the fault
> > - *
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct file *file = vma->vm_file;
> >  
> > -	sb_start_pagefault(sb);
> > -	file_update_time(vma->vm_file);
> > -	sb_end_pagefault(sb);
> > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> 
> Why is NO_SECTOR argument correct here?

Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty".  This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write.  If the fault was for a write, the radix tree entry will also
be made dirty.

For reads the radix tree entry will be inserted but left clean.  When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
tree entry & set the dirty tag.

This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().

Jan Kara Jan. 13, 2016, 9:35 a.m. UTC | #3

On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > And secondly: You must write-protect all mappings of the flushed range so
> > that you get fault when the sector gets written-to again. We spoke about
> > this in the past already but somehow it got lost and I forgot about it as
> > well. You need something like rmap_walk_file()...
> 
> The code that write protected mappings and then cleaned the radix tree entries
> did get written, and was part of v2:
> 
> https://lkml.org/lkml/2015/11/13/759
> 
> I removed all the code that cleaned PTE entries and radix tree entries for v3.
> The reason behind this was that there was a race that I couldn't figure out
> how to solve between the cleaning of the PTEs and the cleaning of the radix
> tree entries.
> 
> The race goes like this:
> 
> Thread 1 (write)			Thread 2 (fsync)
> ================			================
> wp_pfn_shared()
> pfn_mkwrite()
> dax_radix_entry()
> radix_tree_tag_set(DIRTY)
> 					dax_writeback_mapping_range()
> 					dax_writeback_one()
> 					radix_tag_clear(DIRTY)
> 					pgoff_mkclean()
> ... return up to wp_pfn_shared()
> wp_page_reuse()
> pte_mkdirty()
> 
> After this sequence we end up with a dirty PTE that is writeable, but with a
> clean radix tree entry.  This means that users can write to the page, but that
> a follow-up fsync or msync won't flush this dirty data to media.
> 
> The overall issue is that in the write path that goes through wp_pfn_shared(),
> the DAX code has control over when the radix tree entry is dirtied but not
> when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
> This means that we can't easily add locking, etc. to protect ourselves.
> 
> I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> easy solutions presented themselves in the absence of a page lock.  I do have
> one idea, but I think it's pretty invasive and will need to wait for another
> kernel cycle.
> 
> The current code that leaves the radix tree entry will give us correct
> behavior - it'll just be less efficient because we will have an ever-growing
> dirty set to flush.

Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
tags but it does not (I should have known, I have written that functions
few years ago ;). Makes sense. Thanks for clarification.

> > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > >   * dax_pfn_mkwrite - handle first write to DAX page
> > >   * @vma: The virtual memory area where the fault occurred
> > >   * @vmf: The description of the fault
> > > - *
> > >   */
> > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > >  {
> > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > +	struct file *file = vma->vm_file;
> > >  
> > > -	sb_start_pagefault(sb);
> > > -	file_update_time(vma->vm_file);
> > > -	sb_end_pagefault(sb);
> > > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > 
> > Why is NO_SECTOR argument correct here?
> 
> Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> tree - just make that entry dirty".  This works because pfn_mkwrite() always
> follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> will insert the radix tree entry, regardless of whether the fault was for a
> read or a write.  If the fault was for a write, the radix tree entry will also
> be made dirty.
>
> For reads the radix tree entry will be inserted but left clean.  When the
> first write happens we will get a pfn_mkwrite() call, which will call
> dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> tree entry & set the dirty tag.

So the explanation of this should be somewhere so that everyone knows that
we must have radix tree entries even for clean mapped blocks. Because upto
know that was not clear to me.  Also __dax_pmd_fault() seems to insert
entries only for write fault so the assumption doesn't seem to hold there?

I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
with hole punching. But I guess I can live with that.
 
								Honza

Ross Zwisler Jan. 13, 2016, 6:58 p.m. UTC | #4

On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > And secondly: You must write-protect all mappings of the flushed range so
> > > that you get fault when the sector gets written-to again. We spoke about
> > > this in the past already but somehow it got lost and I forgot about it as
> > > well. You need something like rmap_walk_file()...
> > 
> > The code that write protected mappings and then cleaned the radix tree entries
> > did get written, and was part of v2:
> > 
> > https://lkml.org/lkml/2015/11/13/759
> > 
> > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > The reason behind this was that there was a race that I couldn't figure out
> > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > tree entries.
> > 
> > The race goes like this:
> > 
> > Thread 1 (write)			Thread 2 (fsync)
> > ================			================
> > wp_pfn_shared()
> > pfn_mkwrite()
> > dax_radix_entry()
> > radix_tree_tag_set(DIRTY)
> > 					dax_writeback_mapping_range()
> > 					dax_writeback_one()
> > 					radix_tag_clear(DIRTY)
> > 					pgoff_mkclean()
> > ... return up to wp_pfn_shared()
> > wp_page_reuse()
> > pte_mkdirty()
> > 
> > After this sequence we end up with a dirty PTE that is writeable, but with a
> > clean radix tree entry.  This means that users can write to the page, but that
> > a follow-up fsync or msync won't flush this dirty data to media.
> > 
> > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > the DAX code has control over when the radix tree entry is dirtied but not
> > when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
> > This means that we can't easily add locking, etc. to protect ourselves.
> > 
> > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > easy solutions presented themselves in the absence of a page lock.  I do have
> > one idea, but I think it's pretty invasive and will need to wait for another
> > kernel cycle.
> > 
> > The current code that leaves the radix tree entry will give us correct
> > behavior - it'll just be less efficient because we will have an ever-growing
> > dirty set to flush.
> 
> Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> tags but it does not (I should have known, I have written that functions
> few years ago ;). Makes sense. Thanks for clarification.
> 
> > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   * dax_pfn_mkwrite - handle first write to DAX page
> > > >   * @vma: The virtual memory area where the fault occurred
> > > >   * @vmf: The description of the fault
> > > > - *
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +	struct file *file = vma->vm_file;
> > > >  
> > > > -	sb_start_pagefault(sb);
> > > > -	file_update_time(vma->vm_file);
> > > > -	sb_end_pagefault(sb);
> > > > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > > 
> > > Why is NO_SECTOR argument correct here?
> > 
> > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > tree - just make that entry dirty".  This works because pfn_mkwrite() always
> > follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> > will insert the radix tree entry, regardless of whether the fault was for a
> > read or a write.  If the fault was for a write, the radix tree entry will also
> > be made dirty.
> >
> > For reads the radix tree entry will be inserted but left clean.  When the
> > first write happens we will get a pfn_mkwrite() call, which will call
> > dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> > tree entry & set the dirty tag.
> 
> So the explanation of this should be somewhere so that everyone knows that
> we must have radix tree entries even for clean mapped blocks. Because upto
> know that was not clear to me.  Also __dax_pmd_fault() seems to insert
> entries only for write fault so the assumption doesn't seem to hold there?

Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
4k pages.  You are right about our handling of 2MiB pages - for a read
followed by a write we will just call into the normal __dax_pmd_fault() code
again, which will do the get_block() call and insert a dirty radix tree entry.
Because we have to go all the way through the fault handler again at write
time there isn't a benefit to inserting a clean radix tree entry on read, so
we just skip it.

> I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
> with hole punching. But I guess I can live with that.
>  
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Jan Kara Jan. 15, 2016, 1:10 p.m. UTC | #5

On Wed 13-01-16 11:58:02, Ross Zwisler wrote:
> On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> > On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > > And secondly: You must write-protect all mappings of the flushed range so
> > > > that you get fault when the sector gets written-to again. We spoke about
> > > > this in the past already but somehow it got lost and I forgot about it as
> > > > well. You need something like rmap_walk_file()...
> > > 
> > > The code that write protected mappings and then cleaned the radix tree entries
> > > did get written, and was part of v2:
> > > 
> > > https://lkml.org/lkml/2015/11/13/759
> > > 
> > > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > > The reason behind this was that there was a race that I couldn't figure out
> > > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > > tree entries.
> > > 
> > > The race goes like this:
> > > 
> > > Thread 1 (write)			Thread 2 (fsync)
> > > ================			================
> > > wp_pfn_shared()
> > > pfn_mkwrite()
> > > dax_radix_entry()
> > > radix_tree_tag_set(DIRTY)
> > > 					dax_writeback_mapping_range()
> > > 					dax_writeback_one()
> > > 					radix_tag_clear(DIRTY)
> > > 					pgoff_mkclean()
> > > ... return up to wp_pfn_shared()
> > > wp_page_reuse()
> > > pte_mkdirty()
> > > 
> > > After this sequence we end up with a dirty PTE that is writeable, but with a
> > > clean radix tree entry.  This means that users can write to the page, but that
> > > a follow-up fsync or msync won't flush this dirty data to media.
> > > 
> > > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > > the DAX code has control over when the radix tree entry is dirtied but not
> > > when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
> > > This means that we can't easily add locking, etc. to protect ourselves.
> > > 
> > > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > > easy solutions presented themselves in the absence of a page lock.  I do have
> > > one idea, but I think it's pretty invasive and will need to wait for another
> > > kernel cycle.
> > > 
> > > The current code that leaves the radix tree entry will give us correct
> > > behavior - it'll just be less efficient because we will have an ever-growing
> > > dirty set to flush.
> > 
> > Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> > tags but it does not (I should have known, I have written that functions
> > few years ago ;). Makes sense. Thanks for clarification.
> > 
> > > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > > >   * dax_pfn_mkwrite - handle first write to DAX page
> > > > >   * @vma: The virtual memory area where the fault occurred
> > > > >   * @vmf: The description of the fault
> > > > > - *
> > > > >   */
> > > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > >  {
> > > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > > +	struct file *file = vma->vm_file;
> > > > >  
> > > > > -	sb_start_pagefault(sb);
> > > > > -	file_update_time(vma->vm_file);
> > > > > -	sb_end_pagefault(sb);
> > > > > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > > > 
> > > > Why is NO_SECTOR argument correct here?
> > > 
> > > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > > tree - just make that entry dirty".  This works because pfn_mkwrite() always
> > > follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> > > will insert the radix tree entry, regardless of whether the fault was for a
> > > read or a write.  If the fault was for a write, the radix tree entry will also
> > > be made dirty.
> > >
> > > For reads the radix tree entry will be inserted but left clean.  When the
> > > first write happens we will get a pfn_mkwrite() call, which will call
> > > dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> > > tree entry & set the dirty tag.
> > 
> > So the explanation of this should be somewhere so that everyone knows that
> > we must have radix tree entries even for clean mapped blocks. Because upto
> > know that was not clear to me.  Also __dax_pmd_fault() seems to insert
> > entries only for write fault so the assumption doesn't seem to hold there?
> 
> Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
> 4k pages.  You are right about our handling of 2MiB pages - for a read
> followed by a write we will just call into the normal __dax_pmd_fault() code
> again, which will do the get_block() call and insert a dirty radix tree entry.
> Because we have to go all the way through the fault handler again at write
> time there isn't a benefit to inserting a clean radix tree entry on read, so
> we just skip it.

Ouch, I wasn't aware of this asymetry between PMD and PTE faults. OK, so
please just document this all somewhere because I'm pretty sure casual
reader won't be able to figure this all out just from the code.

								Honza

Jan Kara Feb. 8, 2016, 9:44 a.m. UTC | #6

On Sat 06-02-16 17:33:07, Dmitry Monakhov wrote:
> > +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > +		loff_t end)
> > +{
> > +	struct inode *inode = mapping->host;
> > +	struct block_device *bdev = inode->i_sb->s_bdev;
> > +	pgoff_t indices[PAGEVEC_SIZE];
> > +	pgoff_t start_page, end_page;
> > +	struct pagevec pvec;
> > +	void *entry;
> > +	int i, ret = 0;
> > +
> > +	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> > +		return -EIO;
> > +
> > +	rcu_read_lock();
> > +	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > +	rcu_read_unlock();
> > +
> > +	/* see if the start of our range is covered by a PMD entry */
> > +	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +		start &= PMD_MASK;
> > +
> > +	start_page = start >> PAGE_CACHE_SHIFT;
> > +	end_page = end >> PAGE_CACHE_SHIFT;
> > +
> > +	tag_pages_for_writeback(mapping, start_page, end_page);
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (1) {
> > +		pvec.nr = find_get_entries_tag(mapping, start_page,
> > +				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> > +				pvec.pages, indices);
> > +
> > +		if (pvec.nr == 0)
> > +			break;
> > +
> > +		for (i = 0; i < pvec.nr; i++) {
> > +			ret = dax_writeback_one(bdev, mapping, indices[i],
> > +					pvec.pages[i]);
> > +			if (ret < 0)
> > +				return ret;
> > +		}
> I think it would be more efficient to use batched locking like follows:
>                 spin_lock_irq(&mapping->tree_lock);
> 		for (i = 0; i < pvec.nr; i++) {
>                     struct blk_dax_ctl dax[PAGEVEC_SIZE];                
>                     radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
>                     /* It is also reasonable to merge adjacent dax
>                      * regions in to one */
>                     dax[i].sector = RADIX_DAX_SECTOR(entry);
>                     dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);                    
> 
>                 }
>                 spin_unlock_irq(&mapping->tree_lock);
>                	if (blk_queue_enter(q, true) != 0)
>                     goto error;
>                 for (i = 0; i < pvec.nr; i++) {
>                     rc = bdev_direct_access(bdev, dax[i]);
>                     wb_cache_pmem(dax[i].addr, dax[i].size);
>                 }
>                 ret = blk_queue_exit(q, true)

We need to clear the radix tree tag only after flushing caches. But in
principle I agree that some batching of radix tree tag manipulations should
be doable. But frankly so far we have issues with correctness so speed is
not our main concern.

> > +	}
> > +	wmb_pmem();
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
> > +
> >  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  			struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > @@ -363,6 +532,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	}
> >  	dax_unmap_atomic(bdev, &dax);
> >  
> > +	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> > +			vmf->flags & FAULT_FLAG_WRITE);
> > +	if (error)
> > +		goto out;
> > +
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > @@ -487,6 +661,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		delete_from_page_cache(page);
> >  		unlock_page(page);
> >  		page_cache_release(page);
> > +		page = NULL;
> >  	}
> I've realized that I do not understand why dax_fault code works at all.
> During dax_fault we want to remove page from mapping and insert dax-entry
>  Basically code looks like follows:
> 0 page = find_get_page()
> 1 lock_page(page)
> 2 delete_from_page_cache(page);
> 3 unlock_page(page);
> 4 dax_insert_mapping(inode, &bh, vma, vmf);
> 
> BUT what on earth protects us from other process to reinsert page again
> after step(2) but before (4)?

Nothing, it's a bug and Ross / Matthew are working on fixing it...

								Honza

Ross Zwisler Feb. 8, 2016, 10:06 p.m. UTC | #7

On Sat, Feb 06, 2016 at 05:33:07PM +0300, Dmitry Monakhov wrote:
> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
<>
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> IMHO it would be sane to call that function as dax_radix_entry_insert() 

I think I may have actually had it named that at some point. :)  I changed it
because it doesn't always insert an entry - in the read case for example we
insert a clean entry, and then on the following dax_pfn_mkwrite() we call back
in and mark it as dirty.

<>
> > +/*
> > + * Flush the mapping to the persistent domain within the byte range of [start,
> > + * end]. This is required by data integrity operations to ensure file data is
> > + * on persistent storage prior to completion of the operation.
> > + */
> > +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > +		loff_t end)
> > +{
> > +	struct inode *inode = mapping->host;
> > +	struct block_device *bdev = inode->i_sb->s_bdev;
> > +	pgoff_t indices[PAGEVEC_SIZE];
> > +	pgoff_t start_page, end_page;
> > +	struct pagevec pvec;
> > +	void *entry;
> > +	int i, ret = 0;
> > +
> > +	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> > +		return -EIO;
> > +
> > +	rcu_read_lock();
> > +	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > +	rcu_read_unlock();
> > +
> > +	/* see if the start of our range is covered by a PMD entry */
> > +	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +		start &= PMD_MASK;
> > +
> > +	start_page = start >> PAGE_CACHE_SHIFT;
> > +	end_page = end >> PAGE_CACHE_SHIFT;
> > +
> > +	tag_pages_for_writeback(mapping, start_page, end_page);
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (1) {
> > +		pvec.nr = find_get_entries_tag(mapping, start_page,
> > +				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> > +				pvec.pages, indices);
> > +
> > +		if (pvec.nr == 0)
> > +			break;
> > +
> > +		for (i = 0; i < pvec.nr; i++) {
> > +			ret = dax_writeback_one(bdev, mapping, indices[i],
> > +					pvec.pages[i]);
> > +			if (ret < 0)
> > +				return ret;
> > +		}
> I think it would be more efficient to use batched locking like follows:
>                 spin_lock_irq(&mapping->tree_lock);
> 		for (i = 0; i < pvec.nr; i++) {
>                     struct blk_dax_ctl dax[PAGEVEC_SIZE];                
>                     radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
>                     /* It is also reasonable to merge adjacent dax
>                      * regions in to one */
>                     dax[i].sector = RADIX_DAX_SECTOR(entry);
>                     dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);                    
> 
>                 }
>                 spin_unlock_irq(&mapping->tree_lock);
>                	if (blk_queue_enter(q, true) != 0)
>                     goto error;
>                 for (i = 0; i < pvec.nr; i++) {
>                     rc = bdev_direct_access(bdev, dax[i]);
>                     wb_cache_pmem(dax[i].addr, dax[i].size);
>                 }
>                 ret = blk_queue_exit(q, true)

I guess this could be more efficient, but as Jan said in his response we're
currently focused on correctness.  I also wonder if it would be measurably
better?

In any case, Jan is right - you have to clear the TOWRITE tag only after
you've flushed, and you also need to include the entry verification code from
dax_writeback_one() after you grab the tree lock.  Basically, I believe all
the code in dax_writeback_one() is needed - this change would essentially just
be inlining that code in dax_writeback_mapping_range() so you could do
multiple operations without giving up a lock.

[v8,6/9] dax: add support for fsync/msync

Commit Message

Comments

Patch