[v7,2/9] dax: fix conversion of holes to PMDs

Message ID	1452103263-1592-3-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State	Accepted
Commit	de14b9cb5e02
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> From: Ross Zwisler <ross.zwisler@linux.intel.com> To: linux-kernel@vger.kernel.org Subject: [PATCH v7 2/9] dax: fix conversion of holes to PMDs Date: Wed, 6 Jan 2016 11:00:56 -0700 Message-Id: <1452103263-1592-3-git-send-email-ross.zwisler@linux.intel.com> In-Reply-To: <1452103263-1592-1-git-send-email-ross.zwisler@linux.intel.com> References: <1452103263-1592-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com>, Dave Chinner <david@fromorbit.com>, "J. Bruce Fields" <bfields@fieldses.org>, linux-mm@kvack.org, Andreas Dilger <adilger.kernel@dilger.ca>, "H. Peter Anvin" <hpa@zytor.com>, Jeff Layton <jlayton@poochiereds.net>, linux-nvdimm@lists.01.org, x86@kernel.org, Ingo Molnar <mingo@redhat.com>, linux-ext4@vger.kernel.org, xfs@oss.sgi.com, Alexander Viro <viro@zeniv.linux.org.uk>, Thomas Gleixner <tglx@linutronix.de>, Theodore Ts'o <tytso@mit.edu>, Jan Kara <jack@suse.com>, linux-fsdevel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <matthew.r.wilcox@intel.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Ross Zwisler Jan. 6, 2016, 6 p.m. UTC

When we get a DAX PMD fault for a write it is possible that there could be
some number of 4k zero pages already present for the same range that were
inserted to service reads from a hole.  These 4k zero pages need to be
unmapped from the VMAs and removed from the struct address_space radix tree
before the real DAX PMD entry can be inserted.

For PTE faults this same use case also exists and is handled by a
combination of unmap_mapping_range() to unmap the VMAs and
delete_from_page_cache() to remove the page from the address_space radix
tree.

For PMD faults we do have a call to unmap_mapping_range() (protected by a
buffer_new() check), but nothing clears out the radix tree entry.  The
buffer_new() check is also incorrect as the current ext4 and XFS filesystem
code will never return a buffer_head with BH_New set, even when allocating
new blocks over a hole.  Instead the filesystem will zero the blocks
manually and return a buffer_head with only BH_Mapped set.

Fix this situation by removing the buffer_new() check and adding a call to
truncate_inode_pages_range() to clear out the radix tree entries before we
insert the DAX PMD.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Dan Williams Jan. 6, 2016, 7:04 p.m. UTC | #1

On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> When we get a DAX PMD fault for a write it is possible that there could be
> some number of 4k zero pages already present for the same range that were
> inserted to service reads from a hole.  These 4k zero pages need to be
> unmapped from the VMAs and removed from the struct address_space radix tree
> before the real DAX PMD entry can be inserted.
>
> For PTE faults this same use case also exists and is handled by a
> combination of unmap_mapping_range() to unmap the VMAs and
> delete_from_page_cache() to remove the page from the address_space radix
> tree.
>
> For PMD faults we do have a call to unmap_mapping_range() (protected by a
> buffer_new() check), but nothing clears out the radix tree entry.  The
> buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> code will never return a buffer_head with BH_New set, even when allocating
> new blocks over a hole.  Instead the filesystem will zero the blocks
> manually and return a buffer_head with only BH_Mapped set.
>
> Fix this situation by removing the buffer_new() check and adding a call to
> truncate_inode_pages_range() to clear out the radix tree entries before we
> insert the DAX PMD.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Replaced the current contents of v6 in -mm from next-20160106 with
this v7 set and it looks good.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>

One question below...

> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 03cc4a3..9dc0c97 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>         bool write = flags & FAULT_FLAG_WRITE;
>         struct block_device *bdev;
>         pgoff_t size, pgoff;
> +       loff_t lstart, lend;
>         sector_t block;
>         int result = 0;
>
> @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>                 goto fallback;
>         }
>
> -       /*
> -        * If we allocated new storage, make sure no process has any
> -        * zero pages covering this hole
> -        */
> -       if (buffer_new(&bh)) {
> -               i_mmap_unlock_read(mapping);
> -               unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> -               i_mmap_lock_read(mapping);
> -       }
> +       /* make sure no process has any zero pages covering this hole */
> +       lstart = pgoff << PAGE_SHIFT;
> +       lend = lstart + PMD_SIZE - 1; /* inclusive */
> +       i_mmap_unlock_read(mapping);
> +       unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> +       truncate_inode_pages_range(mapping, lstart, lend);

Do we need to do both unmap and truncate given that
truncate_inode_page() optionally does an unmap_mapping_range()
internally?

Jan Kara Jan. 7, 2016, 1:22 p.m. UTC | #2

On Wed 06-01-16 11:00:56, Ross Zwisler wrote:
> When we get a DAX PMD fault for a write it is possible that there could be
> some number of 4k zero pages already present for the same range that were
> inserted to service reads from a hole.  These 4k zero pages need to be
> unmapped from the VMAs and removed from the struct address_space radix tree
> before the real DAX PMD entry can be inserted.
> 
> For PTE faults this same use case also exists and is handled by a
> combination of unmap_mapping_range() to unmap the VMAs and
> delete_from_page_cache() to remove the page from the address_space radix
> tree.
> 
> For PMD faults we do have a call to unmap_mapping_range() (protected by a
> buffer_new() check), but nothing clears out the radix tree entry.  The
> buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> code will never return a buffer_head with BH_New set, even when allocating
> new blocks over a hole.  Instead the filesystem will zero the blocks
> manually and return a buffer_head with only BH_Mapped set.
> 
> Fix this situation by removing the buffer_new() check and adding a call to
> truncate_inode_pages_range() to clear out the radix tree entries before we
> insert the DAX PMD.

Ho, hum, let me understand this. So we have a file, different processes are
mapping it. One process maps is with normal page granularity and another
process with huge page granularity. Thus when the first process read-faults
a few normal pages and then the second process write-faults the huge page
in the same range, we have a problem. Do I understand this correctly?
Because otherwise I don't understand how a single page table can have both
huge page and normal page in the same range...

And if this is indeed the problem then what prevents the unmapping and
truncation in huge page fault to race with mapping the same range again
using small pages? Sure now blocks are allocated so the mapping itself will
be consistent but radix tree will have the same issues it had before this
patch, won't it?

... thinking some more about this ...

OK, there is some difference - we will only have DAX exceptional entries
for the range covered by huge page and those we replace properly in
dax_radix_entry() code. So things are indeed fine *except* that nothing
seems so serialize dax_load() hole with PMD fault. The race like following
seems possible:

CPU1 - process 1		CPU2 - process 2

__dax_fault() - file f, index 1
  get_block() -> returns hole
				__dax_pmd_fault() - file f, index 0
				  get_block() -> allocates blocks
				  ...
				  truncate_pagecache_range()
  dax_load_hole()

Boom, we have hole page instantiated for allocated range (data corruption)
and corruption of radix tree entries as well. Actually this problem is
there even for two different processes doing normal page faults (one read,
one write) against the same page in the file.

... thinking about possible fixes ...

So we need some exclusion that makes sure pgoff->block mapping information
is uptodate at the moment we insert it into page tables. The simplest
reasonably fast thing I can see is:

When handling a read fault, things stay as is and filesystem protects the
fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading. When
handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for reading
and try a read fault. If __dax_fault() sees a hole returned from
get_blocks() during a write fault, it bails out. Filesystem grabs
EXT4_I(inode)->i_mmap_sem for writing and retries with different
get_blocks() callback which will allocate blocks. That way we get proper
exclusion for faults needing to allocate blocks. Thoughts?

								Honza

> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 03cc4a3..9dc0c97 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	bool write = flags & FAULT_FLAG_WRITE;
>  	struct block_device *bdev;
>  	pgoff_t size, pgoff;
> +	loff_t lstart, lend;
>  	sector_t block;
>  	int result = 0;
>  
> @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		goto fallback;
>  	}
>  
> -	/*
> -	 * If we allocated new storage, make sure no process has any
> -	 * zero pages covering this hole
> -	 */
> -	if (buffer_new(&bh)) {
> -		i_mmap_unlock_read(mapping);
> -		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> -		i_mmap_lock_read(mapping);
> -	}
> +	/* make sure no process has any zero pages covering this hole */
> +	lstart = pgoff << PAGE_SHIFT;
> +	lend = lstart + PMD_SIZE - 1; /* inclusive */
> +	i_mmap_unlock_read(mapping);
> +	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> +	truncate_inode_pages_range(mapping, lstart, lend);
> +	i_mmap_lock_read(mapping);
>  
>  	/*
>  	 * If a truncate happened while we were allocating blocks, we may
> @@ -669,7 +668,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		goto out;
>  	}
>  	if ((pgoff | PG_PMD_COLOUR) >= size) {
> -		dax_pmd_dbg(&bh, address, "pgoff unaligned");
> +		dax_pmd_dbg(&bh, address,
> +				"offset + huge page size > file size");
>  		goto fallback;
>  	}
>  
> -- 
> 2.5.0
> 
>

Ross Zwisler Jan. 7, 2016, 10:11 p.m. UTC | #3

On Thu, Jan 07, 2016 at 02:22:06PM +0100, Jan Kara wrote:
> On Wed 06-01-16 11:00:56, Ross Zwisler wrote:
> > When we get a DAX PMD fault for a write it is possible that there could be
> > some number of 4k zero pages already present for the same range that were
> > inserted to service reads from a hole.  These 4k zero pages need to be
> > unmapped from the VMAs and removed from the struct address_space radix tree
> > before the real DAX PMD entry can be inserted.
> > 
> > For PTE faults this same use case also exists and is handled by a
> > combination of unmap_mapping_range() to unmap the VMAs and
> > delete_from_page_cache() to remove the page from the address_space radix
> > tree.
> > 
> > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > buffer_new() check), but nothing clears out the radix tree entry.  The
> > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > code will never return a buffer_head with BH_New set, even when allocating
> > new blocks over a hole.  Instead the filesystem will zero the blocks
> > manually and return a buffer_head with only BH_Mapped set.
> > 
> > Fix this situation by removing the buffer_new() check and adding a call to
> > truncate_inode_pages_range() to clear out the radix tree entries before we
> > insert the DAX PMD.
> 
> Ho, hum, let me understand this. So we have a file, different processes are
> mapping it. One process maps is with normal page granularity and another
> process with huge page granularity. Thus when the first process read-faults
> a few normal pages and then the second process write-faults the huge page
> in the same range, we have a problem. Do I understand this correctly?
> Because otherwise I don't understand how a single page table can have both
> huge page and normal page in the same range...

I don't think that it necessarily has to do with multiple threads.  The bit to
notice here is we *always* use 4k zero pages to cover holes.  So, a single
thread can hit this condition by doing some reads from a hole (insert 4k
pages), then doing a write.  This write is the first time that we will try and
use real DAX storage to insert into the page tables, and we may end up getting
a PMD.  This means that we need to clear out all the 4k pages that we inserted
while reading holes in this same range, now that we have a 2M segment
allocated by the filesystem and the entire range is no longer a hole.

> And if this is indeed the problem then what prevents the unmapping and
> truncation in huge page fault to race with mapping the same range again
> using small pages? Sure now blocks are allocated so the mapping itself will
> be consistent but radix tree will have the same issues it had before this
> patch, won't it?

Yep, this is a separate issue, but I think that we handle this case
successfully, though we may end up flushing the same address multiple times.
Once the filesystem has established a block mapping (assuming we avoid the
race described below where one thread is mapping in holes and the other sees a
block allocation), I think we are okay.  It's true that one thread can map in
PMDs, and another thread could potentially map in PTEs that cover the same
range if they hare working with mmaps that are smaller than a PMD, but the
sectors inserted into the radix tree by each of those threads will be
individually correct - the only issue is that they may overlap.

Say, for example you have the following:

CPU1 - process 1				CPU2 - process 2
mmap for sector 0, size 2M
insert PMD into radix tree for sector 0
  This radix tree covers sectors 0-4096
						mmap for sector 32, size 4k
						insert PTE entry into radix
						tree for sector 32

In this case a fsync of the fd by process 1 will end up flushing sector 32
twice, which is correct but inefficient.  I think we can make this more
efficient by adjusting the insertion code and dirtying code in
dax_radix_entry() to look for PMDs that cover this same range.

> ... thinking some more about this ...
> 
> OK, there is some difference - we will only have DAX exceptional entries
> for the range covered by huge page and those we replace properly in
> dax_radix_entry() code. So things are indeed fine *except* that nothing
> seems so serialize dax_load() hole with PMD fault. The race like following
> seems possible:
> 
> CPU1 - process 1		CPU2 - process 2
> 
> __dax_fault() - file f, index 1
>   get_block() -> returns hole
> 				__dax_pmd_fault() - file f, index 0
> 				  get_block() -> allocates blocks
> 				  ...
> 				  truncate_pagecache_range()
>   dax_load_hole()
> 
> Boom, we have hole page instantiated for allocated range (data corruption)
> and corruption of radix tree entries as well. Actually this problem is
> there even for two different processes doing normal page faults (one read,
> one write) against the same page in the file.

Yea, I agree, this seems like an existing issue that you could hit with just
the PTE path:

CPU1 - process 1		CPU2 - process 2

__dax_fault() - read file f, index 0
  get_block() -> returns hole
				__dax_fault() - write file f, index 0
				  get_block() -> allocates blocks
				  ...
				  skips unmap_mapping_range() and
				    delete_from_page_cache() because it didn't
				    find a page for this pgoff
				  dax_insert_mapping()
  dax_load_hole()
  *data corruption*

> ... thinking about possible fixes ...
> 
> So we need some exclusion that makes sure pgoff->block mapping information
> is uptodate at the moment we insert it into page tables. The simplest
> reasonably fast thing I can see is:
> 
> When handling a read fault, things stay as is and filesystem protects the
> fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading. When
> handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for reading
> and try a read fault. If __dax_fault() sees a hole returned from
> get_blocks() during a write fault, it bails out. Filesystem grabs
> EXT4_I(inode)->i_mmap_sem for writing and retries with different
> get_blocks() callback which will allocate blocks. That way we get proper
> exclusion for faults needing to allocate blocks. Thoughts?

I think this would work.  ext4, ext2 and xfs all handle their exclusion with
rw_semaphores, so this should work for each of them, I think.  Thanks for the
problem statement & solution!  :) 

I guess our best course is to make sure that we don't make this existing
problem worse via the fsync/msync patches by handling the error gracefully,
and fix this for v4.6.  I do feel the need to point out that this is a
pre-existing issue with DAX, and that my fsync patches just happened to help
us find it.  They don't make the situation any better or any worse, and I
really hope this issue doesn't end up blocking the fsync/msync patches from
getting merged for v4.5.

Thanks,
- Ross

Ross Zwisler Jan. 7, 2016, 10:34 p.m. UTC | #4

On Wed, Jan 06, 2016 at 11:04:35AM -0800, Dan Williams wrote:
> On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > When we get a DAX PMD fault for a write it is possible that there could be
> > some number of 4k zero pages already present for the same range that were
> > inserted to service reads from a hole.  These 4k zero pages need to be
> > unmapped from the VMAs and removed from the struct address_space radix tree
> > before the real DAX PMD entry can be inserted.
> >
> > For PTE faults this same use case also exists and is handled by a
> > combination of unmap_mapping_range() to unmap the VMAs and
> > delete_from_page_cache() to remove the page from the address_space radix
> > tree.
> >
> > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > buffer_new() check), but nothing clears out the radix tree entry.  The
> > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > code will never return a buffer_head with BH_New set, even when allocating
> > new blocks over a hole.  Instead the filesystem will zero the blocks
> > manually and return a buffer_head with only BH_Mapped set.
> >
> > Fix this situation by removing the buffer_new() check and adding a call to
> > truncate_inode_pages_range() to clear out the radix tree entries before we
> > insert the DAX PMD.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Replaced the current contents of v6 in -mm from next-20160106 with
> this v7 set and it looks good.
> 
> Reported-by: Dan Williams <dan.j.williams@intel.com>
> Tested-by: Dan Williams <dan.j.williams@intel.com>
> 
> One question below...
> 
> > ---
> >  fs/dax.c | 20 ++++++++++----------
> >  1 file changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 03cc4a3..9dc0c97 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >         bool write = flags & FAULT_FLAG_WRITE;
> >         struct block_device *bdev;
> >         pgoff_t size, pgoff;
> > +       loff_t lstart, lend;
> >         sector_t block;
> >         int result = 0;
> >
> > @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >                 goto fallback;
> >         }
> >
> > -       /*
> > -        * If we allocated new storage, make sure no process has any
> > -        * zero pages covering this hole
> > -        */
> > -       if (buffer_new(&bh)) {
> > -               i_mmap_unlock_read(mapping);
> > -               unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> > -               i_mmap_lock_read(mapping);
> > -       }
> > +       /* make sure no process has any zero pages covering this hole */
> > +       lstart = pgoff << PAGE_SHIFT;
> > +       lend = lstart + PMD_SIZE - 1; /* inclusive */
> > +       i_mmap_unlock_read(mapping);
> > +       unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > +       truncate_inode_pages_range(mapping, lstart, lend);
> 
> Do we need to do both unmap and truncate given that
> truncate_inode_page() optionally does an unmap_mapping_range()
> internally?

Ah, indeed it does.  Sure, having just the call to truncate_inode_page() seems
cleaner.  I'll re-test and send this out in v8.

Ross Zwisler Jan. 8, 2016, 4:18 a.m. UTC | #5

On Thu, Jan 07, 2016 at 03:34:55PM -0700, Ross Zwisler wrote:
> On Wed, Jan 06, 2016 at 11:04:35AM -0800, Dan Williams wrote:
> > On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > When we get a DAX PMD fault for a write it is possible that there could be
> > > some number of 4k zero pages already present for the same range that were
> > > inserted to service reads from a hole.  These 4k zero pages need to be
> > > unmapped from the VMAs and removed from the struct address_space radix tree
> > > before the real DAX PMD entry can be inserted.
> > >
> > > For PTE faults this same use case also exists and is handled by a
> > > combination of unmap_mapping_range() to unmap the VMAs and
> > > delete_from_page_cache() to remove the page from the address_space radix
> > > tree.
> > >
> > > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > > buffer_new() check), but nothing clears out the radix tree entry.  The
> > > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > > code will never return a buffer_head with BH_New set, even when allocating
> > > new blocks over a hole.  Instead the filesystem will zero the blocks
> > > manually and return a buffer_head with only BH_Mapped set.
> > >
> > > Fix this situation by removing the buffer_new() check and adding a call to
> > > truncate_inode_pages_range() to clear out the radix tree entries before we
> > > insert the DAX PMD.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > Replaced the current contents of v6 in -mm from next-20160106 with
> > this v7 set and it looks good.
> > 
> > Reported-by: Dan Williams <dan.j.williams@intel.com>
> > Tested-by: Dan Williams <dan.j.williams@intel.com>
> > 
> > One question below...
> > 
> > > ---
> > >  fs/dax.c | 20 ++++++++++----------
> > >  1 file changed, 10 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 03cc4a3..9dc0c97 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >         bool write = flags & FAULT_FLAG_WRITE;
> > >         struct block_device *bdev;
> > >         pgoff_t size, pgoff;
> > > +       loff_t lstart, lend;
> > >         sector_t block;
> > >         int result = 0;
> > >
> > > @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >                 goto fallback;
> > >         }
> > >
> > > -       /*
> > > -        * If we allocated new storage, make sure no process has any
> > > -        * zero pages covering this hole
> > > -        */
> > > -       if (buffer_new(&bh)) {
> > > -               i_mmap_unlock_read(mapping);
> > > -               unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> > > -               i_mmap_lock_read(mapping);
> > > -       }
> > > +       /* make sure no process has any zero pages covering this hole */
> > > +       lstart = pgoff << PAGE_SHIFT;
> > > +       lend = lstart + PMD_SIZE - 1; /* inclusive */
> > > +       i_mmap_unlock_read(mapping);
> > > +       unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > > +       truncate_inode_pages_range(mapping, lstart, lend);
> > 
> > Do we need to do both unmap and truncate given that
> > truncate_inode_page() optionally does an unmap_mapping_range()
> > internally?
> 
> Ah, indeed it does.  Sure, having just the call to truncate_inode_page() seems
> cleaner.  I'll re-test and send this out in v8.

Actually, in testing it doesn't look like unmap_mapping_range() in
truncate_inode_page() gets called.  We fail the page_mapped(page) check for
our read-only zero pages.  I think we need to keep the unmap_mapping_range()
call in __dax_pmd_fault().

Jan Kara Jan. 11, 2016, 12:23 p.m. UTC | #6

On Thu 07-01-16 15:11:14, Ross Zwisler wrote:
> On Thu, Jan 07, 2016 at 02:22:06PM +0100, Jan Kara wrote:
> > On Wed 06-01-16 11:00:56, Ross Zwisler wrote:
> > > When we get a DAX PMD fault for a write it is possible that there could be
> > > some number of 4k zero pages already present for the same range that were
> > > inserted to service reads from a hole.  These 4k zero pages need to be
> > > unmapped from the VMAs and removed from the struct address_space radix tree
> > > before the real DAX PMD entry can be inserted.
> > > 
> > > For PTE faults this same use case also exists and is handled by a
> > > combination of unmap_mapping_range() to unmap the VMAs and
> > > delete_from_page_cache() to remove the page from the address_space radix
> > > tree.
> > > 
> > > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > > buffer_new() check), but nothing clears out the radix tree entry.  The
> > > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > > code will never return a buffer_head with BH_New set, even when allocating
> > > new blocks over a hole.  Instead the filesystem will zero the blocks
> > > manually and return a buffer_head with only BH_Mapped set.
> > > 
> > > Fix this situation by removing the buffer_new() check and adding a call to
> > > truncate_inode_pages_range() to clear out the radix tree entries before we
> > > insert the DAX PMD.
> > 
> > Ho, hum, let me understand this. So we have a file, different processes are
> > mapping it. One process maps is with normal page granularity and another
> > process with huge page granularity. Thus when the first process read-faults
> > a few normal pages and then the second process write-faults the huge page
> > in the same range, we have a problem. Do I understand this correctly?
> > Because otherwise I don't understand how a single page table can have both
> > huge page and normal page in the same range...
> 
> I don't think that it necessarily has to do with multiple threads.  The bit to
> notice here is we *always* use 4k zero pages to cover holes.  So, a single
> thread can hit this condition by doing some reads from a hole (insert 4k
> pages), then doing a write.  This write is the first time that we will try and
> use real DAX storage to insert into the page tables, and we may end up getting
> a PMD.  This means that we need to clear out all the 4k pages that we inserted
> while reading holes in this same range, now that we have a 2M segment
> allocated by the filesystem and the entire range is no longer a hole.

OK, I see. Thanks for explanation.

> > And if this is indeed the problem then what prevents the unmapping and
> > truncation in huge page fault to race with mapping the same range again
> > using small pages? Sure now blocks are allocated so the mapping itself will
> > be consistent but radix tree will have the same issues it had before this
> > patch, won't it?
> 
> Yep, this is a separate issue, but I think that we handle this case
> successfully, though we may end up flushing the same address multiple times.
> Once the filesystem has established a block mapping (assuming we avoid the
> race described below where one thread is mapping in holes and the other sees a
> block allocation), I think we are okay.  It's true that one thread can map in
> PMDs, and another thread could potentially map in PTEs that cover the same
> range if they hare working with mmaps that are smaller than a PMD, but the
> sectors inserted into the radix tree by each of those threads will be
> individually correct - the only issue is that they may overlap.
> 
> Say, for example you have the following:
> 
> CPU1 - process 1				CPU2 - process 2
> mmap for sector 0, size 2M
> insert PMD into radix tree for sector 0
>   This radix tree covers sectors 0-4096
> 						mmap for sector 32, size 4k
> 						insert PTE entry into radix
> 						tree for sector 32
> 
> In this case a fsync of the fd by process 1 will end up flushing sector 32
> twice, which is correct but inefficient.  I think we can make this more
> efficient by adjusting the insertion code and dirtying code in
> dax_radix_entry() to look for PMDs that cover this same range.

Yes, this is what I ended up with as well. So we are in agreement here.

> > ... thinking some more about this ...
> > 
> > OK, there is some difference - we will only have DAX exceptional entries
> > for the range covered by huge page and those we replace properly in
> > dax_radix_entry() code. So things are indeed fine *except* that nothing
> > seems so serialize dax_load() hole with PMD fault. The race like following
> > seems possible:
> > 
> > CPU1 - process 1		CPU2 - process 2
> > 
> > __dax_fault() - file f, index 1
> >   get_block() -> returns hole
> > 				__dax_pmd_fault() - file f, index 0
> > 				  get_block() -> allocates blocks
> > 				  ...
> > 				  truncate_pagecache_range()
> >   dax_load_hole()
> > 
> > Boom, we have hole page instantiated for allocated range (data corruption)
> > and corruption of radix tree entries as well. Actually this problem is
> > there even for two different processes doing normal page faults (one read,
> > one write) against the same page in the file.
> 
> Yea, I agree, this seems like an existing issue that you could hit with just
> the PTE path:
> 
> CPU1 - process 1		CPU2 - process 2
> 
> __dax_fault() - read file f, index 0
>   get_block() -> returns hole
> 				__dax_fault() - write file f, index 0
> 				  get_block() -> allocates blocks
> 				  ...
> 				  skips unmap_mapping_range() and
> 				    delete_from_page_cache() because it didn't
> 				    find a page for this pgoff
> 				  dax_insert_mapping()
>   dax_load_hole()
>   *data corruption*
> 
> > ... thinking about possible fixes ...
> > 
> > So we need some exclusion that makes sure pgoff->block mapping information
> > is uptodate at the moment we insert it into page tables. The simplest
> > reasonably fast thing I can see is:
> > 
> > When handling a read fault, things stay as is and filesystem protects the
> > fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading. When
> > handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for reading
> > and try a read fault. If __dax_fault() sees a hole returned from
> > get_blocks() during a write fault, it bails out. Filesystem grabs
> > EXT4_I(inode)->i_mmap_sem for writing and retries with different
> > get_blocks() callback which will allocate blocks. That way we get proper
> > exclusion for faults needing to allocate blocks. Thoughts?
> 
> I think this would work.  ext4, ext2 and xfs all handle their exclusion with
> rw_semaphores, so this should work for each of them, I think.  Thanks for the
> problem statement & solution!  :) 
> 
> I guess our best course is to make sure that we don't make this existing
> problem worse via the fsync/msync patches by handling the error gracefully,
> and fix this for v4.6.  I do feel the need to point out that this is a
> pre-existing issue with DAX, and that my fsync patches just happened to help
> us find it.  They don't make the situation any better or any worse, and I
> really hope this issue doesn't end up blocking the fsync/msync patches from
> getting merged for v4.5.

Yeah, I agree this is a preexisting problem and mostly independent of your
fsync series so it can be dealt with once fsync series lands. The only
thing where this meets is the locking - you will have exclusive hold of the
inode and all its mapping when doing a write fault allocating a block. That
may save you some dances with i_mmap_lock (but I didn't think about this
too much).

								Honza

[v7,2/9] dax: fix conversion of holes to PMDs

Commit Message

Comments

Patch