[2/2] dax: fix data corruption due to stale mmap reads

Message ID	20170421034437.4359-2-ross.zwisler@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Ross Zwisler <ross.zwisler@linux.intel.com> To: Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org Cc: Ross Zwisler <ross.zwisler@linux.intel.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Alexey Kuznetsov <kuznet@virtuozzo.com>, Andrey Ryabinin <aryabinin@virtuozzo.com>, Anna Schumaker <anna.schumaker@netapp.com>, Christoph Hellwig <hch@lst.de>, Dan Williams <dan.j.williams@intel.com>, "Darrick J. Wong" <darrick.wong@oracle.com>, Eric Van Hensbergen <ericvh@gmail.com>, Jan Kara <jack@suse.cz>, Jens Axboe <axboe@kernel.dk>, Johannes Weiner <hannes@cmpxchg.org>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Latchesar Ionkov <lucho@ionkov.net>, linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, linux-nvdimm@lists.01.org, Matthew Wilcox <mawilcox@microsoft.com>, Ron Minnich <rminnich@sandia.gov>, samba-technical@lists.samba.org, Steve French <sfrench@samba.org>, Trond Myklebust <trond.myklebust@primarydata.com>, v9fs-developer@lists.sourceforge.net Subject: [PATCH 2/2] dax: fix data corruption due to stale mmap reads Date: Thu, 20 Apr 2017 21:44:37 -0600 Message-Id: <20170421034437.4359-2-ross.zwisler@linux.intel.com> In-Reply-To: <20170421034437.4359-1-ross.zwisler@linux.intel.com> References: <20170420191446.GA21694@linux.intel.com> <20170421034437.4359-1-ross.zwisler@linux.intel.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Ross Zwisler April 21, 2017, 3:44 a.m. UTC

Users of DAX can suffer data corruption from stale mmap reads via the
following sequence:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we incorrectly
  leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero page
  mapping is still intact we read back zeroes instead of the new data.

We fix this by unconditionally calling invalidate_inode_pages2_range() in
dax_iomap_actor() for new block allocations, and by enhancing
__dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
being removed from the radix tree.

This is based on an initial patch from Jan Kara.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
---
 fs/dax.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

Jan Kara April 25, 2017, 11:10 a.m. UTC | #1

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
>

Ross Zwisler April 25, 2017, 10:59 p.m. UTC | #2

On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
<>
> Hum, but now thinking more about it I have hard time figuring out why write
> vs fault cannot actually still race:
> 
> CPU1 - write(2)				CPU2 - read fault
> 
> 					dax_iomap_pte_fault()
> 					  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 					  grab_mapping_entry()
> 					  - we add zero page in the radix
> 					    tree & map it to page tables
> 
> Similarly read vs write fault may end up racing in a wrong way and try to
> replace already existing exceptional entry with a hole page?

Yep, this race seems real to me, too.  This seems very much like the issues
that exist when a thread is doing direct I/O.  One thread is doing I/O to an
intermediate buffer (page cache for direct I/O case, zero page for us), and
the other is going around it directly to media, and they can get out of sync.

IIRC the direct I/O code looked something like:

1/ invalidate existing mappings
2/ do direct I/O to media
3/ invalidate mappings again, just in case.  Should be cheap if there weren't
   any conflicting faults.  This makes sure any new allocations we made are
   faulted in.

I guess one option would be to replicate that logic in the DAX I/O path, or we
could try and enhance our locking so page faults can't race with I/O since
both can allocate blocks.

I'm not sure, but will think on it.

Jan Kara April 26, 2017, 8:52 a.m. UTC | #3

On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> <>
> > Hum, but now thinking more about it I have hard time figuring out why write
> > vs fault cannot actually still race:
> > 
> > CPU1 - write(2)				CPU2 - read fault
> > 
> > 					dax_iomap_pte_fault()
> > 					  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 					  grab_mapping_entry()
> > 					  - we add zero page in the radix
> > 					    tree & map it to page tables
> > 
> > Similarly read vs write fault may end up racing in a wrong way and try to
> > replace already existing exceptional entry with a hole page?
> 
> Yep, this race seems real to me, too.  This seems very much like the issues
> that exist when a thread is doing direct I/O.  One thread is doing I/O to an
> intermediate buffer (page cache for direct I/O case, zero page for us), and
> the other is going around it directly to media, and they can get out of sync.
> 
> IIRC the direct I/O code looked something like:
> 
> 1/ invalidate existing mappings
> 2/ do direct I/O to media
> 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
>    any conflicting faults.  This makes sure any new allocations we made are
>    faulted in.

Yeah, the problem is people generally expect weird behavior when they mix
direct and buffered IO (or let alone mmap) however everyone expects
standard read(2) and write(2) to be completely coherent with mmap(2).

> I guess one option would be to replicate that logic in the DAX I/O path, or we
> could try and enhance our locking so page faults can't race with I/O since
> both can allocate blocks.

In the abstract way, the problem is that we have radix tree (and page
tables) cache block mapping information and the operation: "read block
mapping information, store it in the radix tree" is not serialized in any
way against other block allocations so the information we store can be out
of date by the time we store it.

One way to solve this would be to move ->iomap_begin call in the fault
paths under entry lock although that would mean I have to redo how ext4
handles DAX faults because with current code it would create lock inversion
wrt transaction start.

Another solution would be to grab i_mmap_sem for write when doing write
fault of a page and similarly have it grabbed for writing when doing
write(2). This would scale rather poorly but if we later replaced it with a
range lock (Davidlohr has already posted a nice implementation of it) it
won't be as bad. But I guess option 1) is better...

								Honza

Ross Zwisler April 26, 2017, 10:52 p.m. UTC | #4

On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
> On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> > <>
> > > Hum, but now thinking more about it I have hard time figuring out why write
> > > vs fault cannot actually still race:
> > > 
> > > CPU1 - write(2)				CPU2 - read fault
> > > 
> > > 					dax_iomap_pte_fault()
> > > 					  ->iomap_begin() - sees hole
> > > dax_iomap_rw()
> > >   iomap_apply()
> > >     ->iomap_begin - allocates blocks
> > >     dax_iomap_actor()
> > >       invalidate_inode_pages2_range()
> > >         - there's nothing to invalidate
> > > 					  grab_mapping_entry()
> > > 					  - we add zero page in the radix
> > > 					    tree & map it to page tables
> > > 
> > > Similarly read vs write fault may end up racing in a wrong way and try to
> > > replace already existing exceptional entry with a hole page?
> > 
> > Yep, this race seems real to me, too.  This seems very much like the issues
> > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
> > intermediate buffer (page cache for direct I/O case, zero page for us), and
> > the other is going around it directly to media, and they can get out of sync.
> > 
> > IIRC the direct I/O code looked something like:
> > 
> > 1/ invalidate existing mappings
> > 2/ do direct I/O to media
> > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
> >    any conflicting faults.  This makes sure any new allocations we made are
> >    faulted in.
> 
> Yeah, the problem is people generally expect weird behavior when they mix
> direct and buffered IO (or let alone mmap) however everyone expects
> standard read(2) and write(2) to be completely coherent with mmap(2).

Yep, fair enough.

> > I guess one option would be to replicate that logic in the DAX I/O path, or we
> > could try and enhance our locking so page faults can't race with I/O since
> > both can allocate blocks.
> 
> In the abstract way, the problem is that we have radix tree (and page
> tables) cache block mapping information and the operation: "read block
> mapping information, store it in the radix tree" is not serialized in any
> way against other block allocations so the information we store can be out
> of date by the time we store it.
> 
> One way to solve this would be to move ->iomap_begin call in the fault
> paths under entry lock although that would mean I have to redo how ext4
> handles DAX faults because with current code it would create lock inversion
> wrt transaction start.

I don't think this alone is enough to save us.  The I/O path doesn't currently
take any DAX radix tree entry locks, so our race would just become:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  grab_mapping_entry() // newly moved
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  - we add zero page in the radix
					    tree & map it to page tables

In their current form I don't think we want to take DAX radix tree entry locks
in the I/O path because that would effectively serialize I/O over a given
radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
would be serialized.

> Another solution would be to grab i_mmap_sem for write when doing write
> fault of a page and similarly have it grabbed for writing when doing
> write(2). This would scale rather poorly but if we later replaced it with a
> range lock (Davidlohr has already posted a nice implementation of it) it
> won't be as bad. But I guess option 1) is better...

The best idea I had for handling this sounds similar, which would be to
convert the radix tree locks to essentially be reader/writer locks.  I/O and
faults that don't modify the block mapping could just take read-level locks,
and could all run concurrently.  I/O or faults that modify a block mapping
would take a write lock, and serialize with other writers and readers.

You could know if you needed a write lock without asking the filesystem - if
you're a write and the radix tree entry is empty or is for a zero page, you
grab the write lock.

This dovetails nicely with the idea of having the radix tree act as a cache
for block mappings.  You take the appropriate lock on the radix tree entry,
and it has the block mapping info for your I/O or fault so you don't have to
call into the FS.  I/O would also participate so we would keep info about
block mappings that we gather from I/O to help shortcut our page faults.

How does this sound vs the range lock idea?  How hard do you think it would be
to convert our current wait queue system to reader/writer style locking?

Also, how do you think we should deal with the current PMD corruption?  Should
we go with the current fix (I can augment the comments as you suggested), and
then handle optimizations to that approach and the solution to this larger
race as a follow-on?

Jan Kara April 27, 2017, 7:26 a.m. UTC | #5

On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
> On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
> > On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> > > <>
> > > > Hum, but now thinking more about it I have hard time figuring out why write
> > > > vs fault cannot actually still race:
> > > > 
> > > > CPU1 - write(2)				CPU2 - read fault
> > > > 
> > > > 					dax_iomap_pte_fault()
> > > > 					  ->iomap_begin() - sees hole
> > > > dax_iomap_rw()
> > > >   iomap_apply()
> > > >     ->iomap_begin - allocates blocks
> > > >     dax_iomap_actor()
> > > >       invalidate_inode_pages2_range()
> > > >         - there's nothing to invalidate
> > > > 					  grab_mapping_entry()
> > > > 					  - we add zero page in the radix
> > > > 					    tree & map it to page tables
> > > > 
> > > > Similarly read vs write fault may end up racing in a wrong way and try to
> > > > replace already existing exceptional entry with a hole page?
> > > 
> > > Yep, this race seems real to me, too.  This seems very much like the issues
> > > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
> > > intermediate buffer (page cache for direct I/O case, zero page for us), and
> > > the other is going around it directly to media, and they can get out of sync.
> > > 
> > > IIRC the direct I/O code looked something like:
> > > 
> > > 1/ invalidate existing mappings
> > > 2/ do direct I/O to media
> > > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
> > >    any conflicting faults.  This makes sure any new allocations we made are
> > >    faulted in.
> > 
> > Yeah, the problem is people generally expect weird behavior when they mix
> > direct and buffered IO (or let alone mmap) however everyone expects
> > standard read(2) and write(2) to be completely coherent with mmap(2).
> 
> Yep, fair enough.
> 
> > > I guess one option would be to replicate that logic in the DAX I/O path, or we
> > > could try and enhance our locking so page faults can't race with I/O since
> > > both can allocate blocks.
> > 
> > In the abstract way, the problem is that we have radix tree (and page
> > tables) cache block mapping information and the operation: "read block
> > mapping information, store it in the radix tree" is not serialized in any
> > way against other block allocations so the information we store can be out
> > of date by the time we store it.
> > 
> > One way to solve this would be to move ->iomap_begin call in the fault
> > paths under entry lock although that would mean I have to redo how ext4
> > handles DAX faults because with current code it would create lock inversion
> > wrt transaction start.
> 
> I don't think this alone is enough to save us.  The I/O path doesn't currently
> take any DAX radix tree entry locks, so our race would just become:
> 
> CPU1 - write(2)				CPU2 - read fault
> 
> 					dax_iomap_pte_fault()
> 					  grab_mapping_entry() // newly moved
> 					  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 					  - we add zero page in the radix
> 					    tree & map it to page tables
> 
> In their current form I don't think we want to take DAX radix tree entry locks
> in the I/O path because that would effectively serialize I/O over a given
> radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
> would be serialized.

Note that invalidate_inode_pages2_range() will see the entry created by
grab_mapping_entry() on CPU2 and block waiting for its lock and this is
exactly what stops the race. The invalidate_inode_pages2_range()
effectively makes sure there isn't any page fault in progress for given
range...

Also note that writes to a file are serialized by i_rwsem anyway (and at
least serialization of writes to the overlapping range is required by POSIX)
so this doesn't add any more serialization than we already have.

> > Another solution would be to grab i_mmap_sem for write when doing write
> > fault of a page and similarly have it grabbed for writing when doing
> > write(2). This would scale rather poorly but if we later replaced it with a
> > range lock (Davidlohr has already posted a nice implementation of it) it
> > won't be as bad. But I guess option 1) is better...
> 
> The best idea I had for handling this sounds similar, which would be to
> convert the radix tree locks to essentially be reader/writer locks.  I/O and
> faults that don't modify the block mapping could just take read-level locks,
> and could all run concurrently.  I/O or faults that modify a block mapping
> would take a write lock, and serialize with other writers and readers.

Well, this would be difficult to implement inside the radix tree (not
enough bits in the entry) so you'd have to go for some external locking
primitive anyway. And if you do that, read-write range lock Davidlohr has
implemented is what you describe - well we could also have a radix tree
with rwsems but I suspect the overhead of maintaining that would be too
large. It would require larger rewrite than reusing entry locks as I
suggest above though and it isn't an obvious performance win for realistic
workloads either so I'd like to see some performance numbers before going
that way. It likely improves a situation where processes race to fault the
same page for which we already know the block mapping but I'm not sure if
that translates to any measurable performance wins for workloads on DAX
filesystem.

> You could know if you needed a write lock without asking the filesystem - if
> you're a write and the radix tree entry is empty or is for a zero page, you
> grab the write lock.
> 
> This dovetails nicely with the idea of having the radix tree act as a cache
> for block mappings.  You take the appropriate lock on the radix tree entry,
> and it has the block mapping info for your I/O or fault so you don't have to
> call into the FS.  I/O would also participate so we would keep info about
> block mappings that we gather from I/O to help shortcut our page faults.
> 
> How does this sound vs the range lock idea?  How hard do you think it would be
> to convert our current wait queue system to reader/writer style locking?
> 
> Also, how do you think we should deal with the current PMD corruption?  Should
> we go with the current fix (I can augment the comments as you suggested), and
> then handle optimizations to that approach and the solution to this larger
> race as a follow-on?

So for now I'm still more inclined to just stay with the radix tree lock as
is and just fix up the locking as I suggest and go for larger rewrite only
if we can demonstrate further performance wins.

WRT your second patch, if we go with the locking as I suggest, it is enough
to unmap the whole range after invalidate_inode_pages2() has cleared radix
tree entries (*) which will be much cheaper (for large writes) than doing
unmapping entry by entry. So I'd go for that. I'll prepare a patch for the
locking change - it will require changes to ext4 transaction handling so it
won't be completely trivial.

(*) The flow of information is: filesystem block mapping info -> radix tree
-> page tables so if 'filesystem block mapping info' changes, we should go
invalidate corresponding radix tree entries (new entries will already have
uptodate info) and then invalidate corresponding page tables (again once
radix tree has no stale entries, we are sure new page table entries will be
uptodate).

								Honza

Ross Zwisler May 1, 2017, 10:38 p.m. UTC | #6

On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote:
> On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
<>
> > I don't think this alone is enough to save us.  The I/O path doesn't currently
> > take any DAX radix tree entry locks, so our race would just become:
> > 
> > CPU1 - write(2)				CPU2 - read fault
> > 
> > 					dax_iomap_pte_fault()
> > 					  grab_mapping_entry() // newly moved
> > 					  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 					  - we add zero page in the radix
> > 					    tree & map it to page tables
> > 
> > In their current form I don't think we want to take DAX radix tree entry locks
> > in the I/O path because that would effectively serialize I/O over a given
> > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
> > would be serialized.
> 
> Note that invalidate_inode_pages2_range() will see the entry created by
> grab_mapping_entry() on CPU2 and block waiting for its lock and this is
> exactly what stops the race. The invalidate_inode_pages2_range()
> effectively makes sure there isn't any page fault in progress for given
> range...

Yep, this is the bit that I was missing.  Thanks.

> Also note that writes to a file are serialized by i_rwsem anyway (and at
> least serialization of writes to the overlapping range is required by POSIX)
> so this doesn't add any more serialization than we already have.
> 
> > > Another solution would be to grab i_mmap_sem for write when doing write
> > > fault of a page and similarly have it grabbed for writing when doing
> > > write(2). This would scale rather poorly but if we later replaced it with a
> > > range lock (Davidlohr has already posted a nice implementation of it) it
> > > won't be as bad. But I guess option 1) is better...
> > 
> > The best idea I had for handling this sounds similar, which would be to
> > convert the radix tree locks to essentially be reader/writer locks.  I/O and
> > faults that don't modify the block mapping could just take read-level locks,
> > and could all run concurrently.  I/O or faults that modify a block mapping
> > would take a write lock, and serialize with other writers and readers.
> 
> Well, this would be difficult to implement inside the radix tree (not
> enough bits in the entry) so you'd have to go for some external locking
> primitive anyway. And if you do that, read-write range lock Davidlohr has
> implemented is what you describe - well we could also have a radix tree
> with rwsems but I suspect the overhead of maintaining that would be too
> large. It would require larger rewrite than reusing entry locks as I
> suggest above though and it isn't an obvious performance win for realistic
> workloads either so I'd like to see some performance numbers before going
> that way. It likely improves a situation where processes race to fault the
> same page for which we already know the block mapping but I'm not sure if
> that translates to any measurable performance wins for workloads on DAX
> filesystem.
> 
> > You could know if you needed a write lock without asking the filesystem - if
> > you're a write and the radix tree entry is empty or is for a zero page, you
> > grab the write lock.
> > 
> > This dovetails nicely with the idea of having the radix tree act as a cache
> > for block mappings.  You take the appropriate lock on the radix tree entry,
> > and it has the block mapping info for your I/O or fault so you don't have to
> > call into the FS.  I/O would also participate so we would keep info about
> > block mappings that we gather from I/O to help shortcut our page faults.
> > 
> > How does this sound vs the range lock idea?  How hard do you think it would be
> > to convert our current wait queue system to reader/writer style locking?
> > 
> > Also, how do you think we should deal with the current PMD corruption?  Should
> > we go with the current fix (I can augment the comments as you suggested), and
> > then handle optimizations to that approach and the solution to this larger
> > race as a follow-on?
> 
> So for now I'm still more inclined to just stay with the radix tree lock as
> is and just fix up the locking as I suggest and go for larger rewrite only
> if we can demonstrate further performance wins.

Sounds good.

> WRT your second patch, if we go with the locking as I suggest, it is enough
> to unmap the whole range after invalidate_inode_pages2() has cleared radix
> tree entries (*) which will be much cheaper (for large writes) than doing
> unmapping entry by entry.

I'm still not convinced that it is safe to do the unmap in a separate step.  I
see your point about it being expensive to do a rmap walk to unmap each entry
in __dax_invalidate_mapping_entry(), but I think we might need to because the
unmap is part of the contract imposed by invalidate_inode_pages2_range() and
invalidate_inode_pages2().  This exists in the header comment above each:

 * Any pages which are found to be mapped into pagetables are unmapped prior
 * to invalidation.

If you look at the usage of invalidate_inode_pages2_range() in
generic_file_direct_write() for example (which I realize we won't call for a
DAX inode, but still), I think that it really does rely on the fact that
invalidated pages are unmapped, right?  If it didn't, and hole pages were
mapped, the hole pages could remain mapped while a direct I/O write allocated
blocks and then wrote real data.

If we really want to unmap the entire range at once, maybe it would have to be
done in invalidate_inode_pages2_range(), after the loop?  My hesitation about
this is that we'd be leaking yet more DAX special casing up into the
mm/truncate.c code.

Or am I missing something?

> So I'd go for that. I'll prepare a patch for the
> locking change - it will require changes to ext4 transaction handling so it
> won't be completely trivial.
> 
> (*) The flow of information is: filesystem block mapping info -> radix tree
> -> page tables so if 'filesystem block mapping info' changes, we should go
> invalidate corresponding radix tree entries (new entries will already have
> uptodate info) and then invalidate corresponding page tables (again once
> radix tree has no stale entries, we are sure new page table entries will be
> uptodate).
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Dan Williams May 1, 2017, 10:59 p.m. UTC | #7

On Thu, Apr 27, 2017 at 12:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
>> On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
>> > On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
>> > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
>> > > <>
>> > > > Hum, but now thinking more about it I have hard time figuring out why write
>> > > > vs fault cannot actually still race:
>> > > >
>> > > > CPU1 - write(2)                         CPU2 - read fault
>> > > >
>> > > >                                         dax_iomap_pte_fault()
>> > > >                                           ->iomap_begin() - sees hole
>> > > > dax_iomap_rw()
>> > > >   iomap_apply()
>> > > >     ->iomap_begin - allocates blocks
>> > > >     dax_iomap_actor()
>> > > >       invalidate_inode_pages2_range()
>> > > >         - there's nothing to invalidate
>> > > >                                           grab_mapping_entry()
>> > > >                                           - we add zero page in the radix
>> > > >                                             tree & map it to page tables
>> > > >
>> > > > Similarly read vs write fault may end up racing in a wrong way and try to
>> > > > replace already existing exceptional entry with a hole page?
>> > >
>> > > Yep, this race seems real to me, too.  This seems very much like the issues
>> > > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
>> > > intermediate buffer (page cache for direct I/O case, zero page for us), and
>> > > the other is going around it directly to media, and they can get out of sync.
>> > >
>> > > IIRC the direct I/O code looked something like:
>> > >
>> > > 1/ invalidate existing mappings
>> > > 2/ do direct I/O to media
>> > > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
>> > >    any conflicting faults.  This makes sure any new allocations we made are
>> > >    faulted in.
>> >
>> > Yeah, the problem is people generally expect weird behavior when they mix
>> > direct and buffered IO (or let alone mmap) however everyone expects
>> > standard read(2) and write(2) to be completely coherent with mmap(2).
>>
>> Yep, fair enough.
>>
>> > > I guess one option would be to replicate that logic in the DAX I/O path, or we
>> > > could try and enhance our locking so page faults can't race with I/O since
>> > > both can allocate blocks.
>> >
>> > In the abstract way, the problem is that we have radix tree (and page
>> > tables) cache block mapping information and the operation: "read block
>> > mapping information, store it in the radix tree" is not serialized in any
>> > way against other block allocations so the information we store can be out
>> > of date by the time we store it.
>> >
>> > One way to solve this would be to move ->iomap_begin call in the fault
>> > paths under entry lock although that would mean I have to redo how ext4
>> > handles DAX faults because with current code it would create lock inversion
>> > wrt transaction start.
>>
>> I don't think this alone is enough to save us.  The I/O path doesn't currently
>> take any DAX radix tree entry locks, so our race would just become:
>>
>> CPU1 - write(2)                               CPU2 - read fault
>>
>>                                       dax_iomap_pte_fault()
>>                                         grab_mapping_entry() // newly moved
>>                                         ->iomap_begin() - sees hole
>> dax_iomap_rw()
>>   iomap_apply()
>>     ->iomap_begin - allocates blocks
>>     dax_iomap_actor()
>>       invalidate_inode_pages2_range()
>>         - there's nothing to invalidate
>>                                         - we add zero page in the radix
>>                                           tree & map it to page tables
>>
>> In their current form I don't think we want to take DAX radix tree entry locks
>> in the I/O path because that would effectively serialize I/O over a given
>> radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
>> would be serialized.
>
> Note that invalidate_inode_pages2_range() will see the entry created by
> grab_mapping_entry() on CPU2 and block waiting for its lock and this is
> exactly what stops the race. The invalidate_inode_pages2_range()
> effectively makes sure there isn't any page fault in progress for given
> range...
>
> Also note that writes to a file are serialized by i_rwsem anyway (and at
> least serialization of writes to the overlapping range is required by POSIX)
> so this doesn't add any more serialization than we already have.
>
>> > Another solution would be to grab i_mmap_sem for write when doing write
>> > fault of a page and similarly have it grabbed for writing when doing
>> > write(2). This would scale rather poorly but if we later replaced it with a
>> > range lock (Davidlohr has already posted a nice implementation of it) it
>> > won't be as bad. But I guess option 1) is better...
>>
>> The best idea I had for handling this sounds similar, which would be to
>> convert the radix tree locks to essentially be reader/writer locks.  I/O and
>> faults that don't modify the block mapping could just take read-level locks,
>> and could all run concurrently.  I/O or faults that modify a block mapping
>> would take a write lock, and serialize with other writers and readers.
>
> Well, this would be difficult to implement inside the radix tree (not
> enough bits in the entry) so you'd have to go for some external locking
> primitive anyway. And if you do that, read-write range lock Davidlohr has
> implemented is what you describe - well we could also have a radix tree
> with rwsems but I suspect the overhead of maintaining that would be too
> large. It would require larger rewrite than reusing entry locks as I
> suggest above though and it isn't an obvious performance win for realistic
> workloads either so I'd like to see some performance numbers before going
> that way. It likely improves a situation where processes race to fault the
> same page for which we already know the block mapping but I'm not sure if
> that translates to any measurable performance wins for workloads on DAX
> filesystem.

I'm also concerned about inventing new / fancy radix infrastructure
when we're already in the space of needing struct page for any
non-trivial usage of dax. As Kirill's transparent-huge-page page cache
implementation matures I'd be interested in looking at a transition
path away from radix locking towards something that it shared with the
common case page cache locking.

Jan Kara May 4, 2017, 9:12 a.m. UTC | #8

On Mon 01-05-17 16:38:55, Ross Zwisler wrote:
> > So for now I'm still more inclined to just stay with the radix tree lock as
> > is and just fix up the locking as I suggest and go for larger rewrite only
> > if we can demonstrate further performance wins.
> 
> Sounds good.
> 
> > WRT your second patch, if we go with the locking as I suggest, it is enough
> > to unmap the whole range after invalidate_inode_pages2() has cleared radix
> > tree entries (*) which will be much cheaper (for large writes) than doing
> > unmapping entry by entry.
> 
> I'm still not convinced that it is safe to do the unmap in a separate step.  I
> see your point about it being expensive to do a rmap walk to unmap each entry
> in __dax_invalidate_mapping_entry(), but I think we might need to because the
> unmap is part of the contract imposed by invalidate_inode_pages2_range() and
> invalidate_inode_pages2().  This exists in the header comment above each:
> 
>  * Any pages which are found to be mapped into pagetables are unmapped prior
>  * to invalidation.
> 
> If you look at the usage of invalidate_inode_pages2_range() in
> generic_file_direct_write() for example (which I realize we won't call for a
> DAX inode, but still), I think that it really does rely on the fact that
> invalidated pages are unmapped, right?  If it didn't, and hole pages were
> mapped, the hole pages could remain mapped while a direct I/O write allocated
> blocks and then wrote real data.
> 
> If we really want to unmap the entire range at once, maybe it would have to be
> done in invalidate_inode_pages2_range(), after the loop?  My hesitation about
> this is that we'd be leaking yet more DAX special casing up into the
> mm/truncate.c code.
> 
> Or am I missing something?

No, my thinking was to put the invalidation at the end of
invalidate_inode_pages2_range(). I agree it means more special-casing for
DAX in mm/truncate.c.

								Honza

[2/2] dax: fix data corruption due to stale mmap reads

Commit Message

Comments

Patch