[12/12] dax: New fault locking
diff mbox

Message ID 1457637535-21633-13-git-send-email-jack@suse.cz
State New
Headers show

Commit Message

Jan Kara March 10, 2016, 7:18 p.m. UTC
Currently DAX page fault locking is racy.

CPU0 (write fault)		CPU1 (read fault)

__dax_fault()			__dax_fault()
  get_block(inode, block, &bh, 0) -> not mapped
				  get_block(inode, block, &bh, 0)
				    -> not mapped
  if (!buffer_mapped(&bh))
    if (vmf->flags & FAULT_FLAG_WRITE)
      get_block(inode, block, &bh, 1) -> allocates blocks
  if (page) -> no
				  if (!buffer_mapped(&bh))
				    if (vmf->flags & FAULT_FLAG_WRITE) {
				    } else {
				      dax_load_hole();
				    }
  dax_insert_mapping()

And we are in a situation where we fail in dax_radix_entry() with -EIO.

Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.

We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 470 ++++++++++++++++++++++++++++++++++++++--------------
 include/linux/dax.h |   1 +
 mm/truncate.c       |  62 ++++---
 3 files changed, 374 insertions(+), 159 deletions(-)

Comments

NeilBrown March 10, 2016, 11:54 p.m. UTC | #1
On Fri, Mar 11 2016, Jan Kara wrote:

> Currently DAX page fault locking is racy.
>
> CPU0 (write fault)		CPU1 (read fault)
>
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
>
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
>
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.

Hi,
 I think the exception locking bits look good - I cannot comment on the
 rest.
 I looks like it was a good idea to bring the locking into dax.c instead
 of trying to make it generic.

Thanks,
NeilBrown
NeilBrown March 15, 2016, 9:34 p.m. UTC | #2
On Fri, Mar 11 2016, NeilBrown wrote:

> On Fri, Mar 11 2016, Jan Kara wrote:
>
>> Currently DAX page fault locking is racy.
>>
>> CPU0 (write fault)		CPU1 (read fault)
>>
>> __dax_fault()			__dax_fault()
>>   get_block(inode, block, &bh, 0) -> not mapped
>> 				  get_block(inode, block, &bh, 0)
>> 				    -> not mapped
>>   if (!buffer_mapped(&bh))
>>     if (vmf->flags & FAULT_FLAG_WRITE)
>>       get_block(inode, block, &bh, 1) -> allocates blocks
>>   if (page) -> no
>> 				  if (!buffer_mapped(&bh))
>> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
>> 				    } else {
>> 				      dax_load_hole();
>> 				    }
>>   dax_insert_mapping()
>>
>> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>>
>> Another problem with the current DAX page fault locking is that there is
>> no race-free way to clear dirty tag in the radix tree. We can always
>> end up with clean radix tree and dirty data in CPU cache.
>>
>> We fix the first problem by introducing locking of exceptional radix
>> tree entries in DAX mappings acting very similarly to page lock and thus
>> synchronizing properly faults against the same mapping index. The same
>> lock can later be used to avoid races when clearing radix tree dirty
>> tag.
>
> Hi,
>  I think the exception locking bits look good - I cannot comment on the
>  rest.
>  I looks like it was a good idea to bring the locking into dax.c instead
>  of trying to make it generic.
>

Actually ... I'm still bothered by the exclusive waiting.  If an entry
is locked and there are two threads in dax_pfn_mkwrite() then one would
be woken up when the entry is unlocked and it will just set the TAG_DIRTY
flag and then continue without ever waking the next waiter on the
wait queue.

I *think* that any thread which gets an exclusive wakeup is responsible
for performing another wakeup.  In this case it must either lock the
slot, or call __wakeup.
That means:
  grab_mapping_entry needs to call wakeup:
     if radix_tree_preload() fails
     if radix_tree_insert fails other than with -EEXIST
     if a valid page was found
  dax_delete_mapping_entry needs to call wakeup
     if the fail case, though as that isn't expect (WARN_ON_ONCE)
        it should be a problem not to wakeup here
  dax_pfn_mkwrite needs to call wakeup unconditionally


Am I missing something?

Thanks,
NeilBrown
Jan Kara March 18, 2016, 2:16 p.m. UTC | #3
On Wed 16-03-16 08:34:28, NeilBrown wrote:
> On Fri, Mar 11 2016, NeilBrown wrote:
> 
> > On Fri, Mar 11 2016, Jan Kara wrote:
> >
> >> Currently DAX page fault locking is racy.
> >>
> >> CPU0 (write fault)		CPU1 (read fault)
> >>
> >> __dax_fault()			__dax_fault()
> >>   get_block(inode, block, &bh, 0) -> not mapped
> >> 				  get_block(inode, block, &bh, 0)
> >> 				    -> not mapped
> >>   if (!buffer_mapped(&bh))
> >>     if (vmf->flags & FAULT_FLAG_WRITE)
> >>       get_block(inode, block, &bh, 1) -> allocates blocks
> >>   if (page) -> no
> >> 				  if (!buffer_mapped(&bh))
> >> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> >> 				    } else {
> >> 				      dax_load_hole();
> >> 				    }
> >>   dax_insert_mapping()
> >>
> >> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> >>
> >> Another problem with the current DAX page fault locking is that there is
> >> no race-free way to clear dirty tag in the radix tree. We can always
> >> end up with clean radix tree and dirty data in CPU cache.
> >>
> >> We fix the first problem by introducing locking of exceptional radix
> >> tree entries in DAX mappings acting very similarly to page lock and thus
> >> synchronizing properly faults against the same mapping index. The same
> >> lock can later be used to avoid races when clearing radix tree dirty
> >> tag.
> >
> > Hi,
> >  I think the exception locking bits look good - I cannot comment on the
> >  rest.
> >  I looks like it was a good idea to bring the locking into dax.c instead
> >  of trying to make it generic.
> >
> 
> Actually ... I'm still bothered by the exclusive waiting.  If an entry
> is locked and there are two threads in dax_pfn_mkwrite() then one would
> be woken up when the entry is unlocked and it will just set the TAG_DIRTY
> flag and then continue without ever waking the next waiter on the
> wait queue.
> 
> I *think* that any thread which gets an exclusive wakeup is responsible
> for performing another wakeup.  In this case it must either lock the
> slot, or call __wakeup.
> That means:
>   grab_mapping_entry needs to call wakeup:
>      if radix_tree_preload() fails
>      if radix_tree_insert fails other than with -EEXIST
>      if a valid page was found

Why would we need to call wake up when a valid page was found? In that case
there should not be any process waiting for the radix tree entry lock.
Otherwise I agree with you. Thanks for pointing this out, you've likely
saved me quite some debugging ;).


>   dax_delete_mapping_entry needs to call wakeup
>      if the fail case, though as that isn't expect (WARN_ON_ONCE)
>         it should be a problem not to wakeup here
>   dax_pfn_mkwrite needs to call wakeup unconditionally

								Honza
Jan Kara March 18, 2016, 3:39 p.m. UTC | #4
On Fri 18-03-16 15:16:18, Jan Kara wrote:
> On Wed 16-03-16 08:34:28, NeilBrown wrote:
> > On Fri, Mar 11 2016, NeilBrown wrote:
> > 
> > > On Fri, Mar 11 2016, Jan Kara wrote:
> > >
> > >> Currently DAX page fault locking is racy.
> > >>
> > >> CPU0 (write fault)		CPU1 (read fault)
> > >>
> > >> __dax_fault()			__dax_fault()
> > >>   get_block(inode, block, &bh, 0) -> not mapped
> > >> 				  get_block(inode, block, &bh, 0)
> > >> 				    -> not mapped
> > >>   if (!buffer_mapped(&bh))
> > >>     if (vmf->flags & FAULT_FLAG_WRITE)
> > >>       get_block(inode, block, &bh, 1) -> allocates blocks
> > >>   if (page) -> no
> > >> 				  if (!buffer_mapped(&bh))
> > >> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> > >> 				    } else {
> > >> 				      dax_load_hole();
> > >> 				    }
> > >>   dax_insert_mapping()
> > >>
> > >> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> > >>
> > >> Another problem with the current DAX page fault locking is that there is
> > >> no race-free way to clear dirty tag in the radix tree. We can always
> > >> end up with clean radix tree and dirty data in CPU cache.
> > >>
> > >> We fix the first problem by introducing locking of exceptional radix
> > >> tree entries in DAX mappings acting very similarly to page lock and thus
> > >> synchronizing properly faults against the same mapping index. The same
> > >> lock can later be used to avoid races when clearing radix tree dirty
> > >> tag.
> > >
> > > Hi,
> > >  I think the exception locking bits look good - I cannot comment on the
> > >  rest.
> > >  I looks like it was a good idea to bring the locking into dax.c instead
> > >  of trying to make it generic.
> > >
> > 
> > Actually ... I'm still bothered by the exclusive waiting.  If an entry
> > is locked and there are two threads in dax_pfn_mkwrite() then one would
> > be woken up when the entry is unlocked and it will just set the TAG_DIRTY
> > flag and then continue without ever waking the next waiter on the
> > wait queue.
> > 
> > I *think* that any thread which gets an exclusive wakeup is responsible
> > for performing another wakeup.  In this case it must either lock the
> > slot, or call __wakeup.
> > That means:
> >   grab_mapping_entry needs to call wakeup:
> >      if radix_tree_preload() fails
> >      if radix_tree_insert fails other than with -EEXIST
> >      if a valid page was found
> 
> Why would we need to call wake up when a valid page was found? In that case
> there should not be any process waiting for the radix tree entry lock.
> Otherwise I agree with you. Thanks for pointing this out, you've likely
> saved me quite some debugging ;).
> 
> 
> >   dax_delete_mapping_entry needs to call wakeup
> >      if the fail case, though as that isn't expect (WARN_ON_ONCE)
> >         it should be a problem not to wakeup here
> >   dax_pfn_mkwrite needs to call wakeup unconditionally

Actually, after some thought I don't think the wakeup is needed except for
dax_pfn_mkwrite(). In the other cases we know there is no radix tree
exceptional entry and thus there can be no waiters for its lock...

								Honza
NeilBrown March 22, 2016, 9:10 p.m. UTC | #5
On Sat, Mar 19 2016, Jan Kara wrote:
>
> Actually, after some thought I don't think the wakeup is needed except for
> dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> exceptional entry and thus there can be no waiters for its lock...
>

I think that is fragile logic - though it may be correct at present.

A radix tree slot can transition from "Locked exception" to "unlocked
exception" to "deleted" to "struct page".

So it is absolutely certain that a thread cannot go to sleep after
finding a "locked exception" and wake up to find a "struct page" ??

How about a much simpler change.
 - new local variable "slept" in lookup_unlocked_mapping_entry() which
   is set if prepare_to_wait_exclusive() gets called.
 - if after __radix_tree_lookup() returns:
        (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
   then it calls wakeup immediately - because if it was waiting,
   something else might be to.

That would cover all vaguely possible cases except dax_pfn_mkwrite()


Thanks,
NeilBrown
Jan Kara March 23, 2016, 11 a.m. UTC | #6
On Wed 23-03-16 08:10:42, NeilBrown wrote:
> On Sat, Mar 19 2016, Jan Kara wrote:
> >
> > Actually, after some thought I don't think the wakeup is needed except for
> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> > exceptional entry and thus there can be no waiters for its lock...
> >
> 
> I think that is fragile logic - though it may be correct at present.
> 
> A radix tree slot can transition from "Locked exception" to "unlocked
> exception" to "deleted" to "struct page".

Yes.
 
> So it is absolutely certain that a thread cannot go to sleep after
> finding a "locked exception" and wake up to find a "struct page" ??

With current implementation this should not happen but I agree entry
locking code should not rely on this.

> How about a much simpler change.
>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
>    is set if prepare_to_wait_exclusive() gets called.
>  - if after __radix_tree_lookup() returns:
>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
>    then it calls wakeup immediately - because if it was waiting,
>    something else might be to.
> 
> That would cover all vaguely possible cases except dax_pfn_mkwrite()

But how does this really help? If lookup_unlocked_mapping_entry() finds
there is no entry (and it was there before), the process deleting the entry
(or replacing it with something else) is responsible for waking up
everybody. So your change would only duplicate what
dax_delete_mapping_entry() does. The potential for breakage is that callers
of lookup_unlocked_mapping_entry() are responsible for waking up other
waiters *even if* they do not lock or delete the entry in the end. Maybe
I'll rename lookup_unlocked_mapping_entry() to get_unlocked_mapping_entry()
so that it is clearer that one must call either put_unlocked_mapping_entry()
or put_locked_mapping_entry() on it.

								Honza
NeilBrown March 31, 2016, 4:20 a.m. UTC | #7
On Wed, Mar 23 2016, Jan Kara wrote:

> On Wed 23-03-16 08:10:42, NeilBrown wrote:
>> On Sat, Mar 19 2016, Jan Kara wrote:
>> >
>> > Actually, after some thought I don't think the wakeup is needed except for
>> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
>> > exceptional entry and thus there can be no waiters for its lock...
>> >
>> 
>> I think that is fragile logic - though it may be correct at present.
>> 
>> A radix tree slot can transition from "Locked exception" to "unlocked
>> exception" to "deleted" to "struct page".
>
> Yes.
>  
>> So it is absolutely certain that a thread cannot go to sleep after
>> finding a "locked exception" and wake up to find a "struct page" ??
>
> With current implementation this should not happen but I agree entry
> locking code should not rely on this.
>
>> How about a much simpler change.
>>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
>>    is set if prepare_to_wait_exclusive() gets called.
>>  - if after __radix_tree_lookup() returns:
>>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
>>    then it calls wakeup immediately - because if it was waiting,
>>    something else might be to.
>> 
>> That would cover all vaguely possible cases except dax_pfn_mkwrite()
>
> But how does this really help? If lookup_unlocked_mapping_entry() finds
> there is no entry (and it was there before), the process deleting the entry
> (or replacing it with something else) is responsible for waking up
> everybody.

"everybody" - yes.  But it doesn't wake everybody does it?  It just
wakes one.

+		__wake_up(wq, TASK_NORMAL, 1, &key);
                                           ^one!

Or am I misunderstanding how exclusive waiting works?

Thanks,
NeilBrown

>             So your change would only duplicate what
> dax_delete_mapping_entry() does. The potential for breakage is that callers
> of lookup_unlocked_mapping_entry() are responsible for waking up other
> waiters *even if* they do not lock or delete the entry in the end. Maybe
> I'll rename lookup_unlocked_mapping_entry() to get_unlocked_mapping_entry()
> so that it is clearer that one must call either put_unlocked_mapping_entry()
> or put_locked_mapping_entry() on it.
>
> 								Honza
>
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara March 31, 2016, 8:54 a.m. UTC | #8
On Thu 31-03-16 15:20:00, NeilBrown wrote:
> On Wed, Mar 23 2016, Jan Kara wrote:
> 
> > On Wed 23-03-16 08:10:42, NeilBrown wrote:
> >> On Sat, Mar 19 2016, Jan Kara wrote:
> >> >
> >> > Actually, after some thought I don't think the wakeup is needed except for
> >> > dax_pfn_mkwrite(). In the other cases we know there is no radix tree
> >> > exceptional entry and thus there can be no waiters for its lock...
> >> >
> >> 
> >> I think that is fragile logic - though it may be correct at present.
> >> 
> >> A radix tree slot can transition from "Locked exception" to "unlocked
> >> exception" to "deleted" to "struct page".
> >
> > Yes.
> >  
> >> So it is absolutely certain that a thread cannot go to sleep after
> >> finding a "locked exception" and wake up to find a "struct page" ??
> >
> > With current implementation this should not happen but I agree entry
> > locking code should not rely on this.
> >
> >> How about a much simpler change.
> >>  - new local variable "slept" in lookup_unlocked_mapping_entry() which
> >>    is set if prepare_to_wait_exclusive() gets called.
> >>  - if after __radix_tree_lookup() returns:
> >>         (ret==NULL || !radix_tree_exceptional_entry(ret)) && slept
> >>    then it calls wakeup immediately - because if it was waiting,
> >>    something else might be to.
> >> 
> >> That would cover all vaguely possible cases except dax_pfn_mkwrite()
> >
> > But how does this really help? If lookup_unlocked_mapping_entry() finds
> > there is no entry (and it was there before), the process deleting the entry
> > (or replacing it with something else) is responsible for waking up
> > everybody.
> 
> "everybody" - yes.  But it doesn't wake everybody does it?  It just
> wakes one.
> 
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
>                                            ^one!
> 
> Or am I misunderstanding how exclusive waiting works?

Ah, OK. I have already fixed that in my local version of the patches so
that we wake-up everybody after deleting the entry but forgot to tell you.
So I have there now:

		__wake_up(wq, TASK_NORMAL, 0, &key);

Are you OK with the code now?

								Honza
NeilBrown April 1, 2016, 12:34 a.m. UTC | #9
On Thu, Mar 31 2016, Jan Kara wrote:

> On Thu 31-03-16 15:20:00, NeilBrown wrote:
>> On Wed, Mar 23 2016, Jan Kara wrote:
>> 

>> > But how does this really help? If lookup_unlocked_mapping_entry() finds
>> > there is no entry (and it was there before), the process deleting the entry
>> > (or replacing it with something else) is responsible for waking up
>> > everybody.
>> 
>> "everybody" - yes.  But it doesn't wake everybody does it?  It just
>> wakes one.
>> 
>> +		__wake_up(wq, TASK_NORMAL, 1, &key);
>>                                            ^one!
>> 
>> Or am I misunderstanding how exclusive waiting works?
>
> Ah, OK. I have already fixed that in my local version of the patches so
> that we wake-up everybody after deleting the entry but forgot to tell you.
> So I have there now:
>
> 		__wake_up(wq, TASK_NORMAL, 0, &key);

"0" meaning "MAX_INT+1" (or something).  I didn't realize you could do
that, but it makes perfect sense.

>
> Are you OK with the code now?

Certainly less unhappy.  Assume that I am OK but I'll try to have
another look when you post again and make sure any lingering doubts are
gone.

Thanks,
NeilBrown


>
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Patch
diff mbox

diff --git a/fs/dax.c b/fs/dax.c
index 7148fcdb2c92..7d2e34f90654 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -41,6 +41,30 @@ 
 #define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
 		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
 
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+	return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+					      pgoff_t index)
+{
+	unsigned long hash = hash_long((unsigned long)mapping ^ index,
+				       DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
 static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
 {
 	struct request_queue *q = bdev->bd_queue;
@@ -306,6 +330,214 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 EXPORT_SYMBOL_GPL(dax_do_io);
 
 /*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+	struct radix_tree_root *root;
+	unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_t wait;
+	struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned mode,
+				       int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->root != ewait->key.root || key->index != ewait->key.index)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+static inline int slot_locked(void **v)
+{
+	unsigned long l = *(unsigned long *)v;
+	return l & DAX_ENTRY_LOCK;
+}
+
+static inline void *lock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l |= DAX_ENTRY_LOCK);
+}
+
+static inline void *unlock_slot(void **v)
+{
+	unsigned long *l = (unsigned long *)v;
+	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
+					   pgoff_t index, void ***slotp)
+{
+	void *ret, **slot;
+	struct wait_exceptional_entry_queue wait;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	init_wait(&wait.wait);
+	wait.wait.func = wake_exceptional_entry_func;
+	wait.key.root = &mapping->page_tree;
+	wait.key.index = index;
+
+	for (;;) {
+		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+					  &slot);
+		if (!ret || !radix_tree_exceptional_entry(ret) ||
+		    !slot_locked(slot)) {
+			if (slotp)
+				*slotp = slot;
+			return ret;
+		}
+		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		spin_unlock_irq(&mapping->tree_lock);
+		schedule();
+		finish_wait(wq, &wait.wait);
+		spin_lock_irq(&mapping->tree_lock);
+	}
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+
+restart:
+	spin_lock_irq(&mapping->tree_lock);
+	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
+	/* No entry for given index? Make sure radix tree is big enough. */
+	if (!ret) {
+		int err;
+
+		spin_unlock_irq(&mapping->tree_lock);
+		err = radix_tree_preload(mapping_gfp_mask(mapping));
+		if (err)
+			return ERR_PTR(err);
+		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
+		spin_lock_irq(&mapping->tree_lock);
+		err = radix_tree_insert(&mapping->page_tree, index, ret);
+		radix_tree_preload_end();
+		if (err) {
+			spin_unlock_irq(&mapping->tree_lock);
+			/* Someone already created the entry? */
+			if (err == -EEXIST)
+				goto restart;
+			return ERR_PTR(err);
+		}
+		/* Good, we have inserted empty locked entry into the tree. */
+		mapping->nrexceptional++;
+		spin_unlock_irq(&mapping->tree_lock);
+		return ret;
+	}
+	/* Normal page in radix tree? */
+	if (!radix_tree_exceptional_entry(ret)) {
+		struct page *page = ret;
+
+		page_cache_get(page);
+		spin_unlock_irq(&mapping->tree_lock);
+		lock_page(page);
+		/* Page got truncated? Retry... */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto restart;
+		}
+		return page;
+	}
+	ret = lock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	void *ret, **slot;
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+	spin_lock_irq(&mapping->tree_lock);
+	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	if (WARN_ON_ONCE(!slot_locked(slot))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	unlock_slot(slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+}
+
+static void put_mapping_entry(struct address_space *mapping, pgoff_t index,
+			      void *entry)
+{
+	if (!radix_tree_exceptional_entry(entry)) {
+		unlock_page(entry);
+		page_cache_release(entry);
+	} else {
+		unlock_mapping_entry(mapping, index);
+	}
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+	void *entry;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	/*
+	 * Caller should make sure radix tree modifications don't race and
+	 * we have seen exceptional entry here before.
+	 */
+	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+	radix_tree_delete(&mapping->page_tree, index);
+	mapping->nrexceptional--;
+	spin_unlock_irq(&mapping->tree_lock);
+	if (waitqueue_active(wq)) {
+		struct exceptional_entry_key key;
+
+		key.root = &mapping->page_tree;
+		key.index = index;
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+	}
+	return 1;
+}
+
+/*
  * The user has performed a load from a hole in the file.  Allocating
  * a new page in the file would cause excessive storage usage for
  * workloads with sparse files.  We allocate a page cache page instead.
@@ -313,16 +545,24 @@  EXPORT_SYMBOL_GPL(dax_do_io);
  * otherwise it will simply fall out of the page cache under memory
  * pressure without ever having been dirtied.
  */
-static int dax_load_hole(struct address_space *mapping, struct page *page,
-							struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+			 struct vm_fault *vmf)
 {
-	struct inode *inode = mapping->host;
-	if (!page)
-		page = find_or_create_page(mapping, vmf->pgoff,
-						GFP_KERNEL | __GFP_ZERO);
-	if (!page)
-		return VM_FAULT_OOM;
+	struct page *page;
 
+	/* Hole page already exists? Return it...  */
+	if (!radix_tree_exceptional_entry(entry)) {
+		vmf->page = entry;
+		return VM_FAULT_LOCKED;
+	}
+
+	/* This will replace locked radix tree entry with a hole page */
+	page = find_or_create_page(mapping, vmf->pgoff,
+				   vmf->gfp_mask | __GFP_ZERO);
+	if (!page) {
+		put_mapping_entry(mapping, vmf->pgoff, entry);
+		return VM_FAULT_OOM;
+	}
 	vmf->page = page;
 	return VM_FAULT_LOCKED;
 }
@@ -346,77 +586,58 @@  static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
-#define NO_SECTOR -1
 #define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_CACHE_SHIFT))
 
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
-		sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
+			       void *entry, sector_t sector, bool dirty,
+			       gfp_t gfp_mask)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	pgoff_t pmd_index = DAX_PMD_INDEX(index);
-	int type, error = 0;
-	void *entry;
+	int error = 0;
+	bool hole_fill = false;
+	struct mem_cgroup *memcg;
+	void *ret;
 
-	WARN_ON_ONCE(pmd_entry && !dirty);
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	spin_lock_irq(&mapping->tree_lock);
-
-	entry = radix_tree_lookup(page_tree, pmd_index);
-	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
-		index = pmd_index;
-		goto dirty;
+	/* Replacing hole page with block mapping? */
+	if (!radix_tree_exceptional_entry(entry)) {
+		hole_fill = true;
+		error = radix_tree_preload(gfp_mask);
+		if (error)
+			return ERR_PTR(error);
+		memcg = mem_cgroup_begin_page_stat(entry);
 	}
 
-	entry = radix_tree_lookup(page_tree, index);
-	if (entry) {
-		type = RADIX_DAX_TYPE(entry);
-		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
-					type != RADIX_DAX_PMD)) {
-			error = -EIO;
+	spin_lock_irq(&mapping->tree_lock);
+	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+		       DAX_ENTRY_LOCK);
+	if (hole_fill) {
+		__delete_from_page_cache(entry, NULL, memcg);
+		error = radix_tree_insert(page_tree, index, ret);
+		if (error) {
+			ret = ERR_PTR(error);
 			goto unlock;
 		}
+		mapping->nrexceptional++;
+	} else {
+		void **slot;
+		void *ret2;
 
-		if (!pmd_entry || type == RADIX_DAX_PMD)
-			goto dirty;
-
-		/*
-		 * We only insert dirty PMD entries into the radix tree.  This
-		 * means we don't need to worry about removing a dirty PTE
-		 * entry and inserting a clean PMD entry, thus reducing the
-		 * range we would flush with a follow-up fsync/msync call.
-		 */
-		radix_tree_delete(&mapping->page_tree, index);
-		mapping->nrexceptional--;
+		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
+		WARN_ON_ONCE(ret2 != entry);
+		radix_tree_replace_slot(slot, ret);
 	}
-
-	if (sector == NO_SECTOR) {
-		/*
-		 * This can happen during correct operation if our pfn_mkwrite
-		 * fault raced against a hole punch operation.  If this
-		 * happens the pte that was hole punched will have been
-		 * unmapped and the radix tree entry will have been removed by
-		 * the time we are called, but the call will still happen.  We
-		 * will return all the way up to wp_pfn_shared(), where the
-		 * pte_same() check will fail, eventually causing page fault
-		 * to be retried by the CPU.
-		 */
-		goto unlock;
-	}
-
-	error = radix_tree_insert(page_tree, index,
-			RADIX_DAX_ENTRY(sector, pmd_entry));
-	if (error)
-		goto unlock;
-
-	mapping->nrexceptional++;
- dirty:
 	if (dirty)
 		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
  unlock:
 	spin_unlock_irq(&mapping->tree_lock);
-	return error;
+	if (hole_fill) {
+		mem_cgroup_end_page_stat(memcg);
+		radix_tree_preload_end();
+	}
+	return ret;
 }
 
 static int dax_writeback_one(struct block_device *bdev,
@@ -542,17 +763,18 @@  int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+			struct buffer_head *bh, void *entry,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	struct address_space *mapping = inode->i_mapping;
 	struct block_device *bdev = bh->b_bdev;
 	struct blk_dax_ctl dax = {
-		.sector = to_sector(bh, inode),
+		.sector = to_sector(bh, mapping->host),
 		.size = bh->b_size,
 	};
 	int error;
+	void *ret;
 
 	if (dax_map_atomic(bdev, &dax) < 0) {
 		error = PTR_ERR(dax.addr);
@@ -560,14 +782,25 @@  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
-	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
-			vmf->flags & FAULT_FLAG_WRITE);
-	if (error)
+	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
+			        vmf->flags & FAULT_FLAG_WRITE,
+			        vmf->gfp_mask & ~__GFP_HIGHMEM);
+	if (IS_ERR(ret)) {
+		error = PTR_ERR(ret);
 		goto out;
+	}
+	/* Have we replaced hole page? Unmap and free it. */
+	if (!radix_tree_exceptional_entry(entry)) {
+		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+				    PAGE_CACHE_SIZE, 0);
+		unlock_page(entry);
+		page_cache_release(entry);
+	}
+	entry = ret;
 
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
  out:
+	put_mapping_entry(mapping, vmf->pgoff, entry);
 	return error;
 }
 
@@ -587,7 +820,7 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	struct inode *inode = mapping->host;
-	struct page *page;
+	void *entry;
 	struct buffer_head bh;
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
 	unsigned blkbits = inode->i_blkbits;
@@ -596,6 +829,11 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int error;
 	int major = 0;
 
+	/*
+	 * Check whether offset isn't beyond end of file now. Caller is supposed
+	 * to hold locks serializing us with truncate / punch hole so this is
+	 * a reliable test.
+	 */
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
@@ -605,40 +843,17 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	bh.b_bdev = inode->i_sb->s_bdev;
 	bh.b_size = PAGE_SIZE;
 
- repeat:
-	page = find_get_page(mapping, vmf->pgoff);
-	if (page) {
-		if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
-			page_cache_release(page);
-			return VM_FAULT_RETRY;
-		}
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto repeat;
-		}
+	entry = grab_mapping_entry(mapping, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		error = PTR_ERR(entry);
+		goto out;
 	}
 
 	error = get_block(inode, block, &bh, 0);
 	if (!error && (bh.b_size < PAGE_SIZE))
 		error = -EIO;		/* fs corruption? */
 	if (error)
-		goto unlock_page;
-
-	if (!buffer_mapped(&bh) && !vmf->cow_page) {
-		if (vmf->flags & FAULT_FLAG_WRITE) {
-			error = get_block(inode, block, &bh, 1);
-			count_vm_event(PGMAJFAULT);
-			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
-			if (!error && (bh.b_size < PAGE_SIZE))
-				error = -EIO;
-			if (error)
-				goto unlock_page;
-		} else {
-			return dax_load_hole(mapping, page, vmf);
-		}
-	}
+		goto unlock_entry;
 
 	if (vmf->cow_page) {
 		struct page *new_page = vmf->cow_page;
@@ -647,30 +862,33 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		else
 			clear_user_highpage(new_page, vaddr);
 		if (error)
-			goto unlock_page;
-		vmf->page = page;
-		if (page)
+			goto unlock_entry;
+		if (!radix_tree_exceptional_entry(entry)) {
+			vmf->page = entry;
 			return VM_FAULT_LOCKED;
+		}
+		unlock_mapping_entry(mapping, vmf->pgoff);
 		return 0;
 	}
 
-	/* Check we didn't race with a read fault installing a new page */
-	if (!page && major)
-		page = find_lock_page(mapping, vmf->pgoff);
-
-	if (page) {
-		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
-							PAGE_CACHE_SIZE, 0);
-		delete_from_page_cache(page);
-		unlock_page(page);
-		page_cache_release(page);
-		page = NULL;
+	if (!buffer_mapped(&bh)) {
+		if (vmf->flags & FAULT_FLAG_WRITE) {
+			error = get_block(inode, block, &bh, 1);
+			count_vm_event(PGMAJFAULT);
+			mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+			major = VM_FAULT_MAJOR;
+			if (!error && (bh.b_size < PAGE_SIZE))
+				error = -EIO;
+			if (error)
+				goto unlock_entry;
+		} else {
+			return dax_load_hole(mapping, entry, vmf);
+		}
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */
 	WARN_ON_ONCE(buffer_unwritten(&bh));
-	error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+	error = dax_insert_mapping(mapping, &bh, entry, vma, vmf);
  out:
 	if (error == -ENOMEM)
 		return VM_FAULT_OOM | major;
@@ -679,11 +897,8 @@  int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		return VM_FAULT_SIGBUS | major;
 	return VM_FAULT_NOPAGE | major;
 
- unlock_page:
-	if (page) {
-		unlock_page(page);
-		page_cache_release(page);
-	}
+ unlock_entry:
+	put_mapping_entry(mapping, vmf->pgoff, entry);
 	goto out;
 }
 EXPORT_SYMBOL(__dax_fault);
@@ -968,16 +1183,17 @@  EXPORT_SYMBOL_GPL(dax_pmd_fault);
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	void *entry;
+	pgoff_t index = vmf->pgoff;
 
-	/*
-	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
-	 * RADIX_DAX_PTE entry already exists in the radix tree from a
-	 * previous call to __dax_fault().  We just want to look up that PTE
-	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
-	 * saves us from having to make a call to get_block() here to look
-	 * up the sector.
-	 */
-	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
+	spin_lock_irq(&mapping->tree_lock);
+	entry = lookup_unlocked_mapping_entry(mapping, index, NULL);
+	if (!entry || !radix_tree_exceptional_entry(entry))
+		goto out;
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+out:
+	spin_unlock_irq(&mapping->tree_lock);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd28d824254b..da2416d916e6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,6 +18,7 @@  int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
 int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 
 #ifdef CONFIG_FS_DAX
 struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index e3ee0e27cd17..2dd33df71e42 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@  static void clear_exceptional_entry(struct address_space *mapping,
 	if (shmem_mapping(mapping))
 		return;
 
-	spin_lock_irq(&mapping->tree_lock);
-
 	if (dax_mapping(mapping)) {
-		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
-			mapping->nrexceptional--;
-	} else {
-		/*
-		 * Regular page slots are stabilized by the page lock even
-		 * without the tree itself locked.  These unlocked entries
-		 * need verification under the tree lock.
-		 */
-		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
-					&slot))
-			goto unlock;
-		if (*slot != entry)
-			goto unlock;
-		radix_tree_replace_slot(slot, NULL);
-		mapping->nrexceptional--;
-		if (!node)
-			goto unlock;
-		workingset_node_shadows_dec(node);
-		/*
-		 * Don't track node without shadow entries.
-		 *
-		 * Avoid acquiring the list_lru lock if already untracked.
-		 * The list_empty() test is safe as node->private_list is
-		 * protected by mapping->tree_lock.
-		 */
-		if (!workingset_node_shadows(node) &&
-		    !list_empty(&node->private_list))
-			list_lru_del(&workingset_shadow_nodes,
-					&node->private_list);
-		__radix_tree_delete_node(&mapping->page_tree, node);
+		dax_delete_mapping_entry(mapping, index);
+		return;
 	}
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+				&slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+	radix_tree_replace_slot(slot, NULL);
+	mapping->nrexceptional--;
+	if (!node)
+		goto unlock;
+	workingset_node_shadows_dec(node);
+	/*
+	 * Don't track node without shadow entries.
+	 *
+	 * Avoid acquiring the list_lru lock if already untracked.
+	 * The list_empty() test is safe as node->private_list is
+	 * protected by mapping->tree_lock.
+	 */
+	if (!workingset_node_shadows(node) &&
+	    !list_empty(&node->private_list))
+		list_lru_del(&workingset_shadow_nodes,
+				&node->private_list);
+	__radix_tree_delete_node(&mapping->page_tree, node);
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }