diff mbox

[v8,4/9] dax: support dirty DAX entries in radix tree

Message ID 1452230879-18117-5-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ross Zwisler Jan. 8, 2016, 5:27 a.m. UTC
Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback sectors for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely on
the fact that only one type of exceptional entry can be found in a given
radix tree based on its usage.  This happens for free with DAX vs shmem but
we explicitly prevent shadow entries from being added to radix trees for
DAX mappings.

The only shadow entries that would be generated for DAX radix trees would
be to track zero page mappings that were created for holes.  These pages
would receive minimal benefit from having shadow entries, and the choice
to have only one type of exceptional entry in a given radix tree makes the
logic simpler both in clear_exceptional_entry() and in the rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c             |  2 +-
 fs/inode.c                 |  2 +-
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  3 +-
 include/linux/radix-tree.h |  9 ++++++
 mm/filemap.c               | 17 ++++++++----
 mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
 mm/vmscan.c                |  9 +++++-
 mm/workingset.c            |  4 +--
 9 files changed, 78 insertions(+), 42 deletions(-)

Comments

Jan Kara Jan. 13, 2016, 9:44 a.m. UTC | #1
On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.
> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

I have realized there's one issue with this code. See below:

> @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  		return;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	/*
> -	 * Regular page slots are stabilized by the page lock even
> -	 * without the tree itself locked.  These unlocked entries
> -	 * need verification under the tree lock.
> -	 */
> -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> -		goto unlock;
> -	if (*slot != entry)
> -		goto unlock;
> -	radix_tree_replace_slot(slot, NULL);
> -	mapping->nrshadows--;
> -	if (!node)
> -		goto unlock;
> -	workingset_node_shadows_dec(node);
> -	/*
> -	 * Don't track node without shadow entries.
> -	 *
> -	 * Avoid acquiring the list_lru lock if already untracked.
> -	 * The list_empty() test is safe as node->private_list is
> -	 * protected by mapping->tree_lock.
> -	 */
> -	if (!workingset_node_shadows(node) &&
> -	    !list_empty(&node->private_list))
> -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> -	__radix_tree_delete_node(&mapping->page_tree, node);
> +
> +	if (dax_mapping(mapping)) {
> +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> +			mapping->nrexceptional--;

So when you punch hole in a file, you can delete a PMD entry from a radix
tree which covers part of the file which still stays. So in this case you
have to split the PMD entry into PTE entries (probably that needs to happen
up in truncate_inode_pages_range()) or something similar...

								Honza
Ross Zwisler Jan. 13, 2016, 6:48 p.m. UTC | #2
On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > Add support for tracking dirty DAX entries in the struct address_space
> > radix tree.  This tree is already used for dirty page writeback, and it
> > already supports the use of exceptional (non struct page*) entries.
> > 
> > In order to properly track dirty DAX pages we will insert new exceptional
> > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > These exceptional entries will also contain the writeback sectors for the
> > PTE or PMD faults that we can use at fsync/msync time.
> > 
> > There are currently two types of exceptional entries (shmem and shadow)
> > that can be placed into the radix tree, and this adds a third.  We rely on
> > the fact that only one type of exceptional entry can be found in a given
> > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > we explicitly prevent shadow entries from being added to radix trees for
> > DAX mappings.
> > 
> > The only shadow entries that would be generated for DAX radix trees would
> > be to track zero page mappings that were created for holes.  These pages
> > would receive minimal benefit from having shadow entries, and the choice
> > to have only one type of exceptional entry in a given radix tree makes the
> > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> I have realized there's one issue with this code. See below:
> 
> > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  		return;
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	/*
> > -	 * Regular page slots are stabilized by the page lock even
> > -	 * without the tree itself locked.  These unlocked entries
> > -	 * need verification under the tree lock.
> > -	 */
> > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > -		goto unlock;
> > -	if (*slot != entry)
> > -		goto unlock;
> > -	radix_tree_replace_slot(slot, NULL);
> > -	mapping->nrshadows--;
> > -	if (!node)
> > -		goto unlock;
> > -	workingset_node_shadows_dec(node);
> > -	/*
> > -	 * Don't track node without shadow entries.
> > -	 *
> > -	 * Avoid acquiring the list_lru lock if already untracked.
> > -	 * The list_empty() test is safe as node->private_list is
> > -	 * protected by mapping->tree_lock.
> > -	 */
> > -	if (!workingset_node_shadows(node) &&
> > -	    !list_empty(&node->private_list))
> > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > +
> > +	if (dax_mapping(mapping)) {
> > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > +			mapping->nrexceptional--;
> 
> So when you punch hole in a file, you can delete a PMD entry from a radix
> tree which covers part of the file which still stays. So in this case you
> have to split the PMD entry into PTE entries (probably that needs to happen
> up in truncate_inode_pages_range()) or something similar...

I think (and will verify) that the DAX code just unmaps the entire PMD range
when we receive a hole punch request inside of the PMD.  If this is true then
I think the radix tree code should behave the same way and just remove the PMD
entry in the radix tree.

This will cause new accesses that used to land in the PMD range to get new
page faults.  These faults will call get_blocks(), where presumably the
filesystem will tell us that we don't have a contiguous 2MiB range anymore, so
we will fall back to PTE faults.  These PTEs will fill in both the radix tree
and the page tables.

So, I think the work here is to verify the behavior of DAX wrt hole punches
for PMD ranges, and make the radix tree code match that behavior.  Sound good?
Jan Kara Jan. 15, 2016, 1:22 p.m. UTC | #3
On Wed 13-01-16 11:48:32, Ross Zwisler wrote:
> On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> > On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > > Add support for tracking dirty DAX entries in the struct address_space
> > > radix tree.  This tree is already used for dirty page writeback, and it
> > > already supports the use of exceptional (non struct page*) entries.
> > > 
> > > In order to properly track dirty DAX pages we will insert new exceptional
> > > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > > These exceptional entries will also contain the writeback sectors for the
> > > PTE or PMD faults that we can use at fsync/msync time.
> > > 
> > > There are currently two types of exceptional entries (shmem and shadow)
> > > that can be placed into the radix tree, and this adds a third.  We rely on
> > > the fact that only one type of exceptional entry can be found in a given
> > > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > > we explicitly prevent shadow entries from being added to radix trees for
> > > DAX mappings.
> > > 
> > > The only shadow entries that would be generated for DAX radix trees would
> > > be to track zero page mappings that were created for holes.  These pages
> > > would receive minimal benefit from having shadow entries, and the choice
> > > to have only one type of exceptional entry in a given radix tree makes the
> > > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > I have realized there's one issue with this code. See below:
> > 
> > > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > >  		return;
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	/*
> > > -	 * Regular page slots are stabilized by the page lock even
> > > -	 * without the tree itself locked.  These unlocked entries
> > > -	 * need verification under the tree lock.
> > > -	 */
> > > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > > -		goto unlock;
> > > -	if (*slot != entry)
> > > -		goto unlock;
> > > -	radix_tree_replace_slot(slot, NULL);
> > > -	mapping->nrshadows--;
> > > -	if (!node)
> > > -		goto unlock;
> > > -	workingset_node_shadows_dec(node);
> > > -	/*
> > > -	 * Don't track node without shadow entries.
> > > -	 *
> > > -	 * Avoid acquiring the list_lru lock if already untracked.
> > > -	 * The list_empty() test is safe as node->private_list is
> > > -	 * protected by mapping->tree_lock.
> > > -	 */
> > > -	if (!workingset_node_shadows(node) &&
> > > -	    !list_empty(&node->private_list))
> > > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > > +
> > > +	if (dax_mapping(mapping)) {
> > > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > > +			mapping->nrexceptional--;
> > 
> > So when you punch hole in a file, you can delete a PMD entry from a radix
> > tree which covers part of the file which still stays. So in this case you
> > have to split the PMD entry into PTE entries (probably that needs to happen
> > up in truncate_inode_pages_range()) or something similar...
> 
> I think (and will verify) that the DAX code just unmaps the entire PMD range
> when we receive a hole punch request inside of the PMD.  If this is true then
> I think the radix tree code should behave the same way and just remove the PMD
> entry in the radix tree.

But you cannot just remove it if it is dirty... You have to keep somewhere
information that part of the PMD range is still dirty (or write that range
out before removing the radix tree entry).

> This will cause new accesses that used to land in the PMD range to get new
> page faults.  These faults will call get_blocks(), where presumably the
> filesystem will tell us that we don't have a contiguous 2MiB range anymore, so
> we will fall back to PTE faults.  These PTEs will fill in both the radix tree
> and the page tables.
> 
> So, I think the work here is to verify the behavior of DAX wrt hole punches
> for PMD ranges, and make the radix tree code match that behavior.  Sound good?

								Honza
Ross Zwisler Jan. 15, 2016, 7:03 p.m. UTC | #4
On Fri, Jan 15, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> On Wed 13-01-16 11:48:32, Ross Zwisler wrote:
> > On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> > > On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > > > Add support for tracking dirty DAX entries in the struct address_space
> > > > radix tree.  This tree is already used for dirty page writeback, and it
> > > > already supports the use of exceptional (non struct page*) entries.
> > > > 
> > > > In order to properly track dirty DAX pages we will insert new exceptional
> > > > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > > > These exceptional entries will also contain the writeback sectors for the
> > > > PTE or PMD faults that we can use at fsync/msync time.
> > > > 
> > > > There are currently two types of exceptional entries (shmem and shadow)
> > > > that can be placed into the radix tree, and this adds a third.  We rely on
> > > > the fact that only one type of exceptional entry can be found in a given
> > > > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > > > we explicitly prevent shadow entries from being added to radix trees for
> > > > DAX mappings.
> > > > 
> > > > The only shadow entries that would be generated for DAX radix trees would
> > > > be to track zero page mappings that were created for holes.  These pages
> > > > would receive minimal benefit from having shadow entries, and the choice
> > > > to have only one type of exceptional entry in a given radix tree makes the
> > > > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > 
> > > I have realized there's one issue with this code. See below:
> > > 
> > > > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > > >  		return;
> > > >  
> > > >  	spin_lock_irq(&mapping->tree_lock);
> > > > -	/*
> > > > -	 * Regular page slots are stabilized by the page lock even
> > > > -	 * without the tree itself locked.  These unlocked entries
> > > > -	 * need verification under the tree lock.
> > > > -	 */
> > > > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > > > -		goto unlock;
> > > > -	if (*slot != entry)
> > > > -		goto unlock;
> > > > -	radix_tree_replace_slot(slot, NULL);
> > > > -	mapping->nrshadows--;
> > > > -	if (!node)
> > > > -		goto unlock;
> > > > -	workingset_node_shadows_dec(node);
> > > > -	/*
> > > > -	 * Don't track node without shadow entries.
> > > > -	 *
> > > > -	 * Avoid acquiring the list_lru lock if already untracked.
> > > > -	 * The list_empty() test is safe as node->private_list is
> > > > -	 * protected by mapping->tree_lock.
> > > > -	 */
> > > > -	if (!workingset_node_shadows(node) &&
> > > > -	    !list_empty(&node->private_list))
> > > > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > > > +
> > > > +	if (dax_mapping(mapping)) {
> > > > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > > > +			mapping->nrexceptional--;
> > > 
> > > So when you punch hole in a file, you can delete a PMD entry from a radix
> > > tree which covers part of the file which still stays. So in this case you
> > > have to split the PMD entry into PTE entries (probably that needs to happen
> > > up in truncate_inode_pages_range()) or something similar...
> > 
> > I think (and will verify) that the DAX code just unmaps the entire PMD range
> > when we receive a hole punch request inside of the PMD.  If this is true then
> > I think the radix tree code should behave the same way and just remove the PMD
> > entry in the radix tree.
> 
> But you cannot just remove it if it is dirty... You have to keep somewhere
> information that part of the PMD range is still dirty (or write that range
> out before removing the radix tree entry).

Yep, agreed.
Ross Zwisler Feb. 3, 2016, 4:42 p.m. UTC | #5
On Fri, Jan 15, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> On Wed 13-01-16 11:48:32, Ross Zwisler wrote:
> > On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> > > On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > > > Add support for tracking dirty DAX entries in the struct address_space
> > > > radix tree.  This tree is already used for dirty page writeback, and it
> > > > already supports the use of exceptional (non struct page*) entries.
> > > > 
> > > > In order to properly track dirty DAX pages we will insert new exceptional
> > > > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > > > These exceptional entries will also contain the writeback sectors for the
> > > > PTE or PMD faults that we can use at fsync/msync time.
> > > > 
> > > > There are currently two types of exceptional entries (shmem and shadow)
> > > > that can be placed into the radix tree, and this adds a third.  We rely on
> > > > the fact that only one type of exceptional entry can be found in a given
> > > > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > > > we explicitly prevent shadow entries from being added to radix trees for
> > > > DAX mappings.
> > > > 
> > > > The only shadow entries that would be generated for DAX radix trees would
> > > > be to track zero page mappings that were created for holes.  These pages
> > > > would receive minimal benefit from having shadow entries, and the choice
> > > > to have only one type of exceptional entry in a given radix tree makes the
> > > > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > 
> > > I have realized there's one issue with this code. See below:
> > > 
> > > > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > > >  		return;
> > > >  
> > > >  	spin_lock_irq(&mapping->tree_lock);
> > > > -	/*
> > > > -	 * Regular page slots are stabilized by the page lock even
> > > > -	 * without the tree itself locked.  These unlocked entries
> > > > -	 * need verification under the tree lock.
> > > > -	 */
> > > > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > > > -		goto unlock;
> > > > -	if (*slot != entry)
> > > > -		goto unlock;
> > > > -	radix_tree_replace_slot(slot, NULL);
> > > > -	mapping->nrshadows--;
> > > > -	if (!node)
> > > > -		goto unlock;
> > > > -	workingset_node_shadows_dec(node);
> > > > -	/*
> > > > -	 * Don't track node without shadow entries.
> > > > -	 *
> > > > -	 * Avoid acquiring the list_lru lock if already untracked.
> > > > -	 * The list_empty() test is safe as node->private_list is
> > > > -	 * protected by mapping->tree_lock.
> > > > -	 */
> > > > -	if (!workingset_node_shadows(node) &&
> > > > -	    !list_empty(&node->private_list))
> > > > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > > > +
> > > > +	if (dax_mapping(mapping)) {
> > > > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > > > +			mapping->nrexceptional--;
> > > 
> > > So when you punch hole in a file, you can delete a PMD entry from a radix
> > > tree which covers part of the file which still stays. So in this case you
> > > have to split the PMD entry into PTE entries (probably that needs to happen
> > > up in truncate_inode_pages_range()) or something similar...
> > 
> > I think (and will verify) that the DAX code just unmaps the entire PMD range
> > when we receive a hole punch request inside of the PMD.  If this is true then
> > I think the radix tree code should behave the same way and just remove the PMD
> > entry in the radix tree.
> 
> But you cannot just remove it if it is dirty... You have to keep somewhere
> information that part of the PMD range is still dirty (or write that range
> out before removing the radix tree entry).

It turns out that hole punching a DAX PMD hits a BUG:

[  247.821632] ------------[ cut here ]------------
[  247.822744] kernel BUG at mm/memory.c:1195!
[  247.823742] invalid opcode: 0000 [#1] SMP 
[  247.824768] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  247.826299] CPU: 1 PID: 1544 Comm: test Tainted: G        W       4.4.0-rc8+ #9
[  247.828017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  247.830298] task: ffff880036756200 ti: ffff8800a077c000 task.ti: ffff8800a077c000
[  247.831935] RIP: 0010:[<ffffffff81203157>]  [<ffffffff81203157>] unmap_page_range+0x907/0x910
[  247.833847] RSP: 0018:ffff8800a077fb88  EFLAGS: 00010282
[  247.835030] RAX: 0000000000000073 RBX: ffffc00000000fff RCX: 0000000000000000
[  247.836595] RDX: 0000000000000000 RSI: ffff88051a3ce168 RDI: ffff88051a3ce168
[  247.838168] RBP: ffff8800a077fc68 R08: 0000000000000001 R09: 0000000000000001
[  247.839728] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000010405000
[  247.841244] R13: 0000000010403000 R14: ffff8800a077fcb0 R15: 0000000010403000
[  247.842715] FS:  00007f533a5bb700(0000) GS:ffff88051a200000(0000) knlGS:0000000000000000
[  247.844395] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  247.845589] CR2: 0000000010403000 CR3: 0000000514337000 CR4: 00000000000006e0
[  247.847076] Stack:
[  247.847502]  ffff8800a077fcb0 ffff8800a077fc38 ffff880036756200 ffff8800a077fbb0
[  247.849126]  0000000010404fff 0000000010404fff 0000000010404fff ffff880514337000
[  247.850714]  ffff880513499000 ffff8800a077fcc0 0000000010405000 0000000000000000
[  247.852246] Call Trace:
[  247.852740]  [<ffffffff810fbb4d>] ? trace_hardirqs_on+0xd/0x10
[  247.853883]  [<ffffffff812031dd>] unmap_single_vma+0x7d/0xe0
[  247.855004]  [<ffffffff812032ed>] zap_page_range_single+0xad/0xf0
[  247.856195]  [<ffffffff81203410>] ? unmap_mapping_range+0xa0/0x190
[  247.857403]  [<ffffffff812034d6>] unmap_mapping_range+0x166/0x190
[  247.858596]  [<ffffffff811e0948>] truncate_pagecache_range+0x48/0x60
[  247.859839]  [<ffffffff8130b7ba>] ext4_punch_hole+0x33a/0x4b0
[  247.860837]  [<ffffffff8133a274>] ext4_fallocate+0x144/0x890
[  247.861784]  [<ffffffff810f6cd7>] ? update_fast_ctr+0x17/0x30
[  247.862751]  [<ffffffff810f6d69>] ? percpu_down_read+0x49/0x90
[  247.863731]  [<ffffffff812640c4>] ? __sb_start_write+0xb4/0xf0
[  247.864709]  [<ffffffff8125d900>] vfs_fallocate+0x140/0x220
[  247.865645]  [<ffffffff8125e7d4>] SyS_fallocate+0x44/0x70
[  247.866553]  [<ffffffff81a68ef2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  247.867632] Code: 12 fe ff ff 0f 0b 48 8b 45 98 48 8b 4d 90 4c 89 fa 48 c7 c6 78 30 c2 81 48 c7 c7 50 6b f0 81 4c 8b 48 08 4c 8b 00 e8 06 4a fc ff <0f> 0b e8 d2 2f ea ff 66 90 66 66 66 66 90 48 8b 06 4c 8b 4e 08 
[  247.871862] RIP  [<ffffffff81203157>] unmap_page_range+0x907/0x910
[  247.872843]  RSP <ffff8800a077fb88>
[  247.873435] ---[ end trace 75145e78670ba43d ]---

This happens with XFS as well.  I'm not sure that this path has ever been run,
so essentially the next step is "add PMD hole punch support to DAX, including
fsync/msync support".
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e123641..303b7cd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,7 @@  void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 4c8f719..6e3e5d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -495,7 +495,7 @@  void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
-	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrexceptional);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@  static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ddb7bad..fdab768 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,8 @@  struct address_space {
 	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
-	unsigned long		nrshadows;	/* number of shadow entries */
+	/* number of shadow or DAX exceptional entries */
+	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 57e7d87..7c88ad1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@ 
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 847ee43..7b8be78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@ 
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -123,9 +124,9 @@  static void page_cache_tree_delete(struct address_space *mapping,
 	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
 
 	if (shadow) {
-		mapping->nrshadows++;
+		mapping->nrexceptional++;
 		/*
-		 * Make sure the nrshadows update is committed before
+		 * Make sure the nrexceptional update is committed before
 		 * the nrpages update so that final truncate racing
 		 * with reclaim does not see both counters 0 at the
 		 * same time and miss a shadow entry.
@@ -579,9 +580,13 @@  static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		if (WARN_ON(dax_mapping(mapping)))
+			return -EINVAL;
+
 		if (shadowp)
 			*shadowp = p;
-		mapping->nrshadows--;
+		mapping->nrexceptional--;
 		if (node)
 			workingset_node_shadows_dec(node);
 	}
@@ -1245,9 +1250,9 @@  repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..e3ee0e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@ 
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,39 @@  static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+			mapping->nrexceptional--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+					&slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrexceptional--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+					&node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +237,7 @@  void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -402,7 +411,7 @@  EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-	unsigned long nrshadows;
+	unsigned long nrexceptional;
 	unsigned long nrpages;
 
 	/*
@@ -416,14 +425,14 @@  void truncate_inode_pages_final(struct address_space *mapping)
 
 	/*
 	 * When reclaim installs eviction entries, it increases
-	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * nrexceptional first, then decreases nrpages.  Make sure we see
 	 * this in the right order or we might miss an entry.
 	 */
 	nrpages = mapping->nrpages;
 	smp_rmb();
-	nrshadows = mapping->nrshadows;
+	nrexceptional = mapping->nrexceptional;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrexceptional) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44ec50f..30e0cd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,7 @@ 
 #include <linux/oom.h>
 #include <linux/prefetch.h>
 #include <linux/printk.h>
+#include <linux/dax.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -671,9 +672,15 @@  static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 * inode reclaim needs to empty out the radix tree or
 		 * the nodes are lost.  Don't plant shadows behind its
 		 * back.
+		 *
+		 * We also don't store shadows for DAX mappings because the
+		 * only page cache pages found in these are zero pages
+		 * covering holes, and because we don't want to mix DAX
+		 * exceptional entries and shadow exceptional entries in the
+		 * same page_tree.
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
-		    !mapping_exiting(mapping))
+		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa01713..61ead9e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,8 +351,8 @@  static enum lru_status shadow_lru_isolate(struct list_head *item,
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-			BUG_ON(!mapping->nrshadows);
-			mapping->nrshadows--;
+			BUG_ON(!mapping->nrexceptional);
+			mapping->nrexceptional--;
 		}
 	}
 	BUG_ON(node->count);