diff mbox

[v10] mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS

Message ID 152658753673.26786.16458605771414761966.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams May 17, 2018, 8:06 p.m. UTC
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
be able to rely on the fact that they will get wakeups on dev_pagemap
page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
generic_dax_page_free() as common indicator / infrastructure for dax
filesytems to require. With this change there are no users of the
MEMORY_DEVICE_HOST designation, so remove it.

The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
However, we still need to support FS_DAX in the FS_DAX_LIMITED case
implemented by the s390/dcssblk driver.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Reported-by: Dave Jiang <dave.jiang@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This patch replaces and consolidates patch 2 [1] and 4 [2] from the v9
series [3] for "dax: fix dma vs truncate/hole-punch".

The original implementation which introduced fs_dax_claim() was broken
in the presence of partitions as filesystems on partitions of a pmem
device would collide when attempting to issue fs_dax_claim().

Instead, since this new page wakeup behavior is a property of
dev_pagemap pages and there is a 1:1 relationship between a pmem device
and its dev_pagemap instance, make the pmem driver own the page wakeup
initialization rather than the filesystem.

This simplifies the implementation considerably. The diffstat for the
series is now:

    21 files changed, 546 insertions(+), 277 deletions(-)

...down from:

    24 files changed, 730 insertions(+), 297 deletions(-)

The other patches in the series are not included since they did not
change in any meaningful way. Let me know if anyone wants a full resend,
it will otherwise be available in -next shortly. Given the change in
approach I did not carry Reviewed-by tags from patch 2 and 4 to this
patch.

[1]: [PATCH v9 2/9] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
https://lists.01.org/pipermail/linux-nvdimm/2018-April/015459.html

[2]: [PATCH v9 4/9] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
https://lists.01.org/pipermail/linux-nvdimm/2018-April/015461.html

[3]: [PATCH v9 0/9] dax: fix dma vs truncate/hole-punch
https://lists.01.org/pipermail/linux-nvdimm/2018-April/015457.html


 drivers/dax/super.c       |   18 ++++++++---
 drivers/nvdimm/pfn_devs.c |    2 -
 drivers/nvdimm/pmem.c     |   20 +++++++++++++
 fs/Kconfig                |    1 +
 include/linux/memremap.h  |   41 ++++++++++----------------
 include/linux/mm.h        |   71 ++++++++++++++++++++++++++++++++++-----------
 kernel/memremap.c         |   37 +++++++++++++++++++++--
 mm/Kconfig                |    5 +++
 mm/hmm.c                  |   13 +-------
 mm/swap.c                 |    3 +-
 10 files changed, 143 insertions(+), 68 deletions(-)

Comments

Jan Kara May 18, 2018, 9 a.m. UTC | #1
On Thu 17-05-18 13:06:54, Dan Williams wrote:
> In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
> be able to rely on the fact that they will get wakeups on dev_pagemap
> page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
> generic_dax_page_free() as common indicator / infrastructure for dax
> filesytems to require. With this change there are no users of the
> MEMORY_DEVICE_HOST designation, so remove it.
> 
> The HMM sub-system extended dev_pagemap to arrange a callback when a
> dev_pagemap managed page is freed. Since a dev_pagemap page is free /
> idle when its reference count is 1 it requires an additional branch to
> check the page-type at put_page() time. Given put_page() is a hot-path
> we do not want to incur that check if HMM is not in use, so a static
> branch is used to avoid that overhead when not necessary.
> 
> Now, the FS_DAX implementation wants to reuse this mechanism for
> receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
> static-key into a generic mechanism that either HMM or FS_DAX code paths
> can enable.
> 
> For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
> care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
> However, we still need to support FS_DAX in the FS_DAX_LIMITED case
> implemented by the s390/dcssblk driver.
> 
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Reported-by: kbuild test robot <lkp@intel.com>
> Reported-by: Thomas Meyer <thomas@m3y3r.de>
> Reported-by: Dave Jiang <dave.jiang@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Yeah, it looks simpler than original patches and it looks OK to me. You can
add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
Christoph Hellwig May 18, 2018, 9:46 a.m. UTC | #2
This looks reasonable to me.  A few more comments below.

> This patch replaces and consolidates patch 2 [1] and 4 [2] from the v9
> series [3] for "dax: fix dma vs truncate/hole-punch".

Can you repost the whole series?  Otherwise things might get a little
too confusing.

>  		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
> +		return 0;
>  	} else if (pfn_t_devmap(pfn)) {
> +		struct dev_pagemap *pgmap;

This should probably become something like:

	bool supported = false;

	...


	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
		...
		supported = true;
	} else if (pfn_t_devmap(pfn)) {
		pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
		if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX)
			supported = true;
		put_dev_pagemap(pgmap);
	}

	if (!supported) {
		pr_debug("VFS (%s): error: dax support not enabled\n",
			sb->s_id);
		return -EOPNOTSUPP;
	}
	return 0;

> +	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)

Btw, what was the reason again we couldn't get rid of FS_DAX_LIMITED?

> +void generic_dax_pagefree(struct page *page, void *data)
> +{
> +	wake_up_var(&page->_refcount);
> +}
> +EXPORT_SYMBOL_GPL(generic_dax_pagefree);

Why is this here and exported instead of static in drivers/nvdimm/pmem.c?
Dan Williams May 18, 2018, 4 p.m. UTC | #3
On Fri, May 18, 2018 at 2:46 AM, Christoph Hellwig <hch@lst.de> wrote:
> This looks reasonable to me.  A few more comments below.
>
>> This patch replaces and consolidates patch 2 [1] and 4 [2] from the v9
>> series [3] for "dax: fix dma vs truncate/hole-punch".
>
> Can you repost the whole series?  Otherwise things might get a little
> too confusing.

Sure thing.

>>               WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
>> +             return 0;
>>       } else if (pfn_t_devmap(pfn)) {
>> +             struct dev_pagemap *pgmap;
>
> This should probably become something like:
>
>         bool supported = false;
>
>         ...
>
>
>         if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
>                 ...
>                 supported = true;
>         } else if (pfn_t_devmap(pfn)) {
>                 pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
>                 if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX)
>                         supported = true;
>                 put_dev_pagemap(pgmap);
>         }
>
>         if (!supported) {
>                 pr_debug("VFS (%s): error: dax support not enabled\n",
>                         sb->s_id);
>                 return -EOPNOTSUPP;
>         }
>         return 0;

Looks good, will do.

>> +     select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
>
> Btw, what was the reason again we couldn't get rid of FS_DAX_LIMITED?

The last I heard from Gerald they were still mildly interested in
keeping the dccssblk dax support going with this limited mode, and
threatened to add full page support at a later date:

---
From: Gerald

dcssblk seems to work fine, I did not see any SIGBUS while "executing
in place" from dcssblk with the current upstream kernel, maybe because
we only use dcssblk with fs dax in read-only mode.

Anyway, the dcssblk change is fine with me. I will look into adding
struct pages for dcssblk memory later, to make it work again with
this change, but for now I do not know of anyone needing this in the
upstream kernel.

https://www.spinics.net/lists/linux-xfs/msg14628.html
---

>> +void generic_dax_pagefree(struct page *page, void *data)
>> +{
>> +     wake_up_var(&page->_refcount);
>> +}
>> +EXPORT_SYMBOL_GPL(generic_dax_pagefree);
>
> Why is this here and exported instead of static in drivers/nvdimm/pmem.c?

I was thinking it did not belong to the pmem driver, but you're right
unless / until we grow another fsdax capable driver this detail can
stay internal to the pmem driver.
Jan Kara May 21, 2018, 9:04 a.m. UTC | #4
On Fri 18-05-18 09:00:29, Dan Williams wrote:
> On Fri, May 18, 2018 at 2:46 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +     select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
> >
> > Btw, what was the reason again we couldn't get rid of FS_DAX_LIMITED?
> 
> The last I heard from Gerald they were still mildly interested in
> keeping the dccssblk dax support going with this limited mode, and
> threatened to add full page support at a later date:
> 
> ---
> From: Gerald
> 
> dcssblk seems to work fine, I did not see any SIGBUS while "executing
> in place" from dcssblk with the current upstream kernel, maybe because
> we only use dcssblk with fs dax in read-only mode.
> 
> Anyway, the dcssblk change is fine with me. I will look into adding
> struct pages for dcssblk memory later, to make it work again with
> this change, but for now I do not know of anyone needing this in the
> upstream kernel.
> 
> https://www.spinics.net/lists/linux-xfs/msg14628.html
> ---

We definitely do have customers using "execute in place" on s390x from
dcssblk. I've got about two bug reports for it when customers were updating
from old kernels using original XIP to kernels using DAX. So we need to
keep that working.

								Honza
Christoph Hellwig May 22, 2018, 6:28 a.m. UTC | #5
On Mon, May 21, 2018 at 11:04:10AM +0200, Jan Kara wrote:
> We definitely do have customers using "execute in place" on s390x from
> dcssblk. I've got about two bug reports for it when customers were updating
> from old kernels using original XIP to kernels using DAX. So we need to
> keep that working.

That is all good an fine, but I think time has come where s390 needs
to migrate to provide the pmem API so that we can get rid of these
special cases.  Especially given that the old XIP/legacy DAX has all
kinds of known bugs at this point in time.
Gerald Schaefer May 23, 2018, 6:50 p.m. UTC | #6
On Tue, 22 May 2018 08:28:06 +0200
Christoph Hellwig <hch@lst.de> wrote:

> On Mon, May 21, 2018 at 11:04:10AM +0200, Jan Kara wrote:
> > We definitely do have customers using "execute in place" on s390x from
> > dcssblk. I've got about two bug reports for it when customers were updating
> > from old kernels using original XIP to kernels using DAX. So we need to
> > keep that working.  
> 
> That is all good an fine, but I think time has come where s390 needs
> to migrate to provide the pmem API so that we can get rid of these
> special cases.  Especially given that the old XIP/legacy DAX has all
> kinds of known bugs at this point in time.

I haven't yet looked at this patch series, but I can feel that this
FS_DAX_LIMITED workaround is beginning to cause some headaches, apart
from being quite ugly of course.

Just to make sure I still understand the basic problem, which I thought
was missing struct pages for the dcssblk memory, what exactly do you
mean with "provide the pmem API", is there more to do?

I do have a prototype patch lying around that adds struct pages, but
didn't yet have time to fully test/complete it. Of course we initially
introduced XIP as a mechanism to reduce memory consumption, and that
is probably the use case for the remaining customer(s). Adding struct
pages would somehow reduce that benefit, but as long as we can still
"execute in place", I guess it will be OK.

Regards,
Gerald
Dan Williams May 29, 2018, 8:26 p.m. UTC | #7
On Wed, May 23, 2018 at 11:50 AM, Gerald Schaefer
<gerald.schaefer@de.ibm.com> wrote:
> On Tue, 22 May 2018 08:28:06 +0200
> Christoph Hellwig <hch@lst.de> wrote:
>
>> On Mon, May 21, 2018 at 11:04:10AM +0200, Jan Kara wrote:
>> > We definitely do have customers using "execute in place" on s390x from
>> > dcssblk. I've got about two bug reports for it when customers were updating
>> > from old kernels using original XIP to kernels using DAX. So we need to
>> > keep that working.
>>
>> That is all good an fine, but I think time has come where s390 needs
>> to migrate to provide the pmem API so that we can get rid of these
>> special cases.  Especially given that the old XIP/legacy DAX has all
>> kinds of known bugs at this point in time.
>
> I haven't yet looked at this patch series, but I can feel that this
> FS_DAX_LIMITED workaround is beginning to cause some headaches, apart
> from being quite ugly of course.
>
> Just to make sure I still understand the basic problem, which I thought
> was missing struct pages for the dcssblk memory, what exactly do you
> mean with "provide the pmem API", is there more to do?

No, just 'struct page' is needed.

What used to be the pmem API is now pushed down into to dax_operations
provided by the device driver. dcssblk is free to just redirect to the
generic implementations for copy_from_iter() and copy_to_iter(), and
be done. I.e. we've removed the "pmem API" requirement.

> I do have a prototype patch lying around that adds struct pages, but
> didn't yet have time to fully test/complete it. Of course we initially
> introduced XIP as a mechanism to reduce memory consumption, and that
> is probably the use case for the remaining customer(s). Adding struct
> pages would somehow reduce that benefit, but as long as we can still
> "execute in place", I guess it will be OK.

The pmem driver has the option to allocate the 'struct page' map out
of pmem directly. If the overhead of having the map in System RAM is
too high it could borrow the same approach, but that adds another
degree of configuration complexity freedom.
Gerald Schaefer June 1, 2018, 3:01 p.m. UTC | #8
On Tue, 29 May 2018 13:26:33 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, May 23, 2018 at 11:50 AM, Gerald Schaefer
> <gerald.schaefer@de.ibm.com> wrote:
> > On Tue, 22 May 2018 08:28:06 +0200
> > Christoph Hellwig <hch@lst.de> wrote:
> >  
> >> On Mon, May 21, 2018 at 11:04:10AM +0200, Jan Kara wrote:  
> >> > We definitely do have customers using "execute in place" on s390x from
> >> > dcssblk. I've got about two bug reports for it when customers were updating
> >> > from old kernels using original XIP to kernels using DAX. So we need to
> >> > keep that working.  
> >>
> >> That is all good an fine, but I think time has come where s390 needs
> >> to migrate to provide the pmem API so that we can get rid of these
> >> special cases.  Especially given that the old XIP/legacy DAX has all
> >> kinds of known bugs at this point in time.  
> >
> > I haven't yet looked at this patch series, but I can feel that this
> > FS_DAX_LIMITED workaround is beginning to cause some headaches, apart
> > from being quite ugly of course.
> >
> > Just to make sure I still understand the basic problem, which I thought
> > was missing struct pages for the dcssblk memory, what exactly do you
> > mean with "provide the pmem API", is there more to do?  
> 
> No, just 'struct page' is needed.
> 
> What used to be the pmem API is now pushed down into to dax_operations
> provided by the device driver. dcssblk is free to just redirect to the
> generic implementations for copy_from_iter() and copy_to_iter(), and
> be done. I.e. we've removed the "pmem API" requirement.
> 
> > I do have a prototype patch lying around that adds struct pages, but
> > didn't yet have time to fully test/complete it. Of course we initially
> > introduced XIP as a mechanism to reduce memory consumption, and that
> > is probably the use case for the remaining customer(s). Adding struct
> > pages would somehow reduce that benefit, but as long as we can still
> > "execute in place", I guess it will be OK.  
> 
> The pmem driver has the option to allocate the 'struct page' map out
> of pmem directly. If the overhead of having the map in System RAM is
> too high it could borrow the same approach, but that adds another
> degree of configuration complexity freedom.
> 

Thanks for clarifying, and mentioning the pmem altmap support, that
looks interesting. I also noticed that I probably should enable
CONFIG_ZONE_DEVICE for s390, and use devm_memremap_pages() to get
the struct pages, rather than my homegrown solution so far. This will
take some time however, so I hope you can live with the FS_DAX_LIMITED
a little longer.

Regards,
Gerald
Dan Williams June 2, 2018, 12:03 a.m. UTC | #9
On Fri, Jun 1, 2018 at 8:01 AM, Gerald Schaefer
<gerald.schaefer@de.ibm.com> wrote:
> On Tue, 29 May 2018 13:26:33 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
>
>> On Wed, May 23, 2018 at 11:50 AM, Gerald Schaefer
>> <gerald.schaefer@de.ibm.com> wrote:
>> > On Tue, 22 May 2018 08:28:06 +0200
>> > Christoph Hellwig <hch@lst.de> wrote:
>> >
>> >> On Mon, May 21, 2018 at 11:04:10AM +0200, Jan Kara wrote:
>> >> > We definitely do have customers using "execute in place" on s390x from
>> >> > dcssblk. I've got about two bug reports for it when customers were updating
>> >> > from old kernels using original XIP to kernels using DAX. So we need to
>> >> > keep that working.
>> >>
>> >> That is all good an fine, but I think time has come where s390 needs
>> >> to migrate to provide the pmem API so that we can get rid of these
>> >> special cases.  Especially given that the old XIP/legacy DAX has all
>> >> kinds of known bugs at this point in time.
>> >
>> > I haven't yet looked at this patch series, but I can feel that this
>> > FS_DAX_LIMITED workaround is beginning to cause some headaches, apart
>> > from being quite ugly of course.
>> >
>> > Just to make sure I still understand the basic problem, which I thought
>> > was missing struct pages for the dcssblk memory, what exactly do you
>> > mean with "provide the pmem API", is there more to do?
>>
>> No, just 'struct page' is needed.
>>
>> What used to be the pmem API is now pushed down into to dax_operations
>> provided by the device driver. dcssblk is free to just redirect to the
>> generic implementations for copy_from_iter() and copy_to_iter(), and
>> be done. I.e. we've removed the "pmem API" requirement.
>>
>> > I do have a prototype patch lying around that adds struct pages, but
>> > didn't yet have time to fully test/complete it. Of course we initially
>> > introduced XIP as a mechanism to reduce memory consumption, and that
>> > is probably the use case for the remaining customer(s). Adding struct
>> > pages would somehow reduce that benefit, but as long as we can still
>> > "execute in place", I guess it will be OK.
>>
>> The pmem driver has the option to allocate the 'struct page' map out
>> of pmem directly. If the overhead of having the map in System RAM is
>> too high it could borrow the same approach, but that adds another
>> degree of configuration complexity freedom.
>>
>
> Thanks for clarifying, and mentioning the pmem altmap support, that
> looks interesting. I also noticed that I probably should enable
> CONFIG_ZONE_DEVICE for s390, and use devm_memremap_pages() to get
> the struct pages, rather than my homegrown solution so far. This will
> take some time however, so I hope you can live with the FS_DAX_LIMITED
> a little longer.
>

Yes, no urgent rush as far as I can see. Thanks Gerald.
diff mbox

Patch

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..4928b7fcfb71 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -134,15 +134,21 @@  int __bdev_dax_supported(struct super_block *sb, int blocksize)
 		 * on being able to do (page_address(pfn_to_page())).
 		 */
 		WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+		return 0;
 	} else if (pfn_t_devmap(pfn)) {
-		/* pass */;
-	} else {
-		pr_debug("VFS (%s): error: dax support not enabled\n",
-				sb->s_id);
-		return -EOPNOTSUPP;
+		struct dev_pagemap *pgmap;
+
+		pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
+		if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX) {
+			put_dev_pagemap(pgmap);
+			return 0;
+		}
+		put_dev_pagemap(pgmap);
 	}
 
-	return 0;
+	pr_debug("VFS (%s): error: dax support not enabled\n",
+			sb->s_id);
+	return -EOPNOTSUPP;
 }
 EXPORT_SYMBOL_GPL(__bdev_dax_supported);
 #endif
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 30b08791597d..3f7ad5bc443e 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -561,8 +561,6 @@  static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 	res->start += start_pad;
 	res->end -= end_trunc;
 
-	pgmap->type = MEMORY_DEVICE_HOST;
-
 	if (nd_pfn->mode == PFN_MODE_RAM) {
 		if (offset < SZ_8K)
 			return -EINVAL;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..e6d94604e9a4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -294,6 +294,22 @@  static void pmem_release_disk(void *__pmem)
 	put_disk(pmem->disk);
 }
 
+static void pmem_release_pgmap_ops(void *__pgmap)
+{
+	dev_pagemap_put_ops();
+}
+
+static int setup_pagemap_fsdax(struct device *dev, struct dev_pagemap *pgmap)
+{
+	dev_pagemap_get_ops();
+	if (devm_add_action_or_reset(dev, pmem_release_pgmap_ops, pgmap))
+		return -ENOMEM;
+	pgmap->type = MEMORY_DEVICE_FS_DAX;
+	pgmap->page_free = generic_dax_pagefree;
+
+	return 0;
+}
+
 static int pmem_attach_disk(struct device *dev,
 		struct nd_namespace_common *ndns)
 {
@@ -353,6 +369,8 @@  static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	pmem->pgmap.ref = &q->q_usage_counter;
 	if (is_nd_pfn(dev)) {
+		if (setup_pagemap_fsdax(dev, &pmem->pgmap))
+			return -ENOMEM;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -364,6 +382,8 @@  static int pmem_attach_disk(struct device *dev,
 	} else if (pmem_should_map_pages(dev)) {
 		memcpy(&pmem->pgmap.res, &nsio->res, sizeof(pmem->pgmap.res));
 		pmem->pgmap.altmap_valid = false;
+		if (setup_pagemap_fsdax(dev, &pmem->pgmap))
+			return -ENOMEM;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
 		memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1e050e012eb9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@  config FS_DAX
 	bool "Direct Access (DAX) support"
 	depends on MMU
 	depends on !(ARM || MIPS || SPARC)
+	select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
 	select FS_IOMAP
 	select DAX
 	help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7b4899c06f49..29ea63544c4d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@ 
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include <linux/mm.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -30,13 +29,6 @@  struct vmem_altmap {
  * Specialize ZONE_DEVICE memory into multiple types each having differents
  * usage.
  *
- * MEMORY_DEVICE_HOST:
- * Persistent device memory (pmem): struct page might be allocated in different
- * memory and architecture might want to perform special actions. It is similar
- * to regular memory, in that the CPU can access it transparently. However,
- * it is likely to have different bandwidth and latency than regular memory.
- * See Documentation/nvdimm/nvdimm.txt for more information.
- *
  * MEMORY_DEVICE_PRIVATE:
  * Device memory that is not directly addressable by the CPU: CPU can neither
  * read nor write private memory. In this case, we do still have struct pages
@@ -53,11 +45,19 @@  struct vmem_altmap {
  * driver can hotplug the device memory using ZONE_DEVICE and with that memory
  * type. Any page of a process can be migrated to such memory. However no one
  * should be allow to pin such memory so that it can always be evicted.
+ *
+ * MEMORY_DEVICE_FS_DAX:
+ * Host memory that has similar access semantics as System RAM i.e. DMA
+ * coherent and supports page pinning. In support of coordinating page
+ * pinning vs other operations MEMORY_DEVICE_FS_DAX arranges for a
+ * wakeup event whenever a page is unpinned and becomes idle. This
+ * wakeup is used to coordinate physical address space management (ex:
+ * fs truncate/hole punch) vs pinned pages (ex: device dma).
  */
 enum memory_type {
-	MEMORY_DEVICE_HOST = 0,
-	MEMORY_DEVICE_PRIVATE,
+	MEMORY_DEVICE_PRIVATE = 1,
 	MEMORY_DEVICE_PUBLIC,
+	MEMORY_DEVICE_FS_DAX,
 };
 
 /*
@@ -123,15 +123,18 @@  struct dev_pagemap {
 };
 
 #ifdef CONFIG_ZONE_DEVICE
+void generic_dax_pagefree(struct page *page, void *data);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		struct dev_pagemap *pgmap);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
+static inline void generic_dax_pagefree(struct page *page, void *data)
+{
+}
+
 static inline void *devm_memremap_pages(struct device *dev,
 		struct dev_pagemap *pgmap)
 {
@@ -161,20 +164,6 @@  static inline void vmem_altmap_free(struct vmem_altmap *altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-	return is_zone_device_page(page) &&
-		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
 	if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06a4be6..6e19265ee8f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -821,27 +821,65 @@  static inline bool is_zone_device_page(const struct page *page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(&device_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
+{
+	if (!static_branch_unlikely(&devmap_managed_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	switch (page->pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_PUBLIC:
+	case MEMORY_DEVICE_FS_DAX:
+		__put_devmap_managed_page(page);
+		return true;
+	default:
+		break;
+	}
+	return false;
+}
+
+static inline bool is_device_private_page(const struct page *page)
 {
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
-#define IS_HMM_ENABLED 0
+
+static inline bool is_device_public_page(const struct page *page)
+{
+	return is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_PUBLIC;
+}
+
+#else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline void dev_pagemap_get_ops(void)
+{
+}
+
+static inline void dev_pagemap_put_ops(void)
+{
+}
+
+static inline bool put_devmap_managed_page(struct page *page)
+{
+	return false;
+}
+
 static inline bool is_device_private_page(const struct page *page)
 {
 	return false;
 }
+
 static inline bool is_device_public_page(const struct page *page)
 {
 	return false;
 }
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
@@ -859,16 +897,13 @@  static inline void put_page(struct page *page)
 	page = compound_head(page);
 
 	/*
-	 * For private device pages we need to catch refcount transition from
-	 * 2 to 1, when refcount reach one it means the private device page is
-	 * free and we need to inform the device driver through callback. See
+	 * For devmap managed pages we need to catch refcount transition from
+	 * 2 to 1, when refcount reach one it means the page is free and we
+	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (IS_HMM_ENABLED && unlikely(is_device_private_page(page) ||
-	    unlikely(is_device_public_page(page)))) {
-		put_zone_device_private_or_public_page(page);
+	if (put_devmap_managed_page(page))
 		return;
-	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 37a9604133f6..3e546c6beb42 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -9,12 +9,19 @@ 
 #include <linux/memory_hotplug.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/wait_bit.h>
 
 static DEFINE_MUTEX(pgmap_lock);
 static RADIX_TREE(pgmap_radix, GFP_KERNEL);
 #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
 #define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
+void generic_dax_pagefree(struct page *page, void *data)
+{
+	wake_up_var(&page->_refcount);
+}
+EXPORT_SYMBOL_GPL(generic_dax_pagefree);
+
 static unsigned long order_at(struct resource *res, unsigned long pgoff)
 {
 	unsigned long phys_pgoff = PHYS_PFN(res->start) + pgoff;
@@ -301,8 +308,30 @@  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 	return pgmap;
 }
 
-#if IS_ENABLED(CONFIG_DEVICE_PRIVATE) ||  IS_ENABLED(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
+EXPORT_SYMBOL_GPL(devmap_managed_key);
+static atomic_t devmap_enable;
+
+/*
+ * Toggle the static key for ->page_free() callbacks when dev_pagemap
+ * pages go idle.
+ */
+void dev_pagemap_get_ops(void)
+{
+	if (atomic_inc_return(&devmap_enable) == 1)
+		static_branch_enable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_get_ops);
+
+void dev_pagemap_put_ops(void)
+{
+	if (atomic_dec_and_test(&devmap_enable))
+		static_branch_disable(&devmap_managed_key);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_put_ops);
+
+void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
@@ -322,5 +351,5 @@  void put_zone_device_private_or_public_page(struct page *page)
 	} else if (!count)
 		__put_page(page);
 }
-EXPORT_SYMBOL(put_zone_device_private_or_public_page);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+EXPORT_SYMBOL_GPL(__put_devmap_managed_page);
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/Kconfig b/mm/Kconfig
index d5004d82a1d6..bf9d6366bced 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -692,6 +692,9 @@  config ARCH_HAS_HMM
 config MIGRATE_VMA_HELPER
 	bool
 
+config DEV_PAGEMAP_OPS
+	bool
+
 config HMM
 	bool
 	select MIGRATE_VMA_HELPER
@@ -712,6 +715,7 @@  config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent unaddressable device
@@ -722,6 +726,7 @@  config DEVICE_PUBLIC
 	bool "Addressable device memory (like GPU memory)"
 	depends on ARCH_HAS_HMM
 	select HMM
+	select DEV_PAGEMAP_OPS
 
 	help
 	  Allows creation of struct pages to represent addressable device
diff --git a/mm/hmm.c b/mm/hmm.c
index 486dc394a5a3..de7b6bf77201 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -35,15 +35,6 @@ 
 
 #define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-/*
- * Device private memory see HMM (Documentation/vm/hmm.txt) or hmm.h
- */
-DEFINE_STATIC_KEY_FALSE(device_private_key);
-EXPORT_SYMBOL(device_private_key);
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
@@ -1167,7 +1158,7 @@  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
 	resource_size_t addr;
 	int ret;
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
@@ -1261,7 +1252,7 @@  struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
 	if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
 		return ERR_PTR(-EINVAL);
 
-	static_branch_enable(&device_private_key);
+	dev_pagemap_get_ops();
 
 	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
 				   GFP_KERNEL, dev_to_node(device));
diff --git a/mm/swap.c b/mm/swap.c
index 3dd518832096..26fc9b5f1b6c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -29,6 +29,7 @@ 
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/backing-dev.h>
+#include <linux/memremap.h>
 #include <linux/memcontrol.h>
 #include <linux/gfp.h>
 #include <linux/uio.h>
@@ -743,7 +744,7 @@  void release_pages(struct page **pages, int nr)
 						       flags);
 				locked_pgdat = NULL;
 			}
-			put_zone_device_private_or_public_page(page);
+			put_devmap_managed_page(page);
 			continue;
 		}