diff mbox series

[v2,10/18] fsdax: Manage pgmap references at entry insertion and deletion

Message ID 166329936739.2786261.14035402420254589047.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
State New, archived
Headers show
Series Fix the DAX-gup mistake | expand

Commit Message

Dan Williams Sept. 16, 2022, 3:36 a.m. UTC
The percpu_ref in 'struct dev_pagemap' is used to coordinate active
mappings of device-memory with the device-removal / unbind path. It
enables the semantic that initiating device-removal (or
device-driver-unbind) blocks new mapping and DMA attempts, and waits for
mapping revocation or inflight DMA to complete.

Expand the scope of the reference count to pin the DAX device active at
mapping time and not later at the first gup event. With a device
reference being held while any page on that device is mapped the need to
manage pgmap reference counts in the gup code is eliminated. That
cleanup is saved for a follow-on change.

For now, teach dax_insert_entry() and dax_delete_mapping_entry() to take
and drop pgmap references respectively. Where dax_insert_entry() is
called to take the initial reference on the page, and
dax_delete_mapping_entry() is called once there are no outstanding
references to the given page(s).

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   34 ++++++++++++++++++++++++++++------
 include/linux/memremap.h |   18 ++++++++++++++----
 mm/memremap.c            |   13 ++++++++-----
 3 files changed, 50 insertions(+), 15 deletions(-)

Comments

Jason Gunthorpe Sept. 21, 2022, 2:03 p.m. UTC | #1
On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> mappings of device-memory with the device-removal / unbind path. It
> enables the semantic that initiating device-removal (or
> device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> mapping revocation or inflight DMA to complete.

This seems strange to me

The pagemap should be ref'd as long as the filesystem is mounted over
the dax. The ref should be incrd when the filesystem is mounted and
decrd when it is unmounted.

When the filesystem unmounts it should zap all the mappings (actually
I don't think you can even unmount a filesystem while mappings are
open) and wait for all page references to go to zero, then put the
final pagemap back.

The rule is nothing can touch page->pgmap while page->refcount == 0,
and if page->refcount != 0 then page->pgmap must be valid, without any
refcounting on the page map itself.

So, why do we need pgmap refcounting all over the place? It seems like
it only existed before because of the abuse of the page->refcount?

Jason
Dan Williams Sept. 21, 2022, 3:18 p.m. UTC | #2
Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > mappings of device-memory with the device-removal / unbind path. It
> > enables the semantic that initiating device-removal (or
> > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > mapping revocation or inflight DMA to complete.
> 
> This seems strange to me
> 
> The pagemap should be ref'd as long as the filesystem is mounted over
> the dax. The ref should be incrd when the filesystem is mounted and
> decrd when it is unmounted.
> 
> When the filesystem unmounts it should zap all the mappings (actually
> I don't think you can even unmount a filesystem while mappings are
> open) and wait for all page references to go to zero, then put the
> final pagemap back.
> 
> The rule is nothing can touch page->pgmap while page->refcount == 0,
> and if page->refcount != 0 then page->pgmap must be valid, without any
> refcounting on the page map itself.
> 
> So, why do we need pgmap refcounting all over the place? It seems like
> it only existed before because of the abuse of the page->refcount?

Recall that this percpu_ref is mirroring the same function as
blk_queue_enter() whereby every new request is checking to make sure the
device is still alive, or whether it has started exiting.

So pgmap 'live' reference taking in fs/dax.c allows the core to start
failing fault requests once device teardown has started. It is a 'block
new, and drain old' semantic.
Dan Williams Sept. 21, 2022, 9:38 p.m. UTC | #3
Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > > mappings of device-memory with the device-removal / unbind path. It
> > > enables the semantic that initiating device-removal (or
> > > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > > mapping revocation or inflight DMA to complete.
> > 
> > This seems strange to me
> > 
> > The pagemap should be ref'd as long as the filesystem is mounted over
> > the dax. The ref should be incrd when the filesystem is mounted and
> > decrd when it is unmounted.
> > 
> > When the filesystem unmounts it should zap all the mappings (actually
> > I don't think you can even unmount a filesystem while mappings are
> > open) and wait for all page references to go to zero, then put the
> > final pagemap back.
> > 
> > The rule is nothing can touch page->pgmap while page->refcount == 0,
> > and if page->refcount != 0 then page->pgmap must be valid, without any
> > refcounting on the page map itself.
> > 
> > So, why do we need pgmap refcounting all over the place? It seems like
> > it only existed before because of the abuse of the page->refcount?
> 
> Recall that this percpu_ref is mirroring the same function as
> blk_queue_enter() whereby every new request is checking to make sure the
> device is still alive, or whether it has started exiting.
> 
> So pgmap 'live' reference taking in fs/dax.c allows the core to start
> failing fault requests once device teardown has started. It is a 'block
> new, and drain old' semantic.

However this line of questioning has me realizing that I have the
put_dev_pagemap() in the wrong place. It needs to go in
free_zone_device_page(), so that gup extends the lifetime of the device.
Jason Gunthorpe Sept. 21, 2022, 10:07 p.m. UTC | #4
On Wed, Sep 21, 2022 at 02:38:56PM -0700, Dan Williams wrote:
> Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > > > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > > > mappings of device-memory with the device-removal / unbind path. It
> > > > enables the semantic that initiating device-removal (or
> > > > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > > > mapping revocation or inflight DMA to complete.
> > > 
> > > This seems strange to me
> > > 
> > > The pagemap should be ref'd as long as the filesystem is mounted over
> > > the dax. The ref should be incrd when the filesystem is mounted and
> > > decrd when it is unmounted.
> > > 
> > > When the filesystem unmounts it should zap all the mappings (actually
> > > I don't think you can even unmount a filesystem while mappings are
> > > open) and wait for all page references to go to zero, then put the
> > > final pagemap back.
> > > 
> > > The rule is nothing can touch page->pgmap while page->refcount == 0,
> > > and if page->refcount != 0 then page->pgmap must be valid, without any
> > > refcounting on the page map itself.
> > > 
> > > So, why do we need pgmap refcounting all over the place? It seems like
> > > it only existed before because of the abuse of the page->refcount?
> > 
> > Recall that this percpu_ref is mirroring the same function as
> > blk_queue_enter() whereby every new request is checking to make sure the
> > device is still alive, or whether it has started exiting.
> > 
> > So pgmap 'live' reference taking in fs/dax.c allows the core to start
> > failing fault requests once device teardown has started. It is a 'block
> > new, and drain old' semantic.

It is weird this email never arrived for me..

I think that is all fine, but it would be much more logically
expressed as a simple 'is pgmap alive' call before doing a new mapping
than mucking with the refcount logic. Such a test could simply
READ_ONCE a bool value in the pgmap struct.

Indeed, you could reasonably put such a liveness test at the moment
every driver takes a 0 refcount struct page and turns it into a 1
refcount struct page.

Jason
Dan Williams Sept. 22, 2022, 12:14 a.m. UTC | #5
Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 02:38:56PM -0700, Dan Williams wrote:
> > Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > > > > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > > > > mappings of device-memory with the device-removal / unbind path. It
> > > > > enables the semantic that initiating device-removal (or
> > > > > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > > > > mapping revocation or inflight DMA to complete.
> > > > 
> > > > This seems strange to me
> > > > 
> > > > The pagemap should be ref'd as long as the filesystem is mounted over
> > > > the dax. The ref should be incrd when the filesystem is mounted and
> > > > decrd when it is unmounted.
> > > > 
> > > > When the filesystem unmounts it should zap all the mappings (actually
> > > > I don't think you can even unmount a filesystem while mappings are
> > > > open) and wait for all page references to go to zero, then put the
> > > > final pagemap back.
> > > > 
> > > > The rule is nothing can touch page->pgmap while page->refcount == 0,
> > > > and if page->refcount != 0 then page->pgmap must be valid, without any
> > > > refcounting on the page map itself.
> > > > 
> > > > So, why do we need pgmap refcounting all over the place? It seems like
> > > > it only existed before because of the abuse of the page->refcount?
> > > 
> > > Recall that this percpu_ref is mirroring the same function as
> > > blk_queue_enter() whereby every new request is checking to make sure the
> > > device is still alive, or whether it has started exiting.
> > > 
> > > So pgmap 'live' reference taking in fs/dax.c allows the core to start
> > > failing fault requests once device teardown has started. It is a 'block
> > > new, and drain old' semantic.
> 
> It is weird this email never arrived for me..
> 
> I think that is all fine, but it would be much more logically
> expressed as a simple 'is pgmap alive' call before doing a new mapping
> than mucking with the refcount logic. Such a test could simply
> READ_ONCE a bool value in the pgmap struct.
> 
> Indeed, you could reasonably put such a liveness test at the moment
> every driver takes a 0 refcount struct page and turns it into a 1
> refcount struct page.

I could do it with a flag, but the reason to have pgmap->ref managed at
the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
time memunmap_pages() can look at the one counter rather than scanning
and rescanning all the pages to see when they go to final idle.
Jason Gunthorpe Sept. 22, 2022, 12:25 a.m. UTC | #6
On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:

> > Indeed, you could reasonably put such a liveness test at the moment
> > every driver takes a 0 refcount struct page and turns it into a 1
> > refcount struct page.
> 
> I could do it with a flag, but the reason to have pgmap->ref managed at
> the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> time memunmap_pages() can look at the one counter rather than scanning
> and rescanning all the pages to see when they go to final idle.

That makes some sense too, but the logical way to do that is to put some
counter along the page_free() path, and establish a 'make a page not
free' path that does the other side.

ie it should not be in DAX code, it should be all in common pgmap
code. The pgmap should never be freed while any page->refcount != 0
and that should be an intrinsic property of pgmap, not relying on
external parties.

Though I suspect if we were to look at performance it is probably
better to scan the memory on the unlikely case of pgmap removal than
to put more code in hot paths to keep track of refcounts.. It doesn't
need rescanning, just one sweep where it waits on every non-zero page
to become zero.

Jason
Dan Williams Sept. 22, 2022, 2:17 a.m. UTC | #7
Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> 
> > > Indeed, you could reasonably put such a liveness test at the moment
> > > every driver takes a 0 refcount struct page and turns it into a 1
> > > refcount struct page.
> > 
> > I could do it with a flag, but the reason to have pgmap->ref managed at
> > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > time memunmap_pages() can look at the one counter rather than scanning
> > and rescanning all the pages to see when they go to final idle.
> 
> That makes some sense too, but the logical way to do that is to put some
> counter along the page_free() path, and establish a 'make a page not
> free' path that does the other side.
> 
> ie it should not be in DAX code, it should be all in common pgmap
> code. The pgmap should never be freed while any page->refcount != 0
> and that should be an intrinsic property of pgmap, not relying on
> external parties.

I just do not know where to put such intrinsics since there is nothing
today that requires going through the pgmap object to discover the pfn
and 'allocate' the page.

I think you may be asking to unify dax_direct_access() with pgmap
management where all dax_direct_access() users are required to take a
page reference if the pfn it returns is going to be used outside of
dax_read_lock().

In other words make dax_direct_access() the 'allocation' event that pins
the pgmap? I might be speaking a foreign language if you're not familiar
with the relationship of 'struct dax_device' to 'struct dev_pagemap'
instances. This is not the first time I have considered making them one
in the same.

> Though I suspect if we were to look at performance it is probably
> better to scan the memory on the unlikely case of pgmap removal than
> to put more code in hot paths to keep track of refcounts.. It doesn't
> need rescanning, just one sweep where it waits on every non-zero page
> to become zero.

True, on the way down nothing should be elevating page references, just
waiting for the last one to drain. I am just not sure that pgmap removal
is that unlikely going forward with things like the dax_kmem driver and
CXL Dynamic Capacity Devices where tearing down DAX devices happens.
Perhaps something to revisit if the pgmap percpu_ref ever shows up in
profiles.
Jason Gunthorpe Sept. 22, 2022, 5:55 p.m. UTC | #8
On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > 
> > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > refcount struct page.
> > > 
> > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > time memunmap_pages() can look at the one counter rather than scanning
> > > and rescanning all the pages to see when they go to final idle.
> > 
> > That makes some sense too, but the logical way to do that is to put some
> > counter along the page_free() path, and establish a 'make a page not
> > free' path that does the other side.
> > 
> > ie it should not be in DAX code, it should be all in common pgmap
> > code. The pgmap should never be freed while any page->refcount != 0
> > and that should be an intrinsic property of pgmap, not relying on
> > external parties.
> 
> I just do not know where to put such intrinsics since there is nothing
> today that requires going through the pgmap object to discover the pfn
> and 'allocate' the page.

I think that is just a new API that wrappers the set refcount = 1,
percpu refcount and maybe building appropriate compound pages too.

Eg maybe something like:

  struct folio *pgmap_alloc_folios(pgmap, start, length)

And you get back maximally sized allocated folios with refcount = 1
that span the requested range.

> In other words make dax_direct_access() the 'allocation' event that pins
> the pgmap? I might be speaking a foreign language if you're not familiar
> with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> instances. This is not the first time I have considered making them one
> in the same.

I don't know enough about dax, so yes very foreign :)

I'm thinking broadly about how to make pgmap usable to all the other
drivers in a safe and robust way that makes some kind of logical sense.

Jason
Dan Williams Sept. 22, 2022, 9:54 p.m. UTC | #9
Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > > 
> > > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > > refcount struct page.
> > > > 
> > > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > > time memunmap_pages() can look at the one counter rather than scanning
> > > > and rescanning all the pages to see when they go to final idle.
> > > 
> > > That makes some sense too, but the logical way to do that is to put some
> > > counter along the page_free() path, and establish a 'make a page not
> > > free' path that does the other side.
> > > 
> > > ie it should not be in DAX code, it should be all in common pgmap
> > > code. The pgmap should never be freed while any page->refcount != 0
> > > and that should be an intrinsic property of pgmap, not relying on
> > > external parties.
> > 
> > I just do not know where to put such intrinsics since there is nothing
> > today that requires going through the pgmap object to discover the pfn
> > and 'allocate' the page.
> 
> I think that is just a new API that wrappers the set refcount = 1,
> percpu refcount and maybe building appropriate compound pages too.
> 
> Eg maybe something like:
> 
>   struct folio *pgmap_alloc_folios(pgmap, start, length)
> 
> And you get back maximally sized allocated folios with refcount = 1
> that span the requested range.
> 
> > In other words make dax_direct_access() the 'allocation' event that pins
> > the pgmap? I might be speaking a foreign language if you're not familiar
> > with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> > instances. This is not the first time I have considered making them one
> > in the same.
> 
> I don't know enough about dax, so yes very foreign :)
> 
> I'm thinking broadly about how to make pgmap usable to all the other
> drivers in a safe and robust way that makes some kind of logical sense.

I think the API should be pgmap_folio_get() because, at least for DAX,
the memory is already allocated. The 'allocator' for fsdax is the
filesystem block allocator, and pgmap_folio_get() grants access to a
folio in the pgmap by a pfn that the block allocator knows about. If the
GPU use case wants to wrap an allocator around that they can, but the
fundamental requirement is check if the pgmap is dead and if not elevate
the page reference.

So something like:

/**
 * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
 * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
 * @pfn: page frame number covered by @pgmap
 */
struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, unsigned long pfn)
{
        struct page *page;
        
        VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
        
        if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
                return NULL;
        page = pfn_to_page(pfn);
        return page_folio(page);
}

This does not create compound folios, that needs to be coordinated with
the caller and likely needs an explicit

    pgmap_construct_folio(pgmap, pfn, order)

...call that can be done while holding locks against operations that
will cause the folio to be broken down.
Dave Chinner Sept. 23, 2022, 1:36 a.m. UTC | #10
On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > > > 
> > > > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > > > refcount struct page.
> > > > > 
> > > > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > > > time memunmap_pages() can look at the one counter rather than scanning
> > > > > and rescanning all the pages to see when they go to final idle.
> > > > 
> > > > That makes some sense too, but the logical way to do that is to put some
> > > > counter along the page_free() path, and establish a 'make a page not
> > > > free' path that does the other side.
> > > > 
> > > > ie it should not be in DAX code, it should be all in common pgmap
> > > > code. The pgmap should never be freed while any page->refcount != 0
> > > > and that should be an intrinsic property of pgmap, not relying on
> > > > external parties.
> > > 
> > > I just do not know where to put such intrinsics since there is nothing
> > > today that requires going through the pgmap object to discover the pfn
> > > and 'allocate' the page.
> > 
> > I think that is just a new API that wrappers the set refcount = 1,
> > percpu refcount and maybe building appropriate compound pages too.
> > 
> > Eg maybe something like:
> > 
> >   struct folio *pgmap_alloc_folios(pgmap, start, length)
> > 
> > And you get back maximally sized allocated folios with refcount = 1
> > that span the requested range.
> > 
> > > In other words make dax_direct_access() the 'allocation' event that pins
> > > the pgmap? I might be speaking a foreign language if you're not familiar
> > > with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> > > instances. This is not the first time I have considered making them one
> > > in the same.
> > 
> > I don't know enough about dax, so yes very foreign :)
> > 
> > I'm thinking broadly about how to make pgmap usable to all the other
> > drivers in a safe and robust way that makes some kind of logical sense.
> 
> I think the API should be pgmap_folio_get() because, at least for DAX,
> the memory is already allocated. The 'allocator' for fsdax is the
> filesystem block allocator, and pgmap_folio_get() grants access to a

No, the "allocator" for fsdax is the inode iomap interface, not the
filesystem block allocator. The filesystem block allocator is only
involved in iomapping if we have to allocate a new mapping for a
given file offset.

A better name for this is "arbiter", not allocator.  To get an
active mapping of the DAX pages backing a file, we need to ask the
inode iomap subsystem to *map a file offset* and it will return
kaddr and/or pfns for the backing store the file offset maps to.

IOWs, for FSDAX, access to the backing store (i.e. the physical pages) is
arbitrated by the *inode*, not the filesystem allocator or the dax
device. Hence if a subsystem needs to pin the backing store for some
use, it must first ensure that it holds an inode reference (direct
or indirect) for that range of the backing store that will spans the
life of the pin. When the pin is done, it can tear down the mappings
it was using and then the inode reference can be released.

This ensures that any racing unlink of the inode will not result in
the backing store being freed from under the application that has a
pin. It will prevent the inode from being reclaimed and so
potentially accessing stale or freed in-memory structures. And it
will prevent the filesytem from being unmounted while the
application using FSDAX access is still actively using that
functionality even if it's already closed all it's fds....

Cheers,

Dave.
Dan Williams Sept. 23, 2022, 2:01 a.m. UTC | #11
Dave Chinner wrote:
> On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> > > > Jason Gunthorpe wrote:
> > > > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > > > > 
> > > > > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > > > > refcount struct page.
> > > > > > 
> > > > > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > > > > time memunmap_pages() can look at the one counter rather than scanning
> > > > > > and rescanning all the pages to see when they go to final idle.
> > > > > 
> > > > > That makes some sense too, but the logical way to do that is to put some
> > > > > counter along the page_free() path, and establish a 'make a page not
> > > > > free' path that does the other side.
> > > > > 
> > > > > ie it should not be in DAX code, it should be all in common pgmap
> > > > > code. The pgmap should never be freed while any page->refcount != 0
> > > > > and that should be an intrinsic property of pgmap, not relying on
> > > > > external parties.
> > > > 
> > > > I just do not know where to put such intrinsics since there is nothing
> > > > today that requires going through the pgmap object to discover the pfn
> > > > and 'allocate' the page.
> > > 
> > > I think that is just a new API that wrappers the set refcount = 1,
> > > percpu refcount and maybe building appropriate compound pages too.
> > > 
> > > Eg maybe something like:
> > > 
> > >   struct folio *pgmap_alloc_folios(pgmap, start, length)
> > > 
> > > And you get back maximally sized allocated folios with refcount = 1
> > > that span the requested range.
> > > 
> > > > In other words make dax_direct_access() the 'allocation' event that pins
> > > > the pgmap? I might be speaking a foreign language if you're not familiar
> > > > with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> > > > instances. This is not the first time I have considered making them one
> > > > in the same.
> > > 
> > > I don't know enough about dax, so yes very foreign :)
> > > 
> > > I'm thinking broadly about how to make pgmap usable to all the other
> > > drivers in a safe and robust way that makes some kind of logical sense.
> > 
> > I think the API should be pgmap_folio_get() because, at least for DAX,
> > the memory is already allocated. The 'allocator' for fsdax is the
> > filesystem block allocator, and pgmap_folio_get() grants access to a
> 
> No, the "allocator" for fsdax is the inode iomap interface, not the
> filesystem block allocator. The filesystem block allocator is only
> involved in iomapping if we have to allocate a new mapping for a
> given file offset.
> 
> A better name for this is "arbiter", not allocator.  To get an
> active mapping of the DAX pages backing a file, we need to ask the
> inode iomap subsystem to *map a file offset* and it will return
> kaddr and/or pfns for the backing store the file offset maps to.
> 
> IOWs, for FSDAX, access to the backing store (i.e. the physical pages) is
> arbitrated by the *inode*, not the filesystem allocator or the dax
> device. Hence if a subsystem needs to pin the backing store for some
> use, it must first ensure that it holds an inode reference (direct
> or indirect) for that range of the backing store that will spans the
> life of the pin. When the pin is done, it can tear down the mappings
> it was using and then the inode reference can be released.
> 
> This ensures that any racing unlink of the inode will not result in
> the backing store being freed from under the application that has a
> pin. It will prevent the inode from being reclaimed and so
> potentially accessing stale or freed in-memory structures. And it
> will prevent the filesytem from being unmounted while the
> application using FSDAX access is still actively using that
> functionality even if it's already closed all it's fds....

Sounds so simple when you put it that way. I'll give it a shot and stop
the gymnastics of trying to get in front of truncate_inode_pages_final()
with a 'dax break layouts', just hold it off until final unpin.
Jason Gunthorpe Sept. 23, 2022, 1:24 p.m. UTC | #12
On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:

> > I'm thinking broadly about how to make pgmap usable to all the other
> > drivers in a safe and robust way that makes some kind of logical sense.
> 
> I think the API should be pgmap_folio_get() because, at least for DAX,
> the memory is already allocated. 

I would pick a name that has some logical connection to
ops->page_free()

This function is starting a pairing where once it completes page_free
will eventually be called.

> /**
>  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
>  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
>  * @pfn: page frame number covered by @pgmap
>  */
> struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, unsigned long pfn)
> {
>         struct page *page;
>         
>         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
>
>         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
>                 return NULL;

This shouldn't be a WARN?

>         page = pfn_to_page(pfn);
>         return page_folio(page);
> }

Yeah, makes sense to me, but I would do a len as well to amortize the
cost of all these checks..

> This does not create compound folios, that needs to be coordinated with
> the caller and likely needs an explicit

Does it? What situations do you think the caller needs to coordinate
the folio size? Caller should call the function for each logical unit
of storage it wants to allocate from the pgmap..

Jason
Dan Williams Sept. 23, 2022, 4:29 p.m. UTC | #13
Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:
> 
> > > I'm thinking broadly about how to make pgmap usable to all the other
> > > drivers in a safe and robust way that makes some kind of logical sense.
> > 
> > I think the API should be pgmap_folio_get() because, at least for DAX,
> > the memory is already allocated. 
> 
> I would pick a name that has some logical connection to
> ops->page_free()
> 
> This function is starting a pairing where once it completes page_free
> will eventually be called.

Following Dave's note that this is an 'arbitration' mechanism I think
request/release is more appropriate than alloc/free for what this is doing.

> 
> > /**
> >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
> >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
> >  * @pfn: page frame number covered by @pgmap
> >  */
> > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, unsigned long pfn)
> > {
> >         struct page *page;
> >         
> >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
> >
> >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> >                 return NULL;
> 
> This shouldn't be a WARN?

It's a bug if someone calls this after killing the pgmap. I.e.  the
expectation is that the caller is synchronzing this. The only reason
this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
not expect it to fire on anything but a development kernel.

> 
> >         page = pfn_to_page(pfn);
> >         return page_folio(page);
> > }
> 
> Yeah, makes sense to me, but I would do a len as well to amortize the
> cost of all these checks..
> 
> > This does not create compound folios, that needs to be coordinated with
> > the caller and likely needs an explicit
> 
> Does it? What situations do you think the caller needs to coordinate
> the folio size? Caller should call the function for each logical unit
> of storage it wants to allocate from the pgmap..

The problem for fsdax is that it needs to gather all the PTEs, hold a
lock to synchronize against events that would shatter a huge page, and
then build up the compound folio metadata before inserting the PMD. So I
think that flow is request all pfns, lock, fixup refcounts, build up
compound folio, insert huge i_pages entry, unlock and install the pmd.
Jason Gunthorpe Sept. 23, 2022, 5:42 p.m. UTC | #14
On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote:
> > > /**
> > >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
> > >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
> > >  * @pfn: page frame number covered by @pgmap
> > >  */
> > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap,
> > > unsigned long pfn)

Maybe should be not be pfn but be 'offset from the first page of the
pgmap' ? Then we don't need the xa_load stuff, since it cann't be
wrong by definition.

> > > {
> > >         struct page *page;
> > >         
> > >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
> > >
> > >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > >                 return NULL;
> > 
> > This shouldn't be a WARN?
> 
> It's a bug if someone calls this after killing the pgmap. I.e.  the
> expectation is that the caller is synchronzing this. The only reason
> this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
> not expect it to fire on anything but a development kernel.

OK, that makes sense

But shouldn't this get the pgmap refcount here? The reason we started
talking about this was to make all the pgmap logic self contained so
that the pgmap doesn't pass its own destroy until all the all the
page_free()'s have been done.

> > > This does not create compound folios, that needs to be coordinated with
> > > the caller and likely needs an explicit
> > 
> > Does it? What situations do you think the caller needs to coordinate
> > the folio size? Caller should call the function for each logical unit
> > of storage it wants to allocate from the pgmap..
> 
> The problem for fsdax is that it needs to gather all the PTEs, hold a
> lock to synchronize against events that would shatter a huge page, and
> then build up the compound folio metadata before inserting the PMD. 

Er, at this point we are just talking about acquiring virgin pages
nobody else is using, not inserting things. There is no possibility of
conurrent shattering because, by definition, nothing else can
reference these struct pages at this instant.

Also, the caller must already be serializating pgmap_get_folio()
against concurrent calls on the same pfn (since it is an error to call
pgmap_get_folio() on an non-free pfn)

So, I would expect the caller must already have all the necessary
locking to accept maximally sized folios.

eg if it has some reason to punch a hole in the contiguous range
(shatter the folio) it must *already* serialize against
pgmap_get_folio(), since something like punching a hole must know with
certainty if any struct pages are refcount != 0 or not, and must not
race with something trying to set their refcount to 1.

Jason
Dan Williams Sept. 23, 2022, 7:03 p.m. UTC | #15
Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote:
> > > > /**
> > > >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
> > > >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
> > > >  * @pfn: page frame number covered by @pgmap
> > > >  */
> > > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap,
> > > > unsigned long pfn)
> 
> Maybe should be not be pfn but be 'offset from the first page of the
> pgmap' ? Then we don't need the xa_load stuff, since it cann't be
> wrong by definition.
> 
> > > > {
> > > >         struct page *page;
> > > >         
> > > >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
> > > >
> > > >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > > >                 return NULL;
> > > 
> > > This shouldn't be a WARN?
> > 
> > It's a bug if someone calls this after killing the pgmap. I.e.  the
> > expectation is that the caller is synchronzing this. The only reason
> > this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
> > not expect it to fire on anything but a development kernel.
> 
> OK, that makes sense
> 
> But shouldn't this get the pgmap refcount here? The reason we started
> talking about this was to make all the pgmap logic self contained so
> that the pgmap doesn't pass its own destroy until all the all the
> page_free()'s have been done.
> 
> > > > This does not create compound folios, that needs to be coordinated with
> > > > the caller and likely needs an explicit
> > > 
> > > Does it? What situations do you think the caller needs to coordinate
> > > the folio size? Caller should call the function for each logical unit
> > > of storage it wants to allocate from the pgmap..
> > 
> > The problem for fsdax is that it needs to gather all the PTEs, hold a
> > lock to synchronize against events that would shatter a huge page, and
> > then build up the compound folio metadata before inserting the PMD. 
> 
> Er, at this point we are just talking about acquiring virgin pages
> nobody else is using, not inserting things. There is no possibility of
> conurrent shattering because, by definition, nothing else can
> reference these struct pages at this instant.
> 
> Also, the caller must already be serializating pgmap_get_folio()
> against concurrent calls on the same pfn (since it is an error to call
> pgmap_get_folio() on an non-free pfn)
> 
> So, I would expect the caller must already have all the necessary
> locking to accept maximally sized folios.
> 
> eg if it has some reason to punch a hole in the contiguous range
> (shatter the folio) it must *already* serialize against
> pgmap_get_folio(), since something like punching a hole must know with
> certainty if any struct pages are refcount != 0 or not, and must not
> race with something trying to set their refcount to 1.

Perhaps, I'll take a look. The scenario I am more concerned about is
processA sets up a VMA of PAGE_SIZE and races processB to fault in the
same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
PTE mapping and processB gets a PMD mapping, but the refcounting is all
handled in small pages. I need to investigate more what is needed for
fsdax to support folio_size() > mapping entry size.
Jason Gunthorpe Sept. 23, 2022, 7:23 p.m. UTC | #16
On Fri, Sep 23, 2022 at 12:03:53PM -0700, Dan Williams wrote:

> Perhaps, I'll take a look. The scenario I am more concerned about is
> processA sets up a VMA of PAGE_SIZE and races processB to fault in the
> same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
> PTE mapping and processB gets a PMD mapping, but the refcounting is all
> handled in small pages. I need to investigate more what is needed for
> fsdax to support folio_size() > mapping entry size.

This is fine actually.

The PMD/PTE can hold a tail page. So the page cache will hold a PMD
sized folio, procesA will have a PTE pointing to a tail page and
processB will have a PMD pointing at the head page.

For the immediate instant you can keep accounting for each tail page
as you do now, just with folio wrappers. Once you have proper folios
you shift the accounting responsibility to the core code and the core
will faster with one ref per PMD/PTE.

The trick with folios is probably going to be breaking up a folio. THP
has some nasty stuff for that, but I think a FS would be better to
just revoke the entire folio, bring the refcount to 0, change the
underling physical mapping, and then fault will naturally restore a
properly sized folio to accomodate the new physical layout.

ie you never break up a folio once it is created from the pgmap.

What you want is to have largest possibile folios because it optimizes
all the handling logic.

.. and then you are well positioned to do some kind of trick where the
FS asserts at mount time that it never needs a folio less than order X
and you can then trigger the devdax optimization of folding struct
page memory and significantly reducing the wastage for struct page..

Jason
Alistair Popple Sept. 27, 2022, 6:07 a.m. UTC | #17
Dan Williams <dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote:
>> > > > /**
>> > > >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
>> > > >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
>> > > >  * @pfn: page frame number covered by @pgmap
>> > > >  */
>> > > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap,
>> > > > unsigned long pfn)
>>
>> Maybe should be not be pfn but be 'offset from the first page of the
>> pgmap' ? Then we don't need the xa_load stuff, since it cann't be
>> wrong by definition.
>>
>> > > > {
>> > > >         struct page *page;
>> > > >
>> > > >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
>> > > >
>> > > >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
>> > > >                 return NULL;
>> > >
>> > > This shouldn't be a WARN?
>> >
>> > It's a bug if someone calls this after killing the pgmap. I.e.  the
>> > expectation is that the caller is synchronzing this. The only reason
>> > this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
>> > not expect it to fire on anything but a development kernel.
>>
>> OK, that makes sense
>>
>> But shouldn't this get the pgmap refcount here? The reason we started
>> talking about this was to make all the pgmap logic self contained so
>> that the pgmap doesn't pass its own destroy until all the all the
>> page_free()'s have been done.

That sounds good to me at least. I just noticed we introduced this exact
bug for device private/coherent pages when making their refcounts zero
based. Nothing currently takes pgmap->ref when a private/coherent page
is mapped. Therefore memunmap_pages() will complete and pgmap destroyed
while pgmap pages are still mapped.

So I think we need to call put_dev_pagemap() as part of
free_zone_device_page().

 - Alistair

>> > > > This does not create compound folios, that needs to be coordinated with
>> > > > the caller and likely needs an explicit
>> > >
>> > > Does it? What situations do you think the caller needs to coordinate
>> > > the folio size? Caller should call the function for each logical unit
>> > > of storage it wants to allocate from the pgmap..
>> >
>> > The problem for fsdax is that it needs to gather all the PTEs, hold a
>> > lock to synchronize against events that would shatter a huge page, and
>> > then build up the compound folio metadata before inserting the PMD.
>>
>> Er, at this point we are just talking about acquiring virgin pages
>> nobody else is using, not inserting things. There is no possibility of
>> conurrent shattering because, by definition, nothing else can
>> reference these struct pages at this instant.
>>
>> Also, the caller must already be serializating pgmap_get_folio()
>> against concurrent calls on the same pfn (since it is an error to call
>> pgmap_get_folio() on an non-free pfn)
>>
>> So, I would expect the caller must already have all the necessary
>> locking to accept maximally sized folios.
>>
>> eg if it has some reason to punch a hole in the contiguous range
>> (shatter the folio) it must *already* serialize against
>> pgmap_get_folio(), since something like punching a hole must know with
>> certainty if any struct pages are refcount != 0 or not, and must not
>> race with something trying to set their refcount to 1.
>
> Perhaps, I'll take a look. The scenario I am more concerned about is
> processA sets up a VMA of PAGE_SIZE and races processB to fault in the
> same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
> PTE mapping and processB gets a PMD mapping, but the refcounting is all
> handled in small pages. I need to investigate more what is needed for
> fsdax to support folio_size() > mapping entry size.
Jason Gunthorpe Sept. 27, 2022, 12:56 p.m. UTC | #18
On Tue, Sep 27, 2022 at 04:07:05PM +1000, Alistair Popple wrote:

> That sounds good to me at least. I just noticed we introduced this exact
> bug for device private/coherent pages when making their refcounts zero
> based. Nothing currently takes pgmap->ref when a private/coherent page
> is mapped. Therefore memunmap_pages() will complete and pgmap destroyed
> while pgmap pages are still mapped.

To kind of summarize this thread

Either we should get the pgmap reference during the refcount = 1 flow,
and put it during page_free()

Or we should have the pgmap destroy sweep all the pages and wait for
them to become ref == 0

I don't think we should have pgmap references randomly strewn all over
the place. A positive refcount on the page alone must be enough to
prove that the struct page exists and the pgmap is not destroyed.

Every driver using pgmap needs something liek this, so I'd prefer it
be in the pgmap code..

Jason
diff mbox series

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 5d9f30105db4..ee2568c8b135 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -376,14 +376,26 @@  static inline void dax_mapping_set_cow(struct page *page)
  * whether this entry is shared by multiple files.  If so, set the page->mapping
  * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-				struct vm_fault *vmf, unsigned long flags)
+static vm_fault_t dax_associate_entry(void *entry,
+				      struct address_space *mapping,
+				      struct vm_fault *vmf, unsigned long flags)
 {
 	unsigned long size = dax_entry_size(entry), pfn, index;
+	struct dev_pagemap *pgmap;
 	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
+		return 0;
+
+	if (!size)
+		return 0;
+
+	if (!(flags & DAX_COW)) {
+		pfn = dax_to_pfn(entry);
+		pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size));
+		if (!pgmap)
+			return VM_FAULT_SIGBUS;
+	}
 
 	index = linear_page_index(vmf->vma, ALIGN(vmf->address, size));
 	for_each_mapped_pfn(entry, pfn) {
@@ -398,19 +410,24 @@  static void dax_associate_entry(void *entry, struct address_space *mapping,
 			page_ref_inc(page);
 		}
 	}
+
+	return 0;
 }
 
 static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		bool trunc)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn;
+	struct page *page;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
+	if (!size)
+		return;
 
+	for_each_mapped_pfn(entry, pfn) {
+		page = pfn_to_page(pfn);
 		if (dax_mapping_is_cow(page->mapping)) {
 			/* keep the CoW flag if this page is still shared */
 			if (page->index-- > 0)
@@ -423,6 +440,11 @@  static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		page->mapping = NULL;
 		page->index = 0;
 	}
+
+	if (trunc && !dax_mapping_is_cow(page->mapping)) {
+		page = pfn_to_page(dax_to_pfn(entry));
+		put_dev_pagemap_many(page->pgmap, PHYS_PFN(size));
+	}
 }
 
 /*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c3b4cc84877b..fd57407e7f3d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -191,8 +191,13 @@  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
-struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap);
+struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn,
+					 struct dev_pagemap *pgmap, int refs);
+static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
+						  struct dev_pagemap *pgmap)
+{
+	return get_dev_pagemap_many(pfn, pgmap, 1);
+}
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
@@ -244,10 +249,15 @@  static inline unsigned long memremap_compat_align(void)
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
+static inline void put_dev_pagemap_many(struct dev_pagemap *pgmap, int refs)
 {
 	if (pgmap)
-		percpu_ref_put(&pgmap->ref);
+		percpu_ref_put_many(&pgmap->ref, refs);
+}
+
+static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
+{
+	put_dev_pagemap_many(pgmap, 1);
 }
 
 #endif /* _LINUX_MEMREMAP_H_ */
diff --git a/mm/memremap.c b/mm/memremap.c
index 95f6ffe9cb0f..83c5e6fafd84 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -430,15 +430,16 @@  void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
 }
 
 /**
- * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
+ * get_dev_pagemap_many() - take new live references(s) on the dev_pagemap for @pfn
  * @pfn: page frame number to lookup page_map
  * @pgmap: optional known pgmap that already has a reference
+ * @refs: number of references to take
  *
  * If @pgmap is non-NULL and covers @pfn it will be returned as-is.  If @pgmap
  * is non-NULL but does not cover @pfn the reference to it will be released.
  */
-struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap)
+struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn,
+					 struct dev_pagemap *pgmap, int refs)
 {
 	resource_size_t phys = PFN_PHYS(pfn);
 
@@ -454,13 +455,15 @@  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 	/* fall back to slow path lookup */
 	rcu_read_lock();
 	pgmap = xa_load(&pgmap_array, PHYS_PFN(phys));
-	if (pgmap && !percpu_ref_tryget_live(&pgmap->ref))
+	if (pgmap && !percpu_ref_tryget_live_rcu(&pgmap->ref))
 		pgmap = NULL;
+	if (pgmap && refs > 1)
+		percpu_ref_get_many(&pgmap->ref, refs - 1);
 	rcu_read_unlock();
 
 	return pgmap;
 }
-EXPORT_SYMBOL_GPL(get_dev_pagemap);
+EXPORT_SYMBOL_GPL(get_dev_pagemap_many);
 
 void free_zone_device_page(struct page *page)
 {