mbox series

[00/13] Fix the DAX-gup mistake

Message ID 166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com (mailing list archive)
Headers show
Series Fix the DAX-gup mistake | expand

Message

Dan Williams Sept. 4, 2022, 2:16 a.m. UTC
tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to
map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages()
for fsdax and devdax inodes, and use page_maybe_dma_pinned() to
determine when filesystems can safely truncate DAX mappings vs DMA.

The longer story is that DAX has caused friction with folio development
and other device-memory use cases due to its hack of using a
page-reference count of 1 to indicate that the page is DMA idle. That
situation arose from the mistake of not managing DAX page reference
counts at map time. The lack of page reference counting at map time grew
organically from the original DAX experiment of attempting to manage DAX
mappings without page structures. The page lock, dirty tracking and
other entry management was supported sans pages. However, the page
support was then bolted on incrementally so solve problems with gup,
memory-failure, and all the other kernel services that are missing when
a pfn does not have an associated page structure.

Since then John has led an effort to account for when a page is pinned
for DMA vs other sources that elevate the reference count. The
page_maybe_dma_pinned() helper slots in seamlessly to replace the need
to track transitions to page->_refount == 1.

The larger change in this set comes from Jason's observation that
inserting DAX mappings without any reference taken is a bug. So
dax_insert_entry(), that fsdax uses, is updated to take 'struct
dev_pagemap' references, and devdax is updated to reuse the same.

This allows for 'struct dev_pagemap' manipulation to be self-contained
to DAX-specific paths. It is also a foundation to build towards removing
pte_devmap() and start treating DAX pages as another vm_normal_page(),
and perhaps more conversions of the DAX infrastructure to reuse typical
page mapping helpers. One of the immediate hurdles is the usage of
pmd_devmap() to distinguish large page mappings that are not transparent
huge pages.

---

Dan Williams (13):
      fsdax: Rename "busy page" to "pinned page"
      fsdax: Use page_maybe_dma_pinned() for DAX vs DMA collisions
      fsdax: Delete put_devmap_managed_page_refs()
      fsdax: Update dax_insert_entry() calling convention to return an error
      fsdax: Cleanup dax_associate_entry()
      fsdax: Rework dax_insert_entry() calling convention
      fsdax: Manage pgmap references at entry insertion and deletion
      devdax: Minor warning fixups
      devdax: Move address_space helpers to the DAX core
      dax: Prep dax_{associate,disassociate}_entry() for compound pages
      devdax: add PUD support to the DAX mapping infrastructure
      devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
      mm/gup: Drop DAX pgmap accounting


 .clang-format             |    1
 drivers/Makefile          |    2
 drivers/dax/Kconfig       |    6
 drivers/dax/Makefile      |    1
 drivers/dax/bus.c         |    2
 drivers/dax/dax-private.h |    1
 drivers/dax/device.c      |   73 ++-
 drivers/dax/mapping.c     | 1020 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/dax/super.c       |    2
 fs/dax.c                  | 1049 ++-------------------------------------------
 fs/ext4/inode.c           |    9
 fs/fuse/dax.c             |   10
 fs/xfs/xfs_file.c         |    8
 fs/xfs/xfs_inode.c        |    2
 include/linux/dax.h       |  124 ++++-
 include/linux/huge_mm.h   |   23 -
 include/linux/memremap.h  |   24 +
 include/linux/mm.h        |   58 +-
 mm/gup.c                  |   92 +---
 mm/huge_memory.c          |   54 --
 mm/memremap.c             |   31 -
 mm/swap.c                 |    2
 22 files changed, 1326 insertions(+), 1268 deletions(-)
 create mode 100644 drivers/dax/mapping.c

base-commit: 1c23f9e627a7b412978b4e852793c5e3c3efc555

Comments

Jason Gunthorpe Sept. 6, 2022, 1:05 p.m. UTC | #1
On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote:
> tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to
> map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages()
> for fsdax and devdax inodes, and use page_maybe_dma_pinned() to
> determine when filesystems can safely truncate DAX mappings vs DMA.
> 
> The longer story is that DAX has caused friction with folio development
> and other device-memory use cases due to its hack of using a
> page-reference count of 1 to indicate that the page is DMA idle. That
> situation arose from the mistake of not managing DAX page reference
> counts at map time. The lack of page reference counting at map time grew
> organically from the original DAX experiment of attempting to manage DAX
> mappings without page structures. The page lock, dirty tracking and
> other entry management was supported sans pages. However, the page
> support was then bolted on incrementally so solve problems with gup,
> memory-failure, and all the other kernel services that are missing when
> a pfn does not have an associated page structure.
> 
> Since then John has led an effort to account for when a page is pinned
> for DMA vs other sources that elevate the reference count. The
> page_maybe_dma_pinned() helper slots in seamlessly to replace the need
> to track transitions to page->_refount == 1.
> 
> The larger change in this set comes from Jason's observation that
> inserting DAX mappings without any reference taken is a bug. So
> dax_insert_entry(), that fsdax uses, is updated to take 'struct
> dev_pagemap' references, and devdax is updated to reuse the same.

It wasn't pagemap references that were the problem, it was struct page
references.

pagemap is just something that should be ref'd in the background, as
long as a struct page has a positive reference the pagemap should be
considered referenced, IMHO free_zone_device_page() should be dealing
with this - put the pagemap after calling page_free().

Pagemap is protecting page->pgmap from UAF so we must ensure we hold
it when we do pgmap->ops

That should be the only put, and it should pair with the only get
which happens when the driver takes a 0 refcount page out of its free
list and makes it have a refcount of 1.

> page mapping helpers. One of the immediate hurdles is the usage of
> pmd_devmap() to distinguish large page mappings that are not transparent
> huge pages.

And this is because the struct page refcounting is not right :|

I had thought the progression would be to make fsdax use compound
folios, install compound folios in the PMD, remove all the special
case refcounting for DAX from the pagetable code, then address the
pgmap issue from the basis of working page->refcount, eg by putting a
pgmap put in right after the op->page_free call.

Can we continue to have the weird page->refcount behavior and still
change the other things?

Jason
Dan Williams Sept. 6, 2022, 5:23 p.m. UTC | #2
Jason Gunthorpe wrote:
> On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote:
> > tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to
> > map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages()
> > for fsdax and devdax inodes, and use page_maybe_dma_pinned() to
> > determine when filesystems can safely truncate DAX mappings vs DMA.
> > 
> > The longer story is that DAX has caused friction with folio development
> > and other device-memory use cases due to its hack of using a
> > page-reference count of 1 to indicate that the page is DMA idle. That
> > situation arose from the mistake of not managing DAX page reference
> > counts at map time. The lack of page reference counting at map time grew
> > organically from the original DAX experiment of attempting to manage DAX
> > mappings without page structures. The page lock, dirty tracking and
> > other entry management was supported sans pages. However, the page
> > support was then bolted on incrementally so solve problems with gup,
> > memory-failure, and all the other kernel services that are missing when
> > a pfn does not have an associated page structure.
> > 
> > Since then John has led an effort to account for when a page is pinned
> > for DMA vs other sources that elevate the reference count. The
> > page_maybe_dma_pinned() helper slots in seamlessly to replace the need
> > to track transitions to page->_refount == 1.
> > 
> > The larger change in this set comes from Jason's observation that
> > inserting DAX mappings without any reference taken is a bug. So
> > dax_insert_entry(), that fsdax uses, is updated to take 'struct
> > dev_pagemap' references, and devdax is updated to reuse the same.
> 
> It wasn't pagemap references that were the problem, it was struct page
> references.
> 
> pagemap is just something that should be ref'd in the background, as
> long as a struct page has a positive reference the pagemap should be
> considered referenced, IMHO free_zone_device_page() should be dealing
> with this - put the pagemap after calling page_free().

Yes.

I think I got caught admiring the solution of the
page_maybe_dma_pinned() conversion for replacing the ugly observation of
the 2 -> 1 refcount transition, and then introduced this breakage. I
will rework this to catch the 0 to 1 transition of the refcount for
incrementing and use free_zone_device_page() to drop the pgmap
reference.

> Pagemap is protecting page->pgmap from UAF so we must ensure we hold
> it when we do pgmap->ops
> 
> That should be the only put, and it should pair with the only get
> which happens when the driver takes a 0 refcount page out of its free
> list and makes it have a refcount of 1.
> 
> > page mapping helpers. One of the immediate hurdles is the usage of
> > pmd_devmap() to distinguish large page mappings that are not transparent
> > huge pages.
> 
> And this is because the struct page refcounting is not right :|
> 
> I had thought the progression would be to make fsdax use compound
> folios, install compound folios in the PMD, remove all the special
> case refcounting for DAX from the pagetable code, then address the
> pgmap issue from the basis of working page->refcount, eg by putting a
> pgmap put in right after the op->page_free call.

As far as I can see as long as the pgmap is managed at map and
free_zone_device_page() time then the large folio conversion can come
later.

> Can we continue to have the weird page->refcount behavior and still
> change the other things?

No at a minimum the pgmap vs page->refcount problem needs to be solved
first.
Jason Gunthorpe Sept. 6, 2022, 5:29 p.m. UTC | #3
On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:

> > Can we continue to have the weird page->refcount behavior and still
> > change the other things?
> 
> No at a minimum the pgmap vs page->refcount problem needs to be solved
> first.

So who will do the put page after the PTE/PMD's are cleared out? In
the normal case the tlb flusher does it integrated into zap..

Can we safely have the put page in the fsdax side after the zap?

Jason
Dan Williams Sept. 6, 2022, 6:37 p.m. UTC | #4
Jason Gunthorpe wrote:
> On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> 
> > > Can we continue to have the weird page->refcount behavior and still
> > > change the other things?
> > 
> > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > first.
> 
> So who will do the put page after the PTE/PMD's are cleared out? In
> the normal case the tlb flusher does it integrated into zap..

AFAICS the zap manages the _mapcount not _refcount. Are you talking
about page_remove_rmap() or some other reference count drop?

> Can we safely have the put page in the fsdax side after the zap?

The _refcount is managed from the lifetime insert_page() to
truncate_inode_pages(), where for DAX those are managed from
dax_insert_dentry() to dax_delete_mapping_entry().

I think that is sufficient modulo the gap you identified where I was not
accounting for additional _refcount increases outside of gup while the
DAX entry is live.
Jason Gunthorpe Sept. 6, 2022, 6:49 p.m. UTC | #5
On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > 
> > > > Can we continue to have the weird page->refcount behavior and still
> > > > change the other things?
> > > 
> > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > first.
> > 
> > So who will do the put page after the PTE/PMD's are cleared out? In
> > the normal case the tlb flusher does it integrated into zap..
> 
> AFAICS the zap manages the _mapcount not _refcount. Are you talking
> about page_remove_rmap() or some other reference count drop?

No, page refcount.

__tlb_remove_page() eventually causes a put_page() via
tlb_batch_pages_flush() calling free_pages_and_swap_cache()

Eg:

 *  MMU_GATHER_NO_GATHER
 *
 *  If the option is set the mmu_gather will not track individual pages for
 *  delayed page free anymore. A platform that enables the option needs to
 *  provide its own implementation of the __tlb_remove_page_size() function to
 *  free pages.

> > Can we safely have the put page in the fsdax side after the zap?
> 
> The _refcount is managed from the lifetime insert_page() to
> truncate_inode_pages(), where for DAX those are managed from
> dax_insert_dentry() to dax_delete_mapping_entry().

As long as we all understand the page doesn't become re-allocatable
until the refcount reaches 0 and the free op is called it may be OK!

Jason
Dan Williams Sept. 6, 2022, 7:41 p.m. UTC | #6
Jason Gunthorpe wrote:
> On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > > 
> > > > > Can we continue to have the weird page->refcount behavior and still
> > > > > change the other things?
> > > > 
> > > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > > first.
> > > 
> > > So who will do the put page after the PTE/PMD's are cleared out? In
> > > the normal case the tlb flusher does it integrated into zap..
> > 
> > AFAICS the zap manages the _mapcount not _refcount. Are you talking
> > about page_remove_rmap() or some other reference count drop?
> 
> No, page refcount.
> 
> __tlb_remove_page() eventually causes a put_page() via
> tlb_batch_pages_flush() calling free_pages_and_swap_cache()
> 
> Eg:
> 
>  *  MMU_GATHER_NO_GATHER
>  *
>  *  If the option is set the mmu_gather will not track individual pages for
>  *  delayed page free anymore. A platform that enables the option needs to
>  *  provide its own implementation of the __tlb_remove_page_size() function to
>  *  free pages.

Ok, yes, that is a vm_normal_page() mechanism which I was going to defer
since it is incremental to the _refcount handling fix and maintain that
DAX pages are still !vm_normal_page() in this set.

> > > Can we safely have the put page in the fsdax side after the zap?
> > 
> > The _refcount is managed from the lifetime insert_page() to
> > truncate_inode_pages(), where for DAX those are managed from
> > dax_insert_dentry() to dax_delete_mapping_entry().
> 
> As long as we all understand the page doesn't become re-allocatable
> until the refcount reaches 0 and the free op is called it may be OK!

Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for
when the filesystem can safely reuse the page, it really needs to wait
for the reference count to drop to 0 similar to how it waits for the
page-idle condition today.
Dan Williams Sept. 7, 2022, 12:54 a.m. UTC | #7
Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > > > 
> > > > > > Can we continue to have the weird page->refcount behavior and still
> > > > > > change the other things?
> > > > > 
> > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > > > first.
> > > > 
> > > > So who will do the put page after the PTE/PMD's are cleared out? In
> > > > the normal case the tlb flusher does it integrated into zap..
> > > 
> > > AFAICS the zap manages the _mapcount not _refcount. Are you talking
> > > about page_remove_rmap() or some other reference count drop?
> > 
> > No, page refcount.
> > 
> > __tlb_remove_page() eventually causes a put_page() via
> > tlb_batch_pages_flush() calling free_pages_and_swap_cache()
> > 
> > Eg:
> > 
> >  *  MMU_GATHER_NO_GATHER
> >  *
> >  *  If the option is set the mmu_gather will not track individual pages for
> >  *  delayed page free anymore. A platform that enables the option needs to
> >  *  provide its own implementation of the __tlb_remove_page_size() function to
> >  *  free pages.
> 
> Ok, yes, that is a vm_normal_page() mechanism which I was going to defer
> since it is incremental to the _refcount handling fix and maintain that
> DAX pages are still !vm_normal_page() in this set.
> 
> > > > Can we safely have the put page in the fsdax side after the zap?
> > > 
> > > The _refcount is managed from the lifetime insert_page() to
> > > truncate_inode_pages(), where for DAX those are managed from
> > > dax_insert_dentry() to dax_delete_mapping_entry().
> > 
> > As long as we all understand the page doesn't become re-allocatable
> > until the refcount reaches 0 and the free op is called it may be OK!
> 
> Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for
> when the filesystem can safely reuse the page, it really needs to wait
> for the reference count to drop to 0 similar to how it waits for the
> page-idle condition today.

This gets tricky with break_layouts(). For example xfs_break_layouts()
wants to ensure that the page is gup idle while holding the mmap lock.
If the page is not gup idle it needs to drop locks and retry. It is
possible the path to drop a page reference also needs to acquire
filesystem locs. Consider odd cases like DMA from one offset to another
in the same file. So waiting with filesystem locks held is off the
table, which also means that deferring the wait until
dax_delete_mapping_entry() time is also off the table.

That means that even after the conversion to make DAX page references
0-based it will still be the case that filesystem code will be waiting
for the 2 -> 1 transition to indicate "mapped DAX page has no more
external references".

Then dax_delete_mapping_entry() can validate that it is performing the 1
-> 0 transition since no more refernences should have been taken while
holding filesystem locks.
Jason Gunthorpe Sept. 7, 2022, 12:58 p.m. UTC | #8
On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote:
> Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> > > > Jason Gunthorpe wrote:
> > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > > > > 
> > > > > > > Can we continue to have the weird page->refcount behavior and still
> > > > > > > change the other things?
> > > > > > 
> > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > > > > first.
> > > > > 
> > > > > So who will do the put page after the PTE/PMD's are cleared out? In
> > > > > the normal case the tlb flusher does it integrated into zap..
> > > > 
> > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking
> > > > about page_remove_rmap() or some other reference count drop?
> > > 
> > > No, page refcount.
> > > 
> > > __tlb_remove_page() eventually causes a put_page() via
> > > tlb_batch_pages_flush() calling free_pages_and_swap_cache()
> > > 
> > > Eg:
> > > 
> > >  *  MMU_GATHER_NO_GATHER
> > >  *
> > >  *  If the option is set the mmu_gather will not track individual pages for
> > >  *  delayed page free anymore. A platform that enables the option needs to
> > >  *  provide its own implementation of the __tlb_remove_page_size() function to
> > >  *  free pages.
> > 
> > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer
> > since it is incremental to the _refcount handling fix and maintain that
> > DAX pages are still !vm_normal_page() in this set.
> > 
> > > > > Can we safely have the put page in the fsdax side after the zap?
> > > > 
> > > > The _refcount is managed from the lifetime insert_page() to
> > > > truncate_inode_pages(), where for DAX those are managed from
> > > > dax_insert_dentry() to dax_delete_mapping_entry().
> > > 
> > > As long as we all understand the page doesn't become re-allocatable
> > > until the refcount reaches 0 and the free op is called it may be OK!
> > 
> > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for
> > when the filesystem can safely reuse the page, it really needs to wait
> > for the reference count to drop to 0 similar to how it waits for the
> > page-idle condition today.
> 
> This gets tricky with break_layouts(). For example xfs_break_layouts()
> wants to ensure that the page is gup idle while holding the mmap lock.
> If the page is not gup idle it needs to drop locks and retry. It is
> possible the path to drop a page reference also needs to acquire
> filesystem locs. Consider odd cases like DMA from one offset to another
> in the same file. So waiting with filesystem locks held is off the
> table, which also means that deferring the wait until
> dax_delete_mapping_entry() time is also off the table.
> 
> That means that even after the conversion to make DAX page references
> 0-based it will still be the case that filesystem code will be waiting
> for the 2 -> 1 transition to indicate "mapped DAX page has no more
> external references".

Why?

If you are doing the break layouts wouldn't you first zap the ptes,
which will bring the reference to 0 if there are not other users.

If the reference did not become 0 then you have to drop all locks,
sleep until it reaches zero, and retry?

How does adding 2->1 help anything?

Jason
Dan Williams Sept. 7, 2022, 5:10 p.m. UTC | #9
Jason Gunthorpe wrote:
> On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote:
> > Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> > > > > Jason Gunthorpe wrote:
> > > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > > > > > 
> > > > > > > > Can we continue to have the weird page->refcount behavior and still
> > > > > > > > change the other things?
> > > > > > > 
> > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > > > > > first.
> > > > > > 
> > > > > > So who will do the put page after the PTE/PMD's are cleared out? In
> > > > > > the normal case the tlb flusher does it integrated into zap..
> > > > > 
> > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking
> > > > > about page_remove_rmap() or some other reference count drop?
> > > > 
> > > > No, page refcount.
> > > > 
> > > > __tlb_remove_page() eventually causes a put_page() via
> > > > tlb_batch_pages_flush() calling free_pages_and_swap_cache()
> > > > 
> > > > Eg:
> > > > 
> > > >  *  MMU_GATHER_NO_GATHER
> > > >  *
> > > >  *  If the option is set the mmu_gather will not track individual pages for
> > > >  *  delayed page free anymore. A platform that enables the option needs to
> > > >  *  provide its own implementation of the __tlb_remove_page_size() function to
> > > >  *  free pages.
> > > 
> > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer
> > > since it is incremental to the _refcount handling fix and maintain that
> > > DAX pages are still !vm_normal_page() in this set.
> > > 
> > > > > > Can we safely have the put page in the fsdax side after the zap?
> > > > > 
> > > > > The _refcount is managed from the lifetime insert_page() to
> > > > > truncate_inode_pages(), where for DAX those are managed from
> > > > > dax_insert_dentry() to dax_delete_mapping_entry().
> > > > 
> > > > As long as we all understand the page doesn't become re-allocatable
> > > > until the refcount reaches 0 and the free op is called it may be OK!
> > > 
> > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for
> > > when the filesystem can safely reuse the page, it really needs to wait
> > > for the reference count to drop to 0 similar to how it waits for the
> > > page-idle condition today.
> > 
> > This gets tricky with break_layouts(). For example xfs_break_layouts()
> > wants to ensure that the page is gup idle while holding the mmap lock.
> > If the page is not gup idle it needs to drop locks and retry. It is
> > possible the path to drop a page reference also needs to acquire
> > filesystem locs. Consider odd cases like DMA from one offset to another
> > in the same file. So waiting with filesystem locks held is off the
> > table, which also means that deferring the wait until
> > dax_delete_mapping_entry() time is also off the table.
> > 
> > That means that even after the conversion to make DAX page references
> > 0-based it will still be the case that filesystem code will be waiting
> > for the 2 -> 1 transition to indicate "mapped DAX page has no more
> > external references".
> 
> Why?
> 
> If you are doing the break layouts wouldn't you first zap the ptes,
> which will bring the reference to 0 if there are not other users.

The internals of break layouts does zap the ptes, but it does not remove
the mapping entries. So, I was limiting my thinking to that constraint,
but now that I push on it, the need to keep the entry around until the
final truncate_setsize() event seems soft. It should be ok to upgrade
break layouts to delete mapping entries, wait for _refcount to drop to
zero, and then re-evaluate that nothing installed a new entry after
acquiring the filesystem locks again.
Dan Williams Sept. 7, 2022, 6:43 p.m. UTC | #10
Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote:
> > > Dan Williams wrote:
> > > > Jason Gunthorpe wrote:
> > > > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote:
> > > > > > Jason Gunthorpe wrote:
> > > > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote:
> > > > > > > 
> > > > > > > > > Can we continue to have the weird page->refcount behavior and still
> > > > > > > > > change the other things?
> > > > > > > > 
> > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved
> > > > > > > > first.
> > > > > > > 
> > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In
> > > > > > > the normal case the tlb flusher does it integrated into zap..
> > > > > > 
> > > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking
> > > > > > about page_remove_rmap() or some other reference count drop?
> > > > > 
> > > > > No, page refcount.
> > > > > 
> > > > > __tlb_remove_page() eventually causes a put_page() via
> > > > > tlb_batch_pages_flush() calling free_pages_and_swap_cache()
> > > > > 
> > > > > Eg:
> > > > > 
> > > > >  *  MMU_GATHER_NO_GATHER
> > > > >  *
> > > > >  *  If the option is set the mmu_gather will not track individual pages for
> > > > >  *  delayed page free anymore. A platform that enables the option needs to
> > > > >  *  provide its own implementation of the __tlb_remove_page_size() function to
> > > > >  *  free pages.
> > > > 
> > > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer
> > > > since it is incremental to the _refcount handling fix and maintain that
> > > > DAX pages are still !vm_normal_page() in this set.
> > > > 
> > > > > > > Can we safely have the put page in the fsdax side after the zap?
> > > > > > 
> > > > > > The _refcount is managed from the lifetime insert_page() to
> > > > > > truncate_inode_pages(), where for DAX those are managed from
> > > > > > dax_insert_dentry() to dax_delete_mapping_entry().
> > > > > 
> > > > > As long as we all understand the page doesn't become re-allocatable
> > > > > until the refcount reaches 0 and the free op is called it may be OK!
> > > > 
> > > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for
> > > > when the filesystem can safely reuse the page, it really needs to wait
> > > > for the reference count to drop to 0 similar to how it waits for the
> > > > page-idle condition today.
> > > 
> > > This gets tricky with break_layouts(). For example xfs_break_layouts()
> > > wants to ensure that the page is gup idle while holding the mmap lock.
> > > If the page is not gup idle it needs to drop locks and retry. It is
> > > possible the path to drop a page reference also needs to acquire
> > > filesystem locs. Consider odd cases like DMA from one offset to another
> > > in the same file. So waiting with filesystem locks held is off the
> > > table, which also means that deferring the wait until
> > > dax_delete_mapping_entry() time is also off the table.
> > > 
> > > That means that even after the conversion to make DAX page references
> > > 0-based it will still be the case that filesystem code will be waiting
> > > for the 2 -> 1 transition to indicate "mapped DAX page has no more
> > > external references".
> > 
> > Why?
> > 
> > If you are doing the break layouts wouldn't you first zap the ptes,
> > which will bring the reference to 0 if there are not other users.
> 
> The internals of break layouts does zap the ptes, but it does not remove
> the mapping entries. So, I was limiting my thinking to that constraint,
> but now that I push on it, the need to keep the entry around until the
> final truncate_setsize() event seems soft. It should be ok to upgrade
> break layouts to delete mapping entries, wait for _refcount to drop to
> zero, and then re-evaluate that nothing installed a new entry after
> acquiring the filesystem locks again.

It is still the case that while waiting for the page to go idle it is
associated with its given file / inode. It is possible that
memory-failure, or some other event that requires looking up the page's
association, fires in that time span.

If that happens the page's association to the file needs to be kept in
tact. So it is still the case that while waiting for the final put the
page count needs to remain elevated to maintain the page's association
to the file until break layouts can be sure it is doing the final put
under filesystem locks. I.e. break layouts is "make it safe to do the
truncate", not "do the truncate up front".

The truncate will still move from being done while the _refcount is 1 to
being done while the _refcount is 0, but the condition for break layouts
to signal it is safe to proceed is when it can observe _refcount == 0,
or the 1 -> 0 transition.
Jason Gunthorpe Sept. 7, 2022, 7:30 p.m. UTC | #11
On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote:

> It is still the case that while waiting for the page to go idle it is
> associated with its given file / inode. It is possible that
> memory-failure, or some other event that requires looking up the page's
> association, fires in that time span.

Can't the page->mapping can remain set to the address space even if it is
not installed into any PTEs? Zap should only remove the PTEs, not
clear the page->mapping.

Or, said another way, page->mapping should only change while the page
refcount is 0 and thus the filesystem is completely in control of when
it changes, and can do so under its own locks

If the refcount is 0 then memory failure should not happen - it would
require someone accessed the page without referencing it. The only
thing that could do that is the kernel, and if the kernel is
referencing a 0 refcount page (eg it got converted to meta-data or
something), it is probably not linked to an address space anymore
anyhow?

> under filesystem locks. I.e. break layouts is "make it safe to do the
> truncate", not "do the truncate up front".

The truncate action is reorganizing the metadata in the filesystem,
the lead up to it is to fence of all access to the DAX pages from all
sources, so it does seem to me that 0ing the refcount in advance is
exactly the right thing to do.

It returns the page back to the exclusive control of the filesystem,
and nothing else does this.

Jason
Dan Williams Sept. 7, 2022, 8:45 p.m. UTC | #12
Jason Gunthorpe wrote:
> On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote:
> 
> > It is still the case that while waiting for the page to go idle it is
> > associated with its given file / inode. It is possible that
> > memory-failure, or some other event that requires looking up the page's
> > association, fires in that time span.
> 
> Can't the page->mapping can remain set to the address space even if it is
> not installed into any PTEs? Zap should only remove the PTEs, not
> clear the page->mapping.
> 
> Or, said another way, page->mapping should only change while the page
> refcount is 0 and thus the filesystem is completely in control of when
> it changes, and can do so under its own locks
> 
> If the refcount is 0 then memory failure should not happen - it would
> require someone accessed the page without referencing it. The only
> thing that could do that is the kernel, and if the kernel is
> referencing a 0 refcount page (eg it got converted to meta-data or
> something), it is probably not linked to an address space anymore
> anyhow?

First, thank you for helping me think through this, I am going to need
this thread in 6 months when I revisit this code.

I agree with the observation that page->mapping should only change while
the reference count is zero, but my problem is catching the 1 -> 0 in
its natural location in free_zone_device_page(). That and the fact that
the entry needs to be maintained until the page is actually disconnected
from the file to me means that break layouts holds off truncate until it
can observe the 0 refcount condition while holding filesystem locks, and
then the final truncate deletes the mapping entry which is already at 0.

I.e. break layouts waits until _refcount reaches 0, but entry removal
still needs one more dax_delete_mapping_entry() event to transitition to
the _refcount == 0 plus no address_space entry condition. Effectively
simulating _mapcount with address_space tracking until DAX pages can
become vm_normal_page().
Jason Gunthorpe Sept. 8, 2022, 6:49 p.m. UTC | #13
On Wed, Sep 07, 2022 at 01:45:35PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote:
> > 
> > > It is still the case that while waiting for the page to go idle it is
> > > associated with its given file / inode. It is possible that
> > > memory-failure, or some other event that requires looking up the page's
> > > association, fires in that time span.
> > 
> > Can't the page->mapping can remain set to the address space even if it is
> > not installed into any PTEs? Zap should only remove the PTEs, not
> > clear the page->mapping.
> > 
> > Or, said another way, page->mapping should only change while the page
> > refcount is 0 and thus the filesystem is completely in control of when
> > it changes, and can do so under its own locks
> > 
> > If the refcount is 0 then memory failure should not happen - it would
> > require someone accessed the page without referencing it. The only
> > thing that could do that is the kernel, and if the kernel is
> > referencing a 0 refcount page (eg it got converted to meta-data or
> > something), it is probably not linked to an address space anymore
> > anyhow?
> 
> First, thank you for helping me think through this, I am going to need
> this thread in 6 months when I revisit this code.
> 
> I agree with the observation that page->mapping should only change while
> the reference count is zero, but my problem is catching the 1 -> 0 in
> its natural location in free_zone_device_page(). That and the fact that
> the entry needs to be maintained until the page is actually disconnected
> from the file to me means that break layouts holds off truncate until it
> can observe the 0 refcount condition while holding filesystem locks, and
> then the final truncate deletes the mapping entry which is already at 0.

Okay, that makes sense to me.. but what is "entry need to be
maintained" mean?

> I.e. break layouts waits until _refcount reaches 0, but entry removal
> still needs one more dax_delete_mapping_entry() event to transitition to
> the _refcount == 0 plus no address_space entry condition. Effectively
> simulating _mapcount with address_space tracking until DAX pages can
> become vm_normal_page().

This I don't follow.. Who will do the one more
dax_delete_mapping_entry()?

I'm not sure what it has to do with normal_page?

Jason
Dan Williams Sept. 8, 2022, 7:27 p.m. UTC | #14
Jason Gunthorpe wrote:
> On Wed, Sep 07, 2022 at 01:45:35PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote:
> > > 
> > > > It is still the case that while waiting for the page to go idle it is
> > > > associated with its given file / inode. It is possible that
> > > > memory-failure, or some other event that requires looking up the page's
> > > > association, fires in that time span.
> > > 
> > > Can't the page->mapping can remain set to the address space even if it is
> > > not installed into any PTEs? Zap should only remove the PTEs, not
> > > clear the page->mapping.
> > > 
> > > Or, said another way, page->mapping should only change while the page
> > > refcount is 0 and thus the filesystem is completely in control of when
> > > it changes, and can do so under its own locks
> > > 
> > > If the refcount is 0 then memory failure should not happen - it would
> > > require someone accessed the page without referencing it. The only
> > > thing that could do that is the kernel, and if the kernel is
> > > referencing a 0 refcount page (eg it got converted to meta-data or
> > > something), it is probably not linked to an address space anymore
> > > anyhow?
> > 
> > First, thank you for helping me think through this, I am going to need
> > this thread in 6 months when I revisit this code.
> > 
> > I agree with the observation that page->mapping should only change while
> > the reference count is zero, but my problem is catching the 1 -> 0 in
> > its natural location in free_zone_device_page(). That and the fact that
> > the entry needs to be maintained until the page is actually disconnected
> > from the file to me means that break layouts holds off truncate until it
> > can observe the 0 refcount condition while holding filesystem locks, and
> > then the final truncate deletes the mapping entry which is already at 0.
> 
> Okay, that makes sense to me.. but what is "entry need to be
> maintained" mean?

I am talking about keeping an entry in the address_space until that page
is truncated out of the file, and I think the "zapped but not truncated"
state needs a new Xarray-value flag (DAX_ZAP) to track it. So the
life-cycle of a DAX page that is truncated becomes:

0/ devm_memremap_pages() to instantiate the page with _refcount==0

1/ dax_insert_entry() and vmf_insert_mixed() add an entry to the
address_space and install a pte for the page. _refcount==1.

2/ gup elevates _refcount to 2

3/ truncate or punch hole attempts to free the DAX page

4/ break layouts zaps the ptes, drops the reference from 1/, and waits
   for _refcount to drop to zero while holding fs locks.

5/ at the 1 -> 0 transition the address_space entry is tagged with a new
   flag DAX_ZAP to track that this page is unreferenced, but still
   associated with the mapping until the final truncate. I.e. the DAX_ZAP
   flag lets the fsdax core track when it has already dropped a page
   reference, but still has use for things like memory-failure to
   opportunistically use page->mapping on a 0-reference page.

I think this could be done without the DAX_ZAP flag, but I want to have
some safety to catch programming errors where the truncate path finds
entries already at a zero reference count without having first been
zapped.

> > I.e. break layouts waits until _refcount reaches 0, but entry removal
> > still needs one more dax_delete_mapping_entry() event to transitition to
> > the _refcount == 0 plus no address_space entry condition. Effectively
> > simulating _mapcount with address_space tracking until DAX pages can
> > become vm_normal_page().
> 
> This I don't follow.. Who will do the one more
> dax_delete_mapping_entry()?

The core of dax_delete_mapping_entry() is __dax_invalidate_entry(). I am
thinking something like __dax_invalidate_entry(mapping, index, ZAP) is
called for break layouts and __dax_invalidate_entry(mapping, index,
TRUNC) is called for finally disconnecting that page from its mapping.

> 
> I'm not sure what it has to do with normal_page?
> 

This thread is mainly about DAX slowly reinventing _mapcount that gets
managed in all the right places for a normal_page. Longer term I think
we either need to get back to the page-less DAX experiment and find a
way to unwind some of this page usage creep, or cut over to finally
making DAX-pages be normal pages and delete most of the special case
handling in fs/dax.c. I am open to discussing the former, but I think
the latter is more likely.
Jason Gunthorpe Sept. 9, 2022, 11:53 a.m. UTC | #15
On Thu, Sep 08, 2022 at 12:27:06PM -0700, Dan Williams wrote:
>    flag lets the fsdax core track when it has already dropped a page
>    reference, but still has use for things like memory-failure to
>    opportunistically use page->mapping on a 0-reference page.

This is not straightforward, as discussed before the page->mapping is
allowed to change while the refcount is zero, so there is no generic
way to safely obtain a pointer to the address space from a 0 reference
page.

You'd have to pass the 0 reference page into a new pgmap operation
which could obtain an appropriate internal lock to read page->mapping.

> > I'm not sure what it has to do with normal_page?
> 
> This thread is mainly about DAX slowly reinventing _mapcount that gets
> managed in all the right places for a normal_page. Longer term I think
> we either need to get back to the page-less DAX experiment and find a
> way to unwind some of this page usage creep, or cut over to finally
> making DAX-pages be normal pages and delete most of the special case
> handling in fs/dax.c. I am open to discussing the former, but I think
> the latter is more likely.

I think the latter is inevitable at this point..

Jason
Dan Williams Sept. 9, 2022, 5:52 p.m. UTC | #16
Jason Gunthorpe wrote:
> On Thu, Sep 08, 2022 at 12:27:06PM -0700, Dan Williams wrote:
> >    flag lets the fsdax core track when it has already dropped a page
> >    reference, but still has use for things like memory-failure to
> >    opportunistically use page->mapping on a 0-reference page.
> 
> This is not straightforward, as discussed before the page->mapping is
> allowed to change while the refcount is zero, so there is no generic
> way to safely obtain a pointer to the address space from a 0 reference
> page.

Agree.

> 
> You'd have to pass the 0 reference page into a new pgmap operation
> which could obtain an appropriate internal lock to read page->mapping.

Correct, that's what the memory-failure code does via dax_lock_page().
It pins the pgmap, freezes page->mapping associations via
rcu_read_lock(), speculatively reads page->mapping, takes the Xarray
lock, revalidates page->mapping is the one we read speculatively, and
then finally locks the entry in place until the memory-failure handling
completes.
Matthew Wilcox Sept. 9, 2022, 6:11 p.m. UTC | #17
On Thu, Sep 08, 2022 at 12:27:06PM -0700, Dan Williams wrote:
> This thread is mainly about DAX slowly reinventing _mapcount that gets
> managed in all the right places for a normal_page. Longer term I think
> we either need to get back to the page-less DAX experiment and find a
> way to unwind some of this page usage creep, or cut over to finally
> making DAX-pages be normal pages and delete most of the special case
> handling in fs/dax.c. I am open to discussing the former, but I think
> the latter is more likely.

I'm still very much in favour of the former.  I've started on replacing
scatterlist with phyr, but haven't got very far yet.