Message ID | 166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix the DAX-gup mistake | expand |
On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote: > tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to > map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages() > for fsdax and devdax inodes, and use page_maybe_dma_pinned() to > determine when filesystems can safely truncate DAX mappings vs DMA. > > The longer story is that DAX has caused friction with folio development > and other device-memory use cases due to its hack of using a > page-reference count of 1 to indicate that the page is DMA idle. That > situation arose from the mistake of not managing DAX page reference > counts at map time. The lack of page reference counting at map time grew > organically from the original DAX experiment of attempting to manage DAX > mappings without page structures. The page lock, dirty tracking and > other entry management was supported sans pages. However, the page > support was then bolted on incrementally so solve problems with gup, > memory-failure, and all the other kernel services that are missing when > a pfn does not have an associated page structure. > > Since then John has led an effort to account for when a page is pinned > for DMA vs other sources that elevate the reference count. The > page_maybe_dma_pinned() helper slots in seamlessly to replace the need > to track transitions to page->_refount == 1. > > The larger change in this set comes from Jason's observation that > inserting DAX mappings without any reference taken is a bug. So > dax_insert_entry(), that fsdax uses, is updated to take 'struct > dev_pagemap' references, and devdax is updated to reuse the same. It wasn't pagemap references that were the problem, it was struct page references. pagemap is just something that should be ref'd in the background, as long as a struct page has a positive reference the pagemap should be considered referenced, IMHO free_zone_device_page() should be dealing with this - put the pagemap after calling page_free(). Pagemap is protecting page->pgmap from UAF so we must ensure we hold it when we do pgmap->ops That should be the only put, and it should pair with the only get which happens when the driver takes a 0 refcount page out of its free list and makes it have a refcount of 1. > page mapping helpers. One of the immediate hurdles is the usage of > pmd_devmap() to distinguish large page mappings that are not transparent > huge pages. And this is because the struct page refcounting is not right :| I had thought the progression would be to make fsdax use compound folios, install compound folios in the PMD, remove all the special case refcounting for DAX from the pagetable code, then address the pgmap issue from the basis of working page->refcount, eg by putting a pgmap put in right after the op->page_free call. Can we continue to have the weird page->refcount behavior and still change the other things? Jason
Jason Gunthorpe wrote: > On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote: > > tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to > > map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages() > > for fsdax and devdax inodes, and use page_maybe_dma_pinned() to > > determine when filesystems can safely truncate DAX mappings vs DMA. > > > > The longer story is that DAX has caused friction with folio development > > and other device-memory use cases due to its hack of using a > > page-reference count of 1 to indicate that the page is DMA idle. That > > situation arose from the mistake of not managing DAX page reference > > counts at map time. The lack of page reference counting at map time grew > > organically from the original DAX experiment of attempting to manage DAX > > mappings without page structures. The page lock, dirty tracking and > > other entry management was supported sans pages. However, the page > > support was then bolted on incrementally so solve problems with gup, > > memory-failure, and all the other kernel services that are missing when > > a pfn does not have an associated page structure. > > > > Since then John has led an effort to account for when a page is pinned > > for DMA vs other sources that elevate the reference count. The > > page_maybe_dma_pinned() helper slots in seamlessly to replace the need > > to track transitions to page->_refount == 1. > > > > The larger change in this set comes from Jason's observation that > > inserting DAX mappings without any reference taken is a bug. So > > dax_insert_entry(), that fsdax uses, is updated to take 'struct > > dev_pagemap' references, and devdax is updated to reuse the same. > > It wasn't pagemap references that were the problem, it was struct page > references. > > pagemap is just something that should be ref'd in the background, as > long as a struct page has a positive reference the pagemap should be > considered referenced, IMHO free_zone_device_page() should be dealing > with this - put the pagemap after calling page_free(). Yes. I think I got caught admiring the solution of the page_maybe_dma_pinned() conversion for replacing the ugly observation of the 2 -> 1 refcount transition, and then introduced this breakage. I will rework this to catch the 0 to 1 transition of the refcount for incrementing and use free_zone_device_page() to drop the pgmap reference. > Pagemap is protecting page->pgmap from UAF so we must ensure we hold > it when we do pgmap->ops > > That should be the only put, and it should pair with the only get > which happens when the driver takes a 0 refcount page out of its free > list and makes it have a refcount of 1. > > > page mapping helpers. One of the immediate hurdles is the usage of > > pmd_devmap() to distinguish large page mappings that are not transparent > > huge pages. > > And this is because the struct page refcounting is not right :| > > I had thought the progression would be to make fsdax use compound > folios, install compound folios in the PMD, remove all the special > case refcounting for DAX from the pagetable code, then address the > pgmap issue from the basis of working page->refcount, eg by putting a > pgmap put in right after the op->page_free call. As far as I can see as long as the pgmap is managed at map and free_zone_device_page() time then the large folio conversion can come later. > Can we continue to have the weird page->refcount behavior and still > change the other things? No at a minimum the pgmap vs page->refcount problem needs to be solved first.
On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > Can we continue to have the weird page->refcount behavior and still > > change the other things? > > No at a minimum the pgmap vs page->refcount problem needs to be solved > first. So who will do the put page after the PTE/PMD's are cleared out? In the normal case the tlb flusher does it integrated into zap.. Can we safely have the put page in the fsdax side after the zap? Jason
Jason Gunthorpe wrote: > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > Can we continue to have the weird page->refcount behavior and still > > > change the other things? > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > first. > > So who will do the put page after the PTE/PMD's are cleared out? In > the normal case the tlb flusher does it integrated into zap.. AFAICS the zap manages the _mapcount not _refcount. Are you talking about page_remove_rmap() or some other reference count drop? > Can we safely have the put page in the fsdax side after the zap? The _refcount is managed from the lifetime insert_page() to truncate_inode_pages(), where for DAX those are managed from dax_insert_dentry() to dax_delete_mapping_entry(). I think that is sufficient modulo the gap you identified where I was not accounting for additional _refcount increases outside of gup while the DAX entry is live.
On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote: > Jason Gunthorpe wrote: > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > > > Can we continue to have the weird page->refcount behavior and still > > > > change the other things? > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > > first. > > > > So who will do the put page after the PTE/PMD's are cleared out? In > > the normal case the tlb flusher does it integrated into zap.. > > AFAICS the zap manages the _mapcount not _refcount. Are you talking > about page_remove_rmap() or some other reference count drop? No, page refcount. __tlb_remove_page() eventually causes a put_page() via tlb_batch_pages_flush() calling free_pages_and_swap_cache() Eg: * MMU_GATHER_NO_GATHER * * If the option is set the mmu_gather will not track individual pages for * delayed page free anymore. A platform that enables the option needs to * provide its own implementation of the __tlb_remove_page_size() function to * free pages. > > Can we safely have the put page in the fsdax side after the zap? > > The _refcount is managed from the lifetime insert_page() to > truncate_inode_pages(), where for DAX those are managed from > dax_insert_dentry() to dax_delete_mapping_entry(). As long as we all understand the page doesn't become re-allocatable until the refcount reaches 0 and the free op is called it may be OK! Jason
Jason Gunthorpe wrote: > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote: > > Jason Gunthorpe wrote: > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > > > > > Can we continue to have the weird page->refcount behavior and still > > > > > change the other things? > > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > > > first. > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In > > > the normal case the tlb flusher does it integrated into zap.. > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking > > about page_remove_rmap() or some other reference count drop? > > No, page refcount. > > __tlb_remove_page() eventually causes a put_page() via > tlb_batch_pages_flush() calling free_pages_and_swap_cache() > > Eg: > > * MMU_GATHER_NO_GATHER > * > * If the option is set the mmu_gather will not track individual pages for > * delayed page free anymore. A platform that enables the option needs to > * provide its own implementation of the __tlb_remove_page_size() function to > * free pages. Ok, yes, that is a vm_normal_page() mechanism which I was going to defer since it is incremental to the _refcount handling fix and maintain that DAX pages are still !vm_normal_page() in this set. > > > Can we safely have the put page in the fsdax side after the zap? > > > > The _refcount is managed from the lifetime insert_page() to > > truncate_inode_pages(), where for DAX those are managed from > > dax_insert_dentry() to dax_delete_mapping_entry(). > > As long as we all understand the page doesn't become re-allocatable > until the refcount reaches 0 and the free op is called it may be OK! Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for when the filesystem can safely reuse the page, it really needs to wait for the reference count to drop to 0 similar to how it waits for the page-idle condition today.
Dan Williams wrote: > Jason Gunthorpe wrote: > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote: > > > Jason Gunthorpe wrote: > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > > > > > > > Can we continue to have the weird page->refcount behavior and still > > > > > > change the other things? > > > > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > > > > first. > > > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In > > > > the normal case the tlb flusher does it integrated into zap.. > > > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking > > > about page_remove_rmap() or some other reference count drop? > > > > No, page refcount. > > > > __tlb_remove_page() eventually causes a put_page() via > > tlb_batch_pages_flush() calling free_pages_and_swap_cache() > > > > Eg: > > > > * MMU_GATHER_NO_GATHER > > * > > * If the option is set the mmu_gather will not track individual pages for > > * delayed page free anymore. A platform that enables the option needs to > > * provide its own implementation of the __tlb_remove_page_size() function to > > * free pages. > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer > since it is incremental to the _refcount handling fix and maintain that > DAX pages are still !vm_normal_page() in this set. > > > > > Can we safely have the put page in the fsdax side after the zap? > > > > > > The _refcount is managed from the lifetime insert_page() to > > > truncate_inode_pages(), where for DAX those are managed from > > > dax_insert_dentry() to dax_delete_mapping_entry(). > > > > As long as we all understand the page doesn't become re-allocatable > > until the refcount reaches 0 and the free op is called it may be OK! > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for > when the filesystem can safely reuse the page, it really needs to wait > for the reference count to drop to 0 similar to how it waits for the > page-idle condition today. This gets tricky with break_layouts(). For example xfs_break_layouts() wants to ensure that the page is gup idle while holding the mmap lock. If the page is not gup idle it needs to drop locks and retry. It is possible the path to drop a page reference also needs to acquire filesystem locs. Consider odd cases like DMA from one offset to another in the same file. So waiting with filesystem locks held is off the table, which also means that deferring the wait until dax_delete_mapping_entry() time is also off the table. That means that even after the conversion to make DAX page references 0-based it will still be the case that filesystem code will be waiting for the 2 -> 1 transition to indicate "mapped DAX page has no more external references". Then dax_delete_mapping_entry() can validate that it is performing the 1 -> 0 transition since no more refernences should have been taken while holding filesystem locks.
On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote: > Dan Williams wrote: > > Jason Gunthorpe wrote: > > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote: > > > > Jason Gunthorpe wrote: > > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > > > > > > > > > Can we continue to have the weird page->refcount behavior and still > > > > > > > change the other things? > > > > > > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > > > > > first. > > > > > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In > > > > > the normal case the tlb flusher does it integrated into zap.. > > > > > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking > > > > about page_remove_rmap() or some other reference count drop? > > > > > > No, page refcount. > > > > > > __tlb_remove_page() eventually causes a put_page() via > > > tlb_batch_pages_flush() calling free_pages_and_swap_cache() > > > > > > Eg: > > > > > > * MMU_GATHER_NO_GATHER > > > * > > > * If the option is set the mmu_gather will not track individual pages for > > > * delayed page free anymore. A platform that enables the option needs to > > > * provide its own implementation of the __tlb_remove_page_size() function to > > > * free pages. > > > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer > > since it is incremental to the _refcount handling fix and maintain that > > DAX pages are still !vm_normal_page() in this set. > > > > > > > Can we safely have the put page in the fsdax side after the zap? > > > > > > > > The _refcount is managed from the lifetime insert_page() to > > > > truncate_inode_pages(), where for DAX those are managed from > > > > dax_insert_dentry() to dax_delete_mapping_entry(). > > > > > > As long as we all understand the page doesn't become re-allocatable > > > until the refcount reaches 0 and the free op is called it may be OK! > > > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for > > when the filesystem can safely reuse the page, it really needs to wait > > for the reference count to drop to 0 similar to how it waits for the > > page-idle condition today. > > This gets tricky with break_layouts(). For example xfs_break_layouts() > wants to ensure that the page is gup idle while holding the mmap lock. > If the page is not gup idle it needs to drop locks and retry. It is > possible the path to drop a page reference also needs to acquire > filesystem locs. Consider odd cases like DMA from one offset to another > in the same file. So waiting with filesystem locks held is off the > table, which also means that deferring the wait until > dax_delete_mapping_entry() time is also off the table. > > That means that even after the conversion to make DAX page references > 0-based it will still be the case that filesystem code will be waiting > for the 2 -> 1 transition to indicate "mapped DAX page has no more > external references". Why? If you are doing the break layouts wouldn't you first zap the ptes, which will bring the reference to 0 if there are not other users. If the reference did not become 0 then you have to drop all locks, sleep until it reaches zero, and retry? How does adding 2->1 help anything? Jason
Jason Gunthorpe wrote: > On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote: > > Dan Williams wrote: > > > Jason Gunthorpe wrote: > > > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote: > > > > > Jason Gunthorpe wrote: > > > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > > > > > > > > > > > Can we continue to have the weird page->refcount behavior and still > > > > > > > > change the other things? > > > > > > > > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > > > > > > first. > > > > > > > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In > > > > > > the normal case the tlb flusher does it integrated into zap.. > > > > > > > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking > > > > > about page_remove_rmap() or some other reference count drop? > > > > > > > > No, page refcount. > > > > > > > > __tlb_remove_page() eventually causes a put_page() via > > > > tlb_batch_pages_flush() calling free_pages_and_swap_cache() > > > > > > > > Eg: > > > > > > > > * MMU_GATHER_NO_GATHER > > > > * > > > > * If the option is set the mmu_gather will not track individual pages for > > > > * delayed page free anymore. A platform that enables the option needs to > > > > * provide its own implementation of the __tlb_remove_page_size() function to > > > > * free pages. > > > > > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer > > > since it is incremental to the _refcount handling fix and maintain that > > > DAX pages are still !vm_normal_page() in this set. > > > > > > > > > Can we safely have the put page in the fsdax side after the zap? > > > > > > > > > > The _refcount is managed from the lifetime insert_page() to > > > > > truncate_inode_pages(), where for DAX those are managed from > > > > > dax_insert_dentry() to dax_delete_mapping_entry(). > > > > > > > > As long as we all understand the page doesn't become re-allocatable > > > > until the refcount reaches 0 and the free op is called it may be OK! > > > > > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for > > > when the filesystem can safely reuse the page, it really needs to wait > > > for the reference count to drop to 0 similar to how it waits for the > > > page-idle condition today. > > > > This gets tricky with break_layouts(). For example xfs_break_layouts() > > wants to ensure that the page is gup idle while holding the mmap lock. > > If the page is not gup idle it needs to drop locks and retry. It is > > possible the path to drop a page reference also needs to acquire > > filesystem locs. Consider odd cases like DMA from one offset to another > > in the same file. So waiting with filesystem locks held is off the > > table, which also means that deferring the wait until > > dax_delete_mapping_entry() time is also off the table. > > > > That means that even after the conversion to make DAX page references > > 0-based it will still be the case that filesystem code will be waiting > > for the 2 -> 1 transition to indicate "mapped DAX page has no more > > external references". > > Why? > > If you are doing the break layouts wouldn't you first zap the ptes, > which will bring the reference to 0 if there are not other users. The internals of break layouts does zap the ptes, but it does not remove the mapping entries. So, I was limiting my thinking to that constraint, but now that I push on it, the need to keep the entry around until the final truncate_setsize() event seems soft. It should be ok to upgrade break layouts to delete mapping entries, wait for _refcount to drop to zero, and then re-evaluate that nothing installed a new entry after acquiring the filesystem locks again.
Dan Williams wrote: > Jason Gunthorpe wrote: > > On Tue, Sep 06, 2022 at 05:54:54PM -0700, Dan Williams wrote: > > > Dan Williams wrote: > > > > Jason Gunthorpe wrote: > > > > > On Tue, Sep 06, 2022 at 11:37:36AM -0700, Dan Williams wrote: > > > > > > Jason Gunthorpe wrote: > > > > > > > On Tue, Sep 06, 2022 at 10:23:41AM -0700, Dan Williams wrote: > > > > > > > > > > > > > > > > Can we continue to have the weird page->refcount behavior and still > > > > > > > > > change the other things? > > > > > > > > > > > > > > > > No at a minimum the pgmap vs page->refcount problem needs to be solved > > > > > > > > first. > > > > > > > > > > > > > > So who will do the put page after the PTE/PMD's are cleared out? In > > > > > > > the normal case the tlb flusher does it integrated into zap.. > > > > > > > > > > > > AFAICS the zap manages the _mapcount not _refcount. Are you talking > > > > > > about page_remove_rmap() or some other reference count drop? > > > > > > > > > > No, page refcount. > > > > > > > > > > __tlb_remove_page() eventually causes a put_page() via > > > > > tlb_batch_pages_flush() calling free_pages_and_swap_cache() > > > > > > > > > > Eg: > > > > > > > > > > * MMU_GATHER_NO_GATHER > > > > > * > > > > > * If the option is set the mmu_gather will not track individual pages for > > > > > * delayed page free anymore. A platform that enables the option needs to > > > > > * provide its own implementation of the __tlb_remove_page_size() function to > > > > > * free pages. > > > > > > > > Ok, yes, that is a vm_normal_page() mechanism which I was going to defer > > > > since it is incremental to the _refcount handling fix and maintain that > > > > DAX pages are still !vm_normal_page() in this set. > > > > > > > > > > > Can we safely have the put page in the fsdax side after the zap? > > > > > > > > > > > > The _refcount is managed from the lifetime insert_page() to > > > > > > truncate_inode_pages(), where for DAX those are managed from > > > > > > dax_insert_dentry() to dax_delete_mapping_entry(). > > > > > > > > > > As long as we all understand the page doesn't become re-allocatable > > > > > until the refcount reaches 0 and the free op is called it may be OK! > > > > > > > > Yes, but this does mean that page_maybe_dma_pinned() is not sufficient for > > > > when the filesystem can safely reuse the page, it really needs to wait > > > > for the reference count to drop to 0 similar to how it waits for the > > > > page-idle condition today. > > > > > > This gets tricky with break_layouts(). For example xfs_break_layouts() > > > wants to ensure that the page is gup idle while holding the mmap lock. > > > If the page is not gup idle it needs to drop locks and retry. It is > > > possible the path to drop a page reference also needs to acquire > > > filesystem locs. Consider odd cases like DMA from one offset to another > > > in the same file. So waiting with filesystem locks held is off the > > > table, which also means that deferring the wait until > > > dax_delete_mapping_entry() time is also off the table. > > > > > > That means that even after the conversion to make DAX page references > > > 0-based it will still be the case that filesystem code will be waiting > > > for the 2 -> 1 transition to indicate "mapped DAX page has no more > > > external references". > > > > Why? > > > > If you are doing the break layouts wouldn't you first zap the ptes, > > which will bring the reference to 0 if there are not other users. > > The internals of break layouts does zap the ptes, but it does not remove > the mapping entries. So, I was limiting my thinking to that constraint, > but now that I push on it, the need to keep the entry around until the > final truncate_setsize() event seems soft. It should be ok to upgrade > break layouts to delete mapping entries, wait for _refcount to drop to > zero, and then re-evaluate that nothing installed a new entry after > acquiring the filesystem locks again. It is still the case that while waiting for the page to go idle it is associated with its given file / inode. It is possible that memory-failure, or some other event that requires looking up the page's association, fires in that time span. If that happens the page's association to the file needs to be kept in tact. So it is still the case that while waiting for the final put the page count needs to remain elevated to maintain the page's association to the file until break layouts can be sure it is doing the final put under filesystem locks. I.e. break layouts is "make it safe to do the truncate", not "do the truncate up front". The truncate will still move from being done while the _refcount is 1 to being done while the _refcount is 0, but the condition for break layouts to signal it is safe to proceed is when it can observe _refcount == 0, or the 1 -> 0 transition.
On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote: > It is still the case that while waiting for the page to go idle it is > associated with its given file / inode. It is possible that > memory-failure, or some other event that requires looking up the page's > association, fires in that time span. Can't the page->mapping can remain set to the address space even if it is not installed into any PTEs? Zap should only remove the PTEs, not clear the page->mapping. Or, said another way, page->mapping should only change while the page refcount is 0 and thus the filesystem is completely in control of when it changes, and can do so under its own locks If the refcount is 0 then memory failure should not happen - it would require someone accessed the page without referencing it. The only thing that could do that is the kernel, and if the kernel is referencing a 0 refcount page (eg it got converted to meta-data or something), it is probably not linked to an address space anymore anyhow? > under filesystem locks. I.e. break layouts is "make it safe to do the > truncate", not "do the truncate up front". The truncate action is reorganizing the metadata in the filesystem, the lead up to it is to fence of all access to the DAX pages from all sources, so it does seem to me that 0ing the refcount in advance is exactly the right thing to do. It returns the page back to the exclusive control of the filesystem, and nothing else does this. Jason
Jason Gunthorpe wrote: > On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote: > > > It is still the case that while waiting for the page to go idle it is > > associated with its given file / inode. It is possible that > > memory-failure, or some other event that requires looking up the page's > > association, fires in that time span. > > Can't the page->mapping can remain set to the address space even if it is > not installed into any PTEs? Zap should only remove the PTEs, not > clear the page->mapping. > > Or, said another way, page->mapping should only change while the page > refcount is 0 and thus the filesystem is completely in control of when > it changes, and can do so under its own locks > > If the refcount is 0 then memory failure should not happen - it would > require someone accessed the page without referencing it. The only > thing that could do that is the kernel, and if the kernel is > referencing a 0 refcount page (eg it got converted to meta-data or > something), it is probably not linked to an address space anymore > anyhow? First, thank you for helping me think through this, I am going to need this thread in 6 months when I revisit this code. I agree with the observation that page->mapping should only change while the reference count is zero, but my problem is catching the 1 -> 0 in its natural location in free_zone_device_page(). That and the fact that the entry needs to be maintained until the page is actually disconnected from the file to me means that break layouts holds off truncate until it can observe the 0 refcount condition while holding filesystem locks, and then the final truncate deletes the mapping entry which is already at 0. I.e. break layouts waits until _refcount reaches 0, but entry removal still needs one more dax_delete_mapping_entry() event to transitition to the _refcount == 0 plus no address_space entry condition. Effectively simulating _mapcount with address_space tracking until DAX pages can become vm_normal_page().
On Wed, Sep 07, 2022 at 01:45:35PM -0700, Dan Williams wrote: > Jason Gunthorpe wrote: > > On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote: > > > > > It is still the case that while waiting for the page to go idle it is > > > associated with its given file / inode. It is possible that > > > memory-failure, or some other event that requires looking up the page's > > > association, fires in that time span. > > > > Can't the page->mapping can remain set to the address space even if it is > > not installed into any PTEs? Zap should only remove the PTEs, not > > clear the page->mapping. > > > > Or, said another way, page->mapping should only change while the page > > refcount is 0 and thus the filesystem is completely in control of when > > it changes, and can do so under its own locks > > > > If the refcount is 0 then memory failure should not happen - it would > > require someone accessed the page without referencing it. The only > > thing that could do that is the kernel, and if the kernel is > > referencing a 0 refcount page (eg it got converted to meta-data or > > something), it is probably not linked to an address space anymore > > anyhow? > > First, thank you for helping me think through this, I am going to need > this thread in 6 months when I revisit this code. > > I agree with the observation that page->mapping should only change while > the reference count is zero, but my problem is catching the 1 -> 0 in > its natural location in free_zone_device_page(). That and the fact that > the entry needs to be maintained until the page is actually disconnected > from the file to me means that break layouts holds off truncate until it > can observe the 0 refcount condition while holding filesystem locks, and > then the final truncate deletes the mapping entry which is already at 0. Okay, that makes sense to me.. but what is "entry need to be maintained" mean? > I.e. break layouts waits until _refcount reaches 0, but entry removal > still needs one more dax_delete_mapping_entry() event to transitition to > the _refcount == 0 plus no address_space entry condition. Effectively > simulating _mapcount with address_space tracking until DAX pages can > become vm_normal_page(). This I don't follow.. Who will do the one more dax_delete_mapping_entry()? I'm not sure what it has to do with normal_page? Jason
Jason Gunthorpe wrote: > On Wed, Sep 07, 2022 at 01:45:35PM -0700, Dan Williams wrote: > > Jason Gunthorpe wrote: > > > On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote: > > > > > > > It is still the case that while waiting for the page to go idle it is > > > > associated with its given file / inode. It is possible that > > > > memory-failure, or some other event that requires looking up the page's > > > > association, fires in that time span. > > > > > > Can't the page->mapping can remain set to the address space even if it is > > > not installed into any PTEs? Zap should only remove the PTEs, not > > > clear the page->mapping. > > > > > > Or, said another way, page->mapping should only change while the page > > > refcount is 0 and thus the filesystem is completely in control of when > > > it changes, and can do so under its own locks > > > > > > If the refcount is 0 then memory failure should not happen - it would > > > require someone accessed the page without referencing it. The only > > > thing that could do that is the kernel, and if the kernel is > > > referencing a 0 refcount page (eg it got converted to meta-data or > > > something), it is probably not linked to an address space anymore > > > anyhow? > > > > First, thank you for helping me think through this, I am going to need > > this thread in 6 months when I revisit this code. > > > > I agree with the observation that page->mapping should only change while > > the reference count is zero, but my problem is catching the 1 -> 0 in > > its natural location in free_zone_device_page(). That and the fact that > > the entry needs to be maintained until the page is actually disconnected > > from the file to me means that break layouts holds off truncate until it > > can observe the 0 refcount condition while holding filesystem locks, and > > then the final truncate deletes the mapping entry which is already at 0. > > Okay, that makes sense to me.. but what is "entry need to be > maintained" mean? I am talking about keeping an entry in the address_space until that page is truncated out of the file, and I think the "zapped but not truncated" state needs a new Xarray-value flag (DAX_ZAP) to track it. So the life-cycle of a DAX page that is truncated becomes: 0/ devm_memremap_pages() to instantiate the page with _refcount==0 1/ dax_insert_entry() and vmf_insert_mixed() add an entry to the address_space and install a pte for the page. _refcount==1. 2/ gup elevates _refcount to 2 3/ truncate or punch hole attempts to free the DAX page 4/ break layouts zaps the ptes, drops the reference from 1/, and waits for _refcount to drop to zero while holding fs locks. 5/ at the 1 -> 0 transition the address_space entry is tagged with a new flag DAX_ZAP to track that this page is unreferenced, but still associated with the mapping until the final truncate. I.e. the DAX_ZAP flag lets the fsdax core track when it has already dropped a page reference, but still has use for things like memory-failure to opportunistically use page->mapping on a 0-reference page. I think this could be done without the DAX_ZAP flag, but I want to have some safety to catch programming errors where the truncate path finds entries already at a zero reference count without having first been zapped. > > I.e. break layouts waits until _refcount reaches 0, but entry removal > > still needs one more dax_delete_mapping_entry() event to transitition to > > the _refcount == 0 plus no address_space entry condition. Effectively > > simulating _mapcount with address_space tracking until DAX pages can > > become vm_normal_page(). > > This I don't follow.. Who will do the one more > dax_delete_mapping_entry()? The core of dax_delete_mapping_entry() is __dax_invalidate_entry(). I am thinking something like __dax_invalidate_entry(mapping, index, ZAP) is called for break layouts and __dax_invalidate_entry(mapping, index, TRUNC) is called for finally disconnecting that page from its mapping. > > I'm not sure what it has to do with normal_page? > This thread is mainly about DAX slowly reinventing _mapcount that gets managed in all the right places for a normal_page. Longer term I think we either need to get back to the page-less DAX experiment and find a way to unwind some of this page usage creep, or cut over to finally making DAX-pages be normal pages and delete most of the special case handling in fs/dax.c. I am open to discussing the former, but I think the latter is more likely.
On Thu, Sep 08, 2022 at 12:27:06PM -0700, Dan Williams wrote: > flag lets the fsdax core track when it has already dropped a page > reference, but still has use for things like memory-failure to > opportunistically use page->mapping on a 0-reference page. This is not straightforward, as discussed before the page->mapping is allowed to change while the refcount is zero, so there is no generic way to safely obtain a pointer to the address space from a 0 reference page. You'd have to pass the 0 reference page into a new pgmap operation which could obtain an appropriate internal lock to read page->mapping. > > I'm not sure what it has to do with normal_page? > > This thread is mainly about DAX slowly reinventing _mapcount that gets > managed in all the right places for a normal_page. Longer term I think > we either need to get back to the page-less DAX experiment and find a > way to unwind some of this page usage creep, or cut over to finally > making DAX-pages be normal pages and delete most of the special case > handling in fs/dax.c. I am open to discussing the former, but I think > the latter is more likely. I think the latter is inevitable at this point.. Jason
Jason Gunthorpe wrote: > On Thu, Sep 08, 2022 at 12:27:06PM -0700, Dan Williams wrote: > > flag lets the fsdax core track when it has already dropped a page > > reference, but still has use for things like memory-failure to > > opportunistically use page->mapping on a 0-reference page. > > This is not straightforward, as discussed before the page->mapping is > allowed to change while the refcount is zero, so there is no generic > way to safely obtain a pointer to the address space from a 0 reference > page. Agree. > > You'd have to pass the 0 reference page into a new pgmap operation > which could obtain an appropriate internal lock to read page->mapping. Correct, that's what the memory-failure code does via dax_lock_page(). It pins the pgmap, freezes page->mapping associations via rcu_read_lock(), speculatively reads page->mapping, takes the Xarray lock, revalidates page->mapping is the one we read speculatively, and then finally locks the entry in place until the memory-failure handling completes.
On Thu, Sep 08, 2022 at 12:27:06PM -0700, Dan Williams wrote: > This thread is mainly about DAX slowly reinventing _mapcount that gets > managed in all the right places for a normal_page. Longer term I think > we either need to get back to the page-less DAX experiment and find a > way to unwind some of this page usage creep, or cut over to finally > making DAX-pages be normal pages and delete most of the special case > handling in fs/dax.c. I am open to discussing the former, but I think > the latter is more likely. I'm still very much in favour of the former. I've started on replacing scatterlist with phyr, but haven't got very far yet.