Message ID | 150949217152.24061.9869502311102659784.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> + struct { > + /* > + * ZONE_DEVICE pages are never on an lru or handled by > + * a slab allocator, this points to the hosting device > + * page map. > + */ > + struct dev_pagemap *pgmap; > + /* > + * inode association for MEMORY_DEVICE_FS_DAX page-idle > + * callbacks. Note that we don't use ->mapping since > + * that has hard coded page-cache assumptions in > + * several paths. > + */ What assumptions? I'd much rather fix those up than having two fields that have the same functionality.
On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: >> + struct { >> + /* >> + * ZONE_DEVICE pages are never on an lru or handled by >> + * a slab allocator, this points to the hosting device >> + * page map. >> + */ >> + struct dev_pagemap *pgmap; >> + /* >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle >> + * callbacks. Note that we don't use ->mapping since >> + * that has hard coded page-cache assumptions in >> + * several paths. >> + */ > > What assumptions? I'd much rather fix those up than having two fields > that have the same functionality. [ Reviving this old thread where you asked why I introduce page->inode instead of reusing page->mapping ] For example, xfs_vm_set_page_dirty() assumes that page->mapping being non-NULL indicates a typical page cache page, this is a false assumption for DAX. My guess at a fix for this is to add pagecache_page() checks to locations like this, but I worry about how to find them all. Where pagecache_page() is: bool pagecache_page(struct page *page) { if (!page->mapping) return false; if (!IS_DAX(page->mapping->host)) return false; return true; } Otherwise we go off the rails: WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] [..] CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O 4.15.0-rc2+ #984 [..] Call Trace: set_page_dirty_lock+0x40/0x60 bio_set_pages_dirty+0x37/0x50 iomap_dio_actor+0x2b7/0x3b0 ? iomap_dio_zero+0x110/0x110 iomap_apply+0xa4/0x110 iomap_dio_rw+0x29e/0x3b0 ? iomap_dio_zero+0x110/0x110 ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] xfs_file_read_iter+0xa0/0xc0 [xfs] __vfs_read+0xf9/0x170 vfs_read+0xa6/0x150 SyS_pread64+0x93/0xb0 entry_SYSCALL_64_fastpath+0x1f/0x96
On Tue 19-12-17 17:11:38, Dan Williams wrote: > On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: > >> + struct { > >> + /* > >> + * ZONE_DEVICE pages are never on an lru or handled by > >> + * a slab allocator, this points to the hosting device > >> + * page map. > >> + */ > >> + struct dev_pagemap *pgmap; > >> + /* > >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle > >> + * callbacks. Note that we don't use ->mapping since > >> + * that has hard coded page-cache assumptions in > >> + * several paths. > >> + */ > > > > What assumptions? I'd much rather fix those up than having two fields > > that have the same functionality. > > [ Reviving this old thread where you asked why I introduce page->inode > instead of reusing page->mapping ] > > For example, xfs_vm_set_page_dirty() assumes that page->mapping being > non-NULL indicates a typical page cache page, this is a false > assumption for DAX. My guess at a fix for this is to add > pagecache_page() checks to locations like this, but I worry about how > to find them all. Where pagecache_page() is: > > bool pagecache_page(struct page *page) > { > if (!page->mapping) > return false; > if (!IS_DAX(page->mapping->host)) > return false; > return true; > } > > Otherwise we go off the rails: > > WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 > xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] But this just shows that mapping->a_ops are wrong for this mapping, doesn't it? ->set_page_dirty handler for DAX mapping should just properly handle DAX pages... (and only those) > [..] > CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O > 4.15.0-rc2+ #984 > [..] > Call Trace: > set_page_dirty_lock+0x40/0x60 > bio_set_pages_dirty+0x37/0x50 > iomap_dio_actor+0x2b7/0x3b0 > ? iomap_dio_zero+0x110/0x110 > iomap_apply+0xa4/0x110 > iomap_dio_rw+0x29e/0x3b0 > ? iomap_dio_zero+0x110/0x110 > ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] > xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] > xfs_file_read_iter+0xa0/0xc0 [xfs] > __vfs_read+0xf9/0x170 > vfs_read+0xa6/0x150 > SyS_pread64+0x93/0xb0 > entry_SYSCALL_64_fastpath+0x1f/0x96 Honza
On Tue, Dec 19, 2017 at 05:11:38PM -0800, Dan Williams wrote: > On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: > >> + struct { > >> + /* > >> + * ZONE_DEVICE pages are never on an lru or handled by > >> + * a slab allocator, this points to the hosting device > >> + * page map. > >> + */ > >> + struct dev_pagemap *pgmap; > >> + /* > >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle > >> + * callbacks. Note that we don't use ->mapping since > >> + * that has hard coded page-cache assumptions in > >> + * several paths. > >> + */ > > > > What assumptions? I'd much rather fix those up than having two fields > > that have the same functionality. > > [ Reviving this old thread where you asked why I introduce page->inode > instead of reusing page->mapping ] > > For example, xfs_vm_set_page_dirty() assumes that page->mapping being > non-NULL indicates a typical page cache page, this is a false > assumption for DAX. That means every single filesystem has an incorrect assumption for DAX pages. xfs_vm_set_page_dirty() is derived directly from __set_page_dirty_buffers(), which is the default function that set_page_dirty() calls to do it's work. Indeed, ext4 also calls __set_page_dirty_buffers(), so whatever problem XFS has here with DAX and racing truncates is going to manifest in ext4 as well. > My guess at a fix for this is to add > pagecache_page() checks to locations like this, but I worry about how > to find them all. Where pagecache_page() is: > > bool pagecache_page(struct page *page) > { > if (!page->mapping) > return false; > if (!IS_DAX(page->mapping->host)) > return false; > return true; > } This is likely to be a problem in lots more places if we have to treat "has page been truncated away" race checks on dax mappings differently to page cache mappings. This smells of a whack-a-mole style bandaid to me.... Cheers, Dave.
On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote: > On Tue 19-12-17 17:11:38, Dan Williams wrote: >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> + struct { >> >> + /* >> >> + * ZONE_DEVICE pages are never on an lru or handled by >> >> + * a slab allocator, this points to the hosting device >> >> + * page map. >> >> + */ >> >> + struct dev_pagemap *pgmap; >> >> + /* >> >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle >> >> + * callbacks. Note that we don't use ->mapping since >> >> + * that has hard coded page-cache assumptions in >> >> + * several paths. >> >> + */ >> > >> > What assumptions? I'd much rather fix those up than having two fields >> > that have the same functionality. >> >> [ Reviving this old thread where you asked why I introduce page->inode >> instead of reusing page->mapping ] >> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being >> non-NULL indicates a typical page cache page, this is a false >> assumption for DAX. My guess at a fix for this is to add >> pagecache_page() checks to locations like this, but I worry about how >> to find them all. Where pagecache_page() is: >> >> bool pagecache_page(struct page *page) >> { >> if (!page->mapping) >> return false; >> if (!IS_DAX(page->mapping->host)) >> return false; >> return true; >> } >> >> Otherwise we go off the rails: >> >> WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] > > But this just shows that mapping->a_ops are wrong for this mapping, doesn't > it? ->set_page_dirty handler for DAX mapping should just properly handle > DAX pages... (and only those) Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I enable all the address_space_operations to start firing. However, instead of adding DAX specific address_space_operations it appears ->mapping should never be set for DAX pages, because DAX pages are disconnected from the page-writeback machinery. In other words never setting ->mapping bypasses all the possible broken assumptions and code paths that take page-cache specific actions before calling an address_space_operation. > >> [..] >> CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O >> 4.15.0-rc2+ #984 >> [..] >> Call Trace: >> set_page_dirty_lock+0x40/0x60 >> bio_set_pages_dirty+0x37/0x50 >> iomap_dio_actor+0x2b7/0x3b0 >> ? iomap_dio_zero+0x110/0x110 >> iomap_apply+0xa4/0x110 >> iomap_dio_rw+0x29e/0x3b0 >> ? iomap_dio_zero+0x110/0x110 >> ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] >> xfs_file_dio_aio_read+0x7c/0x1a0 [xfs] >> xfs_file_read_iter+0xa0/0xc0 [xfs] >> __vfs_read+0xf9/0x170 >> vfs_read+0xa6/0x150 >> SyS_pread64+0x93/0xb0 >> entry_SYSCALL_64_fastpath+0x1f/0x96 > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Wed 20-12-17 14:41:14, Dan Williams wrote: > On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote: > > On Tue 19-12-17 17:11:38, Dan Williams wrote: > >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: > >> >> + struct { > >> >> + /* > >> >> + * ZONE_DEVICE pages are never on an lru or handled by > >> >> + * a slab allocator, this points to the hosting device > >> >> + * page map. > >> >> + */ > >> >> + struct dev_pagemap *pgmap; > >> >> + /* > >> >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle > >> >> + * callbacks. Note that we don't use ->mapping since > >> >> + * that has hard coded page-cache assumptions in > >> >> + * several paths. > >> >> + */ > >> > > >> > What assumptions? I'd much rather fix those up than having two fields > >> > that have the same functionality. > >> > >> [ Reviving this old thread where you asked why I introduce page->inode > >> instead of reusing page->mapping ] > >> > >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being > >> non-NULL indicates a typical page cache page, this is a false > >> assumption for DAX. My guess at a fix for this is to add > >> pagecache_page() checks to locations like this, but I worry about how > >> to find them all. Where pagecache_page() is: > >> > >> bool pagecache_page(struct page *page) > >> { > >> if (!page->mapping) > >> return false; > >> if (!IS_DAX(page->mapping->host)) > >> return false; > >> return true; > >> } > >> > >> Otherwise we go off the rails: > >> > >> WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 > >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] > > > > But this just shows that mapping->a_ops are wrong for this mapping, doesn't > > it? ->set_page_dirty handler for DAX mapping should just properly handle > > DAX pages... (and only those) > > Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I > enable all the address_space_operations to start firing. However, > instead of adding DAX specific address_space_operations it appears > ->mapping should never be set for DAX pages, because DAX pages are > disconnected from the page-writeback machinery. page->mapping is not only about page-writeback machinery. It is generally about page <-> inode relation and that still exists for DAX pages. We even reuse the mapping->page_tree to store DAX pages. Also requiring proper address_space_operations for DAX inodes is IMO not a bad thing as such. That being said I'm not 100% convinced we should really set page->mapping for DAX pages. After all they are not page cache pages but rather a physical storage for the data, don't ever get to LRU, etc. But if you need page->inode relation somewhere, that is a good indication to me that it might be just easier to set page->mapping and provide aops that do the right thing (i.e. usually not much) for them. BTW: the ->set_page_dirty() in particular actually *does* need to do something for DAX pages - corresponding radix tree entries should be marked dirty so that caches can get flushed when needed. > In other words never > setting ->mapping bypasses all the possible broken assumptions and > code paths that take page-cache specific actions before calling an > address_space_operation. If there are any assumptions left after aops are set properly, then we can reconsider this but for now setting ->mapping and proper aops looks cleaner to me... Honza
On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote: > On Wed 20-12-17 14:41:14, Dan Williams wrote: >> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote: >> > On Tue 19-12-17 17:11:38, Dan Williams wrote: >> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: >> >> >> + struct { >> >> >> + /* >> >> >> + * ZONE_DEVICE pages are never on an lru or handled by >> >> >> + * a slab allocator, this points to the hosting device >> >> >> + * page map. >> >> >> + */ >> >> >> + struct dev_pagemap *pgmap; >> >> >> + /* >> >> >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle >> >> >> + * callbacks. Note that we don't use ->mapping since >> >> >> + * that has hard coded page-cache assumptions in >> >> >> + * several paths. >> >> >> + */ >> >> > >> >> > What assumptions? I'd much rather fix those up than having two fields >> >> > that have the same functionality. >> >> >> >> [ Reviving this old thread where you asked why I introduce page->inode >> >> instead of reusing page->mapping ] >> >> >> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being >> >> non-NULL indicates a typical page cache page, this is a false >> >> assumption for DAX. My guess at a fix for this is to add >> >> pagecache_page() checks to locations like this, but I worry about how >> >> to find them all. Where pagecache_page() is: >> >> >> >> bool pagecache_page(struct page *page) >> >> { >> >> if (!page->mapping) >> >> return false; >> >> if (!IS_DAX(page->mapping->host)) >> >> return false; >> >> return true; >> >> } >> >> >> >> Otherwise we go off the rails: >> >> >> >> WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 >> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] >> > >> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't >> > it? ->set_page_dirty handler for DAX mapping should just properly handle >> > DAX pages... (and only those) >> >> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I >> enable all the address_space_operations to start firing. However, >> instead of adding DAX specific address_space_operations it appears >> ->mapping should never be set for DAX pages, because DAX pages are >> disconnected from the page-writeback machinery. > > page->mapping is not only about page-writeback machinery. It is generally > about page <-> inode relation and that still exists for DAX pages. We even > reuse the mapping->page_tree to store DAX pages. Also requiring proper > address_space_operations for DAX inodes is IMO not a bad thing as such. > > That being said I'm not 100% convinced we should really set page->mapping > for DAX pages. After all they are not page cache pages but rather a > physical storage for the data, don't ever get to LRU, etc. But if you need > page->inode relation somewhere, that is a good indication to me that it > might be just easier to set page->mapping and provide aops that do the > right thing (i.e. usually not much) for them. > > BTW: the ->set_page_dirty() in particular actually *does* need to do > something for DAX pages - corresponding radix tree entries should be > marked dirty so that caches can get flushed when needed. For this specific concern, the get_user_pages() path will have triggered mkwrite, so the dax dirty tracking in the radix will have already happened by the time we call ->set_page_dirty(). So, it's not yet clear to me that we need that particular op. >> In other words never >> setting ->mapping bypasses all the possible broken assumptions and >> code paths that take page-cache specific actions before calling an >> address_space_operation. > > If there are any assumptions left after aops are set properly, then we can > reconsider this but for now setting ->mapping and proper aops looks cleaner > to me... I'll try an address_space_operation with a nop ->set_page_dirty() and see if anything else falls out.
On Thu 21-12-17 09:31:50, Dan Williams wrote: > On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote: > > On Wed 20-12-17 14:41:14, Dan Williams wrote: > >> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote: > >> > On Tue 19-12-17 17:11:38, Dan Williams wrote: > >> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote: > >> >> >> + struct { > >> >> >> + /* > >> >> >> + * ZONE_DEVICE pages are never on an lru or handled by > >> >> >> + * a slab allocator, this points to the hosting device > >> >> >> + * page map. > >> >> >> + */ > >> >> >> + struct dev_pagemap *pgmap; > >> >> >> + /* > >> >> >> + * inode association for MEMORY_DEVICE_FS_DAX page-idle > >> >> >> + * callbacks. Note that we don't use ->mapping since > >> >> >> + * that has hard coded page-cache assumptions in > >> >> >> + * several paths. > >> >> >> + */ > >> >> > > >> >> > What assumptions? I'd much rather fix those up than having two fields > >> >> > that have the same functionality. > >> >> > >> >> [ Reviving this old thread where you asked why I introduce page->inode > >> >> instead of reusing page->mapping ] > >> >> > >> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being > >> >> non-NULL indicates a typical page cache page, this is a false > >> >> assumption for DAX. My guess at a fix for this is to add > >> >> pagecache_page() checks to locations like this, but I worry about how > >> >> to find them all. Where pagecache_page() is: > >> >> > >> >> bool pagecache_page(struct page *page) > >> >> { > >> >> if (!page->mapping) > >> >> return false; > >> >> if (!IS_DAX(page->mapping->host)) > >> >> return false; > >> >> return true; > >> >> } > >> >> > >> >> Otherwise we go off the rails: > >> >> > >> >> WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468 > >> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs] > >> > > >> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't > >> > it? ->set_page_dirty handler for DAX mapping should just properly handle > >> > DAX pages... (and only those) > >> > >> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I > >> enable all the address_space_operations to start firing. However, > >> instead of adding DAX specific address_space_operations it appears > >> ->mapping should never be set for DAX pages, because DAX pages are > >> disconnected from the page-writeback machinery. > > > > page->mapping is not only about page-writeback machinery. It is generally > > about page <-> inode relation and that still exists for DAX pages. We even > > reuse the mapping->page_tree to store DAX pages. Also requiring proper > > address_space_operations for DAX inodes is IMO not a bad thing as such. > > > > That being said I'm not 100% convinced we should really set page->mapping > > for DAX pages. After all they are not page cache pages but rather a > > physical storage for the data, don't ever get to LRU, etc. But if you need > > page->inode relation somewhere, that is a good indication to me that it > > might be just easier to set page->mapping and provide aops that do the > > right thing (i.e. usually not much) for them. > > > > BTW: the ->set_page_dirty() in particular actually *does* need to do > > something for DAX pages - corresponding radix tree entries should be > > marked dirty so that caches can get flushed when needed. > > For this specific concern, the get_user_pages() path will have > triggered mkwrite, so the dax dirty tracking in the radix will have > already happened by the time we call ->set_page_dirty(). So, it's not > yet clear to me that we need that particular op. Right, but it would still be nice to at least verify in ->set_page_dirty() that the DAX page is marked as dirty in the radix tree (that may be slightly expensive to keep forever but probably still worth it at least for some time). BTW: You just made me realize get_user_pages() is likely the path I have been looking for for about an year which can indeed dirty a mapped page from a different path than page fault (I'm now speaking about non-DAX case but it translates to DAX as well). I have been getting sporadical reports of filesystems (ext4 and xfs) crashing because a page was dirty but filesystem was not prepared for that (which happens in ->page_mkwrite or ->write_begin). What I think could happen is that someone used mmaped file as a buffer for DIO or something like that, that triggered GUP, which triggered fault and dirtied a page. Then kswapd came, wrote the page, writeprotected it in page tables, and reclaimed buffers from the page. It could not free the page itself because DIO still held reference to it. And then DIO eventually called bio_set_pages_dirty() which marked pages dirty and filesystem eventually barfed on these... I'll talk to MM people about this and how could it be possibly fixed - but probably after Christmas so I'm writing it here so that I don't forget :). > >> In other words never > >> setting ->mapping bypasses all the possible broken assumptions and > >> code paths that take page-cache specific actions before calling an > >> address_space_operation. > > > > If there are any assumptions left after aops are set properly, then we can > > reconsider this but for now setting ->mapping and proper aops looks cleaner > > to me... > > I'll try an address_space_operation with a nop ->set_page_dirty() and > see if anything else falls out. There are also other aops which would be good to NOPify. readpage and writepage should probably directly BUG, write_begin and write_end as well as we expect iomap to be used with DAX. Ditto for invalidatepage, releasepage, freepage, launder_page, is_dirty_writeback, isolate_page as we don't expect page reclaim for our pages. Other functions in there would need some thought... Honza
diff --git a/fs/dax.c b/fs/dax.c index ac6497dcfebd..fd5d385988d1 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -297,6 +297,55 @@ static void put_unlocked_mapping_entry(struct address_space *mapping, dax_wake_mapping_entry_waiter(mapping, index, entry, false); } +static unsigned long dax_entry_size(void *entry) +{ + if (dax_is_zero_entry(entry)) + return 0; + else if (dax_is_empty_entry(entry)) + return 0; + else if (dax_is_pmd_entry(entry)) + return HPAGE_SIZE; + else + return PAGE_SIZE; +} + +#define for_each_entry_pfn(entry, pfn, end_pfn) \ + for (pfn = dax_radix_pfn(entry), \ + end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \ + pfn < end_pfn; \ + pfn++) + +static void dax_associate_entry(void *entry, struct inode *inode) +{ + unsigned long pfn, end_pfn; + + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return; + + for_each_entry_pfn(entry, pfn, end_pfn) { + struct page *page = pfn_to_page(pfn); + + WARN_ON_ONCE(page->inode); + page->inode = inode; + } +} + +static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc) +{ + unsigned long pfn, end_pfn; + + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return; + + for_each_entry_pfn(entry, pfn, end_pfn) { + struct page *page = pfn_to_page(pfn); + + WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(page->inode && page->inode != inode); + page->inode = NULL; + } +} + /* * Find radix tree entry at given index. If it points to an exceptional entry, * return it with the radix tree entry locked. If the radix tree doesn't @@ -403,6 +452,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index, } if (pmd_downgrade) { + dax_disassociate_entry(entry, mapping->host, false); radix_tree_delete(&mapping->page_tree, index); mapping->nrexceptional--; dax_wake_mapping_entry_waiter(mapping, index, entry, @@ -452,6 +502,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping, (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) || radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))) goto out; + dax_disassociate_entry(entry, mapping->host, trunc); radix_tree_delete(page_tree, index); mapping->nrexceptional--; ret = 1; @@ -529,6 +580,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping, { struct radix_tree_root *page_tree = &mapping->page_tree; unsigned long pfn = pfn_t_to_pfn(pfn_t); + struct inode *inode = mapping->host; pgoff_t index = vmf->pgoff; void *new_entry; @@ -548,6 +600,10 @@ static void *dax_insert_mapping_entry(struct address_space *mapping, spin_lock_irq(&mapping->tree_lock); new_entry = dax_radix_locked_entry(pfn, flags); + if (dax_entry_size(entry) != dax_entry_size(new_entry)) { + dax_disassociate_entry(entry, inode, false); + dax_associate_entry(new_entry, inode); + } if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { /* diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 46f4ecf5479a..dd976851e8d8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -118,11 +118,21 @@ struct page { * Can be used as a generic list * by the page owner. */ - struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an - * lru or handled by a slab - * allocator, this points to the - * hosting device page map. - */ + struct { + /* + * ZONE_DEVICE pages are never on an lru or handled by + * a slab allocator, this points to the hosting device + * page map. + */ + struct dev_pagemap *pgmap; + /* + * inode association for MEMORY_DEVICE_FS_DAX page-idle + * callbacks. Note that we don't use ->mapping since + * that has hard coded page-cache assumptions in + * several paths. + */ + struct inode *inode; + }; struct { /* slub per cpu partial pages */ struct page *next; /* Next partial slab */ #ifdef CONFIG_64BIT diff --git a/kernel/memremap.c b/kernel/memremap.c index 8a4ebfe9db4e..f9a2929fc310 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -441,13 +441,13 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, struct page *page = pfn_to_page(pfn); /* - * ZONE_DEVICE pages union ->lru with a ->pgmap back - * pointer. It is a bug if a ZONE_DEVICE page is ever - * freed or placed on a driver-private list. Seed the - * storage with LIST_POISON* values. + * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer + * and ->inode (for the MEMORY_DEVICE_FS_DAX case) association. + * It is a bug if a ZONE_DEVICE page is ever freed or placed on + * a driver-private list. */ - list_del(&page->lru); page->pgmap = pgmap; + page->inode = NULL; percpu_ref_get(ref); if (!(++i % 1024)) cond_resched();
Catch cases where truncate encounters pages that are still under active dma. This warning is a canary for potential data corruption as truncated blocks could be allocated to a new file while the device is still perform i/o. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/dax.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/mm_types.h | 20 ++++++++++++---- kernel/memremap.c | 10 ++++---- 3 files changed, 76 insertions(+), 10 deletions(-)