diff mbox

[14/15] dax: associate mappings with inodes, and warn if dma collides with truncate

Message ID 150949217152.24061.9869502311102659784.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Oct. 31, 2017, 11:22 p.m. UTC
Catch cases where truncate encounters pages that are still under active
dma. This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
perform i/o.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   56 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   20 ++++++++++++----
 kernel/memremap.c        |   10 ++++----
 3 files changed, 76 insertions(+), 10 deletions(-)

Comments

Christoph Hellwig Nov. 10, 2017, 9:08 a.m. UTC | #1
> +		struct {
> +			/*
> +			 * ZONE_DEVICE pages are never on an lru or handled by
> +			 * a slab allocator, this points to the hosting device
> +			 * page map.
> +			 */
> +			struct dev_pagemap *pgmap;
> +			/*
> +			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
> +			 * callbacks. Note that we don't use ->mapping since
> +			 * that has hard coded page-cache assumptions in
> +			 * several paths.
> +			 */

What assumptions?  I'd much rather fix those up than having two fields
that have the same functionality.
Dan Williams Dec. 20, 2017, 1:11 a.m. UTC | #2
On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> +             struct {
>> +                     /*
>> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> +                      * a slab allocator, this points to the hosting device
>> +                      * page map.
>> +                      */
>> +                     struct dev_pagemap *pgmap;
>> +                     /*
>> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> +                      * callbacks. Note that we don't use ->mapping since
>> +                      * that has hard coded page-cache assumptions in
>> +                      * several paths.
>> +                      */
>
> What assumptions?  I'd much rather fix those up than having two fields
> that have the same functionality.

[ Reviving this old thread where you asked why I introduce page->inode
instead of reusing page->mapping ]

For example, xfs_vm_set_page_dirty() assumes that page->mapping being
non-NULL indicates a typical page cache page, this is a false
assumption for DAX. My guess at a fix for this is to add
pagecache_page() checks to locations like this, but I worry about how
to find them all. Where pagecache_page() is:

bool pagecache_page(struct page *page)
{
        if (!page->mapping)
                return false;
        if (!IS_DAX(page->mapping->host))
                return false;
        return true;
}

Otherwise we go off the rails:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96
Jan Kara Dec. 20, 2017, 2:38 p.m. UTC | #3
On Tue 19-12-17 17:11:38, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX. My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }
> 
> Otherwise we go off the rails:
> 
>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]

But this just shows that mapping->a_ops are wrong for this mapping, doesn't
it? ->set_page_dirty handler for DAX mapping should just properly handle
DAX pages... (and only those)

>  [..]
>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
> 4.15.0-rc2+ #984
>  [..]
>  Call Trace:
>   set_page_dirty_lock+0x40/0x60
>   bio_set_pages_dirty+0x37/0x50
>   iomap_dio_actor+0x2b7/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   iomap_apply+0xa4/0x110
>   iomap_dio_rw+0x29e/0x3b0
>   ? iomap_dio_zero+0x110/0x110
>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>   __vfs_read+0xf9/0x170
>   vfs_read+0xa6/0x150
>   SyS_pread64+0x93/0xb0
>   entry_SYSCALL_64_fastpath+0x1f/0x96

								Honza
Dave Chinner Dec. 20, 2017, 10:14 p.m. UTC | #4
On Tue, Dec 19, 2017 at 05:11:38PM -0800, Dan Williams wrote:
> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> +             struct {
> >> +                     /*
> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> +                      * a slab allocator, this points to the hosting device
> >> +                      * page map.
> >> +                      */
> >> +                     struct dev_pagemap *pgmap;
> >> +                     /*
> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> +                      * callbacks. Note that we don't use ->mapping since
> >> +                      * that has hard coded page-cache assumptions in
> >> +                      * several paths.
> >> +                      */
> >
> > What assumptions?  I'd much rather fix those up than having two fields
> > that have the same functionality.
> 
> [ Reviving this old thread where you asked why I introduce page->inode
> instead of reusing page->mapping ]
> 
> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> non-NULL indicates a typical page cache page, this is a false
> assumption for DAX.

That means every single filesystem has an incorrect assumption for
DAX pages. xfs_vm_set_page_dirty() is derived directly from
__set_page_dirty_buffers(), which is the default function that
set_page_dirty() calls to do it's work. Indeed, ext4 also calls
__set_page_dirty_buffers(), so whatever problem XFS has here with
DAX and racing truncates is going to manifest in ext4 as well.

> My guess at a fix for this is to add
> pagecache_page() checks to locations like this, but I worry about how
> to find them all. Where pagecache_page() is:
> 
> bool pagecache_page(struct page *page)
> {
>         if (!page->mapping)
>                 return false;
>         if (!IS_DAX(page->mapping->host))
>                 return false;
>         return true;
> }

This is likely to be a problem in lots more places if we have to
treat "has page been truncated away" race checks on dax mappings
differently to page cache mappings. This smells of a whack-a-mole
style bandaid to me....

Cheers,

Dave.
Dan Williams Dec. 20, 2017, 10:41 p.m. UTC | #5
On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> +             struct {
>> >> +                     /*
>> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> +                      * a slab allocator, this points to the hosting device
>> >> +                      * page map.
>> >> +                      */
>> >> +                     struct dev_pagemap *pgmap;
>> >> +                     /*
>> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> +                      * that has hard coded page-cache assumptions in
>> >> +                      * several paths.
>> >> +                      */
>> >
>> > What assumptions?  I'd much rather fix those up than having two fields
>> > that have the same functionality.
>>
>> [ Reviving this old thread where you asked why I introduce page->inode
>> instead of reusing page->mapping ]
>>
>> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> non-NULL indicates a typical page cache page, this is a false
>> assumption for DAX. My guess at a fix for this is to add
>> pagecache_page() checks to locations like this, but I worry about how
>> to find them all. Where pagecache_page() is:
>>
>> bool pagecache_page(struct page *page)
>> {
>>         if (!page->mapping)
>>                 return false;
>>         if (!IS_DAX(page->mapping->host))
>>                 return false;
>>         return true;
>> }
>>
>> Otherwise we go off the rails:
>>
>>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>
> But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> it? ->set_page_dirty handler for DAX mapping should just properly handle
> DAX pages... (and only those)

Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
enable all the address_space_operations to start firing. However,
instead of adding DAX specific address_space_operations it appears
->mapping should never be set for DAX pages, because DAX pages are
disconnected from the page-writeback machinery. In other words never
setting ->mapping bypasses all the possible broken assumptions and
code paths that take page-cache specific actions before calling an
address_space_operation.

>
>>  [..]
>>  CPU: 27 PID: 1783 Comm: dma-collision Tainted: G           O
>> 4.15.0-rc2+ #984
>>  [..]
>>  Call Trace:
>>   set_page_dirty_lock+0x40/0x60
>>   bio_set_pages_dirty+0x37/0x50
>>   iomap_dio_actor+0x2b7/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   iomap_apply+0xa4/0x110
>>   iomap_dio_rw+0x29e/0x3b0
>>   ? iomap_dio_zero+0x110/0x110
>>   ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
>>   xfs_file_read_iter+0xa0/0xc0 [xfs]
>>   __vfs_read+0xf9/0x170
>>   vfs_read+0xa6/0x150
>>   SyS_pread64+0x93/0xb0
>>   entry_SYSCALL_64_fastpath+0x1f/0x96
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Dec. 21, 2017, 12:14 p.m. UTC | #6
On Wed 20-12-17 14:41:14, Dan Williams wrote:
> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> +             struct {
> >> >> +                     /*
> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> >> +                      * a slab allocator, this points to the hosting device
> >> >> +                      * page map.
> >> >> +                      */
> >> >> +                     struct dev_pagemap *pgmap;
> >> >> +                     /*
> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> >> +                      * callbacks. Note that we don't use ->mapping since
> >> >> +                      * that has hard coded page-cache assumptions in
> >> >> +                      * several paths.
> >> >> +                      */
> >> >
> >> > What assumptions?  I'd much rather fix those up than having two fields
> >> > that have the same functionality.
> >>
> >> [ Reviving this old thread where you asked why I introduce page->inode
> >> instead of reusing page->mapping ]
> >>
> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> >> non-NULL indicates a typical page cache page, this is a false
> >> assumption for DAX. My guess at a fix for this is to add
> >> pagecache_page() checks to locations like this, but I worry about how
> >> to find them all. Where pagecache_page() is:
> >>
> >> bool pagecache_page(struct page *page)
> >> {
> >>         if (!page->mapping)
> >>                 return false;
> >>         if (!IS_DAX(page->mapping->host))
> >>                 return false;
> >>         return true;
> >> }
> >>
> >> Otherwise we go off the rails:
> >>
> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
> >
> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> > it? ->set_page_dirty handler for DAX mapping should just properly handle
> > DAX pages... (and only those)
> 
> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
> enable all the address_space_operations to start firing. However,
> instead of adding DAX specific address_space_operations it appears
> ->mapping should never be set for DAX pages, because DAX pages are
> disconnected from the page-writeback machinery.

page->mapping is not only about page-writeback machinery. It is generally
about page <-> inode relation and that still exists for DAX pages. We even
reuse the mapping->page_tree to store DAX pages. Also requiring proper
address_space_operations for DAX inodes is IMO not a bad thing as such.

That being said I'm not 100% convinced we should really set page->mapping
for DAX pages. After all they are not page cache pages but rather a
physical storage for the data, don't ever get to LRU, etc. But if you need
page->inode relation somewhere, that is a good indication to me that it
might be just easier to set page->mapping and provide aops that do the
right thing (i.e. usually not much) for them.

BTW: the ->set_page_dirty() in particular actually *does* need to do
something for DAX pages - corresponding radix tree entries should be
marked dirty so that caches can get flushed when needed.

> In other words never
> setting ->mapping bypasses all the possible broken assumptions and
> code paths that take page-cache specific actions before calling an
> address_space_operation.

If there are any assumptions left after aops are set properly, then we can
reconsider this but for now setting ->mapping and proper aops looks cleaner
to me...

								Honza
Dan Williams Dec. 21, 2017, 5:31 p.m. UTC | #7
On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 20-12-17 14:41:14, Dan Williams wrote:
>> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
>> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> >> +             struct {
>> >> >> +                     /*
>> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
>> >> >> +                      * a slab allocator, this points to the hosting device
>> >> >> +                      * page map.
>> >> >> +                      */
>> >> >> +                     struct dev_pagemap *pgmap;
>> >> >> +                     /*
>> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
>> >> >> +                      * callbacks. Note that we don't use ->mapping since
>> >> >> +                      * that has hard coded page-cache assumptions in
>> >> >> +                      * several paths.
>> >> >> +                      */
>> >> >
>> >> > What assumptions?  I'd much rather fix those up than having two fields
>> >> > that have the same functionality.
>> >>
>> >> [ Reviving this old thread where you asked why I introduce page->inode
>> >> instead of reusing page->mapping ]
>> >>
>> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
>> >> non-NULL indicates a typical page cache page, this is a false
>> >> assumption for DAX. My guess at a fix for this is to add
>> >> pagecache_page() checks to locations like this, but I worry about how
>> >> to find them all. Where pagecache_page() is:
>> >>
>> >> bool pagecache_page(struct page *page)
>> >> {
>> >>         if (!page->mapping)
>> >>                 return false;
>> >>         if (!IS_DAX(page->mapping->host))
>> >>                 return false;
>> >>         return true;
>> >> }
>> >>
>> >> Otherwise we go off the rails:
>> >>
>> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
>> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
>> >
>> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
>> > it? ->set_page_dirty handler for DAX mapping should just properly handle
>> > DAX pages... (and only those)
>>
>> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
>> enable all the address_space_operations to start firing. However,
>> instead of adding DAX specific address_space_operations it appears
>> ->mapping should never be set for DAX pages, because DAX pages are
>> disconnected from the page-writeback machinery.
>
> page->mapping is not only about page-writeback machinery. It is generally
> about page <-> inode relation and that still exists for DAX pages. We even
> reuse the mapping->page_tree to store DAX pages. Also requiring proper
> address_space_operations for DAX inodes is IMO not a bad thing as such.
>
> That being said I'm not 100% convinced we should really set page->mapping
> for DAX pages. After all they are not page cache pages but rather a
> physical storage for the data, don't ever get to LRU, etc. But if you need
> page->inode relation somewhere, that is a good indication to me that it
> might be just easier to set page->mapping and provide aops that do the
> right thing (i.e. usually not much) for them.
>
> BTW: the ->set_page_dirty() in particular actually *does* need to do
> something for DAX pages - corresponding radix tree entries should be
> marked dirty so that caches can get flushed when needed.

For this specific concern, the get_user_pages() path will have
triggered mkwrite, so the dax dirty tracking in the radix will have
already happened by the time we call ->set_page_dirty(). So, it's not
yet clear to me that we need that particular op.

>> In other words never
>> setting ->mapping bypasses all the possible broken assumptions and
>> code paths that take page-cache specific actions before calling an
>> address_space_operation.
>
> If there are any assumptions left after aops are set properly, then we can
> reconsider this but for now setting ->mapping and proper aops looks cleaner
> to me...

I'll try an address_space_operation with a nop ->set_page_dirty() and
see if anything else falls out.
Jan Kara Dec. 22, 2017, 8:51 a.m. UTC | #8
On Thu 21-12-17 09:31:50, Dan Williams wrote:
> On Thu, Dec 21, 2017 at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 20-12-17 14:41:14, Dan Williams wrote:
> >> On Wed, Dec 20, 2017 at 6:38 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Tue 19-12-17 17:11:38, Dan Williams wrote:
> >> >> On Fri, Nov 10, 2017 at 1:08 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> >> +             struct {
> >> >> >> +                     /*
> >> >> >> +                      * ZONE_DEVICE pages are never on an lru or handled by
> >> >> >> +                      * a slab allocator, this points to the hosting device
> >> >> >> +                      * page map.
> >> >> >> +                      */
> >> >> >> +                     struct dev_pagemap *pgmap;
> >> >> >> +                     /*
> >> >> >> +                      * inode association for MEMORY_DEVICE_FS_DAX page-idle
> >> >> >> +                      * callbacks. Note that we don't use ->mapping since
> >> >> >> +                      * that has hard coded page-cache assumptions in
> >> >> >> +                      * several paths.
> >> >> >> +                      */
> >> >> >
> >> >> > What assumptions?  I'd much rather fix those up than having two fields
> >> >> > that have the same functionality.
> >> >>
> >> >> [ Reviving this old thread where you asked why I introduce page->inode
> >> >> instead of reusing page->mapping ]
> >> >>
> >> >> For example, xfs_vm_set_page_dirty() assumes that page->mapping being
> >> >> non-NULL indicates a typical page cache page, this is a false
> >> >> assumption for DAX. My guess at a fix for this is to add
> >> >> pagecache_page() checks to locations like this, but I worry about how
> >> >> to find them all. Where pagecache_page() is:
> >> >>
> >> >> bool pagecache_page(struct page *page)
> >> >> {
> >> >>         if (!page->mapping)
> >> >>                 return false;
> >> >>         if (!IS_DAX(page->mapping->host))
> >> >>                 return false;
> >> >>         return true;
> >> >> }
> >> >>
> >> >> Otherwise we go off the rails:
> >> >>
> >> >>  WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
> >> >> xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
> >> >
> >> > But this just shows that mapping->a_ops are wrong for this mapping, doesn't
> >> > it? ->set_page_dirty handler for DAX mapping should just properly handle
> >> > DAX pages... (and only those)
> >>
> >> Ah, yes. Now that I change ->mapping to be non-NULL for DAX pages I
> >> enable all the address_space_operations to start firing. However,
> >> instead of adding DAX specific address_space_operations it appears
> >> ->mapping should never be set for DAX pages, because DAX pages are
> >> disconnected from the page-writeback machinery.
> >
> > page->mapping is not only about page-writeback machinery. It is generally
> > about page <-> inode relation and that still exists for DAX pages. We even
> > reuse the mapping->page_tree to store DAX pages. Also requiring proper
> > address_space_operations for DAX inodes is IMO not a bad thing as such.
> >
> > That being said I'm not 100% convinced we should really set page->mapping
> > for DAX pages. After all they are not page cache pages but rather a
> > physical storage for the data, don't ever get to LRU, etc. But if you need
> > page->inode relation somewhere, that is a good indication to me that it
> > might be just easier to set page->mapping and provide aops that do the
> > right thing (i.e. usually not much) for them.
> >
> > BTW: the ->set_page_dirty() in particular actually *does* need to do
> > something for DAX pages - corresponding radix tree entries should be
> > marked dirty so that caches can get flushed when needed.
> 
> For this specific concern, the get_user_pages() path will have
> triggered mkwrite, so the dax dirty tracking in the radix will have
> already happened by the time we call ->set_page_dirty(). So, it's not
> yet clear to me that we need that particular op.

Right, but it would still be nice to at least verify in ->set_page_dirty()
that the DAX page is marked as dirty in the radix tree (that may be
slightly expensive to keep forever but probably still worth it at least for
some time).

BTW: You just made me realize get_user_pages() is likely the path I have
been looking for for about an year which can indeed dirty a mapped page
from a different path than page fault (I'm now speaking about non-DAX case
but it translates to DAX as well). I have been getting sporadical reports
of filesystems (ext4 and xfs) crashing because a page was dirty but
filesystem was not prepared for that (which happens in ->page_mkwrite or
->write_begin). What I think could happen is that someone used mmaped file
as a buffer for DIO or something like that, that triggered GUP, which
triggered fault and dirtied a page. Then kswapd came, wrote the page,
writeprotected it in page tables, and reclaimed buffers from the page. It
could not free the page itself because DIO still held reference to it.
And then DIO eventually called bio_set_pages_dirty() which marked pages
dirty and filesystem eventually barfed on these... I'll talk to MM people
about this and how could it be possibly fixed - but probably after
Christmas so I'm writing it here so that I don't forget :).

> >> In other words never
> >> setting ->mapping bypasses all the possible broken assumptions and
> >> code paths that take page-cache specific actions before calling an
> >> address_space_operation.
> >
> > If there are any assumptions left after aops are set properly, then we can
> > reconsider this but for now setting ->mapping and proper aops looks cleaner
> > to me...
> 
> I'll try an address_space_operation with a nop ->set_page_dirty() and
> see if anything else falls out.

There are also other aops which would be good to NOPify. readpage and
writepage should probably directly BUG, write_begin and write_end as well
as we expect iomap to be used with DAX. Ditto for invalidatepage,
releasepage, freepage, launder_page, is_dirty_writeback, isolate_page as we
don't expect page reclaim for our pages. Other functions in there would
need some thought...

								Honza
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index ac6497dcfebd..fd5d385988d1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -297,6 +297,55 @@  static void put_unlocked_mapping_entry(struct address_space *mapping,
 	dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return HPAGE_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+	for (pfn = dax_radix_pfn(entry), \
+			end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+			pfn < end_pfn; \
+			pfn++)
+
+static void dax_associate_entry(void *entry, struct inode *inode)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(page->inode);
+		page->inode = inode;
+	}
+}
+
+static void dax_disassociate_entry(void *entry, struct inode *inode, bool trunc)
+{
+	unsigned long pfn, end_pfn;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	for_each_entry_pfn(entry, pfn, end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(page->inode && page->inode != inode);
+		page->inode = NULL;
+	}
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -403,6 +452,7 @@  static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
 		}
 
 		if (pmd_downgrade) {
+			dax_disassociate_entry(entry, mapping->host, false);
 			radix_tree_delete(&mapping->page_tree, index);
 			mapping->nrexceptional--;
 			dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -452,6 +502,7 @@  static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+	dax_disassociate_entry(entry, mapping->host, trunc);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -529,6 +580,7 @@  static void *dax_insert_mapping_entry(struct address_space *mapping,
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 	unsigned long pfn = pfn_t_to_pfn(pfn_t);
+	struct inode *inode = mapping->host;
 	pgoff_t index = vmf->pgoff;
 	void *new_entry;
 
@@ -548,6 +600,10 @@  static void *dax_insert_mapping_entry(struct address_space *mapping,
 
 	spin_lock_irq(&mapping->tree_lock);
 	new_entry = dax_radix_locked_entry(pfn, flags);
+	if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+		dax_disassociate_entry(entry, inode, false);
+		dax_associate_entry(new_entry, inode);
+	}
 
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 46f4ecf5479a..dd976851e8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,11 +118,21 @@  struct page {
 					 * Can be used as a generic list
 					 * by the page owner.
 					 */
-		struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an
-					    * lru or handled by a slab
-					    * allocator, this points to the
-					    * hosting device page map.
-					    */
+		struct {
+			/*
+			 * ZONE_DEVICE pages are never on an lru or handled by
+			 * a slab allocator, this points to the hosting device
+			 * page map.
+			 */
+			struct dev_pagemap *pgmap;
+			/*
+			 * inode association for MEMORY_DEVICE_FS_DAX page-idle
+			 * callbacks. Note that we don't use ->mapping since
+			 * that has hard coded page-cache assumptions in
+			 * several paths.
+			 */
+			struct inode *inode;
+		};
 		struct {		/* slub per cpu partial pages */
 			struct page *next;	/* Next partial slab */
 #ifdef CONFIG_64BIT
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 8a4ebfe9db4e..f9a2929fc310 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -441,13 +441,13 @@  void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct page *page = pfn_to_page(pfn);
 
 		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
-		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
-		 * freed or placed on a driver-private list.  Seed the
-		 * storage with LIST_POISON* values.
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+		 * and ->inode (for the MEMORY_DEVICE_FS_DAX case) association.
+		 * It is a bug if a ZONE_DEVICE page is ever freed or placed on
+		 * a driver-private list.
 		 */
-		list_del(&page->lru);
 		page->pgmap = pgmap;
+		page->inode = NULL;
 		percpu_ref_get(ref);
 		if (!(++i % 1024))
 			cond_resched();