[v3,12/13] dax: handle truncate of dma-busy pages

get_user_pages() pins file backed memory pages for access by dma
devices. However, it only pins the memory pages not the page-to-file
offset association. If a file is truncated the pages are mapped out of
the file and dma may continue indefinitely into a page that is owned by
a device driver. This breaks coherency of the file vs dma, but the
assumption is that if userspace wants the file-space truncated it does
not matter what data is inbound from the device, it is not relevant
anymore.

The assumptions of the truncate-page-cache model are broken by DAX where
the target DMA page *is* the filesystem block. Leaving the page pinned
for DMA, but truncating the file block out of the file, means that the
filesytem is free to reallocate a block under active DMA to another
file!

Here are some possible options for fixing this situation ('truncate' and
'fallocate(punch hole)' are synonymous below):

    1/ Fail truncate while any file blocks might be under dma

    2/ Block (sleep-wait) truncate while any file blocks might be under
       dma

    3/ Remap file blocks to a "lost+found"-like file-inode where
       dma can continue and we might see what inbound data from DMA was
       mapped out of the original file. Blocks in this file could be
       freed back to the filesystem when dma eventually ends.

    4/ Disable dax until option 3 or another long term solution has been
       implemented. However, filesystem-dax is still marked experimental
       for concerns like this.

Option 1 will throw failures where userspace has never expected them
before, option 2 might hang the truncating process indefinitely, and
option 3 requires per filesystem enabling to remap blocks from one inode
to another.  Option 2 is implemented in this patch for the DAX path with
the expectation that non-transient users of get_user_pages() (RDMA) are
disallowed from setting up dax mappings and that the potential delay
introduced to the truncate path is acceptable compared to the response
time of the page cache case. This can only be seen as a stop-gap until
we can solve the problem of safely sequestering unallocated filesystem
blocks under active dma.

The solution introduces a new FL_ALLOCATED lease to pin the allocated
blocks in a dax file while dma might be accessing them. It behaves
identically to an FL_LAYOUT lease save for the fact that it is
immediately sheduled to be reaped, and that the only path that waits for
its removal is the truncate path. We can not reuse FL_LAYOUT directly
since that would deadlock in the case where userspace did a direct-I/O
operation with a target buffer backed by an mmap range of the same file.

Credit / inspiration for option 3 goes to Dave Hansen, who proposed
something similar as an alternative way to solve the problem that
MAP_DIRECT was trying to solve.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig          |    1 
 fs/dax.c            |  188 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/locks.c          |   17 ++++-
 include/linux/dax.h |   23 ++++++
 include/linux/fs.h  |   22 +++++-
 mm/gup.c            |   27 ++++++-
 6 files changed, 268 insertions(+), 10 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v3,12/13] dax: handle truncate of dma-busy pages

Commit Message

Comments

Patch