From patchwork Sun Sep 4 02:16:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965093 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 065C2ECAAD5 for ; Sun, 4 Sep 2022 02:16:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5026A8016F; Sat, 3 Sep 2022 22:16:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B20E8015A; Sat, 3 Sep 2022 22:16:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 37A198016F; Sat, 3 Sep 2022 22:16:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 25F998015A for ; Sat, 3 Sep 2022 22:16:09 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EE6A3AAE1A for ; Sun, 4 Sep 2022 02:16:08 +0000 (UTC) X-FDA: 79872788016.17.E96750C Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf02.hostedemail.com (Postfix) with ESMTP id 72BBD80061 for ; Sun, 4 Sep 2022 02:16:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257767; x=1693793767; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GVe4dBjvMQpZmNRdpmLWUnxXhArGe+l3tmmLFYYezBs=; b=D/AvnE8inyBQD37iVm8hix/Uz1vRJpapsj7//iKz+kb49uFLCzEV3lDi fI/+ikA3P19Afp4cdb0ba69UM1RYpXvUBX5x1I7wAHPdrbjwO1aC1/KaF UTCwe8IeJ0ADNCYKEdHPIqk4TDXJ273NKsT2PAtgdtfBf1EFyVLONU0ky AharUtZ/4EyhgX46HB3xKN4G9lMhmu9w5P/UMf22qWxwWWciSgGcck40h igQ7ybcXfAyJbU/L3fu/2O8W1kajde/jdw4ksKyOqTPTk6jMKhIC0gIIC wnyPwTLq7c4ynDbqQu9EvGWThUVD8irB8ZvHgB4DVVgGiXFM8QibYYitw A==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="279219081" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="279219081" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:06 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="702523876" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:06 -0700 Subject: [PATCH 01/13] fsdax: Rename "busy page" to "pinned page" From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:05 -0700 Message-ID: <166225776577.2351842.7326849167823619889.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257767; a=rsa-sha256; cv=none; b=ivX5JH6wj+HhmH7FjmUQhvcdt/5lxJ74lyeWYlapw6UhK9337LxNQKL9RBpQyKun33PvIF SqKDp4T9m7XJ6HN1JXnKBk/VnG8OxHksrHZvpMuHRPMbJ9+g2WDOg3Q51GWRLh1V8f6qxd N2Tn3BXqJGCcK+jJU1qvqXjTJcxiJi0= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b="D/AvnE8i"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf02.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257767; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v3vK86LsJt3COzE06buSphht4rzk/Pon0rVWIOZxx2c=; b=QZGRTpHKB6tPzqxYlDC/THXMUGwHP4qWDWOpq+Ca7JfHzcR3AyUAm7JxPgrw5dj53MfDWK qA7msjGJPVkgqLa15+ARlz8Yri8gZk9uScsVcv0KtuJ52wOn3Smk9yC+RM8ESKE1uVUXqI q4l7ZX5k8sUJXh/55mnqAJxIJCXUL0c= X-Rspam-User: Authentication-Results: imf02.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b="D/AvnE8i"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf02.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 72BBD80061 X-Stat-Signature: yha1zquazczwngpao7ucu4ieog9xg9hs X-HE-Tag: 1662257767-449580 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The FSDAX need to hold of truncate is for pages undergoing DMA. Replace the DAX specific "busy" terminology with the "pinned" term. This is in preparation from moving FSDAX from watching transitions of page->_refcount to '1' with observations of page_maybe_dma_pinned() returning false. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Signed-off-by: Dan Williams --- fs/dax.c | 16 ++++++++-------- fs/ext4/inode.c | 2 +- fs/fuse/dax.c | 4 ++-- fs/xfs/xfs_file.c | 2 +- fs/xfs/xfs_inode.c | 2 +- include/linux/dax.h | 10 ++++++---- 6 files changed, 19 insertions(+), 17 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index c440dcef4b1b..0f22f7b46de0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -407,7 +407,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, } } -static struct page *dax_busy_page(void *entry) +static struct page *dax_pinned_page(void *entry) { unsigned long pfn; @@ -665,7 +665,7 @@ static void *grab_mapping_entry(struct xa_state *xas, } /** - * dax_layout_busy_page_range - find first pinned page in @mapping + * dax_layout_pinned_page_range - find first pinned page in @mapping * @mapping: address space to scan for a page with ref count > 1 * @start: Starting offset. Page containing 'start' is included. * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, @@ -682,7 +682,7 @@ static void *grab_mapping_entry(struct xa_state *xas, * to be able to run unmap_mapping_range() and subsequently not race * mapping_mapped() becoming true. */ -struct page *dax_layout_busy_page_range(struct address_space *mapping, +struct page *dax_layout_pinned_page_range(struct address_space *mapping, loff_t start, loff_t end) { void *entry; @@ -727,7 +727,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, if (unlikely(dax_is_locked(entry))) entry = get_unlocked_entry(&xas, 0); if (entry) - page = dax_busy_page(entry); + page = dax_pinned_page(entry); put_unlocked_entry(&xas, entry, WAKE_NEXT); if (page) break; @@ -742,13 +742,13 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, xas_unlock_irq(&xas); return page; } -EXPORT_SYMBOL_GPL(dax_layout_busy_page_range); +EXPORT_SYMBOL_GPL(dax_layout_pinned_page_range); -struct page *dax_layout_busy_page(struct address_space *mapping) +struct page *dax_layout_pinned_page(struct address_space *mapping) { - return dax_layout_busy_page_range(mapping, 0, LLONG_MAX); + return dax_layout_pinned_page_range(mapping, 0, LLONG_MAX); } -EXPORT_SYMBOL_GPL(dax_layout_busy_page); +EXPORT_SYMBOL_GPL(dax_layout_pinned_page); static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index, bool trunc) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 601214453c3a..bf49bf506965 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3957,7 +3957,7 @@ int ext4_break_layouts(struct inode *inode) return -EINVAL; do { - page = dax_layout_busy_page(inode->i_mapping); + page = dax_layout_pinned_page(inode->i_mapping); if (!page) return 0; diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index e23e802a8013..e0b846f16bc5 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -443,7 +443,7 @@ static int fuse_setup_new_dax_mapping(struct inode *inode, loff_t pos, /* * Can't do inline reclaim in fault path. We call - * dax_layout_busy_page() before we free a range. And + * dax_layout_pinned_page() before we free a range. And * fuse_wait_dax_page() drops mapping->invalidate_lock and requires it. * In fault path we enter with mapping->invalidate_lock held and can't * drop it. Also in fault path we hold mapping->invalidate_lock shared @@ -671,7 +671,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, { struct page *page; - page = dax_layout_busy_page_range(inode->i_mapping, start, end); + page = dax_layout_pinned_page_range(inode->i_mapping, start, end); if (!page) return 0; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c6c80265c0b2..954bb6e83796 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -822,7 +822,7 @@ xfs_break_dax_layouts( ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL)); - page = dax_layout_busy_page(inode->i_mapping); + page = dax_layout_pinned_page(inode->i_mapping); if (!page) return 0; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 28493c8e9bb2..9d0bea03501e 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -3481,7 +3481,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout( * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable * for this nested lock case. */ - page = dax_layout_busy_page(VFS_I(ip2)->i_mapping); + page = dax_layout_pinned_page(VFS_I(ip2)->i_mapping); if (page && page_ref_count(page) != 1) { xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL); xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); diff --git a/include/linux/dax.h b/include/linux/dax.h index ba985333e26b..54f099166a29 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -157,8 +157,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder) int dax_writeback_mapping_range(struct address_space *mapping, struct dax_device *dax_dev, struct writeback_control *wbc); -struct page *dax_layout_busy_page(struct address_space *mapping); -struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end); +struct page *dax_layout_pinned_page(struct address_space *mapping); +struct page *dax_layout_pinned_page_range(struct address_space *mapping, loff_t start, loff_t end); dax_entry_t dax_lock_page(struct page *page); void dax_unlock_page(struct page *page, dax_entry_t cookie); dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, @@ -166,12 +166,14 @@ dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, void dax_unlock_mapping_entry(struct address_space *mapping, unsigned long index, dax_entry_t cookie); #else -static inline struct page *dax_layout_busy_page(struct address_space *mapping) +static inline struct page *dax_layout_pinned_page(struct address_space *mapping) { return NULL; } -static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages) +static inline struct page * +dax_layout_pinned_page_range(struct address_space *mapping, pgoff_t start, + pgoff_t nr_pages) { return NULL; } From patchwork Sun Sep 4 02:16:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965094 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D3DCECAAD5 for ; Sun, 4 Sep 2022 02:16:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C06FF80170; Sat, 3 Sep 2022 22:16:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB5828015A; Sat, 3 Sep 2022 22:16:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA66380170; Sat, 3 Sep 2022 22:16:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9B1CD8015A for ; Sat, 3 Sep 2022 22:16:14 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6829C12048F for ; Sun, 4 Sep 2022 02:16:14 +0000 (UTC) X-FDA: 79872788268.08.DFD7193 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf25.hostedemail.com (Postfix) with ESMTP id B0815A006A for ; Sun, 4 Sep 2022 02:16:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257773; x=1693793773; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=XECMJ3rDr2QuKr70k+HRef4UN7pYK04Pd7WfZFPApLY=; b=gt5vui+HKKa+NLnCjBReVUKc8c1b0ZCiGemHBqN10v35doffckrR6D/t JmXnNmtLPa/DGw7OwSBkXCmWAvSRLPQp7Dw3sYdLaGfEI8itqTsMwHKl9 T3L+uyjJytaW5nfyxg2JF97G11co9AGqk/qomTEBRwlfCo6HhudtEW9I5 oFbAmMfsK1sepvVWALK7gaIv2RCHjTFUtajbaMwTXpU4YxPwi9AtUFVwz OchAIQaaT/wzlhfyryieC+8+/ZE4sewKVUE/gaRF5nGdkRJ0iiATflPr8 8j8D0y30boPbuK6OxOkuUWkW3XF9VaFX/bb2mn9gyqL5EUvRUGE62our/ Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="357917710" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="357917710" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:12 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="646515349" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:12 -0700 Subject: [PATCH 02/13] fsdax: Use page_maybe_dma_pinned() for DAX vs DMA collisions From: Dan Williams To: akpm@linux-foundation.org Cc: Jan Kara , "Darrick J. Wong" , Christoph Hellwig , John Hubbard , Jason Gunthorpe , Matthew Wilcox , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:12 -0700 Message-ID: <166225777193.2351842.16365701080007152185.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=gt5vui+H; spf=pass (imf25.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257774; a=rsa-sha256; cv=none; b=N/KBa9oVo2mO5CO6ZNYgl3BUENL9i7yLhge7NFh1bQAne9dHUDcJ6xhBSLd382RQECjQcO KjCmAndWCm0jn9Go+tccqHc6TxBCYBcBaBsXK1G1ipT2jeDMXF7mddIfb/PtsmC26HBGim tNMMSvz1Wzy0kCMA9ClSbEWvD5toAsY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257774; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fOFbKmJVavJclhJ9AQw9D3Ma3DEFJSx6W2KGZ0TzlVs=; b=tInbtaxj0Wsui4YW6/HZZJpbD6AYMfQUB/Vx2YbVvqA7zLn34Hv/TNOERSDKX198gv1cyI G0MIEZEt9Tv/08mTKfFPR5a+ziXUXjZwT6VafugG5hxaO3dzpHQVk5CHQI3E0MS5EQSDKQ vu2ESgt5yY7pRHploHrD5bbKgyG8VRk= X-Rspam-User: Authentication-Results: imf25.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=gt5vui+H; spf=pass (imf25.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: B0815A006A X-Stat-Signature: fph8gwwoec6qzysfdhr6iuyt78tq17xd X-HE-Tag: 1662257773-293530 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The pin_user_pages() + page_maybe_dma_pinned() infrastructure is a framework for tackling the kernel's struggles with gup+DMA. DAX presents a unique flavor of the gup+DMA problem since pinned pages are identical to physical filesystem blocks. Unlike the page-cache case, a mapping of a file can not be truncated while DMA is in-flight because the DMA must complete before the filesystem block is reclaimed. DAX has a homegrown solution to this problem based on watching the page->_refcount go idle. Beyond being awkward to catch that idle transition in put_page(), it is overkill when only the page_maybe_dma_pinned() transition needs to be captured. Move the wakeup of filesystem-DAX truncate paths ({ext4,xfs,fuse_dax}_break_layouts()) to unpin_user_pages() with a new wakeup_fsdax_pin_waiters() helper, and use !page_maybe_dma_pinned() as the wake condition. Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Christoph Hellwig Cc: John Hubbard Reported-by: Jason Gunthorpe Reported-by: Matthew Wilcox Signed-off-by: Dan Williams --- fs/dax.c | 4 ++-- fs/ext4/inode.c | 7 +++---- fs/fuse/dax.c | 6 +++--- fs/xfs/xfs_file.c | 6 +++--- include/linux/mm.h | 28 ++++++++++++++++++++++++++++ mm/gup.c | 6 ++++-- 6 files changed, 43 insertions(+), 14 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 0f22f7b46de0..aceb587bc27e 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -395,7 +395,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && page_maybe_dma_pinned(page)); if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ if (page->index-- > 0) @@ -414,7 +414,7 @@ static struct page *dax_pinned_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - if (page_ref_count(page) > 1) + if (page_maybe_dma_pinned(page)) return page; } return NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf49bf506965..5e68e64f155a 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3961,10 +3961,9 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0; - error = ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(inode)); + error = ___wait_var_event(page, !page_maybe_dma_pinned(page), + TASK_INTERRUPTIBLE, 0, 0, + ext4_wait_dax_page(inode)); } while (error == 0); return error; diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index e0b846f16bc5..6419ca420c42 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -676,9 +676,9 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, return 0; *retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, fuse_wait_dax_page(inode)); + return ___wait_var_event(page, !page_maybe_dma_pinned(page), + TASK_INTERRUPTIBLE, 0, 0, + fuse_wait_dax_page(inode)); } /* dmap_end == 0 leads to unmapping of whole file */ diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 954bb6e83796..dbffb9481b71 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -827,9 +827,9 @@ xfs_break_dax_layouts( return 0; *retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return ___wait_var_event(page, !page_maybe_dma_pinned(page), + TASK_INTERRUPTIBLE, 0, 0, + xfs_wait_dax_page(inode)); } int diff --git a/include/linux/mm.h b/include/linux/mm.h index 3bedc449c14d..557d5447ebec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1517,6 +1517,34 @@ static inline bool page_maybe_dma_pinned(struct page *page) return folio_maybe_dma_pinned(page_folio(page)); } +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX) +/* + * Unlike typical file backed pages that support truncating a page from + * a file while it is under active DMA, DAX pages need to hold off + * truncate operations until transient page pins are released. + * + * The filesystem (via dax_layout_pinned_page()) takes steps to make + * sure that any observation of the !page_maybe_dma_pinned() state is + * stable until the truncation completes. + */ +static inline void wakeup_fsdax_pin_waiters(struct folio *folio) +{ + struct page *page = &folio->page; + + if (!folio_is_zone_device(folio)) + return; + if (page->pgmap->type != MEMORY_DEVICE_FS_DAX) + return; + if (folio_maybe_dma_pinned(folio)) + return; + wake_up_var(page); +} +#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */ +static inline void wakeup_fsdax_pin_waiters(struct folio *folio) +{ +} +#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */ + /* * This should most likely only be called during fork() to see whether we * should break the cow immediately for an anon page on the src mm. diff --git a/mm/gup.c b/mm/gup.c index 732825157430..499c46296fda 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -177,8 +177,10 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags) refs *= GUP_PIN_COUNTING_BIAS; } - if (!put_devmap_managed_page_refs(&folio->page, refs)) - folio_put_refs(folio, refs); + folio_put_refs(folio, refs); + + if (flags & FOLL_PIN) + wakeup_fsdax_pin_waiters(folio); } /** From patchwork Sun Sep 4 02:16:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965095 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 004E3C38145 for ; Sun, 4 Sep 2022 02:16:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FE3F80171; Sat, 3 Sep 2022 22:16:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6AD348015A; Sat, 3 Sep 2022 22:16:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 54DCB80171; Sat, 3 Sep 2022 22:16:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 45A9D8015A for ; Sat, 3 Sep 2022 22:16:21 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 294091A0754 for ; Sun, 4 Sep 2022 02:16:21 +0000 (UTC) X-FDA: 79872788562.13.2064DE5 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by imf21.hostedemail.com (Postfix) with ESMTP id 65B9A1C006A for ; Sun, 4 Sep 2022 02:16:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257779; x=1693793779; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sipDxMIA1gHM4rUMbFjWahw5uArq5Zq1WCPqnel5xNA=; b=Qdae1fGS/7ZyShgCoYqDr93DDmZmTotqn8Gar3TDOnfG7CdGcbaRSAgT uzUclrCwv5Uh/6DXVYdKo5YPRpY/Ek4rS5DukV1W+76vN1XoKWyHtIKPG WbbSOA7Xe7VzXDMkZ7kT6k95j+N6Dd62b2WAefRlGQwIddtp+neSiwKJO Aah3BFwbrhhJT+0U4f05ci2ZdKykbuBgXdi3OGaOxS3g6GrK9zL0LLeE5 mTgU5TC4dm73odxBluDWlfRooU5+4Nz4rTRNW3pqDNhAKIl9zVd5ZNxk1 cl2GX3S80NQQBH93i3cb+UqSHDZyz/iZJNzXq1cLQIUo9tKZFlTbfqAMG A==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="382504015" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="382504015" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:18 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="643384493" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:17 -0700 Subject: [PATCH 03/13] fsdax: Delete put_devmap_managed_page_refs() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:17 -0700 Message-ID: <166225777752.2351842.10384480208879805937.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257779; a=rsa-sha256; cv=none; b=knmr3q1GLX50CC2UJKCTvTXAgENtOSmibS1BdIsG7k24UJz//PYoubj/KxZ6AjN058zV4K mylfNmDCZNFij7Y2ktUB51DpocKLq8pLlq3m+XpFVIi4VFOYq2qpJbc6XMFzYANnEej9hP SyMHFv+5xDeT5ZH4iknAakdXy0WOeqo= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=Qdae1fGS; spf=pass (imf21.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257779; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SfNjKOCVvcvo92uLvcXlC3YyD3xm2NrPLRDYDBXuL6Y=; b=253BnJPDG3YdvF2Zw5kWFWN7BvHBH+YyhaoypCCA2rpyqXiJ9xSnw6G6oCejpiFZ0JatYw eKvoNKFS2pggj4mC9mEK1AlVoQWJZd6g7kMOlZ/nK05LXpXuF4RrZ4jdse+BOoakSkjyqE hR4Xkvr5x6aDZHpoGQCHVDXM1fFDgEU= X-Stat-Signature: nfscj18fmky7azq7wr7o6cenwyazp9no X-Rspamd-Queue-Id: 65B9A1C006A X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=Qdae1fGS; spf=pass (imf21.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.43 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam01 X-HE-Tag: 1662257779-554348 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that fsdax DMA-idle detection no longer depends on catching transitions of page->_refcount to 1, remove put_devmap_managed_page_refs() and associated infrastructure. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- include/linux/mm.h | 30 ------------------------------ mm/gup.c | 3 +-- mm/memremap.c | 18 ------------------ mm/swap.c | 2 -- 4 files changed, 1 insertion(+), 52 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 557d5447ebec..24f8682d0cd7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1048,30 +1048,6 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf); * back into memory. */ -#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX) -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); - -bool __put_devmap_managed_page_refs(struct page *page, int refs); -static inline bool put_devmap_managed_page_refs(struct page *page, int refs) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return false; - if (!is_zone_device_page(page)) - return false; - return __put_devmap_managed_page_refs(page, refs); -} -#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */ -static inline bool put_devmap_managed_page_refs(struct page *page, int refs) -{ - return false; -} -#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */ - -static inline bool put_devmap_managed_page(struct page *page) -{ - return put_devmap_managed_page_refs(page, 1); -} - /* 127: arbitrary random number, small enough to assemble well */ #define folio_ref_zero_or_close_to_overflow(folio) \ ((unsigned int) folio_ref_count(folio) + 127u <= 127u) @@ -1168,12 +1144,6 @@ static inline void put_page(struct page *page) { struct folio *folio = page_folio(page); - /* - * For some devmap managed pages we need to catch refcount transition - * from 2 to 1: - */ - if (put_devmap_managed_page(&folio->page)) - return; folio_put(folio); } diff --git a/mm/gup.c b/mm/gup.c index 499c46296fda..67dfffe97917 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -87,8 +87,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs) * belongs to this folio. */ if (unlikely(page_folio(page) != folio)) { - if (!put_devmap_managed_page_refs(&folio->page, refs)) - folio_put_refs(folio, refs); + folio_put_refs(folio, refs); goto retry; } diff --git a/mm/memremap.c b/mm/memremap.c index 58b20c3c300b..433500e955fb 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -507,21 +507,3 @@ void free_zone_device_page(struct page *page) */ set_page_count(page, 1); } - -#ifdef CONFIG_FS_DAX -bool __put_devmap_managed_page_refs(struct page *page, int refs) -{ - if (page->pgmap->type != MEMORY_DEVICE_FS_DAX) - return false; - - /* - * fsdax page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (page_ref_sub_return(page, refs) == 1) - wake_up_var(&page->_refcount); - return true; -} -EXPORT_SYMBOL(__put_devmap_managed_page_refs); -#endif /* CONFIG_FS_DAX */ diff --git a/mm/swap.c b/mm/swap.c index 9cee7f6a3809..b346dd24cde8 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -960,8 +960,6 @@ void release_pages(struct page **pages, int nr) unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - if (put_devmap_managed_page(&folio->page)) - continue; if (folio_put_testzero(folio)) free_zone_device_page(&folio->page); continue; From patchwork Sun Sep 4 02:16:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965096 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1392CC6FA82 for ; Sun, 4 Sep 2022 02:16:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0AE880172; Sat, 3 Sep 2022 22:16:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9BA5B8015A; Sat, 3 Sep 2022 22:16:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8343B80172; Sat, 3 Sep 2022 22:16:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7439A8015A for ; Sat, 3 Sep 2022 22:16:26 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4DE8EAAE74 for ; Sun, 4 Sep 2022 02:16:26 +0000 (UTC) X-FDA: 79872788772.13.D1AC754 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf31.hostedemail.com (Postfix) with ESMTP id 9A3872006B for ; Sun, 4 Sep 2022 02:16:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257785; x=1693793785; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sPcXvDmR3bCtwmu5PpDvm6C3RP1zngXbItUU55SD49w=; b=dR00PTp5+hl/LY85T8NJuu8Q9yXJHkgwYNi969Zk+u0pt7mqjWmjErmM q2o3v2hrImqzT8nvX7RixV/sy+ipWKAuiT0OKkUNEhZTany1Uo4YNFhsy JJklD/wKA29HUukLxvC1co6h/p8mjxNM2vPhBCpM1Qt8B8xH3lwBMuDNF FLxavpJb5Ef3wk0q99eAqWdtwty+DdD4ZllzeAo91f/ZvGyXGO/1qgTfJ Rq5Pgw71DrDsP+O3OpskOqvNPZJtG8m+fVmQ2y7ZsqAvNkrFQ5lJONzr+ 9Q4YPm//4dTpifOYqNVJXWycgVeH/X+PrHJgbSaeOYwIgd0zUJC9DL7Ve Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="275947164" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="275947164" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:24 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="702523932" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:23 -0700 Subject: [PATCH 04/13] fsdax: Update dax_insert_entry() calling convention to return an error From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:23 -0700 Message-ID: <166225778308.2351842.10359830461531484766.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257786; a=rsa-sha256; cv=none; b=aNy0004544l2iOstL4r0WRF7VOtevv9ALjggtSlecGiTeeXC8fiCeSR80eMJ3SuYTd/PQ4 UhEmteL+Hxs5VDubHfwT/dM/QGHYCgBnBJFYfTpnNypB0fZ/i88PSGKVnG2rqEko12ZACj HV32n9LNt1O2SHQyFRYrKX1vW7Ph0yI= ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=dR00PTp5; spf=pass (imf31.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257786; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=poCFLxy/fcznoW3PjIJA5EPnRrW+vOlG1urnZCa/y9A=; b=ffdLhfOFCJl/EJchg/UEwZ13S9xHtmU8D7mud2ZxRIJEy3sfPSnBUY0K3C4jeSbVHJjzwa ePJYN+HJWuygMmuWZfR+MliCHHREknMClgjCBaMNF/15TsndG7dJZeQ5iNzK8SNuf2NPRZ YzMFwF54W72CIA4TmcCWGX7/3MbR2y0= Authentication-Results: imf31.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=dR00PTp5; spf=pass (imf31.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: 5pdfruwkupox6mzrse3ifwn87w8ej9i6 X-Rspamd-Queue-Id: 9A3872006B X-HE-Tag: 1662257785-736603 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for teaching dax_insert_entry() to take live @pgmap references, enable it to return errors. Given the observation that all callers overwrite the passed in entry with the return value, just update @entry in place and convert the return code to a vm_fault_t status. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index aceb587bc27e..d2fb58a7449b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -853,14 +853,15 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter) * already in the tree, we will skip the insertion and just dirty the PMD as * appropriate. */ -static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - const struct iomap_iter *iter, void *entry, pfn_t pfn, - unsigned long flags) +static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + const struct iomap_iter *iter, void **pentry, + pfn_t pfn, unsigned long flags) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; void *new_entry = dax_make_entry(pfn, flags); bool dirty = !dax_fault_is_synchronous(iter, vmf->vma); bool cow = dax_fault_is_cow(iter); + void *entry = *pentry; if (dirty) __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); @@ -906,7 +907,8 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); xas_unlock_irq(xas); - return entry; + *pentry = entry; + return 0; } static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, @@ -1154,9 +1156,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf, pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); vm_fault_t ret; - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, iter, entry, pfn, DAX_ZERO_PAGE); + if (ret) + goto out; ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); +out: trace_dax_load_hole(inode, vmf, ret); return ret; } @@ -1173,6 +1178,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, struct page *zero_page; spinlock_t *ptl; pmd_t pmd_entry; + vm_fault_t ret; pfn_t pfn; zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm); @@ -1181,8 +1187,10 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, goto fallback; pfn = page_to_pfn_t(zero_page); - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, - DAX_PMD | DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, iter, entry, pfn, + DAX_PMD | DAX_ZERO_PAGE); + if (ret) + return ret; if (arch_needs_pgtable_deposit()) { pgtable = pte_alloc_one(vma->vm_mm); @@ -1534,6 +1542,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT; bool write = iter->flags & IOMAP_WRITE; unsigned long entry_flags = pmd ? DAX_PMD : 0; + vm_fault_t ret; int err = 0; pfn_t pfn; void *kaddr; @@ -1558,7 +1567,9 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, if (err) return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err); - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, entry_flags); + ret = dax_insert_entry(xas, vmf, iter, entry, pfn, entry_flags); + if (ret) + return ret; if (write && srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) { From patchwork Sun Sep 4 02:16:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965097 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAB40ECAAD5 for ; Sun, 4 Sep 2022 02:16:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4B52A80173; Sat, 3 Sep 2022 22:16:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4632C8015A; Sat, 3 Sep 2022 22:16:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32B3780173; Sat, 3 Sep 2022 22:16:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 236B88015A for ; Sat, 3 Sep 2022 22:16:32 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id EE01A1C5B78 for ; Sun, 4 Sep 2022 02:16:31 +0000 (UTC) X-FDA: 79872788982.15.8816D6E Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf15.hostedemail.com (Postfix) with ESMTP id 46DB2A005A for ; Sun, 4 Sep 2022 02:16:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257791; x=1693793791; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9CRqNyu89p0LQkhJhRVAoDT4rdt25lVWtdCoNTZjeqE=; b=JixZohKSwiPCT4CbVBFI6u4E5OXDCBu59Rq1wEW8eA4+h9m9Q6fzX4mM CrPm28WQ68MiDvQNbQgkDT4uoFSnzRbYrtBF/Zmm1S8TGpyKIyUTloAQA fliaDFKA9z68qVP2k9eCUrJ/vkpLkZ0GzANZNX1gRuzTs/5QyynrxbRZ8 VgJkOzgRg4Q+B+l/Hxmm5Bx0ZWdtC1qreh59KCrSxpDGe7cZohLI7u/bb K57rj9JhaNkwQrAKPkidIRJ2uwc4nbKK+yv35ZFuSMPP4kUKDglMgnbTr e8cJ0A1kHeB68EW3Z02YCTJhYqCE5J+sKP9ismX4IhDVbCRCZN117uwMV Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="294943506" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="294943506" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:29 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="789035742" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:29 -0700 Subject: [PATCH 05/13] fsdax: Cleanup dax_associate_entry() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:29 -0700 Message-ID: <166225778919.2351842.8691837577077340308.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257791; a=rsa-sha256; cv=none; b=QuFkJyZfCiD/V0lmVKdOccbI3bFMN65FFhUSupr/JJNtxD0HclwI6qYcY3XcnGajdK/cj4 D8x2zyThfNXcGuUTsZ74KLnsXERKnV/Kh05pkTUbagGYaqeSEv/4dC/fkmmxXjKITuAUbN D8hcWQf/JzYwRUkOiF648npqlxD5dq8= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=JixZohKS; spf=softfail (imf15.hostedemail.com: 192.55.52.120 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257791; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YhRwZPLof/amNyk6fvhqQd7+PUGMUtyo9MGRir8zxLc=; b=UYxsqL/+UtNDM3EAYPlHUd0YbalzfM/oNFfM3u/xb4eSW0IFi4KIphUtSbAEYktnNU+a65 d5eL/gCsHPygHS8GJqQBUWq98P/Js3cZrwN9GntcLWG3Zeh+xAwEE86B0uGDX0IwaSU6SM qGP/xaEa09Ui53Ua8iuQHK2QHASrVD8= X-Rspam-User: X-Stat-Signature: dmoxxndy8owcdpxwrqqymt5dorbtjy7m X-Rspamd-Queue-Id: 46DB2A005A X-Rspamd-Server: rspam10 Authentication-Results: imf15.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=JixZohKS; spf=softfail (imf15.hostedemail.com: 192.55.52.120 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-HE-Tag: 1662257791-49360 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Pass @vmf to drop the separate @vma and @address arguments to dax_associate_entry(), use the existing DAX flags to convey the @cow argument, and replace the open-coded ALIGN(). Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index d2fb58a7449b..fad1c8a1d913 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -362,7 +362,7 @@ static inline void dax_mapping_set_cow(struct page *page) * FS_DAX_MAPPING_COW, and use page->index as refcount. */ static void dax_associate_entry(void *entry, struct address_space *mapping, - struct vm_area_struct *vma, unsigned long address, bool cow) + struct vm_fault *vmf, unsigned long flags) { unsigned long size = dax_entry_size(entry), pfn, index; int i = 0; @@ -370,11 +370,11 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return; - index = linear_page_index(vma, address & ~(size - 1)); + index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - if (cow) { + if (flags & DAX_COW) { dax_mapping_set_cow(page); } else { WARN_ON_ONCE(page->mapping); @@ -882,8 +882,7 @@ static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, void *old; dax_disassociate_entry(entry, mapping, false); - dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address, - cow); + dax_associate_entry(new_entry, mapping, vmf, flags); /* * Only swap our new entry into the page cache if the current * entry is a zero page or an empty entry. If a normal PTE or From patchwork Sun Sep 4 02:16:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965098 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 725AFC38145 for ; Sun, 4 Sep 2022 02:16:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 06F5380174; Sat, 3 Sep 2022 22:16:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 01D788015A; Sat, 3 Sep 2022 22:16:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E283B80174; Sat, 3 Sep 2022 22:16:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D3F7C8015A for ; Sat, 3 Sep 2022 22:16:37 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B01CFC09B2 for ; Sun, 4 Sep 2022 02:16:37 +0000 (UTC) X-FDA: 79872789234.06.8D29514 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf10.hostedemail.com (Postfix) with ESMTP id 294D6C0059 for ; Sun, 4 Sep 2022 02:16:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257797; x=1693793797; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EnM0U6h0FiH1zuPhxkY+qtc5p+6YtcYvmyrCE3m68d8=; b=O0ONI5RcpYrhw+DP3QN2oXIdexknneA21JCUDUFxlKuZaeK/5WEts5Qa KIZ1IR9Evzq1USLnWprsSp7y39tOa4s1IrZ8cOxyNqASg5p1unH3ND0UP R88kymxZN+lXh4zwvBcQr/36mtLOLc9QHnTKF9PKR7F5vvixcri6P/5S9 MLaXM8wpE9qUTWMtq/g09OhIP4npGSBWYrXKU22Y5yw7a7LrZnETPywH1 0yKttFeaQ2m2vqA2+PjX1p4m2dM3ZUgnbsvn7VnG2mdA8g87xAGyIVP4h XwH/cQ/sxZEIAJRQbx2h9fy8txm8vZdUbDR2wGvoJUsUD0Yo+ZU0uWBC9 g==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="297501593" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="297501593" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:35 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="681682267" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:35 -0700 Subject: [PATCH 06/13] fsdax: Rework dax_insert_entry() calling convention From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:34 -0700 Message-ID: <166225779478.2351842.2371440980289644924.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=O0ONI5Rc; spf=softfail (imf10.hostedemail.com: 134.134.136.65 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257797; a=rsa-sha256; cv=none; b=nU46n+eZYPKz0aKKgMAOf4ZpIxmbW+LRycOAwR6/7h2iAwJo+Yie3CgvIRa+5jms1U/tnu /lxUTtFTcg4oX180QshE+FlXIAT/a9layQR0oZ1tA8SbDCfjJNaTjuEZImwjWhWsvJf/Wu 021+/+mG4hsh9bQwkhgpt9aJpxoT8IA= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257797; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TEbZBkTHSE0qms3fLLgylOmJv672A9F9oRm/61u4NtU=; b=SYvySAe1VQZYxNlzlxY+UaPG8Y00GNuAmggzpAfJezZTQulx9TJs2ow6awQLMdjUhDRS8u KyCj8VgQNvr10XNHu1a2F96c/N3Xdsnhe/Yrwc7m7ZjqwYIy0xY72GlsNByPrsTowoUebJ ayWcKyU8YiQUolcIyAue7dahKg50+1U= X-Stat-Signature: gwcuaom4xdjonwn5okb5x4dsdfxwautu X-Rspamd-Queue-Id: 294D6C0059 X-Rspam-User: Authentication-Results: imf10.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=O0ONI5Rc; spf=softfail (imf10.hostedemail.com: 134.134.136.65 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-Rspamd-Server: rspam07 X-HE-Tag: 1662257796-632646 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Move the determination of @dirty and @cow in dax_insert_entry() to flags (DAX_DIRTY and DAX_COW) that are passed in. This allows the iomap related code to remain fs/dax.c in preparation for the Xarray infrastructure to move to drivers/dax/mapping.c. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 44 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index fad1c8a1d913..65d55c5ecdef 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -75,11 +75,19 @@ fs_initcall(init_dax_wait_table); * block allocation. */ #define DAX_SHIFT (4) +#define DAX_MASK ((1UL << DAX_SHIFT) - 1) #define DAX_LOCKED (1UL << 0) #define DAX_PMD (1UL << 1) #define DAX_ZERO_PAGE (1UL << 2) #define DAX_EMPTY (1UL << 3) +/* + * These flags are not conveyed in Xarray value entries, they are just + * modifiers to dax_insert_entry(). + */ +#define DAX_DIRTY (1UL << (DAX_SHIFT + 0)) +#define DAX_COW (1UL << (DAX_SHIFT + 1)) + static unsigned long dax_to_pfn(void *entry) { return xa_to_value(entry) >> DAX_SHIFT; @@ -87,7 +95,8 @@ static unsigned long dax_to_pfn(void *entry) static void *dax_make_entry(pfn_t pfn, unsigned long flags) { - return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT)); + return xa_mk_value((flags & DAX_MASK) | + (pfn_t_to_pfn(pfn) << DAX_SHIFT)); } static bool dax_is_locked(void *entry) @@ -846,6 +855,20 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter) (iter->iomap.flags & IOMAP_F_SHARED); } +static unsigned long dax_iter_flags(const struct iomap_iter *iter, + struct vm_fault *vmf) +{ + unsigned long flags = 0; + + if (!dax_fault_is_synchronous(iter, vmf->vma)) + flags |= DAX_DIRTY; + + if (dax_fault_is_cow(iter)) + flags |= DAX_COW; + + return flags; +} + /* * By this point grab_mapping_entry() has ensured that we have a locked entry * of the appropriate size so we don't have to worry about downgrading PMDs to @@ -854,13 +877,13 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter) * appropriate. */ static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - const struct iomap_iter *iter, void **pentry, - pfn_t pfn, unsigned long flags) + void **pentry, pfn_t pfn, + unsigned long flags) { struct address_space *mapping = vmf->vma->vm_file->f_mapping; void *new_entry = dax_make_entry(pfn, flags); - bool dirty = !dax_fault_is_synchronous(iter, vmf->vma); - bool cow = dax_fault_is_cow(iter); + bool dirty = flags & DAX_DIRTY; + bool cow = flags & DAX_COW; void *entry = *pentry; if (dirty) @@ -1155,7 +1178,8 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf, pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); vm_fault_t ret; - ret = dax_insert_entry(xas, vmf, iter, entry, pfn, DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, entry, pfn, + DAX_ZERO_PAGE | dax_iter_flags(iter, vmf)); if (ret) goto out; @@ -1186,8 +1210,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf, goto fallback; pfn = page_to_pfn_t(zero_page); - ret = dax_insert_entry(xas, vmf, iter, entry, pfn, - DAX_PMD | DAX_ZERO_PAGE); + ret = dax_insert_entry(xas, vmf, entry, pfn, + DAX_PMD | DAX_ZERO_PAGE | + dax_iter_flags(iter, vmf)); if (ret) return ret; @@ -1566,7 +1591,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, if (err) return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err); - ret = dax_insert_entry(xas, vmf, iter, entry, pfn, entry_flags); + ret = dax_insert_entry(xas, vmf, entry, pfn, + entry_flags | dax_iter_flags(iter, vmf)); if (ret) return ret; From patchwork Sun Sep 4 02:16:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965099 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE76EC6FA82 for ; Sun, 4 Sep 2022 02:16:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FCEF80175; Sat, 3 Sep 2022 22:16:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6AD348015A; Sat, 3 Sep 2022 22:16:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 54EEB80175; Sat, 3 Sep 2022 22:16:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 442658015A for ; Sat, 3 Sep 2022 22:16:42 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2455140219 for ; Sun, 4 Sep 2022 02:16:42 +0000 (UTC) X-FDA: 79872789444.19.6BAB395 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf10.hostedemail.com (Postfix) with ESMTP id A3802C0059 for ; Sun, 4 Sep 2022 02:16:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257801; x=1693793801; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xVQvjYo1znKhjHAgKaHx9Jqu5HLN8DrX7hpEaKrWHI0=; b=ZUZhLybVCfHKu+D2vwBgFxihVHvLHTdr8DSNIk4/P1VwvIdtq61seYN3 ddp0k4+j/7V3kXmKkeQ3y1bS1VPj6q0+hM5zfKBKt8SvIbA8k6L+JBDja 2wuCIGinhVIejn5DFbtuAcWWOM3mQ0yQwBSRpuv83LFV8z47gFAp9NMO3 uR9iAEndycXBdXO5OrqkNDI4ot4Q8eUipBdHCmCrNB0eEkSRCE6d6ZTyv 9aEaaCzR7tEJXuAA3GbCdEIpZE7Fm1912jPuMudJsABSkGEsv4rqshc/1 YJjwO8/BQUm8h4zsPJBPdnIdnk28LpVxlpxhcY79qyxehMIu23l2vv0Vd Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="297501610" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="297501610" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:41 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="643384601" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:41 -0700 Subject: [PATCH 07/13] fsdax: Manage pgmap references at entry insertion and deletion From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:40 -0700 Message-ID: <166225780079.2351842.2977488896744053462.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=ZUZhLybV; spf=softfail (imf10.hostedemail.com: 134.134.136.65 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257801; a=rsa-sha256; cv=none; b=aYSn782VB53qRzAGqRz29IpDzTiAHcb9Bu5W/cIB2qiDyQZ51XtwP7q+8JKSRSRyRxTigK Fspdr0zNzcXnSQ6Qpjl51SGvHJ1s+VfZ1JJhl3QfgfRZQ7GK/1Wd9zJtC74UMBNpwLDc9A 7KrCCw01rDi8OAEo2BvFgQFuN2Rn55g= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257801; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zrsd3AC/94bqT598M9+OFrl+WmLF1FtxxgeMeLW6DJ8=; b=mfk7CT064+2EXoEiniolPD97LD5FvhkuwC3bV6Siaf+ZHuIqDZd1ZT7HCFTKCKsgeyP7/b VrbxlSqxJ3STkwWQV4uks/2EWWEFq+jZWgyhqqD4eSRTZM1jeANo0IIlWV9fjAvzsVYDRP JVHTjkv/ewjrXBOxtDlWiccQWP6sCaY= X-Stat-Signature: rut9tayib7joib5tuuorag3om338baxb X-Rspamd-Queue-Id: A3802C0059 X-Rspam-User: Authentication-Results: imf10.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=ZUZhLybV; spf=softfail (imf10.hostedemail.com: 134.134.136.65 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-Rspamd-Server: rspam07 X-HE-Tag: 1662257801-611943 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The percpu_ref in 'struct dev_pagemap' is used to coordinate active mappings of device-memory with the device-removal / unbind path. It enables the semantic that initiating device-removal (or device-driver-unbind) blocks new mapping and DMA attempts, and waits for mapping revocation or inflight DMA to complete. Expand the scope of the reference count to pin the DAX device active at mapping time and not later at the first gup event. With a device reference being held while any page on that device is mapped the need to manage pgmap reference counts in the gup code is eliminated. That cleanup is saved for a follow-on change. For now, teach dax_insert_entry() and dax_delete_mapping_entry() to take and drop pgmap references respectively. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- fs/dax.c | 30 ++++++++++++++++++++++++------ include/linux/memremap.h | 18 ++++++++++++++---- mm/memremap.c | 13 ++++++++----- 3 files changed, 46 insertions(+), 15 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 65d55c5ecdef..b23222b0dae4 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -370,14 +370,24 @@ static inline void dax_mapping_set_cow(struct page *page) * whether this entry is shared by multiple files. If so, set the page->mapping * FS_DAX_MAPPING_COW, and use page->index as refcount. */ -static void dax_associate_entry(void *entry, struct address_space *mapping, - struct vm_fault *vmf, unsigned long flags) +static vm_fault_t dax_associate_entry(void *entry, + struct address_space *mapping, + struct vm_fault *vmf, unsigned long flags) { unsigned long size = dax_entry_size(entry), pfn, index; + struct dev_pagemap *pgmap; int i = 0; if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return; + return 0; + + if (!size) + return 0; + + pfn = dax_to_pfn(entry); + pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size)); + if (!pgmap) + return VM_FAULT_SIGBUS; index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); for_each_mapped_pfn(entry, pfn) { @@ -391,19 +401,27 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, page->index = index + i++; } } + + return 0; } static void dax_disassociate_entry(void *entry, struct address_space *mapping, bool trunc) { - unsigned long pfn; + unsigned long size = dax_entry_size(entry), pfn; + struct page *page; if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return; - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); + if (!size) + return; + + page = pfn_to_page(dax_to_pfn(entry)); + put_dev_pagemap_many(page->pgmap, PHYS_PFN(size)); + for_each_mapped_pfn(entry, pfn) { + page = pfn_to_page(pfn); WARN_ON_ONCE(trunc && page_maybe_dma_pinned(page)); if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ diff --git a/include/linux/memremap.h b/include/linux/memremap.h index c3b4cc84877b..fd57407e7f3d 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -191,8 +191,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid); void memunmap_pages(struct dev_pagemap *pgmap); void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); -struct dev_pagemap *get_dev_pagemap(unsigned long pfn, - struct dev_pagemap *pgmap); +struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn, + struct dev_pagemap *pgmap, int refs); +static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn, + struct dev_pagemap *pgmap) +{ + return get_dev_pagemap_many(pfn, pgmap, 1); +} bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); @@ -244,10 +249,15 @@ static inline unsigned long memremap_compat_align(void) } #endif /* CONFIG_ZONE_DEVICE */ -static inline void put_dev_pagemap(struct dev_pagemap *pgmap) +static inline void put_dev_pagemap_many(struct dev_pagemap *pgmap, int refs) { if (pgmap) - percpu_ref_put(&pgmap->ref); + percpu_ref_put_many(&pgmap->ref, refs); +} + +static inline void put_dev_pagemap(struct dev_pagemap *pgmap) +{ + put_dev_pagemap_many(pgmap, 1); } #endif /* _LINUX_MEMREMAP_H_ */ diff --git a/mm/memremap.c b/mm/memremap.c index 433500e955fb..4debe7b211ae 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -430,15 +430,16 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns) } /** - * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn + * get_dev_pagemap_many() - take new live references(s) on the dev_pagemap for @pfn * @pfn: page frame number to lookup page_map * @pgmap: optional known pgmap that already has a reference + * @refs: number of references to take * * If @pgmap is non-NULL and covers @pfn it will be returned as-is. If @pgmap * is non-NULL but does not cover @pfn the reference to it will be released. */ -struct dev_pagemap *get_dev_pagemap(unsigned long pfn, - struct dev_pagemap *pgmap) +struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn, + struct dev_pagemap *pgmap, int refs) { resource_size_t phys = PFN_PHYS(pfn); @@ -454,13 +455,15 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, /* fall back to slow path lookup */ rcu_read_lock(); pgmap = xa_load(&pgmap_array, PHYS_PFN(phys)); - if (pgmap && !percpu_ref_tryget_live(&pgmap->ref)) + if (pgmap && !percpu_ref_tryget_live_rcu(&pgmap->ref)) pgmap = NULL; + if (pgmap && refs > 1) + percpu_ref_get_many(&pgmap->ref, refs - 1); rcu_read_unlock(); return pgmap; } -EXPORT_SYMBOL_GPL(get_dev_pagemap); +EXPORT_SYMBOL_GPL(get_dev_pagemap_many); void free_zone_device_page(struct page *page) { From patchwork Sun Sep 4 02:16:46 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965100 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5A2FECAAD5 for ; Sun, 4 Sep 2022 02:16:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 85D0D80176; Sat, 3 Sep 2022 22:16:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 80C5B8015A; Sat, 3 Sep 2022 22:16:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6FC3180176; Sat, 3 Sep 2022 22:16:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 604108015A for ; Sat, 3 Sep 2022 22:16:54 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 308A1C0721 for ; Sun, 4 Sep 2022 02:16:53 +0000 (UTC) X-FDA: 79872789906.13.FC5D560 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf01.hostedemail.com (Postfix) with ESMTP id 8107A4005F for ; Sun, 4 Sep 2022 02:16:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257812; x=1693793812; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qfzx5y9B9XHcY/WNGZvpd9adYJw1q+hyjxJ0sZQS8JA=; b=TpnvUjBdWmfH5kqLUYq9fRmv7Akt2igcgUILmSmsLsJZcsGwHIcT5X9D 4/at/Xl7oNPj8LSDu9sXeISQhjy7gkAztKhQOyz0Qoe8d532Vq5YUx5+K eMwHRNDtbIve7L67OvDxOfiibVCdI84LW5jKJJN6ApHDISp9U2gYsL8sI cyXlULHOOjHyEoEn8WCuEz/Pf5pUOaeOSLWny16W4BY9il4RAX8ugL2FB GvaSqYC1kMOv7RifdUNIO1WjFzvtfQ3UyuHU4uTPkzzxHLyqmViQmyyi9 C9lSPpURwuTsQ1dlyWO3SEl1KZlfWmShB1SRbANFWyZJlOpHV09LzB632 Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="294943552" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="294943552" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:47 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="755659485" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:46 -0700 Subject: [PATCH 08/13] devdax: Minor warning fixups From: Dan Williams To: akpm@linux-foundation.org Cc: hch@lst.de, linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:46 -0700 Message-ID: <166225780636.2351842.2332609175968045796.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257812; a=rsa-sha256; cv=none; b=OYh6Y0CVs/3S6WtVQKUrU4nFEXsTJ4tkGJ9NOjF4KUsQa4ObaeuJ5xV4VSpibtA6mzDT3F zGP+e67J/S5RoKcVkBneYZjSc5G766D/qB2snHXvaH1C4Y9rEmiJOohyEMFymEn+qUPW39 u8VaEtOBdQmrb3DLSbXCHy50PC08a9E= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=TpnvUjBd; spf=softfail (imf01.hostedemail.com: 192.55.52.120 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257812; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1vgZdDU9Ny4uGe480EWaiLp5fsg1lyKdoZ2O0fPs0wc=; b=vpiP+FlG9ZMrkqxfR51cKA2+8ERhwD14wZ55V6/IGw3N9L/emloDOHWlPNDx9eZHjqycpV JkqvBAgDqxGozAD2r1BfilSKo24XW+4WSLfKhYF/4teBU56LggX/Y4vE9rrfjONxas6PZw BM4ubU9iBnuHK34nEXxURYKi3vNNOHo= X-Rspam-User: X-Stat-Signature: yiwyomkarg346snsgu9zuph4dcdruarn X-Rspamd-Queue-Id: 8107A4005F X-Rspamd-Server: rspam10 Authentication-Results: imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=TpnvUjBd; spf=softfail (imf01.hostedemail.com: 192.55.52.120 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-HE-Tag: 1662257812-232073 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Fix a missing prototype warning for dev_dax_probe(), and fix dax_holder() comment block format. Signed-off-by: Dan Williams --- drivers/dax/dax-private.h | 1 + drivers/dax/super.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 1c974b7caae6..202cafd836e8 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -87,6 +87,7 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev) } phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size); +int dev_dax_probe(struct dev_dax *dev_dax); #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline bool dax_align_valid(unsigned long align) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9b5e2a5eb0ae..4909ad945a49 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(put_dax); /** * dax_holder() - obtain the holder of a dax device * @dax_dev: a dax_device instance - + * * Return: the holder's data which represents the holder if registered, * otherwize NULL. */ From patchwork Sun Sep 4 02:16:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965101 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34E62C6FA82 for ; Sun, 4 Sep 2022 02:16:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE1ED80177; Sat, 3 Sep 2022 22:16:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B9FEA8015A; Sat, 3 Sep 2022 22:16:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 882C680177; Sat, 3 Sep 2022 22:16:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 745FE8015A for ; Sat, 3 Sep 2022 22:16:55 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 39725C0960 for ; Sun, 4 Sep 2022 02:16:55 +0000 (UTC) X-FDA: 79872789990.17.C589E19 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by imf01.hostedemail.com (Postfix) with ESMTP id 7D9464005F for ; Sun, 4 Sep 2022 02:16:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257814; x=1693793814; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AC58yGFMa2QqJxzJCBOFnKgnutB/utXxbJla51pE370=; b=djQLQCPejfttB2pDR/AKjxqvuD62iTcb3lMAYkNBM8UJm6pWOLtGStBa 1IQkcIumXHo+o0WoVAU9xsphH7/5DQB1mxFMZnQS+o/hk092iRuaW4BMx GspFRnkys1wH1VtlnvA+Wbx68j0maIhIPmJE2GykITZKjl0Hb+m5zBZYz 0fm7pt2cd2oxMDDICCRbdqTYi2vhTYocD6iHmWEcPXBFUGT+dHY0+p0w7 bXz5MiApMi9sqY6ahCKv1oTV69ZeUBiVzK2ig1rWQFmNP5N+rNhGX6YRa yPwVnILpF1/cG5sOf9o5Poknb5yddN/b74mRKLoJcU5UOc27evx18TyRb g==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="279219205" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="279219205" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:52 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="646515452" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:52 -0700 Subject: [PATCH 09/13] devdax: Move address_space helpers to the DAX core From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:52 -0700 Message-ID: <166225781235.2351842.9238883139775499926.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=djQLQCPe; spf=pass (imf01.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257814; a=rsa-sha256; cv=none; b=HGQqwhISK6Y1tRFQBAXDZtXpV/EkqVhqK22ePA6S9wwvUyYnvVUWdsWowIusjUZVNhbpmM YbREpL9IFdtDuoAfbCJVnH6KgMnUGQuHUKadbZtjHsNXSnAN1YdMbebhpRDtg2DwFx2nnK R6NAPg97zvXvb6XAyqGGx9oUc0QE2fo= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257814; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UOT8x0bLCGT/sM1y5Bp5uLEY6kO/TrlcKDAhndRW/I0=; b=FkNn9UUnSZ4f0mw3kgxEjkyyC+i6ppS6Fe02c40auAKPgERCiiz0PQNr3UJXS6V2DGUEuf HXPtE3cvqAOIraYeDPa4mYwlCYLDnSkEJle+L11rZl0inNboxseXw3TjdkYR3XKizh+Ehf WcHGk8ENwAs/l6Tim5ZqxrkjSZRaeAE= X-Rspam-User: Authentication-Results: imf01.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=djQLQCPe; spf=pass (imf01.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 7D9464005F X-Stat-Signature: wk8muq55keaz3rkf7bu8sctmxa1zayfe X-HE-Tag: 1662257814-268459 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for decamping 'get_dev_pagemap()' from code paths outside of DAX, device-dax needs to track mapping references. Reuse the same infrastructure as fsdax (dax_{insert,delete_mapping}_entry()). For now, just move that infrastructure into a common location. The move involves splitting iomap and supporting helpers into fs/dax.c and all 'struct address_space' and DAX-entry manipulation into drivers/dax/mapping.c. grab_mapping_entry() is renamed dax_grab_mapping_entry(), and some common definitions and declarations are moved to include/linux/dax.h. No functional change is intended, just code movement. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- .clang-format | 1 drivers/Makefile | 2 drivers/dax/Kconfig | 5 drivers/dax/Makefile | 1 drivers/dax/mapping.c | 973 ++++++++++++++++++++++++++++++++++++++++++ fs/dax.c | 1071 +--------------------------------------------- include/linux/dax.h | 116 ++++- include/linux/memremap.h | 6 8 files changed, 1106 insertions(+), 1069 deletions(-) create mode 100644 drivers/dax/mapping.c diff --git a/.clang-format b/.clang-format index 1247d54f9e49..336fa266386e 100644 --- a/.clang-format +++ b/.clang-format @@ -269,6 +269,7 @@ ForEachMacros: - 'for_each_link_cpus' - 'for_each_link_platforms' - 'for_each_lru' + - 'for_each_mapped_pfn' - 'for_each_matching_node' - 'for_each_matching_node_and_match' - 'for_each_mem_pfn_range' diff --git a/drivers/Makefile b/drivers/Makefile index 057857258bfd..ec6c4146b966 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -71,7 +71,7 @@ obj-$(CONFIG_FB_INTEL) += video/fbdev/intelfb/ obj-$(CONFIG_PARPORT) += parport/ obj-y += base/ block/ misc/ mfd/ nfc/ obj-$(CONFIG_LIBNVDIMM) += nvdimm/ -obj-$(CONFIG_DAX) += dax/ +obj-y += dax/ obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/ obj-$(CONFIG_NUBUS) += nubus/ obj-y += cxl/ diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 5fdf269a822e..3ed4da3935e5 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only menuconfig DAX tristate "DAX: direct access to differentiated memory" + depends on MMU select SRCU default m if NVDIMM_DAX @@ -49,6 +50,10 @@ config DEV_DAX_HMEM_DEVICES depends on DEV_DAX_HMEM && DAX=y def_bool y +config DAX_MAPPING + depends on DAX + def_bool y + config DEV_DAX_KMEM tristate "KMEM DAX: volatile-use of persistent memory" default DEV_DAX diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 90a56ca3b345..d57f1f34e8a8 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o +obj-$(CONFIG_DAX_MAPPING) += mapping.o dax-y := super.o dax-y += bus.o diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c new file mode 100644 index 000000000000..0810af7d9503 --- /dev/null +++ b/drivers/dax/mapping.c @@ -0,0 +1,973 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Direct Access mapping infrastructure split from fs/dax.c + * Copyright (c) 2013-2014 Intel Corporation + * Author: Matthew Wilcox + * Author: Ross Zwisler + */ + +#include +#include +#include +#include +#include +#include +#include + +#define CREATE_TRACE_POINTS +#include + +/* We choose 4096 entries - same as per-zone page wait tables */ +#define DAX_WAIT_TABLE_BITS 12 +#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) + +static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; + +static int __init init_dax_wait_table(void) +{ + int i; + + for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++) + init_waitqueue_head(wait_table + i); + return 0; +} +fs_initcall(init_dax_wait_table); + +static unsigned long dax_to_pfn(void *entry) +{ + return xa_to_value(entry) >> DAX_SHIFT; +} + +static void *dax_make_entry(pfn_t pfn, unsigned long flags) +{ + return xa_mk_value((flags & DAX_MASK) | + (pfn_t_to_pfn(pfn) << DAX_SHIFT)); +} + +static bool dax_is_locked(void *entry) +{ + return xa_to_value(entry) & DAX_LOCKED; +} + +static unsigned int dax_entry_order(void *entry) +{ + if (xa_to_value(entry) & DAX_PMD) + return PMD_ORDER; + return 0; +} + +static unsigned long dax_is_pmd_entry(void *entry) +{ + return xa_to_value(entry) & DAX_PMD; +} + +static bool dax_is_pte_entry(void *entry) +{ + return !(xa_to_value(entry) & DAX_PMD); +} + +static int dax_is_zero_entry(void *entry) +{ + return xa_to_value(entry) & DAX_ZERO_PAGE; +} + +static int dax_is_empty_entry(void *entry) +{ + return xa_to_value(entry) & DAX_EMPTY; +} + +/* + * true if the entry that was found is of a smaller order than the entry + * we were looking for + */ +static bool dax_is_conflict(void *entry) +{ + return entry == XA_RETRY_ENTRY; +} + +/* + * DAX page cache entry locking + */ +struct exceptional_entry_key { + struct xarray *xa; + pgoff_t entry_start; +}; + +struct wait_exceptional_entry_queue { + wait_queue_entry_t wait; + struct exceptional_entry_key key; +}; + +/** + * enum dax_wake_mode: waitqueue wakeup behaviour + * @WAKE_ALL: wake all waiters in the waitqueue + * @WAKE_NEXT: wake only the first waiter in the waitqueue + */ +enum dax_wake_mode { + WAKE_ALL, + WAKE_NEXT, +}; + +static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, void *entry, + struct exceptional_entry_key *key) +{ + unsigned long hash; + unsigned long index = xas->xa_index; + + /* + * If 'entry' is a PMD, align the 'index' that we use for the wait + * queue to the start of that PMD. This ensures that all offsets in + * the range covered by the PMD map to the same bit lock. + */ + if (dax_is_pmd_entry(entry)) + index &= ~PG_PMD_COLOUR; + key->xa = xas->xa; + key->entry_start = index; + + hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS); + return wait_table + hash; +} + +static int wake_exceptional_entry_func(wait_queue_entry_t *wait, + unsigned int mode, int sync, void *keyp) +{ + struct exceptional_entry_key *key = keyp; + struct wait_exceptional_entry_queue *ewait = + container_of(wait, struct wait_exceptional_entry_queue, wait); + + if (key->xa != ewait->key.xa || + key->entry_start != ewait->key.entry_start) + return 0; + return autoremove_wake_function(wait, mode, sync, NULL); +} + +/* + * @entry may no longer be the entry at the index in the mapping. + * The important information it's conveying is whether the entry at + * this index used to be a PMD entry. + */ +static void dax_wake_entry(struct xa_state *xas, void *entry, + enum dax_wake_mode mode) +{ + struct exceptional_entry_key key; + wait_queue_head_t *wq; + + wq = dax_entry_waitqueue(xas, entry, &key); + + /* + * Checking for locked entry and prepare_to_wait_exclusive() happens + * under the i_pages lock, ditto for entry handling in our callers. + * So at this point all tasks that could have seen our entry locked + * must be in the waitqueue and the following check will see them. + */ + if (waitqueue_active(wq)) + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key); +} + +/* + * Look up entry in page cache, wait for it to become unlocked if it + * is a DAX entry and return it. The caller must subsequently call + * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry() + * if it did. The entry returned may have a larger order than @order. + * If @order is larger than the order of the entry found in i_pages, this + * function returns a dax_is_conflict entry. + * + * Must be called with the i_pages lock held. + */ +static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) +{ + void *entry; + struct wait_exceptional_entry_queue ewait; + wait_queue_head_t *wq; + + init_wait(&ewait.wait); + ewait.wait.func = wake_exceptional_entry_func; + + for (;;) { + entry = xas_find_conflict(xas); + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) + return entry; + if (dax_entry_order(entry) < order) + return XA_RETRY_ENTRY; + if (!dax_is_locked(entry)) + return entry; + + wq = dax_entry_waitqueue(xas, entry, &ewait.key); + prepare_to_wait_exclusive(wq, &ewait.wait, + TASK_UNINTERRUPTIBLE); + xas_unlock_irq(xas); + xas_reset(xas); + schedule(); + finish_wait(wq, &ewait.wait); + xas_lock_irq(xas); + } +} + +/* + * The only thing keeping the address space around is the i_pages lock + * (it's cycled in clear_inode() after removing the entries from i_pages) + * After we call xas_unlock_irq(), we cannot touch xas->xa. + */ +static void wait_entry_unlocked(struct xa_state *xas, void *entry) +{ + struct wait_exceptional_entry_queue ewait; + wait_queue_head_t *wq; + + init_wait(&ewait.wait); + ewait.wait.func = wake_exceptional_entry_func; + + wq = dax_entry_waitqueue(xas, entry, &ewait.key); + /* + * Unlike get_unlocked_entry() there is no guarantee that this + * path ever successfully retrieves an unlocked entry before an + * inode dies. Perform a non-exclusive wait in case this path + * never successfully performs its own wake up. + */ + prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE); + xas_unlock_irq(xas); + schedule(); + finish_wait(wq, &ewait.wait); +} + +static void put_unlocked_entry(struct xa_state *xas, void *entry, + enum dax_wake_mode mode) +{ + if (entry && !dax_is_conflict(entry)) + dax_wake_entry(xas, entry, mode); +} + +/* + * We used the xa_state to get the entry, but then we locked the entry and + * dropped the xa_lock, so we know the xa_state is stale and must be reset + * before use. + */ +void dax_unlock_entry(struct xa_state *xas, void *entry) +{ + void *old; + + WARN_ON(dax_is_locked(entry)); + xas_reset(xas); + xas_lock_irq(xas); + old = xas_store(xas, entry); + xas_unlock_irq(xas); + WARN_ON(!dax_is_locked(old)); + dax_wake_entry(xas, entry, WAKE_NEXT); +} + +/* + * Return: The entry stored at this location before it was locked. + */ +static void *dax_lock_entry(struct xa_state *xas, void *entry) +{ + unsigned long v = xa_to_value(entry); + + return xas_store(xas, xa_mk_value(v | DAX_LOCKED)); +} + +static unsigned long dax_entry_size(void *entry) +{ + if (dax_is_zero_entry(entry)) + return 0; + else if (dax_is_empty_entry(entry)) + return 0; + else if (dax_is_pmd_entry(entry)) + return PMD_SIZE; + else + return PAGE_SIZE; +} + +static unsigned long dax_end_pfn(void *entry) +{ + return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE; +} + +/* + * Iterate through all mapped pfns represented by an entry, i.e. skip + * 'empty' and 'zero' entries. + */ +#define for_each_mapped_pfn(entry, pfn) \ + for (pfn = dax_to_pfn(entry); pfn < dax_end_pfn(entry); pfn++) + +static bool dax_mapping_is_cow(struct address_space *mapping) +{ + return (unsigned long)mapping == PAGE_MAPPING_DAX_COW; +} + +/* + * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount. + */ +static void dax_mapping_set_cow(struct page *page) +{ + if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) { + /* + * Reset the index if the page was already mapped + * regularly before. + */ + if (page->mapping) + page->index = 1; + page->mapping = (void *)PAGE_MAPPING_DAX_COW; + } + page->index++; +} + +struct page *dax_pinned_page(void *entry) +{ + unsigned long pfn; + + for_each_mapped_pfn(entry, pfn) { + struct page *page = pfn_to_page(pfn); + + if (page_maybe_dma_pinned(page)) + return page; + } + return NULL; +} + +/* + * When it is called in dax_insert_entry(), the cow flag will indicate that + * whether this entry is shared by multiple files. If so, set the page->mapping + * FS_DAX_MAPPING_COW, and use page->index as refcount. + */ +static vm_fault_t dax_associate_entry(void *entry, + struct address_space *mapping, + struct vm_fault *vmf, unsigned long flags) +{ + unsigned long size = dax_entry_size(entry), pfn, index; + struct dev_pagemap *pgmap; + int i = 0; + + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return 0; + + if (!size) + return 0; + + pfn = dax_to_pfn(entry); + pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size)); + if (!pgmap) + return VM_FAULT_SIGBUS; + + index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); + for_each_mapped_pfn(entry, pfn) { + struct page *page = pfn_to_page(pfn); + + if (flags & DAX_COW) { + dax_mapping_set_cow(page); + } else { + WARN_ON_ONCE(page->mapping); + page->mapping = mapping; + page->index = index + i++; + } + } + + return 0; +} + +static void dax_disassociate_entry(void *entry, struct address_space *mapping, + bool trunc) +{ + unsigned long size = dax_entry_size(entry), pfn; + struct page *page; + + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return; + + if (!size) + return; + + page = pfn_to_page(dax_to_pfn(entry)); + put_dev_pagemap_many(page->pgmap, PHYS_PFN(size)); + + for_each_mapped_pfn(entry, pfn) { + page = pfn_to_page(pfn); + WARN_ON_ONCE(trunc && page_maybe_dma_pinned(page)); + if (dax_mapping_is_cow(page->mapping)) { + /* keep the CoW flag if this page is still shared */ + if (page->index-- > 0) + continue; + } else + WARN_ON_ONCE(page->mapping && page->mapping != mapping); + page->mapping = NULL; + page->index = 0; + } +} + +/* + * dax_lock_page - Lock the DAX entry corresponding to a page + * @page: The page whose entry we want to lock + * + * Context: Process context. + * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could + * not be locked. + */ +dax_entry_t dax_lock_page(struct page *page) +{ + XA_STATE(xas, NULL, 0); + void *entry; + + /* Ensure page->mapping isn't freed while we look at it */ + rcu_read_lock(); + for (;;) { + struct address_space *mapping = READ_ONCE(page->mapping); + + entry = NULL; + if (!mapping || !dax_mapping(mapping)) + break; + + /* + * In the device-dax case there's no need to lock, a + * struct dev_pagemap pin is sufficient to keep the + * inode alive, and we assume we have dev_pagemap pin + * otherwise we would not have a valid pfn_to_page() + * translation. + */ + entry = (void *)~0UL; + if (S_ISCHR(mapping->host->i_mode)) + break; + + xas.xa = &mapping->i_pages; + xas_lock_irq(&xas); + if (mapping != page->mapping) { + xas_unlock_irq(&xas); + continue; + } + xas_set(&xas, page->index); + entry = xas_load(&xas); + if (dax_is_locked(entry)) { + rcu_read_unlock(); + wait_entry_unlocked(&xas, entry); + rcu_read_lock(); + continue; + } + dax_lock_entry(&xas, entry); + xas_unlock_irq(&xas); + break; + } + rcu_read_unlock(); + return (dax_entry_t)entry; +} + +void dax_unlock_page(struct page *page, dax_entry_t cookie) +{ + struct address_space *mapping = page->mapping; + XA_STATE(xas, &mapping->i_pages, page->index); + + if (S_ISCHR(mapping->host->i_mode)) + return; + + dax_unlock_entry(&xas, (void *)cookie); +} + +/* + * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping + * @mapping: the file's mapping whose entry we want to lock + * @index: the offset within this file + * @page: output the dax page corresponding to this dax entry + * + * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry + * could not be locked. + */ +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index, + struct page **page) +{ + XA_STATE(xas, NULL, 0); + void *entry; + + rcu_read_lock(); + for (;;) { + entry = NULL; + if (!dax_mapping(mapping)) + break; + + xas.xa = &mapping->i_pages; + xas_lock_irq(&xas); + xas_set(&xas, index); + entry = xas_load(&xas); + if (dax_is_locked(entry)) { + rcu_read_unlock(); + wait_entry_unlocked(&xas, entry); + rcu_read_lock(); + continue; + } + if (!entry || dax_is_zero_entry(entry) || + dax_is_empty_entry(entry)) { + /* + * Because we are looking for entry from file's mapping + * and index, so the entry may not be inserted for now, + * or even a zero/empty entry. We don't think this is + * an error case. So, return a special value and do + * not output @page. + */ + entry = (void *)~0UL; + } else { + *page = pfn_to_page(dax_to_pfn(entry)); + dax_lock_entry(&xas, entry); + } + xas_unlock_irq(&xas); + break; + } + rcu_read_unlock(); + return (dax_entry_t)entry; +} + +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index, + dax_entry_t cookie) +{ + XA_STATE(xas, &mapping->i_pages, index); + + if (cookie == ~0UL) + return; + + dax_unlock_entry(&xas, (void *)cookie); +} + +/* + * Find page cache entry at given index. If it is a DAX entry, return it + * with the entry locked. If the page cache doesn't contain an entry at + * that index, add a locked empty entry. + * + * When requesting an entry with size DAX_PMD, dax_grab_mapping_entry() will + * either return that locked entry or will return VM_FAULT_FALLBACK. + * This will happen if there are any PTE entries within the PMD range + * that we are requesting. + * + * We always favor PTE entries over PMD entries. There isn't a flow where we + * evict PTE entries in order to 'upgrade' them to a PMD entry. A PMD + * insertion will fail if it finds any PTE entries already in the tree, and a + * PTE insertion will cause an existing PMD entry to be unmapped and + * downgraded to PTE entries. This happens for both PMD zero pages as + * well as PMD empty entries. + * + * The exception to this downgrade path is for PMD entries that have + * real storage backing them. We will leave these real PMD entries in + * the tree, and PTE writes will simply dirty the entire PMD entry. + * + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For + * persistent memory the benefit is doubtful. We can add that later if we can + * show it helps. + * + * On error, this function does not return an ERR_PTR. Instead it returns + * a VM_FAULT code, encoded as an xarray internal entry. The ERR_PTR values + * overlap with xarray value entries. + */ +void *dax_grab_mapping_entry(struct xa_state *xas, + struct address_space *mapping, unsigned int order) +{ + unsigned long index = xas->xa_index; + bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ + void *entry; + +retry: + pmd_downgrade = false; + xas_lock_irq(xas); + entry = get_unlocked_entry(xas, order); + + if (entry) { + if (dax_is_conflict(entry)) + goto fallback; + if (!xa_is_value(entry)) { + xas_set_err(xas, -EIO); + goto out_unlock; + } + + if (order == 0) { + if (dax_is_pmd_entry(entry) && + (dax_is_zero_entry(entry) || + dax_is_empty_entry(entry))) { + pmd_downgrade = true; + } + } + } + + if (pmd_downgrade) { + /* + * Make sure 'entry' remains valid while we drop + * the i_pages lock. + */ + dax_lock_entry(xas, entry); + + /* + * Besides huge zero pages the only other thing that gets + * downgraded are empty entries which don't need to be + * unmapped. + */ + if (dax_is_zero_entry(entry)) { + xas_unlock_irq(xas); + unmap_mapping_pages(mapping, + xas->xa_index & ~PG_PMD_COLOUR, + PG_PMD_NR, false); + xas_reset(xas); + xas_lock_irq(xas); + } + + dax_disassociate_entry(entry, mapping, false); + xas_store(xas, NULL); /* undo the PMD join */ + dax_wake_entry(xas, entry, WAKE_ALL); + mapping->nrpages -= PG_PMD_NR; + entry = NULL; + xas_set(xas, index); + } + + if (entry) { + dax_lock_entry(xas, entry); + } else { + unsigned long flags = DAX_EMPTY; + + if (order > 0) + flags |= DAX_PMD; + entry = dax_make_entry(pfn_to_pfn_t(0), flags); + dax_lock_entry(xas, entry); + if (xas_error(xas)) + goto out_unlock; + mapping->nrpages += 1UL << order; + } + +out_unlock: + xas_unlock_irq(xas); + if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) + goto retry; + if (xas->xa_node == XA_ERROR(-ENOMEM)) + return xa_mk_internal(VM_FAULT_OOM); + if (xas_error(xas)) + return xa_mk_internal(VM_FAULT_SIGBUS); + return entry; +fallback: + xas_unlock_irq(xas); + return xa_mk_internal(VM_FAULT_FALLBACK); +} + +/** + * dax_layout_pinned_page_range - find first pinned page in @mapping + * @mapping: address space to scan for a page with ref count > 1 + * @start: Starting offset. Page containing 'start' is included. + * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, + * pages from 'start' till the end of file are included. + * + * DAX requires ZONE_DEVICE mapped pages. These pages are never + * 'onlined' to the page allocator so they are considered idle when + * page->count == 1. A filesystem uses this interface to determine if + * any page in the mapping is busy, i.e. for DMA, or other + * get_user_pages() usages. + * + * It is expected that the filesystem is holding locks to block the + * establishment of new mappings in this address_space. I.e. it expects + * to be able to run unmap_mapping_range() and subsequently not race + * mapping_mapped() becoming true. + */ +struct page *dax_layout_pinned_page_range(struct address_space *mapping, + loff_t start, loff_t end) +{ + void *entry; + unsigned int scanned = 0; + struct page *page = NULL; + pgoff_t start_idx = start >> PAGE_SHIFT; + pgoff_t end_idx; + XA_STATE(xas, &mapping->i_pages, start_idx); + + /* + * In the 'limited' case get_user_pages() for dax is disabled. + */ + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) + return NULL; + + if (!dax_mapping(mapping) || !mapping_mapped(mapping)) + return NULL; + + /* If end == LLONG_MAX, all pages from start to till end of file */ + if (end == LLONG_MAX) + end_idx = ULONG_MAX; + else + end_idx = end >> PAGE_SHIFT; + /* + * If we race get_user_pages_fast() here either we'll see the + * elevated page count in the iteration and wait, or + * get_user_pages_fast() will see that the page it took a reference + * against is no longer mapped in the page tables and bail to the + * get_user_pages() slow path. The slow path is protected by + * pte_lock() and pmd_lock(). New references are not taken without + * holding those locks, and unmap_mapping_pages() will not zero the + * pte or pmd without holding the respective lock, so we are + * guaranteed to either see new references or prevent new + * references from being established. + */ + unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0); + + xas_lock_irq(&xas); + xas_for_each(&xas, entry, end_idx) { + if (WARN_ON_ONCE(!xa_is_value(entry))) + continue; + if (unlikely(dax_is_locked(entry))) + entry = get_unlocked_entry(&xas, 0); + if (entry) + page = dax_pinned_page(entry); + put_unlocked_entry(&xas, entry, WAKE_NEXT); + if (page) + break; + if (++scanned % XA_CHECK_SCHED) + continue; + + xas_pause(&xas); + xas_unlock_irq(&xas); + cond_resched(); + xas_lock_irq(&xas); + } + xas_unlock_irq(&xas); + return page; +} +EXPORT_SYMBOL_GPL(dax_layout_pinned_page_range); + +struct page *dax_layout_pinned_page(struct address_space *mapping) +{ + return dax_layout_pinned_page_range(mapping, 0, LLONG_MAX); +} +EXPORT_SYMBOL_GPL(dax_layout_pinned_page); + +static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index, + bool trunc) +{ + XA_STATE(xas, &mapping->i_pages, index); + int ret = 0; + void *entry; + + xas_lock_irq(&xas); + entry = get_unlocked_entry(&xas, 0); + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) + goto out; + if (!trunc && (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) || + xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE))) + goto out; + dax_disassociate_entry(entry, mapping, trunc); + xas_store(&xas, NULL); + mapping->nrpages -= 1UL << dax_entry_order(entry); + ret = 1; +out: + put_unlocked_entry(&xas, entry, WAKE_ALL); + xas_unlock_irq(&xas); + return ret; +} + +int dax_invalidate_mapping_entry_sync(struct address_space *mapping, + pgoff_t index) +{ + return __dax_invalidate_entry(mapping, index, false); +} + +/* + * Delete DAX entry at @index from @mapping. Wait for it + * to be unlocked before deleting it. + */ +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) +{ + int ret = __dax_invalidate_entry(mapping, index, true); + + /* + * This gets called from truncate / punch_hole path. As such, the caller + * must hold locks protecting against concurrent modifications of the + * page cache (usually fs-private i_mmap_sem for writing). Since the + * caller has seen a DAX entry for this index, we better find it + * at that index as well... + */ + WARN_ON_ONCE(!ret); + return ret; +} + +/* + * By this point dax_grab_mapping_entry() has ensured that we have a locked entry + * of the appropriate size so we don't have to worry about downgrading PMDs to + * PTEs. If we happen to be trying to insert a PTE and there is a PMD + * already in the tree, we will skip the insertion and just dirty the PMD as + * appropriate. + */ +vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + void **pentry, pfn_t pfn, unsigned long flags) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + void *new_entry = dax_make_entry(pfn, flags); + bool dirty = flags & DAX_DIRTY; + bool cow = flags & DAX_COW; + void *entry = *pentry; + + if (dirty) + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { + unsigned long index = xas->xa_index; + /* we are replacing a zero page with block mapping */ + if (dax_is_pmd_entry(entry)) + unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, + PG_PMD_NR, false); + else /* pte entry */ + unmap_mapping_pages(mapping, index, 1, false); + } + + xas_reset(xas); + xas_lock_irq(xas); + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { + void *old; + + dax_disassociate_entry(entry, mapping, false); + dax_associate_entry(new_entry, mapping, vmf, flags); + /* + * Only swap our new entry into the page cache if the current + * entry is a zero page or an empty entry. If a normal PTE or + * PMD entry is already in the cache, we leave it alone. This + * means that if we are trying to insert a PTE and the + * existing entry is a PMD, we will just leave the PMD in the + * tree and dirty it if necessary. + */ + old = dax_lock_entry(xas, new_entry); + WARN_ON_ONCE(old != + xa_mk_value(xa_to_value(entry) | DAX_LOCKED)); + entry = new_entry; + } else { + xas_load(xas); /* Walk the xa_state */ + } + + if (dirty) + xas_set_mark(xas, PAGECACHE_TAG_DIRTY); + + if (cow) + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); + + xas_unlock_irq(xas); + *pentry = entry; + return 0; +} + +int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, + struct address_space *mapping, void *entry) +{ + unsigned long pfn, index, count, end; + long ret = 0; + struct vm_area_struct *vma; + + /* + * A page got tagged dirty in DAX mapping? Something is seriously + * wrong. + */ + if (WARN_ON(!xa_is_value(entry))) + return -EIO; + + if (unlikely(dax_is_locked(entry))) { + void *old_entry = entry; + + entry = get_unlocked_entry(xas, 0); + + /* Entry got punched out / reallocated? */ + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) + goto put_unlocked; + /* + * Entry got reallocated elsewhere? No need to writeback. + * We have to compare pfns as we must not bail out due to + * difference in lockbit or entry type. + */ + if (dax_to_pfn(old_entry) != dax_to_pfn(entry)) + goto put_unlocked; + if (WARN_ON_ONCE(dax_is_empty_entry(entry) || + dax_is_zero_entry(entry))) { + ret = -EIO; + goto put_unlocked; + } + + /* Another fsync thread may have already done this entry */ + if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE)) + goto put_unlocked; + } + + /* Lock the entry to serialize with page faults */ + dax_lock_entry(xas, entry); + + /* + * We can clear the tag now but we have to be careful so that concurrent + * dax_writeback_one() calls for the same index cannot finish before we + * actually flush the caches. This is achieved as the calls will look + * at the entry only under the i_pages lock and once they do that + * they will see the entry locked and wait for it to unlock. + */ + xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE); + xas_unlock_irq(xas); + + /* + * If dax_writeback_mapping_range() was given a wbc->range_start + * in the middle of a PMD, the 'index' we use needs to be + * aligned to the start of the PMD. + * This allows us to flush for PMD_SIZE and not have to worry about + * partial PMD writebacks. + */ + pfn = dax_to_pfn(entry); + count = 1UL << dax_entry_order(entry); + index = xas->xa_index & ~(count - 1); + end = index + count - 1; + + /* Walk all mappings of a given index of a file and writeprotect them */ + i_mmap_lock_read(mapping); + vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) { + pfn_mkclean_range(pfn, count, index, vma); + cond_resched(); + } + i_mmap_unlock_read(mapping); + + dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); + /* + * After we have flushed the cache, we can clear the dirty tag. There + * cannot be new dirty data in the pfn after the flush has completed as + * the pfn mappings are writeprotected and fault waits for mapping + * entry lock. + */ + xas_reset(xas); + xas_lock_irq(xas); + xas_store(xas, entry); + xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); + dax_wake_entry(xas, entry, WAKE_NEXT); + + trace_dax_writeback_one(mapping->host, index, count); + return ret; + + put_unlocked: + put_unlocked_entry(xas, entry, WAKE_NEXT); + return ret; +} + +/* + * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables + * @vmf: The description of the fault + * @pfn: PFN to insert + * @order: Order of entry to insert. + * + * This function inserts a writeable PTE or PMD entry into the page tables + * for an mmaped DAX file. It also marks the page cache entry as dirty. + */ +vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, + unsigned int order) +{ + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order); + void *entry; + vm_fault_t ret; + + xas_lock_irq(&xas); + entry = get_unlocked_entry(&xas, order); + /* Did we race with someone splitting entry or so? */ + if (!entry || dax_is_conflict(entry) || + (order == 0 && !dax_is_pte_entry(entry))) { + put_unlocked_entry(&xas, entry, WAKE_NEXT); + xas_unlock_irq(&xas); + trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, + VM_FAULT_NOPAGE); + return VM_FAULT_NOPAGE; + } + xas_set_mark(&xas, PAGECACHE_TAG_DIRTY); + dax_lock_entry(&xas, entry); + xas_unlock_irq(&xas); + if (order == 0) + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn); +#ifdef CONFIG_FS_DAX_PMD + else if (order == PMD_ORDER) + ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); +#endif + else + ret = VM_FAULT_FALLBACK; + dax_unlock_entry(&xas, entry); + trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret); + return ret; +} diff --git a/fs/dax.c b/fs/dax.c index b23222b0dae4..79e49e718d33 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -27,809 +27,8 @@ #include #include -#define CREATE_TRACE_POINTS #include -static inline unsigned int pe_order(enum page_entry_size pe_size) -{ - if (pe_size == PE_SIZE_PTE) - return PAGE_SHIFT - PAGE_SHIFT; - if (pe_size == PE_SIZE_PMD) - return PMD_SHIFT - PAGE_SHIFT; - if (pe_size == PE_SIZE_PUD) - return PUD_SHIFT - PAGE_SHIFT; - return ~0; -} - -/* We choose 4096 entries - same as per-zone page wait tables */ -#define DAX_WAIT_TABLE_BITS 12 -#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) - -/* The 'colour' (ie low bits) within a PMD of a page offset. */ -#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) -#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT) - -/* The order of a PMD entry */ -#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) - -static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES]; - -static int __init init_dax_wait_table(void) -{ - int i; - - for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++) - init_waitqueue_head(wait_table + i); - return 0; -} -fs_initcall(init_dax_wait_table); - -/* - * DAX pagecache entries use XArray value entries so they can't be mistaken - * for pages. We use one bit for locking, one bit for the entry size (PMD) - * and two more to tell us if the entry is a zero page or an empty entry that - * is just used for locking. In total four special bits. - * - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE - * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem - * block allocation. - */ -#define DAX_SHIFT (4) -#define DAX_MASK ((1UL << DAX_SHIFT) - 1) -#define DAX_LOCKED (1UL << 0) -#define DAX_PMD (1UL << 1) -#define DAX_ZERO_PAGE (1UL << 2) -#define DAX_EMPTY (1UL << 3) - -/* - * These flags are not conveyed in Xarray value entries, they are just - * modifiers to dax_insert_entry(). - */ -#define DAX_DIRTY (1UL << (DAX_SHIFT + 0)) -#define DAX_COW (1UL << (DAX_SHIFT + 1)) - -static unsigned long dax_to_pfn(void *entry) -{ - return xa_to_value(entry) >> DAX_SHIFT; -} - -static void *dax_make_entry(pfn_t pfn, unsigned long flags) -{ - return xa_mk_value((flags & DAX_MASK) | - (pfn_t_to_pfn(pfn) << DAX_SHIFT)); -} - -static bool dax_is_locked(void *entry) -{ - return xa_to_value(entry) & DAX_LOCKED; -} - -static unsigned int dax_entry_order(void *entry) -{ - if (xa_to_value(entry) & DAX_PMD) - return PMD_ORDER; - return 0; -} - -static unsigned long dax_is_pmd_entry(void *entry) -{ - return xa_to_value(entry) & DAX_PMD; -} - -static bool dax_is_pte_entry(void *entry) -{ - return !(xa_to_value(entry) & DAX_PMD); -} - -static int dax_is_zero_entry(void *entry) -{ - return xa_to_value(entry) & DAX_ZERO_PAGE; -} - -static int dax_is_empty_entry(void *entry) -{ - return xa_to_value(entry) & DAX_EMPTY; -} - -/* - * true if the entry that was found is of a smaller order than the entry - * we were looking for - */ -static bool dax_is_conflict(void *entry) -{ - return entry == XA_RETRY_ENTRY; -} - -/* - * DAX page cache entry locking - */ -struct exceptional_entry_key { - struct xarray *xa; - pgoff_t entry_start; -}; - -struct wait_exceptional_entry_queue { - wait_queue_entry_t wait; - struct exceptional_entry_key key; -}; - -/** - * enum dax_wake_mode: waitqueue wakeup behaviour - * @WAKE_ALL: wake all waiters in the waitqueue - * @WAKE_NEXT: wake only the first waiter in the waitqueue - */ -enum dax_wake_mode { - WAKE_ALL, - WAKE_NEXT, -}; - -static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, - void *entry, struct exceptional_entry_key *key) -{ - unsigned long hash; - unsigned long index = xas->xa_index; - - /* - * If 'entry' is a PMD, align the 'index' that we use for the wait - * queue to the start of that PMD. This ensures that all offsets in - * the range covered by the PMD map to the same bit lock. - */ - if (dax_is_pmd_entry(entry)) - index &= ~PG_PMD_COLOUR; - key->xa = xas->xa; - key->entry_start = index; - - hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS); - return wait_table + hash; -} - -static int wake_exceptional_entry_func(wait_queue_entry_t *wait, - unsigned int mode, int sync, void *keyp) -{ - struct exceptional_entry_key *key = keyp; - struct wait_exceptional_entry_queue *ewait = - container_of(wait, struct wait_exceptional_entry_queue, wait); - - if (key->xa != ewait->key.xa || - key->entry_start != ewait->key.entry_start) - return 0; - return autoremove_wake_function(wait, mode, sync, NULL); -} - -/* - * @entry may no longer be the entry at the index in the mapping. - * The important information it's conveying is whether the entry at - * this index used to be a PMD entry. - */ -static void dax_wake_entry(struct xa_state *xas, void *entry, - enum dax_wake_mode mode) -{ - struct exceptional_entry_key key; - wait_queue_head_t *wq; - - wq = dax_entry_waitqueue(xas, entry, &key); - - /* - * Checking for locked entry and prepare_to_wait_exclusive() happens - * under the i_pages lock, ditto for entry handling in our callers. - * So at this point all tasks that could have seen our entry locked - * must be in the waitqueue and the following check will see them. - */ - if (waitqueue_active(wq)) - __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key); -} - -/* - * Look up entry in page cache, wait for it to become unlocked if it - * is a DAX entry and return it. The caller must subsequently call - * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry() - * if it did. The entry returned may have a larger order than @order. - * If @order is larger than the order of the entry found in i_pages, this - * function returns a dax_is_conflict entry. - * - * Must be called with the i_pages lock held. - */ -static void *get_unlocked_entry(struct xa_state *xas, unsigned int order) -{ - void *entry; - struct wait_exceptional_entry_queue ewait; - wait_queue_head_t *wq; - - init_wait(&ewait.wait); - ewait.wait.func = wake_exceptional_entry_func; - - for (;;) { - entry = xas_find_conflict(xas); - if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) - return entry; - if (dax_entry_order(entry) < order) - return XA_RETRY_ENTRY; - if (!dax_is_locked(entry)) - return entry; - - wq = dax_entry_waitqueue(xas, entry, &ewait.key); - prepare_to_wait_exclusive(wq, &ewait.wait, - TASK_UNINTERRUPTIBLE); - xas_unlock_irq(xas); - xas_reset(xas); - schedule(); - finish_wait(wq, &ewait.wait); - xas_lock_irq(xas); - } -} - -/* - * The only thing keeping the address space around is the i_pages lock - * (it's cycled in clear_inode() after removing the entries from i_pages) - * After we call xas_unlock_irq(), we cannot touch xas->xa. - */ -static void wait_entry_unlocked(struct xa_state *xas, void *entry) -{ - struct wait_exceptional_entry_queue ewait; - wait_queue_head_t *wq; - - init_wait(&ewait.wait); - ewait.wait.func = wake_exceptional_entry_func; - - wq = dax_entry_waitqueue(xas, entry, &ewait.key); - /* - * Unlike get_unlocked_entry() there is no guarantee that this - * path ever successfully retrieves an unlocked entry before an - * inode dies. Perform a non-exclusive wait in case this path - * never successfully performs its own wake up. - */ - prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE); - xas_unlock_irq(xas); - schedule(); - finish_wait(wq, &ewait.wait); -} - -static void put_unlocked_entry(struct xa_state *xas, void *entry, - enum dax_wake_mode mode) -{ - if (entry && !dax_is_conflict(entry)) - dax_wake_entry(xas, entry, mode); -} - -/* - * We used the xa_state to get the entry, but then we locked the entry and - * dropped the xa_lock, so we know the xa_state is stale and must be reset - * before use. - */ -static void dax_unlock_entry(struct xa_state *xas, void *entry) -{ - void *old; - - BUG_ON(dax_is_locked(entry)); - xas_reset(xas); - xas_lock_irq(xas); - old = xas_store(xas, entry); - xas_unlock_irq(xas); - BUG_ON(!dax_is_locked(old)); - dax_wake_entry(xas, entry, WAKE_NEXT); -} - -/* - * Return: The entry stored at this location before it was locked. - */ -static void *dax_lock_entry(struct xa_state *xas, void *entry) -{ - unsigned long v = xa_to_value(entry); - return xas_store(xas, xa_mk_value(v | DAX_LOCKED)); -} - -static unsigned long dax_entry_size(void *entry) -{ - if (dax_is_zero_entry(entry)) - return 0; - else if (dax_is_empty_entry(entry)) - return 0; - else if (dax_is_pmd_entry(entry)) - return PMD_SIZE; - else - return PAGE_SIZE; -} - -static unsigned long dax_end_pfn(void *entry) -{ - return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE; -} - -/* - * Iterate through all mapped pfns represented by an entry, i.e. skip - * 'empty' and 'zero' entries. - */ -#define for_each_mapped_pfn(entry, pfn) \ - for (pfn = dax_to_pfn(entry); \ - pfn < dax_end_pfn(entry); pfn++) - -static inline bool dax_mapping_is_cow(struct address_space *mapping) -{ - return (unsigned long)mapping == PAGE_MAPPING_DAX_COW; -} - -/* - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount. - */ -static inline void dax_mapping_set_cow(struct page *page) -{ - if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) { - /* - * Reset the index if the page was already mapped - * regularly before. - */ - if (page->mapping) - page->index = 1; - page->mapping = (void *)PAGE_MAPPING_DAX_COW; - } - page->index++; -} - -/* - * When it is called in dax_insert_entry(), the cow flag will indicate that - * whether this entry is shared by multiple files. If so, set the page->mapping - * FS_DAX_MAPPING_COW, and use page->index as refcount. - */ -static vm_fault_t dax_associate_entry(void *entry, - struct address_space *mapping, - struct vm_fault *vmf, unsigned long flags) -{ - unsigned long size = dax_entry_size(entry), pfn, index; - struct dev_pagemap *pgmap; - int i = 0; - - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return 0; - - if (!size) - return 0; - - pfn = dax_to_pfn(entry); - pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size)); - if (!pgmap) - return VM_FAULT_SIGBUS; - - index = linear_page_index(vmf->vma, ALIGN(vmf->address, size)); - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); - - if (flags & DAX_COW) { - dax_mapping_set_cow(page); - } else { - WARN_ON_ONCE(page->mapping); - page->mapping = mapping; - page->index = index + i++; - } - } - - return 0; -} - -static void dax_disassociate_entry(void *entry, struct address_space *mapping, - bool trunc) -{ - unsigned long size = dax_entry_size(entry), pfn; - struct page *page; - - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return; - - if (!size) - return; - - page = pfn_to_page(dax_to_pfn(entry)); - put_dev_pagemap_many(page->pgmap, PHYS_PFN(size)); - - for_each_mapped_pfn(entry, pfn) { - page = pfn_to_page(pfn); - WARN_ON_ONCE(trunc && page_maybe_dma_pinned(page)); - if (dax_mapping_is_cow(page->mapping)) { - /* keep the CoW flag if this page is still shared */ - if (page->index-- > 0) - continue; - } else - WARN_ON_ONCE(page->mapping && page->mapping != mapping); - page->mapping = NULL; - page->index = 0; - } -} - -static struct page *dax_pinned_page(void *entry) -{ - unsigned long pfn; - - for_each_mapped_pfn(entry, pfn) { - struct page *page = pfn_to_page(pfn); - - if (page_maybe_dma_pinned(page)) - return page; - } - return NULL; -} - -/* - * dax_lock_page - Lock the DAX entry corresponding to a page - * @page: The page whose entry we want to lock - * - * Context: Process context. - * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could - * not be locked. - */ -dax_entry_t dax_lock_page(struct page *page) -{ - XA_STATE(xas, NULL, 0); - void *entry; - - /* Ensure page->mapping isn't freed while we look at it */ - rcu_read_lock(); - for (;;) { - struct address_space *mapping = READ_ONCE(page->mapping); - - entry = NULL; - if (!mapping || !dax_mapping(mapping)) - break; - - /* - * In the device-dax case there's no need to lock, a - * struct dev_pagemap pin is sufficient to keep the - * inode alive, and we assume we have dev_pagemap pin - * otherwise we would not have a valid pfn_to_page() - * translation. - */ - entry = (void *)~0UL; - if (S_ISCHR(mapping->host->i_mode)) - break; - - xas.xa = &mapping->i_pages; - xas_lock_irq(&xas); - if (mapping != page->mapping) { - xas_unlock_irq(&xas); - continue; - } - xas_set(&xas, page->index); - entry = xas_load(&xas); - if (dax_is_locked(entry)) { - rcu_read_unlock(); - wait_entry_unlocked(&xas, entry); - rcu_read_lock(); - continue; - } - dax_lock_entry(&xas, entry); - xas_unlock_irq(&xas); - break; - } - rcu_read_unlock(); - return (dax_entry_t)entry; -} - -void dax_unlock_page(struct page *page, dax_entry_t cookie) -{ - struct address_space *mapping = page->mapping; - XA_STATE(xas, &mapping->i_pages, page->index); - - if (S_ISCHR(mapping->host->i_mode)) - return; - - dax_unlock_entry(&xas, (void *)cookie); -} - -/* - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping - * @mapping: the file's mapping whose entry we want to lock - * @index: the offset within this file - * @page: output the dax page corresponding to this dax entry - * - * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry - * could not be locked. - */ -dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index, - struct page **page) -{ - XA_STATE(xas, NULL, 0); - void *entry; - - rcu_read_lock(); - for (;;) { - entry = NULL; - if (!dax_mapping(mapping)) - break; - - xas.xa = &mapping->i_pages; - xas_lock_irq(&xas); - xas_set(&xas, index); - entry = xas_load(&xas); - if (dax_is_locked(entry)) { - rcu_read_unlock(); - wait_entry_unlocked(&xas, entry); - rcu_read_lock(); - continue; - } - if (!entry || - dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { - /* - * Because we are looking for entry from file's mapping - * and index, so the entry may not be inserted for now, - * or even a zero/empty entry. We don't think this is - * an error case. So, return a special value and do - * not output @page. - */ - entry = (void *)~0UL; - } else { - *page = pfn_to_page(dax_to_pfn(entry)); - dax_lock_entry(&xas, entry); - } - xas_unlock_irq(&xas); - break; - } - rcu_read_unlock(); - return (dax_entry_t)entry; -} - -void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index, - dax_entry_t cookie) -{ - XA_STATE(xas, &mapping->i_pages, index); - - if (cookie == ~0UL) - return; - - dax_unlock_entry(&xas, (void *)cookie); -} - -/* - * Find page cache entry at given index. If it is a DAX entry, return it - * with the entry locked. If the page cache doesn't contain an entry at - * that index, add a locked empty entry. - * - * When requesting an entry with size DAX_PMD, grab_mapping_entry() will - * either return that locked entry or will return VM_FAULT_FALLBACK. - * This will happen if there are any PTE entries within the PMD range - * that we are requesting. - * - * We always favor PTE entries over PMD entries. There isn't a flow where we - * evict PTE entries in order to 'upgrade' them to a PMD entry. A PMD - * insertion will fail if it finds any PTE entries already in the tree, and a - * PTE insertion will cause an existing PMD entry to be unmapped and - * downgraded to PTE entries. This happens for both PMD zero pages as - * well as PMD empty entries. - * - * The exception to this downgrade path is for PMD entries that have - * real storage backing them. We will leave these real PMD entries in - * the tree, and PTE writes will simply dirty the entire PMD entry. - * - * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For - * persistent memory the benefit is doubtful. We can add that later if we can - * show it helps. - * - * On error, this function does not return an ERR_PTR. Instead it returns - * a VM_FAULT code, encoded as an xarray internal entry. The ERR_PTR values - * overlap with xarray value entries. - */ -static void *grab_mapping_entry(struct xa_state *xas, - struct address_space *mapping, unsigned int order) -{ - unsigned long index = xas->xa_index; - bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ - void *entry; - -retry: - pmd_downgrade = false; - xas_lock_irq(xas); - entry = get_unlocked_entry(xas, order); - - if (entry) { - if (dax_is_conflict(entry)) - goto fallback; - if (!xa_is_value(entry)) { - xas_set_err(xas, -EIO); - goto out_unlock; - } - - if (order == 0) { - if (dax_is_pmd_entry(entry) && - (dax_is_zero_entry(entry) || - dax_is_empty_entry(entry))) { - pmd_downgrade = true; - } - } - } - - if (pmd_downgrade) { - /* - * Make sure 'entry' remains valid while we drop - * the i_pages lock. - */ - dax_lock_entry(xas, entry); - - /* - * Besides huge zero pages the only other thing that gets - * downgraded are empty entries which don't need to be - * unmapped. - */ - if (dax_is_zero_entry(entry)) { - xas_unlock_irq(xas); - unmap_mapping_pages(mapping, - xas->xa_index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); - xas_reset(xas); - xas_lock_irq(xas); - } - - dax_disassociate_entry(entry, mapping, false); - xas_store(xas, NULL); /* undo the PMD join */ - dax_wake_entry(xas, entry, WAKE_ALL); - mapping->nrpages -= PG_PMD_NR; - entry = NULL; - xas_set(xas, index); - } - - if (entry) { - dax_lock_entry(xas, entry); - } else { - unsigned long flags = DAX_EMPTY; - - if (order > 0) - flags |= DAX_PMD; - entry = dax_make_entry(pfn_to_pfn_t(0), flags); - dax_lock_entry(xas, entry); - if (xas_error(xas)) - goto out_unlock; - mapping->nrpages += 1UL << order; - } - -out_unlock: - xas_unlock_irq(xas); - if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM)) - goto retry; - if (xas->xa_node == XA_ERROR(-ENOMEM)) - return xa_mk_internal(VM_FAULT_OOM); - if (xas_error(xas)) - return xa_mk_internal(VM_FAULT_SIGBUS); - return entry; -fallback: - xas_unlock_irq(xas); - return xa_mk_internal(VM_FAULT_FALLBACK); -} - -/** - * dax_layout_pinned_page_range - find first pinned page in @mapping - * @mapping: address space to scan for a page with ref count > 1 - * @start: Starting offset. Page containing 'start' is included. - * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, - * pages from 'start' till the end of file are included. - * - * DAX requires ZONE_DEVICE mapped pages. These pages are never - * 'onlined' to the page allocator so they are considered idle when - * page->count == 1. A filesystem uses this interface to determine if - * any page in the mapping is busy, i.e. for DMA, or other - * get_user_pages() usages. - * - * It is expected that the filesystem is holding locks to block the - * establishment of new mappings in this address_space. I.e. it expects - * to be able to run unmap_mapping_range() and subsequently not race - * mapping_mapped() becoming true. - */ -struct page *dax_layout_pinned_page_range(struct address_space *mapping, - loff_t start, loff_t end) -{ - void *entry; - unsigned int scanned = 0; - struct page *page = NULL; - pgoff_t start_idx = start >> PAGE_SHIFT; - pgoff_t end_idx; - XA_STATE(xas, &mapping->i_pages, start_idx); - - /* - * In the 'limited' case get_user_pages() for dax is disabled. - */ - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) - return NULL; - - if (!dax_mapping(mapping) || !mapping_mapped(mapping)) - return NULL; - - /* If end == LLONG_MAX, all pages from start to till end of file */ - if (end == LLONG_MAX) - end_idx = ULONG_MAX; - else - end_idx = end >> PAGE_SHIFT; - /* - * If we race get_user_pages_fast() here either we'll see the - * elevated page count in the iteration and wait, or - * get_user_pages_fast() will see that the page it took a reference - * against is no longer mapped in the page tables and bail to the - * get_user_pages() slow path. The slow path is protected by - * pte_lock() and pmd_lock(). New references are not taken without - * holding those locks, and unmap_mapping_pages() will not zero the - * pte or pmd without holding the respective lock, so we are - * guaranteed to either see new references or prevent new - * references from being established. - */ - unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0); - - xas_lock_irq(&xas); - xas_for_each(&xas, entry, end_idx) { - if (WARN_ON_ONCE(!xa_is_value(entry))) - continue; - if (unlikely(dax_is_locked(entry))) - entry = get_unlocked_entry(&xas, 0); - if (entry) - page = dax_pinned_page(entry); - put_unlocked_entry(&xas, entry, WAKE_NEXT); - if (page) - break; - if (++scanned % XA_CHECK_SCHED) - continue; - - xas_pause(&xas); - xas_unlock_irq(&xas); - cond_resched(); - xas_lock_irq(&xas); - } - xas_unlock_irq(&xas); - return page; -} -EXPORT_SYMBOL_GPL(dax_layout_pinned_page_range); - -struct page *dax_layout_pinned_page(struct address_space *mapping) -{ - return dax_layout_pinned_page_range(mapping, 0, LLONG_MAX); -} -EXPORT_SYMBOL_GPL(dax_layout_pinned_page); - -static int __dax_invalidate_entry(struct address_space *mapping, - pgoff_t index, bool trunc) -{ - XA_STATE(xas, &mapping->i_pages, index); - int ret = 0; - void *entry; - - xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas, 0); - if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) - goto out; - if (!trunc && - (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) || - xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE))) - goto out; - dax_disassociate_entry(entry, mapping, trunc); - xas_store(&xas, NULL); - mapping->nrpages -= 1UL << dax_entry_order(entry); - ret = 1; -out: - put_unlocked_entry(&xas, entry, WAKE_ALL); - xas_unlock_irq(&xas); - return ret; -} - -/* - * Delete DAX entry at @index from @mapping. Wait for it - * to be unlocked before deleting it. - */ -int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) -{ - int ret = __dax_invalidate_entry(mapping, index, true); - - /* - * This gets called from truncate / punch_hole path. As such, the caller - * must hold locks protecting against concurrent modifications of the - * page cache (usually fs-private i_mmap_sem for writing). Since the - * caller has seen a DAX entry for this index, we better find it - * at that index as well... - */ - WARN_ON_ONCE(!ret); - return ret; -} - -/* - * Invalidate DAX entry if it is clean. - */ -int dax_invalidate_mapping_entry_sync(struct address_space *mapping, - pgoff_t index) -{ - return __dax_invalidate_entry(mapping, index, false); -} - static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos) { return PHYS_PFN(iomap->addr + (pos & PAGE_MASK) - iomap->offset); @@ -856,195 +55,6 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter return 0; } -/* - * MAP_SYNC on a dax mapping guarantees dirty metadata is - * flushed on write-faults (non-cow), but not read-faults. - */ -static bool dax_fault_is_synchronous(const struct iomap_iter *iter, - struct vm_area_struct *vma) -{ - return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) && - (iter->iomap.flags & IOMAP_F_DIRTY); -} - -static bool dax_fault_is_cow(const struct iomap_iter *iter) -{ - return (iter->flags & IOMAP_WRITE) && - (iter->iomap.flags & IOMAP_F_SHARED); -} - -static unsigned long dax_iter_flags(const struct iomap_iter *iter, - struct vm_fault *vmf) -{ - unsigned long flags = 0; - - if (!dax_fault_is_synchronous(iter, vmf->vma)) - flags |= DAX_DIRTY; - - if (dax_fault_is_cow(iter)) - flags |= DAX_COW; - - return flags; -} - -/* - * By this point grab_mapping_entry() has ensured that we have a locked entry - * of the appropriate size so we don't have to worry about downgrading PMDs to - * PTEs. If we happen to be trying to insert a PTE and there is a PMD - * already in the tree, we will skip the insertion and just dirty the PMD as - * appropriate. - */ -static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, - void **pentry, pfn_t pfn, - unsigned long flags) -{ - struct address_space *mapping = vmf->vma->vm_file->f_mapping; - void *new_entry = dax_make_entry(pfn, flags); - bool dirty = flags & DAX_DIRTY; - bool cow = flags & DAX_COW; - void *entry = *pentry; - - if (dirty) - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - - if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { - unsigned long index = xas->xa_index; - /* we are replacing a zero page with block mapping */ - if (dax_is_pmd_entry(entry)) - unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); - else /* pte entry */ - unmap_mapping_pages(mapping, index, 1, false); - } - - xas_reset(xas); - xas_lock_irq(xas); - if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { - void *old; - - dax_disassociate_entry(entry, mapping, false); - dax_associate_entry(new_entry, mapping, vmf, flags); - /* - * Only swap our new entry into the page cache if the current - * entry is a zero page or an empty entry. If a normal PTE or - * PMD entry is already in the cache, we leave it alone. This - * means that if we are trying to insert a PTE and the - * existing entry is a PMD, we will just leave the PMD in the - * tree and dirty it if necessary. - */ - old = dax_lock_entry(xas, new_entry); - WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) | - DAX_LOCKED)); - entry = new_entry; - } else { - xas_load(xas); /* Walk the xa_state */ - } - - if (dirty) - xas_set_mark(xas, PAGECACHE_TAG_DIRTY); - - if (cow) - xas_set_mark(xas, PAGECACHE_TAG_TOWRITE); - - xas_unlock_irq(xas); - *pentry = entry; - return 0; -} - -static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, - struct address_space *mapping, void *entry) -{ - unsigned long pfn, index, count, end; - long ret = 0; - struct vm_area_struct *vma; - - /* - * A page got tagged dirty in DAX mapping? Something is seriously - * wrong. - */ - if (WARN_ON(!xa_is_value(entry))) - return -EIO; - - if (unlikely(dax_is_locked(entry))) { - void *old_entry = entry; - - entry = get_unlocked_entry(xas, 0); - - /* Entry got punched out / reallocated? */ - if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) - goto put_unlocked; - /* - * Entry got reallocated elsewhere? No need to writeback. - * We have to compare pfns as we must not bail out due to - * difference in lockbit or entry type. - */ - if (dax_to_pfn(old_entry) != dax_to_pfn(entry)) - goto put_unlocked; - if (WARN_ON_ONCE(dax_is_empty_entry(entry) || - dax_is_zero_entry(entry))) { - ret = -EIO; - goto put_unlocked; - } - - /* Another fsync thread may have already done this entry */ - if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE)) - goto put_unlocked; - } - - /* Lock the entry to serialize with page faults */ - dax_lock_entry(xas, entry); - - /* - * We can clear the tag now but we have to be careful so that concurrent - * dax_writeback_one() calls for the same index cannot finish before we - * actually flush the caches. This is achieved as the calls will look - * at the entry only under the i_pages lock and once they do that - * they will see the entry locked and wait for it to unlock. - */ - xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE); - xas_unlock_irq(xas); - - /* - * If dax_writeback_mapping_range() was given a wbc->range_start - * in the middle of a PMD, the 'index' we use needs to be - * aligned to the start of the PMD. - * This allows us to flush for PMD_SIZE and not have to worry about - * partial PMD writebacks. - */ - pfn = dax_to_pfn(entry); - count = 1UL << dax_entry_order(entry); - index = xas->xa_index & ~(count - 1); - end = index + count - 1; - - /* Walk all mappings of a given index of a file and writeprotect them */ - i_mmap_lock_read(mapping); - vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) { - pfn_mkclean_range(pfn, count, index, vma); - cond_resched(); - } - i_mmap_unlock_read(mapping); - - dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); - /* - * After we have flushed the cache, we can clear the dirty tag. There - * cannot be new dirty data in the pfn after the flush has completed as - * the pfn mappings are writeprotected and fault waits for mapping - * entry lock. - */ - xas_reset(xas); - xas_lock_irq(xas); - xas_store(xas, entry); - xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); - dax_wake_entry(xas, entry, WAKE_NEXT); - - trace_dax_writeback_one(mapping->host, index, count); - return ret; - - put_unlocked: - put_unlocked_entry(xas, entry, WAKE_NEXT); - return ret; -} - /* * Flush the mapping to the persistent domain within the byte range of [start, * end]. This is required by data integrity operations to ensure file data is @@ -1181,6 +191,37 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size, return 0; } +/* + * MAP_SYNC on a dax mapping guarantees dirty metadata is + * flushed on write-faults (non-cow), but not read-faults. + */ +static bool dax_fault_is_synchronous(const struct iomap_iter *iter, + struct vm_area_struct *vma) +{ + return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) && + (iter->iomap.flags & IOMAP_F_DIRTY); +} + +static bool dax_fault_is_cow(const struct iomap_iter *iter) +{ + return (iter->flags & IOMAP_WRITE) && + (iter->iomap.flags & IOMAP_F_SHARED); +} + +static unsigned long dax_iter_flags(const struct iomap_iter *iter, + struct vm_fault *vmf) +{ + unsigned long flags = 0; + + if (!dax_fault_is_synchronous(iter, vmf->vma)) + flags |= DAX_DIRTY; + + if (dax_fault_is_cow(iter)) + flags |= DAX_COW; + + return flags; +} + /* * The user has performed a load from a hole in the file. Allocating a new * page in the file would cause excessive storage usage for workloads with @@ -1663,7 +704,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page) iter.flags |= IOMAP_WRITE; - entry = grab_mapping_entry(&xas, mapping, 0); + entry = dax_grab_mapping_entry(&xas, mapping, 0); if (xa_is_internal(entry)) { ret = xa_to_internal(entry); goto out; @@ -1780,12 +821,12 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, goto fallback; /* - * grab_mapping_entry() will make sure we get an empty PMD entry, + * dax_grab_mapping_entry() will make sure we get an empty PMD entry, * a zero PMD entry or a DAX PMD. If it can't (because a PTE * entry is already in the array, for instance), it will return * VM_FAULT_FALLBACK. */ - entry = grab_mapping_entry(&xas, mapping, PMD_ORDER); + entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER); if (xa_is_internal(entry)) { ret = xa_to_internal(entry); goto fallback; @@ -1859,50 +900,6 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, } EXPORT_SYMBOL_GPL(dax_iomap_fault); -/* - * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables - * @vmf: The description of the fault - * @pfn: PFN to insert - * @order: Order of entry to insert. - * - * This function inserts a writeable PTE or PMD entry into the page tables - * for an mmaped DAX file. It also marks the page cache entry as dirty. - */ -static vm_fault_t -dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) -{ - struct address_space *mapping = vmf->vma->vm_file->f_mapping; - XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order); - void *entry; - vm_fault_t ret; - - xas_lock_irq(&xas); - entry = get_unlocked_entry(&xas, order); - /* Did we race with someone splitting entry or so? */ - if (!entry || dax_is_conflict(entry) || - (order == 0 && !dax_is_pte_entry(entry))) { - put_unlocked_entry(&xas, entry, WAKE_NEXT); - xas_unlock_irq(&xas); - trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, - VM_FAULT_NOPAGE); - return VM_FAULT_NOPAGE; - } - xas_set_mark(&xas, PAGECACHE_TAG_DIRTY); - dax_lock_entry(&xas, entry); - xas_unlock_irq(&xas); - if (order == 0) - ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn); -#ifdef CONFIG_FS_DAX_PMD - else if (order == PMD_ORDER) - ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); -#endif - else - ret = VM_FAULT_FALLBACK; - dax_unlock_entry(&xas, entry); - trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret); - return ret; -} - /** * dax_finish_sync_fault - finish synchronous page fault * @vmf: The description of the fault diff --git a/include/linux/dax.h b/include/linux/dax.h index 54f099166a29..05ce7992ac43 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -157,41 +157,50 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder) int dax_writeback_mapping_range(struct address_space *mapping, struct dax_device *dax_dev, struct writeback_control *wbc); -struct page *dax_layout_pinned_page(struct address_space *mapping); -struct page *dax_layout_pinned_page_range(struct address_space *mapping, loff_t start, loff_t end); +#else +static inline int dax_writeback_mapping_range(struct address_space *mapping, + struct dax_device *dax_dev, struct writeback_control *wbc) +{ + return -EOPNOTSUPP; +} + +#endif + +int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, + const struct iomap_ops *ops); +int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, + const struct iomap_ops *ops); + +#if IS_ENABLED(CONFIG_DAX) +int dax_read_lock(void); +void dax_read_unlock(int id); dax_entry_t dax_lock_page(struct page *page); void dax_unlock_page(struct page *page, dax_entry_t cookie); +void run_dax(struct dax_device *dax_dev); dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, unsigned long index, struct page **page); void dax_unlock_mapping_entry(struct address_space *mapping, unsigned long index, dax_entry_t cookie); +struct page *dax_layout_pinned_page(struct address_space *mapping); +struct page *dax_layout_pinned_page_range(struct address_space *mapping, loff_t start, loff_t end); #else -static inline struct page *dax_layout_pinned_page(struct address_space *mapping) -{ - return NULL; -} - -static inline struct page * -dax_layout_pinned_page_range(struct address_space *mapping, pgoff_t start, - pgoff_t nr_pages) +static inline dax_entry_t dax_lock_page(struct page *page) { - return NULL; + if (IS_DAX(page->mapping->host)) + return ~0UL; + return 0; } -static inline int dax_writeback_mapping_range(struct address_space *mapping, - struct dax_device *dax_dev, struct writeback_control *wbc) +static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) { - return -EOPNOTSUPP; } -static inline dax_entry_t dax_lock_page(struct page *page) +static inline int dax_read_lock(void) { - if (IS_DAX(page->mapping->host)) - return ~0UL; return 0; } -static inline void dax_unlock_page(struct page *page, dax_entry_t cookie) +static inline void dax_read_unlock(int id) { } @@ -205,24 +214,17 @@ static inline void dax_unlock_mapping_entry(struct address_space *mapping, unsigned long index, dax_entry_t cookie) { } -#endif - -int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero, - const struct iomap_ops *ops); -int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero, - const struct iomap_ops *ops); -#if IS_ENABLED(CONFIG_DAX) -int dax_read_lock(void); -void dax_read_unlock(int id); -#else -static inline int dax_read_lock(void) +static inline struct page *dax_layout_pinned_page(struct address_space *mapping) { - return 0; + return NULL; } -static inline void dax_read_unlock(int id) +static inline struct page * +dax_layout_pinned_page_range(struct address_space *mapping, loff_t start, + loff_t end) { + return NULL; } #endif /* CONFIG_DAX */ bool dax_alive(struct dax_device *dax_dev); @@ -245,6 +247,10 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, pfn_t *pfnp, int *errp, const struct iomap_ops *ops); vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, enum page_entry_size pe_size, pfn_t pfn); +struct page *dax_pinned_page(void *entry); +void *dax_grab_mapping_entry(struct xa_state *xas, + struct address_space *mapping, unsigned int order); +void dax_unlock_entry(struct xa_state *xas, void *entry); int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index); int dax_invalidate_mapping_entry_sync(struct address_space *mapping, pgoff_t index); @@ -261,6 +267,54 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); } +/* + * DAX pagecache entries use XArray value entries so they can't be mistaken + * for pages. We use one bit for locking, one bit for the entry size (PMD) + * and two more to tell us if the entry is a zero page or an empty entry that + * is just used for locking. In total four special bits. + * + * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE + * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem + * block allocation. + */ +#define DAX_SHIFT (4) +#define DAX_MASK ((1UL << DAX_SHIFT) - 1) +#define DAX_LOCKED (1UL << 0) +#define DAX_PMD (1UL << 1) +#define DAX_ZERO_PAGE (1UL << 2) +#define DAX_EMPTY (1UL << 3) + +/* + * These flags are not conveyed in Xarray value entries, they are just + * modifiers to dax_insert_entry(). + */ +#define DAX_DIRTY (1UL << (DAX_SHIFT + 0)) +#define DAX_COW (1UL << (DAX_SHIFT + 1)) +vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, + void **pentry, pfn_t pfn, unsigned long flags); +vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, + unsigned int order); +int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, + struct address_space *mapping, void *entry); + +/* The 'colour' (ie low bits) within a PMD of a page offset. */ +#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1) +#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT) + +/* The order of a PMD entry */ +#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) + +static inline unsigned int pe_order(enum page_entry_size pe_size) +{ + if (pe_size == PE_SIZE_PTE) + return PAGE_SHIFT - PAGE_SHIFT; + if (pe_size == PE_SIZE_PMD) + return PMD_SHIFT - PAGE_SHIFT; + if (pe_size == PE_SIZE_PUD) + return PUD_SHIFT - PAGE_SHIFT; + return ~0; +} + #ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r); #else diff --git a/include/linux/memremap.h b/include/linux/memremap.h index fd57407e7f3d..e5d30eec3bf1 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -221,6 +221,12 @@ static inline void devm_memunmap_pages(struct device *dev, { } +static inline struct dev_pagemap * +get_dev_pagemap_many(unsigned long pfn, struct dev_pagemap *pgmap, int refs) +{ + return NULL; +} + static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn, struct dev_pagemap *pgmap) { From patchwork Sun Sep 4 02:16:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965102 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C62EC6FA82 for ; Sun, 4 Sep 2022 02:17:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0957780178; Sat, 3 Sep 2022 22:17:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 044D88015A; Sat, 3 Sep 2022 22:17:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E4F5E80178; Sat, 3 Sep 2022 22:17:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D58678015A for ; Sat, 3 Sep 2022 22:17:00 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id B3D6E1C5F93 for ; Sun, 4 Sep 2022 02:17:00 +0000 (UTC) X-FDA: 79872790200.23.0F4F20E Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf06.hostedemail.com (Postfix) with ESMTP id 24650180052 for ; Sun, 4 Sep 2022 02:16:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257820; x=1693793820; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JRd33zR9nL8ko8mlBX12iNdK9JcdjXgX57uT6RLRguc=; b=Es4qupQNn6NUteg53i3EfLM78ZeCI2ok8isNg1JOs0n1Y8eF0RAHoadu a0x587YKVjAQPg6b/CsEQrv5Fw+dr4k52021zbjTo2LfDhumkWvfu7wi3 vaZs9vXNYBDeeqF6qr9K3cVIHFnqd1hdMKXzcatQqjG269EwdKwsOZG1D g/gJgtfyWztbqFmq14efbMsJ7BmAMsJEogM9n0Tb75JzMaVd9RKg8E0/Q a4JOSXJJGdk8UFg69aIAu31YuQC6h3App3EV43+A55a1a6X2MY0+JRne2 188Rr8YTBZVyVZFc5kU9A/ohtozApKPiqxvP2le/kDd//MLBrqkBqeLsc Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="294943598" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="294943598" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:58 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="590497054" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:16:58 -0700 Subject: [PATCH 10/13] dax: Prep dax_{associate, disassociate}_entry() for compound pages From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:16:58 -0700 Message-ID: <166225781800.2351842.4542681429835252305.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257820; a=rsa-sha256; cv=none; b=7h2ly6GX97IPQZ7bOXt/8l6qITNIDxEmswqFUMuqrwHs5Gf7ClCuXRh+hfRrCua6OyJ+0h xyuPm+cGq3b4KCRdKpu9DPe8xEnC/ewbMYFdKY30Go2WT2fsMyB/9Le+TS9EFPDNLGoQD8 m3xLQlOksZUDyrqilFQ1khlnTvYVQQs= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=Es4qupQN; spf=softfail (imf06.hostedemail.com: 192.55.52.120 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257820; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nau4jCbY8zYXDsjg0lWLMSS2xXdrAFw9g+oAIu7ziOc=; b=CxhCjKnsR0DjzpxWMLww2aj/IOfZSmhB+kOTk+LTu9B6b8m8HHMYlPw+OmW2UT87lfAsbZ JJRobXNfwfiYxYyRpBR4z36vNIiwLqiIlsITEhLA9O4apd+VfdfsL/zeGCLLjOI7o1qZHh RReGFOAO4Vw2qB95mTUIrr1iDHEVuQ0= X-Rspam-User: X-Stat-Signature: qfty3afa7345o6o7unfyxjzi693mia8q X-Rspamd-Queue-Id: 24650180052 X-Rspamd-Server: rspam10 Authentication-Results: imf06.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=Es4qupQN; spf=softfail (imf06.hostedemail.com: 192.55.52.120 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-HE-Tag: 1662257819-181237 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for device-dax to use the same mapping machinery as fsdax, add support for device-dax compound pages. Presently this is handled by dax_set_mapping() which is careful to only update page->mapping for head pages. However, it does that by looking at properties in the 'struct dev_dax' instance associated with the page. Switch to just checking PageHead() directly. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/Kconfig | 1 + drivers/dax/mapping.c | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index 3ed4da3935e5..2dcc8744277d 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -10,6 +10,7 @@ if DAX config DEV_DAX tristate "Device DAX: direct access mapping device" depends on TRANSPARENT_HUGEPAGE + depends on !FS_DAX_LIMITED help Support raw access to differentiated (persistence, bandwidth, latency...) memory via an mmap(2) capable character diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index 0810af7d9503..6bd38ddba2cb 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -351,6 +351,8 @@ static vm_fault_t dax_associate_entry(void *entry, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); + page = compound_head(page); + if (flags & DAX_COW) { dax_mapping_set_cow(page); } else { @@ -358,6 +360,13 @@ static vm_fault_t dax_associate_entry(void *entry, page->mapping = mapping; page->index = index + i++; } + + /* + * page->mapping and page->index are only manipulated on + * head pages + */ + if (PageHead(page)) + break; } return 0; @@ -380,6 +389,8 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { page = pfn_to_page(pfn); + page = compound_head(page); + WARN_ON_ONCE(trunc && page_maybe_dma_pinned(page)); if (dax_mapping_is_cow(page->mapping)) { /* keep the CoW flag if this page is still shared */ @@ -389,6 +400,13 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0; + + /* + * page->mapping and page->index are only manipulated on + * head pages + */ + if (PageHead(page)) + break; } } From patchwork Sun Sep 4 02:17:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965103 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC767C38145 for ; Sun, 4 Sep 2022 02:17:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6889C80179; Sat, 3 Sep 2022 22:17:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 610AA8015A; Sat, 3 Sep 2022 22:17:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D88880179; Sat, 3 Sep 2022 22:17:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 3EB9C8015A for ; Sat, 3 Sep 2022 22:17:06 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 1BF181A0754 for ; Sun, 4 Sep 2022 02:17:06 +0000 (UTC) X-FDA: 79872790452.31.00C417A Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf08.hostedemail.com (Postfix) with ESMTP id 94DFE16006F for ; Sun, 4 Sep 2022 02:17:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257825; x=1693793825; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2EY4uo7WnABwfrYvG4GUU1AXUkCpofKxrPSaAp/kYFc=; b=EyaWoVyfuUA8oTi+2aOalUFL85pIPsuksjQXh5XPiLSX0qqQGiMLkbQP n29EUZ1dwJ0oAR2+y1059571Y6VbNie8ANCydm93P8jRdea1iQcft5tcA uc2Xodv4AinOemBdQAMhcy9LDlg80aA4sITM063/IKXkUR2Ooqk4XIXX7 Bqhrgshc/oDaipntxCk4pO/F0maR6UUxymwF8TrQwsCZhRtnIcYi+Fit9 iK8LJAItKVEa8UDgnVFA70SRvKIs1rGKSvQb8+0+OyHkt3SX4rEago2TD 8Zd2zEbZlPyhBOM8alyGRgRFJB/6LwmL+nx7rcIGXhuXsgjCWks80m8Y/ g==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="275947290" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="275947290" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:17:04 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="616070245" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:17:04 -0700 Subject: [PATCH 11/13] devdax: add PUD support to the DAX mapping infrastructure From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:17:03 -0700 Message-ID: <166225782359.2351842.11436411972119201331.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=EyaWoVyf; spf=softfail (imf08.hostedemail.com: 192.55.52.136 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257825; a=rsa-sha256; cv=none; b=eIbk3AesB2X0ZETdix28WdLV0I6PwK4a5Mxjp7msI84VlOZi8XmmcH5XgwquT8Ea1V2M4m OVDZjan3uR4LM3oDXjD/Dt0ZL3eYM2w0UZw1PbPEIK9eoSS+xE8nfzQQKpNSTMQHPXZ3mz 2l3QggwDMdYrzluzyIAfxUXyiNYZ2WE= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257825; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cgXLtqvasw+KlJAS1C+QH8SO17O9k/wcYrCZzmBLZjw=; b=vPAZn6HyBVlX7S7m8p1MeZJa4OZCUvO6BR4moKN7XGk5Ep3G7uIhRwZSIPYwLpHe+HJNUf 4m9EteKKEJQcSnjYxJ3jugBoM9jeHgBp4Keb4Wyxz4GODhnjVhvtS8WNiyCVvowQDy13Gd 0S5/u+M7D/TgR2FuauXjAWOSHX+A2+M= X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=EyaWoVyf; spf=softfail (imf08.hostedemail.com: 192.55.52.136 is neither permitted nor denied by domain of dan.j.williams@intel.com) smtp.mailfrom=dan.j.williams@intel.com; dmarc=fail reason="No valid SPF" header.from=intel.com (policy=none) X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 94DFE16006F X-Stat-Signature: d3j83fkfkmb6cosw31frbsxgjhhe3w7m X-HE-Tag: 1662257825-155045 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for using the DAX mapping infrastructure for device-dax, update the helpers to handle PUD entries. In practice the code related to @size_downgrade will go unused for PUD entries since only devdax creates DAX PUD entries and devdax enforces aligned mappings. The conversion is included for completeness. The addition of PUD support to dax_insert_pfn_mkwrite() requires a new stub for vmf_insert_pfn_pud() in the CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=n case. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/mapping.c | 50 ++++++++++++++++++++++++++++++++++++----------- include/linux/dax.h | 30 +++++++++++++++++++--------- include/linux/huge_mm.h | 11 ++++++++-- 3 files changed, 67 insertions(+), 24 deletions(-) diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index 6bd38ddba2cb..6eaa0fe33c16 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -13,6 +13,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -51,6 +52,8 @@ static bool dax_is_locked(void *entry) static unsigned int dax_entry_order(void *entry) { + if (xa_to_value(entry) & DAX_PUD) + return PUD_ORDER; if (xa_to_value(entry) & DAX_PMD) return PMD_ORDER; return 0; @@ -61,9 +64,14 @@ static unsigned long dax_is_pmd_entry(void *entry) return xa_to_value(entry) & DAX_PMD; } +static unsigned long dax_is_pud_entry(void *entry) +{ + return xa_to_value(entry) & DAX_PUD; +} + static bool dax_is_pte_entry(void *entry) { - return !(xa_to_value(entry) & DAX_PMD); + return !(xa_to_value(entry) & (DAX_PMD|DAX_PUD)); } static int dax_is_zero_entry(void *entry) @@ -272,6 +280,8 @@ static unsigned long dax_entry_size(void *entry) return 0; else if (dax_is_pmd_entry(entry)) return PMD_SIZE; + else if (dax_is_pud_entry(entry)) + return PUD_SIZE; else return PAGE_SIZE; } @@ -572,11 +582,11 @@ void *dax_grab_mapping_entry(struct xa_state *xas, struct address_space *mapping, unsigned int order) { unsigned long index = xas->xa_index; - bool pmd_downgrade; /* splitting PMD entry into PTE entries? */ + bool size_downgrade; /* splitting entry into PTE entries? */ void *entry; retry: - pmd_downgrade = false; + size_downgrade = false; xas_lock_irq(xas); entry = get_unlocked_entry(xas, order); @@ -589,15 +599,25 @@ void *dax_grab_mapping_entry(struct xa_state *xas, } if (order == 0) { - if (dax_is_pmd_entry(entry) && + if (!dax_is_pte_entry(entry) && (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))) { - pmd_downgrade = true; + size_downgrade = true; } } } - if (pmd_downgrade) { + if (size_downgrade) { + unsigned long colour, nr; + + if (dax_is_pmd_entry(entry)) { + colour = PG_PMD_COLOUR; + nr = PG_PMD_NR; + } else { + colour = PG_PUD_COLOUR; + nr = PG_PUD_NR; + } + /* * Make sure 'entry' remains valid while we drop * the i_pages lock. @@ -611,9 +631,8 @@ void *dax_grab_mapping_entry(struct xa_state *xas, */ if (dax_is_zero_entry(entry)) { xas_unlock_irq(xas); - unmap_mapping_pages(mapping, - xas->xa_index & ~PG_PMD_COLOUR, - PG_PMD_NR, false); + unmap_mapping_pages(mapping, xas->xa_index & ~colour, + nr, false); xas_reset(xas); xas_lock_irq(xas); } @@ -621,7 +640,7 @@ void *dax_grab_mapping_entry(struct xa_state *xas, dax_disassociate_entry(entry, mapping, false); xas_store(xas, NULL); /* undo the PMD join */ dax_wake_entry(xas, entry, WAKE_ALL); - mapping->nrpages -= PG_PMD_NR; + mapping->nrpages -= nr; entry = NULL; xas_set(xas, index); } @@ -631,7 +650,9 @@ void *dax_grab_mapping_entry(struct xa_state *xas, } else { unsigned long flags = DAX_EMPTY; - if (order > 0) + if (order == PUD_SHIFT - PAGE_SHIFT) + flags |= DAX_PUD; + else if (order == PMD_SHIFT - PAGE_SHIFT) flags |= DAX_PMD; entry = dax_make_entry(pfn_to_pfn_t(0), flags); dax_lock_entry(xas, entry); @@ -811,7 +832,10 @@ vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { unsigned long index = xas->xa_index; /* we are replacing a zero page with block mapping */ - if (dax_is_pmd_entry(entry)) + if (dax_is_pud_entry(entry)) + unmap_mapping_pages(mapping, index & ~PG_PUD_COLOUR, + PG_PUD_NR, false); + else if (dax_is_pmd_entry(entry)) unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR, PG_PMD_NR, false); else /* pte entry */ @@ -983,6 +1007,8 @@ vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, else if (order == PMD_ORDER) ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE); #endif + else if (order == PUD_ORDER) + ret = vmf_insert_pfn_pud(vmf, pfn, FAULT_FLAG_WRITE); else ret = VM_FAULT_FALLBACK; dax_unlock_entry(&xas, entry); diff --git a/include/linux/dax.h b/include/linux/dax.h index 05ce7992ac43..81fcc0e4a070 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -268,21 +268,24 @@ static inline bool dax_mapping(struct address_space *mapping) } /* - * DAX pagecache entries use XArray value entries so they can't be mistaken - * for pages. We use one bit for locking, one bit for the entry size (PMD) - * and two more to tell us if the entry is a zero page or an empty entry that - * is just used for locking. In total four special bits. + * DAX pagecache entries use XArray value entries so they can't be + * mistaken for pages. We use one bit for locking, two bits for the + * entry size (PMD, PUD) and two more to tell us if the entry is a zero + * page or an empty entry that is just used for locking. In total 5 + * special bits which limits the max pfn that can be stored as: + * (1UL << 57 - PAGE_SHIFT). 63 - DAX_SHIFT - 1 (for xa_mk_value()). * - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE - * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem - * block allocation. + * If the P{M,U}D bits are not set the entry has size PAGE_SIZE, and if + * the ZERO_PAGE and EMPTY bits aren't set the entry is a normal DAX + * entry with a filesystem block allocation. */ -#define DAX_SHIFT (4) +#define DAX_SHIFT (5) #define DAX_MASK ((1UL << DAX_SHIFT) - 1) #define DAX_LOCKED (1UL << 0) #define DAX_PMD (1UL << 1) -#define DAX_ZERO_PAGE (1UL << 2) -#define DAX_EMPTY (1UL << 3) +#define DAX_PUD (1UL << 2) +#define DAX_ZERO_PAGE (1UL << 3) +#define DAX_EMPTY (1UL << 4) /* * These flags are not conveyed in Xarray value entries, they are just @@ -304,6 +307,13 @@ int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, /* The order of a PMD entry */ #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) +/* The 'colour' (ie low bits) within a PUD of a page offset. */ +#define PG_PUD_COLOUR ((PUD_SIZE >> PAGE_SHIFT) - 1) +#define PG_PUD_NR (PUD_SIZE >> PAGE_SHIFT) + +/* The order of a PUD entry */ +#define PUD_ORDER (PUD_SHIFT - PAGE_SHIFT) + static inline unsigned int pe_order(enum page_entry_size pe_size) { if (pe_size == PE_SIZE_PTE) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 768e5261fdae..de73f5a16252 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -18,10 +18,19 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud); +vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn, + pgprot_t pgprot, bool write); #else static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) { } + +static inline vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, + pfn_t pfn, pgprot_t pgprot, + bool write) +{ + return VM_FAULT_SIGBUS; +} #endif vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf); @@ -58,8 +67,6 @@ static inline vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, { return vmf_insert_pfn_pmd_prot(vmf, pfn, vmf->vma->vm_page_prot, write); } -vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn, - pgprot_t pgprot, bool write); /** * vmf_insert_pfn_pud - insert a pud size pfn From patchwork Sun Sep 4 02:17:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965104 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6A15ECAAD5 for ; Sun, 4 Sep 2022 02:17:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FE788017A; Sat, 3 Sep 2022 22:17:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6AD178015A; Sat, 3 Sep 2022 22:17:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 54E258017A; Sat, 3 Sep 2022 22:17:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 463088015A for ; Sat, 3 Sep 2022 22:17:12 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 19680A0458 for ; Sun, 4 Sep 2022 02:17:12 +0000 (UTC) X-FDA: 79872790704.01.CDFBD06 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf04.hostedemail.com (Postfix) with ESMTP id 637FB4005F for ; Sun, 4 Sep 2022 02:17:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257831; x=1693793831; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=pmXSmi9LguX30LS/sFbhN9mBClDXZExgw2W8d8OhiMU=; b=nmqO8q1vr1nfFuD47qmt3eDwoZhfSlO8JKFrOqX7ScHyP47LHi5407yG CKQCfz8VI/duj0JVl46Aqbg4e64/2P7QkFZetGzDQ4gJWb7DIfSMIk3eo D0vZQYc1OgvWpn+hZ6yFFo2U5yNO9arSzGGNuAhJv1EvVPPoy8JBiyAyb 1XlBf7vRALzteeUnIDkqP1QumNUnXjUzthWOKXufuLt/V+XJsgugtS29B anIeghZ4uIogXZPAekGJFfD+tT8l4P8XSR6hpyWnlZBbY8uwh88hMkXO5 lDo9YV6GUvNRGocD7cNSS1c1EW2kdeH27LbaLiH7WW4SvkYSZRd/VKMPg g==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="360158844" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="360158844" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:17:10 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="590497071" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:17:10 -0700 Subject: [PATCH 12/13] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Jason Gunthorpe , Christoph Hellwig , John Hubbard , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:17:09 -0700 Message-ID: <166225782976.2351842.16939728802182084191.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257831; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=P5PCqaubjv7eA3qakPYtbAtHMIDGnBPrY/wsWVo1qo8=; b=e3LTx0e+Y7xQXW30capUOkbIa0sNHD/s4ClhsukoF4um/mT2nb0csy+3myRd8juwGr2Csq 6quZwdQSenIOQjUAqAHDFz0G2RLrIyoARwEVzBfaKJCEBdZFaDzqHzGnVHZ87i7TqnerLe JsPmIaDRuIEyz2PpEWDg0fwaQPB5zxk= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=nmqO8q1v; spf=pass (imf04.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257831; a=rsa-sha256; cv=none; b=xSHfLb3g9QA/NYghk1zULV4fzrQ022Yr8urBCKvYBxkpHIsRB0lR71s03LSrp98k9ouOVH bY7Ipg7a1D/GH+O50VEPkk9EMCOiL/P1LV30Xu40/MDgGG6tjGIPM+JLm33Lc9jrfuc4Vm qhZudAlpELEYdLg7YwjyiQaw7luGyXQ= X-Rspam-User: Authentication-Results: imf04.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=nmqO8q1v; spf=pass (imf04.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 637FB4005F X-Stat-Signature: kht6x6oq4dducku878is6but6ukrack3 X-HE-Tag: 1662257831-242595 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Track entries and take pgmap references at mapping insertion time. Revoke mappings and drop the associated pgmap references at device destruction or inode eviction time. With this in place, and the fsdax equivalent already in place, the gup code no longer needs to consider PTE_DEVMAP as an indicator to get a pgmap reference before taking a page reference. In other words, GUP takes additional references on mapped pages. Until now, DAX in all its forms was failing to take references at mapping time. With that fixed there is no longer a requirement for the gup to manage @pgmap references. That cleanup is saved for a follow-on patch. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Jason Gunthorpe Cc: Christoph Hellwig Cc: John Hubbard Signed-off-by: Dan Williams --- drivers/dax/bus.c | 2 + drivers/dax/device.c | 73 +++++++++++++++++++++++++++++-------------------- drivers/dax/mapping.c | 3 ++ 3 files changed, 47 insertions(+), 31 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 1dad813ee4a6..f4dd9b8b88a9 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -384,7 +384,7 @@ void kill_dev_dax(struct dev_dax *dev_dax) struct inode *inode = dax_inode(dax_dev); kill_dax(dax_dev); - unmap_mapping_range(inode->i_mapping, 0, 0, 1); + truncate_inode_pages(inode->i_mapping, 0); /* * Dynamic dax region have the pgmap allocated via dev_kzalloc() diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 5494d745ced5..7f306939807e 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -73,38 +73,15 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, return -1; } -static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn, - unsigned long fault_size) -{ - unsigned long i, nr_pages = fault_size / PAGE_SIZE; - struct file *filp = vmf->vma->vm_file; - struct dev_dax *dev_dax = filp->private_data; - pgoff_t pgoff; - - /* mapping is only set on the head */ - if (dev_dax->pgmap->vmemmap_shift) - nr_pages = 1; - - pgoff = linear_page_index(vmf->vma, - ALIGN(vmf->address, fault_size)); - - for (i = 0; i < nr_pages; i++) { - struct page *page = pfn_to_page(pfn_t_to_pfn(pfn) + i); - - page = compound_head(page); - if (page->mapping) - continue; - - page->mapping = filp->f_mapping; - page->index = pgoff + i; - } -} - static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; + XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct device *dev = &dev_dax->dev; phys_addr_t phys; + vm_fault_t ret; + void *entry; pfn_t pfn; unsigned int fault_size = PAGE_SIZE; @@ -128,7 +105,16 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); - dax_set_mapping(vmf, pfn, fault_size); + entry = dax_grab_mapping_entry(&xas, mapping, 0); + if (xa_is_internal(entry)) + return xa_to_internal(entry); + + ret = dax_insert_entry(&xas, vmf, &entry, pfn, 0); + + dax_unlock_entry(&xas, entry); + + if (ret) + return ret; return vmf_insert_mixed(vmf->vma, vmf->address, pfn); } @@ -136,10 +122,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax, static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; unsigned long pmd_addr = vmf->address & PMD_MASK; + XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct device *dev = &dev_dax->dev; phys_addr_t phys; + vm_fault_t ret; pgoff_t pgoff; + void *entry; pfn_t pfn; unsigned int fault_size = PMD_SIZE; @@ -171,7 +161,16 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); - dax_set_mapping(vmf, pfn, fault_size); + entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER); + if (xa_is_internal(entry)) + return xa_to_internal(entry); + + ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PMD); + + dax_unlock_entry(&xas, entry); + + if (ret) + return ret; return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); } @@ -180,10 +179,14 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax, static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf) { + struct address_space *mapping = vmf->vma->vm_file->f_mapping; unsigned long pud_addr = vmf->address & PUD_MASK; + XA_STATE(xas, &mapping->i_pages, vmf->pgoff); struct device *dev = &dev_dax->dev; phys_addr_t phys; + vm_fault_t ret; pgoff_t pgoff; + void *entry; pfn_t pfn; unsigned int fault_size = PUD_SIZE; @@ -216,7 +219,16 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); - dax_set_mapping(vmf, pfn, fault_size); + entry = dax_grab_mapping_entry(&xas, mapping, PUD_ORDER); + if (xa_is_internal(entry)) + return xa_to_internal(entry); + + ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PUD); + + dax_unlock_entry(&xas, entry); + + if (ret) + return ret; return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE); } @@ -494,3 +506,4 @@ MODULE_LICENSE("GPL v2"); module_init(dax_init); module_exit(dax_exit); MODULE_ALIAS_DAX_DEVICE(0); +MODULE_IMPORT_NS(DAX); diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c index 6eaa0fe33c16..b9851cfd4cbd 100644 --- a/drivers/dax/mapping.c +++ b/drivers/dax/mapping.c @@ -261,6 +261,7 @@ void dax_unlock_entry(struct xa_state *xas, void *entry) WARN_ON(!dax_is_locked(old)); dax_wake_entry(xas, entry, WAKE_NEXT); } +EXPORT_SYMBOL_NS_GPL(dax_unlock_entry, DAX); /* * Return: The entry stored at this location before it was locked. @@ -674,6 +675,7 @@ void *dax_grab_mapping_entry(struct xa_state *xas, xas_unlock_irq(xas); return xa_mk_internal(VM_FAULT_FALLBACK); } +EXPORT_SYMBOL_NS_GPL(dax_grab_mapping_entry, DAX); /** * dax_layout_pinned_page_range - find first pinned page in @mapping @@ -875,6 +877,7 @@ vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, *pentry = entry; return 0; } +EXPORT_SYMBOL_NS_GPL(dax_insert_entry, DAX); int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, struct address_space *mapping, void *entry) From patchwork Sun Sep 4 02:17:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 12965114 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 386E8C6FA82 for ; Sun, 4 Sep 2022 02:17:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2DC68017B; Sat, 3 Sep 2022 22:17:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BDD8F8015A; Sat, 3 Sep 2022 22:17:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7E858017B; Sat, 3 Sep 2022 22:17:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 99D9D8015A for ; Sat, 3 Sep 2022 22:17:17 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 736E1120158 for ; Sun, 4 Sep 2022 02:17:17 +0000 (UTC) X-FDA: 79872790914.19.A2E5658 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf20.hostedemail.com (Postfix) with ESMTP id E635F1C006D for ; Sun, 4 Sep 2022 02:17:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662257836; x=1693793836; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=s7cUPCFRfEM51i7BRJG+0ccdu8jS149l2Tkf0+0JoZ4=; b=NKd3oshaX7nU6kA7ROgyuXGspn5ysv/Ugla0EuJzd0zDXeM5yqLv+oIi sWP28qsbSQNjxp2qdttc3giakYe4NFsXsfazJlnP2P+aYZuBuVeNdRejM SuPQH5dJQo0yAoQgtUXPnZHwC1PZ7gr06CoHxVNDKMWVaBXJGJxAGu4hp lWpEGyKc38rsOS65L/mhcODfcHwAxJLcSKhvNPSP26duhavy/O2e5x7iR ymwUzTDBi0ARdd6rlLFjY+vHUbO5BPfKs6IXgrCdiWIAEQUrf7X46LhNG uelMJHgZNCTz+tqSflADHWUsq1X2OQEeW/TwSGe6d7rjEKrY7VKT71sRR Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10459"; a="357917882" X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="357917882" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:17:15 -0700 X-IronPort-AV: E=Sophos;i="5.93,288,1654585200"; d="scan'208";a="609343164" Received: from pg4-mobl3.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.212.132.198]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Sep 2022 19:17:15 -0700 Subject: [PATCH 13/13] mm/gup: Drop DAX pgmap accounting From: Dan Williams To: akpm@linux-foundation.org Cc: Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Christoph Hellwig , John Hubbard , Jason Gunthorpe , linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-fsdevel@vger.kernel.org Date: Sat, 03 Sep 2022 19:17:15 -0700 Message-ID: <166225783530.2351842.9198292974545499645.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> References: <166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=NKd3osha; spf=pass (imf20.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1662257837; a=rsa-sha256; cv=none; b=KwmITbaE+dzj6nX7OrMTffjCdstYCXSnj74nYsXkInt6lfoM9hJW/7A6lR+cJAv7y+Chq9 gwRRSqLddu/bWTvjn98L/o2IfT0ok1jmc4EwISVK7SIhzvBRaZUVbae6qpdhmOGlHBfayy KrKkWNr488fw3z5EAlryrU+BvEZPTYw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1662257837; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WEERqm/sdYnI4MV4s0yDozvdHiPcUsyKaTsbTExdnRQ=; b=EReoCjnLjKtfJt8QtWS5GnECrWykC6i1YkbQ2HqLWwtjVicYDwCqvO659vgNuT6JJx8juW aGE2TGBk/di7ujnYkgJXGz7BHYR7ALQoK35my8mUpqJYlBFgjDqCKKVp/SJNw6gQW0y9Ui A6agV6Q7B3s3EAz3BTN8rhIZv7TiseQ= X-Rspam-User: Authentication-Results: imf20.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=NKd3osha; spf=pass (imf20.hostedemail.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E635F1C006D X-Stat-Signature: 8qxmmmyz7dxs5p64676tkrbmhjz9eobr X-HE-Tag: 1662257836-69234 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that pgmap accounting is handled at map time, it can be dropped from gup time. A hurdle still remains that filesystem-DAX huge pages are not compound pages which still requires infrastructure like __gup_device_huge_p{m,u}d() to stick around. Additionally, ZONE_DEVICE pages with this change are still not suitable to be returned from vm_normal_page(), so this cleanup is limited to deleting pgmap reference manipulation. This is an incremental step on the path to removing pte_devmap() altogether. Note that follow_pmd_devmap() can be deleted entirely since a few additions of pmd_devmap() allows the transparent huge page path to be reused. Cc: Matthew Wilcox Cc: Jan Kara Cc: "Darrick J. Wong" Cc: Christoph Hellwig Cc: John Hubbard Reported-by: Jason Gunthorpe Signed-off-by: Dan Williams --- include/linux/huge_mm.h | 12 +------ mm/gup.c | 83 +++++++++++------------------------------------ mm/huge_memory.c | 54 +------------------------------ 3 files changed, 22 insertions(+), 127 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index de73f5a16252..b8ed373c6090 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -263,10 +263,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio) return folio_order(folio) >= HPAGE_PMD_ORDER; } -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, int flags, struct dev_pagemap **pgmap); struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, int flags, struct dev_pagemap **pgmap); + pud_t *pud, int flags); vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); @@ -418,14 +416,8 @@ static inline void mm_put_huge_zero_page(struct mm_struct *mm) return; } -static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma, - unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap) -{ - return NULL; -} - static inline struct page *follow_devmap_pud(struct vm_area_struct *vma, - unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap) + unsigned long addr, pud_t *pud, int flags) { return NULL; } diff --git a/mm/gup.c b/mm/gup.c index 67dfffe97917..3832edd27dfd 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -25,7 +25,6 @@ #include "internal.h" struct follow_page_context { - struct dev_pagemap *pgmap; unsigned int page_mask; }; @@ -490,8 +489,7 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) } static struct page *follow_page_pte(struct vm_area_struct *vma, - unsigned long address, pmd_t *pmd, unsigned int flags, - struct dev_pagemap **pgmap) + unsigned long address, pmd_t *pmd, unsigned int flags) { struct mm_struct *mm = vma->vm_mm; struct page *page; @@ -535,17 +533,13 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, } page = vm_normal_page(vma, address, pte); - if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { + if (!page && pte_devmap(pte)) { /* - * Only return device mapping pages in the FOLL_GET or FOLL_PIN - * case since they are only valid while holding the pgmap - * reference. + * ZONE_DEVICE pages are not yet treated as vm_normal_page() + * instances, with respect to mapcount and compound-page + * metadata */ - *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); - if (*pgmap) - page = pte_page(pte); - else - goto no_page; + page = pte_page(pte); } else if (unlikely(!page)) { if (flags & FOLL_DUMP) { /* Avoid special (like zero) pages in core dumps */ @@ -663,15 +657,8 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return no_page_table(vma, flags); goto retry; } - if (pmd_devmap(pmdval)) { - ptl = pmd_lock(mm, pmd); - page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap); - spin_unlock(ptl); - if (page) - return page; - } - if (likely(!pmd_trans_huge(pmdval))) - return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + if (likely(!(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))) + return follow_page_pte(vma, address, pmd, flags); if ((flags & FOLL_NUMA) && pmd_protnone(pmdval)) return no_page_table(vma, flags); @@ -689,9 +676,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, pmd_migration_entry_wait(mm, pmd); goto retry_locked; } - if (unlikely(!pmd_trans_huge(*pmd))) { + if (unlikely(!(pmd_trans_huge(*pmd) || pmd_devmap(pmdval)))) { spin_unlock(ptl); - return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + return follow_page_pte(vma, address, pmd, flags); } if (flags & FOLL_SPLIT_PMD) { int ret; @@ -709,7 +696,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, } return ret ? ERR_PTR(ret) : - follow_page_pte(vma, address, pmd, flags, &ctx->pgmap); + follow_page_pte(vma, address, pmd, flags); } page = follow_trans_huge_pmd(vma, address, pmd, flags); spin_unlock(ptl); @@ -746,7 +733,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, } if (pud_devmap(*pud)) { ptl = pud_lock(mm, pud); - page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap); + page = follow_devmap_pud(vma, address, pud, flags); spin_unlock(ptl); if (page) return page; @@ -793,9 +780,6 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, * * @flags can have FOLL_ flags set, defined in * - * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches - * the device's dev_pagemap metadata to avoid repeating expensive lookups. - * * When getting an anonymous page and the caller has to trigger unsharing * of a shared anonymous page first, -EMLINK is returned. The caller should * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only @@ -850,7 +834,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, struct page *follow_page(struct vm_area_struct *vma, unsigned long address, unsigned int foll_flags) { - struct follow_page_context ctx = { NULL }; + struct follow_page_context ctx = { 0 }; struct page *page; if (vma_is_secretmem(vma)) @@ -860,8 +844,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address, return NULL; page = follow_page_mask(vma, address, foll_flags, &ctx); - if (ctx.pgmap) - put_dev_pagemap(ctx.pgmap); return page; } @@ -1121,7 +1103,7 @@ static long __get_user_pages(struct mm_struct *mm, { long ret = 0, i = 0; struct vm_area_struct *vma = NULL; - struct follow_page_context ctx = { NULL }; + struct follow_page_context ctx = { 0 }; if (!nr_pages) return 0; @@ -1244,8 +1226,6 @@ static long __get_user_pages(struct mm_struct *mm, nr_pages -= page_increm; } while (nr_pages); out: - if (ctx.pgmap) - put_dev_pagemap(ctx.pgmap); return i ? i : ret; } @@ -2325,9 +2305,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { - struct dev_pagemap *pgmap = NULL; - int nr_start = *nr, ret = 0; pte_t *ptep, *ptem; + int ret = 0; ptem = ptep = pte_offset_map(&pmd, addr); do { @@ -2348,12 +2327,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, if (pte_devmap(pte)) { if (unlikely(flags & FOLL_LONGTERM)) goto pte_unmap; - - pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - goto pte_unmap; - } } else if (pte_special(pte)) goto pte_unmap; @@ -2400,8 +2373,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, ret = 1; pte_unmap: - if (pgmap) - put_dev_pagemap(pgmap); pte_unmap(ptem); return ret; } @@ -2428,28 +2399,17 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { - int nr_start = *nr; - struct dev_pagemap *pgmap = NULL; - do { struct page *page = pfn_to_page(pfn); - pgmap = get_dev_pagemap(pfn, pgmap); - if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, flags, pages); - break; - } SetPageReferenced(page); pages[*nr] = page; - if (unlikely(!try_grab_page(page, flags))) { - undo_dev_pagemap(nr, nr_start, flags, pages); + if (unlikely(!try_grab_page(page, flags))) break; - } (*nr)++; pfn++; } while (addr += PAGE_SIZE, addr != end); - put_dev_pagemap(pgmap); return addr == end; } @@ -2458,16 +2418,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, struct page **pages, int *nr) { unsigned long fault_pfn; - int nr_start = *nr; fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; - if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - undo_dev_pagemap(nr, nr_start, flags, pages); + if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) return 0; - } + return 1; } @@ -2476,16 +2434,13 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, struct page **pages, int *nr) { unsigned long fault_pfn; - int nr_start = *nr; fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) return 0; - if (unlikely(pud_val(orig) != pud_val(*pudp))) { - undo_dev_pagemap(nr, nr_start, flags, pages); + if (unlikely(pud_val(orig) != pud_val(*pudp))) return 0; - } return 1; } #else diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..ef68296f2158 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1031,55 +1031,6 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, update_mmu_cache_pmd(vma, addr, pmd); } -struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, - pmd_t *pmd, int flags, struct dev_pagemap **pgmap) -{ - unsigned long pfn = pmd_pfn(*pmd); - struct mm_struct *mm = vma->vm_mm; - struct page *page; - - assert_spin_locked(pmd_lockptr(mm, pmd)); - - /* - * When we COW a devmap PMD entry, we split it into PTEs, so we should - * not be in this function with `flags & FOLL_COW` set. - */ - WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); - - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ - if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == - (FOLL_PIN | FOLL_GET))) - return NULL; - - if (flags & FOLL_WRITE && !pmd_write(*pmd)) - return NULL; - - if (pmd_present(*pmd) && pmd_devmap(*pmd)) - /* pass */; - else - return NULL; - - if (flags & FOLL_TOUCH) - touch_pmd(vma, addr, pmd, flags & FOLL_WRITE); - - /* - * device mapped pages can only be returned if the - * caller will manage the page reference count. - */ - if (!(flags & (FOLL_GET | FOLL_PIN))) - return ERR_PTR(-EEXIST); - - pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; - *pgmap = get_dev_pagemap(pfn, *pgmap); - if (!*pgmap) - return ERR_PTR(-EFAULT); - page = pfn_to_page(pfn); - if (!try_grab_page(page, flags)) - page = ERR_PTR(-ENOMEM); - - return page; -} - int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) @@ -1196,7 +1147,7 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr, } struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, - pud_t *pud, int flags, struct dev_pagemap **pgmap) + pud_t *pud, int flags) { unsigned long pfn = pud_pfn(*pud); struct mm_struct *mm = vma->vm_mm; @@ -1230,9 +1181,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, return ERR_PTR(-EEXIST); pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; - *pgmap = get_dev_pagemap(pfn, *pgmap); - if (!*pgmap) - return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); if (!try_grab_page(page, flags)) page = ERR_PTR(-ENOMEM);