From patchwork Thu Oct 22 23:41:27 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dan Williams <dan.j.williams@intel.com>
X-Patchwork-Id: 7468581
Return-Path: <linux-nvdimm-bounces@lists.01.org>
X-Original-To: patchwork-linux-nvdimm@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id 7889CBEEA4
	for <patchwork-linux-nvdimm@patchwork.kernel.org>;
	Thu, 22 Oct 2015 23:41:34 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 188832077E
	for <patchwork-linux-nvdimm@patchwork.kernel.org>;
	Thu, 22 Oct 2015 23:41:33 +0000 (UTC)
Received: from ml01.01.org (ml01.01.org [198.145.21.10])
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 559E420776
	for <patchwork-linux-nvdimm@patchwork.kernel.org>;
	Thu, 22 Oct 2015 23:41:31 +0000 (UTC)
Received: from ml01.vlan14.01.org (localhost [IPv6:::1])
	by ml01.01.org (Postfix) with ESMTP id 3A1E760944;
	Thu, 22 Oct 2015 16:41:31 -0700 (PDT)
X-Original-To: linux-nvdimm@lists.01.org
Delivered-To: linux-nvdimm@lists.01.org
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by ml01.01.org (Postfix) with ESMTP id 60A9660944
	for <linux-nvdimm@lists.01.org>; Thu, 22 Oct 2015 16:41:29 -0700 (PDT)
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
	by fmsmga103.fm.intel.com with ESMTP; 22 Oct 2015 16:41:30 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.20,184,1444719600"; d="scan'208";a="833425830"
Received: from orsmsx109.amr.corp.intel.com ([10.22.240.7])
	by fmsmga002.fm.intel.com with ESMTP; 22 Oct 2015 16:41:29 -0700
Received: from orsmsx153.amr.corp.intel.com (10.22.226.247) by
	ORSMSX109.amr.corp.intel.com (10.22.240.7) with Microsoft SMTP Server
	(TLS) id 14.3.248.2; Thu, 22 Oct 2015 16:41:28 -0700
Received: from orsmsx107.amr.corp.intel.com ([169.254.1.53]) by
	ORSMSX153.amr.corp.intel.com ([169.254.12.164]) with mapi id
	14.03.0248.002; Thu, 22 Oct 2015 16:41:27 -0700
From: "Williams, Dan J" <dan.j.williams@intel.com>
To: "jack@suse.cz" <jack@suse.cz>
Subject: Re: [PATCH 5/5] block: enable dax for raw block devices
Thread-Topic: [PATCH 5/5] block: enable dax for raw block devices
Thread-Index: AQHRDJWpFpZwVt7ww02fEDBKc5eLI553tj+AgABs8oCAAFSIAIAAKsaA
Date: Thu, 22 Oct 2015 23:41:27 +0000
Message-ID: <1445557283.17208.30.camel@intel.com>
References: 
 <20151022064142.12700.11849.stgit@dwillia2-desk3.amr.corp.intel.com>
	<20151022064211.12700.77105.stgit@dwillia2-desk3.amr.corp.intel.com>
	<20151022093549.GE14445@quack.suse.cz>
	<1445529945.17208.4.camel@intel.com>
	<20151022210818.GC8670@quack.suse.cz>
In-Reply-To: <20151022210818.GC8670@quack.suse.cz>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.22.254.138]
Content-ID: <A0AF4BEE0DEEC345B82E1748CF5AAC3D@intel.com>
MIME-Version: 1.0
Cc: "linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"axboe@fb.com" <axboe@fb.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"hch@lst.de" <hch@lst.de>
X-BeenThere: linux-nvdimm@lists.01.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "Linux-nvdimm developer list." <linux-nvdimm.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/linux-nvdimm>,
	<mailto:linux-nvdimm-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/linux-nvdimm/>
List-Post: <mailto:linux-nvdimm@lists.01.org>
List-Help: <mailto:linux-nvdimm-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/linux-nvdimm>,
	<mailto:linux-nvdimm-request@lists.01.org?subject=subscribe>
Errors-To: linux-nvdimm-bounces@lists.01.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_LOW,
	RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Thu, 2015-10-22 at 23:08 +0200, Jan Kara wrote:
> On Thu 22-10-15 16:05:46, Williams, Dan J wrote:
> > On Thu, 2015-10-22 at 11:35 +0200, Jan Kara wrote:
> > > On Thu 22-10-15 02:42:11, Dan Williams wrote:
> > > > If an application wants exclusive access to all of the persistent memory
> > > > provided by an NVDIMM namespace it can use this raw-block-dax facility
> > > > to forgo establishing a filesystem.  This capability is targeted
> > > > primarily to hypervisors wanting to provision persistent memory for
> > > > guests.
> > > > 
> > > > Cc: Jan Kara <jack@suse.cz>
> > > > Cc: Jeff Moyer <jmoyer@redhat.com>
> > > > Cc: Christoph Hellwig <hch@lst.de>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > > ---
> > > >  fs/block_dev.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 53 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > > index 3255dcec96b4..c27cd1a21a13 100644
> > > > --- a/fs/block_dev.c
> > > > +++ b/fs/block_dev.c
> > > > @@ -1687,13 +1687,65 @@ static const struct address_space_operations def_blk_aops = {
> > > >  	.is_dirty_writeback = buffer_check_dirty_writeback,
> > > >  };
> > > >  
> > > > +#ifdef CONFIG_FS_DAX
> > > > +/*
> > > > + * In the raw block case we do not need to contend with truncation nor
> > > > + * unwritten file extents.  Without those concerns there is no need for
> > > > + * additional locking beyond the mmap_sem context that these routines
> > > > + * are already executing under.
> > > > + *
> > > > + * Note, there is no protection if the block device is dynamically
> > > > + * resized (partition grow/shrink) during a fault. A stable block device
> > > > + * size is already not enforced in the blkdev_direct_IO path.
> > > > + *
> > > > + * For DAX, it is the responsibility of the block device driver to
> > > > + * ensure the whole-disk device size is stable while requests are in
> > > > + * flight.
> > > > + *
> > > > + * Finally, these paths do not synchronize against freezing
> > > > + * (sb_start_pagefault(), etc...) since bdev_sops does not support
> > > > + * freezing.
> > > 
> > > Well, for devices freezing is handled directly in the block layer code
> > > (blk_stop_queue()) since there's no need to put some metadata structures
> > > into a consistent state. So the comment about bdev_sops is somewhat
> > > strange.
> > 
> > This text was aimed at the request from Ross to document the differences
> > vs the generic_file_mmap() path.  Is the following incremental change
> > more clear?
> 
> Well, not really. I thought you'd just delete that paragraph :) The thing
> is: When doing IO directly to the block device, it makes no sense to look
> at a filesystem on top of it - hopefully there is none since you'd be
> corrupting it. So the paragraph that would make sense to me would be:
> 
>  * Finally, in contrast to filemap_page_mkwrite(), we don't bother calling
>  * sb_start_pagefault(). There is no filesystem which could be frozen here
>  * and when bdev gets frozen, IO gets blocked in the request queue.
> 
> But when spelled out like this, I've realized that with DAX, this blocking
> of requests in the request queue doesn't really block the IO to the device.
> So block device freezing (aka blk_queue_stop()) doesn't work reliably with
> DAX. That should be fixed but it's not easy as the only way to do that
> would be to hook into blk_stop_queue() and unmap (or at least
> write-protect) all the mappings of the device. Ugh...
> 
> Ugh2: Now I realized that DAX mmap isn't safe wrt fs freezing even for
> filesystems since there's nothing which writeprotects pages that are
> writeably mapped. In normal path, page writeback does this but that doesn't
> happen for DAX. I remember we once talked about this but it got lost.
> We need something like walk all filesystem inodes during fs freeze and
> writeprotect all pages that are mapped. But that's going to be slow...

This sounds suspiciously like what I'm planning to do for the device
teardown path when we've dynamically allocated struct page.  The backing
memory for those pages is freed when the driver runs its ->remove()
path, so we have to be sure there are no outstanding references to them.

My current proposal for the teardown case, that we might re-purpose for
this freeze case, is below.  It relies on the percpu_ref in the
request_queue to block new faults and then uses truncate_pagecache() to
teardown mappings.  However, this assumes we've inserted pages into the
address_space radix at fault, which we don't currently do...

In general, as this page-backed-pmem support lands upstream, I'm of the
opinion that the page-less DAX support be deprecated/disabled
unless/until it can be made as functionally capable as the page-enabled
paths.

8<----
Subject: mm, pmem: devm_memunmap_pages(), truncate and unmap ZONE_DEVICE pages

From: Dan Williams <dan.j.williams@intel.com>

Before we allow ZONE_DEVICE pages to be put into active use outside of
the pmem driver, we need to arrange for them to be reclaimed when the
driver is shutdown.  devm_memunmap_pages() must wait for all pages to
return to the initial mapcount of 1.  If a given page is mapped by a
process we will truncate it out of its inode mapping and unmap it out of
the process vma.

This truncation is done while the dev_pagemap reference count is "dead",
preventing new references from being taken while the truncate+unmap scan
is in progress.

Cc: Dave Hansen <dave@sr71.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/pmem.c |   42 ++++++++++++++++++++++++++++++++++++------
 fs/dax.c              |    2 ++
 include/linux/mm.h    |    5 +++++
 kernel/memremap.c     |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7acce594fa0..2c9aebbc3fea 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -24,12 +24,15 @@
 #include <linux/memory_hotplug.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/async.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
 #include "pfn.h"
 #include "nd.h"
 
+static ASYNC_DOMAIN_EXCLUSIVE(async_pmem);
+
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
@@ -164,14 +167,43 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 	return pmem;
 }
 
-static void pmem_detach_disk(struct pmem_device *pmem)
+
+static void async_blk_cleanup_queue(void *data, async_cookie_t cookie)
+{
+	struct pmem_device *pmem = data;
+
+	blk_cleanup_queue(pmem->pmem_queue);
+}
+
+static void pmem_detach_disk(struct device *dev)
 {
+	struct pmem_device *pmem = dev_get_drvdata(dev);
+	struct request_queue *q = pmem->pmem_queue;
+
 	if (!pmem->pmem_disk)
 		return;
 
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
+	async_schedule_domain(async_blk_cleanup_queue, pmem, &async_pmem);
+
+	if (pmem->pfn_flags & PFN_MAP) {
+		/*
+		 * Wait for queue to go dead so that we know no new
+		 * references will be taken against the pages allocated
+		 * by devm_memremap_pages().
+		 */
+		blk_wait_queue_dead(q);
+
+		/*
+		 * Manually release the page mapping so that
+		 * blk_cleanup_queue() can complete queue draining.
+		 */
+		devm_memunmap_pages(dev, (void __force *) pmem->virt_addr);
+	}
+
+	/* Wait for blk_cleanup_queue() to finish */
+	async_synchronize_full_domain(&async_pmem);
 }
 
 static int pmem_attach_disk(struct device *dev,
@@ -299,11 +331,9 @@ static int nd_pfn_init(struct nd_pfn *nd_pfn)
 static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
 {
 	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
-	struct pmem_device *pmem;
 
 	/* free pmem disk */
-	pmem = dev_get_drvdata(&nd_pfn->dev);
-	pmem_detach_disk(pmem);
+	pmem_detach_disk(&nd_pfn->dev);
 
 	/* release nd_pfn resources */
 	kfree(nd_pfn->pfn_sb);
@@ -446,7 +476,7 @@ static int nd_pmem_remove(struct device *dev)
 	else if (is_nd_pfn(dev))
 		nvdimm_namespace_detach_pfn(pmem->ndns);
 	else
-		pmem_detach_disk(pmem);
+		pmem_detach_disk(dev);
 
 	return 0;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 8d756562fcf0..0bc9b315d16f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector,
 		blk_queue_exit(q);
 		return (void __pmem *) ERR_PTR(rc);
 	}
+	rcu_read_lock();
 	return addr;
 }
 
@@ -62,6 +63,7 @@ static void dax_unmap_atomic(struct block_device *bdev, void __pmem *addr)
 	if (IS_ERR(addr))
 		return;
 	blk_queue_exit(bdev->bd_queue);
+	rcu_read_unlock();
 }
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5b5267eae5b..294518ddf5bc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -801,6 +801,7 @@ struct dev_pagemap {
 
 #ifdef CONFIG_ZONE_DEVICE
 struct dev_pagemap *__get_dev_pagemap(resource_size_t phys);
+void devm_memunmap_pages(struct device *dev, void *addr);
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 #else
@@ -809,6 +810,10 @@ static inline struct dev_pagemap *__get_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
+static inline void devm_memunmap_pages(struct device *dev, void *addr)
+{
+}
+
 static inline void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4698071a1c43..ac74336e6d73 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -13,6 +13,7 @@
 #include <linux/rculist.h>
 #include <linux/device.h>
 #include <linux/types.h>
+#include <linux/fs.h>
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
@@ -187,10 +188,39 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap)
 
 static void devm_memremap_pages_release(struct device *dev, void *data)
 {
+	unsigned long pfn;
 	struct page_map *page_map = data;
 	struct resource *res = &page_map->res;
+	struct address_space *mapping_prev = NULL;
 	struct dev_pagemap *pgmap = &page_map->pgmap;
 
+	if (percpu_ref_tryget_live(pgmap->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(pgmap->ref);
+	}
+
+	/* flush in-flight dax_map_atomic() operations */
+	synchronize_rcu();
+
+	for_each_device_pfn(pfn, pgmap) {
+		struct page *page = pfn_to_page(pfn);
+		struct address_space *mapping = page->mapping;
+		struct inode *inode = mapping ? mapping->host : NULL;
+
+		dev_WARN_ONCE(dev, atomic_read(&page->_count) < 1,
+				"%s: ZONE_DEVICE page was freed!\n", __func__);
+
+		if (!mapping || !inode || mapping == mapping_prev) {
+			dev_WARN_ONCE(dev, atomic_read(&page->_count) > 1,
+					"%s: unexpected elevated page count pfn: %lx\n",
+					__func__, pfn);
+			continue;
+		}
+
+		truncate_pagecache(inode, 0);
+		mapping_prev = mapping;
+	}
+
 	/* pages are dead and unused, undo the arch mapping */
 	arch_remove_memory(res->start, resource_size(res));
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -292,6 +322,24 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	return __va(res->start);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
+
+static int page_map_match(struct device *dev, void *res, void *match_data)
+{
+	struct page_map *page_map = res;
+	resource_size_t phys = *(resource_size_t *) match_data;
+
+	return page_map->res.start == phys;
+}
+
+void devm_memunmap_pages(struct device *dev, void *addr)
+{
+	resource_size_t start = __pa(addr);
+
+	if (devres_release(dev, devm_memremap_pages_release, page_map_match,
+				&start) != 0)
+		dev_WARN(dev, "failed to find page map to release\n");
+}
+EXPORT_SYMBOL(devm_memunmap_pages);
 #endif /* CONFIG_ZONE_DEVICE */
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP