From patchwork Mon Nov 2 04:30:26 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 7533481 Return-Path: X-Original-To: patchwork-linux-nvdimm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id D831DBEEA4 for ; Mon, 2 Nov 2015 04:36:12 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 8531A2024F for ; Mon, 2 Nov 2015 04:36:11 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0E55E2024D for ; Mon, 2 Nov 2015 04:36:10 +0000 (UTC) Received: from ml01.vlan14.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 027BA1A1F37; Sun, 1 Nov 2015 20:36:10 -0800 (PST) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by ml01.01.org (Postfix) with ESMTP id C1D5B1A1F37 for ; Sun, 1 Nov 2015 20:36:08 -0800 (PST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP; 01 Nov 2015 20:36:09 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,232,1444719600"; d="scan'208";a="840330972" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.39]) by orsmga002.jf.intel.com with ESMTP; 01 Nov 2015 20:36:07 -0800 Subject: [PATCH v3 08/15] mm, dax, pmem: introduce pfn_t From: Dan Williams To: axboe@fb.com Date: Sun, 01 Nov 2015 23:30:26 -0500 Message-ID: <20151102043025.6610.24022.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <20151102042941.6610.27784.stgit@dwillia2-desk3.amr.corp.intel.com> References: <20151102042941.6610.27784.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.17.1-9-g687f MIME-Version: 1.0 Cc: Dave Hansen , jack@suse.cz, linux-nvdimm@lists.01.org, david@fromorbit.com, linux-kernel@vger.kernel.org, hch@lst.de, Andrew Morton X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_LOW, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For the purpose of communicating the optional presence of a 'struct page' for the pfn returned from ->direct_access(), introduce a type that encapsulates a page-frame-number plus flags. These flags contain the historical "page_link" encoding for a scatterlist entry, but can also denote "device memory". Where "device memory" is a set of pfns that are not part of the kernel's linear mapping by default, but are accessed via the same memory controller as ram. The motivation for this new type is large capacity persistent memory that needs struct page entries in the 'memmap' to support 3rd party DMA (i.e. O_DIRECT I/O with a persistent memory source/target). However, we also need it in support of maintaining a list of mapped inodes which need to be unmapped at driver teardown or freeze_bdev() time. Cc: Christoph Hellwig Cc: Dave Hansen Cc: Andrew Morton Cc: Ross Zwisler Signed-off-by: Dan Williams --- arch/powerpc/sysdev/axonram.c | 8 ++--- drivers/block/brd.c | 4 +-- drivers/nvdimm/pmem.c | 6 +++- drivers/s390/block/dcssblk.c | 10 +++--- fs/block_dev.c | 2 + fs/dax.c | 19 ++++++------ include/linux/blkdev.h | 4 +-- include/linux/mm.h | 65 +++++++++++++++++++++++++++++++++++++++++ include/linux/pfn.h | 9 ++++++ 9 files changed, 101 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index d2b79bc336c1..59ca4c0ab529 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -141,15 +141,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) */ static long axon_ram_direct_access(struct block_device *device, sector_t sector, - void __pmem **kaddr, unsigned long *pfn) + void __pmem **kaddr, pfn_t *pfn) { struct axon_ram_bank *bank = device->bd_disk->private_data; loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT; - void *addr = (void *)(bank->ph_addr + offset); - - *kaddr = (void __pmem *)addr; - *pfn = virt_to_phys(addr) >> PAGE_SHIFT; + *kaddr = (void __pmem __force *) bank->io_addr + offset; + *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV); return bank->size - offset; } diff --git a/drivers/block/brd.c b/drivers/block/brd.c index b9794aeeb878..0bbc60463779 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -374,7 +374,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, #ifdef CONFIG_BLK_DEV_RAM_DAX static long brd_direct_access(struct block_device *bdev, sector_t sector, - void __pmem **kaddr, unsigned long *pfn) + void __pmem **kaddr, pfn_t *pfn) { struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; @@ -385,7 +385,7 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, if (!page) return -ENOSPC; *kaddr = (void __pmem *)page_address(page); - *pfn = page_to_pfn(page); + *pfn = page_to_pfn_t(page); return PAGE_SIZE; } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 93472953e231..09093372e5f0 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -39,6 +39,7 @@ struct pmem_device { phys_addr_t phys_addr; /* when non-zero this device is hosting a 'pfn' instance */ phys_addr_t data_offset; + unsigned long pfn_flags; void __pmem *virt_addr; size_t size; }; @@ -106,7 +107,7 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector, resource_size_t offset = sector * 512 + pmem->data_offset; *kaddr = pmem->virt_addr + offset; - *pfn = __phys_to_pfn(pmem->phys_addr + offset, pmem->pfn_flags); + *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); return pmem->size - offset; } @@ -144,8 +145,10 @@ static struct pmem_device *pmem_alloc(struct device *dev, if (!q) return ERR_PTR(-ENOMEM); + pmem->pfn_flags = PFN_DEV; if (pmem_should_map_pages(dev)) { pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res); + pmem->pfn_flags |= PFN_MAP; } else pmem->virt_addr = (void __pmem *) devm_memremap(dev, pmem->phys_addr, pmem->size, @@ -356,6 +359,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns) pmem = dev_get_drvdata(dev); devm_memunmap(dev, (void __force *) pmem->virt_addr); pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res); + pmem->pfn_flags |= PFN_MAP; if (IS_ERR(pmem->virt_addr)) { rc = PTR_ERR(pmem->virt_addr); goto err; diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 5ed44fe21380..e2b2839e4de5 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -29,7 +29,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static void dcssblk_make_request(struct request_queue *q, struct bio *bio); static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum, - void __pmem **kaddr, unsigned long *pfn); + void __pmem **kaddr, pfn_t *pfn); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -881,20 +881,18 @@ fail: static long dcssblk_direct_access (struct block_device *bdev, sector_t secnum, - void __pmem **kaddr, unsigned long *pfn) + void __pmem **kaddr, pfn_t *pfn) { struct dcssblk_dev_info *dev_info; unsigned long offset, dev_sz; - void *addr; dev_info = bdev->bd_disk->private_data; if (!dev_info) return -ENODEV; dev_sz = dev_info->end - dev_info->start; offset = secnum * 512; - addr = (void *) (dev_info->start + offset); - *pfn = virt_to_phys(addr) >> PAGE_SHIFT; - *kaddr = (void __pmem *) addr; + *kaddr = (void __pmem *) (dev_info->start + offset); + *pfn = phys_to_pfn_t(dev_info->start + offset, PFN_DEV); return dev_sz - offset; } diff --git a/fs/block_dev.c b/fs/block_dev.c index 0a793c7930eb..84b042778812 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -442,7 +442,7 @@ EXPORT_SYMBOL_GPL(bdev_write_page); * accessible at this address. */ long bdev_direct_access(struct block_device *bdev, sector_t sector, - void __pmem **addr, unsigned long *pfn, long size) + void __pmem **addr, pfn_t *pfn, long size) { long avail; const struct block_device_operations *ops = bdev->bd_disk->fops; diff --git a/fs/dax.c b/fs/dax.c index a480729c00ec..4d6861f022d9 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -31,7 +31,7 @@ #include static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector, - long size, unsigned long *pfn, long *len) + long size, pfn_t *pfn, long *len) { long rc; void __pmem *addr; @@ -52,7 +52,7 @@ static void __pmem *__dax_map_atomic(struct block_device *bdev, sector_t sector, static void __pmem *dax_map_atomic(struct block_device *bdev, sector_t sector, long size) { - unsigned long pfn; + pfn_t pfn; return __dax_map_atomic(bdev, sector, size, &pfn, NULL); } @@ -72,8 +72,8 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) might_sleep(); do { void __pmem *addr; - unsigned long pfn; long count, sz; + pfn_t pfn; sz = min_t(long, size, SZ_1M); addr = __dax_map_atomic(bdev, sector, size, &pfn, &count); @@ -141,7 +141,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, struct block_device *bdev = NULL; int rw = iov_iter_rw(iter), rc; long map_len = 0; - unsigned long pfn; + pfn_t pfn; void __pmem *addr = NULL; void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO); bool hole = false; @@ -333,9 +333,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, unsigned long vaddr = (unsigned long)vmf->virtual_address; struct address_space *mapping = inode->i_mapping; struct block_device *bdev = bh->b_bdev; - unsigned long pfn; void __pmem *addr; pgoff_t size; + pfn_t pfn; int error; i_mmap_lock_read(mapping); @@ -366,7 +366,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, } dax_unmap_atomic(bdev, addr); - error = vm_insert_mixed(vma, vaddr, pfn); + error = vm_insert_mixed(vma, vaddr, pfn_t_to_pfn(pfn)); out: i_mmap_unlock_read(mapping); @@ -655,8 +655,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_NOPAGE; spin_unlock(ptl); } else { + pfn_t pfn; long length; - unsigned long pfn; void __pmem *kaddr = __dax_map_atomic(bdev, to_sector(&bh, inode), HPAGE_SIZE, &pfn, &length); @@ -665,7 +665,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_SIGBUS; goto out; } - if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) { + if ((length < PMD_SIZE) || (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)) { dax_unmap_atomic(bdev, kaddr); goto fallback; } @@ -679,7 +679,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, } dax_unmap_atomic(bdev, kaddr); - result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write); + result |= vmf_insert_pfn_pmd(vma, address, pmd, + pfn_t_to_pfn(pfn), write); } out: diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 59a770dad804..b78e01542e9e 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1620,7 +1620,7 @@ struct block_device_operations { int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); long (*direct_access)(struct block_device *, sector_t, void __pmem **, - unsigned long *pfn); + pfn_t *); unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing); /* ->media_changed() is DEPRECATED, use ->check_events() instead */ @@ -1639,7 +1639,7 @@ extern int bdev_read_page(struct block_device *, sector_t, struct page *); extern int bdev_write_page(struct block_device *, sector_t, struct page *, struct writeback_control *); extern long bdev_direct_access(struct block_device *, sector_t, - void __pmem **addr, unsigned long *pfn, long size); + void __pmem **addr, pfn_t *pfn, long size); #else /* CONFIG_BLOCK */ struct block_device; diff --git a/include/linux/mm.h b/include/linux/mm.h index 80001de019ba..b8a90c481ae4 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -927,6 +927,71 @@ static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg) #endif /* + * PFN_FLAGS_MASK - mask of all the possible valid pfn_t flags + * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry + * PFN_SG_LAST - pfn references a page and is the last scatterlist entry + * PFN_DEV - pfn is not covered by system memmap by default + * PFN_MAP - pfn has a dynamic page mapping established by a device driver + */ +#define PFN_FLAGS_MASK (~PAGE_MASK << (BITS_PER_LONG - PAGE_SHIFT)) +#define PFN_SG_CHAIN (1UL << (BITS_PER_LONG - 1)) +#define PFN_SG_LAST (1UL << (BITS_PER_LONG - 2)) +#define PFN_DEV (1UL << (BITS_PER_LONG - 3)) +#define PFN_MAP (1UL << (BITS_PER_LONG - 4)) + +static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, unsigned long flags) +{ + pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), }; + + return pfn_t; +} + +/* a default pfn to pfn_t conversion assumes that @pfn is pfn_valid() */ +static inline pfn_t pfn_to_pfn_t(unsigned long pfn) +{ + return __pfn_to_pfn_t(pfn, 0); +} + +static inline pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags) +{ + return __pfn_to_pfn_t(addr >> PAGE_SHIFT, flags); +} + +static inline bool pfn_t_has_page(pfn_t pfn) +{ + return (pfn.val & PFN_MAP) == PFN_MAP || (pfn.val & PFN_DEV) == 0; +} + +static inline unsigned long pfn_t_to_pfn(pfn_t pfn) +{ + return pfn.val & ~PFN_FLAGS_MASK; +} + +static inline struct page *pfn_t_to_page(pfn_t pfn) +{ + if (pfn_t_has_page(pfn)) + return pfn_to_page(pfn_t_to_pfn(pfn)); + return NULL; +} + +static inline dma_addr_t pfn_t_to_phys(pfn_t pfn) +{ + return PFN_PHYS(pfn_t_to_pfn(pfn)); +} + +static inline void *pfn_t_to_virt(pfn_t pfn) +{ + if (pfn_t_has_page(pfn)) + return __va(pfn_t_to_phys(pfn)); + return NULL; +} + +static inline pfn_t page_to_pfn_t(struct page *page) +{ + return pfn_to_pfn_t(page_to_pfn(page)); +} + +/* * Some inline functions in vmstat.h depend on page_zone() */ #include diff --git a/include/linux/pfn.h b/include/linux/pfn.h index 7646637221f3..96df85985f16 100644 --- a/include/linux/pfn.h +++ b/include/linux/pfn.h @@ -3,6 +3,15 @@ #ifndef __ASSEMBLY__ #include + +/* + * pfn_t: encapsulates a page-frame number that is optionally backed + * by memmap (struct page). Whether a pfn_t has a 'struct page' + * backing is indicated by flags in the high bits of the value. + */ +typedef struct { + unsigned long val; +} pfn_t; #endif #define PFN_ALIGN(x) (((unsigned long)(x) + (PAGE_SIZE - 1)) & PAGE_MASK)