Message ID | 1457979277-26791-1-git-send-email-stephen.bates@pmcs.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Mar 14, 2016 at 12:14:37PM -0600, Stephen Bates wrote: > 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe > peer there is potential for coherency issues and for writes to occur out > of order. This is something that users of this feature need to be > cognizant of and may necessitate the use of CONFIG_EXPERT. Though really, > this isn't much different than the existing situation with RDMA: if > userspace sets up an MR for remote use, they need to be careful about > using that memory region themselves. There's more to the coherency problem than this. As I understand it, on x86, memory in a PCI BAR does not participate in the coherency protocol. So you can get a situation where CPU A stores 4 bytes to offset 8 in a cacheline, then CPU B stores 4 bytes to offset 16 in the same cacheline, and CPU A's write mysteriously goes missing. I may have misunderstood the exact details when this was explained to me a few years ago, but the details were horrible enough to run away screaming. Pretending PCI BARs are real memory? Just Say No.
On Mon, Mar 14, 2016 at 05:23:44PM -0400, Matthew Wilcox wrote: > On Mon, Mar 14, 2016 at 12:14:37PM -0600, Stephen Bates wrote: > > 3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe > > peer there is potential for coherency issues and for writes to occur out > > of order. This is something that users of this feature need to be > > cognizant of and may necessitate the use of CONFIG_EXPERT. Though really, > > this isn't much different than the existing situation with RDMA: if > > userspace sets up an MR for remote use, they need to be careful about > > using that memory region themselves. > > There's more to the coherency problem than this. As I understand it, on > x86, memory in a PCI BAR does not participate in the coherency protocol. > So you can get a situation where CPU A stores 4 bytes to offset 8 in a > cacheline, then CPU B stores 4 bytes to offset 16 in the same cacheline, > and CPU A's write mysteriously goes missing. No, this cannot happen with writing combining. You need full caching turned on to get that kind of problem. write combining can only combine writes, it cannot make up writes that never existed. That said, I question I don't know the answer to, is how does write locking/memory barries interact with the write combining CPU buffers, and are all the fencing semantics guarenteed.. There is some interaction there (some drivers use write combining a lot).. but that sure is a rarely used corner area... The other issue is that the fencing mechanism RDMA uses to create ordering with system memory is not good enough to fence peer-peer transactions in the general case. It is only possibly good enough if all the transactions run through the root complex. > I may have misunderstood the exact details when this was explained to me a > few years ago, but the details were horrible enough to run away screaming. > Pretending PCI BARs are real memory? Just Say No. Someone should probably explain in more detail what this is even good for, DAX on PCI-E bar memory seems goofy in the general case. I was under the impression the main use case involved the CPU never touching these memories and just using them to route-through to another IO device (eg network). So all these discussions about CPU coherency seem a bit strange. Jason
On 14/03/16 03:57 PM, Jason Gunthorpe wrote: > Someone should probably explain in more detail what this is even good > for, DAX on PCI-E bar memory seems goofy in the general case. I was > under the impression the main use case involved the CPU never touching > these memories and just using them to route-through to another IO > device (eg network). So all these discussions about CPU coherency seem > a bit strange. Yes, the primary purpose is to enable P2P transactions that don't involve the CPU at all. To enable this, we do mmap the BAR region into user space which is then technically able to read/write to it using the CPU. However, you're right, it is silly to write to the mmap'd PCI BAR for anything but debug/testing purposes -- this type of access also has horrible performance. Really, the mmaping is just a convenient way to pass around the addresses with existing interfaces that expect system RAM (RDMA, O_DIRECT). Putting DAX on the PCI-E bar is a actually more of a curiosity at the moment than anything. The current plan for NVMe with CMB would not involve DAX. CMB buffers would be allocated perhaps by mapping the nvmeX char device which could then be used with O_DIRECT access on a file on the NVME device and also be passed to RDMA devices. In this way data could flow from the NVMe device to an RDMA network without using system memory to buffer it. Logan
> > On 14/03/16 03:57 PM, Jason Gunthorpe wrote: > > Someone should probably explain in more detail what this is even good > > for, DAX on PCI-E bar memory seems goofy in the general case. I was > > under the impression the main use case involved the CPU never touching > > these memories and just using them to route-through to another IO > > device (eg network). So all these discussions about CPU coherency seem > > a bit strange. > > > Yes, the primary purpose is to enable P2P transactions that don't involve the > CPU at all. To enable this, we do mmap the BAR region into user space which > is then technically able to read/write to it using the CPU. However, you're > right, it is silly to write to the mmap'd PCI BAR for anything but debug/testing > purposes -- this type of access also has horrible performance. Really, the > mmaping is just a convenient way to pass around the addresses with existing > interfaces that expect system RAM (RDMA, O_DIRECT). > > Putting DAX on the PCI-E bar is a actually more of a curiosity at the moment > than anything. The current plan for NVMe with CMB would not involve DAX. > CMB buffers would be allocated perhaps by mapping the nvmeX char device > which could then be used with O_DIRECT access on a file on the NVME > device and also be passed to RDMA devices. In this way data could flow from > the NVMe device to an RDMA network without using system memory to > buffer it. > > Logan The transfer of data between PCIe devices is the main use-case for this proposed patch. Some other applications for this include direct transfer of data between PCIe SSDs (for background data movement and replication) and the movement of data between PCIe SSDs and GPGPU/FPGAs for accelerating applications that use those types of offload engines. In some of our performance tests we have seen significant reduction in residual DRAM bandwidth when all the data involved in these transfers has to pass through system memory buffers. This proposed patch avoids that problem. Stephen
On 3/14/2016 11:57 PM, Jason Gunthorpe wrote: > The other issue is that the fencing mechanism RDMA uses to create > ordering with system memory is not good enough to fence peer-peer > transactions in the general case. It is only possibly good enough if > all the transactions run through the root complex. Are you sure this is a problem? I'm not sure it is clear in the PCIe specs, but I thought that for transactions that are not relaxed-ordered and don't use ID-based ordering, a PCIe switch must prevent reads and writes from passing writes. I assume this is true even when the requestor ID is different because IDO relaxes these constraints specifically for transactions coming from different requestor IDs. Regards, Haggai
On Thu, Mar 17, 2016 at 05:18:11PM +0200, Haggai Eran wrote: > On 3/14/2016 11:57 PM, Jason Gunthorpe wrote: > > The other issue is that the fencing mechanism RDMA uses to create > > ordering with system memory is not good enough to fence peer-peer > > transactions in the general case. It is only possibly good enough if > > all the transactions run through the root complex. > > Are you sure this is a problem? I'm not sure it is clear in the PCIe > specs, but I thought that for transactions that are not relaxed-ordered > and don't use ID-based ordering, a PCIe switch must prevent reads and > writes from passing writes. Yes, that is right, and that is good enough if the PCI-E fabric is a simple acyclic configuration (ie the common case). There are fringe cases that are more complex, and maybe the correct reading of the spec is to setup routing to avoid optimal paths, but it certainly is possible to configure switches in a way that could not guarentee global ordering. Jason
> > There are fringe cases that are more complex, and maybe the correct reading > of the spec is to setup routing to avoid optimal paths, but it certainly is > possible to configure switches in a way that could not guarentee global > ordering. > > Jason If someone configures a multipath PCIe topology I think they will have potential for out of order RDMA reads/writes regardless of whether the target MR is in system memory of PCIe memory. So I don't think these crazy topologies are uniquely problematic for IOPMEM. Stephen
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 8d0b546..095f358 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -184,7 +184,7 @@ static struct pmem_device *pmem_alloc(struct device *dev, pmem->pfn_flags = PFN_DEV; if (pmem_should_map_pages(dev)) { pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, res, - &q->q_usage_counter, NULL); + &q->q_usage_counter, NULL, ARCH_MEMREMAP_PMEM); pmem->pfn_flags |= PFN_MAP; } else pmem->virt_addr = (void __pmem *) devm_memremap(dev, @@ -410,7 +410,7 @@ static int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns) q = pmem->pmem_queue; devm_memunmap(dev, (void __force *) pmem->virt_addr); pmem->virt_addr = (void __pmem *) devm_memremap_pages(dev, &nsio->res, - &q->q_usage_counter, altmap); + &q->q_usage_counter, altmap, ARCH_MEMREMAP_PMEM); pmem->pfn_flags |= PFN_MAP; if (IS_ERR(pmem->virt_addr)) { rc = PTR_ERR(pmem->virt_addr); diff --git a/include/linux/io.h b/include/linux/io.h index 32403b5..e2c8419 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -135,6 +135,7 @@ enum { /* See memremap() kernel-doc for usage description... */ MEMREMAP_WB = 1 << 0, MEMREMAP_WT = 1 << 1, + MEMREMAP_WC = 1 << 2, }; void *memremap(resource_size_t offset, size_t size, unsigned long flags); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index bcaa634..ad5cf23 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -51,12 +51,13 @@ struct dev_pagemap { #ifdef CONFIG_ZONE_DEVICE void *devm_memremap_pages(struct device *dev, struct resource *res, - struct percpu_ref *ref, struct vmem_altmap *altmap); + struct percpu_ref *ref, struct vmem_altmap *altmap, + unsigned long flags); struct dev_pagemap *find_dev_pagemap(resource_size_t phys); #else static inline void *devm_memremap_pages(struct device *dev, struct resource *res, struct percpu_ref *ref, - struct vmem_altmap *altmap) + struct vmem_altmap *altmap, unsigned long flags) { /* * Fail attempts to call devm_memremap_pages() without diff --git a/kernel/memremap.c b/kernel/memremap.c index b981a7b..a739286 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -41,7 +41,7 @@ static void *try_ram_remap(resource_size_t offset, size_t size) * memremap() - remap an iomem_resource as cacheable memory * @offset: iomem resource start address * @size: size of remap - * @flags: either MEMREMAP_WB or MEMREMAP_WT + * @flags: either MEMREMAP_WB, MEMREMAP_WT or MEMREMAP_WC * * memremap() is "ioremap" for cases where it is known that the resource * being mapped does not have i/o side effects and the __iomem @@ -57,6 +57,9 @@ static void *try_ram_remap(resource_size_t offset, size_t size) * cache or are written through to memory and never exist in a * cache-dirty state with respect to program visibility. Attempts to * map "System RAM" with this mapping type will fail. + * + * MEMREMAP_WC - like MEMREMAP_WT but enables write combining. + * */ void *memremap(resource_size_t offset, size_t size, unsigned long flags) { @@ -101,6 +104,12 @@ void *memremap(resource_size_t offset, size_t size, unsigned long flags) addr = ioremap_wt(offset, size); } + + if (!addr && (flags & MEMREMAP_WC)) { + flags &= ~MEMREMAP_WC; + addr = ioremap_wc(offset, size); + } + return addr; } EXPORT_SYMBOL(memremap); @@ -164,13 +173,41 @@ static RADIX_TREE(pgmap_radix, GFP_KERNEL); #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) +enum { + PAGEMAP_IO_MEM = 1 << 0, +}; + struct page_map { struct resource res; struct percpu_ref *ref; struct dev_pagemap pgmap; struct vmem_altmap altmap; + void *kaddr; + int flags; }; +static int add_zone_device_pages(int nid, u64 start, u64 size) +{ + struct pglist_data *pgdat = NODE_DATA(nid); + struct zone *zone = pgdat->node_zones + ZONE_DEVICE; + unsigned long start_pfn = start >> PAGE_SHIFT; + unsigned long nr_pages = size >> PAGE_SHIFT; + + return __add_pages(nid, zone, start_pfn, nr_pages); +} + +static void remove_zone_device_pages(u64 start, u64 size) +{ + unsigned long start_pfn = start >> PAGE_SHIFT; + unsigned long nr_pages = size >> PAGE_SHIFT; + struct zone *zone; + int ret; + + zone = page_zone(pfn_to_page(start_pfn)); + ret = __remove_pages(zone, start_pfn, nr_pages); + WARN_ON_ONCE(ret); +} + void get_zone_device_page(struct page *page) { percpu_ref_get(page->pgmap->ref); @@ -235,8 +272,15 @@ static void devm_memremap_pages_release(struct device *dev, void *data) /* pages are dead and unused, undo the arch mapping */ align_start = res->start & ~(SECTION_SIZE - 1); align_size = ALIGN(resource_size(res), SECTION_SIZE); - arch_remove_memory(align_start, align_size); + + if (page_map->flags & PAGEMAP_IO_MEM) { + remove_zone_device_pages(align_start, align_size); + iounmap(page_map->kaddr); + } else { + arch_remove_memory(align_start, align_size); + } pgmap_radix_release(res); + dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc, "%s: failed to free all reserved pages\n", __func__); } @@ -258,6 +302,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) * @res: "host memory" address range * @ref: a live per-cpu reference count * @altmap: optional descriptor for allocating the memmap from @res + * @flags: either MEMREMAP_WB, MEMREMAP_WT or MEMREMAP_WC + * see memremap() for a description of the flags * * Notes: * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time @@ -268,7 +314,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys) * this is not enforced. */ void *devm_memremap_pages(struct device *dev, struct resource *res, - struct percpu_ref *ref, struct vmem_altmap *altmap) + struct percpu_ref *ref, struct vmem_altmap *altmap, + unsigned long flags) { int is_ram = region_intersects(res->start, resource_size(res), "System RAM"); @@ -277,6 +324,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, struct page_map *page_map; unsigned long pfn; int error, nid; + void *addr = NULL; if (is_ram == REGION_MIXED) { WARN_ONCE(1, "%s attempted on mixed region %pr\n", @@ -344,10 +392,28 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, if (nid < 0) nid = numa_mem_id(); - error = arch_add_memory(nid, align_start, align_size, true); + if (flags & MEMREMAP_WB || !flags) { + error = arch_add_memory(nid, align_start, align_size, true); + addr = __va(res->start); + } else { + page_map->flags |= PAGEMAP_IO_MEM; + error = add_zone_device_pages(nid, align_start, align_size); + } + if (error) goto err_add_memory; + if (!addr && (flags & MEMREMAP_WT)) + addr = ioremap_wt(res->start, resource_size(res)); + + if (!addr && (flags & MEMREMAP_WC)) + addr = ioremap_wc(res->start, resource_size(res)); + + if (!addr && page_map->flags & PAGEMAP_IO_MEM) { + remove_zone_device_pages(res->start, resource_size(res)); + goto err_add_memory; + } + for_each_device_pfn(pfn, page_map) { struct page *page = pfn_to_page(pfn); @@ -355,8 +421,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res, list_force_poison(&page->lru); page->pgmap = pgmap; } + + page_map->kaddr = addr; devres_add(dev, page_map); - return __va(res->start); + return addr; err_add_memory: err_radix: