[RFC,1/1] Add support for ZONE_DEVICE IO memory with struct pages.

From: Logan Gunthorpe <logang@deltatee.com>

From: Logan Gunthorpe <logang@deltatee.com>

Introduction
------------

This RFC patch adds struct page backing for IO memory and as such
allows IO memory to be used as a DMA target.

Patch Summary
-------------

We build on recent work done that adds memory regions owned by a device
driver (ZONE_DEVICE) [1] and to add struct page support for these new
regions of memory [2]. The patch should apply cleanly to 4.5-rc7.

1. Add a new flag (MEMREMAP_WC) to the enumerated type in include/io.h.

2. Add an extra flags argument into dev_memremap_pages to take in a
MEMREMAP_XX argument. We update the existing calls to this function to
reflect the change.

3. For completeness, we add MEMREMAP_WC support to the memremap;
however we have no actual need for this functionality.

4. We add the static functions, add_zone_device_pages and
remove_zone_device pages. These are similar to arch_add_memory except
they don't create the memory mapping. We don't believe these need to be
made arch specific, but are open to other opinions.

5. dev_memremap_pages and devm_memremap_pages_release are updated to
treat IO memory slightly differently. For IO memory we use a combination
of the appropriate io_remap function and the zone_device pages functions
created above. A flags variable and kaddr pointer are added to struct
page_mem to facilitate this for the release function.

Motivation and Use Cases
------------------------

PCIe IO devices are getting faster. It is not uncommon now to find PCIe
network and storage devices that can generate and consume several GB/s.
Almost always these devices have either a high performance DMA engine, a
number of exposed PCIe BARs or both.

Until this patch, any high-performance transfer of information between
two PICe devices has required the use of a staging buffer in system
memory. With this patch the bandwidth to system memory is not compromised
when high-throughput transfers occurs between PCIe devices. This means
that more system memory bandwidth is available to the CPU cores for data
processing and manipulation. In addition, in systems where the two PCIe
devices reside behind a PCIe switch the datapath avoids the CPU entirely.

Consumers
---------

In order to test this patch and to give an example of a consumer of this
patch we have developed a PCIe driver which can be used to bind to any
PCIe device that exposes IO memory via a PCIe BAR. A basic out of tree
module can be found at [3] which is intended only for testing and example
purposes. Assuming this RFC is received well, the first in-kernel user
would be a patchset which facilitates the Controller Memory Buffer (CMB)
feature for NVMe. This would enable RDMA devices to move data directly
to/from files on NVMe devices without the need for a round trip through
system memory.

The example out of tree module allows for the following accesses:

1. Block based access. For this we borrowed heavily from the pmem device
driver. Note that this mode allows for DAX filesystems to be mounted
on the block device. The driver uses devm_memremap_pages on a block of
PCIe memory such that files on this FS can then be mmapped using direct
access and used as a DMA target.

2. MMAP based access. The driver again uses devm_memremap_pages, then
hands out ZONE_DEVICE backed pages with its fault function. Thus, with
this patch, these mappings would now be usable as DMA targets.

Testing and Performance
-----------------------

We have done a moderate about of testing of this patch on a QEMU
environment and a real hardware environment. On real hardware we have
observed peer-to-peer writes of up to 4GB/s and reads of up to 1.2 GB/s.
In both cases these numbers are limitations of our consumer hardware. In
addtion, we have observed that the CPU DRAM bandwidth is not impacted
when using IOPMEM which is not the case when a traditional patch through
system memory is taken.

For more information on the testing and performance results see the
GitHub site [4].

Known Issues
------------

1. Address Translation. Suggestions have been made that in certain
architectures and topologies the dma_addr_t passed to the DMA master
in a peer-2-peer transfer will not correctly route to the IO memory
intended. However in our testing to date we have not seen this to be
an issue, even in systems with IOMMUs and PCIe switches. It is our
understanding that an IOMMU only maps system memory and would not
interfere with device memory regions. (It certainly has no opportunity
to do so if the transfer gets routed through a switch).

2. Memory Segment Spacing. This patch has the same limitations that
ZONE_DEVICE does in that memory regions must be spaces at least
SECTION_SIZE bytes part. On x86 this is 128MB and there are cases where
BARs can be placed closer together than this. Thus ZONE_DEVICE would not
be usable on neighboring BARs. For our purposes, this is not an issue as
we'd only be looking at enabling a single BAR in a given PCIe device.
More exotic use cases may have problems with this.

3. Coherency Issues. When IOMEM is written from both the CPU and a PCIe
peer there is potential for coherency issues and for writes to occur out
of order. This is something that users of this feature need to be
cognizant of and may necessitate the use of CONFIG_EXPERT. Though really,
this isn't much different than the existing situation with RDMA: if
userspace sets up an MR for remote use, they need to be careful about
using that memory region themselves.

4. Architecture. Currently this patch is applicable only to x86
architectures. The same is true for much of the code pertaining to
PMEM and ZONE_DEVICE. It is hoped that the work will be extended to other
ARCH over time.

References
----------

[1] https://lists.01.org/pipermail/linux-nvdimm/2015-August/001810.html
[2] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002387.html
[3] https://github.com/sbates130272/linux-donard/tree/iopmem-rfc
[4] https://github.com/sbates130272/zone-device

Signed-off-by: Stephen Bates <stephen.bates@pmcs.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvdimm/pmem.c    |  4 +--
 include/linux/io.h       |  1 +
 include/linux/memremap.h |  5 ++--
 kernel/memremap.c        | 78 ++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 79 insertions(+), 9 deletions(-)

--
2.1.4

[RFC,1/1] Add support for ZONE_DEVICE IO memory with struct pages.

Commit Message

Comments

Patch