diff mbox

[5/5] libnvdimm: add DMA support for pmem blk-mq

Message ID 150153988620.49768.12914164179718467335.stgit@djiang5-desk3.ch.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Jiang July 31, 2017, 10:24 p.m. UTC
Adding DMA support for pmem blk reads. This provides signficant CPU
reduction with large memory reads with good performance. DMAs are triggered
with test against bio_multiple_segment(), so the small I/Os (4k or less?)
are still performed by the CPU in order to reduce latency. By default
the pmem driver will be using blk-mq with DMA.

Numbers below are measured against pmem simulated via DRAM using
memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
platform.  Keep in mind the performance for actual persistent memory
will differ.
Fio 2.21 was used.

64k: 1 task queuedepth=1
CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU

64k: 16 tasks queuedepth=16
CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU

2M: 1 task queuedepth=1
CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU

2M: 16 tasks queuedepth=16
CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/nvdimm/pmem.c |  225 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 208 insertions(+), 17 deletions(-)

Comments

Johannes Thumshirn Aug. 1, 2017, 7:34 a.m. UTC | #1
Dave Jiang <dave.jiang@intel.com> writes:

> Adding DMA support for pmem blk reads. This provides signficant CPU
> reduction with large memory reads with good performance. DMAs are triggered
> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
> are still performed by the CPU in order to reduce latency. By default
> the pmem driver will be using blk-mq with DMA.
>
> Numbers below are measured against pmem simulated via DRAM using
> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> platform.  Keep in mind the performance for actual persistent memory
> will differ.
> Fio 2.21 was used.
>
> 64k: 1 task queuedepth=1
> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>
> 64k: 16 tasks queuedepth=16
> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>
> 2M: 1 task queuedepth=1
> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>
> 2M: 16 tasks queuedepth=16
> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---

Hi Dave,

The above table shows that there's a performance benefit for 2M
transfers but a regression for 64k transfers, if we forget about the CPU
utilization for a second. Would it be beneficial to have heuristics on
the transfer size that decide when to use dma and when not? You
introduced this hunk:

-   rc = pmem_handle_cmd(cmd);
+   if (cmd->chan && bio_multiple_segments(req->bio))
+      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
+   else
+       rc = pmem_handle_cmd(cmd);

Which utilizes dma for bios with multiple segments and for single
segment bios you use the old path, maybe the single/multi segment logic
can be amended to have something like:

    if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
       rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
    else
       rc = pmem_handle_cmd(cmd);

Just something woth considering IMHO.

> +	len = blk_rq_payload_bytes(req);
> +	page = virt_to_page(pmem_addr);
> +	off = (u64)pmem_addr & ~PAGE_MASK;

off = offset_in_page(pmem_addr); ?
Dave Jiang Aug. 1, 2017, 4:40 p.m. UTC | #2
On 08/01/2017 12:34 AM, Johannes Thumshirn wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
> 
>> Adding DMA support for pmem blk reads. This provides signficant CPU
>> reduction with large memory reads with good performance. DMAs are triggered
>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> are still performed by the CPU in order to reduce latency. By default
>> the pmem driver will be using blk-mq with DMA.
>>
>> Numbers below are measured against pmem simulated via DRAM using
>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> platform.  Keep in mind the performance for actual persistent memory
>> will differ.
>> Fio 2.21 was used.
>>
>> 64k: 1 task queuedepth=1
>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>
>> 64k: 16 tasks queuedepth=16
>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>
>> 2M: 1 task queuedepth=1
>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>
>> 2M: 16 tasks queuedepth=16
>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
> 
> Hi Dave,
> 
> The above table shows that there's a performance benefit for 2M
> transfers but a regression for 64k transfers, if we forget about the CPU
> utilization for a second. Would it be beneficial to have heuristics on
> the transfer size that decide when to use dma and when not? You
> introduced this hunk:
> 
> -   rc = pmem_handle_cmd(cmd);
> +   if (cmd->chan && bio_multiple_segments(req->bio))
> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
> +   else
> +       rc = pmem_handle_cmd(cmd);
> 
> Which utilizes dma for bios with multiple segments and for single
> segment bios you use the old path, maybe the single/multi segment logic
> can be amended to have something like:
> 
>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>     else
>        rc = pmem_handle_cmd(cmd);

Yes good idea. I will introduce a thresh parameter that's tune-able.

> 
> Just something woth considering IMHO.
> 
>> +	len = blk_rq_payload_bytes(req);
>> +	page = virt_to_page(pmem_addr);
>> +	off = (u64)pmem_addr & ~PAGE_MASK;
> 
> off = offset_in_page(pmem_addr); ?

Thanks. Will fix.
Dan Williams Aug. 1, 2017, 5:43 p.m. UTC | #3
On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> Dave Jiang <dave.jiang@intel.com> writes:
>
>> Adding DMA support for pmem blk reads. This provides signficant CPU
>> reduction with large memory reads with good performance. DMAs are triggered
>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> are still performed by the CPU in order to reduce latency. By default
>> the pmem driver will be using blk-mq with DMA.
>>
>> Numbers below are measured against pmem simulated via DRAM using
>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> platform.  Keep in mind the performance for actual persistent memory
>> will differ.
>> Fio 2.21 was used.
>>
>> 64k: 1 task queuedepth=1
>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>
>> 64k: 16 tasks queuedepth=16
>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>
>> 2M: 1 task queuedepth=1
>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>
>> 2M: 16 tasks queuedepth=16
>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>
>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>
> Hi Dave,
>
> The above table shows that there's a performance benefit for 2M
> transfers but a regression for 64k transfers, if we forget about the CPU
> utilization for a second.

I don't think we can forget about cpu utilization, and I would expect
most users would value cpu over small bandwidth increases. Especially
when those numbers show efficiency like this

+160% cpu +26% read bandwidth
+171% cpu -10% write bandwidth

> Would it be beneficial to have heuristics on
> the transfer size that decide when to use dma and when not? You
> introduced this hunk:
>
> -   rc = pmem_handle_cmd(cmd);
> +   if (cmd->chan && bio_multiple_segments(req->bio))
> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
> +   else
> +       rc = pmem_handle_cmd(cmd);
>
> Which utilizes dma for bios with multiple segments and for single
> segment bios you use the old path, maybe the single/multi segment logic
> can be amended to have something like:
>
>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>     else
>        rc = pmem_handle_cmd(cmd);
>
> Just something woth considering IMHO.

The thing we want to avoid most is having the cpu stall doing nothing
waiting for memory access, so I think once the efficiency of adding
more cpu goes non-linear it's better to let the dma engine handle
that. The current heuristic seems to achieve that, but the worry is
does it over-select the cpu for cases where requests could be merged
into a bulkier i/o that is more suitable for dma.

I'm not sure we want another tunable vs predictable / efficient
behavior by default.
Ross Zwisler Aug. 1, 2017, 8:42 p.m. UTC | #4
On Mon, Jul 31, 2017 at 03:24:46PM -0700, Dave Jiang wrote:
> Adding DMA support for pmem blk reads. This provides signficant CPU
> reduction with large memory reads with good performance. DMAs are triggered
> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
> are still performed by the CPU in order to reduce latency. By default
> the pmem driver will be using blk-mq with DMA.
> 
> Numbers below are measured against pmem simulated via DRAM using
> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> platform.  Keep in mind the performance for actual persistent memory
> will differ.
> Fio 2.21 was used.
> 
> 64k: 1 task queuedepth=1
> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
> 
> 64k: 16 tasks queuedepth=16
> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
> 
> 2M: 1 task queuedepth=1
> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
> 
> 2M: 16 tasks queuedepth=16
> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
> 
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> ---
<>
> +static void nd_pmem_dma_callback(void *data,
> +		const struct dmaengine_result *res)
> +{
> +	struct pmem_cmd *cmd = data;
> +	struct request *req = cmd->rq;
> +	struct request_queue *q = req->q;
> +	struct pmem_device *pmem = q->queuedata;
> +	struct nd_region *nd_region = to_region(pmem);
> +	struct device *dev = to_dev(pmem);
> +	int rc = 0;
> +
> +	dev_dbg(dev, "%s()\n", __func__);

Is this left-over debug, or did you mean to leave it in?

> +
> +	if (res) {
> +		enum dmaengine_tx_result dma_err = res->result;
> +
> +		switch (dma_err) {
> +		case DMA_TRANS_READ_FAILED:
> +		case DMA_TRANS_WRITE_FAILED:
> +		case DMA_TRANS_ABORTED:
> +			dev_dbg(dev, "bio failed\n");
> +			rc = -ENXIO;
> +			break;
> +		case DMA_TRANS_NOERROR:
> +		default:
> +			break;
> +		}
> +	}
> +
> +	if (req->cmd_flags & REQ_FUA)
> +		nvdimm_flush(nd_region);
> +
> +	dev_dbg(dev, "ending request\n");
> +	blk_mq_end_request(cmd->rq, rc);
> +}
> +
> +static int pmem_handle_cmd_dma(struct pmem_cmd *cmd, bool is_write)
> +{
> +	struct request *req = cmd->rq;
> +	struct request_queue *q = req->q;
> +	struct pmem_device *pmem = q->queuedata;
> +	struct device *dev = to_dev(pmem);
> +	phys_addr_t pmem_off = blk_rq_pos(req) * 512 + pmem->data_offset;
> +	void *pmem_addr = pmem->virt_addr + pmem_off;
> +	struct nd_region *nd_region = to_region(pmem);
> +	size_t len;
> +	struct dma_device *dma = cmd->chan->device;
> +	struct dmaengine_unmap_data *unmap;
> +	dma_cookie_t cookie;
> +	struct dma_async_tx_descriptor *txd;
> +	struct page *page;
> +	unsigned int off;
> +	int rc;
> +	enum dma_data_direction dir;
> +	dma_addr_t dma_addr;
> +
> +	if (req->cmd_flags & REQ_FLUSH)
> +		nvdimm_flush(nd_region);
> +
> +	unmap = dmaengine_get_unmap_data(dma->dev, 2, GFP_NOWAIT);
> +	if (!unmap) {
> +		dev_dbg(dev, "failed to get dma unmap data\n");
> +		rc = -ENOMEM;

The value of 'rc' isn't used at all in the error paths at the end of this
function.  Instead it ends the mq request with -ENXIO and returns -ENXIO
unconditionally.  That code should probably use 'rc' instead ... (continued in
next block)


> +		goto err;
> +	}
> +
> +	/*
> +	 * If reading from pmem, writing to scatterlist,
> +	 * and if writing to pmem, reading from scatterlist.
> +	 */
> +	dir = is_write ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
> +	cmd->sg_nents = blk_rq_map_sg(req->q, req, cmd->sg);
> +	if (cmd->sg_nents < 1) {
> +		rc = -EINVAL;
> +		goto err;
> +	}
> +
> +	if (cmd->sg_nents > 128) {
> +		rc = -ENOMEM;
> +		dev_warn(dev, "Number of sg greater than allocated\n");
> +		goto err;
> +	}
> +
> +	rc = dma_map_sg(dma->dev, cmd->sg, cmd->sg_nents, dir);
> +	if (rc < 1) {
> +		rc = -ENXIO;
> +		goto err;
> +	}
> +
> +	len = blk_rq_payload_bytes(req);
> +	page = virt_to_page(pmem_addr);
> +	off = (u64)pmem_addr & ~PAGE_MASK;
> +	dir = is_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
> +	dma_addr = dma_map_page(dma->dev, page, off, len, dir);
> +	if (dma_mapping_error(dma->dev, unmap->addr[0])) {
> +		dev_dbg(dma->dev, "src DMA mapping error\n");
> +		goto err_unmap_sg;

... which means that these later gotos need to set 'rc'.  This applies to the
rest of the gotos in this function.

> +	}
> +
> +	unmap->len = len;
> +
> +	if (is_write) {
> +		unmap->addr[0] = dma_addr;
> +		unmap->addr[1] = (dma_addr_t)cmd->sg;
> +		unmap->to_cnt = 1;
> +		unmap->from_cnt = 0;
> +		dma_unmap_data_sg_from_nents(unmap, 2) = cmd->sg_nents;
> +		txd = dma->device_prep_dma_memcpy_from_sg(cmd->chan, dma_addr,
> +				cmd->sg, cmd->sg_nents, DMA_PREP_INTERRUPT);
> +	} else {
> +		unmap->addr[0] = (dma_addr_t)cmd->sg;
> +		unmap->addr[1] = dma_addr;
> +		unmap->from_cnt = 1;
> +		unmap->to_cnt = 0;
> +		dma_unmap_data_sg_to_nents(unmap, 2) = cmd->sg_nents;
> +		txd = dma->device_prep_dma_memcpy_to_sg(cmd->chan, cmd->sg,
> +			cmd->sg_nents, dma_addr, DMA_PREP_INTERRUPT);
> +	}
> +
> +	if (!txd) {
> +		dev_dbg(dma->dev, "dma prep failed\n");
> +		goto err_unmap_buffer;
> +	}
> +
> +	txd->callback_result = nd_pmem_dma_callback;
> +	txd->callback_param = cmd;
> +	dma_set_unmap(txd, unmap);
> +	cookie = dmaengine_submit(txd);
> +	if (dma_submit_error(cookie)) {
> +		dev_dbg(dma->dev, "dma submit error\n");
> +		goto err_set_unmap;
> +	}
> +
> +	dmaengine_unmap_put(unmap);
> +	dma_async_issue_pending(cmd->chan);
> +
> +	return 0;
> +
> +err_set_unmap:
> +	dmaengine_unmap_put(unmap);
> +err_unmap_buffer:
> +	dma_unmap_page(dev, dma_addr, len, dir);
> +err_unmap_sg:
> +	if (dir == DMA_TO_DEVICE)
> +		dir = DMA_FROM_DEVICE;
> +	else
> +		dir = DMA_TO_DEVICE;
> +	dma_unmap_sg(dev, cmd->sg, cmd->sg_nents, dir);
> +	dmaengine_unmap_put(unmap);
> +err:
> +	blk_mq_end_request(cmd->rq, -ENXIO);
> +	return -ENXIO;
> +}
Johannes Thumshirn Aug. 3, 2017, 8:06 a.m. UTC | #5
On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > Dave Jiang <dave.jiang@intel.com> writes:
> >
> >> Adding DMA support for pmem blk reads. This provides signficant CPU
> >> reduction with large memory reads with good performance. DMAs are triggered
> >> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
> >> are still performed by the CPU in order to reduce latency. By default
> >> the pmem driver will be using blk-mq with DMA.
> >>
> >> Numbers below are measured against pmem simulated via DRAM using
> >> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> >> platform.  Keep in mind the performance for actual persistent memory
> >> will differ.
> >> Fio 2.21 was used.
> >>
> >> 64k: 1 task queuedepth=1
> >> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
> >> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
> >>
> >> 64k: 16 tasks queuedepth=16
> >> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
> >> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
> >>
> >> 2M: 1 task queuedepth=1
> >> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
> >> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
> >>
> >> 2M: 16 tasks queuedepth=16
> >> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
> >> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
> >>
> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> >> ---
> >
> > Hi Dave,
> >
> > The above table shows that there's a performance benefit for 2M
> > transfers but a regression for 64k transfers, if we forget about the CPU
> > utilization for a second.
> 
> I don't think we can forget about cpu utilization, and I would expect
> most users would value cpu over small bandwidth increases. Especially
> when those numbers show efficiency like this
> 
> +160% cpu +26% read bandwidth
> +171% cpu -10% write bandwidth
> 
> > Would it be beneficial to have heuristics on
> > the transfer size that decide when to use dma and when not? You
> > introduced this hunk:
> >
> > -   rc = pmem_handle_cmd(cmd);
> > +   if (cmd->chan && bio_multiple_segments(req->bio))
> > +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
> > +   else
> > +       rc = pmem_handle_cmd(cmd);
> >
> > Which utilizes dma for bios with multiple segments and for single
> > segment bios you use the old path, maybe the single/multi segment logic
> > can be amended to have something like:
> >
> >     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
> >        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
> >     else
> >        rc = pmem_handle_cmd(cmd);
> >
> > Just something woth considering IMHO.
> 
> The thing we want to avoid most is having the cpu stall doing nothing
> waiting for memory access, so I think once the efficiency of adding
> more cpu goes non-linear it's better to let the dma engine handle
> that. The current heuristic seems to achieve that, but the worry is
> does it over-select the cpu for cases where requests could be merged
> into a bulkier i/o that is more suitable for dma.
> 
> I'm not sure we want another tunable vs predictable / efficient
> behavior by default.

Sorry for my late reply, but lots of automated performance regression CI tools
_will_ report errors because of the performance drop. Yes these may be false
positives, but it'll cost hours to resolve them. If we have a tunable it's way
easier to set it correctly for the use cases. And please keep in mind some
people don't really case about CPU utilization but maxiumum throughput.

Thanks,
	Johannes
Dan Williams Aug. 3, 2017, 3:41 p.m. UTC | #6
On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>> > Dave Jiang <dave.jiang@intel.com> writes:
>> >
>> >> Adding DMA support for pmem blk reads. This provides signficant CPU
>> >> reduction with large memory reads with good performance. DMAs are triggered
>> >> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> >> are still performed by the CPU in order to reduce latency. By default
>> >> the pmem driver will be using blk-mq with DMA.
>> >>
>> >> Numbers below are measured against pmem simulated via DRAM using
>> >> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> >> platform.  Keep in mind the performance for actual persistent memory
>> >> will differ.
>> >> Fio 2.21 was used.
>> >>
>> >> 64k: 1 task queuedepth=1
>> >> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> >> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>> >>
>> >> 64k: 16 tasks queuedepth=16
>> >> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> >> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>> >>
>> >> 2M: 1 task queuedepth=1
>> >> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> >> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>> >>
>> >> 2M: 16 tasks queuedepth=16
>> >> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> >> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>> >>
>> >> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>> >> ---
>> >
>> > Hi Dave,
>> >
>> > The above table shows that there's a performance benefit for 2M
>> > transfers but a regression for 64k transfers, if we forget about the CPU
>> > utilization for a second.
>>
>> I don't think we can forget about cpu utilization, and I would expect
>> most users would value cpu over small bandwidth increases. Especially
>> when those numbers show efficiency like this
>>
>> +160% cpu +26% read bandwidth
>> +171% cpu -10% write bandwidth
>>
>> > Would it be beneficial to have heuristics on
>> > the transfer size that decide when to use dma and when not? You
>> > introduced this hunk:
>> >
>> > -   rc = pmem_handle_cmd(cmd);
>> > +   if (cmd->chan && bio_multiple_segments(req->bio))
>> > +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>> > +   else
>> > +       rc = pmem_handle_cmd(cmd);
>> >
>> > Which utilizes dma for bios with multiple segments and for single
>> > segment bios you use the old path, maybe the single/multi segment logic
>> > can be amended to have something like:
>> >
>> >     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>> >        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>> >     else
>> >        rc = pmem_handle_cmd(cmd);
>> >
>> > Just something woth considering IMHO.
>>
>> The thing we want to avoid most is having the cpu stall doing nothing
>> waiting for memory access, so I think once the efficiency of adding
>> more cpu goes non-linear it's better to let the dma engine handle
>> that. The current heuristic seems to achieve that, but the worry is
>> does it over-select the cpu for cases where requests could be merged
>> into a bulkier i/o that is more suitable for dma.
>>
>> I'm not sure we want another tunable vs predictable / efficient
>> behavior by default.
>
> Sorry for my late reply, but lots of automated performance regression CI tools
> _will_ report errors because of the performance drop. Yes these may be false
> positives, but it'll cost hours to resolve them. If we have a tunable it's way
> easier to set it correctly for the use cases. And please keep in mind some
> people don't really case about CPU utilization but maxiumum throughput.

Returning more cpu to the application may also increase bandwidth
because we stop spinning on completion and spend more time issuing
I/O. I'm also worried that this threshold cross-over point is workload
and pmem media specific. The tunable should be there as an override,
but I think we need simple/sane defaults out of the gate. A "no
regressions whatsoever on any metric" policy effectively kills this
effort as far as I can see. We otherwise need quite a bit more
qualification of different workloads, these 3 fio runs is not enough.
Dave Jiang Aug. 3, 2017, 4:12 p.m. UTC | #7
On 08/03/2017 08:41 AM, Dan Williams wrote:
> On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>>>> Dave Jiang <dave.jiang@intel.com> writes:
>>>>
>>>>> Adding DMA support for pmem blk reads. This provides signficant CPU
>>>>> reduction with large memory reads with good performance. DMAs are triggered
>>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>>>>> are still performed by the CPU in order to reduce latency. By default
>>>>> the pmem driver will be using blk-mq with DMA.
>>>>>
>>>>> Numbers below are measured against pmem simulated via DRAM using
>>>>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>>>>> platform.  Keep in mind the performance for actual persistent memory
>>>>> will differ.
>>>>> Fio 2.21 was used.
>>>>>
>>>>> 64k: 1 task queuedepth=1
>>>>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>>>>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>>>>
>>>>> 64k: 16 tasks queuedepth=16
>>>>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>>>>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>>>>
>>>>> 2M: 1 task queuedepth=1
>>>>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>>>>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>>>>
>>>>> 2M: 16 tasks queuedepth=16
>>>>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>>>>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>>>>
>>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>>> ---
>>>>
>>>> Hi Dave,
>>>>
>>>> The above table shows that there's a performance benefit for 2M
>>>> transfers but a regression for 64k transfers, if we forget about the CPU
>>>> utilization for a second.
>>>
>>> I don't think we can forget about cpu utilization, and I would expect
>>> most users would value cpu over small bandwidth increases. Especially
>>> when those numbers show efficiency like this
>>>
>>> +160% cpu +26% read bandwidth
>>> +171% cpu -10% write bandwidth
>>>
>>>> Would it be beneficial to have heuristics on
>>>> the transfer size that decide when to use dma and when not? You
>>>> introduced this hunk:
>>>>
>>>> -   rc = pmem_handle_cmd(cmd);
>>>> +   if (cmd->chan && bio_multiple_segments(req->bio))
>>>> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>>>> +   else
>>>> +       rc = pmem_handle_cmd(cmd);
>>>>
>>>> Which utilizes dma for bios with multiple segments and for single
>>>> segment bios you use the old path, maybe the single/multi segment logic
>>>> can be amended to have something like:
>>>>
>>>>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>>>>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>>>>     else
>>>>        rc = pmem_handle_cmd(cmd);
>>>>
>>>> Just something woth considering IMHO.
>>>
>>> The thing we want to avoid most is having the cpu stall doing nothing
>>> waiting for memory access, so I think once the efficiency of adding
>>> more cpu goes non-linear it's better to let the dma engine handle
>>> that. The current heuristic seems to achieve that, but the worry is
>>> does it over-select the cpu for cases where requests could be merged
>>> into a bulkier i/o that is more suitable for dma.
>>>
>>> I'm not sure we want another tunable vs predictable / efficient
>>> behavior by default.
>>
>> Sorry for my late reply, but lots of automated performance regression CI tools
>> _will_ report errors because of the performance drop. Yes these may be false
>> positives, but it'll cost hours to resolve them. If we have a tunable it's way
>> easier to set it correctly for the use cases. And please keep in mind some
>> people don't really case about CPU utilization but maxiumum throughput.
> 
> Returning more cpu to the application may also increase bandwidth
> because we stop spinning on completion and spend more time issuing
> I/O. I'm also worried that this threshold cross-over point is workload
> and pmem media specific. The tunable should be there as an override,
> but I think we need simple/sane defaults out of the gate. A "no
> regressions whatsoever on any metric" policy effectively kills this
> effort as far as I can see. We otherwise need quite a bit more
> qualification of different workloads, these 3 fio runs is not enough.
> 

I can put in a tunable knob like Johannes suggested and leave the
default for that to what it is currently. And we can adjust as necessary
as we do more testing.
Dan Williams Aug. 3, 2017, 4:15 p.m. UTC | #8
On Thu, Aug 3, 2017 at 9:12 AM, Dave Jiang <dave.jiang@intel.com> wrote:
>
>
> On 08/03/2017 08:41 AM, Dan Williams wrote:
>> On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>>> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>>>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>>>>> Dave Jiang <dave.jiang@intel.com> writes:
>>>>>
>>>>>> Adding DMA support for pmem blk reads. This provides signficant CPU
>>>>>> reduction with large memory reads with good performance. DMAs are triggered
>>>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>>>>>> are still performed by the CPU in order to reduce latency. By default
>>>>>> the pmem driver will be using blk-mq with DMA.
>>>>>>
>>>>>> Numbers below are measured against pmem simulated via DRAM using
>>>>>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>>>>>> platform.  Keep in mind the performance for actual persistent memory
>>>>>> will differ.
>>>>>> Fio 2.21 was used.
>>>>>>
>>>>>> 64k: 1 task queuedepth=1
>>>>>> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>>>>>> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>>>>>>
>>>>>> 64k: 16 tasks queuedepth=16
>>>>>> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>>>>>> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>>>>>>
>>>>>> 2M: 1 task queuedepth=1
>>>>>> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>>>>>> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>>>>>>
>>>>>> 2M: 16 tasks queuedepth=16
>>>>>> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>>>>>> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>>>>>>
>>>>>> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
>>>>>> ---
>>>>>
>>>>> Hi Dave,
>>>>>
>>>>> The above table shows that there's a performance benefit for 2M
>>>>> transfers but a regression for 64k transfers, if we forget about the CPU
>>>>> utilization for a second.
>>>>
>>>> I don't think we can forget about cpu utilization, and I would expect
>>>> most users would value cpu over small bandwidth increases. Especially
>>>> when those numbers show efficiency like this
>>>>
>>>> +160% cpu +26% read bandwidth
>>>> +171% cpu -10% write bandwidth
>>>>
>>>>> Would it be beneficial to have heuristics on
>>>>> the transfer size that decide when to use dma and when not? You
>>>>> introduced this hunk:
>>>>>
>>>>> -   rc = pmem_handle_cmd(cmd);
>>>>> +   if (cmd->chan && bio_multiple_segments(req->bio))
>>>>> +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>>>>> +   else
>>>>> +       rc = pmem_handle_cmd(cmd);
>>>>>
>>>>> Which utilizes dma for bios with multiple segments and for single
>>>>> segment bios you use the old path, maybe the single/multi segment logic
>>>>> can be amended to have something like:
>>>>>
>>>>>     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>>>>>        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>>>>>     else
>>>>>        rc = pmem_handle_cmd(cmd);
>>>>>
>>>>> Just something woth considering IMHO.
>>>>
>>>> The thing we want to avoid most is having the cpu stall doing nothing
>>>> waiting for memory access, so I think once the efficiency of adding
>>>> more cpu goes non-linear it's better to let the dma engine handle
>>>> that. The current heuristic seems to achieve that, but the worry is
>>>> does it over-select the cpu for cases where requests could be merged
>>>> into a bulkier i/o that is more suitable for dma.
>>>>
>>>> I'm not sure we want another tunable vs predictable / efficient
>>>> behavior by default.
>>>
>>> Sorry for my late reply, but lots of automated performance regression CI tools
>>> _will_ report errors because of the performance drop. Yes these may be false
>>> positives, but it'll cost hours to resolve them. If we have a tunable it's way
>>> easier to set it correctly for the use cases. And please keep in mind some
>>> people don't really case about CPU utilization but maxiumum throughput.
>>
>> Returning more cpu to the application may also increase bandwidth
>> because we stop spinning on completion and spend more time issuing
>> I/O. I'm also worried that this threshold cross-over point is workload
>> and pmem media specific. The tunable should be there as an override,
>> but I think we need simple/sane defaults out of the gate. A "no
>> regressions whatsoever on any metric" policy effectively kills this
>> effort as far as I can see. We otherwise need quite a bit more
>> qualification of different workloads, these 3 fio runs is not enough.
>>
>
> I can put in a tunable knob like Johannes suggested and leave the
> default for that to what it is currently. And we can adjust as necessary
> as we do more testing.

That's the problem, we need a multi-dimensional knob. It's not just
transfer-size that effects total delivered bandwidth.
Johannes Thumshirn Aug. 4, 2017, 6:07 a.m. UTC | #9
On Thu, Aug 03, 2017 at 09:15:12AM -0700, Dan Williams wrote:
> > I can put in a tunable knob like Johannes suggested and leave the
> > default for that to what it is currently. And we can adjust as necessary
> > as we do more testing.
> 
> That's the problem, we need a multi-dimensional knob. It's not just
> transfer-size that effects total delivered bandwidth.

If Dan opposes it so much, I'm OK with no tunable. I'm just expressing my
concern that we get reports like this [1] one here as a result.

Thanks,
	Johannes

[1] http://marc.info/?l=linux-kernel&m=150175029403913&w=2
Dan Williams Aug. 4, 2017, 3:47 p.m. UTC | #10
On Thu, Aug 3, 2017 at 11:07 PM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On Thu, Aug 03, 2017 at 09:15:12AM -0700, Dan Williams wrote:
>> > I can put in a tunable knob like Johannes suggested and leave the
>> > default for that to what it is currently. And we can adjust as necessary
>> > as we do more testing.
>>
>> That's the problem, we need a multi-dimensional knob. It's not just
>> transfer-size that effects total delivered bandwidth.
>
> If Dan opposes it so much, I'm OK with no tunable. I'm just expressing my
> concern that we get reports like this [1] one here as a result.

Sorry if I came off harsh. In that report from Mel he's showing a slow
down in a representative workload not just a canned bandwidth
benchmark. I'm worried that tuning for a single dimension on an fio
run doesn't actually improve throughput delivered to an application.
diff mbox

Patch

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 347835b..02ca9f0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -32,6 +32,8 @@ 
 #include <linux/dax.h>
 #include <linux/nd.h>
 #include <linux/blk-mq.h>
+#include <linux/dmaengine.h>
+#include <linux/dma-mapping.h>
 #include "pmem.h"
 #include "pfn.h"
 #include "nd.h"
@@ -41,12 +43,24 @@  enum {
 	PMEM_Q_MQ = 1,
 };
 
-static int queue_mode = PMEM_Q_BIO;
+static int queue_mode = PMEM_Q_MQ;
 module_param(queue_mode, int, 0444);
 MODULE_PARM_DESC(queue_mode, "Pmem Queue Mode");
 
+static int queue_depth = 128;
+module_param(queue_depth, int, 0444);
+MODULE_PARM_DESC(queue_depth, "I/O Queue Depth for multi queue mode");
+
+/* typically maps to number of DMA channels/devices per socket */
+static int q_per_node = 8;
+module_param(q_per_node, int, 0444);
+MODULE_PARM_DESC(q_per_node, "Hardware queues per node\n");
+
 struct pmem_cmd {
 	struct request *rq;
+	struct dma_chan *chan;
+	int sg_nents;
+	struct scatterlist sg[];
 };
 
 static struct device *to_dev(struct pmem_device *pmem)
@@ -276,6 +290,160 @@  static void pmem_release_disk(void *__pmem)
 	put_disk(pmem->disk);
 }
 
+static void nd_pmem_dma_callback(void *data,
+		const struct dmaengine_result *res)
+{
+	struct pmem_cmd *cmd = data;
+	struct request *req = cmd->rq;
+	struct request_queue *q = req->q;
+	struct pmem_device *pmem = q->queuedata;
+	struct nd_region *nd_region = to_region(pmem);
+	struct device *dev = to_dev(pmem);
+	int rc = 0;
+
+	dev_dbg(dev, "%s()\n", __func__);
+
+	if (res) {
+		enum dmaengine_tx_result dma_err = res->result;
+
+		switch (dma_err) {
+		case DMA_TRANS_READ_FAILED:
+		case DMA_TRANS_WRITE_FAILED:
+		case DMA_TRANS_ABORTED:
+			dev_dbg(dev, "bio failed\n");
+			rc = -ENXIO;
+			break;
+		case DMA_TRANS_NOERROR:
+		default:
+			break;
+		}
+	}
+
+	if (req->cmd_flags & REQ_FUA)
+		nvdimm_flush(nd_region);
+
+	dev_dbg(dev, "ending request\n");
+	blk_mq_end_request(cmd->rq, rc);
+}
+
+static int pmem_handle_cmd_dma(struct pmem_cmd *cmd, bool is_write)
+{
+	struct request *req = cmd->rq;
+	struct request_queue *q = req->q;
+	struct pmem_device *pmem = q->queuedata;
+	struct device *dev = to_dev(pmem);
+	phys_addr_t pmem_off = blk_rq_pos(req) * 512 + pmem->data_offset;
+	void *pmem_addr = pmem->virt_addr + pmem_off;
+	struct nd_region *nd_region = to_region(pmem);
+	size_t len;
+	struct dma_device *dma = cmd->chan->device;
+	struct dmaengine_unmap_data *unmap;
+	dma_cookie_t cookie;
+	struct dma_async_tx_descriptor *txd;
+	struct page *page;
+	unsigned int off;
+	int rc;
+	enum dma_data_direction dir;
+	dma_addr_t dma_addr;
+
+	if (req->cmd_flags & REQ_FLUSH)
+		nvdimm_flush(nd_region);
+
+	unmap = dmaengine_get_unmap_data(dma->dev, 2, GFP_NOWAIT);
+	if (!unmap) {
+		dev_dbg(dev, "failed to get dma unmap data\n");
+		rc = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * If reading from pmem, writing to scatterlist,
+	 * and if writing to pmem, reading from scatterlist.
+	 */
+	dir = is_write ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+	cmd->sg_nents = blk_rq_map_sg(req->q, req, cmd->sg);
+	if (cmd->sg_nents < 1) {
+		rc = -EINVAL;
+		goto err;
+	}
+
+	if (cmd->sg_nents > 128) {
+		rc = -ENOMEM;
+		dev_warn(dev, "Number of sg greater than allocated\n");
+		goto err;
+	}
+
+	rc = dma_map_sg(dma->dev, cmd->sg, cmd->sg_nents, dir);
+	if (rc < 1) {
+		rc = -ENXIO;
+		goto err;
+	}
+
+	len = blk_rq_payload_bytes(req);
+	page = virt_to_page(pmem_addr);
+	off = (u64)pmem_addr & ~PAGE_MASK;
+	dir = is_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE;
+	dma_addr = dma_map_page(dma->dev, page, off, len, dir);
+	if (dma_mapping_error(dma->dev, unmap->addr[0])) {
+		dev_dbg(dma->dev, "src DMA mapping error\n");
+		goto err_unmap_sg;
+	}
+
+	unmap->len = len;
+
+	if (is_write) {
+		unmap->addr[0] = dma_addr;
+		unmap->addr[1] = (dma_addr_t)cmd->sg;
+		unmap->to_cnt = 1;
+		unmap->from_cnt = 0;
+		dma_unmap_data_sg_from_nents(unmap, 2) = cmd->sg_nents;
+		txd = dma->device_prep_dma_memcpy_from_sg(cmd->chan, dma_addr,
+				cmd->sg, cmd->sg_nents, DMA_PREP_INTERRUPT);
+	} else {
+		unmap->addr[0] = (dma_addr_t)cmd->sg;
+		unmap->addr[1] = dma_addr;
+		unmap->from_cnt = 1;
+		unmap->to_cnt = 0;
+		dma_unmap_data_sg_to_nents(unmap, 2) = cmd->sg_nents;
+		txd = dma->device_prep_dma_memcpy_to_sg(cmd->chan, cmd->sg,
+			cmd->sg_nents, dma_addr, DMA_PREP_INTERRUPT);
+	}
+
+	if (!txd) {
+		dev_dbg(dma->dev, "dma prep failed\n");
+		goto err_unmap_buffer;
+	}
+
+	txd->callback_result = nd_pmem_dma_callback;
+	txd->callback_param = cmd;
+	dma_set_unmap(txd, unmap);
+	cookie = dmaengine_submit(txd);
+	if (dma_submit_error(cookie)) {
+		dev_dbg(dma->dev, "dma submit error\n");
+		goto err_set_unmap;
+	}
+
+	dmaengine_unmap_put(unmap);
+	dma_async_issue_pending(cmd->chan);
+
+	return 0;
+
+err_set_unmap:
+	dmaengine_unmap_put(unmap);
+err_unmap_buffer:
+	dma_unmap_page(dev, dma_addr, len, dir);
+err_unmap_sg:
+	if (dir == DMA_TO_DEVICE)
+		dir = DMA_FROM_DEVICE;
+	else
+		dir = DMA_TO_DEVICE;
+	dma_unmap_sg(dev, cmd->sg, cmd->sg_nents, dir);
+	dmaengine_unmap_put(unmap);
+err:
+	blk_mq_end_request(cmd->rq, -ENXIO);
+	return -ENXIO;
+}
+
 static int pmem_handle_cmd(struct pmem_cmd *cmd)
 {
 	struct request *req = cmd->rq;
@@ -309,13 +477,17 @@  static int pmem_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct pmem_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	struct request *req;
 	int rc;
 
-	cmd->rq = bd->rq;
-
-	blk_mq_start_request(bd->rq);
+	req = cmd->rq = bd->rq;
+	cmd->chan = dma_find_channel(DMA_MEMCPY);
+	blk_mq_start_request(req);
 
-	rc = pmem_handle_cmd(cmd);
+	if (cmd->chan && bio_multiple_segments(req->bio))
+		rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
+	else
+		rc = pmem_handle_cmd(cmd);
 	if (rc < 0)
 		rc = BLK_MQ_RQ_QUEUE_ERROR;
 	else
@@ -348,10 +520,10 @@  static int pmem_attach_disk(struct device *dev,
 	struct nd_pfn_sb *pfn_sb;
 	struct pmem_device *pmem;
 	struct resource pfn_res;
-	struct request_queue *q;
 	struct gendisk *disk;
 	void *addr;
 	int rc;
+	struct dma_chan *chan;
 
 	/* while nsio_rw_bytes is active, parse a pfn info block if present */
 	if (is_nd_pfn(dev)) {
@@ -381,11 +553,20 @@  static int pmem_attach_disk(struct device *dev,
 	}
 
 	if (queue_mode == PMEM_Q_MQ) {
+		chan = dma_find_channel(DMA_MEMCPY);
+		if (!chan) {
+			queue_mode = PMEM_Q_BIO;
+			dev_warn(dev, "Forced back to PMEM_Q_BIO, no DMA\n");
+		}
+	}
+
+	if (queue_mode == PMEM_Q_MQ) {
 		pmem->tag_set.ops = &pmem_mq_ops;
-		pmem->tag_set.nr_hw_queues = nr_online_nodes;
-		pmem->tag_set.queue_depth = 64;
+		pmem->tag_set.nr_hw_queues = nr_online_nodes * q_per_node;
+		pmem->tag_set.queue_depth = queue_depth;
 		pmem->tag_set.numa_node = dev_to_node(dev);
-		pmem->tag_set.cmd_size = sizeof(struct pmem_cmd);
+		pmem->tag_set.cmd_size = sizeof(struct pmem_cmd) +
+			sizeof(struct scatterlist) * 128;
 		pmem->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 		pmem->tag_set.driver_data = pmem;
 
@@ -446,14 +627,17 @@  static int pmem_attach_disk(struct device *dev,
 		return PTR_ERR(addr);
 	pmem->virt_addr = addr;
 
-	blk_queue_write_cache(q, true, true);
-	blk_queue_make_request(q, pmem_make_request);
-	blk_queue_physical_block_size(q, PAGE_SIZE);
-	blk_queue_max_hw_sectors(q, UINT_MAX);
-	blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
-	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
-	queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
-	q->queuedata = pmem;
+	blk_queue_write_cache(pmem->q, true, true);
+	blk_queue_physical_block_size(pmem->q, PAGE_SIZE);
+	if (queue_mode == PMEM_Q_MQ) {
+		blk_queue_max_hw_sectors(pmem->q, 0x200000);
+		blk_queue_max_segments(pmem->q, 128);
+	} else
+		blk_queue_max_hw_sectors(pmem->q, UINT_MAX);
+	blk_queue_bounce_limit(pmem->q, BLK_BOUNCE_ANY);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->q);
+	queue_flag_set_unlocked(QUEUE_FLAG_DAX, pmem->q);
+	pmem->q->queuedata = pmem;
 
 	disk = alloc_disk_node(0, nid);
 	if (!disk)
@@ -588,15 +772,22 @@  static struct nd_device_driver nd_pmem_driver = {
 
 static int __init pmem_init(void)
 {
+	if (queue_mode == PMEM_Q_MQ)
+		dmaengine_get();
+
 	return nd_driver_register(&nd_pmem_driver);
 }
 module_init(pmem_init);
 
 static void pmem_exit(void)
 {
+	if (queue_mode == PMEM_Q_MQ)
+		dmaengine_put();
+
 	driver_unregister(&nd_pmem_driver.drv);
 }
 module_exit(pmem_exit);
 
+MODULE_SOFTDEP("pre: dmaengine");
 MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
 MODULE_LICENSE("GPL v2");