diff mbox series

[v3,1/2] dma-mapping: add benchmark support for streaming DMA APIs

Message ID 20201102080646.2180-2-song.bao.hua@hisilicon.com (mailing list archive)
State New
Headers show
Series dma-mapping: provide a benchmark for streaming DMA mapping | expand

Commit Message

Song Bao Hua (Barry Song) Nov. 2, 2020, 8:06 a.m. UTC
Nowadays, there are increasing requirements to benchmark the performance
of dma_map and dma_unmap particually while the device is attached to an
IOMMU.

This patch enables the support. Users can run specified number of threads
to do dma_map_page and dma_unmap_page on a specific NUMA node with the
specified duration. Then dma_map_benchmark will calculate the average
latency for map and unmap.

A difficulity for this benchmark is that dma_map/unmap APIs must run on
a particular device. Each device might have different backend of IOMMU or
non-IOMMU.

So we use the driver_override to bind dma_map_benchmark to a particual
device by:
For platform devices:
echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
echo xxx > /sys/bus/platform/drivers/xxx/unbind
echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind

For PCI devices:
echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override
echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind
echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind

Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
-v3:
  * fix build issues reported by 0day kernel test robot
-v2:
  * add PCI support; v1 supported platform devices only
  * replace ssleep by msleep_interruptible() to permit users to exit
    benchmark before it is completed
  * many changes according to Robin's suggestions, thanks! Robin
    - add standard deviation output to reflect the worst case
    - check users' parameters strictly like the number of threads
    - make cache dirty before dma_map
    - fix unpaired dma_map_page and dma_unmap_single;
    - remove redundant "long long" before ktime_to_ns();
    - use devm_add_action()

 kernel/dma/Kconfig         |   8 +
 kernel/dma/Makefile        |   1 +
 kernel/dma/map_benchmark.c | 296 +++++++++++++++++++++++++++++++++++++
 3 files changed, 305 insertions(+)
 create mode 100644 kernel/dma/map_benchmark.c

Comments

John Garry Nov. 2, 2020, 9:18 a.m. UTC | #1
On 02/11/2020 08:06, Barry Song wrote:
> Nowadays, there are increasing requirements to benchmark the performance
> of dma_map and dma_unmap particually while the device is attached to an
> IOMMU.
> 
> This patch enables the support. Users can run specified number of threads
> to do dma_map_page and dma_unmap_page on a specific NUMA node with the
> specified duration. Then dma_map_benchmark will calculate the average
> latency for map and unmap.
> 
> A difficulity for this benchmark is that dma_map/unmap APIs must run on
> a particular device. Each device might have different backend of IOMMU or
> non-IOMMU.
> 
> So we use the driver_override to bind dma_map_benchmark to a particual
> device by:
> For platform devices:
> echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> 
> For PCI devices:
> echo dma_map_benchmark > /sys/bus/pci/devices/0000:00:01.0/driver_override
> echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind
> echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
> -v3:
>    * fix build issues reported by 0day kernel test robot
> -v2:
>    * add PCI support; v1 supported platform devices only
>    * replace ssleep by msleep_interruptible() to permit users to exit
>      benchmark before it is completed
>    * many changes according to Robin's suggestions, thanks! Robin
>      - add standard deviation output to reflect the worst case
>      - check users' parameters strictly like the number of threads
>      - make cache dirty before dma_map
>      - fix unpaired dma_map_page and dma_unmap_single;
>      - remove redundant "long long" before ktime_to_ns();
>      - use devm_add_action()
> 
>   kernel/dma/Kconfig         |   8 +
>   kernel/dma/Makefile        |   1 +
>   kernel/dma/map_benchmark.c | 296 +++++++++++++++++++++++++++++++++++++
>   3 files changed, 305 insertions(+)
>   create mode 100644 kernel/dma/map_benchmark.c
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index c99de4a21458..949c53da5991 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
>   	  is technically out-of-spec.
>   
>   	  If unsure, say N.
> +
> +config DMA_MAP_BENCHMARK
> +	bool "Enable benchmarking of streaming DMA mapping"
> +	help
> +	  Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
> +	  performance of dma_(un)map_page.

Since this is a driver, any reason for which it cannot be loadable? If 
so, it seems any functionality would depend on DEBUG FS, I figure that's 
just how we work for debugfs.

Thanks,
John

> +
> +	  See tools/testing/selftests/dma/dma_map_benchmark.c
> diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> index dc755ab68aab..7aa6b26b1348 100644
> --- a/kernel/dma/Makefile
> +++ b/kernel/dma/Makefile
Song Bao Hua (Barry Song) Nov. 2, 2020, 9:37 a.m. UTC | #2
> -----Original Message-----
> From: John Garry
> Sent: Monday, November 2, 2020 10:19 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com;
> m.szyprowski@samsung.com
> Cc: linux-kselftest@vger.kernel.org; Shuah Khan <shuah@kernel.org>; Joerg
> Roedel <joro@8bytes.org>; Linuxarm <linuxarm@huawei.com>; xuwei (O)
> <xuwei5@huawei.com>; Will Deacon <will@kernel.org>
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 02/11/2020 08:06, Barry Song wrote:
> > Nowadays, there are increasing requirements to benchmark the performance
> > of dma_map and dma_unmap particually while the device is attached to an
> > IOMMU.
> >
> > This patch enables the support. Users can run specified number of threads
> > to do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> > specified duration. Then dma_map_benchmark will calculate the average
> > latency for map and unmap.
> >
> > A difficulity for this benchmark is that dma_map/unmap APIs must run on
> > a particular device. Each device might have different backend of IOMMU or
> > non-IOMMU.
> >
> > So we use the driver_override to bind dma_map_benchmark to a particual
> > device by:
> > For platform devices:
> > echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> > echo xxx > /sys/bus/platform/drivers/xxx/unbind
> > echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >
> > For PCI devices:
> > echo dma_map_benchmark >
> /sys/bus/pci/devices/0000:00:01.0/driver_override
> > echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind
> > echo 0000:00:01.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
> >
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> > -v3:
> >    * fix build issues reported by 0day kernel test robot
> > -v2:
> >    * add PCI support; v1 supported platform devices only
> >    * replace ssleep by msleep_interruptible() to permit users to exit
> >      benchmark before it is completed
> >    * many changes according to Robin's suggestions, thanks! Robin
> >      - add standard deviation output to reflect the worst case
> >      - check users' parameters strictly like the number of threads
> >      - make cache dirty before dma_map
> >      - fix unpaired dma_map_page and dma_unmap_single;
> >      - remove redundant "long long" before ktime_to_ns();
> >      - use devm_add_action()
> >
> >   kernel/dma/Kconfig         |   8 +
> >   kernel/dma/Makefile        |   1 +
> >   kernel/dma/map_benchmark.c | 296
> +++++++++++++++++++++++++++++++++++++
> >   3 files changed, 305 insertions(+)
> >   create mode 100644 kernel/dma/map_benchmark.c
> >
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c99de4a21458..949c53da5991 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
> >   	  is technically out-of-spec.
> >
> >   	  If unsure, say N.
> > +
> > +config DMA_MAP_BENCHMARK
> > +	bool "Enable benchmarking of streaming DMA mapping"
> > +	help
> > +	  Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> > +	  performance of dma_(un)map_page.
> 
> Since this is a driver, any reason for which it cannot be loadable? If
> so, it seems any functionality would depend on DEBUG FS, I figure that's
> just how we work for debugfs.

We depend on kthread_bind_mask which isn't an export_symbol.
Maybe worth to send a patch to export it?

> 
> Thanks,
> John
> 
> > +
> > +	  See tools/testing/selftests/dma/dma_map_benchmark.c
> > diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
> > index dc755ab68aab..7aa6b26b1348 100644
> > --- a/kernel/dma/Makefile
> > +++ b/kernel/dma/Makefile

Thanks
Barry
Song Bao Hua (Barry Song) Nov. 10, 2020, 8:10 a.m. UTC | #3
Hello Robin, Christoph,
Any further comment? John suggested that "depends on DEBUG_FS" should be added in Kconfig.
I am collecting more comments to send v4 together with fixing this minor issue :-)

Thanks
Barry

> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Monday, November 2, 2020 9:07 PM
> To: iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com;
> m.szyprowski@samsung.com
> Cc: Linuxarm <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org; xuwei
> (O) <xuwei5@huawei.com>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>; Joerg Roedel <joro@8bytes.org>; Will Deacon
> <will@kernel.org>; Shuah Khan <shuah@kernel.org>
> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming
> DMA APIs
> 
> Nowadays, there are increasing requirements to benchmark the performance
> of dma_map and dma_unmap particually while the device is attached to an
> IOMMU.
> 
> This patch enables the support. Users can run specified number of threads to
> do dma_map_page and dma_unmap_page on a specific NUMA node with the
> specified duration. Then dma_map_benchmark will calculate the average
> latency for map and unmap.
> 
> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
> particular device. Each device might have different backend of IOMMU or
> non-IOMMU.
> 
> So we use the driver_override to bind dma_map_benchmark to a particual
> device by:
> For platform devices:
> echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> 
> For PCI devices:
> echo dma_map_benchmark >
> /sys/bus/pci/devices/0000:00:01.0/driver_override
> echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 >
> /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
> -v3:
>   * fix build issues reported by 0day kernel test robot
> -v2:
>   * add PCI support; v1 supported platform devices only
>   * replace ssleep by msleep_interruptible() to permit users to exit
>     benchmark before it is completed
>   * many changes according to Robin's suggestions, thanks! Robin
>     - add standard deviation output to reflect the worst case
>     - check users' parameters strictly like the number of threads
>     - make cache dirty before dma_map
>     - fix unpaired dma_map_page and dma_unmap_single;
>     - remove redundant "long long" before ktime_to_ns();
>     - use devm_add_action()
> 
>  kernel/dma/Kconfig         |   8 +
>  kernel/dma/Makefile        |   1 +
>  kernel/dma/map_benchmark.c | 296
> +++++++++++++++++++++++++++++++++++++
>  3 files changed, 305 insertions(+)
>  create mode 100644 kernel/dma/map_benchmark.c
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index
> c99de4a21458..949c53da5991 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -225,3 +225,11 @@ config DMA_API_DEBUG_SG
>  	  is technically out-of-spec.
> 
>  	  If unsure, say N.
> +
> +config DMA_MAP_BENCHMARK
> +	bool "Enable benchmarking of streaming DMA mapping"
> +	help
> +	  Provides /sys/kernel/debug/dma_map_benchmark that helps with
> testing
> +	  performance of dma_(un)map_page.
> +
> +	  See tools/testing/selftests/dma/dma_map_benchmark.c
> diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile index
> dc755ab68aab..7aa6b26b1348 100644
> --- a/kernel/dma/Makefile
> +++ b/kernel/dma/Makefile
> @@ -10,3 +10,4 @@ obj-$(CONFIG_DMA_API_DEBUG)		+= debug.o
>  obj-$(CONFIG_SWIOTLB)			+= swiotlb.o
>  obj-$(CONFIG_DMA_COHERENT_POOL)		+= pool.o
>  obj-$(CONFIG_DMA_REMAP)			+= remap.o
> +obj-$(CONFIG_DMA_MAP_BENCHMARK)		+= map_benchmark.o
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> new file mode 100644 index 000000000000..dc4e5ff48a2d
> --- /dev/null
> +++ b/kernel/dma/map_benchmark.c
> @@ -0,0 +1,296 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Hisilicon Limited.
> + */
> +
> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
> +
> +#include <linux/debugfs.h>
> +#include <linux/delay.h>
> +#include <linux/device.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/kernel.h>
> +#include <linux/kthread.h>
> +#include <linux/math64.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +#include <linux/timekeeping.h>
> +
> +#define DMA_MAP_BENCHMARK	_IOWR('d', 1, struct map_benchmark)
> +#define DMA_MAP_MAX_THREADS	1024
> +#define DMA_MAP_MAX_SECONDS	300
> +
> +struct map_benchmark {
> +	__u64 avg_map_100ns; /* average map latency in 100ns */
> +	__u64 map_stddev; /* standard deviation of map latency */
> +	__u64 avg_unmap_100ns; /* as above */
> +	__u64 unmap_stddev;
> +	__u32 threads; /* how many threads will do map/unmap in parallel */
> +	__u32 seconds; /* how long the test will last */
> +	int node; /* which numa node this benchmark will run on */
> +	__u64 expansion[10];	/* For future use */
> +};
> +
> +struct map_benchmark_data {
> +	struct map_benchmark bparam;
> +	struct device *dev;
> +	struct dentry  *debugfs;
> +	atomic64_t sum_map_100ns;
> +	atomic64_t sum_unmap_100ns;
> +	atomic64_t sum_square_map;
> +	atomic64_t sum_square_unmap;
> +	atomic64_t loops;
> +};
> +
> +static int map_benchmark_thread(void *data) {
> +	void *buf;
> +	dma_addr_t dma_addr;
> +	struct map_benchmark_data *map = data;
> +	int ret = 0;
> +
> +	buf = (void *)__get_free_page(GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	while (!kthread_should_stop())  {
> +		__u64 map_100ns, unmap_100ns, map_square, unmap_square;
> +		ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
> +
> +		/*
> +		 * for a non-coherent device, if we don't stain them in the cache,
> +		 * this will give an underestimate of the real-world overhead of
> +		 * BIDIRECTIONAL or TO_DEVICE mappings
> +		 * 66 means evertything goes well! 66 is lucky.
> +		 */
> +		memset(buf, 0x66, PAGE_SIZE);
> +
> +		map_stime = ktime_get();
> +		dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
> +		if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> +			pr_err("dma_map_single failed on %s\n",
> dev_name(map->dev));
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		map_etime = ktime_get();
> +
> +		unmap_stime = ktime_get();
> +		dma_unmap_single(map->dev, dma_addr, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
> +		unmap_etime = ktime_get();
> +
> +		/* calculate sum and sum of squares */
> +		map_100ns = div64_ul(ktime_to_ns(ktime_sub(map_etime,
> map_stime)),  100);
> +		unmap_100ns = div64_ul(ktime_to_ns(ktime_sub(unmap_etime,
> unmap_stime)), 100);
> +		map_square = map_100ns * map_100ns;
> +		unmap_square = unmap_100ns * unmap_100ns;
> +
> +		atomic64_add(map_100ns, &map->sum_map_100ns);
> +		atomic64_add(unmap_100ns, &map->sum_unmap_100ns);
> +		atomic64_add(map_square, &map->sum_square_map);
> +		atomic64_add(unmap_square, &map->sum_square_unmap);
> +		atomic64_inc(&map->loops);
> +	}
> +
> +out:
> +	free_page((unsigned long)buf);
> +	return ret;
> +}
> +
> +static int do_map_benchmark(struct map_benchmark_data *map) {
> +	struct task_struct **tsk;
> +	int threads = map->bparam.threads;
> +	int node = map->bparam.node;
> +	const cpumask_t *cpu_mask = cpumask_of_node(node);
> +	__u64 loops;
> +	int ret = 0;
> +	int i;
> +
> +	tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
> +	if (!tsk)
> +		return -ENOMEM;
> +
> +	get_device(map->dev);
> +
> +	for (i = 0; i < threads; i++) {
> +		tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
> +				map->bparam.node, "dma-map-benchmark/%d", i);
> +		if (IS_ERR(tsk[i])) {
> +			pr_err("create dma_map thread failed\n");
> +			ret = PTR_ERR(tsk[i]);
> +			goto out;
> +		}
> +
> +		if (node != NUMA_NO_NODE && node_online(node))
> +			kthread_bind_mask(tsk[i], cpu_mask);
> +	}
> +
> +	/* clear the old value in the previous benchmark */
> +	atomic64_set(&map->sum_map_100ns, 0);
> +	atomic64_set(&map->sum_unmap_100ns, 0);
> +	atomic64_set(&map->sum_square_map, 0);
> +	atomic64_set(&map->sum_square_unmap, 0);
> +	atomic64_set(&map->loops, 0);
> +
> +	for (i = 0; i < threads; i++)
> +		wake_up_process(tsk[i]);
> +
> +	msleep_interruptible(map->bparam.seconds * 1000);
> +
> +	/* wait for the completion of benchmark threads */
> +	for (i = 0; i < threads; i++) {
> +		ret = kthread_stop(tsk[i]);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	loops = atomic64_read(&map->loops);
> +	if (likely(loops > 0)) {
> +		__u64 map_variance, unmap_variance;
> +
> +		/* average latency */
> +		map->bparam.avg_map_100ns =
> div64_u64(atomic64_read(&map->sum_map_100ns), loops);
> +		map->bparam.avg_unmap_100ns =
> +div64_u64(atomic64_read(&map->sum_unmap_100ns), loops);
> +
> +		/* standard deviation of latency */
> +		map_variance =
> div64_u64(atomic64_read(&map->sum_square_map),  loops) -
> +			map->bparam.avg_map_100ns *
> map->bparam.avg_map_100ns;
> +		unmap_variance =
> div64_u64(atomic64_read(&map->sum_square_unmap), loops) -
> +			map->bparam.avg_unmap_100ns *
> map->bparam.avg_unmap_100ns;
> +		map->bparam.map_stddev = int_sqrt64(map_variance);
> +		map->bparam.unmap_stddev = int_sqrt64(unmap_variance);
> +	}
> +
> +out:
> +	put_device(map->dev);
> +	kfree(tsk);
> +	return ret;
> +}
> +
> +static long map_benchmark_ioctl(struct file *filep, unsigned int cmd,
> +		unsigned long arg)
> +{
> +	struct map_benchmark_data *map = filep->private_data;
> +	int ret;
> +
> +	if (copy_from_user(&map->bparam, (void __user *)arg,
> sizeof(map->bparam)))
> +		return -EFAULT;
> +
> +	switch (cmd) {
> +	case DMA_MAP_BENCHMARK:
> +		if (map->bparam.threads == 0 || map->bparam.threads >
> DMA_MAP_MAX_THREADS) {
> +			pr_err("invalid thread number\n");
> +			return -EINVAL;
> +		}
> +		if (map->bparam.seconds == 0 || map->bparam.seconds >
> DMA_MAP_MAX_SECONDS) {
> +			pr_err("invalid duration seconds\n");
> +			return -EINVAL;
> +		}
> +
> +		ret = do_map_benchmark(map);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (copy_to_user((void __user *)arg, &map->bparam,
> sizeof(map->bparam)))
> +		return -EFAULT;
> +
> +	return ret;
> +}
> +
> +static const struct file_operations map_benchmark_fops = {
> +	.open = simple_open,
> +	.unlocked_ioctl = map_benchmark_ioctl, };
> +
> +static void map_benchmark_remove_debugfs(void *data) {
> +	struct map_benchmark_data *map = (struct map_benchmark_data *)data;
> +
> +	debugfs_remove(map->debugfs);
> +}
> +
> +static int __map_benchmark_probe(struct device *dev) {
> +	struct dentry *entry;
> +	struct map_benchmark_data *map;
> +	int ret;
> +
> +	map = devm_kzalloc(dev, sizeof(*map), GFP_KERNEL);
> +	if (!map)
> +		return -ENOMEM;
> +	map->dev = dev;
> +
> +	ret = devm_add_action(dev, map_benchmark_remove_debugfs, map);
> +	if (ret) {
> +		pr_err("Can't add debugfs remove action\n");
> +		return ret;
> +	}
> +
> +	/*
> +	 * we only permit a device bound with this driver, 2nd probe
> +	 * will fail
> +	 */
> +	entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map,
> +			&map_benchmark_fops);
> +	if (IS_ERR(entry))
> +		return PTR_ERR(entry);
> +	map->debugfs = entry;
> +
> +	return 0;
> +}
> +
> +static int map_benchmark_platform_probe(struct platform_device *pdev) {
> +	return __map_benchmark_probe(&pdev->dev);
> +}
> +
> +static struct platform_driver map_benchmark_platform_driver = {
> +	.driver		= {
> +		.name	= "dma_map_benchmark",
> +	},
> +	.probe = map_benchmark_platform_probe, };
> +
> +static int map_benchmark_pci_probe(struct pci_dev *pdev, const struct
> +pci_device_id *id) {
> +	return __map_benchmark_probe(&pdev->dev);
> +}
> +
> +static struct pci_driver map_benchmark_pci_driver = {
> +	.name	= "dma_map_benchmark",
> +	.probe	= map_benchmark_pci_probe,
> +};
> +
> +static int __init map_benchmark_init(void) {
> +	int ret;
> +
> +	ret = pci_register_driver(&map_benchmark_pci_driver);
> +	if (ret)
> +		return ret;
> +
> +	ret = platform_driver_register(&map_benchmark_platform_driver);
> +	if (ret) {
> +		pci_unregister_driver(&map_benchmark_pci_driver);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void __exit map_benchmark_cleanup(void) {
> +	platform_driver_unregister(&map_benchmark_platform_driver);
> +	pci_unregister_driver(&map_benchmark_pci_driver);
> +}
> +
> +module_init(map_benchmark_init);
> +module_exit(map_benchmark_cleanup);
> +
> +MODULE_AUTHOR("Barry Song <song.bao.hua@hisilicon.com>");
> +MODULE_DESCRIPTION("dma_map benchmark driver");
> MODULE_LICENSE("GPL");
> --
> 2.25.1
John Garry Nov. 10, 2020, 8:38 a.m. UTC | #4
On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote:
> Hello Robin, Christoph,
> Any further comment? John suggested that "depends on DEBUG_FS" should be added in Kconfig.
> I am collecting more comments to send v4 together with fixing this minor issue :-)
> 
> Thanks
> Barry
> 
>> -----Original Message-----
>> From: Song Bao Hua (Barry Song)
>> Sent: Monday, November 2, 2020 9:07 PM
>> To: iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com;
>> m.szyprowski@samsung.com
>> Cc: Linuxarm <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org; xuwei
>> (O) <xuwei5@huawei.com>; Song Bao Hua (Barry Song)
>> <song.bao.hua@hisilicon.com>; Joerg Roedel <joro@8bytes.org>; Will Deacon
>> <will@kernel.org>; Shuah Khan <shuah@kernel.org>
>> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for streaming
>> DMA APIs
>>
>> Nowadays, there are increasing requirements to benchmark the performance
>> of dma_map and dma_unmap particually while the device is attached to an
>> IOMMU.
>>
>> This patch enables the support. Users can run specified number of threads to
>> do dma_map_page and dma_unmap_page on a specific NUMA node with the
>> specified duration. Then dma_map_benchmark will calculate the average
>> latency for map and unmap.
>>
>> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
>> particular device. Each device might have different backend of IOMMU or
>> non-IOMMU.
>>
>> So we use the driver_override to bind dma_map_benchmark to a particual
>> device by:
>> For platform devices:
>> echo dma_map_benchmark > /sys/bus/platform/devices/xxx/driver_override
>> echo xxx > /sys/bus/platform/drivers/xxx/unbind
>> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
>>

Hi Barry,

>> For PCI devices:
>> echo dma_map_benchmark >
>> /sys/bus/pci/devices/0000:00:01.0/driver_override
>> echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 >
>> /sys/bus/pci/drivers/dma_map_benchmark/bind

Do we need to check if the device to which we attach actually has DMA 
mapping capability?

>>
>> Cc: Joerg Roedel <joro@8bytes.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>> Cc: Robin Murphy <robin.murphy@arm.com>
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> ---

Thanks,
John
Song Bao Hua (Barry Song) Nov. 11, 2020, 1:29 a.m. UTC | #5
> -----Original Message-----
> From: John Garry
> Sent: Tuesday, November 10, 2020 9:39 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com;
> m.szyprowski@samsung.com
> Cc: linux-kselftest@vger.kernel.org; Will Deacon <will@kernel.org>; Joerg
> Roedel <joro@8bytes.org>; Linuxarm <linuxarm@huawei.com>; xuwei (O)
> <xuwei5@huawei.com>; Shuah Khan <shuah@kernel.org>
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 10/11/2020 08:10, Song Bao Hua (Barry Song) wrote:
> > Hello Robin, Christoph,
> > Any further comment? John suggested that "depends on DEBUG_FS" should
> be added in Kconfig.
> > I am collecting more comments to send v4 together with fixing this minor
> issue :-)
> >
> > Thanks
> > Barry
> >
> >> -----Original Message-----
> >> From: Song Bao Hua (Barry Song)
> >> Sent: Monday, November 2, 2020 9:07 PM
> >> To: iommu@lists.linux-foundation.org; hch@lst.de;
> robin.murphy@arm.com;
> >> m.szyprowski@samsung.com
> >> Cc: Linuxarm <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org;
> xuwei
> >> (O) <xuwei5@huawei.com>; Song Bao Hua (Barry Song)
> >> <song.bao.hua@hisilicon.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon
> >> <will@kernel.org>; Shuah Khan <shuah@kernel.org>
> >> Subject: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming
> >> DMA APIs
> >>
> >> Nowadays, there are increasing requirements to benchmark the
> performance
> >> of dma_map and dma_unmap particually while the device is attached to an
> >> IOMMU.
> >>
> >> This patch enables the support. Users can run specified number of threads
> to
> >> do dma_map_page and dma_unmap_page on a specific NUMA node with
> the
> >> specified duration. Then dma_map_benchmark will calculate the average
> >> latency for map and unmap.
> >>
> >> A difficulity for this benchmark is that dma_map/unmap APIs must run on a
> >> particular device. Each device might have different backend of IOMMU or
> >> non-IOMMU.
> >>
> >> So we use the driver_override to bind dma_map_benchmark to a particual
> >> device by:
> >> For platform devices:
> >> echo dma_map_benchmark >
> /sys/bus/platform/devices/xxx/driver_override
> >> echo xxx > /sys/bus/platform/drivers/xxx/unbind
> >> echo xxx > /sys/bus/platform/drivers/dma_map_benchmark/bind
> >>
> 
> Hi Barry,
> 
> >> For PCI devices:
> >> echo dma_map_benchmark >
> >> /sys/bus/pci/devices/0000:00:01.0/driver_override
> >> echo 0000:00:01.0 > /sys/bus/pci/drivers/xxx/unbind echo 0000:00:01.0 >
> >> /sys/bus/pci/drivers/dma_map_benchmark/bind
> 
> Do we need to check if the device to which we attach actually has DMA
> mapping capability?

Hello John,

I'd like to think checking this here would be overdesign. We just give users the
freedom to bind any device they care about to the benchmark driver. Usually
that means a real hardware either behind an IOMMU or through a direct
mapping.

if for any reason users put a wrong "device", that is the choice of users. Anyhow,
the below code will still handle it properly and users will get a report in which
everything is zero.

+static int map_benchmark_thread(void *data)
+{
...
+		dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, DMA_BIDIRECTIONAL);
+		if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
+			pr_err("dma_map_single failed on %s\n", dev_name(map->dev));
+			ret = -ENOMEM;
+			goto out;
+		}
...
+}

> 
> >>
> >> Cc: Joerg Roedel <joro@8bytes.org>
> >> Cc: Will Deacon <will@kernel.org>
> >> Cc: Shuah Khan <shuah@kernel.org>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> >> Cc: Robin Murphy <robin.murphy@arm.com>
> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >> ---
> 
> Thanks,
> John

Thanks
Barry
Song Bao Hua (Barry Song) Nov. 11, 2020, 9:42 a.m. UTC | #6
> -----Original Message-----
> From: John Garry
> Sent: Wednesday, November 11, 2020 10:37 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com;
> m.szyprowski@samsung.com
> Cc: linux-kselftest@vger.kernel.org; Will Deacon <will@kernel.org>; Joerg
> Roedel <joro@8bytes.org>; Linuxarm <linuxarm@huawei.com>; xuwei (O)
> <xuwei5@huawei.com>; Shuah Khan <shuah@kernel.org>
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On 11/11/2020 01:29, Song Bao Hua (Barry Song) wrote:
> > I'd like to think checking this here would be overdesign. We just give users the
> > freedom to bind any device they care about to the benchmark driver. Usually
> > that means a real hardware either behind an IOMMU or through a direct
> > mapping.
> >
> > if for any reason users put a wrong "device", that is the choice of users.
> 
> Right, but if the device simply has no DMA ops supported, it could be
> better to fail the probe rather than let them try the test at all.
> 
>   Anyhow,
> > the below code will still handle it properly and users will get a report in which
> > everything is zero.
> >
> > +static int map_benchmark_thread(void *data)
> > +{
> > ...
> > +		dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE,
> DMA_BIDIRECTIONAL);
> > +		if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
> 
> Doing this is proper, but I am not sure if this tells the user the real
> problem.

Telling users the real problem isn't the design intention of this test
benchmark. It is never the purpose of this benchmark.

> 
> > +			pr_err("dma_map_single failed on %s\n",
> dev_name(map->dev));
> 
> Not sure why use pr_err() over dev_err().

We are reporting errors in dma-benchmark driver rather than reporting errors
in the driver of the specific device. I think we should have "dma-benchmark"
as the prefix while printing the name of the device by dev_name().

> 
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> 
> Thanks,
> John

Thanks
Barry
Christoph Hellwig Nov. 14, 2020, 4:53 p.m. UTC | #7
Lots of > 80 char lines.  Please fix up the style.

I think this needs to set a dma mask as behavior for unlimited dma
mask vs the default 32-bit one can be very different.  I also think
you need to be able to pass the direction or have different tests
for directions.  bidirectional is not exactly heavily used and pays
more cache management penality.
Song Bao Hua (Barry Song) Nov. 15, 2020, 12:11 a.m. UTC | #8
> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@lst.de]
> Sent: Sunday, November 15, 2020 5:54 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: iommu@lists.linux-foundation.org; hch@lst.de; robin.murphy@arm.com;
> m.szyprowski@samsung.com; Linuxarm <linuxarm@huawei.com>;
> linux-kselftest@vger.kernel.org; xuwei (O) <xuwei5@huawei.com>; Joerg
> Roedel <joro@8bytes.org>; Will Deacon <will@kernel.org>; Shuah Khan
> <shuah@kernel.org>
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> Lots of > 80 char lines.  Please fix up the style.

Checkpatch has changed 80 to 100. That's probably why my local checkpatch didn't report any warning:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdc48fa11e46f867ea4d

I am happy to change them to be less than 80 if you like.

> 
> I think this needs to set a dma mask as behavior for unlimited dma
> mask vs the default 32-bit one can be very different. 

I actually prefer users bind real devices with real dma_mask to test rather than force to change
the dma_mask in this benchmark.

Some device might have 32bit dma_mask while some others might have unlimited. But both of
them can bind to this driver or unbind from it after the test is done. So users just need to bind
those different real devices with different real dma_mask to dma_benchmark.

This can reflect the real performance of the real device better, I think.

> I also think you need to be able to pass the direction or have different tests
> for directions.  bidirectional is not exactly heavily used and pays
> more cache management penality.

For this, I'd like to increase a direction option in the test app and pass the option to the benchmark
driver.

Thanks
Barry
Christoph Hellwig Nov. 15, 2020, 8:45 a.m. UTC | #9
On Sun, Nov 15, 2020 at 12:11:15AM +0000, Song Bao Hua (Barry Song) wrote:
> 
> Checkpatch has changed 80 to 100. That's probably why my local checkpatch didn't report any warning:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdc48fa11e46f867ea4d
> 
> I am happy to change them to be less than 80 if you like.

Don't rely on checkpath, is is broken.  Look at the codingstyle document.

> > I think this needs to set a dma mask as behavior for unlimited dma
> > mask vs the default 32-bit one can be very different. 
> 
> I actually prefer users bind real devices with real dma_mask to test rather than force to change
> the dma_mask in this benchmark.

The mask is set by the driver, not the device.  So you need to set when
when you bind, real device or not.
Song Bao Hua (Barry Song) Nov. 15, 2020, 9:54 p.m. UTC | #10
> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@lst.de]
> Sent: Sunday, November 15, 2020 9:45 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>
> Cc: Christoph Hellwig <hch@lst.de>; iommu@lists.linux-foundation.org;
> robin.murphy@arm.com; m.szyprowski@samsung.com; Linuxarm
> <linuxarm@huawei.com>; linux-kselftest@vger.kernel.org; xuwei (O)
> <xuwei5@huawei.com>; Joerg Roedel <joro@8bytes.org>; Will Deacon
> <will@kernel.org>; Shuah Khan <shuah@kernel.org>
> Subject: Re: [PATCH v3 1/2] dma-mapping: add benchmark support for
> streaming DMA APIs
> 
> On Sun, Nov 15, 2020 at 12:11:15AM +0000, Song Bao Hua (Barry Song)
> wrote:
> >
> > Checkpatch has changed 80 to 100. That's probably why my local checkpatch
> didn't report any warning:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
> bdc48fa11e46f867ea4d
> >
> > I am happy to change them to be less than 80 if you like.
> 
> Don't rely on checkpath, is is broken.  Look at the codingstyle document.
> 
> > > I think this needs to set a dma mask as behavior for unlimited dma
> > > mask vs the default 32-bit one can be very different.
> >
> > I actually prefer users bind real devices with real dma_mask to test rather
> than force to change
> > the dma_mask in this benchmark.
> 
> The mask is set by the driver, not the device.  So you need to set when
> when you bind, real device or not.

Yep while it is a little bit tricky.

Sometimes, it is done by "device" in architectures, e.g. there are lots of
dma_mask configuration code in arch/arm/mach-xxx.
arch/arm/mach-davinci/da850.c
static u64 da850_vpif_dma_mask = DMA_BIT_MASK(32);
static struct platform_device da850_vpif_dev = {
	.name		= "vpif",
	.id		= -1,
	.dev		= {
		.dma_mask		= &da850_vpif_dma_mask,
		.coherent_dma_mask	= DMA_BIT_MASK(32),
	},
	.resource	= da850_vpif_resource,
	.num_resources	= ARRAY_SIZE(da850_vpif_resource),
};

Sometimes, it is done by "of" or "acpi", for example:
drivers/acpi/arm64/iort.c
void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size)
{
	u64 end, mask, dmaaddr = 0, size = 0, offset = 0;
	int ret;

	...

	ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
	if (!ret) {
		/*
		 * Limit coherent and dma mask based on size retrieved from
		 * firmware.
		 */
		end = dmaaddr + size - 1;
		mask = DMA_BIT_MASK(ilog2(end) + 1);
		dev->bus_dma_limit = end;
		dev->coherent_dma_mask = mask;
		*dev->dma_mask = mask;
	}
	...
}

Sometimes, it is done by "bus", for example, ISA:
		isa_dev->dev.coherent_dma_mask = DMA_BIT_MASK(24);
		isa_dev->dev.dma_mask = &isa_dev->dev.coherent_dma_mask;

		error = device_register(&isa_dev->dev);
		if (error) {
			put_device(&isa_dev->dev);
			break;
		}

And in many cases, it is done by driver. On the ARM64 server platform I am testing,
actually rarely drivers set dma_mask.

So to make the dma benchmark work on all platforms, it seems it is worth
to add a dma_mask_bit parameter. But, in order to avoid breaking the
dma_mask of those devices whose dma_mask are set by architectures, 
acpi and bus, it seems we need to do the below in dma_benchmark:

u64 old_mask;

old_mask = dma_get_mask(dev);

dma_set_mask(dev, &new_mask);

do_map_benchmark();

/* restore old dma_mask so that the dma_mask of the device is not changed due to
benchmark when it is bound back to its original driver */
dma_set_mask(dev, &old_mask);

Thanks
Barry
diff mbox series

Patch

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..949c53da5991 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -225,3 +225,11 @@  config DMA_API_DEBUG_SG
 	  is technically out-of-spec.
 
 	  If unsure, say N.
+
+config DMA_MAP_BENCHMARK
+	bool "Enable benchmarking of streaming DMA mapping"
+	help
+	  Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
+	  performance of dma_(un)map_page.
+
+	  See tools/testing/selftests/dma/dma_map_benchmark.c
diff --git a/kernel/dma/Makefile b/kernel/dma/Makefile
index dc755ab68aab..7aa6b26b1348 100644
--- a/kernel/dma/Makefile
+++ b/kernel/dma/Makefile
@@ -10,3 +10,4 @@  obj-$(CONFIG_DMA_API_DEBUG)		+= debug.o
 obj-$(CONFIG_SWIOTLB)			+= swiotlb.o
 obj-$(CONFIG_DMA_COHERENT_POOL)		+= pool.o
 obj-$(CONFIG_DMA_REMAP)			+= remap.o
+obj-$(CONFIG_DMA_MAP_BENCHMARK)		+= map_benchmark.o
diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
new file mode 100644
index 000000000000..dc4e5ff48a2d
--- /dev/null
+++ b/kernel/dma/map_benchmark.c
@@ -0,0 +1,296 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Hisilicon Limited.
+ */
+
+#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
+
+#include <linux/debugfs.h>
+#include <linux/delay.h>
+#include <linux/device.h>
+#include <linux/dma-mapping.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/math64.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <linux/timekeeping.h>
+
+#define DMA_MAP_BENCHMARK	_IOWR('d', 1, struct map_benchmark)
+#define DMA_MAP_MAX_THREADS	1024
+#define DMA_MAP_MAX_SECONDS	300
+
+struct map_benchmark {
+	__u64 avg_map_100ns; /* average map latency in 100ns */
+	__u64 map_stddev; /* standard deviation of map latency */
+	__u64 avg_unmap_100ns; /* as above */
+	__u64 unmap_stddev;
+	__u32 threads; /* how many threads will do map/unmap in parallel */
+	__u32 seconds; /* how long the test will last */
+	int node; /* which numa node this benchmark will run on */
+	__u64 expansion[10];	/* For future use */
+};
+
+struct map_benchmark_data {
+	struct map_benchmark bparam;
+	struct device *dev;
+	struct dentry  *debugfs;
+	atomic64_t sum_map_100ns;
+	atomic64_t sum_unmap_100ns;
+	atomic64_t sum_square_map;
+	atomic64_t sum_square_unmap;
+	atomic64_t loops;
+};
+
+static int map_benchmark_thread(void *data)
+{
+	void *buf;
+	dma_addr_t dma_addr;
+	struct map_benchmark_data *map = data;
+	int ret = 0;
+
+	buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	while (!kthread_should_stop())  {
+		__u64 map_100ns, unmap_100ns, map_square, unmap_square;
+		ktime_t map_stime, map_etime, unmap_stime, unmap_etime;
+
+		/*
+		 * for a non-coherent device, if we don't stain them in the cache,
+		 * this will give an underestimate of the real-world overhead of
+		 * BIDIRECTIONAL or TO_DEVICE mappings
+		 * 66 means evertything goes well! 66 is lucky.
+		 */
+		memset(buf, 0x66, PAGE_SIZE);
+
+		map_stime = ktime_get();
+		dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, DMA_BIDIRECTIONAL);
+		if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
+			pr_err("dma_map_single failed on %s\n", dev_name(map->dev));
+			ret = -ENOMEM;
+			goto out;
+		}
+		map_etime = ktime_get();
+
+		unmap_stime = ktime_get();
+		dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+		unmap_etime = ktime_get();
+
+		/* calculate sum and sum of squares */
+		map_100ns = div64_ul(ktime_to_ns(ktime_sub(map_etime, map_stime)),  100);
+		unmap_100ns = div64_ul(ktime_to_ns(ktime_sub(unmap_etime, unmap_stime)), 100);
+		map_square = map_100ns * map_100ns;
+		unmap_square = unmap_100ns * unmap_100ns;
+
+		atomic64_add(map_100ns, &map->sum_map_100ns);
+		atomic64_add(unmap_100ns, &map->sum_unmap_100ns);
+		atomic64_add(map_square, &map->sum_square_map);
+		atomic64_add(unmap_square, &map->sum_square_unmap);
+		atomic64_inc(&map->loops);
+	}
+
+out:
+	free_page((unsigned long)buf);
+	return ret;
+}
+
+static int do_map_benchmark(struct map_benchmark_data *map)
+{
+	struct task_struct **tsk;
+	int threads = map->bparam.threads;
+	int node = map->bparam.node;
+	const cpumask_t *cpu_mask = cpumask_of_node(node);
+	__u64 loops;
+	int ret = 0;
+	int i;
+
+	tsk = kmalloc_array(threads, sizeof(tsk), GFP_KERNEL);
+	if (!tsk)
+		return -ENOMEM;
+
+	get_device(map->dev);
+
+	for (i = 0; i < threads; i++) {
+		tsk[i] = kthread_create_on_node(map_benchmark_thread, map,
+				map->bparam.node, "dma-map-benchmark/%d", i);
+		if (IS_ERR(tsk[i])) {
+			pr_err("create dma_map thread failed\n");
+			ret = PTR_ERR(tsk[i]);
+			goto out;
+		}
+
+		if (node != NUMA_NO_NODE && node_online(node))
+			kthread_bind_mask(tsk[i], cpu_mask);
+	}
+
+	/* clear the old value in the previous benchmark */
+	atomic64_set(&map->sum_map_100ns, 0);
+	atomic64_set(&map->sum_unmap_100ns, 0);
+	atomic64_set(&map->sum_square_map, 0);
+	atomic64_set(&map->sum_square_unmap, 0);
+	atomic64_set(&map->loops, 0);
+
+	for (i = 0; i < threads; i++)
+		wake_up_process(tsk[i]);
+
+	msleep_interruptible(map->bparam.seconds * 1000);
+
+	/* wait for the completion of benchmark threads */
+	for (i = 0; i < threads; i++) {
+		ret = kthread_stop(tsk[i]);
+		if (ret)
+			goto out;
+	}
+
+	loops = atomic64_read(&map->loops);
+	if (likely(loops > 0)) {
+		__u64 map_variance, unmap_variance;
+
+		/* average latency */
+		map->bparam.avg_map_100ns = div64_u64(atomic64_read(&map->sum_map_100ns), loops);
+		map->bparam.avg_unmap_100ns = div64_u64(atomic64_read(&map->sum_unmap_100ns), loops);
+
+		/* standard deviation of latency */
+		map_variance = div64_u64(atomic64_read(&map->sum_square_map),  loops) -
+			map->bparam.avg_map_100ns * map->bparam.avg_map_100ns;
+		unmap_variance = div64_u64(atomic64_read(&map->sum_square_unmap), loops) -
+			map->bparam.avg_unmap_100ns * map->bparam.avg_unmap_100ns;
+		map->bparam.map_stddev = int_sqrt64(map_variance);
+		map->bparam.unmap_stddev = int_sqrt64(unmap_variance);
+	}
+
+out:
+	put_device(map->dev);
+	kfree(tsk);
+	return ret;
+}
+
+static long map_benchmark_ioctl(struct file *filep, unsigned int cmd,
+		unsigned long arg)
+{
+	struct map_benchmark_data *map = filep->private_data;
+	int ret;
+
+	if (copy_from_user(&map->bparam, (void __user *)arg, sizeof(map->bparam)))
+		return -EFAULT;
+
+	switch (cmd) {
+	case DMA_MAP_BENCHMARK:
+		if (map->bparam.threads == 0 || map->bparam.threads > DMA_MAP_MAX_THREADS) {
+			pr_err("invalid thread number\n");
+			return -EINVAL;
+		}
+		if (map->bparam.seconds == 0 || map->bparam.seconds > DMA_MAP_MAX_SECONDS) {
+			pr_err("invalid duration seconds\n");
+			return -EINVAL;
+		}
+
+		ret = do_map_benchmark(map);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (copy_to_user((void __user *)arg, &map->bparam, sizeof(map->bparam)))
+		return -EFAULT;
+
+	return ret;
+}
+
+static const struct file_operations map_benchmark_fops = {
+	.open = simple_open,
+	.unlocked_ioctl = map_benchmark_ioctl,
+};
+
+static void map_benchmark_remove_debugfs(void *data)
+{
+	struct map_benchmark_data *map = (struct map_benchmark_data *)data;
+
+	debugfs_remove(map->debugfs);
+}
+
+static int __map_benchmark_probe(struct device *dev)
+{
+	struct dentry *entry;
+	struct map_benchmark_data *map;
+	int ret;
+
+	map = devm_kzalloc(dev, sizeof(*map), GFP_KERNEL);
+	if (!map)
+		return -ENOMEM;
+	map->dev = dev;
+
+	ret = devm_add_action(dev, map_benchmark_remove_debugfs, map);
+	if (ret) {
+		pr_err("Can't add debugfs remove action\n");
+		return ret;
+	}
+
+	/*
+	 * we only permit a device bound with this driver, 2nd probe
+	 * will fail
+	 */
+	entry = debugfs_create_file("dma_map_benchmark", 0600, NULL, map,
+			&map_benchmark_fops);
+	if (IS_ERR(entry))
+		return PTR_ERR(entry);
+	map->debugfs = entry;
+
+	return 0;
+}
+
+static int map_benchmark_platform_probe(struct platform_device *pdev)
+{
+	return __map_benchmark_probe(&pdev->dev);
+}
+
+static struct platform_driver map_benchmark_platform_driver = {
+	.driver		= {
+		.name	= "dma_map_benchmark",
+	},
+	.probe = map_benchmark_platform_probe,
+};
+
+static int map_benchmark_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	return __map_benchmark_probe(&pdev->dev);
+}
+
+static struct pci_driver map_benchmark_pci_driver = {
+	.name	= "dma_map_benchmark",
+	.probe	= map_benchmark_pci_probe,
+};
+
+static int __init map_benchmark_init(void)
+{
+	int ret;
+
+	ret = pci_register_driver(&map_benchmark_pci_driver);
+	if (ret)
+		return ret;
+
+	ret = platform_driver_register(&map_benchmark_platform_driver);
+	if (ret) {
+		pci_unregister_driver(&map_benchmark_pci_driver);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void __exit map_benchmark_cleanup(void)
+{
+	platform_driver_unregister(&map_benchmark_platform_driver);
+	pci_unregister_driver(&map_benchmark_pci_driver);
+}
+
+module_init(map_benchmark_init);
+module_exit(map_benchmark_cleanup);
+
+MODULE_AUTHOR("Barry Song <song.bao.hua@hisilicon.com>");
+MODULE_DESCRIPTION("dma_map benchmark driver");
+MODULE_LICENSE("GPL");