diff mbox series

[3/3] virtio-pmem: Add virtio pmem driver

Message ID 20180831133019.27579-4-pagupta@redhat.com (mailing list archive)
State New, archived
Headers show
Series kvm "fake DAX" device | expand

Commit Message

Pankaj Gupta Aug. 31, 2018, 1:30 p.m. UTC
This patch adds virtio-pmem driver for KVM guest.

Guest reads the persistent memory range information from
Qemu over VIRTIO and registers it on nvdimm_bus. It also
creates a nd_region object with the persistent memory
range information so that existing 'nvdimm/pmem' driver
can reserve this into system memory map. This way
'virtio-pmem' driver uses existing functionality of pmem
driver to register persistent memory compatible for DAX
capable filesystems.

This also provides function to perform guest flush over
VIRTIO from 'pmem' driver when userspace performs flush
on DAX memory range.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
---
 drivers/virtio/Kconfig           |   9 ++
 drivers/virtio/Makefile          |   1 +
 drivers/virtio/virtio_pmem.c     | 255 +++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_ids.h  |   1 +
 include/uapi/linux/virtio_pmem.h |  40 ++++++
 5 files changed, 306 insertions(+)
 create mode 100644 drivers/virtio/virtio_pmem.c
 create mode 100644 include/uapi/linux/virtio_pmem.h

Comments

kernel test robot Sept. 4, 2018, 3:17 p.m. UTC | #1
Hi Pankaj,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
[also build test ERROR on v4.19-rc2 next-20180903]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: i386-randconfig-a3-201835 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 
:::::: branch date: 21 hours ago
:::::: commit date: 21 hours ago

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_pmem.o: In function `virtio_pmem_remove':
>> drivers/virtio/virtio_pmem.c:220: undefined reference to `nvdimm_bus_unregister'
   drivers/virtio/virtio_pmem.o: In function `virtio_pmem_probe':
>> drivers/virtio/virtio_pmem.c:186: undefined reference to `nvdimm_bus_register'
>> drivers/virtio/virtio_pmem.c:198: undefined reference to `nvdimm_pmem_region_create'
   drivers/virtio/virtio_pmem.c:207: undefined reference to `nvdimm_bus_unregister'

# https://github.com/0day-ci/linux/commit/acce2633da18b0ad58d0cc9243a85b03020ca099
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout acce2633da18b0ad58d0cc9243a85b03020ca099
vim +220 drivers/virtio/virtio_pmem.c

acce2633 Pankaj Gupta 2018-08-31  147  
acce2633 Pankaj Gupta 2018-08-31  148  static int virtio_pmem_probe(struct virtio_device *vdev)
acce2633 Pankaj Gupta 2018-08-31  149  {
acce2633 Pankaj Gupta 2018-08-31  150  	int err = 0;
acce2633 Pankaj Gupta 2018-08-31  151  	struct resource res;
acce2633 Pankaj Gupta 2018-08-31  152  	struct virtio_pmem *vpmem;
acce2633 Pankaj Gupta 2018-08-31  153  	struct nvdimm_bus *nvdimm_bus;
acce2633 Pankaj Gupta 2018-08-31  154  	struct nd_region_desc ndr_desc;
acce2633 Pankaj Gupta 2018-08-31  155  	int nid = dev_to_node(&vdev->dev);
acce2633 Pankaj Gupta 2018-08-31  156  	struct nd_region *nd_region;
acce2633 Pankaj Gupta 2018-08-31  157  
acce2633 Pankaj Gupta 2018-08-31  158  	if (!vdev->config->get) {
acce2633 Pankaj Gupta 2018-08-31  159  		dev_err(&vdev->dev, "%s failure: config disabled\n",
acce2633 Pankaj Gupta 2018-08-31  160  			__func__);
acce2633 Pankaj Gupta 2018-08-31  161  		return -EINVAL;
acce2633 Pankaj Gupta 2018-08-31  162  	}
acce2633 Pankaj Gupta 2018-08-31  163  
acce2633 Pankaj Gupta 2018-08-31  164  	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
acce2633 Pankaj Gupta 2018-08-31  165  			GFP_KERNEL);
acce2633 Pankaj Gupta 2018-08-31  166  	if (!vpmem) {
acce2633 Pankaj Gupta 2018-08-31  167  		err = -ENOMEM;
acce2633 Pankaj Gupta 2018-08-31  168  		goto out_err;
acce2633 Pankaj Gupta 2018-08-31  169  	}
acce2633 Pankaj Gupta 2018-08-31  170  
acce2633 Pankaj Gupta 2018-08-31  171  	vpmem->vdev = vdev;
acce2633 Pankaj Gupta 2018-08-31  172  	err = init_vq(vpmem);
acce2633 Pankaj Gupta 2018-08-31  173  	if (err)
acce2633 Pankaj Gupta 2018-08-31  174  		goto out_err;
acce2633 Pankaj Gupta 2018-08-31  175  
acce2633 Pankaj Gupta 2018-08-31  176  	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
acce2633 Pankaj Gupta 2018-08-31  177  			start, &vpmem->start);
acce2633 Pankaj Gupta 2018-08-31  178  	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
acce2633 Pankaj Gupta 2018-08-31  179  			size, &vpmem->size);
acce2633 Pankaj Gupta 2018-08-31  180  
acce2633 Pankaj Gupta 2018-08-31  181  	res.start = vpmem->start;
acce2633 Pankaj Gupta 2018-08-31  182  	res.end   = vpmem->start + vpmem->size-1;
acce2633 Pankaj Gupta 2018-08-31  183  	vpmem->nd_desc.provider_name = "virtio-pmem";
acce2633 Pankaj Gupta 2018-08-31  184  	vpmem->nd_desc.module = THIS_MODULE;
acce2633 Pankaj Gupta 2018-08-31  185  
acce2633 Pankaj Gupta 2018-08-31 @186  	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
acce2633 Pankaj Gupta 2018-08-31  187  						&vpmem->nd_desc);
acce2633 Pankaj Gupta 2018-08-31  188  	if (!nvdimm_bus)
acce2633 Pankaj Gupta 2018-08-31  189  		goto out_vq;
acce2633 Pankaj Gupta 2018-08-31  190  
acce2633 Pankaj Gupta 2018-08-31  191  	dev_set_drvdata(&vdev->dev, nvdimm_bus);
acce2633 Pankaj Gupta 2018-08-31  192  	memset(&ndr_desc, 0, sizeof(ndr_desc));
acce2633 Pankaj Gupta 2018-08-31  193  
acce2633 Pankaj Gupta 2018-08-31  194  	ndr_desc.res = &res;
acce2633 Pankaj Gupta 2018-08-31  195  	ndr_desc.numa_node = nid;
acce2633 Pankaj Gupta 2018-08-31  196  	ndr_desc.flush = virtio_pmem_flush;
acce2633 Pankaj Gupta 2018-08-31  197  	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
acce2633 Pankaj Gupta 2018-08-31 @198  	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
acce2633 Pankaj Gupta 2018-08-31  199  
acce2633 Pankaj Gupta 2018-08-31  200  	if (!nd_region)
acce2633 Pankaj Gupta 2018-08-31  201  		goto out_nd;
acce2633 Pankaj Gupta 2018-08-31  202  
acce2633 Pankaj Gupta 2018-08-31  203  	//virtio_device_ready(vdev);
acce2633 Pankaj Gupta 2018-08-31  204  	return 0;
acce2633 Pankaj Gupta 2018-08-31  205  out_nd:
acce2633 Pankaj Gupta 2018-08-31  206  	err = -ENXIO;
acce2633 Pankaj Gupta 2018-08-31  207  	nvdimm_bus_unregister(nvdimm_bus);
acce2633 Pankaj Gupta 2018-08-31  208  out_vq:
acce2633 Pankaj Gupta 2018-08-31  209  	vdev->config->del_vqs(vdev);
acce2633 Pankaj Gupta 2018-08-31  210  out_err:
acce2633 Pankaj Gupta 2018-08-31  211  	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
acce2633 Pankaj Gupta 2018-08-31  212  	return err;
acce2633 Pankaj Gupta 2018-08-31  213  }
acce2633 Pankaj Gupta 2018-08-31  214  
acce2633 Pankaj Gupta 2018-08-31  215  static void virtio_pmem_remove(struct virtio_device *vdev)
acce2633 Pankaj Gupta 2018-08-31  216  {
acce2633 Pankaj Gupta 2018-08-31  217  	struct virtio_pmem *vpmem = vdev->priv;
acce2633 Pankaj Gupta 2018-08-31  218  	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
acce2633 Pankaj Gupta 2018-08-31  219  
acce2633 Pankaj Gupta 2018-08-31 @220  	nvdimm_bus_unregister(nvdimm_bus);
acce2633 Pankaj Gupta 2018-08-31  221  	vdev->config->del_vqs(vdev);
acce2633 Pankaj Gupta 2018-08-31  222  	kfree(vpmem);
acce2633 Pankaj Gupta 2018-08-31  223  }
acce2633 Pankaj Gupta 2018-08-31  224  

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Pankaj Gupta Sept. 5, 2018, 8:34 a.m. UTC | #2
Hello,

Thanks for the report.

> Hi Pankaj,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
> [also build test ERROR on v4.19-rc2 next-20180903]
> [if your patch is applied to the wrong git tree, please drop us a note to
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git
> libnvdimm-for-next
> config: i386-randconfig-a3-201835 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
> :::::: branch date: 21 hours ago
> :::::: commit date: 21 hours ago
> 
> All errors (new ones prefixed by >>):
> 
>    drivers/virtio/virtio_pmem.o: In function `virtio_pmem_remove':
> >> drivers/virtio/virtio_pmem.c:220: undefined reference to
> >> `nvdimm_bus_unregister'
>    drivers/virtio/virtio_pmem.o: In function `virtio_pmem_probe':
> >> drivers/virtio/virtio_pmem.c:186: undefined reference to
> >> `nvdimm_bus_register'
> >> drivers/virtio/virtio_pmem.c:198: undefined reference to
> >> `nvdimm_pmem_region_create'
>    drivers/virtio/virtio_pmem.c:207: undefined reference to
>    `nvdimm_bus_unregister'

It looks like dependent configiguration 'LIBNVDIMM' is not enabled. I will add the 
dependency in Kconfig file for virtio_pmem in v2.

Thanks,
Pankaj

> 
> #
> https://github.com/0day-ci/linux/commit/acce2633da18b0ad58d0cc9243a85b03020ca099
> git remote add linux-review https://github.com/0day-ci/linux
> git remote update linux-review
> git checkout acce2633da18b0ad58d0cc9243a85b03020ca099
> vim +220 drivers/virtio/virtio_pmem.c
> 
> acce2633 Pankaj Gupta 2018-08-31  147
> acce2633 Pankaj Gupta 2018-08-31  148  static int virtio_pmem_probe(struct
> virtio_device *vdev)
> acce2633 Pankaj Gupta 2018-08-31  149  {
> acce2633 Pankaj Gupta 2018-08-31  150  	int err = 0;
> acce2633 Pankaj Gupta 2018-08-31  151  	struct resource res;
> acce2633 Pankaj Gupta 2018-08-31  152  	struct virtio_pmem *vpmem;
> acce2633 Pankaj Gupta 2018-08-31  153  	struct nvdimm_bus *nvdimm_bus;
> acce2633 Pankaj Gupta 2018-08-31  154  	struct nd_region_desc ndr_desc;
> acce2633 Pankaj Gupta 2018-08-31  155  	int nid = dev_to_node(&vdev->dev);
> acce2633 Pankaj Gupta 2018-08-31  156  	struct nd_region *nd_region;
> acce2633 Pankaj Gupta 2018-08-31  157
> acce2633 Pankaj Gupta 2018-08-31  158  	if (!vdev->config->get) {
> acce2633 Pankaj Gupta 2018-08-31  159  		dev_err(&vdev->dev, "%s failure:
> config disabled\n",
> acce2633 Pankaj Gupta 2018-08-31  160  			__func__);
> acce2633 Pankaj Gupta 2018-08-31  161  		return -EINVAL;
> acce2633 Pankaj Gupta 2018-08-31  162  	}
> acce2633 Pankaj Gupta 2018-08-31  163
> acce2633 Pankaj Gupta 2018-08-31  164  	vdev->priv = vpmem =
> devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> acce2633 Pankaj Gupta 2018-08-31  165  			GFP_KERNEL);
> acce2633 Pankaj Gupta 2018-08-31  166  	if (!vpmem) {
> acce2633 Pankaj Gupta 2018-08-31  167  		err = -ENOMEM;
> acce2633 Pankaj Gupta 2018-08-31  168  		goto out_err;
> acce2633 Pankaj Gupta 2018-08-31  169  	}
> acce2633 Pankaj Gupta 2018-08-31  170
> acce2633 Pankaj Gupta 2018-08-31  171  	vpmem->vdev = vdev;
> acce2633 Pankaj Gupta 2018-08-31  172  	err = init_vq(vpmem);
> acce2633 Pankaj Gupta 2018-08-31  173  	if (err)
> acce2633 Pankaj Gupta 2018-08-31  174  		goto out_err;
> acce2633 Pankaj Gupta 2018-08-31  175
> acce2633 Pankaj Gupta 2018-08-31  176  	virtio_cread(vpmem->vdev, struct
> virtio_pmem_config,
> acce2633 Pankaj Gupta 2018-08-31  177  			start, &vpmem->start);
> acce2633 Pankaj Gupta 2018-08-31  178  	virtio_cread(vpmem->vdev, struct
> virtio_pmem_config,
> acce2633 Pankaj Gupta 2018-08-31  179  			size, &vpmem->size);
> acce2633 Pankaj Gupta 2018-08-31  180
> acce2633 Pankaj Gupta 2018-08-31  181  	res.start = vpmem->start;
> acce2633 Pankaj Gupta 2018-08-31  182  	res.end   = vpmem->start +
> vpmem->size-1;
> acce2633 Pankaj Gupta 2018-08-31  183  	vpmem->nd_desc.provider_name =
> "virtio-pmem";
> acce2633 Pankaj Gupta 2018-08-31  184  	vpmem->nd_desc.module = THIS_MODULE;
> acce2633 Pankaj Gupta 2018-08-31  185
> acce2633 Pankaj Gupta 2018-08-31 @186  	vpmem->nvdimm_bus = nvdimm_bus =
> nvdimm_bus_register(&vdev->dev,
> acce2633 Pankaj Gupta 2018-08-31  187  						&vpmem->nd_desc);
> acce2633 Pankaj Gupta 2018-08-31  188  	if (!nvdimm_bus)
> acce2633 Pankaj Gupta 2018-08-31  189  		goto out_vq;
> acce2633 Pankaj Gupta 2018-08-31  190
> acce2633 Pankaj Gupta 2018-08-31  191  	dev_set_drvdata(&vdev->dev,
> nvdimm_bus);
> acce2633 Pankaj Gupta 2018-08-31  192  	memset(&ndr_desc, 0,
> sizeof(ndr_desc));
> acce2633 Pankaj Gupta 2018-08-31  193
> acce2633 Pankaj Gupta 2018-08-31  194  	ndr_desc.res = &res;
> acce2633 Pankaj Gupta 2018-08-31  195  	ndr_desc.numa_node = nid;
> acce2633 Pankaj Gupta 2018-08-31  196  	ndr_desc.flush = virtio_pmem_flush;
> acce2633 Pankaj Gupta 2018-08-31  197  	set_bit(ND_REGION_PAGEMAP,
> &ndr_desc.flags);
> acce2633 Pankaj Gupta 2018-08-31 @198  	nd_region =
> nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> acce2633 Pankaj Gupta 2018-08-31  199
> acce2633 Pankaj Gupta 2018-08-31  200  	if (!nd_region)
> acce2633 Pankaj Gupta 2018-08-31  201  		goto out_nd;
> acce2633 Pankaj Gupta 2018-08-31  202
> acce2633 Pankaj Gupta 2018-08-31  203  	//virtio_device_ready(vdev);
> acce2633 Pankaj Gupta 2018-08-31  204  	return 0;
> acce2633 Pankaj Gupta 2018-08-31  205  out_nd:
> acce2633 Pankaj Gupta 2018-08-31  206  	err = -ENXIO;
> acce2633 Pankaj Gupta 2018-08-31  207  	nvdimm_bus_unregister(nvdimm_bus);
> acce2633 Pankaj Gupta 2018-08-31  208  out_vq:
> acce2633 Pankaj Gupta 2018-08-31  209  	vdev->config->del_vqs(vdev);
> acce2633 Pankaj Gupta 2018-08-31  210  out_err:
> acce2633 Pankaj Gupta 2018-08-31  211  	dev_err(&vdev->dev, "failed to
> register virtio pmem memory\n");
> acce2633 Pankaj Gupta 2018-08-31  212  	return err;
> acce2633 Pankaj Gupta 2018-08-31  213  }
> acce2633 Pankaj Gupta 2018-08-31  214
> acce2633 Pankaj Gupta 2018-08-31  215  static void virtio_pmem_remove(struct
> virtio_device *vdev)
> acce2633 Pankaj Gupta 2018-08-31  216  {
> acce2633 Pankaj Gupta 2018-08-31  217  	struct virtio_pmem *vpmem =
> vdev->priv;
> acce2633 Pankaj Gupta 2018-08-31  218  	struct nvdimm_bus *nvdimm_bus =
> dev_get_drvdata(&vdev->dev);
> acce2633 Pankaj Gupta 2018-08-31  219
> acce2633 Pankaj Gupta 2018-08-31 @220  	nvdimm_bus_unregister(nvdimm_bus);
> acce2633 Pankaj Gupta 2018-08-31  221  	vdev->config->del_vqs(vdev);
> acce2633 Pankaj Gupta 2018-08-31  222  	kfree(vpmem);
> acce2633 Pankaj Gupta 2018-08-31  223  }
> acce2633 Pankaj Gupta 2018-08-31  224
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
>
kernel test robot Sept. 5, 2018, 12:02 p.m. UTC | #3
Hi Pankaj,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux-nvdimm/libnvdimm-for-next]
[also build test ERROR on v4.19-rc2 next-20180905]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Pankaj-Gupta/kvm-fake-DAX-device/20180903-160032
base:   https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git libnvdimm-for-next
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/virtio/virtio_pmem.o: In function `virtio_pmem_remove':
>> virtio_pmem.c:(.text+0x299): undefined reference to `nvdimm_bus_unregister'
   drivers/virtio/virtio_pmem.o: In function `virtio_pmem_probe':
>> virtio_pmem.c:(.text+0x5e3): undefined reference to `nvdimm_bus_register'
>> virtio_pmem.c:(.text+0x62a): undefined reference to `nvdimm_pmem_region_create'
   virtio_pmem.c:(.text+0x63b): undefined reference to `nvdimm_bus_unregister'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Luiz Capitulino Sept. 12, 2018, 4:54 p.m. UTC | #4
On Fri, 31 Aug 2018 19:00:18 +0530
Pankaj Gupta <pagupta@redhat.com> wrote:

> This patch adds virtio-pmem driver for KVM guest.
> 
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
> 
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
> 
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/virtio/Kconfig           |   9 ++
>  drivers/virtio/Makefile          |   1 +
>  drivers/virtio/virtio_pmem.c     | 255 +++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  40 ++++++
>  5 files changed, 306 insertions(+)
>  create mode 100644 drivers/virtio/virtio_pmem.c
>  create mode 100644 include/uapi/linux/virtio_pmem.h
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 3589764..a331e23 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
>  
>  	  If unsure, say Y.
>  
> +config VIRTIO_PMEM
> +	tristate "Support for virtio pmem driver"
> +	depends on VIRTIO
> +	help
> +	This driver provides support for virtio based flushing interface
> +	for persistent memory range.
> +
> +	If unsure, say M.
> +
>  config VIRTIO_BALLOON
>  	tristate "Virtio balloon driver"
>  	depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5..cbe91c6 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 0000000..c22cc87
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio.h>
> +#include <linux/module.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/spinlock.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/nd.h>
> +
> +struct virtio_pmem_request {
> +	/* Host return status corresponding to flush request */
> +	int ret;
> +
> +	/* command name*/
> +	char name[16];
> +
> +	/* Wait queue to process deferred work after ack from host */
> +	wait_queue_head_t host_acked;
> +	bool done;
> +
> +	/* Wait queue to process deferred work after virt queue buffer avail */
> +	wait_queue_head_t wq_buf;
> +	bool wq_buf_avail;
> +	struct list_head list;
> +};
> +
> +struct virtio_pmem {
> +	struct virtio_device *vdev;
> +
> +	/* Virtio pmem request queue */
> +	struct virtqueue *req_vq;
> +
> +	/* nvdimm bus registers virtio pmem device */
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nvdimm_bus_descriptor nd_desc;
> +
> +	/* List to store deferred work if virtqueue is full */
> +	struct list_head req_list;
> +
> +	/* Synchronize virtqueue data */
> +	spinlock_t pmem_lock;
> +
> +	/* Memory region information */
> +	uint64_t start;
> +	uint64_t size;
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> + /* The interrupt handler */
> +static void host_ack(struct virtqueue *vq)
> +{
> +	unsigned int len;
> +	unsigned long flags;
> +	struct virtio_pmem_request *req, *req_buf;
> +	struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> +		req->done = true;
> +		wake_up(&req->host_acked);
> +
> +		if (!list_empty(&vpmem->req_list)) {
> +			req_buf = list_first_entry(&vpmem->req_list,
> +					struct virtio_pmem_request, list);
> +			list_del(&vpmem->req_list);
> +			req_buf->wq_buf_avail = true;
> +			wake_up(&req_buf->wq_buf);
> +		}
> +	}
> +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> +	struct virtqueue *vq;
> +
> +	/* single vq */
> +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> +				host_ack, "flush_queue");
> +	if (IS_ERR(vq))
> +		return PTR_ERR(vq);
> +
> +	spin_lock_init(&vpmem->pmem_lock);
> +	INIT_LIST_HEAD(&vpmem->req_list);
> +
> +	return 0;
> +};
> +
> + /* The request submission function */
> +static int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> +	int err;
> +	unsigned long flags;
> +	struct scatterlist *sgs[2], sg, ret;
> +	struct virtio_device *vdev =
> +		dev_to_virtio(nd_region->dev.parent->parent);
> +	struct virtio_pmem *vpmem = vdev->priv;

I'm missing a might_sleep() call in this function.

> +	struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> +
> +	if (!req)
> +		return -ENOMEM;
> +
> +	req->done = req->wq_buf_avail = false;
> +	strcpy(req->name, "FLUSH");
> +	init_waitqueue_head(&req->host_acked);
> +	init_waitqueue_head(&req->wq_buf);
> +
> +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +	sg_init_one(&sg, req->name, strlen(req->name));
> +	sgs[0] = &sg;
> +	sg_init_one(&ret, &req->ret, sizeof(req->ret));
> +	sgs[1] = &ret;

It seems that sg_init_one() is only setting fields, in this
case you can move spin_lock_irqsave() here.

> +	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> +	if (err) {
> +		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> +		list_add_tail(&vpmem->req_list, &req->list);
> +		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +		/* When host has read buffer, this completes via host_ack */
> +		wait_event(req->wq_buf, req->wq_buf_avail);
> +		spin_lock_irqsave(&vpmem->pmem_lock, flags);

Is this error handling code assuming that at some point
virtqueue_add_sgs() will succeed for a different thread? If yes,
what happens if the assumption is false? That is, what happens if
virtqueue_add_sgs() never succeeds anymore?

Why not just return an error?

> +	}
> +	virtqueue_kick(vpmem->req_vq);
> +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +	/* When host has read buffer, this completes via host_ack */
> +	wait_event(req->host_acked, req->done);
> +	err = req->ret;

If I'm understanding the QEMU code correctly, you're returning EIO
from QEMU if fsync() fails. I think this is wrong, since we don't know
if EIO in QEMU will be the same EIO in the guest. One way to solve this
would be to return 0 for success and 1 for failure from QEMU, and let the
guest implementation pick its error code (for your implementation it
could be EIO).

> +	kfree(req);
> +
> +	return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +	int err = 0;
> +	struct resource res;
> +	struct virtio_pmem *vpmem;
> +	struct nvdimm_bus *nvdimm_bus;
> +	struct nd_region_desc ndr_desc;
> +	int nid = dev_to_node(&vdev->dev);
> +	struct nd_region *nd_region;
> +
> +	if (!vdev->config->get) {
> +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +
> +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +			GFP_KERNEL);
> +	if (!vpmem) {
> +		err = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	vpmem->vdev = vdev;
> +	err = init_vq(vpmem);
> +	if (err)
> +		goto out_err;
> +
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			start, &vpmem->start);
> +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +			size, &vpmem->size);
> +
> +	res.start = vpmem->start;
> +	res.end   = vpmem->start + vpmem->size-1;
> +	vpmem->nd_desc.provider_name = "virtio-pmem";
> +	vpmem->nd_desc.module = THIS_MODULE;
> +
> +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> +						&vpmem->nd_desc);
> +	if (!nvdimm_bus)
> +		goto out_vq;
> +
> +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +	memset(&ndr_desc, 0, sizeof(ndr_desc));
> +
> +	ndr_desc.res = &res;
> +	ndr_desc.numa_node = nid;
> +	ndr_desc.flush = virtio_pmem_flush;
> +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +
> +	if (!nd_region)
> +		goto out_nd;
> +
> +	//virtio_device_ready(vdev);
> +	return 0;
> +out_nd:
> +	err = -ENXIO;
> +	nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> +	vdev->config->del_vqs(vdev);
> +out_err:
> +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> +	return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> +	struct virtio_pmem *vpmem = vdev->priv;
> +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> +	nvdimm_bus_unregister(nvdimm_bus);
> +	vdev->config->del_vqs(vdev);
> +	kfree(vpmem);
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static int virtio_pmem_freeze(struct virtio_device *vdev)
> +{
> +	/* todo: handle freeze function */
> +	return -EPERM;
> +}
> +
> +static int virtio_pmem_restore(struct virtio_device *vdev)
> +{
> +	/* todo: handle restore function */
> +	return -EPERM;
> +}
> +#endif
> +
> +
> +static struct virtio_driver virtio_pmem_driver = {
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.probe			= virtio_pmem_probe,
> +	.remove			= virtio_pmem_remove,
> +#ifdef CONFIG_PM_SLEEP
> +	.freeze                 = virtio_pmem_freeze,
> +	.restore                = virtio_pmem_restore,
> +#endif
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2..3463895 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
>  #define VIRTIO_ID_INPUT        18 /* virtio input */
>  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
>  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
>  
>  #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 0000000..c7c22a5
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible drivers/servers:
> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2018-2019
> + * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
> + */
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> +	__le64 start;
> +	__le64 size;
> +};
> +#endif
Pankaj Gupta Sept. 13, 2018, 6:58 a.m. UTC | #5
Hi Luiz,

Thanks for the review.

> 
> > This patch adds virtio-pmem driver for KVM guest.
> > 
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> > 
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> > 
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/virtio/Kconfig           |   9 ++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/virtio_pmem.c     | 255
> >  +++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  40 ++++++
> >  5 files changed, 306 insertions(+)
> >  create mode 100644 drivers/virtio/virtio_pmem.c
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > 
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 3589764..a331e23 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> >  
> >  	  If unsure, say Y.
> >  
> > +config VIRTIO_PMEM
> > +	tristate "Support for virtio pmem driver"
> > +	depends on VIRTIO
> > +	help
> > +	This driver provides support for virtio based flushing interface
> > +	for persistent memory range.
> > +
> > +	If unsure, say M.
> > +
> >  config VIRTIO_BALLOON
> >  	tristate "Virtio balloon driver"
> >  	depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5..cbe91c6 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 0000000..c22cc87
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,255 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/nd.h>
> > +
> > +struct virtio_pmem_request {
> > +	/* Host return status corresponding to flush request */
> > +	int ret;
> > +
> > +	/* command name*/
> > +	char name[16];
> > +
> > +	/* Wait queue to process deferred work after ack from host */
> > +	wait_queue_head_t host_acked;
> > +	bool done;
> > +
> > +	/* Wait queue to process deferred work after virt queue buffer avail */
> > +	wait_queue_head_t wq_buf;
> > +	bool wq_buf_avail;
> > +	struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > +	struct virtio_device *vdev;
> > +
> > +	/* Virtio pmem request queue */
> > +	struct virtqueue *req_vq;
> > +
> > +	/* nvdimm bus registers virtio pmem device */
> > +	struct nvdimm_bus *nvdimm_bus;
> > +	struct nvdimm_bus_descriptor nd_desc;
> > +
> > +	/* List to store deferred work if virtqueue is full */
> > +	struct list_head req_list;
> > +
> > +	/* Synchronize virtqueue data */
> > +	spinlock_t pmem_lock;
> > +
> > +	/* Memory region information */
> > +	uint64_t start;
> > +	uint64_t size;
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +	{ 0 },
> > +};
> > +
> > + /* The interrupt handler */
> > +static void host_ack(struct virtqueue *vq)
> > +{
> > +	unsigned int len;
> > +	unsigned long flags;
> > +	struct virtio_pmem_request *req, *req_buf;
> > +	struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > +		req->done = true;
> > +		wake_up(&req->host_acked);
> > +
> > +		if (!list_empty(&vpmem->req_list)) {
> > +			req_buf = list_first_entry(&vpmem->req_list,
> > +					struct virtio_pmem_request, list);
> > +			list_del(&vpmem->req_list);
> > +			req_buf->wq_buf_avail = true;
> > +			wake_up(&req_buf->wq_buf);
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > +	struct virtqueue *vq;
> > +
> > +	/* single vq */
> > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +				host_ack, "flush_queue");
> > +	if (IS_ERR(vq))
> > +		return PTR_ERR(vq);
> > +
> > +	spin_lock_init(&vpmem->pmem_lock);
> > +	INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > +	return 0;
> > +};
> > +
> > + /* The request submission function */
> > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > +	int err;
> > +	unsigned long flags;
> > +	struct scatterlist *sgs[2], sg, ret;
> > +	struct virtio_device *vdev =
> > +		dev_to_virtio(nd_region->dev.parent->parent);
> > +	struct virtio_pmem *vpmem = vdev->priv;
> 
> I'm missing a might_sleep() call in this function.

I am not sure if we need might_sleep here? 
We can add it as debugging aid for detecting any problems
in sleeping from acquired atomic context?

> 
> > +	struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> > +
> > +	if (!req)
> > +		return -ENOMEM;
> > +
> > +	req->done = req->wq_buf_avail = false;
> > +	strcpy(req->name, "FLUSH");
> > +	init_waitqueue_head(&req->host_acked);
> > +	init_waitqueue_head(&req->wq_buf);
> > +
> > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +	sg_init_one(&sg, req->name, strlen(req->name));
> > +	sgs[0] = &sg;
> > +	sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > +	sgs[1] = &ret;
> 
> It seems that sg_init_one() is only setting fields, in this
> case you can move spin_lock_irqsave() here.

yes, will move spin_lock_irqsave here.

> 
> > +	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > +	if (err) {
> > +		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > +
> > +		list_add_tail(&vpmem->req_list, &req->list);
> > +		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > +		/* When host has read buffer, this completes via host_ack */
> > +		wait_event(req->wq_buf, req->wq_buf_avail);
> > +		spin_lock_irqsave(&vpmem->pmem_lock, flags);
> 
> Is this error handling code assuming that at some point
> virtqueue_add_sgs() will succeed for a different thread? If yes,
> what happens if the assumption is false? That is, what happens if
> virtqueue_add_sgs() never succeeds anymore?

virtqueue_add_sgs will not succeed and corresponding thread should wait.
All subsequent calling threads should also wait. As soon as there is first
available free entry(from host), first waiting thread is acknowledged.

In worst case if Qemu is not utilizing any of the used buffer will keep
multiple threads waiting. 

> 
> Why not just return an error?

As per suggestion by Stefan in previous discussion: if the virtqueue is full.  
Printing a message and failing the flush isn't appropriate.  This thread needs to 
wait until virtqueue space becomes available.

> 
> > +	}
> > +	virtqueue_kick(vpmem->req_vq);
> > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > +	/* When host has read buffer, this completes via host_ack */
> > +	wait_event(req->host_acked, req->done);
> > +	err = req->ret;
> 
> If I'm understanding the QEMU code correctly, you're returning EIO
> from QEMU if fsync() fails. I think this is wrong, since we don't know
> if EIO in QEMU will be the same EIO in the guest. One way to solve this
> would be to return 0 for success and 1 for failure from QEMU, and let the
> guest implementation pick its error code (for your implementation it
> could be EIO).

Makes sense, will change this. 

Thanks,
Pankaj 
> 
> > +	kfree(req);
> > +
> > +	return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +	int err = 0;
> > +	struct resource res;
> > +	struct virtio_pmem *vpmem;
> > +	struct nvdimm_bus *nvdimm_bus;
> > +	struct nd_region_desc ndr_desc;
> > +	int nid = dev_to_node(&vdev->dev);
> > +	struct nd_region *nd_region;
> > +
> > +	if (!vdev->config->get) {
> > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> > +			__func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +			GFP_KERNEL);
> > +	if (!vpmem) {
> > +		err = -ENOMEM;
> > +		goto out_err;
> > +	}
> > +
> > +	vpmem->vdev = vdev;
> > +	err = init_vq(vpmem);
> > +	if (err)
> > +		goto out_err;
> > +
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			start, &vpmem->start);
> > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +			size, &vpmem->size);
> > +
> > +	res.start = vpmem->start;
> > +	res.end   = vpmem->start + vpmem->size-1;
> > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > +	vpmem->nd_desc.module = THIS_MODULE;
> > +
> > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > +						&vpmem->nd_desc);
> > +	if (!nvdimm_bus)
> > +		goto out_vq;
> > +
> > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +	memset(&ndr_desc, 0, sizeof(ndr_desc));
> > +
> > +	ndr_desc.res = &res;
> > +	ndr_desc.numa_node = nid;
> > +	ndr_desc.flush = virtio_pmem_flush;
> > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +
> > +	if (!nd_region)
> > +		goto out_nd;
> > +
> > +	//virtio_device_ready(vdev);
> > +	return 0;
> > +out_nd:
> > +	err = -ENXIO;
> > +	nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > +	vdev->config->del_vqs(vdev);
> > +out_err:
> > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > +	return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > +	struct virtio_pmem *vpmem = vdev->priv;
> > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > +	nvdimm_bus_unregister(nvdimm_bus);
> > +	vdev->config->del_vqs(vdev);
> > +	kfree(vpmem);
> > +}
> > +
> > +#ifdef CONFIG_PM_SLEEP
> > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > +{
> > +	/* todo: handle freeze function */
> > +	return -EPERM;
> > +}
> > +
> > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > +{
> > +	/* todo: handle restore function */
> > +	return -EPERM;
> > +}
> > +#endif
> > +
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > +	.driver.name		= KBUILD_MODNAME,
> > +	.driver.owner		= THIS_MODULE,
> > +	.id_table		= id_table,
> > +	.probe			= virtio_pmem_probe,
> > +	.remove			= virtio_pmem_remove,
> > +#ifdef CONFIG_PM_SLEEP
> > +	.freeze                 = virtio_pmem_freeze,
> > +	.restore                = virtio_pmem_restore,
> > +#endif
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2..3463895 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> >  
> >  #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 0000000..c7c22a5
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible drivers/servers:
> > + *
> > + *
> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + *    notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + *    notice, this list of conditions and the following disclaimer in the
> > + *    documentation and/or other materials provided with the distribution.
> > + * 3. Neither the name of IBM nor the names of its contributors
> > + *    may be used to endorse or promote products derived from this
> > software
> > + *    without specific prior written permission.
> > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > ``AS IS''
> > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> > THE
> > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > PURPOSE
> > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > STRICT
> > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
> > WAY
> > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + *
> > + * Copyright (C) Red Hat, Inc., 2018-2019
> > + * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
> > + */
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > +	__le64 start;
> > +	__le64 size;
> > +};
> > +#endif
> 
> 
>
Luiz Capitulino Sept. 13, 2018, 12:19 p.m. UTC | #6
On Thu, 13 Sep 2018 02:58:21 -0400 (EDT)
Pankaj Gupta <pagupta@redhat.com> wrote:

> Hi Luiz,
> 
> Thanks for the review.
> 
> >   
> > > This patch adds virtio-pmem driver for KVM guest.
> > > 
> > > Guest reads the persistent memory range information from
> > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > creates a nd_region object with the persistent memory
> > > range information so that existing 'nvdimm/pmem' driver
> > > can reserve this into system memory map. This way
> > > 'virtio-pmem' driver uses existing functionality of pmem
> > > driver to register persistent memory compatible for DAX
> > > capable filesystems.
> > > 
> > > This also provides function to perform guest flush over
> > > VIRTIO from 'pmem' driver when userspace performs flush
> > > on DAX memory range.
> > > 
> > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > ---
> > >  drivers/virtio/Kconfig           |   9 ++
> > >  drivers/virtio/Makefile          |   1 +
> > >  drivers/virtio/virtio_pmem.c     | 255
> > >  +++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/virtio_ids.h  |   1 +
> > >  include/uapi/linux/virtio_pmem.h |  40 ++++++
> > >  5 files changed, 306 insertions(+)
> > >  create mode 100644 drivers/virtio/virtio_pmem.c
> > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > 
> > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > index 3589764..a331e23 100644
> > > --- a/drivers/virtio/Kconfig
> > > +++ b/drivers/virtio/Kconfig
> > > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> > >  
> > >  	  If unsure, say Y.
> > >  
> > > +config VIRTIO_PMEM
> > > +	tristate "Support for virtio pmem driver"
> > > +	depends on VIRTIO
> > > +	help
> > > +	This driver provides support for virtio based flushing interface
> > > +	for persistent memory range.
> > > +
> > > +	If unsure, say M.
> > > +
> > >  config VIRTIO_BALLOON
> > >  	tristate "Virtio balloon driver"
> > >  	depends on VIRTIO
> > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > > index 3a2b5c5..cbe91c6 100644
> > > --- a/drivers/virtio/Makefile
> > > +++ b/drivers/virtio/Makefile
> > > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > > new file mode 100644
> > > index 0000000..c22cc87
> > > --- /dev/null
> > > +++ b/drivers/virtio/virtio_pmem.c
> > > @@ -0,0 +1,255 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * virtio_pmem.c: Virtio pmem Driver
> > > + *
> > > + * Discovers persistent memory range information
> > > + * from host and provides a virtio based flushing
> > > + * interface.
> > > + */
> > > +#include <linux/virtio.h>
> > > +#include <linux/module.h>
> > > +#include <linux/virtio_ids.h>
> > > +#include <linux/virtio_config.h>
> > > +#include <uapi/linux/virtio_pmem.h>
> > > +#include <linux/spinlock.h>
> > > +#include <linux/libnvdimm.h>
> > > +#include <linux/nd.h>
> > > +
> > > +struct virtio_pmem_request {
> > > +	/* Host return status corresponding to flush request */
> > > +	int ret;
> > > +
> > > +	/* command name*/
> > > +	char name[16];
> > > +
> > > +	/* Wait queue to process deferred work after ack from host */
> > > +	wait_queue_head_t host_acked;
> > > +	bool done;
> > > +
> > > +	/* Wait queue to process deferred work after virt queue buffer avail */
> > > +	wait_queue_head_t wq_buf;
> > > +	bool wq_buf_avail;
> > > +	struct list_head list;
> > > +};
> > > +
> > > +struct virtio_pmem {
> > > +	struct virtio_device *vdev;
> > > +
> > > +	/* Virtio pmem request queue */
> > > +	struct virtqueue *req_vq;
> > > +
> > > +	/* nvdimm bus registers virtio pmem device */
> > > +	struct nvdimm_bus *nvdimm_bus;
> > > +	struct nvdimm_bus_descriptor nd_desc;
> > > +
> > > +	/* List to store deferred work if virtqueue is full */
> > > +	struct list_head req_list;
> > > +
> > > +	/* Synchronize virtqueue data */
> > > +	spinlock_t pmem_lock;
> > > +
> > > +	/* Memory region information */
> > > +	uint64_t start;
> > > +	uint64_t size;
> > > +};
> > > +
> > > +static struct virtio_device_id id_table[] = {
> > > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > +	{ 0 },
> > > +};
> > > +
> > > + /* The interrupt handler */
> > > +static void host_ack(struct virtqueue *vq)
> > > +{
> > > +	unsigned int len;
> > > +	unsigned long flags;
> > > +	struct virtio_pmem_request *req, *req_buf;
> > > +	struct virtio_pmem *vpmem = vq->vdev->priv;
> > > +
> > > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > +	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > +		req->done = true;
> > > +		wake_up(&req->host_acked);
> > > +
> > > +		if (!list_empty(&vpmem->req_list)) {
> > > +			req_buf = list_first_entry(&vpmem->req_list,
> > > +					struct virtio_pmem_request, list);
> > > +			list_del(&vpmem->req_list);
> > > +			req_buf->wq_buf_avail = true;
> > > +			wake_up(&req_buf->wq_buf);
> > > +		}
> > > +	}
> > > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +}
> > > + /* Initialize virt queue */
> > > +static int init_vq(struct virtio_pmem *vpmem)
> > > +{
> > > +	struct virtqueue *vq;
> > > +
> > > +	/* single vq */
> > > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > +				host_ack, "flush_queue");
> > > +	if (IS_ERR(vq))
> > > +		return PTR_ERR(vq);
> > > +
> > > +	spin_lock_init(&vpmem->pmem_lock);
> > > +	INIT_LIST_HEAD(&vpmem->req_list);
> > > +
> > > +	return 0;
> > > +};
> > > +
> > > + /* The request submission function */
> > > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > +	int err;
> > > +	unsigned long flags;
> > > +	struct scatterlist *sgs[2], sg, ret;
> > > +	struct virtio_device *vdev =
> > > +		dev_to_virtio(nd_region->dev.parent->parent);
> > > +	struct virtio_pmem *vpmem = vdev->priv;  
> > 
> > I'm missing a might_sleep() call in this function.  
> 
> I am not sure if we need might_sleep here? 
> We can add it as debugging aid for detecting any problems
> in sleeping from acquired atomic context?

Yes. Since this function sleeps and since some functions that
may run in atomic context call it, it's a good idea to
call might_sleep().

> > > +	struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > +
> > > +	if (!req)
> > > +		return -ENOMEM;
> > > +
> > > +	req->done = req->wq_buf_avail = false;
> > > +	strcpy(req->name, "FLUSH");
> > > +	init_waitqueue_head(&req->host_acked);
> > > +	init_waitqueue_head(&req->wq_buf);
> > > +
> > > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > +	sg_init_one(&sg, req->name, strlen(req->name));
> > > +	sgs[0] = &sg;
> > > +	sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > +	sgs[1] = &ret;  
> > 
> > It seems that sg_init_one() is only setting fields, in this
> > case you can move spin_lock_irqsave() here.  
> 
> yes, will move spin_lock_irqsave here.
> 
> >   
> > > +	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > > +	if (err) {
> > > +		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> > > +
> > > +		list_add_tail(&vpmem->req_list, &req->list);
> > > +		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > +		/* When host has read buffer, this completes via host_ack */
> > > +		wait_event(req->wq_buf, req->wq_buf_avail);
> > > +		spin_lock_irqsave(&vpmem->pmem_lock, flags);  
> > 
> > Is this error handling code assuming that at some point
> > virtqueue_add_sgs() will succeed for a different thread? If yes,
> > what happens if the assumption is false? That is, what happens if
> > virtqueue_add_sgs() never succeeds anymore?  
> 
> virtqueue_add_sgs will not succeed and corresponding thread should wait.
> All subsequent calling threads should also wait. As soon as there is first
> available free entry(from host), first waiting thread is acknowledged.
> 
> In worst case if Qemu is not utilizing any of the used buffer will keep
> multiple threads waiting. 
> 
> > 
> > Why not just return an error?  
> 
> As per suggestion by Stefan in previous discussion: if the virtqueue is full.  
> Printing a message and failing the flush isn't appropriate.  This thread needs to 
> wait until virtqueue space becomes available.

If virtqueue_add_sgs() is guaranteed to succeed at some point then OK.
Otherwise, you'll get threads getting stuck forever.

> > > +	}
> > > +	virtqueue_kick(vpmem->req_vq);
> > > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +
> > > +	/* When host has read buffer, this completes via host_ack */
> > > +	wait_event(req->host_acked, req->done);
> > > +	err = req->ret;  
> > 
> > If I'm understanding the QEMU code correctly, you're returning EIO
> > from QEMU if fsync() fails. I think this is wrong, since we don't know
> > if EIO in QEMU will be the same EIO in the guest. One way to solve this
> > would be to return 0 for success and 1 for failure from QEMU, and let the
> > guest implementation pick its error code (for your implementation it
> > could be EIO).  
> 
> Makes sense, will change this. 
> 
> Thanks,
> Pankaj 
> >   
> > > +	kfree(req);
> > > +
> > > +	return err;
> > > +};
> > > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > > +
> > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > +{
> > > +	int err = 0;
> > > +	struct resource res;
> > > +	struct virtio_pmem *vpmem;
> > > +	struct nvdimm_bus *nvdimm_bus;
> > > +	struct nd_region_desc ndr_desc;
> > > +	int nid = dev_to_node(&vdev->dev);
> > > +	struct nd_region *nd_region;
> > > +
> > > +	if (!vdev->config->get) {
> > > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> > > +			__func__);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > +			GFP_KERNEL);
> > > +	if (!vpmem) {
> > > +		err = -ENOMEM;
> > > +		goto out_err;
> > > +	}
> > > +
> > > +	vpmem->vdev = vdev;
> > > +	err = init_vq(vpmem);
> > > +	if (err)
> > > +		goto out_err;
> > > +
> > > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > +			start, &vpmem->start);
> > > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > +			size, &vpmem->size);
> > > +
> > > +	res.start = vpmem->start;
> > > +	res.end   = vpmem->start + vpmem->size-1;
> > > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > > +	vpmem->nd_desc.module = THIS_MODULE;
> > > +
> > > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > +						&vpmem->nd_desc);
> > > +	if (!nvdimm_bus)
> > > +		goto out_vq;
> > > +
> > > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > +	memset(&ndr_desc, 0, sizeof(ndr_desc));
> > > +
> > > +	ndr_desc.res = &res;
> > > +	ndr_desc.numa_node = nid;
> > > +	ndr_desc.flush = virtio_pmem_flush;
> > > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > +
> > > +	if (!nd_region)
> > > +		goto out_nd;
> > > +
> > > +	//virtio_device_ready(vdev);
> > > +	return 0;
> > > +out_nd:
> > > +	err = -ENXIO;
> > > +	nvdimm_bus_unregister(nvdimm_bus);
> > > +out_vq:
> > > +	vdev->config->del_vqs(vdev);
> > > +out_err:
> > > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > +	return err;
> > > +}
> > > +
> > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > +{
> > > +	struct virtio_pmem *vpmem = vdev->priv;
> > > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > +
> > > +	nvdimm_bus_unregister(nvdimm_bus);
> > > +	vdev->config->del_vqs(vdev);
> > > +	kfree(vpmem);
> > > +}
> > > +
> > > +#ifdef CONFIG_PM_SLEEP
> > > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > > +{
> > > +	/* todo: handle freeze function */
> > > +	return -EPERM;
> > > +}
> > > +
> > > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > > +{
> > > +	/* todo: handle restore function */
> > > +	return -EPERM;
> > > +}
> > > +#endif
> > > +
> > > +
> > > +static struct virtio_driver virtio_pmem_driver = {
> > > +	.driver.name		= KBUILD_MODNAME,
> > > +	.driver.owner		= THIS_MODULE,
> > > +	.id_table		= id_table,
> > > +	.probe			= virtio_pmem_probe,
> > > +	.remove			= virtio_pmem_remove,
> > > +#ifdef CONFIG_PM_SLEEP
> > > +	.freeze                 = virtio_pmem_freeze,
> > > +	.restore                = virtio_pmem_restore,
> > > +#endif
> > > +};
> > > +
> > > +module_virtio_driver(virtio_pmem_driver);
> > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > +MODULE_LICENSE("GPL");
> > > diff --git a/include/uapi/linux/virtio_ids.h
> > > b/include/uapi/linux/virtio_ids.h
> > > index 6d5c3b2..3463895 100644
> > > --- a/include/uapi/linux/virtio_ids.h
> > > +++ b/include/uapi/linux/virtio_ids.h
> > > @@ -43,5 +43,6 @@
> > >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> > >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> > >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> > >  
> > >  #endif /* _LINUX_VIRTIO_IDS_H */
> > > diff --git a/include/uapi/linux/virtio_pmem.h
> > > b/include/uapi/linux/virtio_pmem.h
> > > new file mode 100644
> > > index 0000000..c7c22a5
> > > --- /dev/null
> > > +++ b/include/uapi/linux/virtio_pmem.h
> > > @@ -0,0 +1,40 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > > + * anyone can use the definitions to implement compatible drivers/servers:
> > > + *
> > > + *
> > > + * Redistribution and use in source and binary forms, with or without
> > > + * modification, are permitted provided that the following conditions
> > > + * are met:
> > > + * 1. Redistributions of source code must retain the above copyright
> > > + *    notice, this list of conditions and the following disclaimer.
> > > + * 2. Redistributions in binary form must reproduce the above copyright
> > > + *    notice, this list of conditions and the following disclaimer in the
> > > + *    documentation and/or other materials provided with the distribution.
> > > + * 3. Neither the name of IBM nor the names of its contributors
> > > + *    may be used to endorse or promote products derived from this
> > > software
> > > + *    without specific prior written permission.
> > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > ``AS IS''
> > > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> > > THE
> > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > > PURPOSE
> > > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > > CONSEQUENTIAL
> > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > > STRICT
> > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
> > > WAY
> > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > > + * SUCH DAMAGE.
> > > + *
> > > + * Copyright (C) Red Hat, Inc., 2018-2019
> > > + * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
> > > + */
> > > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > > +
> > > +struct virtio_pmem_config {
> > > +	__le64 start;
> > > +	__le64 size;
> > > +};
> > > +#endif  
> > 
> > 
> >   
>
Pankaj Gupta Sept. 14, 2018, 12:13 p.m. UTC | #7
> 
> > Hi Luiz,
> > 
> > Thanks for the review.
> > 
> > >   
> > > > This patch adds virtio-pmem driver for KVM guest.
> > > > 
> > > > Guest reads the persistent memory range information from
> > > > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > > > creates a nd_region object with the persistent memory
> > > > range information so that existing 'nvdimm/pmem' driver
> > > > can reserve this into system memory map. This way
> > > > 'virtio-pmem' driver uses existing functionality of pmem
> > > > driver to register persistent memory compatible for DAX
> > > > capable filesystems.
> > > > 
> > > > This also provides function to perform guest flush over
> > > > VIRTIO from 'pmem' driver when userspace performs flush
> > > > on DAX memory range.
> > > > 
> > > > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > > > ---
> > > >  drivers/virtio/Kconfig           |   9 ++
> > > >  drivers/virtio/Makefile          |   1 +
> > > >  drivers/virtio/virtio_pmem.c     | 255
> > > >  +++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/virtio_ids.h  |   1 +
> > > >  include/uapi/linux/virtio_pmem.h |  40 ++++++
> > > >  5 files changed, 306 insertions(+)
> > > >  create mode 100644 drivers/virtio/virtio_pmem.c
> > > >  create mode 100644 include/uapi/linux/virtio_pmem.h
> > > > 
> > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > index 3589764..a331e23 100644
> > > > --- a/drivers/virtio/Kconfig
> > > > +++ b/drivers/virtio/Kconfig
> > > > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> > > >  
> > > >  	  If unsure, say Y.
> > > >  
> > > > +config VIRTIO_PMEM
> > > > +	tristate "Support for virtio pmem driver"
> > > > +	depends on VIRTIO
> > > > +	help
> > > > +	This driver provides support for virtio based flushing interface
> > > > +	for persistent memory range.
> > > > +
> > > > +	If unsure, say M.
> > > > +
> > > >  config VIRTIO_BALLOON
> > > >  	tristate "Virtio balloon driver"
> > > >  	depends on VIRTIO
> > > > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > > > index 3a2b5c5..cbe91c6 100644
> > > > --- a/drivers/virtio/Makefile
> > > > +++ b/drivers/virtio/Makefile
> > > > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> > > >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> > > >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> > > >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > > > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > > > diff --git a/drivers/virtio/virtio_pmem.c
> > > > b/drivers/virtio/virtio_pmem.c
> > > > new file mode 100644
> > > > index 0000000..c22cc87
> > > > --- /dev/null
> > > > +++ b/drivers/virtio/virtio_pmem.c
> > > > @@ -0,0 +1,255 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * virtio_pmem.c: Virtio pmem Driver
> > > > + *
> > > > + * Discovers persistent memory range information
> > > > + * from host and provides a virtio based flushing
> > > > + * interface.
> > > > + */
> > > > +#include <linux/virtio.h>
> > > > +#include <linux/module.h>
> > > > +#include <linux/virtio_ids.h>
> > > > +#include <linux/virtio_config.h>
> > > > +#include <uapi/linux/virtio_pmem.h>
> > > > +#include <linux/spinlock.h>
> > > > +#include <linux/libnvdimm.h>
> > > > +#include <linux/nd.h>
> > > > +
> > > > +struct virtio_pmem_request {
> > > > +	/* Host return status corresponding to flush request */
> > > > +	int ret;
> > > > +
> > > > +	/* command name*/
> > > > +	char name[16];
> > > > +
> > > > +	/* Wait queue to process deferred work after ack from host */
> > > > +	wait_queue_head_t host_acked;
> > > > +	bool done;
> > > > +
> > > > +	/* Wait queue to process deferred work after virt queue buffer avail
> > > > */
> > > > +	wait_queue_head_t wq_buf;
> > > > +	bool wq_buf_avail;
> > > > +	struct list_head list;
> > > > +};
> > > > +
> > > > +struct virtio_pmem {
> > > > +	struct virtio_device *vdev;
> > > > +
> > > > +	/* Virtio pmem request queue */
> > > > +	struct virtqueue *req_vq;
> > > > +
> > > > +	/* nvdimm bus registers virtio pmem device */
> > > > +	struct nvdimm_bus *nvdimm_bus;
> > > > +	struct nvdimm_bus_descriptor nd_desc;
> > > > +
> > > > +	/* List to store deferred work if virtqueue is full */
> > > > +	struct list_head req_list;
> > > > +
> > > > +	/* Synchronize virtqueue data */
> > > > +	spinlock_t pmem_lock;
> > > > +
> > > > +	/* Memory region information */
> > > > +	uint64_t start;
> > > > +	uint64_t size;
> > > > +};
> > > > +
> > > > +static struct virtio_device_id id_table[] = {
> > > > +	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > > > +	{ 0 },
> > > > +};
> > > > +
> > > > + /* The interrupt handler */
> > > > +static void host_ack(struct virtqueue *vq)
> > > > +{
> > > > +	unsigned int len;
> > > > +	unsigned long flags;
> > > > +	struct virtio_pmem_request *req, *req_buf;
> > > > +	struct virtio_pmem *vpmem = vq->vdev->priv;
> > > > +
> > > > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > +	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > > > +		req->done = true;
> > > > +		wake_up(&req->host_acked);
> > > > +
> > > > +		if (!list_empty(&vpmem->req_list)) {
> > > > +			req_buf = list_first_entry(&vpmem->req_list,
> > > > +					struct virtio_pmem_request, list);
> > > > +			list_del(&vpmem->req_list);
> > > > +			req_buf->wq_buf_avail = true;
> > > > +			wake_up(&req_buf->wq_buf);
> > > > +		}
> > > > +	}
> > > > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +}
> > > > + /* Initialize virt queue */
> > > > +static int init_vq(struct virtio_pmem *vpmem)
> > > > +{
> > > > +	struct virtqueue *vq;
> > > > +
> > > > +	/* single vq */
> > > > +	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > > > +				host_ack, "flush_queue");
> > > > +	if (IS_ERR(vq))
> > > > +		return PTR_ERR(vq);
> > > > +
> > > > +	spin_lock_init(&vpmem->pmem_lock);
> > > > +	INIT_LIST_HEAD(&vpmem->req_list);
> > > > +
> > > > +	return 0;
> > > > +};
> > > > +
> > > > + /* The request submission function */
> > > > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > > > +{
> > > > +	int err;
> > > > +	unsigned long flags;
> > > > +	struct scatterlist *sgs[2], sg, ret;
> > > > +	struct virtio_device *vdev =
> > > > +		dev_to_virtio(nd_region->dev.parent->parent);
> > > > +	struct virtio_pmem *vpmem = vdev->priv;
> > > 
> > > I'm missing a might_sleep() call in this function.
> > 
> > I am not sure if we need might_sleep here?
> > We can add it as debugging aid for detecting any problems
> > in sleeping from acquired atomic context?
> 
> Yes. Since this function sleeps and since some functions that
> may run in atomic context call it, it's a good idea to
> call might_sleep().

o.k Will add might_sleep.

> 
> > > > +	struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> > > > +
> > > > +	if (!req)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	req->done = req->wq_buf_avail = false;
> > > > +	strcpy(req->name, "FLUSH");
> > > > +	init_waitqueue_head(&req->host_acked);
> > > > +	init_waitqueue_head(&req->wq_buf);
> > > > +
> > > > +	spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > > +	sg_init_one(&sg, req->name, strlen(req->name));
> > > > +	sgs[0] = &sg;
> > > > +	sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > > +	sgs[1] = &ret;
> > > 
> > > It seems that sg_init_one() is only setting fields, in this
> > > case you can move spin_lock_irqsave() here.
> > 
> > yes, will move spin_lock_irqsave here.
> > 
> > >   
> > > > +	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > > > +	if (err) {
> > > > +		dev_err(&vdev->dev, "failed to send command to virtio pmem
> > > > device\n");
> > > > +
> > > > +		list_add_tail(&vpmem->req_list, &req->list);
> > > > +		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +
> > > > +		/* When host has read buffer, this completes via host_ack */
> > > > +		wait_event(req->wq_buf, req->wq_buf_avail);
> > > > +		spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > 
> > > Is this error handling code assuming that at some point
> > > virtqueue_add_sgs() will succeed for a different thread? If yes,
> > > what happens if the assumption is false? That is, what happens if
> > > virtqueue_add_sgs() never succeeds anymore?
> > 
> > virtqueue_add_sgs will not succeed and corresponding thread should wait.
> > All subsequent calling threads should also wait. As soon as there is first
> > available free entry(from host), first waiting thread is acknowledged.
> > 
> > In worst case if Qemu is not utilizing any of the used buffer will keep
> > multiple threads waiting.
> > 
> > > 
> > > Why not just return an error?
> > 
> > As per suggestion by Stefan in previous discussion: if the virtqueue is
> > full.
> > Printing a message and failing the flush isn't appropriate.  This thread
> > needs to
> > wait until virtqueue space becomes available.
> 
> If virtqueue_add_sgs() is guaranteed to succeed at some point then OK.
> Otherwise, you'll get threads getting stuck forever.

We are handling here 'virtqueue_add_sgs' failure when virtqueue is full.

For regular virtqueue full case, guest threads should wait. This scales for
more number of fsync requests than current virtqueue size and avoids returning
failure to userspace.

Even if we return error when qemu threads are stuck, every time we return error
unless threads actually progress and free an entry in virtqueue. 
 
> 
> > > > +	}
> > > > +	virtqueue_kick(vpmem->req_vq);
> > > > +	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > > +
> > > > +	/* When host has read buffer, this completes via host_ack */
> > > > +	wait_event(req->host_acked, req->done);
> > > > +	err = req->ret;
> > > 
> > > If I'm understanding the QEMU code correctly, you're returning EIO
> > > from QEMU if fsync() fails. I think this is wrong, since we don't know
> > > if EIO in QEMU will be the same EIO in the guest. One way to solve this
> > > would be to return 0 for success and 1 for failure from QEMU, and let the
> > > guest implementation pick its error code (for your implementation it
> > > could be EIO).
> > 
> > Makes sense, will change this.
> > 
> > Thanks,
> > Pankaj
> > >   
> > > > +	kfree(req);
> > > > +
> > > > +	return err;
> > > > +};
> > > > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > > > +
> > > > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > +{
> > > > +	int err = 0;
> > > > +	struct resource res;
> > > > +	struct virtio_pmem *vpmem;
> > > > +	struct nvdimm_bus *nvdimm_bus;
> > > > +	struct nd_region_desc ndr_desc;
> > > > +	int nid = dev_to_node(&vdev->dev);
> > > > +	struct nd_region *nd_region;
> > > > +
> > > > +	if (!vdev->config->get) {
> > > > +		dev_err(&vdev->dev, "%s failure: config disabled\n",
> > > > +			__func__);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > > > +			GFP_KERNEL);
> > > > +	if (!vpmem) {
> > > > +		err = -ENOMEM;
> > > > +		goto out_err;
> > > > +	}
> > > > +
> > > > +	vpmem->vdev = vdev;
> > > > +	err = init_vq(vpmem);
> > > > +	if (err)
> > > > +		goto out_err;
> > > > +
> > > > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > > +			start, &vpmem->start);
> > > > +	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > > > +			size, &vpmem->size);
> > > > +
> > > > +	res.start = vpmem->start;
> > > > +	res.end   = vpmem->start + vpmem->size-1;
> > > > +	vpmem->nd_desc.provider_name = "virtio-pmem";
> > > > +	vpmem->nd_desc.module = THIS_MODULE;
> > > > +
> > > > +	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > > > +						&vpmem->nd_desc);
> > > > +	if (!nvdimm_bus)
> > > > +		goto out_vq;
> > > > +
> > > > +	dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > > > +	memset(&ndr_desc, 0, sizeof(ndr_desc));
> > > > +
> > > > +	ndr_desc.res = &res;
> > > > +	ndr_desc.numa_node = nid;
> > > > +	ndr_desc.flush = virtio_pmem_flush;
> > > > +	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > > > +	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > > > +
> > > > +	if (!nd_region)
> > > > +		goto out_nd;
> > > > +
> > > > +	//virtio_device_ready(vdev);
> > > > +	return 0;
> > > > +out_nd:
> > > > +	err = -ENXIO;
> > > > +	nvdimm_bus_unregister(nvdimm_bus);
> > > > +out_vq:
> > > > +	vdev->config->del_vqs(vdev);
> > > > +out_err:
> > > > +	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > > > +	return err;
> > > > +}
> > > > +
> > > > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > > > +{
> > > > +	struct virtio_pmem *vpmem = vdev->priv;
> > > > +	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > > > +
> > > > +	nvdimm_bus_unregister(nvdimm_bus);
> > > > +	vdev->config->del_vqs(vdev);
> > > > +	kfree(vpmem);
> > > > +}
> > > > +
> > > > +#ifdef CONFIG_PM_SLEEP
> > > > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > > > +{
> > > > +	/* todo: handle freeze function */
> > > > +	return -EPERM;
> > > > +}
> > > > +
> > > > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > > > +{
> > > > +	/* todo: handle restore function */
> > > > +	return -EPERM;
> > > > +}
> > > > +#endif
> > > > +
> > > > +
> > > > +static struct virtio_driver virtio_pmem_driver = {
> > > > +	.driver.name		= KBUILD_MODNAME,
> > > > +	.driver.owner		= THIS_MODULE,
> > > > +	.id_table		= id_table,
> > > > +	.probe			= virtio_pmem_probe,
> > > > +	.remove			= virtio_pmem_remove,
> > > > +#ifdef CONFIG_PM_SLEEP
> > > > +	.freeze                 = virtio_pmem_freeze,
> > > > +	.restore                = virtio_pmem_restore,
> > > > +#endif
> > > > +};
> > > > +
> > > > +module_virtio_driver(virtio_pmem_driver);
> > > > +MODULE_DEVICE_TABLE(virtio, id_table);
> > > > +MODULE_DESCRIPTION("Virtio pmem driver");
> > > > +MODULE_LICENSE("GPL");
> > > > diff --git a/include/uapi/linux/virtio_ids.h
> > > > b/include/uapi/linux/virtio_ids.h
> > > > index 6d5c3b2..3463895 100644
> > > > --- a/include/uapi/linux/virtio_ids.h
> > > > +++ b/include/uapi/linux/virtio_ids.h
> > > > @@ -43,5 +43,6 @@
> > > >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> > > >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> > > >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > > > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> > > >  
> > > >  #endif /* _LINUX_VIRTIO_IDS_H */
> > > > diff --git a/include/uapi/linux/virtio_pmem.h
> > > > b/include/uapi/linux/virtio_pmem.h
> > > > new file mode 100644
> > > > index 0000000..c7c22a5
> > > > --- /dev/null
> > > > +++ b/include/uapi/linux/virtio_pmem.h
> > > > @@ -0,0 +1,40 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed
> > > > so
> > > > + * anyone can use the definitions to implement compatible
> > > > drivers/servers:
> > > > + *
> > > > + *
> > > > + * Redistribution and use in source and binary forms, with or without
> > > > + * modification, are permitted provided that the following conditions
> > > > + * are met:
> > > > + * 1. Redistributions of source code must retain the above copyright
> > > > + *    notice, this list of conditions and the following disclaimer.
> > > > + * 2. Redistributions in binary form must reproduce the above
> > > > copyright
> > > > + *    notice, this list of conditions and the following disclaimer in
> > > > the
> > > > + *    documentation and/or other materials provided with the
> > > > distribution.
> > > > + * 3. Neither the name of IBM nor the names of its contributors
> > > > + *    may be used to endorse or promote products derived from this
> > > > software
> > > > + *    without specific prior written permission.
> > > > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > > > ``AS IS''
> > > > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> > > > TO,
> > > > THE
> > > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > > > PURPOSE
> > > > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > > > CONSEQUENTIAL
> > > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
> > > > GOODS
> > > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> > > > INTERRUPTION)
> > > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > > > STRICT
> > > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> > > > ANY
> > > > WAY
> > > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
> > > > OF
> > > > + * SUCH DAMAGE.
> > > > + *
> > > > + * Copyright (C) Red Hat, Inc., 2018-2019
> > > > + * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
> > > > + */
> > > > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > > > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > > > +
> > > > +struct virtio_pmem_config {
> > > > +	__le64 start;
> > > > +	__le64 size;
> > > > +};
> > > > +#endif
> > > 
> > > 
> > >   
> > 
> 
>
Dan Williams Sept. 22, 2018, 1:08 a.m. UTC | #8
On Fri, Aug 31, 2018 at 6:32 AM Pankaj Gupta <pagupta@redhat.com> wrote:
>
> This patch adds virtio-pmem driver for KVM guest.
>
> Guest reads the persistent memory range information from
> Qemu over VIRTIO and registers it on nvdimm_bus. It also
> creates a nd_region object with the persistent memory
> range information so that existing 'nvdimm/pmem' driver
> can reserve this into system memory map. This way
> 'virtio-pmem' driver uses existing functionality of pmem
> driver to register persistent memory compatible for DAX
> capable filesystems.
>
> This also provides function to perform guest flush over
> VIRTIO from 'pmem' driver when userspace performs flush
> on DAX memory range.
>
> Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> ---
>  drivers/virtio/Kconfig           |   9 ++
>  drivers/virtio/Makefile          |   1 +
>  drivers/virtio/virtio_pmem.c     | 255 +++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/virtio_ids.h  |   1 +
>  include/uapi/linux/virtio_pmem.h |  40 ++++++
>  5 files changed, 306 insertions(+)
>  create mode 100644 drivers/virtio/virtio_pmem.c
>  create mode 100644 include/uapi/linux/virtio_pmem.h
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 3589764..a331e23 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
>
>           If unsure, say Y.
>
> +config VIRTIO_PMEM
> +       tristate "Support for virtio pmem driver"
> +       depends on VIRTIO
> +       help
> +       This driver provides support for virtio based flushing interface
> +       for persistent memory range.
> +
> +       If unsure, say M.
> +
>  config VIRTIO_BALLOON
>         tristate "Virtio balloon driver"
>         depends on VIRTIO
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 3a2b5c5..cbe91c6 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
>  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
>  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> new file mode 100644
> index 0000000..c22cc87
> --- /dev/null
> +++ b/drivers/virtio/virtio_pmem.c
> @@ -0,0 +1,255 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * virtio_pmem.c: Virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + */
> +#include <linux/virtio.h>
> +#include <linux/module.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/virtio_config.h>
> +#include <uapi/linux/virtio_pmem.h>
> +#include <linux/spinlock.h>
> +#include <linux/libnvdimm.h>
> +#include <linux/nd.h>

I think we need to split this driver into 2 files,
drivers/virtio/pmem.c would discover and register the virtual pmem
device with the libnvdimm core, and drivers/nvdimm/virtio.c would
house virtio_pmem_flush().

> +
> +struct virtio_pmem_request {
> +       /* Host return status corresponding to flush request */
> +       int ret;
> +
> +       /* command name*/
> +       char name[16];
> +
> +       /* Wait queue to process deferred work after ack from host */
> +       wait_queue_head_t host_acked;
> +       bool done;
> +
> +       /* Wait queue to process deferred work after virt queue buffer avail */
> +       wait_queue_head_t wq_buf;

Why does this need wait_queue's per request? shouldn't this be per-device?

> +       bool wq_buf_avail;
> +       struct list_head list;
> +};
> +
> +struct virtio_pmem {
> +       struct virtio_device *vdev;
> +
> +       /* Virtio pmem request queue */
> +       struct virtqueue *req_vq;
> +
> +       /* nvdimm bus registers virtio pmem device */
> +       struct nvdimm_bus *nvdimm_bus;
> +       struct nvdimm_bus_descriptor nd_desc;
> +
> +       /* List to store deferred work if virtqueue is full */
> +       struct list_head req_list;
> +
> +       /* Synchronize virtqueue data */
> +       spinlock_t pmem_lock;
> +
> +       /* Memory region information */
> +       uint64_t start;
> +       uint64_t size;
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +       { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> +       { 0 },
> +};
> +
> + /* The interrupt handler */
> +static void host_ack(struct virtqueue *vq)
> +{
> +       unsigned int len;
> +       unsigned long flags;
> +       struct virtio_pmem_request *req, *req_buf;
> +       struct virtio_pmem *vpmem = vq->vdev->priv;
> +
> +       spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +       while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> +               req->done = true;
> +               wake_up(&req->host_acked);
> +
> +               if (!list_empty(&vpmem->req_list)) {
> +                       req_buf = list_first_entry(&vpmem->req_list,
> +                                       struct virtio_pmem_request, list);
> +                       list_del(&vpmem->req_list);
> +                       req_buf->wq_buf_avail = true;
> +                       wake_up(&req_buf->wq_buf);
> +               }
> +       }
> +       spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +}
> + /* Initialize virt queue */
> +static int init_vq(struct virtio_pmem *vpmem)
> +{
> +       struct virtqueue *vq;
> +
> +       /* single vq */
> +       vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> +                               host_ack, "flush_queue");
> +       if (IS_ERR(vq))
> +               return PTR_ERR(vq);
> +
> +       spin_lock_init(&vpmem->pmem_lock);
> +       INIT_LIST_HEAD(&vpmem->req_list);
> +
> +       return 0;
> +};
> +
> + /* The request submission function */
> +static int virtio_pmem_flush(struct nd_region *nd_region)
> +{
> +       int err;
> +       unsigned long flags;
> +       struct scatterlist *sgs[2], sg, ret;
> +       struct virtio_device *vdev =
> +               dev_to_virtio(nd_region->dev.parent->parent);

That's a long de-ref chain I would just stash the vdev in
nd_region->provider_data.

> +       struct virtio_pmem *vpmem = vdev->priv;
> +       struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
> +
> +       if (!req)
> +               return -ENOMEM;
> +
> +       req->done = req->wq_buf_avail = false;
> +       strcpy(req->name, "FLUSH");
> +       init_waitqueue_head(&req->host_acked);
> +       init_waitqueue_head(&req->wq_buf);
> +
> +       spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +       sg_init_one(&sg, req->name, strlen(req->name));
> +       sgs[0] = &sg;
> +       sg_init_one(&ret, &req->ret, sizeof(req->ret));
> +       sgs[1] = &ret;
> +       err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> +       if (err) {
> +               dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
> +
> +               list_add_tail(&vpmem->req_list, &req->list);
> +               spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +               /* When host has read buffer, this completes via host_ack */
> +               wait_event(req->wq_buf, req->wq_buf_avail);
> +               spin_lock_irqsave(&vpmem->pmem_lock, flags);
> +       }
> +       virtqueue_kick(vpmem->req_vq);
> +       spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> +
> +       /* When host has read buffer, this completes via host_ack */
> +       wait_event(req->host_acked, req->done);

Hmm, this seems awkward if this is called from pmem_make_request. If
we need to wait for completion that should be managed by the guest
block layer. I.e. make_request should just queue request and then
trigger bio_endio() when the response comes back.

However this does mean that nvdimm_flush() becomes asynchronous. So
maybe we need to pass in a 'sync' flag or the bio directly to indicate
this is an asynchronous flush request from pmem_make_request() vs a
synchronous one from nsio_rw_bytes().

> +       err = req->ret;
> +       kfree(req);
> +
> +       return err;
> +};
> +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> +
> +static int virtio_pmem_probe(struct virtio_device *vdev)
> +{
> +       int err = 0;
> +       struct resource res;
> +       struct virtio_pmem *vpmem;
> +       struct nvdimm_bus *nvdimm_bus;
> +       struct nd_region_desc ndr_desc;
> +       int nid = dev_to_node(&vdev->dev);
> +       struct nd_region *nd_region;
> +
> +       if (!vdev->config->get) {
> +               dev_err(&vdev->dev, "%s failure: config disabled\n",
> +                       __func__);
> +               return -EINVAL;
> +       }
> +
> +       vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> +                       GFP_KERNEL);
> +       if (!vpmem) {
> +               err = -ENOMEM;
> +               goto out_err;
> +       }
> +
> +       vpmem->vdev = vdev;
> +       err = init_vq(vpmem);
> +       if (err)
> +               goto out_err;
> +
> +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +                       start, &vpmem->start);
> +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> +                       size, &vpmem->size);
> +
> +       res.start = vpmem->start;
> +       res.end   = vpmem->start + vpmem->size-1;
> +       vpmem->nd_desc.provider_name = "virtio-pmem";
> +       vpmem->nd_desc.module = THIS_MODULE;
> +
> +       vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> +                                               &vpmem->nd_desc);
> +       if (!nvdimm_bus)
> +               goto out_vq;
> +
> +       dev_set_drvdata(&vdev->dev, nvdimm_bus);
> +       memset(&ndr_desc, 0, sizeof(ndr_desc));
> +
> +       ndr_desc.res = &res;
> +       ndr_desc.numa_node = nid;
> +       ndr_desc.flush = virtio_pmem_flush;
> +       set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> +       nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> +
> +       if (!nd_region)
> +               goto out_nd;
> +
> +       //virtio_device_ready(vdev);
> +       return 0;
> +out_nd:
> +       err = -ENXIO;
> +       nvdimm_bus_unregister(nvdimm_bus);
> +out_vq:
> +       vdev->config->del_vqs(vdev);
> +out_err:
> +       dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> +       return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> +       struct virtio_pmem *vpmem = vdev->priv;
> +       struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> +
> +       nvdimm_bus_unregister(nvdimm_bus);
> +       vdev->config->del_vqs(vdev);
> +       kfree(vpmem);
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static int virtio_pmem_freeze(struct virtio_device *vdev)
> +{
> +       /* todo: handle freeze function */
> +       return -EPERM;
> +}
> +
> +static int virtio_pmem_restore(struct virtio_device *vdev)
> +{
> +       /* todo: handle restore function */
> +       return -EPERM;
> +}
> +#endif

As far as I can see there's nothing to do on a power transition, I
would just omit this completely.

> +
> +
> +static struct virtio_driver virtio_pmem_driver = {
> +       .driver.name            = KBUILD_MODNAME,
> +       .driver.owner           = THIS_MODULE,
> +       .id_table               = id_table,
> +       .probe                  = virtio_pmem_probe,
> +       .remove                 = virtio_pmem_remove,
> +#ifdef CONFIG_PM_SLEEP
> +       .freeze                 = virtio_pmem_freeze,
> +       .restore                = virtio_pmem_restore,
> +#endif
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2..3463895 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
>  #define VIRTIO_ID_INPUT        18 /* virtio input */
>  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
>  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
>
>  #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
> new file mode 100644
> index 0000000..c7c22a5
> --- /dev/null
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> + * anyone can use the definitions to implement compatible drivers/servers:

The SPDX identifier does not match this BSD license, and the whole
point of the SPDX identifier is to get out of the need to have these
large text blobs of license goop.

> + *
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
> + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + *
> + * Copyright (C) Red Hat, Inc., 2018-2019
> + * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
> + */
> +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> +#define _UAPI_LINUX_VIRTIO_PMEM_H
> +
> +struct virtio_pmem_config {
> +       __le64 start;
> +       __le64 size;
> +};
> +#endif

Why does this need to be in the uapi?
Pankaj Gupta Sept. 24, 2018, 9:41 a.m. UTC | #9
Hi Dan,

Thanks for the review. Please find my reply inline.

> > This patch adds virtio-pmem driver for KVM guest.
> >
> > Guest reads the persistent memory range information from
> > Qemu over VIRTIO and registers it on nvdimm_bus. It also
> > creates a nd_region object with the persistent memory
> > range information so that existing 'nvdimm/pmem' driver
> > can reserve this into system memory map. This way
> > 'virtio-pmem' driver uses existing functionality of pmem
> > driver to register persistent memory compatible for DAX
> > capable filesystems.
> >
> > This also provides function to perform guest flush over
> > VIRTIO from 'pmem' driver when userspace performs flush
> > on DAX memory range.
> >
> > Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
> > ---
> >  drivers/virtio/Kconfig           |   9 ++
> >  drivers/virtio/Makefile          |   1 +
> >  drivers/virtio/virtio_pmem.c     | 255
> >  +++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/virtio_ids.h  |   1 +
> >  include/uapi/linux/virtio_pmem.h |  40 ++++++
> >  5 files changed, 306 insertions(+)
> >  create mode 100644 drivers/virtio/virtio_pmem.c
> >  create mode 100644 include/uapi/linux/virtio_pmem.h
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 3589764..a331e23 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -42,6 +42,15 @@ config VIRTIO_PCI_LEGACY
> >
> >           If unsure, say Y.
> >
> > +config VIRTIO_PMEM
> > +       tristate "Support for virtio pmem driver"
> > +       depends on VIRTIO
> > +       help
> > +       This driver provides support for virtio based flushing interface
> > +       for persistent memory range.
> > +
> > +       If unsure, say M.
> > +
> >  config VIRTIO_BALLOON
> >         tristate "Virtio balloon driver"
> >         depends on VIRTIO
> > diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> > index 3a2b5c5..cbe91c6 100644
> > --- a/drivers/virtio/Makefile
> > +++ b/drivers/virtio/Makefile
> > @@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
> >  virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
> >  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> >  obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
> > +obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
> > diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
> > new file mode 100644
> > index 0000000..c22cc87
> > --- /dev/null
> > +++ b/drivers/virtio/virtio_pmem.c
> > @@ -0,0 +1,255 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * virtio_pmem.c: Virtio pmem Driver
> > + *
> > + * Discovers persistent memory range information
> > + * from host and provides a virtio based flushing
> > + * interface.
> > + */
> > +#include <linux/virtio.h>
> > +#include <linux/module.h>
> > +#include <linux/virtio_ids.h>
> > +#include <linux/virtio_config.h>
> > +#include <uapi/linux/virtio_pmem.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/libnvdimm.h>
> > +#include <linux/nd.h>
> 
> I think we need to split this driver into 2 files,
> drivers/virtio/pmem.c would discover and register the virtual pmem
> device with the libnvdimm core, and drivers/nvdimm/virtio.c would
> house virtio_pmem_flush().

o.k. Will split the driver into two files as suggested.

> 
> > +
> > +struct virtio_pmem_request {
> > +       /* Host return status corresponding to flush request */
> > +       int ret;
> > +
> > +       /* command name*/
> > +       char name[16];
> > +
> > +       /* Wait queue to process deferred work after ack from host */
> > +       wait_queue_head_t host_acked;
> > +       bool done;
> > +
> > +       /* Wait queue to process deferred work after virt queue buffer
> > avail */
> > +       wait_queue_head_t wq_buf;
> 
> Why does this need wait_queue's per request? shouldn't this be per-device?

This is used to wait flush calling threads when virtio queue is full. 
wait_queue in request struct binds waitqueue and request. When host acknowledges
guest, first waiting request is selected and corresponding thread is woken-up.

Alternatively, we can use "add_wait_queue_exclusive" with device wait_queue.
This will wake up only one exclusive process waiting. This will avoid using 
additional list for tracking.

> 
> > +       bool wq_buf_avail;
> > +       struct list_head list;
> > +};
> > +
> > +struct virtio_pmem {
> > +       struct virtio_device *vdev;
> > +
> > +       /* Virtio pmem request queue */
> > +       struct virtqueue *req_vq;
> > +
> > +       /* nvdimm bus registers virtio pmem device */
> > +       struct nvdimm_bus *nvdimm_bus;
> > +       struct nvdimm_bus_descriptor nd_desc;
> > +
> > +       /* List to store deferred work if virtqueue is full */
> > +       struct list_head req_list;
> > +
> > +       /* Synchronize virtqueue data */
> > +       spinlock_t pmem_lock;
> > +
> > +       /* Memory region information */
> > +       uint64_t start;
> > +       uint64_t size;
> > +};
> > +
> > +static struct virtio_device_id id_table[] = {
> > +       { VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
> > +       { 0 },
> > +};
> > +
> > + /* The interrupt handler */
> > +static void host_ack(struct virtqueue *vq)
> > +{
> > +       unsigned int len;
> > +       unsigned long flags;
> > +       struct virtio_pmem_request *req, *req_buf;
> > +       struct virtio_pmem *vpmem = vq->vdev->priv;
> > +
> > +       spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +       while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
> > +               req->done = true;
> > +               wake_up(&req->host_acked);
> > +
> > +               if (!list_empty(&vpmem->req_list)) {
> > +                       req_buf = list_first_entry(&vpmem->req_list,
> > +                                       struct virtio_pmem_request, list);
> > +                       list_del(&vpmem->req_list);
> > +                       req_buf->wq_buf_avail = true;
> > +                       wake_up(&req_buf->wq_buf);
> > +               }
> > +       }
> > +       spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +}
> > + /* Initialize virt queue */
> > +static int init_vq(struct virtio_pmem *vpmem)
> > +{
> > +       struct virtqueue *vq;
> > +
> > +       /* single vq */
> > +       vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
> > +                               host_ack, "flush_queue");
> > +       if (IS_ERR(vq))
> > +               return PTR_ERR(vq);
> > +
> > +       spin_lock_init(&vpmem->pmem_lock);
> > +       INIT_LIST_HEAD(&vpmem->req_list);
> > +
> > +       return 0;
> > +};
> > +
> > + /* The request submission function */
> > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > +{
> > +       int err;
> > +       unsigned long flags;
> > +       struct scatterlist *sgs[2], sg, ret;
> > +       struct virtio_device *vdev =
> > +               dev_to_virtio(nd_region->dev.parent->parent);
> 
> That's a long de-ref chain I would just stash the vdev in
> nd_region->provider_data.

Sure. Will use 'nd_region->provider_data' for vdev.

> 
> > +       struct virtio_pmem *vpmem = vdev->priv;
> > +       struct virtio_pmem_request *req = kmalloc(sizeof(*req),
> > GFP_KERNEL);
> > +
> > +       if (!req)
> > +               return -ENOMEM;
> > +
> > +       req->done = req->wq_buf_avail = false;
> > +       strcpy(req->name, "FLUSH");
> > +       init_waitqueue_head(&req->host_acked);
> > +       init_waitqueue_head(&req->wq_buf);
> > +
> > +       spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +       sg_init_one(&sg, req->name, strlen(req->name));
> > +       sgs[0] = &sg;
> > +       sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > +       sgs[1] = &ret;
> > +       err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
> > +       if (err) {
> > +               dev_err(&vdev->dev, "failed to send command to virtio pmem
> > device\n");
> > +
> > +               list_add_tail(&vpmem->req_list, &req->list);
> > +               spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > +               /* When host has read buffer, this completes via host_ack
> > */
> > +               wait_event(req->wq_buf, req->wq_buf_avail);
> > +               spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > +       }
> > +       virtqueue_kick(vpmem->req_vq);
> > +       spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > +
> > +       /* When host has read buffer, this completes via host_ack */
> > +       wait_event(req->host_acked, req->done);
> 
> Hmm, this seems awkward if this is called from pmem_make_request. If
> we need to wait for completion that should be managed by the guest
> block layer. I.e. make_request should just queue request and then
> trigger bio_endio() when the response comes back.

We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem 
driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops" 
callbacks for corresponding  VIRTIO vqs. AFAICU there is no existing multiqueue 
code merged for pmem driver yet, though i could see patches by Dave upstream.

Anything I am missing here?   

> 
> However this does mean that nvdimm_flush() becomes asynchronous. So
> maybe we need to pass in a 'sync' flag or the bio directly to indicate
> this is an asynchronous flush request from pmem_make_request() vs a
> synchronous one from nsio_rw_bytes().

Sure.

> 
> > +       err = req->ret;
> > +       kfree(req);
> > +
> > +       return err;
> > +};
> > +EXPORT_SYMBOL_GPL(virtio_pmem_flush);
> > +
> > +static int virtio_pmem_probe(struct virtio_device *vdev)
> > +{
> > +       int err = 0;
> > +       struct resource res;
> > +       struct virtio_pmem *vpmem;
> > +       struct nvdimm_bus *nvdimm_bus;
> > +       struct nd_region_desc ndr_desc;
> > +       int nid = dev_to_node(&vdev->dev);
> > +       struct nd_region *nd_region;
> > +
> > +       if (!vdev->config->get) {
> > +               dev_err(&vdev->dev, "%s failure: config disabled\n",
> > +                       __func__);
> > +               return -EINVAL;
> > +       }
> > +
> > +       vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
> > +                       GFP_KERNEL);
> > +       if (!vpmem) {
> > +               err = -ENOMEM;
> > +               goto out_err;
> > +       }
> > +
> > +       vpmem->vdev = vdev;
> > +       err = init_vq(vpmem);
> > +       if (err)
> > +               goto out_err;
> > +
> > +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +                       start, &vpmem->start);
> > +       virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> > +                       size, &vpmem->size);
> > +
> > +       res.start = vpmem->start;
> > +       res.end   = vpmem->start + vpmem->size-1;
> > +       vpmem->nd_desc.provider_name = "virtio-pmem";
> > +       vpmem->nd_desc.module = THIS_MODULE;
> > +
> > +       vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
> > +                                               &vpmem->nd_desc);
> > +       if (!nvdimm_bus)
> > +               goto out_vq;
> > +
> > +       dev_set_drvdata(&vdev->dev, nvdimm_bus);
> > +       memset(&ndr_desc, 0, sizeof(ndr_desc));
> > +
> > +       ndr_desc.res = &res;
> > +       ndr_desc.numa_node = nid;
> > +       ndr_desc.flush = virtio_pmem_flush;
> > +       set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
> > +       nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
> > +
> > +       if (!nd_region)
> > +               goto out_nd;
> > +
> > +       //virtio_device_ready(vdev);
> > +       return 0;
> > +out_nd:
> > +       err = -ENXIO;
> > +       nvdimm_bus_unregister(nvdimm_bus);
> > +out_vq:
> > +       vdev->config->del_vqs(vdev);
> > +out_err:
> > +       dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
> > +       return err;
> > +}
> > +
> > +static void virtio_pmem_remove(struct virtio_device *vdev)
> > +{
> > +       struct virtio_pmem *vpmem = vdev->priv;
> > +       struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
> > +
> > +       nvdimm_bus_unregister(nvdimm_bus);
> > +       vdev->config->del_vqs(vdev);
> > +       kfree(vpmem);
> > +}
> > +
> > +#ifdef CONFIG_PM_SLEEP
> > +static int virtio_pmem_freeze(struct virtio_device *vdev)
> > +{
> > +       /* todo: handle freeze function */
> > +       return -EPERM;
> > +}
> > +
> > +static int virtio_pmem_restore(struct virtio_device *vdev)
> > +{
> > +       /* todo: handle restore function */
> > +       return -EPERM;
> > +}
> > +#endif
> 
> As far as I can see there's nothing to do on a power transition, I
> would just omit this completely.

o.k Will remove these handlers.

> 
> > +
> > +
> > +static struct virtio_driver virtio_pmem_driver = {
> > +       .driver.name            = KBUILD_MODNAME,
> > +       .driver.owner           = THIS_MODULE,
> > +       .id_table               = id_table,
> > +       .probe                  = virtio_pmem_probe,
> > +       .remove                 = virtio_pmem_remove,
> > +#ifdef CONFIG_PM_SLEEP
> > +       .freeze                 = virtio_pmem_freeze,
> > +       .restore                = virtio_pmem_restore,
> > +#endif
> > +};
> > +
> > +module_virtio_driver(virtio_pmem_driver);
> > +MODULE_DEVICE_TABLE(virtio, id_table);
> > +MODULE_DESCRIPTION("Virtio pmem driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/uapi/linux/virtio_ids.h
> > b/include/uapi/linux/virtio_ids.h
> > index 6d5c3b2..3463895 100644
> > --- a/include/uapi/linux/virtio_ids.h
> > +++ b/include/uapi/linux/virtio_ids.h
> > @@ -43,5 +43,6 @@
> >  #define VIRTIO_ID_INPUT        18 /* virtio input */
> >  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
> >  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> > +#define VIRTIO_ID_PMEM         25 /* virtio pmem */
> >
> >  #endif /* _LINUX_VIRTIO_IDS_H */
> > diff --git a/include/uapi/linux/virtio_pmem.h
> > b/include/uapi/linux/virtio_pmem.h
> > new file mode 100644
> > index 0000000..c7c22a5
> > --- /dev/null
> > +++ b/include/uapi/linux/virtio_pmem.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
> > + * anyone can use the definitions to implement compatible drivers/servers:
> 
> The SPDX identifier does not match this BSD license, and the whole
> point of the SPDX identifier is to get out of the need to have these
> large text blobs of license goop.

Right, just copied this. Will remove these BSD lines.

> 
> > + *
> > + *
> > + * Redistribution and use in source and binary forms, with or without
> > + * modification, are permitted provided that the following conditions
> > + * are met:
> > + * 1. Redistributions of source code must retain the above copyright
> > + *    notice, this list of conditions and the following disclaimer.
> > + * 2. Redistributions in binary form must reproduce the above copyright
> > + *    notice, this list of conditions and the following disclaimer in the
> > + *    documentation and/or other materials provided with the distribution.
> > + * 3. Neither the name of IBM nor the names of its contributors
> > + *    may be used to endorse or promote products derived from this
> > software
> > + *    without specific prior written permission.
> > + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > ``AS IS''
> > + * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> > THE
> > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> > PURPOSE
> > + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > CONSEQUENTIAL
> > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> > STRICT
> > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY
> > WAY
> > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> > + * SUCH DAMAGE.
> > + *
> > + * Copyright (C) Red Hat, Inc., 2018-2019
> > + * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
> > + */
> > +#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
> > +#define _UAPI_LINUX_VIRTIO_PMEM_H
> > +
> > +struct virtio_pmem_config {
> > +       __le64 start;
> > +       __le64 size;
> > +};
> > +#endif
> 
> Why does this need to be in the uapi?

This struct is defined by userspace(Qemu) and used by kernel to fetch
values passed by qemu device.

Thanks,
Pankaj
Pankaj Gupta Sept. 27, 2018, 1:06 p.m. UTC | #10
Hello Dan,

> > > + /* The request submission function */
> > > +static int virtio_pmem_flush(struct nd_region *nd_region)
> > > +{
> > > +       int err;

[...]

> > > +       init_waitqueue_head(&req->host_acked);
> > > +       init_waitqueue_head(&req->wq_buf);
> > > +
> > > +       spin_lock_irqsave(&vpmem->pmem_lock, flags);
> > > +       sg_init_one(&sg, req->name, strlen(req->name));
> > > +       sgs[0] = &sg;
> > > +       sg_init_one(&ret, &req->ret, sizeof(req->ret));
> > > +       sgs[1] = &ret;

[...]
> > > +       spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
> > > +       /* When host has read buffer, this completes via host_ack */
> > > +       wait_event(req->host_acked, req->done);
> > 
> > Hmm, this seems awkward if this is called from pmem_make_request. If
> > we need to wait for completion that should be managed by the guest
> > block layer. I.e. make_request should just queue request and then
> > trigger bio_endio() when the response comes back.
> 
> We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem
> driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops"
> callbacks for corresponding  VIRTIO vqs. AFAICU there is no existing
> multiqueue
> code merged for pmem driver yet, though i could see patches by Dave upstream.
> 

I thought about this and with current infrastructure "make_request" releases spinlock 
and makes current thread/task. All Other threads are free to call 'make_request'/flush 
and similarly wait by releasing the lock. This actually works like a queue of threads 
waiting for notifications from host.

Current pmem code do not have multiqueue support and I am not sure if core pmem code 
needs it. Adding multiqueue support just for virtio-pmem and not for pmem in same driver
will be confusing or require alot of tweaking.

Could you please give your suggestions on this. 
 
Thanks,
Pankaj
Dan Williams Sept. 27, 2018, 3:55 p.m. UTC | #11
On Thu, Sep 27, 2018 at 6:07 AM Pankaj Gupta <pagupta@redhat.com> wrote:
[..]
> > We are plugging VIRTIO based flush callback for virtio_pmem driver. If pmem
> > driver (pmem_make_request) has to queue request we have to plug "blk_mq_ops"
> > callbacks for corresponding  VIRTIO vqs. AFAICU there is no existing
> > multiqueue
> > code merged for pmem driver yet, though i could see patches by Dave upstream.
> >
>
> I thought about this and with current infrastructure "make_request" releases spinlock
> and makes current thread/task. All Other threads are free to call 'make_request'/flush
> and similarly wait by releasing the lock.

Which lock are you referring?

> This actually works like a queue of threads
> waiting for notifications from host.
>
> Current pmem code do not have multiqueue support and I am not sure if core pmem code
> needs it. Adding multiqueue support just for virtio-pmem and not for pmem in same driver
> will be confusing or require alot of tweaking.

Why does the pmem driver need to be converted to multiqueue support?

> Could you please give your suggestions on this.

I was expecting that flush requests that cannot be completed
synchronously be placed on a queue and have bio_endio() called at a
future time. I.e. use bio_chain() to manage the async portion of the
flush request. This causes the guest block layer to just assume the
bio was queued and will be completed at some point in the future.
diff mbox series

Patch

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 3589764..a331e23 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -42,6 +42,15 @@  config VIRTIO_PCI_LEGACY
 
 	  If unsure, say Y.
 
+config VIRTIO_PMEM
+	tristate "Support for virtio pmem driver"
+	depends on VIRTIO
+	help
+	This driver provides support for virtio based flushing interface
+	for persistent memory range.
+
+	If unsure, say M.
+
 config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5..cbe91c6 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@  virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
 virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
 obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_PMEM) += virtio_pmem.o
diff --git a/drivers/virtio/virtio_pmem.c b/drivers/virtio/virtio_pmem.c
new file mode 100644
index 0000000..c22cc87
--- /dev/null
+++ b/drivers/virtio/virtio_pmem.c
@@ -0,0 +1,255 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * virtio_pmem.c: Virtio pmem Driver
+ *
+ * Discovers persistent memory range information
+ * from host and provides a virtio based flushing
+ * interface.
+ */
+#include <linux/virtio.h>
+#include <linux/module.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_config.h>
+#include <uapi/linux/virtio_pmem.h>
+#include <linux/spinlock.h>
+#include <linux/libnvdimm.h>
+#include <linux/nd.h>
+
+struct virtio_pmem_request {
+	/* Host return status corresponding to flush request */
+	int ret;
+
+	/* command name*/
+	char name[16];
+
+	/* Wait queue to process deferred work after ack from host */
+	wait_queue_head_t host_acked;
+	bool done;
+
+	/* Wait queue to process deferred work after virt queue buffer avail */
+	wait_queue_head_t wq_buf;
+	bool wq_buf_avail;
+	struct list_head list;
+};
+
+struct virtio_pmem {
+	struct virtio_device *vdev;
+
+	/* Virtio pmem request queue */
+	struct virtqueue *req_vq;
+
+	/* nvdimm bus registers virtio pmem device */
+	struct nvdimm_bus *nvdimm_bus;
+	struct nvdimm_bus_descriptor nd_desc;
+
+	/* List to store deferred work if virtqueue is full */
+	struct list_head req_list;
+
+	/* Synchronize virtqueue data */
+	spinlock_t pmem_lock;
+
+	/* Memory region information */
+	uint64_t start;
+	uint64_t size;
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_PMEM, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+ /* The interrupt handler */
+static void host_ack(struct virtqueue *vq)
+{
+	unsigned int len;
+	unsigned long flags;
+	struct virtio_pmem_request *req, *req_buf;
+	struct virtio_pmem *vpmem = vq->vdev->priv;
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	while ((req = virtqueue_get_buf(vq, &len)) != NULL) {
+		req->done = true;
+		wake_up(&req->host_acked);
+
+		if (!list_empty(&vpmem->req_list)) {
+			req_buf = list_first_entry(&vpmem->req_list,
+					struct virtio_pmem_request, list);
+			list_del(&vpmem->req_list);
+			req_buf->wq_buf_avail = true;
+			wake_up(&req_buf->wq_buf);
+		}
+	}
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+}
+ /* Initialize virt queue */
+static int init_vq(struct virtio_pmem *vpmem)
+{
+	struct virtqueue *vq;
+
+	/* single vq */
+	vpmem->req_vq = vq = virtio_find_single_vq(vpmem->vdev,
+				host_ack, "flush_queue");
+	if (IS_ERR(vq))
+		return PTR_ERR(vq);
+
+	spin_lock_init(&vpmem->pmem_lock);
+	INIT_LIST_HEAD(&vpmem->req_list);
+
+	return 0;
+};
+
+ /* The request submission function */
+static int virtio_pmem_flush(struct nd_region *nd_region)
+{
+	int err;
+	unsigned long flags;
+	struct scatterlist *sgs[2], sg, ret;
+	struct virtio_device *vdev =
+		dev_to_virtio(nd_region->dev.parent->parent);
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct virtio_pmem_request *req = kmalloc(sizeof(*req), GFP_KERNEL);
+
+	if (!req)
+		return -ENOMEM;
+
+	req->done = req->wq_buf_avail = false;
+	strcpy(req->name, "FLUSH");
+	init_waitqueue_head(&req->host_acked);
+	init_waitqueue_head(&req->wq_buf);
+
+	spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	sg_init_one(&sg, req->name, strlen(req->name));
+	sgs[0] = &sg;
+	sg_init_one(&ret, &req->ret, sizeof(req->ret));
+	sgs[1] = &ret;
+	err = virtqueue_add_sgs(vpmem->req_vq, sgs, 1, 1, req, GFP_ATOMIC);
+	if (err) {
+		dev_err(&vdev->dev, "failed to send command to virtio pmem device\n");
+
+		list_add_tail(&vpmem->req_list, &req->list);
+		spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+		/* When host has read buffer, this completes via host_ack */
+		wait_event(req->wq_buf, req->wq_buf_avail);
+		spin_lock_irqsave(&vpmem->pmem_lock, flags);
+	}
+	virtqueue_kick(vpmem->req_vq);
+	spin_unlock_irqrestore(&vpmem->pmem_lock, flags);
+
+	/* When host has read buffer, this completes via host_ack */
+	wait_event(req->host_acked, req->done);
+	err = req->ret;
+	kfree(req);
+
+	return err;
+};
+EXPORT_SYMBOL_GPL(virtio_pmem_flush);
+
+static int virtio_pmem_probe(struct virtio_device *vdev)
+{
+	int err = 0;
+	struct resource res;
+	struct virtio_pmem *vpmem;
+	struct nvdimm_bus *nvdimm_bus;
+	struct nd_region_desc ndr_desc;
+	int nid = dev_to_node(&vdev->dev);
+	struct nd_region *nd_region;
+
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	vdev->priv = vpmem = devm_kzalloc(&vdev->dev, sizeof(*vpmem),
+			GFP_KERNEL);
+	if (!vpmem) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	vpmem->vdev = vdev;
+	err = init_vq(vpmem);
+	if (err)
+		goto out_err;
+
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			start, &vpmem->start);
+	virtio_cread(vpmem->vdev, struct virtio_pmem_config,
+			size, &vpmem->size);
+
+	res.start = vpmem->start;
+	res.end   = vpmem->start + vpmem->size-1;
+	vpmem->nd_desc.provider_name = "virtio-pmem";
+	vpmem->nd_desc.module = THIS_MODULE;
+
+	vpmem->nvdimm_bus = nvdimm_bus = nvdimm_bus_register(&vdev->dev,
+						&vpmem->nd_desc);
+	if (!nvdimm_bus)
+		goto out_vq;
+
+	dev_set_drvdata(&vdev->dev, nvdimm_bus);
+	memset(&ndr_desc, 0, sizeof(ndr_desc));
+
+	ndr_desc.res = &res;
+	ndr_desc.numa_node = nid;
+	ndr_desc.flush = virtio_pmem_flush;
+	set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags);
+	nd_region = nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc);
+
+	if (!nd_region)
+		goto out_nd;
+
+	//virtio_device_ready(vdev);
+	return 0;
+out_nd:
+	err = -ENXIO;
+	nvdimm_bus_unregister(nvdimm_bus);
+out_vq:
+	vdev->config->del_vqs(vdev);
+out_err:
+	dev_err(&vdev->dev, "failed to register virtio pmem memory\n");
+	return err;
+}
+
+static void virtio_pmem_remove(struct virtio_device *vdev)
+{
+	struct virtio_pmem *vpmem = vdev->priv;
+	struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(&vdev->dev);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	vdev->config->del_vqs(vdev);
+	kfree(vpmem);
+}
+
+#ifdef CONFIG_PM_SLEEP
+static int virtio_pmem_freeze(struct virtio_device *vdev)
+{
+	/* todo: handle freeze function */
+	return -EPERM;
+}
+
+static int virtio_pmem_restore(struct virtio_device *vdev)
+{
+	/* todo: handle restore function */
+	return -EPERM;
+}
+#endif
+
+
+static struct virtio_driver virtio_pmem_driver = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.probe			= virtio_pmem_probe,
+	.remove			= virtio_pmem_remove,
+#ifdef CONFIG_PM_SLEEP
+	.freeze                 = virtio_pmem_freeze,
+	.restore                = virtio_pmem_restore,
+#endif
+};
+
+module_virtio_driver(virtio_pmem_driver);
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio pmem driver");
+MODULE_LICENSE("GPL");
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2..3463895 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@ 
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_PMEM         25 /* virtio pmem */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
new file mode 100644
index 0000000..c7c22a5
--- /dev/null
+++ b/include/uapi/linux/virtio_pmem.h
@@ -0,0 +1,40 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers:
+ *
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS''
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * Copyright (C) Red Hat, Inc., 2018-2019
+ * Copyright (C) Pankaj Gupta <pagupta@redhat.com>, 2018
+ */
+#ifndef _UAPI_LINUX_VIRTIO_PMEM_H
+#define _UAPI_LINUX_VIRTIO_PMEM_H
+
+struct virtio_pmem_config {
+	__le64 start;
+	__le64 size;
+};
+#endif