mbox series

[v2,0/2] kvm "fake DAX" device

Message ID 20181013050021.11962-1-pagupta@redhat.com (mailing list archive)
Headers show
Series kvm "fake DAX" device | expand

Message

Pankaj Gupta Oct. 13, 2018, 5 a.m. UTC
This patch series has implementation for "fake DAX". 
 "fake DAX" is fake persistent memory(nvdimm) in guest 
 which allows to bypass the guest page cache. This also
 implements a VIRTIO based asynchronous flush mechanism.  
 
 Sharing guest kernel driver in this patchset with the 
 changes suggested in v1. Tested with Qemu side device 
 emulation for virtio-pmem [4]. 
 
 Details of project idea for 'fake DAX' flushing interface 
 is shared [2] & [3].

 Implementation is divided into two parts:
 New virtio pmem guest driver and qemu code changes for new 
 virtio pmem paravirtualized device.

1. Guest virtio-pmem kernel driver
---------------------------------
   - Reads persistent memory range from paravirt device and 
     registers with 'nvdimm_bus'.  
   - 'nvdimm/pmem' driver uses this information to allocate 
     persistent memory region and setup filesystem operations 
     to the allocated memory. 
   - virtio pmem driver implements asynchronous flushing 
     interface to flush from guest to host.

2. Qemu virtio-pmem device
---------------------------------
   - Creates virtio pmem device and exposes a memory range to 
     KVM guest. 
   - At host side this is file backed memory which acts as 
     persistent memory. 
   - Qemu side flush uses aio thread pool API's and virtio 
     for asynchronous guest multi request handling. 

   David Hildenbrand CCed also posted a modified version[5] of 
   qemu virtio-pmem code based on updated Qemu memory device API. 

 Virtio-pmem errors handling:
 ----------------------------------------
  Checked behaviour of virtio-pmem for below types of errors
  Need suggestions on expected behaviour for handling these errors?

  - Hardware Errors: Uncorrectable recoverable Errors: 
  a] virtio-pmem: 
    - As per current logic if error page belongs to Qemu process, 
      host MCE handler isolates(hwpoison) that page and send SIGBUS. 
      Qemu SIGBUS handler injects exception to KVM guest. 
    - KVM guest then isolates the page and send SIGBUS to guest 
      userspace process which has mapped the page. 
  
  b] Existing implementation for ACPI pmem driver: 
    - Handles such errors with MCE notifier and creates a list 
      of bad blocks. Read/direct access DAX operation return EIO 
      if accessed memory page fall in bad block list.
    - It also starts backgound scrubbing.  
    - Similar functionality can be reused in virtio-pmem with MCE 
      notifier but without scrubbing(no ACPI/ARS)? Need inputs to 
      confirm if this behaviour is ok or needs any change?

Changes from PATCH v1: [1]
- 0-day build test for build dependency on libnvdimm 

 Changes suggested by - [Dan Williams]
- Split the driver into two parts virtio & pmem  
- Move queuing of async block request to block layer
- Add "sync" parameter in nvdimm_flush function
- Use indirect call for nvdimm_flush
- Don’t move declarations to common global header e.g nd.h
- nvdimm_flush() return 0 or -EIO if it fails
- Teach nsio_rw_bytes() that the flush can fail
- Rename nvdimm_flush() to generic_nvdimm_flush()
- Use 'nd_region->provider_data' for long dereferencing
- Remove virtio_pmem_freeze/restore functions
- Remove BSD license text with SPDX license text

- Add might_sleep() in virtio_pmem_flush - [Luiz]
- Make spin_lock_irqsave() narrow

Changes from RFC v3
- Rebase to latest upstream - Luiz
- Call ndregion->flush in place of nvdimm_flush- Luiz
- kmalloc return check - Luiz
- virtqueue full handling - Stefan
- Don't map entire virtio_pmem_req to device - Stefan
- request leak, correct sizeof req- Stefan
- Move declaration to virtio_pmem.c

Changes from RFC v2:
- Add flush function in the nd_region in place of switching
  on a flag - Dan & Stefan
- Add flush completion function with proper locking and wait
  for host side flush completion - Stefan & Dan
- Keep userspace API in uapi header file - Stefan, MST
- Use LE fields & New device id - MST
- Indentation & spacing suggestions - MST & Eric
- Remove extra header files & add licensing - Stefan

Changes from RFC v1:
- Reuse existing 'pmem' code for registering persistent 
  memory and other operations instead of creating an entirely 
  new block driver.
- Use VIRTIO driver to register memory information with 
  nvdimm_bus and create region_type accordingly. 
- Call VIRTIO flush from existing pmem driver.

Pankaj Gupta (2):
   libnvdimm: nd_region flush callback support
   virtio-pmem: Add virtio-pmem guest driver

[1] https://lkml.org/lkml/2018/8/31/407
[2] https://www.spinics.net/lists/kvm/msg149761.html
[3] https://www.spinics.net/lists/kvm/msg153095.html  
[4] https://lkml.org/lkml/2018/8/31/413
[5] https://marc.info/?l=qemu-devel&m=153555721901824&w=2

 drivers/acpi/nfit/core.c     |    8 ++--
 drivers/nvdimm/claim.c       |   12 ++++--
 drivers/nvdimm/nd.h          |    2 +
 drivers/nvdimm/pmem.c        |   24 +++++++++----
 drivers/nvdimm/region_devs.c |   76 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/libnvdimm.h    |   10 ++++-
 6 files changed, 110 insertions(+), 22 deletions(-)

Comments

Dan Williams Oct. 13, 2018, 4:06 p.m. UTC | #1
On Fri, Oct 12, 2018 at 10:00 PM Pankaj Gupta <pagupta@redhat.com> wrote:
>
>  This patch series has implementation for "fake DAX".
>  "fake DAX" is fake persistent memory(nvdimm) in guest
>  which allows to bypass the guest page cache. This also
>  implements a VIRTIO based asynchronous flush mechanism.

Can we stop calling this 'fake DAX', because it isn't 'DAX' and it's
starting to confuse people. This enabling is effectively a
host-page-cache-passthrough mechanism not DAX. Let's call the whole
approach virtio-pmem, and leave DAX out of the name to hopefully
prevent people from wondering why some DAX features are disabled with
this driver. For example MAP_SYNC is not compatible with this
approach.

Additional enabling is need to disable MAP_SYNC in the presence of a
virtio-pmem device. See the rough proposal here:

    https://lkml.org/lkml/2018/4/25/756
Pankaj Gupta Oct. 13, 2018, 4:29 p.m. UTC | #2
> >
> >  This patch series has implementation for "fake DAX".
> >  "fake DAX" is fake persistent memory(nvdimm) in guest
> >  which allows to bypass the guest page cache. This also
> >  implements a VIRTIO based asynchronous flush mechanism.
> 
> Can we stop calling this 'fake DAX', because it isn't 'DAX' and it's
> starting to confuse people. This enabling is effectively a
> host-page-cache-passthrough mechanism not DAX. Let's call the whole
> approach virtio-pmem, and leave DAX out of the name to hopefully
> prevent people from wondering why some DAX features are disabled with
> this driver. For example MAP_SYNC is not compatible with this
> approach.

Sure. I got your point. I will use "virtio-pmem" in future.  

> 
> Additional enabling is need to disable MAP_SYNC in the presence of a
> virtio-pmem device. See the rough proposal here:
> 
>     https://lkml.org/lkml/2018/4/25/756

Yes, I will handle disabling of MAP_SYNC for this use-case.


Thanks,
Pankaj