Message ID | 161966810162.652.13723419108625443430.stgit@17be908f7c1c (mailing list archive) |
---|---|
Headers | show |
Series | nvdimm: Enable sync-dax property for nvdimm | expand |
On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote: > The nvdimm devices are expected to ensure write persistence during power > failure kind of scenarios. > > The libpmem has architecture specific instructions like dcbf on POWER > to flush the cache data to backend nvdimm device during normal writes > followed by explicit flushes if the backend devices are not synchronous > DAX capable. > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > and the subsequent flush doesn't traslate to actual flush to the backend > file on the host in case of file backed v-nvdimms. This is addressed by > virtio-pmem in case of x86_64 by making explicit flushes translating to > fsync at qemu. > > On SPAPR, the issue is addressed by adding a new hcall to > request for an explicit flush from the guest ndctl driver when the backend > nvdimm cannot ensure write persistence with dcbf alone. So, the approach > here is to convey when the hcall flush is required in a device tree > property. The guest makes the hcall when the property is found, instead > of relying on dcbf. Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the virtio-nvdimm device already exists? Stefan
On 4/29/21 9:25 PM, Stefan Hajnoczi wrote: > On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote: >> The nvdimm devices are expected to ensure write persistence during power >> failure kind of scenarios. >> >> The libpmem has architecture specific instructions like dcbf on POWER >> to flush the cache data to backend nvdimm device during normal writes >> followed by explicit flushes if the backend devices are not synchronous >> DAX capable. >> >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest >> and the subsequent flush doesn't traslate to actual flush to the backend >> file on the host in case of file backed v-nvdimms. This is addressed by >> virtio-pmem in case of x86_64 by making explicit flushes translating to >> fsync at qemu. >> >> On SPAPR, the issue is addressed by adding a new hcall to >> request for an explicit flush from the guest ndctl driver when the backend >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach >> here is to convey when the hcall flush is required in a device tree >> property. The guest makes the hcall when the property is found, instead >> of relying on dcbf. > > Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the > virtio-nvdimm device already exists? > On virtualized ppc64 platforms, guests use papr_scm.ko kernel drive for persistent memory support. This was done such that we can use one kernel driver to support persistent memory with multiple hypervisors. To avoid supporting multiple drivers in the guest, -device nvdimm Qemu command-line results in Qemu using PAPR SCM backend. What this patch series does is to make sure we expose the correct synchronous fault support, when we back such nvdimm device with a file. The existing PAPR SCM backend enables persistent memory support with the help of multiple hypercall. #define H_SCM_READ_METADATA 0x3E4 #define H_SCM_WRITE_METADATA 0x3E8 #define H_SCM_BIND_MEM 0x3EC #define H_SCM_UNBIND_MEM 0x3F0 #define H_SCM_UNBIND_ALL 0x3FC Most of them are already implemented in Qemu. This patch series implements H_SCM_FLUSH hypercall. -aneesh
On Thu, Apr 29, 2021 at 10:02:23PM +0530, Aneesh Kumar K.V wrote: > On 4/29/21 9:25 PM, Stefan Hajnoczi wrote: > > On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote: > > > The nvdimm devices are expected to ensure write persistence during power > > > failure kind of scenarios. > > > > > > The libpmem has architecture specific instructions like dcbf on POWER > > > to flush the cache data to backend nvdimm device during normal writes > > > followed by explicit flushes if the backend devices are not synchronous > > > DAX capable. > > > > > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > > > and the subsequent flush doesn't traslate to actual flush to the backend > > > file on the host in case of file backed v-nvdimms. This is addressed by > > > virtio-pmem in case of x86_64 by making explicit flushes translating to > > > fsync at qemu. > > > > > > On SPAPR, the issue is addressed by adding a new hcall to > > > request for an explicit flush from the guest ndctl driver when the backend > > > nvdimm cannot ensure write persistence with dcbf alone. So, the approach > > > here is to convey when the hcall flush is required in a device tree > > > property. The guest makes the hcall when the property is found, instead > > > of relying on dcbf. > > > > Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the > > virtio-nvdimm device already exists? > > > > On virtualized ppc64 platforms, guests use papr_scm.ko kernel drive for > persistent memory support. This was done such that we can use one kernel > driver to support persistent memory with multiple hypervisors. To avoid > supporting multiple drivers in the guest, -device nvdimm Qemu command-line > results in Qemu using PAPR SCM backend. What this patch series does is to > make sure we expose the correct synchronous fault support, when we back such > nvdimm device with a file. > > The existing PAPR SCM backend enables persistent memory support with the > help of multiple hypercall. > > #define H_SCM_READ_METADATA 0x3E4 > #define H_SCM_WRITE_METADATA 0x3E8 > #define H_SCM_BIND_MEM 0x3EC > #define H_SCM_UNBIND_MEM 0x3F0 > #define H_SCM_UNBIND_ALL 0x3FC > > Most of them are already implemented in Qemu. This patch series implements > H_SCM_FLUSH hypercall. The overall point here is that we didn't define the hypercall. It was defined in order to support NVDIMM/pmem devices under PowerVM. For uniformity between PowerVM and KVM guests, we want to support the same hypercall interface on KVM/qemu as well.
On Fri, Apr 30, 2021 at 02:27:18PM +1000, David Gibson wrote: > On Thu, Apr 29, 2021 at 10:02:23PM +0530, Aneesh Kumar K.V wrote: > > On 4/29/21 9:25 PM, Stefan Hajnoczi wrote: > > > On Wed, Apr 28, 2021 at 11:48:21PM -0400, Shivaprasad G Bhat wrote: > > > > The nvdimm devices are expected to ensure write persistence during power > > > > failure kind of scenarios. > > > > > > > > The libpmem has architecture specific instructions like dcbf on POWER > > > > to flush the cache data to backend nvdimm device during normal writes > > > > followed by explicit flushes if the backend devices are not synchronous > > > > DAX capable. > > > > > > > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > > > > and the subsequent flush doesn't traslate to actual flush to the backend > > > > file on the host in case of file backed v-nvdimms. This is addressed by > > > > virtio-pmem in case of x86_64 by making explicit flushes translating to > > > > fsync at qemu. > > > > > > > > On SPAPR, the issue is addressed by adding a new hcall to > > > > request for an explicit flush from the guest ndctl driver when the backend > > > > nvdimm cannot ensure write persistence with dcbf alone. So, the approach > > > > here is to convey when the hcall flush is required in a device tree > > > > property. The guest makes the hcall when the property is found, instead > > > > of relying on dcbf. > > > > > > Sorry, I'm not very familiar with SPAPR. Why add a hypercall when the > > > virtio-nvdimm device already exists? > > > > > > > On virtualized ppc64 platforms, guests use papr_scm.ko kernel drive for > > persistent memory support. This was done such that we can use one kernel > > driver to support persistent memory with multiple hypervisors. To avoid > > supporting multiple drivers in the guest, -device nvdimm Qemu command-line > > results in Qemu using PAPR SCM backend. What this patch series does is to > > make sure we expose the correct synchronous fault support, when we back such > > nvdimm device with a file. > > > > The existing PAPR SCM backend enables persistent memory support with the > > help of multiple hypercall. > > > > #define H_SCM_READ_METADATA 0x3E4 > > #define H_SCM_WRITE_METADATA 0x3E8 > > #define H_SCM_BIND_MEM 0x3EC > > #define H_SCM_UNBIND_MEM 0x3F0 > > #define H_SCM_UNBIND_ALL 0x3FC > > > > Most of them are already implemented in Qemu. This patch series implements > > H_SCM_FLUSH hypercall. > > The overall point here is that we didn't define the hypercall. It was > defined in order to support NVDIMM/pmem devices under PowerVM. For > uniformity between PowerVM and KVM guests, we want to support the same > hypercall interface on KVM/qemu as well. Okay, that's fine. Now Linux and QEMU have multiple ways of doing this, but it's fair enough if it's an existing platform hypercall. Stefan
Some corrections to terminology confusion below... On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote: > > The nvdimm devices are expected to ensure write persistence during power > failure kind of scenarios. No, QEMU is not expected to make that guarantee. QEMU is free to lie to the guest about the persistence guarantees of the guest PMEM ranges. It's more accurate to say that QEMU nvdimm devices can emulate persistent memory and optionally pass through host power-fail persistence guarantees to the guest. The power-fail persistence domain can be one of "cpu_cache", or "memory_controller" if the persistent memory region is "synchronous". If the persistent range is not synchronous, it really isn't "persistent memory"; it's memory mapped storage that needs I/O commands to flush. > The libpmem has architecture specific instructions like dcbf on POWER Which "libpmem" is this? PMDK is a reference library not a PMEM interface... maybe I'm missing what libpmem has to do with QEMU? > to flush the cache data to backend nvdimm device during normal writes > followed by explicit flushes if the backend devices are not synchronous > DAX capable. > > Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > and the subsequent flush doesn't traslate to actual flush to the backend s/traslate/translate/ > file on the host in case of file backed v-nvdimms. This is addressed by > virtio-pmem in case of x86_64 by making explicit flushes translating to > fsync at qemu. Note that virtio-pmem was a proposal for a specific optimization of allowing guests to share page cache. The virtio-pmem approach is not to be confused with actual persistent memory. > On SPAPR, the issue is addressed by adding a new hcall to > request for an explicit flush from the guest ndctl driver when the backend What is an "ndctl" driver? ndctl is userspace tooling, do you mean the guest pmem driver? > nvdimm cannot ensure write persistence with dcbf alone. So, the approach > here is to convey when the hcall flush is required in a device tree > property. The guest makes the hcall when the property is found, instead > of relying on dcbf. > > A new device property sync-dax is added to the nvdimm device. When the > sync-dax is 'writeback'(default for PPC), device property > "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH > requesting for an explicit flush. I'm not sure "sync-dax" is a suitable name for the property of the guest persistent memory. There is no requirement that the memory-backend file for a guest be a dax-capable file. It's also implementation specific what hypercall needs to be invoked for a given occurrence of "sync-dax". What does that map to on non-PPC platforms for example? It seems to me that an "nvdimm" device presents the synchronous usage model and a whole other device type implements an async-hypercall setup that the guest happens to service with its nvdimm stack, but it's not an "nvdimm" anymore at that point. > sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines > prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented > now as the flush semantics are unimplemented. "sync-dax" has no meaning on its own, I think this needs an explicit mechanism to convey both the "not-sync" property *and* the callback method, it shouldn't be inferred by arch type. > When the backend file is actually synchronous DAX capable and no explicit > flushes are required, the sync-dax mode 'direct' is to be used. > > The below demonstration shows the map_sync behavior with sync-dax writeback & > direct. > (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c) > > The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from > nvdimm with syn-dax=writeback, mounted as > /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) > /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) > > [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct > [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback > Failed to mmap with Operation not supported > > The first patch does the header file cleanup necessary for the > subsequent ones. Second patch implements the hcall, adds the necessary > vmstate properties to spapr machine structure for carrying the hcall > status during save-restore. The nature of the hcall being asynchronus, > the patch uses aio utilities to offload the flush. The third patch adds > the 'sync-dax' device property and enables the device tree property > for the guest to utilise the hcall. > > The kernel changes to exploit this hcall is at > https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch > > --- > v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html > Changes from v3: > - Fixed the forward declaration coding guideline violations in 1st patch. > - Removed the code waiting for the flushes to complete during migration, > instead restart the flush worker on destination qemu in post load. > - Got rid of the randomization of the flush tokens, using simple > counter. > - Got rid of the redundant flush state lock, relying on the BQL now. > - Handling the memory-backend-ram usage > - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'. > Added prevention code using 'writeback' on arm and x86_64. > - Fixed all the miscellaneous comments. > > v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html > Changes from v2: > - Using the thread pool based approach as suggested > - Moved the async hcall handling code to spapr_nvdimm.c along > with some simplifications > - Added vmstate to preserve the hcall status during save-restore > along with pre_save handler code to complete all ongoning flushes. > - Added hw_compat magic for sync-dax 'on' on previous machines. > - Miscellanious minor fixes. > > v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html > Changes from v1 > - Fixed a missed-out unlock > - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token > > Shivaprasad G Bhat (3): > spapr: nvdimm: Forward declare and move the definitions > spapr: nvdimm: Implement H_SCM_FLUSH hcall > nvdimm: Enable sync-dax device property for nvdimm > > > hw/arm/virt.c | 28 ++++ > hw/i386/pc.c | 28 ++++ > hw/mem/nvdimm.c | 52 +++++++ > hw/ppc/spapr.c | 16 ++ > hw/ppc/spapr_nvdimm.c | 285 +++++++++++++++++++++++++++++++++++++++++ > include/hw/mem/nvdimm.h | 11 ++ > include/hw/ppc/spapr.h | 11 +- > include/hw/ppc/spapr_nvdimm.h | 27 ++-- > qapi/common.json | 20 +++ > 9 files changed, 455 insertions(+), 23 deletions(-) > > -- > Signature > _______________________________________________ > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org > To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On 5/1/21 12:44 AM, Dan Williams wrote: > Some corrections to terminology confusion below... ....... > >> file on the host in case of file backed v-nvdimms. This is addressed by >> virtio-pmem in case of x86_64 by making explicit flushes translating to >> fsync at qemu. > > Note that virtio-pmem was a proposal for a specific optimization of > allowing guests to share page cache. The virtio-pmem approach is not > to be confused with actual persistent memory. > ..... >> >> A new device property sync-dax is added to the nvdimm device. When the >> sync-dax is 'writeback'(default for PPC), device property >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH >> requesting for an explicit flush. > > I'm not sure "sync-dax" is a suitable name for the property of the > guest persistent memory. There is no requirement that the > memory-backend file for a guest be a dax-capable file. It's also > implementation specific what hypercall needs to be invoked for a given > occurrence of "sync-dax". What does that map to on non-PPC platforms > for example? It seems to me that an "nvdimm" device presents the > synchronous usage model and a whole other device type implements an > async-hypercall setup that the guest happens to service with its > nvdimm stack, but it's not an "nvdimm" anymore at that point. > What is attempted here is to use the same guest driver papr_scm.ko support the usecase of sharing page cache from the host instead of depending on a new guest driver virtio-pmem. This also try to correctly indicate to the guest that an usage like -object memory-backend-file,id=memnvdimm1,mem-path=file_name -device nvdimm,memdev=memnvdimm1 correctly indicate to the guest that we are indeed sharing page cache and not really emulating a persistent memory. W.r.t non ppc platforms, it was discussed earlier and one of the suggestion there was to mark that as "unsafe". Any suggestion for an alternate property name than "sync-dax"? >> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines >> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented >> now as the flush semantics are unimplemented. > > "sync-dax" has no meaning on its own, I think this needs an explicit > mechanism to convey both the "not-sync" property *and* the callback > method, it shouldn't be inferred by arch type. Won't a non-sync property imply that guest needs to do a callback to ensure persistence? Hence they are related? -aneesh
On 5/1/21 12:44 AM, Dan Williams wrote: > Some corrections to terminology confusion below... > > > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote: >> The nvdimm devices are expected to ensure write persistence during power >> failure kind of scenarios. > No, QEMU is not expected to make that guarantee. QEMU is free to lie > to the guest about the persistence guarantees of the guest PMEM > ranges. It's more accurate to say that QEMU nvdimm devices can emulate > persistent memory and optionally pass through host power-fail > persistence guarantees to the guest. The power-fail persistence domain > can be one of "cpu_cache", or "memory_controller" if the persistent > memory region is "synchronous". If the persistent range is not > synchronous, it really isn't "persistent memory"; it's memory mapped > storage that needs I/O commands to flush. Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is completely in the host pagecache, and we need a way to ensure that host pagecaches are flushed to the backend. This analogous to the WPQ flush being offloaded to the hypervisor. Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt > >> The libpmem has architecture specific instructions like dcbf on POWER > Which "libpmem" is this? PMDK is a reference library not a PMEM > interface... maybe I'm missing what libpmem has to do with QEMU? I was referrering to semantics of flushing pmem cache lines as in PMDK/libpmem. > >> to flush the cache data to backend nvdimm device during normal writes >> followed by explicit flushes if the backend devices are not synchronous >> DAX capable. >> >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest >> and the subsequent flush doesn't traslate to actual flush to the backend > s/traslate/translate/ > >> file on the host in case of file backed v-nvdimms. This is addressed by >> virtio-pmem in case of x86_64 by making explicit flushes translating to >> fsync at qemu. > Note that virtio-pmem was a proposal for a specific optimization of > allowing guests to share page cache. The virtio-pmem approach is not > to be confused with actual persistent memory. > >> On SPAPR, the issue is addressed by adding a new hcall to >> request for an explicit flush from the guest ndctl driver when the backend > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the > guest pmem driver? oops, wrong terminologies. I was referring to guest libnvdimm and papr_scm kernel modules. > >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach >> here is to convey when the hcall flush is required in a device tree >> property. The guest makes the hcall when the property is found, instead >> of relying on dcbf. >> >> A new device property sync-dax is added to the nvdimm device. When the >> sync-dax is 'writeback'(default for PPC), device property >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH >> requesting for an explicit flush. > I'm not sure "sync-dax" is a suitable name for the property of the > guest persistent memory. sync-dax property translates ND_REGION_ASYNC flag being set/unset for the pmem region also if the nvdimm_flush callback is provided in the papr_scm or not. As everything boils down to synchronous nature of the device, I chose sync-dax for the name. > There is no requirement that the > memory-backend file for a guest be a dax-capable file. It's also > implementation specific what hypercall needs to be invoked for a given > occurrence of "sync-dax". What does that map to on non-PPC platforms > for example? The backend file can be dax-capable, to be hinted using "sync-dax=direct". When the backend is not dax-capable, the "sync-dax=writeback" to used, so that the guest makes the hcall. On all non-PPC archs, with the "sync-dax=writeback" qemu errors out stating the lack of support. > It seems to me that an "nvdimm" device presents the > synchronous usage model and a whole other device type implements an > async-hypercall setup that the guest happens to service with its > nvdimm stack, but it's not an "nvdimm" anymore at that point. In case the file backing the v-nvdimm is not dax-capable, we need flush semantics on the guest to be mapped to pagecache flush on the host side. > >> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines >> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented >> now as the flush semantics are unimplemented. > "sync-dax" has no meaning on its own, I think this needs an explicit > mechanism to convey both the "not-sync" property *and* the callback > method, it shouldn't be inferred by arch type. Yes. On all platforms the "sync-dax=unsafe" meaning - with host power failure the host pagecache is lost and subsequently data written by the guest will also be gone. This is the default for non-PPC. On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC is set for the region and the guest makes hcalls to issue fsync on the host. Are you suggesting me to keep it "unsafe" as default for all architectures including PPC and a user can set it to "writeback" if desired. > >> When the backend file is actually synchronous DAX capable and no explicit >> flushes are required, the sync-dax mode 'direct' is to be used. >> >> The below demonstration shows the map_sync behavior with sync-dax writeback & >> direct. >> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c) >> >> The pmem0 is from nvdimm with With sync-dax=direct, and pmem1 is from >> nvdimm with syn-dax=writeback, mounted as >> /dev/pmem0 on /mnt1 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) >> /dev/pmem1 on /mnt2 type xfs (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota) >> >> [root@atest-guest ~]# ./mapsync /mnt1/newfile ----> When sync-dax=unsafe/direct >> [root@atest-guest ~]# ./mapsync /mnt2/newfile ----> when sync-dax=writeback >> Failed to mmap with Operation not supported >> >> The first patch does the header file cleanup necessary for the >> subsequent ones. Second patch implements the hcall, adds the necessary >> vmstate properties to spapr machine structure for carrying the hcall >> status during save-restore. The nature of the hcall being asynchronus, >> the patch uses aio utilities to offload the flush. The third patch adds >> the 'sync-dax' device property and enables the device tree property >> for the guest to utilise the hcall. >> >> The kernel changes to exploit this hcall is at >> https://github.com/linuxppc/linux/commit/75b7c05ebf9026.patch >> >> --- >> v3 - https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07916.html >> Changes from v3: >> - Fixed the forward declaration coding guideline violations in 1st patch. >> - Removed the code waiting for the flushes to complete during migration, >> instead restart the flush worker on destination qemu in post load. >> - Got rid of the randomization of the flush tokens, using simple >> counter. >> - Got rid of the redundant flush state lock, relying on the BQL now. >> - Handling the memory-backend-ram usage >> - Changed the sync-dax symantics from on/off to 'unsafe','writeback' and 'direct'. >> Added prevention code using 'writeback' on arm and x86_64. >> - Fixed all the miscellaneous comments. >> >> v2 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg07031.html >> Changes from v2: >> - Using the thread pool based approach as suggested >> - Moved the async hcall handling code to spapr_nvdimm.c along >> with some simplifications >> - Added vmstate to preserve the hcall status during save-restore >> along with pre_save handler code to complete all ongoning flushes. >> - Added hw_compat magic for sync-dax 'on' on previous machines. >> - Miscellanious minor fixes. >> >> v1 - https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg06330.html >> Changes from v1 >> - Fixed a missed-out unlock >> - using QLIST_FOREACH instead of QLIST_FOREACH_SAFE while generating token >> >> Shivaprasad G Bhat (3): >> spapr: nvdimm: Forward declare and move the definitions >> spapr: nvdimm: Implement H_SCM_FLUSH hcall >> nvdimm: Enable sync-dax device property for nvdimm >> >> >> hw/arm/virt.c | 28 ++++ >> hw/i386/pc.c | 28 ++++ >> hw/mem/nvdimm.c | 52 +++++++ >> hw/ppc/spapr.c | 16 ++ >> hw/ppc/spapr_nvdimm.c | 285 +++++++++++++++++++++++++++++++++++++++++ >> include/hw/mem/nvdimm.h | 11 ++ >> include/hw/ppc/spapr.h | 11 +- >> include/hw/ppc/spapr_nvdimm.h | 27 ++-- >> qapi/common.json | 20 +++ >> 9 files changed, 455 insertions(+), 23 deletions(-) >> >> -- >> Signature >> _______________________________________________ >> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org >> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org > _______________________________________________ > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org > To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote: > > > On 5/1/21 12:44 AM, Dan Williams wrote: > > Some corrections to terminology confusion below... > > > > > > On Wed, Apr 28, 2021 at 8:49 PM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote: > >> The nvdimm devices are expected to ensure write persistence during power > >> failure kind of scenarios. > > No, QEMU is not expected to make that guarantee. QEMU is free to lie > > to the guest about the persistence guarantees of the guest PMEM > > ranges. It's more accurate to say that QEMU nvdimm devices can emulate > > persistent memory and optionally pass through host power-fail > > persistence guarantees to the guest. The power-fail persistence domain > > can be one of "cpu_cache", or "memory_controller" if the persistent > > memory region is "synchronous". If the persistent range is not > > synchronous, it really isn't "persistent memory"; it's memory mapped > > storage that needs I/O commands to flush. > > Since this is virtual nvdimm(v-nvdimm) backed by a file, and the data is > completely > > in the host pagecache, and we need a way to ensure that host pagecaches > > are flushed to the backend. This analogous to the WPQ flush being offloaded > > to the hypervisor. No, it isn't analogous. WPQ flush is an optional mechanism to force data to a higher durability domain. The flush in this interface is mandatory. It's a different class of device. The proposal that "sync-dax=unsafe" for non-PPC architectures, is a fundamental misrepresentation of how this is supposed to work. Rather than make "sync-dax" a first class citizen of the device-description interface I'm proposing that you make this a separate device-type. This also solves the problem that "sync-dax" with an implicit architecture backend assumption is not precise, but a new "non-nvdimm" device type would make it explicit what the host is advertising to the guest. > > > Ref: https://github.com/dgibson/qemu/blob/main/docs/nvdimm.txt > > > > > > >> The libpmem has architecture specific instructions like dcbf on POWER > > Which "libpmem" is this? PMDK is a reference library not a PMEM > > interface... maybe I'm missing what libpmem has to do with QEMU? > > > I was referrering to semantics of flushing pmem cache lines as in > > PMDK/libpmem. > > > > > >> to flush the cache data to backend nvdimm device during normal writes > >> followed by explicit flushes if the backend devices are not synchronous > >> DAX capable. > >> > >> Qemu - virtual nvdimm devices are memory mapped. The dcbf in the guest > >> and the subsequent flush doesn't traslate to actual flush to the backend > > s/traslate/translate/ > > > >> file on the host in case of file backed v-nvdimms. This is addressed by > >> virtio-pmem in case of x86_64 by making explicit flushes translating to > >> fsync at qemu. > > Note that virtio-pmem was a proposal for a specific optimization of > > allowing guests to share page cache. The virtio-pmem approach is not > > to be confused with actual persistent memory. > > > >> On SPAPR, the issue is addressed by adding a new hcall to > >> request for an explicit flush from the guest ndctl driver when the backend > > What is an "ndctl" driver? ndctl is userspace tooling, do you mean the > > guest pmem driver? > > > oops, wrong terminologies. I was referring to guest libnvdimm and > > papr_scm kernel modules. > > > > > >> nvdimm cannot ensure write persistence with dcbf alone. So, the approach > >> here is to convey when the hcall flush is required in a device tree > >> property. The guest makes the hcall when the property is found, instead > >> of relying on dcbf. > >> > >> A new device property sync-dax is added to the nvdimm device. When the > >> sync-dax is 'writeback'(default for PPC), device property > >> "hcall-flush-required" is set, and the guest makes hcall H_SCM_FLUSH > >> requesting for an explicit flush. > > I'm not sure "sync-dax" is a suitable name for the property of the > > guest persistent memory. > > > sync-dax property translates ND_REGION_ASYNC flag being set/unset Yes, I am aware, but that property alone is not sufficient to identify the flush mechanism. > > for the pmem region also if the nvdimm_flush callback is provided in the > > papr_scm or not. As everything boils down to synchronous nature > > of the device, I chose sync-dax for the name. > > > > There is no requirement that the > > memory-backend file for a guest be a dax-capable file. It's also > > implementation specific what hypercall needs to be invoked for a given > > occurrence of "sync-dax". What does that map to on non-PPC platforms > > for example? > > > The backend file can be dax-capable, to be hinted using "sync-dax=direct". All memory-mapped files are "dax-capable". "DAX" is an access mechanism, not a data-integrity contract. > When the backend is not dax-capable, the "sync-dax=writeback" to used, No, the qemu property for this shuold be a separate device-type. > so that the guest makes the hcall. On all non-PPC archs, with the > > "sync-dax=writeback" qemu errors out stating the lack of support. There is no "lack of support" to be worried about on other archs if the interface is explicit about the atypical flush arrangement. > > > > It seems to me that an "nvdimm" device presents the > > synchronous usage model and a whole other device type implements an > > async-hypercall setup that the guest happens to service with its > > nvdimm stack, but it's not an "nvdimm" anymore at that point. > > > In case the file backing the v-nvdimm is not dax-capable, we need flush > > semantics on the guest to be mapped to pagecache flush on the host side. > > > > > >> sync-dax is "unsafe" on all other platforms(x86, ARM) and old pseries machines > >> prior to 5.2 on PPC. sync-dax="writeback" on ARM and x86_64 is prevented > >> now as the flush semantics are unimplemented. > > "sync-dax" has no meaning on its own, I think this needs an explicit > > mechanism to convey both the "not-sync" property *and* the callback > > method, it shouldn't be inferred by arch type. > > > Yes. On all platforms the "sync-dax=unsafe" meaning - with host power > > failure the host pagecache is lost and subsequently data written by the > > guest will also be gone. This is the default for non-PPC. The default to date has been for the guest to trust that an nvdimm is an nvdimm with no explicit flushing required. It's too late now to introduce an "unsafe" default. > > > On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC > > is set for the region and the guest makes hcalls to issue fsync on the host. > > > Are you suggesting me to keep it "unsafe" as default for all architectures > > including PPC and a user can set it to "writeback" if desired. No, I am suggesting that "sync-dax" is insufficient to convey this property. This behavior warrants its own device type, not an ambiguous property of the memory-backend-file with implicit architecture assumptions attached.
On 5/4/21 1:11 AM, Dan Williams wrote: > On Mon, May 3, 2021 at 7:06 AM Shivaprasad G Bhat <sbhat@linux.ibm.com> wrote: >> ..... > > The proposal that "sync-dax=unsafe" for non-PPC architectures, is a > fundamental misrepresentation of how this is supposed to work. Rather > than make "sync-dax" a first class citizen of the device-description > interface I'm proposing that you make this a separate device-type. > This also solves the problem that "sync-dax" with an implicit > architecture backend assumption is not precise, but a new "non-nvdimm" > device type would make it explicit what the host is advertising to the > guest. > Currently, users can use a virtualized nvdimm support in Qemu to share host page cache to the guest via the below command -object memory-backend-file,id=memnvdimm1,mem-path=file_name_in_host_fs -device nvdimm,memdev=memnvdimm1 Such usage can results in wrong application behavior because there is no hint to the application/guest OS that a cpu cache flush is not sufficient to ensure persistence. I understand that virio-pmem is suggested as an alternative for that. But why not fix virtualized nvdimm if platforms can express the details. ie, can ACPI indicate to the guest OS that the device need a flush mechanism to ensure persistence in the above case? What this patch series did was to express that property via a device tree node and guest driver enables a hypercall based flush mechanism to ensure persistence. .... >> >> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC >> >> is set for the region and the guest makes hcalls to issue fsync on the host. >> >> >> Are you suggesting me to keep it "unsafe" as default for all architectures >> >> including PPC and a user can set it to "writeback" if desired. > > No, I am suggesting that "sync-dax" is insufficient to convey this > property. This behavior warrants its own device type, not an ambiguous > property of the memory-backend-file with implicit architecture > assumptions attached. > Why is it insufficient? Is it because other architectures don't have an ability express this detail to guest OS? Isn't that an arch limitations? -aneesh
> > The proposal that "sync-dax=unsafe" for non-PPC architectures, is a > > fundamental misrepresentation of how this is supposed to work. Rather > > than make "sync-dax" a first class citizen of the device-description > > interface I'm proposing that you make this a separate device-type. > > This also solves the problem that "sync-dax" with an implicit > > architecture backend assumption is not precise, but a new "non-nvdimm" > > device type would make it explicit what the host is advertising to the > > guest. > > > > Currently, users can use a virtualized nvdimm support in Qemu to share > host page cache to the guest via the below command > > -object memory-backend-file,id=memnvdimm1,mem-path=file_name_in_host_fs > -device nvdimm,memdev=memnvdimm1 > > Such usage can results in wrong application behavior because there is no > hint to the application/guest OS that a cpu cache flush is not > sufficient to ensure persistence. > > I understand that virio-pmem is suggested as an alternative for that. > But why not fix virtualized nvdimm if platforms can express the details. > > ie, can ACPI indicate to the guest OS that the device need a flush > mechanism to ensure persistence in the above case? > > What this patch series did was to express that property via a device > tree node and guest driver enables a hypercall based flush mechanism to > ensure persistence. Would VIRTIO (entirely asynchronous, no trap at host side) based mechanism is better than hyper-call based? Registering memory can be done any way. We implemented virtio-pmem flush mechanisms with below considerations: - Proper semantic for guest flush requests. - Efficient mechanism for performance pov. I am just asking myself if we have platform agnostic mechanism already there, maybe we can extend it to suit our needs? Maybe I am missing some points here. > >> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC > >> > >> is set for the region and the guest makes hcalls to issue fsync on the host. > >> > >> > >> Are you suggesting me to keep it "unsafe" as default for all architectures > >> > >> including PPC and a user can set it to "writeback" if desired. > > > > No, I am suggesting that "sync-dax" is insufficient to convey this > > property. This behavior warrants its own device type, not an ambiguous > > property of the memory-backend-file with implicit architecture > > assumptions attached. > > > > Why is it insufficient? Is it because other architectures don't have an > ability express this detail to guest OS? Isn't that an arch limitations?
On 5/4/21 11:13 AM, Pankaj Gupta wrote: .... >> >> What this patch series did was to express that property via a device >> tree node and guest driver enables a hypercall based flush mechanism to >> ensure persistence. > > Would VIRTIO (entirely asynchronous, no trap at host side) based > mechanism is better > than hyper-call based? Registering memory can be done any way. We > implemented virtio-pmem > flush mechanisms with below considerations: > > - Proper semantic for guest flush requests. > - Efficient mechanism for performance pov. > sure, virio-pmem can be used as an alternative. > I am just asking myself if we have platform agnostic mechanism already > there, maybe > we can extend it to suit our needs? Maybe I am missing some points here. > What is being attempted in this series is to indicate to the guest OS that the backing device/file used for emulated nvdimm device cannot guarantee the persistence via cpu cache flush instructions. >>>> On PPC, the default is "sync-dax=writeback" - so the ND_REGION_ASYNC >>>> >>>> is set for the region and the guest makes hcalls to issue fsync on the host. >>>> >>>> >>>> Are you suggesting me to keep it "unsafe" as default for all architectures >>>> >>>> including PPC and a user can set it to "writeback" if desired. >>> >>> No, I am suggesting that "sync-dax" is insufficient to convey this >>> property. This behavior warrants its own device type, not an ambiguous >>> property of the memory-backend-file with implicit architecture >>> assumptions attached. >>> >> >> Why is it insufficient? Is it because other architectures don't have an >> ability express this detail to guest OS? Isn't that an arch limitations? -aneesh
On Tue, May 4, 2021 at 2:03 AM Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> wrote: > > On 5/4/21 11:13 AM, Pankaj Gupta wrote: > .... > > >> > >> What this patch series did was to express that property via a device > >> tree node and guest driver enables a hypercall based flush mechanism to > >> ensure persistence. > > > > Would VIRTIO (entirely asynchronous, no trap at host side) based > > mechanism is better > > than hyper-call based? Registering memory can be done any way. We > > implemented virtio-pmem > > flush mechanisms with below considerations: > > > > - Proper semantic for guest flush requests. > > - Efficient mechanism for performance pov. > > > > sure, virio-pmem can be used as an alternative. > > > I am just asking myself if we have platform agnostic mechanism already > > there, maybe > > we can extend it to suit our needs? Maybe I am missing some points here. > > > > What is being attempted in this series is to indicate to the guest OS > that the backing device/file used for emulated nvdimm device cannot > guarantee the persistence via cpu cache flush instructions. Right, the detail I am arguing is that it should be a device description, not a backend file property. In other words this proposal is advocating this: -object memory-backend-file,id=mem1,share,sync-dax=$sync-dax,mem-path=$path -device nvdimm,memdev=mem1 ...and I am advocating for reusing or duplicating the virtio-pmem model like this: -object memory-backend-file,id=mem1,share,mem-path=$path -device spapr-hcall,memdev=mem1 ...because the interface to the guest is the device, not the memory-backend-file.