diff mbox

[4/4] hvmloader: add support to load extra ACPI tables from qemu

Message ID 1451388711-18646-5-git-send-email-haozhong.zhang@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Haozhong Zhang Dec. 29, 2015, 11:31 a.m. UTC
NVDIMM devices are detected and configured by software through
ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
patch extends the existing mechanism in hvmloader of loading passthrough
ACPI tables to load extra ACPI tables built by QEMU.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
---
 tools/firmware/hvmloader/acpi/build.c   | 34 +++++++++++++++++++++++++++------
 xen/include/public/hvm/hvm_xs_strings.h |  3 +++
 2 files changed, 31 insertions(+), 6 deletions(-)

Comments

Jan Beulich Jan. 15, 2016, 5:10 p.m. UTC | #1
>>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> NVDIMM devices are detected and configured by software through
> ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> patch extends the existing mechanism in hvmloader of loading passthrough
> ACPI tables to load extra ACPI tables built by QEMU.

Mechanically the patch looks okay, but whether it's actually needed
depends on whether indeed we want NV RAM managed in qemu
instead of in the hypervisor (where imo it belongs); I didn' see any
reply yet to that same comment of mine made (iirc) in the context
of another patch.

Jan
Haozhong Zhang Jan. 18, 2016, 12:52 a.m. UTC | #2
On 01/15/16 10:10, Jan Beulich wrote:
> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> > NVDIMM devices are detected and configured by software through
> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> > patch extends the existing mechanism in hvmloader of loading passthrough
> > ACPI tables to load extra ACPI tables built by QEMU.
> 
> Mechanically the patch looks okay, but whether it's actually needed
> depends on whether indeed we want NV RAM managed in qemu
> instead of in the hypervisor (where imo it belongs); I didn' see any
> reply yet to that same comment of mine made (iirc) in the context
> of another patch.
> 
> Jan
> 

One purpose of this patch series is to provide vNVDIMM backed by host
NVDIMM devices. It requires some drivers to detect and manage host
NVDIMM devices (including parsing ACPI, managing labels, etc.) that
are not trivial, so I leave this work to the dom0 linux. Current Linux
kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
then mmaps them into certain range of dom0's address space and asks
Xen hypervisor to map that range of address space to a domU.

However, there are two problems in this Xen patch series and the
corresponding QEMU patch series, which may require further
changes in hypervisor and/or toolstack.

(1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
    the host NVDIMM to domU, which results VMEXIT for every guest
    read/write to the corresponding vNVDIMM devices. I'm going to find
    a way to passthrough the address space range of host NVDIMM to a
    guest domU (similarly to what xen-pt in QEMU uses)
    
(2) Xen currently does not check whether the address that QEMU asks to
    map to domU is really within the host NVDIMM address
    space. Therefore, Xen hypervisor needs a way to decide the host
    NVDIMM address space which can be done by parsing ACPI NFIT
    tables.

Haozhong
Jan Beulich Jan. 18, 2016, 8:46 a.m. UTC | #3
>>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> On 01/15/16 10:10, Jan Beulich wrote:
>> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
>> > NVDIMM devices are detected and configured by software through
>> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
>> > patch extends the existing mechanism in hvmloader of loading passthrough
>> > ACPI tables to load extra ACPI tables built by QEMU.
>> 
>> Mechanically the patch looks okay, but whether it's actually needed
>> depends on whether indeed we want NV RAM managed in qemu
>> instead of in the hypervisor (where imo it belongs); I didn' see any
>> reply yet to that same comment of mine made (iirc) in the context
>> of another patch.
> 
> One purpose of this patch series is to provide vNVDIMM backed by host
> NVDIMM devices. It requires some drivers to detect and manage host
> NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> are not trivial, so I leave this work to the dom0 linux. Current Linux
> kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> then mmaps them into certain range of dom0's address space and asks
> Xen hypervisor to map that range of address space to a domU.
> 
> However, there are two problems in this Xen patch series and the
> corresponding QEMU patch series, which may require further
> changes in hypervisor and/or toolstack.
> 
> (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
>     the host NVDIMM to domU, which results VMEXIT for every guest
>     read/write to the corresponding vNVDIMM devices. I'm going to find
>     a way to passthrough the address space range of host NVDIMM to a
>     guest domU (similarly to what xen-pt in QEMU uses)
>     
> (2) Xen currently does not check whether the address that QEMU asks to
>     map to domU is really within the host NVDIMM address
>     space. Therefore, Xen hypervisor needs a way to decide the host
>     NVDIMM address space which can be done by parsing ACPI NFIT
>     tables.

These problems are a pretty direct result of the management of
NVDIMM not being done by the hypervisor.

Stating what qemu currently does is, I'm afraid, not really serving
the purpose of hashing out whether the management of NVDIMM,
just like that of "normal" RAM, wouldn't better be done by the
hypervisor. In fact so far I haven't seen any rationale (other than
the desire to share code with KVM) for the presently chosen
solution. Yet in KVM qemu is - afaict - much more of an integral part
of the hypervisor than it is in the Xen case (and even there core
management of the memory is left to the kernel, i.e. what
constitutes the core hypervisor there).

Jan
Wei Liu Jan. 19, 2016, 11:37 a.m. UTC | #4
On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> > On 01/15/16 10:10, Jan Beulich wrote:
> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> >> > NVDIMM devices are detected and configured by software through
> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> >> > ACPI tables to load extra ACPI tables built by QEMU.
> >> 
> >> Mechanically the patch looks okay, but whether it's actually needed
> >> depends on whether indeed we want NV RAM managed in qemu
> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> >> reply yet to that same comment of mine made (iirc) in the context
> >> of another patch.
> > 
> > One purpose of this patch series is to provide vNVDIMM backed by host
> > NVDIMM devices. It requires some drivers to detect and manage host
> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > then mmaps them into certain range of dom0's address space and asks
> > Xen hypervisor to map that range of address space to a domU.
> > 

OOI Do we have a viable solution to do all these non-trivial things in
core hypervisor?  Are you proposing designing a new set of hypercalls
for NVDIMM?  

Wei.
Jan Beulich Jan. 19, 2016, 11:46 a.m. UTC | #5
>>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote:
> On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
>> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
>> > On 01/15/16 10:10, Jan Beulich wrote:
>> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
>> >> > NVDIMM devices are detected and configured by software through
>> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
>> >> > patch extends the existing mechanism in hvmloader of loading passthrough
>> >> > ACPI tables to load extra ACPI tables built by QEMU.
>> >> 
>> >> Mechanically the patch looks okay, but whether it's actually needed
>> >> depends on whether indeed we want NV RAM managed in qemu
>> >> instead of in the hypervisor (where imo it belongs); I didn' see any
>> >> reply yet to that same comment of mine made (iirc) in the context
>> >> of another patch.
>> > 
>> > One purpose of this patch series is to provide vNVDIMM backed by host
>> > NVDIMM devices. It requires some drivers to detect and manage host
>> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
>> > are not trivial, so I leave this work to the dom0 linux. Current Linux
>> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
>> > then mmaps them into certain range of dom0's address space and asks
>> > Xen hypervisor to map that range of address space to a domU.
>> > 
> 
> OOI Do we have a viable solution to do all these non-trivial things in
> core hypervisor?  Are you proposing designing a new set of hypercalls
> for NVDIMM?  

That's certainly a possibility; I lack sufficient detail to make myself
an opinion which route is going to be best.

Jan
Tian, Kevin Jan. 20, 2016, 5:14 a.m. UTC | #6
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, January 19, 2016 7:47 PM
> 
> >>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote:
> > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> >> > On 01/15/16 10:10, Jan Beulich wrote:
> >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> >> >> > NVDIMM devices are detected and configured by software through
> >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> >> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> >> >> > ACPI tables to load extra ACPI tables built by QEMU.
> >> >>
> >> >> Mechanically the patch looks okay, but whether it's actually needed
> >> >> depends on whether indeed we want NV RAM managed in qemu
> >> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> >> >> reply yet to that same comment of mine made (iirc) in the context
> >> >> of another patch.
> >> >
> >> > One purpose of this patch series is to provide vNVDIMM backed by host
> >> > NVDIMM devices. It requires some drivers to detect and manage host
> >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> >> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> >> > then mmaps them into certain range of dom0's address space and asks
> >> > Xen hypervisor to map that range of address space to a domU.
> >> >
> >
> > OOI Do we have a viable solution to do all these non-trivial things in
> > core hypervisor?  Are you proposing designing a new set of hypercalls
> > for NVDIMM?
> 
> That's certainly a possibility; I lack sufficient detail to make myself
> an opinion which route is going to be best.
> 
> Jan

Hi, Haozhong,

Are NVDIMM related ACPI table in plain text format, or do they require
a ACPI parser to decode? Is there a corresponding E820 entry?

Above information would be useful to help decide the direction.

In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM
since it's a type of memory resource while for memory we expect hypervisor
to centrally manage.

However in another thought the answer is different if we view this 
resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc.
then it should be fine to have Dom0 manage NVDIMM then Xen just controls
the mapping based on existing io permission mechanism.

Another possible point for this model is that PMEM is only one mode of 
NVDIMM device, which can be also exposed as a storage device. In the
latter case the management has to be in Dom0. So we don't need to
scatter the management role into Dom0/Xen based on different modes.

Back to your earlier questions:

> (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
>     the host NVDIMM to domU, which results VMEXIT for every guest
>     read/write to the corresponding vNVDIMM devices. I'm going to find
>     a way to passthrough the address space range of host NVDIMM to a
>     guest domU (similarly to what xen-pt in QEMU uses)
> 
> (2) Xen currently does not check whether the address that QEMU asks to
>     map to domU is really within the host NVDIMM address
>     space. Therefore, Xen hypervisor needs a way to decide the host
>     NVDIMM address space which can be done by parsing ACPI NFIT
>     tables.

If you look at how ACPI OpRegion is handled for IGD passthrough:

 241     ret = xc_domain_iomem_permission(xen_xc, xen_domid,
 242             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
 243             XEN_PCI_INTEL_OPREGION_PAGES,
 244             XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED);

 254     ret = xc_domain_memory_mapping(xen_xc, xen_domid,
 255             (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT),
 256             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
 257             XEN_PCI_INTEL_OPREGION_PAGES,
 258             DPCI_ADD_MAPPING);

Above can address your 2 questions. Xen doesn't need to tell exactly
whether the assigned range actually belongs to NVDIMM, just like
the policy for PCI assignment today.

Thanks
Kevin
Haozhong Zhang Jan. 20, 2016, 5:31 a.m. UTC | #7
Hi Jan, Wei and Kevin,

On 01/18/16 01:46, Jan Beulich wrote:
> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> > On 01/15/16 10:10, Jan Beulich wrote:
> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> >> > NVDIMM devices are detected and configured by software through
> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> >> > ACPI tables to load extra ACPI tables built by QEMU.
> >> 
> >> Mechanically the patch looks okay, but whether it's actually needed
> >> depends on whether indeed we want NV RAM managed in qemu
> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> >> reply yet to that same comment of mine made (iirc) in the context
> >> of another patch.
> > 
> > One purpose of this patch series is to provide vNVDIMM backed by host
> > NVDIMM devices. It requires some drivers to detect and manage host
> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > then mmaps them into certain range of dom0's address space and asks
> > Xen hypervisor to map that range of address space to a domU.
> > 
> > However, there are two problems in this Xen patch series and the
> > corresponding QEMU patch series, which may require further
> > changes in hypervisor and/or toolstack.
> > 
> > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
> >     the host NVDIMM to domU, which results VMEXIT for every guest
> >     read/write to the corresponding vNVDIMM devices. I'm going to find
> >     a way to passthrough the address space range of host NVDIMM to a
> >     guest domU (similarly to what xen-pt in QEMU uses)
> >     
> > (2) Xen currently does not check whether the address that QEMU asks to
> >     map to domU is really within the host NVDIMM address
> >     space. Therefore, Xen hypervisor needs a way to decide the host
> >     NVDIMM address space which can be done by parsing ACPI NFIT
> >     tables.
> 
> These problems are a pretty direct result of the management of
> NVDIMM not being done by the hypervisor.
> 
> Stating what qemu currently does is, I'm afraid, not really serving
> the purpose of hashing out whether the management of NVDIMM,
> just like that of "normal" RAM, wouldn't better be done by the
> hypervisor. In fact so far I haven't seen any rationale (other than
> the desire to share code with KVM) for the presently chosen
> solution. Yet in KVM qemu is - afaict - much more of an integral part
> of the hypervisor than it is in the Xen case (and even there core
> management of the memory is left to the kernel, i.e. what
> constitutes the core hypervisor there).
> 
> Jan
> 

Sorry for the later reply, as I was reading some code and trying to
get things clear for myself.

The primary reason of current solution is to reuse existing NVDIMM
driver in Linux kernel.

One responsibility of this driver is to discover NVDIMM devices and
their parameters (e.g. which portion of an NVDIMM device can be mapped
into the system address space and which address it is mapped to) by
parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
ACPI Specification v6 and the actual code in Linux kernel
(drivers/acpi/nfit.*), it's not a trivial task.

Secondly, the driver implements a convenient block device interface to
let software access areas where NVDIMM devices are mapped. The
existing vNVDIMM implementation in QEMU uses this interface.

As Linux NVDIMM driver has already done above, why do we bother to
reimplement them in Xen?

For the two problems raised in my previous reply, following are my
thoughts.

(1) (for the first problem) QEMU mmaps /dev/pmemXX into its virtual
    address space. When it works with KVM, it calls KVM api to map
    that virtual address space range into a guest physical address
    space.

    For Xen, I'm going to do the similar thing, but Xen seems not
    provide such api. The most close one I can find is
    XEN_DOMCTL_memory_mapping (which is used by VGA passthrough in
    QEMU xen_pt_graphics), but it does not accept guest virtual
    address. Thus, I'm going to add a new one that does similar work
    but can accept guest virtual address.

(2) (for the second problem) After having looked at the corresponding
    Linux kernel code and my comments at beginning, I now doubt if
    it's necessary to parsing NFIT in Xen. Maybe I can follow what
    xen_pt_graphics does, that is to assign guest with permission to
    access the corresponding host NVDIMM address space range and then
    call the new hypercall added in (1).

    Again, a new hypercall that is similar to
    XEN_DOMCTL_iomem_permission and can accept guest virtual address
    is needed.

Any comments?

Thanks,
Haozhong
Haozhong Zhang Jan. 20, 2016, 5:58 a.m. UTC | #8
On 01/20/16 13:14, Tian, Kevin wrote:
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Tuesday, January 19, 2016 7:47 PM
> > 
> > >>> On 19.01.16 at 12:37, <wei.liu2@citrix.com> wrote:
> > > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> > >> >>> On 18.01.16 at 01:52, <haozhong.zhang@intel.com> wrote:
> > >> > On 01/15/16 10:10, Jan Beulich wrote:
> > >> >> >>> On 29.12.15 at 12:31, <haozhong.zhang@intel.com> wrote:
> > >> >> > NVDIMM devices are detected and configured by software through
> > >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> > >> >> > patch extends the existing mechanism in hvmloader of loading passthrough
> > >> >> > ACPI tables to load extra ACPI tables built by QEMU.
> > >> >>
> > >> >> Mechanically the patch looks okay, but whether it's actually needed
> > >> >> depends on whether indeed we want NV RAM managed in qemu
> > >> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> > >> >> reply yet to that same comment of mine made (iirc) in the context
> > >> >> of another patch.
> > >> >
> > >> > One purpose of this patch series is to provide vNVDIMM backed by host
> > >> > NVDIMM devices. It requires some drivers to detect and manage host
> > >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > >> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > >> > then mmaps them into certain range of dom0's address space and asks
> > >> > Xen hypervisor to map that range of address space to a domU.
> > >> >
> > >
> > > OOI Do we have a viable solution to do all these non-trivial things in
> > > core hypervisor?  Are you proposing designing a new set of hypercalls
> > > for NVDIMM?
> > 
> > That's certainly a possibility; I lack sufficient detail to make myself
> > an opinion which route is going to be best.
> > 
> > Jan
> 
> Hi, Haozhong,
> 
> Are NVDIMM related ACPI table in plain text format, or do they require
> a ACPI parser to decode? Is there a corresponding E820 entry?
>

Most in plain text format, but still the driver evaluates _FIT
(firmware interface table) method and decode is needed then.

> Above information would be useful to help decide the direction.
> 
> In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM
> since it's a type of memory resource while for memory we expect hypervisor
> to centrally manage.
> 
> However in another thought the answer is different if we view this 
> resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc.
> then it should be fine to have Dom0 manage NVDIMM then Xen just controls
> the mapping based on existing io permission mechanism.
>

It's more like a MMIO device than the normal ram.

> Another possible point for this model is that PMEM is only one mode of 
> NVDIMM device, which can be also exposed as a storage device. In the
> latter case the management has to be in Dom0. So we don't need to
> scatter the management role into Dom0/Xen based on different modes.
>

NVDIMM device in pmem mode is exposed as storage device (a block
device /dev/pmemXX) in Linux, and it's also used like a disk drive
(you can make file system on it, create files on it and even pass
files rather than a whole /dev/pmemXX to guests).

> Back to your earlier questions:
> 
> > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
> >     the host NVDIMM to domU, which results VMEXIT for every guest
> >     read/write to the corresponding vNVDIMM devices. I'm going to find
> >     a way to passthrough the address space range of host NVDIMM to a
> >     guest domU (similarly to what xen-pt in QEMU uses)
> > 
> > (2) Xen currently does not check whether the address that QEMU asks to
> >     map to domU is really within the host NVDIMM address
> >     space. Therefore, Xen hypervisor needs a way to decide the host
> >     NVDIMM address space which can be done by parsing ACPI NFIT
> >     tables.
> 
> If you look at how ACPI OpRegion is handled for IGD passthrough:
> 
>  241     ret = xc_domain_iomem_permission(xen_xc, xen_domid,
>  242             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
>  243             XEN_PCI_INTEL_OPREGION_PAGES,
>  244             XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED);
> 
>  254     ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>  255             (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT),
>  256             (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
>  257             XEN_PCI_INTEL_OPREGION_PAGES,
>  258             DPCI_ADD_MAPPING);
>

Yes, I've noticed these two functions. The addition work would be
adding new ones that can accept virtual address, as QEMU has no easy
way to get the physical address of /dev/pmemXX and can only mmap them
into its virtual address space.

> Above can address your 2 questions. Xen doesn't need to tell exactly
> whether the assigned range actually belongs to NVDIMM, just like
> the policy for PCI assignment today.
>

That means Xen hypervisor can trust whatever address dom0 kernel and
QEMU provide?

Thanks,
Haozhong
Jan Beulich Jan. 20, 2016, 8:46 a.m. UTC | #9
>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> The primary reason of current solution is to reuse existing NVDIMM
> driver in Linux kernel.

Re-using code in the Dom0 kernel has benefits and drawbacks, and
in any event needs to depend on proper layering to remain in place.
A benefit is less code duplication between Xen and Linux; along the
same lines a drawback is code duplication between various Dom0
OS variants.

> One responsibility of this driver is to discover NVDIMM devices and
> their parameters (e.g. which portion of an NVDIMM device can be mapped
> into the system address space and which address it is mapped to) by
> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> ACPI Specification v6 and the actual code in Linux kernel
> (drivers/acpi/nfit.*), it's not a trivial task.

To answer one of Kevin's questions: The NFIT table doesn't appear
to require the ACPI interpreter. They seem more like SRAT and SLIT.
Also you failed to answer Kevin's question regarding E820 entries: I
think NVDIMM (or at least parts thereof) get represented in E820 (or
the EFI memory map), and if that's the case this would be a very
strong hint towards management needing to be in the hypervisor.

> Secondly, the driver implements a convenient block device interface to
> let software access areas where NVDIMM devices are mapped. The
> existing vNVDIMM implementation in QEMU uses this interface.
> 
> As Linux NVDIMM driver has already done above, why do we bother to
> reimplement them in Xen?

See above; a possibility is that we may need a split model (block
layer parts on Dom0, "normal memory" parts in the hypervisor.
Iirc the split is being determined by firmware, and hence set in
stone by the time OS (or hypervisor) boot starts.

Jan
Andrew Cooper Jan. 20, 2016, 8:58 a.m. UTC | #10
On 20/01/2016 08:46, Jan Beulich wrote:
>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>> The primary reason of current solution is to reuse existing NVDIMM
>> driver in Linux kernel.
> Re-using code in the Dom0 kernel has benefits and drawbacks, and
> in any event needs to depend on proper layering to remain in place.
> A benefit is less code duplication between Xen and Linux; along the
> same lines a drawback is code duplication between various Dom0
> OS variants.
>
>> One responsibility of this driver is to discover NVDIMM devices and
>> their parameters (e.g. which portion of an NVDIMM device can be mapped
>> into the system address space and which address it is mapped to) by
>> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
>> ACPI Specification v6 and the actual code in Linux kernel
>> (drivers/acpi/nfit.*), it's not a trivial task.
> To answer one of Kevin's questions: The NFIT table doesn't appear
> to require the ACPI interpreter. They seem more like SRAT and SLIT.
> Also you failed to answer Kevin's question regarding E820 entries: I
> think NVDIMM (or at least parts thereof) get represented in E820 (or
> the EFI memory map), and if that's the case this would be a very
> strong hint towards management needing to be in the hypervisor.

Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped
into memory.  I am still on the dom0 side of this fence.

The real question is whether it is possible to take an NVDIMM, split it
in half, give each half to two different guests (with appropriate NFIT
tables) and that be sufficient for the guests to just work.

Either way, it needs to be a toolstack policy decision as to how to
split the resource.

~Andrew

>
>> Secondly, the driver implements a convenient block device interface to
>> let software access areas where NVDIMM devices are mapped. The
>> existing vNVDIMM implementation in QEMU uses this interface.
>>
>> As Linux NVDIMM driver has already done above, why do we bother to
>> reimplement them in Xen?
> See above; a possibility is that we may need a split model (block
> layer parts on Dom0, "normal memory" parts in the hypervisor.
> Iirc the split is being determined by firmware, and hence set in
> stone by the time OS (or hypervisor) boot starts.
>
> Jan
>
Haozhong Zhang Jan. 20, 2016, 10:15 a.m. UTC | #11
On 01/20/16 08:58, Andrew Cooper wrote:
> On 20/01/2016 08:46, Jan Beulich wrote:
> >>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> >> The primary reason of current solution is to reuse existing NVDIMM
> >> driver in Linux kernel.
> > Re-using code in the Dom0 kernel has benefits and drawbacks, and
> > in any event needs to depend on proper layering to remain in place.
> > A benefit is less code duplication between Xen and Linux; along the
> > same lines a drawback is code duplication between various Dom0
> > OS variants.
> >
> >> One responsibility of this driver is to discover NVDIMM devices and
> >> their parameters (e.g. which portion of an NVDIMM device can be mapped
> >> into the system address space and which address it is mapped to) by
> >> parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> >> ACPI Specification v6 and the actual code in Linux kernel
> >> (drivers/acpi/nfit.*), it's not a trivial task.
> > To answer one of Kevin's questions: The NFIT table doesn't appear
> > to require the ACPI interpreter. They seem more like SRAT and SLIT.
> > Also you failed to answer Kevin's question regarding E820 entries: I
> > think NVDIMM (or at least parts thereof) get represented in E820 (or
> > the EFI memory map), and if that's the case this would be a very
> > strong hint towards management needing to be in the hypervisor.
>

CCing QEMU vNVDIMM maintainer: Xiao Guangrong

> Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped
> into memory.  I am still on the dom0 side of this fence.
> 
> The real question is whether it is possible to take an NVDIMM, split it
> in half, give each half to two different guests (with appropriate NFIT
> tables) and that be sufficient for the guests to just work.
>

Yes, one NVDIMM device can be split into multiple parts and assigned
to different guests, and QEMU is responsible to maintain virtual NFIT
tables for each part.

> Either way, it needs to be a toolstack policy decision as to how to
> split the resource.
>

But the split does not need to be done at Xen side IMO. It can be done
by dom0 kernel and QEMU as long as they tells Xen hypervisor the
address space range of each part.

Haozhong

> ~Andrew
> 
> >
> >> Secondly, the driver implements a convenient block device interface to
> >> let software access areas where NVDIMM devices are mapped. The
> >> existing vNVDIMM implementation in QEMU uses this interface.
> >>
> >> As Linux NVDIMM driver has already done above, why do we bother to
> >> reimplement them in Xen?
> > See above; a possibility is that we may need a split model (block
> > layer parts on Dom0, "normal memory" parts in the hypervisor.
> > Iirc the split is being determined by firmware, and hence set in
> > stone by the time OS (or hypervisor) boot starts.
> >
> > Jan
> >
>
Xiao Guangrong Jan. 20, 2016, 10:36 a.m. UTC | #12
Hi,

On 01/20/2016 06:15 PM, Haozhong Zhang wrote:

> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
>
>> Conceptually, an NVDIMM is just like a fast SSD which is linearly mapped
>> into memory.  I am still on the dom0 side of this fence.
>>
>> The real question is whether it is possible to take an NVDIMM, split it
>> in half, give each half to two different guests (with appropriate NFIT
>> tables) and that be sufficient for the guests to just work.
>>
>
> Yes, one NVDIMM device can be split into multiple parts and assigned
> to different guests, and QEMU is responsible to maintain virtual NFIT
> tables for each part.
>
>> Either way, it needs to be a toolstack policy decision as to how to
>> split the resource.

Currently, we are using NVDIMM as a block device and a DAX-based filesystem
is created upon it in Linux so that file-related accesses directly reach
the NVDIMM device.

In KVM, If the NVDIMM device need to be shared by different VMs, we can
create multiple files on the DAX-based filesystem and assign the file to
each VMs. In the future, we can enable namespace (partition-like) for PMEM
memory and assign the namespace to each VMs (current Linux driver uses the
whole PMEM as a single namespace).

I think it is not a easy work to let Xen hypervisor recognize NVDIMM device
and manager NVDIMM resource.

Thanks!
Haozhong Zhang Jan. 20, 2016, 11:04 a.m. UTC | #13
On 01/20/16 01:46, Jan Beulich wrote:
> >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> > The primary reason of current solution is to reuse existing NVDIMM
> > driver in Linux kernel.
>

CC'ing QEMU vNVDIMM maintainer: Xiao Guangrong

> Re-using code in the Dom0 kernel has benefits and drawbacks, and
> in any event needs to depend on proper layering to remain in place.
> A benefit is less code duplication between Xen and Linux; along the
> same lines a drawback is code duplication between various Dom0
> OS variants.
>

Not clear about other Dom0 OS. But for Linux, it already has a NVDIMM
driver since 4.2.

> > One responsibility of this driver is to discover NVDIMM devices and
> > their parameters (e.g. which portion of an NVDIMM device can be mapped
> > into the system address space and which address it is mapped to) by
> > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> > ACPI Specification v6 and the actual code in Linux kernel
> > (drivers/acpi/nfit.*), it's not a trivial task.
> 
> To answer one of Kevin's questions: The NFIT table doesn't appear
> to require the ACPI interpreter. They seem more like SRAT and SLIT.

Sorry, I made a mistake in another reply. NFIT does not contain
anything requiring ACPI interpreter. But there are some _DSM methods
for NVDIMM in SSDT, which needs ACPI interpreter.

> Also you failed to answer Kevin's question regarding E820 entries: I
> think NVDIMM (or at least parts thereof) get represented in E820 (or
> the EFI memory map), and if that's the case this would be a very
> strong hint towards management needing to be in the hypervisor.
>

Legacy NVDIMM devices may use E820 entries or other ad-hoc ways to
announce their locations, but newer ones that follow ACPI v6 spec do
not need E820 any more and only need ACPI NFIT (i.e. firmware may not
build E820 entries for them).

The current linux kernel can handle both legacy and new NVDIMM devices
and provide the same block device interface for them.

> > Secondly, the driver implements a convenient block device interface to
> > let software access areas where NVDIMM devices are mapped. The
> > existing vNVDIMM implementation in QEMU uses this interface.
> > 
> > As Linux NVDIMM driver has already done above, why do we bother to
> > reimplement them in Xen?
> 
> See above; a possibility is that we may need a split model (block
> layer parts on Dom0, "normal memory" parts in the hypervisor.
> Iirc the split is being determined by firmware, and hence set in
> stone by the time OS (or hypervisor) boot starts.
>

For the "normal memory" parts, do you mean parts that map the host
NVDIMM device's address space range to the guest? I'm going to
implement that part in hypervisor and expose it as a hypercall so that
it can be used by QEMU.

Haozhong
Jan Beulich Jan. 20, 2016, 11:20 a.m. UTC | #14
>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
> On 01/20/16 01:46, Jan Beulich wrote:
>> >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>> > Secondly, the driver implements a convenient block device interface to
>> > let software access areas where NVDIMM devices are mapped. The
>> > existing vNVDIMM implementation in QEMU uses this interface.
>> > 
>> > As Linux NVDIMM driver has already done above, why do we bother to
>> > reimplement them in Xen?
>> 
>> See above; a possibility is that we may need a split model (block
>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>> Iirc the split is being determined by firmware, and hence set in
>> stone by the time OS (or hypervisor) boot starts.
> 
> For the "normal memory" parts, do you mean parts that map the host
> NVDIMM device's address space range to the guest? I'm going to
> implement that part in hypervisor and expose it as a hypercall so that
> it can be used by QEMU.

To answer this I need to have my understanding of the partitioning
being done by firmware confirmed: If that's the case, then "normal"
means the part that doesn't get exposed as a block device (SSD).
In any event there's no correlation to guest exposure here.

Jan
Andrew Cooper Jan. 20, 2016, 1:16 p.m. UTC | #15
On 20/01/16 10:36, Xiao Guangrong wrote:
>
> Hi,
>
> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
>
>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
>>
>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
>>> mapped
>>> into memory.  I am still on the dom0 side of this fence.
>>>
>>> The real question is whether it is possible to take an NVDIMM, split it
>>> in half, give each half to two different guests (with appropriate NFIT
>>> tables) and that be sufficient for the guests to just work.
>>>
>>
>> Yes, one NVDIMM device can be split into multiple parts and assigned
>> to different guests, and QEMU is responsible to maintain virtual NFIT
>> tables for each part.
>>
>>> Either way, it needs to be a toolstack policy decision as to how to
>>> split the resource.
>
> Currently, we are using NVDIMM as a block device and a DAX-based
> filesystem
> is created upon it in Linux so that file-related accesses directly reach
> the NVDIMM device.
>
> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> create multiple files on the DAX-based filesystem and assign the file to
> each VMs. In the future, we can enable namespace (partition-like) for
> PMEM
> memory and assign the namespace to each VMs (current Linux driver uses
> the
> whole PMEM as a single namespace).
>
> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> device
> and manager NVDIMM resource.
>
> Thanks!
>

The more I see about this, the more sure I am that we want to keep it as
a block device managed by dom0.

In the case of the DAX-based filesystem, I presume files are not
necessarily contiguous.  I also presume that this is worked around by
permuting the mapping of the virtual NVDIMM such that the it appears as
a contiguous block of addresses to the guest?

Today in Xen, Qemu already has the ability to create mappings in the
guest's address space, e.g. to map PCI device BARs.  I don't see a
conceptual difference here, although the security/permission model
certainly is more complicated.

~Andrew
Stefano Stabellini Jan. 20, 2016, 2:29 p.m. UTC | #16
On Wed, 20 Jan 2016, Andrew Cooper wrote:
> On 20/01/16 10:36, Xiao Guangrong wrote:
> >
> > Hi,
> >
> > On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >
> >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>
> >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>> mapped
> >>> into memory.  I am still on the dom0 side of this fence.
> >>>
> >>> The real question is whether it is possible to take an NVDIMM, split it
> >>> in half, give each half to two different guests (with appropriate NFIT
> >>> tables) and that be sufficient for the guests to just work.
> >>>
> >>
> >> Yes, one NVDIMM device can be split into multiple parts and assigned
> >> to different guests, and QEMU is responsible to maintain virtual NFIT
> >> tables for each part.
> >>
> >>> Either way, it needs to be a toolstack policy decision as to how to
> >>> split the resource.
> >
> > Currently, we are using NVDIMM as a block device and a DAX-based
> > filesystem
> > is created upon it in Linux so that file-related accesses directly reach
> > the NVDIMM device.
> >
> > In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > create multiple files on the DAX-based filesystem and assign the file to
> > each VMs. In the future, we can enable namespace (partition-like) for
> > PMEM
> > memory and assign the namespace to each VMs (current Linux driver uses
> > the
> > whole PMEM as a single namespace).
> >
> > I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > device
> > and manager NVDIMM resource.
> >
> > Thanks!
> >
> 
> The more I see about this, the more sure I am that we want to keep it as
> a block device managed by dom0.
> 
> In the case of the DAX-based filesystem, I presume files are not
> necessarily contiguous.  I also presume that this is worked around by
> permuting the mapping of the virtual NVDIMM such that the it appears as
> a contiguous block of addresses to the guest?
> 
> Today in Xen, Qemu already has the ability to create mappings in the
> guest's address space, e.g. to map PCI device BARs.  I don't see a
> conceptual difference here, although the security/permission model
> certainly is more complicated.

I imagine that mmap'ing  these /dev/pmemXX devices require root
privileges, does it not?

I wouldn't encourage the introduction of anything else that requires
root privileges in QEMU. With QEMU running as non-root by default in
4.7, the feature will not be available unless users explicitly ask to
run QEMU as root (which they shouldn't really).
Haozhong Zhang Jan. 20, 2016, 2:38 p.m. UTC | #17
On 01/20/16 13:16, Andrew Cooper wrote:
> On 20/01/16 10:36, Xiao Guangrong wrote:
> >
> > Hi,
> >
> > On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >
> >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>
> >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>> mapped
> >>> into memory.  I am still on the dom0 side of this fence.
> >>>
> >>> The real question is whether it is possible to take an NVDIMM, split it
> >>> in half, give each half to two different guests (with appropriate NFIT
> >>> tables) and that be sufficient for the guests to just work.
> >>>
> >>
> >> Yes, one NVDIMM device can be split into multiple parts and assigned
> >> to different guests, and QEMU is responsible to maintain virtual NFIT
> >> tables for each part.
> >>
> >>> Either way, it needs to be a toolstack policy decision as to how to
> >>> split the resource.
> >
> > Currently, we are using NVDIMM as a block device and a DAX-based
> > filesystem
> > is created upon it in Linux so that file-related accesses directly reach
> > the NVDIMM device.
> >
> > In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > create multiple files on the DAX-based filesystem and assign the file to
> > each VMs. In the future, we can enable namespace (partition-like) for
> > PMEM
> > memory and assign the namespace to each VMs (current Linux driver uses
> > the
> > whole PMEM as a single namespace).
> >
> > I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > device
> > and manager NVDIMM resource.
> >
> > Thanks!
> >
> 
> The more I see about this, the more sure I am that we want to keep it as
> a block device managed by dom0.
> 
> In the case of the DAX-based filesystem, I presume files are not
> necessarily contiguous.  I also presume that this is worked around by
> permuting the mapping of the virtual NVDIMM such that the it appears as
> a contiguous block of addresses to the guest?
>

No, it's not necessary to be contiguous. We can map those
none-contiguous parts into a contiguous guest physical address space area
and QEMU fills the base and size of area in vNFIT.

> Today in Xen, Qemu already has the ability to create mappings in the
> guest's address space, e.g. to map PCI device BARs.  I don't see a
> conceptual difference here, although the security/permission model
> certainly is more complicated.
>

I'm preparing a design document and let's see afterwards what would be
a better solution.

Thanks,
Haozhong
Haozhong Zhang Jan. 20, 2016, 2:42 p.m. UTC | #18
On 01/20/16 14:29, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Andrew Cooper wrote:
> > On 20/01/16 10:36, Xiao Guangrong wrote:
> > >
> > > Hi,
> > >
> > > On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> > >
> > >> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> > >>
> > >>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> > >>> mapped
> > >>> into memory.  I am still on the dom0 side of this fence.
> > >>>
> > >>> The real question is whether it is possible to take an NVDIMM, split it
> > >>> in half, give each half to two different guests (with appropriate NFIT
> > >>> tables) and that be sufficient for the guests to just work.
> > >>>
> > >>
> > >> Yes, one NVDIMM device can be split into multiple parts and assigned
> > >> to different guests, and QEMU is responsible to maintain virtual NFIT
> > >> tables for each part.
> > >>
> > >>> Either way, it needs to be a toolstack policy decision as to how to
> > >>> split the resource.
> > >
> > > Currently, we are using NVDIMM as a block device and a DAX-based
> > > filesystem
> > > is created upon it in Linux so that file-related accesses directly reach
> > > the NVDIMM device.
> > >
> > > In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > > create multiple files on the DAX-based filesystem and assign the file to
> > > each VMs. In the future, we can enable namespace (partition-like) for
> > > PMEM
> > > memory and assign the namespace to each VMs (current Linux driver uses
> > > the
> > > whole PMEM as a single namespace).
> > >
> > > I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > > device
> > > and manager NVDIMM resource.
> > >
> > > Thanks!
> > >
> > 
> > The more I see about this, the more sure I am that we want to keep it as
> > a block device managed by dom0.
> > 
> > In the case of the DAX-based filesystem, I presume files are not
> > necessarily contiguous.  I also presume that this is worked around by
> > permuting the mapping of the virtual NVDIMM such that the it appears as
> > a contiguous block of addresses to the guest?
> > 
> > Today in Xen, Qemu already has the ability to create mappings in the
> > guest's address space, e.g. to map PCI device BARs.  I don't see a
> > conceptual difference here, although the security/permission model
> > certainly is more complicated.
> 
> I imagine that mmap'ing  these /dev/pmemXX devices require root
> privileges, does it not?
>

Yes, unless we assign non-root access permissions to /dev/pmemXX (but
this is not the default behavior of linux kernel so far).

> I wouldn't encourage the introduction of anything else that requires
> root privileges in QEMU. With QEMU running as non-root by default in
> 4.7, the feature will not be available unless users explicitly ask to
> run QEMU as root (which they shouldn't really).
>

Yes, I'll include those privileged operations in the design document.

Haozhong

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Andrew Cooper Jan. 20, 2016, 2:45 p.m. UTC | #19
On 20/01/16 14:29, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Andrew Cooper wrote:
>> On 20/01/16 10:36, Xiao Guangrong wrote:
>>> Hi,
>>>
>>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
>>>
>>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
>>>>
>>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
>>>>> mapped
>>>>> into memory.  I am still on the dom0 side of this fence.
>>>>>
>>>>> The real question is whether it is possible to take an NVDIMM, split it
>>>>> in half, give each half to two different guests (with appropriate NFIT
>>>>> tables) and that be sufficient for the guests to just work.
>>>>>
>>>> Yes, one NVDIMM device can be split into multiple parts and assigned
>>>> to different guests, and QEMU is responsible to maintain virtual NFIT
>>>> tables for each part.
>>>>
>>>>> Either way, it needs to be a toolstack policy decision as to how to
>>>>> split the resource.
>>> Currently, we are using NVDIMM as a block device and a DAX-based
>>> filesystem
>>> is created upon it in Linux so that file-related accesses directly reach
>>> the NVDIMM device.
>>>
>>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
>>> create multiple files on the DAX-based filesystem and assign the file to
>>> each VMs. In the future, we can enable namespace (partition-like) for
>>> PMEM
>>> memory and assign the namespace to each VMs (current Linux driver uses
>>> the
>>> whole PMEM as a single namespace).
>>>
>>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
>>> device
>>> and manager NVDIMM resource.
>>>
>>> Thanks!
>>>
>> The more I see about this, the more sure I am that we want to keep it as
>> a block device managed by dom0.
>>
>> In the case of the DAX-based filesystem, I presume files are not
>> necessarily contiguous.  I also presume that this is worked around by
>> permuting the mapping of the virtual NVDIMM such that the it appears as
>> a contiguous block of addresses to the guest?
>>
>> Today in Xen, Qemu already has the ability to create mappings in the
>> guest's address space, e.g. to map PCI device BARs.  I don't see a
>> conceptual difference here, although the security/permission model
>> certainly is more complicated.
> I imagine that mmap'ing  these /dev/pmemXX devices require root
> privileges, does it not?

I presume it does, although mmap()ing a file on a DAX filesystem will
work in the standard POSIX way.

Neither of these are sufficient however.  That gets Qemu a mapping of
the NVDIMM, not the guest.  Something, one way or another, has to turn
this into appropriate add-to-phymap hypercalls.

>
> I wouldn't encourage the introduction of anything else that requires
> root privileges in QEMU. With QEMU running as non-root by default in
> 4.7, the feature will not be available unless users explicitly ask to
> run QEMU as root (which they shouldn't really).

This isn't how design works.

First, design a feature in an architecturally correct way, and then
design an security policy to fit.  (note, both before implement happens).

We should not stunt design based on an existing implementation.  In
particular, if design shows that being a root only feature is the only
sane way of doing this, it should be a root only feature.  (I hope this
is not the case, but it shouldn't cloud the judgement of a design).

~Andrew
Haozhong Zhang Jan. 20, 2016, 2:53 p.m. UTC | #20
On 01/20/16 14:45, Andrew Cooper wrote:
> On 20/01/16 14:29, Stefano Stabellini wrote:
> > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> >> On 20/01/16 10:36, Xiao Guangrong wrote:
> >>> Hi,
> >>>
> >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >>>
> >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>>>
> >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>>>> mapped
> >>>>> into memory.  I am still on the dom0 side of this fence.
> >>>>>
> >>>>> The real question is whether it is possible to take an NVDIMM, split it
> >>>>> in half, give each half to two different guests (with appropriate NFIT
> >>>>> tables) and that be sufficient for the guests to just work.
> >>>>>
> >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> >>>> tables for each part.
> >>>>
> >>>>> Either way, it needs to be a toolstack policy decision as to how to
> >>>>> split the resource.
> >>> Currently, we are using NVDIMM as a block device and a DAX-based
> >>> filesystem
> >>> is created upon it in Linux so that file-related accesses directly reach
> >>> the NVDIMM device.
> >>>
> >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> >>> create multiple files on the DAX-based filesystem and assign the file to
> >>> each VMs. In the future, we can enable namespace (partition-like) for
> >>> PMEM
> >>> memory and assign the namespace to each VMs (current Linux driver uses
> >>> the
> >>> whole PMEM as a single namespace).
> >>>
> >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> >>> device
> >>> and manager NVDIMM resource.
> >>>
> >>> Thanks!
> >>>
> >> The more I see about this, the more sure I am that we want to keep it as
> >> a block device managed by dom0.
> >>
> >> In the case of the DAX-based filesystem, I presume files are not
> >> necessarily contiguous.  I also presume that this is worked around by
> >> permuting the mapping of the virtual NVDIMM such that the it appears as
> >> a contiguous block of addresses to the guest?
> >>
> >> Today in Xen, Qemu already has the ability to create mappings in the
> >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> >> conceptual difference here, although the security/permission model
> >> certainly is more complicated.
> > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > privileges, does it not?
> 
> I presume it does, although mmap()ing a file on a DAX filesystem will
> work in the standard POSIX way.
> 
> Neither of these are sufficient however.  That gets Qemu a mapping of
> the NVDIMM, not the guest.  Something, one way or another, has to turn
> this into appropriate add-to-phymap hypercalls.
>

Yes, those hypercalls are what I'm going to add.

Haozhong

> >
> > I wouldn't encourage the introduction of anything else that requires
> > root privileges in QEMU. With QEMU running as non-root by default in
> > 4.7, the feature will not be available unless users explicitly ask to
> > run QEMU as root (which they shouldn't really).
> 
> This isn't how design works.
> 
> First, design a feature in an architecturally correct way, and then
> design an security policy to fit.  (note, both before implement happens).
> 
> We should not stunt design based on an existing implementation.  In
> particular, if design shows that being a root only feature is the only
> sane way of doing this, it should be a root only feature.  (I hope this
> is not the case, but it shouldn't cloud the judgement of a design).
> 
> ~Andrew
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Stefano Stabellini Jan. 20, 2016, 3:05 p.m. UTC | #21
On Wed, 20 Jan 2016, Andrew Cooper wrote:
> On 20/01/16 14:29, Stefano Stabellini wrote:
> > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> >> On 20/01/16 10:36, Xiao Guangrong wrote:
> >>> Hi,
> >>>
> >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> >>>
> >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> >>>>
> >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> >>>>> mapped
> >>>>> into memory.  I am still on the dom0 side of this fence.
> >>>>>
> >>>>> The real question is whether it is possible to take an NVDIMM, split it
> >>>>> in half, give each half to two different guests (with appropriate NFIT
> >>>>> tables) and that be sufficient for the guests to just work.
> >>>>>
> >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> >>>> tables for each part.
> >>>>
> >>>>> Either way, it needs to be a toolstack policy decision as to how to
> >>>>> split the resource.
> >>> Currently, we are using NVDIMM as a block device and a DAX-based
> >>> filesystem
> >>> is created upon it in Linux so that file-related accesses directly reach
> >>> the NVDIMM device.
> >>>
> >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> >>> create multiple files on the DAX-based filesystem and assign the file to
> >>> each VMs. In the future, we can enable namespace (partition-like) for
> >>> PMEM
> >>> memory and assign the namespace to each VMs (current Linux driver uses
> >>> the
> >>> whole PMEM as a single namespace).
> >>>
> >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> >>> device
> >>> and manager NVDIMM resource.
> >>>
> >>> Thanks!
> >>>
> >> The more I see about this, the more sure I am that we want to keep it as
> >> a block device managed by dom0.
> >>
> >> In the case of the DAX-based filesystem, I presume files are not
> >> necessarily contiguous.  I also presume that this is worked around by
> >> permuting the mapping of the virtual NVDIMM such that the it appears as
> >> a contiguous block of addresses to the guest?
> >>
> >> Today in Xen, Qemu already has the ability to create mappings in the
> >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> >> conceptual difference here, although the security/permission model
> >> certainly is more complicated.
> > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > privileges, does it not?
> 
> I presume it does, although mmap()ing a file on a DAX filesystem will
> work in the standard POSIX way.
> 
> Neither of these are sufficient however.  That gets Qemu a mapping of
> the NVDIMM, not the guest.  Something, one way or another, has to turn
> this into appropriate add-to-phymap hypercalls.
> 
> >
> > I wouldn't encourage the introduction of anything else that requires
> > root privileges in QEMU. With QEMU running as non-root by default in
> > 4.7, the feature will not be available unless users explicitly ask to
> > run QEMU as root (which they shouldn't really).
> 
> This isn't how design works.
> 
> First, design a feature in an architecturally correct way, and then
> design an security policy to fit.
>
> We should not stunt design based on an existing implementation.  In
> particular, if design shows that being a root only feature is the only
> sane way of doing this, it should be a root only feature.  (I hope this
> is not the case, but it shouldn't cloud the judgement of a design).

I would argue that security is an integral part of the architecture and
should not be retrofitted into it.

Is it really a good design if the only sane way to implement it is
making it a root-only feature? I think not. Designing security policies
for pieces of software that don't have the infrastructure for them is
costly and that cost should be accounted as part of the overall cost of
the solution rather than added to it in a second stage.


> (note, both before implement happens).

That is ideal but realistically in many cases nobody is able to produce
a design before the implementation happens. There is plenty of articles
written about this since the 90s / early 00s.
Konrad Rzeszutek Wilk Jan. 20, 2016, 3:07 p.m. UTC | #22
On Wed, Jan 20, 2016 at 07:04:49PM +0800, Haozhong Zhang wrote:
> On 01/20/16 01:46, Jan Beulich wrote:
> > >>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> > > The primary reason of current solution is to reuse existing NVDIMM
> > > driver in Linux kernel.
> >
> 
> CC'ing QEMU vNVDIMM maintainer: Xiao Guangrong
> 
> > Re-using code in the Dom0 kernel has benefits and drawbacks, and
> > in any event needs to depend on proper layering to remain in place.
> > A benefit is less code duplication between Xen and Linux; along the
> > same lines a drawback is code duplication between various Dom0
> > OS variants.
> >
> 
> Not clear about other Dom0 OS. But for Linux, it already has a NVDIMM
> driver since 4.2.
> 
> > > One responsibility of this driver is to discover NVDIMM devices and
> > > their parameters (e.g. which portion of an NVDIMM device can be mapped
> > > into the system address space and which address it is mapped to) by
> > > parsing ACPI NFIT tables. Looking at the NFIT spec in Sec 5.2.25 of
> > > ACPI Specification v6 and the actual code in Linux kernel
> > > (drivers/acpi/nfit.*), it's not a trivial task.
> > 
> > To answer one of Kevin's questions: The NFIT table doesn't appear
> > to require the ACPI interpreter. They seem more like SRAT and SLIT.
> 
> Sorry, I made a mistake in another reply. NFIT does not contain
> anything requiring ACPI interpreter. But there are some _DSM methods
> for NVDIMM in SSDT, which needs ACPI interpreter.

Right, but those are for health checks and such. Not needed for boot-time
discovery of the ranges in memory of the NVDIMM.
> 
> > Also you failed to answer Kevin's question regarding E820 entries: I
> > think NVDIMM (or at least parts thereof) get represented in E820 (or
> > the EFI memory map), and if that's the case this would be a very
> > strong hint towards management needing to be in the hypervisor.
> >
> 
> Legacy NVDIMM devices may use E820 entries or other ad-hoc ways to
> announce their locations, but newer ones that follow ACPI v6 spec do
> not need E820 any more and only need ACPI NFIT (i.e. firmware may not
> build E820 entries for them).

I am missing something here.

Linux pvops uses an hypercall to construct its E820 (XENMEM_machine_memory_map)
see arch/x86/xen/setup.c:xen_memory_setup.

That hypercall gets an filtered E820 from the hypervisor. And the
hypervisor gets the E820 from multiboot2 - which gets it from grub2.

With the 'legacy NVDIMM' using E820_NVDIMM (type 12? 13) - they don't
show up in multiboot2 - which means Xen will ignore them (not sure
if changes them to E820_RSRV or just leaves them alone).

Anyhow for the /dev/pmem0 driver in Linux to construct an block
device on the E820_NVDIMM - it MUST have the E820 entry - but we don't
construct that.

I would think that one of the patches would be for the hypervisor
to recognize the E820_NVDIMM and associate that area with p2m_mmio
(so that the xc_memory_mapping hypercall would work on the MFNs)?

But you also mention ACPI v6 defining them an using ACPI NFIT - 
so that would be treating said system address extracted from the
ACPI NFIT just as an MMIO (except it being WB instead of UC).

Either way - Xen hypervisor should also parse the ACPI NFIT so
that it can mark that range as p2m_mmio (or does it do that by
default for any non-E820 ranges?). Does it actually need to
do that? Or is that optional?

I hope the design document will explain a bit of this.

> 
> The current linux kernel can handle both legacy and new NVDIMM devices
> and provide the same block device interface for them.

OK, so Xen would need to do that as well - so that the Linux kernel
can utilize it.
> 
> > > Secondly, the driver implements a convenient block device interface to
> > > let software access areas where NVDIMM devices are mapped. The
> > > existing vNVDIMM implementation in QEMU uses this interface.
> > > 
> > > As Linux NVDIMM driver has already done above, why do we bother to
> > > reimplement them in Xen?
> > 
> > See above; a possibility is that we may need a split model (block
> > layer parts on Dom0, "normal memory" parts in the hypervisor.
> > Iirc the split is being determined by firmware, and hence set in
> > stone by the time OS (or hypervisor) boot starts.
> >
> 
> For the "normal memory" parts, do you mean parts that map the host
> NVDIMM device's address space range to the guest? I'm going to
> implement that part in hypervisor and expose it as a hypercall so that
> it can be used by QEMU.
> 
> Haozhong
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk Jan. 20, 2016, 3:13 p.m. UTC | #23
On Wed, Jan 20, 2016 at 10:53:10PM +0800, Haozhong Zhang wrote:
> On 01/20/16 14:45, Andrew Cooper wrote:
> > On 20/01/16 14:29, Stefano Stabellini wrote:
> > > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> > >> On 20/01/16 10:36, Xiao Guangrong wrote:
> > >>> Hi,
> > >>>
> > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> > >>>
> > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> > >>>>
> > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> > >>>>> mapped
> > >>>>> into memory.  I am still on the dom0 side of this fence.
> > >>>>>
> > >>>>> The real question is whether it is possible to take an NVDIMM, split it
> > >>>>> in half, give each half to two different guests (with appropriate NFIT
> > >>>>> tables) and that be sufficient for the guests to just work.
> > >>>>>
> > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> > >>>> tables for each part.
> > >>>>
> > >>>>> Either way, it needs to be a toolstack policy decision as to how to
> > >>>>> split the resource.
> > >>> Currently, we are using NVDIMM as a block device and a DAX-based
> > >>> filesystem
> > >>> is created upon it in Linux so that file-related accesses directly reach
> > >>> the NVDIMM device.
> > >>>
> > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > >>> create multiple files on the DAX-based filesystem and assign the file to
> > >>> each VMs. In the future, we can enable namespace (partition-like) for
> > >>> PMEM
> > >>> memory and assign the namespace to each VMs (current Linux driver uses
> > >>> the
> > >>> whole PMEM as a single namespace).
> > >>>
> > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > >>> device
> > >>> and manager NVDIMM resource.
> > >>>
> > >>> Thanks!
> > >>>
> > >> The more I see about this, the more sure I am that we want to keep it as
> > >> a block device managed by dom0.
> > >>
> > >> In the case of the DAX-based filesystem, I presume files are not
> > >> necessarily contiguous.  I also presume that this is worked around by
> > >> permuting the mapping of the virtual NVDIMM such that the it appears as
> > >> a contiguous block of addresses to the guest?
> > >>
> > >> Today in Xen, Qemu already has the ability to create mappings in the
> > >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> > >> conceptual difference here, although the security/permission model
> > >> certainly is more complicated.
> > > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > > privileges, does it not?
> > 
> > I presume it does, although mmap()ing a file on a DAX filesystem will
> > work in the standard POSIX way.
> > 
> > Neither of these are sufficient however.  That gets Qemu a mapping of
> > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > this into appropriate add-to-phymap hypercalls.
> >
> 
> Yes, those hypercalls are what I'm going to add.

Why?

What you need (in a rought hand-wave way) is to:
 - mount /dev/pmem0
 - mmap the file on /dev/pmem0 FS
 - walk the VMA for the file - extract the MFN (machien frame numbers)
 - feed those frame numbers to xc_memory_mapping hypercall. The
   guest pfns would be contingous.
   Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
   /dev/pmem0 FS - the guest pfns are 0x200000 upward.

   However the MFNs may be discontingous as the NVDIMM could be an
   1TB - and the 8GB file is scattered all over.

I believe that is all you would need to do?
> 
> Haozhong
> 
> > >
> > > I wouldn't encourage the introduction of anything else that requires
> > > root privileges in QEMU. With QEMU running as non-root by default in
> > > 4.7, the feature will not be available unless users explicitly ask to
> > > run QEMU as root (which they shouldn't really).
> > 
> > This isn't how design works.
> > 
> > First, design a feature in an architecturally correct way, and then
> > design an security policy to fit.  (note, both before implement happens).
> > 
> > We should not stunt design based on an existing implementation.  In
> > particular, if design shows that being a root only feature is the only
> > sane way of doing this, it should be a root only feature.  (I hope this
> > is not the case, but it shouldn't cloud the judgement of a design).
> > 
> > ~Andrew
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Haozhong Zhang Jan. 20, 2016, 3:29 p.m. UTC | #24
On 01/20/16 10:13, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 20, 2016 at 10:53:10PM +0800, Haozhong Zhang wrote:
> > On 01/20/16 14:45, Andrew Cooper wrote:
> > > On 20/01/16 14:29, Stefano Stabellini wrote:
> > > > On Wed, 20 Jan 2016, Andrew Cooper wrote:
> > > >> On 20/01/16 10:36, Xiao Guangrong wrote:
> > > >>> Hi,
> > > >>>
> > > >>> On 01/20/2016 06:15 PM, Haozhong Zhang wrote:
> > > >>>
> > > >>>> CCing QEMU vNVDIMM maintainer: Xiao Guangrong
> > > >>>>
> > > >>>>> Conceptually, an NVDIMM is just like a fast SSD which is linearly
> > > >>>>> mapped
> > > >>>>> into memory.  I am still on the dom0 side of this fence.
> > > >>>>>
> > > >>>>> The real question is whether it is possible to take an NVDIMM, split it
> > > >>>>> in half, give each half to two different guests (with appropriate NFIT
> > > >>>>> tables) and that be sufficient for the guests to just work.
> > > >>>>>
> > > >>>> Yes, one NVDIMM device can be split into multiple parts and assigned
> > > >>>> to different guests, and QEMU is responsible to maintain virtual NFIT
> > > >>>> tables for each part.
> > > >>>>
> > > >>>>> Either way, it needs to be a toolstack policy decision as to how to
> > > >>>>> split the resource.
> > > >>> Currently, we are using NVDIMM as a block device and a DAX-based
> > > >>> filesystem
> > > >>> is created upon it in Linux so that file-related accesses directly reach
> > > >>> the NVDIMM device.
> > > >>>
> > > >>> In KVM, If the NVDIMM device need to be shared by different VMs, we can
> > > >>> create multiple files on the DAX-based filesystem and assign the file to
> > > >>> each VMs. In the future, we can enable namespace (partition-like) for
> > > >>> PMEM
> > > >>> memory and assign the namespace to each VMs (current Linux driver uses
> > > >>> the
> > > >>> whole PMEM as a single namespace).
> > > >>>
> > > >>> I think it is not a easy work to let Xen hypervisor recognize NVDIMM
> > > >>> device
> > > >>> and manager NVDIMM resource.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >> The more I see about this, the more sure I am that we want to keep it as
> > > >> a block device managed by dom0.
> > > >>
> > > >> In the case of the DAX-based filesystem, I presume files are not
> > > >> necessarily contiguous.  I also presume that this is worked around by
> > > >> permuting the mapping of the virtual NVDIMM such that the it appears as
> > > >> a contiguous block of addresses to the guest?
> > > >>
> > > >> Today in Xen, Qemu already has the ability to create mappings in the
> > > >> guest's address space, e.g. to map PCI device BARs.  I don't see a
> > > >> conceptual difference here, although the security/permission model
> > > >> certainly is more complicated.
> > > > I imagine that mmap'ing  these /dev/pmemXX devices require root
> > > > privileges, does it not?
> > > 
> > > I presume it does, although mmap()ing a file on a DAX filesystem will
> > > work in the standard POSIX way.
> > > 
> > > Neither of these are sufficient however.  That gets Qemu a mapping of
> > > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > > this into appropriate add-to-phymap hypercalls.
> > >
> > 
> > Yes, those hypercalls are what I'm going to add.
> 
> Why?
> 
> What you need (in a rought hand-wave way) is to:
>  - mount /dev/pmem0
>  - mmap the file on /dev/pmem0 FS
>  - walk the VMA for the file - extract the MFN (machien frame numbers)

Can this step be done by QEMU? Or does linux kernel provide some
approach for the userspace to do the translation?

Haozhong

>  - feed those frame numbers to xc_memory_mapping hypercall. The
>    guest pfns would be contingous.
>    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
>    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
> 
>    However the MFNs may be discontingous as the NVDIMM could be an
>    1TB - and the 8GB file is scattered all over.
> 
> I believe that is all you would need to do?
> > 
> > Haozhong
> > 
> > > >
> > > > I wouldn't encourage the introduction of anything else that requires
> > > > root privileges in QEMU. With QEMU running as non-root by default in
> > > > 4.7, the feature will not be available unless users explicitly ask to
> > > > run QEMU as root (which they shouldn't really).
> > > 
> > > This isn't how design works.
> > > 
> > > First, design a feature in an architecturally correct way, and then
> > > design an security policy to fit.  (note, both before implement happens).
> > > 
> > > We should not stunt design based on an existing implementation.  In
> > > particular, if design shows that being a root only feature is the only
> > > sane way of doing this, it should be a root only feature.  (I hope this
> > > is not the case, but it shouldn't cloud the judgement of a design).
> > > 
> > > ~Andrew
> > > 
> > > _______________________________________________
> > > Xen-devel mailing list
> > > Xen-devel@lists.xen.org
> > > http://lists.xen.org/xen-devel
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Xiao Guangrong Jan. 20, 2016, 3:29 p.m. UTC | #25
On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
>> On 01/20/16 01:46, Jan Beulich wrote:
>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>>>> Secondly, the driver implements a convenient block device interface to
>>>> let software access areas where NVDIMM devices are mapped. The
>>>> existing vNVDIMM implementation in QEMU uses this interface.
>>>>
>>>> As Linux NVDIMM driver has already done above, why do we bother to
>>>> reimplement them in Xen?
>>>
>>> See above; a possibility is that we may need a split model (block
>>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>>> Iirc the split is being determined by firmware, and hence set in
>>> stone by the time OS (or hypervisor) boot starts.
>>
>> For the "normal memory" parts, do you mean parts that map the host
>> NVDIMM device's address space range to the guest? I'm going to
>> implement that part in hypervisor and expose it as a hypercall so that
>> it can be used by QEMU.
>
> To answer this I need to have my understanding of the partitioning
> being done by firmware confirmed: If that's the case, then "normal"
> means the part that doesn't get exposed as a block device (SSD).
> In any event there's no correlation to guest exposure here.

Firmware does not manage NVDIMM. All the operations of nvdimm are handled
by OS.

Actually, there are lots of things we should take into account if we move
the NVDIMM management to hypervisor:
a) ACPI NFIT interpretation
    A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
    base information of NVDIMM devices which includes PMEM info, PBLK
    info, nvdimm device interleave, vendor info, etc. Let me explain it one
    by one.

    PMEM and PBLK are two modes to access NVDIMM devices:
    1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
       space so that CPU can r/w it directly.
    2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
       only offers two windows which are mapped to CPU's address space, the data
       window and access window, so that CPU can use these two windows to access
       the whole NVDIMM device.

    NVDIMM device is interleaved whose info is also exported so that we can
    calculate the address to access the specified NVDIMM device.

    NVDIMM devices from different vendor can have different function so that the
    vendor info is exported by NFIT to make vendor's driver work.

b) ACPI SSDT interpretation
    SSDT offers _DSM method which controls NVDIMM device, such as label operation,
    health check etc and hotplug support.

c) Resource management
    NVDIMM resource management challenged as:
    1) PMEM is huge and it is little slower access than RAM so it is not suitable
       to manage it as page struct (i think it is not a big problem in Xen
       hypervisor?)
    2) need to partition it to it be used in multiple VMs.
    3) need to support PBLK and partition it in the future.

d) management tools support
    S.M.A.R.T? error detection and recovering?

c) hotplug support

d) third parts drivers
    Vendor drivers need to be ported to xen hypervisor and let it be supported in
    the management tool.

e) ...
Konrad Rzeszutek Wilk Jan. 20, 2016, 3:41 p.m. UTC | #26
> > > > Neither of these are sufficient however.  That gets Qemu a mapping of
> > > > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > > > this into appropriate add-to-phymap hypercalls.
> > > >
> > > 
> > > Yes, those hypercalls are what I'm going to add.
> > 
> > Why?
> > 
> > What you need (in a rought hand-wave way) is to:
> >  - mount /dev/pmem0
> >  - mmap the file on /dev/pmem0 FS
> >  - walk the VMA for the file - extract the MFN (machien frame numbers)
> 
> Can this step be done by QEMU? Or does linux kernel provide some
> approach for the userspace to do the translation?

I don't know. I would think no - as you wouldn't want the userspace
application to figure out the physical frames from the virtual
address (unless they are root). But then if you look in
/proc/<pid>/maps and /proc/<pid>/smaps there are some data there.

Hm, /proc/<pid>/pagemaps has something intersting

See pagemap_read function. That looks to be doing it?

> 
> Haozhong
> 
> >  - feed those frame numbers to xc_memory_mapping hypercall. The
> >    guest pfns would be contingous.
> >    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
> >    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
> > 
> >    However the MFNs may be discontingous as the NVDIMM could be an
> >    1TB - and the 8GB file is scattered all over.
> > 
> > I believe that is all you would need to do?
Konrad Rzeszutek Wilk Jan. 20, 2016, 3:47 p.m. UTC | #27
On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
> 
> 
> On 01/20/2016 07:20 PM, Jan Beulich wrote:
> >>>>On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
> >>On 01/20/16 01:46, Jan Beulich wrote:
> >>>>>>On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> >>>>Secondly, the driver implements a convenient block device interface to
> >>>>let software access areas where NVDIMM devices are mapped. The
> >>>>existing vNVDIMM implementation in QEMU uses this interface.
> >>>>
> >>>>As Linux NVDIMM driver has already done above, why do we bother to
> >>>>reimplement them in Xen?
> >>>
> >>>See above; a possibility is that we may need a split model (block
> >>>layer parts on Dom0, "normal memory" parts in the hypervisor.
> >>>Iirc the split is being determined by firmware, and hence set in
> >>>stone by the time OS (or hypervisor) boot starts.
> >>
> >>For the "normal memory" parts, do you mean parts that map the host
> >>NVDIMM device's address space range to the guest? I'm going to
> >>implement that part in hypervisor and expose it as a hypercall so that
> >>it can be used by QEMU.
> >
> >To answer this I need to have my understanding of the partitioning
> >being done by firmware confirmed: If that's the case, then "normal"
> >means the part that doesn't get exposed as a block device (SSD).
> >In any event there's no correlation to guest exposure here.
> 
> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
> by OS.
> 
> Actually, there are lots of things we should take into account if we move
> the NVDIMM management to hypervisor:

If you remove the block device part and just deal with pmem part then this
gets smaller.

Also the _DSM operations - I can't see them being in hypervisor - but only
in the dom0 - which would have the right software to tickle the correct
ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
an SMART operation, etc).

> a) ACPI NFIT interpretation
>    A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>    base information of NVDIMM devices which includes PMEM info, PBLK
>    info, nvdimm device interleave, vendor info, etc. Let me explain it one
>    by one.

And it is a static table. As in part of the MADT.
> 
>    PMEM and PBLK are two modes to access NVDIMM devices:
>    1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>       space so that CPU can r/w it directly.
>    2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>       only offers two windows which are mapped to CPU's address space, the data
>       window and access window, so that CPU can use these two windows to access
>       the whole NVDIMM device.
> 
>    NVDIMM device is interleaved whose info is also exported so that we can
>    calculate the address to access the specified NVDIMM device.

Right, along with the serial numbers.
> 
>    NVDIMM devices from different vendor can have different function so that the
>    vendor info is exported by NFIT to make vendor's driver work.

via _DSM right?
> 
> b) ACPI SSDT interpretation
>    SSDT offers _DSM method which controls NVDIMM device, such as label operation,
>    health check etc and hotplug support.

Sounds like the control domain (dom0) would be in charge of that.
> 
> c) Resource management
>    NVDIMM resource management challenged as:
>    1) PMEM is huge and it is little slower access than RAM so it is not suitable
>       to manage it as page struct (i think it is not a big problem in Xen
>       hypervisor?)
>    2) need to partition it to it be used in multiple VMs.
>    3) need to support PBLK and partition it in the future.

That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.
> 
> d) management tools support
>    S.M.A.R.T? error detection and recovering?
> 
> c) hotplug support

How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
to scan. That would require the hypervisor also reading this for it to
update it's data-structures.
> 
> d) third parts drivers
>    Vendor drivers need to be ported to xen hypervisor and let it be supported in
>    the management tool.

Ewww.

I presume the 'third party drivers' mean more interesting _DSM features right?
On the base level the firmware with this type of NVDIMM would still have
the basic - ACPI NFIT + E820_NVDIMM (optional).
> 
> e) ...
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Haozhong Zhang Jan. 20, 2016, 3:54 p.m. UTC | #28
On 01/20/16 10:41, Konrad Rzeszutek Wilk wrote:
> > > > > Neither of these are sufficient however.  That gets Qemu a mapping of
> > > > > the NVDIMM, not the guest.  Something, one way or another, has to turn
> > > > > this into appropriate add-to-phymap hypercalls.
> > > > >
> > > > 
> > > > Yes, those hypercalls are what I'm going to add.
> > > 
> > > Why?
> > > 
> > > What you need (in a rought hand-wave way) is to:
> > >  - mount /dev/pmem0
> > >  - mmap the file on /dev/pmem0 FS
> > >  - walk the VMA for the file - extract the MFN (machien frame numbers)
> > 
> > Can this step be done by QEMU? Or does linux kernel provide some
> > approach for the userspace to do the translation?
> 
> I don't know. I would think no - as you wouldn't want the userspace
> application to figure out the physical frames from the virtual
> address (unless they are root). But then if you look in
> /proc/<pid>/maps and /proc/<pid>/smaps there are some data there.
> 
> Hm, /proc/<pid>/pagemaps has something intersting
> 
> See pagemap_read function. That looks to be doing it?
>

Interesting and good to know this. I'll have a look at it.

Thanks,
Haozhong

> > 
> > Haozhong
> > 
> > >  - feed those frame numbers to xc_memory_mapping hypercall. The
> > >    guest pfns would be contingous.
> > >    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
> > >    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
> > > 
> > >    However the MFNs may be discontingous as the NVDIMM could be an
> > >    1TB - and the 8GB file is scattered all over.
> > > 
> > > I believe that is all you would need to do?
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Xiao Guangrong Jan. 20, 2016, 4:25 p.m. UTC | #29
On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
>>>> On 01/20/16 01:46, Jan Beulich wrote:
>>>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>>>>>> Secondly, the driver implements a convenient block device interface to
>>>>>> let software access areas where NVDIMM devices are mapped. The
>>>>>> existing vNVDIMM implementation in QEMU uses this interface.
>>>>>>
>>>>>> As Linux NVDIMM driver has already done above, why do we bother to
>>>>>> reimplement them in Xen?
>>>>>
>>>>> See above; a possibility is that we may need a split model (block
>>>>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>>>>> Iirc the split is being determined by firmware, and hence set in
>>>>> stone by the time OS (or hypervisor) boot starts.
>>>>
>>>> For the "normal memory" parts, do you mean parts that map the host
>>>> NVDIMM device's address space range to the guest? I'm going to
>>>> implement that part in hypervisor and expose it as a hypercall so that
>>>> it can be used by QEMU.
>>>
>>> To answer this I need to have my understanding of the partitioning
>>> being done by firmware confirmed: If that's the case, then "normal"
>>> means the part that doesn't get exposed as a block device (SSD).
>>> In any event there's no correlation to guest exposure here.
>>
>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>> by OS.
>>
>> Actually, there are lots of things we should take into account if we move
>> the NVDIMM management to hypervisor:
>
> If you remove the block device part and just deal with pmem part then this
> gets smaller.
>

Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long
time plan. :)

> Also the _DSM operations - I can't see them being in hypervisor - but only
> in the dom0 - which would have the right software to tickle the correct
> ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
> an SMART operation, etc).

Yes, it is reasonable to put it in dom 0 and it makes management tools happy.

>
>> a) ACPI NFIT interpretation
>>     A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>     base information of NVDIMM devices which includes PMEM info, PBLK
>>     info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>     by one.
>
> And it is a static table. As in part of the MADT.

Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead
if a nvdimm device is hotpluged, please see below.

>>
>>     PMEM and PBLK are two modes to access NVDIMM devices:
>>     1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>        space so that CPU can r/w it directly.
>>     2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>        only offers two windows which are mapped to CPU's address space, the data
>>        window and access window, so that CPU can use these two windows to access
>>        the whole NVDIMM device.
>>
>>     NVDIMM device is interleaved whose info is also exported so that we can
>>     calculate the address to access the specified NVDIMM device.
>
> Right, along with the serial numbers.
>>
>>     NVDIMM devices from different vendor can have different function so that the
>>     vendor info is exported by NFIT to make vendor's driver work.
>
> via _DSM right?

Yes.

>>
>> b) ACPI SSDT interpretation
>>     SSDT offers _DSM method which controls NVDIMM device, such as label operation,
>>     health check etc and hotplug support.
>
> Sounds like the control domain (dom0) would be in charge of that.

Yup. Dom0 is a better place to handle it.

>>
>> c) Resource management
>>     NVDIMM resource management challenged as:
>>     1) PMEM is huge and it is little slower access than RAM so it is not suitable
>>        to manage it as page struct (i think it is not a big problem in Xen
>>        hypervisor?)
>>     2) need to partition it to it be used in multiple VMs.
>>     3) need to support PBLK and partition it in the future.
>
> That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.

Sure, so let dom0 handle this is better, we are on the same page. :)

>>
>> d) management tools support
>>     S.M.A.R.T? error detection and recovering?
>>
>> c) hotplug support
>
> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> to scan. That would require the hypervisor also reading this for it to
> update it's data-structures.

Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
the better place handing this case too.

>>
>> d) third parts drivers
>>     Vendor drivers need to be ported to xen hypervisor and let it be supported in
>>     the management tool.
>
> Ewww.
>
> I presume the 'third party drivers' mean more interesting _DSM features right?

Yes.

> On the base level the firmware with this type of NVDIMM would still have
> the basic - ACPI NFIT + E820_NVDIMM (optional).
>>

Yes.
Konrad Rzeszutek Wilk Jan. 20, 2016, 4:47 p.m. UTC | #30
On Thu, Jan 21, 2016 at 12:25:08AM +0800, Xiao Guangrong wrote:
> 
> 
> On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote:
> >On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
> >>
> >>
> >>On 01/20/2016 07:20 PM, Jan Beulich wrote:
> >>>>>>On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
> >>>>On 01/20/16 01:46, Jan Beulich wrote:
> >>>>>>>>On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
> >>>>>>Secondly, the driver implements a convenient block device interface to
> >>>>>>let software access areas where NVDIMM devices are mapped. The
> >>>>>>existing vNVDIMM implementation in QEMU uses this interface.
> >>>>>>
> >>>>>>As Linux NVDIMM driver has already done above, why do we bother to
> >>>>>>reimplement them in Xen?
> >>>>>
> >>>>>See above; a possibility is that we may need a split model (block
> >>>>>layer parts on Dom0, "normal memory" parts in the hypervisor.
> >>>>>Iirc the split is being determined by firmware, and hence set in
> >>>>>stone by the time OS (or hypervisor) boot starts.
> >>>>
> >>>>For the "normal memory" parts, do you mean parts that map the host
> >>>>NVDIMM device's address space range to the guest? I'm going to
> >>>>implement that part in hypervisor and expose it as a hypercall so that
> >>>>it can be used by QEMU.
> >>>
> >>>To answer this I need to have my understanding of the partitioning
> >>>being done by firmware confirmed: If that's the case, then "normal"
> >>>means the part that doesn't get exposed as a block device (SSD).
> >>>In any event there's no correlation to guest exposure here.
> >>
> >>Firmware does not manage NVDIMM. All the operations of nvdimm are handled
> >>by OS.
> >>
> >>Actually, there are lots of things we should take into account if we move
> >>the NVDIMM management to hypervisor:
> >
> >If you remove the block device part and just deal with pmem part then this
> >gets smaller.
> >
> 
> Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long
> time plan. :)
> 
> >Also the _DSM operations - I can't see them being in hypervisor - but only
> >in the dom0 - which would have the right software to tickle the correct
> >ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
> >an SMART operation, etc).
> 
> Yes, it is reasonable to put it in dom 0 and it makes management tools happy.
> 
> >
> >>a) ACPI NFIT interpretation
> >>    A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
> >>    base information of NVDIMM devices which includes PMEM info, PBLK
> >>    info, nvdimm device interleave, vendor info, etc. Let me explain it one
> >>    by one.
> >
> >And it is a static table. As in part of the MADT.
> 
> Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead
> if a nvdimm device is hotpluged, please see below.
> 
> >>
> >>    PMEM and PBLK are two modes to access NVDIMM devices:
> >>    1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
> >>       space so that CPU can r/w it directly.
> >>    2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
> >>       only offers two windows which are mapped to CPU's address space, the data
> >>       window and access window, so that CPU can use these two windows to access
> >>       the whole NVDIMM device.
> >>
> >>    NVDIMM device is interleaved whose info is also exported so that we can
> >>    calculate the address to access the specified NVDIMM device.
> >
> >Right, along with the serial numbers.
> >>
> >>    NVDIMM devices from different vendor can have different function so that the
> >>    vendor info is exported by NFIT to make vendor's driver work.
> >
> >via _DSM right?
> 
> Yes.
> 
> >>
> >>b) ACPI SSDT interpretation
> >>    SSDT offers _DSM method which controls NVDIMM device, such as label operation,
> >>    health check etc and hotplug support.
> >
> >Sounds like the control domain (dom0) would be in charge of that.
> 
> Yup. Dom0 is a better place to handle it.
> 
> >>
> >>c) Resource management
> >>    NVDIMM resource management challenged as:
> >>    1) PMEM is huge and it is little slower access than RAM so it is not suitable
> >>       to manage it as page struct (i think it is not a big problem in Xen
> >>       hypervisor?)
> >>    2) need to partition it to it be used in multiple VMs.
> >>    3) need to support PBLK and partition it in the future.
> >
> >That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.
> 
> Sure, so let dom0 handle this is better, we are on the same page. :)
> 
> >>
> >>d) management tools support
> >>    S.M.A.R.T? error detection and recovering?
> >>
> >>c) hotplug support
> >
> >How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> >to scan. That would require the hypervisor also reading this for it to
> >update it's data-structures.
> 
> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> the better place handing this case too.

That one is a bit difficult. Both the OS and the hypervisor would need to know about
this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
the hypervisor needs to be told so it can slurp it up.

However I don't know if the hypervisor needs to know all the details of an
NVDIMM - or just the starting and ending ranges so that when an guest is created
and the VT-d is constructed - it can be assured that the ranges are valid.

I am not an expert on the P2M code - but I think that would need to be looked
at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.

> 
> >>
> >>d) third parts drivers
> >>    Vendor drivers need to be ported to xen hypervisor and let it be supported in
> >>    the management tool.
> >
> >Ewww.
> >
> >I presume the 'third party drivers' mean more interesting _DSM features right?
> 
> Yes.
> 
> >On the base level the firmware with this type of NVDIMM would still have
> >the basic - ACPI NFIT + E820_NVDIMM (optional).
> >>
> 
> Yes.
Xiao Guangrong Jan. 20, 2016, 4:55 p.m. UTC | #31
On 01/21/2016 12:47 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 21, 2016 at 12:25:08AM +0800, Xiao Guangrong wrote:
>>
>>
>> On 01/20/2016 11:47 PM, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Jan 20, 2016 at 11:29:55PM +0800, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>>>>>> On 20.01.16 at 12:04, <haozhong.zhang@intel.com> wrote:
>>>>>> On 01/20/16 01:46, Jan Beulich wrote:
>>>>>>>>>> On 20.01.16 at 06:31, <haozhong.zhang@intel.com> wrote:
>>>>>>>> Secondly, the driver implements a convenient block device interface to
>>>>>>>> let software access areas where NVDIMM devices are mapped. The
>>>>>>>> existing vNVDIMM implementation in QEMU uses this interface.
>>>>>>>>
>>>>>>>> As Linux NVDIMM driver has already done above, why do we bother to
>>>>>>>> reimplement them in Xen?
>>>>>>>
>>>>>>> See above; a possibility is that we may need a split model (block
>>>>>>> layer parts on Dom0, "normal memory" parts in the hypervisor.
>>>>>>> Iirc the split is being determined by firmware, and hence set in
>>>>>>> stone by the time OS (or hypervisor) boot starts.
>>>>>>
>>>>>> For the "normal memory" parts, do you mean parts that map the host
>>>>>> NVDIMM device's address space range to the guest? I'm going to
>>>>>> implement that part in hypervisor and expose it as a hypercall so that
>>>>>> it can be used by QEMU.
>>>>>
>>>>> To answer this I need to have my understanding of the partitioning
>>>>> being done by firmware confirmed: If that's the case, then "normal"
>>>>> means the part that doesn't get exposed as a block device (SSD).
>>>>> In any event there's no correlation to guest exposure here.
>>>>
>>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>>>> by OS.
>>>>
>>>> Actually, there are lots of things we should take into account if we move
>>>> the NVDIMM management to hypervisor:
>>>
>>> If you remove the block device part and just deal with pmem part then this
>>> gets smaller.
>>>
>>
>> Yes indeed. But xen can not benefit from NVDIMM BLK, i think it is not a long
>> time plan. :)
>>
>>> Also the _DSM operations - I can't see them being in hypervisor - but only
>>> in the dom0 - which would have the right software to tickle the correct
>>> ioctl on /dev/pmem to do the "management" (carve the NVDIMM, perform
>>> an SMART operation, etc).
>>
>> Yes, it is reasonable to put it in dom 0 and it makes management tools happy.
>>
>>>
>>>> a) ACPI NFIT interpretation
>>>>     A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>>>     base information of NVDIMM devices which includes PMEM info, PBLK
>>>>     info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>>>     by one.
>>>
>>> And it is a static table. As in part of the MADT.
>>
>> Yes, it is, but we need to fetch updated nvdimm info from _FIT in SSDT/DSDT instead
>> if a nvdimm device is hotpluged, please see below.
>>
>>>>
>>>>     PMEM and PBLK are two modes to access NVDIMM devices:
>>>>     1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>>>        space so that CPU can r/w it directly.
>>>>     2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>>>        only offers two windows which are mapped to CPU's address space, the data
>>>>        window and access window, so that CPU can use these two windows to access
>>>>        the whole NVDIMM device.
>>>>
>>>>     NVDIMM device is interleaved whose info is also exported so that we can
>>>>     calculate the address to access the specified NVDIMM device.
>>>
>>> Right, along with the serial numbers.
>>>>
>>>>     NVDIMM devices from different vendor can have different function so that the
>>>>     vendor info is exported by NFIT to make vendor's driver work.
>>>
>>> via _DSM right?
>>
>> Yes.
>>
>>>>
>>>> b) ACPI SSDT interpretation
>>>>     SSDT offers _DSM method which controls NVDIMM device, such as label operation,
>>>>     health check etc and hotplug support.
>>>
>>> Sounds like the control domain (dom0) would be in charge of that.
>>
>> Yup. Dom0 is a better place to handle it.
>>
>>>>
>>>> c) Resource management
>>>>     NVDIMM resource management challenged as:
>>>>     1) PMEM is huge and it is little slower access than RAM so it is not suitable
>>>>        to manage it as page struct (i think it is not a big problem in Xen
>>>>        hypervisor?)
>>>>     2) need to partition it to it be used in multiple VMs.
>>>>     3) need to support PBLK and partition it in the future.
>>>
>>> That all sounds to me like an control domain (dom0) decisions. Not Xen hypervisor.
>>
>> Sure, so let dom0 handle this is better, we are on the same page. :)
>>
>>>>
>>>> d) management tools support
>>>>     S.M.A.R.T? error detection and recovering?
>>>>
>>>> c) hotplug support
>>>
>>> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
>>> to scan. That would require the hypervisor also reading this for it to
>>> update it's data-structures.
>>
>> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
>> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
>> the better place handing this case too.
>
> That one is a bit difficult. Both the OS and the hypervisor would need to know about
> this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> the hypervisor needs to be told so it can slurp it up.

Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
the irq info out and tell hypervior to pass the irq to dom0, it is doable?

>
> However I don't know if the hypervisor needs to know all the details of an
> NVDIMM - or just the starting and ending ranges so that when an guest is created
> and the VT-d is constructed - it can be assured that the ranges are valid.
>
> I am not an expert on the P2M code - but I think that would need to be looked
> at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.

We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
lable support (namespace)...
Jan Beulich Jan. 20, 2016, 5:07 p.m. UTC | #32
>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>> To answer this I need to have my understanding of the partitioning
>> being done by firmware confirmed: If that's the case, then "normal"
>> means the part that doesn't get exposed as a block device (SSD).
>> In any event there's no correlation to guest exposure here.
> 
> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
> by OS.
> 
> Actually, there are lots of things we should take into account if we move
> the NVDIMM management to hypervisor:
> a) ACPI NFIT interpretation
>     A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>     base information of NVDIMM devices which includes PMEM info, PBLK
>     info, nvdimm device interleave, vendor info, etc. Let me explain it one
>     by one.
> 
>     PMEM and PBLK are two modes to access NVDIMM devices:
>     1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>        space so that CPU can r/w it directly.
>     2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>        only offers two windows which are mapped to CPU's address space, the data
>        window and access window, so that CPU can use these two windows to access
>        the whole NVDIMM device.

You fail to mention PBLK. The question above really was about what
entity controls which of the two modes get used (and perhaps for
which parts of the overall NVDIMM).

Jan
Xiao Guangrong Jan. 20, 2016, 5:17 p.m. UTC | #33
On 01/21/2016 01:07 AM, Jan Beulich wrote:
>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>> To answer this I need to have my understanding of the partitioning
>>> being done by firmware confirmed: If that's the case, then "normal"
>>> means the part that doesn't get exposed as a block device (SSD).
>>> In any event there's no correlation to guest exposure here.
>>
>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>> by OS.
>>
>> Actually, there are lots of things we should take into account if we move
>> the NVDIMM management to hypervisor:
>> a) ACPI NFIT interpretation
>>      A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>      base information of NVDIMM devices which includes PMEM info, PBLK
>>      info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>      by one.
>>
>>      PMEM and PBLK are two modes to access NVDIMM devices:
>>      1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>         space so that CPU can r/w it directly.
>>      2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>         only offers two windows which are mapped to CPU's address space, the data
>>         window and access window, so that CPU can use these two windows to access
>>         the whole NVDIMM device.
>
> You fail to mention PBLK. The question above really was about what

The 2) is PBLK.

> entity controls which of the two modes get used (and perhaps for
> which parts of the overall NVDIMM).

So i think the "normal" you mentioned is about PMEM. :)
Konrad Rzeszutek Wilk Jan. 20, 2016, 5:18 p.m. UTC | #34
> >>>>c) hotplug support
> >>>
> >>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> >>>to scan. That would require the hypervisor also reading this for it to
> >>>update it's data-structures.
> >>
> >>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> >>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> >>the better place handing this case too.
> >
> >That one is a bit difficult. Both the OS and the hypervisor would need to know about
> >this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> >the hypervisor needs to be told so it can slurp it up.
> 
> Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0

Yes of course it can.
> handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
> the irq info out and tell hypervior to pass the irq to dom0, it is doable?
> 
> >
> >However I don't know if the hypervisor needs to know all the details of an
> >NVDIMM - or just the starting and ending ranges so that when an guest is created
> >and the VT-d is constructed - it can be assured that the ranges are valid.
> >
> >I am not an expert on the P2M code - but I think that would need to be looked
> >at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
> 
> We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
> lable support (namespace)...

<hand-waves> I don't know what QEMU does for guests? I naively assumed it would
create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
the _DSM).

Either way what I think you need to investigate is what is neccessary for the
Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
needs to the _FIT and NFIT tables.

(Adding Feng Wu, the VT-d maintainer).
Xiao Guangrong Jan. 20, 2016, 5:23 p.m. UTC | #35
On 01/21/2016 01:18 AM, Konrad Rzeszutek Wilk wrote:
>>>>>> c) hotplug support
>>>>>
>>>>> How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
>>>>> to scan. That would require the hypervisor also reading this for it to
>>>>> update it's data-structures.
>>>>
>>>> Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
>>>> _FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
>>>> the better place handing this case too.
>>>
>>> That one is a bit difficult. Both the OS and the hypervisor would need to know about
>>> this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
>>> the hypervisor needs to be told so it can slurp it up.
>>
>> Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
>
> Yes of course it can.
>> handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
>> the irq info out and tell hypervior to pass the irq to dom0, it is doable?
>>
>>>
>>> However I don't know if the hypervisor needs to know all the details of an
>>> NVDIMM - or just the starting and ending ranges so that when an guest is created
>>> and the VT-d is constructed - it can be assured that the ranges are valid.
>>>
>>> I am not an expert on the P2M code - but I think that would need to be looked
>>> at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
>>
>> We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
>> lable support (namespace)...
>
> <hand-waves> I don't know what QEMU does for guests? I naively assumed it would
> create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
> the _DSM).

Ah, ACPI eliminates this E820 entry.

>
> Either way what I think you need to investigate is what is neccessary for the
> Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
> the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
> needs to the _FIT and NFIT tables.
>

Interesting. I did not consider using NVDIMM as DMA. Do you have usecase for this
kind of NVDIMM usage?
Konrad Rzeszutek Wilk Jan. 20, 2016, 5:48 p.m. UTC | #36
On Thu, Jan 21, 2016 at 01:23:31AM +0800, Xiao Guangrong wrote:
> 
> 
> On 01/21/2016 01:18 AM, Konrad Rzeszutek Wilk wrote:
> >>>>>>c) hotplug support
> >>>>>
> >>>>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> >>>>>to scan. That would require the hypervisor also reading this for it to
> >>>>>update it's data-structures.
> >>>>
> >>>>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> >>>>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> >>>>the better place handing this case too.
> >>>
> >>>That one is a bit difficult. Both the OS and the hypervisor would need to know about
> >>>this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> >>>the hypervisor needs to be told so it can slurp it up.
> >>
> >>Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
> >
> >Yes of course it can.
> >>handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
> >>the irq info out and tell hypervior to pass the irq to dom0, it is doable?
> >>
> >>>
> >>>However I don't know if the hypervisor needs to know all the details of an
> >>>NVDIMM - or just the starting and ending ranges so that when an guest is created
> >>>and the VT-d is constructed - it can be assured that the ranges are valid.
> >>>
> >>>I am not an expert on the P2M code - but I think that would need to be looked
> >>>at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
> >>
> >>We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
> >>lable support (namespace)...
> >
> ><hand-waves> I don't know what QEMU does for guests? I naively assumed it would
> >create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
> >the _DSM).
> 
> Ah, ACPI eliminates this E820 entry.
> 
> >
> >Either way what I think you need to investigate is what is neccessary for the
> >Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
> >the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
> >needs to the _FIT and NFIT tables.
> >
> 
> Interesting. I did not consider using NVDIMM as DMA. Do you have usecase for this
> kind of NVDIMM usage?

An easy one is iSCSI target. You could have an SR-IOV NIC that would have TCM
enabled (CONFIG_TCM_FILEIO or CONFIG_TCM_IBLOCK). Mount an file on the /dev/pmem0
(using DAX enabled FS) and export it as iSCSI LUN. The traffic would go over an SR-IOV NIC.

The DMA transactions would be SR-IOV NIC <-> NVDIMM.
Andrew Cooper Jan. 20, 2016, 6:14 p.m. UTC | #37
On 20/01/16 15:05, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Andrew Cooper wrote:
>> On 20/01/16 14:29, Stefano Stabellini wrote:
>>> On Wed, 20 Jan 2016, Andrew Cooper wrote:
>>>>
>>> I wouldn't encourage the introduction of anything else that requires
>>> root privileges in QEMU. With QEMU running as non-root by default in
>>> 4.7, the feature will not be available unless users explicitly ask to
>>> run QEMU as root (which they shouldn't really).
>> This isn't how design works.
>>
>> First, design a feature in an architecturally correct way, and then
>> design an security policy to fit.
>>
>> We should not stunt design based on an existing implementation.  In
>> particular, if design shows that being a root only feature is the only
>> sane way of doing this, it should be a root only feature.  (I hope this
>> is not the case, but it shouldn't cloud the judgement of a design).
> I would argue that security is an integral part of the architecture and
> should not be retrofitted into it.

There is no retrofitting - it is all part of the same overall design
before coding starts happen.

>
> Is it really a good design if the only sane way to implement it is
> making it a root-only feature? I think not.

Then you have missed the point.

If you fail at architecting the feature in the first place, someone else
is going to have to come along and reimplement it properly, then provide
some form of compatibility with the old one.

Security is an important consideration in the design; I do not wish to
understate that.  However, if the only way for a feature to be
architected properly is for the feature to be a root-only feature, then
it should be a root-only feature.

>  Designing security policies
> for pieces of software that don't have the infrastructure for them is
> costly and that cost should be accounted as part of the overall cost of
> the solution rather than added to it in a second stage.

That cost is far better spent designing it properly in the first place,
rather than having to come along and reimplement a v2 because v1 was broken.

>
>
>> (note, both before implement happens).
> That is ideal but realistically in many cases nobody is able to produce
> a design before the implementation happens.

It is perfectly easy.  This is the difference between software
engineering and software hacking.

There has been a lot of positive feedback from on-list design
documents.  It is a trend which needs to continue.

~Andrew
Haozhong Zhang Jan. 21, 2016, 3:12 a.m. UTC | #38
On 01/20/16 12:18, Konrad Rzeszutek Wilk wrote:
> > >>>>c) hotplug support
> > >>>
> > >>>How does that work? Ah the _DSM will point to the new ACPI NFIT for the OS
> > >>>to scan. That would require the hypervisor also reading this for it to
> > >>>update it's data-structures.
> > >>
> > >>Similar as you said. The NVDIMM root device in SSDT/DSDT dedicates a new interface,
> > >>_FIT, which return the new NFIT once new device hotplugged. And yes, domain 0 is
> > >>the better place handing this case too.
> > >
> > >That one is a bit difficult. Both the OS and the hypervisor would need to know about
> > >this (I think?). dom0 since it gets the ACPI event and needs to process it. Then
> > >the hypervisor needs to be told so it can slurp it up.
> > 
> > Can dom0 receive the interrupt triggered by device hotplug? If yes, we can let dom0
> 
> Yes of course it can.
> > handle all the things like native. If it can not, dom0 can interpret ACPI and fetch
> > the irq info out and tell hypervior to pass the irq to dom0, it is doable?
> > 
> > >
> > >However I don't know if the hypervisor needs to know all the details of an
> > >NVDIMM - or just the starting and ending ranges so that when an guest is created
> > >and the VT-d is constructed - it can be assured that the ranges are valid.
> > >
> > >I am not an expert on the P2M code - but I think that would need to be looked
> > >at to make sure it is OK with stitching an E820_NVDIMM type "MFN" into an guest PFN.
> > 
> > We do better do not use "E820" as it lacks some advantages of ACPI, such as, NUMA, hotplug,
> > lable support (namespace)...
> 
> <hand-waves> I don't know what QEMU does for guests? I naively assumed it would
> create an E820_NVDIMM along with the ACPI MADT NFIT tables (and the SSDT to have
> the _DSM).
>

ACPI 6 defines E820 type 7 for pmem (see table 15-312 in Section 15)
and legacy ones may use the non-standard type 12 (and even older ones
may use type 6, but linux does not consider type 6 any more), but
hot-plugged NVDIMM may not appear in E820. Still think it's better to
let dom0 linux that already has enough drivers handle all these device
probing tasks.

> Either way what I think you need to investigate is what is neccessary for the
> Xen hypervisor VT-d code (IOMMU) to have an entry which is the system address for
> the NVDIMM. Based on that - you will know what kind of exposure the hypervisor
> needs to the _FIT and NFIT tables.
>
> (Adding Feng Wu, the VT-d maintainer).

I haven't considered VT-d at all. From your example in another reply,
it looks like that VT-d code needs to be aware of the address space
range of NVDIMM, otherwise that example would not work. If so, maybe
we can let dom0 linux kernel report the address space ranges of
detected NVDIMM devices to Xen hypervisor. Anyway, I'll investigate
this issue.

Haozhong
Bob Liu Jan. 21, 2016, 3:35 a.m. UTC | #39
On 01/20/2016 11:41 PM, Konrad Rzeszutek Wilk wrote:
>>>>> Neither of these are sufficient however.  That gets Qemu a mapping of
>>>>> the NVDIMM, not the guest.  Something, one way or another, has to turn
>>>>> this into appropriate add-to-phymap hypercalls.
>>>>>
>>>>
>>>> Yes, those hypercalls are what I'm going to add.
>>>
>>> Why?
>>>
>>> What you need (in a rought hand-wave way) is to:
>>>  - mount /dev/pmem0
>>>  - mmap the file on /dev/pmem0 FS
>>>  - walk the VMA for the file - extract the MFN (machien frame numbers)
>>

If I understand right, in this case the MFN is the block layout of the DAX-file?
If we find all the file blocks, then we get all the MFN.

>> Can this step be done by QEMU? Or does linux kernel provide some
>> approach for the userspace to do the translation?
> 

The ioctl(fd, FIBMAP, &block) may help, which can get the LBAs that a given file occupies. 

-Bob

> I don't know. I would think no - as you wouldn't want the userspace
> application to figure out the physical frames from the virtual
> address (unless they are root). But then if you look in
> /proc/<pid>/maps and /proc/<pid>/smaps there are some data there.
> 
> Hm, /proc/<pid>/pagemaps has something intersting
> 
> See pagemap_read function. That looks to be doing it?
> 
>>
>> Haozhong
>>
>>>  - feed those frame numbers to xc_memory_mapping hypercall. The
>>>    guest pfns would be contingous.
>>>    Example: say the E820_NVDIMM starts at 8GB->16GB, so an 8GB file on
>>>    /dev/pmem0 FS - the guest pfns are 0x200000 upward.
>>>
>>>    However the MFNs may be discontingous as the NVDIMM could be an
>>>    1TB - and the 8GB file is scattered all over.
>>>
>>> I believe that is all you would need to do?
Jan Beulich Jan. 21, 2016, 8:18 a.m. UTC | #40
>>> On 20.01.16 at 18:17, <guangrong.xiao@linux.intel.com> wrote:

> 
> On 01/21/2016 01:07 AM, Jan Beulich wrote:
>>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
>>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>> To answer this I need to have my understanding of the partitioning
>>>> being done by firmware confirmed: If that's the case, then "normal"
>>>> means the part that doesn't get exposed as a block device (SSD).
>>>> In any event there's no correlation to guest exposure here.
>>>
>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>>> by OS.
>>>
>>> Actually, there are lots of things we should take into account if we move
>>> the NVDIMM management to hypervisor:
>>> a) ACPI NFIT interpretation
>>>      A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>>      base information of NVDIMM devices which includes PMEM info, PBLK
>>>      info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>>      by one.
>>>
>>>      PMEM and PBLK are two modes to access NVDIMM devices:
>>>      1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>>         space so that CPU can r/w it directly.
>>>      2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>>         only offers two windows which are mapped to CPU's address space, the data
>>>         window and access window, so that CPU can use these two windows to access
>>>         the whole NVDIMM device.
>>
>> You fail to mention PBLK. The question above really was about what
> 
> The 2) is PBLK.
> 
>> entity controls which of the two modes get used (and perhaps for
>> which parts of the overall NVDIMM).
> 
> So i think the "normal" you mentioned is about PMEM. :)

Yes. But then - other than you said above - it still looks to me as
if the split between PMEM and PBLK is arranged for by firmware?

Jan
Xiao Guangrong Jan. 21, 2016, 8:25 a.m. UTC | #41
On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>>> On 20.01.16 at 18:17, <guangrong.xiao@linux.intel.com> wrote:
>
>>
>> On 01/21/2016 01:07 AM, Jan Beulich wrote:
>>>>>> On 20.01.16 at 16:29, <guangrong.xiao@linux.intel.com> wrote:
>>>> On 01/20/2016 07:20 PM, Jan Beulich wrote:
>>>>> To answer this I need to have my understanding of the partitioning
>>>>> being done by firmware confirmed: If that's the case, then "normal"
>>>>> means the part that doesn't get exposed as a block device (SSD).
>>>>> In any event there's no correlation to guest exposure here.
>>>>
>>>> Firmware does not manage NVDIMM. All the operations of nvdimm are handled
>>>> by OS.
>>>>
>>>> Actually, there are lots of things we should take into account if we move
>>>> the NVDIMM management to hypervisor:
>>>> a) ACPI NFIT interpretation
>>>>       A new ACPI table introduced in ACPI 6.0 is named NFIT which exports the
>>>>       base information of NVDIMM devices which includes PMEM info, PBLK
>>>>       info, nvdimm device interleave, vendor info, etc. Let me explain it one
>>>>       by one.
>>>>
>>>>       PMEM and PBLK are two modes to access NVDIMM devices:
>>>>       1) PMEM can be treated as NV-RAM which is directly mapped to CPU's address
>>>>          space so that CPU can r/w it directly.
>>>>       2) as NVDIMM has huge capability and CPU's address space is limited, NVDIMM
>>>>          only offers two windows which are mapped to CPU's address space, the data
>>>>          window and access window, so that CPU can use these two windows to access
>>>>          the whole NVDIMM device.
>>>
>>> You fail to mention PBLK. The question above really was about what
>>
>> The 2) is PBLK.
>>
>>> entity controls which of the two modes get used (and perhaps for
>>> which parts of the overall NVDIMM).
>>
>> So i think the "normal" you mentioned is about PMEM. :)
>
> Yes. But then - other than you said above - it still looks to me as
> if the split between PMEM and PBLK is arranged for by firmware?

Yes. But OS/Hypervisor is not excepted to dynamically change its configure (re-split),
i,e, for PoV of OS/Hypervisor, it is static.
Jan Beulich Jan. 21, 2016, 8:53 a.m. UTC | #42
>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>> Yes. But then - other than you said above - it still looks to me as
>> if the split between PMEM and PBLK is arranged for by firmware?
> 
> Yes. But OS/Hypervisor is not excepted to dynamically change its configure 
> (re-split),
> i,e, for PoV of OS/Hypervisor, it is static.

Exactly, that has been my understanding. And hence the PMEM part
could be under the hypervisor's control, while the PBLK part could be
Dom0's responsibility.

Jan
Xiao Guangrong Jan. 21, 2016, 9:10 a.m. UTC | #43
On 01/21/2016 04:53 PM, Jan Beulich wrote:
>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>> Yes. But then - other than you said above - it still looks to me as
>>> if the split between PMEM and PBLK is arranged for by firmware?
>>
>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure
>> (re-split),
>> i,e, for PoV of OS/Hypervisor, it is static.
>
> Exactly, that has been my understanding. And hence the PMEM part
> could be under the hypervisor's control, while the PBLK part could be
> Dom0's responsibility.
>

I am not sure if i have understood your point. What your suggestion is that
leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
Dom0? If yes, we should:
a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor
    interpret ACPI SSDT/DSDT.
b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
    handle them in hypervisor.
c) hypervisor should mange PMEM resource pool and partition it to multiple
    VMs.
Andrew Cooper Jan. 21, 2016, 9:29 a.m. UTC | #44
On 21/01/16 09:10, Xiao Guangrong wrote:
>
>
> On 01/21/2016 04:53 PM, Jan Beulich wrote:
>>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
>>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>>> Yes. But then - other than you said above - it still looks to me as
>>>> if the split between PMEM and PBLK is arranged for by firmware?
>>>
>>> Yes. But OS/Hypervisor is not excepted to dynamically change its
>>> configure
>>> (re-split),
>>> i,e, for PoV of OS/Hypervisor, it is static.
>>
>> Exactly, that has been my understanding. And hence the PMEM part
>> could be under the hypervisor's control, while the PBLK part could be
>> Dom0's responsibility.
>>
>
> I am not sure if i have understood your point. What your suggestion is
> that
> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
> Dom0? If yes, we should:
> a) handle hotplug in hypervisor (new PMEM add/remove) that causes
> hyperivsor
>    interpret ACPI SSDT/DSDT.
> b) some _DSMs control PMEM so you should filter out these kind of
> _DSMs and
>    handle them in hypervisor.
> c) hypervisor should mange PMEM resource pool and partition it to
> multiple
>    VMs.

It is not possible for Xen to handle ACPI such as this.

There can only be one OSPM on a system, and 9/10ths of the functionality
needing it already lives in Dom0.

The only rational course of action is for Xen to treat both PBLK and
PMEM as "devices" and leave them in Dom0's hands.

~Andrew
Jan Beulich Jan. 21, 2016, 10:25 a.m. UTC | #45
>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> On 01/21/2016 04:53 PM, Jan Beulich wrote:
>>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
>>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
>>>> Yes. But then - other than you said above - it still looks to me as
>>>> if the split between PMEM and PBLK is arranged for by firmware?
>>>
>>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure
>>> (re-split),
>>> i,e, for PoV of OS/Hypervisor, it is static.
>>
>> Exactly, that has been my understanding. And hence the PMEM part
>> could be under the hypervisor's control, while the PBLK part could be
>> Dom0's responsibility.
>>
> 
> I am not sure if i have understood your point. What your suggestion is that
> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
> Dom0? If yes, we should:
> a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor
>     interpret ACPI SSDT/DSDT.

Why would this be different from ordinary memory hotplug, where
Dom0 deals with the ACPI CA interaction, notifying Xen about the
added memory?

> b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
>     handle them in hypervisor.

Not if (see above) following the model we currently have in place.

> c) hypervisor should mange PMEM resource pool and partition it to multiple
>     VMs.

Yes.

Jan
Jan Beulich Jan. 21, 2016, 10:26 a.m. UTC | #46
>>> On 21.01.16 at 10:29, <andrew.cooper3@citrix.com> wrote:
> On 21/01/16 09:10, Xiao Guangrong wrote:
>> I am not sure if i have understood your point. What your suggestion is
>> that
>> leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
>> Dom0? If yes, we should:
>> a) handle hotplug in hypervisor (new PMEM add/remove) that causes
>> hyperivsor
>>    interpret ACPI SSDT/DSDT.
>> b) some _DSMs control PMEM so you should filter out these kind of
>> _DSMs and
>>    handle them in hypervisor.
>> c) hypervisor should mange PMEM resource pool and partition it to
>> multiple
>>    VMs.
> 
> It is not possible for Xen to handle ACPI such as this.
> 
> There can only be one OSPM on a system, and 9/10ths of the functionality
> needing it already lives in Dom0.
> 
> The only rational course of action is for Xen to treat both PBLK and
> PMEM as "devices" and leave them in Dom0's hands.

See my other reply: Why would this be different from "ordinary"
memory hotplug?

Jan
Haozhong Zhang Jan. 21, 2016, 2:01 p.m. UTC | #47
On 01/21/16 03:25, Jan Beulich wrote:
> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> > On 01/21/2016 04:53 PM, Jan Beulich wrote:
> >>>>> On 21.01.16 at 09:25, <guangrong.xiao@linux.intel.com> wrote:
> >>> On 01/21/2016 04:18 PM, Jan Beulich wrote:
> >>>> Yes. But then - other than you said above - it still looks to me as
> >>>> if the split between PMEM and PBLK is arranged for by firmware?
> >>>
> >>> Yes. But OS/Hypervisor is not excepted to dynamically change its configure
> >>> (re-split),
> >>> i,e, for PoV of OS/Hypervisor, it is static.
> >>
> >> Exactly, that has been my understanding. And hence the PMEM part
> >> could be under the hypervisor's control, while the PBLK part could be
> >> Dom0's responsibility.
> >>
> > 
> > I am not sure if i have understood your point. What your suggestion is that
> > leave PMEM for hypervisor and all other parts (PBLK and _DSM handling) to
> > Dom0? If yes, we should:
> > a) handle hotplug in hypervisor (new PMEM add/remove) that causes hyperivsor
> >     interpret ACPI SSDT/DSDT.
> 
> Why would this be different from ordinary memory hotplug, where
> Dom0 deals with the ACPI CA interaction, notifying Xen about the
> added memory?
>

The process of NVDIMM hotplug is similar to the ordinary memory
hotplug, and seemingly possible to support it in Xen hypervisor like
ordinary memory hotplug.

> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
> >     handle them in hypervisor.
> 
> Not if (see above) following the model we currently have in place.
>

You mean let dom0 linux evaluates those _DSMs and interact with
hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?

> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> >     VMs.
> 
> Yes.
>

But I Still do not quite understand this part: why must pmem resource
management and partition be done in hypervisor?

I mean if we allow the following steps of operations (for example)
(1) partition pmem in dom 0
(2) get address and size of each partition (part_addr, part_size)
(3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, gpfn) to
    map a partition to the address gpfn in dom d.
Only the last step requires hypervisor. Would anything be wrong if we
allow above operations?

Ha
Jan Beulich Jan. 21, 2016, 2:52 p.m. UTC | #48
>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> On 01/21/16 03:25, Jan Beulich wrote:
>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
>> >     handle them in hypervisor.
>> 
>> Not if (see above) following the model we currently have in place.
>>
> 
> You mean let dom0 linux evaluates those _DSMs and interact with
> hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?

Yes.

>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
>> >     VMs.
>> 
>> Yes.
>>
> 
> But I Still do not quite understand this part: why must pmem resource
> management and partition be done in hypervisor?

Because that's where memory management belongs. And PMEM,
other than PBLK, is just another form of RAM.

> I mean if we allow the following steps of operations (for example)
> (1) partition pmem in dom 0
> (2) get address and size of each partition (part_addr, part_size)
> (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, 
> gpfn) to
>     map a partition to the address gpfn in dom d.
> Only the last step requires hypervisor. Would anything be wrong if we
> allow above operations?

The main issue is that this would imo be a layering violation. I'm
sure it can be made work, but that doesn't mean that's the way
it ought to work.

Jan
Haozhong Zhang Jan. 22, 2016, 2:43 a.m. UTC | #49
On 01/21/16 07:52, Jan Beulich wrote:
> >>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> > On 01/21/16 03:25, Jan Beulich wrote:
> >> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> >> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
> >> >     handle them in hypervisor.
> >> 
> >> Not if (see above) following the model we currently have in place.
> >>
> > 
> > You mean let dom0 linux evaluates those _DSMs and interact with
> > hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?
> 
> Yes.
> 
> >> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> >> >     VMs.
> >> 
> >> Yes.
> >>
> > 
> > But I Still do not quite understand this part: why must pmem resource
> > management and partition be done in hypervisor?
> 
> Because that's where memory management belongs. And PMEM,
> other than PBLK, is just another form of RAM.
> 
> > I mean if we allow the following steps of operations (for example)
> > (1) partition pmem in dom 0
> > (2) get address and size of each partition (part_addr, part_size)
> > (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size, 
> > gpfn) to
> >     map a partition to the address gpfn in dom d.
> > Only the last step requires hypervisor. Would anything be wrong if we
> > allow above operations?
> 
> The main issue is that this would imo be a layering violation. I'm
> sure it can be made work, but that doesn't mean that's the way
> it ought to work.
> 
> Jan
> 

OK, then it makes sense to put them in hypervisor. I'll think about
this and note in the design document.

Thanks,
Haozhong
George Dunlap Jan. 26, 2016, 11:44 a.m. UTC | #50
On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>> On 01/21/16 03:25, Jan Beulich wrote:
>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>> > b) some _DSMs control PMEM so you should filter out these kind of _DSMs and
>>> >     handle them in hypervisor.
>>>
>>> Not if (see above) following the model we currently have in place.
>>>
>>
>> You mean let dom0 linux evaluates those _DSMs and interact with
>> hypervisor if necessary (e.g. XENPF_mem_hotadd for memory hotplug)?
>
> Yes.
>
>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
>>> >     VMs.
>>>
>>> Yes.
>>>
>>
>> But I Still do not quite understand this part: why must pmem resource
>> management and partition be done in hypervisor?
>
> Because that's where memory management belongs. And PMEM,
> other than PBLK, is just another form of RAM.

I haven't looked more deeply into the details of this, but this
argument doesn't seem right to me.

Normal RAM in Xen is what might be called "fungible" -- at boot, all
RAM is zeroed, and it basically doesn't matter at all what RAM is
given to what guest.  (There are restrictions of course: lowmem for
DMA, contiguous superpages, &c; but within those groups, it doesn't
matter *which* bit of lowmem you get, as long as you get enough to do
your job.)  If you reboot your guest or hand RAM back to the
hypervisor, you assume that everything in it will disappear.  When you
ask for RAM, you can request some parameters that it will have
(lowmem, on a specific node, &c), but you can't request a specific
page that you had before.

This is not the case for PMEM.  The whole point of PMEM (correct me if
I'm wrong) is to be used for long-term storage that survives over
reboot.  It matters very much that a guest be given the same PRAM
after the host is rebooted that it was given before.  It doesn't make
any sense to manage it the way Xen currently manages RAM (i.e., that
you request a page and get whatever Xen happens to give you).

So if Xen is going to use PMEM, it will have to invent an entirely new
interface for guests, and it will have to keep track of those
resources across host reboots.  In other words, it will have to
duplicate all the work that Linux already does.  What do we gain from
that duplication?  Why not just leverage what's already implemented in
dom0?

>> I mean if we allow the following steps of operations (for example)
>> (1) partition pmem in dom 0
>> (2) get address and size of each partition (part_addr, part_size)
>> (3) call a hypercall like nvdimm_memory_mapping(d, part_addr, part_size,
>> gpfn) to
>>     map a partition to the address gpfn in dom d.
>> Only the last step requires hypervisor. Would anything be wrong if we
>> allow above operations?
>
> The main issue is that this would imo be a layering violation. I'm
> sure it can be made work, but that doesn't mean that's the way
> it ought to work.

Jan, from a toolstack <-> Xen perspective, I'm not sure what
alternative there to the interface above.  Won't the toolstack have to
1) figure out what nvdimm regions there are and 2) tell Xen how and
where to assign them to the guest no matter what we do?  And if we
want to assign arbitrary regions to arbitrary guests, then (part_addr,
part_size) and (gpfn) are going to be necessary bits of information.
The only difference would be whether part_addr is the machine address
or some abstracted address space (possibly starting at 0).

What does your ideal toolstack <-> Xen interface look like?

 -George
Jan Beulich Jan. 26, 2016, 12:44 p.m. UTC | #51
>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>>> On 01/21/16 03:25, Jan Beulich wrote:
>>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
>>>> >     VMs.
>>>>
>>>> Yes.
>>>>
>>>
>>> But I Still do not quite understand this part: why must pmem resource
>>> management and partition be done in hypervisor?
>>
>> Because that's where memory management belongs. And PMEM,
>> other than PBLK, is just another form of RAM.
> 
> I haven't looked more deeply into the details of this, but this
> argument doesn't seem right to me.
> 
> Normal RAM in Xen is what might be called "fungible" -- at boot, all
> RAM is zeroed, and it basically doesn't matter at all what RAM is
> given to what guest.  (There are restrictions of course: lowmem for
> DMA, contiguous superpages, &c; but within those groups, it doesn't
> matter *which* bit of lowmem you get, as long as you get enough to do
> your job.)  If you reboot your guest or hand RAM back to the
> hypervisor, you assume that everything in it will disappear.  When you
> ask for RAM, you can request some parameters that it will have
> (lowmem, on a specific node, &c), but you can't request a specific
> page that you had before.
> 
> This is not the case for PMEM.  The whole point of PMEM (correct me if
> I'm wrong) is to be used for long-term storage that survives over
> reboot.  It matters very much that a guest be given the same PRAM
> after the host is rebooted that it was given before.  It doesn't make
> any sense to manage it the way Xen currently manages RAM (i.e., that
> you request a page and get whatever Xen happens to give you).

Interesting. This isn't the usage model I have been thinking about
so far. Having just gone back to the original 0/4 mail, I'm afraid
we're really left guessing, and you guessed differently than I did.
My understanding of the intentions of PMEM so far was that this
is a high-capacity, slower than DRAM but much faster than e.g.
swapping to disk alternative to normal RAM. I.e. the persistent
aspect of it wouldn't matter at all in this case (other than for PBLK,
obviously).

However, thinking through your usage model I have problems
seeing it work in a reasonable way even with virtualization left
aside: To my knowledge there's no established protocol on how
multiple parties (different versions of the same OS, or even
completely different OSes) would arbitrate using such memory
ranges. And even for a single OS it is, other than for disks (and
hence PBLK), not immediately clear how it would communicate
from one boot to another what information got stored where,
or how it would react to some or all of this storage having
disappeared (just like a disk which got removed, which - unless
it held the boot partition - would normally have pretty little
effect on the OS coming back up).

> So if Xen is going to use PMEM, it will have to invent an entirely new
> interface for guests, and it will have to keep track of those
> resources across host reboots.  In other words, it will have to
> duplicate all the work that Linux already does.  What do we gain from
> that duplication?  Why not just leverage what's already implemented in
> dom0?

Indeed if my guessing on the intentions was wrong, then the
picture completely changes (also for the points you've made
further down).

Jan
Jürgen Groß Jan. 26, 2016, 12:54 p.m. UTC | #52
On 26/01/16 13:44, Jan Beulich wrote:
>>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>>>> On 01/21/16 03:25, Jan Beulich wrote:
>>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple
>>>>>>     VMs.
>>>>>
>>>>> Yes.
>>>>>
>>>>
>>>> But I Still do not quite understand this part: why must pmem resource
>>>> management and partition be done in hypervisor?
>>>
>>> Because that's where memory management belongs. And PMEM,
>>> other than PBLK, is just another form of RAM.
>>
>> I haven't looked more deeply into the details of this, but this
>> argument doesn't seem right to me.
>>
>> Normal RAM in Xen is what might be called "fungible" -- at boot, all
>> RAM is zeroed, and it basically doesn't matter at all what RAM is
>> given to what guest.  (There are restrictions of course: lowmem for
>> DMA, contiguous superpages, &c; but within those groups, it doesn't
>> matter *which* bit of lowmem you get, as long as you get enough to do
>> your job.)  If you reboot your guest or hand RAM back to the
>> hypervisor, you assume that everything in it will disappear.  When you
>> ask for RAM, you can request some parameters that it will have
>> (lowmem, on a specific node, &c), but you can't request a specific
>> page that you had before.
>>
>> This is not the case for PMEM.  The whole point of PMEM (correct me if
>> I'm wrong) is to be used for long-term storage that survives over
>> reboot.  It matters very much that a guest be given the same PRAM
>> after the host is rebooted that it was given before.  It doesn't make
>> any sense to manage it the way Xen currently manages RAM (i.e., that
>> you request a page and get whatever Xen happens to give you).
> 
> Interesting. This isn't the usage model I have been thinking about
> so far. Having just gone back to the original 0/4 mail, I'm afraid
> we're really left guessing, and you guessed differently than I did.
> My understanding of the intentions of PMEM so far was that this
> is a high-capacity, slower than DRAM but much faster than e.g.
> swapping to disk alternative to normal RAM. I.e. the persistent
> aspect of it wouldn't matter at all in this case (other than for PBLK,
> obviously).
> 
> However, thinking through your usage model I have problems
> seeing it work in a reasonable way even with virtualization left
> aside: To my knowledge there's no established protocol on how
> multiple parties (different versions of the same OS, or even
> completely different OSes) would arbitrate using such memory
> ranges. And even for a single OS it is, other than for disks (and
> hence PBLK), not immediately clear how it would communicate
> from one boot to another what information got stored where,
> or how it would react to some or all of this storage having
> disappeared (just like a disk which got removed, which - unless
> it held the boot partition - would normally have pretty little
> effect on the OS coming back up).

Last year at Linux Plumbers Conference I attended a session dedicated
to NVDIMM support. I asked the very same question and the INTEL guy
there told me there is indeed something like a partition table meant
to describe the layout of the memory areas and their contents.

It would be nice to have a pointer to such information. Without anything
like this it might be rather difficult to find the best solution how to
implement NVDIMM support in Xen or any other product.


Juergen
George Dunlap Jan. 26, 2016, 1:58 p.m. UTC | #53
On 26/01/16 12:44, Jan Beulich wrote:
>>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
>> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
>>>> On 01/21/16 03:25, Jan Beulich wrote:
>>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
>>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple
>>>>>>     VMs.
>>>>>
>>>>> Yes.
>>>>>
>>>>
>>>> But I Still do not quite understand this part: why must pmem resource
>>>> management and partition be done in hypervisor?
>>>
>>> Because that's where memory management belongs. And PMEM,
>>> other than PBLK, is just another form of RAM.
>>
>> I haven't looked more deeply into the details of this, but this
>> argument doesn't seem right to me.
>>
>> Normal RAM in Xen is what might be called "fungible" -- at boot, all
>> RAM is zeroed, and it basically doesn't matter at all what RAM is
>> given to what guest.  (There are restrictions of course: lowmem for
>> DMA, contiguous superpages, &c; but within those groups, it doesn't
>> matter *which* bit of lowmem you get, as long as you get enough to do
>> your job.)  If you reboot your guest or hand RAM back to the
>> hypervisor, you assume that everything in it will disappear.  When you
>> ask for RAM, you can request some parameters that it will have
>> (lowmem, on a specific node, &c), but you can't request a specific
>> page that you had before.
>>
>> This is not the case for PMEM.  The whole point of PMEM (correct me if
>> I'm wrong) is to be used for long-term storage that survives over
>> reboot.  It matters very much that a guest be given the same PRAM
>> after the host is rebooted that it was given before.  It doesn't make
>> any sense to manage it the way Xen currently manages RAM (i.e., that
>> you request a page and get whatever Xen happens to give you).
> 
> Interesting. This isn't the usage model I have been thinking about
> so far. Having just gone back to the original 0/4 mail, I'm afraid
> we're really left guessing, and you guessed differently than I did.
> My understanding of the intentions of PMEM so far was that this
> is a high-capacity, slower than DRAM but much faster than e.g.
> swapping to disk alternative to normal RAM. I.e. the persistent
> aspect of it wouldn't matter at all in this case (other than for PBLK,
> obviously).

Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
then you're right -- it is just another form of RAM, that should be
treated no differently than say, lowmem: a fungible resource that can be
requested by setting a flag.

Haozhong?

 -George
Konrad Rzeszutek Wilk Jan. 26, 2016, 2:44 p.m. UTC | #54
> Last year at Linux Plumbers Conference I attended a session dedicated
> to NVDIMM support. I asked the very same question and the INTEL guy
> there told me there is indeed something like a partition table meant
> to describe the layout of the memory areas and their contents.

It is described in details at pmem.io, look at  Documents, see
http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.

Then I would recommend you read:
http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

followed by http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

And then for dessert:
https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
which explains it in more technical terms.
> 
> It would be nice to have a pointer to such information. Without anything
> like this it might be rather difficult to find the best solution how to
> implement NVDIMM support in Xen or any other product.
Konrad Rzeszutek Wilk Jan. 26, 2016, 2:46 p.m. UTC | #55
On Tue, Jan 26, 2016 at 01:58:35PM +0000, George Dunlap wrote:
> On 26/01/16 12:44, Jan Beulich wrote:
> >>>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> >> On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> >>>> On 01/21/16 03:25, Jan Beulich wrote:
> >>>>>>>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> >>>>>> c) hypervisor should mange PMEM resource pool and partition it to multiple
> >>>>>>     VMs.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>
> >>>> But I Still do not quite understand this part: why must pmem resource
> >>>> management and partition be done in hypervisor?
> >>>
> >>> Because that's where memory management belongs. And PMEM,
> >>> other than PBLK, is just another form of RAM.
> >>
> >> I haven't looked more deeply into the details of this, but this
> >> argument doesn't seem right to me.
> >>
> >> Normal RAM in Xen is what might be called "fungible" -- at boot, all
> >> RAM is zeroed, and it basically doesn't matter at all what RAM is
> >> given to what guest.  (There are restrictions of course: lowmem for
> >> DMA, contiguous superpages, &c; but within those groups, it doesn't
> >> matter *which* bit of lowmem you get, as long as you get enough to do
> >> your job.)  If you reboot your guest or hand RAM back to the
> >> hypervisor, you assume that everything in it will disappear.  When you
> >> ask for RAM, you can request some parameters that it will have
> >> (lowmem, on a specific node, &c), but you can't request a specific
> >> page that you had before.
> >>
> >> This is not the case for PMEM.  The whole point of PMEM (correct me if
> >> I'm wrong) is to be used for long-term storage that survives over
> >> reboot.  It matters very much that a guest be given the same PRAM
> >> after the host is rebooted that it was given before.  It doesn't make
> >> any sense to manage it the way Xen currently manages RAM (i.e., that
> >> you request a page and get whatever Xen happens to give you).
> > 
> > Interesting. This isn't the usage model I have been thinking about
> > so far. Having just gone back to the original 0/4 mail, I'm afraid
> > we're really left guessing, and you guessed differently than I did.
> > My understanding of the intentions of PMEM so far was that this
> > is a high-capacity, slower than DRAM but much faster than e.g.
> > swapping to disk alternative to normal RAM. I.e. the persistent
> > aspect of it wouldn't matter at all in this case (other than for PBLK,
> > obviously).
> 
> Oh, right -- yes, if the usage model of PRAM is just "cheap slow RAM",
> then you're right -- it is just another form of RAM, that should be
> treated no differently than say, lowmem: a fungible resource that can be
> requested by setting a flag.

I would think of it as MMIO ranges than RAM. Yes it is behind an MMC - but
there are subtle things such as the new instructions - pcommit, clfushopt,
and other that impact it.

Furthermore ranges (contingous and most likely discontingous)
of this  "RAM" has to be shared with guests (at least dom0)
and with other (multiple HVM guests).


> 
> Haozhong?
> 
>  -George
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Haozhong Zhang Jan. 26, 2016, 3:30 p.m. UTC | #56
On 01/26/16 05:44, Jan Beulich wrote:
> >>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> >>> On 01/21/16 03:25, Jan Beulich wrote:
> >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> >>>> >     VMs.
> >>>>
> >>>> Yes.
> >>>>
> >>>
> >>> But I Still do not quite understand this part: why must pmem resource
> >>> management and partition be done in hypervisor?
> >>
> >> Because that's where memory management belongs. And PMEM,
> >> other than PBLK, is just another form of RAM.
> > 
> > I haven't looked more deeply into the details of this, but this
> > argument doesn't seem right to me.
> > 
> > Normal RAM in Xen is what might be called "fungible" -- at boot, all
> > RAM is zeroed, and it basically doesn't matter at all what RAM is
> > given to what guest.  (There are restrictions of course: lowmem for
> > DMA, contiguous superpages, &c; but within those groups, it doesn't
> > matter *which* bit of lowmem you get, as long as you get enough to do
> > your job.)  If you reboot your guest or hand RAM back to the
> > hypervisor, you assume that everything in it will disappear.  When you
> > ask for RAM, you can request some parameters that it will have
> > (lowmem, on a specific node, &c), but you can't request a specific
> > page that you had before.
> > 
> > This is not the case for PMEM.  The whole point of PMEM (correct me if
> > I'm wrong) is to be used for long-term storage that survives over
> > reboot.  It matters very much that a guest be given the same PRAM
> > after the host is rebooted that it was given before.  It doesn't make
> > any sense to manage it the way Xen currently manages RAM (i.e., that
> > you request a page and get whatever Xen happens to give you).
> 
> Interesting. This isn't the usage model I have been thinking about
> so far. Having just gone back to the original 0/4 mail, I'm afraid
> we're really left guessing, and you guessed differently than I did.
> My understanding of the intentions of PMEM so far was that this
> is a high-capacity, slower than DRAM but much faster than e.g.
> swapping to disk alternative to normal RAM. I.e. the persistent
> aspect of it wouldn't matter at all in this case (other than for PBLK,
> obviously).
>

Of course, pmem could be used in the way you thought because of its
'ram' aspect. But I think the more meaningful usage is from its
persistent aspect. For example, the implementation of some journal
file systems could store logs in pmem rather than the normal ram, so
that if a power failure happens before those in-memory logs are
completely written to the disk, there would still be chance to restore
them from pmem after next booting (rather than abandoning all of
them).

(I'm still writing the design doc which will include more details of
underlying hardware and the software interface of nvdimm exposed by
current linux)

> However, thinking through your usage model I have problems
> seeing it work in a reasonable way even with virtualization left
> aside: To my knowledge there's no established protocol on how
> multiple parties (different versions of the same OS, or even
> completely different OSes) would arbitrate using such memory
> ranges. And even for a single OS it is, other than for disks (and
> hence PBLK), not immediately clear how it would communicate
> from one boot to another what information got stored where,
> or how it would react to some or all of this storage having
> disappeared (just like a disk which got removed, which - unless
> it held the boot partition - would normally have pretty little
> effect on the OS coming back up).
>

Label storage area is a persistent area on NVDIMM and can be used to
store partitions information. It's not included in pmem (that part
that is mapped into the system address space). Instead, it can be only
accessed through NVDIMM _DSM method [1]. However, what contents are
stored and how they are interpreted are left to software. One way is
to follow NVDIMM Namespace Specification [2] to store an array of
labels that describe the start address (from the base 0 of pmem) and
the size of each partition, which is called as namespace. On Linux,
each namespace is exposed as a /dev/pmemXX device.

In the virtualization, the (virtual) label storage area of vNVDIMM and
the corresponding _DSM method are emulated by QEMU. The virtual label
storage area is not written to the host one. Instead, we can reserve a
piece area on pmem for the virtual one.

Besides namespaces, we can also create DAX file systems on pmem and
use files to partition.

Haozhong

> > So if Xen is going to use PMEM, it will have to invent an entirely new
> > interface for guests, and it will have to keep track of those
> > resources across host reboots.  In other words, it will have to
> > duplicate all the work that Linux already does.  What do we gain from
> > that duplication?  Why not just leverage what's already implemented in
> > dom0?
> 
> Indeed if my guessing on the intentions was wrong, then the
> picture completely changes (also for the points you've made
> further down).
> 
> Jan
>
Haozhong Zhang Jan. 26, 2016, 3:33 p.m. UTC | #57
On 01/26/16 23:30, Haozhong Zhang wrote:
> On 01/26/16 05:44, Jan Beulich wrote:
> > >>> On 26.01.16 at 12:44, <George.Dunlap@eu.citrix.com> wrote:
> > > On Thu, Jan 21, 2016 at 2:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
> > >>>>> On 21.01.16 at 15:01, <haozhong.zhang@intel.com> wrote:
> > >>> On 01/21/16 03:25, Jan Beulich wrote:
> > >>>> >>> On 21.01.16 at 10:10, <guangrong.xiao@linux.intel.com> wrote:
> > >>>> > c) hypervisor should mange PMEM resource pool and partition it to multiple
> > >>>> >     VMs.
> > >>>>
> > >>>> Yes.
> > >>>>
> > >>>
> > >>> But I Still do not quite understand this part: why must pmem resource
> > >>> management and partition be done in hypervisor?
> > >>
> > >> Because that's where memory management belongs. And PMEM,
> > >> other than PBLK, is just another form of RAM.
> > > 
> > > I haven't looked more deeply into the details of this, but this
> > > argument doesn't seem right to me.
> > > 
> > > Normal RAM in Xen is what might be called "fungible" -- at boot, all
> > > RAM is zeroed, and it basically doesn't matter at all what RAM is
> > > given to what guest.  (There are restrictions of course: lowmem for
> > > DMA, contiguous superpages, &c; but within those groups, it doesn't
> > > matter *which* bit of lowmem you get, as long as you get enough to do
> > > your job.)  If you reboot your guest or hand RAM back to the
> > > hypervisor, you assume that everything in it will disappear.  When you
> > > ask for RAM, you can request some parameters that it will have
> > > (lowmem, on a specific node, &c), but you can't request a specific
> > > page that you had before.
> > > 
> > > This is not the case for PMEM.  The whole point of PMEM (correct me if
> > > I'm wrong) is to be used for long-term storage that survives over
> > > reboot.  It matters very much that a guest be given the same PRAM
> > > after the host is rebooted that it was given before.  It doesn't make
> > > any sense to manage it the way Xen currently manages RAM (i.e., that
> > > you request a page and get whatever Xen happens to give you).
> > 
> > Interesting. This isn't the usage model I have been thinking about
> > so far. Having just gone back to the original 0/4 mail, I'm afraid
> > we're really left guessing, and you guessed differently than I did.
> > My understanding of the intentions of PMEM so far was that this
> > is a high-capacity, slower than DRAM but much faster than e.g.
> > swapping to disk alternative to normal RAM. I.e. the persistent
> > aspect of it wouldn't matter at all in this case (other than for PBLK,
> > obviously).
> >
> 
> Of course, pmem could be used in the way you thought because of its
> 'ram' aspect. But I think the more meaningful usage is from its
> persistent aspect. For example, the implementation of some journal
> file systems could store logs in pmem rather than the normal ram, so
> that if a power failure happens before those in-memory logs are
> completely written to the disk, there would still be chance to restore
> them from pmem after next booting (rather than abandoning all of
> them).
> 
> (I'm still writing the design doc which will include more details of
> underlying hardware and the software interface of nvdimm exposed by
> current linux)
> 
> > However, thinking through your usage model I have problems
> > seeing it work in a reasonable way even with virtualization left
> > aside: To my knowledge there's no established protocol on how
> > multiple parties (different versions of the same OS, or even
> > completely different OSes) would arbitrate using such memory
> > ranges. And even for a single OS it is, other than for disks (and
> > hence PBLK), not immediately clear how it would communicate
> > from one boot to another what information got stored where,
> > or how it would react to some or all of this storage having
> > disappeared (just like a disk which got removed, which - unless
> > it held the boot partition - would normally have pretty little
> > effect on the OS coming back up).
> >
> 
> Label storage area is a persistent area on NVDIMM and can be used to
> store partitions information. It's not included in pmem (that part
> that is mapped into the system address space). Instead, it can be only
> accessed through NVDIMM _DSM method [1]. However, what contents are
> stored and how they are interpreted are left to software. One way is
> to follow NVDIMM Namespace Specification [2] to store an array of
> labels that describe the start address (from the base 0 of pmem) and
> the size of each partition, which is called as namespace. On Linux,
> each namespace is exposed as a /dev/pmemXX device.
> 
> In the virtualization, the (virtual) label storage area of vNVDIMM and
> the corresponding _DSM method are emulated by QEMU. The virtual label
> storage area is not written to the host one. Instead, we can reserve a
> piece area on pmem for the virtual one.
> 
> Besides namespaces, we can also create DAX file systems on pmem and
> use files to partition.
>

Forgot references:
[1] NVDIMM DSM Interface Examples, http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[2] NVDIMM Namespace Specification, http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

> Haozhong
> 
> > > So if Xen is going to use PMEM, it will have to invent an entirely new
> > > interface for guests, and it will have to keep track of those
> > > resources across host reboots.  In other words, it will have to
> > > duplicate all the work that Linux already does.  What do we gain from
> > > that duplication?  Why not just leverage what's already implemented in
> > > dom0?
> > 
> > Indeed if my guessing on the intentions was wrong, then the
> > picture completely changes (also for the points you've made
> > further down).
> > 
> > Jan
> >
Jan Beulich Jan. 26, 2016, 3:37 p.m. UTC | #58
>>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>>  Last year at Linux Plumbers Conference I attended a session dedicated
>> to NVDIMM support. I asked the very same question and the INTEL guy
>> there told me there is indeed something like a partition table meant
>> to describe the layout of the memory areas and their contents.
> 
> It is described in details at pmem.io, look at  Documents, see
> http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.

Well, that's about how PMEM and PBLK ranges get marked, but not
about how use of the space inside a PMEM range is coordinated.

Jan
Jan Beulich Jan. 26, 2016, 3:57 p.m. UTC | #59
>>> On 26.01.16 at 16:30, <haozhong.zhang@intel.com> wrote:
> On 01/26/16 05:44, Jan Beulich wrote:
>> Interesting. This isn't the usage model I have been thinking about
>> so far. Having just gone back to the original 0/4 mail, I'm afraid
>> we're really left guessing, and you guessed differently than I did.
>> My understanding of the intentions of PMEM so far was that this
>> is a high-capacity, slower than DRAM but much faster than e.g.
>> swapping to disk alternative to normal RAM. I.e. the persistent
>> aspect of it wouldn't matter at all in this case (other than for PBLK,
>> obviously).
> 
> Of course, pmem could be used in the way you thought because of its
> 'ram' aspect. But I think the more meaningful usage is from its
> persistent aspect. For example, the implementation of some journal
> file systems could store logs in pmem rather than the normal ram, so
> that if a power failure happens before those in-memory logs are
> completely written to the disk, there would still be chance to restore
> them from pmem after next booting (rather than abandoning all of
> them).

Well, that leaves open how that file system would find its log
after reboot, or how that log is protected from clobbering by
another OS booted in between.

>> However, thinking through your usage model I have problems
>> seeing it work in a reasonable way even with virtualization left
>> aside: To my knowledge there's no established protocol on how
>> multiple parties (different versions of the same OS, or even
>> completely different OSes) would arbitrate using such memory
>> ranges. And even for a single OS it is, other than for disks (and
>> hence PBLK), not immediately clear how it would communicate
>> from one boot to another what information got stored where,
>> or how it would react to some or all of this storage having
>> disappeared (just like a disk which got removed, which - unless
>> it held the boot partition - would normally have pretty little
>> effect on the OS coming back up).
> 
> Label storage area is a persistent area on NVDIMM and can be used to
> store partitions information. It's not included in pmem (that part
> that is mapped into the system address space). Instead, it can be only
> accessed through NVDIMM _DSM method [1]. However, what contents are
> stored and how they are interpreted are left to software. One way is
> to follow NVDIMM Namespace Specification [2] to store an array of
> labels that describe the start address (from the base 0 of pmem) and
> the size of each partition, which is called as namespace. On Linux,
> each namespace is exposed as a /dev/pmemXX device.

According to what I've just read in one of the documents Konrad
pointed us to, there can be just one PMEM label per DIMM. Unless
I misread of course...

Jan
Haozhong Zhang Jan. 26, 2016, 3:57 p.m. UTC | #60
On 01/26/16 08:37, Jan Beulich wrote:
> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> there told me there is indeed something like a partition table meant
> >> to describe the layout of the memory areas and their contents.
> > 
> > It is described in details at pmem.io, look at  Documents, see
> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> 
> Well, that's about how PMEM and PBLK ranges get marked, but not
> about how use of the space inside a PMEM range is coordinated.
>

How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT table.
Namespace to pmem is something like partition table to disk.

Haozhong
Jan Beulich Jan. 26, 2016, 4:34 p.m. UTC | #61
>>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> On 01/26/16 08:37, Jan Beulich wrote:
>> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>> >>  Last year at Linux Plumbers Conference I attended a session dedicated
>> >> to NVDIMM support. I asked the very same question and the INTEL guy
>> >> there told me there is indeed something like a partition table meant
>> >> to describe the layout of the memory areas and their contents.
>> > 
>> > It is described in details at pmem.io, look at  Documents, see
>> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
>> 
>> Well, that's about how PMEM and PBLK ranges get marked, but not
>> about how use of the space inside a PMEM range is coordinated.
>>
> 
> How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> table.
> Namespace to pmem is something like partition table to disk.

But I'm talking about sub-dividing the space inside an individual
PMEM range.

Jan
Konrad Rzeszutek Wilk Jan. 26, 2016, 7:32 p.m. UTC | #62
On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> > On 01/26/16 08:37, Jan Beulich wrote:
> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> >> there told me there is indeed something like a partition table meant
> >> >> to describe the layout of the memory areas and their contents.
> >> > 
> >> > It is described in details at pmem.io, look at  Documents, see
> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> >> 
> >> Well, that's about how PMEM and PBLK ranges get marked, but not
> >> about how use of the space inside a PMEM range is coordinated.
> >>
> > 
> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> > table.
> > Namespace to pmem is something like partition table to disk.
> 
> But I'm talking about sub-dividing the space inside an individual
> PMEM range.

The namespaces are it.

Once you have done them you can mount the PMEM range under say /dev/pmem0
and then put a filesystem on it (ext4, xfs) - and enable DAX support.
The DAX just means that the FS will bypass the page cache and write directly
to the virtual address.

then one can create giant 'dd' images on this filesystem and pass it
to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the blocks
(or MFNs) for the contents of the file are most certainly discontingous.

> 
> Jan
>
Haozhong Zhang Jan. 27, 2016, 2:23 a.m. UTC | #63
On 01/26/16 08:57, Jan Beulich wrote:
> >>> On 26.01.16 at 16:30, <haozhong.zhang@intel.com> wrote:
> > On 01/26/16 05:44, Jan Beulich wrote:
> >> Interesting. This isn't the usage model I have been thinking about
> >> so far. Having just gone back to the original 0/4 mail, I'm afraid
> >> we're really left guessing, and you guessed differently than I did.
> >> My understanding of the intentions of PMEM so far was that this
> >> is a high-capacity, slower than DRAM but much faster than e.g.
> >> swapping to disk alternative to normal RAM. I.e. the persistent
> >> aspect of it wouldn't matter at all in this case (other than for PBLK,
> >> obviously).
> > 
> > Of course, pmem could be used in the way you thought because of its
> > 'ram' aspect. But I think the more meaningful usage is from its
> > persistent aspect. For example, the implementation of some journal
> > file systems could store logs in pmem rather than the normal ram, so
> > that if a power failure happens before those in-memory logs are
> > completely written to the disk, there would still be chance to restore
> > them from pmem after next booting (rather than abandoning all of
> > them).
> 
> Well, that leaves open how that file system would find its log
> after reboot, or how that log is protected from clobbering by
> another OS booted in between.
>

It would depend on the concrete design of those OS or
applications. This is just an example to show a possible usage of the
persistent aspect.

> >> However, thinking through your usage model I have problems
> >> seeing it work in a reasonable way even with virtualization left
> >> aside: To my knowledge there's no established protocol on how
> >> multiple parties (different versions of the same OS, or even
> >> completely different OSes) would arbitrate using such memory
> >> ranges. And even for a single OS it is, other than for disks (and
> >> hence PBLK), not immediately clear how it would communicate
> >> from one boot to another what information got stored where,
> >> or how it would react to some or all of this storage having
> >> disappeared (just like a disk which got removed, which - unless
> >> it held the boot partition - would normally have pretty little
> >> effect on the OS coming back up).
> > 
> > Label storage area is a persistent area on NVDIMM and can be used to
> > store partitions information. It's not included in pmem (that part
> > that is mapped into the system address space). Instead, it can be only
> > accessed through NVDIMM _DSM method [1]. However, what contents are
> > stored and how they are interpreted are left to software. One way is
> > to follow NVDIMM Namespace Specification [2] to store an array of
> > labels that describe the start address (from the base 0 of pmem) and
> > the size of each partition, which is called as namespace. On Linux,
> > each namespace is exposed as a /dev/pmemXX device.
> 
> According to what I've just read in one of the documents Konrad
> pointed us to, there can be just one PMEM label per DIMM. Unless
> I misread of course...
>

My mistake, only one pmem label per DIMM.

Haozhong
Haozhong Zhang Jan. 27, 2016, 7:22 a.m. UTC | #64
On 01/26/16 14:32, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> > >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> > > On 01/26/16 08:37, Jan Beulich wrote:
> > >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> > >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> > >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> > >> >> there told me there is indeed something like a partition table meant
> > >> >> to describe the layout of the memory areas and their contents.
> > >> > 
> > >> > It is described in details at pmem.io, look at  Documents, see
> > >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> > >> 
> > >> Well, that's about how PMEM and PBLK ranges get marked, but not
> > >> about how use of the space inside a PMEM range is coordinated.
> > >>
> > > 
> > > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> > > table.
> > > Namespace to pmem is something like partition table to disk.
> > 
> > But I'm talking about sub-dividing the space inside an individual
> > PMEM range.
> 
> The namespaces are it.
>

Because only one persistent memory namespace is allowed for an
individual pmem, namespace can not be used to sub-divide.

> Once you have done them you can mount the PMEM range under say /dev/pmem0
> and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> The DAX just means that the FS will bypass the page cache and write directly
> to the virtual address.
> 
> then one can create giant 'dd' images on this filesystem and pass it
> to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the blocks
> (or MFNs) for the contents of the file are most certainly discontingous.
>

Though the 'dd' image may occupy discontingous MFNs on host pmem, we can map them
to contiguous guest PFNs.
Jan Beulich Jan. 27, 2016, 10:16 a.m. UTC | #65
>>> On 26.01.16 at 20:32, <konrad.wilk@oracle.com> wrote:
> On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
>> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
>> > On 01/26/16 08:37, Jan Beulich wrote:
>> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>> >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
>> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
>> >> >> there told me there is indeed something like a partition table meant
>> >> >> to describe the layout of the memory areas and their contents.
>> >> > 
>> >> > It is described in details at pmem.io, look at  Documents, see
>> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
>> >> 
>> >> Well, that's about how PMEM and PBLK ranges get marked, but not
>> >> about how use of the space inside a PMEM range is coordinated.
>> >>
>> > 
>> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
>> > table.
>> > Namespace to pmem is something like partition table to disk.
>> 
>> But I'm talking about sub-dividing the space inside an individual
>> PMEM range.
> 
> The namespaces are it.
> 
> Once you have done them you can mount the PMEM range under say /dev/pmem0
> and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> The DAX just means that the FS will bypass the page cache and write directly
> to the virtual address.
> 
> then one can create giant 'dd' images on this filesystem and pass it
> to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the 
> blocks
> (or MFNs) for the contents of the file are most certainly discontingous.

And what's the advantage of this over PBLK? I.e. why would one
want to separate PMEM and PBLK ranges if everything gets used
the same way anyway?

Jan
George Dunlap Jan. 27, 2016, 10:55 a.m. UTC | #66
On Tue, Jan 26, 2016 at 4:34 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
>> On 01/26/16 08:37, Jan Beulich wrote:
>>> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
>>> >>  Last year at Linux Plumbers Conference I attended a session dedicated
>>> >> to NVDIMM support. I asked the very same question and the INTEL guy
>>> >> there told me there is indeed something like a partition table meant
>>> >> to describe the layout of the memory areas and their contents.
>>> >
>>> > It is described in details at pmem.io, look at  Documents, see
>>> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
>>>
>>> Well, that's about how PMEM and PBLK ranges get marked, but not
>>> about how use of the space inside a PMEM range is coordinated.
>>>
>>
>> How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT
>> table.
>> Namespace to pmem is something like partition table to disk.
>
> But I'm talking about sub-dividing the space inside an individual
> PMEM range.

Well as long as at a high level full PMEM blocks can be allocated /
marked to a single OS, then that OS can figure out if / how to further
subdivide them (and store information about that subdivision).

But in any case, since it seems from what Haozhong and Konrad say,
that the point of this *is* in fact to take advantage of the
persistence, then it seems like allowing Linux to solve the problem of
how to subdivide PMEM blocks and just leveraging their solution would
be better than trying to duplicate all that effort inside of Xen.

 -George
Konrad Rzeszutek Wilk Jan. 27, 2016, 2:50 p.m. UTC | #67
On Wed, Jan 27, 2016 at 03:16:59AM -0700, Jan Beulich wrote:
> >>> On 26.01.16 at 20:32, <konrad.wilk@oracle.com> wrote:
> > On Tue, Jan 26, 2016 at 09:34:13AM -0700, Jan Beulich wrote:
> >> >>> On 26.01.16 at 16:57, <haozhong.zhang@intel.com> wrote:
> >> > On 01/26/16 08:37, Jan Beulich wrote:
> >> >> >>> On 26.01.16 at 15:44, <konrad.wilk@oracle.com> wrote:
> >> >> >>  Last year at Linux Plumbers Conference I attended a session dedicated
> >> >> >> to NVDIMM support. I asked the very same question and the INTEL guy
> >> >> >> there told me there is indeed something like a partition table meant
> >> >> >> to describe the layout of the memory areas and their contents.
> >> >> > 
> >> >> > It is described in details at pmem.io, look at  Documents, see
> >> >> > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf see Namespaces section.
> >> >> 
> >> >> Well, that's about how PMEM and PBLK ranges get marked, but not
> >> >> about how use of the space inside a PMEM range is coordinated.
> >> >>
> >> > 
> >> > How a NVDIMM is partitioned into pmem and pblk is described by ACPI NFIT 
> >> > table.
> >> > Namespace to pmem is something like partition table to disk.
> >> 
> >> But I'm talking about sub-dividing the space inside an individual
> >> PMEM range.
> > 
> > The namespaces are it.
> > 
> > Once you have done them you can mount the PMEM range under say /dev/pmem0
> > and then put a filesystem on it (ext4, xfs) - and enable DAX support.
> > The DAX just means that the FS will bypass the page cache and write directly
> > to the virtual address.
> > 
> > then one can create giant 'dd' images on this filesystem and pass it
> > to QEMU to .. expose as NVDIMM to the guest. Because it is a file - the 
> > blocks
> > (or MFNs) for the contents of the file are most certainly discontingous.
> 
> And what's the advantage of this over PBLK? I.e. why would one
> want to separate PMEM and PBLK ranges if everything gets used
> the same way anyway?

Speed. PBLK emulates hardware - by having a sliding window of the DIMM. The
OS can only write to a ring-buffer with the system address and the payload
(64bytes I think?) - and the hardware (or firmware) picks it up and does the
writes to NVDIMM.

The only motivation behind this is to deal with errors. Normal PMEM writes
do not report errors. As in if the media is busted - the hardware will engage its
remap logic and write somewhere else - until all of its remap blocks have
been exhausted. At that point writes (I presume, not sure) and reads will report
an error - but via an #MCE. 

Part of this Xen design will be how to handle that :-)

With an PBLK - I presume the hardware/firmware will read the block after it has
written it - and if there are errors it will report it right away. Which means
you can easily hook PBLK nicely in RAID setups right away. It will be slower
than PMEM, but it does give you the normal error reporting. That is until
the MCE#->OS->fs errors logic gets figured out.

The MCE# logic code is being developed right now by Tony Luck on LKML - and
the last I saw the MCE# has the system address - and the MCE code would tag
the pages with some bit so that the applications would get a signal.

> 
> Jan
>
diff mbox

Patch

diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 503648c..72be3e0 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -292,8 +292,10 @@  static struct acpi_20_slit *construct_slit(void)
     return slit;
 }
 
-static int construct_passthrough_tables(unsigned long *table_ptrs,
-                                        int nr_tables)
+static int construct_passthrough_tables_common(unsigned long *table_ptrs,
+                                               int nr_tables,
+                                               const char *xs_acpi_pt_addr,
+                                               const char *xs_acpi_pt_length)
 {
     const char *s;
     uint8_t *acpi_pt_addr;
@@ -304,26 +306,28 @@  static int construct_passthrough_tables(unsigned long *table_ptrs,
     uint32_t total = 0;
     uint8_t *buffer;
 
-    s = xenstore_read(HVM_XS_ACPI_PT_ADDRESS, NULL);
+    s = xenstore_read(xs_acpi_pt_addr, NULL);
     if ( s == NULL )
-        return 0;    
+        return 0;
 
     acpi_pt_addr = (uint8_t*)(uint32_t)strtoll(s, NULL, 0);
     if ( acpi_pt_addr == NULL )
         return 0;
 
-    s = xenstore_read(HVM_XS_ACPI_PT_LENGTH, NULL);
+    s = xenstore_read(xs_acpi_pt_length, NULL);
     if ( s == NULL )
         return 0;
 
     acpi_pt_length = (uint32_t)strtoll(s, NULL, 0);
 
     for ( nr_added = 0; nr_added < nr_max; nr_added++ )
-    {        
+    {
         if ( (acpi_pt_length - total) < sizeof(struct acpi_header) )
             break;
 
         header = (struct acpi_header*)acpi_pt_addr;
+        set_checksum(header, offsetof(struct acpi_header, checksum),
+                     header->length);
 
         buffer = mem_alloc(header->length, 16);
         if ( buffer == NULL )
@@ -338,6 +342,21 @@  static int construct_passthrough_tables(unsigned long *table_ptrs,
     return nr_added;
 }
 
+static int construct_passthrough_tables(unsigned long *table_ptrs,
+                                        int nr_tables)
+{
+    return construct_passthrough_tables_common(table_ptrs, nr_tables,
+                                               HVM_XS_ACPI_PT_ADDRESS,
+                                               HVM_XS_ACPI_PT_LENGTH);
+}
+
+static int construct_dm_tables(unsigned long *table_ptrs, int nr_tables)
+{
+    return construct_passthrough_tables_common(table_ptrs, nr_tables,
+                                               HVM_XS_DM_ACPI_PT_ADDRESS,
+                                               HVM_XS_DM_ACPI_PT_LENGTH);
+}
+
 static int construct_secondary_tables(unsigned long *table_ptrs,
                                       struct acpi_info *info)
 {
@@ -454,6 +473,9 @@  static int construct_secondary_tables(unsigned long *table_ptrs,
     /* Load any additional tables passed through. */
     nr_tables += construct_passthrough_tables(table_ptrs, nr_tables);
 
+    /* Load any additional tables from device model */
+    nr_tables += construct_dm_tables(table_ptrs, nr_tables);
+
     table_ptrs[nr_tables] = 0;
     return nr_tables;
 }
diff --git a/xen/include/public/hvm/hvm_xs_strings.h b/xen/include/public/hvm/hvm_xs_strings.h
index 146b0b0..4698495 100644
--- a/xen/include/public/hvm/hvm_xs_strings.h
+++ b/xen/include/public/hvm/hvm_xs_strings.h
@@ -41,6 +41,9 @@ 
 #define HVM_XS_ACPI_PT_ADDRESS         "hvmloader/acpi/address"
 #define HVM_XS_ACPI_PT_LENGTH          "hvmloader/acpi/length"
 
+#define HVM_XS_DM_ACPI_PT_ADDRESS      "hvmloader/dm-acpi/address"
+#define HVM_XS_DM_ACPI_PT_LENGTH       "hvmloader/dm-acpi/length"
+
 /* Any number of SMBIOS types can be passed through to an HVM guest using
  * the following xenstore values. The values specify the guest physical
  * address and length of a block of SMBIOS structures for hvmloader to use.