diff mbox series

[RFC] hw/cxl: type 3 devices can now present volatile or persistent memory

Message ID 20221006000103.49542-1-gregory.price@memverge.com (mailing list archive)
State New, archived
Headers show
Series [RFC] hw/cxl: type 3 devices can now present volatile or persistent memory | expand

Commit Message

Gregory Price Oct. 6, 2022, 12:01 a.m. UTC
Type 3 devices were hard-coded to always present as persistent memory devices.
This patch adds the "is_pmem" attribute which can be used to instantiate
a device as either a pmem or vmem.

Right now it is only possible to choose one or the other, but future
devices may present both (such as multi-logical devices with different
regions backed by different types of memory).
---
 docs/system/devices/cxl.rst    | 31 ++++++++++++++++---------
 hw/cxl/cxl-mailbox-utils.c     | 24 +++++++++++---------
 hw/mem/cxl_type3.c             | 10 +++++----
 include/hw/cxl/cxl_device.h    |  5 +++--
 tests/qtest/bios-tables-test.c |  8 +++----
 tests/qtest/cxl-test.c         | 41 ++++++++++++++++++++++++++++------
 6 files changed, 82 insertions(+), 37 deletions(-)

Comments

Jonathan Cameron Oct. 6, 2022, 8:45 a.m. UTC | #1
On Wed,  5 Oct 2022 20:01:03 -0400
Gourry <gourry.memverge@gmail.com> wrote:

> Type 3 devices were hard-coded to always present as persistent memory devices.
> This patch adds the "is_pmem" attribute which can be used to instantiate
> a device as either a pmem or vmem.

Hi Gourry,

Great to see this.

Missing Signed-off by so we can't apply this (no developer certificate of
origin)  Probably want your from address to match that and I've no idea
if your fully name is Gourry (apologies if it is!)

+CC linux-cxl as most folks who will be interested in this hang out there
and are more likely to noticed it than burried on qemu-devel.
Also +CC some others who will be interested in this discussion.

Please also cc the listed CXL QEMU maintainers (myself and Ben W) - see
MAINTAINERS

> 
> Right now it is only possible to choose one or the other, but future
> devices may present both (such as multi-logical devices with different
> regions backed by different types of memory).

Would it be possible to introduce this support in a fashion that allowed
us to transition to devices presenting both without needing a change of
interface?  With hindsight the memdev naming of the existing parameter is
less than helpful, but one path to doing this would be to allow for
an alternative memdev-volatile= parameter.  For now we can keep the patch
small by only supporting either memdev or memdev-volatile

That way when we want to come along and add mixed devices later, we simply
enable supporting the specification of two different memory backends.

The tricky bit then would be to support variable capacity (devices can
support a region of memory which may be persistent or non persistent
depending on settings - see Set Partition Info).

So my gut feeling is we may eventually allow all of:

1) single memory backend for persistent or volatile.
2) two memory backends, covering a persistent region and a volatile region
  (likely you may want actual non volatile backend for the persistent
   region but not the volatile one).
3) single memory backend in which we can allow for the Set Partition Info stuff
   to work.

As we move beyond just one backend that is presented as persistent we need
to define the interface...

I can think of far too many options!  Let me lay out a few for comment.

A) Array of memory backends, separate control of configurable size
   + starting persistent size, starting volatile size.
   Default of one memory backend gives you current non-volatile only.
   To get volatile only you provide one memory backend and set
   persistent size to 0 and volatile size to size of memory backend.

B) Two memory backends, (later add some trickery around presented size)
   memdev as defined is persistent memory (maybe we deprecate that interface
   and make it memdev-persistent) memdev-volatile as volatile.

   Steps to add functionality.
   1. Add your code to enable volatile but use 'which memdev is supplied' to
      act as the flag.  For now supply both is an error.
   2. Enable static volatile + non-volatile devices (both memdevs supplied).
   3. (hackish) Add a control for amount of partionable capacity.  Cheat
      and remove that capacity from both memdevs.  Changing partition just
      messes without how much of each memdev is used.  We'd need to be a little
      careful to ensure we wipe data etc.

You code mostly looks good to me so I think discussion here is going to be
all about the interface and how we build one more suitable for the future!

Jonathan




> ---
>  docs/system/devices/cxl.rst    | 31 ++++++++++++++++---------
>  hw/cxl/cxl-mailbox-utils.c     | 24 +++++++++++---------
>  hw/mem/cxl_type3.c             | 10 +++++----
>  include/hw/cxl/cxl_device.h    |  5 +++--
>  tests/qtest/bios-tables-test.c |  8 +++----
>  tests/qtest/cxl-test.c         | 41 ++++++++++++++++++++++++++++------
>  6 files changed, 82 insertions(+), 37 deletions(-)
> 
> diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst
> index f25783a4ec..3a62d46e8a 100644
> --- a/docs/system/devices/cxl.rst
> +++ b/docs/system/devices/cxl.rst
> @@ -300,7 +300,7 @@ Example topology involving a switch::
>  
>  Example command lines
>  ---------------------
> -A very simple setup with just one directly attached CXL Type 3 device::
> +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
>  
>    qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
>    ...
> @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device::
>    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
>    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> +  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G

Can't do this as it breaks any existing setup.  If a flag makes sense (I don't think
it is flexible enough) then would need to be volatile so it not being present
will correspond to false and current situation.

> +
> +A very simple setup with just one directly attached CXL Type 3 Volatile Memory device::
> +
> +  qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
> +  ...
> +  -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
> +  -object memory-backend-ram,id=cxl-lsa1,share=on,size=256M \
> +  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> +  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> +  -device cxl-type3,bus=root_port13,pmem=false,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
>    -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
>  
>  A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way
> @@ -328,13 +339,13 @@ the CXL Type3 device directly attached (no switches).::
>    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>    -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \
>    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
>    -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \
> -  -device cxl-type3,bus=root_port14,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
> +  -device cxl-type3,bus=root_port14,pmem=true,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
>    -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \
> -  -device cxl-type3,bus=root_port15,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
> +  -device cxl-type3,bus=root_port15,pmem=true,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
>    -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \
> -  -device cxl-type3,bus=root_port16,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
> +  -device cxl-type3,bus=root_port16,pmem=true,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
>    -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k
>  
>  An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
> @@ -354,13 +365,13 @@ An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
>    -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
>    -device cxl-upstream,bus=root_port0,id=us0 \
>    -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
> -  -device cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
> +  -device cxl-type3,bus=swport0,pmem=true,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
>    -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
> -  -device cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
> +  -device cxl-type3,bus=swport1,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
>    -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
> -  -device cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
> +  -device cxl-type3,bus=swport2,pmem=true,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
>    -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
> -  -device cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
> +  -device cxl-type3,bus=swport3,pmem=true,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
>    -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
>  
>  Kernel Configuration Options
> diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> index bc1bb18844..3ed4dfeb69 100644
> --- a/hw/cxl/cxl-mailbox-utils.c
> +++ b/hw/cxl/cxl-mailbox-utils.c
> @@ -138,7 +138,7 @@ static ret_code cmd_firmware_update_get_info(struct cxl_cmd *cmd,
>      } QEMU_PACKED *fw_info;
>      QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50);
>  
> -    if (cxl_dstate->pmem_size < (256 << 20)) {
> +    if (cxl_dstate->mem_size < (256 << 20)) {

Huh.  I wonder why this check is here in the first place? Looks very odd given why
should we be checking the memory size in a firmware update based command.

Probably cut and paste error long ago. 

>          return CXL_MBOX_INTERNAL_ERROR;
>      }
>  
> @@ -281,7 +281,7 @@ static ret_code cmd_identify_memory_device(struct cxl_cmd *cmd,
>  
>      CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
>      CXLType3Class *cvc = CXL_TYPE3_GET_CLASS(ct3d);
> -    uint64_t size = cxl_dstate->pmem_size;
> +    uint64_t size = cxl_dstate->mem_size;
>  
>      if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
>          return CXL_MBOX_INTERNAL_ERROR;
> @@ -290,11 +290,13 @@ static ret_code cmd_identify_memory_device(struct cxl_cmd *cmd,
>      id = (void *)cmd->payload;
>      memset(id, 0, sizeof(*id));
>  
> -    /* PMEM only */
> -    snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 0);
> +    /* Version 0: PMEM Only.  Version 1: PMEM and VMEM */
> +    snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 1);
>  
> -    id->total_capacity = size / (256 << 20);
> -    id->persistent_capacity = size / (256 << 20);
> +    size /= (256 << 20);
> +    id->total_capacity = size;
> +    id->persistent_capacity = ct3d->is_pmem ? size : 0;
> +    id->volatile_capacity = ct3d->is_pmem ? 0 : size;
>      id->lsa_size = cvc->get_lsa_size(ct3d);
>  
>      *len = sizeof(*id);
> @@ -312,16 +314,18 @@ static ret_code cmd_ccls_get_partition_info(struct cxl_cmd *cmd,
>          uint64_t next_pmem;
>      } QEMU_PACKED *part_info = (void *)cmd->payload;
>      QEMU_BUILD_BUG_ON(sizeof(*part_info) != 0x20);
> -    uint64_t size = cxl_dstate->pmem_size;
> +
> +    CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
> +    uint64_t size = cxl_dstate->mem_size;
>  
>      if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
>          return CXL_MBOX_INTERNAL_ERROR;
>      }
>  
> -    /* PMEM only */
> -    part_info->active_vmem = 0;
> +    size /= (256 << 20);
> +    part_info->active_vmem = ct3d->is_pmem ? 0 : size;
>      part_info->next_vmem = 0;
> -    part_info->active_pmem = size / (256 << 20);
> +    part_info->active_pmem = ct3d->is_pmem ? size : 0;
>      part_info->next_pmem = 0;
>  
>      *len = sizeof(*part_info);
> diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> index ada2108fac..18c5b9ff90 100644
> --- a/hw/mem/cxl_type3.c
> +++ b/hw/mem/cxl_type3.c
> @@ -111,7 +111,7 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
>          error_setg(errp, "memdev property must be set");
>          return false;
>      }
> -    memory_region_set_nonvolatile(mr, true);
> +    memory_region_set_nonvolatile(mr, ct3d->is_pmem);
>      memory_region_set_enabled(mr, true);
>      host_memory_backend_set_mapped(ct3d->hostmem, true);
>  
> @@ -123,7 +123,7 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
>      address_space_init(&ct3d->hostmem_as, mr, name);
>      g_free(name);
>  
> -    ct3d->cxl_dstate.pmem_size = ct3d->hostmem->size;
> +    ct3d->cxl_dstate.mem_size = ct3d->hostmem->size;
>  
>      if (!ct3d->lsa) {
>          error_setg(errp, "lsa property must be set");
> @@ -271,6 +271,7 @@ static void ct3d_reset(DeviceState *dev)
>  }
>  
>  static Property ct3_props[] = {
> +    DEFINE_PROP_BOOL("pmem", CXLType3Dev, is_pmem, false),
>      DEFINE_PROP_LINK("memdev", CXLType3Dev, hostmem, TYPE_MEMORY_BACKEND,
>                       HostMemoryBackend *),
>      DEFINE_PROP_LINK("lsa", CXLType3Dev, lsa, TYPE_MEMORY_BACKEND,
> @@ -278,6 +279,7 @@ static Property ct3_props[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +
>  static uint64_t get_lsa_size(CXLType3Dev *ct3d)
>  {
>      MemoryRegion *mr;
> @@ -338,10 +340,10 @@ static void ct3_class_init(ObjectClass *oc, void *data)
>      pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
>      pc->vendor_id = PCI_VENDOR_ID_INTEL;
>      pc->device_id = 0xd93; /* LVF for now */
> -    pc->revision = 1;
> +    pc->revision = 2;
Given the driver isn't checking this and your code change isn't breaking
older versions I would not update the revision for this.

>  
>      set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> -    dc->desc = "CXL PMEM Device (Type 3)";
> +    dc->desc = "CXL Memory Device (Type 3)";
>      dc->reset = ct3d_reset;
>      device_class_set_props(dc, ct3_props);
>  
> diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
> index 1e141b6621..67fc65f047 100644
> --- a/include/hw/cxl/cxl_device.h
> +++ b/include/hw/cxl/cxl_device.h
> @@ -117,8 +117,8 @@ typedef struct cxl_device_state {
>          uint64_t host_set;
>      } timestamp;
>  
> -    /* memory region for persistent memory, HDM */
> -    uint64_t pmem_size;
> +    /* memory region for persistent and volatile memory, HDM */
> +    uint64_t mem_size;
>  } CXLDeviceState;
>  
>  /* Initialize the register block for a device */
> @@ -235,6 +235,7 @@ struct CXLType3Dev {
>      PCIDevice parent_obj;
>  
>      /* Properties */
> +    bool is_pmem;
>      HostMemoryBackend *hostmem;
>      HostMemoryBackend *lsa;
>  
> diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
> index 2ebeb530b2..40c392056d 100644
> --- a/tests/qtest/bios-tables-test.c
> +++ b/tests/qtest/bios-tables-test.c
> @@ -1578,13 +1578,13 @@ static void test_acpi_q35_cxl(void)
>                               " -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1"
>                               " -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2"
>                               " -device cxl-rp,port=0,bus=cxl.1,id=rp1,chassis=0,slot=2"
> -                             " -device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1"
> +                             " -device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1,lsa=lsa1"
>                               " -device cxl-rp,port=1,bus=cxl.1,id=rp2,chassis=0,slot=3"
> -                             " -device cxl-type3,bus=rp2,memdev=cxl-mem2,lsa=lsa2"
> +                             " -device cxl-type3,bus=rp2,pmem=true,memdev=cxl-mem2,lsa=lsa2"
>                               " -device cxl-rp,port=0,bus=cxl.2,id=rp3,chassis=0,slot=5"
> -                             " -device cxl-type3,bus=rp3,memdev=cxl-mem3,lsa=lsa3"
> +                             " -device cxl-type3,bus=rp3,pmem=true,memdev=cxl-mem3,lsa=lsa3"
>                               " -device cxl-rp,port=1,bus=cxl.2,id=rp4,chassis=0,slot=6"
> -                             " -device cxl-type3,bus=rp4,memdev=cxl-mem4,lsa=lsa4"
> +                             " -device cxl-type3,bus=rp4,pmem=true,memdev=cxl-mem4,lsa=lsa4"
>                               " -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,"
>                               "cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.targets.1=cxl.2,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k",
>                               tmp_path, tmp_path, tmp_path, tmp_path,
> diff --git a/tests/qtest/cxl-test.c b/tests/qtest/cxl-test.c
> index cbe0fb549b..667e590c5f 100644
> --- a/tests/qtest/cxl-test.c
> +++ b/tests/qtest/cxl-test.c
> @@ -31,27 +31,40 @@
>  
>  #define QEMU_T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M " \
>                   "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
> -                 "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 "
> +                 "-device cxl-type3,bus=rp0,pmem=false,memdev=cxl-mem0," \
> +                  "lsa=lsa0,id=cxl-pmem0 "
> +
> +#define QEMU_T3DV "-object memory-backend-ram,id=cxl-mem0,size=256M " \
> +                  "-object memory-backend-ram,id=lsa0,size=256M "    \
> +                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
> +                   "lsa=lsa0,id=cxl-pmem0 "
> +
>  
>  #define QEMU_2T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M "    \
>                    "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
> -                  "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 " \
> +                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
> +                   "lsa=lsa0,id=cxl-pmem0 " \
>                    "-object memory-backend-file,id=cxl-mem1,mem-path=%s,size=256M "    \
>                    "-object memory-backend-file,id=lsa1,mem-path=%s,size=256M "    \
> -                  "-device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1,id=cxl-pmem1 "
> +                  "-device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1," \
> +                   "lsa=lsa1,id=cxl-pmem1 "
>  
>  #define QEMU_4T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M " \
>                    "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
> -                  "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 " \
> +                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
> +                   "lsa=lsa0,id=cxl-pmem0 " \
>                    "-object memory-backend-file,id=cxl-mem1,mem-path=%s,size=256M "    \
>                    "-object memory-backend-file,id=lsa1,mem-path=%s,size=256M "    \
> -                  "-device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1,id=cxl-pmem1 " \
> +                  "-device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1," \
> +                   "lsa=lsa1,id=cxl-pmem1 " \
>                    "-object memory-backend-file,id=cxl-mem2,mem-path=%s,size=256M "    \
>                    "-object memory-backend-file,id=lsa2,mem-path=%s,size=256M "    \
> -                  "-device cxl-type3,bus=rp2,memdev=cxl-mem2,lsa=lsa2,id=cxl-pmem2 " \
> +                  "-device cxl-type3,bus=rp2,pmem=true,memdev=cxl-mem2," \
> +                   "lsa=lsa2,id=cxl-pmem2 " \
>                    "-object memory-backend-file,id=cxl-mem3,mem-path=%s,size=256M "    \
>                    "-object memory-backend-file,id=lsa3,mem-path=%s,size=256M "    \
> -                  "-device cxl-type3,bus=rp3,memdev=cxl-mem3,lsa=lsa3,id=cxl-pmem3 "
> +                  "-device cxl-type3,bus=rp3,pmem=true,memdev=cxl-mem3," \
> +                   "lsa=lsa3,id=cxl-pmem3 "
>  
>  static void cxl_basic_hb(void)
>  {
> @@ -103,6 +116,19 @@ static void cxl_t3d(void)
>      qtest_end();
>  }
>  
> +static void cxl_t3d_vmem(void)
> +{
> +    g_autoptr(GString) cmdline = g_string_new(NULL);
> +    g_autofree const char *tmpfs = NULL;
> +
> +    tmpfs = g_dir_make_tmp("cxl-test-XXXXXX", NULL);
> +
> +    g_string_printf(cmdline, QEMU_PXB_CMD QEMU_RP QEMU_T3DV);
> +
> +    qtest_start(cmdline->str);
> +    qtest_end();
> +}
> +
>  static void cxl_1pxb_2rp_2t3d(void)
>  {
>      g_autoptr(GString) cmdline = g_string_new(NULL);
> @@ -145,6 +171,7 @@ int main(int argc, char **argv)
>      qtest_add_func("/pci/cxl/rp_x2", cxl_2root_port);
>  #ifdef CONFIG_POSIX
>      qtest_add_func("/pci/cxl/type3_device", cxl_t3d);
> +    qtest_add_func("/pci/cxl/type3_vmem_device", cxl_t3d_vmem);
>      qtest_add_func("/pci/cxl/rp_x2_type3_x2", cxl_1pxb_2rp_2t3d);
>      qtest_add_func("/pci/cxl/pxb_x2_root_port_x4_type3_x4", cxl_2pxb_4rp_4t3d);
>  #endif

Great to see tests covered. Those often get forgotten.

Jonathan
Jonathan Cameron Oct. 6, 2022, 8:50 a.m. UTC | #2
On Thu, 6 Oct 2022 09:45:57 +0100
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Wed,  5 Oct 2022 20:01:03 -0400
> Gourry <gourry.memverge@gmail.com> wrote:
> 
> > Type 3 devices were hard-coded to always present as persistent memory devices.
> > This patch adds the "is_pmem" attribute which can be used to instantiate
> > a device as either a pmem or vmem.  
> 
> Hi Gourry,
> 
> Great to see this.
> 
> Missing Signed-off by so we can't apply this (no developer certificate of
> origin)  Probably want your from address to match that and I've no idea
> if your fully name is Gourry (apologies if it is!)
> 
> +CC linux-cxl as most folks who will be interested in this hang out there
> and are more likely to noticed it than burried on qemu-devel.
> Also +CC some others who will be interested in this discussion.
> 
> Please also cc the listed CXL QEMU maintainers (myself and Ben W) - see
> MAINTAINERS
> 
I forgot to ask.  How are you testing this?

One of the blockers for volatile support was that we had no means to poke
it properly as the kernel doesn't yet support volatile capacity and
no one has done the relevant work in EDK2 or similar to do it before the kernel boots.
There has been some work on EDK2 support for ARM N2 FVPs from 
Saanta Pattanayak, but not upstream eyt.
https://lpc.events/event/16/contributions/1254/ 

Thanks,

Jonathan

> > 
> > Right now it is only possible to choose one or the other, but future
> > devices may present both (such as multi-logical devices with different
> > regions backed by different types of memory).  
> 
> Would it be possible to introduce this support in a fashion that allowed
> us to transition to devices presenting both without needing a change of
> interface?  With hindsight the memdev naming of the existing parameter is
> less than helpful, but one path to doing this would be to allow for
> an alternative memdev-volatile= parameter.  For now we can keep the patch
> small by only supporting either memdev or memdev-volatile
> 
> That way when we want to come along and add mixed devices later, we simply
> enable supporting the specification of two different memory backends.
> 
> The tricky bit then would be to support variable capacity (devices can
> support a region of memory which may be persistent or non persistent
> depending on settings - see Set Partition Info).
> 
> So my gut feeling is we may eventually allow all of:
> 
> 1) single memory backend for persistent or volatile.
> 2) two memory backends, covering a persistent region and a volatile region
>   (likely you may want actual non volatile backend for the persistent
>    region but not the volatile one).
> 3) single memory backend in which we can allow for the Set Partition Info stuff
>    to work.
> 
> As we move beyond just one backend that is presented as persistent we need
> to define the interface...
> 
> I can think of far too many options!  Let me lay out a few for comment.
> 
> A) Array of memory backends, separate control of configurable size
>    + starting persistent size, starting volatile size.
>    Default of one memory backend gives you current non-volatile only.
>    To get volatile only you provide one memory backend and set
>    persistent size to 0 and volatile size to size of memory backend.
> 
> B) Two memory backends, (later add some trickery around presented size)
>    memdev as defined is persistent memory (maybe we deprecate that interface
>    and make it memdev-persistent) memdev-volatile as volatile.
> 
>    Steps to add functionality.
>    1. Add your code to enable volatile but use 'which memdev is supplied' to
>       act as the flag.  For now supply both is an error.
>    2. Enable static volatile + non-volatile devices (both memdevs supplied).
>    3. (hackish) Add a control for amount of partionable capacity.  Cheat
>       and remove that capacity from both memdevs.  Changing partition just
>       messes without how much of each memdev is used.  We'd need to be a little
>       careful to ensure we wipe data etc.
> 
> You code mostly looks good to me so I think discussion here is going to be
> all about the interface and how we build one more suitable for the future!
> 
> Jonathan
> 
> 
> 
> 
> > ---
> >  docs/system/devices/cxl.rst    | 31 ++++++++++++++++---------
> >  hw/cxl/cxl-mailbox-utils.c     | 24 +++++++++++---------
> >  hw/mem/cxl_type3.c             | 10 +++++----
> >  include/hw/cxl/cxl_device.h    |  5 +++--
> >  tests/qtest/bios-tables-test.c |  8 +++----
> >  tests/qtest/cxl-test.c         | 41 ++++++++++++++++++++++++++++------
> >  6 files changed, 82 insertions(+), 37 deletions(-)
> > 
> > diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst
> > index f25783a4ec..3a62d46e8a 100644
> > --- a/docs/system/devices/cxl.rst
> > +++ b/docs/system/devices/cxl.rst
> > @@ -300,7 +300,7 @@ Example topology involving a switch::
> >  
> >  Example command lines
> >  ---------------------
> > -A very simple setup with just one directly attached CXL Type 3 device::
> > +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
> >  
> >    qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
> >    ...
> > @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device::
> >    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
> >    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > +  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G  
> 
> Can't do this as it breaks any existing setup.  If a flag makes sense (I don't think
> it is flexible enough) then would need to be volatile so it not being present
> will correspond to false and current situation.
> 
> > +
> > +A very simple setup with just one directly attached CXL Type 3 Volatile Memory device::
> > +
> > +  qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
> > +  ...
> > +  -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
> > +  -object memory-backend-ram,id=cxl-lsa1,share=on,size=256M \
> > +  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > +  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > +  -device cxl-type3,bus=root_port13,pmem=false,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> >    -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
> >  
> >  A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way
> > @@ -328,13 +339,13 @@ the CXL Type3 device directly attached (no switches).::
> >    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >    -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \
> >    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> >    -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \
> > -  -device cxl-type3,bus=root_port14,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
> > +  -device cxl-type3,bus=root_port14,pmem=true,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
> >    -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \
> > -  -device cxl-type3,bus=root_port15,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
> > +  -device cxl-type3,bus=root_port15,pmem=true,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
> >    -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \
> > -  -device cxl-type3,bus=root_port16,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
> > +  -device cxl-type3,bus=root_port16,pmem=true,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
> >    -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k
> >  
> >  An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
> > @@ -354,13 +365,13 @@ An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
> >    -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
> >    -device cxl-upstream,bus=root_port0,id=us0 \
> >    -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
> > -  -device cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
> > +  -device cxl-type3,bus=swport0,pmem=true,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
> >    -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
> > -  -device cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
> > +  -device cxl-type3,bus=swport1,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
> >    -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
> > -  -device cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
> > +  -device cxl-type3,bus=swport2,pmem=true,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
> >    -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
> > -  -device cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
> > +  -device cxl-type3,bus=swport3,pmem=true,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
> >    -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
> >  
> >  Kernel Configuration Options
> > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> > index bc1bb18844..3ed4dfeb69 100644
> > --- a/hw/cxl/cxl-mailbox-utils.c
> > +++ b/hw/cxl/cxl-mailbox-utils.c
> > @@ -138,7 +138,7 @@ static ret_code cmd_firmware_update_get_info(struct cxl_cmd *cmd,
> >      } QEMU_PACKED *fw_info;
> >      QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50);
> >  
> > -    if (cxl_dstate->pmem_size < (256 << 20)) {
> > +    if (cxl_dstate->mem_size < (256 << 20)) {  
> 
> Huh.  I wonder why this check is here in the first place? Looks very odd given why
> should we be checking the memory size in a firmware update based command.
> 
> Probably cut and paste error long ago. 
> 
> >          return CXL_MBOX_INTERNAL_ERROR;
> >      }
> >  
> > @@ -281,7 +281,7 @@ static ret_code cmd_identify_memory_device(struct cxl_cmd *cmd,
> >  
> >      CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
> >      CXLType3Class *cvc = CXL_TYPE3_GET_CLASS(ct3d);
> > -    uint64_t size = cxl_dstate->pmem_size;
> > +    uint64_t size = cxl_dstate->mem_size;
> >  
> >      if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
> >          return CXL_MBOX_INTERNAL_ERROR;
> > @@ -290,11 +290,13 @@ static ret_code cmd_identify_memory_device(struct cxl_cmd *cmd,
> >      id = (void *)cmd->payload;
> >      memset(id, 0, sizeof(*id));
> >  
> > -    /* PMEM only */
> > -    snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 0);
> > +    /* Version 0: PMEM Only.  Version 1: PMEM and VMEM */
> > +    snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 1);
> >  
> > -    id->total_capacity = size / (256 << 20);
> > -    id->persistent_capacity = size / (256 << 20);
> > +    size /= (256 << 20);
> > +    id->total_capacity = size;
> > +    id->persistent_capacity = ct3d->is_pmem ? size : 0;
> > +    id->volatile_capacity = ct3d->is_pmem ? 0 : size;
> >      id->lsa_size = cvc->get_lsa_size(ct3d);
> >  
> >      *len = sizeof(*id);
> > @@ -312,16 +314,18 @@ static ret_code cmd_ccls_get_partition_info(struct cxl_cmd *cmd,
> >          uint64_t next_pmem;
> >      } QEMU_PACKED *part_info = (void *)cmd->payload;
> >      QEMU_BUILD_BUG_ON(sizeof(*part_info) != 0x20);
> > -    uint64_t size = cxl_dstate->pmem_size;
> > +
> > +    CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
> > +    uint64_t size = cxl_dstate->mem_size;
> >  
> >      if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
> >          return CXL_MBOX_INTERNAL_ERROR;
> >      }
> >  
> > -    /* PMEM only */
> > -    part_info->active_vmem = 0;
> > +    size /= (256 << 20);
> > +    part_info->active_vmem = ct3d->is_pmem ? 0 : size;
> >      part_info->next_vmem = 0;
> > -    part_info->active_pmem = size / (256 << 20);
> > +    part_info->active_pmem = ct3d->is_pmem ? size : 0;
> >      part_info->next_pmem = 0;
> >  
> >      *len = sizeof(*part_info);
> > diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
> > index ada2108fac..18c5b9ff90 100644
> > --- a/hw/mem/cxl_type3.c
> > +++ b/hw/mem/cxl_type3.c
> > @@ -111,7 +111,7 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >          error_setg(errp, "memdev property must be set");
> >          return false;
> >      }
> > -    memory_region_set_nonvolatile(mr, true);
> > +    memory_region_set_nonvolatile(mr, ct3d->is_pmem);
> >      memory_region_set_enabled(mr, true);
> >      host_memory_backend_set_mapped(ct3d->hostmem, true);
> >  
> > @@ -123,7 +123,7 @@ static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
> >      address_space_init(&ct3d->hostmem_as, mr, name);
> >      g_free(name);
> >  
> > -    ct3d->cxl_dstate.pmem_size = ct3d->hostmem->size;
> > +    ct3d->cxl_dstate.mem_size = ct3d->hostmem->size;
> >  
> >      if (!ct3d->lsa) {
> >          error_setg(errp, "lsa property must be set");
> > @@ -271,6 +271,7 @@ static void ct3d_reset(DeviceState *dev)
> >  }
> >  
> >  static Property ct3_props[] = {
> > +    DEFINE_PROP_BOOL("pmem", CXLType3Dev, is_pmem, false),
> >      DEFINE_PROP_LINK("memdev", CXLType3Dev, hostmem, TYPE_MEMORY_BACKEND,
> >                       HostMemoryBackend *),
> >      DEFINE_PROP_LINK("lsa", CXLType3Dev, lsa, TYPE_MEMORY_BACKEND,
> > @@ -278,6 +279,7 @@ static Property ct3_props[] = {
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> >  
> > +
> >  static uint64_t get_lsa_size(CXLType3Dev *ct3d)
> >  {
> >      MemoryRegion *mr;
> > @@ -338,10 +340,10 @@ static void ct3_class_init(ObjectClass *oc, void *data)
> >      pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
> >      pc->vendor_id = PCI_VENDOR_ID_INTEL;
> >      pc->device_id = 0xd93; /* LVF for now */
> > -    pc->revision = 1;
> > +    pc->revision = 2;  
> Given the driver isn't checking this and your code change isn't breaking
> older versions I would not update the revision for this.
> 
> >  
> >      set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> > -    dc->desc = "CXL PMEM Device (Type 3)";
> > +    dc->desc = "CXL Memory Device (Type 3)";
> >      dc->reset = ct3d_reset;
> >      device_class_set_props(dc, ct3_props);
> >  
> > diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
> > index 1e141b6621..67fc65f047 100644
> > --- a/include/hw/cxl/cxl_device.h
> > +++ b/include/hw/cxl/cxl_device.h
> > @@ -117,8 +117,8 @@ typedef struct cxl_device_state {
> >          uint64_t host_set;
> >      } timestamp;
> >  
> > -    /* memory region for persistent memory, HDM */
> > -    uint64_t pmem_size;
> > +    /* memory region for persistent and volatile memory, HDM */
> > +    uint64_t mem_size;
> >  } CXLDeviceState;
> >  
> >  /* Initialize the register block for a device */
> > @@ -235,6 +235,7 @@ struct CXLType3Dev {
> >      PCIDevice parent_obj;
> >  
> >      /* Properties */
> > +    bool is_pmem;
> >      HostMemoryBackend *hostmem;
> >      HostMemoryBackend *lsa;
> >  
> > diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
> > index 2ebeb530b2..40c392056d 100644
> > --- a/tests/qtest/bios-tables-test.c
> > +++ b/tests/qtest/bios-tables-test.c
> > @@ -1578,13 +1578,13 @@ static void test_acpi_q35_cxl(void)
> >                               " -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1"
> >                               " -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2"
> >                               " -device cxl-rp,port=0,bus=cxl.1,id=rp1,chassis=0,slot=2"
> > -                             " -device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1"
> > +                             " -device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1,lsa=lsa1"
> >                               " -device cxl-rp,port=1,bus=cxl.1,id=rp2,chassis=0,slot=3"
> > -                             " -device cxl-type3,bus=rp2,memdev=cxl-mem2,lsa=lsa2"
> > +                             " -device cxl-type3,bus=rp2,pmem=true,memdev=cxl-mem2,lsa=lsa2"
> >                               " -device cxl-rp,port=0,bus=cxl.2,id=rp3,chassis=0,slot=5"
> > -                             " -device cxl-type3,bus=rp3,memdev=cxl-mem3,lsa=lsa3"
> > +                             " -device cxl-type3,bus=rp3,pmem=true,memdev=cxl-mem3,lsa=lsa3"
> >                               " -device cxl-rp,port=1,bus=cxl.2,id=rp4,chassis=0,slot=6"
> > -                             " -device cxl-type3,bus=rp4,memdev=cxl-mem4,lsa=lsa4"
> > +                             " -device cxl-type3,bus=rp4,pmem=true,memdev=cxl-mem4,lsa=lsa4"
> >                               " -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,"
> >                               "cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.targets.1=cxl.2,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k",
> >                               tmp_path, tmp_path, tmp_path, tmp_path,
> > diff --git a/tests/qtest/cxl-test.c b/tests/qtest/cxl-test.c
> > index cbe0fb549b..667e590c5f 100644
> > --- a/tests/qtest/cxl-test.c
> > +++ b/tests/qtest/cxl-test.c
> > @@ -31,27 +31,40 @@
> >  
> >  #define QEMU_T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M " \
> >                   "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
> > -                 "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 "
> > +                 "-device cxl-type3,bus=rp0,pmem=false,memdev=cxl-mem0," \
> > +                  "lsa=lsa0,id=cxl-pmem0 "
> > +
> > +#define QEMU_T3DV "-object memory-backend-ram,id=cxl-mem0,size=256M " \
> > +                  "-object memory-backend-ram,id=lsa0,size=256M "    \
> > +                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
> > +                   "lsa=lsa0,id=cxl-pmem0 "
> > +
> >  
> >  #define QEMU_2T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M "    \
> >                    "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
> > -                  "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 " \
> > +                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
> > +                   "lsa=lsa0,id=cxl-pmem0 " \
> >                    "-object memory-backend-file,id=cxl-mem1,mem-path=%s,size=256M "    \
> >                    "-object memory-backend-file,id=lsa1,mem-path=%s,size=256M "    \
> > -                  "-device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1,id=cxl-pmem1 "
> > +                  "-device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1," \
> > +                   "lsa=lsa1,id=cxl-pmem1 "
> >  
> >  #define QEMU_4T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M " \
> >                    "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
> > -                  "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 " \
> > +                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
> > +                   "lsa=lsa0,id=cxl-pmem0 " \
> >                    "-object memory-backend-file,id=cxl-mem1,mem-path=%s,size=256M "    \
> >                    "-object memory-backend-file,id=lsa1,mem-path=%s,size=256M "    \
> > -                  "-device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1,id=cxl-pmem1 " \
> > +                  "-device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1," \
> > +                   "lsa=lsa1,id=cxl-pmem1 " \
> >                    "-object memory-backend-file,id=cxl-mem2,mem-path=%s,size=256M "    \
> >                    "-object memory-backend-file,id=lsa2,mem-path=%s,size=256M "    \
> > -                  "-device cxl-type3,bus=rp2,memdev=cxl-mem2,lsa=lsa2,id=cxl-pmem2 " \
> > +                  "-device cxl-type3,bus=rp2,pmem=true,memdev=cxl-mem2," \
> > +                   "lsa=lsa2,id=cxl-pmem2 " \
> >                    "-object memory-backend-file,id=cxl-mem3,mem-path=%s,size=256M "    \
> >                    "-object memory-backend-file,id=lsa3,mem-path=%s,size=256M "    \
> > -                  "-device cxl-type3,bus=rp3,memdev=cxl-mem3,lsa=lsa3,id=cxl-pmem3 "
> > +                  "-device cxl-type3,bus=rp3,pmem=true,memdev=cxl-mem3," \
> > +                   "lsa=lsa3,id=cxl-pmem3 "
> >  
> >  static void cxl_basic_hb(void)
> >  {
> > @@ -103,6 +116,19 @@ static void cxl_t3d(void)
> >      qtest_end();
> >  }
> >  
> > +static void cxl_t3d_vmem(void)
> > +{
> > +    g_autoptr(GString) cmdline = g_string_new(NULL);
> > +    g_autofree const char *tmpfs = NULL;
> > +
> > +    tmpfs = g_dir_make_tmp("cxl-test-XXXXXX", NULL);
> > +
> > +    g_string_printf(cmdline, QEMU_PXB_CMD QEMU_RP QEMU_T3DV);
> > +
> > +    qtest_start(cmdline->str);
> > +    qtest_end();
> > +}
> > +
> >  static void cxl_1pxb_2rp_2t3d(void)
> >  {
> >      g_autoptr(GString) cmdline = g_string_new(NULL);
> > @@ -145,6 +171,7 @@ int main(int argc, char **argv)
> >      qtest_add_func("/pci/cxl/rp_x2", cxl_2root_port);
> >  #ifdef CONFIG_POSIX
> >      qtest_add_func("/pci/cxl/type3_device", cxl_t3d);
> > +    qtest_add_func("/pci/cxl/type3_vmem_device", cxl_t3d_vmem);
> >      qtest_add_func("/pci/cxl/rp_x2_type3_x2", cxl_1pxb_2rp_2t3d);
> >      qtest_add_func("/pci/cxl/pxb_x2_root_port_x4_type3_x4", cxl_2pxb_4rp_4t3d);
> >  #endif  
> 
> Great to see tests covered. Those often get forgotten.
> 
> Jonathan
>
Gregory Price Oct. 6, 2022, 3:52 p.m. UTC | #3
On Thu, Oct 06, 2022 at 09:50:07AM +0100, Jonathan Cameron wrote:
> On Thu, 6 Oct 2022 09:45:57 +0100
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> > Great to see this.
> > 
> > Missing Signed-off by so we can't apply this (no developer certificate of
> > origin)  Probably want your from address to match that and I've no idea
> > if your fully name is Gourry (apologies if it is!)
> > 
> > +CC linux-cxl as most folks who will be interested in this hang out there
> > and are more likely to noticed it than burried on qemu-devel.
> > Also +CC some others who will be interested in this discussion.
> > 
> > Please also cc the listed CXL QEMU maintainers (myself and Ben W) - see
> > MAINTAINERS
> > 


Sorry about that, this is my literal first patch submission to the
kernel devlists ever, I'm still getting my environment set up (just
learning mutt as I type this).

Gourry is a nickname I go by, but I see pseudonyms are not particularly
welcome on the lists, so no worries.


In addition, I'm relatively new to the QEMU, CXL, and PCIe ecosystem, so
please excuse some of my ignorance of the overall state of things -
doing my best to pickup as quickly as possible.


> I forgot to ask.  How are you testing this?
> 
> One of the blockers for volatile support was that we had no means to poke
> it properly as the kernel doesn't yet support volatile capacity and
> no one has done the relevant work in EDK2 or similar to do it before the kernel boots.
> There has been some work on EDK2 support for ARM N2 FVPs from 
> Saanta Pattanayak, but not upstream eyt.
> https://lpc.events/event/16/contributions/1254/ 
> 
> Thanks,
> 
> Jonathan
>

Presently this is untested for exactly the reasons you described. I had
originally intended to shoot off a cover letter with this patch
discussing the issue, but I managed to much that up... so lets just plop
that in here:


[PATCH RFC] hw/cxl: Volatile Type-3 CXL Devices

The idea behind this patch is simple: Not all type-3 devices are
persistent memory devices, so add a simple boolean (is_pmem) to the
attributes of type-3 devices and have the mailbox-utility report memory
accordingly (persistent_capacity if is_pmem, else volatile_capacity).

In an emulated system, there isn't really a difference between the
devices except for the backing storage (file vs ram), but having an
emulation mode for each is useful for the sake of testing fabric
manager/orchestrator corner cases.

At the moment this can be tested with `cxl list` and it will show
the ram capacity as requested, instead of pmem capacity.


There are a few issues with this patch set that i'm not able to discern
the right answer to just yet, and was hoping for some input:

1) The PCI device type is set prior to realize/attributes, and is
currently still set to PCI_CLASS_STORAGE_EXPRESS.  Should this instead
be PCI_CLASS_MEMORY_CXL when presenting as a simple memory expander?


1a) If so, that means we'll probably need to make type-3 into an abstract
class of devices and explicitly create devices that represents vmem and
pmem devices separately.  That could probably be done within cxl_type3.c
since most of the code will be shared anyway.

But this is confusing, because a pmem device can be made to operate in
volatile mode... so should it ALSO be assigned as PCI_CLASS_MEMORY_CXL,
and a volatile device simply be a pmem device with persistence turned
off? Many design questions here.


1b) If not, does this have an appreciable affect on system-setup from
the perspective of the EFI/Bios? It should be possible, (see #2) to have
the EFI/bios setup memory expanders and just present them to the kernel
as additional, available memory as if more ram is simply present.



2) EDK2 sets the memory area as a reserved, and the memory is not
configured by the system as ram.  I'm fairly sure edk2 just doesn't
support this yet, but there's a chicken/egg problem.  If the device
isn't there, there's nothing to test against... if there's nothing to
test against, no one will write the support.  So I figure we should kick
start the process (probably by getting it wrong on the first go around!)



3) Upstream linux drivers haven't touched ram configurations yet.  I
just configured this with Dan Williams yesterday on IRC.  My
understanding is that it's been worked on but nothing has been
upstreamed, in part because there are only a very small set of devices
available to developers at the moment.


4) Should (is_pmem) be defaulted to 'true' (or invert and make the flag
(is_vmem)), and should pmem devices also present their memory as
volatile capacity?  I think they should, because all pmem devices should
be capable of running in a volatile mode.


> > 
> > Would it be possible to introduce this support in a fashion that allowed
> > us to transition to devices presenting both without needing a change of
> > interface?  With hindsight the memdev naming of the existing parameter is
> > less than helpful, but one path to doing this would be to allow for
> > an alternative memdev-volatile= parameter.  For now we can keep the patch
> > small by only supporting either memdev or memdev-volatile

My understanding of the current design is the the memdev backing itself
is irrelevant.  You can totally have file-memdevs backing a volatile
reagion and a ram-memdev backing a persistent region.

By adding more/alternative flags that can conflict with each other, we
will also create a boondoggle on intialization.

I had first considered making Type3Dev into an abstract device and
instantiating separate explicit pmem and vmem devices, but ultimatly
that just looked like 2 empty structures wrapping Type3Dev and setting
the internal is_pmem boolean... lol.

> > 
> > That way when we want to come along and add mixed devices later, we simply
> > enable supporting the specification of two different memory backends.
> > 
> > The tricky bit then would be to support variable capacity (devices can
> > support a region of memory which may be persistent or non persistent
> > depending on settings - see Set Partition Info).
> > 
> > So my gut feeling is we may eventually allow all of:
> > 
> > 1) single memory backend for persistent or volatile.
> > 2) two memory backends, covering a persistent region and a volatile region
> >   (likely you may want actual non volatile backend for the persistent
> >    region but not the volatile one).
> > 3) single memory backend in which we can allow for the Set Partition Info stuff
> >    to work.
> >

Seems a little odd to use two memory backends.  Of what use is it to the
software developers, it should be completely transparent to them, right?

The only thing I can think of is maybe reset mechanics for volatile
regions being set differently than persistent regions, but even then it
seems simple enough to just emulate the behavior and use a single
backing device.

> > As we move beyond just one backend that is presented as persistent we need
> > to define the interface...
> > 
> > I can think of far too many options!  Let me lay out a few for comment.
> > 
> > A) Array of memory backends, separate control of configurable size
> >    + starting persistent size, starting volatile size.
> >    Default of one memory backend gives you current non-volatile only.
> >    To get volatile only you provide one memory backend and set
> >    persistent size to 0 and volatile size to size of memory backend.
> > 
> > B) Two memory backends, (later add some trickery around presented size)
> >    memdev as defined is persistent memory (maybe we deprecate that interface
> >    and make it memdev-persistent) memdev-volatile as volatile.

I had considered this, but - at least in my head - you're describing a
multi-logical device.  The device we're looking at right now is a basic
Single-Logical Device which can be either persistent or volatile.

This goes back to my question:  Should we be designing this SLD as a
component which can slot into an MLD, or are there special features of
a MLD sub-device that aren't present on an SLD?

Basically I just don't see the need for multiple backends, as opposed to
multiple devices which are separately manageable.

If I'm misunderstanding something, please let me know.

> > 
> >    Steps to add functionality.
> >    1. Add your code to enable volatile but use 'which memdev is supplied' to
> >       act as the flag.  For now supply both is an error.
> >    2. Enable static volatile + non-volatile devices (both memdevs supplied).
> >    3. (hackish) Add a control for amount of partionable capacity.  Cheat
> >       and remove that capacity from both memdevs.  Changing partition just
> >       messes without how much of each memdev is used.  We'd need to be a little
> >       careful to ensure we wipe data etc.
> > 
> > You code mostly looks good to me so I think discussion here is going to be
> > all about the interface and how we build one more suitable for the future!
> > 
> > Jonathan
> > 
> > 
> > 
> > 

Feels like I'm 



> > >  Example command lines
> > >  ---------------------
> > > -A very simple setup with just one directly attached CXL Type 3 device::
> > > +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
> > >  
> > >    qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
> > >    ...
> > > @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device::
> > >    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
> > >    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > >    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > > -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > > +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > > +  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G  
> > 
> > Can't do this as it breaks any existing setup.  If a flag makes sense (I don't think
> > it is flexible enough) then would need to be volatile so it not being present
> > will correspond to false and current situation.
> > 

this would be another reason to simply default the pmem flag to true.
It would allow existing setups to continue as-is.

I can make this swap, but i think the above discussions were more
important to get started (and why this is an RFC).



> > > diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
> > > index bc1bb18844..3ed4dfeb69 100644
> > > --- a/hw/cxl/cxl-mailbox-utils.c
> > > +++ b/hw/cxl/cxl-mailbox-utils.c
> > > @@ -138,7 +138,7 @@ static ret_code cmd_firmware_update_get_info(struct cxl_cmd *cmd,
> > >      } QEMU_PACKED *fw_info;
> > >      QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50);
> > >  
> > > -    if (cxl_dstate->pmem_size < (256 << 20)) {
> > > +    if (cxl_dstate->mem_size < (256 << 20)) {  
> > 
> > Huh.  I wonder why this check is here in the first place? Looks very odd given why
> > should we be checking the memory size in a firmware update based command.
> > 
> > Probably cut and paste error long ago. 
> > 


I had actually wondered about all these 256MB alignment checks, and
whether they should actually be sunken into the device creation rather
than strewn around.  I would have to go digging through the spec to see
whether there is explicit error handling at each level if a device
presents an out-of-spec size.

Something for another patch.



> > > @@ -338,10 +340,10 @@ static void ct3_class_init(ObjectClass *oc, void *data)
> > >      pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
> > >      pc->vendor_id = PCI_VENDOR_ID_INTEL;
> > >      pc->device_id = 0xd93; /* LVF for now */
> > > -    pc->revision = 1;
> > > +    pc->revision = 2;  
> > Given the driver isn't checking this and your code change isn't breaking
> > older versions I would not update the revision for this.
> > 

Will drop that from the patch
Jonathan Cameron Oct. 6, 2022, 4:42 p.m. UTC | #4
On Thu, 6 Oct 2022 11:52:06 -0400
Gregory Price <gourry.memverge@gmail.com> wrote:

> On Thu, Oct 06, 2022 at 09:50:07AM +0100, Jonathan Cameron wrote:
> > On Thu, 6 Oct 2022 09:45:57 +0100
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >   
> > > Great to see this.
> > > 
> > > Missing Signed-off by so we can't apply this (no developer certificate of
> > > origin)  Probably want your from address to match that and I've no idea
> > > if your fully name is Gourry (apologies if it is!)
> > > 
> > > +CC linux-cxl as most folks who will be interested in this hang out there
> > > and are more likely to noticed it than burried on qemu-devel.
> > > Also +CC some others who will be interested in this discussion.
> > > 
> > > Please also cc the listed CXL QEMU maintainers (myself and Ben W) - see
> > > MAINTAINERS
> > >   
> 
> 
> Sorry about that, this is my literal first patch submission to the
> kernel devlists ever, I'm still getting my environment set up (just
> learning mutt as I type this).

No problem and welcome!
> 
> Gourry is a nickname I go by, but I see pseudonyms are not particularly
> welcome on the lists, so no worries.

Fine to use nicknames on the list, but they make the laywers get nervous
when it comes to tracking where code came from etc!

> 
> 
> In addition, I'm relatively new to the QEMU, CXL, and PCIe ecosystem, so
> please excuse some of my ignorance of the overall state of things -
> doing my best to pickup as quickly as possible.
> 
> 
> > I forgot to ask.  How are you testing this?
> > 
> > One of the blockers for volatile support was that we had no means to poke
> > it properly as the kernel doesn't yet support volatile capacity and
> > no one has done the relevant work in EDK2 or similar to do it before the kernel boots.
> > There has been some work on EDK2 support for ARM N2 FVPs from 
> > Saanta Pattanayak, but not upstream eyt.
> > https://lpc.events/event/16/contributions/1254/ 
> > 
> > Thanks,
> > 
> > Jonathan
> >  
> 
> Presently this is untested for exactly the reasons you described. I had
> originally intended to shoot off a cover letter with this patch
> discussing the issue, but I managed to much that up... so lets just plop
> that in here:
> 
> 
> [PATCH RFC] hw/cxl: Volatile Type-3 CXL Devices
> 
> The idea behind this patch is simple: Not all type-3 devices are
> persistent memory devices, so add a simple boolean (is_pmem) to the
> attributes of type-3 devices and have the mailbox-utility report memory
> accordingly (persistent_capacity if is_pmem, else volatile_capacity).
> 
> In an emulated system, there isn't really a difference between the
> devices except for the backing storage (file vs ram), but having an
> emulation mode for each is useful for the sake of testing fabric
> manager/orchestrator corner cases.

I'm not sure there is even a different wrt to the backing store.
We can happily back volatile memory with a file and non volatile
with ram.   Obviously doesn't make a lot of sense, but it will 'work' :)
Ah you note this below...

> 
> At the moment this can be tested with `cxl list` and it will show
> the ram capacity as requested, instead of pmem capacity.
> 
> 
> There are a few issues with this patch set that i'm not able to discern
> the right answer to just yet, and was hoping for some input:
> 
> 1) The PCI device type is set prior to realize/attributes, and is
> currently still set to PCI_CLASS_STORAGE_EXPRESS.  Should this instead
> be PCI_CLASS_MEMORY_CXL when presenting as a simple memory expander?

We override this later in realize but indeed a bit odd that we first set
it to the wrong thing. I guess that is really old code.  Nice thing to clean up.
I just tried it and setting it right in the first place + dropping where we
change it later works fine.

> 
> 1a) If so, that means we'll probably need to make type-3 into an abstract
> class of devices and explicitly create devices that represents vmem and
> pmem devices separately.  That could probably be done within cxl_type3.c
> since most of the code will be shared anyway.
> 
> But this is confusing, because a pmem device can be made to operate in
> volatile mode... so should it ALSO be assigned as PCI_CLASS_MEMORY_CXL,
> and a volatile device simply be a pmem device with persistence turned
> off? Many design questions here.

We definitely don't want to have separate device types. Whether they
have volatile or non volatile or both should just be a configuration option.

> 
> 
> 1b) If not, does this have an appreciable affect on system-setup from
> the perspective of the EFI/Bios? It should be possible, (see #2) to have
> the EFI/bios setup memory expanders and just present them to the kernel
> as additional, available memory as if more ram is simply present.

Absolutely - that's both what I understand the ARM EDK2 patches do
and what is suggested in the software guide that intel folk wrote (note
we treat that as guidance only rather than a spec, but it is useful).
Linked to from this page:
https://www.intel.co.uk/content/www/uk/en/io/pci-express/pci-express-architecture-devnet-resources.html


> 
> 
> 
> 2) EDK2 sets the memory area as a reserved, and the memory is not
> configured by the system as ram.  I'm fairly sure edk2 just doesn't
> support this yet, but there's a chicken/egg problem.  If the device
> isn't there, there's nothing to test against... if there's nothing to
> test against, no one will write the support.  So I figure we should kick
> start the process (probably by getting it wrong on the first go around!)

Yup, if the bios left it alone, OS drivers need to treat it the same as
they would deal with hotplugged memory.  Note my strong suspicion is there
will be host vendors who won't ever handle volatile CXL memory in firmware.
They will just let the OS bring it up after boot. As long as you have DDR
as well on the system that will be fine.  Means there is one code path
to verify rather than two.  Not everyone will care about legacy OS support.


> 
> 3) Upstream linux drivers haven't touched ram configurations yet.  I
> just configured this with Dan Williams yesterday on IRC.  My
> understanding is that it's been worked on but nothing has been
> upstreamed, in part because there are only a very small set of devices
> available to developers at the moment.

There was an offer of similar volatile memory QEMU emulation in the
session on QEMU CXL at Linux Plumbers.  That will look something like you have
here and maybe reflects that someone has hardware as well...

> 
> 
> 4) Should (is_pmem) be defaulted to 'true' (or invert and make the flag
> (is_vmem)), and should pmem devices also present their memory as
> volatile capacity?  I think they should, because all pmem devices should
> be capable of running in a volatile mode.

Disagree.  Running in volatile mode implies a bunch of security guarantees
(no retention over reboots for starters).
The OS must be able to use them as volatile mode, but that's an OS decision.
For now, we bring them up as DAX and then you can rebind them with the kmem
driver to treat them as volatile.

It's definitely not a spec requirement for a non volatile device to
be configured to present as volatile (though it is an option via the partition
commands).

> 
> 
> > > 
> > > Would it be possible to introduce this support in a fashion that allowed
> > > us to transition to devices presenting both without needing a change of
> > > interface?  With hindsight the memdev naming of the existing parameter is
> > > less than helpful, but one path to doing this would be to allow for
> > > an alternative memdev-volatile= parameter.  For now we can keep the patch
> > > small by only supporting either memdev or memdev-volatile  
> 
> My understanding of the current design is the the memdev backing itself
> is irrelevant.  You can totally have file-memdevs backing a volatile
> reagion and a ram-memdev backing a persistent region.

True, but a definite 'usecase' would be to back persistent memory with persistent
backend and volatile with volatile.

> 
> By adding more/alternative flags that can conflict with each other, we
> will also create a boondoggle on intialization.
> 
> I had first considered making Type3Dev into an abstract device and
> instantiating separate explicit pmem and vmem devices, but ultimatly
> that just looked like 2 empty structures wrapping Type3Dev and setting
> the internal is_pmem boolean... lol.

Definitely don't do that.  A single device needs to support a mixture as
that's what the spec allows and people will probably build.

> 
> > > 
> > > That way when we want to come along and add mixed devices later, we simply
> > > enable supporting the specification of two different memory backends.
> > > 
> > > The tricky bit then would be to support variable capacity (devices can
> > > support a region of memory which may be persistent or non persistent
> > > depending on settings - see Set Partition Info).
> > > 
> > > So my gut feeling is we may eventually allow all of:
> > > 
> > > 1) single memory backend for persistent or volatile.
> > > 2) two memory backends, covering a persistent region and a volatile region
> > >   (likely you may want actual non volatile backend for the persistent
> > >    region but not the volatile one).
> > > 3) single memory backend in which we can allow for the Set Partition Info stuff
> > >    to work.
> > >  
> 
> Seems a little odd to use two memory backends.  Of what use is it to the
> software developers, it should be completely transparent to them, right?
> 
> The only thing I can think of is maybe reset mechanics for volatile
> regions being set differently than persistent regions, but even then it
> seems simple enough to just emulate the behavior and use a single
> backing device.

It's a very convenient path as lets us define sizes and things from the
actual memdev rather than duplicating all the configuration in multiple
devices.  If it weren't for the ability to change the allocations at runtime
I'd definitely say this was the best path.  That corner makes it complex
but I'd still like the simplicity of you throw a backend at the device
and we set up all the description on basis that backend is what we want
to use.

> 
> > > As we move beyond just one backend that is presented as persistent we need
> > > to define the interface...
> > > 
> > > I can think of far too many options!  Let me lay out a few for comment.
> > > 
> > > A) Array of memory backends, separate control of configurable size
> > >    + starting persistent size, starting volatile size.
> > >    Default of one memory backend gives you current non-volatile only.
> > >    To get volatile only you provide one memory backend and set
> > >    persistent size to 0 and volatile size to size of memory backend.
> > > 
> > > B) Two memory backends, (later add some trickery around presented size)
> > >    memdev as defined is persistent memory (maybe we deprecate that interface
> > >    and make it memdev-persistent) memdev-volatile as volatile.  
> 
> I had considered this, but - at least in my head - you're describing a
> multi-logical device.  The device we're looking at right now is a basic
> Single-Logical Device which can be either persistent or volatile.

Not the same as an MLD.  This is just representing a normal
SLD with a mixture of volatile and non volatile capacity.
MLDs are a whole lot more complex - on the todo list...

> 
> This goes back to my question:  Should we be designing this SLD as a
> component which can slot into an MLD, or are there special features of
> a MLD sub-device that aren't present on an SLD?

Different things
SLD with a mixture of memory types is fine and can be really dumb.
Imagine a controller with multi purpose slots similar to some of the ones
on the Intel systems that take either NVDIMMS or normal DIMMS.  Depending
on what is plugged in, the amount of each memory type available changes.

MLDs from host look just like an SLD as well. They become interesting only
once we have emulation of the FM-API etc and can start moving memory around.
That's something I'm working on but a lot of parts are needed to make that
work fully.

> 
> Basically I just don't see the need for multiple backends, as opposed to
> multiple devices which are separately manageable.

Depends on what you want to do.  Aim here is to emulate devices that 'might'
exist. A combined device is allowed by the spec, so we need a path to
emulate that. No need to do it for now, but the interface defined needs to
extend to that case. Obviously a lot of code paths can be tested without that
using a patch similar to yours. Just not all of them.
A single flag is not sufficiently extensible.  My aim here was to describe
possible paths that are. Note however we only need to emulate some of it
today - just with an interface that allows other stuff to work later!


> 
> If I'm misunderstanding something, please let me know.
> 
> > > 
> > >    Steps to add functionality.
> > >    1. Add your code to enable volatile but use 'which memdev is supplied' to
> > >       act as the flag.  For now supply both is an error.
> > >    2. Enable static volatile + non-volatile devices (both memdevs supplied).
> > >    3. (hackish) Add a control for amount of partionable capacity.  Cheat
> > >       and remove that capacity from both memdevs.  Changing partition just
> > >       messes without how much of each memdev is used.  We'd need to be a little
> > >       careful to ensure we wipe data etc.
> > > 
> > > You code mostly looks good to me so I think discussion here is going to be
> > > all about the interface and how we build one more suitable for the future!
> > > 
> > > Jonathan
> > > 
> > > 
> > > 
> > >   
> 
> Feels like I'm 
?

> 
> 
> 
> > > >  Example command lines
> > > >  ---------------------
> > > > -A very simple setup with just one directly attached CXL Type 3 device::
> > > > +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
> > > >  
> > > >    qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
> > > >    ...
> > > > @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device::
> > > >    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
> > > >    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > >    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > > > -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > > > +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > > > +  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G    
> > > 
> > > Can't do this as it breaks any existing setup.  If a flag makes sense (I don't think
> > > it is flexible enough) then would need to be volatile so it not being present
> > > will correspond to false and current situation.
> > >   
> 
> this would be another reason to simply default the pmem flag to true.
> It would allow existing setups to continue as-is.
> 
> I can make this swap, but i think the above discussions were more
> important to get started (and why this is an RFC).
> 
Agreed.  I'm in process of rebasing all the other stuff we have out of tree.
Once that's done (next week probably) I can carry this on top and that will
give us something to iterate on + develop the Linux / EDK2 side against.


Jonathan
Gregory Price Oct. 6, 2022, 5:29 p.m. UTC | #5
On Thu, Oct 06, 2022 at 05:42:14PM +0100, Jonathan Cameron wrote:
> > 
> > 1) The PCI device type is set prior to realize/attributes, and is
> > currently still set to PCI_CLASS_STORAGE_EXPRESS.  Should this instead
> > be PCI_CLASS_MEMORY_CXL when presenting as a simple memory expander?
> 
> We override this later in realize but indeed a bit odd that we first set
> it to the wrong thing. I guess that is really old code.  Nice thing to clean up.
> I just tried it and setting it right in the first place + dropping where we
> change it later works fine.
> 

I'll add it as a pullout patch ahead of my next revision.

/**** snip - skipping ahead for the sake of brevity ****/


I was unaware that an SLD could be comprised of multiple regions
of both persistent and volatile memory.  I was under the impression that
it could only be one type of memory.  Of course that makes sense in the
case of a memory expander that simply lets you plug DIMMs in *facepalm*

I see the reason to have separate backends in this case.

The reason to allow an array of backing devices is if we believe each
individual DIMM plugged into a memexpander is likely to show up as
(configurably?) individual NUMA nodes, or if it's likely to get
classified as one numa node.

Maybe we should consider 2 new options:
--persistent-memdevs=pm1 pm2 pm3
--volatile-memdevs=vm1 vm2 vm3

etc, and deprecate --memdev, and go with your array of memdevs idea.

I think I could probably whip that up in a day or two.  Thoughts?



> > 
> > 2) EDK2 sets the memory area as a reserved, and the memory is not
> > configured by the system as ram.  I'm fairly sure edk2 just doesn't
> > support this yet, but there's a chicken/egg problem.  If the device
> > isn't there, there's nothing to test against... if there's nothing to
> > test against, no one will write the support.  So I figure we should kick
> > start the process (probably by getting it wrong on the first go around!)
> 
> Yup, if the bios left it alone, OS drivers need to treat it the same as
> they would deal with hotplugged memory.  Note my strong suspicion is there
> will be host vendors who won't ever handle volatile CXL memory in firmware.
> They will just let the OS bring it up after boot. As long as you have DDR
> as well on the system that will be fine.  Means there is one code path
> to verify rather than two.  Not everyone will care about legacy OS support.
> 

Presently i'm failing to bring up a region of memory even when this is
set to persistent (even on upstream configuration).  The kernel is
presently failing to set_size because the region is used.

I can't tell if this is a driver error or because EDK2 is marking the
region as reserved.

relevant boot output:
[    0.000000] BIOS-e820: [mem 0x0000000290000000-0x000000029fffffff] reserved
[    1.229097] acpi ACPI0016:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
[    1.244082] acpi ACPI0016:00: _OSC: OS supports [CXL20PortDevRegAccess CXLProtocolErrorReporting CXLNativeHotPlug]
[    1.261245] acpi ACPI0016:00: _OSC: platform does not support [LTR DPC]
[    1.272347] acpi ACPI0016:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability]
[    1.286092] acpi ACPI0016:00: _OSC: OS now controls [CXLMemErrorReporting]

The device is otherwise available for use

cli output
# cxl list
[
  {
    "memdev":"mem0",
    "pmem_size":268435456,
    "ram_size":0,
    "serial":0,
    "host":"0000:35:00.0"
  }
]

but it fails to setup correctly

cxl create-region -m -d decoder0.0 -w 1 -g 256 mem0
cxl region: create_region: region0: set_size failed: Numerical result out of range
cxl region: cmd_create_region: created 0 regions

I tracked this down to this part of the kernel:

kernel/resource.c

static struct resource *get_free_mem_region(...)
{
	... snip ...
	enumerate regions, fail to find a useable region
	... snip ...
	return ERR_PTR(-ERANGE);
}

but i'm not sure of what to do with this info.  We have some proof
that real hardware works with this no problem, and the only difference
is that the EFI/bios/firmware is setting the memory regions as `usable`
or `soft reserved`, which would imply the EDK2 is the blocker here
regardless of the OS driver status.

But I'd seen elsewhere you had gotten some of this working, and I'm
failing to get anything working at the moment.  If you have any input i
would greatly appreciate the help.

QEMU config:

/opt/qemu-cxl2/bin/qemu-system-x86_64 \
-drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=d\
-m 2G,slots=4,maxmem=4G \
-smp 4 \
-machine type=q35,accel=kvm,cxl=on \
-enable-kvm \
-nographic \
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
-object memory-backend-file,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=256M \
-object memory-backend-file,id=lsa0,mem-path=/tmp/cxl-lsa0,size=256M \
-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=256M

I'd seen on the lists that you had seen issues with single-rp setups,
but no combination of configuration I've tried (including all the ones
in the docs and tests) lead to a successful region creation with
`cxl create-region`

> > 
> > 3) Upstream linux drivers haven't touched ram configurations yet.  I
> > just configured this with Dan Williams yesterday on IRC.  My
> > understanding is that it's been worked on but nothing has been
> > upstreamed, in part because there are only a very small set of devices
> > available to developers at the moment.
> 
> There was an offer of similar volatile memory QEMU emulation in the
> session on QEMU CXL at Linux Plumbers.  That will look something like you have
> here and maybe reflects that someone has hardware as well...
> 

I saw that, and I figured I'd start the conversation by pushing
something :].
Gregory Price Oct. 7, 2022, 2:50 p.m. UTC | #6
Now that i've had some time to look at the spec, the DVSEC CXL Capability
register (8.1.3.1 in 3.0 spec)
only supports enabling two HDM ranges at the moment, which to me means we
should implement

memdev0=..., memdev1=...

Yesterday I pushed a patch proposal that separated the regions into
persistent and volatile, but
i had discovered that there isn't a good way (presently) to determine of a
MemoryBacking object
is both file File-type AND has pmem=true, meaning we'd have to either
expose that interface
(this seems dubious to me) or do the following:

memdev0=mem0,memdev0-pmem=true,memdev1=mem0,memdev0-pmem=false

and then simply store a bit for each hostmem region accordingly to report
pmem/volatile.
This would allow the flexibility for the backing device to be whatever the
user wants while
still being able to mark the behavior of the type-3 device as pmem or
volatile.  Alternatively
we could make `memdev0-type=` and allow for new types in the future I
suppose.

The one thing I'm a bit confused on - the Media_type and Memory_Class
fields in the DVSEC
CXL Range registers (8.1.3.8.2).  Right now they're set to "010b" = Memory
Characteristics
are communicated via CDAT".  Do the devices presently emulate this? I'm
finding it hard to pick
apart the code to identify it.

On Thu, Oct 6, 2022 at 1:30 PM Gregory Price <gourry.memverge@gmail.com>
wrote:

> On Thu, Oct 06, 2022 at 05:42:14PM +0100, Jonathan Cameron wrote:
> > >
> > > 1) The PCI device type is set prior to realize/attributes, and is
> > > currently still set to PCI_CLASS_STORAGE_EXPRESS.  Should this instead
> > > be PCI_CLASS_MEMORY_CXL when presenting as a simple memory expander?
> >
> > We override this later in realize but indeed a bit odd that we first set
> > it to the wrong thing. I guess that is really old code.  Nice thing to
> clean up.
> > I just tried it and setting it right in the first place + dropping where
> we
> > change it later works fine.
> >
>
> I'll add it as a pullout patch ahead of my next revision.
>
> /**** snip - skipping ahead for the sake of brevity ****/
>
>
> I was unaware that an SLD could be comprised of multiple regions
> of both persistent and volatile memory.  I was under the impression that
> it could only be one type of memory.  Of course that makes sense in the
> case of a memory expander that simply lets you plug DIMMs in *facepalm*
>
> I see the reason to have separate backends in this case.
>
> The reason to allow an array of backing devices is if we believe each
> individual DIMM plugged into a memexpander is likely to show up as
> (configurably?) individual NUMA nodes, or if it's likely to get
> classified as one numa node.
>
> Maybe we should consider 2 new options:
> --persistent-memdevs=pm1 pm2 pm3
> --volatile-memdevs=vm1 vm2 vm3
>
> etc, and deprecate --memdev, and go with your array of memdevs idea.
>
> I think I could probably whip that up in a day or two.  Thoughts?
>
>
>
> > >
> > > 2) EDK2 sets the memory area as a reserved, and the memory is not
> > > configured by the system as ram.  I'm fairly sure edk2 just doesn't
> > > support this yet, but there's a chicken/egg problem.  If the device
> > > isn't there, there's nothing to test against... if there's nothing to
> > > test against, no one will write the support.  So I figure we should
> kick
> > > start the process (probably by getting it wrong on the first go
> around!)
> >
> > Yup, if the bios left it alone, OS drivers need to treat it the same as
> > they would deal with hotplugged memory.  Note my strong suspicion is
> there
> > will be host vendors who won't ever handle volatile CXL memory in
> firmware.
> > They will just let the OS bring it up after boot. As long as you have DDR
> > as well on the system that will be fine.  Means there is one code path
> > to verify rather than two.  Not everyone will care about legacy OS
> support.
> >
>
> Presently i'm failing to bring up a region of memory even when this is
> set to persistent (even on upstream configuration).  The kernel is
> presently failing to set_size because the region is used.
>
> I can't tell if this is a driver error or because EDK2 is marking the
> region as reserved.
>
> relevant boot output:
> [    0.000000] BIOS-e820: [mem 0x0000000290000000-0x000000029fffffff]
> reserved
> [    1.229097] acpi ACPI0016:00: _OSC: OS supports [ExtendedConfig ASPM
> ClockPM Segments MSI EDR HPX-Type3]
> [    1.244082] acpi ACPI0016:00: _OSC: OS supports [CXL20PortDevRegAccess
> CXLProtocolErrorReporting CXLNativeHotPlug]
> [    1.261245] acpi ACPI0016:00: _OSC: platform does not support [LTR DPC]
> [    1.272347] acpi ACPI0016:00: _OSC: OS now controls [PCIeHotplug
> SHPCHotplug PME AER PCIeCapability]
> [    1.286092] acpi ACPI0016:00: _OSC: OS now controls
> [CXLMemErrorReporting]
>
> The device is otherwise available for use
>
> cli output
> # cxl list
> [
>   {
>     "memdev":"mem0",
>     "pmem_size":268435456,
>     "ram_size":0,
>     "serial":0,
>     "host":"0000:35:00.0"
>   }
> ]
>
> but it fails to setup correctly
>
> cxl create-region -m -d decoder0.0 -w 1 -g 256 mem0
> cxl region: create_region: region0: set_size failed: Numerical result out
> of range
> cxl region: cmd_create_region: created 0 regions
>
> I tracked this down to this part of the kernel:
>
> kernel/resource.c
>
> static struct resource *get_free_mem_region(...)
> {
>         ... snip ...
>         enumerate regions, fail to find a useable region
>         ... snip ...
>         return ERR_PTR(-ERANGE);
> }
>
> but i'm not sure of what to do with this info.  We have some proof
> that real hardware works with this no problem, and the only difference
> is that the EFI/bios/firmware is setting the memory regions as `usable`
> or `soft reserved`, which would imply the EDK2 is the blocker here
> regardless of the OS driver status.
>
> But I'd seen elsewhere you had gotten some of this working, and I'm
> failing to get anything working at the moment.  If you have any input i
> would greatly appreciate the help.
>
> QEMU config:
>
> /opt/qemu-cxl2/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=d\
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=256M \
> -object memory-backend-file,id=lsa0,mem-path=/tmp/cxl-lsa0,size=256M \
> -device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=256M
>
> I'd seen on the lists that you had seen issues with single-rp setups,
> but no combination of configuration I've tried (including all the ones
> in the docs and tests) lead to a successful region creation with
> `cxl create-region`
>
> > >
> > > 3) Upstream linux drivers haven't touched ram configurations yet.  I
> > > just configured this with Dan Williams yesterday on IRC.  My
> > > understanding is that it's been worked on but nothing has been
> > > upstreamed, in part because there are only a very small set of devices
> > > available to developers at the moment.
> >
> > There was an offer of similar volatile memory QEMU emulation in the
> > session on QEMU CXL at Linux Plumbers.  That will look something like
> you have
> > here and maybe reflects that someone has hardware as well...
> >
>
> I saw that, and I figured I'd start the conversation by pushing
> something :].
>
Davidlohr Bueso Oct. 7, 2022, 6:16 p.m. UTC | #7
On Thu, 06 Oct 2022, Jonathan Cameron wrote:
>> 3) Upstream linux drivers haven't touched ram configurations yet.  I
>> just configured this with Dan Williams yesterday on IRC.  My
>> understanding is that it's been worked on but nothing has been
>> upstreamed, in part because there are only a very small set of devices
>> available to developers at the moment.
>
>There was an offer of similar volatile memory QEMU emulation in the
>session on QEMU CXL at Linux Plumbers.  That will look something like you have
>here and maybe reflects that someone has hardware as well...

Yeah, putting this back together was on my todo list, but happy to see
patches are out. Recollecting my thoughts on this, my original approach
was also to support only volatile or persistent capacities, but through
two backends, and thus two address spaces. Afaik the last idea that was
discussed on IRC in this regard was to do it with a single backend along
with a pmem_offset=N boundary (0 or 100% for example for one type or the
other) tunnable.

...

>> Seems a little odd to use two memory backends.  Of what use is it to the
>> software developers, it should be completely transparent to them, right?
>>
>> The only thing I can think of is maybe reset mechanics for volatile
>> regions being set differently than persistent regions, but even then it
>> seems simple enough to just emulate the behavior and use a single
>> backing device.
>
>It's a very convenient path as lets us define sizes and things from the
>actual memdev rather than duplicating all the configuration in multiple
>devices.  If it weren't for the ability to change the allocations at runtime
>I'd definitely say this was the best path.  That corner makes it complex
>but I'd still like the simplicity of you throw a backend at the device
>and we set up all the description on basis that backend is what we want
>to use.

Agreed.

...

>> > > >  Example command lines
>> > > >  ---------------------
>> > > > -A very simple setup with just one directly attached CXL Type 3 device::
>> > > > +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
>> > > >
>> > > >    qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
>> > > >    ...
>> > > > @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device::
>> > > >    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
>> > > >    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>> > > >    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
>> > > > -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
>> > > > +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \

So regardless of the interface we end up with, volatile and lsa parameters
should be mutually exclusive.

Thanks,
Davidlohr
Gregory Price Oct. 7, 2022, 6:46 p.m. UTC | #8
On Fri, Oct 07, 2022 at 11:16:19AM -0700, Davidlohr Bueso wrote:
> 
> Yeah, putting this back together was on my todo list, but happy to see
> patches are out. Recollecting my thoughts on this, my original approach
> was also to support only volatile or persistent capacities, but through
> two backends, and thus two address spaces. Afaik the last idea that was
> discussed on IRC in this regard was to do it with a single backend along
> with a pmem_offset=N boundary (0 or 100% for example for one type or the
> other) tunnable.
> 

This makes sense.  References another message I sent, are the region
areas in the dvsecs an artifact from cxl1.x? They suggest only two
regions are supported.  Was this overridden by the introduction of CDAT
fields that describe the memory layout?

(sorry, just trying to put together the puzzle pieces here, jumping in a
bit late to the party).

> > > > > >  Example command lines
> > > > > >  ---------------------
> > > > > > -A very simple setup with just one directly attached CXL Type 3 device::
> > > > > > +A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
> > > > > >
> > > > > >    qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
> > > > > >    ...
> > > > > > @@ -308,7 +308,18 @@ A very simple setup with just one directly attached CXL Type 3 device::
> > > > > >    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
> > > > > >    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > > > > >    -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > > > > > -  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> > > > > > +  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
> 
> So regardless of the interface we end up with, volatile and lsa parameters
> should be mutually exclusive.
> 

Spec says that volatile devices `may` implement an lsa.

Get LSA (Opcode 4102h)
The Label Storage Area (LSA) shall be supported by a memory device
that provides persistent memory capacity and may be supported by a
device that provides only volatile memory capacity. The format of
the LSA is specified in Section 9.13.2. The size of the Label Storage
Area is retrieved from the Identify Memory Device command.
Davidlohr Bueso Oct. 7, 2022, 7:52 p.m. UTC | #9
On Thu, 06 Oct 2022, Jonathan Cameron wrote:

>One of the blockers for volatile support was that we had no means to poke
>it properly as the kernel doesn't yet support volatile capacity and
>no one has done the relevant work in EDK2 or similar to do it before the kernel boots.
>There has been some work on EDK2 support for ARM N2 FVPs from
>Saanta Pattanayak, but not upstream eyt.
>https://lpc.events/event/16/contributions/1254/

fwiw I had been trying to build some of the firmware bootup for the required
acpi tables that are particular to volatile capacity steps (srat/slit, hmat and
EFI Memory Map) by parameters, but got quickly out of hand. For example, the srat
could use a passed 'node' and have a cxl_build_srat(), etc. But yeah it would
be nice for the EDK2 work to advance on the x86 end.

Thanks,
Davidlohr

diff --git a/hw/acpi/cxl.c b/hw/acpi/cxl.c
index 2bf8c0799359..1c3c6d17c222 100644
--- a/hw/acpi/cxl.c
+++ b/hw/acpi/cxl.c
@@ -254,3 +255,46 @@ void build_cxl_osc_method(Aml *dev)
+static int cxl_device_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_CXL_TYPE3)) {
+        *list = g_slist_append(*list, DEVICE(obj));
+    }
+
+    object_child_foreach(obj, cxl_device_list, opaque);
+    return 0;
+}
+
+static GSList *cxl_get_device_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(qdev_get_machine(), cxl_device_list, &list);
+    return list;
+}
+
+void cxl_build_srat(GArray *table_data)
+{
+    GSList *device_list, *list = cxl_get_device_list();
+
+    for (device_list = list; device_list; device_list = device_list->next) {
+        DeviceState *dev = device_list->data;
+        Object *obj = OBJECT(dev);
+        CXLType3Dev *ct3d = CXL_TYPE3(dev);
+        MemoryRegion *mr;
+        int node;
+
+        mr = host_memory_backend_get_memory(ct3d->hostmem);
+        if (!mr) {
+            continue;
+        }
+        node = object_property_get_uint(obj, "node", &error_abort);
+
+        build_srat_memory(table_data, mr->addr, mr->size, node, MEM_AFFINITY_ENABLED);
+    }
+
+    g_slist_free(list);
+}
Davidlohr Bueso Oct. 7, 2022, 7:55 p.m. UTC | #10
On Fri, 07 Oct 2022, Gregory Price wrote:

>Spec says that volatile devices `may` implement an lsa.

Right you are.

>Get LSA (Opcode 4102h)
>The Label Storage Area (LSA) shall be supported by a memory device
>that provides persistent memory capacity and may be supported by a
>device that provides only volatile memory capacity. The format of
>the LSA is specified in Section 9.13.2. The size of the Label Storage
>Area is retrieved from the Identify Memory Device command.
Jonathan Cameron Oct. 10, 2022, 2:43 p.m. UTC | #11
> 
> I was unaware that an SLD could be comprised of multiple regions
> of both persistent and volatile memory.  I was under the impression that
> it could only be one type of memory.  Of course that makes sense in the
> case of a memory expander that simply lets you plug DIMMs in *facepalm*
> 
> I see the reason to have separate backends in this case.
> 
> The reason to allow an array of backing devices is if we believe each
> individual DIMM plugged into a memexpander is likely to show up as
> (configurably?) individual NUMA nodes, or if it's likely to get
> classified as one numa node.

I'm not sure it would be each DIMM separately as there are likely to only
be a couple of types.

> 
> Maybe we should consider 2 new options:
> --persistent-memdevs=pm1 pm2 pm3
> --volatile-memdevs=vm1 vm2 vm3
> 
> etc, and deprecate --memdev, and go with your array of memdevs idea.
> 
> I think I could probably whip that up in a day or two.  Thoughts?

I wonder if we care to emulate beyond 1 volatile and 1 persistent.
Sure devices might exist, but if we can exercise all the code paths
with a simpler configuration, perhaps we don't need to handle the
more complex ones?

The sticky corner here is Set Partition Info 
CXL r3.0 8.2.9.8.2.1

Separation between volatile and non volatile is configurable at runtime.

> 
> 
> 
> > > 
> > > 2) EDK2 sets the memory area as a reserved, and the memory is not
> > > configured by the system as ram.  I'm fairly sure edk2 just doesn't
> > > support this yet, but there's a chicken/egg problem.  If the device
> > > isn't there, there's nothing to test against... if there's nothing to
> > > test against, no one will write the support.  So I figure we should kick
> > > start the process (probably by getting it wrong on the first go around!)  
> > 
> > Yup, if the bios left it alone, OS drivers need to treat it the same as
> > they would deal with hotplugged memory.  Note my strong suspicion is there
> > will be host vendors who won't ever handle volatile CXL memory in firmware.
> > They will just let the OS bring it up after boot. As long as you have DDR
> > as well on the system that will be fine.  Means there is one code path
> > to verify rather than two.  Not everyone will care about legacy OS support.
> >   
> 
> Presently i'm failing to bring up a region of memory even when this is
> set to persistent (even on upstream configuration).  The kernel is
> presently failing to set_size because the region is used.
> 
> I can't tell if this is a driver error or because EDK2 is marking the
> region as reserved.
> 
> relevant boot output:
> [    0.000000] BIOS-e820: [mem 0x0000000290000000-0x000000029fffffff] reserved
> [    1.229097] acpi ACPI0016:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI EDR HPX-Type3]
> [    1.244082] acpi ACPI0016:00: _OSC: OS supports [CXL20PortDevRegAccess CXLProtocolErrorReporting CXLNativeHotPlug]
> [    1.261245] acpi ACPI0016:00: _OSC: platform does not support [LTR DPC]
> [    1.272347] acpi ACPI0016:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability]
> [    1.286092] acpi ACPI0016:00: _OSC: OS now controls [CXLMemErrorReporting]
> 
> The device is otherwise available for use
> 
> cli output
> # cxl list
> [
>   {
>     "memdev":"mem0",
>     "pmem_size":268435456,
>     "ram_size":0,
>     "serial":0,
>     "host":"0000:35:00.0"
>   }
> ]
> 
> but it fails to setup correctly
> 
> cxl create-region -m -d decoder0.0 -w 1 -g 256 mem0
> cxl region: create_region: region0: set_size failed: Numerical result out of range
> cxl region: cmd_create_region: created 0 regions
> 
> I tracked this down to this part of the kernel:
> 
> kernel/resource.c
> 
> static struct resource *get_free_mem_region(...)
> {
> 	... snip ...
> 	enumerate regions, fail to find a useable region
> 	... snip ...
> 	return ERR_PTR(-ERANGE);
> }
> 
> but i'm not sure of what to do with this info.  We have some proof
> that real hardware works with this no problem, and the only difference
> is that the EFI/bios/firmware is setting the memory regions as `usable`
> or `soft reserved`, which would imply the EDK2 is the blocker here
> regardless of the OS driver status.
> 
> But I'd seen elsewhere you had gotten some of this working, and I'm
> failing to get anything working at the moment.  If you have any input i
> would greatly appreciate the help.
> 
> QEMU config:
> 
> /opt/qemu-cxl2/bin/qemu-system-x86_64 \
> -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=d\
> -m 2G,slots=4,maxmem=4G \
> -smp 4 \
> -machine type=q35,accel=kvm,cxl=on \
> -enable-kvm \
> -nographic \
> -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=256M \
> -object memory-backend-file,id=lsa0,mem-path=/tmp/cxl-lsa0,size=256M \
> -device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=256M
> 
> I'd seen on the lists that you had seen issues with single-rp setups,
> but no combination of configuration I've tried (including all the ones
> in the docs and tests) lead to a successful region creation with
> `cxl create-region`

Hmm. Let me have a play.  I've not run x86 tests for a while so
perhaps something is missing there.

I'm carrying a patch to override check_last_peer() in
cxl_port_setup_targets() as that is wrong for some combinations,
but that doesn't look like it's related to what you are seeing.

> 
> > > 
> > > 3) Upstream linux drivers haven't touched ram configurations yet.  I
> > > just configured this with Dan Williams yesterday on IRC.  My
> > > understanding is that it's been worked on but nothing has been
> > > upstreamed, in part because there are only a very small set of devices
> > > available to developers at the moment.  
> > 
> > There was an offer of similar volatile memory QEMU emulation in the
> > session on QEMU CXL at Linux Plumbers.  That will look something like you have
> > here and maybe reflects that someone has hardware as well...
> >   
> 
> I saw that, and I figured I'd start the conversation by pushing
> something :].
Jonathan Cameron Oct. 10, 2022, 3:18 p.m. UTC | #12
On Fri, 7 Oct 2022 10:50:12 -0400
Gregory Price <gourry.memverge@gmail.com> wrote:

> Now that i've had some time to look at the spec, the DVSEC CXL Capability
> register (8.1.3.1 in 3.0 spec)
> only supports enabling two HDM ranges at the moment, which to me means we
> should implement
> 
> memdev0=..., memdev1=...

That needs to work (for backwards compatibility)
but doesn't need to be the most optimal mapping.  So you could have
two HDM ranges defined, but if your software is clever enough to
use the CXL 2.0 infrastructure you can have more (Linux is).

> 
> Yesterday I pushed a patch proposal that separated the regions into
> persistent and volatile, but
> i had discovered that there isn't a good way (presently) to determine of a
> MemoryBacking object
> is both file File-type AND has pmem=true, meaning we'd have to either
> expose that interface
> (this seems dubious to me) or do the following:
Whilst there are good usecases for using actual volatile and non volatile
for the backends, I don't think we want to enforce that as it should
'work' with whatever backends are used.
> 
> memdev0=mem0,memdev0-pmem=true,memdev1=mem0,memdev0-pmem=false
> 
> and then simply store a bit for each hostmem region accordingly to report
> pmem/volatile.
> This would allow the flexibility for the backing device to be whatever the
> user wants while
> still being able to mark the behavior of the type-3 device as pmem or
> volatile.  Alternatively
> we could make `memdev0-type=` and allow for new types in the future I
> suppose.
> 
That was why I had one suggestion of just supporting two memdevs.

memdev-volatile and memdev-nonvolatile but then messing with whether the
whole memdev was presented depending on the set partition.  I'm increasingly
thinking that is the least ugly solution.  + until we implement set partition
along with a control on maximum size of partitionable region, we can
just use a pair of memdevs to provide all the information needed to emulate
the devices.

> The one thing I'm a bit confused on - the Media_type and Memory_Class
> fields in the DVSEC
> CXL Range registers (8.1.3.8.2).  Right now they're set to "010b" = Memory
> Characteristics
> are communicated via CDAT".  Do the devices presently emulate this? I'm
> finding it hard to pick
> apart the code to identify it.

Not upstream yet.

Patches on list + can be found at

https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-09
There are a few messy corners in that tree but it should work. I'll be
pushing out a new version in a few days.

I updated that in latest version to build the tables based on the
memdev provided.  We'll want to add the volatile support to that alongside
your patch.

Jonathan
Gregory Price Oct. 10, 2022, 3:20 p.m. UTC | #13
> > 
> > Maybe we should consider 2 new options:
> > --persistent-memdevs=pm1 pm2 pm3
> > --volatile-memdevs=vm1 vm2 vm3
> > 
> > etc, and deprecate --memdev, and go with your array of memdevs idea.
> > 
> > I think I could probably whip that up in a day or two.  Thoughts?
> 
> I wonder if we care to emulate beyond 1 volatile and 1 persistent.
> Sure devices might exist, but if we can exercise all the code paths
> with a simpler configuration, perhaps we don't need to handle the
> more complex ones?
> 
> The sticky corner here is Set Partition Info 
> CXL r3.0 8.2.9.8.2.1
> 
> Separation between volatile and non volatile is configurable at runtime.
> 

Set Partition Info (Opcode 4101h)
... snip ... Paritionable capacity is equal to
(total capacity - (volatile-only) - (persistent-only))

By definition, volatile backed memory can only contain volatile regions.
That's simple.

For persistent-backed memory, capacity can be chopped up (iif Identify
Memory Device reports Paritition Alignment to be non-zero).

--persistent-memdev=pmem-backing
--volatile-memdev=vmem-backing

By default, maybe we should make the entire size of persistent regions
to be persistent-only for the initial prototype.  This would let us
merge without support for Set Partition Info.  Then add an additional
argument to enable additional argument to set the partition alignment.

--partitionable-pmem=
  : Defaults to 0. If non-zero, up to N XBytes of PMem may be
    used as vmem by the operating system at runtime.

On the CXL tool side we should then see the following output for these
settings:

--persistent-memdev=pmem0     1GB
--volatile-memdev=vmem0       1GB
--partitionable-pmem=512M

Capacities:
----------------
Total Capacity: 2GB
Volatile Capacity: 1GB
Persistent Capacity: 1GB
Partitionable Capacity: 512MB
Partitioned Memory: 0MB

Available Memory:
-----------------
Volatile Memory: 1GB
Persistent Memory: 1GB


Then Run something along the lines of:
`cxl partition-pmem 512MB`   (or whatever args are needed)


Capacities:
----------------
Total Capacity: 2GB
Volatile Capacity: 1GB
Persistent Capacity: 1GB
Partitionable Capacity: 512MB
Partitioned Memory: 512MB

Available Memory:
-----------------
Volatile Memory: 1.5GB
Persistent Memory: 512MB


Thoughts?
Gregory Price Oct. 10, 2022, 3:25 p.m. UTC | #14
> 
> https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-09
> There are a few messy corners in that tree but it should work. I'll be
> pushing out a new version in a few days.
> 
> I updated that in latest version to build the tables based on the
> memdev provided.  We'll want to add the volatile support to that alongside
> your patch.
> 

I will rebase my --persistent-memdev and --volatile-memdev patch on your
branch and send out the commits when i'm done.  I may also add the
scafolding for the partitionable-pmem field but not actually expose it.
Jonathan Cameron Oct. 10, 2022, 4:26 p.m. UTC | #15
On Mon, 10 Oct 2022 11:20:05 -0400
Gregory Price <gregory.price@memverge.com> wrote:

> > > 
> > > Maybe we should consider 2 new options:
> > > --persistent-memdevs=pm1 pm2 pm3
> > > --volatile-memdevs=vm1 vm2 vm3
> > > 
> > > etc, and deprecate --memdev, and go with your array of memdevs idea.
> > > 
> > > I think I could probably whip that up in a day or two.  Thoughts?  
> > 
> > I wonder if we care to emulate beyond 1 volatile and 1 persistent.
> > Sure devices might exist, but if we can exercise all the code paths
> > with a simpler configuration, perhaps we don't need to handle the
> > more complex ones?
> > 
> > The sticky corner here is Set Partition Info 
> > CXL r3.0 8.2.9.8.2.1
> > 
> > Separation between volatile and non volatile is configurable at runtime.
> >   
> 
> Set Partition Info (Opcode 4101h)
> ... snip ... Paritionable capacity is equal to
> (total capacity - (volatile-only) - (persistent-only))
> 
> By definition, volatile backed memory can only contain volatile regions.
> That's simple.

I guess that's a reasonable model for the QEMU command line.

> 
> For persistent-backed memory, capacity can be chopped up (iif Identify
> Memory Device reports Paritition Alignment to be non-zero).
> 
> --persistent-memdev=pmem-backing
> --volatile-memdev=vmem-backing
> 
> By default, maybe we should make the entire size of persistent regions
> to be persistent-only for the initial prototype. 

Agreed.  Partitioning wasn't high on my todo list though maybe it is
for others...


> This would let us
> merge without support for Set Partition Info.  Then add an additional
> argument to enable additional argument to set the partition alignment.
> 
> --partitionable-pmem=
>   : Defaults to 0. If non-zero, up to N XBytes of PMem may be
>     used as vmem by the operating system at runtime.
> 
> On the CXL tool side we should then see the following output for these
> settings:
> 
> --persistent-memdev=pmem0     1GB
> --volatile-memdev=vmem0       1GB
> --partitionable-pmem=512M
> 
> Capacities:
> ----------------
> Total Capacity: 2GB
> Volatile Capacity: 1GB
> Persistent Capacity: 1GB
> Partitionable Capacity: 512MB
> Partitioned Memory: 0MB
> 
> Available Memory:
> -----------------
> Volatile Memory: 1GB
> Persistent Memory: 1GB
> 
> 
> Then Run something along the lines of:
> `cxl partition-pmem 512MB`   (or whatever args are needed)
> 
> 
> Capacities:
> ----------------
> Total Capacity: 2GB
> Volatile Capacity: 1GB
> Persistent Capacity: 1GB
> Partitionable Capacity: 512MB
> Partitioned Memory: 512MB
> 
> Available Memory:
> -----------------
> Volatile Memory: 1.5GB
> Persistent Memory: 512MB
> 
> 
> Thoughts?

Looks good to me.  Alison, I think you were looking at the partitioning stuff...
What do you think of this?

Jonathan
Jonathan Cameron Oct. 10, 2022, 4:32 p.m. UTC | #16
> > but i'm not sure of what to do with this info.  We have some proof
> > that real hardware works with this no problem, and the only difference
> > is that the EFI/bios/firmware is setting the memory regions as `usable`
> > or `soft reserved`, which would imply the EDK2 is the blocker here
> > regardless of the OS driver status.
> > 
> > But I'd seen elsewhere you had gotten some of this working, and I'm
> > failing to get anything working at the moment.  If you have any input i
> > would greatly appreciate the help.
> > 
> > QEMU config:
> > 
> > /opt/qemu-cxl2/bin/qemu-system-x86_64 \
> > -drive file=/var/lib/libvirt/images/cxl.qcow2,format=qcow2,index=0,media=d\
> > -m 2G,slots=4,maxmem=4G \
> > -smp 4 \
> > -machine type=q35,accel=kvm,cxl=on \
> > -enable-kvm \
> > -nographic \
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52 \
> > -device cxl-rp,id=rp0,bus=cxl.0,chassis=0,slot=0 \
> > -object memory-backend-file,id=cxl-mem0,mem-path=/tmp/cxl-mem0,size=256M \
> > -object memory-backend-file,id=lsa0,mem-path=/tmp/cxl-lsa0,size=256M \
> > -device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 \
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=256M
> > 
> > I'd seen on the lists that you had seen issues with single-rp setups,
> > but no combination of configuration I've tried (including all the ones
> > in the docs and tests) lead to a successful region creation with
> > `cxl create-region`  
> 
> Hmm. Let me have a play.  I've not run x86 tests for a while so
> perhaps something is missing there.
> 
> I'm carrying a patch to override check_last_peer() in
> cxl_port_setup_targets() as that is wrong for some combinations,
> but that doesn't look like it's related to what you are seeing.

I'm not sure if it's relevant, but turned out I'd forgotten I'm carrying 3
patches that aren't upstream (and one is a horrible hack).

Hack: https://lore.kernel.org/linux-cxl/20220819094655.000005ed@huawei.com/
Shouldn't affect a simple case like this...

https://lore.kernel.org/linux-cxl/20220819093133.00006c22@huawei.com/T/#t
(Dan's version)

https://lore.kernel.org/linux-cxl/20220815154044.24733-1-Jonathan.Cameron@huawei.com/T/#t

For writes to work you will currently need two rps (nothing on the second is fine)
as we still haven't resolved if the kernel should support an HDM decoder on
a host bridge with one port.  I think it should (Spec allows it), others unconvinced.

Note I haven't shifted over to x86 yet so may still be something different from
arm64.

Jonathan


> 
> >   
> > > > 
> > > > 3) Upstream linux drivers haven't touched ram configurations yet.  I
> > > > just configured this with Dan Williams yesterday on IRC.  My
> > > > understanding is that it's been worked on but nothing has been
> > > > upstreamed, in part because there are only a very small set of devices
> > > > available to developers at the moment.    
> > > 
> > > There was an offer of similar volatile memory QEMU emulation in the
> > > session on QEMU CXL at Linux Plumbers.  That will look something like you have
> > > here and maybe reflects that someone has hardware as well...
> > >     
> > 
> > I saw that, and I figured I'd start the conversation by pushing
> > something :].  
> 
>
Davidlohr Bueso Oct. 10, 2022, 5:18 p.m. UTC | #17
On Mon, 10 Oct 2022, Jonathan Cameron wrote:

>I wonder if we care to emulate beyond 1 volatile and 1 persistent.
>Sure devices might exist, but if we can exercise all the code paths
>with a simpler configuration, perhaps we don't need to handle the
>more complex ones?

Yes, I completely agree. 1 of each seems like the best balance between
exercising code paths vs complexity.

Thanks,
Davidlohr
Gregory Price Oct. 11, 2022, 1:23 a.m. UTC | #18
I've pushed 5 new commits to this branch here (@Jonathan I've also made
a merge request to pull them into your branch).

https://gitlab.com/gourry.memverge/qemu/-/commits/cxl-2022-10-09

They're built on top of Jonathan's extensions for the CDAT since the
CDAT has memory region relevant entries and trying to do this separate
would be unwise.

1/5: PCI_CLASS_MEMORY_CXL patch
2/5: CXL_CAPACITY_MULTIPLIER pullout patch (@Davidlohr request)
3/5: Generalizes CDATDsmas intialization ahead of multi-region
4/5: Multi-region support w/ backward compatibility
     * Requires extra eyes for CDAT and Read/Write Change Validation*
5/5: Test and documentation update


On Mon, Oct 10, 2022 at 11:25:31AM -0400, Gregory Price wrote:
> > 
> > https://gitlab.com/jic23/qemu/-/commits/cxl-2022-10-09
> > There are a few messy corners in that tree but it should work. I'll be
> > pushing out a new version in a few days.
> > 
> > I updated that in latest version to build the tables based on the
> > memdev provided.  We'll want to add the volatile support to that alongside
> > your patch.
> > 
> 
> I will rebase my --persistent-memdev and --volatile-memdev patch on your
> branch and send out the commits when i'm done.  I may also add the
> scafolding for the partitionable-pmem field but not actually expose it.
Davidlohr Bueso Oct. 11, 2022, 5:14 p.m. UTC | #19
On Mon, 10 Oct 2022, Gregory Price wrote:

>I've pushed 5 new commits to this branch here (@Jonathan I've also made
>a merge request to pull them into your branch).
>
>https://gitlab.com/gourry.memverge/qemu/-/commits/cxl-2022-10-09

This series could perhaps be posted as a reply to the CDAT extensions
cover letter. But regardless, at some point it should be in linux-cxl@.

>
>They're built on top of Jonathan's extensions for the CDAT since the
>CDAT has memory region relevant entries and trying to do this separate
>would be unwise.
>
>1/5: PCI_CLASS_MEMORY_CXL patch
>2/5: CXL_CAPACITY_MULTIPLIER pullout patch (@Davidlohr request)

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>

However this needs a changelog; for however redundant it may be.

>3/5: Generalizes CDATDsmas intialization ahead of multi-region

>4/5: Multi-region support w/ backward compatibility
>     * Requires extra eyes for CDAT and Read/Write Change Validation*

I'm still eyeballing this but it certainly looks much more complete now -
at least with the minimal support I was hoping for.

>5/5: Test and documentation update

I think that there should two examples here with volatile and LSA usage.
The first is without as it is quite unintuitive otherwise, then a
second example with specifying the lba. Also in these cases you want
id=cxl-vmem0. And the documentation should be updated to mention that
memdev is deprecated and {persistent/volatile}-memdev should be used.

Thanks,
Davidlohr
Gregory Price Oct. 11, 2022, 5:22 p.m. UTC | #20
I'll push the patches to qemu-cxl and linux-cxl today or tomorrow, I
wanted to get them into a state on gitlab for Jonathan to rebase and
merge into his work.  He'll likely end up pushing the entire series at
the end of the day.

Will update the tests/docs accordingly.

re: changelog - i'm new to mailing list contributions, where would these
go exactly?


On Tue, Oct 11, 2022 at 10:14:38AM -0700, Davidlohr Bueso wrote:
> On Mon, 10 Oct 2022, Gregory Price wrote:
> 
> > I've pushed 5 new commits to this branch here (@Jonathan I've also made
> > a merge request to pull them into your branch).
> > 
> > https://gitlab.com/gourry.memverge/qemu/-/commits/cxl-2022-10-09
> 
> This series could perhaps be posted as a reply to the CDAT extensions
> cover letter. But regardless, at some point it should be in linux-cxl@.
> 
> > 
> > They're built on top of Jonathan's extensions for the CDAT since the
> > CDAT has memory region relevant entries and trying to do this separate
> > would be unwise.
> > 
> > 1/5: PCI_CLASS_MEMORY_CXL patch
> > 2/5: CXL_CAPACITY_MULTIPLIER pullout patch (@Davidlohr request)
> 
> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> 
> However this needs a changelog; for however redundant it may be.
> 
> > 3/5: Generalizes CDATDsmas intialization ahead of multi-region
> 
> > 4/5: Multi-region support w/ backward compatibility
> >     * Requires extra eyes for CDAT and Read/Write Change Validation*
> 
> I'm still eyeballing this but it certainly looks much more complete now -
> at least with the minimal support I was hoping for.
> 
> > 5/5: Test and documentation update
> 
> I think that there should two examples here with volatile and LSA usage.
> The first is without as it is quite unintuitive otherwise, then a
> second example with specifying the lba. Also in these cases you want
> id=cxl-vmem0. And the documentation should be updated to mention that
> memdev is deprecated and {persistent/volatile}-memdev should be used.
> 
> Thanks,
> Davidlohr
Jonathan Cameron Oct. 11, 2022, 5:28 p.m. UTC | #21
On Tue, 11 Oct 2022 13:22:40 -0400
Gregory Price <gregory.price@memverge.com> wrote:

> I'll push the patches to qemu-cxl and linux-cxl today or tomorrow, I
> wanted to get them into a state on gitlab for Jonathan to rebase and
> merge into his work.  He'll likely end up pushing the entire series at
> the end of the day.
> 
> Will update the tests/docs accordingly.
> 
> re: changelog - i'm new to mailing list contributions, where would these
> go exactly?

Two places - in the cover letter, and under the ---
in the individual patches.  

At this stage I'm not super fussed by tests because there won't be
anything visible beyond 'does it crash'.  Don't think we will be affecting
the ACPI table test results for example.  Good to change to the proposed
new command line for persistent memdev though.

Definitely good to get some testing in place for this longer term but
our whole strategy for that is a bit adhoc for now because tooling isn't
yet commonly available in distros etc (and I for one haven't looked
at what is is necessary yet!)

Was pointed out to me last week that the docs currently give arm64 examples
and we don't have upstream arm64 support yet :( 

The wonders of partial pickup of patch sets!

Thanks,

Jonathan

> 
> 
> On Tue, Oct 11, 2022 at 10:14:38AM -0700, Davidlohr Bueso wrote:
> > On Mon, 10 Oct 2022, Gregory Price wrote:
> >   
> > > I've pushed 5 new commits to this branch here (@Jonathan I've also made
> > > a merge request to pull them into your branch).
> > > 
> > > https://gitlab.com/gourry.memverge/qemu/-/commits/cxl-2022-10-09  
> > 
> > This series could perhaps be posted as a reply to the CDAT extensions
> > cover letter. But regardless, at some point it should be in linux-cxl@.
> >   
> > > 
> > > They're built on top of Jonathan's extensions for the CDAT since the
> > > CDAT has memory region relevant entries and trying to do this separate
> > > would be unwise.
> > > 
> > > 1/5: PCI_CLASS_MEMORY_CXL patch
> > > 2/5: CXL_CAPACITY_MULTIPLIER pullout patch (@Davidlohr request)  
> > 
> > Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
> > 
> > However this needs a changelog; for however redundant it may be.
> >   
> > > 3/5: Generalizes CDATDsmas intialization ahead of multi-region  
> >   
> > > 4/5: Multi-region support w/ backward compatibility
> > >     * Requires extra eyes for CDAT and Read/Write Change Validation*  
> > 
> > I'm still eyeballing this but it certainly looks much more complete now -
> > at least with the minimal support I was hoping for.
> >   
> > > 5/5: Test and documentation update  
> > 
> > I think that there should two examples here with volatile and LSA usage.
> > The first is without as it is quite unintuitive otherwise, then a
> > second example with specifying the lba. Also in these cases you want
> > id=cxl-vmem0. And the documentation should be updated to mention that
> > memdev is deprecated and {persistent/volatile}-memdev should be used.
> > 
> > Thanks,
> > Davidlohr
diff mbox series

Patch

diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst
index f25783a4ec..3a62d46e8a 100644
--- a/docs/system/devices/cxl.rst
+++ b/docs/system/devices/cxl.rst
@@ -300,7 +300,7 @@  Example topology involving a switch::
 
 Example command lines
 ---------------------
-A very simple setup with just one directly attached CXL Type 3 device::
+A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
 
   qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
   ...
@@ -308,7 +308,18 @@  A very simple setup with just one directly attached CXL Type 3 device::
   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
   -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
-  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
+  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
+  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
+
+A very simple setup with just one directly attached CXL Type 3 Volatile Memory device::
+
+  qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
+  ...
+  -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
+  -object memory-backend-ram,id=cxl-lsa1,share=on,size=256M \
+  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
+  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
+  -device cxl-type3,bus=root_port13,pmem=false,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
 
 A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way
@@ -328,13 +339,13 @@  the CXL Type3 device directly attached (no switches).::
   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
   -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \
   -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
-  -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
+  -device cxl-type3,bus=root_port13,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
   -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \
-  -device cxl-type3,bus=root_port14,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
+  -device cxl-type3,bus=root_port14,pmem=true,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
   -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \
-  -device cxl-type3,bus=root_port15,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
+  -device cxl-type3,bus=root_port15,pmem=true,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
   -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \
-  -device cxl-type3,bus=root_port16,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
+  -device cxl-type3,bus=root_port16,pmem=true,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k
 
 An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
@@ -354,13 +365,13 @@  An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
   -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
   -device cxl-upstream,bus=root_port0,id=us0 \
   -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-  -device cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
+  -device cxl-type3,bus=swport0,pmem=true,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
   -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
-  -device cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
+  -device cxl-type3,bus=swport1,pmem=true,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
   -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
-  -device cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
+  -device cxl-type3,bus=swport2,pmem=true,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
   -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
-  -device cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
+  -device cxl-type3,bus=swport3,pmem=true,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
 
 Kernel Configuration Options
diff --git a/hw/cxl/cxl-mailbox-utils.c b/hw/cxl/cxl-mailbox-utils.c
index bc1bb18844..3ed4dfeb69 100644
--- a/hw/cxl/cxl-mailbox-utils.c
+++ b/hw/cxl/cxl-mailbox-utils.c
@@ -138,7 +138,7 @@  static ret_code cmd_firmware_update_get_info(struct cxl_cmd *cmd,
     } QEMU_PACKED *fw_info;
     QEMU_BUILD_BUG_ON(sizeof(*fw_info) != 0x50);
 
-    if (cxl_dstate->pmem_size < (256 << 20)) {
+    if (cxl_dstate->mem_size < (256 << 20)) {
         return CXL_MBOX_INTERNAL_ERROR;
     }
 
@@ -281,7 +281,7 @@  static ret_code cmd_identify_memory_device(struct cxl_cmd *cmd,
 
     CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
     CXLType3Class *cvc = CXL_TYPE3_GET_CLASS(ct3d);
-    uint64_t size = cxl_dstate->pmem_size;
+    uint64_t size = cxl_dstate->mem_size;
 
     if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
         return CXL_MBOX_INTERNAL_ERROR;
@@ -290,11 +290,13 @@  static ret_code cmd_identify_memory_device(struct cxl_cmd *cmd,
     id = (void *)cmd->payload;
     memset(id, 0, sizeof(*id));
 
-    /* PMEM only */
-    snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 0);
+    /* Version 0: PMEM Only.  Version 1: PMEM and VMEM */
+    snprintf(id->fw_revision, 0x10, "BWFW VERSION %02d", 1);
 
-    id->total_capacity = size / (256 << 20);
-    id->persistent_capacity = size / (256 << 20);
+    size /= (256 << 20);
+    id->total_capacity = size;
+    id->persistent_capacity = ct3d->is_pmem ? size : 0;
+    id->volatile_capacity = ct3d->is_pmem ? 0 : size;
     id->lsa_size = cvc->get_lsa_size(ct3d);
 
     *len = sizeof(*id);
@@ -312,16 +314,18 @@  static ret_code cmd_ccls_get_partition_info(struct cxl_cmd *cmd,
         uint64_t next_pmem;
     } QEMU_PACKED *part_info = (void *)cmd->payload;
     QEMU_BUILD_BUG_ON(sizeof(*part_info) != 0x20);
-    uint64_t size = cxl_dstate->pmem_size;
+
+    CXLType3Dev *ct3d = container_of(cxl_dstate, CXLType3Dev, cxl_dstate);
+    uint64_t size = cxl_dstate->mem_size;
 
     if (!QEMU_IS_ALIGNED(size, 256 << 20)) {
         return CXL_MBOX_INTERNAL_ERROR;
     }
 
-    /* PMEM only */
-    part_info->active_vmem = 0;
+    size /= (256 << 20);
+    part_info->active_vmem = ct3d->is_pmem ? 0 : size;
     part_info->next_vmem = 0;
-    part_info->active_pmem = size / (256 << 20);
+    part_info->active_pmem = ct3d->is_pmem ? size : 0;
     part_info->next_pmem = 0;
 
     *len = sizeof(*part_info);
diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index ada2108fac..18c5b9ff90 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -111,7 +111,7 @@  static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
         error_setg(errp, "memdev property must be set");
         return false;
     }
-    memory_region_set_nonvolatile(mr, true);
+    memory_region_set_nonvolatile(mr, ct3d->is_pmem);
     memory_region_set_enabled(mr, true);
     host_memory_backend_set_mapped(ct3d->hostmem, true);
 
@@ -123,7 +123,7 @@  static bool cxl_setup_memory(CXLType3Dev *ct3d, Error **errp)
     address_space_init(&ct3d->hostmem_as, mr, name);
     g_free(name);
 
-    ct3d->cxl_dstate.pmem_size = ct3d->hostmem->size;
+    ct3d->cxl_dstate.mem_size = ct3d->hostmem->size;
 
     if (!ct3d->lsa) {
         error_setg(errp, "lsa property must be set");
@@ -271,6 +271,7 @@  static void ct3d_reset(DeviceState *dev)
 }
 
 static Property ct3_props[] = {
+    DEFINE_PROP_BOOL("pmem", CXLType3Dev, is_pmem, false),
     DEFINE_PROP_LINK("memdev", CXLType3Dev, hostmem, TYPE_MEMORY_BACKEND,
                      HostMemoryBackend *),
     DEFINE_PROP_LINK("lsa", CXLType3Dev, lsa, TYPE_MEMORY_BACKEND,
@@ -278,6 +279,7 @@  static Property ct3_props[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+
 static uint64_t get_lsa_size(CXLType3Dev *ct3d)
 {
     MemoryRegion *mr;
@@ -338,10 +340,10 @@  static void ct3_class_init(ObjectClass *oc, void *data)
     pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
     pc->vendor_id = PCI_VENDOR_ID_INTEL;
     pc->device_id = 0xd93; /* LVF for now */
-    pc->revision = 1;
+    pc->revision = 2;
 
     set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
-    dc->desc = "CXL PMEM Device (Type 3)";
+    dc->desc = "CXL Memory Device (Type 3)";
     dc->reset = ct3d_reset;
     device_class_set_props(dc, ct3_props);
 
diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 1e141b6621..67fc65f047 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -117,8 +117,8 @@  typedef struct cxl_device_state {
         uint64_t host_set;
     } timestamp;
 
-    /* memory region for persistent memory, HDM */
-    uint64_t pmem_size;
+    /* memory region for persistent and volatile memory, HDM */
+    uint64_t mem_size;
 } CXLDeviceState;
 
 /* Initialize the register block for a device */
@@ -235,6 +235,7 @@  struct CXLType3Dev {
     PCIDevice parent_obj;
 
     /* Properties */
+    bool is_pmem;
     HostMemoryBackend *hostmem;
     HostMemoryBackend *lsa;
 
diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
index 2ebeb530b2..40c392056d 100644
--- a/tests/qtest/bios-tables-test.c
+++ b/tests/qtest/bios-tables-test.c
@@ -1578,13 +1578,13 @@  static void test_acpi_q35_cxl(void)
                              " -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1"
                              " -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2"
                              " -device cxl-rp,port=0,bus=cxl.1,id=rp1,chassis=0,slot=2"
-                             " -device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1"
+                             " -device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1,lsa=lsa1"
                              " -device cxl-rp,port=1,bus=cxl.1,id=rp2,chassis=0,slot=3"
-                             " -device cxl-type3,bus=rp2,memdev=cxl-mem2,lsa=lsa2"
+                             " -device cxl-type3,bus=rp2,pmem=true,memdev=cxl-mem2,lsa=lsa2"
                              " -device cxl-rp,port=0,bus=cxl.2,id=rp3,chassis=0,slot=5"
-                             " -device cxl-type3,bus=rp3,memdev=cxl-mem3,lsa=lsa3"
+                             " -device cxl-type3,bus=rp3,pmem=true,memdev=cxl-mem3,lsa=lsa3"
                              " -device cxl-rp,port=1,bus=cxl.2,id=rp4,chassis=0,slot=6"
-                             " -device cxl-type3,bus=rp4,memdev=cxl-mem4,lsa=lsa4"
+                             " -device cxl-type3,bus=rp4,pmem=true,memdev=cxl-mem4,lsa=lsa4"
                              " -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,"
                              "cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.targets.1=cxl.2,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k",
                              tmp_path, tmp_path, tmp_path, tmp_path,
diff --git a/tests/qtest/cxl-test.c b/tests/qtest/cxl-test.c
index cbe0fb549b..667e590c5f 100644
--- a/tests/qtest/cxl-test.c
+++ b/tests/qtest/cxl-test.c
@@ -31,27 +31,40 @@ 
 
 #define QEMU_T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M " \
                  "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
-                 "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 "
+                 "-device cxl-type3,bus=rp0,pmem=false,memdev=cxl-mem0," \
+                  "lsa=lsa0,id=cxl-pmem0 "
+
+#define QEMU_T3DV "-object memory-backend-ram,id=cxl-mem0,size=256M " \
+                  "-object memory-backend-ram,id=lsa0,size=256M "    \
+                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
+                   "lsa=lsa0,id=cxl-pmem0 "
+
 
 #define QEMU_2T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M "    \
                   "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
-                  "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 " \
+                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
+                   "lsa=lsa0,id=cxl-pmem0 " \
                   "-object memory-backend-file,id=cxl-mem1,mem-path=%s,size=256M "    \
                   "-object memory-backend-file,id=lsa1,mem-path=%s,size=256M "    \
-                  "-device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1,id=cxl-pmem1 "
+                  "-device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1," \
+                   "lsa=lsa1,id=cxl-pmem1 "
 
 #define QEMU_4T3D "-object memory-backend-file,id=cxl-mem0,mem-path=%s,size=256M " \
                   "-object memory-backend-file,id=lsa0,mem-path=%s,size=256M "    \
-                  "-device cxl-type3,bus=rp0,memdev=cxl-mem0,lsa=lsa0,id=cxl-pmem0 " \
+                  "-device cxl-type3,bus=rp0,pmem=true,memdev=cxl-mem0," \
+                   "lsa=lsa0,id=cxl-pmem0 " \
                   "-object memory-backend-file,id=cxl-mem1,mem-path=%s,size=256M "    \
                   "-object memory-backend-file,id=lsa1,mem-path=%s,size=256M "    \
-                  "-device cxl-type3,bus=rp1,memdev=cxl-mem1,lsa=lsa1,id=cxl-pmem1 " \
+                  "-device cxl-type3,bus=rp1,pmem=true,memdev=cxl-mem1," \
+                   "lsa=lsa1,id=cxl-pmem1 " \
                   "-object memory-backend-file,id=cxl-mem2,mem-path=%s,size=256M "    \
                   "-object memory-backend-file,id=lsa2,mem-path=%s,size=256M "    \
-                  "-device cxl-type3,bus=rp2,memdev=cxl-mem2,lsa=lsa2,id=cxl-pmem2 " \
+                  "-device cxl-type3,bus=rp2,pmem=true,memdev=cxl-mem2," \
+                   "lsa=lsa2,id=cxl-pmem2 " \
                   "-object memory-backend-file,id=cxl-mem3,mem-path=%s,size=256M "    \
                   "-object memory-backend-file,id=lsa3,mem-path=%s,size=256M "    \
-                  "-device cxl-type3,bus=rp3,memdev=cxl-mem3,lsa=lsa3,id=cxl-pmem3 "
+                  "-device cxl-type3,bus=rp3,pmem=true,memdev=cxl-mem3," \
+                   "lsa=lsa3,id=cxl-pmem3 "
 
 static void cxl_basic_hb(void)
 {
@@ -103,6 +116,19 @@  static void cxl_t3d(void)
     qtest_end();
 }
 
+static void cxl_t3d_vmem(void)
+{
+    g_autoptr(GString) cmdline = g_string_new(NULL);
+    g_autofree const char *tmpfs = NULL;
+
+    tmpfs = g_dir_make_tmp("cxl-test-XXXXXX", NULL);
+
+    g_string_printf(cmdline, QEMU_PXB_CMD QEMU_RP QEMU_T3DV);
+
+    qtest_start(cmdline->str);
+    qtest_end();
+}
+
 static void cxl_1pxb_2rp_2t3d(void)
 {
     g_autoptr(GString) cmdline = g_string_new(NULL);
@@ -145,6 +171,7 @@  int main(int argc, char **argv)
     qtest_add_func("/pci/cxl/rp_x2", cxl_2root_port);
 #ifdef CONFIG_POSIX
     qtest_add_func("/pci/cxl/type3_device", cxl_t3d);
+    qtest_add_func("/pci/cxl/type3_vmem_device", cxl_t3d_vmem);
     qtest_add_func("/pci/cxl/rp_x2_type3_x2", cxl_1pxb_2rp_2t3d);
     qtest_add_func("/pci/cxl/pxb_x2_root_port_x4_type3_x4", cxl_2pxb_4rp_4t3d);
 #endif