mbox series

[ndctl,v2,0/7] cxl: add support for listing and creating volatile regions

Message ID 20230120-vv-volatile-regions-v2-0-4ea6253000e5@intel.com
Headers show
Series cxl: add support for listing and creating volatile regions | expand

Message

Verma, Vishal L Feb. 8, 2023, 8 p.m. UTC
While enumeration of ram type regions already works in libcxl and
cxl-cli, it lacked an attribute to indicate pmem vs. ram. Add a new
'type' attribute to region listings to address this. Additionally, add
support for creating ram regions to the cxl-create-region command. The
region listings are also updated with dax-region information for
volatile regions.

This also includes fixed for a few bugs / usability issues identified
along the way - patches 1, 4, and 6. Patch 5 is a usability improvement
where based on decoder capabilities, the type of a region can be
inferred for the create-region command.

These have been tested against the ram-region additions to cxl_test
which are part of the kernel support patch set[1].
Additionally, tested against qemu using a WIP branch for volatile
support found here[2]. The 'run_qemu' script has a branch that creates
volatile memdevs in addition to pmem ones. This is also in a branch[3]
since it depends on [2].

These cxl-cli / libcxl patches themselves are also available in a
branch at [4].

[1]: https://lore.kernel.org/linux-cxl/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com/
[2]: https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-26
[3]: https://github.com/pmem/run_qemu/commits/vv/ram-memdevs
[4]: https://github.com/pmem/ndctl/tree/vv/volatile-regions

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
Changes in v2:
- Fix typos in the commit message of patch 1 (Fan)
- Gate the type attr in region listings on mode != 'none' (Dan)
- Clarify unreachability of the default case in collect_minsize() (Ira)
- Simplify the mode setup in set_type_from_decoder() (Dan)
- Fix typo in the commit message of Patch 7 (Dan)
- Remove unneeded daxctl/json.h include from cxl/filter.c (Dan)
- Link to v1: https://lore.kernel.org/r/20230120-vv-volatile-regions-v1-0-b42b21ee8d0b@intel.com

---
Dan Williams (2):
      cxl/list: Include regions in the verbose listing
      cxl/list: Enumerate device-dax properties for regions

Vishal Verma (5):
      cxl/region: skip region_actions for region creation
      cxl: add a type attribute to region listings
      cxl: add core plumbing for creation of ram regions
      cxl/region: accept user-supplied UUIDs for pmem regions
      cxl/region: determine region type based on root decoder capability

 Documentation/cxl/cxl-create-region.txt |  6 ++-
 Documentation/cxl/cxl-list.txt          | 31 ++++++++++++++
 Documentation/cxl/lib/libcxl.txt        |  8 ++++
 cxl/lib/private.h                       |  2 +
 cxl/lib/libcxl.c                        | 72 +++++++++++++++++++++++++++++++--
 cxl/filter.h                            |  3 ++
 cxl/libcxl.h                            |  3 ++
 cxl/json.c                              | 23 +++++++++++
 cxl/list.c                              |  3 ++
 cxl/region.c                            | 66 +++++++++++++++++++++++++++---
 cxl/lib/libcxl.sym                      |  7 ++++
 cxl/lib/meson.build                     |  1 +
 cxl/meson.build                         |  3 ++
 13 files changed, 217 insertions(+), 11 deletions(-)
---
base-commit: 08720628d2ba469e203a18c0b1ffbd90f4bfab1d
change-id: 20230120-vv-volatile-regions-063950cef590

Best regards,

Comments

Brice Goglin Feb. 9, 2023, 11:04 a.m. UTC | #1
Le 08/02/2023 à 21:00, Vishal Verma a écrit :
> While enumeration of ram type regions already works in libcxl and
> cxl-cli, it lacked an attribute to indicate pmem vs. ram. Add a new
> 'type' attribute to region listings to address this. Additionally, add
> support for creating ram regions to the cxl-create-region command. The
> region listings are also updated with dax-region information for
> volatile regions.
>
> This also includes fixed for a few bugs / usability issues identified
> along the way - patches 1, 4, and 6. Patch 5 is a usability improvement
> where based on decoder capabilities, the type of a region can be
> inferred for the create-region command.
>
> These have been tested against the ram-region additions to cxl_test
> which are part of the kernel support patch set[1].
> Additionally, tested against qemu using a WIP branch for volatile
> support found here[2]. The 'run_qemu' script has a branch that creates
> volatile memdevs in addition to pmem ones. This is also in a branch[3]
> since it depends on [2].
>
> These cxl-cli / libcxl patches themselves are also available in a
> branch at [4].
>
> [1]: https://lore.kernel.org/linux-cxl/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com/
> [2]: https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-26
> [3]: https://github.com/pmem/run_qemu/commits/vv/ram-memdevs
> [4]: https://github.com/pmem/ndctl/tree/vv/volatile-regions


Hello Vishal

I am trying to play with this but all my attempts failed so far. Could 
you provide Qemu and cxl-cli command-lines to get a volatile region 
enabled in a Qemu VM?

Thanks

Brice
Verma, Vishal L Feb. 9, 2023, 7:17 p.m. UTC | #2
On Thu, 2023-02-09 at 12:04 +0100, Brice Goglin wrote:
> 
> Le 08/02/2023 à 21:00, Vishal Verma a écrit :
> > While enumeration of ram type regions already works in libcxl and
> > cxl-cli, it lacked an attribute to indicate pmem vs. ram. Add a new
> > 'type' attribute to region listings to address this. Additionally, add
> > support for creating ram regions to the cxl-create-region command. The
> > region listings are also updated with dax-region information for
> > volatile regions.
> > 
> > This also includes fixed for a few bugs / usability issues identified
> > along the way - patches 1, 4, and 6. Patch 5 is a usability improvement
> > where based on decoder capabilities, the type of a region can be
> > inferred for the create-region command.
> > 
> > These have been tested against the ram-region additions to cxl_test
> > which are part of the kernel support patch set[1].
> > Additionally, tested against qemu using a WIP branch for volatile
> > support found here[2]. The 'run_qemu' script has a branch that creates
> > volatile memdevs in addition to pmem ones. This is also in a branch[3]
> > since it depends on [2].
> > 
> > These cxl-cli / libcxl patches themselves are also available in a
> > branch at [4].
> > 
> > [1]: https://lore.kernel.org/linux-cxl/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com/
> > [2]: https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-26
> > [3]: https://github.com/pmem/run_qemu/commits/vv/ram-memdevs
> > [4]: https://github.com/pmem/ndctl/tree/vv/volatile-regions
> 
> 
> Hello Vishal
> 
> I am trying to play with this but all my attempts failed so far. Could 
> you provide Qemu and cxl-cli command-lines to get a volatile region 
> enabled in a Qemu VM?

Hi Brice,

Greg had posted his working config in another thread:
https://lore.kernel.org/linux-cxl/Y9sMs0FGulQSIe9t@memverge.com/

I've also pasted below, the qemu command line generated by the run_qemu
script I referenced. (Note that this adds a bunch of stuff not strictly
needed for a minimal CXL configuration - you can certainly trim a lot
of that out - this is just the default setup that is generated and I
usually run).

Feel free to post what errors / problems you're hitting and we can
debug further from there.

Thanks
Vishal


$ run_qemu.sh -g --cxl --cxl-debug --rw -r none --cmdline
/home/vverma7/git/qemu/build/qemu-system-x86_64 
-machine q35,accel=kvm,nvdimm=on,cxl=on 
-m 8192M,slots=4,maxmem=40964M 
-smp 8,sockets=2,cores=2,threads=2 
-enable-kvm 
-display none 
-nographic 
-drive if=pflash,format=raw,unit=0,file=OVMF_CODE.fd,readonly=on 
-drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd 
-debugcon file:uefi_debug.log 
-global isa-debugcon.iobase=0x402 
-drive file=root.img,format=raw,media=disk 
-kernel ./mkosi.extra/lib/modules/6.2.0-rc6+/vmlinuz 
-initrd mkosi.extra/boot/initramfs-6.2.0-rc2+.img 
-append selinux=0 audit=0 console=tty0 console=ttyS0 root=/dev/sda2 ignore_loglevel rw cxl_acpi.dyndbg=+fplm cxl_pci.dyndbg=+fplm cxl_core.dyndbg=+fplm cxl_mem.dyndbg=+fplm cxl_pmem.dyndbg=+fplm cxl_port.dyndbg=+fplm cxl_region.dyndbg=+fplm cxl_test.dyndbg=+fplm cxl_mock.dyndbg=+fplm cxl_mock_mem.dyndbg=+fplm memmap=2G!4G efi_fake_mem=2G@6G:0x40000 
-device e1000,netdev=net0,mac=52:54:00:12:34:56 
-netdev user,id=net0,hostfwd=tcp::10022-:22 
-object memory-backend-file,id=cxl-mem0,share=on,mem-path=cxltest0.raw,size=256M 
-object memory-backend-file,id=cxl-mem1,share=on,mem-path=cxltest1.raw,size=256M 
-object memory-backend-file,id=cxl-mem2,share=on,mem-path=cxltest2.raw,size=256M 
-object memory-backend-file,id=cxl-mem3,share=on,mem-path=cxltest3.raw,size=256M 
-object memory-backend-ram,id=cxl-mem4,share=on,size=256M 
-object memory-backend-ram,id=cxl-mem5,share=on,size=256M 
-object memory-backend-ram,id=cxl-mem6,share=on,size=256M 
-object memory-backend-ram,id=cxl-mem7,share=on,size=256M 
-object memory-backend-file,id=cxl-lsa0,share=on,mem-path=lsa0.raw,size=1K 
-object memory-backend-file,id=cxl-lsa1,share=on,mem-path=lsa1.raw,size=1K 
-object memory-backend-file,id=cxl-lsa2,share=on,mem-path=lsa2.raw,size=1K 
-object memory-backend-file,id=cxl-lsa3,share=on,mem-path=lsa3.raw,size=1K 
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=53 
-device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 
-device cxl-rp,id=hb0rp0,bus=cxl.0,chassis=0,slot=0,port=0 
-device cxl-rp,id=hb0rp1,bus=cxl.0,chassis=0,slot=1,port=1 
-device cxl-rp,id=hb0rp2,bus=cxl.0,chassis=0,slot=2,port=2 
-device cxl-rp,id=hb0rp3,bus=cxl.0,chassis=0,slot=3,port=3 
-device cxl-rp,id=hb1rp0,bus=cxl.1,chassis=0,slot=4,port=0 
-device cxl-rp,id=hb1rp1,bus=cxl.1,chassis=0,slot=5,port=1 
-device cxl-rp,id=hb1rp2,bus=cxl.1,chassis=0,slot=6,port=2 
-device cxl-rp,id=hb1rp3,bus=cxl.1,chassis=0,slot=7,port=3 
-device cxl-type3,bus=hb0rp0,memdev=cxl-mem0,id=cxl-dev0,lsa=cxl-lsa0 
-device cxl-type3,bus=hb0rp1,memdev=cxl-mem1,id=cxl-dev1,lsa=cxl-lsa1 
-device cxl-type3,bus=hb1rp0,memdev=cxl-mem2,id=cxl-dev2,lsa=cxl-lsa2 
-device cxl-type3,bus=hb1rp1,memdev=cxl-mem3,id=cxl-dev3,lsa=cxl-lsa3 
-device cxl-type3,bus=hb0rp2,volatile-memdev=cxl-mem4,id=cxl-dev4 
-device cxl-type3,bus=hb0rp3,volatile-memdev=cxl-mem5,id=cxl-dev5 
-device cxl-type3,bus=hb1rp2,volatile-memdev=cxl-mem6,id=cxl-dev6 
-device cxl-type3,bus=hb1rp3,volatile-memdev=cxl-mem7,id=cxl-dev7 
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,cxl-fmw.1.targets.0=cxl.0,cxl-fmw.1.targets.1=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k 
-object memory-backend-ram,id=mem0,size=2048M 
-numa node,nodeid=0,memdev=mem0, 
-numa cpu,node-id=0,socket-id=0 
-object memory-backend-ram,id=mem1,size=2048M 
-numa node,nodeid=1,memdev=mem1, 
-numa cpu,node-id=1,socket-id=1 
-object memory-backend-ram,id=mem2,size=2048M 
-numa node,nodeid=2,memdev=mem2, 
-object memory-backend-ram,id=mem3,size=2048M 
-numa node,nodeid=3,memdev=mem3, 
-numa node,nodeid=4, 
-object memory-backend-file,id=nvmem0,share=on,mem-path=nvdimm-0,size=16384M,align=1G 
-device nvdimm,memdev=nvmem0,id=nv0,label-size=2M,node=4 
-numa node,nodeid=5, 
-object memory-backend-file,id=nvmem1,share=on,mem-path=nvdimm-1,size=16384M,align=1G 
-device nvdimm,memdev=nvmem1,id=nv1,label-size=2M,node=5 
-numa dist,src=0,dst=0,val=10 
-numa dist,src=0,dst=1,val=21 
-numa dist,src=0,dst=2,val=12 
-numa dist,src=0,dst=3,val=21 
-numa dist,src=0,dst=4,val=17 
-numa dist,src=0,dst=5,val=28 
-numa dist,src=1,dst=1,val=10 
-numa dist,src=1,dst=2,val=21 
-numa dist,src=1,dst=3,val=12 
-numa dist,src=1,dst=4,val=28 
-numa dist,src=1,dst=5,val=17 
-numa dist,src=2,dst=2,val=10 
-numa dist,src=2,dst=3,val=21 
-numa dist,src=2,dst=4,val=28 
-numa dist,src=2,dst=5,val=28 
-numa dist,src=3,dst=3,val=10 
-numa dist,src=3,dst=4,val=28 
-numa dist,src=3,dst=5,val=28 
-numa dist,src=4,dst=4,val=10 
-numa dist,src=4,dst=5,val=28 
-numa dist,src=5,dst=5,val=10
Jonathan Cameron Feb. 10, 2023, 12:43 p.m. UTC | #3
On Fri, 10 Feb 2023 11:15:44 +0100
Brice Goglin <Brice.Goglin@inria.fr> wrote:

> Le 09/02/2023 à 20:17, Verma, Vishal L a écrit :
> > On Thu, 2023-02-09 at 12:04 +0100, Brice Goglin wrote:  
> >> Hello Vishal
> >>
> >> I am trying to play with this but all my attempts failed so far. Could
> >> you provide Qemu and cxl-cli command-lines to get a volatile region
> >> enabled in a Qemu VM?  
> > Hi Brice,
> >
> > Greg had posted his working config in another thread:
> > https://lore.kernel.org/linux-cxl/Y9sMs0FGulQSIe9t@memverge.com/
> >
> > I've also pasted below, the qemu command line generated by the run_qemu
> > script I referenced. (Note that this adds a bunch of stuff not strictly
> > needed for a minimal CXL configuration - you can certainly trim a lot
> > of that out - this is just the default setup that is generated and I
> > usually run).  
> 
> 
> Hello Vishal
> 
> Thanks a lot, things were failing because my kernel didn't have
> CONFIG_CXL_REGION_INVALIDATION_TEST=y. Now I am able to create a single
> ram region, either with a single device or multiple interleaved ones.
> 
> However I can't get multiple separate ram regions. If I boot a config
> like yours below, I get 4 ram devices. How can I create one region for
> each? Once I create the first one, others fail saying something like
> below. I tried using other decoders but it didn't help (I still need
> to read more CXL docs about decoders, why new ones appear when creating
> a region, etc).
> 
> cxl region: collect_memdevs: no active memdevs found: decoder: decoder0.0 filter: mem3

Hi Brice,

QEMU emulation currently only supports single HDM decoder at each level,
so HB, Switch USP, EP (with exception of the CFMWS top level ones as shown
in the example which has two of those). We should fix that...

For now, you should be able to do it with multiple pxb-cxl instances with
appropriate CFMWS entries for each one. Which is horrible but might work
for you in the meantime.  

> 
> By the way, once configured in system ram, my CXL ram is merged into an
> existing "normal" NUMA node. How do I tell Qemu that a CXL region should
> be part of a new NUMA node? I assume that's what's going to happen on
> real hardware?

We don't yet have kernel code to deal with assigning a new NUMA node.
Was on the todo list in last sync call I think.

> 
> Thanks
> 
> Brice
> 
> 
> 
> >
> >
> > $ run_qemu.sh -g --cxl --cxl-debug --rw -r none --cmdline
> > /home/vverma7/git/qemu/build/qemu-system-x86_64
> > -machine q35,accel=kvm,nvdimm=on,cxl=on
> > -m 8192M,slots=4,maxmem=40964M
> > -smp 8,sockets=2,cores=2,threads=2
> > -enable-kvm
> > -display none
> > -nographic
> > -drive if=pflash,format=raw,unit=0,file=OVMF_CODE.fd,readonly=on
> > -drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd
> > -debugconfile:uefi_debug.log  
> > -global isa-debugcon.iobase=0x402
> > -drive file=root.img,format=raw,media=disk
> > -kernel ./mkosi.extra/lib/modules/6.2.0-rc6+/vmlinuz
> > -initrd mkosi.extra/boot/initramfs-6.2.0-rc2+.img
> > -append selinux=0 audit=0 console=tty0 console=ttyS0 root=/dev/sda2 ignore_loglevel rw cxl_acpi.dyndbg=+fplm cxl_pci.dyndbg=+fplm cxl_core.dyndbg=+fplm cxl_mem.dyndbg=+fplm cxl_pmem.dyndbg=+fplm cxl_port.dyndbg=+fplm cxl_region.dyndbg=+fplm cxl_test.dyndbg=+fplm cxl_mock.dyndbg=+fplm cxl_mock_mem.dyndbg=+fplm memmap=2G!4G efi_fake_mem=2G@6G:0x40000
> > -device e1000,netdev=net0,mac=52:54:00:12:34:56
> > -netdev user,id=net0,hostfwd=tcp::10022-:22
> > -object memory-backend-file,id=cxl-mem0,share=on,mem-path=cxltest0.raw,size=256M
> > -object memory-backend-file,id=cxl-mem1,share=on,mem-path=cxltest1.raw,size=256M
> > -object memory-backend-file,id=cxl-mem2,share=on,mem-path=cxltest2.raw,size=256M
> > -object memory-backend-file,id=cxl-mem3,share=on,mem-path=cxltest3.raw,size=256M
> > -object memory-backend-ram,id=cxl-mem4,share=on,size=256M
> > -object memory-backend-ram,id=cxl-mem5,share=on,size=256M
> > -object memory-backend-ram,id=cxl-mem6,share=on,size=256M
> > -object memory-backend-ram,id=cxl-mem7,share=on,size=256M
> > -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=lsa0.raw,size=1K
> > -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=lsa1.raw,size=1K
> > -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=lsa2.raw,size=1K
> > -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=lsa3.raw,size=1K
> > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=53
> > -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191
> > -device cxl-rp,id=hb0rp0,bus=cxl.0,chassis=0,slot=0,port=0
> > -device cxl-rp,id=hb0rp1,bus=cxl.0,chassis=0,slot=1,port=1
> > -device cxl-rp,id=hb0rp2,bus=cxl.0,chassis=0,slot=2,port=2
> > -device cxl-rp,id=hb0rp3,bus=cxl.0,chassis=0,slot=3,port=3
> > -device cxl-rp,id=hb1rp0,bus=cxl.1,chassis=0,slot=4,port=0
> > -device cxl-rp,id=hb1rp1,bus=cxl.1,chassis=0,slot=5,port=1
> > -device cxl-rp,id=hb1rp2,bus=cxl.1,chassis=0,slot=6,port=2
> > -device cxl-rp,id=hb1rp3,bus=cxl.1,chassis=0,slot=7,port=3
> > -device cxl-type3,bus=hb0rp0,memdev=cxl-mem0,id=cxl-dev0,lsa=cxl-lsa0
> > -device cxl-type3,bus=hb0rp1,memdev=cxl-mem1,id=cxl-dev1,lsa=cxl-lsa1
> > -device cxl-type3,bus=hb1rp0,memdev=cxl-mem2,id=cxl-dev2,lsa=cxl-lsa2
> > -device cxl-type3,bus=hb1rp1,memdev=cxl-mem3,id=cxl-dev3,lsa=cxl-lsa3
> > -device cxl-type3,bus=hb0rp2,volatile-memdev=cxl-mem4,id=cxl-dev4
> > -device cxl-type3,bus=hb0rp3,volatile-memdev=cxl-mem5,id=cxl-dev5
> > -device cxl-type3,bus=hb1rp2,volatile-memdev=cxl-mem6,id=cxl-dev6
> > -device cxl-type3,bus=hb1rp3,volatile-memdev=cxl-mem7,id=cxl-dev7
> > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,cxl-fmw.1.targets.0=cxl.0,cxl-fmw.1.targets.1=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k
> > -object memory-backend-ram,id=mem0,size=2048M
> > -numa node,nodeid=0,memdev=mem0,
> > -numa cpu,node-id=0,socket-id=0
> > -object memory-backend-ram,id=mem1,size=2048M
> > -numa node,nodeid=1,memdev=mem1,
> > -numa cpu,node-id=1,socket-id=1
> > -object memory-backend-ram,id=mem2,size=2048M
> > -numa node,nodeid=2,memdev=mem2,
> > -object memory-backend-ram,id=mem3,size=2048M
> > -numa node,nodeid=3,memdev=mem3,
> > -numa node,nodeid=4,
> > -object memory-backend-file,id=nvmem0,share=on,mem-path=nvdimm-0,size=16384M,align=1G
> > -device nvdimm,memdev=nvmem0,id=nv0,label-size=2M,node=4
> > -numa node,nodeid=5,
> > -object memory-backend-file,id=nvmem1,share=on,mem-path=nvdimm-1,size=16384M,align=1G
> > -device nvdimm,memdev=nvmem1,id=nv1,label-size=2M,node=5
> > -numa dist,src=0,dst=0,val=10
> > -numa dist,src=0,dst=1,val=21
> > -numa dist,src=0,dst=2,val=12
> > -numa dist,src=0,dst=3,val=21
> > -numa dist,src=0,dst=4,val=17
> > -numa dist,src=0,dst=5,val=28
> > -numa dist,src=1,dst=1,val=10
> > -numa dist,src=1,dst=2,val=21
> > -numa dist,src=1,dst=3,val=12
> > -numa dist,src=1,dst=4,val=28
> > -numa dist,src=1,dst=5,val=17
> > -numa dist,src=2,dst=2,val=10
> > -numa dist,src=2,dst=3,val=21
> > -numa dist,src=2,dst=4,val=28
> > -numa dist,src=2,dst=5,val=28
> > -numa dist,src=3,dst=3,val=10
> > -numa dist,src=3,dst=4,val=28
> > -numa dist,src=3,dst=5,val=28
> > -numa dist,src=4,dst=4,val=10
> > -numa dist,src=4,dst=5,val=28
> > -numa dist,src=5,dst=5,val=10
> >
Brice Goglin Feb. 10, 2023, 4:09 p.m. UTC | #4
Le 10/02/2023 à 13:43, Jonathan Cameron a écrit :
>
>> Hello Vishal
>>
>> Thanks a lot, things were failing because my kernel didn't have
>> CONFIG_CXL_REGION_INVALIDATION_TEST=y. Now I am able to create a single
>> ram region, either with a single device or multiple interleaved ones.
>>
>> However I can't get multiple separate ram regions. If I boot a config
>> like yours below, I get 4 ram devices. How can I create one region for
>> each? Once I create the first one, others fail saying something like
>> below. I tried using other decoders but it didn't help (I still need
>> to read more CXL docs about decoders, why new ones appear when creating
>> a region, etc).
>>
>> cxl region: collect_memdevs: no active memdevs found: decoder: decoder0.0 filter: mem3
> Hi Brice,
>
> QEMU emulation currently only supports single HDM decoder at each level,
> so HB, Switch USP, EP (with exception of the CFMWS top level ones as shown
> in the example which has two of those). We should fix that...
>
> For now, you should be able to do it with multiple pxb-cxl instances with
> appropriate CFMWS entries for each one. Which is horrible but might work
> for you in the meantime.


Thanks Jonathan, this works fine:

   -object memory-backend-ram,id=vmem0,share=on,size=256M \
   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
   -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
   -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
   -object memory-backend-ram,id=vmem1,share=on,size=256M \
   -device pxb-cxl,bus_nr=14,bus=pcie.0,id=cxl.2 \
   -device cxl-rp,port=0,bus=cxl.2,id=root_port14,chassis=1,slot=2 \
   -device cxl-type3,bus=root_port14,volatile-memdev=vmem1,id=cxl-vmem1 \
   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=4G


>> By the way, once configured in system ram, my CXL ram is merged into an
>> existing "normal" NUMA node. How do I tell Qemu that a CXL region should
>> be part of a new NUMA node? I assume that's what's going to happen on
>> real hardware?
> We don't yet have kernel code to deal with assigning a new NUMA node.
> Was on the todo list in last sync call I think.


Good to known, thanks again.

Brice
Dan Williams Feb. 11, 2023, 1:53 a.m. UTC | #5
Brice Goglin wrote:
[..]
> >> By the way, once configured in system ram, my CXL ram is merged into an
> >> existing "normal" NUMA node. How do I tell Qemu that a CXL region should
> >> be part of a new NUMA node? I assume that's what's going to happen on
> >> real hardware?
> > We don't yet have kernel code to deal with assigning a new NUMA node.
> > Was on the todo list in last sync call I think.
> 
> 
> Good to known, thanks again.

In fact, there is no plan to support "new" NUMA node creation. A node
can only be onlined / populated from set of static nodes defined by
platform-firmware. The set of static nodes is defined by the union of
all the proximity domain numbers in the SRAT as well as a node per
CFMWS / QTG id. See:

    fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT

...for the CXL node enumeration scheme.

Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to
group devices by window. This scheme attempts to be as simple as
possible, but no simpler. If more granularity is necessary in practice,
that would be a good discussion to have soonish.. LSF/MM comes to mind.
Brice Goglin Feb. 11, 2023, 3:55 p.m. UTC | #6
Le 11/02/2023 à 02:53, Dan Williams a écrit :

> Brice Goglin wrote:
> [..]
>>>> By the way, once configured in system ram, my CXL ram is merged into an
>>>> existing "normal" NUMA node. How do I tell Qemu that a CXL region should
>>>> be part of a new NUMA node? I assume that's what's going to happen on
>>>> real hardware?
>>> We don't yet have kernel code to deal with assigning a new NUMA node.
>>> Was on the todo list in last sync call I think.
>>
> In fact, there is no plan to support "new" NUMA node creation. A node
> can only be onlined / populated from set of static nodes defined by
> platform-firmware. The set of static nodes is defined by the union of
> all the proximity domain numbers in the SRAT as well as a node per
> CFMWS / QTG id. See:
>
>      fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT
>
> ...for the CXL node enumeration scheme.
>
> Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to
> group devices by window. This scheme attempts to be as simple as
> possible, but no simpler. If more granularity is necessary in practice,
> that would be a good discussion to have soonish.. LSF/MM comes to mind.

Actually I was mistaken, there's already a new NUMA node when creating
a region under Qemu, but my tools ignored it because it's empty.
After daxctl online-memory, things look good.

Can you clarify your above sentences on a real node? If I connect two
memory expanders on two slots of the same CPU, do I get a single CFMWS or two?
What if I connect two devices to a single slot across a CXL switch?

Brice
Dan Williams Feb. 13, 2023, 11:10 p.m. UTC | #7
Brice Goglin wrote:
> Le 11/02/2023 à 02:53, Dan Williams a écrit :
> 
> > Brice Goglin wrote:
> > [..]
> >>>> By the way, once configured in system ram, my CXL ram is merged into an
> >>>> existing "normal" NUMA node. How do I tell Qemu that a CXL region should
> >>>> be part of a new NUMA node? I assume that's what's going to happen on
> >>>> real hardware?
> >>> We don't yet have kernel code to deal with assigning a new NUMA node.
> >>> Was on the todo list in last sync call I think.
> >>
> > In fact, there is no plan to support "new" NUMA node creation. A node
> > can only be onlined / populated from set of static nodes defined by
> > platform-firmware. The set of static nodes is defined by the union of
> > all the proximity domain numbers in the SRAT as well as a node per
> > CFMWS / QTG id. See:
> >
> >      fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT
> >
> > ...for the CXL node enumeration scheme.
> >
> > Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to
> > group devices by window. This scheme attempts to be as simple as
> > possible, but no simpler. If more granularity is necessary in practice,
> > that would be a good discussion to have soonish.. LSF/MM comes to mind.
> 
> Actually I was mistaken, there's already a new NUMA node when creating
> a region under Qemu, but my tools ignored it because it's empty.
> After daxctl online-memory, things look good.
> 
> Can you clarify your above sentences on a real node? If I connect two
> memory expanders on two slots of the same CPU, do I get a single CFMWS or two?
> What if I connect two devices to a single slot across a CXL switch?

Ultimately the answer is "ask your platform vendor", because this is a
firmware decision. However, my expectation is that since the ACPI HMAT
requires a proximity domain per distinct performance class, and because
the ACPI HMAT needs to distinguish the memory that is "attached" to a
CPU initiator domain, that CXL will at a minimum be described in a
proximity domain distinct from "local DRAM".

The number of CFMWS windows published is gated by the degrees of freedom
platform-firmware wants to give the OS relative to the number of CXL
host-bridges in the system. One scheme that seems plausible is one CFMWS
window for each host-bridge / x1 interleave (to maximize RAS) and one
CFMWS with all host-bridges interleaved together (to maximize
performance).

The above is just my personal opinion as a Linux kernel developer, a
platform implementation is free to be as restrictive or generous as it
wants with CFMWS resources.