Message ID | 20230120-vv-volatile-regions-v2-0-4ea6253000e5@intel.com |
---|---|
Headers | show |
Series | cxl: add support for listing and creating volatile regions | expand |
Le 08/02/2023 à 21:00, Vishal Verma a écrit : > While enumeration of ram type regions already works in libcxl and > cxl-cli, it lacked an attribute to indicate pmem vs. ram. Add a new > 'type' attribute to region listings to address this. Additionally, add > support for creating ram regions to the cxl-create-region command. The > region listings are also updated with dax-region information for > volatile regions. > > This also includes fixed for a few bugs / usability issues identified > along the way - patches 1, 4, and 6. Patch 5 is a usability improvement > where based on decoder capabilities, the type of a region can be > inferred for the create-region command. > > These have been tested against the ram-region additions to cxl_test > which are part of the kernel support patch set[1]. > Additionally, tested against qemu using a WIP branch for volatile > support found here[2]. The 'run_qemu' script has a branch that creates > volatile memdevs in addition to pmem ones. This is also in a branch[3] > since it depends on [2]. > > These cxl-cli / libcxl patches themselves are also available in a > branch at [4]. > > [1]: https://lore.kernel.org/linux-cxl/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com/ > [2]: https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-26 > [3]: https://github.com/pmem/run_qemu/commits/vv/ram-memdevs > [4]: https://github.com/pmem/ndctl/tree/vv/volatile-regions Hello Vishal I am trying to play with this but all my attempts failed so far. Could you provide Qemu and cxl-cli command-lines to get a volatile region enabled in a Qemu VM? Thanks Brice
On Thu, 2023-02-09 at 12:04 +0100, Brice Goglin wrote: > > Le 08/02/2023 à 21:00, Vishal Verma a écrit : > > While enumeration of ram type regions already works in libcxl and > > cxl-cli, it lacked an attribute to indicate pmem vs. ram. Add a new > > 'type' attribute to region listings to address this. Additionally, add > > support for creating ram regions to the cxl-create-region command. The > > region listings are also updated with dax-region information for > > volatile regions. > > > > This also includes fixed for a few bugs / usability issues identified > > along the way - patches 1, 4, and 6. Patch 5 is a usability improvement > > where based on decoder capabilities, the type of a region can be > > inferred for the create-region command. > > > > These have been tested against the ram-region additions to cxl_test > > which are part of the kernel support patch set[1]. > > Additionally, tested against qemu using a WIP branch for volatile > > support found here[2]. The 'run_qemu' script has a branch that creates > > volatile memdevs in addition to pmem ones. This is also in a branch[3] > > since it depends on [2]. > > > > These cxl-cli / libcxl patches themselves are also available in a > > branch at [4]. > > > > [1]: https://lore.kernel.org/linux-cxl/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com/ > > [2]: https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-26 > > [3]: https://github.com/pmem/run_qemu/commits/vv/ram-memdevs > > [4]: https://github.com/pmem/ndctl/tree/vv/volatile-regions > > > Hello Vishal > > I am trying to play with this but all my attempts failed so far. Could > you provide Qemu and cxl-cli command-lines to get a volatile region > enabled in a Qemu VM? Hi Brice, Greg had posted his working config in another thread: https://lore.kernel.org/linux-cxl/Y9sMs0FGulQSIe9t@memverge.com/ I've also pasted below, the qemu command line generated by the run_qemu script I referenced. (Note that this adds a bunch of stuff not strictly needed for a minimal CXL configuration - you can certainly trim a lot of that out - this is just the default setup that is generated and I usually run). Feel free to post what errors / problems you're hitting and we can debug further from there. Thanks Vishal $ run_qemu.sh -g --cxl --cxl-debug --rw -r none --cmdline /home/vverma7/git/qemu/build/qemu-system-x86_64 -machine q35,accel=kvm,nvdimm=on,cxl=on -m 8192M,slots=4,maxmem=40964M -smp 8,sockets=2,cores=2,threads=2 -enable-kvm -display none -nographic -drive if=pflash,format=raw,unit=0,file=OVMF_CODE.fd,readonly=on -drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd -debugcon file:uefi_debug.log -global isa-debugcon.iobase=0x402 -drive file=root.img,format=raw,media=disk -kernel ./mkosi.extra/lib/modules/6.2.0-rc6+/vmlinuz -initrd mkosi.extra/boot/initramfs-6.2.0-rc2+.img -append selinux=0 audit=0 console=tty0 console=ttyS0 root=/dev/sda2 ignore_loglevel rw cxl_acpi.dyndbg=+fplm cxl_pci.dyndbg=+fplm cxl_core.dyndbg=+fplm cxl_mem.dyndbg=+fplm cxl_pmem.dyndbg=+fplm cxl_port.dyndbg=+fplm cxl_region.dyndbg=+fplm cxl_test.dyndbg=+fplm cxl_mock.dyndbg=+fplm cxl_mock_mem.dyndbg=+fplm memmap=2G!4G efi_fake_mem=2G@6G:0x40000 -device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 -object memory-backend-file,id=cxl-mem0,share=on,mem-path=cxltest0.raw,size=256M -object memory-backend-file,id=cxl-mem1,share=on,mem-path=cxltest1.raw,size=256M -object memory-backend-file,id=cxl-mem2,share=on,mem-path=cxltest2.raw,size=256M -object memory-backend-file,id=cxl-mem3,share=on,mem-path=cxltest3.raw,size=256M -object memory-backend-ram,id=cxl-mem4,share=on,size=256M -object memory-backend-ram,id=cxl-mem5,share=on,size=256M -object memory-backend-ram,id=cxl-mem6,share=on,size=256M -object memory-backend-ram,id=cxl-mem7,share=on,size=256M -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=lsa0.raw,size=1K -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=lsa1.raw,size=1K -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=lsa2.raw,size=1K -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=lsa3.raw,size=1K -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=53 -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 -device cxl-rp,id=hb0rp0,bus=cxl.0,chassis=0,slot=0,port=0 -device cxl-rp,id=hb0rp1,bus=cxl.0,chassis=0,slot=1,port=1 -device cxl-rp,id=hb0rp2,bus=cxl.0,chassis=0,slot=2,port=2 -device cxl-rp,id=hb0rp3,bus=cxl.0,chassis=0,slot=3,port=3 -device cxl-rp,id=hb1rp0,bus=cxl.1,chassis=0,slot=4,port=0 -device cxl-rp,id=hb1rp1,bus=cxl.1,chassis=0,slot=5,port=1 -device cxl-rp,id=hb1rp2,bus=cxl.1,chassis=0,slot=6,port=2 -device cxl-rp,id=hb1rp3,bus=cxl.1,chassis=0,slot=7,port=3 -device cxl-type3,bus=hb0rp0,memdev=cxl-mem0,id=cxl-dev0,lsa=cxl-lsa0 -device cxl-type3,bus=hb0rp1,memdev=cxl-mem1,id=cxl-dev1,lsa=cxl-lsa1 -device cxl-type3,bus=hb1rp0,memdev=cxl-mem2,id=cxl-dev2,lsa=cxl-lsa2 -device cxl-type3,bus=hb1rp1,memdev=cxl-mem3,id=cxl-dev3,lsa=cxl-lsa3 -device cxl-type3,bus=hb0rp2,volatile-memdev=cxl-mem4,id=cxl-dev4 -device cxl-type3,bus=hb0rp3,volatile-memdev=cxl-mem5,id=cxl-dev5 -device cxl-type3,bus=hb1rp2,volatile-memdev=cxl-mem6,id=cxl-dev6 -device cxl-type3,bus=hb1rp3,volatile-memdev=cxl-mem7,id=cxl-dev7 -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,cxl-fmw.1.targets.0=cxl.0,cxl-fmw.1.targets.1=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k -object memory-backend-ram,id=mem0,size=2048M -numa node,nodeid=0,memdev=mem0, -numa cpu,node-id=0,socket-id=0 -object memory-backend-ram,id=mem1,size=2048M -numa node,nodeid=1,memdev=mem1, -numa cpu,node-id=1,socket-id=1 -object memory-backend-ram,id=mem2,size=2048M -numa node,nodeid=2,memdev=mem2, -object memory-backend-ram,id=mem3,size=2048M -numa node,nodeid=3,memdev=mem3, -numa node,nodeid=4, -object memory-backend-file,id=nvmem0,share=on,mem-path=nvdimm-0,size=16384M,align=1G -device nvdimm,memdev=nvmem0,id=nv0,label-size=2M,node=4 -numa node,nodeid=5, -object memory-backend-file,id=nvmem1,share=on,mem-path=nvdimm-1,size=16384M,align=1G -device nvdimm,memdev=nvmem1,id=nv1,label-size=2M,node=5 -numa dist,src=0,dst=0,val=10 -numa dist,src=0,dst=1,val=21 -numa dist,src=0,dst=2,val=12 -numa dist,src=0,dst=3,val=21 -numa dist,src=0,dst=4,val=17 -numa dist,src=0,dst=5,val=28 -numa dist,src=1,dst=1,val=10 -numa dist,src=1,dst=2,val=21 -numa dist,src=1,dst=3,val=12 -numa dist,src=1,dst=4,val=28 -numa dist,src=1,dst=5,val=17 -numa dist,src=2,dst=2,val=10 -numa dist,src=2,dst=3,val=21 -numa dist,src=2,dst=4,val=28 -numa dist,src=2,dst=5,val=28 -numa dist,src=3,dst=3,val=10 -numa dist,src=3,dst=4,val=28 -numa dist,src=3,dst=5,val=28 -numa dist,src=4,dst=4,val=10 -numa dist,src=4,dst=5,val=28 -numa dist,src=5,dst=5,val=10
On Fri, 10 Feb 2023 11:15:44 +0100 Brice Goglin <Brice.Goglin@inria.fr> wrote: > Le 09/02/2023 à 20:17, Verma, Vishal L a écrit : > > On Thu, 2023-02-09 at 12:04 +0100, Brice Goglin wrote: > >> Hello Vishal > >> > >> I am trying to play with this but all my attempts failed so far. Could > >> you provide Qemu and cxl-cli command-lines to get a volatile region > >> enabled in a Qemu VM? > > Hi Brice, > > > > Greg had posted his working config in another thread: > > https://lore.kernel.org/linux-cxl/Y9sMs0FGulQSIe9t@memverge.com/ > > > > I've also pasted below, the qemu command line generated by the run_qemu > > script I referenced. (Note that this adds a bunch of stuff not strictly > > needed for a minimal CXL configuration - you can certainly trim a lot > > of that out - this is just the default setup that is generated and I > > usually run). > > > Hello Vishal > > Thanks a lot, things were failing because my kernel didn't have > CONFIG_CXL_REGION_INVALIDATION_TEST=y. Now I am able to create a single > ram region, either with a single device or multiple interleaved ones. > > However I can't get multiple separate ram regions. If I boot a config > like yours below, I get 4 ram devices. How can I create one region for > each? Once I create the first one, others fail saying something like > below. I tried using other decoders but it didn't help (I still need > to read more CXL docs about decoders, why new ones appear when creating > a region, etc). > > cxl region: collect_memdevs: no active memdevs found: decoder: decoder0.0 filter: mem3 Hi Brice, QEMU emulation currently only supports single HDM decoder at each level, so HB, Switch USP, EP (with exception of the CFMWS top level ones as shown in the example which has two of those). We should fix that... For now, you should be able to do it with multiple pxb-cxl instances with appropriate CFMWS entries for each one. Which is horrible but might work for you in the meantime. > > By the way, once configured in system ram, my CXL ram is merged into an > existing "normal" NUMA node. How do I tell Qemu that a CXL region should > be part of a new NUMA node? I assume that's what's going to happen on > real hardware? We don't yet have kernel code to deal with assigning a new NUMA node. Was on the todo list in last sync call I think. > > Thanks > > Brice > > > > > > > > > $ run_qemu.sh -g --cxl --cxl-debug --rw -r none --cmdline > > /home/vverma7/git/qemu/build/qemu-system-x86_64 > > -machine q35,accel=kvm,nvdimm=on,cxl=on > > -m 8192M,slots=4,maxmem=40964M > > -smp 8,sockets=2,cores=2,threads=2 > > -enable-kvm > > -display none > > -nographic > > -drive if=pflash,format=raw,unit=0,file=OVMF_CODE.fd,readonly=on > > -drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd > > -debugconfile:uefi_debug.log > > -global isa-debugcon.iobase=0x402 > > -drive file=root.img,format=raw,media=disk > > -kernel ./mkosi.extra/lib/modules/6.2.0-rc6+/vmlinuz > > -initrd mkosi.extra/boot/initramfs-6.2.0-rc2+.img > > -append selinux=0 audit=0 console=tty0 console=ttyS0 root=/dev/sda2 ignore_loglevel rw cxl_acpi.dyndbg=+fplm cxl_pci.dyndbg=+fplm cxl_core.dyndbg=+fplm cxl_mem.dyndbg=+fplm cxl_pmem.dyndbg=+fplm cxl_port.dyndbg=+fplm cxl_region.dyndbg=+fplm cxl_test.dyndbg=+fplm cxl_mock.dyndbg=+fplm cxl_mock_mem.dyndbg=+fplm memmap=2G!4G efi_fake_mem=2G@6G:0x40000 > > -device e1000,netdev=net0,mac=52:54:00:12:34:56 > > -netdev user,id=net0,hostfwd=tcp::10022-:22 > > -object memory-backend-file,id=cxl-mem0,share=on,mem-path=cxltest0.raw,size=256M > > -object memory-backend-file,id=cxl-mem1,share=on,mem-path=cxltest1.raw,size=256M > > -object memory-backend-file,id=cxl-mem2,share=on,mem-path=cxltest2.raw,size=256M > > -object memory-backend-file,id=cxl-mem3,share=on,mem-path=cxltest3.raw,size=256M > > -object memory-backend-ram,id=cxl-mem4,share=on,size=256M > > -object memory-backend-ram,id=cxl-mem5,share=on,size=256M > > -object memory-backend-ram,id=cxl-mem6,share=on,size=256M > > -object memory-backend-ram,id=cxl-mem7,share=on,size=256M > > -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=lsa0.raw,size=1K > > -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=lsa1.raw,size=1K > > -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=lsa2.raw,size=1K > > -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=lsa3.raw,size=1K > > -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=53 > > -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191 > > -device cxl-rp,id=hb0rp0,bus=cxl.0,chassis=0,slot=0,port=0 > > -device cxl-rp,id=hb0rp1,bus=cxl.0,chassis=0,slot=1,port=1 > > -device cxl-rp,id=hb0rp2,bus=cxl.0,chassis=0,slot=2,port=2 > > -device cxl-rp,id=hb0rp3,bus=cxl.0,chassis=0,slot=3,port=3 > > -device cxl-rp,id=hb1rp0,bus=cxl.1,chassis=0,slot=4,port=0 > > -device cxl-rp,id=hb1rp1,bus=cxl.1,chassis=0,slot=5,port=1 > > -device cxl-rp,id=hb1rp2,bus=cxl.1,chassis=0,slot=6,port=2 > > -device cxl-rp,id=hb1rp3,bus=cxl.1,chassis=0,slot=7,port=3 > > -device cxl-type3,bus=hb0rp0,memdev=cxl-mem0,id=cxl-dev0,lsa=cxl-lsa0 > > -device cxl-type3,bus=hb0rp1,memdev=cxl-mem1,id=cxl-dev1,lsa=cxl-lsa1 > > -device cxl-type3,bus=hb1rp0,memdev=cxl-mem2,id=cxl-dev2,lsa=cxl-lsa2 > > -device cxl-type3,bus=hb1rp1,memdev=cxl-mem3,id=cxl-dev3,lsa=cxl-lsa3 > > -device cxl-type3,bus=hb0rp2,volatile-memdev=cxl-mem4,id=cxl-dev4 > > -device cxl-type3,bus=hb0rp3,volatile-memdev=cxl-mem5,id=cxl-dev5 > > -device cxl-type3,bus=hb1rp2,volatile-memdev=cxl-mem6,id=cxl-dev6 > > -device cxl-type3,bus=hb1rp3,volatile-memdev=cxl-mem7,id=cxl-dev7 > > -M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k,cxl-fmw.1.targets.0=cxl.0,cxl-fmw.1.targets.1=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.1.interleave-granularity=8k > > -object memory-backend-ram,id=mem0,size=2048M > > -numa node,nodeid=0,memdev=mem0, > > -numa cpu,node-id=0,socket-id=0 > > -object memory-backend-ram,id=mem1,size=2048M > > -numa node,nodeid=1,memdev=mem1, > > -numa cpu,node-id=1,socket-id=1 > > -object memory-backend-ram,id=mem2,size=2048M > > -numa node,nodeid=2,memdev=mem2, > > -object memory-backend-ram,id=mem3,size=2048M > > -numa node,nodeid=3,memdev=mem3, > > -numa node,nodeid=4, > > -object memory-backend-file,id=nvmem0,share=on,mem-path=nvdimm-0,size=16384M,align=1G > > -device nvdimm,memdev=nvmem0,id=nv0,label-size=2M,node=4 > > -numa node,nodeid=5, > > -object memory-backend-file,id=nvmem1,share=on,mem-path=nvdimm-1,size=16384M,align=1G > > -device nvdimm,memdev=nvmem1,id=nv1,label-size=2M,node=5 > > -numa dist,src=0,dst=0,val=10 > > -numa dist,src=0,dst=1,val=21 > > -numa dist,src=0,dst=2,val=12 > > -numa dist,src=0,dst=3,val=21 > > -numa dist,src=0,dst=4,val=17 > > -numa dist,src=0,dst=5,val=28 > > -numa dist,src=1,dst=1,val=10 > > -numa dist,src=1,dst=2,val=21 > > -numa dist,src=1,dst=3,val=12 > > -numa dist,src=1,dst=4,val=28 > > -numa dist,src=1,dst=5,val=17 > > -numa dist,src=2,dst=2,val=10 > > -numa dist,src=2,dst=3,val=21 > > -numa dist,src=2,dst=4,val=28 > > -numa dist,src=2,dst=5,val=28 > > -numa dist,src=3,dst=3,val=10 > > -numa dist,src=3,dst=4,val=28 > > -numa dist,src=3,dst=5,val=28 > > -numa dist,src=4,dst=4,val=10 > > -numa dist,src=4,dst=5,val=28 > > -numa dist,src=5,dst=5,val=10 > >
Le 10/02/2023 à 13:43, Jonathan Cameron a écrit : > >> Hello Vishal >> >> Thanks a lot, things were failing because my kernel didn't have >> CONFIG_CXL_REGION_INVALIDATION_TEST=y. Now I am able to create a single >> ram region, either with a single device or multiple interleaved ones. >> >> However I can't get multiple separate ram regions. If I boot a config >> like yours below, I get 4 ram devices. How can I create one region for >> each? Once I create the first one, others fail saying something like >> below. I tried using other decoders but it didn't help (I still need >> to read more CXL docs about decoders, why new ones appear when creating >> a region, etc). >> >> cxl region: collect_memdevs: no active memdevs found: decoder: decoder0.0 filter: mem3 > Hi Brice, > > QEMU emulation currently only supports single HDM decoder at each level, > so HB, Switch USP, EP (with exception of the CFMWS top level ones as shown > in the example which has two of those). We should fix that... > > For now, you should be able to do it with multiple pxb-cxl instances with > appropriate CFMWS entries for each one. Which is horrible but might work > for you in the meantime. Thanks Jonathan, this works fine: -object memory-backend-ram,id=vmem0,share=on,size=256M \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ -object memory-backend-ram,id=vmem1,share=on,size=256M \ -device pxb-cxl,bus_nr=14,bus=pcie.0,id=cxl.2 \ -device cxl-rp,port=0,bus=cxl.2,id=root_port14,chassis=1,slot=2 \ -device cxl-type3,bus=root_port14,volatile-memdev=vmem1,id=cxl-vmem1 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.2,cxl-fmw.1.size=4G >> By the way, once configured in system ram, my CXL ram is merged into an >> existing "normal" NUMA node. How do I tell Qemu that a CXL region should >> be part of a new NUMA node? I assume that's what's going to happen on >> real hardware? > We don't yet have kernel code to deal with assigning a new NUMA node. > Was on the todo list in last sync call I think. Good to known, thanks again. Brice
Brice Goglin wrote: [..] > >> By the way, once configured in system ram, my CXL ram is merged into an > >> existing "normal" NUMA node. How do I tell Qemu that a CXL region should > >> be part of a new NUMA node? I assume that's what's going to happen on > >> real hardware? > > We don't yet have kernel code to deal with assigning a new NUMA node. > > Was on the todo list in last sync call I think. > > > Good to known, thanks again. In fact, there is no plan to support "new" NUMA node creation. A node can only be onlined / populated from set of static nodes defined by platform-firmware. The set of static nodes is defined by the union of all the proximity domain numbers in the SRAT as well as a node per CFMWS / QTG id. See: fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT ...for the CXL node enumeration scheme. Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to group devices by window. This scheme attempts to be as simple as possible, but no simpler. If more granularity is necessary in practice, that would be a good discussion to have soonish.. LSF/MM comes to mind.
Le 11/02/2023 à 02:53, Dan Williams a écrit : > Brice Goglin wrote: > [..] >>>> By the way, once configured in system ram, my CXL ram is merged into an >>>> existing "normal" NUMA node. How do I tell Qemu that a CXL region should >>>> be part of a new NUMA node? I assume that's what's going to happen on >>>> real hardware? >>> We don't yet have kernel code to deal with assigning a new NUMA node. >>> Was on the todo list in last sync call I think. >> > In fact, there is no plan to support "new" NUMA node creation. A node > can only be onlined / populated from set of static nodes defined by > platform-firmware. The set of static nodes is defined by the union of > all the proximity domain numbers in the SRAT as well as a node per > CFMWS / QTG id. See: > > fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT > > ...for the CXL node enumeration scheme. > > Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to > group devices by window. This scheme attempts to be as simple as > possible, but no simpler. If more granularity is necessary in practice, > that would be a good discussion to have soonish.. LSF/MM comes to mind. Actually I was mistaken, there's already a new NUMA node when creating a region under Qemu, but my tools ignored it because it's empty. After daxctl online-memory, things look good. Can you clarify your above sentences on a real node? If I connect two memory expanders on two slots of the same CPU, do I get a single CFMWS or two? What if I connect two devices to a single slot across a CXL switch? Brice
Brice Goglin wrote: > Le 11/02/2023 à 02:53, Dan Williams a écrit : > > > Brice Goglin wrote: > > [..] > >>>> By the way, once configured in system ram, my CXL ram is merged into an > >>>> existing "normal" NUMA node. How do I tell Qemu that a CXL region should > >>>> be part of a new NUMA node? I assume that's what's going to happen on > >>>> real hardware? > >>> We don't yet have kernel code to deal with assigning a new NUMA node. > >>> Was on the todo list in last sync call I think. > >> > > In fact, there is no plan to support "new" NUMA node creation. A node > > can only be onlined / populated from set of static nodes defined by > > platform-firmware. The set of static nodes is defined by the union of > > all the proximity domain numbers in the SRAT as well as a node per > > CFMWS / QTG id. See: > > > > fd49f99c1809 ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT > > > > ...for the CXL node enumeration scheme. > > > > Once you have a node per CFMWS then it is up to CDAT and the QTG DSM to > > group devices by window. This scheme attempts to be as simple as > > possible, but no simpler. If more granularity is necessary in practice, > > that would be a good discussion to have soonish.. LSF/MM comes to mind. > > Actually I was mistaken, there's already a new NUMA node when creating > a region under Qemu, but my tools ignored it because it's empty. > After daxctl online-memory, things look good. > > Can you clarify your above sentences on a real node? If I connect two > memory expanders on two slots of the same CPU, do I get a single CFMWS or two? > What if I connect two devices to a single slot across a CXL switch? Ultimately the answer is "ask your platform vendor", because this is a firmware decision. However, my expectation is that since the ACPI HMAT requires a proximity domain per distinct performance class, and because the ACPI HMAT needs to distinguish the memory that is "attached" to a CPU initiator domain, that CXL will at a minimum be described in a proximity domain distinct from "local DRAM". The number of CFMWS windows published is gated by the degrees of freedom platform-firmware wants to give the OS relative to the number of CXL host-bridges in the system. One scheme that seems plausible is one CFMWS window for each host-bridge / x1 interleave (to maximize RAS) and one CFMWS with all host-bridges interleaved together (to maximize performance). The above is just my personal opinion as a Linux kernel developer, a platform implementation is free to be as restrictive or generous as it wants with CFMWS resources.
While enumeration of ram type regions already works in libcxl and cxl-cli, it lacked an attribute to indicate pmem vs. ram. Add a new 'type' attribute to region listings to address this. Additionally, add support for creating ram regions to the cxl-create-region command. The region listings are also updated with dax-region information for volatile regions. This also includes fixed for a few bugs / usability issues identified along the way - patches 1, 4, and 6. Patch 5 is a usability improvement where based on decoder capabilities, the type of a region can be inferred for the create-region command. These have been tested against the ram-region additions to cxl_test which are part of the kernel support patch set[1]. Additionally, tested against qemu using a WIP branch for volatile support found here[2]. The 'run_qemu' script has a branch that creates volatile memdevs in addition to pmem ones. This is also in a branch[3] since it depends on [2]. These cxl-cli / libcxl patches themselves are also available in a branch at [4]. [1]: https://lore.kernel.org/linux-cxl/167564534874.847146.5222419648551436750.stgit@dwillia2-xfh.jf.intel.com/ [2]: https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-26 [3]: https://github.com/pmem/run_qemu/commits/vv/ram-memdevs [4]: https://github.com/pmem/ndctl/tree/vv/volatile-regions Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> --- Changes in v2: - Fix typos in the commit message of patch 1 (Fan) - Gate the type attr in region listings on mode != 'none' (Dan) - Clarify unreachability of the default case in collect_minsize() (Ira) - Simplify the mode setup in set_type_from_decoder() (Dan) - Fix typo in the commit message of Patch 7 (Dan) - Remove unneeded daxctl/json.h include from cxl/filter.c (Dan) - Link to v1: https://lore.kernel.org/r/20230120-vv-volatile-regions-v1-0-b42b21ee8d0b@intel.com --- Dan Williams (2): cxl/list: Include regions in the verbose listing cxl/list: Enumerate device-dax properties for regions Vishal Verma (5): cxl/region: skip region_actions for region creation cxl: add a type attribute to region listings cxl: add core plumbing for creation of ram regions cxl/region: accept user-supplied UUIDs for pmem regions cxl/region: determine region type based on root decoder capability Documentation/cxl/cxl-create-region.txt | 6 ++- Documentation/cxl/cxl-list.txt | 31 ++++++++++++++ Documentation/cxl/lib/libcxl.txt | 8 ++++ cxl/lib/private.h | 2 + cxl/lib/libcxl.c | 72 +++++++++++++++++++++++++++++++-- cxl/filter.h | 3 ++ cxl/libcxl.h | 3 ++ cxl/json.c | 23 +++++++++++ cxl/list.c | 3 ++ cxl/region.c | 66 +++++++++++++++++++++++++++--- cxl/lib/libcxl.sym | 7 ++++ cxl/lib/meson.build | 1 + cxl/meson.build | 3 ++ 13 files changed, 217 insertions(+), 11 deletions(-) --- base-commit: 08720628d2ba469e203a18c0b1ffbd90f4bfab1d change-id: 20230120-vv-volatile-regions-063950cef590 Best regards,