Message ID | 20240101075315.43167-2-horenchuang@bytedance.com |
---|---|
State | New, archived |
Headers | show |
Series | Introduce HostMemType for 'memory-backend-*' | expand |
Hi Jack, On 1/1/24 08:53, Ho-Ren (Jack) Chuang wrote: > Introduce a new configuration option 'host-mem-type=' in the > '-object memory-backend-ram', allowing users to specify > from which type of memory to allocate. > > Users can specify 'cxlram' as an argument, and QEMU will then > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > For example: > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \ > > In v1, we plan to move most of the implementations to util and break down > this patch into different smaller patches. > > Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com> > --- > backends/hostmem.c | 184 +++++++++++++++++++++++++++++++++++++++ > include/sysemu/hostmem.h | 1 + > qapi/common.json | 19 ++++ > qapi/qom.json | 1 + > qemu-options.hx | 2 +- > 5 files changed, 206 insertions(+), 1 deletion(-) > > diff --git a/backends/hostmem.c b/backends/hostmem.c > +#define CXL_DEVICE_PATH "/sys/bus/cxl/devices/" > +#define REGION_PATH_LEN 307 > +#define DAX_REGION_PATH_LEN 563 > +#define DAX_PATH_LEN 819 > +#define TARGET_FILE_PATH_LEN 831 How do you get these numbers? > diff --git a/qapi/common.json b/qapi/common.json > index 6fed9cde1a..591fd73291 100644 > --- a/qapi/common.json > +++ b/qapi/common.json > @@ -167,6 +167,25 @@ > { 'enum': 'HostMemPolicy', > 'data': [ 'default', 'preferred', 'bind', 'interleave' ] } > > +## > +# @HostMemType: > +# > +# Automatically find a backend memory type on host. This description is not clear (to me). > +# Can be further extened to support other types such as cxlpmem, hbm. Typo "extended" although I'm not sure it is helpful to mention it. > +# > +# @none: do nothing (default). "do nothing" is confusing here, I'd drop it. > +# > +# @cxlram: a CXL RAM backend on host. > +# > +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same > +# time. HostMemType is used to automatically bind with one kind of > +# host memory types. > +# > +# Since: 8.3 9.0 > +## > +{ 'enum': 'HostMemType', > + 'data': [ 'none', 'cxlram' ] } > + > ## > # @NetFilterDirection: > # > diff --git a/qapi/qom.json b/qapi/qom.json > index 95516ba325..fa3bc29708 100644 > --- a/qapi/qom.json > +++ b/qapi/qom.json > @@ -626,6 +626,7 @@ > '*host-nodes': ['uint16'], > '*merge': 'bool', > '*policy': 'HostMemPolicy', > + '*host-mem-type': 'HostMemType', Missing documentation in MemoryBackendProperties. > '*prealloc': 'bool', > '*prealloc-threads': 'uint32', > '*prealloc-context': 'str',
On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote: > Introduce a new configuration option 'host-mem-type=' in the > '-object memory-backend-ram', allowing users to specify > from which type of memory to allocate. > > Users can specify 'cxlram' as an argument, and QEMU will then > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > For example: > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \ > You can achieve the exact same thing already simply by using memory policies and detecting the node(s) before calling QEMU, no? There has to be a good reason to add such a shortcut into QEMU, and it should be spelled out here.
On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote: > Introduce a new configuration option 'host-mem-type=' in the > '-object memory-backend-ram', allowing users to specify > from which type of memory to allocate. > > Users can specify 'cxlram' as an argument, and QEMU will then > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > For example: > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ Stupid questions: Why not just use `host-nodes` and pass in the numa node you want to allocate from? Why should QEMU be made "CXL-aware" in the sense that QEMU is responsible for figuring out what host node has CXL memory? This feels like an "upper level software" operation (orchestration), rather than something qemu should internally understand. > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ For a variety of performance reasons, this will not work the way you want it to. You are essentially telling QEMU to map the vmem0 into a virtual cxl device, and now any memory accesses to that memory region will end up going through the cxl-type3 device logic - which is an IO path from the perspective of QEMU. You may want to double check that your tests aren't using swap space in the guest, because I found in my testing that linux would prefer to use swap rather than attempt to use virtual cxl-type3 device memory because of how god-awful slow it is (even if it is backed by DRAM). Additionally, this configuration will not (should not) presently work with VMX enabled. Unless I missed some other update, QEMU w/ CXL memory presently crashes when VMX is enabled for not-yet-debugged reasons. Another possiblity: You mapped this memory-backend into another numa node explicitly and never onlined the memory via cxlcli. I've done this, and it works, but it's a "hidden feature" that probably should not exist / be supported. If I understand the goal here, it's to pass CXL-hosted DRAM through to the guest in a way that the system can manage it according to its performance attributes. You're better off just using the `host-nodes` field of host-memory and passing bandwidth/latency attributes though via `-numa hmat-lb` In that scenario, the guest software doesn't even need to know CXL exists at all, it can just read the attributes of the numa node that QEMU created for it. In the future to deal with proper dynamic capacity, we may need to consider a different backend object altogether that allows sparse allocations, and a virtual cxl device which pre-allocates the CFMW can at least be registered to manage it. I'm not quite sure how that looks just yet. For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB on a cpuless node. qemu-system-x86_64 \ -nographic \ -accel kvm \ -machine type=q35,hmat=on \ -drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \ -m 8G,slots=4,maxmem=16G \ -smp cpus=4 \ -object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here -object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ <-- cpu node -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ <-- cpuless node -netdev bridge,id=hn0,br=virbr0 \ -device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 [root@fedora ~]# numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 3965 MB node 0 free: 3611 MB node 1 cpus: node 1 size: 3986 MB node 1 free: 3960 MB node distances: node 0 1 0: 10 20 1: 20 10 [root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators node0 read_bandwidth read_latency write_bandwidth write_latency [root@fedora initiators]# cat read_bandwidth 5 [root@fedora initiators]# cat read_latency 20 ~Gregory
On Tue, Jan 2, 2024 at 5:04 AM David Hildenbrand <david@redhat.com> wrote: > > On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote: > > Introduce a new configuration option 'host-mem-type=' in the > > '-object memory-backend-ram', allowing users to specify > > from which type of memory to allocate. > > > > Users can specify 'cxlram' as an argument, and QEMU will then > > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > > For example: > > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ > > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \ > > > > You can achieve the exact same thing already simply by using memory > policies and detecting the node(s) before calling QEMU, no? Yes, I agree this can be done with memory policy and bind to the CXL memory numa nodes on host. > > There has to be a good reason to add such a shortcut into QEMU, and it > should be spelled out here. So our end goal here is to enable CXL memory in the guest VM and have the guest kernel to recognize the CXL memory to the correct memory tier (slow tier) in the Linux kernel tiered memory system. Here is what we observed: * The kernel tiered memory system relies on calculating the memory attributes (read/write latency, bandwidth from ACPI) for fast vs slow tier. * The kernel tiered memory system has two path to recognize a memory tier 1) in mm subsystem init, memory_tier_init 2) in kmem driver device probe dev_dax_kmem_probe. Since the ACPI subsystem is initialized after mm, reading the memory attributes from ACPI can only be done in 2). CXL memory has to be presented as a devdax device, which can then be probed by the kmem driver in the guest and recognized as the slow tier. We do see that QEMU has this option "-numa hmat-lb" to set the memory attributes per VM's numa node. The problem is that setting the memory attributes per numa node means that the numa node is created during guest kernel initialization. A CXL devdax device can only be created post kernel initialization and new numa nodes are created for the CXL devdax devices. The guest kernel is not reading the memory attributes from "-numa hmat-lb" for the CXL devdax devices. So we thought if we create an explicit CXL memory backend, and associate that with the virtual CXL type-3 frontend, we can pass the CXL memory attributes from the host into the guest VM and avoid using memory policy and "-numa hmat-lb", thus simplifying the configuration. We are still figuring out exactly how to pass the memory attributes from the CXL backend into the VM. There is probably a solution to "-numa hmat-lb" for the CXL devdax devices as well and we are also looking into it. > > -- > Cheers, > > David / dhildenb >
On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote: > > On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote: > > Introduce a new configuration option 'host-mem-type=' in the > > '-object memory-backend-ram', allowing users to specify > > from which type of memory to allocate. > > > > Users can specify 'cxlram' as an argument, and QEMU will then > > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > > For example: > > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ > > Stupid questions: > > Why not just use `host-nodes` and pass in the numa node you want to > allocate from? Why should QEMU be made "CXL-aware" in the sense that > QEMU is responsible for figuring out what host node has CXL memory? > > This feels like an "upper level software" operation (orchestration), rather > than something qemu should internally understand. I don't have a "big picture" and I am learning. Maybe we proposed something not useful :-) I replied to the same question on a fork of this thread. > > > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > > For a variety of performance reasons, this will not work the way you > want it to. You are essentially telling QEMU to map the vmem0 into a > virtual cxl device, and now any memory accesses to that memory region > will end up going through the cxl-type3 device logic - which is an IO > path from the perspective of QEMU. I didn't understand exactly how the virtual cxl-type3 device works. I thought it would go with the same "guest virtual address -> guest physical address -> host physical address" translation totally done by CPU. But if it is going through an emulation path handled by virtual cxl-type3, I agree the performance would be bad. Do you know why accessing memory on a virtual cxl-type3 device can't go with the nested page table translation? > > You may want to double check that your tests aren't using swap space in > the guest, because I found in my testing that linux would prefer to use > swap rather than attempt to use virtual cxl-type3 device memory because > of how god-awful slow it is (even if it is backed by DRAM). We didn't enable swap in our current test setup. I think there was a kernel change making the mm page reclamation path to use cxl memory instead of swap if you enable memory tiering. Did you try that? Swap is persistent storage. I would be very surprised if virtual cxl is actually slower than swap. > > > Additionally, this configuration will not (should not) presently work > with VMX enabled. Unless I missed some other update, QEMU w/ CXL memory > presently crashes when VMX is enabled for not-yet-debugged reasons. When we had a discussion with Intel, they told us to not use the KVM option in QEMU while using virtual cxl type3 device. That's probably related to the issue you described here? We enabled KVM though but haven't seen the crash yet. > > Another possiblity: You mapped this memory-backend into another numa > node explicitly and never onlined the memory via cxlcli. I've done > this, and it works, but it's a "hidden feature" that probably should > not exist / be supported. I thought general purpose memory nodes are onlined by default? > > > > If I understand the goal here, it's to pass CXL-hosted DRAM through to > the guest in a way that the system can manage it according to its > performance attributes. Yes. > > You're better off just using the `host-nodes` field of host-memory > and passing bandwidth/latency attributes though via `-numa hmat-lb` We tried this but it doesn't work from end to end right now. I described the issue in another fork of this thread. > > In that scenario, the guest software doesn't even need to know CXL > exists at all, it can just read the attributes of the numa node > that QEMU created for it. We thought about this before. But the current kernel implementation requires a devdax device to be probed and recognized as a slow tier (by reading the memory attributes). I don't think this can be done via the path you described. Have you tried this before? > > In the future to deal with proper dynamic capacity, we may need to > consider a different backend object altogether that allows sparse > allocations, and a virtual cxl device which pre-allocates the CFMW > can at least be registered to manage it. I'm not quite sure how that > looks just yet. Are we talking about CXL memory pooling? > > For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB > on a cpuless node. > > qemu-system-x86_64 \ > -nographic \ > -accel kvm \ > -machine type=q35,hmat=on \ > -drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \ > -m 8G,slots=4,maxmem=16G \ > -smp cpus=4 \ > -object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here > -object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \ <-- cpu node > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \ <-- cpuless node > -netdev bridge,id=hn0,br=virbr0 \ > -device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \ > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \ > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 > > [root@fedora ~]# numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 > node 0 size: 3965 MB > node 0 free: 3611 MB > node 1 cpus: > node 1 size: 3986 MB > node 1 free: 3960 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > [root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators > node0 read_bandwidth read_latency write_bandwidth write_latency > [root@fedora initiators]# cat read_bandwidth > 5 > [root@fedora initiators]# cat read_latency > 20 > > > ~Gregory
On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > For a variety of performance reasons, this will not work the way you > > want it to. You are essentially telling QEMU to map the vmem0 into a > > virtual cxl device, and now any memory accesses to that memory region > > will end up going through the cxl-type3 device logic - which is an IO > > path from the perspective of QEMU. > > I didn't understand exactly how the virtual cxl-type3 device works. I > thought it would go with the same "guest virtual address -> guest > physical address -> host physical address" translation totally done by > CPU. But if it is going through an emulation path handled by virtual > cxl-type3, I agree the performance would be bad. Do you know why > accessing memory on a virtual cxl-type3 device can't go with the > nested page table translation? > Because a byte-access on CXL memory can have checks on it that must be emulated by the virtual device, and because there are caching implications that have to be emulated as well. The cxl device you are using is an emulated CXL device - not a virtualization interface. Nuanced difference: the emulated device has to emulate *everything* that CXL device does. What you want is passthrough / managed access to a real device - virtualization. This is not the way to accomplish that. A better way to accomplish that is to simply pass the memory through as a static numa node as I described. > > When we had a discussion with Intel, they told us to not use the KVM > option in QEMU while using virtual cxl type3 device. That's probably > related to the issue you described here? We enabled KVM though but > haven't seen the crash yet. > The crash really only happens, IIRC, if code ends up hosted in that memory. I forget the exact scenario, but the working theory is it has to do with the way instruction caches are managed with KVM and this device. > > > > You're better off just using the `host-nodes` field of host-memory > > and passing bandwidth/latency attributes though via `-numa hmat-lb` > > We tried this but it doesn't work from end to end right now. I > described the issue in another fork of this thread. > > > > > In that scenario, the guest software doesn't even need to know CXL > > exists at all, it can just read the attributes of the numa node > > that QEMU created for it. > > We thought about this before. But the current kernel implementation > requires a devdax device to be probed and recognized as a slow tier > (by reading the memory attributes). I don't think this can be done via > the path you described. Have you tried this before? > Right, because the memory tiering component lumps the nodes together. Better idea: Fix the memory tiering component I cc'd you on another patch line that is discussing something relevant to this. https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8 The point is: There's no need for this to be a dax device at all, there is no need for the guest to even know what is providing the memory, or for the guest to have any management access to the memory. It just wants the memory and the ability to tier it. So we should fix the memory tiering component to work with this workflow. ~Gregory
On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote: > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > For a variety of performance reasons, this will not work the way you > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > virtual cxl device, and now any memory accesses to that memory region > > > will end up going through the cxl-type3 device logic - which is an IO > > > path from the perspective of QEMU. > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > thought it would go with the same "guest virtual address -> guest > > physical address -> host physical address" translation totally done by > > CPU. But if it is going through an emulation path handled by virtual > > cxl-type3, I agree the performance would be bad. Do you know why > > accessing memory on a virtual cxl-type3 device can't go with the > > nested page table translation? > > > > Because a byte-access on CXL memory can have checks on it that must be > emulated by the virtual device, and because there are caching > implications that have to be emulated as well. Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the CXL memory data path goes through them, the performance would be pretty problematic. We have actually run Intel's Memory Latency Checker benchmark from inside a guest VM with both system-DRAM and virtual CXL-type3 configured. The idle latency on the virtual CXL memory is 2X of system DRAM, which is on-par with the benchmark running from a physical host. I need to debug this more to understand why the latency is actually much better than I would expect now. > > The cxl device you are using is an emulated CXL device - not a > virtualization interface. Nuanced difference: the emulated device has > to emulate *everything* that CXL device does. > > What you want is passthrough / managed access to a real device - > virtualization. This is not the way to accomplish that. A better way > to accomplish that is to simply pass the memory through as a static numa > node as I described. That would work, too. But I think a kernel change is required to establish the correct memory tiering if we go this routine. > > > > > When we had a discussion with Intel, they told us to not use the KVM > > option in QEMU while using virtual cxl type3 device. That's probably > > related to the issue you described here? We enabled KVM though but > > haven't seen the crash yet. > > > > The crash really only happens, IIRC, if code ends up hosted in that > memory. I forget the exact scenario, but the working theory is it has > to do with the way instruction caches are managed with KVM and this > device. > > > > > > > You're better off just using the `host-nodes` field of host-memory > > > and passing bandwidth/latency attributes though via `-numa hmat-lb` > > > > We tried this but it doesn't work from end to end right now. I > > described the issue in another fork of this thread. > > > > > > > > In that scenario, the guest software doesn't even need to know CXL > > > exists at all, it can just read the attributes of the numa node > > > that QEMU created for it. > > > > We thought about this before. But the current kernel implementation > > requires a devdax device to be probed and recognized as a slow tier > > (by reading the memory attributes). I don't think this can be done via > > the path you described. Have you tried this before? > > > > Right, because the memory tiering component lumps the nodes together. > > Better idea: Fix the memory tiering component > > I cc'd you on another patch line that is discussing something relevant > to this. > > https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8 > > The point is: There's no need for this to be a dax device at all, there > is no need for the guest to even know what is providing the memory, or > for the guest to have any management access to the memory. It just > wants the memory and the ability to tier it. > > So we should fix the memory tiering component to work with this > workflow. Agreed. We really don't need the devdax device at all. I thought that choice was made due to the memory tiering concept being started with pmem ... Let's continue this part of the discussion on the above thread. > > ~Gregory
On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote: > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote: > > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > > > For a variety of performance reasons, this will not work the way you > > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > > virtual cxl device, and now any memory accesses to that memory region > > > > will end up going through the cxl-type3 device logic - which is an IO > > > > path from the perspective of QEMU. > > > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > > thought it would go with the same "guest virtual address -> guest > > > physical address -> host physical address" translation totally done by > > > CPU. But if it is going through an emulation path handled by virtual > > > cxl-type3, I agree the performance would be bad. Do you know why > > > accessing memory on a virtual cxl-type3 device can't go with the > > > nested page table translation? > > > > > > > Because a byte-access on CXL memory can have checks on it that must be > > emulated by the virtual device, and because there are caching > > implications that have to be emulated as well. > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the > CXL memory data path goes through them, the performance would be > pretty problematic. We have actually run Intel's Memory Latency > Checker benchmark from inside a guest VM with both system-DRAM and > virtual CXL-type3 configured. The idle latency on the virtual CXL > memory is 2X of system DRAM, which is on-par with the benchmark > running from a physical host. I need to debug this more to understand > why the latency is actually much better than I would expect now. So we double checked on benchmark testing. What we see is that running Intel Memory Latency Checker from a guest VM with virtual CXL memory VS from a physical host with CXL1.1 memory expander has the same latency. From guest VM: local socket system-DRAM latency is 117.0ns, local socket CXL-DRAM latency is 269.4ns From physical host: local socket system-DRAM latency is 113.6ns , local socket CXL-DRAM latency is 267.5ns I also set debugger breakpoints on cxl_type3_read/cxl_type3_write while running the benchmark testing but those two functions are not ever hit. We used the virtual CXL configuration while launching QEMU but the CXL memory is present as a separate NUMA node and we are not creating devdax devices. Does that make any difference? > > > > > The cxl device you are using is an emulated CXL device - not a > > virtualization interface. Nuanced difference: the emulated device has > > to emulate *everything* that CXL device does. > > > > What you want is passthrough / managed access to a real device - > > virtualization. This is not the way to accomplish that. A better way > > to accomplish that is to simply pass the memory through as a static numa > > node as I described. > > That would work, too. But I think a kernel change is required to > establish the correct memory tiering if we go this routine. > > > > > > > > > When we had a discussion with Intel, they told us to not use the KVM > > > option in QEMU while using virtual cxl type3 device. That's probably > > > related to the issue you described here? We enabled KVM though but > > > haven't seen the crash yet. > > > > > > > The crash really only happens, IIRC, if code ends up hosted in that > > memory. I forget the exact scenario, but the working theory is it has > > to do with the way instruction caches are managed with KVM and this > > device. > > > > > > > > > > You're better off just using the `host-nodes` field of host-memory > > > > and passing bandwidth/latency attributes though via `-numa hmat-lb` > > > > > > We tried this but it doesn't work from end to end right now. I > > > described the issue in another fork of this thread. > > > > > > > > > > > In that scenario, the guest software doesn't even need to know CXL > > > > exists at all, it can just read the attributes of the numa node > > > > that QEMU created for it. > > > > > > We thought about this before. But the current kernel implementation > > > requires a devdax device to be probed and recognized as a slow tier > > > (by reading the memory attributes). I don't think this can be done via > > > the path you described. Have you tried this before? > > > > > > > Right, because the memory tiering component lumps the nodes together. > > > > Better idea: Fix the memory tiering component > > > > I cc'd you on another patch line that is discussing something relevant > > to this. > > > > https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8 > > > > The point is: There's no need for this to be a dax device at all, there > > is no need for the guest to even know what is providing the memory, or > > for the guest to have any management access to the memory. It just > > wants the memory and the ability to tier it. > > > > So we should fix the memory tiering component to work with this > > workflow. > > Agreed. We really don't need the devdax device at all. I thought that > choice was made due to the memory tiering concept being started with > pmem ... Let's continue this part of the discussion on the above > thread. > > > > > ~Gregory
On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote: > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote: > > > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > > > > > For a variety of performance reasons, this will not work the way you > > > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > > > virtual cxl device, and now any memory accesses to that memory region > > > > > will end up going through the cxl-type3 device logic - which is an IO > > > > > path from the perspective of QEMU. > > > > > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > > > thought it would go with the same "guest virtual address -> guest > > > > physical address -> host physical address" translation totally done by > > > > CPU. But if it is going through an emulation path handled by virtual > > > > cxl-type3, I agree the performance would be bad. Do you know why > > > > accessing memory on a virtual cxl-type3 device can't go with the > > > > nested page table translation? > > > > > > > > > > Because a byte-access on CXL memory can have checks on it that must be > > > emulated by the virtual device, and because there are caching > > > implications that have to be emulated as well. > > > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the > > CXL memory data path goes through them, the performance would be > > pretty problematic. We have actually run Intel's Memory Latency > > Checker benchmark from inside a guest VM with both system-DRAM and > > virtual CXL-type3 configured. The idle latency on the virtual CXL > > memory is 2X of system DRAM, which is on-par with the benchmark > > running from a physical host. I need to debug this more to understand > > why the latency is actually much better than I would expect now. > > So we double checked on benchmark testing. What we see is that running > Intel Memory Latency Checker from a guest VM with virtual CXL memory > VS from a physical host with CXL1.1 memory expander has the same > latency. > > From guest VM: local socket system-DRAM latency is 117.0ns, local > socket CXL-DRAM latency is 269.4ns > From physical host: local socket system-DRAM latency is 113.6ns , > local socket CXL-DRAM latency is 267.5ns > > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write > while running the benchmark testing but those two functions are not > ever hit. We used the virtual CXL configuration while launching QEMU > but the CXL memory is present as a separate NUMA node and we are not > creating devdax devices. Does that make any difference? > Could you possibly share your full QEMU configuration and what OS/kernel you are running inside the guest? The only thing I'm surprised by is that the numa node appears without requiring the driver to generate the NUMA node. It's possible I missed a QEMU update that allows this. ~Gregory
On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote: > > On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote: > > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote: > > > > > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote: > > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > > > > > > > > > For a variety of performance reasons, this will not work the way you > > > > > > want it to. You are essentially telling QEMU to map the vmem0 into a > > > > > > virtual cxl device, and now any memory accesses to that memory region > > > > > > will end up going through the cxl-type3 device logic - which is an IO > > > > > > path from the perspective of QEMU. > > > > > > > > > > I didn't understand exactly how the virtual cxl-type3 device works. I > > > > > thought it would go with the same "guest virtual address -> guest > > > > > physical address -> host physical address" translation totally done by > > > > > CPU. But if it is going through an emulation path handled by virtual > > > > > cxl-type3, I agree the performance would be bad. Do you know why > > > > > accessing memory on a virtual cxl-type3 device can't go with the > > > > > nested page table translation? > > > > > > > > > > > > > Because a byte-access on CXL memory can have checks on it that must be > > > > emulated by the virtual device, and because there are caching > > > > implications that have to be emulated as well. > > > > > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the > > > CXL memory data path goes through them, the performance would be > > > pretty problematic. We have actually run Intel's Memory Latency > > > Checker benchmark from inside a guest VM with both system-DRAM and > > > virtual CXL-type3 configured. The idle latency on the virtual CXL > > > memory is 2X of system DRAM, which is on-par with the benchmark > > > running from a physical host. I need to debug this more to understand > > > why the latency is actually much better than I would expect now. > > > > So we double checked on benchmark testing. What we see is that running > > Intel Memory Latency Checker from a guest VM with virtual CXL memory > > VS from a physical host with CXL1.1 memory expander has the same > > latency. > > > > From guest VM: local socket system-DRAM latency is 117.0ns, local > > socket CXL-DRAM latency is 269.4ns > > From physical host: local socket system-DRAM latency is 113.6ns , > > local socket CXL-DRAM latency is 267.5ns > > > > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write > > while running the benchmark testing but those two functions are not > > ever hit. We used the virtual CXL configuration while launching QEMU > > but the CXL memory is present as a separate NUMA node and we are not > > creating devdax devices. Does that make any difference? > > > > Could you possibly share your full QEMU configuration and what OS/kernel > you are running inside the guest? Sounds like the technical details are explained on the other thread. From what I understand now, if we don't go through a complex CXL setup, it wouldn't go through the emulation path. Here is our exact setup. Guest runs Linux kernel 6.6rc2 taskset --cpu-list 0-47,96-143 \ numactl -N 0 -m 0 ${QEMU} \ -M q35,cxl=on,hmat=on \ -m 64G \ -smp 8,sockets=1,cores=8,threads=1 \ -object memory-backend-ram,id=ram0,size=45G \ -numa node,memdev=ram0,cpus=0-7,nodeid=0 \ -msg timestamp=on -L /usr/share/seabios \ -enable-kvm \ -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ -numa node,memdev=vmem0,nodeid=1 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \ -numa dist,src=0,dst=0,val=10 \ -numa dist,src=0,dst=1,val=14 \ -numa dist,src=1,dst=0,val=14 \ -numa dist,src=1,dst=1,val=10 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-latency,latency=91 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-latency,latency=100 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-latency,latency=91 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-latency,latency=100 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-bandwidth,bandwidth=262100M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-bandwidth,bandwidth=30000M \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-bandwidth,bandwidth=176100M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-bandwidth,bandwidth=30000M \ -drive file="${DISK_IMG}",format=qcow2 \ -device pci-bridge,chassis_nr=3,id=pci.3,bus=pcie.0,addr=0xd \ -netdev tap,id=vm-sk-tap22,ifname=tap22,script=/usr/local/etc/qemu-ifup,downscript=no \ -device virtio-net-pci,netdev=vm-sk-tap22,id=net0,mac=02:11:17:01:7e:33,bus=pci.3,addr=0x1,bootindex=3 \ -serial mon:stdio > > The only thing I'm surprised by is that the numa node appears without > requiring the driver to generate the NUMA node. It's possible I missed > a QEMU update that allows this. > > ~Gregory
On Tue, Jan 09, 2024 at 11:33:04AM -0800, Hao Xiang wrote: > On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote: > > Sounds like the technical details are explained on the other thread. > From what I understand now, if we don't go through a complex CXL > setup, it wouldn't go through the emulation path. > > Here is our exact setup. Guest runs Linux kernel 6.6rc2 > > taskset --cpu-list 0-47,96-143 \ > numactl -N 0 -m 0 ${QEMU} \ > -M q35,cxl=on,hmat=on \ > -m 64G \ > -smp 8,sockets=1,cores=8,threads=1 \ > -object memory-backend-ram,id=ram0,size=45G \ > -numa node,memdev=ram0,cpus=0-7,nodeid=0 \ > -msg timestamp=on -L /usr/share/seabios \ > -enable-kvm \ > -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind > \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > -numa node,memdev=vmem0,nodeid=1 \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k :] you did what i thought you did -numa node,memdev=vmem0,nodeid=1 """ Another possiblity: You mapped this memory-backend into another numa node explicitly and never onlined the memory via cxlcli. I've done this, and it works, but it's a "hidden feature" that probably should not exist / be supported. """ You're mapping vmem0 into an explicit numa node *and* into the type3 device. You don't need to do both - and technically this shouldn't be allowed. With this configuration, you can go thorugh the cxl-cli setup process for the CXL device, you'll find that you can create *another* node (node 2 in this case) that maps to the same memory you mapped to node1.. You can drop the cxl devices objects in here and the memory will still come up the way you want it to. If you drop this line: -numa node,memdev=vmem0,nodeid=1 You have to use the CXL driver to instantiate the dax device and the numa node, and at *that* point you will see the read/write functions being called. ~Gregory
On Tue, Jan 9, 2024 at 11:58 AM Gregory Price <gregory.price@memverge.com> wrote: > > On Tue, Jan 09, 2024 at 11:33:04AM -0800, Hao Xiang wrote: > > On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > Sounds like the technical details are explained on the other thread. > > From what I understand now, if we don't go through a complex CXL > > setup, it wouldn't go through the emulation path. > > > > Here is our exact setup. Guest runs Linux kernel 6.6rc2 > > > > taskset --cpu-list 0-47,96-143 \ > > numactl -N 0 -m 0 ${QEMU} \ > > -M q35,cxl=on,hmat=on \ > > -m 64G \ > > -smp 8,sockets=1,cores=8,threads=1 \ > > -object memory-backend-ram,id=ram0,size=45G \ > > -numa node,memdev=ram0,cpus=0-7,nodeid=0 \ > > -msg timestamp=on -L /usr/share/seabios \ > > -enable-kvm \ > > -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind > > \ > > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > > -numa node,memdev=vmem0,nodeid=1 \ > > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k > > :] you did what i thought you did > > -numa node,memdev=vmem0,nodeid=1 > > """ > Another possiblity: You mapped this memory-backend into another numa > node explicitly and never onlined the memory via cxlcli. I've done > this, and it works, but it's a "hidden feature" that probably should > not exist / be supported. > """ > > You're mapping vmem0 into an explicit numa node *and* into the type3 > device. You don't need to do both - and technically this shouldn't be > allowed. > > With this configuration, you can go thorugh the cxl-cli setup process > for the CXL device, you'll find that you can create *another* node > (node 2 in this case) that maps to the same memory you mapped to node1.. > > > You can drop the cxl devices objects in here and the memory will still > come up the way you want it to. > > If you drop this line: > > -numa node,memdev=vmem0,nodeid=1 We tried this as well and it works after going through the cxlcli process and created the devdax device. The problem is that without the "nodeid=1" configuration, we cannot connect with the explicit per numa node latency/bandwidth configuration "-numa hmat-lb". I glanced at the code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the lb information to VM's hmat. From what I understand so far, the guest kernel will dynamically create a numa node after a cxl devdax device is created. That means we don't know the numa node until after VM boot. 2. QEMU can only statically parse the lb information to the VM at boot time. How do we connect these two things? Assuming that the same issue applies to a physical server with CXL. Were you able to see a host kernel getting the correct lb information for a CXL devdax device? > > You have to use the CXL driver to instantiate the dax device and the > numa node, and at *that* point you will see the read/write functions > being called. > > ~Gregory
On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote: > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price > <gregory.price@memverge.com> wrote: > > > > If you drop this line: > > > > -numa node,memdev=vmem0,nodeid=1 > > We tried this as well and it works after going through the cxlcli > process and created the devdax device. The problem is that without the > "nodeid=1" configuration, we cannot connect with the explicit per numa > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the > lb information to VM's hmat. > Yeah, this is what Jonathan was saying - right now there isn't a good way (in QEMU) to pass the hmat/cdat stuff down through the device. Needs to be plumbed out. In the meantime: You should just straight up drop the cxl device from your QEMU config. It doesn't actually get you anything. > From what I understand so far, the guest kernel will dynamically > create a numa node after a cxl devdax device is created. That means we > don't know the numa node until after VM boot. 2. QEMU can only > statically parse the lb information to the VM at boot time. How do we > connect these two things? during boot, the kernel discovers all the memory regions exposed to bios. In this qemu configuration you have defined: region 0: CPU + DRAM node region 1: DRAM only node region 2: CXL Fixed Memory Window (the last line of the cxl stuff) The kernel reads this information on boot and reserves 1 numa node for each of these regions. The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1 respectively. Node2 sits dormant until you go through the cxl-cli startup sequence. What you're asking for is for the QEMU team to plumb hmat/cdat information down through the type3 device. I *think* that is presently possible with a custom CDAT file - but Jonathan probably has more details on that. You'll have to go digging for answers on that one. Now - even if you did that - the current state of the cxl-type3 device is *not what you want* because your memory accesses will be routed through the read/write functions in the emulated device. What Jonathan and I discussed on the other thread is how you might go about slimming this down to allow pass-through of the memory without the need for all the fluff. This is a non-trivial refactor of the existing device, so i would not expect that any time soon. At the end of the day, quickest way to get-there-from-here is to just drop the cxl related lines from your QEMU config, and keep everything else. > > Assuming that the same issue applies to a physical server with CXL. > Were you able to see a host kernel getting the correct lb information > for a CXL devdax device? > Yes, if you bring up a CXL device via cxl-cli on real hardware, the subsequent numa node ends up in the "lower tier" of the memory-tiering infrastructure. ~Gregory
On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.price@memverge.com> wrote: > > On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote: > > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price > > <gregory.price@memverge.com> wrote: > > > > > > If you drop this line: > > > > > > -numa node,memdev=vmem0,nodeid=1 > > > > We tried this as well and it works after going through the cxlcli > > process and created the devdax device. The problem is that without the > > "nodeid=1" configuration, we cannot connect with the explicit per numa > > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the > > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the > > lb information to VM's hmat. > > > > Yeah, this is what Jonathan was saying - right now there isn't a good > way (in QEMU) to pass the hmat/cdat stuff down through the device. > Needs to be plumbed out. > > In the meantime: You should just straight up drop the cxl device from > your QEMU config. It doesn't actually get you anything. > > > From what I understand so far, the guest kernel will dynamically > > create a numa node after a cxl devdax device is created. That means we > > don't know the numa node until after VM boot. 2. QEMU can only > > statically parse the lb information to the VM at boot time. How do we > > connect these two things? > > during boot, the kernel discovers all the memory regions exposed to > bios. In this qemu configuration you have defined: > > region 0: CPU + DRAM node > region 1: DRAM only node > region 2: CXL Fixed Memory Window (the last line of the cxl stuff) > > The kernel reads this information on boot and reserves 1 numa node for > each of these regions. > > The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1 > respectively. > > Node2 sits dormant until you go through the cxl-cli startup sequence. > > > What you're asking for is for the QEMU team to plumb hmat/cdat > information down through the type3 device. I *think* that is presently > possible with a custom CDAT file - but Jonathan probably has more > details on that. You'll have to go digging for answers on that one. I think this is exactly what I was looking for. When we started with the idea of having an explicit CXL memory backend, we wanted to 1) Bind a virtual CXL device to an actual CXL memory node on host. 2) Pass the latency/bandwidth information from the CXL backend into the virtual CXL device. I didn't have a concrete idea of how to do 2) With the discussion here, I learned that the information is passed from CDAT. Just looked into the virtual CXL code and found that ct3_build_cdat_entries_for_mr() is the function that builds this information. But the latency and bandwidth there are currently hard-coded. I think it makes sense to have an explicit CXL memory backend where QEMU can query the CXL memory attributes from the host and pass that information from the CXL backend into the virtual CXL type-3 device. > > > Now - even if you did that - the current state of the cxl-type3 device > is *not what you want* because your memory accesses will be routed > through the read/write functions in the emulated device. > > What Jonathan and I discussed on the other thread is how you might go > about slimming this down to allow pass-through of the memory without the > need for all the fluff. This is a non-trivial refactor of the existing > device, so i would not expect that any time soon. > > At the end of the day, quickest way to get-there-from-here is to just > drop the cxl related lines from your QEMU config, and keep everything > else. Agreed. We need the kernel to be capable of reading the memory attributes from HMAT and establish the correct memory tier for system-DRAM (on a CPUless numa node). Currently system-DRAM is assumed to always be fast tier. > > > > > Assuming that the same issue applies to a physical server with CXL. > > Were you able to see a host kernel getting the correct lb information > > for a CXL devdax device? > > > > Yes, if you bring up a CXL device via cxl-cli on real hardware, the > subsequent numa node ends up in the "lower tier" of the memory-tiering > infrastructure. > > ~Gregory
On Tue, 9 Jan 2024 15:55:46 -0800 Hao Xiang <hao.xiang@bytedance.com> wrote: > On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.price@memverge.com> wrote: > > > > On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote: > > > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price > > > <gregory.price@memverge.com> wrote: > > > > > > > > If you drop this line: > > > > > > > > -numa node,memdev=vmem0,nodeid=1 > > > > > > We tried this as well and it works after going through the cxlcli > > > process and created the devdax device. The problem is that without the > > > "nodeid=1" configuration, we cannot connect with the explicit per numa > > > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the > > > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the > > > lb information to VM's hmat. > > > > > > > Yeah, this is what Jonathan was saying - right now there isn't a good > > way (in QEMU) to pass the hmat/cdat stuff down through the device. > > Needs to be plumbed out. > > > > In the meantime: You should just straight up drop the cxl device from > > your QEMU config. It doesn't actually get you anything. > > > > > From what I understand so far, the guest kernel will dynamically > > > create a numa node after a cxl devdax device is created. That means we > > > don't know the numa node until after VM boot. 2. QEMU can only > > > statically parse the lb information to the VM at boot time. How do we > > > connect these two things? > > > > during boot, the kernel discovers all the memory regions exposed to > > bios. In this qemu configuration you have defined: > > > > region 0: CPU + DRAM node > > region 1: DRAM only node > > region 2: CXL Fixed Memory Window (the last line of the cxl stuff) > > > > The kernel reads this information on boot and reserves 1 numa node for > > each of these regions. > > > > The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1 > > respectively. > > > > Node2 sits dormant until you go through the cxl-cli startup sequence. > > > > > > What you're asking for is for the QEMU team to plumb hmat/cdat > > information down through the type3 device. I *think* that is presently > > possible with a custom CDAT file - but Jonathan probably has more > > details on that. You'll have to go digging for answers on that one. > > I think this is exactly what I was looking for. When we started with > the idea of having an explicit CXL memory backend, we wanted to > 1) Bind a virtual CXL device to an actual CXL memory node on host. > 2) Pass the latency/bandwidth information from the CXL backend into > the virtual CXL device. > I didn't have a concrete idea of how to do 2) > With the discussion here, I learned that the information is passed > from CDAT. Just looked into the virtual CXL code and found that > ct3_build_cdat_entries_for_mr() is the function that builds this > information. But the latency and bandwidth there are currently > hard-coded. I think it makes sense to have an explicit CXL memory > backend where QEMU can query the CXL memory attributes from the host > and pass that information from the CXL backend into the virtual CXL > type-3 device. There is probably an argument for a memory backend to be able to take perf numbers in general (I don't see this as being CXL specific) or for us adding more parameters to the cxl device entry, but for now you can inject a cdat file that presents whatever you like. What we are missing though is generic port creation, so even with everything else in place it won't quite work yet. There was a hacky patch for generic ports, but it's not upstream yet (or in my tree). Usefully there is work under review for adding generic initiators to qemu that we can repurpose most of for GPs. Jonathan
QAPI schema sanity review only. "Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com> writes: > Introduce a new configuration option 'host-mem-type=' in the > '-object memory-backend-ram', allowing users to specify > from which type of memory to allocate. > > Users can specify 'cxlram' as an argument, and QEMU will then > automatically locate CXL RAM NUMA nodes and use them as the backend memory. > For example: > -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \ > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \ > > In v1, we plan to move most of the implementations to util and break down > this patch into different smaller patches. > > Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com> [...] > diff --git a/qapi/common.json b/qapi/common.json > index 6fed9cde1a..591fd73291 100644 > --- a/qapi/common.json > +++ b/qapi/common.json > @@ -167,6 +167,25 @@ > { 'enum': 'HostMemPolicy', > 'data': [ 'default', 'preferred', 'bind', 'interleave' ] } > > +## > +# @HostMemType: > +# > +# Automatically find a backend memory type on host. Types don't do, they are. Suggest something like # Host memory types > +# Can be further extened to support other types such as cxlpmem, hbm. Doc comments are reference documentation for users of QMP. This sentence doesn't fit there. > +# > +# @none: do nothing (default). Again, types don't do, they are. (default) makes no sense here. Types don't have defaults, optional arguments do. In this case, @host-mem-type below. What kind of memory does this setting request? > +# > +# @cxlram: a CXL RAM backend on host. Suggest to spell it cxl-ram. > +# > +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same > +# time. HostMemType is used to automatically bind with one kind of > +# host memory types. I feel this note doesn't belong here. Add it to the users of HostMemType and HostMemPolicy instead. > +# > +# Since: 8.3 > +## > +{ 'enum': 'HostMemType', > + 'data': [ 'none', 'cxlram' ] } > + I guess you're adding this to common.json and not qom.json so it's next to HostMemPolicy. Okay. > ## > # @NetFilterDirection: > # > diff --git a/qapi/qom.json b/qapi/qom.json > index 95516ba325..fa3bc29708 100644 > --- a/qapi/qom.json > +++ b/qapi/qom.json > @@ -626,6 +626,7 @@ > '*host-nodes': ['uint16'], > '*merge': 'bool', > '*policy': 'HostMemPolicy', > + '*host-mem-type': 'HostMemType', > '*prealloc': 'bool', > '*prealloc-threads': 'uint32', > '*prealloc-context': 'str', Missing: doc comment update. [...]
diff --git a/backends/hostmem.c b/backends/hostmem.c index 747e7838c0..3bede13879 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -44,6 +44,133 @@ host_memory_backend_get_name(HostMemoryBackend *backend) return object_get_canonical_path(OBJECT(backend)); } +#define FILE_LINE_LEN 256 +static int +is_valid_node(const char *path) { + FILE *file = fopen(path, "r"); + if (file == NULL) { + return -1; + } + + char line[FILE_LINE_LEN]; + if (fgets(line, sizeof(line), file) != NULL) { + int target_node = atoi(line); + + if (target_node >= 0) { + fclose(file); + return target_node; + } + } + + fclose(file); + return -1; +} + +static int +is_directory(const char *path) { + struct stat path_stat; + stat(path, &path_stat); + return S_ISDIR(path_stat.st_mode); +} + +static int +is_symlink(const char *path) { + struct stat path_stat; + if (lstat(path, &path_stat) == -1) { + return 0; + } + return S_ISLNK(path_stat.st_mode); +} + +#define CXL_DEVICE_PATH "/sys/bus/cxl/devices/" +#define REGION_PATH_LEN 307 +#define DAX_REGION_PATH_LEN 563 +#define DAX_PATH_LEN 819 +#define TARGET_FILE_PATH_LEN 831 +/* + * return: the number of valid numa node id found + */ +static int +host_memory_backend_get_cxlram_nodes(int *valid_cxlram_nodes) { + DIR *base_dir = NULL, *region_dir = NULL, *dax_region_dir = NULL; + const char *base_dir_path = CXL_DEVICE_PATH; + struct dirent *entry; + int valid_node = 0, ret = 0; + + base_dir = opendir(base_dir_path); + if (base_dir == NULL) { + return valid_node; + } + + while ((entry = readdir(base_dir)) != NULL) { + char region_path[REGION_PATH_LEN]; + + ret = snprintf(region_path, sizeof(region_path), "%s%s", + base_dir_path, entry->d_name); + if (ret < 0 || + !is_symlink(region_path) || + strncmp(entry->d_name, "region", ARRAY_SIZE("region") - 1)) { + continue; + } + + region_dir = opendir(region_path); + if (region_dir == NULL) { + goto region_exit; + } + + while ((entry = readdir(region_dir)) != NULL) { + char dax_region_path[DAX_REGION_PATH_LEN]; + + ret = snprintf(dax_region_path, sizeof(dax_region_path), "%s/%s", + region_path, entry->d_name); + if (ret < 0 || + !is_directory(dax_region_path) || + strncmp(entry->d_name, "dax_region", + ARRAY_SIZE("dax_region") - 1)) { + + continue; + } + + dax_region_dir = opendir(dax_region_path); + if (dax_region_dir == NULL) { + goto dax_region_exit; + } + + while ((entry = readdir(dax_region_dir)) != NULL) { + int target_node; + char dax_path[DAX_PATH_LEN]; + char target_file_path[TARGET_FILE_PATH_LEN]; + ret = snprintf(dax_path, sizeof(dax_path), "%s/%s", + dax_region_path, entry->d_name); + if (ret < 0 || + !is_directory(dax_path) || + strncmp(entry->d_name, "dax", ARRAY_SIZE("dax") - 1)) { + continue; + } + + ret = snprintf(target_file_path, sizeof(target_file_path), + "%s/target_node", dax_path); + if (ret < 0) { + continue; + } + + target_node = is_valid_node(target_file_path); + if (target_node >= 0) { + valid_cxlram_nodes[valid_node] = target_node; + valid_node++; + } + } + } + } + + closedir(dax_region_dir); +dax_region_exit: + closedir(region_dir); +region_exit: + closedir(base_dir); + return valid_node; +} + static void host_memory_backend_get_size(Object *obj, Visitor *v, const char *name, void *opaque, Error **errp) @@ -117,6 +244,12 @@ host_memory_backend_set_host_nodes(Object *obj, Visitor *v, const char *name, HostMemoryBackend *backend = MEMORY_BACKEND(obj); uint16List *l, *host_nodes = NULL; + if (backend->host_mem_type == HOST_MEM_TYPE_CXLRAM) { + error_setg(errp, + "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible"); + return; + } + visit_type_uint16List(v, name, &host_nodes, errp); for (l = host_nodes; l; l = l->next) { @@ -150,6 +283,11 @@ host_memory_backend_set_policy(Object *obj, int policy, Error **errp) HostMemoryBackend *backend = MEMORY_BACKEND(obj); backend->policy = policy; + if (backend->host_mem_type == HOST_MEM_TYPE_CXLRAM) { + error_setg(errp, + "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible"); + } + #ifndef CONFIG_NUMA if (policy != HOST_MEM_POLICY_DEFAULT) { error_setg(errp, "NUMA policies are not supported by this QEMU"); @@ -157,6 +295,46 @@ host_memory_backend_set_policy(Object *obj, int policy, Error **errp) #endif } +static int +host_memory_backend_get_host_mem_type(Object *obj, Error **errp G_GNUC_UNUSED) +{ + HostMemoryBackend *backend = MEMORY_BACKEND(obj); + return backend->host_mem_type; +} + +static void +host_memory_backend_set_host_mem_type(Object *obj, int host_mem_type, Error **errp) +{ + HostMemoryBackend *backend = MEMORY_BACKEND(obj); + backend->host_mem_type = host_mem_type; + +#ifndef CONFIG_NUMA + error_setg(errp, "NUMA node host memory types are not supported by this QEMU"); +#else + int i, valid_cxlram_nodes[MAX_NODES]; + + if (backend->policy > 0 || + !bitmap_empty(backend->host_nodes, MAX_NODES)) { + error_setg(errp, + "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible"); + return; + } + + if (host_memory_backend_get_cxlram_nodes(valid_cxlram_nodes) > 0) { + for (i = 0; i < MAX_NODES; i++) { + if (valid_cxlram_nodes[i] < 0) { + break; + } + bitmap_set(backend->host_nodes, valid_cxlram_nodes[i], 1); + } + } else { + error_setg(errp, "Cannot find CXL RAM on host"); + return; + } + backend->policy = HOST_MEM_POLICY_BIND; +#endif +} + static bool host_memory_backend_get_merge(Object *obj, Error **errp) { HostMemoryBackend *backend = MEMORY_BACKEND(obj); @@ -536,6 +714,12 @@ host_memory_backend_class_init(ObjectClass *oc, void *data) host_memory_backend_get_share, host_memory_backend_set_share); object_class_property_set_description(oc, "share", "Mark the memory as private to QEMU or shared"); + object_class_property_add_enum(oc, "host-mem-type", "HostMemType", + &HostMemType_lookup, + host_memory_backend_get_host_mem_type, + host_memory_backend_set_host_mem_type); + object_class_property_set_description(oc, "host-mem-type", + "Set the backend host memory type"); #ifdef CONFIG_LINUX object_class_property_add_bool(oc, "reserve", host_memory_backend_get_reserve, host_memory_backend_set_reserve); diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h index 39326f1d4f..afeb9b71d1 100644 --- a/include/sysemu/hostmem.h +++ b/include/sysemu/hostmem.h @@ -70,6 +70,7 @@ struct HostMemoryBackend { ThreadContext *prealloc_context; DECLARE_BITMAP(host_nodes, MAX_NODES + 1); HostMemPolicy policy; + HostMemType host_mem_type; MemoryRegion mr; }; diff --git a/qapi/common.json b/qapi/common.json index 6fed9cde1a..591fd73291 100644 --- a/qapi/common.json +++ b/qapi/common.json @@ -167,6 +167,25 @@ { 'enum': 'HostMemPolicy', 'data': [ 'default', 'preferred', 'bind', 'interleave' ] } +## +# @HostMemType: +# +# Automatically find a backend memory type on host. +# Can be further extened to support other types such as cxlpmem, hbm. +# +# @none: do nothing (default). +# +# @cxlram: a CXL RAM backend on host. +# +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same +# time. HostMemType is used to automatically bind with one kind of +# host memory types. +# +# Since: 8.3 +## +{ 'enum': 'HostMemType', + 'data': [ 'none', 'cxlram' ] } + ## # @NetFilterDirection: # diff --git a/qapi/qom.json b/qapi/qom.json index 95516ba325..fa3bc29708 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -626,6 +626,7 @@ '*host-nodes': ['uint16'], '*merge': 'bool', '*policy': 'HostMemPolicy', + '*host-mem-type': 'HostMemType', '*prealloc': 'bool', '*prealloc-threads': 'uint32', '*prealloc-context': 'str', diff --git a/qemu-options.hx b/qemu-options.hx index b66570ae00..39074c1aa0 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -5211,7 +5211,7 @@ SRST (``share=off``). For this use case, we need writable RAM instead of ROM, and want to also set ``rom=off``. - ``-object memory-backend-ram,id=id,merge=on|off,dump=on|off,share=on|off,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave`` + ``-object memory-backend-ram,id=id,merge=on|off,dump=on|off,share=on|off,prealloc=on|off,size=size,host-mem-type=cxlram,host-nodes=host-nodes,policy=default|preferred|bind|interleave`` Creates a memory backend object, which can be used to back the guest RAM. Memory backend objects offer more control than the ``-m`` option that is traditionally used to define guest RAM.