Message ID | 369d848fdc86994ca646a5aa4e04c4dc049d04f1.1677274611.git.maciej.szmigiero@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Hyper-V Dynamic Memory Protocol driver (hv-balloon) | expand |
On 24.02.23 22:41, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > This device works like a virtual DIMM stick: it allows inserting extra RAM All DIMMs in QEMU are virtual. What you want it, a piece of memory that doesn not get exposed via ACPI or similar (and doesn't follow the traditional "slots" concept). > into the guest at run time and later removing it without having to > duplicate all of the address space management logic of TYPE_MEMORY_DEVICE > in each memory hot-add protocol driver. ... which are these? virtio-mem and virtio-pmem do their own thing for good reasons. You're adding it for HV. I don't think their is demand for a generic device. In fact, I have no idea what "HAPVDIMM" should actually mean. If you really need such a device after we discussed the alternatives, please keep it hv-specific. > > This device is not meant to be instantiated or removed by the QEMU user > directly: rather, the protocol driver is supposed to add and remove it as > required. That sounds like the wrong approach to me. More on that below. > > In fact, its very existence is supposed to be an implementation detail, > transparent to the QEMU user. > > To prevent the user from accidentally manually creating an instance of this > device the protocol driver is supposed to place the qdev_device_add*() call > (that is uses to add this device) between hapvdimm_allow_adding() and > hapvdimm_disallow_adding() calls in order to temporary authorize the > operation. > The most important part first: the realize function of a device is not supposed to assing itself any resources. Calling memory device (un)plug functions from the realize function is wrong. (Hot)plug handlers are the right approach for that. Please refer to how we chain hotplug handlers (machine hotplug handler -> bus hotplug handler) to implement virtio-mem and virtio-pmem. These hotplug handlers would also be the place where to reject a device if created by the user, for example. But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add. An alternative might be the following: Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem. In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement. So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container. The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device). I'm adding support for that right now to implement a virtio-mem extension -- the memory device says how many memslots it requires, and these will get reserved for that memory device; the memory device can then consume them later without further checks dynamically. That approach could be extended to increase/decrease the memslot requirement (the device would ask to increase/decrease its limit), if ever required.
On 27.02.2023 16:25, David Hildenbrand wrote: > On 24.02.23 22:41, Maciej S. Szmigiero wrote: >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> >> This device works like a virtual DIMM stick: it allows inserting extra RAM > > All DIMMs in QEMU are virtual. What you want it, a piece of memory that doesn not get exposed via ACPI or similar (and doesn't follow the traditional "slots" concept). Right. >> into the guest at run time and later removing it without having to >> duplicate all of the address space management logic of TYPE_MEMORY_DEVICE >> in each memory hot-add protocol driver. > > ... which are these? virtio-mem and virtio-pmem do their own thing for good reasons. You're adding it for HV. > > I don't think their is demand for a generic device. In fact, I have no idea what "HAPVDIMM" should actually mean. > > If you really need such a device after we discussed the alternatives, please keep it hv-specific. No problem, the device can be made hv-specific - at least until another use for it is found (if it is found). >> >> This device is not meant to be instantiated or removed by the QEMU user >> directly: rather, the protocol driver is supposed to add and remove it as >> required. > > That sounds like the wrong approach to me. More on that below. > >> >> In fact, its very existence is supposed to be an implementation detail, >> transparent to the QEMU user. >> >> To prevent the user from accidentally manually creating an instance of this >> device the protocol driver is supposed to place the qdev_device_add*() call >> (that is uses to add this device) between hapvdimm_allow_adding() and >> hapvdimm_disallow_adding() calls in order to temporary authorize the >> operation. >> > > The most important part first: the realize function of a device is not supposed to assing itself any resources. Calling memory device (un)plug functions from the realize function is wrong. > > (Hot)plug handlers are the right approach for that. Please refer to how we chain hotplug handlers (machine hotplug handler -> bus hotplug handler) to implement virtio-mem and virtio-pmem. These hotplug handlers would also be the place where to reject a device if created by the user, for example. > That was more or less the approach that v1 of this driver took: The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices, whatever one calls them) explicitly via the machine hotplug handler (using the device_add command). At that time you said [1] that: > 1) I dislike that an external entity has to do vDIMM adaptions / > ballooning adaptions when rebooting or when wanting to resize a guest. because: > Once you have the current approach upstream (vDIMMs, ballooning), > there is no easy way to change that later (requires deprecating, etc.). That's why this version hides these vDIMMs. Instead, the QEMU manager (user) directly provides the raw memory backend device (for example, memory-backend-ram) to the driver via a QMP command. Since now the user is not expected to touch these vDIMMs directly in any way these become an implementation detail than can be changed or even removed if needed at some point, without affecting the existing users. > But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add. > > > An alternative might be the following: > > Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem. > > In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement. > > So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container. > > The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device). > > Technically in this case a "main" hv-balloon device is still needed - in contrast with virtio-mem (which allows multiple instances) there can be only one Dynamic Memory protocol provider on the VMBus. That means these "container" sub-devices would need to register with that main hv-balloon device. However, I'm not sure what is exactly gained by this approach. These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface so they are accounted for properly (the alternative would be to patch the relevant QEMU code all over the place - that's probably why virtio-mem also implements this interface instead). One still needs some QMP command to add a raw memory backend to the chosen "container" hv-balloon sub-device. Since now the QEMU manager (user) is aware of the presence of these "container" sub-devices, and has to manage them, changing the QEMU interface in the future is more complex (as you said in [1]). I understand that virtio-mem uses a similar approach, however that's because the virtio-mem protocol itself works that way. > I'm adding support for that right now to implement a virtio-mem > extension -- the memory device says how many memslots it requires, > and these will get reserved for that memory device; the memory device > can then consume them later without further checks dynamically. That > approach could be extended to increase/decrease the memslot > requirement (the device would ask to increase/decrease its limit), > if ever required. In terms of future virtio-mem things I'm also eagerly waiting for an ability to set a removed virtio-mem block read-only (or not covered by any memslot) - this most probably could be reused later for implementing the same functionality in this driver. Thanks, Maciej [1]: https://lore.kernel.org/qemu-devel/28ab7005-c31c-239e-4659-e5287f4c5468@redhat.com/
> > That was more or less the approach that v1 of this driver took: > The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices, > whatever one calls them) explicitly via the machine hotplug handler > (using the device_add command). > > At that time you said [1] that: >> 1) I dislike that an external entity has to do vDIMM adaptions / >> ballooning adaptions when rebooting or when wanting to resize a guest. > > because: >> Once you have the current approach upstream (vDIMMs, ballooning), >> there is no easy way to change that later (requires deprecating, etc.). > > That's why this version hides these vDIMMs. Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO). Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend. What memory backends will remain when we reboot? When can we remove memory backends? But that's just about the user interaction in general. My comment here was about the hidden devices: they have to go through plug handlers to get resources assigned, not self-assign resources in the realize function. Note that virtio-mem uses a single sparse memory backend to make resizing easier (well, and to handle migration and some other things easier). But it comes with other things that require optimization. Using multiple memslots to expose memory to the VM is one optimization I'm working on. Resizable memory backends are another one. I think you could implement the memory adding part similar to virtio-mem, and simply have a large sparse memory backend, from which you expose new memory to the VM as you please. And you could even use multiple memslots for that. But that's your design decision, and I won't argue with that, just pointing that out. > Instead, the QEMU manager (user) directly provides the raw memory > backend device (for example, memory-backend-ram) to the driver via a QMP > command. Yes, that's what I understood. > > Since now the user is not expected to touch these vDIMMs directly in any > way these become an implementation detail than can be changed or even > removed if needed at some point, without affecting the existing users. > >> But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add. >> >> >> An alternative might be the following: >> >> Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem. >> >> In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement. >> >> So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container. >> >> The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device). >> >> > Technically in this case a "main" hv-balloon device is still needed - > in contrast with virtio-mem (which allows multiple instances) there can > be only one Dynamic Memory protocol provider on the VMBus. Yes, just like virtio-balloon. There cannot be multiple instances. > > That means these "container" sub-devices would need to register with that > main hv-balloon device. > My question is, if they really have to be devices. Why wouldn't it sufficient to map the memory backends directly into the container? Why is the > However, I'm not sure what is exactly gained by this approach. > > These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE. > so they are accounted for properly (the alternative would be to patch > the relevant QEMU code all over the place - that's probably why > virtio-mem also implements this interface instead). Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks. > > One still needs some QMP command to add a raw memory backend to > the chosen "container" hv-balloon sub-device. If you go with multiple memory backends, yes. > > Since now the QEMU manager (user) is aware of the presence of these > "container" sub-devices, and has to manage them, changing the QEMU > interface in the future is more complex (as you said in [1]). Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)? > > I understand that virtio-mem uses a similar approach, however that's > because the virtio-mem protocol itself works that way. > >> I'm adding support for that right now to implement a virtio-mem >> extension -- the memory device says how many memslots it requires, >> and these will get reserved for that memory device; the memory device >> can then consume them later without further checks dynamically. That >> approach could be extended to increase/decrease the memslot >> requirement (the device would ask to increase/decrease its limit), >> if ever required. > > In terms of future virtio-mem things I'm also eagerly waiting for an > ability to set a removed virtio-mem block read-only (or not covered by > any memslot) - this most probably could be reused later for implementing > the same functionality in this driver. In contrast to setting them read-only, the memslots that contain no plugged blocks anymore will be completely removed. The goal is to not consume any metadata overhead in KVM (well, and also do one step into the direction of protecting unplugged memory from getting reallocated).
On 28.02.2023 16:02, David Hildenbrand wrote: >> >> That was more or less the approach that v1 of this driver took: >> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices, >> whatever one calls them) explicitly via the machine hotplug handler >> (using the device_add command). >> >> At that time you said [1] that: >>> 1) I dislike that an external entity has to do vDIMM adaptions / >>> ballooning adaptions when rebooting or when wanting to resize a guest. >> >> because: >>> Once you have the current approach upstream (vDIMMs, ballooning), >>> there is no easy way to change that later (requires deprecating, etc.). >> >> That's why this version hides these vDIMMs. > > Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO). > > Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend. Well, the logic here is pretty simple: deflate the balloon first (including deflating it by zero bytes if not inflated), then, if any memory size remains to add, hot-add the reminder. We can't get rid of ballooning altogether because otherwise going below the boot memory size wouldn't be possible. > What memory backends will remain when we reboot? In this driver version, none will remain inserted (virtio-mem also seems to unplug all blocks unconditionally when the VM is rebooted). In version 1, all memory backeds were re-inserted once the guest re-connected to the DM protocol after a reboot. As I wrote in my response to Daniel moments ago, there are some issues with automatic re-insertion if the guest never re-connects to the DM protocol - that's why I've removed this functionality from this driver version. > When can we remove memory backends? There's a QMP event generated when a memory backend can be removed: HV_BALLOON_MEMORY_BACKEND_UNUSED > But that's just about the user interaction in general. My comment here was about the hidden devices: they have to go through plug handlers to get resources assigned, not self-assign resources in the realize function. > > > Note that virtio-mem uses a single sparse memory backend to make resizing easier (well, and to handle migration and some other things easier). But it comes with other things that require optimization. Using multiple memslots to expose memory to the VM is one optimization I'm working on. Resizable memory backends are another one. > > I think you could implement the memory adding part similar to virtio-mem, and simply have a large sparse memory backend, from which you expose new memory to the VM as you please. And you could even use multiple memslots for that. But that's your design decision, and I won't argue with that, just pointing that out. > > >> Instead, the QEMU manager (user) directly provides the raw memory >> backend device (for example, memory-backend-ram) to the driver via a QMP >> command. > > Yes, that's what I understood. > >> >> Since now the user is not expected to touch these vDIMMs directly in any >> way these become an implementation detail than can be changed or even >> removed if needed at some point, without affecting the existing users. >> >>> But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add. >>> >>> >>> An alternative might be the following: >>> >>> Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem. >>> >>> In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement. >>> >>> So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container. >>> >>> The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device). >>> >>> >> Technically in this case a "main" hv-balloon device is still needed - >> in contrast with virtio-mem (which allows multiple instances) there can >> be only one Dynamic Memory protocol provider on the VMBus. > > Yes, just like virtio-balloon. There cannot be multiple instances. Right, this has some important consequences (see below). >> >> That means these "container" sub-devices would need to register with that >> main hv-balloon device. >> > > My question is, if they really have to be devices. Why wouldn't it sufficient to map the memory backends directly into the container? Why is the See the answer below the next paragraph. > >> However, I'm not sure what is exactly gained by this approach. >> >> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface > > No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE. In case of virtio-mem if one wants to add even more memory than the "current" backing memory device allows there's always a possibility of adding yet another virtio-mem-pci device with an additional backing memory device. If there would be just the main hv-balloon device (implementing TYPE_MEMORY_DEVICE) then this would not be possible, since one can't have multiple DM VMBus devices. Hence, intermediate sub-devices are necessary (each one implementing TYPE_MEMORY_DEVICE), which do not sit on the VMBus, in order to allow adding new backing memory devices (as virtio-mem allows). >> so they are accounted for properly (the alternative would be to patch >> the relevant QEMU code all over the place - that's probably why >> virtio-mem also implements this interface instead). > > Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks. I was referring to the necessity of implementing TYPE_MEMORY_DEVICE at all in hv-balloon driver - if it didn't implement this interface then it couldn't benefit from the logic in hw/mem/memory-device.c, so it would need to be open-coded inside the driver and every call to functions provided by that file from QEMU would need to be patched to account for the memory provided by this driver. > >> >> One still needs some QMP command to add a raw memory backend to >> the chosen "container" hv-balloon sub-device. > > If you go with multiple memory backends, yes. > >> >> Since now the QEMU manager (user) is aware of the presence of these >> "container" sub-devices, and has to manage them, changing the QEMU >> interface in the future is more complex (as you said in [1]).> > Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)? > As I wrote above, multiple backing memory devices are necessary so the guest can be expanded above the initially provided backing memory device, much like virtio-mem already allows. And then you have to either: 1) Let the hv-balloon driver transparently manage the lifetime of these sub-devices, like this version of the patch set does, OR: 2) Make the QEMU manager (user) insert and remove these sub-devices explicitly, like the version 1 of this driver did. > >> >> I understand that virtio-mem uses a similar approach, however that's >> because the virtio-mem protocol itself works that way. >> >>> I'm adding support for that right now to implement a virtio-mem >>> extension -- the memory device says how many memslots it requires, >>> and these will get reserved for that memory device; the memory device >>> can then consume them later without further checks dynamically. That >>> approach could be extended to increase/decrease the memslot >>> requirement (the device would ask to increase/decrease its limit), >>> if ever required. >> >> In terms of future virtio-mem things I'm also eagerly waiting for an >> ability to set a removed virtio-mem block read-only (or not covered by >> any memslot) - this most probably could be reused later for implementing >> the same functionality in this driver. > > In contrast to setting them read-only, the memslots that contain no plugged blocks anymore will be completely removed. The goal is to not consume any metadata overhead in KVM (well, and also do one step into the direction of protecting unplugged memory from getting reallocated). > Nice, looking forward to having this functionality in QEMU for Linux guests. Thanks, Maciej
On 28.02.23 22:27, Maciej S. Szmigiero wrote: > On 28.02.2023 16:02, David Hildenbrand wrote: >>> >>> That was more or less the approach that v1 of this driver took: >>> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices, >>> whatever one calls them) explicitly via the machine hotplug handler >>> (using the device_add command). >>> >>> At that time you said [1] that: >>>> 1) I dislike that an external entity has to do vDIMM adaptions / >>>> ballooning adaptions when rebooting or when wanting to resize a guest. >>> >>> because: >>>> Once you have the current approach upstream (vDIMMs, ballooning), >>>> there is no easy way to change that later (requires deprecating, etc.). >>> >>> That's why this version hides these vDIMMs. >> >> Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO). >> >> Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend. > > Well, the logic here is pretty simple: deflate the balloon first > (including deflating it by zero bytes if not inflated), then, if any > memory size remains to add, hot-add the reminder. > Yes, but if you have 1 GiB deflated and want to add 2 GiB, things are already getting more involved if you get what I mean. I was going through the exact same model back when I was designing virtio-mem, and eventually added with a way where you can just tell QEMU the requested size an be done with it. > We can't get rid of ballooning altogether because otherwise going > below the boot memory size wouldn't be possible. Right, more on that below. > >> What memory backends will remain when we reboot? > > In this driver version, none will remain inserted > (virtio-mem also seems to unplug all blocks unconditionally when the > VM is rebooted). > There is a very important difference: virtio-mem only temporarily unplugs that memory. As the guest boots up it re-adds the requested amount of memory without any user interaction. That was added for two main reasons (a) We can easily defragment the virtio-mem device that way. (b) If the rebooted guest doesn't load the virtio-mem driver, it wouldn't be able to make use of that memory. Like, rebooting into Windows right now ;) So if you hotplugged some memory using virtio-mem and reboot, that memory will automatically be re-added. > In version 1, all memory backeds were re-inserted once the guest > re-connected to the DM protocol after a reboot. > > As I wrote in my response to Daniel moments ago, there are some issues > with automatic re-insertion if the guest never re-connects to the DM > protocol - that's why I've removed this functionality from this > driver version. I think we might be able to to better, but that's just my idea how it could look like. I'll describe it below. [...] >>> However, I'm not sure what is exactly gained by this approach. >>> >>> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface >> >> No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE. > In case of virtio-mem if one wants to add even more memory than the > "current" backing memory device allows there's always a possibility of > adding yet another virtio-mem-pci device with an additional backing > memory device. We could, but that's not the way I envision virtio-mem. The thing is, already when starting QEMU we have to make decisions about the maximum VM size when setting the maxmem option. Consequently, we cannot grow a VM until infinity, we already have to plan ahead to some degree. So what my goal is with virito-mem, is the following (it already works, we just have to work on reduction of metadata and memory overcommit handling -- mostly internal optimizations): qemu-kvm ... \ -m 4G,maxmem=1048G \ -object memory-backend-ram,id=mem0,size=1T, ... \ -device virtio-mem-pci,id=vmem0,memdev=mem0,requested-size=0 So we an grow the guest up to 1T if we like. There is no way we could add more memory to that VM because we're already hitting the limit of maxmem. It gets more complicated with multiple NUMA nodes, NVDIMMS, etc, but the main goal is to make it possible to have the maximum size be ridiculously large (while optimizing it internally!) that one doesn't have to even worry about adding a new device. I think the same model would work for hv as well, at least with my limited knowledge about it ;) > > If there would be just the main hv-balloon device (implementing > TYPE_MEMORY_DEVICE) then this would not be possible, since one can't > have multiple DM VMBus devices. > > Hence, intermediate sub-devices are necessary (each one implementing > TYPE_MEMORY_DEVICE), which do not sit on the VMBus, in order to allow > adding new backing memory devices (as virtio-mem allows). Not necessarily, I think, as discussed. > >>> so they are accounted for properly (the alternative would be to patch >>> the relevant QEMU code all over the place - that's probably why >>> virtio-mem also implements this interface instead). >> >> Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks. > > I was referring to the necessity of implementing TYPE_MEMORY_DEVICE at > all in hv-balloon driver - if it didn't implement this interface then it > couldn't benefit from the logic in hw/mem/memory-device.c, so it would > need to be open-coded inside the driver and every call to functions > provided by that file from QEMU would need to be patched to account for > the memory provided by this driver. Ah, yes, one device has to be a memory device. I was just asking if you really need multiple ones. > >> >>> >>> One still needs some QMP command to add a raw memory backend to >>> the chosen "container" hv-balloon sub-device. >> >> If you go with multiple memory backends, yes. >> >>> >>> Since now the QEMU manager (user) is aware of the presence of these >>> "container" sub-devices, and has to manage them, changing the QEMU >>> interface in the future is more complex (as you said in [1]).> >> Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)? >> > > As I wrote above, multiple backing memory devices are necessary so the > guest can be expanded above the initially provided backing memory device, > much like virtio-mem already allows. > > And then you have to either: > 1) Let the hv-balloon driver transparently manage the lifetime of these > sub-devices, like this version of the patch set does, OR: > > 2) Make the QEMU manager (user) insert and remove these sub-devices > explicitly, like the version 1 of this driver did. Let's me raise this idea: qemu-kvm ... \ -m 4G,maxmem=1048G \ -object memory-backend-ram,id=mem0,size=1T, ... \ -device hv-balloon,id=vmem0,memdev=mem0 We'd do the same internal optimizations as we're doing (and the ones I am working on) for virtio-mem. The above would result in a VM with 4G. With virtio-mem, we resize devices, with the balloon, you resize the logical VM size. So the single (existing?) user interface would be the existing balloon cmd. Note that we set the logical VM size here, not the size of the balloon. info balloon -> 4G balloon 2G [will inflate] info balloon -> 2G balloon 128G [will deflate, then hotplug] info balloon -> 128G balloon 8G [will deflate] info balloon -> 8G ... How memory is added (deflate first, then expose some new memory via the memdev, ...) is left to the hv-balloon device, the user doesn't have to bother. We set the logical VM size and hv-balloon will do it's thing to eventually reach that goal. Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up. The only thing we can't do is the following: when going below 4G, we cannot resize boot memory. But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup). Even migration could eventually be fairly simple, because virtio-mem already solved it to some degree. The only catch is, that for boot memory, we'd also have to detect discarded ranges. But that would be something to think about in the future.
On 28.02.2023 23:12, David Hildenbrand wrote: > On 28.02.23 22:27, Maciej S. Szmigiero wrote: >> On 28.02.2023 16:02, David Hildenbrand wrote: >>>> >>>> That was more or less the approach that v1 of this driver took: >>>> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices, >>>> whatever one calls them) explicitly via the machine hotplug handler >>>> (using the device_add command). >>>> >>>> At that time you said [1] that: >>>>> 1) I dislike that an external entity has to do vDIMM adaptions / >>>>> ballooning adaptions when rebooting or when wanting to resize a guest. >>>> >>>> because: >>>>> Once you have the current approach upstream (vDIMMs, ballooning), >>>>> there is no easy way to change that later (requires deprecating, etc.). >>>> >>>> That's why this version hides these vDIMMs. >>> >>> Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO). >>> >>> Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend. >> >> Well, the logic here is pretty simple: deflate the balloon first >> (including deflating it by zero bytes if not inflated), then, if any >> memory size remains to add, hot-add the reminder. >> > > Yes, but if you have 1 GiB deflated and want to add 2 GiB, things are already getting more involved if you get what I mean. > > I was going through the exact same model back when I was designing virtio-mem, and eventually added with a way where you can just tell QEMU the requested size an be done with it. Understood, this interface seems obviously more user-friendly. >> We can't get rid of ballooning altogether because otherwise going >> below the boot memory size wouldn't be possible. > > Right, more on that below. > >> >>> What memory backends will remain when we reboot? >> >> In this driver version, none will remain inserted >> (virtio-mem also seems to unplug all blocks unconditionally when the >> VM is rebooted). >> > > There is a very important difference: virtio-mem only temporarily unplugs that memory. As the guest boots up it re-adds the requested amount of memory without any user interaction. That was added for two main reasons > > (a) We can easily defragment the virtio-mem device that way. > (b) If the rebooted guest doesn't load the virtio-mem driver, it > wouldn't be able to make use of that memory. Like, rebooting into > Windows right now ;) > > So if you hotplugged some memory using virtio-mem and reboot, that memory will automatically be re-added. > >> In version 1, all memory backeds were re-inserted once the guest >> re-connected to the DM protocol after a reboot. >> >> As I wrote in my response to Daniel moments ago, there are some issues >> with automatic re-insertion if the guest never re-connects to the DM >> protocol - that's why I've removed this functionality from this >> driver version. > > I think we might be able to to better, but that's just my idea how it could look like. I'll describe it below. > > [...] > >>>> However, I'm not sure what is exactly gained by this approach. >>>> >>>> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface >>> >>> No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE. >> In case of virtio-mem if one wants to add even more memory than the >> "current" backing memory device allows there's always a possibility of >> adding yet another virtio-mem-pci device with an additional backing >> memory device. > > We could, but that's not the way I envision virtio-mem. The thing is, already when starting QEMU we have to make decisions about the maximum VM size when setting the maxmem option. Consequently, we cannot grow a VM until infinity, we already have to plan ahead to some degree. > > So what my goal is with virito-mem, is the following (it already works, we just have to work on reduction of metadata and memory overcommit handling -- mostly internal optimizations): > > qemu-kvm ... \ > -m 4G,maxmem=1048G \ > -object memory-backend-ram,id=mem0,size=1T, ... \ > -device virtio-mem-pci,id=vmem0,memdev=mem0,requested-size=0 > > So we an grow the guest up to 1T if we like. There is no way we could add more memory to that VM because we're already hitting the limit of maxmem. > > It gets more complicated with multiple NUMA nodes, NVDIMMS, etc, but the main goal is to make it possible to have the maximum size be ridiculously large (while optimizing it internally!) that one doesn't have to even worry about adding a new device. > > I think the same model would work for hv as well, at least with my limited knowledge about it ;) I understand your idea - responded below, under the hv-balloon example. >> >> If there would be just the main hv-balloon device (implementing >> TYPE_MEMORY_DEVICE) then this would not be possible, since one can't >> have multiple DM VMBus devices. >> >> Hence, intermediate sub-devices are necessary (each one implementing >> TYPE_MEMORY_DEVICE), which do not sit on the VMBus, in order to allow >> adding new backing memory devices (as virtio-mem allows). > > Not necessarily, I think, as discussed. > >> >>>> so they are accounted for properly (the alternative would be to patch >>>> the relevant QEMU code all over the place - that's probably why >>>> virtio-mem also implements this interface instead). >>> >>> Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks. >> >> I was referring to the necessity of implementing TYPE_MEMORY_DEVICE at >> all in hv-balloon driver - if it didn't implement this interface then it >> couldn't benefit from the logic in hw/mem/memory-device.c, so it would >> need to be open-coded inside the driver and every call to functions >> provided by that file from QEMU would need to be patched to account for >> the memory provided by this driver. > > Ah, yes, one device has to be a memory device. I was just asking if you really need multiple ones. > >> >>> >>>> >>>> One still needs some QMP command to add a raw memory backend to >>>> the chosen "container" hv-balloon sub-device. >>> >>> If you go with multiple memory backends, yes. >>> >>>> >>>> Since now the QEMU manager (user) is aware of the presence of these >>>> "container" sub-devices, and has to manage them, changing the QEMU >>>> interface in the future is more complex (as you said in [1]).> >>> Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)? >>> >> >> As I wrote above, multiple backing memory devices are necessary so the >> guest can be expanded above the initially provided backing memory device, >> much like virtio-mem already allows. >> >> And then you have to either: >> 1) Let the hv-balloon driver transparently manage the lifetime of these >> sub-devices, like this version of the patch set does, OR: >> >> 2) Make the QEMU manager (user) insert and remove these sub-devices >> explicitly, like the version 1 of this driver did. > > Let's me raise this idea: > > qemu-kvm ... \ > -m 4G,maxmem=1048G \ > -object memory-backend-ram,id=mem0,size=1T, ... \ > -device hv-balloon,id=vmem0,memdev=mem0 > > We'd do the same internal optimizations as we're doing (and the ones I am working on) for virtio-mem. > > The above would result in a VM with 4G. With virtio-mem, we resize devices, with the balloon, you resize the logical VM size. > > So the single (existing?) user interface would be the existing balloon cmd. Note that we set the logical VM size here, not the size of the balloon. > > info balloon -> 4G > balloon 2G [will inflate] > info balloon -> 2G > balloon 128G [will deflate, then hotplug] > info balloon -> 128G > balloon 8G [will deflate] > info balloon -> 8G > ... > > How memory is added (deflate first, then expose some new memory via the memdev, ...) is left to the hv-balloon device, the user doesn't have to bother. We set the logical VM size and hv-balloon will do it's thing to eventually reach that goal. The idea would seem reasonable, but: (there's always some "but") 1) Once we implement NUMA support we'd probably need multiple TYPE_MEMORY_DEVICEs anyway, since it seems one memdev can sit on only one NUMA node, With virtio-mem one can simply have per-node virtio-mem devices. 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing memory device mostly marked madvise(MADV_DONTNEED). Like, how much memory + swap this setup would actually consume - that's something I would need to measure. 3) In a public cloud environment malicious guests are a possibility. Currently (without things like resizable memslots) the best idea I tried was to place the whole QEMU process into a memory-limited cgroup (limited to the guest target size). There are still some issues with it: one needs to reserve swap space up to the guest maximum size so the QEMU process doesn't get OOM-killed if guest touches that memory and the cgroup memory controller for some reason seems to start swapping even before reaching its limit (that's still under investigation why). > Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up. > > The only thing we can't do is the following: when going below 4G, we cannot resize boot memory. > > > But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup). Hyper-V actually "cleans up" the guest memory map on reboot - if the guest was effectively resized up then on reboot the guest boot memory is resized up to match that last size. Similarly, if the guest was ballooned out - that amount of memory is removed from the boot memory on reboot. So it's not exactly doing a hot-add after the guest boots. This approach (of resizing the boot memory) also avoids problems if the guest loses hot-add / ballooning capability after a reboot - for example, rebooting into a Linux guest from Windows with hv-balloon. But unfortunately such resizing the guest boot memory seems not trivial to implement in QEMU. > > > Even migration could eventually be fairly simple, because virtio-mem already solved it to some degree. The only catch is, that for boot memory, we'd also have to detect discarded ranges. But that would be something to think about in the future.> Yes, migration support is planned for future versions of the driver, when its final design is known. Thanks, Maciej
> > The idea would seem reasonable, but: (there's always some "but") > 1) Once we implement NUMA support we'd probably need multiple > TYPE_MEMORY_DEVICEs anyway, since it seems one memdev can sit on only > one NUMA node, > Not necessarily. You could extend the hv-balloon device to have one memslot for each NUMA node. Of course, once again, you have to plan ahead how to distribute memory across NUMA nodes (same with virtio-mem). Having that said, last time I checked, HV dynamic memory was force-disabled when enabling vNUMA under HV. Simply because balloon inflation is not NUMA aware. > With virtio-mem one can simply have per-node virtio-mem devices. > > 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing > memory device mostly marked madvise(MADV_DONTNEED). > Like, how much memory + swap this setup would actually consume - that's > something I would need to measure. There are some WIP items to improve that (QEMU metadata (e.g., bitmaps), KVM metadata (e.g., per-memslot), Linux metadata (e.g., page tables). Memory overcommit handling also has to be tackled. So it would be a "shared" problem with virtio-mem and will be sorted out eventually :) > > 3) In a public cloud environment malicious guests are a possibility. > Currently (without things like resizable memslots) the best idea I tried > was to place the whole QEMU process into a memory-limited cgroup > (limited to the guest target size). Yes. Protection of unplugged memory is on my TODO list for virtio-mem as well, to avoid having to rely on cgroups. > > There are still some issues with it: one needs to reserve swap space up > to the guest maximum size so the QEMU process doesn't get OOM-killed if > guest touches that memory and the cgroup memory controller for some > reason seems to start swapping even before reaching its limit (that's > still under investigation why). Yes, putting a memory cap on Linux was always tricky. > >> Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up. >> >> The only thing we can't do is the following: when going below 4G, we cannot resize boot memory. >> >> >> But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup). > > Hyper-V actually "cleans up" the guest memory map on reboot - if the > guest was effectively resized up then on reboot the guest boot memory is > resized up to match that last size. > Similarly, if the guest was ballooned out - that amount of memory is > removed from the boot memory on reboot. Yes, it cleans up, but as I said last time I checked there was this concept of startup vs. minimum vs. maximum, at least for dynamic memory: https://www.fastvue.co/tmgreporter/blog/understanding-hyper-v-dynamic-memory-dynamic-ram/ Startup RAM would be whatever you specify for "-m xG". If you go below min, you remove memory via deflation once the guest is up. > > So it's not exactly doing a hot-add after the guest boots. I recall BUG reports in Linux, that we got hv-balloon hot-add requests ~1 minute after Linux booted up, because of the above reason of startup memory [in these BUG reports, memory onlining was disabled and the VM would run out of memory because we hotplugged too much memory]. That's why I remember that this approach once was done. Maybe there are multiple implementations noways. At least in QEMU you could chose whatever makes most sense for QEMU. > This approach (of resizing the boot memory) also avoids problems if the > guest loses hot-add / ballooning capability after a reboot - for example, > rebooting into a Linux guest from Windows with hv-balloon. TBH, I wouldn't be too concerned about that scenario ("hotplugged memory to a guest, guest reboots into a weird OS, weird OS isn't able to use hotplugged memory). For virtio-mem, the important part was that you always "know" how much memory the VM is aware about. If you always start with "Startup memory" and hotadd later (only if you detected guest support after a bootup), you can handle that scenario. > > But unfortunately such resizing the guest boot memory seems not trivial > to implement in QEMU. Yes, avoiding changing memory layout to keep memory migration feasible was another thing I considered when designing virtio-mem. Anyhow, I'm just throwing out ideas here on how to eventually handle it differently.
On 1.03.2023 18:24, David Hildenbrand wrote: (...) >> With virtio-mem one can simply have per-node virtio-mem devices. >> >> 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing >> memory device mostly marked madvise(MADV_DONTNEED). >> Like, how much memory + swap this setup would actually consume - that's >> something I would need to measure. > > There are some WIP items to improve that (QEMU metadata (e.g., bitmaps), KVM metadata (e.g., per-memslot), Linux metadata (e.g., page tables). > Memory overcommit handling also has to be tackled. > > So it would be a "shared" problem with virtio-mem and will be sorted out eventually :) > Yes, but this might take a bit of time, especially if kernel-side changes are involved - that's why I will check how this setup works in practice in its current shape. (...) >>> Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up. >>> >>> The only thing we can't do is the following: when going below 4G, we cannot resize boot memory. >>> >>> >>> But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup). >> >> Hyper-V actually "cleans up" the guest memory map on reboot - if the >> guest was effectively resized up then on reboot the guest boot memory is >> resized up to match that last size. >> Similarly, if the guest was ballooned out - that amount of memory is >> removed from the boot memory on reboot. > > Yes, it cleans up, but as I said last time I checked there was this concept of startup vs. minimum vs. maximum, at least for dynamic memory: > > https://www.fastvue.co/tmgreporter/blog/understanding-hyper-v-dynamic-memory-dynamic-ram/ > > Startup RAM would be whatever you specify for "-m xG". If you go below min, you remove memory via deflation once the guest is up. That article was from 2014, so I guess it pertained Windows 2012 R2. The memory settings page in more recent Hyper-V versions looks like on the screenshot at [1]. It no longer calls that main memory amount value "Startup RAM", now it's just "RAM". Despite what one might think the "Enable Dynamic Memory" checkbox does *not* control the Dynamic Memory protocol availability or usage - the protocol is always available/exported to the guest. What the "Enable Dynamic Memory" checkbox controls is some host-side heuristics that automatically resize the guest within chosen bounds based on some metrics. Even if the "Enable Dynamic Memory" checkbox is *not* enabled the guest can still be online-resized via Dynamic Memory protocol by simply changing the value in the "RAM" field and clicking "Apply". At least that's how it works on Windows 2019 with a Linux guest. >> >> So it's not exactly doing a hot-add after the guest boots. > > I recall BUG reports in Linux, that we got hv-balloon hot-add requests ~1 minute after Linux booted up, because of the above reason of startup memory [in these BUG reports, memory onlining was disabled and the VM would run out of memory because we hotplugged too much memory]. That's why I remember that this approach once was done. > > Maybe there are multiple implementations noways. At least in QEMU you could chose whatever makes most sense for QEMU. > Right, it seems that the Hyper-V behavior evolved with time, too. >> This approach (of resizing the boot memory) also avoids problems if the >> guest loses hot-add / ballooning capability after a reboot - for example, >> rebooting into a Linux guest from Windows with hv-balloon. > > TBH, I wouldn't be too concerned about that scenario ("hotplugged memory to a guest, guest reboots into a weird OS, weird OS isn't able to use hotplugged memory). For virtio-mem, the important part was that you always "know" how much memory the VM is aware about. If you always start with "Startup memory" and hotadd later (only if you detected guest support after a bootup), you can handle that scenario. I'm not *that* concerned with cross-guest-type scenario either, but if it can be made more smooth then I wouldn't mind. Thanks, Maciej [1]: https://www.tenforums.com/performance-maintenance/38478-windows-10-hyper-v-dynamic-memory.html#post544905
On 01.03.23 23:08, Maciej S. Szmigiero wrote: > On 1.03.2023 18:24, David Hildenbrand wrote: > (...) >>> With virtio-mem one can simply have per-node virtio-mem devices. >>> >>> 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing >>> memory device mostly marked madvise(MADV_DONTNEED). >>> Like, how much memory + swap this setup would actually consume - that's >>> something I would need to measure. >> >> There are some WIP items to improve that (QEMU metadata (e.g., bitmaps), KVM metadata (e.g., per-memslot), Linux metadata (e.g., page tables). >> Memory overcommit handling also has to be tackled. >> >> So it would be a "shared" problem with virtio-mem and will be sorted out eventually :) >> > > Yes, but this might take a bit of time, especially if kernel-side changes > are involved - that's why I will check how this setup works in practice > in its current shape. Yes, let me know if you have any question. I invested a lot of time to figure out all of the details and possible workarounds/approaches in the past. >>> Hyper-V actually "cleans up" the guest memory map on reboot - if the >>> guest was effectively resized up then on reboot the guest boot memory is >>> resized up to match that last size. >>> Similarly, if the guest was ballooned out - that amount of memory is >>> removed from the boot memory on reboot. >> >> Yes, it cleans up, but as I said last time I checked there was this concept of startup vs. minimum vs. maximum, at least for dynamic memory: >> >> https://www.fastvue.co/tmgreporter/blog/understanding-hyper-v-dynamic-memory-dynamic-ram/ >> >> Startup RAM would be whatever you specify for "-m xG". If you go below min, you remove memory via deflation once the guest is up. > > > That article was from 2014, so I guess it pertained Windows 2012 R2. I remember seeing the same interface when I played with that a couple of years ago, but I don't recall which windows version i was using. > > The memory settings page in more recent Hyper-V versions looks like on > the screenshot at [1]. > > It no longer calls that main memory amount value "Startup RAM", now it's > just "RAM". > > Despite what one might think the "Enable Dynamic Memory" checkbox does > *not* control the Dynamic Memory protocol availability or usage - the > protocol is always available/exported to the guest. > > What the "Enable Dynamic Memory" checkbox controls is some host-side > heuristics that automatically resize the guest within chosen bounds > based on some metrics. > > Even if the "Enable Dynamic Memory" checkbox is *not* enabled the guest > can still be online-resized via Dynamic Memory protocol by simply > changing the value in the "RAM" field and clicking "Apply". > > At least that's how it works on Windows 2019 with a Linux guest. Right, I recall that that's a feature that was separately announced as explicit VM resizing, not HV dynamic memory. It uses the same underlying mechanism, yes, which is why the feature is always exposed to the VMs. That's most probably when they performed the "Startup RAM" -> "RAM" rename, to make both features possibly co-exist and easier to configure. > >>> >>> So it's not exactly doing a hot-add after the guest boots. >> >> I recall BUG reports in Linux, that we got hv-balloon hot-add requests ~1 minute after Linux booted up, because of the above reason of startup memory [in these BUG reports, memory onlining was disabled and the VM would run out of memory because we hotplugged too much memory]. That's why I remember that this approach once was done. >> >> Maybe there are multiple implementations noways. At least in QEMU you could chose whatever makes most sense for QEMU. >> > > Right, it seems that the Hyper-V behavior evolved with time, too. Yes. One could think of a split approach, that is, we never resize the initial RAM size (-m XG) from inside QEMU. Instead, we could have the following models: (1) Basic "Startup RAM" model: always (re)boot Linux with "-m XG". On reboot. Once the VM comes up, we either add memory or request to inflate the balloon, to reach the previous guest size. Whenever the VM reboots, we first defrag all hv-balloon provided memory ("one contiguous chunk") to then "add" that memory to the VM. If the logical VM size <= requested, this hv-balloon memory size would be "0". Essentially resembling the "old" HV dynamic memory approach. (2) Extended "Startup RAM" mode: Same as (1), but instead of hot-adding the RAM after the guest came up, we simply defrag the hv-balloon RAM during reboot ("one contiguous chunk") and expose it via e820/SRAT ot the guest. Going "below" startup RAM will still require inflation once the guest is up. (3) External "Resize" mode: On reboot, simply shutdown the VM and notify libvirt. Libvirt will restart the VM with adjusted "Startup RAM". It's fairly straight forward to extend (1) to achieve (2). That could be a sane default for QEMU. However wants (3) can simply let libvirt handle it on top without any special handling. Internal resize mode is tricky, especially regarding migration. With sufficient motivation and problem solving one might be able to turn (1) or (2) into such a (4) mode. It would just be an implementation detail. Note that I never considered the "go below initial RAM" and "resize initial RAM" really relevant for virtio-mem. Instead, you chose the startup size to be reasonably small (e.g., 4 GiB) and expose memory via the virtio-mem devices right at QEMU startup ("requested-size=XG"). The same approach could be applied to the hv-balloon model. One main reason to decide against resizing significantly below 4G was, for example, that you'll end up losing valuable DMA/DMA32 memory the lower you go -- that no hotplugged memory will provide. So using inflation for everything < 4G does not sound too crazy to me, and could avoid mode (3) altogether. But again, just my thoughts.
diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig index 9fbfe748b5..13f70707ed 100644 --- a/hw/i386/Kconfig +++ b/hw/i386/Kconfig @@ -68,6 +68,7 @@ config I440FX imply E1000_PCI imply VMPORT imply VMMOUSE + imply HAPVDIMM select ACPI_PIIX4 select PC_PCI select PC_ACPI @@ -95,6 +96,7 @@ config Q35 imply E1000E_PCI_EXPRESS imply VMPORT imply VMMOUSE + imply HAPVDIMM select PC_PCI select PC_ACPI select PCI_EXPRESS_Q35 diff --git a/hw/i386/pc.c b/hw/i386/pc.c index a7a2ededf9..5469d89bcc 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -73,6 +73,7 @@ #include "hw/acpi/acpi.h" #include "hw/acpi/cpu_hotplug.h" #include "acpi-build.h" +#include "hw/mem/hapvdimm.h" #include "hw/mem/pc-dimm.h" #include "hw/mem/nvdimm.h" #include "hw/cxl/cxl.h" @@ -1609,7 +1610,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine, object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) || object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI) || object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) || - object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) { + object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE) || + object_dynamic_cast(OBJECT(dev), TYPE_HAPVDIMM)) { return HOTPLUG_HANDLER(machine); } diff --git a/hw/mem/Kconfig b/hw/mem/Kconfig index 73c5ae8ad9..d8c1feafed 100644 --- a/hw/mem/Kconfig +++ b/hw/mem/Kconfig @@ -16,3 +16,7 @@ config CXL_MEM_DEVICE bool default y if CXL select MEM_DEVICE + +config HAPVDIMM + bool + select MEM_DEVICE diff --git a/hw/mem/hapvdimm.c b/hw/mem/hapvdimm.c new file mode 100644 index 0000000000..9ae82edb2c --- /dev/null +++ b/hw/mem/hapvdimm.c @@ -0,0 +1,221 @@ +/* + * A memory hot-add protocol vDIMM device + * + * Copyright (C) 2020-2023 Oracle and/or its affiliates. + * + * Heavily based on pc-dimm.c: + * Copyright ProfitBricks GmbH 2012 + * Copyright (C) 2014 Red Hat Inc + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ + +#include "qemu/osdep.h" + +#include "exec/memory.h" +#include "hw/boards.h" +#include "hw/mem/hapvdimm.h" +#include "hw/mem/memory-device.h" +#include "hw/qdev-core.h" +#include "hw/qdev-properties.h" +#include "migration/vmstate.h" +#include "qapi/error.h" +#include "qapi/visitor.h" +#include "qemu/module.h" +#include "sysemu/hostmem.h" +#include "trace.h" + +typedef struct HAPVDIMMDevice { + /* private */ + DeviceState parent_obj; + + /* public */ + bool ever_realized; + uint64_t addr; + uint64_t align; + uint32_t node; + HostMemoryBackend *hostmem; +} HAPVDIMMDevice; + +typedef struct HAPVDIMMDeviceClass { + /* private */ + DeviceClass parent_class; +} HAPVDIMMDeviceClass; + +static bool hapvdimm_adding_allowed; +static Property hapvdimm_properties[] = { + DEFINE_PROP_UINT64(HAPVDIMM_ADDR_PROP, HAPVDIMMDevice, addr, 0), + DEFINE_PROP_UINT64(HAPVDIMM_ALIGN_PROP, HAPVDIMMDevice, align, 0), + DEFINE_PROP_LINK(HAPVDIMM_MEMDEV_PROP, HAPVDIMMDevice, hostmem, + TYPE_MEMORY_BACKEND, HostMemoryBackend *), + DEFINE_PROP_END_OF_LIST(), +}; + +void hapvdimm_allow_adding(void) +{ + hapvdimm_adding_allowed = true; +} + +void hapvdimm_disallow_adding(void) +{ + hapvdimm_adding_allowed = false; +} + +static void hapvdimm_get_size(Object *obj, Visitor *v, const char *name, + void *opaque, Error **errp) +{ + ERRP_GUARD(); + uint64_t value; + + value = memory_device_get_region_size(MEMORY_DEVICE(obj), errp); + if (*errp) { + return; + } + + visit_type_uint64(v, name, &value, errp); +} + +static void hapvdimm_init(Object *obj) +{ + object_property_add(obj, HAPVDIMM_SIZE_PROP, "uint64", hapvdimm_get_size, + NULL, NULL, NULL); +} + +static void hapvdimm_realize(DeviceState *dev, Error **errp) +{ + ERRP_GUARD(); + HAPVDIMMDevice *hapvdimm = HAPVDIMM(dev); + MachineState *ms = MACHINE(qdev_get_machine()); + + if (!hapvdimm->ever_realized) { + if (!hapvdimm_adding_allowed) { + error_setg(errp, "direct adding not allowed"); + return; + } + + hapvdimm->ever_realized = true; + } + + memory_device_pre_plug(MEMORY_DEVICE(hapvdimm), ms, + hapvdimm->align ? &hapvdimm->align : NULL, + errp); + if (*errp) { + return; + } + + if (!hapvdimm->hostmem) { + error_setg(errp, "'" HAPVDIMM_MEMDEV_PROP "' property is not set"); + return; + } else if (host_memory_backend_is_mapped(hapvdimm->hostmem)) { + const char *path; + + path = object_get_canonical_path_component(OBJECT(hapvdimm->hostmem)); + error_setg(errp, "can't use already busy memdev: %s", path); + return; + } + + host_memory_backend_set_mapped(hapvdimm->hostmem, true); + + memory_device_plug(MEMORY_DEVICE(hapvdimm), ms); + vmstate_register_ram(host_memory_backend_get_memory(hapvdimm->hostmem), + dev); +} + +static void hapvdimm_unrealize(DeviceState *dev) +{ + HAPVDIMMDevice *hapvdimm = HAPVDIMM(dev); + MachineState *ms = MACHINE(qdev_get_machine()); + + memory_device_unplug(MEMORY_DEVICE(hapvdimm), ms); + vmstate_unregister_ram(host_memory_backend_get_memory(hapvdimm->hostmem), + dev); + + host_memory_backend_set_mapped(hapvdimm->hostmem, false); +} + +static uint64_t hapvdimm_md_get_addr(const MemoryDeviceState *md) +{ + return object_property_get_uint(OBJECT(md), HAPVDIMM_ADDR_PROP, + &error_abort); +} + +static void hapvdimm_md_set_addr(MemoryDeviceState *md, uint64_t addr, + Error **errp) +{ + object_property_set_uint(OBJECT(md), HAPVDIMM_ADDR_PROP, addr, errp); +} + +static MemoryRegion *hapvdimm_md_get_memory_region(MemoryDeviceState *md, + Error **errp) +{ + HAPVDIMMDevice *hapvdimm = HAPVDIMM(md); + + if (!hapvdimm->hostmem) { + error_setg(errp, "'" HAPVDIMM_MEMDEV_PROP "' property must be set"); + return NULL; + } + + return host_memory_backend_get_memory(hapvdimm->hostmem); +} + +static void hapvdimm_md_fill_device_info(const MemoryDeviceState *md, + MemoryDeviceInfo *info) +{ + PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1); + const DeviceClass *dc = DEVICE_GET_CLASS(md); + const HAPVDIMMDevice *hapvdimm = HAPVDIMM(md); + const DeviceState *dev = DEVICE(md); + + if (dev->id) { + di->id = g_strdup(dev->id); + } + di->hotplugged = dev->hotplugged; + di->hotpluggable = dc->hotpluggable; + di->addr = hapvdimm->addr; + di->slot = -1; + di->node = 0; /* FIXME: report proper node */ + di->size = object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_SIZE_PROP, + NULL); + di->memdev = object_get_canonical_path(OBJECT(hapvdimm->hostmem)); + + info->u.dimm.data = di; + info->type = MEMORY_DEVICE_INFO_KIND_DIMM; +} + +static void hapvdimm_class_init(ObjectClass *oc, void *data) +{ + DeviceClass *dc = DEVICE_CLASS(oc); + MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(oc); + + dc->realize = hapvdimm_realize; + dc->unrealize = hapvdimm_unrealize; + device_class_set_props(dc, hapvdimm_properties); + dc->desc = "vDIMM for a hot add protocol"; + + mdc->get_addr = hapvdimm_md_get_addr; + mdc->set_addr = hapvdimm_md_set_addr; + mdc->get_plugged_size = memory_device_get_region_size; + mdc->get_memory_region = hapvdimm_md_get_memory_region; + mdc->fill_device_info = hapvdimm_md_fill_device_info; +} + +static const TypeInfo hapvdimm_info = { + .name = TYPE_HAPVDIMM, + .parent = TYPE_DEVICE, + .instance_size = sizeof(HAPVDIMMDevice), + .instance_init = hapvdimm_init, + .class_init = hapvdimm_class_init, + .class_size = sizeof(HAPVDIMMDeviceClass), + .interfaces = (InterfaceInfo[]) { + { TYPE_MEMORY_DEVICE }, + { } + }, +}; + +static void hapvdimm_register_types(void) +{ + type_register_static(&hapvdimm_info); +} + +type_init(hapvdimm_register_types) diff --git a/hw/mem/meson.build b/hw/mem/meson.build index 609b2b36fc..5f7a0181d3 100644 --- a/hw/mem/meson.build +++ b/hw/mem/meson.build @@ -4,6 +4,7 @@ mem_ss.add(when: 'CONFIG_DIMM', if_true: files('pc-dimm.c')) mem_ss.add(when: 'CONFIG_NPCM7XX', if_true: files('npcm7xx_mc.c')) mem_ss.add(when: 'CONFIG_NVDIMM', if_true: files('nvdimm.c')) mem_ss.add(when: 'CONFIG_CXL_MEM_DEVICE', if_true: files('cxl_type3.c')) +mem_ss.add(when: 'CONFIG_HAPVDIMM', if_true: files('hapvdimm.c')) softmmu_ss.add_all(when: 'CONFIG_MEM_DEVICE', if_true: mem_ss) diff --git a/include/hw/mem/hapvdimm.h b/include/hw/mem/hapvdimm.h new file mode 100644 index 0000000000..bb9a135a52 --- /dev/null +++ b/include/hw/mem/hapvdimm.h @@ -0,0 +1,27 @@ +/* + * A memory hot-add protocol vDIMM device + * + * Copyright (C) 2020-2023 Oracle and/or its affiliates. + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef QEMU_HAPVDIMM_H +#define QEMU_HAPVDIMM_H + +#include "qom/object.h" + +#define TYPE_HAPVDIMM "mem-hapvdimm" +OBJECT_DECLARE_SIMPLE_TYPE(HAPVDIMMDevice, HAPVDIMM) + +#define HAPVDIMM_ADDR_PROP "addr" +#define HAPVDIMM_ALIGN_PROP "align" +#define HAPVDIMM_SIZE_PROP "size" +#define HAPVDIMM_MEMDEV_PROP "memdev" + +void hapvdimm_allow_adding(void); +void hapvdimm_disallow_adding(void); + +#endif