diff mbox series

[RFC] memory: pause all vCPUs for the duration of memory transactions

Message ID 20201026084916.3103221-1-vkuznets@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] memory: pause all vCPUs for the duration of memory transactions | expand

Commit Message

Vitaly Kuznetsov Oct. 26, 2020, 8:49 a.m. UTC
Currently, KVM doesn't provide an API to make atomic updates to memmap when
the change touches more than one memory slot, e.g. in case we'd like to
punch a hole in an existing slot.

Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
like

!!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
...

The problem seems to be that TSEG manipulations on one vCPU are not atomic
from other vCPUs views. In particular, here's the strace:

Initial creation of the 'problematic' slot:

10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
   memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0

... and then the update (caused by e.g. mch_update_smram()) later:

10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
   memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
   memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0

In case KVM has to handle any event on a different vCPU in between these
two calls the #PF will get triggered.

An ideal solution to the problem would probably require KVM to provide a
new API to do the whole transaction in one shot but as a band-aid we can
just pause all vCPUs to make memory transations atomic.

Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC: Generally, memap updates happen only a few times during guest boot but
I'm not sure there are no scenarios when pausing all vCPUs is undesireable
from performance point of view. Also, I'm not sure if kvm_enabled() check
is needed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 softmmu/memory.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

David Hildenbrand Oct. 26, 2020, 10:43 a.m. UTC | #1
On 26.10.20 09:49, Vitaly Kuznetsov wrote:
> Currently, KVM doesn't provide an API to make atomic updates to memmap when
> the change touches more than one memory slot, e.g. in case we'd like to
> punch a hole in an existing slot.
> 
> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
> like
> 
> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
> ...
> 
> The problem seems to be that TSEG manipulations on one vCPU are not atomic
> from other vCPUs views. In particular, here's the strace:
> 
> Initial creation of the 'problematic' slot:
> 
> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>    memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
> 
> ... and then the update (caused by e.g. mch_update_smram()) later:
> 
> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>    memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>    memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
> 
> In case KVM has to handle any event on a different vCPU in between these
> two calls the #PF will get triggered.
> 
> An ideal solution to the problem would probably require KVM to provide a
> new API to do the whole transaction in one shot but as a band-aid we can
> just pause all vCPUs to make memory transations atomic.
> 
> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
> RFC: Generally, memap updates happen only a few times during guest boot but
> I'm not sure there are no scenarios when pausing all vCPUs is undesireable
> from performance point of view. Also, I'm not sure if kvm_enabled() check
> is needed.
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  softmmu/memory.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index fa280a19f7f7..0bf6f3f6d5dc 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -28,6 +28,7 @@
>  
>  #include "exec/memory-internal.h"
>  #include "exec/ram_addr.h"
> +#include "sysemu/cpus.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/runstate.h"
>  #include "sysemu/tcg.h"
> @@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace *as)
>  void memory_region_transaction_begin(void)
>  {
>      qemu_flush_coalesced_mmio_buffer();
> -    ++memory_region_transaction_depth;
> +    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
> +        pause_all_vcpus();
> +    }
>  }
>  
>  void memory_region_transaction_commit(void)
> @@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
>              }
>              ioeventfd_update_pending = false;
>          }
> -   }
> +
> +        if (kvm_enabled()) {
> +            resume_all_vcpus();
> +        }
> +    }
>  }
>  
>  static void memory_region_destructor_none(MemoryRegion *mr)
> 

This is in general unsafe. pause_all_vcpus() will temporarily drop the
BQL, resulting in bad things happening to caller sites.

I studies the involved issues quite intensively when wanting to resize
memory regions from virtio-mem code. It's not that easy.

Have a look at my RFC for resizing. You can apply something similar to
other operations.

https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html
David Hildenbrand Oct. 26, 2020, 11:17 a.m. UTC | #2
On 26.10.20 11:43, David Hildenbrand wrote:
> On 26.10.20 09:49, Vitaly Kuznetsov wrote:
>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>> the change touches more than one memory slot, e.g. in case we'd like to
>> punch a hole in an existing slot.
>>
>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>> like
>>
>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>> ...
>>
>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>> from other vCPUs views. In particular, here's the strace:
>>
>> Initial creation of the 'problematic' slot:
>>
>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>    memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>>
>> ... and then the update (caused by e.g. mch_update_smram()) later:
>>
>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>    memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>    memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>>
>> In case KVM has to handle any event on a different vCPU in between these
>> two calls the #PF will get triggered.
>>
>> An ideal solution to the problem would probably require KVM to provide a
>> new API to do the whole transaction in one shot but as a band-aid we can
>> just pause all vCPUs to make memory transations atomic.
>>
>> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> RFC: Generally, memap updates happen only a few times during guest boot but
>> I'm not sure there are no scenarios when pausing all vCPUs is undesireable
>> from performance point of view. Also, I'm not sure if kvm_enabled() check
>> is needed.
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  softmmu/memory.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>> index fa280a19f7f7..0bf6f3f6d5dc 100644
>> --- a/softmmu/memory.c
>> +++ b/softmmu/memory.c
>> @@ -28,6 +28,7 @@
>>  
>>  #include "exec/memory-internal.h"
>>  #include "exec/ram_addr.h"
>> +#include "sysemu/cpus.h"
>>  #include "sysemu/kvm.h"
>>  #include "sysemu/runstate.h"
>>  #include "sysemu/tcg.h"
>> @@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace *as)
>>  void memory_region_transaction_begin(void)
>>  {
>>      qemu_flush_coalesced_mmio_buffer();
>> -    ++memory_region_transaction_depth;
>> +    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
>> +        pause_all_vcpus();
>> +    }
>>  }
>>  
>>  void memory_region_transaction_commit(void)
>> @@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
>>              }
>>              ioeventfd_update_pending = false;
>>          }
>> -   }
>> +
>> +        if (kvm_enabled()) {
>> +            resume_all_vcpus();
>> +        }
>> +    }
>>  }
>>  
>>  static void memory_region_destructor_none(MemoryRegion *mr)
>>
> 
> This is in general unsafe. pause_all_vcpus() will temporarily drop the
> BQL, resulting in bad things happening to caller sites.
> 
> I studies the involved issues quite intensively when wanting to resize
> memory regions from virtio-mem code. It's not that easy.
> 
> Have a look at my RFC for resizing. You can apply something similar to
> other operations.
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html

Oh, and I even mentioned the case you try to fix here back then

"
Instead of inhibiting during the region_resize(), we could inhibit for the
hole memory transaction (from begin() to commit()). This could be nice,
because also splitting of memory regions would be atomic (I remember there
was a BUG report regarding that), however, I am not sure if that might
impact any RT users.
"

The current patches live in
https://github.com/davidhildenbrand/qemu/commits/virtio-mem-next

Especially

https://github.com/davidhildenbrand/qemu/commit/433fbb3abed20f15030e42f2b2bea7e6b9a15180


I haven't proceeded in upstreaming because I'm still busy with
virtio-mem thingies in the kernel.
Vitaly Kuznetsov Oct. 27, 2020, 12:36 p.m. UTC | #3
David Hildenbrand <david@redhat.com> writes:

> On 26.10.20 11:43, David Hildenbrand wrote:
>> On 26.10.20 09:49, Vitaly Kuznetsov wrote:
>>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>>> the change touches more than one memory slot, e.g. in case we'd like to
>>> punch a hole in an existing slot.
>>>
>>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>>> like
>>>
>>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>>> ...
>>>
>>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>>> from other vCPUs views. In particular, here's the strace:
>>>
>>> Initial creation of the 'problematic' slot:
>>>
>>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>    memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>>>
>>> ... and then the update (caused by e.g. mch_update_smram()) later:
>>>
>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>    memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>    memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>>>
>>> In case KVM has to handle any event on a different vCPU in between these
>>> two calls the #PF will get triggered.
>>>
>>> An ideal solution to the problem would probably require KVM to provide a
>>> new API to do the whole transaction in one shot but as a band-aid we can
>>> just pause all vCPUs to make memory transations atomic.
>>>
>>> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>> ---
>>> RFC: Generally, memap updates happen only a few times during guest boot but
>>> I'm not sure there are no scenarios when pausing all vCPUs is undesireable
>>> from performance point of view. Also, I'm not sure if kvm_enabled() check
>>> is needed.
>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>> ---
>>>  softmmu/memory.c | 11 +++++++++--
>>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>>> index fa280a19f7f7..0bf6f3f6d5dc 100644
>>> --- a/softmmu/memory.c
>>> +++ b/softmmu/memory.c
>>> @@ -28,6 +28,7 @@
>>>  
>>>  #include "exec/memory-internal.h"
>>>  #include "exec/ram_addr.h"
>>> +#include "sysemu/cpus.h"
>>>  #include "sysemu/kvm.h"
>>>  #include "sysemu/runstate.h"
>>>  #include "sysemu/tcg.h"
>>> @@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace *as)
>>>  void memory_region_transaction_begin(void)
>>>  {
>>>      qemu_flush_coalesced_mmio_buffer();
>>> -    ++memory_region_transaction_depth;
>>> +    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
>>> +        pause_all_vcpus();
>>> +    }
>>>  }
>>>  
>>>  void memory_region_transaction_commit(void)
>>> @@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
>>>              }
>>>              ioeventfd_update_pending = false;
>>>          }
>>> -   }
>>> +
>>> +        if (kvm_enabled()) {
>>> +            resume_all_vcpus();
>>> +        }
>>> +    }
>>>  }
>>>  
>>>  static void memory_region_destructor_none(MemoryRegion *mr)
>>>
>> 
>> This is in general unsafe. pause_all_vcpus() will temporarily drop the
>> BQL, resulting in bad things happening to caller sites.

Oh, I see, thanks! I was expecting there's a reason we don't have this
simple fix in already :-)

>> 
>> I studies the involved issues quite intensively when wanting to resize
>> memory regions from virtio-mem code. It's not that easy.
>> 
>> Have a look at my RFC for resizing. You can apply something similar to
>> other operations.
>> 
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html
>
> Oh, and I even mentioned the case you try to fix here back then
>
> "
> Instead of inhibiting during the region_resize(), we could inhibit for the
> hole memory transaction (from begin() to commit()). This could be nice,
> because also splitting of memory regions would be atomic (I remember there
> was a BUG report regarding that), however, I am not sure if that might
> impact any RT users.
> "
>
> The current patches live in
> https://github.com/davidhildenbrand/qemu/commits/virtio-mem-next
>
> Especially
>
> https://github.com/davidhildenbrand/qemu/commit/433fbb3abed20f15030e42f2b2bea7e6b9a15180
>
>

I'm not sure why we're focusing on ioctls here. I was debugging my case
quite some time ago but from what I remember it had nothing to do with
ioctls from QEMU. When we are removing a memslot any exit to KVM may
trigger an error condition as we'll see that vCPU or some of our
internal structures (e.g. VMCS for a nested guest) references
non-existent memory. I don't see a good solution other than making the
update fully atomic from *all* vCPUs point of view and this requires
stopping all CPUs -- either from QEMU or from KVM.

Resizing a slot can probably be done without removing it first, however,
I expect that organizing QEMU code in a way where it will decide whether
or not old configuration requires removal is not easy. In some cases
(e.g. punching a KVM_MEM_READONLY hole in the middle of an RW slot) it
seems to be impossible to do with current KVM API.


> I haven't proceeded in upstreaming because I'm still busy with
> virtio-mem thingies in the kernel.
David Hildenbrand Oct. 27, 2020, 12:42 p.m. UTC | #4
On 27.10.20 13:36, Vitaly Kuznetsov wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 26.10.20 11:43, David Hildenbrand wrote:
>>> On 26.10.20 09:49, Vitaly Kuznetsov wrote:
>>>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>>>> the change touches more than one memory slot, e.g. in case we'd like to
>>>> punch a hole in an existing slot.
>>>>
>>>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>>>> like
>>>>
>>>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>>>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>>>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>>>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>>>> ...
>>>>
>>>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>>>> from other vCPUs views. In particular, here's the strace:
>>>>
>>>> Initial creation of the 'problematic' slot:
>>>>
>>>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>     memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>>>>
>>>> ... and then the update (caused by e.g. mch_update_smram()) later:
>>>>
>>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>     memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>     memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>>>>
>>>> In case KVM has to handle any event on a different vCPU in between these
>>>> two calls the #PF will get triggered.
>>>>
>>>> An ideal solution to the problem would probably require KVM to provide a
>>>> new API to do the whole transaction in one shot but as a band-aid we can
>>>> just pause all vCPUs to make memory transations atomic.
>>>>
>>>> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>> ---
>>>> RFC: Generally, memap updates happen only a few times during guest boot but
>>>> I'm not sure there are no scenarios when pausing all vCPUs is undesireable
>>>> from performance point of view. Also, I'm not sure if kvm_enabled() check
>>>> is needed.
>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>> ---
>>>>   softmmu/memory.c | 11 +++++++++--
>>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>>>> index fa280a19f7f7..0bf6f3f6d5dc 100644
>>>> --- a/softmmu/memory.c
>>>> +++ b/softmmu/memory.c
>>>> @@ -28,6 +28,7 @@
>>>>   
>>>>   #include "exec/memory-internal.h"
>>>>   #include "exec/ram_addr.h"
>>>> +#include "sysemu/cpus.h"
>>>>   #include "sysemu/kvm.h"
>>>>   #include "sysemu/runstate.h"
>>>>   #include "sysemu/tcg.h"
>>>> @@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace *as)
>>>>   void memory_region_transaction_begin(void)
>>>>   {
>>>>       qemu_flush_coalesced_mmio_buffer();
>>>> -    ++memory_region_transaction_depth;
>>>> +    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
>>>> +        pause_all_vcpus();
>>>> +    }
>>>>   }
>>>>   
>>>>   void memory_region_transaction_commit(void)
>>>> @@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
>>>>               }
>>>>               ioeventfd_update_pending = false;
>>>>           }
>>>> -   }
>>>> +
>>>> +        if (kvm_enabled()) {
>>>> +            resume_all_vcpus();
>>>> +        }
>>>> +    }
>>>>   }
>>>>   
>>>>   static void memory_region_destructor_none(MemoryRegion *mr)
>>>>
>>>
>>> This is in general unsafe. pause_all_vcpus() will temporarily drop the
>>> BQL, resulting in bad things happening to caller sites.
> 
> Oh, I see, thanks! I was expecting there's a reason we don't have this
> simple fix in already :-)
> 
>>>
>>> I studies the involved issues quite intensively when wanting to resize
>>> memory regions from virtio-mem code. It's not that easy.
>>>
>>> Have a look at my RFC for resizing. You can apply something similar to
>>> other operations.
>>>
>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html
>>
>> Oh, and I even mentioned the case you try to fix here back then
>>
>> "
>> Instead of inhibiting during the region_resize(), we could inhibit for the
>> hole memory transaction (from begin() to commit()). This could be nice,
>> because also splitting of memory regions would be atomic (I remember there
>> was a BUG report regarding that), however, I am not sure if that might
>> impact any RT users.
>> "
>>
>> The current patches live in
>> https://github.com/davidhildenbrand/qemu/commits/virtio-mem-next
>>
>> Especially
>>
>> https://github.com/davidhildenbrand/qemu/commit/433fbb3abed20f15030e42f2b2bea7e6b9a15180
>>
>>
> 
> I'm not sure why we're focusing on ioctls here. I was debugging my case
> quite some time ago but from what I remember it had nothing to do with
> ioctls from QEMU. When we are removing a memslot any exit to KVM may
> trigger an error condition as we'll see that vCPU or some of our
> internal structures (e.g. VMCS for a nested guest) references
> non-existent memory. I don't see a good solution other than making the
> update fully atomic from *all* vCPUs point of view and this requires
> stopping all CPUs -- either from QEMU or from KVM.

I cannot follow. My patch waits until *any* KVM ioctls are out of the 
kernel. That includes VCPUs, but also other ioctls (because there are 
some that require a consistent memory block state).

So from a KVM point of view, the CPUs are stopped.
Vitaly Kuznetsov Oct. 27, 2020, 1:02 p.m. UTC | #5
David Hildenbrand <david@redhat.com> writes:

> On 27.10.20 13:36, Vitaly Kuznetsov wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>>> On 26.10.20 11:43, David Hildenbrand wrote:
>>>> On 26.10.20 09:49, Vitaly Kuznetsov wrote:
>>>>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>>>>> the change touches more than one memory slot, e.g. in case we'd like to
>>>>> punch a hole in an existing slot.
>>>>>
>>>>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>>>>> like
>>>>>
>>>>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>>>>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>>>>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>>>>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>>>>> ...
>>>>>
>>>>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>>>>> from other vCPUs views. In particular, here's the strace:
>>>>>
>>>>> Initial creation of the 'problematic' slot:
>>>>>
>>>>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>>     memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>>>>>
>>>>> ... and then the update (caused by e.g. mch_update_smram()) later:
>>>>>
>>>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>>     memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>>>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>>     memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>>>>>
>>>>> In case KVM has to handle any event on a different vCPU in between these
>>>>> two calls the #PF will get triggered.
>>>>>
>>>>> An ideal solution to the problem would probably require KVM to provide a
>>>>> new API to do the whole transaction in one shot but as a band-aid we can
>>>>> just pause all vCPUs to make memory transations atomic.
>>>>>
>>>>> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>>> ---
>>>>> RFC: Generally, memap updates happen only a few times during guest boot but
>>>>> I'm not sure there are no scenarios when pausing all vCPUs is undesireable
>>>>> from performance point of view. Also, I'm not sure if kvm_enabled() check
>>>>> is needed.
>>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>>> ---
>>>>>   softmmu/memory.c | 11 +++++++++--
>>>>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>>>>> index fa280a19f7f7..0bf6f3f6d5dc 100644
>>>>> --- a/softmmu/memory.c
>>>>> +++ b/softmmu/memory.c
>>>>> @@ -28,6 +28,7 @@
>>>>>   
>>>>>   #include "exec/memory-internal.h"
>>>>>   #include "exec/ram_addr.h"
>>>>> +#include "sysemu/cpus.h"
>>>>>   #include "sysemu/kvm.h"
>>>>>   #include "sysemu/runstate.h"
>>>>>   #include "sysemu/tcg.h"
>>>>> @@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace *as)
>>>>>   void memory_region_transaction_begin(void)
>>>>>   {
>>>>>       qemu_flush_coalesced_mmio_buffer();
>>>>> -    ++memory_region_transaction_depth;
>>>>> +    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
>>>>> +        pause_all_vcpus();
>>>>> +    }
>>>>>   }
>>>>>   
>>>>>   void memory_region_transaction_commit(void)
>>>>> @@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
>>>>>               }
>>>>>               ioeventfd_update_pending = false;
>>>>>           }
>>>>> -   }
>>>>> +
>>>>> +        if (kvm_enabled()) {
>>>>> +            resume_all_vcpus();
>>>>> +        }
>>>>> +    }
>>>>>   }
>>>>>   
>>>>>   static void memory_region_destructor_none(MemoryRegion *mr)
>>>>>
>>>>
>>>> This is in general unsafe. pause_all_vcpus() will temporarily drop the
>>>> BQL, resulting in bad things happening to caller sites.
>> 
>> Oh, I see, thanks! I was expecting there's a reason we don't have this
>> simple fix in already :-)
>> 
>>>>
>>>> I studies the involved issues quite intensively when wanting to resize
>>>> memory regions from virtio-mem code. It's not that easy.
>>>>
>>>> Have a look at my RFC for resizing. You can apply something similar to
>>>> other operations.
>>>>
>>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html
>>>
>>> Oh, and I even mentioned the case you try to fix here back then
>>>
>>> "
>>> Instead of inhibiting during the region_resize(), we could inhibit for the
>>> hole memory transaction (from begin() to commit()). This could be nice,
>>> because also splitting of memory regions would be atomic (I remember there
>>> was a BUG report regarding that), however, I am not sure if that might
>>> impact any RT users.
>>> "
>>>
>>> The current patches live in
>>> https://github.com/davidhildenbrand/qemu/commits/virtio-mem-next
>>>
>>> Especially
>>>
>>> https://github.com/davidhildenbrand/qemu/commit/433fbb3abed20f15030e42f2b2bea7e6b9a15180
>>>
>>>
>> 
>> I'm not sure why we're focusing on ioctls here. I was debugging my case
>> quite some time ago but from what I remember it had nothing to do with
>> ioctls from QEMU. When we are removing a memslot any exit to KVM may
>> trigger an error condition as we'll see that vCPU or some of our
>> internal structures (e.g. VMCS for a nested guest) references
>> non-existent memory. I don't see a good solution other than making the
>> update fully atomic from *all* vCPUs point of view and this requires
>> stopping all CPUs -- either from QEMU or from KVM.
>
> I cannot follow. My patch waits until *any* KVM ioctls are out of the 
> kernel. That includes VCPUs, but also other ioctls (because there are 
> some that require a consistent memory block state).
>
> So from a KVM point of view, the CPUs are stopped.

Sorry for not being clear: your patch looks good to me, what I tried to
say is that with the current KVM API the only way to guarantee atomicity
of the update is to make vCPUs stop (one way or another), kicking them
out and preventing new IOCTLs from being dispatched is one way
(temporary pausing them inside KVM would be another, for example -- but
that would require *new* API supplying the whole transaction and not one
memslot update).
David Hildenbrand Oct. 27, 2020, 1:08 p.m. UTC | #6
On 27.10.20 14:02, Vitaly Kuznetsov wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 27.10.20 13:36, Vitaly Kuznetsov wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>
>>>> On 26.10.20 11:43, David Hildenbrand wrote:
>>>>> On 26.10.20 09:49, Vitaly Kuznetsov wrote:
>>>>>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>>>>>> the change touches more than one memory slot, e.g. in case we'd like to
>>>>>> punch a hole in an existing slot.
>>>>>>
>>>>>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>>>>>> like
>>>>>>
>>>>>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>>>>>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>>>>>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>>>>>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>>>>>> ...
>>>>>>
>>>>>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>>>>>> from other vCPUs views. In particular, here's the strace:
>>>>>>
>>>>>> Initial creation of the 'problematic' slot:
>>>>>>
>>>>>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>>>      memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>>>>>>
>>>>>> ... and then the update (caused by e.g. mch_update_smram()) later:
>>>>>>
>>>>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>>>      memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>>>>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>>>>      memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>>>>>>
>>>>>> In case KVM has to handle any event on a different vCPU in between these
>>>>>> two calls the #PF will get triggered.
>>>>>>
>>>>>> An ideal solution to the problem would probably require KVM to provide a
>>>>>> new API to do the whole transaction in one shot but as a band-aid we can
>>>>>> just pause all vCPUs to make memory transations atomic.
>>>>>>
>>>>>> Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>>>> ---
>>>>>> RFC: Generally, memap updates happen only a few times during guest boot but
>>>>>> I'm not sure there are no scenarios when pausing all vCPUs is undesireable
>>>>>> from performance point of view. Also, I'm not sure if kvm_enabled() check
>>>>>> is needed.
>>>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>>>> ---
>>>>>>    softmmu/memory.c | 11 +++++++++--
>>>>>>    1 file changed, 9 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>>>>>> index fa280a19f7f7..0bf6f3f6d5dc 100644
>>>>>> --- a/softmmu/memory.c
>>>>>> +++ b/softmmu/memory.c
>>>>>> @@ -28,6 +28,7 @@
>>>>>>    
>>>>>>    #include "exec/memory-internal.h"
>>>>>>    #include "exec/ram_addr.h"
>>>>>> +#include "sysemu/cpus.h"
>>>>>>    #include "sysemu/kvm.h"
>>>>>>    #include "sysemu/runstate.h"
>>>>>>    #include "sysemu/tcg.h"
>>>>>> @@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace *as)
>>>>>>    void memory_region_transaction_begin(void)
>>>>>>    {
>>>>>>        qemu_flush_coalesced_mmio_buffer();
>>>>>> -    ++memory_region_transaction_depth;
>>>>>> +    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
>>>>>> +        pause_all_vcpus();
>>>>>> +    }
>>>>>>    }
>>>>>>    
>>>>>>    void memory_region_transaction_commit(void)
>>>>>> @@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
>>>>>>                }
>>>>>>                ioeventfd_update_pending = false;
>>>>>>            }
>>>>>> -   }
>>>>>> +
>>>>>> +        if (kvm_enabled()) {
>>>>>> +            resume_all_vcpus();
>>>>>> +        }
>>>>>> +    }
>>>>>>    }
>>>>>>    
>>>>>>    static void memory_region_destructor_none(MemoryRegion *mr)
>>>>>>
>>>>>
>>>>> This is in general unsafe. pause_all_vcpus() will temporarily drop the
>>>>> BQL, resulting in bad things happening to caller sites.
>>>
>>> Oh, I see, thanks! I was expecting there's a reason we don't have this
>>> simple fix in already :-)
>>>
>>>>>
>>>>> I studies the involved issues quite intensively when wanting to resize
>>>>> memory regions from virtio-mem code. It's not that easy.
>>>>>
>>>>> Have a look at my RFC for resizing. You can apply something similar to
>>>>> other operations.
>>>>>
>>>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html
>>>>
>>>> Oh, and I even mentioned the case you try to fix here back then
>>>>
>>>> "
>>>> Instead of inhibiting during the region_resize(), we could inhibit for the
>>>> hole memory transaction (from begin() to commit()). This could be nice,
>>>> because also splitting of memory regions would be atomic (I remember there
>>>> was a BUG report regarding that), however, I am not sure if that might
>>>> impact any RT users.
>>>> "
>>>>
>>>> The current patches live in
>>>> https://github.com/davidhildenbrand/qemu/commits/virtio-mem-next
>>>>
>>>> Especially
>>>>
>>>> https://github.com/davidhildenbrand/qemu/commit/433fbb3abed20f15030e42f2b2bea7e6b9a15180
>>>>
>>>>
>>>
>>> I'm not sure why we're focusing on ioctls here. I was debugging my case
>>> quite some time ago but from what I remember it had nothing to do with
>>> ioctls from QEMU. When we are removing a memslot any exit to KVM may
>>> trigger an error condition as we'll see that vCPU or some of our
>>> internal structures (e.g. VMCS for a nested guest) references
>>> non-existent memory. I don't see a good solution other than making the
>>> update fully atomic from *all* vCPUs point of view and this requires
>>> stopping all CPUs -- either from QEMU or from KVM.
>>
>> I cannot follow. My patch waits until *any* KVM ioctls are out of the
>> kernel. That includes VCPUs, but also other ioctls (because there are
>> some that require a consistent memory block state).
>>
>> So from a KVM point of view, the CPUs are stopped.
> 
> Sorry for not being clear: your patch looks good to me, what I tried to
> say is that with the current KVM API the only way to guarantee atomicity
> of the update is to make vCPUs stop (one way or another), kicking them
> out and preventing new IOCTLs from being dispatched is one way
> (temporary pausing them inside KVM would be another, for example -- but
> that would require *new* API supplying the whole transaction and not one
> memslot update).

Ah, got it.

Yes - and I briefly looked into resizing slots inside KVM atomically and 
it already turned out to be a major pain. All that metadata that's 
allocated for a memory slot based on the size is problematic.

Same applies to all other kinds of operations (splitting, punching out, 
...) as you also mentioned.
Vitaly Kuznetsov Oct. 27, 2020, 1:19 p.m. UTC | #7
David Hildenbrand <david@redhat.com> writes:

> On 27.10.20 14:02, Vitaly Kuznetsov wrote:
>> 
>> Sorry for not being clear: your patch looks good to me, what I tried to
>> say is that with the current KVM API the only way to guarantee atomicity
>> of the update is to make vCPUs stop (one way or another), kicking them
>> out and preventing new IOCTLs from being dispatched is one way
>> (temporary pausing them inside KVM would be another, for example -- but
>> that would require *new* API supplying the whole transaction and not one
>> memslot update).
>
> Ah, got it.
>
> Yes - and I briefly looked into resizing slots inside KVM atomically and 
> it already turned out to be a major pain. All that metadata that's 
> allocated for a memory slot based on the size is problematic.
>

Yep, exactly and personally I'd rather refrain from doing more tricks
within KVM to keep the code simple. Generally, memslot updates should't
happen very often so pausing and resuming vCPUs should be acceptable
(that was one of the questions for this RFC).

Overall, I think we're in violent agreement here)

> Same applies to all other kinds of operations (splitting, punching out, 
> ...) as you also mentioned.

One question from a QEMU newbie though: why do you put
kvm_ioctl_inhibit_begin()/kvm_ioctl_inhibit_end() to kvm_region_resize()
only and not taking it all the way up to
memory_region_transaction_begin()/memory_region_transaction_end() to
support atomicity for all kinds of updates right away?

The second question is whether you plan to sumbit this any time soon)

Thanks!
David Hildenbrand Oct. 27, 2020, 1:35 p.m. UTC | #8
>> Same applies to all other kinds of operations (splitting, punching out,
>> ...) as you also mentioned.
> 
> One question from a QEMU newbie though: why do you put
> kvm_ioctl_inhibit_begin()/kvm_ioctl_inhibit_end() to kvm_region_resize()
> only and not taking it all the way up to
> memory_region_transaction_begin()/memory_region_transaction_end() to
> support atomicity for all kinds of updates right away?

The clean way to implement it for 
memory_region_transaction_begin()/memory_region_transaction_end() is by 
implementing
->begin()
->commit()
callbacks for the KVM MemoryListener, and doing it in there, in KVM code.


Now, I wasn't sure how this might affect real-time workloads, where you 
really don't want to kick CPUs out of KVM. You can make a lot of 
operations without requiring this handling like

1. Adding regions (memory hotplug)
2. Removing regions (memory hotunplug)
3. Enabling/disabling dirty logging

Resize/split(/move/...) are the problematic operations where we would 
need that handling. Modifying the size/location of existing slots.

One way to tackle it would be to "sense" upfront if such "modifying" 
operations will be required, communicating that via "->begin()", and 
letting the KVM notifier decide based on that information whether to get 
everything out of KVM. Sounds feasible.

> 
> The second question is whether you plan to sumbit this any time soon)

Once we have an agreement on how to proceed, either I can try to 
dedicate some time, or if you have some time, you can pickup my work and 
make it also handle the other problematic cases (as discussed e.g., ^).
Vitaly Kuznetsov Oct. 27, 2020, 1:47 p.m. UTC | #9
David Hildenbrand <david@redhat.com> writes:

>>> Same applies to all other kinds of operations (splitting, punching out,
>>> ...) as you also mentioned.
>> 
>> One question from a QEMU newbie though: why do you put
>> kvm_ioctl_inhibit_begin()/kvm_ioctl_inhibit_end() to kvm_region_resize()
>> only and not taking it all the way up to
>> memory_region_transaction_begin()/memory_region_transaction_end() to
>> support atomicity for all kinds of updates right away?
>
> The clean way to implement it for 
> memory_region_transaction_begin()/memory_region_transaction_end() is by 
> implementing
> ->begin()
> ->commit()
> callbacks for the KVM MemoryListener, and doing it in there, in KVM code.
>
>
> Now, I wasn't sure how this might affect real-time workloads, where you 
> really don't want to kick CPUs out of KVM. You can make a lot of 
> operations without requiring this handling like
>
> 1. Adding regions (memory hotplug)
> 2. Removing regions (memory hotunplug)
> 3. Enabling/disabling dirty logging
>
> Resize/split(/move/...) are the problematic operations where we would 
> need that handling. Modifying the size/location of existing slots.
>
> One way to tackle it would be to "sense" upfront if such "modifying" 
> operations will be required, communicating that via "->begin()", and 
> letting the KVM notifier decide based on that information whether to get 
> everything out of KVM. Sounds feasible.
>

I don't actually know if we have such use-cases but thinking about
e.g. punching a hole in a middle of an existing slot requires:
1) Resizing the existing slot to its first half
2) Creating the hole
3) Creating a new slot for the second half of the slot.
In case we'd like to make this atomic, we need to cover the whole
transaction. But again, I don't know if we have a use-case for it or
not.

>> 
>> The second question is whether you plan to sumbit this any time soon)
>
> Once we have an agreement on how to proceed, either I can try to 
> dedicate some time, or if you have some time, you can pickup my work and 
> make it also handle the other problematic cases (as discussed e.g., ^).

Ok, I plan to test your patch in the nearby future and if it works we
can start with just the 'resize' case as it seems to be causing real
issues. The infrastructure you create with kvm_ioctl_inhibit_begin()/
kvm_ioctl_inhibit_end() can be re-used later for other ops.
Igor Mammedov Oct. 27, 2020, 2:20 p.m. UTC | #10
On Tue, 27 Oct 2020 14:47:14 +0100
Vitaly Kuznetsov <vkuznets@redhat.com> wrote:

> David Hildenbrand <david@redhat.com> writes:
> 
> >>> Same applies to all other kinds of operations (splitting, punching out,
> >>> ...) as you also mentioned.  
> >> 
> >> One question from a QEMU newbie though: why do you put
> >> kvm_ioctl_inhibit_begin()/kvm_ioctl_inhibit_end() to kvm_region_resize()
> >> only and not taking it all the way up to
> >> memory_region_transaction_begin()/memory_region_transaction_end() to
> >> support atomicity for all kinds of updates right away?  
> >
> > The clean way to implement it for 
> > memory_region_transaction_begin()/memory_region_transaction_end() is by 
> > implementing  
> > ->begin()
> > ->commit()  
> > callbacks for the KVM MemoryListener, and doing it in there, in KVM code.
> >
> >
> > Now, I wasn't sure how this might affect real-time workloads, where you 
> > really don't want to kick CPUs out of KVM. You can make a lot of 
> > operations without requiring this handling like
> >
> > 1. Adding regions (memory hotplug)
> > 2. Removing regions (memory hotunplug)
> > 3. Enabling/disabling dirty logging
> >
> > Resize/split(/move/...) are the problematic operations where we would 
> > need that handling. Modifying the size/location of existing slots.
> >
> > One way to tackle it would be to "sense" upfront if such "modifying" 
> > operations will be required, communicating that via "->begin()", and 
> > letting the KVM notifier decide based on that information whether to get 
> > everything out of KVM. Sounds feasible.
> >  
> 
> I don't actually know if we have such use-cases but thinking about
> e.g. punching a hole in a middle of an existing slot requires:
> 1) Resizing the existing slot to its first half
> 2) Creating the hole
> 3) Creating a new slot for the second half of the slot.
> In case we'd like to make this atomic, we need to cover the whole
> transaction. But again, I don't know if we have a use-case for it or
> not.

it usually happens during boot time on x86 where MMIO (re)maps
cause punching holes in lower RAM.
(you can observe it by tracing MemoryListener::region_add/del hooks)

[...]
Peter Xu Nov. 2, 2020, 7:57 p.m. UTC | #11
Vitaly,

On Mon, Oct 26, 2020 at 09:49:16AM +0100, Vitaly Kuznetsov wrote:
> Currently, KVM doesn't provide an API to make atomic updates to memmap when
> the change touches more than one memory slot, e.g. in case we'd like to
> punch a hole in an existing slot.
> 
> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
> like
> 
> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
> ...
> 
> The problem seems to be that TSEG manipulations on one vCPU are not atomic
> from other vCPUs views. In particular, here's the strace:
> 
> Initial creation of the 'problematic' slot:
> 
> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>    memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
> 
> ... and then the update (caused by e.g. mch_update_smram()) later:
> 
> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>    memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>    memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
> 
> In case KVM has to handle any event on a different vCPU in between these
> two calls the #PF will get triggered.

A pure question: Why a #PF?  Is it injected into the guest?

My understanding (which could be wrong) is that all thing should start with a
vcpu page fault onto the removed range, then when kvm finds that the memory
accessed is not within a valid memslot (since we're adding it back but not
yet), it'll become an user exit back to QEMU assuming it's an MMIO access.  Or
am I wrong somewhere?

Thanks,
Vitaly Kuznetsov Nov. 3, 2020, 1:07 p.m. UTC | #12
Peter Xu <peterx@redhat.com> writes:

> Vitaly,
>
> On Mon, Oct 26, 2020 at 09:49:16AM +0100, Vitaly Kuznetsov wrote:
>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>> the change touches more than one memory slot, e.g. in case we'd like to
>> punch a hole in an existing slot.
>> 
>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>> like
>> 
>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>> ...
>> 
>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>> from other vCPUs views. In particular, here's the strace:
>> 
>> Initial creation of the 'problematic' slot:
>> 
>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>    memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>> 
>> ... and then the update (caused by e.g. mch_update_smram()) later:
>> 
>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>    memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>    memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>> 
>> In case KVM has to handle any event on a different vCPU in between these
>> two calls the #PF will get triggered.
>
> A pure question: Why a #PF?  Is it injected into the guest?
>

Yes, we see a #PF injected in the guest during OVMF boot.

> My understanding (which could be wrong) is that all thing should start with a
> vcpu page fault onto the removed range, then when kvm finds that the memory
> accessed is not within a valid memslot (since we're adding it back but not
> yet), it'll become an user exit back to QEMU assuming it's an MMIO access.  Or
> am I wrong somewhere?

In case it is a normal access from the guest, yes, but AFAIR here
guest's CR3 is pointing to non existent memory and when KVM detects that
it injects #PF by itself without a loop through userspace.
Peter Xu Nov. 3, 2020, 4:37 p.m. UTC | #13
On Tue, Nov 03, 2020 at 02:07:09PM +0100, Vitaly Kuznetsov wrote:
> In case it is a normal access from the guest, yes, but AFAIR here
> guest's CR3 is pointing to non existent memory and when KVM detects that
> it injects #PF by itself without a loop through userspace.

I see, thanks Vitaly.  I think this kind of answered my previous confusion on
why we can't just bounce all these to QEMU since I thought QEMU should try to
take the bql if it's mmio access - probably because there're quite a lot of
references to guest memslots in kernel that cannot be naturally treated as
guest MMIO access (especially for nested, maybe?) so that maybe it's very hard
to cover all of them.  Paolo has mentioned quite a few times that he'd prefer a
kernel solution for this; I feel like I understand better on the reason now..

Have any of us tried to collect the requirements on this new kernel interface
(if to be proposed)?  I'm kind of thinking how it would look like to solve all
the pains we have right now.

Firstly, I think we'd likely want to have the capability to handle "holes" in
memslots, either to punch a hole, which iiuc is the problem behind this patch.
Or the reversed operation, which is to fill up a whole that we've just punched.
The other major one could be virtio-mem who would like to extend or shrink an
existing memslot.  However IIUC that's also doable with the "hole" idea in that
we can create the memslot with the maximum supported size, then "punch a hole"
at the end of the memslot just like it shrinked.  When extend, we shrink the
hole instead rather than the memslot.

Then there's the other case where we want to keep the dirty bitmap when
punching a hole on existing ram.  If with the "hole" idea in the kernel, it
seems easy too - when we punch the hole, we drop dirty bitmaps only for the
range covered by the hole.  Then we won't lose the rest bitmaps that where the
RAM still makes sense, since the memslot will use the same bitmap before/after
punching a hole.

So now an simple idea comes to my mind (I think we can have even more better,
or more complicated ideas, e.g., to make kvm memslot a tree? But I'll start
with the simple): maybe we just need to teach each kvm memslot to take "holes"
within itself.  By default, there's no holes with KVM_SET_USER_MEMORY_REGION
configured kvm memslots, then we can configure holes for each memslot using a
new flag (assuming, KVM_MEM_SET_HOLE) of the same ioctl (after LOG_DIRTY_PAGES
and READ_ONLY).  Initially we may add a restriction on how many holes we need,
so the holes can also be an array.

Thoughts?

Thanks,
Laszlo Ersek Nov. 4, 2020, 5:58 p.m. UTC | #14
On 11/03/20 14:07, Vitaly Kuznetsov wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> Vitaly,
>>
>> On Mon, Oct 26, 2020 at 09:49:16AM +0100, Vitaly Kuznetsov wrote:
>>> Currently, KVM doesn't provide an API to make atomic updates to memmap when
>>> the change touches more than one memory slot, e.g. in case we'd like to
>>> punch a hole in an existing slot.
>>>
>>> Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
>>> like
>>>
>>> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
>>> ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
>>> RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
>>> RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
>>> ...
>>>
>>> The problem seems to be that TSEG manipulations on one vCPU are not atomic
>>> from other vCPUs views. In particular, here's the strace:
>>>
>>> Initial creation of the 'problematic' slot:
>>>
>>> 10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>    memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0
>>>
>>> ... and then the update (caused by e.g. mch_update_smram()) later:
>>>
>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>    memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
>>> 10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, guest_phys_addr=0x100000,
>>>    memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0
>>>
>>> In case KVM has to handle any event on a different vCPU in between these
>>> two calls the #PF will get triggered.
>>
>> A pure question: Why a #PF?  Is it injected into the guest?
>>
> 
> Yes, we see a #PF injected in the guest during OVMF boot.
> 
>> My understanding (which could be wrong) is that all thing should start with a
>> vcpu page fault onto the removed range, then when kvm finds that the memory
>> accessed is not within a valid memslot (since we're adding it back but not
>> yet), it'll become an user exit back to QEMU assuming it's an MMIO access.  Or
>> am I wrong somewhere?
> 
> In case it is a normal access from the guest, yes, but AFAIR here
> guest's CR3 is pointing to non existent memory and when KVM detects that
> it injects #PF by itself without a loop through userspace.
> 

Indeed that's how I seem to remember it too; the guest page tables
cannot be walked (by the processor implicitly, or by KVM explicitly -- I
can't tell which one of those applies).

Thanks
Laszlo
Laszlo Ersek Nov. 4, 2020, 6:09 p.m. UTC | #15
On 11/03/20 17:37, Peter Xu wrote:
> On Tue, Nov 03, 2020 at 02:07:09PM +0100, Vitaly Kuznetsov wrote:
>> In case it is a normal access from the guest, yes, but AFAIR here
>> guest's CR3 is pointing to non existent memory and when KVM detects that
>> it injects #PF by itself without a loop through userspace.
> 
> I see, thanks Vitaly.  I think this kind of answered my previous confusion on
> why we can't just bounce all these to QEMU since I thought QEMU should try to
> take the bql if it's mmio access - probably because there're quite a lot of
> references to guest memslots in kernel that cannot be naturally treated as
> guest MMIO access (especially for nested, maybe?) so that maybe it's very hard
> to cover all of them.  Paolo has mentioned quite a few times that he'd prefer a
> kernel solution for this; I feel like I understand better on the reason now..
> 
> Have any of us tried to collect the requirements on this new kernel interface
> (if to be proposed)?  I'm kind of thinking how it would look like to solve all
> the pains we have right now.
> 
> Firstly, I think we'd likely want to have the capability to handle "holes" in
> memslots, either to punch a hole, which iiuc is the problem behind this patch.
> Or the reversed operation, which is to fill up a whole that we've just punched.
> The other major one could be virtio-mem who would like to extend or shrink an
> existing memslot.  However IIUC that's also doable with the "hole" idea in that
> we can create the memslot with the maximum supported size, then "punch a hole"
> at the end of the memslot just like it shrinked.  When extend, we shrink the
> hole instead rather than the memslot.
> 
> Then there's the other case where we want to keep the dirty bitmap when
> punching a hole on existing ram.  If with the "hole" idea in the kernel, it
> seems easy too - when we punch the hole, we drop dirty bitmaps only for the
> range covered by the hole.  Then we won't lose the rest bitmaps that where the
> RAM still makes sense, since the memslot will use the same bitmap before/after
> punching a hole.
> 
> So now an simple idea comes to my mind (I think we can have even more better,
> or more complicated ideas, e.g., to make kvm memslot a tree? But I'll start
> with the simple): maybe we just need to teach each kvm memslot to take "holes"
> within itself.  By default, there's no holes with KVM_SET_USER_MEMORY_REGION
> configured kvm memslots, then we can configure holes for each memslot using a
> new flag (assuming, KVM_MEM_SET_HOLE) of the same ioctl (after LOG_DIRTY_PAGES
> and READ_ONLY).  Initially we may add a restriction on how many holes we need,
> so the holes can also be an array.
> 
> Thoughts?

My only one (and completely unwashed / uneducated) thought is that this
resembles the fact (?) that VMAs are represented as rbtrees. So maybe
don't turn a single KVM memslot into a tree, but represent the full set
of KVM memslots as an rbtree?

My understanding is that "interval tree" is one of the most efficient
data structures for tracking a set of (discontiguous) memory regions,
and that an rbtree can be generalized into an interval tree. I'm super
rusty on the theory (after having contributed a genuine rbtree impl to
edk2 in 2014, sic transit gloria mundi :/), but I think that's what the
VMA stuff in the kernel does, effectively.

Perhaps it could apply to KVM memslots too.

Sorry if I'm making no sense, of course. (I'm going out on a limb with
posting this email, but whatever...)

Laszlo
Peter Xu Nov. 4, 2020, 7:23 p.m. UTC | #16
On Wed, Nov 04, 2020 at 07:09:02PM +0100, Laszlo Ersek wrote:
> On 11/03/20 17:37, Peter Xu wrote:
> > On Tue, Nov 03, 2020 at 02:07:09PM +0100, Vitaly Kuznetsov wrote:
> >> In case it is a normal access from the guest, yes, but AFAIR here
> >> guest's CR3 is pointing to non existent memory and when KVM detects that
> >> it injects #PF by itself without a loop through userspace.
> > 
> > I see, thanks Vitaly.  I think this kind of answered my previous confusion on
> > why we can't just bounce all these to QEMU since I thought QEMU should try to
> > take the bql if it's mmio access - probably because there're quite a lot of
> > references to guest memslots in kernel that cannot be naturally treated as
> > guest MMIO access (especially for nested, maybe?) so that maybe it's very hard
> > to cover all of them.  Paolo has mentioned quite a few times that he'd prefer a
> > kernel solution for this; I feel like I understand better on the reason now..
> > 
> > Have any of us tried to collect the requirements on this new kernel interface
> > (if to be proposed)?  I'm kind of thinking how it would look like to solve all
> > the pains we have right now.
> > 
> > Firstly, I think we'd likely want to have the capability to handle "holes" in
> > memslots, either to punch a hole, which iiuc is the problem behind this patch.
> > Or the reversed operation, which is to fill up a whole that we've just punched.
> > The other major one could be virtio-mem who would like to extend or shrink an
> > existing memslot.  However IIUC that's also doable with the "hole" idea in that
> > we can create the memslot with the maximum supported size, then "punch a hole"
> > at the end of the memslot just like it shrinked.  When extend, we shrink the
> > hole instead rather than the memslot.
> > 
> > Then there's the other case where we want to keep the dirty bitmap when
> > punching a hole on existing ram.  If with the "hole" idea in the kernel, it
> > seems easy too - when we punch the hole, we drop dirty bitmaps only for the
> > range covered by the hole.  Then we won't lose the rest bitmaps that where the
> > RAM still makes sense, since the memslot will use the same bitmap before/after
> > punching a hole.
> > 
> > So now an simple idea comes to my mind (I think we can have even more better,
> > or more complicated ideas, e.g., to make kvm memslot a tree? But I'll start
> > with the simple): maybe we just need to teach each kvm memslot to take "holes"
> > within itself.  By default, there's no holes with KVM_SET_USER_MEMORY_REGION
> > configured kvm memslots, then we can configure holes for each memslot using a
> > new flag (assuming, KVM_MEM_SET_HOLE) of the same ioctl (after LOG_DIRTY_PAGES
> > and READ_ONLY).  Initially we may add a restriction on how many holes we need,
> > so the holes can also be an array.
> > 
> > Thoughts?
> 
> My only one (and completely unwashed / uneducated) thought is that this
> resembles the fact (?) that VMAs are represented as rbtrees. So maybe
> don't turn a single KVM memslot into a tree, but represent the full set
> of KVM memslots as an rbtree?
> 
> My understanding is that "interval tree" is one of the most efficient
> data structures for tracking a set of (discontiguous) memory regions,
> and that an rbtree can be generalized into an interval tree. I'm super
> rusty on the theory (after having contributed a genuine rbtree impl to
> edk2 in 2014, sic transit gloria mundi :/), but I think that's what the
> VMA stuff in the kernel does, effectively.
> 
> Perhaps it could apply to KVM memslots too.
> 
> Sorry if I'm making no sense, of course. (I'm going out on a limb with
> posting this email, but whatever...)

Appreciate your input, Laszlo.

I agree that in most cases tree is more efficient.  Though, I feel like
there're two issues, while the structure itself is the less important one.

Firstly, about the "tree or array" question: I'm not sure whether it's urgently
needed for kvm.  Here imho slot lookup should be the major critical operation
for kvm memslots, while crrent array impl of kvm memslot has already achieved
something similar of a tree (O(logn)) by keeping the array ordered (can refer
to search_memslots()).  That's why I think it's still ok.

The other issue is about the "atomic update" of memslots that we're trying to
look for a solution that can address all the issues we've seen.  IMHO it's a
different problem to solve.  For example, even if we start to use tree
structure, we'll still face the same issue when we want to punch a hole in an
existing memslot - we'll still need to remove the tree node before adding the
new one.  AFAICT, same failure could happen on CR3.

So IMHO the more important question is how to make these operations atomic.

To make things clearer, I'll also try to expand a bit more on the "hole" idea
proposed above.

Firstly, I believe there's also other solutions for these problems.

Assuming we would like to remove slot X (range 0-4G), and add back slots X1
(range 0-1G) and X2 (range 2G-4G), with range 1G-2G as new MMIO region.

One example is, we can introduce another new ioctl to contain a few operations
of KVM_SET_USER_MEMORY_REGION, then we can put three items in:

  - Remove memslot X
  - Add memslot X1
  - Add memslot X2

So that the container update will be atomic.  I think it works too at least for
the issue that this patch wants to address.  However I think it's not ideal.
The thing is, the memslot ID could change before/after the new ioctl even if
it's backing mostly the same thing.  Meanwhile, we'll need to destroy a memslot
that is destined to be added back with just some range information changed.
E.g., if we have dirty bitmap, we need to drop it and reallocate again.  That's
a waste, imho.

So I'm thinking, how about we teach kvm similar things as what QEMU has already
know with those memory regions?  That means kvm would start to know "ok this
memslot is always that one, it never changed, however we want to mask some of
it and bounce to QEMU for whatever reason".  The simplest way to implement this
is to introduce a "hole" idea to kvm.  And as I mentioned, I believe an array
would work too similar to memslots, but that's another story.

One thing intersting about the "hole" idea is that, we don't need containers to
keep a bunch of KVM_SET_USER_MEMORY_REGION ioctls any more!  Because our
operation is naturally atomic when represented as "holes": when we add a new
MMIO region within a RAM range, we simply add a new hole element to the
existing holes of this kvm memslot (note! adding the hole does _not_ change
slot ID and everything of the slot itself; e.g. dirty bitmaps do not need to be
re-allocate when we operate on the holes).  As a summary, in the new world: A
memslot should present something more close to ramblock in QEMU.  It means
"this range will be used by the guest", but it does not mean that guest can
access all of it.  When the access reaches a hole, we bounce to QEMU as if the
memslot is missing.

(Since this is really more kvm-specific, cc kvm list too)

Thanks,
Vitaly Kuznetsov Nov. 5, 2020, 3:36 p.m. UTC | #17
Peter Xu <peterx@redhat.com> writes:

> On Wed, Nov 04, 2020 at 07:09:02PM +0100, Laszlo Ersek wrote:
>> On 11/03/20 17:37, Peter Xu wrote:
>> > On Tue, Nov 03, 2020 at 02:07:09PM +0100, Vitaly Kuznetsov wrote:
>> >> In case it is a normal access from the guest, yes, but AFAIR here
>> >> guest's CR3 is pointing to non existent memory and when KVM detects that
>> >> it injects #PF by itself without a loop through userspace.
>> > 
>> > I see, thanks Vitaly.  I think this kind of answered my previous confusion on
>> > why we can't just bounce all these to QEMU since I thought QEMU should try to
>> > take the bql if it's mmio access - probably because there're quite a lot of
>> > references to guest memslots in kernel that cannot be naturally treated as
>> > guest MMIO access (especially for nested, maybe?) so that maybe it's very hard
>> > to cover all of them.  Paolo has mentioned quite a few times that he'd prefer a
>> > kernel solution for this; I feel like I understand better on the reason now..
>> > 
>> > Have any of us tried to collect the requirements on this new kernel interface
>> > (if to be proposed)?  I'm kind of thinking how it would look like to solve all
>> > the pains we have right now.
>> > 
>> > Firstly, I think we'd likely want to have the capability to handle "holes" in
>> > memslots, either to punch a hole, which iiuc is the problem behind this patch.
>> > Or the reversed operation, which is to fill up a whole that we've just punched.
>> > The other major one could be virtio-mem who would like to extend or shrink an
>> > existing memslot.  However IIUC that's also doable with the "hole" idea in that
>> > we can create the memslot with the maximum supported size, then "punch a hole"
>> > at the end of the memslot just like it shrinked.  When extend, we shrink the
>> > hole instead rather than the memslot.
>> > 
>> > Then there's the other case where we want to keep the dirty bitmap when
>> > punching a hole on existing ram.  If with the "hole" idea in the kernel, it
>> > seems easy too - when we punch the hole, we drop dirty bitmaps only for the
>> > range covered by the hole.  Then we won't lose the rest bitmaps that where the
>> > RAM still makes sense, since the memslot will use the same bitmap before/after
>> > punching a hole.
>> > 
>> > So now an simple idea comes to my mind (I think we can have even more better,
>> > or more complicated ideas, e.g., to make kvm memslot a tree? But I'll start
>> > with the simple): maybe we just need to teach each kvm memslot to take "holes"
>> > within itself.  By default, there's no holes with KVM_SET_USER_MEMORY_REGION
>> > configured kvm memslots, then we can configure holes for each memslot using a
>> > new flag (assuming, KVM_MEM_SET_HOLE) of the same ioctl (after LOG_DIRTY_PAGES
>> > and READ_ONLY).  Initially we may add a restriction on how many holes we need,
>> > so the holes can also be an array.
>> > 
>> > Thoughts?
>> 
>> My only one (and completely unwashed / uneducated) thought is that this
>> resembles the fact (?) that VMAs are represented as rbtrees. So maybe
>> don't turn a single KVM memslot into a tree, but represent the full set
>> of KVM memslots as an rbtree?
>> 
>> My understanding is that "interval tree" is one of the most efficient
>> data structures for tracking a set of (discontiguous) memory regions,
>> and that an rbtree can be generalized into an interval tree. I'm super
>> rusty on the theory (after having contributed a genuine rbtree impl to
>> edk2 in 2014, sic transit gloria mundi :/), but I think that's what the
>> VMA stuff in the kernel does, effectively.
>> 
>> Perhaps it could apply to KVM memslots too.
>> 
>> Sorry if I'm making no sense, of course. (I'm going out on a limb with
>> posting this email, but whatever...)
>
> Appreciate your input, Laszlo.
>
> I agree that in most cases tree is more efficient.  Though, I feel like
> there're two issues, while the structure itself is the less important one.
>
> Firstly, about the "tree or array" question: I'm not sure whether it's urgently
> needed for kvm.  Here imho slot lookup should be the major critical operation
> for kvm memslots, while crrent array impl of kvm memslot has already achieved
> something similar of a tree (O(logn)) by keeping the array ordered (can refer
> to search_memslots()).  That's why I think it's still ok.
>
> The other issue is about the "atomic update" of memslots that we're trying to
> look for a solution that can address all the issues we've seen.  IMHO it's a
> different problem to solve.  For example, even if we start to use tree
> structure, we'll still face the same issue when we want to punch a hole in an
> existing memslot - we'll still need to remove the tree node before adding the
> new one.  AFAICT, same failure could happen on CR3.
>
> So IMHO the more important question is how to make these operations atomic.
>
> To make things clearer, I'll also try to expand a bit more on the "hole" idea
> proposed above.
>
> Firstly, I believe there's also other solutions for these problems.
>
> Assuming we would like to remove slot X (range 0-4G), and add back slots X1
> (range 0-1G) and X2 (range 2G-4G), with range 1G-2G as new MMIO region.
>
> One example is, we can introduce another new ioctl to contain a few operations
> of KVM_SET_USER_MEMORY_REGION, then we can put three items in:
>
>   - Remove memslot X
>   - Add memslot X1
>   - Add memslot X2
>
> So that the container update will be atomic.  I think it works too at least for
> the issue that this patch wants to address.  However I think it's not ideal.
> The thing is, the memslot ID could change before/after the new ioctl even if
> it's backing mostly the same thing.  Meanwhile, we'll need to destroy a memslot
> that is destined to be added back with just some range information changed.
> E.g., if we have dirty bitmap, we need to drop it and reallocate again.  That's
> a waste, imho.
>
> So I'm thinking, how about we teach kvm similar things as what QEMU has already
> know with those memory regions?  That means kvm would start to know "ok this
> memslot is always that one, it never changed, however we want to mask some of
> it and bounce to QEMU for whatever reason".  The simplest way to implement this
> is to introduce a "hole" idea to kvm.  And as I mentioned, I believe an array
> would work too similar to memslots, but that's another story.
>
> One thing intersting about the "hole" idea is that, we don't need containers to
> keep a bunch of KVM_SET_USER_MEMORY_REGION ioctls any more!  Because our
> operation is naturally atomic when represented as "holes": when we add a new
> MMIO region within a RAM range, we simply add a new hole element to the
> existing holes of this kvm memslot (note! adding the hole does _not_ change
> slot ID and everything of the slot itself; e.g. dirty bitmaps do not need to be
> re-allocate when we operate on the holes).  As a summary, in the new world: A
> memslot should present something more close to ramblock in QEMU.  It means
> "this range will be used by the guest", but it does not mean that guest can
> access all of it.  When the access reaches a hole, we bounce to QEMU as if the
> memslot is missing.
>
> (Since this is really more kvm-specific, cc kvm list too)

Thank you for the write up Peter!

My first reaction to the 'hole' idea was that it's rather redundant and
sticking to the current concept where 'everything missing is a hole' is
sufficient. I, however, didn't think about dirty bitmap at all so maybe
it is worth considering.

As an in-KVM solution to the problem I could suggest an enhanced version
of KVM_SET_USER_MEMORY_REGION taking multiple regions (or even the whole
memmap for the guest) and we would process it in one shot. We can either
do it in an incremental form (add this, remove that) or require to
supply the full set so KVM will start with dropping everything. We can
also exploit Paolo's idea and make the call return dirty bitmaps of the
removed slots, however, this complicates things from both API (like if I
want to remove two slots and add three -- where would the dirty bitmap
go?) and QEMU PoV (as we don't necessarily want to consume dirty bitmap
when we do the update). So I'd stick with the 'incremental' idea.

Honestly, I'm not fully convinced we need an in-KVM idea. We'll still
have to kick all vCPUs out to do the update (for shorter period compared
to in-QEMU solution but still) so in case we are concerned about e.g. RT
workloads we should be as I don't see a way out: latency spikes are
inevitable.
Peter Xu Nov. 5, 2020, 4:35 p.m. UTC | #18
Hi, Vitaly,

On Thu, Nov 05, 2020 at 04:36:28PM +0100, Vitaly Kuznetsov wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Nov 04, 2020 at 07:09:02PM +0100, Laszlo Ersek wrote:
> >> On 11/03/20 17:37, Peter Xu wrote:
> >> > On Tue, Nov 03, 2020 at 02:07:09PM +0100, Vitaly Kuznetsov wrote:
> >> >> In case it is a normal access from the guest, yes, but AFAIR here
> >> >> guest's CR3 is pointing to non existent memory and when KVM detects that
> >> >> it injects #PF by itself without a loop through userspace.
> >> > 
> >> > I see, thanks Vitaly.  I think this kind of answered my previous confusion on
> >> > why we can't just bounce all these to QEMU since I thought QEMU should try to
> >> > take the bql if it's mmio access - probably because there're quite a lot of
> >> > references to guest memslots in kernel that cannot be naturally treated as
> >> > guest MMIO access (especially for nested, maybe?) so that maybe it's very hard
> >> > to cover all of them.  Paolo has mentioned quite a few times that he'd prefer a
> >> > kernel solution for this; I feel like I understand better on the reason now..
> >> > 
> >> > Have any of us tried to collect the requirements on this new kernel interface
> >> > (if to be proposed)?  I'm kind of thinking how it would look like to solve all
> >> > the pains we have right now.
> >> > 
> >> > Firstly, I think we'd likely want to have the capability to handle "holes" in
> >> > memslots, either to punch a hole, which iiuc is the problem behind this patch.
> >> > Or the reversed operation, which is to fill up a whole that we've just punched.
> >> > The other major one could be virtio-mem who would like to extend or shrink an
> >> > existing memslot.  However IIUC that's also doable with the "hole" idea in that
> >> > we can create the memslot with the maximum supported size, then "punch a hole"
> >> > at the end of the memslot just like it shrinked.  When extend, we shrink the
> >> > hole instead rather than the memslot.
> >> > 
> >> > Then there's the other case where we want to keep the dirty bitmap when
> >> > punching a hole on existing ram.  If with the "hole" idea in the kernel, it
> >> > seems easy too - when we punch the hole, we drop dirty bitmaps only for the
> >> > range covered by the hole.  Then we won't lose the rest bitmaps that where the
> >> > RAM still makes sense, since the memslot will use the same bitmap before/after
> >> > punching a hole.
> >> > 
> >> > So now an simple idea comes to my mind (I think we can have even more better,
> >> > or more complicated ideas, e.g., to make kvm memslot a tree? But I'll start
> >> > with the simple): maybe we just need to teach each kvm memslot to take "holes"
> >> > within itself.  By default, there's no holes with KVM_SET_USER_MEMORY_REGION
> >> > configured kvm memslots, then we can configure holes for each memslot using a
> >> > new flag (assuming, KVM_MEM_SET_HOLE) of the same ioctl (after LOG_DIRTY_PAGES
> >> > and READ_ONLY).  Initially we may add a restriction on how many holes we need,
> >> > so the holes can also be an array.
> >> > 
> >> > Thoughts?
> >> 
> >> My only one (and completely unwashed / uneducated) thought is that this
> >> resembles the fact (?) that VMAs are represented as rbtrees. So maybe
> >> don't turn a single KVM memslot into a tree, but represent the full set
> >> of KVM memslots as an rbtree?
> >> 
> >> My understanding is that "interval tree" is one of the most efficient
> >> data structures for tracking a set of (discontiguous) memory regions,
> >> and that an rbtree can be generalized into an interval tree. I'm super
> >> rusty on the theory (after having contributed a genuine rbtree impl to
> >> edk2 in 2014, sic transit gloria mundi :/), but I think that's what the
> >> VMA stuff in the kernel does, effectively.
> >> 
> >> Perhaps it could apply to KVM memslots too.
> >> 
> >> Sorry if I'm making no sense, of course. (I'm going out on a limb with
> >> posting this email, but whatever...)
> >
> > Appreciate your input, Laszlo.
> >
> > I agree that in most cases tree is more efficient.  Though, I feel like
> > there're two issues, while the structure itself is the less important one.
> >
> > Firstly, about the "tree or array" question: I'm not sure whether it's urgently
> > needed for kvm.  Here imho slot lookup should be the major critical operation
> > for kvm memslots, while crrent array impl of kvm memslot has already achieved
> > something similar of a tree (O(logn)) by keeping the array ordered (can refer
> > to search_memslots()).  That's why I think it's still ok.
> >
> > The other issue is about the "atomic update" of memslots that we're trying to
> > look for a solution that can address all the issues we've seen.  IMHO it's a
> > different problem to solve.  For example, even if we start to use tree
> > structure, we'll still face the same issue when we want to punch a hole in an
> > existing memslot - we'll still need to remove the tree node before adding the
> > new one.  AFAICT, same failure could happen on CR3.
> >
> > So IMHO the more important question is how to make these operations atomic.
> >
> > To make things clearer, I'll also try to expand a bit more on the "hole" idea
> > proposed above.
> >
> > Firstly, I believe there's also other solutions for these problems.
> >
> > Assuming we would like to remove slot X (range 0-4G), and add back slots X1
> > (range 0-1G) and X2 (range 2G-4G), with range 1G-2G as new MMIO region.
> >
> > One example is, we can introduce another new ioctl to contain a few operations
> > of KVM_SET_USER_MEMORY_REGION, then we can put three items in:
> >
> >   - Remove memslot X
> >   - Add memslot X1
> >   - Add memslot X2
> >
> > So that the container update will be atomic.  I think it works too at least for
> > the issue that this patch wants to address.  However I think it's not ideal.
> > The thing is, the memslot ID could change before/after the new ioctl even if
> > it's backing mostly the same thing.  Meanwhile, we'll need to destroy a memslot
> > that is destined to be added back with just some range information changed.
> > E.g., if we have dirty bitmap, we need to drop it and reallocate again.  That's
> > a waste, imho.
> >
> > So I'm thinking, how about we teach kvm similar things as what QEMU has already
> > know with those memory regions?  That means kvm would start to know "ok this
> > memslot is always that one, it never changed, however we want to mask some of
> > it and bounce to QEMU for whatever reason".  The simplest way to implement this
> > is to introduce a "hole" idea to kvm.  And as I mentioned, I believe an array
> > would work too similar to memslots, but that's another story.
> >
> > One thing intersting about the "hole" idea is that, we don't need containers to
> > keep a bunch of KVM_SET_USER_MEMORY_REGION ioctls any more!  Because our
> > operation is naturally atomic when represented as "holes": when we add a new
> > MMIO region within a RAM range, we simply add a new hole element to the
> > existing holes of this kvm memslot (note! adding the hole does _not_ change
> > slot ID and everything of the slot itself; e.g. dirty bitmaps do not need to be
> > re-allocate when we operate on the holes).  As a summary, in the new world: A
> > memslot should present something more close to ramblock in QEMU.  It means
> > "this range will be used by the guest", but it does not mean that guest can
> > access all of it.  When the access reaches a hole, we bounce to QEMU as if the
> > memslot is missing.
> >
> > (Since this is really more kvm-specific, cc kvm list too)
> 
> Thank you for the write up Peter!
> 
> My first reaction to the 'hole' idea was that it's rather redundant and
> sticking to the current concept where 'everything missing is a hole' is
> sufficient. I, however, didn't think about dirty bitmap at all so maybe
> it is worth considering.

You're definitely right here that at least it seems redundant at the first
glance that it kind of duplicated the idea of existing memslot, but just in a
reversed way.  However I'd like to emphasize just in case:

  - The "hole" idea I tried to introduce is not in the same level of current
    memslot, instead it's a property of one existing memslot.

  - The "hole" idea can help us achieve atomicity, and imho that's the most
    important thing that I think this idea brings.  We would like atomicity of
    every single memslot update, however current kvm memslot layout cannot
    solve atomicity of cases like "punching MMIO holes within an existing
    memslot".

By introducing this extra layer of "hole" idea within memslot, we can make
everything atomic now, by either atomically operating on memslot or on a hole
of it.  When punching the MMIO hole, we insert a hole (or expand an existing
hole!  We're still free to define the "hole" scemantics) into the existing
memslot.  That replaces all the previous 3 activities and it's naturally
atomic.

Dirty bitmap is another extra benefit out of the atomicity that we can get.
IMHO kvm memslot would be really more efficient when it represents the qemu
ramblocks.  It also does not really make sense to change slot ids if what we
did is simply punch a 1-byte hole onto a 4G range - the backing ram never
changed, so should the dirty bitmap and the slot id..  KVM memslot tried to
emulate these complexity of QEMU using a single layer of kvm memslot, hence
we'll face these atomicity problems.  So I'm thinking whether we should provide
extra abstraction (the "hole" idea is the 2nd layer, that's under kvm memslots;
we can also think about some other ways to represent this, anyway I really feel
like that should a new layer of abstraction) to kvm memslots to know better
about guest rams.

> 
> As an in-KVM solution to the problem I could suggest an enhanced version
> of KVM_SET_USER_MEMORY_REGION taking multiple regions (or even the whole
> memmap for the guest) and we would process it in one shot. We can either
> do it in an incremental form (add this, remove that) or require to
> supply the full set so KVM will start with dropping everything. We can
> also exploit Paolo's idea and make the call return dirty bitmaps of the
> removed slots, however, this complicates things from both API (like if I
> want to remove two slots and add three -- where would the dirty bitmap
> go?) and QEMU PoV (as we don't necessarily want to consume dirty bitmap
> when we do the update). So I'd stick with the 'incremental' idea.

I'm not very sure about what's the "incremental idea" that you mentioned here.
However I agree with you on above and I think that's also why I think we may
consider to go the other way to add that extra layer to kvm memslot so that we
don't need to have these issues you mentioned.  It seems to bring unnecessary
complexity on the interface and it seems not as clean as the "hole" idea to me.

> 
> Honestly, I'm not fully convinced we need an in-KVM idea. We'll still
> have to kick all vCPUs out to do the update (for shorter period compared
> to in-QEMU solution but still) so in case we are concerned about e.g. RT
> workloads we should be as I don't see a way out: latency spikes are
> inevitable.

Yes it would be perfect if there's an userspace solution. I'd be fine to stop
all the vcpus as long as we haven't released the BQL during that period -
that'll be quite simple and efficient if doable.  I actually proposed similar
things during working on the qemu dirty ring series when I wanted to flush vcpu
dirty bitmaps, but I encountered the same issue of incorrect BQL releasing and
I got nasty coredumps.  I also questioned about "why we need to release BQL"
during pausing all vcpus but I didn't dig deeper yet (on the icount and
record/replay stuff, I think, Paolo or Alex may know better).  I'm just afraid
there's no one good solution there to solve all the similar problems.

In all cases, this is still my very unmature understanding.  I really hope
Paolo could chim in to share his thoughts on this too, or especially on why we
need to release bql during vcpu pausing and its necessity.

Regarding RT - I never worried about RT on this. :)

IMHO, "real-time host" does not mean that this RT host must guarantee
determinism across the whole lifecycle of the system.  There're plenty of stuff
that we can't do to guarantee that determinsm afaict, e.g., cpu hotplug.  I
don't think "changing guest mem layout" is a good thing to do during the guest
running RT workload.  It should be easier just to avoid it (e.g., make sure it
only happens before the RT workload starts; and forbidden during RT app
running) rather than trying to make that opeartion determinstic as well.

Thanks,
diff mbox series

Patch

diff --git a/softmmu/memory.c b/softmmu/memory.c
index fa280a19f7f7..0bf6f3f6d5dc 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -28,6 +28,7 @@ 
 
 #include "exec/memory-internal.h"
 #include "exec/ram_addr.h"
+#include "sysemu/cpus.h"
 #include "sysemu/kvm.h"
 #include "sysemu/runstate.h"
 #include "sysemu/tcg.h"
@@ -1057,7 +1058,9 @@  static void address_space_update_topology(AddressSpace *as)
 void memory_region_transaction_begin(void)
 {
     qemu_flush_coalesced_mmio_buffer();
-    ++memory_region_transaction_depth;
+    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
+        pause_all_vcpus();
+    }
 }
 
 void memory_region_transaction_commit(void)
@@ -1087,7 +1090,11 @@  void memory_region_transaction_commit(void)
             }
             ioeventfd_update_pending = false;
         }
-   }
+
+        if (kvm_enabled()) {
+            resume_all_vcpus();
+        }
+    }
 }
 
 static void memory_region_destructor_none(MemoryRegion *mr)