diff mbox series

[v3,07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()

Message ID 20231115071519.2864957-8-xiaoyao.li@intel.com (mailing list archive)
State New, archived
Headers show
Series QEMU Guest memfd + QEMU TDX support | expand

Commit Message

Xiaoyao Li Nov. 15, 2023, 7:14 a.m. UTC
Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
ram_block_discard_range() which grabs some code from
ram_discard_range(). However, during code movement, it changed alignment
check of host_startaddr from qemu_host_page_size to rb->page_size.

When ramblock is back'ed by hugepage, it requires the startaddr to be
huge page size aligned, which is a overkill. e.g., TDX's private-shared
page conversion is done at 4KB granularity. Shared page is discarded
when it gets converts to private and when shared page back'ed by
hugepage it is going to fail on this check.

So change to alignment check back to qemu_host_page_size.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
 - Newly added in v3;
---
 system/physmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Hildenbrand Nov. 15, 2023, 6:20 p.m. UTC | #1
On 15.11.23 08:14, Xiaoyao Li wrote:
> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
> ram_block_discard_range() which grabs some code from
> ram_discard_range(). However, during code movement, it changed alignment
> check of host_startaddr from qemu_host_page_size to rb->page_size.
> 
> When ramblock is back'ed by hugepage, it requires the startaddr to be
> huge page size aligned, which is a overkill. e.g., TDX's private-shared
> page conversion is done at 4KB granularity. Shared page is discarded
> when it gets converts to private and when shared page back'ed by
> hugepage it is going to fail on this check.
> 
> So change to alignment check back to qemu_host_page_size.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
>   - Newly added in v3;
> ---
>   system/physmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index c56b17e44df6..8a4e42c7cf60 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>   
>       uint8_t *host_startaddr = rb->host + start;
>   
> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {

For your use cases, rb->page_size should always match qemu_host_page_size.

IIRC, we only set rb->page_size to different values for hugetlb. And 
guest_memfd does not support hugetlb.

Even if QEMU is using THP, rb->page_size should 4k.

Please elaborate how you can actually trigger that. From what I recall, 
guest_memfd is not compatible with hugetlb.

And the check here makes perfect sense for existing callers of 
ram_block_discard_range(): you cannot partially zap a hugetlb page.
Xiaoyao Li Nov. 16, 2023, 2:56 a.m. UTC | #2
On 11/16/2023 2:20 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>> ram_block_discard_range() which grabs some code from
>> ram_discard_range(). However, during code movement, it changed alignment
>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>
>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>> huge page size aligned, which is a overkill. e.g., TDX's private-shared
>> page conversion is done at 4KB granularity. Shared page is discarded
>> when it gets converts to private and when shared page back'ed by
>> hugepage it is going to fail on this check.
>>
>> So change to alignment check back to qemu_host_page_size.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>>   - Newly added in v3;
>> ---
>>   system/physmem.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index c56b17e44df6..8a4e42c7cf60 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb, 
>> uint64_t start, size_t length)
>>       uint8_t *host_startaddr = rb->host + start;
>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
> 
> For your use cases, rb->page_size should always match qemu_host_page_size.
> 
> IIRC, we only set rb->page_size to different values for hugetlb. And 
> guest_memfd does not support hugetlb.
> 
> Even if QEMU is using THP, rb->page_size should 4k.
> 
> Please elaborate how you can actually trigger that. From what I recall, 
> guest_memfd is not compatible with hugetlb.

It's the shared memory that can be back'ed by hugetlb.

Later patch 9 introduces ram_block_convert_page(), which will discard 
shared memory when it gets converted to private. TD guest can request 
convert a 4K to private while the page is previously back'ed by hugetlb 
as 2M shared page.

> And the check here makes perfect sense for existing callers of 
> ram_block_discard_range(): you cannot partially zap a hugetlb page.
>
David Hildenbrand Nov. 20, 2023, 9:56 a.m. UTC | #3
On 16.11.23 03:56, Xiaoyao Li wrote:
> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>> ram_block_discard_range() which grabs some code from
>>> ram_discard_range(). However, during code movement, it changed alignment
>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>
>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>> huge page size aligned, which is a overkill. e.g., TDX's private-shared
>>> page conversion is done at 4KB granularity. Shared page is discarded
>>> when it gets converts to private and when shared page back'ed by
>>> hugepage it is going to fail on this check.
>>>
>>> So change to alignment check back to qemu_host_page_size.
>>>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> ---
>>> Changes in v3:
>>>    - Newly added in v3;
>>> ---
>>>    system/physmem.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index c56b17e44df6..8a4e42c7cf60 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>> uint64_t start, size_t length)
>>>        uint8_t *host_startaddr = rb->host + start;
>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>
>> For your use cases, rb->page_size should always match qemu_host_page_size.
>>
>> IIRC, we only set rb->page_size to different values for hugetlb. And
>> guest_memfd does not support hugetlb.
>>
>> Even if QEMU is using THP, rb->page_size should 4k.
>>
>> Please elaborate how you can actually trigger that. From what I recall,
>> guest_memfd is not compatible with hugetlb.
> 
> It's the shared memory that can be back'ed by hugetlb.

Serious question: does that configuration make any sense to support at 
this point? I claim: no.

> 
> Later patch 9 introduces ram_block_convert_page(), which will discard
> shared memory when it gets converted to private. TD guest can request
> convert a 4K to private while the page is previously back'ed by hugetlb
> as 2M shared page.

So you can call ram_block_discard_guest_memfd_range() on subpage basis, 
but not ram_block_discard_range().

ram_block_convert_range() would have to thought that that (questionable) 
combination of hugetlb for shmem and ordinary pages for guest_memfd 
cannot discard shared memory.

And it probably shouldn't either way. There are other problems when not 
using hugetlb along with preallocation.

The check in ram_block_discard_range() is correct, whoever ends up 
calling it has to stop calling it.
Xiaoyao Li Dec. 4, 2023, 7:35 a.m. UTC | #4
On 11/20/2023 5:56 PM, David Hildenbrand wrote:
> On 16.11.23 03:56, Xiaoyao Li wrote:
>> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>>> ram_block_discard_range() which grabs some code from
>>>> ram_discard_range(). However, during code movement, it changed 
>>>> alignment
>>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>>
>>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>>> huge page size aligned, which is a overkill. e.g., TDX's private-shared
>>>> page conversion is done at 4KB granularity. Shared page is discarded
>>>> when it gets converts to private and when shared page back'ed by
>>>> hugepage it is going to fail on this check.
>>>>
>>>> So change to alignment check back to qemu_host_page_size.
>>>>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> ---
>>>> Changes in v3:
>>>>    - Newly added in v3;
>>>> ---
>>>>    system/physmem.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index c56b17e44df6..8a4e42c7cf60 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>>> uint64_t start, size_t length)
>>>>        uint8_t *host_startaddr = rb->host + start;
>>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>>
>>> For your use cases, rb->page_size should always match 
>>> qemu_host_page_size.
>>>
>>> IIRC, we only set rb->page_size to different values for hugetlb. And
>>> guest_memfd does not support hugetlb.
>>>
>>> Even if QEMU is using THP, rb->page_size should 4k.
>>>
>>> Please elaborate how you can actually trigger that. From what I recall,
>>> guest_memfd is not compatible with hugetlb.
>>
>> It's the shared memory that can be back'ed by hugetlb.
> 
> Serious question: does that configuration make any sense to support at 
> this point? I claim: no.
> 
>>
>> Later patch 9 introduces ram_block_convert_page(), which will discard
>> shared memory when it gets converted to private. TD guest can request
>> convert a 4K to private while the page is previously back'ed by hugetlb
>> as 2M shared page.
> 
> So you can call ram_block_discard_guest_memfd_range() on subpage basis, 
> but not ram_block_discard_range().
> 
> ram_block_convert_range() would have to thought that that (questionable) 
> combination of hugetlb for shmem and ordinary pages for guest_memfd 
> cannot discard shared memory.
> 
> And it probably shouldn't either way. There are other problems when not 
> using hugetlb along with preallocation.

If I understand correctly, preallocation needs to be enabled for 
hugetlb. And in preallocation case, it doesn't need to discard memory. 
Is it correct?

> The check in ram_block_discard_range() is correct, whoever ends up 
> calling it has to stop calling it.
> 

So, I need add logic to ram_block_discard_page() that if the size of 
shared memory indicates hugepage, skip the discarding?
Xiaoyao Li Dec. 4, 2023, 7:53 a.m. UTC | #5
On 12/4/2023 3:35 PM, Xiaoyao Li wrote:
> On 11/20/2023 5:56 PM, David Hildenbrand wrote:
>> On 16.11.23 03:56, Xiaoyao Li wrote:
>>> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>>>> ram_block_discard_range() which grabs some code from
>>>>> ram_discard_range(). However, during code movement, it changed 
>>>>> alignment
>>>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>>>
>>>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>>>> huge page size aligned, which is a overkill. e.g., TDX's 
>>>>> private-shared
>>>>> page conversion is done at 4KB granularity. Shared page is discarded
>>>>> when it gets converts to private and when shared page back'ed by
>>>>> hugepage it is going to fail on this check.
>>>>>
>>>>> So change to alignment check back to qemu_host_page_size.
>>>>>
>>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>>> ---
>>>>> Changes in v3:
>>>>>    - Newly added in v3;
>>>>> ---
>>>>>    system/physmem.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>> index c56b17e44df6..8a4e42c7cf60 100644
>>>>> --- a/system/physmem.c
>>>>> +++ b/system/physmem.c
>>>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>> uint64_t start, size_t length)
>>>>>        uint8_t *host_startaddr = rb->host + start;
>>>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>>>
>>>> For your use cases, rb->page_size should always match 
>>>> qemu_host_page_size.
>>>>
>>>> IIRC, we only set rb->page_size to different values for hugetlb. And
>>>> guest_memfd does not support hugetlb.
>>>>
>>>> Even if QEMU is using THP, rb->page_size should 4k.
>>>>
>>>> Please elaborate how you can actually trigger that. From what I recall,
>>>> guest_memfd is not compatible with hugetlb.
>>>
>>> It's the shared memory that can be back'ed by hugetlb.
>>
>> Serious question: does that configuration make any sense to support at 
>> this point? I claim: no.
>>
>>>
>>> Later patch 9 introduces ram_block_convert_page(), which will discard
>>> shared memory when it gets converted to private. TD guest can request
>>> convert a 4K to private while the page is previously back'ed by hugetlb
>>> as 2M shared page.
>>
>> So you can call ram_block_discard_guest_memfd_range() on subpage 
>> basis, but not ram_block_discard_range().
>>
>> ram_block_convert_range() would have to thought that that 
>> (questionable) combination of hugetlb for shmem and ordinary pages for 
>> guest_memfd cannot discard shared memory.
>>
>> And it probably shouldn't either way. There are other problems when 
>> not using hugetlb along with preallocation.
> 
> If I understand correctly, preallocation needs to be enabled for 
> hugetlb. And in preallocation case, it doesn't need to discard memory. 
> Is it correct?
> 
>> The check in ram_block_discard_range() is correct, whoever ends up 
>> calling it has to stop calling it.
>>
>  > So, I need add logic to ram_block_discard_page() that if the size of

Sorry, I made a typo.

Correct myself, s/ram_block_discard_page()/ram_block_convert_range()

> shared memory indicates hugepage, skip the discarding?
> 
>
David Hildenbrand Dec. 4, 2023, 9:52 a.m. UTC | #6
On 04.12.23 08:53, Xiaoyao Li wrote:
> On 12/4/2023 3:35 PM, Xiaoyao Li wrote:
>> On 11/20/2023 5:56 PM, David Hildenbrand wrote:
>>> On 16.11.23 03:56, Xiaoyao Li wrote:
>>>> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>>>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>>>>> ram_block_discard_range() which grabs some code from
>>>>>> ram_discard_range(). However, during code movement, it changed
>>>>>> alignment
>>>>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>>>>
>>>>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>>>>> huge page size aligned, which is a overkill. e.g., TDX's
>>>>>> private-shared
>>>>>> page conversion is done at 4KB granularity. Shared page is discarded
>>>>>> when it gets converts to private and when shared page back'ed by
>>>>>> hugepage it is going to fail on this check.
>>>>>>
>>>>>> So change to alignment check back to qemu_host_page_size.
>>>>>>
>>>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>>>> ---
>>>>>> Changes in v3:
>>>>>>     - Newly added in v3;
>>>>>> ---
>>>>>>     system/physmem.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>> index c56b17e44df6..8a4e42c7cf60 100644
>>>>>> --- a/system/physmem.c
>>>>>> +++ b/system/physmem.c
>>>>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>>> uint64_t start, size_t length)
>>>>>>         uint8_t *host_startaddr = rb->host + start;
>>>>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>>>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>>>>
>>>>> For your use cases, rb->page_size should always match
>>>>> qemu_host_page_size.
>>>>>
>>>>> IIRC, we only set rb->page_size to different values for hugetlb. And
>>>>> guest_memfd does not support hugetlb.
>>>>>
>>>>> Even if QEMU is using THP, rb->page_size should 4k.
>>>>>
>>>>> Please elaborate how you can actually trigger that. From what I recall,
>>>>> guest_memfd is not compatible with hugetlb.
>>>>
>>>> It's the shared memory that can be back'ed by hugetlb.
>>>
>>> Serious question: does that configuration make any sense to support at
>>> this point? I claim: no.
>>>
>>>>
>>>> Later patch 9 introduces ram_block_convert_page(), which will discard
>>>> shared memory when it gets converted to private. TD guest can request
>>>> convert a 4K to private while the page is previously back'ed by hugetlb
>>>> as 2M shared page.
>>>
>>> So you can call ram_block_discard_guest_memfd_range() on subpage
>>> basis, but not ram_block_discard_range().
>>>
>>> ram_block_convert_range() would have to thought that that
>>> (questionable) combination of hugetlb for shmem and ordinary pages for
>>> guest_memfd cannot discard shared memory.
>>>
>>> And it probably shouldn't either way. There are other problems when
>>> not using hugetlb along with preallocation.
>>
>> If I understand correctly, preallocation needs to be enabled for
>> hugetlb. And in preallocation case, it doesn't need to discard memory.
>> Is it correct?

Yes The downside is that we'll end up with double-memory consumption. 
But if/how to optimize that in this case ca be left for future work.

>>
>>> The check in ram_block_discard_range() is correct, whoever ends up
>>> calling it has to stop calling it.
>>>
>>   > So, I need add logic to ram_block_discard_page() that if the size of
> 
> Sorry, I made a typo.
> 
> Correct myself, s/ram_block_discard_page()/ram_block_convert_range()

Yes, just leave any shared memory backend that uses hugetlb alone.
diff mbox series

Patch

diff --git a/system/physmem.c b/system/physmem.c
index c56b17e44df6..8a4e42c7cf60 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3532,7 +3532,7 @@  int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
 
     uint8_t *host_startaddr = rb->host + start;
 
-    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
+    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
         error_report("ram_block_discard_range: Unaligned start address: %p",
                      host_startaddr);
         goto err;