diff mbox series

[v6,01/12] memory: Introduce RamDiscardMgr for RAM memory regions

Message ID 20210222115708.7623-2-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-mem: vfio support | expand

Commit Message

David Hildenbrand Feb. 22, 2021, 11:56 a.m. UTC
We have some special RAM memory regions (managed by virtio-mem), whereby
the guest agreed to only use selected memory ranges. "unused" parts are
discarded so they won't consume memory - to logically unplug these memory
ranges. Before the VM is allowed to use such logically unplugged memory
again, coordination with the hypervisor is required.

This results in "sparse" mmaps/RAMBlocks/memory regions, whereby only
coordinated parts are valid to be used/accessed by the VM.

In most cases, we don't care about that - e.g., in KVM, we simply have a
single KVM memory slot. However, in case of vfio, registering the
whole region with the kernel results in all pages getting pinned, and
therefore an unexpected high memory consumption - discarding of RAM in
that context is broken.

Let's introduce a way to coordinate discarding/populating memory within a
RAM memory region with such special consumers of RAM memory regions: they
can register as listeners and get updates on memory getting discarded and
populated. Using this machinery, vfio will be able to map only the
currently populated parts, resulting in discarded parts not getting pinned
and not consuming memory.

A RamDiscardMgr has to be set for a memory region before it is getting
mapped, and cannot change while the memory region is mapped.

Note: At some point, we might want to let RAMBlock users (esp. vfio used
for nvme://) consume this interface as well. We'll need RAMBlock notifier
calls when a RAMBlock is getting mapped/unmapped (via the corresponding
memory region), so we can properly register a listener there as well.

Reviewed-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Auger Eric <eric.auger@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/memory.h | 231 ++++++++++++++++++++++++++++++++++++++++++
 softmmu/memory.c      |  22 ++++
 2 files changed, 253 insertions(+)

Comments

Paolo Bonzini Feb. 22, 2021, 1:27 p.m. UTC | #1
On 22/02/21 12:56, David Hildenbrand wrote:
> 
> +    /**
> +     * @get_min_granularity:
> +     *
> +     * Get the minimum granularity in which listeners will get notified
> +     * about changes within the #MemoryRegion via the #RamDiscardMgr.
> +     *
> +     * @rdm: the #RamDiscardMgr
> +     * @mr: the #MemoryRegion
> +     *
> +     * Returns the minimum granularity.
> +     */
> +    uint64_t (*get_min_granularity)(const RamDiscardMgr *rdm,
> +                                    const MemoryRegion *mr);
> +
> +    /**
> +     * @is_populated:
> +     *
> +     * Check whether the given range within the #MemoryRegion is completely
> +     * populated (i.e., no parts are currently discarded). There are no
> +     * alignment requirements for the range.
> +     *
> +     * @rdm: the #RamDiscardMgr
> +     * @mr: the #MemoryRegion
> +     * @offset: offset into the #MemoryRegion
> +     * @size: size in the #MemoryRegion
> +     *
> +     * Returns whether the given range is completely populated.
> +     */
> +    bool (*is_populated)(const RamDiscardMgr *rdm, const MemoryRegion *mr,
> +                         ram_addr_t offset, ram_addr_t size);
> +
> +    /**
> +     * @register_listener:
> +     *
> +     * Register a #RamDiscardListener for a #MemoryRegion via the
> +     * #RamDiscardMgr and immediately notify the #RamDiscardListener about all
> +     * populated parts within the #MemoryRegion via the #RamDiscardMgr.
> +     *
> +     * In case any notification fails, no further notifications are triggered
> +     * and an error is logged.
> +     *
> +     * @rdm: the #RamDiscardMgr
> +     * @mr: the #MemoryRegion
> +     * @rdl: the #RamDiscardListener
> +     */
> +    void (*register_listener)(RamDiscardMgr *rdm, const MemoryRegion *mr,
> +                              RamDiscardListener *rdl);
> +
> +    /**
> +     * @unregister_listener:
> +     *
> +     * Unregister a previously registered #RamDiscardListener for a
> +     * #MemoryRegion via the #RamDiscardMgr after notifying the
> +     * #RamDiscardListener about all populated parts becoming unpopulated
> +     * within the #MemoryRegion via the #RamDiscardMgr.
> +     *
> +     * @rdm: the #RamDiscardMgr
> +     * @mr: the #MemoryRegion
> +     * @rdl: the #RamDiscardListener
> +     */
> +    void (*unregister_listener)(RamDiscardMgr *rdm, const MemoryRegion *mr,
> +                                RamDiscardListener *rdl);
> +
> +    /**
> +     * @replay_populated:
> +     *
> +     * Notify the #RamDiscardListener about all populated parts within the
> +     * #MemoryRegion via the #RamDiscardMgr.
> +     *
> +     * In case any notification fails, no further notifications are triggered.
> +     * The listener is not required to be registered.
> +     *
> +     * @rdm: the #RamDiscardMgr
> +     * @mr: the #MemoryRegion
> +     * @rdl: the #RamDiscardListener
> +     *
> +     * Returns 0 on success, or a negative error if any notification failed.
> +     */
> +    int (*replay_populated)(const RamDiscardMgr *rdm, const MemoryRegion *mr,
> +                            RamDiscardListener *rdl);

If this function is only going to use a single callback, just pass it 
(together with a void *opaque) as the argument.
> +};
> +
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>  typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>  
> @@ -487,6 +683,7 @@ struct MemoryRegion {
>      const char *name;
>      unsigned ioeventfd_nb;
>      MemoryRegionIoeventfd *ioeventfds;
> +    RamDiscardMgr *rdm; /* Only for RAM */
>  };


The idea of sending discard notifications is obviously good.  I have a 
couple of questions on the design that you used for the interface; I'm 
not saying that it should be done differently, I would only like to 
understand the trade offs that you chose:

1) can the RamDiscardManager (no abbreviations :)) be just the owner of 
the memory region (obj->parent)?  Alternatively, if you want to make it 
separate from the owner, does it make sense for it to be a separate 
reusable object, sitting between virtio-mem and the MemoryRegion, so 
that the implementation can be reused?

2) why have the new RamDiscardListener instead of just extending 
MemoryListener? Conveniently, vfio already has a MemoryListener that can 
be extended, and you wouldn't need the list of RamDiscardListeners. 
There is already a precedent of replaying the current state when a 
listener is added (see listener_add_address_space), so this is not 
something different between ML and RDL.

Also, if you add a new interface, you should have "method call" wrappers 
similar to user_creatable_complete or user_creatable_can_be_deleted.

Thanks,

Paolo
David Hildenbrand Feb. 22, 2021, 2:03 p.m. UTC | #2
>> +    /**
>> +     * @replay_populated:
>> +     *
>> +     * Notify the #RamDiscardListener about all populated parts within the
>> +     * #MemoryRegion via the #RamDiscardMgr.
>> +     *
>> +     * In case any notification fails, no further notifications are triggered.
>> +     * The listener is not required to be registered.
>> +     *
>> +     * @rdm: the #RamDiscardMgr
>> +     * @mr: the #MemoryRegion
>> +     * @rdl: the #RamDiscardListener
>> +     *
>> +     * Returns 0 on success, or a negative error if any notification failed.
>> +     */
>> +    int (*replay_populated)(const RamDiscardMgr *rdm, const MemoryRegion *mr,
>> +                            RamDiscardListener *rdl);
> 
> If this function is only going to use a single callback, just pass it
> (together with a void *opaque) as the argument.
>> +};
>> +
>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>>   typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>>   
>> @@ -487,6 +683,7 @@ struct MemoryRegion {
>>       const char *name;
>>       unsigned ioeventfd_nb;
>>       MemoryRegionIoeventfd *ioeventfds;
>> +    RamDiscardMgr *rdm; /* Only for RAM */
>>   };
> 
> 
> The idea of sending discard notifications is obviously good.  I have a
> couple of questions on the design that you used for the interface; I'm
> not saying that it should be done differently, I would only like to
> understand the trade offs that you chose:

Sure!

> 
> 1) can the RamDiscardManager (no abbreviations :)) be just the owner of

I used to call it "SparseRamManager", but wanted to stress the semantics 
- and can use RamDiscardManager ;) . Suggestions welcome.

> the memory region (obj->parent)?  Alternatively, if you want to make it
> separate from the owner, does it make sense for it to be a separate
> reusable object, sitting between virtio-mem and the MemoryRegion, so
> that the implementation can be reused?

virtio-mem consumes a memory backend (e.g., memory-backend-ram). That 
one logically "ownes" the memory region (and thereby the RAMBlock).

The memory backend gets assigned to virtio-mem. At that point, 
virtio-mem "owns" the memory backend. It will set itself as the 
RamDiscardsManager before mapping the memory region into system address 
space (whereby it will get exposed to the system).

This flow made sense to me. Regarding "reusable object" - I think the 
only stuff we could fit in there would be e.g., maintaining the lists of 
notifiers. I'd rather wait until we actually have a second user to see 
what we can factor out.

If you have any suggestion/preference, please let me know.

> 
> 2) why have the new RamDiscardListener instead of just extending
> MemoryListener? Conveniently, vfio already has a MemoryListener that can

It behaves more like the IOMMU notifier in that regard.

> be extended, and you wouldn't need the list of RamDiscardListeners.
> There is already a precedent of replaying the current state when a
> listener is added (see listener_add_address_space), so this is not
> something different between ML and RDL.

The main motivation is to let listener decide how it wants to handle the 
memory region. For example, for vhost, vdpa, kvm, ... I only want a 
single region, not separate ones for each and every populated range, 
punching out discarded ranges. Note that there are cases (i.e., 
anonymous memory), where it's even valid for the guest to read discarded 
memory.

Special cases are only required in corner cases, namely whenever we 
unconditionally:

a) Read memory inside a memory region. (e.g., guest-memory-dump)
b) Write memory inside a memory region. (e.g., TPM, migration)
c) Populate memory inside a memory region. (e.g., vfio)

> 
> Also, if you add a new interface, you should have "method call" wrappers
> similar to user_creatable_complete or user_creatable_can_be_deleted.

I think I had those at some point but decided to drop them. Can readd them.
Paolo Bonzini Feb. 22, 2021, 2:18 p.m. UTC | #3
On 22/02/21 15:03, David Hildenbrand wrote:
> 
>>> +    /**
>>> +     * @replay_populated:
>>> +     *
>>> +     * Notify the #RamDiscardListener about all populated parts 
>>> within the
>>> +     * #MemoryRegion via the #RamDiscardMgr.
>>> +     *
>>> +     * In case any notification fails, no further notifications are 
>>> triggered.
>>> +     * The listener is not required to be registered.
>>> +     *
>>> +     * @rdm: the #RamDiscardMgr
>>> +     * @mr: the #MemoryRegion
>>> +     * @rdl: the #RamDiscardListener
>>> +     *
>>> +     * Returns 0 on success, or a negative error if any notification 
>>> failed.
>>> +     */
>>> +    int (*replay_populated)(const RamDiscardMgr *rdm, const 
>>> MemoryRegion *mr,
>>> +                            RamDiscardListener *rdl);
>>
>> If this function is only going to use a single callback, just pass it
>> (together with a void *opaque) as the argument.
>>> +};
>>> +
>>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>>>   typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>>> @@ -487,6 +683,7 @@ struct MemoryRegion {
>>>       const char *name;
>>>       unsigned ioeventfd_nb;
>>>       MemoryRegionIoeventfd *ioeventfds;
>>> +    RamDiscardMgr *rdm; /* Only for RAM */
>>>   };
>>
>>
>> The idea of sending discard notifications is obviously good.  I have a
>> couple of questions on the design that you used for the interface; I'm
>> not saying that it should be done differently, I would only like to
>> understand the trade offs that you chose:
> 
> Sure!
> 
>>
>> 1) can the RamDiscardManager (no abbreviations :)) be just the owner of
> 
> I used to call it "SparseRamManager", but wanted to stress the semantics 
> - and can use RamDiscardManager ;) . Suggestions welcome.
> 
>> the memory region (obj->parent)?  Alternatively, if you want to make it
>> separate from the owner, does it make sense for it to be a separate
>> reusable object, sitting between virtio-mem and the MemoryRegion, so
>> that the implementation can be reused?
> 
> virtio-mem consumes a memory backend (e.g., memory-backend-ram). That 
> one logically "ownes" the memory region (and thereby the RAMBlock).
> 
> The memory backend gets assigned to virtio-mem. At that point, 
> virtio-mem "owns" the memory backend. It will set itself as the 
> RamDiscardsManager before mapping the memory region into system address 
> space (whereby it will get exposed to the system).
> 
> This flow made sense to me. Regarding "reusable object" - I think the 
> only stuff we could fit in there would be e.g., maintaining the lists of 
> notifiers. I'd rather wait until we actually have a second user to see 
> what we can factor out.
> 
> If you have any suggestion/preference, please let me know.
> 
>>
>> 2) why have the new RamDiscardListener instead of just extending
>> MemoryListener? Conveniently, vfio already has a MemoryListener that can
> 
> It behaves more like the IOMMU notifier in that regard.

Yes, but does it behave more like the IOMMU notifier in other regards? 
:)  The IOMMU notifier is concerned with an iova concept that doesn't 
exist at the MemoryRegion level, while RamDiscardListener works at the 
(MemoryRegion, offset) level that can easily be represented by a 
MemoryRegionSection.  Using MemoryRegionSection might even simplify the 
listener code.

>> be extended, and you wouldn't need the list of RamDiscardListeners.
>> There is already a precedent of replaying the current state when a
>> listener is added (see listener_add_address_space), so this is not
>> something different between ML and RDL.
> 
> The main motivation is to let listener decide how it wants to handle the 
> memory region. For example, for vhost, vdpa, kvm, ... I only want a 
> single region, not separate ones for each and every populated range, 
> punching out discarded ranges. Note that there are cases (i.e., 
> anonymous memory), where it's even valid for the guest to read discarded 
> memory.

Yes, I agree with that.  You would still have the same 
region-add/region_nop/region_del callbacks for KVM and friends; on top 
of that you would have region_populate/region_discard callbacks for VFIO.

Populated regions would be replayed after region_add, while I don't 
think it makes sense to have a region_discard_all callback before 
region_discard.

Paolo

> Special cases are only required in corner cases, namely whenever we 
> unconditionally:
> 
> a) Read memory inside a memory region. (e.g., guest-memory-dump)
> b) Write memory inside a memory region. (e.g., TPM, migration)
> c) Populate memory inside a memory region. (e.g., vfio)
> 
>>
>> Also, if you add a new interface, you should have "method call" wrappers
>> similar to user_creatable_complete or user_creatable_can_be_deleted.
> 
> I think I had those at some point but decided to drop them. Can readd them.
> 
>
David Hildenbrand Feb. 22, 2021, 2:53 p.m. UTC | #4
On 22.02.21 15:18, Paolo Bonzini wrote:
> On 22/02/21 15:03, David Hildenbrand wrote:
>>
>>>> +    /**
>>>> +     * @replay_populated:
>>>> +     *
>>>> +     * Notify the #RamDiscardListener about all populated parts
>>>> within the
>>>> +     * #MemoryRegion via the #RamDiscardMgr.
>>>> +     *
>>>> +     * In case any notification fails, no further notifications are
>>>> triggered.
>>>> +     * The listener is not required to be registered.
>>>> +     *
>>>> +     * @rdm: the #RamDiscardMgr
>>>> +     * @mr: the #MemoryRegion
>>>> +     * @rdl: the #RamDiscardListener
>>>> +     *
>>>> +     * Returns 0 on success, or a negative error if any notification
>>>> failed.
>>>> +     */
>>>> +    int (*replay_populated)(const RamDiscardMgr *rdm, const
>>>> MemoryRegion *mr,
>>>> +                            RamDiscardListener *rdl);
>>>
>>> If this function is only going to use a single callback, just pass it
>>> (together with a void *opaque) as the argument.
>>>> +};
>>>> +
>>>>    typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>>>>    typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>>>> @@ -487,6 +683,7 @@ struct MemoryRegion {
>>>>        const char *name;
>>>>        unsigned ioeventfd_nb;
>>>>        MemoryRegionIoeventfd *ioeventfds;
>>>> +    RamDiscardMgr *rdm; /* Only for RAM */
>>>>    };
>>>
>>>
>>> The idea of sending discard notifications is obviously good.  I have a
>>> couple of questions on the design that you used for the interface; I'm
>>> not saying that it should be done differently, I would only like to
>>> understand the trade offs that you chose:
>>
>> Sure!
>>
>>>
>>> 1) can the RamDiscardManager (no abbreviations :)) be just the owner of
>>
>> I used to call it "SparseRamManager", but wanted to stress the semantics
>> - and can use RamDiscardManager ;) . Suggestions welcome.
>>
>>> the memory region (obj->parent)?  Alternatively, if you want to make it
>>> separate from the owner, does it make sense for it to be a separate
>>> reusable object, sitting between virtio-mem and the MemoryRegion, so
>>> that the implementation can be reused?
>>
>> virtio-mem consumes a memory backend (e.g., memory-backend-ram). That
>> one logically "ownes" the memory region (and thereby the RAMBlock).
>>
>> The memory backend gets assigned to virtio-mem. At that point,
>> virtio-mem "owns" the memory backend. It will set itself as the
>> RamDiscardsManager before mapping the memory region into system address
>> space (whereby it will get exposed to the system).
>>
>> This flow made sense to me. Regarding "reusable object" - I think the
>> only stuff we could fit in there would be e.g., maintaining the lists of
>> notifiers. I'd rather wait until we actually have a second user to see
>> what we can factor out.
>>
>> If you have any suggestion/preference, please let me know.
>>
>>>
>>> 2) why have the new RamDiscardListener instead of just extending
>>> MemoryListener? Conveniently, vfio already has a MemoryListener that can
>>
>> It behaves more like the IOMMU notifier in that regard.
> 
> Yes, but does it behave more like the IOMMU notifier in other regards?
> :)  The IOMMU notifier is concerned with an iova concept that doesn't
> exist at the MemoryRegion level, while RamDiscardListener works at the
> (MemoryRegion, offset) level that can easily be represented by a
> MemoryRegionSection.  Using MemoryRegionSection might even simplify the
> listener code.

It's similarly concerned with rather small, lightweight updates I would say.

> 
>>> be extended, and you wouldn't need the list of RamDiscardListeners.
>>> There is already a precedent of replaying the current state when a
>>> listener is added (see listener_add_address_space), so this is not
>>> something different between ML and RDL.
>>
>> The main motivation is to let listener decide how it wants to handle the
>> memory region. For example, for vhost, vdpa, kvm, ... I only want a
>> single region, not separate ones for each and every populated range,
>> punching out discarded ranges. Note that there are cases (i.e.,
>> anonymous memory), where it's even valid for the guest to read discarded
>> memory.
> 
> Yes, I agree with that.  You would still have the same
> region-add/region_nop/region_del callbacks for KVM and friends; on top
> of that you would have region_populate/region_discard callbacks for VFIO.

I see roughly how it could work, however, I am not sure yet if this is 
the right approach.

I think instead of region_populate/region_discard we would want 
individual region_add/region_del when populating/discarding for all 
MemoryListeners that opt-in somehow (e.g., VFIO, dump-guest-memory, 
...). Similarly, we would want to call log_sync()/log_clear() then only 
for these parts.

But what happens when I populate/discard some memory? I don't want to 
trigger an address space transaction (begin()...region_nop()...commit()) 
- whenever I populate/discard memory (e.g., in 2 MB granularity). 
Especially not, if nothing might have changed for most other 
MemoryListeners.

> 
> Populated regions would be replayed after region_add, while I don't
> think it makes sense to have a region_discard_all callback before
> region_discard.

How would we handle vfio_listerner_log_sync()vfio_sync_dirty_bitmap()?
Paolo Bonzini Feb. 22, 2021, 5:37 p.m. UTC | #5
On 22/02/21 15:53, David Hildenbrand wrote:
>> Yes, but does it behave more like the IOMMU notifier in other regards?
>> :)  The IOMMU notifier is concerned with an iova concept that doesn't
>> exist at the MemoryRegion level, while RamDiscardListener works at the
>> (MemoryRegion, offset) level that can easily be represented by a
>> MemoryRegionSection.  Using MemoryRegionSection might even simplify the
>> listener code.
> 
> It's similarly concerned with rather small, lightweight updates I would 
> say.

Why does that matter?  I think if it's concerned with the MemoryRegion 
address space it should use MemoryListener and MemoryRegionSection.

>>> The main motivation is to let listener decide how it wants to handle the
>>> memory region. For example, for vhost, vdpa, kvm, ... I only want a
>>> single region, not separate ones for each and every populated range,
>>> punching out discarded ranges. Note that there are cases (i.e.,
>>> anonymous memory), where it's even valid for the guest to read discarded
>>> memory.
>>
>> Yes, I agree with that.  You would still have the same
>> region-add/region_nop/region_del callbacks for KVM and friends; on top
>> of that you would have region_populate/region_discard callbacks for VFIO.
> 
> I think instead of region_populate/region_discard we would want 
> individual region_add/region_del when populating/discarding for all 
> MemoryListeners that opt-in somehow (e.g., VFIO, dump-guest-memory, 
> ...). Similarly, we would want to call log_sync()/log_clear() then only 
> for these parts.
> 
> But what happens when I populate/discard some memory? I don't want to 
> trigger an address space transaction (begin()...region_nop()...commit()) 
> - whenever I populate/discard memory (e.g., in 2 MB granularity). 
> Especially not, if nothing might have changed for most other 
> MemoryListeners.

Right, that was the reason why I was suggesting different callbacks. 
For the VFIO listener, which doesn't have begin or commit callbacks, I 
think you could just rename region_add to region_populate, and point 
both region_del and region_discard to the existing region_del commit.

Calling log_sync/log_clear only for populated parts also makes sense. 
log_sync and log_clear do not have to be within begin/commit, so you can 
change the semantics to call them more than once.



Paolo
David Hildenbrand Feb. 22, 2021, 5:48 p.m. UTC | #6
On 22.02.21 18:37, Paolo Bonzini wrote:
> On 22/02/21 15:53, David Hildenbrand wrote:
>>> Yes, but does it behave more like the IOMMU notifier in other regards?
>>> :)  The IOMMU notifier is concerned with an iova concept that doesn't
>>> exist at the MemoryRegion level, while RamDiscardListener works at the
>>> (MemoryRegion, offset) level that can easily be represented by a
>>> MemoryRegionSection.  Using MemoryRegionSection might even simplify the
>>> listener code.
>>
>> It's similarly concerned with rather small, lightweight updates I would
>> say.
> 
> Why does that matter?  I think if it's concerned with the MemoryRegion
> address space it should use MemoryListener and MemoryRegionSection.
> 
>>>> The main motivation is to let listener decide how it wants to handle the
>>>> memory region. For example, for vhost, vdpa, kvm, ... I only want a
>>>> single region, not separate ones for each and every populated range,
>>>> punching out discarded ranges. Note that there are cases (i.e.,
>>>> anonymous memory), where it's even valid for the guest to read discarded
>>>> memory.
>>>
>>> Yes, I agree with that.  You would still have the same
>>> region-add/region_nop/region_del callbacks for KVM and friends; on top
>>> of that you would have region_populate/region_discard callbacks for VFIO.
>>
>> I think instead of region_populate/region_discard we would want
>> individual region_add/region_del when populating/discarding for all
>> MemoryListeners that opt-in somehow (e.g., VFIO, dump-guest-memory,
>> ...). Similarly, we would want to call log_sync()/log_clear() then only
>> for these parts.
>>
>> But what happens when I populate/discard some memory? I don't want to
>> trigger an address space transaction (begin()...region_nop()...commit())
>> - whenever I populate/discard memory (e.g., in 2 MB granularity).
>> Especially not, if nothing might have changed for most other
>> MemoryListeners.
> 
> Right, that was the reason why I was suggesting different callbacks.
> For the VFIO listener, which doesn't have begin or commit callbacks, I
> think you could just rename region_add to region_populate, and point
> both region_del and region_discard to the existing region_del commit.
> 
> Calling log_sync/log_clear only for populated parts also makes sense.
> log_sync and log_clear do not have to be within begin/commit, so you can
> change the semantics to call them more than once.

I'll prototype to see how it looks/feels. As long as it's moving logic 
out of the VFIO code into the address space update code it could be 
quite alright.
David Hildenbrand Feb. 22, 2021, 7:43 p.m. UTC | #7
>>>> The main motivation is to let listener decide how it wants to handle the
>>>> memory region. For example, for vhost, vdpa, kvm, ... I only want a
>>>> single region, not separate ones for each and every populated range,
>>>> punching out discarded ranges. Note that there are cases (i.e.,
>>>> anonymous memory), where it's even valid for the guest to read discarded
>>>> memory.
>>>
>>> Yes, I agree with that.  You would still have the same
>>> region-add/region_nop/region_del callbacks for KVM and friends; on top
>>> of that you would have region_populate/region_discard callbacks for VFIO.
>>
>> I think instead of region_populate/region_discard we would want
>> individual region_add/region_del when populating/discarding for all
>> MemoryListeners that opt-in somehow (e.g., VFIO, dump-guest-memory,
>> ...). Similarly, we would want to call log_sync()/log_clear() then only
>> for these parts.
>>
>> But what happens when I populate/discard some memory? I don't want to
>> trigger an address space transaction (begin()...region_nop()...commit())
>> - whenever I populate/discard memory (e.g., in 2 MB granularity).
>> Especially not, if nothing might have changed for most other
>> MemoryListeners.
> 
> Right, that was the reason why I was suggesting different callbacks.
> For the VFIO listener, which doesn't have begin or commit callbacks, I
> think you could just rename region_add to region_populate, and point
> both region_del and region_discard to the existing region_del commit.
> 
> Calling log_sync/log_clear only for populated parts also makes sense.
> log_sync and log_clear do not have to be within begin/commit, so you can
> change the semantics to call them more than once.

So I looked at the simplest of all cases (discard) and I am not convinced yet
that this is the right approach. I can understand why it looks like this fits
into the MemoryListener, but I am not sure if gives us any real benefits or
makes the code any clearer (I'd even say it's the contrary).


+void memory_region_notify_discard(MemoryRegion *mr, hwaddr offset,
+                                  hwaddr size)
+{
+    hwaddr mr_start, mr_end;
+    MemoryRegionSection mrs;
+    MemoryListener *listener;
+    AddressSpace *as;
+    FlatView *view;
+    FlatRange *fr;
+
+    QTAILQ_FOREACH(listener, &memory_listeners, link) {
+        if (!listener->region_discard) {
+            continue;
+        }
+        as = listener->address_space;
+        view = address_space_get_flatview(as);
+        FOR_EACH_FLAT_RANGE(fr, view) {
+            if (fr->mr != mr) {
+                continue;
+            }
+
+            mrs = section_from_flat_range(fr, view);
+
+            mr_start = MAX(mrs.offset_within_region, offset);
+            mr_end = MIN(offset + size,
+                          mrs.offset_within_region + int128_get64(mrs.size));
+            mr_end = MIN(mr_end, offset + size);
+
+            if (mr_start >= mr_end) {
+                continue;
+            }
+
+            mrs.offset_within_address_space += mr_start -
+                                               mrs.offset_within_region;
+            mrs.offset_within_region = mr_start;
+            mrs.size = int128_make64(mr_end - mr_start);
+            listener->region_discard(listener, &mrs);
+        }
+        flatview_unref(view);
+    }
+}

Maybe I am missing something important. This looks highly inefficient.

1. Although we know the memory region we have to walk over the whole address
    space ... over and over again for each potential listener.

2. Even without any applicable listeners (=> ! VFIO) we loop over all listeners.
    There are ways around that but it doesn't make the code nicer IMHO.

3. In the future I am planning on sending populate/discard events
    without the BQL (in my approach, synchronizing internally against
    register/unregister/populate/discard ...). I don't see an easy way
    to achieve that here. I think we are required to hold the BQL on any
    updates.

memory_region_notify_populate() gets quite ugly when we realize halfway that
we have to revert what we already did by notifying about already populated
pieces ...
David Hildenbrand Feb. 23, 2021, 10:50 a.m. UTC | #8
On 22.02.21 20:43, David Hildenbrand wrote:
>>>>> The main motivation is to let listener decide how it wants to handle the
>>>>> memory region. For example, for vhost, vdpa, kvm, ... I only want a
>>>>> single region, not separate ones for each and every populated range,
>>>>> punching out discarded ranges. Note that there are cases (i.e.,
>>>>> anonymous memory), where it's even valid for the guest to read discarded
>>>>> memory.
>>>>
>>>> Yes, I agree with that.  You would still have the same
>>>> region-add/region_nop/region_del callbacks for KVM and friends; on top
>>>> of that you would have region_populate/region_discard callbacks for VFIO.
>>>
>>> I think instead of region_populate/region_discard we would want
>>> individual region_add/region_del when populating/discarding for all
>>> MemoryListeners that opt-in somehow (e.g., VFIO, dump-guest-memory,
>>> ...). Similarly, we would want to call log_sync()/log_clear() then only
>>> for these parts.
>>>
>>> But what happens when I populate/discard some memory? I don't want to
>>> trigger an address space transaction (begin()...region_nop()...commit())
>>> - whenever I populate/discard memory (e.g., in 2 MB granularity).
>>> Especially not, if nothing might have changed for most other
>>> MemoryListeners.
>>
>> Right, that was the reason why I was suggesting different callbacks.
>> For the VFIO listener, which doesn't have begin or commit callbacks, I
>> think you could just rename region_add to region_populate, and point
>> both region_del and region_discard to the existing region_del commit.
>>
>> Calling log_sync/log_clear only for populated parts also makes sense.
>> log_sync and log_clear do not have to be within begin/commit, so you can
>> change the semantics to call them more than once.
> 
> So I looked at the simplest of all cases (discard) and I am not convinced yet
> that this is the right approach. I can understand why it looks like this fits
> into the MemoryListener, but I am not sure if gives us any real benefits or
> makes the code any clearer (I'd even say it's the contrary).
> 
> 
> +void memory_region_notify_discard(MemoryRegion *mr, hwaddr offset,
> +                                  hwaddr size)
> +{
> +    hwaddr mr_start, mr_end;
> +    MemoryRegionSection mrs;
> +    MemoryListener *listener;
> +    AddressSpace *as;
> +    FlatView *view;
> +    FlatRange *fr;
> +
> +    QTAILQ_FOREACH(listener, &memory_listeners, link) {
> +        if (!listener->region_discard) {
> +            continue;
> +        }
> +        as = listener->address_space;
> +        view = address_space_get_flatview(as);
> +        FOR_EACH_FLAT_RANGE(fr, view) {
> +            if (fr->mr != mr) {
> +                continue;
> +            }
> +
> +            mrs = section_from_flat_range(fr, view);
> +
> +            mr_start = MAX(mrs.offset_within_region, offset);
> +            mr_end = MIN(offset + size,
> +                          mrs.offset_within_region + int128_get64(mrs.size));
> +            mr_end = MIN(mr_end, offset + size);
> +
> +            if (mr_start >= mr_end) {
> +                continue;
> +            }
> +
> +            mrs.offset_within_address_space += mr_start -
> +                                               mrs.offset_within_region;
> +            mrs.offset_within_region = mr_start;
> +            mrs.size = int128_make64(mr_end - mr_start);
> +            listener->region_discard(listener, &mrs);
> +        }
> +        flatview_unref(view);
> +    }
> +}
> 
> Maybe I am missing something important. This looks highly inefficient.
> 
> 1. Although we know the memory region we have to walk over the whole address
>      space ... over and over again for each potential listener.
> 
> 2. Even without any applicable listeners (=> ! VFIO) we loop over all listeners.
>      There are ways around that but it doesn't make the code nicer IMHO.
> 
> 3. In the future I am planning on sending populate/discard events
>      without the BQL (in my approach, synchronizing internally against
>      register/unregister/populate/discard ...). I don't see an easy way
>      to achieve that here. I think we are required to hold the BQL on any
>      updates.
> 
> memory_region_notify_populate() gets quite ugly when we realize halfway that
> we have to revert what we already did by notifying about already populated
> pieces ...
> 

So, the more I look into it the more I doubt this should go into the 
MemoryListener.

The RamDiscardManager is specific to RAM memory regions - similarly the 
IOMMU notifier is specific to IOMMU regions.

In the near future we will have two "clients" (vfio, 
tpm/dump-guest-memory), whereby only vfio will actually has to register 
for updates at runtime.

I really want to have a dedicated registration/notifier mechanism, for 
reasons already mentioned in my last mail, but also to later reuse that 
mechanism in other context as noted in the cover letter:

"Note: At some point, we might want to let RAMBlock users (esp. vfio 
used for nvme://) consume this interface as well. We'll need RAMBlock 
notifier calls when a RAMBlock is getting mapped/unmapped (via the 
corresponding memory region), so we can properly register a listener 
there as well."


However, I do agree that the notifier should work with 
MemoryRegionSection - this will make the "client" code much easier.

The register()/replay_populate() mechanism should consume a 
MemoryRegionSection to work on, and call the listener via (adjusted) 
MemoryRegionSection.

Maybe I'm even able to simplify/get rid of the discard_all() callback.
Paolo Bonzini Feb. 23, 2021, 3:03 p.m. UTC | #9
On 23/02/21 11:50, David Hildenbrand wrote:
> 
> 
> However, I do agree that the notifier should work with 
> MemoryRegionSection - this will make the "client" code much easier.
> 
> The register()/replay_populate() mechanism should consume a 
> MemoryRegionSection to work on, and call the listener via (adjusted) 
> MemoryRegionSection.
> 
> Maybe I'm even able to simplify/get rid of the discard_all() callback.

Good, thanks for trying and finding the best of both worlds.

Paolo
David Hildenbrand Feb. 23, 2021, 3:09 p.m. UTC | #10
On 23.02.21 16:03, Paolo Bonzini wrote:
> On 23/02/21 11:50, David Hildenbrand wrote:
>>
>>
>> However, I do agree that the notifier should work with
>> MemoryRegionSection - this will make the "client" code much easier.
>>
>> The register()/replay_populate() mechanism should consume a
>> MemoryRegionSection to work on, and call the listener via (adjusted)
>> MemoryRegionSection.
>>
>> Maybe I'm even able to simplify/get rid of the discard_all() callback.
> 
> Good, thanks for trying and finding the best of both worlds.

I'll do a couple of reworks and then share the current state, then we 
can discuss if this is going into the right direction.

Thanks a lot for the feedback!
diff mbox series

Patch

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c6fb714e49..6132910767 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -42,6 +42,12 @@  typedef struct IOMMUMemoryRegionClass IOMMUMemoryRegionClass;
 DECLARE_OBJ_CHECKERS(IOMMUMemoryRegion, IOMMUMemoryRegionClass,
                      IOMMU_MEMORY_REGION, TYPE_IOMMU_MEMORY_REGION)
 
+#define TYPE_RAM_DISCARD_MGR "qemu:ram-discard-mgr"
+typedef struct RamDiscardMgrClass RamDiscardMgrClass;
+typedef struct RamDiscardMgr RamDiscardMgr;
+DECLARE_OBJ_CHECKERS(RamDiscardMgr, RamDiscardMgrClass, RAM_DISCARD_MGR,
+                     TYPE_RAM_DISCARD_MGR);
+
 #ifdef CONFIG_FUZZ
 void fuzz_dma_read_cb(size_t addr,
                       size_t len,
@@ -124,6 +130,66 @@  typedef struct IOMMUTLBEvent {
     IOMMUTLBEntry entry;
 } IOMMUTLBEvent;
 
+struct RamDiscardListener;
+typedef int (*NotifyRamPopulate)(struct RamDiscardListener *rdl,
+                                 const MemoryRegion *mr, ram_addr_t offset,
+                                 ram_addr_t size);
+typedef void (*NotifyRamDiscard)(struct RamDiscardListener *rdl,
+                                 const MemoryRegion *mr, ram_addr_t offset,
+                                 ram_addr_t size);
+typedef void (*NotifyRamDiscardAll)(struct RamDiscardListener *rdl,
+                                    const MemoryRegion *mr);
+
+typedef struct RamDiscardListener {
+    /*
+     * @notify_populate:
+     *
+     * Notification that previously discarded memory is about to get populated.
+     * Listeners are able to object. If any listener objects, already
+     * successfully notified listeners are notified about a discard again.
+     *
+     * @rdl: the #RamDiscardListener getting notified
+     * @mr: the relevant #MemoryRegion
+     * @offset: offset into the #MemoryRegion, aligned to minimum granularity of
+     *          the #RamDiscardMgr
+     * @size: the size, aligned to minimum granularity of the #RamDiscardMgr
+     *
+     * Returns 0 on success. If the notification is rejected by the listener,
+     * an error is returned.
+     */
+    NotifyRamPopulate notify_populate;
+
+    /*
+     * @notify_discard:
+     *
+     * Notification that previously populated memory was discarded successfully
+     * and listeners should drop all references to such memory and prevent
+     * new population (e.g., unmap).
+     *
+     * @rdl: the #RamDiscardListener getting notified
+     * @mr: the relevant #MemoryRegion
+     * @offset: offset into the #MemoryRegion, aligned to minimum granularity of
+     *          the #RamDiscardMgr
+     * @size: the size, aligned to minimum granularity of the #RamDiscardMgr
+     */
+    NotifyRamDiscard notify_discard;
+
+    /*
+     * @notify_discard_all:
+     *
+     * Notification that all previously populated memory was discarded
+     * successfully.
+     *
+     * Note: this callback is optional. If not set, individual notify_populate()
+     * notifications are triggered.
+     *
+     * @rdl: the #RamDiscardListener getting notified
+     * @mr: the relevant #MemoryRegion
+     */
+    NotifyRamDiscardAll notify_discard_all;
+    QLIST_ENTRY(RamDiscardListener) next;
+} RamDiscardListener;
+
 /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
 #define RAM_PREALLOC   (1 << 0)
 
@@ -167,6 +233,16 @@  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
     n->iommu_idx = iommu_idx;
 }
 
+static inline void ram_discard_listener_init(RamDiscardListener *rdl,
+                                             NotifyRamPopulate populate_fn,
+                                             NotifyRamDiscard discard_fn,
+                                             NotifyRamDiscardAll discard_all_fn)
+{
+    rdl->notify_populate = populate_fn;
+    rdl->notify_discard = discard_fn;
+    rdl->notify_discard_all = discard_all_fn;
+}
+
 /*
  * Memory region callbacks
  */
@@ -441,6 +517,126 @@  struct IOMMUMemoryRegionClass {
                                      Error **errp);
 };
 
+/*
+ * RamDiscardMgrClass:
+ *
+ * A #RamDiscardMgr coordinates which parts of specific RAM #MemoryRegion
+ * regions are currently populated to be used/accessed by the VM, notifying
+ * after parts were discarded (freeing up memory) and before parts will be
+ * populated (consuming memory), to be used/acessed by the VM.
+ *
+ * A #RamDiscardMgr can only be set for a RAM #MemoryRegion while the
+ * #MemoryRegion isn't mapped yet; it cannot change while the #MemoryRegion is
+ * mapped.
+ *
+ * The #RamDiscardMgr is intended to be used by technologies that are
+ * incompatible with discarding of RAM (e.g., VFIO, which may pin all
+ * memory inside a #MemoryRegion), and require proper coordination to only
+ * map the currently populated parts, to hinder parts that are expected to
+ * remain discarded from silently getting populated and consuming memory.
+ * Technologies that support discarding of RAM don't have to bother and can
+ * simply map the whole #MemoryRegion.
+ *
+ * An example #RamDiscardMgr is virtio-mem, which logically (un)plugs
+ * memory within an assigned RAM #MemoryRegion, coordinated with the VM.
+ * Logically unplugging memory consists of discarding RAM. The VM agreed to not
+ * access unplugged (discarded) memory - especially via DMA. virtio-mem will
+ * properly coordinate with listeners before memory is plugged (populated),
+ * and after memory is unplugged (discarded).
+ *
+ * Listeners are called in multiples of the minimum granularity and changes are
+ * aligned to the minimum granularity within the #MemoryRegion. Listeners have
+ * to prepare for memory becomming discarded in a different granularity than it
+ * was populated and the other way around.
+ */
+struct RamDiscardMgrClass {
+    /* private */
+    InterfaceClass parent_class;
+
+    /* public */
+
+    /**
+     * @get_min_granularity:
+     *
+     * Get the minimum granularity in which listeners will get notified
+     * about changes within the #MemoryRegion via the #RamDiscardMgr.
+     *
+     * @rdm: the #RamDiscardMgr
+     * @mr: the #MemoryRegion
+     *
+     * Returns the minimum granularity.
+     */
+    uint64_t (*get_min_granularity)(const RamDiscardMgr *rdm,
+                                    const MemoryRegion *mr);
+
+    /**
+     * @is_populated:
+     *
+     * Check whether the given range within the #MemoryRegion is completely
+     * populated (i.e., no parts are currently discarded). There are no
+     * alignment requirements for the range.
+     *
+     * @rdm: the #RamDiscardMgr
+     * @mr: the #MemoryRegion
+     * @offset: offset into the #MemoryRegion
+     * @size: size in the #MemoryRegion
+     *
+     * Returns whether the given range is completely populated.
+     */
+    bool (*is_populated)(const RamDiscardMgr *rdm, const MemoryRegion *mr,
+                         ram_addr_t offset, ram_addr_t size);
+
+    /**
+     * @register_listener:
+     *
+     * Register a #RamDiscardListener for a #MemoryRegion via the
+     * #RamDiscardMgr and immediately notify the #RamDiscardListener about all
+     * populated parts within the #MemoryRegion via the #RamDiscardMgr.
+     *
+     * In case any notification fails, no further notifications are triggered
+     * and an error is logged.
+     *
+     * @rdm: the #RamDiscardMgr
+     * @mr: the #MemoryRegion
+     * @rdl: the #RamDiscardListener
+     */
+    void (*register_listener)(RamDiscardMgr *rdm, const MemoryRegion *mr,
+                              RamDiscardListener *rdl);
+
+    /**
+     * @unregister_listener:
+     *
+     * Unregister a previously registered #RamDiscardListener for a
+     * #MemoryRegion via the #RamDiscardMgr after notifying the
+     * #RamDiscardListener about all populated parts becoming unpopulated
+     * within the #MemoryRegion via the #RamDiscardMgr.
+     *
+     * @rdm: the #RamDiscardMgr
+     * @mr: the #MemoryRegion
+     * @rdl: the #RamDiscardListener
+     */
+    void (*unregister_listener)(RamDiscardMgr *rdm, const MemoryRegion *mr,
+                                RamDiscardListener *rdl);
+
+    /**
+     * @replay_populated:
+     *
+     * Notify the #RamDiscardListener about all populated parts within the
+     * #MemoryRegion via the #RamDiscardMgr.
+     *
+     * In case any notification fails, no further notifications are triggered.
+     * The listener is not required to be registered.
+     *
+     * @rdm: the #RamDiscardMgr
+     * @mr: the #MemoryRegion
+     * @rdl: the #RamDiscardListener
+     *
+     * Returns 0 on success, or a negative error if any notification failed.
+     */
+    int (*replay_populated)(const RamDiscardMgr *rdm, const MemoryRegion *mr,
+                            RamDiscardListener *rdl);
+};
+
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
 typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
 
@@ -487,6 +683,7 @@  struct MemoryRegion {
     const char *name;
     unsigned ioeventfd_nb;
     MemoryRegionIoeventfd *ioeventfds;
+    RamDiscardMgr *rdm; /* Only for RAM */
 };
 
 struct IOMMUMemoryRegion {
@@ -1979,6 +2176,40 @@  bool memory_region_present(MemoryRegion *container, hwaddr addr);
  */
 bool memory_region_is_mapped(MemoryRegion *mr);
 
+/**
+ * memory_region_get_ram_discard_mgr: get the #RamDiscardMgr for a
+ * #MemoryRegion
+ *
+ * The #RamDiscardMgr cannot change while a memory region is mapped.
+ *
+ * @mr: the #MemoryRegion
+ */
+RamDiscardMgr *memory_region_get_ram_discard_mgr(MemoryRegion *mr);
+
+/**
+ * memory_region_has_ram_discard_mgr: check whether a #MemoryRegion has a
+ * #RamDiscardMgr assigned
+ *
+ * @mr: the #MemoryRegion
+ */
+static inline bool memory_region_has_ram_discard_mgr(MemoryRegion *mr)
+{
+    return !!memory_region_get_ram_discard_mgr(mr);
+}
+
+/**
+ * memory_region_set_ram_discard_mgr: set the #RamDiscardMgr for a
+ * #MemoryRegion
+ *
+ * This function must not be called for a mapped #MemoryRegion, a #MemoryRegion
+ * that does not cover RAM, or a #MemoryRegion that already has a
+ * #RamDiscardMgr assigned.
+ *
+ * @mr: the #MemoryRegion
+ * @urn: #RamDiscardMgr to set
+ */
+void memory_region_set_ram_discard_mgr(MemoryRegion *mr, RamDiscardMgr *rdm);
+
 /**
  * memory_region_find: translate an address/size relative to a
  * MemoryRegion into a #MemoryRegionSection.
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 874a8fccde..5e7bea7661 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2029,6 +2029,21 @@  int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
     return imrc->num_indexes(iommu_mr);
 }
 
+RamDiscardMgr *memory_region_get_ram_discard_mgr(MemoryRegion *mr)
+{
+    if (!memory_region_is_mapped(mr) || !memory_region_is_ram(mr)) {
+        return NULL;
+    }
+    return mr->rdm;
+}
+
+void memory_region_set_ram_discard_mgr(MemoryRegion *mr, RamDiscardMgr *rdm)
+{
+    g_assert(memory_region_is_ram(mr) && !memory_region_is_mapped(mr));
+    g_assert(!rdm || !mr->rdm);
+    mr->rdm = rdm;
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
@@ -3309,10 +3324,17 @@  static const TypeInfo iommu_memory_region_info = {
     .abstract           = true,
 };
 
+static const TypeInfo ram_discard_mgr_info = {
+    .parent             = TYPE_INTERFACE,
+    .name               = TYPE_RAM_DISCARD_MGR,
+    .class_size         = sizeof(RamDiscardMgrClass),
+};
+
 static void memory_register_types(void)
 {
     type_register_static(&memory_region_info);
     type_register_static(&iommu_memory_region_info);
+    type_register_static(&ram_discard_mgr_info);
 }
 
 type_init(memory_register_types)