diff mbox series

[13/16] xen/page_alloc: add a path for xenheap when there is no direct map

Message ID 32ae7c14babf7e78b60febb53095a74c5e865456.1588278317.git.hongyxia@amazon.com (mailing list archive)
State New, archived
Headers show
Series Remove the direct map | expand

Commit Message

Hongyan Xia April 30, 2020, 8:44 p.m. UTC
From: Hongyan Xia <hongyxia@amazon.com>

When there is not an always-mapped direct map, xenheap allocations need
to be mapped and unmapped on-demand.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
---
 xen/common/page_alloc.c | 45 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 42 insertions(+), 3 deletions(-)

Comments

Julien Grall May 1, 2020, 8:50 a.m. UTC | #1
Hi Hongyan,

On 30/04/2020 21:44, Hongyan Xia wrote:
> From: Hongyan Xia <hongyxia@amazon.com>
> 
> When there is not an always-mapped direct map, xenheap allocations need
> to be mapped and unmapped on-demand.
> 
> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
> ---
>   xen/common/page_alloc.c | 45 ++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 10b7aeca48..1285fc5977 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -2143,6 +2143,7 @@ void init_xenheap_pages(paddr_t ps, paddr_t pe)
>   void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
>   {
>       struct page_info *pg;
> +    void *ret;
>   
>       ASSERT(!in_irq());
>   
> @@ -2151,14 +2152,27 @@ void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
>       if ( unlikely(pg == NULL) )
>           return NULL;
>   
> -    memguard_unguard_range(page_to_virt(pg), 1 << (order + PAGE_SHIFT));
> +    ret = page_to_virt(pg);
>   
> -    return page_to_virt(pg);
> +    if ( !arch_has_directmap() &&
> +         map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
> +                          PAGE_HYPERVISOR) )

The only user (arm32) of split domheap/xenheap has no directmap. So this 
will break arm32 as we don't support superpage shattering (for good 
reasons).

In this configuration, only xenheap pages are always mapped. Domheap 
will be mapped on-demand. So I don't think we need to map/unmap xenheap 
pages at allocation/free.

Cheers,
Jan Beulich April 22, 2021, 12:31 p.m. UTC | #2
On 30.04.2020 22:44, Hongyan Xia wrote:
> From: Hongyan Xia <hongyxia@amazon.com>
> 
> When there is not an always-mapped direct map, xenheap allocations need
> to be mapped and unmapped on-demand.
> 
> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>

This series has been left uncommented for far too long - I'm sorry.
While earlier patches here are probably reasonable (but would likely
need re-basing, so I'm not sure whether to try to get to look though
them before that makes much sense), I'd like to spell out that I'm
not really happy with the approach taken here: Simply re-introducing
a direct map entry for individual pages is not the way to go imo.
First and foremost this is rather wasteful in terms of resources (VA
space).

As I don't think we have many cases where code actually depends on
being able to apply __va() (or equivalent) to the address returned
from alloc_xenheap_pages(), I think this should instead involve
vmap(), with the vmap area drastically increased (perhaps taking all
of the space the direct map presently consumes). For any remaining
users of __va() or alike these should perhaps be converted into an
alias / derivation of vmap_to_{mfn,page}() then.

Since the goal of eliminating the direct map is to have unrelated
guests' memory not mapped when running a certain guest, it could
then be further considered to "overmap" what is being requested -
rather than just mapping the single 4k page, the containing 2M or 1G
one could be mapped (provided it all belongs to the running guest),
while unmapping could be deferred until the next context switch to a
different domain (or, if necessary, for 64-bit PV guests until the
next switch to guest user mode). Of course this takes as a prereq a
sufficiently low overhead means to establish whether the larger
containing page of a smaller one is all owned by the same domain.

Jan
Hongyan Xia April 28, 2021, 11:04 a.m. UTC | #3
On Thu, 2021-04-22 at 14:31 +0200, Jan Beulich wrote:
> On 30.04.2020 22:44, Hongyan Xia wrote:
> > From: Hongyan Xia <hongyxia@amazon.com>
> > 
> > When there is not an always-mapped direct map, xenheap allocations
> > need
> > to be mapped and unmapped on-demand.
> > 
> > Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
> 
> This series has been left uncommented for far too long - I'm sorry.
> While earlier patches here are probably reasonable (but would likely
> need re-basing, so I'm not sure whether to try to get to look though
> them before that makes much sense),

No worries. This series depends on the domheap Xen page table
conversion series anyway (which was just fully merged. Thanks.). I will
re-base now since the dependency is resolved.

> As I don't think we have many cases where code actually depends on
> being able to apply __va() (or equivalent) to the address returned
> from alloc_xenheap_pages(), I think this should instead involve
> vmap(), with the vmap area drastically increased (perhaps taking all
> of the space the direct map presently consumes). For any remaining
> users of __va() or alike these should perhaps be converted into an
> alias / derivation of vmap_to_{mfn,page}() then.

That's true, and this was my first implementation (and also Wei's
original proposal) which worked okay. But, several problems got in the
way.

1. Partial unmap. Biggest offender is xmalloc which allocates and could
then free part of it, which means we need to be able to partially unmap
the region. vmap() does not support this.

2. Fast PA->VA. There is currently no way to go from PA to VA in
vmapped pages, unless we somehow repurpose or add new fields in
page_info. Also, VA->PA is possible but very slow now. There is not
much PA->VA in the critical path but see 3.

3. EPT. Mapping and unmapping EPT in HVM hypercalls and MMIO are so
many and so slow that it is probably not possible to keep them as
domheap pages due to the big performance drop after removing the direct
map. If we move them to xenheap pages on vmap, then this depends on 2
for page table walking.

In the end, I could not find a way that met all 3 above without massive
and intrusive changes. If there is a way, it certainly needs a design
document. The "on-demand" direct map solves all the problems without
breaking any APIs and is very easy to understand. We have been using
Xen without the direct map for a while now with this approach with
decent performance (in fact, you cannot tell that this is a Xen without
the direct map by just real-world benchmarks alone).

I too agree that this approach is a litte hacky and wastes a big chunk
of virtual address space. Definitely wants some discussion if a better
way can be found that solves the problems.

Thanks,
Hongyan
Jan Beulich April 28, 2021, 11:51 a.m. UTC | #4
On 28.04.2021 13:04, Hongyan Xia wrote:
> On Thu, 2021-04-22 at 14:31 +0200, Jan Beulich wrote:
>> As I don't think we have many cases where code actually depends on
>> being able to apply __va() (or equivalent) to the address returned
>> from alloc_xenheap_pages(), I think this should instead involve
>> vmap(), with the vmap area drastically increased (perhaps taking all
>> of the space the direct map presently consumes). For any remaining
>> users of __va() or alike these should perhaps be converted into an
>> alias / derivation of vmap_to_{mfn,page}() then.
> 
> That's true, and this was my first implementation (and also Wei's
> original proposal) which worked okay. But, several problems got in the
> way.
> 
> 1. Partial unmap. Biggest offender is xmalloc which allocates and could
> then free part of it, which means we need to be able to partially unmap
> the region. vmap() does not support this.

If the direct map went fully away, and hence if Xen heap pages got
vmap()-ed, there's no reason to keep xmalloc() from forwarding to
vmalloc() instead of going this partial-unmap route.

> 2. Fast PA->VA. There is currently no way to go from PA to VA in
> vmapped pages, unless we somehow repurpose or add new fields in
> page_info. Also, VA->PA is possible but very slow now. There is not
> much PA->VA in the critical path but see 3.

There would better not be any PA->VA. Can you point out examples
where it would be hard to avoid using such? I also don't see the
connection to 3 - is EPT code using PA->VA a lot? p2m-ept.c does
not look to have a single use of __va() or ..._to_virt().

> 3. EPT. Mapping and unmapping EPT in HVM hypercalls and MMIO are so
> many and so slow that it is probably not possible to keep them as
> domheap pages due to the big performance drop after removing the direct
> map. If we move them to xenheap pages on vmap, then this depends on 2
> for page table walking.

See my proposal to defer unmapping of the domain's own pages
(and I would consider the p2m pages to be part of the domain's
ones for this purpose). In fact, since the p2m pages come from a
fixed, separate pool I wonder whether the entire pool couldn't
be mapped in e.g. the per-domain VA range.

Jan
Hongyan Xia April 28, 2021, 1:22 p.m. UTC | #5
On Wed, 2021-04-28 at 13:51 +0200, Jan Beulich wrote:
> On 28.04.2021 13:04, Hongyan Xia wrote:
> > On Thu, 2021-04-22 at 14:31 +0200, Jan Beulich wrote:
> > > As I don't think we have many cases where code actually depends
> > > on
> > > being able to apply __va() (or equivalent) to the address
> > > returned
> > > from alloc_xenheap_pages(), I think this should instead involve
> > > vmap(), with the vmap area drastically increased (perhaps taking
> > > all
> > > of the space the direct map presently consumes). For any
> > > remaining
> > > users of __va() or alike these should perhaps be converted into
> > > an
> > > alias / derivation of vmap_to_{mfn,page}() then.
> > 
> > That's true, and this was my first implementation (and also Wei's
> > original proposal) which worked okay. But, several problems got in
> > the
> > way.
> > 
> > 1. Partial unmap. Biggest offender is xmalloc which allocates and
> > could
> > then free part of it, which means we need to be able to partially
> > unmap
> > the region. vmap() does not support this.
> 
> If the direct map went fully away, and hence if Xen heap pages got
> vmap()-ed, there's no reason to keep xmalloc() from forwarding to
> vmalloc() instead of going this partial-unmap route.
> 
> > 2. Fast PA->VA. There is currently no way to go from PA to VA in
> > vmapped pages, unless we somehow repurpose or add new fields in
> > page_info. Also, VA->PA is possible but very slow now. There is not
> > much PA->VA in the critical path but see 3.
> 
> There would better not be any PA->VA. Can you point out examples
> where it would be hard to avoid using such? I also don't see the
> connection to 3 - is EPT code using PA->VA a lot? p2m-ept.c does
> not look to have a single use of __va() or ..._to_virt().

p2m does not have any __va(), but my performance results showed that
mapping and unmapping EPT when there is no direct map was incredibly
slow, hence why I moved EPT to xenheap in my local branch, which uses
__va().

> See my proposal to defer unmapping of the domain's own pages
> (and I would consider the p2m pages to be part of the domain's
> ones for this purpose). In fact, since the p2m pages come from a
> fixed, separate pool I wonder whether the entire pool couldn't
> be mapped in e.g. the per-domain VA range.

I thought about that as well, not just EPT but a lot of domain-private
pages can be moved to the per-domain range, and the secrets are hidden
by virtue of cr3 switches when switching to other domains. But still we
have the problem of quickly finding PA->VA (I don't mean __va(), I mean
finding the VA that can access a page table page) for EPT walks.

Mapping in bigger pages should work wonders for pre-partitioned guests
where we know the guest mostly just has contiguous physical memory and
a superpage map probably covers all pages in an HVM 2-level walk. But
for a generic solution where domain memory can be really fragmented
(and context switches can happen a lot on a pCPU), how can we quickly
find PA->VA in EPT walking without some intrusive changes to Xen? Of
course, if we do not allow the HAP pool to change and force the HAP
pool to be physically contiguous, we can just remember the base VA of
its vmapped region for quick PA->VA, but I don't think this is a
generic solution.

Am I missing anything?

Hongyan
Jan Beulich April 28, 2021, 1:55 p.m. UTC | #6
On 28.04.2021 15:22, Hongyan Xia wrote:
> On Wed, 2021-04-28 at 13:51 +0200, Jan Beulich wrote:
>> See my proposal to defer unmapping of the domain's own pages
>> (and I would consider the p2m pages to be part of the domain's
>> ones for this purpose). In fact, since the p2m pages come from a
>> fixed, separate pool I wonder whether the entire pool couldn't
>> be mapped in e.g. the per-domain VA range.
> 
> I thought about that as well, not just EPT but a lot of domain-private
> pages can be moved to the per-domain range, and the secrets are hidden
> by virtue of cr3 switches when switching to other domains. But still we
> have the problem of quickly finding PA->VA (I don't mean __va(), I mean
> finding the VA that can access a page table page) for EPT walks.
> 
> Mapping in bigger pages should work wonders for pre-partitioned guests
> where we know the guest mostly just has contiguous physical memory and
> a superpage map probably covers all pages in an HVM 2-level walk. But
> for a generic solution where domain memory can be really fragmented
> (and context switches can happen a lot on a pCPU), how can we quickly
> find PA->VA in EPT walking without some intrusive changes to Xen? Of
> course, if we do not allow the HAP pool to change and force the HAP
> pool to be physically contiguous, we can just remember the base VA of
> its vmapped region for quick PA->VA, but I don't think this is a
> generic solution.

We don't need to make the p2m pool static, but I think we can build on
it changing rarely, if ever. Hence it changing may be acceptable to be
moderately expensive.

If we made the pool physically contiguous, translation - as you say -
would be easy. But even if we can't find enough physical memory for it
to be contiguous, we can still help ourselves. The intermediate case is
when we can still make it consist of all 2M pages. There translation
may be fast enough even via a brute force array lookup. If we need to
resort to 4k pages, why not maintain the PA->VA association in a radix
tree?

Independent of that I think there are some cycles to be gained by,
already today, not having to map and unmap the root page table for
every access (get, set, etc).

Jan
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 10b7aeca48..1285fc5977 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -2143,6 +2143,7 @@  void init_xenheap_pages(paddr_t ps, paddr_t pe)
 void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
 {
     struct page_info *pg;
+    void *ret;
 
     ASSERT(!in_irq());
 
@@ -2151,14 +2152,27 @@  void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
     if ( unlikely(pg == NULL) )
         return NULL;
 
-    memguard_unguard_range(page_to_virt(pg), 1 << (order + PAGE_SHIFT));
+    ret = page_to_virt(pg);
 
-    return page_to_virt(pg);
+    if ( !arch_has_directmap() &&
+         map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
+                          PAGE_HYPERVISOR) )
+        {
+            /* Failed to map xenheap pages. */
+            free_heap_pages(pg, order, false);
+            return NULL;
+        }
+
+    memguard_unguard_range(ret, 1 << (order + PAGE_SHIFT));
+
+    return ret;
 }
 
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
+    unsigned long va = (unsigned long)v & PAGE_MASK;
+
     ASSERT(!in_irq());
 
     if ( v == NULL )
@@ -2166,6 +2180,12 @@  void free_xenheap_pages(void *v, unsigned int order)
 
     memguard_guard_range(v, 1 << (order + PAGE_SHIFT));
 
+    if ( !arch_has_directmap() &&
+         destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
+        dprintk(XENLOG_WARNING,
+                "Error while destroying xenheap mappings at %p, order %u\n",
+                v, order)
+
     free_heap_pages(virt_to_page(v), order, false);
 }
 
@@ -2189,6 +2209,7 @@  void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
 {
     struct page_info *pg;
     unsigned int i;
+    void *ret;
 
     ASSERT(!in_irq());
 
@@ -2201,16 +2222,28 @@  void *alloc_xenheap_pages(unsigned int order, unsigned int memflags)
     if ( unlikely(pg == NULL) )
         return NULL;
 
+    ret = page_to_virt(pg);
+
+    if ( !arch_has_directmap() &&
+         map_pages_to_xen((unsigned long)ret, page_to_mfn(pg), 1UL << order,
+                          PAGE_HYPERVISOR) )
+        {
+            /* Failed to map xenheap pages. */
+            free_domheap_pages(pg, order);
+            return NULL;
+        }
+
     for ( i = 0; i < (1u << order); i++ )
         pg[i].count_info |= PGC_xen_heap;
 
-    return page_to_virt(pg);
+    return ret;
 }
 
 void free_xenheap_pages(void *v, unsigned int order)
 {
     struct page_info *pg;
     unsigned int i;
+    unsigned long va = (unsigned long)v & PAGE_MASK;
 
     ASSERT(!in_irq());
 
@@ -2222,6 +2255,12 @@  void free_xenheap_pages(void *v, unsigned int order)
     for ( i = 0; i < (1u << order); i++ )
         pg[i].count_info &= ~PGC_xen_heap;
 
+    if ( !arch_has_directmap() &&
+         destroy_xen_mappings(va, va + (1UL << (order + PAGE_SHIFT))) )
+        dprintk(XENLOG_WARNING,
+                "Error while destroying xenheap mappings at %p, order %u\n",
+                v, order);
+
     free_heap_pages(pg, order, true);
 }