diff mbox

[v5,10/28] xsplice: Implement payload loading

Message ID 20160404194444.GA4474@char.us.oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk April 4, 2016, 7:44 p.m. UTC
On Fri, Apr 01, 2016 at 03:18:45AM -0600, Jan Beulich wrote:
> >>> On 31.03.16 at 23:26, <konrad@darnok.org> wrote:
> >>  Also - how well will this O(n^2) lookup work once there are enough
> >> payloads? I think this calls for the alternative vmap() extension I've
> >> been suggesting earlier.
> > 
> > Could you elaborate on the vmap extension a bit please?
> > 
> > Your earlier email seems to say: drop the vmap API and just 
> > allocate the underlaying pages yourself.
> 
> Actually I had also said in that earlier mail: "If, otoh, you left that
> VA management to (an extended version of) vmap(), by e.g.
> allowing the caller to request allocation from a different VA range
> (much like iirc x86-64 Linux handles its modules address range
> allocation), things would be different. After all the VA
> management is the important part here, while the backing
> memory allocation is just a trivial auxiliary operation."
> 
> I.e. elaboration here really just consists of the referral to the
> respective Linux approach.

I am in need here of guidance I am afraid.

Let me explain (did this in IRC but this will have a broader scope):

In Linux we have the 'struct vm_area' which internally contains the start
and end address (amongst other things). The callers usually use __vmalloc_node_range
to an provide those addresses. Internally the vmalloc API allocates the
'struct vm_area' from the normal SLAB allocator. Vmalloc API also has an
vmap block area (allocated within vmalloc area) which is a red and black tree
for all the users of its API. When vm_size() is called this tree is searched
to find the 'vm_area' for the provided virtual address. There is a lot
of code in this. Copying it and jamming does not look easy so it would be
better to take concepts of this an implement this.. 


On Xen we setup a bitmap that covers the full vmalloc area (128MB on my
4GB box, but can go up to 64GB) - the 128MB vmalloc area requires about
32K bits.

For every allocation  we "waste" an page (and a bit) so that there is a gap.
This gap is needed when trying to to determine the size of the allocated
region - when scanning the bitmap we can easily figure out the cleared
bit which is akin to a fencepost.


To make Xen's vmalloc API be generic I need to wholesale make it able
to deal with virtual addresses that are not part of its space (as in
not in VMAP_VIRT_START to vm_top). At the start I the input to vm_size()
needs to get the size of the virtual address (either the ones from
the vmalloc areas or the ones provided earlier by vmalloc_cb).

One easy mechanism is to embedded an array of simplified 'struct vm_area' structure:

struct vm_area {
	unsigned long va;
}

for every slot in the VMAP_VIRT_START area (that is have 32K entries).
The vm_size and all the rest can check for this array if the virtual
address provided is not within the vmalloc virtual addresses. If there
is a match we just need to consult the vm_bitmap at the same index and
figure out where the empty bit is set.
The downside is that I've to walk the full array (32K entries).

But when you think about it - most of the time we use normal vmalloc addresses
and only in exceptional cases do we need the alternate ones. And the only reason
to keep track of it is to know the size.

The easier way would be to track them via a linked list:

struct vm_area {
	struct list_head list;
	unsigned long va;
	size_t nr;
}

And vm_size, vm_index, etc would consult this list for the virtual address and
could get the proper information. (See inline patch)

But if we are doing that this, then why even put it in the vmalloc API? Why not
track all of this with the user of this? (like it was done in v4 of this patch series?)

Please advise.

From a0e3015bbf4d99fc8e5428634dc3b281cf8eedb7 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 14 Mar 2016 12:02:05 -0400
Subject: [PATCH] vmap: Add vmalloc_cb

For those users who want to supply their own vmap callback.
This callback will be called _after_ the pages have been
allocated and instead of using the vmap internal API, it will
call the callback which will be responsible for providing the
virtual addresses.

The vmap API also keeps track of this virtual address (along
with the size) so that vunmap, vm_size, and vm_free can operate
on these virtual addresses.

This allows users (such as xSplice) to provide their own
mechanism to change the the page flags, and also use virtual
addresses closer to the hypervisor virtual addresses (at least
on x86).

The users (such as patch titled "xsplice: Implement payload
loading") can wrap the calls to __vmap for this.

Note that the displacement of the hypervisor virtual addresses to the
vmalloc (on x86) is more than 32-bits - which means that ELF relocations
(which are limited to 32-bits) - won't work (we truncate the 34 and 33th
bit). Hence we cannot use on vmalloc virtual addresses but must
supply our own ranges.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>

v4: New patch.
v5: Update per Jan's comments.
v6: Drop the stray parentheses on typedefs.
    Ditch the vunmap callback. Stash away the virtual addresses in lists.
---
---
 xen/common/vmap.c      | 80 ++++++++++++++++++++++++++++++++++++++++++++++++--
 xen/include/xen/vmap.h |  5 ++++
 2 files changed, 82 insertions(+), 3 deletions(-)

Comments

Jan Beulich April 5, 2016, 7:34 a.m. UTC | #1
>>> On 04.04.16 at 21:44, <konrad.wilk@oracle.com> wrote:
> On Fri, Apr 01, 2016 at 03:18:45AM -0600, Jan Beulich wrote:
>> >>> On 31.03.16 at 23:26, <konrad@darnok.org> wrote:
>> >>  Also - how well will this O(n^2) lookup work once there are enough
>> >> payloads? I think this calls for the alternative vmap() extension I've
>> >> been suggesting earlier.
>> > 
>> > Could you elaborate on the vmap extension a bit please?
>> > 
>> > Your earlier email seems to say: drop the vmap API and just 
>> > allocate the underlaying pages yourself.
>> 
>> Actually I had also said in that earlier mail: "If, otoh, you left that
>> VA management to (an extended version of) vmap(), by e.g.
>> allowing the caller to request allocation from a different VA range
>> (much like iirc x86-64 Linux handles its modules address range
>> allocation), things would be different. After all the VA
>> management is the important part here, while the backing
>> memory allocation is just a trivial auxiliary operation."
>> 
>> I.e. elaboration here really just consists of the referral to the
>> respective Linux approach.
> 
> I am in need here of guidance I am afraid.
> 
> Let me explain (did this in IRC but this will have a broader scope):
> 
> In Linux we have the 'struct vm_area' which internally contains the start
> and end address (amongst other things). The callers usually use 
> __vmalloc_node_range
> to an provide those addresses. Internally the vmalloc API allocates the
> 'struct vm_area' from the normal SLAB allocator. Vmalloc API also has an
> vmap block area (allocated within vmalloc area) which is a red and black 
> tree
> for all the users of its API. When vm_size() is called this tree is searched
> to find the 'vm_area' for the provided virtual address. There is a lot
> of code in this. Copying it and jamming does not look easy so it would be
> better to take concepts of this an implement this.. 
> 
> 
> On Xen we setup a bitmap that covers the full vmalloc area (128MB on my
> 4GB box, but can go up to 64GB) - the 128MB vmalloc area requires about
> 32K bits.
> 
> For every allocation  we "waste" an page (and a bit) so that there is a gap.
> This gap is needed when trying to to determine the size of the allocated
> region - when scanning the bitmap we can easily figure out the cleared
> bit which is akin to a fencepost.
> 
> 
> To make Xen's vmalloc API be generic I need to wholesale make it able
> to deal with virtual addresses that are not part of its space (as in
> not in VMAP_VIRT_START to vm_top). At the start I the input to vm_size()
> needs to get the size of the virtual address (either the ones from
> the vmalloc areas or the ones provided earlier by vmalloc_cb).
> 
> One easy mechanism is to embedded an array of simplified 'struct vm_area' 
> structure:
> 
> struct vm_area {
> 	unsigned long va;
> }
> 
> for every slot in the VMAP_VIRT_START area (that is have 32K entries).
> The vm_size and all the rest can check for this array if the virtual
> address provided is not within the vmalloc virtual addresses. If there
> is a match we just need to consult the vm_bitmap at the same index and
> figure out where the empty bit is set.
> The downside is that I've to walk the full array (32K entries).
> 
> But when you think about it - most of the time we use normal vmalloc 
> addresses
> and only in exceptional cases do we need the alternate ones. And the only 
> reason
> to keep track of it is to know the size.
> 
> The easier way would be to track them via a linked list:
> 
> struct vm_area {
> 	struct list_head list;
> 	unsigned long va;
> 	size_t nr;
> }
> 
> And vm_size, vm_index, etc would consult this list for the virtual address 
> and
> could get the proper information. (See inline patch)
> 
> But if we are doing that this, then why even put it in the vmalloc API? Why 
> not
> track all of this with the user of this? (like it was done in v4 of this 
> patch series?)
> 
> Please advise.

Well, I certainly didn't think of it getting done that way. To me the
most natural generalization would be for an arch to register one or
more secondary ranges (which could even get referred to by an
enum) at boot time (or maybe that could even be arranged for at
compile time, i.e. no active registration necessary), with each such
area getting the same data structures set up as is being done right
now for the "base" VA range.

But if that's too cumbersome for now, I'm certainly fine with
xSplice code dealing with the VA management itself. We can
always add such an extension later (and then make xSplice use it).

Jan
diff mbox

Patch

diff --git a/xen/common/vmap.c b/xen/common/vmap.c
index 134eda0..0289e22 100644
--- a/xen/common/vmap.c
+++ b/xen/common/vmap.c
@@ -19,6 +19,16 @@  static unsigned int __read_mostly vm_end;
 /* lowest known clear bit in the bitmap */
 static unsigned int vm_low;
 
+static LIST_HEAD(vm_area_list);
+
+struct vm_area {
+    struct list_head list;
+    void *va;
+    unsigned int pages;
+};
+
+static DEFINE_SPINLOCK(vm_area_lock);
+
 void __init vm_init(void)
 {
     unsigned int i, nr;
@@ -146,12 +156,34 @@  static unsigned int vm_index(const void *va)
            test_bit(idx, vm_bitmap) ? idx : 0;
 }
 
+static const struct vm_area *vm_find(const void *va)
+{
+    const struct vm_area *found = NULL, *vm;
+
+    spin_lock(&vm_area_lock);
+    list_for_each_entry( vm, &vm_area_list, list )
+    {
+        if ( vm->va != va )
+            continue;
+        found = vm;
+        break;
+    }
+    spin_unlock(&vm_area_lock);
+
+    return found;
+}
+
 static unsigned int vm_size(const void *va)
 {
     unsigned int start = vm_index(va), end;
 
     if ( !start )
+    {
+        const struct vm_area *vm = vm_find(va);
+        if ( vm )
+            return vm->pages;
         return 0;
+    }
 
     end = find_next_zero_bit(vm_bitmap, vm_top, start + 1);
 
@@ -164,6 +196,16 @@  void vm_free(const void *va)
 
     if ( !bit )
     {
+        struct vm_area *vm = (struct vm_area *)vm_find(va);
+
+        if ( vm )
+        {
+            spin_lock(&vm_area_lock);
+            list_del(&vm->list);
+            spin_unlock(&vm_area_lock);
+            xfree(vm);
+            return;
+        }
         WARN_ON(va != NULL);
         return;
     }
@@ -216,12 +258,13 @@  void vunmap(const void *va)
     vm_free(va);
 }
 
-void *vmalloc(size_t size)
+void *vmalloc_cb(size_t size, vmap_cb_t *vmap_cb, mfn_t **mfn_array)
 {
     mfn_t *mfn;
     size_t pages, i;
     struct page_info *pg;
     void *va;
+    struct vm_area *vm = NULL;
 
     ASSERT(size);
 
@@ -230,6 +273,17 @@  void *vmalloc(size_t size)
     if ( mfn == NULL )
         return NULL;
 
+    if ( vmap_cb )
+    {
+        vm = xmalloc(struct vm_area);
+
+        if ( !vm )
+        {
+            xfree(mfn);
+            return NULL;
+        }
+    }
+
     for ( i = 0; i < pages; i++ )
     {
         pg = alloc_domheap_page(NULL, 0);
@@ -238,20 +292,40 @@  void *vmalloc(size_t size)
         mfn[i] = _mfn(page_to_mfn(pg));
     }
 
-    va = vmap(mfn, pages);
+    va = vmap_cb ? vmap_cb(mfn, pages) : vmap(mfn, pages);
     if ( va == NULL )
         goto error;
 
-    xfree(mfn);
+    if ( vm )
+    {
+        vm->va = va;
+        vm->pages = pages;
+
+        spin_lock(&vm_area_lock);
+        list_add(&vm->list, &vm_area_list);
+        spin_unlock(&vm_area_lock);
+    }
+    if ( mfn_array )
+        *mfn_array = mfn;
+    else
+        xfree(mfn);
+
     return va;
 
  error:
     while ( i-- )
         free_domheap_page(mfn_to_page(mfn_x(mfn[i])));
     xfree(mfn);
+    if ( vm )
+        xfree(vm);
     return NULL;
 }
 
+void *vmalloc(size_t size)
+{
+    return vmalloc_cb(size, NULL, NULL);
+}
+
 void *vzalloc(size_t size)
 {
     void *p = vmalloc(size);
diff --git a/xen/include/xen/vmap.h b/xen/include/xen/vmap.h
index 5671ac8..6cd0707 100644
--- a/xen/include/xen/vmap.h
+++ b/xen/include/xen/vmap.h
@@ -12,6 +12,11 @@  void *__vmap(const mfn_t *mfn, unsigned int granularity,
 void *vmap(const mfn_t *mfn, unsigned int nr);
 void vunmap(const void *);
 void *vmalloc(size_t size);
+
+/* Callback for vmalloc_cb to use when vmap-ing. */
+typedef void *vmap_cb_t(const mfn_t *mfn, unsigned int pages);
+void *vmalloc_cb(size_t size, vmap_cb_t *vmap_cb, mfn_t **);
+
 void *vzalloc(size_t size);
 void vfree(void *va);