diff mbox series

[XEN,v1,1/1] x86/domctl: add gva_to_gfn command

Message ID e047a7a7-2e45-48de-2cb4-69001e95e1a8@list.ru (mailing list archive)
State New, archived
Headers show
Series [XEN,v1,1/1] x86/domctl: add gva_to_gfn command | expand

Commit Message

Ковалёв Сергей March 20, 2023, 4:32 p.m. UTC
gva_to_gfn command used for fast address translation in LibVMI project.
With such a command it is possible to perform address translation in
single call instead of series of queries to get every page table.

Thanks to Dmitry Isaykin for involvement.

Signed-off-by: Sergey Kovalev <valor@list.ru>

---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Wei Liu <wl@xen.org>
Cc: George Dunlap <george.dunlap@citrix.com>
Cc: Julien Grall <julien@xen.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tamas K Lengyel <tamas@tklengyel.com>
Cc: xen-devel@lists.xenproject.org
---

---
  xen/arch/x86/domctl.c       | 17 +++++++++++++++++
  xen/include/public/domctl.h | 13 +++++++++++++
  2 files changed, 30 insertions(+)

Comments

Andrew Cooper March 20, 2023, 7:07 p.m. UTC | #1
On 20/03/2023 4:32 pm, Ковалёв Сергей wrote:
> gva_to_gfn command used for fast address translation in LibVMI project.
> With such a command it is possible to perform address translation in
> single call instead of series of queries to get every page table.
>
> Thanks to Dmitry Isaykin for involvement.
>
> Signed-off-by: Sergey Kovalev <valor@list.ru>

I fully appreciate why you want this hypercall, and I've said several
times that libvmi wants something better than it has, but...

> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> index 2118fcad5d..0c9706ea0a 100644
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1364,6 +1364,23 @@ long arch_do_domctl(
>              copyback = true;
>          break;
>
> +    case XEN_DOMCTL_gva_to_gfn:
> +    {
> +        uint64_t ga = domctl->u.gva_to_gfn.addr;
> +        uint64_t cr3 = domctl->u.gva_to_gfn.cr3;
> +        struct vcpu* v = d->vcpu[0];

... this isn't safe if you happen to issue this hypercall too early in a
domain's lifecycle.

If nothing else, you want to do a domain_vcpu() check and return -ENOENT
in the failure case.

More generally, issuing the hypercall under vcpu0 isn't necessarily
correct.  It is common for all vCPUs to have equivalent paging settings,
but e.g. Xen transiently disables CR4.CET and CR0.WP in order to make
self-modifying code changes.

Furthermore, the setting of CR4.{PAE,PSE} determines reserved bits, so
you can't even ignore the access rights and hope that the translation
works out correctly.

Ideally we'd have a pagewalk algorithm which didn't require taking a
vcpu, and instead just took a set of paging configuration, but it is all
chronically entangled right now.

I think, at a minimum, you need to take a vcpu_id as an input, but I
suspect to make this a usable API you want an altp2m view id too.

Also, I'm pretty sure this is only safe for a paused vCPU.  If the vCPU
isn't paused, then there's a TOCTOU race in the pagewalk code when
inspecting control registers.

> +        uint32_t pfec = PFEC_page_present;
> +        unsigned int page_order;
> +
> +        uint64_t gfn = paging_ga_to_gfn_cr3(v, cr3, ga, &pfec,
> &page_order);
> +        domctl->u.gva_to_gfn.addr = gfn;
> +        domctl->u.gva_to_gfn.page_order = page_order;

page_order is only not stack rubble if gfn is different to INVALID_GFN.

> +        if ( __copy_to_guest(u_domctl, domctl, 1) )
> +            ret = -EFAULT;

You want to restrict this to just the gva_to_gfn sub-portion.  No point
copying back more than necessary.

~Andrew
Ковалёв Сергей March 20, 2023, 7:22 p.m. UTC | #2
21.03.2023 1:51, Tamas K Lengyel wrote:
> 
> 
> On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru 
> <mailto:valor@list.ru>> wrote:
>  >
>  > gva_to_gfn command used for fast address translation in LibVMI project.
>  > With such a command it is possible to perform address translation in
>  > single call instead of series of queries to get every page table.
> 
> You have a couple assumptions here:
>   - Xen will always have a direct map of the entire guest memory - there 
> are already plans to move away from that. Without that this approach 
> won't have any advantage over doing the same mapping by LibVMI

Thanks! I didn't know about the plan. Though I use this patch
back ported into 4.16.

>   - LibVMI has to map every page for each page table for every lookup - 
> you have to do that only for the first, afterwards the pages on which 
> the pagetable is are kept in a cache and subsequent lookups would be 
> actually faster then having to do this domctl since you can keep being 
> in the same process instead of having to jump to Xen.

Yes. I know about the page cache. But I have faced with several issues
with cache like this one https://github.com/libvmi/libvmi/pull/1058 .
So I had to disable the cache.

> 
> With these perspectives in mind I don't think this would be a useful 
> addition. Please prove me wrong with performance numbers and a specific 
> use-case that warrants adding this and how you plan to introduce it into 
> LibVMI without causing performance regression to all other use-cases.

I will send You a PR into LibVMI in a day or two. I don't have any
performance numbers at the time. I send this patch to share my current
work as soon as possible.

To prevent regression in all use-cases we could add a configure option.
Thanks to make me notice that!

> 
> Tamas
Andrew Cooper March 20, 2023, 7:33 p.m. UTC | #3
On 20/03/2023 7:22 pm, Ковалёв Сергей wrote:
>
> 21.03.2023 1:51, Tamas K Lengyel wrote:
>> On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru
>> <mailto:valor@list.ru>> wrote:
>>  >
>>  > gva_to_gfn command used for fast address translation in LibVMI
>> project.
>>  > With such a command it is possible to perform address translation in
>>  > single call instead of series of queries to get every page table.
>>
>> You have a couple assumptions here:
>>   - Xen will always have a direct map of the entire guest memory -
>> there are already plans to move away from that. Without that this
>> approach won't have any advantage over doing the same mapping by LibVMI
>
> Thanks! I didn't know about the plan.

To be clear, "not mapping the guest by default" is for speculative
safety/hardening reasons.

Xen will always need to be capable of mapping arbitrary parts of the
guest, even if only transiently, so there's no relevant interaction with
this new proposed hypercall.


The truth is that Xen will always be able to do a single pagewalk faster
than libvmi can do it (via mappings, or otherwise), but if libvmi does
properly maintain a cache of mappings then it will be faster that
repeated hypercalls into Xen.  Where the split lies depends heavily on
the libvmi workload.

I don't see a problem in principle with a hypercall like this - it is
"just" a performance optimisation over capabilities that libvmi already
has - but the version presented here is overly simplistic.

~Andrew
Ковалёв Сергей March 20, 2023, 7:35 p.m. UTC | #4
20.03.2023 22:07, Andrew Cooper пишет:
> On 20/03/2023 4:32 pm, Ковалёв Сергей wrote:
>> gva_to_gfn command used for fast address translation in LibVMI project.
>> With such a command it is possible to perform address translation in
>> single call instead of series of queries to get every page table.
>>
>> Thanks to Dmitry Isaykin for involvement.
>>
>> Signed-off-by: Sergey Kovalev <valor@list.ru>
> 
> I fully appreciate why you want this hypercall, and I've said several
> times that libvmi wants something better than it has, but...
> 
>> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
>> index 2118fcad5d..0c9706ea0a 100644
>> --- a/xen/arch/x86/domctl.c
>> +++ b/xen/arch/x86/domctl.c
>> @@ -1364,6 +1364,23 @@ long arch_do_domctl(
>>               copyback = true;
>>           break;
>>
>> +    case XEN_DOMCTL_gva_to_gfn:
>> +    {
>> +        uint64_t ga = domctl->u.gva_to_gfn.addr;
>> +        uint64_t cr3 = domctl->u.gva_to_gfn.cr3;
>> +        struct vcpu* v = d->vcpu[0];
> 
> ... this isn't safe if you happen to issue this hypercall too early in a
> domain's lifecycle.
> 
> If nothing else, you want to do a domain_vcpu() check and return -ENOENT
> in the failure case.

Thanks!

> 
> More generally, issuing the hypercall under vcpu0 isn't necessarily
> correct.  It is common for all vCPUs to have equivalent paging settings,
> but e.g. Xen transiently disables CR4.CET and CR0.WP in order to make
> self-modifying code changes.
> 
> Furthermore, the setting of CR4.{PAE,PSE} determines reserved bits, so
> you can't even ignore the access rights and hope that the translation
> works out correctly.

Thanks! I didn't think about such things earlier. I should to think
this know carefully.

> 
> Ideally we'd have a pagewalk algorithm which didn't require taking a
> vcpu, and instead just took a set of paging configuration, but it is all
> chronically entangled right now.
> 

Do You mean to add new implementation of "paging_ga_to_gfn_cr3"?

> I think, at a minimum, you need to take a vcpu_id as an input, but I
> suspect to make this a usable API you want an altp2m view id too.
> 

Why we should consider altp2m while translating guest virtual address to
guest physical one?

> Also, I'm pretty sure this is only safe for a paused vCPU.  If the vCPU
> isn't paused, then there's a TOCTOU race in the pagewalk code when
> inspecting control registers.
> 

Thanks! Should we pause the domain?

>> +        uint32_t pfec = PFEC_page_present;
>> +        unsigned int page_order;
>> +
>> +        uint64_t gfn = paging_ga_to_gfn_cr3(v, cr3, ga, &pfec,
>> &page_order);
>> +        domctl->u.gva_to_gfn.addr = gfn;
>> +        domctl->u.gva_to_gfn.page_order = page_order;
> 
> page_order is only not stack rubble if gfn is different to INVALID_GFN.
> 

Sorry but I don't understand "is only not stack rubble". Do you mean
that I should initialize "page_order" while defining it?

>> +        if ( __copy_to_guest(u_domctl, domctl, 1) )
>> +            ret = -EFAULT;
> 
> You want to restrict this to just the gva_to_gfn sub-portion.  No point
> copying back more than necessary.
> 
> ~Andrew

Thanks a lot!
Andrew Cooper March 20, 2023, 7:46 p.m. UTC | #5
On 20/03/2023 7:35 pm, Ковалёв Сергей wrote:
> 20.03.2023 22:07, Andrew Cooper пишет:
>>
>> More generally, issuing the hypercall under vcpu0 isn't necessarily
>> correct.  It is common for all vCPUs to have equivalent paging settings,
>> but e.g. Xen transiently disables CR4.CET and CR0.WP in order to make
>> self-modifying code changes.
>>
>> Furthermore, the setting of CR4.{PAE,PSE} determines reserved bits, so
>> you can't even ignore the access rights and hope that the translation
>> works out correctly.
>
> Thanks! I didn't think about such things earlier. I should to think
> this know carefully.

If you haven't already, read

https://github.com/xen-project/xen/blob/master/xen/arch/x86/mm/guest_walk.c

and

https://github.com/andyhhp/xtf/blob/pagetable-emulation/tests/pagetable-emulation/main.c

These are various notes and tests I made last time I had to rewrite
Xen's pagewalk from scratch.

>
>>
>> Ideally we'd have a pagewalk algorithm which didn't require taking a
>> vcpu, and instead just took a set of paging configuration, but it is all
>> chronically entangled right now.
>>
>
> Do You mean to add new implementation of "paging_ga_to_gfn_cr3"?

Yes, but I didn't mean for this to be taken as a suggestion.  It's far
more work than it sounds...

>
>> I think, at a minimum, you need to take a vcpu_id as an input, but I
>> suspect to make this a usable API you want an altp2m view id too.
>>
>
> Why we should consider altp2m while translating guest virtual address to
> guest physical one?

Because altp2m can change the gfn mappings, and therefore the contents
of the pagetables.

A pagewalk from cr3 in one view can end up being totally different to a
walk from the same cr3 in a different view.

>
>> Also, I'm pretty sure this is only safe for a paused vCPU.  If the vCPU
>> isn't paused, then there's a TOCTOU race in the pagewalk code when
>> inspecting control registers.
>>
>
> Thanks! Should we pause the domain?

Certainly the vCPU.  Chances are that if you're making this hypercall
from a libvmi callback, the vCPU in question is already paused, at which
point taking one extra pause ref on it is very quick.

>
>>> +        uint32_t pfec = PFEC_page_present;
>>> +        unsigned int page_order;
>>> +
>>> +        uint64_t gfn = paging_ga_to_gfn_cr3(v, cr3, ga, &pfec,
>>> &page_order);
>>> +        domctl->u.gva_to_gfn.addr = gfn;
>>> +        domctl->u.gva_to_gfn.page_order = page_order;
>>
>> page_order is only not stack rubble if gfn is different to INVALID_GFN.
>>
>
> Sorry but I don't understand "is only not stack rubble". Do you mean
> that I should initialize "page_order" while defining it?

page_order is only initialised when gfn returns != INVALID_GFN.

See the function description.

~Andrew
Tamas K Lengyel March 20, 2023, 10:51 p.m. UTC | #6
On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru> wrote:
>
> gva_to_gfn command used for fast address translation in LibVMI project.
> With such a command it is possible to perform address translation in
> single call instead of series of queries to get every page table.

You have a couple assumptions here:
 - Xen will always have a direct map of the entire guest memory - there are
already plans to move away from that. Without that this approach won't have
any advantage over doing the same mapping by LibVMI
 - LibVMI has to map every page for each page table for every lookup - you
have to do that only for the first, afterwards the pages on which the
pagetable is are kept in a cache and subsequent lookups would be actually
faster then having to do this domctl since you can keep being in the same
process instead of having to jump to Xen.

With these perspectives in mind I don't think this would be a useful
addition. Please prove me wrong with performance numbers and a specific
use-case that warrants adding this and how you plan to introduce it into
LibVMI without causing performance regression to all other use-cases.

Tamas
Tamas K Lengyel March 20, 2023, 11:34 p.m. UTC | #7
On Mon, Mar 20, 2023 at 3:23 PM Ковалёв Сергей <valor@list.ru> wrote:
>
>
>
> 21.03.2023 1:51, Tamas K Lengyel wrote:
> >
> >
> > On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru
> > <mailto:valor@list.ru>> wrote:
> >  >
> >  > gva_to_gfn command used for fast address translation in LibVMI
project.
> >  > With such a command it is possible to perform address translation in
> >  > single call instead of series of queries to get every page table.
> >
> > You have a couple assumptions here:
> >   - Xen will always have a direct map of the entire guest memory - there
> > are already plans to move away from that. Without that this approach
> > won't have any advantage over doing the same mapping by LibVMI
>
> Thanks! I didn't know about the plan. Though I use this patch
> back ported into 4.16.
>
> >   - LibVMI has to map every page for each page table for every lookup -
> > you have to do that only for the first, afterwards the pages on which
> > the pagetable is are kept in a cache and subsequent lookups would be
> > actually faster then having to do this domctl since you can keep being
> > in the same process instead of having to jump to Xen.
>
> Yes. I know about the page cache. But I have faced with several issues
> with cache like this one https://github.com/libvmi/libvmi/pull/1058 .
> So I had to disable the cache.

The issue you linked to is an issue with a stale v2p cache, which is a
virtual TLB. The cache I talked about is the page cache, which is just
maintaining a list of the pages that were accessed by LibVMI for future
accesses. You can have one and not the other (ie. ./configure
--disable-address-cache --enable-page-cache).

Tamas
Tamas K Lengyel March 20, 2023, 11:43 p.m. UTC | #8
On Mon, Mar 20, 2023 at 3:34 PM Andrew Cooper <andrew.cooper3@citrix.com>
wrote:
>
> On 20/03/2023 7:22 pm, Ковалёв Сергей wrote:
> >
> > 21.03.2023 1:51, Tamas K Lengyel wrote:
> >> On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru
> >> <mailto:valor@list.ru>> wrote:
> >>  >
> >>  > gva_to_gfn command used for fast address translation in LibVMI
> >> project.
> >>  > With such a command it is possible to perform address translation in
> >>  > single call instead of series of queries to get every page table.
> >>
> >> You have a couple assumptions here:
> >>   - Xen will always have a direct map of the entire guest memory -
> >> there are already plans to move away from that. Without that this
> >> approach won't have any advantage over doing the same mapping by LibVMI
> >
> > Thanks! I didn't know about the plan.
>
> To be clear, "not mapping the guest by default" is for speculative
> safety/hardening reasons.
>
> Xen will always need to be capable of mapping arbitrary parts of the
> guest, even if only transiently, so there's no relevant interaction with
> this new proposed hypercall.
>
>
> The truth is that Xen will always be able to do a single pagewalk faster
> than libvmi can do it (via mappings, or otherwise), but if libvmi does
> properly maintain a cache of mappings then it will be faster that
> repeated hypercalls into Xen.  Where the split lies depends heavily on
> the libvmi workload.
>
> I don't see a problem in principle with a hypercall like this - it is
> "just" a performance optimisation over capabilities that libvmi already
> has - but the version presented here is overly simplistic.

For debugging purposes sure it would be fine to have this hypercall but I
wouldn't set it as the default for LibVMI. Oftentimes the lookup needs to
be more nuanced then what Xen understands about paging. For example, on
Windows guests you can have transition pages that don't have the present
bit set yet are perfectly valid for introspection purposes (
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=3311ed0c63d4ca707c49256655e401f37f25ec50).
Xen would need to be enlightened about this type of OS-specific tidbits for
which I think LibVMI is a much better place to keep the logic for.

Tamas
Jan Beulich March 21, 2023, 7:28 a.m. UTC | #9
On 20.03.2023 20:07, Andrew Cooper wrote:
> On 20/03/2023 4:32 pm, Ковалёв Сергей wrote:
>> gva_to_gfn command used for fast address translation in LibVMI project.
>> With such a command it is possible to perform address translation in
>> single call instead of series of queries to get every page table.
>>
>> Thanks to Dmitry Isaykin for involvement.
>>
>> Signed-off-by: Sergey Kovalev <valor@list.ru>
> 
> I fully appreciate why you want this hypercall, and I've said several
> times that libvmi wants something better than it has, but...

But is a domctl really the right vehicle? While not strictly intended for
device models, a dm-op would seem more suitable to me. Considering you
already brought altp2m into play, it could also be a sub-op of HVMOP_altp2m.

Jan
Jan Beulich March 21, 2023, 7:38 a.m. UTC | #10
On 20.03.2023 17:32, Ковалёв Сергей wrote:
> gva_to_gfn command used for fast address translation in LibVMI project.
> With such a command it is possible to perform address translation in
> single call instead of series of queries to get every page table.
> 
> Thanks to Dmitry Isaykin for involvement.
> 
> Signed-off-by: Sergey Kovalev <valor@list.ru>

Along with other comment which were given already, a few more largely style
related nits. First of all when sending patches you want to make sure they
make it through intact. You have some line wrapping and white space
corruption in what actually ended up on the list.

> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1364,6 +1364,23 @@ long arch_do_domctl(
>               copyback = true;
>           break;
> 
> +    case XEN_DOMCTL_gva_to_gfn:
> +    {
> +        uint64_t ga = domctl->u.gva_to_gfn.addr;
> +        uint64_t cr3 = domctl->u.gva_to_gfn.cr3;

Please avoid fixed width types where they're not actually needed. In the
two examples above I guess the 1st wants to be paddr_t while the 2nd
wants to be unsigned long (and pfec below unsigned int).

> +        struct vcpu* v = d->vcpu[0];

* and blank want to switch places.

> +        uint32_t pfec = PFEC_page_present;
> +        unsigned int page_order;
> +
> +        uint64_t gfn = paging_ga_to_gfn_cr3(v, cr3, ga, &pfec, 
> &page_order);
> +        domctl->u.gva_to_gfn.addr = gfn;

Blank line between declaration(s) and statement(s) please (and not in
the middle of declarations).

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -948,6 +948,17 @@ struct xen_domctl_paging_mempool {
>       uint64_aligned_t size; /* Size in bytes. */
>   };
> 
> +/*
> + * XEN_DOMCTL_gva_to_gfn.
> + *
> + * Get the guest virtual to guest physicall address translated.
> + */
> +struct xen_domctl_gva_to_gfn {
> +    uint64_aligned_t addr;
> +    uint64_aligned_t cr3;

This is x86-specific. If it's meant to be an x86-only interface, then
it needs to be marked as such by suitable #ifdef around it (you'll
find other examples higher up in the file). Otherwise, if it's meant
to be generic in principle (which I think would be better), but
implemented only on x86 for the time being, the name wants to change
(e.g. "root_pt").

If it remained "cr3", you'd also need to clarify what the non-address
bits of the field mean: They hardly will have their CR3 meaning.

> +    uint64_aligned_t page_order;

This certainly doesn't need to be 64 bits wide. For the foreseeable
future 8 bits will suffice, with explicit padding added as necessary
(though Andrew's request for further inputs may consume some of that
padding).

Jan
Ковалёв Сергей March 21, 2023, 7:48 a.m. UTC | #11
21.03.2023 2:34, Tamas K Lengyel пишет:
> 
> 
> On Mon, Mar 20, 2023 at 3:23 PM Ковалёв Сергей <valor@list.ru 
> <mailto:valor@list.ru>> wrote:
>  >
>  >
>  >
>  > 21.03.2023 1:51, Tamas K Lengyel wrote:
>  > >
>  > >
>  > > On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru 
> <mailto:valor@list.ru>
>  > > <mailto:valor@list.ru <mailto:valor@list.ru>>> wrote:
>  > >  >
>  > >  > gva_to_gfn command used for fast address translation in LibVMI 
> project.
>  > >  > With such a command it is possible to perform address translation in
>  > >  > single call instead of series of queries to get every page table.
>  > >
>  > > You have a couple assumptions here:
>  > >   - Xen will always have a direct map of the entire guest memory - 
> there
>  > > are already plans to move away from that. Without that this approach
>  > > won't have any advantage over doing the same mapping by LibVMI
>  >
>  > Thanks! I didn't know about the plan. Though I use this patch
>  > back ported into 4.16.
>  >
>  > >   - LibVMI has to map every page for each page table for every lookup -
>  > > you have to do that only for the first, afterwards the pages on which
>  > > the pagetable is are kept in a cache and subsequent lookups would be
>  > > actually faster then having to do this domctl since you can keep being
>  > > in the same process instead of having to jump to Xen.
>  >
>  > Yes. I know about the page cache. But I have faced with several issues
>  > with cache like this one https://github.com/libvmi/libvmi/pull/1058 
> <https://github.com/libvmi/libvmi/pull/1058> .
>  > So I had to disable the cache.
> 
> The issue you linked to is an issue with a stale v2p cache, which is a 
> virtual TLB. The cache I talked about is the page cache, which is just 
> maintaining a list of the pages that were accessed by LibVMI for future 
> accesses. You can have one and not the other (ie. ./configure  
> --disable-address-cache --enable-page-cache).
> 
> Tamas

Thanks. I know about the page cache. Though I'm not familiar with
it close enough.

As far as I understand at the time the page cache implementation in
LibVMI looks like this:
1. Call sequence: vmi_read > vmi_read_page > driver_read_page >
    xen_read_page > memory_cache_insert ..> get_memory_data >
    xen_get_memory > xen_get_memory_pfn > xc_map_foreign_range
2. This is perfectly valid while guest OS keeps page there. And
    physical pages are always there.
3. To renew cache the "age_limit" counter is used.
4. In Xen driver implementation in LibVMI the "age_limit" is
    disabled.
5. Also it is possible to invalidate cache with "xen_write" or
    "vmi_pagecache_flush". But it is not used.
6. Other way to avoid too big cache is cache size limit. So on
    every insert half of the cache is dropped on size overflow.

So the only thing we should know is valid mapping of guest
virtual address to guest physical address.

And the slow paths are:
1. A first traversal of new page table set. E.g. for the new process.
2. Or new subset of page tables for known process.
3. Subsequent page access after cache clean on size overflow.

Am I right?

The main idea behind the patch:
1. For the very first time it would be done faster with hypercall.
2. For subsequent calls v2p translation cache could be used (used in
    my current work in LibVMI).
3. To avoid errors with stale cache v2p cache could be invalidated
    on every event (VMI_FLUSH_RATE = 1).
Tamas K Lengyel March 21, 2023, 1:07 p.m. UTC | #12
On Tue, Mar 21, 2023 at 3:49 AM Ковалёв Сергей <valor@list.ru> wrote:
>
>
>
> 21.03.2023 2:34, Tamas K Lengyel пишет:
> >
> >
> > On Mon, Mar 20, 2023 at 3:23 PM Ковалёв Сергей <valor@list.ru
> > <mailto:valor@list.ru>> wrote:
> >  >
> >  >
> >  >
> >  > 21.03.2023 1:51, Tamas K Lengyel wrote:
> >  > >
> >  > >
> >  > > On Mon, Mar 20, 2023 at 12:32 PM Ковалёв Сергей <valor@list.ru
> > <mailto:valor@list.ru>
> >  > > <mailto:valor@list.ru <mailto:valor@list.ru>>> wrote:
> >  > >  >
> >  > >  > gva_to_gfn command used for fast address translation in LibVMI
> > project.
> >  > >  > With such a command it is possible to perform address
translation in
> >  > >  > single call instead of series of queries to get every page
table.
> >  > >
> >  > > You have a couple assumptions here:
> >  > >   - Xen will always have a direct map of the entire guest memory -
> > there
> >  > > are already plans to move away from that. Without that this
approach
> >  > > won't have any advantage over doing the same mapping by LibVMI
> >  >
> >  > Thanks! I didn't know about the plan. Though I use this patch
> >  > back ported into 4.16.
> >  >
> >  > >   - LibVMI has to map every page for each page table for every
lookup -
> >  > > you have to do that only for the first, afterwards the pages on
which
> >  > > the pagetable is are kept in a cache and subsequent lookups would
be
> >  > > actually faster then having to do this domctl since you can keep
being
> >  > > in the same process instead of having to jump to Xen.
> >  >
> >  > Yes. I know about the page cache. But I have faced with several
issues
> >  > with cache like this one https://github.com/libvmi/libvmi/pull/1058
> > <https://github.com/libvmi/libvmi/pull/1058> .
> >  > So I had to disable the cache.
> >
> > The issue you linked to is an issue with a stale v2p cache, which is a
> > virtual TLB. The cache I talked about is the page cache, which is just
> > maintaining a list of the pages that were accessed by LibVMI for future
> > accesses. You can have one and not the other (ie. ./configure
> > --disable-address-cache --enable-page-cache).
> >
> > Tamas
>
> Thanks. I know about the page cache. Though I'm not familiar with
> it close enough.
>
> As far as I understand at the time the page cache implementation in
> LibVMI looks like this:
> 1. Call sequence: vmi_read > vmi_read_page > driver_read_page >
>     xen_read_page > memory_cache_insert ..> get_memory_data >
>     xen_get_memory > xen_get_memory_pfn > xc_map_foreign_range
> 2. This is perfectly valid while guest OS keeps page there. And
>     physical pages are always there.
> 3. To renew cache the "age_limit" counter is used.
> 4. In Xen driver implementation in LibVMI the "age_limit" is
>     disabled.
> 5. Also it is possible to invalidate cache with "xen_write" or
>     "vmi_pagecache_flush". But it is not used.
> 6. Other way to avoid too big cache is cache size limit. So on
>     every insert half of the cache is dropped on size overflow.
>
> So the only thing we should know is valid mapping of guest
> virtual address to guest physical address.
>
> And the slow paths are:
> 1. A first traversal of new page table set. E.g. for the new process.
> 2. Or new subset of page tables for known process.
> 3. Subsequent page access after cache clean on size overflow.
>
> Am I right?
>
> The main idea behind the patch:
> 1. For the very first time it would be done faster with hypercall.
> 2. For subsequent calls v2p translation cache could be used (used in
>     my current work in LibVMI).
> 3. To avoid errors with stale cache v2p cache could be invalidated
>     on every event (VMI_FLUSH_RATE = 1).

If you set a flush rate to 1 then you would effectively run without any
caching between events. It will still be costly. Yes, you save some
performance on having to map the pages because Xen already has them but
overall this is still a subpar solution.

IMHO you are not addressing the real issue, which is just the lack of hooks
into the OS that would tell you when the v2p cache actually needs to be
invalidated. The OS does TLB maintenance already when it updates its
pagetables. If you wrote logic to hook into that, you wouldn't have to
disable the caches or run with flush rate = 1. On the DRAKVUF side this has
been a TODO for a long time
https://github.com/tklengyel/drakvuf/blob/df2d274dfe349bbdacdb121229707f6c91449b38/src/libdrakvuf/private.h#L140.
If you had those hooks into the TLB maintenance logic you could just use
the existing page-cache and be done with it. Yes, the very first access may
still be slower than the hypercall but I doubt it would be noticeable in
the long run.

Tamas
Andrew Cooper March 21, 2023, 1:46 p.m. UTC | #13
On 21/03/2023 7:28 am, Jan Beulich wrote:
> On 20.03.2023 20:07, Andrew Cooper wrote:
>> On 20/03/2023 4:32 pm, Ковалёв Сергей wrote:
>>> gva_to_gfn command used for fast address translation in LibVMI project.
>>> With such a command it is possible to perform address translation in
>>> single call instead of series of queries to get every page table.
>>>
>>> Thanks to Dmitry Isaykin for involvement.
>>>
>>> Signed-off-by: Sergey Kovalev <valor@list.ru>
>> I fully appreciate why you want this hypercall, and I've said several
>> times that libvmi wants something better than it has, but...
> But is a domctl really the right vehicle? While not strictly intended for
> device models, a dm-op would seem more suitable to me. Considering you
> already brought altp2m into play, it could also be a sub-op of HVMOP_altp2m.

It definitely feels wrong to be using an altp2m op if you're not using
altp2m, and there introspection usecases that don't use altp2m.

dm-op would be the place to put this, if I wasn't pretty sure it would
be modified over time.

As I say - I can see why this might be useful, but pagewalking is more
complicated than the interface presented here.

~Andrew
Ковалёв Сергей March 21, 2023, 3:13 p.m. UTC | #14
Thanks to all for suggestions and notes.

Though as Andrew Cooper noticed current approach is too over simplified.
As Tams K Lengyel noticed the effect could be too negligible and some
OS specific logic should be present.

So as for today we could drop the patch.

20.03.2023 19:32, Ковалёв Сергей пишет:
> gva_to_gfn command used for fast address translation in LibVMI project.
> With such a command it is possible to perform address translation in
> single call instead of series of queries to get every page table.
> 
> Thanks to Dmitry Isaykin for involvement.
> 
> Signed-off-by: Sergey Kovalev <valor@list.ru>
> 
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: "Roger Pau Monné" <roger.pau@citrix.com>
> Cc: Wei Liu <wl@xen.org>
> Cc: George Dunlap <george.dunlap@citrix.com>
> Cc: Julien Grall <julien@xen.org>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tamas K Lengyel <tamas@tklengyel.com>
> Cc: xen-devel@lists.xenproject.org
> ---
> 
> ---
>   xen/arch/x86/domctl.c       | 17 +++++++++++++++++
>   xen/include/public/domctl.h | 13 +++++++++++++
>   2 files changed, 30 insertions(+)
> 
> diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
> index 2118fcad5d..0c9706ea0a 100644
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1364,6 +1364,23 @@ long arch_do_domctl(
>               copyback = true;
>           break;
> 
> +    case XEN_DOMCTL_gva_to_gfn:
> +    {
> +        uint64_t ga = domctl->u.gva_to_gfn.addr;
> +        uint64_t cr3 = domctl->u.gva_to_gfn.cr3;
> +        struct vcpu* v = d->vcpu[0];
> +        uint32_t pfec = PFEC_page_present;
> +        unsigned int page_order;
> +
> +        uint64_t gfn = paging_ga_to_gfn_cr3(v, cr3, ga, &pfec, 
> &page_order);
> +        domctl->u.gva_to_gfn.addr = gfn;
> +        domctl->u.gva_to_gfn.page_order = page_order;
> +        if ( __copy_to_guest(u_domctl, domctl, 1) )
> +            ret = -EFAULT;
> +
> +        break;
> +    }
> +
>       default:
>           ret = -ENOSYS;
>           break;
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index 51be28c3de..628dfc68fd 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -948,6 +948,17 @@ struct xen_domctl_paging_mempool {
>       uint64_aligned_t size; /* Size in bytes. */
>   };
> 
> +/*
> + * XEN_DOMCTL_gva_to_gfn.
> + *
> + * Get the guest virtual to guest physicall address translated.
> + */
> +struct xen_domctl_gva_to_gfn {
> +    uint64_aligned_t addr;
> +    uint64_aligned_t cr3;
> +    uint64_aligned_t page_order;
> +};
> +
>   #if defined(__i386__) || defined(__x86_64__)
>   struct xen_domctl_vcpu_msr {
>       uint32_t         index;
> @@ -1278,6 +1289,7 @@ struct xen_domctl {
>   #define XEN_DOMCTL_vmtrace_op                    84
>   #define XEN_DOMCTL_get_paging_mempool_size       85
>   #define XEN_DOMCTL_set_paging_mempool_size       86
> +#define XEN_DOMCTL_gva_to_gfn                    87
>   #define XEN_DOMCTL_gdbsx_guestmemio            1000
>   #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>   #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
> @@ -1340,6 +1352,7 @@ struct xen_domctl {
>           struct xen_domctl_vuart_op          vuart_op;
>           struct xen_domctl_vmtrace_op        vmtrace_op;
>           struct xen_domctl_paging_mempool    paging_mempool;
> +        struct xen_domctl_gva_to_gfn        gva_to_gfn;
>           uint8_t                             pad[128];
>       } u;
>   };
diff mbox series

Patch

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 2118fcad5d..0c9706ea0a 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1364,6 +1364,23 @@  long arch_do_domctl(
              copyback = true;
          break;

+    case XEN_DOMCTL_gva_to_gfn:
+    {
+        uint64_t ga = domctl->u.gva_to_gfn.addr;
+        uint64_t cr3 = domctl->u.gva_to_gfn.cr3;
+        struct vcpu* v = d->vcpu[0];
+        uint32_t pfec = PFEC_page_present;
+        unsigned int page_order;
+
+        uint64_t gfn = paging_ga_to_gfn_cr3(v, cr3, ga, &pfec, 
&page_order);
+        domctl->u.gva_to_gfn.addr = gfn;
+        domctl->u.gva_to_gfn.page_order = page_order;
+        if ( __copy_to_guest(u_domctl, domctl, 1) )
+            ret = -EFAULT;
+
+        break;
+    }
+
      default:
          ret = -ENOSYS;
          break;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 51be28c3de..628dfc68fd 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -948,6 +948,17 @@  struct xen_domctl_paging_mempool {
      uint64_aligned_t size; /* Size in bytes. */
  };

+/*
+ * XEN_DOMCTL_gva_to_gfn.
+ *
+ * Get the guest virtual to guest physicall address translated.
+ */
+struct xen_domctl_gva_to_gfn {
+    uint64_aligned_t addr;
+    uint64_aligned_t cr3;
+    uint64_aligned_t page_order;
+};
+
  #if defined(__i386__) || defined(__x86_64__)
  struct xen_domctl_vcpu_msr {
      uint32_t         index;
@@ -1278,6 +1289,7 @@  struct xen_domctl {
  #define XEN_DOMCTL_vmtrace_op                    84
  #define XEN_DOMCTL_get_paging_mempool_size       85
  #define XEN_DOMCTL_set_paging_mempool_size       86
+#define XEN_DOMCTL_gva_to_gfn                    87
  #define XEN_DOMCTL_gdbsx_guestmemio            1000
  #define XEN_DOMCTL_gdbsx_pausevcpu             1001
  #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1340,6 +1352,7 @@  struct xen_domctl {
          struct xen_domctl_vuart_op          vuart_op;
          struct xen_domctl_vmtrace_op        vmtrace_op;
          struct xen_domctl_paging_mempool    paging_mempool;
+        struct xen_domctl_gva_to_gfn        gva_to_gfn;
          uint8_t                             pad[128];
      } u;
  };