diff mbox series

x86/vm_event: correctly gather gs_shadow value

Message ID 20190501042249.1218-1-tamas@tklengyel.com (mailing list archive)
State New, archived
Headers show
Series x86/vm_event: correctly gather gs_shadow value | expand

Commit Message

Tamas K Lengyel May 1, 2019, 4:22 a.m. UTC
Currently the gs_shadow value is only cached when the vCPU is being scheduled
out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
since it doesn't represent the actual state of the vCPU at the time the event
was recorded. This prevents vm_event subscribers from correctly finding kernel
structures in the guest when it is trapped while in ring3.

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Roger Pau Monne <roger.pau@citrix.com>
---
 xen/arch/x86/vm_event.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Razvan Cojocaru May 1, 2019, 7:17 a.m. UTC | #1
On 5/1/19 7:22 AM, Tamas K Lengyel wrote:
> Currently the gs_shadow value is only cached when the vCPU is being scheduled
> out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
> since it doesn't represent the actual state of the vCPU at the time the event
> was recorded. This prevents vm_event subscribers from correctly finding kernel
> structures in the guest when it is trapped while in ring3.

Isn't the VCPU always scheduled out (since it is paused) with sync
vm_events? Is this an async fix only?

In any case,

Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>


Thanks,
Razvan
Andrew Cooper May 1, 2019, 7:38 a.m. UTC | #2
On 01/05/2019 08:17, Razvan Cojocaru wrote:
> On 5/1/19 7:22 AM, Tamas K Lengyel wrote:
>> Currently the gs_shadow value is only cached when the vCPU is being scheduled
>> out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
>> since it doesn't represent the actual state of the vCPU at the time the event
>> was recorded. This prevents vm_event subscribers from correctly finding kernel
>> structures in the guest when it is trapped while in ring3.
> Isn't the VCPU always scheduled out (since it is paused) with sync
> vm_events? Is this an async fix only?

Paused isn't always scheduled out.  It can be half-idle with state still
in hardware.

>
> In any case,
>
> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper May 1, 2019, 7:50 a.m. UTC | #3
On 01/05/2019 05:22, Tamas K Lengyel wrote:
> Currently the gs_shadow value is only cached when the vCPU is being scheduled
> out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
> since it doesn't represent the actual state of the vCPU at the time the event
> was recorded. This prevents vm_event subscribers from correctly finding kernel
> structures in the guest when it is trapped while in ring3.
>
> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Roger Pau Monne <roger.pau@citrix.com>
> ---
>  xen/arch/x86/vm_event.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
> index 51c3493b1d..4464940da7 100644
> --- a/xen/arch/x86/vm_event.c
> +++ b/xen/arch/x86/vm_event.c

Actually, come to think of it, the same is true for the SYSENTER 
details, which by default are read/write to the guest while it is
scheduled.  As a result, the details reported here will from the last
vcpu context switch, and possibly stale.

It might be worth introducing a "sync state from hw" hook which collects
all the data we intend to pass to the introspection agent.

~Andrew#
Tamas K Lengyel May 1, 2019, 1:45 p.m. UTC | #4
On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 01/05/2019 05:22, Tamas K Lengyel wrote:
> > Currently the gs_shadow value is only cached when the vCPU is being scheduled
> > out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
> > since it doesn't represent the actual state of the vCPU at the time the event
> > was recorded. This prevents vm_event subscribers from correctly finding kernel
> > structures in the guest when it is trapped while in ring3.
> >
> > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> > Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: Wei Liu <wei.liu2@citrix.com>
> > Cc: Roger Pau Monne <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/vm_event.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
> > index 51c3493b1d..4464940da7 100644
> > --- a/xen/arch/x86/vm_event.c
> > +++ b/xen/arch/x86/vm_event.c
>
> Actually, come to think of it, the same is true for the SYSENTER
> details, which by default are read/write to the guest while it is
> scheduled.  As a result, the details reported here will from the last
> vcpu context switch, and possibly stale.

I'll look into it.

> It might be worth introducing a "sync state from hw" hook which collects
> all the data we intend to pass to the introspection agent.

You mean adding another hvm hook?

Thanks,
Tamas
Tamas K Lengyel May 1, 2019, 1:58 p.m. UTC | #5
> > It might be worth introducing a "sync state from hw" hook which collects
> > all the data we intend to pass to the introspection agent.
>
> You mean adding another hvm hook?

Actually, instead of another hook I think what would make sense it to
just update vmx_save_vmcs_ctxt to automatically refresh the cached
register values when it's called with "v == current". Thoughts?

Tamas
Razvan Cojocaru May 1, 2019, 2:19 p.m. UTC | #6
On 5/1/19 4:58 PM, Tamas K Lengyel wrote:
>>> It might be worth introducing a "sync state from hw" hook which collects
>>> all the data we intend to pass to the introspection agent.
>>
>> You mean adding another hvm hook?
> 
> Actually, instead of another hook I think what would make sense it to
> just update vmx_save_vmcs_ctxt to automatically refresh the cached
> register values when it's called with "v == current". Thoughts?

That's probably the better way to go about it, since otherwise the
xc_hvm_getcontext_partial() hypercall will suffer from the same problem.
(there are two ways of getting guest state: one is via the vm_event
cached values, the other is via the aforementioned hypercall).


Thanks,
Razvan
Tamas K Lengyel May 1, 2019, 2:53 p.m. UTC | #7
On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
>
> On 5/1/19 4:58 PM, Tamas K Lengyel wrote:
> >>> It might be worth introducing a "sync state from hw" hook which collects
> >>> all the data we intend to pass to the introspection agent.
> >>
> >> You mean adding another hvm hook?
> >
> > Actually, instead of another hook I think what would make sense it to
> > just update vmx_save_vmcs_ctxt to automatically refresh the cached
> > register values when it's called with "v == current". Thoughts?
>
> That's probably the better way to go about it, since otherwise the
> xc_hvm_getcontext_partial() hypercall will suffer from the same problem.
> (there are two ways of getting guest state: one is via the vm_event
> cached values, the other is via the aforementioned hypercall).

True, although issuing the hypercall in the vm_event callback is
actually fine - that's how I found the issue to begin with, since the
vCPU will be scheduled out with the cached registers refreshed and
thus be different then what the vm_event itself had. But other callers
of the hypercall can run into the problem if the guest/vcpu is not
paused.

Tamas
Tamas K Lengyel May 1, 2019, 2:58 p.m. UTC | #8
On Wed, May 1, 2019 at 7:45 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
>
> On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >
> > On 01/05/2019 05:22, Tamas K Lengyel wrote:
> > > Currently the gs_shadow value is only cached when the vCPU is being scheduled
> > > out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
> > > since it doesn't represent the actual state of the vCPU at the time the event
> > > was recorded. This prevents vm_event subscribers from correctly finding kernel
> > > structures in the guest when it is trapped while in ring3.
> > >
> > > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> > > Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>
> > > Cc: Jan Beulich <jbeulich@suse.com>
> > > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > > Cc: Wei Liu <wei.liu2@citrix.com>
> > > Cc: Roger Pau Monne <roger.pau@citrix.com>
> > > ---
> > >  xen/arch/x86/vm_event.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
> > > index 51c3493b1d..4464940da7 100644
> > > --- a/xen/arch/x86/vm_event.c
> > > +++ b/xen/arch/x86/vm_event.c
> >
> > Actually, come to think of it, the same is true for the SYSENTER
> > details, which by default are read/write to the guest while it is
> > scheduled.  As a result, the details reported here will from the last
> > vcpu context switch, and possibly stale.
>
> I'll look into it.

The sysenter values look fine to me because vmx_save_vmcs_ctxt calls
vmx_vmcs_save, which refreshes the values from the actual VMCS. It's
only shadow_gs that is not refreshed.

Tamas
Tamas K Lengyel May 1, 2019, 3:01 p.m. UTC | #9
On Wed, May 1, 2019 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
>
> On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru
> <rcojocaru@bitdefender.com> wrote:
> >
> > On 5/1/19 4:58 PM, Tamas K Lengyel wrote:
> > >>> It might be worth introducing a "sync state from hw" hook which collects
> > >>> all the data we intend to pass to the introspection agent.
> > >>
> > >> You mean adding another hvm hook?
> > >
> > > Actually, instead of another hook I think what would make sense it to
> > > just update vmx_save_vmcs_ctxt to automatically refresh the cached
> > > register values when it's called with "v == current". Thoughts?
> >
> > That's probably the better way to go about it, since otherwise the
> > xc_hvm_getcontext_partial() hypercall will suffer from the same problem.
> > (there are two ways of getting guest state: one is via the vm_event
> > cached values, the other is via the aforementioned hypercall).
>
> True, although issuing the hypercall in the vm_event callback is
> actually fine - that's how I found the issue to begin with, since the
> vCPU will be scheduled out with the cached registers refreshed and
> thus be different then what the vm_event itself had. But other callers
> of the hypercall can run into the problem if the guest/vcpu is not
> paused.

Actually, doing the "v == current" check wouldn't really do anything
for the hypercall since it's not the domain issuing the hypercall on
itself. The only way to ensure that hypercall is returning correct
values would be to pause the vCPU.

Tamas
Razvan Cojocaru May 1, 2019, 3:43 p.m. UTC | #10
On 5/1/19 6:01 PM, Tamas K Lengyel wrote:
> On Wed, May 1, 2019 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
>>
>> On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru
>> <rcojocaru@bitdefender.com> wrote:
>>>
>>> On 5/1/19 4:58 PM, Tamas K Lengyel wrote:
>>>>>> It might be worth introducing a "sync state from hw" hook which collects
>>>>>> all the data we intend to pass to the introspection agent.
>>>>>
>>>>> You mean adding another hvm hook?
>>>>
>>>> Actually, instead of another hook I think what would make sense it to
>>>> just update vmx_save_vmcs_ctxt to automatically refresh the cached
>>>> register values when it's called with "v == current". Thoughts?
>>>
>>> That's probably the better way to go about it, since otherwise the
>>> xc_hvm_getcontext_partial() hypercall will suffer from the same problem.
>>> (there are two ways of getting guest state: one is via the vm_event
>>> cached values, the other is via the aforementioned hypercall).
>>
>> True, although issuing the hypercall in the vm_event callback is
>> actually fine - that's how I found the issue to begin with, since the
>> vCPU will be scheduled out with the cached registers refreshed and
>> thus be different then what the vm_event itself had. But other callers
>> of the hypercall can run into the problem if the guest/vcpu is not
>> paused.
> 
> Actually, doing the "v == current" check wouldn't really do anything
> for the hypercall since it's not the domain issuing the hypercall on
> itself. The only way to ensure that hypercall is returning correct
> values would be to pause the vCPU.

I've discussed this with Andrew in the meantime and he's kindly pointed
out that for the hypercall we pause from remote context, which forces a
de-schedule. So the hypercall _should_ be fine. But we write data to the
vm_event ring from the current context, where state might actually be in
hardware.

He'll probably chime in with additional suggestions / comments.


Thanks,
Razvan
Tamas K Lengyel May 1, 2019, 3:52 p.m. UTC | #11
On Wed, May 1, 2019 at 9:44 AM Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
>
> On 5/1/19 6:01 PM, Tamas K Lengyel wrote:
> > On Wed, May 1, 2019 at 8:53 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
> >>
> >> On Wed, May 1, 2019 at 8:20 AM Razvan Cojocaru
> >> <rcojocaru@bitdefender.com> wrote:
> >>>
> >>> On 5/1/19 4:58 PM, Tamas K Lengyel wrote:
> >>>>>> It might be worth introducing a "sync state from hw" hook which collects
> >>>>>> all the data we intend to pass to the introspection agent.
> >>>>>
> >>>>> You mean adding another hvm hook?
> >>>>
> >>>> Actually, instead of another hook I think what would make sense it to
> >>>> just update vmx_save_vmcs_ctxt to automatically refresh the cached
> >>>> register values when it's called with "v == current". Thoughts?
> >>>
> >>> That's probably the better way to go about it, since otherwise the
> >>> xc_hvm_getcontext_partial() hypercall will suffer from the same problem.
> >>> (there are two ways of getting guest state: one is via the vm_event
> >>> cached values, the other is via the aforementioned hypercall).
> >>
> >> True, although issuing the hypercall in the vm_event callback is
> >> actually fine - that's how I found the issue to begin with, since the
> >> vCPU will be scheduled out with the cached registers refreshed and
> >> thus be different then what the vm_event itself had. But other callers
> >> of the hypercall can run into the problem if the guest/vcpu is not
> >> paused.
> >
> > Actually, doing the "v == current" check wouldn't really do anything
> > for the hypercall since it's not the domain issuing the hypercall on
> > itself. The only way to ensure that hypercall is returning correct
> > values would be to pause the vCPU.
>
> I've discussed this with Andrew in the meantime and he's kindly pointed
> out that for the hypercall we pause from remote context, which forces a
> de-schedule. So the hypercall _should_ be fine. But we write data to the
> vm_event ring from the current context, where state might actually be in
> hardware.

Correct, I've went through the hypercall code and it does pause the
vCPU. So right now I don't think we need anything else then what's
already in this patch.

Tamas
Andrew Cooper May 1, 2019, 5:02 p.m. UTC | #12
On 01/05/2019 15:58, Tamas K Lengyel wrote:
> On Wed, May 1, 2019 at 7:45 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
>> On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> On 01/05/2019 05:22, Tamas K Lengyel wrote:
>>>> Currently the gs_shadow value is only cached when the vCPU is being scheduled
>>>> out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
>>>> since it doesn't represent the actual state of the vCPU at the time the event
>>>> was recorded. This prevents vm_event subscribers from correctly finding kernel
>>>> structures in the guest when it is trapped while in ring3.
>>>>
>>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
>>>> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>
>>>> Cc: Jan Beulich <jbeulich@suse.com>
>>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> Cc: Wei Liu <wei.liu2@citrix.com>
>>>> Cc: Roger Pau Monne <roger.pau@citrix.com>
>>>> ---
>>>>  xen/arch/x86/vm_event.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
>>>> index 51c3493b1d..4464940da7 100644
>>>> --- a/xen/arch/x86/vm_event.c
>>>> +++ b/xen/arch/x86/vm_event.c
>>> Actually, come to think of it, the same is true for the SYSENTER
>>> details, which by default are read/write to the guest while it is
>>> scheduled.  As a result, the details reported here will from the last
>>> vcpu context switch, and possibly stale.
>> I'll look into it.
> The sysenter values look fine to me because vmx_save_vmcs_ctxt calls
> vmx_vmcs_save, which refreshes the values from the actual VMCS. It's
> only shadow_gs that is not refreshed.

So, you are correct that we call into vmx_vmcs_save() which causes the
SYSENTER details to be correct.

However, the same path also calls vmx_save_cpu_state() which saves
shadow_gs, and therefore ought to DTRT for the previous use in vm_event.

The problem is that vmx_{save,load}_cpu_state() only function correctly
in remote context, which is why the stale shadow_gs persists.

Contrary to what I said earlier, I now think that a better fix would be
to make the above functions safe to use use in current context (at least
for the save side - the restore side can probably just ASSERT() atm, as
it doesn't seem to have an equivalent use).

That way, the vm_event code doesn't need to contain any
context-sensitive code, which is a better overall, IMO.

~Andrew
Tamas K Lengyel May 1, 2019, 5:10 p.m. UTC | #13
On Wed, May 1, 2019 at 11:03 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> On 01/05/2019 15:58, Tamas K Lengyel wrote:
> > On Wed, May 1, 2019 at 7:45 AM Tamas K Lengyel <tamas@tklengyel.com> wrote:
> >> On Wed, May 1, 2019 at 1:50 AM Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>> On 01/05/2019 05:22, Tamas K Lengyel wrote:
> >>>> Currently the gs_shadow value is only cached when the vCPU is being scheduled
> >>>> out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
> >>>> since it doesn't represent the actual state of the vCPU at the time the event
> >>>> was recorded. This prevents vm_event subscribers from correctly finding kernel
> >>>> structures in the guest when it is trapped while in ring3.
> >>>>
> >>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> >>>> Cc: Razvan Cojocaru <rcojocaru@bitdefender.com>
> >>>> Cc: Jan Beulich <jbeulich@suse.com>
> >>>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> >>>> Cc: Wei Liu <wei.liu2@citrix.com>
> >>>> Cc: Roger Pau Monne <roger.pau@citrix.com>
> >>>> ---
> >>>>  xen/arch/x86/vm_event.c | 2 +-
> >>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
> >>>> index 51c3493b1d..4464940da7 100644
> >>>> --- a/xen/arch/x86/vm_event.c
> >>>> +++ b/xen/arch/x86/vm_event.c
> >>> Actually, come to think of it, the same is true for the SYSENTER
> >>> details, which by default are read/write to the guest while it is
> >>> scheduled.  As a result, the details reported here will from the last
> >>> vcpu context switch, and possibly stale.
> >> I'll look into it.
> > The sysenter values look fine to me because vmx_save_vmcs_ctxt calls
> > vmx_vmcs_save, which refreshes the values from the actual VMCS. It's
> > only shadow_gs that is not refreshed.
>
> So, you are correct that we call into vmx_vmcs_save() which causes the
> SYSENTER details to be correct.
>
> However, the same path also calls vmx_save_cpu_state() which saves
> shadow_gs, and therefore ought to DTRT for the previous use in vm_event.
>
> The problem is that vmx_{save,load}_cpu_state() only function correctly
> in remote context, which is why the stale shadow_gs persists.
>
> Contrary to what I said earlier, I now think that a better fix would be
> to make the above functions safe to use use in current context (at least
> for the save side - the restore side can probably just ASSERT() atm, as
> it doesn't seem to have an equivalent use).
>
> That way, the vm_event code doesn't need to contain any
> context-sensitive code, which is a better overall, IMO.

Sounds good to me. So a "v == current" case to refresh the gs_shadow
value is all this will entail, correct?

Thanks,
Tamas
diff mbox series

Patch

diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
index 51c3493b1d..4464940da7 100644
--- a/xen/arch/x86/vm_event.c
+++ b/xen/arch/x86/vm_event.c
@@ -239,7 +239,7 @@  void vm_event_fill_regs(vm_event_request_t *req)
     vm_event_pack_segment_register(x86_seg_ds, &req->data.regs.x86);
     vm_event_pack_segment_register(x86_seg_es, &req->data.regs.x86);
 
-    req->data.regs.x86.shadow_gs = ctxt.shadow_gs;
+    req->data.regs.x86.shadow_gs = rdgsshadow();
     req->data.regs.x86.dr6 = ctxt.dr6;
 #endif
 }