diff mbox series

[v4,for-4.14] x86/monitor: revert default behavior when monitoring register write events

Message ID 20200603125237.100041-1-tamas@tklengyel.com (mailing list archive)
State New, archived
Headers show
Series [v4,for-4.14] x86/monitor: revert default behavior when monitoring register write events | expand

Commit Message

Tamas K Lengyel June 3, 2020, 12:52 p.m. UTC
For the last couple years we have received numerous reports from users of
monitor vm_events of spurious guest crashes when using events. In particular,
it has observed that the problem occurs when vm_events are being disabled. The
nature of the guest crash varied widely and has only occured occasionally. This
made debugging the issue particularly hard. We had discussions about this issue
even here on the xen-devel mailinglist with no luck figuring it out.

The bug has now been identified as a race-condition between register event
handling and disabling the monitor vm_event interface. The default behavior
regarding emulation of register write events is changed so that they get
postponed until the corresponding vm_event handler decides whether to allow such
write to take place. Unfortunately this can only be implemented by performing the
deny/allow step when the vCPU gets scheduled.

Due to that postponed emulation of the event if the user decides to pause the
VM in the vm_event handler and then disable events, the entire emulation step
is skipped the next time the vCPU is resumed. Even if the user doesn't pause
during the vm_event handling but exits immediately and disables vm_event, the
situation becomes racey as disabling vm_event may succeed before the guest's
vCPUs get scheduled with the pending emulation task. This has been particularly
the case with VMS that have several vCPUs as after the VM is unpaused it may
actually take a long time before all vCPUs get scheduled.

In this patch we are reverting the default behavior to always perform emulation
of register write events when the event occurs. To postpone them can be turned
on as an option. In that case the user of the interface still has to take care
of only disabling the interface when its safe as it remains buggy.

Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
reply').

Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
---
 xen/arch/x86/hvm/hvm.c            | 14 ++++++++------
 xen/arch/x86/hvm/monitor.c        | 13 ++++++++-----
 xen/include/asm-x86/domain.h      |  1 +
 xen/include/asm-x86/hvm/monitor.h |  7 +++----
 xen/include/asm-x86/monitor.h     |  4 ++++
 xen/include/public/domctl.h       |  6 ++++++
 6 files changed, 30 insertions(+), 15 deletions(-)

Comments

Roger Pau Monné June 3, 2020, 3:07 p.m. UTC | #1
On Wed, Jun 03, 2020 at 06:52:37AM -0600, Tamas K Lengyel wrote:
> For the last couple years we have received numerous reports from users of
> monitor vm_events of spurious guest crashes when using events. In particular,
> it has observed that the problem occurs when vm_events are being disabled. The
> nature of the guest crash varied widely and has only occured occasionally. This
> made debugging the issue particularly hard. We had discussions about this issue
> even here on the xen-devel mailinglist with no luck figuring it out.
> 
> The bug has now been identified as a race-condition between register event
> handling and disabling the monitor vm_event interface. The default behavior
> regarding emulation of register write events is changed so that they get
> postponed until the corresponding vm_event handler decides whether to allow such
> write to take place. Unfortunately this can only be implemented by performing the
> deny/allow step when the vCPU gets scheduled.
> 
> Due to that postponed emulation of the event if the user decides to pause the
> VM in the vm_event handler and then disable events, the entire emulation step
> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
> during the vm_event handling but exits immediately and disables vm_event, the
> situation becomes racey as disabling vm_event may succeed before the guest's
> vCPUs get scheduled with the pending emulation task. This has been particularly
> the case with VMS that have several vCPUs as after the VM is unpaused it may
> actually take a long time before all vCPUs get scheduled.
> 
> In this patch we are reverting the default behavior to always perform emulation
> of register write events when the event occurs. To postpone them can be turned
> on as an option. In that case the user of the interface still has to take care
> of only disabling the interface when its safe as it remains buggy.
> 
> Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
> reply').
> 
> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>

Thanks!

Reviewed-by: Roger Pau Monné <rogerpau@citrix.com>

I would like to get some input from Bitdefender really, and whether
they are fine with this approach.
Jan Beulich June 8, 2020, 3:55 p.m. UTC | #2
On 03.06.2020 17:07, Roger Pau Monné wrote:
> On Wed, Jun 03, 2020 at 06:52:37AM -0600, Tamas K Lengyel wrote:
>> For the last couple years we have received numerous reports from users of
>> monitor vm_events of spurious guest crashes when using events. In particular,
>> it has observed that the problem occurs when vm_events are being disabled. The
>> nature of the guest crash varied widely and has only occured occasionally. This
>> made debugging the issue particularly hard. We had discussions about this issue
>> even here on the xen-devel mailinglist with no luck figuring it out.
>>
>> The bug has now been identified as a race-condition between register event
>> handling and disabling the monitor vm_event interface. The default behavior
>> regarding emulation of register write events is changed so that they get
>> postponed until the corresponding vm_event handler decides whether to allow such
>> write to take place. Unfortunately this can only be implemented by performing the
>> deny/allow step when the vCPU gets scheduled.
>>
>> Due to that postponed emulation of the event if the user decides to pause the
>> VM in the vm_event handler and then disable events, the entire emulation step
>> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
>> during the vm_event handling but exits immediately and disables vm_event, the
>> situation becomes racey as disabling vm_event may succeed before the guest's
>> vCPUs get scheduled with the pending emulation task. This has been particularly
>> the case with VMS that have several vCPUs as after the VM is unpaused it may
>> actually take a long time before all vCPUs get scheduled.
>>
>> In this patch we are reverting the default behavior to always perform emulation
>> of register write events when the event occurs. To postpone them can be turned
>> on as an option. In that case the user of the interface still has to take care
>> of only disabling the interface when its safe as it remains buggy.
>>
>> Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
>> reply').
>>
>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> 
> Thanks!
> 
> Reviewed-by: Roger Pau Monné <rogerpau@citrix.com>
> 
> I would like to get some input from Bitdefender really, and whether
> they are fine with this approach.

Alexandru, Petre,

have the two of you seen this request?

Jan
Razvan Cojocaru June 8, 2020, 6:58 p.m. UTC | #3
On 6/8/20 6:55 PM, Jan Beulich wrote:
> On 03.06.2020 17:07, Roger Pau Monné wrote:
>> On Wed, Jun 03, 2020 at 06:52:37AM -0600, Tamas K Lengyel wrote:
>>> For the last couple years we have received numerous reports from users of
>>> monitor vm_events of spurious guest crashes when using events. In particular,
>>> it has observed that the problem occurs when vm_events are being disabled. The
>>> nature of the guest crash varied widely and has only occured occasionally. This
>>> made debugging the issue particularly hard. We had discussions about this issue
>>> even here on the xen-devel mailinglist with no luck figuring it out.
>>>
>>> The bug has now been identified as a race-condition between register event
>>> handling and disabling the monitor vm_event interface. The default behavior
>>> regarding emulation of register write events is changed so that they get
>>> postponed until the corresponding vm_event handler decides whether to allow such
>>> write to take place. Unfortunately this can only be implemented by performing the
>>> deny/allow step when the vCPU gets scheduled.
>>>
>>> Due to that postponed emulation of the event if the user decides to pause the
>>> VM in the vm_event handler and then disable events, the entire emulation step
>>> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
>>> during the vm_event handling but exits immediately and disables vm_event, the
>>> situation becomes racey as disabling vm_event may succeed before the guest's
>>> vCPUs get scheduled with the pending emulation task. This has been particularly
>>> the case with VMS that have several vCPUs as after the VM is unpaused it may
>>> actually take a long time before all vCPUs get scheduled.
>>>
>>> In this patch we are reverting the default behavior to always perform emulation
>>> of register write events when the event occurs. To postpone them can be turned
>>> on as an option. In that case the user of the interface still has to take care
>>> of only disabling the interface when its safe as it remains buggy.
>>>
>>> Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
>>> reply').
>>>
>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
>>
>> Thanks!
>>
>> Reviewed-by: Roger Pau Monné <rogerpau@citrix.com>
>>
>> I would like to get some input from Bitdefender really, and whether
>> they are fine with this approach.

Hello,

Not really my call to make anymore, but I do have a few notes.

First, IIRC the problem stems from the initial choice to have the
vm_event data allocated on-demand when first subscribing to events. The
proper solution (since this patch doesn't actually fix the problem),
IMHO, would be for the vm_event data to _always_ exist, and instead of
relying on the value of its pointer to check if there are event
subscribers, we could just check the emulation flags individually and
never miss a pending emulated something again. I did try to go that way
in the beginning, but it has reasonably been objected that we should cut
back on using hypervisor memory unnecessarily, hence we got at this point.

Secondly, I see no reason why we couldn't adapt to the new default
behaviour provided that the old behaviour continues to work _exactly_ as
before.

And last but not least, the proper sequence is: 1. unsubscribe from
register write events, 2. process all events "still in the chamber"
(keep checking the ring buffer for a while), 3. detach from the guest
(disable the vm_event subsystem). Not ideal perhaps (in that it's not
guaranteed that a VCPU won't resume after a longer period than our
timeout), but if the sequence is followed there should be no guest hangs
or crashes (at least none that we or our clients have observed so far).

So in short, I think there's a better fix for this by simply not
allocating the vm_event memory on-demand anymore and never having to
deal with lost pending emulations again. It should also decrease code
complexity by a tiny bit. Then again, as stated at the beginning of this
message, that's just a recommendation.


HTH,
Razvan
Tamas K Lengyel June 8, 2020, 7:54 p.m. UTC | #4
On Mon, Jun 8, 2020 at 12:58 PM Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
>
> On 6/8/20 6:55 PM, Jan Beulich wrote:
> > On 03.06.2020 17:07, Roger Pau Monné wrote:
> >> On Wed, Jun 03, 2020 at 06:52:37AM -0600, Tamas K Lengyel wrote:
> >>> For the last couple years we have received numerous reports from users of
> >>> monitor vm_events of spurious guest crashes when using events. In particular,
> >>> it has observed that the problem occurs when vm_events are being disabled. The
> >>> nature of the guest crash varied widely and has only occured occasionally. This
> >>> made debugging the issue particularly hard. We had discussions about this issue
> >>> even here on the xen-devel mailinglist with no luck figuring it out.
> >>>
> >>> The bug has now been identified as a race-condition between register event
> >>> handling and disabling the monitor vm_event interface. The default behavior
> >>> regarding emulation of register write events is changed so that they get
> >>> postponed until the corresponding vm_event handler decides whether to allow such
> >>> write to take place. Unfortunately this can only be implemented by performing the
> >>> deny/allow step when the vCPU gets scheduled.
> >>>
> >>> Due to that postponed emulation of the event if the user decides to pause the
> >>> VM in the vm_event handler and then disable events, the entire emulation step
> >>> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
> >>> during the vm_event handling but exits immediately and disables vm_event, the
> >>> situation becomes racey as disabling vm_event may succeed before the guest's
> >>> vCPUs get scheduled with the pending emulation task. This has been particularly
> >>> the case with VMS that have several vCPUs as after the VM is unpaused it may
> >>> actually take a long time before all vCPUs get scheduled.
> >>>
> >>> In this patch we are reverting the default behavior to always perform emulation
> >>> of register write events when the event occurs. To postpone them can be turned
> >>> on as an option. In that case the user of the interface still has to take care
> >>> of only disabling the interface when its safe as it remains buggy.
> >>>
> >>> Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
> >>> reply').
> >>>
> >>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> >>
> >> Thanks!
> >>
> >> Reviewed-by: Roger Pau Monné <rogerpau@citrix.com>
> >>
> >> I would like to get some input from Bitdefender really, and whether
> >> they are fine with this approach.
>
> Hello,
>
> Not really my call to make anymore, but I do have a few notes.
>
> First, IIRC the problem stems from the initial choice to have the
> vm_event data allocated on-demand when first subscribing to events. The
> proper solution (since this patch doesn't actually fix the problem),
> IMHO, would be for the vm_event data to _always_ exist, and instead of
> relying on the value of its pointer to check if there are event
> subscribers, we could just check the emulation flags individually and
> never miss a pending emulated something again. I did try to go that way
> in the beginning, but it has reasonably been objected that we should cut
> back on using hypervisor memory unnecessarily, hence we got at this point.
>
> Secondly, I see no reason why we couldn't adapt to the new default
> behaviour provided that the old behaviour continues to work _exactly_ as
> before.
>
> And last but not least, the proper sequence is: 1. unsubscribe from
> register write events, 2. process all events "still in the chamber"
> (keep checking the ring buffer for a while), 3. detach from the guest
> (disable the vm_event subsystem). Not ideal perhaps (in that it's not
> guaranteed that a VCPU won't resume after a longer period than our
> timeout), but if the sequence is followed there should be no guest hangs
> or crashes (at least none that we or our clients have observed so far).

Incorrect. That's not enough. You also have to wait for all the vCPUs
to get scheduled before disabling vm_event or otherwise the emulation
is skipped entirely. Please read the patch message. If the user
decides to disable vm_event after getting a CR3 event delivered, the
CR3 never gets updated and results in the guest crashing in
unpredictable ways. Same happens with all the other registers.

>
> So in short, I think there's a better fix for this by simply not
> allocating the vm_event memory on-demand anymore and never having to
> deal with lost pending emulations again. It should also decrease code
> complexity by a tiny bit. Then again, as stated at the beginning of this
> message, that's just a recommendation.

Since only you guys use this feature I'm going to wait for a fix.
Until then, the default behavior should be restored so this buggy
behavior doesn't affect anyone else. You can still turn it on, its
just not going to be on for vm_event by default. I don't particularly
care what fix is there since only you guys use it. If you don't mind
that there is this buggy behavior because you never disable vm_event
once you activate it then that's that.

Cheers,
Tamas
Razvan Cojocaru June 8, 2020, 8:14 p.m. UTC | #5
On 6/8/20 10:54 PM, Tamas K Lengyel wrote:
> On Mon, Jun 8, 2020 at 12:58 PM Razvan Cojocaru
>> And last but not least, the proper sequence is: 1. unsubscribe from
>> register write events, 2. process all events "still in the chamber"
>> (keep checking the ring buffer for a while), 3. detach from the guest
>> (disable the vm_event subsystem). Not ideal perhaps (in that it's not
>> guaranteed that a VCPU won't resume after a longer period than our
>> timeout), but if the sequence is followed there should be no guest hangs
>> or crashes (at least none that we or our clients have observed so far).
> 
> Incorrect. That's not enough. You also have to wait for all the vCPUs
> to get scheduled before disabling vm_event or otherwise the emulation
> is skipped entirely. Please read the patch message. If the user
> decides to disable vm_event after getting a CR3 event delivered, the
> CR3 never gets updated and results in the guest crashing in
> unpredictable ways. Same happens with all the other registers.

I did read the patch message. As I've previously stated ("it's not
guaranteed that a VCPU won't resume after a longer period than our
timeout"), it's not ideal, and I've made no claim that the problem does
not exist or that it shouldn't be fixed - but really if you've got a
VCPU that will wait more than a couple of seconds to get scheduled then
you've got a bigger problem with the VM.

>> So in short, I think there's a better fix for this by simply not 
>> allocating the vm_event memory on-demand anymore and never having to
>> deal with lost pending emulations again. It should also decrease code
>> complexity by a tiny bit. Then again, as stated at the beginning of this
>> message, that's just a recommendation.is
> 
> Since only you guys use this feature I'm going to wait for a fix.
> Until then, the default behavior should be restored so this buggy
> behavior doesn't affect anyone else. You can still turn it on, its
> just not going to be on for vm_event by default. I don't particularly
> care what fix is there since only you guys use it. If you don't mind
> that there is this buggy behavior because you never disable vm_events
> once you activate it then that's that.

Indeed, our usual scenario is that vm_event is always on on the machine.
It's only rarely being disabled while the guest is running, and when it
is, it's always with waiting sufficiently long that we've not seen any
unexplainable hung VMs.

Fair enough, as long as the previous behaviour is optionally available I
see no reason why this patch shouldn't make it in - though, of course,
as also previously stated, I'm just trying to shed as much light as
possible on this from what I remember, and what happens next is not my
call. My colleagues should be able to chime in tomorrow.


Cheers,
Razvan
Tamas K Lengyel June 8, 2020, 8:44 p.m. UTC | #6
On Mon, Jun 8, 2020 at 2:14 PM Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
>
> On 6/8/20 10:54 PM, Tamas K Lengyel wrote:
> > On Mon, Jun 8, 2020 at 12:58 PM Razvan Cojocaru
> >> And last but not least, the proper sequence is: 1. unsubscribe from
> >> register write events, 2. process all events "still in the chamber"
> >> (keep checking the ring buffer for a while), 3. detach from the guest
> >> (disable the vm_event subsystem). Not ideal perhaps (in that it's not
> >> guaranteed that a VCPU won't resume after a longer period than our
> >> timeout), but if the sequence is followed there should be no guest hangs
> >> or crashes (at least none that we or our clients have observed so far).
> >
> > Incorrect. That's not enough. You also have to wait for all the vCPUs
> > to get scheduled before disabling vm_event or otherwise the emulation
> > is skipped entirely. Please read the patch message. If the user
> > decides to disable vm_event after getting a CR3 event delivered, the
> > CR3 never gets updated and results in the guest crashing in
> > unpredictable ways. Same happens with all the other registers.
>
> I did read the patch message. As I've previously stated ("it's not
> guaranteed that a VCPU won't resume after a longer period than our
> timeout"), it's not ideal, and I've made no claim that the problem does
> not exist or that it shouldn't be fixed - but really if you've got a
> VCPU that will wait more than a couple of seconds to get scheduled then
> you've got a bigger problem with the VM.

Sorry, missed the part where you say you knew about this issue. We
didn't and it was absolutely not documented anywhere and certainly not
mentioned in the original patch that added the feature. It literally
took us years to figure out what the hell is going on. While as you it
can be a couple seconds before its safe to disable it can be a lot
longer depending on the system state, how many VMs are running and how
many vCPUs are active on the VM. There is absolutely necessary
use-cases where you want to keep the VM paused after a CR3 event is
received and exit vm_event afterwards. This bug totally blocks those
use-cases. Not to mention that it's a total mess having an interface
where the user has to guess when its safe to do something. If this was
pointed out when the patch was made I would have objected to that
being merged.

>
> >> So in short, I think there's a better fix for this by simply not
> >> allocating the vm_event memory on-demand anymore and never having to
> >> deal with lost pending emulations again. It should also decrease code
> >> complexity by a tiny bit. Then again, as stated at the beginning of this
> >> message, that's just a recommendation.is
> >
> > Since only you guys use this feature I'm going to wait for a fix.
> > Until then, the default behavior should be restored so this buggy
> > behavior doesn't affect anyone else. You can still turn it on, its
> > just not going to be on for vm_event by default. I don't particularly
> > care what fix is there since only you guys use it. If you don't mind
> > that there is this buggy behavior because you never disable vm_events
> > once you activate it then that's that.
>
> Indeed, our usual scenario is that vm_event is always on on the machine.
> It's only rarely being disabled while the guest is running, and when it
> is, it's always with waiting sufficiently long that we've not seen any
> unexplainable hung VMs.
>
> Fair enough, as long as the previous behaviour is optionally available I
> see no reason why this patch shouldn't make it in - though, of course,
> as also previously stated, I'm just trying to shed as much light as
> possible on this from what I remember, and what happens next is not my
> call. My colleagues should be able to chime in tomorrow.

Looking forward to it.

Thanks,
Tamas
Razvan Cojocaru June 8, 2020, 9:16 p.m. UTC | #7
On 6/8/20 11:44 PM, Tamas K Lengyel wrote:
> On Mon, Jun 8, 2020 at 2:14 PM Razvan Cojocaru
> <rcojocaru@bitdefender.com> wrote:
>>
>> On 6/8/20 10:54 PM, Tamas K Lengyel wrote:
>>> On Mon, Jun 8, 2020 at 12:58 PM Razvan Cojocaru
>>>> And last but not least, the proper sequence is: 1. unsubscribe from
>>>> register write events, 2. process all events "still in the chamber"
>>>> (keep checking the ring buffer for a while), 3. detach from the guest
>>>> (disable the vm_event subsystem). Not ideal perhaps (in that it's not
>>>> guaranteed that a VCPU won't resume after a longer period than our
>>>> timeout), but if the sequence is followed there should be no guest hangs
>>>> or crashes (at least none that we or our clients have observed so far).
>>>
>>> Incorrect. That's not enough. You also have to wait for all the vCPUs
>>> to get scheduled before disabling vm_event or otherwise the emulation
>>> is skipped entirely. Please read the patch message. If the user
>>> decides to disable vm_event after getting a CR3 event delivered, the
>>> CR3 never gets updated and results in the guest crashing in
>>> unpredictable ways. Same happens with all the other registers.
>>
>> I did read the patch message. As I've previously stated ("it's not
>> guaranteed that a VCPU won't resume after a longer period than our
>> timeout"), it's not ideal, and I've made no claim that the problem does
>> not exist or that it shouldn't be fixed - but really if you've got a
>> VCPU that will wait more than a couple of seconds to get scheduled then
>> you've got a bigger problem with the VM.
> 
> Sorry, missed the part where you say you knew about this issue. We
> didn't and it was absolutely not documented anywhere and certainly not
> mentioned in the original patch that added the feature. It literally
> took us years to figure out what the hell is going on. While as you it
> can be a couple seconds before its safe to disable it can be a lot
> longer depending on the system state, how many VMs are running and how
> many vCPUs are active on the VM. There is absolutely necessary
> use-cases where you want to keep the VM paused after a CR3 event is
> received and exit vm_event afterwards. This bug totally blocks those
> use-cases. Not to mention that it's a total mess having an interface
> where the user has to guess when its safe to do something. If this was
> pointed out when the patch was made I would have objected to that
> being merged.

No, we didn't know about the issue. It's apparently not my most eloquent
evening. I was merely saying that I did understand what the issue was
from your description, and offering an explanation on why we've never
seen this in QA or production. Of course the issue would have been
mentioned at least (but ideally not exist to begin with) had it been known.

We do take several passes through the ring buffer (and as a side-effect
reduce the probability of a race occuring to almost zero), but we do
that to make sure we've cleaned up all EPT vm_events; the fact that it
has helped with _this_ issue is a coincidence.

IIRC Windows, at least, will be upset if a VCPU is stuck for too long.

As for how the vm_event system behaves:

1. A security application that is unable to _prevent_ things being done
to a system is not doing a very good job, since in that case you can
only collect stats and not veto anything. I would argue that the default
for such a monitoring application should be the current one (all events
should be pre-action).

2. This is further supported by the fact that that's exactly how the EPT
vm_events work: you get a "I want to write to this page" event _before_
the write occurs. If register writes behave differently, you have _two_
different models: one where you get an event post-action, and one where
you get one pre-action.


Hope that's clearer,
Razvan
Tamas K Lengyel June 8, 2020, 10:50 p.m. UTC | #8
On Mon, Jun 8, 2020 at 3:16 PM Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
>
> On 6/8/20 11:44 PM, Tamas K Lengyel wrote:
> > On Mon, Jun 8, 2020 at 2:14 PM Razvan Cojocaru
> > <rcojocaru@bitdefender.com> wrote:
> >>
> >> On 6/8/20 10:54 PM, Tamas K Lengyel wrote:
> >>> On Mon, Jun 8, 2020 at 12:58 PM Razvan Cojocaru
> >>>> And last but not least, the proper sequence is: 1. unsubscribe from
> >>>> register write events, 2. process all events "still in the chamber"
> >>>> (keep checking the ring buffer for a while), 3. detach from the guest
> >>>> (disable the vm_event subsystem). Not ideal perhaps (in that it's not
> >>>> guaranteed that a VCPU won't resume after a longer period than our
> >>>> timeout), but if the sequence is followed there should be no guest hangs
> >>>> or crashes (at least none that we or our clients have observed so far).
> >>>
> >>> Incorrect. That's not enough. You also have to wait for all the vCPUs
> >>> to get scheduled before disabling vm_event or otherwise the emulation
> >>> is skipped entirely. Please read the patch message. If the user
> >>> decides to disable vm_event after getting a CR3 event delivered, the
> >>> CR3 never gets updated and results in the guest crashing in
> >>> unpredictable ways. Same happens with all the other registers.
> >>
> >> I did read the patch message. As I've previously stated ("it's not
> >> guaranteed that a VCPU won't resume after a longer period than our
> >> timeout"), it's not ideal, and I've made no claim that the problem does
> >> not exist or that it shouldn't be fixed - but really if you've got a
> >> VCPU that will wait more than a couple of seconds to get scheduled then
> >> you've got a bigger problem with the VM.
> >
> > Sorry, missed the part where you say you knew about this issue. We
> > didn't and it was absolutely not documented anywhere and certainly not
> > mentioned in the original patch that added the feature. It literally
> > took us years to figure out what the hell is going on. While as you it
> > can be a couple seconds before its safe to disable it can be a lot
> > longer depending on the system state, how many VMs are running and how
> > many vCPUs are active on the VM. There is absolutely necessary
> > use-cases where you want to keep the VM paused after a CR3 event is
> > received and exit vm_event afterwards. This bug totally blocks those
> > use-cases. Not to mention that it's a total mess having an interface
> > where the user has to guess when its safe to do something. If this was
> > pointed out when the patch was made I would have objected to that
> > being merged.
>
> No, we didn't know about the issue. It's apparently not my most eloquent
> evening. I was merely saying that I did understand what the issue was
> from your description, and offering an explanation on why we've never
> seen this in QA or production. Of course the issue would have been
> mentioned at least (but ideally not exist to begin with) had it been known.
>
> We do take several passes through the ring buffer (and as a side-effect
> reduce the probability of a race occuring to almost zero), but we do
> that to make sure we've cleaned up all EPT vm_events; the fact that it
> has helped with _this_ issue is a coincidence.
>
> IIRC Windows, at least, will be upset if a VCPU is stuck for too long.
>
> As for how the vm_event system behaves:
>
> 1. A security application that is unable to _prevent_ things being done
> to a system is not doing a very good job, since in that case you can
> only collect stats and not veto anything. I would argue that the default
> for such a monitoring application should be the current one (all events
> should be pre-action).

Not all security applications require this though. Malware analysis
where stealth is required would absolutely not want this side-effect
to be visible to the guest where malware could use it to determine
that it's being monitored. So I don't buy into this argument.

>
> 2. This is further supported by the fact that that's exactly how the EPT
> vm_events work: you get a "I want to write to this page" event _before_
> the write occurs. If register writes behave differently, you have _two_
> different models: one where you get an event post-action, and one where
> you get one pre-action.

Whether you get an event before or after the effects of the event have
been applied to the system state shouldn't matter as long as you can
revert that action. I wouldn't care either way to be the default. But
having a default that breaks other use-cases is unacceptable.

Tamas
Razvan Cojocaru June 8, 2020, 11:14 p.m. UTC | #9
On 6/9/20 1:50 AM, Tamas K Lengyel wrote:
> On Mon, Jun 8, 2020 at 3:16 PM Razvan Cojocaru
> <rcojocaru@bitdefender.com> wrote:
>> 1. A security application that is unable to _prevent_ things being done
>> to a system is not doing a very good job, since in that case you can
>> only collect stats and not veto anything. I would argue that the default
>> for such a monitoring application should be the current one (all events
>> should be pre-action).
> 
> Not all security applications require this though. Malware analysis
> where stealth is required would absolutely not want this side-effect
> to be visible to the guest where malware could use it to determine
> that it's being monitored. So I don't buy into this argument.

Fair enough, in that case having both models supported should be fine.
I'll leave the rest of that conversation to my colleagues.

>> 2. This is further supported by the fact that that's exactly how the EPT
>> vm_events work: you get a "I want to write to this page" event _before_
>> the write occurs. If register writes behave differently, you have _two_
>> different models: one where you get an event post-action, and one where
>> you get one pre-action.
> 
> Whether you get an event before or after the effects of the event have
> been applied to the system state shouldn't matter as long as you can
> revert that action. I wouldn't care either way to be the default. But
> having a default that breaks other use-cases is unacceptable.

You keep saying that as if I disagree. :) But we've already established
that the potential for a race condition has been found and needs to be
fixed.

My only (minor) objection has been that a patch fixing the current model
would have been preferable to one that switches the default as a
workaround. Still, it's understandable that perhaps there's no time or
motivation for that.


Thanks for the work,
Razvan
Tamas K Lengyel June 8, 2020, 11:41 p.m. UTC | #10
On Mon, Jun 8, 2020 at 5:14 PM Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
>
> On 6/9/20 1:50 AM, Tamas K Lengyel wrote:
> > On Mon, Jun 8, 2020 at 3:16 PM Razvan Cojocaru
> > <rcojocaru@bitdefender.com> wrote:
> >> 1. A security application that is unable to _prevent_ things being done
> >> to a system is not doing a very good job, since in that case you can
> >> only collect stats and not veto anything. I would argue that the default
> >> for such a monitoring application should be the current one (all events
> >> should be pre-action).
> >
> > Not all security applications require this though. Malware analysis
> > where stealth is required would absolutely not want this side-effect
> > to be visible to the guest where malware could use it to determine
> > that it's being monitored. So I don't buy into this argument.
>
> Fair enough, in that case having both models supported should be fine.
> I'll leave the rest of that conversation to my colleagues.
>
> >> 2. This is further supported by the fact that that's exactly how the EPT
> >> vm_events work: you get a "I want to write to this page" event _before_
> >> the write occurs. If register writes behave differently, you have _two_
> >> different models: one where you get an event post-action, and one where
> >> you get one pre-action.
> >
> > Whether you get an event before or after the effects of the event have
> > been applied to the system state shouldn't matter as long as you can
> > revert that action. I wouldn't care either way to be the default. But
> > having a default that breaks other use-cases is unacceptable.
>
> You keep saying that as if I disagree. :) But we've already established
> that the potential for a race condition has been found and needs to be
> fixed.
>
> My only (minor) objection has been that a patch fixing the current model
> would have been preferable to one that switches the default as a
> workaround. Still, it's understandable that perhaps there's no time or
> motivation for that.

I've already sent two other patches that make it more manageable to
disable the interface when this feature is used. Your colleagues are
welcome to pick those up or send other fixes that they prefer. As I
don't use this feature I won't be spending more time fixing it then
what I've already spent on it. At this point collectively I probably
spent weeks trying to just track the issue down as it was such an
annoying bug to find.

Tamas
Jan Beulich June 9, 2020, 6:28 a.m. UTC | #11
On 08.06.2020 20:58, Razvan Cojocaru wrote:
> On 6/8/20 6:55 PM, Jan Beulich wrote:
>> On 03.06.2020 17:07, Roger Pau Monné wrote:
>>> On Wed, Jun 03, 2020 at 06:52:37AM -0600, Tamas K Lengyel wrote:
>>>> For the last couple years we have received numerous reports from users of
>>>> monitor vm_events of spurious guest crashes when using events. In particular,
>>>> it has observed that the problem occurs when vm_events are being disabled. The
>>>> nature of the guest crash varied widely and has only occured occasionally. This
>>>> made debugging the issue particularly hard. We had discussions about this issue
>>>> even here on the xen-devel mailinglist with no luck figuring it out.
>>>>
>>>> The bug has now been identified as a race-condition between register event
>>>> handling and disabling the monitor vm_event interface. The default behavior
>>>> regarding emulation of register write events is changed so that they get
>>>> postponed until the corresponding vm_event handler decides whether to allow such
>>>> write to take place. Unfortunately this can only be implemented by performing the
>>>> deny/allow step when the vCPU gets scheduled.
>>>>
>>>> Due to that postponed emulation of the event if the user decides to pause the
>>>> VM in the vm_event handler and then disable events, the entire emulation step
>>>> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
>>>> during the vm_event handling but exits immediately and disables vm_event, the
>>>> situation becomes racey as disabling vm_event may succeed before the guest's
>>>> vCPUs get scheduled with the pending emulation task. This has been particularly
>>>> the case with VMS that have several vCPUs as after the VM is unpaused it may
>>>> actually take a long time before all vCPUs get scheduled.
>>>>
>>>> In this patch we are reverting the default behavior to always perform emulation
>>>> of register write events when the event occurs. To postpone them can be turned
>>>> on as an option. In that case the user of the interface still has to take care
>>>> of only disabling the interface when its safe as it remains buggy.
>>>>
>>>> Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
>>>> reply').
>>>>
>>>> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
>>>
>>> Thanks!
>>>
>>> Reviewed-by: Roger Pau Monné <rogerpau@citrix.com>
>>>
>>> I would like to get some input from Bitdefender really, and whether
>>> they are fine with this approach.
> 
> Hello,
> 
> Not really my call to make anymore, but I do have a few notes.

Even more so thanks for replying; I did Cc you merely because of
history and because I thought with Alexandru and Petre not
replying you might either know someone else who should, or nudge
them.

> Secondly, I see no reason why we couldn't adapt to the new default
> behaviour provided that the old behaviour continues to work _exactly_ as
> before.

That's my understanding (and nothing to the contrary looks to
have surfaced from the other communication between you and Tamas),
so with that I guess I'll go and throw in the patch.

Jan
Jan Beulich June 9, 2020, 9:37 a.m. UTC | #12
On 03.06.2020 14:52, Tamas K Lengyel wrote:
> For the last couple years we have received numerous reports from users of
> monitor vm_events of spurious guest crashes when using events. In particular,
> it has observed that the problem occurs when vm_events are being disabled. The
> nature of the guest crash varied widely and has only occured occasionally. This
> made debugging the issue particularly hard. We had discussions about this issue
> even here on the xen-devel mailinglist with no luck figuring it out.
> 
> The bug has now been identified as a race-condition between register event
> handling and disabling the monitor vm_event interface. The default behavior
> regarding emulation of register write events is changed so that they get
> postponed until the corresponding vm_event handler decides whether to allow such
> write to take place. Unfortunately this can only be implemented by performing the
> deny/allow step when the vCPU gets scheduled.
> 
> Due to that postponed emulation of the event if the user decides to pause the
> VM in the vm_event handler and then disable events, the entire emulation step
> is skipped the next time the vCPU is resumed. Even if the user doesn't pause
> during the vm_event handling but exits immediately and disables vm_event, the
> situation becomes racey as disabling vm_event may succeed before the guest's
> vCPUs get scheduled with the pending emulation task. This has been particularly
> the case with VMS that have several vCPUs as after the VM is unpaused it may
> actually take a long time before all vCPUs get scheduled.
> 
> In this patch we are reverting the default behavior to always perform emulation
> of register write events when the event occurs. To postpone them can be turned
> on as an option. In that case the user of the interface still has to take care
> of only disabling the interface when its safe as it remains buggy.
> 
> Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
> reply').
> 
> Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>

Applicable parts
Acked-by: Jan Beulich <jbeulich@suse.com>

Jan
Paul Durrant June 9, 2020, 9:48 a.m. UTC | #13
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 09 June 2020 10:37
> To: Tamas K Lengyel <tamas@tklengyel.com>
> Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>;
> Roger Pau Monné <roger.pau@citrix.com>; George Dunlap <george.dunlap@citrix.com>; Ian Jackson
> <ian.jackson@eu.citrix.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> <sstabellini@kernel.org>; Alexandru Isaila <aisaila@bitdefender.com>; Petre Pircalabu
> <ppircalabu@bitdefender.com>; Paul Durrant <paul@xen.org>
> Subject: Re: [PATCH v4 for-4.14] x86/monitor: revert default behavior when monitoring register write
> events
> 
> On 03.06.2020 14:52, Tamas K Lengyel wrote:
> > For the last couple years we have received numerous reports from users of
> > monitor vm_events of spurious guest crashes when using events. In particular,
> > it has observed that the problem occurs when vm_events are being disabled. The
> > nature of the guest crash varied widely and has only occured occasionally. This
> > made debugging the issue particularly hard. We had discussions about this issue
> > even here on the xen-devel mailinglist with no luck figuring it out.
> >
> > The bug has now been identified as a race-condition between register event
> > handling and disabling the monitor vm_event interface. The default behavior
> > regarding emulation of register write events is changed so that they get
> > postponed until the corresponding vm_event handler decides whether to allow such
> > write to take place. Unfortunately this can only be implemented by performing the
> > deny/allow step when the vCPU gets scheduled.
> >
> > Due to that postponed emulation of the event if the user decides to pause the
> > VM in the vm_event handler and then disable events, the entire emulation step
> > is skipped the next time the vCPU is resumed. Even if the user doesn't pause
> > during the vm_event handling but exits immediately and disables vm_event, the
> > situation becomes racey as disabling vm_event may succeed before the guest's
> > vCPUs get scheduled with the pending emulation task. This has been particularly
> > the case with VMS that have several vCPUs as after the VM is unpaused it may
> > actually take a long time before all vCPUs get scheduled.
> >
> > In this patch we are reverting the default behavior to always perform emulation
> > of register write events when the event occurs. To postpone them can be turned
> > on as an option. In that case the user of the interface still has to take care
> > of only disabling the interface when its safe as it remains buggy.
> >
> > Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
> > reply').
> >
> > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> 
> Applicable parts
> Acked-by: Jan Beulich <jbeulich@suse.com>
> 

Release-acked-by: Paul Durrant <paul@xen.org>
Tamas K Lengyel June 10, 2020, 4:35 p.m. UTC | #14
On Tue, Jun 9, 2020 at 3:48 AM Paul Durrant <xadimgnik@gmail.com> wrote:
>
> > -----Original Message-----
> > From: Jan Beulich <jbeulich@suse.com>
> > Sent: 09 June 2020 10:37
> > To: Tamas K Lengyel <tamas@tklengyel.com>
> > Cc: xen-devel@lists.xenproject.org; Andrew Cooper <andrew.cooper3@citrix.com>; Wei Liu <wl@xen.org>;
> > Roger Pau Monné <roger.pau@citrix.com>; George Dunlap <george.dunlap@citrix.com>; Ian Jackson
> > <ian.jackson@eu.citrix.com>; Julien Grall <julien@xen.org>; Stefano Stabellini
> > <sstabellini@kernel.org>; Alexandru Isaila <aisaila@bitdefender.com>; Petre Pircalabu
> > <ppircalabu@bitdefender.com>; Paul Durrant <paul@xen.org>
> > Subject: Re: [PATCH v4 for-4.14] x86/monitor: revert default behavior when monitoring register write
> > events
> >
> > On 03.06.2020 14:52, Tamas K Lengyel wrote:
> > > For the last couple years we have received numerous reports from users of
> > > monitor vm_events of spurious guest crashes when using events. In particular,
> > > it has observed that the problem occurs when vm_events are being disabled. The
> > > nature of the guest crash varied widely and has only occured occasionally. This
> > > made debugging the issue particularly hard. We had discussions about this issue
> > > even here on the xen-devel mailinglist with no luck figuring it out.
> > >
> > > The bug has now been identified as a race-condition between register event
> > > handling and disabling the monitor vm_event interface. The default behavior
> > > regarding emulation of register write events is changed so that they get
> > > postponed until the corresponding vm_event handler decides whether to allow such
> > > write to take place. Unfortunately this can only be implemented by performing the
> > > deny/allow step when the vCPU gets scheduled.
> > >
> > > Due to that postponed emulation of the event if the user decides to pause the
> > > VM in the vm_event handler and then disable events, the entire emulation step
> > > is skipped the next time the vCPU is resumed. Even if the user doesn't pause
> > > during the vm_event handling but exits immediately and disables vm_event, the
> > > situation becomes racey as disabling vm_event may succeed before the guest's
> > > vCPUs get scheduled with the pending emulation task. This has been particularly
> > > the case with VMS that have several vCPUs as after the VM is unpaused it may
> > > actually take a long time before all vCPUs get scheduled.
> > >
> > > In this patch we are reverting the default behavior to always perform emulation
> > > of register write events when the event occurs. To postpone them can be turned
> > > on as an option. In that case the user of the interface still has to take care
> > > of only disabling the interface when its safe as it remains buggy.
> > >
> > > Fixes: 96760e2fba10 ('vm_event: deny register writes if refused by vm_event
> > > reply').
> > >
> > > Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
> >
> > Applicable parts
> > Acked-by: Jan Beulich <jbeulich@suse.com>
> >
>
> Release-acked-by: Paul Durrant <paul@xen.org>

Thanks guys!

Tamas
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 74c9f84462..5bb47583b3 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3601,13 +3601,15 @@  int hvm_msr_write_intercept(unsigned int msr, uint64_t msr_content,
 
         ASSERT(v->arch.vm_event);
 
-        /* The actual write will occur in hvm_do_resume() (if permitted). */
-        v->arch.vm_event->write_data.do_write.msr = 1;
-        v->arch.vm_event->write_data.msr = msr;
-        v->arch.vm_event->write_data.value = msr_content;
+        if ( hvm_monitor_msr(msr, msr_content, msr_old_content) )
+        {
+            /* The actual write will occur in hvm_do_resume(), if permitted. */
+            v->arch.vm_event->write_data.do_write.msr = 1;
+            v->arch.vm_event->write_data.msr = msr;
+            v->arch.vm_event->write_data.value = msr_content;
 
-        hvm_monitor_msr(msr, msr_content, msr_old_content);
-        return X86EMUL_OKAY;
+            return X86EMUL_OKAY;
+        }
     }
 
     if ( (ret = guest_wrmsr(v, msr, msr_content)) != X86EMUL_UNHANDLEABLE )
diff --git a/xen/arch/x86/hvm/monitor.c b/xen/arch/x86/hvm/monitor.c
index 8aa14137e2..e4a09964a0 100644
--- a/xen/arch/x86/hvm/monitor.c
+++ b/xen/arch/x86/hvm/monitor.c
@@ -53,11 +53,11 @@  bool hvm_monitor_cr(unsigned int index, unsigned long value, unsigned long old)
             .u.write_ctrlreg.old_value = old
         };
 
-        if ( monitor_traps(curr, sync, &req) >= 0 )
-            return 1;
+        return monitor_traps(curr, sync, &req) >= 0 &&
+               curr->domain->arch.monitor.control_register_values;
     }
 
-    return 0;
+    return false;
 }
 
 bool hvm_monitor_emul_unimplemented(void)
@@ -77,7 +77,7 @@  bool hvm_monitor_emul_unimplemented(void)
         monitor_traps(curr, true, &req) == 1;
 }
 
-void hvm_monitor_msr(unsigned int msr, uint64_t new_value, uint64_t old_value)
+bool hvm_monitor_msr(unsigned int msr, uint64_t new_value, uint64_t old_value)
 {
     struct vcpu *curr = current;
 
@@ -92,8 +92,11 @@  void hvm_monitor_msr(unsigned int msr, uint64_t new_value, uint64_t old_value)
             .u.mov_to_msr.old_value = old_value
         };
 
-        monitor_traps(curr, 1, &req);
+        return monitor_traps(curr, 1, &req) >= 0 &&
+               curr->domain->arch.monitor.control_register_values;
     }
+
+    return false;
 }
 
 void hvm_monitor_descriptor_access(uint64_t exit_info,
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index e8cee3d5e5..6fd94c2e14 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -418,6 +418,7 @@  struct arch_domain
          * This is used to filter out pagefaults.
          */
         unsigned int inguest_pagefault_disabled                            : 1;
+        unsigned int control_register_values                               : 1;
         struct monitor_msr_bitmap *msr_bitmap;
         uint64_t write_ctrlreg_mask[4];
     } monitor;
diff --git a/xen/include/asm-x86/hvm/monitor.h b/xen/include/asm-x86/hvm/monitor.h
index 66de24cb75..a75cd8545c 100644
--- a/xen/include/asm-x86/hvm/monitor.h
+++ b/xen/include/asm-x86/hvm/monitor.h
@@ -29,15 +29,14 @@  enum hvm_monitor_debug_type
 };
 
 /*
- * Called for current VCPU on crX/MSR changes by guest.
- * The event might not fire if the client has subscribed to it in onchangeonly
- * mode, hence the bool return type for control register write events.
+ * Called for current VCPU on crX/MSR changes by guest. Bool return signals
+ * whether emulation should be postponed.
  */
 bool hvm_monitor_cr(unsigned int index, unsigned long value,
                     unsigned long old);
 #define hvm_monitor_crX(cr, new, old) \
                         hvm_monitor_cr(VM_EVENT_X86_##cr, new, old)
-void hvm_monitor_msr(unsigned int msr, uint64_t value, uint64_t old_value);
+bool hvm_monitor_msr(unsigned int msr, uint64_t value, uint64_t old_value);
 void hvm_monitor_descriptor_access(uint64_t exit_info,
                                    uint64_t vmx_exit_qualification,
                                    uint8_t descriptor, bool is_write);
diff --git a/xen/include/asm-x86/monitor.h b/xen/include/asm-x86/monitor.h
index 4afb0665e8..01c6d63bb9 100644
--- a/xen/include/asm-x86/monitor.h
+++ b/xen/include/asm-x86/monitor.h
@@ -59,6 +59,10 @@  int arch_monitor_domctl_op(struct domain *d, struct xen_domctl_monitor_op *mop)
         domain_unpause(d);
         break;
 
+    case XEN_DOMCTL_MONITOR_OP_CONTROL_REGISTERS:
+        d->arch.monitor.control_register_values = true;
+        break;
+
     default:
         rc = -EOPNOTSUPP;
     }
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 1ad34c35eb..59bdc28c89 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -1025,6 +1025,12 @@  struct xen_domctl_psr_cmt_op {
 #define XEN_DOMCTL_MONITOR_OP_DISABLE           1
 #define XEN_DOMCTL_MONITOR_OP_GET_CAPABILITIES  2
 #define XEN_DOMCTL_MONITOR_OP_EMULATE_EACH_REP  3
+/*
+ * Control register feature can result in guest-crashes when the monitor
+ * subsystem is being turned off. User has to take special precautions
+ * to ensure all vCPUs have resumed before it is safe to turn it off.
+ */
+#define XEN_DOMCTL_MONITOR_OP_CONTROL_REGISTERS 4
 
 #define XEN_DOMCTL_MONITOR_EVENT_WRITE_CTRLREG         0
 #define XEN_DOMCTL_MONITOR_EVENT_MOV_TO_MSR            1