diff mbox series

[v3,5/5] evtchn: don't call Xen consumer callback with per-channel lock held

Message ID d821c715-966a-b48b-a877-c5dac36822f0@suse.com (mailing list archive)
State Superseded
Headers show
Series evtchn: (not so) recent XSAs follow-on | expand

Commit Message

Jan Beulich Nov. 23, 2020, 1:30 p.m. UTC
While there don't look to be any problems with this right now, the lock
order implications from holding the lock can be very difficult to follow
(and may be easy to violate unknowingly). The present callbacks don't
(and no such callback should) have any need for the lock to be held.

However, vm_event_disable() frees the structures used by respective
callbacks and isn't otherwise synchronized with invocations of these
callbacks, so maintain a count of in-progress calls, for evtchn_close()
to wait to drop to zero before freeing the port (and dropping the lock).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Should we make this accounting optional, to be requested through a new
parameter to alloc_unbound_xen_event_channel(), or derived from other
than the default callback being requested?
---
v3: Drain callbacks before proceeding with closing. Re-base.

Comments

Alexandru Stefan ISAILA Nov. 30, 2020, 10:39 a.m. UTC | #1
On 23.11.2020 15:30, Jan Beulich wrote:
> While there don't look to be any problems with this right now, the lock
> order implications from holding the lock can be very difficult to follow
> (and may be easy to violate unknowingly). The present callbacks don't
> (and no such callback should) have any need for the lock to be held.
> 
> However, vm_event_disable() frees the structures used by respective
> callbacks and isn't otherwise synchronized with invocations of these
> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> to wait to drop to zero before freeing the port (and dropping the lock).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>

> ---
> Should we make this accounting optional, to be requested through a new
> parameter to alloc_unbound_xen_event_channel(), or derived from other
> than the default callback being requested?
> ---
> v3: Drain callbacks before proceeding with closing. Re-base.
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc
>       
>       rchn->u.interdomain.remote_dom  = ld;
>       rchn->u.interdomain.remote_port = lport;
> +    atomic_set(&rchn->u.interdomain.active_calls, 0);
>       rchn->state                     = ECS_INTERDOMAIN;
>   
>       /*
> @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int
>   
>           double_evtchn_lock(chn1, chn2);
>   
> +        if ( consumer_is_xen(chn1) )
> +            while ( atomic_read(&chn1->u.interdomain.active_calls) )
> +                cpu_relax();
> +
>           evtchn_free(d1, chn1);
>   
>           chn2->state = ECS_UNBOUND;
> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>           rport = lchn->u.interdomain.remote_port;
>           rchn  = evtchn_from_port(rd, rport);
>           if ( consumer_is_xen(rchn) )
> +        {
> +            /* Don't keep holding the lock for the call below. */
> +            atomic_inc(&rchn->u.interdomain.active_calls);
> +            evtchn_read_unlock(lchn);
>               xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
> -        else
> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
> +            atomic_dec(&rchn->u.interdomain.active_calls);
> +            return 0;
> +        }
> +        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>           break;
>       case ECS_IPI:
>           evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -104,6 +104,7 @@ struct evtchn
>           } unbound;     /* state == ECS_UNBOUND */
>           struct {
>               evtchn_port_t  remote_port;
> +            atomic_t       active_calls;
>               struct domain *remote_dom;
>           } interdomain; /* state == ECS_INTERDOMAIN */
>           struct {
> 
>
Julien Grall Dec. 2, 2020, 9:10 p.m. UTC | #2
Hi Jan,

On 23/11/2020 13:30, Jan Beulich wrote:
> While there don't look to be any problems with this right now, the lock
> order implications from holding the lock can be very difficult to follow
> (and may be easy to violate unknowingly). The present callbacks don't
> (and no such callback should) have any need for the lock to be held.
> 
> However, vm_event_disable() frees the structures used by respective
> callbacks and isn't otherwise synchronized with invocations of these
> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> to wait to drop to zero before freeing the port (and dropping the lock).

AFAICT, this callback is not the only place where the synchronization is 
missing in the VM event code.

For instance, vm_event_put_request() can also race against 
vm_event_disable().

So shouldn't we handle this issue properly in VM event?

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Should we make this accounting optional, to be requested through a new
> parameter to alloc_unbound_xen_event_channel(), or derived from other
> than the default callback being requested?

Aside the VM event, do you see any value for the other caller?

> ---
> v3: Drain callbacks before proceeding with closing. Re-base.
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc
>       
>       rchn->u.interdomain.remote_dom  = ld;
>       rchn->u.interdomain.remote_port = lport;
> +    atomic_set(&rchn->u.interdomain.active_calls, 0);
>       rchn->state                     = ECS_INTERDOMAIN;
>   
>       /*
> @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int
>   
>           double_evtchn_lock(chn1, chn2);
>   
> +        if ( consumer_is_xen(chn1) )
> +            while ( atomic_read(&chn1->u.interdomain.active_calls) )
> +                cpu_relax();
> +
>           evtchn_free(d1, chn1);
>   
>           chn2->state = ECS_UNBOUND;
> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>           rport = lchn->u.interdomain.remote_port;
>           rchn  = evtchn_from_port(rd, rport);
>           if ( consumer_is_xen(rchn) )
> +        {
> +            /* Don't keep holding the lock for the call below. */
> +            atomic_inc(&rchn->u.interdomain.active_calls);
> +            evtchn_read_unlock(lchn);
>               xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
> -        else
> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);

atomic_dec() doesn't contain any memory barrier, so we will want one 
between xen_notification_fn() and atomic_dec() to avoid re-ordering.

> +            atomic_dec(&rchn->u.interdomain.active_calls);
> +            return 0;
> +        }
> +        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>           break;
>       case ECS_IPI:
>           evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -104,6 +104,7 @@ struct evtchn
>           } unbound;     /* state == ECS_UNBOUND */
>           struct {
>               evtchn_port_t  remote_port;
> +            atomic_t       active_calls;
>               struct domain *remote_dom;
>           } interdomain; /* state == ECS_INTERDOMAIN */
>           struct {
> 

Cheers,
Jan Beulich Dec. 3, 2020, 10:09 a.m. UTC | #3
On 02.12.2020 22:10, Julien Grall wrote:
> On 23/11/2020 13:30, Jan Beulich wrote:
>> While there don't look to be any problems with this right now, the lock
>> order implications from holding the lock can be very difficult to follow
>> (and may be easy to violate unknowingly). The present callbacks don't
>> (and no such callback should) have any need for the lock to be held.
>>
>> However, vm_event_disable() frees the structures used by respective
>> callbacks and isn't otherwise synchronized with invocations of these
>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>> to wait to drop to zero before freeing the port (and dropping the lock).
> 
> AFAICT, this callback is not the only place where the synchronization is 
> missing in the VM event code.
> 
> For instance, vm_event_put_request() can also race against 
> vm_event_disable().
> 
> So shouldn't we handle this issue properly in VM event?

I suppose that's a question to the VM event folks rather than me?

>> ---
>> Should we make this accounting optional, to be requested through a new
>> parameter to alloc_unbound_xen_event_channel(), or derived from other
>> than the default callback being requested?
> 
> Aside the VM event, do you see any value for the other caller?

No (albeit I'm not entirely certain about vpl011_notification()'s
needs), hence the consideration. It's unnecessary overhead in
those cases.

>> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>>           rport = lchn->u.interdomain.remote_port;
>>           rchn  = evtchn_from_port(rd, rport);
>>           if ( consumer_is_xen(rchn) )
>> +        {
>> +            /* Don't keep holding the lock for the call below. */
>> +            atomic_inc(&rchn->u.interdomain.active_calls);
>> +            evtchn_read_unlock(lchn);
>>               xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
>> -        else
>> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
> 
> atomic_dec() doesn't contain any memory barrier, so we will want one 
> between xen_notification_fn() and atomic_dec() to avoid re-ordering.

Oh, indeed. But smp_mb() is too heavy handed here - x86 doesn't
really need any barrier, yet would gain a full MFENCE that way.
Actually - looks like I forgot we gained smp_mb__before_atomic()
a little over half a year ago.

Jan
Tamas K Lengyel Dec. 3, 2020, 2:40 p.m. UTC | #4
On Thu, Dec 3, 2020 at 5:09 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 02.12.2020 22:10, Julien Grall wrote:
> > On 23/11/2020 13:30, Jan Beulich wrote:
> >> While there don't look to be any problems with this right now, the lock
> >> order implications from holding the lock can be very difficult to follow
> >> (and may be easy to violate unknowingly). The present callbacks don't
> >> (and no such callback should) have any need for the lock to be held.
> >>
> >> However, vm_event_disable() frees the structures used by respective
> >> callbacks and isn't otherwise synchronized with invocations of these
> >> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >> to wait to drop to zero before freeing the port (and dropping the lock).
> >
> > AFAICT, this callback is not the only place where the synchronization is
> > missing in the VM event code.
> >
> > For instance, vm_event_put_request() can also race against
> > vm_event_disable().
> >
> > So shouldn't we handle this issue properly in VM event?
>
> I suppose that's a question to the VM event folks rather than me?

IMHO it would obviously be better if the Xen side could handle
situations like these gracefully. OTOH it is also reasonable to expect
the privileged toolstack to perform its own sanity checks before
disabling. Like right now for disabling vm_event we first pause the
VM, process all remaining events that were on the ring, and only then
disable the interface. This avoids the race condition mentioned above
(among other issues). It's not perfect - we ran into problems where
event replies don't have the desired effect because the VM is paused -
but for the most part doing this is sufficient. So I don't consider
this to be a priority at the moment. That said, if anyone is so
inclined to fix this up, I would be happy to review & ack.

Tamas
Julien Grall Dec. 4, 2020, 11:28 a.m. UTC | #5
Hi Jan,

On 03/12/2020 10:09, Jan Beulich wrote:
> On 02.12.2020 22:10, Julien Grall wrote:
>> On 23/11/2020 13:30, Jan Beulich wrote:
>>> While there don't look to be any problems with this right now, the lock
>>> order implications from holding the lock can be very difficult to follow
>>> (and may be easy to violate unknowingly). The present callbacks don't
>>> (and no such callback should) have any need for the lock to be held.
>>>
>>> However, vm_event_disable() frees the structures used by respective
>>> callbacks and isn't otherwise synchronized with invocations of these
>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>
>> AFAICT, this callback is not the only place where the synchronization is
>> missing in the VM event code.
>>
>> For instance, vm_event_put_request() can also race against
>> vm_event_disable().
>>
>> So shouldn't we handle this issue properly in VM event?
> 
> I suppose that's a question to the VM event folks rather than me?

Yes. From my understanding of Tamas's e-mail, they are relying on the 
monitoring software to do the right thing.

I will refrain to comment on this approach. However, given the race is 
much wider than the event channel, I would recommend to not add more 
code in the event channel to deal with such problem.

Instead, this should be fixed in the VM event code when someone has time 
to harden the subsystem.

> 
>>> ---
>>> Should we make this accounting optional, to be requested through a new
>>> parameter to alloc_unbound_xen_event_channel(), or derived from other
>>> than the default callback being requested?
>>
>> Aside the VM event, do you see any value for the other caller?
> 
> No (albeit I'm not entirely certain about vpl011_notification()'s
> needs), hence the consideration. It's unnecessary overhead in
> those cases.

I had another look and I think there is a small race in VPL011. It 
should be easy to fix (I will try to have a look later today).

> 
>>> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>>>            rport = lchn->u.interdomain.remote_port;
>>>            rchn  = evtchn_from_port(rd, rport);
>>>            if ( consumer_is_xen(rchn) )
>>> +        {
>>> +            /* Don't keep holding the lock for the call below. */
>>> +            atomic_inc(&rchn->u.interdomain.active_calls);
>>> +            evtchn_read_unlock(lchn);
>>>                xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
>>> -        else
>>> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>>
>> atomic_dec() doesn't contain any memory barrier, so we will want one
>> between xen_notification_fn() and atomic_dec() to avoid re-ordering.
> 
> Oh, indeed. But smp_mb() is too heavy handed here - x86 doesn't
> really need any barrier, yet would gain a full MFENCE that way.
> Actually - looks like I forgot we gained smp_mb__before_atomic()
> a little over half a year ago.

Ah yes, I forgot that atomics instruction are ordered on x86.

Cheers,
Jan Beulich Dec. 4, 2020, 11:48 a.m. UTC | #6
On 04.12.2020 12:28, Julien Grall wrote:
> Hi Jan,
> 
> On 03/12/2020 10:09, Jan Beulich wrote:
>> On 02.12.2020 22:10, Julien Grall wrote:
>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>> While there don't look to be any problems with this right now, the lock
>>>> order implications from holding the lock can be very difficult to follow
>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>> (and no such callback should) have any need for the lock to be held.
>>>>
>>>> However, vm_event_disable() frees the structures used by respective
>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>
>>> AFAICT, this callback is not the only place where the synchronization is
>>> missing in the VM event code.
>>>
>>> For instance, vm_event_put_request() can also race against
>>> vm_event_disable().
>>>
>>> So shouldn't we handle this issue properly in VM event?
>>
>> I suppose that's a question to the VM event folks rather than me?
> 
> Yes. From my understanding of Tamas's e-mail, they are relying on the 
> monitoring software to do the right thing.
> 
> I will refrain to comment on this approach. However, given the race is 
> much wider than the event channel, I would recommend to not add more 
> code in the event channel to deal with such problem.
> 
> Instead, this should be fixed in the VM event code when someone has time 
> to harden the subsystem.

Are effectively saying I should now undo the addition of the
refcounting, which was added in response to feedback from you?
Or else what exactly am I to take from your reply?

Jan
Julien Grall Dec. 4, 2020, 11:51 a.m. UTC | #7
On 04/12/2020 11:48, Jan Beulich wrote:
> On 04.12.2020 12:28, Julien Grall wrote:
>> Hi Jan,
>>
>> On 03/12/2020 10:09, Jan Beulich wrote:
>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>> While there don't look to be any problems with this right now, the lock
>>>>> order implications from holding the lock can be very difficult to follow
>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>
>>>>> However, vm_event_disable() frees the structures used by respective
>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>
>>>> AFAICT, this callback is not the only place where the synchronization is
>>>> missing in the VM event code.
>>>>
>>>> For instance, vm_event_put_request() can also race against
>>>> vm_event_disable().
>>>>
>>>> So shouldn't we handle this issue properly in VM event?
>>>
>>> I suppose that's a question to the VM event folks rather than me?
>>
>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>> monitoring software to do the right thing.
>>
>> I will refrain to comment on this approach. However, given the race is
>> much wider than the event channel, I would recommend to not add more
>> code in the event channel to deal with such problem.
>>
>> Instead, this should be fixed in the VM event code when someone has time
>> to harden the subsystem.
> 
> Are effectively saying I should now undo the addition of the
> refcounting, which was added in response to feedback from you?

Please point out where I made the request to use the refcounting...

I pointed out there was an issue with the VM event code. This was latter 
analysed as a wider issue. The VM event folks doesn't seem to be very 
concerned on the race, so I don't see the reason to try to fix it in the 
event channel code.

Cheers,
Jan Beulich Dec. 4, 2020, 12:01 p.m. UTC | #8
On 04.12.2020 12:51, Julien Grall wrote:
> 
> 
> On 04/12/2020 11:48, Jan Beulich wrote:
>> On 04.12.2020 12:28, Julien Grall wrote:
>>> Hi Jan,
>>>
>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>
>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>
>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>> missing in the VM event code.
>>>>>
>>>>> For instance, vm_event_put_request() can also race against
>>>>> vm_event_disable().
>>>>>
>>>>> So shouldn't we handle this issue properly in VM event?
>>>>
>>>> I suppose that's a question to the VM event folks rather than me?
>>>
>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>> monitoring software to do the right thing.
>>>
>>> I will refrain to comment on this approach. However, given the race is
>>> much wider than the event channel, I would recommend to not add more
>>> code in the event channel to deal with such problem.
>>>
>>> Instead, this should be fixed in the VM event code when someone has time
>>> to harden the subsystem.
>>
>> Are effectively saying I should now undo the addition of the
>> refcounting, which was added in response to feedback from you?
> 
> Please point out where I made the request to use the refcounting...

You didn't ask for this directly, sure, but ...

> I pointed out there was an issue with the VM event code.

... this has ultimately led to the decision to use refcounting
(iirc there was one alternative that I had proposed, besides
the option of doing nothing).

> This was latter 
> analysed as a wider issue. The VM event folks doesn't seem to be very 
> concerned on the race, so I don't see the reason to try to fix it in the 
> event channel code.

And you won't need the refcount for vpl011 then? I can certainly
drop it again, but it feels odd to go back to an earlier version
under the circumstances ...

Jan
Julien Grall Dec. 4, 2020, 3:09 p.m. UTC | #9
Hi,

On 04/12/2020 12:01, Jan Beulich wrote:
> On 04.12.2020 12:51, Julien Grall wrote:
>>
>>
>> On 04/12/2020 11:48, Jan Beulich wrote:
>>> On 04.12.2020 12:28, Julien Grall wrote:
>>>> Hi Jan,
>>>>
>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>
>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>
>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>> missing in the VM event code.
>>>>>>
>>>>>> For instance, vm_event_put_request() can also race against
>>>>>> vm_event_disable().
>>>>>>
>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>
>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>
>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>> monitoring software to do the right thing.
>>>>
>>>> I will refrain to comment on this approach. However, given the race is
>>>> much wider than the event channel, I would recommend to not add more
>>>> code in the event channel to deal with such problem.
>>>>
>>>> Instead, this should be fixed in the VM event code when someone has time
>>>> to harden the subsystem.
>>>
>>> Are effectively saying I should now undo the addition of the
>>> refcounting, which was added in response to feedback from you?
>>
>> Please point out where I made the request to use the refcounting...
> 
> You didn't ask for this directly, sure, but ...
> 
>> I pointed out there was an issue with the VM event code.
> 
> ... this has ultimately led to the decision to use refcounting
> (iirc there was one alternative that I had proposed, besides
> the option of doing nothing).

One other option that was discussed (maybe only on security@xen.org) is 
to move the spinlock outside of the structure so it is always allocated.

> 
>> This was latter
>> analysed as a wider issue. The VM event folks doesn't seem to be very
>> concerned on the race, so I don't see the reason to try to fix it in the
>> event channel code.
> 
> And you won't need the refcount for vpl011 then?

I don't believe we need it for the vpl011 as the spin lock protecting 
the code should always be allocated. The problem today is the lock is 
initialized too late.

> I can certainly
> drop it again, but it feels odd to go back to an earlier version
> under the circumstances ...

The code introduced doesn't look necessary outside of the VM event code.
So I think it would be wrong to merge it if it is just papering over a 
bigger problem.

Cheers,



> 
> Jan
>
Tamas K Lengyel Dec. 4, 2020, 3:21 p.m. UTC | #10
On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>
> Hi Jan,
>
> On 03/12/2020 10:09, Jan Beulich wrote:
> > On 02.12.2020 22:10, Julien Grall wrote:
> >> On 23/11/2020 13:30, Jan Beulich wrote:
> >>> While there don't look to be any problems with this right now, the lock
> >>> order implications from holding the lock can be very difficult to follow
> >>> (and may be easy to violate unknowingly). The present callbacks don't
> >>> (and no such callback should) have any need for the lock to be held.
> >>>
> >>> However, vm_event_disable() frees the structures used by respective
> >>> callbacks and isn't otherwise synchronized with invocations of these
> >>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >>> to wait to drop to zero before freeing the port (and dropping the lock).
> >>
> >> AFAICT, this callback is not the only place where the synchronization is
> >> missing in the VM event code.
> >>
> >> For instance, vm_event_put_request() can also race against
> >> vm_event_disable().
> >>
> >> So shouldn't we handle this issue properly in VM event?
> >
> > I suppose that's a question to the VM event folks rather than me?
>
> Yes. From my understanding of Tamas's e-mail, they are relying on the
> monitoring software to do the right thing.
>
> I will refrain to comment on this approach. However, given the race is
> much wider than the event channel, I would recommend to not add more
> code in the event channel to deal with such problem.
>
> Instead, this should be fixed in the VM event code when someone has time
> to harden the subsystem.

I double-checked and the disable route is actually more robust, we
don't just rely on the toolstack doing the right thing. The domain
gets paused before any calls to vm_event_disable. So I don't think
there is really a race-condition here.

Tamas
Julien Grall Dec. 4, 2020, 3:29 p.m. UTC | #11
On 04/12/2020 15:21, Tamas K Lengyel wrote:
> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>
>> Hi Jan,
>>
>> On 03/12/2020 10:09, Jan Beulich wrote:
>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>> While there don't look to be any problems with this right now, the lock
>>>>> order implications from holding the lock can be very difficult to follow
>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>
>>>>> However, vm_event_disable() frees the structures used by respective
>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>
>>>> AFAICT, this callback is not the only place where the synchronization is
>>>> missing in the VM event code.
>>>>
>>>> For instance, vm_event_put_request() can also race against
>>>> vm_event_disable().
>>>>
>>>> So shouldn't we handle this issue properly in VM event?
>>>
>>> I suppose that's a question to the VM event folks rather than me?
>>
>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>> monitoring software to do the right thing.
>>
>> I will refrain to comment on this approach. However, given the race is
>> much wider than the event channel, I would recommend to not add more
>> code in the event channel to deal with such problem.
>>
>> Instead, this should be fixed in the VM event code when someone has time
>> to harden the subsystem.
> 
> I double-checked and the disable route is actually more robust, we
> don't just rely on the toolstack doing the right thing. The domain
> gets paused before any calls to vm_event_disable. So I don't think
> there is really a race-condition here.

The code will *only* pause the monitored domain. I can see two issues:
    1) The toolstack is still sending event while destroy is happening. 
This is the race discussed here.
    2) The implement of vm_event_put_request() suggests that it can be 
called with not-current domain.

I don't see how just pausing the monitored domain is enough here.

Cheers,
Tamas K Lengyel Dec. 4, 2020, 7:15 p.m. UTC | #12
On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>
>
>
> On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> >>
> >> Hi Jan,
> >>
> >> On 03/12/2020 10:09, Jan Beulich wrote:
> >>> On 02.12.2020 22:10, Julien Grall wrote:
> >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> >>>>> While there don't look to be any problems with this right now, the lock
> >>>>> order implications from holding the lock can be very difficult to follow
> >>>>> (and may be easy to violate unknowingly). The present callbacks don't
> >>>>> (and no such callback should) have any need for the lock to be held.
> >>>>>
> >>>>> However, vm_event_disable() frees the structures used by respective
> >>>>> callbacks and isn't otherwise synchronized with invocations of these
> >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> >>>>
> >>>> AFAICT, this callback is not the only place where the synchronization is
> >>>> missing in the VM event code.
> >>>>
> >>>> For instance, vm_event_put_request() can also race against
> >>>> vm_event_disable().
> >>>>
> >>>> So shouldn't we handle this issue properly in VM event?
> >>>
> >>> I suppose that's a question to the VM event folks rather than me?
> >>
> >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> >> monitoring software to do the right thing.
> >>
> >> I will refrain to comment on this approach. However, given the race is
> >> much wider than the event channel, I would recommend to not add more
> >> code in the event channel to deal with such problem.
> >>
> >> Instead, this should be fixed in the VM event code when someone has time
> >> to harden the subsystem.
> >
> > I double-checked and the disable route is actually more robust, we
> > don't just rely on the toolstack doing the right thing. The domain
> > gets paused before any calls to vm_event_disable. So I don't think
> > there is really a race-condition here.
>
> The code will *only* pause the monitored domain. I can see two issues:
>     1) The toolstack is still sending event while destroy is happening.
> This is the race discussed here.
>     2) The implement of vm_event_put_request() suggests that it can be
> called with not-current domain.
>
> I don't see how just pausing the monitored domain is enough here.

Requests only get generated by the monitored domain. So if the domain
is not running you won't get more of them. The toolstack can only send
replies.

Tamas
Julien Grall Dec. 4, 2020, 7:22 p.m. UTC | #13
On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@gmail.com> wrote:
>
> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
> >
> >
> >
> > On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> > >>
> > >> Hi Jan,
> > >>
> > >> On 03/12/2020 10:09, Jan Beulich wrote:
> > >>> On 02.12.2020 22:10, Julien Grall wrote:
> > >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> > >>>>> While there don't look to be any problems with this right now, the lock
> > >>>>> order implications from holding the lock can be very difficult to follow
> > >>>>> (and may be easy to violate unknowingly). The present callbacks don't
> > >>>>> (and no such callback should) have any need for the lock to be held.
> > >>>>>
> > >>>>> However, vm_event_disable() frees the structures used by respective
> > >>>>> callbacks and isn't otherwise synchronized with invocations of these
> > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> > >>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> > >>>>
> > >>>> AFAICT, this callback is not the only place where the synchronization is
> > >>>> missing in the VM event code.
> > >>>>
> > >>>> For instance, vm_event_put_request() can also race against
> > >>>> vm_event_disable().
> > >>>>
> > >>>> So shouldn't we handle this issue properly in VM event?
> > >>>
> > >>> I suppose that's a question to the VM event folks rather than me?
> > >>
> > >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> > >> monitoring software to do the right thing.
> > >>
> > >> I will refrain to comment on this approach. However, given the race is
> > >> much wider than the event channel, I would recommend to not add more
> > >> code in the event channel to deal with such problem.
> > >>
> > >> Instead, this should be fixed in the VM event code when someone has time
> > >> to harden the subsystem.
> > >
> > > I double-checked and the disable route is actually more robust, we
> > > don't just rely on the toolstack doing the right thing. The domain
> > > gets paused before any calls to vm_event_disable. So I don't think
> > > there is really a race-condition here.
> >
> > The code will *only* pause the monitored domain. I can see two issues:
> >     1) The toolstack is still sending event while destroy is happening.
> > This is the race discussed here.
> >     2) The implement of vm_event_put_request() suggests that it can be
> > called with not-current domain.
> >
> > I don't see how just pausing the monitored domain is enough here.
>
> Requests only get generated by the monitored domain.

If that's the case, then why is vm_event_put_request() able to
deal with a non-current domain?

I understand that you are possibly trusting who may call it, but this
looks quite fragile.

Cheers,

---
Tamas K Lengyel Dec. 4, 2020, 9:23 p.m. UTC | #14
On Fri, Dec 4, 2020 at 2:22 PM Julien Grall <julien.grall.oss@gmail.com> wrote:
>
> On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@gmail.com> wrote:
> >
> > On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
> > >
> > >
> > >
> > > On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> > > >>
> > > >> Hi Jan,
> > > >>
> > > >> On 03/12/2020 10:09, Jan Beulich wrote:
> > > >>> On 02.12.2020 22:10, Julien Grall wrote:
> > > >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> > > >>>>> While there don't look to be any problems with this right now, the lock
> > > >>>>> order implications from holding the lock can be very difficult to follow
> > > >>>>> (and may be easy to violate unknowingly). The present callbacks don't
> > > >>>>> (and no such callback should) have any need for the lock to be held.
> > > >>>>>
> > > >>>>> However, vm_event_disable() frees the structures used by respective
> > > >>>>> callbacks and isn't otherwise synchronized with invocations of these
> > > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> > > >>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> > > >>>>
> > > >>>> AFAICT, this callback is not the only place where the synchronization is
> > > >>>> missing in the VM event code.
> > > >>>>
> > > >>>> For instance, vm_event_put_request() can also race against
> > > >>>> vm_event_disable().
> > > >>>>
> > > >>>> So shouldn't we handle this issue properly in VM event?
> > > >>>
> > > >>> I suppose that's a question to the VM event folks rather than me?
> > > >>
> > > >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> > > >> monitoring software to do the right thing.
> > > >>
> > > >> I will refrain to comment on this approach. However, given the race is
> > > >> much wider than the event channel, I would recommend to not add more
> > > >> code in the event channel to deal with such problem.
> > > >>
> > > >> Instead, this should be fixed in the VM event code when someone has time
> > > >> to harden the subsystem.
> > > >
> > > > I double-checked and the disable route is actually more robust, we
> > > > don't just rely on the toolstack doing the right thing. The domain
> > > > gets paused before any calls to vm_event_disable. So I don't think
> > > > there is really a race-condition here.
> > >
> > > The code will *only* pause the monitored domain. I can see two issues:
> > >     1) The toolstack is still sending event while destroy is happening.
> > > This is the race discussed here.
> > >     2) The implement of vm_event_put_request() suggests that it can be
> > > called with not-current domain.
> > >
> > > I don't see how just pausing the monitored domain is enough here.
> >
> > Requests only get generated by the monitored domain.
>
> If that's the case, then why is vm_event_put_request() able to
> deal with a non-current domain?
>
> I understand that you are possibly trusting who may call it, but this
> looks quite fragile.

I didn't write the system. You probably want to ask that question from
the original author.

Tamas
Jan Beulich Dec. 7, 2020, 8:02 a.m. UTC | #15
On 04.12.2020 16:09, Julien Grall wrote:
> On 04/12/2020 12:01, Jan Beulich wrote:
>> On 04.12.2020 12:51, Julien Grall wrote:
>>> On 04/12/2020 11:48, Jan Beulich wrote:
>>>> On 04.12.2020 12:28, Julien Grall wrote:
>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>
>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>
>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>> monitoring software to do the right thing.
>>>>>
>>>>> I will refrain to comment on this approach. However, given the race is
>>>>> much wider than the event channel, I would recommend to not add more
>>>>> code in the event channel to deal with such problem.
>>>>>
>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>> to harden the subsystem.
>>>>
>>>> Are effectively saying I should now undo the addition of the
>>>> refcounting, which was added in response to feedback from you?
>>>
>>> Please point out where I made the request to use the refcounting...
>>
>> You didn't ask for this directly, sure, but ...
>>
>>> I pointed out there was an issue with the VM event code.
>>
>> ... this has ultimately led to the decision to use refcounting
>> (iirc there was one alternative that I had proposed, besides
>> the option of doing nothing).
> 
> One other option that was discussed (maybe only on security@xen.org) is 
> to move the spinlock outside of the structure so it is always allocated.

Oh, right - forgot about that one, because that's nothing I would
ever have taken on actually carrying out.

>>> This was latter
>>> analysed as a wider issue. The VM event folks doesn't seem to be very
>>> concerned on the race, so I don't see the reason to try to fix it in the
>>> event channel code.
>>
>> And you won't need the refcount for vpl011 then?
> 
> I don't believe we need it for the vpl011 as the spin lock protecting 
> the code should always be allocated. The problem today is the lock is 
> initialized too late.
> 
>> I can certainly
>> drop it again, but it feels odd to go back to an earlier version
>> under the circumstances ...
> 
> The code introduced doesn't look necessary outside of the VM event code.
> So I think it would be wrong to merge it if it is just papering over a 
> bigger problem.

So to translate this to a clear course of action: You want me to
go back to the earlier version by dropping the refcounting again?
(I don't view this as "papering over" btw, but a tiny step towards
a solution.)

Jan
Jan Beulich Dec. 7, 2020, 3:28 p.m. UTC | #16
On 04.12.2020 20:15, Tamas K Lengyel wrote:
> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>
>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>
>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>> missing in the VM event code.
>>>>>>
>>>>>> For instance, vm_event_put_request() can also race against
>>>>>> vm_event_disable().
>>>>>>
>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>
>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>
>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>> monitoring software to do the right thing.
>>>>
>>>> I will refrain to comment on this approach. However, given the race is
>>>> much wider than the event channel, I would recommend to not add more
>>>> code in the event channel to deal with such problem.
>>>>
>>>> Instead, this should be fixed in the VM event code when someone has time
>>>> to harden the subsystem.
>>>
>>> I double-checked and the disable route is actually more robust, we
>>> don't just rely on the toolstack doing the right thing. The domain
>>> gets paused before any calls to vm_event_disable. So I don't think
>>> there is really a race-condition here.
>>
>> The code will *only* pause the monitored domain. I can see two issues:
>>     1) The toolstack is still sending event while destroy is happening.
>> This is the race discussed here.
>>     2) The implement of vm_event_put_request() suggests that it can be
>> called with not-current domain.
>>
>> I don't see how just pausing the monitored domain is enough here.
> 
> Requests only get generated by the monitored domain. So if the domain
> is not running you won't get more of them. The toolstack can only send
> replies.

Julien,

does this change your view on the refcounting added by the patch
at the root of this sub-thread?

Jan
Julien Grall Dec. 7, 2020, 5:22 p.m. UTC | #17
On 07/12/2020 08:02, Jan Beulich wrote:
> On 04.12.2020 16:09, Julien Grall wrote:
>> On 04/12/2020 12:01, Jan Beulich wrote:
>>> On 04.12.2020 12:51, Julien Grall wrote:
>>>> On 04/12/2020 11:48, Jan Beulich wrote:
>>>>> On 04.12.2020 12:28, Julien Grall wrote:
>>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>>
>>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>>
>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>>> monitoring software to do the right thing.
>>>>>>
>>>>>> I will refrain to comment on this approach. However, given the race is
>>>>>> much wider than the event channel, I would recommend to not add more
>>>>>> code in the event channel to deal with such problem.
>>>>>>
>>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>>> to harden the subsystem.
>>>>>
>>>>> Are effectively saying I should now undo the addition of the
>>>>> refcounting, which was added in response to feedback from you?
>>>>
>>>> Please point out where I made the request to use the refcounting...
>>>
>>> You didn't ask for this directly, sure, but ...
>>>
>>>> I pointed out there was an issue with the VM event code.
>>>
>>> ... this has ultimately led to the decision to use refcounting
>>> (iirc there was one alternative that I had proposed, besides
>>> the option of doing nothing).
>>
>> One other option that was discussed (maybe only on security@xen.org) is
>> to move the spinlock outside of the structure so it is always allocated.
> 
> Oh, right - forgot about that one, because that's nothing I would
> ever have taken on actually carrying out.
> 
>>>> This was latter
>>>> analysed as a wider issue. The VM event folks doesn't seem to be very
>>>> concerned on the race, so I don't see the reason to try to fix it in the
>>>> event channel code.
>>>
>>> And you won't need the refcount for vpl011 then?
>>
>> I don't believe we need it for the vpl011 as the spin lock protecting
>> the code should always be allocated. The problem today is the lock is
>> initialized too late.
>>
>>> I can certainly
>>> drop it again, but it feels odd to go back to an earlier version
>>> under the circumstances ...
>>
>> The code introduced doesn't look necessary outside of the VM event code.
>> So I think it would be wrong to merge it if it is just papering over a
>> bigger problem.
> 
> So to translate this to a clear course of action: You want me to
> go back to the earlier version by dropping the refcounting again?

Yes.

> (I don't view this as "papering over" btw, but a tiny step towards
> a solution.)

This is implying that the refcounting is part of the actual solution. I 
think you can solve it directly in the VM event code without touching 
the event channel code.

Furthermore, I see no point to add code in the common code if the 
maintainers of the affected subsystem think there code is safe (I don't 
believe it is).

Cheers,
Julien Grall Dec. 7, 2020, 5:30 p.m. UTC | #18
Hi Jan,

On 07/12/2020 15:28, Jan Beulich wrote:
> On 04.12.2020 20:15, Tamas K Lengyel wrote:
>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>>
>>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>>
>>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>>> missing in the VM event code.
>>>>>>>
>>>>>>> For instance, vm_event_put_request() can also race against
>>>>>>> vm_event_disable().
>>>>>>>
>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>
>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>
>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>> monitoring software to do the right thing.
>>>>>
>>>>> I will refrain to comment on this approach. However, given the race is
>>>>> much wider than the event channel, I would recommend to not add more
>>>>> code in the event channel to deal with such problem.
>>>>>
>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>> to harden the subsystem.
>>>>
>>>> I double-checked and the disable route is actually more robust, we
>>>> don't just rely on the toolstack doing the right thing. The domain
>>>> gets paused before any calls to vm_event_disable. So I don't think
>>>> there is really a race-condition here.
>>>
>>> The code will *only* pause the monitored domain. I can see two issues:
>>>      1) The toolstack is still sending event while destroy is happening.
>>> This is the race discussed here.
>>>      2) The implement of vm_event_put_request() suggests that it can be
>>> called with not-current domain.
>>>
>>> I don't see how just pausing the monitored domain is enough here.
>>
>> Requests only get generated by the monitored domain. So if the domain
>> is not running you won't get more of them. The toolstack can only send
>> replies.
> 
> Julien,
> 
> does this change your view on the refcounting added by the patch
> at the root of this sub-thread?

I still think the code is at best fragile. One example I can find is:

   -> guest_remove_page()
     -> p2m_mem_paging_drop_page()
      -> vm_event_put_request()

guest_remove_page() is not always call on the current domain. So there 
are possibility for vm_event_put_request() to happen on a foreign domain 
and therefore wouldn't be protected by the current hypercall.

Anyway, I don't think the refcounting should be part of the event 
channel without any idea on how this would fit in fixing the VM event race.

Cheers,
Tamas K Lengyel Dec. 7, 2020, 5:35 p.m. UTC | #19
On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote:
>
> Hi Jan,
>
> On 07/12/2020 15:28, Jan Beulich wrote:
> > On 04.12.2020 20:15, Tamas K Lengyel wrote:
> >> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
> >>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
> >>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> >>>>> On 03/12/2020 10:09, Jan Beulich wrote:
> >>>>>> On 02.12.2020 22:10, Julien Grall wrote:
> >>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
> >>>>>>>> While there don't look to be any problems with this right now, the lock
> >>>>>>>> order implications from holding the lock can be very difficult to follow
> >>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
> >>>>>>>> (and no such callback should) have any need for the lock to be held.
> >>>>>>>>
> >>>>>>>> However, vm_event_disable() frees the structures used by respective
> >>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
> >>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> >>>>>>>
> >>>>>>> AFAICT, this callback is not the only place where the synchronization is
> >>>>>>> missing in the VM event code.
> >>>>>>>
> >>>>>>> For instance, vm_event_put_request() can also race against
> >>>>>>> vm_event_disable().
> >>>>>>>
> >>>>>>> So shouldn't we handle this issue properly in VM event?
> >>>>>>
> >>>>>> I suppose that's a question to the VM event folks rather than me?
> >>>>>
> >>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
> >>>>> monitoring software to do the right thing.
> >>>>>
> >>>>> I will refrain to comment on this approach. However, given the race is
> >>>>> much wider than the event channel, I would recommend to not add more
> >>>>> code in the event channel to deal with such problem.
> >>>>>
> >>>>> Instead, this should be fixed in the VM event code when someone has time
> >>>>> to harden the subsystem.
> >>>>
> >>>> I double-checked and the disable route is actually more robust, we
> >>>> don't just rely on the toolstack doing the right thing. The domain
> >>>> gets paused before any calls to vm_event_disable. So I don't think
> >>>> there is really a race-condition here.
> >>>
> >>> The code will *only* pause the monitored domain. I can see two issues:
> >>>      1) The toolstack is still sending event while destroy is happening.
> >>> This is the race discussed here.
> >>>      2) The implement of vm_event_put_request() suggests that it can be
> >>> called with not-current domain.
> >>>
> >>> I don't see how just pausing the monitored domain is enough here.
> >>
> >> Requests only get generated by the monitored domain. So if the domain
> >> is not running you won't get more of them. The toolstack can only send
> >> replies.
> >
> > Julien,
> >
> > does this change your view on the refcounting added by the patch
> > at the root of this sub-thread?
>
> I still think the code is at best fragile. One example I can find is:
>
>    -> guest_remove_page()
>      -> p2m_mem_paging_drop_page()
>       -> vm_event_put_request()
>
> guest_remove_page() is not always call on the current domain. So there
> are possibility for vm_event_put_request() to happen on a foreign domain
> and therefore wouldn't be protected by the current hypercall.
>
> Anyway, I don't think the refcounting should be part of the event
> channel without any idea on how this would fit in fixing the VM event race.

If the problematic patterns only appear with mem_paging I would
suggest just removing the mem_paging code completely. It's been
abandoned for several years now.

Tamas
Jan Beulich Dec. 23, 2020, 1:12 p.m. UTC | #20
On 07.12.2020 18:35, Tamas K Lengyel wrote:
> On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote:
>>
>> Hi Jan,
>>
>> On 07/12/2020 15:28, Jan Beulich wrote:
>>> On 04.12.2020 20:15, Tamas K Lengyel wrote:
>>>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>>>>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>>>>
>>>>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>>>>
>>>>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>>>>> missing in the VM event code.
>>>>>>>>>
>>>>>>>>> For instance, vm_event_put_request() can also race against
>>>>>>>>> vm_event_disable().
>>>>>>>>>
>>>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>>>
>>>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>>>
>>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>>>> monitoring software to do the right thing.
>>>>>>>
>>>>>>> I will refrain to comment on this approach. However, given the race is
>>>>>>> much wider than the event channel, I would recommend to not add more
>>>>>>> code in the event channel to deal with such problem.
>>>>>>>
>>>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>>>> to harden the subsystem.
>>>>>>
>>>>>> I double-checked and the disable route is actually more robust, we
>>>>>> don't just rely on the toolstack doing the right thing. The domain
>>>>>> gets paused before any calls to vm_event_disable. So I don't think
>>>>>> there is really a race-condition here.
>>>>>
>>>>> The code will *only* pause the monitored domain. I can see two issues:
>>>>>      1) The toolstack is still sending event while destroy is happening.
>>>>> This is the race discussed here.
>>>>>      2) The implement of vm_event_put_request() suggests that it can be
>>>>> called with not-current domain.
>>>>>
>>>>> I don't see how just pausing the monitored domain is enough here.
>>>>
>>>> Requests only get generated by the monitored domain. So if the domain
>>>> is not running you won't get more of them. The toolstack can only send
>>>> replies.
>>>
>>> Julien,
>>>
>>> does this change your view on the refcounting added by the patch
>>> at the root of this sub-thread?
>>
>> I still think the code is at best fragile. One example I can find is:
>>
>>    -> guest_remove_page()
>>      -> p2m_mem_paging_drop_page()
>>       -> vm_event_put_request()
>>
>> guest_remove_page() is not always call on the current domain. So there
>> are possibility for vm_event_put_request() to happen on a foreign domain
>> and therefore wouldn't be protected by the current hypercall.
>>
>> Anyway, I don't think the refcounting should be part of the event
>> channel without any idea on how this would fit in fixing the VM event race.
> 
> If the problematic patterns only appear with mem_paging I would
> suggest just removing the mem_paging code completely. It's been
> abandoned for several years now.

Since this is nothing I'm fancying doing, the way forward here needs
to be a different one. From the input by both of you I still can't
conclude whether this patch should remain as is in v4, or revert
back to its v2 version. Please can we get this settled so I can get
v4 out?

Thanks, Jan
Julien Grall Dec. 23, 2020, 1:33 p.m. UTC | #21
On 23/12/2020 13:12, Jan Beulich wrote:
> On 07.12.2020 18:35, Tamas K Lengyel wrote:
>> On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote:
>>>
>>> Hi Jan,
>>>
>>> On 07/12/2020 15:28, Jan Beulich wrote:
>>>> On 04.12.2020 20:15, Tamas K Lengyel wrote:
>>>>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>>>>>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>>>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>>>>>
>>>>>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>>>>>
>>>>>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>>>>>> missing in the VM event code.
>>>>>>>>>>
>>>>>>>>>> For instance, vm_event_put_request() can also race against
>>>>>>>>>> vm_event_disable().
>>>>>>>>>>
>>>>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>>>>
>>>>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>>>>
>>>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>>>>> monitoring software to do the right thing.
>>>>>>>>
>>>>>>>> I will refrain to comment on this approach. However, given the race is
>>>>>>>> much wider than the event channel, I would recommend to not add more
>>>>>>>> code in the event channel to deal with such problem.
>>>>>>>>
>>>>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>>>>> to harden the subsystem.
>>>>>>>
>>>>>>> I double-checked and the disable route is actually more robust, we
>>>>>>> don't just rely on the toolstack doing the right thing. The domain
>>>>>>> gets paused before any calls to vm_event_disable. So I don't think
>>>>>>> there is really a race-condition here.
>>>>>>
>>>>>> The code will *only* pause the monitored domain. I can see two issues:
>>>>>>       1) The toolstack is still sending event while destroy is happening.
>>>>>> This is the race discussed here.
>>>>>>       2) The implement of vm_event_put_request() suggests that it can be
>>>>>> called with not-current domain.
>>>>>>
>>>>>> I don't see how just pausing the monitored domain is enough here.
>>>>>
>>>>> Requests only get generated by the monitored domain. So if the domain
>>>>> is not running you won't get more of them. The toolstack can only send
>>>>> replies.
>>>>
>>>> Julien,
>>>>
>>>> does this change your view on the refcounting added by the patch
>>>> at the root of this sub-thread?
>>>
>>> I still think the code is at best fragile. One example I can find is:
>>>
>>>     -> guest_remove_page()
>>>       -> p2m_mem_paging_drop_page()
>>>        -> vm_event_put_request()
>>>
>>> guest_remove_page() is not always call on the current domain. So there
>>> are possibility for vm_event_put_request() to happen on a foreign domain
>>> and therefore wouldn't be protected by the current hypercall.
>>>
>>> Anyway, I don't think the refcounting should be part of the event
>>> channel without any idea on how this would fit in fixing the VM event race.
>>
>> If the problematic patterns only appear with mem_paging I would
>> suggest just removing the mem_paging code completely. It's been
>> abandoned for several years now.
> 
> Since this is nothing I'm fancying doing, the way forward here needs
> to be a different one. From the input by both of you I still can't
> conclude whether this patch should remain as is in v4, or revert
> back to its v2 version. Please can we get this settled so I can get
> v4 out?

I haven't had time to investigate the rest of the VM event code to find 
other cases where this may happen. I still think there is a bigger 
problem in the VM event code, but the maintainer disagrees here.

At which point, I see limited reason to try to paper over in the common 
code. So I would rather ack/merge v2 rather than v3.

Cheers,
Jan Beulich Dec. 23, 2020, 1:41 p.m. UTC | #22
On 23.12.2020 14:33, Julien Grall wrote:
> On 23/12/2020 13:12, Jan Beulich wrote:
>> From the input by both of you I still can't
>> conclude whether this patch should remain as is in v4, or revert
>> back to its v2 version. Please can we get this settled so I can get
>> v4 out?
> 
> I haven't had time to investigate the rest of the VM event code to find 
> other cases where this may happen. I still think there is a bigger 
> problem in the VM event code, but the maintainer disagrees here.
> 
> At which point, I see limited reason to try to paper over in the common 
> code. So I would rather ack/merge v2 rather than v3.

Since I expect Tamas and/or the Bitdefender folks to be of the
opposite opinion, there's still no way out, at least if "rather
ack" implies a nak for v3. Personally, if this expectation of
mine is correct, I'd prefer to keep the accounting but make it
optional (as suggested in a post-commit-message remark).
That'll eliminate the overhead you appear to be concerned of,
but of course it'll further complicate the logic (albeit just
slightly).

Jan
Julien Grall Dec. 23, 2020, 2:44 p.m. UTC | #23
On 23/12/2020 13:41, Jan Beulich wrote:
> On 23.12.2020 14:33, Julien Grall wrote:
>> On 23/12/2020 13:12, Jan Beulich wrote:
>>>  From the input by both of you I still can't
>>> conclude whether this patch should remain as is in v4, or revert
>>> back to its v2 version. Please can we get this settled so I can get
>>> v4 out?
>>
>> I haven't had time to investigate the rest of the VM event code to find
>> other cases where this may happen. I still think there is a bigger
>> problem in the VM event code, but the maintainer disagrees here.
>>
>> At which point, I see limited reason to try to paper over in the common
>> code. So I would rather ack/merge v2 rather than v3.
> 
> Since I expect Tamas and/or the Bitdefender folks to be of the
> opposite opinion, there's still no way out, at least if "rather
> ack" implies a nak for v3.

The only way out here is for someone to justify why this patch is 
sufficient for the VM event race. I am not convinced it is (see more below).

> Personally, if this expectation of
> mine is correct, I'd prefer to keep the accounting but make it
> optional (as suggested in a post-commit-message remark).
> That'll eliminate the overhead you appear to be concerned of,
> but of course it'll further complicate the logic (albeit just
> slightly).

I am more concerned about adding over complex code that would just 
papering over a bigger problem. I also can't see use of it outside of 
the VM event discussion.

I had another look at the code. As I mentioned in the past, 
vm_put_event_request() is able to deal with d != current->domain (it 
will set VM_EVENT_FLAG_FOREIGN). There are 4 callers for the function:
    1) p2m_mem_paging_drop_page()
    2) p2m_mem_paging_populate()
    3) mem_sharing_notify_enomem()
    4) monitor_traps()

1) and 2) belongs to the mem paging subsystem. Tamas suggested that it 
was abandoned.

4) can only be called with the current domain.

This leave us 3) in the mem sharing subsystem. As this is call the 
memory hypercalls, it looks possible to me that d != current->domain. 
The code also seems to be maintained (there were recent non-trivial 
changes).

Can one of the VM event developper come up with a justification why this 
patch enough to make the VM event subsystem safe?

FAOD, the justification should be solely based on the hypervisor code 
(IOW not external trusted software).

Cheers,
Jan Beulich Dec. 23, 2020, 2:56 p.m. UTC | #24
On 23.12.2020 15:44, Julien Grall wrote:
> On 23/12/2020 13:41, Jan Beulich wrote:
>> On 23.12.2020 14:33, Julien Grall wrote:
>>> On 23/12/2020 13:12, Jan Beulich wrote:
>>>>  From the input by both of you I still can't
>>>> conclude whether this patch should remain as is in v4, or revert
>>>> back to its v2 version. Please can we get this settled so I can get
>>>> v4 out?
>>>
>>> I haven't had time to investigate the rest of the VM event code to find
>>> other cases where this may happen. I still think there is a bigger
>>> problem in the VM event code, but the maintainer disagrees here.
>>>
>>> At which point, I see limited reason to try to paper over in the common
>>> code. So I would rather ack/merge v2 rather than v3.
>>
>> Since I expect Tamas and/or the Bitdefender folks to be of the
>> opposite opinion, there's still no way out, at least if "rather
>> ack" implies a nak for v3.
> 
> The only way out here is for someone to justify why this patch is 
> sufficient for the VM event race.

I think this is too much you demand. Moving in the right direction
without arriving at the final destination is still a worthwhile
thing to do imo.

>> Personally, if this expectation of
>> mine is correct, I'd prefer to keep the accounting but make it
>> optional (as suggested in a post-commit-message remark).
>> That'll eliminate the overhead you appear to be concerned of,
>> but of course it'll further complicate the logic (albeit just
>> slightly).
> 
> I am more concerned about adding over complex code that would just 
> papering over a bigger problem. I also can't see use of it outside of 
> the VM event discussion.

I think it is a generally appropriate thing to do to wait for
callbacks to complete before tearing down their origin control
structure. There may be cases where code structure makes this
unnecessary, but I don't think this can be an expectation to
all the users of the functionality. Hence my suggestion to
possibly make this optional (driven directly or indirectly by
the user of the registration function).

Jan
Julien Grall Dec. 23, 2020, 3:08 p.m. UTC | #25
Hi Jan,

On 23/12/2020 14:56, Jan Beulich wrote:
> On 23.12.2020 15:44, Julien Grall wrote:
>> On 23/12/2020 13:41, Jan Beulich wrote:
>>> On 23.12.2020 14:33, Julien Grall wrote:
>>>> On 23/12/2020 13:12, Jan Beulich wrote:
>>>>>   From the input by both of you I still can't
>>>>> conclude whether this patch should remain as is in v4, or revert
>>>>> back to its v2 version. Please can we get this settled so I can get
>>>>> v4 out?
>>>>
>>>> I haven't had time to investigate the rest of the VM event code to find
>>>> other cases where this may happen. I still think there is a bigger
>>>> problem in the VM event code, but the maintainer disagrees here.
>>>>
>>>> At which point, I see limited reason to try to paper over in the common
>>>> code. So I would rather ack/merge v2 rather than v3.
>>>
>>> Since I expect Tamas and/or the Bitdefender folks to be of the
>>> opposite opinion, there's still no way out, at least if "rather
>>> ack" implies a nak for v3.
>>
>> The only way out here is for someone to justify why this patch is
>> sufficient for the VM event race.
> 
> I think this is too much you demand.

I guess you didn't notice that I did most of the job by providing an 
analysis in my e-mail... I don't think it is too much demanding to read 
the analysis and say whether I am correct or not.

Do you really prefer to add code would get rot because unused?

> Moving in the right direction
> without arriving at the final destination is still a worthwhile
> thing to do imo.
> 
>>> Personally, if this expectation of
>>> mine is correct, I'd prefer to keep the accounting but make it
>>> optional (as suggested in a post-commit-message remark).
>>> That'll eliminate the overhead you appear to be concerned of,
>>> but of course it'll further complicate the logic (albeit just
>>> slightly).
>>
>> I am more concerned about adding over complex code that would just
>> papering over a bigger problem. I also can't see use of it outside of
>> the VM event discussion.
> 
> I think it is a generally appropriate thing to do to wait for
> callbacks to complete before tearing down their origin control
> structure. There may be cases where code structure makes this
> unnecessary, but I don't think this can be an expectation to
> all the users of the functionality. Hence my suggestion to
> possibly make this optional (driven directly or indirectly by
> the user of the registration function).

As you tend to say, we should not add code unless there is a user. So 
far the only possible user is dubbious. If you find another user, then 
we can discuss whether this patch makes sense.

Cheers,
Tamas K Lengyel Dec. 23, 2020, 3:15 p.m. UTC | #26
On Wed, Dec 23, 2020 at 9:44 AM Julien Grall <julien@xen.org> wrote:
>
>
>
> On 23/12/2020 13:41, Jan Beulich wrote:
> > On 23.12.2020 14:33, Julien Grall wrote:
> >> On 23/12/2020 13:12, Jan Beulich wrote:
> >>>  From the input by both of you I still can't
> >>> conclude whether this patch should remain as is in v4, or revert
> >>> back to its v2 version. Please can we get this settled so I can get
> >>> v4 out?
> >>
> >> I haven't had time to investigate the rest of the VM event code to find
> >> other cases where this may happen. I still think there is a bigger
> >> problem in the VM event code, but the maintainer disagrees here.
> >>
> >> At which point, I see limited reason to try to paper over in the common
> >> code. So I would rather ack/merge v2 rather than v3.
> >
> > Since I expect Tamas and/or the Bitdefender folks to be of the
> > opposite opinion, there's still no way out, at least if "rather
> > ack" implies a nak for v3.
>
> The only way out here is for someone to justify why this patch is
> sufficient for the VM event race. I am not convinced it is (see more below).
>
> > Personally, if this expectation of
> > mine is correct, I'd prefer to keep the accounting but make it
> > optional (as suggested in a post-commit-message remark).
> > That'll eliminate the overhead you appear to be concerned of,
> > but of course it'll further complicate the logic (albeit just
> > slightly).
>
> I am more concerned about adding over complex code that would just
> papering over a bigger problem. I also can't see use of it outside of
> the VM event discussion.
>
> I had another look at the code. As I mentioned in the past,
> vm_put_event_request() is able to deal with d != current->domain (it
> will set VM_EVENT_FLAG_FOREIGN). There are 4 callers for the function:
>     1) p2m_mem_paging_drop_page()
>     2) p2m_mem_paging_populate()
>     3) mem_sharing_notify_enomem()
>     4) monitor_traps()
>
> 1) and 2) belongs to the mem paging subsystem. Tamas suggested that it
> was abandoned.
>
> 4) can only be called with the current domain.
>
> This leave us 3) in the mem sharing subsystem. As this is call the
> memory hypercalls, it looks possible to me that d != current->domain.
> The code also seems to be maintained (there were recent non-trivial
> changes).
>
> Can one of the VM event developper come up with a justification why this
> patch enough to make the VM event subsystem safe?

3) is an unused feature as well that likely should be dropped at some
point. It can also only be called with current->domain, it effectively
just signals an out-of-memory error to a vm_event listener in dom0
that populating an entry for the VM that EPT faulted failed. I guess
the idea was that the dom0 agent would be able to make a decision on
how to proceed (ie which VM to kill to free up memory).
diff mbox series

Patch

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -397,6 +397,7 @@  static long evtchn_bind_interdomain(evtc
     
     rchn->u.interdomain.remote_dom  = ld;
     rchn->u.interdomain.remote_port = lport;
+    atomic_set(&rchn->u.interdomain.active_calls, 0);
     rchn->state                     = ECS_INTERDOMAIN;
 
     /*
@@ -720,6 +721,10 @@  int evtchn_close(struct domain *d1, int
 
         double_evtchn_lock(chn1, chn2);
 
+        if ( consumer_is_xen(chn1) )
+            while ( atomic_read(&chn1->u.interdomain.active_calls) )
+                cpu_relax();
+
         evtchn_free(d1, chn1);
 
         chn2->state = ECS_UNBOUND;
@@ -781,9 +786,15 @@  int evtchn_send(struct domain *ld, unsig
         rport = lchn->u.interdomain.remote_port;
         rchn  = evtchn_from_port(rd, rport);
         if ( consumer_is_xen(rchn) )
+        {
+            /* Don't keep holding the lock for the call below. */
+            atomic_inc(&rchn->u.interdomain.active_calls);
+            evtchn_read_unlock(lchn);
             xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
-        else
-            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
+            atomic_dec(&rchn->u.interdomain.active_calls);
+            return 0;
+        }
+        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
         break;
     case ECS_IPI:
         evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -104,6 +104,7 @@  struct evtchn
         } unbound;     /* state == ECS_UNBOUND */
         struct {
             evtchn_port_t  remote_port;
+            atomic_t       active_calls;
             struct domain *remote_dom;
         } interdomain; /* state == ECS_INTERDOMAIN */
         struct {