Message ID | d821c715-966a-b48b-a877-c5dac36822f0@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | evtchn: (not so) recent XSAs follow-on | expand |
On 23.11.2020 15:30, Jan Beulich wrote: > While there don't look to be any problems with this right now, the lock > order implications from holding the lock can be very difficult to follow > (and may be easy to violate unknowingly). The present callbacks don't > (and no such callback should) have any need for the lock to be held. > > However, vm_event_disable() frees the structures used by respective > callbacks and isn't otherwise synchronized with invocations of these > callbacks, so maintain a count of in-progress calls, for evtchn_close() > to wait to drop to zero before freeing the port (and dropping the lock). > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> > --- > Should we make this accounting optional, to be requested through a new > parameter to alloc_unbound_xen_event_channel(), or derived from other > than the default callback being requested? > --- > v3: Drain callbacks before proceeding with closing. Re-base. > > --- a/xen/common/event_channel.c > +++ b/xen/common/event_channel.c > @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc > > rchn->u.interdomain.remote_dom = ld; > rchn->u.interdomain.remote_port = lport; > + atomic_set(&rchn->u.interdomain.active_calls, 0); > rchn->state = ECS_INTERDOMAIN; > > /* > @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int > > double_evtchn_lock(chn1, chn2); > > + if ( consumer_is_xen(chn1) ) > + while ( atomic_read(&chn1->u.interdomain.active_calls) ) > + cpu_relax(); > + > evtchn_free(d1, chn1); > > chn2->state = ECS_UNBOUND; > @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig > rport = lchn->u.interdomain.remote_port; > rchn = evtchn_from_port(rd, rport); > if ( consumer_is_xen(rchn) ) > + { > + /* Don't keep holding the lock for the call below. */ > + atomic_inc(&rchn->u.interdomain.active_calls); > + evtchn_read_unlock(lchn); > xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport); > - else > - evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); > + atomic_dec(&rchn->u.interdomain.active_calls); > + return 0; > + } > + evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); > break; > case ECS_IPI: > evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn); > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -104,6 +104,7 @@ struct evtchn > } unbound; /* state == ECS_UNBOUND */ > struct { > evtchn_port_t remote_port; > + atomic_t active_calls; > struct domain *remote_dom; > } interdomain; /* state == ECS_INTERDOMAIN */ > struct { > >
Hi Jan, On 23/11/2020 13:30, Jan Beulich wrote: > While there don't look to be any problems with this right now, the lock > order implications from holding the lock can be very difficult to follow > (and may be easy to violate unknowingly). The present callbacks don't > (and no such callback should) have any need for the lock to be held. > > However, vm_event_disable() frees the structures used by respective > callbacks and isn't otherwise synchronized with invocations of these > callbacks, so maintain a count of in-progress calls, for evtchn_close() > to wait to drop to zero before freeing the port (and dropping the lock). AFAICT, this callback is not the only place where the synchronization is missing in the VM event code. For instance, vm_event_put_request() can also race against vm_event_disable(). So shouldn't we handle this issue properly in VM event? > > Signed-off-by: Jan Beulich <jbeulich@suse.com> > --- > Should we make this accounting optional, to be requested through a new > parameter to alloc_unbound_xen_event_channel(), or derived from other > than the default callback being requested? Aside the VM event, do you see any value for the other caller? > --- > v3: Drain callbacks before proceeding with closing. Re-base. > > --- a/xen/common/event_channel.c > +++ b/xen/common/event_channel.c > @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc > > rchn->u.interdomain.remote_dom = ld; > rchn->u.interdomain.remote_port = lport; > + atomic_set(&rchn->u.interdomain.active_calls, 0); > rchn->state = ECS_INTERDOMAIN; > > /* > @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int > > double_evtchn_lock(chn1, chn2); > > + if ( consumer_is_xen(chn1) ) > + while ( atomic_read(&chn1->u.interdomain.active_calls) ) > + cpu_relax(); > + > evtchn_free(d1, chn1); > > chn2->state = ECS_UNBOUND; > @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig > rport = lchn->u.interdomain.remote_port; > rchn = evtchn_from_port(rd, rport); > if ( consumer_is_xen(rchn) ) > + { > + /* Don't keep holding the lock for the call below. */ > + atomic_inc(&rchn->u.interdomain.active_calls); > + evtchn_read_unlock(lchn); > xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport); > - else > - evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); atomic_dec() doesn't contain any memory barrier, so we will want one between xen_notification_fn() and atomic_dec() to avoid re-ordering. > + atomic_dec(&rchn->u.interdomain.active_calls); > + return 0; > + } > + evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); > break; > case ECS_IPI: > evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn); > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -104,6 +104,7 @@ struct evtchn > } unbound; /* state == ECS_UNBOUND */ > struct { > evtchn_port_t remote_port; > + atomic_t active_calls; > struct domain *remote_dom; > } interdomain; /* state == ECS_INTERDOMAIN */ > struct { > Cheers,
On 02.12.2020 22:10, Julien Grall wrote: > On 23/11/2020 13:30, Jan Beulich wrote: >> While there don't look to be any problems with this right now, the lock >> order implications from holding the lock can be very difficult to follow >> (and may be easy to violate unknowingly). The present callbacks don't >> (and no such callback should) have any need for the lock to be held. >> >> However, vm_event_disable() frees the structures used by respective >> callbacks and isn't otherwise synchronized with invocations of these >> callbacks, so maintain a count of in-progress calls, for evtchn_close() >> to wait to drop to zero before freeing the port (and dropping the lock). > > AFAICT, this callback is not the only place where the synchronization is > missing in the VM event code. > > For instance, vm_event_put_request() can also race against > vm_event_disable(). > > So shouldn't we handle this issue properly in VM event? I suppose that's a question to the VM event folks rather than me? >> --- >> Should we make this accounting optional, to be requested through a new >> parameter to alloc_unbound_xen_event_channel(), or derived from other >> than the default callback being requested? > > Aside the VM event, do you see any value for the other caller? No (albeit I'm not entirely certain about vpl011_notification()'s needs), hence the consideration. It's unnecessary overhead in those cases. >> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig >> rport = lchn->u.interdomain.remote_port; >> rchn = evtchn_from_port(rd, rport); >> if ( consumer_is_xen(rchn) ) >> + { >> + /* Don't keep holding the lock for the call below. */ >> + atomic_inc(&rchn->u.interdomain.active_calls); >> + evtchn_read_unlock(lchn); >> xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport); >> - else >> - evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); > > atomic_dec() doesn't contain any memory barrier, so we will want one > between xen_notification_fn() and atomic_dec() to avoid re-ordering. Oh, indeed. But smp_mb() is too heavy handed here - x86 doesn't really need any barrier, yet would gain a full MFENCE that way. Actually - looks like I forgot we gained smp_mb__before_atomic() a little over half a year ago. Jan
On Thu, Dec 3, 2020 at 5:09 AM Jan Beulich <jbeulich@suse.com> wrote: > > On 02.12.2020 22:10, Julien Grall wrote: > > On 23/11/2020 13:30, Jan Beulich wrote: > >> While there don't look to be any problems with this right now, the lock > >> order implications from holding the lock can be very difficult to follow > >> (and may be easy to violate unknowingly). The present callbacks don't > >> (and no such callback should) have any need for the lock to be held. > >> > >> However, vm_event_disable() frees the structures used by respective > >> callbacks and isn't otherwise synchronized with invocations of these > >> callbacks, so maintain a count of in-progress calls, for evtchn_close() > >> to wait to drop to zero before freeing the port (and dropping the lock). > > > > AFAICT, this callback is not the only place where the synchronization is > > missing in the VM event code. > > > > For instance, vm_event_put_request() can also race against > > vm_event_disable(). > > > > So shouldn't we handle this issue properly in VM event? > > I suppose that's a question to the VM event folks rather than me? IMHO it would obviously be better if the Xen side could handle situations like these gracefully. OTOH it is also reasonable to expect the privileged toolstack to perform its own sanity checks before disabling. Like right now for disabling vm_event we first pause the VM, process all remaining events that were on the ring, and only then disable the interface. This avoids the race condition mentioned above (among other issues). It's not perfect - we ran into problems where event replies don't have the desired effect because the VM is paused - but for the most part doing this is sufficient. So I don't consider this to be a priority at the moment. That said, if anyone is so inclined to fix this up, I would be happy to review & ack. Tamas
Hi Jan, On 03/12/2020 10:09, Jan Beulich wrote: > On 02.12.2020 22:10, Julien Grall wrote: >> On 23/11/2020 13:30, Jan Beulich wrote: >>> While there don't look to be any problems with this right now, the lock >>> order implications from holding the lock can be very difficult to follow >>> (and may be easy to violate unknowingly). The present callbacks don't >>> (and no such callback should) have any need for the lock to be held. >>> >>> However, vm_event_disable() frees the structures used by respective >>> callbacks and isn't otherwise synchronized with invocations of these >>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>> to wait to drop to zero before freeing the port (and dropping the lock). >> >> AFAICT, this callback is not the only place where the synchronization is >> missing in the VM event code. >> >> For instance, vm_event_put_request() can also race against >> vm_event_disable(). >> >> So shouldn't we handle this issue properly in VM event? > > I suppose that's a question to the VM event folks rather than me? Yes. From my understanding of Tamas's e-mail, they are relying on the monitoring software to do the right thing. I will refrain to comment on this approach. However, given the race is much wider than the event channel, I would recommend to not add more code in the event channel to deal with such problem. Instead, this should be fixed in the VM event code when someone has time to harden the subsystem. > >>> --- >>> Should we make this accounting optional, to be requested through a new >>> parameter to alloc_unbound_xen_event_channel(), or derived from other >>> than the default callback being requested? >> >> Aside the VM event, do you see any value for the other caller? > > No (albeit I'm not entirely certain about vpl011_notification()'s > needs), hence the consideration. It's unnecessary overhead in > those cases. I had another look and I think there is a small race in VPL011. It should be easy to fix (I will try to have a look later today). > >>> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig >>> rport = lchn->u.interdomain.remote_port; >>> rchn = evtchn_from_port(rd, rport); >>> if ( consumer_is_xen(rchn) ) >>> + { >>> + /* Don't keep holding the lock for the call below. */ >>> + atomic_inc(&rchn->u.interdomain.active_calls); >>> + evtchn_read_unlock(lchn); >>> xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport); >>> - else >>> - evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); >> >> atomic_dec() doesn't contain any memory barrier, so we will want one >> between xen_notification_fn() and atomic_dec() to avoid re-ordering. > > Oh, indeed. But smp_mb() is too heavy handed here - x86 doesn't > really need any barrier, yet would gain a full MFENCE that way. > Actually - looks like I forgot we gained smp_mb__before_atomic() > a little over half a year ago. Ah yes, I forgot that atomics instruction are ordered on x86. Cheers,
On 04.12.2020 12:28, Julien Grall wrote: > Hi Jan, > > On 03/12/2020 10:09, Jan Beulich wrote: >> On 02.12.2020 22:10, Julien Grall wrote: >>> On 23/11/2020 13:30, Jan Beulich wrote: >>>> While there don't look to be any problems with this right now, the lock >>>> order implications from holding the lock can be very difficult to follow >>>> (and may be easy to violate unknowingly). The present callbacks don't >>>> (and no such callback should) have any need for the lock to be held. >>>> >>>> However, vm_event_disable() frees the structures used by respective >>>> callbacks and isn't otherwise synchronized with invocations of these >>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>> to wait to drop to zero before freeing the port (and dropping the lock). >>> >>> AFAICT, this callback is not the only place where the synchronization is >>> missing in the VM event code. >>> >>> For instance, vm_event_put_request() can also race against >>> vm_event_disable(). >>> >>> So shouldn't we handle this issue properly in VM event? >> >> I suppose that's a question to the VM event folks rather than me? > > Yes. From my understanding of Tamas's e-mail, they are relying on the > monitoring software to do the right thing. > > I will refrain to comment on this approach. However, given the race is > much wider than the event channel, I would recommend to not add more > code in the event channel to deal with such problem. > > Instead, this should be fixed in the VM event code when someone has time > to harden the subsystem. Are effectively saying I should now undo the addition of the refcounting, which was added in response to feedback from you? Or else what exactly am I to take from your reply? Jan
On 04/12/2020 11:48, Jan Beulich wrote: > On 04.12.2020 12:28, Julien Grall wrote: >> Hi Jan, >> >> On 03/12/2020 10:09, Jan Beulich wrote: >>> On 02.12.2020 22:10, Julien Grall wrote: >>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>> While there don't look to be any problems with this right now, the lock >>>>> order implications from holding the lock can be very difficult to follow >>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>> (and no such callback should) have any need for the lock to be held. >>>>> >>>>> However, vm_event_disable() frees the structures used by respective >>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>> >>>> AFAICT, this callback is not the only place where the synchronization is >>>> missing in the VM event code. >>>> >>>> For instance, vm_event_put_request() can also race against >>>> vm_event_disable(). >>>> >>>> So shouldn't we handle this issue properly in VM event? >>> >>> I suppose that's a question to the VM event folks rather than me? >> >> Yes. From my understanding of Tamas's e-mail, they are relying on the >> monitoring software to do the right thing. >> >> I will refrain to comment on this approach. However, given the race is >> much wider than the event channel, I would recommend to not add more >> code in the event channel to deal with such problem. >> >> Instead, this should be fixed in the VM event code when someone has time >> to harden the subsystem. > > Are effectively saying I should now undo the addition of the > refcounting, which was added in response to feedback from you? Please point out where I made the request to use the refcounting... I pointed out there was an issue with the VM event code. This was latter analysed as a wider issue. The VM event folks doesn't seem to be very concerned on the race, so I don't see the reason to try to fix it in the event channel code. Cheers,
On 04.12.2020 12:51, Julien Grall wrote: > > > On 04/12/2020 11:48, Jan Beulich wrote: >> On 04.12.2020 12:28, Julien Grall wrote: >>> Hi Jan, >>> >>> On 03/12/2020 10:09, Jan Beulich wrote: >>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>>> While there don't look to be any problems with this right now, the lock >>>>>> order implications from holding the lock can be very difficult to follow >>>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>>> (and no such callback should) have any need for the lock to be held. >>>>>> >>>>>> However, vm_event_disable() frees the structures used by respective >>>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>>> >>>>> AFAICT, this callback is not the only place where the synchronization is >>>>> missing in the VM event code. >>>>> >>>>> For instance, vm_event_put_request() can also race against >>>>> vm_event_disable(). >>>>> >>>>> So shouldn't we handle this issue properly in VM event? >>>> >>>> I suppose that's a question to the VM event folks rather than me? >>> >>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>> monitoring software to do the right thing. >>> >>> I will refrain to comment on this approach. However, given the race is >>> much wider than the event channel, I would recommend to not add more >>> code in the event channel to deal with such problem. >>> >>> Instead, this should be fixed in the VM event code when someone has time >>> to harden the subsystem. >> >> Are effectively saying I should now undo the addition of the >> refcounting, which was added in response to feedback from you? > > Please point out where I made the request to use the refcounting... You didn't ask for this directly, sure, but ... > I pointed out there was an issue with the VM event code. ... this has ultimately led to the decision to use refcounting (iirc there was one alternative that I had proposed, besides the option of doing nothing). > This was latter > analysed as a wider issue. The VM event folks doesn't seem to be very > concerned on the race, so I don't see the reason to try to fix it in the > event channel code. And you won't need the refcount for vpl011 then? I can certainly drop it again, but it feels odd to go back to an earlier version under the circumstances ... Jan
Hi, On 04/12/2020 12:01, Jan Beulich wrote: > On 04.12.2020 12:51, Julien Grall wrote: >> >> >> On 04/12/2020 11:48, Jan Beulich wrote: >>> On 04.12.2020 12:28, Julien Grall wrote: >>>> Hi Jan, >>>> >>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>>>> While there don't look to be any problems with this right now, the lock >>>>>>> order implications from holding the lock can be very difficult to follow >>>>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>>>> (and no such callback should) have any need for the lock to be held. >>>>>>> >>>>>>> However, vm_event_disable() frees the structures used by respective >>>>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>>>> >>>>>> AFAICT, this callback is not the only place where the synchronization is >>>>>> missing in the VM event code. >>>>>> >>>>>> For instance, vm_event_put_request() can also race against >>>>>> vm_event_disable(). >>>>>> >>>>>> So shouldn't we handle this issue properly in VM event? >>>>> >>>>> I suppose that's a question to the VM event folks rather than me? >>>> >>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>> monitoring software to do the right thing. >>>> >>>> I will refrain to comment on this approach. However, given the race is >>>> much wider than the event channel, I would recommend to not add more >>>> code in the event channel to deal with such problem. >>>> >>>> Instead, this should be fixed in the VM event code when someone has time >>>> to harden the subsystem. >>> >>> Are effectively saying I should now undo the addition of the >>> refcounting, which was added in response to feedback from you? >> >> Please point out where I made the request to use the refcounting... > > You didn't ask for this directly, sure, but ... > >> I pointed out there was an issue with the VM event code. > > ... this has ultimately led to the decision to use refcounting > (iirc there was one alternative that I had proposed, besides > the option of doing nothing). One other option that was discussed (maybe only on security@xen.org) is to move the spinlock outside of the structure so it is always allocated. > >> This was latter >> analysed as a wider issue. The VM event folks doesn't seem to be very >> concerned on the race, so I don't see the reason to try to fix it in the >> event channel code. > > And you won't need the refcount for vpl011 then? I don't believe we need it for the vpl011 as the spin lock protecting the code should always be allocated. The problem today is the lock is initialized too late. > I can certainly > drop it again, but it feels odd to go back to an earlier version > under the circumstances ... The code introduced doesn't look necessary outside of the VM event code. So I think it would be wrong to merge it if it is just papering over a bigger problem. Cheers, > > Jan >
On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: > > Hi Jan, > > On 03/12/2020 10:09, Jan Beulich wrote: > > On 02.12.2020 22:10, Julien Grall wrote: > >> On 23/11/2020 13:30, Jan Beulich wrote: > >>> While there don't look to be any problems with this right now, the lock > >>> order implications from holding the lock can be very difficult to follow > >>> (and may be easy to violate unknowingly). The present callbacks don't > >>> (and no such callback should) have any need for the lock to be held. > >>> > >>> However, vm_event_disable() frees the structures used by respective > >>> callbacks and isn't otherwise synchronized with invocations of these > >>> callbacks, so maintain a count of in-progress calls, for evtchn_close() > >>> to wait to drop to zero before freeing the port (and dropping the lock). > >> > >> AFAICT, this callback is not the only place where the synchronization is > >> missing in the VM event code. > >> > >> For instance, vm_event_put_request() can also race against > >> vm_event_disable(). > >> > >> So shouldn't we handle this issue properly in VM event? > > > > I suppose that's a question to the VM event folks rather than me? > > Yes. From my understanding of Tamas's e-mail, they are relying on the > monitoring software to do the right thing. > > I will refrain to comment on this approach. However, given the race is > much wider than the event channel, I would recommend to not add more > code in the event channel to deal with such problem. > > Instead, this should be fixed in the VM event code when someone has time > to harden the subsystem. I double-checked and the disable route is actually more robust, we don't just rely on the toolstack doing the right thing. The domain gets paused before any calls to vm_event_disable. So I don't think there is really a race-condition here. Tamas
On 04/12/2020 15:21, Tamas K Lengyel wrote: > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: >> >> Hi Jan, >> >> On 03/12/2020 10:09, Jan Beulich wrote: >>> On 02.12.2020 22:10, Julien Grall wrote: >>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>> While there don't look to be any problems with this right now, the lock >>>>> order implications from holding the lock can be very difficult to follow >>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>> (and no such callback should) have any need for the lock to be held. >>>>> >>>>> However, vm_event_disable() frees the structures used by respective >>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>> >>>> AFAICT, this callback is not the only place where the synchronization is >>>> missing in the VM event code. >>>> >>>> For instance, vm_event_put_request() can also race against >>>> vm_event_disable(). >>>> >>>> So shouldn't we handle this issue properly in VM event? >>> >>> I suppose that's a question to the VM event folks rather than me? >> >> Yes. From my understanding of Tamas's e-mail, they are relying on the >> monitoring software to do the right thing. >> >> I will refrain to comment on this approach. However, given the race is >> much wider than the event channel, I would recommend to not add more >> code in the event channel to deal with such problem. >> >> Instead, this should be fixed in the VM event code when someone has time >> to harden the subsystem. > > I double-checked and the disable route is actually more robust, we > don't just rely on the toolstack doing the right thing. The domain > gets paused before any calls to vm_event_disable. So I don't think > there is really a race-condition here. The code will *only* pause the monitored domain. I can see two issues: 1) The toolstack is still sending event while destroy is happening. This is the race discussed here. 2) The implement of vm_event_put_request() suggests that it can be called with not-current domain. I don't see how just pausing the monitored domain is enough here. Cheers,
On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: > > > > On 04/12/2020 15:21, Tamas K Lengyel wrote: > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: > >> > >> Hi Jan, > >> > >> On 03/12/2020 10:09, Jan Beulich wrote: > >>> On 02.12.2020 22:10, Julien Grall wrote: > >>>> On 23/11/2020 13:30, Jan Beulich wrote: > >>>>> While there don't look to be any problems with this right now, the lock > >>>>> order implications from holding the lock can be very difficult to follow > >>>>> (and may be easy to violate unknowingly). The present callbacks don't > >>>>> (and no such callback should) have any need for the lock to be held. > >>>>> > >>>>> However, vm_event_disable() frees the structures used by respective > >>>>> callbacks and isn't otherwise synchronized with invocations of these > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() > >>>>> to wait to drop to zero before freeing the port (and dropping the lock). > >>>> > >>>> AFAICT, this callback is not the only place where the synchronization is > >>>> missing in the VM event code. > >>>> > >>>> For instance, vm_event_put_request() can also race against > >>>> vm_event_disable(). > >>>> > >>>> So shouldn't we handle this issue properly in VM event? > >>> > >>> I suppose that's a question to the VM event folks rather than me? > >> > >> Yes. From my understanding of Tamas's e-mail, they are relying on the > >> monitoring software to do the right thing. > >> > >> I will refrain to comment on this approach. However, given the race is > >> much wider than the event channel, I would recommend to not add more > >> code in the event channel to deal with such problem. > >> > >> Instead, this should be fixed in the VM event code when someone has time > >> to harden the subsystem. > > > > I double-checked and the disable route is actually more robust, we > > don't just rely on the toolstack doing the right thing. The domain > > gets paused before any calls to vm_event_disable. So I don't think > > there is really a race-condition here. > > The code will *only* pause the monitored domain. I can see two issues: > 1) The toolstack is still sending event while destroy is happening. > This is the race discussed here. > 2) The implement of vm_event_put_request() suggests that it can be > called with not-current domain. > > I don't see how just pausing the monitored domain is enough here. Requests only get generated by the monitored domain. So if the domain is not running you won't get more of them. The toolstack can only send replies. Tamas
On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@gmail.com> wrote: > > On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: > > > > > > > > On 04/12/2020 15:21, Tamas K Lengyel wrote: > > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: > > >> > > >> Hi Jan, > > >> > > >> On 03/12/2020 10:09, Jan Beulich wrote: > > >>> On 02.12.2020 22:10, Julien Grall wrote: > > >>>> On 23/11/2020 13:30, Jan Beulich wrote: > > >>>>> While there don't look to be any problems with this right now, the lock > > >>>>> order implications from holding the lock can be very difficult to follow > > >>>>> (and may be easy to violate unknowingly). The present callbacks don't > > >>>>> (and no such callback should) have any need for the lock to be held. > > >>>>> > > >>>>> However, vm_event_disable() frees the structures used by respective > > >>>>> callbacks and isn't otherwise synchronized with invocations of these > > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() > > >>>>> to wait to drop to zero before freeing the port (and dropping the lock). > > >>>> > > >>>> AFAICT, this callback is not the only place where the synchronization is > > >>>> missing in the VM event code. > > >>>> > > >>>> For instance, vm_event_put_request() can also race against > > >>>> vm_event_disable(). > > >>>> > > >>>> So shouldn't we handle this issue properly in VM event? > > >>> > > >>> I suppose that's a question to the VM event folks rather than me? > > >> > > >> Yes. From my understanding of Tamas's e-mail, they are relying on the > > >> monitoring software to do the right thing. > > >> > > >> I will refrain to comment on this approach. However, given the race is > > >> much wider than the event channel, I would recommend to not add more > > >> code in the event channel to deal with such problem. > > >> > > >> Instead, this should be fixed in the VM event code when someone has time > > >> to harden the subsystem. > > > > > > I double-checked and the disable route is actually more robust, we > > > don't just rely on the toolstack doing the right thing. The domain > > > gets paused before any calls to vm_event_disable. So I don't think > > > there is really a race-condition here. > > > > The code will *only* pause the monitored domain. I can see two issues: > > 1) The toolstack is still sending event while destroy is happening. > > This is the race discussed here. > > 2) The implement of vm_event_put_request() suggests that it can be > > called with not-current domain. > > > > I don't see how just pausing the monitored domain is enough here. > > Requests only get generated by the monitored domain. If that's the case, then why is vm_event_put_request() able to deal with a non-current domain? I understand that you are possibly trusting who may call it, but this looks quite fragile. Cheers, ---
On Fri, Dec 4, 2020 at 2:22 PM Julien Grall <julien.grall.oss@gmail.com> wrote: > > On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@gmail.com> wrote: > > > > On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: > > > > > > > > > > > > On 04/12/2020 15:21, Tamas K Lengyel wrote: > > > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: > > > >> > > > >> Hi Jan, > > > >> > > > >> On 03/12/2020 10:09, Jan Beulich wrote: > > > >>> On 02.12.2020 22:10, Julien Grall wrote: > > > >>>> On 23/11/2020 13:30, Jan Beulich wrote: > > > >>>>> While there don't look to be any problems with this right now, the lock > > > >>>>> order implications from holding the lock can be very difficult to follow > > > >>>>> (and may be easy to violate unknowingly). The present callbacks don't > > > >>>>> (and no such callback should) have any need for the lock to be held. > > > >>>>> > > > >>>>> However, vm_event_disable() frees the structures used by respective > > > >>>>> callbacks and isn't otherwise synchronized with invocations of these > > > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() > > > >>>>> to wait to drop to zero before freeing the port (and dropping the lock). > > > >>>> > > > >>>> AFAICT, this callback is not the only place where the synchronization is > > > >>>> missing in the VM event code. > > > >>>> > > > >>>> For instance, vm_event_put_request() can also race against > > > >>>> vm_event_disable(). > > > >>>> > > > >>>> So shouldn't we handle this issue properly in VM event? > > > >>> > > > >>> I suppose that's a question to the VM event folks rather than me? > > > >> > > > >> Yes. From my understanding of Tamas's e-mail, they are relying on the > > > >> monitoring software to do the right thing. > > > >> > > > >> I will refrain to comment on this approach. However, given the race is > > > >> much wider than the event channel, I would recommend to not add more > > > >> code in the event channel to deal with such problem. > > > >> > > > >> Instead, this should be fixed in the VM event code when someone has time > > > >> to harden the subsystem. > > > > > > > > I double-checked and the disable route is actually more robust, we > > > > don't just rely on the toolstack doing the right thing. The domain > > > > gets paused before any calls to vm_event_disable. So I don't think > > > > there is really a race-condition here. > > > > > > The code will *only* pause the monitored domain. I can see two issues: > > > 1) The toolstack is still sending event while destroy is happening. > > > This is the race discussed here. > > > 2) The implement of vm_event_put_request() suggests that it can be > > > called with not-current domain. > > > > > > I don't see how just pausing the monitored domain is enough here. > > > > Requests only get generated by the monitored domain. > > If that's the case, then why is vm_event_put_request() able to > deal with a non-current domain? > > I understand that you are possibly trusting who may call it, but this > looks quite fragile. I didn't write the system. You probably want to ask that question from the original author. Tamas
On 04.12.2020 16:09, Julien Grall wrote: > On 04/12/2020 12:01, Jan Beulich wrote: >> On 04.12.2020 12:51, Julien Grall wrote: >>> On 04/12/2020 11:48, Jan Beulich wrote: >>>> On 04.12.2020 12:28, Julien Grall wrote: >>>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>>> So shouldn't we handle this issue properly in VM event? >>>>>> >>>>>> I suppose that's a question to the VM event folks rather than me? >>>>> >>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>>> monitoring software to do the right thing. >>>>> >>>>> I will refrain to comment on this approach. However, given the race is >>>>> much wider than the event channel, I would recommend to not add more >>>>> code in the event channel to deal with such problem. >>>>> >>>>> Instead, this should be fixed in the VM event code when someone has time >>>>> to harden the subsystem. >>>> >>>> Are effectively saying I should now undo the addition of the >>>> refcounting, which was added in response to feedback from you? >>> >>> Please point out where I made the request to use the refcounting... >> >> You didn't ask for this directly, sure, but ... >> >>> I pointed out there was an issue with the VM event code. >> >> ... this has ultimately led to the decision to use refcounting >> (iirc there was one alternative that I had proposed, besides >> the option of doing nothing). > > One other option that was discussed (maybe only on security@xen.org) is > to move the spinlock outside of the structure so it is always allocated. Oh, right - forgot about that one, because that's nothing I would ever have taken on actually carrying out. >>> This was latter >>> analysed as a wider issue. The VM event folks doesn't seem to be very >>> concerned on the race, so I don't see the reason to try to fix it in the >>> event channel code. >> >> And you won't need the refcount for vpl011 then? > > I don't believe we need it for the vpl011 as the spin lock protecting > the code should always be allocated. The problem today is the lock is > initialized too late. > >> I can certainly >> drop it again, but it feels odd to go back to an earlier version >> under the circumstances ... > > The code introduced doesn't look necessary outside of the VM event code. > So I think it would be wrong to merge it if it is just papering over a > bigger problem. So to translate this to a clear course of action: You want me to go back to the earlier version by dropping the refcounting again? (I don't view this as "papering over" btw, but a tiny step towards a solution.) Jan
On 04.12.2020 20:15, Tamas K Lengyel wrote: > On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: >> On 04/12/2020 15:21, Tamas K Lengyel wrote: >>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: >>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>>>> While there don't look to be any problems with this right now, the lock >>>>>>> order implications from holding the lock can be very difficult to follow >>>>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>>>> (and no such callback should) have any need for the lock to be held. >>>>>>> >>>>>>> However, vm_event_disable() frees the structures used by respective >>>>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>>>> >>>>>> AFAICT, this callback is not the only place where the synchronization is >>>>>> missing in the VM event code. >>>>>> >>>>>> For instance, vm_event_put_request() can also race against >>>>>> vm_event_disable(). >>>>>> >>>>>> So shouldn't we handle this issue properly in VM event? >>>>> >>>>> I suppose that's a question to the VM event folks rather than me? >>>> >>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>> monitoring software to do the right thing. >>>> >>>> I will refrain to comment on this approach. However, given the race is >>>> much wider than the event channel, I would recommend to not add more >>>> code in the event channel to deal with such problem. >>>> >>>> Instead, this should be fixed in the VM event code when someone has time >>>> to harden the subsystem. >>> >>> I double-checked and the disable route is actually more robust, we >>> don't just rely on the toolstack doing the right thing. The domain >>> gets paused before any calls to vm_event_disable. So I don't think >>> there is really a race-condition here. >> >> The code will *only* pause the monitored domain. I can see two issues: >> 1) The toolstack is still sending event while destroy is happening. >> This is the race discussed here. >> 2) The implement of vm_event_put_request() suggests that it can be >> called with not-current domain. >> >> I don't see how just pausing the monitored domain is enough here. > > Requests only get generated by the monitored domain. So if the domain > is not running you won't get more of them. The toolstack can only send > replies. Julien, does this change your view on the refcounting added by the patch at the root of this sub-thread? Jan
On 07/12/2020 08:02, Jan Beulich wrote: > On 04.12.2020 16:09, Julien Grall wrote: >> On 04/12/2020 12:01, Jan Beulich wrote: >>> On 04.12.2020 12:51, Julien Grall wrote: >>>> On 04/12/2020 11:48, Jan Beulich wrote: >>>>> On 04.12.2020 12:28, Julien Grall wrote: >>>>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>>>> So shouldn't we handle this issue properly in VM event? >>>>>>> >>>>>>> I suppose that's a question to the VM event folks rather than me? >>>>>> >>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>>>> monitoring software to do the right thing. >>>>>> >>>>>> I will refrain to comment on this approach. However, given the race is >>>>>> much wider than the event channel, I would recommend to not add more >>>>>> code in the event channel to deal with such problem. >>>>>> >>>>>> Instead, this should be fixed in the VM event code when someone has time >>>>>> to harden the subsystem. >>>>> >>>>> Are effectively saying I should now undo the addition of the >>>>> refcounting, which was added in response to feedback from you? >>>> >>>> Please point out where I made the request to use the refcounting... >>> >>> You didn't ask for this directly, sure, but ... >>> >>>> I pointed out there was an issue with the VM event code. >>> >>> ... this has ultimately led to the decision to use refcounting >>> (iirc there was one alternative that I had proposed, besides >>> the option of doing nothing). >> >> One other option that was discussed (maybe only on security@xen.org) is >> to move the spinlock outside of the structure so it is always allocated. > > Oh, right - forgot about that one, because that's nothing I would > ever have taken on actually carrying out. > >>>> This was latter >>>> analysed as a wider issue. The VM event folks doesn't seem to be very >>>> concerned on the race, so I don't see the reason to try to fix it in the >>>> event channel code. >>> >>> And you won't need the refcount for vpl011 then? >> >> I don't believe we need it for the vpl011 as the spin lock protecting >> the code should always be allocated. The problem today is the lock is >> initialized too late. >> >>> I can certainly >>> drop it again, but it feels odd to go back to an earlier version >>> under the circumstances ... >> >> The code introduced doesn't look necessary outside of the VM event code. >> So I think it would be wrong to merge it if it is just papering over a >> bigger problem. > > So to translate this to a clear course of action: You want me to > go back to the earlier version by dropping the refcounting again? Yes. > (I don't view this as "papering over" btw, but a tiny step towards > a solution.) This is implying that the refcounting is part of the actual solution. I think you can solve it directly in the VM event code without touching the event channel code. Furthermore, I see no point to add code in the common code if the maintainers of the affected subsystem think there code is safe (I don't believe it is). Cheers,
Hi Jan, On 07/12/2020 15:28, Jan Beulich wrote: > On 04.12.2020 20:15, Tamas K Lengyel wrote: >> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: >>> On 04/12/2020 15:21, Tamas K Lengyel wrote: >>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: >>>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>>>>> While there don't look to be any problems with this right now, the lock >>>>>>>> order implications from holding the lock can be very difficult to follow >>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>>>>> (and no such callback should) have any need for the lock to be held. >>>>>>>> >>>>>>>> However, vm_event_disable() frees the structures used by respective >>>>>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>>>>> >>>>>>> AFAICT, this callback is not the only place where the synchronization is >>>>>>> missing in the VM event code. >>>>>>> >>>>>>> For instance, vm_event_put_request() can also race against >>>>>>> vm_event_disable(). >>>>>>> >>>>>>> So shouldn't we handle this issue properly in VM event? >>>>>> >>>>>> I suppose that's a question to the VM event folks rather than me? >>>>> >>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>>> monitoring software to do the right thing. >>>>> >>>>> I will refrain to comment on this approach. However, given the race is >>>>> much wider than the event channel, I would recommend to not add more >>>>> code in the event channel to deal with such problem. >>>>> >>>>> Instead, this should be fixed in the VM event code when someone has time >>>>> to harden the subsystem. >>>> >>>> I double-checked and the disable route is actually more robust, we >>>> don't just rely on the toolstack doing the right thing. The domain >>>> gets paused before any calls to vm_event_disable. So I don't think >>>> there is really a race-condition here. >>> >>> The code will *only* pause the monitored domain. I can see two issues: >>> 1) The toolstack is still sending event while destroy is happening. >>> This is the race discussed here. >>> 2) The implement of vm_event_put_request() suggests that it can be >>> called with not-current domain. >>> >>> I don't see how just pausing the monitored domain is enough here. >> >> Requests only get generated by the monitored domain. So if the domain >> is not running you won't get more of them. The toolstack can only send >> replies. > > Julien, > > does this change your view on the refcounting added by the patch > at the root of this sub-thread? I still think the code is at best fragile. One example I can find is: -> guest_remove_page() -> p2m_mem_paging_drop_page() -> vm_event_put_request() guest_remove_page() is not always call on the current domain. So there are possibility for vm_event_put_request() to happen on a foreign domain and therefore wouldn't be protected by the current hypercall. Anyway, I don't think the refcounting should be part of the event channel without any idea on how this would fit in fixing the VM event race. Cheers,
On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote: > > Hi Jan, > > On 07/12/2020 15:28, Jan Beulich wrote: > > On 04.12.2020 20:15, Tamas K Lengyel wrote: > >> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: > >>> On 04/12/2020 15:21, Tamas K Lengyel wrote: > >>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: > >>>>> On 03/12/2020 10:09, Jan Beulich wrote: > >>>>>> On 02.12.2020 22:10, Julien Grall wrote: > >>>>>>> On 23/11/2020 13:30, Jan Beulich wrote: > >>>>>>>> While there don't look to be any problems with this right now, the lock > >>>>>>>> order implications from holding the lock can be very difficult to follow > >>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't > >>>>>>>> (and no such callback should) have any need for the lock to be held. > >>>>>>>> > >>>>>>>> However, vm_event_disable() frees the structures used by respective > >>>>>>>> callbacks and isn't otherwise synchronized with invocations of these > >>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() > >>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock). > >>>>>>> > >>>>>>> AFAICT, this callback is not the only place where the synchronization is > >>>>>>> missing in the VM event code. > >>>>>>> > >>>>>>> For instance, vm_event_put_request() can also race against > >>>>>>> vm_event_disable(). > >>>>>>> > >>>>>>> So shouldn't we handle this issue properly in VM event? > >>>>>> > >>>>>> I suppose that's a question to the VM event folks rather than me? > >>>>> > >>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the > >>>>> monitoring software to do the right thing. > >>>>> > >>>>> I will refrain to comment on this approach. However, given the race is > >>>>> much wider than the event channel, I would recommend to not add more > >>>>> code in the event channel to deal with such problem. > >>>>> > >>>>> Instead, this should be fixed in the VM event code when someone has time > >>>>> to harden the subsystem. > >>>> > >>>> I double-checked and the disable route is actually more robust, we > >>>> don't just rely on the toolstack doing the right thing. The domain > >>>> gets paused before any calls to vm_event_disable. So I don't think > >>>> there is really a race-condition here. > >>> > >>> The code will *only* pause the monitored domain. I can see two issues: > >>> 1) The toolstack is still sending event while destroy is happening. > >>> This is the race discussed here. > >>> 2) The implement of vm_event_put_request() suggests that it can be > >>> called with not-current domain. > >>> > >>> I don't see how just pausing the monitored domain is enough here. > >> > >> Requests only get generated by the monitored domain. So if the domain > >> is not running you won't get more of them. The toolstack can only send > >> replies. > > > > Julien, > > > > does this change your view on the refcounting added by the patch > > at the root of this sub-thread? > > I still think the code is at best fragile. One example I can find is: > > -> guest_remove_page() > -> p2m_mem_paging_drop_page() > -> vm_event_put_request() > > guest_remove_page() is not always call on the current domain. So there > are possibility for vm_event_put_request() to happen on a foreign domain > and therefore wouldn't be protected by the current hypercall. > > Anyway, I don't think the refcounting should be part of the event > channel without any idea on how this would fit in fixing the VM event race. If the problematic patterns only appear with mem_paging I would suggest just removing the mem_paging code completely. It's been abandoned for several years now. Tamas
On 07.12.2020 18:35, Tamas K Lengyel wrote: > On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote: >> >> Hi Jan, >> >> On 07/12/2020 15:28, Jan Beulich wrote: >>> On 04.12.2020 20:15, Tamas K Lengyel wrote: >>>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: >>>>> On 04/12/2020 15:21, Tamas K Lengyel wrote: >>>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: >>>>>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>>>>>>> While there don't look to be any problems with this right now, the lock >>>>>>>>>> order implications from holding the lock can be very difficult to follow >>>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>>>>>>> (and no such callback should) have any need for the lock to be held. >>>>>>>>>> >>>>>>>>>> However, vm_event_disable() frees the structures used by respective >>>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>>>>>>> >>>>>>>>> AFAICT, this callback is not the only place where the synchronization is >>>>>>>>> missing in the VM event code. >>>>>>>>> >>>>>>>>> For instance, vm_event_put_request() can also race against >>>>>>>>> vm_event_disable(). >>>>>>>>> >>>>>>>>> So shouldn't we handle this issue properly in VM event? >>>>>>>> >>>>>>>> I suppose that's a question to the VM event folks rather than me? >>>>>>> >>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>>>>> monitoring software to do the right thing. >>>>>>> >>>>>>> I will refrain to comment on this approach. However, given the race is >>>>>>> much wider than the event channel, I would recommend to not add more >>>>>>> code in the event channel to deal with such problem. >>>>>>> >>>>>>> Instead, this should be fixed in the VM event code when someone has time >>>>>>> to harden the subsystem. >>>>>> >>>>>> I double-checked and the disable route is actually more robust, we >>>>>> don't just rely on the toolstack doing the right thing. The domain >>>>>> gets paused before any calls to vm_event_disable. So I don't think >>>>>> there is really a race-condition here. >>>>> >>>>> The code will *only* pause the monitored domain. I can see two issues: >>>>> 1) The toolstack is still sending event while destroy is happening. >>>>> This is the race discussed here. >>>>> 2) The implement of vm_event_put_request() suggests that it can be >>>>> called with not-current domain. >>>>> >>>>> I don't see how just pausing the monitored domain is enough here. >>>> >>>> Requests only get generated by the monitored domain. So if the domain >>>> is not running you won't get more of them. The toolstack can only send >>>> replies. >>> >>> Julien, >>> >>> does this change your view on the refcounting added by the patch >>> at the root of this sub-thread? >> >> I still think the code is at best fragile. One example I can find is: >> >> -> guest_remove_page() >> -> p2m_mem_paging_drop_page() >> -> vm_event_put_request() >> >> guest_remove_page() is not always call on the current domain. So there >> are possibility for vm_event_put_request() to happen on a foreign domain >> and therefore wouldn't be protected by the current hypercall. >> >> Anyway, I don't think the refcounting should be part of the event >> channel without any idea on how this would fit in fixing the VM event race. > > If the problematic patterns only appear with mem_paging I would > suggest just removing the mem_paging code completely. It's been > abandoned for several years now. Since this is nothing I'm fancying doing, the way forward here needs to be a different one. From the input by both of you I still can't conclude whether this patch should remain as is in v4, or revert back to its v2 version. Please can we get this settled so I can get v4 out? Thanks, Jan
On 23/12/2020 13:12, Jan Beulich wrote: > On 07.12.2020 18:35, Tamas K Lengyel wrote: >> On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote: >>> >>> Hi Jan, >>> >>> On 07/12/2020 15:28, Jan Beulich wrote: >>>> On 04.12.2020 20:15, Tamas K Lengyel wrote: >>>>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote: >>>>>> On 04/12/2020 15:21, Tamas K Lengyel wrote: >>>>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote: >>>>>>>> On 03/12/2020 10:09, Jan Beulich wrote: >>>>>>>>> On 02.12.2020 22:10, Julien Grall wrote: >>>>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote: >>>>>>>>>>> While there don't look to be any problems with this right now, the lock >>>>>>>>>>> order implications from holding the lock can be very difficult to follow >>>>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't >>>>>>>>>>> (and no such callback should) have any need for the lock to be held. >>>>>>>>>>> >>>>>>>>>>> However, vm_event_disable() frees the structures used by respective >>>>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these >>>>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close() >>>>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock). >>>>>>>>>> >>>>>>>>>> AFAICT, this callback is not the only place where the synchronization is >>>>>>>>>> missing in the VM event code. >>>>>>>>>> >>>>>>>>>> For instance, vm_event_put_request() can also race against >>>>>>>>>> vm_event_disable(). >>>>>>>>>> >>>>>>>>>> So shouldn't we handle this issue properly in VM event? >>>>>>>>> >>>>>>>>> I suppose that's a question to the VM event folks rather than me? >>>>>>>> >>>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the >>>>>>>> monitoring software to do the right thing. >>>>>>>> >>>>>>>> I will refrain to comment on this approach. However, given the race is >>>>>>>> much wider than the event channel, I would recommend to not add more >>>>>>>> code in the event channel to deal with such problem. >>>>>>>> >>>>>>>> Instead, this should be fixed in the VM event code when someone has time >>>>>>>> to harden the subsystem. >>>>>>> >>>>>>> I double-checked and the disable route is actually more robust, we >>>>>>> don't just rely on the toolstack doing the right thing. The domain >>>>>>> gets paused before any calls to vm_event_disable. So I don't think >>>>>>> there is really a race-condition here. >>>>>> >>>>>> The code will *only* pause the monitored domain. I can see two issues: >>>>>> 1) The toolstack is still sending event while destroy is happening. >>>>>> This is the race discussed here. >>>>>> 2) The implement of vm_event_put_request() suggests that it can be >>>>>> called with not-current domain. >>>>>> >>>>>> I don't see how just pausing the monitored domain is enough here. >>>>> >>>>> Requests only get generated by the monitored domain. So if the domain >>>>> is not running you won't get more of them. The toolstack can only send >>>>> replies. >>>> >>>> Julien, >>>> >>>> does this change your view on the refcounting added by the patch >>>> at the root of this sub-thread? >>> >>> I still think the code is at best fragile. One example I can find is: >>> >>> -> guest_remove_page() >>> -> p2m_mem_paging_drop_page() >>> -> vm_event_put_request() >>> >>> guest_remove_page() is not always call on the current domain. So there >>> are possibility for vm_event_put_request() to happen on a foreign domain >>> and therefore wouldn't be protected by the current hypercall. >>> >>> Anyway, I don't think the refcounting should be part of the event >>> channel without any idea on how this would fit in fixing the VM event race. >> >> If the problematic patterns only appear with mem_paging I would >> suggest just removing the mem_paging code completely. It's been >> abandoned for several years now. > > Since this is nothing I'm fancying doing, the way forward here needs > to be a different one. From the input by both of you I still can't > conclude whether this patch should remain as is in v4, or revert > back to its v2 version. Please can we get this settled so I can get > v4 out? I haven't had time to investigate the rest of the VM event code to find other cases where this may happen. I still think there is a bigger problem in the VM event code, but the maintainer disagrees here. At which point, I see limited reason to try to paper over in the common code. So I would rather ack/merge v2 rather than v3. Cheers,
On 23.12.2020 14:33, Julien Grall wrote: > On 23/12/2020 13:12, Jan Beulich wrote: >> From the input by both of you I still can't >> conclude whether this patch should remain as is in v4, or revert >> back to its v2 version. Please can we get this settled so I can get >> v4 out? > > I haven't had time to investigate the rest of the VM event code to find > other cases where this may happen. I still think there is a bigger > problem in the VM event code, but the maintainer disagrees here. > > At which point, I see limited reason to try to paper over in the common > code. So I would rather ack/merge v2 rather than v3. Since I expect Tamas and/or the Bitdefender folks to be of the opposite opinion, there's still no way out, at least if "rather ack" implies a nak for v3. Personally, if this expectation of mine is correct, I'd prefer to keep the accounting but make it optional (as suggested in a post-commit-message remark). That'll eliminate the overhead you appear to be concerned of, but of course it'll further complicate the logic (albeit just slightly). Jan
On 23/12/2020 13:41, Jan Beulich wrote: > On 23.12.2020 14:33, Julien Grall wrote: >> On 23/12/2020 13:12, Jan Beulich wrote: >>> From the input by both of you I still can't >>> conclude whether this patch should remain as is in v4, or revert >>> back to its v2 version. Please can we get this settled so I can get >>> v4 out? >> >> I haven't had time to investigate the rest of the VM event code to find >> other cases where this may happen. I still think there is a bigger >> problem in the VM event code, but the maintainer disagrees here. >> >> At which point, I see limited reason to try to paper over in the common >> code. So I would rather ack/merge v2 rather than v3. > > Since I expect Tamas and/or the Bitdefender folks to be of the > opposite opinion, there's still no way out, at least if "rather > ack" implies a nak for v3. The only way out here is for someone to justify why this patch is sufficient for the VM event race. I am not convinced it is (see more below). > Personally, if this expectation of > mine is correct, I'd prefer to keep the accounting but make it > optional (as suggested in a post-commit-message remark). > That'll eliminate the overhead you appear to be concerned of, > but of course it'll further complicate the logic (albeit just > slightly). I am more concerned about adding over complex code that would just papering over a bigger problem. I also can't see use of it outside of the VM event discussion. I had another look at the code. As I mentioned in the past, vm_put_event_request() is able to deal with d != current->domain (it will set VM_EVENT_FLAG_FOREIGN). There are 4 callers for the function: 1) p2m_mem_paging_drop_page() 2) p2m_mem_paging_populate() 3) mem_sharing_notify_enomem() 4) monitor_traps() 1) and 2) belongs to the mem paging subsystem. Tamas suggested that it was abandoned. 4) can only be called with the current domain. This leave us 3) in the mem sharing subsystem. As this is call the memory hypercalls, it looks possible to me that d != current->domain. The code also seems to be maintained (there were recent non-trivial changes). Can one of the VM event developper come up with a justification why this patch enough to make the VM event subsystem safe? FAOD, the justification should be solely based on the hypervisor code (IOW not external trusted software). Cheers,
On 23.12.2020 15:44, Julien Grall wrote: > On 23/12/2020 13:41, Jan Beulich wrote: >> On 23.12.2020 14:33, Julien Grall wrote: >>> On 23/12/2020 13:12, Jan Beulich wrote: >>>> From the input by both of you I still can't >>>> conclude whether this patch should remain as is in v4, or revert >>>> back to its v2 version. Please can we get this settled so I can get >>>> v4 out? >>> >>> I haven't had time to investigate the rest of the VM event code to find >>> other cases where this may happen. I still think there is a bigger >>> problem in the VM event code, but the maintainer disagrees here. >>> >>> At which point, I see limited reason to try to paper over in the common >>> code. So I would rather ack/merge v2 rather than v3. >> >> Since I expect Tamas and/or the Bitdefender folks to be of the >> opposite opinion, there's still no way out, at least if "rather >> ack" implies a nak for v3. > > The only way out here is for someone to justify why this patch is > sufficient for the VM event race. I think this is too much you demand. Moving in the right direction without arriving at the final destination is still a worthwhile thing to do imo. >> Personally, if this expectation of >> mine is correct, I'd prefer to keep the accounting but make it >> optional (as suggested in a post-commit-message remark). >> That'll eliminate the overhead you appear to be concerned of, >> but of course it'll further complicate the logic (albeit just >> slightly). > > I am more concerned about adding over complex code that would just > papering over a bigger problem. I also can't see use of it outside of > the VM event discussion. I think it is a generally appropriate thing to do to wait for callbacks to complete before tearing down their origin control structure. There may be cases where code structure makes this unnecessary, but I don't think this can be an expectation to all the users of the functionality. Hence my suggestion to possibly make this optional (driven directly or indirectly by the user of the registration function). Jan
Hi Jan, On 23/12/2020 14:56, Jan Beulich wrote: > On 23.12.2020 15:44, Julien Grall wrote: >> On 23/12/2020 13:41, Jan Beulich wrote: >>> On 23.12.2020 14:33, Julien Grall wrote: >>>> On 23/12/2020 13:12, Jan Beulich wrote: >>>>> From the input by both of you I still can't >>>>> conclude whether this patch should remain as is in v4, or revert >>>>> back to its v2 version. Please can we get this settled so I can get >>>>> v4 out? >>>> >>>> I haven't had time to investigate the rest of the VM event code to find >>>> other cases where this may happen. I still think there is a bigger >>>> problem in the VM event code, but the maintainer disagrees here. >>>> >>>> At which point, I see limited reason to try to paper over in the common >>>> code. So I would rather ack/merge v2 rather than v3. >>> >>> Since I expect Tamas and/or the Bitdefender folks to be of the >>> opposite opinion, there's still no way out, at least if "rather >>> ack" implies a nak for v3. >> >> The only way out here is for someone to justify why this patch is >> sufficient for the VM event race. > > I think this is too much you demand. I guess you didn't notice that I did most of the job by providing an analysis in my e-mail... I don't think it is too much demanding to read the analysis and say whether I am correct or not. Do you really prefer to add code would get rot because unused? > Moving in the right direction > without arriving at the final destination is still a worthwhile > thing to do imo. > >>> Personally, if this expectation of >>> mine is correct, I'd prefer to keep the accounting but make it >>> optional (as suggested in a post-commit-message remark). >>> That'll eliminate the overhead you appear to be concerned of, >>> but of course it'll further complicate the logic (albeit just >>> slightly). >> >> I am more concerned about adding over complex code that would just >> papering over a bigger problem. I also can't see use of it outside of >> the VM event discussion. > > I think it is a generally appropriate thing to do to wait for > callbacks to complete before tearing down their origin control > structure. There may be cases where code structure makes this > unnecessary, but I don't think this can be an expectation to > all the users of the functionality. Hence my suggestion to > possibly make this optional (driven directly or indirectly by > the user of the registration function). As you tend to say, we should not add code unless there is a user. So far the only possible user is dubbious. If you find another user, then we can discuss whether this patch makes sense. Cheers,
On Wed, Dec 23, 2020 at 9:44 AM Julien Grall <julien@xen.org> wrote: > > > > On 23/12/2020 13:41, Jan Beulich wrote: > > On 23.12.2020 14:33, Julien Grall wrote: > >> On 23/12/2020 13:12, Jan Beulich wrote: > >>> From the input by both of you I still can't > >>> conclude whether this patch should remain as is in v4, or revert > >>> back to its v2 version. Please can we get this settled so I can get > >>> v4 out? > >> > >> I haven't had time to investigate the rest of the VM event code to find > >> other cases where this may happen. I still think there is a bigger > >> problem in the VM event code, but the maintainer disagrees here. > >> > >> At which point, I see limited reason to try to paper over in the common > >> code. So I would rather ack/merge v2 rather than v3. > > > > Since I expect Tamas and/or the Bitdefender folks to be of the > > opposite opinion, there's still no way out, at least if "rather > > ack" implies a nak for v3. > > The only way out here is for someone to justify why this patch is > sufficient for the VM event race. I am not convinced it is (see more below). > > > Personally, if this expectation of > > mine is correct, I'd prefer to keep the accounting but make it > > optional (as suggested in a post-commit-message remark). > > That'll eliminate the overhead you appear to be concerned of, > > but of course it'll further complicate the logic (albeit just > > slightly). > > I am more concerned about adding over complex code that would just > papering over a bigger problem. I also can't see use of it outside of > the VM event discussion. > > I had another look at the code. As I mentioned in the past, > vm_put_event_request() is able to deal with d != current->domain (it > will set VM_EVENT_FLAG_FOREIGN). There are 4 callers for the function: > 1) p2m_mem_paging_drop_page() > 2) p2m_mem_paging_populate() > 3) mem_sharing_notify_enomem() > 4) monitor_traps() > > 1) and 2) belongs to the mem paging subsystem. Tamas suggested that it > was abandoned. > > 4) can only be called with the current domain. > > This leave us 3) in the mem sharing subsystem. As this is call the > memory hypercalls, it looks possible to me that d != current->domain. > The code also seems to be maintained (there were recent non-trivial > changes). > > Can one of the VM event developper come up with a justification why this > patch enough to make the VM event subsystem safe? 3) is an unused feature as well that likely should be dropped at some point. It can also only be called with current->domain, it effectively just signals an out-of-memory error to a vm_event listener in dom0 that populating an entry for the VM that EPT faulted failed. I guess the idea was that the dom0 agent would be able to make a decision on how to proceed (ie which VM to kill to free up memory).
--- a/xen/common/event_channel.c +++ b/xen/common/event_channel.c @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc rchn->u.interdomain.remote_dom = ld; rchn->u.interdomain.remote_port = lport; + atomic_set(&rchn->u.interdomain.active_calls, 0); rchn->state = ECS_INTERDOMAIN; /* @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int double_evtchn_lock(chn1, chn2); + if ( consumer_is_xen(chn1) ) + while ( atomic_read(&chn1->u.interdomain.active_calls) ) + cpu_relax(); + evtchn_free(d1, chn1); chn2->state = ECS_UNBOUND; @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig rport = lchn->u.interdomain.remote_port; rchn = evtchn_from_port(rd, rport); if ( consumer_is_xen(rchn) ) + { + /* Don't keep holding the lock for the call below. */ + atomic_inc(&rchn->u.interdomain.active_calls); + evtchn_read_unlock(lchn); xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport); - else - evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); + atomic_dec(&rchn->u.interdomain.active_calls); + return 0; + } + evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); break; case ECS_IPI: evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn); --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -104,6 +104,7 @@ struct evtchn } unbound; /* state == ECS_UNBOUND */ struct { evtchn_port_t remote_port; + atomic_t active_calls; struct domain *remote_dom; } interdomain; /* state == ECS_INTERDOMAIN */ struct {
While there don't look to be any problems with this right now, the lock order implications from holding the lock can be very difficult to follow (and may be easy to violate unknowingly). The present callbacks don't (and no such callback should) have any need for the lock to be held. However, vm_event_disable() frees the structures used by respective callbacks and isn't otherwise synchronized with invocations of these callbacks, so maintain a count of in-progress calls, for evtchn_close() to wait to drop to zero before freeing the port (and dropping the lock). Signed-off-by: Jan Beulich <jbeulich@suse.com> --- Should we make this accounting optional, to be requested through a new parameter to alloc_unbound_xen_event_channel(), or derived from other than the default callback being requested? --- v3: Drain callbacks before proceeding with closing. Re-base.