diff mbox

xen/pvcalls: fix potential endless loop in pvcalls-front.c

Message ID alpine.DEB.2.10.1711151319250.12676@sstabellini-ThinkPad-X260 (mailing list archive)
State New, archived
Headers show

Commit Message

Stefano Stabellini Nov. 15, 2017, 9:20 p.m. UTC
On Wed, 15 Nov 2017, Boris Ostrovsky wrote:
> On 11/15/2017 02:09 PM, Stefano Stabellini wrote:
> > On Wed, 15 Nov 2017, Juergen Gross wrote:
> >>>>>         while(mutex_is_locked(&map->active.in_mutex.owner) ||
> >>>>>               mutex_is_locked(&map->active.out_mutex.owner))
> >>>>>                 cpu_relax();
> >>>>>
> >>>>> ?
> >>>> I'm not convinced there isn't a race.
> >>>>
> >>>> In pvcalls_front_recvmsg() sock->sk->sk_send_head is being read and only
> >>>> then in_mutex is taken. What happens if pvcalls_front_release() resets
> >>>> sk_send_head and manages to test the mutex before the mutex is locked?
> >>>>
> >>>> Even in case this is impossible: the whole construct seems to be rather
> >>>> fragile.
> > I agree it looks fragile, and I agree that it might be best to avoid the
> > usage of in_mutex and out_mutex as refcounts. More comments on this
> > below.
> >
> >  
> >>> I think we can wait until pvcalls_refcount is 1 (i.e. it's only us) and
> >>> not rely on mutex state.
> >> Yes, this would work.
> > Yes, I agree it would work and for the sake of getting something in
> > shape for the merge window I am attaching a patch for it. Please go
> > ahead with it. Let me know if you need anything else immediately, and
> > I'll work on it ASAP.
> >
> >
> >
> > However, I should note that this is a pretty big hammer we are using:
> > the refcount is global, while we only need to wait until it's only us
> > _on this specific socket_.
> 
> Can you explain why socket is important?

Yes, of course: there are going to be many open sockets on a given
pvcalls connection. pvcalls_refcount is global: waiting on
pvcalls_refcount means waiting until any operations on any unrelated
sockets stop. While we only need to wait until the operations on the one
socket we want to close stop.


> >
> > We really need a per socket refcount. If we don't want to use the mutex
> > internal counters, then we need another one.
> >
> > See the appended patch that introduces a per socket refcount. However,
> > for the merge window, also using pvcalls_refcount is fine.
> >
> > The race Juergen is concerned about is only theoretically possible:
> >
> > recvmsg:                                 release:
> >   
> >   test sk_send_head                      clear sk_send_head
> >   <only few instructions>                <prepare a message>
> >   grab in_mutex                          <send a message to the server>
> >                                          <wait for an answer>
> >                                          test in_mutex
> >
> > Without kernel preemption is not possible for release to clear
> > sk_send_head and test in_mutex after recvmsg tests sk_send_head and
> > before recvmsg grabs in_mutex.
> 
> Sorry, I don't follow --- what does preemption have to do with this? If
> recvmsg and release happen on different processors the order of
> operations can be
> 
> CPU0                                   CPU1
> 
> test sk_send_head
> <interrupt>
>                                         clear sk_send_head
>                                         <send a message to the server>
>                                         <wait for an answer>
>                                         test in_mutex
>                                         free everything
> grab in_mutex
> 
> I actually think RCU should take care of all of this.

Preemption could cause something very similar to happen, but your
example is very good too, even better, because it could trigger the
issue even with preemption disabled. I'll think more about this and
submit a separate patch on top of the simple pvcalls_refcount patch
below.



> But for now I will take your refcount-based patch. However, it also
> needs comment update.
> 
> How about
> 
> /*
>  * We need to make sure that send/rcvmsg on this socket has not started
>  * before we've cleared sk_send_head here. The easiest (though not optimal)
>  * way to guarantee this is to see that no pvcall (other than us) is in
> progress.
>  */

Yes, this is the patch:

---


xen/pvcalls: fix potential endless loop in pvcalls-front.c

mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
take in_mutex on the first try, but you can't take out_mutex. Next times
you call mutex_trylock() in_mutex is going to fail. It's an endless
loop.

Solve the problem by waiting until the global refcount is 1 instead (the
refcount is 1 when the only active pvcalls frontend function is
pvcalls_front_release).

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com

Comments

Stefano Stabellini Nov. 15, 2017, 9:50 p.m. UTC | #1
On Wed, 15 Nov 2017, Stefano Stabellini wrote:
> On Wed, 15 Nov 2017, Boris Ostrovsky wrote:
> > On 11/15/2017 02:09 PM, Stefano Stabellini wrote:
> > > On Wed, 15 Nov 2017, Juergen Gross wrote:
> > >>>>>         while(mutex_is_locked(&map->active.in_mutex.owner) ||
> > >>>>>               mutex_is_locked(&map->active.out_mutex.owner))
> > >>>>>                 cpu_relax();
> > >>>>>
> > >>>>> ?
> > >>>> I'm not convinced there isn't a race.
> > >>>>
> > >>>> In pvcalls_front_recvmsg() sock->sk->sk_send_head is being read and only
> > >>>> then in_mutex is taken. What happens if pvcalls_front_release() resets
> > >>>> sk_send_head and manages to test the mutex before the mutex is locked?
> > >>>>
> > >>>> Even in case this is impossible: the whole construct seems to be rather
> > >>>> fragile.
> > > I agree it looks fragile, and I agree that it might be best to avoid the
> > > usage of in_mutex and out_mutex as refcounts. More comments on this
> > > below.
> > >
> > >  
> > >>> I think we can wait until pvcalls_refcount is 1 (i.e. it's only us) and
> > >>> not rely on mutex state.
> > >> Yes, this would work.
> > > Yes, I agree it would work and for the sake of getting something in
> > > shape for the merge window I am attaching a patch for it. Please go
> > > ahead with it. Let me know if you need anything else immediately, and
> > > I'll work on it ASAP.
> > >
> > >
> > >
> > > However, I should note that this is a pretty big hammer we are using:
> > > the refcount is global, while we only need to wait until it's only us
> > > _on this specific socket_.
> > 
> > Can you explain why socket is important?
> 
> Yes, of course: there are going to be many open sockets on a given
> pvcalls connection. pvcalls_refcount is global: waiting on
> pvcalls_refcount means waiting until any operations on any unrelated
> sockets stop. While we only need to wait until the operations on the one
> socket we want to close stop.
> 
> 
> > >
> > > We really need a per socket refcount. If we don't want to use the mutex
> > > internal counters, then we need another one.
> > >
> > > See the appended patch that introduces a per socket refcount. However,
> > > for the merge window, also using pvcalls_refcount is fine.
> > >
> > > The race Juergen is concerned about is only theoretically possible:
> > >
> > > recvmsg:                                 release:
> > >   
> > >   test sk_send_head                      clear sk_send_head
> > >   <only few instructions>                <prepare a message>
> > >   grab in_mutex                          <send a message to the server>
> > >                                          <wait for an answer>
> > >                                          test in_mutex
> > >
> > > Without kernel preemption is not possible for release to clear
> > > sk_send_head and test in_mutex after recvmsg tests sk_send_head and
> > > before recvmsg grabs in_mutex.
> > 
> > Sorry, I don't follow --- what does preemption have to do with this? If
> > recvmsg and release happen on different processors the order of
> > operations can be
> > 
> > CPU0                                   CPU1
> > 
> > test sk_send_head
> > <interrupt>
> >                                         clear sk_send_head
> >                                         <send a message to the server>
> >                                         <wait for an answer>
> >                                         test in_mutex
> >                                         free everything
> > grab in_mutex
> > 
> > I actually think RCU should take care of all of this.
> 
> Preemption could cause something very similar to happen, but your
> example is very good too, even better, because it could trigger the
> issue even with preemption disabled. I'll think more about this and
> submit a separate patch on top of the simple pvcalls_refcount patch
> below.
> 
> 
> 
> > But for now I will take your refcount-based patch. However, it also
> > needs comment update.
> > 
> > How about
> > 
> > /*
> >  * We need to make sure that send/rcvmsg on this socket has not started
> >  * before we've cleared sk_send_head here. The easiest (though not optimal)
> >  * way to guarantee this is to see that no pvcall (other than us) is in
> > progress.
> >  */
> 
> Yes, this is the patch:
> 
> ---
> 
> 
> xen/pvcalls: fix potential endless loop in pvcalls-front.c
> 
> mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
> take in_mutex on the first try, but you can't take out_mutex. Next times
> you call mutex_trylock() in_mutex is going to fail. It's an endless
> loop.
> 
> Solve the problem by waiting until the global refcount is 1 instead (the
> refcount is 1 when the only active pvcalls frontend function is
> pvcalls_front_release).
> 
> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
> CC: boris.ostrovsky@oracle.com
> CC: jgross@suse.com
> 
> 
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 0c1ec68..54c0fda 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -1043,13 +1043,12 @@ int pvcalls_front_release(struct socket *sock)
>  		wake_up_interruptible(&map->active.inflight_conn_req);
>  
>  		/*
> -		 * Wait until there are no more waiters on the mutexes.
> -		 * We know that no new waiters can be added because sk_send_head
> -		 * is set to NULL -- we only need to wait for the existing
> -		 * waiters to return.
> -		 */
> -		while (!mutex_trylock(&map->active.in_mutex) ||
> -			   !mutex_trylock(&map->active.out_mutex))
> +		 * We need to make sure that sendmsg/recvmsg on this socket have
> +		 * not started before we've cleared sk_send_head here. The
> +		 * easiest (though not optimal) way to guarantee this is to see
> +		 * that no pvcall (other than us) is in progress.
> +		*/
> +		while (atomic_read(&pvcalls_refcount) > 1)
>  			cpu_relax();
>  
>  		pvcalls_front_free_map(bedata, map);


Sorry, code style issue: one missing space in the comment. I'll send it
again separately.
Boris Ostrovsky Nov. 15, 2017, 9:51 p.m. UTC | #2
On 11/15/2017 04:50 PM, Stefano Stabellini wrote:
>
> Sorry, code style issue: one missing space in the comment. I'll send it
> again separately


I've already fixed this, no worries.

-boris
Stefano Stabellini Nov. 15, 2017, 9:54 p.m. UTC | #3
On Wed, 15 Nov 2017, Boris Ostrovsky wrote:
> On 11/15/2017 04:50 PM, Stefano Stabellini wrote:
> >
> > Sorry, code style issue: one missing space in the comment. I'll send it
> > again separately
> 
> 
> I've already fixed this, no worries.

Thank you!!
Jürgen Groß Nov. 16, 2017, 5:28 a.m. UTC | #4
On 15/11/17 22:20, Stefano Stabellini wrote:
> On Wed, 15 Nov 2017, Boris Ostrovsky wrote:
>> On 11/15/2017 02:09 PM, Stefano Stabellini wrote:
>>> On Wed, 15 Nov 2017, Juergen Gross wrote:
>>>>>>>         while(mutex_is_locked(&map->active.in_mutex.owner) ||
>>>>>>>               mutex_is_locked(&map->active.out_mutex.owner))
>>>>>>>                 cpu_relax();
>>>>>>>
>>>>>>> ?
>>>>>> I'm not convinced there isn't a race.
>>>>>>
>>>>>> In pvcalls_front_recvmsg() sock->sk->sk_send_head is being read and only
>>>>>> then in_mutex is taken. What happens if pvcalls_front_release() resets
>>>>>> sk_send_head and manages to test the mutex before the mutex is locked?
>>>>>>
>>>>>> Even in case this is impossible: the whole construct seems to be rather
>>>>>> fragile.
>>> I agree it looks fragile, and I agree that it might be best to avoid the
>>> usage of in_mutex and out_mutex as refcounts. More comments on this
>>> below.
>>>
>>>  
>>>>> I think we can wait until pvcalls_refcount is 1 (i.e. it's only us) and
>>>>> not rely on mutex state.
>>>> Yes, this would work.
>>> Yes, I agree it would work and for the sake of getting something in
>>> shape for the merge window I am attaching a patch for it. Please go
>>> ahead with it. Let me know if you need anything else immediately, and
>>> I'll work on it ASAP.
>>>
>>>
>>>
>>> However, I should note that this is a pretty big hammer we are using:
>>> the refcount is global, while we only need to wait until it's only us
>>> _on this specific socket_.
>>
>> Can you explain why socket is important?
> 
> Yes, of course: there are going to be many open sockets on a given
> pvcalls connection. pvcalls_refcount is global: waiting on
> pvcalls_refcount means waiting until any operations on any unrelated
> sockets stop. While we only need to wait until the operations on the one
> socket we want to close stop.
> 
> 
>>>
>>> We really need a per socket refcount. If we don't want to use the mutex
>>> internal counters, then we need another one.
>>>
>>> See the appended patch that introduces a per socket refcount. However,
>>> for the merge window, also using pvcalls_refcount is fine.
>>>
>>> The race Juergen is concerned about is only theoretically possible:
>>>
>>> recvmsg:                                 release:
>>>   
>>>   test sk_send_head                      clear sk_send_head
>>>   <only few instructions>                <prepare a message>
>>>   grab in_mutex                          <send a message to the server>
>>>                                          <wait for an answer>
>>>                                          test in_mutex
>>>
>>> Without kernel preemption is not possible for release to clear
>>> sk_send_head and test in_mutex after recvmsg tests sk_send_head and
>>> before recvmsg grabs in_mutex.
>>
>> Sorry, I don't follow --- what does preemption have to do with this? If
>> recvmsg and release happen on different processors the order of
>> operations can be
>>
>> CPU0                                   CPU1
>>
>> test sk_send_head
>> <interrupt>
>>                                         clear sk_send_head
>>                                         <send a message to the server>
>>                                         <wait for an answer>
>>                                         test in_mutex
>>                                         free everything
>> grab in_mutex
>>
>> I actually think RCU should take care of all of this.
> 
> Preemption could cause something very similar to happen, but your
> example is very good too, even better, because it could trigger the
> issue even with preemption disabled. I'll think more about this and
> submit a separate patch on top of the simple pvcalls_refcount patch
> below.

We are running as a guest. Even with interrupts off the vcpu could be
off the pcpu for several milliseconds!

Don't count on code length to avoid races!


Juergen
diff mbox

Patch

diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 0c1ec68..54c0fda 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -1043,13 +1043,12 @@  int pvcalls_front_release(struct socket *sock)
 		wake_up_interruptible(&map->active.inflight_conn_req);
 
 		/*
-		 * Wait until there are no more waiters on the mutexes.
-		 * We know that no new waiters can be added because sk_send_head
-		 * is set to NULL -- we only need to wait for the existing
-		 * waiters to return.
-		 */
-		while (!mutex_trylock(&map->active.in_mutex) ||
-			   !mutex_trylock(&map->active.out_mutex))
+		 * We need to make sure that sendmsg/recvmsg on this socket have
+		 * not started before we've cleared sk_send_head here. The
+		 * easiest (though not optimal) way to guarantee this is to see
+		 * that no pvcall (other than us) is in progress.
+		*/
+		while (atomic_read(&pvcalls_refcount) > 1)
 			cpu_relax();
 
 		pvcalls_front_free_map(bedata, map);