xen/pvcalls: fix potential endless loop in pvcalls-front.c

Message ID	alpine.DEB.2.10.1711151100490.22767@sstabellini-ThinkPad-X260 (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2B8E121903 Date: Wed, 15 Nov 2017 11:09:14 -0800 (PST) From: Stefano Stabellini <sstabellini@kernel.org> To: Juergen Gross <jgross@suse.com> In-Reply-To: <12bea1d0-6504-d7c6-c47c-7cd197264536@suse.com> Message-ID: <alpine.DEB.2.10.1711151100490.22767@sstabellini-ThinkPad-X260> References: <1510006679-6381-1-git-send-email-sstabellini@kernel.org> <ae9f82e5-4f58-a349-7dcb-513322d737d8@suse.com> <alpine.DEB.2.10.1711101521490.22767@sstabellini-ThinkPad-X260> <ea152f49-829b-7c46-80a7-5931221c1475@suse.com> <alpine.DEB.2.10.1711131030080.22767@sstabellini-ThinkPad-X260> <ee4d97bb-fc92-9dfd-861b-458da59f2178@suse.com> <b57870b1-7e35-3ff1-1b79-080a6b600980@oracle.com> <12bea1d0-6504-d7c6-c47c-7cd197264536@suse.com> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323329-1283922779-1510772476=:22767" Content-ID: <alpine.DEB.2.10.1711151107590.22767@sstabellini-ThinkPad-X260> Cc: xen-devel@lists.xenproject.org, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Stefano Stabellini <sstabellini@kernel.org> Subject: Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in pvcalls-front.c Precedence: list Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Message ID

alpine.DEB.2.10.1711151100490.22767@sstabellini-ThinkPad-X260 (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2B8E121903
Date: Wed, 15 Nov 2017 11:09:14 -0800 (PST)
From: Stefano Stabellini <sstabellini@kernel.org>
To: Juergen Gross <jgross@suse.com>
In-Reply-To: <12bea1d0-6504-d7c6-c47c-7cd197264536@suse.com>
Message-ID: <alpine.DEB.2.10.1711151100490.22767@sstabellini-ThinkPad-X260>
References: <1510006679-6381-1-git-send-email-sstabellini@kernel.org>
	<ae9f82e5-4f58-a349-7dcb-513322d737d8@suse.com>
	<alpine.DEB.2.10.1711101521490.22767@sstabellini-ThinkPad-X260>
	<ea152f49-829b-7c46-80a7-5931221c1475@suse.com>
	<alpine.DEB.2.10.1711131030080.22767@sstabellini-ThinkPad-X260>
	<ee4d97bb-fc92-9dfd-861b-458da59f2178@suse.com>
	<b57870b1-7e35-3ff1-1b79-080a6b600980@oracle.com>
	<12bea1d0-6504-d7c6-c47c-7cd197264536@suse.com>
User-Agent: Alpine 2.10 (DEB 1266 2009-07-14)
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED;
	BOUNDARY="8323329-1283922779-1510772476=:22767"
Content-ID: <alpine.DEB.2.10.1711151107590.22767@sstabellini-ThinkPad-X260>
Cc: xen-devel@lists.xenproject.org,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Stefano Stabellini <sstabellini@kernel.org>
Subject: Re: [Xen-devel] [PATCH] xen/pvcalls: fix potential endless loop in
	pvcalls-front.c
Precedence: list
Errors-To: xen-devel-bounces@lists.xen.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Commit Message

Stefano Stabellini Nov. 15, 2017, 7:09 p.m. UTC

On Wed, 15 Nov 2017, Juergen Gross wrote:
> >>>         while(mutex_is_locked(&map->active.in_mutex.owner) ||
> >>>               mutex_is_locked(&map->active.out_mutex.owner))
> >>>                 cpu_relax();
> >>>
> >>> ?
> >> I'm not convinced there isn't a race.
> >>
> >> In pvcalls_front_recvmsg() sock->sk->sk_send_head is being read and only
> >> then in_mutex is taken. What happens if pvcalls_front_release() resets
> >> sk_send_head and manages to test the mutex before the mutex is locked?
> >>
> >> Even in case this is impossible: the whole construct seems to be rather
> >> fragile.

I agree it looks fragile, and I agree that it might be best to avoid the
usage of in_mutex and out_mutex as refcounts. More comments on this
below.

> > I think we can wait until pvcalls_refcount is 1 (i.e. it's only us) and
> > not rely on mutex state.
> 
> Yes, this would work.

Yes, I agree it would work and for the sake of getting something in
shape for the merge window I am attaching a patch for it. Please go
ahead with it. Let me know if you need anything else immediately, and
I'll work on it ASAP.

However, I should note that this is a pretty big hammer we are using:
the refcount is global, while we only need to wait until it's only us
_on this specific socket_.

We really need a per socket refcount. If we don't want to use the mutex
internal counters, then we need another one.

See the appended patch that introduces a per socket refcount. However,
for the merge window, also using pvcalls_refcount is fine.

The race Juergen is concerned about is only theoretically possible:

recvmsg:                                 release:

  test sk_send_head                      clear sk_send_head
  <only few instructions>                <prepare a message>
  grab in_mutex                          <send a message to the server>
                                         <wait for an answer>
                                         test in_mutex

Without kernel preemption is not possible for release to clear
sk_send_head and test in_mutex after recvmsg tests sk_send_head and
before recvmsg grabs in_mutex.

But maybe we need to disable kernel preemption in recvmsg and sendmsg to
stay on the safe side?

The patch below introduces a per active socket refcount, so that we
don't have to rely on in_mutex and out_mutex for refcounting. It also
disables preemption in sendmsg and recvmsg in the region described
above.

I don't think this patch should go in immediately. We can take our time
to figure out the best fix.

Comments

Boris Ostrovsky Nov. 15, 2017, 7:50 p.m. UTC | #1

On 11/15/2017 02:09 PM, Stefano Stabellini wrote:
> On Wed, 15 Nov 2017, Juergen Gross wrote:
>>>>>         while(mutex_is_locked(&map->active.in_mutex.owner) ||
>>>>>               mutex_is_locked(&map->active.out_mutex.owner))
>>>>>                 cpu_relax();
>>>>>
>>>>> ?
>>>> I'm not convinced there isn't a race.
>>>>
>>>> In pvcalls_front_recvmsg() sock->sk->sk_send_head is being read and only
>>>> then in_mutex is taken. What happens if pvcalls_front_release() resets
>>>> sk_send_head and manages to test the mutex before the mutex is locked?
>>>>
>>>> Even in case this is impossible: the whole construct seems to be rather
>>>> fragile.
> I agree it looks fragile, and I agree that it might be best to avoid the
> usage of in_mutex and out_mutex as refcounts. More comments on this
> below.
>
>  
>>> I think we can wait until pvcalls_refcount is 1 (i.e. it's only us) and
>>> not rely on mutex state.
>> Yes, this would work.
> Yes, I agree it would work and for the sake of getting something in
> shape for the merge window I am attaching a patch for it. Please go
> ahead with it. Let me know if you need anything else immediately, and
> I'll work on it ASAP.
>
>
>
> However, I should note that this is a pretty big hammer we are using:
> the refcount is global, while we only need to wait until it's only us
> _on this specific socket_.

Can you explain why socket is important?

>
> We really need a per socket refcount. If we don't want to use the mutex
> internal counters, then we need another one.
>
> See the appended patch that introduces a per socket refcount. However,
> for the merge window, also using pvcalls_refcount is fine.
>
> The race Juergen is concerned about is only theoretically possible:
>
> recvmsg:                                 release:
>   
>   test sk_send_head                      clear sk_send_head
>   <only few instructions>                <prepare a message>
>   grab in_mutex                          <send a message to the server>
>                                          <wait for an answer>
>                                          test in_mutex
>
> Without kernel preemption is not possible for release to clear
> sk_send_head and test in_mutex after recvmsg tests sk_send_head and
> before recvmsg grabs in_mutex.

Sorry, I don't follow --- what does preemption have to do with this? If
recvmsg and release happen on different processors the order of
operations can be

CPU0                                   CPU1

test sk_send_head
<interrupt>
                                        clear sk_send_head
                                        <send a message to the server>
                                        <wait for an answer>
                                        test in_mutex
                                        free everything
grab in_mutex

I actually think RCU should take care of all of this.

But for now I will take your refcount-based patch. However, it also
needs comment update.

How about

/*
 * We need to make sure that send/rcvmsg on this socket has not started
 * before we've cleared sk_send_head here. The easiest (though not optimal)
 * way to guarantee this is to see that no pvcall (other than us) is in
progress.
 */

-boris


>
> But maybe we need to disable kernel preemption in recvmsg and sendmsg to
> stay on the safe side?
>
> The patch below introduces a per active socket refcount, so that we
> don't have to rely on in_mutex and out_mutex for refcounting. It also
> disables preemption in sendmsg and recvmsg in the region described
> above.
>
> I don't think this patch should go in immediately. We can take our time
> to figure out the best fix.
>
>
> diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
> index 0c1ec68..8c1030b 100644
> --- a/drivers/xen/pvcalls-front.c
> +++ b/drivers/xen/pvcalls-front.c
> @@ -68,6 +68,7 @@ struct sock_mapping {
>  			struct pvcalls_data data;
>  			struct mutex in_mutex;
>  			struct mutex out_mutex;
> +			atomic_t sock_refcount;
>  
>  			wait_queue_head_t inflight_conn_req;
>  		} active;
> @@ -497,15 +498,20 @@ int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
>  	}
>  	bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
>  
> +	preempt_disable();
>  	map = (struct sock_mapping *) sock->sk->sk_send_head;
>  	if (!map) {
> +		preempt_enable();
>  		pvcalls_exit();
>  		return -ENOTSOCK;
>  	}
>  
> +	atomic_inc(&map->active.sock_refcount);
>  	mutex_lock(&map->active.out_mutex);
> +	preempt_enable();
>  	if ((flags & MSG_DONTWAIT) && !pvcalls_front_write_todo(map)) {
>  		mutex_unlock(&map->active.out_mutex);
> +		atomic_dec(&map->active.sock_refcount);
>  		pvcalls_exit();
>  		return -EAGAIN;
>  	}
> @@ -528,6 +534,7 @@ int pvcalls_front_sendmsg(struct socket *sock, struct msghdr *msg,
>  		tot_sent = sent;
>  
>  	mutex_unlock(&map->active.out_mutex);
> +	atomic_dec(&map->active.sock_refcount);
>  	pvcalls_exit();
>  	return tot_sent;
>  }
> @@ -600,13 +607,17 @@ int pvcalls_front_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
>  	}
>  	bedata = dev_get_drvdata(&pvcalls_front_dev->dev);
>  
> +	preempt_disable();
>  	map = (struct sock_mapping *) sock->sk->sk_send_head;
>  	if (!map) {
> +		preempt_enable();
>  		pvcalls_exit();
>  		return -ENOTSOCK;
>  	}
>  
> +	atomic_inc(&map->active.sock_refcount);
>  	mutex_lock(&map->active.in_mutex);
> +	preempt_enable();
>  	if (len > XEN_FLEX_RING_SIZE(PVCALLS_RING_ORDER))
>  		len = XEN_FLEX_RING_SIZE(PVCALLS_RING_ORDER);
>  
> @@ -625,6 +636,7 @@ int pvcalls_front_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
>  		ret = 0;
>  
>  	mutex_unlock(&map->active.in_mutex);
> +	atomic_dec(&map->active.sock_refcount);
>  	pvcalls_exit();
>  	return ret;
>  }
> @@ -1048,8 +1060,7 @@ int pvcalls_front_release(struct socket *sock)
>  		 * is set to NULL -- we only need to wait for the existing
>  		 * waiters to return.
>  		 */
> -		while (!mutex_trylock(&map->active.in_mutex) ||
> -			   !mutex_trylock(&map->active.out_mutex))
> +		while (atomic_read(&map->active.sock_refcount) > 0)
>  			cpu_relax();
>  
>  		pvcalls_front_free_map(bedata, map);

Boris Ostrovsky Nov. 15, 2017, 7:53 p.m. UTC | #2

On 11/15/2017 02:50 PM, Boris Ostrovsky wrote:
> On 11/15/2017 02:09 PM, Stefano Stabellini wrote:
>>
>> However, I should note that this is a pretty big hammer we are using:
>> the refcount is global, while we only need to wait until it's only us
>> _on this specific socket_.
> Can you explain why socket is important?

Nevermind. I was thinking about *processor* socket (as in cores,
threads, packages etc. I am right now looking at a bug that deals with
core behavior ;-))

-boris

xen/pvcalls: fix potential endless loop in pvcalls-front.c

mutex_trylock() returns 1 if you take the lock and 0 if not. Assume you
take in_mutex on the first try, but you can't take out_mutex. Next times
you call mutex_trylock() in_mutex is going to fail. It's an endless
loop.

Solve the problem by waiting until the global refcount is 1 instead (the
refcount is 1 when the only active pvcalls frontend function is
pvcalls_front_release).

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
CC: boris.ostrovsky@oracle.com
CC: jgross@suse.com


diff --git a/drivers/xen/pvcalls-front.c b/drivers/xen/pvcalls-front.c
index 0c1ec68..72a74c3 100644
--- a/drivers/xen/pvcalls-front.c
+++ b/drivers/xen/pvcalls-front.c
@@ -1048,8 +1048,7 @@  int pvcalls_front_release(struct socket *sock)
 		 * is set to NULL -- we only need to wait for the existing
 		 * waiters to return.
 		 */
-		while (!mutex_trylock(&map->active.in_mutex) ||
-			   !mutex_trylock(&map->active.out_mutex))
+		while (atomic_read(&pvcalls_refcount) > 1)
 			cpu_relax();
 
 		pvcalls_front_free_map(bedata, map);

xen/pvcalls: fix potential endless loop in pvcalls-front.c

Commit Message

Comments

Patch